[ ] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
When reading files, pandas reads "NA", "N/A" and other similar values as nan by default. It should not. "NA" and "N/A" could be mean two different things: "Not Available" and "Not Applicable", for instance. Or it could just be the initials for "Nadine Auburg". The default of reading these values as nan is misleading.
import pandas as pd
# test.csv file has four lines:
# a
# "NA"
# "NAN"
# "N/A"
file = 'test.csv'
df = pd.read_csv(file)
# expected df to be:
# a
# 0 NA
# 1 NAN
# 2 N/A
# actual:
# a
# 0 nan
# 1 NAN
# 2 nan
Set na_filter=False as default rather than na_filter=True in pandas.read_csv (and other file-reading methods, if applicable).
Currently the solution is to explicitly set na_filter=False when reading file. Example:
import pandas as pd
file = 'test.csv'
df = pd.read_csv(file, na_filter=False)
import pandas as pd
pd.show_versions()
# INSTALLED VERSIONS
# ------------------
# commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
# python : 3.9.15.final.0
# python-bits : 64
# OS : Darwin
# OS-release : 22.1.0
# Version : Darwin Kernel Version 22.1.0: Sun Oct 9 20:14:30 PDT 2022; root:xnu-8792.41.9~2/RELEASE_ARM64_T8103
# machine : x86_64
# processor : i386
# byteorder : little
# LC_ALL : en_CA.UTF-8
# LANG : en_CA.UTF-8
# LOCALE : en_CA.UTF-8
# pandas : 1.5.2
# numpy : 1.23.5
# pytz : 2022.6
# dateutil : 2.8.2
# setuptools : 65.5.1
# pip : 22.3.1
# Cython : None
# pytest : None
# hypothesis : None
# sphinx : 5.3.0
# blosc : None
# feather : None
# xlsxwriter : None
# lxml.etree : None
# html5lib : None
# pymysql : None
# psycopg2 : None
# jinja2 : 3.1.2
# IPython : 7.33.0
# pandas_datareader: None
# bs4 : 4.11.1
# bottleneck : None
# brotli :
# fastparquet : None
# fsspec : None
# gcsfs : None
# matplotlib : 3.6.2
# numba : None
# numexpr : None
# odfpy : None
# openpyxl : None
# pandas_gbq : None
# pyarrow : None
# pyreadstat : None
# pyxlsb : None
# s3fs : None
# scipy : 1.9.3
# snappy : None
# sqlalchemy : None
# tables : None
# tabulate : None
# xarray : None
# xlrd : None
# xlwt : None
# zstandard : None
# tzdata : None
-1 on this. The default is sensible, you can simply set to False if you don't want the conversion. Also the impact is way too big to change this now
I agree changing the setting could have a big impact on existing code. And that could be reason enough to not do it. But let's also weigh it against code that is or will be written by unsuspecting users.
If you take a .csv file with "NA"s, import it using the defaults of pd.read_csv, and export the pandas dataframe as a csv file again, you get a different csv file.
Pandas edits the data behind my back by default without me telling it to. To me, that's not okay.