Prevent Pandas from Interpreting 'Na' as Nan in a String

Prevent pandas from interpreting 'NA' as NaN in a string

You could use parameters keep_default_na and na_values to set all NA values by hand docs:

import pandas as pd
from io import StringIO

data = """
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 _ 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
"""

df = pd.read_csv(StringIO(data), sep=' ', keep_default_na=False, na_values=['_'])

In [130]: df
Out[130]:
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
0 5d8b N P60490 1 146 1 146 1 146
1 5d8b NA P80377 NaN 126 1 126 1 126
2 5d8b O P60491 1 118 1 118 1 118

In [144]: df.CHAIN.apply(type)
Out[144]:
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
Name: CHAIN, dtype: object

EDIT

All default NA values from na-values (as of pandas 1.0.0):

The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].

Prevent Pandas read_csv from interpreting NA as NaN but retaining NaN for empty values

For me, this works:

df = pd.read_csv('file.csv', keep_default_na=False, na_values=[''])

which gives:

  region      date  expenses
0 NA 1/1/2019 53.0
1 EU 1/2/2019 NaN

But I'd rather play safe, due to possible other NaN in other columns, and do

df = pd.read_csv('file.csv')
df['region'] = df['region'].fillna('NA')

How to prevent pandas from removing 'NA' character string when reading a csv?

Read the dataframe with keep_default_na=False, possibly specifying with na_values the set of values that you want to consider as "genuine" NaNs:

# custom admissible NaNs values, 'NA' is not in this list
na_values = ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND',
'-1.#QNAN', '-NaN', '-nan', '1.#IND',
'1.#QNAN', 'N/A', 'NULL', 'NaN',
'n/a', 'nan', 'null'
]

data = pd.read_csv('C:\\Users\\User\\Desktop\\' + filename,
sep=',',
quotechar='"',
encoding='mbcs',
low_memory=False,
na_values = na_values # specify custom NaN values
keep_default_na=False) # and use them

Here's a reproducible example of what could be happening here:

# create dataframe with NA and write it to file
import pandas as pd
df = pd.DataFrame({'Line Code':['MV', 'RM', 'NA', 'AB'],
'Product SKU':['Product1', 'Product2', 'Product3', 'Product4']})

df.to_csv("mydf.csv", index = False)

# read it in, in two different fashions
df_problematic = pd.read_csv("mydf.csv")
df_ok = pd.read_csv("mydf.csv", keep_default_na = False)

in df_problematic, the 'NA' value is interpreted as NaN, which is not what you want (refer to the read_csv docs for options when reading csv files in pandas and for info about the default list of symbols interpreted as NaNs).

Prevent pandas from interpreting 'NA' as NaN in a string : csv file

for NaN

df[~df.isnull()]

for NA

df.dropna()

String NA conflict with pandas na type

Set na_filter parameter as False

df = pd.read_csv("aa.csv", na_filter=False)

Why does pandas identify string NaN (a nitride of sodium) as a missing value?

As per pandas documentation for read_csv, 'NaN' is one of default missing value indicators.

If you're sure there are no missing values in your csv file, you could simply pass an argument na_filter = False to your read_csv() call to stop missing value parse.

Otherwise, you could use keep_default_na = False to exclude the default values and specify your own with na_values parameter.

how to stop pandas from inferring string value Infinity as inf and changing datatype to float64

You can pass the dtype argument to set explicit column types for particular column names, like so:

pd.read_csv(file_name, dtype={'Vendor': str})


Related Topics



Leave a reply



Submit