Remove row with null value from pandas data frame
This should do the work:
df = df.dropna(how='any',axis=0)
It will erase every row (axis=0) that has "any" Null value in it.
EXAMPLE:
#Recreate random DataFrame with Nan values
df = pd.DataFrame(index = pd.date_range('2017-01-01', '2017-01-10', freq='1d'))
# Average speed in miles per hour
df['A'] = np.random.randint(low=198, high=205, size=len(df.index))
df['B'] = np.random.random(size=len(df.index))*2
#Create dummy NaN value on 2 cells
df.iloc[2,1]=None
df.iloc[5,0]=None
print(df)
A B
2017-01-01 203.0 1.175224
2017-01-02 199.0 1.338474
2017-01-03 198.0 NaN
2017-01-04 198.0 0.652318
2017-01-05 199.0 1.577577
2017-01-06 NaN 0.234882
2017-01-07 203.0 1.732908
2017-01-08 204.0 1.473146
2017-01-09 198.0 1.109261
2017-01-10 202.0 1.745309
#Delete row with dummy value
df = df.dropna(how='any',axis=0)
print(df)
A B
2017-01-01 203.0 1.175224
2017-01-02 199.0 1.338474
2017-01-04 198.0 0.652318
2017-01-05 199.0 1.577577
2017-01-07 203.0 1.732908
2017-01-08 204.0 1.473146
2017-01-09 198.0 1.109261
2017-01-10 202.0 1.745309
See the reference for further detail.
If everything is OK with your DataFrame, dropping NaNs should be as easy as that. If this is still not working, make sure you have the proper datatypes defined for your column (pd.to_numeric comes to mind...)
How to keep null values when writing to csv
You have two options here: change the csv.writing
quoting option in Python, or tell PostgreSQL to accept quoted strings as possible NULLs (requires PostgreSQL 9.4 or newer)
csv.writer()
and quotingOn the Python side, you are telling the csv.writer()
object to add quotes, because you configured it to use csv.QUOTE_NONNUMERIC
:
Instructs
writer
objects to quote all non-numeric fields.
None
values are non-numeric, so result in ""
being written.
Switch to using csv.QUOTE_MINIMAL
or csv.QUOTE_NONE
:
csv.QUOTE_MINIMAL
Instructswriter
objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.
csv.QUOTE_NONE
Instructswriter
objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character.
Since all you are writing is longitude and latitude values, you don't need any quoting here, there are no delimiters or quotecharacters present in your data.
With either option, the CSV output for None
values is simple an empty string:
>>> import csv
>>> from io import StringIO
>>> def test_csv_writing(rows, quoting):
... outfile = StringIO()
... csv_writer = csv.writer(outfile, delimiter=',', quoting=quoting)
... csv_writer.writerows(rows)
... return outfile.getvalue()
...
>>> rows = [
... [42.313270000, -71.116240000],
... [42.377010000, -71.064770000],
... [None, None],
... ]
>>> print(test_csv_writing(rows, csv.QUOTE_NONNUMERIC))
42.31327,-71.11624
42.37701,-71.06477
"",""
>>> print(test_csv_writing(rows, csv.QUOTE_MINIMAL))
42.31327,-71.11624
42.37701,-71.06477
,
>>> print(test_csv_writing(rows, csv.QUOTE_NONE))
42.31327,-71.11624
42.37701,-71.06477
,
PostgreSQL 9.4 COPY FROM
, NULL
values and FORCE_NULL
As of PostgreSQL 9.4, you can also force PostgreSQL to accept quoted empty strings as NULL
s, when you use the FORCE_NULL
option. From the COPY FROM
documentation:
FORCE_NULL
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to
NULL
. In the default case where the null string is empty, this converts a quoted empty string intoNULL
. This option is allowed only inCOPY FROM
, and only when using CSV format.
Naming the columns in a FORCE_NULL
option lets PostgreSQL accept both the empty column and ""
as NULL
values for those columns, e.g.:
COPY position (
lon,
lat
)
FROM "filename"
WITH (
FORMAT csv,
NULL '',
DELIMITER ',',
FORCE_NULL(lon, lat)
);
at which point it doesn't matter anymore what quoting options you used on the Python side.
Other options to considerFor simple data transformation tasks from other databases, don't use Python
If you already querying databases to collate data to go into PostgreSQL, consider directly inserting into Postgres. If the data comes from other sources, using the foreign data wrapper (fdw) module lets you cut out the middle-man and directly pull data into PostgreSQL from other sources.
Numpy data? Consider using COPY FROM as binary, directly from Python
Numpy data can more efficiently be inserted via binary COPY FROM
; the linked answer augments a numpy structured array with the required extra metadata and byte ordering, then efficiently creates a binary copy of the data and inserts it into PostgreSQL using COPY FROM STDIN WITH BINARY
and the psycopg2.copy_expert()
method. This neatly avoids number -> text -> number conversions.
Persisting data to handle large datasets in a pipeline?
Don't re-invent the data pipeline wheels. Consider using existing projects such as Apache Spark, which have already solved the efficiency problems. Spark lets you treat data as a structured stream, and includes the infrastructure to run data analysis steps in parallel, and you can treat distributed, structured data as Pandas dataframes.
Another option might be to look at Dask to help share datasets between distributed tasks to process large amounts of data.
Even if converting an already running project to Spark might be a step too far, at least consider using Apache Arrow, the data exchange platform Spark builds on top of. The pyarrow
project would let you exchange data via Parquet files, or exchange data over IPC.
The Pandas and Numpy teams are quite heavily invested in supporting the needs of Arrow and Dask (there is considerable overlap in core members between these projects) and are actively working to make Python data exchange as efficient as possible, including extending Python's pickle
module to allow for out-of-band data streams to avoid unnecessary memory copying when sharing data.
How can I find the isnull() value in CSV file?
- If you know the NA value is in the field Title
train_data.dropna(subset=['Title'])
- If you want to remove all NAs
train_data.dropna()
- View columns with NA
train_data.isna().any()
- If you want to view the NA values
train_data[train_data.isna().any(axis=1)]
Pandas csv dataframe: Input "Null" into all cells in empty colums, input "NA" into all empty cells in non-empty colums
So are the empty values "", None, or NaN?
df=pd.DataFrame({"A": [1,2,3],
"B": [5,3,1],
"C": [5,"", 1],
"D": ["", "", ""],
"E": [3, np.nan, 6],
"F": [np.nan, np.nan, np.nan],
"G": [9, None, 3],
"H": [None, None, None]})
for j in df.columns[df.isna().all()|df.eq("").all()].tolist():
df[j]=np.repeat("NA", 3)
for i, j in zip(*np.where(df.isna()|df.eq(""))):
df.iloc[i,j]="null"
output
A B C D E F G H
0 1 5 5 3.0 NaN 9.0 None
1 2 3 NaN NaN NaN None
2 3 1 1 6.0 NaN 3.0 None
A B C D E F G H
0 1 5 5 NA 3.0 NA 9.0 NA
1 2 3 null NA null NA null NA
2 3 1 1 NA 6.0 NA 3.0 NA
How do I drop the None line in a csv file using python?
In order to remove rows with 'empty' cells, do this:
1. Import .csv to pandas dataframe
import pandas as pd
df_csv = pd.read_csv('yourfile.csv')
2. Drop NaN rows
df_csv.dropna(axis = 0, how = 'any', inplace = True)
# axis = 1 will drop the column instead of the row
# how = 'all' will only drop if all cells are NaN
3. Save to .csv
df_csv.to_csv('yourfile_parsed.csv', index = False)
Comments
- It is better to refer to
None
orNaN
rather than saying 'empty' - Also, 'clear' better called 'drop' - people otherwise may think you want to remove all values while keeping the row
Related Topics
Get the Row(S) Which Have the Max Value in Groups Using Groupby
Pandas: How to Assign Values Based on Multiple Conditions for Existing Columns
Clicking Links With Python Beautifulsoup
Regex Check If Specific Multiple Words Present in a Sentence
What Do Numbers Starting With 0 Mean in Python
Python - Use Previous Row'S Value to Update the New Rows Values
How to Check If Numbers Are in a List in Python
Python Json.Loads Shows Valueerror: Extra Data
Python: Editing List While Iterating Over It
How to Normalize a Numpy Array to Within a Certain Range
How to Plot Pandas Dataframe With Date (Year/Month)
Converting Pandas Column of Comma-Separated Strings into Integers
Regular Expression to Check Whitespace in the Beginning and End of a String
How to Store Python Dictionary in to MySQL Db Through Python
How to Sum Multiple Columns in a Spark Dataframe in Pyspark
How to Get Text from Span Tag in Beautifulsoup
How Best to Insert Nan Values in a Python List by Referring to an Already Sorted List
Get the First Item from an Iterable That Matches a Condition