Delete Rows Containing Numeric Values in Strings from Pandas Dataframe

delete rows containing numeric values in strings from pandas dataframe

In your case, I think it's better to use simple indexing rather than drop. For example:

>>> df
text type
0 abc b
1 abc123 a
2 cde a
3 abc1.2.3 b
4 1.2.3 a
5 xyz a
6 abc123 a
7 9999 a
8 5text a
9 text a


>>> df[~df.text.str.contains(r'[0-9]')]
text type
0 abc b
2 cde a
5 xyz a
9 text a

That locates any rows with no numeric text

To explain:

df.text.str.contains(r'[0-9]')

returns a boolean series of where there are any digits:

0    False
1 True
2 False
3 True
4 True
5 False
6 True
7 True
8 True
9 False

and you can use this with the ~ to index your dataframe wherever that returns false

Remove rows from pandas dataframe if string has 'only numbers'

If we're only worrying about ASCII digits 0-9:

df = df[~df['question_stemmed'].str.isdigit()]

If we need to worry about unicode or digits in other languages:

df = df[~df['question_stemmed'].str.isnumeric()]

Pandas methods internally call the corresponding python methods. See What's the difference between str.isdigit, isnumeric and isdecimal in python? for an explanation of how these functions work.

Remove rows where column value type is string Pandas

Use convert_objects with param convert_numeric=True this will coerce any non numeric values to NaN:

In [24]:

df = pd.DataFrame({'a': [0.1,0.5,'jasdh', 9.0]})
df
Out[24]:
a
0 0.1
1 0.5
2 jasdh
3 9
In [27]:

df.convert_objects(convert_numeric=True)
Out[27]:
a
0 0.1
1 0.5
2 NaN
3 9.0
In [29]:

You can then drop them:

df.convert_objects(convert_numeric=True).dropna()
Out[29]:
a
0 0.1
1 0.5
3 9.0

UPDATE

Since version 0.17.0 this method is now deprecated and you need to use to_numeric unfortunately this operates on a Series rather than a whole df so the equivalent code is now:

df.apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna()

Removing rows with digits and strings in pandas dataframe

Using pandas.Series.str.contains with regex

Simpler regex but would allow for a row with '123 456' because both '3 ' and ' 4' satisfy the pattern.

df[df.col1.str.contains('\d\D|\D\d')]

col1
3 C96305407PLA
4 P0116711

This addresses the shortcoming of the regex above by explicitly forcing the pattern to only match if either a digit/alpha or alpha/digit is found.

df[df.col1.str.contains('(?i)\d[a-z]|[a-z]\d')]

col1
3 C96305407PLA
4 P0116711

Python Pandas Remove Rows that has Numbers (not float nor int but like 1.2.3)

Transform to str then use Regex function.

df=pd.DataFrame({'id':[1,2,3],'value':['3.3.4 text','3.4.5',3.2]})
df=df.astype(str)
df[df['value'].str.contains(r'^[\d.]+$')]

It gets:

  id  value
1 2 3.4.5
2 3 3.2

Select rows which contain numeric substrings in Pandas

You can use boolean indexing with a str.contains() regex:

  • ^0E - starts with 0E
  • \d{2}$ - ends with 2 digits
  • \d{2}[A-Z]$ - ends with 2 digits and 1 capital letter
col = ... # target column
mask = df[col].str.contains(r'^0E|\d{2}$|\d{2}[A-Z]$')
df = df.loc[~mask]

Remove rows from DataFrame that contain numbers from 0 to 9

You can use the vectorised contains and the regex pattern \d to see if the string contains any digits to create the boolean mask and use ~ to negate it:

In [173]:
df[~df['Testvalue'].str.contains('\d')]

Out[173]:
Testvalue
2 water

Here the contains generates the following boolean mask:

In [174]:
df['Testvalue'].str.contains('\d')

Out[174]:
0 True
1 True
2 False
Name: Testvalue, dtype: bool

Delete rows of a pandas data frame having string values in python 3.4.1

So the way I would approach this is to try to convert the columns to an int using a user function with a Try/Catch to handle the situation where the value cannot be coerced into an Int, these get set to NaN values. Drop the row where you have an empty value, for some reason it actually has a length of 1 when I tested this with your data, it may work for you using len 0.

In [42]:
# simple function to try to convert the type, returns NaN if the value cannot be coerced
def func(x):
try:
return int(x)
except ValueError:
return NaN
# assign multiple columns
df['Pro_L_1'], df['Pro_L_3'], df['Sale'] = df['Pro_L_1'].apply(func), df['Pro_L_3'].apply(func), df['Sale'].apply(func)
# drop the 'empty' date row, take a copy() so we don't get a warning
df = df.loc[df['Date'].str.len() > 1].copy()
# convert the string to a datetime, if we didn't drop the row it would set the empty row to today's date
df['Date']= pd.to_datetime(df['Date'])
# now convert all the dtypes that are numeric to a numeric dtype
df = df.convert_objects(convert_numeric=True)
# check the dtypes
df.dtypes

Out[42]:
Geo_L_1 int64
Geo_L_2 int64
Geo_L_3 int64
Pro_L_1 float64
Pro_L_2 float64
Pro_L_3 float64
Date datetime64[ns]
Sale float64
dtype: object
In [43]:
# display the current situation
df
Out[43]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
1 1 2 3 129 1 5193316745 2013-01-01 NaN
3 1 2 3 129 NaN 5193316745 2012-01-10 10
4 1 2 3 129 1 5193316745 2013-01-10 4
5 1 2 3 NaN 1 5193316745 2014-01-10 6
6 1 2 3 129 1 5193316745 2012-01-11 4
7 1 2 3 129 1 NaN 2013-01-11 2
8 1 2 3 129 1 5193316745 2014-01-11 6
9 1 2 3 129 1 5193316745 2012-01-12 NaN
10 1 2 3 129 1 5193316745 2013-01-12 5
In [44]:
# drop the rows
df.dropna()
Out[44]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
4 1 2 3 129 1 5193316745 2013-01-10 4
6 1 2 3 129 1 5193316745 2012-01-11 4
8 1 2 3 129 1 5193316745 2014-01-11 6
10 1 2 3 129 1 5193316745 2013-01-12 5

For the last line assign it so df = df.dropna()



Related Topics



Leave a reply



Submit