Find Records With Leading Zero in Python Pandas

Find records with leading zero in Python Pandas

This should work:

df1[df1['Acct_no'].str[0] == '0']

Add leading zeros based on condition in python

You need vectorize this; select the columns using a boolean index and use .str.zfill() on the resulting subsets:

# select the right rows to avoid wasting time operating on longer strings
shorter = df.Random.str.len() < 9
longer = ~shorter
df.Random[shorter] = df.Random[shorter].str.zfill(9)
df.Random[longer] = df.Random[longer].str.zfill(20)

Note: I did not use np.where() because we wouldn't want to double the work. A vectorized df.Random.str.zfill() is faster than looping over the rows, but doing it twice still takes more time than doing it just once for each set of rows.

Speed comparison on 1 million rows of strings with values of random lengths (from 5 characters all the way up to 30):

In [1]: import numpy as np, pandas as pd

In [2]: import platform; print(platform.python_version_tuple(), platform.platform(), pd.__version__, np.__version__, sep="\n")
('3', '7', '3')
Darwin-17.7.0-x86_64-i386-64bit
0.24.2
1.16.4

In [3]: !sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

In [4]: from random import choices, randrange

In [5]: def randvalue(chars="0123456789", _c=choices, _r=randrange):
...: return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")
...:

In [6]: df = pd.DataFrame(data={"Random": [randvalue() for _ in range(10**6)]})

In [7]: %%timeit
...: target = df.copy()
...: shorter = target.Random.str.len() < 9
...: longer = ~shorter
...: target.Random[shorter] = target.Random[shorter].str.zfill(9)
...: target.Random[longer] = target.Random[longer].str.zfill(20)
...:
...:
825 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %%timeit
...: target = df.copy()
...: target.Random = np.where(target.Random.str.len()<9,target.Random.str.zfill(9),target.Random.str.zfill(20))
...:
...:
929 ms ± 69.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(The target = df.copy() line is needed to make sure that each repeated test run is isolated from the one before.)

Conclusion: on 1 million rows, using np.where() is about 10% slower.

However, using df.Row.apply(), as proposed by jackbicknell14, beats either method by a huge margin:

In [9]: def fill_zeros(x, _len=len, _zfill=str.zfill):
...: # len() and str.zfill() are cached as parameters for performance
...: return _zfill(x, 9 if _len(x) < 9 else 20)

In [10]: %%timeit
...: target = df.copy()
...: target.Random = target.Random.apply(fill_zeros)
...:
...:
299 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's about 3 times faster!

Remove leading zeroes pandas

you can try str.replace

df['amount'].str.replace(r'^(0+)', '').fillna('0')
0     324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
Name: amount, dtype: object

Time efficient way for add leading zeros in pandas series

s = pd.Series(map(lambda x: '%010d' %x, s))

where s is your series.

Why does pandas remove leading zero when writing to a csv?

Pandas doesn't strip padded zeros. You're liking seeing this when opening in Excel. Open the csv in a text editor like notepad++ and you'll see they're still zero padded.

How to save a CSV from dataframe, to keep zeros left in column with numbers?

Specify dtype as string while reading the csv file as below:

# if you are reading data with leading zeros
candidatos_2014 = pd.read_csv('candidatos_2014.csv', dtype ='str')

or convert data column into string

# if data is generated in python you can convert column into string first
candidatos_2014['cpf'] = candidatos_2014['cpf'].astype('str')
candidatos_2014.to_csv('candidatos_2014.csv')

How can I keep leading zeros in a column, when I export to CSV?

This is an excel problem as @EdChum suggested. You'll want to wrap your column in ="" with apply('="{}".format). This will tell excel to treat the entry as a formula that returns the text within quotes. That text will be your values with leading zeros.

Consider the following example.

df = pd.DataFrame(dict(A=['001', '002']))
df.A = df.A.apply('="{}"'.format)
df.to_excel('test_leading_zeros.xlsx')

Using multindex resample in pandas with zeros results in NaN

Don't resample, but use the date in the groupby:

df['datetime'] = pd.to_datetime(df['datetime'])

df.groupby(['name', df['datetime'].dt.date]).sum()

Or, using pandas.Grouper for flexibility:

df.groupby(['name', pd.Grouper(key='datetime', freq='D')]).sum()

Output:

                       value
name datetime
Excalibur1 2013-12-25 3
2014-12-25 914
Janus 2014-01-11 8129
Michael 2012-01-11 3999

rectangular shape and missing dates:

For a rectangular shape use:

df2 = df.groupby(['name', pd.Grouper(key='datetime', freq='D')])['value'].sum().unstack(level='name', fill_value=0)

Output:

name        Excalibur1  Janus  Michael
datetime
2013-12-25 3 0 0
2014-12-25 914 0 0
2014-01-11 0 8129 0
2012-01-11 0 0 3999

And to add missing dates, reindex:

df2 = df.groupby(['name', pd.Grouper(key='datetime', freq='D')])['value'].sum().unstack(level='name', fill_value=0)
df2 = df2.reindex(pd.date_range(df['datetime'].dt.date.min(), df['datetime'].max()), fill_value=0)

Output:

name        Excalibur1  Janus  Michael
2012-01-11 0 0 3999
2012-01-12 0 0 0
2012-01-13 0 0 0
2012-01-14 0 0 0
2012-01-15 0 0 0
...


Related Topics



Leave a reply



Submit