Handling Variable Number of Columns with Pandas - Python

Handling Variable Number of Columns with Pandas - Python

One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):

>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
A B C D E
0 1 2 3 NaN NaN
1 1 2 3 4 NaN
2 1 2 3 4 5
3 1 2 NaN NaN NaN
4 1 2 3 4 NaN

Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.

Handling Variable Number of Columns Dataframe - Python

I believe you can just pass the list into pd.DataFrame() and you will just get NaNs for the values that don't exist.

For example:

List_of_Lists = [[1,2,3,4],
[5,6,7],
[9,10],
[11]]
df = pd.DataFrame(List_of_Lists)
print(df)
0 1 2 3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN

Then to get the naming the way you want just use pandas.DataFrame.add_prefix

df = df.add_prefix('Column')
print(df)
Column0 Column1 Column2 Column3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN

Now I guess there is the possibility that you also could want each list to be a column. In that case you need to transpose your List_of_Lists.

from itertools import zip_longest

df = pd.DataFrame(list(map(list, zip_longest(*List_of_Lists))))
print(df)
0 1 2 3
0 1 5.0 9.0 11.0
1 2 6.0 10.0 NaN
2 3 7.0 NaN NaN
3 4 NaN NaN NaN

Reading csv with variable number of columns with pandas

Just make use of the usecols params, instead of the names one. names assume that you're listing all the columns' name, whereas usecolsassume a subample of the columns.

from io import StringIO
import pandas as pd

file = StringIO(
'''1, 2, 3, 4,
1, 2
1, 2, 3, 4,
1, 2, 3,''')

df = pd.read_csv(file, usecols =[0,1,2], header = None)
df
0 1 2
0 1 2 3.0
1 1 2 NaN
2 1 2 3.0
3 1 2 3.0

How to deal with variable number of columns in dataframe

Use Index.intersection:

df[df.columns.intersection(['Col_A','Col_A','Col_E'], sort=False)]

Multiple condition over a variable number of columns

Here's a solution using any and mask without apply:

df = pd.DataFrame(index=range(8), columns = ['TOT_SIGNAL','TRADING_DAY']).join(pd.DataFrame(np.eye(8, 5)))

df.TRADING_DAY = df.TRADING_DAY.mask((df.iloc[:,2:] != 0).any(axis=1), 1)

Result:

  TOT_SIGNAL TRADING_DAY    0    1    2    3    4
0 NaN 1 1.0 0.0 0.0 0.0 0.0
1 NaN 1 0.0 1.0 0.0 0.0 0.0
2 NaN 1 0.0 0.0 1.0 0.0 0.0
3 NaN 1 0.0 0.0 0.0 1.0 0.0
4 NaN 1 0.0 0.0 0.0 0.0 1.0
5 NaN NaN 0.0 0.0 0.0 0.0 0.0
6 NaN NaN 0.0 0.0 0.0 0.0 0.0
7 NaN NaN 0.0 0.0 0.0 0.0 0.0

How to Read A CSV With A Variable Number of Columns?

You could pass a dummy separator, and then use str.split (by ",") with expand=True:

df = pd.read_csv('path/to/file.csv', sep=" ", header=None)
df = df[0].str.split(",", expand=True).fillna("")
print(df)

Output

      0     1     2     3
0 5783 145v
1 g656 4589 3243 tt56
2 6579

Split a pandas DataFrame column into a variable number of columns

You could slightly change the function and use it in a list comprehension; then assign the nested list to columns:

def get_header_properties(header):
pf_type = re.match(".*?(?=\.)", header).group()
pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
pf_coords = re.search(f"(?<={pf_id}).*", header).group()
coords = pf_coords.split(",")[1:]
return [pf_type, pf_id] + coords + ([np.nan]*(2-len(coords)) if len(coords)<2 else [])

df[['Type','ID','dim1','dim2']] = [get_header_properties(i) for i in df['index']]
out = df.drop(columns='index')[['Type','ID','dim1','dim2','value']]

That said, instead of the function, it seems it's simpler and more efficient to use str.split once on "index" column and join it to df:

df = (df['index'].str.split('[.,]', expand=True)
.fillna(np.nan)
.rename(columns={i: col for i,col in enumerate(['Type','ID','dim1','dim2'])})
.join(df[['value']]))

Output:

        Type       ID dim1 dim2   value
0 FirstType FirstID NaN NaN 0.23
1 OtherType OtherID 1 NaN 50.00
2 OtherType OtherID 4 NaN 60.00
3 LastType LastID 1 1 110.00
4 LastType LastID 1 2 199.00
5 LastType LastID 2 3 123.00

Pandas comparison with variable number of columns

So assuming your dataframe has parsed the datetime columns (you can use to_datetime for that, or eg specify parse_dates in read_csv):

In [64]: df
Out[64]:
id date birth_date_1 birth_date_2
0 1 2000-01-01 2000-01-03 2000-01-05
1 1 2000-01-07 2000-01-03 2000-01-05
2 2 2000-01-02 2000-01-10 2000-01-01
3 2 2000-01-05 2000-01-10 2000-01-01

You can now check where the values in the 'birth_date' columns are lower than the values in the 'date' column, and then use sum to count:

In [65]: df[['birth_date_1', 'birth_date_2']].lt(df['date'], axis=0)
Out[65]:
birth_date_1 birth_date_2
0 False False
1 True True
2 False True
3 False True

In [66]: df[['birth_date_1', 'birth_date_2']].lt(df['date'], axis=0).sum(axis=1)

Out[66]:
0 0
1 2
2 1
3 1
dtype: int64

To deal with the varying number of 'birth_date' columns, you can do this automatically with filter, like this:

In [67]: df.filter(like="birth_date")
Out[67]:
birth_date_1 birth_date_2
0 2000-01-03 2000-01-05
1 2000-01-03 2000-01-05
2 2000-01-10 2000-01-01
3 2000-01-10 2000-01-01

Altogether, this would give:

In [66]: df.filter(like="birth_date").lt(df['date'], axis=0).sum(axis=1)

Out[66]:
0 0
1 2
2 1
3 1
dtype: int64

Pandas - string split into multiple columns with variable number of delimited values into 3 columns

Use DataFrame.reindex:

s.str.split(' - ', expand=True).reindex(range(3), axis=1).astype(object).mask(lambda x: x.isna(), None)

Or:

s.str.split(' - ', expand=True).reindex(range(3), axis=1).fillna('')

How set value on dataframe given a variable number of conditions?

If I understand you correctly, you are looking for .query() method:

import pandas as pd
from itertools import product

animals = ["dogs", "cats"]
eyes = ['brown', 'blue', 'green']
height = ['short', 'average', 'tall']
a = [animals, eyes, height]
df = pd.DataFrame(list(product(*a)), columns=["animals", "eyes", "height"])
df['value'] = 1

def zero_out(df, lst):
q = ' & '.join( '{} == "{}"'.format(col, val) for col, val in lst )
df.loc[df.query(q).index, 'value'] = 0

zero_out(df, [("height", "tall")])
print(df)

Prints:

   animals   eyes   height  value
0 dogs brown short 1
1 dogs brown average 1
2 dogs brown tall 0
3 dogs blue short 1
4 dogs blue average 1
5 dogs blue tall 0
6 dogs green short 1
7 dogs green average 1
8 dogs green tall 0
9 cats brown short 1
10 cats brown average 1
11 cats brown tall 0
12 cats blue short 1
13 cats blue average 1
14 cats blue tall 0
15 cats green short 1
16 cats green average 1
17 cats green tall 0

Or zero_out(df, [("animals", "dogs"), ("eyes", "blue")]):

   animals   eyes   height  value
0 dogs brown short 1
1 dogs brown average 1
2 dogs brown tall 1
3 dogs blue short 0
4 dogs blue average 0
5 dogs blue tall 0
6 dogs green short 1
7 dogs green average 1
8 dogs green tall 1
9 cats brown short 1
10 cats brown average 1
11 cats brown tall 1
12 cats blue short 1
13 cats blue average 1
14 cats blue tall 1
15 cats green short 1
16 cats green average 1
17 cats green tall 1


Related Topics



Leave a reply



Submit