Handling Variable Number of Columns with Pandas - Python
One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):
>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
A B C D E
0 1 2 3 NaN NaN
1 1 2 3 4 NaN
2 1 2 3 4 5
3 1 2 NaN NaN NaN
4 1 2 3 4 NaN
Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.
Handling Variable Number of Columns Dataframe - Python
I believe you can just pass the list into pd.DataFrame()
and you will just get NaNs for the values that don't exist.
For example:
List_of_Lists = [[1,2,3,4],
[5,6,7],
[9,10],
[11]]
df = pd.DataFrame(List_of_Lists)
print(df)
0 1 2 3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN
Then to get the naming the way you want just use pandas.DataFrame.add_prefix
df = df.add_prefix('Column')
print(df)
Column0 Column1 Column2 Column3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN
Now I guess there is the possibility that you also could want each list to be a column. In that case you need to transpose your List_of_Lists
.
from itertools import zip_longest
df = pd.DataFrame(list(map(list, zip_longest(*List_of_Lists))))
print(df)
0 1 2 3
0 1 5.0 9.0 11.0
1 2 6.0 10.0 NaN
2 3 7.0 NaN NaN
3 4 NaN NaN NaN
Reading csv with variable number of columns with pandas
Just make use of the usecols
params, instead of the names
one. names
assume that you're listing all the columns' name, whereas usecols
assume a subample of the columns.
from io import StringIO
import pandas as pd
file = StringIO(
'''1, 2, 3, 4,
1, 2
1, 2, 3, 4,
1, 2, 3,''')
df = pd.read_csv(file, usecols =[0,1,2], header = None)
df
0 1 2
0 1 2 3.0
1 1 2 NaN
2 1 2 3.0
3 1 2 3.0
How to deal with variable number of columns in dataframe
Use Index.intersection
:
df[df.columns.intersection(['Col_A','Col_A','Col_E'], sort=False)]
Multiple condition over a variable number of columns
Here's a solution using any
and mask
without apply
:
df = pd.DataFrame(index=range(8), columns = ['TOT_SIGNAL','TRADING_DAY']).join(pd.DataFrame(np.eye(8, 5)))
df.TRADING_DAY = df.TRADING_DAY.mask((df.iloc[:,2:] != 0).any(axis=1), 1)
Result:
TOT_SIGNAL TRADING_DAY 0 1 2 3 4
0 NaN 1 1.0 0.0 0.0 0.0 0.0
1 NaN 1 0.0 1.0 0.0 0.0 0.0
2 NaN 1 0.0 0.0 1.0 0.0 0.0
3 NaN 1 0.0 0.0 0.0 1.0 0.0
4 NaN 1 0.0 0.0 0.0 0.0 1.0
5 NaN NaN 0.0 0.0 0.0 0.0 0.0
6 NaN NaN 0.0 0.0 0.0 0.0 0.0
7 NaN NaN 0.0 0.0 0.0 0.0 0.0
How to Read A CSV With A Variable Number of Columns?
You could pass a dummy separator, and then use str.split (by ","
) with expand=True
:
df = pd.read_csv('path/to/file.csv', sep=" ", header=None)
df = df[0].str.split(",", expand=True).fillna("")
print(df)
Output
0 1 2 3
0 5783 145v
1 g656 4589 3243 tt56
2 6579
Split a pandas DataFrame column into a variable number of columns
You could slightly change the function and use it in a list comprehension; then assign the nested list to columns:
def get_header_properties(header):
pf_type = re.match(".*?(?=\.)", header).group()
pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
pf_coords = re.search(f"(?<={pf_id}).*", header).group()
coords = pf_coords.split(",")[1:]
return [pf_type, pf_id] + coords + ([np.nan]*(2-len(coords)) if len(coords)<2 else [])
df[['Type','ID','dim1','dim2']] = [get_header_properties(i) for i in df['index']]
out = df.drop(columns='index')[['Type','ID','dim1','dim2','value']]
That said, instead of the function, it seems it's simpler and more efficient to use str.split
once on "index" column and join
it to df
:
df = (df['index'].str.split('[.,]', expand=True)
.fillna(np.nan)
.rename(columns={i: col for i,col in enumerate(['Type','ID','dim1','dim2'])})
.join(df[['value']]))
Output:
Type ID dim1 dim2 value
0 FirstType FirstID NaN NaN 0.23
1 OtherType OtherID 1 NaN 50.00
2 OtherType OtherID 4 NaN 60.00
3 LastType LastID 1 1 110.00
4 LastType LastID 1 2 199.00
5 LastType LastID 2 3 123.00
Pandas comparison with variable number of columns
So assuming your dataframe has parsed the datetime columns (you can use to_datetime
for that, or eg specify parse_dates
in read_csv
):
In [64]: df
Out[64]:
id date birth_date_1 birth_date_2
0 1 2000-01-01 2000-01-03 2000-01-05
1 1 2000-01-07 2000-01-03 2000-01-05
2 2 2000-01-02 2000-01-10 2000-01-01
3 2 2000-01-05 2000-01-10 2000-01-01
You can now check where the values in the 'birth_date' columns are lower than the values in the 'date' column, and then use sum
to count:
In [65]: df[['birth_date_1', 'birth_date_2']].lt(df['date'], axis=0)
Out[65]:
birth_date_1 birth_date_2
0 False False
1 True True
2 False True
3 False True
In [66]: df[['birth_date_1', 'birth_date_2']].lt(df['date'], axis=0).sum(axis=1)
Out[66]:
0 0
1 2
2 1
3 1
dtype: int64
To deal with the varying number of 'birth_date' columns, you can do this automatically with filter
, like this:
In [67]: df.filter(like="birth_date")
Out[67]:
birth_date_1 birth_date_2
0 2000-01-03 2000-01-05
1 2000-01-03 2000-01-05
2 2000-01-10 2000-01-01
3 2000-01-10 2000-01-01
Altogether, this would give:
In [66]: df.filter(like="birth_date").lt(df['date'], axis=0).sum(axis=1)
Out[66]:
0 0
1 2
2 1
3 1
dtype: int64
Pandas - string split into multiple columns with variable number of delimited values into 3 columns
Use DataFrame.reindex
:
s.str.split(' - ', expand=True).reindex(range(3), axis=1).astype(object).mask(lambda x: x.isna(), None)
Or:
s.str.split(' - ', expand=True).reindex(range(3), axis=1).fillna('')
How set value on dataframe given a variable number of conditions?
If I understand you correctly, you are looking for .query()
method:
import pandas as pd
from itertools import product
animals = ["dogs", "cats"]
eyes = ['brown', 'blue', 'green']
height = ['short', 'average', 'tall']
a = [animals, eyes, height]
df = pd.DataFrame(list(product(*a)), columns=["animals", "eyes", "height"])
df['value'] = 1
def zero_out(df, lst):
q = ' & '.join( '{} == "{}"'.format(col, val) for col, val in lst )
df.loc[df.query(q).index, 'value'] = 0
zero_out(df, [("height", "tall")])
print(df)
Prints:
animals eyes height value
0 dogs brown short 1
1 dogs brown average 1
2 dogs brown tall 0
3 dogs blue short 1
4 dogs blue average 1
5 dogs blue tall 0
6 dogs green short 1
7 dogs green average 1
8 dogs green tall 0
9 cats brown short 1
10 cats brown average 1
11 cats brown tall 0
12 cats blue short 1
13 cats blue average 1
14 cats blue tall 0
15 cats green short 1
16 cats green average 1
17 cats green tall 0
Or zero_out(df, [("animals", "dogs"), ("eyes", "blue")])
:
animals eyes height value
0 dogs brown short 1
1 dogs brown average 1
2 dogs brown tall 1
3 dogs blue short 0
4 dogs blue average 0
5 dogs blue tall 0
6 dogs green short 1
7 dogs green average 1
8 dogs green tall 1
9 cats brown short 1
10 cats brown average 1
11 cats brown tall 1
12 cats blue short 1
13 cats blue average 1
14 cats blue tall 1
15 cats green short 1
16 cats green average 1
17 cats green tall 1
Related Topics
How to Read the Rgb Value of a Given Pixel in Python
Remove and Replace Printed Items
How to Print an Exception in Python
What Exactly Is File.Flush() Doing
How to Ignore Deprecation Warnings in Python
How to Export Keras .H5 to Tensorflow .Pb
How to Get the Ip Address from a Nic (Network Interface Controller) in Python
How to Use Argsort in Descending Order
Python Numpy Valueerror: Operands Could Not Be Broadcast Together with Shapes
Python Multithreading Wait Till All Threads Finished
How to Plot Multiple Seaborn Jointplot in Subplot
Python' Is Not Recognized as an Internal or External Command
Why Does Python Code Use Len() Function Instead of a Length Method
Pylint "Unable to Import" Error - How to Set Pythonpath
How Would I Access Variables from One Class to Another