How do I retrieve the number of columns in a Pandas data frame?
Like so:
import pandas as pd
df = pd.DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
len(df.columns)
3
How to get number of columns in a DataFrame row that are above threshold
The value of x
in the lambda is a Series, which can be indexed like this.
df[9] = df.apply(lambda x: x[x > 2].count(), axis=1)
How to count the number of columns with a value on each row in python?
Replace all the blank values to NaN
, then count the notnull
values by row using sum(1)
:
df['Chains'] = df.iloc[:,1:].replace('',np.nan).notnull().sum(1)
>>> df
IndividualID Trip1 Trip2 Trip3 Trip4 Trip5 Trip6 Trip7 Trip8 \
0 200100001 23 1.0 2.0 4.0 4.0 1.0 5.0 5.0
1 200100002 21 1.0 12.0 3.0 1.0 55.0 7.0 7.0
2 200100003 12 3.0 3.0 6.0 3.0 NaN NaN NaN
3 200100004 4 NaN NaN NaN NaN NaN NaN NaN
4 200100005 6 5.0 3.0 9.0 3.0 5.0 6.0 NaN
5 200100005 23 4.0 4.0 2.0 4.0 3.0 6.0 5.0
Trip9 Chains
0 5.0 9
1 NaN 8
2 NaN 5
3 NaN 1
4 NaN 7
5 NaN 8
Python ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method)
Problem solved. In my pipeline, the categorical features are being one-hot encoded. In my training set, there were 42 unique categories, meaning that this will result in 42 columns when one-hot encoded. In my testing set, there were 27 unique categories, resulting in 27 columns when one-hot encoded. Thence, the ValueError was raised.
Split a pandas DataFrame column into a variable number of columns
You could slightly change the function and use it in a list comprehension; then assign the nested list to columns:
def get_header_properties(header):
pf_type = re.match(".*?(?=\.)", header).group()
pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
pf_coords = re.search(f"(?<={pf_id}).*", header).group()
coords = pf_coords.split(",")[1:]
return [pf_type, pf_id] + coords + ([np.nan]*(2-len(coords)) if len(coords)<2 else [])
df[['Type','ID','dim1','dim2']] = [get_header_properties(i) for i in df['index']]
out = df.drop(columns='index')[['Type','ID','dim1','dim2','value']]
That said, instead of the function, it seems it's simpler and more efficient to use str.split
once on "index" column and join
it to df
:
df = (df['index'].str.split('[.,]', expand=True)
.fillna(np.nan)
.rename(columns={i: col for i,col in enumerate(['Type','ID','dim1','dim2'])})
.join(df[['value']]))
Output:
Type ID dim1 dim2 value
0 FirstType FirstID NaN NaN 0.23
1 OtherType OtherID 1 NaN 50.00
2 OtherType OtherID 4 NaN 60.00
3 LastType LastID 1 1 110.00
4 LastType LastID 1 2 199.00
5 LastType LastID 2 3 123.00
Adding Dummy Column If Number of Column is less than number of rows
You can do reindex
out = df.reindex(columns = df.columns.to_list()+[*range(df.shape[0]-df.shape[1])],fill_value=0)
Out[65]:
A B C 0 1 2
0 4 2 5 0 0 0
1 2 6 8 0 0 0
2 8 3 4 0 0 0
3 4 2 5 0 0 0
4 3 6 7 0 0 0
5 7 3 8 0 0 0
Change amount of columns in Pandas
Create list of all possible values from file, then reshape by numpy.reshape for 4 columns DataFrame:
with open('data.txt') as f:
L = [x for line in f for x in line.strip().split()]
print (L)
['32', '45', '2.65', '-845', '1', '-84', '97.236', '454',
'35.78', '77.12', '948.87', '151', '-23.5', '-787.48', '13.005', '31']
df = pd.DataFrame(np.array(L).reshape(-1, 4))
print (df)
0 1 2 3
0 32 45 2.65 -845
1 1 -84 97.236 454
2 35.78 77.12 948.87 151
3 -23.5 -787.48 13.005 31
But solution not working, if not possible create full 4 columns, then it is a bit complicated:
#missing last value
print (L)
['32', '45', '2.65', '-845', '1', '-84', '97.236', '454', '35.78',
'77.12', '948.87', '151', '-23.5', '-787.48', '13.005']
arr = np.empty(((len(L) - 1)//4 + 1)*4, dtype='O')
arr[:len(L)] = L
df = pd.DataFrame(arr.reshape((-1, 4))).fillna('0')
print(df)
0 1 2 3
0 32 45 2.65 -845
1 1 -84 97.236 454
2 35.78 77.12 948.87 151
3 -23.5 -787.48 13.005 0
Related Topics
While Loop in SQL Server 2008 Iterating Through a Date-Range and Then Insert
Running a SQLite3 Script from Command Line
Get the List of Stored Procedures Created And/Or Modified on a Particular Date
Custom Sorting in SQL Order by Clause
Sqlite Insert Taking Long Time
Determine Table Referenced in a View in SQL Server
Id Best Practices for Databases
Duplicating Records to Fill Gap Between Dates
How to Do a SQL Update in Batches, Like an Update Top
Selecting The Top N Rows Within a Group by Clause
How to Order by Column with Non-Null Values First in Sql
Represent a Subquery in Relational Algebra
How to Easily Edit SQL Xml Column in SQL Management Studio
Why Is My SQL Server Order by Slow Despite the Ordered Column Being Indexed
Undo Log Error: No More Space Left Over in System Tablespace for Allocating Undo Log Pages
Resources for Database Sharding and Partitioning
Pagination with The Stored Procedure
Cannot Connect to Azure SQL Database, Even with Whitelisted Ip