Delete the first three rows of a dataframe in pandas
Use iloc
:
df = df.iloc[3:]
will give you a new df without the first three rows.
Remove very first row in pandas
Use DataFrame.droplevel
because there is MultiIndex
in columns:
df = df.sort_index(axis=1, level=1).droplevel(0, axis=1)
Or for oldier versions of pandas MultiIndex.droplevel
:
df.columns = df.columns.droplevel(0)
Python/Pandas - Remove the first row with Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
You need to use the skiprows
argument inside the pd.read_excel
function to correctly get the column names in the 5th row.
UPDATE Including the forward filling
import pandas as pd
xl = pd.ExcelFile('Sample_File.xlsm')
for sheet in xl.sheet_names:
df = pd.read_excel(xl, sheet_name=sheet, skiprows=4) # no more iloc here
df['Comment'] = df['Comment'].ffill()
df.to_csv(f'{sheet}.csv', index=False)
How to remove/delete first row/column from Data Frame using python?
.read_html()
returns a list of dataframes. You call the specific dataframes by the index position (Ie: like you did with df[1]
. So you need to use .iloc
on the dataframe in your list of dataframes, on index position 1.
df = df[1].iloc[: , 1:]
Remove top row from a dataframe
You can try using slicing.
df = df[1:]
This will remove the first row of your dataframe.
How to delete first row in a csv file using python
FILENAME = 'test.csv'
DELETE_LINE_NUMBER = 1
with open(FILENAME) as f:
data = f.read().splitlines() # Read csv file
with open(FILENAME, 'w') as g:
g.write('\n'.join([data[:DELETE_LINE_NUMBER]] + data[DELETE_LINE_NUMBER+1:])) # Write to file
Original test.csv:
ID, Name
0, ABC
1, DEF
2, GHI
3, JKL
4, MNO
After run:
ID, Name
1, DEF
2, GHI
3, JKL
4, MNO
(deleted 0, ABC
)
Python: Pandas - Delete the first row by group
You could use groupby/transform
to prepare a boolean mask which is True for the rows you want and False for the rows you don't want. Once you have such a boolean mask, you can select the sub-DataFrame using df.loc[mask]
:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003],
'PRICE': [14.5, 14.5, 14.5, 15.125, 14.5, 14.5, 14.5, 14.5, 15.0],
'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
19920113, 19920114, 19920115]},
index = range(1,10))
def mask_first(x):
result = np.ones_like(x)
result[0] = 0
return result
mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
print(df.loc[mask])
yields
ID PRICE date
2 10001 14.5 19920106
3 10001 14.5 19920107
5 10002 14.5 19920109
6 10002 14.5 19920110
8 10003 14.5 19920114
9 10003 15.0 19920115
Since you're interested in efficiency, here is a benchmark:
import timeit
import operator
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame(
{'ID': np.random.randint(100, size=(N,)),
'PRICE': np.random.random(N),
'date': np.random.random(N)})
def using_mask(df):
def mask_first(x):
result = np.ones_like(x)
result[0] = 0
return result
mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
return df.loc[mask]
def using_apply(df):
return df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])
def using_apply_alt(df):
return df.groupby('ID', group_keys=False).apply(lambda x: x[1:])
timing = dict()
for func in (using_mask, using_apply, using_apply_alt):
timing[func] = timeit.timeit(
'{}(df)'.format(func.__name__),
'from __main__ import df, {}'.format(func.__name__), number=100)
for func, t in sorted(timing.items(), key=operator.itemgetter(1)):
print('{:16}: {:.2f}'.format(func.__name__, t))
reports
using_mask : 0.85
using_apply_alt : 2.04
using_apply : 3.70
pandas data frame removing the first row of every numbers
Use duplicated
with boolean indexing
, last remove #
by position with str[1:]
or by str.strip
:
print (df)
a
0 #1
1 #2
2 #2
3 #3
4 #3
5 #3
6 #3
7 #4
8 #4
9 #5
df = df.loc[df['a'].duplicated(), 'a'].str[1:]
print (df)
2 2
4 3
5 3
6 3
8 4
Name: a, dtype: object
Or:
df = df.loc[df['a'].duplicated(), 'a'].str.strip('#')
print (df)
2 2
4 3
5 3
6 3
8 4
Name: a, dtype: object
Detail:
print (df['a'].duplicated())
0 False
1 False
2 True
3 False
4 True
5 True
6 True
7 False
8 True
9 False
Name: a, dtype: bool
EDIT:
df = df[df['a'].duplicated()]
df['a'] = df['a'].str.strip('#')
Related Topics
How to Locate Elements on Webpage With Headless Chrome
Retrieving Subfolders Names in S3 Bucket from Boto3
Python - How to Separate Paragraphs from Text
Selenium - Iterating Through Groups of Elements - Python
Count Frequency of Words in a List and Sort by Frequency
How to Convert a Datetime Object to Milliseconds Since Epoch (Unix Time) in Python
I Want to Multiply Two Columns in a Pandas Dataframe and Add the Result into a New Column
Pandas Groupby Columns With Nan (Missing) Values
Convert String to Negative Number
Convert SQL Result to List Python
Pandas Dataframe Check If Column Value Exists in a Group of Columns
Python 3D Polynomial Surface Fit, Order Dependent
Python MySQL Connector: Caching_Sha2_Password Plugin
Suppress Stdout/Stderr Print from Python Functions
Python: Read Several Json Files from a Folder
How to Remove Comma and Brackets