Exclude First Row When Importing Data from Excel into Python

Exclude first row when importing data from excel into Python

The pandas documentation for the pd.read_excel method mentions a skiprows parameter that you can use to exclude the first row of your excel file.

Example

import pandas as pd
data = pd.read_excel("file.xlsx", parse_cols="A,C,E,G", skiprows=[0])

Source: pandas docs

How to read a excel file without taking its first row as header ? Pandas, Python

The file can be read using the file name as string or an open file object:

pd.read_excel('test.xlsx', index_col=0)

if you want to read particular sheet.

pd.read_excel(open('test.xlsx', 'rb'),sheet_name='Sheet3')

Index and header can be specified via the index_col and header arguments

pd.read_excel('test.xlsx', index_col=None, header=None)

Exclude Excel rows while importing to Python based on keyword

It will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.

so you should use the following code after your for loop

drop_list = ['Category','Owner']
df=df[~df.isin(drop_list)]
df = df.dropna()
df.reset_index(drop=True,inplace=True)

df.to_excel("CombinedTest.xlsx")

I hope this helps you out :)

Skipping range of rows after header through pandas.read_excel

As per the documentation for pandas.read_excel, skiprows must be list-like.

Try this instead to exclude rows 1 to 336 inclusive:

df = pd.read_excel("file.xlsx",
sheet_name = "Sheet1",
skiprows = range(1, 337),
usecols = "H:BD")

Note: range constructor is considered list-like for this purpose, so no explicit list conversion is necessary.

How to delete first row in excel with openpyxl

Untested but the following tweak should fix it. Instead of reading 'A1' and removing row 0 you look for the first non empty row using min_row and read and remove that.

import openpyxl

filename = "example.xlsx"
wb = openpyxl.load_workbook(filename)
sheet = wb['Sheet1']
status = sheet.cell(sheet.min_row, 1).value
print(status)
sheet.delete_rows(sheet.min_row, 1)
wb.save(filename)

How to skip first rows and all empty rows after that in specific column?

For dropping the first 21 rows - you can do this

frames = [df.iloc[21:, :] for df in frames]

And to drop all the NaN values from column E - you can do this

combined.dropna(subset=["E"], inplace=True)

Your final code will look something like this -

import pandas as pd
import glob

#all files in directory (NOT SURE IF I CAN OPTIMIZE THE CODE WITH THIS)
#AM NOT USING THIS LINE AT THE MOMENT
#excel_names = glob.glob('*JAN_2019-jan.xlsx')

# filenames
excel_names = ["file1.xlsx", "file2.xlsx", "file3.xlsx"]

# read them in
excels = [pd.ExcelFile(name) for name in excel_names]

# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in
excels]

# delete the first row for all frames except the first (NOT WORKING)
# i.e. remove the header row -- assumes it's the first (NOT WORKING)
frames = [df.iloc[21:, :] for df in frames]

# concatenate them..
combined = pd.concat(frames)
combined.dropna(subset=["E"], inplace=True)

# write it out
combined.to_excel("c.xlsx", header=False, index=False)

To drop the first 21 rows from all dataframes except the first one - you can do this -

frames_2 = [df.iloc[21:, :] for df in frames[1:]]
#And combine them separately
combined = pd.concat([frames[0], *frames_2])

To exclude the character "-" from your dataframe -

combined = combined[~combined['E'].isin(['-'])]

Python - pandas xls import - difficulties removing certain row +

Looking at the screenshot of your inputted excel file along with the printed dataframe, the problem you're encountering is likely due to the Merged Cells you have in the second and third rows.

I recommend making use of some of the parameters for pandas.DataFrame.to_excel that are outlined in docs (Link Here). In particular, header and skiprows should help you.

I've provided an example below, in which I create an excel file (.xlsx) that replicates the issue you have with the merged cells. Then I copy the .xlsx to be a .xls and read it using pandas.DataFrame.to_excel with header and skiprows spelled out.

import pandas as pd
import numpy as np
import shutil

# Creating a dataframe and saving as test.xlsx in current directory
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', startrow=3, index=False,
header=False)
wb = writer.book
ws = writer.sheets['Sheet1']

ws.merge_range('A1:C1', 'Large Merged Cell in first Row')
ws.merge_range('A2:A3', 'A')
ws.merge_range('B2:B3', 'B')
ws.merge_range('C2:C3', 'C')

wb.close()

print(df)
#copying test.xlsx as a .xls file
shutil.copy(r"test.xlsx" , r"test.xls")

new_df = pd.read_excel('test.xls', header = 0, skiprows = [0,2])
print(new_df)

Expected test.xls file:
Expected test.xls

print(new_df) should show:

          A         B         C
0 1.242498 0.512675 -1.370710
1 0.060366 -0.467702 -1.420735
2 -0.198547 0.042364 0.915423
3 0.340909 0.749019 0.272871
4 2.633348 -1.343251 -0.248733
5 0.892257 0.371924 0.023415
6 -0.809030 -0.633796 0.449373
7 0.322960 2.073352 1.362657
8 -0.848093 1.848489 0.813144
9 2.718069 -0.540174 1.411980

how to remove header from excel file using python

Just add skiprows argument while reading excel.

import pandas as pd
import os
def remove_header():
file_name = "AV Clients.xlsx"
os.chmod(file_name, 0o777)
df = pd.read_excel(file_name, skiprows = 1) #Read Excel file as a DataFrame
print(df)
df.to_excel("AV_Clients1.xlsx", index=False)

remove_header()

Skip first line in import statement using gc.open_by_url from gspread (i.e. add header=0)

Looking at the API documentation, you probably want to use:

df = pd.DataFrame(g_sheets.get_worksheet(0).get_all_records(head=1))

The .get_all_records method returns a dictionary of with the column headers as the keys and a list of column values as the dictionary values. The argument head=<int> determines which row to use as keys; rows start from 1 and follow the numeration of the spreadsheet.

Since the values returned by .get_all_records() are lists of strings, the data frame constructor, pd.DataFrame, will return a data frame that is all strings. To convert it to floats, we need to replace the empty strings, and the the dash-only strings ('-') with NA-type values, then convert to float.

Luckily pandas DataFrame has a convenient method for replacing values .replace. We can pass it mapping from the string we want as NAs to None, which gets converted to NaN.

import pandas as pd

data = g_sheets.get_worksheet(0).get_all_records(head=1)

na_strings_map= {
'-': None,
'': None
}

df = pd.DataFrame(data).replace(na_strings_map).astype(float)


Related Topics



Leave a reply



Submit