How to Ignore First Two Columns of CSV File

Reading csv file and want to skip first two columns

You should use the csv module in the standard library. You might need to pass additional kwargs (keyword arguments) depending on the format of your csv file.

import csv

with open('my_csv_file', 'r') as fin:
reader = csv.reader(fin)
for line in reader:
print(line[2:])
# do something with rest of columns...

Read multiple csv files (and skip 2 columns in each csv file) into one dataframe in R?

Using the data.table package functions fread() and rbindlist() will provide the result you're after faster than any of the other base or tidyverse alternatives.

library(data.table)

## Create a list of the files
FileList <- list.files(pattern = ".csv")

## Pre-allocate a list to store all of the results of reading
## so that we aren't re-copying the list for each iteration
DTList <- vector(mode = "list", length = length(FileList))

## Read in all the files, excluding the first two columns
for(i %in% seq_along(DTList)) {
DTList[[i]] <- data.table::fread(FileList[[i]], drop = c(1,2))
}

## Combine the results into a single data.table
DT <- data.table::rbindlist(DTList)

## Optionally, convert the data.table to a data.frame to match requested result
## Though I would recommend looking into using data.table instead!
data.table::setDF(DT)

Skip the first column when reading a csv file Python

Ok, so removing the data (or whichever the keyword is) could be done with a regular expression (which is not really the scope of the question but meh...)

About the regular expression:

Let's imagine your keyword is data, right? You can use this: (?:data)*\W*(?P<juicy_data>\w+)\W*(?:data)* If your keyword was something else, you can just change the two data strings in that regular expression to whatever other value the keyword contains...

You can test regular expressions online in www.pythonregex.com or www.debuggex.com

The regular expression is basically saying: Look for zero or more data strings but (if you find any) don't do anything with them. Don't add them to the list of matched groups, don't show them... nothing, just match them but discard it. After that, look for zero or more non-word characters (anything that is not a letter or a number... just in case there's a data: or a space after , or a data--> ... that \W removes all the non-alphanumerical characters that came after data ) Then you get to your juicy_data That is one or more characters that can be found in "regular" words (any alphanumeric character). Then, just in case there's a data behind it, do the same that it was done with the first data group. Just match it and remove it.

Now, to remove the first column: You can use the fact that a csv.reader is itself an iterator. When you iterate over it (as the code below does), it gives you a list containing all the columns found in one row. The fact that it gives you a list of all the rows is very useful for your case: You just have to collect the first item of said row, since that's the column you care about (you don't need row[0], nor row[1:])

So here it goes:

import csv
import re

def get_values_flexibly(csv_path, keyword):
def process(func):
return set([func(cell)] + [func(row[index]) for row in reader])
# Start fo real!
kwd_remover = re.compile(
r'(?:{kw})*\W*(?P<juicy_data>\w+)\W*(?:{kw})*'.format(kw=keyword)
)
result = []
with open(csv_path, 'r') as f:
reader = csv.reader(f)
first_row = [kwd_remover.findall(cell)[0] for cell in reader.next()]
print "Cleaned first_row: %s" % first_row
for index, row in enumerate(reader):
print "Before cleaning: %s" % row
cleaned_row = [kwd_remover.findall(cell)[0] for cell in row]
result.append(cleaned_row[1])
print "After cleaning: %s" % cleaned_row
return result

print "Result: %s" % get_values_flexibly("sample.csv", 'data')

Outputs:

Cleaned first_row: ['h1', 'h2', 'h3']
Before cleaning: ['a data', 'data: abc', 'tr']
After cleaning: ['a', 'abc', 'tr']
Before cleaning: ['b data', 'vf data', ' gh']
After cleaning: ['b', 'vf', 'gh']
Before cleaning: ['k data', 'grt data', ' ph']
After cleaning: ['k', 'grt', 'ph']
Result: ['abc', 'vf', 'grt']

Ignore the first space in CSV

Ideally you should be parsing the first two parts as a datetime. By using a space as a delimiter, it would imply the header has three columns. The space after the date though is being seen as an extra column.

A workaround is to skip the header entirely and supply your own column names. The parse_dates parameter can be used to tell Pandas to parse the first two columns as a single combined datetime object.

For example:

import pandas as pd

points = pd.read_csv("test.csv", delimiter=" ",
skipinitialspace=True, skiprows=1, index_col=None,
parse_dates=[[0, 1]], names=["Date", "Time", "Latitude", "Longitude"])

print(points)

Should give you the following dataframe:

            Date_Time  Latitude  Longitude
0 2021-09-12 23:13:00 44.63 -63.56
1 2021-09-14 23:13:00 43.78 -62.00
2 2021-09-16 23:14:00 44.83 -54.60

How to read a CSV without the first column

You can specify a converter for any column.

converters = {0: lambda s: float(s.strip('"')}
data = np.loadtxt("Data/sim.csv", delimiter=',', skiprows=1, converters=converters)

Or, you can specify which columns to use, something like:

data = np.loadtxt("Data/sim.csv", delimiter=',', skiprows=1, usecols=range(1,15))

http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html


One way you can skip the first column, without knowing the number of columns, is to read the number of columns from the csv manually. It's easy enough, although you may need to tweak this on occasion to account for formatting inconsistencies*.

with open("Data/sim.csv") as f:
ncols = len(f.readline().split(','))

data = np.loadtxt("Data/sim.csv", delimiter=',', skiprows=1, usecols=range(1,ncols+1))

*If there are blank lines at the top, you'll need to skip them. If there may be commas in the field headers, you should count columns using the first data line instead. So, if you have specific problems, I can add some details to make the code more robust.



Related Topics



Leave a reply



Submit