Only Read Selected Columns

Only read selected columns

Say the data are in file data.txt, you can use the colClasses argument of read.table() to skip columns. Here the data in the first 7 columns are "integer" and we set the remaining 6 columns to "NULL" indicating they should be skipped

> read.table("data.txt", colClasses = c(rep("integer", 7), rep("NULL", 6)), 
+ header = TRUE)
Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27 -2 -6 -10 -32

Change "integer" to one of the accepted types as detailed in ?read.table depending on the real type of data.

data.txt looks like this:

$ cat data.txt 
"Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
2009 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2010 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2011 -21 -27 -2 -6 -10 -32 -13 -12 -27 -30 -38 -29

and was created by using

write.table(dat, file = "data.txt", row.names = FALSE)

where dat is

dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L, 
-27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L
), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L,
-25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L
), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L,
-25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame",
row.names = c(NA, -3L))

If the number of columns is not known beforehand, the utility function count.fields will read through the file and count the number of fields in each line.

## returns a vector equal to the number of lines in the file
count.fields("data.txt", sep = "\t")
## returns the maximum to set colClasses
max(count.fields("data.txt", sep = "\t"))

Read specific columns with pandas or other python module

An easy way to do this is using the pandas library like this.

import pandas as pd
fields = ['star_name', 'ra']

df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name

The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'

how to skip reading certain columns in readr

There is an answer out there, I just didn't search hard enough:
https://github.com/hadley/readr/issues/132

Apparently this was a documentation issue that has been corrected. This functionality may eventually get added but Hadley thought it was more useful to be able to just update one column type and not drop the others.

Update: The functionality has been added

The following code is from the readr documentation:

read_csv("iris.csv", col_types = cols_only( Species = col_factor(c("setosa", "versicolor", "virginica"))))

This will read only the Species column of the iris data set. In order to read only a specific column you must also pass the column specification i.e. col_factor, col_double, etc...

How to select specific columns from read_csv which start with specific word?

You read the file twice: once for the headers only and once for the actual data:

df = pd.read_csv('data.csv', usecols=lambda col: col.startswith('A_') or col.startswith('X_'))

How to read specific columns from mulitple CSV files, and skip columns that do not exist in some of the files using Python Pandas

You could try to read only the columns names from the csv file and check them with your desired columns as follows:

import csv 

desired_col = ["user_id", "event_type"] # I selected only two values

for file_name in csv_files:

csv_cols = next(csv.reader(open(file_name))) # read only the csv columns names

cols = [col for col in desired_col if col in csv_cols]

df = pd.read_csv(file_name, usecols=cols)

Then, each time you read a new csv file, you need first to read the names of columns and then check desired_columns against csv_columns.

How plot and symbolize only selected columns from csv in plotting in d3?

Map the data to filter out columns not included in keys:

d3.csv("ratings.csv").then(data => {
const keys = ['date', 'Dixit', 'Dominion'];
const filteredData = data.map(item =>
keys.reduce((obj, key) => ({...obj, [key]: item[key]}), {}));
...
});


Related Topics



Leave a reply



Submit