Extracting specific columns from pandas.dataframe
import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]
Extracting specific columns from a data frame
Using the dplyr package, if your data.frame is called df1
:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>%
pipe as:
select(df1, A, B, E)
Extract certain columns from data frame R
base R
newdf <- df[, unique(c("x", names(which(sapply(df, function(z) is.numeric(z) & any(c(1, 3) %in% z)))))), drop = FALSE]
newdf
# x s1 s3
# 1 x1 1 1
# 2 x2 2 1
# 3 x3 1 2
# 4 x4 2 2
# 5 x5 3 1
newdf[-1] <- lapply(newdf[-1], function(z) +(z == 1))
newdf
# x s1 s3
# 1 x1 1 1
# 2 x2 0 1
# 3 x3 1 0
# 4 x4 0 0
# 5 x5 0 1
Walk-through:
first, we determine which columns are numbers and contain the numbers 1 or 3:
sapply(df, function(z) is.numeric(z) & any(c(1, 3) %in% z))
# x s1 s2 s3 s4
# FALSE TRUE FALSE TRUE FALSEThis will exclude any column that is not numeric, meaning that a
character
column that contains a literal"1"
or"3"
will not be retained. This is complete inference on my end; if you want to accept the string versions then remove theis.numeric(z)
component.second, we extract the names of those that are true, and prepend
"x"
c("x", names(which(sapply(df, function(z) is.numeric(z) & any(c(1, 3) %in% z)))))
# [1] "x" "s1" "s3"wrap that in
unique(.)
if, for some reason,"x"
is also numeric and contains 1 or 3 (this step is purely defensive, you may not strictly need it)select those columns, defensively adding
drop=FALSE
so that if only one column is matched, it still returns a fulldata.frame
replace just those columns (excluding the first column which is
"x"
) with 0 or 1; thez == 1
returnslogical
, and the wrapping+(..)
converts logical to 0 (false) or 1 (true).
dplyr
library(dplyr)
df %>%
select(x, where(~ is.numeric(.) & any(c(1, 3) %in% .))) %>%
mutate(across(-x, ~ +(. == 1)))
# x s1 s3
# 1 x1 1 1
# 2 x2 0 1
# 3 x3 1 0
# 4 x4 0 0
# 5 x5 0 1
Extracting specific selected columns to new DataFrame as a copy
There is a way of doing this and it actually looks similar to R
new = old[['A', 'C', 'D']].copy()
Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you'll probably want to use .copy()
to avoid a SettingWithCopyWarning
.
An alternative method is to use filter
which will create a copy by default:
new = old.filter(['A','B','D'], axis=1)
Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop
(this will also create a copy by default):
new = old.drop('B', axis=1)
In R: Extract specific columns from data frame by date and keep basic columns at beginning
You can use split.default
to split data based on year and with lapply
cbind
the first four columns to each list.
result <- lapply(split.default(df[-(1:4)],
format(as.Date(names(df)[-(1:4)], 'X%Y.%m.%d'), '%Y')),
function(x) cbind(df[1:4], x))
R tries to discourage column names starting with numbers so if you read the data with default options it will change column name from 2007-01-07
to X2007.01.07
so keeping that in mind I have used 'X%Y.%m.%d'
in as.Date
. If you have somehow managed to read column names as you have shown i.e 2007-01-07
use %Y-%m-%d
in as.Date
.
Selecting multiple columns in a Pandas dataframe
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__
syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy()
method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc
, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc
along with get_loc
function of columns
method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc
.
Related Topics
Setting Individual Axis Limits With Facet_Wrap and Scales = "Free" in Ggplot2
Drop Unused Factor Levels in a Subsetted Data Frame
Order Discrete X Scale by Frequency/Value
Combine a List of Data Frames into One Data Frame by Row
Collapse/Concatenate/Aggregate a Column to a Single Comma Separated String Within Each Group
Order Bars in Ggplot2 Bar Graph
Choose the Top Five Values from Each Group in R
Aggregate/Summarize Multiple Variables Per Group (E.G. Sum, Mean)
Counting Unique Values Across Variables (Columns) in R
How to Generate a Histogram for Each Column of My Table
Coerce Multiple Columns to Factors At Once
How to Replace Na Values With Zeros in an R Dataframe
Evaluate Expression Given as a String
Convert Dataframe Column to 1 or 0 for "True"/"False" Values and Assign to Dataframe