How to Delete a Column by Name in Data.Table

How do you delete a column by name in data.table?

Any of the following will remove column foo from the data.table df3:

# Method 1 (and preferred as it takes 0.00s even on a 20GB data.table)
df3[,foo:=NULL]

df3[, c("foo","bar"):=NULL] # remove two columns

myVar = "foo"
df3[, (myVar):=NULL] # lookup myVar contents

# Method 2a -- A safe idiom for excluding (possibly multiple)
# columns matching a regex
df3[, grep("^foo$", colnames(df3)):=NULL]

# Method 2b -- An alternative to 2a, also "safe" in the sense described below
df3[, which(grepl("^foo$", colnames(df3))):=NULL]

data.table also supports the following syntax:

## Method 3 (could then assign to df3, 
df3[, !"foo"]

though if you were actually wanting to remove column "foo" from df3 (as opposed to just printing a view of df3 minus column "foo") you'd really want to use Method 1 instead.

(Do note that if you use a method relying on grep() or grepl(), you need to set pattern="^foo$" rather than "foo", if you don't want columns with names like "fool" and "buffoon" (i.e. those containing foo as a substring) to also be matched and removed.)

Less safe options, fine for interactive use:

The next two idioms will also work -- if df3 contains a column matching "foo" -- but will fail in a probably-unexpected way if it does not. If, for instance, you use any of them to search for the non-existent column "bar", you'll end up with a zero-row data.table.

As a consequence, they are really best suited for interactive use where one might, e.g., want to display a data.table minus any columns with names containing the substring "foo". For programming purposes (or if you are wanting to actually remove the column(s) from df3 rather than from a copy of it), Methods 1, 2a, and 2b are really the best options.

# Method 4:
df3[, .SD, .SDcols = !patterns("^foo$")]

Lastly there are approaches using with=FALSE, though data.table is gradually moving away from using this argument so it's now discouraged where you can avoid it; showing here so you know the option exists in case you really do need it:

# Method 5a (like Method 3)
df3[, !"foo", with=FALSE]
# Method 5b (like Method 4)
df3[, !grep("^foo$", names(df3)), with=FALSE]
# Method 5b (another like Method 4)
df3[, !grepl("^foo$", names(df3)), with=FALSE]

Drop data frame columns by name

There's also the subset command, useful if you know which columns you want:

df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))

UPDATED after comment by @hadley: To drop columns a,c you could do:

df <- subset(df, select = -c(a, c))

Remove Column from Data Table C#


   DataTable t;
t.Columns.Remove("columnName");
t.Columns.RemoveAt(columnIndex);

Remove columns from DataTable which are not in List string

If you just want to remove all the columns not found in list 'A'

var A = new List<string> { "a", "b", "c" };
var toRemove = dt.Columns.Cast<DataColumn>().Select(x => x.ColumnName).Except(A).ToList();

foreach (var col in toRemove) dt.Columns.Remove(col);

Keeping Data Columns by name and removing the others within a DataTable

You can't change a collection while you are enumerating on it. You can change your code to a standard for loop with a backward indexing like this

for(int x = dataTable.Columns.Count - 1; x >= 0; x--)
{
DataColumn dc = dataTable.Columns[x];
if(dc.ColumnName != "Cat" && dc.ColumnName != "Dog" &&
dc.ColumnName != "Turtle " && dc.ColumnName != "Lion")
{
dc.Columns.Remove(dataColumn)
}
}

The looping in reverse is required to avoid jumping columns when you remove an item from the collection. Also, as explained, in the comment below, you need to use the && logical operator to remove ALL the columns that don't have a name like the four one you want to preserve. Using the || logical operator will remove all of your columns because the column named "Lion" will be removed because its name is not "cat" (or anything else in the if condition).

There is also the possibility to use a DataView to extract only the columns you need, but this has the drawback to require a second datatable in memory and you could encounter problems if your data set is really big.

DataTable datatable = CSVReader.CSVInput(filepath);
DataView dv = new DataView(datatable);
DataTable newTable = dv.ToTable(false, new string[] {"cat", "dog", "turtle", "lion"});

Delete multiple columns by reference using reverse selection in data.Table

We can use the setdiff to get the names of the dataset that are not in the list_to_keep and assign (:=) it to NULL

df[, setdiff(names(df), list_to_keep) := NULL]

As @rosscova mentioned, using which on the logical vector can be used to get the position of the column and to assign the columns to NULL

df[, which(!names(df)%in%list_to_keep):=NULL] 

How to drop columns by name in a data frame

You should use either indexing or the subset function. For example :

R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8

Then you can use the which function and the - operator in column indexation :

R> df[ , -which(names(df) %in% c("z","u"))]
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Or, much simpler, use the select argument of the subset function : you can then use the - operator directly on a vector of column names, and you can even omit the quotes around the names !

R> subset(df, select=-c(z,u))
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Note that you can also select the columns you want instead of dropping the others :

R> df[ , c("x","y")]
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

R> subset(df, select=c(x,y))
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Remove columns from DataTable in C#

Aside from limiting the columns selected to reduce bandwidth and memory:

DataTable t;
t.Columns.Remove("columnName");
t.Columns.RemoveAt(columnIndex);


Related Topics



Leave a reply



Submit