Accessing Y Columns with Duplicated Names in J of X[Y, J] Merges

accessing Y columns with duplicated names in j of X[Y, j] merges

You can use

DT[DT2, i.y]

and if you find yourself surprised that it's not the same output as DT[DT2][, y.1], see this thread

Can I access repeated column names in `j` in a data.table join?

You can refer to columns of data.table in i, that is, DT2's columns, with the prefix i as follows:

DT1[DT2, list(val=i.value-value)]
name val
1: a 1
2: b 1
3: c 1
4: d 1
5: e 1

# Data used
DT1 <- data.table(name=letters[1:5], value=2:6)
DT2 <- data.table(name=letters[1:5], value=3:7)
setkey(DT1, name)

data.table merge on duplicate column name / how to write J corresponding to names in Y?

I would set the key of edges twice, and join twice.

setkey(edges, From)
edges[nodes, FromName := Name]
setkey(edges, To)
edges[nodes, ToName := Name]

## one-liner
setkey(setkey(edges, From)[nodes, FromName := Name], To)[nodes, ToName := Name]

python pandas remove duplicate columns

Here's a one line solution to remove columns based on duplicate column names:

df = df.loc[:,~df.columns.duplicated()].copy()

How it works:

Suppose the columns of the data frame are ['alpha','beta','alpha']

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.

Note: the above only checks columns names, not column values.

To remove duplicated indexes

Since it is similar enough, do the same thing on the index:

df = df.loc[~df.index.duplicated(),:].copy()

To remove duplicates by checking values without transposing

df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()

This avoids the issue of transposing. Is it fast? No. Does it work? Yeah. Here, try it on this:

# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))

#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs

# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]

# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()


Related Topics



Leave a reply



Submit