accessing Y columns with duplicated names in j of X[Y, j] merges
You can use
DT[DT2, i.y]
and if you find yourself surprised that it's not the same output as DT[DT2][, y.1]
, see this thread
Can I access repeated column names in `j` in a data.table join?
You can refer to columns of data.table
in i
, that is, DT2
's columns, with the prefix i
as follows:
DT1[DT2, list(val=i.value-value)]
name val
1: a 1
2: b 1
3: c 1
4: d 1
5: e 1
# Data used
DT1 <- data.table(name=letters[1:5], value=2:6)
DT2 <- data.table(name=letters[1:5], value=3:7)
setkey(DT1, name)
data.table merge on duplicate column name / how to write J corresponding to names in Y?
I would set the key of edges
twice, and join twice.
setkey(edges, From)
edges[nodes, FromName := Name]
setkey(edges, To)
edges[nodes, ToName := Name]
## one-liner
setkey(setkey(edges, From)[nodes, FromName := Name], To)[nodes, ToName := Name]
python pandas remove duplicate columns
Here's a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated()
returns a boolean array: a True
or False
for each column. If it is False
then the column name is unique up to that point, if it is True
then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True]
.
Pandas
allows one to index using boolean values whereby it selects only the True
values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True]
)
Finally, df.loc[:,[True,True,False]]
selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy()
is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? Yeah. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
Related Topics
How to Add Se Error Bars to My Barplot in Ggplot2
How to Remove Leading "0." in a Numeric R Variable
How to See All Rows of a Data Frame in a Jupyter Notebook with an R Kernel
Plot Line and Bar Graph (With Secondary Axis for Line Graph) Using Ggplot
Ggplot and R: Two Variables Over Time
R - Svd() Function - Infinite or Missing Values in 'X'
How to Capture the Output of System()
Why Are the Colors Wrong on This Ggplot
Tidyr::Pivot_Wider() Reorder Column Names Grouping by 'Name_From'
How to Detect That a Vector Is Subset of Specific Vector
How to Prep Transaction Data into Basket for Arules
Why Does Dplyr's Filter Drop Na Values from a Factor Variable
How to Ignore Na in Ifelse Statement
Plot Separate Years on a Common Day-Month Scale
Why Does Mapply Not Return Date-Objects
Convert Table into Matrix by Column Names