Merge overlapping ranges into unique groups, in dataframe
You'll need IRanges
package:
require(IRanges)
ir <- IRanges(df$start, df$end)
df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
> df
# start end group group2
# 1 178 5025 1 1
# 2 400 5025 1 1
# 3 983 5535 2 1
# 4 1932 6918 2 1
# 5 33653 38197 3 2
To install IRanges
, type these lines in R:
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
To learn more (manual etc..) go here
Group rows by overlapping ranges
I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:
df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31],
'right':[4,8,13,7,19,16,23,35]})
df
The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).
One solution is using the following approach (based on a method for combining intervals posted by @CentAu):
# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)
# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if
l.contains(x)][0])
...which outputs:
Combine two data frames with overlapping ranges and calculate overlap grouped by category
def group(df1):
df2 = Strat[Strat['HOLE_ID']==df1.iloc[0]['HOLE_ID']]
df1[['FROM','TO']] = df1[['FROM','TO']].astype(float)
df2[['FROM','TO']] = df2[['FROM','TO']].astype(float)
temp = pd.concat([df1[['FROM','TO']],df2[['FROM','TO']]]).unstack().reset_index(drop=True) \
.drop_duplicates().sort_values().reset_index(drop=True) \
.to_frame(name='FROM').merge(df2, how='outer').ffill()
temp['TO'] = temp.shift(-1)['FROM']
def tobins(x):
agg = temp[(x.FROM <= temp.FROM) & (temp.FROM < x.TO)].groupby('Type') \
.apply(lambda y: y['TO'].max() - y['FROM'].min()).reset_index(name='res')
agg.res = agg.Type + ' ' + (agg.res/agg.res.sum()).map('{:.0%}'.format)
return '; '.join(agg.res.tolist())
df1['Type'] = df1.apply(tobins,axis=1)
return df1
Assay.groupby('HOLE_ID').apply(group)
HOLE_ID FROM TO val Type
0 Hole_1 1.0 2.5 0.001 A 100%
1 Hole_1 2.5 5.0 0.005 A 60%; B 40%
2 Hole_1 5.0 7.0 0.002 C 75%; D 25%
3 Hole_1 7.0 10.0 0.001 D 33%; E 67%
4 Hole_2 1.0 3.0 0.001 A 100%
5 Hole_2 3.0 5.0 0.005 B 100%
6 Hole_2 5.0 7.0 0.002 C 75%; D 25%
7 Hole_2 7.0 10.0 0.001 D 33%; E 67%
The key point is building temp DataFrame with all point FROM and TO from both table. For HOLE_ID = 'Hole_1'
it looks as below. Now we can get for each row of Assay (x) those rows of temp table with (x.FROM <= temp.FROM < x.TO)
, group them by Type, calculate shares and join to result format
FROM HOLE_ID TO Type
0 0.0 Hole_1 1.0 A
1 1.0 Hole_1 2.5 A
2 2.5 Hole_1 4.0 A
3 4.0 Hole_1 5.0 B
4 5.0 Hole_1 6.5 C
5 6.5 Hole_1 7.0 D
6 7.0 Hole_1 8.0 D
7 8.0 Hole_1 10.0 E
8 10.0 Hole_1 NaN E
Merge consecutive and overlapping date ranges
- within customer
groupby("cust", as_index=False)
look for overlapping dates in rows cumsum()
sums booleans to generate a group for overlapping dates- finally simple case on min/max within groups
df.groupby("cust", as_index=False).apply(
lambda d: d.sort_values(["start_date", "end_date"])
.groupby(
["cust", (~(d["start_date"] <= (d["end_date"].shift() + pd.Timedelta(days=1)))).cumsum()],
as_index=False
)
.agg({"start_date": "min", "end_date": "max"})
).reset_index(drop=True)
cust | start_date | end_date | |
---|---|---|---|
0 | CUST123 | 2021-01-01 00:00:00 | 2021-01-31 00:00:00 |
1 | CUST123 | 2021-02-02 00:00:00 | 2021-02-28 00:00:00 |
2 | CUST456 | 2021-01-05 00:00:00 | 2021-01-31 00:00:00 |
Merge overlapping ranges per group
I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.
> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 4F 2500-3401 +
[2] 4F 19116-20730 +
[3] 4F 1420-2527 -
[4] 0F 1405-1700 -
[5] 0F 1727-2038 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
which differs from your expectation because there are two OF
non-overlapping ranges?
Find overlapping intervals in groups and retain largest non-overlapping periods
Having searched for related problems on stackoverflow, I found that the following approaches (here: Collapse and merge overlapping time intervals) and (here: How to flatten / merge overlapping time periods) could be adapted to my issue.
# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
# Note: df and df1 created in the initial reprex (above)
df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)
# Desired output?
identical(df1, df2)
#> [1] TRUE
Pandas: Reduce overlapping date ranges to one line (per group)
Henry Ecker pointed me in the right direction considering this problem as a graph:
Pandas combining rows based on dates
The code only needed a very small bit of rewriting to get the right answer:
from scipy.sparse.csgraph import connected_components
def reductionFunction(data):
# create a 2D graph of connectivity between date ranges
start = data.start_date.values
end = data.end_date.values
graph = (start <= end[:, None]) & (end >= start[:, None])
# find connected components in this graph
n_components, indices = connected_components(graph)
# group the results by these connected components
return data.groupby(indices).aggregate({'start_date': 'min',
'end_date': 'max'})
df.groupby(['customer_id']).apply(reductionFunction).reset_index('customer_id')
Related Topics
Extract the Coefficients for the Best Tuning Parameters of a Glmnet Model in Caret
Ggplot2: Using Gtable to Move Strip Labels to Top of Panel for Facet_Grid
Calculating Peaks in Histograms or Density Functions
How to Access Dimensions of Labels Plotted by 'Geom_Text' in 'Ggplot2'
Gsub in R with Unicode Replacement Give Different Results Under Windows Compared with Unix
Aligning Data Frame with Missing Values
How to Start a for Loop in R Programming
Make a List of Many Objects from a Vector of Object Names
Showing Equation of Nls Model with Ggpmisc
R Remove Last Word from String
Automated Httr Authentication with Twitter , Provide Response to Interactive Prompt in "Batch" Mode
How to Declare a Thousand Separator in Read.Csv
Regression (Logistic) in R: Finding X Value (Predictor) for a Particular Y Value (Outcome)
Apply Function to Elements Over a List