Merge Overlapping Ranges into Unique Groups, in Dataframe

Merge overlapping ranges into unique groups, in dataframe

You'll need IRanges package:

require(IRanges)
ir <- IRanges(df$start, df$end)
df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
> df

# start end group group2
# 1 178 5025 1 1
# 2 400 5025 1 1
# 3 983 5535 2 1
# 4 1932 6918 2 1
# 5 33653 38197 3 2

To install IRanges, type these lines in R:

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

To learn more (manual etc..) go here

Group rows by overlapping ranges

I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:

df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31], 
'right':[4,8,13,7,19,16,23,35]})
df

The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).

Sample Image

One solution is using the following approach (based on a method for combining intervals posted by @CentAu):

# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)

# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)

# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if
l.contains(x)][0])

...which outputs:

Sample Image

Combine two data frames with overlapping ranges and calculate overlap grouped by category

def group(df1):
df2 = Strat[Strat['HOLE_ID']==df1.iloc[0]['HOLE_ID']]
df1[['FROM','TO']] = df1[['FROM','TO']].astype(float)
df2[['FROM','TO']] = df2[['FROM','TO']].astype(float)

temp = pd.concat([df1[['FROM','TO']],df2[['FROM','TO']]]).unstack().reset_index(drop=True) \
.drop_duplicates().sort_values().reset_index(drop=True) \
.to_frame(name='FROM').merge(df2, how='outer').ffill()
temp['TO'] = temp.shift(-1)['FROM']

def tobins(x):
agg = temp[(x.FROM <= temp.FROM) & (temp.FROM < x.TO)].groupby('Type') \
.apply(lambda y: y['TO'].max() - y['FROM'].min()).reset_index(name='res')
agg.res = agg.Type + ' ' + (agg.res/agg.res.sum()).map('{:.0%}'.format)
return '; '.join(agg.res.tolist())

df1['Type'] = df1.apply(tobins,axis=1)
return df1

Assay.groupby('HOLE_ID').apply(group)

HOLE_ID FROM TO val Type
0 Hole_1 1.0 2.5 0.001 A 100%
1 Hole_1 2.5 5.0 0.005 A 60%; B 40%
2 Hole_1 5.0 7.0 0.002 C 75%; D 25%
3 Hole_1 7.0 10.0 0.001 D 33%; E 67%
4 Hole_2 1.0 3.0 0.001 A 100%
5 Hole_2 3.0 5.0 0.005 B 100%
6 Hole_2 5.0 7.0 0.002 C 75%; D 25%
7 Hole_2 7.0 10.0 0.001 D 33%; E 67%

The key point is building temp DataFrame with all point FROM and TO from both table. For HOLE_ID = 'Hole_1' it looks as below. Now we can get for each row of Assay (x) those rows of temp table with (x.FROM <= temp.FROM < x.TO), group them by Type, calculate shares and join to result format

   FROM HOLE_ID    TO Type
0 0.0 Hole_1 1.0 A
1 1.0 Hole_1 2.5 A
2 2.5 Hole_1 4.0 A
3 4.0 Hole_1 5.0 B
4 5.0 Hole_1 6.5 C
5 6.5 Hole_1 7.0 D
6 7.0 Hole_1 8.0 D
7 8.0 Hole_1 10.0 E
8 10.0 Hole_1 NaN E

Merge consecutive and overlapping date ranges

  • within customer groupby("cust", as_index=False) look for overlapping dates in rows
  • cumsum() sums booleans to generate a group for overlapping dates
  • finally simple case on min/max within groups

df.groupby("cust", as_index=False).apply(
lambda d: d.sort_values(["start_date", "end_date"])
.groupby(
["cust", (~(d["start_date"] <= (d["end_date"].shift() + pd.Timedelta(days=1)))).cumsum()],
as_index=False
)
.agg({"start_date": "min", "end_date": "max"})
).reset_index(drop=True)






























custstart_dateend_date
0CUST1232021-01-01 00:00:002021-01-31 00:00:00
1CUST1232021-02-02 00:00:002021-02-28 00:00:00
2CUST4562021-01-05 00:00:002021-01-31 00:00:00

Merge overlapping ranges per group

I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.


> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 4F 2500-3401 +
[2] 4F 19116-20730 +
[3] 4F 1420-2527 -
[4] 0F 1405-1700 -
[5] 0F 1727-2038 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths

which differs from your expectation because there are two OF non-overlapping ranges?

Find overlapping intervals in groups and retain largest non-overlapping periods

Having searched for related problems on stackoverflow, I found that the following approaches (here: Collapse and merge overlapping time intervals) and (here: How to flatten / merge overlapping time periods) could be adapted to my issue.

# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694

# Note: df and df1 created in the initial reprex (above)

df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)

# Desired output?
identical(df1, df2)
#> [1] TRUE

Pandas: Reduce overlapping date ranges to one line (per group)

Henry Ecker pointed me in the right direction considering this problem as a graph:

Pandas combining rows based on dates

The code only needed a very small bit of rewriting to get the right answer:

from scipy.sparse.csgraph import connected_components

def reductionFunction(data):
# create a 2D graph of connectivity between date ranges
start = data.start_date.values
end = data.end_date.values
graph = (start <= end[:, None]) & (end >= start[:, None])

# find connected components in this graph
n_components, indices = connected_components(graph)

# group the results by these connected components
return data.groupby(indices).aggregate({'start_date': 'min',
'end_date': 'max'})

df.groupby(['customer_id']).apply(reductionFunction).reset_index('customer_id')


Related Topics



Leave a reply



Submit