Merge Overlapping Ranges into Unique Groups, in Dataframe

Merge overlapping ranges into unique groups, in dataframe

You'll need IRanges package:

require(IRanges)
ir <- IRanges(df$start, df$end)
df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
> df

#  start   end group group2
# 1   178  5025     1      1
# 2   400  5025     1      1
# 3   983  5535     2      1
# 4  1932  6918     2      1
# 5 33653 38197     3      2

To install IRanges, type these lines in R:

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

To learn more (manual etc..) go here

Group rows by overlapping ranges

I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:

df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31], 
    'right':[4,8,13,7,19,16,23,35]})
df

The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).

Sample Image

One solution is using the following approach (based on a method for combining intervals posted by @CentAu):

# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
    """ Union of a list of intervals e.g. [(1,2),(3,4)] """
    intervals = [Interval(begin, end) for (begin, end) in data]
    u = Union(*intervals)
    return [u] if isinstance(u, Interval) \
        else list(u.args)

# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)

# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if 
l.contains(x)][0])

...which outputs:

Sample Image

Combine two data frames with overlapping ranges and calculate overlap grouped by category

def group(df1):
    df2 = Strat[Strat['HOLE_ID']==df1.iloc[0]['HOLE_ID']]
    df1[['FROM','TO']] = df1[['FROM','TO']].astype(float)
    df2[['FROM','TO']] = df2[['FROM','TO']].astype(float)

    temp =  pd.concat([df1[['FROM','TO']],df2[['FROM','TO']]]).unstack().reset_index(drop=True) \
              .drop_duplicates().sort_values().reset_index(drop=True) \
              .to_frame(name='FROM').merge(df2, how='outer').ffill()
   temp['TO'] = temp.shift(-1)['FROM']

    def tobins(x):
        agg = temp[(x.FROM <= temp.FROM) & (temp.FROM < x.TO)].groupby('Type') \
                .apply(lambda y: y['TO'].max() - y['FROM'].min()).reset_index(name='res')
        agg.res = agg.Type + ' ' + (agg.res/agg.res.sum()).map('{:.0%}'.format)
        return '; '.join(agg.res.tolist())

    df1['Type'] = df1.apply(tobins,axis=1)
    return df1

Assay.groupby('HOLE_ID').apply(group)

  HOLE_ID  FROM    TO    val          Type
0  Hole_1   1.0   2.5  0.001        A 100%
1  Hole_1   2.5   5.0  0.005  A 60%; B 40%
2  Hole_1   5.0   7.0  0.002  C 75%; D 25%
3  Hole_1   7.0  10.0  0.001  D 33%; E 67%
4  Hole_2   1.0   3.0  0.001        A 100%
5  Hole_2   3.0   5.0  0.005        B 100%
6  Hole_2   5.0   7.0  0.002  C 75%; D 25%
7  Hole_2   7.0  10.0  0.001  D 33%; E 67%

The key point is building temp DataFrame with all point FROM and TO from both table. For HOLE_ID = 'Hole_1' it looks as below. Now we can get for each row of Assay (x) those rows of temp table with (x.FROM <= temp.FROM < x.TO), group them by Type, calculate shares and join to result format

   FROM HOLE_ID    TO Type
0   0.0  Hole_1   1.0    A
1   1.0  Hole_1   2.5    A
2   2.5  Hole_1   4.0    A
3   4.0  Hole_1   5.0    B
4   5.0  Hole_1   6.5    C
5   6.5  Hole_1   7.0    D
6   7.0  Hole_1   8.0    D
7   8.0  Hole_1  10.0    E
8  10.0  Hole_1   NaN    E

Merge consecutive and overlapping date ranges

within customer groupby("cust", as_index=False) look for overlapping dates in rows
cumsum() sums booleans to generate a group for overlapping dates
finally simple case on min/max within groups


df.groupby("cust", as_index=False).apply(
    lambda d: d.sort_values(["start_date", "end_date"])
    .groupby(
        ["cust", (~(d["start_date"] <= (d["end_date"].shift() + pd.Timedelta(days=1)))).cumsum()],
        as_index=False
    )
    .agg({"start_date": "min", "end_date": "max"})
).reset_index(drop=True)

	cust	start_date	end_date
0	CUST123	2021-01-01 00:00:00	2021-01-31 00:00:00
1	CUST123	2021-02-02 00:00:00	2021-02-28 00:00:00
2	CUST456	2021-01-05 00:00:00	2021-01-31 00:00:00

Merge overlapping ranges per group

I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.


> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]       4F   2500-3401      +
  [2]       4F 19116-20730      +
  [3]       4F   1420-2527      -
  [4]       0F   1405-1700      -
  [5]       0F   1727-2038      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

which differs from your expectation because there are two OF non-overlapping ranges?

Find overlapping intervals in groups and retain largest non-overlapping periods

Having searched for related problems on stackoverflow, I found that the following approaches (here: Collapse and merge overlapping time intervals) and (here: How to flatten / merge overlapping time periods) could be adapted to my issue.

# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694 

# Note: df and df1 created in the initial reprex (above)

df2 <- df %>%
  group_by(group) %>%
  arrange(group, start) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start))  >            # find overlaps
                              cummax(as.numeric(end)))[-n()])) %>%
  ungroup() %>%
  group_by(group, indx) %>%
  arrange(desc(intval_length)) %>%                                # retain largest interval
  filter(row_number() == 1) %>%
  ungroup() %>%
  select(-indx) %>%
  arrange(group, start)

# Desired output?
identical(df1, df2)
#> [1] TRUE

Pandas: Reduce overlapping date ranges to one line (per group)

Henry Ecker pointed me in the right direction considering this problem as a graph:

Pandas combining rows based on dates

The code only needed a very small bit of rewriting to get the right answer:

from scipy.sparse.csgraph import connected_components

def reductionFunction(data):
    # create a 2D graph of connectivity between date ranges
    start = data.start_date.values
    end = data.end_date.values
    graph = (start <= end[:, None]) & (end >= start[:, None])

    # find connected components in this graph
    n_components, indices = connected_components(graph)

    # group the results by these connected components
    return data.groupby(indices).aggregate({'start_date': 'min',
                                            'end_date': 'max'})

df.groupby(['customer_id']).apply(reductionFunction).reset_index('customer_id')

Merge Overlapping Ranges into Unique Groups, in Dataframe