Finding Rows with Consecutive Increase in the Values of a Column

Finding rows with consecutive increase in the values of a column

In any case, it helps to put it in terms of increasing rows-per-stock (the actual quoteid value isn't really helpful here). Count of days captured (in this table) is easiest - if you want something else (like only business days, ignoring weekends/holidays, or whatever) it gets more involved; you'd probably need a calendar file. You're going to want an index over [stockid, createdate], if you don't have one already.

WITH StockRow AS (SELECT stockId, closePrice, createdDate,
ROW_NUMBER() OVER(PARTITION BY stockId
ORDER BY createdDate) rn
FROM Quote),

RunGroup AS (SELECT Base.stockId, Base.createdDate,
MAX(Restart.rn) OVER(PARTITION BY Base.stockId
ORDER BY Base.createdDate) groupingId
FROM StockRow Base
LEFT JOIN StockRow Restart
ON Restart.stockId = Base.stockId
AND Restart.rn = Base.rn - 1
AND Restart.closePrice > Base.closePrice)

SELECT stockId,
COUNT(*) AS consecutiveCount,
MIN(createdDate) AS startDate, MAX(createdDate) AS endDate
FROM RunGroup
GROUP BY stockId, groupingId
HAVING COUNT(*) >= 3
ORDER BY stockId, startDate

Which yields the following results from the provided data:

Increasing_Run
stockId consecutiveCount startDate endDate
===================================================
1 5 2012-01-01 2012-01-05
2 4 2012-01-01 2012-01-04
3 3 2012-01-02 2012-01-04

SQL Fiddle Example
(Fiddle also has an example for multiple runs)

This analysis will ignore all gaps, correctly matches all runs (the next time a positive run starts).


So what's going on here?

StockRow AS (SELECT stockId, closePrice, createdDate,
ROW_NUMBER() OVER(PARTITION BY stockId
ORDER BY createdDate) rn
FROM Quote)

This CTE is being used for one purpose: we need a way to find the next/previous row, so first we number each row in order (of the date)...

RunGroup AS (SELECT Base.stockId, Base.createdDate,
MAX(Restart.rn) OVER(PARTITION BY Base.stockId
ORDER BY Base.createdDate) groupingId
FROM StockRow Base
LEFT JOIN StockRow Restart
ON Restart.stockId = Base.stockId
AND Restart.rn = Base.rn - 1
AND Restart.closePrice > Base.closePrice)

... and then join them based on the index. If you end up on something that has LAG()/LEAD(), using those instead will almost certainly be a better option. There's one critical thing here though - matches are only if the row is out-of-sequence (less than the previous row). Otherwise, the value end up being null (with LAG(), you'd need to use something like CASE afterwards to pull this off). You get a temporary set that looks something like this:

B.rn   B.closePrice   B.createdDate  R.rn   R.closePrice   R.createdDate  groupingId
1 15 2012-01-01 - - - -
2 13 2012-01-02 1 15 2012-01-01 1
3 17 2012-01-03 - - - 1
4 18 2012-01-04 - - - 1
5 10 2012-01-05 4 18 2012-01-04 4

... So there's values for Restart only when the previous was greater than the "current" row. The use of MAX() in the window function is being used to the greatest value seen so far... which because null is lowest, causes the row-index to be retained for all other rows until another mismatch occurs (which gives a new value). At this point, we essentially have the intermediate results of a gaps-and-islands query, ready for the final aggregation.

SELECT stockId, 
COUNT(*) AS consecutiveCount,
MIN(createdDate) AS startDate, MAX(createdDate) AS endDate
FROM RunGroup
GROUP BY stockId, groupingId
HAVING COUNT(*) >= 3
ORDER BY stockId, startDate

The final part of the query is getting the start and end dates of the run, and counting the number of entries between those dates. If there was something more complicated for the date calculation, it probably needs to happen at this point. The GROUP BY is showing one of the few legitimate instances of not including a column in the SELECT clause. The HAVING clause is used to eliminate runs that are "too short".

In Python Pandas , searching where there are 4 consecutive rows where values going up

Assuming the dataframe is sorted. One way is based on the cumsum of the differences to identify the first time an upward Price move succeeding a 3 days upwards trend (i.e. 4 days of upward trend).

quant1 = (df['Price'].diff().apply(np.sign) == 1).cumsum()
quant2 = (df['Price'].diff().apply(np.sign) == 1).cumsum().where(~(df['Price'].diff().apply(np.sign) == 1)).ffill().fillna(0).astype(int)
df['is_consecutive'] = (quant1-quant2) >= 3

note that the above takes into account only strictly increasing Prices (not equal).

Then we override also the is_consecutive tag for the previous 3 Prices to be also TRUE using the win_view self defined function:

def win_view(x, size):
if isinstance(x, list):
x = np.array(x)
if isinstance(x, pd.core.series.Series):
x = x.values
if isinstance(x, np.ndarray):
pass
else:
raise Exception('wrong type')
return np.lib.stride_tricks.as_strided(
x,
shape=(x.size - size + 1, size),
strides=(x.strides[0], x.strides[0])
)

arr = win_view(df['is_consecutive'], 4)
arr[arr[:,3]] = True

Note that we inplace replace the values to be True.

EDIT 1
Inspired by the self defined win_view function, I realized that the solution it can be obtained simply by win_view (without the need of using cumsums) as below:

df['is_consecutive'] = False
arr = win_view(df['Price'].diff(), 4)
arr_ind = win_view(list(df['Price'].index), 4)
mask = arr_ind[np.all(arr[:, 1:] > 0, axis=1)].flatten()
df.loc[mask, 'is_consecutive'] = True

We maintain 2 arrays, 1 for the returns and 1 for the indices. We collect the indices where we have 3 consecutive positive return np.all(arr[:, 1:] > 0, axis=1 (i.e. 4 upmoving prices) and we replace those in our original df.

How can I find 5 consecutive rows in pandas Dataframe where a value of a certain column is at least 0.5

Here is a pandas, non-iterative approach, and therefore quite efficient.

Steps:

  • Create a rolling window of 5 points and determine the minimum value.
  • If the minimum value is >= 0.5, store True, else store False.
  • All booleans are stored in a numpy.array, called idx.
  • The idx array is used as a filter on the main dataset with a value of 4 subtracted to determine the first index of the run of 5.
  • The filtered DataFrame is presented.

Sample code:

idx = (df['residual'].rolling(window=5).min() >= 0.5).to_numpy()
df.iloc[df.index[idx]-4]

Output:

Index                       Time  real_generation  predicted_generation  residual
1 2019-01-01 11:00:00+00:00 0.126 0.627 0.501

Count Number of Consecutive Occurrence of values in Table

One approach is the difference of row numbers:

select name, count(*) 
from (select t.*,
(row_number() over (order by id) -
row_number() over (partition by name order by id)
) as grp
from t
) t
group by grp, name;

The logic is easiest to understand if you run the subquery and look at the values of each row number separately and then look at the difference.

How to identify consecutive repeating values in data frame column?

To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:

# for the second frame
>>> consecutives = df.Value.diff().ne(0).cumsum()
>>> consecutives

0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 5

But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaNs wherever we don't have 0 in the original series:

>>> masked_consecs = consecutives.mask(df.Value.ne(0))
>>> masked_consecs

0 NaN
1 NaN
2 2.0
3 2.0
4 NaN
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN

Now we can group by this series and look at the groups' sizes:

>>> consec_sizes = df.Value.groupby(masked_consecs).size().to_numpy()
>>> consec_sizes

array([2, 3])

The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:

>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True

Now we can wrap this procedure in a function for reusability:

def is_consec_found(series, value=0, threshold=2):
# mark consecutive groups
consecs = series.diff().ne(0).cumsum()

# disregard those groups that are not of `value`
masked_consecs = consecs.mask(series.ne(value))

# get size of each
consec_sizes = series.groupby(masked_consecs).size().to_numpy()

# check sizes agains the threshold
is_okay = (consec_sizes >= threshold).any()

# whether a suitable sequence is found or not
return is_okay

and we can run it as:

# these are all for the second dataframe you posted
>>> is_consec_found(df.Value, value=0, threshold=2)
True

>>> is_consec_found(df.Value, value=0, threshold=5)
False

>>> is_consec_found(df.Value, value=1, threshold=2)
True

>>> is_consec_found(df.Value, value=1, threshold=3)
False

Find records with 3 or more consecutive records with same value

You can use a trick to enumerate the "cash" transactions. This trick is a difference of row numbers and it is very useful:

select t.*
from (select t.*, count(*) over (partition by grp, customerid, transtype) as cnt
from (select t.*,
(row_number() over (partition by customerid order by date) -
row_number() over (partition by customerid, transtype order by date)
) as grp
from t
) t
where transtype = 'cash'
) t
where cnt >= 3;

This returns the customers and the start date. If you want to return the actual transactions, you can use an addition level of window functions:

select customerid, min(date) as start_date, sum(value) as sumvalue
from (select t.*,
(row_number() over (partition by customerid order by date) -
row_number() over (partition by customerid, transtype order by date)
) as grp
from t
) t
where transtype = 'cash'
group by grp, transtype, customerid
having count(*) >= 3;

T-SQL for finding consecutive increasing values

Q is used to get a ranking value rn ordered by Column1. Added in PrimaryKey in case there are ties in Column1. C is a recursive CTE that loops from the top ordered by rn incrementing cc for each increasing value of Column2. It will break from the recursion when cc reaches 10. Finally get the last 10 rows from C. The where clause takes care of the case when there are no 10 consecutive increasing values.

with Q as
(
select PrimaryKey,
Column1,
Column2,
row_number() over(order by Column1, PrimaryKey) as rn
from MyTable
),
C as
(
select PrimaryKey,
Column1,
Column2,
rn,
1 as cc
from Q
where rn = 1
union all
select Q.PrimaryKey,
Q.Column1,
Q.Column2,
Q.rn,
case
when Q.Column2 > C.Column2 then C.cc + 1
else 1
end
from Q
inner join C
on Q.rn - 1 = C.rn
where C.cc < 10
)
select top 10 *
from C
where 10 in (select cc from C)
order by rn desc
option (maxrecursion 0)

Version 2
As Martin Smith pointed out in a comment, the above query has really bad performance. The culprit is the first CTE. The version below use table variable to hold the ranked rows. The primary key directive on rn creates an index that will be used in the join in the recursive part of the query. Apart from the table variable this does the same as above.

declare @T table
(
PrimaryKey int,
Column1 datetime,
Column2 int,
rn int primary key
);

insert into @T
select PrimaryKey,
Column1,
Column2,
row_number() over(order by Column1, PrimaryKey) as rn
from MyTable;

with C as
(
select PrimaryKey,
Column1,
Column2,
rn,
1 as cc
from @T
where rn = 1
union all
select T.PrimaryKey,
T.Column1,
T.Column2,
T.rn,
case
when T.Column2 > C.Column2 then C.cc + 1
else 1
end
from @T as T
inner join C
on T.rn = C.rn + 1
where C.cc < 10
)
select top 10 *
from C
where 10 in (select cc from C)
order by rn desc
option (maxrecursion 0)

Stats of consecutive increasing values of a dataframe column

import pandas as pd
import numpy as np

s = pd.Series([1,2,3,4,1,2,3,3,2,1])

def ascends(s):
diff = np.r_[0, (np.diff(s.values)>=0).astype(int), 0]
diff2 = np.diff(diff)
descends = np.where(np.logical_not(diff)[1:] & np.logical_not(diff)[:-1])[0]
starts = np.sort(np.r_[np.where(diff2 > 0)[0], descends])
ends = np.sort(np.r_[np.where(diff2 < 0)[0], descends])
return ends - starts + 1

b = ascends(s)
print b
print b.max()
print b.min()
print b.mean()

(reference)

Output:

[4 4 1 1]
4
1
2.5


Related Topics



Leave a reply



Submit