Find first and last occurrence of an item in R dataframe
Making a few assumptions about your data:
week
is numericitem
is always associated with at least one week (noNA
weeks)- "last" is equivalent to "largest value" for
week
Then this dplyr
solution should work:
library(dplyr)
df %>%
group_by(item) %>%
summarise(diff = max(week) - min(week)) %>%
ungroup()
# A tibble: 2 x 2
item diff
<int> <dbl>
1 63230 2
2 63233 2
How can I find the first and last occurrences of an element in a data.frame?
You can do this with duplicated and rev (for LAST):
> v1=c(1,1,1,2,2,3,3,3,3,4,4,5)
> data.frame(v1,FIRST=!duplicated(v1),LAST=rev(!duplicated(rev(v1))))
v1 FIRST LAST
1 1 TRUE FALSE
2 1 FALSE FALSE
3 1 FALSE TRUE
4 2 TRUE FALSE
5 2 FALSE TRUE
6 3 TRUE FALSE
7 3 FALSE FALSE
8 3 FALSE FALSE
9 3 FALSE TRUE
10 4 TRUE FALSE
11 4 FALSE TRUE
12 5 TRUE TRUE
pandas - find first occurrence
idxmax
and argmax
will return the position of the maximal value or the first position if the maximal value occurs more than once.
use idxmax
on df.A.ne('a')
df.A.ne('a').idxmax()
3
or the numpy
equivalent
(df.A.values != 'a').argmax()
3
However, if A
has already been sorted, then we can use searchsorted
df.A.searchsorted('a', side='right')
array([3])
Or the numpy
equivalent
df.A.values.searchsorted('a', side='right')
3
How to obtain first and last occurrence of an item in pandas
You can try groupby
and them apply
custom function f
like:
def f(x):
Doormin = x[x['Door'] == 1].min()
Doormax = x[x['Door'] == 1].max()
Coaster2min = x[x['Coaster2'] == 1].min()
Coaster2max = x[x['Coaster2'] == 1].max()
Coaster1min = x[x['Coaster1'] == 1].min()
Coaster1max = x[x['Coaster1'] == 1].max()
Door = pd.Series([Doormin['Door'], Doormin['SensorDate'], Doormin['SensorTime'], Doormax['SensorTime'], Doormin['RegisteredTime']], index=['Door','SensorDate','SensorTimeFirst','SensorTimeLast','RegisteredTime'])
Coaster1 = pd.Series([Coaster1min['Coaster1'], Coaster1min['SensorDate'], Coaster1min['SensorTime'], Coaster1max['SensorTime'], Coaster1min['RegisteredTime']], index=['Coaster1','SensorDate','SensorTimeFirst','SensorTimeLast','RegisteredTime'])
Coaster2 = pd.Series([Coaster2min['Coaster2'], Coaster2min['SensorDate'], Coaster2min['SensorTime'], Coaster2max['SensorTime'], Coaster2min['RegisteredTime']], index=['Coaster2','SensorDate','SensorTimeFirst','SensorTimeLast','RegisteredTime'])
return pd.DataFrame([Door, Coaster2, Coaster1])
print df.groupby(['User','Activity']).apply(f)
Coaster1 Coaster2 Door RegisteredTime \
User Activity
Chris coffee + hot water 0 NaN NaN 1 13:09:00
1 NaN 1 NaN 13:09:00
2 NaN NaN NaN NaN
SensorDate SensorTimeFirst SensorTimeLast
User Activity
Chris coffee + hot water 0 2015-09-21 13:05:54 13:05:56
1 2015-09-21 13:05:58 13:05:59
2 NaN NaN NaN
And maybe you can add 0
instead of NaN
by fillna
:
df = df.groupby(['User','Activity']).apply(f)
df[['Coaster1','Coaster2','Door']] = df[['Coaster1','Coaster2','Door']].fillna(0)
print df
Coaster1 Coaster2 Door RegisteredTime \
User Activity
Chris coffee + hot water 0 0 0 1 13:09:00
1 0 1 0 13:09:00
2 0 0 0 NaN
SensorDate SensorTimeFirst SensorTimeLast
User Activity
Chris coffee + hot water 0 2015-09-21 13:05:54 13:05:56
1 2015-09-21 13:05:58 13:05:59
2 NaN NaN NaN
Within rows of data frame, find first occurrence and longest sequence of value
Edit #2: Rewrote as combination of two summarizations.
input_tidy <- input %>%
gather(col, val, -ID) %>%
group_by(ID) %>%
arrange(ID) %>%
mutate(col_num = row_number() + 1)
input[,1] %>%
# Combine with summary of each ID's first zero
left_join(input_tidy %>% filter(val == 0) %>%
summarize(first_0_name = first(col),
first_0_loc = first(col_num))) %>%
# Combine with length of each ID's first post-0 streak of 1's
left_join(input_tidy %>%
filter(val == 1 & cumsum(val == 1 & lag(val, default = 1) == 0) == 1) %>%
summarize(streak_1 = n()))
# A tibble: 10 x 4
ID first_0_name first_0_loc streak_1
<chr> <chr> <dbl> <int>
1 A i9 10 5
2 B i4 5 4
3 C i6 7 8
4 D i8 9 4
5 E i9 10 5
6 F NA NA NA
7 G i1 2 5
8 H i3 4 8
9 I i2 3 NA
10 J i3 4 2
find the first occurrence of a value (from a list of values)in a pandas dataframe and return the index of the row
We can do stack
the drop_duplicates
out = df.loc[:,'N2':].stack().drop_duplicates()
0 N2 12
N3 14
N4 40
N5 42
1 N2 5
N3 24
N4 43
N5 45
2 N2 23
N3 28
N4 38
N5 49
3 N2 11
N3 22
N5 41
4 N2 27
N3 30
N4 46
dtype: int64
Extract rows for the first occurrence of a variable in a data frame
t.first <- species[match(unique(species$Taxa), species$Taxa),]
should give you what you're looking for. match
returns indices of the first match in the compared vectors, which give you the rows you need.
Python pandas get first and last index, duplicate if first is also the last, of group in data frame
pd.concat
pd.concat([d.iloc[[0, -1]] for _, d in df.groupby('ID')])
ID Date
0 A 1/1/2015
2 A 1/3/2017
3 B 1/3/2017
3 B 1/3/2017
4 C 1/5/2016
5 C 1/7/2016
Using agg
df.groupby('ID').agg(['first', 'last']).stack().reset_index('ID')
ID Date
first A 1/1/2015
last A 1/3/2017
first B 1/3/2017
last B 1/3/2017
first C 1/5/2016
last C 1/7/2016
Access index of last element in data frame
The former answer is now superseded by .iloc
:
>>> df = pd.DataFrame({"date": range(10, 64, 8)})
>>> df.index += 17
>>> df
date
17 10
18 18
19 26
20 34
21 42
22 50
23 58
>>> df["date"].iloc[0]
10
>>> df["date"].iloc[-1]
58
The shortest way I can think of uses .iget()
:
>>> df = pd.DataFrame({"date": range(10, 64, 8)})
>>> df.index += 17
>>> df
date
17 10
18 18
19 26
20 34
21 42
22 50
23 58
>>> df['date'].iget(0)
10
>>> df['date'].iget(-1)
58
Alternatively:
>>> df['date'][df.index[0]]
10
>>> df['date'][df.index[-1]]
58
There's also .first_valid_index()
and .last_valid_index()
, but depending on whether or not you want to rule out NaN
s they might not be what you want.
Remember that df.ix[0]
doesn't give you the first, but the one indexed by 0. For example, in the above case, df.ix[0]
would produce
>>> df.ix[0]
Traceback (most recent call last):
File "<ipython-input-489-494245247e87>", line 1, in <module>
df.ix[0]
[...]
KeyError: 0
Related Topics
Plotting Functions on Top of Datapoints in R
Provide Shades Between Dates on X Axis
Using 'Fread' to Import CSV File from an Archive into 'R' Without Extracting to Disk
How to Speed Up R Packages Installation in Docker
Remove White Space Between Plots and Table in Grid.Arrange
How to Summarizing Data Statistics Using R
How to Plot the Linear Regression in R
Multiple Filled.Contour Plots in One Graph Using with Par(Mfrow=C())
Controlling Both the Major and Minor Grid Lines on the Y Axis
Ggplot Object Not Found Error When Adding Layer with Different Data
What Are the Caveats of Using Source Versus Parse & Eval
Extract Time Series of a Point ( Lon, Lat) from Netcdf in R
How Do Add a Column in a Data Frame in R
R Map Switzerland According to Npa (Locality)
Check If Character String Is a Valid Color Representation
When Writing My Own R Package, I Can't Seem to Get Other Packages to Import Correctly