Get first row value of a given column
To select the ith
row, use iloc
:
In [31]: df_test.iloc[0]
Out[31]:
ATime 1.2
X 2.0
Y 15.0
Z 2.0
Btime 1.2
C 12.0
D 25.0
E 12.0
Name: 0, dtype: float64
To select the ith value in the Btime
column you could use:
In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2
There is a difference between df_test['Btime'].iloc[0]
(recommended) and df_test.iloc[0]['Btime']
:
DataFrames store data in column-based blocks (where each block has a single
dtype). If you select by column first, a view can be returned (which is
quicker than returning a copy) and the original dtype is preserved. In contrast,
if you select by row first, and if the DataFrame has columns of different
dtypes, then Pandas copies the data into a new Series of object dtype. So
selecting columns is a bit faster than selecting rows. Thus, althoughdf_test.iloc[0]['Btime']
works, df_test['Btime'].iloc[0]
is a little bit
more efficient.
There is a big difference between the two when it comes to assignment.df_test['Btime'].iloc[0] = x
affects df_test
, but df_test.iloc[0]['Btime']
may not. See below for an explanation of why. Because a subtle difference in
the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:
df.iloc[0, df.columns.get_loc('Btime')] = x
df.iloc[0, df.columns.get_loc('Btime')] = x
(recommended):
The recommended way to assign new values to a
DataFrame is to avoid chained indexing, and instead use the method shown by
andrew,
df.loc[df.index[n], 'Btime'] = x
or
df.iloc[n, df.columns.get_loc('Btime')] = x
The latter method is a bit faster, because df.loc
has to convert the row and column labels to
positional indices, so there is a little less conversion necessary if you usedf.iloc
instead.
df['Btime'].iloc[0] = x
works, but is not recommended:
Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime']
always returns a
view (not a copy) so df['Btime'].iloc[n] = x
can be used to assign a new value
at the nth location of the Btime
column of df
.
Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning
even though in this case the assignment succeeds in modifying df
:
In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
In [26]: df
Out[26]:
foo bar
0 A 99 <-- assignment succeeded
2 B 100
1 C 100
df.iloc[0]['Btime'] = x
does not work:
In contrast, assignment with df.iloc[0]['bar'] = 123
does not work because df.iloc[0]
is returning a copy:
In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [67]: df
Out[67]:
foo bar
0 A 99 <-- assignment failed
2 B 100
1 C 100
Warning: I had previously suggested df_test.ix[i, 'Btime']
. But this is not guaranteed to give you the ith
value since ix
tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i]
will return the row labeled i
rather than the ith
row. For example,
In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [2]: df
Out[2]:
foo
0 A
2 B
1 C
In [4]: df.ix[1, 'foo']
Out[4]: 'C'
pandas extract first row column value equal to 1 for each group
First filter only rows with label=1
and then remove duplicates per id
by DataFrame.drop_duplicates
:
df1 = df[df['label'].eq(1)].drop_duplicates('id')
Get first row of dataframe in Python Pandas based on criteria
This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:
>>> df[condition]
This will return a slice of your dataframe which you can index using iloc
. Here are your examples:
Get first row where A > 3 (returns row 2)
>>> df[df.A > 3].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
If what you actually want is the row number, rather than using iloc
, it would be df[df.A > 3].index[0]
.
Get first row where A > 4 AND B > 3:
>>> df[(df.A > 4) & (df.B > 3)].iloc[0]
A 5
B 4
C 5
Name: 4, dtype: int64Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
>>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:
>>> def series_or_default(X, condition, default_col, ascending=False):
... sliced = X[condition]
... if sliced.shape[0] == 0:
... return X.sort_values(default_col, ascending=ascending).iloc[0]
... return sliced.iloc[0]
>>>
>>> series_or_default(df, df.A > 6, 'A')
A 5
B 4
C 5
Name: 4, dtype: int64
As expected, it returns row 4.
Get the first row of each group of unique values in another column
Use groupby
+ first
:
firsts = df.groupby('col_B', as_index=False).first()
Output:
>>> firsts
col_B col_A
0 x 1
1 xx 2
2 y 4
If the order of the columns is important:
firsts = df.loc[df.groupby('col_B', as_index=False).first().index]
Output:
>>> firsts
col_A col_B
0 1 x
1 2 xx
2 3 xx
Find next first row meeting a condition after a specific row in pandas
You can use:
g = df['first'].ne(df['first'].shift()).cumsum().loc[~df['first']]
# or
# g = df['first'].cumsum()[~df['first']]
out = df[df['second']].groupby(g).head(1)
Output:
first second
1 False True
4 False True
Intermediate grouper g
:
1 2
3 4
4 4
5 4
7 6
Name: first, dtype: int64
What is the correct way to get the first row of a dataframe?
To get the first and last element of the column, your option is already the most efficient/correct way. If you're interested in this topic, I can recommend you to read this other Stackoverflow answer: https://stackoverflow.com/a/25254087/8294752
To get the first row, I personally prefer to use DataFrame.head(1), therefore for your code something like this:
df_first_row = sub_df.head(1)
I didn't look into how the head() method is defined in Pandas and its performance implications, but in my opinion it improves readability and reduces some potential confusion with indexes.
In other examples you might also find sub_df.iloc[0]
, but this option will return a pandas.Series
which has as indexes the DataFrame column names.sub_df.head(1)
will return a 1-row DataFrame instead, which is the same result as sub_df.iloc[0:1,:]
Filter dataframe rows based on return value of foo() applied to first column
TLDR
mask = df.apply(lambda row: foo(row['Path']), axis=1)
res: pd.DataFrame = df[mask]
Solution
To filter the rows of a DataFrame according to the return value of foo(str: str) -> bool
applied to the values contained in column Path
of each row the solution is to generate a mask with pandas.DataFrame.apply().
How does a mask work?
The mask works as follow: given a dataframe df: pd.DataFrame
and a mask: pd.Series<bool>
accessing with square brackets df[mask]
will result in a new DataFrame with only the rows corresponnding to a True
value of the mask series.
How to get the mask
Since df.apply(fuction, axis, ...)
takes as input a function one would be tempted to pass foo()
as argument of the apply()
but this is wrong.
The function argumennt of apply() must be a function taking as argument a pd.Series and not a string therefore the correct way to get the mask is the following, where axis = 1
indicates that we're applyinng the lambda to get the boolean value to every row of the dataframe rather than to every column.
mask = df.apply(lambda row: foo(row['Path']), axis=1)
How to select the first row of a specific column in a list in R
I don't think you need a list or loop. Here's my suggestion, based on my understanding of your issue:
First, define your intervals based on breaks using cut()
:
library(dplyr)
# Define breaks for intervals ("groups")
breaks <- seq(0, 0.28, by = 0.02)
# Group data using breaks
dat$interval <- cut(dat$Time, breaks = breaks, right = FALSE, labels = FALSE)
Now use dplyr to manipulate your data based on your condition: "find the highest velocity value in each interval of time BUT only IF it is greater than the value in the previous time interval."
dat2 <- dat %>%
# For each group, as defined by an interval
group_by(interval) %>%
# Get the maximum velocity value
mutate(interval_max = max(velocity)) %>%
# Now ungroup and iterate across rows
ungroup %>%
# If the previous value is greater than the current value,
# keep the previous value, if not, keep the current value
mutate(interval_max_cond = ifelse(
lag(interval_max) > interval_max,
lag(interval_max),
interval_max))
Let me know if you have any questions!
data:
dat <- structure(list(Time = c(0, 0.0099998, 0.0200003, 0.0300001, 0.0399999,
0.0499997, 0.0600002, 0.07, 0.0799998, 0.0900003, 0.1000001,
0.1099999, 0.1199997, 0.1300002, 0.14, 0.1499998, 0.1600003,
0.1700001, 0.1799999, 0.1899997, 0.2000002, 0.21, 0.2199998,
0.2300003, 0.2400001, 0.2499999, 0.2599997), velocity = c(0.3444447,
0.3444447, 0.3444447, 0.3444447, 0.3444447, 0.3444447, 0.3444447,
0.3444447, 0.3444447, 0.3444447, 0.3444447, 0.3444447, 0.3444447,
0.3444447, 0.3444447, 0.3444447, 0.3444447, 0.3444447, 0.3444447,
0.441667, 0.441667, 0.441667, 0.441667, 0.441667, 0.6222227,
0.6222227, 0.6222227)), class = "data.frame", row.names = c(NA,
-27L))
Pandas, get first and last column index for row value
You can first transpose DataFrame
by DataFrame.T
and then aggregate minimal and maximal index with convert values to strings by Series.dt.strftime
, last convert to dictionaries by DataFrame.to_dict
.
For get consecutive groups is compared shifted values with Series.cumsum
.
df1 = df.T.reset_index()
L = [df1.groupby(df1[x].ne(df1[x].shift()).cumsum())
.agg(value=(x, 'first'),
start=('index', 'min'),
end=('index', 'max'))
.assign(start=lambda x: x['start'].dt.strftime('%Y-%m-%d'),
end=lambda x: x['end'].dt.strftime('%Y-%m-%d'))
.to_dict(orient='records') for x in df1.columns.drop('index')]
print (L)
[[{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}],
[{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}],
[{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}]]
Related Topics
How to Efficiently Handle European Decimal Separators Using the Pandas Read_CSV Function
Get Files Names Inside a Zip File on Ftp Server Without Downloading Whole Archive
Is There a Difference Between Continue and Pass in a for Loop in Python
Rotate Image and Crop Out Black Borders
Activating Anaconda Environment in VScode
How to Get Exception Message in Python Properly
Can't Open Lib 'Odbc Driver 13 for SQL Server'? Sym Linking Issue
Add Zeros to a Float After the Decimal Point in Python
Nameerror: Name 'Datetime' Is Not Defined
How to Write Strategy Pattern in Python Differently Than Example in Wikipedia
How to Read Datetime Back from SQLite as a Datetime Instead of String in Python
Convert a Python Dict to a String and Back
Inserting the Same Value Multiple Times When Formatting a String
How to Check If One Dictionary Is a Subset of Another Larger Dictionary
Open Cv Error: (-215) Scn == 3 || Scn == 4 in Function Cvtcolor