Pandas selecting by label sometimes return Series, sometimes returns DataFrame
Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc
. There are other ways, but in my opinion this is the cleanest.
In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame
In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame
Different brackets in pandas DataFrame.loc
Although the first 2 are equivalent in output, the second is called chained indexing:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
the type also is a Series
for the second one:
In[48]:
type(df.loc['Second'])
Out[48]: pandas.core.series.Series
you then index the index value which then returns the scalar value:
In[47]:
df.loc['Second']
Out[47]:
price 2
count 3
Name: Second, dtype: int32
In[49]:
df.loc['Second']['count']
Out[49]: 3
Regarding the last one, the additional brackets returns a df which is why you see the index value rather than a scalar value:
In[44]:
type(df.loc[['Second']])
Out[44]: pandas.core.frame.DataFrame
So then passing the column, indexes this df and returns the matching column, as a Series
:
In[46]:
type(df.loc[['Second'],'count'])
Out[46]: pandas.core.series.Series
So it depends on what you want to achieve, but avoid the second form as it can lead to unexpected behaviour when attempting to assign to the column or df
Series [] and .loc[] sometimes returns a single value, and sometimes unexpectedly a single element Series containing the same value
In my opinion problem is duplicated index values, so if idxmax
return tuple
, which is duplicated, is returned not scalar, but all duplicated rows in selection.
Simple solution for avoid it is create default index, here change:
df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
to:
df = pd.read_clipboard(sep='\t', na_values='')
for no MultiIndex
, but default RangeIndex
.
Check it if RangeIndex
:
print (df.index)
Solution if need MultiIndex
is remove duplicated values:
df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
df = df[~df.index.duplicated()]
Stop pandas dataframe from converting to vector
Actually, df.iloc[1,:]
is not a pd.DataFrame
it is a pd.Series
you can check it with type(df.iloc[1, :])
. So row or column doesn't have any sense in these case.
To keep it as a pd.DataFrame
you could select a range of rows of length 1: df.iloc[1:2, :]
or df.iloc[[1], :]
Keep selected column as DataFrame instead of Series
As @Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
Pandas dataframe row extraction is changing dimensions
To get row as a DataFrame you need to use:
csv_row1 = csv.loc[[0]]
Pandas series: Only keep the first entry that contains a given character (comma)
If you really want to work with Series methods, the approach would be:
series[series.str.contains(',')].iloc[0]
However, this requires checking all elements, just to keep one.
A much more efficient approach (depending on the exact data, there might be edge case where this isn't true), would be to use a filter
and next
to get the first element. This is more that 100 times faster on the provided example.
next(filter(lambda x: ',' in x, series))
Output: '3,360,003|'
Related Topics
Download and Save PDF File with Python Requests Module
Importerror After Cython Embed
Using Python's Multiprocessing Module to Execute Simultaneous and Separate Seawat/Modflow Model Runs
Get List of All Routes Defined in the Flask App
How to Install a Package Inside Virtualenv
Use Aws Glue Python with Numpy and Pandas Python Packages
Using a Python Subprocess Call to Invoke a Python Script
How to Add an Empty Column to a Dataframe
How to Get All of the Output from My .Exe Using Subprocess and Popen
Python Map Object Is Not Subscriptable
How to Plot Multi-Color Line If X-Axis Is Date Time Index of Pandas
What Is the Most Efficient Way of Counting Occurrences in Pandas
Differencebetween a Pandas Series and a Single-Column Dataframe
How to Break a Line of Chained Methods in Python
How to Ssh Connect Through Python Paramiko with Ppk Public Key