Differencebetween a Pandas Series and a Single-Column Dataframe

What is the difference between a pandas Series and a single-column DataFrame?

Quoting the Pandas docs

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

So, the Series is the data structure for a single column of a DataFrame, not only conceptually, but literally, i.e. the data in a DataFrame is actually stored in memory as a collection of Series.

Analogously: We need both lists and matrices, because matrices are built with lists. Single row matricies, while equivalent to lists in functionality still cannot exist without the list(s) they're composed of.

They both have extremely similar APIs, but you'll find that DataFrame methods always cater to the possibility that you have more than one column. And, of course, you can always add another Series (or equivalent object) to a DataFrame, while adding a Series to another Series involves creating a DataFrame.

Keep selected column as DataFrame instead of Series

As @Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):

In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

In [11]: df
Out[11]:
A B
0 1 2
1 3 4

In [12]: df[['A']]

In [13]: df[[0]]

In [14]: df.loc[:, ['A']]

In [15]: df.iloc[:, [0]]

Out[12-15]: # they all return the same thing:
A
0 1
1 3

The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:

In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])

In [17]: df
Out[17]:
A 0
0 1 2
1 3 4

In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3

Difference between Series & Data Frame

You can think of Series as a column in a DataFrame while the actual DataFrame is the table if you think of it in terms of sql

Row Series vs Col Series in Pandas

If check docs for DataFrame:

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects

If check Series:

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

So if select by one index or one columns (and not duplicated index value or column value) get always Series.

I think is not many differencies between both Series (row or column Series), only obviously one for general DataFrame with different types of columns (Series) like here - column ReleaseYear is filled by numbers, integers, both another columns are filled by strings.

So if check Series.dtype of data get differencies. For columns are same types, object what is obviously strings or integers, but for Series from rows are mixed types of values, first value is string, second integers and third string. So finally get object. If test separately by .apply(type) is possible check it:

Notice:

If all columns has same types then there is no such differency here.

Notice1:

Sure, is possible create Series filled by mixed data, then Series created from column has object dtype too same like Series created from row.

year_column = df['ReleaseYear']
print(year_column)
0 1997
1 2002
Name: ReleaseYear, dtype: int64

print (type(year_column))
<class 'pandas.core.series.Series'>

print (year_column.dtype)
int64

print (year_column.apply(type))
0 <class 'int'>
1 <class 'int'>
Name: ReleaseYear, dtype: object


row_one = df.loc[0]
print(row_one)
Title Titanic
ReleaseYear 1997
Director James Cameron
Name: 0, dtype: object

print (type(row_one))
<class 'pandas.core.series.Series'>

print (row_one.dtype)
object

print (row_one.apply(type))
Title <class 'str'>
ReleaseYear <class 'numpy.int64'>
Director <class 'str'>
Name: 0, dtype: object

Why do I get Pandas data frame with only one column vs Series?

In general, a one-column DataFrame will be returned when the operation could return a multicolumn DataFrame. For instance, when you use a boolean column index, a multicolumn DataFrame would have to be returned if there was more than one True value, so a DataFrame will always be returned, even if it has only one column. Likewise when setting an index, if your DataFrame had more than two columns, the result would still have to be a DataFrame after removing one for the index, so it will still be a DataFrame even if it has only one column left.

In contrast, if you do something like df.ix[:,'col'], it returns a Series, because there is no way that passing one column name to select can ever select more than one column.

The idea is that doing an operation should not sometimes return a DataFrame and sometimes a Series based on features specific to the operands (i.e., how many columns they happen to have, how many values are True in your boolean mask). When you do df.set_index('col'), it's simpler if you know that you will always get a DataFrame, without having to worry about how many columns the original happened to have.

Note that there is also the DataFrame method .squeeze() for turning a one-column DataFrame into a Series.

Comparing two dataframes and getting the differences

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red

Extract single column from Pandas DataFrame in two ways, difference?

Square brackets are important

df['floor_temperature'] represents a series. pd.Series objects are one-dimensional. The argument feeding pd.DataFrame.__getitem__, for which [] is syntactic sugar, is a scalar.

df[['floor_temperature']] represents a dataframe. pd.DataFrame objects are two-dimensional, indicated by the argument being a list.

What you are seeing is the difference between a single isolated series and a dataframe with a single series.

How do I add data to a column only if a certain value exists in previous column using Python and Faker?

I'd slightly change the approach and generate a column OS. This column you can then transform into With MacOS etc. if needed.

With this approach its easier to get the 0.5 / 0.5 split within Windows right:

from faker import Faker
from faker.providers import BaseProvider, DynamicProvider
import numpy as np
import pandas as pd
from datetime import datetime
import random
from collections import OrderedDict

pc_type = ['PC', 'Apple']
wos_type = OrderedDict([('With Windows 10', 0.5), ('With Windows 11', 0.5)])
fake = Faker()

def create_data(x):
project_data = {}
for i in range(x):
project_data[i] = {}
project_data[i]['Name'] = fake.name()
project_data[i]['PC Type'] = fake.random_element(pc_type)
if project_data[i]['PC Type'] == 'PC':
project_data[i]['OS'] = fake.random_element(elements = wos_type)
else:
project_data[i]['OS'] = 'MacOS'

return project_data

df = pd.DataFrame(create_data(10)).transpose()
df

Output

                     Name PC Type               OS
0 Nicholas Walker Apple MacOS
1 Eric Hull PC With Windows 10
2 Veronica Gonzales PC With Windows 11
3 Mrs. Krista Richardson Apple MacOS
4 Anne Craig PC With Windows 10
5 Joseph Hayes PC With Windows 10
6 Mary Nelson Apple MacOS
7 Jill Hunt Apple MacOS
8 Mark Taylor PC With Windows 11
9 Kyle Thompson PC With Windows 10


Related Topics



Leave a reply



Submit