Pandas: How to Easily Share a Sample Dataframe Using Df.To_Dict()

Pandas: How to easily share a sample dataframe using df.to_dict()?

The answer:

In many situations, using an approach with df.to_dict() will do the job perfectly! Here are two cases that come to mind:

Case 1: You've got a dataframe built or loaded in Python from a local source

Case 2: You've got a table in another application (like Excel)



The details:

Case 1: You've got a dataframe built or loaded from a local source

Given that you've got a pandas dataframe named df, just

  1. run df.to_dict() in you console or editor, and
  2. copy the output that is formatted as a dictionary, and
  3. paste the content into pd.DataFrame(<output>) and include that chunk in your now reproducible code snippet.

Case 2: You've got a table in another application (like Excel)

Depending on the source and separator like (',', ';' '\\s+') where the latter means any spaces, you can simply:

  1. Ctrl+C the contents
  2. run df=pd.read_clipboard(sep='\\s+') in your console or editor, and
  3. run df.to_dict(), and
  4. include the output in df=pd.DataFrame(<output>)

In this case, the start of your question would look something like this:

import pandas as pd
df = pd.DataFrame({0: {0: 0.25474768796402636, 1: 0.5792136563952824, 2: 0.5950396800676201},
1: {0: 0.9071073567355232, 1: 0.1657288354283053, 2: 0.4962367707789421},
2: {0: 0.7440601352930207, 1: 0.7755487356392468, 2: 0.5230707257648775}})

Of course, this gets a little clumsy with larger dataframes. But very often, all anyone who seeks to answer your question need is a little sample of your real world data to take the structure of your data into consideration.

And there are two ways you can handle larger dataframes:

  1. run df.head(20).to_dict() to only include the first 20 rows, and
  2. change the format of your dict using, for example, df.to_dict('split') (there are other options besides 'split') to reshape your output to a dict that requires fewer lines.

Here's an example using the iris dataset, among other places available from plotly express.

If you just run:

import plotly.express as px
import pandas as pd
df = px.data.iris()
df.to_dict()

This will produce an output of nearly 1000 lines, and won't be very practical as a reproducible sample. But if you include .head(25), you'll get:

{'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0, 5: 5.4, 6: 4.6, 7: 5.0, 8: 4.4, 9: 4.9},
'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6, 5: 3.9, 6: 3.4, 7: 3.4, 8: 2.9, 9: 3.1},
'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4, 5: 1.7, 6: 1.4, 7: 1.5, 8: 1.4, 9: 1.5},
'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.4, 6: 0.3, 7: 0.2, 8: 0.2, 9: 0.1},
'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa', 5: 'setosa', 6: 'setosa', 7: 'setosa', 8: 'setosa', 9: 'setosa'},
'species_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}

And now we're getting somewhere. But depending on the structure and content of the data, this may not cover the complexity of the contents in a satisfactory manner. But you can include more data on fewer lines by including to_dict('split') like this:

import plotly.express as px
df = px.data.iris().head(10)
df.to_dict('split')

Now your output will look like:

{'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'columns': ['sepal_length',
'sepal_width',
'petal_length',
'petal_width',
'species',
'species_id'],
'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
[4.9, 3.0, 1.4, 0.2, 'setosa', 1],
[4.7, 3.2, 1.3, 0.2, 'setosa', 1],
[4.6, 3.1, 1.5, 0.2, 'setosa', 1],
[5.0, 3.6, 1.4, 0.2, 'setosa', 1],
[5.4, 3.9, 1.7, 0.4, 'setosa', 1],
[4.6, 3.4, 1.4, 0.3, 'setosa', 1],
[5.0, 3.4, 1.5, 0.2, 'setosa', 1],
[4.4, 2.9, 1.4, 0.2, 'setosa', 1],
[4.9, 3.1, 1.5, 0.1, 'setosa', 1]]}

And now you can easily increase the number in .head(10) without cluttering your question too much. But there's one minor drawback. Now you can no longer use the input directly in pd.DataFrame. But if you include a few specifications with regards to index, column, and data you'll be just fine. So for this particluar dataset, my preferred approach would be:

import pandas as pd
import plotly.express as px

sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'columns': ['sepal_length',
'sepal_width',
'petal_length',
'petal_width',
'species',
'species_id'],
'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
[4.9, 3.0, 1.4, 0.2, 'setosa', 1],
[4.7, 3.2, 1.3, 0.2, 'setosa', 1],
[4.6, 3.1, 1.5, 0.2, 'setosa', 1],
[5.0, 3.6, 1.4, 0.2, 'setosa', 1],
[5.4, 3.9, 1.7, 0.4, 'setosa', 1],
[4.6, 3.4, 1.4, 0.3, 'setosa', 1],
[5.0, 3.4, 1.5, 0.2, 'setosa', 1],
[4.4, 2.9, 1.4, 0.2, 'setosa', 1],
[4.9, 3.1, 1.5, 0.1, 'setosa', 1],
[5.4, 3.7, 1.5, 0.2, 'setosa', 1],
[4.8, 3.4, 1.6, 0.2, 'setosa', 1],
[4.8, 3.0, 1.4, 0.1, 'setosa', 1],
[4.3, 3.0, 1.1, 0.1, 'setosa', 1],
[5.8, 4.0, 1.2, 0.2, 'setosa', 1]]}

df = pd.DataFrame(index=sample['index'], columns=sample['columns'], data=sample['data'])
df

Now you'll have this dataframe to work with:

    sepal_length  sepal_width  petal_length  petal_width species  species_id
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1
5 5.4 3.9 1.7 0.4 setosa 1
6 4.6 3.4 1.4 0.3 setosa 1
7 5.0 3.4 1.5 0.2 setosa 1
8 4.4 2.9 1.4 0.2 setosa 1
9 4.9 3.1 1.5 0.1 setosa 1
10 5.4 3.7 1.5 0.2 setosa 1
11 4.8 3.4 1.6 0.2 setosa 1
12 4.8 3.0 1.4 0.1 setosa 1
13 4.3 3.0 1.1 0.1 setosa 1
14 5.8 4.0 1.2 0.2 setosa 1

Which will increase your chances of receiving useful answers significantly!

Edit:

df_to_dict() will not be able to read timestamps like 1: Timestamp('2020-01-02 00:00:00') without also including from pandas import Timestamp

Convert a Pandas DataFrame to a dictionary

The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.

to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.

These steps can be done with the following line:

>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:

>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125

Then the options are as follows.

dict - the default: column names are keys, values are dictionaries of index:data pairs

>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}

list - keys are column names, values are lists of column data

>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}

series - like 'list', but values are Series

>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,

'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}

split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively

>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}

records - each row becomes a dictionary where key is column name and value is the data in the cell

>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]

index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)

>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}

Return generator instead of list from df.to_dict()

There is not a way to get a generator directly from to_dict(orient='records'). However, it is possible to modify the to_dict source code to be a generator instead of returning a list comprehension:

from pandas.core.common import standardize_mapping
from pandas.core.dtypes.cast import maybe_box_native


def dataframe_records_gen(df_):
columns = df_.columns.tolist()
into_c = standardize_mapping(dict)

for row in df_.itertuples(index=False, name=None):
yield into_c(
(k, maybe_box_native(v)) for k, v in dict(zip(columns, row)).items()
)

Sample Code:

import pandas as pd

df = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})

# Using Generator
for row in dataframe_records_gen(df):
print(row)

# For Comparison with to_dict function
print("to_dict", df.to_dict(orient='records'))

Output:

{'A': 1, 'B': 3}
{'A': 2, 'B': 4}
to_dict [{'A': 1, 'B': 3}, {'A': 2, 'B': 4}]

For more natural syntax, it's also possible to register a custom accessor:

import pandas as pd
from pandas.core.common import standardize_mapping
from pandas.core.dtypes.cast import maybe_box_native


@pd.api.extensions.register_dataframe_accessor("gen")
class GenAccessor:
def __init__(self, pandas_obj):
self._obj = pandas_obj

def records(self):
columns = self._obj.columns.tolist()
into_c = standardize_mapping(dict)

for row in self._obj.itertuples(index=False, name=None):
yield into_c(
(k, maybe_box_native(v))
for k, v in dict(zip(columns, row)).items()
)

Which makes this generator accessible via the gen accessor in this case:

df = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})

# Using Generator through registered custom accessor
for row in df.gen.records():
print(row)

# For Comparison with to_dict function
print("to_dict", df.to_dict(orient='records'))

Output:

{'A': 1, 'B': 3}
{'A': 2, 'B': 4}
to_dict [{'A': 1, 'B': 3}, {'A': 2, 'B': 4}]

Pandas DataFrame to List of Dictionaries

Use df.to_dict('records') -- gives the output without having to transpose externally.

In [2]: df.to_dict('records')
Out[2]:
[{'customer': 1L, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
{'customer': 2L, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
{'customer': 3L, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

Converting a pandas Dataframe to a list of dictionary

First you need to change columns to desired output, then set orient argument to records when exporting to dict with to_dict().

df.columns = ["parameterCode", "parameterName", "unit", "result", "uRange", "lRange"]

res = df.to_dict(orient='records')
print(res)

[{'parameterCode': 27162, 'parameterName': 'MPV (Mean Platelet Volume)', 'unit': 'fL', 'result': '12.3', 'uRange': 6.5, 'lRange': 12.0}, {'parameterCode': 29789, 'parameterName': 'Platelet Count', 'unit': 'thou/mm3', 'result': '156', 'uRange': 150.0, 'lRange': 450.0}, {'parameterCode': 29790, 'parameterName': 'Segmented Neutrophils', 'unit': '%', 'result': '72', 'uRange': 40.0, 'lRange': 80.0}, {'parameterCode': 29791, 'parameterName': 'Lymphocytes', 'unit': '%', 'result': '22', 'uRange': 20.0, 'lRange': 40.0}, {'parameterCode': 29792, 'parameterName': 'Monocytes', 'unit': '%', 'result': '3.7', 'uRange': 2.0, 'lRange': 10.0}, {'parameterCode': 29793, 'parameterName': 'Eosinophils', 'unit': '%', 'result': '1.8', 'uRange': 1.0, 'lRange': 6.0}, {'parameterCode': 29794, 'parameterName': 'Basophils', 'unit': '%', 'result': '0.5', 'uRange': nan, 'lRange': 2.0}, {'parameterCode': 29795, 'parameterName': 'Neutrophils', 'unit': 'thou/mm3', 'result': '6.74', 'uRange': 2.0, 'lRange': 7.0}, {'parameterCode': 29796, 'parameterName': 'Lymphocytes', 'unit': 'thou/mm3', 'result': '2.06', 'uRange': 1.0, 'lRange': 3.0}, {'parameterCode': 29797, 'parameterName': 'Monocytes', 'unit': 'thou/mm3', 'result': '0.35', 'uRange': 0.2, 'lRange': 1.0}, {'parameterCode': 29798, 'parameterName': 'Eosinophils', 'unit': 'thou/mm3', 'result': '0.17', 'uRange': 0.02, 'lRange': 0.5}, {'parameterCode': 29799, 'parameterName': 'Basophils', 'unit': 'thou/mm3', 'result': '0.05', 'uRange': 0.01, 'lRange': 0.1}, {'parameterCode': 29806, 'parameterName': 'Hemoglobin', 'unit': 'g/dL', 'result': '7', 'uRange': 11.5, 'lRange': 15.0}, {'parameterCode': 29807, 'parameterName': 'Packed Cell Volume (PCV)', 'unit': '%', 'result': '26.7', 'uRange': 36.0, 'lRange': 46.0}, {'parameterCode': 29808, 'parameterName': 'Total Leukocyte Count (TLC)', 'unit': 'thou/mm3', 'result': '9.36', 'uRange': 4.0, 'lRange': 10.0}, {'parameterCode': 29809, 'parameterName': 'RBC Count', 'unit': 'mill/mm3', 'result': '3.54', 'uRange': 3.8, 'lRange': 4.8}, {'parameterCode': 29810, 'parameterName': 'MCV', 'unit': 'fL', 'result': '75.4', 'uRange': 80.0, 'lRange': 100.0}, {'parameterCode': 29811, 'parameterName': 'MCH', 'unit': 'pg', 'result': '19.8', 'uRange': 27.0, 'lRange': 32.0}, {'parameterCode': 29812, 'parameterName': 'MCHC', 'unit': 'g/dL', 'result': '26.2', 'uRange': 32.0, 'lRange': 35.0}, {'parameterCode': 29813, 'parameterName': 'Red Cell Distribution Width (RDW)', 'unit': '%', 'result': '17.8', 'uRange': 11.5, 'lRange': 14.5}]

Sampling data from the pandas dataframe

If you're re-assigning frequency to the original dataframe, that's probably the issue. Make sure you don't have duplicate labels and weights going into your sampling.

Using your summary data I can generate 5000 samples which do have (roughly) the same distribution as the original:

In [1]: import pandas as pd

In [2]: summary = pd.DataFrame(
...: [
...: ['A', 0.350019],
...: ['B', 0.209966],
...: ['C', 0.126553],
...: ['D', 0.100983],
...: ['E', 0.053767],
...: ['F', 0.039378],
...: ['G', 0.029529],
...: ['H', 0.019056],
...: ['I', 0.016783],
...: ['J', 0.014813],
...: ['K', 0.014152],
...: ['L', 0.013477],
...: ['M', 0.009444],
...: ['N', 0.002082],
...: ],
...: columns=['label', 'freq']
...: )

You can sample from the summary table, weighting each unique label with the frequency in the original dataset:

In [3]: summary.label.sample(
...: n=5000,
...: weights=summary.freq,
...: replace=True,
...: ).value_counts(normalize=True)

Out[3]:
label
A 0.3448
B 0.2198
C 0.1356
D 0.0952
E 0.0488
F 0.0322
G 0.0284
H 0.0234
I 0.0168
J 0.0162
K 0.0146
L 0.0140
M 0.0090
N 0.0012
dtype: float64

Alternatively, you could simply skip the calculation of the frequencies altogether - pandas will do this for you:

In [7]: df = pd.DataFrame(np.random.choice(["A", "B", "C", "D"], size=20_000, p=[0.6, 0.3, 0.05, 0.05]), columns=["label"])
In [8]: df.label.sample(5000, replace=True).value_counts(normalize=True)
Out[8]:
A 0.5994
B 0.2930
C 0.0576
D 0.0500
Name: label, dtype: float64

The issue with the code in your question is that you end up weighting based on frequency and based on the explicit weights (which also account for frequency):

In [2]: df = pd.DataFrame(np.random.choice(["A", "B", "C", "D"], size=20_000, p=[0.6, 0.3, 0.05, 0.05]), columns=["label"])
In [3]: df['frequency'] = df.groupby('label')['label'].transform('count')
In [4]: df
Out[4]:
label frequency
0 A 11908
1 A 11908
2 B 5994
3 B 5994
4 D 1033
... ... ...
19995 A 11908
19996 D 1033
19997 A 11908
19998 A 11908
19999 A 11908

The result is roughly equal to the normalized square of each frequency:

In [6]: freqs = np.array([0.6, 0.3, 0.05, 0.05])
In [7]: (freqs ** 2) / (freqs ** 2).sum()
Out[7]:
array([0.79120879, 0.1978022 , 0.00549451, 0.00549451])

Convert Python dict into a dataframe

The error here, is since calling the DataFrame constructor with scalar values (where it expects values to be a list/dict/... i.e. have multiple columns):

pd.DataFrame(d)
ValueError: If using all scalar values, you must must pass an index

You could take the items from the dictionary (i.e. the key-value pairs):

In [11]: pd.DataFrame(d.items())  # or list(d.items()) in python 3
Out[11]:
0 1
0 2012-07-02 392
1 2012-07-06 392
2 2012-06-29 391
3 2012-06-28 391
...

In [12]: pd.DataFrame(d.items(), columns=['Date', 'DateValue'])
Out[12]:
Date DateValue
0 2012-07-02 392
1 2012-07-06 392
2 2012-06-29 391

But I think it makes more sense to pass the Series constructor:

In [21]: s = pd.Series(d, name='DateValue')
Out[21]:
2012-06-08 388
2012-06-09 388
2012-06-10 388

In [22]: s.index.name = 'Date'

In [23]: s.reset_index()
Out[23]:
Date DateValue
0 2012-06-08 388
1 2012-06-09 388
2 2012-06-10 388

How to convert dataframe to dictionary in pandas WITHOUT index

When I see your dataset with 2 columns I see a series and not a dataframe.

Try this: d = df.set_index('name')['coverage'].to_dict() which will convert your dataframe to a series and output that.

However, if your intent is to have more columns and not a common key you could store them in an array instead using 'records'. d = df.to_dict('r').
`

Runnable code:

import pandas as pd

df = pd.DataFrame({
'name': ['Jason'],
'coverage': [25.1]
})

print(df.to_dict())
print(df.set_index('name')['coverage'].to_dict())
print(df.to_dict('r'))

Returns:

{'name': {0: 'Jason'}, 'coverage': {0: 25.1}}
{'Jason': 25.1}
[{'name': 'Jason', 'coverage': 25.1}]

And one more thing, try to avoid to use variable name dict as it is reserved.

How to make good reproducible pandas examples

Note: The ideas here are pretty generic for Stack Overflow, indeed questions.

Disclaimer: Writing a good question is hard.

The Good:

  • do include small* example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

    or make it "copy and pasteable" using pd.read_clipboard(sep='\s\s+'), you can format the text for Stack Overflow highlight and use Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented:

    In [2]: df
    Out[2]:
    A B
    0 1 2
    1 1 3
    2 4 6

    test pd.read_clipboard(sep='\s\s+') yourself.

    * I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows [citation needed], and I bet I can do it in 5 rows. Can you reproduce the error with df = df.head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

    * Every rule has an exception, the obvious one is for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate: df = pd.DataFrame(np.random.randn(100000000, 10)). Consider using np.random.seed so we have the exact same frame. Saying that, "make this code fast for me" is not strictly on topic for the site.

  • write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
    A B
    0 1 5
    1 4 6

    Explain what the numbers come from: the 5 is sum of the B column for the rows where A is 1.

  • do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
    B
    A
    1 5
    4 6

    But say what's incorrect: the A column is in the index rather than a column.

  • do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • if it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure**.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date..

    ** Sometimes this is the issue itself: they were strings.

The Bad:

  • don't include a MultiIndex, which we can't copy and paste (see above). This is kind of a grievance with Pandas' default display, but nonetheless annoying:

    In [11]: df
    Out[11]:
    C
    A B
    1 2 3
    2 6

    The correct way is to include an ordinary DataFrame with a set_index call:

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])

    In [13]: df
    Out[13]:
    C
    A B
    1 2 3
    2 6
  • do provide insight to what it is when giving the outcome you want:

       B
    A
    1 1
    5 0

    Be specific about how you got the numbers (what are they)... double check they're correct.

  • If your code throws an error, do include the entire stack trace (this can be edited out later if it's too noisy). Show the line number (and the corresponding line of your code which it's raising against).

The Ugly:

  • don't link to a CSV file we don't have access to (ideally don't link to an external source at all...)

    df = pd.read_csv('my_secret_file.csv')  # ideally with lots of parsing options

    Most data is proprietary we get that: Make up similar data and see if you can reproduce the problem (something small).

  • don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.

    Essays are bad, it's easier with small examples.

  • don't include 10+ (100+??) lines of data munging before getting to your actual question.

    Please, we see enough of this in our day jobs. We want to help, but not like this....
    Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.

Anyway, have fun learning Python, NumPy and Pandas!



Related Topics



Leave a reply



Submit