How to Make Good Reproducible Pandas Examples

How to make good reproducible pandas examples

Note: The ideas here are pretty generic for Stack Overflow, indeed questions.

Disclaimer: Writing a good question is hard.

The Good:

  • do include small* example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

    or make it "copy and pasteable" using pd.read_clipboard(sep='\s\s+'), you can format the text for Stack Overflow highlight and use Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented:

    In [2]: df
    Out[2]:
    A B
    0 1 2
    1 1 3
    2 4 6

    test pd.read_clipboard(sep='\s\s+') yourself.

    * I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows [citation needed], and I bet I can do it in 5 rows. Can you reproduce the error with df = df.head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

    * Every rule has an exception, the obvious one is for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate: df = pd.DataFrame(np.random.randn(100000000, 10)). Consider using np.random.seed so we have the exact same frame. Saying that, "make this code fast for me" is not strictly on topic for the site.

  • write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
    A B
    0 1 5
    1 4 6

    Explain what the numbers come from: the 5 is sum of the B column for the rows where A is 1.

  • do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
    B
    A
    1 5
    4 6

    But say what's incorrect: the A column is in the index rather than a column.

  • do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • if it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure**.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date..

    ** Sometimes this is the issue itself: they were strings.

The Bad:

  • don't include a MultiIndex, which we can't copy and paste (see above). This is kind of a grievance with Pandas' default display, but nonetheless annoying:

    In [11]: df
    Out[11]:
    C
    A B
    1 2 3
    2 6

    The correct way is to include an ordinary DataFrame with a set_index call:

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])

    In [13]: df
    Out[13]:
    C
    A B
    1 2 3
    2 6
  • do provide insight to what it is when giving the outcome you want:

       B
    A
    1 1
    5 0

    Be specific about how you got the numbers (what are they)... double check they're correct.

  • If your code throws an error, do include the entire stack trace (this can be edited out later if it's too noisy). Show the line number (and the corresponding line of your code which it's raising against).

The Ugly:

  • don't link to a CSV file we don't have access to (ideally don't link to an external source at all...)

    df = pd.read_csv('my_secret_file.csv')  # ideally with lots of parsing options

    Most data is proprietary we get that: Make up similar data and see if you can reproduce the problem (something small).

  • don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.

    Essays are bad, it's easier with small examples.

  • don't include 10+ (100+??) lines of data munging before getting to your actual question.

    Please, we see enough of this in our day jobs. We want to help, but not like this....
    Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.

Anyway, have fun learning Python, NumPy and Pandas!

Print pandas data frame for reproducible example (equivalent to dput in R)

If binary data is OK for you, you can use the pickle library. It usually allows to serialize and deserialize arbitraty objects (on condition that their class definition is provided, which is true for dataframes, if pandas is installed).

If you need a human-readable format, you can create a Python dictionary from your dataframe with df_dict = df.to_dict(), and print this dictionary (to look at it and maybe copy-paste), or dump it to a JSON string.

When you want to convert a dict back to pandas, use df = pd.DataFrame.from_dict(df_dict).

A minimal example of decoding and encoding:

import pandas as pd
df = pd.DataFrame.from_dict({'a': {0: 1, 1: 2}, 'b': {0: 3, 1: 3}})
print(df.to_dict())

which results in the {'a': {0: 1, 1: 2}, 'b': {0: 3, 1: 3}} copy-able object.

Reshaping dataframes in pandas

Based on the data you've provided you should be able to use .unstack() to do this:

print(df['counts'].unstack(level=['Model_1', 'Winloss']))

Pandas Randomly Data Choosing

You can also sample from groups generated with groupby:

df.groupby('district').sample(n=5)

To restrict the sampling to those districts you can filter the df beforehand:

df[df['district'].isin(['USA', 'Canada', 'LA', 'NY', 'Japan'])].groupby('district').sample(n=5)

This is assuming 'district' is the district column. Also, if I understood correctly, since you are sampling 5 items from 5 districts, the dimension of the final DataFrame should be (5*5)x5 = 25x5 (25 rows and 5 columns).

You need pandas version >= 1.1.0 to use this method.

Pandas not all but two of True

Generate the dataframe:

import pandas as pd
df = pd.DataFrame({
"0": [True, True, True, True, True],
"1": [False, False, False, False, False],
"2": [False, False, False, False, False],
"3": [False, False, False, False, False],
"4": [False, True, False, False, False]

})

Get the number of true per row:

df = df.assign(number_of_true= lambda x: x.sum(axis=1))

0 1 2 3 4 number_of_true
0 True False False False False 1
1 True False False False True 2
2 True False False False False 1
3 True False False False False 1
4 True False False False False 1

Select row that do not have two True on one row:

df = df.query("number_of_true != 2")

One liner:

(
df
.assign(number_of_true= lambda x: x.sum(axis=1))
.query("number_of_true != 2")
.drop(columns="number_of_true") #clean dataframe
)
Output:
0 1 2 3 4
0 True False False False False
2 True False False False False
3 True False False False False
4 True False False False False



Related Topics



Leave a reply



Submit