"Large Data" Workflows Using Pandas

Large data workflows using pandas

I routinely use tens of gigabytes of data in just this fashion
e.g. I have tables on disk that I read via queries, create data and append back.

It's worth reading the docs and late in this thread for several suggestions for how to store your data.

Details which will affect how you store your data, like:

Give as much detail as you can; and I can help you develop a structure.

  1. Size of data, # of rows, columns, types of columns; are you appending
    rows, or just columns?
  2. What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.

    (Giving a toy example could enable us to offer more specific recommendations.)
  3. After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
  4. Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
  5. Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
  6. Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?

Solution

Ensure you have pandas at least 0.10.1 installed.

Read iterating files chunk-by-chunk and multiple table queries.

Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):

(The following is pseudocode.)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
# this maps your fields to a specific group, and defines
# what you want to have as data_columns.
# you might want to create a nice class wrapping this
# (as you will want to have this map and its inversion)
group_map = dict(
A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
B = dict(fields = ['field_10',...... ], dc = ['field_10']),
.....
REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

Reading in the files and creating the storage (essentially doing what append_to_multiple does):

for f in files:
# read in the file, additional options may be necessary here
# the chunksize is not strictly necessary, you may be able to slurp each
# file into memory in which case just eliminate this part of the loop
# (you can also change chunksize if necessary)
for chunk in pd.read_table(f, chunksize=50000):
# we are going to append to each table by group
# we are not going to create indexes at this time
# but we *ARE* going to create (some) data_columns

# figure out the field groupings
for g, v in group_map.items():
# create the frame for this group
frame = chunk.reindex(columns = v['fields'], copy = False)

# append it
store.append(g, frame, index=False, data_columns = v['dc'])

Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).

This is how you get columns and create new ones:

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
# select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

When you are ready for post_processing:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).

You also might want to:

  • create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
  • indexes on certain data columns (makes row-subsetting much faster).
  • enable compression.

Let me know when you have questions!

How to quickly group multiple features of data from a dataframe using pandas

You can try:

result = library.groupby("library_id").agg({"library_id": "size", "school": "unique"})

to get

            library_id       school
library_id
A123 3 [A1, A, A2]
A456 2 [A]
B123 2 [B]

We group by the library_id and then aggregrate it over the group size and the unique entries.

If you don't want library_id appearing on top of the index, you can write result.index.name = None since it is the name of the index of the result.

How to load large data into pandas efficiently?

You could use

pandas.read_csv(filename, chunksize = chunksize)

How to handle extremely large data sets in pandas

I guess you need to distribute your processing.

One way to do this is to split your input to multiple chunks, use multiprocessing to generate intermediate outputs, then combine them all at the end.

How do I do this in pandas ?

"Large data" work flows using pandas

Finding number of times rows of a particular column are repeated using Pandas

Try this:

df = df.groupby('Name').filter(lambda group: group.shape[0] == 3)

Is there any way to create multiple axes plot only by using python-pandas?

You can use twinx:

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

Now you have 2 axes on the same figure.

Demo from matplotlib

Adapted to pandas:

t = np.arange(0.01, 10.0, 0.01)
data1 = np.exp(t)
data2 = np.sin(2 * np.pi * t)
df = pd.DataFrame({'X': t, 'Y1': data1, 'Y2': data2})

ax1 = df.plot(x='X', y='Y1', color='b', legend=False)
ax2 = df.plot(x='X', y='Y2', color='r', legend=False, ax=ax1.twinx())
plt.show()

Sample Image

Reading, calculate and group data of several files with pandas

Assuming that all of your files have exactly the same length (# of rows) and values for angles, then you don't really need to make a bunch of dataframes and concatenate them all together.

If I'm understanding correctly, you just want a final dataframe with a new column for each file (named with the filename) with the 'int' data, normalized with all the values from only that specific file

On the first file, you can create a dataframe to use as your final output, then just add columns to it on each subsequent file

for idx,file in enumerate(files):
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
filename = file.split('\\')[-1][:-3] #get filename from splitting full path and removing last 3 characters (file extension)
df[filename]=df['int']/df['int'].max() #use the filename itself as the new column name

if idx == 0: #create norm_files output dataframe on first file
norm_files = df[['angle',file]]
else: #add column to norm_files for all subsequent files
norm_files[file] = df[file]

return a message and put in a new column after compare values in a dataset (using pandas)

Try this:

import numpy as np

country_data['Humanitarian Help'] = np.where(country_data['Deathrate'] < 9, "balanced", "Deathrate")

Note that I changed > (greater than) to < (less than) per "Deathrate is lower than 9"



Related Topics



Leave a reply



Submit