How to Prevent Rbind() from Geting Really Slow as Dataframe Grows Larger

Python processing CSV file really slow

I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.

df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])

I added a fake record at the end to confirm it does increment properly:

     SPECIAL_ID sex age     zone          key day month  year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003

If you want to get rid of counts, you just need:

df.drop('counts', inplace=True, axis=1)

Pandas DataFrame.apply going very slow for scipy.stats

pd.DataFrame.apply isn't magic. It's just a Python-level loop and a convenience method. Except here, it doesn't add much convenience. It doesn't accept lists either, as it applies a single function along an axis, so your code is erroneous.

You can feed your dataframe as an argument to all these functions directly, and this exhibits consistent performance:

# Python 3.6.0, Pandas 0.19.2

import pandas as pd
import numpy as np
import scipy as sc

np.random.seed(0)
d = pd.DataFrame(np.random.randint(0,10, size=10**6))

%timeit np.mean(d) # 1.3 ms per loop
%timeit np.std(d) # 2.82 ms per loop
%timeit sc.stats.kurtosis(d) # 33 ms per loop
%timeit [func(d) for func in (np.mean, np.std)] # 3.95 ms per loop
%timeit [func(d) for func in (np.mean, sc.stats.kurtosis)] # 34.8 ms per loop

Efficient way to add rows to dataframe while being able to add column names

We can use deprecated rbind_list from dplyr

rbind_list(list_of_nums)
# A tibble: 3 x 5
# alpha beta gamma pi omega
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 4 2 NA NA
#2 5 NA 18 2 NA
#3 2 10 NA NA 12
#warning:
#'rbind_list' is deprecated.
#Use 'bind_rows()' instead.
#See help("Deprecated")

benchmark

l <- rep(list_of_nums, 10000)

library(microbenchmark)
b <- microbenchmark(
markus = rbind_list(l),
OP = OP(l),
Julian_Hn = bind_rows(!!!l),
times = 10L
)

autoplot(b)

Sample Image

b
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# markus 108.43026 108.98696 119.86560 122.87064 128.76507 134.64753 10 a
# OP 33415.89685 33647.62856 34314.40213 34058.06817 34695.69121 36231.96304 10 b
# Julian_Hn 27.36839 27.77864 30.83439 28.44502 29.68894 42.87212 10 a

Where OP is given by

OP <- function(x) {
df = data.frame()

for (num in x) {
temp_df = data.frame(as.list(num))
df = dplyr::bind_rows(df, temp_df)
}
df
}

Julia Optimization

Using DataFrames.jl you can do e.g.:

function bootstrap(;iters=1, data=nothing, statistic=nothing)
statArr = Float64.(empty(data)) # Init empty dataframe
for i in 1:iters
stat = statistic(data, rand(1:nrow(data), nrow(data)))
push!(statArr, stat) # push row to empty dataframe
end
return statArr
end;

# Statistic function for column means
meanmap(data, sel) = [mean(@view x[sel]) for x in eachcol(data)]

Which should be faster than R. The changes are:

  • major: use views instead of copying everything in every iteration
  • minor: do not create a data frame for each bootstrap replicate but rather a vector and push! it instead of append!ing it (this saves time of creation and validation of data frame objects)

(I have made only the major optimizations of the code; there are some additional minor optimizations that could be made, but they should not affect the run time in a significant way)

Also note that you are close to the maximum execution speed as:

julia> x = rand(1:nrow(df), nrow(df));

julia> y = df[!, 1];

julia> f(y, x) = mean(@view y[x]);

julia> g(y, x) = [f(y, x) for _ in 1:9999*1000];

julia> @time g(y, x);

Is roughly the lower bound of execution time you can expect to have and it is not much faster than the code above (it is faster of course, by around 25%-30%, as it does less work and is more CPU cache friendly).


As a small comment showing how details matter in such cases (I think it is interesting, although it is a minor optimization, so I left it out).

Instead of rand(1:nrow(data), nrow(data)) if you use sort!(rand(1:nrow(data), nrow(data))) you save an additional 1 second. The reason is that in this way you ensure that you access data sequentially when you calculate mean (which is more CPU cache friendly and mean is unaffected by observation order).

A second comment like this is that on a multi-CPU machine (and started Julia with -t switch selecting to use more than one thread) one could use threading to speed things up like this (again - I did not optimize things out here to the very last possible tweak, but rather wanted to show the main idea):

function bootstrap(;iters=1, data=nothing, statistic=nothing)
statArr = Float64.(empty(data)) # Init empty dataframe
tmp = Vector{Any}(undef, iters)
Threads.@threads for i in 1:iters
stat = statistic(data, rand(1:nrow(data), nrow(data)))
tmp[i] = stat
end
for v in tmp
push!(statArr, v) # push dataframe to empty dataframe
end
return statArr
end

This is much faster and easy to do in Julia (while doable, but not so easy in R).


Regarding views you can read about them here.

Faster way to make pandas Multiindex dataframe than append

You can adapt the answer to a very similar question as follow:

z = json.loads(json_data)

out = pd.Series({
(i,j,m): z[i][j][k][m]
for i in z
for j in z[i]
for k in ['players']
for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())

# out:

salary
year team player
1990-1991 Cleveland Hot Rod Williams $3,785,000
Danny Ferry $2,640,000
Mark Price $1,400,000
Brad Daugherty $1,320,000
Larry Nance $1,260,000
Chucky Brown $630,000
Steve Kerr $548,000
Derrick Chievous $525,000
Winston Bennett $525,000
John Morton $350,000
Milos Babic $200,000
Gerald Paddio $120,000
Darnell Valentine $100,000
Henry James $75,000

Also, if you intend to do some numerical analysis with those salaries, you probably want them as numbers, not strings. If so, also consider:

out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))

PS: Explanation:

The for lines are just one big comprehension to flatten your nested dict. To understand how it works, try first:

[
(i,j)
for i in z
for j in z[i]
]

The 3rd for would be to list all keys of z[i][j], which would be: ['salary', 'players', 'url'], but we are only interested in 'players', so we say so.

The final bit is, instead of a list, we want a dict. Try the expression without surrounding with pd.Series() and you'll see exactly what's going on.

Fastest save and load options for a numpy array

For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :

  • NumPy.memmap, maps big arrays to binary form

    • Pros :

      • No dependency other than Numpy
      • Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )
    • Cons :

      • Chunks of your array are limited to 2.5G
      • Still limited by Numpy throughput
  • Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py

    • Pros :

      • Format supports compression, indexing, and other super nice features
      • Apparently the ultimate PetaByte-large file format
    • Cons :

      • Learning curve of having a hierarchical format ?
      • Have to define what your performance needs are (see later)
  • Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)

    • Pros:

      • It's Pythonic ! (haha)
      • Supports all sorts of objects
    • Cons:

      • Probably slower than others (because aimed at any objects not arrays)

Numpy.memmap

From the docs of NumPy.memmap :

Create a memory-map to an array stored in a binary file on disk.

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.


HDF5 arrays

From the h5py doc

Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient

Can I execute a function in apply to pandas dataframe asynchronously?

Asynchronous I/O approach with well-known asyncio + aiohttp libraries:

Demonstrated on sample Dataframe and simple webpage content processing routines (to show the mechanics of the approach).

Let's say we need to count all header, link(<a>) and span tags through all urls and store the resulting counters in the source dataframe.

import pandas as pd
import asyncio
import aiohttp
from bs4 import BeautifulSoup


def count_headers(html):
return len(list(html.select('h1,h2,h3,h4,h5,h6')))

def count_links(html):
return len(list(html.find_all('a')))

def count_spans(html):
return len(list(html.find_all('spans')))


df = pd.DataFrame({'id': [1, 2, 3], 'url': ['https://stackoverflow.com/questions',
'https://facebook.com',
'https://wiki.archlinux.org']})
df['head_c'], df['link_c'], df['span_c'] = [None, None, None]
# print(df)

async def process_url(df, url):
async with aiohttp.ClientSession() as session:
resp = await session.get(url)
content = await resp.text()
soup = BeautifulSoup(content, 'html.parser')
headers_count = count_headers(soup)
links_count = count_links(soup)
spans_count = count_spans(soup)
print("Done")

df.loc[df['url'] == url, ['head_c', 'link_c', 'span_c']] = \
[[headers_count, links_count, spans_count]]


async def main(df):
await asyncio.gather(*[process_url(df, url) for url in df['url']])
print(df)


loop = asyncio.get_event_loop()
loop.run_until_complete(main(df))
loop.close()

The output:

Done
Done
Done
id url head_c link_c span_c
0 1 https://stackoverflow.com/questions 25 306 0
1 2 https://facebook.com 3 55 0
2 3 https://wiki.archlinux.org 15 91 0

Enjoy the performance difference.

Cumulative OLS with Python Pandas

Following on the advice in the comments, I created my own function that can be used with apply and which relies on cumsum to accumulate all the individual needed terms for expressing the coefficient from an OLS univariate regression vectorially.

def cumulative_ols(
data_frame,
lhs_column,
rhs_column,
date_column,
min_obs=60,
):
"""
Function to perform a cumulative OLS on a Pandas data frame. It is
meant to be used with `apply` after grouping the data frame by categories
and sorting by date, so that the regression below applies to the time
series of a single category's data and the use of `cumsum` will work
appropriately given sorted dates. It is also assumed that the date
conventions of the left-hand-side and right-hand-side variables have been
arranged by the user to match up with any lagging conventions needed.

This OLS is implicitly univariate and relies on the simplification to the
formula:

Cov(x,y) ~ (1/n)*sum(x*y) - (1/n)*sum(x)*(1/n)*sum(y)
Var(x) ~ (1/n)*sum(x^2) - ((1/n)*sum(x))^2
beta ~ Cov(x,y) / Var(x)

and the code makes a further simplification be cancelling one factor
of (1/n).

Notes: one easy improvement is to change the date column to a generic sort
column since there's no special reason the regressions need to be time-
series specific.
"""
data_frame["xy"] = (data_frame[lhs_column] * data_frame[rhs_column]).fillna(0.0)
data_frame["x2"] = (data_frame[rhs_column]**2).fillna(0.0)
data_frame["yobs"] = data_frame[lhs_column].notnull().map(int)
data_frame["xobs"] = data_frame[rhs_column].notnull().map(int)
data_frame["cum_yobs"] = data_frame["yobs"].cumsum()
data_frame["cum_xobs"] = data_frame["xobs"].cumsum()
data_frame["cumsum_xy"] = data_frame["xy"].cumsum()
data_frame["cumsum_x2"] = data_frame["x2"].cumsum()
data_frame["cumsum_x"] = data_frame[rhs_column].fillna(0.0).cumsum()
data_frame["cumsum_y"] = data_frame[lhs_column].fillna(0.0).cumsum()
data_frame["cum_cov"] = data_frame["cumsum_xy"] - (1.0/data_frame["cum_yobs"])*data_frame["cumsum_x"]*data_frame["cumsum_y"]
data_frame["cum_x_var"] = data_frame["cumsum_x2"] - (1.0/data_frame["cum_xobs"])*(data_frame["cumsum_x"])**2
data_frame["FactorBeta"] = data_frame["cum_cov"]/data_frame["cum_x_var"]
data_frame["FactorBeta"][data_frame["cum_yobs"] < min_obs] = np.NaN
return data_frame[[date_column, "FactorBeta"]].set_index(date_column)
### End cumulative_ols

I have verified on numerous test cases that this matches the output of my former function and the output of NumPy's linalg.lstsq function. I haven't done a full benchmark on the timing, but anecdotally, it is around 50 times faster in the cases I've been working on.

Efficient way to assign values from another column pandas df

You can use:

def f(x):
#get unique days
u = x['Day'].unique()
#mapping dictionary
d = dict(zip(u, np.arange(len(u)) // 3 + 1))
x['new'] = x['Day'].map(d)
return x

df = df.groupby('Location', sort=False).apply(f)
#add Location column
s = df['new'].astype(str) + df['Location']
#encoding by factorize
df['new'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('C')
print (df)
Day Location new
0 Mon Home C1
1 Tues Home C1
2 Wed Away C2
3 Wed Home C1
4 Thurs Away C2
5 Thurs Home C3
6 Fri Home C3
7 Mon Home C1
8 Sat Home C3
9 Fri Away C2
10 Sun Home C4


Related Topics



Leave a reply



Submit