How to Prevent Rbind() from Geting Really Slow as Dataframe Grows Larger

Python processing CSV file really slow

I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.

df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])

I added a fake record at the end to confirm it does increment properly:

     SPECIAL_ID sex age     zone          key day month  year counts
0  13012016505001   F   1  1001001  1001001_F_1  13     1  2016    001
1  25122013505001   F   4  1001001  1001001_F_4  25    12  2013    001
2  24022012505001   F   5  1001001  1001001_F_5  24     2  2012    001
3  09032012505001   F   5  1001001  1001001_F_5   9     3  2012    001
4  21082011505001   F   6  1001001  1001001_F_6  21     8  2011    001
5  16082011505001   F   6  1001001  1001001_F_6  16     8  2011    001
6  21102011505002   F   6  1001001  1001001_F_6  16     8  2011    002
7  21102012505003   F   6  1001001  1001001_F_6  16     8  2011    003

If you want to get rid of counts, you just need:

df.drop('counts', inplace=True, axis=1)

Pandas DataFrame.apply going very slow for scipy.stats

pd.DataFrame.apply isn't magic. It's just a Python-level loop and a convenience method. Except here, it doesn't add much convenience. It doesn't accept lists either, as it applies a single function along an axis, so your code is erroneous.

You can feed your dataframe as an argument to all these functions directly, and this exhibits consistent performance:

# Python 3.6.0, Pandas 0.19.2

import pandas as pd
import numpy as np
import scipy as sc

np.random.seed(0)
d = pd.DataFrame(np.random.randint(0,10, size=10**6))

%timeit np.mean(d)                                          # 1.3 ms per loop
%timeit np.std(d)                                           # 2.82 ms per loop
%timeit sc.stats.kurtosis(d)                                # 33 ms per loop
%timeit [func(d) for func in (np.mean, np.std)]             # 3.95 ms per loop
%timeit [func(d) for func in (np.mean, sc.stats.kurtosis)]  # 34.8 ms per loop

Efficient way to add rows to dataframe while being able to add column names

We can use deprecated rbind_list from dplyr

rbind_list(list_of_nums)
# A tibble: 3 x 5
#  alpha  beta gamma    pi omega
#  <dbl> <dbl> <dbl> <dbl> <dbl>
#1     1     4     2    NA    NA
#2     5    NA    18     2    NA
#3     2    10    NA    NA    12
#warning:
#'rbind_list' is deprecated.
#Use 'bind_rows()' instead.
#See help("Deprecated")

benchmark

l <- rep(list_of_nums, 10000)

library(microbenchmark)
b <- microbenchmark(
  markus = rbind_list(l),
  OP = OP(l), 
  Julian_Hn = bind_rows(!!!l),
  times = 10L
)

autoplot(b)

Sample Image

b
#Unit: milliseconds
#      expr         min          lq        mean      median          uq         max neval cld
#    markus   108.43026   108.98696   119.86560   122.87064   128.76507   134.64753    10  a 
#        OP 33415.89685 33647.62856 34314.40213 34058.06817 34695.69121 36231.96304    10   b
# Julian_Hn    27.36839    27.77864    30.83439    28.44502    29.68894    42.87212    10  a

Where OP is given by

OP <- function(x) {
  df = data.frame()

  for (num in x) {
    temp_df = data.frame(as.list(num))
    df = dplyr::bind_rows(df, temp_df)
  }
  df
}

Julia Optimization

Using DataFrames.jl you can do e.g.:

function bootstrap(;iters=1, data=nothing, statistic=nothing)
    statArr = Float64.(empty(data)) # Init empty dataframe
    for i in 1:iters
        stat = statistic(data, rand(1:nrow(data), nrow(data)))
        push!(statArr, stat) # push row to empty dataframe
    end
    return statArr
end;

# Statistic function for column means
meanmap(data, sel) = [mean(@view x[sel]) for x in eachcol(data)]

Which should be faster than R. The changes are:

major: use views instead of copying everything in every iteration
minor: do not create a data frame for each bootstrap replicate but rather a vector and push! it instead of append!ing it (this saves time of creation and validation of data frame objects)

(I have made only the major optimizations of the code; there are some additional minor optimizations that could be made, but they should not affect the run time in a significant way)

Also note that you are close to the maximum execution speed as:

julia> x = rand(1:nrow(df), nrow(df));

julia> y = df[!, 1];

julia> f(y, x) = mean(@view y[x]);

julia> g(y, x) = [f(y, x) for _ in 1:9999*1000];

julia> @time g(y, x);

Is roughly the lower bound of execution time you can expect to have and it is not much faster than the code above (it is faster of course, by around 25%-30%, as it does less work and is more CPU cache friendly).

As a small comment showing how details matter in such cases (I think it is interesting, although it is a minor optimization, so I left it out).

Instead of rand(1:nrow(data), nrow(data)) if you use sort!(rand(1:nrow(data), nrow(data))) you save an additional 1 second. The reason is that in this way you ensure that you access data sequentially when you calculate mean (which is more CPU cache friendly and mean is unaffected by observation order).

A second comment like this is that on a multi-CPU machine (and started Julia with -t switch selecting to use more than one thread) one could use threading to speed things up like this (again - I did not optimize things out here to the very last possible tweak, but rather wanted to show the main idea):

function bootstrap(;iters=1, data=nothing, statistic=nothing)
    statArr = Float64.(empty(data)) # Init empty dataframe
    tmp = Vector{Any}(undef, iters)
    Threads.@threads for i in 1:iters
        stat = statistic(data, rand(1:nrow(data), nrow(data)))
        tmp[i] = stat
    end
    for v in tmp
        push!(statArr, v) # push dataframe to empty dataframe
    end
    return statArr
end

This is much faster and easy to do in Julia (while doable, but not so easy in R).

Regarding views you can read about them here.

Faster way to make pandas Multiindex dataframe than append

You can adapt the answer to a very similar question as follow:

z = json.loads(json_data)

out = pd.Series({
    (i,j,m): z[i][j][k][m]
    for i in z
    for j in z[i]
    for k in ['players']
    for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())

# out:

                                           salary
year      team      player                       
1990-1991 Cleveland Hot Rod Williams   $3,785,000
                    Danny Ferry        $2,640,000
                    Mark Price         $1,400,000
                    Brad Daugherty     $1,320,000
                    Larry Nance        $1,260,000
                    Chucky Brown         $630,000
                    Steve Kerr           $548,000
                    Derrick Chievous     $525,000
                    Winston Bennett      $525,000
                    John Morton          $350,000
                    Milos Babic          $200,000
                    Gerald Paddio        $120,000
                    Darnell Valentine    $100,000
                    Henry James           $75,000

Also, if you intend to do some numerical analysis with those salaries, you probably want them as numbers, not strings. If so, also consider:

out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))

PS: Explanation:

The for lines are just one big comprehension to flatten your nested dict. To understand how it works, try first:

[
    (i,j)
    for i in z
    for j in z[i]
]

The 3rd for would be to list all keys of z[i][j], which would be: ['salary', 'players', 'url'], but we are only interested in 'players', so we say so.

The final bit is, instead of a list, we want a dict. Try the expression without surrounding with pd.Series() and you'll see exactly what's going on.

Fastest save and load options for a numpy array

For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :

NumPy.memmap, maps big arrays to binary form
- Pros :
  - No dependency other than Numpy
  - Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )
- Cons :
  - Chunks of your array are limited to 2.5G
  - Still limited by Numpy throughput
Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py
- Pros :
  - Format supports compression, indexing, and other super nice features
  - Apparently the ultimate PetaByte-large file format
- Cons :
  - Learning curve of having a hierarchical format ?
  - Have to define what your performance needs are (see later)
Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)
- Pros:
  - It's Pythonic ! (haha)
  - Supports all sorts of objects
- Cons:
  - Probably slower than others (because aimed at any objects not arrays)

Numpy.memmap

From the docs of NumPy.memmap :

Create a memory-map to an array stored in a binary file on disk.

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.

HDF5 arrays

From the h5py doc

Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient

Can I execute a function in apply to pandas dataframe asynchronously?

Asynchronous I/O approach with well-known asyncio + aiohttp libraries:

Demonstrated on sample Dataframe and simple webpage content processing routines (to show the mechanics of the approach).

Let's say we need to count all header, link(<a>) and span tags through all urls and store the resulting counters in the source dataframe.

import pandas as pd
import asyncio
import aiohttp
from bs4 import BeautifulSoup


def count_headers(html):
    return len(list(html.select('h1,h2,h3,h4,h5,h6')))

def count_links(html):
    return len(list(html.find_all('a')))

def count_spans(html):
    return len(list(html.find_all('spans')))


df = pd.DataFrame({'id': [1, 2, 3], 'url': ['https://stackoverflow.com/questions',
                                            'https://facebook.com',
                                            'https://wiki.archlinux.org']})
df['head_c'], df['link_c'], df['span_c'] = [None, None, None]
# print(df)

async def process_url(df, url):
    async with aiohttp.ClientSession() as session:
        resp = await session.get(url)
        content = await resp.text()
        soup = BeautifulSoup(content, 'html.parser')
        headers_count = count_headers(soup)
        links_count = count_links(soup)
        spans_count = count_spans(soup)
        print("Done")

        df.loc[df['url'] == url, ['head_c', 'link_c', 'span_c']] = \
            [[headers_count, links_count, spans_count]]


async def main(df):
    await asyncio.gather(*[process_url(df, url) for url in df['url']])
    print(df)


loop = asyncio.get_event_loop()
loop.run_until_complete(main(df))
loop.close()

The output:

Done
Done
Done
   id                                  url  head_c  link_c  span_c
0   1  https://stackoverflow.com/questions      25     306       0
1   2                 https://facebook.com       3      55       0
2   3           https://wiki.archlinux.org      15      91       0

Enjoy the performance difference.

Cumulative OLS with Python Pandas

Following on the advice in the comments, I created my own function that can be used with apply and which relies on cumsum to accumulate all the individual needed terms for expressing the coefficient from an OLS univariate regression vectorially.

def cumulative_ols(
                   data_frame,
                   lhs_column,
                   rhs_column,
                   date_column,
                   min_obs=60,
                  ):
    """
    Function to perform a cumulative OLS on a Pandas data frame. It is
    meant to be used with `apply` after grouping the data frame by categories
    and sorting by date, so that the regression below applies to the time
    series of a single category's data and the use of `cumsum` will work    
    appropriately given sorted dates. It is also assumed that the date 
    conventions of the left-hand-side and right-hand-side variables have been 
    arranged by the user to match up with any lagging conventions needed.

    This OLS is implicitly univariate and relies on the simplification to the
    formula:

    Cov(x,y) ~ (1/n)*sum(x*y) - (1/n)*sum(x)*(1/n)*sum(y)
    Var(x)   ~ (1/n)*sum(x^2) - ((1/n)*sum(x))^2
    beta     ~ Cov(x,y) / Var(x)

    and the code makes a further simplification be cancelling one factor 
    of (1/n).

    Notes: one easy improvement is to change the date column to a generic sort
    column since there's no special reason the regressions need to be time-
    series specific.
    """
    data_frame["xy"]         = (data_frame[lhs_column] * data_frame[rhs_column]).fillna(0.0)
    data_frame["x2"]         = (data_frame[rhs_column]**2).fillna(0.0)
    data_frame["yobs"]       = data_frame[lhs_column].notnull().map(int)
    data_frame["xobs"]       = data_frame[rhs_column].notnull().map(int)
    data_frame["cum_yobs"]   = data_frame["yobs"].cumsum()
    data_frame["cum_xobs"]   = data_frame["xobs"].cumsum()
    data_frame["cumsum_xy"]  = data_frame["xy"].cumsum()
    data_frame["cumsum_x2"]  = data_frame["x2"].cumsum()
    data_frame["cumsum_x"]   = data_frame[rhs_column].fillna(0.0).cumsum()
    data_frame["cumsum_y"]   = data_frame[lhs_column].fillna(0.0).cumsum()
    data_frame["cum_cov"]    = data_frame["cumsum_xy"] - (1.0/data_frame["cum_yobs"])*data_frame["cumsum_x"]*data_frame["cumsum_y"]
    data_frame["cum_x_var"]  = data_frame["cumsum_x2"] - (1.0/data_frame["cum_xobs"])*(data_frame["cumsum_x"])**2
    data_frame["FactorBeta"] = data_frame["cum_cov"]/data_frame["cum_x_var"]
    data_frame["FactorBeta"][data_frame["cum_yobs"] < min_obs] = np.NaN
    return data_frame[[date_column, "FactorBeta"]].set_index(date_column)
### End cumulative_ols

I have verified on numerous test cases that this matches the output of my former function and the output of NumPy's linalg.lstsq function. I haven't done a full benchmark on the timing, but anecdotally, it is around 50 times faster in the cases I've been working on.

Efficient way to assign values from another column pandas df

You can use:

def f(x):
    #get unique days
    u = x['Day'].unique()
    #mapping dictionary
    d = dict(zip(u, np.arange(len(u)) // 3 + 1))
    x['new'] = x['Day'].map(d)
    return x

df = df.groupby('Location', sort=False).apply(f)
#add Location column
s = df['new'].astype(str) + df['Location']
#encoding by factorize
df['new'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('C')
print (df)
      Day Location new
0     Mon     Home  C1
1    Tues     Home  C1
2     Wed     Away  C2
3     Wed     Home  C1
4   Thurs     Away  C2
5   Thurs     Home  C3
6     Fri     Home  C3
7     Mon     Home  C1
8     Sat     Home  C3
9     Fri     Away  C2
10    Sun     Home  C4