Numpy/Scipy Equivalent of R Ecdf(X)(X) Function

numpy/scipy equivalent of R ecdf(x)(x) function?

Try these links:

statsmodels.ECDF

ECDF in python without step function?

Example code

import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt

data = np.random.normal(0,5, size=2000)

ecdf = ECDF(data)
plt.plot(ecdf.x,ecdf.y)

How to plot empirical cdf (ecdf)

That looks to be (almost) exactly what you want. Two things:

First, the results are a tuple of four items. The third is the size of the bins. The second is the starting point of the smallest bin. The first is the number of points in the in or below each bin. (The last is the number of points outside the limits, but since you haven't set any, all points will be binned.)

Second, you'll want to rescale the results so the final value is 1, to follow the usual conventions of a CDF, but otherwise it's right.

Here's what it does under the hood:

def cumfreq(a, numbins=10, defaultreallimits=None):
    # docstring omitted
    h,l,b,e = histogram(a,numbins,defaultreallimits)
    cumhist = np.cumsum(h*1, axis=0)
    return cumhist,l,b,e

It does the histogramming, then produces a cumulative sum of the counts in each bin. So the ith value of the result is the number of array values less than or equal to the the maximum of the ith bin. So, the final value is just the size of the initial array.

Finally, to plot it, you'll need to use the initial value of the bin, and the bin size to determine what x-axis values you'll need.

Another option is to use numpy.histogram which can do the normalization and returns the bin edges. You'll need to do the cumulative sum of the resulting counts yourself.

a = array([...]) # your array of numbers
num_bins = 20
counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True)
cdf = numpy.cumsum(counts)
pylab.plot(bin_edges[1:], cdf)

(bin_edges[1:] is the upper edge of each bin.)

ECDF in python without step function?

If you just want to change the plot, then you could let matplotlib interpolate between the observed values.

>>> xx = np.random.randn(nobs)
>>> ecdf = sm.distributions.ECDF(xx)
>>> plt.plot(ecdf.x, ecdf.y)
[<matplotlib.lines.Line2D object at 0x07A872D0>]
>>> plt.show()

or sort original data and plot

>>> xx.sort()
>>> plt.plot(xx, ecdf(xx))
[<matplotlib.lines.Line2D object at 0x07A87090>]
>>> plt.show()

which is the same as plotting it directly

>>> a=0; plt.plot(xx, np.arange(1.,nobs+1)/(nobs+a))
[<matplotlib.lines.Line2D object at 0x07A87D30>]
>>> plt.show()

Note: depending on how you want the ecdf to behave at the boundaries and how it will be centered, there are different normalizations for "plotting positions" that are in common use, like the parameter a that I added as example a=1 is a common choice.

As alternative to using the empirical cdf, you could also use an interpolated or smoothed ecdf or histogram, or a kernel density estimate.

Extracting/Exporting the Data of the Empirical Cumulative Distribution Function in R (ecdf)

The output of ecdf is a function, among other object types. You can verify this with class(myResult), which displayes the S4 classes of the object myResult.

If you enter myResult(unique(myData)), R evaluates the ecdf object myResult at all distinct values appearing in myData, and prints it to the console. To save the output you can enter write.csv(cbind(unique(myData), myResult(unique(myData))), file="C:/Documents/My ecdf.csv") to save it to that filepath.

The ecdf doesn't tell you how many cars are above/below a specific threshold; rather, it states the probability that a randomly selected car from your data set is above/below the threshold. If you're interested in the number of cars satisfying some criteria, just count them. myData[myData<=10] returns the data elements, and length(myData[myData<=10]) tells you how many of them there are.

Assuming you mean that you want to know the sample probabilities that a randomly-selected car from your data is below 10 mph, that's the value given by myResult(10).

How to interpret `scipy.stats.kstest` and `ks_2samp` to evaluate `fit` of data to a distribution?

So the null-hypothesis for the KT test is that the distributions are the same. Thus, the lower your p value the greater the statistical evidence you have to reject the null hypothesis and conclude the distributions are different. The test only really lets you speak of your confidence that the distributions are different, not the same, since the test is designed to find alpha, the probability of Type I error.

Also, I'm pretty sure the KT test is only valid if you have a fully specified distribution in mind beforehand. Here, you simply fit a gamma distribution on some data, so of course, it's no surprise the test yielded a high p-value (i.e. you cannot reject the null hypothesis that the distributions are the same).

Real quickly, here is the pdf of the Gamma you fit (in blue) against the pdf of the normal distribution you sampled from (in green):

In [13]: paramsd = dict(zip(('shape','loc','scale'),params))

In [14]: a = paramsd['shape']

In [15]: del paramsd['shape']

In [16]: paramsd
Out[16]: {'loc': -71.588039241913037, 'scale': 0.051114096301755507}

In [17]: X = np.linspace(-55, -45, 100)

In [18]: plt.plot(X, stats.gamma.pdf(X,a,**paramsd))
Out[18]: [<matplotlib.lines.Line2D at 0x7ff820f21d68>]

Sample Image

It should be obvious these aren't very different. Really, the test compares the empirical CDF (ECDF) vs the CDF of you candidate distribution (which again, you derived from fitting your data to that distribution), and the test statistic is the maximum difference. Borrowing an implementation of ECDF from here, we can see that any such maximum difference will be small, and the test will clearly not reject the null hypothesis:

In [32]: def ecdf(x):
   .....:         xs = np.sort(x)
   .....:         ys = np.arange(1, len(xs)+1)/float(len(xs))
   .....:         return xs, ys
   .....: 

In [33]: plt.plot(X, stats.gamma.cdf(X,a,**paramsd))
Out[33]: [<matplotlib.lines.Line2D at 0x7ff805223a20>]

In [34]: plt.plot(*ecdf(x))
Out[34]: [<matplotlib.lines.Line2D at 0x7ff80524c208>]

Sample Image

Numpy/Scipy Equivalent of R Ecdf(X)(X) Function