Compute a Confidence Interval from Sample Data

Compute a confidence interval from sample data

import numpy as np
import scipy.stats

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

You can calculate like this.

Compute a confidence interval from sample data assuming unknown distribution

If you don't know the underlying distribution, then my first thought would be to use bootstrapping: https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

In pseudo-code, assuming x is a numpy array containing your data:

import numpy as np
N = 10000
mean_estimates = []
for _ in range(N):
    re_sample_idx = np.random.randint(0, len(x), x.shape)
    mean_estimates.append(np.mean(x[re_sample_idx]))

mean_estimates is now a list of 10000 estimates of the mean of the distribution. Take the 2.5th and 97.5th percentile of these 10000 values, and you have a confidence interval around the mean of your data:

sorted_estimates = np.sort(np.array(mean_estimates))
conf_interval = [sorted_estimates[int(0.025 * N)], sorted_estimates[int(0.975 * N)]]

How do I calculate confidence interval with only sample size and confidence level

I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as

Sample Image

From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.

Calculating confidence interval and sample size for data conversions

Ok, let's assume the variable X=Proportion of books converted correctly, distributed normally, with values between 0 and 1

Sample size = this is what we want to determine

Population size = 30
Existing book list contains 30 books

Estimated value = 0.90
That is, the value of X that you think is real.

90+-5% of all books converted correctly

If you have no idea of what's the actual value, use 0.5 instead

Error margin = 0.05
The difference between the real value and the estimated value. As you ascertained above, this would be +-5%

Confidence level = 0.95
This is NOT the same as error margin. You are making a prediction, how sure do you want to be of your prediction? This is the confidence level. You gave two values above:
to be 85-95% certain that all books converted correctly
So we're going with 95%, just to be sure.

The recommended sample size is 25
You can use this calculator to arrive to the same results
https://select-statistics.co.uk/calculators/sample-size-calculator-population-proportion/
And it also has a magnificent explanation of all the input values above.

Hope it works for you. Cheers!

How to plot confidence interval of a time series data in Python?

I'm not qualified to answer question 1, however the answers to this SO question produce different results from your code.

As for question 2, you can use matplotlib fill_between to fill the area between two curves (the upper and lower of your example).

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

# https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

mean, lower, upper = [],[],[]
ci = 0.8
for i in range (20):
    a = np.random.rand(100) # this is the output
    m, ml, mu = mean_confidence_interval(a, ci)
    mean.append(m)
    lower.append(ml)
    upper.append(mu)

plt.figure()
plt.plot(mean,'-b', label='mean')
plt.plot(upper,'-r', label='upper')
plt.plot(lower,'-g', label='lower')
# fill the area with black color, opacity 0.15
plt.fill_between(list(range(len(mean))), upper, lower, color="k", alpha=0.15)

plt.xlabel("Value")
plt.ylabel("Loss")
plt.legend()