Fastest way to compute image dataset channel wise mean and standard deviation in Python
Since this is a numerically heavy task (a lot of iterations around a matrix, or a tensor), I always suggest to use libraries that are good at this: numpy.
A properly installed numpy should be able to utilize the underlying BLAS (Basic Linear Algebra Subroutines) routines which are optimized for operating an array of floating points from the memory hierarchy perspective.
imread should already give you the numpy array. You can get the reshaped 1d array of the image of the red channel by
import numpy as np
val = np.reshape(image[:,:,0], -1)
the mean of such by
np.mean(val)
and the standard deviation by
np.std(val)
In this way, you can get rid of two layers of python loops:
count = 0
mean = 0
delta = 0
delta2 = 0
M2 = 0
for i, file in enumerate(tqdm(first)):
image = cv2.imread(file)
val = np.reshape(image[:,:,0], -1)
img_mean = np.mean(val)
img_std = np.std(val)
...
The rest of the incremental update should be straightforward.
Once you have done this, the bottleneck will become the image loading speed, which is limited by disk read operation performance. For that regard, I suspect using multi-thread as others suggested will help much based on my prior experience.
Fastest way to compute image dataset channel wise mean and standard deviation in Python
Since this is a numerically heavy task (a lot of iterations around a matrix, or a tensor), I always suggest to use libraries that are good at this: numpy.
A properly installed numpy should be able to utilize the underlying BLAS (Basic Linear Algebra Subroutines) routines which are optimized for operating an array of floating points from the memory hierarchy perspective.
imread should already give you the numpy array. You can get the reshaped 1d array of the image of the red channel by
import numpy as np
val = np.reshape(image[:,:,0], -1)
the mean of such by
np.mean(val)
and the standard deviation by
np.std(val)
In this way, you can get rid of two layers of python loops:
count = 0
mean = 0
delta = 0
delta2 = 0
M2 = 0
for i, file in enumerate(tqdm(first)):
image = cv2.imread(file)
val = np.reshape(image[:,:,0], -1)
img_mean = np.mean(val)
img_std = np.std(val)
...
The rest of the incremental update should be straightforward.
Once you have done this, the bottleneck will become the image loading speed, which is limited by disk read operation performance. For that regard, I suspect using multi-thread as others suggested will help much based on my prior experience.
Finding mean and standard deviation across image channels PyTorch
You just need to rearrange batch tensor in a right way: from [B, C, W, H]
to [B, C, W * H]
by:
batch = batch.view(batch.size(0), batch.size(1), -1)
Here is complete usage example on random data:
Code:
import torch
from torch.utils.data import TensorDataset, DataLoader
data = torch.randn(64, 3, 28, 28)
labels = torch.zeros(64, 1)
dataset = TensorDataset(data, labels)
loader = DataLoader(dataset, batch_size=8)
nimages = 0
mean = 0.
std = 0.
for batch, _ in loader:
# Rearrange batch to be the shape of [B, C, W * H]
batch = batch.view(batch.size(0), batch.size(1), -1)
# Update total number of images
nimages += batch.size(0)
# Compute mean and std here
mean += batch.mean(2).sum(0)
std += batch.std(2).sum(0)
# Final step
mean /= nimages
std /= nimages
print(mean)
print(std)
Output:
tensor([-0.0029, -0.0022, -0.0036])
tensor([0.9942, 0.9939, 0.9923])
Normalising images using mean and std of a dataset
Your formulas are not correct. You can't take the mean of the values of a batch and then the standard deviation of these means and expect it to be the standard deviation over the entire dataset. Try something like:
total = 0.0
totalsq = 0.0
count = 0
for data, *_ in dataloader:
count += np.prod(data.shape)
total += data.sum()
totalsq += (data**2).sum()
mean = total/count
var = (totalsq/count) - (mean**2)
std = torch.sqrt(var)
Numpy - Normalize RGB image dataset
Looks good, but there are some things NumPy does that could make it nicer. I'm assuming that you want to normalize each channel separately.
For one, notice that x
has a method mean
, so we can write x[..., 0].mean()
instead of np.mean(x[:, :, :, 0])
. Also, the mean
method takes the keyword argument axis
, which we can use as follows:
means = x.mean(axis=(0, 1, 2)) # Take the mean over the N,H,W axes
means.shape # => will evaluate to (C,)
Then we can subtract the means from the whole dataset like so:
centered = x - x.mean(axis=(0,1,2), keepdims=True)
Note that we had to use keepdims
here.
There is also an x.std
that works the same way, so we can do the whole normalization in 1 line:
z = (x - x.mean(axis=(0,1,2), keepdims=True)) / x.std(axis=(0,1,2), keepdims=True)
Check out the docs for numpy.ndarray.mean and np.ndarray.std for more info. The np.ndarray.method
methods are what you hit when you call x.method
instead of using np.method(x)
instead.
Edit: I have since learned that, of course, there is a scipy.stats.zscore
. I'm not sure if this is a more readable way to take zscores along each channel, but some might prefer it:
z = zscore(x.reshape(-1, 3)).reshape(x.shape)
The scipy function operates only over a single axis, so we have to reshape into an NHW x C
matrix first and then reshape back.
Related Topics
How to Open Different Urls At the Same Time by Using Python Selenium
How to Check Whether All Elements of Array Are in Between Two Values
Hiding Axis Text in Matplotlib Plots
Regex to Match Digits and At Most One Space Between Them
Python: Requests.Exceptions.Connectionerror. Max Retries Exceeded With Url
How to Copy/Repeat an Array N Times into a New Array
Retrieve Top N in Each Group of a Dataframe in Pyspark
Python Multiprocessing Pool Hangs At Join
Windowserror: [Error 126] the Specified Module Could Not Be Found
A Better Way Than Looping and Calling Functions That Loop and Call Another Functions
How to Move to One Folder Back in Python
How to Share Single Sqlite Connection in Multi-Threaded Python Application
How to Extract Rar Files Inside Google Colab
How to Get Max() to Return Variable Names Instead of Values in Python
Pandas, Remove Everything After Last '_'
Check If File Has a CSV Format With Python
Permissionerror: [Errno 13] Permission Denied
How to Use a Pre-Trained Neural Network With Grayscale Images