How to Crop a Netcdf File

Is there a way to crop a NETCDF file?

nco works fine, but just to list an alternative, one can also do it using cdo (climate data operators), which I find easier to remember. You can specify directly the longitude and latitude values in this way:

cdo sellonlatbox,lon1,lon2,lat1,lat2 infile.nc outfile.nc

where lon1,lon2,lat1,lat2 define the boundaries of the area you require.

Note that longitude can be specified using 0:360 or also -180:180 conventions irrespective of that used in the input file. The output conventions will follow those used in the cdo command. This this command can also be used to convert a file from one format to the other.

For more details on extracting subregions, I have posted this video tutorial on youtube

If you don't have cdo already installed you can get it on Ubuntu with

sudo apt-get install cdo

cdo has many other functions for processing, combining and splitting files and an excellent online documentation. Note that for cdo to work the coordinate variables (lat/lon) need to be defined according to CF conventions, so in that way the nco solution is more robust.

Crop netcdf files in R

I am making assumptions here, because this is not my area of expertise, but you are able to simply transform this into a dataset using the raster-package. This seems to be the way to go, also according to this author.

raster::as.data.frame(nc.stars.crop, xy = TRUE)

At least for me this worked. And then you could transform it back into a simple features object, if you are so inclined with

raster::as.data.frame(nc.stars.crop, xy = TRUE) %>% 
sf::st_as_sf(coords = c('lon','lat'))

However, the transformation to lon/lat is not really exact, because it produces point data, whereas the original information is raster data. So there is clearly information that gets lost.

sf::st_as_sf() seems to work out of the box for this, but I am not sure, because I have no way to validate the transformation of the original data. For me the following worked:

read_ncdf('20220301120000-NCEI-L4_GHRSST-SSTblend-AVHRR_OI-GLOB-v02.0-fv02.1.nc', var="analysed_sst") %>%
  sf::st_as_sf()

This creates polygons, the size of your initial raster tiles and seems to conserve all necessary information.

Finally, here is a work-around to extracting exactly the data you were plotting. You can access the data that ggplot used, by assigning the ggplot to a variable and then accessing the data layer.

p <- ggplot() + geom_stars(data=nc.stars.crop) +
coord_equal() + theme_void() +
scale_x_discrete(expand=c(0,0))+
scale_y_discrete(expand=c(0,0))

p$layers[[1]]$data

How to crop a NetCDF file by longitude in NCO

NCO handles these "wrapped coordinates" as described here. Please re-try with your bounding-box limits in [0,360], e.g.,

ncks -d lat,40.,70. -d lon,347.,10. https://dataserver.nccs.nasa.gov/thredds/dodsC/CMIP5/NASA/GISS/rcp85/E2-H_rcp85_r2i1p1_day/tos_day_GISS-E2-H_rcp85_r2i1p1_20510101-20751231.nc out.nc

Faster efficient way to crop netcdf in R

Dealing with (NetCDF) files with many layers (time steps) can be very slow when using (a standard approach with) GDAL, which is what terra uses. I hope to fix this over the coming months. What you want to do may go much faster with raster because it approaches the data as a three-dimensional array (it is not looping over layers). So I would suggest

library(raster)
r <- brick("myfile.nc")
r2 <- crop(r, extent(-79, -72, 0, 12.4))

NCO cropping a netcdf file using dimension values rather than indices

Yes, using a decimal indicates the range of actual values (eg, latitudes) to extract over, while using integers indicates the range of indices corresponding to the values.

For instance, to extract across latitudes 30.0 - 40.0 degrees N:

ncks -d lat,30.,40. file.nc -O cropped_file.nc

Clipping netCDF file to a shapefile and cloning the metadata variables in R

You have a NetCDF file with many (52) variables (sub-datasets). When you open the file with rast these become "layers". Alternatively you can open the file with sds to keep the sub-dataset structure but that does not help you here (and you would need to skip the first two, see below).

library(terra)
f <- "ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc"
r <- rast(f)
r
#class       : SpatRaster 
#dimensions  : 21600, 43200, 52  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#sources     : ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:water_surface_height_above_reference_datum  
              ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:water_surface_height_uncertainty  
              ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:lake_surface_water_extent  
              ... and 49 more source(s)
#varnames    : water_surface_height_above_reference_datum (water surface height above geoid) 
              water_surface_height_uncertainty (water surface height uncertainty) 
              lake_surface_water_extent (Lake Water Extent) 
              ...
#names       : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ... 
#unit        :           m,           m,         km2,         km2,      Kelvin,      Kelvin, ... 
#time        : 2019-01-01

Note that there are 52 layers and sources (sub-datasets). There are names

head(names(r))
#[1] "water_surface_height_above_reference_datum" "water_surface_height_uncertainty"          
#[3] "lake_surface_water_extent"                  "lake_surface_water_extent_uncertainty"     
#[5] "lake_surface_water_temperature"             "lswt_uncertainty"

And also "longnames" (they are often much longer than the variable names, not in this case)

head(longnames(r))
# [1] "water surface height above geoid" "water surface height uncertainty" "Lake Water Extent"               
# [4] "Water extent uncertainty"         "lake surface skin temperature"    "Total uncertainty"

You can also open the file with sds, but you need to skip "lon_bounds" and "lat_bounds" variables (dimensions)

s <- sds(f, 3:52)

Now read a vector data set (shapefile in this case) and crop

lake <- vect("hydro_p_LakeErie.shp")
rc <- crop(r, lake)
rc 

#class       : SpatRaster 
#dimensions  : 182, 555, 52  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : -83.475, -78.85, 41.38333, 42.9  (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#source      : memory 
#names       : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ... 
#min values  :         NaN,         NaN,         NaN,         NaN,     271.170,       0.283, ... 
#max values  :         NaN,         NaN,         NaN,         NaN,     277.090,       0.622, ... 
#time        : 2019-01-01

It can be convenient to save this to a GTiff file like this (or even better to use the filename argument in crop)

gtf <- writeRaster(rc, "test.tif", overwrite=TRUE)
gtf
#class       : SpatRaster 
#dimensions  : 182, 555, 52  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : -83.475, -78.85, 41.38333, 42.9  (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#source      : test.tif 
#names       : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ... 
#min values  :         NaN,         NaN,         NaN,         NaN,     271.170,       0.283, ... 
#max values  :         NaN,         NaN,         NaN,         NaN,     277.090,       0.622, ...

What has changed is that the data are now in a file, rather then in memory. And you still have the layer (variable) names.

To write the layers as variables to a NetCDF file you need to create a SpatRasterDataset. You can do that like this:

x <- as.list(rc)
s <- sds(x)
names(s) <- names(rc)
longnames(s) <- longnames(r)
units(s) <- units(r)

Note the use of longnames(r) and units(r) (not rc). This is because r has subdatasets (and each has a longname and a unit) while rc does not.

Now use writeCDF

z <- writeCDF(s, "test.nc", overwrite=TRUE)
 
rc2 <- rast("test.nc")
rc2

#class       : SpatRaster 
#dimensions  : 182, 555, 52  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : -83.475, -78.85, 41.38333, 42.9  (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#sources     : test.nc:water_surface_height_above_reference_datum  
              test.nc:water_surface_height_uncertainty  
              test.nc:lake_surface_water_extent  
              ... and 49 more source(s)
#varnames    : water_surface_height_above_reference_datum (water surface height above geoid) 
              water_surface_height_uncertainty (water surface height uncertainty) 
              lake_surface_water_extent (Lake Water Extent) 
              ...
#names       : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ... 
#unit        :           m,           m,         km2,         km2,      Kelvin,      Kelvin, ... 
#time        : 2019-01-01

So it looks like we have a NetCDF with the same structure.

Note that the current CRAN version of terra drops the time variable if there is only one time step. The development version (1.3-11) keeps the time dimension, even of there is only one step.

You can install the development version with
install.packages('terra', repos='https://rspatial.r-universe.dev')

Circular lat/lon crop of a NetCDF file with Python

I would calculate the distance between the center and each lat/lon pair (2D grid), and use that to construct a mask which you can apply to your data. Once masked, you can again simply use numpy functions to calculate statistics like max().

For example, using the haversine() function from https://stackoverflow.com/a/4913653/3581217, modified to a vectorized version which you can directly apply onto numpy arrays:

import numpy as np
import matplotlib.pylab as pl

def haversine(lon1, lat1, lon2, lat2):
    # convert decimal degrees to radians 
    lon1 = np.deg2rad(lon1)
    lon2 = np.deg2rad(lon2)
    lat1 = np.deg2rad(lat1)
    lat2 = np.deg2rad(lat2)

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371
    return c * r

# Latitude / longitude grid
lat = np.linspace(50,54,16)
lon = np.linspace(6,9,12)

# Center coordinates
clat = 52
clon = 7 

max_dist = 100      # max distance in km

# Calculate distance between center and all other lat/lon pairs
distance = haversine(lon[:,np.newaxis], lat, clon, clat) 

# Mask distance array where distance > max_dist
distance_m = np.ma.masked_greater(distance, max_dist)

# Dummy data
data = np.random.random(size=[lon.size, lat.size])

# Test: set a value outside the max_dist circle to a large value:
data[0,0] = 10

# Mask the data array based on the distance mask
data_m = np.ma.masked_where(distance > max_dist, data)

pl.figure()
pl.subplot(221)
pl.title('distance (km)')
pl.pcolormesh(lon, lat, np.transpose(distance))
pl.colorbar()

pl.subplot(222)
pl.title('distance < max_dist (km)')
pl.pcolormesh(lon, lat, np.transpose(distance_m))
pl.colorbar()

pl.subplot(223)
pl.title('all data; max = {0:.1f}'.format(data.max()))
pl.pcolormesh(lon, lat, np.transpose(data))
pl.colorbar()

pl.subplot(224)
pl.title('masked data; max = {0:.1f}'.format(data_m.max()))
pl.pcolormesh(lon, lat, np.transpose(data_m))
pl.colorbar()

Which results in:

Sample Image