Netcdf Files in R

Crop netcdf files in R

I am making assumptions here, because this is not my area of expertise, but you are able to simply transform this into a dataset using the raster-package. This seems to be the way to go, also according to this author.

raster::as.data.frame(nc.stars.crop, xy = TRUE)

At least for me this worked. And then you could transform it back into a simple features object, if you are so inclined with

raster::as.data.frame(nc.stars.crop, xy = TRUE) %>% 
sf::st_as_sf(coords = c('lon','lat'))

However, the transformation to lon/lat is not really exact, because it produces point data, whereas the original information is raster data. So there is clearly information that gets lost.

sf::st_as_sf() seems to work out of the box for this, but I am not sure, because I have no way to validate the transformation of the original data. For me the following worked:

read_ncdf('20220301120000-NCEI-L4_GHRSST-SSTblend-AVHRR_OI-GLOB-v02.0-fv02.1.nc', var="analysed_sst") %>%
  sf::st_as_sf()

This creates polygons, the size of your initial raster tiles and seems to conserve all necessary information.

Finally, here is a work-around to extracting exactly the data you were plotting. You can access the data that ggplot used, by assigning the ggplot to a variable and then accessing the data layer.

p <- ggplot() + geom_stars(data=nc.stars.crop) +
coord_equal() + theme_void() +
scale_x_discrete(expand=c(0,0))+
scale_y_discrete(expand=c(0,0))

p$layers[[1]]$data

R sub-setting netcdf file

In most cases you can do that like this

library(raster)
b <- brick("filename.nc")
e <- extent(8.125, 37.125, 68.125, 97.375)
x <- crop(b, e)

Clipping netCDF file to a shapefile and cloning the metadata variables in R

You have a NetCDF file with many (52) variables (sub-datasets). When you open the file with rast these become "layers". Alternatively you can open the file with sds to keep the sub-dataset structure but that does not help you here (and you would need to skip the first two, see below).

library(terra)
f <- "ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc"
r <- rast(f)
r
#class       : SpatRaster 
#dimensions  : 21600, 43200, 52  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#sources     : ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:water_surface_height_above_reference_datum  
              ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:water_surface_height_uncertainty  
              ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20190101-fv1.0.nc:lake_surface_water_extent  
              ... and 49 more source(s)
#varnames    : water_surface_height_above_reference_datum (water surface height above geoid) 
              water_surface_height_uncertainty (water surface height uncertainty) 
              lake_surface_water_extent (Lake Water Extent) 
              ...
#names       : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ... 
#unit        :           m,           m,         km2,         km2,      Kelvin,      Kelvin, ... 
#time        : 2019-01-01

Note that there are 52 layers and sources (sub-datasets). There are names

head(names(r))
#[1] "water_surface_height_above_reference_datum" "water_surface_height_uncertainty"          
#[3] "lake_surface_water_extent"                  "lake_surface_water_extent_uncertainty"     
#[5] "lake_surface_water_temperature"             "lswt_uncertainty"

And also "longnames" (they are often much longer than the variable names, not in this case)

head(longnames(r))
# [1] "water surface height above geoid" "water surface height uncertainty" "Lake Water Extent"               
# [4] "Water extent uncertainty"         "lake surface skin temperature"    "Total uncertainty"

You can also open the file with sds, but you need to skip "lon_bounds" and "lat_bounds" variables (dimensions)

s <- sds(f, 3:52)

Now read a vector data set (shapefile in this case) and crop

lake <- vect("hydro_p_LakeErie.shp")
rc <- crop(r, lake)
rc 

#class       : SpatRaster 
#dimensions  : 182, 555, 52  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : -83.475, -78.85, 41.38333, 42.9  (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#source      : memory 
#names       : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ... 
#min values  :         NaN,         NaN,         NaN,         NaN,     271.170,       0.283, ... 
#max values  :         NaN,         NaN,         NaN,         NaN,     277.090,       0.622, ... 
#time        : 2019-01-01

It can be convenient to save this to a GTiff file like this (or even better to use the filename argument in crop)

gtf <- writeRaster(rc, "test.tif", overwrite=TRUE)
gtf
#class       : SpatRaster 
#dimensions  : 182, 555, 52  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : -83.475, -78.85, 41.38333, 42.9  (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#source      : test.tif 
#names       : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ... 
#min values  :         NaN,         NaN,         NaN,         NaN,     271.170,       0.283, ... 
#max values  :         NaN,         NaN,         NaN,         NaN,     277.090,       0.622, ...

What has changed is that the data are now in a file, rather then in memory. And you still have the layer (variable) names.

To write the layers as variables to a NetCDF file you need to create a SpatRasterDataset. You can do that like this:

x <- as.list(rc)
s <- sds(x)
names(s) <- names(rc)
longnames(s) <- longnames(r)
units(s) <- units(r)

Note the use of longnames(r) and units(r) (not rc). This is because r has subdatasets (and each has a longname and a unit) while rc does not.

Now use writeCDF

z <- writeCDF(s, "test.nc", overwrite=TRUE)
 
rc2 <- rast("test.nc")
rc2

#class       : SpatRaster 
#dimensions  : 182, 555, 52  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : -83.475, -78.85, 41.38333, 42.9  (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#sources     : test.nc:water_surface_height_above_reference_datum  
              test.nc:water_surface_height_uncertainty  
              test.nc:lake_surface_water_extent  
              ... and 49 more source(s)
#varnames    : water_surface_height_above_reference_datum (water surface height above geoid) 
              water_surface_height_uncertainty (water surface height uncertainty) 
              lake_surface_water_extent (Lake Water Extent) 
              ...
#names       : water~datum, water~ainty, lake_~xtent, lake_~ainty, lake_~ature, lswt_~ainty, ... 
#unit        :           m,           m,         km2,         km2,      Kelvin,      Kelvin, ... 
#time        : 2019-01-01

So it looks like we have a NetCDF with the same structure.

Note that the current CRAN version of terra drops the time variable if there is only one time step. The development version (1.3-11) keeps the time dimension, even of there is only one step.

You can install the development version with
install.packages('terra', repos='https://rspatial.r-universe.dev')

Faster efficient way to crop netcdf in R

Dealing with (NetCDF) files with many layers (time steps) can be very slow when using (a standard approach with) GDAL, which is what terra uses. I hope to fix this over the coming months. What you want to do may go much faster with raster because it approaches the data as a three-dimensional array (it is not looping over layers). So I would suggest

library(raster)
r <- brick("myfile.nc")
r2 <- crop(r, extent(-79, -72, 0, 12.4))

netCDF files in R

When you extract your variable, you need to specify which dimensions you want. Currently you're asking R to get everything and so I suspect it's creating a 3D array which will likely be enormous.

The ncdf4 package generally supersedes ncdf, you should try using that instead. You need to decide if you want to read data by location for time or by time step for location. This is easier to envisage on a plain 2D grid:

Single cell at all time steps
All locations single time step

Yours is a 3D grid through time (albeit with the 3rd dimension only two bands), however it looks like your variable isn't using the bands dimension. Here's a 2D workflow based on ncdf4, ignoring your bands:

Package:

install.packages("ncdf4")
library(ncdf4)

Open connection:

nc = nc_open("~/dir/dir/file.nc")

For a grid at one time step

Read dimensions:

precip = list()
precip$x = ncvar_get(nc, "lon")
precip$y = ncvar_get(nc, "lat")

Read data (note start is the index in dimensions to begin and count is how many observations from that point, so here we read the whole grid at the first time step):

precip$z = ncvar_get(nc, "precip", start=c(1, 1, 1), count=c(-1, -1, 1))
# Convert to a raster if required
precip.r = raster(precip)

To read a single cell at all time steps

You need to find your cell index, precip$x and precip$y will help. Once you have it (e.g. cell x=5 and y=10):

precip.cell = ncvar_get(nc, "precip", start=c(5, 10, 1), count=c(1, 1, -1))

Use netCDF file in R as panel data sf object

The code below to combine the gridded gdp with the centroid of an administrative region (which was an SF object - represented by obj2 in the code snippet).

If you need to aggregate the grids into an administrative region (say by averaging over the region), have a look at exactextractr

library(sf)
library(raster)
library(terra)

obj1 <- stack("./doi_10.5061_dryad.dk1j0__v2/GDP_per_capita_PPP_1990_2015_v2.nc")

extract.df <- terra::extract(obj1, obj2, df = T)