clisops.ops package

Operations module for clisops.

Submodules

clisops.ops.average module

Average operations for xarray datasets.

clisops.ops.average.average_over_dims(ds, dims=None, ignore_undetected_dims=False, output_dir=None, output_type='netcdf', split_method='time:auto', file_namer='standard')[source]

Calculate an average over given dimensions.

Parameters:
  • ds (xr.Dataset or str) – Xarray dataset.

  • dims (Sequence of str or DimensionParameter, optional) – The dimensions over which to apply the average. If None, none of the dimensions are averaged over. Dimensions must be one of [“time”, “level”, “latitude”, “longitude”].

  • ignore_undetected_dims (bool) – If the dimensions specified are not found in the dataset, an Exception will be raised if set to True. If False, an exception will not be raised and the other dimensions will be averaged over. Default = False.

  • output_dir (str or Path, optional) – The directory where the output files will be saved. If None, the output will not be saved to disk.

  • output_type ({“netcdf”, “nc”, “zarr”, “xarray”}) – The format of the output files. If “xarray”, the output will be an xarray Dataset. If “netcdf”, “nc”, or “zarr”, the output will be saved to disk in the specified format.

  • split_method ({“time:auto”}) – The method to split the output files. Currently only “time:auto” is supported, which will

  • file_namer ({“standard”, “simple”}) – The file namer to use for generating output file names. “standard” uses a more descriptive naming convention, while “simple” uses a numbered sequence.

Return type:

list[Dataset | str]

Returns:

list of xr.Dataset or str – A list of the outputs in the format selected; str corresponds to file paths if the output format selected is a file.

Examples

ds: xarray Dataset or “cmip5.output1.MOHC.HadGEM2-ES.rcp85.mon.atmos.Amon.r1i1p1.latest.tas”
dims: [‘latitude’, ‘longitude’]
ignore_undetected_dims: False
output_dir: “/cache/wps/procs/req0111”
output_type: “netcdf”
split_method: “time:auto”
file_namer: “standard”
clisops.ops.average.average_shape(ds, shape, variable=None, output_dir=None, output_type='netcdf', split_method='time:auto', file_namer='standard')[source]

Calculate a spatial average over a given shape.

Parameters:
  • ds (xr.Dataset or str or Path) – Xarray dataset.

  • shape (str, Path, or gpd.GeoDataFrame) – Path to shape file, or directly a GeoDataFrame. Supports formats compatible with geopandas. Will be converted to EPSG:4326 if needed.

  • variable (str or sequence of str, optional) – Variables to average. If None, average over all data variables.

  • output_dir (str or Path, optional) – The directory where the output files will be saved. If None, the output will not be saved to disk.

  • output_type ({“netcdf”, “nc”, “zarr”, “xarray”}) – The format of the output files. If “xarray”, the output will be an xarray Dataset.

  • split_method ({“time:auto”}) – The method to split the output files. Currently only “time:auto” is supported, which will automatically split the output files based on time.

  • file_namer ({“standard”, “simple”}) – The file namer to use for generating output file names. “standard” uses a more descriptive naming convention, while “simple” uses a numbered sequence.

Return type:

list[Dataset | str]

Returns:

list of xr.Dataset or str – A list of the outputs in the format selected. str corresponds to file paths if the output format selected is a file.

Examples

ds: xarray Dataset or “cmip5.output1.MOHC.HadGEM2-ES.rcp85.mon.atmos.Amon.r1i1p1.latest.tas”
dims: [‘latitude’, ‘longitude’]
ignore_undetected_dims: False
output_dir: “/cache/wps/procs/req0111”
output_type: “netcdf”
split_method: “time:auto”
file_namer: “standard”
clisops.ops.average.average_time(ds, freq, output_dir=None, output_type='netcdf', split_method='time:auto', file_namer='standard')[source]

Calculate an average over time for a given frequency.

Parameters:
  • ds (xr.Dataset or str) – Xarray dataset.

  • freq (str) – The frequency to average over, either “month” or “year”.

  • output_dir (str or Path, optional) – The directory where the output files will be saved. If None, the output will not be saved to disk.

  • output_type ({“netcdf”, “nc”, “zarr”, “xarray”}) – The format of the output files. If “xarray”, the output will be an xarray Dataset.

  • split_method ({“time:auto”}) – The method to split the output files. Currently only “time:auto” is supported, which will automatically split the output files based on time.

  • file_namer ({“standard”, “simple”}) – The file namer to use for generating output file names. “standard” uses a more descriptive naming convention, while “simple” uses a numbered sequence.

Return type:

list[Dataset | str]

Returns:

List of datasets or file paths – A list of the outputs in the format selected. str corresponds to file paths if the output format selected is a file.

Examples

ds: xarray Dataset or “cmip5.output1.MOHC.HadGEM2-ES.rcp85.mon.atmos.Amon.r1i1p1.latest.tas”
dims: [‘latitude’, ‘longitude’]
ignore_undetected_dims: False
output_dir: “/cache/wps/procs/req0111”
output_type: “netcdf”
split_method: “time:auto”
file_namer: “standard”

clisops.ops.base_operation module

Base class for all Operations in clisops.

class clisops.ops.base_operation.Operation(ds, file_namer='standard', split_method='time:auto', output_dir=None, output_type='netcdf', **params)[source]

Bases: object

Base class for all Operations.

This class provides the common interface and functionality for all operations in clisops.

Parameters:
  • ds (str or Path or xr.Dataset) – The input dataset, which can be a path to a file or an xarray Dataset.

  • file_namer (str, optional) – The file namer to use for output files. Default is “standard”.

  • split_method (str, optional) – The method to use for splitting the dataset into time slices. Default is “time:auto”.

  • output_dir (str or Path or None, optional) – The directory where output files will be saved. If None, no files will be saved. Default is None.

  • output_type (str, optional) – The type of output to generate. Can be “netcdf”, “zarr”, or “xarray”. Default is “netcdf”.

  • **params (dict, optional) – Additional parameters specific to the operation. These will be resolved in self._resolve_params().

_calculate()[source]

The _calculate() method is implemented within each operation subclass.

_cap_deflate_level(ds)[source]

For CMOR3 / CMIP6 it was investigated which netCDF4 deflate_level should be set to optimize the balance between reduction of file size and degradation in performance. The values found were deflate_level=1, shuffle=True. To keep the write times at a minimum, compression level 1 is not exceeded. See issue: https://github.com/PCMDI/cmor/issues/403.

_fix_netcdf_attrs_encoding(ds)[source]

Executes output_utils.fix_netcdf_attrs_encoding for xarray.Datasets

_get_file_namer()[source]

Return the appropriate file namer object.

static _remove_redundant_coordinates_attr(ds)[source]

Remove the coordinate attribute added by xarray.

See also

https

//github.com/roocs/clisops/issues/224

Examples

If you have a dataset with a time_bnds variable that has a coordinate attribute: .. code-block:: cpp

double time_bnds(time, bnds); time_bnds:coordinates = “height”;

Programs like cdo will complain about this:

Warning (cdf_set_var): Inconsistent variable definition for time_bnds!
static _remove_redundant_fill_values(ds)[source]

Get coordinate and data variables and remove fill values added by xarray.

CF-Conventions say that coordinate variables cannot have missing values.

See also

https

//github.com/roocs/clisops/issues/224

_remove_str_compression(ds)[source]

netCDF4 datatypes of variable length are decoded to str by xarray<2023.11.0. As of xarray 2023.11.0 they are decoded to one of np.dtypes.StrDType (eg. “<U20”) of variable length and stripped of all encoding settings. In netcdf-c versions >= 4.9.0 and xarray < 2023.11.0 the latter part needs to be conducted manually to avoid an Exception when writing the xarray.Dataset to disk. See issue: https://github.com/Unidata/netcdf4-python/issues/1205 See PR: https://github.com/roocs/clisops/pull/319.

_resolve_dsets(ds)[source]

Take in the ds object and load it as an xarray Dataset if it is a path/wildcard. Set the result to self.ds.

_resolve_params(**params)[source]

Resolve the operation-specific input parameters to self.params.

Return type:

None

process()[source]

Main processing method used by all subclasses.

Return type:

list[Dataset | Path]

Returns:

list of xr.Dataset or Path – A list of outputs, which might be NetCDF file paths, Zarr file paths, or xarray.Dataset.

clisops.ops.regrid module

Regridding operation for xarray datasets.

clisops.ops.regrid.regrid(ds, *, method='nearest_s2d', adaptive_masking_threshold=0.5, grid='adaptive', mask=None, output_dir=None, output_type='netcdf', split_method='time:auto', file_namer='standard', keep_attrs=True)[source]

Regrid specified input file or xarray object.

Parameters:
  • ds (xarray.Dataset or str or Path) – Dataset to regrid, or a path to a file or files (wildcards allowed).

  • method ({“nearest_s2d”, “conservative”, “patch”, “bilinear”}) – The regridding method to use. Default is “nearest_s2d”.

  • adaptive_masking_threshold (int or float, optional) – Threshold for adaptive masking. If None, adaptive masking is not applied. Default is 0.5.

  • grid (xarray.Dataset or xarray.DataArray or int or float or tuple or str) – The target grid for regridding. If None, the default grid is used. If “adaptive”, an adaptive grid will be used based on the input dataset. If “auto”, the grid will be automatically determined based on the input dataset. If a tuple, it should be in the format (lat, lon) or (lat, lon, level). Default is “adaptive”.

  • mask ({“ocean”, “land”}, optional) – The mask to apply to the regridded data. If None, no mask is applied.

  • output_dir (str or Path, optional) – The directory where the output files will be saved. If None, the output will not be saved to disk.

  • output_type ({“netcdf”, “nc”, “zarr”, “xarray”}) – The format of the output files. If “xarray”, the output will be an xarray Dataset. If “netcdf”, “nc”, or “zarr”, the output will be saved to files in the specified format. Default is “netcdf”.

  • split_method ({“time:auto”}) – The method to split the output files. Currently only “time:auto” is supported, which will split the output files by time slices automatically. Default is “time:auto”.

  • file_namer ({“standard”, “simple”}) – File namer to use for generating output file names. “standard” uses the dataset name and adds a suffix for the operation. “simple” uses a numbered sequence for the output files. Default is “standard”.

  • keep_attrs ({True, False, “target”}) – Whether to keep the attributes of the input dataset in the output dataset. If “target”, the attributes of the target grid will be kept. Default is True.

Return type:

list[Dataset | str]

Returns:

list of xr.Dataset or list of str – A list of the regridded outputs in the format selected; str corresponds to file paths if the output format selected is a file.

Examples

ds: xarray Dataset or “cmip5.output1.MOHC.HadGEM2-ES.rcp85.mon.atmos.Amon.r1i1p1.latest.tas”
method: “nearest_s2d”
adaptive_masking_threshold:
grid: “1deg”
mask: “land”
output_dir: “/cache/wps/procs/req0111”
output_type: “netcdf”
split_method: “time:auto”
file_namer: “standard”
keep_attrs: True

clisops.ops.subset module

Subset operations for xarray datasets.

class clisops.ops.subset.Subset(ds, file_namer='standard', split_method='time:auto', output_dir=None, output_type='netcdf', **params)[source]

Bases: Operation

Subset operation for xarray datasets.

This operation allows subsetting of datasets based on time, area, and level parameters.

Variables:
  • ds (xr.Dataset or str or Path) – The dataset to be subsetted, can be a path to a file or an xarray Dataset.

  • time (str or tuple or TimeParameter or Series or Interval, optional) – Time parameter for subsetting, can be a string, tuple, or TimeParameter instance.

  • area (str or tuple or AreaParameter, optional) – Area parameter for subsetting, can be a string, tuple, or AreaParameter instance.

  • level (str or tuple or LevelParameter or Interval, optional) – Level parameter for subsetting, can be a string, tuple, or LevelParameter instance.

  • time_components (str or dict or TimeComponents or TimeComponentsParameter, optional) – Time components for subsetting, can be a string, dictionary, or TimeComponentsParameter instance.

  • output_dir (str or Path, optional) – Directory where the output will be saved. If None, the output will not be saved to a file.

  • output_type (str, default "netcdf") – The format of the output, can be “netcdf”, “nc”, “zarr”, or “xarray”.

  • split_method (str, default "time:auto") – Method for splitting the output, currently only supports “time:auto”.

  • file_namer (str, default "standard") – The file naming strategy to use for the output files, can be “standard” or “simple”.

_resolve_params(**params)[source]

Generates a dictionary of subset parameters based on the provided arguments.

_calculate()[source]

Processes the subsetting request and returns the subsetted dataset.

process()

Executes the subsetting operation and returns the result.

_calculate()[source]

The _calculate() method is implemented within each operation subclass.

_resolve_params(**params)[source]

Generates a dictionary of subset parameters.

clisops.ops.subset.subset(ds, *, time=None, area=None, level=None, time_components=None, output_dir=None, output_type='netcdf', split_method='time:auto', file_namer='standard')[source]

Subset operation.

Parameters:
  • ds (xarray.Dataset or str or Path) – The dataset to be subsetted, can be a path to a file or an xarray Dataset.

  • time (str or Tuple[str, str] or TimeParameter or Series or Interval, optional) – Time parameter for subsetting, can be a string, tuple of strings, TimeParameter instance, Series, or Interval. If None, no time subsetting is applied.

  • area (str or AreaParameter or Tuple[int or float or str, int or float or str, int or float or str, int or float or str], optional) – Area parameter for subsetting, can be a string, AreaParameter instance, or a tuple of four values representing the bounding box (lon_min, lat_min, lon_max, lat_max). If None, no area subsetting is applied.

  • level (str or Tuple[int or float or str, int or float or str] or LevelParameter or Interval, optional) – Level parameter for subsetting, can be a string, tuple of two values, LevelParameter instance, or Interval. If None, no level subsetting is applied.

  • time_components (str or dict or TimeComponentsParameter, optional) – Time components for subsetting, can be a string, dictionary, or TimeComponentsParameter instance.

  • output_dir (str or Path, optional) – Directory where the output will be saved. If None, the output will not be saved to a file.

  • output_type ({“netcdf”, “nc”, “zarr”, “xarray”}) – The format of the output, can be “netcdf”, “nc”, “zarr”, or “xarray”. Default is “netcdf”.

  • split_method ({“time:auto”}) – Method for splitting the output, currently only supports “time:auto”. Default is “time:auto”.

  • file_namer ({“standard”, “simple”}) – The file naming strategy to use for the output files, can be “standard” or “simple”. Default is “standard”.

Return type:

list[Dataset | str]

Returns:

list of xr.Dataset or list of str – A list of the subsetted outputs in the format selected; str corresponds to file paths if the output format selected is a file.

Notes

If you request a selection range (such as level, latitude or longitude) that specifies the lower and upper bounds in the opposite direction to the actual coordinate values then clisops.ops.subset will detect this issue and reverse your selection before returning the data subset.

Examples

ds: xarray Dataset or “cmip5.output1.MOHC.HadGEM2-ES.rcp85.mon.atmos.Amon.r1i1p1.latest.tas”
time: (“1999-01-01T00:00:00”, “2100-12-30T00:00:00”) or “2085-01-01T12:00:00Z/2120-12-30T12:00:00Z”
area: (-5.,49.,10.,65) or “0.,49.,10.,65” or [0, 49.5, 10, 65] with the order being lon_0, lat_0, lon_1, lat_1
level: (1000.,) or “1000/2000” or (“1000.50”, “2000.60”)
time_components: “year:2000,2004,2008|month:01,02” or {“year”: (2000, 2004, 2008), “months”: (1, 2)}
output_dir: “/cache/wps/procs/req0111”
output_type: “netcdf”
split_method: “time:auto”
file_namer: “standard”