clisops.utils package
Module providing utility functions for CLISOPS.
Submodules
clisops.utils.common module
Common utility functions for CLISOPS.
- clisops.utils.common.check_dir(func, dr)[source]
Ensure that directory ‘dr’ exists before function/method is called, decorator.
- Parameters:
func (FunctionType) – The function to be decorated.
dr (str or Path) – The directory path to check for existence.
- Return type:
Callable- Returns:
FunctionType – The decorated function that checks for the directory’s existence before execution.
- clisops.utils.common.enable_logging()[source]
Enable logging for CLISOPS.
- Return type:
list[int]- Returns:
list[int] – List of enabled log levels, e.g., [10, 20, 30, 40, 50].
- clisops.utils.common.expand_wildcards(paths)[source]
Expand the wildcards that may be present in Paths.
- Parameters:
paths (str or Path) – The path or paths to expand, which may contain wildcards (e.g., *.nc).
- Return type:
list- Returns:
list – A list of Path objects that match the expanded wildcards.
- clisops.utils.common.parse_size(size)[source]
Parse size string into number of bytes.
Parses a size string with a number and a unit (e.g., “10MB”, “2GB”) into an integer representing the number of bytes.
- Parameters:
size (str) – The size string to parse, which should consist of a number followed by a unit (e.g., “10MB”, “2GB”).
- Return type:
int- Returns:
int – The size in bytes as an integer.
- Raises:
ValueError – If the size string does not match the expected format or if the unit is not recognized.
Examples
>>> parse_size("10MB") 10485760
- clisops.utils.common.require_module(func, module, module_name, min_version='0.0.0', unsupported_version_range=None, max_supported_version=None, max_supported_warning=None)[source]
Ensure that module is installed before function/method is called, decorator.
- Parameters:
func (FunctionType) – The function to be decorated.
module (ModuleType) – The module to check for availability.
module_name (str) – The name of the module to check.
min_version (str, optional) – The minimum version of the module required. Defaults to “0.0.0”.
unsupported_version_range (list of str or tuple of str, optional) – A list with two elements, with the elements marking a range of unsupported versions, with the first element being the first unsupported and the second element being the first supported version. If provided, a warning will be issued if the module version falls within this range: version_0 <= module_version < version_1 Defaults to None, meaning no unsupported version range check is performed.
max_supported_version (str, optional) – The maximum supported version of the module. If provided, a warning will be issued if the module version exceeds this. Defaults to None, meaning no maximum version check is performed.
max_supported_warning (str, optional) – The warning message to display if the module version exceeds the maximum supported version.
- Return type:
Callable- Returns:
FunctionType – The decorated function that checks for the module’s availability and version before execution.
clisops.utils.dataset_utils module
Dataset utilities for CLISOPS.
- clisops.utils.dataset_utils._convert_interval_between_lon_frames(low, high)[source]
Convert a longitude interval to another longitude frame, returns Tuple of two floats.
- clisops.utils.dataset_utils._crosses_0_meridian(lon_c)[source]
Determine whether grid extents over the 0-meridian.
Assumes approximate constant width of grid cells.
- Parameters:
lon_c (xr.DataArray) – Longitude coordinate variable in the longitude frame [-180, 180].
- Returns:
bool – True for a dataset crossing the 0-meridian, False else.
- clisops.utils.dataset_utils._determine_grid_orientation(lon)[source]
Determine grid orientation by checking the longitude range along each axis.
- clisops.utils.dataset_utils._get_kwargs_for_opener(otype, **kwargs)[source]
Returns a dictionary of keyword args for sending to either xr.open_dataset() of xr.open_mfdataset(), based on whether otype=”single” or “multi”. The provided kwargs dictionary is used to extend/override the default values.
- Parameters:
otype (str) – The type of opener, either “single” for xr.open_dataset() or “multi” for xr.open_mfdataset().
**kwargs (dict) – Additional keyword arguments to include when opening the dataset.
- Returns:
dict[str, any] – A dictionary of keyword arguments to be used with the specified xarray dataset opener.
- clisops.utils.dataset_utils._is_time(coord)[source]
Check if a coordinate uses cftime datetime objects.
Handles Dask-backed arrays for lazy evaluation.
- Return type:
bool
- clisops.utils.dataset_utils._lonbnds_mids_trans_check(lon1, lon2, lon3, lon4)[source]
Checks if the midpoints of the bounds traverse the Greenwich Meridian or antimeridian.If so, the midpoints are adjusted.
- clisops.utils.dataset_utils._lonbnds_mids_trans_check_diff(lon1, lon2)[source]
Checks if the midpoints of the bounds traverse the Greenwich Meridian or antimeridian.If so, the midpoints are adjusted.
- clisops.utils.dataset_utils._lonbnds_mids_trans_check_sum(lon1, lon2)[source]
Checks if the midpoints of the bounds traverse the Greenwich Meridian or antimeridian.If so, the midpoints are adjusted.
- clisops.utils.dataset_utils._open_as_kerchunk(dset, **kwargs)[source]
Open the dataset dset as a Kerchunk file. Return an Xarray Dataset.
- clisops.utils.dataset_utils._patch_time_encoding(ds, file_list, **kwargs)[source]
Patches the time encoding of an xarray Dataset that has been opened from multiple files.
Read the first file in file_list to read in the time units attribute. It then saves that attribute in ds.time.encoding[“units”].
- Parameters:
ds (xarray.Dataset) – The xarray dataset to patch.
file_list (list of str or Path) – List of file paths to the dataset files.
Notes
Hopefully this will be fixed in Xarray at some point. The problem is that if time is present, the multi-file dataset has an empty encoding dictionary.
- clisops.utils.dataset_utils.add_hor_CF_coord_attrs(ds, lat='lat', lon='lon', lat_bnds='lat_bnds', lon_bnds='lon_bnds', keep_attrs=False)[source]
Add the common CF variable attributes to the horizontal coordinate variables.
- Parameters:
ds (xarray.Dataset) – An xarray Dataset.
lat (str) – Latitude coordinate variable name. The default is “lat”.
lon (str) – Longitude coordinate variable name. The default is “lon”.
lat_bnds (str) – Latitude bounds coordinate variable name. The default is “lat_bnds”.
lon_bnds (str) – Longitude bounds coordinate variable name. The default is “lon_bnds”.
keep_attrs (bool) – Whether to keep original coordinate variable attributes if they do not conflict. In case of a conflict, the attribute value will be overwritten independent of this setting. The default is False.
- Return type:
Dataset- Returns:
xarray.Dataset – The input dataset with updated coordinate variable attributes.
- clisops.utils.dataset_utils.adjust_date_to_calendar(ds, date, direction='backwards')[source]
Check that the date specified exists in the calendar type of the dataset.
If not present, changes the date a day at a time (up to a maximum of five (5) times) to find a date that does exist. ‘Direction’ indicates the direction to change the date by.
- Parameters:
ds (xarray.Dataset or xarray.DataArray) – The data to examine.
date (str) – The date to check.
direction (str) – The direction to move the index in days to find a date that does exist. ‘backwards’ means the search will go backwards in time until an existing date is found. ‘forwards’ means the search will go forwards in time. The default is ‘backwards’.
- Return type:
str- Returns:
str – The next possible existing date in the calendar of the dataset.
- clisops.utils.dataset_utils.calculate_offset(lon, first_element_value)[source]
Calculate the number of elements to roll the dataset by in order to have longitude from within requested bounds.
- Parameters:
lon (xarray.DataArray) – Longitude coordinate of xarray dataset.
first_element_value (float) – The value of the first element of the longitude array to roll to.
- Return type:
int- Returns:
int – The number of elements to roll the dataset by.
- clisops.utils.dataset_utils.cf_convert_between_lon_frames(ds_in, lon_interval, force=False)[source]
Convert ds or lon_interval to the other longitude frame if the longitude frames do not match, as appropriate.
If ds and lon_interval are defined on different longitude frames ([-180, 180] and [0, 360]), this function will convert one of the input parameters to the other longitude frame, preferably the lon_interval. Adjusts shifted longitude frames [0-x, 360-x] in the dataset to one of the two standard longitude frames, dependent on the specified lon_interval. In the case of curvilinear grids featuring an additional 1D x-coordinate of the projection, this projection x-coordinate will not get converted.
- Parameters:
ds_in (xarray.Dataset or xarray.DataArray) – An xarray data object with defined longitude dimension.
lon_interval (tuple or list) – Length-2-tuple or length-2-list of floats or integers denoting the bounds of the longitude interval.
force (bool) – If True, force conversion even if longitude frames match.
- Returns:
Tuple(ds, lon_low, lon_high) – The xarray.Dataset and the bounds of the longitude interval, potentially adjusted in terms of their defined longitude frame.
- clisops.utils.dataset_utils.check_lon_alignment(ds, lon_bnds)[source]
Check whether the longitude subset requested is within the bounds of the dataset.
If not, try to roll the dataset so that the request is. Raise an exception if rolling is not possible.
- Parameters:
ds (xarray.Dataset) – The xarray dataset to check.
lon_bnds (tuple) – A tuple of two floats representing the longitude bounds to check against the dataset.
- Return type:
Dataset- Returns:
xarray.Dataset – The dataset with the longitude coordinate adjusted if necessary.
- clisops.utils.dataset_utils.convert_coord_to_axis(coord)[source]
Convert coordinate type to its single character axis identifier (tzyx).
- Parameters:
coord (str) – The coordinate type to convert, e.g. ‘time’, ‘longitude’, ‘latitude’, ‘level’, ‘realization’.
- Returns:
str – The single character axis identifier of the coordinate (t for time, z for level, y for latitude, x for longitude, r for realization).
- clisops.utils.dataset_utils.detect_bounds(ds, coordinate)[source]
Use cf_xarray to obtain the variable name of the requested coordinates bounds.
- Parameters:
ds (xarray.Dataset or xarray.DataArray) – Dataset the coordinate bounds variable name shall be obtained from.
coordinate (str) – Name of the coordinate variable to determine the bounds from.
- Return type:
str|None- Returns:
str or None – Returns the variable name of the requested coordinate bounds. Returns None if the variable has no bounds or if they cannot be identified.
- clisops.utils.dataset_utils.detect_coordinate(ds, coord_type)[source]
Use cf_xarray to obtain the variable name of the requested coordinate.
- Parameters:
ds (xarray.Dataset or xarray.DataArray) – Dataset the coordinate variable name shall be obtained from.
coord_type (str) – Coordinate type understood by cf-xarray, eg. ‘lat’, ‘lon’, …
- Return type:
str- Returns:
str – Coordinate variable name.
- Raises:
KeyError – Raised if the requested coordinate cannot be identified.
- clisops.utils.dataset_utils.detect_format(ds)[source]
Detect format of a dataset.
Supported formats are ‘CF’, ‘SCRIP’, ‘xESMF’.
- Parameters:
ds (xr.Dataset) – An xarray.Dataset of which to detect the format.
- Return type:
str- Returns:
str – The format, if supported. Else raises an Exception.
- clisops.utils.dataset_utils.detect_gridtype(ds, lon, lat, lon_bnds=None, lat_bnds=None)[source]
Detect the type of the grid as one of “regular_lat_lon”, “curvilinear”, “unstructured”.
Assumes the grid description / structure follows the CF conventions.
- Parameters:
ds (xarray.Dataset) – Dataset containing the grid / coordinate variables.
lon (str) – Longitude variable name.
lat (str) – Latitude variable name.
lon_bnds (str, optional) – Longitude bounds variable name. If not provided, the bounds will not be considered.
lat_bnds (str, optional) – Latitude bounds variable name. If not provided, the bounds will not be considered.
- Return type:
str- Returns:
str – The type of the grid, one of “regular_lat_lon”, “curvilinear”, “unstructured”.
- clisops.utils.dataset_utils.detect_shape(ds, lat, lon, grid_type)[source]
Detect the shape of the grid.
Returns a tuple of (nlat, nlon, ncells). For an unstructured grid nlat and nlon are not defined, and therefore the returned tuple will be (ncells, ncells, ncells).
- Parameters:
ds (xr.Dataset) – Dataset containing the grid / coordinate variables.
lat (str) – Latitude variable name.
lon (str) – Longitude variable name.
grid_type ({“regular_lat_lon”, “curvilinear”, “unstructured”}) – The grid type to detect the shape for.
- Return type:
tuple[int,int,int]- Returns:
int – Number of latitude points in the grid.
int – Number of longitude points in the grid.
int – Number of cells in the grid.
- clisops.utils.dataset_utils.determine_lon_lat_range(ds, lon, lat, lon_bnds=None, lat_bnds=None, apply_fix=True)[source]
Determine the min/max lon/lat values of the dataset (and potentially apply fix for unmasked missing values).
- Parameters:
ds (xarray.Dataset) – Input dataset object.
lon (str) – Name of longitude coordinate.
lat (str) – Name of latitude coordinate.
lon_bnds (str or None, optional) – Name of longitude bounds coordinate. The default is None.
lat_bnds (str or None, optional) – Name of latitude bounds coordinate. The default is None.
apply_fix (bool, optional) – Whether to apply fix for unmasked missing values. The default is True.
- Returns:
xmin (float) – Minimum longitude value.
xmax (float) – Maximum longitude value.
ymin (float) – Minimum latitude value.
ymax (float) – Maximum latitude value.
- clisops.utils.dataset_utils.fix_unmasked_missing_values_lon_lat(ds, lon, lat, lon_bnds, lat_bnds, xminmax, yminmax)[source]
Fix for unmasked missing values in longitude and latitude coordinates and their bounds.
- Parameters:
ds (xarray.Dataset) – Input dataset object.
lon (str) – Name of longitude coordinate.
lat (str) – Name of latitude coordinate.
lon_bnds (str or None) – Name of longitude bounds coordinate.
lat_bnds (str or None) – Name of latitude bounds coordinate.
xminmax (list) – List of minimum and maximum longitude values.
yminmax (list) – List of minimum and maximum latitude values.
- Returns:
bool – Whether the fix on ds[lon] and ds[lat] (and if specified ds[lon_bnds] and ds[lat_bnds]) was applied or not.
- clisops.utils.dataset_utils.generate_bounds_curvilinear(ds, lat, lon, clip_latitude=True, roll=True)[source]
Compute bounds for curvilinear grids.
Assumes 2D latitude and longitude coordinate variables. The bounds will be attached as coords to the xarray.Dataset. Assume the longitudes are defined on the longitude frame [-180, 180]. The default setting for ‘roll’ ensures that the longitudes are converted if this is not the case.
The bound calculation for curvilinear grids was adapted from https://github.com/SantanderMetGroup/ATLAS/blob/mai-devel/scripts/ATLAS-data/ bash-interpolation-scripts/AtlasCDOremappeR_CORDEX/grid_bounds_calc.py which is based on work by Caillaud Cécile and Samuel Somot from Meteo-France. Compared with the original code, there is an additional correction performed in the calculation, ensuring that at the Greenwich meridian and anti meridian the sign of the bounds does not differ. The new implementation is also significantly faster, as it replaces for loops with numpy.vectorize and index slicing.
- Parameters:
ds (xarray.Dataset) – Dataset to compute the bounds for.
lat (str) – Latitude variable name.
lon (str) – Longitude variable name.
clip_latitude (bool, optional) – Whether to clip latitude values to [-90, 90]. The default is True.
roll (bool, optional) – Whether to roll longitude values to [-180, 180]. The default is True.
- Returns:
xarray.Dataset – Dataset with bounds attached variables.
- clisops.utils.dataset_utils.generate_bounds_rectilinear(ds, lat, lon)[source]
Compute bounds for rectilinear grids.
The bounds will be attached as coords to the xarray.Dataset of the Grid object. If no bounds can be created, a warning is issued. It is assumed but not ensured that no duplicated cells are present in the grid.
- Parameters:
ds (xarray.Dataset) – The dataset to modify.
lat (str) – Latitude variable name.
lon (str) – Longitude variable name.
- Return type:
Dataset- Returns:
xarray.Dataset – Dataset with attached bounds.
- clisops.utils.dataset_utils.get_coord_by_attr(ds, attr, value)[source]
Return a coordinate based on a known attribute of a coordinate.
- Parameters:
ds (xarray.Dataset or xarray.DataArray) – The xarray dataset or data array to search for the coordinate.
attr (str) – The name of the attribute to look for in the coordinates.
value (any) – The expected value of the attribute you are looking for.
- Returns:
xarray.DataArray, optional – The coordinate of the xarray dataset if found, otherwise None.
- clisops.utils.dataset_utils.get_coord_by_type(ds, coord_type, ignore_aux_coords=True, return_further_matches=False, warn_if_no_main_variable=True)[source]
Return the name of the coordinate that matches the given type.
- Parameters:
ds (xarray.Dataset or xarray.DataArray) – Dataset/DataArray to search for coordinate.
coord_type (str) – Type of coordinate, e.g. ‘time’, ‘level’, ‘latitude’, ‘longitude’, ‘realization’.
ignore_aux_coords (bool) – Whether to ignore auxiliary coordinates. Default is True.
return_further_matches (bool) – Whether to return further matches. Default is False.
warn_if_no_main_variable (bool) – Whether to warn if no main variable can be identified. Default is True.
- Returns:
str – Name of the coordinate that matches the given type.
str or list of str – If return_further_matches is True, apart from the matching coordinate, a list with further potential matches is returned.
- Raises:
ValueError – If the coordinate type is not known.
- clisops.utils.dataset_utils.get_coord_type(coord)[source]
Get the coordinate type.
- Parameters:
coord (xarray.DataArray or xarray.Dataset) – Coordinate of xarray dataset, e.g. coord = ds.coords[coord_id].
- Return type:
str|None- Returns:
str, optional – The type of coordinate as a string. Either ‘longitude’, ‘latitude’, ‘time’, ‘level’, ‘realization’ or None.
- clisops.utils.dataset_utils.get_main_variable(ds, exclude_common_coords=True)[source]
Find the main variable of an xarray Dataset.
- Parameters:
ds (xarray.Dataset) – The xarray Dataset to search for the main variable.
exclude_common_coords (bool) – If True, common coordinates (time, level, latitude, longitude, bounds) are excluded from the search for the main variable. Default is True.
- Returns:
str – The name of the main variable in the dataset, e.g. ‘tas’.
- clisops.utils.dataset_utils.is_kerchunk_file(dset)[source]
Return a boolean based on reading the file extension.
- Parameters:
dset (str or Path) – The dataset identifier, which is expected to be a file path or name.
- Return type:
bool- Returns:
bool – True if the file is a Kerchunk file (i.e., has a .json, .zst, .zstd, or .parquet extension), otherwise False.
- clisops.utils.dataset_utils.is_latitude(coord)[source]
Determine if a coordinate is latitude.
- Parameters:
coord (xarray.DataArray or xarray.Dataset) – Coordinate of xarray dataset, e.g. coord = ds.coords[coord_id].
- Return type:
bool- Returns:
bool – True if the coordinate is latitude, otherwise False.
- clisops.utils.dataset_utils.is_level(coord)[source]
Determine if a coordinate is level.
- Parameters:
coord (xarray.DataArray or xarray.Dataset) – Coordinate of xarray dataset, e.g. coord = ds.coords[coord_id].
- Return type:
bool- Returns:
bool – True if the coordinate is level, otherwise False.
- clisops.utils.dataset_utils.is_longitude(coord)[source]
Determine if a coordinate is longitude.
- Parameters:
coord (xarray.DataArray or xarray.Dataset) – Coordinate of xarray dataset, e.g. coord = ds.coords[coord_id].
- Return type:
bool- Returns:
bool – True if the coordinate is longitude, otherwise False.
- clisops.utils.dataset_utils.is_realization(coord)[source]
Determine if a coordinate is realisation.
- Parameters:
coord (xarray.DataArray or xarray.Dataset) – Coordinate of xarray dataset, e.g. coord = ds.coords[coord_id].
- Return type:
bool- Returns:
bool – True if the coordinate is realization, otherwise False.
- clisops.utils.dataset_utils.is_time(coord)[source]
Determine if a coordinate is time.
- Parameters:
coord (xarray.DataArray or xarray.Dataset) – Coordinate of xarray dataset, e.g. coord = ds.coords[coord_id].
- Return type:
bool- Returns:
bool – True if the coordinate is time, otherwise False.
- clisops.utils.dataset_utils.open_xr_dataset(dset, **kwargs)[source]
Open an xarray dataset from a dataset input.
- Parameters:
dset (str or Path or list of str or list of Path) – A dataset identifier, directory path, or file path ending in
*.nc.**kwargs (dict) – Any additional keyword arguments for opening the dataset. decode_times=xr.coders.CFDatetimeCoder(use_cftime=True) and decode_timedelta=False are used by default, along with
combine="by_coords"foropen_mfdatasetonly.
- Returns:
xarray.Dataset – An xarray Dataset object opened from the provided dataset input.
Notes
Any list will be interpreted as a list of files.
- clisops.utils.dataset_utils.reformat_SCRIP_to_CF(ds, keep_attrs=False)[source]
Reformat dataset from SCRIP to CF format.
- Parameters:
ds (xarray.Dataset) – Input dataset in SCRIP format.
keep_attrs (bool) – Whether to keep the global attributes.
- Return type:
Dataset- Returns:
xarray.Dataset – Reformatted dataset.
- clisops.utils.dataset_utils.reformat_xESMF_to_CF(ds, keep_attrs=False)[source]
Reformat dataset from xESMF to CF format.
- Parameters:
ds (xarray.Dataset) – Input dataset in xESMF format.
keep_attrs (bool) – Whether to keep the global attributes.
- Return type:
Dataset- Returns:
xarray.Dataset – The reformatted dataset.
clisops.utils.file_namers module
File namers for CLISOPS.
- class clisops.utils.file_namers.SimpleFileNamer(replace=None, extra='')[source]
Bases:
_BaseFileNamerSimple file namer class.
Generates numbered file names.
- class clisops.utils.file_namers.StandardFileNamer(replace=None, extra='')[source]
Bases:
SimpleFileNamerStandard file namer class.
Generates file names based on input dataset.
- static _get_project(ds)[source]
Gets the project name from the input dataset.
- _get_template(ds)[source]
Gets template to use for output file name, based on the project of the dataset.
- static _get_time_range(da)[source]
Finds the time range of the data in the output.
- Return type:
str
- _resolve_derived_attrs(ds, attrs, template, fmt=None)[source]
Finds var_id, time_range and format_extension of dataset and output to generate output file name.
- Return type:
None
- get_file_name(ds, fmt='nc')[source]
Construct file name.
- Parameters:
ds (xr.DataArray | xr.Dataset) – The dataset for which to generate the file name.
fmt (str) – The format of the output file, by default “nc”.
- Return type:
str- Returns:
str – The generated file name based on the dataset attributes and project configuration.
- class clisops.utils.file_namers._BaseFileNamer(replace=None, extra='')[source]
Bases:
objectFile namer base class.
- get_file_name(ds, fmt='nc')[source]
Generate numbered file names.
- clisops.utils.file_namers.get_file_namer(name)[source]
Return the correct filenamer from the provided name.
- Parameters:
name (str) – The name of the file namer to return. Options are “standard” or “simple”.
- Return type:
object- Returns:
_BaseFileNamer – The file namer class corresponding to the provided name.
clisops.utils.file_utils module
File utilities for CLISOPS.
- class clisops.utils.file_utils.FileMapper(file_list, dirpath=None)[source]
Bases:
objectClass to represent a set of files that exist in the same directory as one object.
- Parameters:
file_list (list of str) – The list of files to represent. If dirpath is not provided, these should be full file paths.
dirpath (str or Path, optional) – The directory path where the files exist. Default is None. If dirpath is not provided, it will be deduced from the file paths provided in file_list.
- Variables:
file_list (list of str) – List of file names of the files represented.
file_paths (list of str or list of Path) – List of full file paths of the files represented.
dirpath (str or Path) – The directory path where the files exist. Default is None. If dirpath is not provided, it will be deduced from the file paths provided in file_list.
- _resolve()[source]
- clisops.utils.file_utils.is_file_list(coll)[source]
Check whether a collection is a list of files.
- Parameters:
coll (list) – Collection to check.
- Return type:
bool- Returns:
bool – True if the collection is a list of files, else returns False.
clisops.utils.output_utils module
Utility functions for handling output formats and file writing in CLISOPS.
- class clisops.utils.output_utils.FileLock(fpath)[source]
Bases:
objectCreate and release a lockfile.
Adapted from https://github.com/cedadev/cmip6-object-store/cmip6_zarr/file_lock.py
- Parameters:
fpath (str) – The file path for the lock file to be created.
- acquire(timeout=10)[source]
Create actual lockfile, raise error if already exists beyond ‘timeout’.
- Parameters:
timeout (int) – Maximum time in seconds to wait for the lockfile to be created. Default is 10 seconds.
- Raises:
Exception – If the lockfile cannot be created within the specified timeout.
- release()[source]
Release lock, i.e. delete lockfile.
- clisops.utils.output_utils._fix_str_encoding(s, encoding='utf-8')[source]
Helper function to fix string encoding of surrogates.
- Parameters:
s (str or byte) – The string to be fixed. If the input is not of type str or bytes, it is returned as is.
encoding (str, optional) – The encoding to be used. Default is “utf-8”.
- Returns:
str – The fixed string.
- clisops.utils.output_utils._format_time(tm, fmt='%Y-%m-%d')[source]
Convert to datetime if time is a numpy datetime.
- clisops.utils.output_utils._get_chunked_dataset(ds)[source]
Chunk xr.Dataset and return chunked dataset.
- clisops.utils.output_utils.check_format(fmt)[source]
Check that the requested format exists.
- Parameters:
fmt (str) – The format to check against the supported formats.
- Raises:
KeyError – If the format is not recognized.
- Return type:
None
- clisops.utils.output_utils.create_lock(fname)[source]
Check whether lockfile already exists and else creates lockfile.
- Parameters:
fname (str) – Path of the lockfile to be created.
- Return type:
FileLock|None- Returns:
FileLock or None – Returns a FileLock object if the lockfile is created successfully, or None if the lockfile already exists.
- clisops.utils.output_utils.filter_times_within(times, start=None, end=None)[source]
Return a reduced array if start or end times are defined and are within the main array.
- Parameters:
times (array-like) – An array of datetime objects or strings representing times.
start (str, optional) – A string representing the start date in “YYYY-MM-DD” format. If None, no start filter is applied.
end (str, optional) – A string representing the end date in “YYYY-MM-DD” format. If None, no end filter is applied.
- Returns:
list – A list of datetime objects that fall within the specified start and end times.
- clisops.utils.output_utils.fix_netcdf_attrs_encoding(ds, encoding='utf-8')[source]
Fix strings that contain invalid chars in Dataset attrs to be safe for NetCDF writing.
- Parameters:
ds (xarray.Dataset) – The dataset with attrs to be fixed.
encoding (str, optional) – The encoding to be used. Default is “utf-8”.
- Returns:
xarray.Dataset – The dataset with fixed attrs.
- clisops.utils.output_utils.get_chunk_length(da)[source]
Calculate the chunk length to use when chunking xarray datasets.
Based on the memory limit provided in config and the size of the dataset.
- Parameters:
da (xr.DataArray) – The data array to be chunked.
- Return type:
int- Returns:
int – The length of the chunk to be used for the time dimension.
- clisops.utils.output_utils.get_da(ds)[source]
Return xr.DataArray when the format of ds may be either xr.Dataset or xr.DataArray.
If ds is an xr.Dataset, it will extract the main variable DataArray.
- Parameters:
ds (xr.Dataset or xr.DataArray) – The dataset or data array to extract the main variable from.
- Return type:
DataArray- Returns:
xr.DataArray – The main variable DataArray from the dataset.
- clisops.utils.output_utils.get_format_engine(fmt)[source]
Find the engine for the requested output format.
- Parameters:
fmt (str) – The format for which to find the engine.
- Return type:
str- Returns:
str – The engine to use for writing the output format.
- clisops.utils.output_utils.get_format_extension(fmt)[source]
Find the extension for the requested output format.
- Parameters:
fmt (str) – The format for which to find the file extension.
- Return type:
str- Returns:
str – The file extension associated with the requested format.
- clisops.utils.output_utils.get_format_writer(fmt)[source]
Find the output method for the requested output format.
- Parameters:
fmt (str) – The format for which to find the output method.
- Return type:
str|None- Returns:
str or None – The method to use for writing the output format, or None if no method is defined.
- clisops.utils.output_utils.get_output(ds, output_type, output_dir, namer)[source]
Return output after applying chunking and determining the output format and chunking.
- Parameters:
ds (xarray.Dataset) – The dataset to be processed.
output_type (str) – The type of output format to be used (e.g., “netcdf”, “zarr”).
output_dir (str or Path) – The directory where the output file will be saved. If None, the current directory is used.
namer (object) – An object responsible for generating the file name based on the dataset attributes and output type.
- Returns:
str or xarray.Dataset – The path to the output file if written, or the original dataset if no writing is performed.
- clisops.utils.output_utils.get_time_slices(ds, split_method, start=None, end=None, file_size_limit=None)[source]
Get time slices for a dataset or data array.
Take an xarray Dataset or DataArray, assume it can be split on the time axis into a sequence of slices. Optionally, take a start and end date to specify a sub-slice of the main time axis.
Use the prescribed file size limit to generate a list of (“YYYY-MM-DD”, “YYYY-MM-DD”) slices so that the output files do not (significantly) exceed the file size limit.
- Parameters:
ds (xr.Dataset or xr.DataArray) – A dataset or data array that contains a time dimension.
split_method (str) – The method to use for splitting the dataset.
start (str, optional) – A string specifying the start date in “YYYY-MM-DD” format.
end (str, optional) – A string specifying the end date in “YYYY-MM-DD” format.
file_size_limit (str) – A string specifying “<number><units>”.
- Return type:
list[tuple[str,str]]- Returns:
list of tuples – A list of tuples, each containing two strings representing the start and end dates of each slice.
clisops.utils.testing module
Test utilities for clisops.
- class clisops.utils.testing.ContextLogger(caplog=False)[source]
Bases:
objectHelper function for safe logging management in pytests.
This class manages the loguru logger context, enabling and disabling logging for a specific package during the test execution. It also handles the case where pytest’s caplog fixture is used, allowing for log capturing without interfering with the logger’s configuration.
- Parameters:
caplog (CaplogFixture, optional) – The pytest caplog fixture, if provided, to capture logs during tests.
- clisops.utils.testing.get_esgf_file_paths(esgf_cache_dir)[source]
Get a dictionary of example ESGF file paths for testing purposes.
- Parameters:
esgf_cache_dir (str or os.PathLike) – The base directory where ESGF test data is cached.
- Return type:
dict[str,str]- Returns:
dict[str, str] – A dictionary where keys are descriptive names of datasets and values are their corresponding file paths.
- clisops.utils.testing.get_esgf_glob_paths(esgf_cache_dir)[source]
Return a dictionary of glob paths for ESGF test data.
- Parameters:
esgf_cache_dir (str or os.PathLike) – The base directory where ESGF test data is cached.
- Return type:
dict[str,str]- Returns:
dict – A dictionary where keys are dataset identifiers and values are glob paths to the datasets.
- clisops.utils.testing.load_registry(branch, repo)[source]
Load the registry file for the test data.
- Parameters:
branch (str) – The branch of the repository to use for the registry.
repo (str) – The URL of the repository to use for the registry.
- Return type:
dict[str,str]- Returns:
dict – Dictionary of filenames and hashes.
- clisops.utils.testing.stratus(repo, branch, cache_dir, data_updates=True)[source]
Pooch registry instance for xclim test data.
- Parameters:
repo (str) – URL of the repository to use when fetching testing datasets.
branch (str) – Branch of repository to use when fetching testing datasets.
cache_dir (str or Path) – The path to the directory where the data files are stored.
data_updates (bool) – If True, allow updates to the data files. Default is True.
- Returns:
pooch.Pooch – The Pooch instance for accessing the testing data.
Examples
Using the registry to download a file:
import xarray as xr from clisops.utils.testing import stratus s = stratus(data_dir=..., repo=..., branch=...) example_file = s.fetch("example.nc") data = xr.open_dataset(example_file)
- clisops.utils.testing.write_roocs_cfg(template=None, cache_dir=PosixPath('/home/docs/.cache/mini-esgf-data'))[source]
Write a ROOCS configuration file for testing purposes.
- Parameters:
template (str, optional) – A custom template for the ROOCS configuration file. If not provided, a default template is used.
cache_dir (str or Path, optional) – The directory where the configuration file will be written. Default to the ESGF test data cache directory.
- Return type:
str- Returns:
str – The path to the written ROOCS configuration file.
clisops.utils.time_utils module
Utility functions for handling time and date objects in CLISOPS.
- class clisops.utils.time_utils.AnyCalendarDateTime(year, month, day, hour, minute, second)[source]
Bases:
objectA class to represent a datetime that could be of any calendar.
Can add and subtract a day from the input based on MAX_DAY, MIN_DAY, MAX_MONTH and MIN_MONTH
- Parameters:
year (int) – The year of the datetime.
month (int) – The month of the datetime (1-12).
day (int) – The day of the month (1-31).
hour (int) – The hour of the day (0-23).
minute (int) – The minute of the hour (0-59).
second (int) – The second of the minute (0-59).
- DAY_RANGE = range(1, 32)
- HOUR_RANGE = range(0, 24)
- MINUTE_RANGE = range(0, 60)
- MONTH_RANGE = range(1, 13)
- SECOND_RANGE = range(0, 60)
- add_day()[source]
Add a day to the input datetime.
- sub_day()[source]
Subtract a day to the input datetime.
- validate_input(input, name, range)[source]
Validate input against a given range.
- Parameters:
input (int) – The input value to validate.
name (str) – The name of the input for error messages.
range (range) – The valid range for the input value.
- Raises:
ValueError – If the input value is not within the specified range.
- property value
Show calendar value.
- Returns:
str – A string representation of the datetime in ISO 8601 format.
- clisops.utils.time_utils.create_time_bounds(ds, freq)[source]
Generate time bounds for datasets that have been temporally averaged.
Averaging frequencies supported are yearly, monthly and daily.
- Parameters:
ds (xarray.Dataset) – The dataset containing the time variable.
freq (str) – The frequency of the time bounds to create. Options are “year”, “month”, or “day”.
- Return type:
list[list[AnyCalendarDateTime]]- Returns:
list of AnyCalendarDateTime – A list of lists containing the start and end datetime objects for each time point in the dataset.
- clisops.utils.time_utils.str_to_AnyCalendarDateTime(dt, defaults=None)[source]
Given a string representing date/time, return a DateTimeAnyTime object.
String formats should start with Year and go through to Second, but you can miss out anything from month onwards.
- Parameters:
dt (str) – A string representing a date/time in the format “YYYY-MM-DDTHH:MM:SS” or similar.
defaults (list, optional) – A list of default values for year, month, day, hour, minute, and second if they cannot be parsed from the string.
- Returns:
AnyCalendarDateTime – An instance of AnyCalendarDateTime initialized with the parsed or default values.
- clisops.utils.time_utils.to_isoformat(tm)[source]
Return an ISO 8601 string from a time object (of different types).
- Parameters:
tm (datetime.datetime or datetime.date or numpy.datetime64 or similar) – A time object that can be converted to an ISO 8601 string.
- Returns:
str – An ISO 8601 formatted string representing the time.