Averaging over dimensions of the dataset

The average over dimensions operation makes use of clisops.core.average to process the datasets and to set the output type and the output file names.

It is possible to average over none or any number of time, longitude, latitude or level dimensions in the dataset.

[1]:
from clisops.utils.testing import stratus, XCLIM_TEST_DATA_VERSION, XCLIM_TEST_DATA_REPO_URL,XCLIM_TEST_DATA_CACHE_DIR

Stratus = stratus(repo=XCLIM_TEST_DATA_REPO_URL, branch=XCLIM_TEST_DATA_VERSION, cache_dir=XCLIM_TEST_DATA_CACHE_DIR)

# fetch files locally or from GitHub
tas_files = [
    Stratus.fetch("cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc"),
    Stratus.fetch("cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_203012-205511.nc"),
    Stratus.fetch("cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_205512-208011.nc"),
]

o3_file = Stratus.fetch("cmip6/o3_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_185001-194912.nc")

# remove previously created example file
import os
if os.path.exists("./output_001.nc"):
    os.remove("./output_001.nc")
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[1], line 7
      3 Stratus = stratus(repo=XCLIM_TEST_DATA_REPO_URL, branch=XCLIM_TEST_DATA_VERSION, cache_dir=XCLIM_TEST_DATA_CACHE_DIR)
      5 # fetch files locally or from GitHub
      6 tas_files = [
----> 7     Stratus.fetch("cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc"),
      8     Stratus.fetch("cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_203012-205511.nc"),
      9     Stratus.fetch("cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_205512-208011.nc"),
     10 ]
     12 o3_file = Stratus.fetch("cmip6/o3_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_185001-194912.nc")
     14 # remove previously created example file

File ~/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/pooch/core.py:589, in Pooch.fetch(self, fname, processor, downloader, progressbar)
    586     if downloader is None:
    587         downloader = choose_downloader(url, progressbar=progressbar)
--> 589     stream_download(
    590         url,
    591         full_path,
    592         known_hash,
    593         downloader,
    594         pooch=self,
    595         retry_if_failed=self.retry_if_failed,
    596     )
    598 if processor is not None:
    599     return processor(str(full_path), action, self)

File ~/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/pooch/core.py:807, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    803 try:
    804     # Stream the file to a temporary so that we can safely check its
    805     # hash before overwriting the original.
    806     with temporary_file(path=str(fname.parent)) as tmp:
--> 807         downloader(url, tmp, pooch)
    808         hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    809         shutil.move(tmp, str(fname))

File ~/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/pooch/downloaders.py:221, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
    219 try:
    220     response = requests.get(url, timeout=timeout, **kwargs)
--> 221     response.raise_for_status()
    222     content = response.iter_content(chunk_size=self.chunk_size)
    223     total = int(response.headers.get("content-length", 0))

File ~/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/requests/models.py:1024, in Response.raise_for_status(self)
   1019     http_error_msg = (
   1020         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1021     )
   1023 if http_error_msg:
-> 1024     raise HTTPError(http_error_msg, response=self)

HTTPError: 403 Client Error: Forbidden for url: https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc

Parameters

Parameters taken by the average_over_dims are below:

ds: Union[xr.Dataset, str]
dims : Optional[Union[Tuple[str], DimensionParameter]]
  The dimensions over which to apply the average. If None, none of the dimensions are averaged over. Dimensions
  must be one of ["time", "level", "latitude", "longitude"].
ignore_undetected_dims: bool
  If the dimensions specified are not found in the dataset, an Exception will be raised if set to True.
  If False, an exception will not be raised and the other dimensions will be averaged over. Default = False
output_dir: Optional[Union[str, Path]] = None
output_type: {"netcdf", "nc", "zarr", "xarray"}
split_method: {"time:auto"}
file_namer: {"standard", "simple"}

The output is a list containing the outputs in the format selected.

[2]:
from clisops.ops.average import average_over_dims
from clisops.exceptions import InvalidParameterValue
import xarray as xr
[3]:
ds = xr.open_mfdataset(tas_files, use_cftime=True, combine="by_coords")

ds
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 ds = xr.open_mfdataset(tas_files, use_cftime=True, combine="by_coords")
      3 ds

NameError: name 'tas_files' is not defined

Average over one dimension

[4]:
result = average_over_dims(ds, dims=["time"], ignore_undetected_dims=False, output_type="xarray")

result[0]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 result = average_over_dims(ds, dims=["time"], ignore_undetected_dims=False, output_type="xarray")
      3 result[0]

NameError: name 'ds' is not defined

As you can see in the output dataset, time has been averaged over and has been removed.

Average over two dimensions

Averaging over two dimensions is just as simple as averaging over one. The dimensions to be averaged over should be passed in as a sequence.

[5]:
result = average_over_dims(ds, dims=["time", "latitude"], ignore_undetected_dims=False, output_type="xarray")

result[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 result = average_over_dims(ds, dims=["time", "latitude"], ignore_undetected_dims=False, output_type="xarray")
      3 result[0]

NameError: name 'ds' is not defined

In this case both the time and latitude dimensions have been removed.

Allowed dimensions

It is only possible to average over longtiude, latitude, level and time. If a different dimension is provided to average over an error will be raised.

[6]:
try:
    average_over_dims(
                ds,
                dims=["incorrect_dim"],
                ignore_undetected_dims=False,
                output_type="xarray",
    )
except InvalidParameterValue as exc:
    print(exc)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 3
      1 try:
      2     average_over_dims(
----> 3                 ds,
      4                 dims=["incorrect_dim"],
      5                 ignore_undetected_dims=False,
      6                 output_type="xarray",
      7     )
      8 except InvalidParameterValue as exc:
      9     print(exc)

NameError: name 'ds' is not defined

Dimensions not found

In the case where a dimension has been selected for averaging but it doesn’t exist in the dataset, there are 2 options.

  1. To raise an exception when the dimension doesn’t exist, set ignore_undetected_dims = False

[7]:
try:
    average_over_dims(
        ds,
        dims=["level", "time"],
        ignore_undetected_dims=False,
        output_type="xarray",
    )
except InvalidParameterValue as exc:
    print(exc)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 3
      1 try:
      2     average_over_dims(
----> 3         ds,
      4         dims=["level", "time"],
      5         ignore_undetected_dims=False,
      6         output_type="xarray",
      7     )
      8 except InvalidParameterValue as exc:
      9     print(exc)

NameError: name 'ds' is not defined
  1. To ignore when the dimension doesn’t exist, and average over any other requested dimensions anyway, set ignore_undetected_dims = True

[8]:
result = average_over_dims(
        ds,
        dims=["level", "time"],
        ignore_undetected_dims=True,
        output_type="xarray",
)
result[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 2
      1 result = average_over_dims(
----> 2         ds,
      3         dims=["level", "time"],
      4         ignore_undetected_dims=True,
      5         output_type="xarray",
      6 )
      7 result[0]

NameError: name 'ds' is not defined

In the case above, a level dimension did not exist, but this was ignored and time was averaged over anyway.

No dimensions supplied

If no dimensions are supplied, no averaging will be applied and the original dataset will be returned.

[9]:
result = average_over_dims(
        ds,
        dims=None,
        ignore_undetected_dims=False,
        output_type="xarray"
)

result[0]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 2
      1 result = average_over_dims(
----> 2         ds,
      3         dims=None,
      4         ignore_undetected_dims=False,
      5         output_type="xarray"
      6 )
      8 result[0]

NameError: name 'ds' is not defined

An example of averaging over level

[10]:
print("Original dataset")
print(xr.open_dataset(o3_file, use_cftime=True))

result = average_over_dims(
        o3_file,
        dims=["level"],
        ignore_undetected_dims=False,
        output_type="xarray",
    )


print("Averaged dataset")
result[0]
Original dataset
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 print("Original dataset")
----> 2 print(xr.open_dataset(o3_file, use_cftime=True))
      4 result = average_over_dims(
      5         o3_file,
      6         dims=["level"],
      7         ignore_undetected_dims=False,
      8         output_type="xarray",
      9     )
     12 print("Averaged dataset")

NameError: name 'o3_file' is not defined

In the above, the dimension plev has be removed and averaged over