Subsetting

The subset operation makes use of clisops.core.subset to process the datasets and to set the output type and the output file names.

[1]:
# Initialize the testing data
import clisops.utils.testing as clite

Stratus = clite.stratus(repo=clite.XCLIM_TEST_DATA_REPO_URL, branch=clite.XCLIM_TEST_DATA_VERSION, cache_dir=clite.XCLIM_TEST_DATA_CACHE_DIR)

# fetch files locally or from GitHub
tas_files = [
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_203012-205511.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_205512-208011.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_208012-209912.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_209912-212411.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_212412-214911.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_214912-217411.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_217412-219911.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_219912-222411.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_222412-224911.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_224912-227411.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_227412-229911.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_229912-229912.nc"
]
for i, name in enumerate(tas_files):
    tas_files[i] = Stratus.fetch(name)

o3_file = Stratus.fetch("cmip6/o3_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_185001-194912.nc")

# remove previously created example file
import os
if os.path.exists("./output_001.nc"):
    os.remove("./output_001.nc")
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_208012-209912.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_208012-209912.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_209912-212411.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_209912-212411.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_212412-214911.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_212412-214911.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_214912-217411.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_214912-217411.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_217412-219911.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_217412-219911.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_219912-222411.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_219912-222411.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_222412-224911.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_222412-224911.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_224912-227411.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_224912-227411.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_227412-229911.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_227412-229911.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
Downloading file 'cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_229912-229912.nc' from 'https://raw.githubusercontent.com/Ouranosinc/xclim-testdata/v2024.8.23/data/cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_229912-229912.nc' to '/home/docs/.cache/xclim-testdata/v2024.8.23'.
[2]:
from clisops.ops.subset import subset
import xarray as xr

The subset process takes several parameters:

Subsetting Parameters

ds: Union[xr.Dataset, str, Path]
time: Optional[Union[str, TimeParameter]]
area: Optional[
    Union[
        str,
        Tuple[
            Union[int, float, str],
            Union[int, float, str],
            Union[int, float, str],
            Union[int, float, str],
        ],
        AreaParameter,
    ]
]
level: Optional[
    Union[
        str, LevelParameter
    ]
]
time_components: Optional[Union[str, Dict, TimeComponentsParameter]]
output_dir: Optional[Union[str, Path]]
output_type: {"netcdf", "nc", "zarr", "xarray"}
split_method: {"time:auto"}
file_namer: {"standard"}

The output is a list containing the outputs in the format selected.

[3]:
ds = xr.open_mfdataset(tas_files, decode_times=xr.coders.CFDatetimeCoder(use_cftime=True), combine="by_coords")

Output to xarray

There will only be one output for this example.

[4]:
outputs = subset(
        ds=ds,
        time="2007-01-01T00:00:00/2200-12-30T00:00:00",
        area=(0.0, 10.0, 175.0, 90.0),
        output_type="xarray",
    )

print(f"There is only {len(outputs)} output.")
outputs[0]
There is only 1 output.
[4]:
<xarray.Dataset> Size: 140kB
Dimensions:    (time: 2329, lat: 1, bnds: 2, lon: 1)
Coordinates:
    height     float64 8B 1.5
  * lat        (lat) float64 8B 35.0
  * lon        (lon) float64 8B 0.0
  * time       (time) object 19kB 2007-01-16 00:00:00 ... 2200-12-16 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 37kB dask.array<chunksize=(287, 1, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 37kB dask.array<chunksize=(287, 1, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 9kB dask.array<chunksize=(287, 1, 1), meta=np.ndarray>
    time_bnds  (time, bnds) object 37kB dask.array<chunksize=(287, 2), meta=np.ndarray>
Attributes: (12/29)
    institution:            Met Office Hadley Centre, Fitzroy Road, Exeter, D...
    institute_id:           MOHC
    experiment_id:          rcp85
    source:                 HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); o...
    model_id:               HadGEM2-ES
    forcing:                GHG, SA, Oz, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O,...
    ...                     ...
    title:                  HadGEM2-ES model output prepared for CMIP5 RCP8.5
    parent_experiment:      historical
    modeling_realm:         atmos
    realization:            1
    cmor_version:           2.5.0
    NCO:                    4.7.3

Output to netCDF with simple namer

There is only one output as the file size is under the memory limit so does not need to be split. This example uses the simple namer which numbers output files.

[5]:
outputs = subset(
        ds=ds,
        time="2007-01-01T00:00:00/2200-12-30T00:00:00",
        area=(0.0, 10.0, 175.0, 90.0),
        output_type="nc",
        output_dir=".",
        split_method="time:auto",
        file_namer="simple"
    )
[6]:
# To open the file

subset_ds = xr.open_mfdataset("./output_001.nc", decode_times=xr.coders.CFDatetimeCoder(use_cftime=True), combine="by_coords")
subset_ds
[6]:
<xarray.Dataset> Size: 140kB
Dimensions:    (time: 2329, lat: 1, bnds: 2, lon: 1)
Coordinates:
    height     float64 8B ...
  * lat        (lat) float64 8B 35.0
  * lon        (lon) float64 8B 0.0
  * time       (time) object 19kB 2007-01-16 00:00:00 ... 2200-12-16 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 37kB dask.array<chunksize=(2329, 1, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 37kB dask.array<chunksize=(2329, 1, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 9kB dask.array<chunksize=(2329, 1, 1), meta=np.ndarray>
    time_bnds  (time, bnds) object 37kB dask.array<chunksize=(2329, 2), meta=np.ndarray>
Attributes: (12/29)
    institution:            Met Office Hadley Centre, Fitzroy Road, Exeter, D...
    institute_id:           MOHC
    experiment_id:          rcp85
    source:                 HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); o...
    model_id:               HadGEM2-ES
    forcing:                GHG, SA, Oz, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O,...
    ...                     ...
    title:                  HadGEM2-ES model output prepared for CMIP5 RCP8.5
    parent_experiment:      historical
    modeling_realm:         atmos
    realization:            1
    cmor_version:           2.5.0
    NCO:                    4.7.3

Output to netCDF with standard namer

There is only one output as the file size is under the memory limit so does not need to be split. This example uses the standard namer which names output filesa ccording the the input file and how it has been subsetted.

[7]:
outputs = subset(
        ds=ds,
        time="2007-01-01T00:00:00/2200-12-30T00:00:00",
        area=(0.0, 10.0, 175.0, 90.0),
        output_type="nc",
        output_dir=".",
        split_method="time:auto",
        file_namer="standard"
    )

Subsetting by level

[8]:
ds = xr.open_dataset(o3_file, decode_times=xr.coders.CFDatetimeCoder(use_cftime=True))

No subsetting applied

[9]:
result = subset(ds=ds,
                output_type="xarray")

result[0].coords
[9]:
Coordinates:
  * lat      (lat) float64 16B -89.5 10.5
  * lon      (lon) float64 24B 0.625 125.6 250.6
  * plev     (plev) float64 152B 1e+05 9.25e+04 8.5e+04 ... 1e+03 500.0 100.0
  * time     (time) object 10kB 1850-01-16 12:00:00 ... 1949-12-16 12:00:00

Subsetting over level

[10]:
# subsetting over pressure level (plev)

result = subset(ds=ds,
                level="600/100",
                output_type="xarray")

print(result[0].coords)
print(f"\nplev has been subsetted and now only has {len(result[0].coords)} values.")
Coordinates:
  * lat      (lat) float64 16B -89.5 10.5
  * lon      (lon) float64 24B 0.625 125.6 250.6
  * plev     (plev) float64 16B 500.0 100.0
  * time     (time) object 10kB 1850-01-16 12:00:00 ... 1949-12-16 12:00:00

plev has been subsetted and now only has 4 values.

Use time components

[11]:
ds = xr.open_mfdataset(tas_files, decode_times=xr.coders.CFDatetimeCoder(use_cftime=True), combine="by_coords")
[12]:
outputs = subset(
        ds=ds,
        time_components="year: 2010, 2020, 2030|month: 12, 1, 2",
        output_type="xarray",
    )

print(f"There is only {len(outputs)} output.")
outputs[0]
There is only 1 output.
[12]:
<xarray.Dataset> Size: 976B
Dimensions:    (time: 9, lat: 2, bnds: 2, lon: 2)
Coordinates:
    height     float64 8B 1.5
  * lat        (lat) float64 16B -90.0 35.0
  * lon        (lon) float64 16B 0.0 187.5
  * time       (time) object 72B 2010-01-16 00:00:00 ... 2030-12-16 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 288B dask.array<chunksize=(9, 2, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 288B dask.array<chunksize=(9, 2, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 144B dask.array<chunksize=(9, 2, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object 144B dask.array<chunksize=(9, 2), meta=np.ndarray>
Attributes: (12/29)
    institution:            Met Office Hadley Centre, Fitzroy Road, Exeter, D...
    institute_id:           MOHC
    experiment_id:          rcp85
    source:                 HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); o...
    model_id:               HadGEM2-ES
    forcing:                GHG, SA, Oz, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O,...
    ...                     ...
    title:                  HadGEM2-ES model output prepared for CMIP5 RCP8.5
    parent_experiment:      historical
    modeling_realm:         atmos
    realization:            1
    cmor_version:           2.5.0
    NCO:                    4.7.3

Using parameter classes

[13]:
from clisops.parameter import (
    time_components,
    time_interval,
)
[14]:
ds = xr.open_mfdataset(tas_files, decode_times=xr.coders.CFDatetimeCoder(use_cftime=True), combine="by_coords")
[15]:
outputs = subset(
        ds=ds,
        time=time_interval("2007-01-01T00:00:00", "2200-12-30T00:00:00"),
        time_components=time_components(month=["dec", "jan", "feb"]),
        output_type="xarray",
    )

print(f"There is only {len(outputs)} output.")
outputs[0]
There is only 1 output.
[15]:
<xarray.Dataset> Size: 61kB
Dimensions:    (time: 583, lat: 2, bnds: 2, lon: 2)
Coordinates:
    height     float64 8B 1.5
  * lat        (lat) float64 16B -90.0 35.0
  * lon        (lon) float64 16B 0.0 187.5
  * time       (time) object 5kB 2007-01-16 00:00:00 ... 2200-12-16 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 19kB dask.array<chunksize=(258, 2, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 19kB dask.array<chunksize=(258, 2, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 9kB dask.array<chunksize=(258, 2, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object 9kB dask.array<chunksize=(258, 2), meta=np.ndarray>
Attributes: (12/29)
    institution:            Met Office Hadley Centre, Fitzroy Road, Exeter, D...
    institute_id:           MOHC
    experiment_id:          rcp85
    source:                 HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); o...
    model_id:               HadGEM2-ES
    forcing:                GHG, SA, Oz, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O,...
    ...                     ...
    title:                  HadGEM2-ES model output prepared for CMIP5 RCP8.5
    parent_experiment:      historical
    modeling_realm:         atmos
    realization:            1
    cmor_version:           2.5.0
    NCO:                    4.7.3