Subsetting

The subset operation makes use of clisops.core.subset to process the datasets and to set the output type and the output file names.

[1]:

from clisops.utils import get_file
# fetch files locally or from github
tas_files = get_file([
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_200512-203011.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_203012-205511.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_205512-208011.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_208012-209912.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_209912-212411.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_212412-214911.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_214912-217411.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_217412-219911.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_219912-222411.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_222412-224911.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_224912-227411.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_227412-229911.nc",
    "cmip5/tas_Amon_HadGEM2-ES_rcp85_r1i1p1_229912-229912.nc"
])

o3_file = get_file("cmip6/o3_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_185001-194912.nc")

# remove previously created example file
import os
if os.path.exists("./output_001.nc"):
    os.remove("./output_001.nc")

[2]:

from clisops.ops.subset import subset
import xarray as xr

The subset process takes several parameters:

Subsetting Parameters

ds: Union[xr.Dataset, str, Path]
time: Optional[Union[str, TimeParameter]]
area: Optional[
    Union[
        str,
        Tuple[
            Union[int, float, str],
            Union[int, float, str],
            Union[int, float, str],
            Union[int, float, str],
        ],
        AreaParameter,
    ]
]
level: Optional[
    Union[
        str, LevelParameter
    ]
]
time_components: Optional[Union[str, Dict, TimeComponentsParameter]]
output_dir: Optional[Union[str, Path]]
output_type: {"netcdf", "nc", "zarr", "xarray"}
split_method: {"time:auto"}
file_namer: {"standard"}

The output is a list containing the outputs in the format selected.

[3]:

ds = xr.open_mfdataset(tas_files, use_cftime=True, combine="by_coords")

Output to xarray

There will only be one output for this example.

[4]:

outputs = subset(
        ds=ds,
        time="2007-01-01T00:00:00/2200-12-30T00:00:00",
        area=(0.0, 10.0, 175.0, 90.0),
        output_type="xarray",
    )

print(f"There is only {len(outputs)} output.")
outputs[0]

There is only 1 output.

/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/core/subset.py:1331: UserWarning: "start_date" has been nudged to nearest valid time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)
/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/core/subset.py:1331: UserWarning: "end_date" has been nudged to nearest valid time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)

[4]:

<xarray.Dataset>
Dimensions:    (lat: 1, time: 2329, bnds: 2, lon: 1)
Coordinates:
    height     float64 1.5
  * lat        (lat) float64 35.0
  * lon        (lon) float64 0.0
  * time       (time) object 2007-01-16 00:00:00 ... 2200-12-16 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(287, 1, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(287, 1, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 dask.array<chunksize=(287, 1, 1), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(287, 2), meta=np.ndarray>
Attributes: (12/29)
    institution:            Met Office Hadley Centre, Fitzroy Road, Exeter, D...
    institute_id:           MOHC
    experiment_id:          rcp85
    source:                 HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); o...
    model_id:               HadGEM2-ES
    forcing:                GHG, SA, Oz, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O,...
    ...                     ...
    title:                  HadGEM2-ES model output prepared for CMIP5 RCP8.5
    parent_experiment:      historical
    modeling_realm:         atmos
    realization:            1
    cmor_version:           2.5.0
    NCO:                    4.7.3

Output to netCDF with simple namer

There is only one output as the file size is under the memory limit so does not need to be split. This example uses the simple namer which numbers output files.

[5]:

outputs = subset(
        ds=ds,
        time="2007-01-01T00:00:00/2200-12-30T00:00:00",
        area=(0.0, 10.0, 175.0, 90.0),
        output_type="nc",
        output_dir=".",
        split_method="time:auto",
        file_namer="simple"
    )

/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/core/subset.py:1331: UserWarning: "start_date" has been nudged to nearest valid time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)
/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/core/subset.py:1331: UserWarning: "end_date" has been nudged to nearest valid time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)

[6]:

# To open the file

subset_ds = xr.open_mfdataset("./output_001.nc", use_cftime=True, combine="by_coords")
subset_ds

[6]:

<xarray.Dataset>
Dimensions:    (lat: 1, time: 2329, bnds: 2, lon: 1)
Coordinates:
    height     float64 ...
  * lat        (lat) float64 35.0
  * lon        (lon) float64 0.0
  * time       (time) object 2007-01-16 00:00:00 ... 2200-12-16 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(2329, 1, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(2329, 1, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 dask.array<chunksize=(2329, 1, 1), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(2329, 2), meta=np.ndarray>
Attributes: (12/29)
    institution:            Met Office Hadley Centre, Fitzroy Road, Exeter, D...
    institute_id:           MOHC
    experiment_id:          rcp85
    source:                 HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); o...
    model_id:               HadGEM2-ES
    forcing:                GHG, SA, Oz, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O,...
    ...                     ...
    title:                  HadGEM2-ES model output prepared for CMIP5 RCP8.5
    parent_experiment:      historical
    modeling_realm:         atmos
    realization:            1
    cmor_version:           2.5.0
    NCO:                    4.7.3

Output to netCDF with standard namer

There is only one output as the file size is under the memory limit so does not need to be split. This example uses the standard namer which names output filesa ccording the the input file and how it has been subsetted.

[7]:

outputs = subset(
        ds=ds,
        time="2007-01-01T00:00:00/2200-12-30T00:00:00",
        area=(0.0, 10.0, 175.0, 90.0),
        output_type="nc",
        output_dir=".",
        split_method="time:auto",
        file_namer="standard"
    )

/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/core/subset.py:1331: UserWarning: "start_date" has been nudged to nearest valid time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)
/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/core/subset.py:1331: UserWarning: "end_date" has been nudged to nearest valid time step in xarray object.
  da = subset_time(da, start_date=start_date, end_date=end_date)

Subsetting by level

[8]:

ds = xr.open_dataset(o3_file, use_cftime=True)

No subsetting applied

[9]:

result = subset(ds=ds,
                output_type="xarray")

result[0].coords

[9]:

Coordinates:
  * lat      (lat) float64 -89.5 10.5
  * lon      (lon) float64 0.625 125.6 250.6
  * plev     (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 1e+03 500.0 100.0
  * time     (time) object 1850-01-16 12:00:00 ... 1949-12-16 12:00:00

Subsetting over level

[10]:

# subsetting over pressure level (plev)

result = subset(ds=ds,
                level="600/100",
                output_type="xarray")

print(result[0].coords)
print(f"\nplev has been subsetted and now only has {len(result[0].coords)} values.")

Coordinates:
  * lat      (lat) float64 -89.5 10.5
  * lon      (lon) float64 0.625 125.6 250.6
  * plev     (plev) float64 500.0 100.0
  * time     (time) object 1850-01-16 12:00:00 ... 1949-12-16 12:00:00

plev has been subsetted and now only has 4 values.

/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/ops/subset.py:146: UserWarning: "first_level" has been nudged to nearest valid level in xarray object.
  result = subset_level(result, **kwargs)

Use time components

[11]:

ds = xr.open_mfdataset(tas_files, use_cftime=True, combine="by_coords")

[12]:

outputs = subset(
        ds=ds,
        time_components="year: 2010, 2020, 2030|month: 12, 1, 2",
        output_type="xarray",
    )

print(f"There is only {len(outputs)} output.")
outputs[0]

There is only 1 output.

[12]:

<xarray.Dataset>
Dimensions:    (lat: 2, time: 9, bnds: 2, lon: 2)
Coordinates:
    height     float64 1.5
  * lat        (lat) float64 -90.0 35.0
  * lon        (lon) float64 0.0 187.5
  * time       (time) object 2010-01-16 00:00:00 ... 2030-12-16 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(8, 2, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(8, 2, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 dask.array<chunksize=(8, 2, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(8, 2), meta=np.ndarray>
Attributes: (12/29)
    institution:            Met Office Hadley Centre, Fitzroy Road, Exeter, D...
    institute_id:           MOHC
    experiment_id:          rcp85
    source:                 HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); o...
    model_id:               HadGEM2-ES
    forcing:                GHG, SA, Oz, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O,...
    ...                     ...
    title:                  HadGEM2-ES model output prepared for CMIP5 RCP8.5
    parent_experiment:      historical
    modeling_realm:         atmos
    realization:            1
    cmor_version:           2.5.0
    NCO:                    4.7.3

Using parameter classes

[13]:

from roocs_utils.parameter.param_utils import (
    level_interval,
    level_series,
    time_components,
    time_interval,
    time_series,
)

[14]:

ds = xr.open_mfdataset(tas_files, use_cftime=True, combine="by_coords")

[15]:

outputs = subset(
        ds=ds,
        time=time_interval("2007-01-01T00:00:00", "2200-12-30T00:00:00"),
        time_components=time_components(month=["dec", "jan", "feb"]),
        output_type="xarray",
    )

print(f"There is only {len(outputs)} output.")
outputs[0]

There is only 1 output.

/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/ops/subset.py:120: UserWarning: "start_date" has been nudged to nearest valid time step in xarray object.
  result = subset_time(self.ds, **kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/clisops/conda/stable/lib/python3.11/site-packages/clisops/ops/subset.py:120: UserWarning: "end_date" has been nudged to nearest valid time step in xarray object.
  result = subset_time(self.ds, **kwargs)

[15]:

<xarray.Dataset>
Dimensions:    (lat: 2, time: 583, bnds: 2, lon: 2)
Coordinates:
    height     float64 1.5
  * lat        (lat) float64 -90.0 35.0
  * lon        (lon) float64 0.0 187.5
  * time       (time) object 2007-01-16 00:00:00 ... 2200-12-16 00:00:00
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(71, 2, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(71, 2, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 dask.array<chunksize=(71, 2, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object dask.array<chunksize=(71, 2), meta=np.ndarray>
Attributes: (12/29)
    institution:            Met Office Hadley Centre, Fitzroy Road, Exeter, D...
    institute_id:           MOHC
    experiment_id:          rcp85
    source:                 HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); o...
    model_id:               HadGEM2-ES
    forcing:                GHG, SA, Oz, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O,...
    ...                     ...
    title:                  HadGEM2-ES model output prepared for CMIP5 RCP8.5
    parent_experiment:      historical
    modeling_realm:         atmos
    realization:            1
    cmor_version:           2.5.0
    NCO:                    4.7.3