Subsetting utilities

clisops comes with some utilities to perform common tasks that are either not implemented in xarray, or that are implemented but do not have the generality needed for climate science work. Here we show examples of the clisops.core.subset submodule.

[ ]:
import xarray as xr
from clisops.core import subset
xr.set_options(display_style='html')

import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (13, 5)
%matplotlib inline
[ ]:
ds = xr.tutorial.open_dataset('air_temperature')
ds.coords
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
[ ]:
ds.air.isel(time=0).plot()  # Simple index-selection with xarray
<matplotlib.collections.QuadMesh at 0x7fbce02e6760>
../_images/notebooks_core_subset_3_1.png

subset_bbox : using a latitude-longitude bounding box

In the previous example notebook, we used xarray’s .sel() to cut a lat-lon subset of our data. clisops offers the same utility, but with more generality. For example, if we mindlessly try xarray’s method on our dataset:

[ ]:
ds.sel(lat=slice(45, 50), lon=slice(-60, -55)).coords
Coordinates:
  * lat      (lat) float32
  * lon      (lon) float32
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

As you can see, lat and lon are empty. In this dataset, the lats are defined in descending order and lons are in the range [0, 360[ instead of [-180, 180[, which is why xarray’s method did not return the expected result. clisops understands these nuances:

[ ]:
subset.subset_bbox(ds, lat_bnds=[45, 50], lon_bnds=[-60, -55]).coords
Coordinates:
  * lat      (lat) float32 50.0 47.5 45.0
  * lon      (lon) float32 300.0 302.5 305.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

When lat and lon are 2-D

subset_bbox also manages cases where the lat-lon coordinates are not sorted 1D vectors, for example with this more complex dataset:

Most subset methods expect the input dataset / dataarray to have lat and lon as variables. It may be able to understand your data if other common names are used (like latitude, or lons), but we recommend renaming the variables before using the tool (like in this example).

[ ]:
ds_roms = xr.tutorial.open_dataset('ROMS_example').rename(lon_rho='lon', lat_rho='lat')
salt = ds_roms.salt.isel(ocean_time=0, s_rho=0)

#matplotlib-based plotting
fig, (axEtaXi, axLatLon) = plt.subplots(1, 2)
salt.plot(cmap=plt.cm.gray_r, ax=axEtaXi, add_colorbar=False)
axLatLon.pcolormesh(salt.lon, salt.lat, salt)
axLatLon.set_xlabel(salt.lon.long_name)
axLatLon.set_ylabel(salt.lat.long_name)
<ipython-input-6-e7e8689dc693>:7: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.
  axLatLon.pcolormesh(salt.lon, salt.lat, salt)
Text(0, 0.5, 'latitude of RHO-points')
../_images/notebooks_core_subset_9_2.png
[ ]:
salt_bb = subset.subset_bbox(salt, lat_bnds=[28, 30], lon_bnds=[-91, -88])

fig, (axEtaXi, axLatLon) = plt.subplots(1, 2)
salt_bb.plot(cmap=plt.cm.gray_r, ax=axEtaXi, add_colorbar=False)
axLatLon.pcolormesh(salt_bb.lon, salt_bb.lat, salt_bb)
axLatLon.set_xlabel(salt_bb.lon.long_name)
axLatLon.set_ylabel(salt_bb.lat.long_name)
/home/phobos/Python/clisops/clisops/core/subset.py:272: UserWarning: lat and lon-like dimensions are not found among arg `<xarray.DataArray 'salt' (eta_rho: 191, xi_rho: 371)>
array([[34.953022, 34.955223, 34.957733, ..., 34.91684 , 34.913895, 34.913876],
       [34.95414 , 34.957577, 34.961086, ..., 34.916145, 34.91495 , 34.913864],
       [34.959286, 34.963936, 34.968594, ..., 34.923023, 34.92564 , 34.926853],
       ...,
       [      nan,       nan,       nan, ...,       nan,       nan,       nan],
       [      nan,       nan,       nan, ...,       nan,       nan,       nan],
       [      nan,       nan,       nan, ...,       nan,       nan,       nan]],
      dtype=float32)
Coordinates:
    Cs_r        float64 ...
    lon         (eta_rho, xi_rho) float64 -93.6 -93.58 -93.57 ... -88.88 -88.87
    hc          float64 ...
    h           (eta_rho, xi_rho) float64 ...
    lat         (eta_rho, xi_rho) float64 27.45 27.45 27.45 ... 30.85 30.86
    Vtransform  int32 ...
    ocean_time  datetime64[ns] 2001-08-01
    s_rho       float64 -0.9833
Dimensions without coordinates: eta_rho, xi_rho
Attributes:
    long_name:  salinity
    time:       ocean_time
    field:      salinity, scalar, series` dimensions: ['eta_rho', 'xi_rho'].
  warnings.warn(
<ipython-input-7-67bc17b9a9a6>:5: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.
  axLatLon.pcolormesh(salt_bb.lon, salt_bb.lat, salt_bb)
Text(0, 0.5, 'latitude of RHO-points')
../_images/notebooks_core_subset_10_2.png

Add time subsetting

subset_bbox and other methods of the submodule give options to also give time bounds. These options are mostly for convenience as only some basics sanity checks are performed.

[ ]:
subset.subset_bbox(ds, lat_bnds=[45, 50], lon_bnds=[-60, -55],
                       start_date='2013-02', end_date='2013-08').coords
Coordinates:
  * lat      (lat) float32 50.0 47.5 45.0
  * lon      (lon) float32 300.0 302.5 305.0
  * time     (time) datetime64[ns] 2013-02-01 ... 2013-08-31T18:00:00

subset_gridpoint : Selecting grid points

subset_gridpoint can be used for selecting single locations. Compared to .sel(), this method adds a tolerance parameter in meters so that it finds the nearest point from the given coordinate within this distance, or else it returns NaN.

[ ]:
lon_pt = -70.2
lat_pt = 50.1

# Get timeseries for point within 100 km of (-70.2, 50.1)
ds3 = subset.subset_gridpoint(ds, lon=lon_pt, lat=lat_pt, tolerance=100 * 1000)
ds3.coords
Coordinates:
    lat      float32 50.0
    lon      float32 290.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
[ ]:
ds3.air.plot()
[<matplotlib.lines.Line2D at 0x7fbcdfc62b20>]
../_images/notebooks_core_subset_15_1.png

This also works with a list of coordinates. In that case, a new dimension site concatenates the selected gridpoints. Moreover, passing add_distance = True allows us to inspect how close the selected gridpoints were from the requested coordinates. In this example, the third point (-90, 54.1) is a little bit further than 100 km from the nearest gridpoint, so the coordinate of this gridpoint and the distance are returned, but the values are masked with NaN (and thus they do not appear on the graph).

[ ]:
lon_pt = [-65, -70.2, -90]
lat_pt = [45, 50.1, 54.1]

# Get timeseries for point within 100 km of each points
ds3 = subset.subset_gridpoint(ds, lon=lon_pt, lat=lat_pt, tolerance=100 * 1000, add_distance=True)
print(ds3.coords)
ds3.air.plot(hue='site')
Coordinates:
    lat       (site) float32 45.0 50.0 55.0
    lon       (site) float32 295.0 290.0 270.0
  * time      (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    distance  (site) float64 0.0 1.814e+04 1.002e+05
[<matplotlib.lines.Line2D at 0x7fbcdfb2fa60>,
 <matplotlib.lines.Line2D at 0x7fbcdfb66490>,
 <matplotlib.lines.Line2D at 0x7fbcdfb66310>]
../_images/notebooks_core_subset_17_2.png

subset_shape : Selecting a region from a polygon

If the region we want to keep in our dataset is complex, we can use subset_shape and pass a polygon. The input shape can be any georeferenced data file understood by geopandas and fiona that run under the hood. For example here with a geojson file of Canada available online:

[ ]:
ds_can = subset.subset_shape(ds, 'https://raw.githubusercontent.com/johan/world.geo.json/master/countries/CAN.geo.json')

fig, (axall, axcan) = plt.subplots(1, 2)
ds.air.isel(time=0).plot(ax=axall)
ds_can.air.isel(time=0).plot(ax=axcan)
2021-01-26 14:44:38,120 - root - INFO - No file or no dimensions found in arg `https://raw.githubusercontent.com/johan/world.geo.json/master/countries/CAN.geo.json`.
/home/phobos/Python/clisops/clisops/core/subset.py:279: UserWarning: CRS definitions are similar but raster lon values must be wrapped.
  final = func(*formatted_args, **kwargs)
/home/phobos/Python/clisops/clisops/core/subset.py:463: UserWarning: Wrapping longitudes at 180 degrees.
  warnings.warn("Wrapping longitudes at 180 degrees.")
<matplotlib.collections.QuadMesh at 0x7fbcdf949310>
../_images/notebooks_core_subset_19_3.png

create_mask : Creating a mask from polygons

We can also create a mask to divide the data in regions. In the example, we load the shapes for Canada, Mexico and the US, merge them with pandas and then create the mask. create_mask is a bit more restrictive than the other methods showed here and specifically asks for the x and y dims. Also, we need to pass wrap_lons=True since our longitudes go from 0 to 360 instead of -180 to 180.

[ ]:
import pandas as pd
import geopandas as gpd
can = gpd.read_file('https://raw.githubusercontent.com/johan/world.geo.json/master/countries/CAN.geo.json')
usa = gpd.read_file('https://raw.githubusercontent.com/johan/world.geo.json/master/countries/USA.geo.json')
mex = gpd.read_file('https://raw.githubusercontent.com/johan/world.geo.json/master/countries/MEX.geo.json')
northam = pd.concat([can, usa, mex]).set_index('id')
northam
name geometry
id
CAN Canada MULTIPOLYGON (((-63.66450 46.55001, -62.93930 ...
USA United States of America MULTIPOLYGON (((-155.54211 19.08348, -155.6881...
MEX Mexico POLYGON ((-97.14001 25.87000, -97.52807 24.992...
[ ]:
mask = subset.create_mask(x_dim=ds.lon, y_dim=ds.lat, poly=northam, wrap_lons=True)
/home/phobos/Python/clisops/clisops/core/subset.py:463: UserWarning: Wrapping longitudes at 180 degrees.
  warnings.warn("Wrapping longitudes at 180 degrees.")
[ ]:
mask.T.plot() # The transpose is needed because create_mask returns lon, lat instead the input lat, lon.
<matplotlib.collections.QuadMesh at 0x7fbcdf82ad00>
../_images/notebooks_core_subset_23_1.png

The same can be done with 2D lat/lon coordinates. Here get the regions of salt that are within the boundaries of the US as defined in our geojson data. As our data comes from a small region in the Gulf of Mexico and our USA mask is quite coarse, the following isn’t really helpful. In any case, it illustrates what can be done with clisops.

[ ]:
usa_mask = subset.create_mask(x_dim=salt.lon, y_dim=salt.lat, poly=usa)
[ ]:
salt.plot()
usa_mask.plot(cmap=plt.cm.Blues_r, add_colorbar=False)
<matplotlib.collections.QuadMesh at 0x7fbcdf778df0>
../_images/notebooks_core_subset_26_1.png

create_weight_masks : Creating exact weights masks from polygons

While create_mask (and subset_shape) only check if the center of a grid cell lies inside of a polygon or not, create_weight_masks (and average_shape, see the Average utilities notebook) consider the partial overlap of grid cells and polygons. Thus, instead of creating an integer mask for the data, this method creates one mask for each polygon. The weights are the fraction of the grid cells-polygon overlap over the whole polygon. As such, the sum along spatial dimensions gives 1 for each geometry.

However, this feature of clisops requires the optional xESMF dependency (version 0.6.2 or later), which is not currently available of pypi (pip) and must be installed either through conda or from the source.

The call signature is different: a dataset or dataarray containing the grid information is passed directly. The same restrictions as for xESMF apply.

[ ]:
weights = subset.create_weight_masks(ds, northam)
weights
<xarray.DataArray (geom: 3, lat: 25, lon: 53)>
array([[[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 4.38326936e-05, 1.38383820e-05, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [4.47662669e-03, 5.31807679e-03, 5.29956131e-03, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
...
        [0.00000000e+00, 1.73510030e-04, 9.35640399e-04, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]]])
Coordinates:
    name     (geom) object 'Canada' 'United States of America' 'Mexico'
  * geom     (geom) object 'CAN' 'USA' 'MEX'
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
[ ]:
weights.plot(col='geom', cmap='pink')
<xarray.plot.facetgrid.FacetGrid at 0x7fbcddd3f640>
../_images/notebooks_core_subset_29_1.png
[ ]:
weights.sum(['lon', 'lat'])
<xarray.DataArray (geom: 3)>
array([1., 1., 1.])
Coordinates:
    name     (geom) object 'Canada' 'United States of America' 'Mexico'
  * geom     (geom) object 'CAN' 'USA' 'MEX'