Select and View Data#

Datasets#

In PRIMAP2, data is handled in xarray datasets with defined dimensions, coordinates and metadata. If you are not familiar with selecting data from using xarray, we recommend reading the corresponding section in xarray’s documentation first.

To get going, we will show the most important features of the data format using a toy example.

Hide code cell content
# setup logging for the docs - we don't need debug logs
import sys
from loguru import logger

logger.remove()
logger.add(sys.stderr, level="INFO")
1
import primap2
import primap2.tests

ds = primap2.tests.examples.toy_ds()

ds
<xarray.Dataset> Size: 3kB
Dimensions:              (time: 6, area (ISO3): 2, category (IPCC2006): 5,
                          source: 2)
Coordinates:
  * time                 (time) datetime64[ns] 48B 2015-01-01 ... 2020-01-01
  * area (ISO3)          (area (ISO3)) <U3 24B 'COL' 'ARG'
  * category (IPCC2006)  (category (IPCC2006)) <U3 60B '0' '1' '2' '1.A' '1.B'
  * source               (source) <U8 64B 'RAND2020' 'RAND2021'
Data variables:
    CO2                  (time, area (ISO3), category (IPCC2006), source) float64 960B [CO2·Gg/a] ...
    CH4                  (time, area (ISO3), category (IPCC2006), source) float64 960B [CH4·Gg/a] ...
    CH4 (SARGWP100)      (time, area (ISO3), category (IPCC2006), source) float64 960B [CO2·Gg/a] ...
Attributes:
    area:     area (ISO3)
    cat:      category (IPCC2006)

You can click through the coordinates and variables to check out the contents of the toy dataset. As said, primap2 datasets are xarray datasets, but with clearly defined naming conventions so that the data is self-describing.

Each dataset has a time dimension, an area dimension and a source dimension. In our toy example, we additionally have a category dimension. For the area and category dimensions, the terminology used for the dimension is given in the dimension name in braces, e.g. ISO3 for the area. The terminologies are defined in the separate climate-categories package, so that the meaning of the area codes is clearly defined.

In the dataset are data variables. Each greenhouse gas is in a separate data variable, and if the data variable contains global warming potential equivalent emissions instead of mass emissions, the used metric is given in braces.

Selecting#

Data can be selected using the xarray indexing methods, but PRIMAP2 also provides own versions of some of xarray’s selection methods which are easier to use in the primap2 context.

Getitem#

The following selections both select the same:

ds["area (ISO3)"]
<xarray.DataArray 'area (ISO3)' (area (ISO3): 2)> Size: 24B
array(['COL', 'ARG'], dtype='<U3')
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 24B 'COL' 'ARG'
ds.pr["area"]
<xarray.DataArray 'area (ISO3)' (area (ISO3): 2)> Size: 24B
array(['COL', 'ARG'], dtype='<U3')
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 24B 'COL' 'ARG'

The loc Indexer#

Similarly, a version of the loc indexer is provided which works with the bare dimension names:

ds.pr.loc[{"time": slice("2016", "2018"), "area": "COL"}]
<xarray.Dataset> Size: 880B
Dimensions:              (time: 3, category (IPCC2006): 5, source: 2)
Coordinates:
  * time                 (time) datetime64[ns] 24B 2016-01-01 ... 2018-01-01
    area (ISO3)          <U3 12B 'COL'
  * category (IPCC2006)  (category (IPCC2006)) <U3 60B '0' '1' '2' '1.A' '1.B'
  * source               (source) <U8 64B 'RAND2020' 'RAND2021'
Data variables:
    CO2                  (time, category (IPCC2006), source) float64 240B [CO2·Gg/a] ...
    CH4                  (time, category (IPCC2006), source) float64 240B [CH4·Gg/a] ...
    CH4 (SARGWP100)      (time, category (IPCC2006), source) float64 240B [CO2·Gg/a] ...
Attributes:
    area:     area (ISO3)
    cat:      category (IPCC2006)

Negative Selections#

Using the primap2 loc indexer, you can also use negative selections to select everything but the specified value or values along a dimension:

from primap2 import Not

ds.pr.loc[{"time": slice("2002", "2005"), "cat": Not(["0", "1", "2"])}]
<xarray.Dataset> Size: 112B
Dimensions:              (time: 0, area (ISO3): 2, category (IPCC2006): 2,
                          source: 2)
Coordinates:
  * time                 (time) datetime64[ns] 0B 
  * area (ISO3)          (area (ISO3)) <U3 24B 'COL' 'ARG'
  * category (IPCC2006)  (category (IPCC2006)) <U3 24B '1.A' '1.B'
  * source               (source) <U8 64B 'RAND2020' 'RAND2021'
Data variables:
    CO2                  (time, area (ISO3), category (IPCC2006), source) float64 0B [CO2·Gg/a] ...
    CH4                  (time, area (ISO3), category (IPCC2006), source) float64 0B [CH4·Gg/a] ...
    CH4 (SARGWP100)      (time, area (ISO3), category (IPCC2006), source) float64 0B [CO2·Gg/a] ...
Attributes:
    area:     area (ISO3)
    cat:      category (IPCC2006)

Metadata#

We store metadata about the whole dataset in the attrs of the dataset, and metadata about specific data variables in their respective attrs.

ds.attrs
{'area': 'area (ISO3)', 'cat': 'category (IPCC2006)'}
ds["CH4 (SARGWP100)"].attrs
{'entity': 'CH4', 'gwp_context': 'SARGWP100'}

In our toy example there are only some technical metadata values which are mostly convenient for e.g. accessing the global warming potential metric without resorting to string processing. However, you can also add more information, for example a short description of your dataset in the attribute title:

ds.attrs["title"] = "A toy example dataset which contains random data."

We have standardized names for a few attributes (e.g. title), which can then also be accessed via the pr namespace:

ds.pr.title
'A toy example dataset which contains random data.'

You can find the definition of all standardized attributes at Dataset Attributes.

Unit handling#

PRIMAP2 uses the openscm_units package based on the Pint library together with the pint-xarray library for handling of units.

Unit information#

To access the unit information, you can use the pint accessor on DataArrays provided by pint-xarray:

ds["CH4"].pint.units
CH4 gigagram/year

Simple conversions#

Simple unit conversions are possible using standard Pint functions:

ch4_kt_per_day = ds["CH4"].pint.to("kt CH4 / day")
ch4_kt_per_day.pint.units
CH4 kt/day

CO2 equivalent units and mass units#

To convert mass units (emissions of gases) into global warming potentials in units of equivalent CO2 emissions, you have to specify a global warming potential context (also known as global warming potential metric):

ch4_ar4 = ds["CH4"].pr.convert_to_gwp(gwp_context="AR4GWP100", units="Gg CO2 / year")
# The information about the used GWP context is retained:
ch4_ar4.attrs
{'entity': 'CH4', 'gwp_context': 'AR4GWP100'}

Because the GWP context used for conversion is stored, it is easy to convert back to mass units:

ch4 = ch4_ar4.pr.convert_to_mass()
ch4.attrs
{'entity': 'CH4'}

The stored GWP context can also be used to convert another array using the same context:

ch4_sar = ds["CH4"].pr.convert_to_gwp_like(ds["CH4 (SARGWP100)"])
ch4_sar.attrs
{'entity': 'CH4', 'gwp_context': 'SARGWP100'}

Dropping units#

Sometimes, it is necessary or convenient to drop the units, for example to use arrays as input for external functions which are unit-naive. This can be done safely by first converting to the target unit, then dequantifying the dataset or array:

da_nounits = ds["CH4"].pint.to("Mt CH4 / year").pr.dequantify()
da_nounits.attrs
{'entity': 'CH4', 'units': 'CH4 * megametric_ton / year'}

Note that the units are then stored in the DataArray’s attrs, and can be restored using the xarray.DataArray.pr.quantify() function.

Descriptive statistics#

To get an overview about the missing information in a Dataset or DataArray, you can use the xarray.DataArray.pr.coverage() function. It gives you a summary of the number of non-NaN data points.

To illustrate this, we use an array with missing information:

Hide code cell source
import numpy as np
import pandas as pd
import xarray as xr

time = pd.date_range("2000-01-01", "2003-01-01", freq="YS")
area_iso3 = np.array(["COL", "ARG", "MEX"])
category_ipcc = np.array(["1", "2"])
coords = [
    ("category (IPCC2006)", category_ipcc),
    ("area (ISO3)", area_iso3),
    ("time", time),
]
da = xr.DataArray(
    data=[
        [
            [1, 2, 3, 4],
            [np.nan, np.nan, np.nan, np.nan],
            [1, 2, 3, np.nan],
        ],
        [
            [np.nan, 2, np.nan, 4],
            [1, np.nan, 3, np.nan],
            [1, np.nan, 3, np.nan],
        ],
    ],
    coords=coords,
)

da
<xarray.DataArray (category (IPCC2006): 2, area (ISO3): 3, time: 4)> Size: 192B
array([[[ 1.,  2.,  3.,  4.],
        [nan, nan, nan, nan],
        [ 1.,  2.,  3., nan]],

       [[nan,  2., nan,  4.],
        [ 1., nan,  3., nan],
        [ 1., nan,  3., nan]]])
Coordinates:
  * category (IPCC2006)  (category (IPCC2006)) <U1 8B '1' '2'
  * area (ISO3)          (area (ISO3)) <U3 36B 'COL' 'ARG' 'MEX'
  * time                 (time) datetime64[ns] 32B 2000-01-01 ... 2003-01-01

With this array, we can now obtain coverage statistics along given dimensions:

da.pr.coverage("area")
area (ISO3)
COL    6
ARG    2
MEX    5
Name: coverage, dtype: int64
da.pr.coverage("time", "area")
area (ISO3) COL ARG MEX
time
2000-01-01 1 1 2
2001-01-01 2 0 1
2002-01-01 1 1 2
2003-01-01 2 0 0

For Datasets, you can also specify the “entity” as a coordinate:

ds = primap2.tests.examples._cached_opulent_ds.copy(deep=True)
ds["CO2"].pr.loc[{"product": "milk", "area": ["COL", "MEX"]}].pint.magnitude[:] = np.nan

ds.pr.coverage("product", "entity", "area")
area (ISO3) COL ARG MEX BOL
product (FAOSTAT) entity
milk CO2 2016 2016 2016 2016
SF6 2016 2016 2016 2016
CH4 2016 2016 2016 2016
SF6 (SARGWP100) 2016 2016 2016 2016
meat CO2 2016 2016 2016 2016
SF6 2016 2016 2016 2016
CH4 2016 2016 2016 2016
SF6 (SARGWP100) 2016 2016 2016 2016