Select and View Data#
Datasets#
In PRIMAP2, data is handled in xarray datasets with defined dimensions, coordinates and metadata. If you are not familiar with selecting data from using xarray, we recommend reading the corresponding section in xarray’s documentation first.
To get going, we will show the most important features of the data format using a toy example.
Logging setup for the docs
# setup logging for the docs - we don't need debug logs
import sys
from loguru import logger
logger.remove()
logger.add(sys.stderr, level="INFO")
1
import primap2
import primap2.tests
ds = primap2.tests.examples.toy_ds()
ds
<xarray.Dataset> Size: 3kB Dimensions: (time: 6, area (ISO3): 2, category (IPCC2006): 5, source: 2) Coordinates: * time (time) datetime64[ns] 48B 2015-01-01 ... 2020-01-01 * area (ISO3) (area (ISO3)) <U3 24B 'COL' 'ARG' * category (IPCC2006) (category (IPCC2006)) <U3 60B '0' '1' '2' '1.A' '1.B' * source (source) <U8 64B 'RAND2020' 'RAND2021' Data variables: CO2 (time, area (ISO3), category (IPCC2006), source) float64 960B [CO2·Gg/a] ... CH4 (time, area (ISO3), category (IPCC2006), source) float64 960B [CH4·Gg/a] ... CH4 (SARGWP100) (time, area (ISO3), category (IPCC2006), source) float64 960B [CO2·Gg/a] ... Attributes: area: area (ISO3) cat: category (IPCC2006)
You can click through the coordinates and variables to check out the contents of the toy dataset. As said, primap2 datasets are xarray datasets, but with clearly defined naming conventions so that the data is self-describing.
Each dataset has a time
dimension, an area
dimension and a source
dimension.
In our toy example, we additionally have a category
dimension. For the area
and
category
dimensions, the terminology used for the dimension is given in the
dimension name in braces, e.g. ISO3
for the area. The terminologies are defined in
the separate climate-categories
package, so that the meaning of the area codes is clearly defined.
In the dataset are data variables. Each greenhouse gas is in a separate data variable, and if the data variable contains global warming potential equivalent emissions instead of mass emissions, the used metric is given in braces.
Selecting#
Data can be selected using the xarray indexing methods, but PRIMAP2 also provides own versions of some of xarray’s selection methods which are easier to use in the primap2 context.
Getitem#
The following selections both select the same:
ds["area (ISO3)"]
<xarray.DataArray 'area (ISO3)' (area (ISO3): 2)> Size: 24B array(['COL', 'ARG'], dtype='<U3') Coordinates: * area (ISO3) (area (ISO3)) <U3 24B 'COL' 'ARG'
ds.pr["area"]
<xarray.DataArray 'area (ISO3)' (area (ISO3): 2)> Size: 24B array(['COL', 'ARG'], dtype='<U3') Coordinates: * area (ISO3) (area (ISO3)) <U3 24B 'COL' 'ARG'
The loc Indexer#
Similarly, a version of the loc
indexer is provided which works with the
bare dimension names:
ds.pr.loc[{"time": slice("2016", "2018"), "area": "COL"}]
<xarray.Dataset> Size: 880B Dimensions: (time: 3, category (IPCC2006): 5, source: 2) Coordinates: * time (time) datetime64[ns] 24B 2016-01-01 ... 2018-01-01 area (ISO3) <U3 12B 'COL' * category (IPCC2006) (category (IPCC2006)) <U3 60B '0' '1' '2' '1.A' '1.B' * source (source) <U8 64B 'RAND2020' 'RAND2021' Data variables: CO2 (time, category (IPCC2006), source) float64 240B [CO2·Gg/a] ... CH4 (time, category (IPCC2006), source) float64 240B [CH4·Gg/a] ... CH4 (SARGWP100) (time, category (IPCC2006), source) float64 240B [CO2·Gg/a] ... Attributes: area: area (ISO3) cat: category (IPCC2006)
Negative Selections#
Using the primap2 loc
indexer, you can also use negative selections to select
everything but the specified value or values along a dimension:
from primap2 import Not
ds.pr.loc[{"time": slice("2002", "2005"), "cat": Not(["0", "1", "2"])}]
<xarray.Dataset> Size: 112B Dimensions: (time: 0, area (ISO3): 2, category (IPCC2006): 2, source: 2) Coordinates: * time (time) datetime64[ns] 0B * area (ISO3) (area (ISO3)) <U3 24B 'COL' 'ARG' * category (IPCC2006) (category (IPCC2006)) <U3 24B '1.A' '1.B' * source (source) <U8 64B 'RAND2020' 'RAND2021' Data variables: CO2 (time, area (ISO3), category (IPCC2006), source) float64 0B [CO2·Gg/a] ... CH4 (time, area (ISO3), category (IPCC2006), source) float64 0B [CH4·Gg/a] ... CH4 (SARGWP100) (time, area (ISO3), category (IPCC2006), source) float64 0B [CO2·Gg/a] ... Attributes: area: area (ISO3) cat: category (IPCC2006)
Metadata#
We store metadata about the whole dataset in the attrs
of the dataset, and
metadata about specific data variables in their respective attrs
.
ds.attrs
{'area': 'area (ISO3)', 'cat': 'category (IPCC2006)'}
ds["CH4 (SARGWP100)"].attrs
{'entity': 'CH4', 'gwp_context': 'SARGWP100'}
In our toy example there are only some technical metadata values which are mostly
convenient for e.g. accessing the global warming potential metric without resorting
to string processing. However, you can also add more information, for example a
short description of your dataset in the attribute title
:
ds.attrs["title"] = "A toy example dataset which contains random data."
We have standardized names for a few attributes (e.g. title), which can then also
be accessed via the pr
namespace:
ds.pr.title
'A toy example dataset which contains random data.'
You can find the definition of all standardized attributes at Dataset Attributes.
Unit handling#
PRIMAP2 uses the openscm_units package based on the Pint library together with the pint-xarray library for handling of units.
Unit information#
To access the unit information, you can use the pint
accessor on DataArrays provided
by pint-xarray:
ds["CH4"].pint.units
Simple conversions#
Simple unit conversions are possible using standard Pint functions:
ch4_kt_per_day = ds["CH4"].pint.to("kt CH4 / day")
ch4_kt_per_day.pint.units
CO2 equivalent units and mass units#
To convert mass units (emissions of gases) into global warming potentials in units of equivalent CO2 emissions, you have to specify a global warming potential context (also known as global warming potential metric):
ch4_ar4 = ds["CH4"].pr.convert_to_gwp(gwp_context="AR4GWP100", units="Gg CO2 / year")
# The information about the used GWP context is retained:
ch4_ar4.attrs
{'entity': 'CH4', 'gwp_context': 'AR4GWP100'}
Because the GWP context used for conversion is stored, it is easy to convert back to mass units:
ch4 = ch4_ar4.pr.convert_to_mass()
ch4.attrs
{'entity': 'CH4'}
The stored GWP context can also be used to convert another array using the same context:
ch4_sar = ds["CH4"].pr.convert_to_gwp_like(ds["CH4 (SARGWP100)"])
ch4_sar.attrs
{'entity': 'CH4', 'gwp_context': 'SARGWP100'}
Dropping units#
Sometimes, it is necessary or convenient to drop the units, for example to use arrays as input for external functions which are unit-naive. This can be done safely by first converting to the target unit, then dequantifying the dataset or array:
da_nounits = ds["CH4"].pint.to("Mt CH4 / year").pr.dequantify()
da_nounits.attrs
{'entity': 'CH4', 'units': 'CH4 * megametric_ton / year'}
Note that the units are then stored in the DataArray’s attrs
, and can be
restored using the xarray.DataArray.pr.quantify()
function.
Descriptive statistics#
To get an overview about the missing information in a Dataset or DataArray, you
can use the xarray.DataArray.pr.coverage()
function. It gives you a summary
of the number of non-NaN data points.
To illustrate this, we use an array with missing information:
Show code cell source
import numpy as np
import pandas as pd
import xarray as xr
time = pd.date_range("2000-01-01", "2003-01-01", freq="YS")
area_iso3 = np.array(["COL", "ARG", "MEX"])
category_ipcc = np.array(["1", "2"])
coords = [
("category (IPCC2006)", category_ipcc),
("area (ISO3)", area_iso3),
("time", time),
]
da = xr.DataArray(
data=[
[
[1, 2, 3, 4],
[np.nan, np.nan, np.nan, np.nan],
[1, 2, 3, np.nan],
],
[
[np.nan, 2, np.nan, 4],
[1, np.nan, 3, np.nan],
[1, np.nan, 3, np.nan],
],
],
coords=coords,
)
da
<xarray.DataArray (category (IPCC2006): 2, area (ISO3): 3, time: 4)> Size: 192B array([[[ 1., 2., 3., 4.], [nan, nan, nan, nan], [ 1., 2., 3., nan]], [[nan, 2., nan, 4.], [ 1., nan, 3., nan], [ 1., nan, 3., nan]]]) Coordinates: * category (IPCC2006) (category (IPCC2006)) <U1 8B '1' '2' * area (ISO3) (area (ISO3)) <U3 36B 'COL' 'ARG' 'MEX' * time (time) datetime64[ns] 32B 2000-01-01 ... 2003-01-01
With this array, we can now obtain coverage statistics along given dimensions:
da.pr.coverage("area")
area (ISO3)
COL 6
ARG 2
MEX 5
Name: coverage, dtype: int64
da.pr.coverage("time", "area")
area (ISO3) | COL | ARG | MEX |
---|---|---|---|
time | |||
2000-01-01 | 1 | 1 | 2 |
2001-01-01 | 2 | 0 | 1 |
2002-01-01 | 1 | 1 | 2 |
2003-01-01 | 2 | 0 | 0 |
For Datasets, you can also specify the “entity” as a coordinate:
ds = primap2.tests.examples._cached_opulent_ds.copy(deep=True)
ds["CO2"].pr.loc[{"product": "milk", "area": ["COL", "MEX"]}].pint.magnitude[:] = np.nan
ds.pr.coverage("product", "entity", "area")
area (ISO3) | COL | ARG | MEX | BOL | |
---|---|---|---|---|---|
product (FAOSTAT) | entity | ||||
milk | CO2 | 2016 | 2016 | 2016 | 2016 |
SF6 | 2016 | 2016 | 2016 | 2016 | |
CH4 | 2016 | 2016 | 2016 | 2016 | |
SF6 (SARGWP100) | 2016 | 2016 | 2016 | 2016 | |
meat | CO2 | 2016 | 2016 | 2016 | 2016 |
SF6 | 2016 | 2016 | 2016 | 2016 | |
CH4 | 2016 | 2016 | 2016 | 2016 | |
SF6 (SARGWP100) | 2016 | 2016 | 2016 | 2016 |