General Usage

Because PRIMAP2 builds on xarray, all xarray functionality is available right away. Additional functionality is provided in the primap2 package and in the pr namespace on xarray objects. In this section, we will present examples of general PRIMAP2 usage, which are likely to be helpful in any context. More specialized functionality for specific tasks is presented in the next section.

Importing

[1]:
import primap2  # injects the "pr" namespace into xarray
[2]:
# set up logging for the docs - don't show debug messages
import sys
from loguru import logger

logger.remove()
logger.add(sys.stderr, level="INFO")
[2]:
1

Loading Datafiles

Loading from netcdf files

The native storage format of PRIMAP2 are netcdf5 files, and datasets can be written to and loaded from netcdf5 files using PRIMAP2 functions. We will load the “minimal” and “opulent” Datasets from the data format section:

[3]:
ds_min = primap2.open_dataset("minimal_ds.nc")
ds = primap2.open_dataset("opulent_ds.nc")

ds
[3]:
<xarray.Dataset> Size: 518kB
Dimensions:               (time: 21, area (ISO3): 4, category (IPCC 2006): 8,
                           animal (FAOSTAT): 3, product (FAOSTAT): 2,
                           scenario (FAOSTAT): 2, provenance: 1, model: 1,
                           source: 2)
Coordinates:
  * animal (FAOSTAT)      (animal (FAOSTAT)) <U5 60B 'cow' 'swine' 'goat'
  * area (ISO3)           (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 96B '0' '1' ... '1.A' '1.B'
    category_names        (category (IPCC 2006)) <U14 448B 'total' ... 'light...
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 168B 2000-01-01 ... 2020-01-01
Data variables:
    CH4                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [CH4·Gg/a] ...
    CO2                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [CO2·Gg/a] ...
    SF6                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [Gg·SF6/a] ...
    SF6 (SARGWP100)       (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [CO2·Gg/a] ...
    population            (time, area (ISO3), provenance, model, source) float64 1kB [] ...
Attributes:
    area:                area (ISO3)
    cat:                 category (IPCC 2006)
    comment:             GHG inventory data ...
    contact:             lol_no_one_will_answer@example.com
    entity_terminology:  primap2
    history:             2021-01-14 14:50 data invented\n2021-01-14 14:51 add...
    institution:         PIK
    references:          doi:10.1012
    rights:              Use however you want.
    scen:                scenario (FAOSTAT)
    sec_cats:            ['animal (FAOSTAT)', 'product (FAOSTAT)']
    title:               Completely invented GHG inventory data

Accessing metadata

Metadata is stored in the attrs of Datasets, and you can of course access it directly there. Additionally, you can access the PRIMAP2 metadata directly under the .pr namespace, which has the advantage that autocompletion works in ipython and IDEs and typos will be caught immediately.

[4]:
ds.attrs
[4]:
{'area': 'area (ISO3)',
 'cat': 'category (IPCC 2006)',
 'comment': 'GHG inventory data ...',
 'contact': 'lol_no_one_will_answer@example.com',
 'entity_terminology': 'primap2',
 'history': '2021-01-14 14:50 data invented\n2021-01-14 14:51 additional processing step\n',
 'institution': 'PIK',
 'references': 'doi:10.1012',
 'rights': 'Use however you want.',
 'scen': 'scenario (FAOSTAT)',
 'sec_cats': ['animal (FAOSTAT)', 'product (FAOSTAT)'],
 'title': 'Completely invented GHG inventory data'}
[5]:
ds.pr.title
[5]:
'Completely invented GHG inventory data'
[6]:
ds.pr.title = "Another title"
ds.pr.title
[6]:
'Another title'

Selecting data

Data can be selected using the xarray indexing methods, but PRIMAP2 also provides own versions of some of xarray’s selection methods which work using the dimension names without the category set.

Getitem

The following selections both select the same:

[7]:
ds["area (ISO3)"]
[7]:
<xarray.DataArray 'area (ISO3)' (area (ISO3): 4)> Size: 48B
array(['COL', 'ARG', 'MEX', 'BOL'], dtype='<U3')
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
[8]:
ds.pr["area"]
[8]:
<xarray.DataArray 'area (ISO3)' (area (ISO3): 4)> Size: 48B
array(['COL', 'ARG', 'MEX', 'BOL'], dtype='<U3')
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'

The loc Indexer

Similarly, a version of the loc indexer is provided which works with the bare dimension names:

[9]:
ds.pr.loc[{"time": slice("2002", "2005"), "animal": "cow"}]
[9]:
<xarray.Dataset> Size: 34kB
Dimensions:               (time: 4, area (ISO3): 4, category (IPCC 2006): 8,
                           product (FAOSTAT): 2, scenario (FAOSTAT): 2,
                           provenance: 1, model: 1, source: 2)
Coordinates:
    animal (FAOSTAT)      <U5 20B 'cow'
  * area (ISO3)           (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 96B '0' '1' ... '1.A' '1.B'
    category_names        (category (IPCC 2006)) <U14 448B 'total' ... 'light...
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 32B 2002-01-01 ... 2005-01-01
Data variables:
    CH4                   (time, area (ISO3), category (IPCC 2006), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 8kB [CH4·Gg/a] ...
    CO2                   (time, area (ISO3), category (IPCC 2006), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 8kB [CO2·Gg/a] ...
    SF6                   (time, area (ISO3), category (IPCC 2006), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 8kB [Gg·SF6/a] ...
    SF6 (SARGWP100)       (time, area (ISO3), category (IPCC 2006), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 8kB [CO2·Gg/a] ...
    population            (time, area (ISO3), provenance, model, source) float64 256B [] ...
Attributes:
    area:                area (ISO3)
    cat:                 category (IPCC 2006)
    comment:             GHG inventory data ...
    contact:             lol_no_one_will_answer@example.com
    entity_terminology:  primap2
    history:             2021-01-14 14:50 data invented\n2021-01-14 14:51 add...
    institution:         PIK
    references:          doi:10.1012
    rights:              Use however you want.
    scen:                scenario (FAOSTAT)
    sec_cats:            ['animal (FAOSTAT)', 'product (FAOSTAT)']
    title:               Another title

It also works on DataArrays:

[10]:
da = ds["CO2"]

da_subset = da.pr.loc[
    {
        "time": slice("2002", "2005"),
        "animal": "cow",
        "category": "0",
        "area": "COL",
    }
]
da_subset
[10]:
<xarray.DataArray 'CO2' (time: 4, product (FAOSTAT): 2, scenario (FAOSTAT): 2,
                         provenance: 1, model: 1, source: 2)> Size: 256B
<Quantity([[[[[[0.13213244 0.30232155]]]


   [[[0.56331297 0.86483736]]]]



  [[[[0.4988332  0.31410627]]]


   [[[0.10251511 0.81517807]]]]]




 [[[[[0.17221409 0.44064751]]]


   [[[0.08172373 0.93020094]]]]

...

  [[[[0.41901311 0.35808953]]]


   [[[0.66955278 0.09849324]]]]]




 [[[[[0.54763585 0.6474731 ]]]


   [[[0.89670035 0.70107409]]]]



  [[[[0.78242851 0.08874086]]]


   [[[0.70000787 0.16270348]]]]]], 'CO2 * gigagram / year')>
Coordinates:
    animal (FAOSTAT)      <U5 20B 'cow'
    area (ISO3)           <U3 12B 'COL'
    category (IPCC 2006)  <U3 12B '0'
    category_names        <U14 56B 'total'
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 32B 2002-01-01 ... 2005-01-01
Attributes:
    entity:   CO2

Negative Selections

Using the primap2 loc indexer, you can also use negative selections to select everything but the specified value or values along a dimension:

[11]:
from primap2 import Not

ds.pr.loc[{"time": slice("2002", "2005"), "animal": Not("cow")}]
[11]:
<xarray.Dataset> Size: 67kB
Dimensions:               (time: 4, area (ISO3): 4, category (IPCC 2006): 8,
                           animal (FAOSTAT): 2, product (FAOSTAT): 2,
                           scenario (FAOSTAT): 2, provenance: 1, model: 1,
                           source: 2)
Coordinates:
  * animal (FAOSTAT)      (animal (FAOSTAT)) <U5 40B 'swine' 'goat'
  * area (ISO3)           (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 96B '0' '1' ... '1.A' '1.B'
    category_names        (category (IPCC 2006)) <U14 448B 'total' ... 'light...
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 32B 2002-01-01 ... 2005-01-01
Data variables:
    CH4                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 16kB [CH4·Gg/a] ...
    CO2                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 16kB [CO2·Gg/a] ...
    SF6                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 16kB [Gg·SF6/a] ...
    SF6 (SARGWP100)       (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 16kB [CO2·Gg/a] ...
    population            (time, area (ISO3), provenance, model, source) float64 256B [] ...
Attributes:
    area:                area (ISO3)
    cat:                 category (IPCC 2006)
    comment:             GHG inventory data ...
    contact:             lol_no_one_will_answer@example.com
    entity_terminology:  primap2
    history:             2021-01-14 14:50 data invented\n2021-01-14 14:51 add...
    institution:         PIK
    references:          doi:10.1012
    rights:              Use however you want.
    scen:                scenario (FAOSTAT)
    sec_cats:            ['animal (FAOSTAT)', 'product (FAOSTAT)']
    title:               Another title

Setting data

PRIMAP2 provides a unified API to introduce new data values, fill missing information, and overwrite existing information: the da.pr.set function and its sibling ds.pr.set.

The basic signature of the set functions is set(dimension, keys, values), and it returns the changed object without changing the original one. Use it like this:

[12]:
da = ds_min["CO2"].loc[{"time": slice("2000", "2005")}]
da
[12]:
<xarray.DataArray 'CO2' (time: 6, area (ISO3): 4, source: 1)> Size: 192B
<Quantity([[[0.66352018]
  [0.91683183]
  [0.10638745]
  [0.61416341]]

 [[0.33822533]
  [0.68233863]
  [0.73953851]
  [0.79057726]]

 [[0.11409101]
  [0.25072807]
  [0.85044897]
  [0.40873907]]

 [[0.80147707]
  [0.85209597]
  [0.71521289]
  [0.08031265]]

 [[0.61018888]
  [0.88112521]
  [0.55500009]
  [0.11105145]]

 [[0.81245612]
  [0.14774848]
  [0.14072429]
  [0.13586946]]], 'CO2 * gigagram / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 48B 2000-01-01 2001-01-01 ... 2005-01-01
Attributes:
    entity:   CO2
[13]:
import numpy as np
from primap2 import ureg

modified = da.pr.set(
    "area", "CUB", np.linspace(0, 20, 6) * ureg("Gg CO2 / year")
)
modified
[13]:
<xarray.DataArray 'CO2' (time: 6, area (ISO3): 5, source: 1)> Size: 240B
<Quantity([[[ 0.91683183]
  [ 0.61416341]
  [ 0.66352018]
  [ 0.        ]
  [ 0.10638745]]

 [[ 0.68233863]
  [ 0.79057726]
  [ 0.33822533]
  [ 4.        ]
  [ 0.73953851]]

 [[ 0.25072807]
  [ 0.40873907]
  [ 0.11409101]
  [ 8.        ]
  [ 0.85044897]]

 [[ 0.85209597]
  [ 0.08031265]
  [ 0.80147707]
  [12.        ]
  [ 0.71521289]]

 [[ 0.88112521]
  [ 0.11105145]
  [ 0.61018888]
  [16.        ]
  [ 0.55500009]]

 [[ 0.14774848]
  [ 0.13586946]
  [ 0.81245612]
  [20.        ]
  [ 0.14072429]]], 'CO2 * gigagram / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 60B 'ARG' 'BOL' 'COL' 'CUB' 'MEX'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 48B 2000-01-01 2001-01-01 ... 2005-01-01
Attributes:
    entity:   CO2

By default, existing non-NaN values are not overwritten:

[14]:
try:
    da.pr.set("area", "COL", np.linspace(0, 20, 6) * ureg("Gg CO2 / year"))
except ValueError as err:
    print(err)
Values {'COL'} for 'area (ISO3)' already exist and contain data. Use existing='overwrite' or 'fillna' to avoid this error.

You can overwrite existing values by specifying existing="overwrite" to overwrite all values or existing="fillna" to overwrite only NaNs.

[15]:
da.pr.set(
    "area",
    "COL",
    np.linspace(0, 20, 6) * ureg("Gg CO2 / year"),
    existing="overwrite",
)
[15]:
<xarray.DataArray 'CO2' (time: 6, area (ISO3): 4, source: 1)> Size: 192B
<Quantity([[[ 0.91683183]
  [ 0.61416341]
  [ 0.        ]
  [ 0.10638745]]

 [[ 0.68233863]
  [ 0.79057726]
  [ 4.        ]
  [ 0.73953851]]

 [[ 0.25072807]
  [ 0.40873907]
  [ 8.        ]
  [ 0.85044897]]

 [[ 0.85209597]
  [ 0.08031265]
  [12.        ]
  [ 0.71521289]]

 [[ 0.88112521]
  [ 0.11105145]
  [16.        ]
  [ 0.55500009]]

 [[ 0.14774848]
  [ 0.13586946]
  [20.        ]
  [ 0.14072429]]], 'CO2 * gigagram / year')>
Coordinates:
  * time         (time) datetime64[ns] 48B 2000-01-01 2001-01-01 ... 2005-01-01
  * area (ISO3)  (area (ISO3)) <U3 48B 'ARG' 'BOL' 'COL' 'MEX'
  * source       (source) <U8 32B 'RAND2020'
Attributes:
    entity:   CO2

By default, the set() function extends the specified dimension automatically to accommodate new values if not all key values are in the specified dimension yet. You can change this by specifying new="error", which will raise a KeyError if any of the keys is not found:

[16]:
try:
    da.pr.set(
        "area",
        ["COL", "CUB"],
        np.linspace(0, 20, 6) * ureg("Gg CO2 / year"),
        existing="overwrite",
        new="error",
    )
except KeyError as err:
    print(err)
"Values {'CUB'} not in 'area (ISO3)', use new='extend' to automatically insert new values into dim."

In particular, the set() functions can also be used with xarray’s arithmetic functions to derive values from existing data and store the result in the Dataset. As an example, we will derive better values for category 0 by adding all its subcategories and store the result.

First, let’s see the current data for a small subset of the data:

[17]:
sel = {
    "area": "COL",
    "category": ["0", "1", "2", "3", "4", "5"],
    "animal": "cow",
    "product": "milk",
    "scenario": "highpop",
    "source": "RAND2020",
}
subset = ds.pr.loc[sel].squeeze()

# TODO: currently, plotting with units still emits a warning
import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    subset["CO2"].plot.line(x="time", hue="category (IPCC 2006)")
_images/usage_28_0.png

While it is hard to see any details in this plot, it is clearly visible that category 0 is not the sum of the other categories (which should not come as a surprise since we generated the data at random).

We will now recompute category 0 for the entire dataset using set():

[18]:
cat0_new = ds.pr.loc[{"category": ["1", "2", "3", "4", "5"]}].pr.sum(
    "category"
)

ds = ds.pr.set(
    "category",
    "0",
    cat0_new,
    existing="overwrite",
)

# plot a small subset of the result
subset = ds.pr.loc[sel].squeeze()
# TODO: currently, plotting with units still emits a warning
import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    subset["CO2"].plot.line(x="time", hue="category (IPCC 2006)")
_images/usage_30_0.png

As you can see in the plot, category 0 is now computed from its subcategories. The set() method of Datasets works on all data variables in the dataset which have the corresponding dimension. In this example, the “population” variable does not have categories, so it was unchanged.

Adding dimensions

When adding a dimension to a primap2 dataset one also has to keeps the attrs up to date. Thus the xr.ds.expand_dims function does not suffice. The function ds.pr.expand_dims takes care of the attrs for you. The following example adds a category dimension with coordinate value ‘0’ and terminology ‘IPCC2006’ to ds_min and adds the alias to the attrs.

[19]:
ds_dim_added = ds_min.pr.expand_dims(dim='category', coord_value=['0'], terminology='IPCC2006')
ds_dim_added
[19]:
<xarray.Dataset> Size: 3kB
Dimensions:              (category (IPCC2006): 1, time: 21, area (ISO3): 4,
                          source: 1)
Coordinates:
  * category (IPCC2006)  (category (IPCC2006)) object 8B ('0',)
  * area (ISO3)          (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source               (source) <U8 32B 'RAND2020'
  * time                 (time) datetime64[ns] 168B 2000-01-01 ... 2020-01-01
Data variables:
    CH4                  (category (IPCC2006), time, area (ISO3), source) float64 672B [CH4·Gg/a] ...
    CO2                  (category (IPCC2006), time, area (ISO3), source) float64 672B [CO2·Gg/a] ...
    SF6                  (category (IPCC2006), time, area (ISO3), source) float64 672B [Gg·SF6/a] ...
    SF6 (SARGWP100)      (category (IPCC2006), time, area (ISO3), source) float64 672B [CO2·Gg/a] ...
Attributes:
    area:     area (ISO3)
    cat:      category (IPCC2006)

Unit handling

PRIMAP2 uses the openscm_units package based on the Pint library for handling of units.

CO2 equivalent units and mass units

Using global warming potential contexts, it is easy to convert mass units into CO2 equivalents:

[20]:
from primap2 import ureg  # The unit registry

sf6_gwp = ds["SF6"].pr.convert_to_gwp(
    gwp_context="AR4GWP100", units="Gg CO2 / year"
)
# The information about the used GWP context is retained:
sf6_gwp.attrs
[20]:
{'entity': 'SF6', 'gwp_context': 'AR4GWP100'}

Because the GWP context used for conversion is stored, it is equally easy to convert back to mass units:

[21]:
sf6 = sf6_gwp.pr.convert_to_mass()
sf6.attrs
[21]:
{'entity': 'SF6'}

The stored GWP context can also be used to convert another array using the same context:

[22]:
ch4_gwp = ds["CH4"].pr.convert_to_gwp_like(sf6_gwp)
ch4_gwp.attrs
[22]:
{'entity': 'CH4', 'gwp_context': 'AR4GWP100'}

Dropping units

Sometimes, it is necessary to drop the units, for example to use arrays as input for external functions which are unit-naive. This can be done safely by first converting to the target unit, then dequantifying the dataset or array:

[23]:
da_nounits = ds["CH4"].pint.to("Mt CH4 / year").pr.dequantify()
da_nounits.attrs
[23]:
{'entity': 'CH4', 'units': 'CH4 * megametric_ton / year'}

Note that the units are then stored in the DataArray’s attrs, and can be restored using the da.pr.quantify function.

Descriptive statistics

To get an overview about the missing information in a Dataset or DataArray, you can use the pr.coverage function. It gives you a summary of the number of non-NaN data points:

[24]:
import numpy as np
import xarray as xr
import pandas as pd

time = pd.date_range("2000-01-01", "2003-01-01", freq="YS")
area_iso3 = np.array(["COL", "ARG", "MEX"])
category_ipcc = np.array(["1", "2"])
coords