General Usage¶

Because PRIMAP2 builds on xarray, all xarray functionality is available right away. Additional functionality is provided in the primap2 package and in the pr namespace on xarray objects. In this section, we will present examples of general PRIMAP2 usage, which are likely to be helpful in any context. More specialized functionality for specific tasks is presented in the next section.

Importing¶

[1]:

import primap2  # injects the "pr" namespace into xarray

[2]:

# set up logging for the docs - don't show debug messages
import sys
from loguru import logger

logger.remove()
logger.add(sys.stderr, level="INFO")

[2]:

Loading Datafiles¶

Loading from netcdf files¶

The native storage format of PRIMAP2 are netcdf5 files, and datasets can be written to and loaded from netcdf5 files using PRIMAP2 functions. We will load the “minimal” and “opulent” Datasets from the data format section:

[3]:

ds_min = primap2.open_dataset("minimal_ds.nc")
ds = primap2.open_dataset("opulent_ds.nc")

ds

[3]:

<xarray.Dataset> Size: 518kB
Dimensions:               (time: 21, area (ISO3): 4, category (IPCC 2006): 8,
                           animal (FAOSTAT): 3, product (FAOSTAT): 2,
                           scenario (FAOSTAT): 2, provenance: 1, model: 1,
                           source: 2)
Coordinates:
  * animal (FAOSTAT)      (animal (FAOSTAT)) <U5 60B 'cow' 'swine' 'goat'
  * area (ISO3)           (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 96B '0' '1' ... '1.A' '1.B'
    category_names        (category (IPCC 2006)) <U14 448B 'total' ... 'light...
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 168B 2000-01-01 ... 2020-01-01
Data variables:
    CH4                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [CH4·Gg/a] ...
    CO2                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [CO2·Gg/a] ...
    SF6                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [Gg·SF6/a] ...
    SF6 (SARGWP100)       (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [CO2·Gg/a] ...
    population            (time, area (ISO3), provenance, model, source) float64 1kB [] ...
Attributes:
    area:                area (ISO3)
    cat:                 category (IPCC 2006)
    comment:             GHG inventory data ...
    contact:             lol_no_one_will_answer@example.com
    entity_terminology:  primap2
    history:             2021-01-14 14:50 data invented\n2021-01-14 14:51 add...
    institution:         PIK
    references:          doi:10.1012
    rights:              Use however you want.
    scen:                scenario (FAOSTAT)
    sec_cats:            ['animal (FAOSTAT)', 'product (FAOSTAT)']
    title:               Completely invented GHG inventory data

Accessing metadata¶

Metadata is stored in the attrs of Datasets, and you can of course access it directly there. Additionally, you can access the PRIMAP2 metadata directly under the .pr namespace, which has the advantage that autocompletion works in ipython and IDEs and typos will be caught immediately.

[4]:

ds.attrs

[4]:

{'area': 'area (ISO3)',
 'cat': 'category (IPCC 2006)',
 'comment': 'GHG inventory data ...',
 'contact': 'lol_no_one_will_answer@example.com',
 'entity_terminology': 'primap2',
 'history': '2021-01-14 14:50 data invented\n2021-01-14 14:51 additional processing step\n',
 'institution': 'PIK',
 'references': 'doi:10.1012',
 'rights': 'Use however you want.',
 'scen': 'scenario (FAOSTAT)',
 'sec_cats': ['animal (FAOSTAT)', 'product (FAOSTAT)'],
 'title': 'Completely invented GHG inventory data'}

[5]:

ds.pr.title

[5]:

'Completely invented GHG inventory data'

[6]:

ds.pr.title = "Another title"
ds.pr.title

[6]:

'Another title'

Selecting data¶

Data can be selected using the xarray indexing methods, but PRIMAP2 also provides own versions of some of xarray’s selection methods which work using the dimension names without the category set.

The loc Indexer¶

Similarly, a version of the loc indexer is provided which works with the bare dimension names:

[9]:

ds.pr.loc[{"time": slice("2002", "2005"), "animal": "cow"}]

[9]:

<xarray.Dataset> Size: 34kB
Dimensions:               (time: 4, area (ISO3): 4, category (IPCC 2006): 8,
                           product (FAOSTAT): 2, scenario (FAOSTAT): 2,
                           provenance: 1, model: 1, source: 2)
Coordinates:
    animal (FAOSTAT)      <U5 20B 'cow'
  * area (ISO3)           (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 96B '0' '1' ... '1.A' '1.B'
    category_names        (category (IPCC 2006)) <U14 448B 'total' ... 'light...
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 32B 2002-01-01 ... 2005-01-01
Data variables:
    CH4                   (time, area (ISO3), category (IPCC 2006), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 8kB [CH4·Gg/a] ...
    CO2                   (time, area (ISO3), category (IPCC 2006), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 8kB [CO2·Gg/a] ...
    SF6                   (time, area (ISO3), category (IPCC 2006), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 8kB [Gg·SF6/a] ...
    SF6 (SARGWP100)       (time, area (ISO3), category (IPCC 2006), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 8kB [CO2·Gg/a] ...
    population            (time, area (ISO3), provenance, model, source) float64 256B [] ...
Attributes:
    area:                area (ISO3)
    cat:                 category (IPCC 2006)
    comment:             GHG inventory data ...
    contact:             lol_no_one_will_answer@example.com
    entity_terminology:  primap2
    history:             2021-01-14 14:50 data invented\n2021-01-14 14:51 add...
    institution:         PIK
    references:          doi:10.1012
    rights:              Use however you want.
    scen:                scenario (FAOSTAT)
    sec_cats:            ['animal (FAOSTAT)', 'product (FAOSTAT)']
    title:               Another title

It also works on DataArrays:

[10]:

da = ds["CO2"]

da_subset = da.pr.loc[
    {
        "time": slice("2002", "2005"),
        "animal": "cow",
        "category": "0",
        "area": "COL",
    }
]
da_subset

[10]:

<xarray.DataArray 'CO2' (time: 4, product (FAOSTAT): 2, scenario (FAOSTAT): 2,
                         provenance: 1, model: 1, source: 2)> Size: 256B
<Quantity([[[[[[0.13213244 0.30232155]]]


   [[[0.56331297 0.86483736]]]]



  [[[[0.4988332  0.31410627]]]


   [[[0.10251511 0.81517807]]]]]




 [[[[[0.17221409 0.44064751]]]


   [[[0.08172373 0.93020094]]]]

...

  [[[[0.41901311 0.35808953]]]


   [[[0.66955278 0.09849324]]]]]




 [[[[[0.54763585 0.6474731 ]]]


   [[[0.89670035 0.70107409]]]]



  [[[[0.78242851 0.08874086]]]


   [[[0.70000787 0.16270348]]]]]], 'CO2 * gigagram / year')>
Coordinates:
    animal (FAOSTAT)      <U5 20B 'cow'
    area (ISO3)           <U3 12B 'COL'
    category (IPCC 2006)  <U3 12B '0'
    category_names        <U14 56B 'total'
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 32B 2002-01-01 ... 2005-01-01
Attributes:
    entity:   CO2

Negative Selections¶

Using the primap2 loc indexer, you can also use negative selections to select everything but the specified value or values along a dimension:

[11]:

from primap2 import Not

ds.pr.loc[{"time": slice("2002", "2005"), "animal": Not("cow")}]

[11]:

<xarray.Dataset> Size: 67kB
Dimensions:               (time: 4, area (ISO3): 4, category (IPCC 2006): 8,
                           animal (FAOSTAT): 2, product (FAOSTAT): 2,
                           scenario (FAOSTAT): 2, provenance: 1, model: 1,
                           source: 2)
Coordinates:
  * animal (FAOSTAT)      (animal (FAOSTAT)) <U5 40B 'swine' 'goat'
  * area (ISO3)           (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 96B '0' '1' ... '1.A' '1.B'
    category_names        (category (IPCC 2006)) <U14 448B 'total' ... 'light...
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 32B 2002-01-01 ... 2005-01-01
Data variables:
    CH4                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 16kB [CH4·Gg/a] ...
    CO2                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 16kB [CO2·Gg/a] ...
    SF6                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 16kB [Gg·SF6/a] ...
    SF6 (SARGWP100)       (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 16kB [CO2·Gg/a] ...
    population            (time, area (ISO3), provenance, model, source) float64 256B [] ...
Attributes:
    area:                area (ISO3)
    cat:                 category (IPCC 2006)
    comment:             GHG inventory data ...
    contact:             lol_no_one_will_answer@example.com
    entity_terminology:  primap2
    history:             2021-01-14 14:50 data invented\n2021-01-14 14:51 add...
    institution:         PIK
    references:          doi:10.1012
    rights:              Use however you want.
    scen:                scenario (FAOSTAT)
    sec_cats:            ['animal (FAOSTAT)', 'product (FAOSTAT)']
    title:               Another title

Setting data¶

PRIMAP2 provides a unified API to introduce new data values, fill missing information, and overwrite existing information: the da.pr.set function and its sibling ds.pr.set.

The basic signature of the set functions is set(dimension, keys, values), and it returns the changed object without changing the original one. Use it like this:

[12]:

da = ds_min["CO2"].loc[{"time": slice("2000", "2005")}]
da

[12]:

<xarray.DataArray 'CO2' (time: 6, area (ISO3): 4, source: 1)> Size: 192B
<Quantity([[[0.66352018]
  [0.91683183]
  [0.10638745]
  [0.61416341]]

 [[0.33822533]
  [0.68233863]
  [0.73953851]
  [0.79057726]]

 [[0.11409101]
  [0.25072807]
  [0.85044897]
  [0.40873907]]

 [[0.80147707]
  [0.85209597]
  [0.71521289]
  [0.08031265]]

 [[0.61018888]
  [0.88112521]
  [0.55500009]
  [0.11105145]]

 [[0.81245612]
  [0.14774848]
  [0.14072429]
  [0.13586946]]], 'CO2 * gigagram / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 48B 2000-01-01 2001-01-01 ... 2005-01-01
Attributes:
    entity:   CO2

[13]:

import numpy as np
from primap2 import ureg

modified = da.pr.set(
    "area", "CUB", np.linspace(0, 20, 6) * ureg("Gg CO2 / year")
)
modified

[13]:

<xarray.DataArray 'CO2' (time: 6, area (ISO3): 5, source: 1)> Size: 240B
<Quantity([[[ 0.91683183]
  [ 0.61416341]
  [ 0.66352018]
  [ 0.        ]
  [ 0.10638745]]

 [[ 0.68233863]
  [ 0.79057726]
  [ 0.33822533]
  [ 4.        ]
  [ 0.73953851]]

 [[ 0.25072807]
  [ 0.40873907]
  [ 0.11409101]
  [ 8.        ]
  [ 0.85044897]]

 [[ 0.85209597]
  [ 0.08031265]
  [ 0.80147707]
  [12.        ]
  [ 0.71521289]]

 [[ 0.88112521]
  [ 0.11105145]
  [ 0.61018888]
  [16.        ]
  [ 0.55500009]]

 [[ 0.14774848]
  [ 0.13586946]
  [ 0.81245612]
  [20.        ]
  [ 0.14072429]]], 'CO2 * gigagram / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 60B 'ARG' 'BOL' 'COL' 'CUB' 'MEX'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 48B 2000-01-01 2001-01-01 ... 2005-01-01
Attributes:
    entity:   CO2

By default, existing non-NaN values are not overwritten:

[14]:

try:
    da.pr.set("area", "COL", np.linspace(0, 20, 6) * ureg("Gg CO2 / year"))
except ValueError as err:
    print(err)

Values {'COL'} for 'area (ISO3)' already exist and contain data. Use existing='overwrite' or 'fillna' to avoid this error.

You can overwrite existing values by specifying existing="overwrite" to overwrite all values or existing="fillna" to overwrite only NaNs.

[15]:

da.pr.set(
    "area",
    "COL",
    np.linspace(0, 20, 6) * ureg("Gg CO2 / year"),
    existing="overwrite",
)

[15]:

<xarray.DataArray 'CO2' (time: 6, area (ISO3): 4, source: 1)> Size: 192B
<Quantity([[[ 0.91683183]
  [ 0.61416341]
  [ 0.        ]
  [ 0.10638745]]

 [[ 0.68233863]
  [ 0.79057726]
  [ 4.        ]
  [ 0.73953851]]

 [[ 0.25072807]
  [ 0.40873907]
  [ 8.        ]
  [ 0.85044897]]

 [[ 0.85209597]
  [ 0.08031265]
  [12.        ]
  [ 0.71521289]]

 [[ 0.88112521]
  [ 0.11105145]
  [16.        ]
  [ 0.55500009]]

 [[ 0.14774848]
  [ 0.13586946]
  [20.        ]
  [ 0.14072429]]], 'CO2 * gigagram / year')>
Coordinates:
  * time         (time) datetime64[ns] 48B 2000-01-01 2001-01-01 ... 2005-01-01
  * area (ISO3)  (area (ISO3)) <U3 48B 'ARG' 'BOL' 'COL' 'MEX'
  * source       (source) <U8 32B 'RAND2020'
Attributes:
    entity:   CO2

By default, the set() function extends the specified dimension automatically to accommodate new values if not all key values are in the specified dimension yet. You can change this by specifying new="error", which will raise a KeyError if any of the keys is not found:

[16]:

try:
    da.pr.set(
        "area",
        ["COL", "CUB"],
        np.linspace(0, 20, 6) * ureg("Gg CO2 / year"),
        existing="overwrite",
        new="error",
    )
except KeyError as err:
    print(err)

"Values {'CUB'} not in 'area (ISO3)', use new='extend' to automatically insert new values into dim."

In particular, the set() functions can also be used with xarray’s arithmetic functions to derive values from existing data and store the result in the Dataset. As an example, we will derive better values for category 0 by adding all its subcategories and store the result.

First, let’s see the current data for a small subset of the data:

[17]:

sel = {
    "area": "COL",
    "category": ["0", "1", "2", "3", "4", "5"],
    "animal": "cow",
    "product": "milk",
    "scenario": "highpop",
    "source": "RAND2020",
}
subset = ds.pr.loc[sel].squeeze()

# TODO: currently, plotting with units still emits a warning
import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    subset["CO2"].plot.line(x="time", hue="category (IPCC 2006)")

Matplotlib is building the font cache; this may take a moment.

While it is hard to see any details in this plot, it is clearly visible that category 0 is not the sum of the other categories (which should not come as a surprise since we generated the data at random).

We will now recompute category 0 for the entire dataset using set():

[18]:

cat0_new = ds.pr.loc[{"category": ["1", "2", "3", "4", "5"]}].pr.sum(
    "category"
)

ds = ds.pr.set(
    "category",
    "0",
    cat0_new,
    existing="overwrite",
)

# plot a small subset of the result
subset = ds.pr.loc[sel].squeeze()
# TODO: currently, plotting with units still emits a warning
import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    subset["CO2"].plot.line(x="time", hue="category (IPCC 2006)")

As you can see in the plot, category 0 is now computed from its subcategories. The set() method of Datasets works on all data variables in the dataset which have the corresponding dimension. In this example, the “population” variable does not have categories, so it was unchanged.

Adding dimensions¶

When adding a dimension to a primap2 dataset one also has to keeps the attrs up to date. Thus the xr.ds.expand_dims function does not suffice. The function ds.pr.expand_dims takes care of the attrs for you. The following example adds a category dimension with coordinate value ‘0’ and terminology ‘IPCC2006’ to ds_min and adds the alias to the attrs.

[19]:

ds_dim_added = ds_min.pr.expand_dims(dim='category', coord_value=['0'], terminology='IPCC2006')
ds_dim_added

[19]:

<xarray.Dataset> Size: 3kB
Dimensions:              (category (IPCC2006): 1, time: 21, area (ISO3): 4,
                          source: 1)
Coordinates:
  * category (IPCC2006)  (category (IPCC2006)) object 8B ('0',)
  * area (ISO3)          (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source               (source) <U8 32B 'RAND2020'
  * time                 (time) datetime64[ns] 168B 2000-01-01 ... 2020-01-01
Data variables:
    CH4                  (category (IPCC2006), time, area (ISO3), source) float64 672B [CH4·Gg/a] ...
    CO2                  (category (IPCC2006), time, area (ISO3), source) float64 672B [CO2·Gg/a] ...
    SF6                  (category (IPCC2006), time, area (ISO3), source) float64 672B [Gg·SF6/a] ...
    SF6 (SARGWP100)      (category (IPCC2006), time, area (ISO3), source) float64 672B [CO2·Gg/a] ...
Attributes:
    area:     area (ISO3)
    cat:      category (IPCC2006)

Unit handling¶

PRIMAP2 uses the openscm_units package based on the Pint library for handling of units.

CO2 equivalent units and mass units¶

Using global warming potential contexts, it is easy to convert mass units into CO2 equivalents:

[20]:

from primap2 import ureg  # The unit registry

sf6_gwp = ds["SF6"].pr.convert_to_gwp(
    gwp_context="AR4GWP100", units="Gg CO2 / year"
)
# The information about the used GWP context is retained:
sf6_gwp.attrs

[20]:

{'entity': 'SF6', 'gwp_context': 'AR4GWP100'}

Because the GWP context used for conversion is stored, it is equally easy to convert back to mass units:

[21]:

sf6 = sf6_gwp.pr.convert_to_mass()
sf6.attrs

[21]:

{'entity': 'SF6'}

The stored GWP context can also be used to convert another array using the same context:

[22]:

ch4_gwp = ds["CH4"].pr.convert_to_gwp_like(sf6_gwp)
ch4_gwp.attrs

[22]:

{'entity': 'CH4', 'gwp_context': 'AR4GWP100'}

Dropping units¶

Sometimes, it is necessary to drop the units, for example to use arrays as input for external functions which are unit-naive. This can be done safely by first converting to the target unit, then dequantifying the dataset or array:

[23]:

da_nounits = ds["CH4"].pint.to("Mt CH4 / year").pr.dequantify()
da_nounits.attrs

[23]:

{'entity': 'CH4', 'units': 'CH4 * megametric_ton / year'}

Note that the units are then stored in the DataArray’s attrs, and can be restored using the da.pr.quantify function.

Descriptive statistics¶

To get an overview about the missing information in a Dataset or DataArray, you can use the pr.coverage function. It gives you a summary of the number of non-NaN data points:

[24]:

import numpy as np
import xarray as xr
import pandas as pd

time = pd.date_range("2000-01-01", "2003-01-01", freq="YS")
area_iso3 = np.array(["COL", "ARG", "MEX"])
category_ipcc = np.array(["1", "2"])
coords = [
    ("category (IPCC2006)", category_ipcc),
    ("area (ISO3)", area_iso3),
    ("time", time),
]
da = xr.DataArray(
    data=[
        [
            [1, 2, 3, 4],
            [np.nan, np.nan, np.nan, np.nan],
            [1, 2, 3, np.nan],
        ],
        [
            [np.nan, 2, np.nan, 4],
            [1, np.nan, 3, np.nan],
            [1, np.nan, 3, np.nan],
        ],
    ],
    coords=coords,
)

da

[25]:

da.pr.coverage("area")

[25]:

area (ISO3)
COL    6
ARG    2
MEX    5
Name: coverage, dtype: int64

[26]:

da.pr.coverage("time", "area")

[26]:

area (ISO3)	COL	ARG	MEX
time
2000-01-01	1	1	2
2001-01-01	2	0	1
2002-01-01	1	1	2
2003-01-01	2	0	0

For Datasets, you can also specify the “entity” as a coordinate to summarize for the data variables:

[27]:

import primap2.tests

ds = primap2.tests.examples.opulent_ds()
ds["CO2"].pr.loc[{"product": "milk"}].pint.magnitude[:] = np.nan

ds.pr.coverage("product", "entity", "area")

/home/docs/checkouts/readthedocs.org/user_builds/primap2/checkouts/main/primap2/tests/examples.py:171: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  dims = [dim for dim in opulent.dims.keys() if dim != "time"]

[27]:

	area (ISO3)	COL	ARG	MEX	BOL
product (FAOSTAT)	entity
milk	CO2	0	0	0	0
	SF6	2016	2016	2016	2016
	CH4	2016	2016	2016	2016
	SF6 (SARGWP100)	2016	2016	2016	2016
meat	CO2	2016	2016	2016	2016
	SF6	2016	2016	2016	2016
	CH4	2016	2016	2016	2016
	SF6 (SARGWP100)	2016	2016	2016	2016

Merging¶

xarray provides different functions to combine Datasets and DataArrays. However, these are not built to combine data which contain duplicates with rounding / processing errors. However, when reading data e.g. from country reports this is often needed as some sectors are included in several tables and might use different numbers of decimals. Thus, PRIMAP2 has added a merge function that can accept data discrepancies not exceeding a given tolerance level. The merging of attributes is handled by xarray and the combine_attrs parameter is just passed on to the xarray functions. Default is to drop_conflicts.

Below an example using the built in opulent_ds

[28]:

from primap2.tests.examples import opulent_ds
import xarray as xr

op_ds = opulent_ds()
# only take part of the countries to have something to actually merge
da_start = op_ds["CO2"].pr.loc[{"area": ["ARG", "COL", "MEX"]}]
# modify some data
data_to_modify = op_ds["CO2"].pr.loc[{"area": ["ARG"]}].pr.sum("area")
data_to_modify.data = data_to_modify.data * 1.009
da_merge = op_ds["CO2"].pr.set(
    "area", "ARG", data_to_modify, existing="overwrite"
)

# merge with tolerance such that it will pass
da_result = da_start.pr.merge(da_merge, tolerance=0.01)

[29]:

# merge with lower tolerance such that it will fail
try:
    # the logged message is very large, only show a small part
    logger.disable("primap2")
    da_result = da_start.pr.merge(da_merge, tolerance=0.005)
except xr.MergeError as err:
    err_short = "\n".join(str(err).split("\n")[0:6])
    print(f"An error occured during merging: {err_short}")
logger.enable("primap2")

# you sould also only log a warning and not raise an error
# using the error_on_discrepancy=False argument to `merge`

An error occured during merging: pr.merge error: found discrepancies larger than tolerance (0.50%) for area (ISO3)=ARG, provenance=projected, model=FANCYFAO:
shown are relative discrepancies.
                                                                                                category_names    CO2
time       category (IPCC 2006) animal (FAOSTAT) product (FAOSTAT) scenario (FAOSTAT) source
2000-01-01 0                    cow              milk              highpop            RAND2020           total  0.009
                                                                                      RAND2021           total  0.009

Aggregation and infilling¶

xarray provides robust functions for aggregation (sum) and filling of missing information (fillna). PRIMAP2 adds functions which fill or skip missing information based on if the information is missing at all points along certain axes, for example for a whole time series. This makes it possible to, for example, evaluate the sum of sub-categories while ignoring only those categories which are missing completely. It is also possible to ignore NA values (i.e. treating them as 0) in sums using the skipna parameter. When using skipna, the min_count parameter governs how many non-NA vales are needed in a sum for the result to be non-NA. The default value is skipna=1. This is helpful if you want to e.g. sum all subsectors and for some countries or gases some of the subsectors have NA values because there is no data. To avoid NA timeseries if a single sector is NA you use skipna. In other cases, e.g. when checking if data coverage is complete skipna is not used, so any NA value in the source data results in NA in the summed data and is not hidden.

[30]:

time = pd.date_range("2000-01-01", "2003-01-01", freq="YS")
area_iso3 = np.array(["COL", "ARG", "MEX"])
coords = [("area (ISO3)", area_iso3), ("time", time)]
da = xr.DataArray(
    data=[
        [1, 2, 3, 4],
        [np.nan, np.nan, np.nan, np.nan],
        [1, 2, 3, np.nan],
    ],
    coords=coords,
)

da

[31]:

da.pr.sum(dim="area", skipna_evaluation_dims="time")

[31]:

<xarray.DataArray (time: 4)> Size: 32B
array([ 2.,  4.,  6., nan])
Coordinates:
  * time     (time) datetime64[ns] 32B 2000-01-01 2001-01-01 ... 2003-01-01

[32]:

da.pr.sum(dim="area", skipna=True)

[32]:

<xarray.DataArray (time: 4)> Size: 32B
array([2., 4., 6., 4.])
Coordinates:
  * time     (time) datetime64[ns] 32B 2000-01-01 2001-01-01 ... 2003-01-01

[33]:

# compare this to the result of the standard xarray sum:

da.sum(dim="area (ISO3)")

[33]:

<xarray.DataArray (time: 4)> Size: 32B
array([2., 4., 6., 4.])
Coordinates:
  * time     (time) datetime64[ns] 32B 2000-01-01 2001-01-01 ... 2003-01-01

The same functionality is available for filling in missing information:

[34]:

da.pr.fill_all_na("time", value=10)

For lager aggregation tasks, e.g. aggregating several gas baskets from individual gases or aggregating a full category tree from leaves we have the functions ds.pr.add_aggregates_variables, ds.pr.add_aggregates_coordinates, and da.pr.add_aggregates_coordinates which are highly configurable, but can also be used in a simplified mode for quick aggregation tasks. In the following we give a few examples. For the full feature set we refer to function descriptions linked above. The functions internally work with ds/da.pr.merge() to allow for consistency checks when target timeseries exist.

### ds.pr.add_aggregates_variables

This function aggregates data from individual variables to new variables (usually gas baskets). Several variables can be created in one call where the order of definition is the order of creation. Filters can be specified to limit aggregation to certain coordinate values.

Examples

Sum gases in the minimal example dataset

[35]:

summed_ds = ds_min.pr.add_aggregates_variables(
    gas_baskets={
        "test (SARGWP100)": {
            "sources": ["CO2", "SF6", "CH4"],
        },
    },
)
summed_ds["test (SARGWP100)"]

[35]:

<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B
<Quantity([[[   59.52116157]
  [12091.80994421]
  [16182.21435315]
  [ 2965.80220677]]

 [[11436.07585155]
  [12597.2322152 ]
  [19172.6969282 ]
  [19351.58487054]]

 [[12749.61794337]
  [14785.27911422]
  [ 5733.75559422]
  [ 7390.51658535]]

 [[12745.20608578]
  [11897.40280597]
  [10390.92115671]
  [16345.51800435]]

...

 [[14148.34407451]
  [  449.63570296]
  [  280.71991745]
  [14065.93620383]]

 [[13244.23861057]
  [23847.91204598]
  [ 4243.44458005]
  [19677.45365633]]

 [[23081.76617658]
  [ 6422.07536784]
  [22948.697363  ]
  [ 3986.50318618]]

 [[22806.4457288 ]
  [18200.33442572]
  [ 6282.66521788]
  [ 1017.35660718]]], 'CO2 * gigagram / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01
Attributes:
    gwp_context:  SARGWP100
    entity:       test

We can also use a filter to limit the aggregation to a single country:

[36]:

filtered_ds = ds_min.pr.add_aggregates_variables(
    gas_baskets={
        "test (SARGWP100)": {
            "sources": ["CO2", "SF6", "CH4"],
            "filter": {"area (ISO3)": ["COL"]},
        },
    },
)
filtered_ds["test (SARGWP100)"]

[36]:

<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B
<Quantity([[[   59.52116157]
  [           nan]
  [           nan]
  [           nan]]

 [[11436.07585155]
  [           nan]
  [           nan]
  [           nan]]

 [[12749.61794337]
  [           nan]
  [           nan]
  [           nan]]

 [[12745.20608578]
  [           nan]
  [           nan]
  [           nan]]

...

 [[14148.34407451]
  [           nan]
  [           nan]
  [           nan]]

 [[13244.23861057]
  [           nan]
  [           nan]
  [           nan]]

 [[23081.76617658]
  [           nan]
  [           nan]
  [           nan]]

 [[22806.4457288 ]
  [           nan]
  [           nan]
  [           nan]]], 'CO2 * gigagram / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01
Attributes:
    gwp_context:  SARGWP100
    entity:       test

If we recompute an existing timeseries it has to be consistent with the existing data. Here we use the simple mode to specify the aggregation rules.

[37]:

from xarray import MergeError
try:
    recomputed_ds = filtered_ds.pr.add_aggregates_variables(
        gas_baskets={
            "test (SARGWP100)": ["CO2", "CH4"],
        },
    )
    recomputed_ds["test (SARGWP100)"]
except MergeError as err:
    print(err)

2024-05-16 08:49:09.222 | ERROR    | primap2._merge:merge_with_tolerance_core:72 - pr.merge error: found discrepancies larger than tolerance (1.00%) for area (ISO3)=COL, source=RAND2020:
shown are relative discrepancies.
            test (SARGWP100)
time
2000-01-01          0.722739
2001-01-01          0.998434
2002-01-01          0.999453
2003-01-01          0.998306
2004-01-01          0.998756
2005-01-01          0.998049
2006-01-01          0.995990
2007-01-01          0.998147
2008-01-01          0.995131
2009-01-01          0.999447
2010-01-01          0.999546
2011-01-01          0.998854
2012-01-01          0.999669
2013-01-01          0.999974
2014-01-01          0.999712
2015-01-01          0.999793
2016-01-01          0.998886
2017-01-01          0.998772
2018-01-01          0.999852
2019-01-01          0.999112
2020-01-01          0.999471

pr.merge error: found discrepancies larger than tolerance (1.00%) for area (ISO3)=COL, source=RAND2020:
shown are relative discrepancies.
            test (SARGWP100)
time
2000-01-01          0.722739
2001-01-01          0.998434
2002-01-01          0.999453
2003-01-01          0.998306
2004-01-01          0.998756
2005-01-01          0.998049
2006-01-01          0.995990
2007-01-01          0.998147
2008-01-01          0.995131
2009-01-01          0.999447
2010-01-01          0.999546
2011-01-01          0.998854
2012-01-01          0.999669
2013-01-01          0.999974
2014-01-01          0.999712
2015-01-01          0.999793
2016-01-01          0.998886
2017-01-01          0.998772
2018-01-01          0.999852
2019-01-01          0.999112
2020-01-01          0.999471

unless we set the tolerance high enough (only possible in the complex mode for the aggregation rules)

[38]:

recomputed_ds = filtered_ds.pr.add_aggregates_variables(
    gas_baskets={
        "test (SARGWP100)": {
            "sources": ["CO2", "CH4"],
            "tolerance": 1 #  100%
        },
    },
)
recomputed_ds["test (SARGWP100)"]

[38]:

<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B
<Quantity([[[5.95211616e+01]
  [8.65600580e+00]
  [1.55819099e+01]
  [1.24871019e+01]]

 [[1.14360759e+04]
  [2.53322001e+00]
  [1.39057502e+01]
  [1.56300850e+00]]

 [[1.27496179e+04]
  [6.52687510e+00]
  [9.95243955e+00]
  [1.62190600e+01]]

 [[1.27452061e+04]
  [1.51039398e+01]
  [1.59373124e+01]
  [5.28770892e+00]]

...

 [[1.41483441e+04]
  [1.62580953e+01]
  [3.77418330e+00]
  [3.73085782e+00]]

 [[1.32442386e+04]
  [2.45037885e+00]
  [1.10246891e+00]
  [9.05224940e+00]]

 [[2.30817662e+04]
  [1.18341086e+00]
  [1.53357641e+01]
  [2.72788938e+00]]

 [[2.28064457e+04]
  [2.51734396e+00]
  [6.70344318e-01]
  [1.16859245e+01]]], 'CO2 * gigagram / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01
Attributes:
    gwp_context:  SARGWP100
    entity:       test

### ds.pr.add_aggregates_coordinates

This function aggregates data from individual coordinate values to new values (e.g. from subcategories to categories). Several values for several coordinates can be created in one call where the order of definition is the order of creation. Filters can be specified to limit aggregation to certain coordinate values, entities or variables. most of the operation is similar to the variable aggregation and thus we keep the examples here shorter. The da.pr.add_aggregates_coordinates function uses the same syntax.