The Composite Source Generator

The Composite Source Generator#

The primap2.csg module can be used to create a composite dataset from multiple source datasets using specified rules.

The general strategy for combining datasets is to always treat a single timeseries, i.e. an array with only the time as dimension. For each timeseries, the available source timeseries are ordered according to defined priorities, and the result timeseries is initialized from the highest-priority timeseries. Then, lower-priority source timeseries are used in turn to fill any missing information in the result timeseries, one source timeseries at a time. For filling the missing information, a strategy (such as direct substitution or least-squares matching of data) is selected for each source timeseries as configured. When no missing information is left in the result timeseries, the algorithm terminates. It also terminates if all source timeseries are used, even if missing information is left.

The core function to use is the primap2.csg.compose() function. It needs the following input:

  • The input dataset, containing all sources. The shape and dimensions of the input dataset also determine the shape and dimensions of the composed dataset.

  • A definition of priority dimensions and priorities. The priority dimensions are the dimensions in the input dataset which will be used to select source datasets. The result dataset will not have the priority dimensions as dimensions any more, because along these dimensions, the source timeseries will be combined into a single composite timeseries. The priorities are a list of selections which have to specify exactly one value for each priority dimension, so that priorities are clearly defined. You can specify values for other dimensions than the priority dimensions, e.g. if you want to change the priorities for some countries or categories. You can also specify exclusions from either the result or input datasets to skip specific sources or categories.

  • A definition of strategies. Using selectors along any input dataset dimensions, it is possible to define filling strategies to use. For each timeseries, a filling strategy has to be specified, so it is a good idea to define a default filling strategy using an empty selector (see example below).

Hide code cell content
# setup logging for the docs - we don't need debug logs
import sys
from loguru import logger

logger.remove()
logger.add(sys.stderr, level="INFO")
1
import numpy as np
import primap2
import primap2.csg

ureg = primap2.ureg

input_ds = primap2.open_dataset("../opulent_ds.nc")[["CH4", "CO2", "SF6"]]
input_ds["CH4"].loc[
    {
        "category (IPCC 2006)": "1",
        "time": slice("2000", "2001"),
        "scenario (FAOSTAT)": "lowpop",
    }
][:] = np.nan * ureg("Gg CH4 / year")
input_ds
<xarray.Dataset> Size: 388kB
Dimensions:               (time: 21, area (ISO3): 4, category (IPCC 2006): 8,
                           animal (FAOSTAT): 3, product (FAOSTAT): 2,
                           scenario (FAOSTAT): 2, provenance: 1, model: 1,
                           source: 2)
Coordinates:
  * animal (FAOSTAT)      (animal (FAOSTAT)) <U5 60B 'cow' 'swine' 'goat'
  * area (ISO3)           (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 96B '0' '1' ... '1.A' '1.B'
    category_names        (category (IPCC 2006)) <U14 448B 'total' ... 'light...
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * scenario (FAOSTAT)    (scenario (FAOSTAT)) <U7 56B 'highpop' 'lowpop'
  * source                (source) <U8 64B 'RAND2020' 'RAND2021'
  * time                  (time) datetime64[ns] 168B 2000-01-01 ... 2020-01-01
Data variables:
    CH4                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [CH4·Gg/a] ...
    CO2                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [CO2·Gg/a] ...
    SF6                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), scenario (FAOSTAT), provenance, model, source) float64 129kB [SF6·Gg/a] ...
Attributes:
    area:                area (ISO3)
    cat:                 category (IPCC 2006)
    comment:             GHG inventory data ...
    contact:             lol_no_one_will_answer@example.com
    entity_terminology:  primap2
    history:             2021-01-14 14:50 data invented\n2021-01-14 14:51 add...
    institution:         PIK
    references:          doi:10.1012
    rights:              Use however you want.
    scen:                scenario (FAOSTAT)
    sec_cats:            ['animal (FAOSTAT)', 'product (FAOSTAT)']
    title:               Completely invented GHG inventory data
priority_definition = primap2.csg.PriorityDefinition(
    priority_dimensions=["source", "scenario (FAOSTAT)"],
    priorities=[
        # only applies to category 0: prefer highpop
        {
            "category (IPCC 2006)": "0",
            "source": "RAND2020",
            "scenario (FAOSTAT)": "highpop",
        },
        {"source": "RAND2020", "scenario (FAOSTAT)": "lowpop"},
        {"source": "RAND2020", "scenario (FAOSTAT)": "highpop"},
        {"source": "RAND2021", "scenario (FAOSTAT)": "lowpop"},
        # the RAND2021, highpop combination is not used at all - you don't have to use all source timeseries
    ],
    # category 5 is not defined for CH4 in this example, so we skip processing it
    # altogether
    exclude_result=[{"entity": "CH4", "category (IPCC 2006)": "5"}],
    # in this example, we know that COL has reported wrong data in the RAND2020 source
    # for SF6 category 1, so we exclude it from processing, it will be skipped and the
    # other data sources will be used as configured in the priorities instead.
    exclude_input=[
        {
            "entity": "SF6",
            "category (IPCC 2006)": "1",
            "area (ISO3)": "COL",
            "source": "RAND2020",
        }
    ],
)
# Currently, there is only one strategy implemented, so we use
# the empty selector {}, which matches everything, to configure
# to use the substitution strategy for all timeseries.
strategy_definition = primap2.csg.StrategyDefinition(
    strategies=[({}, primap2.csg.SubstitutionStrategy())]
)
result_ds = primap2.csg.compose(
    input_data=input_ds,
    priority_definition=priority_definition,
    strategy_definition=strategy_definition,
    progress_bar=None,  # The animated progress bar is useless in the generated documentation
)

result_ds
<xarray.Dataset> Size: 102kB
Dimensions:               (time: 21, area (ISO3): 4, category (IPCC 2006): 8,
                           animal (FAOSTAT): 3, product (FAOSTAT): 2,
                           provenance: 1, model: 1)
Coordinates:
  * animal (FAOSTAT)      (animal (FAOSTAT)) <U5 60B 'cow' 'swine' 'goat'
  * area (ISO3)           (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 96B '0' '1' ... '1.A' '1.B'
    category_names        (category (IPCC 2006)) <U14 448B 'total' ... 'light...
  * model                 (model) <U8 32B 'FANCYFAO'
  * product (FAOSTAT)     (product (FAOSTAT)) <U4 32B 'milk' 'meat'
  * provenance            (provenance) <U9 36B 'projected'
  * time                  (time) datetime64[ns] 168B 2000-01-01 ... 2020-01-01
Data variables:
    CH4                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), provenance, model) float64 32kB [CH4·Gg/a] ...
    Processing of CH4     (area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), provenance, model) object 2kB ...
    CO2                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), provenance, model) float64 32kB [CO2·Gg/a] ...
    Processing of CO2     (area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), provenance, model) object 2kB ...
    SF6                   (time, area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), provenance, model) float64 32kB [SF6·Gg/a] ...
    Processing of SF6     (area (ISO3), category (IPCC 2006), animal (FAOSTAT), product (FAOSTAT), provenance, model) object 2kB ...
Attributes:
    area:                area (ISO3)
    cat:                 category (IPCC 2006)
    comment:             GHG inventory data ...
    contact:             lol_no_one_will_answer@example.com
    entity_terminology:  primap2
    history:             2021-01-14 14:50 data invented\n2021-01-14 14:51 add...
    institution:         PIK
    references:          doi:10.1012
    rights:              Use however you want.
    sec_cats:            ['animal (FAOSTAT)', 'product (FAOSTAT)']
    title:               Completely invented GHG inventory data

In the result, you can see that the priority dimensions have been removed, and there are new data variables “Processing of $entity” added which contain detailed information for each timeseries how it was derived.

sel = {
    "animal": "cow",
    "category": ["0", "1"],
    "product": "milk",
    "time": slice("2000", "2002"),
    "area": "MEX",
}
result_ds["CH4"].pr.loc[sel]
<xarray.DataArray 'CH4' (time: 3, category (IPCC 2006): 2, provenance: 1,
                         model: 1)> Size: 48B
<Quantity([[[[0.36864371]]

  [[0.2494224 ]]]


 [[[0.41488627]]

  [[0.7457293 ]]]


 [[[0.06242199]]

  [[0.85542488]]]], 'CH4 * gigagram / year')>
Coordinates:
    animal (FAOSTAT)      <U5 20B 'cow'
    area (ISO3)           <U3 12B 'MEX'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 24B '0' '1'
    category_names        (category (IPCC 2006)) <U14 112B 'total' 'industry'
  * model                 (model) <U8 32B 'FANCYFAO'
    product (FAOSTAT)     <U4 16B 'milk'
  * provenance            (provenance) <U9 36B 'projected'
  * time                  (time) datetime64[ns] 24B 2000-01-01 ... 2002-01-01
Attributes:
    entity:   CH4
del sel["time"]
result_ds["Processing of CH4"].pr.loc[sel]
<xarray.DataArray 'Processing of CH4' (category (IPCC 2006): 2, provenance: 1,
                                       model: 1)> Size: 16B
array([[[TimeseriesProcessingDescription(steps=[ProcessingStepDescription(time='all', function='substitution', description="substituted with corresponding values from {'source': 'RAND2020', 'scenario (FAOSTAT)': 'highpop'}", source="{'source': 'RAND2020', 'scenario (FAOSTAT)': 'highpop'}")])]],

       [[TimeseriesProcessingDescription(steps=[ProcessingStepDescription(time=array(['2002-01-01T00:00:00.000000000', '2003-01-01T00:00:00.000000000',
                '2004-01-01T00:00:00.000000000', '2005-01-01T00:00:00.000000000',
                '2006-01-01T00:00:00.000000000', '2007-01-01T00:00:00.000000000',
                '2008-01-01T00:00:00.000000000', '2009-01-01T00:00:00.000000000',
                '2010-01-01T00:00:00.000000000', '2011-01-01T00:00:00.000000000',
                '2012-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
                '2014-01-01T00:00:00.000000000', '2015-01-01T00:00:00.000000000',
                '2016-01-01T00:00:00.000000000', '2017-01-01T00:00:00.000000000',
                '2018-01-01T00:00:00.000000000', '2019-01-01T00:00:00.000000000',
                '2020-01-01T00:00:00.000000000'], dtype='datetime64[ns]'), function='substitution', description="substituted with corresponding values from {'source': 'RAND2020', 'scenario (FAOSTAT)': 'lowpop'}", source="{'source': 'RAND2020', 'scenario (FAOSTAT)': 'lowpop'}"), ProcessingStepDescription(time=array(['2000-01-01T00:00:00.000000000', '2001-01-01T00:00:00.000000000'],
               dtype='datetime64[ns]'), function='substitution', description="substituted with corresponding values from {'source': 'RAND2020', 'scenario (FAOSTAT)': 'highpop'}", source="{'source': 'RAND2020', 'scenario (FAOSTAT)': 'highpop'}")])                                                                                                                                         ]]],
      dtype=object)
Coordinates:
    animal (FAOSTAT)      <U5 20B 'cow'
    area (ISO3)           <U3 12B 'MEX'
  * category (IPCC 2006)  (category (IPCC 2006)) <U3 24B '0' '1'
    category_names        (category (IPCC 2006)) <U14 112B 'total' 'industry'
  * model                 (model) <U8 32B 'FANCYFAO'
    product (FAOSTAT)     <U4 16B 'milk'
  * provenance            (provenance) <U9 36B 'projected'
Attributes:
    entity:              Processing of CH4
    described_variable:  CH4
for tpd in result_ds["Processing of CH4"].pr.loc[sel]:
    print(f"category={tpd['category (IPCC 2006)'].item()}")
    print(str(tpd.item()))
    print()
category=0
Using function=substitution with source={'source': 'RAND2020', 'scenario (FAOSTAT)': 'highpop'} for times=all: substituted with corresponding values from {'source': 'RAND2020', 'scenario (FAOSTAT)': 'highpop'}

category=1
Using function=substitution with source={'source': 'RAND2020', 'scenario (FAOSTAT)': 'lowpop'} for times=['2002-01-01T00:00:00.000000000' '2003-01-01T00:00:00.000000000'
 '2004-01-01T00:00:00.000000000' '2005-01-01T00:00:00.000000000'
 '2006-01-01T00:00:00.000000000' '2007-01-01T00:00:00.000000000'
 '2008-01-01T00:00:00.000000000' '2009-01-01T00:00:00.000000000'
 '2010-01-01T00:00:00.000000000' '2011-01-01T00:00:00.000000000'
 '2012-01-01T00:00:00.000000000' '2013-01-01T00:00:00.000000000'
 '2014-01-01T00:00:00.000000000' '2015-01-01T00:00:00.000000000'
 '2016-01-01T00:00:00.000000000' '2017-01-01T00:00:00.000000000'
 '2018-01-01T00:00:00.000000000' '2019-01-01T00:00:00.000000000'
 '2020-01-01T00:00:00.000000000']: substituted with corresponding values from {'source': 'RAND2020', 'scenario (FAOSTAT)': 'lowpop'}
Using function=substitution with source={'source': 'RAND2020', 'scenario (FAOSTAT)': 'highpop'} for times=['2000-01-01T00:00:00.000000000' '2001-01-01T00:00:00.000000000']: substituted with corresponding values from {'source': 'RAND2020', 'scenario (FAOSTAT)': 'highpop'}

We can see that - as configured - for category 0 “highpop” was preferred, and for category 1 “lowpop” was preferred. For category 0, the initial timeseries did not contain NaNs, so no filling was needed. For category 1, there was information missing in the initial timeseries, so the lower-priority timeseries was used to fill the holes.