Dealing with missing information#

Aggregation#

xarray provides robust functions for aggregation (xarray.DataArray.sum()). PRIMAP2 adds functions which skip missing data points if the information is missing at all points along certain axes, for example for a whole time series. Let’s first create an example with missing information:

import pandas as pd
import numpy as np
import xarray as xr
import primap2

time = pd.date_range("2000-01-01", "2003-01-01", freq="YS")
area_iso3 = np.array(["COL", "ARG", "MEX"])
coords = [("area (ISO3)", area_iso3), ("time", time)]
da = xr.DataArray(
    data=[
        [1, 2, 3, 4],
        [np.nan, np.nan, np.nan, np.nan],
        [1, 2, 3, np.nan],
    ],
    coords=coords,
    name="test data"
)

da.pr.to_df()

time	2000-01-01	2001-01-01	2002-01-01	2003-01-01
area (ISO3)
COL	1.0	2.0	3.0	4.0
ARG	NaN	NaN	NaN	NaN
MEX	1.0	2.0	3.0	NaN

Now, we can use the primap2 xarray.DataArray.pr.sum() function to evaluate the sum of countries while ignoring only those countries where the whole timeseries is missing, using the skipna_evaluation_dims parameter:

da.pr.sum(dim="area", skipna_evaluation_dims="time").pr.to_df()

time
2000-01-01    2.0
2001-01-01    4.0
2002-01-01    6.0
2003-01-01    NaN
Freq: YS-JAN, Name: test data, dtype: float64

If you instead want to skip all NA values, use the skipna parameter:

da.pr.sum(dim="area", skipna=True).pr.to_df()

time
2000-01-01    2.0
2001-01-01    4.0
2002-01-01    6.0
2003-01-01    4.0
Freq: YS-JAN, Name: test data, dtype: float64

# compare this to the result of the standard xarray sum - it also skips NA values by default:

da.sum(dim="area (ISO3)").pr.to_df()

time
2000-01-01    2.0
2001-01-01    4.0
2002-01-01    6.0
2003-01-01    4.0
Freq: YS-JAN, Name: test data, dtype: float64

Infilling#

The same functionality is available for filling in missing information using the xarray.DataArray.pr.fill_all_na() function. In this example, we fill missing information only where the whole time series is missing.

da.pr.fill_all_na("time", value=10).pr.to_df()

time	2000-01-01	2001-01-01	2002-01-01	2003-01-01
area (ISO3)
COL	1.0	2.0	3.0	4.0
ARG	10.0	10.0	10.0	10.0
MEX	1.0	2.0	3.0	NaN

Bulk aggregation#

For larger aggregation tasks, e.g. aggregating several gas baskets from individual gases or aggregating a full category tree from leaves we have the functions xarray.Dataset.pr.add_aggregates_variables(), xarray.Dataset.pr.add_aggregates_coordinates(), and xarray.DataArray.pr.add_aggregates_coordinates() which are highly configurable, but can also be used in a simplified mode for quick aggregation tasks. In the following we give a few examples. For the full feature set we refer to function descriptions linked above. The functions internally work with xarray.Dataset.pr.merge() / xarray.DataArray.pr.merge() to allow for consistency checks when target timeseries exist.

Add aggregates for variables#

The xarray.Dataset.pr.add_aggregates_variables() function aggregates data from individual variables to new variables (usually gas baskets). Several variables can be created in one call where the order of definition is the order of creation. Filters can be specified to limit aggregation to certain coordinate values.

Examples#

Sum gases in the minimal example dataset

ds_min = primap2.open_dataset("../minimal_ds.nc")
summed_ds = ds_min.pr.add_aggregates_variables(
    gas_baskets={
        "test (SARGWP100)": {
            "sources": ["CO2", "SF6", "CH4"],
        },
    },
)
summed_ds["test (SARGWP100)"]

<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B
<Quantity([[[   59.52116157]
  [12091.80994421]
  [16182.21435315]
  [ 2965.80220677]]

 [[11436.07585155]
  [12597.2322152 ]
  [19172.6969282 ]
  [19351.58487054]]

 [[12749.61794337]
  [14785.27911422]
  [ 5733.75559422]
  [ 7390.51658535]]

 [[12745.20608578]
  [11897.40280597]
  [10390.92115671]
  [16345.51800435]]

...

 [[14148.34407451]
  [  449.63570296]
  [  280.71991745]
  [14065.93620383]]

 [[13244.23861057]
  [23847.91204598]
  [ 4243.44458005]
  [19677.45365633]]

 [[23081.76617658]
  [ 6422.07536784]
  [22948.697363  ]
  [ 3986.50318618]]

 [[22806.4457288 ]
  [18200.33442572]
  [ 6282.66521788]
  [ 1017.35660718]]], 'gigagram * CO2 / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01
Attributes:
    gwp_context:  SARGWP100
    entity:       test

Magnitude	[[[59.52116157416765] [12091.809944210348] [16182.214353154057] [2965.8022067698835]] [[11436.075851545389] [12597.23221519852] [19172.696928195095] [19351.584870542]] [[12749.617943367224] [14785.27911421922] [5733.755594221655] [7390.516585351228]] [[12745.206085777849] [11897.402805973288] [10390.921156709866] [16345.518004345118]] [[15967.43984793884] [13380.822043229577] [22438.5377196701] [8109.856150779328]] [[4334.507658333797] [22559.595623631903] [18105.991609920395] [16022.799518635737]] [[1972.2429825878212] [1794.1634448344262] [19326.36633442345] [8116.797055691886]] [[2752.2905446272316] [15376.038610856369] [15467.892816736707] [19249.729034313496]] [[3697.1472507891804] [12459.89719201328] [2127.0009940566592] [14859.308235442348]] [[21514.18513275884] [4076.3653460509668] [15012.449302311423] [9381.584343300909]] [[23785.653057472184] [5624.745768912249] [22651.510524075537] [18110.495047068074]] [[2791.568217454403] [13627.86599514684] [7109.749002780255] [17154.658187288544]] [[21716.894833091996] [1085.470845589873] [4676.979960292264] [21616.399759066047]] [[21976.442672914825] [14049.071132656909] [11477.83439854772] [13880.684755706594]] [[18674.66394537614] [15728.574168824432] [12584.570769588963] [13367.090151929624]] [[21821.44036021962] [656.9805022582212] [23320.44055169984] [1209.7133715215007]] [[16050.84531673067] [17014.601305667784] [16435.35819334093] [12797.354674062255]] [[14148.344074507273] [449.63570296246246] [280.7199174494474] [14065.936203832147]] [[13244.238610569013] [23847.91204597733] [4243.444580054062] [19677.453656333262]] [[23081.766176575504] [6422.075367835192] [22948.697362998766] [3986.503186184746]] [[22806.445728804] [18200.334425721827] [6282.665217878636] [1017.3566071774144]]]
Units	CO2 gigagram/year

We can also use a filter / selector to limit the aggregation to a selection e.g. a single country:

filtered_ds = ds_min.pr.add_aggregates_variables(
    gas_baskets={
        "test (SARGWP100)": {
            "sources": ["CO2", "SF6", "CH4"],
            "sel": {"area (ISO3)": ["COL"]},
        },
    },
)
filtered_ds["test (SARGWP100)"]

<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B
<Quantity([[[   59.52116157]
  [           nan]
  [           nan]
  [           nan]]

 [[11436.07585155]
  [           nan]
  [           nan]
  [           nan]]

 [[12749.61794337]
  [           nan]
  [           nan]
  [           nan]]

 [[12745.20608578]
  [           nan]
  [           nan]
  [           nan]]

...

 [[14148.34407451]
  [           nan]
  [           nan]
  [           nan]]

 [[13244.23861057]
  [           nan]
  [           nan]
  [           nan]]

 [[23081.76617658]
  [           nan]
  [           nan]
  [           nan]]

 [[22806.4457288 ]
  [           nan]
  [           nan]
  [           nan]]], 'gigagram * CO2 / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01
Attributes:
    gwp_context:  SARGWP100
    entity:       test

Magnitude	[[[59.52116157416765] [nan] [nan] [nan]] [[11436.075851545389] [nan] [nan] [nan]] [[12749.617943367224] [nan] [nan] [nan]] [[12745.206085777849] [nan] [nan] [nan]] [[15967.43984793884] [nan] [nan] [nan]] [[4334.507658333797] [nan] [nan] [nan]] [[1972.2429825878212] [nan] [nan] [nan]] [[2752.2905446272316] [nan] [nan] [nan]] [[3697.1472507891804] [nan] [nan] [nan]] [[21514.18513275884] [nan] [nan] [nan]] [[23785.653057472184] [nan] [nan] [nan]] [[2791.568217454403] [nan] [nan] [nan]] [[21716.894833091996] [nan] [nan] [nan]] [[21976.442672914825] [nan] [nan] [nan]] [[18674.66394537614] [nan] [nan] [nan]] [[21821.44036021962] [nan] [nan] [nan]] [[16050.84531673067] [nan] [nan] [nan]] [[14148.344074507273] [nan] [nan] [nan]] [[13244.238610569013] [nan] [nan] [nan]] [[23081.766176575504] [nan] [nan] [nan]] [[22806.445728804] [nan] [nan] [nan]]]
Units	CO2 gigagram/year

When filtering it is important to note that entities and variables are not the same thing. The difference between the entity and variable filters / selectors is that 'entity': ['SF6'] will match both variables 'SF6' and 'SF6 (SARGWP100)' (as both variables are for the entity 'SF6') while 'variable': ['SF6'] will match only the variable 'SF6'.

If we recompute an existing timeseries it has to be consistent with the existing data. Here we use the simple mode to specify the aggregation rules. The example below fails because the result is inconsistent with existing data.

from xarray import MergeError

try:
    recomputed_ds = filtered_ds.pr.add_aggregates_variables(
        gas_baskets={
            "test (SARGWP100)": ["CO2", "CH4"],
        },
    )
    recomputed_ds["test (SARGWP100)"]
except MergeError as err:
    print(err)

2025-09-05T08:50:41.057092+0000 ERROR pr.merge error: found discrepancies larger than tolerance (1.00%) for area (ISO3)=COL, source=RAND2020:
shown are relative discrepancies. (test (SARGWP100))
            test (SARGWP100)
time                        
2000-01-01          0.722739
2001-01-01          0.998434
2002-01-01          0.999453
2003-01-01          0.998306
2004-01-01          0.998756
2005-01-01          0.998049
2006-01-01          0.995990
2007-01-01          0.998147
2008-01-01          0.995131
2009-01-01          0.999447
2010-01-01          0.999546
2011-01-01          0.998854
2012-01-01          0.999669
2013-01-01          0.999974
2014-01-01          0.999712
2015-01-01          0.999793
2016-01-01          0.998886
2017-01-01          0.998772
2018-01-01          0.999852
2019-01-01          0.999112
2020-01-01          0.999471

pr.merge error: found discrepancies larger than tolerance (1.00%) for area (ISO3)=COL, source=RAND2020:
shown are relative discrepancies. (test (SARGWP100))
            test (SARGWP100)
time                        
2000-01-01          0.722739
2001-01-01          0.998434
2002-01-01          0.999453
2003-01-01          0.998306
2004-01-01          0.998756
2005-01-01          0.998049
2006-01-01          0.995990
2007-01-01          0.998147
2008-01-01          0.995131
2009-01-01          0.999447
2010-01-01          0.999546
2011-01-01          0.998854
2012-01-01          0.999669
2013-01-01          0.999974
2014-01-01          0.999712
2015-01-01          0.999793
2016-01-01          0.998886
2017-01-01          0.998772
2018-01-01          0.999852
2019-01-01          0.999112
2020-01-01          0.999471

We can set the tolerance high enough such that the test passes and no error is thrown. This is only possible in the complex mode for the aggregation rules.

recomputed_ds = filtered_ds.pr.add_aggregates_variables(
    gas_baskets={
        "test (SARGWP100)": {
            "sources": ["CO2", "CH4"],
            "tolerance": 1,  #  100%
        },
    },
)
recomputed_ds["test (SARGWP100)"]

<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B
<Quantity([[[5.95211616e+01]
  [8.65600580e+00]
  [1.55819099e+01]
  [1.24871019e+01]]

 [[1.14360759e+04]
  [2.53322001e+00]
  [1.39057502e+01]
  [1.56300850e+00]]

 [[1.27496179e+04]
  [6.52687510e+00]
  [9.95243955e+00]
  [1.62190600e+01]]

 [[1.27452061e+04]
  [1.51039398e+01]
  [1.59373124e+01]
  [5.28770892e+00]]

...

 [[1.41483441e+04]
  [1.62580953e+01]
  [3.77418330e+00]
  [3.73085782e+00]]

 [[1.32442386e+04]
  [2.45037885e+00]
  [1.10246891e+00]
  [9.05224940e+00]]

 [[2.30817662e+04]
  [1.18341086e+00]
  [1.53357641e+01]
  [2.72788938e+00]]

 [[2.28064457e+04]
  [2.51734396e+00]
  [6.70344318e-01]
  [1.16859245e+01]]], 'gigagram * CO2 / year')>
Coordinates:
  * area (ISO3)  (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL'
  * source       (source) <U8 32B 'RAND2020'
  * time         (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01
Attributes:
    gwp_context:  SARGWP100
    entity:       test

Magnitude	[[[59.52116157416765] [8.65600580130023] [15.581909942878157] [12.487101870953296]] [[11436.075851545389] [2.5332200089170533] [13.905750239857] [1.5630085031197394]] [[12749.617943367224] [6.5268750974206275] [9.952439546116182] [16.219059985002684]] [[12745.206085777849] [15.103939790437414] [15.937312377724947] [5.28770892140769]] [[15967.43984793884] [19.00402698763998] [14.393576925019275] [6.5202395329219875]] [[4334.507658333797] [2.9040567582394647] [11.307105834537255] [2.123389495947778]] [[1972.2429825878212] [20.274787538102814] [12.898339768111482] [8.576136485579097]] [[2752.2905446272316] [14.941286840007505] [2.582920063119654] [6.697406459147589]] [[3697.1472507891804] [20.442289388358965] [1.4082925743265275] [15.076098857248144]] [[21514.18513275884] [15.515539142433214] [16.229162165688187] [16.765942978685413]] [[23785.653057472184] [17.878269292869454] [14.396250127577149] [21.412606889013798]] [[2791.568217454403] [18.83585454907] [6.4366539476327125] [8.685913798354937]] [[21716.894833091996] [19.330743336912878] [17.870606229082284] [8.882025292063526]] [[21976.442672914825] [18.123788331893927] [14.533083830388508] [7.745297696930066]] [[18674.66394537614] [11.214641283415093] [9.646505146351734] [21.192730540264773]] [[21821.44036021962] [2.9929226936623072] [14.473676465735325] [1.2014406582153514]] [[16050.84531673067] [18.226833659008168] [5.473022125543174] [9.139524964613031]] [[14148.344074507273] [16.25809527677654] [3.7741833009472483] [3.7308578233809775]] [[13244.238610569013] [2.450378852004865] [1.1024689104382035] [9.052249399528726]] [[23081.766176575504] [1.1834108645782524] [15.335764061386877] [2.7278893811313045]] [[22806.445728804] [2.5173439629203944] [0.670344318281515] [11.685924527858791]]]
Units	CO2 gigagram/year

Add aggregates for coordinates#

The xarray.Dataset.pr.add_aggregates_coordinates() function aggregates data from individual coordinate values to new values (e.g. from subcategories to categories). Several values for several coordinates can be created in one call where the order of definition is the order of creation. Filters can be specified to limit aggregation to certain coordinate values, entities or variables. Most of the operation is similar to the variable aggregation. Thus we keep the examples here shorter. The xarray.DataArray.pr.add_aggregates_coordinates() function uses the same syntax.

Examples#

Sum countries in the minimal example dataset

test_ds = ds_min.pr.add_aggregates_coordinates(
    agg_info={
        "area (ISO3)": {
            "all": {
                "sources": ["COL", "ARG", "MEX", "BOL"],
            }
        }
    }
)
test_ds

<xarray.Dataset> Size: 4kB
Dimensions:          (area (ISO3): 5, source: 1, time: 21)
Coordinates:
  * area (ISO3)      (area (ISO3)) <U3 60B 'ARG' 'BOL' 'COL' 'MEX' 'all'
  * source           (source) <U8 32B 'RAND2020'
  * time             (time) datetime64[ns] 168B 2000-01-01 ... 2020-01-01
Data variables:
    CH4              (time, area (ISO3), source) float64 840B [CH4·Gg/a] 0.36...
    CO2              (time, area (ISO3), source) float64 840B [CO2·Gg/a] 0.91...
    SF6              (time, area (ISO3), source) float64 840B [SF6·Gg/a] 0.50...
    SF6 (SARGWP100)  (time, area (ISO3), source) float64 840B [CO2·Gg/a] 1.20...
Attributes:
    area:         area (ISO3)
    entity:       SF6
    gwp_context:  SARGWP100

Dealing with missing information

Contents

Dealing with missing information#

Aggregation#

Infilling#

Bulk aggregation#

Add aggregates for variables#

Examples#

Add aggregates for coordinates#

Examples#