Dealing with missing information#
Aggregation#
xarray provides robust functions for aggregation (xarray.DataArray.sum()
).
PRIMAP2 adds functions which skip missing data points if the
information is missing at all points along certain axes, for example for
a whole time series.
Let’s first create an example with missing information:
import pandas as pd
import numpy as np
import xarray as xr
import primap2
time = pd.date_range("2000-01-01", "2003-01-01", freq="YS")
area_iso3 = np.array(["COL", "ARG", "MEX"])
coords = [("area (ISO3)", area_iso3), ("time", time)]
da = xr.DataArray(
data=[
[1, 2, 3, 4],
[np.nan, np.nan, np.nan, np.nan],
[1, 2, 3, np.nan],
],
coords=coords,
name="test data"
)
da.pr.to_df()
time | 2000-01-01 | 2001-01-01 | 2002-01-01 | 2003-01-01 |
---|---|---|---|---|
area (ISO3) | ||||
COL | 1.0 | 2.0 | 3.0 | 4.0 |
ARG | NaN | NaN | NaN | NaN |
MEX | 1.0 | 2.0 | 3.0 | NaN |
Now, we can use the primap2 xarray.DataArray.pr.sum()
function to evaluate the sum of countries
while ignoring only those countries where the whole timeseries is missing, using the
skipna_evaluation_dims
parameter:
da.pr.sum(dim="area", skipna_evaluation_dims="time").pr.to_df()
time
2000-01-01 2.0
2001-01-01 4.0
2002-01-01 6.0
2003-01-01 NaN
Freq: YS-JAN, Name: test data, dtype: float64
If you instead want to skip all NA values, use the skipna
parameter:
da.pr.sum(dim="area", skipna=True).pr.to_df()
time
2000-01-01 2.0
2001-01-01 4.0
2002-01-01 6.0
2003-01-01 4.0
Freq: YS-JAN, Name: test data, dtype: float64
# compare this to the result of the standard xarray sum - it also skips NA values by default:
da.sum(dim="area (ISO3)").pr.to_df()
time
2000-01-01 2.0
2001-01-01 4.0
2002-01-01 6.0
2003-01-01 4.0
Freq: YS-JAN, Name: test data, dtype: float64
Infilling#
The same functionality is available for filling in missing information using the
xarray.DataArray.pr.fill_all_na()
function.
In this example, we fill missing information only where the whole time series is missing.
da.pr.fill_all_na("time", value=10).pr.to_df()
time | 2000-01-01 | 2001-01-01 | 2002-01-01 | 2003-01-01 |
---|---|---|---|---|
area (ISO3) | ||||
COL | 1.0 | 2.0 | 3.0 | 4.0 |
ARG | 10.0 | 10.0 | 10.0 | 10.0 |
MEX | 1.0 | 2.0 | 3.0 | NaN |
Bulk aggregation#
For larger aggregation tasks, e.g. aggregating several gas baskets from individual gases or aggregating a full category tree from leaves we have the functions xarray.Dataset.pr.add_aggregates_variables()
, xarray.Dataset.pr.add_aggregates_coordinates()
, and xarray.DataArray.pr.add_aggregates_coordinates()
which are highly configurable, but can also be used in a simplified mode for quick aggregation tasks. In the following we give a few examples. For the full feature set we refer to function descriptions linked above. The functions internally work with xarray.Dataset.pr.merge()
/ xarray.DataArray.pr.merge()
to allow for consistency checks when target timeseries exist.
Add aggregates for variables#
The xarray.Dataset.pr.add_aggregates_variables()
function aggregates data from individual variables to new variables (usually gas baskets). Several variables can be created in one call where the order of definition is the order of creation. Filters can be specified to limit aggregation to certain coordinate values.
Examples#
Sum gases in the minimal example dataset
ds_min = primap2.open_dataset("../minimal_ds.nc")
summed_ds = ds_min.pr.add_aggregates_variables(
gas_baskets={
"test (SARGWP100)": {
"sources": ["CO2", "SF6", "CH4"],
},
},
)
summed_ds["test (SARGWP100)"]
<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B <Quantity([[[ 59.52116157] [12091.80994421] [16182.21435315] [ 2965.80220677]] [[11436.07585155] [12597.2322152 ] [19172.6969282 ] [19351.58487054]] [[12749.61794337] [14785.27911422] [ 5733.75559422] [ 7390.51658535]] [[12745.20608578] [11897.40280597] [10390.92115671] [16345.51800435]] ... [[14148.34407451] [ 449.63570296] [ 280.71991745] [14065.93620383]] [[13244.23861057] [23847.91204598] [ 4243.44458005] [19677.45365633]] [[23081.76617658] [ 6422.07536784] [22948.697363 ] [ 3986.50318618]] [[22806.4457288 ] [18200.33442572] [ 6282.66521788] [ 1017.35660718]]], 'gigagram * CO2 / year')> Coordinates: * area (ISO3) (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL' * source (source) <U8 32B 'RAND2020' * time (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01 Attributes: gwp_context: SARGWP100 entity: test
We can also use a filter / selector to limit the aggregation to a selection e.g. a single country:
filtered_ds = ds_min.pr.add_aggregates_variables(
gas_baskets={
"test (SARGWP100)": {
"sources": ["CO2", "SF6", "CH4"],
"sel": {"area (ISO3)": ["COL"]},
},
},
)
filtered_ds["test (SARGWP100)"]
<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B <Quantity([[[ 59.52116157] [ nan] [ nan] [ nan]] [[11436.07585155] [ nan] [ nan] [ nan]] [[12749.61794337] [ nan] [ nan] [ nan]] [[12745.20608578] [ nan] [ nan] [ nan]] ... [[14148.34407451] [ nan] [ nan] [ nan]] [[13244.23861057] [ nan] [ nan] [ nan]] [[23081.76617658] [ nan] [ nan] [ nan]] [[22806.4457288 ] [ nan] [ nan] [ nan]]], 'gigagram * CO2 / year')> Coordinates: * area (ISO3) (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL' * source (source) <U8 32B 'RAND2020' * time (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01 Attributes: gwp_context: SARGWP100 entity: test
When filtering it is important to note that entities and variables are not the same thing. The difference between the entity
and variable
filters / selectors is that 'entity': ['SF6']
will match both variables 'SF6'
and 'SF6 (SARGWP100)'
(as both variables are for the entity 'SF6'
) while 'variable': ['SF6']
will match only the variable 'SF6'
.
If we recompute an existing timeseries it has to be consistent with the existing data. Here we use the simple mode to specify the aggregation rules. The example below fails because the result is inconsistent with existing data.
from xarray import MergeError
try:
recomputed_ds = filtered_ds.pr.add_aggregates_variables(
gas_baskets={
"test (SARGWP100)": ["CO2", "CH4"],
},
)
recomputed_ds["test (SARGWP100)"]
except MergeError as err:
print(err)
2025-02-07T09:47:02.111068+0000 ERROR pr.merge error: found discrepancies larger than tolerance (1.00%) for area (ISO3)=COL, source=RAND2020:
shown are relative discrepancies. (test (SARGWP100))
test (SARGWP100)
time
2000-01-01 0.722739
2001-01-01 0.998434
2002-01-01 0.999453
2003-01-01 0.998306
2004-01-01 0.998756
2005-01-01 0.998049
2006-01-01 0.995990
2007-01-01 0.998147
2008-01-01 0.995131
2009-01-01 0.999447
2010-01-01 0.999546
2011-01-01 0.998854
2012-01-01 0.999669
2013-01-01 0.999974
2014-01-01 0.999712
2015-01-01 0.999793
2016-01-01 0.998886
2017-01-01 0.998772
2018-01-01 0.999852
2019-01-01 0.999112
2020-01-01 0.999471
pr.merge error: found discrepancies larger than tolerance (1.00%) for area (ISO3)=COL, source=RAND2020:
shown are relative discrepancies. (test (SARGWP100))
test (SARGWP100)
time
2000-01-01 0.722739
2001-01-01 0.998434
2002-01-01 0.999453
2003-01-01 0.998306
2004-01-01 0.998756
2005-01-01 0.998049
2006-01-01 0.995990
2007-01-01 0.998147
2008-01-01 0.995131
2009-01-01 0.999447
2010-01-01 0.999546
2011-01-01 0.998854
2012-01-01 0.999669
2013-01-01 0.999974
2014-01-01 0.999712
2015-01-01 0.999793
2016-01-01 0.998886
2017-01-01 0.998772
2018-01-01 0.999852
2019-01-01 0.999112
2020-01-01 0.999471
We can set the tolerance high enough such that the test passes and no error is thrown. This is only possible in the complex mode for the aggregation rules.
recomputed_ds = filtered_ds.pr.add_aggregates_variables(
gas_baskets={
"test (SARGWP100)": {
"sources": ["CO2", "CH4"],
"tolerance": 1, # 100%
},
},
)
recomputed_ds["test (SARGWP100)"]
<xarray.DataArray 'test (SARGWP100)' (time: 21, area (ISO3): 4, source: 1)> Size: 672B <Quantity([[[5.95211616e+01] [8.65600580e+00] [1.55819099e+01] [1.24871019e+01]] [[1.14360759e+04] [2.53322001e+00] [1.39057502e+01] [1.56300850e+00]] [[1.27496179e+04] [6.52687510e+00] [9.95243955e+00] [1.62190600e+01]] [[1.27452061e+04] [1.51039398e+01] [1.59373124e+01] [5.28770892e+00]] ... [[1.41483441e+04] [1.62580953e+01] [3.77418330e+00] [3.73085782e+00]] [[1.32442386e+04] [2.45037885e+00] [1.10246891e+00] [9.05224940e+00]] [[2.30817662e+04] [1.18341086e+00] [1.53357641e+01] [2.72788938e+00]] [[2.28064457e+04] [2.51734396e+00] [6.70344318e-01] [1.16859245e+01]]], 'gigagram * CO2 / year')> Coordinates: * area (ISO3) (area (ISO3)) <U3 48B 'COL' 'ARG' 'MEX' 'BOL' * source (source) <U8 32B 'RAND2020' * time (time) datetime64[ns] 168B 2000-01-01 2001-01-01 ... 2020-01-01 Attributes: gwp_context: SARGWP100 entity: test
Add aggregates for coordinates#
The xarray.Dataset.pr.add_aggregates_coordinates()
function aggregates data from individual coordinate values to new values (e.g. from subcategories to categories). Several values for several coordinates can be created in one call where the order of definition is the order of creation. Filters can be specified to limit aggregation to certain coordinate values, entities or variables. Most of the operation is similar to the variable aggregation. Thus we keep the examples here shorter. The xarray.DataArray.pr.add_aggregates_coordinates()
function uses the same syntax.
Examples#
Sum countries in the minimal example dataset
test_ds = ds_min.pr.add_aggregates_coordinates(
agg_info={
"area (ISO3)": {
"all": {
"sources": ["COL", "ARG", "MEX", "BOL"],
}
}
}
)
test_ds
<xarray.Dataset> Size: 4kB Dimensions: (area (ISO3): 5, source: 1, time: 21) Coordinates: * area (ISO3) (area (ISO3)) <U3 60B 'ARG' 'BOL' 'COL' 'MEX' 'all' * source (source) <U8 32B 'RAND2020' * time (time) datetime64[ns] 168B 2000-01-01 ... 2020-01-01 Data variables: CH4 (time, area (ISO3), source) float64 840B [CH4·Gg/a] 0.36... CO2 (time, area (ISO3), source) float64 840B [CO2·Gg/a] 0.91... SF6 (time, area (ISO3), source) float64 840B [SF6·Gg/a] 0.50... SF6 (SARGWP100) (time, area (ISO3), source) float64 840B [CO2·Gg/a] 1.20... Attributes: area: area (ISO3) entity: SF6 gwp_context: SARGWP100