primap2.csg.compose

Contents

primap2.csg.compose#

primap2.csg.compose(*, input_data: ~xarray.core.dataset.Dataset, priority_definition: ~primap2.csg._models.PriorityDefinition, strategy_definition: ~primap2.csg._models.StrategyDefinition, progress_bar: type[~tqdm.std.tqdm] | None = <class 'tqdm.std.tqdm'>) Dataset[source]#

Compose a harmonized dataset from multiple input datasets.

The input datasets are treated at the timeseries level, for each timeseries:

  • the highest priority dataset is chosen according to the priority_definition

  • if the dataset contains any nan values (denoting missing information), the next highest priority dataset is selected to fill the missing values

  • a filling strategy is chosen according to the strategy_definition and the filling is done with the dataset selected in the previous step

  • if afterwards there are still missing nan values, the next highest priority dataset is selected and the previous step is repeated

  • when all missing values are filled or all input datasets are exhausted, the timeseries is finished and written to the result.

In addition to the harmonized data, also a description of the processing steps done for each timeseries is returned in the result dataset, where for each variable, a variable of the form “Processing of $variable” is returned, with the same dimensions as the variable, apart from the time dimension.

Parameters:
input_data

Dataset with the input data. The input data dimensions determine the output data dimensions: the output data has the same dimensions as the input data, minus the priority dimensions defined in the priority_definition. From the priority dimensions, the different datasets for the filling are selected, so they vanish in the result.

priority_definition

Defines the priorities to select timeseries from the input data. Priorities are formed by a list of selections and are used “from left to right”, where the first matching selection has the highest priority. Each selection has to specify values for all priority dimensions (so that exactly one timeseries is selected from the input data), but can also specify other dimensions. That way it is, e.g., possible to define a different priority for a specific country by listing it early (i.e. with high priority) before the more general rules which should be applied for all other countries. You can also specify the “entity” or “variable” in the selection, which will limit the rule to a specific entity or variable, respectively. For each DataArray in the input_data Dataset, the variable is its name, the entity is the value of the key entity in its attrs.

strategy_definition

Defines the filling strategies to be used when filling timeseries with other timeseries. Again, the priority is defined by a list of selections and corresponding strategies which are used “from left to right”. Selections can use any dimension and don’t have to apply to only one timeseries. For example, to define a default strategy which should be used for all timeseries unless something else is configured, configure an empty selection as the last (rightmost) entry. You can also specify the “entity” or “variable” in the selection, which will limit the rule to a specific entity or variable, respectively. For each DataArray in the input_data Dataset, the variable is its name, the entity is the value of the key entity in its attrs.

progress_bar

By default, show progress bars using the tqdm package during the operation. If None, don’t show any progress bars. You can supply a class compatible to tqdm.tqdm’s protocol if you want to customize the progress bar.

Returns:
result

Dataset with the same entities and dimensions as input_data, but with following changes: the data is composed and filled according to the rules, the priority dimensions are reduced and not included in the result, and additional variables of the form “Processing of $variable” are added which describe the processing steps done for each timeseries.