Data format details¶

Data is stored in xarray dataset objects with specified dimensions, coordinates, data variables, and attrs.

Dimensions¶

For all datasets, the dimensions for the time and the area are required, and other dimensions and coordinates can be given if necessary. For all dimensions, defined names have to be used and additional metadata about the dimensions is stored in the datasets attrs. The dimensions are:

dimension	dimension key	required	notes	attrs
time	time	✗	for periods, the start of the period
area	area (<category-set>)	✗	must be a pre-defined category set	`'area': 'area (<category set>)'`
category	category (<c-set>)		primary category	`'cat': 'category (<c-set>)'`
sec. categories	<type> (<c-set>)		there can be multiple	`'sec_cats': ['<type> (<c-set>)', …]`
scenario	scenario (<c-set>)			`'scen': 'scenario (<c-set>)'`
provenance	provenance		values from fixed set
model	model		model should be from a predefined list
source	source	✗	a short source identifier

For some dimensions, the meaning of the data is directly visible from the data type (time uses an xarray datetime data type) or the values come from a pre-defined list (provenance, model). For the other dimensions, values come from a set of categories (denoted <category-set> or shorter <c-set> in the table), such as the ISO-3166-1 three-letter country abbreviations denoting the area or IPCC 2006 categories being used as the primary category. In this case, the used category-set is included directly in the dimension key in brackets, and a translation from a generic name to the dimension key is included in the attrs.

Most commonly, data have either no category (for example, population data) or one primary category (for example, most CO2 emissions data). Therefore, the primary category is not required, and if it is used, it is denoted as the category. However, some data sets have more than one categorization, for example the FAO land-use emissions data, that are categorized according to agricultural sector and animal type. Therefore, it is possible to include arbitrary secondary categories, where the dimension key is then formed from the dimension or type (<type> in the table) and the category-set (for example, animal (FAOSTAT)).

Additional rules:

The valid values for the provenance are measured, projected, and derived.

Additional Coordinates¶

Besides the coordinates defining dimensions, additional coordinates can be given, for example to supply category names for the categories. Additional coordinates are not required to have unique values. The name of additional coordinates is not allowed to contain spaces, replace them preferably with _.

Data Variables¶

Each data variable has a name (key) and attributes (attrs). The attributes are:

attribute	content	required
entity	entity code, possibly from the dataset’s entity category set	yes
gwp_context	which global warming potential context was used for calculating the data	if a gwp was used
units	in which units the data is given	see rules below

For the entity, a category set (i.e. a terminology) should be defined for the whole dataset in the dataset attributes using the key entity_terminology (see below). If the entity_terminology is defined, all entities in the dataset must be defined in the terminology so that the exact meaning of entity codes is known. If the entity_terminology is not defined, the meaning of the entities is not clearly defined.

The name of the data variable (its key) is formed from the entity and the gwp_context, if applicable. If there is no gwp_context, the name is the entity. If there is a gwp_context, the name is the entity, followed by the gwp_context in parentheses, separated from the entity by a space.

Units are required for all data variables with a dtype of float, while for data with other data types, the units are required only where they make sense. For example, data with an integer data type representing (human or animal) population data requires units, while a data variable with a categorical data type representing the evaluation method for each data point does not require units. If the units are required, they can either be given by quantifying the data variable using pint_xarray, or can be included in the variable attributes using the key units as a string. For storage, the dataset should not be quantified and the units should be given in the attrs, but for calculations the dataset should be quantified using pint_xarray. If given in the attrs as a string, the units must be parsable by openscm-units.

Dataset Attributes¶

Metadata about the dimensions and the data set as a whole is stored in the dataset attrs. The metadata about the dimensions is described above in the paragraph concerning dimensions. The other attributes with metadata about the dataset as a whole are:

attribute	description	data type
references	citable reference(s) describing the data	free-form `str` (ideally URL)
rights	license or other usage restrictions	free-form `str`
contact	who can answer questions about the data	free-form `str` (usually email)
title	a succinct description	free-form `str`
comment	longer form description	free-form `str`
institution	where the data originates	free-form `str`
history	processing steps done on the data	`str` with specific rules (see text)
entity_terminology	terminology for data variable entities	`str`
publication_date	date of publication of the dataset	`datetime.date`

All of these attributes are optional. If the references field starts with doi:, it is a doi, otherwise it is a free-form literature reference. In the history field, an audit trail of modifications can be stored. Steps are separated by a newline character, and processing steps should append to the field.

These attributes describing the data set contents are inspired by the CF conventions for the description of file contents.

The entity_terminology (if present) defines the meaning of the codes used in the data variables’ names and entity attributes. In entity_terminology, the name of the used terminology is stored and the terminology is defined elsewhere.

PRIMAP2

Navigation

Related Topics

Data format details¶

Dimensions¶

Additional Coordinates¶

Data Variables¶

Dataset Attributes¶