Interchange format¶
In PRIMAP2, data is internally handled in xarray datasets with defined coordinates and metadata. On disk this structure is stored as a netcdf file. Because the netcdf file format was developed for the exchange of multi-dimensional datasets with a varying number of dimensions for different entities and rich meta data, we recommend that consumers of datasets published by us use the provided netcdf files.
However, we recognise that many existing workflows rely on tools that handle tabular data exclusively and therefore also publish in the PRIMAP2 Interchange Format which is a tabular wide format with additional meta data. Users of the interchange format have to integrate the given meta data carefully into their workflows to ensure correct results.
Logical format¶
In the interchange format all dimensions and time points are represented by columns in a two-dimensional array. Values of the time columns are data while values of the other columns are coordinates. To store metadata, including the information contained in the attrs
dict in the PRIMAP2 xarray format, we use an additional dictionary. See sections In-memory representation and on-disk representation below for information on the storage of these structures.
The requirements for the data, columns, and coordinates follow the requirements in the standard PRIMAP2 data format. Dimensions area
and source
, which are mandatory in the xarray format, are mandatory columns in the tabular data in the interchange format. The time
dimension is included in the horizontal dimension of the tabular interchange format. Additionally, we have unit
and entity
as mandatory columns with the restriction that each entity can have only one unit.
All optional dimensions (see Data format details) can be added as optional columns. Secondary categories are columns with free format names. They are listed as secondary columns in the metadata dict.
Column names correspond to the dimension key of the xarray format, i.e. they contain the terminology in parentheses (e.g. area (ISO3)
).
Additional columns are currently not possible, but the option will be added in a future release (#25).
The metadata dict has an attrs
entry, which corresponds to the attrs
dict of the xarray format (see Data format details). Additionally, the metadata dict contains information on the dimensions
of the data for each entity, on the time_format
of the data columns and (if stored on disk) on the name of the data_file
(see Interchange format details).
Use¶
The interchange format is intended for use mainly in two settings.
To publish data processed using PRIMAP2 in a way that is easy to read by others but also keeps the internal structure and metadata. The format will be used by future data publications by the PRIMAP team including PRIMAP-hist.
To have a common intermediate format for reading data from original sources (mostly xls or csv files in different formats) to simplify data reading functions and to enable use of our data reading functionality by other projects. All data is first read into the interchange format and subsequently converted into the native PRIMAP2 format. This enables using our data reading routines in other software packages.
In-memory representation¶
The in-memory representation of the interchange format is using a pandas DataFrame to store the data, and a dict to store the additional metadata. Pandas DataFrames have the capability to store the metadata on their attrs
, however this function is still experimental and subject to change without notice, so care has to be taken not to lose the data if processing is done on the DataFrame. For an example see Examples section below.
On-disk representation¶
On disk the dataset is represented by a csv file containing the array, and a yaml file containing the additional metadata as a dict. Both files should have the same name except for the ending. On disk, the key data_file
is added to the metadata dict, which contains the name of the csv file. Thus, a function reading interchange format data just needs the yaml file name to read the data.
Examples¶
Here we show a few examples of the interchange format.
[1]:
# import all the used libraries
import primap2 as pm2
Reading csv data¶
The PRIMAP2 data reading procedures first convert data into the interchange format. For explanations of the used parameters see the Data reading example. A more complex dataset is read in Data reading PRIMAP-hist.
[2]:
file = "data_reading_writing_examples/test_csv_data_sec_cat.csv"
coords_cols = {
"unit": "unit",
"entity": "gas",
"area": "country",
"category": "category",
"sec_cats__Class": "classification",
}
coords_defaults = {
"source": "TESTcsv2021",
"sec_cats__Type": "fugitive",
"scenario": "HISTORY",
}
coords_terminologies = {
"area": "ISO3",
"category": "IPCC2006",
"sec_cats__Type": "type",
"sec_cats__Class": "class",
"scenario": "general",
}
coords_value_mapping = {
"category": "PRIMAP1",
"entity": "PRIMAP1",
"unit": "PRIMAP1",
}
data_if = pm2.pm2io.read_wide_csv_file_if(
file,
coords_cols=coords_cols,
coords_defaults=coords_defaults,
coords_terminologies=coords_terminologies,
coords_value_mapping=coords_value_mapping,
)
data_if.head()
[2]:
source | scenario (general) | area (ISO3) | entity | unit | category (IPCC2006) | Class (class) | Type (type) | 1991 | 2000 | 2010 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | TESTcsv2021 | HISTORY | AUS | CO2 | Gg CO2 / yr | 1 | TOTAL | fugitive | 4000.00 | 5000.00 | 6000.00 |
1 | TESTcsv2021 | HISTORY | AUS | KYOTOGHG (SARGWP100) | Mt CO2 / yr | 0 | TOTAL | fugitive | 8.00 | 9.00 | 10.00 |
2 | TESTcsv2021 | HISTORY | FRA | CH4 | Gg CH4 / yr | 2 | TOTAL | fugitive | 7.00 | 8.00 | 9.00 |
3 | TESTcsv2021 | HISTORY | FRA | CO2 | Gg CO2 / yr | 2 | TOTAL | fugitive | 12.00 | 13.00 | 14.00 |
4 | TESTcsv2021 | HISTORY | FRA | KYOTOGHG (SARGWP100) | Mt CO2 / yr | 0 | TOTAL | fugitive | 0.03 | 0.02 | 0.04 |
Writing interchange format data¶
Data is written using the pm2io.write_interchange_format
function which takes a filename and path (str
or pathlib.Path
), an interchange format dataframe (pandas.DataFrame
) and optionally an attribute dict
as inputs. If the filename has an ending, it will be ignored. The function writes a yaml
file and a csv
file.
[3]:
file_if = "data_reading_writing_examples/test_csv_data_sec_cat_if"
pm2.pm2io.write_interchange_format(file_if, data_if)
Reading data from disk¶
To read interchange format data from disk the function pm2io.read_interchange_format
is used. It just takes a filename and path as input (str
or pathlib.Path
) and returns a pandas.DataFrame
containing the data and metadata. The filename and path has to point to the yaml
file. the csv
file will be read from the filename contained in the yaml
file.
[4]:
data_if_read = pm2.pm2io.read_interchange_format(file_if)
data_if_read.head()
[4]:
source | scenario (general) | area (ISO3) | entity | unit | category (IPCC2006) | Class (class) | Type (type) | 1991 | 2000 | 2010 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | TESTcsv2021 | HISTORY | AUS | CO2 | Gg CO2 / yr | 1 | TOTAL | fugitive | 4000.0000000000005 | 5000.000000000001 | 6000.000000000001 |
1 | TESTcsv2021 | HISTORY | AUS | KYOTOGHG (SARGWP100) | Mt CO2 / yr | 0 | TOTAL | fugitive | 8.0 | 9.0 | 10.0 |
2 | TESTcsv2021 | HISTORY | FRA | CH4 | Gg CH4 / yr | 2 | TOTAL | fugitive | 7.0 | 8.0 | 9.0 |
3 | TESTcsv2021 | HISTORY | FRA | CO2 | Gg CO2 / yr | 2 | TOTAL | fugitive | 12.0 | 13.0 | 14.0 |
4 | TESTcsv2021 | HISTORY | FRA | KYOTOGHG (SARGWP100) | Mt CO2 / yr | 0 | TOTAL | fugitive | 0.03 | 0.02 | 0.04 |
Converting to and from standard PRIMAP2 format¶
Data in the standard, xarray-based PRIMAP2 format can be converted to and from the interchange format with the corresponding functions:
[5]:
ds_minimal = pm2.open_dataset("minimal_ds.nc")
if_minimal = ds_minimal.pr.to_interchange_format()
if_minimal.head()
[5]:
source | area (ISO3) | entity | unit | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RAND2020 | ARG | CH4 | CH4 * gigagram / year | 0.368532 | 0.088137 | 0.298864 | 0.678659 | 0.862995 | 0.131253 | ... | 0.867758 | 0.874488 | 0.845115 | 0.510803 | 0.108674 | 0.822922 | 0.767920 | 0.071981 | 0.025005 | 0.089894 |
1 | RAND2020 | ARG | CO2 | CO2 * gigagram / year | 0.916832 | 0.682339 | 0.250728 | 0.852096 | 0.881125 | 0.147748 | ... | 0.612935 | 0.966494 | 0.376384 | 0.487788 | 0.710766 | 0.945482 | 0.131772 | 0.938776 | 0.658307 | 0.629569 |
2 | RAND2020 | ARG | SF6 | SF6 * gigagram / year | 0.505571 | 0.526975 | 0.618358 | 0.497167 | 0.559072 | 0.943795 | ... | 0.569415 | 0.044608 | 0.587069 | 0.657630 | 0.027363 | 0.711145 | 0.018133 | 0.997718 | 0.268657 | 0.761415 |
3 | RAND2020 | ARG | SF6 (SARGWP100) | CO2 * gigagram / year | 12083.153938 | 12594.698995 | 14778.752239 | 11882.298866 | 13361.818016 | 22556.691567 | ... | 13609.030141 | 1066.140102 | 14030.947344 | 15717.359528 | 653.987580 | 16996.374472 | 433.377608 | 23845.461667 | 6420.891957 | 18197.817082 |
4 | RAND2020 | BOL | CH4 | CH4 * gigagram / year | 0.565378 | 0.036782 | 0.752872 | 0.247971 | 0.305199 | 0.094644 | ... | 0.374838 | 0.406868 | 0.352221 | 0.965541 | 0.037689 | 0.402788 | 0.173916 | 0.409820 | 0.093950 | 0.553759 |
5 rows × 25 columns
[6]:
ds_minimal_re = pm2.pm2io.from_interchange_format(if_minimal)
ds_minimal_re
2023-05-10 09:53:21.886 | DEBUG | primap2.pm2io._interchange_format:from_interchange_format:320 - Expected array shapes: [[21, 4, 1, 4], [21, 4, 1, 4], [21, 4, 1, 4], [21, 4, 1, 4]], resulting in size 1,344.
[6]:
<xarray.Dataset> Dimensions: (time: 21, source: 1, area (ISO3): 4) Coordinates: * source (source) object 'RAND2020' * area (ISO3) (area (ISO3)) object 'ARG' 'BOL' 'COL' 'MEX' * time (time) datetime64[ns] 2000-01-01 2001-01-01 ... 2020-01-01 Data variables: CH4 (time, source, area (ISO3)) float64 [CH4·Gg/a] 0.3685 ..... CO2 (time, source, area (ISO3)) float64 [CO2·Gg/a] 0.9168 ..... SF6 (time, source, area (ISO3)) float64 [Gg·SF6/a] 0.5056 ..... SF6 (SARGWP100) (time, source, area (ISO3)) float64 [CO2·Gg/a] 1.208e+04... Attributes: area: area (ISO3)