primap2.pm2io.convert_wide_dataframe_if

primap2.pm2io.convert_wide_dataframe_if#

primap2.pm2io.convert_wide_dataframe_if(data_wide: DataFrame, *, coords_cols: dict[str, str], add_coords_cols: None | dict[str, list[str]] = None, coords_defaults: None | dict[str, Any] = None, coords_terminologies: dict[str, str], coords_value_mapping: None | dict[str, Any] = None, coords_value_filling: None | dict[str, dict[str, dict]] = None, filter_keep: None | dict[str, dict[str, Any]] = None, filter_remove: None | dict[str, dict[str, Any]] = None, meta_data: None | dict[str, Any] = None, time_format: str = '%Y', time_cols: None | list = None, convert_str: bool | dict[str, float] = True, copy_df: bool = False) DataFrame[source]#

Convert a DataFrame in wide format into the PRIMAP2 interchange format.

Columns can be renamed or filled with default values to match the PRIMAP2 structure. Where we refer to “dimensions” in the parameter description below we mean the basic dimension names without the added terminology (e.g. “area” not “area (ISO3)”). The terminology information will be added by this function. You can not use the short dimension names in the attributes (e.g. “cat” instead of “category”).

TODO: Currently duplicate data points will not be detected.

TODO: enable filtering through query strings

TODO: enable specification of the entity terminology

Parameters:
data_wide: pd.DataFrame

Wide DataFrame which will be converted.

coords_colsdict

Dict where the keys are PRIMAP2 dimension names and the values are column names in the dataframe to be converted. For secondary categories use a sec_cats__ prefix.

add_coords_colsdict, optional

Dict where the keys are PRIMAP2 additional coordinate names and the values are lists with two elements where the first is the column in the dataframe to be converted and the second is the primap2 dimension for the coordinate (e.g. category for a category_name coordinate.

coords_defaultsdict, optional

Dict for default values of coordinates / dimensions not given in the dataframe. The keys are the dimension names and the values are the values for the dimensions. For secondary categories use a sec_cats__ prefix.

coords_terminologiesdict

Dict defining the terminologies used for the different coordinates (e.g. ISO3 for area). Only possible coordinates here are: area, category, scenario, entity, and secondary categories. For secondary categories use a sec_cats__ prefix. All entries different from “area”, “category”, “scenario”, “entity”, and sec_cats__<name> will raise a ValueError.

coords_value_mappingdict, optional

A dict with primap2 dimension names as keys. Values are dicts with input values as keys and output values as values. A standard use case is to map gas names from input data to the standardized names used in primap2. Alternatively a value can also be a function which transforms one CSV metadata value into the new metadata value. A third possibility is to give a string as a value, which defines a rule for translating metadata values. The only defined rule at the moment is “PRIMAP1” which can be used for the “category”, “entity”, and “unit” columns to translate from PRIMAP1 metadata to PRIMAP2 metadata.

coords_value_fillingdict, optional

A dict with primap2 dimension names as keys. These are the target columns where values will be filled (or replaced). Vales are dicts with primap2 dimension names as keys. These are the source columns. The values are dicts with source value - target value mappings. The value filling can do everything that the value mapping can, but while mapping can only replace values within a column using information from that column, the filing function can also fill or replace data based on values from a different column. This can be used to e.g. fill missing category codes based on category names or to replace category codes which do not meet the terminology using the category names.

filter_keepdict, optional

Dict defining filters of data to keep. Filtering is done before metadata mapping, so use original metadata values to define the filter. Column names are as in the csv file. Each entry in the dict defines an individual filter. The names of the filters have no relevance. Default: keep all data.

filter_removedict, optional

Dict defining filters of data to remove. Filtering is done before metadata mapping, so use original metadata values to define the filter. Column names are as in the csv file. Each entry in the dict defines an individual filter. The names of the filters have no relevance.

meta_datadict, optional

Meta data for the whole dataset. Will end up in the dataset-wide attrs. Allowed keys are “references”, “rights”, “contact”, “title”, “comment”, “institution”, and “history”. Documentation about the format and meaning of the meta data can be found in the data format documentation.

time_formatstr

str with strftime style format used to parse the time information for the data columns. Default: “%Y”, which will match years.

time_colslist, optional

List of column names which contain the data for each time point. If not given cols will be inferred using time_format.

convert_strbool or dict, optional (default: True)

If set to false, string values in the data columns will be kept. If set to true they will be converted to np.nan or 0 following default rules. If a dict is given mapping will be as given in the dict for values present in the dict and default as in parse_code for all other values

copy_dfbool, optional (default: True)

If set to true, a copy of the input DataFrame is made to keep the input as is. This negatively impacts speed. If set to false the input DataFrame will be altered but performance will be better

Returns:
obj: pd.DataFrame

pandas DataFrame with the read data

Examples

Example for meta_mapping:

meta_mapping = {
    'pyCPA_col_1': {'col_1_value_1_in': 'col_1_value_1_out',
                    'col_1_value_2_in': 'col_1_value_2_out',
                    },
    'pyCPA_col_2': {'col_2_value_1_in': 'col_2_value_1_out',
                    'col_2_value_2_in': 'col_2_value_2_out',
                    },
}

Example for filter_keep:

filter_keep = {
    'f_1': {'variable': ['CO2', 'CH4'], 'region': 'USA'},
    'f_2': {'variable': 'N2O'}
}

This example filter keeps all CO2 and CH4 data for the USA and N2O data for all countries

Example for filter_remove:

filter_remove = {
    'f_1': {'scenario': 'HISTORY'},
}

This filter removes all data with ‘HISTORY’ as scenario