{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Data format\n",
    "\n",
    "In PRIMAP2, data is handled in\n",
    "[xarray datasets](https://xarray.pydata.org/en/stable/data-structures.html#dataset)\n",
    "with defined coordinates and metadata.\n",
    "If you are not familiar with xarray data structures, we recommend reading\n",
    "[xarray's own primer](https://xarray.pydata.org/en/stable/data-structures.html) first.\n",
    "\n",
    "## Examples\n",
    "\n",
    "Let's start with two examples, one minimal example showing only what is required for a\n",
    "PRIMAP2 data set, and one opulent example showing the flexibility of the format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2021-01-12T16:46:26.482526Z",
     "iopub.status.busy": "2021-01-12T16:46:26.482238Z",
     "iopub.status.idle": "2021-01-12T16:46:27.442994Z",
     "shell.execute_reply": "2021-01-12T16:46:27.442425Z",
     "shell.execute_reply.started": "2021-01-12T16:46:26.482495Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# import all the used libraries\n",
    "import datetime\n",
    "import xarray as xr\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import primap2\n",
    "from primap2 import ureg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Minimal example\n",
    "\n",
    "This example contains only the required metadata, which are the time, the area,\n",
    "and the source.\n",
    "It also shows how multiple gases and global warming potentials stemming from the\n",
    "gases are included in a single dataset and the use of units.\n",
    "\n",
    "The example is created with dummy data; note that in real usage, you would read\n",
    "data from a file or API instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2021-01-12T16:46:27.443932Z",
     "iopub.status.busy": "2021-01-12T16:46:27.443739Z",
     "iopub.status.idle": "2021-01-12T16:46:27.724241Z",
     "shell.execute_reply": "2021-01-12T16:46:27.723607Z",
     "shell.execute_reply.started": "2021-01-12T16:46:27.443906Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "time = pd.date_range(\"2000-01-01\", \"2020-01-01\", freq=\"AS\")\n",
    "area_iso3 = np.array([\"COL\", \"ARG\", \"MEX\", \"BOL\"])\n",
    "minimal = xr.Dataset(\n",
    "    {\n",
    "        ent: xr.DataArray(\n",
    "            data=np.random.rand(len(time), len(area_iso3), 1),\n",
    "            coords={\n",
    "                \"time\": time,\n",
    "                \"area (ISO3)\": area_iso3,\n",
    "                \"source\": [\"RAND2020\"],\n",
    "            },\n",
    "            dims=[\"time\", \"area (ISO3)\", \"source\"],\n",
    "            attrs={\"units\": f\"{ent} Gg / year\", \"entity\": ent},\n",
    "        )\n",
    "        for ent in (\"CO2\", \"SF6\", \"CH4\")\n",
    "    },\n",
    "    attrs={\"area\": \"area (ISO3)\"},\n",
    ").pr.quantify()\n",
    "\n",
    "with ureg.context(\"SARGWP100\"):\n",
    "    minimal[\"SF6 (SARGWP100)\"] = minimal[\"SF6\"].pint.to(\"CO2 Gg / year\")\n",
    "minimal[\"SF6 (SARGWP100)\"].attrs[\"gwp_context\"] = \"SARGWP100\"\n",
    "\n",
    "minimal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Explore the dataset by clicking on the icons at the end of the rows, which will\n",
    "show you the metadata `attrs` and the actual data for each coordinate or variable.\n",
    "\n",
    "Notice:\n",
    "\n",
    "* For the time coordinate, python datetime objects are used, and the meaning of\n",
    "  each data point is therefore directly obvious.\n",
    "* For the area coordinate, three-letter country abbreviations are used, and their\n",
    "  meaning is not necessarily obvious. Therefore, the key or name for the area\n",
    "  coordinate also contains (in parentheses) the used set of categories, here\n",
    "  ISO-3166 three-letter country abbreviations. To be able to identify the area\n",
    "  coordinate without parsing strings, the data set `attrs` contain the key-value pair\n",
    "  `'area': 'area (ISO3)'`, which translates from the simple name to the coordinate\n",
    "  key including the identifier for the category set.\n",
    "* The variables all carry an associated `openscm_units` unit. It is the same unit\n",
    "  for all data points in a variable, but differs between variables because it\n",
    "  includes the gas.\n",
    "* The `attrs` of each variable specify the `entity` of the variable. For simple\n",
    "  gases like the CO2 emissions, this is the same as the variable name, but for\n",
    "  example for the global warming potential associated with the SF6 emissions,\n",
    "  it is different.\n",
    "* When a global warming potential is given, the used conversion factors have to be\n",
    "  specified explicitly using openscm_units context names, for example\n",
    "  `SARGWP100` for the global warming potential equivalent factors for a 100-year\n",
    "  time horizon specified in the second assessment report."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Opulent example\n",
    "\n",
    "The opulent example contains every standard metadata and also shows\n",
    "that the variables in the data set can have a different number of\n",
    "dimensions. Because it aims to show everything, creating it takes some effort,\n",
    "skip to the result unless you are interested in the details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2021-01-12T16:46:27.725678Z",
     "iopub.status.busy": "2021-01-12T16:46:27.725360Z",
     "iopub.status.idle": "2021-01-12T16:46:27.848817Z",
     "shell.execute_reply": "2021-01-12T16:46:27.848241Z",
     "shell.execute_reply.started": "2021-01-12T16:46:27.725637Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# create with dummy data\n",
    "coords = {\n",
    "    \"time\": pd.date_range(\"2000-01-01\", \"2020-01-01\", freq=\"AS\"),\n",
    "    \"area (ISO3)\": np.array([\"COL\", \"ARG\", \"MEX\", \"BOL\"]),\n",
    "    \"category (IPCC 2006)\": np.array(\n",
    "        [\"0\", \"1\", \"2\", \"3\", \"4\", \"5\", \"1.A\", \"1.B\"]\n",
    "    ),\n",
    "    \"animal (FAOSTAT)\": np.array([\"cow\", \"swine\", \"goat\"]),\n",
    "    \"product (FAOSTAT)\": np.array([\"milk\", \"meat\"]),\n",
    "    \"scenario (FAOSTAT)\": np.array([\"highpop\", \"lowpop\"]),\n",
    "    \"provenance\": np.array([\"projected\"]),\n",
    "    \"model\": np.array([\"FANCYFAO\"]),\n",
    "    \"source\": np.array([\"RAND2020\", \"RAND2021\"]),\n",
    "}\n",
    "\n",
    "opulent = xr.Dataset(\n",
    "    {\n",
    "        ent: xr.DataArray(\n",
    "            data=np.random.rand(*(len(x) for x in coords.values())),\n",
    "            coords=coords,\n",
    "            dims=list(coords.keys()),\n",
    "            attrs={\"units\": f\"{ent} Gg / year\", \"entity\": ent},\n",
    "        )\n",
    "        for ent in (\"CO2\", \"SF6\", \"CH4\")\n",
    "    },\n",
    "    attrs={\n",
    "        \"entity_terminology\": \"primap2\",\n",
    "        \"area\": \"area (ISO3)\",\n",
    "        \"cat\": \"category (IPCC 2006)\",\n",
    "        \"sec_cats\": [\"animal (FAOSTAT)\", \"product (FAOSTAT)\"],\n",
    "        \"scen\": \"scenario (FAOSTAT)\",\n",
    "        \"references\": \"doi:10.1012\",\n",
    "        \"rights\": \"Use however you want.\",\n",
    "        \"contact\": \"lol_no_one_will_answer@example.com\",\n",
    "        \"title\": \"Completely invented GHG inventory data\",\n",
    "        \"comment\": \"GHG inventory data ...\",\n",
    "        \"institution\": \"PIK\",\n",
    "        \"history\": \"\"\"2021-01-14 14:50 data invented\n",
    "2021-01-14 14:51 additional processing step\n",
    "\"\"\",\n",
    "        \"publication_date\": datetime.date(2099, 12, 31),\n",
    "    },\n",
    ")\n",
    "\n",
    "pop_coords = {\n",
    "    x: coords[x]\n",
    "    for x in (\n",
    "        \"time\",\n",
    "        \"area (ISO3)\",\n",
    "        \"provenance\",\n",
    "        \"model\",\n",
    "        \"source\",\n",
    "    )\n",
    "}\n",
    "opulent[\"population\"] = xr.DataArray(\n",
    "    data=np.random.rand(*(len(x) for x in pop_coords.values())),\n",
    "    coords=pop_coords,\n",
    "    dims=list(pop_coords.keys()),\n",
    "    attrs={\"entity\": \"population\", \"units\": \"\"},\n",
    ")\n",
    "\n",
    "opulent = opulent.assign_coords(\n",
    "    {\n",
    "        \"category_names\": xr.DataArray(\n",
    "            data=np.array(\n",
    "                [\n",
    "                    \"total\",\n",
    "                    \"industry\",\n",
    "                    \"energy\",\n",
    "                    \"transportation\",\n",
    "                    \"residential\",\n",
    "                    \"land use\",\n",
    "                    \"heavy industry\",\n",
    "                    \"light industry\",\n",
    "                ]\n",
    "            ),\n",
    "            coords={\n",
    "                \"category (IPCC 2006)\": coords[\"category (IPCC 2006)\"]\n",
    "            },\n",
    "            dims=[\"category (IPCC 2006)\"],\n",
    "        )\n",
    "    }\n",
    ")\n",
    "\n",
    "opulent = opulent.pr.quantify()\n",
    "\n",
    "with ureg.context(\"SARGWP100\"):\n",
    "    opulent[\"SF6 (SARGWP100)\"] = opulent[\"SF6\"].pint.to(\"CO2 Gg / year\")\n",
    "opulent[\"SF6 (SARGWP100)\"].attrs[\"gwp_context\"] = \"SARGWP100\"\n",
    "\n",
    "opulent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Compared to the minimal example, this data set has a lot more to unpack:\n",
    "\n",
    "* The first thing to notice is that there are a lot more dimensions, in particular\n",
    "  for the used `model`, the `provenance` of the data, the described `scenario`, and\n",
    "  the `animal` type. As before, the dimension `scenario`, `animal`, and `product` use a\n",
    "  specific set of\n",
    "  categories given in parentheses and with appropriate metadata in the `attrs`.\n",
    "  The `scenario` is a standard dimension, and the metadata in `attrs` is given using\n",
    "  the `scen` key. The `animal` and `product` dimensions are nonstandard, and are\n",
    "  included in the\n",
    "  secondary categories at `attrs['sec_cats']`. Note that `sec_cats` contains a list, so\n",
    "  that multiple nonstandard dimensions can be included if needed.\n",
    "* There is also s coordinate which is not defining a dimensions, `category names`. It\n",
    "  gives additional information about categories, which can be helpful for humans\n",
    "  trying to make sense of the category codes without looking them up. Note that\n",
    "  because this coordinate is not used as an index for a dimension, the category\n",
    "  names do not have to be unique.\n",
    "* In the data variables, emissions of the different gases all use all dimensions, but\n",
    "  the population data does not use all dimensions. For each data variable, only the\n",
    "  dimensions which make sense have to be used.\n",
    "* In the `attrs`, the terminology for the entities is explicitly defined, so that the\n",
    "  meaning of the entity attributes is unambigously defined.\n",
    "* In the `attrs`, additional metadata useful for humans is included: citable\n",
    "  `references`, usage `rights`, a descriptive `title`, a long-form `comment`,\n",
    "   an email address to `contact` for questions about the data set, and the `publication_date`.\n",
    "* In the `attrs` the field `history` records processing steps done with the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Advantages\n",
    "\n",
    "* By using standard xarray datasets, the standard data analysis libraries for in the\n",
    "xarray universe can be used.\n",
    "* Due to xarray's compatibility with dask arrays, [dask](https://dask.org/) can be used\n",
    "to distribute calculations to multiple cores or a cluster of machines.\n",
    "* xarray provides a file storage format based on netcdf, which is well-suited for\n",
    "efficient long-term storage.\n",
    "\n",
    "## Limitations\n",
    "\n",
    "* xarray does not provide a solution for the management of multiple data sets,\n",
    "  including search and discovery, change management etc. For this, we plan to use\n",
    "  [datalad](https://www.datalad.org/).\n",
    "* At the moment, xarray does not deal with very sparse data efficiently. For large,\n",
    "  very sparse datasets with lots of dimensions, primap2 is currently not usable."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}