Skip to content

Data model

This page gives a high-level overview of the DBnomics data model, and the main design principles behind it.

The data model page of the documentation of the dbnomics-toolbox library explains how this data model is implemented in Python and how to use it in fetchers.

Data representation principles

DBnomics distinguishes data from its format, and simplifies format only.

This means DBnomics tries to preserve provider semantics while normalizing a limited set of representation details needed for consistent access across providers.

Preserved provider semantics

The following items are kept as-is from the provider:

  • time series and their observations
  • dataset dimensions: a dataset may have a dimension named "Country" with values "France", "Germany", etc., and another dataset may have a dimension named "REF_AREA" with values "FR", "DE", etc. DBnomics does not try to harmonize those dimensions, but keeps them as-is from the provider.
  • NA (non-available) values usage: DBnomics does not add or remove them. If a provider distributes a time series with an incomplete calendar (with some missing periods), DBnomics does not try to complete it

Some providers distribute time series with no observation, or with only NA values, and DBnomics keeps them as-is as well. Here are some examples:

Normalized representation

Some data is normalized from provider source data:

  • periods: providers use different ways to represent them and DBnomics defines a standard format (e.g. 202001 or 2020M01 becomes 2020-01)
  • NA (non-available) values: some providers use NaN, some others -9999, etc. DBnomics always uses NA

Normalization is done by each fetcher in the data conversion part, based on the knowledge of the provider source data. For example, a period like 2000-qII would be normalized as 2000-Q2.

Why normalize this?

If we kept the original formats for periods or NA values, it would be difficult to, for example, represent time series on a chart, because each format would have to be handled separately.

The best place to do this normalization is in the fetcher, because it has the knowledge of the provider source data, and can be adapted to each provider.

Conceptual model

We first present a conceptual model of the main DBnomics concepts and their relationships:

flowchart LR
    Provider -->|has many| Dataset
    Dataset -->|has many| Series
    Series -->|has many| Observation

Note

This diagram is a conceptual model, meant as a mental model of the main DBnomics concepts and their relationships.

It does not reflect the actual implementation or describe the in-memory representation used by DBnomics or by dbnomics_toolbox.model. The subsections below describe the conceptual model.

Provider

Each provider has:

  • a code that must be unique across providers
  • many datasets (see section below)

Examples: IMF for International Monetary Fund, OECD for Organisation for Economic Co-operation and Development, WB for World Bank, etc.

Provider codes are used in several DBnomics URLs.

Dataset

Each dataset is part of a provider and has:

  • a code that must be unique across a single provider (e.g. BOP for Balance of Payments).
  • an ID that is the concatenation of the provider code and the dataset code, separated by a slash (e.g. IMF/BOP).
  • many dimensions definitions (see section below)
  • many series (see section below)

Dataset codes are used in several DBnomics URLs.

Dataset dimensions

In DBnomics datasets are multi-dimensional and are sometimes referred to as data cubes.

Each dataset defines an ordered list of dimensions, and each series of the dataset has a value for each dimension.

For example, a dataset may have dimensions FREQ (frequency), REF_AREA (reference area), UNIT (unit of measurement), etc., and a series of this dataset may have values A for FREQ, FR for REF_AREA, USD for UNIT, etc.

Dataset releases

Sometimes providers publish datasets with releases.

In DBnomics each dataset release is a regular dataset having a code following this pattern: {dataset_code}:{release_code}.

For example, IMF publishes WEO every 6 months (e.g. 2019-04, 2019-10, 2020-04, etc.) and DBnomics datasets have the codes WEO:2019-04, WEO:2019-10, WEO:2020-04, etc.

See also: dataset releases.

Series

Each series is part of a dataset and has:

  • a code that must be unique across a single dataset (e.g. A.1C_355.BACK_BP6_USD)
  • an ID that is the concatenation of the dataset ID and the series code, separated by a slash (e.g. IMF/BOP/A.1C_355.BACK_BP6_USD)
  • a name that SHOULD be unique
  • a value for each dimension of the dataset (e.g. FREQ => A, REF_AREA => FR, etc.)
  • many observations (see section below)

Duplicate series names

Some providers give the same names to many series. DBnomics data model tolerates this, even if it's not recommended.

In this case, the user will always be able to distinguish those time series by looking at their code or dimensions.

Series without code

Some providers don't give codes to series, but only dimensions (e.g. {"FREQ": "A", "REF_AREA": "FR"}). In this case, as the series code is required in the data model of DBnomics, it can be generated from the dimensions (e.g. A.FR).

This does not mean that the series codes must be generated from dimensions: some providers give arbitrary codes to series, that do not correspond to dimensions (e.g. SERIES_137).

However, building the series codes from dimensions is a common practice among providers, and is recommended as it makes them easier to understand and more stable over time.

Observation

An observation is a pair of period and value, like {"period": "2020-01", "value": 123.45}.

It can also have attributes: a set of key-value pairs (e.g. {"unit": "USD", "status": "final"}).

Periods can be years (2020), months (2020-01), quarters (2020-Q1), etc. For more information, see the documentation of dbnomics-toolbox on periods.

Example: multi-dimensional dataset

A multi-dimensional dataset defines many dimensions and has many series where each one is categorized using dimensions.

To help understanding that concept, here is an example. Let's start from the following hypothetical CSV file named product_prices.csv that tracks the evolution of the price of different products in different countries:

sku country year price
111 FR 2000 12
111 FR 2001 13
111 FR 2002 11
111 DE 2001 9
111 DE 2002 11
111 DE 2003 14
222 FR 2000 87
222 FR 2001 88
222 FR 2002 90
222 FR 2003 79

We can consider this file a multi-dimensional dataset if we identify dimensions in it, and series where each one is categorized by those dimensions.

Based on the given columns, we can infer 2 dimensions: SKU and COUNTRY (we'll keep this order), with values SKU={111,222} and COUNTRY={DE,FR}.

The remaining columns year and price define the observations of each series, and the series can be found by applying a "group-by" operation on the dimensions.

So the dataset is composed of 4 time series, each being related to a single product and country.

Series 111.FR:

period value
2000 12
2001 13
2002 11

Series 111.DE:

period value
2001 9
2002 11
2003 14

Series 222.FR:

period value
2000 87
2001 88
2002 90
2003 79

Series codes

Because the dimensions of a dataset are ordered, we can infer the series codes by concatenating the codes of the values of the dimensions, separated by a . character.

For example, the series for SKU=111 and COUNTRY=FR has the code 111.FR, and not FR.111.