Skip to content

Data model

This page gives a high-level overview of the DBnomics data model, and the main design principles behind it.

Read the documentation of dbnomics-toolbox to see how this data model is implemented in Python and how to use it: https://dbnomics-toolbox.readthedocs.io/en/latest/data-model/.

Data representation principles

DBnomics distinguishes data from its format, and simplifies format only.

This means DBnomics tries to preserve provider semantics while normalizing a limited set of representation details needed for consistent access across providers.

Preserved provider semantics

The following items are kept as-is from the provider:

  • time series and their observations
  • dataset dimensions: a dataset may have a dimension named "Country" with values "France", "Germany", etc., and another dataset may have a dimension named "REF_AREA" with values "FR", "DE", etc. DBnomics does not try to harmonize those dimensions, but keeps them as-is from the provider.
  • NA (non-available) values usage: DBnomics does not add or remove them. If a provider distributes a time series with an incomplete calendar (with some missing periods), DBnomics does not try to complete it

Some providers distribute time series with no observation, or with only NA values, and DBnomics keeps them as-is as well. Here are some examples:

Normalized representation

Some data is normalized from provider source data:

  • periods: providers use different ways to represent them and DBnomics defines a standard format (e.g. 202001 or 2020M01 becomes 2020-01)
  • NA (non-available) values: some providers use NaN, some others -9999, etc. DBnomics always uses NA

Normalization is done by each fetcher in the data conversion part, based on the knowledge of the provider source data. For example, a period like 2000-qII would be normalized as 2000-Q2.

Why normalize this?

If we kept the original formats for periods or NA values, it would be difficult to, for example, represent time series on a chart, because each format would have to be handled separately.

The best place to do this normalization is in the fetcher, because it has the knowledge of the provider source data, and can be adapted to each provider.

Conceptual model

We first present a conceptual model of the main DBnomics concepts and their relationships:

flowchart LR
    Provider -->|has many| Dataset
    Dataset -->|has many| Series
    Series -->|has many| Observation

Note

This diagram is a conceptual model, meant as a mental model of the main DBnomics concepts and their relationships.

It does not reflect the actual implementation or describe the in-memory representation used by DBnomics or by dbnomics_toolbox.model. The subsections below describe the conceptual model.

Provider

  • code: to have an URL well defined, a provider MUST have a code (example)

Dataset

  • code: to have an URL well defined, a dataset MUST have a code (example)

Dataset dimensions

  • different for all datasets, not harmonized

Dataset releases

Sometimes providers publish datasets with releases.

In DBnomics each dataset release is a regular dataset named after the pattern {dataset_code}:{release_code}.

For example, IMF provides WEO every 6 months (e.g. 2019-04, 2019-10, 2020-04, etc.). DBnomics datasets would be named WEO:2019-04, WEO:2019-10, WEO:2020-04, etc.

The release code latest is reserved for accessing the latest release of a dataset. This is featured in the DBnomics website, the Web API, and all the DBnomics clients as long as they follow HTTP redirections.

See also: dataset releases.

Series

  • code: to have an URL well defined, a series MUST have a code (example)
  • name: SHOULD be unique
  • dimensions: a set of key-value pairs like `FREQ => A, REF_AREA => FR, etc.)
  • observations: a list of observations (see below)

Duplicate series names

Some providers give the same names to many series. DBnomics data model tolerates this, even if it's not recommended.

In this case, the user will always be able to distinguish those time series by looking at their code or dimensions.

Series without code

Some providers don't give codes to series, but only dimensions (e.g. {"FREQ": "A", "REF_AREA": "FR"}). In this case, as the series code is required in the data model of DBnomics, it can be generated from the dimensions (e.g. A.FR).

This does not mean that the series codes must be generated from dimensions: some providers give arbitrary codes to series, that do not correspond to dimensions (e.g. SERIES_137).

However, building the series codes from dimensions is a common practice among providers, and is recommended as it makes them easier to understand and more stable over time.

Observation

An observation is a pair of period and value, like {"period": "2020-01", "value": 123.45}.

It can also have attributes: a set of key-value pairs (e.g. {"unit": "USD", "status": "final"}).

Period

The period format is normalized from provider source data:

  • YYYY for years
  • YYYY-MM for months (e.g. 2000-01, 2000-11)
  • YYYY-MM-DD for days (MUST be padded for MM and DD)
  • YYYY-Q[1-4] for year quarters
  • example: 2018-Q1 represents jan to mar 2018, and 2018-Q4 represents oct to dec 2018
  • YYYY-S[1-2] for year semesters (aka bi-annual, semi-annual)
  • example: 2018-S1 represents jan to jun 2018, and 2018-S2 represents jul to dec 2018
  • YYYY-B[1-6] for pairs of months (aka bi-monthly)
  • example: 2018-B1 represents jan + feb 2018, and 2018-B6 represents nov + dec 2018
  • YYYY-W[01-53] for year weeks (MUST be padded)