Data model
This page gives a high-level overview of the DBnomics data model, and the main design principles behind it.
Read the documentation of dbnomics-toolbox to see how this data model is implemented in Python and how to use it: https://dbnomics-toolbox.readthedocs.io/en/latest/data-model/.
Data representation principles
DBnomics distinguishes data from its format, and simplifies format only.
This means DBnomics tries to preserve provider semantics while normalizing a limited set of representation details needed for consistent access across providers.
Preserved provider semantics
The following items are kept as-is from the provider:
- time series and their observations
- dataset dimensions: a dataset may have a dimension named "Country" with values "France", "Germany", etc., and another dataset may have a dimension named "REF_AREA" with values "FR", "DE", etc. DBnomics does not try to harmonize those dimensions, but keeps them as-is from the provider.
- NA (non-available) values usage: DBnomics does not add or remove them. If a provider distributes a time series with an incomplete calendar (with some missing periods), DBnomics does not try to complete it
Some providers distribute time series with no observation, or with only NA values, and DBnomics keeps them as-is as well. Here are some examples:
Normalized representation
Some data is normalized from provider source data:
- periods: providers use different ways to represent them and DBnomics defines a standard format (e.g.
202001or2020M01becomes2020-01) - NA (non-available) values: some providers use
NaN, some others-9999, etc. DBnomics always usesNA
Normalization is done by each fetcher in the data conversion part, based on the knowledge of the provider source data.
For example, a period like 2000-qII would be normalized as 2000-Q2.
Why normalize this?
If we kept the original formats for periods or NA values, it would be difficult to, for example, represent time series on a chart, because each format would have to be handled separately.
The best place to do this normalization is in the fetcher, because it has the knowledge of the provider source data, and can be adapted to each provider.
Conceptual model
We first present a conceptual model of the main DBnomics concepts and their relationships:
flowchart LR
Provider -->|has many| Dataset
Dataset -->|has many| Series
Series -->|has many| Observation
Note
This diagram is a conceptual model, meant as a mental model of the main DBnomics concepts and their relationships.
It does not reflect the actual implementation or describe the in-memory representation used by DBnomics or by dbnomics_toolbox.model.
The subsections below describe the conceptual model.
Provider
code: to have an URL well defined, a provider MUST have a code (example)
Dataset
code: to have an URL well defined, a dataset MUST have a code (example)
Dataset dimensions
- different for all datasets, not harmonized
Dataset releases
Sometimes providers publish datasets with releases.
In DBnomics each dataset release is a regular dataset named after the pattern {dataset_code}:{release_code}.
For example, IMF provides WEO every 6 months (e.g. 2019-04, 2019-10, 2020-04, etc.).
DBnomics datasets would be named WEO:2019-04, WEO:2019-10, WEO:2020-04, etc.
The release code latest is reserved for accessing the latest release of a dataset. This is featured in the DBnomics website, the Web API, and all the DBnomics clients as long as they follow HTTP redirections.
See also: dataset releases.
Series
code: to have an URL well defined, a series MUST have a code (example)name: SHOULD be uniquedimensions: a set of key-value pairs like `FREQ => A, REF_AREA => FR, etc.)observations: a list of observations (see below)
Duplicate series names
Some providers give the same names to many series. DBnomics data model tolerates this, even if it's not recommended.
In this case, the user will always be able to distinguish those time series by looking at their code or dimensions.
Series without code
Some providers don't give codes to series, but only dimensions (e.g. {"FREQ": "A", "REF_AREA": "FR"}).
In this case, as the series code is required in the data model of DBnomics, it can be generated from the dimensions (e.g. A.FR).
This does not mean that the series codes must be generated from dimensions: some providers give arbitrary codes to series, that do not correspond to dimensions (e.g. SERIES_137).
However, building the series codes from dimensions is a common practice among providers, and is recommended as it makes them easier to understand and more stable over time.
Observation
An observation is a pair of period and value, like {"period": "2020-01", "value": 123.45}.
It can also have attributes: a set of key-value pairs (e.g. {"unit": "USD", "status": "final"}).
Period
The period format is normalized from provider source data:
YYYYfor yearsYYYY-MMfor months (e.g.2000-01,2000-11)YYYY-MM-DDfor days (MUST be padded forMMandDD)YYYY-Q[1-4]for year quarters- example:
2018-Q1represents jan to mar 2018, and2018-Q4represents oct to dec 2018 YYYY-S[1-2]for year semesters (aka bi-annual, semi-annual)- example:
2018-S1represents jan to jun 2018, and2018-S2represents jul to dec 2018 YYYY-B[1-6]for pairs of months (aka bi-monthly)- example:
2018-B1represents jan + feb 2018, and2018-B6represents nov + dec 2018 YYYY-W[01-53]for year weeks (MUST be padded)