Data model

Useful links:

Entities

DBnomics data model is defined by the following entities:

provider -(has many)-> dataset -(has many)-> time series

Provider

code: to have an URL well defined, a provider MUST have a code (example)

Dataset

code: to have an URL well defined, a dataset MUST have a code (example)

Dataset releases

Sometimes providers publish datasets with releases.

In DBnomics each dataset release is a regular dataset named after the pattern {dataset_code}:{release_code}.

For example, IMF provides WEO every 6 months (e.g. 2019-04, 2019-10, 2020-04, etc.). DBnomics datasets would be named WEO:2019-04, WEO:2019-10, WEO:2020-04, etc.

The release code latest is reserved for accessing the latest release of a dataset. This is featured in DBnomics website, the Web API, and all the DBnomics clients as long as they follow the HTTP redirections.

Series

code: to have an URL well defined, a series MUST have a code (example)
name: SHOULD be unique

TODO: if I generate a series code from a name, what characters are valid?

Duplicate series names

Some providers give the same names to many series. DBnomics data model accepts duplicate series names, even if it's not recommended.

The user will be able to distinguish those time series with the same name by looking at their code or dimensions.

Remind that one of DBnomics features is to redistribute provider data as-is. This situation would have been the same by accessing data from the provider website.

However the data validation script displays an error if run in developer mode.

Missing series codes

Some providers distribute time series with arbitrary codes (e.g. AMECO/UING/AUT.1.0.0.0.UING), whereas some other don't give codes to series, but dimensions (e.g. {"FREQ": "A", "REF_AREA": "FR"}). In the latter case, as the series code is a hard constraint, it has to be generated from the dimensions (e.g. A.FR).

Dimensions

Ideally, giving a value to all the dimensions of a dataset should return a unique time series. But sometimes, due to errors in provider data or because of modelization choices from providers, they distribute datasets with more than one time series per dimension set.

For example, this search by dimension where every dimension has a value selected, matches 3 time series (see also API link).

TODO: if I generate a dimension code from a label, what characters are valid?