Data model¶
This is a reference for the data model of the different data we are gathering:
Altough there are no rules on which language or framework to use, I will write in the case of Python and Pandas, in case of somebody working with another language (like R or Java), most of this rules are easy to translate, in case of doubt, please feel free to discuss in #task-geo.
General guidelines¶
Cite the source of your data. ( In the docstring of your data_source file, and on the
AUTHORS.rst
Data Sources section and, of course, include yourself among the contributors).Units should be in metric system, codes like country codes, should be ISO.
Column names should be lowercase (it might be uglier, but it’s easier when coding).
Unless strictly necessary, we should try avoid having multiindices.
Avoid making totalizations of data, the more granular data, the best, we can always aggregate, but we can’t disaggregate.
Avoid unecessary stacking of data:
Don’t do this:
date
measure
value
2020-01-01
A
1.5
2020-01-01
B
2.0
2020-01-02
A
1.0
2020-01-02
B
0.5
Do this instead:
date
A
B
2020-01-01
1.5
2.0
2020-01-02
1.0
0.5
THIS DOESN’T APPLY TO TIME DATA!!!
Avoid unnecesary flattening/unstacking of data: (Yes, it may seem contradictory with the last point but it’s not!)
This may come in hand when working with one-hot encoded variables or unstructured or schemaless data using dataframes, for example, for the following JSON:
[ { 'id': 2, 'feature_A': 0.1, 'feature_B': 1.4, 'feature_C': 1.5, }, { 'id': 1, 'feature_X': 14.5, 'feature_Y': 2.0, 'feature_Z': 41 } ]
Don’t do this:
id
feature_A
feature_B
feature_C
feature_X
feature_Y
feature_Z
1
0.1
1.4
1.5
nan
nan
nan
2
nan
nan
nan
14.5
2.0
41
Do this instead:
id
feature
value
1
A
0.1
1
B
1.4
1
C
1.5
2
X
14.5
2
Y
2.0
2
Z
41
It’s hard to define an explicit set of rules, but I’ll say:
Try to avoid adding NaN to your data because of the shaping.
Try to make your date the most compact possible
In case of doubt, please, don’t hesistate to ask on Slack #team-geo!
Geographical data¶
Columns with geographical-related information should be at the beginning of the dataset.
Latitude and longitude should be in decimal values, not in degrees.
Enrich your location (lat, long) data with the geopolitical information(city, country) when possible. You can use something like https://opencagedata.com/
Columns containing ISO codes instead of names, should append _iso at the end of the name.
In case political subdivisions are available, each subdivision should be in its own column, those columns should be:
country
,region
,subregion
city
. If information after some level is not available the column is not required.