5. TOAR Near Realtime Data Processing
Currently we collect near real-time data from two data providers: UBA (German Environment Agency 1 ) and OpenAQ (open air quality data 2). The corresponding data harvesting procedures are described below.
5.1. UBA Data Harvesting
Since 2001, the German Umweltbundesamt - UBA 1 - provides preliminary data from a growing number (currently 1004) of German surface stations. Basis for the data exchange is the manual „Luftqualitätsdaten- und Informationsaustausch in Deutschland“, Version V 5, April 2019 (in German).
At least ozone, SO2, PM10, PM2.5, NO2 and CO data for the current day are updated daily and provided continuously hourly up to a maximum of four previous days. Data is fetched from the UBA service 4 times per day (8 am,12 pm, 18 pm, and 22 pm (local time)).
The software for processing the data from UBA is available at https://gitlab.version.fz-juelich.de/esde/toar-data/toar-db-data/-/tree/master/toar_v2/harvesting/UBA_NRT.
Data (StationparameterMeta.csv, StationMeta.csv, uba_%s.csv (%s denotes a date))
are harvested 4-times daily from http://www.luftdaten.umweltbundesamt.de/files/ (secured with access credentials).
name of component in original file |
name of component in TOAR database |
---|---|
Schwefeldioxid |
so2 |
Ozon |
o3 |
Stickstoffdioxid |
no2 |
Stickstoffmonoxid |
no |
Kohlenmonoxid |
co |
Temperatur |
temp |
Windgeschwindigkeit |
wspeed |
Windrichtung |
wdir |
PM10 |
pm10 |
PM2_5 |
pm2p5 |
Relative Feuchte |
relhum |
Benzol |
benzene |
Ethan |
ethane |
Methan |
ch4 |
Propan |
propane |
Toluol |
toluene |
o-Xylol |
oxylene |
mp-Xylol |
mpxylene |
Luftdruck |
press |
term of station_type in original file |
term of station_type in TOAR database |
---|---|
Hintergrund |
background |
Industrie |
industrial |
Verkehr |
traffic |
term of station_type_of_area in original file |
term of station_type_of_area in TOAR database |
---|---|
ländlich abgelegen |
rural |
ländliches Gebiet |
rural |
ländlich regional |
rural |
ländlich stadtnah |
rural |
städtisches Gebiet |
urban |
vorstädtisches Gebiet |
suburban |
component |
original unit |
unit in TOAR DB |
unit conversion while ingesting |
---|---|---|---|
co |
mg m-3 |
ppb |
858.95 |
no |
ug m-3 |
ppb |
0.80182 |
no2 |
ug m-3 |
ppb |
0.52297 |
o3 |
ug m-3 |
ppb |
0.50124 |
so2 |
ug m-3 |
ppb |
0.37555 |
benzene |
ug m-3 |
ppb |
0.30802 |
ethane |
ug m-3 |
ppb |
0.77698 |
ch4 |
ug m-3 |
ppb |
1.49973 |
propane |
ug m-3 |
ppb |
0.52982 |
toluene |
ug m-3 |
ppb |
0.26113 |
oxylene |
ug m-3 |
ppb |
0.22662 |
mpxylene |
ug m-3 |
ppb |
0.22662 |
pm1 |
ug m-3 |
ug m-3 |
|
pm10 |
ug m-3 |
ug m-3 |
|
pm2p5 |
ug m-3 |
ug m-3 |
|
press |
hPa |
hPa |
|
temp |
degree celsius |
degree celsius |
|
wdir |
degree |
degree |
|
wspeed |
m s-1 |
m s-1 |
|
relhum |
% |
% |
Validated data from the previous year is available at May 31st latest. This data is requested by email and then processed from the database dumps we receive. The validated data will supersede the preliminary near realtime data. The realtime data remains in the database but is hidden from the standard user access procedures via the data quality flag settings.
5.2. OpenAQ
OpenAQ 2 is collecting data in 93 different countries from real-time government and research grade sources. Starting on 26th November 2016, OpenAQ has already gathered more than one billion records, which has 306 Gigabyte in total size and covers the air quality relevant variables BC, CO, NO2, O3, PM10, PM2.5 and SO2.
5.2.1. Data Provision
OpenAQ provides real-time meteorological data on Amazon Web Service 3 in daily directories. Data files composed of records of meteorological measurement values are put into the directory of the current day at irregular intervals. The directories with their data files are stored on Amazon Web Service permanently.
Each data file contains up to hundreds of thousands of records. Records are JSON 4 objects in the same structure throughout the entire life cycle. The task of our real-time data harvesting procedure is to go through these records and save them into the TOAR database according to TOAR database scheme in about real time.
A key element for processing the OpenAQ data is a separate intermediate database, to help processing the data. Only after the data is ready to be stored in the TOAR database it will be uploaded.
The realised real-time data harvesting procedure consists of four steps, the first two download the data and store it in the intermediate database while the last two parse the fields and map them to the TOAR database scheme.
The first two steps (workflow1) are responsible for the action between Amazon Web Service and intermediate database, and the other two steps (workflow2) for the action between intermediate database and TOAR database.
Technically the open source software Apache Airflow is used for workflow automation, so that workflows are triggered in regular interval within a day.
5.2.2. The Intermediate Database
The reason for introducing an intermediate database is to make data parsing and mapping easier and to enable pre-evaluation, statistics, and visualisation. Thereby we flatten the long lists of tree-structured records into a two-dimensional table.
5.2.3. The Harvesting Workflow
Workflow 1
We use python and the boto3 5 python module for querying Amazon Web Service (AWS).
First the newly created 6 sub directories on AWS have to be identified and retrieved which will then be inserted into the sub directory table and the data file table of the intermediate database.
With that the current status of the intermediate database has been synchronised with the one of AWS and all unprocessed records are prepared in the intermediate database for importing into the TOAR database.
Workflow 2
The second workflow identifies the station and the timeseries in the TOAR database a new record belongs to in the way described in Steps 10 to 17 in Section 3.2.
With the id of the identified time series, the value of the record will finally be inserted into the data table in TOAR database. In the end a record from the intermediate database is matched and saved into TOAR database (Fig. 5.3).
5.2.4. Workflow Automation with Apache Airflow
The data harvesting process described in the last subsection can be executed in one batch or divided into two isolated workflows. In both cases it is desired to be scheduled, executed and monitored automatically. To this end we use the Apache Airflow workflow management software 7 installed on the same server as the intermediate database. Apache Airflow is registered as a system service, so that it will be started automatically on system boot. We define two separate workflows in Apache Airflow as depicted in Fig. 5.2. Both workflows are scheduled hourly. On the web interface of Apache Airflow, we can monitor and manipulate the workflows with ease.
Footnotes