Post processing the Stations DataLink
The stations data can be fully processed within the sensor analysis framework provided. In this guide, we will go through a working live example that will use available data from a permanent living lab station in Fablab BCN. The available sensors are listed here.
Load in the dataLink
So first, we will load in the data from this station. The device number is the 4748, and is available in here.
We will use the available interface in our framework to load the station data. In this case, we will load all the available data into our notebook, but if we wanted, we could use the interface to set up time limits, should the timeframe desired be different:
Info
In the field Kit list we can also input a comma separated list of devices such as: 4748, 4565, 4587
and we will load all the data for you.
We can now explore the available readings in our test with something like:
## This will output the devices we have in the selected test print readings['STATION_FABLAB_BCN']['devices'].keys()
[u'4748']
If we want to have access to the actual data, we can go under:
## This will output the dataframe's first 4 lines print readings['STATION_FABLAB_BCN']['devices']['4748']['data'].head(4)
BATT CO_MICS_RAW EXT_HUM EXT_TEMP GB_1A \ 2018-08-24 17:00:00+02:00 0.0 73.441111 47.652222 30.663333 4.517778 2018-08-24 17:10:00+02:00 0.0 129.049000 46.632000 31.209000 3.905000 2018-08-24 17:20:00+02:00 0.0 53.738333 45.901667 31.616667 3.808333 2018-08-24 17:30:00+02:00 74.5 122.405000 45.655000 31.810000 3.925000 ...
Or if we want to see the available recordings:
## This will output the dataframe columns print readings['STATION_FABLAB_BCN']['devices']['4748']['data'].columns
Index([u'BATT', u'CO_MICS_RAW', u'EXT_HUM', u'EXT_TEMP', u'GB_1A', u'GB_1W', u'GB_2A', u'GB_2W', u'GB_3A', u'GB_3W', u'HUM', u'LIGHT', u'NO2_MICS_RAW', u'PM_1', u'PM_10', u'PM_25', u'PM_DALLAS_TEMP', u'PRESS', u'TEMP'], dtype='object')
For more information about the test structure, all the fields are detailed in here.
If the device is an station, we will have to input the sensor references for the alphasense devices. We have prepared the framework to input this easily:
## This will output the structure for inputing the alphasense sensor refs print readings['STATION_FABLAB_BCN']['devices']['4748']['alphasense']
{'O3': 'TEMPORARY_O3', 'SLOTS': 'TEMPORARY_SLOTS', 'CO': 'TEMPORARY_CO', 'NO2': 'TEMPORARY_NO2'}
As you can see, we have no data in this struct, but we can easily fill it put by:
readings['STATION_FABLAB_BCN']['devices']['4748']['alphasense']['O3'] = 204560316 readings['STATION_FABLAB_BCN']['devices']['4748']['alphasense']['NO2'] = 202160413 readings['STATION_FABLAB_BCN']['devices']['4748']['alphasense']['CO'] = 162031257 readings['STATION_FABLAB_BCN']['devices']['4748']['alphasense']['SLOTS'] = ('NO2', 'CO', 'O3') print readings['STATION_FABLAB_BCN']['devices']['4748']['alphasense']
{'O3': 204560316, 'SLOTS': ('NO2', 'CO', 'O3'), 'CO': 162031257, 'NO2': 202160413}
Note that each of these fields is necessary for our posterior calculations. Each of the O3
, NO~2~
, CO
fields relate to the manufacturer's reference of each of the sensors, whilst the SLOTS
field relates to the order at which the sensors are placed. Normally, the stations are delivered with the SLOTS field as: ('CO', 'NO~2~', 'O3')
, meaning that the CO sensor is in the slot #1, the NO2 sensor in the slot #2 and the O3 sensor in the slot #3.
Info
Normally we refer to the OX-B431 sensor as O3, although it measures both, O3+NO2 mixing ratios, and therefore we use O3 or OX indistinctively.
Warning
You might have noticed that the slots we have input are not matching to our own description... Our bad!
Explore the dataLink
We can now have a look at the station's data. We can go to the Section Exploratory Data Analysis
and use the range of available interfaces for the data analysis. We would like to serve this as a flexible tool, in which the priority is to generate proper analysis. For this, we have some interesting interactive plots as:
- Time Series Plot
- Back2Back correlation plots
- Correlogram
- Heat maps
Let's run a simple example. We will plot all the concerned alphasense
signals using the interface. We could also do this by code, but we find it less time consuming and more data analysis dedicated:
We can select within the channels available in the dataframe, for each of the devices within a test. All the devices are supossed to have overlapping timestamps, so they can be compared easily. Hence, here is where the concept of test is interesting, since all the devices within a test can be easily grouped and compared:
In this example we have selected all three alphasense sensors available data and plotted them with the working and auxiliary electrodes. We can here explore with the plotly integrated commands to review the data.
Here, it is important to see how the sensors adapt to the environment once they have their power restored after a power cut, or after the first use. For example, in the case of the CO
measurement, after the power restorage on 31st of August, the sensor is clearly experiencing an stabilisation that has to be discarded in our calculations:
We can also see how some metrics correlate among themselves and analyse potential sources for multicollinearity in our model. For this, we will study how every two measurments correlate among themselves in the following interface:
We will use a very straightforward example: let's see how temperature and relative humidity correlate and see if there might be variations in the absolute humidity that might affect the variations in the relative humidity that are not explained by the temperature. If we select our channels and Check on Crop Data in X axis
and input our dates we will have the following:
Here, we can see that both are anticorrelated (Pearson = -0.61) and although they have a clear inverse trend, their R2 is low:
This might indicate that the variations in the humidity are not fully explained by the temperature variations and, that there might be variations in the absolute humidity that we could account for in our models.
Adding calculated channelsLink
Let's then add the partial vapour pressure in our dataset. Based on the definition of the relative humidity (RH):
Where P_{H_2O} is the partial vapour pressure and P^*_{H_2O} is the equilibrium vapour pressure at a certain pressure and temperature. This equilibrium vapour pressure can be determined by the Arden Buck Equation and goes like:
Where P is the absolute pressure in mbar and T is the temperature in degC. Having the partial vapour pressure, we can then calculate both values in the implemented calculator in our notebook:
Here, the formula is:
## Calculate equilibrium vapour pressure P_H2O_EQ = (1.0007 + 3.46*1e-6*PRESS*10)*6.1121*np.exp(17.502*TEMP/(240.97+TEMP))
Note that we can input any type of expression in the Formula Field
that can be subject to evaluation as in a Python formula. Note as well that numpy
operations are allowed and that they can be written in line.
Info
If you want to calculate this formula for several devices within a test, select all of them and the calculator will make the available in the dropdowns the common metrics.
Now, we can calculate de partial vapour pressure as, since it's available within our channels:
## Calculate partial vapour pressure P_H2O_VAP= HUM*P_H2O_EQ/100
If we analyse the data in periods where the partial vapour pressure is fairly constant (i.e. 31st of Aug, we see that variations of the temperature are directly correlated with the relative humidity, whilst days as the 2nd of September show greater variations in both, temperature and partial vapour pressure that provoke a lower correlation in the temperature and humidity:
Calculating the actual pollutant concentrationsLink
Now that we know how to get around in the notebook, explore data and add channels in a simple way, lets calculate actual pollutant concentrations. For this, we will use the section AlphaSense Baseline Calibration
. In this block, we will apply the methodology exlained in this section in order to calculate actual pollutant concentrations.
For this, in the above mentioned section, if we run the cell, we will see an output like the following:
This will list all the available devices that contain alphasense data. Remember to include the calibration data mentioned before in the dict
so that we can calculate the final concentrations.
For reference, all the alphasense data is under this repository, which looks like:
{"Target 2": "na", "Target 1": "CO", "Serial No": "162031254", "Sensitivity 1": "568.3", "Sensitivity 2": "0", "Zero Current": "-34", "Aux Zero Current": "-20.8"} {"Target 2": "na", "Target 1": "CO", "Serial No": "162031257", "Sensitivity 1": "493.1", "Sensitivity 2": "0", "Zero Current": "-69.4", "Aux Zero Current": "-18.6"} {"Target 2": "na", "Target 1": "CO", "Serial No": "162031256", "Sensitivity 1": "601.9", "Sensitivity 2": "0", "Zero Current": "-68.1", "Aux Zero Current": "-13.9"} {"Target 2": "na", "Target 1": "CO", "Serial No": "162581706", "Sensitivity 1": "581.4", "Sensitivity 2": "0", "Zero Current": "-72.8", "Aux Zero Current": "-35.3"} {"Target 2": "na", "Target 1": "CO", "Serial No": "162581707", "Sensitivity 1": "605", "Sensitivity 2": "0", "Zero Current": "-56.7", "Aux Zero Current": "-46.3"}
In the cell output, we can select the tests that contain alphasense devices and that are subject to be calculated. A brief explanation of all the checkboxes is detailed below:
- Decomp: attemps to decompose the trends found in the day-to-day data. This is a common technique used in time series analysis in order to avoid regression including trend. It is normally not needed since the periods that are used for the data calculacions are short enough to have no significant trend (one day + overlap*)
- Plots Inter: checking this plots intermediary plots of all the calculations performed within the method. Use with caution
- Verbose: checking this includes extra information during the calculation process. Use with caution
- Plots Results: plots the final calculation per pollutant and all relevant intermediary calculation. Default is to be Checked
- Print Stats: prints interesting statistics about the dataset for later use
Now, let's calculate the pollutants. We don't need to worry about the manufacturer data, since it will automatically retrieved during the process.
CO results
The methodology used here is not the baseline model, but the application of this formula, as explained in here:
As we can see, the initial data is not to be considered due to the sensor stabilisation time. If we focus on the usable data, we can already get some insights about what hours are more polluted and the difference between working days and weekends:
NO2 results
For this metric, we will be using the above mentioned baseline methodology
If we have a look at the data, we see that the sensor still requires stabilisation, as the CO electrode:
Zooming in, we can see the most polluted hours are those in the morning:
Warning
OX sensor in this station is not giving good results (probably it's ageing has provoked a sensitivity loss), and therefore, the data will not be shown here.