# Korwe

Data, Science & Software

# City of Cape Town Open Data

###### Version 1.0

This is a preliminary report for questions and comments.

The data used in the report was obtained from the City of Cape Town Open Data Portal in January 2017 from [2].

The data in this report are not in any way certified to be correct and in no way does the author accept any loss or liability from damages using this report or data. Please contact the author [1] for any corrections or questions. It should be noted that the author is not an air quality expert.

We confined ourself to the pollution data at the following test stations

• Athlone
• Atlantis
• Bellville-South
• Bothasig
• City Hall
• Foreshore
• Goodwood
• Khayelitsha
• Molteno
• Plattekloof
• Potsdam
• SomersetWest
• Tableview
• Wallacedene

We used the following files (here we use standardized file names)

• Sulphur_dioxide_SO2_2014.ods
• Sulphur_dioxide_SO2_2013.ods
• Air quality August2015
• Air quality July2015
• Air quality June2015
• Air quality May2015

We globbed the various pollution readings together into one table as

• Date
• Time
• Suburb aka Test Station
• DataSet aka the specific file
• SO2 (ug/m3) - Sulphur Dioxide $$SO_2$$ { $$\dfrac { 350 \mu g} {m^3 hr}$$ or $$\dfrac{125 \mu g} {m^3 24 hrs}$$ }
• NO (ug/m3) - Nitrogen Oxide / Nitrogen Monoxide
• NO2 (ug/m3) - Nitrogen Dioxide $$NO_2$$ { $$\dfrac {200 \mu g} {m^3 1 hr}$$ }
• NOX (ug/m3) - Nitrogen Oxides
• Wind Dir (degrees)
• Wind Speed (m/s)
• Temperature (°C)
• Relative Humidity (%)
• Barometric Pressure (mb)
• O3 (ug/m3) - Ozone { $$\dfrac{120 \mu g } { m^3 8hr}$$ }
• PM10 (ug/m3) - Particulate Matter { $$\dfrac{50 \mu g } {m^3 24 hr}$$ or $$\dfrac{ 40 mg } {m^3 yr}$$}
• Benzene (ppb) {$$\dfrac{4 \mu g } { m^3 yr}$$}
• Toluene (ppb)
• M&P-xylene (ppb)
• O-xylene (ppb)
• Ethylbenzene (ppb)
• timestamp (time of last modification).

The data in chain brackets is nominally permissible levels. We discuss this later. Note the warning to their correctness.

Here ug/m3 is micro grams per cubic meter $$\dfrac {\mu g }{ m^{3} }$$, mb is millibars, ppb is parts per billion and lngy/m2 is Langley per square meter (which is 41 840 Joules/m² or 11.622 watt-hours per square meter $$\dfrac{W h}{m^2}$$ where $$\dfrac{1 W } {m^2}$$ = $$\dfrac{24 W h} {m^2 day}$$ .)

The work was done using Java, Postgres and R. The code (.java and report) is available on Bitbucket ([3], sign in required). However the data is too big and sits in csv files (the author will provide on request and make the data available via a API DB query. Currently there are 163 807 rows.)

### Data cleaning

We had to clean the data to make it usable. File names had to be standardized, .xls and .ods formats were changed to .csv.

Most of the pollution data is integer, however strange types occur such as in SO2 there is a row 26/10/2014 04:00;0.8;0;15;0;5;3;0;1;1;1. Here the float was replaced by 1.

Some samples contained “<Samp”, “RS232”, “Down”, “NoData”, “InVld”, “Calib” and other text labels which were replaced with “null”.

## Air Quality Index

The author used the calculation of the Air Quality Index of the United Kingdom$$^{[4]}$$ and Europe$$^{[5]}$$ to determine permissible levels because of their availability. Clearly the values are merely reference values, because for example, we do not account for the period of measurement (nor is that information available). The permissible levels for Cape Town South Africa, will differ. Presumably the SABS will have appropriate levels.

# Initial Investigation

This report should be understood as someone with a torch going into a dark forest to find what they can find.

## Atlantis

We begin our investigation by looking at the Atlantis testing station.

### Nitrogen Monoxide NO levels

We choose to look at Nitrogen Monoxide levels.

#### Data Summary

First we load the data then we visualise it.

Let us look at a summary of the data:

##       date                time                 no
##  Min.   :2013-01-01   Length:20449       Min.   : -9
##  1st Qu.:2013-08-02   Class :character   1st Qu.:  2
##  Median :2014-03-03   Mode  :character   Median :  6
##  Mean   :2014-03-20                      Mean   : 11
##  3rd Qu.:2014-10-02                      3rd Qu.: 13
##  Max.   :2015-08-31                      Max.   :159
##                                          NA's   :9467
## [1] 20449

We see that we have 9467 nulls or not available data. The data goes from January 2013 to August 2015, with big holes in it. We have to be really careful to treat null data, no data and zero data correctly.

There are 20 449 lines of data. To get some kind of concept on the size we have $$365 \times 2 \times 24 + (30 + 31 + 31 + 31) \times 24 = (365\times2 + 3\times31 + 30) \times 24 = 20 472$$ possible samples. The small discrepancy between 20449 and 20 472 should me investigated but is not material.

Doing the same count across all suburbs and testing stations: $$7 \times 365 \times 2 \times 24 + 14 \times (30 + 31 + 31 + 31) \times 24 = ( 365 + 3\times31 + 30) \times 24 \times 14 = 20 472$$

We look at 7 stations in the 2013 and 2014 data (14 in the 2015) which implies 163 968 rows which is not too different from 163 807 in the table.

#### Nitric Monoxide Atlantis data as histogram

Next we look at the histogram

The data is largely zero, and does not exceed 50. Unfortunately we have some negative data, which makes no sense.

#### Nitric Monoxide Atlantis data as a time series

We now look at the time series.

##### Nitric Monoxide in Atlantis with nulls

Initially we do a crude approximation of the x-axis as an index. We checked that the dates behaved as expectedly by sorting them and plotting them against the index. This is a naive plot to give us a feel for the data.

As mentioned unavailable data is simply not plotted - this is not merely data flagged as null, or “zero” data, but data that does not exist. This is remedied in the next plot.

##### Nitric Monoxide in Atlantis verses date

Then we plot the x axis as date & time.