This project is a continuation of the analysis of the New Delhi weather dataset from part 1. The purpose of this second part is to design a model that identifies relationships between the data and is capable of producing a meaningful or interesting result, and to build that model by training it with data. This project will also feature the introduction of an additional data source, in this case, a spreadsheet of data collected by the National Environmental Engineering Research Institute of India on the smog in New Delhi during the year 2012. This additional data will lend deeper insight into the weather of the city by establishing the baseline air conditions. Being able to accurately predict the weather and air quality in a city is extremely useful for reasons of health. With an accurate prediction of conditions, people will be able to make better informed decisions about when they should exercise or leave the house to prevent complications like heatstroke or respiratory illness.

As always with any good data analysis, the right tools have to be selected first. This project introduces several new packages. The most important addition is caret, which specializes in predictive modeling using machine learning to train a model on a portion of the dataset and allows for testing that model using the rest of the dataset.

  include <- function(library_name){
    if( !(library_name %in% installed.packages()) )
      install.packages(library_name) 
    library(library_name, character.only=TRUE)
  } 
include("tidyverse")
include("caret")
include("dplyr")
include("knitr")
include("tidyr")

With the primary dataset at the ready, we can now take our new tools and make use of them.

We begin where we left off last time, by passing in all of the data from P01 with all of the changes made to it from the previous deliverable. This ensures that no redundant work needs to be done to organize the data.

The database used previously in this project has a column containing air conditions, however it’s a categorical variable that can take any one of the values below:

levels(weather$air_condition)
##  [1] ""                              "Blowing Sand"                 
##  [3] "Clear"                         "Drizzle"                      
##  [5] "Fog"                           "Funnel Cloud"                 
##  [7] "Haze"                          "Heavy Fog"                    
##  [9] "Heavy Rain"                    "Heavy Thunderstorms and Rain" 
## [11] "Heavy Thunderstorms with Hail" "Light Drizzle"                
## [13] "Light Fog"                     "Light Freezing Rain"          
## [15] "Light Hail Showers"            "Light Haze"                   
## [17] "Light Rain"                    "Light Rain Showers"           
## [19] "Light Sandstorm"               "Light Thunderstorm"           
## [21] "Light Thunderstorms and Rain"  "Mist"                         
## [23] "Mostly Cloudy"                 "Overcast"                     
## [25] "Partial Fog"                   "Partly Cloudy"                
## [27] "Patches of Fog"                "Rain"                         
## [29] "Rain Showers"                  "Sandstorm"                    
## [31] "Scattered Clouds"              "Shallow Fog"                  
## [33] "Smoke"                         "Squalls"                      
## [35] "Thunderstorm"                  "Thunderstorms and Rain"       
## [37] "Thunderstorms with Hail"       "Unknown"                      
## [39] "Volcanic Ash"                  "Widespread Dust"

This is not nearly detailed enough for a detailed prediciton of conditions. To remedy this, we employ the use of another dataset focused specifically on air quality.

This spreadsheet is data taken from the National Environmental Engineering Research Institute of India and contains data detailing the smog in New Delhi during the year 2012. The spreadsheet was downloaded from https://data.gov.in/resources/location-wise-monthly-ambient-air-quality-delhi-year-2001.

smog <- read.csv(file="cpcb_dly_aq_delhi-2001.csv", header=TRUE, sep=",")
head(smog)
##   Stn.Code      Sampling.Date State City.Town.Village.Area
## 1       55  January - M012001 Delhi                  Delhi
## 2       55 February - M022001 Delhi                  Delhi
## 3       55    March - M032001 Delhi                  Delhi
## 4       55    April - M042001 Delhi                  Delhi
## 5       55      May - M052001 Delhi                  Delhi
## 6       55     June - M062001 Delhi                  Delhi
##                            Agency Type.of.Location  SO2  NO2 RSPM.PM10 SPM
## 1 Central Pollution Control Board  Industrial Area 16.3 35.9        NA 278
## 2 Central Pollution Control Board  Industrial Area 18.1 44.3        NA 367
## 3 Central Pollution Control Board  Industrial Area 17.7 35.1        NA 280
## 4 Central Pollution Control Board  Industrial Area 16.3 39.8        NA 342
## 5 Central Pollution Control Board  Industrial Area 16.5 33.6        NA 285
## 6 Central Pollution Control Board  Industrial Area 17.9 37.7        NA 193

The important parts of this dataset are Sampling.Date - the date during when the sample was taken SO2 - the air concentration of the toxin sulfur dioxide (SO2) in micrograms per cubic meter NO2 - the air concentration of the toxin nitrous oxide (NO2) in micrograms per cubic meter SPM - the air concentration of suspended particulate matter (SPM) in micrograms per cubic meter

In addition, it contains the sourcing of each data point detailing the agency, monitoring station, station code, and city. Because this is the same for each of these points, we can go ahead and remove several of these unimportant columns.

smog <- subset(smog, select = -c(Stn.Code, State, City.Town.Village.Area, Agency, Type.of.Location, RSPM.PM10))

Next, we can process the time as a datetime object to make it easier to work with in R

smog$Sampling.Date <- parse_date(as.character(smog$Sampling.Date), format = "%B - M%m%Y")

Finally, we can rename each of the columns in the spreadsheet to the more consistent naming convention we established with the database in the last report

colnames(smog)[colnames(smog)=="Sampling.Date"]<-"sample_date"
colnames(smog)[colnames(smog)=="RSPM.PM10"]<-"RSPM_PM10"
colnames(smog)
## [1] "sample_date" "SO2"         "NO2"         "SPM"

Now the spreadsheet is tidied up and ready to be analyzed.

This portion of the project will focus on developing a model that is capable of predicting the average temperature of a given day.

We have two significant collections of tidied data that are ready for analysis. To make them more useful, we can combine them together with a join operation.

delhidata = inner_join(dailyweather, smog, c("date" = "sample_date"))
head(delhidata)
## # A tibble: 6 x 21
## # Groups:   date [1]
##   datetime_utc        time  date       air_condition dew_point any_fog
##   <dttm>              <tim> <date>     <fct>             <int>   <int>
## 1 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
## 2 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
## 3 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
## 4 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
## 5 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
## 6 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
## # … with 15 more variables: heat_index <dbl>, humidity <fct>,
## #   air_pressure <dbl>, any_rain <int>, temperature <int>, any_thunder <int>,
## #   visibility <dbl>, wind_angle <int>, wind_direction <fct>,
## #   wind_gust_speed <dbl>, wind_chill <dbl>, wind_average_speed <dbl>,
## #   SO2 <dbl>, NO2 <dbl>, SPM <int>

Unfortunately, this reduces the two datasets, one of which had over one hundred thousand observations, to a mere 64. This will still be adequate to do the analysis, but could be a lot better.

Before building a predictive model, the dataset has to be partitioned into a larger training set and a smaller testing set. For the purposes of this project using 75% of the data for the former and 25% for the latter will be adequate.

#weather<-as.data.frame(table(unlist(weather)))
sample_selection <- createDataPartition(delhidata$SO2, p = 0.75, list = FALSE)
train <- delhidata[sample_selection, ]
test <- delhidata[-sample_selection, ]

Next, we can gain insight about the ways that one of the variables affects another by seeing how well they correlate. One of the most effective ways of gathering insight about the change in a variable is by seeing how well it correlates with another one. In this case, we’re seeing how well the target variable, birth_year, correlates with every other variable. A larger value means a higher correlation.

submission_model <- lm(data=train, formula = delhidata$SO2 
                       ~ delhidata$heat_index
                       + delhidata$air_pressure
                       + delhidata$air_condition
                       + delhidata$dew_point
                       + delhidata$wind_angle
                       + delhidata$wind_average_speed 
                       + delhidata$NO2
                       )
summary(submission_model)
## 
## Call:
## lm(formula = delhidata$SO2 ~ delhidata$heat_index + delhidata$air_pressure + 
##     delhidata$air_condition + delhidata$dew_point + delhidata$wind_angle + 
##     delhidata$wind_average_speed + delhidata$NO2, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3247 -3.2667 -0.0222  3.0711  6.5512 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   10.318627 202.951927   0.051    0.960
## delhidata$heat_index          -1.542404   1.886993  -0.817    0.417
## delhidata$air_pressure         0.037187   0.209370   0.178    0.860
## delhidata$air_conditionSmoke   1.438199   1.832617   0.785    0.436
## delhidata$dew_point            0.550931   0.839552   0.656    0.514
## delhidata$wind_angle           0.006550   0.007799   0.840    0.405
## delhidata$wind_average_speed  -0.036327   0.144312  -0.252    0.802
## delhidata$NO2                 -0.044055   0.033437  -1.318    0.193
## 
## Residual standard error: 3.954 on 56 degrees of freedom
## Multiple R-squared:  0.09889,    Adjusted R-squared:  -0.01375 
## F-statistic: 0.878 on 7 and 56 DF,  p-value: 0.5297

As we can see from these results, none of the variables correlate exceptionally well, with air condition and heat index being the strongest correlations. It’s worth noting that air condition(smoke) has a positive correlation with the temperature, and heat index has a negative correlation. It stands to reasin intuitively that air condition being smoke results in a higher concentration of SO2 than other air conditions.

Now, we can use the predictive model we trained on 75% of the data earlier to make a prediction for the SO2 concentration of the other 25%. The first 10 of these data points has been printed out.

predictions <- submission_model %>% predict(test)
## Warning: 'newdata' had 15 rows but variables found have 64 rows
head(predictions, n=10)
##        1        2        3        4        5        6        7        8 
## 12.29697 12.46438 13.16485 13.19569 12.16481 11.15596 11.38504 11.36027 
##        9       10 
## 11.92416 11.15321
#as.Date(predictions,origin = "1960-10-01")

These predictions are all well within normal ranges of the data, meaning that the model is producing reasonable results, which is a really good sign. The actual data that has been predicted can be seen here:

head(delhidata$SO2, n=10)
##  [1] 16.3 17.5 10.1 11.6  7.3 11.8 11.5 16.5 10.1 17.1

From this we can see that the model is far from a perfect predictor, and doesn’t even seem to follow the same trends that the original data does, however the results are still within a very reasonable range of error.

This portion of the project demonstrates very well the kinds of analyses that can be done with a data science project. Machine learning and predictive analysis are very powerful tools for many types of computing work, and data science is far from an exception.

The biggest hurdles for getting this project to work was joining the two sets of data on time. R was very finnicky about this and it took some help from both Dr. Edward Roualdes and Dr. Robin Donnatello to figure it out and finally get the join to work.

Unfortunately for the context of this assignment, the data prediction model produced isn’t very good. It lacks the large variety of data required to make good predictions. Alternative strategies for redoing this work could include:

Regardless of the failures, this project still demonstrates the most important parts of its goal, like the application of machine learning to this data science project and the use of an additional dataset to provide a more robust understanding of the subject.