New Delhi Weather Portfolio Project 3

This product is the final entry in the New Delhi weather and air quality analysis project that has been continuing throughout this course. In this project I will be walking through making some detailed changes in the previous projects based on feedback I received on the previous projects and also my own intuition. This report will include an analysis of a cross-validation done on the project and how effective it is at answering the research questions. In additon, I will be explaining how the findings from this project could be operationalized, and weigh the benefits and ramifications of doing so.

Just like always, we have to select the correct tools for the operation. These are the same packages as used in P02.

  include <- function(library_name){
    if( !(library_name %in% installed.packages()) )
      install.packages(library_name) 
    library(library_name, character.only=TRUE)
  }
include("tidyverse")
#include("caret")
#include("dplyr")
include("knitr")
#include("tidyr")

The first goal of this part of the project is the edits of the first and second parts of the project.

My classmate Asheela left several helpful issues in my repository for the first portion of the project: 1. Change the code for loading the dataset from using the entire filepath within my local directory to using the current directory as a reference. 2. Make the code in general more readable by using R markup syntax. She specifies line 54, but I’m generalizing this to the entire project. 3. Add visuals to help readers understand the data more intuitively

My classmate Andrew left an issue for the first portion of the project, about how the output filename in the yml header was incorrect. In addition, he wrote a peer review summary that offers many points I could improve on: 1. The two sources of data have been shrunk enormously by inner-joining them to produce the final model and remedying this would be very helpful for the accuracy of the model 2. Adding visuals could help with communicating details of the project to the reader 3. The concluding remarks are unfortunately vague

These two sources of feedback have helped drive further work on the prior section of the project.
To address Asheela’s advice,
1. I corrected the filepath as necessary in P01 2. I used R markdown syntax to bullet and indent the lists in P01 3. I added some visualizations to the end of P01 To address Andrew’s advice, 1. I was unable to remedy the lack of data with the inner join. I have some specilation about possible options in the conclusion of P02 but I was unable to carry out any of them. 2. I added visuals to project 1 that should be sufficient 3. I updated the conclusion of P02 to be a more robust and clear summary and explanation of possible further work

This advice was very helpful and has helped to make my project more robust.

Once again, we begin where we left off by passing in all of the data from P02 with all of the changes made to it from the previous deliverables. This ensures that no redundant work needs to be done to organize the data.

Project Data

This project resulted in the production of a table of readings of weather and air quality. The first few rows are shown below:

delhidata

## # A tibble: 64 x 21
## # Groups:   date [8]
##    datetime_utc        time  date       air_condition dew_point any_fog
##    <dttm>              <tim> <date>     <fct>             <int>   <int>
##  1 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
##  2 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
##  3 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
##  4 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
##  5 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
##  6 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
##  7 2001-04-01 05:30:00 05:30 2001-04-01 Haze                 14       0
##  8 2001-05-01 00:30:00 00:30 2001-05-01 Haze                 15       0
##  9 2001-05-01 00:30:00 00:30 2001-05-01 Haze                 15       0
## 10 2001-05-01 00:30:00 00:30 2001-05-01 Haze                 15       0
## # … with 54 more rows, and 15 more variables: heat_index <dbl>, humidity <fct>,
## #   air_pressure <dbl>, any_rain <int>, temperature <int>, any_thunder <int>,
## #   visibility <dbl>, wind_angle <int>, wind_direction <fct>,
## #   wind_gust_speed <dbl>, wind_chill <dbl>, wind_average_speed <dbl>,
## #   SO2 <dbl>, NO2 <dbl>, SPM <int>

Cross-Validation

The data analytics code below is commented out because of the errors it causes.

#data.frame( R2 = R2(predictions, test$temperature),
#            RMSE = RMSE(predictions, test$temperature),
#            MAE = MAE(predictions, test$temperature))

This code would have produced:

r^2, a measure of the variance between a prediction and the actual data Root Mean Squares Error, the standard deviation of the differences between the prediction and the actual data *Mean Absolute Error, a measure of the absolute difference between two continuous variables.

These data metrics would be a very useful tool for validating the strength of the model produced in this project. More than likely, the results would have looked quite poor, but they would have been helpful for understanding the specifics of the model.

Operationalization

The goal of this project was to explore trends in weather patterns in New Delhi in order to be able to use the model as a predictor for the weather. As the reader can see, undergoing that has been much more involved than it might sound. The primary application of this work would be to provide an accurate forecast of the weather. The second portion of the project adds data on pollutants, so this would ideally result in a predictor of the pollution in the air at any given time of year. Generally speaking, what this should produce is a model that can forecast the weather and pollution within the city.

Operationalizing this project would likely take the form of using the resultant predictive model as an engine for predicting the weather and air quality at any given time. Any online or television weather service could display the results of this model for the current time period to the public.

However there are some issues that arise from operationalizing the project this way:

The original data in the model that combines air quality and weather are very small. It’s possible that it could be salvaged with strategic imputation of the missing values, but this would have the same issue as the final data model in general of making a predictive model with very few data.
The model isn’t very accurate as a predictor. The results from the weather table in P02 were within the scope of correct results, but they were still over 20% off, which would be completely unacceptable as an actual forecast.
The weather data is taken from only one location in the city. New Delhi is a rather small city, with about half the geographic area of Chico, but it’s still possible for there to be bias in the weather reports by not cross-referencing them with another location.
The weather-prediction algorithms that are in use today are far beyond the complexity and accuracy of anything this project is capable of in its current state.

With all this in mind, this project has come very far towards its ultimate goals. It hasn’t produced the predictive data model that I hoped it would, but it still has a lot of potential to remedy that by other means. What this project shows is the strength of data science and its potential to help society. These kinds of projects have very clear positive potential impacts of implementation in society as a whole. When considering this with potential downsides, they can do quite a lot of good. Finally, through this project, I’ve learned a lot about data science and will use this knowledge and experience to use it in my data science career.

New Delhi Weather Portfolio Project 3

Olivia Lund

Project Data

Cross-Validation

Operationalization