For another assignment from Reproducible Research, I've prepared a self contained analysis from a large and fairly dirty data set, as if it were to be delivered to a Government decision maker.
By 'self contained', we mean that every aspect from start to finish must be self-evident & reproducable.
Similar to my previous post, here I also would have preferred to tidy the data in R and build the visualizations with Tableau.
The graphical systems in R are robust. And I find them super convenient to quickly plot the distribution of some data, or to spot check my work along the way. But R (in my opinion) is no place to produce "communication" level visualizations.
Visualizations built in R are simply too rigid and static for communication purposes. The moment one reaches a "tidy" data set that is for analysis, then it's time to switch to Tableau for communication.
If you would like to download my reproducible code to give it a run, click here.
The Impact of Large Weather Events
July 25, 2014
Synopsis
This analysis explores the NOAA Storm Database and answer basic questions about severe weather events. Specifically, this analysis will address the questions:
- Across the United States, which types of events (as indicated in the eventType variable) are most harmful with respect to population health?
- Across the United States, which types of events have the greatest economic consequences?
For reference, additional documentation of the data exists which explains how the variables are constructed/defined.
Recorded events begin in the year 1950 and end in November 2011. There are generally fewer events recorded in the earlier years, most likely due to a lack of good records. More recent years should be considered more complete.
There are 7 variables related to these questions:
- EVTYPE: Event Type (e.g. tornado, flood, etc.)
- FATALITIES: Number of fatalities
- INJURIES: Number of injuries
- PROPDMG: Property damage estimates, entered as actual dollar amounts
- PROPDMGEXP: Alphabetic Codes to signify magnitude “K” for thousands, “M” for millions, and “B” for billions)
- CROPDMG: Crop damage estimates, entered as actual dollar amounts
- CROPDMGEXP: Alphabetic Codes to signify magnitude “K” for thousands, “M” for millions, and “B” for billions)
Data Processing
First, to download & process the raw data to format for analysis.
Computing Environment
Please set your own working directory. Be patient. The source data to download & load into memory is quite large!
## Set your own Working Directory
setwd("/Users/keithhelfrich/Dropbox/Coursera/2014.07 - JHU - Reproduceable Research/RepData_PeerAssessment2")
Download Source Data
Size = 47MB, please be patient!
sourcefile <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "repdata-data-StormData.csv.bz2"
if(!file.exists(destfile)) {
download.file(sourcefile, destfile=destfile, quiet=TRUE, method="curl") }
Install and Load Packages
packages <- c("ggplot2","RColorBrewer", "plyr", "xtable") # packages req'd
is_installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) # function: check if installed
load_or_install<-function(packages) { # function: load all packages
for(package_name in packages) { # install as needed
if(!is_installed(package_name)) {
install.packages(package_name,repos="http://lib.stat.cmu.edu/R/CRAN") }
library(package_name,character.only=TRUE,quietly=TRUE,verbose=FALSE) }
}
load_or_install(packages) # install & load
Read Data
Be patient. The source data to load into memory is quite large!
## Read Data (Wait for it!)
stormData <- read.csv(bzfile(destfile)) # read source data
Format beginDate
## Convert BGN_DATE to a date format for analysis.
stormData$beginDate <- as.Date(strptime(stormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))
stormData$beginYear <- as.factor(format(stormData$beginDate,format="%Y"))
Observations
Because fewer events were recorded in earlier years, this analysis will focus on only the most recent 10 years.
## Plot the number of observations per year, across the entire data set
barplot(table(stormData$beginYear), # plot no. of records per year
ylab="# records")
abline(v=63, lty=1, lwd=3, col="blue") # reference line marks 10 yrs
text(x=63,y=60000, labels = "analysis-->", # annotate analysis date range
pos=4, col="blue", cex=.7, offset=.1)
title("Focus: Recent 10 Years") # title the plot

Subset analysisData
This analysis will focus on the recents 10 years: 2002 thru 2011
## subset to years 2002 thru 2011
analysisData <- stormData[format(stormData$beginDate,format="%Y") > 2001,] # subset years 2002 thru 2011
analysisData <- analysisData[,c("beginDate","beginYear","STATE","EVTYPE", # subset columns of interest
"FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG", # including REMARKS & REFNUM
"CROPDMGEXP","REMARKS","REFNUM")] # for future reference
analysisData$beginYear <- factor(analysisData$beginYear) # re-level beginYear
analysisData$EVTYPE <- factor(analysisData$EVTYPE) # re-level EVTYPE
rm(stormData) # remove unneccesary
gc() # free up memory
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 882315 47.2 1476915 78.9 1144256 61.2
## Vcells 30814829 235.1 119226608 909.7 148940956 1136.4
Event Type (cleansing)
The field EVTYPE requires cleansing because the same type of event is often logged with slightly different names. Some names are spellect incorrectly. etc. For example: all hurricanes should be grouped together, whereas the raw data encodes each named hurricane as a separate type of event. A more thorough analysis would examine each event type individually, to decide whether it is encoded correctly or whether it should be consolidated with another group of events. Below are the cursory corrections to be made for this analysis.
# Cleanup EVTYPE
analysisData$eventType <- as.character(analysisData$EVTYPE)
analysisData$eventType[grepl("tornad", analysisData$eventType, ignore.case = TRUE)] <- "TORNADO"
analysisData$eventType[grepl("thun.*orm", analysisData$eventType, ignore.case = TRUE)] <- "THUNDERSTORM"
analysisData$eventType[grepl("tstm", analysisData$eventType, ignore.case = TRUE)] <- "THUNDERSTORM"
analysisData$eventType[grepl("snow", analysisData$eventType, ignore.case = TRUE)] <- "SNOW"
analysisData$eventType[grepl("blizzard", analysisData$eventType, ignore.case = TRUE)] <- "BLIZZARD"
analysisData$eventType[grepl("hail", analysisData$eventType, ignore.case = TRUE)] <- "HAIL"
analysisData$eventType[grepl("rain", analysisData$eventType, ignore.case = TRUE)] <- "RAIN"
analysisData$eventType[grepl("precip", analysisData$eventType, ignore.case = TRUE)] <- "RAIN"
analysisData$eventType[grepl("hurricane", analysisData$eventType, ignore.case = TRUE)] <- "HURRICANE"
analysisData$eventType[grepl("tropical.*storm", analysisData$eventType, ignore.case = TRUE)] <- "TROPICALSTORM"
analysisData$eventType[grepl("flood", analysisData$eventType, ignore.case = TRUE)] <- "FLOOD"
analysisData$eventType[grepl("fire", analysisData$eventType, ignore.case = TRUE)] <- "FIRE"
analysisData$eventType[grepl("lightn", analysisData$eventType, ignore.case = TRUE)] <- "LIGHTNING"
analysisData$eventType[grepl("wind", analysisData$eventType, ignore.case = TRUE)] <- "WIND"
analysisData$eventType[grepl("cold", analysisData$eventType, ignore.case = TRUE)] <- "COLD"
analysisData$eventType[grepl("heat", analysisData$eventType, ignore.case = TRUE)] <- "HEAT"
analysisData$eventType[grepl("storm surge", analysisData$eventType, ignore.case = TRUE)] <- "STORM SURGE"
analysisData$eventType <- as.factor(analysisData$eventType)
Nappa River Flood Correction
REMARKS in the Nappa River Flood record from the year 2006 clearly indicate damages in the millions of dollars, yet the property damage multiplier value for PROPDMGEXP is mistakenly recorded as ‘B’ for billions. This inaccuracy significantly skews the analysis results, so let us make the manual correction.
## Correction to property damage multiplier for Nappa River Flood
print(xtable(t(analysisData[analysisData$REFNUM == "605943",])), # print original values
type = "html", include.rownames = TRUE)
605953 | |
---|---|
beginDate | 2006-01-01 |
beginYear | 2006 |
STATE | CA |
EVTYPE | FLOOD |
FATALITIES | 0 |
INJURIES | 0 |
PROPDMG | 115 |
PROPDMGEXP | B |
CROPDMG | 32.5 |
CROPDMGEXP | M |
REMARKS | Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million. |
REFNUM | 605943 |
eventType | FLOOD |
analysisData[analysisData$REFNUM == 605943, "PROPDMGEXP"] <- "M" # manual data correction
Fatalities & Injury (pre-processing)
To identify the most dangerous events by total count of bodily injury & death, a bit of data manipulation is required. This pre-processing code takes the following approach to tidy the data for analysis:
- Aggregate the Fatalities & Injuries by Year and Type of Event
- Create a new data frame with these values, named ‘personsAffected’
- For each type of event, also identify the total number of incidents, labeled ‘freq’
- Subset to the top 10 ‘nostDangerous’ events, as determined by ‘personCount’ of ‘personsAffected’
- Sort and re-level for tidiness
## Aggregate Fatalities & Injuries by Year and Type of Event
sumFatal <- aggregate(FATALITIES ~ eventType, # aggregate Fatalities
data = analysisData, sum) # by eventType
sumInjury <- aggregate(INJURIES ~ eventType, # agggregate Injuries
data = analysisData, sum) # by eventType
Identify the mostDangerous by personCount of personsAffected
personsAffected <- merge(sumFatal,sumInjury) # merge into 'personsAffected'
personsAffected <- transform(personsAffected,personCount = FATALITIES + INJURIES) # sum personCount (injured/killed)
personsAffected <- merge(personsAffected, # frequency of each event type
count(analysisData[analysisData$eventType %in% personsAffected$eventType,],
vars="eventType"))
mostDangerous <- personsAffected[order(-personsAffected[,"personCount"]), ][1:10,] # 10 worst by personCount
mostDangerous$eventType <- factor(mostDangerous$eventType) # re-level the eventType
Economic Consequences (pre-processing)
To identify the events which have caused the most property damage, yet more data manipulation is required. This pre-processing code takes the following approach:
- build lookupTables to translate PROPDMGEXP and CROPDMGEXP codes into a dollar ‘Multiplier’
- add these multipliers into the analysisData data set
- use these multipliers to calculate the ‘totalDamage’, in terms of economic dollar cost
## Build lookupTables
PROPDMGEXP <- levels(analysisData$PROPDMGEXP) # property damage levels
pMultiplier <- c(1,1,1,1,1,10,100,1000,10000,100000,1000000,10000000, # property damage multipliers
100000000,1000000000,100,100,1000,1000000,1000000)
propLookup <- data.frame(cbind(PROPDMGEXP,pMultiplier)) # propLookup table
propLookup$pMultiplier <- as.numeric(as.character(propLookup$pMultiplier)) # convert to numeric
CROPDMGEXP <- levels(analysisData$CROPDMGEXP) # crop damage levels
cMultiplier <- c(1,1,1,100,1000000000,1000,1000,1000000,1000000) # crop damage multipliers
cropLookup <- data.frame(cbind(CROPDMGEXP, cMultiplier)) # cropLookup table
cropLookup$cMultiplier <- as.numeric(as.character(cropLookup$cMultiplier)) # convert to numeric
Merge lookupTables into analysisData
analysisData <- merge(analysisData,propLookup) # merge pMultiplier into data set
analysisData <- merge(analysisData,cropLookup) # merge cMultiplier into data set
analysisData <- analysisData[,c(4,3,5,12,13,6,7:11,14,2,15,1)] # reorder columns for tidiness
For Reference: The Lookup Tables
For reference, these are the dollar multipliers for PROPDMGEXP and CROPDMGEXP:
## For Reference, print the lookupTables
print(propLookup)
## PROPDMGEXP pMultiplier
## 1 1e+00
## 2 - 1e+00
## 3 ? 1e+00
## 4 + 1e+00
## 5 0 1e+00
## 6 1 1e+01
## 7 2 1e+02
## 8 3 1e+03
## 9 4 1e+04
## 10 5 1e+05
## 11 6 1e+06
## 12 7 1e+07
## 13 8 1e+08
## 14 B 1e+09
## 15 h 1e+02
## 16 H 1e+02
## 17 K 1e+03
## 18 m 1e+06
## 19 M 1e+06
print(cropLookup)
## CROPDMGEXP cMultiplier
## 1 1e+00
## 2 ? 1e+00
## 3 0 1e+00
## 4 2 1e+02
## 5 B 1e+09
## 6 k 1e+03
## 7 K 1e+03
## 8 m 1e+06
## 9 M 1e+06
Calculate Total Economic Damage
To reach the total economic damage, we use the formula (PROPDMG x pMultiplier) + (CROPDMG x cMultiplier):
## Calculate Total Economic Damage = (total property damage) + (total crop damage)
analysisData$totalDamage <- (analysisData$PROPDMG * analysisData$pMultiplier) + (analysisData$CROPDMG * analysisData$cMultiplier)
Identify mostCostly
For completeness of the analysis when assessing total economic damage, we consider the top 20 types of events when ranked from total economic cost (damage to both property and crops). And for completeness we also consider the time dimension, as well. To do so, this pre-proccessing code takes the following approach:
- Aggregate ‘totalDamage’ by Event Type
- Rank this list and keep the Top 20 as ‘mostCostly’
- Aggregate ‘totalDamage’ again, this time by both Event Type and Year
- Keep this summary as ‘damagesAgg’ and subset the values to only the 20 ‘mostCostly’
## Aggregate totalDamage by eventType & Year, ranks to identify the top 20
mostCostly <- aggregate(totalDamage ~ eventType, # aggregate totalDamage
data = analysisData,sum) # by eventType only
mostCostly <- mostCostly[order(-mostCostly[,"totalDamage"]), ][1:20,] # top 20 mostCostly
mostCostly$eventType <- factor(mostCostly$eventType) # re-level the eventType
damagesAgg <- aggregate(totalDamage ~ eventType + beginYear, # aggregate totalDamage
data = analysisData,sum) # by eventType & Year
damagesAgg <- subset(damagesAgg, damagesAgg$eventType %in% mostCostly$eventType) # subset to Top 20
damagesAgg$eventType <- factor(damagesAgg$eventType) # re-level eventType factor
personsAffected <- merge(personsAffected, # frequency of each event type
count(analysisData[analysisData$eventType %in% personsAffected$eventType,],
vars="eventType"))
Tidy Structure
For reference: a look at the structure of the final tidy data for analysis
str(analysisData) # for reference: analysisData structure
## 'data.frame': 453730 obs. of 16 variables:
## $ beginYear : Factor w/ 10 levels "2002","2003",..: 1 1 1 1 3 1 1 3 1 3 ...
## $ beginDate : Date, format: "2002-07-22" "2002-08-16" ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 26 42 5 19 40 48 41 41 5 51 ...
## $ REFNUM : num 484622 467741 449816 454144 548156 ...
## $ eventType : Factor w/ 72 levels "ABNORMALLY DRY",..: 54 26 18 68 18 26 26 67 68 54 ...
## $ EVTYPE : Factor w/ 121 levels "ABNORMALLY DRY",..: 69 41 28 117 28 41 41 25 117 98 ...
## $ FATALITIES : num 0 0 0 2 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 24034 1 291464 40772 20322 1 1 56802 26749 327499 ...
## $ pMultiplier: num 1 1 1 1 1 1 1 1 1 1 ...
## $ PROPDMGEXP : Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ cMultiplier: num 1 1 1 1 1 1 1 1 1 1 ...
## $ CROPDMGEXP : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ totalDamage: num 0 0 0 0 0 0 0 0 0 0 ...
Write analysisData
For safe keeping & future analysis, write the tidy analysisData to disk.
## Write the tidy analysisData to disk
write.table(analysisData,file="stormData_tidy.txt",row.names=FALSE,sep="|") # write to disk
tar("stormData_tidy.bz2", paste(getwd(),"/stormData_tidy.txt",sep=""),compression="bzip2") # compress
invisible(file.remove("stormData_tidy.txt")) # remove uncompressed
Results
Fatalities and Injury
The ten worst types of weather event by bodily injury & death are shown below:
print(mostDangerous)
## eventType FATALITIES INJURIES personCount freq
## 55 TORNADO 1112 13588 14700 15168
## 29 HEAT 920 4019 4939 1610
## 54 THUNDERSTORM 227 2599 2826 154586
## 41 LIGHTNING 370 2250 2620 7899
## 18 FLOOD 789 820 1609 54700
## 36 HURRICANE 67 1291 1358 128
## 67 WIND 438 803 1241 19961
## 17 FIRE 76 1051 1127 3077
## 45 RIP CURRENT 340 208 548 431
## 26 HAIL 3 456 459 142701
The code to produce a scatter plot:
## Plot mostDangerous Events, sized by frequency
ggplot(mostDangerous, aes(x=FATALITIES, y=INJURIES, color=eventType)) +
geom_point(aes(size = freq), alpha=.5) +
geom_text(data = mostDangerous[mostDangerous$eventType == "TORNADO" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = 1.1, vjust = 1.1) +
geom_text(data = mostDangerous[mostDangerous$eventType == "FLOOD" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = 1, vjust = -1) +
geom_text(data = mostDangerous[mostDangerous$eventType == "HAIL" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = -.8, vjust = -.8) +
geom_text(data = mostDangerous[mostDangerous$eventType == "HEAT" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = 1, vjust = -1) +
geom_text(data = mostDangerous[mostDangerous$eventType == "HURRICANE" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = -1, vjust = .5) +
geom_text(data = mostDangerous[mostDangerous$eventType == "LIGHTNING" ,],
aes(FATALITIES,INJURIES, label = eventType), vjust = -.3) +
geom_text(data = mostDangerous[mostDangerous$eventType == "RIP CURRENT" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = .5, vjust = -.3) +
geom_text(data = mostDangerous[mostDangerous$eventType == "THUNDERSTORM" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = .5, vjust = -1) +
geom_text(data = mostDangerous[mostDangerous$eventType == "WIND" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = .5, vjust = -1) +
geom_text(data = mostDangerous[mostDangerous$eventType == "FIRE" ,],
aes(FATALITIES,INJURIES, label = eventType), hjust = -.3, vjust = .5) +
scale_colour_brewer(palette="Paired") +
scale_size_area(max_size=30) +
ggtitle("Weather Event Types Most \nDangerous to Population Health")

Illustrated in the scatter plot above, where the size of the point represents the frequency of that event type, we see that relatively infrequent events such as HURRICANE, HEAT, and RIP CURRENT are responsible for many deaths even while their frequency is much less.
Other events which occur more frequently, like THUNDERSTORMS and HAIL, are still in the top 10 but incur a lower number of fatalities. The Tornado is in a class of its own, far outpacing the other event types in recorded fatalaties and deaths.
Economic Consequences
Code to produce the dot plot:
ggplot(damagesAgg, aes(x=beginYear, y=totalDamage, color=eventType))+
geom_point(aes(size = totalDamage), alpha = 0.5,) +
scale_size_area(max_size=25) +
theme_bw() +
geom_text(data = subset(damagesAgg, damagesAgg$beginYear == "2005" &
damagesAgg$eventType == "HURRICANE"),
aes(beginYear,totalDamage, label = eventType), hjust = -.3, vjust = .5) +
geom_text(data = subset(damagesAgg, damagesAgg$beginYear == "2005" &
damagesAgg$eventType == "STORM SURGE"),
aes(beginYear,totalDamage, label = eventType), hjust = -.2, vjust = .5) +
geom_text(data = subset(damagesAgg, damagesAgg$beginYear == "2004" &
damagesAgg$eventType == "HURRICANE"),
aes(beginYear,totalDamage, label = eventType), hjust = -.1, vjust = .5) +
geom_text(data = subset(damagesAgg, damagesAgg$beginYear == "2011" &
damagesAgg$eventType == "TORNADO"),
aes(beginYear,totalDamage, label = eventType), hjust = 1.2, vjust = .5) +
geom_text(data = subset(damagesAgg, damagesAgg$beginYear == "2011" &
damagesAgg$eventType == "FLOOD"),
aes(beginYear,totalDamage, label = eventType), hjust = 1.2, vjust = .5) +
ggtitle("Weather Event Types Most \nCostly to Crops and Property") +
ylab("Total Economic Damage in Dollars") +
xlab("")

Shown above in a dot plot by year, with the both size and Y-Axis representing the total economic cost, it becomes quickly apparent that large weather events in a handful of years dratically outpace all others in terms of economic consequence. For that reason, policy & preparedness related to minimizing the damange caused by these types of large events will be worthwhile.
Word Count: 2,566
References
- "Reproducible Research", by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014.
https://www.coursera.org/course/repdata - "A Few Simple Plots in R", by Keith Helfrich. July 29, 2014.
http://redheadedstepdata.io/a-few-simple-plots-in-R/ - "repData_project2.rmd - Reproducible Code", by Keith Helfrich. July 25, 2014.
https://www.dropbox.com/s/zci6m2x97zywkx6/repData_project2.rmd - "National Weather Service Storm Data Documentation". July 29, 2014
https://d396qusza40orc.cloudfront.net repdata/peer2_doc/pd01016005curr.pdf - "National Climatic Data Center Storm Events FAQ". July 29, 2014.
https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/NCDC%20Storm%20Events-FAQ%20Page.pdf