Menu

Large Weather Events

For another assignment from Reproducible Research, I've prepared a self contained analysis from a large and fairly dirty data set, as if it were to be delivered to a Government decision maker.

By 'self contained', we mean that every aspect from start to finish must be self-evident & reproducable.

Similar to my previous post, here I also would have preferred to tidy the data in R and build the visualizations with Tableau.

The graphical systems in R are robust. And I find them super convenient to quickly plot the distribution of some data, or to spot check my work along the way. But R (in my opinion) is no place to produce "communication" level visualizations.

Visualizations built in R are simply too rigid and static for communication purposes. The moment one reaches a "tidy" data set that is for analysis, then it's time to switch to Tableau for communication.

If you would like to download my reproducible code to give it a run, click here.

Synopsis

This analysis explores the NOAA Storm Database and answer basic questions about severe weather events. Specifically, this analysis will address the questions:

  • Across the United States, which types of events (as indicated in the eventType variable) are most harmful with respect to population health?
  • Across the United States, which types of events have the greatest economic consequences?

For reference, additional documentation of the data exists which explains how the variables are constructed/defined.

Recorded events begin in the year 1950 and end in November 2011. There are generally fewer events recorded in the earlier years, most likely due to a lack of good records. More recent years should be considered more complete.

There are 7 variables related to these questions:

  • EVTYPE: Event Type (e.g. tornado, flood, etc.)
  • FATALITIES: Number of fatalities
  • INJURIES: Number of injuries
  • PROPDMG: Property damage estimates, entered as actual dollar amounts
  • PROPDMGEXP: Alphabetic Codes to signify magnitude “K” for thousands, “M” for millions, and “B” for billions)
  • CROPDMG: Crop damage estimates, entered as actual dollar amounts
  • CROPDMGEXP: Alphabetic Codes to signify magnitude “K” for thousands, “M” for millions, and “B” for billions)

Data Processing

First, to download & process the raw data to format for analysis.

Computing Environment

Please set your own working directory. Be patient. The source data to download & load into memory is quite large!

  
## Set your own Working Directory
setwd("/Users/keithhelfrich/Dropbox/Coursera/2014.07 - JHU - Reproduceable Research/RepData_PeerAssessment2")

## Download Source Data
## Size = 47MB, please be patient!
sourcefile <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2";  
destfile   <- "repdata-data-StormData.csv.bz2"  
if(!file.exists(destfile)) {  
download.file(sourcefile, destfile=destfile, quiet=TRUE, method="curl") }

## Install and Load Packages
packages <- c("ggplot2","RColorBrewer", "plyr", "xtable")                             # packages req'd

is_installed <- function(mypkg) is.element(mypkg, installed.packages()[,1]) # function: check if installed  
load\_or\_install<-function(packages) {                                       # function: load all packages  
 for(package_name in packages) {                                            #           install as needed
 if(!is_installed(package_name)) {
 install.packages(package_name,repos="http://lib.stat.cmu.edu/R/CRAN";) }
 library(package_name,character.only=TRUE,quietly=TRUE,verbose=FALSE) }
}

load\_or\_install(packages)                                                   # install & load

Read Data

Be patient. The source data to load into memory is quite large!

  
## Read Data (Wait for it!)
stormData <- read.csv(bzfile(destfile))                                     # read source data

Format beginDate

  
## Convert BGN_DATE to a date format for analysis.
stormData$beginDate <- as.Date(strptime(stormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))  
stormData$beginYear <- as.factor(format(stormData$beginDate,format="%Y"))

Observations

Because fewer events were recorded in earlier years, this analysis will focus on only the most recent 10 years.

  
## Plot the number of observations per year, across the entire data set
barplot(table(stormData$beginYear),                                         # plot no. of records per year  
    ylab="# records")
abline(v=63, lty=1, lwd=3, col="blue")                                      # reference line marks 10 yrs  
text(x=63,y=60000, labels = "analysis-->",                                     # annotate analysis date range  
     pos=4, col="blue", cex=.7, offset=.1)
title("Focus: Recent 10 Years")                                                # title the plot

Subset analysisData

This analysis will focus on the recents 10 years: 2002 thru 2011

  
## subset to years 2002 thru 2011
analysisData <- stormData[format(stormData$beginDate,format="%Y") > 2001,]  # subset years 2002 thru 2011  
analysisData <- analysisData[,c("beginDate","beginYear","STATE","EVTYPE",   # subset columns of interest  
    "FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG",                    # including REMARKS & REFNUM
    "CROPDMGEXP","REMARKS","REFNUM")]                                               # for future reference
analysisData$beginYear <- factor(analysisData$beginYear)                    # re-level beginYear  
analysisData$EVTYPE <- factor(analysisData$EVTYPE)                          # re-level EVTYPE

rm(stormData)                                                               # remove unneccesary  
gc()                                                                        # free up memory
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells   882315  47.2    1476915  78.9   1144256   61.2
## Vcells 30814829 235.1  119226608 909.7 148940956 1136.4

Event Type (cleansing)

The field EVTYPE requires cleansing because the same type of event is often logged with slightly different names. Some names are spellect incorrectly. etc. For example: all hurricanes should be grouped together, whereas the raw data encodes each named hurricane as a separate type of event. A more thorough analysis would examine each event type individually, to decide whether it is encoded correctly or whether it should be consolidated with another group of events. Below are the cursory corrections to be made for this analysis.

  
# Cleanup EVTYPE
analysisData$eventType <- as.character(analysisData$EVTYPE)  
analysisData$eventType[grepl("tornad", analysisData$eventType, ignore.case = TRUE)]          <- "TORNADO"  
analysisData$eventType[grepl("thun.*orm", analysisData$eventType, ignore.case = TRUE)]       <- "THUNDERSTORM"  
analysisData$eventType[grepl("tstm", analysisData$eventType, ignore.case = TRUE)]            <- "THUNDERSTORM"  
analysisData$eventType[grepl("snow", analysisData$eventType, ignore.case = TRUE)]            <- "SNOW"  
analysisData$eventType[grepl("blizzard", analysisData$eventType, ignore.case = TRUE)]        <- "BLIZZARD"  
analysisData$eventType[grepl("hail", analysisData$eventType, ignore.case = TRUE)]            <- "HAIL"  
analysisData$eventType[grepl("rain", analysisData$eventType, ignore.case = TRUE)]            <- "RAIN"  
analysisData$eventType[grepl("precip", analysisData$eventType, ignore.case = TRUE)]          <- "RAIN"  
analysisData$eventType[grepl("hurricane", analysisData$eventType, ignore.case = TRUE)]       <- "HURRICANE"  
analysisData$eventType[grepl("tropical.*storm", analysisData$eventType, ignore.case = TRUE)] <- "TROPICALSTORM"  
analysisData$eventType[grepl("flood", analysisData$eventType, ignore.case = TRUE)]           <- "FLOOD"  
analysisData$eventType[grepl("fire", analysisData$eventType, ignore.case = TRUE)]            <- "FIRE"  
analysisData$eventType[grepl("lightn", analysisData$eventType, ignore.case = TRUE)]          <- "LIGHTNING"  
analysisData$eventType[grepl("wind", analysisData$eventType, ignore.case = TRUE)]            <- "WIND"  
analysisData$eventType[grepl("cold", analysisData$eventType, ignore.case = TRUE)]            <- "COLD"  
analysisData$eventType[grepl("heat", analysisData$eventType, ignore.case = TRUE)]            <- "HEAT"  
analysisData$eventType[grepl("storm surge", analysisData$eventType, ignore.case = TRUE)]     <- "STORM SURGE"  
analysisData$eventType <- as.factor(analysisData$eventType)

Nappa River Flood Correction

REMARKS in the Nappa River Flood record from the year 2006 clearly indicate damages in the millions of dollars, yet the property damage multiplier value for PROPDMGEXP is mistakenly recorded as ‘B’ for billions. This inaccuracy significantly skews the analysis results, so let us make the manual correction.

  
## Correction to property damage multiplier for Nappa River Flood 
print(xtable(t(analysisData[analysisData$REFNUM  "605943",])),                      # print original values  
      type = "html", include.rownames = TRUE)                            
605953
beginDate 2006-01-01
beginYear 2006
STATE CA
EVTYPE FLOOD
FATALITIES 0
INJURIES 0
PROPDMG 115
PROPDMGEXP B
CROPDMG 32.5
CROPDMGEXP M
REMARKS Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
REFNUM 605943
eventType FLOOD
  
analysisData[analysisData$REFNUM  605943, "PROPDMGEXP"] <- "M"                      # manual data correction

Fatalities & Injury (pre-processing)

To identify the most dangerous events by total count of bodily injury & death, a bit of data manipulation is required. This pre-processing code takes the following approach to tidy the data for analysis:

  • Aggregate the Fatalities & Injuries by Year and Type of Event
  • Create a new data frame with these values, named ‘personsAffected’
  • For each type of event, also identify the total number of incidents, labeled ‘freq’
  • Subset to the top 10 ‘nostDangerous’ events, as determined by ‘personCount’ of ‘personsAffected’
  • Sort and re-level for tidiness
  
## Aggregate Fatalities & Injuries by Year and Type of Event
sumFatal  <- aggregate(FATALITIES ~ eventType,                                      # aggregate Fatalities  
    data = analysisData, sum)                                                            # by eventType
sumInjury <- aggregate(INJURIES ~ eventType,                                        # agggregate Injuries  
    data = analysisData, sum)                                                            # by eventType

## Identify the mostDangerous by personCount of personsAffected
personsAffected <- merge(sumFatal,sumInjury)                                        # merge into 'personsAffected'  
personsAffected <- transform(personsAffected,personCount = FATALITIES + INJURIES)   # sum personCount (injured/killed)  
personsAffected <- merge(personsAffected,                                           # frequency of each event type  
    count(analysisData[analysisData$eventType %in% personsAffected$eventType,], 
    vars="eventType"))
mostDangerous <- personsAffected[order(-personsAffected[,"personCount"]), ][1:10,]  # 10 worst by personCount  
mostDangerous$eventType <- factor(mostDangerous$eventType)                          # re-level the eventType

Economic Consequences (pre-processing)

To identify the events which have caused the most property damage, yet more data manipulation is required. This pre-processing code takes the following approach:

  • build lookupTables to translate PROPDMGEXP and CROPDMGEXP codes into a dollar ‘Multiplier’
  • add these multipliers into the analysisData data set
  • use these multipliers to calculate the ‘totalDamage’, in terms of economic dollar cost
  
## Build lookupTables
PROPDMGEXP  <- levels(analysisData$PROPDMGEXP)                          # property damage levels  
pMultiplier <- c(1,1,1,1,1,10,100,1000,10000,100000,1000000,10000000,   # property damage multipliers  
                 100000000,1000000000,100,100,1000,1000000,1000000)
propLookup  <- data.frame(cbind(PROPDMGEXP,pMultiplier))                # propLookup table  
propLookup$pMultiplier <- as.numeric(as.character(propLookup$pMultiplier))     # convert to numeric

CROPDMGEXP  <- levels(analysisData$CROPDMGEXP)                          # crop damage levels  
cMultiplier <- c(1,1,1,100,1000000000,1000,1000,1000000,1000000)        # crop damage multipliers  
cropLookup  <- data.frame(cbind(CROPDMGEXP, cMultiplier))                # cropLookup table  
cropLookup$cMultiplier <- as.numeric(as.character(cropLookup$cMultiplier))     # convert to numeric

## Merge lookupTables into analysisData
analysisData <- merge(analysisData,propLookup)                          # merge pMultiplier into data set  
analysisData <- merge(analysisData,cropLookup)                          # merge cMultiplier into data set  
analysisData <- analysisData[,c(4,3,5,12,13,6,7:11,14,2,15,1)]          # reorder columns for tidiness

For Reference: The Lookup Tables

For reference, these are the dollar multipliers for PROPDMGEXP and CROPDMGEXP:

  
## For Reference, print the lookupTables
print(propLookup)
##    PROPDMGEXP pMultiplier
## 1                   1e+00
## 2           -       1e+00
## 3           ?       1e+00
## 4           +       1e+00
## 5           0       1e+00
## 6           1       1e+01
## 7           2       1e+02
## 8           3       1e+03
## 9           4       1e+04
## 10          5       1e+05
## 11          6       1e+06
## 12          7       1e+07
## 13          8       1e+08
## 14          B       1e+09
## 15          h       1e+02
## 16          H       1e+02
## 17          K       1e+03
## 18          m       1e+06
## 19          M       1e+06
  
print(cropLookup)
##   CROPDMGEXP cMultiplier
## 1                  1e+00
## 2          ?       1e+00
## 3          0       1e+00
## 4          2       1e+02
## 5          B       1e+09
## 6          k       1e+03
## 7          K       1e+03
## 8          m       1e+06
## 9          M       1e+06

Calculate Total Economic Damage

To reach the total economic damage, we use the formula (PROPDMG x pMultiplier) + (CROPDMG x cMultiplier):

  
## Calculate Total Economic Damage = (total property damage) + (total crop damage)
analysisData$totalDamage <- (analysisData$PROPDMG * analysisData$pMultiplier) + (analysisData$CROPDMG * analysisData$cMultiplier)

Identify mostCostly

For completeness of the analysis when assessing total economic damage, we consider the top 20 types of events when ranked from total economic cost (damage to both property and crops). And for completeness we also consider the time dimension, as well. To do so, this pre-proccessing code takes the following approach:

  • Aggregate ‘totalDamage’ by Event Type
  • Rank this list and keep the Top 20 as ‘mostCostly’
  • Aggregate ‘totalDamage’ again, this time by both Event Type and Year
  • Keep this summary as ‘damagesAgg’ and subset the values to only the 20 ‘mostCostly’
  
## Aggregate totalDamage by eventType & Year, ranks to identify the top 20
mostCostly <- aggregate(totalDamage ~ eventType,                             # aggregate totalDamage  
    data = analysisData,sum)                                                    # by eventType only

mostCostly <- mostCostly[order(-mostCostly[,"totalDamage"]), ][1:20,]        # top 20 mostCostly

mostCostly$eventType <- factor(mostCostly$eventType)                            # re-level the eventType

damagesAgg <- aggregate(totalDamage ~ eventType + beginYear,                 # aggregate totalDamage  
    data = analysisData,sum)                                                    # by eventType & Year
damagesAgg <- subset(damagesAgg, damagesAgg$eventType %in% mostCostly$eventType)  # subset to Top 20  
damagesAgg$eventType <- factor(damagesAgg$eventType)                              # re-level eventType factor

personsAffected <- merge(personsAffected,                                    # frequency of each event type  
    count(analysisData[analysisData$eventType %in% personsAffected$eventType,], 
    vars="eventType"))

Tidy Structure

For reference: a look at the structure of the final tidy data for analysis

  
str(analysisData)                                                           # for reference: analysisData structure
## 'data.frame':    453730 obs. of  16 variables:
##  $ beginYear  : Factor w/ 10 levels "2002","2003",..: 1 1 1 1 3 1 1 3 1 3 ...
##  $ beginDate  : Date, format: "2002-07-22" "2002-08-16" ...
##  $ STATE      : Factor w/ 72 levels "AK","AL","AM",..: 26 42 5 19 40 48 41 41 5 51 ...
##  $ REFNUM     : num  484622 467741 449816 454144 548156 ...
##  $ eventType  : Factor w/ 72 levels "ABNORMALLY DRY",..: 54 26 18 68 18 26 26 67 68 54 ...
##  $ EVTYPE     : Factor w/ 121 levels "ABNORMALLY DRY",..: 69 41 28 117 28 41 41 25 117 98 ...
##  $ FATALITIES : num  0 0 0 2 0 0 0 0 0 0 ...
##  $ INJURIES   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMG    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMG    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ REMARKS    : Factor w/ 436781 levels "","\t","\t\t",..: 24034 1 291464 40772 20322 1 1 56802 26749 327499 ...
##  $ pMultiplier: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ PROPDMGEXP : Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cMultiplier: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ CROPDMGEXP : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ totalDamage: num  0 0 0 0 0 0 0 0 0 0 ...

Write analysisData

For safe keeping & future analysis, write the tidy analysisData to disk.

  
## Write the tidy analysisData to disk
write.table(analysisData,file="stormData_tidy.txt",row.names=FALSE,sep="|")                  # write to disk  
tar("stormData_tidy.bz2", paste(getwd(),"/stormData_tidy.txt",sep=""),compression="bzip2")   # compress  
invisible(file.remove("stormData_tidy.txt"))                                                 # remove uncompressed


Results

Fatalities and Injury

The ten worst types of weather event by bodily injury & death are shown below:

  
print(mostDangerous)
##       eventType FATALITIES INJURIES personCount   freq
## 55      TORNADO       1112    13588       14700  15168
## 29         HEAT        920     4019        4939   1610
## 54 THUNDERSTORM        227     2599        2826 154586
## 41    LIGHTNING        370     2250        2620   7899
## 18        FLOOD        789      820        1609  54700
## 36    HURRICANE         67     1291        1358    128
## 67         WIND        438      803        1241  19961
## 17         FIRE         76     1051        1127   3077
## 45  RIP CURRENT        340      208         548    431
## 26         HAIL          3      456         459 142701

The code to produce a scatter plot:

  
## Plot mostDangerous Events, sized by frequency
ggplot(mostDangerous, aes(x=FATALITIES, y=INJURIES, color=eventType)) +  
    geom_point(aes(size = freq), alpha=.5) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "TORNADO" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = 1.1, vjust = 1.1) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "FLOOD" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = 1, vjust = -1) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "HAIL" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = -.8, vjust = -.8) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "HEAT" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = 1, vjust = -1) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "HURRICANE" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = -1, vjust = .5) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "LIGHTNING" ,], 
              aes(FATALITIES,INJURIES, label = eventType), vjust = -.3) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "RIP CURRENT" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = .5, vjust = -.3) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "THUNDERSTORM" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = .5, vjust = -1) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "WIND" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = .5, vjust = -1) +
    geom_text(data = mostDangerous[mostDangerous$eventType  "FIRE" ,], 
              aes(FATALITIES,INJURIES, label = eventType), hjust = -.3, vjust = .5) +
    scale_colour_brewer(palette="Paired")   +
    scale_size_area(max_size=30) +
    ggtitle("Weather Event Types Most \nDangerous to Population Health")

Illustrated in the scatter plot above, where the size of the point represents the frequency of that event type, we see that relatively infrequent events such as HURRICANE, HEAT, and RIP CURRENT are responsible for many deaths even while their frequency is much less.

Other events which occur more frequently, like THUNDERSTORMS and HAIL, are still in the top 10 but incur a lower number of fatalities. The Tornado is in a class of its own, far outpacing the other event types in recorded fatalaties and deaths.

Economic Consequences

Code to produce the dot plot:

  
ggplot(damagesAgg, aes(x=beginYear, y=totalDamage, color=eventType))+  
    geom_point(aes(size = totalDamage), alpha = 0.5,) + 
    scale_size_area(max_size=25) +
    theme_bw() +
    geom_text(data = subset(damagesAgg, damagesAgg$beginYear  "2005" & 
                                        damagesAgg$eventType  "HURRICANE"),
        aes(beginYear,totalDamage, label = eventType), hjust = -.3, vjust = .5) +
    geom_text(data = subset(damagesAgg, damagesAgg$beginYear  "2005" & 
                                        damagesAgg$eventType  "STORM SURGE"),
        aes(beginYear,totalDamage, label = eventType), hjust = -.2, vjust = .5) +
    geom_text(data = subset(damagesAgg, damagesAgg$beginYear  "2004" & 
                                        damagesAgg$eventType  "HURRICANE"),
        aes(beginYear,totalDamage, label = eventType), hjust = -.1, vjust = .5) +
    geom_text(data = subset(damagesAgg, damagesAgg$beginYear  "2011" & 
                                        damagesAgg$eventType  "TORNADO"),
        aes(beginYear,totalDamage, label = eventType), hjust = 1.2, vjust = .5) +
    geom_text(data = subset(damagesAgg, damagesAgg$beginYear  "2011" & 
                                        damagesAgg$eventType  "FLOOD"),
        aes(beginYear,totalDamage, label = eventType), hjust = 1.2, vjust = .5) +
    ggtitle("Weather Event Types Most \nCostly to Crops and Property") + 
    ylab("Total Economic Damage in Dollars") + 
    xlab("") 

Shown above in a dot plot by year, with the both size and Y-Axis representing the total economic cost, it becomes quickly apparent that large weather events in a handful of years dratically outpace all others in terms of economic consequence. For that reason, policy & preparedness related to minimizing the damange caused by these types of large events will be worthwhile.

Word Count: 2,566

References

  1. "Reproducible Research", by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014.
    https://www.coursera.org/course/repdata
  2. "A Few Simple Plots in R", by Keith Helfrich. July 29, 2014.
    http://redheadedstepdata.io/a-few-simple-plots-in-R/
  3. "repData_project2.rmd - Reproducible Code", by Keith Helfrich. July 25, 2014.
    https://www.dropbox.com/s/zci6m2x97zywkx6/repData_project2.rmd
  4. "National Weather Service Storm Data Documentation". July 29, 2014
    https://d396qusza40orc.cloudfront.net repdata/peer2_doc/pd01016005curr.pdf
  5. "National Climatic Data Center Storm Events FAQ". July 29, 2014.
    https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/NCDC%20Storm%20Events-FAQ%20Page.pdf