Menu

A few simple plots in R

Recent coursework for the Exploratory Data Analysis & Reproducible Research courses, which make up part of the 9 month Data Science Specialization offered on Coursera by John Hopkins University, offer the opportunity to examine what it takes to manipulate data & produce simple plots in R.

Some code might be expected for data manipulation. But the code to plot & anotate simple charts shows the reason why Tableau is such a pleasure to work within.

It is also interesting to observe the code to impute missing values. Come to find later that an R package already exists for this. Never-the-less, one learns more when doing it by hand :) And even with a package, this is a far simpler task to perform in Tableau.

The Data

The gist of the assignment is quite simple:

Data from a personal activity monitoring device collects observations at 5 minute intervals through out the day. For two months of data from an anonymous individual, we're looking at the number of steps taken during the 5 minute intervals of each day.

The variables included are:

  • steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
  • date: The date on which the measurement was taken in YYYY-MM-DD format
  • interval: Identifier for the 5-minute interval in which measurement was taken

Reproducible Research: Peer Assessment 1

Keith V. Helfrich
July 10, 2014

Read Data

It’s nice that we can read the CSV data from inside of the zip file:

data <- read.csv(unz("activity.zip", "activity.csv"))

Pre-Processing: Remove Missing Values

For this part of the assignment, we can ignore the missing values. OriginalValues are those which are not missing from the original data. Later, we will come back to replace the NA’s with a point estimate (best guess). For now, how many missing values are we ignoring ?

originalValue <- complete.cases(data)  
nMissing <- length(originalValue[originalValueFALSE])                      # number of records with NA  
nComplete <- length(originalValue[originalValueTRUE])                      # number of complete records

title="Missing vs. Complete Cases"  
barplot(table(originalValue),main=title,xaxt='n', col="bisque3")             # render Complete Cases barplot  
axis(side=1,at=c(.7,1.9),labels=c("Missing","Complete"),tick=FALSE)          # render axis  
text(.7,0,labels=nMissing, pos=3)                                            # label the NA's bar  
text(1.9,0,labels=nComplete, pos=3)                                          # label the Completes bar

missing vs complete cases

What is the mean total number of steps taken per day?

Interesting that the total number of steps per day follows a nearly normal distribution without any outliers. Let’s make a histogram of the total number of steps taken each day, and report the mean and median of the same.

completes<-subset(data,complete.cases(data)TRUE)              # build a subset of the complete values

splitByDay<-split(completes,completes$date, drop=TRUE)          # split the complete cases by date  
dailySteps<-sapply(splitByDay, function(x) sum(x$steps))        # build a numeric vector w/ daily sum of steps  
hist(dailySteps, main="Hist Total Steps per Day", xlab="# Steps", col="bisque3") # plot a histogram  
abline(v=mean(dailySteps), lty=3, col="blue")                   # draw a blue line thru the mean  
abline(v=median(dailySteps), lty=4, col="red")                  # draw a red line thru the median  
text(mean(dailySteps),25,labels="mean", pos=4, col="blue")      # label the mean  
text(mean(dailySteps),23,labels="median", pos=4, col="red")     # label the median  
rug(dailySteps, col="chocolate")                                # place a rug underneath the histogram

“histogram

The mean and median total number of steps per day are nearly the same.

summary(dailySteps)                                             # print summary (includes mean & median)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    8840   10800   10800   13300   21200

What is the average daily activity pattern?

Let’s make a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis).

splitByInterval <- split(completes,completes$interval, drop=TRUE)     # split the complete cases by date  
intervalAvg <- sapply(splitByInterval, function(x) mean(x$steps))     # vector of Avg. steps per interval  
plot(intervalAvg, type="l",  
     main="5' Interval Time Series", 
     ylab="Average # Steps", 
     xlab="Interval INDEX", col="chocolate")                          # plot the 5' time series
abline(v=which.max(intervalAvg), lty=3, col="blue")                   # draw a red line thru the median  
text(which.max(intervalAvg),max(intervalAvg),  
     labels=paste("max = ",as.character(round(max(intervalAvg)))), 
     pos=4, col="blue")                                               # label the max interval

Avg steps in each 5’ interval

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps? Well, that would be the interval by the name 835, with an average # steps = 206, which is located in the INDEX position # 104.

names(which.max(intervalAvg))                                         # name of the interval containing max
## [1] "835"
round(max(intervalAvg))                                               # max Avg. # of steps value
## [1] 206
which.max(intervalAvg)                                                # INDEX of the interval containing max
## 835 
## 104

Imputing missing values

Up ’til now, we’ve ignored the records with missing values. Yet the presence of missing values can introduce bias into some calculations or summaries of the data. As we’ve seen earlier, the total number of missing values in the dataset (i.e. the total number of rows with NAs) is 2,304

nMissing
## [1] 2304

To impute missing values, I will use the mean across all days for the 5-minute interval which the NA occurs. So let’s create a new dataset, equal to the original, but with estimates for the missing data filled in. We’ll get there by adding a fourth utility column, which contains TRUE / FALSE boolean to indicate whether the originalValue was present (TRUE) or missing (FALSE).

newData <- cbind(data,originalValue)                          # newData, with 'originalValue' column  
splitByOrig<-split(newData,newData$originalValue, drop=TRUE)  # split newData by whether originalValue exists

# For each row in the split data frame where originalValue  FALSE, 
# replace NA with the intervalAvg (rounded to the nearest integer)
# the impute value is found with a lookup from the intervalAvg named vector created earlier

for (row in 1:nrow(splitByOrig[["FALSE"]])){  
    splitByOrig[["FALSE"]][row,1] <- round(subset(intervalAvg,names(intervalAvg) 
                                     as.character(splitByOrig[["FALSE"]][row,3])))
}

Now we have a list named splitByOrig, with two data frame elements: one data frame named “TRUE” which contains all those observations for which an originalValue was present, and another named “FALSE” which contains the imputed intervalAvg # of steps for the missing 5’ interval.

The imputation is done now, so let’s bring the pieces back together again. Chronological order was temporarily ruined by splitting & recombinig the data, so we also need to re-order the rows by date & interval.

newData <- rbind(splitByOrig[["FALSE"]],splitByOrig[["TRUE"]])           # combine the TRUE & FALSE data frames  
newData <- newData[with(newData, order(date, interval)), ]               # re-order by date & interval

Using the newData, make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.

splitNewByDay <- split(newData,newData$date, drop=TRUE)                  # split the newData by date  
dailyStepsNew <- sapply(splitNewByDay, function(x) sum(x$steps))         # numeric vector w/ daily sum of steps  
hist(dailyStepsNew, main="NEW Hist: Total Steps per Day", xlab="         # Steps", col="bisque3") # plot a histogram  
abline(v=mean(dailySteps), lty=3, col="blue")                            # draw a blue line thru the mean  
abline(v=median(dailySteps), lty=4, col="red")                           # draw a red line thru the median  
text(mean(dailySteps),35,labels="mean", pos=4, col="blue")               # label the mean  
text(mean(dailySteps),33,labels="median", pos=4, col="red")              # label the median  
rug(dailyStepsNew,col="chocolate")

NEW histogram, with imputed values

While the quartiles vary a bit, the mean and median total number of steps per day are exactly the same.

summary(dailySteps)                                                # summary of dailySteps
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    8840   10800   10800   13300   21200
summary(dailyStepsNew)                                             # summary of dailyStepsNew
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    9820   10800   10800   12800   21200

What is the impact of imputing missing data ?

In fact, by using the average value for imputation, the only difference is in the frequency (number of observations)) for the center bar of the new histogram:

par(mfrow=c(1,2))

### plot the original histogram
hist(dailySteps, main="Hist Total Steps per Day", xlab="# Steps", col="bisque3", ylim=c(0,35)) # plot a histogram  
abline(v=mean(dailySteps), lty=3, col="blue")                      # draw a blue line thru the mean  
abline(v=median(dailySteps), lty=4, col="red")                     # draw a red line thru the median  
text(mean(dailySteps),25,labels="mean", pos=4, col="blue")         # label the mean  
text(mean(dailySteps),23,labels="median", pos=4, col="red")        # label the median  
rug(dailySteps, col="chocolate")

### plot the imputed histogram
hist(dailyStepsNew, main="NEW Hist: Total Steps per Day", xlab="# Steps", col="bisque3", ylab="") # plot a histogram  
abline(v=mean(dailySteps), lty=3, col="blue")                      # draw a blue line thru the mean  
abline(v=median(dailySteps), lty=4, col="red")                     # draw a red line thru the median  
text(mean(dailySteps),35,labels="mean", pos=4, col="blue")         # label the mean  
text(mean(dailySteps),33,labels="median", pos=4, col="red")        # label the median  
rug(dailyStepsNew,col="chocolate")

side-by-side histograms

Are there differences in activity patterns between weekdays and weekends?

Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

newData$date <- as.Date(strptime(newData$date, format="%Y-%m-%d")) # convert date to a date() class variable  
newData$day <- weekdays(newData$date)                              # build a 'day' factor to hold weekday / weekend  
for (i in 1:nrow(newData)) {                                       # for each day  
    if (newData[i,]$day %in% c("Saturday","Sunday")) {             # if Saturday or Sunday,
        newData[i,]$day<-"weekend"                                 #   then 'weekend'
    }
    else{
        newData[i,]$day<-"weekday"                                 #    else 'weekday'
    }
}

Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). The plot should look something like the following, which was creating using simulated data:

## aggregate newData by steps as a function of interval + day  
stepsByDay <- aggregate(newData$steps ~ newData$interval + newData$day, newData, mean)

## reset the column names to be pretty & clean
names(stepsByDay) <- c("interval", "day", "steps")

## plot weekday over weekend time series
par(mfrow=c(1,1))  
with(stepsByDay, plot(steps ~ interval, type="n", main="Weekday vs. Weekend Avg."))  
with(stepsByDay[stepsByDay$day  "weekday",], lines(steps ~ interval, type="l", col="chocolate"))  
with(stepsByDay[stepsByDay$day == "weekend",], lines(steps ~ interval, type="l", col="16" ))  
legend("topright", lty=c(1,1), col = c("chocolate", "16"), legend = c("weekday", "weekend"), seg.len=3)

Weekday vs. Weekend

It looks like this person has a day job & does most of his or her walking on the weekends!

Word Count: 1,519

References

  1. "Exploratory Data Analysis", by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/course/exdata
  2. "Reproducible Research", by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/course/repdata
  3. "Data Science Specialization", by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/specialization/jhudatascience/1
  4. "Imputation in R - Stack Overflow", : http://stackoverflow.com/questions/13114812/imputation-in-r
  5. "Peer Assessments / Peer Assessment 1", Reproducible Research : by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014 https://class.coursera.org/repdata-004/human_grading/view/courses/972143/assessments/3/submissions