Recent coursework for the Exploratory Data Analysis & Reproducible Research courses, which make up part of the 9 month Data Science Specialization offered on Coursera by John Hopkins University, offer the opportunity to examine what it takes to manipulate data & produce simple plots in R.

Some code might be expected for data manipulation. But the code to plot & anotate simple charts shows the reason why Tableau is such a pleasure to work within.

It is also interesting to observe the code to **impute missing values**. Come to find later that an R package already exists for this. Never-the-less, one learns more when doing it by hand :) And even with a package, this is a far simpler task to perform in Tableau.

## The Data

The gist of the assignment is quite simple:

Data from a personal activity monitoring device collects observations at 5 minute intervals through out the day. For two months of data from an anonymous individual, we're looking at the number of steps taken during the 5 minute intervals of each day.

The variables included are:

**steps:**Number of steps taking in a 5-minute interval (missing values are coded as NA)**date:**The date on which the measurement was taken in YYYY-MM-DD format**interval:**Identifier for the 5-minute interval in which measurement was taken

# Reproducible Research: Peer Assessment 1

Keith V. Helfrich

July 10, 2014

# Read Data

It’s nice that we can read the CSV data from inside of the zip file:

`data <- read.csv(unz("activity.zip", "activity.csv"))`

## Pre-Processing: Remove Missing Values

For this part of the assignment, we can ignore the missing values. **OriginalValues** are those which are not missing from the original data. Later, we will come back to replace the NA’s with a point estimate (best guess). For now, how many missing values are we ignoring ?

```
originalValue <- complete.cases(data)
nMissing <- length(originalValue[originalValueFALSE]) # number of records with NA
nComplete <- length(originalValue[originalValueTRUE]) # number of complete records
title="Missing vs. Complete Cases"
barplot(table(originalValue),main=title,xaxt='n', col="bisque3") # render Complete Cases barplot
axis(side=1,at=c(.7,1.9),labels=c("Missing","Complete"),tick=FALSE) # render axis
text(.7,0,labels=nMissing, pos=3) # label the NA's bar
text(1.9,0,labels=nComplete, pos=3) # label the Completes bar
```

## What is the mean total number of steps taken per day?

Interesting that the total number of steps per day follows a nearly normal distribution without any outliers. Let’s make a histogram of the total number of steps taken each day, and report the **mean** and **median** of the same.

```
completes<-subset(data,complete.cases(data)TRUE) # build a subset of the complete values
splitByDay<-split(completes,completes$date, drop=TRUE) # split the complete cases by date
dailySteps<-sapply(splitByDay, function(x) sum(x$steps)) # build a numeric vector w/ daily sum of steps
hist(dailySteps, main="Hist Total Steps per Day", xlab="# Steps", col="bisque3") # plot a histogram
abline(v=mean(dailySteps), lty=3, col="blue") # draw a blue line thru the mean
abline(v=median(dailySteps), lty=4, col="red") # draw a red line thru the median
text(mean(dailySteps),25,labels="mean", pos=4, col="blue") # label the mean
text(mean(dailySteps),23,labels="median", pos=4, col="red") # label the median
rug(dailySteps, col="chocolate") # place a rug underneath the histogram
```

The **mean** and **median** total number of steps per day are nearly the same.

`summary(dailySteps) # print summary (includes mean & median)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8840 10800 10800 13300 21200
```

# What is the average daily activity pattern?

Let’s make a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis).

```
splitByInterval <- split(completes,completes$interval, drop=TRUE) # split the complete cases by date
intervalAvg <- sapply(splitByInterval, function(x) mean(x$steps)) # vector of Avg. steps per interval
plot(intervalAvg, type="l",
main="5' Interval Time Series",
ylab="Average # Steps",
xlab="Interval INDEX", col="chocolate") # plot the 5' time series
abline(v=which.max(intervalAvg), lty=3, col="blue") # draw a red line thru the median
text(which.max(intervalAvg),max(intervalAvg),
labels=paste("max = ",as.character(round(max(intervalAvg)))),
pos=4, col="blue") # label the max interval
```

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps? Well, that would be the interval by the **name 835**, with an **average # steps = 206**, which is located in the **INDEX position # 104**.

`names(which.max(intervalAvg)) # name of the interval containing max`

`## [1] "835"`

`round(max(intervalAvg)) # max Avg. # of steps value`

`## [1] 206`

`which.max(intervalAvg) # INDEX of the interval containing max`

```
## 835
## 104
```

# Imputing missing values

Up ’til now, we’ve ignored the records with missing values. Yet the presence of missing values can introduce bias into some calculations or summaries of the data. As we’ve seen earlier, the total number of missing values in the dataset (i.e. the total number of rows with NAs) is **2,304**

`nMissing`

`## [1] 2304`

To impute missing values, I will use the mean across all days for the 5-minute interval which the NA occurs. So let’s create a new dataset, equal to the original, but with estimates for the missing data filled in. We’ll get there by adding a fourth utility column, which contains TRUE / FALSE boolean to indicate whether the **originalValue** was present (TRUE) or missing (FALSE).

```
newData <- cbind(data,originalValue) # newData, with 'originalValue' column
splitByOrig<-split(newData,newData$originalValue, drop=TRUE) # split newData by whether originalValue exists
# For each row in the split data frame where originalValue FALSE,
# replace NA with the intervalAvg (rounded to the nearest integer)
# the impute value is found with a lookup from the intervalAvg named vector created earlier
for (row in 1:nrow(splitByOrig[["FALSE"]])){
splitByOrig[["FALSE"]][row,1] <- round(subset(intervalAvg,names(intervalAvg)
as.character(splitByOrig[["FALSE"]][row,3])))
}
```

Now we have a list named **splitByOrig**, with two data frame elements: one data frame named **“TRUE”** which contains all those observations for which an **originalValue** was present, and another named **“FALSE”** which contains the imputed **intervalAvg** # of steps for the missing 5’ interval.

The imputation is done now, so let’s bring the pieces back together again. Chronological order was temporarily ruined by splitting & recombinig the data, so we also need to re-order the rows by date & interval.

```
newData <- rbind(splitByOrig[["FALSE"]],splitByOrig[["TRUE"]]) # combine the TRUE & FALSE data frames
newData <- newData[with(newData, order(date, interval)), ] # re-order by date & interval
```

Using the **newData**, make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.

```
splitNewByDay <- split(newData,newData$date, drop=TRUE) # split the newData by date
dailyStepsNew <- sapply(splitNewByDay, function(x) sum(x$steps)) # numeric vector w/ daily sum of steps
hist(dailyStepsNew, main="NEW Hist: Total Steps per Day", xlab=" # Steps", col="bisque3") # plot a histogram
abline(v=mean(dailySteps), lty=3, col="blue") # draw a blue line thru the mean
abline(v=median(dailySteps), lty=4, col="red") # draw a red line thru the median
text(mean(dailySteps),35,labels="mean", pos=4, col="blue") # label the mean
text(mean(dailySteps),33,labels="median", pos=4, col="red") # label the median
rug(dailyStepsNew,col="chocolate")
```

While the quartiles vary a bit, the **mean** and **median** total number of steps per day are exactly the same.

`summary(dailySteps) # summary of dailySteps`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8840 10800 10800 13300 21200
```

`summary(dailyStepsNew) # summary of dailyStepsNew`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 9820 10800 10800 12800 21200
```

## What is the impact of imputing missing data ?

In fact, by using the average value for imputation, the only difference is in the frequency (number of observations)) for the center bar of the new histogram:

```
par(mfrow=c(1,2))
### plot the original histogram
hist(dailySteps, main="Hist Total Steps per Day", xlab="# Steps", col="bisque3", ylim=c(0,35)) # plot a histogram
abline(v=mean(dailySteps), lty=3, col="blue") # draw a blue line thru the mean
abline(v=median(dailySteps), lty=4, col="red") # draw a red line thru the median
text(mean(dailySteps),25,labels="mean", pos=4, col="blue") # label the mean
text(mean(dailySteps),23,labels="median", pos=4, col="red") # label the median
rug(dailySteps, col="chocolate")
### plot the imputed histogram
hist(dailyStepsNew, main="NEW Hist: Total Steps per Day", xlab="# Steps", col="bisque3", ylab="") # plot a histogram
abline(v=mean(dailySteps), lty=3, col="blue") # draw a blue line thru the mean
abline(v=median(dailySteps), lty=4, col="red") # draw a red line thru the median
text(mean(dailySteps),35,labels="mean", pos=4, col="blue") # label the mean
text(mean(dailySteps),33,labels="median", pos=4, col="red") # label the median
rug(dailyStepsNew,col="chocolate")
```

# Are there differences in activity patterns between weekdays and weekends?

Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

```
newData$date <- as.Date(strptime(newData$date, format="%Y-%m-%d")) # convert date to a date() class variable
newData$day <- weekdays(newData$date) # build a 'day' factor to hold weekday / weekend
for (i in 1:nrow(newData)) { # for each day
if (newData[i,]$day %in% c("Saturday","Sunday")) { # if Saturday or Sunday,
newData[i,]$day<-"weekend" # then 'weekend'
}
else{
newData[i,]$day<-"weekday" # else 'weekday'
}
}
```

Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). The plot should look something like the following, which was creating using simulated data:

```
## aggregate newData by steps as a function of interval + day
stepsByDay <- aggregate(newData$steps ~ newData$interval + newData$day, newData, mean)
## reset the column names to be pretty & clean
names(stepsByDay) <- c("interval", "day", "steps")
## plot weekday over weekend time series
par(mfrow=c(1,1))
with(stepsByDay, plot(steps ~ interval, type="n", main="Weekday vs. Weekend Avg."))
with(stepsByDay[stepsByDay$day "weekday",], lines(steps ~ interval, type="l", col="chocolate"))
with(stepsByDay[stepsByDay$day == "weekend",], lines(steps ~ interval, type="l", col="16" ))
legend("topright", lty=c(1,1), col = c("chocolate", "16"), legend = c("weekday", "weekend"), seg.len=3)
```

It looks like this person has a day job & does most of his or her walking on the weekends!

**Word Count: 1,519**

# References

- "Exploratory Data Analysis", by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/course/exdata
- "Reproducible Research", by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/course/repdata
- "Data Science Specialization", by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/specialization/jhudatascience/1
- "Imputation in R - Stack Overflow", : http://stackoverflow.com/questions/13114812/imputation-in-r
- "Peer Assessments / Peer Assessment 1", Reproducible Research : by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014 https://class.coursera.org/repdata-004/human_grading/view/courses/972143/assessments/3/submissions