Menu

Large Weather Events

For another assignment from Reproducible Research, I've prepared a self contained analysis from a large and fairly dirty data set, as if it were to be delivered to a Government decision maker.

By 'self contained', we mean that every aspect from start to finish must be self-evident & reproducable.

Similar to my previous post, here I also would have preferred to tidy the data in R and build the visualizations with Tableau.

The graphical systems in R are robust. And I find them super convenient to quickly plot the distribution of some data, or to spot check my work along the way. But R (in my opinion) is no place to produce "communication" level visualizations.

Visualizations built in R are simply too rigid and static for communication purposes. The moment one reaches a "tidy" data set that is for analysis, then it's time to switch to Tableau for communication.

If you would like to download my reproducible code to give it a run, click here.

A few simple plots in R

Recent coursework for the Exploratory Data Analysis & Reproducible Research courses, which make up part of the 9 month Data Science Specialization offered on Coursera by John Hopkins University, offer the opportunity to examine what it takes to manipulate data & produce simple plots in R.

Some code might be expected for data manipulation. But the code to plot & anotate simple charts shows the reason why Tableau is such a pleasure to work within.

It is also interesting to observe the code to impute missing values. Come to find later that an R package already exists for this. Never-the-less, one learns more when doing it by hand :) And even with a package, this is a far simpler task to perform in Tableau.

The Data

The gist of the assignment is quite simple:

Data from a personal activity monitoring device collects observations at 5 minute intervals through out the day. For two months of data from an anonymous individual, we're looking at the number of steps taken during the 5 minute intervals of each day.

The variables included are:

  • steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
  • date: The date on which the

Master Tableau Concepts

cover-image

Over the past days I had the pleasure of spending time with Tableau Zen Master Joe Mako. It was a pleasure not only for the education (which was exceptional), but for the example: which was extraordinary. Joe is a kind & gentle man, devoted to improving the planet with compassion and improving the lives of others.

One vehicle for his devotion has been 10,000 hours of reverse engineering Tableau. Hence, when Joe describes The Art and Science of Tableau, those with an interest in breaking through the conceptual barriers to reach a Master's Plateau listen.

Joe's talk was aimed at the Relational end of the learning spectrum. If not step-by-step instructions, he spoke of those underlying concepts that a data chef should seek to understand. He taught that understanding the relationship between these concepts will enable you to venture beyond cook-book instructions to create something new.

Approach to Learning Spectrum

Below are the various topics that Joe has touched upon yesterday. For each of them I provide a brief summarization, with links to additional learning.

  • Data Densification
  • Data Blending 1 vs. Data Blending 2
  • Domain Padding
  • Scaffolding
  • The Four Pill Types

Data Densification

Each mark within a vis in Tableau represents one record

Demographics Makeover

cover-image

Earlier I identified coloring the tips of the bars as a nice technique used by Quantcast to highlight over-indexing, and I took issue with their use of pie charts and the cryptic INDEX scoring system. In this post, we will examine those techniques and how they might be improved.

This view below of demographics data for the web property Zimbio.com is the focus of today's makeover:

Pie Charts

There are many arguments in critique & defense of the pie chart, mostly in critique. Among them, Decision Viz has authored an entire series on "how to leave your pie chart":

10+ Ways to Leave Your Pie Chart | DecisionViz

Cole Nussbaumer is on a mission to rid the world of ineffective charts, "one exploding, 3d pie chart at a time".

storytelling with data: alternatives to pies

NPR, oh Data Where art Thou ?

cover-image

An open letter to NPR.org

Regarding your blog post titled "How Far Your Paycheck Goes, In 356 U.S. Cities", by by QUOCTRUNG BUI, May 20, I would ask: from where you have obtained the median income figures at a Metropolitan Area level of detail?

How Far Your Paycheck Goes, In 356 U.S. Cities

And why is the source of those figures not referenced from the article?

The link from your article to the bea.gov release finds data at the State level only. Hence, the analysis from your article related to 356 U.S. Cities is not verifiable.

BEA: News Release: Real Personal Income for States and Metropolitan Areas, 2008-2012

Stacked Area Makeover

cover-image

The first in a series to improve upon the data visualization techniques at Quantcast.

Various expert articles detail the reasons why a stacked area chart is bad, bad, bad form. And just this week, the masters of our field are discussing similar problems with stream graphs, a type of stacked area chart.

Card

So why does Quantcast use the stacked area for Site Traffic data ?

Let's Make it Over

Here below is an example of the area chart to be improved.

First to Generate Some Data

  
### gen_data function
gen_data<-function(m_apps,m_web,online){  
  date<-seq(as.Date("2014-04-01"), as.Date("2014-05-30"), by="days")  
  clicks_m_apps<-as.integer(abs(rnorm(1:60,mean=m_apps,sd=m_apps/4)))
  clicks_m_web<-as.integer(abs(rnorm(1:60,mean=m_web,sd=m_web/4)))
  clicks_online<-as.integer(abs(rnorm(1:60,mean=online,sd=online/4)))
  df<-data.frame(date,clicks_m_apps,clicks_m_web,clicks_online)
  return(df)
}

### generate random variables
df<-gen_data(200,250,450)

### write to a CSV
write.csv(df,file="data.csv",row.names=FALSE)  

From our generated data, this is what the stacked area chart looks like. This was sexy in Excel (cerca 1995):
This was sexy in Excel (cerca 1995!)

Transparency is Nice

Dan Murray recently pointed out, the use of transparency is nice and all comparative plots should base from zero.

Twitter / DGM885: I like the use of transparency ...

With only two measures, this transparency effect is actually quite easy to achieve in

A Patchwork Quilt in Favor of Practice

cover-image

Review: Coyle, Daniel: The Talent Code: Greatness Isn't Born.  It's Grown, Here's How.  New York: Bantam Dell (Extract from Chapter 1 - The Sweet Spot)


The inspirational self-help book has risen to popular prominence during recent years.  Riding this wave are numerous authors, each with their own prescription for methods and techniques to help you achieve success.  The trend continues today, with new authors adding their voices into the collective chorus of, "You can do it! Here's how."  In The Talent Code, New York Times best selling author Daniel Coyle rides this wave with his own narrative in which he makes the argument: Greatness Isn't Born.  It's Grown, Here's How. 

Coyle's main argument is encapsulated very precisely by the subtitle of his work.  Notably he argues that talent can be learned, Nurture is superior to Nature and, more specifically, a recipe to Nurture exists that produces superior results.  As with many in the inspirational self-help genre, Coyle's premise is built upon timeless adages.  In this case, he works from "practice makes perfect" and "that which does not kill you, makes you stronger."  Yet his treatise quickly draws an important distinction between rehearsal, mere

Color The Tips

cover-image

Here we look at one of the nicer technique at Quantcast, coloring the tips for bars that extend beyond the reference line, and how to do this in Tableau. Coloring the tips of your bars so provides an instant, pre-cognitive call-out to over-indexing.

Example Data Set

Building upon yet another earlier analysis, in this example I work with SFPD Police Incidents Data.

To follow along, you can either export the most recent 3 months from data.sfgov.org (build your viz from scratch), or download the packaged workbook (review the finished product).

Over-Indexing is for Sample to Population Comparisons

Just as with the demographics data at Quantcast, our analysis of Police Incidents in San Francisco is also focused on the over-indexing of a sample observation, compared to a reference line for the population.

Quantcast looks at visitors to a given website (the sample), compared to all Internet users (the population). Here I highlight incident categories where the frequency in a given District (the sample) is greater than the average across the entire city (the population).

You can of course color the tips in any scenario where reference lines are used. For example, in a plan-to-actuals comparison for finance data.

SFPD Incidents

cover-image

As a resident of San Francisco, I am curious to gain a better understanding of Police Incidents in my city. I would like to know if certain types of incidents occur in higher concentrations within some Police Districts over others.

Judging by data derived from the SFPD Crime Incident Reporting system, was there a relationship between the Police District in San Francisco and the Category of Police Incident during the six months from November, 2013 thru April, 2014?

This analysis is relevant and important to all city residents, city planners, and those who will decide which neighborhoods are suitable for their life activities with regard to certain types of crime.

Sources

A Data Visualization Opportunity

cover-image

Critical review of the data visualization practices at Quantcast.

Quantcast offers "insights into how web properties are faring." Rather, I would say, a perplexing mix of poorly chosen chart types, a cryptic scoring system, and data visualization techniques that require too much effort to understand. Stephen Few would be displeased.

Let's identify the opportunities. In future posts, I'll come back to remake these visualizations.

Traditional Chart Types

The Stacked Area Graph is widely criticized for its shortcomings with part-to-whole comparisons. In the Quantcast area chart, what to do if "Change in Mobile" is the pattern I'm interested in isolating? I'm given no tools to do so visually. Online traffic drives the entire visual, simply by being at the bottom.

Dan Murray has pointed me to a new tool called Visage, which addresses this question with an elegant use of transparency. It would be great if Tableau were able to incorporate these techniques, as well!


"Stacked area charts work when you primarily want to show how the whole changes through time and only give a rough sense of how the parts compare.." Stephen Few


Then come the Pie Charts. For a detailed review of why Stacked Area Graphs and Pie