Menu

SFPD Incidents

cover-image

As a resident of San Francisco, I am curious to gain a better understanding of Police Incidents in my city. I would like to know if certain types of incidents occur in higher concentrations within some Police Districts over others.

Judging by data derived from the SFPD Crime Incident Reporting system, was there a relationship between the Police District in San Francisco and the Category of Police Incident during the six months from November, 2013 thru April, 2014?

This analysis is relevant and important to all city residents, city planners, and those who will decide which neighborhoods are suitable for their life activities with regard to certain types of crime.

Sources

An Introduction to the Data

The data for this analysis include the entire population of all Police Incidents derived from SFPD Crime Incident Reporting system during the time period. Each observational case is a single Police Incident. Both variables of interest are categorical and non-ordinal: “PdDistrict” and “Category”.

This analysis is a retrospective observational study. Incident data are collected in a way that does not directly interfere with how the data arose, i.e. we have merely “observed” the incidents after they occurred. This observational study is retrospective because it uses data from the past. Hence, the analysis seeks only to establish an association between the explanatory and response variables and is not sufficient to determine causation.

Because the data set contains only six months of data, we will be cautious to avoid generalizing conclusions to time frames greatly beyond the time period in question. Potential sources of bias that prevent generalizing further include the year, the season, or other other variability in the pattern of police incidents by neighborhood over time.

Exploratory Analysis

There were 61475 incidents during the six month date range. This is roughly ~10k incidents per month:

  
## Check the Date Range
dates <- as.Date(Incidents$Date, format = "%m/%d/%Y")  
range(dates)  
[1] "2013-11-01" "2014-04-30"

## Total Number of Incidents
nrow(Incidents)  
[1] 61475

## Table an Incident Count by Month
table(year(dates), months(dates))

       April December February January March November
  2013     0     9606        0       0     0    10518
  2014  9806        0     9852   10693 11000        0

There are 10 Districts in total, ranging from Bayview to the Tenderloin, shown below sequenced by Incident count.

  
## Sort District by Decreasing Incident Frequency
Incidents$PdDistrict <- factor(Incidents$PdDistrict, levels = names(sort(table(Incidents$PdDistrict), decreasing = TRUE)))

## Plot Police District
plot(Incidents$PdDistrict, main = "Count of Incidents by PdDistrict", sub = "Sort Descending", xlab = "PdDistrict", ylab = "Incident Count")  

Count of Incidents by PdDistrict

While there are 36 Categories, ranging from ARSON to WEAPON LAWS, the top 11 Categories account for 90% of all Police Incidents:

  
# Sort Category by Decreasing Incident Frequency
Incidents$Category <- factor(Incidents$Category, levels = names(sort(table(Incidents$Category), decreasing = TRUE)))

# Calculate the Proportion of Total Incidents in the Top Eleven Categories
sum(table(Incidents$Category)[1:11])/nrow(Incidents)  
[1] 0.8991

# Build a Pareto Chart
pareto.chart(table(Incidents$Category))  

Pareto Chart for table(Incidents$Category)

  
Pareto chart analysis for table(Incidents$Category)  
                              Frequency Cum.Freq. Percentage Cum.Percent.
  LARCENY/THEFT                   16386     16386  26.654738        26.65
  OTHER OFFENSES                   7905     24291  12.858886        39.51
  NON-CRIMINAL                     7235     31526  11.769012        51.28
  ASSAULT                          4771     36297   7.760878        59.04
  VANDALISM                        3297     39594   5.363156        64.41
  WARRANTS                         3101     42695   5.044327        69.45
  VEHICLE THEFT                    3036     45731   4.938593        74.39
  DRUG/NARCOTIC                    3013     48744   4.901179        79.29
  BURGLARY                         2955     51699   4.806832        84.10
  MISSING PERSON                   2077     53776   3.378609        87.48
  SUSPICIOUS OCC                   1497     55273   2.435136        89.91
  ROBBERY                          1493     56766   2.428630        92.34
  FRAUD                            1258     58024   2.046360        94.39
  WEAPON LAWS                       611     58635   0.993900        95.38
  TRESPASS                          523     59158   0.850752        96.23
  DRUNKENNESS                       379     59537   0.616511        96.85
  FORGERY/COUNTERFEITING            325     59862   0.528670        97.38
  SEX OFFENSES, FORCIBLE            272     60134   0.442456        97.82
  KIDNAPPING                        260     60394   0.422936        98.24
  DISORDERLY CONDUCT                176     60570   0.286295        98.53
  DRIVING UNDER THE INFLUENCE       166     60736   0.270028        98.80
  RECOVERED VEHICLE                 128     60864   0.208215        99.01
  ARSON                             112     60976   0.182188        99.19
  PROSTITUTION                      110     61086   0.178935        99.37
  RUNAWAY                           110     61196   0.178935        99.55
  LIQUOR LAWS                       105     61301   0.170801        99.72
  EMBEZZLEMENT                       44     61345   0.071574        99.79
  SUICIDE                            39     61384   0.063440        99.85
  FAMILY OFFENSES                    28     61412   0.045547        99.90
  EXTORTION                          16     61428   0.026027        99.92
  LOITERING                          14     61442   0.022773        99.95
  SEX OFFENSES, NON FORCIBLE         12     61454   0.019520        99.97
  BAD CHECKS                          8     61462   0.013013        99.98
  STOLEN PROPERTY                     7     61469   0.011387        99.99
  GAMBLING                            4     61473   0.006507       100.00
  BRIBERY                             2     61475   0.003253       100.00

The pareto analysis above shows the cumulative frequency of all Categories of Incident.

Data Prep

Because the top 22 Categories account for 99% of all Police Incidents, the remainder of this analsyis will lump the long tail into “OTHER OFFENCES”.

The R code employed to lump Categories 21 thru 36 of the long tail into “OTHER OFFENCES” and produce the dataset for analysis as only a subset of the original columns is shown below:



## Load the various overlapping data sets
Incidents_1 <- read.csv("SFPD_Incidents-2013.11.01-2014.01.31.csv")  
Incidents_2 <- read.csv("SFPD_Incidents_2014.02.01-2014.04.27.csv")  
Incidents_3 <- read.csv("SFPD_Incidents_2014.04.20-2014.04.30.csv")

## Subset the Unique Records
Incidents <- unique(rbind(Incidents_1, Incidents_2, Incidents_3))

## Sort the Categories by Decreasing Incident Frequency
Incidents$Category <- factor(Incidents$Category, levels = names(sort(table(Incidents$Category), decreasing = TRUE)))

## Create a new factor variable called Cat.New
## Lump the lower tail into "OTHER OFFENSES"
Cat.Orig <- levels(Incidents$Category)  
Cat.New <- Cat.Orig[1:20]  
Cat.New[21:length(Cat.Orig)] <- rep("OTHER OFFENSES", length(Cat.Orig) - 20)

## Build a Cross Reference Table and
## Merge Cat.New into the Incidents data frame
Map.Table <- data.frame(Cat.Orig, Cat.New)  
Incidents <- merge(x = Incidents, y = Map.Table, by.x = "Category", by.y = "Cat.Orig")

## Sort Cat.New by Decreasing Incident Frequency
Incidents$Cat.New <- factor(Incidents$Cat.New, levels = names(sort(table(Incidents$Cat.New), decreasing = TRUE)))

## Subset the Columns of Interest into 'dataset'
dataset <- Incidents[, c("Cat.New", "Date", "PdDistrict")]

## Set Friendly Names for the 'dataset' columns
## and attach into the global environment for future ease
colnames(dataset) <- c("Category", "Date", "PdDistrict")  
attach(dataset)  

Inference

As a hypothesis, I suspect that some Categories of Incident will occur with higher frequency within some Police Districts. In other words, I suspect that Category and Police District are associated. The structure of the test is as follows:

  • H_0: Incident Category and Police District are independent.
  • H_A: Incident Category and Police District are dependent.

Since both are multi-leveled categorical variables, when dealing with counts & investigating how far the observed counts are from the expected, a Chi-squared hypothesis test for independence is an appropriate technique.

Let's first check the necessary conditions.

  • The independence condition is met, as we have no reason to believe that individual Police Incidents in San Francisco are not independent from each other. As well, each Incident contributes to only a single cell in the Chi-Squared table. In other words: each Incident will have only one Category and occurs in only a one District.

  • For the sample size condition, each combination of Incident & Category do have at least 5 expected cases. The meeting of this condition was aided by chopping off the lengthy tail & focusing on only the TopEleven Incident Categories.

Now to run the Chi-Squared independence test, which also provides a nice overview of the TopEleven data set.

  
## Use Inference() Function to Run the Chi-Squared Test
inference(Category, PdDistrict, type = "ht", method = "theoretical", est = "proportion", sum_stats = FALSE, eda_plot = FALSE,  
    inf_plot = FALSE, alternative = "greater")

Response variable: categorical, Explanatory variable: categorical  
Chi-square test of independence  
H_0: Response and explanatory variable are independent.  
H_A: Response and explanatory variable are dependent.  
Check conditions: expected counts

    Pearson's Chi-squared test

data:  y_table  
X-squared = 6420, df = 189, p-value < 2.2e-16  

Results of the Chi-Squared Test for Indepencence

From the p-value < 2.2e-16 we can determine that, if the null hypothesis were true, then the probability of finding the observed distribution of Incident Category and Police District to be very nearly zero. Hence, the observed data provide good reason to conclude that the Police District and Incident Category were associated during the dates of 11/2013 thru 04/2014.

Additional Analysis

Now that we've established a relationship exists, let us briefly examine the proportion of Incidents for each Category within each District. Side-by-side boxplots of the frequency proportions are appropriate, and we can see below that the frequency of Larceny/Theft Incidents is an outlier in every district except “TENDERLOIN”.

Moreover, Larceny/Theft has the highest frequency proportion in all districts with multiple outlier categories. For this reason, let us also compute the mean frequency of Larceny/Theft Incidents across all Districts and add this as a reference line in red to the side-by-side boxplots.

Visually, the topmost outlier dot in each District will be Larceny/Theft.

  
## Create a PropTable of Incident Frequency for each Category, by District
TEprop <- (as.data.frame.matrix(prop.table(table(Category, PdDistrict))))

## Save the outliers to a vector & print side-by-side Boxplots
outliers <- boxplot(TEprop, plot = TRUE)$out

## In Each District, identify: 
## Which are the Categories with an outlier proportion ?
for (district in 1:ncol(TEprop)) {  
    PdDistrict <- (colnames(TEprop[district]))
    Cat <- rownames(TEprop[which(TEprop[district][, 1] %in% outliers), ])
    print(c(PdDistrict, Cat))
}

[1] "SOUTHERN"       "LARCENY/THEFT"  "OTHER OFFENSES" "NON-CRIMINAL"
[1] "MISSION"        "LARCENY/THEFT"  "OTHER OFFENSES"
[1] "NORTHERN"      "LARCENY/THEFT"
[1] "CENTRAL"       "LARCENY/THEFT" "NON-CRIMINAL"
[1] "BAYVIEW"        "LARCENY/THEFT"  "OTHER OFFENSES"
[1] "INGLESIDE"      "LARCENY/THEFT"  "OTHER OFFENSES"
[1] "TENDERLOIN"
[1] "TARAVAL"       "LARCENY/THEFT"
[1] "PARK"          "LARCENY/THEFT"
[1] "RICHMOND"      "LARCENY/THEFT"

## Add a Horizontal Reference Line for the 
## Mean proportion of Larceny/Theft Incidents across all Districts
abline(h = mean(as.numeric(TEprop[1, ])), col = "red")  

Side-by-Side Boxplots

Conclusions

We've learned that, during the dates examined, 11 of 36 Categories account for 90% of all Police Incidents in San Francisco. We have also found Police District & Incident Category to be associated. We've drawn a visual to highlight the districts which have a proportion of Incidents above & below the mean frequency for Larceny/Theft, which is an outlier in all but one of the districts. From these charts, it is apparent that some Districts do have a visibly higher frequency of Larceny/Theft than others.

While the Chi-Square test does not provide the tools to identify which districts and which categories are responsible for the low p-value in the test. Extreme outlier values for Larceny/Theft do give a non-scientific indication.

It will be interesting to pursue additional research to build upon this analysis. For example:

  • In which District is the frequency of Larceny/Theft statistically significant ?
  • Are these results observable across a broader range of dates ?
  • If so, can the conclusions be generalized to all San Francisco Police Incidents across all dates ?

Curiosities

Why the precipitous drop in incidents during the few days around Christmas ? Is it because the criminals are on vacation, the police, or both ?

plot(Date)  

Reference

The data from this analysis are freely available from the city of San Francisco:

San Francisco Data

Appendix

R Packages

  
# Install Packages
install.packages(c("lubridate","qcc","knitr"))  
require(lubridate)  
require(qcc)  
source("http://bit.ly/dasi_inference")  

A sample of 50 lines from the data set used for Inference:



dataset[250:300, ]

Category       Date PdDistrict  
250  ASSAULT 11/21/2013    MISSION  
251  ASSAULT 03/29/2014    CENTRAL  
252  ASSAULT 04/28/2014   NORTHERN  
253  ASSAULT 01/19/2014    MISSION  
254  ASSAULT 12/01/2013    BAYVIEW  
255  ASSAULT 11/21/2013   RICHMOND  
256  ASSAULT 01/01/2014   NORTHERN  
257  ASSAULT 11/22/2013    CENTRAL  
258  ASSAULT 03/23/2014    CENTRAL  
259  ASSAULT 11/01/2013    MISSION  
260  ASSAULT 02/22/2014   NORTHERN  
261  ASSAULT 12/08/2013   SOUTHERN  
262  ASSAULT 02/03/2014    BAYVIEW  
263  ASSAULT 11/30/2013   NORTHERN  
264  ASSAULT 04/06/2014 TENDERLOIN  
265  ASSAULT 02/23/2014    MISSION  
266  ASSAULT 11/26/2013    BAYVIEW  
267  ASSAULT 01/06/2014 TENDERLOIN  
268  ASSAULT 02/14/2014   RICHMOND  
269  ASSAULT 01/16/2014  INGLESIDE  
270  ASSAULT 11/17/2013   SOUTHERN  
271  ASSAULT 03/28/2014   NORTHERN  
272  ASSAULT 11/12/2013    MISSION  
273  ASSAULT 12/19/2013 TENDERLOIN  
274  ASSAULT 04/29/2014       PARK  
275  ASSAULT 04/11/2014    BAYVIEW  
276  ASSAULT 11/06/2013    TARAVAL  
277  ASSAULT 02/25/2014 TENDERLOIN  
278  ASSAULT 11/29/2013   SOUTHERN  
279  ASSAULT 01/26/2014  INGLESIDE  
280  ASSAULT 11/11/2013  INGLESIDE  
281  ASSAULT 01/11/2014 TENDERLOIN  
282  ASSAULT 04/10/2014  INGLESIDE  
283  ASSAULT 02/15/2014 TENDERLOIN  
284  ASSAULT 12/16/2013    BAYVIEW  
285  ASSAULT 03/04/2014    MISSION  
286  ASSAULT 04/25/2014    MISSION  
287  ASSAULT 03/04/2014   SOUTHERN  
288  ASSAULT 11/08/2013    TARAVAL  
289  ASSAULT 02/14/2014    MISSION  
290  ASSAULT 01/13/2014  INGLESIDE  
291  ASSAULT 12/01/2013    BAYVIEW  
292  ASSAULT 12/02/2013   SOUTHERN  
293  ASSAULT 03/04/2014    CENTRAL  
294  ASSAULT 01/09/2014    TARAVAL  
295  ASSAULT 02/07/2014    CENTRAL  
296  ASSAULT 04/27/2014   SOUTHERN  
297  ASSAULT 11/08/2013   NORTHERN  
298  ASSAULT 12/30/2013    TARAVAL  
299  ASSAULT 11/25/2013 TENDERLOIN  
300  ASSAULT 04/27/2014    CENTRAL  

Word Count: 1,826

References

  1. "SFPDIncidents: data & analysis derived from the SFPD Crime Incident Reporting system", Repository, GitHub, May 16, 2014: https://github.com/keithhelfrich/SFPDIncidents
  2. "SFPD Incidents - Previous Three Months", San Francisco Data, Updated Daily: https://data.sfgov.org/Public-Safety/SFPD-Incidents-Previous-Three-Months/tmnf-yvry#Export