As a resident of San Francisco, I am curious to gain a better understanding of Police Incidents in my city. I would like to know if certain types of incidents occur in higher concentrations within some Police Districts over others.

Judging by data derived from the SFPD Crime Incident Reporting system, was there a relationship between the Police District in San Francisco and the Category of Police Incident during the six months from November, 2013 thru April, 2014?

This analysis is relevant and important to all city residents, city planners, and those who will decide which neighborhoods are suitable for their life activities with regard to certain types of crime.

## Sources

- Work with the repo I maintain on github keithhelfrich/SFPD_Incidents

- Direct from sfgov.org, rolling three months San Francisco Data

## An Introduction to the Data

The data for this analysis include the entire population of all Police Incidents derived from SFPD Crime Incident Reporting system during the time period. Each observational case is a single Police Incident. Both variables of interest are categorical and non-ordinal: **“PdDistrict”** and **“Category”**.

This analysis is a retrospective observational study. Incident data are collected in a way that does not directly interfere with how the data arose, i.e. we have merely “observed” the incidents after they occurred. This observational study is retrospective because it uses data from the past. Hence, the analysis seeks only to establish an association between the explanatory and response variables and is not sufficient to determine causation.

Because the data set contains only six months of data, we will be cautious to avoid generalizing conclusions to time frames greatly beyond the time period in question. Potential sources of bias that prevent generalizing further include the year, the season, or other other variability in the pattern of police incidents by neighborhood over time.

# Exploratory Analysis

There were **61475** incidents during the six month date range. This is roughly **~10k incidents per month**:

```
## Check the Date Range
dates <- as.Date(Incidents$Date, format = "%m/%d/%Y")
range(dates)
[1] "2013-11-01" "2014-04-30"
## Total Number of Incidents
nrow(Incidents)
[1] 61475
## Table an Incident Count by Month
table(year(dates), months(dates))
April December February January March November
2013 0 9606 0 0 0 10518
2014 9806 0 9852 10693 11000 0
```

There are **10 Districts** in total, ranging from Bayview to the Tenderloin, shown below sequenced by Incident count.

```
## Sort District by Decreasing Incident Frequency
Incidents$PdDistrict <- factor(Incidents$PdDistrict, levels = names(sort(table(Incidents$PdDistrict), decreasing = TRUE)))
## Plot Police District
plot(Incidents$PdDistrict, main = "Count of Incidents by PdDistrict", sub = "Sort Descending", xlab = "PdDistrict", ylab = "Incident Count")
```

While there are **36 Categories**, ranging from ARSON to WEAPON LAWS, **the top 11 Categories** account for **90% of all Police Incidents**:

```
# Sort Category by Decreasing Incident Frequency
Incidents$Category <- factor(Incidents$Category, levels = names(sort(table(Incidents$Category), decreasing = TRUE)))
# Calculate the Proportion of Total Incidents in the Top Eleven Categories
sum(table(Incidents$Category)[1:11])/nrow(Incidents)
[1] 0.8991
# Build a Pareto Chart
pareto.chart(table(Incidents$Category))
```

```
Pareto chart analysis for table(Incidents$Category)
Frequency Cum.Freq. Percentage Cum.Percent.
LARCENY/THEFT 16386 16386 26.654738 26.65
OTHER OFFENSES 7905 24291 12.858886 39.51
NON-CRIMINAL 7235 31526 11.769012 51.28
ASSAULT 4771 36297 7.760878 59.04
VANDALISM 3297 39594 5.363156 64.41
WARRANTS 3101 42695 5.044327 69.45
VEHICLE THEFT 3036 45731 4.938593 74.39
DRUG/NARCOTIC 3013 48744 4.901179 79.29
BURGLARY 2955 51699 4.806832 84.10
MISSING PERSON 2077 53776 3.378609 87.48
SUSPICIOUS OCC 1497 55273 2.435136 89.91
ROBBERY 1493 56766 2.428630 92.34
FRAUD 1258 58024 2.046360 94.39
WEAPON LAWS 611 58635 0.993900 95.38
TRESPASS 523 59158 0.850752 96.23
DRUNKENNESS 379 59537 0.616511 96.85
FORGERY/COUNTERFEITING 325 59862 0.528670 97.38
SEX OFFENSES, FORCIBLE 272 60134 0.442456 97.82
KIDNAPPING 260 60394 0.422936 98.24
DISORDERLY CONDUCT 176 60570 0.286295 98.53
DRIVING UNDER THE INFLUENCE 166 60736 0.270028 98.80
RECOVERED VEHICLE 128 60864 0.208215 99.01
ARSON 112 60976 0.182188 99.19
PROSTITUTION 110 61086 0.178935 99.37
RUNAWAY 110 61196 0.178935 99.55
LIQUOR LAWS 105 61301 0.170801 99.72
EMBEZZLEMENT 44 61345 0.071574 99.79
SUICIDE 39 61384 0.063440 99.85
FAMILY OFFENSES 28 61412 0.045547 99.90
EXTORTION 16 61428 0.026027 99.92
LOITERING 14 61442 0.022773 99.95
SEX OFFENSES, NON FORCIBLE 12 61454 0.019520 99.97
BAD CHECKS 8 61462 0.013013 99.98
STOLEN PROPERTY 7 61469 0.011387 99.99
GAMBLING 4 61473 0.006507 100.00
BRIBERY 2 61475 0.003253 100.00
```

The pareto analysis above shows the cumulative frequency of all Categories of Incident.

# Data Prep

Because **the top 22 Categories** account for **99% of all Police Incidents**, the remainder of this analsyis will lump the long tail into “OTHER OFFENCES”.

The R code employed to lump Categories 21 thru 36 of the long tail into “OTHER OFFENCES” and produce the **dataset** for analysis as only a subset of the original columns is shown below:

```
## Load the various overlapping data sets
Incidents_1 <- read.csv("SFPD_Incidents-2013.11.01-2014.01.31.csv")
Incidents_2 <- read.csv("SFPD_Incidents_2014.02.01-2014.04.27.csv")
Incidents_3 <- read.csv("SFPD_Incidents_2014.04.20-2014.04.30.csv")
## Subset the Unique Records
Incidents <- unique(rbind(Incidents_1, Incidents_2, Incidents_3))
## Sort the Categories by Decreasing Incident Frequency
Incidents$Category <- factor(Incidents$Category, levels = names(sort(table(Incidents$Category), decreasing = TRUE)))
## Create a new factor variable called Cat.New
## Lump the lower tail into "OTHER OFFENSES"
Cat.Orig <- levels(Incidents$Category)
Cat.New <- Cat.Orig[1:20]
Cat.New[21:length(Cat.Orig)] <- rep("OTHER OFFENSES", length(Cat.Orig) - 20)
## Build a Cross Reference Table and
## Merge Cat.New into the Incidents data frame
Map.Table <- data.frame(Cat.Orig, Cat.New)
Incidents <- merge(x = Incidents, y = Map.Table, by.x = "Category", by.y = "Cat.Orig")
## Sort Cat.New by Decreasing Incident Frequency
Incidents$Cat.New <- factor(Incidents$Cat.New, levels = names(sort(table(Incidents$Cat.New), decreasing = TRUE)))
## Subset the Columns of Interest into 'dataset'
dataset <- Incidents[, c("Cat.New", "Date", "PdDistrict")]
## Set Friendly Names for the 'dataset' columns
## and attach into the global environment for future ease
colnames(dataset) <- c("Category", "Date", "PdDistrict")
attach(dataset)
```

# Inference

As a hypothesis, I suspect that some Categories of Incident will occur with higher frequency within some Police Districts. In other words, I suspect that Category and Police District are associated. The structure of the test is as follows:

- H_0: Incident Category and Police District are independent.
- H_A: Incident Category and Police District are dependent.

Since both are multi-leveled categorical variables, when dealing with counts & investigating how far the observed counts are from the expected, a Chi-squared hypothesis test for independence is an appropriate technique.

Let's first check the necessary conditions.

The independence condition is met, as we have no reason to believe that individual Police Incidents in San Francisco are not independent from each other. As well, each Incident contributes to only a single cell in the Chi-Squared table. In other words: each Incident will have only one Category and occurs in only a one District.

For the sample size condition, each combination of Incident & Category

**do**have at least 5 expected cases. The meeting of this condition was aided by chopping off the lengthy tail & focusing on only the TopEleven Incident Categories.

Now to run the Chi-Squared independence test, which also provides a nice overview of the TopEleven data set.

```
## Use Inference() Function to Run the Chi-Squared Test
inference(Category, PdDistrict, type = "ht", method = "theoretical", est = "proportion", sum_stats = FALSE, eda_plot = FALSE,
inf_plot = FALSE, alternative = "greater")
Response variable: categorical, Explanatory variable: categorical
Chi-square test of independence
H_0: Response and explanatory variable are independent.
H_A: Response and explanatory variable are dependent.
Check conditions: expected counts
Pearson's Chi-squared test
data: y_table
X-squared = 6420, df = 189, p-value < 2.2e-16
```

## Results of the Chi-Squared Test for Indepencence

From the **p-value < 2.2e-16** we can determine that, if the null hypothesis were true, then the probability of finding the observed distribution of Incident Category and Police District to be **very nearly zero**. Hence, the observed data provide good reason to conclude that the Police District and Incident Category were associated during the dates of 11/2013 thru 04/2014.

# Additional Analysis

Now that we've established a relationship exists, let us briefly examine the proportion of Incidents for each Category within each District. Side-by-side boxplots of the frequency proportions are appropriate, and we can see below that the frequency of **Larceny/Theft** Incidents is an outlier in every district except “TENDERLOIN”.

Moreover, **Larceny/Theft** has the highest frequency proportion in all districts with multiple outlier categories. For this reason, let us also compute the mean frequency of **Larceny/Theft** Incidents across all Districts and add this as a reference line in red to the side-by-side boxplots.

Visually, the topmost outlier dot in each District will be **Larceny/Theft**.

```
## Create a PropTable of Incident Frequency for each Category, by District
TEprop <- (as.data.frame.matrix(prop.table(table(Category, PdDistrict))))
## Save the outliers to a vector & print side-by-side Boxplots
outliers <- boxplot(TEprop, plot = TRUE)$out
## In Each District, identify:
## Which are the Categories with an outlier proportion ?
for (district in 1:ncol(TEprop)) {
PdDistrict <- (colnames(TEprop[district]))
Cat <- rownames(TEprop[which(TEprop[district][, 1] %in% outliers), ])
print(c(PdDistrict, Cat))
}
[1] "SOUTHERN" "LARCENY/THEFT" "OTHER OFFENSES" "NON-CRIMINAL"
[1] "MISSION" "LARCENY/THEFT" "OTHER OFFENSES"
[1] "NORTHERN" "LARCENY/THEFT"
[1] "CENTRAL" "LARCENY/THEFT" "NON-CRIMINAL"
[1] "BAYVIEW" "LARCENY/THEFT" "OTHER OFFENSES"
[1] "INGLESIDE" "LARCENY/THEFT" "OTHER OFFENSES"
[1] "TENDERLOIN"
[1] "TARAVAL" "LARCENY/THEFT"
[1] "PARK" "LARCENY/THEFT"
[1] "RICHMOND" "LARCENY/THEFT"
## Add a Horizontal Reference Line for the
## Mean proportion of Larceny/Theft Incidents across all Districts
abline(h = mean(as.numeric(TEprop[1, ])), col = "red")
```

# Conclusions

We've learned that, during the dates examined, 11 of 36 Categories account for 90% of all Police Incidents in San Francisco. We have also found Police District & Incident Category to be associated. We've drawn a visual to highlight the districts which have a proportion of Incidents above & below the mean frequency for Larceny/Theft, which is an outlier in all but one of the districts. From these charts, it is apparent that some Districts do have a visibly higher frequency of Larceny/Theft than others.

While the Chi-Square test does not provide the tools to identify which districts and which categories are responsible for the low p-value in the test. Extreme outlier values for **Larceny/Theft** do give a non-scientific indication.

It will be interesting to pursue additional research to build upon this analysis. For example:

- In which District is the frequency of Larceny/Theft statistically significant ?
- Are these results observable across a broader range of dates ?
- If so, can the conclusions be generalized to all San Francisco Police Incidents across all dates ?

## Curiosities

Why the precipitous drop in incidents during the few days around Christmas ? Is it because the criminals are on vacation, the police, or both ?

```
plot(Date)
```

# Reference

The data from this analysis are freely available from the city of San Francisco:

## Appendix

R Packages

```
# Install Packages
install.packages(c("lubridate","qcc","knitr"))
require(lubridate)
require(qcc)
source("http://bit.ly/dasi_inference")
```

A sample of 50 lines from the data set used for Inference:

```
dataset[250:300, ]
Category Date PdDistrict
250 ASSAULT 11/21/2013 MISSION
251 ASSAULT 03/29/2014 CENTRAL
252 ASSAULT 04/28/2014 NORTHERN
253 ASSAULT 01/19/2014 MISSION
254 ASSAULT 12/01/2013 BAYVIEW
255 ASSAULT 11/21/2013 RICHMOND
256 ASSAULT 01/01/2014 NORTHERN
257 ASSAULT 11/22/2013 CENTRAL
258 ASSAULT 03/23/2014 CENTRAL
259 ASSAULT 11/01/2013 MISSION
260 ASSAULT 02/22/2014 NORTHERN
261 ASSAULT 12/08/2013 SOUTHERN
262 ASSAULT 02/03/2014 BAYVIEW
263 ASSAULT 11/30/2013 NORTHERN
264 ASSAULT 04/06/2014 TENDERLOIN
265 ASSAULT 02/23/2014 MISSION
266 ASSAULT 11/26/2013 BAYVIEW
267 ASSAULT 01/06/2014 TENDERLOIN
268 ASSAULT 02/14/2014 RICHMOND
269 ASSAULT 01/16/2014 INGLESIDE
270 ASSAULT 11/17/2013 SOUTHERN
271 ASSAULT 03/28/2014 NORTHERN
272 ASSAULT 11/12/2013 MISSION
273 ASSAULT 12/19/2013 TENDERLOIN
274 ASSAULT 04/29/2014 PARK
275 ASSAULT 04/11/2014 BAYVIEW
276 ASSAULT 11/06/2013 TARAVAL
277 ASSAULT 02/25/2014 TENDERLOIN
278 ASSAULT 11/29/2013 SOUTHERN
279 ASSAULT 01/26/2014 INGLESIDE
280 ASSAULT 11/11/2013 INGLESIDE
281 ASSAULT 01/11/2014 TENDERLOIN
282 ASSAULT 04/10/2014 INGLESIDE
283 ASSAULT 02/15/2014 TENDERLOIN
284 ASSAULT 12/16/2013 BAYVIEW
285 ASSAULT 03/04/2014 MISSION
286 ASSAULT 04/25/2014 MISSION
287 ASSAULT 03/04/2014 SOUTHERN
288 ASSAULT 11/08/2013 TARAVAL
289 ASSAULT 02/14/2014 MISSION
290 ASSAULT 01/13/2014 INGLESIDE
291 ASSAULT 12/01/2013 BAYVIEW
292 ASSAULT 12/02/2013 SOUTHERN
293 ASSAULT 03/04/2014 CENTRAL
294 ASSAULT 01/09/2014 TARAVAL
295 ASSAULT 02/07/2014 CENTRAL
296 ASSAULT 04/27/2014 SOUTHERN
297 ASSAULT 11/08/2013 NORTHERN
298 ASSAULT 12/30/2013 TARAVAL
299 ASSAULT 11/25/2013 TENDERLOIN
300 ASSAULT 04/27/2014 CENTRAL
```

**Word Count: 1,826**

# References

- "SFPD
*Incidents: data & analysis derived from the SFPD Crime Incident Reporting system", Repository, GitHub, May 16, 2014: https://github.com/keithhelfrich/SFPD*Incidents - "SFPD Incidents - Previous Three Months", San Francisco Data, Updated Daily: https://data.sfgov.org/Public-Safety/SFPD-Incidents-Previous-Three-Months/tmnf-yvry#Export