Seattle Crime

Like all big cities, Seattle has areas and neighborhoods that feel safer or less safe than others. Personally, I know I have ideas about where I should be on my guard at night, neighborhoods where I should not park my car for extended periods of time, and areas that are not safe to buy a house in. Are these ideas I have correct? Do some neighborhoods and areas in Seattle really have more crime than others? Fortunately there is no need to speculate anymore - we have data! In particular, the Seattle Police Department uploads initial police reports to a publically accessible dataset 6-12 hours after responding to an incident. This data set includes information regarding location, time, and a summary of the offense description.

In this study we will study whether different neighborhoods in Seattle have different levels of crime.

We will use data from the Seattle Police API. You can find all the details here:

https://data.seattle.gov/Public-Safety/Seattle-Police-Department-Police-Report-Incident/7ais-f98f

We start this project by loading all of the required packages for this project.

(Side note - you can find all of this code in my GitHub: https://github.com/JTDean123)

In [790]:
import urllib.request
import json
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
import numpy as np
from sklearn.cluster import KMeans
from random import randint
from sklearn.datasets import load_digits

Next we import the most recent crime data and convert it to a pandas data frame.

In [128]:
# Data was pulled on March 17, 2017 at 10AM
crime2017 = urllib.request.urlopen('https://data.seattle.gov/resource/policereport.json')
crime2017 = pd.read_json(crime2017)
crime2017.head(3)
Out[128]:
census_tract_2000 date_reported district_sector general_offense_number hundred_block_location latitude location longitude month occurred_date_or_date_range_start occurred_date_range_end offense_code offense_code_extension offense_type rms_cdw_id summarized_offense_description summary_offense_code year zone_beat
0 7300.1029 2017-03-18T06:31:00 D 201795658 12XX BLOCK OF THOMAS ST 47.620815 {'latitude': '47.620815277', 'needs_recoding':... -122.332344 3 2017-03-18T04:56:00 NaN X 21 DISTURBANCE-OTH 1154709 DISTURBANCE X 2017 D3
1 11102.4008 2017-03-18T03:21:00 S 201795614 77XX BLOCK OF RAINIER AV S 47.533142 {'latitude': '47.53314209', 'needs_recoding': ... -122.269989 3 2017-03-18T03:21:00 NaN 5707 0 TRESPASS 1154704 TRESPASS 5700 2017 S2
2 9300.3029 2017-03-18T02:13:00 O 201795555 1XX BLOCK OF S HANFORD ST 47.575577 {'latitude': '47.575576782', 'needs_recoding':... -122.333740 3 2017-03-18T02:13:00 NaN 5499 0 TRAFFIC 1154683 TRAFFIC 5400 2017 O2

Looking good. Now that the crime data has been converted into a data frame let's check out the dimensions.

In [555]:
crime2017.shape
Out[555]:
(1000, 6)

As mentioned in the Seattle Police API documentation, there are 19 features in this data set. Also, there are 1000 observations, the throttling limit for this API. A large number of these features will not be very useful for this project, so we will next create a new data frame containing those that we need moving forward. Specifically, we will keep:

  • district_sector
  • latitude
  • longitude
  • month
  • summarized_offense_description
  • zone_beat

and we can slice our data frame to include only these.

In [560]:
crime2017 = crime2017[['district_sector', 'latitude', 'longitude', 'month', 'summarized_offense_description', 'zone_beat']]
crime2017.head(3)
Out[560]:
district_sector latitude longitude month summarized_offense_description zone_beat
0 D 47.620815 -122.332344 3 DISTURBANCE D3
1 S 47.533142 -122.269989 3 TRESPASS S2
2 O 47.575577 -122.333740 3 TRAFFIC O2

The next step is to check the data frame for missing data. We first scan the entire data frame to determine if any data is missing.

In [561]:
crime2017.isnull().values.any()
Out[561]:
False

Great- No missing data! The Seattle Police Department has provided us a clean data set. Next, we take a look at the summarized_offense_description feature to see what types of offenses and how many exist in the data. We will use 'value_counts( )' and convert the result to a DataFame.

In [562]:
crimeTypes = crime2017['summarized_offense_description'].value_counts()
crimeTypes = pd.DataFrame(crimeTypes)

Now we can visualize the data with a barplot.

In [563]:
plot = crimeTypes.plot(kind='bar', title='The 1000 Most Recent Reported Crimes in Seattle', figsize=(15, 10), legend=False, fontsize=16, edgecolor = 'black', linewidth=2);
plot.set_ylabel('Count', fontsize=16)
plot.set_xlabel('Crime Type', fontsize=16)
plt.show()

Interestingly, we find that the crime incident that is report most in this data set is car prowling. This makes sense, as finding broken car window glass on curbs is a pretty common thing around town. The corresponding district_sector is also included in this data set, and we can use this to determine the locations of these observations. As described here -

https://www.seattle.gov/police/about-us/about-policing/precinct-and-patrol-boundaries

and shown in map form here -

https://www.seattle.gov/police/information-and-data/tweets-by-beat

Seattle is divided into 17 sectors (letters) and each sector is divided into beats (numbers). This is visualized below.

In [434]:
Image("seattleMap.png")
Out[434]:

We can start by determining the amount of crimes reported in each sector.

In [564]:
district_sector = crime2017.groupby('district_sector')['month'].count()
district_sector = pd.DataFrame(district_sector)

plot = district_sector.plot(kind='bar', title='Number of Crimes Reported in Seattle Sectors', figsize=(15, 10), legend=False, fontsize=16, edgecolor = 'black', linewidth=2);
plot.set_ylabel('Number of crimes reported', fontsize=16)
plot.set_xlabel('Sector', fontsize=16)
plt.show()

The four sectors with the highest number of crimes reported are, in descending order:

  • M - Downtown
  • K - Downtown / Pioneer Square
  • B - Ballard / Fremont
  • J - Ballard / Greenlake

The neighborhood with the lowest number of reported crimes was O - Georgetown.

Disagree wth my neighborhood assignments? Let me know!

It is unclear what Sector '99' refers to. A quick search with the latitude/longitude data provided with this sector showed that these reports occurred inside the sector map shown above, so these may be data points with reporting errors for the location. Regardless, the total number of counts for sector 99 was small, so I will remove these data points moving forward.

However, the number of reported crimes reported in each sector was much closer than I thought it would be. It may be that boundaries are determined so that crimes are approximately evenly distributed between sectors.

To further explore the relationship between crime and neighborhood location we can perform principal component analysis (PCA) on this data set. PCA can be described at the eigendecomposition of a covariance matrix where the eigenvectors with the largest eigenvalues represent the most significant principal components. Thus, the largest principal component is the combination of variables that has the maximum variance. PCA is useful for dimension reduction of large data sets and, more relevant for this example, determining the variables that explain most of the variance in the data set.

So, a PCA of this data set will tell us what variables, or types of crimes, explain the variance in the neighborhoods, or sectors. To begin, we need to build a data matrix that contains crime types in columns and sectors as rows. Each entry will correspond to the number of crime types for a given sector, and we will remove sector 99 for the reasons mentioned above.

In [879]:
crimeArray = crime2017.groupby(['summarized_offense_description','district_sector'])['district_sector'].count().unstack()
crimeArray = crimeArray.drop('99', 1)
crimeArray = crimeArray.transpose()
crimeArray.head(5)
Out[879]:
summarized_offense_description ANIMAL COMPLAINT ASSAULT BIKE THEFT BURGLARY BURGLARY-SECURE PARKING-RES CAR PROWL COUNTERFEIT DISPUTE DISTURBANCE DUI ... SHOPLIFTING STOLEN PROPERTY THEFT OF SERVICES THREATS TRAFFIC TRESPASS VEHICLE THEFT VIOLATION OF COURT ORDER WARRANT ARREST WEAPON
district_sector
B NaN 7.0 NaN 3.0 NaN 24.0 NaN NaN 6.0 3.0 ... 1.0 4.0 NaN 2.0 NaN NaN 2.0 NaN 1.0 NaN
C NaN 6.0 2.0 17.0 5.0 2.0 NaN 1.0 2.0 NaN ... 2.0 2.0 NaN 2.0 NaN 2.0 2.0 NaN NaN NaN
D NaN 6.0 NaN 10.0 3.0 11.0 NaN NaN 3.0 NaN ... NaN 5.0 NaN 2.0 NaN 5.0 5.0 NaN 1.0 NaN
E NaN 11.0 NaN 3.0 2.0 9.0 NaN NaN 3.0 NaN ... NaN NaN NaN 5.0 1.0 4.0 1.0 NaN NaN NaN
F NaN NaN NaN 7.0 NaN 4.0 NaN NaN 2.0 NaN ... NaN 1.0 NaN NaN 3.0 NaN 3.0 NaN 5.0 1.0

5 rows × 32 columns

Looking at the data above, we can see some crime types are not reported in a particular sector. For example, sector B does not have any Bike thefts, therefore we need to convert the NaN values to zero.

In [880]:
crimeArray = crimeArray.replace(np.nan, 0)
crimeArray.head(5)
Out[880]:
summarized_offense_description ANIMAL COMPLAINT ASSAULT BIKE THEFT BURGLARY BURGLARY-SECURE PARKING-RES CAR PROWL COUNTERFEIT DISPUTE DISTURBANCE DUI ... SHOPLIFTING STOLEN PROPERTY THEFT OF SERVICES THREATS TRAFFIC TRESPASS VEHICLE THEFT VIOLATION OF COURT ORDER WARRANT ARREST WEAPON
district_sector
B 0.0 7.0 0.0 3.0 0.0 24.0 0.0 0.0 6.0 3.0 ... 1.0 4.0 0.0 2.0 0.0 0.0 2.0 0.0 1.0 0.0
C 0.0 6.0 2.0 17.0 5.0 2.0 0.0 1.0 2.0 0.0 ... 2.0 2.0 0.0 2.0 0.0 2.0 2.0 0.0 0.0 0.0
D 0.0 6.0 0.0 10.0 3.0 11.0 0.0 0.0 3.0 0.0 ... 0.0 5.0 0.0 2.0 0.0 5.0 5.0 0.0 1.0 0.0
E 0.0 11.0 0.0 3.0 2.0 9.0 0.0 0.0 3.0 0.0 ... 0.0 0.0 0.0 5.0 1.0 4.0 1.0 0.0 0.0 0.0
F 0.0 0.0 0.0 7.0 0.0 4.0 0.0 0.0 2.0 0.0 ... 0.0 1.0 0.0 0.0 3.0 0.0 3.0 0.0 5.0 1.0

5 rows × 32 columns

Looking at the types of crimes that were reported, it seems that although 32 types of crimes are reported, they fall naturally into a smaller number of groups. For example, 'Bike Theft' and 'Shoplifting' could be generally grouped at 'Theft'. Therefore we will create new features from the crime types by grouping them as follows:

  • Theft:
    BIKE THEFT, BURGLARY, BURGLARY-SECURE PARKING-RES, CAR PROWL, MAIL THEFT, ROBBERY, SHOPLIFTING, STOLEN PROPERTY, THEFT OF SERVICES, VEHICLE THEFT

  • Fraud:
    COUNTERFEIT, EMBEZZLE, FORGERY, FRAUD

  • Violence:
    ASSAULT, THREATS, WEAPON, INJURY

  • Drugs:
    DUI, NARCOTICS

  • Property:
    LOST PROPERTY, OTHER PROPERTY, PROPERTY DAMAGE

  • Court Violation:
    VIOLATION OF COURT ORDER, WARRANT ARREST

  • Public Offense:
    DISPUTE, DISTURBANCE, OBSTRUCT, PUBLIC NUISANCE, TRESPASS

Note that we will leave ANIMAL COMPLAINT and TRAFFIC as their own categories as they do not naturally fall into one of the new categories listed above. Using this grouping strategy we can reduce 32 features to 9. This grouping is done as follows below.

In [881]:
crimeArray['Theft'] = crimeArray['BIKE THEFT'] + crimeArray['BURGLARY'] + crimeArray['BURGLARY-SECURE PARKING-RES'] + crimeArray['CAR PROWL'] + crimeArray['MAIL THEFT'] + crimeArray['ROBBERY'] + crimeArray['SHOPLIFTING'] + crimeArray['STOLEN PROPERTY'] + crimeArray['THEFT OF SERVICES']  + crimeArray['VEHICLE THEFT']  

crimeArray['Fraud'] = crimeArray['COUNTERFEIT'] + crimeArray['EMBEZZLE'] + crimeArray['FORGERY'] + crimeArray['FRAUD'] 

crimeArray['Violence'] = crimeArray['ASSAULT'] + crimeArray['THREATS'] + crimeArray['WEAPON'] + crimeArray['INJURY'] 

crimeArray['Drugs'] = crimeArray['DUI'] + crimeArray['NARCOTICS']

crimeArray['Property'] = crimeArray['LOST PROPERTY'] + crimeArray['OTHER PROPERTY'] + crimeArray['PROPERTY DAMAGE']

crimeArray['Court Violation'] = crimeArray['VIOLATION OF COURT ORDER'] + crimeArray['WARRANT ARREST']

crimeArray['Public Offense'] = crimeArray['DISPUTE'] + crimeArray['DISTURBANCE'] + crimeArray['OBSTRUCT'] + crimeArray['PUBLIC NUISANCE']  + crimeArray['TRESPASS']

crimeArray['Traffic'] = crimeArray['TRAFFIC']

crimeArray['Animal Complaint'] = crimeArray['ANIMAL COMPLAINT']

crimeArray = crimeArray.iloc[:,32:]
crimeArray.head(25)
Out[881]:
summarized_offense_description Theft Fraud Violence Drugs Property Court Violation Public Offense Traffic Animal Complaint
district_sector
B 39.0 8.0 9.0 3.0 16.0 1.0 6.0 0.0 0.0
C 34.0 4.0 8.0 0.0 8.0 0.0 7.0 0.0 0.0
D 36.0 2.0 8.0 0.0 7.0 1.0 9.0 0.0 0.0
E 15.0 3.0 17.0 0.0 8.0 0.0 7.0 1.0 0.0
F 15.0 4.0 1.0 1.0 18.0 5.0 2.0 3.0 0.0
G 9.0 1.0 5.0 0.0 6.0 1.0 11.0 0.0 2.0
J 55.0 2.0 2.0 0.0 16.0 1.0 4.0 0.0 0.0
K 22.0 7.0 32.0 1.0 5.0 6.0 13.0 0.0 0.0
L 21.0 12.0 4.0 0.0 4.0 0.0 0.0 0.0 0.0
M 52.0 7.0 10.0 0.0 4.0 4.0 12.0 0.0 0.0
N 29.0 10.0 2.0 2.0 10.0 5.0 5.0 0.0 0.0
O 11.0 1.0 0.0 1.0 9.0 2.0 0.0 1.0 0.0
Q 23.0 3.0 4.0 0.0 13.0 2.0 6.0 0.0 0.0
R 35.0 8.0 7.0 0.0 6.0 1.0 11.0 0.0 0.0
S 35.0 0.0 4.0 0.0 2.0 2.0 1.0 0.0 0.0
U 23.0 4.0 7.0 0.0 9.0 3.0 3.0 0.0 0.0
W 34.0 7.0 5.0 3.0 3.0 0.0 8.0 1.0 0.0
Looking good! We should take note that there were only two Animal Complaints and they were both in Sector G. We are now in a position to perform PCA on this data set. We can do this with the sklean library as follows below. Also, since PCA is sensitive to absolute values of the data (remember we are determining the eigendecomposition of a covariance matrix, not a correlation coefficient matrix) we need to normalize the data. We will normalize to a mean of zero and variance of one.
In [568]:
pca = PCA()
crimePCA = pca.fit(scale(crimeArray))
cumsum = crimePCA.explained_variance_ratio_.cumsum()

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    
    font = {'weight' : 'bold',
        'size'   : 18}
    
    plt.rc('font', **font)
    
    plt.plot(range(1,len(cumsum)+1), cumsum, '-o', color='b')
    plt.xlabel('Principal Component')
    plt.ylabel('Cumulative Proportion')    
    plt.tight_layout()
    plt.show()
    plt.rcdefaults()

As shown above, over 70% of the variation in the data is attributed to four principal components, and we can explore this further.

In [584]:
pcDF = pd.DataFrame(crimePCA.components_[0:4])
pcDF = pcDF.transpose()
pcDF.columns = ['PC1', 'PC2', 'PC3', 'PC4']
indexes = crimeArray.columns.tolist()
pcDF.index = [indexes]
pcDF
Out[584]:
PC1 PC2 PC3 PC4
Theft -0.216239 -0.268253 -0.560047 -0.305941
Fraud -0.178603 -0.509305 -0.054975 0.451431
Violence -0.456637 -0.162421 0.387359 -0.208361
Drugs 0.117521 -0.479407 0.109732 0.530480
Property 0.427143 -0.182437 0.040754 -0.291975
Court Violation -0.002167 -0.333201 0.477726 -0.414472
Public Offense -0.533502 -0.018754 0.249167 -0.039958
Traffic 0.471814 -0.107496 0.377428 -0.015298
Animal Complaint -0.096566 0.506185 0.294649 0.344847

We can now interpret these contributors to these principal components. The value for each feature in each principal component, or loading, tells us how much the feature contributes to the principal component.

For example, in PC1 we find a large negative value for Violence and a large positive value for Traffic. This means that if a sector has a large negative PC1 it likely has high Violence and low Traffic crime types, and the opposite is true if a sector has a positive PC1.

  • PC1 increases with traffic and property crime reports and decreases with violence and public offenses.

  • PC2 decreases with fraud and drugs and increases with animal complaints.

  • PC3 decreases with theft and increases with violence, court violations, and traffic crime reports.

  • PC4 increases strongly with fraud and drug crime types.

It is interesting that although Violence and Drugs crime types are significant variables, indicating they explain much of the variance in the data set, they do not appear as large loading values in the same principal component, indicating that these two features do not vary together.

We find that about 45% of the variance in the data set is captured in PC1 and PC2, and we can project the data on to this new feature space. Including PC3 would capture over 60% of the variance, however this requires projecting and plotting in a 3D space, and I hate 3D plots!

In [583]:
crimeProject = pca.fit_transform(scale(crimeArray))
sector = crimeArray.index.tolist()

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(8, 8))
    
    font = {'weight' : 'bold',
        'size'   : 22}

    plt.rc('font', **font)
    
    for i, labels in enumerate(sector):
        plt.scatter(crimeProject[i,0], crimeProject[i,1], s=250)
        plt.annotate(labels, (crimeProject[i,0], crimeProject[i,1]))
    
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2') 
    plt.tight_layout()
    plt.show()
    plt.rcdefaults()