Prologue

About us

We (Agathe Vukelic, Claire Bilat, Florence Diener and Gaëtan Michelet) are four students who decided to complete a master’s degree in Digital Investigation and Identification at University of Lausanne. As part of this, we found ourselves trying to make a success of an EPFL course intituled “Applied Data Analysis”. We were asked to developp a project under the theme data science for social good. A variety of datasets were provided by the course’s organisators including the Chicago Food Inspections Database, which we selected for our project. However, as forensic sciences students, we could not miss the opportunity to make everything about criminology, which ultimately led to this study. We indeed added a database of the crimes in Chicago in order to analyse the correlation, for a given community area in Chicago, between the number of crimes committed and the hygiene of the food establishments (among other various things).

About the databases used

The first dataset that we used is the Chicago Food Inspections Database linked above and provided by the Chicago department of Public Health’s Food Protection Program. It contains informations regarding the inspection reports of food establishments in Chicago from 2010 to the present. Some informations are directly linked to the establishments like their exact location, their type (restaurants, coffee shop, …) or their license number. Some other informations document the inspections realized like their result (pass, fail, …) or the violations noticed.

The second dataset used for this project is the Crimes in Chicago Database also linked above and provided by the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting. It contains informations regarding the reported incidents of crime that occured in Chicago from 2001 to present, like the primary type of the crime (homicide, burglary…) or its anonymized location.

The third dataset used for this projet is the Chicago Business Licenses and Owners Database provided by the City of Chicago. It contains informations about the owners of establishments in Chicago and allowed us to link the food establishments of the first dataset to their respective owners.

The fourth and last dataset used for this project is the Geographic Boundaries of Community Areas in Chicago Database provided by the City of Chicago. This dataset allowed us to link the food establishments of the first dataset to their respective community areas.

About this project

Research questions

The main research question was : where to eat safely in Chicago ?

The adjective safely has been chosen wisely because it has multiple meanings: you can eat in a safe way making sure that the establishment where you go respects particular hygiene rules, but also that the place is safe according to the crime rate of its district. That said, two scores has been computed per community area : their hygiene score based on the food inspections performed in the area and their crime score based on the incidents of crime reported in the area. More about the exact calculation of those two scores will be explained later.

It has been decided that additionally to the geographic component of the analysis (based on the geographic delimitations of the community areas) a temporal component would be taken into account. Indeed, the field of restoration is known to be in a constant evolution : each year, many establishments are opening while other are closing or changing of owner - especially in big cities. This is why the different calculations and comparisons has been done by year. Also, the main research question (where to eat safely in Chicago) has been given based on the latest data, considering that it would be the most useful.

It has also been decided to distribute the facility types into facility groups that fall into two main categories :

  • The private establishments, where it is possible to eat a main course (for example, the places where you can only eat an ice cream were deleted of our list)
  • The public establishments like school cafeterias and hospitals

More about the exact distribution of those Facility Groups will be explained later.

Some other research questions were added to the project :

  • About the HygieneScores :
    • Are they significant differences of Hygiene Scores between the community areas ?
    • How is the evolution over time ?
    • Which group of facilities has better hygiene score ?
    • Are they significant differences of Hygiene scores between the private and the public establishments ?
  • About the management of food establishments :
    • Is there a relation between the number of establishments that an owner has and the hygiene scores obtained ?

Preprocessing

This part of our project is not the most thrilling so we will not go over every detail of the preprocessing but only mention a few ones that could maybe help the reader to gain a better global comprehension of the project. Please refer to our project notebook if you have an insatiable curiosity about it.

The Chicago Food Inspections Database

The Facility group

The Database contains informations about the facility type of the establishments inspected but there were too many different types of facility for the purpose of our project. As explained before, we created two main categories, private and public establishments, containing a few custom facility groups into which the facility types of interest are distributed. Those facility groups are listed below.

private facility groups : - restaurant - grocery restaurant - banquet - rooftop restaurant - bar restaurant - bakery restaurant - liquor restaurant - catering - golden diner

public facility groups : - day care - school - childrens services - adulte care

The Hygiene score

In order to compute the hygiene score of a community area, we first computed the hygiene score of an inspection. To do so, we took into account the inspection’s result and the number of violations detected during the inspection. There is three main possible outcomes of an inspection : either the facility passes the inspection or it passes with conditions or it fails. We attributed an arbitrary score to each of those outcomes :

  • Pass = 1
  • Pass with conditions = 2
  • Fail = 3

The formula used to compute the hygiene score of an inspection is the following :

We gave much more weight to the result of the inspection because there are some violations more serious than others which can cause some biais : an establishment could fail its inspection by violating only one serious hygiene rule and another establishment could pass its inspection (with conditions) by violating two or more less serious hygiene rules.

The hygiene score of an establishment, per year is defined by the mean of the hygiene scores from every inspections performed within the year in the particular establishment.

The hygiene score of a community area, per year is defined by the mean of the hygiene scores from every inspections that took place within the year in the particular area.

We took the mean and not the sum because the distribution of inspections within the different community areas is not homogen and we could not assume that the ratio between inspected and uninspected establishments is the same for each community area.

Limits : An important point is to pay attention to the number of inspected establishments compared to the total number of establishments. It is certain than the variations of this ratio between the community areas has an impact on the results.

The Crimes in Chicago Database

The Crime score

In order to compute the crime score of a community area, we first computed the crime score of a reporterd crime. To do so, we took into account the crime’s minimum sentence (in term of years of imprisonment) provided by the Illinois Penalty Code. For the crimes where the minimum sentence is not imprisonment, we fixed the crime score to 0,1.

The crime score of a community area, per year is defined by the sum of the crime scores from every crimes reported within the year in the particular area.

Here we took the sum because we assumed that there was no huge difference between the ratios of reported vs unreported crimes for each community area.

Results and discussion

HygieneScores / Community Area

General Visualization

The following figure allows to visualize the HygieneScores per Year for each Community Area. The Mean line helps to see the bars tending to deviate from the Median computed.

      unselect the years you don’t want to display by clicking on their label on the right

We can see that there is no particular trend, with a Median oscillating between about 30 and 60, except for on entry, which is really above the rest, for the 47th Community Area with a Median at 4. This community area has low HygieneScores, exception made of 2013.

Visualization Year By Year

The following figure allows a visualization of the results in another way - each bar contains all the HygieneScores of the Community Areas.

Zooming in, we can notice again that the repartition for each Year seems to be random.

We also have plotted the HygieneScores in descending order, to see if we could learn something from it but neither the top 10 nor the bottom 10 are alike, except for the 47th Community Area which stays at the top 1 - getting the lowest scores so the best results.

Now let’s have a look at the map of Chicago, in order to have a geospatial view of the results.

As well as we saw on the figure 1.1 and 1.2, the HygieneScore doesn’t seem to follow any rule relative to the Community Areas. The maps for the other years - 2011, 2012, 2013, 2014, 2015, 2016 and 2017 - are showing the exact same absence of trend which ends up being a particular trend.

Correlation

The .corr() function gives the Pearson Coefficient between the HygieneScores and the Community Areas.

The result of the correlation computation is in accordance with the rest of the analysis : there is indeed no relation between the HygieneScore and the Community Areas.

Considering the constant variations in the food domain, the results obtained could simply indicated that the inspections are fair, following the phenomenon of changes, unpredictable due to the behaviors in the restauration industry.

N.B. : The results observed for Burnside, the 47th Community Area, can be explained with the White Flight phenomenon, leading the businesses to move away, but also with the geographical situation on the border of the city, which leads Burnside to be more a “comfortable residential community”. Source : wikipedia

The low HygieneScores could either be explained by a small number of establishments with correct HygieneScores running in a zone whose comfort could ease the maintenance of the establishments.


HygieneScores / Facility Type

The data also give the Facility Type of each establishments. As explained at the begining of our story, they have been put into groups in order to obtain meaningful results.

On this figure, we can study the differences between Public and Private establishments.

Again no particular trend can be observed for the Public establishments, except for the category Restaurant which happens to have minimized variations until 2018 where the rise of the HygieneScores is huge. The Restaurant type of facility is the one with the more entries in the dataset.

For the Private establishments, we can say that the HygieneScores are more stable. They seems to follow the same trend, with a rise in 2018 and 2019.

The Private establishements are the more sensitive. Because of the way they works - children or elderly are in their care and those care are often expensive - they generally have to follow specific rules. Their particular duty could explain the fact that their results are more constant than the ones of the Public establishments.


HygieneScores / Owner

We thought that it would be interesting to calculate the Pearson Coefficient between the HygieneScores and the Number of Restaurants by Owner. Using the .corr() function, we obtained the result : PCC = -0.29536601012489433

The HygieneScores and the Number of Restaurants by Owner are not correlated following the Pearson method.

The result obtained is contrary to our thinking : we thought that the more establishments a owner has, the more able he is to enforce rules fitting the Food Code - owning several establishements would induce more experiment and resources. Apparently, this is not the case !


CrimeScores / Community Area

General Visualization

The following figure the visualization of the CrimeScores per Year for each Community Area.

      unselect the years you don’t want to display by clicking on their label on the right

This figure allows to see in details the CrimeScores of each Community Area. We can already see that the CrimeScore are very different inter-Community Areas, but stay in the same ranges intra-Community Areas. Mostly, the CrimeScores have decreased between 2010 and 2015, then have rised in 2016.

Now let’s take a look at the city map for another point of view.


As we can see from the map, the crime score stays more or less stable with a slight decrease from 2010 to 2015. In 2017, as on the figure 4.1, we can observe very low crime scores everywhere. As we read nothing that could explain this, we think it is an issue in the dataset, maybe a non-complet dataset for this year.

We can also see that one community area particularly stands out every year (except 2017) : Austin. Getting information on this community area we find it is one of the most populated community area and that there is in Austin a lot of violent crimes. That explains the high crime score that it gets as crimes scores are calculated based on the penalty of the crimes.

Considering the previous results in the HygieneScores sections, this rise cannot be explain by our analysis of the Chicago Food Inspections yet, but it already has been discussed. Quickly searching on Google, many articles relate this trend :

“The city’s overall crime rate, especially the violent crime rate, is higher than the US average. Chicago was responsible for nearly half of 2016’s increase in homicides in the US, though the nation’s crime rates remain near historic lows. The reasons for the higher numbers in Chicago remain unclear.” - Source : wikipedia

Apparently, their analysis of the Chicago Food Inspections could not have explained it either.

This figure allows a good visualization of the fact that there is a general trend in which the CrimeScores varie linearly by CommunityArea.

Correlation

The corr function gives the Pearson Coefficient between the CrimeScores and the Community Areas.

The CrimeScore appears to be very related to the place.


Hygiene versus Crime

Correlation

The difference of the trends detected between HygieneScores and CrimeScores strongly lead to think that there is no correlation between the two. Using the .corr() function, we obtained the result : PCC = -0.3059226707490215

The CrimeScore and the HygieneScore are not correlated following the Pearson method. In our opinion, this result is a very good point :

  • Even if in some Community Areas the CrimeScore and therefore the level of criminality is very high, it doesn’t impact the food establishments of the place

  • The Chicago department of Public Health’s Food Protection Program works the same way no matter the Community Area : it is generallly not the fault of the people who want to run food establishments if the criminiality level is high, so they should not be penalized by it.


To end the story, we would like to warn you about the fact that if some Community Areas are more affected by the criminality than others, the criminality itself strongly depends on the people involved in the crimes - autors, victimes. Without solid data analysis of the actors of the crimes, we cannot tell you that you would not be safe in certain Community Areas according to your profil.