The Department of Health and Mental Hygiene in New York City is responsible for taking reports and following up on rat sightings. Each sighting is recorded and updated when the matter is considered to be resolved. Each record includes many pieces of information including when it was created, closed, the type of location, the zip-code, address, borough, and latitude and longitude. In this report, we will seek insight to the following research questions:

The Data

The main data set utilized provides descriptions of rat sightings reported to the Department Of Health and Mental Hygiene in NYC from January 2010 to the present. The data is available freely to the public from the NYC OpenData website and is updated daily. This report is formatted to update the data used approximately every month after first being compiled.

The data can be accessed for viewing and is available for download from the following link:

The NYC OpenData website is part of the "Open Data Law" enacted in 2012 which mandates data from public entities to be available online. The website currently has over 1900 data sets available.

Variables of interest in the rat sighting data set include:

  • date reported
  • date completed
  • address
  • borough
  • zip-code
  • latitude and longitude

Secondly, a data set with monthly time series of home values by zip code will be utilized and joined with the rat data set. We will attempt to regress the average home value on counts of rat sightings and boroughs. The home value data is accessed directly from Zillow's website at the following link:

The specific data set used was for Home Listings and Sales with Data Type = Median List Price and Geography = ZIP Code.

Variables of interest in the Zillow home value data set include:

  • region name (zip code)
  • state
  • metro
  • county
  • monthly median values by region name
## Warning: 2 parsing failures.
##    row          col expected actual                 file
##  71934 Incident Zip a double    N/A 'rats2021-03-09.csv'
## 132279 Incident Zip a double    N/A 'rats2021-03-09.csv'
## `summarise()` has grouped output by 'year.created'. You can override using the `.groups` argument.

Data cleaning

Since this is real data, it needed to be examined and manipulated for our purposes. For time series plots, counts were grouped by year and month of creation and the official date assigned to each time point was the minimum of the dates in that grouping, which should be the first day of the month.

Initially, an issue in the data was found in a preliminary time series plot. It appeared that the count per month of rat sightings was not consistently being entered before July of 2015. During the process of creating this report (as of May 2nd 2018) the issue in the data appears to have been resolved by NYC OpenData authorities and all entered records of rat sightings from NYC OpenData are used for this report. As of May 2nd 2018, there were over 110,000 records of rat sightings.

For the Zillow data set, zip codes had been truncated to leave out leading zeros. Using an sapply function with an ifelse statement, the leading zeros were pasted back. The data set was then converted to tidy format using the gather function.

Next, the data set was filtered for zip codes matching the rat data set (NYC zip codes). Lastly, grouping by date and zip code home values were averaged within groups.

Report frequency distributions

When looking at rat sightings, it is interesting to see if there is any pattern in what day of the week that reports get filed and when the reports are closed.

Looking at the day of the week that the reports are created, the most noticeable feature is that during the weekend, Saturday and Sunday, there are only about half to two-thirds the amount of reports as there are during the majority of weekdays. Also quite noticeable is the drop from Wednesday to Friday. This suggests that reports may typically "slow down" as the weekend approaches.

When looking at the day of the week when reports are considered closed, it is first most noticeable that nearly all reports are closed between Monday and Friday. However, there are still a small number of cases closed during the weekend. There are also a sizable amount of "NA" values which represents cases that were not closed or are still in progress. There is not much of a particularly "busy" day during the week but it does appear that Monday is the least busy.

In particular, the features shown in the month-to-month count bar charts should be looked at to examine how reports vary over time.

Some months of the year are likely to be considered "busy" months. To examine which months are the busiest overall, it is meaningful to examine frequency counts of when rat sighting reports are created and closed by what month of the year is recorded. Reports that were not closed are not included.

When looking at the bar chart of counts of rat sighting reports created, it is clear that July is the busiest month for reports to be created. In general, the summer months of June through September are the busiest. While winter months of December, January, and February had much lower counts of reports being created. Additionally, there is a noticeable drop between October and November.

The bar chart of closed reports shows very similar patterns as summer months have more reports closed with winter having the fewest. There is also a drop off from October to November.

Among the five boroughs of NYC, Brooklyn has by far the most reports of rat sightings, followed by Manhattan, the Bronx, Queens and lastly Staten Island had the fewest. Brooklyn appears to have about 7x as many rat sightings as Staten Island.

Rat Sightings over Time

The rat sighting data is collected across time, so questions that arise are if rat sightings are increasing, and if there is a seasonal trend. Additionally, These same questions can be examined on each of the boroughs in NYC. To examine these questions, it is appropriate to examine time series plots of rat sightings.

Looking at rat sightings across time, it does appear that rat sightings are generally increasing across time. There also appears to be a cyclical component as in general, the trend tends to be decreasing from August to January and increasing from January to July. The highest point appears in July of 2017 and is a large increase from the previous July.

## `summarise()` has grouped output by 'year.created', 'month.created'. You can override using the `.groups` argument.