{"id":12334,"date":"2023-08-04T15:08:58","date_gmt":"2023-08-04T15:08:58","guid":{"rendered":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/"},"modified":"2025-01-09T09:36:01","modified_gmt":"2025-01-09T09:36:01","slug":"data-driven-dispatch-76b7e998a7a7","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/","title":{"rendered":"Data-Driven Dispatch"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n<p class=\"wp-block-paragraph\">In today&#8217;s fast-paced world, the need for data-driven decisions in dispatch response systems is becoming essential. Dispatchers will perform a kind of triage when listening to calls, prioritising cases based on severity and time sensitivity among other factors. There is potential in optimising this process by leveraging the power of supervised learning models, to make more accurate predictions of case severity in tandem with a human dispatcher&#8217;s assessment.<\/p>\n<p class=\"wp-block-paragraph\">In this post, I&#8217;m going to run through one solution I developed to improve predictions of casualties and\/or serious vehicular damage from car collisions in Chicago. Factors such as crash location, road conditions, speed limit, and time of occurrence were taken into account to answer a simple yes or no question &#8211; <em>will this car crash require an ambulance or tow truck?<\/em><\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"785434\" data-has-transparency=\"false\" style=\"--dominant-color: #785434;\" loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1707\" class=\"wp-image-342611 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/02mdodL6yNGWhe1jK-scaled.jpg\" alt=\"Photo by Chris Dickens on Unsplash\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/02mdodL6yNGWhe1jK-scaled.jpg 2560w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/02mdodL6yNGWhe1jK-300x200.jpg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/02mdodL6yNGWhe1jK-1024x683.jpg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/02mdodL6yNGWhe1jK-768x512.jpg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/02mdodL6yNGWhe1jK-1536x1024.jpg 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/02mdodL6yNGWhe1jK-2048x1365.jpg 2048w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/unsplash.com\/@chrisdickens?utm_source=medium&amp;utm_medium=referral\">Chris Dickens<\/a> on <a href=\"https:\/\/unsplash.com?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In a nutshell, this machine learning tool&#8217;s primary objective is to classify collisions that will most likely require a callout (medical, tow, or both) based on other known factors. By leveraging this tool, responders would be able to efficiently allocate their resources across different parts of the city, based on various conditions such as weather and time of day.<\/p>\n<p class=\"wp-block-paragraph\">For such a tool to be accurate and effective, a large data source would be needed to make predictions from the historical data &#8211; thankfully the city of Chicago already has such a resource (the <strong><a href=\"https:\/\/data.cityofchicago.org\/\">Chicago Data Portal<\/a>),<\/strong> so this data will be used as the test case.<\/p>\n<p class=\"wp-block-paragraph\">Implementing these kinds of predictive models would certainly improve preparedness and response time efficiency when dealing with collisions on city streets. By gaining insights into the underlying patterns and trends within the collision data, we can work towards fostering safer road environments and optimising emergency services.<\/p>\n<p class=\"wp-block-paragraph\">I go into the details of data cleaning, model building, fine tuning and evaluation below, before analysing the model&#8217;s results and drawing conclusions. A link to the github folder for this project, which includes a jupyter notebook and a more comprehensive report on the project, can be found <strong><a href=\"https:\/\/github.com\/jlenehan\/Chicago_Dispatch_Classification\">here<\/a><\/strong>.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">Data Collection and Preparation<\/h2>\n<h3 class=\"wp-block-heading\">Initial Setup<\/h3>\n<p class=\"wp-block-paragraph\">I&#8217;ve listed the basic data analysis libraries used in the project below; standard libraries such as pandas and numpy were used throughout the project, along with matplotlib&#8217;s pyplot and seaborn for visualisation. Additionally, I used the missingno library to identify gaps in the data &#8211; I find this library incredibly useful for visualising missing data in a dataset, and would recommend it for any data science project involving dataframes:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-undefined\">#generic data analysis \nimport os\nimport pandas as pd\nfrom datetime import date\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport seaborn as sns\nimport missingno as msno<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Functions from the machine learning module SciKit learn (sklearn) were imported to build the machine learning engine. these functions are shown here &#8211; I will describe the purpose of each of these functions in the Classification Model section later:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-undefined\">#Preprocessing\nfrom sklearn.preprocessing import LabelEncoder\nfrom sklearn.preprocessing import StandardScaler\n\n# Models\nfrom sklearn.neighbors import KNeighborsClassifier\n\n# Reporting\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.model_selection import RandomizedSearchCV\n\n#metrics\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.metrics import f1_score\nfrom sklearn.metrics import precision_score\nfrom sklearn.metrics import recall_score<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The data for this project were all imported from the Chicago Data Portal, from two sources:<\/p>\n<ol class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/data.cityofchicago.org\/Transportation\/Traffic-Crashes-Crashes\/85ca-t3if\">Traffic Crashes<\/a>:<\/strong> Live dataset of vehicle collisions in the Chicago area. The features of this dataset are conditions recorded at the time of the collision, such as weather conditions, road alignment, latitude and longitude data, among other details.<\/li>\n<li><strong><a href=\"https:\/\/data.cityofchicago.org\/Public-Safety\/Boundaries-Police-Beats-current-\/aerh-rz74\">Police Beats Boundaries<\/a>:<\/strong> A static dataset indicating the boundaries of CPD beats; this dataset is used to supplement district information to the traffic crashes dataset. This can be joined to the original dataset to run analysis on districts with the most frequent collisions.<\/li>\n<\/ol>\n<h3 class=\"wp-block-heading\">Data Cleaning<\/h3>\n<p class=\"wp-block-paragraph\">With both datasets imported, they can now be merged to add district data to the final analysis. This is done using the .merge() function in pandas &#8211; I used an inner join on both dataframes to capture all information in both, using the beat data in both as the join key (listed as beat_of_occurrence in the traffic crashes dataset, and BEAT_NUM in the police beats dataset):<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#joining collision data to beat data - inner join\ncollisions = collision_raw.merge(beat_data, how=&#039;inner&#039;,\n                                 left_on=&#039;beat_of_occurrence&#039;,\n                                 right_on=&#039;BEAT_NUM&#039;\n                                 )<\/code><\/pre>\n<p class=\"wp-block-paragraph\">A quick look at the information provided from the .info() function shows a number of columns with sparse data. This can be visualized using the missingno matrix function:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#visualising missing data\n#sorting values by report received date\n\ncollisions = collisions.sort_values(by=&#039;crash_date&#039;, ascending=True)\n\n#plotting matrix of missing data\nmsno.matrix(collisions)\nplt.show()\n\n#info of sorted data\nprint(collisions.info())<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This displays a matrix of missing data in all columns, as can be seen here:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"838383\" data-has-transparency=\"false\" style=\"--dominant-color: #838383;\" loading=\"lazy\" decoding=\"async\" width=\"1092\" height=\"418\" class=\"wp-image-342617 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1oLxFaUyicszWIidepbsBNw.png\" alt=\"Unrefined dataset, with multiple columns containing a large number of null values\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1oLxFaUyicszWIidepbsBNw.png 1092w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1oLxFaUyicszWIidepbsBNw-300x115.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1oLxFaUyicszWIidepbsBNw-1024x392.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1oLxFaUyicszWIidepbsBNw-768x294.png 768w\" sizes=\"auto, (max-width: 1092px) 100vw, 1092px\" \/><figcaption class=\"wp-element-caption\">Unrefined dataset, with multiple columns containing a large number of null values<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">By dropping columns with sparse data, a much cleaner dataset can be extracted; the columns to drop are defined in a list and then removed from the dataset using the .drop() function:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#defining unnecessary columns\ndrop_cols = [&#039;location&#039;, &#039;crash_date_est_i&#039;,&#039;report_type&#039;, &#039;intersection_related_i&#039;,\n       &#039;hit_and_run_i&#039;, &#039;photos_taken_i&#039;, &#039;crash_date_est_i&#039;, &#039;injuries_unknown&#039;,\n       &#039;private_property_i&#039;, &#039;statements_taken_i&#039;, &#039;dooring_i&#039;, &#039;work_zone_i&#039;,\n       &#039;work_zone_type&#039;, &#039;workers_present_i&#039;,&#039;lane_cnt&#039;,&#039;the_geom&#039;,&#039;rd_no&#039;,\n            &#039;SECTOR&#039;,&#039;BEAT&#039;,&#039;BEAT_NUM&#039;]\n\n#dropping columns\ncollisions=collisions.drop(columns=drop_cols)\n\n#plotting matrix of missing data\nmsno.matrix(collisions)\nplt.show()\n\n#info of sorted data\nprint(collisions.info())<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This leads to a much cleaner msno matrix:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"818181\" data-has-transparency=\"false\" style=\"--dominant-color: #818181;\" loading=\"lazy\" decoding=\"async\" width=\"1092\" height=\"548\" class=\"wp-image-342619 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1SNeLUh-Guxu19MRypXuZyg.png\" alt=\"msno matrix of the pruned dataset\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1SNeLUh-Guxu19MRypXuZyg.png 1092w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1SNeLUh-Guxu19MRypXuZyg-300x151.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1SNeLUh-Guxu19MRypXuZyg-1024x514.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1SNeLUh-Guxu19MRypXuZyg-768x385.png 768w\" sizes=\"auto, (max-width: 1092px) 100vw, 1092px\" \/><figcaption class=\"wp-element-caption\">msno matrix of the pruned dataset<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Looking at the data for latitude and longitude, a small handful of rows had null values, and others mistakenly had zero values (most likely a reporting error):<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eaeaea\" data-has-transparency=\"true\" style=\"--dominant-color: #eaeaea;\" loading=\"lazy\" decoding=\"async\" width=\"592\" height=\"240\" class=\"wp-image-342620 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1-VTY_JeXIdXjZ9V74i6uWA.png\" alt=\"Both the latitude and longitude columns contain zero values (see min and max of each)\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1-VTY_JeXIdXjZ9V74i6uWA.png 592w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1-VTY_JeXIdXjZ9V74i6uWA-300x122.png 300w\" sizes=\"auto, (max-width: 592px) 100vw, 592px\" \/><figcaption class=\"wp-element-caption\">Both the latitude and longitude columns contain zero values (see min and max of each)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">These would cause errors in training the model, so I removed them:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#Some incorrect lat\/long data - need to remove these rows\ncollisions = collisions[collisions[&#039;longitude&#039;]&amp;lt;-80]\ncollisions = collisions[collisions[&#039;latitude&#039;]&amp;gt;40]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">With the data adequately cleaned, I was able to progress with developing the classification model.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">Classification Model<\/h2>\n<h3 class=\"wp-block-heading\">Exploratory Data Analysis<\/h3>\n<p class=\"wp-block-paragraph\">Before proceeding with the machine learning model some exploratory data analysis (EDA) needs to be performed &#8211; each of the columns of the data frame are plotted on a histogram, with bins of 50 to show the distribution of the data. Histograms are useful in the EDA step for a number of reasons, namely that they give an overview of the data distribution, help to identify outliers, and ultimately assist in making decisions on feature engineering:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#plotting histograms of numerical values\ncollisions.hist(bins=50,figsize=(16,12))\nplt.show()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e6ebef\" data-has-transparency=\"true\" style=\"--dominant-color: #e6ebef;\" loading=\"lazy\" decoding=\"async\" width=\"881\" height=\"641\" class=\"wp-image-342621 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1QIwJkCWGNyruz63AuWwmUg.png\" alt=\"Histograms of columns in the final dataset\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1QIwJkCWGNyruz63AuWwmUg.png 881w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1QIwJkCWGNyruz63AuWwmUg-300x218.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1QIwJkCWGNyruz63AuWwmUg-768x559.png 768w\" sizes=\"auto, (max-width: 881px) 100vw, 881px\" \/><figcaption class=\"wp-element-caption\">Histograms of columns in the final dataset<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">A cursory look at the column histograms indicates that the latitude data is bimodal, while the longitude data is rightly skewed. This will need to be standardized so that it can be better applied for machine learning purposes.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ccdce8\" data-has-transparency=\"true\" style=\"--dominant-color: #ccdce8;\" loading=\"lazy\" decoding=\"async\" width=\"830\" height=\"239\" class=\"wp-image-342622 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1yrI0ge5Rru-Rg_rUSXINLQ.png\" alt=\"Latitude-longitude data without scaling\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1yrI0ge5Rru-Rg_rUSXINLQ.png 830w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1yrI0ge5Rru-Rg_rUSXINLQ-300x86.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1yrI0ge5Rru-Rg_rUSXINLQ-768x221.png 768w\" sizes=\"auto, (max-width: 830px) 100vw, 830px\" \/><figcaption class=\"wp-element-caption\">Latitude-longitude data without scaling<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Additionally, it appears the crash hour column is cyclic in nature &#8211; this can be transformed using a trigonometric function (for example sine).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"d5e2eb\" data-has-transparency=\"true\" style=\"--dominant-color: #d5e2eb;\" loading=\"lazy\" decoding=\"async\" width=\"416\" height=\"239\" class=\"wp-image-342623 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1vpXsn_Ei2bLslIKvBU7ptQ.png\" alt=\"Unscaled crash hour data\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1vpXsn_Ei2bLslIKvBU7ptQ.png 416w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1vpXsn_Ei2bLslIKvBU7ptQ-300x172.png 300w\" sizes=\"auto, (max-width: 416px) 100vw, 416px\" \/><figcaption class=\"wp-element-caption\">Unscaled crash hour data<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Scaling and Transformation<\/h3>\n<p class=\"wp-block-paragraph\">Scaling is a technique used in data preprocessing to standardise features so they have similar magnitudes. This is particularly important for machine learning models, since models are generally sensitive to the scale of input features. I&#8217;ve defined the StandardScaler() function to act as the scaler in this model &#8211; this scaling function transforms the data so that it has a mean of 0 and standard deviation of 1.<\/p>\n<p class=\"wp-block-paragraph\">For data with a skewed or bimodal distribution, scaling can be done using logarithmic functions. Log functions make skewed data more symmetrical and reduce the tail on the data &#8211; this is useful when dealing with outlier values. I scaled the latitude and longitude data in this way; as the longitude data is all negative, the negative log was calculated and then scaled.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#scaling latitude and longitude data\nscaler = StandardScaler()\n\n# Logarithmic transformation on longitude\ncollisions_ml[&#039;neg_log_longitude&#039;] = scaler.fit_transform(np.log1p(-collisions_ml[&#039;longitude&#039;]).\n                                                          values.reshape(-1,1))\n\n# Normalisation on latitude\ncollisions_ml[&#039;norm_latitude&#039;] = scaler.fit_transform(np.log1p(collisions[&#039;latitude&#039;]).\n                                                      values.reshape(-1, 1))<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This produces the desired effect, as can be seen below:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"d2e0ea\" data-has-transparency=\"true\" style=\"--dominant-color: #d2e0ea;\" loading=\"lazy\" decoding=\"async\" width=\"802\" height=\"525\" class=\"wp-image-342624 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1uEIRc5hzYyDcqFgXih4VoA.png\" alt=\"Scaled latitude-longitude data\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1uEIRc5hzYyDcqFgXih4VoA.png 802w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1uEIRc5hzYyDcqFgXih4VoA-300x196.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1uEIRc5hzYyDcqFgXih4VoA-768x503.png 768w\" sizes=\"auto, (max-width: 802px) 100vw, 802px\" \/><figcaption class=\"wp-element-caption\">Scaled latitude-longitude data<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In comparison, cyclic data is usually scaled using trigonometric functions such as sine and cos. The crash hour data looks roughly cyclic based on earlier observations, so I applied a sine function to the data as below &#8211; since numpy&#8217;s sin() function is in radians, I first converted the input to radians before calculating the sine of the input:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#transforming crash_hour \n#data is cyclic, can be encoded using trig transforms\n\n#trig transformation - sin(crash_hr)\ncollisions_ml[&#039;sin_hr&#039;] = np.sin(2*np.pi*collisions_ml[&#039;crash_hour&#039;]\/24)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">A histogram of the transformed data can be seen below:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"d6e3ec\" data-has-transparency=\"true\" style=\"--dominant-color: #d6e3ec;\" loading=\"lazy\" decoding=\"async\" width=\"686\" height=\"476\" class=\"wp-image-342625 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1cJdezFaVwqTcpwV8vsX_IA.png\" alt=\"Scaled crash hour data\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1cJdezFaVwqTcpwV8vsX_IA.png 686w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1cJdezFaVwqTcpwV8vsX_IA-300x208.png 300w\" sizes=\"auto, (max-width: 686px) 100vw, 686px\" \/><figcaption class=\"wp-element-caption\">Scaled crash hour data<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Finally I removed the unscaled data from the model to avoid this interfering with model predictions:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#drop previous latitude\/longitude columns\nlat_long_drop_cols = [&#039;longitude&#039;,&#039;latitude&#039;]\ncollisions_ml.drop(lat_long_drop_cols,axis=1,inplace=True)\n\n#drop crash_hour column\ncollisions_ml.drop(&#039;crash_hour&#039;,axis=1,inplace=True)<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Data Encoding<\/h3>\n<p class=\"wp-block-paragraph\">Another important step in data preprocessing is data encoding &#8211; this is where non-numerical data (for example categories) are represented in numerical format, to make it compatible with machine learning algorithms. For categorical data in this model, I used a method called label encoding &#8211; each category in a column is given a numerical value before it&#8217;s inputted to the model. A diagram of this process is shown below:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f5f4f3\" data-has-transparency=\"true\" style=\"--dominant-color: #f5f4f3;\" loading=\"lazy\" decoding=\"async\" width=\"478\" height=\"293\" class=\"wp-image-342626 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1VANp9LwCoxoSEK1IA705PQ.png\" alt=\"An example of label encoding (credits to Zach M\u2014source)\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1VANp9LwCoxoSEK1IA705PQ.png 478w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1VANp9LwCoxoSEK1IA705PQ-300x184.png 300w\" sizes=\"auto, (max-width: 478px) 100vw, 478px\" \/><figcaption class=\"wp-element-caption\">An example of label encoding (credits to Zach M\u2014<a href=\"https:\/\/www.statology.org\/label-encoding-in-python\/\">source<\/a>)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">I encoded the columns in the dataset, first segmenting out the columns I wanted to keep from the original dataset and making a copy of the dataframe (collisions_ml). I then defined the categorical columns in a list, and used the LabelEncoder() function from sklearn to fit and transform the categorical columns:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#segmenting columns into lists\nml_cols = [&#039;posted_speed_limit&#039;,&#039;traffic_control_device&#039;, &#039;device_condition&#039;, &#039;weather_condition&#039;,\n          &#039;lighting_condition&#039;, &#039;first_crash_type&#039;, &#039;trafficway_type&#039;,&#039;alignment&#039;, \n           &#039;roadway_surface_cond&#039;, &#039;road_defect&#039;, &#039;crash_type&#039;,&#039;damage&#039;,&#039;prim_contributory_cause&#039;,\n          &#039;sec_contributory_cause&#039;,&#039;street_direction&#039;,&#039;num_units&#039;, &#039;DISTRICT&#039;,\n          &#039;crash_hour&#039;,&#039;crash_day_of_week&#039;,&#039;latitude&#039;, &#039;longitude&#039;]\ncat_cols = [&#039;traffic_control_device&#039;, &#039;device_condition&#039;, &#039;weather_condition&#039;, &#039;DISTRICT&#039;,\n           &#039;lighting_condition&#039;, &#039;first_crash_type&#039;, &#039;trafficway_type&#039;,&#039;alignment&#039;,\n           &#039;roadway_surface_cond&#039;, &#039;road_defect&#039;, &#039;crash_type&#039;,&#039;damage&#039;,&#039;prim_contributory_cause&#039;,\n           &#039;sec_contributory_cause&#039;,&#039;street_direction&#039;,&#039;num_units&#039;]\n\n#making a copy of the dataset\ncollisions_ml = collisions[ml_cols].copy()\n\n#encoding categorical values\nlabel_encoder = LabelEncoder()\nfor col in collisions_ml[cat_cols].columns:\n    collisions_ml[col] = label_encoder.fit_transform(collisions_ml[col])<\/code><\/pre>\n<p class=\"wp-block-paragraph\">With the data now sufficiently preprocessed, the data can now be split into train and test data, and a classification model can be fitted to the data.<\/p>\n<h3 class=\"wp-block-heading\">Splitting the Train &amp; Test Data<\/h3>\n<p class=\"wp-block-paragraph\">It&#8217;s important to separate data into a training and test sets when building a machine learning model; the training set is a fraction of the initial data which is used to train the model on the right responses, whereas the test set is used to evaluate model performance. Keeping these separate is necessary to reduce the risk of overfitting and model bias.<\/p>\n<p class=\"wp-block-paragraph\">I separated out the crash_type column using the drop() function (the remaining features will be used as the variables to predict crash_type), and defined crash_type as the y result to be predicted using the model. The train_test_split function from sklearn was used to take 20% of the initial dataset as training data, with the rest to be used for model testing.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#Create test set\n#setting X and y values\n\nX = collisions_ml.drop(&#039;crash_type&#039;, axis=1)\ny = collisions_ml[&#039;crash_type&#039;]\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)<\/code><\/pre>\n<h3 class=\"wp-block-heading\">K-Nearest Neighbors Classification<\/h3>\n<p class=\"wp-block-paragraph\">For this project, a K-Nearest Neighbors (KNN) classification model is used to predict results from the features. KNN models work by checking the value of the K nearest known values around an unknown data point, then classifying the data point based on the values of those &quot;neighbor&quot; points. It&#8217;s a non-parametric classifier, meaning it doesn&#8217;t make any assumptions about the underlying data distribution; however it is computationally expensive, and can be sensitive to outliers in the data.<\/p>\n<p class=\"wp-block-paragraph\">I instantiated the KNN classifier with an initial n_neighbors equal to 3, using Euclidean metrics, before fitting the model to the training data:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#Classifier - K Nearest Neighbours\n#instantiate KNN Classifier\nKNNClassifier = KNeighborsClassifier(n_neighbors=3, metric = &#039;euclidean&#039;)\n\nKNNClassifier.fit(X_train,y_train)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Once the model was fitted to the training data, I made predictions on the test data as below:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#Predictions\n#predict on training set\ny_train_pred = KNNClassifier.predict(X_train)\n\n#predict on test data\ny_test_pred = KNNClassifier.predict(X_test)<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Evaluation<\/h3>\n<p class=\"wp-block-paragraph\">Evaluation of a machine learning model is typically done using four metrics; accuracy, precision, recall, and F1 score. The differences between these metrics are very subtle, but in plain English these terms can be defined as follows:<\/p>\n<ol class=\"wp-block-list\">\n<li><em><strong>Accuracy:<\/strong><\/em> the percentage of true positive predictions out of all model predictions. Typically the accuracy of both the train and test data should be measured to evaluate model fit.<\/li>\n<li><em><strong>Precision:<\/strong><\/em> the percentage of true positive predictions out of all <em>positive<\/em> model predictions.<\/li>\n<li><em><strong>Recall:<\/strong><\/em> the percentage of true positive predictions out of all <em>positive cases in the dataset<\/em>.<\/li>\n<li><em><strong>F1 Score:<\/strong><\/em> An overall metric of the model&#8217;s ability to identify positive instances in the data, combining the precision and recall scores.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">I computed the metrics of the KNN model using the below code snippet &#8211; I also calculated the difference between the model&#8217;s accuracy on the train and test set, to assess fit:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#Evaluate model\n# Calculate the accuracy of the model\n\n#calculating accuracy of model on training data\ntrain_accuracy = accuracy_score(y_train, y_train_pred)\n\n#calculating accuracy of model on test data\ntest_accuracy = accuracy_score(y_test, y_test_pred)\n\n#computing f1 score,precision,recall\nf1 = f1_score(y_test, y_test_pred)\nprecision = precision_score(y_test,y_test_pred)\nrecall = recall_score(y_test,y_test_pred)\n\n#comparing performances\nprint(&quot;Training Accuracy:&quot;, train_accuracy)\nprint(&quot;Test Accuracy:&quot;, test_accuracy)\nprint(&quot;Train-Test Accuracy Difference:&quot;, train_accuracy-test_accuracy)\n\n#print precision score\nprint(&quot;Precision Score:&quot;, precision)\n\n#print recall score\nprint(&quot;Recall Score:&quot;, recall)\n\n#print f1 score\nprint(&quot;F1 Score:&quot;, f1)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The initial metrics of the KNN model are given below:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e8e8e8\" data-has-transparency=\"true\" style=\"--dominant-color: #e8e8e8;\" loading=\"lazy\" decoding=\"async\" width=\"510\" height=\"138\" class=\"wp-image-342627 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1rGYCFUOYhS4W9C01Sh8kbQ.png\" alt=\"Metrics of the KNN model on the 1st iteration\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1rGYCFUOYhS4W9C01Sh8kbQ.png 510w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1rGYCFUOYhS4W9C01Sh8kbQ-300x81.png 300w\" sizes=\"auto, (max-width: 510px) 100vw, 510px\" \/><figcaption class=\"wp-element-caption\">Metrics of the KNN model on the 1st iteration<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The model scored well on test accuracy (79.6%), precision (82.1%), recall (91.1%), and F1 score (86.3%) &#8211; however the test accuracy was much higher than the train accuracy at 93.1%, a 13.5% difference. This indicates the model is overfitting the data, which means it would struggle to accurately make predictions on unseen data. Therefore the model needs to be adjusted for a better fit &#8211; this can be done using a process called hyperparameter tuning.<\/p>\n<h3 class=\"wp-block-heading\">Hyperparameter Tuning<\/h3>\n<p class=\"wp-block-paragraph\">Hyperparameter tuning is the process of selecting the best set of hyperparameters for a machine learning model. I fine-tuned the model using <em>k-fold cross-validation<\/em> &#8211; this is a resampling technique where the data is split into <em>k<\/em> subsets (or <em>folds<\/em>), then each fold in turn is used as the validation set while the remaining data is used as the training set. This method is effective at reducing the risk of bias being introduced to the model by a particular choice of training\/test set.<\/p>\n<p class=\"wp-block-paragraph\">The hyperparameters for the KNN model are number of neighbors (_n<em>neighbors<\/em>) and the distance metric. There are a number of different ways to measure distance in a KNN classifier, but here I focused on two options:<\/p>\n<ol class=\"wp-block-list\">\n<li><em><strong>Euclidean:<\/strong><\/em> This can be thought of as the straight-line distance between two points &#8211; it&#8217;s the most commonly used distance metric.<\/li>\n<li><em><strong>Manhattan:<\/strong><\/em> Also called &quot;city block&quot; distance, this is the sum of absolute differences between the coordinates of 2 points. If you imagine standing at the corner of a city building and trying to get to the opposite corner &#8211; you wouldn&#8217;t cross through the building to get to the other side, but instead go up one block, then across one block.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Note that I could have also fine-tuned the weight parameter (which determines whether all neighbors vote equally or if the closer neighbors are given more importance), but I decided to keep the voting weight uniform.<\/p>\n<p class=\"wp-block-paragraph\">I defined a parameter grid with n_neighbors of 3, 7, and 10, as well as metrics of Euclidean or Manhattan. I then instantiated a <em>RandomizedSearchCV<\/em> algorithm, passing in the KNN classifier as the estimator, along with the parameter grid. I set the algorithm to split the data into 5 folds by setting the <em>cv<\/em> parameter to 5; this was then fit to the training set. A snippet of the code for this can be seen below:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#Fine tuning (RandomisedSearchCV)\n# Define parameter grid\nparam_grid = {\n    &#039;n_neighbors&#039;: [3, 7, 10],\n    &#039;metric&#039;: [&#039;euclidean&#039;,&#039;manhattan&#039;]\n}\n\n# instantiate RandomizedSearchCV\nrandom_search = RandomizedSearchCV(estimator=KNeighborsClassifier(), param_distributions=param_grid, cv=5)\n\n# fit to training data\nrandom_search.fit(X_train, y_train)\n\n# Retrieve best model and performance\nbest_classifier = random_search.best_estimator_\nbest_accuracy = random_search.best_score_\n\nprint(&quot;Best Accuracy:&quot;, best_accuracy)\nprint(&quot;Best Model:&quot;, best_classifier)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The best accuracy and classifier were retrieved from the algorithm, indicating the classifier performs best with n_neighbors set to 10 while using the Manhattan distance metric, and that this would lead to an accuracy score of 74.0%:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ececeb\" data-has-transparency=\"true\" style=\"--dominant-color: #ececeb;\" loading=\"lazy\" decoding=\"async\" width=\"961\" height=\"84\" class=\"wp-image-342628 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/18mKFAjq5lztpaw0LETJTog.png\" alt=\"Results of cross-validation - the random search classifier recommends n_neighbors=10 using the manhattan distance metric\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/18mKFAjq5lztpaw0LETJTog.png 961w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/18mKFAjq5lztpaw0LETJTog-300x26.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/18mKFAjq5lztpaw0LETJTog-768x67.png 768w\" sizes=\"auto, (max-width: 961px) 100vw, 961px\" \/><figcaption class=\"wp-element-caption\">Results of cross-validation &#8211; the random search classifier recommends n_neighbors=10 using the manhattan distance metric<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">as such these parameters were inputted to the classifier, and the model was retrained:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#Classifier - K Nearest Neighbours\n#instantiate KNN Classifier\nKNNClassifier = KNeighborsClassifier(n_neighbors=10, metric = &#039;manhattan&#039;)\n\nKNNClassifier.fit(X_train,y_train)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Performance metrics were again extracted from the classifier, in the same manner as before &#8211; a screengrab of the metrics for this iteration can be seen below:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e8e8e8\" data-has-transparency=\"true\" style=\"--dominant-color: #e8e8e8;\" loading=\"lazy\" decoding=\"async\" width=\"514\" height=\"141\" class=\"wp-image-342629 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1Dey6fWpwH3N9EvzZuvW2_g.png\" alt=\"Metrics of the tuned KNN model\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1Dey6fWpwH3N9EvzZuvW2_g.png 514w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/1Dey6fWpwH3N9EvzZuvW2_g-300x82.png 300w\" sizes=\"auto, (max-width: 514px) 100vw, 514px\" \/><figcaption class=\"wp-element-caption\">Metrics of the tuned KNN model<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Cross-validation led to slightly poorer results for all metrics; test accuracy dropped by 2.6%, precision by 1.5%, recall by 0.5%, and F1 score by 1%. however the training-test accuracy difference dropped to 3.8%, where it was initially 13.5%. This indicates the model is no longer overfitting the data, and is therefore more suitable for predicting unseen data.<\/p>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">In summary, the KNN classifier performed well in predicting whether a collision would require a tow or ambulance. Initial metrics from the model&#8217;s first iteration were impressive, however the disparity between test and training accuracy indicated overfitting. Hyperparameter tuning allowed for the model to be optimised, which significantly reduced the gap in accuracies between the two datasets. While performance metrics did take a small hit during this process, the benefit of a model with better fit outweighs these concerns.<\/p>\n<h2 class=\"wp-block-heading\">References<\/h2>\n<ol class=\"wp-block-list\">\n<li>Levy, J. (n.d.). Traffic Crashes &#8211; Crashes [Dataset]. Retrieved from Chicago Data Portal. Available at: <a href=\"https:\/\/data.cityofchicago.org\/Transportation\/Traffic-Crashes-Crashes\/85ca-t3if\">https:\/\/data.cityofchicago.org\/Transportation\/Traffic-Crashes-Crashes\/85ca-t3if<\/a> (Accessed: 14th May 2023).<\/li>\n<li>Chicago Police Department. (n.d.). Boundaries &#8211; Police Beats (current) [Data set]. Retrieved from Chicago Data Portal. Available at: <a href=\"https:\/\/data.cityofchicago.org\/Public-Safety\/Boundaries-Police-Beats-current-\/aerh-rz74\">https:\/\/data.cityofchicago.org\/Public-Safety\/Boundaries-Police-Beats-current-\/aerh-rz74<\/a> (Accessed: 14th May 2023).<\/li>\n<li>Zach M. (2022). &quot;How to Perform Label Encoding in Python (With Example).&quot; [Online]. Available at: <a href=\"https:\/\/www.statology.org\/label-encoding-in-python\/\">https:\/\/www.statology.org\/label-encoding-in-python\/<\/a> (Accessed: July 19th, 2023).<\/li>\n<\/ol>","protected":false},"excerpt":{"rendered":"<p>Using supervised learning to predict service callouts to Chicago car collisions<\/p>\n","protected":false},"author":18,"featured_media":12335,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"Using supervised learning to predict service callouts to Chicago car collisions","footnotes":""},"categories":[44,47,22],"tags":[749,448,508,446,467],"sponsor":[],"coauthors":[29088],"class_list":["post-12334","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-data-visualization","category-machine-learning","tag-classification","tag-data-science","tag-data-visualization","tag-machine-learning","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Data-Driven Dispatch | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data-Driven Dispatch | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Using supervised learning to predict service callouts to Chicago car collisions\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-04T15:08:58+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-09T09:36:01+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1707\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"John Lenehan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"John Lenehan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Data-Driven Dispatch\",\"datePublished\":\"2023-08-04T15:08:58+00:00\",\"dateModified\":\"2025-01-09T09:36:01+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/\"},\"wordCount\":2478,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg\",\"keywords\":[\"Classification\",\"Data Science\",\"Data Visualization\",\"Machine Learning\",\"Python\"],\"articleSection\":[\"Data Science\",\"Data Visualization\",\"Machine Learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/\",\"url\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/\",\"name\":\"Data-Driven Dispatch | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg\",\"datePublished\":\"2023-08-04T15:08:58+00:00\",\"dateModified\":\"2025-01-09T09:36:01+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg\",\"width\":2560,\"height\":1707,\"caption\":\"Photo by Sawyer Bengtson on Unsplash\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Data-Driven Dispatch\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data-Driven Dispatch | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/","og_locale":"en_US","og_type":"article","og_title":"Data-Driven Dispatch | Towards Data Science","og_description":"Using supervised learning to predict service callouts to Chicago car collisions","og_url":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/","og_site_name":"Towards Data Science","article_published_time":"2023-08-04T15:08:58+00:00","article_modified_time":"2025-01-09T09:36:01+00:00","og_image":[{"width":2560,"height":1707,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg","type":"image\/jpeg"}],"author":"John Lenehan","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"John Lenehan","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Data-Driven Dispatch","datePublished":"2023-08-04T15:08:58+00:00","dateModified":"2025-01-09T09:36:01+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/"},"wordCount":2478,"commentCount":0,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg","keywords":["Classification","Data Science","Data Visualization","Machine Learning","Python"],"articleSection":["Data Science","Data Visualization","Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/","url":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/","name":"Data-Driven Dispatch | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg","datePublished":"2023-08-04T15:08:58+00:00","dateModified":"2025-01-09T09:36:01+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/08\/0fmWnQEeucHDrq4B4-scaled.jpg","width":2560,"height":1707,"caption":"Photo by Sawyer Bengtson on Unsplash"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/data-driven-dispatch-76b7e998a7a7\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Data-Driven Dispatch"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/12334","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=12334"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/12334\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/12335"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=12334"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=12334"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=12334"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=12334"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=12334"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}