{"id":3654,"date":"2024-01-20T05:58:35","date_gmt":"2024-01-20T05:58:35","guid":{"rendered":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/"},"modified":"2025-01-08T15:47:12","modified_gmt":"2025-01-08T15:47:12","slug":"evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/","title":{"rendered":"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre?"},"content":{"rendered":"<h3 class=\"wp-block-heading\">Natural Language Processing<\/h3>\n<h1 class=\"wp-block-heading\">Evaluating Cinematic Dialogue &#8211; Which Syntactic and Semantic Features Are Predictive of Genre?<\/h1>\n<h3 class=\"wp-block-heading\"><em>This article explores the relationship between a movie&#8217;s dialogue and its genre, leveraging domain-driven data analysis and informed feature engineering.<\/em><\/h3>\n\n<p class=\"wp-block-paragraph\">From fragmented speech in thrillers to expletive-laden exchanges in action movies, can we guess a movie&#8217;s genre simply by knowing its semantic and syntactic characteristics in the dialogue? If so, which ones?<\/p>\n<p class=\"wp-block-paragraph\">We will investigate whether or not the nuanced dialogue patterns within a screenplay &#8211; its lexicon, structure, and pacing &#8211; can be powerful predictors of genre. The focus here is twofold: to leverage syntactic and semantic script characteristics as predictive features and to underscore the significance of informed feature engineering.<\/p>\n<p class=\"wp-block-paragraph\">One of the primary gaps in many data science courses is the lack of emphasis on domain expertise and feature generation, engineering, and selection. Many courses also provide students with pre-existing datasets, and sometimes, these datasets are already cleaned. Moreover, in the workplace, the rush to produce results often overshadows the process of hypothesizing and validating predictive features, leaving little room for domain-specific exploration and understanding.<\/p>\n<p class=\"wp-block-paragraph\">In my own experience outlined in &quot;<a href=\"https:\/\/towardsdatascience.com\/using-multi-task-and-ensemble-learning-to-predict-alzheimers-cognitive-functioning-7b46fe09f9ff\">Using Multi-Task and Ensemble Learning to Predict Alzheimer&#8217;s Cognitive Functioning<\/a>,&quot; I witnessed the positive impact of informed feature engineering. Researching known predictors of Alzheimer&#8217;s allowed me to question the initial task and data, ultimately leading to the inclusion of key features during modeling.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"5c655c\" data-has-transparency=\"true\" style=\"--dominant-color: #5c655c;\" loading=\"lazy\" decoding=\"async\" width=\"2026\" height=\"1946\" class=\"wp-image-314521 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1eo6EgSEVniTiEWdfA-7yRg.png\" alt=\"DALLE Generated Image by Author\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1eo6EgSEVniTiEWdfA-7yRg.png 2026w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1eo6EgSEVniTiEWdfA-7yRg-300x288.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1eo6EgSEVniTiEWdfA-7yRg-1024x984.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1eo6EgSEVniTiEWdfA-7yRg-768x738.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1eo6EgSEVniTiEWdfA-7yRg-1536x1475.png 1536w\" sizes=\"auto, (max-width: 2026px) 100vw, 2026px\" \/><figcaption class=\"wp-element-caption\">DALLE Generated Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In this article, I delve into a project that examines movie dialogue to illustrate my approach to research and feature extraction. The focus will be on identifying and analyzing textual, semantic, and syntactic elements within film dialogue, investigating how they interrelate, and evaluating their capacity to accurately predict a movie&#8217;s genre.<\/p>\n<h2 class=\"wp-block-heading\">Initial Questions<\/h2>\n<p class=\"wp-block-paragraph\">I like to start every project by conducting a literature review. I begin by jotting down relevant concepts and questions to guide my review. This initial phase is crucial and, depending on the time I have, I intentionally steer clear of research directly related to the modeling problem at hand. The goal is to understand the broader context and seek out supplemental information first. This strategy helps in cultivating an unbiased understanding of the subject matter, ensuring that my approach to the problem is informed, yet not prematurely narrowed by the solutions and methodologies already explored by others.<\/p>\n<h3 class=\"wp-block-heading\">A few questions I&#8217;d jotted down:<\/h3>\n<ul class=\"wp-block-list\">\n<li>Is there a relationship between dialogue and emotions?<\/li>\n<li>How do conversations differ in real life vs. in screenplay?<\/li>\n<li>What can I understand about movie dialogue and how it relates to genre?<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">What I found<\/h3>\n<p class=\"wp-block-paragraph\">There is a body of literature that explores the interplay between natural dialogue and our emotions. Screenwriters capture an emotion or mood by capitalizing on textual and syntactical relationships. These vary across genres since different moods are associated with different genres.<\/p>\n<h2 class=\"wp-block-heading\">What are these syntactical &amp; textual characteristics?<\/h2>\n<p class=\"wp-block-paragraph\">We will extract and evaluate the 4 characteristics listed below. In each section, I&#8217;ll explain the rationale:<\/p>\n<ol class=\"wp-block-list\">\n<li>Length attributes<\/li>\n<li>Types of sentences<\/li>\n<li>Part of speech and profanity<\/li>\n<li>Sentiment analysis<\/li>\n<\/ol>\n<h3 class=\"wp-block-heading\">Data<\/h3>\n<p class=\"wp-block-paragraph\">The dataset used here is the Cornell Movie-Dialogs Corpus (MIT License) from <a href=\"https:\/\/www.kaggle.com\/datasets\/rajathmc\/cornell-moviedialog-corpus\/data\">Kaggle<\/a>, which was originally retrieved from the <a href=\"https:\/\/github.com\/CornellNLP\/ConvoKit?tab=MIT-1-ov-file\">ConvoKit toolkit<\/a> (Chang et al., 2020). This is comprised of over <strong>300k<\/strong> spoken lines <strong>** across <\/strong>~220k<strong> conversational exchanges derived from <\/strong>61**7 different movies.<\/p>\n<h3 class=\"wp-block-heading\">Load the Data<\/h3>\n<p class=\"wp-block-paragraph\">We&#8217;ll begin by loading data using the <code>movie_lines.txt<\/code> file.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Define the directory path where the &#039;movie_lines.txt&#039; file is located\ncorpus_directory = &#039;cornell movie-dialogs corpus&#039;\n\n# Construct the full file path\nfile_path = os.path.join(corpus_directory, &#039;movie_lines.txt&#039;)\n\n# Open the file in read mode with &#039;mac_roman&#039; encoding\nwith open(file_path, &#039;r&#039;, encoding=&#039;mac_roman&#039;) as file:\n\n    # Read the contents of the file and split them into individual lines\n    lines = file.read().splitlines()<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">lines[:2]\n[&#039;L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!&#039;,\n &#039;L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!&#039;]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The columns are split by <code>+++$+++<\/code>so this will be used as the separator to split each line, extract the columns, and read the data into a data frame.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Split each line in &#039;lines&#039; using &#039; +++$+++ &#039; as the separator\npreprocessed_list = list(\n    map(lambda x: (str(x).split(&#039; +++$+++ &#039;)), lines)\n)\n\n# Define column names for the DataFrame\ncolumn_names = [&#039;line&#039;, &#039;speaker_id&#039;, &#039;movie_id&#039;, &#039;name&#039;, &#039;text&#039;]\n\n# Create a DataFrame using &#039;preprocessed_list&#039; \ndf = pd.DataFrame(preprocessed_list, columns=column_names)\n\n# Display the first 2 rows of the DataFrame\ndf.head(2)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"252525\" data-has-transparency=\"false\" style=\"--dominant-color: #252525;\" loading=\"lazy\" decoding=\"async\" width=\"1028\" height=\"218\" class=\"wp-image-314522 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/10C3wRtYtV-vorxxdtbkjJA.png\" alt=\"Sample of Data. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/10C3wRtYtV-vorxxdtbkjJA.png 1028w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/10C3wRtYtV-vorxxdtbkjJA-300x64.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/10C3wRtYtV-vorxxdtbkjJA-1024x217.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/10C3wRtYtV-vorxxdtbkjJA-768x163.png 768w\" sizes=\"auto, (max-width: 1028px) 100vw, 1028px\" \/><figcaption class=\"wp-element-caption\">Sample of Data. Image by Author.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Preprocess Text<\/h3>\n<p class=\"wp-block-paragraph\">I used spaCy &#8211; an open-source natural language processing library written in Python and Cython &#8211; to process the text. This included cleaning contractions, removing punctuation, and lemmatizing words.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Transforms all contractions to their longer form\ndf[&#039;text&#039;] = df.text.map(clean_contractions)\n\n# Removes all punctuation and punctuation errors in the data\ndf[&#039;text_no_punct&#039;] = df.text.map(remove_punctuation)\n\n# Remove words &amp;lt;2 chars and stopwords, lemmatize, &amp;amp; transform to lowercase\ndf[&#039;clean_text&#039;] = df.text_no_punct.map(\n    lambda x: preprocess(&#039; &#039;.join(x))\n)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"232323\" data-has-transparency=\"false\" style=\"--dominant-color: #232323;\" loading=\"lazy\" decoding=\"async\" width=\"2528\" height=\"764\" class=\"wp-image-314523 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/16if74oYbJt708xZR0Cla0Q.png\" alt=\"DataFrame After Preprocessing. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/16if74oYbJt708xZR0Cla0Q.png 2528w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/16if74oYbJt708xZR0Cla0Q-300x91.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/16if74oYbJt708xZR0Cla0Q-1024x309.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/16if74oYbJt708xZR0Cla0Q-768x232.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/16if74oYbJt708xZR0Cla0Q-1536x464.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/16if74oYbJt708xZR0Cla0Q-2048x619.png 2048w\" sizes=\"auto, (max-width: 2528px) 100vw, 2528px\" \/><figcaption class=\"wp-element-caption\">DataFrame After Preprocessing. Image by Author.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">1. Length Attributes<\/h2>\n<p class=\"wp-block-paragraph\">In suspense movies, dialogue is often sparse, showcasing the link between syntax and emotions. When characters are in states of terror, their speech tends to be concise, while nervousness often leads to longer utterances (i.e. rambling), a trait more commonly seen in comedies. Therefore, we will examine the length attributes of each line in the corpus.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"71685c\" data-has-transparency=\"true\" style=\"--dominant-color: #71685c;\" loading=\"lazy\" decoding=\"async\" width=\"2154\" height=\"2048\" class=\"wp-image-314524 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yNCnAlWOYkfZ9XktKKID9A.png\" alt=\"DALLE Generated Image by Author\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yNCnAlWOYkfZ9XktKKID9A.png 2154w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yNCnAlWOYkfZ9XktKKID9A-300x285.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yNCnAlWOYkfZ9XktKKID9A-1024x974.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yNCnAlWOYkfZ9XktKKID9A-768x730.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yNCnAlWOYkfZ9XktKKID9A-1536x1460.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yNCnAlWOYkfZ9XktKKID9A-2048x1947.png 2048w\" sizes=\"auto, (max-width: 2154px) 100vw, 2154px\" \/><figcaption class=\"wp-element-caption\">DALLE Generated Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In this section, we&#8217;ll take a look at:<\/p>\n<ol class=\"wp-block-list\">\n<li>Average # of words in a line<\/li>\n<li>Average # of sentences across the whole corpus<\/li>\n<li>Distribution of the average # of words per line<\/li>\n<li>Distribution of the average # of sentences per line across the corpus<\/li>\n<\/ol>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Calculate the number of words in each line\ndf[&#039;num_words&#039;] = df[&#039;text_no_punct&#039;].map(len)\n\n# Extract the number of sentences in each line\ndf[&#039;num_sentences&#039;] = df[&#039;text&#039;].map(\n    lambda x: len(nltk.sent_tokenize(x))\n)\n\n# Remove entries with empty or non-textual content\ndf = df[df[&#039;num_words&#039;] != 0]<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"222222\" data-has-transparency=\"false\" style=\"--dominant-color: #222222;\" loading=\"lazy\" decoding=\"async\" width=\"3410\" height=\"748\" class=\"wp-image-314525 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1EsbMbsJFhl2JyV0-zQPrfw.png\" alt=\"DataFrame After Adding Length Features. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1EsbMbsJFhl2JyV0-zQPrfw.png 3410w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1EsbMbsJFhl2JyV0-zQPrfw-300x66.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1EsbMbsJFhl2JyV0-zQPrfw-1024x225.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1EsbMbsJFhl2JyV0-zQPrfw-768x168.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1EsbMbsJFhl2JyV0-zQPrfw-1536x337.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1EsbMbsJFhl2JyV0-zQPrfw-2048x449.png 2048w\" sizes=\"auto, (max-width: 3410px) 100vw, 3410px\" \/><figcaption class=\"wp-element-caption\">DataFrame After Adding Length Features. Image by Author.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Statistics Per Movie: Boxplot and Statistics DataFrame<\/h3>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"0f191c\" data-has-transparency=\"true\" style=\"--dominant-color: #0f191c;\" loading=\"lazy\" decoding=\"async\" width=\"2548\" height=\"1004\" class=\"wp-image-314526 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1vLyaKtMmMsD8y8o8jCpqHg.png\" alt=\"Length Feature Boxplots. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1vLyaKtMmMsD8y8o8jCpqHg.png 2548w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1vLyaKtMmMsD8y8o8jCpqHg-300x118.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1vLyaKtMmMsD8y8o8jCpqHg-1024x403.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1vLyaKtMmMsD8y8o8jCpqHg-768x303.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1vLyaKtMmMsD8y8o8jCpqHg-1536x605.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1vLyaKtMmMsD8y8o8jCpqHg-2048x807.png 2048w\" sizes=\"auto, (max-width: 2548px) 100vw, 2548px\" \/><figcaption class=\"wp-element-caption\">Length Feature Boxplots. Image by Author.<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"282828\" data-has-transparency=\"false\" style=\"--dominant-color: #282828;\" loading=\"lazy\" decoding=\"async\" width=\"910\" height=\"206\" class=\"wp-image-314527 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yukxOli-m2m80PDCQKqqsQ.png\" alt=\"Length Features Statistics DataFrame. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yukxOli-m2m80PDCQKqqsQ.png 910w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yukxOli-m2m80PDCQKqqsQ-300x68.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1yukxOli-m2m80PDCQKqqsQ-768x174.png 768w\" sizes=\"auto, (max-width: 910px) 100vw, 910px\" \/><figcaption class=\"wp-element-caption\">Length Features Statistics DataFrame. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In the boxplot and the statistics data frame above, we see that:<\/p>\n<ul class=\"wp-block-list\">\n<li>Word length <strong>ranges from 0 to 30<\/strong>, with a <strong>median length of 7<\/strong>. The interquartile range (maintaining half the data) indicates that the word lengths in <strong>the middle 50% of lines range from 4 to 14<\/strong> <strong>letters.<\/strong><\/li>\n<li>Sentences are usually 1 or 2 sentences, sometimes ranging to their upper bound of 3 sentences.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"1d2521\" data-has-transparency=\"true\" style=\"--dominant-color: #1d2521;\" loading=\"lazy\" decoding=\"async\" width=\"1648\" height=\"422\" class=\"wp-image-314528 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1BTtJE0rgxp1CN5vAikwblA.png\" alt=\"Proportion of Lines Greater Than a Sentence. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1BTtJE0rgxp1CN5vAikwblA.png 1648w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1BTtJE0rgxp1CN5vAikwblA-300x77.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1BTtJE0rgxp1CN5vAikwblA-1024x262.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1BTtJE0rgxp1CN5vAikwblA-768x197.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1BTtJE0rgxp1CN5vAikwblA-1536x393.png 1536w\" sizes=\"auto, (max-width: 1648px) 100vw, 1648px\" \/><figcaption class=\"wp-element-caption\">Proportion of Lines Greater Than a Sentence. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Less than half of the script lines maintain more than 1 sentence. This informs us that <strong>each script line is short<\/strong>, and should be framed accordingly.<\/p>\n<h3 class=\"wp-block-heading\">Distribution on a Per Movie Basis<\/h3>\n<p class=\"wp-block-paragraph\">The metrics mentioned above were calculated on a &#8216;per line&#8217; basis within the movie script data. In the next section, we shift our focus to explore the average length of lines per movie, allowing us to examine variations in word length at the movie level.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"282828\" data-has-transparency=\"false\" style=\"--dominant-color: #282828;\" loading=\"lazy\" decoding=\"async\" width=\"1228\" height=\"218\" class=\"wp-image-314529 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JMULmUXTncYyDtWdU2oOjw.png\" alt=\"Length Features Statistics Across the Corpus. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JMULmUXTncYyDtWdU2oOjw.png 1228w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JMULmUXTncYyDtWdU2oOjw-300x53.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JMULmUXTncYyDtWdU2oOjw-1024x182.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JMULmUXTncYyDtWdU2oOjw-768x136.png 768w\" sizes=\"auto, (max-width: 1228px) 100vw, 1228px\" \/><figcaption class=\"wp-element-caption\">Length Features Statistics Across the Corpus. Image by Author.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Dialogue Density Variation<\/h3>\n<p class=\"wp-block-paragraph\">The &quot;Length Features Statistics DataFrame&quot; figure shows that individual lines in scripts range from 0 to 582 words, with a median of 7 words, which suggests a <strong>high degree of variability in dialogue density on a line-by-line basis<\/strong>. In contrast, the <strong>aggregated movie data shows a much narrower range<\/strong>, with a maximum average of 38.69 words per line, indicating that while individual lines can be extremely verbose or concise, movies tend to balance out to a moderate density of words.<\/p>\n<h3 class=\"wp-block-heading\">Narrative Rhythm<\/h3>\n<p class=\"wp-block-paragraph\">With over 39% of script lines containing more than one sentence, the per-line analysis indicates a tendency towards compound or complex sentences. However, the tighter standard deviation in the movie averages (0.29 for sentences) suggests a <strong>consistency in narrative rhythm across different films,<\/strong> aiming for a steady pace in dialogue delivery.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"2d4a4f\" data-has-transparency=\"true\" style=\"--dominant-color: #2d4a4f;\" loading=\"lazy\" decoding=\"async\" width=\"2010\" height=\"1942\" class=\"wp-image-314530 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1hn7nMToSviWBVG02jZOoFQ.png\" alt=\"DALLE Generated Image by Author\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1hn7nMToSviWBVG02jZOoFQ.png 2010w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1hn7nMToSviWBVG02jZOoFQ-300x290.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1hn7nMToSviWBVG02jZOoFQ-1024x989.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1hn7nMToSviWBVG02jZOoFQ-768x742.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1hn7nMToSviWBVG02jZOoFQ-1536x1484.png 1536w\" sizes=\"auto, (max-width: 2010px) 100vw, 2010px\" \/><figcaption class=\"wp-element-caption\">DALLE Generated Image by Author<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Scriptwriting Consistency<\/h3>\n<p class=\"wp-block-paragraph\">The contrast between the median length of individual lines (7 words) and the average across movies (11.36 words) implies that <strong>screenwriters might often intersperse shorter lines of dialogue with longer monologues or exchanges.<\/strong> This technique could be a deliberate choice to create dynamic interactions between characters, keep the audience engaged, and ensure that each movie has its unique tempo and style.<\/p>\n<h3 class=\"wp-block-heading\">Visualizing the Outliers That Pull the Average to the Right<\/h3>\n<p class=\"wp-block-paragraph\">The histograms show a right-skewed distribution, with a central tendency for movies to feature lines averaging 7\u201313 words. This skewness is indicative of a minority of films with unusually long lines, which heavily influence the overall average.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"080b0c\" data-has-transparency=\"true\" style=\"--dominant-color: #080b0c;\" loading=\"lazy\" decoding=\"async\" width=\"3278\" height=\"656\" class=\"wp-image-314531 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1kWtIcIhacG2V4KSlRDD27Q.png\" alt=\"Histograms of Length Features. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1kWtIcIhacG2V4KSlRDD27Q.png 3278w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1kWtIcIhacG2V4KSlRDD27Q-300x60.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1kWtIcIhacG2V4KSlRDD27Q-1024x205.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1kWtIcIhacG2V4KSlRDD27Q-768x154.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1kWtIcIhacG2V4KSlRDD27Q-1536x307.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1kWtIcIhacG2V4KSlRDD27Q-2048x410.png 2048w\" sizes=\"auto, (max-width: 3278px) 100vw, 3278px\" \/><figcaption class=\"wp-element-caption\">Histograms of Length Features. Image by Author.<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"0f161a\" data-has-transparency=\"true\" style=\"--dominant-color: #0f161a;\" loading=\"lazy\" decoding=\"async\" width=\"3268\" height=\"652\" class=\"wp-image-314532 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1DKfbwZCJQLPS2VYYOfwsEA.png\" alt=\"Histograms of Length Features with Outliers Removed. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1DKfbwZCJQLPS2VYYOfwsEA.png 3268w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1DKfbwZCJQLPS2VYYOfwsEA-300x60.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1DKfbwZCJQLPS2VYYOfwsEA-1024x204.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1DKfbwZCJQLPS2VYYOfwsEA-768x153.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1DKfbwZCJQLPS2VYYOfwsEA-1536x306.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1DKfbwZCJQLPS2VYYOfwsEA-2048x409.png 2048w\" sizes=\"auto, (max-width: 3268px) 100vw, 3268px\" \/><figcaption class=\"wp-element-caption\">Histograms of Length Features with Outliers Removed. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">After outliers are excluded, the bimodal distribution for words per line becomes more evident, suggesting that there are two common line lengths in scripts. This observation is interesting as it could reflect different styles or genres within the corpus. The distribution of sentences per line appears to be approximately normal, with a negligible right skew, indicating a consistent sentence structure across screenplays.<\/p>\n<h2 class=\"wp-block-heading\">2. Types of Sentences<\/h2>\n<h3 class=\"wp-block-heading\"><strong>Exclamation Points<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">There are various ways to represent a heightened state of emotion in a script. One of which is to use an exclamation point (!) for emphasis and another is to use CAPITALIZATION FOR EMPHASIS. We&#8217;ll look at the presence of both and see if there&#8217;s a correlation with the overarching sentiment.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Hyphens<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">A hyphen placed at the end of a character&#8217;s dialogue (-) may signify an interruption in their speech or an abrupt pause in the character&#8217;s thinking (e.g., the character has an epiphany). It can also convey fragmented speech.<\/p>\n<h3 class=\"wp-block-heading\">Questions<\/h3>\n<p class=\"wp-block-paragraph\">I had no prior knowledge or intuition about the relationship between the presence of questions in a script and other features. However, the proportion of questions is easily measurable, and it could be intriguing to explore whether any patterns can be detected.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"1e1e1e\" data-has-transparency=\"false\" style=\"--dominant-color: #1e1e1e;\" loading=\"lazy\" decoding=\"async\" width=\"1106\" height=\"326\" class=\"wp-image-314533 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Z4JWXiCIas4XdPimGPpQdg.png\" alt=\"DataFrame with Added Sentence Type Features. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Z4JWXiCIas4XdPimGPpQdg.png 1106w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Z4JWXiCIas4XdPimGPpQdg-300x88.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Z4JWXiCIas4XdPimGPpQdg-1024x302.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Z4JWXiCIas4XdPimGPpQdg-768x226.png 768w\" sizes=\"auto, (max-width: 1106px) 100vw, 1106px\" \/><figcaption class=\"wp-element-caption\">DataFrame with Added Sentence Type Features. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Below, we see that the proportion of lines with questions, indicated at 31.4%, suggests <strong>a strong preference for interactive dialogue<\/strong> within movies. This is substantially higher than the proportion of lines with exclamations, at 8.9%, which could indicate that <strong>while intense emotional expressions are present, they are less frequent than interrogative exchanges<\/strong>.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"1f1e1e\" data-has-transparency=\"true\" style=\"--dominant-color: #1f1e1e;\" loading=\"lazy\" decoding=\"async\" width=\"2004\" height=\"1002\" class=\"wp-image-314534 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1S3ZpNLq-mrYxYoJxbh6XXg.png\" alt=\"Visualization Representing the Sentence Type Proportions. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1S3ZpNLq-mrYxYoJxbh6XXg.png 2004w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1S3ZpNLq-mrYxYoJxbh6XXg-300x150.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1S3ZpNLq-mrYxYoJxbh6XXg-1024x512.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1S3ZpNLq-mrYxYoJxbh6XXg-768x384.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1S3ZpNLq-mrYxYoJxbh6XXg-1536x768.png 1536w\" sizes=\"auto, (max-width: 2004px) 100vw, 2004px\" \/><figcaption class=\"wp-element-caption\">Visualization Representing the Sentence Type Proportions. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The boxplot for the count of all-caps words reveals that the use of capitalized words is not common, suggesting that screenwriters may prefer subtler methods of conveying emphasis in dialogue rather than relying on text formatting.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"172028\" data-has-transparency=\"false\" style=\"--dominant-color: #172028;\" loading=\"lazy\" decoding=\"async\" width=\"1928\" height=\"306\" class=\"wp-image-314535 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1N0JeQ1jqxanEtOkqSyu8EQ.png\" alt=\"Boxplot for the Count of All Caps Words. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1N0JeQ1jqxanEtOkqSyu8EQ.png 1928w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1N0JeQ1jqxanEtOkqSyu8EQ-300x48.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1N0JeQ1jqxanEtOkqSyu8EQ-1024x163.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1N0JeQ1jqxanEtOkqSyu8EQ-768x122.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1N0JeQ1jqxanEtOkqSyu8EQ-1536x244.png 1536w\" sizes=\"auto, (max-width: 1928px) 100vw, 1928px\" \/><figcaption class=\"wp-element-caption\">Boxplot for the Count of All Caps Words. Image by Author.<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"101a17\" data-has-transparency=\"true\" style=\"--dominant-color: #101a17;\" loading=\"lazy\" decoding=\"async\" width=\"2370\" height=\"654\" class=\"wp-image-314536 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dIQbHYZHdeBNIXowkf6jCA.png\" alt=\"Histogram for the Punctuation Usage Distributions. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dIQbHYZHdeBNIXowkf6jCA.png 2370w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dIQbHYZHdeBNIXowkf6jCA-300x83.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dIQbHYZHdeBNIXowkf6jCA-1024x283.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dIQbHYZHdeBNIXowkf6jCA-768x212.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dIQbHYZHdeBNIXowkf6jCA-1536x424.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dIQbHYZHdeBNIXowkf6jCA-2048x565.png 2048w\" sizes=\"auto, (max-width: 2370px) 100vw, 2370px\" \/><figcaption class=\"wp-element-caption\">Histogram for the Punctuation Usage Distributions. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">While questions are more common, the range of usage varies widely among movies, potentially reflecting different genres or directorial styles. For example, a thriller may have more questions built into the dialogue to maintain suspense, whereas a comedy may use exclamations to highlight punchlines.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"090c0b\" data-has-transparency=\"true\" style=\"--dominant-color: #090c0b;\" loading=\"lazy\" decoding=\"async\" width=\"2356\" height=\"654\" class=\"wp-image-314537 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1XGcE7ykaITGstJ1btZV_tg.png\" alt=\"Histogram for Proportion of Lines that Contain Hyphen at the End and the Average Number of Uppercased Words. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1XGcE7ykaITGstJ1btZV_tg.png 2356w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1XGcE7ykaITGstJ1btZV_tg-300x83.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1XGcE7ykaITGstJ1btZV_tg-1024x284.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1XGcE7ykaITGstJ1btZV_tg-768x213.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1XGcE7ykaITGstJ1btZV_tg-1536x426.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1XGcE7ykaITGstJ1btZV_tg-2048x569.png 2048w\" sizes=\"auto, (max-width: 2356px) 100vw, 2356px\" \/><figcaption class=\"wp-element-caption\">Histogram for Proportion of Lines that Contain Hyphen at the End and the Average Number of Uppercased Words. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The histogram for lines that end with a hyphen shows a significant skew towards a lower proportion, indicating that lines ending with a hyphen are relatively uncommon in movie scripts. This could suggest that interrupted dialogue or sentences leading into actions (which are often denoted by hyphens) are used sparingly, perhaps to maintain the flow of dialogue or to avoid overusing a device that might otherwise lose its impact.<\/p>\n<h2 class=\"wp-block-heading\">3. Part of Speech and Profanity<\/h2>\n<p class=\"wp-block-paragraph\">Part of speech helps us understand the grammatical function of a word in a sentence. For instance, genres like historical or biographical films are often flooded with proper nouns, making the tracking of these and other common tags potentially revealing.<\/p>\n<p class=\"wp-block-paragraph\">According to &quot;<a href=\"https:\/\/stephenfollows.com\/wp-content\/uploads\/2019\/01\/JudgingScreenplaysByTheirCoverage_StephenFollows_c.pdf\">Judging Screenplays by Their Coverage<\/a>&quot; by Stephen Follows and Josh Cockcroft, &quot;swear words (are) not spread equally across all scripts [&#8230;] Comedies are the sweariest, beating Action and Horror scripts by a tiny margin (and) the genres featuring the lowest levels of swearing are Family, Animated and Faith-based scripts&quot; (42).<\/p>\n<h3 class=\"wp-block-heading\">Part of Speech Tagging<\/h3>\n<p class=\"wp-block-paragraph\">We&#8217;ll start by taking a look at the most frequent tags from the text by flattening the text, taking a sample, and using SpaCy for POS tagging.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"1e261e\" data-has-transparency=\"true\" style=\"--dominant-color: #1e261e;\" loading=\"lazy\" decoding=\"async\" width=\"1042\" height=\"608\" class=\"wp-image-314538 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1R0FORFMbYsUxX6Fb35xL7g.png\" alt=\"Barplot of the Most Frequent POS Tags from a Sample of Text. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1R0FORFMbYsUxX6Fb35xL7g.png 1042w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1R0FORFMbYsUxX6Fb35xL7g-300x175.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1R0FORFMbYsUxX6Fb35xL7g-1024x597.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1R0FORFMbYsUxX6Fb35xL7g-768x448.png 768w\" sizes=\"auto, (max-width: 1042px) 100vw, 1042px\" \/><figcaption class=\"wp-element-caption\">Barplot of the Most Frequent POS Tags from a Sample of Text. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Overall, nouns are by far the most common parts of speech, with adjectives and verbs maintaining relatively similar counts. Adverbs are the rarest part of speech for our movies.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-vbnet\">NN: noun, singular or mass\nJJ: adjective\nVB: verb, base form\nVBP: verb, non-3rd person singular present\nRB: adverb<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"202221\" data-has-transparency=\"true\" style=\"--dominant-color: #202221;\" loading=\"lazy\" decoding=\"async\" width=\"1732\" height=\"776\" class=\"wp-image-314539 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1pEkn0DBCQvhXOkHy0CpXPQ.png\" alt=\"Distribution of POS Tags. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1pEkn0DBCQvhXOkHy0CpXPQ.png 1732w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1pEkn0DBCQvhXOkHy0CpXPQ-300x134.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1pEkn0DBCQvhXOkHy0CpXPQ-1024x459.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1pEkn0DBCQvhXOkHy0CpXPQ-768x344.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1pEkn0DBCQvhXOkHy0CpXPQ-1536x688.png 1536w\" sizes=\"auto, (max-width: 1732px) 100vw, 1732px\" \/><figcaption class=\"wp-element-caption\">Distribution of POS Tags. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">I chose to display all four histograms on this plot because it highlights a clear differentiation in the usage of various parts of speech within movie dialogues. <strong>Nouns dominate the linguistic landscape, occupying 40% to 60% of the dialogue<\/strong> whereas adverbs range anywhere between 0 to 10%. This prevalence underlines the concrete and tangible nature of film narratives, which often rely on specific nouns to anchor the conversation and set scenes. Adverbs, conversely, appear infrequently, suggesting that <strong>movie dialogue may favor direct and concise language over descriptive or qualifying phrases.<\/strong><\/p>\n<h3 class=\"wp-block-heading\">Profanity<\/h3>\n<p class=\"wp-block-paragraph\">We&#8217;ll detect profanity using the &#8216;badwords.txt&#8217; from <a href=\"https:\/\/github.com\/areebbeigh\/profanityfilter\/tree\/master\/profanityfilter\/data\">profanityfilter<\/a>.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"0a100e\" data-has-transparency=\"true\" style=\"--dominant-color: #0a100e;\" loading=\"lazy\" decoding=\"async\" width=\"1110\" height=\"808\" class=\"wp-image-314540 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Ly7JfFDk02duVB0WFVOLSA.png\" alt=\"Histogram of the Proportion of Lines that Contain Profanity. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Ly7JfFDk02duVB0WFVOLSA.png 1110w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Ly7JfFDk02duVB0WFVOLSA-300x218.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Ly7JfFDk02duVB0WFVOLSA-1024x745.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1Ly7JfFDk02duVB0WFVOLSA-768x559.png 768w\" sizes=\"auto, (max-width: 1110px) 100vw, 1110px\" \/><figcaption class=\"wp-element-caption\">Histogram of the Proportion of Lines that Contain Profanity. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">While most movie lines are devoid of profanity, there is a significant presence of it in certain scripts, with a few reaching a proportion as high as 0.37. This might reflect the genre, setting, or character development choices, where profanity is used to add realism, and intensity, or to delineate characters&#8217; personalities.<\/p>\n<h2 class=\"wp-block-heading\">4. Sentiment<\/h2>\n<p class=\"wp-block-paragraph\">We&#8217;ll utilize two sentiment analysis models: NLTK Vader, which is quick but uses a basic rule-based approach, and Flair, which is more accurate but computationally intensive.<\/p>\n<p class=\"wp-block-paragraph\">NLTK Vader assigns sentiment scores based on individual words and may be biased by neutral words even in the presence of strong negative words, making it less precise. It also struggles to identify sarcasm or context nuances.<\/p>\n<h3 class=\"wp-block-heading\">Visualizing Frequency of Positive, Negative, Non-Neutral and All Words<\/h3>\n<p class=\"wp-block-paragraph\">Flair is an embedding-based model which enables it to capture context. Words with similar vector representations are often used in the same context. The downside to using this approach is that it&#8217;s significantly slower than the naive, rules-based approach. The NLTK model took ~ 4 minutes to run while this model took ~3 hours to run.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"211a1a\" data-has-transparency=\"true\" style=\"--dominant-color: #211a1a;\" loading=\"lazy\" decoding=\"async\" width=\"2706\" height=\"1400\" class=\"wp-image-314541 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dICHLDbo_nW9BDuoK0uqBg.png\" alt=\"WordCloud for Various Sentiment Groups. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dICHLDbo_nW9BDuoK0uqBg.png 2706w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dICHLDbo_nW9BDuoK0uqBg-300x155.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dICHLDbo_nW9BDuoK0uqBg-1024x530.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dICHLDbo_nW9BDuoK0uqBg-768x397.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dICHLDbo_nW9BDuoK0uqBg-1536x795.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1dICHLDbo_nW9BDuoK0uqBg-2048x1060.png 2048w\" sizes=\"auto, (max-width: 2706px) 100vw, 2706px\" \/><figcaption class=\"wp-element-caption\">WordCloud for Various Sentiment Groups. Image by Author.<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"202020\" data-has-transparency=\"false\" style=\"--dominant-color: #202020;\" loading=\"lazy\" decoding=\"async\" width=\"3258\" height=\"742\" class=\"wp-image-314542 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ofT5KYNkrfX2XEenbc9OvQ.png\" alt=\"DataFrame with All Features Added. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ofT5KYNkrfX2XEenbc9OvQ.png 3258w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ofT5KYNkrfX2XEenbc9OvQ-300x68.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ofT5KYNkrfX2XEenbc9OvQ-1024x233.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ofT5KYNkrfX2XEenbc9OvQ-768x175.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ofT5KYNkrfX2XEenbc9OvQ-1536x350.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ofT5KYNkrfX2XEenbc9OvQ-2048x466.png 2048w\" sizes=\"auto, (max-width: 3258px) 100vw, 3258px\" \/><figcaption class=\"wp-element-caption\">DataFrame with All Features Added. Image by Author.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Relationships Among Variables<\/h2>\n<h3 class=\"wp-block-heading\">Correlation<\/h3>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>&quot;The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable. The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate.&quot; (<a href=\"https:\/\/support.minitab.com\/en-us\/minitab-express\/1\/help-and-how-to\/modeling-statistics\/regression\/supporting-topics\/basics\/a-comparison-of-the-pearson-and-spearman-correlation-methods\/\">source<\/a>)<\/p><\/blockquote>\n<p class=\"wp-block-paragraph\">In our analysis, we will use the Spearman correlation coefficient to identify a monotonic relationship between all values.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"afa8a9\" data-has-transparency=\"true\" style=\"--dominant-color: #afa8a9;\" loading=\"lazy\" decoding=\"async\" width=\"1482\" height=\"686\" class=\"wp-image-314543 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JF7_B7VQDNQjB-XqTRyhcA.png\" alt=\"Correlation Heatmap Between Features. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JF7_B7VQDNQjB-XqTRyhcA.png 1482w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JF7_B7VQDNQjB-XqTRyhcA-300x139.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JF7_B7VQDNQjB-XqTRyhcA-1024x474.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1JF7_B7VQDNQjB-XqTRyhcA-768x355.png 768w\" sizes=\"auto, (max-width: 1482px) 100vw, 1482px\" \/><figcaption class=\"wp-element-caption\">Correlation Heatmap Between Features. Image by Author.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Only Display Significant Correlations<\/h3>\n<p class=\"wp-block-paragraph\">Below displays only the significant correlations where the p-value for the Spearman correlation is less than 0.05.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"595355\" data-has-transparency=\"true\" style=\"--dominant-color: #595355;\" loading=\"lazy\" decoding=\"async\" width=\"1480\" height=\"676\" class=\"wp-image-314544 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1TRqOcORsKpAvnx5NSyoGMw.png\" alt=\"Correlation Heatmap Only Showing Significant Correlations. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1TRqOcORsKpAvnx5NSyoGMw.png 1480w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1TRqOcORsKpAvnx5NSyoGMw-300x137.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1TRqOcORsKpAvnx5NSyoGMw-1024x468.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1TRqOcORsKpAvnx5NSyoGMw-768x351.png 768w\" sizes=\"auto, (max-width: 1480px) 100vw, 1480px\" \/><figcaption class=\"wp-element-caption\">Correlation Heatmap Only Showing Significant Correlations. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">I expected to find some significant correlations, such as those between the average number of words and the average number of sentences or the average number of uppercase words. I&#8217;d also anticipated the following correlations:<\/p>\n<ul class=\"wp-block-list\">\n<li>Sentiment features correlate with the proportion of profanity.<\/li>\n<li>Correlations among different part-of-speech tag proportions.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">There were a few interesting observations:<\/p>\n<ul class=\"wp-block-list\">\n<li>No correlation between the use of exclamation marks and profanity<\/li>\n<li>A significant, albeit weak, correlation between the use of questions and profanity.<\/li>\n<li>A weak negative correlation between the Flair sentiment and the use of questions.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">A significant positive correlation between the average number of words and the use of proper nouns (<code>prop_noun<\/code>) may also indicate that more complex dialogues include more specific references to entities or names, which could be characteristic of certain genres like science fiction or fantasy with complex world-building.<\/p>\n<h3 class=\"wp-block-heading\">Profanity Against Variables<\/h3>\n<p class=\"wp-block-paragraph\">As noted above, I was quite surprised to see a correlation between questions and profanity yet no relationship between exclamation marks and profanity. Therefore, I decided to plot out a slope graph to see if we could uncover any relationship there.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"080708\" data-has-transparency=\"true\" style=\"--dominant-color: #080708;\" loading=\"lazy\" decoding=\"async\" width=\"956\" height=\"778\" class=\"wp-image-314545 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1xdhkYgsF1sauqMC4_uJbfw.png\" alt=\"Slopegraph for the Percentage Change of Profanity Proportion Between Lines With Exclamation &amp; Those Without. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1xdhkYgsF1sauqMC4_uJbfw.png 956w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1xdhkYgsF1sauqMC4_uJbfw-300x244.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1xdhkYgsF1sauqMC4_uJbfw-768x625.png 768w\" sizes=\"auto, (max-width: 956px) 100vw, 956px\" \/><figcaption class=\"wp-element-caption\">Slopegraph for the Percentage Change of Profanity Proportion Between Lines With Exclamation &amp; Those Without. Image by Author.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Profanity Against Variables Summary<\/h3>\n<p class=\"wp-block-paragraph\">Interestingly enough, despite there being no significant correlation between the proportion of exclamation and the proportion of profanity, it appears that the most significant jump between the proportion of profanity occurs from dialogue with no exclamation marks to dialogue with exclamation marks.<\/p>\n<h3 class=\"wp-block-heading\">Final Data<\/h3>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"202020\" data-has-transparency=\"false\" style=\"--dominant-color: #202020;\" loading=\"lazy\" decoding=\"async\" width=\"2622\" height=\"346\" class=\"wp-image-314546 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1-aTXhMfL4C_0UbXRyNmDtw.png\" alt=\"Final DataFrame. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1-aTXhMfL4C_0UbXRyNmDtw.png 2622w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1-aTXhMfL4C_0UbXRyNmDtw-300x40.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1-aTXhMfL4C_0UbXRyNmDtw-1024x135.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1-aTXhMfL4C_0UbXRyNmDtw-768x101.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1-aTXhMfL4C_0UbXRyNmDtw-1536x203.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1-aTXhMfL4C_0UbXRyNmDtw-2048x270.png 2048w\" sizes=\"auto, (max-width: 2622px) 100vw, 2622px\" \/><figcaption class=\"wp-element-caption\">Final DataFrame. Image by Author.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Modeling<\/h2>\n<p class=\"wp-block-paragraph\">I am going to fast-forward through this last part and provide a brief overview of the modeling process and performance. However, please feel free to let me know if you&#8217;d like a more in-depth exploration of the modeling work done here and I&#8217;ll release a part two \ud83d\ude42<\/p>\n<p class=\"wp-block-paragraph\">Here we are building a classifier to predict the genre of drama.<\/p>\n<h3 class=\"wp-block-heading\">LazyPredict<\/h3>\n<p class=\"wp-block-paragraph\">To expedite the modeling phase, we utilized LazyPredict, an AutoML Python package that applies all of the common machine learning algorithms to a dataset and presents common metrics based on the task.<\/p>\n<h3 class=\"wp-block-heading\">Hyperparameter Tuning: Bayesian Optimization<\/h3>\n<p class=\"wp-block-paragraph\">We then performed hyperparameter tuning on the first 4 models:<\/p>\n<ol class=\"wp-block-list\">\n<li>ExtraTreesClassifier<\/li>\n<li>AdaBoostClassifier<\/li>\n<li>Perceptron<\/li>\n<li>XGBClassifier<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Classically, hyperparameter sweeps are run via grid search (brute force), where all possible combinations of hyperparameters are empirically evaluated for optimization. Given that the number of trials grows exponentially with every new hyperparameter, this is usually non-feasible. Another approach, random search, randomly combines hyperparameters, reaching a local optimum more efficiently than grid search if all combinations are not exhausted.<\/p>\n<p class=\"wp-block-paragraph\">Instead of either of these options, I will utilize Bayesian Optimization. This method constructs a Gaussian process to model the black-box function and search space. The overarching advantage is that we are converging to a local solution (like any ML model does) rather than shooting simply trying out different hyperparameters.<\/p>\n<h3 class=\"wp-block-heading\">Manual Hyperparameter Tuning Extra Trees Classifier<\/h3>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-yaml\">Train F1 Score: 0.985\nTrain Accuracy Score: 0.984\n\nTest F1 Score: 0.696\nTest Accuracy Score: 0.675<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The F1 score, a harmonic mean of precision and recall, serves as a key indicator of our model&#8217;s performance. Precision reflects the model&#8217;s reliability in correctly identifying a movie as belonging to the &#8216;drama&#8217; genre, while recall measures the model&#8217;s ability to capture all relevant instances of drama movies.<\/p>\n<p class=\"wp-block-paragraph\">Considering the constraints, such as the absence of a fully developed pipeline for filtering low variance columns, addressing potential multi-collinearity, and a more extensive feature engineering process, the model demonstrated reasonable effectiveness. The following section will highlight the features that were most important for the model.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"cddfcd\" data-has-transparency=\"true\" style=\"--dominant-color: #cddfcd;\" loading=\"lazy\" decoding=\"async\" width=\"1758\" height=\"1382\" class=\"wp-image-314547 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ACFnoZKluNqlqJQ19vsWEQ.png\" alt=\"Feature Importance Scores. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ACFnoZKluNqlqJQ19vsWEQ.png 1758w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ACFnoZKluNqlqJQ19vsWEQ-300x236.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ACFnoZKluNqlqJQ19vsWEQ-1024x805.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ACFnoZKluNqlqJQ19vsWEQ-768x604.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1ACFnoZKluNqlqJQ19vsWEQ-1536x1207.png 1536w\" sizes=\"auto, (max-width: 1758px) 100vw, 1758px\" \/><figcaption class=\"wp-element-caption\">Feature Importance Scores. Image by Author.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Future Work<\/h2>\n<p class=\"wp-block-paragraph\">This article was mainly focused on the process of feature generation and analyzing the data within the context of screenplays. However, if I wanted to work more on modeling, I&#8217;d focus on feature engineering, examine the effects of multi-collinearly, and spend more time on model selection.<\/p>\n<p class=\"wp-block-paragraph\">More specifically, I would:<\/p>\n<ul class=\"wp-block-list\">\n<li>Use an <strong>ensemble approach<\/strong> where the <strong>first model would be fit on the most significant words in the corpus<\/strong> using a tfidf vectorizer and the second model would be the model fit on the features I have here. Then, I&#8217;d combine the two models to see if the integration of the two models would boost performance.<\/li>\n<li>Perform topic modeling using TFIDF and Latent Dirichlet allocation. Perhaps the extraction of topics may boost performance, specifically for a multi-label classifier.<\/li>\n<li>Add other features that could hold predictive power for the genre such as the movie rating, the movie year, and the average amount of dialogue per character<\/li>\n<li>Analyze the average sentiment across time, and <strong>categorize it into a &quot;plot arc&quot;<\/strong> category. According to Fellow, there are 6 common story plot arcs: Riches to Rags (a continuing emotional fall), Rags to Riches (a continuing emotional rise), Oedipus (fall-rise-fall), Cinderella (rise-fall-rise), Man in a Hole (fall-rise) and Icarus (rise-fall).<\/li>\n<li>We could create a function that would take the average sentiment and categorize the movie into one of the plot arcs and see if that would also be predictive of the genre.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"111211\" data-has-transparency=\"true\" style=\"--dominant-color: #111211;\" loading=\"lazy\" decoding=\"async\" width=\"1748\" height=\"390\" class=\"wp-image-314548 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/140nRhh2Y_7fRQKD-0ChpLg.png\" alt=\"Average Sentiment Visual Example. Image by Author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/140nRhh2Y_7fRQKD-0ChpLg.png 1748w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/140nRhh2Y_7fRQKD-0ChpLg-300x67.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/140nRhh2Y_7fRQKD-0ChpLg-1024x228.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/140nRhh2Y_7fRQKD-0ChpLg-768x171.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/140nRhh2Y_7fRQKD-0ChpLg-1536x343.png 1536w\" sizes=\"auto, (max-width: 1748px) 100vw, 1748px\" \/><figcaption class=\"wp-element-caption\">Average Sentiment Visual Example. Image by Author.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Concluding Remarks<\/h2>\n<p class=\"wp-block-paragraph\">I hope you enjoyed this analysis and that this article showcased the potential of tailoring analyses to the unique characteristics of a field. While I focused on cinematic dialogue, the principles of domain-driven data analysis and modeling are universal. I encourage you to research your chosen domain, remain curious, and get creative with feature engineering during your next modeling task. I would also love to hear about your own experiences with interesting domain-driven analyses so feel free to write a comment here or email me at christabellepabalan@gmail.com. Thanks!<\/p>\n<h3 class=\"wp-block-heading\">References<\/h3>\n<ul class=\"wp-block-list\">\n<li>Chang, J. P., Chiam, C., Fu, L., Wang, A., Zhang, J., &amp; Danescu-Niculescu-Mizil, C. (2020). ConvoKit: A toolkit for the analysis of conversations. Proceedings of SIGDIAL.<\/li>\n<li>\n<p class=\"wp-block-paragraph\">Cornell Movie-Dialogs Corpus. Retrieved from <a href=\"https:\/\/www.kaggle.com\/rajathmc\/cornell-moviedialog-corpus\/kernels\">https:\/\/www.kaggle.com\/\nrajathmc\/cornell-moviedialog-corpus\/kernels<\/a>. (originally retrieved from ConvoKit).<\/p>\n<\/li>\n<li>\n<p class=\"wp-block-paragraph\">Follows, S. (2019). Judging screenplays by their coverage. Retrieved from <a href=\"https:\/\/stephenfollows.com\/wp-content\/uploads\/2019\/01\/JudgingScreenplaysByTheirCoverage_StephenFollows_c.pdf\">https:\/\/stephenfollows.com\/wp-content\/uploads\/2019\/01\/Judging\nScreenplaysByTheirCoverage_StephenFollows_c.pdf<\/a><\/p>\n<\/li>\n<li>Minitab. (2023). A comparison of the Pearson and Spearman correlation methods. Retrieved from <a href=\"https:\/\/support.minitab.com\/en-us\/minitab\/21\/%3C0%3Ehelp-and-how-to\/statistics\/basic-statistics\/%3C1%3Esupporting-topics\/correlation-and-covariance\/a-comparison-of-the-pearson-and-spearman-correlation-methods\/\">[https:\/\/support.minitab.com\/en-us\/minitab\/21\/help-and-how-to\/statistics\/basic-statistics\/supporting-topics\/correlation-and-covariance\/a-comparison-of-the-pearson-and-spearman-correlation-methods\/](https:\/\/support.minitab.com\/en-us\/minitab\/21\/%3C0%3Ehelp-and-how-to\/statistics\/basic-statistics\/supporting-topics\/correlation-and-covariance\/a-comparison-of-the-pearson-and-spearman-correlation-methods\/)<\/a>: <a href=\"https:\/\/support.minitab.com\/en-us\/minitab\/21\/help-and-how-to\/statistics\/basic-statistics\/supporting-topics\/correlation-and-covariance\/a-comparison-of-the-pearson-and-spearman-correlation-methods\/\">https:\/\/support.minitab.com\/en-us\/minitab\/21\/help-and-how-to\/statistics\/basic-statistics\/supporting-topics\/correlation-and-covariance\/a-comparison-of-the-pearson-and-spearman-correlation-methods\/<\/a><\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>This article explores the relationship between a movie&#8217;s dialogue and its genre, leveraging domain-driven data analysis and informed&#8230;<\/p>\n","protected":false},"author":18,"featured_media":3655,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":true,"sub_heading":"This article explores the relationship between a movie's dialogue and its genre, leveraging domain-driven data analysis and informed...","footnotes":""},"categories":[14690,47,22,23],"tags":[579,508,468,446,568],"sponsor":[],"coauthors":[27508],"class_list":["post-3654","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cinema","category-data-visualization","category-machine-learning","category-nlp","tag-cinema","tag-data-visualization","tag-deep-dives","tag-machine-learning","tag-nlp"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre? | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre? | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"This article explores the relationship between a movie&#039;s dialogue and its genre, leveraging domain-driven data analysis and informed...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2024-01-20T05:58:35+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-08T15:47:12+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw-1024x975.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"975\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Christabelle Pabalan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Christabelle Pabalan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"18 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre?\",\"datePublished\":\"2024-01-20T05:58:35+00:00\",\"dateModified\":\"2025-01-08T15:47:12+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/\"},\"wordCount\":3300,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw.png\",\"keywords\":[\"Cinema\",\"Data Visualization\",\"Deep Dives\",\"Machine Learning\",\"NLP\"],\"articleSection\":[\"Cinema\",\"Data Visualization\",\"Machine Learning\",\"Natural Language Processing\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/\",\"url\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/\",\"name\":\"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre? | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw.png\",\"datePublished\":\"2024-01-20T05:58:35+00:00\",\"dateModified\":\"2025-01-08T15:47:12+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw.png\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw.png\",\"width\":2026,\"height\":1930,\"caption\":\"DALLE Generated Image by Author\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre? | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/","og_locale":"en_US","og_type":"article","og_title":"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre? | Towards Data Science","og_description":"This article explores the relationship between a movie's dialogue and its genre, leveraging domain-driven data analysis and informed...","og_url":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/","og_site_name":"Towards Data Science","article_published_time":"2024-01-20T05:58:35+00:00","article_modified_time":"2025-01-08T15:47:12+00:00","og_image":[{"width":1024,"height":975,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw-1024x975.png","type":"image\/png"}],"author":"Christabelle Pabalan","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Christabelle Pabalan","Est. reading time":"18 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre?","datePublished":"2024-01-20T05:58:35+00:00","dateModified":"2025-01-08T15:47:12+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/"},"wordCount":3300,"commentCount":0,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw.png","keywords":["Cinema","Data Visualization","Deep Dives","Machine Learning","NLP"],"articleSection":["Cinema","Data Visualization","Machine Learning","Natural Language Processing"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/","url":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/","name":"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre? | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw.png","datePublished":"2024-01-20T05:58:35+00:00","dateModified":"2025-01-08T15:47:12+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw.png","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/01\/1L3KsJpzcs1UzT1NPK7qeqw.png","width":2026,"height":1930,"caption":"DALLE Generated Image by Author"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre?"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/3654","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=3654"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/3654\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/3655"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=3654"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=3654"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=3654"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=3654"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=3654"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}