{"id":598426,"date":"2025-02-25T15:56:16","date_gmt":"2025-02-25T20:56:16","guid":{"rendered":"https:\/\/towardsdatascience.com\/?p=598426"},"modified":"2025-02-25T15:56:17","modified_gmt":"2025-02-25T20:56:17","slug":"efficient-data-handling-in-python-with-arrow","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/","title":{"rendered":"Efficient Data Handling in Python with Arrow"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019re all used to work with CSVs, JSON files\u2026 With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It\u2019s precisely with big amounts of data that being efficient handling the data is crucial for our data science\/analytics workflow, and this is exactly where Apache Arrow comes into play.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Why? The main reason resides in how the data is stored in memory. While JSON and CSVs, for example, are text-based formats, Arrow is a columnar in-memory data format (and that allows for fast data interchange between different data processing tools). Arrow is therefore designed to optimize performance by enabling zero-copy reads, reducing memory usage, and supporting efficient compression.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Moreover, Apache Arrow is open-source and optimized for analytics. It is designed to accelerate big data processing while maintaining interoperability with various data tools, such as Pandas, Spark, and Dask. By storing data in a columnar format, Arrow enables faster read\/write operations and efficient memory usage, making it ideal for analytical workloads.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Sounds great right? What\u2019s best is that this is all the introduction to Arrow I\u2019ll provide. Enough theory, we want to see it in action. So, in this post, we&#8217;ll explore how to use Arrow in Python and how to make the most out of it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Arrow in Python<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To get started, you need to install the necessary libraries: pandas and pyarrow.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">pip install pyarrow pandas<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then, as always, import them in your Python script:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import pyarrow as pa\nimport pandas as pd<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Nothing new yet, just necessary steps to do what follows. Let\u2019s start by performing some simple operations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2.1. Creating and Storing a Table<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The simplest we can do is hardcode our table\u2019s data. Let\u2019s create a two-column table with football data:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">teams = pa.array([&#039;Barcelona&#039;, &#039;Real Madrid&#039;, &#039;Rayo Vallecano&#039;, &#039;Athletic Club&#039;, &#039;Real Betis&#039;], type=pa.string())\ngoals = pa.array([30, 23, 9, 24, 12], type=pa.int8())\n\nteam_goals_table = pa.table([teams, goals], names=[&#039;Team&#039;, &#039;Goals&#039;])<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The format is <em>pyarrow.table<\/em>, but we can easily convert it to pandas if we want:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">df = team_goals_table.to_pandas()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And restore it back to arrow using:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">team_goals_table = pa.Table.from_pandas(df)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And we\u2019ll finally store the table in a file. We could use different formats, like feather, parquet\u2026 I\u2019ll use this last one because it\u2019s fast and memory-optimized:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import pyarrow.parquet as pq\npq.write_table(team_goals_table, &#039;data.parquet&#039;)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Reading a parquet file would just consist of using <code>pq.read_table(&#039;data.parquet&#039;)<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2.2. Compute Functions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Arrow has its own compute module for the usual operations. Let\u2019s start by comparing two arrays element-wise:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import pyarrow.compute as pc\n&gt;&gt;&gt; a = pa.array([1, 2, 3, 4, 5, 6])\n&gt;&gt;&gt; b = pa.array([2, 2, 4, 4, 6, 6])\n&gt;&gt;&gt; pc.equal(a,b)\n[\n  false,\n  true,\n  false,\n  true,\n  false,\n  true\n]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That was easy, we could sum all elements in an array with:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">&gt;&gt;&gt; pc.sum(a)\n&lt;pyarrow.Int64Scalar: 21&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And from this we could easily guess how we can compute a count, a floor, an exp, a mean, a max, a multiplication\u2026 No need to go over them, then. So let\u2019s move to tabular operations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ll start by showing how to sort it:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">&gt;&gt;&gt; table = pa.table({&#039;i&#039;: [&#039;a&#039;,&#039;b&#039;,&#039;a&#039;], &#039;x&#039;: [1,2,3], &#039;y&#039;: [4,5,6]})\n&gt;&gt;&gt; pc.sort_indices(table, sort_keys=[(&#039;y&#039;, descending)])\n&lt;pyarrow.lib.UInt64Array object at 0x1291643a0&gt;\n[\n  2,\n  1,\n  0\n]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Just like in pandas, we can group values and aggregate the data. Let\u2019s, for example, group by \u201ci\u201d and compute the sum on \u201cx\u201d and the mean on \u201cy\u201d:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">&gt;&gt;&gt; table.group_by(&#039;i&#039;).aggregate([(&#039;x&#039;, &#039;sum&#039;), (&#039;y&#039;, &#039;mean&#039;)])\npyarrow.Table\ni: string\nx_sum: int64\ny_mean: double\n----\ni: [[&quot;a&quot;,&quot;b&quot;]]\nx_sum: [[4,2]]\ny_mean: [[5,5]]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Or we can join two tables:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">&gt;&gt;&gt; t1 = pa.table({&#039;i&#039;: [&#039;a&#039;,&#039;b&#039;,&#039;c&#039;], &#039;x&#039;: [1,2,3]})\n&gt;&gt;&gt; t2 = pa.table({&#039;i&#039;: [&#039;a&#039;,&#039;b&#039;,&#039;c&#039;], &#039;y&#039;: [4,5,6]})\n&gt;&gt;&gt; t1.join(t2, keys=&quot;i&quot;)\npyarrow.Table\ni: string\nx: int64\ny: int64\n----\ni: [[&quot;a&quot;,&quot;b&quot;,&quot;c&quot;]]\nx: [[1,2,3]]\ny: [[4,5,6]]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">By default, it is a left outer join but we could twist it by using the <strong><em>join_type<\/em><\/strong> parameter.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are many more useful operations, but let\u2019s see just one more to avoid making this too long: appending a new column to a table.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">&gt;&gt;&gt; t1.append_column(&quot;z&quot;, pa.array([22, 44, 99]))\npyarrow.Table\ni: string\nx: int64\nz: int64\n----\ni: [[&quot;a&quot;,&quot;b&quot;,&quot;c&quot;]]\nx: [[1,2,3]]\nz: [[22,44,99]]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Before ending this section, we must see how to filter a table or array:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">&gt;&gt;&gt; t1.filter((pc.field(&#039;x&#039;) &gt; 0) &amp; (pc.field(&#039;x&#039;) &lt; 3))\npyarrow.Table\ni: string\nx: int64\n----\ni: [[&quot;a&quot;,&quot;b&quot;]]\nx: [[1,2]]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Easy, right? Especially if you\u2019ve been using pandas and numpy for years!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Working with files<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ve already seen how we can read and write Parquet files. But let\u2019s check some other popular file types so that we have several options available.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.1. Apache ORC<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Being very informal, Apache ORC can be understood as the equivalent of Arrow in the realm of file types (even though its origins have nothing to do with Arrow). Being more correct, it\u2019s an open source and columnar storage format.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Reading and writing it is as follows:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from pyarrow import orc\n# Write table\norc.write_table(t1, &#039;t1.orc&#039;)\n# Read table\nt1 = orc.read_table(&#039;t1.orc&#039;)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As a side note, we could decide to compress the file while writing by using the \u201ccompression\u201d parameter.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.2. CSV<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">No secret here, pyarrow has the CSV module:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from pyarrow import csv\n# Write CSV\ncsv.write_csv(t1, &quot;t1.csv&quot;)\n# Read CSV\nt1 = csv.read_csv(&quot;t1.csv&quot;)\n\n# Write CSV compressed and without header\noptions = csv.WriteOptions(include_header=False)\nwith pa.CompressedOutputStream(&quot;t1.csv.gz&quot;, &quot;gzip&quot;) as out:\n    csv.write_csv(t1, out, options)\n\n# Read compressed CSV and add custom header\nt1 = csv.read_csv(&quot;t1.csv.gz&quot;, read_options=csv.ReadOptions(\n    column_names=[&quot;i&quot;, &quot;x&quot;], skip_rows=1\n)]<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">3.2. JSON<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Pyarrow allows JSON reading but not writing. It\u2019s pretty straightforward, let\u2019s see an example supposing we have our JSON data in \u201cdata.json\u201d:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from pyarrow import json\n# Read json\nfn = &quot;data.json&quot;\ntable = json.read_json(fn)\n\n# We can now convert it to pandas if we want to\ndf = table.to_pandas()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the <a href=\"https:\/\/arrow.apache.org\/docs\/python\/ipc.html#ipc\">Arrow IPC format<\/a> internally. So, contrary to Apache ORC, this one was indeed created early in the Arrow project.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from pyarrow import feather\n# Write feather from pandas DF\nfeather.write_feather(df, &quot;t1.feather&quot;)\n# Write feather from table, and compressed\nfeather.write_feather(t1, &quot;t1.feather.lz4&quot;, compression=&quot;lz4&quot;)\n\n# Read feather into table\nt1 = feather.read_table(&quot;t1.feather&quot;)\n# Read feather into df\ndf = feather.read_feather(&quot;t1.feather&quot;)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">4. Advanced Features<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We just touched upon the most basic features and what the majority would need while working with Arrow. However, its amazingness doesn\u2019t end here, it\u2019s right where it starts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As this will be quite domain-specific and not useful for anyone (nor considered introductory) I\u2019ll just mention some of these features without using any code:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">We can handle memory management through the <strong>Buffer<\/strong> type (built on top of C++ Buffer object). Creating a buffer with our data does not allocate any memory; it is a zero-copy view on the memory exported from the data bytes object. Keeping up with this memory management, an instance of <strong>MemoryPool<\/strong> tracks all the allocations and deallocations (like <em>malloc<\/em> and <em>free<\/em> in C). This allows us to track the amount of memory being allocated.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Similarly, there are different ways to work with input\/output streams in batches.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types. So, for example,\u00a0 we can write and read parquet files from an S3 bucket using the S3FileSystem. Google Cloud and Hadoop Distributed File System (HDFS) are also accepted.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Conclusion and Key Takeaways<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Apache Arrow is a powerful tool for efficient data handling in Python. Its columnar storage format, zero-copy reads, and interoperability with popular data processing libraries make it ideal for data science workflows. By integrating Arrow into your pipeline, you can significantly boost performance and optimize memory usage.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">6. Resources<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/arrow.apache.org\/docs\/\">Apache Arrow Documentation<\/a><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/github.com\/apache\/arrow\">PyArrow GitHub Repository<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Introducing Arrow to those who are still unaware of its power<\/p>\n","protected":false},"author":18,"featured_media":598428,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"Introducing Arrow to those who are still unaware of its power","footnotes":""},"categories":[44],"tags":[7810,7844,9843,491,467],"sponsor":[],"coauthors":[30752],"class_list":["post-598426","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","tag-apache-arrow","tag-data-analyitcs","tag-data-handling","tag-programming","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Efficient Data Handling in Python with Arrow | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Efficient Data Handling in Python with Arrow | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Introducing Arrow to those who are still unaware of its power\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2025-02-25T20:56:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-02-25T20:56:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png\" \/>\n\t<meta property=\"og:image:width\" content=\"512\" \/>\n\t<meta property=\"og:image:height\" content=\"384\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Pol Marin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Pol Marin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Efficient Data Handling in Python with Arrow\",\"datePublished\":\"2025-02-25T20:56:16+00:00\",\"dateModified\":\"2025-02-25T20:56:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/\"},\"wordCount\":979,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png\",\"keywords\":[\"Apache Arrow\",\"Data Analyitcs\",\"Data Handling\",\"Programming\",\"Python\"],\"articleSection\":[\"Data Science\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/\",\"url\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/\",\"name\":\"Efficient Data Handling in Python with Arrow | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png\",\"datePublished\":\"2025-02-25T20:56:16+00:00\",\"dateModified\":\"2025-02-25T20:56:17+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png\",\"width\":512,\"height\":384,\"caption\":\"Image generated with Grok\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Efficient Data Handling in Python with Arrow\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Efficient Data Handling in Python with Arrow | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/","og_locale":"en_US","og_type":"article","og_title":"Efficient Data Handling in Python with Arrow | Towards Data Science","og_description":"Introducing Arrow to those who are still unaware of its power","og_url":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/","og_site_name":"Towards Data Science","article_published_time":"2025-02-25T20:56:16+00:00","article_modified_time":"2025-02-25T20:56:17+00:00","og_image":[{"width":512,"height":384,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png","type":"image\/png"}],"author":"Pol Marin","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Pol Marin","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Efficient Data Handling in Python with Arrow","datePublished":"2025-02-25T20:56:16+00:00","dateModified":"2025-02-25T20:56:17+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/"},"wordCount":979,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png","keywords":["Apache Arrow","Data Analyitcs","Data Handling","Programming","Python"],"articleSection":["Data Science"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/","url":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/","name":"Efficient Data Handling in Python with Arrow | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png","datePublished":"2025-02-25T20:56:16+00:00","dateModified":"2025-02-25T20:56:17+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/unnamed-29.png","width":512,"height":384,"caption":"Image generated with Grok"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/efficient-data-handling-in-python-with-arrow\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Efficient Data Handling in Python with Arrow"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/598426","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=598426"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/598426\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/598428"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=598426"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=598426"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=598426"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=598426"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=598426"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}