{"id":114629,"date":"2022-01-31T15:10:13","date_gmt":"2022-01-31T15:10:13","guid":{"rendered":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/"},"modified":"2025-01-21T09:51:51","modified_gmt":"2025-01-21T09:51:51","slug":"top-3-python-packages-to-generate-synthetic-data-33a351a5de0c","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/","title":{"rendered":"Top 3 Python Packages to Generate Synthetic Data"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Data is the backbone for every data project; data analysis, machine learning model training, or a simple dashboard need data. Thus, acquiring the data that satisfies your project is important.<\/p>\n<p class=\"wp-block-paragraph\">However, it is not necessarily the data you want is exists or is available in public. Moreover, there are times you want to test your data project with &quot;Data&quot; that meet your criteria. That is why generating your data become important when you have certain requirements.<\/p>\n<p class=\"wp-block-paragraph\">Generate data might be important, but collecting data manually that meets our needs would take time. For that reason, we could try to synthesize our data with programming language. This article will outline my top 3 python package to generate synthetic data. All the generated data could be used for any data project you want. Let&#8217;s get into it.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">1. Faker<\/h2>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/joke2k\/faker\">Faker<\/a> is a Python package developed to simplify generating synthetic data. Many subsequent data synthetic generator python packages are based on the Faker package. People love how simple and intuitive this package was, so let&#8217;s try it ourselves. For starters, let&#8217;s install the package.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">pip install Faker<\/code><\/pre>\n<p class=\"wp-block-paragraph\">To use the Faker package to generate synthetic data, we need to initiate the <code>Faker<\/code> class.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">from faker import Faker\nfake = Faker()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">With the class initiated, we could generate various synthetic data. For example, we would create a synthetic data name.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">fake.name()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eeedee\" data-has-transparency=\"true\" style=\"--dominant-color: #eeedee;\" loading=\"lazy\" decoding=\"async\" width=\"170\" height=\"38\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1vS-M7b8WEsdkS19bnrtGpA.png\" alt=\"Image by Author\" class=\"wp-image-215638 has-transparency\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The result is a person&#8217;s name when we use the <code>.name<\/code> attribute from the Faker class. Faker synthetic data would produce randomly each time we run the attribute. Let&#8217;s run the name one more time.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eaeaea\" data-has-transparency=\"true\" style=\"--dominant-color: #eaeaea;\" loading=\"lazy\" decoding=\"async\" width=\"163\" height=\"32\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1BWx_NUjFqlG4d7s9EL3nlQ.png\" alt=\"Image by Author\" class=\"wp-image-215639 has-transparency\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The result is a different name than our previous iteration. The randomization process is important in generating synthetic data because we want a variation in our dataset.<\/p>\n<p class=\"wp-block-paragraph\">There are many more variables we could generate using the Faker package. It is not limited to the name variable &#8211; the other example are address, bank, job, credit score, and many more. In the Faker package, this generator is called <strong>Provider<\/strong>. If you want to check the whole <a href=\"https:\/\/faker.readthedocs.io\/en\/stable\/providers.html\">standard provider<\/a>, <a href=\"https:\/\/faker.readthedocs.io\/en\/stable\/communityproviders.html\">community provider<\/a>, and <a href=\"https:\/\/faker.readthedocs.io\/en\/stable\/locales.html\">localized provider<\/a>, you could check it out in their documentation.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">2. SDV<\/h2>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/sdv.dev\/SDV\/\">SDV<\/a> or Synthetic Data Vault is a Python package to generate synthetic data based on the dataset provided. The generated data could be single-table, multi-table, or time-series, depending on the scheme you provided in the environment. Also, the generated would have the same format properties and statistics as the provided dataset.<\/p>\n<p class=\"wp-block-paragraph\">SDV generates synthetic data by applying mathematical techniques and machine learning models such as the deep learning model. Even if the data contain multiple data types and missing data, SDV will handle it, so we only need to provide the data (and the metadata when required).<\/p>\n<p class=\"wp-block-paragraph\">Let&#8217;s try to generate our synthetic data with SDV. First, we need to install the package.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">pip install sdv<\/code><\/pre>\n<p class=\"wp-block-paragraph\">For our sample, I would use the <a href=\"https:\/\/www.kaggle.com\/yasserh\/horse-survival-dataset\">Horse Survival Datase<\/a>t from Kaggle because they contain various datatype and missing data.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">import pandas as pd\ndata = pd.read_csv(&#039;horse.csv&#039;)\ndata.head()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"efefef\" data-has-transparency=\"true\" style=\"--dominant-color: #efefef;\" loading=\"lazy\" decoding=\"async\" width=\"1178\" height=\"206\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1VOIA8kcyp4F94XwQuBn7rg.png\" alt=\"Image by Author\" class=\"wp-image-215640 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1VOIA8kcyp4F94XwQuBn7rg.png 1178w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1VOIA8kcyp4F94XwQuBn7rg-300x52.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1VOIA8kcyp4F94XwQuBn7rg-1024x179.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1VOIA8kcyp4F94XwQuBn7rg-768x134.png 768w\" sizes=\"auto, (max-width: 1178px) 100vw, 1178px\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Our dataset is ready, and we want to generate synthetic data based on the dataset. Let&#8217;s use one of the available <a href=\"https:\/\/sdv.dev\/SDV\/user_guides\/single_table\/models.html\">Singular Table SDV models<\/a>, <code>GaussianCopula<\/code>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">from sdv.tabular import GaussianCopula\nmodel = GaussianCopula()\nmodel.fit(data)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The training process is easy; we need only to initiate the class and fit the data. Let&#8217;s use the model to produce synthetic data.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">sample = model.sample(200)\nsample.head()<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"efeeee\" data-has-transparency=\"true\" style=\"--dominant-color: #efeeee;\" loading=\"lazy\" decoding=\"async\" width=\"1146\" height=\"203\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1xALdmEYSGSLsrpNmN_HRyQ.png\" alt=\"Image by Author\" class=\"wp-image-215641 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1xALdmEYSGSLsrpNmN_HRyQ.png 1146w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1xALdmEYSGSLsrpNmN_HRyQ-300x53.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1xALdmEYSGSLsrpNmN_HRyQ-1024x181.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1xALdmEYSGSLsrpNmN_HRyQ-768x136.png 768w\" sizes=\"auto, (max-width: 1146px) 100vw, 1146px\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">With the <code>.sample<\/code> attribute from the model, we obtain the randomized synthetic data. How much data you want depends on the number you pass into the <code>.sample<\/code> attribute.<\/p>\n<p class=\"wp-block-paragraph\">You might realize that the data sometimes contains a unique identifier. For example, I could assign the identifier in the above dataset as the &#8216;hospital_number.&#8217; The data here above have multiple instances of &#8216;hospital_nunber,&#8217; which is something we don&#8217;t want if it is unique data. In this case, we could pass the <code>primary_key<\/code> parameter to the model.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">model = GaussianCopula(primary_key=&#039;hospital_number&#039;)\nmodel.fit(data)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The sample result would be a unique primary key for each sample generated from the model.<\/p>\n<p class=\"wp-block-paragraph\">Another question we might ask is, how good is the generated synthetic data? In this case, we could use the <code>evaluate<\/code> function from SDV. This evaluation would compare the real dataset with the sample dataset. Many tests are available, but we would only focus on the Kolmogorov\u2013Smirnov (KS) and Chi-Squared (CS) tests.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">from sdv.evaluation import evaluate\nevaluate(sample, data, metrics=[&#039;CSTest&#039;, &#039;KSTest&#039;], aggregate=False)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eeeeee\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeeee;\" loading=\"lazy\" decoding=\"async\" width=\"904\" height=\"114\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1gJ4tYGl586oRbzDVBllcoA.png\" alt=\"Image by Author\" class=\"wp-image-215642 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1gJ4tYGl586oRbzDVBllcoA.png 904w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1gJ4tYGl586oRbzDVBllcoA-300x38.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1gJ4tYGl586oRbzDVBllcoA-768x97.png 768w\" sizes=\"auto, (max-width: 904px) 100vw, 904px\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">KSTest is used to compare the continuous columns, and CSTest compares the discrete columns. Both tests result in a normalized score between 0 to 1, with the target is to maximize the score. From the result above, we can assess that the discrete sample columns are good (almost similar to the real data). In contrast, continuous columns might have a deviation in distribution. If you want to know all the evaluation methods available in SDV, refer to the <a href=\"https:\/\/sdv.dev\/SDV\/user_guides\/evaluation\/evaluation_framework.html\">documentation page<\/a>.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">3. Gretel<\/h2>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/synthetics.docs.gretel.ai\/en\/stable\/#\">Gretel<\/a> or Gretel Synthetics is an open-source Python package based on Recurrent Neural Network (RNN) to generate structured and unstructured data. The python package approach treats the dataset as text data and trains the model based on this text data. The model would then produce synthetic data with text data (we need to transform the data to our intended result).<\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/colab.research.google.com\/github\/gretelai\/gretel-synthetics\/blob\/master\/examples\/synthetic_records.ipynb#scrollTo=2rx28XYXprJB\">Gretel<\/a> required a little bit of heavy computational power because it is based on the RNN, so I recommend using free google colab notebook or Kaggle notebook if your computer is not powerful enough. For accessibility purposes, this article would also refer to the tutorial provided by Gretel.<\/p>\n<p class=\"wp-block-paragraph\">The first thing we need to do is install the package using the following code.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">pip install gretel-synthetics<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We would then use the following code to generate a config code as a parameter to train the RNN model. The following parameter is based on the training on GPU, and the dataset used is the <a href=\"https:\/\/gretel-public-website.s3-us-west-2.amazonaws.com\/datasets\/uber_scooter_rides_1day.csv\">scooter journey coordinates dataset<\/a> available from Gretel.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">from pathlib import Path<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">from gretel_synthetics.config import LocalConfig<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\"># Create a config for both training and generating data<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">config = LocalConfig(<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\"># the max line length for input training\nmax_line_len=2048,  <\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\"># tokenizer model vocabulary\ndatavocab_size=20000,    <\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\"># specify if the training text is structured, else ``None``\nsizefield_delimiter=&quot;,&quot;, <\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\"># overwrite previously trained model checkpoints\noverwrite=True,<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\"># Checkpoint location\ncheckpoint_dir=(Path.cwd() \/ &#039;checkpoints&#039;).as_posix(),<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">#The dataset used for RNN training\ninput_data_path=&quot;https:\/\/gretel-public-website.s3-us-west-2.amazonaws.com\/datasets\/uber_scooter_rides_1day.csv&quot; # filepath or S3)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We will train our RNN model using the following code when the config is ready.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">from gretel_synthetics.train import train_rnn<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">train_rnn(config)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Depending on your processing power, the config, and the dataset, the process might take some time. When it&#8217;s done, the config would automatically save your best model and be ready to generate synthetic data. Let&#8217;s try to generate the data using <code>generate_text<\/code>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">from gretel_synthetics.generate import generate_text<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">#Simple validation function of the data is containing 6 data parts or not. You could always free to tweak it.\ndef validate_record(line):<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">   rec = line.split(&quot;, &quot;)\n   if len(rec) == 6:\n      float(rec[5])\n      float(rec[4])\n      float(rec[3])\n      float(rec[2])\n      int(rec[0])\n   else:\n      raise Exception(&#039;record not 6 parts&#039;)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">#Generate 1000 synthetic data\ndata = generate_text(config, line_validator=validate_record, num_lines=1000)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">print(data)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eeeeee\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeeee;\" loading=\"lazy\" decoding=\"async\" width=\"550\" height=\"38\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/17iw2D03-2WUc386LjojCVA.png\" alt=\"Image by Author\" class=\"wp-image-215643 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/17iw2D03-2WUc386LjojCVA.png 550w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/17iw2D03-2WUc386LjojCVA-300x21.png 300w\" sizes=\"auto, (max-width: 550px) 100vw, 550px\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The generated data, by default, is a generator object that contains all the synthetic data. We could try to iterate the object values and print the result to access the data.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">for line in data:\n   print(line)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let&#8217;s take a look closely at the generated synthetic data.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">line<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f0f0f0\" data-has-transparency=\"false\" style=\"--dominant-color: #f0f0f0;\" loading=\"lazy\" decoding=\"async\" width=\"1198\" height=\"38\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/16y_2Ky82qbbpZYxOTA_peg.png\" alt=\"Image by Author\" class=\"wp-image-215644 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/16y_2Ky82qbbpZYxOTA_peg.png 1198w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/16y_2Ky82qbbpZYxOTA_peg-300x10.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/16y_2Ky82qbbpZYxOTA_peg-1024x32.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/16y_2Ky82qbbpZYxOTA_peg-768x24.png 768w\" sizes=\"auto, (max-width: 1198px) 100vw, 1198px\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The above image is the synthetic data example from the Gretel RNN model. The text parameter is our data is in the form of text data. If we want to access it, we could use the <code>.text<\/code> attribute.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">print(line.text)<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f0f0f0\" data-has-transparency=\"false\" style=\"--dominant-color: #f0f0f0;\" loading=\"lazy\" decoding=\"async\" width=\"598\" height=\"38\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1H7m2dZfrBFr15QaoZnXLPQ.png\" alt=\"Image by Author\" class=\"wp-image-215645 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1H7m2dZfrBFr15QaoZnXLPQ.png 598w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1H7m2dZfrBFr15QaoZnXLPQ-300x19.png 300w\" sizes=\"auto, (max-width: 598px) 100vw, 598px\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Hence we only need to split the data by the delimiter (&#8216;, &#8216;) and process it to the tabular form. Although, not all the data might be valid (depending on your evaluation) and need to be cleaned thoroughly before using it.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ebebeb\" data-has-transparency=\"false\" style=\"--dominant-color: #ebebeb;\" loading=\"lazy\" decoding=\"async\" width=\"1530\" height=\"50\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1Je4KzcBZDpfRcgI793bqRg.png\" alt=\"Image by Author\" class=\"wp-image-215646 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1Je4KzcBZDpfRcgI793bqRg.png 1530w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1Je4KzcBZDpfRcgI793bqRg-300x10.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1Je4KzcBZDpfRcgI793bqRg-1024x33.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/1Je4KzcBZDpfRcgI793bqRg-768x25.png 768w\" sizes=\"auto, (max-width: 1530px) 100vw, 1530px\" \/><figcaption class=\"wp-element-caption\">Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The data above are not valid because they produce more than six parts of data while we only need 6. This is why we need to be careful of the data produced by the RNN model. However, the useability outweighs the cons, so a little work should be fine if you use Gretel.<\/p>\n<p class=\"wp-block-paragraph\">Get ready to learn data science from all the experts with <strong>discounted prices on 365 Data Science!<\/strong><\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p><a href=\"https:\/\/365datascience.pxf.io\/c\/3452806\/1037878\/11148\"><strong>Get ready for a Data Science Summer &#8211; 65% Off | 365 Data Science<\/strong><\/a><\/p><\/blockquote>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Data is the backbone of any data project, but sometimes the data we want is not available or hard to meet our requirements. That is why we could use the Python package to generate synthetic data. This article explains 3 top Python packages for generating data, they are:<\/p>\n<ol class=\"wp-block-list\">\n<li>Faker<\/li>\n<li>SDV<\/li>\n<li>Gretel<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">I hope it helps!<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p class=\"wp-block-paragraph\">Visit me on my <strong><a href=\"https:\/\/bio.link\/cornelli\">Social Media.<\/a><\/strong><\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p><em>If you are not subscribed as a Medium Member, please consider subscribing through <a href=\"https:\/\/cornelliusyudhawijaya.medium.com\/membership\">my referral<\/a>.<\/em><\/p><\/blockquote>","protected":false},"excerpt":{"rendered":"<p>Synthetic data for your data science project<\/p>\n","protected":false},"author":18,"featured_media":114630,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":true,"sub_heading":"Synthetic data for your data science project","footnotes":""},"categories":[17,44,102,25,155],"tags":[447,448,629,491,493],"sponsor":[],"coauthors":[27607],"class_list":["post-114629","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","category-data-science","category-education","category-programming","category-technology","tag-artificial-intelligence","tag-data-science","tag-education","tag-programming","tag-technology"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Top 3 Python Packages to Generate Synthetic Data | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Top 3 Python Packages to Generate Synthetic Data | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Synthetic data for your data science project\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2022-01-31T15:10:13+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-21T09:51:51+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1516\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Cornellius Yudha Wijaya\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Cornellius Yudha Wijaya\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Top 3 Python Packages to Generate Synthetic Data\",\"datePublished\":\"2022-01-31T15:10:13+00:00\",\"dateModified\":\"2025-01-21T09:51:51+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/\"},\"wordCount\":1294,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg\",\"keywords\":[\"Artificial Intelligence\",\"Data Science\",\"Education\",\"Programming\",\"Technology\"],\"articleSection\":[\"Artificial Intelligence\",\"Data Science\",\"Education\",\"Programming\",\"Science and Technology\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/\",\"url\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/\",\"name\":\"Top 3 Python Packages to Generate Synthetic Data | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg\",\"datePublished\":\"2022-01-31T15:10:13+00:00\",\"dateModified\":\"2025-01-21T09:51:51+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg\",\"width\":2560,\"height\":1516,\"caption\":\"Photo by Maxim Berg on Unsplash\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Top 3 Python Packages to Generate Synthetic Data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Top 3 Python Packages to Generate Synthetic Data | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/","og_locale":"en_US","og_type":"article","og_title":"Top 3 Python Packages to Generate Synthetic Data | Towards Data Science","og_description":"Synthetic data for your data science project","og_url":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/","og_site_name":"Towards Data Science","article_published_time":"2022-01-31T15:10:13+00:00","article_modified_time":"2025-01-21T09:51:51+00:00","og_image":[{"width":2560,"height":1516,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg","type":"image\/jpeg"}],"author":"Cornellius Yudha Wijaya","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Cornellius Yudha Wijaya","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Top 3 Python Packages to Generate Synthetic Data","datePublished":"2022-01-31T15:10:13+00:00","dateModified":"2025-01-21T09:51:51+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/"},"wordCount":1294,"commentCount":0,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg","keywords":["Artificial Intelligence","Data Science","Education","Programming","Technology"],"articleSection":["Artificial Intelligence","Data Science","Education","Programming","Science and Technology"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/","url":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/","name":"Top 3 Python Packages to Generate Synthetic Data | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg","datePublished":"2022-01-31T15:10:13+00:00","dateModified":"2025-01-21T09:51:51+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2022\/01\/04mExSYpCXTDmq-4y-scaled.jpg","width":2560,"height":1516,"caption":"Photo by Maxim Berg on Unsplash"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/top-3-python-packages-to-generate-synthetic-data-33a351a5de0c\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Top 3 Python Packages to Generate Synthetic Data"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/114629","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=114629"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/114629\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/114630"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=114629"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=114629"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=114629"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=114629"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=114629"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}