{"id":606560,"date":"2025-07-11T12:56:46","date_gmt":"2025-07-11T17:56:46","guid":{"rendered":"https:\/\/towardsdatascience.com\/?p=606560"},"modified":"2025-07-11T12:57:06","modified_gmt":"2025-07-11T17:57:06","slug":"hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/","title":{"rendered":"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1752131306207\" class=\"mdspan-comment\">In my latest post<\/mdspan>, <a href=\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-with-chatgpt-api-and-langchain\/\">I walked you through setting up a very simple RAG pipeline in Python<\/a>, using OpenAI&#8217;s API, LangChain, and your local files. In that post, I cover the very basics of creating embeddings from your local files with LangChain, storing them in a vector database with FAISS, making API calls to OpenAI&#8217;s API, and ultimately generating responses relevant to your files. \ud83c\udf1f<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/Documents-1-1024x369.jpg\" alt=\"\" class=\"wp-image-607475\"\/><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Nonetheless, in this simple example, I only demonstrate how to use a tiny .txt file. In this post, I further elaborate on how you can utilize larger files with your RAG pipeline by adding an extra step to the process \u2014 <em><strong>chunking<\/strong><\/em>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What about chunking?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Chunking refers to the process of parsing a text into smaller pieces of text\u2014chunks\u2014that are then transformed into embeddings. This is very important because it allows us to effectively process and create embeddings for larger files. All embedding models come with various limitations on the size of the text that is passed \u2014 I&#8217;ll get into more details about those limitations in a moment. These limitations allow for better performance and low-latency responses. In the case that the text we provide doesn&#8217;t meet those size limitations, it\u2019ll get truncated or rejected.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we wanted to create a RAG pipeline reading, say from Leo Tolstoy&#8217;s <em><a data-type=\"link\" data-id=\"https:\/\/en.wikipedia.org\/wiki\/War_and_Peace\" href=\"https:\/\/en.wikipedia.org\/wiki\/War_and_Peace\">War and Peace<\/a><\/em> text (a rather large book), we wouldn&#8217;t be able to directly load it and transform it into a single embedding. Instead, we need to first do the <em>chunking <\/em>\u2014 create smaller chunks of text, and create embeddings for each one. Each chunk being below the size limits of whatever embedding model we use allows us to effectively transform any file into embeddings. So, a <em>somewhat more<\/em> realistic landscape of a RAG pipeline would look as follows:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/Documents-1024x369.jpg\" alt=\"\" class=\"wp-image-607474\"\/><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">There are several parameters to further customize the chunking process and fit it to our specific needs. A key parameter of the chunking process is the <em>chunk size<\/em>, which allows us to specify what the size of each chunk will be (in characters or in tokens). The trick here is that the chunks we create have to be small enough to be processed within the size limitations of the embedding, but at the same time, they should also be large enough to incorporate meaningful information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, let&#8217;s assume we want to process the following sentence from <em>War and Peace<\/em>, where Prince Andrew contemplates the battle:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/10-1024x369.jpg\" alt=\"\" class=\"wp-image-608066\"\/><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s also assume we created the following (rather small) chunks :<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/9-1024x369.jpg\" alt=\"\" class=\"wp-image-608067\"\/><figcaption class=\"wp-element-caption\">image by author<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Then, if we were to ask something like <em>&#8220;What does Prince Andrew mean by &#8216;all the same now&#8217;?&#8221;,<\/em> we may not get a good answer because the chunk <em><mark style=\"background-color:var(--wp--custom--color--spindle)\" class=\"has-inline-color\">\u201cBut isn\u2019t it all the same now?\u201d thought he. <\/mark><\/em> does not contain any context and is vague. In contrast, the meaning is scattered across multiple chunks. Thus, even though it is similar to the question we ask and may be retrieved, it does not contain any meaning to produce a relevant response. Therefore, selecting the appropriate chunk size for the chunking process in line with the type of documents we use for the RAG, can largely influence the quality of the responses we&#8217;ll be getting. In general, the content of a chunk should make sense for a human reading it without any other information, in order to also be able to make sense for the model. Ultimately, a trade-off for the chunk size exists \u2014 chunks need to be small enough to meet the embedding model&#8217;s size limitations, but large enough to preserve meaning.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">\u2022 \u2022 \u2022<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another significant parameter is the chunk overlap. That is how much overlap we want the chunks to have with one another. For instance, in the <em>War and Peace<\/em> example, we would get something like the following chunks if we chose a chunk overlap of 5 characters.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/Documents-3-1024x369.jpg\" alt=\"\" class=\"wp-image-608089\"\/><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is also a very important decision we have to make because:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Larger overlap means more calls and tokens spent on embedding creation, which means more expensive + slower<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Smaller overlap means a higher chance of losing relevant information between the chunk boundaries<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Choosing the correct chunk overlap largely depends on the type of text we want to process. For example, a recipe book where the language is simple and straightforward most probably won&#8217;t require an exotic chunking methodology. On the flip side, a classic literature book like <em>War and Peace<\/em>, where language is very complex and meaning is interconnected throughout different paragraphs and sections, will most probably require a more thoughtful approach to chunking in order for the RAG to produce meaningful results.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">\u2022 \u2022 \u2022 <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But what if all we need is a simpler RAG that looks up to a couple of documents that fit the size limitations of whatever embeddings model we use in just one chunk? Do we still need the chunking step, or can we just directly make one single embedding for the entire text? The short answer is that it is always better to perform the chunking step, even for a knowledge base that does fit the size limits. That is because, as it turns out, when dealing with large documents, we face the problem of getting <a data-type=\"link\" data-id=\"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00638\/119630\/Lost-in-the-Middle-How-Language-Models-Use-Long\" href=\"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00638\/119630\/Lost-in-the-Middle-How-Language-Models-Use-Long\">lost in the middle<\/a> \u2014 missing relevant information that is incorporated in large documents and respective large embeddings.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What are those mysterious &#8216;size limitations&#8217;?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In general, a request to an embedding model can include one or more chunks of text. There are several different kinds of limitations we have to consider relatively to the size of the text we need to create embeddings for and its processing. Each of those different types of limits takes different values depending on the embedding model we use. More specifically, these are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Chunk Size<\/strong>, or also maximum tokens per input, or context window. This is the maximum size in tokens for each chunk. For instance, for OpenAI&#8217;s <code>text-embedding-3-small<\/code> embedding model, the <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/openai\/concepts\/models?tabs=global-standard%2Cstandard-chat-completions#embeddings-models\" data-type=\"link\" data-id=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/openai\/concepts\/models?tabs=global-standard%2Cstandard-chat-completions#embeddings-models\">chunk size limit is 8,191 tokens<\/a>. If we provide a chunk that is larger than the chunk size limit, in most cases, it will be silently truncated\u203c\ufe0f (an embedding is going to be created, but only for the first part that meets the chunk size limit), without producing any error.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Number of Chunks per Request<\/strong>, or also number of inputs. There is also a limit on the number of chunks that can be included in each request. For instance, all OpenAI&#8217;s embedding models have a limit of 2,048 inputs \u2014 that is, <a data-type=\"link\" data-id=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/openai\/how-to\/embeddings?tabs=console\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/openai\/how-to\/embeddings?tabs=console\">a maximum of 2,048 chunks per request.<\/a><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Total Tokens per Request:<\/strong> There is also a limitation on the total number of tokens of all chunks in a request. For all OpenAI&#8217;s models, <a href=\"https:\/\/community.openai.com\/t\/max-total-embeddings-tokens-per-request\/1254699\" data-type=\"link\" data-id=\"https:\/\/community.openai.com\/t\/max-total-embeddings-tokens-per-request\/1254699\">the total maximum number of tokens across all chunks in a single request is 300,000 tokens.<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">So, what happens if our documents are more than 300,000 tokens? As you may have imagined, the answer is that we make multiple consecutive\/parallel requests of 300,000 tokens or fewer. Many Python libraries do this automatically behind the scenes. For example, LangChain&#8217;s <code>OpenAIEmbeddings<\/code> that I use in my previous post, automatically batches the documents we provide into batches under 300,000 tokens, given that the documents are already provided in chunks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Reading larger files into the RAG pipeline<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s take a look at how all these play out in a simple Python example, using the<em> <a data-type=\"link\" data-id=\"https:\/\/www.gutenberg.org\/ebooks\/2600\" href=\"https:\/\/www.gutenberg.org\/cache\/epub\/2600\/pg2600.txt\">War and Peace<\/a><\/em> text as a document to retrieve in the RAG. The data I&#8217;m using \u2014 Leo Tolstoy&#8217;s <em>War and Peace <\/em>text \u2014 is licensed as Public Domain and can be found in <a data-type=\"link\" data-id=\"https:\/\/www.gutenberg.org\/\" href=\"https:\/\/www.gutenberg.org\/\">Project Gutenberg<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So, first of all, let&#8217;s try to read from the <em>War and Peace<\/em> text without any setup for chunking. For this tutorial, you&#8217;ll need to have installed the <code>langchain<\/code>, <code>openai<\/code>, and <code>faiss <\/code>Python libraries. We can easily install the required packages as follows:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">pip install openai langchain langchain-community langchain-openai faiss-cpu<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">After making sure the required libraries are installed, our code for a very simple RAG looks like this and works fine for a small and simple .txt file in the <code>text_folder<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from openai import OpenAI # Chat_GPT API key \napi_key = &quot;your key&quot; \n\n# initialize LLM\nllm = ChatOpenAI(openai_api_key=api_key, model=&quot;gpt-4o-mini&quot;, temperature=0.3)\n\n# loading documents to be used for RAG \ntext_folder =  &quot;RAG files&quot;  \n\ndocuments = []\nfor filename in os.listdir(text_folder):\n    if filename.lower().endswith(&quot;.txt&quot;):\n        file_path = os.path.join(text_folder, filename)\n        loader = TextLoader(file_path)\n        documents.extend(loader.load())\n\n# generate embeddings\nembeddings = OpenAIEmbeddings(openai_api_key=api_key)\n\n# create vector database w FAISS \nvector_store = FAISS.from_documents(documents, embeddings)\nretriever = vector_store.as_retriever()\n\n\ndef main():\n    print(&quot;Welcome to the RAG Assistant. Type &#039;exit&#039; to quit.\\n&quot;)\n    \n    while True:\n        user_input = input(&quot;You: &quot;).strip()\n        if user_input.lower() == &quot;exit&quot;:\n            print(&quot;Exiting\u2026&quot;)\n            break\n\n        # get relevant documents\n        relevant_docs = retriever.invoke(user_input)\n        retrieved_context = &quot;\\n\\n&quot;.join([doc.page_content for doc in relevant_docs])\n\n        # system prompt\n        system_prompt = (\n            &quot;You are a helpful assistant. &quot;\n            &quot;Use ONLY the following knowledge base context to answer the user. &quot;\n            &quot;If the answer is not in the context, say you don&#039;t know.\\n\\n&quot;\n            f&quot;Context:\\n{retrieved_context}&quot;\n        )\n\n        # messages for LLM \n        messages = [\n            {&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: system_prompt},\n            {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: user_input}\n        ]\n\n        # generate response\n        response = llm.invoke(messages)\n        assistant_message = response.content.strip()\n        print(f&quot;\\nAssistant: {assistant_message}\\n&quot;)\n\nif __name__ == &quot;__main__&quot;:\n    main()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">But, if I add the <em>War and Peace<\/em> .txt file in the same folder, and try to directly create an embedding for it, I get the following error:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/image-29-1024x97.png\" alt=\"\" class=\"wp-image-607481\"\/><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">ughh \ud83d\ude43<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So what happens here? LangChain&#8217;s<code> OpenAIEmbeddings<\/code>cannot split the text into separate, less than 300,000 token iterations, because we did not provide it in chunks. It does not split the chunk, which is 777,181 tokens, leading to a request that exceeds the 300,000 tokens maximum per request.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">\u2022 \u2022 \u2022<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, let&#8217;s try to set up the chunking process to create multiple embeddings from this large file. To do this, I will be using the <code>text_splitter<\/code> library provided by LangChain, and more specifically, the <code>RecursiveCharacterTextSplitter<\/code>. In <code>RecursiveCharacterTextSplitter<\/code>, the chunk size and chunk overlap parameters are specified as a number of characters, but other splitters like <code>TokenTextSplitter<\/code> or <code>OpenAITokenSplitter<\/code> also allow to set up these parameters as a number of tokens.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So, we can set up an instance of the text splitter as below:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">&#8230; and then use it to split our initial document into chunks&#8230;<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">split_docs = []\nfor doc in documents:\n    chunks = splitter.split_text(doc.page_content)\n    for chunk in chunks:\n        split_docs.append(Document(page_content=chunk))<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">&#8230;and then use those chunks to create the embeddings&#8230;<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">documents= split_docs\n\n# create embeddings + FAISS index\nembeddings = OpenAIEmbeddings(openai_api_key=api_key)\nvector_store = FAISS.from_documents(documents, embeddings)\nretriever = vector_store.as_retriever()\n\n.....<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">&#8230; and voila \ud83c\udf1f <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now our code can effectively parse the provided document, even if it is a bit larger, and provide relevant responses.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/image-31.png\" alt=\"\" class=\"wp-image-607530\"\/><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">On my mind<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Choosing a chunking approach that fits the size and complexity of the documents we want to feed into our RAG pipeline is crucial for the quality of the responses that we&#8217;ll be receiving. For sure, there are several other parameters and different chunking methodologies one needs to take into account. Nonetheless, understanding and fine-tuning chunk size and overlap is the foundation for building RAG pipelines that produce meaningful results.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">\u2022 \u2022 \u2022<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Loved this post? <em>Got an interesting data or AI project?&nbsp;<\/em><\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Let\u2019s be friends! Join me on<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>\ud83d\udcf0<\/strong><a href=\"https:\/\/datacream.substack.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong><em>Substack<\/em><\/strong><\/a><strong><em> <\/em>\ud83d\udcdd<\/strong><a href=\"https:\/\/medium.com\/@m.mouschoutzi\" target=\"_blank\" rel=\"noreferrer noopener\"><strong><em>Medium<\/em><\/strong><\/a><strong><em> <\/em>\ud83d\udcbc<\/strong><a href=\"https:\/\/www.linkedin.com\/in\/mariamouschoutzi\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong><em>LinkedIn<\/em><\/strong><\/a><strong> \u2615<\/strong><a href=\"http:\/\/buymeacoffee.com\/mmouschoutzi\" target=\"_blank\" rel=\"noreferrer noopener\"><strong><em>Buy me a coffee<\/em><\/strong><\/a><strong><em>!<\/em><\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">\u2022 \u2022 \u2022<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scaling a simple RAG pipeline from simple notes to full books<\/p>\n","protected":false},"author":18,"featured_media":606561,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"Scaling a simple RAG pipeline from simple notes to full books","footnotes":""},"categories":[21],"tags":[447,657,448,453,467],"sponsor":[],"coauthors":[29761],"class_list":["post-606560","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models","tag-artificial-intelligence","tag-chatgpt","tag-data-science","tag-editors-pick","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Scaling a simple RAG pipeline from simple notes to full books\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2025-07-11T17:56:46+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-11T17:57:06+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1080\" \/>\n\t<meta property=\"og:image:height\" content=\"750\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Maria Mouschoutzi\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Maria Mouschoutzi\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain\",\"datePublished\":\"2025-07-11T17:56:46+00:00\",\"dateModified\":\"2025-07-11T17:57:06+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/\"},\"wordCount\":1637,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png\",\"keywords\":[\"Artificial Intelligence\",\"ChatGPT\",\"Data Science\",\"Editors Pick\",\"Python\"],\"articleSection\":[\"Large Language Models\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/\",\"url\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/\",\"name\":\"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png\",\"datePublished\":\"2025-07-11T17:56:46+00:00\",\"dateModified\":\"2025-07-11T17:57:06+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png\",\"width\":1080,\"height\":750,\"caption\":\"Hanna Barakat &amp; Archival Images of AI + AIxDESIGN \/ https:\/\/betterimagesofai.org \/ https:\/\/creativecommons.org\/licenses\/by\/4.0\/\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/","og_locale":"en_US","og_type":"article","og_title":"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain | Towards Data Science","og_description":"Scaling a simple RAG pipeline from simple notes to full books","og_url":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/","og_site_name":"Towards Data Science","article_published_time":"2025-07-11T17:56:46+00:00","article_modified_time":"2025-07-11T17:57:06+00:00","og_image":[{"width":1080,"height":750,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png","type":"image\/png"}],"author":"Maria Mouschoutzi","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Maria Mouschoutzi","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain","datePublished":"2025-07-11T17:56:46+00:00","dateModified":"2025-07-11T17:57:06+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/"},"wordCount":1637,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png","keywords":["Artificial Intelligence","ChatGPT","Data Science","Editors Pick","Python"],"articleSection":["Large Language Models"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/","url":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/","name":"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png","datePublished":"2025-07-11T17:56:46+00:00","dateModified":"2025-07-11T17:57:06+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/data-mining-3_hanna-barakat-aixdesign_archival-images-of-ai_3328x2312.png","width":1080,"height":750,"caption":"Hanna Barakat &amp; Archival Images of AI + AIxDESIGN \/ https:\/\/betterimagesofai.org \/ https:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/hitchhikers-guide-to-rag-from-tiny-files-to-tolstoy-with-openais-api-and-langchain\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Hitchhiker\u2019s Guide to RAG: From Tiny Files to Tolstoy with OpenAI\u2019s API and LangChain"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"TDS Contributor Portal","distributor_original_site_url":"https:\/\/contributor.insightmediagroup.io","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/606560","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=606560"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/606560\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/606561"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=606560"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=606560"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=606560"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=606560"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=606560"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}