Retrieval augmented generation (RAG) is a powerful technique that utilizes large language models (LLMs) and vector databases to create more accurate responses to user queries. RAG allows LLMs to utilize large knowledge bases when responding to user queries, improving the quality of the responses. However, RAG also has some downsides. One downside is that RAG utilizes vector similarity when retrieving context to respond to a user query. Vector similarity is not always consistent and can, for example, struggle with unique user keywords. Furthermore, RAG also struggles because the text is divided into smaller chunks, which prohibits the LLM from utilizing the full contexts of documents when responding to queries. Anthropic’s article on contextual retrieval attempts to solve both problems by using BM25 indexing and adding contexts to chunks.

Motivation
My motivation for this article is twofold. First, I would like to test out the newest models and techniques within machine learning. Keeping up to date with the latest trends within machine learning is critical for any ML engineer and data scientist to most effectively solve the real-world problems they encounter daily. Secondly, I like to work on RAG systems, and contextual retrieval is an interesting idea that can improve the performance of my RAG system. I plan to implement contextual retrieval on one of the problems I am currently working on (explained in the following section) to see how contextual retrieval can improve the ability of an LLM to respond to user queries. Anthropic is also a leading AI company, so I find their articles particularly interesting. The main inspiration for the contents of this article is Anthropic’s article on contextual retrieval.
If you want to learn more about basic RAG, you can read my article on the topic, linked below:
How to Make a RAG System to Gain Powerful Access to Your Data
Table of Contents
· Motivation · The problem I will apply contextual retrieval to · Prerequisites · [Creating chunks](#85c7) with context ∘ Utility ∘ Creating chunks ∘ Upload to Pinecone · BM25 indexing · Combining BM25 and vector-based chunk retrieval · Conclusion
The problem I will apply contextual retrieval to
I am currently working on an application that allows users to search for previous court decisions or rulings. This is, for example, a decision made by the Supreme Court. The current situation in Norway is that these decisions are not easily accessible to the public. You mainly have two options: pay for an expensive service to allow for search in court rulings and decisions. Here, you can use Lovdata.no (which costs a single user 12500 NOK/1169 USD per year!), or Rettsdata.no, which does not list their price publicly on the website. One free option is the Rettspraksis.no, essentially a Wiki for court rulings in Norway. The unfortunate part here is that you only have a direct search, and with tens of thousands of rulings, which makes finding the relevant literature difficult. I, therefore, aim to create a cheaper alternative that allows users to search for and discover relevant court rulings in Norway easily. I have previously acquired the text for these rulings. In this article, I aim to make the contents of these texts easily available using an RAG system with contextual retrieval.
Prerequisites
- Basic Python knowledge
- Access to an OpenAI API key (you will be using GPT-4o-mini, which is a relatively cheap alternative)
- A Pinecone account (you can use their free tier)

Creating chunks with context
I have 449 text files stored in a folder called _extracted_info_pagesfiltered, which are the texts I will make available using contextual retrieval. You can find the texts I am using in this Google Drive folder. I have many more files available, but I am using a subset of them for this article. The data is extracted from Norges Høyesterett and is exempted from copyright law (meaning it may be used commercially) in Norway due to being documents of public interest.
Utility
I will start with some important utility files. In general, using files like this makes your code more modular and easier to work with, so I highly recommend creating similar files when writing your own code.
I create a constants.py file to store all my global constants:
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
TEXT_EMBEDDING_MODEL = "text-embedding-3-small"
GPT_MODEL = "gpt-4o-mini"
INPUT_TOKEN_PRICE = 0.15/1e6
OUTPUT_TOKEN_PRICE = 0.6/1e6
I then made two utility files, one for working with OpenAI called _openaiutility.py:
import os
import tiktoken
import streamlit as st
from openai import OpenAI
from langchain_openai import ChatOpenAI
from openai import OpenAI
from constants import INPUT_TOKEN_PRICE, OUTPUT_TOKEN_PRICE
from constants import TEXT_EMBEDDING_MODEL, GPT_MODEL
OPENAI_API_KEY = st.secrets["OPENAI_API_KEY"]
openai_client = OpenAI(api_key=OPENAI_API_KEY)
def count_tokens(text):
encoding = tiktoken.encoding_for_model(TEXT_EMBEDDING_MODEL)
tokens = encoding.encode(text)
return len(tokens)
def get_embedding(text, model=TEXT_EMBEDDING_MODEL):
return openai_client.embeddings.create(input = [text], model=model).data[0].embedding
def prompt_gpt(prompt):
messages = [{"role": "user", "content": prompt}]
response = openai_client.chat.completions.create(
model=GPT_MODEL, # Ensure correct model name is used
messages=messages,
temperature=0,
)
content = response.choices[0].message.content
prompt_tokens = response.usage.prompt_tokens # Input tokens (in)
completion_tokens = response.usage.completion_tokens # Output tokens (out)
price = calculate_prompt_cost(prompt_tokens, completion_tokens)
return content, price
def calculate_prompt_cost(input_tokens, output_tokens):
return INPUT_TOKEN_PRICE * input_tokens + OUTPUT_TOKEN_PRICE * output_tokens
This is simply boilerplate code for sending GPT prompts, creating embeddings, and so on. In general, I recommend having a file like this to work easily with the OpenAI API. Remember to change the _OPENAI_APIKEY constant with your own API key.
I also created a utility file to work with the vector database called _vector_databaseutility.py:
from constants import CHUNK_OVERLAP, CHUNK_SIZE, TEXT_EMBEDDING_MODEL
import os
from utility.openai_utility import count_tokens
import tiktoken
import json
tokenizer = tiktoken.get_encoding("cl100k_base")
def split_text_into_chunks_with_overlap(text):
tokens = tokenizer.encode(text) # Tokenize the input text
chunks = []
# Loop through the tokens, creating chunks with overlap
for i in range(0, len(tokens), CHUNK_SIZE - CHUNK_OVERLAP):
chunk_tokens = tokens[i:i + CHUNK_SIZE] # Include overlap by adjusting start point
chunk_text = tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
def load_all_chunks_from_folder(folder_path):
chunks = []
for chunk_filename in os.listdir(folder_path):
with open(os.path.join(folder_path, chunk_filename), "r") as f:
data = json.load(f)
chunks.append(data)
return chunks
Creating chunks
From here, I work in a notebook to create chunks and upload them to Pinecone. First, import the required packages:
import os
from jinja2 import Environment, FileSystemLoader
from tqdm.auto import tqdm
from utility.vector_database_utility import split_text_into_chunks_with_overlap
from utility.openai_utility import prompt_gpt, get_embedding
import json
import streamlit as st
Note that I am using Streamlit to store my secrets, as I will likely publish this as a Streamlit application later.
I then define the hyperparameters where I store and save my information.
DOCUMENT_FILEPATHS = r"Datahoyesteretts_dommerfilteredextracted_info_pages_filtered"
CHUNKS_SAVE_PATH = r"Datahoyesteretts_dommerfilteredchunks"
DOCUMENT_TYPE = "hoyesterettsdommer"
- _DOCUMENT_FILEPATHS_ should be a path to the documents you downloaded from Google Drive (or your own documents).
- _CHUNKS_SAVE_PATH_ is the path to where you want to store your chunks
- _DOCUMENT_TYPE_ is used for filtering in my database and buckets
I then set up a Jinja2 environment. Jinja2 is a package for working with prompts, which I tested while working on this project. The main advantage of Jinja is that it allows you to store prompts in separate files and have them with logic. For example, you can have if statements in your prompt, such as whether or not to include certain parts of the prompt. This can be very useful when working on more complicated prompt engineering.
# Set up Jinja2 environment
file_loader = FileSystemLoader('./templates')
env = Environment(loader=file_loader)
def get_add_context_prompt(chunk_text, document_text):
template = env.get_template('create_context_prompt.j2')
data = {
'WHOLE_DOCUMENT': document_text, # Leave blank for default or provide a name
'CHUNK_CONTENT': chunk_text # You can set any score here
}
output = template.render(data)
return output
My Jinja2 template (the file where I store the prompt), is stored in a folder called templates, and saved as a file called _create_contextprompt.j2. The file looks like this:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
In this case, I am using two variables in my prompt (WHOLE_DOCUMENT and CHUNK_CONTENT), which will be replaced for each prompt I create to send to GPT. The function _get_add_contextprompt, defined above, replaces these variables.
I then create the folder to save the chunks to with:
os.makedirs(CHUNKS_SAVE_PATH, exist_ok=True)
And create and store chunks with the following code:
document_filenames = os.listdir(DOCUMENT_FILEPATHS)
for filename in tqdm(document_filenames):
with open(f"{DOCUMENT_FILEPATHS}/{filename}", "r", encoding="utf-8") as f:
document_text = f.read()
# now split text into chunks
print(f"Current tot price: ", tot_price)
chunks = split_text_into_chunks_with_overlap(document_text)
for idx, chunk in enumerate(chunks):
# store the chunk
chunk_save_filename = f"{filename.split(".")[0]}_{idx}.json"
chunk_save_path = f"{CHUNKS_SAVE_PATH}/{chunk_save_filename}"
if os.path.exists(chunk_save_path):
continue
prompt = get_add_context_prompt(chunk, document_text)
context, price = prompt_gpt(prompt)
tot_price += price
chunk_info = {
"id" : f"{filename}_{int(idx)}",
"chunk_text" : context + "nn" + chunk,
"chunk_idx" : idx,
"filename" : filename,
"document_type": DOCUMENT_TYPE
}
with open(chunk_save_path, "w", encoding="utf-8") as f:
json.dump(chunk_info, f, indent=4)
The code reads a text file that you downloaded from Google Drive. It splits the text into chunks, where the chunk size and overlap are decided from the constants.py file. I then check if the chunk exists (if so, go to the next chunk) and prompt GPT-4o-mini to add context to my prompt. I then store the chunk with the added context and the chunk index and filename as metadata.
Next, I take the chunks we created (with context), create embeddings for the chunk text, and store it in the same file with the following:
# go through each chunk, create an embedding for it, and save it the same folder
for chunk_filename in os.listdir(CHUNKS_SAVE_PATH):
# load chunk
with open(f"{CHUNKS_SAVE_PATH}/{chunk_filename}", "r", encoding="utf-8") as f:
chunk_info = json.load(f)
chunk_text = chunk_info["chunk_text"]
chunk_text_embedding = get_embedding(chunk_text)
chunk_info["chunk_embedding"] = chunk_text_embedding
# save chunk
with open(f"{CHUNKS_SAVE_PATH}/{chunk_filename}", "w", encoding="utf-8") as f:
json.dump(chunk_info, f, indent=4)
Upload to Pinecone
Finally, I upload the chunks to Pinecone, a vector database from which you can retrieve the most relevant chunks for a given prompt.
I first import Pinecone and my API key. If you do not have an API key for Pinecone, you can go to the Pinecone website and create a free account. The free tier at Pinecone is very good, so I recommend using it.
from pinecone import Pinecone
from pinecone_utility import PineconeUtility
PINECONE_API_KEY = st.secrets["PINECONE_API_KEY"]
You can then create an index on the Pinecone website. I called my index lov-avgjorelser (roughly translated to "law rulings" in English).
pinecone = Pinecone(api_key=PINECONE_API_KEY)
index_name = "lov-avgjorelser"
index = pinecone.Index(index_name)
Next, you can upload the chunks to Pinecone. Usually, you should not upload text directly to Pinecone. Rather, I recommend storing the text for each chunk separately (for example, in an S3 bucket on AWS). Then, when you retrieve chunks from Pinecone, you can use the chunk identifier to find the chunk text in the S3 bucket. However, I will simply store the text in the bucket here. Note that the max size for metadata in Pinecone is 40kb, so you are limited with the amount of text you can store directly in Pinecone.
# pinecone expects list of objects with: [{"id": id, "values": embedding, "metadata", metadata}]
# upload to pinecone
for chunk_filename in os.listdir(CHUNKS_SAVE_PATH):
# load chunk
with open(f"{CHUNKS_SAVE_PATH}/{chunk_filename}", "r", encoding="utf-8") as f:
chunk_info = json.load(f)
chunk_filename = chunk_info["filename"]
chunk_idx = chunk_info["chunk_idx"]
chunk_text = chunk_info["chunk_text"]
chunk_info["chunk_embedding"] = chunk_text_embedding
document_type = chunk_info["document_type"]
metadata = {
"filename" : chunk_filename,
"chunk_idx" : chunk_idx,
"chunk_text" : chunk_text,
"document_type" : document_type
}
data_with_metadata = [{
"id" : chunk_filename,
"values" : chunk_text_embedding,
"metadata" : metadata
}]
index.upsert(vectors=data_with_metadata)
Congrats, you have now made chunks with context accessible via Pinecone!
BM25 indexing
BM25 is an older technique for indexing documents, which helps find the most relevant documents in a corpus, given a query. You can read more about BM25 here. From Anthropic’s article on contextual retrieval, you can read that they implement BM25 as a separate chunk retrieval algorithm and then combine the retrieved chunks from BM25 and the vector similarity search. This means we can implement BM25 as a completely separate retrieval step for the chunks we made.
You can install the BM25 package with:
pip install rank_bm25
And use the following code to query using BM25
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize
import nltk
from utility.vector_database_utility import load_all_chunks_from_folder
# Download the NLTK tokenizer if you haven't
nltk.download('punkt')
nltk.download('punkt_tab')
CHUNK_PATH = r"Datahoyesteretts_dommerfilteredchunks"
chunks = load_all_chunks_from_folder(CHUNK_PATH)
corpus = [chunk["chunk_text"] for chunk in chunks]
# Tokenize each document in the corpus
tokenized_corpus = [word_tokenize(doc.lower()) for doc in corpus] # should store this somewhere for easy retrieval
bm25 = BM25Okapi(tokenized_corpus)
def retrieve_with_bm25(query: str, corpus: list[str], top_k: int = 2) -> list[str]:
tokenized_query = word_tokenize(query.lower())
doc_scores = bm25.get_top_n(tokenized_query, corpus, n=top_k)
return doc_scores
if __name__ == "__main__":
query = "fyllekjøring"
response = retrieve_with_bm25(query, corpus, top_k=10)
This code creates the tokenized corpus each time the function is run, but a better solution is to store the tokenized corpus somewhere for easy loading. The output of running this file is the text from several documents, in order of most relevant to least relevant, with only the top_k results being shown.
I would also like to add a note on languages here. Text embeddings are usually bilingual, meaning that a user can prompt in one language and will still find relevant chunks even though the chunks are embedded in another language. This is not the case for BM25, however, since BM25 relies on exact word matches. This should not be an issue, considering half of the retrieved chunks will be from BM25 and half from the vector database.
Combining BM25 and vector-based chunk retrieval
You can use the previously written code to perform a full RAG search using contextual retrieval. I made a new class called RagAgent, which I used to perform the RAG.
The class is defined as follows:
import streamlit as st
from openai import OpenAI
from pinecone import Pinecone
from typing import Optional, List
from utility.openai_utility import prompt_gpt, get_embedding
# NOTE this is the file where you made the bm25 index
from bm25 import retrieve_with_bm25
PINECONE_API_KEY = st.secrets["PINECONE_API_KEY"]
pc = Pinecone(api_key=PINECONE_API_KEY)
openai_client = OpenAI()
pinecone = Pinecone(api_key=PINECONE_API_KEY)
class RagAgent:
def __init__(self, index_name):
# load pinecone index
self.index = pinecone.Index(index_name)
def query_pinecone(self, query, top_k=2, include_metadata: bool = True):
query_embedding = get_embedding(query)
query_response = self._query_pinecone_index(query_embedding, top_k=top_k, include_metadata=include_metadata)
return self._extract_info(query_response)
def _query_pinecone_index(self,
query_embedding: list, top_k: int = 2, include_metadata: bool = True
) -> dict[str, any]:
query_response = self.index.query(
vector=query_embedding, top_k=top_k, include_metadata=include_metadata,
)
return query_response
def _extract_info(self, response) -> Optional[dict]:
"""extract data from pinecone query response. Returns dict with id, text and chunk idx"""
if response is None: return None
res_list = []
for resp in response["matches"]:
_id = resp["id"]
res_list.append(
{
"id": _id,
"chunk_text": resp["metadata"]["chunk_text"],
"chunk_idx": resp["metadata"]["chunk_idx"],
})
return res_list
def _combine_chunks(self, chunks_bm25, chunks_vector_db, top_k=20):
"""given output from bm25 and vector database, combine them to only include unique chunks"""
retrieved_chunks = []
# assume lists are ordered from most relevant docs to least relevant
for chunk1, chunk2 in zip(chunks_bm25, chunks_vector_db):
if chunk1 not in retrieved_chunks:
retrieved_chunks.append(chunk1)
if len(retrieved_chunks) >= top_k:
break
if chunk2 not in retrieved_chunks:
retrieved_chunks.append(chunk2)
if len(retrieved_chunks) >= top_k:
break
return retrieved_chunks
def run_bm25_rag(self, query, top_k=2):
chunks_bm25 = retrieve_with_bm25(query, top_k)
chunks_vector_db = self.query_pinecone(query, top_k)
combined_chunks = self._combine_chunks(chunks_bm25, chunks_vector_db)
context = "n".join([chunk["chunk_text"] for chunk in combined_chunks])
full_prompt = f"Given the following emails {context} what is the answer to the question: {query}"
response, _ = prompt_gpt(full_prompt)
return response
Explanation of the class:
- I first include relevant imports and parameters
- The _querypinecone and __query_pineconeindex functions are basic Pinecone functions you can read more about on the Pinecone docs
- The __extractinfo function extracts the information from each chunk to an expected format.
- The __combinechunks function combined the chunks retrieved by BM25 and by the vector database by simply selecting the first vectors from each list of retrieved chunks, ensuring no duplicate chunks are chosen.
- The _run_bm25rag function runs the contextual RAG with BM25 and the chunks with added contexts, utilizing different functions mentioned throughout the article.
You can then run the contextual RAG using the following code. Make sure to update the index_name to your Pinecone index name.
from rag_agent import RagAgent
index_name = "lov-avgjorelser"
rag_agent = RagAgent(index_name)
query = "Hva er straffen for fyllekjøring?"
rag_agent.run_bm25_rag(query, top_k=20)
And GPT gave the following response:
'Straffen for fyllekjøring varierer avhengig av alvorlighetsgraden av overtredelsen og om det foreligger gjentakelse. Generelt kan straffen for fyllekjøring i Norge være bøter, fengsel, eller tap av førerrett. nnFor første gangs overtredelse kan straffen være bøter og/eller fengsel i inntil 6 måneder. Ved gjentakelse eller alvorlige tilfeller, som høy promille eller ulykker, kan straffen bli betydelig strengere, med fengsel i flere år. I tillegg kan det ilegges tap av retten til å føre motorvogn for en periode, avhengig av alvorlighetsgraden av overtredelsen. nnDet er viktig å merke seg at straffene kan variere basert på omstendighetene rundt den spesifikke saken, inkludert om det er registrert tidligere overtredelser.'
I have not performed a quantitative evaluation of the contextual retrieval vs standard RAG, as it is immensely time-consuming to do, and it’s difficult to objectively evaluate the performance of language models. After using the contextual retrieval model for a while, however, I can say that I have noticed an improvement in the model responses from a qualitative perspective. First of all, the contextual RAG is better able to retrieve relevant documents when niche keywords are part of the user query. This is a result of the BM25 indexing. Additionally, I noticed that the LLM is better able to utilize the information in the contexts. For example, in standard RAG, the model will sometimes use the information in the chunk out of context (which makes sense, as the model doesn’t have a context for the chunk). This can lead to hallucinations in the model responses. With contextual RAG, however, the issue of hallucinations is significantly reduced as the model is given the context of each chunk.
Conclusion
In this article, I have discussed how you can implement contextual retrieval for your RAG system and the idea proposed by Anthropic. First, I discussed the problem I am working with, which is making court rulings in Norway more accessible. I then used the text from different court rulings, divided it into chunks, and added context to each chunk using GPT-4o-mini. I then created embeddings for each chunk and stored them in Pinecone. Furthermore, I created a BM25 index, which is used together with vector similarity to retrieve the most relevant chunks. These chunks are then used by GPT to answer a user’s prompt, using context from the chunks. From qualitative testing, contextual retrieval is an improvement from standard RAG, with both being better able to retrieve relevant documents and more accurately responding to user queries.






