Retrieval Augmented Generation (RAG) is a powerful technique that enables AI agents to access and leverage external knowledge sources beyond their training data. In this tutorial, we'll build a RAG agent that can answer questions about the JFK assassination files using OpenAI's Agents SDK and Pinecone vector database.
RAG is particularly useful when:
- You need up-to-date information beyond the model's training cutoff
- You have domain-specific documents or proprietary data
- You want to reduce hallucinations by grounding responses in factual sources
- You need to cite sources for transparency and verification
By the end of this tutorial, you'll have built an agent that can search through historical documents and provide accurate, sourced answers about the JFK files.
Prerequisites
Before we begin, let's install the required packages:
We also need API keys for OpenAI and Pinecone. You can get:
- An OpenAI API key from the OpenAI Platform
- A Pinecone API key from the Pinecone Console
Testing LLM Knowledge Limitations
Before implementing RAG, let's first demonstrate why it's needed. We'll create a basic agent and test its knowledge about specific topics to show the limitations of relying solely on the model's training data.
We'll ask our agent "where was Oswald in october 1959?"
:
Oswald was also in Helsinki, Finland in October 1959 according to the JFK files - which our agent missed. We can try and tease out this information:
Our agent is clearly not aware of Oswald's trip to Helsinki - that is because the underlying LLM has not seen that information during it's training process. We call information learned during LLM training parametric knowledge, ie knowledge stored within the model parameters.
LLMs can also make use of source knowledge to answer questions. Source knowledge refers to information provided to an LLM via a prompt, either provided via the user, the LLM instructions, or in our case - via an external database - ie with Retrieval Augmented Generation (RAG). Before we build out our RAG pipeline, let's see if our LLM can answer our question when we provide the relevant information about Oswald's whereabouts via our instructions
.
Let's ask our original query
again:
Perfect, this is much better! Now what we just did works for this simple example, but it doesn't scale. If we want an agent that can answer any question and use context from all of the JFK files, we need to build a RAG pipeline.
Building a RAG Pipeline
A RAG pipeline actually requires two core pipelines - an ingestion pipeline and a retrieval pipeline. At a high level those pipelines are responsible for:
-
Ingestion handles the initial data preparation, embedding, and indexing. We'll explain those steps in more detail soon, but the tldr is that the ingestion pipeline will transform a set of unstructured and messy PDFs into a "second brain" for our agent, ie the source knowledge.
-
Retrieval handles the query-time retrieval of information. It defines how we access and retrieve source knowledge from our second brain.
Naturally, we need to first develop our ingestion pipeline so that we can populate our second brain before we use the retrieval pipeline to retrieve anything.
Ingestion Pipeline
The ingestion pipeline consists of three (or four) steps:
-
Process the PDF into plain text - with the
aurelio-ai/jfk-files
dataset (below) this step has been completed. -
Chunk the plain text into smaller segments (a good rule of thumb is ~300-400 tokens per chunk).
-
Embed each chunk with OpenAI's
text-embedding-3-small
to create vectors. -
Index those vectors in Pinecone with metadata like source URL, document title, etc.
To begin, we'll start at step 0 and download the pre-parsed JFK files.
Loading the JFK Files Dataset
We'll use a dataset of the JFK files, which we will pull from the Hugging Face Hub. This dataset contains historical documents that our agent will search through to answer questions:
Let's examine a sample document to understand the data structure:
Each document contains:
id
: Unique identifier for the documentfilename
: Name of the PDF fileurl
: Link to the original documentdate
: Publication datecontent
: The full text contentpages
: Number of pages in the original document
Chunking our Data
Step 1 in our ingestion pipeline is to chunk our dataset. As mentioned we will be splitting each PDF into chunks of ~400 tokens. We'll also handle cases where a PDF contains little-to-no information by not indexing that PDF, and cases where our final chunk is too small to be relevant by appending it to the previous chunk.
We use the lightweight semantic-chunkers
library and a simple RegexChunker
for chunking. We will set the token limit for each chunk to 400
tokens:
We chunk a doc like so:
This outputs a list of a list of Chunk
objects. These Chunk
objects contain many smaller splits
, which can be thought of as chunks within chunks. We can view the chunks in a cleaner way using chunker.print
on a list[Chunk]
object like so:
We'll need the text content from our chunks which we access via the content
attribute:
In the next step we'll setup our Pinecone vector DB and begin embedding and indexing our data in one step - while indexing we'll be performing the above chunking logic across all our docs before they're embedded.
Embedding and Indexing
To enable semantic search over our documents, we'll use Pinecone - a fully managed vector database. Vector databases allow us to store and search through vector embeddings (numerical representations of text) to find semantically similar content. There are many vector DB options out there, alongside Pinecone we also recommend Qdrant and pgvector.
First, let's set up our Pinecone API key which we can find in the Pinecone console:
Now we'll create a Pinecone index to store our vector embeddings. We specify the following:
- We want to use AWS via
cloud=CloudProvider.AWS
in Pinecone's free tier region viaregion=AwsRegion.US_EAST_1
. - We use the
llama-text-embed-v2
embedding model hosted by Pinecone - by default the index will be configured for this model. - We specify that the text content that should be embedding by our model will be provided to Pinecone via the
content
metadata field.
Let's check if our index is empty (it should be on first run):
To embed and index a chunk, we can do the following:
Now we should see that our index contains three records inside the default
namespace:
Perfect! Now we simply repeat that process for all of our docs. We will do this in batches to avoid excessive network calls with small packages.
That's our ingestion pipeline complete and we're ready to move on to the retrieval pipeline.
Retrieval Pipeline
Our retrieval pipeline is what will be used to retrieve the right source knowledge for our agent at query-time. We will be implementing this via an Agent SDK @function_tool
but before we do so let's directly test retrieval.
As we're using Pinecone's integrated inference (ie both indexing and embedding are handled by Pinecone) the retrieval pipeline is incredibly simple.
Let's format these a little nicer:
Now let's create a tool that our agent can use to search through the JFK documents. We use the @function_tool
decorator to wrap the logic above and make the retrieval pipeline available to our agents.
Now we provide our jfk_files_search
tool to an agent.
Building the Final RAG Agent
Now we can use our agent to discover who really assassinated JFK. First, let's confirm our agent is functional with our original query about Oswald's whereabouts in October 1959.
To keep things conversational we'll append our own queries and the agent responses to a messages
list.
Great! Our retrieval pipeline is clearly returning highly relevant information to our agent - allowing us to explore the JFK files, as follow up questions, and try to understand the various connections and characters that appear throughout.
Once we're done asking questions, we should ideally delete our vector index to save resources.