Retrieval Augmented Generation (RAG) is a powerful technique that enables AI agents to access and leverage external knowledge sources beyond their training data. In this tutorial, we'll build a RAG agent that can answer questions about the JFK assassination files using OpenAI's Agents SDK and Pinecone vector database.
RAG is particularly useful when:
- You need up-to-date information beyond the model's training cutoff
- You have domain-specific documents or proprietary data
- You want to reduce hallucinations by grounding responses in factual sources
- You need to cite sources for transparency and verification
By the end of this tutorial, you'll have built an agent that can search through historical documents and provide accurate, sourced answers about the JFK files.
Prerequisites
Before we begin, let's install the required packages:
!pip install -qU \
"openai-agents==0.1.0" \
"pinecone==7.0.2" \
"datasets==3.6.0" \
"semantic-chunkers==0.1.1"
We also need API keys for OpenAI and Pinecone. You can get:
- An OpenAI API key from the OpenAI Platform
- A Pinecone API key from the Pinecone Console
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
"Enter OPENAI_API_KEY: "
)
Testing LLM Knowledge Limitations
Before implementing RAG, let's first demonstrate why it's needed. We'll create a basic agent and test its knowledge about specific topics to show the limitations of relying solely on the model's training data.
from agents import Agent
agent = Agent(
name="Agent",
model="gpt-4.1-mini"
)
We'll ask our agent "where was Oswald in october 1959?"
:
from agents import Runner
query = "where was Lee Harvey Oswald in october 1959?"
result = await Runner.run(
starting_agent=agent,
input=query,
)
print(result.final_output)
Oswald was also in Helsinki, Finland in October 1959 according to the JFK files - which our agent missed. We can try and tease out this information:
result = await Runner.run(
starting_agent=agent,
input=[
{"role": "user", "content": query},
{"role": "assistant", "content": result.final_output},
{"role": "user", "content": "did he go anywhere else?"}
],
)
print(result.final_output)
Our agent is clearly not aware of Oswald's trip to Helsinki - that is because the underlying LLM has not seen that information during it's training process. We call information learned during LLM training parametric knowledge, ie knowledge stored within the model parameters.
LLMs can also make use of source knowledge to answer questions. Source knowledge refers to information provided to an LLM via a prompt, either provided via the user, the LLM instructions, or in our case - via an external database - ie with Retrieval Augmented Generation (RAG). Before we build out our RAG pipeline, let's see if our LLM can answer our question when we provide the relevant information about Oswald's whereabouts via our instructions
.
source_knowledge = (
"~SECRET~\n"
"1 June 1964\n"
"\n"
"## MEMO FOR THE RECORD\n"
"\n"
"1. At 0900 this morning I talked with Frank Friberg recently "
"returned COS Helsinki re Warren Commission inquiry concerning "
"the timetable of Oswald's stay in Finland in October 1959, including "
"his contact with the Soviet Consulate there. (Copy of the Commission "
"letter of 25 May 64 and State Cable of 22 May 64 attached.)"
)
agent = Agent(
name="Agent",
instructions=(
"You are an assistant specialized in answering questions about the JFK assassination"
"and related documents.\n"
"Here is some additional context that you can use:\n"
f"{source_knowledge}\n"
),
model="gpt-4.1-mini"
)
Let's ask our original query
again:
result = await Runner.run(
starting_agent=agent,
input=query,
)
print(result.final_output)
Perfect, this is much better! Now what we just did works for this simple example, but it doesn't scale. If we want an agent that can answer any question and use context from all of the JFK files, we need to build a RAG pipeline.
Building a RAG Pipeline
A RAG pipeline actually requires two core pipelines - an ingestion pipeline and a retrieval pipeline. At a high level those pipelines are responsible for:
-
Ingestion handles the initial data preparation, embedding, and indexing. We'll explain those steps in more detail soon, but the tldr is that the ingestion pipeline will transform a set of unstructured and messy PDFs into a "second brain" for our agent, ie the source knowledge.
-
Retrieval handles the query-time retrieval of information. It defines how we access and retrieve source knowledge from our second brain.
Naturally, we need to first develop our ingestion pipeline so that we can populate our second brain before we use the retrieval pipeline to retrieve anything.
Ingestion Pipeline
The ingestion pipeline consists of three (or four) steps:
-
Process the PDF into plain text - with the
aurelio-ai/jfk-files
dataset (below) this step has been completed. -
Chunk the plain text into smaller segments (a good rule of thumb is ~300-400 tokens per chunk).
-
Embed each chunk with OpenAI's
text-embedding-3-small
to create vectors. -
Index those vectors in Pinecone with metadata like source URL, document title, etc.
To begin, we'll start at step 0 and download the pre-parsed JFK files.
Loading the JFK Files Dataset
We'll use a dataset of the JFK files, which we will pull from the Hugging Face Hub. This dataset contains historical documents that our agent will search through to answer questions:
from datasets import load_dataset
dataset =
"aurelio-ai/jfk-files",
split="train"
)
Let's examine a sample document to understand the data structure:
dataset[1]
Each document contains:
id
: Unique identifier for the documentfilename
: Name of the PDF fileurl
: Link to the original documentdate
: Publication datecontent
: The full text contentpages
: Number of pages in the original document
Chunking our Data
Step 1 in our ingestion pipeline is to chunk our dataset. As mentioned we will be splitting each PDF into chunks of ~400 tokens. We'll also handle cases where a PDF contains little-to-no information by not indexing that PDF, and cases where our final chunk is too small to be relevant by appending it to the previous chunk.
We use the lightweight semantic-chunkers
library and a simple RegexChunker
for chunking. We will set the token limit for each chunk to 400
tokens:
from semantic_chunkers import RegexChunker
chunker = RegexChunker(max_chunk_tokens=400)
We chunk a doc like so:
chunks = chunker(docs=[dataset[1]['content']])
chunks
This outputs a list of a list of Chunk
objects. These Chunk
objects contain many smaller splits
, which can be thought of as chunks within chunks. We can view the chunks in a cleaner way using chunker.print
on a list[Chunk]
object like so:
chunker.print(chunks[0])
We'll need the text content from our chunks which we access via the content
attribute:
chunks[0][0].content
In the next step we'll setup our Pinecone vector DB and begin embedding and indexing our data in one step - while indexing we'll be performing the above chunking logic across all our docs before they're embedded.
Embedding and Indexing
To enable semantic search over our documents, we'll use Pinecone - a fully managed vector database. Vector databases allow us to store and search through vector embeddings (numerical representations of text) to find semantically similar content. There are many vector DB options out there, alongside Pinecone we also recommend Qdrant and pgvector.
First, let's set up our Pinecone API key which we can find in the Pinecone console:
from pinecone import Pinecone
os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") or getpass(
"Enter PINECONE_API_KEY: "
)
# initializing the pinecone client
pc = Pinecone()
Now we'll create a Pinecone index to store our vector embeddings. We specify the following:
- We want to use AWS via
cloud=CloudProvider.AWS
in Pinecone's free tier region viaregion=AwsRegion.US_EAST_1
. - We use the
llama-text-embed-v2
embedding model hosted by Pinecone - by default the index will be configured for this model. - We specify that the text content that should be embedding by our model will be provided to Pinecone via the
content
metadata field.
from pinecone import AwsRegion, CloudProvider
# set our index name, you can change this to whatever you like
index_name = "agents-sdk-course-jfk-files"
# if the index doesn't exist, create it
if index_name not in pc.list_indexes().names():
pc.create_index_for_model(
name=index_name,
cloud=CloudProvider.AWS,
region=AwsRegion.US_EAST_1,
embed={
"model": "llama-text-embed-v2",
"field_map": {
"text": "content"
}
}
)
index = pc.Index(index_name)
Let's check if our index is empty (it should be on first run):
index.describe_index_stats()
To embed and index a chunk, we can do the following:
doc = dataset[1]
# chunk the doc
chunks = chunker(docs=[doc['content']])
# create a list of dictionary records
records = [
{
"id": doc['id']+f"-{i}",
"content": chunk.content,
"filename": doc['filename'],
"url": doc['url'],
"date": doc['date'].isoformat(),
"pages": doc['pages']
} for i, chunk in enumerate(chunks[0])
]
# embed and index
index.upsert_records(
namespace="default",
records=records
)
Now we should see that our index contains three records inside the default
namespace:
index.describe_index_stats()
Perfect! Now we simply repeat that process for all of our docs. We will do this in batches to avoid excessive network calls with small packages.
from tqdm.auto import tqdm
records = []
for doc in tqdm(dataset):
# perform a quick length check of our docs to avoid excessively small docs
if len(doc['content']) < 100:
# nothing less than 100 chars
continue
# chunk the docs
chunks = chunker(docs=[doc['content']])
for i, chunk in enumerate(chunks[0]):
records.append(
{
"id": doc['id']+f"-{i}",
"content": chunk.content,
"filename": doc['filename'],
"url": doc['url'],
"date": doc['date'].isoformat(),
"pages": doc['pages']
}
)
if len(records) >= 64:
# if we have a particularly long doc, we'll need to split up the batch
for i in range(0, len(records), 96):
# 96 is the max number of records we can upsert in one go
batch = records[i:i+96]
# embed and index the batch
index.upsert_records(
namespace="default",
records=batch
)
records = []
index.describe_index_stats()
That's our ingestion pipeline complete and we're ready to move on to the retrieval pipeline.
Retrieval Pipeline
Our retrieval pipeline is what will be used to retrieve the right source knowledge for our agent at query-time. We will be implementing this via an Agent SDK @function_tool
but before we do so let's directly test retrieval.
As we're using Pinecone's integrated inference (ie both indexing and embedding are handled by Pinecone) the retrieval pipeline is incredibly simple.
results = index.search(
namespace="default",
query={
"inputs": {"text": query},
"top_k": 5
},
fields=["content", "url", "pages"]
)
results
Let's format these a little nicer:
from IPython.display import Markdown, display
# let's print out the results
results_str = """
| Score | Content | Pages | URL |
|-------|---------|-------|-----|
"""
for result in results["result"]["hits"]:
results_str += (
f"| {result['_score']:.2f} "
"| " + result['fields']['content'].replace('|', '\|') + " "
f"| {result['fields']['pages']} "
f"| {result['fields']['url']} |" "\n"
)
display(Markdown(results_str))
Now let's create a tool that our agent can use to search through the JFK documents. We use the @function_tool
decorator to wrap the logic above and make the retrieval pipeline available to our agents.
from agents import function_tool
@function_tool
def jfk_files_search(query: str) -> str:
"""This tool gives you search access to the full JFK files. To use this tool you
should provide search queries with as much context as possible, and using natural
language to describe the query.
This tool will return five of the most relevant document chunks for your query,
including the result's similarity score, the text content, the source page number,
and source URL.
"""
results = index.search(
namespace="default",
query={
"inputs": {"text": query},
"top_k": 5
},
fields=["content", "url", "pages"]
)
# format the results into a markdown string - this isn't essential for our LLM but
# it helps
source_knowledge = """
| Score | Content | Pages | URL |
|-------|---------|-------|-----|
"""
for result in results["result"]["hits"]:
source_knowledge += (
f"| {result['_score']:.2f} "
"| " + result['fields']['content'].replace('|', '\|') + " "
f"| {result['fields']['pages']} "
f"| {result['fields']['url']} |" "\n"
)
return source_knowledge
Now we provide our jfk_files_search
tool to an agent.
agent = Agent(
name="JFK Document Assistant",
model="gpt-4.1-mini",
instructions=(
"You are an assistant specialized in answering questions about the JFK "
"assassination and related documents. When users ask questions about JFK, "
"the assassination, or related historical events. Please write your answers "
"in markdown and provide sources to support your answers."
),
tools=[jfk_files_search]
)
Building the Final RAG Agent
Now we can use our agent to discover who really assassinated JFK. First, let's confirm our agent is functional with our original query about Oswald's whereabouts in October 1959.
query
result = await Runner.run(
starting_agent=agent,
input=query,
)
display(Markdown(result.final_output))
To keep things conversational we'll append our own queries and the agent responses to a messages
list.
messages = [
{"role": "user", "content": query},
{"role": "assistant", "content": result.final_output}
]
messages.append(
{"role": "user", "content": (
"do the JFK files contain any information about doubts on Lee Harvey Oswald's "
"involvement in the assassination?"
)}
)
result = await Runner.run(
starting_agent=agent,
input=messages,
)
display(Markdown(result.final_output))
messages.extend(
[
{"role": "assistant", "content": result.final_output},
{"role": "user", "content": "I see mentions of Oswald in Mexico, what did he do there?"}
]
)
result = await Runner.run(
starting_agent=agent,
input=messages,
)
display(Markdown(result.final_output))
messages.extend(
[
{"role": "assistant", "content": result.final_output},
{"role": "user", "content": "Tell me more about Valeriy, is he relevant?"}
]
)
result = await Runner.run(
starting_agent=agent,
input=messages,
)
display(Markdown(result.final_output))
Great! Our retrieval pipeline is clearly returning highly relevant information to our agent - allowing us to explore the JFK files, as follow up questions, and try to understand the various connections and characters that appear throughout.
Once we're done asking questions, we should ideally delete our vector index to save resources.
pc.delete_index(index_name)