Aurelio logo
Updated on July 10, 2025

RAG Agent with Agents SDK

AI Engineering

Retrieval Augmented Generation (RAG) is a powerful technique that enables AI agents to access and leverage external knowledge sources beyond their training data. In this tutorial, we'll build a RAG agent that can answer questions about the JFK assassination files using OpenAI's Agents SDK and Pinecone vector database.

RAG is particularly useful when:

  • You need up-to-date information beyond the model's training cutoff
  • You have domain-specific documents or proprietary data
  • You want to reduce hallucinations by grounding responses in factual sources
  • You need to cite sources for transparency and verification

By the end of this tutorial, you'll have built an agent that can search through historical documents and provide accurate, sourced answers about the JFK files.

Prerequisites

Before we begin, let's install the required packages:

python
!pip install -qU \
"openai-agents==0.1.0" \
"pinecone==7.0.2" \
"datasets==3.6.0" \
"semantic-chunkers==0.1.1"

We also need API keys for OpenAI and Pinecone. You can get:

python
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
"Enter OPENAI_API_KEY: "
)

Testing LLM Knowledge Limitations

Before implementing RAG, let's first demonstrate why it's needed. We'll create a basic agent and test its knowledge about specific topics to show the limitations of relying solely on the model's training data.

python
from agents import Agent

agent = Agent(
name="Agent",
model="gpt-4.1-mini"
)

We'll ask our agent "where was Oswald in october 1959?":

python
from agents import Runner

query = "where was Lee Harvey Oswald in october 1959?"

result = await Runner.run(
starting_agent=agent,
input=query,
)

print(result.final_output)

Oswald was also in Helsinki, Finland in October 1959 according to the JFK files - which our agent missed. We can try and tease out this information:

python
result = await Runner.run(
starting_agent=agent,
input=[
{"role": "user", "content": query},
{"role": "assistant", "content": result.final_output},
{"role": "user", "content": "did he go anywhere else?"}
],
)

print(result.final_output)

Our agent is clearly not aware of Oswald's trip to Helsinki - that is because the underlying LLM has not seen that information during it's training process. We call information learned during LLM training parametric knowledge, ie knowledge stored within the model parameters.

LLMs can also make use of source knowledge to answer questions. Source knowledge refers to information provided to an LLM via a prompt, either provided via the user, the LLM instructions, or in our case - via an external database - ie with Retrieval Augmented Generation (RAG). Before we build out our RAG pipeline, let's see if our LLM can answer our question when we provide the relevant information about Oswald's whereabouts via our instructions.

python
source_knowledge = (
"~SECRET~\n"
"1 June 1964\n"
"\n"
"## MEMO FOR THE RECORD\n"
"\n"
"1. At 0900 this morning I talked with Frank Friberg recently "
"returned COS Helsinki re Warren Commission inquiry concerning "

"the timetable of Oswald's stay in Finland in October 1959, including "
"his contact with the Soviet Consulate there. (Copy of the Commission "
"letter of 25 May 64 and State Cable of 22 May 64 attached.)"
)

agent = Agent(
name="Agent",
instructions=(
"You are an assistant specialized in answering questions about the JFK assassination"
"and related documents.\n"
"Here is some additional context that you can use:\n"
f"{source_knowledge}\n"
),
model="gpt-4.1-mini"
)

Let's ask our original query again:

python
result = await Runner.run(
starting_agent=agent,
input=query,
)

print(result.final_output)

Perfect, this is much better! Now what we just did works for this simple example, but it doesn't scale. If we want an agent that can answer any question and use context from all of the JFK files, we need to build a RAG pipeline.

Building a RAG Pipeline

A RAG pipeline actually requires two core pipelines - an ingestion pipeline and a retrieval pipeline. At a high level those pipelines are responsible for:

  • Ingestion handles the initial data preparation, embedding, and indexing. We'll explain those steps in more detail soon, but the tldr is that the ingestion pipeline will transform a set of unstructured and messy PDFs into a "second brain" for our agent, ie the source knowledge.

  • Retrieval handles the query-time retrieval of information. It defines how we access and retrieve source knowledge from our second brain.

Naturally, we need to first develop our ingestion pipeline so that we can populate our second brain before we use the retrieval pipeline to retrieve anything.

Ingestion Pipeline

The ingestion pipeline consists of three (or four) steps:

  1. Process the PDF into plain text - with the aurelio-ai/jfk-files dataset (below) this step has been completed.

  2. Chunk the plain text into smaller segments (a good rule of thumb is ~300-400 tokens per chunk).

  3. Embed each chunk with OpenAI's text-embedding-3-small to create vectors.

  4. Index those vectors in Pinecone with metadata like source URL, document title, etc.

JFK document ingestion pipeline, covering PDF text to chunked text, embedding those chunks into semantically meaningful vector embeddings, and sending those vector embeddings to a vector database

To begin, we'll start at step 0 and download the pre-parsed JFK files.

Loading the JFK Files Dataset

We'll use a dataset of the JFK files, which we will pull from the Hugging Face Hub. This dataset contains historical documents that our agent will search through to answer questions:

python
from datasets import load_dataset

dataset =
"aurelio-ai/jfk-files",
split="train"
)

Let's examine a sample document to understand the data structure:

python
dataset[1]

Each document contains:

  • id: Unique identifier for the document
  • filename: Name of the PDF file
  • url: Link to the original document
  • date: Publication date
  • content: The full text content
  • pages: Number of pages in the original document

Chunking our Data

Step 1 in our ingestion pipeline is to chunk our dataset. As mentioned we will be splitting each PDF into chunks of ~400 tokens. We'll also handle cases where a PDF contains little-to-no information by not indexing that PDF, and cases where our final chunk is too small to be relevant by appending it to the previous chunk.

We use the lightweight semantic-chunkers library and a simple RegexChunker for chunking. We will set the token limit for each chunk to 400 tokens:

python
from semantic_chunkers import RegexChunker

chunker = RegexChunker(max_chunk_tokens=400)

We chunk a doc like so:

python
chunks = chunker(docs=[dataset[1]['content']])
chunks

This outputs a list of a list of Chunk objects. These Chunk objects contain many smaller splits, which can be thought of as chunks within chunks. We can view the chunks in a cleaner way using chunker.print on a list[Chunk] object like so:

python
chunker.print(chunks[0])

We'll need the text content from our chunks which we access via the content attribute:

python
chunks[0][0].content

In the next step we'll setup our Pinecone vector DB and begin embedding and indexing our data in one step - while indexing we'll be performing the above chunking logic across all our docs before they're embedded.

Embedding and Indexing

To enable semantic search over our documents, we'll use Pinecone - a fully managed vector database. Vector databases allow us to store and search through vector embeddings (numerical representations of text) to find semantically similar content. There are many vector DB options out there, alongside Pinecone we also recommend Qdrant and pgvector.

First, let's set up our Pinecone API key which we can find in the Pinecone console:

python
from pinecone import Pinecone

os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") or getpass(
"Enter PINECONE_API_KEY: "
)

# initializing the pinecone client
pc = Pinecone()

Now we'll create a Pinecone index to store our vector embeddings. We specify the following:

  • We want to use AWS via cloud=CloudProvider.AWS in Pinecone's free tier region via region=AwsRegion.US_EAST_1.
  • We use the llama-text-embed-v2 embedding model hosted by Pinecone - by default the index will be configured for this model.
  • We specify that the text content that should be embedding by our model will be provided to Pinecone via the content metadata field.
python
from pinecone import AwsRegion, CloudProvider

# set our index name, you can change this to whatever you like
index_name = "agents-sdk-course-jfk-files"

# if the index doesn't exist, create it
if index_name not in pc.list_indexes().names():
pc.create_index_for_model(
name=index_name,
cloud=CloudProvider.AWS,
region=AwsRegion.US_EAST_1,
embed={
"model": "llama-text-embed-v2",
"field_map": {
"text": "content"
}
}
)

index = pc.Index(index_name)

Let's check if our index is empty (it should be on first run):

python
index.describe_index_stats()

To embed and index a chunk, we can do the following:

python
doc = dataset[1]

# chunk the doc
chunks = chunker(docs=[doc['content']])

# create a list of dictionary records
records = [
{
"id": doc['id']+f"-{i}",
"content": chunk.content,
"filename": doc['filename'],
"url": doc['url'],
"date": doc['date'].isoformat(),
"pages": doc['pages']
} for i, chunk in enumerate(chunks[0])
]

# embed and index
index.upsert_records(
namespace="default",
records=records
)

Now we should see that our index contains three records inside the default namespace:

python
index.describe_index_stats()

Perfect! Now we simply repeat that process for all of our docs. We will do this in batches to avoid excessive network calls with small packages.

python
from tqdm.auto import tqdm

records = []
for doc in tqdm(dataset):
# perform a quick length check of our docs to avoid excessively small docs
if len(doc['content']) < 100:
# nothing less than 100 chars
continue
# chunk the docs
chunks = chunker(docs=[doc['content']])
for i, chunk in enumerate(chunks[0]):
records.append(
{
"id": doc['id']+f"-{i}",
"content": chunk.content,
"filename": doc['filename'],
"url": doc['url'],
"date": doc['date'].isoformat(),
"pages": doc['pages']
}
)
if len(records) >= 64:
# if we have a particularly long doc, we'll need to split up the batch
for i in range(0, len(records), 96):
# 96 is the max number of records we can upsert in one go
batch = records[i:i+96]
# embed and index the batch
index.upsert_records(
namespace="default",
records=batch
)
records = []

index.describe_index_stats()

That's our ingestion pipeline complete and we're ready to move on to the retrieval pipeline.

Retrieval Pipeline

Our retrieval pipeline is what will be used to retrieve the right source knowledge for our agent at query-time. We will be implementing this via an Agent SDK @function_tool but before we do so let's directly test retrieval.

As we're using Pinecone's integrated inference (ie both indexing and embedding are handled by Pinecone) the retrieval pipeline is incredibly simple.

python
results = index.search(
namespace="default",
query={
"inputs": {"text": query},
"top_k": 5
},
fields=["content", "url", "pages"]
)

results

Let's format these a little nicer:

python
from IPython.display import Markdown, display

# let's print out the results
results_str = """
| Score | Content | Pages | URL |
|-------|---------|-------|-----|
"""
for result in results["result"]["hits"]:
results_str += (
f"| {result['_score']:.2f} "
"| " + result['fields']['content'].replace('|', '\|') + " "
f"| {result['fields']['pages']} "
f"| {result['fields']['url']} |" "\n"
)

display(Markdown(results_str))

Now let's create a tool that our agent can use to search through the JFK documents. We use the @function_tool decorator to wrap the logic above and make the retrieval pipeline available to our agents.

python
from agents import function_tool

@function_tool
def jfk_files_search(query: str) -> str:
"""This tool gives you search access to the full JFK files. To use this tool you
should provide search queries with as much context as possible, and using natural
language to describe the query.

This tool will return five of the most relevant document chunks for your query,
including the result's similarity score, the text content, the source page number,
and source URL.
"""
results = index.search(
namespace="default",
query={
"inputs": {"text": query},
"top_k": 5
},
fields=["content", "url", "pages"]
)
# format the results into a markdown string - this isn't essential for our LLM but
# it helps
source_knowledge = """
| Score | Content | Pages | URL |
|-------|---------|-------|-----|
"""
for result in results["result"]["hits"]:
source_knowledge += (
f"| {result['_score']:.2f} "
"| " + result['fields']['content'].replace('|', '\|') + " "
f"| {result['fields']['pages']} "
f"| {result['fields']['url']} |" "\n"
)
return source_knowledge

Now we provide our jfk_files_search tool to an agent.

python
agent = Agent(
name="JFK Document Assistant",
model="gpt-4.1-mini",
instructions=(
"You are an assistant specialized in answering questions about the JFK "
"assassination and related documents. When users ask questions about JFK, "
"the assassination, or related historical events. Please write your answers "
"in markdown and provide sources to support your answers."
),
tools=[jfk_files_search]
)

Building the Final RAG Agent

Now we can use our agent to discover who really assassinated JFK. First, let's confirm our agent is functional with our original query about Oswald's whereabouts in October 1959.

python
query
python
result = await Runner.run(
starting_agent=agent,
input=query,
)

display(Markdown(result.final_output))

To keep things conversational we'll append our own queries and the agent responses to a messages list.

python
messages = [
{"role": "user", "content": query},
{"role": "assistant", "content": result.final_output}
]
python
messages.append(
{"role": "user", "content": (
"do the JFK files contain any information about doubts on Lee Harvey Oswald's "
"involvement in the assassination?"
)}
)

result = await Runner.run(
starting_agent=agent,
input=messages,
)

display(Markdown(result.final_output))
python
messages.extend(
[
{"role": "assistant", "content": result.final_output},
{"role": "user", "content": "I see mentions of Oswald in Mexico, what did he do there?"}
]
)

result = await Runner.run(
starting_agent=agent,
input=messages,
)

display(Markdown(result.final_output))
python
messages.extend(
[
{"role": "assistant", "content": result.final_output},
{"role": "user", "content": "Tell me more about Valeriy, is he relevant?"}
]
)

result = await Runner.run(
starting_agent=agent,
input=messages,
)

display(Markdown(result.final_output))

Great! Our retrieval pipeline is clearly returning highly relevant information to our agent - allowing us to explore the JFK files, as follow up questions, and try to understand the various connections and characters that appear throughout.

Once we're done asking questions, we should ideally delete our vector index to save resources.

python
pc.delete_index(index_name)
text