Building a Fully Local LLM Agent with LM Studio and Cogito V1
Models
State-of-the-Art (SotA) LLMs are no longer limited to running on big AI superclusters, locked away behind an API key. Various open source and open weight LLMs alongside open source software such as LM Studio and LiteLLM allow us to build locally.
In this post, we'll show you how to do exactly that using the excellent Cogito V1 models, LM Studio, and a dash of tool-calling magic to build your own async-capable local agent.
By the end of this tutorial, you'll have:
- A local LLM server running Cogito V1
- A working dev environment with async streaming completions
- Tool/function calling support (e.g., live web search)
- An agent abstraction for iterative reasoning
Getting Started: LM Studio + Cogito V1
The first step is to grab LM Studio, a local LLM runner that's dead simple to use. Download it here.
Next, go to the Discover tab inside LM Studio and search for cogito
. Pick the model cogito-v1-preview-qwen-32b
and hit that green Download button.
Once the download is done:
- Switch to the Server tab
- Start the server on port
1234
- Load your chosen model
You should now be able to query the LLM locally at http://localhost:1234/v1
.
Confirm the Server Is Running
Run this quick check in your terminal:
You should see a list of available models, including cogito-v1-preview-qwen-32b
:
Set Up Your Project
Clone the repo from the Aurelio Labs Cookbook:
Use UV to manage your Python environment:
You now have a ready-to-go environment with all dependencies installed, including litellm
and graphai
.
First Completion Call
This code snippet initializes the LiteLLM client, sets the environment variable for the local server, and sends a basic prompt to the LLM:
From this, we will get a ModelResponse
object, which includes the returned assistant message:
We access the content
with:
Async Streaming with LiteLLM
In most use-cases we're likely to be using async code to enable a more scalable application, and we'll also likely be using streaming — which allows us to build more user-friendly and responsive interfaces. For async we use the acompletion
function from LiteLLM, and we stream the tokens by setting stream=True
.
We can then parse and print each token as our LLM generates it like so:
Once streaming is complete, we should see a full response:
Tool Calls and Function Calling
Now let's enable function calling (aka tool use). Not all models support tool use, and LiteLLM provides the supports
_
function_calling
function to check if an LLM supports tool-use — however, this isn't particularly reliable for LM Studio models, and for the Cogito v1 models this function returns False
:
Returns:
Cogito v1 does support tool-use, so LiteLLM is wrong here — however, we do need to make some modifications to how we're calling the model. For tool use with Cogito v1 on LM Studio we need to proxy OpenAI so that LiteLLM calls our endpoint with OpenAI-standard requests. To do this, we need to replace our lm_studio
prefix with openai
:
Then pass the base_url
to tell LiteLLM to call our LM Studio endpoint (http://localhost:1234/v1
) rather than the default OpenAI endpoint (https://api.openai.com/v1
):
Expected response:
Add Web Search as a Tool
Now that we've put together our completion and proxy calls, let's define a tool. We'll use SerpAPI to build a simple web search tool. We do need an API key for this, but it comes with 100 free calls per month.
We'll also be using aiohttp
to make the HTTP request asynchronously, keeping our full LLM
and tool execution pipeline asynchronous. To call SerpAPI we do:
Results return a list of 10 returned records, each with a title
, link
, snippet
, and source
— among other fields.
The results are fairly messy so we can clean them up and organize them into a Pydantic
BaseModel
, which we'll call Article
and will include attributes for title
, source
,
link
, and snippet
.
We also define the classmethod
from_serpapi_result
to convert the raw SerpAPI results
into our Article
object, and the __str__
method to format the object as a markdown string
which we will provide back to our LLM.
To create a list of Article
objects from the SerpAPI results we do:
Returning a list of Article
objects:
Let's display one of those in markdown:
Giving us this:
Finally, we can refactor all of this into a single async function that our LLM can call:
Our LLM doesn't call this function directly, but instead given a set of function schemas
the LLM will decide which functions / tools to call and which arguments to provide. To generate
these schemas we use the get_schemas
function from graphai-lib
:
This gives us a list of function schemas in tools
which look like this:
We then execute our query with the tools
passed to the tools
parameter like so:
Returning:
Our LLM has generated the tool choice and input parameters for our tool but we have not executed the tool, we must handle that ourselves. To do so we will create a mapping from tool names to their functions.
Now we execute the tool like so:
We then format this and the initial tool call from our LLM into messages, and feed them back into our LLM for a final response.
We feed this into our LLM:
Giving us:
Building a Simple Agent
We can wrap all of this up into some easier to use agentic logic to keep track of the conversation, execute tools when needed, etc, like so:
Usage:
Sample output:
Wrapping Up
We just:
- Ran Cogito V1 locally using LM Studio
- Built a completion and async streaming pipeline
- Wired up tool calling
- Created a reusable async agent
The best part? We're running everything locally.
Now swap in your favorite quantized models (Mistral, LLaMA, etc.) and extend with more tools, memory, and reasoning flows.