Getting Started with Semantic Chunking

Open In Colab

Semantic chunkers allow us to build more context aware chunks of information. We can use this for RAG, splitting video, audio, and much more.

In this example, we will stick with a simple RAG-focused example. We will learn about three different types of chunkers available to us; StatisticalChunker, ConsecutiveChunker, and CumulativeChunker. To begin, we need some data.

python

!pip install -qU \
    semantic-chunkers==0.0.4 \
    datasets==2.19.1

python

from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv2", split="train")
data

text

    Dataset({
        features: ['id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'content', 'references'],
        num_rows: 2673
    })

python

content = data[3]["content"]
print(content[:1000])

text

    # Mamba: Linear-Time Sequence Modeling with Selective State Spaces
    # Albert Gu*1 and Tri Dao*2
    1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me
    # Abstract
    Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities

We will keep a smaller section of content to speed up (and limit cost) for the examples.

python

content = content[:20_000]

We will experiment with different semantic chunking methods on the above text. Every chunker requires an encoder for which we can choose from open source encoders via HuggingfaceEncoder or FastembedEncoder, and proprietary API encoders like OpenAIEncoder or CohereEncoder.

We will use the OpenAIEncoder with text-embedding-3-small:

python

import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
    "OpenAI API key: "
)

encoder = OpenAIEncoder(name="text-embedding-3-small")

Statistical Chunking

The statistical chunking method our most robust chunking method, it uses a varying similarity threshold to identify more dynamic and local similarity splits. It offers a good balance between accuracy and efficiency but can only be used for text documents (unlike the multi-modal ConsecutiveChunker).

The StatisticalChunker can automatically identify a good threshold value to use while chunking our text, so it tends to require less customization than our other chunkers.

python

from semantic_chunkers import StatisticalChunker

chunker = StatisticalChunker(encoder=encoder)

python

chunks = chunker(docs=[content])

python

chunker.print(chunks[0])

text

    Split 1, tokens 300, triggered by: token limit
    [31m# Mamba: Linear-Time Sequence Modeling with Selective State Spaces # Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me # Abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of eï¬ cient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simpliï¬ ed end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5Ã higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences.[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 2, tokens 300, triggered by: token limit
    [32mAs a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation. # 1 Introduction Foundation models (FMs), or large models pretrained on massive data then adapted for downstream tasks, have emerged as an eï¬ ective paradigm in modern machine learning. The backbone of these FMs are often sequence models, operating on arbitrary sequences of inputs from a wide variety of domains such as language, images, speech, audio, time series, and genomics (Brown et al. 2020; Dosovitskiy et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014). While this concept is agnostic to a particular choice of model architecture, modern FMs are predominantly based on a single type of sequence model: the Transformer (Vaswani et al. 2017) and its core attention layer (Bahdanau, Cho, and Bengio 2015) The eï¬ cacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data.[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 3, tokens 112, triggered by: 0.32
    [34mHowever, this property brings fundamental drawbacks: an inability to model anything outside of a ï¬ nite window, and quadratic scaling with respect to the window length. An enormous body of research has appeared on more eï¬ cient variants of attention to overcome these drawbacks (Tay, Dehghani, Bahri, et al. 2022), but often at the expense of the very properties that makes it eï¬ ective. As of yet, none of these variants have been shown to be empirically eï¬ ective at scale across domains.[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 4, tokens 121, triggered by: 0.21
    [35mRecently, structured state space sequence models (SSMs) (Gu, Goel, and RÃ© 2022; Gu, Johnson, Goel, et al. 2021) have emerged as a promising class of architectures for sequence modeling. These models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models (Kalman 1960). This class of models can be computed very eï¬ ciently as either a recurrence or convolution, with linear or near-linear scaling in sequence length.[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 5, tokens 200, triggered by: 0.34
    [31mAdditionally, they have principled Equal contribution. 1 mechanisms for modeling long-range dependencies (Gu, Dao, et al. 2020) in certain data modalities, and have dominated benchmarks such as the Long Range Arena (Tay, Dehghani, Abnar, et al. 2021). Many ï¬ avors of SSMs (Gu, Goel, and RÃ© 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Y. Li et al. 2023; Ma et al. 2023; Orvieto et al. 2023; Smith, Warrington, and Linderman 2023) have been successful in domains involving continuous signal data such as audio and vision (Goel et al. 2022; Nguyen, Goel, et al. 2022; Saon, Gupta, and Cui 2023).[0m
    ----------------------------------------------------------------------------------------

Consecutive Chunking

Consecutive chunking is the simplest version of semantic chunking.

python

from semantic_chunkers import ConsecutiveChunker

chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.3)

python

chunks = chunker(docs=[content])

100%|██████████| 6/6 [00:08<00:00, 1.48s/it]

100%|██████████| 328/328 [00:00<00:00, 36590.56it/s]

python

chunker.print(chunks[0])

text

    Split 1, tokens None, triggered by: 0.09
    [31m# Mamba:[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 2, tokens None, triggered by: 0.10
    [32mLinear-Time Sequence Modeling with Selective State Spaces[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 3, tokens None, triggered by: 0.25
    [34m# Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 4, tokens None, triggered by: 0.22
    [35m# Abstract[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 5, tokens None, triggered by: 0.30
    [31mFoundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token.[0m
    ----------------------------------------------------------------------------------------

Cumulative Chunking

Cumulative chunking is a more compute intensive process, but can often provide more stable results as it is more noise resistant. However, it is very expensive in both time and (if using APIs) money.

python

from semantic_chunkers import CumulativeChunker

chunker = CumulativeChunker(encoder=encoder, score_threshold=0.3)

python

chunks = chunker(docs=[content])

100%|██████████| 329/329 [04:1700:00, 1.28it/s]

python

chunker.print(chunks[0])

text

    Split 1, tokens None, triggered by: 0.09
    [31m# Mamba:[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 2, tokens None, triggered by: 0.10
    [32mLinear-Time Sequence Modeling with Selective State Spaces[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 3, tokens None, triggered by: 0.28
    [34m# Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 4, tokens None, triggered by: 0.22
    [35m# Abstract[0m
    ----------------------------------------------------------------------------------------
    
    
    Split 5, tokens None, triggered by: 0.23
    [31mFoundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ[0m
    ----------------------------------------------------------------------------------------