Split 1, tokens 300, triggered by: token limit
[31m# Mamba: Linear-Time Sequence Modeling with Selective State Spaces # Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me # Abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of eï¬ cient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simpliï¬ ed end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5à higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences.[0m
----------------------------------------------------------------------------------------
Split 2, tokens 300, triggered by: token limit
[32mAs a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation. # 1 Introduction Foundation models (FMs), or large models pretrained on massive data then adapted for downstream tasks, have emerged as an eï¬ ective paradigm in modern machine learning. The backbone of these FMs are often sequence models, operating on arbitrary sequences of inputs from a wide variety of domains such as language, images, speech, audio, time series, and genomics (Brown et al. 2020; Dosovitskiy et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014). While this concept is agnostic to a particular choice of model architecture, modern FMs are predominantly based on a single type of sequence model: the Transformer (Vaswani et al. 2017) and its core attention layer (Bahdanau, Cho, and Bengio 2015) The eï¬ cacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data.[0m
----------------------------------------------------------------------------------------
Split 3, tokens 112, triggered by: 0.32
[34mHowever, this property brings fundamental drawbacks: an inability to model anything outside of a ï¬ nite window, and quadratic scaling with respect to the window length. An enormous body of research has appeared on more eï¬ cient variants of attention to overcome these drawbacks (Tay, Dehghani, Bahri, et al. 2022), but often at the expense of the very properties that makes it eï¬ ective. As of yet, none of these variants have been shown to be empirically eï¬ ective at scale across domains.[0m
----------------------------------------------------------------------------------------
Split 4, tokens 121, triggered by: 0.21
[35mRecently, structured state space sequence models (SSMs) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021) have emerged as a promising class of architectures for sequence modeling. These models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models (Kalman 1960). This class of models can be computed very eï¬ ciently as either a recurrence or convolution, with linear or near-linear scaling in sequence length.[0m
----------------------------------------------------------------------------------------
Split 5, tokens 200, triggered by: 0.34
[31mAdditionally, they have principled Equal contribution. 1 mechanisms for modeling long-range dependencies (Gu, Dao, et al. 2020) in certain data modalities, and have dominated benchmarks such as the Long Range Arena (Tay, Dehghani, Abnar, et al. 2021). Many ï¬ avors of SSMs (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Y. Li et al. 2023; Ma et al. 2023; Orvieto et al. 2023; Smith, Warrington, and Linderman 2023) have been successful in domains involving continuous signal data such as audio and vision (Goel et al. 2022; Nguyen, Goel, et al. 2022; Saon, Gupta, and Cui 2023).[0m
----------------------------------------------------------------------------------------