Summary
DeepSeek-V2, released in June 2024, built off the success of DeepSeek's previous papers to set a new standard for training and inference efficiency. The core changes made to DeepSeek-V2 that set it apart from prior open source models occur in two core components of the transformer architecture: the attention block and the feed-forward network (see below image).
The two key changes can be summarized as follows:
1. Feed-Forward Network Optimization: DeepSeekMoE architecture
Mixture of experts (MoE) layers are a drop-in replacement for the feed-forward layer in the standard transformer architecture. Prior to DeepSeekMoE, most MoE architectures functioned by splitting the feed-forward layer into several large feed-forward layers. Each input token would then "choose" 1 or 2 of these parallel feed-forward layers, also known as "experts", for its own computation. This architecture had one key problem - namely, each expert needed to learn large amounts of redundant information, since processing any token on any topic requires understanding of grammar, semantics, etc. DeepSeek solved this redundancy problem, thereby greatly increased the learning efficiency of the MoE architecture, through three key innovations. These included: more numerous, finer-grained experts; separating experts into shared and routing experts; and load balancing tokens across experts and devices. For more details on these innovations, see the previous blog post in the series.
2. Attention Layer Optimization: Multi-head Latent Attention (MLA)
Multi-head attention, described in detail in my other post, utilizes three matrices to produce new representations of the input tokens: the Query, Key, and Value matrices. Each of these matrices has dimension n x d, where n is the maximum length of the sequence and d is the dimension of the vector representing each token in the sequence. Standard transformers cache the Key and Value matrices for every layer fully in-memory at inference time, improving speed but resulting in large memory overhead. DeepSeek-V2's solution is to compress the Key and Value matrices at each layer into a single latent vector. At inference time, only this vector needs to be cached, substantially reducing memory requirements.
Since we already described the DeepSeekMoE architecture in detail in the previous blog post of this series, this post will focus primarily on multi-head latent attention. We'll start by describing the problem it aims to solve, then move onto describing the intuition behind MLA's solution, and finally dive into the concrete math describing the method. We'll then end this post by discussing the effects that the combination of DeepSeekMoE and multi-head latent attention have on training and inference efficiency. Let's dive in!
Note: This post is part of the “Understanding DeepSeek” Series:
[This article] Understanding DeepSeek Part II: DeepSeek-V2
[Upcoming] Understanding DeepSeek Part III: DeepSeek-V3
[Upcoming] Understanding DeepSeek Part IV: DeepSeekMath
[Upcoming] Understanding DeepSeek Part V: DeepSeek-Prover-V1.5
[Upcoming] Understanding DeepSeek Part VI: DeepSeek-R1
[Upcoming] Understanding DeepSeek Part VII: Implications for the AI Industry and the World
The Memory Efficiency Problem with Standard Multi-Head Attention
Standard multi-head attention, at its core, solves the problem of deciding how to update our understanding of one concept, given a set of other, potentially-related concepts. In the case of language modeling, we want to update our understanding of a particular token using the understanding of the other tokens present in the sequence. To accomplish this, at each attention layer in a transformer, the model learns to parametrize three key matrices: the Query, Key, and Value matrices. These three matrices work together to identify the most relevant portions of the sequence for each token, and then to update each token's representation based on the relevant portions that were found. I won't cover the full details of how this is done here, but you can reference my other blog post for more information.
Now, when language models are producing output at inference, we essentially need to place the transformer in a while loop. Until the transformers outputs an “End of Sequence” token, we’ll feed the input sequence into the transformer to produce the next token then, appending that next token to the input sequence, we’ll feed the newly-elongated input sequence back into the transformer, repeating the process.
The key insight that enables caching here is the following: since modern LLMs are causal, meaning future tokens cannot influence previous tokens, by adding a new token to the end of the input sequence, we do not change the representation of any of the previous tokens. Hence, we do not need to recompute the hidden representations for the previous tokens, since these will be identical.
The only token for which we need to compute a new representation is the next token in the sequence (that is, the one token that doesn’t exist yet)! Another core insight coming from this observation is that we only need the key and value vectors for each previous token to compute the new token’s representation. Since the previous tokens’ representations do not change, we don’t need to use the other tokens as "queries” to update their representations. However, we do need their key and value vectors so that we can “query” these vectors with the new token's query vector.
The above observations then give us a road map for caching values in the transformer in order to limit the number of computations we perform and speed up inference time. In particular, we must cache the Key and Value matrices at each hidden layer so that we can use these to compute the hidden representation for the new token.
Now let’s compute the memory requirements to store these cached values for Llama 3.3 70B, a state-of-the-art open source model at the time of writing. (In practice, Llama 3.3 uses Grouped-Query Attention which actually reduces caching requirements. For the sake of simplicity, we'll assume it uses standard attention here.)
Llama 3.3 has 80 attention layers. Each key and value vector in these attention layers has a dimension of 8192. And Llama 3.3 has a maximum context length of 128,000 tokens.
If Llama 3.3 is used in the default floating point 16 (FP16) mode, then each stored number will take up 2 bytes (16 bits). Hence, a single vector consisting of 8192 floating point numbers will take up 16,384 bytes, or equivalently 16.384 kilobytes. For each cached token in our input, we need to store both a key vector *and* a value vector at each layer. Hence, at every layer, a cached token will require two vectors, totaling 32.768 KB in memory. Since there are 80 such layers, the cost to cache one token is thus 80 * 32.768 KB = 2621.44 KB (equivalently, 2.62 MB).
Now suppose our input is 10,000 tokens long and we are producing the next token in the sequence. To cache the necessary data for the previous tokens, we need 10,000 * 2.62 MB = 26,200 MB (equivalently, 26.2 GB).
If our input uses the full Llama 3.3 context length of 128,000 tokens, the required space is 128,000 * 2.62 MB = 335,360 MB (equivalently, 335.36 GB).
As can be seen by the above example, memory requirements for the cache expand quickly as the input length increases. This makes it incredibly difficult to serve models with long context windows. In order to solve this problem with the standard transformer architecture, DeepSeek introduced Multi-head Latent Attention (MLA).
Multi-head Latent Attention (MLA)
In order to overcome these memory efficiency issues, DeepSeek created the Multi-head Latent Attention layer. This layer modifies standard multi-head attention (depicted on the left side of the above image) by compressing the key and value matrices into a single vector. In practice, this looks like the following:
That is, our model now must learn three additional matrices per layer - one down-projection matrix and two up-projection matrices. By learning these three matrices, we now no longer need to store the entire Key and Value matrices when caching previously-computed tokens. Instead, we can store the compressed latent vectors for each layer, where the compressed latent vector at a layer contains all of the information needed to produce the full Key & Value matrices.
Thus, if we have L layers, we know need to only store d_c * L values (d_c numbers in each latent vector and L latent vectors total, one per layer).
Let's take the example of Llama 3.3 that we illustrated above to see how much this gains us - previously, caching the full Key and Value matrices for the full 128,000 token context length of Llama 3.3 required 335.36 GB. Now, instead of caching the full matrices, let's imagine we've augmented Llama 3.3 to use MLA. DeepSeek sets the dimension of the latent vector to four times the hidden dimension, so we will use 32,768 as the dimension for our latent vector here. Hence, each vector takes up 0.06554 MB. Then, to cache one latent vector at each of Llama 3.3's 80 layers corresponds to using 80 * 0.06554 MB = 5.243 GB.
This is a substantial reduction from the initial requirement of 335.36 GB for standard attention, demonstrating the efficiency gains that can be driven using this approach.
Training and Inference Efficiency
DeepSeek-V2 introduces significant efficiency improvements in both training and inference compared to its predecessor, DeepSeek 67B, primarily through innovations in its architecture—especially the Multi-head Latent Attention (MLA). By compressing the Key and Value matrices into a single latent vector, MLA dramatically reduces memory consumption during inference. The reduction of the KV cache by approximately 93.3% translates directly into substantial gains in maximum generation throughput, allowing DeepSeek-V2 to achieve throughput levels up to 5.76 times greater than those observed in DeepSeek 67B. These optimizations enable DeepSeek-V2 to handle much longer contexts (up to 128K tokens) efficiently, positioning it as one of the most practical choices among large-scale language models for real-world applications where large-context inference is critical.
Additionally, the integration of DeepSeekMoE into the Feed-Forward Network layers synergizes well with MLA, enabling significant computational savings without sacrificing model performance. By activating only a fraction (21B) of its total parameters (236B), DeepSeek-V2 demonstrates economical training by saving 42.5% of training costs compared to dense models of similar scale. Thus, MLA plays a critical role not only in inference-time efficiency but also in making the pretraining phase more cost-effective.
Results and Key Takeaways
The innovative Multi-head Latent Attention layer significantly enhances the practical deployability of DeepSeek-V2. Compared to traditional Multi-Head Attention, MLA achieves superior inference performance while simultaneously overcoming the KV cache bottleneck. With its novel low-rank joint compression strategy, MLA significantly reduces inference memory overhead, making DeepSeek-V2 particularly suited for high-throughput, real-time applications requiring extensive context management.
Empirical evaluations on various benchmarks illustrate the clear strengths of DeepSeek-V2, even when compared against other leading open-source models of the time. Notably, DeepSeek-V2 consistently achieved top-tier performance on benchmarks such as MMLU, math reasoning tasks, and coding challenges, highlighting the architectural advantages introduced by MLA. Moreover, these enhancements enabled DeepSeek-V2 to be trained and served at a fraction of the cost of comparably performing dense models (see above image).
All in all, Multi-head Latent Attention represented another significant milestone for DeepSeek on the path towards highly-optimized training and inference that marked their revolution with DeepSeek-R1 and DeepSeek-V3. The next blog post in this series will dive into the new innovations introduced for DeepSeek-V3, building upon the foundations laid here and forming the base model used to train DeepSeek's state-of-the-art reasoning model.