DeepSeek-V3 Rotary Embedding Discrepancy Analysis And Reproduction Guide
Understanding the Rotary Embedding Difference
Rotary embeddings are crucial for transformer models, as they encode positional information into the input embeddings. The way these embeddings are applied can significantly impact the model's output. In this case, the discrepancy lies in how the query and key positional embeddings (q_pe
and k_pe
) are handled.
Hugging Face Implementation
In the Hugging Face implementation (https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py#L339), there's an explicit permutation (interleaving of odd and even columns) applied to q_pe
and k_pe
. This permutation is a key step in their rotary embedding process.
DeepSeek-AI's Implementation
However, in DeepSeek-AI's own implementation (also at https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py#L339, yes, the same link!), this permutation step is absent. Additionally, the convert.py
script, which loads the HF weights, doesn't permute them either. This is done to accommodate the ordering difference in the apply_rotary_embedding()
function.
This seemingly small difference has a significant impact! It leads to different mathematical results after the attention module in the first dense layer. This is a big deal because it means the two implementations, despite using the same weights, will produce different outputs. I wanted to double-check with the team to see if I missed anything. Thanks in advance for any insights!
Recreating the Issue: A Step-by-Step Guide
To demonstrate this issue, I've set up a reproducible scenario. Here's how you can see the difference for yourself:
Environment
First, make sure you have the correct environment. I used transformers == 4.54.0
for my tests. You can set this up using pip install transformers==4.54.0
.
Code
I've created a simple repository (https://github.com/wwwjn/DeepSeek-V3) with the code to reproduce this issue. You can clone it and follow the steps below.
Runs
I ran two separate tests with randomized inputs:
1. Hugging Face Implementation
I used the Hugging Face transformers
library with weights from https://huggingface.co/deepseek-ai/DeepSeek-V3-0324/tree/main. The command I used was:
python hf_implementation/hf_implementation.py --num_layers 5 > hf_outputs.txt 2>&1
This script performs a single forward pass using the HF implementation with 5 layers. The output is redirected to hf_outputs.txt
.
2. This Repo's Implementation
For this test, I used the model implementation from this repository. To simplify things, I skipped the distributed training setup and focused on a single forward pass.
Step 1: Convert Weights
First, I converted the HF checkpoint weights using convert.py
:
python convert.py --hf-ckpt-path /path/to/your/dsv3-weights/ --save-path /path/to/save/dsv3-weights-5-layer/ --n-experts 256 --model-parallel 8
Make sure to replace /path/to/your/dsv3-weights/
with the actual path to your HF weights and /path/to/save/dsv3-weights-5-layer/
with the desired save location. This command converts the weights for a model with 256 experts and 8-way model parallelism.
Step 2: Run Forward Pass
Next, I ran a single forward pass using the converted weights:
torchrun --nnodes 1 --nproc-per-node 8 inference/run_single_forward.py --config inference/configs/config_671B.json > dsv3-output.txt 2>&1
This command uses torchrun
to launch the run_single_forward.py
script with a specified configuration (config_671B.json
). The output is redirected to dsv3-output.txt
.
Comparing the Results
Now comes the crucial part: comparing the outputs. I've included detailed numerical comparisons in the form of images. These images clearly show the discrepancy in the outputs after the first dense layer's attention mechanism.
Expected Behavior vs. Reality
Ideally, the outputs after the first dense layer's attention layer should be nearly identical between the two implementations. There might be slight differences due to the use of fp8
versus bfloat16
precision, but the core results should align.
However, as the images demonstrate, the outputs are significantly different. This strongly suggests that the rotary embedding discrepancy is indeed causing a divergence in the model's behavior.
Diving Deeper into the Root Cause
To fully grasp the impact of this discrepancy, it's essential to understand why rotary embeddings are so vital for transformer models. Rotary Position Embeddings (RoPE) are a type of positional encoding that uses a rotation matrix to embed positional information into the tokens. This method offers several advantages over traditional positional embeddings, particularly in handling longer sequences. The interleaving of odd and even columns in the Hugging Face implementation is a specific way of applying this rotation, and its absence in the DeepSeek-AI implementation alters the rotation pattern.
Exploring the Mathematical Implications
The difference in implementation affects the mathematical operations within the attention mechanism. The attention mechanism calculates a weighted sum of values based on the similarity between queries and keys. The rotary embeddings modify these queries and keys, and the permutation step in the Hugging Face implementation changes the way this modification occurs. This, in turn, affects the attention weights and the final output.
Practical Consequences of the Discrepancy
The practical consequences of this discrepancy are significant. If the rotary embeddings are not applied consistently, the model's ability to understand and utilize positional information can be compromised. This can lead to degraded performance on tasks that require understanding long-range dependencies, such as text generation and question answering.
Drawing Parallels: The Llama3 Case
Interestingly, this isn't the first time such a discrepancy has been observed. We encountered a similar issue with the Llama3 model. The weights for Llama3 on Hugging Face were permuted compared to Meta's original weights. To accommodate the rotary embedding implementation difference, we had to manually permute the weights back. You can find more details in this reference: https://github.com/pytorch/torchtitan/issues/335.
This experience with Llama3 reinforces the importance of carefully examining rotary embedding implementations and ensuring consistency between different versions of a model. It also highlights the potential for discrepancies to arise when models are ported or reimplemented.
Next Steps and Potential Solutions
So, what are the next steps? The most immediate action is to confirm whether this discrepancy is intentional or an oversight. If it's intentional, understanding the reasoning behind the different implementations is crucial. If it's an oversight, correcting the implementation is essential to ensure consistency and optimal model performance.
Potential Solutions
- Weight Permutation: One potential solution is to permute the weights to align with the rotary embedding implementation being used. This was the approach taken with Llama3, where the weights were permuted to match the rotary embedding implementation in the torchtitan repository.
- Implementation Adjustment: Another solution is to modify the rotary embedding implementation itself. This could involve adding or removing the permutation step to match the desired behavior. However, this approach requires careful consideration to ensure that the modified implementation is mathematically correct and does not introduce other issues.
Community Collaboration
Issues like this highlight the importance of community collaboration in the open-source AI world. Sharing observations, comparing results, and discussing potential solutions are all crucial for ensuring the reliability and reproducibility of models like DeepSeek-V3.
Conclusion: Ensuring Consistency for Optimal Performance
In conclusion, the discrepancy in rotary embedding implementations between the Hugging Face version and this repository's version of DeepSeek-V3 is a significant issue that warrants further investigation. The differences in how positional information is encoded can lead to substantial variations in model behavior and performance. By understanding the root cause of this discrepancy and implementing appropriate solutions, we can ensure the consistency and reliability of DeepSeek-V3 and other transformer models.
I hope this deep dive into the rotary embedding discrepancy in DeepSeek-V3 has been insightful for you guys. It’s a complex issue, but by understanding the details, we can work towards ensuring the best possible performance for these powerful models. Keep exploring, keep questioning, and let’s continue to push the boundaries of AI together!
I appreciate the DeepSeek-AI team's attention to this matter, especially @tianyu-l. Your expertise and insights are invaluable in resolving this issue and ensuring the accuracy of the DeepSeek-V3 model. Thank you for your dedication to advancing the field of AI!