DeepSeek 背後的工程解鎖 | YC 解碼 (The Engineering Unlocks Behind DeepSeek

字幕列表影片播放

由 AI 自動生成

There's a new AI model in town.

市場上出現了一種新的人工智能模型。
Chinese AI company DeepSeek recently made waves when it announced R1, an open-source reasoning model that it claimed achieved comparable performance to OpenAI-01 at a fraction of the cost.

中國人工智能公司 DeepSeek 最近發佈了開源推理模型 R1，並聲稱該模型的性能可與 OpenAI-01 相媲美，而成本僅為後者的一小部分。
The announcement unleashed a wave of social media panic and stock market chaos.

這一消息引發了社交媒體的恐慌和股市的混亂。
NVIDIA losing nearly 600 billion dollars in market cap today alone.

英偉達僅今天一天就損失了近 6000 億美元市值。
But for those following AI developments closely, DeepSeek and R1 didn't come out of nowhere.

但是，對於那些密切關注人工智能發展的人來說，DeepSeek 和 R1 並不是憑空出現的。
The company has been publishing its research and releasing its model weights for months, following a path similar to Meta's Lama model.

幾個月來，該公司一直在發佈其研究成果和模型權重，走的是一條與 Meta 的 Lama 模型類似的道路。
This is in contrast to other major AI labs like OpenAI, Google DeepMind, and Anthropic that have closed weights and publish more limited technical What's changed is just that now the broader public is actually paying attention.

這與 OpenAI、谷歌 DeepMind 和 Anthropic 等其他主要人工智能實驗室形成了鮮明對比。
So let's decode what the real developments here are, where they come from, and why they matter.

是以，讓我們來解讀一下這裡真正的發展是什麼，它們來自哪裡，以及為什麼它們很重要。
First of all, it is important to distinguish between two relevant models here, DeepSeek R1 and DeepSeek V3.

首先，必須區分兩個相關型號，即 DeepSeek R1 和 DeepSeek V3。
DeepSeek V3, which was actually released this past December, is a general-purpose base model that achieves comparable performance to other base models like OpenAI's GPT-40, Anthropic's Cloud 3.5 Sonnet, and Google's Gemini 1.5.

DeepSeek V3 實際上已於去年 12 月發佈，它是一種通用基礎模型，性能可與 OpenAI 的 GPT-40、Anthropic 的 Cloud 3.5 Sonnet 和谷歌的 Gemini 1.5 等其他基礎模型相媲美。
DeepSeek R1, which was released at the end of January, is a reasoning model built on top of DeepSeek V3.

DeepSeek R1 於 1 月底發佈，是建立在 DeepSeek V3 基礎上的推理模型。
In other words, DeepSeek took V3 and applied various algorithmic improvements to it in order to optimize its reasoning ability, resulting in R1, a model that achieves comparable performance to OpenAI's O1 and Google Flash 2.0 on certain complex reasoning benchmarks.

換句話說，DeepSeek 採用了 V3，並對其進行了各種算法改進，以優化其推理能力，最終得到了 R1，一個在某些複雜推理基準測試中與 OpenAI 的 O1 和谷歌 Flash 2.0 性能相當的模型。
But many of the algorithmic innovations responsible for R1's remarkable performance were actually discussed in this past December V3 paper or even before that in DeepSeek's V2 paper, which was published in May 2024, or the DeepSeek math paper, which came out February 2024.

但是，R1之所以能取得如此卓越的性能，許多算法創新實際上在去年12月的V3論文中，甚至在此之前的2024年5月發表的DeepSeek V2論文，或2024年2月發表的DeepSeek數學論文中都有討論。
V3 stitches together many of these innovations, which were designed primarily with compute and training efficiency in mind.

V3 將這些創新融合在一起，主要是為了提高計算和培訓效率。
One way DeepSeek optimized for efficiency and got more floating-point operations per second, or FLOPs, from the GPUs by training V3 natively in 8-bit floating-point formats, rather than the usual 16-bit or 32-bit formats.

DeepSeek 優化效率的方法之一，是以 8 位浮點格式（而非通常的 16 位或 32 位格式）原生訓練 V3，從而從 GPU 中獲得更多的每秒浮點運算次數（或 FLOPs）。
This is not a new idea.

這不是一個新想法。
Many other labs are doing it too.

許多其他實驗室也在這樣做。
But it was key for getting such massive memory savings without sacrificing performance.

但它是在不犧牲性能的前提下節省大量內存的關鍵。
A crucial enhancement is their FP8 accumulation fix, which periodically merges calculations back into a higher-precision FP32 accumulator to prevent small numerical errors from compounding.

一個重要的改進是 FP8 累加修正，它定期將計算結果併入精度更高的 FP32 累加器，以防止小的數值錯誤複合產生。
The result?

結果呢？
Far more efficient training across thousands of GPUs, cutting costs while maintaining model quality.

在數千個 GPU 上進行更高效的訓練，在保持模型品質的同時降低成本。
But why does this efficiency matter?

但這種效率為什麼重要呢？
Given its hardware constraints and U.S. exports controls on the sale of GPUs to China, DeepSeek needed to find a way to get more training and more bandwidth from their existing cluster of GPUs.

考慮到硬件限制和美國對向中國銷售 GPU 的出口管制，DeepSeek 需要找到一種方法，從現有的 GPU 集群中獲得更多的訓練和更大的帶寬。
You see, at AI labs, these GPUs, which do number crunching and matrix multiplication to train these models, are actually sitting idle most of the time.

你看，在人工智能實驗室裡，這些用於數字運算和矩陣乘法以訓練模型的 GPU 實際上大部分時間都處於閒置狀態。
At FP8, it is typical to only see around 35% model FLOPs utilization, or MFU, meaning GPUs are only being utilized at peak potential about a third of the time.

在 FP8 中，通常只能看到約 35% 的模型 FLOPs 利用率（或 MFU），這意味著 GPU 只有約三分之一的時間能發揮出峰值潛力。
The rest of the time, these GPUs are waiting for data to be moved, either between caches or other GPUs.

其餘時間，這些 GPU 都在等待數據在緩存或其他 GPU 之間移動。
This is NVIDIA's key advantage.

這就是英偉達的關鍵優勢。
It is not just about GPUs.

這不僅僅是 GPU 的問題。
It is about an integrated solution they've been building for over a decade that includes the networking with InfiniBand, software with CUDA, and developer experience.

這與他們十多年來一直在構建的集成解決方案有關，其中包括使用 InfiniBand 的網絡、使用 CUDA 的軟件以及開發人員體驗。
Essentially, NVIDIA provides a deeply integrated system that lets AI researchers program GPU clusters, less as a distributed system, and closer to what Jensen describes as one giant GPU.

從本質上講，英偉達™（NVIDIA®）提供了一個深度集成的系統，讓人工智能研究人員能夠對 GPU 集群進行編程，而不是像分佈式系統那樣，更接近於詹森所說的一個巨大的 GPU。
Another clever way DeepSeek makes the most out of their hard work is their particular implementation of a mixture of experts' architecture.

DeepSeek 充分利用其辛勤工作的另一個巧妙方法是他們特別實施的專家混合架構。
DeepSeek v3 has 671 billion model parameters, but only 37 billion are activated for a given token prediction.

DeepSeek v3 有 6 710 億個模型參數，但只有 370 億個參數被激活，用於給定的代幣預測。
By contrast, the largest and most capable Lama3 model doesn't use a mixture of expert architecture, so it activates its full 405 billion for each token prediction.

相比之下，規模最大、能力最強的 Lama3 模型沒有使用混合專家架構，是以它在每個標記預測中都會激活全部 4050 億個專家。
In other words, v3 activates 11x fewer parameters for each forward pass, saving tons of computation.

換句話說，v3 在每次前向傳遞時激活的參數減少了 11 倍，從而節省了大量的計算量。
Mixture of experts isn't a new concept, but it's been challenging to train models with this architecture efficiently.

專家混合並不是一個新概念，但利用這種架構高效地訓練模型一直是個挑戰。
DeepSeek introduced novel techniques that stabilize performance and increase GPU utilization.

DeepSeek 引入了穩定性能和提高 GPU 利用率的新技術。
Additionally, to overcome key performance bottlenecks, v3 makes use of multi-head-related attention, or MLA, which DeepSeek first revealed with its v2 paper, which was published in May 2024.

此外，為了克服關鍵的性能瓶頸，v3 還使用了多頭相關注意力（MLA），這是 DeepSeek 在 2024 年 5 月發表的 v2 論文中首次披露的。
MLA is a solution designed to tackle KV cache storage limitation, one of the biggest sources of VRAM overhead in large models.

MLA 是專為解決 KV 緩存存儲限制而設計的解決方案，它是大型模型中 VRAM 開銷的最大來源之一。
Instead of storing full key and value matrices, MLA manages to compress them down into a latent representation, reconstructing them only when needed.

MLA 不存儲完整的鍵值矩陣，而是將其壓縮為潛在表示，只在需要時才重新構建。
This helped the v2 model reduce its KV cache size by 93.3% and boosted its maximum generation throughput to 5.76 times.

這使 v2 型號的 KV 緩存大小減少了 93.3%，最大生成吞吐量提高了 5.76 倍。
Finally, unlike traditional models that predict only the next token, v3 makes use of multi-token prediction, or MTP.

最後，與只預測下一個標記的傳統模型不同，v3 利用了多標記預測（或 MTP）。
MTP enables v3 to anticipate multiple future tokens at each step.

MTP 使 v3 能夠在每一步預測多個未來令牌。
This densifies training signals, providing more feedback per step for better data efficiency and faster learning.

這將使訓練信號更加密集，每一步都能提供更多反饋，從而提高數據效率，加快學習速度。
It also improves representation planning, allowing the model to pre-plan sequences for smoother, more coherent outputs.

它還改進了表示規劃，使模型能夠預先規劃序列，以獲得更平滑、更連貫的輸出。
During inference, MTP modules can be repurposed for speculative decoding, reducing sequential processing steps and significantly speeding up generation.

在推理過程中，MTP 模塊可用於推測解碼，從而減少順序處理步驟，大大加快生成速度。
Taken all together, this makes v3 one of the most impressive base models on the market, and it's been out for some time now.

綜合來看，v3 是市場上最令人印象深刻的基本機型之一，而且已經推出了一段時間。
However, the recent release of DeepSeq's R1 reasoning model is what really made waves.

然而，最近發佈的 DeepSeq R1 推理模型才真正掀起了波瀾。
Most LLMs can be improved by being prompted to think step-by-step, but what sets reasoning models apart is that they are specifically trained to break down hard problems and think about them for paragraphs at a time.

大多數 LLM 都可以通過提示逐步思考來改進，但推理模型的與眾不同之處在於，它們經過專門訓練，可以將難題分解開來，一次思考幾個段落。
In September, OpenAI showed the power of this new approach with O1.

今年 9 月，OpenAI 通過 O1 展示了這種新方法的威力。
This achieves state-of-the-art results in math, coding, and science benchmarks.

在數學、編碼和科學基準方面取得了最先進的成果。
With R1, DeepSeq took a similar approach and published the secret sauce.

對於 R1，DeepSeq 採用了類似的方法，並公佈了祕訣。
OpenAI and DeepSeq achieved their impressive results through reinforcement learning, a technique to shape an LLM's behavior based on feedback and reward signals.

OpenAI 和 DeepSeq 通過強化學習取得了令人印象深刻的成果，強化學習是一種根據反饋和獎勵信號塑造 LLM 行為的技術。
Modern LLMs use some variation of reinforcement learning with human feedback, aka RLHF, or reinforcement learning from AI feedback, aka RLAIF, to improve their model's usefulness and alignment.

現代的 LLM 使用某種不同的人類反饋強化學習（又稱 RLHF）或人工智能反饋強化學習（又稱 RLAIF）來提高模型的實用性和一致性。
But reasoning models apply RL specifically towards the task of thinking step-by-step through complex problems.

但是，推理模型將 RL 專門用於逐步思考複雜問題的任務。
So how did DeepSeq apply RL to get a reasoning model?

那麼，DeepSeq 是如何應用 RL 獲得推理模型的呢？
At a high level, they assemble a bunch of problems with verifiable outputs, especially in math and coding problems, and then design a training pipeline to get the model to think for a bit and output the correct answers.

在高層次上，他們收集了一堆輸出結果可驗證的問題，尤其是數學和編碼問題，然後設計一個訓練管道，讓模型思考一下，輸出正確的答案。
But they don't give the model any external examples of how to think, whether from humans or AI.

但是，他們並沒有給模型提供任何外部的思考範例，無論是來自人類還是人工智能。
And their grading process was extremely simple.

他們的評分過程也非常簡單。
Rather than using a complex AI to give the model fine-grained feedback, DeepSeq uses simple rules to evaluate the model's final output on accuracy and formatting.

DeepSeq 不使用複雜的人工智能來為模型提供細粒度反饋，而是使用簡單的規則來評估模型最終輸出的準確性和格式。
They use these output scores to update their model through a novel technique they published in February 2024 called Group Relative Policy Optimization, or GRPO.

他們利用這些輸出分數，通過 2024 年 2 月發佈的一項名為 "組相對政策優化"（GRPO）的新技術更新模型。
Remarkably, with this process alone, DeepSeq saw reasoning emerge over thousands of RL steps.

值得注意的是，僅在這一過程中，DeepSeq 就發現了超過數千個 RL 步驟的推理過程。
The model learned skills like extended chain of thought and even experienced an aha moment where it recognized its own mistakes and backtracked to correct its reasoning.

該模型學會了擴展思維鏈等技能，甚至經歷了 "啊哈 "時刻，認識到了自己的錯誤，並回溯糾正了自己的推理。
This model was R1-0, one of the first large models to achieve top-tier results purely through reinforcement learning.

這個模型就是 R1-0，它是第一批純粹通過強化學習獲得頂級結果的大型模型之一。
Pure RL has long been a subject of investigation in Western research labs, such as DeepMind's AlphaGo, which simulated thousands of random games of self-play to beat the world's top Go player in 2016.

長期以來，純粹的 RL 一直是西方研究實驗室的研究課題，例如 DeepMind 的 AlphaGo，它在 2016 年模擬了數千局隨機自下棋，擊敗了世界頂級圍棋選手。
In 2019, OpenAI achieved notable success using reinforcement learning to train a robotics hand to solve a Rubik's Cube and beat a top human team in competitive Dota 2.

2019 年，OpenAI 利用強化學習訓練機械手解魔術方塊，並在競技 Dota 2 中擊敗人類頂級團隊，取得了顯著成功。
But unconstrained by human examples, R1-0's thinking steps suffered from poor readability, switching between English and Chinese at random.

但是，在沒有人類範例的限制下，R1-0 的思維步驟存在可讀性差的問題，在中英文之間隨意切換。
So DeepSeq introduced a cold start phase, fine-tuning unstructured reasoning examples before RL to get R1.

是以，DeepSeq 引入了冷啟動階段，在 RL 獲得 R1 之前對非結構化推理示例進行微調。
This eliminated the language mixing issues and made outputs far more comprehensible.

這消除了語言混合的問題，使輸出結果更容易理解。
The results are impressive.

結果令人印象深刻。
R1 achieves comparable performance to R1 on certain math and coding benchmarks.

在某些數學和編碼基準測試中，R1 的性能與 R1 相當。
But the pace of innovation is speeding up.

但創新的步伐正在加快。
Just two weeks after R1 was released, OpenAI released R3 Mini, which outperforms R1 on key benchmarks.

R1 發佈僅兩週後，OpenAI 又發佈了 R3 Mini，其主要基準性能超過了 R1。
So if R1 didn't actually come out of nowhere, what explains the hype cycle?

那麼，如果 R1 並非憑空出現，炒作週期又是如何解釋的呢？
One explanation is the sheer accessibility of DeepSeq's model.

一種解釋是，DeepSeq 的模型非常容易使用。
R1 is freely accessible through their website and app, and it is free to download, run locally, and customize.

R1 可通過其網站和應用程序免費訪問，並且可以免費下載、在地運行和定製。
Also, because of all the efficiency improvements, it offers near state-of-the-art performance at a price of other reasoning models.

此外，由於所有的效率改進，它以其他推理機型的價格提供了接近最先進的性能。
Another explanation is that a lot of the hype cycle didn't actually have to do with the specific algorithmic improvements that we described, but with misconceptions around V3's alleged $5.5 million in training costs.

另一種解釋是，很多炒作週期實際上與我們所描述的具體算法改進無關，而是對 V3 的所謂 550 萬美元培訓成本的誤解。
There's some important fine print here.

這裡有一些重要的細則。
The $5.5 million figure refers only to the cost of the final training run for V3.

550 萬美元的數字僅指 V3 最後一次訓練的費用。
It doesn't include any of the training costs of R1 or the associated R&D or hardware operating expenses, which are presumably in the hundreds of millions.

這還不包括 R1 的任何培訓費用，也不包括相關的研發或硬件營運費用，而這些費用可能高達數億美元。
Given the extreme algorithmic optimizations here, that $5.5 million training run number actually seems perfectly possible.

考慮到算法的極度優化，550 萬美元的訓練運行次數似乎完全有可能實現。
And it is worth noting that this work is reproducible.

值得注意的是，這項工作是可以複製的。
A UC Berkeley lab recently applied R1-0's key techniques to produce complex reasoning in a smaller model for just $30.

最近，加州大學伯克利分校的一個實驗室應用 R1-0 的關鍵技術，以 30 美元的價格在一個較小的模型中實現了複雜的推理。
What DeepSeq really proves is that there is still room for new players on the frontier.

DeepSeq 真正證明的是，在前沿領域仍有新的參與者。
In particular, there's room for rebuilding the stack for optimizing GPU workloads, improving software at inference layer tooling, and developing AI generated kernels.

特別是，在重建優化 GPU 工作負載的堆棧、改進推理層工具軟件以及開發人工智能生成內核方面，還有很大的空間。
Ultimately, this is fantastic news for AI applications in consumer or B2B, since it means the cost of intelligence keeps going down.

歸根結底，這對消費者或 B2B 領域的人工智能應用來說是個好消息，因為這意味著智能成本在不斷降低。
So the big takeaway here, this is the best possible time to be building a startup.

是以，這裡給我們的最大啟示是，現在是創建初創企業的最佳時機。
If you're accepted, you'll receive $500,000 in investment plus access to the best startup community in the world.

如果您被錄取，您將獲得 50 萬美元的投資，並有機會進入世界上最好的初創企業社區。
So apply now and come build the future with us.

現在就申請吧，和我們一起創造未來。

DeepSeek 背後的工程解鎖 | YC 解碼 (The Engineering Unlocks Behind DeepSeek | YC Decoded)