OpenAI 的新 "深度思考 "o1 模型碾壓編碼基準 (OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks)

字幕列表影片播放

由 AI 自動生成

I thought it plateaued.

我還以為是高原反應。
I thought the bubble was about to burst and the hype train was derailing.

我以為保麗龍即將破滅，炒作列車即將脫軌。
I even thought my software engineering job might be safe from Devon.

我甚至以為我的軟件工程工作不會受到德文郡的影響。
But I couldn't have been more wrong.

但我大錯特錯了。
Yesterday, OpenAI released a new terrifying state-of-the-art model named O1.

昨天，OpenAI 發佈了一款新的可怕的先進模型，名為 O1。
And it's not just another basic GPT, it's a new paradigm of deep thinking or reasoning models that obliterate all past benchmarks on math, coding, and PhD level science.

這不僅僅是另一種基本的 GPT，而是一種新的深度思維範式或推理模型，它抹殺了數學、編碼和博士級科學的所有過去基準。
And Sam Altman had a message for all AI haters out there.

薩姆-奧特曼（Sam Altman）還向所有憎恨人工智能的人傳達了一個資訊。
Before we get too hopeful that O1 will unburden us from our programming jobs though, there are many reasons to doubt this new model.

不過，在我們對 O1 將減輕我們的編程工作負擔抱有太大希望之前，有很多理由對這種新模式表示懷疑。
It's definitely not ASI, it's not AGI, and not even good enough to be called GPT-5.

它肯定不是 ASI，也不是 AGI，甚至不足以被稱為 GPT-5。
Following its mission of openness, OpenAI is keeping all the interesting details closed off, but in today's video, we'll try to figure out O1 actually works and what it means for the future of humanity.

秉承開放的使命，OpenAI 將所有有趣的細節都封鎖起來，但在今天的視頻中，我們將嘗試瞭解 O1 的實際工作原理以及它對人類未來的意義。
It is Friday the 13th, and you're watching The Code Report.

今天是 13 日星期五，您正在收看《密碼報告》。
GPT-5, Orion, Q-Star, Strawberry.

GPT-5、Orion、Q-Star、Strawberry。
These are all names that leaked out of OpenAI in recent months, but yesterday the world was shocked when they released O1 ahead of schedule.

這些都是最近幾個月從 OpenAI 流出的名字，但昨天他們提前發佈 O1 時，全世界都為之震驚。
GPT stands for Generative Pre-trained Transformer, and O stands for Oh S*** We're All Gonna Die.

GPT 是 Generative Pre-trained Transformer 的縮寫，O 是 Oh S*** We're All Gonna Die 的縮寫。
First, let's admire these dubious benchmarks.

首先，讓我們來欣賞一下這些可疑的基準。
Compared to GPT-4, it achieves massive gains on accuracy, most notably in PhD level physics, and on the massive multitask language understanding benchmarks for math and formal logic.

與 GPT-4 相比，它在準確性方面取得了巨大進步，尤其是在博士級物理學以及數學和形式邏輯的大規模多任務語言理解基準方面。
But the craziest improvements come in its coding ability.

但最瘋狂的改進來自於它的編碼能力。
At the International Olympiad in Informatics, it was in the 49th percentile when allowed 50 submissions per problem, but then broke the gold medal submission when it was allowed 10,000 submissions.

在國際信息學奧林匹克競賽中，當每道題允許提交 50 個問題時，它排在第 49 位，但當允許提交 10,000 個問題時，它又打破了金牌提交紀錄。
And compared to GPT-4, its code forces ELO went from the 11th percentile all the way up to the 93rd percentile.

與 GPT-4 相比，其代碼力 ELO 從第 11 百分位數一直上升到第 93 百分位數。
Impressive, but they've also secretly been working with Cognition Labs, the company that wants to replace programmers with this greasy pirate gigolo named Devin.

令人印象深刻的是，他們還與認知實驗室祕密合作，這家公司想用這個名叫戴文的油膩海盜男妓來取代程序員。
When using the GPT-4 brain, it only solved 25% of problems, but with GPT-01, the chart went up to 75%.

使用 GPT-4 大腦時，只能解決 25% 的問題，而使用 GPT-01 後，圖表上的問題解決率上升到了 75%。
That's crazy, and our only hope is that these internal closed source benchmarks from a VC-funded company desperate to raise more money are actually just BS.

這太瘋狂了，我們唯一的希望是，這些由風險投資公司出資、急於籌集更多資金的內部封閉源代碼基準實際上只是一派胡言。
Only time will tell, but O1 is no doubt a huge leap forward in the AI race.

只有時間才能證明一切，但 O1 無疑是人工智能競賽中的一次巨大飛躍。
And the timing is perfect, because many people have been switching from ChatGPT to Claude, and OpenAI is in talks to raise more money at a $150 billion valuation.

而且時機恰到好處，因為很多人已經從 ChatGPT 轉投 Claude，而 OpenAI 正在洽談以 1500 億美元的估值籌集更多資金。
But how does a deep thinking model actually work?

但是，深度思維模式究竟是如何運作的呢？
Well technically, they released three new models, O1 Mini, O1 Preview, and O1 Regular.

嚴格來說，他們發佈了三款新產品：O1 Mini、O1 Preview 和 O1 Regular。
Us plebs only have access to Mini and Preview, and O1 Regular is still locked in a cage, although they have hinted at a $2,000 premium plus plan to access it.

我們這些普通人只能訪問迷你版和預覽版，而 O1 普通版仍然被關在籠子裡，儘管他們已經暗示將推出一個 2000 美元的高級附加計劃來訪問它。
What makes these models special though is that they rely on reinforcement learning to perform complex reasoning.

這些模型的特別之處在於，它們依靠強化學習來進行復雜的推理。
That means when presented with a problem, they produce a chain of thought before presenting the answer to the user.

這意味著，當遇到問題時，他們在向用戶提供答案之前會產生一連串的思考。
In other words, they think.

換句話說，他們在思考。
Descartes said, I think, therefore I am, but O1 is still not a sentient life form.

笛卡爾說："我思故我在"，但 O1 仍然不是一個有生命的生命體。
Just like a human though, it will go through a series of thoughts before reaching a final conclusion, and in the process produce what are called reasoning tokens.

不過，它也會像人類一樣，在得出最終結論前進行一系列思考，並在此過程中產生所謂的推理標記。
These are like outputs that help the model refine its steps and backtrack when necessary, which allows it to produce complex solutions with fewer hallucinations.

這些就像是輸出結果，可以幫助模型完善其步驟，並在必要時進行回溯，從而使其在產生複雜解決方案時減少幻覺。
But the tradeoff is that the response requires more time, computing power, and money.

但代價是，響應需要更多的時間、計算能力和資金。
OpenAI released a bunch of examples, like this guy making a playable snake game in a single shot, or this guy creating a nonogram puzzle.

OpenAI 發佈了很多例子，比如這個人一氣呵成地製作了一個可玩的蛇形遊戲，或者這個人制作了一個非圖形拼圖。
And the model can even reliably tell you how many R's are in the word strawberry, a question that has baffled LLMs in the past.

這個模型甚至能準確地告訴你草莓這個單詞中有多少個 R，而這個問題過去一直困擾著法學碩士。
Actually, just kidding, it failed that test when I tried to run it myself.

事實上，開玩笑的，我自己試著運行它時，它沒有通過測試。
And the actual chain of thought is hidden from the end user, even though you do have to pay for those tokens at a price of $60 per 1 million.

儘管你必須以每 100 萬枚 60 美元的價格購買這些代幣，但實際的思維鏈對最終用戶是隱藏的。
However, they do provide some examples of chain of thought, like in this coding example that transposes a matrix in Bash.

不過，他們確實提供了一些思維鏈的例子，比如這個在 Bash 中轉置矩陣的編碼示例。
You'll notice that it first looks at the shape of the inputs and outputs, then considers the constraints of the programming language, and goes through a bunch of other steps before regurgitating a response.

你會注意到，它首先會查看輸入和輸出的形狀，然後考慮編程語言的限制，並經過一系列其他步驟，最後才給出響應。
But this is actually not a novel concept.

但實際上，這並不是一個新概念。
Google has been dominating math and coding competitions with AlphaProof and AlphaCoder for the last few years using reinforcement learning by producing synthetic data.

在過去幾年裡，谷歌一直通過 AlphaProof 和 AlphaCoder 利用合成數據強化學習，在數學和編碼競賽中獨佔鰲頭。
But this is the first time a model like this has become generally available to the public.

但向公眾普及這樣的模型還是第一次。
Let's go ahead and find out if it slaps.

讓我們來看看它是否會打耳光。
I remember years ago when I first learned code, I recreated the classic MS-DOS game Dog Wars, a turn-based strategy game where you play the role of a traveling salesman and have random encounters with Officer Hardass.

記得多年前，當我第一次學習代碼時，我重新制作了一款經典的 MS-DOS 遊戲《狗狗大戰》，這是一款回合制策略遊戲，你在遊戲中扮演一名旅行推銷員，並隨機遭遇硬漢警官。
As a biological human, it took me like a hundred hours to build.

作為一個生物人，我大概花了 100 個小時來建造它。
But let's first see how GPT-4-0 does with it.

不過，讓我們先看看 GPT-4-0 是如何使用它的。
When I ask it to build this game in C with a GUI, it produces code that almost works, but I wasn't able to get it to compile, and after a couple of follow-up prompts, I finally got something working, but the game logic was very limited.

當我要求它用 C 語言創建這個帶有圖形用戶界面的遊戲時，它生成的代碼幾乎可以工作，但我無法讓它編譯，經過幾次後續提示後，我終於得到了一些可以工作的東西，但遊戲邏輯非常有限。
Now let's give the new 0-1 that exact same prompt.

現在，讓我們給新的 0-1 以完全相同的提示。
What you'll notice is that it goes through the chain of thought, like it's thinking, then assessing compliance, and so on, but what it's actually doing under the hood is creating those reasoning tokens, which should lead to a more comprehensive and accurate result.

你會注意到，它通過思維鏈，好像在思考，然後評估合規性，等等，但它在引擎蓋下實際做的是創建這些推理令牌，這應該會帶來一個更全面、更準確的結果。
In contrast to GPT-4, 0-1 compiled right away, and it followed the game requirements to a T.

與 GPT-4 不同的是，0-1 立即編譯，而且完全符合遊戲要求。
At first glance, it actually seemed like a flawless game, but it turns out the app was actually pretty buggy.

乍一看，這似乎是一款完美無瑕的遊戲，但事實證明，這款應用程序實際上存在很多漏洞。
I kept getting into this infinite loop with Officer Hardass, and the UI was also terrible.

我一直在和 "硬屁股警官 "無限循環，用戶界面也很糟糕。
I tried to fix these issues with additional follow-up prompts, but they actually led to more hallucinations and more bugs, and it's pretty clear that this model isn't truly intelligent.

我試圖用更多的後續提示來解決這些問題，但這些提示實際上導致了更多的幻覺和更多的錯誤，很明顯，這個模型並不是真正的智能模型。
That being said though, there's a huge amount of potential with this chain of thought approach, and by potential, I mean potential to overstate its capabilities.

儘管如此，這種思維鏈方法仍有巨大的潛力，我說的潛力是指誇大其能力的潛力。
In 2019, they were telling us GPT-2 was too dangerous to release.

2019 年，他們告訴我們，釋放 GPT-2 太危險了。
Now five years later, you've got Sam Altman begging the feds to regulate his strawberry.

五年後的今天，山姆-奧特曼懇求聯邦政府監管他的草莓。
It's scary stuff, but until proven otherwise, 0-1 is just another benign AI tool.

這是一個可怕的東西，但在得到證實之前，0-1 只是另一個良性的人工智能工具。
It's basically just like GPT-4, with the ability to recursively prompt itself.

它基本上與 GPT-4 類似，具有遞歸提示功能。
It's not fundamentally game-changing, but you really shouldn't listen to me.

這並不能從根本上改變遊戲規則，但你真的不應該聽我的。
I'm just like a horse influencer in 1910 telling horses a car won't take your job, but another horse driving a car will.

我就像 1910 年的馬影響者，告訴馬汽車不會搶走你的工作，但另一匹開汽車的馬會。
This has been The Code Report.

這就是《密碼報告》。
Thanks for watching, and I will see you in the next one.

感謝您的收看，我們下期再見。