Placeholder Image

字幕列表 影片播放

由 AI 自動生成
  • Elon Musk and his ex-AI startup have built the largest and most powerful artificial intelligence training supercomputer in the world.

    埃隆-馬斯克(Elon Musk)和他的前人工智能創業公司建造了世界上最大、最強大的人工智能訓練超級計算機。

  • Elon has named this beast Colossus.

    埃隆給這頭巨獸取名為 "巨人"。

  • It is equipped with the latest Nvidia GPU hardware, it's liquid-cooled with vast amounts of water, and is powered by giant Tesla Megapack batteries.

    它配備了最新的 Nvidia GPU 硬件,使用大量水冷卻,並由巨大的特斯拉 Megapack 電池供電。

  • Elon believes that all of this combined will create the world's most artificial intelligence, one that will literally solve the mysteries of the universe.

    埃隆相信,所有這一切結合在一起,將創造出世界上最強大的人工智能,它將真正解開宇宙的奧祕。

  • And what we see today is only the beginning.

    而我們今天所看到的,僅僅是一個開始。

  • This is what's inside Colossus.

    這就是巨像的內部結構。

  • The location is Memphis, Tennessee, in an industrial park southwest of the city center, on the bank of the mighty Mississippi River.

    地點是田納西州孟菲斯市,位於市中心西南部的一個工業園區內,在壯麗的密西西比河畔。

  • The building itself wasn't constructed by ex-AI, it was previously home to Electrolux, which is a Swedish appliance manufacturer.

    這座建築本身並不是由前 AI 公司建造的,它之前是瑞典電器製造商伊萊克斯公司(Electrolux)的所在地。

  • So, if you've been wondering why Elon chose Memphis and not Austin, it basically just comes down to finding the right building in the right location to get this thing up and running as fast as possible.

    是以,如果你一直想知道為什麼伊隆選擇孟菲斯而不是奧斯汀,那麼基本上可以歸結為在合適的地點找到合適的建築,以儘快啟動和運行這個項目。

  • Now, as unassuming as the exterior of Colossus might be, it's what's inside that counts.

    現在,儘管巨像的外觀並不起眼,但它的內在才是最重要的。

  • And inside is the largest AI training cluster in the world.

    裡面是世界上最大的人工智能培訓集群。

  • Currently, over 100,000 Nvidia HGX H100 GPUs connected with exabytes of data storage over a super fast network.

    目前,有超過 100,000 個 Nvidia HGX H100 GPU 通過超高速網絡連接到數以兆字節計的數據存儲。

  • Nvidia CEO Jensen Huang has said himself that Colossus is quote, easily the fastest supercomputer on the planet.

    Nvidia 首席執行官黃仁勳(Jensen Huang)曾親口表示,"巨像 "是地球上最快的超級計算機。

  • And it was all built to power Grok, an AI model that Elon Musk and ex-AI will evolve into something far more capable than a simple chatbot.

    而這一切都為 Grok 提供了動力,埃隆-馬斯克(Elon Musk)和前人工智能公司(ex-AI)將把這一人工智能模型發展成為比簡單的哈拉機器人更強大的功能。

  • This is the breeding ground for artificial super intelligence.

    這是人工超級智能的溫床。

  • The entire facility as we see it was built in just 122 days.

    我們看到的整個設施只用了 122 天就建成了。

  • That is insane.

    這太瘋狂了。

  • A more traditional supercomputer cluster would have just one half to one quarter the amount of GPUs as Colossus, but the construction of those traditional systems would take years from start to finish.

    傳統超級計算機集群的 GPU 數量只有 Colossus 的二分之一到四分之一,但建造這些傳統系統從開始到結束需要數年時間。

  • The training work happens in an area called the data hall.

    培訓工作在一個名為數據大廳的區域內進行。

  • XAI uses a configuration known as the raised floor data hall, which splits the system into three levels.

    XAI 採用了一種被稱為 "高架地板數據大廳 "的配置,將系統抽成三層。

  • Above is the power, below is the cooling, and in the middle is the GPU cluster.

    上面是電源,下面是冷卻系統,中間是 GPU 集群。

  • There are four data halls inside Colossus, each with 25,000 GPUs plus storage and the fiber optic network that ties it all together.

    巨像內部有四個數據大廳,每個大廳有 25,000 個 GPU,外加存儲設備和連接所有設備的光纖網絡。

  • Colossus uses water for liquid cooling.

    Colossus 使用水冷卻。

  • Below the GPU cluster is a network of giant pipes that move vast amounts of water in and out of the facility.

    GPU 集群下方是一個巨大的管道網絡,將大量的水送入和送出設施。

  • Hot water from the server is sent outside to a chiller, which lowers the temperature of the water by a few degrees before pumping it back in.

    服務器中的熱水被輸送到室外的冷卻器中,冷卻器會將水溫降低幾度,然後再將水泵回。

  • This doesn't necessarily need to be cold water though.

    但不一定非要用冷水。

  • Without getting too deep into thermodynamics, just remember that energy always travels from hot to cold.

    在不深入研究熱力學的情況下,只需記住能量總是從熱到冷。

  • So as long as the temperature of the water is lower than the working GPUs, which get pretty hot, then the excess heat energy will be drawn into the water as it flows past and heat will be removed from the system.

    是以,只要水的溫度低於工作中的 GPU(GPU 的溫度非常高),那麼多餘的熱能就會隨著水流被吸入水中,並從系統中帶走熱量。

  • Here is what those GPU racks look like.

    這就是 GPU 支架的樣子。

  • Each tray is loaded with 8 NVIDIA H100 GPUs, the current state-of-the-art chip for AI training.

    每個托盤都裝有 8 個英偉達 H100 GPU,這是目前最先進的人工智能訓練芯片。

  • That will change in a relatively short amount of time and Elon already has plans to upgrade Colossus to the NVIDIA B200 chip when that becomes widely available, but for right now, there's no time to waste.

    這種情況在較短的時間內就會改變,Elon 已經計劃在英偉達 B200 芯片普及後將 Colossus 升級到該芯片,但現在沒有時間可以浪費。

  • There are 8 of these racks built into one cabinet with a total of 64 GPU chips and 16 CPU chips in every vertical stack.

    一個機櫃中有 8 個這樣的機架,每個垂直堆疊中總共有 64 個 GPU 芯片和 16 個 CPU 芯片。

  • Each of the racks has its own independent water cooling system, with these small tubes that lead directly into the GPU housing, blue tubes for cold water delivery and red tubes for hot water extraction.

    每個機架都有自己獨立的水冷系統,這些小管子直接通向 GPU 外殼,藍色管子用於輸送冷水,紅色管子用於抽取熱水。

  • The beauty of these GPU racks built for XAI by Supermicro is that each one can be pulled individually for maintenance and it's serviceable on the tray.

    超微為 XAI 打造的這些 GPU 機架的優點在於,每個 GPU 都可以單獨拉出進行維護,而且可以在托盤上進行維護。

  • That means the entire cabinet doesn't need to be shut down and disassembled just to replace one chip.

    這意味著無需為了更換一個芯片而關閉和拆卸整個機櫃。

  • The technician can simply pull the rack, perform the service right there on the tray and then slide it back in and get back to training.

    技術人員只需拉出機架,就可以在托盤上進行服務,然後將機架滑回托盤,繼續進行培訓。

  • This is unique in the AI industry.

    這在人工智能行業是獨一無二的。

  • Only XAI has a setup like this and it will allow them to keep their downtime to an absolute minimum.

    只有 XAI 有這樣的設置,這將使他們能夠將停機時間保持在絕對最低水平。

  • The same is true for the water system.

    供水系統也是如此。

  • Each cabinet has its own cooling management unit at the base that's responsible for monitoring flow rate and temperature with an individual water pump that can easily be removed and serviced.

    每個機櫃的底部都有獨立的冷卻管理單元,負責監控流量和溫度,並配有獨立的水泵,可以方便地拆卸和維修。

  • Now, the thing to keep in mind about gigantic computer systems like this is that things will break.

    現在,對於這樣的大型計算機系統來說,需要牢記的是,總會有東西壞掉。

  • There's no way to avoid that, but having a plan to keep failures localized and problems solved as fast as possible, that is going to make an incredible difference in the overall productivity of the cluster.

    這種情況是無法避免的,但如果能制定一個計劃,將故障限制在局部範圍內,並儘快解決問題,這將對集群的整體生產率產生極大的影響。

  • On the back of each cabinet is a rear door heat exchanger that's basically just a really big fan that pulls air through the rack and facilitates the heat transfer from the hot chips to the cool water.

    在每個機櫃的背面都有一個後門熱交換器,它基本上就是一個非常大的風扇,通過機架抽取空氣,促進熱量從熱芯片傳遞到冷水中。

  • This replaces giant air conditioning units that are found in typical data centers and again keeps each of the racks self-contained.

    這取代了典型數據中心中的巨型空調設備,並再次使每個機架保持獨立。

  • Every fan is glowing with a colored light.

    每一把扇子都散發著彩色的光芒。

  • That's not for aesthetics.

    這不是為了美觀。

  • It's a way for technicians to quickly identify failures.

    這是技術人員快速識別故障的一種方法。

  • A healthy fan will have a blue light while a bad fan will switch to a red light and then they just replace those individual units as they go down.

    健康的風扇亮藍燈,壞的風扇亮紅燈。

  • While GPU chips do the heavy lifting for AI training, CPU chips are used for preparing the data and running the operating system.

    GPU 芯片承擔了人工智能訓練的重任,而 CPU 芯片則用於準備數據和運行作業系統。

  • There are two CPUs for every eight GPUs.

    每八個 GPU 對應兩個 CPU。

  • All of the data used to train Grok is held in a hard drive storage system.

    用於訓練 Grok 的所有數據都保存在硬盤存儲系統中。

  • Exabytes of text, images, and video that are fed into the training cluster.

    輸入到訓練集群的文本、影像和視頻多達數百萬字節。

  • One exabyte is a billion gigabytes and all of that data is handled by a super high-speed network system.

    1 艾字節等於十億千兆字節,所有這些數據都由超高速網絡系統處理。

  • Data is moved around Colossus by Ethernet, but this is not anything like your home network.

    數據通過以太網在 Colossus 上傳輸,但這與家庭網絡完全不同。

  • The XAI network is powered by NVIDIA BlueField 3 DPUs.

    XAI 網絡由英偉達 BlueField 3 DPU 驅動。

  • That's a data processing unit and these chips can handle 400 gigabits per second through a network of fiber optic cables.

    這是一個數據處理單元,這些芯片可以通過光纜網絡以每秒 400 千兆比特的速度處理數據。

  • That's around 400 times faster than a very fast home internet connection.

    這比快速的家庭互聯網連接快 400 倍左右。

  • The Ethernet is necessary for scaling beyond the size of a traditional supercomputer system.

    以太網是超越傳統超級計算機系統規模的必要條件。

  • See AI training requires a massive amount of storage that needs to be accessible by every server in the data center.

    人工智能訓練需要大量存儲空間,數據中心的每臺服務器都需要訪問這些存儲空間。

  • Now, this massive amount of equipment requires an equally massive amount of power.

    現在,這些龐大的設備需要同樣龐大的電力。

  • And again, XAI has done something totally unique with their energy delivery.

    同樣,廈航在能量傳輸方面也做得非常獨特。

  • They are using Tesla Energy.

    他們使用的是特斯拉能源。

  • Colossus doesn't use solar energy.

    巨像不使用太陽能。

  • It's draining power from traditional generators.

    它耗盡了傳統發電機的電力。

  • But there was a problem that XAI encountered when they started to bring their 100,000 GPU system online.

    但是,當 XAI 開始將其 10 萬 GPU 系統上線時,他們遇到了一個問題。

  • The tiny millisecond variations in power coming from the grid would create inconsistencies in the training process.

    來自電網的微小毫秒級電力變化會導致訓練過程不一致。

  • We are talking very small fluctuations, but at this giant scale, those will add up quickly.

    我們說的是非常小的波動,但在如此巨大的範圍內,這些波動很快就會累加起來。

  • So the solution was to bring in Tesla Megapack battery units.

    是以,解決方案是引進特斯拉 Megapack 電池組。

  • So what they do now is pipe input power from the grid into the Megapacks, then the batteries discharged directly into the training cluster.

    是以,他們現在的做法是從電網向 Megapacks 輸送輸入電力,然後電池直接向訓練集群放電。

  • This provides the super consistent direct energy required for the entire network to have the most efficient training session that is physically possible.

    這就為整個網絡提供了所需的超級穩定的直接能量,使其能夠在物理條件允許的情況下進行最高效的訓練。

  • This unique energy upgrade will become even more critical when XAI doubles the size of Colossus to over 200,000 H100 GPUs, something that Elon claims will happen within the next two months.

    當XAI將Colossus的規模擴大一倍,達到超過20萬個H100 GPU時,這種獨特的能源升級將變得更加重要。

  • That is an insane rate of growth, and it's got the established AI giant scared.

    這是一個瘋狂的增長速度,讓老牌的人工智能巨頭膽戰心驚。

  • There have been reports that OpenAI CEO Sam Altman has already told Microsoft executives that he's concerned Elon will soon overtake them in access to computing power.

    有報道稱,OpenAI 首席執行官山姆-奧特曼(Sam Altman)已經告訴微軟高管,他擔心埃隆很快就會在計算能力方面超越他們。

  • Of course, this stuff ain't cheap.

    當然,這東西可不便宜。

  • It was just a few months ago that XAI raised $6 billion in venture capital funding, bringing the one-year-old company to a valuation of $24 billion.

    就在幾個月前,XAI 獲得了 60 億美元的風險投資,使這家成立僅一年的公司估值達到 240 億美元。

  • That's a lot of money for a young company that only had one basic product on the market at the time.

    對於一家當時在市場上只有一種基本產品的年輕公司來說,這是一筆不小的數目。

  • But they did have the richest man in the world at the controls, so obviously that counts for a lot.

    但他們確實有世界上最富有的人在控制,所以這顯然很重要。

  • Now, we've just seen reports from the Wall Street Journal that Elon is already looking for a lot more money, enough to bring the value of XAI to $40 billion.

    現在,我們剛剛看到《華爾街日報》的報道,埃隆已經在尋求更多的資金,足以使 XAI 的價值達到 400 億美元。

  • For a sense of scale, the industry giant OpenAI is currently valued at $157 billion.

    行業巨頭 OpenAI 目前的估值為 1570 億美元,足以說明其規模之大。

  • While a smaller-scale operation like Perplexity, who makes a highly regarded AI search tool, they're expected to soon hit a valuation of $8 billion.

    而像 Perplexity 這樣規模較小的公司,他們生產的人工智能搜索工具備受讚譽,預計很快就會達到 80 億美元的估值。

  • As for Grok, the AI chatbot is continuing to rapidly evolve thanks to new power provided by Just recently, Grok was upgraded to include vision capabilities, meaning that the AI can analyze and comprehend input from images alongside its existing text functions.

    最近,Grok 升級了視覺功能,這意味著人工智能除了現有的文本功能外,還能分析和理解來自影像的輸入。

  • This new feature is integrated into the X social media platform for premium users.

    這項新功能已整合到面向高級用戶的 X 社交媒體平臺中。

  • Now when you see an image in a post, you can click a button to send that image to Grok, where you can now ask the AI any question you want about the content of that image.

    現在,當您在帖子中看到一張圖片時,您可以點擊一個按鈕將該圖片發送到 Grok,在那裡您可以向人工智能提出任何有關該圖片內容的問題。

  • Grok can analyze or provide additional context.

    Grok 可以分析或提供更多背景資訊。

  • This is an important step for XAI on their path towards achieving artificial general intelligence.

    這是 XAI 在實現人工通用智能道路上邁出的重要一步。

  • That's a big buzz term right now, it basically just means an AI that can do pretty much anything.

    這是一個現在很流行的術語,基本上就是指能做任何事情的人工智能。

  • Essentially, an artificial reproduction of the human mind and its incredible versatility.

    從本質上講,它是人類思維及其不可思議的多功能性的人工再現。

  • We can write words, we can make music, we can solve complex problems, invent new things.

    我們可以寫文字,可以做音樂,可以解決複雜的問題,發明新事物。

  • In theory, an artificial general intelligence would have all of the knowledge of the entire human race all concentrated into one super powerful computer brain, making it infinitely smarter than any human being.

    從理論上講,人工通用智能將把全人類的所有知識都集中到一個超級強大的計算機大腦中,使其比任何人都無限聰明。

  • Then the AGI can use that knowledge to learn even more, to discover the undiscoverable, solve the unsolvable, invent the uninventable.

    然後,AGI 可以利用這些知識學習更多,發現無法發現的,解決無法解決的,發明無法發明的。

  • According to Elon Musk, this is how we unlock the mysteries of the universe and the very nature of our own existence.

    埃隆-馬斯克認為,這就是我們揭開宇宙奧祕和我們自身存在本質的方法。

  • Or the AI will go rogue and kill us all.

    否則,人工智能就會叛變,殺光我們所有人。

  • But that's where Neuralink comes in, which is a whole other video that we've already made, make sure you check one of those out next.

    但這正是 Neuralink 的用武之地,我們已經制作了另一個完整的視頻,請務必查看下一個視頻。

Elon Musk and his ex-AI startup have built the largest and most powerful artificial intelligence training supercomputer in the world.

埃隆-馬斯克(Elon Musk)和他的前人工智能創業公司建造了世界上最大、最強大的人工智能訓練超級計算機。

字幕與單字
由 AI 自動生成

單字即點即查 點擊單字可以查詢單字解釋

B1 中級 中文 美國腔

走進埃隆-馬斯克的巨像超級計算機! (Inside Elon Musk's Colossus Supercomputer!)

  • 11 2
    Adam Lin 發佈於 2024 年 11 月 29 日
影片單字