創建任何你想要的圖片的AI，解釋一下 (The AI that creates any picture you want, explained)

字幕列表影片播放

由 AI 自動生成

Seven years ago, back in 2015,

七年前，早在2015年。
one major development in AI research was automated image captioning.

人工智能研究的一個主要發展是自動影像說明。
Machine learning algorithms could already label objects in images,

機器學習算法已經可以對影像中的物體進行標記。
and now they learned to put those labels into natural language descriptions.

而現在他們學會了把這些標籤放到自然語言描述中。
And it made one group of researchers curious.

而這讓一組研究人員感到好奇。
What if you flipped that process around?

如果你把這個過程翻過來呢？
We could do image to text.

我們可以做圖片轉文字。
Why not try doing text to images and see how it works?

為什麼不試著將文字轉為影像，看看效果如何？
It was a more difficult task.They didn't want

這是一個更困難的任務。他們不想
to retrieve existing images the way google search does.

檢索現有圖片的方式是谷歌搜索。
They wanted to generate entirely novel scenes that didn't happen in the real world.

他們想產生完全新奇的場景，這些場景在現實世界中並沒有發生。
So they asked their computer model for something it would have never seen before.

所以他們要求他們的計算機模型提供一些它以前從未見過的東西。
Like all the school buses you've seen are yellow.

就像你見過的所有校車都是黃色的。
But if you write “the red or green school bus” would it actually try to generate something green?

但如果你寫 "紅色或綠色的校車"，它是否真的會試圖生成綠色的東西？
And it did that.

而它確實做到了這一點。
It was a 32 by 32 tiny image.

那是一個32乘32的小影像。
And then all you could see is like a blob of something on top of something.

然後你能看到的就是像一團東西在東西上面。
They tried some other prompts like “A herd of elephants flying in the blue skies”.

他們嘗試了一些其他的提示，如 "一群大象在藍天上飛翔"。
“A vintage photo of a cat.”

"一張貓的復古照片。"
“A toilet seat sits open in the grass field.”

"一個馬桶蓋在草場上敞開著。"
And “a bowl of bananas is on the table.”

而 "一碗香蕉就在桌子上"。
Maybe not something to hang on your wall but the 2016 paper from those researchers

也許不是掛在牆上的東西，但那些研究人員的2016年論文
showed the potential for what might become possible in the future.

顯示了未來可能成為可能的東西的潛力。
And uh... the future has arrived.

而呃......未來已經到來。
It is almost impossible to overstate how far the technology has come in just one year.

幾乎不可能誇大該技術在短短一年內所取得的成就。
By leaps and bounds. Leaps and bounds.

通過飛躍和跳躍。躍進和飛躍。
Yeah, it's been quite dramatic.

是的，這很有戲劇性。
I don't know anyone who hasn't immediately been like

我不知道有誰沒有立即像
“What is this? What is happening here?”

"這是什麼？這裡發生了什麼？"
Could I say like watching waves crashing?

我可以說像看海浪拍打嗎？
Party hat guy.

戴派對帽的人。
Seafoam dreams.

海泡石之夢。
A coral reef. Cubism.

珊瑚礁。立體主義。
Caterpillar.

Caterpillar.
A dancing taco.

一個跳舞的玉米餅。
My prompt is Salvador Dali painting the skyline of New York City.

我的提示是薩爾瓦多-達利畫紐約市的天際線。
You may be thinking, wait AI-generated images aren't new.

你可能在想，等等，人工智能生成的影像並不新鮮。
You probably heard about this generated portrait going for over $400,000 at auction back in 2018.

你可能聽說過這幅生成的畫像早在2018年就以超過40萬美元的價格拍賣了。
Or this installation of morphing portraits, which Sotheby's sold the following year.

或者這個變形肖像的裝置，蘇富比在第二年賣出。
It was created by Mario Klingemann, who explained to me that that type of AI

它是由Mario Klingemann創建的，他向我解釋說，這種類型的AI
art required him to collect a specific dataset of images and train his own model to mimic that data.

藝術要求他收集一個特定的影像數據集，並訓練自己的模型來模仿這些數據。
Let's say, Oh, I want to create landscapes, so I collect a lot of landscape images.

比方說，哦，我想創造風景，所以我收集了很多風景圖片。
I want to create portraits, I trained on portraits.

我想創作肖像畫，我受過肖像畫的訓練。
But then the portrait model would not really be able to create landscapes.

但這樣一來，人像模特就不能真正地創作風景。
Same with those hyper realistic fake faces that have been plaguing

那些超現實的假臉也是如此，它們一直困擾著人們。
linkedin and facebook – those come from a model that only knows how to make faces.

Linkedin和facebook--這些來自於一個只知道做臉的模特。
Generating a scene from any combination of words requires a different, newer, bigger approach.

從任何詞語的組合中生成一個場景需要一個不同的、更新的、更大的方法。
Now we kind of have these huge models, which are so huge that

現在我們有了這些巨大的模型，這些模型是如此巨大，以至於
somebody like me actually cannot train them anymore on their own computer.

像我這樣的人實際上不能再在自己的電腦上培訓他們。
But once they are there, they are really kind of— they contain everything.

但是一旦它們在那裡，它們就真的有點......它們包含了一切。
I mean, to a certain extent.

我的意思是，在某種程度上。
What this means is that we can now create images without having to actually

這意味著，我們現在可以創建影像，而不需要實際的
execute them with paint or cameras or pen tools or code.

用油漆、照相機、鋼筆工具或代碼來執行它們。
The input is just a simple line of text.

輸入只是簡單的一行文字。
I'll get to how this tech works later in the video

我將在視頻的後面講到這項技術是如何工作的
but to understand how we got here, we have to rewind to January 2021

但要了解我們如何走到這一步，我們必須倒退到2021年1月
When a major AI company called Open AI announced DALL-E – which they named after these guys.

當一家名為Open AI的大型人工智能公司宣佈DALL-E--他們以這些人的名字命名。
They said it could create images from text captions for a wide range of concepts.

他們說，它可以從文字說明中為廣泛的概念創建影像。
They recently announced DALLE-2, which promises more realistic results and seamless editing.

他們最近宣佈了DALLE-2，承諾提供更真實的結果和無縫編輯。
But they haven't released either version to the public.

但他們還沒有向公眾發佈這兩個版本。
So over the past year, a community of independent, open-source developers

是以，在過去的一年裡，一個獨立的、開源的開發者社區
built text-to-image generators out of other pre-trained models that they did have access to.

他們從其他預先訓練好的模型中建立了文本到影像的生成器，他們確實有機會接觸到這些模型。
And you can play with those online for free.

而且你可以在網上免費玩這些遊戲。
Some of those developers are now working for a company called Midjourney,

其中一些開發人員現在為一家名為Midjourney的公司工作。
which created a Discord community with bots that turn your text into images in less than a minute.

它創建了一個擁有機器人的Discord社區，可以在不到一分鐘的時間內將你的文字變成圖片。
Having basically no barrier to entry to this has made it like a whole new ballgame.

基本上沒有門檻，這使得它像一個全新的球賽。
I've been up until like two or three in the morning.

我一直到凌晨兩點或三點才睡覺。
Just really trying to change things, piece things together.

只是真的想改變事情，把事情拼湊起來。
I've done about 7,000 images. It's ridiculous.

我已經做了大約7,000張圖片。這很荒唐。
MidJourney currently has a wait-list for subscriptions, but we got a chance to try it out.

MidJourney目前有一個等待訂閱的名單，但我們有機會嘗試一下。
"Go ahead and take a look."

"去吧，看一看。"
“Oh wow. That is so cool”

"哦，哇。這真是太酷了"
“It has some work to do. I feel like it can be — it's not dancing and it could be better.”

"它有一些工作要做。我覺得它可以--它不是在跳舞，它可以做得更好。"
The craft of communicating with these deep learning

與這些深度學習的溝通技巧
models has been dubbed “prompt engineering”.

模型被稱為 "及時工程"。
What I love about prompting for me, it's kind of really

我喜歡的提示對我來說，它是一種真正的
that has something like magic where you have to know the right words for that, for the spell.

有一些類似於魔法的東西，你必須知道正確的單詞，用於咒語。
You realize that you can refine the way you talk to the machine.

你意識到，你可以完善你與機器的對話方式。
It becomes a kind of a dialog.

它成為一種對話。
You can say like “octane render blender 3D”.

你可以說像 "octane render blender 3D"。
Made with Unreal Engine...

用虛幻引擎製作...
...certain types of film lenses and cameras...

...某些類型的膠片鏡頭和相機...
...1950s, 1960s...

...20世紀50年代，60年代...
...dates are really good.

......日期真的很好。
...lino cut or wood cut...

...油印或木印...
Coming up with funny pairings, like a Faberge Egg McMuffin.

想出一些有趣的搭配，比如法貝熱雞蛋麥芬。
A monochromatic infographic poster about typography depicting Chinese characters.

一張關於字體設計的單色資訊海報，描繪了中國的漢字。
Some of the most striking images can come from prompting the model

一些最引人注目的影像可能來自於對模特的提示
to synthesize a long list of concepts.

來綜合一長串的概念。
It's kind of like it's having a very strange collaborator to bounce ideas off of and get

這有點像有一個非常奇怪的合作者，可以向他提出想法，並得到他的支持。
unpredictable ideas back.

不可預知的想法回來。
I love that!

我喜歡這樣!
My prompt was "chasing seafoam dreams,"

我的提示是 "追逐海泡石的夢想"。
which is a lyric from the Ted Leo and the Pharmacists' song "Biomusicology."

這是泰德-利奧和藥劑師的歌曲 "生物音樂學 "中的一句歌詞。
Can I use this as the album cover for my first album? "Absolutely."

我可以用這個作為我第一張專輯的封面嗎？"當然可以。"
Alright.

好吧。
For an image generator to be able to respond to so many different prompts,

對於一個影像生成器來說，能夠對這麼多不同的提示做出反應。
it needs a massive, diverse training dataset.

它需要一個巨大的、多樣化的訓練數據集。
Like hundreds of millions of images scraped from the internet, along with their text descriptions.

就像從互聯網上搜刮來的數以億計的圖片，以及它們的文字描述。
Those captions come from things like the alt text that website owners upload with their images,

這些標題來自於網站所有者與他們的圖片一起上傳的alt文本等東西。
for accessibility and for search engines.

為可訪問性和搜索引擎而設。
So that's how the engineers get these giant datasets.

是以，這就是工程師們如何獲得這些巨大的數據集。
But then what do the models actually do with them?

但後來，模特們究竟用它們做了什麼？
We might assume that when we give them a text prompt,

我們可能會認為，當我們給他們一個文本提示時。
like “a banana inside a snow globe from 1960."

像 "1960年的雪球裡的香蕉"。
They search through the training data to find related images and then copy

他們通過訓練數據搜索，找到相關的影像，然後複製
over some of those pixels. But that's not what's happening.

在這些像素中的一些。但這不是正在發生的事情。
The new generated image doesn't come from the training data,

新生成的影像並不來自於訓練數據。
it comes from the “latent space” of the deep learning model.

它來自於深度學習模型的 "潛在空間"。
That'll make sense in a minute, first let's look at how the model learns.

這在一分鐘內會有意義，首先讓我們看一下模型是如何學習的。
If I gave you these images and told you to match them to these captions, you'd have no problem.

如果我給你這些圖片，並告訴你將它們與這些標題相匹配，你就不會有問題。
But what about now, this is what images look like to a

但現在呢，這就是影像對一個人來說是什麼樣子的。
machine just pixel values for red green and blue.

機器上只有紅綠和藍的像素值。
You'd just have to make a guess, and that's what the computer does too at first.

你只需做出猜測，而這也是電腦一開始所做的。
But then you could go through thousands of rounds of this

但這樣一來，你就可以通過數以千計的回合來實現這一目標。
and never figure out how to get better at it.

並從未想出如何在這方面做得更好。
Whereas a computer can eventually figure out a method that works- that's what deep learning does.

而計算機最終可以找出一種有效的方法--這就是深度學習的作用。
In order to understand that this arrangement of pixels is a banana, and this arrangement

為了理解這種像素的排列方式是一種香蕉，而這種排列方式
of pixels is a balloon, it looks for metrics that help separate these images in mathematical space.

的像素是一個氣球，它尋找有助於在數學空間中分離這些影像的度量。
So how about color? If we measure the amount of yellow in the image,

那麼顏色如何呢？如果我們測量影像中黃色的數量。
that would put the banana over here and the balloon over here in this one-dimensional space.

在這個一維空間中，香蕉在這裡，氣球在這裡。
But then what if we run into this:

但是，如果我們遇到這種情況怎麼辦。
Now our yellowness metric isn't very good at separating bananas from balloons.

現在我們的黃度指標在區分香蕉和氣球方面不是很好。
We need a different variable.

我們需要一個不同的變量。
Let's add an axis for roundness.

讓我們為圓度增加一個軸。
Now we've got a two dimensional space with the round balloons up here and the banana down here.

現在我們有一個二維空間，上面是圓形氣球，下面是香蕉。
But if we look at more data we may come across a banana that's pretty round,

但如果我們看更多的數據，我們可能會遇到一個相當圓的香蕉。
and a balloon that isn't.

和一個不是的氣球。
So maybe there's some way to measure shininess.

是以，也許有一些方法可以衡量光澤度。
Balloons usually have a shiny spot.

氣球通常有一個閃亮的斑點。
Now we have a three dimensional space.

現在我們有了一個三維空間。
And ideally, when we get a new image we can measure those 3 variables and see

而在理想情況下，當我們得到一個新的影像時，我們可以測量這3個變量，看看
whether it falls in the banana region or the balloon region of the space.

無論它是落在空間的香蕉區還是氣球區。
But what if we want our model to recognize,

但如果我們希望我們的模型能夠識別。
not just bananas and balloons, but…all these other things.

不僅僅是香蕉和氣球，還有......所有這些其他東西。
Yellowness, roundness, and shininess don't capture what's distinct about these objects.

黃色、圓形和閃亮並不能捕捉到這些物體的獨特之處。
That's what deep learning algorithms do as they go through all the training data.

這就是深度學習算法在瀏覽所有訓練數據時的表現。
They find variables that help improve their performance on the task and in the process,

他們找到有助於提高他們在任務和過程中的表現的變量。
they build out a mathematical space with way more than 3 dimensions.

他們建立了一個遠遠超過3維的數學空間。
We are incapable of picturing multidimensional space, but midjourney's model offered this and I like it.

我們沒有能力想象多元空間，但midjourney的模型提供了這個，我喜歡它。
So we'll say this represents the latent space of the model. And It has more than 500 dimensions.

所以我們會說這代表了模型的潛在空間。它有500多個維度。
Those 500 axes represent variables that humans wouldn't even recognize or have

這500個軸代表了人類甚至不會認識或擁有的變量。
names for but the result is that the space has meaningful clusters:

的名字，但結果是，該空間有有意義的集群。
A region that captures the essence of banana-ness.

一個抓住了香蕉本質的地區。
A region that represents the textures and colors of photos from the 1960s.

一個代表60年代照片的質地和顏色的區域。
An area for snow and an area for globes and snowglobes somewhere in between.

一個區域是雪，一個區域是球狀物和雪球，介於兩者之間。
Any point in this space can be thought of as the recipe for a possible image.

這個空間的任何一點都可以被認為是一個可能的影像的配方。
The text prompt is what navigates us to that location. But then there's one more step.

文字提示是將我們引向該位置的原因。但隨後還有一個步驟。
Translating a point in that mathematical space into an actual image involves a

將該數學空間中的一個點轉化為實際的影像涉及到一個
generative process called diffusion. It starts with just noise and then,

生成過程稱為擴散。它開始時只有噪音，然後。
over a series of iterations, arranges pixels into a composition that makes sense to humans.

在一系列的迭代中，將像素排列成對人類有意義的組合。
Because of some randomness in the process,

因為過程中存在一些隨機性。
it will never return exactly the same image for the same prompt.

它將永遠不會為同一提示返回完全相同的影像。
And if you enter the prompt into a different model designed by different

而如果你把提示輸入到不同的模型中，由不同的人設計的
people and trained on different data, you'll get a different result.

人，並在不同的數據上進行訓練，你會得到一個不同的結果。
Because you're in a different latent space.

因為你在一個不同的潛在空間。
No way. That is so cool. What the heck? The brush strokes, the color palette. That's fascinating.

不可能。這真是太酷了。這到底是什麼？筆觸，調色板。這很吸引人。
I wish I could like — I mean he's dead, but go up to him and be like, "Look what I have!"

我希望我能像--我的意思是他已經死了，但走到他面前，說："看看我有什麼！"
Oh that's pretty cool. Probably the only Dali that I could afford anyways.”

哦，那是相當酷的。可能是我唯一能買得起的達利。"
The ability of deep learning to extract patterns from data means that you can copy an

深度學習從數據中提取模式的能力意味著，你可以複製一個
artist's style without copying their images, just by putting their name in the prompt.

在不復制藝術家的影像的情況下，只需將他們的名字放在提示中，就能實現藝術家的風格。
James Gurney is an American illustrator who

詹姆斯-格尼是一位美國插畫師，他
became a popular reference for users of text to image models.

成為文本到影像模型用戶的流行參考。
I asked him what kind of norms he would like to see as prompting becomes widespread.

我問他，隨著提示功能的普及，他希望看到什麼樣的規範。
I think it's only fair to people looking at this work

我認為這對看這個作品的人來說才是公平的
that they should know what the prompt was and also what software was used.

他們應該知道提示是什麼，也知道使用的是什麼軟件。
Also I think the artists should be allowed to opt in or opt out of having their work

此外，我認為應該允許藝術家選擇加入或不加入他們的作品。
that they worked so hard on by hand be used as a dataset for creating this other artwork.

他們辛辛苦苦用手做出來的東西被用來作為創造其他藝術品的數據集。
James Gurney, I think he was a great example of being someone

詹姆斯-格尼，我認為他是一個很好的例子，他是一個人
who was open to it, started talking with the artists.

他對此事持開放態度，開始與藝術家們交談。
But I also heard of other artists who got actually extremely upset.

但我也聽說其他藝術家實際上得到了極大的不安。
The copyright questions regarding the images that go into training the

關於進入培訓的影像的版權問題。
models and the images that come out of them…are completely unresolved.

模型和從中產生的影像......是完全無法解決的。
And those aren't the only questions that this technology will provoke.

而這些並不是這項技術將引發的唯一問題。
The latent space of these models contains some

這些模型的潛在空間包含一些
dark corners that get scarier as outputs become photorealistic.

陰暗的角落，隨著產出的逼真而變得更加可怕。
It also holds an untold number of associations that we wouldn't

它還擁有無數我們無法想象的關聯。
teach our children but that it learned from the internet.

教我們的孩子，但它是從互聯網上學到的。
If you ask an image of the CEO, it's like an old white guy.

如果你問一個CEO的形象，那就像一個老白臉。
If you ask for images of nurses, they're all like women.

如果你問護士的形象，他們都像女人。
We don't know exactly what's in the datasets used by OpenAI or Midjourney.

我們不知道OpenAI或Midjourney使用的數據集裡到底有什麼。
But we know the internet is biased toward the English language and western concepts,

但我們知道互聯網是偏向於英語和西方概念的。
with whole cultures not represented at all.

整個文化根本沒有代表。
In one open-sourced dataset,

在一個開源的數據集中。
the word “asian” is represented first and foremost by an avalanche of porn.

亞洲 "這個詞首先是由色情作品的雪崩所代表。
It really is just sort of an infinitely complex mirror held up to our society and what we

這真的是一面無限複雜的鏡子，照出了我們的社會和我們的生活。
deemed worthy enough to, you know, put on the internet in the first place and

你知道的，被認為有足夠的價值，首先放在互聯網上，並且
how we think about what we do put up.

我們如何思考我們所做的擺設。
But what makes this technology so unique is that it enables any of

但這項技術的獨特之處在於，它能使任何一種
us to direct the machine to imagine what we want it to see.

我們可以引導機器想象我們希望它看到的東西。
Party hat guy, space invader, caterpillar, and a ramen bowl.

派對帽子的傢伙，太空入侵者，毛毛蟲，和一個拉麵碗。
Prompting removes the obstacles between ideas and images, and eventually videos, animations,

提示消除了想法和影像之間的障礙，並最終消除了視頻、動畫。
and whole virtual worlds.

和整個虛擬世界。
We are on a voyage here, that is it's a bigger deal than

我們在這裡進行的是一次航行，那就是它比
than just like one decade or the immediate technical consequences.

而不僅僅是像十年前那樣，或者是眼前的技術後果。
It's a change in the way humans imagine, communicate, work with their own culture

這是人類想象、交流、與自己的文化合作方式的一種變化
And that will have long range, good and bad consequences that we

而這將產生長期的、好的和壞的後果，我們
we are just by definition, not going to be capable of completely anticipating.

我們只是根據定義，不可能完全預測到。
Over the course of researching this video I spoke to a bunch of creative people

在研究這段視頻的過程中，我與一群有創造力的人進行了交談
who have played with these tools.

扮演過這些工具的人。
And I asked them what they think this all means for people who make a living making images.

我問他們，他們認為這一切對以製作影像為生的人意味著什麼。
The human artists and illustrators and designers and stock photographers out there.

外面的人類藝術家、插圖畫家、設計師和圖片攝影師。
And they had a lot of interesting things to say.

而且他們有很多有趣的事情要講。
So I've compiled them into a bonus video.

所以我把它們編入了一個獎勵視頻。
Please check it out and add your own thoughts in the comments. Thank you for watching.

請看一下，並在評論中加入你自己的想法。謝謝你的觀看。