Name: Fake Voice Text to Speech Deep Learning ft.埃隆-馬斯克、特朗普、奧巴馬和喬-羅根。 (Fake Voice Text to Speech Deep Learning ft. Elon Musk, Trump, Obama, and Joe Rogan)
Uploaded: 2021-01-14T10:36:23.000Z
Duration: 28 min 28 s
Description: 【看影片學英語】數萬部 YouTube 影片，搭配英漢字典即點即查，輕鬆掌握單字發音與用法，長久累積看電影不必再看字幕。

現在進行式

祈使句型

And welcome to a video about neural networks that speak.

Every voice in this video is generated from a neural network.

I know people will claim they could told me all I could hear the difference.

There's absolutely no way I would be fooled by this.

I have some of the best years, not too big, not useful.

Give it long enough and you might realize something is wrong.

But I think this voice is pretty convincing.

It's not a question of if there will come a time where we can trust audio or video.

We live in a time we're almost full of our information.

Comes from Texas Rodeo and value only know all of this can be fixed.

It's entirely possible that made things we've already seen or heard were faked like this.

I wonder how people will use this technology.

It seems like it would mostly be used for bad things.

The reaction of many will be to just ban things like this, and this is not necessarily the answer.

The truth is, Impersonating others for nefarious purposes is already legal van actress from other countries do not care about our laws.

They will add this technology to their arsenal of cyber attack weaponry.

Citizens must remain vigilant where they get their information and must remain aware that the friend of fake all you and that he was here.

Now we live in an hour where most people get their information online from many sources, including social media.

While this has allowed for more freedom of the press which underpins our democracy, it has also along for foreign legions to pin afraid your flow of information.

That is why tonight I am proposing that we built an Internet wallet is a horrible If we look at Young or Russia, they do not have this crowd.

I don't think you understand how this wall would stop the flow of then Information might be the dumbest idea I have ever this wall you've ever seen but know this would be a beautiful Absolutely not.

People would come to visit just to see my Internet wall.

No, I estimate that the money made from cheers to see the Internet wall would pay for the Internet Will continues.

We might have this, Kyle, the issue here is just like any fig news.

Often the truth takes a long time to get out.

And some people will just never see the correction.

With the last year's presidential election, we saw just how much fake information could impact is by clocking up information.

I wonder how much more destructive fake audio and video will be all right and back to reality.

Anyway, I, uh, would like to do now is show you guys how you could get started with this, what's involved and hopefully simplify it a decent amount.

I had no idea how complicated sound could actually be with Spectra Grams and things like 48 transforms all this stuff, eh?

So I'm not gonna be getting too deep into the weeds, just kind of show you how you could do what I've done and kind of what I found on by no means done with doing fake audio.

And the next thing I'd like to do is fake video on.

So, um, there's still a lot I'm trying to do, and I just I don't really feel like doing the line by line thing.

But I thought I would share with you guys what I found so far.

So with that, the model that I ended up going with is this D C T.

T s model, which is, um I used this code here, but it's based on the following paper which take note was written in late 2017.

So almost two years ago, actually, But you know, the history of doing Texas speech on computers is and has always been, pretty much Someone goes into a studio and records hundreds of hours of audio, and then we use things like digital signal processing, or DSP is I'll probably refer to it from now on.

Uh, to do things like working on the pauses between words and sentences and then do do things to do, like the speech pattern and stuff like that.

It's all these like tools and hard coded, rule based things for Texas speech.

In most cases, even when doing Texas speech with neural networks, historically, people have applied Maur rule based things to the output.

Now, obviously, the dream would be to do text to speech without any of that just simply throw in some examples, get output.

And also side goal is to figure out How can we do this?

And could we just do this with, like, anything on like YouTube?

Like, could we just take just about any YouTube video and mimic a voice on that video?

So with that in mind, this is the model I ended up with.

But I did try other models, including Taco Tron and Taco Tron V two, and I found this one to be the best, which is kind of curious and one of the things that you might notice if you've been reading and not listening or both, is this as deep convolution.

There is not a recurrent cell to be seen, which is also interesting because traditionally you would think of a task like this to go from text to speech.

Speech is also it's not like a single scaler value, right?

It's a sequence, so most of the time you would think, um, I would use the sequence to sequence model, which is a recurrent neural network.

This is using deep convolution neural networks with attention.

Now this is guided attention, which will explain in a moment.

But interestingly enough, it's turning out over the last couple years to be the case that deep convolution, all neural networks are actually outperforming recurrent neural networks at these sequence to sequence types of tasks.

They're doing it in their training faster, and they're getting better results.

I'm not sure that I would be ready to say recurrent neural networks are dead.

We just need a better way to do recurrent neural networks because the biggest thing is the amount of data that's required and how slow they are to train.

So if we could speed them up, we might see good things.

So, um, for example, I recall using a recurrent neural network on like M n'est data.

So it's not like recurrent nor networks aren't like.

They're just as, um, trying to think a topic wide as a convolution Eleanor network can be, So I wouldn't necessarily write them off completely.

Um, don't forget, Neural networks themselves got written off completely not long ago, So anyway, um, yeah, Cool.

It's basically the assumption that everything is gonna be a linear.

So the input to the output will always be linear, as opposed to consider a machine translation task.

Where let's say you're going from English to Spanish and you've got big house needs to translate to Casa Grande A.

Okay, well, the adjective to describe that noun in English comes before the noun, but in Spanish comes after the noun.

In all cases, other words might be, but some are gonna be flipped around.

And so attention is there to actually help with this sequence of sequence, which was previously mostly used for recurrent neural networks.

But now we're using them in a confrontational neural networks as well.

And, it turns out, with convolution, all neural networks plus attention, we can do the exact same thing with the added benefit of doing it way faster.

The other nice benefit that I found is the end result is just better, so really cool.

So, for example, I just want to play a quick example of, um, let me see where it is Yes, o of a Texas speech that is pretty close to the typical ones I would find by training a mediocre sample size of, say, 600 samples.

Okay, um, you probably might not even understand what is being said there, but I'll play it one more time.

It's saying Python is the one and only true programming language is one.

So, yeah, that doesn't work well, So that's Taco Tron on 500 samples.

One thing to note, Taco Tron requires 100 more like 100 plus hours of sample audio and weeks to months to train.

So, um, just understand that that's just a example.

But I wanted to show you because every time I've ever played with Texas speech, that's about what it sounded like.

字幕列表影片播放

Fake Voice Text to Speech Deep Learning ft.埃隆-馬斯克、特朗普、奧巴馬和喬-羅根。 (Fake Voice Text to Speech Deep Learning ft. Elon Musk, Trump, Obama, and Joe Rogan)

stuff

weird

pronunciation

figure