Placeholder Image

字幕列表 影片播放

  • Hi everyone, this is Alec and I'm going to be talking to you today about using recurrent

  • neural networks for text analysis. To get an understanding of why this is a potential

  • tool to use, it's good to look at a bit of a history of how text analysis has been done,

  • particularly from a machine learning perspective. So in machine learning typically we're used

  • to vector representation so we know how to deal with numbers.

  • For categories we would use a one HOT vectorization model.

  • But when we move to trying to understand and classify and regress sequences for instance,

  • it becomes much less clear because our tools are typically based on vector approaches.

  • The way this is typically dealt with is by computing some hard coded feature transformations,

  • for instance, using TFIDF vectorizers, some sort of compression model like LSA and then

  • plugging a linear model, such as support vector machine or a softmax classifier, on top of

  • that. The purpose of this talk today is what happens

  • if we cut out those techniques and instead replace them with an RNN.

  • To get an understanding of why this might be an advantage, structure is hard.

  • Ngrams are the typical way of preserving some structure.

  • This would be we take our sentence, for instance, ‘the cat sat on the mat’, and we re-represent

  • it as the occurrence of any individual word or combinations of words.

  • These combinations of words begin to get us the way to see a little bit of structure.

  • Bystructure’ I mean preserving the ordering of words.

  • The problem with this is once we have bigrams or trigrams any combination of two or three

  • words quickly becomes huge possibilities. You can have easily 10 million plus features

  • and this begins to get cumbersome, require lots of memory and slows things down in and

  • of itself. Structure, although it’s difficult, is also

  • very important. For certain tasks such as humor or sarcasm,

  • looking at a collection of the wordcatappeared or the worddogappeared isn’t

  • going to cut it. And that’s what a lot of our models today

  • do. To understand though why many models today

  • are based on this and do quite successful, ngrams can get you a long way from any task.

  • Specific words are often very strong indicators: ‘uselessin the case of negative sentiment

  • andfantasticin the case of positive sentiment.

  • If youre, for instance, trying to classify whether a document is about the stock market

  • or is a recipeyou don’t see the wordgreen teacome up very much in a stock

  • market conversation and you don’t see the wordNASDAQcome up very much in a recipe.

  • You can quickly separate things in those kinds of tasks.

  • It’s often a question of knowing what’s right for your task at hand.

  • If youre trying to get at a more qualitative understanding of what’s going on in a body

  • of text this is where structure may be very important.

  • Whereas if youre just trying to separate out something that may be very indicative

  • on a word level, then in many ways a bag of words model can be quite strong.

  • How an R&N works. To understand its potential advantages over

  • a bag of words model what an RNN does is it reads through a sequence iteratively which

  • is really nice because it’s how people do it as well.

  • It’s able to preserve some of the structure of the model.

  • It goes through each word and updates its hidden representation based on that word and

  • the input from the previous hidden state. At time zero where we have no previous hidden

  • state we feed in either a bunch of zeroes or we treat it as another parameter to be

  • learned in [INAUDIBLE] representation. It just continues to do this all the way through

  • the sequence. At each time step we have a 512, in this case

  • we had 512 hidden units, dimensional vector representation of our sequence.

  • It’s a way of taking the sequence of words and using time step converting it into fixed

  • length representation. As a bit of notation, arrows would be projections

  • dot products and boxes would represent activities.

  • Vector is the values. For instance, the activation of each hidden

  • unit would be this box. It just proceeds iteratively through.

  • It’s important to note these projections are largely shared across all time sequences.

  • This projection with this arrow shared for all inputs, across all time steps and this

  • hidden to hidden unit connections are preserved as well across the sequence.

  • This is what makes learning tractable in these models.

  • At the end of iterating through the sequence weve got now a learned representation of

  • the sequence, a vector form of our sequence which then can be used by slapping on a traditional

  • classifier. In this toy example what weve done is weve

  • read in an input sentence. For instance, were trying to teach the

  • model how to classify the subject of a sentence. You can also stack them.

  • Just as your RNN can go through an input sequence and return its internal representation of

  • that sequence, you can then train another RNN on top of it or you can jointly train

  • both.

  • The structure is actually quite flexible.

  • One final note is the way that we do this original input to hidden feed forward is this

  • is typically represented either as a traditional one HOT which really doesn’t get us too

  • much advantage but what’s really exciting is we can represent this as what's called

  • anembedding matrix’.

  • These words, ‘the cat sat on the mat’, would be represented as indexes into a matrix.

  • Thewould be represented as index 100.

  • When we read through the sequence we would look up the row, row 100, and we would return

  • as the input being fed into the RNN the learned representation in that embedding matrix.

  • Let’s say we had 128 dimensions to be learned as an input representation for our words.

  • It would be equivalent to 128 by let’s say, 10,000 matrix if we do 10,000 words.

  • We would then feed this in as input.

  • That’s really cool because we can treat it as a learned w ay to learn representations

  • of our words.

  • We'll look later in this presentation at what those actually look like.

  • They give the model a lot of power.

  • The big thing in the literature is RNNs have a reputation for being very difficult to learn.

  • They are often known to be unstable in simple art in strange with generic stochastic gradients

  • are actually very unstable and difficult to learn.

  • What has happened in the research literature over the last few years is there are a bunch

  • of tricks that have been developed that help them be much more stable, much more powerful

  • and much more reliable, effectively.

  • To get an understanding of these were going to go quickly through all these various tricks.

  • The first of these is gating units.

  • To understand what a gating unit is we first need to look a little bit more into detail

  • how a simple RNN works.

  • What happens is we have our hidden state from our previous time step.

  • Again, at the original time step this can just be zeroes or parameters and we receive

  • input at time step T.

  • We take input from the hidden state of h of T minus one and we take input of T.

  • We just add them together for instance via a dot product projection and then we apply

  • an element wise activation function like [INAUDIBLE] for instance.

  • Then we have a new hidden state.

  • At the next time step we receive more input, we add it together and apply another element

  • wise activation function and this process continues forward.

  • To understand what the problem that can be with this, is information is always being

  • updated at each time step.

  • As a result it becomes difficult for information to persist through a model like this.

  • You can think of this as a form of exponential decay.

  • If we have a value let's say here, of one and through this process we effectively end

  • up multiplying that value by a reasonable .05 what happens after the course of several

  • time steps is that value will exponentially decay for instance, to zero.

  • Information has difficulty spreading through a structure like this.

  • There have been various changes calledgating unitsto make this work better.

  • What a gating unit does is instead of having the hidden state at a new time step be a direct

  • operation of the previous time step, it adds in a variety of gating units that effectively

  • transform the information in a more structured way.

  • One of these is called thegated recurrent unitintroduced recently.

  • What it does is it uses two types of gates.

  • It uses a reset gate and a dynamics gate.

  • This is the reset gate and this is the dynamics gate.

  • What the reset gate does is it takes an input from a previous time step, both the hidden

  • representation and the input, and it computes a…there should be element wise sigmoid squash

  • in here, and what it does is it basically computes how much of the previous time steps

  • information should continue along this route.

  • A reset value could be anywhere between zero and one and what it does is it multiplies

  • the previous hidden states by those reset states.

  • What this does is it allows a model to adaptively forget information.

  • For instance, you can imagine for sentiment analysis that some of our information might

  • be only relevant on the sentence level and once you see a period your model can then

  • clear some of its information because it knows the sentence is over and then be able to use

  • it again.

  • Once we have the previous times states information effectively gated by a reset gate, we then

  • update and get our potential new hidden state, h~t.

  • What we then do is we use the dynamics gate to instead of just using h~t as would be a

  • somewhat similar model to the previous example, we average it effectively with the previous

  • hidden state based on the dynamics gate.

  • We take the output of h~t, multiply it by Z again is like our computes values between

  • and one for each unit and we multiply it and add together.

  • This effectively [INAUDIBLE]

  • If ‘Z’ is zero what it would do is it would take entirely the new updated h~t and

  • none of the previous hidden state.

  • This would be equivalent in many ways to our previous simple recurrent unit.

  • In that case we would just have the new value come through.

  • Sure it would be gated by the recurrent unit but again the new hidden state would be a

  • completely new update compared to the previous hidden state.

  • Whereas if Z is one and this value here is one and this value here is zero so then our

  • new hidden state is just a copy effectively of the previous hidden state.

  • Z values of near one are ways of effectively propagating information over longer steps

  • of time.

  • You can think of it as the easiest way to remember something is just not to change it.

  • A value can be spread over very long periods of time if we just have Z values near one

  • because that way we don't have all this noise of updating our hidden states.

  • We just lock it and let that value persist.

  • That’s a way to, for instance, on the sentence level keep context from the previous sentence

  • around if we had Z values that were locked on.

  • Again, like in our previous model, this can be expanded to another time step just so you

  • see how information flows.

  • Again, there are all these calculations involved in updating our hidden states and our gating

  • values but the information at its core really flows through this upper loop of gated values

  • interacting with previous hidden states.

  • Gating is essential.

  • That was enough of an example of all the theoretical reasons why these better designed gates might

  • help propagate information better.

  • But empirically it's also very important.

  • For sentiment analysis of longer sequence of text, for instance, a paragraph or so,

  • a few hundred words for instance, a simple RNN has difficulty learning it all.

  • You can see that it initially climbs downhill a little bit but all it’s actually doing

  • here is just predicting the average sentiment, for instance, 0.5.

  • Whereas a gated unit, a recurrent neural network, is able to quickly learn and continuously

  • learn.

  • Again, you can't use simple recurrent units for these more complex tasks especially when

  • you have longer sequences of 100 plus words or tokens.

  • They just don't work well because information is hard to keep over longer sequences of time

  • in those kinds of models.

  • Now that we've talked about gating there's another question which is what kind of gating

  • do you use.

  • There are two types of models that have been proposed.

  • Gated recurrent units by Cho, recently from the University Montreal, which are used for

  • machine translation and speech recognition tasks.

  • Then there's also with the more traditional long short term memory.

  • This has been around much longer and has been used in far more papers.

  • Various modifications to the classic architecture exist but for text analysis GRU seems to be

  • quite nice in general.

  • It seems to be simpler, faster and optimizes quicker at least on the sentiment analysis

  • dataset.

  • Because it only has two gates compared to LSTM’s four it’s also a little bit faster.

  • If you have a larger dataset and you don't mind waiting a little bit longer, LSTM may

  • be better in the long run especially with larger datasets because it has additional

  • complexity with more gates.

  • But again it seems like GRU does quite well in these kinds of problems generally.

  • I tend to favor it myself but you can try both.

  • The library well be introducing later in this talk supports both.

  • The next question is exploding gradients.

  • Exploding gradients are a training dynamics phenomena that happens in recurrent neural

  • networks where the values that were trying to update [INAUDIBLE] at each step of our

  • training algorithm can become very large and very unstable.

  • This is one of the sources of the reputation of RNNs being hard to train.

  • Typically you would see small values, for instance the norm of your gradient would be

  • around one and just bouncing around and then sometimes you’d see huge spikes.

  • Those spikes can be quite damaging because a traditional learning update would then rapidly

  • change your values and this could result in unstable oscillations and your whole model

  • explodes.

  • In 2012 there was a great paper that proposed simply clipping the norm of the gradient.

  • If the gradient exceeded a set value, for instance 15, it would just be reset and scaled

  • to that value.

  • This was a common form of making RNNs much more stable.

  • Interestingly though, at least on text analysis for sentiment, we don't seem to see this problem

  • with modern optimizers.

  • It seems that the gradient decays pretty cleanly and becomes quite stable over the course of

  • learning.

  • There's another way of making recurrent neural networks better and this is by using better

  • gating functions.

  • There was an interesting paper this year at NIPS the basic idea of which was let's make

  • our gates steeper so they change more rapidly from being a value of zero to a value of one.

  • What this means is a traditional sigmoid would change pretty smoothly between negative five

  • and five.

  • But when you randomly initialize one of these numbers at the beginning of training typically

  • your values wouldn’t lie along the average 0.5, for instance.

  • You wouldn’t see much dynamics here.

  • If we make our gate steeper what that means is our gates begin to rapidly switch between

  • zero and one much more easily, particularly near the beginning of learning.

  • What this seems to suggest is that models that have used these steeper gating units

  • tend to learn a bit faster because they begin to learn how to use these gates quicker.

  • This is another quick easy technique to add.

  • Again, the library well be introducing later in this talk supports to help make learning

  • better in these models.

  • Another technique is orthogonal initialization.

  • Andrew Saxe last year did some great work on showing that initializing.

  • When we begin training these models we don’t know the values of these parameters to use

  • in these dot products, for instance; the weight matrices effectively.

  • What the research literature typically does is initialize [INAUDIBLE] for instance, random

  • Gaussian or random uniform noise.

  • What this research showed is that using random orthogonal matrices worked much better.

  • It's in line with some previous other work that has also noted various forms of similar

  • initializations worked well for RNNs.

  • Now we want to understand how we train these models.

  • There are a variety of techniques that can be used.

  • This is a visualization of the training dynamics of various algorithms on toy datasets where

  • were trying to classify these red dots from these blue dots.

  • We only have a linear model so all it can do is learn effectively a line separating

  • these two.

  • It can't do it perfectly because there's always going to be values separating this.

  • What we see is that the traditional most basic optimizer is stochastic gradient descent whereas

  • there are these various other improvements and techniques.

  • The main point of this example is to demonstrate to not use Sgd effectively.

  • Sgd very early on in training can look quite similar but once the norm of your gradients

  • becomes slower due to later stages optimization you want some sort of dynamicism to your learning

  • algorithm whereas Sgd once it gets out of the very steep earlier areas of learning tends

  • to slow down.

  • This is particularly a problem oftentimes in the space of text analysis because we have

  • very sparse updates on words, for instance.

  • There are rare words that you only see once every thousand or 100,000 words and those

  • words are very difficult to learn in a traditional Sgd framework.

  • Whereas these various techniques like momentum and [INAUDIBLE] accelerated gradient what

  • they do is effectively average together multiple updates and accumulate those averages.

  • Theyre a form of smoothing out this stochastic noise and accelerating directions of continuous

  • updates.

  • There's another family of acceleration methods; the adafamily that effectively scale the learning

  • rate, the amount by which we update a parameter given a gradient by some dynamics, some heuristics

  • describing the local gradient.

  • In the case of adagrad what we do is we accumulate the norm of the gradients update seen so far

  • with respect to a parameter and we scale our learning rate.

  • It's a form of learning rate where we can see that early on it learns quite quickly

  • and later on it begins to slow down as it reaches in this case near [INAUDIBLE].

  • Adadelta an RMS prop do something a little bit like that but make it dynamic.

  • It’s based on the local history instead of the global history of the gradients for

  • a parameter.

  • There are a variety of optimizers and one recently introduced calledAdamcombines

  • the early optimization speed that we saw in that earlier example of adagrad with the better

  • later convergence of various other methods like adadelta and RMS prop.

  • This looks quite good for text analysis in RNN.

  • We can see that Adam gets off to a very early learning start just like adagrad.

  • These resultsactually there's a slight bug in my code for this so take them with

  • a grain of salt but they still look good and it's a bug in the code so it might still be

  • okay.

  • That might actually explain one of the reasons why we saw slightly worse generalization performance.

  • It would train quite well but we would see its performance on held out data might not

  • have been as good for Adam because it learned so much more quickly.

  • Were still looking into reasons why this happens but in general, modern optimizers

  • are essential on these kinds of problems.

  • This just gives you a background on all the various techniques for making RNNs more efficient

  • in training and it can add quite a lot.

  • Early on in learning we can see that Adam and all these other techniques added together

  • so this would be a just a standard gating RNN.

  • Again, if we had a simple RNN on here it would look pretty linear.

  • If we add gradient clipping to make it more stable so we can use a slightly larger learning

  • rate it begins to learn faster.

  • If we add orthogonal initialization we can see again that it began to learn faster and

  • learn better.

  • Finally, if we had only Adam we see another huge gain over traditional Sgd.

  • These add up.

  • We can see that Adam and all these other techniques are able to reach lower effective minima and

  • are at least faster. Up to 10x faster.

  • Admittedly these techniques add a little bit of computation time so it might only be for

  • instance 7.5x faster on a wall clock compared to efficiency per rate update.

  • This is interesting because now RNNs can actually overfit quite a lot.

  • As they continue to fit to training data for instance their test data might plateau.

  • We continue to improve on the training dataset were given but this is called 'overfitting'

  • where our RNN is effectively optimizing for the details of the training data that aren’t

  • true of new data.

  • To combat this one of the techniques that is used is calledearly stoppingwhich

  • is each iteration of our dataset we will record the train and test validation training test

  • scores of these models and we will stop once we notice that our test validation performances

  • are improving.

  • Oftentimes this is going to occur in your first or second iteration through the dataset

  • with all these various techniques together.

  • That's good news because oftentimes models in this space can take ten, 50 or 100 iterations

  • for your training data to converge.

  • It seems in the case of RNNs we often overfit after one or two epics through the data.

  • To understand and get a better sense of how these models can do were going to compare

  • them to a much more standard technique in the literature.

  • Were going to use the Fantastic Machine Learning Library, SKLearn, and were going

  • to use a standard linear model approach, a traditional approach to text analysis.

  • This would be using at TFIDFI vectorizer and a linear model such as logistic regression.

  • This is by no means meant to be the best model.

  • In many cases, naïve Bayes SVN is actually better than [INAUDIBLE] regression for classification

  • for instance, but this is just a very easily accessible, very easily comparable to technique

  • To be fair, we're going to use bigrams which is a way of getting a little bit of structure

  • into our data.

  • Again, this way we could seenot goodinstead of just seeing the tokens fornot

  • andgoodoccurring.

  • We can get a little bit of structure which might be useful in sentiment analysis.

  • Were going to use grid search to evaluate potential [INAUDIBLE] for these linear models.

  • Were going to look at two which is minimum document frequency which is way of controlling

  • for the size of our input to our linear model.

  • This would take tokens or words that appear less than, in less than and many documents

  • and would ignore them.

  • If we see, for instance, the worddinosaurand we've only seen it once in our dataset

  • we're going to ignore it effectively.

  • Also we're going to look at the regulization coefficient which is a way of preventing overfitting

  • for the new models.

  • What were doing is grid search so we're looking at potential values for both of these.

  • We're not just explaining a potential performance improvement based on poorly fitted parameters.

  • Because these linear models tend to be faster we are able to more effectively search over

  • potential parameters.

  • This is a fair way to get the linear model potential advantage because they're much faster

  • so we can much more quickly search through multiple values.

  • Our second model were going to be looking at is one of these recurrent neural networks.

  • Admittedly, this is our own personal research: take every result with a grain of salt.

  • I'm using whatever I’ve tried that worked.

  • The general message though is that using a modern optimizer such as Adam, a gated recurrent

  • unit, steeper sigmoid gates and orthogonal initialization are good defaults.

  • A medium-size model that can work quite well is a 256 dimensional embedding and a 512 dimensional

  • hidden representation.

  • Then we put on whatever output we need: logistic regression for binary sentiment classification,

  • linear regression for predicting real values, etc.

  • It's quite flexible because the RNN in its core is a way of taking these sequences of

  • values and converting them into a vector.

  • Once weve got that vector we can put whatever traditional model we want on top of it so

  • long as it's differential and open to gradient based training.

  • How does this work on datasets?

  • What we see quickly here is that our linear regression model does incredibly well for

  • smaller datasets.

  • When we have for instance only 1,000 or 10,000 training examples we see that the linear model

  • outperforms the RNN by 50%, for instance.

  • But what we notice that’s interesting as our datasets get bigger the RNN tends to scale

  • better until till later training into larger dataset sizes.

  • Because the RNN is admittedly a much more complex model and operates on the sequences

  • themselves ideally with more training data it can learn a much better way to do the task

  • at hand.

  • Whereas your linear model because it's operating on unstructured bag of words and is just a

  • linear model might eventually hit a wall where it's not able to do any better.

  • You can imagine certain situations that you just aren't going to be able to classify the

  • sentiment, positiveness or negativeness of a text when it uses double negation, for instance.

  • That’s one example with sentiment analysis.

  • What’s also interesting is we see this replicated for instance for predicting the helpfulness

  • of a customer review.

  • This is interesting because this is a much more qualitative thing.

  • Sentiment is as well but how helpful a user’s review of a product is even more getting a

  • much more abstract concept.

  • We see again that as before with small amounts of data the linear model, in this case reg

  • since we're predicting real values, does much better but it doesn't seem to scale and make

  • use of more data as effectively as an RNN.

  • This is interesting.

  • We can see that RNNs seem to have poor generalization properties with small amounts of data but

  • they seem to be doing better when we have large amounts of data.

  • At one million labeled examples we can often be between zero and 30% better than the equivalent

  • linear model.

  • Again these are just these examples with logistic regression and linear regression but that

  • crossover seems to be robust and somewhere between 100,000 and a million examples but

  • it is dependent on the dataset.

  • There's only one unfortunate caveat to this approach which is it’s quite slow.

  • For a million paragraph size text examples to converge that linear model takes about

  • 30 minutes on a single CPU core.

  • For an RNN if we use a high-powered graphics card such as the GTX 980 it takes about two

  • hours.

  • That’s not too bad.

  • Our RNN on a proper high-end graphics card is only about four times slower at a million

  • examples to converge than the linear model.

  • Again this is on a basic CPU core.

  • But if we train our RNN on just that CPU core it takes five days.

  • This is unfortunate because this means our RNN is about 250 times slower than a CPU and

  • that's just not going to cut it.

  • This effectively is why we use GPUs in this research.

  • Here's the cool part of the presentation.

  • Again, an RNN when it's being fed an input sequence takes in the sequence and effectively

  • learns a representation for each word.

  • Each word gets replaced from its identifier some value liketheis token 100 and

  • gets replaced with a vector representation that is learned by our model.

  • These visualizations well be showing you are what happen when you look at what representations

  • are learned by those models.

  • What we're going to do is use an algorithm calledTSNEto visualize these embeddings

  • that our RNN learns.

  • What we've done to make it a little clearer is this is the representations learned from

  • training on only binary sentiment analysis.

  • Were trying to predict whether a given customer review, for instance, likes a product

  • or doesn't like a product.

  • What we've done is we've visualized these representations in two dimensions using TSNE

  • and we've colored each word by the average sentiment of a review it appears in.

  • What we see is a kind of axis.

  • Again, it doesn't correspond to any actual axis aligned because it's TSNE.

  • But we see this continuum between very negative words and very positive words.

  • This isn't too surprising.

  • A model trained on sentiment analysis learns to separate out negative and positive words.

  • That's what you'd expect to happen.

  • We can take a little look at these very positive and very negative clusters and see that it's

  • grouped into very understandable words likeuseless, waste, poorly, disappointed

  • as negative.

  • You can see some interesting stuff where again this visualization tries to group similar

  • things close together.

  • We can see that it’s actually identified even though it is a very negative grouping,

  • it's also identifiedreturned, returning, returns, returnall together as well.

  • That’s interesting because it seems to know thatreturnedand return related words

  • are very negative unsurprisingly if you find them in a review.

  • But it's also separated them out slightly from other more generic words.

  • Then on the positive side we also see very unsurprising indicators of happy if sentiments.

  • Sofantastic, wonderful, and pleased’.

  • But what's even more interesting about this model is that we see other forms of grouping

  • and structure being learned.

  • We see that it pulls out for instance, quantities of time; weeks, months, hours, minutes.

  • We also see that it pulls out qualifiers likereally, absolutely, extremely, totally’.

  • Again, qualifiers are interesting because they are by themselves neutral.

  • They don't necessarily indicate positive or negative sentiment; instead they modify it.

  • You can haveextremely goodandextremely bad’.

  • You see that being pulled out together.

  • You also see product nouns, for instance things that products could be, things that are products

  • like movies, books, stories, items, devices are also grouped together.

  • Additionally, punctuation is grouped together.

  • This is indicative potentially of our model learning to use these kinds of data which

  • again implies that our model may actually be learning to use some of the structure present

  • in the data.

  • Punctuation by grouping it together and learning similar representations for it imply that

  • it's finding some use for it.

  • We would expect again punctuation to be quite useful for segmenting out and separating out

  • meanings and notions.

  • Quantities of time are interesting.

  • They are slightly negatively associated which is understandable when you talk about, ‘this

  • product took months to show upor, ‘it worked for a total of an hour’.

  • Again, grouping them all together implies some use of it and the same thing with qualifiers.

  • We have no true evidence at least in this picture of these words being used but by learning

  • similar representations and by having them grouped together it implies it's finding a

  • use for them.

  • We can extrapolate from there that it may in fact be learning to use these words in

  • natural ways for sentiment analysis.

  • Again, this is learned purely from zero and one binary indicator variables.

  • This is a bit like seeing a sequence of numbers, 1,000, 2,000, 3046, five and then realizing

  • that tokens five and one thousand are exclamation point and period.

  • They're similar to tokens 2,000 and 7,000 which are comma and colon.

  • This is a very strong result and very interesting to see this kind of similarity being learned

  • by our model.

  • This is cool but how can we actually use these models?

  • We're also presenting today a basic library to allow developers to use these recurrent

  • neural networks for text analysis.

  • It's called Passage and it’s a tiny library built on top of the great Theano machine learning

  • for framework [INAUDIBLE] math library.

  • It's incredibly alpha; we're working on it but it has a variety of features.

  • We're going to walk through now an example of how to use Passage.

  • This is Passage.

  • It’s clonable via GitHub and it has a variety of tools to make this useful.

  • This is a little example we're going to walk through and explain real quickly on how we

  • can use Passage to do analysis of text.

  • We need to import the components that are necessary.

  • One of these is the tokenizer which is a way of taking strings of text and separating them

  • out into the individual tokens which would be words and punctuation, for instance.

  • A tokenizer can just be instantiated.

  • It has a variety of parameters but has sensible defaults.

  • What we do is we emulate a SKLearn style interface.

  • We can call fit transform on a body of training text which would be again a list of strings

  • for instance and that would return a list of these training tokens which can be used

  • natively by Passage to train RNN models.

  • Additionally, we're going to import the various layers of a model.

  • We have that embedding matrix we talked about, the gated recurrent unit, and a dense output

  • classifier.

  • The way that we compose these into a training model is by stacking them together in a list.

  • Our input is one of these embedding matrices and we're going to set it to have 128 dimensions.

  • We need to know how many of these features to learn, how many of these tokens there are,

  • and we're going to just pull that out of how many our tokenizer decided we needed.

  • Then we're going to use one of these gated recurrent layers where in this case setting

  • its size to 128.

  • The sizes are sometimes smaller than you would use for actual models and you can see better

  • performance from larger models, for instance, but these are small enough to be run on a

  • CPU and not take forever.

  • Theyll still take quite a while though.

  • Finally, we have our dense output unit which would be if we were doing binary sentiment

  • classifications, detecting if it's negative or positive for a string of text, would be

  • one unit because we’d be predicting one value.

  • We would use a sigmoid activation as a way of quickly separating out negative and positive

  • values.

  • Then to make this model we instantiate it through the model class which is just importable

  • from Passage dot models RNN.

  • We give it the layers we want to build our custom architecture out of and we tell it

  • what cost function we want to optimize.

  • The cost function is the effective function that lets us train this model.

  • It's just a way of telling the model how good was this, how good did you do on this example,

  • effectively.

  • For binary classification we use binary [INAUDIBLE] in this example.

  • To train this model we just call a fit interface which takes in training tokens which are made

  • from training text and also takes in the training labels we want to predict given those training

  • texts.

  • Then once that model has been trainedIt should be noted this only transfers one iteration

  • through your dataset.

  • As mentioned earlier, you may want to train for multiple iterations if for instance your

  • model hasn’t converged and you may want to measure your performance on hold out data

  • to know when to stop training a model if it begins to overfit.

  • Right now we've left that part to you but we will be extending this to have interfaces

  • to automatically do this.

  • Finally, if you want to have your model then predict on new data you can just call model

  • dot predict on tokenizer dot transform or test text and this will return how the model

  • predicts new data.

  • That's an example of how to use Passage.

  • To summarize, RNNs are now a potentially competitive tool in certain situations for text analysis.

  • Admittedly there are a lot of disclaimers there but there seems to be a general trend

  • that seems to be emerging which is if you have a large, for instance, million-plus example

  • dataset and you have a GPU they can look quite good.

  • They potentially can outperform linear models and might not take all that much longer.

  • But if you have a smaller dataset and don't have a GPU it can be very difficult to justify

  • despite how cool these models might seem compared to linear models.

  • They are a lot slower, they have a lot of complexity, a lot of different parts and a

  • lot of different architectures you can change and they seem to have poor generalization

  • results with small datasets.

  • Thanks for listening.

  • If you have any questions you can let me know at Alec at Indico dot I-O.

  • Also if you'd like to see a more general introduction to machine learning and deep learning in Python

  • I have another video that you can check out in the upper right, introducing that as a

  • Python developer, how to use the awesome Theano library to implement these algorithms yourself.

  • Additionally if you'd like to check out and learn more about Indico feel free to visit

  • our website at Indico dot I-O where we have various tools like Passage available for developers

  • to use for machine learning.

  • Thanks.

Hi everyone, this is Alec and I'm going to be talking to you today about using recurrent

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

使用循環神經網絡的一般序列學習。 (General Sequence Learning using Recurrent Neural Networks)

  • 217 16
    araratlee 發佈於 2021 年 01 月 14 日
影片單字