Jennifer Listgarten:CRISPR生物信息學--用於指南設計的機器學習預測模型 (Jennifer Listgarten: CRISPR Bioinformatics - Machine learning predictive models for guide design)

字幕列表影片播放

(upbeat ambient music)
- I'm hoping that I'm gonna tell you something
that's interesting and, of course,
I have this very biased view,
which is I look at things from my computational lens
and are there any computer scientists in the room?
I was anticipating not, but okay, there are,
so there's one, maybe every now
and then I'll ask you a question,
no, no, no, I'm just kidding, but, so,
and then so my goal here is gonna be to basically,
actually just give you a flavor of what is machine learning,
this is my expertise, and so just, actually,
again, to get a sense of who's in the room,
like, if I picked on someone here,
like raise your hand if you would be able to answer
that question, like, what is machine learning?
Okay, a handful, no, actually one, or two.
Great, okay, so I just want to give you a sense
of that, and I'm gonna, you know,
most of this is gonna be pretty intuitive,
I'll try to make little bits of it concrete
that I think will be helpful,
and then I'll tell you how we use machine learning
to improve guide designs, specifically
for knockdown experiments, but I think a lot
of it is probably useful for more than that,
but we haven't sort of gone down that route,
and so I can't say very much about that.
And please interrupt me if something doesn't make sense
or you have a question, I'd rather do
that so everybody can kind of stay on board rather
than some, you know, it makes less
and less sense the longer I go.
Alright, so machine learning, actually, during my PhD,
the big, one of the big flagship conferences was peaking
at around 700 attendees, and when I go now,
it actually is capped, like, it's sold out at 8,000 like,
months in advance, 'cause this field is just like,
taken off, basically it's now lucrative for companies,
and it's become a really central part of Google,
Microsoft, Facebook, and all the big tech companies,
so this field has changed a lot,
and kind of similar to CRISPR,
there's an incredible amount of hype and buzz
and ridiculous media coverage and
so it's a little bit funny, in fact,
that I'm not working at these two kind of,
very hyped up areas.
But anyway, so, you know,
people in just the mainstream press now,
you're always hearing about artificial intelligence
and deep neural networks, and so these are like,
so I would say machine learning is a sub-branch
of artificial intelligence,
and a deep neural network is sort
of an instance of machine learning, and so like,
what really is this, this thing?
So it kind of overlaps sometimes
with traditional statistics, but the,
like, in terms of the machinery,
but the goals are very different and,
but, really like the core, fundamental concept here is
that we're gonna sort of pause at some model, so maybe like,
think linear regression is a super simple model,
and you can like, expose it to data, it has some parameters,
right, the weights, and then we essentially want
to fit those weights, and that's the training,
that's literally the machine learning.
So I'm sorry if that sounds super simple
and not like, God-like, like machine learning
and everything working magically,
but that really is what it is,
and, right, and so let me just also give you like,
sort of drive home that point.
So we're gonna pause at some sort of model,
and so here I'm giving you the simplest example
because I think most people here work
with linear regression at some point in their life,
and so you can think of this as a predictive model
in the sense that if I give it a bunch
of examples of Y and X, and I learn the parameter of beta,
then for future examples where I don't have Y
but I only have X, I can just compute,
X times beta, and I get a prediction of why.
So that's the sense in which I call this a predictive model,
and that's very much how machine learning people tend
to think of it, where statisticians are often very focused
on what is beta, what are the confidence intervals
around beta and things like this.
So like, there's, that's the sense
in which there's a lot of overlap,
but the goals are kind of quite different.
We want to like, use real data
and make predictions, so here it's gonna be predictions
about guides, and which guides are effective
at cutting and at knockout.
Right, and so it has these free parameters,
and we call these things that we put in here features,
and so in the case of guide design,
the question is gonna be, what features are we gonna put
in there that allow us to make these kinds of predictions,
and, so I'm gonna get into that in a little bit,
but just as an example to make this concrete,
it might be how many GCs are in this 30mer guide,
or guide plus context.
Right, and like I said, we're gonna call,
we're gonna give it some data,
and so in this case, the data for guide design is gonna be
data from (mumbles), there's a community
that's now publicly available where there are examples,
for example, what the guide was
and how effective the knockout was,
or what the cutting frequency was.
For example, I get a good, a bunch of these examples,
and then that's gonna enable me
to somehow find a good beta, and of course we're not,
actually, we do sometimes use linear regression,
but I'll tell you a little bit more about,
more sort of complex and richer models
that let us do a lot more, and then the goal is going
to be to fit this beta in a good way,
and like, I'm not gonna do some deep dive on that here,
but in the one way that you are publicly familiar
with is just means squared error,
and when you find the beta that minimizes this
for your example training data,
then you get some estimate of beta
and you hope that on unseen examples
when you do X times beta, it gives you a good prediction.
So is that sort of make it somewhat concrete,
what I mean by a predictive model
and how you could view linear regression
as a predictive model in how you might use this
for guide design?
Okay, so obviously I'll tell you a lot more.
So, right, but linear regression is just sort
of the simplest possible example,
and so in our work we actually use,
some of the time, what are called classification
or regression trees, and so in contrast
to here where you might have, say,
this, you might have a bunch of these features,
right, like how many GCs were in my guide,
and then another feature might be,
was there an A in position three,
and you can put in as many as you want,
and then you get all these betas estimated.
So it's very simple, because in that case,
none of these features can interact with each other,
right, you just, you know you just add X times beta one
plus X times beta two, so we call this like,
a linear additive model.
In contrast, these trees allow very sort
of deep interactions among the features,
so this might be how many GCs,
so, of course, this is just, I didn't,
this is not suited to the features I just described,
but this might be some feature like,
I don't know, proportion of GCs,
'cause now it's fractional, and then it,
this algorithm, which is gonna train the betas,
so find a good value beta, well, sort of
through a procedure that I'm not gonna go into detail
for all these models, how it works,
but it's going to somehow look at the data
and determine that it should first split
on the second feature at this value,
and then it will sort of keep going down that.
It says, "Now partition the examples
"in my training data like this."
And then on the second feature in this way,
until you end up at the sort of leaves of this tree,
and these leaves are the predictions.
And so when you do it for the training data,
whichever training examples here end up at this leaf,
you basically take their mean,
and that's now the prediction for that leaf,
and if you take a new example,
you basically just pipe it through this,
these sort of rules, and you end up
with that kind of prediction.
This, simplified, but I think it's a good conceptual view,
and this is just another way of thinking
about it is if you only had two features
and you drew them, like, one against the other,
then effectively, every time you make a branch here,
you're kind of cutting up this space.
So that's also just another way to think about it.
And so, also, so this is, now all over the press nowadays,
and whenever I give these talks,
there's a bunch of young, hungry grad students who say,
"Did you do deep neural networks?"
'Cause that's what everybody wants to do now,
and so deep neural networks, they're kind
of like a really fancy linear regression.
So you could think of these as the Xs in linear regression,
you can think of this as,
imagine there's only one thing out here,
I should have done a different picture, but that's just Y.
And so this again is a mapping where you give it the Xs,
you do a bunch of stuff, and out you get a Y here,
except linear regression, you know,
is this very simple thing,
and now you can see, there's all these other kinds
of, we call these like, hidden nodes,
and so there's this complicated mess now of parameters,
beta, and again, I'm not gonna go into it,
I just want to give you a sense that like,
linear regression is this very simple thing,
and there's a lot of other models that let you do much,
much better prediction, and these are typically the kinds
of models that we use, because they're more powerful
if we care about prediction.
But the flip side is they're actually very hard
to interpret, and so if you want to ask a question,
like, was it the GC feature that is most important
in your guide design, which is what I always,
you know, get a question like this,
and we can do our best to kind
of poke and prod at this machine,
but it's always a little bit ad hoc,
it's hard, the more complicated the model,
then the, you know, the better we might predict
and the less interpretable it is,
and so there's always this kind of tension.
So right, and so what are some of the challenges?
So I've sort of shown you sort of some,
like, increasing amount of complexity in some models,
and so one of the big difficulties is,
if I pause at a very complex model
with a lot of parameters, then I need a lot
of data in order to actually fit those parameters,
and if I don't have enough data,
what's gonna happen, is the parameters,
you're gonna find this very specific setting
of the parameters that effectively memorize
the training data, and the problem then is you
give it a new example that you really care about,
you say, I want to knock out that gene,
and it's never seen that gene,
'cause it's sort of memorized, but we say it's like,
over fit to the data, it doesn't actually generalize well
to these unseen examples.
So there's another tension here,
which is, you want kind of complex, rich models
that can capture very complicated prediction spaces,
but if you don't have enough data to warrant them,
then you're gonna shoot yourself in the foot.
And so, like, so you know, when you learn in this area
as like, an undergrad or a grad student how
to do rigorous machine learning,
this is one of the things you need
to understand, how to control that knob,
how to know if you have enough data
for the model you're looking at,
and things like this,
and so, in this, this was kind of fun for us to do,
the guide design, because a lot
of people are using this now,
and we know we didn't super over fit it
because people are telling us that it's useful,
but, you know, we did a lot of due diligence on our end
to convince ourselves that that was correct, as well.
Right, and even just asking,
how can I evaluate a dataset?
You know, there's a lot of papers out there
where people sort of dabble in machine learning,
they go use something and they prove
that, have this amazing predictive capacity,
and in many cases, because it's such a popular area now,
people just kind of jump in and they do something,
and these evaluations are actually not rigorous,
and so you need to read things very carefully,
you can't just look at a result the way you can like,
look at a gel, or maybe I'm wrong here,
correct me if I'm wrong, if it's not fabricated,
then, like, you know it's not just what you see is
what you get, you need to think very deeply
about what they've done and if that makes sense.
Right, and then of course, you know,
like, in linear regression, for example,
we assume that the error term is,
for example, Gaussian distributed,
and so this is like, a various sort
of technical, precise thing.
Of course, almost, in most cases,
in every model, every model has some sort
of assumption like this, and basically
when you're working with real data,
every model is wrong, like,
there's no way, I mean, the real model is physics,
right, and we're not using physics-based models,
they're hard, but they're not scalable,
and basically it's almost unheard
of except in a few, select fields of like,
molecular dynamics, and so again, this is one
of sort of the tricks of the trades is how do you,
and to do it rigorously, is how do you know that like,
that this assumption, even though it may be violated,
that your model is still okay,
and so that's another consideration.
Sometimes we think it's okay, but maybe we can,
if we can violate it less by, you know,
doing something different with the data before we give it
to the model, maybe we can do better.
And then in a lot of work I do,
sometimes you know exactly a very good model,
but it's just really, really slow,
and so you need to be clever about actually how
to speed things up, and it might be slow when you're trying
to infer those parameters during the training phase, I mean,
they'll often refer to a training phase
where you have this data, and then there's this sort
of deployment or test phase
where you're just using the model and making predictions,
but you don't know the right answer,
unless you're evaluating.
So that's sort of, I guess, just a sort of overall,
little bit of a view of a slice of machine learning,
does that, are there any questions so far?
Okay.
- [Student] So one of the challenges,
of course, is you have to
have a handle, you have to have a handle
on all the features that are important.
- Yeah, absolutely, and so in a way,
one of the sort of miracle breakthroughs
of deep neural networks, and that,
like, and you'll, that's touted is
that you don't need to know them,
because it sort of is such a rich machinery,
it can create features on the spot.
But the problem with them is it needs a vast amount
of data to do that, and the other problem is
they've really only been shown to be state of the art,
I mean, they've broken like, 30 year benchmarks in like,
on things like speech processing and vision tasks,
which are images and speech
which have particular properties,
like continuity and time for speech,
continuity in space for vision, and so,
and people are applying them everywhere now,
but in many cases people say,
"I used a deep neural network."
But you know what, they just used some simpler model,
they would have done just as well,
but that is sort of, to what you said is exactly right,
and when you don't use deep neural networks,
and even, then, sometimes if you don't have a lot
of data, then yeah, this sort of feature engineering is
what they call it, is what do we put in,
do we give it just the GC content,
do we give it this, but kind of one
of the beauties of machine learning is
that you can pause it a whole bunch of things,
throw them all in, and if you have enough data,
then it can tease apart what is relevant
and what's not, but again,
there's always this tension of like,
if you give it too many things
and you don't have enough data to warrant it,
then you're gonna get a bit of nonsense out.
- [Student] Is it true that you still can't look
under the hood and see which features--
- Yeah, so this is, I mean, this is a very active area
of research, although I would say it was a bit obscure
up until very recently, and the reason it's
become much more important is
because deep neural networks are so important,
and people are using them for problems that like,
you know, physicians are like,
looking at a CT scan and stuff, and medical areas,
especially, people want to know why did it make
that decision, like, I'm not gonna give my patient this
or that or not do that because your weird,
crazy deep neural network just popped out .8, you know?
And so, but this is a very active area of research,
and there is, there are things you can do,
but it's, I would say it's always,
in some sense, an approximation over what,
because at the end of the day, it's something operating,
it's like, you know, a human works as a whole, right,
and then you ask some very specific questions
about some individual thing, well, you know,
like in biology, we ignore, like,
the whole rest of the system just
to try to understand, 'cause we don't have a choice, right?
And it's very much like that in machine learning, as well.
So you do ask, and you get something
that's useful out, but you know that
by ignoring every other system
and every other thing that you can only go so far.
Alright, and so I just want to give you,
also, a little bit of a flavor
of what I could call are the two main branches
of machine learning, and everything I've just told you
about is what we call supervised learning,
and so in supervised learning,
why is it supervised?
It's because in the case of like,
guide design that I'm gonna tell you about,
the way we do it, the way we build this predictive model is
I give it a supervisory signal in the sense
that I give it a bunch of guides, and for each guide,
I know how effective the knockout was,
and that's the supervisory signal.
And so without that, I mean, in a sense,
how could you make predictions, right?
You can't build a predictive model if you have no examples
of how well it's worked in the world,
and the trick is to take some limited number of examples
and do something that generalizes
to a much larger set of data.
And so that's, and this is, you know,
obviously used for any number of problems
in biology, outside of biology, in finance,
like anywhere, in chemistry, you name it, this exists.
And so I, my own background, I should say,
has been a little circuitous, like,
my undergrad was actually in physics,
and then I actually did computer vision,
and then I got into computational biology during my PhD,
and I've meandered around like,
through sort of the (mumbles) informatics,
proteomics, and so in fact, these are kind
of problems that I have looked at and now more recently,
doing some of this CRISPR guide design.
And so, right, and one distinction also is,
am I predicting a yes, no, like knockdown,
or knockout versus not, that's classification,
'cause you have a class of yes, it was,
versus no, it wasn't, and regression is sort
of like, how much was it knocked out,
or like, you know, what fraction of cells
or what was the cutting frequency,
something that's not discreet,
and we don't need to pay a lot of attention to that,
but in our papers on this topic,
we have shown that people like to view this problem
as zero one, like, you know, cut or it didn't cut,
but if you, you know, the assays that you guys produce like,
give much more information than that,
and it's actually, it's kind of crazy
to just ignore that, and therefore we want
to be in a regression setting
where we're actually predicting real values,
and, you know, mostly we care about the rank,
'cause the kind of scenarios that we've been trying
to tackle are, someone says,
"I want to knock out this gene,"
so we want to go through and, I should say,
everything we've done here is also for Cas9,
just like, the wild type Cas9.
And so we want to, you know, we say,
our goal is to essentially say,
rank all the possible guides for that gene for Cas9,
and then compute the on target,
according to our models, compute the off target,
and depending on your use case, you know,
you balance those in some way
that you think is appropriate.
And so we want rankings here, like,
we don't care if we know the exact number,
although, I guess, some of our users say,
like, "What does that number mean?"
And that's very hard to do because
of the nature of the data that we take in,
and this sort of different assays
that get incorporated, and how we have
to kind of massage the data to throw them all together,
which I'm got gonna talk about here,
but it is a very difficult problem, actually.
Right, and so this is, in this setting it's pretty easy
to evaluate the model, it's not trivial,
but it's much easier than the second setting,
which is unsupervised, which I'll tell you
about in a second, and the reason it's easy is
because I have some set of examples
for which I know the answer, right,
and so the kinds of things
that we do is we basically fraction off the data,
we keep some amount of our data away from us
and we just never look at it until the very end,
and then we do a bunch of our own sort
of methods on this other part,
and we're allowed to look at it as many times as we want,
and then at the end of the day,
then we go kind of go once to that final validation set
and say, "How well did we do?"
And we can do that for different methods,
right, so we can look at competing methods
from the literature, or competing methods we make up,
and say like, "Okay, is this, you know, how do they fare?"
And so in that sense it's like,
there can be problems with that,
but it's much easier than in this case,
and so this is unsupervised learning
and I'm not gonna talking anything about
that other than this one slide here today,
but I just wanted to, if you've heard this,
I just thought it might be useful,
so probably everyone here at some point
in their life was doing something with gene expression
and you would draw these little plots from,
I think it was from Berkeley, right, wasn't it, Eisen,
Michael Eisen, okay, maybe this is a different,
I don't know, who here's drawn these plots
at some point in their life?
Okay, at least, more than people
who could define machine learning.
So right, so one access here,
or I can't actually see, what is it, yeah,
so each of these would be a micro-experiment,
every column here, and then people would say,
"Cluster these," for, you know,
people with colon cancer or something, and then they'd say,
"Oh, are there subtypes that we can see
"by clustering them according to this tree?"
And you'd sort of cut off the tree
and give you some clusters.
So in this case, the sort of analog
in guide design would be, I give you just the guides,
but I don't tell you anything else, like,
I don't tell you how efficient the knockout was,
at all, and then so you're just trying
to understand some properties of this set of guides,
and for guide design this, as far as I know,
this is not interesting or useful,
I don't know what you would do with it,
but in other cases it is, and I've spent a lot
of my life actually in statistical genetics,
where you can do some really cool stuff.
So when you're doing GWAS, like, geno (mumbles) study
and you're scanning through looking
for each marker, is it sort of related
to this phenotype or this trait,
one of the things that really messes you up there is
when you have heterogeneity of say,
populations from the wold,
and so one thing people do is they kind
of do this unsupervised learning,
in this case it's called principal components analysis,
you can take all of the genetics and you do this process,
and then say, each person in this room, like,
say we've measured you to 23andMe, whatever,
you have all your genetics, and then you can do,
sort of do this unsupervised PCA,
I know each person will get two numbers out
of this analysis, this unsupervised analysis,
and you can plot them here, and when you do that,
it actually kind of recapitulates the map
of Europe if you happen to know where they're from.
So that's not going into the algorithm,
and so that's kind of cool, I mean,
I haven't told you exactly why we want to do this,
but it's just kind of a super neat result,
and totally irrelevant to this talk.
Right, so that's my sort of very quick
and rough intuitive intro to machine learning,
and I'm gonna jump into how we use machine learning
for the on and off target problem
in guide design for knockout, but again,
are there any questions at this point?
Okay.
Alright, so we, this one, the first one I'm gonna talk
about is on target, and that one's already published,
this is with John and other folks at the Broad,
and then this one is in revision now,
although it's on bioRxiv, not the most recent version,
and of course, our overall goal was to have some sort
of end-to-end guide design service,
but we've got to do this in pieces,
there's almost no end to this, right,
even when we have our current one,
there's gonna be a million things we can do.
And I think, usually, so I actually often talk more
to computer scientists, rarely to biologists,
and I have to explain in great detail
what it is we're actually doing,
and I suspect everybody in this room knows,
but just to be clear, again,
I'm gonna focus on the knockout, although often the assays
and the modeling we do are actually cutting frequency,
and so in this case I'm considering knocking out this gene,
I'm gonna use Cas9 and maybe there's like,
NGGs at these four positions, of course,
it's more than four, and the idea is I want the one
that's gonna most effectively knock it out,
or I want to rank order them in some way,
right, and let the user decide what to do.
And so I want the best, best
guide to guide me to the good position,
and then, of course, the flip side is
if I've chosen that particular guide,
then I wanna make sure I'm not disrupting the rest
of the genome, and so we model these separately,
and now I'll just go through one at a time.
Right, so this is a little schematic of what I mean,
which I've been alluding to, I think,
in the introduction, about how we're gonna do guide design,
so again, this is gonna be supervised in the sense
that John's gonna give us a bunch of examples
in his wet lab where he's measured the knockout efficiency,
and what I'm gonna do is I'm gonna take the guide,
and it's actually, we have a 30mer here,
so it's 23 plus I think it's four extra
on one side and three on the other,
and to be honest, like, that's not something we ever played
with much, somehow I think we just started using 30,
maybe we tried just the guide and this helped,
I don't actually remember right now,
but it's a 30mer sequence,
and then we're gonna learn some function,
and so again, in the simplest cases,
function might be linear regression,
where you're effectively just learning some betas,
and then it's gonna allow us to make predictions,
and the game is gonna be to collect as much data as we can,
and to fit these in a suitable way, to pick a good,
a good model, and then to fit it in an appropriate manner.
So that's the setup, and right, and so these,
I've just drawn here kind of three of the key components,
the decisions you have to make when you're doing this kind
of thing, so the first one in
which Dana just was mentioning is like,
what features do we give it here?
Somehow I go from this, you know,
ACTG guide to some numbers,
and I'll tell you a little bit about that,
and then the other thing is we need some measure
of the supervisory signal, is how effective it was,
and so that's not, as a computer scientist,
that's not in my control, but John
and you guys are super clever
and managed to figure out how to measure these things,
and then our job is to think about,
you know, when you measured it for that gene
and that gene, are those comparable,
and can we throw them together or not,
and things like this, or should we,
you know, transform it in some way
so it better adheres to the model
that we're trying to use, maybe go see a noise
and things like this, and then finally,
what model are we using, or are we gonna try a bunch
of models, and how are we going to decide which one is best?
So, right, and so if John, in this first paper,
he had some data already there,
and then he gave us some more,
and overall we had, I think I have a slide on this,
it was for just 15 genes, he did systematically
every possible place you could deploy Cas9.
Okay, great, so right, he had this one he'd
already published, which was,
they're all self-surface proteins,
and so you could use flow cytometry and fluorescence
to separate them and get some measure
of knockout efficiency, and then the second class
of genes was using a drug resistance assay
where I guess you, it's known,
and I guess they show in the paper, as well,
that it reestablished that when you apply this drug
and you successfully knock out the gene,
then the cell survives, and otherwise not.
So using these kinds of tricks,
and probably many more that you guys develop,
we can get some sense of if the protein was not,
you know, not there are not functioning.
And you can see here, I guess, you know,
large data is in the eye of the beholder,
so for modern day machine learning,
this is like, minuscule, like,
it's a little sneeze, basically.
And that makes machine learning really hard,
actually, it makes it easy because things are fast
and they're not unwieldy and I don't need
to worry about memory and compute time,
but it's really bad from getting a good model,
and that's actually one of the big challenges
in this area, is there's very limited data
at the moment, and I'll never forget,
one talk I saw years ago by Google,
like, before machine learning was super famous,
and they actually at the,
one of the big machine learning conferences,
they did this plot,
and they basically showed the performance
as you change through these different models,
and you'd see this model's better, or this model's better,
this and this, and then they,
I went to the next slide, and they said,
"Okay, now we just made the dataset 100 times bigger,"
and basically that so dominated the difference
between the models that you're like, just, okay,
just get more data and don't worry
about what you're, you know, what model.
But when you're in this regime,
you should care about the model,
and to be fair, even in that regime,
that was just one specific problem,
but I think it kind of drives home a very nice point.
And so, right, I'd mentioned that different genes
and different assays like, kind of yield measurement
in a different space, and I'm not gonna,
we're actually, we haven't, we have ongoing work,
it is a sort of machine learning project
on how to handle that in a rigorous, nice way,
and in the meantime, we're doing some stuff like changing it
to ranks and normalizing what the gene and stuff,
I'm not, I mean, if you want to know,
we can go into this offline,
but I'm not gonna talk about that here.
Right, and so now back to this sort
of featurization, what we call featurization of a guide,
so right, we have the 20mer guide,
and then a 3mer after the PAM,
and for, before, and we need to convert
that 30mer nucleotides into something that's numeric,
because any model I want to use in machine learning,
fundamentally it assumes there are numbers there,
they can be real value, they can be negative,
they can be discreet, it doesn't matter,
but they've got to be numbers.
And so, you know, we're not the first people
to have this problem, there is a rich history
of computational biology and ways
to do this, and so this, you know, like we did,
this is a pretty standard thing,
but just to help you wrap your head around some
of the ways in which we might want to do it,
imagine that, so we're, a lot of the features we're going
to use are based on the nucleotide content,
and so what we're gonna do is we're gonna kind
of make a dictionary, where if there are four possibilities
for a letter, then each of those basically gets,
we call it one hot, because one of these bits is on,
and the rest are off, right,
so this is the one that's hot,
that's the one that's hot there,
it's called a one hot in coding, but I think you can see,
you just enumerate however many things you have,
and then the code for it becomes however many there were,
so here there's four, which means there's four digits,
and then you just kind of randomly give it one to A,
and that position to T in this, and now what you do is
when you want to convert this to a long, numeric thing,
first you look up the T, so you start with a T,
so then you get 0010, those are the first four numbers
in the long numeric string,
and then you look at the second letter and it's a G,
so then that's gonna be followed by 1000,
and you keep going like this.
So that's gonna give you some very long string,
but we're not gonna stop there,
we're gonna actually do this for
what we call order two nucleotides,
and we've got this sort of other kind of dictionary.
You can do this up to order three,
you can do all kinds of other stuff
that I'm not even gonna go into,
I just want you to like, sort of wrap around,
like, in your head, how can you actually do this,
and also, the way I've just described this to you,
like, in this example, I, it's position specific
in the sense that I start here on the left
and I ask, was there a T?
So this refers to the first position,
and when I do this for every guide,
it always refers to the first position, right,
and so when this is important in the model,
it's because it's important in the first position,
but I can also do things that are sort
of positioned independent, so,
and I'm gonna actually do that,
not just for GC, I'm gonna do it
for every two pairs of letters, and I'm just gonna say,
"How many GCs were there, how many TGs were there?"
And things like this, and I'm just gonna tack those on
to this thing, this thing's just gonna get longer
and longer and longer, right?
And this is where we need to be careful,
like, 'cause it's so long that I don't have enough data
to actually deal with it properly.
But in this case, like, even though they're really long,
because so many things are zeros,
it's actually not quite as long as we think it is
in terms of the amount of data needed.
And then something that comes out as important,
and so this is, we sometimes call John our like,
oracle featurizer, 'cause John is one, he's smart,
two, he has very good intuition,
and three, he's up on a lot of literature,
and so he said, you know,
there's crystallography experiments that show that,
you know, something happens at the 5mer approximal
to the PAM in sort of a cohesive way,
followed by something that over here and over here,
so maybe when you're doing the thermodynamics,
in addition to looking at the whole 30mer,
maybe also look at it just in these sub-portions
and things like this.
And so we interact with John and we get features
in this way, as well, so we kind of use our rich history
of just computational biology,
and then we use, you know, domain specific expertise,
and if we had, you know, sort of an infinite amount
of data, we could probably skip thinking
through these things at all.
Alright, and so the last part,
like, I've now told you a little bit
about how we do this, and you guys know more
about this than I do, and so now this sort
of last puzzle piece is what kind of model are we going
to use, and so I've already told you a little bit
about these regression trees, and so actually a lot
of our CRISPR, these are just very, they're very rich models
because you can get these interactions, right,
if I split on this feature, let's say this is GC content,
and then I go down to here, and this is,
let's say A in position one,
then this is a very strong interaction
between these two features,
because I have to go down this path before I get it.
And so you can get these very complicated interactions,
which is also why it's hard to interpret,
because you kind of need to like,
look at every interaction all at the same time
to kind of understand it.
So what we actually do is we actually use a bunch
of those models together, there's a way
to combine these models to do even better, still,
and this is called boosting, does it have the name
on here, boosting, somewhere?
Boosting, so boosting is a way where you take some sort
of somewhat simple model, these aren't simple models,
but we actually crippled them a bit,
we keep them very, very simple,
and then we apply this boosting technique.
And so let me just explain to you intuitively
what boosting is like, and this is, I think,
pretty easy to understand on an intuitive level.
You take your data and you fit the parameters
on your training data, and then you go to your training data
and you ask how well did my physicist fitted model do
on every one of my training examples,
and then you weight each of your data points inversely
to how well it did.
In other words, if on that training point,
after I fit the model, it did badly,
then it's gonna get up-weighted,
and the stuff that it did, you know,
really well on is gonna kind of fall out,
it's gonna get a very low weight.
And then you reapply this learning algorithm,
you re-fit another tree, and you add the results,
then you start averaging the results
from a sequence of these trees.
And so there's a whole theory underpinning
why you can do this, what that's actually doing,
but on an intuitive level, that's all it's doing,
and so it's, in some sense it's refocusing its energy
on getting those things that it didn't get right
by subsequently adding in another tree, and these are,
it turns out to be very powerful models
in a large number of domains.
And so, in fact, that is what we use here.
- [Student] You also do an (mumbles)
where you dropped out a feature,
and see whether it made a difference,
(mumbles) had a couple of features,
'cause you can see the (mumbles).
- Yes, so I guess there's two things we do
when we're doing things, is one is,
we try to get the best model possible
that we think will be the best, you know,
like, for whoever wants to use it,
we really think this is the best thing,
and then another thing we do is we want
to interpret it as best as we can,
and so sometimes we drop out a feature
because we think maybe it'll be better,
but usually dropping one feature's not going
to make a difference,
because if it's just one feature and we drop it out,
it would have known it wasn't useful anyway, usually.
Like, it could be that if you, you know,
posited that there was millions of features
and then you got rid of half a million,
that might help you if they were irrelevant,
but if it's just one irrelevant one, it can figure it out,
or else it's just doing a bad job, anyway, basically.
But then you might do it, like, you know,
we have some figures in our paper about here's the accuracy
with all the features we finally used,
and here's the accuracy if we pull this one out
to give people a sense of like, well, how important was it?
But that doesn't necessarily tell you,
actually, so it's actually more nuanced than that,
because it could be that if I use this feature by itself,
and we see this all the time, I think this happened
with my microhomology, which we don't put in our model.
If you put that in just by itself,
it turns out be quite, or somewhat predictive,
I don't remember how predictive this was.
But if you take the model we've already settled on
and you add it in there, it makes no difference
because things are, you know, like,
for example, nucleotide features cover GC content,
and GC content is a proxy to thermodynamics,
and you don't know what's a proxy
for what on things like this,
and so because everything's interacting, like,
you can't, all you know is in the context of this model,
if I pull it out, it doesn't change the performance,
but that doesn't mean that it's actually not predictive,
right, and then the flip side is,
if I put it in by itself and it's not predictive,
it doesn't mean it's not predictive,
because I could have put it in with a bunch of other stuff,
and together it might have been predictive.
So this is why it's actually hard to do that,
and it's also why it's hard to interpret,
yeah, but you can, you know, we do what we can
in that vein to sort of poke and prod as best we can.
- [Student] You said something, to the previous slide,
and I wanted to make sure we got it right,
which is that, sort of, you know what,
we're having to go in because the dataset is small and say,
"These are the features we hypothesize are important,"
that's where, give it that--
- Yeah.
- [Student] (mumbles) and we might be leaving things out--
- Absolutely, yeah, completely.
- [Student] That's really what you were saying was
that if you had a larger dataset,
we could, we wouldn't have to be as correct
about those features going in for the model
if we start applying some of those more complicated things
about (mumbles)?
- So yes and no, yes, the larger the data,
the more complicated a model you can give it
with more parameters,
and the more complicated a model it is,
then with infinite data, it can make up,
figure out these features, but so you need
to both increase it the data
and increase the complexity of the model.
So if you did just one or the other, then that's not enough,
unless you're in a setting where you're using too complex
of a model than you should be
and then you increase the amount of data,
but yeah, but I mean, I think your intuition's correct,
I just wanted to make sure that you know that you needed,
these things need to go kind of in lockstep, in a sense.
- [Student] So getting more data is not a magical fix
for us not understanding which features are (mumbles)?
- No, and in a sense, the more data you have,
you can afford to do more complex modeling,
and the more complex modeling, the harder it is
to tease apart what's going on,
unless you went to some physics based thing,
but like, nobody really does that.
Yeah.
Right, and so at the end of the day,
we're using these boosted regression trees,
and so when we've done, this is for the
on target problem using those 15 genes,
and the colors look slightly different.
Okay, so anyway, this is, and John and colleagues,
actually, this is the paper they had in revision
when I first went up to John and started talking to him,
and that is, was sort of the state of the art
for on-target, predictive accuracy of knockout,
and that's performances, in blue,
so this is flow cytometry data,
this is drug resistance, and this is combined,
and this is when you train on one test,
on the other, of course, we're careful,
when we do that, we don't literally train
on that whole dataset and test on it again,
we partition it in this way that I described,
this is just a shorthand notation to say,
we're only considering this data when we do that.
And similarly here, and then you can see,
the boosted regression tree is doing better.
And actually, also, everybody had
actually been doing classification,
and we can show through a series
of what we call experiments, like,
in silico experiments that, really,
you don't want to do classification here,
you want to do regression,
because the model can use that fine grain assay information.
Right, and so some of the features that come out here,
and I put them into groups, so, for example,
the topmost, and again, take this with a grain of salt,
because the other thing is, with these models,
sometimes you can fit them this way,
you can just tweak the data,
you could imagine taking out 5% of the data,
you can refit it,
and you might get something very different here,
because there's many ways in which you can get
an equally good predictive model.
And so as much as people really want
to latch onto these lists, and we make them,
like, I really, you know, and whenever we sent something
in for review, they're always like,
"Wow, we need to understand this better."
But there's really only so much you can do
with this kind of an analysis,
but I think, you know, it gives you a flavor of like,
you know, stuff that's not important will not be
in this list, that much you know, and I think, you know,
so there is something to be had from this,
you just shouldn't read it too, too literally, but.
So the position dependent on nucleotide features
of order to, so one example of that would be there's a TA
in position three, and this is the importance
of all of those features together, and it turns out,
in this case, to be the most important thing,
and then position dependent order one, so that's like,
is there a T in position 21, or 20, or something like this,
and all of those together turn out to be important.
And then position independent order to,
this is something that includes GC, right,
'cause one example of a position independent order
to feature is how many GCs were in that 30mer.
But you can see, even with the GC content in there,
the thermodynamic model that we use becomes out
to be also important, and moreover,
those little bits that John inferred were important based
on crystallography, like, actually taking subsets
of the 30mer turned out also to be important,
so you can see, that's the 16 to 20,
eight to 15, and then one thing that,
for knockout is important, is actually,
where is the guide knife actually cutting,
right, so if it cuts at the very end
of the gene, it might still form a functional protein
that passes this assay kind of thing,
whereas if it's at the very beginning,
it's sort of less likely.
That's not, you know, if you look at the data,
that's not crystal clear, but it's there enough
that it comes out to be important.
And then it turns out the interaction,
again, this is what we call another John oracle feature,
which is between the nucleotides on either side
of the GG in the PAM, if you actually say,
"I want to know what both of those are at the same time,"
that that turns out to be important,
and actually, and GC count still comes out, as well.
And then you can drill down, right,
these were like, groups of features,
these were all the position dependent order two features,
and then this is just the top of the list in rank order
of the actual individual features,
you know, if you go to the SI you can find all of them,
percept peptide is again, just a measure of like,
were you 50% through the peptide
where you targeted it, or were you 100% through,
so you can quantify that as a percentage,
or as an absolute number, like I was 30 nucleotides in,
and we put both of those into the model.
Right, the other one's amino acid cut position,
so those actually are very important,
and then the second most important one
is actually the 5mer end, the thermodynamics.
Alright, so, and this is actually,
it seems, I don't know if anyone here's using it,
but a lot of people tell me they're using it,
and there're actually two startups
that are using this now, as well,
and everything I've just told you is actually
from this paper, but we're right now in the process
of bringing in more data to try and improve it,
and trying to see how chromatin accessibility
and other things like this might improve the model, or not.
But that's in the works, that's not out yet.
And then one thing, I actually don't usually put this
in my talks, but I thought I'd get a question
about is just like, you know,
what if you had more data, how much better would you do,
and this is like, one of the first things John asked us,
and he said, "Maybe you can draw a plot,"
something like this, so, you know,
we actually had, I guess we had 17 genes,
and so we always want to partition the data
when we're testing, right, so if I have 17 genes
and I want to use as much of it as possible,
I could use 16 to train and test on one
and then I could keep doing that,
I could shuffle and then do it with a different one,
a different one, and so that's this end here,
and that's the performance there,
and you can sort of see, as I let it use fewer
and fewer number of genes to train,
the performance kind of goes down,
and John was like, you know, is it gonna plateau,
does that mean we have enough data,
and getting more data won't help,
which, you know, of course he was hoping we would say yes
to that, but, and you could kind
of make up that story in your head based
on this plot, but there's a whole bunch
of caveats with that, and it kind of looks like that here,
but I don't think it's the case,
and if you want to know more, maybe ask me after, but.
Yeah.
- [Student] What's the Spearman-r (mumbles)?
- Oh, I'm sorry, yeah, Spearman-r I'm gonna talk a lot
about, so when we want to evaluate how well it does,
let's imagine that I have, you know,
10 guides that I've held out, and I've trained my model,
and I would say, "How well does the model work?"
So I know the knockout efficiency for those 10 guides,
and I had my predictive model,
and what I want to do is basically take the correlation
between those.
If, you know, if one, I don't know if we care
about the exact values, but we care about the rank order,
and so we do it actually with the rank.
So we just convert the predictions into ranks,
and then we convert the real assay values into ranks,
and then we do like, a Pearson correlation.
So it's really just getting at,
are my predictions in the right order,
according to the ground truth, and Spearman is doing it
on the ranks because there's a lot of messiness
with the raw values that we get out here,
and this, I think, is enough that it is still very useful.
- [Student] That is one (mumbles)?
- Oh, yes.
(student mumbles)
Yes, that is correct, yeah.
Exactly.
So that's the on-target, and now I'm gonna,
what time, I'm all confused, I thought we were starting
at eight, but it started at 8:30,
so okay, I think this is good,
yeah, lots of time, yeah.
So now I'm gonna talk about the off-target,
but unless there are questions about the on-target first,
that's sort of the much simpler one
from a machine learning perspective,
it was much, much easier to do,
the next one is a bit of a nightmare. (laughs)
- [Student] Can we come back to it later--
- Yeah, we can, absolutely.
- [Student] Look at the parameters that we--
- Yeah, yeah, yeah, yeah, definitely, yep, okay.
Right, so yeah, so this is a much more challenging problem
for us for a number of reasons,
and so one thing is that we're now kind
of looking for accidents, right,
we're not, we don't have this perfect complementarity
between the guide and the target,
which you have for on-target, we're saying,
well, did it accidentally do something there?
So now we need to consider mismatches
between the guide and the target,
and moreover, we need to do that across the whole genome.
So now I'll suddenly, instead of just saying,
does this guide and this target,
is it gonna cut and knock out the gene?
I need to scan the entire genome
and then I need to somehow predict
what is, how active is it gonna be at each of those places
in the presence of some number of mismatches.
So this is just like, actually, wildly more difficult
and computationally expensive, and so,
I mean, one is, I have to scan the whole genome,
so that by itself is really annoying,
and then from a modeling perspective,
the fact that I have to tolerate mismatches means,
I guess in computer science lingo,
we'd say this is like a combinatorial explosion
of possibilities, like, and I'll show you how it explodes,
is that the next slide?
No, but like, so the more mismatches I tolerate,
kind of, the more examples I need for the model,
because there's just so many possibilities,
and there's actually, I think there was even less data here
than for the on-target, depending how you count data,
but it's really, really limited.
Right, and so this very simple three
steps here took us awhile just to get to that,
and so ultimately, though, this is how we think
about the problem, is given a particular guide,
our goal is gonna be to find out where
in the genome there're likely to be off-target activity,
using our model, and then we also want
to summarize it into a single score for that guide,
because people don't want to like, for a single guide,
they don't want to like, ponder, I mean, they might,
for some use cases they might want to do that.
But I think a lot of people,
depending again on the use case,
they want just a single number
with the option to drill down
about where the off-target activity is.
So what we're gonna do is they're gonna start
with a single guide, and then we're gonna scan
along the genome and filter, essentially,
like, to a shortlist of potential active off-target sites,
and then once we have that shortlist,
then we're gonna use machine learning
to score each of them with one of these predictive models,
and then we're going to aggregate them all
into a single score.
And so most of the work that we,
well, all the machine learning is down here, and one could,
in principle, actually use machine learning up here,
as well, but there're just already to many moving parts,
and so we actually just stuck around to doing a heuristic,
which is to basically cut it off roughly,
actually, it's not a three mismatches, huh.
I just realized, this slide,
it's something a little bit different,
I think it actually goes up to six,
but it allowed to ignore the three (mumbles),
or something like this, so I apologize,
I don't have exactly the right details here.
But this shows you this, what I was telling you
about a combinatorial explosion,
which is if I tolerate only one mismatch,
so this is a back of the envelope calculation,
right, like, let's say there's 100 sites genome-wide,
roughly, and the more I go, like,
the more sites I have to consider,
and so when I do this filtering down to a shortlist,
like, I can only go so far and that's also why,
actually, it stands to reason
that machine learning could maybe do better here,
because this is a very rough criterion of just like,
number of mismatches, and we know that that's not
what's driving the model, because we've done the modeling,
but it definitely has a big impact,
and so it's not a bad thing to do.
So right, and there's lots of,
there's lots of software out there
that actually does this, like tons
of things that will go and find all of these things for you,
and I guess our add-on contribution is really
to do the machine learning on top of that in a way
that is really better than what's out there.
Right, so that's, all I'm gonna tell you
about this first step is it's basically just a heuristic
where we say we allow up to some number
of mismatches, and that's our shortlist for this guide.
So right now, the problem is,
how do we score the off-target activity
for each member of that shortlist?
And so there have been some previous approaches on this
from your community that's putting out these great papers
and they're making very clever models
that actually do surprisingly well,
in a lot of cases, given that they're like,
just sort of like, declared by fiat to be the model,
it's really amazing, and so we thought,
let's bring in machine learning
and see if we can do any better here,
and at the time we started to do this,
CFD, which actually was John's model,
in that paper we wrote together,
but we ourselves didn't get to the off-target problem,
'cause we were, just didn't have time for that paper,
is basically counting frequency of knock-down,
according to some very simple features,
and I'll tell you a little bit,
I think I have a little bit more about that.
And so we're gonna, again, do this machine learning thing.
And so the last picture I showed you like this was
for on-target, and in that picture,
it did not show you the target,
if you might recall, it had only this one column,
which was the guide, and then it mapped directly
to the activity using just the guide.
And so, but now, because of the mismatch,
I have to include the target as well, right,
and on-target, by design, these were complimentary,
and so this was redundant information,
but now there might be mismatches between these,
so if I know this, I don't necessarily know that,
so I have to throw that into the mix, and so now,
it's this function I want to learn is a function
of the guide and the target,
and then I'm gonna play the same game,
I'm gonna get as much data as I can,
and I'm gonna try and learn this function.
And actually, actually in this case we're using only 23mer,
not 30, and the reason for that is
I just think the datasets we had to have 23,
and it was a real pain to make it 30, so we just left it
at 23, so these are how you store your data
in your excel sheets may get backed out,
we do machine learning if we get a bit lazy.
Alright, and so now I need to also play this game again
where I go from these now,
a guide and a target, and go to some vector,
and mostly I'm gonna do the same thing I did for on-target,
so everything I already told you,
I'm gonna do that for the guide again,
and that's gonna go in here, but then I need
to somehow also encode the mismatch,
and I can do that in different ways.
I could, for example, I could just say,
"There is a mismatch at position eight,"
but not tell it what it is, and then separately say,
"One of the mismatches was a T to an A,"
or I could say, "There was a mismatch at position eight,
"and it was a T to A."
And so there's some flexibility there,
and we actually do several of those things,
and again, if we had enough data, we wouldn't actually have
to enumerate these different things,
we would just kind of give it the components
and the model could figure out what the features should be,
but we have very limited data.
And then it gets, this is
where it also gets super complicated.
So, again, I mentioned this sort of combinatorial explosion
of possible number of mismatches,
so like, I don't know how much data there is
that has multiple mismatches, but it's like,
you know, maybe 1,000 examples,
and you saw how many mismatch examples there were,
right, like, I mean, thousands and thousands.
So we have almost no data, I would say,
it's almost like no data.
However, what John did is he built,
and I think this is gonna kinda is gonna sound insane,
I think it is insane, but it actually works,
is he took just CD33,
and he enumerated every single mismatch guide target pair,
and he measured the knockout efficiency for that,
and from that, we can, like, if it's just one mismatch,
we can build just a one mismatch model
and not consider the combinatorial explosion.
And that gets us a one mismatch model,
and that's actually what we do is,
we build this one mismatch model just from CD33 data,
and I'm super curious
to know if we actually had some more genes in there,
how the performance would change since, in general,
we're not at all testing on CD33 here, and then,
the thing is that we want to be able
to take any arbitrary example and kind
of bootstrap ourselves using that single mismatch model
to get a prediction for it.
So how am I gonna do that?
What I'm gonna do in the first, on this slide so far is,
I'm gonna pretend that this two mismatch,
or it could be three, if there were three mismatches,
there'd be another figure here,
I'm gonna break it down into examples,
like, pseudo-examples of single mismatch guide target pairs,
even though that's not what it was,
and I'm just gonna do this,
'cause I don't really have a choice,
'cause there's just not enough data,
and then I'm gonna assume that they are,
for example, independent, and if they're independent events,
like, I can multiply them together and I can get a score.
But they're not really independent,
and it's not really the right model,
so we're actually going to use the very limited amount
of multi mismatch data with a really simple model
that has very few parameters to actually learn
to do something better than multiplying them together,
and that buys us a little bit extra,
doesn't buy us a lot, because there's just very little data,
so there's not a lot we can learn,
and if we had more data, we could probably do much,
much better here.
Does that make sense how I like,
I take CD33 single mismatch data,
I build just a single mismatch model,
which we hope is not specific to CD33, and it's clearly not,
'cause we do get some generalization performance,
but to be clear, off-target prediction's actually not great,
it could, like, it's pretty low,
and then we break down examples into the single mismatch,
and we could, in principle, just multiply them together,
and that would actually not do that badly,
but we add on a layer of machine learning
with the very limited amount of multi mismatch data,
which we get mostly from guide seek data.
And we're doing all unbiased assays here for off-target,
'cause we don't want to focus on kind of weird things,
so we're doing unbiased assays, like guide seek
that like, search over the whole genome,
even if they're not as sensitive.
And again, right, so we actually use
these boosted regression trees
for the single mismatch model, and then this is,
I said we have to use a very, very simple model here,
because we have so few examples,
and here we just use linear regression and actually,
you may not know what L1-penalized is,
but it's basically a super-constrained version
of linear regression, where you kind of force
or coerce the parameters to be closer
to zero than they would otherwise, yeah?
- [Student] For the previous slide, does it take
into consideration the applying (mumbles),
for example like here, you have it--
- Yeah.
- [Student] Whereas--
- Right.
So it doesn't do it in the sense
that we don't compute binding affinities of those things,
but we give the model that information
that it was a T to a G, and so if it has enough examples
of that to tease apart that that was useful,
then it could, in principle,
learn that kind of a thing, yeah.
- [Student] It just doesn't take into account the (mumbles).
- So, actually, so yeah, in this layer here,
where I said we'd do something that,
so one could multiply these together,
but instead we learn a model here, and that model, so,
and right, some of the other models out there
that were not machine learning take
into account how close together the mismatches were
and things like this, I think that's
what you're maybe asking, or--
- [Student] Actually,
so if a TC mismatch is in a TTT context--
- Yeah.
- [Student] On each side, it'll have a different--
- Yes, yeah.
- [Student] Stability than--
- Right, right, so we implicitly have the ability
to capture that, and the way we have the ability is
that when we encode this mismatch, one of the ways
in which we encode it is we say,
"There's a T to C mismatch in position,"
I don't know what that is, seven?
And then separately, we have features that say,
there's a T to C in position seven,
and so it can combine these things and learn,
it can put the pieces, like the Lego pieces together
to infer that potentially, if there's enough data,
but we don't do it more explicitly
than what I just described.
- [Student] 'Cause that was one of your,
the nearest neighbor identity was one
of your features in the on-target (mumbles), right?
- In one sense, like, so what did the features say?
- [Student] Well, just that, you go down the sequence
and you say, not just the G--
- Yeah, yeah.
- [Student] But GG in positions one and two,
and that was one of the features.
- No, you're right, actually,
we could augment the way we have featurized
the mismatch specifically to more directly
to encode that, and it might help.
In principle, if we had enough data,
it wouldn't help, but we don't have a lot of data,
so it might help, and so that is something we could try,
actually, that's a good idea, yeah.
Right, and so now we actually have more,
so this is in revision, we're hoping to submit
in the next few weeks, and we actually,
maybe someone in this room said,
"Please, like, get some more data
"and, you know, really validate this."
And so we started working with,
I'll mention at the end, 'cause I don't know how to,
Ben's, I'm trying to say Ben's last name,
(mumbles), and so we have some more data
and show that this generalizes beyond
what I'm gonna show you here, but I don't have
that in the talk yet, and so right,
there's this CD33 single mismatch data,
you can see there's about 1,000 guides,
and then this is sort of,
I think from the original guide seek paper,
and there's nine guides, up to six mismatches,
and so it says there's only 354 which are active in any way,
that is that the guide seek assay like, finds them at all,
and then so, if you just look at N total,
you think you have this huge sample size,
but those are just basically zeros, like,
non-active targets you get by searching the genome,
to end with this shortlist, like say, up to,
whatever, three or six mismatches, and so that's your kind
of candidate list, and then guide seek says,
"Well, I found activity at these very few."
So even though this N is huge,
this is tiny, and that is really important.
So we don't have, in a sense we don't really have
anywhere near that number, and then,
so there's a nice review paper from Max Haussler
and colleagues where they were comparing
on-target and off-target algorithms,
and they did a nice compilation
of some other unbiased assays that included guide seek,
and so we use as a separate dataset,
basically all of Haussler minus this guide seek,
and so those are kind of two different assays,
and those were all cutting frequency assays,
I don't remember the details.
But again, you can see here,
that once we subtract out the guide seek,
like, this is tiny, I mean, this is just really,
really tiny, and it's actually,
CD33 is the only really big dataset we have,
and it's kind of remarkable we can do,
get anything at all out of this.
Right, and so one question that, you know,
with the on-target, there's a whole bunch
of standard measures we evaluate things,
like the Spearman, you said, "What is the Spearman?"
And, you know, that's just sort of something like,
you just think immediately you do,
and you'll see this in all kinds of papers
in many domains, and if you're doing classification,
there's something called the ROC curve,
which we're not doing classifications, essentially,
but all that is to say, there's like, standard things
that people just do like this,
and Nicolo and I spent like, literally months trying
to figure out how to properly evaluate this,
because one of the problems is
there's a really inherent asymmetry here, right, like,
say, especially if you think of therapeutics,
but in any case, if I have a predictive model
and I have a guide target region,
and it's truly, it's active there, like,
it's going to be active and it's going to disrupt things,
my model says that it's not active,
that's the kind of mistake it is,
then that's a really bad mistake, right,
I'm gonna use that guide and something bad's gonna happen,
like, someone's wet lab experiment's kind of messed up,
or I'm gonna kill someone.
And on the flip side, if my model,
if there's some region, (mumbles) some target region
that it's not active, but my model makes a mistake
and says it is active, like, well, that kind of sucks,
'cause I'm not gonna use a guide
that might have been a good guide,
but there's not some sort of devastating consequence,
right, and so there's this inherent asymmetry,
and we wanted to take that into account
when we're evaluating things,
because we want to evaluate in a way
that's most meaningful for what people care about,
and so we ended up, you know, we're aware of this asymmetry,
but we didn't, also, but where, you know,
where in this sort of asymmetry space do we lie,
how do we kind of weight those two extremes?
And so, in the end, this is what we decided to do.
We used a regular Spearman correlation here,
and this just turns out to be easier to visualize this way,
we basically, this is a relative plot
where this is an improvement over a CC top,
which is a, sort of an older model out there,
so it's the baseline, and then you can see,
basically, from the Broad, how well they do,
and let me explain what this knob is in a second.
But, so this is just Spearman R, oh,
this got moved around that block, well,
whatever, it's fine.
And now what happens is on the other extreme end,
over here, what is this?
It's also a Spearman, but it's a weighted Spearman,
so now each guide target pair is weighted
before I compute the Spearman,
and in particular, it's weighted by the activity.
So if a guide target pair is not active,
it falls out altogether, and that's
because of this asymmetry.
So that's the most extreme case
of the asymmetry I just described,
this is, if you're not active,
I don't even care if I get you wrong,
and what you can do is you can turn a knob
that smoothly varies from one end to the other,
and that's what this is.
It's a little complicated, the details I won't go into,
but you can see, it's smoothly variant,
it's not doing something crazy, and so the,
and what's really nice is there's no theoretical reason
for this, but almost invariably, when we draw these plots,
you find that one model systematically lies
above another model, and that's nice,
because it means that no matter where you are
in this regime, you should use like,
the one that falls on top of the other one,
that dominates it, and because it's actually a bit hard
to conceptually understand where you are in this space.
It's like I, we designed this, and I find it hard,
a bit hard to understand, as well,
but at the same time, I think it's a very useful metric.
And so you can see in this case
when we train, we always train on CD33,
and we never test on it, it's always the basis
of the single mismatch model,
and CD33 doesn't appear anywhere
in any of the other data, as far as I know.
And then here, we also add in the guide seek,
which has the multiple mismatches to do the second level,
and then we test on the Hauessler et al
that does not include guide seek,
and that's the relative performance of the different models.
Does the figure make sense in the motivation,
even if it's maybe not the kind
of thing you're used to seeing?
And we're not used to seeing it either.
And then you can flip it around,
we're now, again, we always use the CD-33,
but now we use the Hauessler combined data set,
and we test on the guide seek,
and you see a similar kind of pattern.
And again, we can drill down a little bit,
and this is actually just
from the CD-33 single mismatch model, oh, yeah?
Yeah, yeah, yeah.
- [Student] Just thinking about this,
in the previous slide.
I'm trying to think, does that,
on a 50/50 chance of getting this right--
- Yeah.
- [Student] Either
it looks like
guide seek helps Hauessler better
than Hauessler helps guide seek, is?
- Yeah, it's a little bit hard to, so let's see.
That looks like that.
Remember, these are improvements over CC tops,
and actually in the pre-print and in the paper,
it tells you the actual correlations,
which are pretty low, I forget now,
but I think they're maybe around 10 or 20%,
I don't, don't hold me to that.
The difficulty is that the number of active guides
in each of these, too, is very different,
the guide seek has about three times as many active guides,
so in some sense, it has a better chance
of building a good model that,
and so you have to be careful about things like that.
- [Student] That's an issue for you all the way along,
is what's the quality of the data?
- Absolutely, right, definitely.
And so one of the decisions we have to make is like,
when we're trying to evaluate how well things do,
we want to flip things around as many ways as possible
to show the underlying sort
of modeling approaches are solid, right,
than if you just tweak this or that, you still get this,
which is why we do both of these, but at the end of the day,
we need to say we're going to deploy a model,
do we train it on just the guide seek,
do we train it on the guide seek and the Hauessler,
and, you know, so this is partly like,
just an intuition we have by looking at things,
and we talked to John, and in the end, like,
the final deployed model is just trained on guide seek,
actually, plus the CD33 right now, but again, we have a,
we're trying to get a better one going, as well.
But yeah, absolutely, and I think in the on-target paper,
we actually have like, a matrix
where it says if you test on FC,
or train on the flow cytometry, test on the drug resistance,
how well does that generalize,
and like, we try to always do these kinds of things,
and Hauessler in his review paper has a super nice,
he did a very expansive set of comparisons
where he, across many datasets,
where he does this kind of thing,
if you take their model but you test on their data,
and all this kind of thing,
so there's kind of no end to this,
so you do as much as you can to think
that, you know, what you've done is solid.
- [Student] I guess on the previous slide
it was just a little, in the previous slide
I was just a little confused why you lose the difference
as you move towards the weight end--
- Yeah.
- [Student] Your activity in the 10 negative--
- So actually, we've, since, again, these should be updated,
I just, I like, literally flew here last night
after giving a talk at UCLA yesterday
and I haven't had a chance to update it,
but in fact, one thing I'm glossing over here is
what we realized, actually,
just during the current provision is that,
as you turn this knob over this way,
the effective sample size goes down, so for example,
over here, I've basically turned off anything
that's not active, and so those are falling out
of the computation, and a consequence of
that is that this is a very high variance,
like that estimate is not super stable,
so we now actually compute effective sample sizes
and truncate, and I'm pretty sure on our new plots
that this, it actually gets truncated something like here.
I have to go look, I apologize that I don't have it,
but, and so all I can say is,
I don't know specifically why it takes these shapes,
what I do know is that as you go more and more to the left,
the effective sample size goes down,
and we should consider that, which we hadn't, up until now.
Right, and so this, again, this is just sort
of groups of features and how important they were.
And so this here, unsurprisingly, is there's a feature
that says, you had a T to A in position three,
that's what this is, mutation identity plus position.
So that, everything together, you tell it the position
and the identity, and what it went from,
and what it went to, so that's actually super,
super important in the model,
and then if you tell it just the mutation identity,
but you don't tell it the position,
so you just say it was a T to an A, but I don't know where,
then that also turns out to be helpful,
and if you tell it just the mutation position,
you say, I don't know what the mutation was,
but it was in position three, then that's also helpful,
and then this is translation versus transversion,
is that what it's called?
Transversion versus transition. (laughs)
Yeah, you can tell me what that is,
what we put it in, I guess that's,
what is it, like a C to, A is a what?
- [Student] C to T is transition,
T to G is transition, everything else is a transversion.
- Alright, so I guess we read somewhere
that this might help.
But what's interesting is this sort
of drives home a point I was mentioning earlier is,
for every example, if I tell it the mutation identity
and the mutation position, but I tell it sort
of those two separate features,
then these regression trees, in principle,
they can combine those two things
and make up this feature.
So if I had a lot, a lot of data,
I wouldn't have to give it this feature,
but through experimentation,
we can see we don't have enough data
that it doesn't actually do that
and so that's why we give it all three of these,
and they all turn out to be important.
And so this, this is, I don't know,
for me, as a computational person,
this is interesting, I don't know if it will be for you,
but this is an example of like, different,
we wanted to show that kind of different components
of what we've thrown in the mix here help one
on top of the other and that we're not doing things
for no reason, and again, just for,
for visualization purposes as relative to CC top.
So I've mentioned that,
so red is this final model we deployed,
which is the one I described to you, and let me,
what happens though if instead of,
remember I said there's a second layer model,
where instead of just multiplying together
the single mismatch things, I like,
I add in this extra machine learning,
if I don't add in that extra machine learning,
then I end up on this blue curve,
and so you can see that extra level is helping,
and probably if we had a lot more data,
it would help a lot more.
And now what the one below it is,
it says, what if we do the same thing as
what I just described, the blue curve,
but instead we actually use just the implicit features
from CFDs, so CFD was this what John and guys had to do,
and basically, they don't pitch their model
as a machine learning model,
but as a machine learning person,
you can look at it and you can say, "aha."
The feature that they used, the features they use,
it turns out they use one feature,
and the feature they used is just,
there was a T to A in position eight,
that's the feature that implicitly they're using.
If you use just that feature, you actually do pretty well,
right, like, that's actually driving a lot of the signal,
but again, you buy more by adding in these other features.
And now finally, if instead of doing regression,
where I'm predicting like, ranks or like,
real values, I actually just say,
"Was it active versus not?"
So I pick some threshold, and this is actually
what most other people do,
then you get actually terrible performance,
and this shows you that you're throwing away a lot
of information.
Right, so that's, so now, so far what I've showed you is,
I'm gonna pick a guide, I'm gonna filter using up
to some number of mismatches to get a shortlist,
and then for every potential off-target in the shortlist,
I've now shown you how we've trained the model
and how it fares compared to other models
that do similar things, and now I want to tell you,
how can we aggregate this into a single number?
And, you know, right, this is kind
of complicated in the sense that it depends
on your use case and all kinds of things,
and, but, you know,
I think having something is better than nothing,
and people can ignore it if they want,
but people seem to want it.
And so how do we take all the scores
that result, for this one guide from step two
and aggregate them into something meaningful?
And other people have done this, also,
but not with machine learning.
And so what we're gonna do is the following,
we're going to take this one guide,
we're gonna scan the genome for the potential off-targets,
and, you know, I can't remember,
maybe this is like, three, four, 5,000,
I'm not sure anymore, for each of those,
we're gonna apply the model I just told you about,
which takes the single mismatches
and combines them with this M function
into one number for that guide target,
as sort of something like that probability of activity,
let's say, and I do this for a whole list
of guides, sorry, of targets.
Of course, if I did a different guide,
I would have a different list and a different length
of list probably, as well, and now I can look,
I can think of this as having some distribution
of off-target activity, as well as how many there are,
right, and so now I can say, well,
can I do machine learning on that?
But now the question is,
what is the supervisory signal here, right?
Like, so before it was guide seek and things like this,
so what's my aggregate measure,
and so, again, this is not, so, you know,
John said, "You should try this viability data
"and the non-essential genes."
And so the idea is that if I target a non-essential gene,
so my desired on-target is in the non-essential gene,
then if I have no off-targets, then the cell will survive,
and it will, you know, it'll be viable,
and to the extent that there are off-targets,
it may not be viable, and so,
and so the question is, why is it not viable?
And so the reviewers have complained a little bit
and said, "Well, really what that assay is capturing,
"is it's not capturing some sort
"of sum total effect of off-target activity,
"it's basically capturing,
"was one of your off-targets essential or not?"
And then the reviewer said,
"There's actually so few essential genes
"that basically like, this is meaningless,
"because how can you get anything?"
And so it's true that if you hit an essential gene
as an off-target, it will die,
but it's also true that there are so few of those
that this is not what's driving it,
and there's now actually three different papers out,
I don't, did I add them, the citations,
there's three papers out that show,
probably by people in this room, I don't know,
that sort of the more cuts you get,
the more likely the cell is to die,
and we now have experiments that show
that that is, well, rather, we have experiments
that show it's not knocking out essential genes,
it's driving this experiment.
Like, we can show that because we can actually insert
that information in and show that it doesn't help any,
and that's not surprising, because there are actually
so few, I think it covers like,
2% or less of the genome, sort of, by sequence.
Right, and so we use that as our supervisory signal,
and then we again, play this game
where we partition the data, we train,
and then I say, "How well does it do?"
But again, there's very, very little data
to do that, and so we use, again,
a very simple model, which is called,
it's based on the near regression.
And when you do that, oh yeah, right, so.
Right, so we do this with two data sets,
Avana and Gecko, and again, just on the non-essential genes,
and in the, I forgot to say, so we have this distribution,
so what kind of features do we use from this distribution?
So I think we have sort of the number of things in it,
and then we somehow will compute things,
like, what's the average, what are the core tiles,
and then we might say,
"What are the average in just the genic off-targets,
"the average score from our model?"
And things like this, so it's incorporating
what's genic versus not,
and we can put in the essentiality there,
which is not helping, and then something
also the reviewers have asked for is like,
what if you, in this whole, you know, off-target thing,
you incorporate the chromatin accessibility?
And so that turns out to be very difficult,
it actually does help, but it's really hard
to get the right cell type, and so even though it helps,
we're not right now putting it in our model,
just because for almost all the data we have,
we can't even get chromatin accessibility data.
So in principle, that would be nice,
but it's a bit too early to do that, I would say.
Right, and so when you do this,
and now, so on this dataset, in particular,
it's pretty low, this final aggregation score,
but it's certainly, it's actually highly significant,
and you can say that this is our aggregation model,
and I didn't say this explicitly,
but to the extent we do well on aggregating means
that we also did well on the individual score,
guide, target scoring, because if we did wrong on that,
then you would have no hope of doing the aggregations.
So aggregation is some sort of summary measure,
in a sense, of how well we did on those,
plus how well we combine them together,
and if you did either of those things wrong,
so if the previous model I told you
about was just like, bogus,
you would get nothing out of this.
And then CFD is actually John's way of combining stuff,
it shouldn't say website, it's not even available
on the website, it's just in the paper,
and then this is from the MIT website.
And then you can flip it around again,
so this was right train on Avana, test on Gecko,
and then you can flip it around,
and you see, actually, right now,
the correlations go up a lot more,
so you get stuff more around .15,
then you see a similar ordering of things.
- [Student] How long until MIT takes the website down?
- I don't know, is anyone in the room from that group?
No, okay, I'm curious to hear if they,
I mean, yeah, we should talk to them.
Right, and so we want to put this all together now,
and so this, this is very cumbersome,
so if you look at the available online tools
for off-target activity,
the way they work now is you request some information,
say, about some gene, and then you wait like,
five, 10 minutes, and it like, emails you something
and it's sort of, it's a bit, it was annoying
for us to do evaluations, like, we had to get,
you know, someone to program something that would ping it,
wouldn't ping it too much and would wait,
and then stuff would time out,
and then these kinds of things.
And we wanted something where people would just go,
and it was actually pre-computed,
and because I work at Microsoft,
we have a whole lot of computational resources,
and so it's kind of fun, sometimes,
to make use of them, so we decided we wanted to,
and actually, when we first told John we were gonna do this,
he was like, "That's impossible, you can't do that."
And so we said, "We want to pre-populate,"
for starters, now, it's just the human genome,
for every coding region, for Cas9,
every possible on-target and off-target score,
as well as be able to drill down
into all those off-target scores,
and so we actually, for the first revision,
we populated that, although it wasn't,
for the first submission, and we're now tweaking it,
but that takes roughly, if you have,
we run for about three weeks straight on 17,000 cores,
so which, probably no one here could do
that with wherever you are, so it's kind of,
sometimes it's fun to be at Microsoft
to do that kind of stuff.
And so we're building this site,
and we've tweaked it now a bit based
on the revision, but basically,
there's gonna be different ways you can search,
but based on gene, based on transcript,
and then it'll search through anything that satisfies Cas9,
and then you can like, click on the guide
and drill down into the off-target activity,
and we'll give you some sort of basic information,
like, was this off-target in a gene or not?
We'll also give you the position in chromosome
in case you want to like, look up other things.
And, yeah, so that's unfortunately not live right now
because one, we've been modifying things
and we're now repopulating it,
and then once the revisions end,
like, hopefully, and this is full,
then hopefully, very soon, maybe in the next few months,
I hope this is publicly available,
and if in the mean time you want to play with it,
we might, I have to talk to my collaborators,
we might be able to give some people private access.
Yeah, sorry.
(background noise drowns out student speaker)
Oh, you know what's really funny, is.
Yeah, the funny thing is, I think that,
I have this, after we submitted it and it came back,
we realized we had something flipped around,
and now I can't remember if like, on this example,
what is what anymore, and this will be made clear
when it's released, and I don't--
- [Student] It never is, on a lot of sites, (mumbles).
- I see.
(student mumbles)
Oh.
Yeah, so this has little tool tips,
like if you hover above the question marks,
you get information, and one of them will tell you that,
and I just don't remember, and I know we,
we didn't, we had it wrong, not in the sense
that anything in the paper was wrong,
just that like, something on the website was like,
the wrong direction, yeah.
And so, right, and none of this would be possible
without wonderful set of collaborators, and.
And very much Nicolo, who's my computation collaborator,
who has an office next door to me,
and John at the Broad, who's just been wonderful,
and the three of us have a bunch
of projects on the go right now,
and very recently, we hooked up with Ben and Keith
to generate some more guide seek data
to validate these off-target models, and Michael Weinstein,
who, I think he actually just left UCLA,
he actually did all of the coding infrastructure
for basically that whole search like,
if you give me a gene or a transcript,
like, find all the possible regions
on the genome up to this number of mismatches,
and then he pipes that into our machine learning,
and so those two things work together,
and then some folks at Microsoft, too,
help with the website and the population
of the database and all these things.
So I think, yeah, that's it, thank you.
(audience applauds)
- [Student] How well does your model work
for other (mumbles) systems, (mumbles).
- We haven't yet looked at I mean,
other CRISPR, I mean the question, like,
how well does it for other CRISPR,
how well does it work for other organisms,
I mean, there's, and just,
we don't know yet, other than in Haussler's review,
he did something interesting where he basically said,
"Our model was state of the art,
"as long as it was," I mean, he compared like,
a whole bunch of organisms, like mouses,
zebrafish and all of these things,
and the zebrafish, we did really badly on,
but he could show that it's not necessarily,
it's unlikely to be because of the organism,
and it's more likely the difference between in vitro
and in vivo, and so actually,
we now have privately retrained in vitro models,
and he's taking a look at those, so,
and I'm not, I don't know even almost what that means,
someone told me the biochemists
in some other community even don't even use these words
in the same way, and so that, like,
someone will yell at you depending how you say them,
so all I know is that, like,
that's a dominating factor over organism,
and that there seems to be, like, this is just me,
like, you know, I can't rigorously point
to a picture, but sort of my memory
from all the things I've seen is
there's pretty decent generalization between organisms,
but I'm sure that if you have data
for a particular organism, that,
you know, probably that will help.
- I think one of your limitations is the
only information you have
to develop your features is the sequence
of information, and you mentioned chromatin.
- Yeah.
- To the extent that nucleus cell location,
compaction, has an influence on any particular target,
that's completely outside--
- Yeah, absolutely, yeah, yeah.
- There are data out there,
but there may be more cell-specific--
- I see.
- That you'd like to have--
- In fact, so when Ben and Keith were about
to generate this new data, like,
we decided together, like, how to do this
and as a, and I said, "Can you do any of these cells
"that have chromatin?"
'Cause none of our guide target data has chromatin data.
We ended up having to do it with the viability data
at the aggregation step, and he said like, basically,
basically, no, like, not because he was unfriendly
and uncooperative, just for whatever restriction they had,
like, he, even then, as we were generating new data
and wanted to, we couldn't there,
but if, I mean, if you guys know
where there's good data for us to use,
or you're generating some, and it's out at some point,
feel free to let us know, but yeah,
this is a big, I mean, a mountain of data,
and then these kinds of auxiliary datasets,
both stand to really improve these models a lot.
- That's quite an issue of, you may have some features
that you could discard that, you know,
the machine could discard them,
but you're also emphasizing features that don't--
- Absolutely, absolutely, undoubtedly, yeah, there is.
- Would you mind going back to some of the on-target,
just one thing that struck me that,
the importance of the melting temperature
in nucleotides 16 to 20 is more important
in this histograms than the melting temperature
at eight to 15, three to seven,
and if I remember correctly,
so is 16 to 20 next to the PAM, or is it--
- [Jennifer] Yes, it is.
- [Man With Gray Hair] Those are the ones next to the PAM,
so that makes sense, okay.
- Okay, good, phew. (laughs)
- So, I mean, one of the things,
I grew up as a nucleic acid biochemist,
and one of the things I would like to know is,
what physics features are being reflected here,
even to the extent the nucleotide sequence is predictive,
how can we understand this in terms
of thermodynamics, interaction with,
between the DNA/RNA hybrid and Cas9 protein,
what atomic level features are reflected?
- Right, and I think it is possible to go in that direction,
it's just that you don't get it for free
by doing what we did, it could also be
that by doing that, in fact, we somehow improve what we did,
because we are, have limited data
and maybe, like, you know, the thermodynamics,
in principle, if we use a rich enough model,
it should just compute the thermodynamics if it's necessary,
right, but it's obviously helpful for it,
that we've told it the thermodynamics.
So if there are other physics based things,
we could one, go out of our way to test
to see how important they are,
with all the caveats I've mentioned,
or we can add it in and say,
"Does it also help the model?"
So if you, I mean, if there are things
that are particularly interesting
to the community that we are gonna,
readily computable, we are, you know,
we can throw them in.
- [Student] Seems like an obvious when we look at,
we were looking at the crystal structure,
you know, two days ago, but it looks like it goes
to this intermediate confirmation state
where it's testing whether it's got,
already matches the DNA, and it's kind of doing,
and so once we understand a little bit more
about what's happening--
- I feel like that's, I might be wrong,
but I think that's why it got broken up this way,
is that there was this sort of one phase
where it was like, feeling things out,
and then if it was okay, it would sort of keep going
to the more and more, but that is not my area,
but I have a big memory of that. (chuckles)
- [Student] When you consider the sequence,
I think that, because for the,
single guide (mumbles) scaffold,
if you have different sequence
of this scaffold, it seems different
that this (mumbles) can perform a single structure based
on space around, spacing sequence,
so why not make it smaller, like,
are you relying on single like, scaffold sequence, or--
- What, can you tell me what the scaffold is?
(student mumbles)
I see, and is that, I like, this is now getting,
like, at the limit of knowledge of the (mumbles),
and that's not constant across all these experiments.
- [Student] Oh, I think people now, like,
most (mumbles), the student using singulars--
- I see, I see, I see, okay, so probably it's the case
that all the scaffolding was constant here,
and therefore, we didn't model it,
but if people were changing the scaffolding,
then this could, and presumably it is important,
which is why people are changing it,
then absolutely, we could put that into the model, yeah.
- [Student] I think one thing she's getting
at is that the target, the spacer sequence
that you guys are changing can mess up falling
to the rest of the (mumbles),
so it compliments with different parts of the RNA,
so that might be--
- But it's still, if you're not, even if that's the case,
if in the data I have, I'm not changing that,
then I still don't need to put it in there,
it should be able to just figure that out,
because if something's not changing,
it's not informative, like, by definition, yeah.
(slow ambient music)