ml5.js 用PoseNet進行姿態估計。 (ml5.js Pose Estimation with PoseNet)

字幕列表影片播放

[DING]
Hello, and welcome to another Beginner's Guide to Machine
Learning video tutorial.
In this video, I am going to cover
the pre-trained model, PoseNet.
And I'm going to look at what PoseNet is,
how to use it with the ml5,js library with the p5.js library,
and track your body in the browser in real time.
The model, as I mentioned, that I'm looking at,
is called PoseNet.
[MUSIC PLAYING]
With any machine learning model that you
use, the first question you probably want to ask is,
what are the inputs?
[MUSIC PLAYING]
And what are the outputs?
[MUSIC PLAYING]
And in this case, the PoseNet model
is expecting an image as input.
[MUSIC PLAYING]
And then as output, it is going to give you
an array of coordinates.
[MUSIC PLAYING]
In addition to each of these xy coordinates,
it's going to give you a confidence score for each one.
[MUSIC PLAYING]
And what do all these xy coordinates correspond to?
They correspond to the keypoints on a PoseNet skeleton.
[MUSIC PLAYING]
Now, the PoseNet skeleton isn't necessarily
an anatomically correct skeleton.
It's just an arbitrary set of what
is 17 points that you can see right over here,
from the nose all the way down to the right ankle, that it is
trying to estimate where those positions are
on the human body, and give you xy coordinates,
as well as how confident is that it's
correct about those points.
One other important question you should ask yourself and do
some research about whenever you find yourself using
a pre-trained model out of the box, something
that somebody else trained, is who trained that model?
Why did they train that model?
What data was used to train that model?
And how is that data collected?
PoseNet is a bit of an odd case, because the model itself,
the trained model is open source.
You can use it.
You can download it.
There's examples for it in TensorFlow and tensorflow.js
and ml5,js.
But the actual code for training the model,
from what I understand or what I've been able to find,
is closed source.
So there aren't a lot of details.
A data set that's used often in training models
around images is COCO, or Common Objects In Context.
And it has a lot of labeled images
of people striking poses with their keypoints marked.
So I don't know for a fact whether COCO
was used exclusively for training PoseNet,
whether it was used partially or not at all.
But your best bet for a starting point
for finding out as much as you can about the PoseNet model
is to go directly to the source.
The GitHub repository for PoseNet,
in fact there's a PoseNet 2.0 coming out.
I would also highly suggest you read the blog post "Real-time
Human Post Estimation in the Browser with TensorFlow.js"
by Dan Oved and editing and illustrations from Irene
Alvarado and Alexis Gallo.
So there's a lot of excellent background information
about how the model was trained and other relevant details.
If you want to learn more about the COCO image data set,
I also would point you towards the Humans of AI project
by Philip Schmidt, which is an artwork, an online exhibition
that takes a critical look at the data in that data
set itself.
If you found your way to this video, most likely,
you're here because you're making interactive media
projects.
And PoseNet is a tool that you could
use to do real time body tracking very quickly
and easily.
It's frankly, pretty amazing that you could do
this with just a webcam image.
So one way to get started, which in my view,
is one of the easiest ways, is with the p5 Web Editor,
in the p5.js library, which very, so I have a sketch here
which connects to the camera and just
draws the image in a canvas.
Also want to make sure you have the ml5,js library imported,
and that would be through a script tag in index at HTML.
Once you've got all that set up, we're ready to start coding.
So I'm going to create a variable called PoseNet.
I'm going to say PoseNet equals ml5.posenet.
All the ml5 functions are initialized the same way,
by referencing the ml5 library dot the name of the function,
in this case, PoseNet.
Now typically, there's some arguments that go here.
And we can look up what those arguments are,
by going to the documentation page.
Here we can see there are a few different ways
to call the PoseNet function.
I want to do it the simplest way possible.
I'm just going to give it the video element and a callback
for when the model is loaded, which I don't even
know that I need.
[MUSIC PLAYING]
I'll make sure there are no errors and run this again.
And we can see PoseNet is ready.
So I know I've got my syntax right.
I've called the PoseNet function,
I've loaded the model.
The way PoseNet works is actually
a bit different than everything else in the ml5 library.
And it works based on event handlers.
So I want to set up a pose event by calling this method on.
On pose, I want this function to execute.
Whenever the PoseNet model detects a pose,
then call this function and give me the results of that pose.
I can add that right here in setup.
PoseNet on pose.
And then I'm going to give it a callback called, got poses.
[MUSIC PLAYING]
And now presumably, every single time it detects a pose,
it sees me, it sees my skeleton, it
will log that to the console right here.
Now that it's working, I can see a bunch
of objects being logged.
Let's take a look at what's inside those objects.
The p5 console is very useful for your basic debugging.
In this case, I really want to dive deep into this object
that I'm logging here, the poses object.
So in this case, I'm going to open up the actual developer
console of the browser.
I could see a lot of stuff being logged here very, very quickly.
I'm going to pick any one of these and unfold it.
So I can see that I have an array.
And the first element of the array is a pose.
There can be multiple poses that the model is
detecting if there's more than one person.
In this case, there's just one.
And I can look at this object.
It's got two properties, a pose property and a skeleton
property.
Definitely want to come back to the skeleton property.
But let's start with the pose property.
I can unfold that, and we could see, oh my goodness,
look at all this stuff in here.
So first of all, there's a score.
I mentioned that with each one of these xy
positions of every keypoint, there is a confidence score.
There is also a confidence score for the entire pose itself.
And because the camera's seeing very little of me,
it's quite low, just at 30%.
Then I can actually access any one of those keypoints
by its name.
Nose, left eye, right eye, all these, all the way
down once again to right ankle.
So let's actually draw something based
on any of those keypoints.
We'll use my nose.
I going to make the assumption that there's always only going
to be a single person.
If there were multiple people, I'd
want to do this differently.
And I'm going to make a, hit stop.
I'm going to make a variable called pose.
Then I'm going to say, if it's found a pose,
and I can check that by just checking
the length of the array.
If the length of the array is zero,
then pose equals poses index zero.
I'm going to take the first pose from the array
and store it into the global variable.
But actually, if you remember, the object in the array
has two properties, pose and skeleton.
So it seems there's a lot of redundant lingo here,
but I'm going to say, posesindex0.pose.
[MUSIC PLAYING]
This could be a good place to use the confidence score.
Like, only if it's like of a high confidence actually
use it.
But I'm just going to take any pose that it gives me.
Then in the draw function, I can draw something
based on that pose.
So for example, let me give myself a red nose.
[MUSIC PLAYING]
So now if I run the sketch, ah, so I got an error.
So why did I get that error?
The reason why I got that error is it
hasn't found a pose yet, so there
is no nose for it to draw.
So I should always check to make sure there is a valid pose
first.
[MUSIC PLAYING]
Then draw that circle.
And there we go.
I now have a red dot always following my nose.
If you're following along, pause the video
and try to add two more points where your hands are.
Now there isn't actually a hand keypoint.
It's a wrist keypoint.
But that'll probably work for our purposes,
I'll let you try that.
[TICKING]
[DING]
How did that go?
OK, I'm going to add it for you now.
[MUSIC PLAYING]
Let's see if this works.
Whoo.
This is working terribly.
It could, I'm almost kind of getting it right.
And there we go.
But why is it working so poorly?
Well, first of all, I'm barely showing,
I'm only showing it from my waist up.
And most likely, the model was trained on full body images.
[MUSIC PLAYING]
Now I turned the camera to point at me over here,
and I'm further away.
And you can see how much more accurate
this is, because it seems so much more of my body.
I'm able to control where the wrists are
and get pretty good accurate tracking as I'm standing
further away from the camera.
There are also some other interesting tricks
we could try.
For example, I could estimate distance from the camera
by looking at how far apart are the eyes.
[MUSIC PLAYING]
So for example here, I'm storing the right eye and left eye
location in separate variables, and then
calling the p5 distance function to look
at how far apart they are.
And then, I could just take that distance
and assign it to the size of the nose.
So as I get closer, the nose gets bigger.
And you almost can't tell, because it's sizing relative
to my face.
But it gives it more of a realistic appearance
of an actual clown nose that's attached,
by changing its size according to the proportions of what
it's detecting in the face.
You might be asking yourself, well,
what if I want to draw all the points,
all the points that it's tracking?
So for convenience, I was referencing each point by name.
Right eye, left eye, nose, right wrist.
But there's actually a keypoints array
that has all 17 points in it.
So I can use that to just loop through everything
if that's what I want to do.
[MUSIC PLAYING]
So I can loop through all of the keypoints
and get the xy of each one.
[MUSIC PLAYING]
And then I can draw a green circle at each location.
Oops.
So that code didn't work, because I
forgot that each element, each keypoint
is more than just an xy.
It's got the conference score, it's
got the name of the part and a position.
So I need the keypoints index 0's position dot x.
Pose dot keypoints index I dot position dot x.
Dot position dot y.
Now I believe this'll work.
And here we go.
Only thing I'm not seeing are my ankles.
Oh, it's not.
There we go!
I got kind of accurate there.
Here's my pose.
OK, so you can see I'm getting all the points of my body
right now, standing about probably six feet away
from the camera.
There's one other aspect of this that I haven't shown you yet.
So if you've seen demos of PoseNet
and some of the examples, the points
are connected with lines.
So on the one hand, you could just memorize like always
draw a line between the shoulder to the elbow and the elbow
to the wrist.
But PoseNet, what I presume is based on the confidence scores,
will dynamically give you back which parts
are connected to which parts.
And that's in the skeleton property
of the object found in the array that was returned to us.
So I could actually add a new global variable
called skeleton.
This would've been good for Halloween.
Skeleton equals, and let me just stop this for a second.
Poses index zero dot skeleton.
I can loop over the skeleton.
[MUSIC PLAYING]
And skeleton is actually a two-dimensional array,
because in the second dimension, it
holds the two locations that are connected.
So I can say a equals skeleton index i index zero.
And b is.
[MUSIC PLAYING]
Index 1.
And then I can just draw a line between the two of them.
[MUSIC PLAYING]
I look at every skeleton point.
I get the two parts.
Part A, part B, and just draw a line between the x's
and y's of each of those.
[MUSIC PLAYING]
Make it a kind of thicker line, and give it a, the color white.
And let's see what this looks like.
And there we go.
That's pretty much everything you could do
with the ml5 PoseNet function.
So for you, you might try to do something
like make a googly eyes.
That's something I actually did in a previous video
where I looked at an earlier version of PoseNet.
And you could also look at some of these other examples that
demonstrate other aspects.
For example, you can actually find the pose of a JPEG
that you load rather than images from a webcam.
But what I want to do, which I'm going
to get to in a follow-up video to this,
is not take the outputs and draw something.
But rather, take these outputs and feed them as training data
into an ml5 neural network.
What if I say, hey, every time I make this pose, label that a y.
And every time I make this pose, label that an m, a c, an a, you
see where I'm going.
Could I create a pose classifier?
I can use all of the xy positions,
label them, and train a classifier
to make guesses as to my pose.
This is very similar to what I did with the teachable machine
image classifier.
The difference is, with the image classifier,
as soon as I move the camera to a different room
with different lighting and a different background
with a different person, it's not
going to be able to recognize the pose anymore, because that
was trained on the raw pixels.
This is actually just trained on the relative positions.
So in theory, somebody around the same size as me,
swapping out, it would recognize their pose.
And there's actually a way that I could just
normalize all the data, so that it would work
for anybody's pose potentially.
So you can train your own pose classifier
that'll work generically in a lot of different environments.
So if you make something with ml5 PoseNet
or with PoseNet with another environment,
please share it with me.
I'd love to check it out.
You could find the code for everything
in this video in the link in this video's description.
And I'll see you in the future "Coding Train" ml5 Machine
Learning Beginner, whatever, something video
[WHISTLE]
Goodbye.
[MUSIC PLAYING]