Oh, welcome to part one of an infrequent Siri's about augmenting reality in this series.
I'm going to be learning about exploring the algorithms that operate behind the scenes on enable augmented reality applications.
And I think a good place to start will be with one of the more enjoyable algorithms optic flow.
Before we get stuck into the algorithm, I thought it might be worth having a look at the set up in what I've got.
Here is my webcam at a low resolution displaying a grayscale image at the command prompt on after the whole video about that already.
I'm not going to go into the details of the Webcam application just here on we can see.
I've currently got a friend.
I've got a red rectangle stuck to my head, and the idea of optic flow is I can create a velocity map that allows me to interact with things on the scenery so you can see I can manipulate the rectangle with my hands.
I'm doing it rather gingerly and slowly, because if I hit the rectangle, it gets more velocity.
So there's some sort of physics involved with how the rectangle behaves, and if the rectangle goes off the screen.
I've made it wrap around to the bottom and we can see now rather, unfortunately, it's got stuck in my microphone.
So if I move my hands around, you can see the microphones in the way s I need to just go in front of the microphone on lift up the object.
At least flick it out the way.
There we go.
So this is a crude form of augmented reality.
But the algorithm behind it is quite nice, and it also covers some of the fundamentals off image processing.
What I'm tying to demonstrate is that for the most part, the object behaves according to how I'm interacting with it.
I can push it one way or the other.
Try and find them again one last time.
There we go.
So I'll bring the rectangle to the middle of the screen and just try and leave it the You see, I can push it to the left on.
I can push it back to the right a little bit.
I try and grab hold of it from above and bring it back down.
Where am I going?
There we go.
Well, if it.
All right, that's enough.
Waving my arms around.
Let's have a look how it works.
Let's start by considering how motion is determined between two successive images.
So here I've got the two pictures on the left, which are frames taken from a camera onda T minus zero.
Here is the most recent frame, and T minus one is the frame.
Before that's of the previous frame on.
We can see that the little Chappy inside the frame has waved his arm.
It's moved a little bit, and if we subtract the two images, what we actually get is not the whole body.
But we just get the two locations that have changed.
So we'll get the original home plus the new arm on.
Depending on how the image is encoded on the type of the movement and the luminess on the screen, one of these will be positive on.
One of them will be negative.
And if we take the absolute values of all of the pixel values, I set the multi positive.
We can work out an area where motion has occurred.
But all this map will tell us is that motion has happened and whereabouts in the map.
Has the motion happened?
We don't know what direction, and we don't know how much.
Because in the world of image processing, our motion is represented as just being the difference between two successive frames.
Before we start working out the motion vector for individual pixels, let's consider how we would do it for the whole image.
So I want to imagine a scenario where the camera is moving, but the scene that the camera's viewing remains mostly static.
In this case, I've got two successive frames shown by the black, and the blue image on our brains can work out that actually the even though they're similar, one of the frames has moved slightly.
In fact, the camera must have moved in this direction because we can see that the little man has moved towards the left of the frame.
If we overlay the frames precisely, we can see this more clearly, and the result we're interested in is how much movement has occurred.
And there's really only one way to test this now, because this is the first video in this series.
I'm trying to keep things conceptually, simply know, So I'm not going to be looking into the more advanced routines and feature extraction routines, which would typically be used to solve this problem.
Instead, we're going to brute force it and so to calculate the motion back.
So the only thing we can do is to physically test the new frame in a variety of different positions to see where does it line up with the old frame?
And so we would just need to algorithmic Lee, overlay the two and see where they have the closest match.
I worries the difference between them, the minimum.
And of course it's like this.
And once you found the closest match, we can observe the nature of the vector that we need to represent that motion.
And don't forget, this is for the whole image moving.
This is if the camera is moving on because it's the whole image.
We can assume that the factor has its origin of the middle of that image, and we only have the one vector for the whole image, and this is our global optic flow on your computer.
Mouse operates on a similar principle to this.
It rapidly takes images in succession and tries to work out how the image is transformed in the two D plane from one location to another.
Once it's done this, that vector is then translated into coordinates that ascent to the computer to move the mouse cursor around.
Now the type of optic flow we're interested in isn't global optic flow is such it's local optic flow.
We want to work it out per pixel, so every pixel in our image gets associated with the vector that suggests how that pixel might have moved in the past.
And so things that remain static don't really have a vector.
It'll so the body here didn't change.
In fact, the only thing that did change was the arm here.
Again, I've got are two successive frames, and if I overlay them, we could see the difference is really just the armed, changing position.
I'm going to try and work out where has that pixel come from?
And to do this, I create a patch that surrounds that pixel.
On this patch will be a small fragment of the image.
It will have some statistical features that make it difference to other parts of the image.
I'm going to assume that's the case.
For now, of course, it might not be, and that could lead to errors in this album.
Once I've got a small patch on going to test that patch against lots of different locations within a given search area, and we need to restrict the search area for two reasons.
One is if we searched the whole image, it would be tremendously slow.
It's not a very computational, efficient algorithm, but secondly, we can assume that the differences between frames on the whole are minimal, particularly, it's a human being.
I can't move my arms so fast, but it makes a massive difference at, say, 30 frames per second For each patch in the search area, I record the similarity with the base patch on DDE.
In this instance, it would be somewhere like here.
They're not exactly the same, but they represent a similar part of similar feature off the arm than therefore for this patch.
To get from the location and green to the location and read, it had to move with this vector.
Once we found the best matching patch for every pixel in the image.
We can then modulate our vector field map with the original motion image because a lot of the vectors we're not interested in there hasn't been any motion.
And so they'll just be the results of noise, another unwanted estimations and byproducts of the searching algorithm.
So we restrict, then to just the pixels that we're interested in on.
All of the other factors effectively get set to zero.
Now let's start with the coding, but I want to emphasize that I'm not making any effort at all to try and optimize this algorithm.
In fact, it's going to be very inefficient and run quite poorly indeed.
And the reason for this is I want to keep it clear.
I want to keep the step by step nature of the algorithm very visible for viewers to study, have us the algorithm work.
But I do intend to follow up this video with another video that specifically talks about optimizing this algorithm using a technique called integral images.
But for this video, we're going to keep it simple on brute force, and we're going to pay the price for that simplicity.
As usual, I'm using the OLC Consul game engine on.
If you've seen the Web cam video, we're going to use a lively called S Cappie Tau handle the webcam for us, and I've derived a class from the council game engine called Optic Flow because the performance is going to suffer in this algorithm, I'm using quite a low resolution console, 80 by 60 characters where each character is 16 by 16 pixels and I've taken the liberty of already entering the Webcam capture code.
We don't need to see it again, and this involved a union.
And some variables toe handle on represent the camera.
Some code to initialize the camera to a given size.
And in this case, that's going to be the size of our console.
Using the screen width and screen height functions on in the unused update function, I've created a Lambda function to simply draw on image to the screen.
In this case, an image is going to be a two dimensional floating point of rape.
It just it's a rates through the array and chooses the appropriate combination of characters and colors that represent that pixel.
The first thing they are news update function is going to do is capture the image and convert it into normalized grayscale values, which will use for the rest of the algorithm on the frame returned by the camera after processing for Lou Eminence is going to be stored in a two d ray called F New Camera.
So let's add that now for the Rays.
I'm just creating a pointer to a floating point variable and in the unused create function.
I'm going to allocate the memory depending on the size of the console, and I'm going to initialize it 20 just to make sure everything so far is in order.
I'm going to call the Lambda Function Drawer image on past two.
It's the two D array that represents our new camera, and we should just be able to see the raw image on the screen.
Let's take a look, and there we go.
It's a bit low resolution.
It's a bit fuzzy just to prove that this really does work.
I'm going to quadruple the resolution into something which is a bit more standard Webcam, and now hello.
You can see me quite clearly.
Let's start by just calculating motion within the camera scene.
I need to record the previous frame, someone to create another array called Old Camera on our motion is going to be the difference between these two arrays.
I'm not going to show this each time, but for each array we're going to allocate it and set it all to zero.
We can see here that for each on news update, we completely over right the F new camera array with the new luminous values.
So before we overwrite the new one, we should store it on well, stories into the old war.
Now this array operates on a pixel by pixel basis.
It's going through every single pixel in the image, giving us an X and A Y coordinate.
So this allows me to create a new variable called F diff, which represents the absolute difference between two corresponding pixels for the two arrays.
So the pixels are in the same place, but they raise a different, and I'm going to create a lambda function called Get pixel.
To do the sampling into these arrays, I'll put this lambda function with my other one on.
The reason for doing this is I can check that the coordinates of the array are inbounds on.
I can also specify it to return a zero which will find to be quite useful later on if it's not within the bounds.
Once I calculated the difference per pixel, I'm going to fill 1/3 array motion image with those differences.
But I'm going to ignore all the differences that are so small that they don't contain any useful information.
One of the things about working with Web cams and image processing is you're working with real world data, and it's rubbish.
The real world will throw absolutely everything at you that it can to interfere with your progress, so you'll find a lot of image processing out Madame's.
We do have to threshold and tweak values to make things work, and so to account for some of the just simple pixel noise that's going to occur between frames because it's never going to be the same value of pixel twice.
Really, I'm going to set it zero if it's below that amount, I'm curious to see our motion image on because of the drawer image, Lambda function.
We can easily do that at any time.
Let's take a look so it's still seen.
There's not much at all, but if I move, we can see we've got the difference between two successive frames, and it's quite sensitive.
So I'm just ever so slightly moving my microphone around and we can see we're getting basically an edge detect.
But we're not interested in edges for this video.
But even though I'm trying to be a still is, I possibly can.
We could see there are lots of spurious elements firing each time.
And that's just threw little vibrations of the camera mount camera noise that's being picked up and not being filtered out on.
Also an estimation of how much motion really qualifies as motion.
So I'm trying to keep my head is still a possible.
But just because I'm talking means it's moving around a little bit, and I don't really think that this should interact with world.
So I want to know, remove small movements and by small movements, I mean rapid and large changes between frames.
These air not likely to be natural to remove the temporal, high frequency components of the image, I'm going to add a low pass filter, low pass filtered image is going to be yet another two d ray.
I've created that further up in the code.
Into this image, I accumulate a small amount off the difference between the filtered image on the new image.
So here we can see I've got the new image.
I'm subtracting the filtered image on I'm taking 10% off it on accumulating it into the filtered image.
And, of course, with my draw image Lambda function.
I can see what that looks like to let's take a look so we can see the image sort of faded into existence, and it's become very stable.
In fact, I'm talking and you can't see my mouth move.
And if I do move my ghost, if I move my arm up and keep it still, it takes some time for the arm to appear.
And so we've effectively removed all of the little high frequency jitters, and it's quite ghostly.
And if I move very rapidly waving my arms around, you see it never really settled.
So it's ignoring this high frequency motion.
And of course, this 10% is quite tunable.
So if I said it to say 80% instead, what we're allowing is actually far faster motion.
So you see, there's still a little bit of ghosting, but not very much at all.
My mouth still doesn't look like it's moving as much as it normally should.
If I open it, you say it takes some time.
So my mouth is moving at such a speed that that's being filtered out by the algorithm.
But because we're taking a larger chunk now, we're allowing more high speed components through.
We can see that the boundary of the light here in the background is quite flickering and jittery.
And of course, this will look like motion toe our algorithm because it is a change of pixel values between successive frames.
Now that we've got our filtered image, we need to update our motion calculation to use the filtered images because currently they're using the new camera an old camera which are not filtered it all.
So I'm going to change this line to use filtered variants instead.
And you can see I'm going to need to create an old filtered camera variable to accommodate this, and I'm going to update the old filtered camera array in the same place that I'm updating the simple old camera array.
This means now that my motion image is really robust and only contained significant motion since it acquired the image The next phase is to calculate the optic flow vector map.
I'm going to shove the drawer image into its own section now, which is update Screen has already discussed for this video.
We're just going to do things in a brute force way.
We're not thinking about efficiency just yet, and so I'm creating two variables, both interviews.
One is patch size, which is going to be the squirt dimension off the patch that we're looking for.
So in this case feature could be a nine by nine collection of pixels on.
I'm going to specify the search size, which is a seven by seven grid off pixels, through which to search with the patch.
Let's just have a quick look at some numbers to see why this approach isn't particularly efficient.
Well, for 3 20 by 2 40 array of pixels, that's 76,800 pixels.
We're going to be performing a search over seven by seven grid.
So that's 49 possible locations for each pixel to exist in, which means we've got three million, 763,000 and 200 search locations to do on for each surge location.
We've got a patch of image to test patch is going to be nine by nine pixels.
So for each of our 3.7 million searches, we've got 81 comparisons to make, which means we've got 304 million 819,200 comparisons between pixels to perform on.
Let's assume we want to run that at 30 frames per second.
Well, that's an easy total off nine billion, 144,576,000 individual pixel comparisons to make to run at a real time rate.
And when you factor in the operating system on the fact that it's all random access to memory on the pollution of you, cash and lots of other problems, we could see that this out with him isn't particularly efficient.
It'll help us out.
I'm going to significantly reduce the resolution of the algorithm to get it to run, at least in double digits frame rates on my fairly high end machine.
But don't forget there's gonna be a follower video which talks about solving this problem.
So let's now go back to our original resolution.
It doesn't mean we can make the pixels bigger.
We're going to have quite a few nested four loops here, so it's going to be a little tricky for me to keep everything on the screen without zooming out.
We know to begin with, we're going to go through every single pixel, so let's start there.
Let's create 24 loops that it's a rate across the screen on down the screen.
I'm going to initialize the variables for this pixel.
And for each pixel, we're going to store variable called Patch Difference Max, which is set to a very large number, which is going to keep track of which one of our search patches is the closest fit to our base patch.
I'm also going to initialize 202 more two dimensional vectors flow, axe and flow white, which of the X and Y components of our overall motion for that pixel.
Now I need the next stage of my four loops for each pixel we need to go over a search area.
In this case, the search area is just going to be a rectangle surrounding the original pixel, so going to create a vector that represents the vector from the original pixel to the center off the search area.
I went to create another variable F accumulated difference, which is going to be the total difference for a single patch at this search location.
And it will be this accumulated difference, which is compared to the max difference later on to see which patch is the closest fit to the base patch.
Now we need the next set of four loops for the patch.
Since the search vector represents the middle off the patch that's being searched, I can create the actual pixel coordinates as an offset of that search factor, assuming that the search vector is in the middle of the patch.
So it's very similar to what we had before and in a similar way I can work out the indices required for the base patch, which is basically just the patch size around the pixel that were testing.
Now that I've worked out the indices for the pixels in both the search patch location on the base patch location, I can extract the luminous values of the two successive frames at those locations and then update my cumulated difference.
Make sure to use the absolute function to accumulate the difference because there will be some negative values in this, and they will subtract from the difference.
So for each pixel and for each search location and for each individual patch pixel we accumulate.
A difference on this means we can compare patches, and it's important to compare patches because I want the patch, which is the closest match to the base patch.
And so if my accumulated difference is lower than the current max accumulated difference, I there is the least difference.
Then I want to store in my two D flow fields, the search factor that we created earlier.
And as you can see, we frequently used this sort of notation, which always implies per pixel.
Why Time screen With Plus X, We've now finished the brute force algorithm, and every pixel in the flow field has been allocated a vector of motion.
It's time to modulate this flow field with the motion image.
And so if the motion image contains motion which remember has been quite severely filtered earlier on, then we know this motion must be significant else If there is no motion, then we set the flow field vectors to 02 indicators, such now the reason we need to do this modulation is that the patch search routine has to come up with the value it has to choose a vector, even if there has been no motion.
Hopefully, that value indicates 00 but it might not.
And we want to remove these spurious motion betters from contaminating of exit flow field.
Now we effectively have a to D map off velocity per pixel.
We can have some fun with this by using those vectors as part of a physics calculation to control an object I the Red Square you saw me manipulating at the start of the video.
So I've created two sets of vectors, obviously the pears, because it's an X and Y component one which represents the position on one which represents the velocity on Denon user create.
I'm going to initialize the ball's position to be in the middle of the frame.
Number of my videos this year have centered on doing some rudimentary physics for games, and this one is no different.
In fact, you can look at the Flappy Bird video.
Yes, that one to have a look at what we're about to discuss.
But very simply, we're going to update the balls velocity based on its position within the velocity field.
And so here you can see the individual components being updated.
We use the balls location in the optic flow field to find the component off the vector with which to augment the velocity.
So in this case, in our X component field, I used the balls current position to find the correct index into that field, which will give me the X component off the optic flow vector.
I multiply that by a constant that represents some sort of speed.
This will need to be tuned depending on your application, and I also include F elapsed time just to keep things running smoothly.
As the algorithm can be quite fluctuating.
I do the same for the white component.
The ball's position is simply augmented by the velocity with respect to every elapsed time.
I've included the value one here just in case we needed a tuning parameter to give the bulls some weight.
I've added a drag effect to it, and I'm just simply reducing the velocity by fixed amount over time.
This means the bulls velocity will tend toward zero when it's not being governed by the optic flow man.
This will stop it just drifting continuously in one direction.
And finally, when the ball does reach the edge of the screen, I'm going to wrap it around to the other side.
This means we've got nothing left to do other than drew all the screen.
And we'll use Thea realistic camera image, not the filtered one, because that's what the user might expect to see.
And I'm going to draw the ball.
Of course, it's not really a ball.
It's a rectangle.
I must use the Phil command provided by the game engine to do such Let's take a look.
So here we can see the red rectangle in the middle of the screen and I can push it up there and I'll try and grab hold of it and bring it back down.
And if I try, I gotta get my hands in the right order of push it over towards my other hand and bring it back and push it again.
Then you could see I've actually got quite reasonably accurate control over it.
I might not have reasonably accurately control over my own arms, but at least I can move the rectangle around some way on the screen was to try and bring you back down here.
Now it's quite interesting this because it's got stuck on the corner of the screen and we'll discuss that in a minute.
Once they're finished, playing with Descartes is actually quite addictive to do the other hand me that if I try and thump it, the high velocity send it on its way, you see.
So why was it clinging to the edge of the screen?
While in the comparison phase of the algorithm where we're comparing patches, we call this get pixel lambda function on for pixels close to the edge of the array.
These values will lie beyond the edges of the array, and we've told our Lambda function to return zero in this case.
So we get a known for comparison.
We'll be looking handling this edge effect on introducing some serious optimization through the use of integral images in the following video.
In this series, we've enjoyed this video giving a big thumbs up.
I know it's been a big, more technical of some of my other videos, but I think it's an interesting topic.