字幕列表 影片播放
This lecture is going to serve as an overview of what a probability distribution is and
what main characteristics it has.
Simply put, a distribution shows the possible values a variable can take and how frequently
they occur.
Before we start, let us introduce some important notation we will use for the remainder of
the course.
Assume that “upper-case Y” represents the actual outcome of an event and “lowercase
y” represents one of the possible outcomes.
One way to denote the likelihood of reaching a particular outcome “y”, is P of, Y equals
We can also express it as “p of y”.
For example, uppercase “Y” could represent the number of red marbles we draw out of a
bag and lowercase “y” would be a specific number, like 3 or 5.
Then, we express the probability of getting exactly 5 red marbles as “P, of Y equals
5”, or “p of 5”.
Since “p of y” expresses the probability for each distinct outcome, we call this the
probability function.
Good job, folks!
So, probability distributions, or simply probabilities, measure the likelihood of an outcome depending
on how often it features in the sample space.
Recall that we constructed the probability frequency distribution of an event in the
introductory section of the course.
We recorded the frequency for each unique value and divide it by the total number of
elements in the sample space.
Usually, that is the way we construct these probabilities when we have a finite number
of possible outcomes.
If we had an infinite number of possibilities, then recording the frequency for each one
becomes impossible, because… there are infinitely many of them!
For instance, imagine you are a data scientist and want to analyse the time it takes for
your code to run.
Any single compilation could take anywhere from a few the milliseconds to several days.
Often the result will be between a few milliseconds and a few minutes.
If we record time in seconds, we lose precision which we want to avoid.
To do so we need to use the smallest possible measurement of time.
Since every milli-, micro-, or even nanosecond could be split in half for greater accuracy,
no such thing exists.
Less than an hour from now we will talk in more detail about continuous distributions
and how to deal with them.
Let’s introduce some key definitions.
Now, regardless of whether we have a finite or infinite number of possibilities, we define
distributions using only two characteristics – mean and variance.
Simply put, the mean of the distribution is its average value.
Variance, on the other hand, is essentially how spread out the data is.
We measure this “spread” by how far away from the mean all the values are.
We denote the mean of a distribution as the Greek letter ‘mu’ and its variance as
“sigma squared”.
When analysing distributions, it is important to understand what kind of data we have - population
data or sample data.
Population data is the formal way of referring to “all” the data, while sample data is
just a part of it.
For example, if an employer surveys an entire department about how they travel to work,
the data would represent the population of the department.
However, this same data would also be just a sample of the employees in the whole company.
Something to remember when using sample data is that we adopt different notation for the
mean and variance.
We denote sample mean as “x bar” and sample variance as “s” squared.
One flaw of variance is that it is measured in squared units.
For example, if you are measuring time in seconds, the variance would be measured in
seconds squared.
Usually, there is no direct interpretation of that value.
To make more sense of variance, we introduce a third characteristic of the distribution,
called standard deviation.
Standard deviation is simply the positive square root of variance.
As you can suspect, we denote it as “sigma” when dealing with a population, and as “s”
when dealing with a sample.
Unlike variance, standard deviation is measured in the same units as the mean.
Thus, we can directly interpret it and is often preferable.
One idea which we will use a lot is that any value between “mu minus sigma” and “mu
plus sigma” falls within one standard deviation away from the mean.
The more congested the middle of the distribution, the more data falls within that interval.
Similarly, the less data falls within the interval, the more dispersed the data is.
It is important to know there exists a constant relationship between mean and variance for
any distribution.
By definition, the variance equals the expected value of the squared difference from the mean
for any value.
We denote this as “sigma squared, equals the expected value of Y minus mu, squared”.
After some simplification, this is equal to the expected value of “Y squared” minus
“mu” squared.
As we will see in the coming lectures, if we are dealing with a specific distribution,
we can find a much more precise formula.
Okay, when we are getting acquainted with a certain dataset we want to analyse or make
predictions with, we are most interested in the mean, variance and type of the distribution.
In our next video we will introduce several distributions and the characteristics they
Thanks for watching!
4.2 Types of distributions
Hello, again!
In this lecture we are going to talk about various types of probability distributions
and what kind of events they can be used to describe.
Certain distributions share features, so we group them into types.
Some, like rolling a die or picking a card, have a finite number of outcomes.
They follow discrete distributions and we use the formulas we already introduced to
calculate their probabilities and expected values.
Others, like recording time and distance in track & field, have infinitely many outcomes.
They follow continuous distributions and we use different formulas from the once we mentioned
so far.
Throughout the course of this video we are going to examine the characteristics of some
of the most common distributions.
For each one we will focus on an important aspect of it or when it is used.
Before we get into the specifics, you need to know the proper notation we implement when
defining distributions.
We start off by writing down the variable name for our set of values, followed by the
“tilde” sign.
This is superseded by a capital letter depicting the type of the distribution and some characteristics
of the dataset in parenthesis.
The characteristics are usually, mean and variance but they may vary depending on the
type of the distribution.
Let us start by talking about the discrete ones.
We will get an overview of them and then we will devote a separate lecture to each one.
So, we looked at problems relating to drawing cards from a deck or flipping a coin.
Both examples show events where all outcomes are equally likely.
Such outcomes are called equiprobable and these sorts of events follow a Uniform Distribution.
Then there are events with only two possible outcomes – true or false.
They follow a Bernoulli Distribution, regardless of whether one outcome is more likely to occur.
Any event with two outcomes can be transformed into a Bernoulli event.
We simply assign one of them to be “true” and the other one to be “false”.
Imagine we are required to elect a captain for our college sports team.
The team consists of 7 native students and 3 international students.
We assign the captain being domestic to be “true” and the captain being an international
as “false”.
Since the outcome can now only be “true” or “false”, we have a Bernoulli distribution.
Now, if we want to carry out a similar experiment several times in a row, we are dealing with
a Binomial Distribution.
Just like the Bernoulli Distribution, the outcomes for each iteration are two, but we
have many iterations.
For example, we could be flipping the coin we mentioned earlier 3 times and trying to
calculate the likelihood of getting heads twice.
Lastly, we should mention the Poisson Distribution.
We use it when we want to test out how unusual an event frequency is for a given interval.
For example, imagine we know that so far Lebron James averages 35 points per game during the
regular season.
We want to know how likely it is that he will score 12 points in the first quarter of his
next game.
Since the frequency changes, so should our expectations for the outcome.
Using the Poisson distribution, we are able to determine the chance of Lebron scoring
exactly 12 points for the adjusted time interval.
Great, now on to the continuous distributions!
One thing to remember is that since we are dealing with continuous outcomes, the probability
distribution would be a curve as opposed to unconnected individual bars.
The first one we will talk about is the Normal Distribution.
The outcomes of many events in nature closely resemble this distribution, hence the name
For instance, according to numerous reports throughout the last few decades, the weight
of an adult male polar bear is usually around 500 kilograms.
However, there have been records of individual species weighing anywhere between 350kg and
Extreme values, like 350 and 700, are called outliers and do not feature very frequently
in Normal Distributions.
Sometimes, we have limited data for events that resemble a Normal distribution.
In those cases, we observe the Student’s-T distribution.
It serves as a small sample approximation of a Normal distribution.
Another difference is that the Student’s-T accommodates extreme values significantly
Graphically, that is represented by the curve having fatter “tails”.
Overall, this results in more values extremely far away from the mean, so the curve would
probably more closely resemble a Student’s-T distribution than a Normal distribution.
Now imagine only looking at the recorded weights of the last 10 sightings across Alaska and
The lower number of elements would make the occurrence of any extreme value represent
a much bigger part of the population than it should.
Good job, everyone!
Another continuous distribution we would like to introduce is the Chi-Squared distribution.
It is the first asymmetric continuous distribution we are dealing with as it only consists of
non-negative values.
Graphically, that means that the Chi-Squared distribution always starts from 0 on the left.
Depending on the average and maximum values within the set, the curve of the Chi Squared
graph is usually skewed to the left.
Unlike the previous two distributions, the Chi-Squared does not often mirror real life
However, it is often used in Hypothesis Testing to help determine goodness of fit.
The next distribution on our list is the Exponential distribution.
The Exponential distribution is usually present when we are dealing with events that are rapidly
changing early on.
An easy to understand example is how online news articles generates hits.
They get most of their clicks when the topic is still fresh.
The more time passes, the more irrelevant it becomes as interest dies off.
The last continuous distribution we will mention is the Logistic distribution.
We often find it useful in forecast analysis when we try to determine a cut-off point for
a successful outcome.
For instance, take a competitive e-sport like Dota 2 . We can use a Logistic distribution
to determine how much of an in-game advantage at the 10-minute mark is necessary to confidently
predict victory for either team.
Just like with other types of forecasting, our predictions would never reach true certainty
but more on that later!
Good job, folks!
In the next video we are going to focus on discrete distributions.
We will introduce formulas for competing Expected Values and Standard Deviations before looking
into each distribution individually.
Thanks for watching!
4.3 Discrete Distributions
Welcome back!
In this video we will talk about discrete distributions and their characteristics.
Let’s get started!
Earlier in the course we mentioned that events with discrete distributions have finitely
many distinct outcomes.
Therefore, we can express the entire probability distribution with either a table, a graph
or a formula.
To do so we need to ensure that every unique outcome has a probability assigned to it.
Imagine you are playing darts.
Each distinct outcome has some probability assigned to it based on how big its associated
interval is.
Since we have finitely many possible outcomes, we are dealing with a discrete distribution.
In probability, we are often more interested in the likelihood of an interval than of an
individual value.
With discrete distributions, we can simply add up the probabilities for all the values
that fall within that range.
Recall the example where we drew a card 20 times.
Suppose we want to know the probability of drawing 3 spades or fewer.
We would first calculate the probability of getting 0, 1, 2 or 3 spades and then add them
up to find the probability of drawing 3 spades or fewer.
One peculiarity of discrete events is that the “The probability of Y being less than
or equal to y equals the probability of Y being less than y plus 1”.
In our last example, that would mean getting 3 spades or fewer is the same as getting fewer
than 4 spades.
Now that you have an idea about discrete distributions, we can start exploring each type in more detail.
In the next video we are going to examine the Uniform Distribution.
Thanks for watching!
4.4 Uniform Distribution
Hey, there!
In this lecture we are going to discuss the uniform distribution.
For starters, we use the letter U to define a uniform distribution, followed by the range
of the values in the dataset.
Therefore, we read the following statement as “Variable “X” follows a discrete
uniform distribution ranging from 3 to 7”.
Events which follow the uniform distribution, are ones where all outcomes have equal probability.
One such event is rolling a single standard six-sided die.
When we roll a standard 6-sided die, we have equal chance of getting any value from 1 to
The graph of the probability distribution would have 6 equally tall bars, all reaching
up to one sixth.
Many events in gambling provide such odds, where each individual outcome is equally likely.
Not only that, but many everyday situations follow the Uniform distribution.
If your friend offers you 3 identical chocolate bars, the probabilities assigned to you choosing
one of them also follow the Uniform distribution.
One big drawback of uniform distributions is that the expected value provides us no
relevant information.
Because all outcomes have the same probability, the expected value, which is 3.5, brings no
predictive power.
We can still apply the formulas from earlier and get a mean of 3.5 and a variance of 105
over 36.
These values, however, are completely uninterpretable and there is no real intuition behind what
they mean.
They main takeaway is that when an event is following the Uniform distribution, each outcome
is equally likely.
Therefore, both the mean and the variance are uninterpretable and possess no predictive
power whatsoever.
Sadly, the Uniform is not the only discrete distribution, for which we cannot construct
useful prediction intervals.
In the next video we will introduce the Bernoulli Distribution.
Thanks for watching!
4.5 Bernoulli Distribution
Hello again!
In this lecture we are going to discuss the Bernoulli distribution.
Before we begin, we use “Bern” to define a Bernoulli distribution, followed by the
probability of our preferred outcome in parenthesis.
Therefore, we read the following statement as “Variable “X” follows a Bernoulli
distribution with a probability of success equal to “p””.
We need to describe what types of events follow a Bernoulli distribution.
Any event where we have only 1 trial and two possible outcomes follows such a distribution.
These may include a coin flip, a single True or False quiz question, or deciding whether
to vote for the Democratic or Republican parties in the US elections.
Usually, when dealing with a Bernoulli Distribution, we either have the probabilities of either
event occurring, or have past data indicating some experimental probability.
In either case, the graph of a Bernoulli distribution is simple.
It consists of 2 bars, one for each of the possible outcomes.
One bar would rise up to its associated probability of “p”, and the other one would only reach
“1 minus p”.
For Bernoulli Distributions we often have to assign which outcome is 0, and which outcome
is 1.
After doing so, we can calculate the expected value.
Have in mind that depending on how we assign the 0 and the 1, our expected value will be
equal to either “p” or “1 minus p”.
We usually denote the higher probability with “p”, and the lower one with “1 minus
Furthermore, conventionally we also assign a value of 1 to the event with probability
equal to “p”.
That way, the expected value expresses the likelihood of the favoured event.
Since we only have 1 trial and a favoured event, we expect that outcome to occur.
By plugging in “p” and “1 minus p” into the variance formula, we get that the
variance of Bernoulli events would always equal “p, times 1 minus p”.
That is true, regardless of what the expected value is.
Here’s the first instance where we observe how elegant the characteristics of some distributions
Once again, we can calculate the variance and standard deviation using the formulas
we defined earlier, but they bring us little value.
For example, consider flipping an unfair coin.
This coin is called “unfair” because its weight is spread disproportionately, and it
gets tails 60% of the time.
We assign the outcome of tails to be 1, and p to equal 0.6.
Therefore, the expected value would be “p”, or 0.6.
If we plug in this result into the variance formula, we would get a variance of 0.6, times
0.4, or 0.24.
Great job, everybody!
Sometimes, instead of wanting to know which of two outcomes is more probable, we want
to know how often it would occur over several trials.
In such cases, the outcomes follow a Binomial distribution and we will explore it further
in the next lecture.
4.6 Binomial Distribution
Welcome back!
In the last video, we mentioned Binomial Distributions.
In essence, Binomial events are a sequence of identical Bernoulli events.
Before we get into the difference and similarities between these two distributions, let us examine
the proper notation for a Binomial Distribution.
We use the letter “B” to express a Binomial distribution, followed by the number of trials
and the probability of success in each one.
Therefore, we read the following statement as “Variable “X” follows a Binomial
distribution with 10 trials and a likelihood of success of 0.6 on each individual trial”.
Additionally, we can express a Bernoulli distribution as a Binomial distribution with a single trial.
To better understand the differences between the two types of events, suppose the following
You go to class and your professor gives the class a surprise pop-quiz, for which you have
not prepared.
Luckily for you, the quiz consists of 10 true or false problems.
In this case, guessing a single true or false question is a Bernoulli event, but guessing
the entire quiz is a Binomial Event.
Let’s go back to the quiz example we just mentioned.
In it, the expected value of the Bernoulli distribution suggests which outcome we expect
for a single trial.
Now, the expected value of the Binomial distribution would suggest the number of times we expect
to get a specific outcome.
Now, the graph of the binomial distribution represents the likelihood of attaining our
desired outcome a specific number of times.
If we run n trials, our graph would consist “n + 1”-many bars - one for each unique
value from 0 to n.
For instance, we could be flipping the same unfair coin we had from last lecture.
If we toss it twice, we need bars for the three different outcomes - zero, one or two
If we wish to find the associated likelihood of getting a given outcome a precise number
of times over the course of n trials, we need to introduce the probability function of the
Binomial distribution.
For starters, each individual trial is a Bernoulli trial, so we express the probability of getting
our desired outcome as “p” and the likelihood of the other one as “1 minus p”.
In order to get our favoured outcome exactly y-many times over the n trials, we also need
to get the alternative outcome “n minus y”-many times.
If we don’t account for this, we would be estimating the likelihood of getting our desired
outcome at least y-many times.
Furthermore, there could exist more than one way to reach our desired outcome.
To account for this, we need to find the number of scenarios in which “y” out of the “n”-many
outcomes would be favourable.
But these are actually the “combinations” we already know!
For instance, If we wish to find out the number of ways in which 4 out of the 6 trials can
be successful, it is the same as picking 4 elements out of a sample space of 6.
Now you see why combinatorics are a fundamental part of probability!
Thus, we need to find the number of combinations in which “y” out of the “n” outcomes
would be favourable.
For instance, there are 3 different ways to get tails exactly twice in 3 coin flips.
Therefore, the probability function for a Binomial Distribution is the product of the
number of combinations of picking y-many elements out of n, times “p” to the power of y,
times “p - 1” to the power of “n minus p”.
To see this in action, let us look at an example.
Imagine you bought a single stock of General Motors.
Historically, you know there is a 60% chance the price of your stock will go up on any
given day, and a 40% chance it will drop.
By the price going up, we mean that the closing price is higher than the opening price.
With the probability distribution function, you can calculate the likelihood of the stock
price increasing 3 times during the 5-work-day week.
If we wish to use the probability distribution formula, we need to plug in 3 for “y”,
5 for “n” and 0.6 for “p”.
After plugging in we get: “number of different possible combinations of picking 3 elements
out of 5, times 0.6 to the power of 3, times 0.4 to the power of 2”.
This is equivalent to 10, times 0.216, times 0.16, or 0.3456.
Thus, we have a 34.56% of getting exactly 3 increases over the course of a work week.
The big advantage of recognizing the distribution is that you can simply use these formulas
and plug-in the information you already have!
Now that we know the probability function, we can move on to the expected value.
By definition, the expected value equals the sum of all values in the sample space, multiplied
by their respective probabilities.
The expected value formula for a Binomial event equals the probability of success for
a given value, multiplied by the number of trials we carry out.
This seems familiar, because this is the exact formula we used when computing the expected
values for categorical variables in the beginning of the course.
After computing the expected value, we can finally calculate the variance.
We do so by applying the short formula we learned earlier:
“Variance of Y equals the expected value of Y square, minus the expected value of Y,
After some simplifications, this results in “n, times p, times p minus 1”.
If we plug in the values from our stock market example, that gives us a variance of 5, times
0.6, times 0.4, or 1.2.
This would give us a standard deviation of approximately 1.1.
Knowing the expected value and the standard deviation allows us to make more accurate
future forecasts.
In the next video we are going to discuss Poisson Distributions.
Thanks for watching!
4.7 Poisson Distribution
Hello again!
In this lecture we are going to discuss the Poisson Distribution and its main characteristics.
For starters, we denote a Poisson distribution with the letters “Po” and a single value
parameter - lambda.
We read the statement below as “Variable “Y” follows a Poisson distribution with
lambda equal to 4”.
The Poisson Distribution deals with the frequency with which an event occurs in a specific interval.
Instead of the probability of an event, the Poisson Distribution requires knowing how
often it occurs for a specific period of time or distance.
For example, a firefly might light up 3 times in 10 seconds on average.
We would use a Poisson Distribution if we want to determine the likelihood of it lighting
up 8 times in 20 seconds.
The graph of the Poisson distribution plots the number of instances the event occurs in
a standard interval of time and the probability for each one.
Thus, our graph would always start from 0, since no event can happen a negative amount
of times.
However, there is no cap to the amount of times it could occur over the time interval.
Okay, let us explore an example.
Imagine you created an online course on probability.
Usually, your students ask you around 4 questions per day, but yesterday they asked 7.
Surprised by this sudden spike in interest from your students, you wonder how likely
it was that they asked exactly 7 questions.
In this example, the average questions you anticipate is 4, so lambda equals 4.
The time interval is one entire work day and the singular instance you are interested in
is 7.
Therefore, “y” is 7.
To answer this question, we need to explore the probability function for this type of
As you already saw, the Poisson Distribution is wildly different from any other we have
gone over so far.
It comes without much surprise that its probability function is much different from anything we
have examined so far.
The formula looks the following way: “p of y, equals, lambda to the power of
y, over y factorial, times the Euler’s number to the power of negative lambda”.
Before we plug in the values from our course-creation example, we need to make sure you understand
the entire formula.
Let’s refresh your knowledge of the various parts of this formula.
First, the “e” you see on your screens is known as Euler’s number or Napier’s
As the second name suggests, it is a fixed value approximately equal to 2.72.
We commonly observe it in physics, mathematics and nature, but for the purposes of this example
you only need to know its value.
Secondly, a number to the power of “negative n”, is the same as dividing 1 by that number
to the power of n.
In this case, “e to the power or negative lambda” is just “1 over, e to the power
of lambda”.
Going back to our example, the probability of receiving 7 questions is equal to “4,
raised to the 7th degree, over 7 factorial, multiplied by “E” raised to the negative
That approximately equals 16384 over 5040, times 0.183, or 0.06.
Therefore, there was only a 6% chance of receiving exactly 7 questions.
So far so good!
Knowing the probability function, we can calculate the expected value.
By definition, the expected value of Y, equals the sum of all the products of a distinct
value in the sample space and its probability.
By plugging in, we get this complicated expression.
In the additional materials attached to this lecture, you can see all the complicated algebra
required to simplify this.
Eventually, we get that the expected value is simply lambda.
Similarly, by applying the formulas we already know, the variance also ends up being equal
to lambda.
Both the mean and variance being equal to lambda serves as yet another example of the
elegant statistics these distributions possess and why we can take advantage of them.
Great job, everyone!
Now, if we wish to compute the probability of an interval of a Poisson distribution,
we take the same steps we usually do for discrete distributions.
We find the joint probability of all individual elements within it.
You will have a chance to practice this in the exercises after this lecture.
So far, we have discussed Uniform, Bernoulli, Binomial and Poisson distributions, which
are all discrete.
In the next video we will focus on continuous distributions and see how differ.
Thanks for watching!
4.8 Continuous Distributions
Hello again!
When we started this section of the course, we mentioned how some events have infinitely
many consecutive outcomes.
We call such distributions continuous and they vastly differ from discrete ones.
For starters, their sample space is infinite.
Therefore, we cannot record the frequency of each distinct value.
Thus, we can no longer represent these distributions with a table.
What we can do is represent them with a graph.
More precisely, the graph of the probability density function, or PDF for short.
We denote it as “f of y”, where “y” is an element of the sample space.
As the name suggests, the function depicts the associated probability for every possible
value “y”.
Since it expresses probability, the value it associates with any element of the sample
space would be greater than or equal to zero.
The graphs for continuous distributions slightly resemble the ones for discrete distributions.
However, there are more elements in the sample space, so there are more bars on the graph.
Furthermore, the more bars - the narrower each one must be.
This results in a smooth curve that goes along the top of these bars.
We call this the probability distribution curve, since it shows the likelihood of each
Now on to some further differences between Distinct and Continuous.
Imagine we used the “favoured over all” formula to calculate probabilities for such
Since the sample space is infinite, the likelihood of each individual one would be extremely
Algebra dictates that, assuming the numerator stays constant, the greater the denominator
becomes, the closer the fraction is to 0.
For reference, one third is closer to 0 than a half, and a quarter is closer to 0 than
either of them.
Since the denominator of the “favoured over all” formula would be so big, it is commonly
accepted that such probabilities are extremely insignificant.
In fact, we assume their likelihood of occurring to be essentially 0.
Thus, it is accepted that the probability for any individual value from a continuous
distribution to be equal to 0.
This assumption is crucial in understanding why “the likelihood of an event being strictly
greater than X, is equal to the likelihood of the even being greater than or equal to
X” for some value X within the sample space.
For example, the probability of a college student running a mile in under 6 minutes
is the same as them running it for at most 6 minutes.
That is because we consider the likelihood of finishing in exactly 6 minutes to be 0.
That wasn’t too complicated, right?
So far, we have been using the term “probability function” to refer to the Probability Density
Function of a distribution.
All the graphs we explored for discrete distributions were depicting their PDFs.
Now, we need to introduce the Cumulative Distribution Function, or CDF for short.
Since it is cumulative, this function encompasses everything up a certain value.
We denote the CDF as capital F of y for any continuous random variable Y.
As the name suggest, it represents probability of the random variable being lower than or
equal to a specific value.
Since no value could be lower than or equal to negative infinity, the CDF value for negative
infinity would equal 0.
Similarly, since any value would be lower than plus infinity, we would get a 1 if we
plug plus infinity into the distribution function.
Discrete distributions also have CDFs, but they are far less frequently used.
That is because we can always add up the PDF values associated with the individual probabilities,
we are interested in.
Good job, folks!
The CDF is especially useful when we want to estimate the probability of some interval.
Graphically, the area under the density curve would represent the chance of getting a value
within that interval.
We find this area is by computing the integral of the density curve over the interval from
“a” to “b”.
For those of you who do not know how to calculate integrals, you can use some free online software
like “Wolfram Alpha dot com”.
If you understand probability correctly, determining and calculating these integrals should feel
very intuitive.
Notice how the cumulative probability is simply the probability of the interval from negative
infinity to ‘y’.
For those that know calculus, this suggest that the CDF for a specific value “y”
is equal to the integral of the density function over the interval from minus infinity to “y”.
This gives us a way to obtain the CDF from the PDF.
The opposite of integration is derivation, so to attain a PDF from a CDF, we would have
to find its first derivative.
In more technical terms, the PDF for any element of the sample space ‘y’, equals the first
derivative of the CDF with respect to ‘y’.
Often times, when dealing with continuous variables, we are only given their probability
density functions.
To understand what its graph looks like, we should be able to compute the expected value
and variance for any PDF.
Let’s start with expected values!
The probability of each individual element “y” is 0.
Therefore, we cannot apply the summation formula we used for discrete outcomes.
When dealing with continuous distributions, the expected value is an integral.
More specifically, it is an integral of the product of any element “y” and its associated
PDF value, over the interval from negative infinity to positive infinity.
Now, let us quickly discuss the variance.
Luckily for us, we can still apply the same variance formula we used earlier for discrete
Namely, the variance is equal to the expected value of the squared variable, minus the expected
value of the variable, squared.
Marvellous work!
We now know the main characteristics of any continuous distribution, so we can begin exploring
specific types.
In the next lecture we will introduce the Normal Distribution and its main features.
Thanks for watching!
4.9 Normal Distribution
Welcome back!
In this lecture we are going to introduce one of the most commonly found continuous
distributions – the normal distribution.
For starters, we define a Normal Distribution using a capital letter N followed by the mean
and variance of the distribution.
We read the following notation as “Variable “X” follows a Normal Distribution with
mean “mu” and variance “sigma” squared”.
When dealing with actual data we would usually know the numerical values of mu and sigma
The normal distribution frequently appears in nature, as well as in life, in various
shapes of forms.
For example, the size of a full-grown male lion follows a normal distribution.
Many records suggest that the average lion weight between 150 and 250 kilograms, or 330
to 550 pounds.
Of course, there exist specimen which fall outside of this range.
Lions weighing less than 150, or more than 250 kilograms tend to be the exception rather
than the rule.
Such individuals serve as outliers in our set and the more data we gather, the lower
part of the data they represent.
Now that you know what types of events follow a Normal distribution, let us examine some
of its distinct characteristics.
For starters, the graph of a Normal Distribution is bell-shaped.
Therefore, the majority of the data is centred around the mean.
Thus, values further away from the mean are less likely to occur.
Furthermore, we can see that the graph is symmetric with regards to the mean.
That suggests values equally far away in opposing directions, would still be equally likely.
Let’s go back to the lion example from earlier.
If the mean is 400, symmetry suggests a lion is equally likely to weigh 350 pounds and
450 pounds since both are 50 pounds away from that the mean.
For anybody interested, you can find the CDF and the PDF of the Normal distribution in
the additional materials for this lecture.
Instead of going through the complex algebraic simplifications in this lecture, we are simply
going to talk about the expected value and the variance.
The expected value for a Normal distribution equals its mean - “mu”, whereas its variance
“sigma” squared is usually given when we define the distribution.
However, if it isn’t, we can deduce it from the expected value.
To do so we must apply the formula we showed earlier: “The variance of a variable is
equal to the expected value of the squared variable, minus the squared expected value
of the variable”.
Good job!
Another peculiarity of the Normal Distribution is the “68, 95, 99.7” law.
This law suggests that for any normally distributed event, 68% of all outcomes fall within 1 standard
deviation away from the mean, 95% fall within two standard deviations and 99.7 - within
The last part really emphasises the fact that outliers are extremely rare in Normal distributions.
It also suggests how much we know about a dataset only if we have the information that
it is normally distributed!
Fantastic work, everyone!
Before we move on to other types of distributions, you need to know that we can use this table
to analyse any Normal Distribution.
To do this we need to standardize the distribution, which we will explain in detail in the next
Thanks for watching!
4.9.1 Standardizing a Normal Distribution Welcome back, everybody!
Towards the end of the last lecture we mentioned standardizing without explaining what it is
and why we use it.
Before we understand this concept, we need to explain what a transformation is.
So, a transformation is a way in which we can alter every element of a distribution
to get a new distribution with similar characteristics.
For Normal Distributions we can use addition, subtraction, multiplication and division without
changing the type of the distribution.
For instance, if we add a constant to every element of a Normal distribution, the new
distribution would still be Normal.
Let’s discuss the four algebraic operations and see how each one affects the graph.
If we add a constant, like 3, to the entire distribution, then we simply need to move
the graph 3 places to the right.
Similarly, if we subtract a number from every element, we would simply move our current
graph to the left to get the new one.
If we multiply the function by a constant it will widen that many times and if we divide
every element by a number, the graph will shrink.
However, if we multiply or divide by a number between 0 and 1, the opposing effects will
For example, dividing by a half, is the same as multiplying by 2, so the graph would expand,
even though we are dividing.
Now that you know what a transformation is, we can explain standardizing.
Standardizing is a special kind of transformation in which we make the expected value equal
to 0 and the variance equal to 1.
The benefit of doing so, is that we can then use the cumulative distribution table from
last lecture on any element in the set.
The distribution we get after standardizing any Normal distribution, is called a “Standard
Normal Distribution”.
In addition to the “68, 95, 99.7” rule, there exists a table which summarizes the
most commonly used values for the CDF of a Standard Normal Distribution.
This table is known as the Standard Normal Distribution table or the “Z”- score table.
So far, we learned what standardizing is and why it is convenient.
What we haven’t talked about is how to do it.
First, we wish to move the graph either to the left, or to the right until its mean equals
The way we would do that is by subtracting the mean “mu” from every element.
After this to make the standardization complete, we need to make sure the standard deviation
is 1.
To do so, we would have to divide every element of the newly obtained distribution by the
value of the standard deviation, sigma.
If we denote the Standard Normal Distribution with Z, then for any normally distributed
variable Y, “Z equals Y minus mu, over sigma”.
This equation expresses the transformation we use when standardizing.
Applying this single transformation for any Normal Distribution would result in a Standard
Normal Distribution, which is convenient.
Essentially, every element of the non-standardized distribution is represented in the new distribution
by the number of standard deviations it is away from the mean.
For instance, if some value y is 2.3 standard deviations away from the mean, its equivalent
value “Z” would be equal to 2.3.
Standardizing is incredibly useful when we have a Normal Distribution, however we cannot
always anticipate that the data is spread out that way.
A crucial fact to remember about the Normal distribution is that it requires a lot of
If our sample is limited, we run the risk of outliers drastically affecting our analysis.
In cases where we have less than 30 entries, we usually avoid assuming a Normal distribution.
However, there exists a small sample size approximation of a Normal distribution called
the Students’ T distribution and we are going to focus on it in our next lecture.
Thanks for watching.
4.10 Students’ T Distribution Hello, folks!
In this lesson we are going to talk about the Students’ T distribution and its characteristics.
Before we begin, we use the lower-case letter “t” to define a Students’ T distribution,
followed by a single parameter in parenthesis, called “degrees of freedom”.
We read this next statement as “Variable “Y” follows a Students’ T distribution
with 3 degrees of freedom”.
As we mentioned in the last video, it is a small sample size approximation of a Normal
In instances, where we would assume a Normal distribution were it not for the limited number
of observations, we use the Students’ T distribution.
For instance, the average lap times for the entire season of a Formula 1 race follow a
Normal Distribution, but the lap times for the first lap of the Monaco Grand Prix would
follow a Students’ T distribution.
Now, the curve of the students’ T distribution is also bell-shaped and symmetric.
However, it has fatter tails to accommodate the occurrence of values far away from the
That is because if such a value features in our limited data, it would be representing
a bigger part of the total.
Another key difference between the Students’ T Distribution and the Normal one is that
apart from the mean and variance, we must also define the degrees of freedom for the
Great job!
As long as we have at least 2 degrees of freedom, the expected value of a t-distribution is
the mean “mu”.
Furthermore, the variance of the distribution equals: the variance of the sample, times
number of degrees of freedom over, degrees of freedom minus two.
Overall the Students’ T distribution is frequently used when conducting statistical
It plays a major role when we want to do hypothesis testing with limited data, since we also have
a table summarizing the most important values of its CDF.
Another distribution that is commonly used in statistical analysis is the Chi-squared
In the next video we will explore when we use it and what other distributions it is
related to.
Thanks for watching!
4.11 Chi -squared Distribution
Welcome back, folks!
This is going to be a short lecture where we introduce to you the Chi-squared Distribution.
For starters, we define a denote a Chi-Squared distribution with the capital Greek letter
Chi, squared followed by a parameter “k” depicting the degrees of freedom.
Therefore, we read the following as “Variable “Y” follows a Chi-Square distribution
with 3 degrees of freedom”.
Let’s get started!
Very few events in real life follow such a distribution.
In fact, Chi-Squared is mostly featured in statistical analysis when doing hypothesis
testing and computing confidence intervals.
In particular, we most commonly find it when determining the goodness of fit of categorical
That is why any example we can give you would feel extremely convoluted to anyone not familiar
with statistics.
Now, let’s explore the graph of the Chi-Squared distribution.
Just by looking at it, you can tell the distribution is not symmetric, but rather – asymmetric.
Its graph is highly-skewed to the right.
Furthermore, the values depicted on the X-axis start form 0, rather than some negative number.
This, by the way, shows you yet another transformation.
Elevating the Student’s T distribution to the second power gives us the Chi-squared
and vice versa: finding the square root of the Chi-squared distribution gives us the
Student’s T.
So, a convenient feature of the Chi-Squared distribution is that it also contains a table
of known values, just like the Normal or Students’–T distributions.
The expected value for any Chi-squared distribution is equal to its associated degrees of freedom,
Its variance is equal to two times the degrees of freedom, or simply 2 times k.
To learn more about Hypothesis Testing and Confidence Intervals you can continue with
our program, where we dive into those.
For now, you know all you need to about the Chi-Squared Distribution.
Thanks for watching!
4.12 Exponential Distribution Hello again!
In this lecture, we are going to discuss the Exponential distribution and its main characteristics.
For starters, we define the exponential distribution with the abbreviation “Exp” followed by
a scale parameter - lambda.
We read the following statement as “Variable “X” follows an exponential distribution
with a scale of a half”.
Variables which most closely follow an exponential distribution, are ones with a probability
that initially decreases, before eventually plateauing.
One such example is the aggregate number of views for a Youtube vlog video.
There is great interest upon release, so it starts off with many views in the first day
or two.
After most subscribers have had the chance to see the video, the view-counter slows down.
Even though the aggregate amount of views keeps increasing, the number of new ones diminishes
As time goes on, the video either becomes outdated or the author produces new content,
so viewership focus shifts away.
Therefore, it is most likely for a random viewing to have occurred close to the video’s
initial release, then in any of the following periods.
Graphically, the PDF of such a function would start off very high and sharply decrease within
the first few time frames.
The curve somewhat resembles a boomerang with each handle lining up with the X and the Y
We know what the PDF would look like, but what about the CDF?
In a weird way, the CDF would also resemble a boomerang.
However, this one is shifted 90 degrees to the right.
As you know the cumulative distribution eventually approaches 1, so that would be the value where
it plateaus.
To define an exponential distribution, we require a rate parameter denoted by the Greek
letter “lambda”.
This parameter determine how fast the PDF curve reaches the point of plateauing and
how spread out the graph is.
Let’s talk about the expected value and the variance.
The expected value for an exponential distribution is equal to 1 over the rate parameter lambda,
whilst the variance is 1 over lambda squared.
In data analysis, we end up using exponential distributions quite often.
However, unlike the normal or chi-squared distributions, we do not have a table of known
variables for it.
That is why sometimes we prefer to transform it.
Generally, we can take the natural logarithm of every set of an exponential distribution
and get a normal distribution.
In statistics we can use this new transformed data to run say linear regressions.
This is one of the most common transformations I’ve had to perform.
Before we move on, we need to introduce an extremely important type of distribution that
is often used in mathematical modelling.
We are going to focus on the logistic distribution and its main characteristics in the next video!
Thanks for watching!
4.13 Logistic Distribution Welcome back!
In this lecture, we are going to focus on the continuous logistic probability distribution.
We denote a Logistic Distribution with the entire world “Logistic” followed by two
parameters, its mean and scale parameter like the one for the Exponential distribution.
We also refer to the mean parameter as the “location” and we shall use the terms
interchangeably for the remainder of the video.
Thus, we read the statement below as “Variable “Y” follows a Logistic distribution with
location 6 and a scale of 3”.
We often encounter logistic distributions when trying to determine how continuous variable
inputs can affect the probability of a binary outcome.
This approach is commonly found in forecasting competitive sports events, where there exist
only two clear outcomes – victory or defeat.
For instance, we can analyse whether the average speed of a tennis player’s serve plays a
crucial role in the outcome of the match.
Expectation dictate that sending the ball with higher velocity leaves opponents with
a shorter period to respond.
This usually results in a better hit, which could lead to a point for the server.
To reach the highest speeds tennis players often give up some control over the shot so
are less accurate.
Therefore, we cannot assume that there is a linear relationship between point conversion
and serve speeds.
Theory suggests there exists some optimal speed, which enables the serve to still be
accurate enough.
Then, most of the shots we convert into points will likely have similar velocities.
As tennis players go further away from the optimal speed, their shots either become too
slow and easy to handle, or too inaccurate.
This suggests that the graph of the PDF of the Logistic Distribution would look similarly
to the Normal Distribution.
Actually, the graph of the Logistic Distribution is defined by two key features – its mean
and its scale parameter.
The former dictates the centre of the graph, whilst the latter shows how spread out the
graph is going to be.
Going back to the tennis example, the mean would represent the optimal speed, whilst
the scale would dictate how lenient we can be with the hit.
To elaborate, some tennis players can hit a great serve further away from their optimal
speed than others.
For instance, Serena Williams can hit fantastic serves even if the ball moves much faster
or slower than it optimally should.
Therefore, she is going to have a more spread out PDF, than some of her opponents.
Now, let’s discuss the Cumulative Distribution Function.
It should be a curve that starts off slow, then picks up rather quickly before plateauing
around the 1 mark.
That is because once we reach values near the mean, the probability of converting the
point drastically goes up.
Once again, the scale would dictate the shape of the graph.
In this case, the smaller the scale, the later the graph starts to pick up, but the quicker
it reaches values close to 1.
You can use expected values to estimate the variance of the distribution.
To avoid confusing mathematical expressions, you only need to know it is equal to the square
of the scale, times “pi” squared, over 3.
Great job, everybody!
Now that you know all these various types of distributions, we can explore how probability
features in other fields.
In the next section of the course we are going to focus on statistics, data science and other
related fields which integrate probability.
Thanks for watching!