字幕列表 影片播放 列印英文字幕 This lecture is going to serve as an overview of what a probability distribution is and what main characteristics it has. Simply put, a distribution shows the possible values a variable can take and how frequently they occur. Before we start, let us introduce some important notation we will use for the remainder of the course. Assume that “upper-case Y” represents the actual outcome of an event and “lowercase y” represents one of the possible outcomes. One way to denote the likelihood of reaching a particular outcome “y”, is P of, Y equals y. We can also express it as “p of y”. For example, uppercase “Y” could represent the number of red marbles we draw out of a bag and lowercase “y” would be a specific number, like 3 or 5. Then, we express the probability of getting exactly 5 red marbles as “P, of Y equals 5”, or “p of 5”. Since “p of y” expresses the probability for each distinct outcome, we call this the probability function. Good job, folks! So, probability distributions, or simply probabilities, measure the likelihood of an outcome depending on how often it features in the sample space. Recall that we constructed the probability frequency distribution of an event in the introductory section of the course. We recorded the frequency for each unique value and divide it by the total number of elements in the sample space. Usually, that is the way we construct these probabilities when we have a finite number of possible outcomes. If we had an infinite number of possibilities, then recording the frequency for each one becomes impossible, because… there are infinitely many of them! For instance, imagine you are a data scientist and want to analyse the time it takes for your code to run. Any single compilation could take anywhere from a few the milliseconds to several days. Often the result will be between a few milliseconds and a few minutes. If we record time in seconds, we lose precision which we want to avoid. To do so we need to use the smallest possible measurement of time. Since every milli-, micro-, or even nanosecond could be split in half for greater accuracy, no such thing exists. Less than an hour from now we will talk in more detail about continuous distributions and how to deal with them. Let’s introduce some key definitions. Now, regardless of whether we have a finite or infinite number of possibilities, we define distributions using only two characteristics – mean and variance. Simply put, the mean of the distribution is its average value. Variance, on the other hand, is essentially how spread out the data is. We measure this “spread” by how far away from the mean all the values are. We denote the mean of a distribution as the Greek letter ‘mu’ and its variance as “sigma squared”. Okay. When analysing distributions, it is important to understand what kind of data we have - population data or sample data. Population data is the formal way of referring to “all” the data, while sample data is just a part of it. For example, if an employer surveys an entire department about how they travel to work, the data would represent the population of the department. However, this same data would also be just a sample of the employees in the whole company. Something to remember when using sample data is that we adopt different notation for the mean and variance. We denote sample mean as “x bar” and sample variance as “s” squared. One flaw of variance is that it is measured in squared units. For example, if you are measuring time in seconds, the variance would be measured in seconds squared. Usually, there is no direct interpretation of that value. To make more sense of variance, we introduce a third characteristic of the distribution, called standard deviation. Standard deviation is simply the positive square root of variance. As you can suspect, we denote it as “sigma” when dealing with a population, and as “s” when dealing with a sample. Unlike variance, standard deviation is measured in the same units as the mean. Thus, we can directly interpret it and is often preferable. One idea which we will use a lot is that any value between “mu minus sigma” and “mu plus sigma” falls within one standard deviation away from the mean. The more congested the middle of the distribution, the more data falls within that interval. Similarly, the less data falls within the interval, the more dispersed the data is. Fantastic! It is important to know there exists a constant relationship between mean and variance for any distribution. By definition, the variance equals the expected value of the squared difference from the mean for any value. We denote this as “sigma squared, equals the expected value of Y minus mu, squared”. After some simplification, this is equal to the expected value of “Y squared” minus “mu” squared. As we will see in the coming lectures, if we are dealing with a specific distribution, we can find a much more precise formula. Okay, when we are getting acquainted with a certain dataset we want to analyse or make predictions with, we are most interested in the mean, variance and type of the distribution. In our next video we will introduce several distributions and the characteristics they possess. Thanks for watching! 4.2 Types of distributions Hello, again! In this lecture we are going to talk about various types of probability distributions and what kind of events they can be used to describe. Certain distributions share features, so we group them into types. Some, like rolling a die or picking a card, have a finite number of outcomes. They follow discrete distributions and we use the formulas we already introduced to calculate their probabilities and expected values. Others, like recording time and distance in track & field, have infinitely many outcomes. They follow continuous distributions and we use different formulas from the once we mentioned so far. Throughout the course of this video we are going to examine the characteristics of some of the most common distributions. For each one we will focus on an important aspect of it or when it is used. Before we get into the specifics, you need to know the proper notation we implement when defining distributions. We start off by writing down the variable name for our set of values, followed by the “tilde” sign. This is superseded by a capital letter depicting the type of the distribution and some characteristics of the dataset in parenthesis. The characteristics are usually, mean and variance but they may vary depending on the type of the distribution. Alright! Let us start by talking about the discrete ones. We will get an overview of them and then we will devote a separate lecture to each one. So, we looked at problems relating to drawing cards from a deck or flipping a coin. Both examples show events where all outcomes are equally likely. Such outcomes are called equiprobable and these sorts of events follow a Uniform Distribution. Then there are events with only two possible outcomes – true or false. They follow a Bernoulli Distribution, regardless of whether one outcome is more likely to occur. Any event with two outcomes can be transformed into a Bernoulli event. We simply assign one of them to be “true” and the other one to be “false”. Imagine we are required to elect a captain for our college sports team. The team consists of 7 native students and 3 international students. We assign the captain being domestic to be “true” and the captain being an international as “false”. Since the outcome can now only be “true” or “false”, we have a Bernoulli distribution. Now, if we want to carry out a similar experiment several times in a row, we are dealing with a Binomial Distribution. Just like the Bernoulli Distribution, the outcomes for each iteration are two, but we have many iterations. For example, we could be flipping the coin we mentioned earlier 3 times and trying to calculate the likelihood of getting heads twice. Lastly, we should mention the Poisson Distribution. We use it when we want to test out how unusual an event frequency is for a given interval. For example, imagine we know that so far Lebron James averages 35 points per game during the regular season. We want to know how likely it is that he will score 12 points in the first quarter of his next game. Since the frequency changes, so should our expectations for the outcome. Using the Poisson distribution, we are able to determine the chance of Lebron scoring exactly 12 points for the adjusted time interval. Great, now on to the continuous distributions! One thing to remember is that since we are dealing with continuous outcomes, the probability distribution would be a curve as opposed to unconnected individual bars. The first one we will talk about is the Normal Distribution. The outcomes of many events in nature closely resemble this distribution, hence the name “Normal”. For instance, according to numerous reports throughout the last few decades, the weight of an adult male polar bear is usually around 500 kilograms. However, there have been records of individual species weighing anywhere between 350kg and 700kg. Extreme values, like 350 and 700, are called outliers and do not feature very frequently in Normal Distributions. Sometimes, we have limited data for events that resemble a Normal distribution. In those cases, we observe the Student’s-T distribution. It serves as a small sample approximation of a Normal distribution. Another difference is that the Student’s-T accommodates extreme values significantly better. Graphically, that is represented by the curve having fatter “tails”. Overall, this results in more values extremely far away from the mean, so the curve would probably more closely resemble a Student’s-T distribution than a Normal distribution. Now imagine only looking at the recorded weights of the last 10 sightings across Alaska and Canada. The lower number of elements would make the occurrence of any extreme value represent a much bigger part of the population than it should. Good job, everyone! Another continuous distribution we would like to introduce is the Chi-Squared distribution. It is the first asymmetric continuous distribution we are dealing with as it only consists of non-negative values. Graphically, that means that the Chi-Squared distribution always starts from 0 on the left. Depending on the average and maximum values within the set, the curve of the Chi Squared graph is usually skewed to the left. Unlike the previous two distributions, the Chi-Squared does not often mirror real life events. However, it is often used in Hypothesis Testing to help determine goodness of fit. The next distribution on our list is the Exponential distribution. The Exponential distribution is usually present when we are dealing with events that are rapidly changing early on. An easy to understand example is how online news articles generates hits. They get most of their clicks when the topic is still fresh. The more time passes, the more irrelevant it becomes as interest dies off. The last continuous distribution we will mention is the Logistic distribution. We often find it useful in forecast analysis when we try to determine a cut-off point for a successful outcome. For instance, take a competitive e-sport like Dota 2 . We can use a Logistic distribution to determine how much of an in-game advantage at the 10-minute mark is necessary to confidently predict victory for either team. Just like with other types of forecasting, our predictions would never reach true certainty but more on that later! Woah! Good job, folks! In the next video we are going to focus on discrete distributions. We will introduce formulas for competing Expected Values and Standard Deviations before looking into each distribution individually. Thanks for watching! 4.3 Discrete Distributions Welcome back! In this video we will talk about discrete distributions and their characteristics. Let’s get started! Earlier in the course we mentioned that events with discrete distributions have finitely many distinct outcomes. Therefore, we can express the entire probability distribution with either a table, a graph or a formula. To do so we need to ensure that every unique outcome has a probability assigned to it. Imagine you are playing darts. Each distinct outcome has some probability assigned to it based on how big its associated interval is. Since we have finitely many possible outcomes, we are dealing with a discrete distribution. Great! In probability, we are often more interested in the likelihood of an interval than of an individual value. With discrete distributions, we can simply add up the probabilities for all the values that fall within that range. Recall the example where we drew a card 20 times. Suppose we want to know the probability of drawing 3 spades or fewer. We would first calculate the probability of getting 0, 1, 2 or 3 spades and then add them up to find the probability of drawing 3 spades or fewer. One peculiarity of discrete events is that the “The probability of Y being less than or equal to y equals the probability of Y being less than y plus 1”. In our last example, that would mean getting 3 spades or fewer is the same as getting fewer than 4 spades. Alright! Now that you have an idea about discrete distributions, we can start exploring each type in more detail. In the next video we are going to examine the Uniform Distribution. Thanks for watching! 4.4 Uniform Distribution Hey, there! In this lecture we are going to discuss the uniform distribution. For starters, we use the letter U to define a uniform distribution, followed by the range of the values in the dataset. Therefore, we read the following statement as “Variable “X” follows a discrete uniform distribution ranging from 3 to 7”. Events which follow the uniform distribution, are ones where all outcomes have equal probability. One such event is rolling a single standard six-sided die. When we roll a standard 6-sided die, we have equal chance of getting any value from 1 to 6. The graph of the probability distribution would have 6 equally tall bars, all reaching up to one sixth. Many events in gambling provide such odds, where each individual outcome is equally likely. Not only that, but many everyday situations follow the Uniform distribution. If your friend offers you 3 identical chocolate bars, the probabilities assigned to you choosing one of them also follow the Uniform distribution. One big drawback of uniform distributions is that the expected value provides us no relevant information. Because all outcomes have the same probability, the expected value, which is 3.5, brings no predictive power. We can still apply the formulas from earlier and get a mean of 3.5 and a variance of 105 over 36. These values, however, are completely uninterpretable and there is no real intuition behind what they mean. They main takeaway is that when an event is following the Uniform distribution, each outcome is equally likely. Therefore, both the mean and the variance are uninterpretable and possess no predictive power whatsoever. Okay! Sadly, the Uniform is not the only discrete distribution, for which we cannot construct useful prediction intervals. In the next video we will introduce the Bernoulli Distribution. Thanks for watching! 4.5 Bernoulli Distribution Hello again! In this lecture we are going to discuss the Bernoulli distribution. Before we begin, we use “Bern” to define a Bernoulli distribution, followed by the probability of our preferred outcome in parenthesis. Therefore, we read the following statement as “Variable “X” follows a Bernoulli distribution with a probability of success equal to “p””. Okay! We need to describe what types of events follow a Bernoulli distribution. Any event where we have only 1 trial and two possible outcomes follows such a distribution. These may include a coin flip, a single True or False quiz question, or deciding whether to vote for the Democratic or Republican parties in the US elections. Usually, when dealing with a Bernoulli Distribution, we either have the probabilities of either event occurring, or have past data indicating some experimental probability. In either case, the graph of a Bernoulli distribution is simple. It consists of 2 bars, one for each of the possible outcomes. One bar would rise up to its associated probability of “p”, and the other one would only reach “1 minus p”. For Bernoulli Distributions we often have to assign which outcome is 0, and which outcome is 1. After doing so, we can calculate the expected value. Have in mind that depending on how we assign the 0 and the 1, our expected value will be equal to either “p” or “1 minus p”. We usually denote the higher probability with “p”, and the lower one with “1 minus p”. Furthermore, conventionally we also assign a value of 1 to the event with probability equal to “p”. That way, the expected value expresses the likelihood of the favoured event. Since we only have 1 trial and a favoured event, we expect that outcome to occur. By plugging in “p” and “1 minus p” into the variance formula, we get that the variance of Bernoulli events would always equal “p, times 1 minus p”. That is true, regardless of what the expected value is. Here’s the first instance where we observe how elegant the characteristics of some distributions are. Once again, we can calculate the variance and standard deviation using the formulas we defined earlier, but they bring us little value. For example, consider flipping an unfair coin. This coin is called “unfair” because its weight is spread disproportionately, and it gets tails 60% of the time. We assign the outcome of tails to be 1, and p to equal 0.6. Therefore, the expected value would be “p”, or 0.6. If we plug in this result into the variance formula, we would get a variance of 0.6, times 0.4, or 0.24. Great job, everybody! Sometimes, instead of wanting to know which of two outcomes is more probable, we want to know how often it would occur over several trials. In such cases, the outcomes follow a Binomial distribution and we will explore it further in the next lecture. 4.6 Binomial Distribution Welcome back! In the last video, we mentioned Binomial Distributions. In essence, Binomial events are a sequence of identical Bernoulli events. Before we get into the difference and similarities between these two distributions, let us examine the proper notation for a Binomial Distribution. We use the letter “B” to express a Binomial distribution, followed by the number of trials and the probability of success in each one. Therefore, we read the following statement as “Variable “X” follows a Binomial distribution with 10 trials and a likelihood of success of 0.6 on each individual trial”. Additionally, we can express a Bernoulli distribution as a Binomial distribution with a single trial. Alright! To better understand the differences between the two types of events, suppose the following scenario. You go to class and your professor gives the class a surprise pop-quiz, for which you have not prepared. Luckily for you, the quiz consists of 10 true or false problems. In this case, guessing a single true or false question is a Bernoulli event, but guessing the entire quiz is a Binomial Event. Alright! Let’s go back to the quiz example we just mentioned. In it, the expected value of the Bernoulli distribution suggests which outcome we expect for a single trial. Now, the expected value of the Binomial distribution would suggest the number of times we expect to get a specific outcome. Great! Now, the graph of the binomial distribution represents the likelihood of attaining our desired outcome a specific number of times. If we run n trials, our graph would consist “n + 1”-many bars - one for each unique value from 0 to n. For instance, we could be flipping the same unfair coin we had from last lecture. If we toss it twice, we need bars for the three different outcomes - zero, one or two tails. Fantastic! If we wish to find the associated likelihood of getting a given outcome a precise number of times over the course of n trials, we need to introduce the probability function of the Binomial distribution. For starters, each individual trial is a Bernoulli trial, so we express the probability of getting our desired outcome as “p” and the likelihood of the other one as “1 minus p”. In order to get our favoured outcome exactly y-many times over the n trials, we also need to get the alternative outcome “n minus y”-many times. If we don’t account for this, we would be estimating the likelihood of getting our desired outcome at least y-many times. Furthermore, there could exist more than one way to reach our desired outcome. To account for this, we need to find the number of scenarios in which “y” out of the “n”-many outcomes would be favourable. But these are actually the “combinations” we already know! For instance, If we wish to find out the number of ways in which 4 out of the 6 trials can be successful, it is the same as picking 4 elements out of a sample space of 6. Now you see why combinatorics are a fundamental part of probability! Thus, we need to find the number of combinations in which “y” out of the “n” outcomes would be favourable. For instance, there are 3 different ways to get tails exactly twice in 3 coin flips. Therefore, the probability function for a Binomial Distribution is the product of the number of combinations of picking y-many elements out of n, times “p” to the power of y, times “p - 1” to the power of “n minus p”. Great! To see this in action, let us look at an example. Imagine you bought a single stock of General Motors. Historically, you know there is a 60% chance the price of your stock will go up on any given day, and a 40% chance it will drop. By the price going up, we mean that the closing price is higher than the opening price. With the probability distribution function, you can calculate the likelihood of the stock price increasing 3 times during the 5-work-day week. If we wish to use the probability distribution formula, we need to plug in 3 for “y”, 5 for “n” and 0.6 for “p”. After plugging in we get: “number of different possible combinations of picking 3 elements out of 5, times 0.6 to the power of 3, times 0.4 to the power of 2”. This is equivalent to 10, times 0.216, times 0.16, or 0.3456. Thus, we have a 34.56% of getting exactly 3 increases over the course of a work week. The big advantage of recognizing the distribution is that you can simply use these formulas and plug-in the information you already have! Alright! Now that we know the probability function, we can move on to the expected value. By definition, the expected value equals the sum of all values in the sample space, multiplied by their respective probabilities. The expected value formula for a Binomial event equals the probability of success for a given value, multiplied by the number of trials we carry out. This seems familiar, because this is the exact formula we used when computing the expected values for categorical variables in the beginning of the course. After computing the expected value, we can finally calculate the variance. We do so by applying the short formula we learned earlier: “Variance of Y equals the expected value of Y square, minus the expected value of Y, squared.” After some simplifications, this results in “n, times p, times p minus 1”. If we plug in the values from our stock market example, that gives us a variance of 5, times 0.6, times 0.4, or 1.2. This would give us a standard deviation of approximately 1.1. Knowing the expected value and the standard deviation allows us to make more accurate future forecasts. Fantastic! In the next video we are going to discuss Poisson Distributions. Thanks for watching! 4.7 Poisson Distribution Hello again! In this lecture we are going to discuss the Poisson Distribution and its main characteristics. For starters, we denote a Poisson distribution with the letters “Po” and a single value parameter - lambda. We read the statement below as “Variable “Y” follows a Poisson distribution with lambda equal to 4”. Okay! The Poisson Distribution deals with the frequency with which an event occurs in a specific interval. Instead of the probability of an event, the Poisson Distribution requires knowing how often it occurs for a specific period of time or distance. For example, a firefly might light up 3 times in 10 seconds on average. We would use a Poisson Distribution if we want to determine the likelihood of it lighting up 8 times in 20 seconds. The graph of the Poisson distribution plots the number of instances the event occurs in a standard interval of time and the probability for each one. Thus, our graph would always start from 0, since no event can happen a negative amount of times. However, there is no cap to the amount of times it could occur over the time interval. Okay, let us explore an example. Imagine you created an online course on probability. Usually, your students ask you around 4 questions per day, but yesterday they asked 7. Surprised by this sudden spike in interest from your students, you wonder how likely it was that they asked exactly 7 questions. In this example, the average questions you anticipate is 4, so lambda equals 4. The time interval is one entire work day and the singular instance you are interested in is 7. Therefore, “y” is 7. To answer this question, we need to explore the probability function for this type of distributions. Alright! As you already saw, the Poisson Distribution is wildly different from any other we have gone over so far. It comes without much surprise that its probability function is much different from anything we have examined so far. The formula looks the following way: “p of y, equals, lambda to the power of y, over y factorial, times the Euler’s number to the power of negative lambda”. Before we plug in the values from our course-creation example, we need to make sure you understand the entire formula. Let’s refresh your knowledge of the various parts of this formula. First, the “e” you see on your screens is known as Euler’s number or Napier’s constant. As the second name suggests, it is a fixed value approximately equal to 2.72. We commonly observe it in physics, mathematics and nature, but for the purposes of this example you only need to know its value. Secondly, a number to the power of “negative n”, is the same as dividing 1 by that number to the power of n. In this case, “e to the power or negative lambda” is just “1 over, e to the power of lambda”. Right! Going back to our example, the probability of receiving 7 questions is equal to “4, raised to the 7th degree, over 7 factorial, multiplied by “E” raised to the negative lambda”. That approximately equals 16384 over 5040, times 0.183, or 0.06. Therefore, there was only a 6% chance of receiving exactly 7 questions. So far so good! Knowing the probability function, we can calculate the expected value. By definition, the expected value of Y, equals the sum of all the products of a distinct value in the sample space and its probability. By plugging in, we get this complicated expression. In the additional materials attached to this lecture, you can see all the complicated algebra required to simplify this. Eventually, we get that the expected value is simply lambda. Similarly, by applying the formulas we already know, the variance also ends up being equal to lambda. Both the mean and variance being equal to lambda serves as yet another example of the elegant statistics these distributions possess and why we can take advantage of them. Great job, everyone! Now, if we wish to compute the probability of an interval of a Poisson distribution, we take the same steps we usually do for discrete distributions. We find the joint probability of all individual elements within it. You will have a chance to practice this in the exercises after this lecture. So far, we have discussed Uniform, Bernoulli, Binomial and Poisson distributions, which are all discrete. In the next video we will focus on continuous distributions and see how differ. Thanks for watching! 4.8 Continuous Distributions Hello again! When we started this section of the course, we mentioned how some events have infinitely many consecutive outcomes. We call such distributions continuous and they vastly differ from discrete ones. For starters, their sample space is infinite. Therefore, we cannot record the frequency of each distinct value. Thus, we can no longer represent these distributions with a table. What we can do is represent them with a graph. More precisely, the graph of the probability density function, or PDF for short. We denote it as “f of y”, where “y” is an element of the sample space. As the name suggests, the function depicts the associated probability for every possible value “y”. Since it expresses probability, the value it associates with any element of the sample space would be greater than or equal to zero. Great! The graphs for continuous distributions slightly resemble the ones for discrete distributions. However, there are more elements in the sample space, so there are more bars on the graph. Furthermore, the more bars - the narrower each one must be. This results in a smooth curve that goes along the top of these bars. We call this the probability distribution curve, since it shows the likelihood of each outcome. Now on to some further differences between Distinct and Continuous. Imagine we used the “favoured over all” formula to calculate probabilities for such variables. Since the sample space is infinite, the likelihood of each individual one would be extremely small. Algebra dictates that, assuming the numerator stays constant, the greater the denominator becomes, the closer the fraction is to 0. For reference, one third is closer to 0 than a half, and a quarter is closer to 0 than either of them. Since the denominator of the “favoured over all” formula would be so big, it is commonly accepted that such probabilities are extremely insignificant. In fact, we assume their likelihood of occurring to be essentially 0. Thus, it is accepted that the probability for any individual value from a continuous distribution to be equal to 0. This assumption is crucial in understanding why “the likelihood of an event being strictly greater than X, is equal to the likelihood of the even being greater than or equal to X” for some value X within the sample space. For example, the probability of a college student running a mile in under 6 minutes is the same as them running it for at most 6 minutes. That is because we consider the likelihood of finishing in exactly 6 minutes to be 0. That wasn’t too complicated, right? So far, we have been using the term “probability function” to refer to the Probability Density Function of a distribution. All the graphs we explored for discrete distributions were depicting their PDFs. Now, we need to introduce the Cumulative Distribution Function, or CDF for short. Since it is cumulative, this function encompasses everything up a certain value. We denote the CDF as capital F of y for any continuous random variable Y. As the name suggest, it represents probability of the random variable being lower than or equal to a specific value. Since no value could be lower than or equal to negative infinity, the CDF value for negative infinity would equal 0. Similarly, since any value would be lower than plus infinity, we would get a 1 if we plug plus infinity into the distribution function. Discrete distributions also have CDFs, but they are far less frequently used. That is because we can always add up the PDF values associated with the individual probabilities, we are interested in. Good job, folks! The CDF is especially useful when we want to estimate the probability of some interval. Graphically, the area under the density curve would represent the chance of getting a value within that interval. We find this area is by computing the integral of the density curve over the interval from “a” to “b”. For those of you who do not know how to calculate integrals, you can use some free online software like “Wolfram Alpha dot com”. If you understand probability correctly, determining and calculating these integrals should feel very intuitive. Alright! Notice how the cumulative probability is simply the probability of the interval from negative infinity to ‘y’. For those that know calculus, this suggest that the CDF for a specific value “y” is equal to the integral of the density function over the interval from minus infinity to “y”. This gives us a way to obtain the CDF from the PDF. The opposite of integration is derivation, so to attain a PDF from a CDF, we would have to find its first derivative. In more technical terms, the PDF for any element of the sample space ‘y’, equals the first derivative of the CDF with respect to ‘y’. Okay! Often times, when dealing with continuous variables, we are only given their probability density functions. To understand what its graph looks like, we should be able to compute the expected value and variance for any PDF. Let’s start with expected values! The probability of each individual element “y” is 0. Therefore, we cannot apply the summation formula we used for discrete outcomes. When dealing with continuous distributions, the expected value is an integral. More specifically, it is an integral of the product of any element “y” and its associated PDF value, over the interval from negative infinity to positive infinity. Right! Now, let us quickly discuss the variance. Luckily for us, we can still apply the same variance formula we used earlier for discrete distributions. Namely, the variance is equal to the expected value of the squared variable, minus the expected value of the variable, squared. Marvellous work! We now know the main characteristics of any continuous distribution, so we can begin exploring specific types. In the next lecture we will introduce the Normal Distribution and its main features. Thanks for watching! 4.9 Normal Distribution Welcome back! In this lecture we are going to introduce one of the most commonly found continuous distributions – the normal distribution. For starters, we define a Normal Distribution using a capital letter N followed by the mean and variance of the distribution. We read the following notation as “Variable “X” follows a Normal Distribution with mean “mu” and variance “sigma” squared”. When dealing with actual data we would usually know the numerical values of mu and sigma squared. The normal distribution frequently appears in nature, as well as in life, in various shapes of forms. For example, the size of a full-grown male lion follows a normal distribution. Many records suggest that the average lion weight between 150 and 250 kilograms, or 330 to 550 pounds. Of course, there exist specimen which fall outside of this range. Lions weighing less than 150, or more than 250 kilograms tend to be the exception rather than the rule. Such individuals serve as outliers in our set and the more data we gather, the lower part of the data they represent. Now that you know what types of events follow a Normal distribution, let us examine some of its distinct characteristics. For starters, the graph of a Normal Distribution is bell-shaped. Therefore, the majority of the data is centred around the mean. Thus, values further away from the mean are less likely to occur. Furthermore, we can see that the graph is symmetric with regards to the mean. That suggests values equally far away in opposing directions, would still be equally likely. Let’s go back to the lion example from earlier. If the mean is 400, symmetry suggests a lion is equally likely to weigh 350 pounds and 450 pounds since both are 50 pounds away from that the mean. Alright! For anybody interested, you can find the CDF and the PDF of the Normal distribution in the additional materials for this lecture. Instead of going through the complex algebraic simplifications in this lecture, we are simply going to talk about the expected value and the variance. The expected value for a Normal distribution equals its mean - “mu”, whereas its variance “sigma” squared is usually given when we define the distribution. However, if it isn’t, we can deduce it from the expected value. To do so we must apply the formula we showed earlier: “The variance of a variable is equal to the expected value of the squared variable, minus the squared expected value of the variable”. Good job! Another peculiarity of the Normal Distribution is the “68, 95, 99.7” law. This law suggests that for any normally distributed event, 68% of all outcomes fall within 1 standard deviation away from the mean, 95% fall within two standard deviations and 99.7 - within 3. The last part really emphasises the fact that outliers are extremely rare in Normal distributions. It also suggests how much we know about a dataset only if we have the information that it is normally distributed! Fantastic work, everyone! Before we move on to other types of distributions, you need to know that we can use this table to analyse any Normal Distribution. To do this we need to standardize the distribution, which we will explain in detail in the next video. Thanks for watching! 4.9.1 Standardizing a Normal Distribution Welcome back, everybody! Towards the end of the last lecture we mentioned standardizing without explaining what it is and why we use it. Before we understand this concept, we need to explain what a transformation is. So, a transformation is a way in which we can alter every element of a distribution to get a new distribution with similar characteristics. For Normal Distributions we can use addition, subtraction, multiplication and division without changing the type of the distribution. For instance, if we add a constant to every element of a Normal distribution, the new distribution would still be Normal. Let’s discuss the four algebraic operations and see how each one affects the graph. If we add a constant, like 3, to the entire distribution, then we simply need to move the graph 3 places to the right. Similarly, if we subtract a number from every element, we would simply move our current graph to the left to get the new one. If we multiply the function by a constant it will widen that many times and if we divide every element by a number, the graph will shrink. However, if we multiply or divide by a number between 0 and 1, the opposing effects will occur. For example, dividing by a half, is the same as multiplying by 2, so the graph would expand, even though we are dividing. Alright! Now that you know what a transformation is, we can explain standardizing. Standardizing is a special kind of transformation in which we make the expected value equal to 0 and the variance equal to 1. The benefit of doing so, is that we can then use the cumulative distribution table from last lecture on any element in the set. The distribution we get after standardizing any Normal distribution, is called a “Standard Normal Distribution”. In addition to the “68, 95, 99.7” rule, there exists a table which summarizes the most commonly used values for the CDF of a Standard Normal Distribution. This table is known as the Standard Normal Distribution table or the “Z”- score table. Okay! So far, we learned what standardizing is and why it is convenient. What we haven’t talked about is how to do it. First, we wish to move the graph either to the left, or to the right until its mean equals 0. The way we would do that is by subtracting the mean “mu” from every element. After this to make the standardization complete, we need to make sure the standard deviation is 1. To do so, we would have to divide every element of the newly obtained distribution by the value of the standard deviation, sigma. If we denote the Standard Normal Distribution with Z, then for any normally distributed variable Y, “Z equals Y minus mu, over sigma”. This equation expresses the transformation we use when standardizing. Amazing! Applying this single transformation for any Normal Distribution would result in a Standard Normal Distribution, which is convenient. Essentially, every element of the non-standardized distribution is represented in the new distribution by the number of standard deviations it is away from the mean. For instance, if some value y is 2.3 standard deviations away from the mean, its equivalent value “Z” would be equal to 2.3. Standardizing is incredibly useful when we have a Normal Distribution, however we cannot always anticipate that the data is spread out that way. A crucial fact to remember about the Normal distribution is that it requires a lot of data. If our sample is limited, we run the risk of outliers drastically affecting our analysis. In cases where we have less than 30 entries, we usually avoid assuming a Normal distribution. However, there exists a small sample size approximation of a Normal distribution called the Students’ T distribution and we are going to focus on it in our next lecture. Thanks for watching. 4.10 Students’ T Distribution Hello, folks! In this lesson we are going to talk about the Students’ T distribution and its characteristics. Before we begin, we use the lower-case letter “t” to define a Students’ T distribution, followed by a single parameter in parenthesis, called “degrees of freedom”. We read this next statement as “Variable “Y” follows a Students’ T distribution with 3 degrees of freedom”. As we mentioned in the last video, it is a small sample size approximation of a Normal Distribution. In instances, where we would assume a Normal distribution were it not for the limited number of observations, we use the Students’ T distribution. For instance, the average lap times for the entire season of a Formula 1 race follow a Normal Distribution, but the lap times for the first lap of the Monaco Grand Prix would follow a Students’ T distribution. Now, the curve of the students’ T distribution is also bell-shaped and symmetric. However, it has fatter tails to accommodate the occurrence of values far away from the mean. That is because if such a value features in our limited data, it would be representing a bigger part of the total. Another key difference between the Students’ T Distribution and the Normal one is that apart from the mean and variance, we must also define the degrees of freedom for the distribution. Great job! As long as we have at least 2 degrees of freedom, the expected value of a t-distribution is the mean “mu”. Furthermore, the variance of the distribution equals: the variance of the sample, times number of degrees of freedom over, degrees of freedom minus two. Overall the Students’ T distribution is frequently used when conducting statistical analysis. It plays a major role when we want to do hypothesis testing with limited data, since we also have a table summarizing the most important values of its CDF. Great! Another distribution that is commonly used in statistical analysis is the Chi-squared Distribution. In the next video we will explore when we use it and what other distributions it is related to. Thanks for watching! 4.11 Chi -squared Distribution Welcome back, folks! This is going to be a short lecture where we introduce to you the Chi-squared Distribution. For starters, we define a denote a Chi-Squared distribution with the capital Greek letter Chi, squared followed by a parameter “k” depicting the degrees of freedom. Therefore, we read the following as “Variable “Y” follows a Chi-Square distribution with 3 degrees of freedom”. Alright! Let’s get started! Very few events in real life follow such a distribution. In fact, Chi-Squared is mostly featured in statistical analysis when doing hypothesis testing and computing confidence intervals. In particular, we most commonly find it when determining the goodness of fit of categorical values. That is why any example we can give you would feel extremely convoluted to anyone not familiar with statistics. Alright! Now, let’s explore the graph of the Chi-Squared distribution. Just by looking at it, you can tell the distribution is not symmetric, but rather – asymmetric. Its graph is highly-skewed to the right. Furthermore, the values depicted on the X-axis start form 0, rather than some negative number. This, by the way, shows you yet another transformation. Elevating the Student’s T distribution to the second power gives us the Chi-squared and vice versa: finding the square root of the Chi-squared distribution gives us the Student’s T. Great! So, a convenient feature of the Chi-Squared distribution is that it also contains a table of known values, just like the Normal or Students’–T distributions. The expected value for any Chi-squared distribution is equal to its associated degrees of freedom, k. Its variance is equal to two times the degrees of freedom, or simply 2 times k. To learn more about Hypothesis Testing and Confidence Intervals you can continue with our program, where we dive into those. For now, you know all you need to about the Chi-Squared Distribution. Thanks for watching! 4.12 Exponential Distribution Hello again! In this lecture, we are going to discuss the Exponential distribution and its main characteristics. For starters, we define the exponential distribution with the abbreviation “Exp” followed by a scale parameter - lambda. We read the following statement as “Variable “X” follows an exponential distribution with a scale of a half”. Alright! Variables which most closely follow an exponential distribution, are ones with a probability that initially decreases, before eventually plateauing. One such example is the aggregate number of views for a Youtube vlog video. There is great interest upon release, so it starts off with many views in the first day or two. After most subscribers have had the chance to see the video, the view-counter slows down. Even though the aggregate amount of views keeps increasing, the number of new ones diminishes daily. As time goes on, the video either becomes outdated or the author produces new content, so viewership focus shifts away. Therefore, it is most likely for a random viewing to have occurred close to the video’s initial release, then in any of the following periods. Graphically, the PDF of such a function would start off very high and sharply decrease within the first few time frames. The curve somewhat resembles a boomerang with each handle lining up with the X and the Y axes. Alright! We know what the PDF would look like, but what about the CDF? In a weird way, the CDF would also resemble a boomerang. However, this one is shifted 90 degrees to the right. As you know the cumulative distribution eventually approaches 1, so that would be the value where it plateaus. To define an exponential distribution, we require a rate parameter denoted by the Greek letter “lambda”. This parameter determine how fast the PDF curve reaches the point of plateauing and how spread out the graph is. Alright! Let’s talk about the expected value and the variance. The expected value for an exponential distribution is equal to 1 over the rate parameter lambda, whilst the variance is 1 over lambda squared. In data analysis, we end up using exponential distributions quite often. However, unlike the normal or chi-squared distributions, we do not have a table of known variables for it. That is why sometimes we prefer to transform it. Generally, we can take the natural logarithm of every set of an exponential distribution and get a normal distribution. In statistics we can use this new transformed data to run say linear regressions. This is one of the most common transformations I’ve had to perform. Before we move on, we need to introduce an extremely important type of distribution that is often used in mathematical modelling. We are going to focus on the logistic distribution and its main characteristics in the next video! Thanks for watching! 4.13 Logistic Distribution Welcome back! In this lecture, we are going to focus on the continuous logistic probability distribution. We denote a Logistic Distribution with the entire world “Logistic” followed by two parameters, its mean and scale parameter like the one for the Exponential distribution. We also refer to the mean parameter as the “location” and we shall use the terms interchangeably for the remainder of the video. Thus, we read the statement below as “Variable “Y” follows a Logistic distribution with location 6 and a scale of 3”. Alright! We often encounter logistic distributions when trying to determine how continuous variable inputs can affect the probability of a binary outcome. This approach is commonly found in forecasting competitive sports events, where there exist only two clear outcomes – victory or defeat. For instance, we can analyse whether the average speed of a tennis player’s serve plays a crucial role in the outcome of the match. Expectation dictate that sending the ball with higher velocity leaves opponents with a shorter period to respond. This usually results in a better hit, which could lead to a point for the server. To reach the highest speeds tennis players often give up some control over the shot so are less accurate. Therefore, we cannot assume that there is a linear relationship between point conversion and serve speeds. Theory suggests there exists some optimal speed, which enables the serve to still be accurate enough. Then, most of the shots we convert into points will likely have similar velocities. As tennis players go further away from the optimal speed, their shots either become too slow and easy to handle, or too inaccurate. This suggests that the graph of the PDF of the Logistic Distribution would look similarly to the Normal Distribution. Actually, the graph of the Logistic Distribution is defined by two key features – its mean and its scale parameter. The former dictates the centre of the graph, whilst the latter shows how spread out the graph is going to be. Going back to the tennis example, the mean would represent the optimal speed, whilst the scale would dictate how lenient we can be with the hit. To elaborate, some tennis players can hit a great serve further away from their optimal speed than others. For instance, Serena Williams can hit fantastic serves even if the ball moves much faster or slower than it optimally should. Therefore, she is going to have a more spread out PDF, than some of her opponents. Fantastic! Now, let’s discuss the Cumulative Distribution Function. It should be a curve that starts off slow, then picks up rather quickly before plateauing around the 1 mark. That is because once we reach values near the mean, the probability of converting the point drastically goes up. Once again, the scale would dictate the shape of the graph. In this case, the smaller the scale, the later the graph starts to pick up, but the quicker it reaches values close to 1. Okay! You can use expected values to estimate the variance of the distribution. To avoid confusing mathematical expressions, you only need to know it is equal to the square of the scale, times “pi” squared, over 3. Great job, everybody! Now that you know all these various types of distributions, we can explore how probability features in other fields. In the next section of the course we are going to focus on statistics, data science and other related fields which integrate probability. Thanks for watching!
B1 中級 概率分佈簡介 (Introduction to Probability Distributions) 3 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字