Placeholder Image

字幕列表 影片播放

  • In statistics, logistic regression, or logit regression, is a type of probabilistic statistical

  • classification model. It is also used to predict a binary response from a binary predictor,

  • used for predicting the outcome of a categorical dependent variable based on one or more predictor

  • variables. That is, it is used in estimating the parameters of a qualitative response model.

  • The probabilities describing the possible outcomes of a single trial are modeled, as

  • a function of the explanatory variables, using a logistic function. Frequently "logistic

  • regression" is used to refer specifically to the problem in which the dependent variable

  • is binarythat is, the number of available categories is twowhile problems with more

  • than two categories are referred to as multinomial logistic regression or, if the multiple categories

  • are ordered, as ordered logistic regression. Logistic regression measures the relationship

  • between a categorical dependent variable and one or more independent variables, which are

  • usually continuous, by using probability scores as the predicted values of the dependent variable.

  • As such it treats the same set of problems as does probit regression using similar techniques.

  • Fields and examples of applications Logistic regression was put forth in the 1940s

  • as an alternative to Fisher's 1936 classification method, linear discriminant analysis. It is

  • used extensively in numerous disciplines, including the medical and social science fields.

  • For example, the Trauma and Injury Severity Score, which is widely used to predict mortality

  • in injured patients, was originally developed by Boyd et al. using logistic regression.

  • Logistic regression might be used to predict whether a patient has a given disease, based

  • on observed characteristics of the patient. Another example might be to predict whether

  • an American voter will vote Democratic or Republican, based on age, income, gender,

  • race, state of residence, votes in previous elections, etc. The technique can also be

  • used in engineering, especially for predicting the probability of failure of a given process,

  • system or product. It is also used in marketing applications such as prediction of a customer's

  • propensity to purchase a product or cease a subscription, etc. In economics it can be

  • used to predict the likelihood of a person's choosing to be in the labor force, and a business

  • application would be to predict the likehood of a homeowner defaulting on a mortgage. Conditional

  • random fields, an extension of logistic regression to sequential data, are used in natural language

  • processing. Basics

  • Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals

  • with situations in which the observed outcome for a dependent variable can have only two

  • possible types. Multinomial logistic regression deals with situations where the outcome can

  • have three or more possible types. In binary logistic regression, the outcome is usually

  • coded as "0" or "1", as this leads to the most straightforward interpretation. If a

  • particular observed outcome for the dependent variable is the noteworthy possible outcome

  • it is usually coded as "1" and the contrary outcome as "0". Logistic regression is used

  • to predict the odds of being a case based on the values of the independent variables.

  • The odds are defined as the probability that a particular outcome is a case divided by

  • the probability that it is a noncase. Like other forms of regression analysis, logistic

  • regression makes use of one or more predictor variables that may be either continuous or

  • categorical data. Unlike ordinary linear regression, however, logistic regression is used for predicting

  • binary outcomes of the dependent variable rather than continuous outcomes. Given this

  • difference, it is necessary that logistic regression take the natural logarithm of the

  • odds of the dependent variable being a case to create a continuous criterion as a transformed

  • version of the dependent variable. Thus the logit transformation is referred to as the

  • link function in logistic regressionalthough the dependent variable in logistic regression

  • is binomial, the logit is the continuous criterion upon which linear regression is conducted.

  • The logit of success is then fit to the predictors using linear regression analysis. The predicted

  • value of the logit is converted back into predicted odds via the inverse of the natural

  • logarithm, namely the exponential function. Therefore, although the observed dependent

  • variable in logistic regression is a zero-or-one variable, the logistic regression estimates

  • the odds, as a continuous variable, that the dependent variable is a success. In some applications

  • the odds are all that is needed. In others, a specific yes-or-no prediction is needed

  • for whether the dependent variable is or is not a case; this categorical prediction can

  • be based on the computed odds of a success, with predicted odds above some chosen cut-off

  • value being translated into a prediction of a success.

  • Logistic function, odds ratio, and logit

  • An explanation of logistic regression begins with an explanation of the logistic function,

  • which always takes on values between zero and one:

  • and viewing as a linear function of an explanatory variable , the logistic function can be written

  • as:

  • This will be interpreted as the probability of the dependent variable equalling a "success"

  • or "case" rather than a failure or non-case. We also define the inverse of the logistic

  • function, the logit:

  • and equivalently:

  • A graph of the logistic function is shown in Figure 1. The input is the value of and

  • the output is . The logistic function is useful because it can take an input with any value

  • from negative infinity to positive infinity, whereas the output is confined to values between

  • 0 and 1 and hence is interpretable as a probability. In the above equations, refers to the logit

  • function of some given linear combination of the predictors, denotes the natural logarithm,

  • is the probability that the dependent variable equals a case, is the intercept from the linear

  • regression equation, is the regression coefficient multiplied by some value of the predictor,

  • and base denotes the exponential function. The formula for illustrates that the probability

  • of the dependent variable equaling a case is equal to the value of the logistic function

  • of the linear regression expression. This is important in that it shows that the value

  • of the linear regression expression can vary from negative to positive infinity and yet,

  • after transformation, the resulting expression for the probability ranges between 0 and 1.

  • The equation for illustrates that the logit is equivalent to the linear regression expression.

  • Likewise, the next equation illustrates that the odds of the dependent variable equaling

  • a case is equivalent to the exponential function of the linear regression expression. This

  • illustrates how the logit serves as a link function between the probability and the linear

  • regression expression. Given that the logit ranges between negative infinity and positive

  • infinity, it provides an adequate criterion upon which to conduct linear regression and

  • the logit is easily converted back into the odds.

  • Multiple explanatory variables If there are multiple explanatory variables,

  • then the above expression can be revised to Then when this is used in the equation relating

  • the logged odds of a success to the values of the predictors, the linear regression will

  • be a multiple regression with m explanators; the parameters for all j = 0, 1, 2, ..., m

  • are all estimated. Model fitting

  • Estimation Maximum likelihood estimation

  • The regression coefficients are usually estimated using maximum likelihood estimation. Unlike

  • linear regression with normally distributed residuals, it is not possible to find a closed-form

  • expression for the coefficient values that maximizes the likelihood function, so an iterative

  • process must be used instead, for example Newton's method. This process begins with

  • a tentative solution, revises it slightly to see if it can be improved, and repeats

  • this revision until improvement is minute, at which point the process is said to have

  • converged. In some instances the model may not reach

  • convergence. When a model does not converge this indicates that the coefficients are not

  • meaningful because the iterative process was unable to find appropriate solutions. A failure

  • to converge may occur for a number of reasons: having a large proportion of predictors to

  • cases, multicollinearity, sparseness, or complete separation.

  • Having a large proportion of variables to cases results in an overly conservative Wald

  • statistic and can lead to nonconvergence. Multicollinearity refers to unacceptably high

  • correlations between predictors. As multicollinearity increases, coefficients remain unbiased but

  • standard errors increase and the likelihood of model convergence decreases. To detect

  • multicollinearity amongst the predictors, one can conduct a linear regression analysis

  • with the predictors of interest for the sole purpose of examining the tolerance statistic

  • used to assess whether multicollinearity is unacceptably high.

  • Sparseness in the data refers to having a large proportion of empty cells. Zero cell

  • counts are particularly problematic with categorical predictors. With continuous predictors, the

  • model can infer values for the zero cell counts, but this is not the case with categorical

  • predictors. The reason the model will not converge with zero cell counts for categorical

  • predictors is because the natural logarithm of zero is an undefined value, so final solutions

  • to the model cannot be reached. To remedy this problem, researchers may collapse categories

  • in a theoretically meaningful way or may consider adding a constant to all cells.

  • Another numerical problem that may lead to a lack of convergence is complete separation,

  • which refers to the instance in which the predictors perfectly predict the criterion

  • all cases are accurately classified. In such instances, one should reexamine the data,

  • as there is likely some kind of error. Although not a precise number, as a general

  • rule of thumb, logistic regression models require a minimum of 10 events per explaining

  • variable. Minimum chi-squared estimator for grouped

  • data While individual data will have a dependent

  • variable with a value of zero or one for every observation, with grouped data one observation

  • is on a group of people who all share the same characteristics; in this case the researcher

  • observes the proportion of people in the group for whom the response variable falls into

  • one category or the other. If this proportion is neither zero nor one for any group, the

  • minimum chi-squared estimator involves using weighted least squares to estimate a linear

  • model in which the dependent variable is the logit of the proportion: that is, the log

  • of the ratio of the fraction in one group to the fraction in the other group.

  • Evaluating goodness of fit Goodness of fit in linear regression models

  • is generally measured using the R2. Since this has no direct analog in logistic regression,

  • various methods including the following can be used instead.

  • Deviance and likelihood ratio tests In linear regression analysis, one is concerned

  • with partitioning variance via the sum of squares calculationsvariance in the criterion

  • is essentially divided into variance accounted for by the predictors and residual variance.

  • In logistic regression analysis, deviance is used in lieu of sum of squares calculations.

  • Deviance is analogous to the sum of squares calculations in linear regression and is a

  • measure of the lack of fit to the data in a logistic regression model. Deviance is calculated

  • by comparing a given model with the saturated model – a model with a theoretically perfect

  • fit. This computation is called the likelihood-ratio test:

  • In the above equation D represents the deviance and ln represents the natural logarithm. The

  • log of the likelihood ratio will produce a negative value, so the product is multiplied

  • by negative two times its natural logarithm to produce a value with an approximate chi-squared

  • distribution. Smaller values indicate better fit as the fitted model deviates less from

  • the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square

  • values indicate very little unexplained variance and thus, good model fit. Conversely, a significant

  • chi-square value indicates that a significant amount of the variance is unexplained.

  • Two measures of deviance are particularly important in logistic regression: null deviance

  • and model deviance. The null deviance represents the difference between a model with only the

  • intercept and the saturated model. And, the model deviance represents the difference between

  • a model with at least one predictor and the saturated model. In this respect, the null

  • model provides a baseline upon which to compare predictor models. Given that deviance is a

  • measure of the difference between a given model and the saturated model, smaller values

  • indicate better fit. Therefore, to assess the contribution of a predictor or set of

  • predictors, one can subtract the model deviance from the null deviance and assess the difference

  • on a chi-square distribution with degree of freedom equal to the difference in the number

  • of parameters estimated. Let

  • Then

  • If the model deviance is significantly smaller than the null deviance then one can conclude

  • that the predictor or set of predictors significantly improved model fit. This is analogous to the

  • F-test used in linear regression analysis to assess the significance of prediction.

  • Pseudo-R2s In linear regression the squared multiple

  • correlation, R2 is used to assess goodness of fit as it represents the proportion of

  • variance in the criterion that is explained by the predictors. In logistic regression

  • analysis, there is no agreed upon analogous measure, but there are several competing measures

  • each with limitations. Three of the most commonly used indices are examined on this page beginning

  • with the likelihood ratio R2, R2L:

  • This is the most analogous index to the squared multiple correlation in linear regression.

  • It represents the proportional reduction in the deviance wherein the deviance is treated

  • as a measure of variation analogous but not identical to the variance in linear regression

  • analysis. One limitation of the likelihood ratio R2 is that it is not monotonically related

  • to the odds ratio, meaning that it does not necessarily increase as the odds ratio increases

  • and does not necessarily decrease as the odds ratio decreases.

  • The Cox and Snell R2 is an alternative index of goodness of fit related to the R2 value

  • from linear regression. The Cox and Snell index is problematic as its maximum value

  • is .75, when the variance is at its maximum. The Nagelkerke R2 provides a correction to

  • the Cox and Snell R2 so that the maximum value is equal to one. Nevertheless, the Cox and

  • Snell and likelihood ratio R2s show greater agreement with each other than either does

  • with the Nagelkerke R2. Of course, this might not be the case for values exceeding .75 as

  • the Cox and Snell index is capped at this value. The likelihood ratio R2 is often preferred

  • to the alternatives as it is most analogous to R2 in linear regression, is independent

  • of the base rate and varies between 0 and 1.

  • A word of caution is in order when interpreting pseudo-R2 statistics. The reason these indices

  • of fit are referred to as pseudo R2 is because they do not represent the proportionate reduction

  • in error as the R2 in linear regression does. Linear regression assumes homoscedasticity,

  • that the error variance is the same for all values of the criterion. Logistic regression

  • will always be heteroscedasticthe error variances differ for each value of the predicted

  • score. For each value of the predicted score there would be a different value of the proportionate

  • reduction in error. Therefore, it is inappropriate to think of R2 as a proportionate reduction

  • in error in a universal sense in logistic regression.

  • HosmerLemeshow test The HosmerLemeshow test uses a test statistic

  • that asymptotically follows a distribution to assess whether or not the observed event

  • rates match expected event rates in subgroups of the model population.

  • Evaluating binary classification performance If the estimated probabilities are to be used

  • to classify each observation of independent variable values as predicting the category

  • that the dependent variable is found in, the various methods below for judging the model's

  • suitability in out-of-sample forecasting can also be used on the data that were used for

  • estimationaccuracy, precision, recall, specificity and negative predictive value.

  • In each of these evaluative methods, an aspect of the model's effectiveness in assigning

  • instances to the correct categories is measured. Coefficients

  • After fitting the model, it is likely that researchers will want to examine the contribution

  • of individual predictors. To do so, they will want to examine the regression coefficients.

  • In linear regression, the regression coefficients represent the change in the criterion for

  • each unit change in the predictor. In logistic regression, however, the regression coefficients

  • represent the change in the logit for each unit change in the predictor. Given that the

  • logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential

  • function of the regression coefficientthe odds ratio. In linear regression, the significance

  • of a regression coefficient is assessed by computing a t-test. In logistic regression,

  • there are several different tests designed to assess the significance of an individual

  • predictor, most notably the likelihood ratio test and the Wald statistic.

  • Likelihood ratio test The likelihood-ratio test discussed above

  • to assess model fit is also the recommended procedure to assess the contribution of individual

  • "predictors" to a given model. In the case of a single predictor model, one simply compares

  • the deviance of the predictor model with that of the null model on a chi-square distribution

  • with a single degree of freedom. If the predictor model has a significantly smaller deviance,

  • then one can conclude that there is a significant association between the "predictor" and the

  • outcome. Although some common statistical packages do provide likelihood ratio test

  • statistics, without this computationally intensive test it would be more difficult to assess

  • the contribution of individual predictors in the multiple logistic regression case.

  • To assess the contribution of individual predictors one can enter the predictors hierarchically,

  • comparing each new model with the previous to determine the contribution of each predictor.

  • (There is considerable debate among statisticians regarding the appropriateness of so-called

  • "stepwise" procedures. They do not preserve the nominal statistical properties and can

  • be very misleading.[1] Wald statistic

  • Alternatively, when assessing the contribution of individual predictors in a given model,

  • one may examine the significance of the Wald statistic. The Wald statistic, analogous to

  • the t-test in linear regression, is used to assess the significance of coefficients. The

  • Wald statistic is the ratio of the square of the regression coefficient to the square

  • of the standard error of the coefficient and is asymptotically distributed as a chi-square

  • distribution.

  • Although several statistical packages report the Wald statistic to assess the contribution

  • of individual predictors, the Wald statistic has limitations. When the regression coefficient

  • is large, the standard error of the regression coefficient also tends to be large increasing

  • the probability of Type-II error. The Wald statistic also tends to be biased when data

  • are sparse. Case-control sampling

  • Suppose cases are rare. Then we might wish to sample them more frequently than their

  • prevalence in the population. For example, suppose there is a disease that affects 1

  • person in 10,000 and to collect our data we need to do a complete physical. It may be

  • too expensive to do thousands of physicals of healthy people in order to get data on

  • only a few diseased individuals. Thus, we may evaluate more diseased individuals. This

  • is also called unbalanced data. As a rule of thumb, sampling controls at a rate of five

  • times the number of cases is sufficient to get enough control data.

  • If we form a logistic model from such data, if the model is correct, the parameters are

  • all correct except for . We can correct if we know the true prevalence as follows:

  • where is the true prevalence and is the prevalence in the sample.

  • Formal mathematical specification There are various equivalent specifications

  • of logistic regression, which fit into different types of more general models. These different

  • specifications allow for different sorts of useful generalizations.

  • Setup The basic setup of logistic regression is

  • the same as for standard linear regression. It is assumed that we have a series of N observed

  • data points. Each data point i consists of a set of m explanatory variables x1,i ... xm,i,

  • and an associated binary-valued outcome variable Yi, i.e. it can assume only the two possible

  • values 0 or 1. The goal of logistic regression is to explain the relationship between the

  • explanatory variables and the outcome, so that an outcome can be predicted for a new

  • set of explanatory variables. Some examples:

  • The observed outcomes are the presence or absence of a given disease in a set of patients,

  • and the explanatory variables might be characteristics of the patients thought to be pertinent.

  • The observed outcomes are the votes of a set of people in an election, and the explanatory

  • variables are the demographic characteristics of each person. In such a case, one of the

  • two outcomes is arbitrarily coded as 1, and the other as 0.

  • As in linear regression, the outcome variables Yi are assumed to depend on the explanatory

  • variables x1,i ... xm,i. Explanatory variables

  • As shown above in the above examples, the explanatory variables may be of any type:

  • real-valued, binary, categorical, etc. The main distinction is between continuous variables

  • and discrete variables. Discrete variables referring to more than two possible choices

  • are typically coded using dummy variables, that is, separate explanatory variables taking

  • the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning

  • "variable does have the given value" and a 0 meaning "variable does not have that value".

  • For example, a four-way discrete variable of blood type with the possible values "A,

  • B, AB, O" can be converted to four separate two-way dummy variables, "is-A, is-B, is-AB,

  • is-O", where only one of them has the value 1 and all the rest have the value 0. This

  • allows for separate regression coefficients to be matched for each possible value of the

  • discrete variable. Outcome variables

  • Formally, the outcomes Yi are described as being Bernoulli-distributed data, where each

  • outcome is determined by an unobserved probability pi that is specific to the outcome at hand,

  • but related to the explanatory variables. This can be expressed in any of the following

  • equivalent forms:

  • The meanings of these four lines are: The first line expresses the probability distribution

  • of each Yi: Conditioned on the explanatory variables, it follows a Bernoulli distribution

  • with parameters pi, the probability of the outcome of 1 for trial i. As noted above,

  • each separate trial has its own probability of success, just as each trial has its own

  • explanatory variables. The probability of success pi is not observed, only the outcome

  • of an individual Bernoulli trial using that probability.

  • The second line expresses the fact that the expected value of each Yi is equal to the

  • probability of success pi, which is a general property of the Bernoulli distribution. In

  • other words, if we run a large number of Bernoulli trials using the same probability of success

  • pi, then take the average of all the 1 and 0 outcomes, then the result would be close

  • to pi. This is because doing an average this way simply computes the proportion of successes

  • seen, which we expect to converge to the underlying probability of success.

  • The third line writes out the probability mass function of the Bernoulli distribution,

  • specifying the probability of seeing each of the two possible outcomes.

  • The fourth line is another way of writing the probability mass function, which avoids

  • having to write separate cases and is more convenient for certain types of calculations.

  • This relies on the fact that Yi can take only the value 0 or 1. In each case, one of the

  • exponents will be 1, "choosing" the value under it, while the other is 0, "canceling

  • out" the value under it. Hence, the outcome is either pi or 1 − pi, as in the previous

  • line. Linear predictor function

  • The basic idea of logistic regression is to use the mechanism already developed for linear

  • regression by modeling the probability pi using a linear predictor function, i.e. a

  • linear combination of the explanatory variables and a set of regression coefficients that

  • are specific to the model at hand but the same for all trials. The linear predictor

  • function for a particular data point i is written as:

  • where are regression coefficients indicating the relative effect of a particular explanatory

  • variable on the outcome. The model is usually put into a more compact

  • form as follows: The regression coefficients β0, β1, ..., βm

  • are grouped into a single vector β of size m + 1.

  • For each data point i, an additional explanatory pseudo-variable x0,i is added, with a fixed

  • value of 1, corresponding to the intercept coefficient β0.

  • The resulting explanatory variables x0,i, x1,i, ..., xm,i are then grouped into a single

  • vector Xi of size m + 1. This makes it possible to write the linear

  • predictor function as follows:

  • using the notation for a dot product between two vectors.

  • As a generalized linear model The particular model used by logistic regression,

  • which distinguishes it from standard linear regression and from other types of regression

  • analysis used for binary-valued outcomes, is the way the probability of a particular

  • outcome is linked to the linear predictor function:

  • Written using the more compact notation described above, this is:

  • This formulation expresses logistic regression as a type of generalized linear model, which

  • predicts variables with various types of probability distributions by fitting a linear predictor

  • function of the above form to some sort of arbitrary transformation of the expected value

  • of the variable. The intuition for transforming using the logit

  • function was explained above. It also has the practical effect of converting the probability

  • to a variable that ranges overthereby matching the potential range of the linear

  • prediction function on the right side of the equation.

  • Note that both the probabilities pi and the regression coefficients are unobserved, and

  • the means of determining them is not part of the model itself. They are typically determined

  • by some sort of optimization procedure, e.g. maximum likelihood estimation, that finds

  • values that best fit the observed data, usually subject to regularization conditions that

  • seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients.

  • The use of a regularization condition is equivalent to doing maximum a posteriori estimation,

  • an extension of maximum likelihood. Whether or not regularization is used, it is usually

  • not possible to find a closed-form solution; instead, an iterative numerical method must

  • be used, such as iteratively reweighted least squares or, more commonly these days, a quasi-Newton

  • method such as the L-BFGS method. The interpretation of the βj parameter estimates

  • is as the additive effect on the log of the odds for a unit change in the jth explanatory

  • variable. In the case of a dichotomous explanatory variable, for instance gender, is the estimate

  • of the odds of having the outcome for, say, males compared with females.

  • An equivalent formula uses the inverse of the logit function, which is the logistic

  • function, i.e.:

  • The formula can also be written as a probability distribution:

  • As a latent-variable model The above model has an equivalent formulation

  • as a latent-variable model. This formulation is common in the theory of discrete choice

  • models, and makes it easier to extend to certain more complicated models with multiple, correlated

  • choices, as well as to compare logistic regression to the closely related probit model.

  • Imagine that, for each trial i, there is a continuous latent variable Yi* that is distributed

  • as follows:

  • where

  • i.e. the latent variable can be written directly in terms of the linear predictor function

  • and an additive random error variable that is distributed according to a standard logistic

  • distribution. Then Yi can be viewed as an indicator for

  • whether this latent variable is positive:

  • The choice of modeling the error variable specifically with a standard logistic distribution,

  • rather than a general logistic distribution with the location and scale set to arbitrary

  • values, seems restrictive, but in fact it is not. It must be kept in mind that we can

  • choose the regression coefficients ourselves, and very often can use them to offset changes

  • in the parameters of the error variable's distribution. For example, a logistic error-variable

  • distribution with a non-zero location parameter μ is equivalent to a distribution with a

  • zero location parameter, where μ has been added to the intercept coefficient. Both situations

  • produce the same value for Yi* regardless of settings of explanatory variables. Similarly,

  • an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then

  • dividing all regression coefficients by s. In the latter case, the resulting value of

  • Yi* will be smaller by a factor of s than in the former case, for all sets of explanatory

  • variablesbut critically, it will always remain on the same side of 0, and hence lead

  • to the same Yi choice. (Note that this predicts that the irrelevancy

  • of the scale parameter may not carry over into more complex models where more than two

  • choices are available.) It turns out that this formulation is exactly

  • equivalent to the preceding one, phrased in terms of the generalized linear model and

  • without any latent variables. This can be shown as follows, using the fact that the

  • cumulative distribution function of the standard logistic distribution is the logistic function,

  • which is the inverse of the logit function, i.e.

  • Then:

  • This formulationwhich is standard in discrete choice modelsmakes clear the

  • relationship between logistic regression and the probit model, which uses an error variable

  • distributed according to a standard normal distribution instead of a standard logistic

  • distribution. Both the logistic and normal distributions are symmetric with a basic unimodal,

  • "bell curve" shape. The only difference is that the logistic distribution has somewhat

  • heavier tails, which means that it is less sensitive to outlying data.

  • As a two-way latent-variable model Yet another formulation uses two separate

  • latent variables:

  • where

  • where EV1(0,1) is a standard type-1 extreme value distribution: i.e.

  • Then

  • This model has a separate latent variable and a separate set of regression coefficients

  • for each possible outcome of the dependent variable. The reason for this separation is

  • that it makes it easy to extend logistic regression to multi-outcome categorical variables, as

  • in the multinomial logit model. In such a model, it is natural to model each possible

  • outcome using a different set of regression coefficients. It is also possible to motivate

  • each of the separate latent variables as the theoretical utility associated with making

  • the associated choice, and thus motivate logistic regression in terms of utility theory. This

  • is the approach taken by economists when formulating discrete choice models, because it both provides

  • a theoretically strong foundation and facilitates intuitions about the model, which in turn

  • makes it easy to consider various sorts of extensions.

  • The choice of the type-1 extreme value distribution seems fairly arbitrary, but it makes the mathematics

  • work out, and it may be possible to justify its use through rational choice theory.

  • It turns out that this model is equivalent to the previous model, although this seems

  • non-obvious, since there are now two sets of regression coefficients and error variables,

  • and the error variables have a different distribution. In fact, this model reduces directly to the

  • previous one with the following substitutions:

  • An intuition for this comes from the fact that, since we choose based on the maximum

  • of two values, only their difference matters, not the exact valuesand this effectively

  • removes one degree of freedom. Another critical fact is that the difference of two type-1

  • extreme-value-distributed variables is a logistic distribution, i.e. if

  • We can demonstrate the equivalent as follows:

  • Example As an example, consider a province-level election

  • where the choice is between a right-of-center party, a left-of-center party, and a secessionist

  • party. We would then use three latent variables, one for each choice. Then, in accordance with

  • utility theory, we can then interpret the latent variables as expressing the utility

  • that results from making each of the choices. We can also interpret the regression coefficients

  • as indicating the strength that the associated factor has in contributing to the utility

  • or more correctly, the amount by which a unit change in an explanatory variable changes

  • the utility of a given choice. A voter might expect that the right-of-center party would

  • lower taxes, especially on rich people. This would give low-income people no benefit, i.e.

  • no change in utility; would cause moderate benefit for middle-incoming people; and would

  • cause significant benefits for high-income people. On the other hand, the left-of-center

  • party might be expected to raise taxes and offset it with increased welfare and other

  • assistance for the lower and middle classes. This would cause significant positive benefit

  • to low-income people, perhaps weak benefit to middle-income people, and significant negative

  • benefit to high-income people. Finally, the secessionist party would take no direct actions

  • on the economy, but simply secede. A low-income or middle-income voter might expect basically

  • no clear utility gain or loss from this, but a high-income voter might expect negative

  • utility, since he/she is likely to own companies, which will have a harder time doing business

  • in such an environment and probably lose money. These intuitions can be expressed as follows:

  • This clearly shows that Separate sets of regression coefficients need

  • to exist for each choice. When phrased in terms of utility, this can be seen very easily.

  • Different choices have different effects on net utility; furthermore, the effects vary

  • in complex ways that depend on the characteristics of each individual, so there need to be separate

  • sets of coefficients for each characteristic, not simply a single extra per-choice characteristic.

  • Even though income is a continuous variable, its effect on utility is too complex for it

  • to be treated as a single variable. Either it needs to be directly split up into ranges,

  • or higher powers of income need to be added so that polynomial regression on income is

  • effectively done. As a "log-linear" model

  • Yet another formulation combines the two-way latent variable formulation above with the

  • original formulation higher up without latent variables, and in the process provides a link

  • to one of the standard formulations of the multinomial logit.

  • Here, instead of writing the logit of the probabilities pi as a linear predictor, we

  • separate the linear predictor into two, one for each of the two outcomes:

  • Note that two separate sets of regression coefficients have been introduced, just as

  • in the two-way latent variable model, and the two equations appear a form that writes

  • the logarithm of the associated probability as a linear predictor, with an extra term

  • at the end. This term, as it turns out, serves as the normalizing factor ensuring that the

  • result is a distribution. This can be seen by exponentiating both sides:

  • In this form it is clear that the purpose of Z is to ensure that the resulting distribution

  • over Yi is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply

  • the sum of all un-normalized probabilities, and by dividing each probability by Z, the

  • probabilities become "normalized". That is:

  • and the resulting equations are

  • Or generally:

  • This shows clearly how to generalize this formulation to more than two outcomes, as

  • in multinomial logit. In order to prove that this is equivalent

  • to the previous model, note that the above model is overspecified, in that and cannot

  • be independently specified: rather so knowing one automatically determines the other. As

  • a result, the model is nonidentifiable, in that multiple combinations of β0 and β1

  • will produce the same probabilities for all possible explanatory variables. In fact, it

  • can be seen that adding any constant vector to both of them will produce the same probabilities:

  • As a result, we can simplify matters, and restore identifiability, by picking an arbitrary

  • value for one of the two vectors. We choose to set Then,

  • and so

  • which shows that this formulation is indeed equivalent to the previous formulation.

  • Note that most treatments of the multinomial logit model start out either by extending

  • the "log-linear" formulation presented here or the two-way latent variable formulation

  • presented above, since both clearly show the way that the model could be extended to multi-way

  • outcomes. In general, the presentation with latent variables is more common in econometrics

  • and political science, where discrete choice models and utility theory reign, while the

  • "log-linear" formulation here is more common in computer science, e.g. machine learning

  • and natural language processing. As a single-layer perceptron

  • The model has an equivalent formulation

  • This functional form is commonly called a single-layer perceptron or single-layer artificial

  • neural network. A single-layer neural network computes a continuous output instead of a

  • step function. The derivative of pi with respect to X = (x1, ..., xk) is computed from the

  • general form:

  • where f(X) is an analytic function in X. With this choice, the single-layer neural network

  • is identical to the logistic regression model. This function has a continuous derivative,

  • which allows it to be used in backpropagation. This function is also preferred because its

  • derivative is easily calculated:

  • In terms of binomial data A closely related model assumes that each

  • i is associated not with a single Bernoulli trial but with ni independent identically

  • distributed trials, where the observation Yi is the number of successes observed, and

  • hence follows a binomial distribution:

  • An example of this distribution is the fraction of seeds that germinate after ni are planted.

  • In terms of expected values, this model is expressed as follows:

  • so that

  • Or equivalently:

  • This model can be fit using the same sorts of methods as the above more basic model.

  • Bayesian logistic regression

  • In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients,

  • usually in the form of Gaussian distributions. Unfortunately, the Gaussian distribution is

  • not the conjugate prior of the likelihood function in logistic regression; in fact,

  • the likelihood function is not an exponential family and thus does not have a conjugate

  • prior at all. As a result, the posterior distribution is difficult to calculate, even using standard

  • simulation algorithms. There are various possibilities:

  • Don't do a proper Bayesian analysis, but simply compute a maximum a posteriori point estimate

  • of the parameters. This is common, for example, in "maximum entropy" classifiers in machine

  • learning. Use a more general approximation method such

  • as the MetropolisHastings algorithm. Draw a Markov chain Monte Carlo sample from

  • the exact posterior by using the Independent MetropolisHastings algorithm with heavy-tailed

  • multivariate candidate distribution found by matching the mode and curvature at the

  • mode of the normal approximation to the posterior and then using the Student’s t shape with

  • low degrees of freedom. This is shown to have excellent convergence properties.

  • Use a latent variable model and approximate the logistic distribution using a more tractable

  • distribution, e.g. a Student's t-distribution or a mixture of normal distributions.

  • Do probit regression instead of logistic regression. This is actually a special case of the previous

  • situation, using a normal distribution in place of a Student's t, mixture of normals,

  • etc. This will be less accurate but has the advantage that probit regression is extremely

  • common, and a ready-made Bayesian implementation may already be available.

  • Use the Laplace approximation of the posterior distribution. This approximates the posterior

  • with a Gaussian distribution. This is not a terribly good approximation, but it suffices

  • if all that is desired is an estimate of the posterior mean and variance. In such a case,

  • an approximation scheme such as variational Bayes can be used.

  • Gibbs sampling with an approximating distribution As shown above, logistic regression is equivalent

  • to a latent variable model with an error variable distributed according to a standard logistic

  • distribution. The overall distribution of the latent variable is also a logistic distribution,

  • with the mean equal to . This model considerably simplifies the application of techniques such

  • as Gibbs sampling. However, sampling the regression coefficients is still difficult, because of

  • the lack of conjugacy between the normal and logistic distributions. Changing the prior

  • distribution over the regression coefficients is of no help, because the logistic distribution

  • is not in the exponential family and thus has no conjugate prior.

  • One possibility is to use a more general Markov chain Monte Carlo technique, such as the MetropolisHastings

  • algorithm, which can sample arbitrary distributions. Another possibility, however, is to replace

  • the logistic distribution with a similar-shaped distribution that is easier to work with using

  • Gibbs sampling. In fact, the logistic and normal distributions have a similar shape,

  • and thus one possibility is simply to have normally distributed errors. Because the normal

  • distribution is conjugate to itself, sampling the regression coefficients becomes easy.

  • In fact, this model is exactly the model used in probit regression.

  • However, the normal and logistic distributions differ in that the logistic has heavier tails.

  • As a result, it is more robust to inaccuracies in the underlying model or to errors in the

  • data. Probit regression loses some of this robustness.

  • Another alternative is to use errors distributed as a Student's t-distribution. The Student's

  • t-distribution has heavy tails, and is easy to sample from because it is the compound

  • distribution of a normal distribution with variance distributed as an inverse gamma distribution.

  • In other words, if a normal distribution is used for the error variable, and another latent

  • variable, following an inverse gamma distribution, is added corresponding to the variance of

  • this error variable, the marginal distribution of the error variable will follow a Student's

  • t-distribution. Because of the various conjugacy relationships, all variables in this model

  • are easy to sample from. The Student's t-distribution that best approximates

  • a standard logistic distribution can be determined by matching the moments of the two distributions.

  • The Student's t-distribution has three parameters, and since the skewness of both distributions

  • is always 0, the first four moments can all be matched, using the following equations:

  • This yields the following values:

  • The following graphs compare the standard logistic distribution with the Student's t-distribution

  • that matches the first four moments using the above-determined values, as well as the

  • normal distribution that matches the first two moments. Note how much closer the Student's

  • t-distribution agrees, especially in the tails. Beyond about two standard deviations from

  • the mean, the logistic and normal distributions diverge rapidly, but the logistic and Student's

  • t-distributions don't start diverging significantly until more than 5 standard deviations away.

  • (Another possibility, also amenable to Gibbs sampling, is to approximate the logistic distribution

  • using a mixture density of normal distributions.) Extensions

  • There are large numbers of extensions: Multinomial logistic regression handles the

  • case of a multi-way categorical dependent variable. Note that the general case of having

  • dependent variables with more than two values is termed polytomous regression.

  • Ordered logistic regression handles ordinal dependent variables.

  • Mixed logit is an extension of multinomial logit that allows for correlations among the

  • choices of the dependent variable. An extension of the logistic model to sets

  • of interdependent variables is the conditional random field.

  • Model suitability A way to measure a model's suitability is

  • to assess the model against a set of data that was not used to create the model. The

  • class of techniques is called cross-validation. This holdout model assessment method is particularly

  • valuable when data are collected in different settings or when models are assumed to be

  • generalizable. To measure the suitability of a binary regression

  • model, one can classify both the actual value and the predicted value of each observation

  • as either 0 or 1. The predicted value of an observation can be set equal to 1 if the estimated

  • probability that the observation equals 1 is above , and set equal to 0 if the estimated

  • probability is below . Here logistic regression is being used as a binary classification model.

  • There are four possible combined classifications: prediction of 0 when the holdout sample has

  • a 0 prediction of 0 when the holdout sample has

  • a 1 prediction of 1 when the holdout sample has

  • a 0 prediction of 1 when the holdout sample has

  • a 1 These classifications are used to calculate

  • accuracy, precision, recall, specificity and negative predictive value:

  • = fraction of observations with correct predicted classification

  • = Fraction of predicted positives that are correct

  • = fraction of predicted negatives that are correct

  • = fraction of observations that are actually 1 with a correct predicted classification

  • = fraction of observations that are actually 0 with a correct predicted classification

  • See also

  • Logistic function Discrete choice

  • JarrowTurnbull model Limited dependent variable

  • Multinomial logit model Ordered logit

  • HosmerLemeshow test Brier score

  • MLPACK - contains a C++ implementation of logistic regression

  • References

  • Further reading Agresti, Alan.. Categorical Data Analysis.

  • New York: Wiley-Interscience. ISBN 0-471-36093-7.  Amemiya, T.. Advanced Econometrics. Harvard

  • University Press. ISBN 0-674-00560-0.  Balakrishnan, N.. Handbook of the Logistic

  • Distribution. Marcel Dekker, Inc. ISBN 978-0-8247-8587-1.  Greene, William H.. Econometric Analysis,

  • fifth edition. Prentice Hall. ISBN 0-13-066189-9.  Hilbe, Joseph M.. Logistic Regression Models.

  • Chapman & Hall/CRC Press. ISBN 978-1-4200-7575-5.  Howell, David C.. Statistical Methods for

  • Psychology, 7th ed. Belmont, CA; Thomson Wadsworth. ISBN 978-0-495-59786-5. 

  • Peduzzi, P.; J. Concato, E. Kemper, T.R. Holford, A.R. Feinstein. "A simulation study of the

  • number of events per variable in logistic regression analysis". Journal of Clinical

  • Epidemiology 49: 1373–1379. doi:10.1016/s0895-4356(96)00236-3. PMID 8970487. 

  • External links Econometrics Lecture on YouTube by Mark Thoma

  • Logistic Regression Interpretation Logistic Regression tutorial

  • Using open source software for building Logistic Regression models

In statistics, logistic regression, or logit regression, is a type of probabilistic statistical

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B2 中高級 美國腔

Logistic迴歸 (Logistic regression)

  • 313 16
    mac 發佈於 2021 年 01 月 14 日
影片單字