 Statistics

Logistic Regression

Logistic regression is part of a category of statistical models called generalized linear models. This broad class of models includes ordinary regression and ANOVA, as well as multivariate statistics such as ANCOVA and loglinear regression. An excellent treatment of generalized linear models is presented in Agresti (1996).

Logistic regression allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent or response variable is dichotomous, such as presence/absence or success/failure. Discriminant analysis is also used to predict group membership with only two groups. However, discriminant analysis can only be used with continuous independent variables. Thus, in instances where the independent variables are a categorical, or a mix of continuous and categorical, logistic regression is preferred.

The Model: The dependent variable in logistic regression is usually dichotomous, that is, the dependent variable can take the value 1 with a probability of success q, or the value 0 with probability of failure 1-q. This type of variable is called a Bernoulli (or binary) variable. Although not as common and not discussed in this treatment, applications of logistic regression have also been extended to cases where the dependent variable is of more than two cases, known as multinomial or polytomous [Tabachnick and Fidell (1996) use the term polychotomous].

As mentioned previously, the independent or predictor variables in logistic regression can take any form. That is, logistic regression makes no assumption about the distribution of the independent variables. They do not have to be normally distributed, linearly related or of equal variance within each group.The relationship between the predictor and response variables is not a linear function in logistic regression, instead, the logistic regression function is used, which is the logit transformation of q:

Where a = the constant of the equation and, b = the coefficient of the predictor variables.

An alternative form of the logistic regression equation is: The goal of logistic regression is to correctly predict the category of outcome for individual cases using the most parsimonious model. To accomplish this goal, a model is created that includes all predictor variables that are useful in predicting the response variable. Several different options are available during model creation. Variables can be entered into the model in the order specified by the researcher or logistic regression can test the fit of the model after each coefficient is added or deleted, called stepwise regression.

Stepwise regression is used in the exploratory phase of research but it is not recommended for theory testing (Menard 1995). Theory testing is the testing of a-priori theories or hypotheses of the relationships between variables. Exploratory testing makes no a-priori assumptions regarding the relationships between the variables, thus the goal is to discover relationships.

Backward stepwise regression appears to be the preferred method of exploratory analyses, where the analysis begins with a full or saturated model and variables are eliminated from the model in an iterative process. The fit of the model is tested after the elimination of each variable to ensure that the model still adequately fits the data.When no more variables can be eliminated from the model, the analysis has been completed.

There are two main uses of logistic regression. The first is the prediction of group membership. Since logistic regression calculates the probability or success over the probability of failure, the results of the analysis are in the form of an odds ratio. For example, logistic regression is often used in epidemiological studies where the result of the analysis is the probability of developing cancer after controlling for other associated risks. Logistic regression also provides knowledge of the relationships and strengths among the variables (e.g., smoking 10 packs a day puts you at a higher risk for developing cancer than working in an asbestos mine).

The process by which coefficients are tested for significance for inclusion or elimination from the model involves several different techniques. Each of these will be discussed below.

Wald Test:

A Wald test is used to test the statistical significance of each coefficient (b) in the model. A Wald test calculates a Z statistic, which is: This z value is then squared, yielding a Wald statistic with a chi-square distribution. However, several authors have identified problems with the use of the Wald statistic. Menard (1995) warns that for large coefficients, standard error is inflated, lowering the Wald statistic (chi-square) value. Agresti (1996) states that the likelihood-ratio test is more reliable for small sample sizes than the Wald test.

Likelihood-Ratio Test:

The likelihood-ratio test uses the ratio of the maximized value of the likelihood function for the full model (L1) over the maximized value of the likelihood function for the simpler model (L0). The likelihood-ratio test statistic equals: This log transformation of the likelihood functions yields a chi-squared statistic. This is the recommended test statistic to use when building a model through backward stepwise elimination.

Hosmer-Lemshow Goodness of Fit Test:

The Hosmer-Lemshow statistic evaluates the goodness-of-fit by creating 10 ordered groups of subjects and then compares the number actually in the each group (observed) to the number predicted by the logistic regression model (predicted). Thus, the test statistic is a chi-square statistic with a desirable outcome of non-significance, indicating that the model prediction does not significantly differ from the observed.

The 10 ordered groups are created based on their estimated probability; those with estimated probability below 0.1 form one group, and so on, up to those with probability 0.9 to 1.0. Each of these categories is further divided into two groups based on the actual observed outcome variable (success, failure). The expected frequencies for each of the cells are obtained from the model.If the model is good, then most of the subjects with success are classified in the higher deciles of risk and those with failure in the lower deciles of risk.

References:

Agresti, Alan. 1996. An Introduction to Categorical Data Analysis. John Wiley and Sons, Inc.

Hosmer, David and Stanley Lemeshow.1989. Applied Logistic Regression. John Wiley and Sons, Inc.

Menard, Scott.1995. Applied Logistic Regression Analysis. Sage Publications Series: Quantitative Applications in the Social Sciences, No. 106.

Tabachnick , Barbara and Linda Fidell.1996. Using Multivariate Statistics, Third edition. Harper Collins.

Useful Websites:

Alan Agresti’s website with all the data from the worked examples in his book:
http://lib.stat.cmu.edu/datasets/agresti Copyright 2006   eDavar Web sites. All Rights Reserved
© All the artwork appearing in this Edavar web gallery are copyrighted to the stated artist and web Designer

Publisher & Author: Davar (Dave) Hamadani/Executive Biostatistician