Saturday, July 29, 2017

Discriminant Function Analysis (P-1)


Contents:

Data Quality
4.1 Normality
4.2 Outlier
4.3 Correlation
4.4 Regression
5. Basic assumptions
5.1. Classification theory (false positive and negative)
5.1.1. Variance, ANOVA

5.1.2 MANOVA

What is Discriminant function analysis ?
A statistical discrimination of classification problem consists in assigning or classifying an individual or group of individuals to one of several known or unknown alternative populations on the basis of several measurements on the individual and samples from the unknown populations. For example, a linear combination of the measurements, called the linear discriminator or discriminant function, is constructed on the basis of its value the individual is assigned to one or the other of two populations. 








Discriminant function analysis is a statistical analysis to predict a categorical dependent variable (called a grouping variable) by one or more continuous or binary independent variables (called predictor variables).The main purpose of a discriminant function analysis is to predict group membership based on a linear combination of the interval variables. The procedure begins with a set of observations where both group membership and the values of the interval variables are known. DFA is different from MDA. MDA is a statistical technique used to reduce the differences between variables in order to classify them into a set number of broad groups. Discriminant function analysis is reversed of multivariate analysis of variance (MANOVA). In MANOVA, the independent variables are the groups and the dependent variables are the predictors.

Assumptions
Dependent and Independent Variables
  • The dependent variable should be categorized by m (at least 2) text values (e.g.: 1-good student, 2-bad student; or 1-prominent student, 2-average, 3-bad student). If the dependent variable is not categorized, but its scale of measurement is interval or ratio scale, then we should categorize it first. For instance the average scholastic record is measured by ratio scale, however, if we categorize them as students with average scholastic record over 4.1 are considered good, between 2.5 and 4.0 they are average, and below 2.5 the students are bad, this variable will fulfill the requirements of discriminant analysis.
  • However, we can ask: why did students over 4.1 are known as good student? Why 2.5 is the bound to consider the students bad? Due to this subjective categorization the analysis can be biased, or the correlation coefficients can be under- or overestimated. To remedy this, try to create similar size of categories. It is easier by creating graph about the distribution of dependent variables or relying on prior information. Statistical software, which contains discriminant analysis, such as SPSS, has an option, which recodes variables. However, it is more significant using already categorical variables, because in other case, we will lose some information. In case of non-categorical variables it would be better to use regression analysis instead of discriminant analysis.
  • Independent variables should be metric. We do not have to standardize variables for discriminant analysis, because the unit of measures does not have decisive influence. Therefore, any metric variable can be selected for independent variable. If we have a sufficient number of quantitative variables, we can build dichotomies or ordinal scaled variables with at least 5 categories in the model (Sajtos – Mitev, 2007).
Normality


Image result for Normal distribution


http://www.muelaner.com/wp-content/uploads/2013/07/Standard_deviation_diagram.png


  • ·         The mean, median, and mode of a normal distribution are equal. The area under the normal curve is equal to 1.0. Normal distributions are denser in the center and less dense in the tails. Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ).
  • ·         The Standard Normal curve, shown here, has mean 0 and standard deviation 1. If a dataset follows a normal distribution, then about 68% of the observations will fall within of the mean , which in this case is with the interval (-1,1).
  • ·         As far as I understand, attention should be paid to the normal distribution of variables along with the size.
    OUTLIER
  •  It refers to a person or thing situated away or detached from the main body or system.In statistics, an outlier is an observation point that is distant from other observations. 
  • An outlier may be due to 
    • variability in the measurement or 
    • it may indicate experimental error; the latter are sometimes excluded from the data set.
    • sampling error
MEASUREMENT ERRORS
  • Measurement errors can be divided into two components: random error and systematic error
  • Random errors are errors in measurement that lead to measurable values being inconsistent when repeated measurements of a constant attribute or quantity are taken. 
  • Systematic errors are errors that are not determined by chance but are introduced by an inaccuracy (involving either the observation or measurement process) inherent to the system.
  • Systematic error may also refer to an error with a nonzero mean, the effect of which is not reduced when observations are averaged.
  • Sources of systematic error may be 
    • imperfect calibration of measurement instruments (zero error)
    • changes in the environment which interfere with the measurement process 
    •  imperfect methods of observation 
  • Random errors may ocuur from non-sampling errors ( the errors are not due to sampling)
    • Non-sampling errors in survey estimates can arise from:
      • Coverage errors, such as failure to accurately represent all population units in the sample, or the inability to obtain information about all sample cases;
      • Response errors by respondents due for example to definitional differences, misunderstandings, or deliberate misreporting;
      • Mistakes in recording the data or coding it to standard classifications;
      • Other errors of collection, nonresponse, processing, or imputation of values for missing or inconsistent data.
Measurement error can cause statistical error.Statistical error is the amount by which an observation differs from its expected value, the latter being based on the whole population from which the statistical unit was chosen randomly. For example, if the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The expected value, being the mean of the entire population, is typically unobservable, and hence the statistical error cannot be observed either.










  • Sample size: it is a general rule, that the larger is the sample size, the more significant is the model. The ratio of number of data to the number of variables is also important. The results can be more generalized if we have larger number of data for one variable. As a “rule of thumb”, the maximum number of independent variables is n - 2, where n is the sample size. Moreover, the smallest sample size should be at least 20 for a few (4 or 5) predictors (Poulsen – French, 2003). It is best to have 4 or 5 times, or according to some experts (Sajtos – Mitev, 2007) 10 times as many observations than independent variables.
  • Multivariate normal distribution: it is the most frequently used distribution in statistics. In case of normal distribution, the estimation of parameters is easier, because the parameters can be defined according to the density or distribution function. It can be tested by histograms of frequency distributions or hypothesis testing. Note that violations of the normality assumption are usually not "fatal," as long as non-normality n is caused by skewness and not outliers (Tabachnick – Fidell, 1996). Non-normality can be caused by wrong scales, too (Sajtos – Mitev, 2007). As a sample size increases, the shape of the sampling distribution becomes normal.
  • Outliers: discriminant analysis is highly sensitive for outliers, because the extreme values have a great influence on the mean, standard deviation and the statistical significance as well. In a one-variable case outliers can be defined by quartiles or boxplot, however in a multivariate case it is better to use Mahalanobis distance. Using Mahalanobis distance we can measure the distance between the cases and the centroid for each group, which difference is based on the correlation between variables. Every case will belong to a group for which its Mahalanobis distance is smallest (Hajdu, 2003). The outermost cases are regarded as outliers. For reasonable results we have to handle the problem of outliers, mainly it is better to eliminate them.
  • Homoskedasticity: the constant variance and homogenous covariance matrices across groups are the assumptions for discriminant analysis as well. Heteroscedasticity can be caused by outliers. It can be evaluated through scatterplots of variables or the frequently used Box’s M test, which is a hypothesis testing for discriminant analysis. Box's M uses the F distribution. If p<0.05, then the variances are significantly different, thus the probability value of this F should be greater than 0.05 to demonstrate that the assumption of homoscedasticity is upheld. The test is sensitive for multivariate normal distribution. In other case using Box’s M measure is not reasonable. However, discriminant analysis can be robust even when the homoscedasticity is violated.
  • Where sample size is large, even small differences in covariance matrices may be found significant by Box's M, when in fact no substantial problem of violation of assumptions exists. Therefore, we should also look at the log determinants of the group covariance matrices, which are printed along with Box's M. If the group log determinants are similar, then a significant Box's M for a large sample is usually ignored. Dissimilar log determinants indicates violation of the assumption of equal variance covariance matrices, leading to greater classification errors (specifically, discriminant analysis will tend to classify cases in the group with the larger variability). When violation occurs, quadratic discriminant analysis may be used (see also: http://faculty.chass.ncsu.edu/garson/PA765/discrim.htm).
  • Multicollinearity: independent variables should be correlated to the dependent variable, however there must be no correlation between the independent variables, because it can bias the results of analysis. In case of strong multicollinearity we cannot discriminate, because the highly correlated variables would have higher role than the others. To eliminate the bias-effect we need to exclude the disturbing variables from the analysis, or create a principle component of the correlated variables. Using Mahalanobis distance we prove the independence of variables.
  • Linearity: we assume a linear relation between the independent variables, which can be tested by scatterplot. This assumption does not take into account unless transformed variables are added as additional independent variables.
Other assumptions are (Sajtos – Mitev, 2007):
  • Independence: not only the explanatory variables, but also all cases must be independent. Therefore, panel, longitudinal research, or pre-test data cannot be used for discriminant analysis.
  • Mutually exclusive groups: all cases must belong to one group and every case of the dependent variable must belong to only one group.
  • Group size: the size of groups should be pretty much the same and every group should contain at least 2 cases. If this assumption is violated, it will be better to use logistic regression instead of discriminant analysis. However, statistical programs, such as SPSS help to get significant results as well if the group sizes are not equal. The correction based on the group size can be optionally selected, thus we eliminate the problems caused by non-equal groups.
http://www.tankonyvtar.hu/hu/tartalom/tamop425/0049_08_quantitative_information_forming_methods/6155/images/spacer.gif Prev






 The assumptions of discriminant analysis are the same as those for MANOVA. The analysis is quite sensitive to outliers and the size of the smallest group must be larger than the number of predictor variables. Multivariate normality: Independent variables are normal for each level of the grouping variable.


Normality:

Correlation

a ratio between +1 and −1 calculated so as to represent the linear interdependence of two variables or sets of data.


Pearson Product-Moment Correlation

What does this test do?

The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the strength of a linear association between two variables and is denoted by r. Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit).

What values can the Pearson correlation coefficient take?

The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as the value of one variable increases, the value of the other variable decreases. This is shown in the diagram below:
Pearson Coefficient - Different Values

How can we determine the strength of association based on the Pearson correlation coefficient?

The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or negative, respectively. Achieving a value of +1 or -1 means that all your data points are included on the line of best fit – there are no data points that show any variation away from this line. Values for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the value of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation coefficients are shown in the diagram below:
Different values for the Pearson Correlation Coefficient
Join the 10,000s of students, academics and professionals who rely on Laerd Statistics.TAKE THE TOUR 


Are there guidelines to interpreting Pearson's correlation coefficient?

Yes, the following guidelines have been proposed:
Coefficient, r
Strength of AssociationPositiveNegative
Small.1 to .3-0.1 to -0.3
Medium.3 to .5-0.3 to -0.5
Large.5 to 1.0-0.5 to -1.0
Remember that these values are guidelines and whether an association is strong or not will also depend on what you are measuring.

Can you use any type of variable for Pearson's correlation coefficient?

No, the two variables have to be measured on either an interval or ratio scale. However, both variables do not need to be measured on the same scale (e.g., one variable can be ratio and one can be interval). Further information about types of variable can be found in our Types of Variable guide. If you have ordinal data, you will want to use Spearman's rank-order correlation or a Kendall's Tau Correlation instead of the Pearson product-moment correlation.

Do the two variables have to be measured in the same units?

No, the two variables can be measured in entirely different units. For example, you could correlate a person's age with their blood sugar levels. Here, the units are completely different; age is measured in years and blood sugar level measured in mmol/L (a measure of concentration). Indeed, the calculations for Pearson's correlation coefficient were designed such that the units of measurement do not affect the calculation. This allows the correlation coefficient to be comparable and not influenced by the units of the variables used.

What about dependent and independent variables?

The Pearson product-moment correlation does not take into consideration whether a variable has been classified as a dependent or independent variable. It treats all variables equally. For example, you might want to find out whether basketball performance is correlated to a person's height. You might, therefore, plot a graph of performance against height and calculate the Pearson correlation coefficient. Lets say, for example, that r = .67. That is, as height increases so does basketball performance. This makes sense. However, if we plotted the variables the other way around and wanted to determine whether a person's height was determined by their basketball performance (which makes no sense), we would still get r = .67. This is because the Pearson correlation coefficient makes no account of any theory behind why you chose the two variables to compare. This is illustrated below:
Not influenced by Dependent and Independent Variables

Does the Pearson correlation coefficient indicate the slope of the line?

It is important to realize that the Pearson correlation coefficient, r, does not represent the slope of the line of best fit. Therefore, if you get a Pearson correlation coefficient of +1 this does not mean that for every unit increase in one variable there is a unit increase in another. It simply means that there is no variation between the data points and the line of best fit. This is illustrated below:
The Pearson Coefficient does not indicate the slope of the line of best fit.


What assumptions does Pearson's correlation make?

There are five assumptions that are made with respect to Pearson's correlation:
  1. The variables must be either interval or ratio measurements (see our Types of Variable guide for further details).
  2. The variables must be approximately normally distributed (see our Testing for Normality guide for further details).
  3. There is a linear relationship between the two variables (but see note at bottom of page). We discuss this later in this guide (jump to this section here).
  4. Outliers are either kept to a minimum or are removed entirely. We also discuss this later in this guide (jump to this section here).
  5. There is homoscedasticity of the data. This is discussed later in this guide (jump to this section here).

How can you detect a linear relationship?


To test to see whether your two variables form a linear relationship you simply need to plot them on a graph (a scatterplot, for example) and visually inspect the graph's shape. In the diagram below, you will find a few different examples of a linear relationship and some non-linear relationships. It is not appropriate to analyse a non-linear relationship using a Pearson product-moment correlation.

Comparison with regression
Correlation is almost always used when you measure both variables. It rarely is appropriate when one variable is something you experimentally manipulate.
Linear regression is usually used when X is a variable you manipulate (time, concentration, etc.)