Changes

Jump to navigation Jump to search
no edit summary
Line 21: Line 21:  
#'''Data description or exploratory data analysis (EDA)'''--The basic tools of this objective include univariate statistics, such as the mean, variance, and quantiles applied to each variable separately, and the covariance or correlation matrix between any two of the P quantities. Some of the P quantities can be transformed (for example, by taking the logarithm) prior to establishing the correlation matrix. Because the matrix is symmetrical, there are P(P - 1)/2 potentially different correlation values.
 
#'''Data description or exploratory data analysis (EDA)'''--The basic tools of this objective include univariate statistics, such as the mean, variance, and quantiles applied to each variable separately, and the covariance or correlation matrix between any two of the P quantities. Some of the P quantities can be transformed (for example, by taking the logarithm) prior to establishing the correlation matrix. Because the matrix is symmetrical, there are P(P - 1)/2 potentially different correlation values.
 
#'''Data grouping (discrimination and clustering)'''--Discrimination or classification aim at optimally assigning multivariate data vectors (arrays) into a set of previously defined classes or groups.<ref name=Everitt_1974>Everitt, B., 1974, Cluster analysis: London, Heinemann Educational Books Ltd., 122 p.</ref> Clustering, however, aims at defining classes of multivariate similarity and regrouping the initial sample values into these classes. Discrimination is a supervised act of pattern recognition, whereas clustering is an unsupervised act of pattern cognition.<ref name=Miller_etal_1962>Miller, R. L., and J. S. Kahn, 1962, Statistical analysis in the geological sciences: New York, John Wiley, 481 p.</ref> Principal component analysis (PCA) allows analysis of the covariance (correlation) matrix with a minimum of statistical assumptions. PCA aims at reducing the dimensionality P of the multivariate data set available by defining a limited number (fewer than P) of linear combinations of these quantities, with each combination reflecting some of the data structures (relationships) implicit in the original covariance matrix.
 
#'''Data grouping (discrimination and clustering)'''--Discrimination or classification aim at optimally assigning multivariate data vectors (arrays) into a set of previously defined classes or groups.<ref name=Everitt_1974>Everitt, B., 1974, Cluster analysis: London, Heinemann Educational Books Ltd., 122 p.</ref> Clustering, however, aims at defining classes of multivariate similarity and regrouping the initial sample values into these classes. Discrimination is a supervised act of pattern recognition, whereas clustering is an unsupervised act of pattern cognition.<ref name=Miller_etal_1962>Miller, R. L., and J. S. Kahn, 1962, Statistical analysis in the geological sciences: New York, John Wiley, 481 p.</ref> Principal component analysis (PCA) allows analysis of the covariance (correlation) matrix with a minimum of statistical assumptions. PCA aims at reducing the dimensionality P of the multivariate data set available by defining a limited number (fewer than P) of linear combinations of these quantities, with each combination reflecting some of the data structures (relationships) implicit in the original covariance matrix.
#'''Regression'''--Regression is the generic term for relating two sets of variables. The first set, usually denoted by y, constitutes the dependent variables(s). It is related linearly to the second set, denoted x, called the independent variable(s). (For details of multiple and multivariate regression analysis, see [[Correlation and Regression Analysis]].)
+
#'''Regression'''--Regression is the generic term for relating two sets of variables. The first set, usually denoted by y, constitutes the dependent variables(s). It is related linearly to the second set, denoted x, called the independent variable(s). (For details of multiple and multivariate regression analysis, see [[Correlation and regression analysis]].)
    
==A note about outliers==
 
==A note about outliers==
Line 50: Line 50:     
==Discriminant analysis (classification)==
 
==Discriminant analysis (classification)==
[[File:Charles-l-vavra-john-g-kaldi-robert-m-sneider capillary-pressure 1.jpg|thumbnail|'''Figure 1.''' Plot of two-bivariate distributions, showing overlap between groups a and b along both variables ''x''<sub>1</sub> and ''x''<sub>2</sub>. Groups can be distinguished by projecting members of the two groups onto the discriminant function line.<ref name=Davis_1986 />.]]
+
[[File:Charles-l-vavra-john-g-kaldi-robert-m-sneider capillary-pressure 1.jpg|thumbnail|'''Figure 1.''' Plot of two-bivariate distributions, showing overlap between groups a and b along both variables ''x''<sub>1</sub> and ''x''<sub>2</sub>. Groups can be distinguished by projecting members of the two groups onto the discriminant function line.<ref name=Davis_1986 />]]
    
''Discriminant analysis'' (DA) attempts to determine an allocation rule to classify multivariate data vectors into a set of predefined classes, with a minimum probability of misclassification.<ref name=Davis_1986>Davis, J. C., 1986, Statistics and data analysis in geology: New York, John Wiley, 646 p.</ref> Consider a set of n samples with P quantities being measured on each. Suppose that the n samples are divided into m classes or groups. Discriminant analysis consists of two steps:
 
''Discriminant analysis'' (DA) attempts to determine an allocation rule to classify multivariate data vectors into a set of predefined classes, with a minimum probability of misclassification.<ref name=Davis_1986>Davis, J. C., 1986, Statistics and data analysis in geology: New York, John Wiley, 646 p.</ref> Consider a set of n samples with P quantities being measured on each. Suppose that the n samples are divided into m classes or groups. Discriminant analysis consists of two steps:

Navigation menu