Line 19: |
Line 19: |
| The typical objectives of multivariate data analysis can be divided broadly into three categories. | | The typical objectives of multivariate data analysis can be divided broadly into three categories. |
| | | |
− | * '''1. Data description or exploratory data analysis (EDA):'''
| + | ;1. Data description or exploratory data analysis (EDA) |
− | * The basic tools of this objective include univariate statistics, such as the mean, variance, and quantiles applied to each variable separately, and the covariance or correlation matrix between any two of the P quantities. Some of the P quantities can be transformed (for example, by taking the logarithm) prior to establishing the correlation matrix. Because the matrix is symmetrical, there are P(P - 1)/2 potentially different correlation values.
| + | :The basic tools of this objective include univariate statistics, such as the mean, variance, and quantiles applied to each variable separately, and the covariance or correlation matrix between any two of the P quantities. Some of the P quantities can be transformed (for example, by taking the logarithm) prior to establishing the correlation matrix. Because the matrix is symmetrical, there are P(P - 1)/2 potentially different correlation values. |
| | | |
− | * '''2. Data grouping (discrimination and clustering):'''
| + | ;2. Data grouping (discrimination and clustering) |
− | * Discrimination or classification aim at optimally assigning multivariate data vectors (arrays) into a set of previously defined classes or groups (Everitt, 1974)<ref name=Everitt_1974>Everitt, B., 1974, Cluster analysis: London, Heinemann Educational Books Ltd., 122 p.</ref>. Clustering, however, aims at defining classes of multivariate similarity and regrouping the initial sample values into these classes. Discrimination is a supervised act of pattern recognition, whereas clustering is an unsupervised act of pattern cognition (Miller and Kahn, 1962)<ref name=Miller_etal_1962>Miller, R. L., and J. S. Kahn, 1962, Statistical analysis in the geological sciences: New York, John Wiley, 481 p.</ref>.
| + | :Discrimination or classification aim at optimally assigning multivariate data vectors (arrays) into a set of previously defined classes or groups (Everitt, 1974)<ref name=Everitt_1974>Everitt, B., 1974, Cluster analysis: London, Heinemann Educational Books Ltd., 122 p.</ref>. Clustering, however, aims at defining classes of multivariate similarity and regrouping the initial sample values into these classes. Discrimination is a supervised act of pattern recognition, whereas clustering is an unsupervised act of pattern cognition (Miller and Kahn, 1962)<ref name=Miller_etal_1962>Miller, R. L., and J. S. Kahn, 1962, Statistical analysis in the geological sciences: New York, John Wiley, 481 p.</ref>. |
| | | |
− | Principal component analysis (PCA) allows analysis of the covariance (correlation) matrix with a minimum of statistical assumptions. PCA aims at reducing the dimensionality P of the multivariate data set available by defining a limited number (fewer than P) of linear combinations of these quantities, with each combination reflecting some of the data structures (relationships) implicit in the original covariance matrix. | + | :Principal component analysis (PCA) allows analysis of the covariance (correlation) matrix with a minimum of statistical assumptions. PCA aims at reducing the dimensionality P of the multivariate data set available by defining a limited number (fewer than P) of linear combinations of these quantities, with each combination reflecting some of the data structures (relationships) implicit in the original covariance matrix. |
| | | |
− | * '''3. Regression:'''
| + | ;3. Regression |
− | * Regression is the generic term for relating two sets of variables. The first set, usually denoted by y, constitutes the dependent variables(s). It is related linearly to the second set, denoted x, called the independent variable(s). (For details of multiple and multivariate regression analysis, see the chapter on "Correlation and Regression Analysis" in Part 6.)
| + | :Regression is the generic term for relating two sets of variables. The first set, usually denoted by y, constitutes the dependent variables(s). It is related linearly to the second set, denoted x, called the independent variable(s). (For details of multiple and multivariate regression analysis, see the chapter on "Correlation and Regression Analysis" in Part 6.) |
| | | |
| ==A note about outliers== | | ==A note about outliers== |
Line 34: |
Line 34: |
| | | |
| ==A note about data representativeness== | | ==A note about data representativeness== |
− | As mentioned in the chapter on [[Statistics overview]] (in Part 6), the available sample data are an incomplete image of the underlying population. Statistical features and relationships seen in the data may not be representative of | + | As mentioned in the chapter on [[Statistics overview]] (in Part 6), the available sample data are an incomplete image of the underlying population. Statistical features and relationships seen in the data may not be representative of the characteristics of the underlying population if there are biases in the data. Sources of biases are multiple, from the most obvious measurement biases to imposed spatial clustering. Spatial clustering results from preferential location of data (drilling patterns and core plugs), a prevalent problem in exploration and development. Such preferential selection of data points can severely bias one's image of the reservoir, usually in a nonconservative way. Remedies include defining representative subsets of the data, weighting the data, and careful interpretation of the data analysis results. |
− | | |
− | the characteristics of the underlying population if there are biases in the data. Sources of biases are multiple, from the most obvious measurement biases to imposed spatial clustering. Spatial clustering results from preferential location of data (drilling patterns and core plugs), a prevalent problem in exploration and development. Such preferential selection of data points can severely bias one's image of the reservoir, usually in a nonconservative way. Remedies include defining representative subsets of the data, weighting the data, and careful interpretation of the data analysis results. | |
| | | |
| ==Principal component analysis== | | ==Principal component analysis== |
Line 55: |
Line 53: |
| This inverse relationship can be used in estimating (interpolating) a variable ''x''<sub>''i''</sub> from prior estimates of the principal components ''y''<sub>''j''</sub>. The first principal components ''y''<sub>''j''</sub>, j ≤ ''P''<sub>0</sub>, can be estimated by some type of regression procedure (such as kriging), while the higher components ''y''<sub>''j''</sub>, j > ''P''<sub>0</sub>, corresponding to random noise, can be estimated by their respective means. | | This inverse relationship can be used in estimating (interpolating) a variable ''x''<sub>''i''</sub> from prior estimates of the principal components ''y''<sub>''j''</sub>. The first principal components ''y''<sub>''j''</sub>, j ≤ ''P''<sub>0</sub>, can be estimated by some type of regression procedure (such as kriging), while the higher components ''y''<sub>''j''</sub>, j > ''P''<sub>0</sub>, corresponding to random noise, can be estimated by their respective means. |
| | | |
− | ===Plotting the pca results=== | + | ===Plotting the PCA results=== |
| Many plots and scattergrams can be produced using the results of PCA and can be interpreted with the help of prior knowledge about the underlying phenomenon. Such plots may reveal clusters (grouping) or trends in the physical space or the ''P''<sub>0</sub> dimensional space of the first principal components. Interpretation of these clusters and classification of the initial data set (of size n × P) into more homogeneous subsets in the multivariate and/or spatial sense may then be in order. Data analysis can then be pursued within each of these new subsets. | | Many plots and scattergrams can be produced using the results of PCA and can be interpreted with the help of prior knowledge about the underlying phenomenon. Such plots may reveal clusters (grouping) or trends in the physical space or the ''P''<sub>0</sub> dimensional space of the first principal components. Interpretation of these clusters and classification of the initial data set (of size n × P) into more homogeneous subsets in the multivariate and/or spatial sense may then be in order. Data analysis can then be pursued within each of these new subsets. |
| | | |
− | ==Discriminant analysis (CLASSIFICATION)== | + | ==Discriminant analysis (classification)== |
| ''Discriminant analysis'' (DA) attempts to determine an allocation rule to classify multivariate data vectors into a set of predefined classes, with a minimum probability of misclassification (Davis, 1986)<ref name=Davis_1986>Davis, J. C., 1986, Statistics and data analysis in geology: New York, John Wiley, 646 p.</ref>. Consider a set of n samples with P quantities being measured on each. Suppose that the n samples are divided into m classes or groups. Discriminant analysis consists of two steps: | | ''Discriminant analysis'' (DA) attempts to determine an allocation rule to classify multivariate data vectors into a set of predefined classes, with a minimum probability of misclassification (Davis, 1986)<ref name=Davis_1986>Davis, J. C., 1986, Statistics and data analysis in geology: New York, John Wiley, 646 p.</ref>. Consider a set of n samples with P quantities being measured on each. Suppose that the n samples are divided into m classes or groups. Discriminant analysis consists of two steps: |
| # The determination of what makes each group different from the others. The answer may be that not all m predefined groups are significantly different from each other. | | # The determination of what makes each group different from the others. The answer may be that not all m predefined groups are significantly different from each other. |