Changes

Jump to navigation Jump to search
m
Line 11: Line 11:  
  | pdf    = http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0300/03450.pdf
 
  | pdf    = http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0300/03450.pdf
 
}}
 
}}
Most geological phenomena are multivariate in nature; for example, a porous medium is characterized by a set of interdependent quantities or attributes such as grain size, [[porosity]], [[permeability]], and saturation. Although univariate statistical analysis can characterize the distribution of each attribute separately, an understanding of porous media calls for unraveling the interrelationships among their various attributes. Multivariate statistical analysis proposes to study the joint distribution of all attributes, in which the distribution of any single variable is analyzed as a function of the other attributes distributions.
+
Most geological phenomena are multivariate in nature; for example, a porous medium is characterized by a set of interdependent quantities or attributes such as [[grain size]], [[porosity]], [[permeability]], and saturation. Although univariate statistical analysis can characterize the distribution of each attribute separately, an understanding of porous media calls for unraveling the interrelationships among their various attributes. Multivariate statistical analysis proposes to study the joint distribution of all attributes, in which the distribution of any single variable is analyzed as a function of the other attributes distributions.
    
Multivariate observations are best organized and manipulated as a matrix of sample values, of size (n × P), where n is the number of samples and P is the number of attributes or variables. For example, a (5 × 3) matrix might represent five core samples at different depths on which frequencies of occurrence of three different fossils are recorded. The purposes of multivariate data analysis is to study the relationships among the P attributes, classify the n collected samples into homogeneous groups, and make inferences about the underlying populations from the sample.
 
Multivariate observations are best organized and manipulated as a matrix of sample values, of size (n × P), where n is the number of samples and P is the number of attributes or variables. For example, a (5 × 3) matrix might represent five core samples at different depths on which frequencies of occurrence of three different fossils are recorded. The purposes of multivariate data analysis is to study the relationships among the P attributes, classify the n collected samples into homogeneous groups, and make inferences about the underlying populations from the sample.
Line 21: Line 21:  
#'''Data description or exploratory data analysis (EDA)'''--The basic tools of this objective include univariate statistics, such as the mean, variance, and quantiles applied to each variable separately, and the covariance or correlation matrix between any two of the P quantities. Some of the P quantities can be transformed (for example, by taking the logarithm) prior to establishing the correlation matrix. Because the matrix is symmetrical, there are P(P - 1)/2 potentially different correlation values.
 
#'''Data description or exploratory data analysis (EDA)'''--The basic tools of this objective include univariate statistics, such as the mean, variance, and quantiles applied to each variable separately, and the covariance or correlation matrix between any two of the P quantities. Some of the P quantities can be transformed (for example, by taking the logarithm) prior to establishing the correlation matrix. Because the matrix is symmetrical, there are P(P - 1)/2 potentially different correlation values.
 
#'''Data grouping (discrimination and clustering)'''--Discrimination or classification aim at optimally assigning multivariate data vectors (arrays) into a set of previously defined classes or groups.<ref name=Everitt_1974>Everitt, B., 1974, Cluster analysis: London, Heinemann Educational Books Ltd., 122 p.</ref> Clustering, however, aims at defining classes of multivariate similarity and regrouping the initial sample values into these classes. Discrimination is a supervised act of pattern recognition, whereas clustering is an unsupervised act of pattern cognition.<ref name=Miller_etal_1962>Miller, R. L., and J. S. Kahn, 1962, Statistical analysis in the geological sciences: New York, John Wiley, 481 p.</ref> Principal component analysis (PCA) allows analysis of the covariance (correlation) matrix with a minimum of statistical assumptions. PCA aims at reducing the dimensionality P of the multivariate data set available by defining a limited number (fewer than P) of linear combinations of these quantities, with each combination reflecting some of the data structures (relationships) implicit in the original covariance matrix.
 
#'''Data grouping (discrimination and clustering)'''--Discrimination or classification aim at optimally assigning multivariate data vectors (arrays) into a set of previously defined classes or groups.<ref name=Everitt_1974>Everitt, B., 1974, Cluster analysis: London, Heinemann Educational Books Ltd., 122 p.</ref> Clustering, however, aims at defining classes of multivariate similarity and regrouping the initial sample values into these classes. Discrimination is a supervised act of pattern recognition, whereas clustering is an unsupervised act of pattern cognition.<ref name=Miller_etal_1962>Miller, R. L., and J. S. Kahn, 1962, Statistical analysis in the geological sciences: New York, John Wiley, 481 p.</ref> Principal component analysis (PCA) allows analysis of the covariance (correlation) matrix with a minimum of statistical assumptions. PCA aims at reducing the dimensionality P of the multivariate data set available by defining a limited number (fewer than P) of linear combinations of these quantities, with each combination reflecting some of the data structures (relationships) implicit in the original covariance matrix.
#'''Regression'''--Regression is the generic term for relating two sets of variables. The first set, usually denoted by y, constitutes the dependent variables(s). It is related linearly to the second set, denoted x, called the independent variable(s). (For details of multiple and multivariate regression analysis, see [[Correlation and Regression Analysis]].)
+
#'''Regression'''--Regression is the generic term for relating two sets of variables. The first set, usually denoted by y, constitutes the dependent variables(s). It is related linearly to the second set, denoted x, called the independent variable(s). (For details of multiple and multivariate regression analysis, see [[Correlation and regression analysis]].)
    
==A note about outliers==
 
==A note about outliers==
Line 50: Line 50:     
==Discriminant analysis (classification)==
 
==Discriminant analysis (classification)==
[[File:Charles-l-vavra-john-g-kaldi-robert-m-sneider capillary-pressure 1.jpg|thumbnail|'''Figure 1.''' Plot of two-bivariate distributions, showing overlap between groups a and b along both variables ''x''<sub>1</sub> and ''x''<sub>2</sub>. Groups can be distinguished by projecting members of the two groups onto the discriminant function line.<ref name=Davis_1986 />.]]
+
[[File:Multivariate-data-analysis fig1.png|300px|thumbnail|'''Figure 1.''' Plot of two-bivariate distributions, showing overlap between groups a and b along both variables ''x''<sub>1</sub> and ''x''<sub>2</sub>. Groups can be distinguished by projecting members of the two groups onto the discriminant function line.<ref name=Davis_1986 />]]
    
''Discriminant analysis'' (DA) attempts to determine an allocation rule to classify multivariate data vectors into a set of predefined classes, with a minimum probability of misclassification.<ref name=Davis_1986>Davis, J. C., 1986, Statistics and data analysis in geology: New York, John Wiley, 646 p.</ref> Consider a set of n samples with P quantities being measured on each. Suppose that the n samples are divided into m classes or groups. Discriminant analysis consists of two steps:
 
''Discriminant analysis'' (DA) attempts to determine an allocation rule to classify multivariate data vectors into a set of predefined classes, with a minimum probability of misclassification.<ref name=Davis_1986>Davis, J. C., 1986, Statistics and data analysis in geology: New York, John Wiley, 646 p.</ref> Consider a set of n samples with P quantities being measured on each. Suppose that the n samples are divided into m classes or groups. Discriminant analysis consists of two steps:
Line 61: Line 61:     
==Cluster analysis==
 
==Cluster analysis==
The purpose of ''cluster analysis'' (CA) is to define classes of samples with multivariate similarity (Hartigan, 1975)<ref name=Hartigan_1975>Hartigan, J. A., 1975, Clustering algorithms: New York, John Wiley, 351 p.</ref>. No prior assumption is needed about either the number of these classes or their structures. Cluster analysis requires and  
+
The purpose of ''cluster analysis'' (CA) is to define classes of samples with multivariate similarity.<ref name=Hartigan_1975>Hartigan, J. A., 1975, Clustering algorithms: New York, John Wiley, 351 p.</ref> No prior assumption is needed about either the number of these classes or their structures. Cluster analysis requires and (unfortunately) often depends heavily on a prior choice of a distance measure between any two samples, (x<sub>il</sub>, i = 1, ..., P) and (x<sub>il&prime;</sub>, i = 1, ..., P). Examples of distances include those represented by the following equations:
 
  −
(unfortunately) often depends heavily on a prior choice of a distance measure between any two samples, (x<sub>il</sub>, i = 1, ..., P) and (x<sub>il&prime;</sub>, i = 1, ..., P). Examples of distances include those represented by the following equations:
      
''Generalized Euclidean distance''
 
''Generalized Euclidean distance''
   −
:<math>d_{ll'} = \Bigg[\sum_{i = 1}^P w_i(x_{il} - x_{il'})^k\Bigg]^{1/k}</math>, with ''k'' > 0, ''w<sub>i</sub> &gt0
+
:<math>d_{ll'} = \Bigg[\sum_{i = 1}^P w_i(x_{il} - x_{il'})^k\Bigg]^{1/k}</math>, with ''k'' > 0, ''w<sub>i</sub> &ge; 0,
    
where ''w''<sub>''i''</sub> = weight indicating the relative importance of each variable ''x''<sub>''i''</sub>
 
where ''w''<sub>''i''</sub> = weight indicating the relative importance of each variable ''x''<sub>''i''</sub>
Line 73: Line 71:  
''Correlation type distance''
 
''Correlation type distance''
   −
:<math>\mathbf{Equation}</math>
+
:<math>d_{ll'} = \frac{\displaystyle\sum_{i = 1}^Px_{il}x_{il'}}{\sqrt{\displaystyle\sum_ix_{il}^2\displaystyle\sum_ix_{il'}^2}}</math>
    
One can also define a distance ''d''<sub>ii&prime;</sub> between any two variables ''x''<sub>''i''</sub> and ''x''<sub>i&prime;</sub> by setting the previous summations over all n samples. Such distances between variables lead to definition of classes of variables having similar sample values. Such classes (clusters) of variables can help defining subsets of the P variables for further studies, with reduced dimensionality.
 
One can also define a distance ''d''<sub>ii&prime;</sub> between any two variables ''x''<sub>''i''</sub> and ''x''<sub>i&prime;</sub> by setting the previous summations over all n samples. Such distances between variables lead to definition of classes of variables having similar sample values. Such classes (clusters) of variables can help defining subsets of the P variables for further studies, with reduced dimensionality.
   −
There is a large (and growing) variety of types of cluster analysis techniques (Hartigan, 1975)<ref name=Hartigan_1975 />:
+
[[File:Multivariate-data-analysis fig2.png|300px|thumbnail|'''Figure 2.''' Dendrogram (by aggregation). Starting from n samples, combine the two most similar samples (here 2 and 3). Then, combine the two nearest groups by either joining two samples or aggregating a third sample to the previous group of two (1 is aggregated to 2 and 3). At the next step, 4 and 5 constitutes a new group, which is then aggregated to the former group (1, 2, 3). The aggregation process stops when there is only one group left. In the last step, group (1, 2, 3, 4, 5) is aggregated to group (6, 7, 8, 9).]]
* Hierarchical techniques provide nested grouping as characterized by a dendrogram (Figure 2).
+
 
 +
There is a large (and growing) variety of types of cluster analysis techniques:<ref name=Hartigan_1975 />
 +
* Hierarchical techniques provide nested grouping as characterized by a dendrogram ([[:Image:Charles-l-vavra-john-g-kaldi-robert-m-sneider_capillary-pressure_2.jpg|Figure 2]]).
 
* Partitioning techniques define a set of mutually exclusive classes.
 
* Partitioning techniques define a set of mutually exclusive classes.
 
* Clumping or mixture techniques allow for classes that can overlap.
 
* Clumping or mixture techniques allow for classes that can overlap.
Line 87: Line 87:  
The problem of preferential sampling in high pay zones, which may lead to more samples having high [[porosity]] and saturation values, is particularly critical when performing cluster analysis. If spatial declustering is not done properly before CA, all results can be mere artifacts of that preferential sampling. A related problem is linked to sample locations ''u''<sub>l</sub> and ''u''<sub>l&prime;</sub> not being accounted for in the definition of, say, the Euclidean distance between two samples l and l&prime;.
 
The problem of preferential sampling in high pay zones, which may lead to more samples having high [[porosity]] and saturation values, is particularly critical when performing cluster analysis. If spatial declustering is not done properly before CA, all results can be mere artifacts of that preferential sampling. A related problem is linked to sample locations ''u''<sub>l</sub> and ''u''<sub>l&prime;</sub> not being accounted for in the definition of, say, the Euclidean distance between two samples l and l&prime;.
   −
:<math>\mathbf{Equation}</math>
+
:<math>d_{ll'} = \sqrt{\sum_{i = 1}^P(x_{il} - x_{il'})^2}</math>
   −
with ''x''<sub>il</sub> = ''x''<sub>''i''</sub>(u<sub>l</sub>) and ''x''<sub>il&prime;</sub> = ''x''<sub>''i''</sub>(u<sub>l&prime;</sub>) being the two measurements on variable ''x''<sub>''i''</sub> taken at the two locations ''u''<sub>l</sub> and ''u''<sub>l&prime;</sub>.
+
with ''x''<sub>il</sub> = ''x''<sub>''i''</sub>('''u'''<sub>l</sub>) and ''x''<sub>il&prime;</sub> = ''x''<sub>''i''</sub>('''u'''<sub>l&prime;</sub>) being the two measurements on variable ''x''<sub>''i''</sub> taken at the two locations ''u''<sub>l</sub> and ''u''<sub>l&prime;</sub>.
    
In conclusion, although cluster analysis aims at an unsupervised classification, it is best when applied with some supervision and a prior idea of what natural or physical clusters could be. Cluster analysis can then prove to be a remarkable corroboratory tool, allowing prior speculations to be checked and quantified.
 
In conclusion, although cluster analysis aims at an unsupervised classification, it is best when applied with some supervision and a prior idea of what natural or physical clusters could be. Cluster analysis can then prove to be a remarkable corroboratory tool, allowing prior speculations to be checked and quantified.
  −
[[File:Esbensen_etal__multivariate-data-analysis__Fig_1.png|thumb|{{figure_number|1}}Plot of two-bivariate distributions, showing overlap between groups a and b along both variables ''x''<sub>1</sub> and ''x''<sub>2</sub>. Groups can be distinguished by projecting members of the two groups onto the discriminant function line. (From Davis, 1986)<ref name=Davis_1986 />. ]]
  −
  −
[[File:Esbensen_etal__multivariate-data-analysis__Fig_2.png|thumb|{{figure_number|2}}Dendrogram (by aggregation). Starting from n samples, combine the two most similar samples (here 2 and 3). Then, combine the two nearest groups by either joining two samples or aggregating a third sample to the previous group of two (1 is aggregated to 2 and 3). At the next step, 4 and 5 constitutes a new group, which is then aggregated to the former group (1, 2, 3). The aggregation process stops when there is only one group left. In the last step, group (1, 2, 3, 4, 5) is aggregated to group (6, 7, 8, 9).]]
      
==See also==
 
==See also==
Line 112: Line 108:     
[[Category:Geological methods]] [[Category:Test content]][[Category:Pages with unformatted equations]]
 
[[Category:Geological methods]] [[Category:Test content]][[Category:Pages with unformatted equations]]
 +
[[Category:Methods in Exploration 10]]

Navigation menu