Changes

Multivariate data analysis (view source)

Revision as of 15:40, 9 January 2014

264 bytes removed , 15:40, 9 January 2014

Line 61: Line 61:

==Cluster analysis==

−

The purpose of ''cluster analysis'' (CA) is to define classes of samples with multivariate similarity ~~(Hartigan, 1975)~~<ref name=Hartigan_1975>Hartigan, J. A., 1975, Clustering algorithms: New York, John Wiley, 351 p.</ref>. No prior assumption is needed about either the number of these classes or their structures. Cluster analysis requires and

+

The purpose of ''cluster analysis'' (CA) is to define classes of samples with multivariate similarity.<ref name=Hartigan_1975>Hartigan, J. A., 1975, Clustering algorithms: New York, John Wiley, 351 p.</ref> No prior assumption is needed about either the number of these classes or their structures. Cluster analysis requires and (unfortunately) often depends heavily on a prior choice of a distance measure between any two samples, (xil, i = 1, ..., P) and (xil′, i = 1, ..., P). Examples of distances include those represented by the following equations:

−

(unfortunately) often depends heavily on a prior choice of a distance measure between any two samples, (xil, i = 1, ..., P) and (xil′, i = 1, ..., P). Examples of distances include those represented by the following equations:

''Generalized Euclidean distance''

Line 77: Line 75:

One can also define a distance ''d''ii′ between any two variables ''x''''i'' and ''x''i′ by setting the previous summations over all n samples. Such distances between variables lead to definition of classes of variables having similar sample values. Such classes (clusters) of variables can help defining subsets of the P variables for further studies, with reduced dimensionality.

−

There is a large (and growing) variety of types of cluster analysis techniques ~~(Hartigan, 1975)~~<ref name=Hartigan_1975 />:

+

[[File:Charles-l-vavra-john-g-kaldi-robert-m-sneider capillary-pressure 2.jpg|thumbnail|'''Figure 2.''' Dendrogram (by aggregation). Starting from n samples, combine the two most similar samples (here 2 and 3). Then, combine the two nearest groups by either joining two samples or aggregating a third sample to the previous group of two (1 is aggregated to 2 and 3). At the next step, 4 and 5 constitutes a new group, which is then aggregated to the former group (1, 2, 3). The aggregation process stops when there is only one group left. In the last step, group (1, 2, 3, 4, 5) is aggregated to group (6, 7, 8, 9).]]

−

* Hierarchical techniques provide nested grouping as characterized by a dendrogram (Figure 2).

+

There is a large (and growing) variety of types of cluster analysis techniques:<ref name=Hartigan_1975 />

+

* Hierarchical techniques provide nested grouping as characterized by a dendrogram ([[:Image:Charles-l-vavra-john-g-kaldi-robert-m-sneider_capillary-pressure_2.jpg|Figure 2]]).

* Partitioning techniques define a set of mutually exclusive classes.

* Clumping or mixture techniques allow for classes that can overlap.

Line 87: Line 87:

The problem of preferential sampling in high pay zones, which may lead to more samples having high [[porosity]] and saturation values, is particularly critical when performing cluster analysis. If spatial declustering is not done properly before CA, all results can be mere artifacts of that preferential sampling. A related problem is linked to sample locations ''u''l and ''u''l′ not being accounted for in the definition of, say, the Euclidean distance between two samples l and l′.

−

:<math>\~~mathbf~~{~~Equation~~}</math>

+

:<math>d_{ll'} = \sqrt{\sum_{i = 1}^P(x_{il} - x_{il'})^2}</math>

−

with ''x''il = ''x''''i''(ul) and ''x''il′ = ''x''''i''(ul′) being the two measurements on variable ''x''''i'' taken at the two locations ''u''l and ''u''l′.

+

with ''x''il = ''x''''i''('''u'''l) and ''x''il′ = ''x''''i''('''u'''l′) being the two measurements on variable ''x''''i'' taken at the two locations ''u''l and ''u''l′.

In conclusion, although cluster analysis aims at an unsupervised classification, it is best when applied with some supervision and a prior idea of what natural or physical clusters could be. Cluster analysis can then prove to be a remarkable corroboratory tool, allowing prior speculations to be checked and quantified.

−

[[File:Esbensen_etal__multivariate-data-analysis__Fig_1.png|thumb|{{figure_number|1}}Plot of two-bivariate distributions, showing overlap between groups a and b along both variables ''x''1 and ''x''2. Groups can be distinguished by projecting members of the two groups onto the discriminant function line. (From Davis, 1986)<ref name=Davis_1986 />. ]]

−

[[File:Esbensen_etal__multivariate-data-analysis__Fig_2.png|thumb|{{figure_number|2}}Dendrogram (by aggregation). Starting from n samples, combine the two most similar samples (here 2 and 3). Then, combine the two nearest groups by either joining two samples or aggregating a third sample to the previous group of two (1 is aggregated to 2 and 3). At the next step, 4 and 5 constitutes a new group, which is then aggregated to the former group (1, 2, 3). The aggregation process stops when there is only one group left. In the last step, group (1, 2, 3, 4, 5) is aggregated to group (6, 7, 8, 9).]]

==See also==

Cwhitehurst

Bureaucrats, Interface administrators, Administrators

10,351

edits

Changes

Multivariate data analysis (view source)

Revision as of 15:40, 9 January 2014

Navigation menu

Search