需要安装的包:

install.packages(c("NbClust", "flexclust", "rattle"))

什么是聚类分析

聚类分析是一种数据归约技术,旨在揭露一个数据集中观测值的子集。它可以把大量的观测 值归约为若干个类。这里的类被定义为若干个观测值组成的群组,群组内观测值的相似度比群间 相似度高。这不是一个精确的定义,从而导致了各种聚类方法的出现。

聚类分析被广泛用于生物和行为科学、市场以及医学研究中。例如,一名心理学研究员可能 基于抑郁症病人的症状和人口统计学数据对病人进行聚类,试图得出抑郁症的亚型,以期通过亚 型来找到更加有针对性和有效的治疗方法,同时更好地了解这种疾病。营销研究人员根据消费者 的人口统计特征与购买行为的相似性制定客户细分战略,并基于此对其中的一个或多个子组制定 相应的营销战略。医学研究人员通过对DNA微阵列数据进行聚类分析来获得基因表达模式,从而 帮助他们理解人类的正常发育以及导致许多疾病的根本原因。

聚类分析的特点:无法事先假定每个个体的分类,所以聚类分析被称为无监督的学习。

聚类分析的一般步骤

  • 选择合适的变量

  • 数据的预处理(标准化数据以消除量纲的影响,查找异常点)

  • 计算距离

  • 选择聚类算法

  • 确定类的数目

  • 最终的聚类解决方案

  • 结果可视化与解释

  • 验证

步骤

  • 1、定义每个观测值(行或单元)为一类
  • 2、计算每类和其他各类的距离
  • 3、把距离最短的两类合并成一类,这样类的个数就减少一个
  • 4、重复步骤2和步骤3,直到包含所有观测值的类合并成单个的类为止
?hclust

距离的定义

  • “single”
  • “complete”
  • “average”
  • “centroid”

实例

flexclust包中的营养数据集 nutrient作层次聚类,以期回答以下问题。

- 基于五种营养标准的27类鱼、禽、肉的相同点和不同点是什么?

- 是否有一种方法能把这些食物分成若干个有意义的类?

par(ask=TRUE)
opar <- par(no.readonly=FALSE)
data(nutrient, package="flexclust")
row.names(nutrient) <- tolower(row.names(nutrient))
nutrient.scaled <- scale(nutrient)                                  
d <- dist(nutrient.scaled)                                          
fit.average <- hclust(d, method="average")                          
plot(fit.average, hang=-1, cex=.8, main="Average Linkage Clustering")

选择聚类的数目

NbClust包提供了众多的指数来确定在一个聚类分析里类的最佳数目。

library(NbClust)
nc <- NbClust(nutrient.scaled, distance="euclidean", 
              min.nc=2, max.nc=15, method="average")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 4 proposed 5 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 2 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 4 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
par(opar)
table(nc$Best.n[1,])
## 
##  0  1  2  3  4  5  9 10 13 14 15 
##  2  1  4  4  2  4  1  1  2  1  4
barplot(table(nc$Best.n[1,]), 
        xlab="Numer of Clusters", ylab="Number of Criteria",
        main="Number of Clusters Chosen by 26 Criteria") 

选择2类

clusters <- cutree(fit.average, k=2) 
table(clusters)
## clusters
##  1  2 
## 26  1
aggregate(nutrient, by=list(cluster=clusters), median) 
##   cluster energy protein fat calcium iron
## 1       1  182.5      19 9.5       9 2.45
## 2       2  180.0      22 9.0     367 2.50
aggregate(as.data.frame(nutrient.scaled), by=list(cluster=clusters),
          median)
##   cluster     energy   protein        fat    calcium       iron
## 1       1 -0.2461016 0.0000000 -0.3536883 -0.4480464 0.04688857
## 2       2 -0.2708033 0.7056007 -0.3981050  4.1396825 0.08110456
plot(fit.average, hang=-1, cex=.8,  
     main="Average Linkage Clustering\n5 Cluster Solution")
rect.hclust(fit.average, k=2)

选择5类

clusters <- cutree(fit.average, k=5) 
table(clusters)
## clusters
##  1  2  3  4  5 
##  7 16  1  2  1
aggregate(nutrient, by=list(cluster=clusters), median) 
##   cluster energy protein fat calcium iron
## 1       1  340.0      19  29       9 2.50
## 2       2  170.0      20   8      13 1.45
## 3       3  160.0      26   5      14 5.90
## 4       4   57.5       9   1      78 5.70
## 5       5  180.0      22   9     367 2.50
aggregate(as.data.frame(nutrient.scaled), by=list(cluster=clusters),
          median)
##   cluster     energy    protein        fat    calcium        iron
## 1       1  1.3101024  0.0000000  1.3785620 -0.4480464  0.08110456
## 2       2 -0.3696099  0.2352002 -0.4869384 -0.3967868 -0.63743114
## 3       3 -0.4684165  1.6464016 -0.7534384 -0.3839719  2.40779157
## 4       4 -1.4811842 -2.3520023 -1.1087718  0.4361807  2.27092763
## 5       5 -0.2708033  0.7056007 -0.3981050  4.1396825  0.08110456
plot(fit.average, hang=-1, cex=.8,  
     main="Average Linkage Clustering\n5 Cluster Solution")
rect.hclust(fit.average, k=5)

K 均值聚类

步骤

    1. 选择K个中心点(随机选择K行);
    1. 把每个数据点分配到离它最近的中心点;
    1. 重新计算每类中的点到该类中心点距离的平均值(也就说,得到长度为p的均值向量 ,这里的p是变量的个数);
    1. 分配每个数据到它最近的中心点;
    1. 重复步骤(3)和步骤(4)直到所有的观测值不再被分配或是达到最大的迭代次数(R把10次作为默认迭代次数)。
  • 在R中K均值的函数格式是kmeans(x,centers)
  • 不像层次聚类方法,K均值聚类要求你事先确定要提取的聚类个数。同样,NbClust包可以 用来作为参考。
  • 由于K均值聚类在开始要随机选择k个中心点,在每次调用函数时可能获得不同的方案。使用 set.seed()函数可以保证结果是可复制的。

实例:

用K均值聚类来处理包含178种意大利葡萄酒中13种化学成分的数据集。

wssplot <- function(data, nc=15, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")}
data(wine, package="rattle")
head(wine)
##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
## 4    1   14.37  1.95 2.50       16.8       113    3.85       3.49
## 5    1   13.24  2.59 2.87       21.0       118    2.80       2.69
## 6    1   14.20  1.76 2.45       15.2       112    3.27       3.39
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 1          0.28            2.29  5.64 1.04     3.92    1065
## 2          0.26            1.28  4.38 1.05     3.40    1050
## 3          0.30            2.81  5.68 1.03     3.17    1185
## 4          0.24            2.18  7.80 0.86     3.45    1480
## 5          0.39            1.82  4.32 1.04     2.93     735
## 6          0.34            1.97  6.75 1.05     2.85    1450
df <- scale(wine[-1])  
wssplot(df)      

library(NbClust)
set.seed(1234)
nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 15 proposed 3 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 12 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
par(opar)
table(nc$Best.n[1,])
## 
##  0  1  2  3 10 12 14 15 
##  2  1  4 15  1  1  1  1
barplot(table(nc$Best.n[1,]), 
        xlab="Numer of Clusters", ylab="Number of Criteria",
        main="Number of Clusters Chosen by 26 Criteria") 

set.seed(1234)

fit.km <- kmeans(df, 3, nstart=25) 

fit.km$size
## [1] 62 65 51
fit.km$centers                                               
##      Alcohol      Malic        Ash Alcalinity   Magnesium     Phenols
## 1  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208  0.88274724
## 2 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869 -0.07576891
## 3  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.97657548
##    Flavanoids Nonflavanoids Proanthocyanins      Color        Hue
## 1  0.97506900   -0.56050853      0.57865427  0.1705823  0.4726504
## 2  0.02075402   -0.03343924      0.05810161 -0.8993770  0.4605046
## 3 -1.21182921    0.72402116     -0.77751312  0.9388902 -1.1615122
##     Dilution    Proline
## 1  0.7770551  1.1220202
## 2  0.2700025 -0.7517257
## 3 -1.2887761 -0.4059428
tt <- aggregate(wine[-1], by=list(cluster=fit.km$cluster), mean)
cluster Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Color Hue Dilution Proline
1 13.67677 1.997903 2.466290 17.46290 107.96774 2.847581 3.0032258 0.2920968 1.922097 5.453548 1.0654839 3.163387 1100.2258
2 12.25092 1.897385 2.231231 20.06308 92.73846 2.247692 2.0500000 0.3576923 1.624154 2.973077 1.0627077 2.803385 510.1692
3 13.13412 3.307255 2.417647 21.24118 98.66667 1.683922 0.8188235 0.4519608 1.145882 7.234706 0.6919608 1.696667 619.0588

返回课程主页