需要安装的包:
install.packages(c("NbClust", "flexclust", "rattle"))
聚类分析是一种数据归约技术,旨在揭露一个数据集中观测值的子集。它可以把大量的观测 值归约为若干个类。这里的类被定义为若干个观测值组成的群组,群组内观测值的相似度比群间 相似度高。这不是一个精确的定义,从而导致了各种聚类方法的出现。
聚类分析被广泛用于生物和行为科学、市场以及医学研究中。例如,一名心理学研究员可能 基于抑郁症病人的症状和人口统计学数据对病人进行聚类,试图得出抑郁症的亚型,以期通过亚 型来找到更加有针对性和有效的治疗方法,同时更好地了解这种疾病。营销研究人员根据消费者 的人口统计特征与购买行为的相似性制定客户细分战略,并基于此对其中的一个或多个子组制定 相应的营销战略。医学研究人员通过对DNA微阵列数据进行聚类分析来获得基因表达模式,从而 帮助他们理解人类的正常发育以及导致许多疾病的根本原因。
聚类分析的特点:无法事先假定每个个体的分类,所以聚类分析被称为无监督的学习。
选择合适的变量
数据的预处理(标准化数据以消除量纲的影响,查找异常点)
计算距离
选择聚类算法
确定类的数目
最终的聚类解决方案
结果可视化与解释
验证
?hclust
对flexclust
包中的营养数据集 nutrient
作层次聚类,以期回答以下问题。
- 基于五种营养标准的27类鱼、禽、肉的相同点和不同点是什么?
- 是否有一种方法能把这些食物分成若干个有意义的类?
par(ask=TRUE)
opar <- par(no.readonly=FALSE)
data(nutrient, package="flexclust")
row.names(nutrient) <- tolower(row.names(nutrient))
nutrient.scaled <- scale(nutrient)
d <- dist(nutrient.scaled)
fit.average <- hclust(d, method="average")
plot(fit.average, hang=-1, cex=.8, main="Average Linkage Clustering")
NbClust
包提供了众多的指数来确定在一个聚类分析里类的最佳数目。
library(NbClust)
nc <- NbClust(nutrient.scaled, distance="euclidean",
min.nc=2, max.nc=15, method="average")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 4 proposed 5 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 2 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
par(opar)
table(nc$Best.n[1,])
##
## 0 1 2 3 4 5 9 10 13 14 15
## 2 1 4 4 2 4 1 1 2 1 4
barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")
选择2类
clusters <- cutree(fit.average, k=2)
table(clusters)
## clusters
## 1 2
## 26 1
aggregate(nutrient, by=list(cluster=clusters), median)
## cluster energy protein fat calcium iron
## 1 1 182.5 19 9.5 9 2.45
## 2 2 180.0 22 9.0 367 2.50
aggregate(as.data.frame(nutrient.scaled), by=list(cluster=clusters),
median)
## cluster energy protein fat calcium iron
## 1 1 -0.2461016 0.0000000 -0.3536883 -0.4480464 0.04688857
## 2 2 -0.2708033 0.7056007 -0.3981050 4.1396825 0.08110456
plot(fit.average, hang=-1, cex=.8,
main="Average Linkage Clustering\n5 Cluster Solution")
rect.hclust(fit.average, k=2)
选择5类
clusters <- cutree(fit.average, k=5)
table(clusters)
## clusters
## 1 2 3 4 5
## 7 16 1 2 1
aggregate(nutrient, by=list(cluster=clusters), median)
## cluster energy protein fat calcium iron
## 1 1 340.0 19 29 9 2.50
## 2 2 170.0 20 8 13 1.45
## 3 3 160.0 26 5 14 5.90
## 4 4 57.5 9 1 78 5.70
## 5 5 180.0 22 9 367 2.50
aggregate(as.data.frame(nutrient.scaled), by=list(cluster=clusters),
median)
## cluster energy protein fat calcium iron
## 1 1 1.3101024 0.0000000 1.3785620 -0.4480464 0.08110456
## 2 2 -0.3696099 0.2352002 -0.4869384 -0.3967868 -0.63743114
## 3 3 -0.4684165 1.6464016 -0.7534384 -0.3839719 2.40779157
## 4 4 -1.4811842 -2.3520023 -1.1087718 0.4361807 2.27092763
## 5 5 -0.2708033 0.7056007 -0.3981050 4.1396825 0.08110456
plot(fit.average, hang=-1, cex=.8,
main="Average Linkage Clustering\n5 Cluster Solution")
rect.hclust(fit.average, k=5)
kmeans(x,centers)
set.seed()
函数可以保证结果是可复制的。用K均值聚类来处理包含178种意大利葡萄酒中13种化学成分的数据集。
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
data(wine, package="rattle")
head(wine)
## Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49
## 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39
## Nonflavanoids Proanthocyanins Color Hue Dilution Proline
## 1 0.28 2.29 5.64 1.04 3.92 1065
## 2 0.26 1.28 4.38 1.05 3.40 1050
## 3 0.30 2.81 5.68 1.03 3.17 1185
## 4 0.24 2.18 7.80 0.86 3.45 1480
## 5 0.39 1.82 4.32 1.04 2.93 735
## 6 0.34 1.97 6.75 1.05 2.85 1450
df <- scale(wine[-1])
wssplot(df)
library(NbClust)
set.seed(1234)
nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 15 proposed 3 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
par(opar)
table(nc$Best.n[1,])
##
## 0 1 2 3 10 12 14 15
## 2 1 4 15 1 1 1 1
barplot(table(nc$Best.n[1,]),
xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by 26 Criteria")
set.seed(1234)
fit.km <- kmeans(df, 3, nstart=25)
fit.km$size
## [1] 62 65 51
fit.km$centers
## Alcohol Malic Ash Alcalinity Magnesium Phenols
## 1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
## 2 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
## 3 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
## Flavanoids Nonflavanoids Proanthocyanins Color Hue
## 1 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504
## 2 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046
## 3 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122
## Dilution Proline
## 1 0.7770551 1.1220202
## 2 0.2700025 -0.7517257
## 3 -1.2887761 -0.4059428
tt <- aggregate(wine[-1], by=list(cluster=fit.km$cluster), mean)
cluster | Alcohol | Malic | Ash | Alcalinity | Magnesium | Phenols | Flavanoids | Nonflavanoids | Proanthocyanins | Color | Hue | Dilution | Proline |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 13.67677 | 1.997903 | 2.466290 | 17.46290 | 107.96774 | 2.847581 | 3.0032258 | 0.2920968 | 1.922097 | 5.453548 | 1.0654839 | 3.163387 | 1100.2258 |
2 | 12.25092 | 1.897385 | 2.231231 | 20.06308 | 92.73846 | 2.247692 | 2.0500000 | 0.3576923 | 1.624154 | 2.973077 | 1.0627077 | 2.803385 | 510.1692 |
3 | 13.13412 | 3.307255 | 2.417647 | 21.24118 | 98.66667 | 1.683922 | 0.8188235 | 0.4519608 | 1.145882 | 7.234706 | 0.6919608 | 1.696667 | 619.0588 |
返回课程主页。