典型相关分析的基本思想

例如为了了解家庭的特征与其消费模式之间的关系。调查了若干个家庭的下面两组变量:

此处输入图片的描述

此处输入图片的描述

现在要研究两组变量的关系,如果是两个变量可以计算相关系数,但是两组变量如何研究呢?

此处输入图片的描述

此处输入图片的描述

此处输入图片的描述

此处输入图片的描述

  • 典型相关分析,又称为典则相关分析,(canonical correlation analysis),是分析两组变量间线性相关关系的一种统计分析方法。
  • 典型相关分析的基本思想类似主成分分析,它根据变量间的相关关系,寻找几个简单的综合变量(可看作主成分)替代关系复杂的实际观测变量,将两组变量间的多重线性相关关系转化为少数几对综合变量(主成分)间的简单线性相关。此时,少数几对综合变量(主成分)所包含的相关性信息覆盖了原变量组间所包含的大部分信息。

如果我们记两组变量的第一对线性组合为: \[\begin{align*} X=&(X_1,\cdots,X_p)\\ Y=&(Y_1,\cdots,Y_q)\\ u_1=&\alpha_1^{'}X\\ v_1=&\beta_1^{'}Y \end{align*}\] 其中: \(\alpha_1=(\alpha_{11},\alpha_{21},\cdots,\alpha_{p1})^{'}\) \(\beta_1=(\beta_{11},\beta_{21},\cdots,\beta_{q1})^{'}\) \(\Sigma\)是协方差矩阵 考虑: \[\begin{align*} Var(u_1)=& \alpha_1^{'}Var(X)\alpha_1=\alpha_1^{'}\Sigma_{11}\alpha=1\\ Var(v_1)=& \beta_1^{'}Var(Y)\beta_1=\beta_1^{'}\Sigma_{22}\beta=1\\ \rho_{u_1,v_1}=&Cov(u_1,v_1)=\alpha_1^{'}Cov(X,Y)\beta_1=\alpha_1^{'}\Sigma_{12}\beta_1 \end{align*}\] 典型相关分析就是求\(\alpha_1,\beta_1\),使两者的相关系数\(\rho\) 达到最大。

计算过程

  1. 对变量\(X\)\(Y\) 标准化
  2. \(X\)\(Y\) 的相关系数矩阵R: \[R=\begin{bmatrix} R_{XX}& R_{XY}\\ R_{YX}& R_{YY}\end{bmatrix}=\begin{bmatrix}\sum XX& \sum XY\\ \sum{YX} & \sum{YY}\end{bmatrix}_{(p+q)\times(p+q)}\]
  3. \(A\)\(B\)\[\begin{align*} A=& (R_{XX})^{-1}R_{XY}(R_{YY})^{-1}R_{YX}\\ B=&(R_{YY})^{-1}R_{YX}(R_{XX})^{-1}R_{XY} \end{align*}\]
  4. 分布求\(A\)\(B\) 的特征值和特征根,设\(A\) 的第i个特征值\(\lambda_i\)对应的特征向量为:\((a_{i1},\cdots,a_{ip})\),设\(B\) 的第i个特征值\(\lambda_i\)对应的特征向量为:\((b_{i1},\cdots,b_{iq})\)
  5. 计算\(V_i\)\(W_i\)\[\begin{align*} V_i=& a_{i1}X_1+\cdots+a_{ip}X_p\\ W_i=& b_{i1}Y_1+\cdots+b_{iq}Y_q \end{align*}\]
  6. \(V_i\)\(W_i\) 的相关系数为\(r_i=\sqrt{\lambda_i}\)

典型相关系数的检验

典型相关分析是否恰当,应该取决于两组原变量之间是否相关,如果两组变量之间毫无相关性而言,则不应该作典型相关分析。用样本来估计总体的典型相关系数是否有误,需要进行检验。

\(H_0\): \(\rho_1=\cdots=\rho_r=0\) \(H_1\): \(\rho_i\) 中,至少\(\rho_1\) 不为 \(0\)

检验统计量: 不放设:\(r=min(p,q)=p\) \[\Lambda_0=\frac{|S|}{|S_{XX}||S_{YY}|}=\prod_{i=1}^p(1-\rho_i^2)\] \(\Lambda_0\) 越小,越支持备则假设。 \[Q_0=-[(n-1)-\frac{1}{2}(p+q+1)]ln \Lambda_0\] 在原假设成立的条件下,\(Q_0\) 服从自由度为\(p\times q\) 的卡方分布。

实例

康复俱乐部对20名中年人测量了三个生理指标:体重(x1),腰围(x2),脉搏(x3);三个训练指标:引体向上次数(y1),起坐次数(y2),跳跃次数(y3)。分析生理指标与训练指标的相关性。

ex1=read.table("http://statstudy.github.io/data/9-1.txt",head=T)
ex1
##     x1 x2 x3 y1  y2  y3
## 1  191 36 50  5 162  60
## 2  189 37 52  2 110  60
## 3  193 38 58 12 101 101
## 4  162 35 62 12 105  37
## 5  189 35 46 13 155  58
## 6  182 36 56  4 101  42
## 7  211 38 56  8 101  38
## 8  167 34 60  6 125  40
## 9  176 31 74 15 200  40
## 10 154 33 56 17 251 250
## 11 169 34 50 17 120  38
## 12 166 33 52 13 210 115
## 13 154 34 64 14 215 105
## 14 247 46 50  1  50  50
## 15 193 36 46  6  70  31
## 16 202 37 62 12 210 120
## 17 176 37 54  4  60  25
## 18 157 32 52 11 230  80
## 19 156 33 54 15 225  73
## 20 138 33 68  2 110  43
scale(ex1)
##               x1         x2          x3         y1         y2         y3
##  [1,]  0.5022173  0.1873845 -0.84600343 -0.8418021  0.2629199 -0.2008679
##  [2,]  0.4212145  0.4996919 -0.56862526 -1.4093091 -0.5681948 -0.2008679
##  [3,]  0.5832201  0.8119993  0.26350927  0.4823810 -0.7120415  0.5987035
##  [4,] -0.6723232 -0.1249230  0.81826561  0.4823810 -0.6481096 -0.6494080
##  [5,]  0.4212145 -0.1249230 -1.40075978  0.6715500  0.1510391 -0.2398714
##  [6,]  0.1377048  0.1873845 -0.01386891 -1.0309711 -0.7120415 -0.5518993
##  [7,]  1.3122453  0.8119993 -0.01386891 -0.2742951 -0.7120415 -0.6299063
##  [8,] -0.4698162 -0.4372304  0.54088744 -0.6526331 -0.3284501 -0.5909028
##  [9,] -0.1053036 -1.3741527  2.48253466  1.0498880  0.8702730 -0.5909028
## [10,] -0.9963344 -0.7495378 -0.01386891  1.4282260  1.6854047  3.5044631
## [11,] -0.3888134 -0.4372304 -0.84600343  1.4282260 -0.4083650 -0.6299063
## [12,] -0.5103176 -0.7495378 -0.56862526  0.6715500  1.0301027  0.8717279
## [13,] -0.9963344 -0.4372304  1.09564379  0.8607190  1.1100176  0.6767105
## [14,]  2.7702957  3.3104588 -0.84600343 -1.5984781 -1.5271733 -0.3958854
## [15,]  0.5832201  0.1873845 -1.40075978 -0.6526331 -1.2075138 -0.7664185
## [16,]  0.9477327  0.4996919  0.81826561  0.4823810  1.0301027  0.9692366
## [17,] -0.1053036  0.4996919 -0.29124708 -1.0309711 -1.3673435 -0.8834289
## [18,] -0.8748302 -1.0618453 -0.56862526  0.2932120  1.3497622  0.1891669
## [19,] -0.9153316 -0.7495378 -0.29124708  1.0498880  1.2698474  0.0526547
## [20,] -1.6443568 -0.7495378  1.65040014 -1.4093091 -0.5681948 -0.5323976
## attr(,"scaled:center")
##     x1     x2     x3     y1     y2     y3 
## 178.60  35.40  56.10   9.45 145.55  70.30 
## attr(,"scaled:scale")
##        x1        x2        x3        y1        y2        y3 
## 24.690505  3.201973  7.210373  5.286278 62.566575 51.277470
x=ex1[,1:3]
y=ex1[,4:6]

s11=cor(x);s11
##            x1         x2         x3
## x1  1.0000000  0.8702435 -0.3657620
## x2  0.8702435  1.0000000 -0.3528921
## x3 -0.3657620 -0.3528921  1.0000000
s22=cor(y);s22
##           y1        y2        y3
## y1 1.0000000 0.6957274 0.4957602
## y2 0.6957274 1.0000000 0.6692061
## y3 0.4957602 0.6692061 1.0000000
cor(ex1)
##            x1         x2          x3         y1         y2          y3
## x1  1.0000000  0.8702435 -0.36576203 -0.3896937 -0.4930836 -0.22629556
## x2  0.8702435  1.0000000 -0.35289213 -0.5522321 -0.6455980 -0.19149937
## x3 -0.3657620 -0.3528921  1.00000000  0.1506480  0.2250381  0.03493306
## y1 -0.3896937 -0.5522321  0.15064802  1.0000000  0.6957274  0.49576018
## y2 -0.4930836 -0.6455980  0.22503808  0.6957274  1.0000000  0.66920608
## y3 -0.2262956 -0.1914994  0.03493306  0.4957602  0.6692061  1.00000000
s12=cor(ex1)[1:3,4:6];s12
##            y1         y2          y3
## x1 -0.3896937 -0.4930836 -0.22629556
## x2 -0.5522321 -0.6455980 -0.19149937
## x3  0.1506480  0.2250381  0.03493306
s21=cor(ex1)[4:6,1:3];s21
##            x1         x2         x3
## y1 -0.3896937 -0.5522321 0.15064802
## y2 -0.4930836 -0.6455980 0.22503808
## y3 -0.2262956 -0.1914994 0.03493306
A=solve(s11)%*%s12%*%solve(s22)%*%s21
A
##             x1          x2          x3
## x1 -0.24594544 -0.42556193  0.15927694
## x2  0.58342555  0.90714323 -0.32827241
## x3 -0.01679293 -0.03129267  0.01728371
eigen(A)$vectors
##            [,1]       [,2]       [,3]
## [1,]  0.4404622 -0.8428694 -0.1616014
## [2,] -0.8971428  0.5280957  0.4281731
## [3,]  0.0335830 -0.1033731  0.8891304
sqrt(eigen(A)$values)
## [1] 0.79560815 0.20055604 0.07257029
u1=scale(ex1)[,1:3]%*%eigen(A)$vectors[,1];u1
##               [,1]
##  [1,]  0.024685782
##  [2,] -0.281862072
##  [3,] -0.462743543
##  [4,] -0.156579379
##  [5,]  0.250561111
##  [6,] -0.107922651
##  [7,] -0.150950724
##  [8,]  0.203486477
##  [9,]  1.269799947
## [10,]  0.233129134
## [11,]  0.192589192
## [12,]  0.428570759
## [13,] -0.009794482
## [14,] -1.778155296
## [15,]  0.041734069
## [16,] -0.003374780
## [17,] -0.504458222
## [18,]  0.548201132
## [19,]  0.259492611
## [20,]  0.003590935
k1=c(0.031,-0.493,0.008);k1
## [1]  0.031 -0.493  0.008
u11=scale(ex1)[,1:3]%*%t(t(k1))
u11
##              [,1]
##  [1,] -0.08357983
##  [2,] -0.23783946
##  [3,] -0.38012777
##  [4,]  0.04729113
##  [5,]  0.06343860
##  [6,] -0.08822264
##  [7,] -0.35974702
##  [8,]  0.20531739
##  [9,]  0.69405315
## [10,]  0.33852484
## [11,]  0.19673335
## [12,]  0.34915331
## [13,]  0.19343338
## [14,] -1.55294506
## [15,] -0.08550679
## [16,] -0.21042227
## [17,] -0.25194249
## [18,]  0.49182098
## [19,]  0.33881690
## [20,]  0.33175030
B=solve(s22)%*%s21%*%solve(s11)%*%s12
B
##            y1         y2          y3
## y1  0.1617883  0.1718776  0.02299820
## y2  0.4824417  0.5487737  0.11144827
## y3 -0.3184295 -0.3464725 -0.03208051
eigen(B)
## $values
## [1] 0.632992335 0.040222726 0.005266446
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.2644705 -0.3313572 -0.7046067
## [2,] -0.7975886  0.1089606  0.6721095
## [3,]  0.5421327  0.9371926 -0.2275921
sqrt(eigen(B)$values)
## [1] 0.79560815 0.20055604 0.07257029
v1=scale(ex1)[,4:6]%*%eigen(B)$vectors[,1];v1
##              [,1]
##  [1,] -0.09596723
##  [2,]  0.71700920
##  [3,]  0.76491740
##  [4,]  0.03728404
##  [5,] -0.42811434
##  [6,]  0.54137496
##  [7,]  0.29896638
##  [8,]  0.11422255
##  [9,] -1.29213186
## [10,]  0.17790073
## [11,] -0.39350907
## [12,] -0.52661117
## [13,] -0.74610528
## [14,]  1.42618381
## [15,]  0.72020088
## [16,] -0.42371890
## [17,]  0.88430330
## [18,] -1.05154730
## [19,] -1.26193428
## [20,]  0.53727617
k2=c(0.066,0.017,-0.014)
k2
## [1]  0.066  0.017 -0.014
t(t(k2))
##        [,1]
## [1,]  0.066
## [2,]  0.017
## [3,] -0.014
v11=scale(ex1)[,4:6]%*%t(t(k2))
cor(u1,v1)
##            [,1]
## [1,] -0.7956082
cor(u11,v11)
##           [,1]
## [1,] 0.6287931
A01=prod(1-eigen(A)$values)
A01
## [1] 0.3503905
Q01=-(20-1-0.5*(3+3+1))*log(A01)
Q01
## [1] 16.25496
pr1=1-pchisq(Q01,9)
pr1
## [1] 0.06174456
##################

A02=(1-eigen(A)$values[2])*(1-eigen(A)$values[3])
A02
## [1] 0.9547227
Q02=-(20-1-0.5*(3+3+1))*log(A02)
Q02
## [1] 0.7181831
pr2=1-pchisq(Q02,4)
pr2
## [1] 0.9490678
corcoef.test<-function(r, n, p, q, alpha=0.1){
  #r为相关系数 n为样本个数 且n>p+q
  m<-length(r);  Q<-rep(0, m);  lambda <- 1
  for (k in m:1){
    lambda<-lambda*(1-r[k]^2);   #检验统计量
    Q[k]<- -log(lambda)   #检验统计量取对数
  }
  s<-0;  i<-m
  for (k in 1:m){
    Q[k]<- (n-k+1-1/2*(p+q+3)+s)*Q[k]  #统计量    
    chi<-1-pchisq(Q[k], (p-k+1)*(q-k+1))   
        if (chi>alpha){
      i<-k-1; break
    }
    s<-s+1/r[k]^2
  }
  i  #显示输出结果 选用第几对典型变量
}

corcoef.test(cancor(x,y)$cor,n=20,p=3,q=3,alpha=0.1)
## [1] 1