需要安装的包:

install.packages(c('reshape2', 'sqldf')) 

问卷调查实例

问卷选项 调查问卷

每位经理人的上司根据与服从权威相关的五项陈述(q1到q5)对经理人进行评分。例如,经理人1是一位在美国工作的32岁男性,上司对他的评价是惯于顺从,而经理人5是一位在英国工作的,年龄未知(99可能代表缺失)的女性,服从程度评分较低。

manager <- c(1,2,3,4,5)
date <- c("10/24/08","10/28/08","10/1/08","10/12/08","5/1/09")
gender <- c("M","F","F","M","F")
age <- c(32,45,25,39,99)
q1 <- c(5,3,3,3,2)
q2 <- c(4,5,5,3,2)
q3 <- c(5,2,5,4,1)
q4 <- c(5,5,5,NA,2)
q5 <- c(5,5,2,NA,1)
leadership <- data.frame(manager,date,gender,age,q1,q2,q3,q4,q5, stringsAsFactors=FALSE)
leadership
##   manager     date gender age q1 q2 q3 q4 q5
## 1       1 10/24/08      M  32  5  4  5  5  5
## 2       2 10/28/08      F  45  3  5  2  5  5
## 3       3  10/1/08      F  25  3  5  5  5  2
## 4       4 10/12/08      M  39  3  3  4 NA NA
## 5       5   5/1/09      F  99  2  2  1  2  1

需要的处理:

  • 五个评分(q1到q5)需要组合起来,即为每位经理人生成一个平均服从程度得分。
  • 数据检验要将99岁这样的年龄值重编码为缺失值。
  • 对数据进行筛选
  • 对年龄进行分类筛选
  • 只考虑某个时间段的分析

创建新变量

想创建一个新变量sumx 存储以上两个变量的加和,并创建一个名为meanx的新变量存储这两个变量的均值

第一种方法:

# - Creating new variables
mydata<-data.frame(x1 = c(2, 2, 6, 4),
                   x2 = c(3, 4, 2, 8))
mydata$sumx <- mydata$x1 + mydata$x2
mydata$meanx <- (mydata$x1 + mydata$x2)/2

第二种方法:

attach(mydata)
mydata$sumx <- x1 + x2
mydata$meanx <- (x1 + x2)/2
detach(mydata)

第三种方法:

mydata <- transform(mydata,
                    sumx = x1 + x2,
                    meanx = (x1 + x2)/2)
mydata
##   x1 x2 sumx meanx
## 1  2  3    5   2.5
## 2  2  4    6   3.0
## 3  6  2    8   4.0
## 4  4  8   12   6.0

变量的重编码

  • 将一个连续型变量修改为一组类别值
  • 将误编码的值替换为正确值
  • 基于一组分数线创建一个表示及格/不及格的变量

需要用到逻辑运算符:

创建agecat变量:

第一种方法:

# Recoding variables
leadership$agecat[leadership$age > 75] <- "Elder"
leadership$agecat[leadership$age >= 55 &
                    leadership$age <= 75] <- "Middle Aged"
leadership$agecat[leadership$age < 55] <- "Young"

第二种方法:

leadership <- within(leadership,{
  agecat[age > 75] <- "Elder"
  agecat[age >= 55 & age <= 75] <- "Middle Aged"
  agecat[age < 55] <- "Young" })

变量的重命名

将date修改为testDate

# Renaming variables with the plyr package
names(leadership)
##  [1] "manager" "date"    "gender"  "age"     "q1"      "q2"      "q3"     
##  [8] "q4"      "q5"      "agecat"
names(leadership)[2] <- "testDate"
leadership
##   manager testDate gender age q1 q2 q3 q4 q5 agecat
## 1       1 10/24/08      M  32  5  4  5  5  5  Young
## 2       2 10/28/08      F  45  3  5  2  5  5  Young
## 3       3  10/1/08      F  25  3  5  5  5  2  Young
## 4       4 10/12/08      M  39  3  3  4 NA NA  Young
## 5       5   5/1/09      F  99  2  2  1  2  1  Elder

用包plyr 同时修改多个变量名

library(plyr)
leadership <- rename(leadership,c(manager="managerID", date="testDate"))
## The following `from` values were not present in `x`: date

缺失值

与SAS等程序不同,R中字符型和数值 型数据使用的缺失值符号是相同的。

# Applying the is.na() function
is.na(leadership[, 6:10])
##         q2    q3    q4    q5 agecat
## [1,] FALSE FALSE FALSE FALSE  FALSE
## [2,] FALSE FALSE FALSE FALSE  FALSE
## [3,] FALSE FALSE FALSE FALSE  FALSE
## [4,] FALSE FALSE  TRUE  TRUE  FALSE
## [5,] FALSE FALSE FALSE FALSE  FALSE

将99岁的年龄值重编码为缺失值:

leadership$age[leadership$age == 99] <- NA

也可以这样:

# Recode 99 to missing for the variable age
leadership[age == 99, "age"] <- NA
leadership
##   managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 1         1 10/24/08      M  32  5  4  5  5  5  Young
## 2         2 10/28/08      F  45  3  5  2  5  5  Young
## 3         3  10/1/08      F  25  3  5  5  5  2  Young
## 4         4 10/12/08      M  39  3  3  4 NA NA  Young
## 5         5   5/1/09      F  NA  2  2  1  2  1  Elder
# Excluding missing values from analyses
x <- c(1, 2, NA, 3)
y <- x[1] + x[2] + x[3] + x[4];y
## [1] NA
z <- sum(x);z
## [1] NA

na.rm=TRUE选项,可以在计算之前移除缺失值:

x <- c(1, 2, NA, 3)
y <- sum(x, na.rm=TRUE);y
## [1] 6

函数na.omit()移除所有含有缺失值的观测

# - Using na.omit() to delete incomplete observations
leadership
##   managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 1         1 10/24/08      M  32  5  4  5  5  5  Young
## 2         2 10/28/08      F  45  3  5  2  5  5  Young
## 3         3  10/1/08      F  25  3  5  5  5  2  Young
## 4         4 10/12/08      M  39  3  3  4 NA NA  Young
## 5         5   5/1/09      F  NA  2  2  1  2  1  Elder
newdata <- na.omit(leadership)
newdata
##   managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 1         1 10/24/08      M  32  5  4  5  5  5  Young
## 2         2 10/28/08      F  45  3  5  2  5  5  Young
## 3         3  10/1/08      F  25  3  5  5  5  2  Young

日期值

as.Date()转化为以数值形式存储的日期变量:

# Converting character values to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))#将默认格式的字符型数据转换为对应日期

使用mm/dd/yyyy的格式读取数据:

strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y");dates
## [1] "1965-01-05" "1975-08-16"
# Woring with formats
today <- Sys.Date();today
## [1] "2016-09-01"
format(today, format="%B %d %Y")
## [1] "九月 01 2016"
format(today, format="%A")
## [1] "星期四"

日期的计算

# Calculations with with dates
startdate <- as.Date("2004-02-13")
enddate   <- as.Date("2009-06-22")
enddate - startdate
## Time difference of 1956 days
# Date functions and formatted printing
today <- Sys.Date()
dob <- as.Date("1956-10-12")
difftime(today, dob, units="weeks")
## Time difference of 3125 weeks

将日期转换为字符型变量

# Listing 4.5 - Converting from one data type to another
a <- c(1,2,3)
a
## [1] 1 2 3
is.numeric(a)
## [1] TRUE
is.vector(a)
## [1] TRUE
a <- as.character(a)
a
## [1] "1" "2" "3"
is.numeric(a)
## [1] FALSE
is.vector(a)
## [1] TRUE
is.character(a)
## [1] TRUE

类型转换

数据排序

依经理人的年龄升序排序:

# Sorting a dataset
newdata <- leadership[order(leadership$age),]

将各行依女性到男性、同样性别中按年龄升序排序:

attach(leadership)
## The following objects are masked _by_ .GlobalEnv:
## 
##     age, gender, q1, q2, q3, q4, q5
newdata <- leadership[order(gender, age),];newdata
##   managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 3         3  10/1/08      F  25  3  5  5  5  2  Young
## 2         2 10/28/08      F  45  3  5  2  5  5  Young
## 5         5   5/1/09      F  NA  2  2  1  2  1  Elder
## 1         1 10/24/08      M  32  5  4  5  5  5  Young
## 4         4 10/12/08      M  39  3  3  4 NA NA  Young
detach(leadership)

依经理人的性别和年龄降序排序:

attach(leadership)
## The following objects are masked _by_ .GlobalEnv:
## 
##     age, gender, q1, q2, q3, q4, q5
newdata <-leadership[order(gender, -age),];newdata
##   managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 5         5   5/1/09      F  NA  2  2  1  2  1  Elder
## 2         2 10/28/08      F  45  3  5  2  5  5  Young
## 3         3  10/1/08      F  25  3  5  5  5  2  Young
## 4         4 10/12/08      M  39  3  3  4 NA NA  Young
## 1         1 10/24/08      M  32  5  4  5  5  5  Young
detach(leadership)

数据集的选择和合并

变量的选择的不同方法:

# Selecting variables
newdata <- leadership[, c(6:10)]

myvars <- c("q1", "q2", "q3", "q4", "q5")
newdata <-leadership[myvars]

myvars <- paste("q", 1:5, sep="")
newdata <- leadership[myvars]

选择除去"q3", "q4"的变量:

# Dropping variables
myvars <- names(leadership) %in% c("q3", "q4")
myvars
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
leadership[!myvars]
##   managerID testDate gender age q1 q2 q5 agecat
## 1         1 10/24/08      M  32  5  4  5  Young
## 2         2 10/28/08      F  45  3  5  5  Young
## 3         3  10/1/08      F  25  3  5  2  Young
## 4         4 10/12/08      M  39  3  3 NA  Young
## 5         5   5/1/09      F  NA  2  2  1  Elder

选择男性年龄大于30的经理人:

# - Selecting observations
newdata <- leadership[1:3,]
newdata <- leadership[leadership$gender=="M" &
                        leadership$age > 30,]
attach(leadership)
## The following objects are masked _by_ .GlobalEnv:
## 
##     age, gender, q1, q2, q3, q4, q5
newdata <- leadership[gender=='M' & age > 30,]
detach(leadership)

将研究范围限定在2009年1月1日到2009年12月31日之间:

# Selecting observations based on dates
startdate <- as.Date("2009-01-01")
enddate <- as.Date("2009-10-31")
newdata <- leadership[which(leadership$date >= startdate &leadership$date <= enddate),]

这里需要注意which()的用法。

选择所有age值大于等于35或age值小于24的行,保留了变量q1到q4:

# Using the subset() function
newdata <- subset(leadership, age >= 35 | age < 24,
                  select=c(q1, q2, q3, q4))

选择所有25岁以上的男性,并保留了变量gender到q4(gender、q4和其间所有列)

newdata <- subset(leadership, gender=="M" & age > 25,
                  select=gender:q4)

sample()的用法:

从leadership数据集中随机抽取一个大小为3的样本:

mysample <- leadership[sample(1:nrow(leadership), 3, replace=FALSE),]

使用SQL 语句操作数据框:

#  - Using SQL statements to manipulate data frames
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
newdf <- sqldf("select * from mtcars where carb=1 order by mpg",
               row.names=TRUE)
## Loading required package: tcltk
newdf
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
sqldf("select avg(mpg) as avg_mpg, avg(disp) as avg_disp, gear
from mtcars where cyl in (4, 6) group by gear")
##   avg_mpg avg_disp gear
## 1   20.33    201.0    3
## 2   24.53    123.0    4
## 3   25.37    120.1    5

从数据框mtcars中选择所有的变量(列),保留那些使用化油器(carb)的车型(行),按照mpg对车型进行了升序排序,并将结果保存为数据框newdf。参数row.names=TRUE将原始数据框中的行名延续到了新数据框中。

返回课程主页