需要安装的包:
install.packages(c('reshape2', 'sqldf'))
问卷选项 调查问卷
每位经理人的上司根据与服从权威相关的五项陈述(q1到q5)对经理人进行评分。例如,经理人1是一位在美国工作的32岁男性,上司对他的评价是惯于顺从,而经理人5是一位在英国工作的,年龄未知(99可能代表缺失)的女性,服从程度评分较低。
manager <- c(1,2,3,4,5)
date <- c("10/24/08","10/28/08","10/1/08","10/12/08","5/1/09")
gender <- c("M","F","F","M","F")
age <- c(32,45,25,39,99)
q1 <- c(5,3,3,3,2)
q2 <- c(4,5,5,3,2)
q3 <- c(5,2,5,4,1)
q4 <- c(5,5,5,NA,2)
q5 <- c(5,5,2,NA,1)
leadership <- data.frame(manager,date,gender,age,q1,q2,q3,q4,q5, stringsAsFactors=FALSE)
leadership
## manager date gender age q1 q2 q3 q4 q5
## 1 1 10/24/08 M 32 5 4 5 5 5
## 2 2 10/28/08 F 45 3 5 2 5 5
## 3 3 10/1/08 F 25 3 5 5 5 2
## 4 4 10/12/08 M 39 3 3 4 NA NA
## 5 5 5/1/09 F 99 2 2 1 2 1
需要的处理:
想创建一个新变量sumx 存储以上两个变量的加和,并创建一个名为meanx的新变量存储这两个变量的均值
第一种方法:
# - Creating new variables
mydata<-data.frame(x1 = c(2, 2, 6, 4),
x2 = c(3, 4, 2, 8))
mydata$sumx <- mydata$x1 + mydata$x2
mydata$meanx <- (mydata$x1 + mydata$x2)/2
第二种方法:
attach(mydata)
mydata$sumx <- x1 + x2
mydata$meanx <- (x1 + x2)/2
detach(mydata)
第三种方法:
mydata <- transform(mydata,
sumx = x1 + x2,
meanx = (x1 + x2)/2)
mydata
## x1 x2 sumx meanx
## 1 2 3 5 2.5
## 2 2 4 6 3.0
## 3 6 2 8 4.0
## 4 4 8 12 6.0
需要用到逻辑运算符:
创建agecat变量:
第一种方法:
# Recoding variables
leadership$agecat[leadership$age > 75] <- "Elder"
leadership$agecat[leadership$age >= 55 &
leadership$age <= 75] <- "Middle Aged"
leadership$agecat[leadership$age < 55] <- "Young"
第二种方法:
leadership <- within(leadership,{
agecat[age > 75] <- "Elder"
agecat[age >= 55 & age <= 75] <- "Middle Aged"
agecat[age < 55] <- "Young" })
将date修改为testDate
# Renaming variables with the plyr package
names(leadership)
## [1] "manager" "date" "gender" "age" "q1" "q2" "q3"
## [8] "q4" "q5" "agecat"
names(leadership)[2] <- "testDate"
leadership
## manager testDate gender age q1 q2 q3 q4 q5 agecat
## 1 1 10/24/08 M 32 5 4 5 5 5 Young
## 2 2 10/28/08 F 45 3 5 2 5 5 Young
## 3 3 10/1/08 F 25 3 5 5 5 2 Young
## 4 4 10/12/08 M 39 3 3 4 NA NA Young
## 5 5 5/1/09 F 99 2 2 1 2 1 Elder
用包plyr
同时修改多个变量名
library(plyr)
leadership <- rename(leadership,c(manager="managerID", date="testDate"))
## The following `from` values were not present in `x`: date
与SAS等程序不同,R中字符型和数值 型数据使用的缺失值符号是相同的。
# Applying the is.na() function
is.na(leadership[, 6:10])
## q2 q3 q4 q5 agecat
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE TRUE TRUE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
将99岁的年龄值重编码为缺失值:
leadership$age[leadership$age == 99] <- NA
也可以这样:
# Recode 99 to missing for the variable age
leadership[age == 99, "age"] <- NA
leadership
## managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 1 1 10/24/08 M 32 5 4 5 5 5 Young
## 2 2 10/28/08 F 45 3 5 2 5 5 Young
## 3 3 10/1/08 F 25 3 5 5 5 2 Young
## 4 4 10/12/08 M 39 3 3 4 NA NA Young
## 5 5 5/1/09 F NA 2 2 1 2 1 Elder
# Excluding missing values from analyses
x <- c(1, 2, NA, 3)
y <- x[1] + x[2] + x[3] + x[4];y
## [1] NA
z <- sum(x);z
## [1] NA
na.rm=TRUE
选项,可以在计算之前移除缺失值:
x <- c(1, 2, NA, 3)
y <- sum(x, na.rm=TRUE);y
## [1] 6
函数na.omit()
移除所有含有缺失值的观测
# - Using na.omit() to delete incomplete observations
leadership
## managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 1 1 10/24/08 M 32 5 4 5 5 5 Young
## 2 2 10/28/08 F 45 3 5 2 5 5 Young
## 3 3 10/1/08 F 25 3 5 5 5 2 Young
## 4 4 10/12/08 M 39 3 3 4 NA NA Young
## 5 5 5/1/09 F NA 2 2 1 2 1 Elder
newdata <- na.omit(leadership)
newdata
## managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 1 1 10/24/08 M 32 5 4 5 5 5 Young
## 2 2 10/28/08 F 45 3 5 2 5 5 Young
## 3 3 10/1/08 F 25 3 5 5 5 2 Young
as.Date()
转化为以数值形式存储的日期变量:
# Converting character values to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))#将默认格式的字符型数据转换为对应日期
使用mm/dd/yyyy的格式读取数据:
strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y");dates
## [1] "1965-01-05" "1975-08-16"
# Woring with formats
today <- Sys.Date();today
## [1] "2016-09-01"
format(today, format="%B %d %Y")
## [1] "九月 01 2016"
format(today, format="%A")
## [1] "星期四"
# Calculations with with dates
startdate <- as.Date("2004-02-13")
enddate <- as.Date("2009-06-22")
enddate - startdate
## Time difference of 1956 days
# Date functions and formatted printing
today <- Sys.Date()
dob <- as.Date("1956-10-12")
difftime(today, dob, units="weeks")
## Time difference of 3125 weeks
# Listing 4.5 - Converting from one data type to another
a <- c(1,2,3)
a
## [1] 1 2 3
is.numeric(a)
## [1] TRUE
is.vector(a)
## [1] TRUE
a <- as.character(a)
a
## [1] "1" "2" "3"
is.numeric(a)
## [1] FALSE
is.vector(a)
## [1] TRUE
is.character(a)
## [1] TRUE
依经理人的年龄升序排序:
# Sorting a dataset
newdata <- leadership[order(leadership$age),]
将各行依女性到男性、同样性别中按年龄升序排序:
attach(leadership)
## The following objects are masked _by_ .GlobalEnv:
##
## age, gender, q1, q2, q3, q4, q5
newdata <- leadership[order(gender, age),];newdata
## managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 3 3 10/1/08 F 25 3 5 5 5 2 Young
## 2 2 10/28/08 F 45 3 5 2 5 5 Young
## 5 5 5/1/09 F NA 2 2 1 2 1 Elder
## 1 1 10/24/08 M 32 5 4 5 5 5 Young
## 4 4 10/12/08 M 39 3 3 4 NA NA Young
detach(leadership)
依经理人的性别和年龄降序排序:
attach(leadership)
## The following objects are masked _by_ .GlobalEnv:
##
## age, gender, q1, q2, q3, q4, q5
newdata <-leadership[order(gender, -age),];newdata
## managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 5 5 5/1/09 F NA 2 2 1 2 1 Elder
## 2 2 10/28/08 F 45 3 5 2 5 5 Young
## 3 3 10/1/08 F 25 3 5 5 5 2 Young
## 4 4 10/12/08 M 39 3 3 4 NA NA Young
## 1 1 10/24/08 M 32 5 4 5 5 5 Young
detach(leadership)
变量的选择的不同方法:
# Selecting variables
newdata <- leadership[, c(6:10)]
myvars <- c("q1", "q2", "q3", "q4", "q5")
newdata <-leadership[myvars]
myvars <- paste("q", 1:5, sep="")
newdata <- leadership[myvars]
选择除去"q3", "q4"
的变量:
# Dropping variables
myvars <- names(leadership) %in% c("q3", "q4")
myvars
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
leadership[!myvars]
## managerID testDate gender age q1 q2 q5 agecat
## 1 1 10/24/08 M 32 5 4 5 Young
## 2 2 10/28/08 F 45 3 5 5 Young
## 3 3 10/1/08 F 25 3 5 2 Young
## 4 4 10/12/08 M 39 3 3 NA Young
## 5 5 5/1/09 F NA 2 2 1 Elder
选择男性年龄大于30的经理人:
# - Selecting observations
newdata <- leadership[1:3,]
newdata <- leadership[leadership$gender=="M" &
leadership$age > 30,]
attach(leadership)
## The following objects are masked _by_ .GlobalEnv:
##
## age, gender, q1, q2, q3, q4, q5
newdata <- leadership[gender=='M' & age > 30,]
detach(leadership)
将研究范围限定在2009年1月1日到2009年12月31日之间:
# Selecting observations based on dates
startdate <- as.Date("2009-01-01")
enddate <- as.Date("2009-10-31")
newdata <- leadership[which(leadership$date >= startdate &leadership$date <= enddate),]
这里需要注意which()
的用法。
选择所有age值大于等于35或age值小于24的行,保留了变量q1到q4:
# Using the subset() function
newdata <- subset(leadership, age >= 35 | age < 24,
select=c(q1, q2, q3, q4))
选择所有25岁以上的男性,并保留了变量gender到q4(gender、q4和其间所有列)
newdata <- subset(leadership, gender=="M" & age > 25,
select=gender:q4)
sample()
的用法:
从leadership数据集中随机抽取一个大小为3的样本:
mysample <- leadership[sample(1:nrow(leadership), 3, replace=FALSE),]
使用SQL 语句操作数据框:
# - Using SQL statements to manipulate data frames
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
newdf <- sqldf("select * from mtcars where carb=1 order by mpg",
row.names=TRUE)
## Loading required package: tcltk
newdf
## mpg cyl disp hp drat wt qsec vs am gear carb
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
sqldf("select avg(mpg) as avg_mpg, avg(disp) as avg_disp, gear
from mtcars where cyl in (4, 6) group by gear")
## avg_mpg avg_disp gear
## 1 20.33 201.0 3
## 2 24.53 123.0 4
## 3 25.37 120.1 5
从数据框mtcars中选择所有的变量(列),保留那些使用化油器(carb)的车型(行),按照mpg对车型进行了升序排序,并将结果保存为数据框newdf。参数row.names=TRUE将原始数据框中的行名延续到了新数据框中。
返回课程主页。