# R语言学习笔记

R语言学习系列参考书籍：《R语言实战》

# 一、入门

### 1. 工作空间：

`getwd()`：查看当前工作目录
`setwd("mydirectory")`：设定当前工作目录为mydirectory
`ls()`：列出当前工作空间中的对象
`savehistory("myfile")`：保存命令历史到文件myfile中
`save.image("myfile")`：保存工作空间到文件myfile中
`q()`：退出R

### 2. 包

`install.packages()`：安装包
`library()`：载入包

### 3. 数据结构

(1) 向量

``````a <- c(1, 2, 3, 4, 5)
b <- c(“one”, “two”, “three”)
``````

(2) 矩阵

• 创建一个5*4的矩阵
``````y <- matrix(1:20, nrow=5, ncol=4)
``````
• 按行填充的2*2矩阵
``````cells <- c(1, 26, 24, 68)
rnames <- c(“R1”, “R2”)
cnames <- c(“C1”, “C2”)
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames))
``````
• 按列填充的2*2矩阵
``````mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=FALSE, dimnames=list(rnames, cnames))
``````

(3) 数组

``````dim1 <- c(“A1”, “A2”)
dim2 <- c(“B1”, “B2”, “B3”)
dim3 <- c(“C1”, ”C2”, “C3”, “C4”)
z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
``````

(4) 数据框

``````patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c(“Type1”, “Type2”, “Type1”, “Type1”)
status <- c(“Poor”, “Improved”, “Excellent”, “Poor”)
patientdata <- data.frame(patientID, age, diabetes, status)
``````

``````patientdata
patientdata[1:2]
patientdata[c(“diabtets”, “status”)]
patientdata\$age
``````

\$用来选取一个给定数据框中的某个特定变量：

``````table(patientdata\$diabetes, patientdata\$status)
``````

``````summary(mtcars\$mpg)
plot(mtcars\$mpg, mtcars\$disp)
plot(mtcars\$mpg, mtcars\$wt)
``````

``````attach(mtcars)
summary(mpg)
plot(mpg, disp)
plot(mpg, wt)
detach(mtcars)
``````

``````with(mtcars,{
summary(mpg, disp, wt)
plot(mpg, disp)
plot(mpg, wt)
})
``````

### 4. 数据的输入

(1) 使用键盘输入数据

``````mediate <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0))
mydata <- edit(mydata)
``````

(2) 从带分隔符的文本文件导入数据

``````read.table()

``````

### 5. 数据集的标注

(1)变量标签

``````names(patientdata)[2] <- “Age at hospitalization (in years)”
``````

(2) 值标签

``````patientdata\$gender <- factor(patientdata\$gender, levels=c(1,2), labels=c(“male”, “female”))
``````

# 二、图形初阶

### 1. 使用图形

``````attach(mtcars)
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression on MPG on Weight")
detach(mtcars)
``````

``````pdf("mygraph.pdf")
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression on MPG on Weight")
detach(mtcars)
dev.off()
``````

### 2. 一个简单的例子

``````dose <- c(20,30,40,45,60)
drugA <- c(16,20,27,40,60)
drugB <- c(15,18,25,31,40)
``````

``````plot(dose, drugA, type= "b")
``````

`plot()`是R中为对象作图的一个泛型函数

### 3. 图形参数

``````opar <- par(no.readonly=TRUE)
par (lty=2, pch=17)
plot(dose, drugA, type= "b")
par(opar)
``````

`lay = 2`: 将默认的线条类型修改为虚线
`pct = 17`: 将默认的点的符号改为实心三角

# 三、基本数据管理

``````manager <- c(1, 2, 3, 4, 5)
date <- c(“10/24/08”,”10/28/08”,”10/1/08”,”10/12/08”,”5/1/09”)
country <- c(“US”,”US”,”UK”,”UK”,”UK”)
gender <- c(“M”,”F”,”F”,”M”,”F”)
age <- c(32, 45, 25, 39, 99)
q1 <- c(5, 3, 3, 3, 2)
q2 <- c(4, 5, 5, 3, 2)
q3 <- c(5, 2, 5, 4, 1)
q4 <- c(5, 5, 5, NA, 2)
q5 <- c(5, 5, 2, NA, 1)
leadership <- data.frame(manager, date, country, gender, age, q1, q2, q3, q4, q5, stringsAsFactor=FALSE)
``````

### 2. 创建新变量

``````mydata <- data.frame( x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))
mydata\$sumx <- mydata\$x1 + mydata\$x2
mydata\$meanx <- (mydata\$x1 + mydata\$x2)/2
``````

``````attach(mydata)
mydata\$sumx <- x1 + x2
mydata\$meanx <- (x1 + x2)/2
detach(mydata)
``````

``````mediate <- transform(mediate, sum = x1 + x2, meanx = (x1 + x2)/2)
``````

### 3. 变量的重编码

(1) 将99岁的年龄值重编码为缺失值

``````leadership\$age[leadership\$age == 99] <- NA
``````

(2) 创建变量agecat (Young, Middle, Aged, Elder)

``````leadership\$agecat[leadership\$age > 75] <- "Elder"
``````

``````leadership <- within(leadership, {agecat <- NA
agecat[age > 75] <- "Elder"
agecat[age >= 55 & age <= 75] <- "Middle Aged"
agecat[age < 55] <- "Young"})
``````

### 4. 变量的重命名

(1) 调用交互式编辑器进行变量重命名：

``````fix(leadership)
``````

(2) 以编程方式修改变量名，使用 `rename()` 函数：

``````library(reshape)
``````

(3) 通过`names()`函数来重命名变量：

``````names(leadership)[2] <- “testDate”
names(leadership)[6:10] <- c(“item1”, “item2”, “item3”, “item4”, “item5”)
``````

### 5. 缺失值

(1) 检测缺失值是否存在：

``````is.na()
``````

(2) 重编码某些值为缺失值：

``````leadership\$age[leadership\$age == 99] <- NA
``````

(3) 使用 `na.rm=TRUE`，在计算之前移除缺失值并使用剩余值计算：

``````x <- c(1, 2, NA, 3)
y <- sum(x)  —这里，y的值为NA
y <- sum(x, na.rm=TRUE)  —这里，y等于6
``````

(4) 使用 `na.omit()`，移除所有含有缺失值的观测（行删除，listwise deletion）

`na.omit()`可以删除所有含有缺失数据的行：

``````newdata <- na.omit(leadership)
``````

### 6. 日期值

``````as.Date(x, “input_format”)

mydates <- as.Date(c(“2007-06-22”, “2004-02-13”))

strDates <- c(“01/05/1965”,”08/16/1975”)
dates <- as.Date(strDates, “%m/%d/%Y”)

myformat <- “%m/%d/%y”
``````

``````strDates <- as.character(dates)
``````

### 7. 类型转换

``````a <- c(1, 2, 3)
a <- is.numeric(a)
a <- as.charachter(a)
``````

### 8. 数据排序

``````newdata <- leadership[order(leadership\$age),]
— 各行依经理人的年龄升序排序

— 各行依女性到男性、同样性别中按年龄升序排序

— 各行依经理人性别和年龄降序排序
``````

### 9. 数据集的合并

(1) 添加列

``````total <- merge(dataframeA, dataframeB, by=“ID”)
—将dataframeA和dataframeB按照ID进行合并

total <- merge(datagrameA, dataframeB, by=c(“ID”, “Country”))
— 将两个数据框按照ID和Country进行合并
``````

``````total <- cbind(A, B) —横向合并对象A和对象B
``````

(2) 添加行

``````total <- rbind(dataframeA, dataframeB)
``````

### 10. 数据集取子集

(1) 选入（保留）变量

``````newdata <- leadership[,c(6:10)]
``````

(2) 剔除（丢弃）变量

``````myvars <- names(leadership) %in% c("q3", "q4")
``````

(3) 选入观测

``````newdata <- leadership[1:3,]

new data <- leadership[which(gender=='M' & age > 30),]
``````

(4) subset()函数

``````newdata <- subset(leadership, age>=35, | age < 24, select = c(q1, q2, q3, q4))
newdata <- subset(leadership, gender=="M" & age > 25, select=gender:q4)
``````

(5) 随机抽样

``````my sample <- leadership[sample(1:nrow(leadership), 3, replace=FALSE),]
``````

### 11. 使用SQL语句操作数据框

``````library(sqldf)
newdf <- sqldf("select * from mtcars where carb = 1 order by mpg", row.names=TRUE)
``````

sqldf包是R中一个实用的数据管理辅助工具。

;