R 字串與因子

有序因子

有一些資料雖然是類別型的資料，但是不同類別之間是可以比較的，這種有順序性的類別資料可以使用有序因子來儲存，最常見的例子就是問卷調查的資料，例如滿意度的評價可分為非常差（worst）、差（bad）、普通（so-so）、好（good）與非常好（perfect）：

choices <- c("worst", "bad", "so-so", "good", "perfect")
samples <- sample(choices, 10, replace = TRUE)
samples.factor <- factor(samples, levels = choices)
samples.factor

 [1] perfect so-so   perfect good    perfect
 [6] worst   worst   so-so   good    so-so  
Levels: worst bad so-so good perfect

若遇到這樣的資料，可以改用 ordered 函數來建立有序的因子變數（或是在呼叫 factor 函數時加入 ordered = TRUE 也可以）：

samples.ordered <- ordered(samples, levels = choices)
samples.ordered

 [1] perfect so-so   perfect good    perfect worst  
 [7] worst   so-so   good    so-so  
Levels: worst < bad < so-so < good < perfect

有序因子也是屬於因子變數：

is.factor(samples.ordered)

[1] TRUE

有序的因子跟一般的因子變數的使用方式都相同，唯一的差異就只是它的 levels 有順序性而已。

將連續型變數轉換為離散型變數

如果要看一群連續型資料大致的分佈，可以將其轉換為離散的群組，這樣會比較容易一眼看出整個分布狀況：

grouped <- cut(iris$Sepal.Length, seq(4.3, 7.9, 4))
head(grouped)

[1] (4.3,5.2] (4.3,5.2] (4.3,5.2] (4.3,5.2] (4.3,5.2] (5.2,6.1]
Levels: (4.3,5.2] (5.2,6.1] (6.1,7] (7,7.9]

table(grouped)

grouped
(4.3,5.2] (5.2,6.1]   (6.1,7]   (7,7.9] 
       45        50        43        12

這個作用跟使用 hist 函數畫直方圖相同。

將離散型變數轉換為連續型變數

有些時候我們拿到的原始資料有一些錯誤（例如打字上的錯誤），造成 R 在讀入資料時，將數值當成字串讀入，然後自動轉換為因子，而這種變數若直接使用 as.numeric 轉換為數值的話，會出現錯誤：

raw <- data.frame(
  x = c("1.23", "4..56", "7.89")
)
as.numeric(raw$x)

[1] 1 2 3

這裡的 as.numeric 是將因子變數內部代表各類別的數值資料直接取出，但這並不是我們想要的結果。我們可以將因子先轉換為字串，再轉為數值：

as.numeric(as.character(raw$x))

[1] 1.23   NA 7.89
Warning message:
強制變更過程中產生了 NA

而根據 R 的 FAQ 說明文件，比較好的做法是像下面這種，這個寫法的執行效率會比較高：

as.numeric(levels(raw$x))[as.integer(raw$x)]

[1] 1.23   NA 7.89
Warning message:
強制變更過程中產生了 NA

產生因子的 Levels

gl 函數可以依據指定的樣式來產生因子變數，第一個參數 n 是指定因子的 level 數目，而第二個參數 k 則是指定每個 level 出現的次數：

gl(n = 2, k = 8)

 [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

labels 參數可以用來指定各個 levels 的名稱：

gl(n = 2, k = 8, labels = c("Control", "Treat"))

 [1] Control Control Control Control Control Control Control Control Treat  
[10] Treat   Treat   Treat   Treat   Treat   Treat   Treat  
Levels: Control Treat

length 參數可以指定產生的因子變數長度：

gl(n = 2, k = 2, length = 8,
  labels = c("Control", "Treat"))

[1] Control Control Treat   Treat   Control Control Treat   Treat  
Levels: Control Treat

結合因子變數

如果要將兩個（或多個）因子變數結合，可以使用 interaction 函數：

a <- gl(2, 4, 8)
a

[1] 1 1 1 1 2 2 2 2
Levels: 1 2

b <- gl(2, 2, 8, labels = c("ctrl", "treat"))
b

[1] ctrl  ctrl  treat treat ctrl  ctrl  treat treat
Levels: ctrl treat

interaction(a, b)

[1] 1.ctrl  1.ctrl  1.treat 1.treat 2.ctrl  2.ctrl  2.treat 2.treat
Levels: 1.ctrl 2.ctrl 1.treat 2.treat

超過兩個因子變數的情況也可以使用 interaction 處理，而 sep 參數可以指定結合 level 時的分隔符號。

s <- gl(2, 1, 8, labels = c("M", "F"))
s

[1] M F M F M F M F
Levels: M F

interaction(a, b, s, sep = ":")

[1] 1:ctrl:M  1:ctrl:F  1:treat:M 1:treat:F 2:ctrl:M  2:ctrl:F  2:treat:M
[8] 2:treat:F
8 Levels: 1:ctrl:M 2:ctrl:M 1:treat:M 2:treat:M 1:ctrl:F ... 2:treat:F

繼續閱讀： 1 2 3 45

有序因子

將連續型變數轉換為離散型變數

將離散型變數轉換為連續型變數

產生因子的 Levels

結合因子變數

G. T. Wang

搜尋

分類

宗教

公益