Text Mining 前處理

中文、R 與 quanteda

Jul 28, 2018

流程

資料爬取

這邊使用 RStudio 軟體工程師 Yihui 的中文部落格文章作為練習素材。首先需要取得文章的網址，因此先到部落格的文章列表頁面(https://yihui.name/cn/)，使用瀏覽器的開發者工具(按Ctrl + Shift + I開啟)進行觀察。

接著使用rvest套件擷取網頁中所有文章的連結，並將文章網址儲存成list_of_post.txt：

library(dplyr)
library(rvest)

list_of_posts <- read_html("https://yihui.name/cn/") %>% 
    html_nodes(".archive") %>% # 列表在 div.archive 之下
    html_nodes("p") %>% # 文章標題在 <div> 下之 <p>
    html_nodes("a") %>% html_attr("href") # 文章連結在 <p> 下之 <a> 

readr::write_lines(list_of_posts, "yihui/list_of_post.txt")

head(list_of_posts, 2)

[1] "/cn/2018/10/middle-school-teachers/"
[2] "/cn/2018/10/potato-pancake/"

tail(list_of_posts, 2)

[1] "/cn/2005/01/rtx/"      "/cn/2005/01/20-13-00/"

length(list_of_posts)

[1] 1097

可以看到總共有 1097 篇文章，時間從 2005 年到今年七月都有發文的紀錄。

由於文章數量相當多，因此之後僅會下載部分文章，避免造成伺服器負擔過大。下載網頁時，可以在 R 中直接使用rvest(見下文資料前處理)，但我比較建議使用 Bash¹的wget指令，才不會因為重複下載網頁造成伺服器負擔。

在下載前，需先決定目標文章的網址sub_list：

library(stringr)
set.seed(2018) # 設隨機種子 固定隨機函數的結果

idx <- str_detect(list_of_posts, "2018|2015|2010")
sub_list <- list_of_posts[idx]
sub_list <- sub_list[sample(seq_along(sub_list), 20)]  %>% # 抽出 20 篇
    str_replace_all(pattern = "^/", # 將站內連結改為完整 url
                    replacement = "https://yihui.name/") %>%
    str_replace_all(pattern = "/$", "/index.html")

readr::write_lines(sub_list, "yihui/sublist.txt")

# 給 Bash 用的
sub_list %>%
    str_replace_all("https://yihui.name/cn/", "") %>%
    str_replace_all("/index.html", "") %>%
    str_replace_all("/", "-") %>% 
    str_replace_all("-$", "") %>%
    readr::write_lines("yihui/sublist_name.txt")

Bash 指令下載網頁

無法使用 bash 指令者，可跳過此節

為了自動化下載網頁，我寫了一個簡單的 Bash script wget_list，用法如下:

wget_list <網址文字檔> <檔名文字檔>²
- <網址文字檔>：每一列(row)由一個網址組成
- <檔名文字檔>：每一列由一個名稱組成，每個名稱與<網址文字檔>的網址對應

在這裡，執行下列指令即可下載網頁

cd yihui/html
wget_list ../sublist.txt ../sublist_name.txt
cd -

wget_list:

#!/bin/bash

#<<< wget_list: dowload webpages listed in a file >>>#
### Argument 1 is the file of links, 1 url per row   ###
### Argument 2 is the file of names, 1 name per row  ###

file1=$1
file2=$2

## Get the number of lines in the link list
num_lines=$(wc -l $file1 | egrep -o '^[0-9]*')

## loop over the lines in file1, dowload the the file & name them as listed in file2
for (( i=1; i<=${num_lines}; ++i )); do
     wget "$(sed -n ${i}p $file1)" \
         -O "$(sed -n ${i}p $file2)"
done

資料前處理

在清理資料之前，需先剖析網頁結構(就如同之前剖析文章列表頁面一樣)。這邊觀察這篇文章，大致可以找出這些資訊：

path <- "https://yihui.name/cn/2015/11/peer-review/"
all <- read_html(path) %>%
    html_nodes("article")
header <- all %>% html_nodes("header")

title <- header %>%      # 文章標題
    html_nodes("h1") %>% html_text()

post_date <- header %>%  # 發文日期
    html_node("h3") %>% html_text() %>%
    str_extract("201[0-9]-[0-9]{2}-[0-9]{2}")

article <- all %>%       # 內文
    html_nodes("p") %>% 
    html_text() %>% paste(collapse = "\n") 
    # 這裡將 chr vector collapse 至 1 個字串，
    # 簡化資料結構，並以分行符號保留段落資訊

num_sec <- all %>%      # 內文段落數
    html_nodes("p") %>% length

links <- all %>% html_nodes("p") %>% # 內文連結  
    html_nodes("a") %>% html_attr("href")
link_text <- all %>% html_nodes("p") %>% # 內文連結標題
    html_nodes("a") %>% html_text()

library(tibble)
df <- data_frame(title = title,
           date = post_date,
           content = article,
           num_sec = num_sec,
           links = list(links),
           link_text = list(link_text)
           )
df %>%
    mutate(title = str_trunc(title, 8),
           content = str_trunc(content, 8),
           links = str_trunc(links, 8),
           link_text = str_trunc(link_text, 8)) %>%
    kable("markdown", align = "c")

title	date	content	num_sec	links	link_text
同行评审	2015-11-11	看到这么一…	8	c(“ht…	c(“一则…

我們可以將上面的程式碼改寫成函數post_data()，自動讀取文章並輸出 data frame：

post_data <- function (path) {
    all <- read_html(path) %>%
        html_nodes("article")
    header <- all %>% html_nodes("header")
    
    title <- header %>%      # 文章標題
        html_nodes("h1") %>% html_text()
    
    post_date <- header %>%  # 發文日期
        html_node("h3") %>% html_text() %>%
        str_extract("201[0-9]-[0-9]{2}-[0-9]{2}")
    
    article <- all %>%       # 內文
        html_nodes("p") %>% 
        html_text() %>% paste(collapse = "\n")
        # 這裡將 chr vector collapse 至 1 個字串，
        # 簡化資料結構，並以分行符號保留段落資訊
        
    num_sec <- all %>%      # 內文段落數
        html_nodes("p") %>% length
    
    links <- all %>% html_nodes("p") %>% # 內文連結  
        html_nodes("a") %>% html_attr("href")
    link_text <- all %>%     # 內文連結標題
        html_nodes("p") %>% 
        html_nodes("a") %>% html_text()
    
    df <- tibble::data_frame(title = title,
                             date = post_date,
                             content = article,
                             num_sec = num_sec,
                             links = list(links),
                             link_text = list(link_text)
                             )
}

接著，將所有文章讀取至一個 data frame all_post：

library(dplyr)
library(tidyr)

html_list <- list.files("yihui/html/") # 列出資料夾下的檔案
all_post <- vector("list", length(html_list))

for (i in seq_along(html_list)) {
    path <- paste0("yihui/html/", html_list[i])
    all_post[[i]] <- post_data(path)
}

all_post <- bind_rows(all_post) %>% arrange(desc(date))

head(all_post) %>%
    mutate(title = str_trunc(title, 8),
           content = str_trunc(content, 8),
           links = str_trunc(links, 8),
           link_text = str_trunc(link_text, 8)) %>%
    kable("markdown", align = "c")

title	date	content	num_sec	links	link_text
修辞还是真实	2018-06-21	说两封让我…	12	chara…	chara…
花椒香料	2018-05-31	古人似乎喜…	2	/cn/2…	去年的花椒
CSS 的…	2018-05-14	CSS 中…	15	c(“ht…	c(“查阅…
毛姆的文学回忆录	2018-05-04	前段时间看…	14	c(“/c…	c(“职业…
距离的组织	2018-05-03	前面《闲情…	5	/cn/2…	闲情赋
语言圣战的终结？	2018-04-19	一直以来我…	3	c(“ht…	c(“惊天…

直接從網路讀取

如果無法使用 Bash 指令下載網頁，可將上面程式碼的html_list改為讀取sublist.txt中的 url，並修改for迴圈中的path：

html_list <- read_lines("yihui/sublist.txt") # 讀取 url 
all_post <- vector("list", length(html_list))

for (i in seq_along(html_list)) {
    path <- html_list[i]
    all_post[[i]] <- post_data(path)
}

all_post <- bind_rows(all_post) %>% arrange(desc(date))

斷詞

在處理中文、日語等文本資料，需先經過斷詞處理，因為其不像英語等歐洲語言的文本，以空格表示字詞的界線。

我們將使用jiebaR套件的segment()進行斷詞。由?segment()查看其 documentation 可知segment()只吃文字檔或一個句子，因此需先搞清楚all_post的結構才能進行斷詞：

all_post: 20*5 的data_frame，每列(row)為一篇文章 - $title: 每列為 1 個值 - $date: 每列為 1 個值 - $content: 每列為 1 個值，段落資訊藏在字串中的\n符號 - $links: 每列為 1 個 list - $link_text: 每列為 1 個 list

all_post$content的結構相當簡單(一篇文章一個字串)，因此不須經過額外處理。其它變項不須斷詞處理，因此在此不加細談。

jiebaR::segment

因為all_post$content簡單的結構符合jiebaR套件的預設需求，但有時資料會比較複雜，因此記錄下來供未來參考。

前面提到jiebaR::segment只吃一個句子(一個字串)或文字檔，那如果丟一個 vector 給它會怎樣？答案是看worker()的設定：

library(jiebaR)
seg <- worker(symbol = T, bylines = F)
segment(c("妳很漂亮", "我不喜歡你"), seg)

[1] "妳"     "很漂亮" " "      "我"     "不"     "喜歡"   "你"

seg <- worker(symbol = T, bylines = T)
segment(c("妳很漂亮", "我不喜歡你"), seg)

[[1]]
[1] "妳"     "很漂亮"

[[2]]
[1] "我"   "不"   "喜歡" "你"

bylines = F：回傳 1 個 chr vector，其每個元素為 1 個詞。
bylines = T：回傳 1 個 list，其長度(元素的數量)等於輸入之 vector 的長度，每個元素為一個 chr vector。

bylines = F的設定在此符合我們的需求，並且為配合quanteda套件的特性而將斷詞結果以一個字串(以空格分開字詞)而非一個 chr vector 的形式儲存。以下對第一篇文章進行斷詞：

library(jiebaR)
all_post_seg <- all_post
seg <- worker(symbol = T, bylines = F)

all_post_seg$content[1] <- all_post$content[1] %>%
    segment(seg) %>% paste(collapse = " ")

all_post$content[1] %>% str_trunc(20)

[1] "说两封让我感到“我天，给亲友的书信..."

all_post_seg$content[1] %>% str_trunc(30)

[1] "说 两封 让 我 感到 “ 我 天 ， 给 亲友 的 ..."

要處理所有文章，僅需外包一個 for loop：

all_post_seg <- all_post
seg <- worker(symbol = T, bylines = F)

idx <- seq_along(all_post$content)
for (i in idx){
    all_post_seg$content[i] <- all_post$content[i] %>%
        segment(seg) %>% paste(collapse = " ")
}

head(all_post$content, 3) %>% str_trunc(20)

[1] "说两封让我感到“我天，给亲友的书信..." 
[2] "古人似乎喜欢把花椒当香料用。在《古..."
[3] "CSS 中的位置（position..."

head(all_post_seg$content, 3) %>% str_trunc(30)

[1] "说 两封 让 我 感到 “ 我 天 ， 给 亲友 的 ..."  
[2] "古人 似乎 喜欢 把 花椒 当 香料 用 。 在 《 ..."
[3] "CSS   中 的 位置 （ position ） 属..."

簡轉繁

OpenCC 是一個簡體字與繁體字轉換的專案，非常優秀，因為其不僅是單純字轉字，甚至處理了地區性的用法(如「軟體」vs.「软件」)。因此，其簡繁轉換的選項有非常多：

s2t.json Simplified Chinese to Traditional Chinese 簡體到繁體
t2s.json Traditional Chinese to Simplified Chinese 繁體到簡體
s2tw.json Simplified Chinese to Traditional Chinese (Taiwan Standard) 簡體到臺灣正體
tw2s.json Traditional Chinese (Taiwan Standard) to Simplified Chinese 臺灣正體到簡體
s2hk.json Simplified Chinese to Traditional Chinese (Hong Kong Standard) 簡體到香港繁體（香港小學學習字詞表標準）
hk2s.json Traditional Chinese (Hong Kong Standard) to Simplified Chinese 香港繁體（香港小學學習字詞表標準）到簡體
s2twp.json Simplified Chinese to Traditional Chinese (Taiwan Standard) with Taiwanese idiom 簡體到繁體（臺灣正體標準）並轉換爲臺灣常用詞彙
tw2sp.json Traditional Chinese (Taiwan Standard) to Simplified Chinese with Mainland Chinese idiom 繁體（臺灣正體標準）到簡體並轉換爲中國大陸常用詞彙
t2tw.json Traditional Chinese (OpenCC Standard) to Taiwan Standard 繁體（OpenCC 標準）到臺灣正體
t2hk.json Traditional Chinese (OpenCC Standard) to Hong Kong Standard 繁體（OpenCC 標準）到香港繁體（香港小學學習字詞表標準）

ropencc套件是 OpenCC 的 R 語言接口，其不在 CRAN 上，需以devtools從 GitHub 下載：

devtools::install_github("qinwf/ropencc")

使用上非常容易：

library(ropencc)
trans <- converter(TW2SP) # 臺灣用法轉大陸用法
run_convert(trans, "開放中文轉換軟體")

[1] "开放中文转换软件"

trans <- converter(T2S)   # 單純繁轉簡
run_convert(trans, "開放中文轉換軟體")

[1] "开放中文转换软体"

trans <- converter(S2TWP) # 簡轉臺灣用法
run_convert(trans, "开放中文转换软件")

[1] "開放中文轉換軟體"

在此我使用S2TWP轉換$content；S2T轉換$title：

library(ropencc)
all_post_seg$content <- run_convert(converter(S2TWP),
                                    all_post_seg$content)
all_post_seg$title <- run_convert(converter(S2T),
                                  all_post_seg$title)

head(all_post_seg) %>%
    mutate(title = str_trunc(title, 8),
           content = str_trunc(content, 8),
           links = str_trunc(links, 8),
           link_text = str_trunc(link_text, 8)) %>%
    kable("markdown", align = "c")

title	date	content	num_sec	links	link_text
修辭還是真實	2018-06-21	說兩封 …	12	chara…	chara…
花椒香料	2018-05-31	古人似乎…	2	/cn/2…	去年的花椒
CSS 的…	2018-05-14	CSS …	15	c(“ht…	c(“查阅…
毛姆的文學回憶錄	2018-05-04	前段時間 …	14	c(“/c…	c(“职业…
距離的組織	2018-05-03	前面《 …	5	/cn/2…	闲情赋
語言聖戰的終結？	2018-04-19	一直以來…	3	c(“ht…	c(“惊天…

quanteda

我們前面進行的資料前處理，已經將資料整理成符合quanteda::corpus()輸入的格式：

A data frame consisting of a character vector for documents, and additional vectors for document-level variables

因此，依以下指令即可將all_post_seg轉換成corpus物件：

library(quanteda)
corp <- corpus(all_post_seg, 
               docid_field = "title", 
               text_field = "content") 

corp %>% summary() %>% as_data_frame() %>% 
    head(3) %>%
    mutate(links = str_trunc(links, 8),
           link_text = str_trunc(link_text, 8)) %>%
    kable("markdown", align = "c")

Text	Types	Tokens	Sentences	date	num_sec	links	link_text
修辭還是真實	217	375	15	2018-06-21	12	chara…	chara…
花椒香料	149	246	9	2018-05-31	2	/cn/2…	去年的花椒
CSS 的位置屬性以及如何居中對齊超寬元素	347	805	23	2018-05-14	15	c(“ht…	c(“查阅…

有了corpus的資料結構後，即進入了下圖quanteda的分析架構，也結束了資料前處理的階段，開始進入 EDA 的階段。

graph TD C(Corpus) token(Tokens) AP["Positional analysis"] AN["Non-positional analysis"] dfm(DFM) tidy("Tidy Text Format") vis("Visualize") C --> token token --> dfm token -.-> AP dfm -.-> AN tidy -->|"cast_dfm()"| dfm dfm -->|"tidy()"| tidy dfm -.- vis tidy -.-> vis AP -.- vis style C stroke-width:0px,fill:#6bbcff style token stroke-width:0px,fill:#6bbcff style dfm stroke-width:0px,fill:#6bbcff style tidy stroke-width:0px,fill:orange linkStyle 6 stroke-width:0px,fill:none; linkStyle 8 stroke-width:0px,fill:none;

quanteda 有相當完整的教學資源，且有很多有用的函數。同時，tidytext 套件也能輕易與 quanteda 配合，在 document-feature matrix 與tidytext所提倡的 tidy data frame(one-token-per-document-per-row) 兩種資料結構間自由轉換。tidy data frame 的格式與ggplot2相吻合，有助於資料視覺化的進行。

這裡選擇以quanteda而非tidytext作為主要架構的原因在於tidytext的架構僅容許 bag-of-words 的架構，但quanteda除了 bag-of-words 之外，還保有 Positional analysis 的潛力。

由於篇幅有限，這裡不多加細談quanteda套件³。關於quanteda的使用，可以參考 quanteda tutorial，內容非常詳盡。

Reproduce

這篇文章的原始碼在我的 GitHub，歡迎下載至自己的電腦執行。

參考資料

Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. 1st ed. O’Reilly Media, Inc.

Watanabe, Kohei, and Stefan Müller. 2018. “Quanteda Tutorials.” Quanteda Tutorials. https://tutorials.quanteda.io/.

Mac 和 Linux 內建有 Bash，但 Windows 沒有。↩
要能直接執行wget_list需先給予其執行的權限，因此需設置chmod 755 <path to wget_list>，並且將wget_list置於 shell 會自動搜尋程式的地方(如/usr/bin/)。
另一個方法是不設置權限，直接執行wget_list：
bash <path to wget_list> <file1> <file2> ↩
未來可能會發一篇續作。↩

Last updated: 2018-11-10

PREV My Notes on R Markdown

NEXT jieba 自訂詞庫斷詞