post2df
takes paths to PTT posts (either URLs or local
path to HTML files) as input, extracts information from the
posts, and returns a data frame with n row and 12 cols with
one post per row.
post2df(path, board_col = FALSE)
path | Character Vector. A vector of URLs or local paths to PTT posts. |
---|---|
board_col | Logical. Whether to set board name as a new
variable. Defaults to |
A data frame with n rows and 12 variables:
Author of the post.
Title of the post.
The date of the post.
The content of the post.
The Number of characters in the post content. Whitespaces and newline characters are removed before counting.
Number of comments.
Number of "Push" comments.
Number of "Boo" comments.
URL of the post with https://www.ptt.cc/bbs/ removed. For local file paths, the link is the file name.
A list-column with data frames stored
inside. Contents extracted from the post comment
region. See get_post_comment
for
information about the variables in the data frame.
A list-column with character vectors stored inside. URLs extracted from post content are stored inside the character vectors. The original URLs in the post content are replaced as 'rm_URL' in the variable 'content'.
One additional variable is optional:
The board the post belongs to. Exist only
if board_col = TRUE
.
This is a function that rbinds the data gathered from
get_post
, and add some meta data about the
'content' of the post.
url <- "https://www.ptt.cc/bbs/Gossiping/M.1534415307.A.BE5.html" post_df <- post2df(url) head(post_df) # Access information in the list column: 'comment' head(post_df$comment[[1]]) if (FALSE) { # Read from local files post_df <- post2df(list.files('local/gossiping', full.names = T)) }