post2df takes paths to PTT posts (either URLs or local path to HTML files) as input, extracts information from the posts, and returns a data frame with n row and 12 cols with one post per row.

post2df(path, board_col = FALSE)

Arguments

path

Character Vector. A vector of URLs or local paths to PTT posts.

board_col

Logical. Whether to set board name as a new variable. Defaults to FALSE.

Value

A data frame with n rows and 12 variables:

author

Author of the post.

category

Category of the post, such as "新聞", "問卦", "Re:".

title

Title of the post.

date

The date of the post.

content

The content of the post.

content_char

The Number of characters in the post content. Whitespaces and newline characters are removed before counting.

n_comment

Number of comments.

n_push

Number of "Push" comments.

n_boo

Number of "Boo" comments.

link

URL of the post with https://www.ptt.cc/bbs/ removed. For local file paths, the link is the file name.

comment

A list-column with data frames stored inside. Contents extracted from the post comment region. See get_post_comment for information about the variables in the data frame.

comment_urls

A list-column with character vectors stored inside. URLs extracted from post content are stored inside the character vectors. The original URLs in the post content are replaced as 'rm_URL' in the variable 'content'.

One additional variable is optional:

board

The board the post belongs to. Exist only if board_col = TRUE.

Details

This is a function that rbinds the data gathered from get_post, and add some meta data about the 'content' of the post.

Examples

url <- "https://www.ptt.cc/bbs/Gossiping/M.1534415307.A.BE5.html"

post_df <- post2df(url)
head(post_df)

# Access information in the list column: 'comment'
head(post_df$comment[[1]])

if (FALSE) {
# Read from local files
post_df <- post2df(list.files('local/gossiping', full.names = T))
}