3. Corpus Input
hgct.corpusReader.PlainTextReader
deals with the default plain text
corpus structure input in hgct
.
To read in corpus with other structures, you could write your own class or
function that returnes a corpus object that follows the required structure.
This required structure can be found below (a corpus with two subcorpora, with
each subcorpus having three texts in it):
[
{
"id": "01",
"m": {"label": "1st timestep", "ord": 1, "time_range": [-1000, -206]},
"text": [
{
"c": ["這是第三篇裡的一個句子。", "這是第二個句子。"],
"id": "01/text3.txt",
"m": {"about": "Text 3 in 1st timestep"}
},
{
"c": ["這是一個句子。", "這是第二個句子。"],
"id": "01/text1.txt",
"m": {"about": "Text 1 in 1st timestep"}
},
{
"c": [
"這是第二篇裡的一個句子。", "這是第二個句子。"],
"id": "01/text2.txt",
"m": {"about": "Text 2 in 1st timestep"}
}
]
},
{
"id": "02",
"m": {"label": "2nd timestep", "ord": 2, "time_range": [-205, 220]},
"text": [
{
"c": ["這是第三篇裡的一個句子。", "這是第二個句子。"],
"id": "02/text3.txt",
"m": {"about": "Text 3 in 2nd timestep"}
},
{
"c": ["這是一個句子。", "這是第二個句子。"],
"id": "02/text1.txt",
"m": {"about": "Text 1 in 2nd timestep"}
},
{
"c": ["這是第二篇裡的一個句子。", "這是第二個句子。"],
"id": "02/text2.txt",
"m": {"about": "Text 2 in 2nd timestep"}
}
]
}
]
- class hgct.corpusReader.PlainTextReader(dir_path='data/', ts_meta_filename='time.yaml', text_meta_filename='text_meta.yaml', ts_meta_loader=None, text_meta_loader=None, plain_text_reader=<function read_text_as_sentences>, auto_load=True)[source]
Plain text corpus input handler
Examples
>>> from pprint import pprint >>> from gdown import cached_download >>> from dcctk.corpusReader import PlainTextReader >>> url = 'https://github.com/liao961120/dcctk/raw/main/test/minimal_plaintext_corpus.zip' >>> cached_download(url, "minimal_plaintext_corpus.zip", postprocess=gdown.extractall) >>> corpus = PlainTextReader("minimal_plaintext_corpus/").corpus >>> pprint(corpus) [{'id': '01', 'm': {'label': '1st timestep', 'ord': 1, 'time_range': [-1000, -206]}, 'text': [{'c': ['這是第三篇裡的一個句子。', '這是第二個句子。'], 'id': '01/text3.txt', 'm': {'about': 'Text 3 in 1st timestep'}}, {'c': ['這是一個句子。', '這是第二個句子。'], 'id': '01/text1.txt', 'm': {'about': 'Text 1 in 1st timestep'}}, {'c': ['這是第二篇裡的一個句子。', '這是第二個句子。'], 'id': '01/text2.txt', 'm': {'about': 'Text 2 in 1st timestep'}}]}, {'id': '02', 'm': {'label': '2nd timestep', 'ord': 2, 'time_range': [-205, 220]}, 'text': [{'c': ['這是第三篇裡的一個句子。', '這是第二個句子。'], 'id': '02/text3.txt', 'm': {'about': 'Text 3 in 2nd timestep'}}, {'c': ['這是一個句子。', '這是第二個句子。'], 'id': '02/text1.txt', 'm': {'about': 'Text 1 in 2nd timestep'}}, {'c': ['這是第二篇裡的一個句子。', '這是第二個句子。'], 'id': '02/text2.txt', 'm': {'about': 'Text 2 in 2nd timestep'}}]}]
Methods
get_corpus_as_gen
- __init__(dir_path='data/', ts_meta_filename='time.yaml', text_meta_filename='text_meta.yaml', ts_meta_loader=None, text_meta_loader=None, plain_text_reader=<function read_text_as_sentences>, auto_load=True)[source]
Read in plain text corpus
- Parameters
- dir_pathstr, optional
Path to the directory containing the plain text corpus. For the directory structure of the plain text corpus, refer to the example data in the GitHub repo. By default “data/”.
- ts_meta_filenamestr, optional
Path to the metadata file specifying the time info of each timestepped subcorpora, by default “time.yaml”.
- text_meta_filenamestr, optional
Path to the metadata file specifying info of each corpus text, by default “text_meta.yaml”.
- ts_meta_loaderCallable, optional
Custom function to parse the file specified in
ts_meta_filename
, by default None.- text_meta_loaderCallable, optional
Custom function to parse the file specified in
text_meta_filename
, by default None.- plain_text_readerCallable, optional
Function to read a corpus text file as a sequence of sentences, by default
dcctk.UtilsTextProcess.read_text_as_sentences()
.
- __weakref__
list of weak references to the object (if defined)