3. Corpus Input

hgct.corpusReader.PlainTextReader deals with the default plain text corpus structure input in hgct. To read in corpus with other structures, you could write your own class or function that returnes a corpus object that follows the required structure. This required structure can be found below (a corpus with two subcorpora, with each subcorpus having three texts in it):

Required corpus structure
[
   {
      "id": "01",
      "m": {"label": "1st timestep", "ord": 1, "time_range": [-1000, -206]},
      "text": [
          {
              "c": ["這是第三篇裡的一個句子。", "這是第二個句子。"],
              "id": "01/text3.txt",
              "m": {"about": "Text 3 in 1st timestep"}
          },
          {
              "c": ["這是一個句子。", "這是第二個句子。"],
              "id": "01/text1.txt",
              "m": {"about": "Text 1 in 1st timestep"}
          },
          {
              "c": [
                  "這是第二篇裡的一個句子。", "這是第二個句子。"],
              "id": "01/text2.txt",
              "m": {"about": "Text 2 in 1st timestep"}
          }
      ]
   },
   {
      "id": "02",
      "m": {"label": "2nd timestep", "ord": 2, "time_range": [-205, 220]},
      "text": [
          {
              "c": ["這是第三篇裡的一個句子。", "這是第二個句子。"],
              "id": "02/text3.txt",
              "m": {"about": "Text 3 in 2nd timestep"}
          },
          {
              "c": ["這是一個句子。", "這是第二個句子。"],
              "id": "02/text1.txt",
              "m": {"about": "Text 1 in 2nd timestep"}
          },
          {
              "c": ["這是第二篇裡的一個句子。", "這是第二個句子。"],
              "id": "02/text2.txt",
              "m": {"about": "Text 2 in 2nd timestep"}
          }
      ]
   }
]
class hgct.corpusReader.PlainTextReader(dir_path='data/', ts_meta_filename='time.yaml', text_meta_filename='text_meta.yaml', ts_meta_loader=None, text_meta_loader=None, plain_text_reader=<function read_text_as_sentences>, auto_load=True)[source]

Plain text corpus input handler

Examples

>>> from pprint import pprint
>>> from gdown import cached_download
>>> from dcctk.corpusReader import PlainTextReader

>>> url = 'https://github.com/liao961120/dcctk/raw/main/test/minimal_plaintext_corpus.zip'
>>> cached_download(url, "minimal_plaintext_corpus.zip", postprocess=gdown.extractall)
>>> corpus = PlainTextReader("minimal_plaintext_corpus/").corpus
>>> pprint(corpus)

[{'id': '01',
'm': {'label': '1st timestep', 'ord': 1, 'time_range': [-1000, -206]},
'text': [{'c': ['這是第三篇裡的一個句子。', '這是第二個句子。'],
            'id': '01/text3.txt',
            'm': {'about': 'Text 3 in 1st timestep'}},
        {'c': ['這是一個句子。', '這是第二個句子。'],
            'id': '01/text1.txt',
            'm': {'about': 'Text 1 in 1st timestep'}},
        {'c': ['這是第二篇裡的一個句子。', '這是第二個句子。'],
            'id': '01/text2.txt',
            'm': {'about': 'Text 2 in 1st timestep'}}]},
{'id': '02',
'm': {'label': '2nd timestep', 'ord': 2, 'time_range': [-205, 220]},
'text': [{'c': ['這是第三篇裡的一個句子。', '這是第二個句子。'],
            'id': '02/text3.txt',
            'm': {'about': 'Text 3 in 2nd timestep'}},
        {'c': ['這是一個句子。', '這是第二個句子。'],
            'id': '02/text1.txt',
            'm': {'about': 'Text 1 in 2nd timestep'}},
        {'c': ['這是第二篇裡的一個句子。', '這是第二個句子。'],
            'id': '02/text2.txt',
            'm': {'about': 'Text 2 in 2nd timestep'}}]}]

Methods

get_corpus_as_gen

__init__(dir_path='data/', ts_meta_filename='time.yaml', text_meta_filename='text_meta.yaml', ts_meta_loader=None, text_meta_loader=None, plain_text_reader=<function read_text_as_sentences>, auto_load=True)[source]

Read in plain text corpus

Parameters
dir_pathstr, optional

Path to the directory containing the plain text corpus. For the directory structure of the plain text corpus, refer to the example data in the GitHub repo. By default “data/”.

ts_meta_filenamestr, optional

Path to the metadata file specifying the time info of each timestepped subcorpora, by default “time.yaml”.

text_meta_filenamestr, optional

Path to the metadata file specifying info of each corpus text, by default “text_meta.yaml”.

ts_meta_loaderCallable, optional

Custom function to parse the file specified in ts_meta_filename, by default None.

text_meta_loaderCallable, optional

Custom function to parse the file specified in text_meta_filename, by default None.

plain_text_readerCallable, optional

Function to read a corpus text file as a sequence of sentences, by default dcctk.UtilsTextProcess.read_text_as_sentences().

__weakref__

list of weak references to the object (if defined)

hgct.UtilsTextProcess.read_text_as_sentences(fp)[source]

Read text file as a sequency of sentences

Parameters
fpstr

Path to UTF-8 encoded plain text file.

Yields
str

A string representing a sentence. A sentence corresponds to a line in the file.