3. Corpus Input

hgct.corpusReader.PlainTextReader deals with the default plain text corpus structure input in hgct. To read in corpus with other structures, you could write your own class or function that returnes a corpus object that follows the required structure. This required structure can be found below (a corpus with two subcorpora, with each subcorpus having three texts in it):

Required corpus structure

[
   {
      "id": "01",
      "m": {"label": "1st timestep", "ord": 1, "time_range": [-1000, -206]},
      "text": [
          {
              "c": ["這是第三篇裡的一個句子。", "這是第二個句子。"],
              "id": "01/text3.txt",
              "m": {"about": "Text 3 in 1st timestep"}
          },
          {
              "c": ["這是一個句子。", "這是第二個句子。"],
              "id": "01/text1.txt",
              "m": {"about": "Text 1 in 1st timestep"}
          },
          {
              "c": [
                  "這是第二篇裡的一個句子。", "這是第二個句子。"],
              "id": "01/text2.txt",
              "m": {"about": "Text 2 in 1st timestep"}
          }
      ]
   },
   {
      "id": "02",
      "m": {"label": "2nd timestep", "ord": 2, "time_range": [-205, 220]},
      "text": [
          {
              "c": ["這是第三篇裡的一個句子。", "這是第二個句子。"],
              "id": "02/text3.txt",
              "m": {"about": "Text 3 in 2nd timestep"}
          },
          {
              "c": ["這是一個句子。", "這是第二個句子。"],
              "id": "02/text1.txt",
              "m": {"about": "Text 1 in 2nd timestep"}
          },
          {
              "c": ["這是第二篇裡的一個句子。", "這是第二個句子。"],
              "id": "02/text2.txt",
              "m": {"about": "Text 2 in 2nd timestep"}
          }
      ]
   }
]

class hgct.corpusReader.PlainTextReader(dir_path='data/', ts_meta_filename='time.yaml', text_meta_filename='text_meta.yaml', ts_meta_loader=None, text_meta_loader=None, plain_text_reader=<function read_text_as_sentences>, auto_load=True)[source]

Plain text corpus input handler

Examples

>>> from pprint import pprint
>>> from gdown import cached_download
>>> from dcctk.corpusReader import PlainTextReader

>>> url = 'https://github.com/liao961120/dcctk/raw/main/test/minimal_plaintext_corpus.zip'
>>> cached_download(url, "minimal_plaintext_corpus.zip", postprocess=gdown.extractall)
>>> corpus = PlainTextReader("minimal_plaintext_corpus/").corpus
>>> pprint(corpus)

[{'id': '01',
'm': {'label': '1st timestep', 'ord': 1, 'time_range': [-1000, -206]},
'text': [{'c': ['這是第三篇裡的一個句子。', '這是第二個句子。'],
            'id': '01/text3.txt',
            'm': {'about': 'Text 3 in 1st timestep'}},
        {'c': ['這是一個句子。', '這是第二個句子。'],
            'id': '01/text1.txt',
            'm': {'about': 'Text 1 in 1st timestep'}},
        {'c': ['這是第二篇裡的一個句子。', '這是第二個句子。'],
            'id': '01/text2.txt',
            'm': {'about': 'Text 2 in 1st timestep'}}]},
{'id': '02',
'm': {'label': '2nd timestep', 'ord': 2, 'time_range': [-205, 220]},
'text': [{'c': ['這是第三篇裡的一個句子。', '這是第二個句子。'],
            'id': '02/text3.txt',
            'm': {'about': 'Text 3 in 2nd timestep'}},
        {'c': ['這是一個句子。', '這是第二個句子。'],
            'id': '02/text1.txt',
            'm': {'about': 'Text 1 in 2nd timestep'}},
        {'c': ['這是第二篇裡的一個句子。', '這是第二個句子。'],
            'id': '02/text2.txt',
            'm': {'about': 'Text 2 in 2nd timestep'}}]}]

Methods

get_corpus_as_gen

__init__(dir_path='data/', ts_meta_filename='time.yaml', text_meta_filename='text_meta.yaml', ts_meta_loader=None, text_meta_loader=None, plain_text_reader=<function read_text_as_sentences>, auto_load=True)[source]

Read in plain text corpus

Parameters

dir_pathstr, optional: Path to the directory containing the plain text corpus. For the directory structure of the plain text corpus, refer to the example data in the GitHub repo. By default “data/”.
ts_meta_filenamestr, optional: Path to the metadata file specifying the time info of each timestepped subcorpora, by default “time.yaml”.
text_meta_filenamestr, optional: Path to the metadata file specifying info of each corpus text, by default “text_meta.yaml”.
ts_meta_loaderCallable, optional: Custom function to parse the file specified in ts_meta_filename, by default None.
text_meta_loaderCallable, optional: Custom function to parse the file specified in text_meta_filename, by default None.
plain_text_readerCallable, optional: Function to read a corpus text file as a sequence of sentences, by default dcctk.UtilsTextProcess.read_text_as_sentences().

__weakref__: list of weak references to the object (if defined)

hgct.UtilsTextProcess.read_text_as_sentences(fp)[source]

Read text file as a sequency of sentences

Parameters

fpstr: Path to UTF-8 encoded plain text file.

Yields

str: A string representing a sentence. A sentence corresponds to a line in the file.