4. Corpus Search & Concordancer
- class hgct.concordancer.Concordancer(corpus)[source]
- Attributes
- all_idx_cache
- chr_idcs
- chr_radicals
- corp_size
- cql_attrs
- lexicon
Methods
collocates
(node_cql[, left, right, ...])node ~node
cql_search
(cql[, left, right])Search the corpus with Corpus Query Language
bigram_associations
freq_distr_ngrams
get_meta
get_meta_by_path
get_text
get_texts
index_corpus
index_path
list_files
- collocates(node_cql: str, left=1, right=1, subcorp_idx=None, sort_by='Gsq', alpha=0.1, chinese_only=True)[source]
node ~node
collocate O11 O12 R1 (char index len) ~collocate O21 O22 R2
C1 C2 CorpSize
(Concord num)
- cql_search(cql: str, left=5, right=5)[source]
Search the corpus with Corpus Query Language
- Parameters
- cqlstr
A CQL query
- leftint, optional
Left context size, by default 5
- rightint, optional
Right context size, by default 5
- Yields
- dict
A dictionary with the structure:
{ 'left': [<tk>, <tk>, ...], 'keyword': [<tk>, <tk>, ...], 'right': [<tk>, <tk>, ...], 'position': ( <int>, <int>, <int>, <int> ), 'captureGroups': { 'verb': [<tk>], 'noun': [<tk>]} }
where
<tk>
is a token (char), represented as a string,
- class hgct.concordancer.ConcordLine(cc: dict)[source]
Methods
get_kwic
([return_keyword_idx])Get string representation of the concordance line
get_timestep
([key])Get time step info of the concordance line
to_json
- __init__(cc: dict)[source]
Initialize an instance of concordance line
- Parameters
- ccdict
A dictionary returned by
concordancer.Concordancer._kwic_single()
. It has the following stucture:{ 'left': ',又喜', 'keyword': '將軍之去,', 'right': '計必乘', 'position': (2, 55, 5, 208), 'meta': { 'id': '03/三國志_蜀書七.txt', 'time': { 'time_range': [221, 589], 'label': '魏晉南北', 'ord': 3 }, 'text': { 'book': '三國志', 'sec': '蜀書七' } }, 'captureGroups': { 'obj': {'s': '去,', 'i': [3, 4]} } }
- get_kwic(return_keyword_idx=True)[source]
Get string representation of the concordance line
- Parameters
- return_keyword_idxbool, optional
Whether to return the index of the keywords in the concordance line, by default True
- Returns
- str or tuple
If return keyword_idx is True, returns tuple, with the second element being the index of the keywords (idx_from, idx_to).
- get_timestep(key: Optional[Callable] = None)[source]
Get time step info of the concordance line
- Parameters
- keyCallable, optional
If specified, applied on self.meta.time to return time step data. By default None, which uses subcorp_idx as time step information.
- Returns
- Int
The time step that the concordance line belongs to.