4. Corpus Search & Concordancer

class hgct.concordancer.Concordancer(corpus)[source]
Attributes
all_idx_cache
chr_idcs
chr_radicals
corp_size
cql_attrs
lexicon

Methods

collocates(node_cql[, left, right, ...])

node ~node

cql_search(cql[, left, right])

Search the corpus with Corpus Query Language

bigram_associations

freq_distr_ngrams

get_meta

get_meta_by_path

get_text

get_texts

index_corpus

index_path

list_files

collocates(node_cql: str, left=1, right=1, subcorp_idx=None, sort_by='Gsq', alpha=0.1, chinese_only=True)[source]

node ~node

collocate O11 O12 R1 (char index len) ~collocate O21 O22 R2

C1 C2 CorpSize

(Concord num)

Search the corpus with Corpus Query Language

Parameters
cqlstr

A CQL query

leftint, optional

Left context size, by default 5

rightint, optional

Right context size, by default 5

Yields
dict

A dictionary with the structure:

{
    'left': [<tk>, <tk>, ...],
    'keyword': [<tk>, <tk>, ...],
    'right': [<tk>, <tk>, ...],
    'position': ( <int>, <int>, <int>, <int> ),
    'captureGroups': {
        'verb': [<tk>],
        'noun': [<tk>]}
}

where <tk> is a token (char), represented as a string,

class hgct.concordancer.ConcordLine(cc: dict)[source]

Methods

get_kwic([return_keyword_idx])

Get string representation of the concordance line

get_timestep([key])

Get time step info of the concordance line

to_json

__init__(cc: dict)[source]

Initialize an instance of concordance line

Parameters
ccdict

A dictionary returned by concordancer.Concordancer._kwic_single(). It has the following stucture:

{
    'left': ',又喜',
    'keyword': '將軍之去,',
    'right': '計必乘',
    'position': (2, 55, 5, 208),
    'meta': {
        'id': '03/三國志_蜀書七.txt',
        'time': {
            'time_range': [221, 589], 
            'label': '魏晉南北', 
            'ord': 3
        },
        'text': {
            'book': '三國志', 'sec': '蜀書七'
        }
    },
    'captureGroups': {
        'obj': {'s': '去,', 'i': [3, 4]}
    }
}
get_kwic(return_keyword_idx=True)[source]

Get string representation of the concordance line

Parameters
return_keyword_idxbool, optional

Whether to return the index of the keywords in the concordance line, by default True

Returns
str or tuple

If return keyword_idx is True, returns tuple, with the second element being the index of the keywords (idx_from, idx_to).

get_timestep(key: Optional[Callable] = None)[source]

Get time step info of the concordance line

Parameters
keyCallable, optional

If specified, applied on self.meta.time to return time step data. By default None, which uses subcorp_idx as time step information.

Returns
Int

The time step that the concordance line belongs to.