Concordancer

class concordancer.concordancer.Concordancer(corpus: list, text_key='text')[source]

Methods

cql_search(cql[, left, right])

Search the corpus with Corpus Query Language

set_cql_parameters(default_attr[, max_quant])

Set parameters for CQL queries in the Concordancer

get_corp_data

__init__(corpus: list, text_key='text')

Indexing corpus

Parameters
corpuslist

Corpus data

text_key: str

The key to where text is stored in a JSON file, by default “text”

Notes

Structure of corpus could be:

[
    {
        "<text_dictkey>": [
            {"word": "<word>", "pos": "<pos>"},
            {"word": "<word>", "pos": "<pos>"},
            {"word": "<word>", "pos": "<pos>"},
            ...
        ],
        ...
    }
]

Or, simply a nested list:

[
    [
        [
            ["<word>", "<pos>"],
            ["<word>", "<pos>"],
            ["<word>", "<pos>"],
            ...
        ]
    ],
    [...],  # another text
    ...
]

Search the corpus with Corpus Query Language

Parameters
cqlstr

A CQL query

leftint, optional

Left context size, by default 5

rightint, optional

Right context size, by default 5

Yields
dict

A dictionary with the structure:

{
    'left': [<tk>, <tk>, ...],
    'keyword': [<tk>, <tk>, ...],
    'right': [<tk>, <tk>, ...],
    'position': {
        'doc_idx': <int>, 
        'sent_idx': <int>, 
        'tk_idx': <int>
    },
    'captureGroups': {
        'verb': [<tk>],
        'noun': [<tk>]}
}

where <tk> is a token, represented as a dictionary, for instance:

{
    'word': 'hits', 
    'lemma': 'hit',
    'pos': 'V',
}
set_cql_parameters(default_attr: str, max_quant: int = 6)[source]

Set parameters for CQL queries in the Concordancer

Parameters
default_attrstr

The default attribute of the tokens. CQL allows expressing a token without specifying its attribute, like "hits". If default_attr is set to, for example, word, "hits" is then equivalent to [word="hits"] in CQL.

max_quantint, optional

The maximium quantity to evaluate to for the CQL token-level quantifier. max_quant is used in two CQL expressions: + and *. The upper bounds of these quantifiers are theoretically infinite, but since the computer cannot generate a infinite number of queries, an upper bound of the quantifier must be specified. By default, it is set to 6.