Concordancer
- class concordancer.concordancer.Concordancer(corpus: list, text_key='text')[source]
Methods
cql_search
(cql[, left, right])Search the corpus with Corpus Query Language
set_cql_parameters
(default_attr[, max_quant])Set parameters for CQL queries in the Concordancer
get_corp_data
- __init__(corpus: list, text_key='text')
Indexing corpus
- Parameters
- corpuslist
Corpus data
- text_key: str
The key to where text is stored in a JSON file, by default “text”
Notes
Structure of corpus could be:
[ { "<text_dictkey>": [ {"word": "<word>", "pos": "<pos>"}, {"word": "<word>", "pos": "<pos>"}, {"word": "<word>", "pos": "<pos>"}, ... ], ... } ]
Or, simply a nested list:
[ [ [ ["<word>", "<pos>"], ["<word>", "<pos>"], ["<word>", "<pos>"], ... ] ], [...], # another text ... ]
- cql_search(cql: str, left=5, right=5)[source]
Search the corpus with Corpus Query Language
- Parameters
- cqlstr
A CQL query
- leftint, optional
Left context size, by default 5
- rightint, optional
Right context size, by default 5
- Yields
- dict
A dictionary with the structure:
{ 'left': [<tk>, <tk>, ...], 'keyword': [<tk>, <tk>, ...], 'right': [<tk>, <tk>, ...], 'position': { 'doc_idx': <int>, 'sent_idx': <int>, 'tk_idx': <int> }, 'captureGroups': { 'verb': [<tk>], 'noun': [<tk>]} }
where
<tk>
is a token, represented as a dictionary, for instance:{ 'word': 'hits', 'lemma': 'hit', 'pos': 'V', }
- set_cql_parameters(default_attr: str, max_quant: int = 6)[source]
Set parameters for CQL queries in the Concordancer
- Parameters
- default_attrstr
The default attribute of the tokens. CQL allows expressing a token without specifying its attribute, like
"hits"
. Ifdefault_attr
is set to, for example,word
,"hits"
is then equivalent to[word="hits"]
in CQL.- max_quantint, optional
The maximium quantity to evaluate to for the CQL token-level quantifier.
max_quant
is used in two CQL expressions:+
and*
. The upper bounds of these quantifiers are theoretically infinite, but since the computer cannot generate a infinite number of queries, an upper bound of the quantifier must be specified. By default, it is set to 6.