
class KWIC.queryDB.Corpus(db='data/asbc.sqlite', corp='data/asbc_lite.jsonl')[source]

Query corpus from sqlite database


concordance(self, text_id, sent_id, position)

Retrive all KWIC instances from corpus based on positional information

getNgram(self, text_id, sent_id, position[, …])

Get the ngram of a seed token from the in-memory corpus

queryNgram(self, query[, anchor, gender])

Query KWIC of phrases

queryOneGram(self, token, pos[, matchOpr, …])

Query KWIC of one token

concordance(self, text_id, sent_id, position, n=1, left=10, right=10)[source]

Retrive all KWIC instances from corpus based on positional information


One of a index of the items (text level of the corpus) in the first level of corpus. This is the index indicating the order of the texts in the corpus.


One of a index of the items (sentence level of the corpus) in the second level of corpus. This is the index indicating the order of the sentences in a text.


One of a index of the items (word level of the corpus) in the third level of corpus. This is the index indicating the order of the words in a sentence.

nint, optional

Keyword length, by default 1

leftint, optional

Left context size, in number of tokens, by default 10

rightint, optional

Right context size, in number of tokens, by default 10


A dictionary with:

  • keyword: the keyword and its PoS tag

  • left & right: the left and right context,

    consisting of tokens and their PoS tags.

getNgram(self, text_id, sent_id, position, anchor={'n': 4, 'seed': 1})[source]

Get the ngram of a seed token from the in-memory corpus

The three parameters text_id, sent_id, and position together locates the position of a seed token in the corpus. The info about the ngram in which this seed token lies is saved in the parameter anchor.


The index of the text in the corpus.


The index of the sentence in the text.


The index of the token in the sentence.

anchordict, optional

Information about the seed token’s ngram, by default {‘n’: 4, ‘seed’: 1}.

  • seed: The token’s position in the ngram

  • n: The ngram’s length


An ngram stored as (word, tag) pairs in a list.

queryNgram(self, query, anchor={'n': 2, 'seed': 1}, gender=None)[source]

Query KWIC of phrases


A list of token objects (dictionaries), with each dictionary representing the token in the query string (i.e. token enclosed in the brackets). Returned by queryParser.tokenize().

anchordict, optional

Passed to anchor in getNgram(), by default {‘n’: 2, ‘seed’: 1}.

genderint, optional

Passed to gender in queryOneGram(), by default None.


A pandas dataframe for matching keywords and their positional information in the corpus.

queryOneGram(self, token, pos, matchOpr={'token': '=', 'pos': 'REGEXP'}, gender=None)[source]

Query KWIC of one token


RegEx pattern of the keyword’s form.


RegEx pattern of the keyword’s PoS tag. E.g., to search for:

  • Nouns, use N.*

  • Verbs, use V.*

See the tag set here.

matchOpr: dict

The operator <opr> given to the SQL command in WHERE x <opr> pattern. Could be one of = (exact match), REGEXP (uses RegEx to match pattern), or LIKE (uses % to match pattern). Defaults to exact match for token and sql pattern for pos.

gender: int, optional

Pre-filter SQL database based on the sex of the texts authors.

  • 0: female

  • 1: male

  • other values: all (no filter)


A pandas dataframe for matching keywords and their positional information in the corpus.