Corpus

class KWIC.queryDB.Corpus(db='data/asbc.sqlite', corp='data/asbc_lite.jsonl')[source]

Query corpus from sqlite database

Methods

concordance(self, text_id, sent_id, position)

Retrive all KWIC instances from corpus based on positional information

getNgram(self, text_id, sent_id, position[, …])

Get the ngram of a seed token from the in-memory corpus

queryNgram(self, query[, anchor, gender])

Query KWIC of phrases

queryOneGram(self, token, pos[, matchOpr, …])

Query KWIC of one token

concordance(self, text_id, sent_id, position, n=1, left=10, right=10)[source]

Retrive all KWIC instances from corpus based on positional information

Parameters
text_idint

One of a index of the items (text level of the corpus) in the first level of corpus. This is the index indicating the order of the texts in the corpus.

sent_idint

One of a index of the items (sentence level of the corpus) in the second level of corpus. This is the index indicating the order of the sentences in a text.

positionint

One of a index of the items (word level of the corpus) in the third level of corpus. This is the index indicating the order of the words in a sentence.

nint, optional

Keyword length, by default 1

leftint, optional

Left context size, in number of tokens, by default 10

rightint, optional

Right context size, in number of tokens, by default 10

Returns
dict

A dictionary with:

  • keyword: the keyword and its PoS tag

  • left & right: the left and right context,

    consisting of tokens and their PoS tags.

getNgram(self, text_id, sent_id, position, anchor={'n': 4, 'seed': 1})[source]

Get the ngram of a seed token from the in-memory corpus

The three parameters text_id, sent_id, and position together locates the position of a seed token in the corpus. The info about the ngram in which this seed token lies is saved in the parameter anchor.

Parameters
text_idint

The index of the text in the corpus.

sent_idint

The index of the sentence in the text.

positionint

The index of the token in the sentence.

anchordict, optional

Information about the seed token’s ngram, by default {‘n’: 4, ‘seed’: 1}.

  • seed: The token’s position in the ngram

  • n: The ngram’s length

Returns
list

An ngram stored as (word, tag) pairs in a list.

queryNgram(self, query, anchor={'n': 2, 'seed': 1}, gender=None)[source]

Query KWIC of phrases

Parameters
querylist

A list of token objects (dictionaries), with each dictionary representing the token in the query string (i.e. token enclosed in the brackets). Returned by queryParser.tokenize().

anchordict, optional

Passed to anchor in getNgram(), by default {‘n’: 2, ‘seed’: 1}.

genderint, optional

Passed to gender in queryOneGram(), by default None.

Returns
pandas.DataFrame

A pandas dataframe for matching keywords and their positional information in the corpus.

queryOneGram(self, token, pos, matchOpr={'token': '=', 'pos': 'REGEXP'}, gender=None)[source]

Query KWIC of one token

Parameters
tokenstr

RegEx pattern of the keyword’s form.

posstr

RegEx pattern of the keyword’s PoS tag. E.g., to search for:

  • Nouns, use N.*

  • Verbs, use V.*

See the tag set here.

matchOpr: dict

The operator <opr> given to the SQL command in WHERE x <opr> pattern. Could be one of = (exact match), REGEXP (uses RegEx to match pattern), or LIKE (uses % to match pattern). Defaults to exact match for token and sql pattern for pos.

gender: int, optional

Pre-filter SQL database based on the sex of the texts authors.

  • 0: female

  • 1: male

  • other values: all (no filter)

Returns
pandas.DataFrame

A pandas dataframe for matching keywords and their positional information in the corpus.