`Corpus`¶

class KWIC.queryDB.Corpus(db='data/asbc.sqlite', corp='data/asbc_lite.jsonl')[source]¶

Query corpus from sqlite database

Methods

`concordance`(self, text_id, sent_id, position)	Retrive all KWIC instances from corpus based on positional information
`getNgram`(self, text_id, sent_id, position[, …])	Get the ngram of a seed token from the in-memory corpus
`queryNgram`(self, query[, anchor, gender])	Query KWIC of phrases
`queryOneGram`(self, token, pos[, matchOpr, …])	Query KWIC of one token

concordance(self, text_id, sent_id, position, n=1, left=10, right=10)[source]¶

Retrive all KWIC instances from corpus based on positional information

Parameters

text_idint: One of a index of the items (text level of the corpus) in the first level of corpus. This is the index indicating the order of the texts in the corpus.
sent_idint: One of a index of the items (sentence level of the corpus) in the second level of corpus. This is the index indicating the order of the sentences in a text.
positionint: One of a index of the items (word level of the corpus) in the third level of corpus. This is the index indicating the order of the words in a sentence.
nint, optional: Keyword length, by default 1
leftint, optional: Left context size, in number of tokens, by default 10
rightint, optional: Right context size, in number of tokens, by default 10

Returns

dict

A dictionary with:

keyword: the keyword and its PoS tag
left & right: the left and right context,
consisting of tokens and their PoS tags.

getNgram(self, text_id, sent_id, position, anchor={'n': 4, 'seed': 1})[source]¶

Get the ngram of a seed token from the in-memory corpus

The three parameters text_id, sent_id, and position together locates the position of a seed token in the corpus. The info about the ngram in which this seed token lies is saved in the parameter anchor.

Parameters

text_idint

The index of the text in the corpus.

sent_idint

The index of the sentence in the text.

positionint

The index of the token in the sentence.

anchordict, optional

Information about the seed token’s ngram, by default {‘n’: 4, ‘seed’: 1}.

seed: The token’s position in the ngram
n: The ngram’s length

Returns

list: An ngram stored as (word, tag) pairs in a list.

queryNgram(self, query, anchor={'n': 2, 'seed': 1}, gender=None)[source]¶

Query KWIC of phrases

Parameters

querylist: A list of token objects (dictionaries), with each dictionary representing the token in the query string (i.e. token enclosed in the brackets). Returned by queryParser.tokenize().
anchordict, optional: Passed to anchor in getNgram(), by default {‘n’: 2, ‘seed’: 1}.
genderint, optional: Passed to gender in queryOneGram(), by default None.

Returns

pandas.DataFrame: A pandas dataframe for matching keywords and their positional information in the corpus.

queryOneGram(self, token, pos, matchOpr={'token': '=', 'pos': 'REGEXP'}, gender=None)[source]¶

Query KWIC of one token

Parameters

tokenstr

RegEx pattern of the keyword’s form.

posstr

RegEx pattern of the keyword’s PoS tag. E.g., to search for:

Nouns, use N.*
Verbs, use V.*

See the tag set here.

matchOpr: dict

The operator <opr> given to the SQL command in WHERE x <opr> pattern. Could be one of = (exact match), REGEXP (uses RegEx to match pattern), or LIKE (uses % to match pattern). Defaults to exact match for token and sql pattern for pos.

gender: int, optional

Pre-filter SQL database based on the sex of the texts authors.

0: female
1: male
other values: all (no filter)

Returns

pandas.DataFrame: A pandas dataframe for matching keywords and their positional information in the corpus.

`Corpus`¶

KWIC backend

Navigation

Related Topics

Corpus¶

`Corpus`¶