Corpus¶
-
class
KWIC.queryDB.Corpus(db='data/asbc.sqlite', corp='data/asbc_lite.jsonl')[source]¶ Query corpus from sqlite database
Methods
concordance(self, text_id, sent_id, position)Retrive all KWIC instances from corpus based on positional information
getNgram(self, text_id, sent_id, position[, …])Get the ngram of a seed token from the in-memory corpus
queryNgram(self, query[, anchor, gender])Query KWIC of phrases
queryOneGram(self, token, pos[, matchOpr, …])Query KWIC of one token
-
concordance(self, text_id, sent_id, position, n=1, left=10, right=10)[source]¶ Retrive all KWIC instances from corpus based on positional information
- Parameters
- text_idint
One of a index of the items (text level of the corpus) in the first level of
corpus. This is the index indicating the order of the texts in the corpus.- sent_idint
One of a index of the items (sentence level of the corpus) in the second level of
corpus. This is the index indicating the order of the sentences in a text.- positionint
One of a index of the items (word level of the corpus) in the third level of
corpus. This is the index indicating the order of the words in a sentence.- nint, optional
Keyword length, by default 1
- leftint, optional
Left context size, in number of tokens, by default 10
- rightint, optional
Right context size, in number of tokens, by default 10
- Returns
- dict
A dictionary with:
keyword: the keyword and its PoS tagleft&right: the left and right context,consisting of tokens and their PoS tags.
-
getNgram(self, text_id, sent_id, position, anchor={'n': 4, 'seed': 1})[source]¶ Get the ngram of a seed token from the in-memory corpus
The three parameters
text_id,sent_id, andpositiontogether locates the position of a seed token in the corpus. The info about the ngram in which this seed token lies is saved in the parameteranchor.- Parameters
- text_idint
The index of the text in the corpus.
- sent_idint
The index of the sentence in the text.
- positionint
The index of the token in the sentence.
- anchordict, optional
Information about the seed token’s ngram, by default {‘n’: 4, ‘seed’: 1}.
seed: The token’s position in the ngramn: The ngram’s length
- Returns
- list
An ngram stored as (word, tag) pairs in a list.
-
queryNgram(self, query, anchor={'n': 2, 'seed': 1}, gender=None)[source]¶ Query KWIC of phrases
- Parameters
- querylist
A list of token objects (dictionaries), with each dictionary representing the token in the query string (i.e. token enclosed in the brackets). Returned by
queryParser.tokenize().- anchordict, optional
Passed to
anchoringetNgram(), by default {‘n’: 2, ‘seed’: 1}.- genderint, optional
Passed to
genderinqueryOneGram(), by default None.
- Returns
- pandas.DataFrame
A pandas dataframe for matching keywords and their positional information in the corpus.
-
queryOneGram(self, token, pos, matchOpr={'token': '=', 'pos': 'REGEXP'}, gender=None)[source]¶ Query KWIC of one token
- Parameters
- tokenstr
RegEx pattern of the keyword’s form.
- posstr
RegEx pattern of the keyword’s PoS tag. E.g., to search for:
Nouns, use
N.*Verbs, use
V.*
See the tag set here.
- matchOpr: dict
The operator
<opr>given to the SQL command inWHERE x <opr> pattern. Could be one of=(exact match),REGEXP(uses RegEx to match pattern), orLIKE(uses%to match pattern). Defaults to exact match fortokenand sql pattern forpos.- gender: int, optional
Pre-filter SQL database based on the sex of the texts authors.
0: female1: maleother values: all (no filter)
- Returns
- pandas.DataFrame
A pandas dataframe for matching keywords and their positional information in the corpus.
-