Corpus
¶
-
class
KWIC.queryDB.
Corpus
(db='data/asbc.sqlite', corp='data/asbc_lite.jsonl')[source]¶ Query corpus from sqlite database
Methods
concordance
(self, text_id, sent_id, position)Retrive all KWIC instances from corpus based on positional information
getNgram
(self, text_id, sent_id, position[, …])Get the ngram of a seed token from the in-memory corpus
queryNgram
(self, query[, anchor, gender])Query KWIC of phrases
queryOneGram
(self, token, pos[, matchOpr, …])Query KWIC of one token
-
concordance
(self, text_id, sent_id, position, n=1, left=10, right=10)[source]¶ Retrive all KWIC instances from corpus based on positional information
- Parameters
- text_idint
One of a index of the items (text level of the corpus) in the first level of
corpus
. This is the index indicating the order of the texts in the corpus.- sent_idint
One of a index of the items (sentence level of the corpus) in the second level of
corpus
. This is the index indicating the order of the sentences in a text.- positionint
One of a index of the items (word level of the corpus) in the third level of
corpus
. This is the index indicating the order of the words in a sentence.- nint, optional
Keyword length, by default 1
- leftint, optional
Left context size, in number of tokens, by default 10
- rightint, optional
Right context size, in number of tokens, by default 10
- Returns
- dict
A dictionary with:
keyword
: the keyword and its PoS tagleft
&right
: the left and right context,consisting of tokens and their PoS tags.
-
getNgram
(self, text_id, sent_id, position, anchor={'n': 4, 'seed': 1})[source]¶ Get the ngram of a seed token from the in-memory corpus
The three parameters
text_id
,sent_id
, andposition
together locates the position of a seed token in the corpus. The info about the ngram in which this seed token lies is saved in the parameteranchor
.- Parameters
- text_idint
The index of the text in the corpus.
- sent_idint
The index of the sentence in the text.
- positionint
The index of the token in the sentence.
- anchordict, optional
Information about the seed token’s ngram, by default {‘n’: 4, ‘seed’: 1}.
seed
: The token’s position in the ngramn
: The ngram’s length
- Returns
- list
An ngram stored as (word, tag) pairs in a list.
-
queryNgram
(self, query, anchor={'n': 2, 'seed': 1}, gender=None)[source]¶ Query KWIC of phrases
- Parameters
- querylist
A list of token objects (dictionaries), with each dictionary representing the token in the query string (i.e. token enclosed in the brackets). Returned by
queryParser.tokenize()
.- anchordict, optional
Passed to
anchor
ingetNgram()
, by default {‘n’: 2, ‘seed’: 1}.- genderint, optional
Passed to
gender
inqueryOneGram()
, by default None.
- Returns
- pandas.DataFrame
A pandas dataframe for matching keywords and their positional information in the corpus.
-
queryOneGram
(self, token, pos, matchOpr={'token': '=', 'pos': 'REGEXP'}, gender=None)[source]¶ Query KWIC of one token
- Parameters
- tokenstr
RegEx pattern of the keyword’s form.
- posstr
RegEx pattern of the keyword’s PoS tag. E.g., to search for:
Nouns, use
N.*
Verbs, use
V.*
See the tag set here.
- matchOpr: dict
The operator
<opr>
given to the SQL command inWHERE x <opr> pattern
. Could be one of=
(exact match),REGEXP
(uses RegEx to match pattern), orLIKE
(uses%
to match pattern). Defaults to exact match fortoken
and sql pattern forpos
.- gender: int, optional
Pre-filter SQL database based on the sex of the texts authors.
0
: female1
: maleother values: all (no filter)
- Returns
- pandas.DataFrame
A pandas dataframe for matching keywords and their positional information in the corpus.
-