2. Lexical Statistics¶
-
class
yft.lexstats.TF_IDF(docset: Union[dict, list])[source]¶ Compute tf-idf scores for terms in a list of documents
Methods
score(word, doc_id)Compute tf-idf score of a term in a particular document
Compute tf-idf scores for all terms in the docset
-
score(word: str, doc_id)[source]¶ Compute tf-idf score of a term in a particular document
- Parameters
- wordstr
The term to compute the score
- doc_idint or str
The id specifying the document in docset when initializing the TF_IDF obj
- Returns
- float
The tf-idf score of the term
Examples
>>> docset = { 'doc0': ['a', 'a', 'a', 'c'], 'doc1': ['a', 'a', 'a', 'a', 'c'], 'doc2': ['b', 'c', 'c', 'c'] } >>> tfidf = TF_IDF(docset) >>> tfidf.score('a', 'doc1') 0.32437208648653154 >>> docset = [ ['a', 'a', 'a', 'c'], ['a', 'a', 'a', 'a', 'c'], ['b', 'c', 'c', 'c'] ] >>> tfidf = TF_IDF(docset) >>> tfidf.score('a', 1) 0.32437208648653154
-
score_all()[source]¶ Compute tf-idf scores for all terms in the docset
- Returns
- dict
A dict in the format of:
{ "<docid>": { "<word>": <tf-idf-score>, "<word>": <tf-idf-score>, ... }, "<docid>": {...}, ... }
Examples
>>> docset = { 'doc0': ['a', 'a', 'a', 'c'], 'doc1': ['a', 'a', 'a', 'a', 'c'], 'doc2': ['b', 'c', 'c', 'c'] } >>> tfidf = TF_IDF(docset) >>> tfidf.score_all() {'doc0': {'a': 0.3040988310811233, 'c': 0.0}, 'doc1': {'a': 0.32437208648653154, 'c': 0.0}, 'doc2': {'b': 0.27465307216702745, 'c': 0.0}}
-
-
class
yft.lexstats.ContextualDiversity(fp: str, tk_sep: str = '\u3000', WINDOW: int = 4, SM: float = 0.75)[source]¶ Compute contextual diversity scores for words in a corpus
References
[1] Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings.
[2] Jurafsky, D. (2015, July). Distributional (Vector) Semantics. https://web.stanford.edu/~jurafsky/li15/lec3.vector.pdf
Methods
diversity_scores([words, return_graph, …])Quantify contextual diversity of each word in the corpus
pmi(tgt_w, cntx_w)Compute PMI score of a word pair
-
diversity_scores(words: Optional[Sequence] = None, return_graph=False, max_vocab_size: int = 10000)[source]¶ Quantify contextual diversity of each word in the corpus
- Parameters
- wordsSequence, optional
Words to compute scores, by default None
- return_graphbool, optional
Whether to return the co-occurence network of the words, by default False
- max_vocab_sizeint, optional
The max size of vocabulary used to construct the word co-occurrence network. Only the top-n (n = max_vocab_size) frequent words in the corpus are used.
- Returns
- dict
A dictionary recording contextual diversity scores of the words.
Notes
Implementation of the contexual diversity (polysemous) measure for words in [1]. For the purpose of intuitive interpretation, the original scores (local clustering coefficients) are reversed by (1 - ori_score), such that a higher score indicates higher contextual diversity of a word.
References
[1] Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.
-