2. Lexical Statistics¶

class yft.lexstats.TF_IDF(docset: Union[dict, list])[source]¶

Compute tf-idf scores for terms in a list of documents

Methods

`score`(word, doc_id)	Compute tf-idf score of a term in a particular document
`score_all`()	Compute tf-idf scores for all terms in the docset

score(word: str, doc_id)[source]¶

Compute tf-idf score of a term in a particular document

Parameters

wordstr: The term to compute the score
doc_idint or str: The id specifying the document in docset when initializing the TF_IDF obj

Returns

float: The tf-idf score of the term

Examples

>>> docset = {
    'doc0': ['a', 'a', 'a', 'c'],
    'doc1': ['a', 'a', 'a', 'a', 'c'],
    'doc2': ['b', 'c', 'c', 'c']
}
>>> tfidf = TF_IDF(docset)
>>> tfidf.score('a', 'doc1')
0.32437208648653154
>>> docset = [
    ['a', 'a', 'a', 'c'],
    ['a', 'a', 'a', 'a', 'c'],
    ['b', 'c', 'c', 'c']
]
>>> tfidf = TF_IDF(docset)
>>> tfidf.score('a', 1)
0.32437208648653154

score_all()[source]¶

Compute tf-idf scores for all terms in the docset

Returns

dict

A dict in the format of:

{
    "<docid>": {
        "<word>": <tf-idf-score>,
        "<word>": <tf-idf-score>,
        ...
    },
    "<docid>": {...},
    ...
}

Examples

>>> docset = {
    'doc0': ['a', 'a', 'a', 'c'],
    'doc1': ['a', 'a', 'a', 'a', 'c'],
    'doc2': ['b', 'c', 'c', 'c']
}
>>> tfidf = TF_IDF(docset)
>>> tfidf.score_all()
{'doc0': {'a': 0.3040988310811233, 'c': 0.0}, 'doc1': {'a': 0.32437208648653154, 'c': 0.0}, 'doc2': {'b': 0.27465307216702745, 'c': 0.0}}

class yft.lexstats.ContextualDiversity(fp: str, tk_sep: str = '\u3000', WINDOW: int = 4, SM: float = 0.75)[source]¶

Compute contextual diversity scores for words in a corpus

References

[1] Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings.

[2] Jurafsky, D. (2015, July). Distributional (Vector) Semantics. https://web.stanford.edu/~jurafsky/li15/lec3.vector.pdf

Methods

`diversity_scores`([words, return_graph, …])	Quantify contextual diversity of each word in the corpus
`pmi`(tgt_w, cntx_w)	Compute PMI score of a word pair

diversity_scores(words: Optional[Sequence] = None, return_graph=False, max_vocab_size: int = 10000)[source]¶

Quantify contextual diversity of each word in the corpus

Parameters

wordsSequence, optional: Words to compute scores, by default None
return_graphbool, optional: Whether to return the co-occurence network of the words, by default False
max_vocab_sizeint, optional: The max size of vocabulary used to construct the word co-occurrence network. Only the top-n (n = max_vocab_size) frequent words in the corpus are used.

Returns

dict: A dictionary recording contextual diversity scores of the words.

Notes

Implementation of the contexual diversity (polysemous) measure for words in [1]. For the purpose of intuitive interpretation, the original scores (local clustering coefficients) are reversed by (1 - ori_score), such that a higher score indicates higher contextual diversity of a word.

References

[1] Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.

pmi(tgt_w: str, cntx_w: str)[source]¶

Compute PMI score of a word pair

Parameters

tgt_wstr: Target word
cntx_wstr: Context word

Returns

float: PMI value