2. Lexical Statistics

class yft.lexstats.TF_IDF(docset: Union[dict, list])[source]

Compute tf-idf scores for terms in a list of documents

Methods

score(word, doc_id)

Compute tf-idf score of a term in a particular document

score_all()

Compute tf-idf scores for all terms in the docset

score(word: str, doc_id)[source]

Compute tf-idf score of a term in a particular document

Parameters
wordstr

The term to compute the score

doc_idint or str

The id specifying the document in docset when initializing the TF_IDF obj

Returns
float

The tf-idf score of the term

Examples

>>> docset = {
    'doc0': ['a', 'a', 'a', 'c'],
    'doc1': ['a', 'a', 'a', 'a', 'c'],
    'doc2': ['b', 'c', 'c', 'c']
}
>>> tfidf = TF_IDF(docset)
>>> tfidf.score('a', 'doc1')
0.32437208648653154
>>> docset = [
    ['a', 'a', 'a', 'c'],
    ['a', 'a', 'a', 'a', 'c'],
    ['b', 'c', 'c', 'c']
]
>>> tfidf = TF_IDF(docset)
>>> tfidf.score('a', 1)
0.32437208648653154
score_all()[source]

Compute tf-idf scores for all terms in the docset

Returns
dict

A dict in the format of:

{
    "<docid>": {
        "<word>": <tf-idf-score>,
        "<word>": <tf-idf-score>,
        ...
    },
    "<docid>": {...},
    ...
}

Examples

>>> docset = {
    'doc0': ['a', 'a', 'a', 'c'],
    'doc1': ['a', 'a', 'a', 'a', 'c'],
    'doc2': ['b', 'c', 'c', 'c']
}
>>> tfidf = TF_IDF(docset)
>>> tfidf.score_all()
{'doc0': {'a': 0.3040988310811233, 'c': 0.0}, 'doc1': {'a': 0.32437208648653154, 'c': 0.0}, 'doc2': {'b': 0.27465307216702745, 'c': 0.0}}
class yft.lexstats.ContextualDiversity(fp: str, tk_sep: str = '\u3000', WINDOW: int = 4, SM: float = 0.75)[source]

Compute contextual diversity scores for words in a corpus

References

[1] Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings.

[2] Jurafsky, D. (2015, July). Distributional (Vector) Semantics. https://web.stanford.edu/~jurafsky/li15/lec3.vector.pdf

Methods

diversity_scores([words, return_graph, …])

Quantify contextual diversity of each word in the corpus

pmi(tgt_w, cntx_w)

Compute PMI score of a word pair

diversity_scores(words: Optional[Sequence] = None, return_graph=False, max_vocab_size: int = 10000)[source]

Quantify contextual diversity of each word in the corpus

Parameters
wordsSequence, optional

Words to compute scores, by default None

return_graphbool, optional

Whether to return the co-occurence network of the words, by default False

max_vocab_sizeint, optional

The max size of vocabulary used to construct the word co-occurrence network. Only the top-n (n = max_vocab_size) frequent words in the corpus are used.

Returns
dict

A dictionary recording contextual diversity scores of the words.

Notes

Implementation of the contexual diversity (polysemous) measure for words in [1]. For the purpose of intuitive interpretation, the original scores (local clustering coefficients) are reversed by (1 - ori_score), such that a higher score indicates higher contextual diversity of a word.

References

[1] Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.

pmi(tgt_w: str, cntx_w: str)[source]

Compute PMI score of a word pair

Parameters
tgt_wstr

Target word

cntx_wstr

Context word

Returns
float

PMI value