1. Finding Collocations with Concordancer

This notebook demonstrates how one can use Concordancer to find collocates of a word from the corpus.

!pip install -U concordancer

Requirement already up-to-date: concordancer in /usr/local/lib/python3.6/dist-packages (0.1.13)
Requirement already satisfied, skipping upgrade: falcon-cors in /usr/local/lib/python3.6/dist-packages (from concordancer) (1.1.7)
Requirement already satisfied, skipping upgrade: cqls in /usr/local/lib/python3.6/dist-packages (from concordancer) (0.1.5)
Requirement already satisfied, skipping upgrade: falcon in /usr/local/lib/python3.6/dist-packages (from concordancer) (2.0.0)
Requirement already satisfied, skipping upgrade: tabulate in /usr/local/lib/python3.6/dist-packages (from concordancer) (0.8.7)

import json
from math import log2
from concordancer.concordancer import Concordancer
from concordancer.kwic_print import KWIC

# Use built-in example data
from concordancer.demo import download_demo_corpus
fp = download_demo_corpus(to=".")

Corpus downloaded to /content/demo_corpus.jsonl

# Load corpus as an Concordancer object
with open(fp, encoding="utf-8") as f:
    C = Concordancer([json.loads(l) for l in f], text_key="text")

C.set_cql_parameters(default_attr="word", max_quant=5)

1.1. Extracting Collocates

The code below extracts collocates of the node word 討厭
These collocates must occur within a window size of 4 around the node word to be counted
MI is used as the association measure

# Count co-occurrances
NODE_WORD = '討厭'
WINDOW = 4

cql = f'[word="{NODE_WORD}"]'
results = C.cql_search(cql, left=WINDOW, right=WINDOW)

collo_stats = {}
for result in results:
    context_words = [ w['word'] for w in result['left'] + result['right'] ]
    for collocate in context_words:
        if collocate not in collo_stats:
            collo_stats[collocate] = {
                'cooccur': 0,
                'total': len(C.corp_idx["word"][collocate]),
            }
        collo_stats[collocate]['cooccur'] += 1

collo_stats

{'人': {'cooccur': 1, 'total': 79},
 '外套': {'cooccur': 1, 'total': 58},
 '很': {'cooccur': 1, 'total': 244},
 '我': {'cooccur': 1, 'total': 470},
 '是': {'cooccur': 1, 'total': 450},
 '的': {'cooccur': 1, 'total': 1190},
 '真的': {'cooccur': 1, 'total': 90},
 '穿厚': {'cooccur': 1, 'total': 1}}

# Compute association measures
corpus_size = sum( len(positions) for positions in C.corp_idx['word'].values() )
node_marginal_count = len(C.corp_idx["word"][NODE_WORD])

for word, stats in collo_stats.items():
    observed_cooccur = stats['cooccur']
    collocate_marginal_count = stats['total']

    # Calculate MI
    expected_cooccur = collocate_marginal_count * node_marginal_count / corpus_size
    MI = log2(observed_cooccur / expected_cooccur)
    collo_stats[word]['MI'] = MI

# Sort results
sorted(collo_stats.items(), key=lambda x:x[1]['MI'], reverse=True)[:10]

[('穿厚', {'MI': 14.693759179520415, 'cooccur': 1, 'total': 1}),
 ('外套', {'MI': 8.835778184392844, 'cooccur': 1, 'total': 58}),
 ('人', {'MI': 8.389978431343312, 'cooccur': 1, 'total': 79}),
 ('真的', {'MI': 8.20190608319074, 'cooccur': 1, 'total': 90}),
 ('很', {'MI': 6.763021841957529, 'cooccur': 1, 'total': 244}),
 ('是', {'MI': 5.879977988303378, 'cooccur': 1, 'total': 450}),
 ('我', {'MI': 5.817242232955415, 'cooccur': 1, 'total': 470}),
 ('的', {'MI': 4.477013321325109, 'cooccur': 1, 'total': 1190})]