1. Finding Collocations with Concordancer
This notebook demonstrates how one can use Concordancer to find collocates of a word from the corpus.
!pip install -U concordancer
Requirement already up-to-date: concordancer in /usr/local/lib/python3.6/dist-packages (0.1.13)
Requirement already satisfied, skipping upgrade: falcon-cors in /usr/local/lib/python3.6/dist-packages (from concordancer) (1.1.7)
Requirement already satisfied, skipping upgrade: cqls in /usr/local/lib/python3.6/dist-packages (from concordancer) (0.1.5)
Requirement already satisfied, skipping upgrade: falcon in /usr/local/lib/python3.6/dist-packages (from concordancer) (2.0.0)
Requirement already satisfied, skipping upgrade: tabulate in /usr/local/lib/python3.6/dist-packages (from concordancer) (0.8.7)
import json
from math import log2
from concordancer.concordancer import Concordancer
from concordancer.kwic_print import KWIC
# Use built-in example data
from concordancer.demo import download_demo_corpus
fp = download_demo_corpus(to=".")
Corpus downloaded to /content/demo_corpus.jsonl
# Load corpus as an Concordancer object
with open(fp, encoding="utf-8") as f:
C = Concordancer([json.loads(l) for l in f], text_key="text")
C.set_cql_parameters(default_attr="word", max_quant=5)
1.1. Extracting Collocates
The code below extracts collocates of the node word 討厭
These collocates must occur within a window size of 4 around the node word to be counted
MI is used as the association measure
# Count co-occurrances
NODE_WORD = '討厭'
WINDOW = 4
cql = f'[word="{NODE_WORD}"]'
results = C.cql_search(cql, left=WINDOW, right=WINDOW)
collo_stats = {}
for result in results:
context_words = [ w['word'] for w in result['left'] + result['right'] ]
for collocate in context_words:
if collocate not in collo_stats:
collo_stats[collocate] = {
'cooccur': 0,
'total': len(C.corp_idx["word"][collocate]),
}
collo_stats[collocate]['cooccur'] += 1
collo_stats
{'人': {'cooccur': 1, 'total': 79},
'外套': {'cooccur': 1, 'total': 58},
'很': {'cooccur': 1, 'total': 244},
'我': {'cooccur': 1, 'total': 470},
'是': {'cooccur': 1, 'total': 450},
'的': {'cooccur': 1, 'total': 1190},
'真的': {'cooccur': 1, 'total': 90},
'穿厚': {'cooccur': 1, 'total': 1}}
# Compute association measures
corpus_size = sum( len(positions) for positions in C.corp_idx['word'].values() )
node_marginal_count = len(C.corp_idx["word"][NODE_WORD])
for word, stats in collo_stats.items():
observed_cooccur = stats['cooccur']
collocate_marginal_count = stats['total']
# Calculate MI
expected_cooccur = collocate_marginal_count * node_marginal_count / corpus_size
MI = log2(observed_cooccur / expected_cooccur)
collo_stats[word]['MI'] = MI
# Sort results
sorted(collo_stats.items(), key=lambda x:x[1]['MI'], reverse=True)[:10]
[('穿厚', {'MI': 14.693759179520415, 'cooccur': 1, 'total': 1}),
('外套', {'MI': 8.835778184392844, 'cooccur': 1, 'total': 58}),
('人', {'MI': 8.389978431343312, 'cooccur': 1, 'total': 79}),
('真的', {'MI': 8.20190608319074, 'cooccur': 1, 'total': 90}),
('很', {'MI': 6.763021841957529, 'cooccur': 1, 'total': 244}),
('是', {'MI': 5.879977988303378, 'cooccur': 1, 'total': 450}),
('我', {'MI': 5.817242232955415, 'cooccur': 1, 'total': 470}),
('的', {'MI': 4.477013321325109, 'cooccur': 1, 'total': 1190})]