2. Stats

## Colab setup
# !gdown https://github.com/liao961120/hgct/raw/main/test/data.zip
# !unzip -q data.zip
# !pip install -qU hgct

Corpus Analysis API in hgtk

In this second tutorial, we demonstrate functions for quantitative analysis of the corpus in hgct. To get started, we need two additional objects CompoAnalysis and Dispersion in addition to the Concordancer object introduced in the previous tutorial. The corpus used is identical to the one in Appendix A.

Note that when initializing with CompoAnalysis() and PlainTextReader(), the argument auto_load=False needs to be given to PlainTextReader(). This prevents the full corpus to be loaded into the memory, such that functionalities provided by CompoAnalysis could be used to analyze large data that do not fit into the computer’s memory. For more information, refer to the source code on GitHub[1].

[1] https://github.com/liao961120/hgct/blob/main/hgct/compoAnalysis.py

import pandas as pd
from hgct import PlainTextReader, Concordancer
from hgct import CompoAnalysis, Dispersion

CC = Concordancer(PlainTextReader("data/").corpus)
CA = CompoAnalysis(PlainTextReader("data/", auto_load=False))
DP = Dispersion(PlainTextReader("data/").corpus)
Indexing corpus for text retrival...
{"version_major":2,"version_minor":0,"model_id":"58de1c717dba4b13b447b9c9d19e02b7"}
Indexing corpus for concordance search...
{"version_major":2,"version_minor":0,"model_id":"638bc9b3e1e740fcb0332d33fd692876"}
Indexing corpus for text retrival...
{"version_major":2,"version_minor":0,"model_id":"23eb85e2194a4cbb82db8031596b9b29"}
Indexing corpus for concordance search...
{"version_major":2,"version_minor":0,"model_id":"dbe1ae734bfa43baac59822da8c1059b"}

Frequency List (Distribution)

Frequency lists are provided by the function CompoAnalysis.freq_distr(). Based on the arguments passed, this function computes and returns the frequency distribution of either the characters, IDCs, Kangxi radicals, or characters with a given radical/component. Below, we demonstrate each of these types of frequency distributions.

Character

To return the frequency distribution of the characters in the corpus, set the argument tp to "chr". CompoAnalysis.freq_distr() by default returns a Counter[1], which has the convenient method most_common() that could be used to retrieve the terms with the highest frequencies.

[1] https://docs.python.org/3/library/collections.html#collections.Counter

CA.freq_distr(tp="chr").most_common(4)
[('之', 210608), ('不', 129212), ('也', 107639), ('以', 104578)]

As mentioned in @sec:app-search-by-character, we could limit the scope of calculation to a particular subcorpus by specifying its index. To do this, pass the argument subcorp_idx to the function. The example below sets the subcorpus to 3, which is the subcorpus of modern Chinese (ASBC).

CA.freq_distr(tp="chr", subcorp_idx=3).most_common(4)
[('的', 15826), ('一', 5537), ('是', 5130), ('不', 4469)]

IDC

Frequency distributions of the Ideographic Description Characters (IDCs) could similarly be retrieved by setting tp to "idc". Note that there is an argument use_chr_types that applies when tp="idc" (IDC) or tp="rad" (radical). use_chr_types is used to determine how to compute the frequencies. If it is set to False, character frequencies are considered. If it is True, character frequencies are discarded. In other words, when use_chr_types=False, an IDC or a radical would only be counted once for each type of character. See @sec:frequency-lists for a toy example.

CA.freq_distr(tp="idc", subcorp_idx=3)
Counter({'': 48725,
         '⿰': 167681,
         '⿱': 120035,
         '⿲': 1965,
         '⿳': 4068,
         '⿴': 5744,
         '⿵': 7834,
         '⿶': 1637,
         '⿷': 537,
         '⿸': 18511,
         '⿹': 4412,
         '⿺': 13451,
         '⿻': 10324})
CA.freq_distr(tp="idc", use_chr_types=True, subcorp_idx=3)
Counter({'': 119,
         '⿰': 2454,
         '⿱': 1019,
         '⿲': 26,
         '⿳': 39,
         '⿴': 18,
         '⿵': 45,
         '⿶': 6,
         '⿷': 12,
         '⿸': 176,
         '⿹': 41,
         '⿺': 123,
         '⿻': 32})

Radical

To retrieve frequency distributions for radicals, set tp="rad". The usage of use_chr_types here is similar to the IDC described above.

CA.freq_distr(tp="rad", subcorp_idx=3).most_common(4)
[('人', 28149), ('白', 16640), ('一', 15567), ('口', 15443)]
CA.freq_distr(tp="rad", use_chr_types=True, subcorp_idx=3).most_common(4)
[('水', 233), ('口', 207), ('手', 201), ('人', 172)]

Characters with a given radical

It is also possible to look into characters of a specific type. By setting tp=None, one could then pass in a radical to the argument radical to look at the frequency distribution of the characters with this particular radical.

CA.freq_distr(tp=None, radical="广").most_common(4)
[('度', 4757), ('廣', 4050), ('廟', 3067), ('府', 3064)]

Characters with a given IDC component

Similarly, a frequency distribution of characters of a specific type defined according to a component and an optional IDC describing the the shape could also be retrieved by specifying tp=None and the arguments compo and idc (optional).

CA.freq_distr(tp=None, compo="水", idc="vert2")
Counter({'氶': 1,
         '汞': 15,
         '沓': 89,
         '泉': 1349,
         '泵': 3,
         '淼': 4,
         '滎': 344,
         '漀': 1,
         '漐': 9,
         '漿': 153,
         '澩': 3,
         '灓': 5})

Dispersion

Measures of dispersion could be calculated based on a character or a search pattern.

Dispersion Measures for Characters

Dispersion.char_dispersion() is used for calculating dispersion measures for a character. The examples below—using the toy corpus in Gries (2020)—demonstrate the validity of the returned measure. The values should be identical to those in Table 1 of Gries (2020).

# Gries (2020, Table 1)
DP.char_dispersion(char='a', subcorp_idx=4)
{'DP': 0.18,
 'DPnorm': 0.2195121951219512,
 'JuillandD': 0.7851504534504508,
 'KLdivergence': 0.13697172936522078,
 'Range': 5,
 'RosengrenS': 0.9498163423042408}
# return_raw=True to get the raw data for dispersion calculation
DP.char_dispersion(char='a', return_raw=True, subcorp_idx=4)
({'DP': 0.18,
  'DPnorm': 0.2195121951219512,
  'JuillandD': 0.7851504534504508,
  'KLdivergence': 0.13697172936522078,
  'Range': 5,
  'RosengrenS': 0.9498163423042408},
 {'corpus_size': 50,
  'f': 15,
  'n': 5,
  'p': [0.1111111111111111, 0.45454545454545453, 0.3, 0.2, 0.4],
  's': [0.18, 0.22, 0.2, 0.2, 0.2],
  'v': [1, 5, 3, 2, 4]})

To see how dispersion measures behave on real data, we calculate dispersion measures for four characters (之, 也, 草, and 巾) in a corpus of Literary Chinese texts. The first two characters 之 and 也 are often used as function words and the last two as content words in Literary Chinese. Hence, we would expect the first two to be distributed evenly, and the latter two unevenly in the corpus.

subcorp_idx = 0
df_disp = []
for ch in '之也草巾':
    stats, raw = DP.char_dispersion(
        char=ch, subcorp_idx=subcorp_idx, return_raw=True
    )
    d = {
        'char': ch,
        'Range(%)': '{:.2f}'.format(100 * stats['Range'] / raw['n']),
        **stats
    }
    df_disp.append(d)
df_disp = pd.DataFrame(df_disp)
df_disp
  char Range(%)  Range        DP    DPnorm  KLdivergence  JuillandD  RosengrenS
0        90.98    666  0.128508  0.128509      0.095890   0.977316    0.961405
1        77.05    564  0.251459  0.251462      0.401038   0.962913    0.823893
2        22.40    164  0.649643  0.649649      2.331477   0.863829    0.320790
3         3.69     27  0.844676  0.844683      4.077689   0.541787    0.101871

Dispersion Measures of Complex Forms (defined by CQL)

Dispersion measures for abstract units could also be calculated with the returned concordance lines provided by Concordancer.cql_search(). The function DP.pattern_dispersion() is designed to take the queried results from Concordancer.cql_search() to calculate dispersion measures.

cql = """
[semtag="人體精神"] [semtag="人體精神"]
"""
results = list(CC.cql_search(cql, left=3, right=3))
print('Num of results:', len(results))
for r in results[:3]: print(r)
Num of results: 8459
<Concord 。有孚{惠心},勿問>
<Concord 大澤則{惠必}及下,>
<Concord 「仁義{惠愛}而已矣>
DP.pattern_dispersion(data=results, subcorp_idx=2)
{'DP': 0.1504848557289626,
 'DPnorm': 0.15050344195568013,
 'JuillandD': 0.9387038720245429,
 'KLdivergence': 0.135483902941753,
 'Range': 134,
 'RosengrenS': 0.9428568965311757}

The example below calculates dispersion measures for each subcorpus 0, 1, and 2. This is useful when the user is interested in contrasting dispersion measures in different corpora (e.g., genre/diachronic comparison).

# Compute separate dispersion measures for each subcorpus
df_pat_disp = []
for i in range(3):
    stats, raw = DP.pattern_dispersion(
        data=results, subcorp_idx=i, return_raw=True
    )
    d = {
        'Range(%)': '{:.2f}'.format(100 * stats['Range'] / raw['n']),
        **stats,
        'freq': raw['f'],
        'corp_size': raw['corpus_size']
    }
    df_pat_disp.append(d)
df_pat_disp = pd.DataFrame(df_pat_disp)
df_pat_disp
  Range(%)  Range        DP    DPnorm  ...  JuillandD  RosengrenS  freq  corp_size
0    44.40    325  0.399226  0.399229  ...   0.907705    0.629630  1689    1858228
1    53.38    560  0.325007  0.325008  ...   0.950161    0.753668  3500    3938310
2    85.90    134  0.150485  0.150503  ...   0.938704    0.942857  2489    2097273

[3 rows x 9 columns]

Ngram Frequency

We now turn to the relationships across characters. To compute character n-grams, one can use Concordancer.freq_distr_ngrams().

CC.freq_distr_ngrams(n=2, subcorp_idx=0).most_common(4)
Counting 2-grams...
{"version_major":2,"version_minor":0,"model_id":"bb5037e60ea8460abcc2e2050bd94200"}
[('而不', 3913), ('天下', 3661), ('不可', 2985), ('之所', 2723)]
CC.freq_distr_ngrams(n=3, subcorp_idx=0).most_common(4)
Counting 3-grams...
{"version_major":2,"version_minor":0,"model_id":"98917c323e1943c694386d79c257604f"}
[('天下之', 946), ('歧伯曰', 766), ('之所以', 605), ('不可以', 580)]

Collocation

Association measures could be used to quantify the strengths of attraction between a pair of characters. Pairs with strong attractions could be considered as collocations. hgct implements two types of collocation extraction functions. The first (Concordancer.bigram_associations()) is based on bigrams, which simply computes association scores for all bigrams. With the second implementation (Concordancer.collocates()), users could specify a node and a window size, and characters falling within this window around the node would be treated as a node-collocate pair. Each pair is then computed for an association score.

Bigram Association

bi_asso = CC.bigram_associations(subcorp_idx=3, sort_by="Gsq")
bi_asso[0]
('自己',
 {'DeltaP12': 0.9778668701918644,
  'DeltaP21': 0.36342714003090937,
  'Dice': 0.5303392259913999,
  'FisherExact': 0.0,
  'Gsq': 6188.677676112116,
  'MI': 7.855905225817536,
  'RawCount': 555,
  'Xsq': 128210.23324106314})
d = pd.DataFrame([{'bigram': x[0], **x[1]} for x in bi_asso][:5])
# print(d.to_markdown(index=False, floatfmt=".2f", numalign="left"))
d
  bigram        MI            Xsq  ...  DeltaP12  FisherExact  RawCount
0     自己  7.855905  128210.233241  ...  0.977867          0.0       555
1     什麼  9.153258  192859.824384  ...  0.547635          0.0       339
2     我們  6.183966   42280.224680  ...  0.446638          0.0       592
3     台灣  8.126771  111740.169937  ...  0.693597          0.0       401
4     沒有  6.394685   43012.134830  ...  0.164128          0.0       518

[5 rows x 9 columns]

Node-Collocate Association

The example below use the character sequence 我們 as the node and looks for collocates occurring on the immediate right (left=0 and right=1) on the node. After computing association scores for each node-collocate pair, these pairs are sorted based on the MI measure. The data frame below shows the top-5 collocates with the highest MI scores (a minimum frequency threshold of 6 is applied) of the node 我們.

cql = """
[char="我"] [char="們"]
"""
collo = CC.collocates(cql, left=0, right=1, subcorp_idx=3,
                      sort_by="MI", alpha=0)
collo[0]
('釘',
 {'DeltaP12': 0.0016848237685590844,
  'DeltaP21': 0.33204500782950214,
  'Dice': 0.0033613445378151263,
  'FisherExact': 0.003866505328061448,
  'Gsq': 9.493215334772461,
  'MI': 8.012895027477056,
  'RawCount': 1,
  'Xsq': 256.6351579547297})
d = pd.DataFrame([{'char': x[0], **x[1]} for x in collo
                  if x[1]['RawCount'] > 5][:5])
#print(d.to_markdown(index=False, floatfmt=".2f", numalign="left"))
d
  char        MI         Xsq  ...  DeltaP12   FisherExact  RawCount
0      3.979880  124.857368  ...  0.014258  9.310853e-09         9
1      3.388404   77.315368  ...  0.013769  2.970053e-07         9
2      3.328575  122.653021  ...  0.022845  6.215205e-11        15
3      3.207562  125.435532  ...  0.025641  1.218295e-11        17
4      3.047111   76.926085  ...  0.017841  4.222232e-08        12

[5 rows x 9 columns]

Productivity

Finally, we demonstrate the usage of the tentative applications of Productivity measures [@baayen1993;@baayen2009] to character components. This is implemented in CompoAnalysis.productivity(). The categories for computing measures of productivity are defined based on the arguments passed.

# Productivity of a radical
CA.productivity(radical="广", subcorp_idx=0)
{'N': 1505967,
 'NC': 5889,
 'V1': 1896,
 'V1C': 7,
 'productivity': {'expanding': 0.003691983122362869,
  'potential': 0.0011886568177958906,
  'realized': 58}}
# Productivity of a component
CA.productivity(compo="虫", idc="horz2", pos=0, subcorp_idx=0)
{'N': 1505967,
 'NC': 1027,
 'V1': 1896,
 'V1C': 72,
 'productivity': {'expanding': 0.0379746835443038,
  'potential': 0.07010710808179163,
  'realized': 178}}
# Productivity of Hanzi shapes (IDCs)
df_prod = []
for idc_nm, idc_val in CC.chr_idcs.items():
    p = CA.productivity(idc=idc_nm, subcorp_idx=0)
    df_prod.append({
        'name': idc_nm,
        'shape': idc_val,
        **p['productivity'],
        'V1C': p['V1C'],
        'V1': p['V1'],
        'NC': p['NC'],
        'N': p['N'],
    })

df_prod = pd.DataFrame(df_prod)
df_prod
     name shape  realized  expanding  potential   V1C    V1      NC        N
0   horz2     ⿰      5436   0.719409   0.003115  1364  1896  437854  1505967
1   vert2     ⿱      2045   0.219409   0.000741   416  1896  561357  1505967
2   horz3     ⿲        35   0.001582   0.000481     3  1896    6240  1505967
3   vert3     ⿳        80   0.005802   0.000765    11  1896   14371  1505967
4    encl     ⿴        27   0.001582   0.000208     3  1896   14409  1505967
5    surN     ⿵        84   0.004747   0.000357     9  1896   25231  1505967
6    surU     ⿶         6   0.000000   0.000000     0  1896    7275  1505967
7    curC     ⿷        20   0.002110   0.002438     4  1896    1641  1505967
8    surT     ⿸       332   0.026371   0.000548    50  1896   91208  1505967
9    sur7     ⿹        48   0.002637   0.000197     5  1896   25379  1505967
10   surL     ⿺       178   0.013186   0.000931    25  1896   26844  1505967
11   over     ⿻        43   0.000527   0.000026     1  1896   37846  1505967