1. Query

# # Colab setup
# !gdown https://github.com/liao961120/hgct/raw/main/test/data.zip
# !unzip -q data.zip
# !pip install -qU hgct

Search API in hgtk

In the following tutorials (Appendix A and B), we will use a small collection of texts as the example corpus. The text data is available on GitHub at https://github.com/liao961120/hgct/raw/main/test/data.zip. After extracting data.zip to the directory data, it should have the following structure:

data
├── 01
│   ├── 儀禮_公食大夫禮.txt
│   ├── ...
│   └── 黃帝內經_靈樞經.txt
├── 02
│   ├── ...
│   └── 鹽鐵論_卷四.txt
├── 03
│   ├── 三國志_吳書一.txt
│   ├── ...
│   └── 魯勝墨辯注敘_魯勝墨辯注敘.txt
├── 08
│   ├── asbc1.txt
│   └── asbc2.txt
└── 10
    ├── dispersion1.txt
    ├── ...
    └── dispersion5.txt

The directory data corresponds to the corpus in hgct’s corpus representation. It contains five directories, each of which corresponds to a subcorpus. Directory 01, 02, and 03 consists of small samples of Literary Chinese texts collected from the Chinese Text Project (https://ctext.org). Directory 08 holds modern Chinese texts sampled from ASBC. The directory 10 is a toy corpus in @gries2020 [p. 102] used for illustrating calculations of dispersion measures.

In this tutorial, we demonstrate the supported functionalities in hgct for searching the corpus.

Loading Corpus Data into Concordancer

Provided that the input corpus follows the required directory structure mentioned in @sec:corpus-structure-and-input-data, users could convert the input corpus to the internal corpus representation with PlainTextReader() as in the following code block. Since we are now demonstrating the search functions, we immediately pass the corpus to Concordancer(), which is the object used in hgct for searching the corpus.

from hgct import PlainTextReader, Concordancer
c = Concordancer(PlainTextReader("data/").corpus)

Indexing corpus for text retrival...

{"version_major":2,"version_minor":0,"model_id":"fb6872ef86e3446da7140a66c9108fc0"}

Indexing corpus for concordance search...

{"version_major":2,"version_minor":0,"model_id":"3502b87c9fab4b1aabfd8a07bcf460dc"}

The Concordancer object could be used to retrieve results matching the search pattern as a sequence[1] of concordance lines. Since many of the search patterns would return plenty of results, we define a wrapper function get_first_n() here for the purpose of demonstration.

[1] More precisely, a generator of concordance lines.

def get_first_n(cql, n=10, left=5, right=5):
    out = []
    for i, r in enumerate(c.cql_search(cql, left=left, right=right)):
        if i == n: break
        out.append(r)
    return out

Search by Character

In our first example, we define the search pattern as [char="龜"] [char="[一-龜]"], which roughly means

a sequence of two characters starting with “龜” and ending with any Chinese characters (not, e.g., punctuations)

Passing this pattern to get_first_n() (or Concordancer.cql_search()) gives us a sequence of Concord objects. A Concord object is used to represent a matched result returned from the corpus in hgct.

cql = '''
[char="龜"] [char="[一-龜]"]
'''
# left/right: left/right context size around the keyword
results = get_first_n(cql, n=5, left=6, right=3)
results

[<Concord 遷有無，貨自{龜貝}，至此>,
 <Concord 山在西北。有{龜山}。有龍>,
 <Concord ，故獸不狘；{龜以}為畜，>,
 <Concord 江郡常歲時生{龜長}尺二寸>,
 <Concord 無為頓復卜三{龜知}。聖人>]

To get more information about a particular matching result, we can look at the data attribute in a Concord object, which is a dictionary holding the relevant information of the matching result.

result_1 = results[0]
result_1.data

{'captureGroups': {},
 'keyword': '龜貝',
 'left': '遷有無，貨自',
 'meta': {'id': '02/漢書_傳.txt',
  'text': {'book': '漢書', 'sec': '傳'},
  'time': {'label': '漢', 'ord': 2, 'time_range': [-205, 220]}},
 'position': (1, 6, 3482, 42),
 'right': '，至此'}

Note the position key in Concord.data. It holds the position of the matched keyword in the corpus. The elements in the 4-tuple (1, 6, 3482, 32) correspond respectively to the indices of (subcorpus, text, sentence, character).

We did not mention above how the index of a subcorpus is determined. The index of a subcorpus is automatically determined according to the character order of the directory names. Remember that there are four directories (subcorpora) in our input corpus—01, 02, 03, 08, and 10. So by character order, 01 appears before 02, 02 before 03, 03 before 08, and so on. Hence, the first directory 01 is given the index of 0, the second is given the index of 1, and so on. These indices of the subcorpora, as seen later in Appendix B, could be used for limiting the scope of the functions in hgct in computing corpus statistical measures.

Search by Character Components

In addition to character forms, we can also describe search patterns in terms of character compositions, such as the Kangxi Radical or Ideographic Descriptions of a character.

Kangxi Radicals

To take a look at all the present Kangxi radicals in the characters of the corpus, the attribute Concordancer.chr_radicals could be used:

print(c.chr_radicals)

Building index for character radicals...
{'', '钅', '鬲', '高', '見', '广', '片', '黹', '自', '尸', '鬥', '屮', '面', '麦', '攴', '糸', '臣', '丨', '車', '鱼', '毛', '饣', '癶', '舟', '鼓', '襾', '鹿', '龜', '欠', '香', '鼻', '干', '臼', '爪', '缶', '隶', '用', '走', '爻', '风', '食', '貝', '夕', '刀', '丿', '黍', '匸', '女', '疒', '火', '目', '穴', '卜', '白', '宀', '耒', '曰', '冖', '廾', '力', '支', '老', '匚', '方', '長', '冂', '黽', '冫', '巾', '而', '虫', '尢', '齒', '耳', '入', '手', '鸟', '鳥', '厶', '瓦', '勹', '彐', '车', '凵', '气', '辵', '田', '牙', '龙', '羽', '十', '网', '匕', '辰', '氏', '皮', '角', '豆', '齿', '衣', '首', '矛', '革', '犬', '米', '禾', '生', '豸', '页', '非', '羊', '贝', '玄', '毋', '卩', '歹', '隹', '色', '见', '龟', '禸', '鼎', '鼠', '木', '弓', '至', '皿', '谷', '馬', '韦', '魚', '辛', '彳', '二', '血', '廴', '瓜', '殳', '夂', '言', '厂', '讠', '肉', '靑', '虍', '音', '牛', '豕', '髟', '囗', '石', '龠', '斤', '黑', '玉', '甘', '水', '竹', '雨', '小', '止', '黃', '示', '亠', '麥', '士', '邑', '齊', '土', '鬯', '釆', '戈', '足', '心', '月', '口', '乙', '舛', '亅', '頁', '龍', '酉', '工', '阜', '立', '弋', '日', '黾', '矢', '纟', '寸', '无', '人', '彡', '丶', '己', '麻', '聿', '鹵', '儿', '艮', '几', '艸', '骨', '门', '韋', '巛', '韭', '山', '文', '風', '門', '行', '疋', '马', '身', '又', '斗', '戶', '幺', '赤', '金', '舌', '子', '爿', '鬼', '一', '里', '大', '飛', '夊', '比', '父', '八'}

To search the corpus with Kangxi radicals, simply use the attribute radical in the description of the search pattern.

cql = '''
[radical="立"]
'''
get_first_n(cql, 5)

[<Concord 屬皆从立。{䇐}：臨也。从>,
 <Concord 从立卑聲。{竲}：北地高樓>,
 <Concord 也。从口歫{䇂}。䇂，惡聲>,
 <Concord 从口歫䇂。{䇂}，惡聲也。>,
 <Concord 曰語。从口{䇂}聲。凡言之>]

Ideographic Description Characters (IDCs)

Character components defined according to the Unicode’s Ideographic Description Characters (IDCs) could also be used for searching. The IDCs and their names in hgct are found in Concordancer.chr_idcs:

c.chr_idcs

{'curC': '⿷',
 'encl': '⿴',
 'horz2': '⿰',
 'horz3': '⿲',
 'over': '⿻',
 'sur7': '⿹',
 'surL': '⿺',
 'surN': '⿵',
 'surT': '⿸',
 'surU': '⿶',
 'vert2': '⿱',
 'vert3': '⿳'}

To search according to Ideographic Descriptions, use the attributes compo and/or idc.

cql = '''
[compo="木" & idc="vert2" & pos="0"]
'''
get_first_n(cql, 5)

[<Concord 城。趙國豪{杰}之士，多在>,
 <Concord ，百人者曰{杰}，十人者曰>,
 <Concord 者曰豪。豪{杰}俊英不相陵>,
 <Concord ：并當時之{杰}筆也。觀伯>,
 <Concord 并辭賦之英{杰}也。及仲宣>]

cql = '''
[compo="木" & idc="vert2" & pos="1"]
'''
get_first_n(cql, 5)

[<Concord 有甬，官食{槩}，不可以辟>,
 <Concord 从木午聲。{槩}：𣏙斗斛。>,
 <Concord 郢，而封夫{槩}於堂谿，為>,
 <Concord 幾夷、皓之{槩}。周羣占天>,
 <Concord 質直，皆節{槩}梗梗，有大>]

cql = '''
[compo="木" & idc="vert2"]
'''
get_first_n(cql, 5)

[<Concord 子國為客，{樂}及遍舞。鄭>,
 <Concord 舉，而況敢{樂}禍乎！今吾>,
 <Concord 歌舞不息，{樂}禍也。夫出>,
 <Concord 忘憂，是謂{樂}禍，禍必及>,
 <Concord ，君欣欣兮{樂}康。浴蘭湯>]

Either compo or idc could be left out if a more abstract search pattern is preferred. For instance, if the shape (idc) and the position (pos) are not of interest, these attributes could be left out.

cql = '''
[compo="木"]
'''
get_first_n(cql, 5)

[<Concord 藟，施于條{枚}；凱弟君子>,
 <Concord 四綍，皆銜{枚}，司馬執鐸>,
 <Concord 後。兩軍𠾑{枚}，或左或右>,
 <Concord 徒二人。銜{枚}氏：下士二>,
 <Concord 矢射之。銜{枚}氏：掌司囂>]

If one is interested only in the shape of the character, idc could be specified while all other attributes could be left out.

cql = '''
[idc="encl"] [idc="encl"]
'''
get_first_n(cql, 5)

Building index for character IDCs...

[<Concord ：『始舍之{圉圉}焉，少則洋>,
 <Concord 公朝虜而子{圉夕}立，更始尚>,
 <Concord 𢦔聲。軍：{圜圍}也。四千人>,
 <Concord 永昌」，方{圜四}寸，上紐交>,
 <Concord 行天下，雖{困四}夷，人莫不>]

Radical Semantic Type

Ma’s (2016) semantic type classification of Kangxi Radicals is also incorporated in hgct’s search function. Use the attribute semtag to specify a radical semantic type. Refer to @tbl:ma2016-radical for the 22 available semantic types.

cql = '''
[semtag="植物"] [semtag="植物"]
'''
get_first_n(cql, 5)

[<Concord 兮水中，搴{芙蓉}兮木末。心>,
 <Concord 兮陳坐，援{芙蕖}兮為蓋。水>,
 <Concord 宿兮石城。{芙蓉}蓋而蔆華車>,
 <Concord 而緣木。因{芙蓉}而為媒兮，>,
 <Concord 為衣兮，集{芙蓉}以為裳。不>]

Search by Phonetic Properties

hgct also provides searching the corpus with sound properties. The sound properties are defined according to the data from two system—Guanyun 廣韻 (Middle Chinese) and Chinese Dictionary compiled by the Ministry of Education (MOE) in Taiwan (Mandarin).

c.cql_attrs['CharPhonetic']

{'moe': ['phon', 'tone', 'tp', 'sys="moe"'],
 '廣韻': ['攝', '聲調', '韻母', '聲母', '開合', '等第', '反切', '拼音', 'IPA', 'sys="廣韻"']}

Mandarin (based on 萌典)

cql = '''
[phon="ㄨㄥ" & tone="1" & sys="moe"]
'''
get_first_n(cql, 5)

[<Concord 」耳邊不斷{嗡}嗡的縈繞著>,
 <Concord 耳邊不斷嗡{嗡}的縈繞著類>,
 <Concord 市朝也。而{翁}不爭焉，顧>,
 <Concord 發猛，塤篪{翁}博，瑟易良>,
 <Concord ，黑文而赤{翁}，名曰櫟，>]

cql = '''
[phon="^p" & tp="ipa" & sys="moe"] [phon="^p" & tp="ipa" & sys="moe"]
'''
get_first_n(cql, 5)

[<Concord 荒亂，以十{破百}。器備不行>,
 <Concord 戰而赴圍。{破伯}牙之號鍾兮>,
 <Concord 應弱燕，燕{破必}矣。燕破則>,
 <Concord 分離，陰陽{破敗}，經絡厥絕>,
 <Concord 而弓秦，秦{破必}矣。今見破>]

Middle Chinese (based on 廣韻)

cql = '''
[韻母="東" & 聲調="平" & sys="廣韻"]
'''
get_first_n(cql, 5)

[<Concord 曾參之參。{梵}：出自西域>,
 <Concord 薨奏焉。樊{梵}，字文高，>,
 <Concord 曆編訢、李{梵}等綜校其狀>,
 <Concord 行。而訢、{梵}猶以為元首>,
 <Concord 蘇統及訢、{梵}等十人。以>]