Corpora – BantUGent – UGent Centre for Bantu Studies

(Historical) Bantu corpus linguistics is one of BantUGent’s fields of expertise. Following Tognini-Bonelli (2001), corpus linguistics is the analysis and description of language use as documented in texts, including transcribed oral texts. To analyze and describe ‘real’ language, one needs “large quantities of actual occurrences of that language” assembled in an electronic corpus. Over the past two decades, BantUGent has developed such corpora for several Bantu languages in order to allow for their corpus-based (and in some cases even corpus-driven) study, both in terms of grammar and the lexicon.

BantUGent warmly welcomes scholars who want to do research using one of our existing corpora, or to build and examine a new corpus of their own (Bantu) language as part of a doctoral or postdoctoral research project.

Available Bantu corpora (in addition to corpora for Afrikaans, English, Hausa and Somali, which may be used for comparative purposes) are listed below.

Cilubà
isiNdebele
isiXhosa
isiZulu
Kikongo*
Kirundi
Kiswahili
Lingála
Luganda
Lusoga
Northern Sotho
Sesotho
Setswana
siSwati
Tshivenda
Xitsonga

(* = a set of several dozen synchronic and diachronic corpora for the Kikongo Language Cluster (KLC))