Corpora

(Historical) Bantu corpus linguistics is one of BantUGent’s fields of expertise Following Tognini-Bonelli (2001), corpus linguistics is the analysis and description of language use as documented in texts, including transcribed oral texts. To analyze and describe ‘real’ language, one needs “large quantities of actual occurrences of that language” assembled in an electronic corpus. Over the past two decades, BantUGent has developed such corpora for several Bantu languages in order to allow for their corpus-based (and in some cases even corpus-driven) study, both in terms of grammar and the lexicon.

BantUGent warmly welcomes scholars who want to do research using one of our existing corpora, or to build and examine a new corpus of their own (Bantu) language as part of a doctoral or postdoctoral research project.

Available Bantu corpora (in addition to corpora for Afrikaans, English, Hausa and Somali, which may be used for comparative purposes) are listed below.

  • Cilubà
  • isiNdebele
  • isiXhosa
  • isiZulu
  • Kikongo*
  • Kirundi
  • Kiswahili
  • Lingála
  • Luganda
  • Lusoga
  • Northern Sotho
  • Sesotho
  • Setswana
  • siSwati
  • Tshivenda
  • Xitsonga

(* = a set of several dozen synchronic and diachronic corpora for the Kikongo Language Cluster (KLC))