How to compare corpora: two methodological issues Václav Cvrček

Increasing number of corpus-based discourse studies start with keyword analysis (KWA) which was coined by Scott (Scott & Tribble 2006). The first step in KWA is identification of keywords which results from comparing the text/corpus under examination against the backdrop of referential corpus. In this presentation, I will focus on two methodological questions closely related to the process of keyword identification:

  1. What is the appropriate metric which can be used to measure keyness? It has been pointed out several times (Hofland & Johansson 1982; Scott 2010; Garielatos & Marchi 2012) that test statistic (log-likelihood or chi2) may provide misleading characteristics and that effect size estimator is more adequate.
  2. What is the role of the reference corpus, the impact of its composition and size on the results? I would argue that the reference corpus helps reconstructing a model reader of a text and therefore it has to be taken into account in results interpretation.

Both methodological issues will be demonstrated on two pilot studies: analysis of presidential New Year’s addresses (Fidler & Cvrček 2015) and analysis of academic texts (Cvrček & Fidler 2019).


Cvrček, V. and M. Fidler. 2019. “Up close and personal vs. birds-eye view“ of discourse: a corpus study of perspective using Czech data. ICLC15 – International Congitive Linguistics Conference, Nishinomya. Japan.

Fidler, M. and V. Cvrček. 2015. A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis, Journal of Slavic Linguistics 23(2), pp. 197–239.

Gabrielatos, C. and A. Marchi. 2012. Keyness: Appropriate metrics and practical issues. CADS International Conference 2012, University of Bologna, Italy.

Hofland, K. and S. Johansson. 1982. Word frequencies in British and American English. Bergen: Norwegian computing centre for the Humanities.

Scott, M. 2010. Problems in investigating keyness, or clearing the undergrowth and marking out trails… In: M. Bondi and M. Scott (eds.) Keyness in Texts, pp. 43–58. Amsterdam / Philadelphia: John Benjamins.

Scott, M. and C. Tribble. 2006. Textual Patterns: Key words and corpus analysis in language education. Philadelphia: John Benjamins.

