DHASA2017 – Abstract

A stylometric analysis of Joseph Conrad’s writing

Botha, Lande; Van Zyl, Maryka; Pienaar, Wikus
North-West University

The language and style in the writings of Joseph Conrad, a multilingual, non-native speaker of English, have been the topic of various studies (Monod, 2005; Peters, 2006; Ophir, 2014; Simmons, 2014). Dowden (1973), Lucas (1991), Stubbs (2005), Moon (2007), Nofal, (2013), Hunter and Smith (2014) have paid a great deal of attention to different aspects of grammar in Conrad’s writing in order to describe his idiosyncratic style. These studies mostly focus on analysing and describing solitary or a small handful of texts in terms of their socio-cultural, political and personal affect. Digital versions of Conrad‟s works and digital analysis tools now make it possible to conduct a quantitative study of Conrad‟s linguistic style in all of his writing (both fiction and non-fiction) spanning his working life. The aim of this study is to statistically compare texts from various genres (novels, novella, autobiography and notes) and publication times in order to establish whether a consistent “Conradian” style is maintained across genres and time. A “corpus” of all the published works of Conrad serves as input for Stylo (0.6.0), an R-script (Eder & Rybicki, 2011). Stylo provides a cluster tree analysis in which each text is positioned according to its relation to (stylistic distance from) every other text. This analysis is based on the hundred most frequent words and gives an indication of the extent to which genre and time are factors in the lexical choices of Conrad. The wordlist and keywords function in WordSmith Tools (6.0) (Scott, 2012) allow for further lexis-based comparison of the texts. For purposes of a keywords analysis, the texts are grouped into two (or three) corpora based on the first branching in the Stylo-generated cluster tree. It is also possible to move beyond lexis and to study the grammatical aspects of Conrad‟s style quantitatively by making use of a part-of-speech-tagged version of the corpus. CLAWS4 (Garside & Smith, 1997) is used to tag the data. A (Pearl) script strips the words from the POS-tags leaving “texts” consisting entirely of word class designations. These texts serve as input for a cluster tree analysis in Stylo using bigrams, and then trigrams, which allow for comparison of the texts based on grammatical structure. This indicates the extent to which genre and time of publication are factors in the grammatical choices made by Conrad. The POS-“texts” can also be grouped into two (or three) “corpora” based on the first branching in the cluster tree analysis to serve as input for a “keywords” analysis in Wordsmith. Such an analysis gives an indication of the word classes involved in grammatical differences between the texts. Most of the CLAWS tags also contain morphological information such as tense, aspect and number giving a richer picture of the author‟s style.

