Prelims Paper 1: Introduction to English Language and Literature: Training - Part 2: Text Analysis

Welcome to the training page for Prelims Paper 1. This is Part 2: Text Analysis.

On the left column you'll find some introductory material and links to all the key resources. As you read through you will see there are opportunities to watch short videos, these are screen captures so you can see a quick demonstration of the resource in action.

What is Text Analysis?

Many websites or software programs allow you to analyse your chosen texts. Text analysis tools allow you to explore a text quantitatively, e.g. by instances of one particular word; and systematically, e.g. Looking at the types of words used and phrases used.  This can be particularly useful or finding all instances of a specific word within a text.  The tools will also list all the words in your chosen text by type, e.g. adjective or plural noun.

Using the text analysis tools allows you to compare two or more texts and lets you gather key features of the language used.  You can search for the occurrences of just one word, or a more complex pattern, e.g. pairs of words within one context. 

These tools are good for looking at the different ways authors write across genre or type, e.g. Fiction and non-fiction. 

Researchers also put them to use to examine questions of authorship. With the tools available you can search your own chosen texts. You can also use established corpora like the British National Corpus to look for common occurrences of words and common phrases.

Text Analysis: Statistics

A good place to start is to get som statistics of your chosen texts, to find out a bit more about them. There are many free tools online that will give you statistics about a text, but one we recommend is Voyant Tools.

Voyant Tools is a web-based text reading and analysis environment. It is a scholarly project that is designed to facilitate reading and interpretive practices. Do the exercise below to learn how to use a tool like Voyant and to see what kind of information it can give you.


Exercise One: Voyant Tools

If you found some texts in Part 1 of this training programme, then you can copy and paste those to use in this exercise - or you can choose something else. We have chosen an online text of The Tell-Tale Heart by Edgar Alan Poe.

  1. Open
  2. Paste your chosen text into the search box and press Reveal.

You should be presented with something that looks like this:

Let's look at each part in a bit more detail to see what information it contains.

In the bottom right corner look at the summary:

This will tell you how many words are in your text, and how many of them are unique words. What does this tell us about Poe's use of language? You may need to paste in other texts and compare them to get an idea about how authors tend to write in comparison with Poe. With this tool you can compare two or more different authors, or multiple texts by the same author.

Have a look at the most frequent words used. In this Poe extract the most frequent words used are LouderIncreasedNoise. Later, in Exercise Two we will use this information to find out how often these words appear in the English language.

Next, have a look at the graph in the top right corner. This displays the appearance of those frequent words throughout the text, so you can visually see which ones appear at the same time as each other.

We can see in the Poe example that the word Sound is used a lot at the beginning of the text, but this stops, and later the words Heard and Louder appear very often together. Is this similar in general in the English Language? Keep working down to Exercise Two to find out.

Comparing a text against a whole language

We've seen with the above tools how you can compare texts with each other, but if you want to compare a text to a sample of a whole language, then you will need to use a Corpus.

A corpus is a collection of texts or text extracts that have been put together to be used as a sample of a language or language variety. It consists of texts that have been produced in 'natural contexts' (published books, ordinary conversation, letters, newspapers, lectures etc), which means it mirrors natural language. A well-composed corpus can be used to answer questions about language use, such as:

Does 'wicked' generally mean 'good' or 'bad'? Has this meaning changed over time? Does the use differ between different kinds of text? Do different (kinds of) speakers use the word in the same way?

reference corpus (created to be a balanced sample of a language variety) can be used as the basis of comparison between a text/genre and 'standard language'.

Specialised corpora can be used to examine or compare different language varieties, such as language from a particular area, covering a certain genre or text type, produced by particular language users, etc.

Corpora can be synchrone (covering one time) or diachrone (covering several time periods), consist of different media (written or spoken language) and be composed of different languages.

Annotated corpora have extra information added, usually linguistic information (part-of-speech, lemmata) or metadata (information about the material in the corpus, speakers/authors, situation, extra-linguistic information etc).

There are corpora that can be consulted online, via a custom-built interface, and ones that you explore with stand-alone tools that you install on your computer.

Exercise Two: British National Corpus

One of the most commonly used Corpora for this paper is the British National Corpus. This is a large corpus of British English from the Late 20th Century.

It’s really important when you use a Corpora that what you’re saying about language is relevant to the corpora you’re using. So if we were writing about how Wilfred Owen’s use of language compared to his contempories, or predecessors, then this Corpora would be no good. But if we were saying something about how his use of language is different to later generations, then this would be a useful resource. So just bare that in mind when you’re choosing a corpora to use. There are lots of them available, so just make sure it’s a relevant one to what you’re trying to say.

You can find links to more Corpora here.

We're going to show you some of the basic functions of the British National Corpus so you be guided through how it works and what it can show you. 

  1. You can find the British National Corpus by searching on SOLO or clicking the direct link. When you search on SOLO you will want to click on the entry that says  British National Corpus (alternative interfaces).

The BNC has two main interfaces that we're going to mention here. One is the BNCWeb and the other is the BNC-BYU Interface. They both use exactly the same set of data, it's just the interface that is different. We will demonstrate the BNCWeb here as it is a bit more straightforward for demonstration purposes, but there is guidance on the BYU interface below as well.

  1. First you will need to register for an account to use the BNCWeb it is free, and only takes a minute. Register here.
  2. Once you have accessed the database, type in either the word Louder or the word you found was most commonly used from the text you;'re analysing. 
  3. Press the Start Query button.
  4. Once your results have come up you can see some information at the top here about the results. It’s searched across over 4000 texts, and over 98 million words in the databases, and found over 580 instances of the word Louder.
  5. Press the button Show KWIC View. This stands for Key Word In Context and will place your word in the middle of the line, which just makes it easier to see where it is within the context of the sentence.
  6. Next we will look at what words commonly get used ​just before the word eyes.
  7. Click on the drop-down bar in the top right and select SORT and then press GO, this brings another set of search parameters underneath, and this is where you can select more options. We want to have a look at what words commonly appear just before the word Louder, so select 1 LEFT and press SUBMIT. That has now highlighted all the words one to the left.
  8. Go back to the menu at the top, and select collocations, this will show a list ordered by frequency, so you can see the most commonly used word just before the word Louder is than, closely followed by grew.

This is interesting to do with authors who use language in what you think is an innovative way, to see if this is true. 

More Corpus Information

You can also use the BNC-BYU interface, which does allow for more complex queries. To use this you must first register. Registering will give you access to 200 queries per day. You will need to use the license password when registering, which can be found on WebLearn.Before using the BYU Interface we would recommend reading through the five-minute tour.


American English

Old/Middle English

Language of the Internet

Next Steps

Now you've worked through the training session you can scroll back to the top and have a look through the different tabs, you'll find sections on recommended eBooks, eJournals, Dictionaries, Primary Texts Online, Newspapers & Ephemera, Web Resources, Text Analysis Tools, Corpora.

