Goto desktop  Move back one step  Move forward one step  Sitemap
Larger font Smaller font

  How to normalise forms in texts


What is normalisation?

Normalisation is a procedure whereby variant forms in a text or texts are replaced by a single form – the ‘normalised’ form – in order to (i) make texts more readable perhaps and (ii) to make retrieval from such texts faster and more reliable. If you decide on normalisation, then it is advisable to do so for a copy of the texts of your corpus. This way you maintain the textual integrity of the originals while using the normalised versions for retrieval tasks and perhaps for easier reading (if the originals have a lot of idiosyncratic variation as happens frequently with historical texts).

Normalisation can be carried out with the utility Corpus Presenter Text Tool. The first thing you need is a text or texts to normalise, then a list of normalisations to perform. To test this function, download the prologue to Chaucer’s Canterbury Tales and the list of normalisations (see following table) which is provided with this text.

   Download test files for normalisation (and tagging) (size: 18 KB)


Normalised form Set of variants to be normalised
WHEN whan
WHICH which; whiche
YOU you; yow
THEY they; hi
THEM them; hem
FULL fful; ful
HAD hadde
HAS hath

The screen looks as follows when you activate the normalisation function in Corpus Presenter Text Tool. Click on the button Proceed to perform this action.

When a text has been normalised, it should look something like the following. You can of course decide if you want to have normalised forms in UPPERCASE and in RED COLOURING for highlighting purposes.