What is normalisation?
Normalisation is a procedure whereby variant forms in a text or texts are replaced by a single form – the ‘normalised’ form – in order to (i) make texts more readable perhaps and (ii) to make retrieval from such texts faster and more reliable. If you decide on normalisation, then it is advisable to do so for a copy of the texts of your corpus. This way you maintain the textual integrity of the originals while using the normalised versions for retrieval tasks and perhaps for easier reading (if the originals have a lot of idiosyncratic variation as happens frequently with historical texts).
Normalisation can be carried out with the utility Corpus Presenter Text Tool. The first thing you need is a text or texts to normalise, then a list of normalisations to perform. To test this function, download the prologue to Chaucer’s Canterbury Tales and the list of normalisations (see following table) which is provided with this text.
The screen looks as follows when you activate the normalisation function in Corpus Presenter Text Tool. Click on the button Proceed to perform this action.
When a text has been normalised, it should look something like the following. You can of course decide if you want to have normalised forms in UPPERCASE and in RED COLOURING for highlighting purposes.
How to normalise forms in texts
Download test files for normalisation (and tagging) (size: 18 KB)
Normalised form
Set of variants to be normalised
WHEN
whan
WHICH
which; whiche
YOU
you; yow
THEY
they; hi
THEM
them; hem
FULL
fful; ful
HAD
hadde
HAS
hath