Corpus processing
Q: How is Corpus Presenter organised?
A: It comes as a set of programs, the main one of which is called Corpus Presenter and which gives the name to the entire suite. You install it from the Corpus Presenter website by downloading the relevant EXE file to be found by clicking on the link Download Version 2025 on the opening page of the website, then starting the EXE file (in administrator mode via the right-button of the mouse). At the end of the installation procedure you have a shortcut on the Windows desktop called Corpus Presenter. The supplied test corpus can be viewed with Corpus Presenter straight away as can A Corpus of Irish English, originally supplied on the CD-ROM accompanying the 2003 book and now available via the relevant link on the website. The current version is Corpus Presenter 2025, which is a major overhaul of the original program and which supersedes all previous updates. These are the programs of the suite:
Corpus Presenter
The main program of the current suite is called Corpus Presenter. With it you can carry out all the processing tasks you may want with any corpus. If you do not have a corpus you can still load texts directly and carry out retrieval operations. To create the file necessary to process a corpus with Corpus Presenter, use the supplied utility Corpus Presenter Make Tree or just load any files directly from disk.
Text editor
Additional textual functions are to be found in Corpus Presenter Text Tool. You can tag and/or normalise corpus texts, something which makes historical texts easier to handle during retrieval tasks. The program will also carry out lexical clustering analysis on any text files, thus revealing recurring word patterns (this is also possible in the main program). You can also collate different inputs to a single output.
Word Processor
The aim of a word processor is to allow the processing of formatted output, e.g. when preparing a text for printing, and differs from an editor. The current word processor has many formatting options concerning document appearance which go beyond those of the text editor. It can handles all common file types, including Microsoft Word files.
File Manager
A file manager is necessary for all the house-keeping tasks which has to carry out on a computer. This utility has many special features such as incremental backup which is useful when dealing with large amounts of texts as with a corpus.
Find Text
Normally when compiling a corpus one is dealing with several texts and it may often be necessary to search for strings across the entire group or even through a complete drive. The present program will perform this task for you. A range of options make it a flexible tool for text retrieval. It can be used independently of the main program and is especially useful when compiling a corpus as it permits find and replace operations.
Find Files
This utility allows you to find any file or files anywhere on your computer by just entering part of a name. A list of files is returned in a window and you can copy, move, delete, view, etc. any of the files found.
Internet File Editor
This program will enable you to edite internet files in an interactive environment. You can edit HTML files directly as normal text files, preview the output with an internet browser and edit text in a Wysiwyg environment as well. There are a range of options - including flexible text macros and a database of code - to make HTML file processing easy.
Make Tree
In order for Corpus Presenter to process any set of texts it must have access to a small file called a data set which contains a list of the files of a corpus and labels for the nodes in a tree it displays. In addition, a data set file contains information about the general appearance of the corpus on screen. To design your own data sets or edit others use this utility. The files it creates and/or processes have the extension .CPD.
Q: What is the quickest way to start?
A: The best way to start is by trying out the supplied test corpus. This contains a variety of data types: texts, databases, images and sound files. Select the dataset file TEST_CP.CPD when prompted to load a file at the beginning. If the initial help text is displayed when you first load Corpus Presenter then you can click on the button Select an existing dataset file (extension: .CPD). The dataset file TEST_CP.CPD contains references to all the supplied data files which are then displayed in the form of a structured tree. By clicking on the node of the tree you can view the file which is associated with this node.
Q: Can I use my own files directly with Corpus Presenter?
A: Yes. When the dialogue window Open a data set appears at the beginning or when you choose to work with a new corpus from within Corpus Presenter (press Ctrl-O for this) you click on the button Load text file(s) directly. Then all that is required is that you select the files you wish to use and press Shift-F12. The files are then displayed with the names of the files shown in the window on the left. Any ASCII, RTF or MS Word files can be loaded as can HTML (Internet) files.
Q: Can I make a corpus with Corpus Presenter?
A: Yes. There are basically three ways to do this: (i) generate a corpus from a selection of files on the directory listing level, (ii) make a corpus from a branch of the hard disk, again on the directory listing level (both these options are reached by clicking on the button Quick make when you have first chosen Ctrl-O Open dataset). (iii) Design a corpus using the supplied utility Corpus Presenter Make Tree. If you choose the latter method, then it is sensible to make a copy of the supplied test corpus TEST_CP.CPD and then alter this to suit your needs.
Q: Can I process files downloaded from the internet with Corpus Presenter?
A: Yes. There are two options here. The first is to load any HTML (Internet) file or files directly into Corpus Presenter and start working. However, if the texts are very large and you want to work intensively with them, it is sensible to convert them to RTF or text files first so that you can avail of the Fast text retrieval mode (select the relevant option in the Settings menu, press F3 for this). To convert files, just click on the Convert files button on the directory lister level. Files can be converted to and from the following formats: HTML, RTF, MS Word and ASCII.
Q: How can I view a corpus?
A: The default display mode uses a tree which may contain nodes on several embedded levels. Each node on the tree has a label and a file which is associated with it. Clicking on the label leads to the associated file being displayed. A second mode is also available. This is the list mode in which all files are listed in the order in which they occur in the tree. The advantage of this mode is that you can select any file by simply clicking on tick box on the left of any line. You can then demand that a retrieval operation apply to the group of checked files in the list. If you wish you can also derive a sub-corpus from the checked files. Should you keep to the tree display then retrieval can apply to all files or a corpus, just those in a branch of the tree or simply the currently selected file.
Q: Can I generate word lists with Corpus Presenter?
A: Yes. This is done by choosing the option Search, Make a Word List (shortcut: Ctrl-W) or clicking on the tool button with the small w’s. You can generate a unique word list (a list of types in a text) either for the entire text or for a selection of input forms which you supply in a word list created with a text editor such as that supplied with the Corpus Presenter suite. Word lists can be made for a single text or more than one by specifying the range of texts to be examined for list generation. Once generated, a word list can be stored to disk or copied to the Windows clipboard.
Q: How do I search for strings with Corpus Presenter?
A: The first thing is to load a corpus, say the test corpus referred to above. Then you can either choose the option Search, Blitz search through files (shortcut: Ctrl-B), Quick search (shortcut: Ctrl-B) or the option Advanced search (shortcut: Ctrl-A) or click on the tool button with either the simple magnifying glass, the magnifying glass over a sheet of paper or the binoculars. With the first option a small window opens and you can type anything (string, word or phrase) which you wish to look for. The returns can be stored in the Windows clipboard and retrieved via the Paste option in any text editing software, including the internal editor or the storage area. The second option is similar but more flexible, e.g. you can use wildcards and the returns can be stored in a list which you can use to jump to the text position where the finds were made. In the third case you shift to the retrieval level. You see the current text and on the top of the screen with various options pertaining to retrieval operations are available. The most important one to begin with is that labelled Parameters which opens a window in which you can specify the various parameters for a search such as the strings to be located, the range of texts, the nature and expected distribution of strings in a text, etc. There are a large number of options available on the advanced search level, e.g. you can specify exactly how the retrieval information is arranged which is returned by Corpus Presenter. It is important to take time to explore the options put at your disposal here in order to grasp the real potential of the program. You can also search through databases, assuming that there is at least one in the corpus you are currently processing.
Q: Can I use wild cards during searches?
A: Yes. When generating a word list or when locating strings, the wildcards ? (question mark, stands for one character) and * (asterisk, stands for more than one character) are legal. For instance, you could search for do* which would return do, does, don’t, doing, done in a typical modern English text. A entry like he?d would probably return head and heed, again in a modern English text, whereas he*d could return heaved, heard as well because the asterisk can stand for more than one character.
Q: Can I search for collocations in a corpus?
A: Yes. On either the Quick search or the Advanced search level you can choose to rearrange returns from a search in such a way that up to eight words before and after the search string are arranged in a grid which can be sorted on any field by just clicking on its column. In addition the number of times a certain word occurs before or after a search string is shown and percentages are given. In both search modules there is a command Determine collocations which will initiate this process.
Q: Can I do complex searches with more than one string?
A: Yes. The Advanced search level provides the most sophisticated options in this respect. It allows you to search for syntactic frames, i.e. String1 following by String2 with a specifiable amount of material in between. Furthermore, you can say whether String1 or String2 are entire words, the beginning or end of a word or contained anywhere in a word.
Q: Can I search through only part of a corpus?
A: Yes. All retrieval functions allow you to specify whether the search is to apply to 1) all files in the corpus, 2) only the current branch, 3) from the current file to the end of the corpus, 4) only the current file and 5) checked files. The last option is the most flexible as it allows you to mark files in a corpus (irrespective of their position in the tree) and only have the search carried out on these checked files.
Q: Can I access Cocoa header parameters when searching through a text?
A: Yes. If you are using, say, the Helsinki corpus then on the text retrieval level you can specify certain values for certain Cocoa header parameters which must apply for texts to be included in a search. For instance you might wish to search through only those texts which are prose translations or verse by female writers. In such cases you would demand that the appropriate values for the relevant Cocoa parameters be found in a text before it is searched through.
Q: How do I deal with spelling variants in a corpus?
A: When on the Quick search or Advanced search level you specify that the search is to use an input list and not a single string. An input list can consist of any strings or words on the lines of a text. The search is carried by examining the text for each of the items in the input file. This file need not contain just spelling variants, it can be used for any number of items which you wish to treat as a group, say a set of pronouns which you are interested in.
Q: Can I view texts with retrieval returns?
A: Yes. All retrieval functions have an option Goto text which will cause Corpus Presenter to jump to the position in the text where the current retrieval return was found. By these means you can check up on the context from which a return is derived.
Q: Can I have percentages for returns in texts?
A: Yes. If you select the option Only count finds on the Quick search or the Advanced search level then the program will count the returns and work out the percentage of all words in a text they represent. This might well be useful if you wish to know how the distribution of a certain form or structure varies across different texts, be they different in type or from different periods.
Q: Can I generate a reverse dictionary from input texts?
A: Yes. The text statistics window (accessible via Statistics for texts in the Search menu) includes a list in which the unique words of a text can be deposited in reverse order. You can decide determine how this list (as a whole or in part) is to be stored, to disk or the Windows clipboard.
Q: What can I do with output lists from search tasks?
A: Frequently the output from a particular task within Corpus Presenter is a list which can be copied to the Windows clipboard or saved directly to disk. Such a list can further be processed with the program List Processor. This will allow you to sort lists, create a unique list (i.e. a list of types from a list of tokens), combine two input lists to a single output one, etc. An output list, either directly from Corpus Presenter or filtered through List Processor, can be imported into a database. This would be useful when doing lexical work with corpus files as a database is a kind of dictionary.
Returns can be stored to disk for re-loading at some later point. This option applies on both the Quick search and the Advanced search levels. The advantage here is that to examine returns you do not have to carry out a search each time. Just re-load returns from a previous work session and you can continue where you left off.
Q: Is lexical cluster analysis possible with the Corpus Presenter suite?
A: Yes. The program Corpus Presenter Text Tool will allow you to do this. The principle is quite simple. You load a text or texts and then specify the number of words per cluster (from 1 to 8). The program then combs through the texts and gathers every sequence of clusters and orders them alphabetically or by frequency. This procedure can be useful when trying to determine a writer’s style as typical combinations of word become obvious in the analysis. There is also a lexical cluster option in the main program Corpus Presenter (in the Search menu; shortcut: Ctrl-U). It functions in a similar manner, but in this case you can specify any number of files to be used for cluster generation.
Q: Can I generate a concordance with the Corpus Presenter suite?
A: Yes. Corpus Presenter offers many options to make concordances by searching for words and having them arranged and highlighted in their contexts (in the Search menu; shortcut: Ctrl-F3). You can save the results of such actions to disk and, for instance, edit them later with the text editor in the present suite.
Q: How do I collect and prepare texts?
A: The best way is to use the supplied program, Corpus Presenter Text Tool, a powerful editor with a whole range of useful functions including many shortcuts which save you from entering repetitive text. The program can handle plain ASCII texts, i.e. those without formatting, and Rich Text Format files in which attributes like bold or italic are retained. Many additional functions are also available, some of which are useful when preparing corpus texts. Corpus Presenter Text Tool allows you to tag texts, manually or automatically, and provides many analytical tools for extracting information from texts. The program can handle large files easily and so are useful when compiling comprehensive text corpora.
Q: Can I convert files from one format to another?
A: Yes. Files can also be converted on the fly on the directory lister level of the main program Corpus Presenter. Just click on the button Convert files and specify the type of conversion you require. Any files elegible for the conversion, i.e. which match the input type, will be listed and you can carry out the operation.
Q: Can I globally change the attributes for files?
A: Yes. This can be done with the Corpus Presenter File Manager. There are many situations in which this might be necessary, one would be where you copy files from a CD-ROM onto hard disk. The copies may well still be read-only and this attribute needs to be removed before you can alter the files in question. Select files on the right in the directory listing and then choose the option File attributes for disk / display in the File menu.
Q: Does Corpus Presenter provide macro functions to cut down on repetitive tasks?
A: Yes. The Corpus Presenter Text Tool has an option group Macro in the menu system which offers you a number of functions which will help you avoid unnecessary typing of text which is required repeatedly. There is a text macro function which allows you to have up to 256 pre-defined strings at your disposal. Then there is the Alt-macro list which will associate user-specified strings to the key combinations Alt-0 through Alt-9. There is a further text array option and a small strings function along with an array of 4 text buffers at your disposal. All these options can be exploited gainfully to cut down on the typing of text. Try them out and see what they do.
Q: Can I tag texts with Corpus Presenter?
A: Yes. There is a special function for this in the Corpus Presenter Text Tool. Choose the option Tools, then either Interactive Tagging or Automatic Tagging. In either case a window appears and you enter the information necessary for tagging. This can be done automatically or manually, can involve words or strings, be case-sensitive and avail of any user-specified list of tags and input forms to be tagged. If you choose to tag a text manually, then you may also edit the context of a tag interactively and store it back to its original position.
Q: If I have several texts, can I link them into a single one?
A: Yes. The easiest way to do this is to load the files, one after the other in the order in which you want them with the Corpus Presenter Text Tool (Key: Alt-Z or Shift-F5 for Tools, Insert text). You then save the new composite file to disk - under a different name from any of the individual files - and use this with Corpus Presenter, for instance by loading the file directly (Key: Ctrl-L).
The supplied Corpus Presenter File Manager provides a function for building a composite file from several input files (any selected in a directory listing). There is also a mirror function with which you can extract the component files of a composite file if you wish to do this at some later point.
Q: Can I make my own corpus set with Corpus Presenter?
A: Yes. There is a supplied program Corpus Presenter Make Tree with which you can either create a new corpus control file or edit an existing one. A control file contains the information necessary to display the contents of a corpus in tree form within Corpus Presenter. Try altering the supplied file TEST_CP.CPD which controls all the supplied data files packaged with Corpus Presenter.
Q: Can I normalise texts with Corpus Presenter?
A: Yes. The program Corpus Presenter Text Tool allows you to normalise any set of texts quickly and easily. You specify the set of variants which are to be replaced by a single form and repeat the process for as many replacements as you require. This information is stored on disk and can be retrieved later. Two texts, say an original and a normalised one, can be collated to a single text if you wish. You can also carry out lexical clustering analysis with Corpus Presenter Text Tool, something which you might want to experiment with to see what recurrent word patterns are to be found in a set of texts, for instance when studying the style of an author. Lexical clustering is also possible with the main program (Key: Ctrl-U).
Q: Can I collate texts with Corpus Presenter?
A: Yes. Again Corpus Presenter Text Tool provides a collation function with which you can combine two texts on a line by line basis, thus checking on differences between two versions of an original, for example. Collation can be useful when combining a normalised version with an unaltered version, for instance when processing historical texts with much spelling variation.
Q: Can I carry out global Find and Replace operations?
A: Yes. The program Corpus Presenter Find Text will allow you to do this. From the main level of this program you can specify a string to find and one to replace, using various parameters to ensure correct replacements.
Q: Can I use keywords in a corpus and then collect them?
A: Yes. Corpus Presenter Text Tool includes a function which will collect any strings which are delimited by a specifiable character. Say you insert keywords (or comments or text markers of any kind), delimited by < and >, then the program can collect all these and deposit them in a list from which you can copy then to the Windows clipboard or store them directly to a disk.
Q: Can I check corpus texts for the integrity of their coding?
A: Yes. Corpus Presenter Text Tool again has a function which will examine any text and check whether embedded codes, such as comment markers, are opened and closed correctly. It will also check on whether a text contains only a user-specifiable set of legal characters, e.g. the lower ASCII area and a set of special characters for Old and Middle English, for example.
Q: How can I keep track of alterations in corpus texts?
A. Yes. There are three basic ways of doing this. Either you mark corrections / additions using the Red marking function in the Format option group of Corpus Presenter Text Tool or you mark stretches of text as Protected which means that they cannot be changed until released again. The latter function can be useful when excluding certain parts of a text not only from alteration by a later user but also from certainly functions like find and replace. The third method is to activate the Track changes in text (press Ctrl-Shift-C) function of the text editor in question. Then deleted text is represented as red and strikethrough and newly entered text is shown as blue and underlined. Note that to avail of these options, texts must be encoded in Rich Text Format and be processed by either Corpus Presenter Text Tool or Corpus Presenter Word Processor (the latter also allows the processing of MS Word files with the extension .DOC(X).
Q: How can I view the structure of corpus texts?
A: When preparing texts with Corpus Presenter Text Tool or Corpus Presenter Word Processor you can enter any symbol you like which is to serve as a text marker followed by a number which represents the level in a tree hierarchy which this marker is to have (from 1 to 6). The Outline and the Table of Contents functions will collect these markers and display them in an Explorer-type tree. You can click on a tree node to jump to the text marker in question. You can also save the tree to disk.
Q: Can I arrange a corpus for the internet?
A: Yes. There is a special internet file editor, Corpus Presenter Net Editor, which contains a whole range of powerful options for quick and easy design of webpages. There are many ready-to-use functions (in HTML and Java) available in a code database which you can access from within the program and insert into your web page. With Corpus Presenter Net Editor webpages can be tested without uploading them to a server; this is done as a final step before disseminating information via the internet.
Q: How can I interface with Windows from within the Corpus Presenter suite?
A: In all programs you can open a Windows Explorer window by choosing the relevant option in the Miscellaneous menu or just pressing Shift-Ctrl-F9 or Alt-X. When you exit the explorer window you are automatically returned to the program from which you started.
Q: How can I interface with the internet from within the Corpus Presenter suite?
A: All the major programs have an option to load your own browser (Key: Alt-B). In addition there is a list of websites with which you can maintain the addresses of the sites you visit most commonly.
Q: What happens if I have lost a file on my hard disk and don’t remember the name?
A: You simply load the program Corpus Presenter Find Files and specify some piece of text, a string or word, which you know occurs in the text and let the program do the searching for you. It can operate in different ways, scanning entire drives, only a certain branch, use exclusion lists to make sure that it does not examine certain file types such as programs, etc. The moment it finds the string your entered, the file containing it is loaded with an internal viewer and you can decide if this is what you are looking for; you continue until the right find is made.
Text replacements can also be made with Corpus Presenter Find Text across several texts. This can be very useful if you decide to change some item of text which is scattered across a group of texts and you do not know exactly where. Just let the current program find it for you and you can decide if you want to change it.
Q: Can I use the Corpus Presenter suite to make backups of my files?
A: Yes. There are two ways of doing this. Load the program Corpus Presenter File Manager and then choose an option from the Backup menu. This will initiate a dialogue in which you specify a date filter (if required), the drive and directory from which the backup is to start and enter a possible exclusion list, if required. This option is especially useful when making backups of corpus texts to another hard disk or to a USB drive. The file manager has many options which are useful in the area of file management and security so you are strongly advised to try it out and see what it can do.
And don’t forget...
There is a 292-page book going with Corpus Presenter (published by John Benjamins, Amsterdam, 2003) which contains precise descriptions of all commands in all programs and much additional information which is relevant to the tasks users will wish to perform with the current program suite.
Update information on Corpus Presenter can be found in the internet at the following addresses: http://www.uni-due.de/CP (the dedicated website for Corpus Presenter).
Raymond Hickey
April 2025