Goto desktop  Move back one step  Move forward one step  Sitemap
Larger font Smaller font

  Structure of a data set file


A data set file is a small text file which contains all the information needed for displaying the files of a corpus correctly in tree form. For each node of a tree three pieces of information are specified. In addition there are 11 parameters which are set at the beginning of the file and which determine the location of the corpus files and the manner in which they are displayed.
      A control file for the Corpus Presenter is a plain ASCII file and can be edited with the Corpus Presenter Text Editor. Be careful not to save this in RTF or HTM format as it would no longer function properly as a control file. Note that any line in a control file which begins with a semicolon is regarded as a comment and ignored.
       It is not advisable to alter the text file references unless you know exactly what you are doing. Always save the current control file under a new, temporary name, if you alter anything. Then check that it functions properly before deciding to keep it as an original version. There follows a brief description of the 11 parameters at the beginning of a control file.

Remember:
The easiest way to design/edit a tree is to take one of the supplied data set files (either from the test corpus or from A Corpus of Irish English) and edit this with Corpus Presenter Make Tree. What follows here is technical information for those who want to know about the internal structure of data set files.

1)       Directory where all and only the data files are located. This very is important for a corpus like the Helsinki Corpus where the text files have no extension and where there is no formal means of identifying which files in a given directory belong to a corpus and which do not. Where no extension is used, the default assumption is that all files in a specified data directory are corpus files. If there is no path before a file name then the data directory is used; a path for a file will override this directory. A section of a path can be used. In this case the section is assumed to refer to subdirectories below the primary corpus data directory. If you wish to use the current directory as the data directory, i.e. the directory in which the present file is located, then just enter a single dollar sign: $

2)      Wallpaper file (full path necessary)

3)      Name of manual file (full path necessary)

4)      'Frequently Asked Questions', FAQ file (full path necessary)

5)      'Fact Sheet', FACT file (full path necessary)

6)      Font for text display (legal names: Arial, Courier, Courier New, Garamond, Letter Gothic, Line Printer, Modern Tahoma, Terminal, Times New Roman).

7)      Size of font (legal values 6, 7, 8, 9, 10, 11, 12, 14, 18, 20, 24). The font name and size given here only apply when the text files are standard ASCII files. If they are in either RTF (Rich Text Format) or HTM/HTML (Hypertext Markup Language) format then font information is contained in each file header and this takes precedence over any specification here.

8)      Zoom factor (legal values 50, 75, 90, 95, 100, 105, 110, 115, 125, 150, 200). This only applies if quick text display is not selected.

9)      Standard extension of files in this corpus. A corpus may use a typical extension for its files, as with A Corpus of Irish English - by the present author - the (text) files of which all end in .CIE. Be sure to enter the dot before the extension. Three asterisks indicate that the files have no extension (as with the Helsinki Corpus).

10)      Levels of tree visible at startup.

11)      What icons for nodes? Book = 0, Folder = 1

First 11 parameters of control file for the Helsinki Corpus

G:\HELSINKI\TEXTS
G:\HELSINKI\MANUAL\HELSINKI.JPG
G:\HELSINKI\MANUAL\HEL_CORP.RTF
G:\HELSINKI\MANUAL\FAQS.RTF
G:\HELSINKI\MANUAL\FACT.RTF
Courier New
10
90
***
2
1

Sample section of control file for the Helsinki Corpus (beginning, early Old English)



There are three items of information for each node of a tree. The first is the description to be used as a label for a node (plain text). The second is the file associated with this node. If you enter
DUMMY.RTF here then no file is displayed. This is necessary because there will be nodes in a tree which are empty, i.e. there are just links to other nodes further down the tree. Indeed it is normal, though not essential, that only the terminal nodes of a tree contain actual file references. The third item of information usually consists of three asterisks. The reason it is there at all is that with audio files you may wish to display an image file in the background. If you now specify a WAV file as item no. 2 and an image file as item no. 3 then the latter will be displayed while the former is played. By these means you could for example display a map of a region and play an audio file with the speech of that area at the same time.
      You will notice that the description of many nodes is indented. This is deliberate and represents the means by which you specify what level in a tree the node is to be displayed on. The principle is as follow: every 4 spaces at the beginning of a node label represent an indent of one level below the first, i.e. no spaces indicate a node on the top-most level (level 1), 4 spaces indicate that the node is on level 2, 8 spaces on level 3, 12 on level 4, 16 on level 5 and 20 on level 6. A maximum of 6 levels is permissible.

Old English
DUMMY.RTF
***
I ( - 850)
DUMMY.RTF
***
Documents
DUMMY.RTF
***
Documents 1 (Harmer, Robertson, Birch)
CODOCU1
***
Undefined text type (verse)
DUMMY.RTF
***
Caedmon's Hymn; Bede's Death Song; The Ruthwell Cross; The Leiden Riddle
CONORTHU
***
II (850-950)
DUMMY.RTF
***
Law
DUMMY.RTF
***
Alfred's Introduction to Laws, Laws (Alfred), Laws (Ine)
COLAW2
***
Documents
DUMMY.RTF
***
Documents 2 (Harmer, Robertson, Sweet-Whitelock)
CODOCU2
***
Handbooks, medicine
DUMMY.RTF
***
Laeceboc
COLAECE
***
Philosophy
DUMMY.RTF
***
Alfred's Boethius
COBOETH
***
Religious treatises
DUMMY.RTF
***
Alfred's Cura Pastoralis
COCURA
***
Prefaces
DUMMY.RTF
***
Alfred's Preface to Cura Pastoralis
COPREFCP
***
History
DUMMY.RTF
***
Chronicle MS A Early
COCHROA2
***
Bede's Ecclesiastical History
COBEDE
***
Ohthere and Wulfstan (MS L)
COOHTWU2
***
Alfred's Orosius
COOROSIU
***