The Quest for Re-Integration : Creating New Documents across Traditional Borders

Øyvind Eide, University of Oslo

6th BOBCATSSS symposium : Shaping the Knowledge Society, January 26–28, 1998.

About the project

At the four Universities in Norway there are many museums and collections that contain information on a wide range of aspects of language, culture and history. The Documentation Project is a nation-wide, multidisciplinary project which was assigned the task of transferring several of these collections from paper-based formats to computer readable formats and of publishing the resulting computer files in order to make them more accessible for scholars, students, and the public. The project started in 1991 and the main conversion period was completed by 31 December 1997. My experience with the project has been gained through my work at the subproject at the University of Tromsø.

This university is the youngest in Norway, founded in 1968. It is located in northern Norway in a traditionally multi-ethnic and multi-lingual area. The Documentation Project's subprojects in Tromsø converted several documents and archives regarding this part of the country:

Archives of photographs from the Departments of Regional Ethnology and Sámi Ethnography. The photographs were scanned and linked to a database with meta-information.
Archives of place-names. The index cards were typed into a database, and facsimiles were scanned.
Archaeological acquisition records and topographical archives. This represented the Tromsø part of a nationwide archive system.
Printed books containing historical documentation. These were scanned and SGML tagged.
Important hand-written margin notes in one of these books. These were transcribed and integrated into the printed book with a special set of tags. The notes may easily be removed in order to allow the use of the printed version only when appropriate.
Private letters. These were transcribed and SGML tagged, and facsimiles of each letter page were scanned.
Three series of maps were scanned in high quality.
A collection of bibliographies covering northern Norwegian local history were typed into a database.

In some of the collections converted by subprojects at other universities, there are also elements of importance to northern Norway:

A printed farm registry covering all Norwegian counties. This was scanned and SGML tagged.
Transcripts and facsimiles of medieval diplomas, some of them regarding northern Norway. The text of a transcript published in print was scanned and SGML tagged.

The process of conversion has now come to an end, and the various documents and archives are being published as electronic books and databases. But, as these documents and archives each give their specific views on the history and culture of the northern Norwegian area, they are all parts of the same picture. The user will always have the possibility of reading and searching the electronic books and databases as single-standing editions. But the original editions also included cross-references and bibliographical notes. We find it important to make new versions of such tools in the computer age following three lines:

Convert the existing cross-references and notes in the original printed documents to hypertext links.
Use the idea of cross-reference to make hypertext links that have no explicit model in the paper-based documents.
Link similar concepts in different documents, either by hypertext links or as a combined search system.

Each of these approaches opens for several principal and practical problems. I will discuss some of them later, but first I will give some examples of how these categories are being used in the design of the information system.

Example I: Place-name index

SGML tagging is used to mark out various parts of the text and to enter meta-information. The text excerpt below is taken from one of the indexes of Dokumenter angaaende flytlapperne (Ed.: J. Qvigstad and K.B. Wiklund. Kristiania, 1909). This is a printed collection of historical documents. In this index the editor gives an analysis of the place-names mentioned in the text. In the book the text looks something like this:

A.

Aakie-vare I 354, 382, 387, 396 ved rigsrøs, 312 mellem Finland og Koutokæino.

Aasfield I 315 ovenfor Vaagsnæs V for Ulfs- fjordens munding i V paa AkNO.

Figure 1a: Tagging of index - original, untagged text

The tags we use in this example can be divided into two categories. First, the tags inserting meta-data describing the hierarchy and lay-out of the original documents, e.g. <ukap> (chapter), <avsn> (paragraph), <kur> (italic), etc.:

<UKAP><UKAPTIT>A.</UKAPTIT>

<AVSN><KUR>Aakie-vare</KUR> I 354, 382, 387, 396 ved rigsrøs, 312 mellem Finland og Koutokæino.</AVSN>

<AVSN><KUR>Aasfield</KUR> I 315 ovenfor Vaagsnæs V for Ulfs- fjordens munding i V paa AkNO.</AVSN>

Figure 1b: Tagging of index - tagged hierarchy and layout

The other category of tags is of special interest to us, i.e. the tags marking out specific types of contents. The ones used in this example are:

snavn: place-names
henv: cross-reference to a page in this document
kilde: reference to another document

The full tagging of these lines will then be:

<UKAP><UKAPTIT>A.</UKAPTIT>

<AVSN><KUR><SNAVN>Aakie-vare</SNAVN> ;</KUR> <HENV>I 354, 382, 387, 396</HENV> ved <SNAVN>rigsrøs, 312</SNAVN> mellem <SNAVN>Finland</SNAVN> og <SNAVN>Koutokæino</SNAVN>.</AVSN>

<AVSN><KUR><SNAVN>Aasfield</SNAVN>& lt;/KUR> <HENV>I 315</HENV> ovenfor <SNAVN>Vaagsnæs</SNAVN> V for <SNAVN>Ulfs- fjordens</SNAVN> munding i V paa <KILDE LOPENR="5800002">AkNO</KILDE>.</AVSN>

Figure 1c: Tagging of index - tagged hierarchy, layout and contents

As mentioned, the text in this example is taken from a place-name index. The roman numbers "I"and "II" followed by one or more digits is the cross-reference to the page in volume 1 and 2 of the edition where the place-names are mentioned. As the page number is marked out with the <henv>-tag, it is easy to make hypertext links to each page in the document.

So, the reader goes on to one of the pages in the list. It is also possible to make a "special edition" for the user made of the pages referred to - in this example of Aakie-vare, this user's "book" will contain pages 354, 382, 387 and 396 of volume I.

But this will leave the user with some work - he will have to browse the text of several pages to find the relevant information. A simple analysis of the construction of the index paragraph shows that the page number refers to a specific place-name. It would be nice to make it easier for the reader to find this in the text. This can be done by using a larger font for the place-name. We can also use the "special edition" approach, but include only the paragraphs in which the place-name is mentioned.

In the last paragraph, there is a short title "AkNO". This refers to a specific page of a map of northern Norway. If we had a digital version of this map series, we could make a note with a reference to a facsimile of the map itself. It would also be possible to indicate the part of the map page where the place-name is supposed to be found as the "V" before the reference itself points to the western part of the map page.

But, as we do not have this digital map, how do we help the users in finding the relevant information? We make a footnote with the full title of the map, and a hypertext link directly to the record in Bibsys, a shared catalogue for Norwegian university libraries. If it is not possible to make a link to a document on the WWW, we can at least try to help them to find a paper-based document:

Figure 2: Bibliographical linking

As seen in Figure 2, the "lopenr" attribute contains an arbitrary number that is equal to the number in the <lopenr> element of the file of literary references. In the same record in the literary reference file, there is a tag <objektid> in which the value is equal to the "ojekttid" value of the record in Bibsys. Also, note that the last line in the Bibsys record indicates a library holding of the document. "VSB" is the code for a department at the University Library of Oslo.

Example II: Archaeology, farm-names, diplomas

(This example is described more thorough in Christian-Emil Ore: "Making multidisciplinary resources", Digital Resources for the Humanities, Oxford, 1997)

All the archaeological museums in Norway publish annual reports of their new acquisitions. These reports have been converted by the Documentation Project. A record from one of these reports looks like the top box of Figure 3. The name and number marked out is the name and the land registry number of the farm where the artifact was found.

In the late 19th and early 20th centuries, philologist and archaeologist Oluf Rygh started publishing a complete catalogue of the names of Norwegian farms called Norwegian Farm Names (Norske gaardnavne. Kristiania, 1897-1936). This work was completed after his death, resulting in a work of 18 published volumes. This catalogue is being converted by the Documentation Project. The farms in this catalogue are sorted by the registry numbers. This opens for a hypertext link from the acquisition catalogue.

Figure 3: Linking to other digital documents (ill. by Christian-Emil Ore)

In Norwegian Farm Names, there are numerous literary references, some of which refers to a collection of transcripts of medieval diplomas, Diplomatarium Norvegicum (Christiania, 1847-). This work is still in progress with volume 23. The Documentation Project has converted these diplomas, opening for yet another link, as seen on Figure 3.

With reference to the three lines mentioned earlier, note that the last of these links is a conversion of a cross-reference existing in the paper-based document, while the first is created for this electronic edition and represents information added to it.

Example III: Letters

In the project, some 260 letters written by Sámi linguist and Headmaster of Tromsø College of Education Just Qvigstad is converted. In these letters there are numerous literary references. As these letters were written with only one reader in mind, the titles are often rudimentary. To try to help the users in locating material mentioned in the letters, we use the <kilde>-tag. In this Figure 4, the text of the letter excerpt is translated into English. Only <kilde>-tags are shown:

Figure 4: Literary reference in a letter

This linking is supposed to be used in two directions: Letter text-->"footnote"-->Bibsys (as in Figure 4) will help the reader to identify the books mentioned and will also give information about the location of the physical documents, as Bibsys shows the collections of Norwegian university libraries. Another use of the links is to make a bibliography for the collection of letters:

...
Buch, L. von: Reise durch Norwegen und Lappland. 1810
...
Pira: Svensk-danska förhandlingar 1593-1600. 1895
...

Each of the records in this bibliography will be a hypertext link to the exact locations in the letters where the document is mentioned.

The observant reader will have noticed that there are two more documents mentioned in the excerpt from the letters: "the Swedish bailiff's reports to Karl IX about the Lappish situation on the Norwegian side of the border" and "Karl IX's instruction for the lapp bailiff Lars Larsson (of 1610)". The <kilde>-tagging of these letters could be expanded to include this type of documents. But, as it will be a time-consuming piece of work to find the location of every archival document mentioned, we have not yet done it.

Example IV: Place-names

Several bibliographies covering topographic and local history areas in Norway have been published, the first one in 1907. We have converted the parts of these bibliographies covering northern Norway. In the original printed bibliographies the records were grouped together under geographical headers. These geographical headers were converted into geographical key-words in the database. There are some problems with this because the administrative units have changed during the last 90 years - municipalities were divided (this happened mainly in the period until 1965) and united (mainly from 1965 onwards). This problem is solved by indexing the records in a system with a time dimension. We have used Klassifikasjonsnøkkel til norsk topografi by Vegard Elvestrand (Trondheim, 1976; rev. 198-?) In this system every municipality that has ever existed is given a code; Flakstad in Figure 5 has the code "Wmf".

Parmentier, Georges En Norwège. : Les îles Lofoten et les pêcheries : À travers le maëlström / Par Georges Parmentier. - Lille, 1913. - 27 s. ; 8. ELVESTRAND CODE: Wmf GEOGRAPHICAL HEADER: Flakstad

Figure 5: Place-name indexing in a bibliographical record.

What this shows is that we have the necessary equipment to find practical solutions to the problems connected to this kind of place-name indexing. This also makes it possible to link this bibliographical database to the photograph database.

The University Museum of Tromsø houses several collections of photographs from the last 130 years. A selection of approximately 20.000 of these have been scanned and linked to a database. Like the bibliographical database, every record in this database has geographical information based on the same administrative units.

With the use of the Elvestrand standard, we can quite easily set up a table linking each place-name to the administrative region covering the same or a bigger area in 1997. Then we can make a common environment where users can browse a list of municipalities as they exist today, and get the records from both of these databases when a selection is made.

I will now remind you of Figure 1a. The descriptions following each place-name is indeed well written and was quite adequate for use at the time of original publication, but it is primarily designed to find places on maps and the descriptions are relationally oriented ("above", "west of", etc.). What we need is the same kind of model as in the bibliography example, i.e. a table that links each place-name to a modern map.

This is a complex work, because a much higher number of names has to be analyzed than in the former example, and because the descriptions and map-references is incompatible to administrative units such as municipalities. And it is impossible to do this completely. As the documents in the collection date back to the sixteenth century, it is impossible to locate every place-name mentioned - in the index, some of the descriptions is just "= ?". So, every solution will be only partial, in contrast to the total model in the previous example.

A possible solution is that a research project is initiated with the aim of creating a model for linking place-names which is as good as possible. The result of this project can be integrated into the published electronic version of the book, opening for a closer integration with other resources on the WWW. But to do this, we need to change the traditional concept of publication because this project will be finished a long time after the electronic publishing of the documents.

If it is possible, do it!

"Why did you climb Mt. Everest?"
"Because it's there!"

This might be a good approach to mountain-climbing. It is not so wise when it comes to knowledge management. The total time used in the Documentation Project's subprojects in Tromsø is about 60 man-years. Of this, 30-40% was applied for proof-reading. Some of this time could have been used to make more links, more references to other systems, to library collections, to paper archives, and so on. Maybe this would have resulted in a database system more exiting to work with, but because the number of errors in the texts would be higher, it would also have been of far less value to scholars. Using new technology must not imply leaving the traditional standards of quality.

Ideally, we should only put possibilities into the system that will be used in productive research and study. The problem is that most historians do not know what possibilities exist. So, what I have described above is not what a finished system should contain. Rather, it should be implemented as a suggestion to the users. What do you want from this? Should we do more of this, maybe cut out this because it is useless or it gives a misleading impression of the material, or...

But this is not possible if one follows the traditional publication system where a book is published and then criticized by the research community. Therefore I suggest a model like this:

Figure 6: Model of publishing 1

Note that the first part of the process is quite similar to traditional publishing, while the latter part is more like the implementation phase of a system development project. A computer edition like this will be a database in the respect that we can put new elements and new understanding into it. We can also correct errors:

Figure 7: Model of publishing 2. To the left, the traditional model. To the right, our approach

Different user groups often have different needs. Some of the material we have converted will be read for fun by some users. This is done in a mainly sequential style, as a popular book edition of letter collections would be read. The study of a text, on the other hand, is done in a more paradigmatic fashion. In this kind of use, the linking with other material is most important.

It is also a question of how often the system is used. An occasional user has to be kept in full orientation about what document he is reading. On the other hand, a master student writing her thesis making intense use of this system would be quite tired of always having to press "go on to Bibsys" instead of following a literature link directly. Maybe one should use a system of "user profile" where regular users can log in and have a set of standard parameters, e.g. "override footnote box when following link from book to Bibsys".

Users with special needs will also have the possibility to download documents and do the analysis on their own software.

Where did that "re" in re-intergration go?

In this paper, there has been much discussion on the intergration of documents and archives. But why did I use the word "re-intergration" and not just "intergration" in the title?

In traditional archives, the users have to be present at the institution to use the material. This is, of course, impractical and expensive, and it is much better for the users to get access to material through computer networks. Still, to visit the institution where the original documents are located also includes visiting the people working there and getting easy access to other collections at the institution. As indicated in Figure 9, not everything can be converted into computer readable formats:

CONVERTED	NOT CONVERTED
	Could have been converted	Could not have been converted
Books	More books, letters, etc. Also: sound recordings (music, tales, etc.), film (even though this is practically difficult)	Archaeological objects such as stone axes
Letters
Archaeological archives
Maps
Pictures, both meta-information and the pictures themselves
Pictures of clothes		Clothes (the garments themselves)
Knowledge in the minds of people

Figure 8: What can be converted?

In some respects we are breaking down traditional borders, e.g. the borders between printed books and archives. In this situation there is a risk that pieces of information lose their interconnection and become meaningless. The publication of separate document collections and archives on the WWW may result in a kind of fragmentation because the user loses the environment that used to be a around the documents and archives.

By pointing to other sources of information, both electronicly available and paper-based, we try to rebuild a context. This context will be new in its appearance, but we hope it will carry with it the best traditions from the old institutions. This task is not possible for information specialists alone to solve; we will need close cooperation with scholars with experience in using different kinds of material in their research.

Back to publication.