The electronic encyclopaedia: current and proposed methodologies for information retrieval
Introduction

Vannevar Bush, in 1945 his essay "How We May Think", described an "enlarged supplement to [man´s] memory":"Consider a future device for individual use, which is a sort of mechanised private file and library [..] in which an individual stores all his books, records, and communications, and which is mechanised so that it may be consulted with exceeding speed and flexibility"(Bush, 1945:14). Such a device bears an uncanny resemblance to what is today known as the World Wide Web; and this is why the internet is seen by many as the embodiment of Bush´s vision. This is true insofar as it provides instant physical access to information, which would not have been possible in Bush´s time; however, whilst the internet is growing at an extraordinary rate, so that the possibility of storing such a vast amount of knowledge as to be comprehensive becomes conceivable, it still differs significantly from Bush´s vision. Bush´s (1945:3) key concern was with the difficulty of finding the "momentarily important item", the information one is looking for at a given time; yet although this information may be made available by the internet, the growing amount of data makes relevant information increasingly difficult to find. Thus, Bush´s concerns are still echoed by World Wide Web inventor Tim Berners Lee (1998:8), who recently has been calling for the birth of a "Semantic Web": "While search engines which index HTML pages find many answers to searches and cover a huge part of the Web, they return many inappropriate answers. There is no notion of "correctness" to such searches". Thus, whilst electronic media makes the information available, that information now needs to be made accessible. In this essay, we shall attempt to examine both the causes and possible solutions to inefficient information retrieval. Rather than trying to solve this problem for the whole of the World Wide Web, we will concern ourselves with the particular field encyclopaedic data, or "knowledge", the type of information Bush was concerned with. Because of the limited space available, we will focus on the theoretical aspect of retrieving this information, rather than its technical applications; and furthermore we will attempt to outline the issues surrounding encyclopaedic information retrieval, and some suggestions as to possible solutions, as opposed to suggesting a fully implemented strategy for solving this issue. We will begin by examining current search and retrieval methods, and comment on their validity applied to encyclopaedic data; we shall then examine alternative methods of document mark-up and retrieval critically, before examining some of the broader technological and linguistic issues involved. To conclude, we will offer some suggestions as to how these observations can be applied practically to marking up encyclopaedic data for electronic use.

At this point, it is useful to describe what is meant by "encyclopaedic data". An encyclopaedia, according to the Macmillan Encyclopedia, is "a reference book summarising all human knowledge or comprehensively surveying a particular subject"1. This is helpful, but not sufficient; it does not explain what pieces of information can be categorised as "human knowledge", or which pieces of information have their place in an encyclopaedia. The above definition does provide an interesting dichotomy in separating encyclopaedias as either summaries of the whole of human knowledge, or comprehensive surveys of a particular subject; one of the main advantages of electronic data is that it can afford to be both, not being limited by physical space. Thus, it is possible for an electronic encyclopaedia to contain both an overview and in-depth analysis of a given subject area, while still being readable. This is the premise on which this essay is based: that all levels of information should be available to the user, and that their accessibility be determined by the degree of information the user wishes. For this reason, it may be more appropriate to use Eco´s (1999:232) description of the Encyclopaedia as the "Library of Libraries, the postulate of the globality of human knowledge which cannot be realised by a single speaker"; and for the purpose of this essay, we shall consider encyclopaedic data to be to be academic and informative in nature.

Current Search Methods
Methodology

To begin with, let us consider the current methods used for retrieving information from the internet. The most current method initially used in search engines on the internet was keyword search. This type of search is still greatly used across the web today, especially by intranet and site-specific search engines such as "ht/dig"2 and, though not anymore in such a basic form, by the major internet search engines such as Google or Altavista. The premise of keyword searching is simple: a single or set of words is entered by the user into the engine, the engine searches all of the information it has access to for occurrences of this keyword, and returns all instances of data in which the keywords have been found ordered according to a given set of criteria. Common criteria include frequency of occurrences, or where in the document the keywords occur, be it in the title, the address of the page or the text body (Bryn and Page, 1998). This type of search is quite obviously rather limitative; for instance, if a page contains a word many times, it may end up higher in the search engine results than a page of real interest. The inventors of Google, Sergei Bryn and Lawrence Page (1998: 6, 15), thus found when creating their engine that some major search engines returned results for "Bill Clinton Sucks" and "Bill Clinton Joke of the Day" as the highest ranking results for searches on "Bill Clinton", and in many cases did not return results for the White House web-site.

Google was first introduced in 1997, at a time when most search engines still relied on the most simple form of keyword-matching we outlined above. Google however introduced some concepts which drastically improved search result, and which all hinged around a basic concept, which we can term "metadata". When search engines are queried for a set of keywords, they do not physically search the entire internet for occurrences of a given keyword, which would take too much time. Rather, they index all of the web pages they know about in a database, and search this index rather than the pages themselves. Thus, keyword-matching search engines search their database for occurrences of a keyword and return those instances of pages which contain the keywords. Google´s innovations lied principally with the indexing process; rather than just store a part of or the entire contents of a given page in the database, Google also stores and retrieves information about the page, and uses this information when assessing the relevance of a page to a particular search. Thus, Google stores data about the data, or metadata, and this data is weighed in with the occurrences of a given keyword in a page when assessing the relevance of that page to the keyword.

Google´s main use of metadata is in it´s PageRank system. Google´s founders realised that no other search engines were making use of citations in their ranking system, a method which is sometimes used in academia to rate the importance of different works on a given subject. The premise is that the more often a given piece is cited in others, the more likely it is to be an relevant and important to the subject. The web variation of citations is hyperlinks, which are used to link one page to another; so, if a given page links to another with reference to a given subject, that other page is likely to to be a useful ressource on that subject. Thus, if in HTML we have:

<a href=´ http://www.server.com/mypage.html´>music</a>

where "<a>" describes a hyperlink, "href=´http://www.server.com/mypage.html´" describes the page that link is referring to, and "music" is the keyword attributed to that link, then Google will consider the page "http://www.server.com/mypage.html" to be a useful ressource to connect to the keyword "music". Consequently, that page is more likely to appear higher in Google´s ratings for the keyword "music" than other pages which also contain that keyword but which do not have as many citations pointing to them. The amount of citations to a page with reference to a given keyword is thus also weighed in with the occurrences of that keyword in the page to assess how relevant that page is to a keyword.

Google´s algorithm for PageRank is formulated as such:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where PR(A) is the PageRank of page A, d is a dampening factor, pages (T1) to (Tn) are pages pointing to page (A), and PR(T1...Tn) and C(T1...Tn) are the PageRanks of pages (T1...Tn) and the amount of citations going from pages (T1...Tn) to page (A) respectively (Bryn and Page, 1998:4). This can be translated as such: the PageRank of page (A) is a factor of the PageRank of all pages pointing to it, over the amount of times those pages point to it. The function is iterative, in that it needs to calculate the PageRank of referring pages before calculating that of the initial page, and so on. Basically, each page is ranked according to its popularity, with the popularity of other pages referring to it taken into account when calculating the initial page´s popularity. A page´s PageRank is of course not the only elements factored in when deciding a page´s relevance of a keyword, Google also considers elements such as the amount of recurrences of the keyword, the position of the keyword in the text, the weight given to the keyword according to its font size - another instance of metadata - for instance.

This method has proved extremely successful when applied to the internet, so much so that similar systems have since been adopted by other major engines3, and that Google has become the most popular search engine technology on the internet4. It is thus tempting to wonder what its shortcomings are, or indeed why it could not be applied to encyclopaedic data.

Limitations

First of all, one is brought to consider the linguistic limitations of keyword searching. Although Google has successfully introduced some methods for metadata retrieval, thus expanding the amount of information it has about a given page, that information does not concern the data itself; in other words, Google might now what to think about a given page given its standing with other pages, but it doesn´t actually know that much about the data itself other than whether it contains a keyword or not, and the emphasis placed on that keyword. Therefor, it lays itself open to the vulnerabilities intrinsic to keyword searching, which are basically ignorance of the meaning of the keywords it looks for, and of the relationships between them.

Let us make explicit the different relationships between words, their usefulness to us, and how knowledge of the meaning of a word would help us find more relevant information. Linguistically, words can be found to be linked to each other in several different ways. Lyons (1977:279-311) has found seven different types of semantic links between words: synonymy, incompatibility, class inclusion or hyponymy, antonymy, complementarity, converseness and part-whole or holonymy/meronymy. Briefly, synonymy is a bilateral relationship, where a given word is substitutable for another in a given context; incompatibility is where a word is never substitutable with another given word; class inclusion is "kind-of" relationship, whereby a word describes a kind of another, for example a rose is a kind of flower; antonymy is where a word is a gradable opposite of another, for instance "big" to "small"; complementarity is when a word is an ungradable opposite to another, such as "male" and "female"; converseness is a two place relationship in which the existence of that relationship implies the existence of an equal but opposite relationship between words, as in "doctor" and "patient"; and the part-whole relationship, or holonymy/meronymy, denotes the inalienable appartenance of a word to a construct, such as "arm" to "body" (Lyons, 1977:279-311). Note that these definitions are the subject of controversy, which it is beyond the scope of this essay to consider.

The usefulness of some of these categories when searching for information is immediately apparent; for instance, references might refer to a keyword in a page by its synonym rather than the keyword itself. For instance, a keyword-based search for "atonality" would not return any results for pages relevant to the term "pan-tonality", even though both describe the same thing. Unless relationships of this type are taken into account by the search engine, pages linked by the same theme but using a different keyword to describe it will not be found. Similarly for class inclusion and part-whole relationships; a page which describes a concept to which the given keyword has a "kind-of" or "part-of" relationship will not be retrieved by a search engine unaware of this relationship. This can even said to be the case for relationships of contradictions, such as antonymy, complementarity and converseness; for instance, when researching "atonality", one may be interested in the diatonic system, which was the approach to musical composition atonality was placing itself in opposition to.

These relationships can be described as semantic, in that they describe relationships between words bound together by meaning, be it a relationship of similarity, opposition, inclusion or complementarity. Thus, when Tim Berners Lee mentioned the "Semantic Web", he was referring to a World Wide Web which was linked together by relationships of meaning, in that each page could be described in relationship to a set of other pages (Berners-Lee, 1998:8). Whilst the possibility of achieving this across such a diverse and unstructured collection of information as the Web remains to be seen, but the concept is certainly applicable to encyclopaedic knowledge. Thus, it is possible to imagine a collection of information in which all parts of the data are interrelated by relationships of meaning. For example, a search for Schoenberg might also return results concerning atonality, diatonality, fellow atonal composer Allan Berg, or chromatiscism for instance, concepts or which are all related to Schoenberg more or less directly, but which would be difficult to come across using a simple keyword search. Hence the importance of semantics in establishing the link between different subject areas or different parts of subject areas.

Thus, we have so far seen the inadequacy of current search methods in returning relevant information given a particular topic, and we have seen how this inadequacy revolves around the absence of semantics within the data, or the lack of understanding on behalf of the search engine of the different types of links in between different parts of data. We have also inferred that by introducing semantic relationships in between different concept areas which are covered by the data, we would be able to retrieve results more relevant to our search. Doing so in the context of encyclopaedic data however requires thinking about two different aspects of the process; first of all, the technical aspect of describing the semantic links in the corpus; and secondly, the conceptual approach which should be taken as to describing the nature of these relationships. We shall first examine the technical aspect, then look at how to approach the problem conceptually.

Encyclopaedic Data and Structure

As we have seen, Google indexes the World Wide Web, the main particularity of which is of being unstructured. The term "unstructured" related to data denotes a set of information in which the underlying structure of the information is unclear; there is no clear indication of what, within the referenced data, forms which part of it. For instance, Google attributes more importance to words in bigger or bold character, based on the assumption that word either a title in the page, or generally more important than the others. This is not always the case however, and beyond this, there is little way of knowing how to interpret different parts of the pages; and it is difficult to link a given part of the data to another without knowing what type of information is being considered, thus making it difficult to introduce any types of semantics within the data. By considering an electronic encyclopaedia, we are granting ourselves the luxury of a closed system, in which there need not be any discrepancies either in the way the data is structured, or in the vocabulary which is used, which we shall bear in mind when considering both the elements of the encyclopaedia and the semantics linking them together.

HTML

Much of difficulty with pages of the World Wide Web resides in that they are written in HTML, which at its core is a visual mark-up language. Its role as such is to inform browsers how to display the different parts of data within the pages, and how these parts should appear on screen; unless by adopting a rigid layout structure, which would obviously be difficult to maintain over time and quite limited in the amount of information it could give, there is thus no way of knowing which parts of the page refer to what. If we consider the following HTML:

<table border="0">
<tr><td><b>Die Erwartung</b></td></tr>
<tr><td>by Shoenberg, Arnold</td></tr>
<tr><td>Intercord 1991</td></tr>
</table>

whilst we may know that Shoenberg is the composer of "Die Erwartung" and that "Arnold" is his first name, the program reading this information will only know it has to display a table ("<table>...</table>") with three rows ("<tr>...</tr>") and one column ("<td>...</td>"), containing the data as above.

XML

We thus need a mark-up language which describes not so much the layout of a page, but the nature of its contents; for this reason, we shall choose XML. XML functions on the same basis as HTML in that it uses tags (as in "<table>"), but differs in that those tags are not part of a given set, but user-definable. Thus, it becomes possible for us to define our own tags in order to best define the information; the data above could become:

<recording>
<title>Die Erwartung</title>
<composer>
<lastname>Shoenberg>/lastname>
<firstname>Arnold</firstname>
</composer>
<label>Intercord</label>
<date>1991</date>
</recording>

Whilst this example lacks the completeness one would require of encyclopaedic data, it does illustrate the usefulness of describing data as opposed to layout; the program reading this information would know that the word "Shoenberg" describes the last name ("<lastname>...</lastname>") of a composer ("<composer>...</composer>"), who composed a recording whose title ("<title>...</title>") is "Die Erwartung".

However, whilst XML does provide for syntactic relationships, insofar as it can be inferred that the recording called " Die Erwartung" was written by a composer whose last name is "Shoenberg", and not by another composer in a similarly formulated list of recordings, it does not provide for describing the semantic relationships between the different elements in the data. Whilst the tagging method we have used above is useful for describing a limited set of relationships which are de facto syntactic, and in which the keywords suffice in describing the relationships, this method soon becomes limited. Thus if we add this second set of data to the initial set:

<recording>
<title>Violinkonzert</title>
<composer>
<lastname>Berg</lastname>
<firstname>Alban</firstname>
</composer>
<label>Decca</label>
<date>1996</date>
</recording>

the program would have no way of knowing that the two pieces are both atonal, as are both composers; that on piece is for violin, the other for an orchestra; or that Berg was Shoenberg´s pupil, for instance. It thus becomes apparent that a second set of data is required to describe the relationships between the first.

Metadata

Although the notion of describing relationships between different parts of data has existed for a while, particularly in the field of Artificial Intelligence (AI), the idea of applying this type of relationship to non-mathematic data is relatively recent. There are thus two different sets of standards that can be applied; the older set of languages which were used in AI, such as KIF (Knowledge Interchange Format) or OIL (Ontology Interface Language), and more recent initiatives which have been developed with a more general type of knowledge description in mind, such as RDF (Ressource Description Framework), Topic Maps, SHOE or the Dublin Core Initiative.

Some of these standards are not useful for our purpose; for example the Dublin Core tag-set pertains mainly to bibliographical information, and SHOE is designed as a set of add-ons to HTML pages, which limits the possibilities for tagging the data itself, as we have seen above. The other sets of languages we have described are all similar in that they are based on a tri-partite relationship: that between what is being described, what it is being related to and the nature of this relationship. Thus for instance, using the OIL :

class-def recording class-def Die Erwartung
subclass-of recording

class-def composer
subclass-of artist
slot-constraint composes
has-value recording

class-def label

subclass-of record company
slot-constraint releases
has-value recording

class-def Shoenberg
subclass-of composer
slot-constraint style
has-value atonal

The class here is the item being described, the constraint describes the types of relationships it can have with other elements, which are themselves being described. Note once again that this example is also rather inadequate were one considering encyclopaedic data, but for brevity´s sake we are required to describe simple types of relationships; the flexibility of the model should however be apparent, as is how it could be expanded to describe all matters of relationships.

RDF

There is thus no major difference in between the different standards as far as our purpose is concerned as to the languages themselves; the most important for us is to be unconstrained in defining our own set of relationships. However, being based on XML, RDF uses the same tag system, and as such it might be preferable: this confers it with the type of syntactic relationship which could make the subclass-of markers used above redundant, and on the practical level would also eliminate the need for parsing the data and the meta-data separately. RDF has been criticised for being difficult to read for humans as opposed to machines5, however the sheer amount of links in an encyclopaedia would make any notation of the semantics involved almost unreadable, and it likely in any case that the relationships would be edited via an interface rather than manually.

Thus, we are able to establish a technical framework on which to base an electronic encyclopaedia: as the data is structured in its nature, so it should be in its mark-up. Moreover, because it is difficult to describe semantics within a framework which is built for the data itself rather than the links between the different parts of the data, it is useful to consider a second, separate set of information which describes the links between the sets of data. Because we are using XML to tag the data, it is convenient to use another XML-based language; for this reason, we shall choose the RDF. Now that we have established a technical structure on which to base our thinking, let us consider how to approach the problem of connecting the different parts of the data to each other.

Encyclopaedic Knowledge and Semantics
The dictionary and the encyclopaedia

The reader will probably have noticed that in the examples given above concerning semantic mark-up, we are mixing two different kinds of information: on one hand, there is what can be termed "dictionary" data, on the other "encyclopaedic" data. The difference between the two is subtle, but important in terms of linking information: a Saussurean approach to the dictionary sense of a word would refer to its exact meaning in the language system as defined by its relationships to other words, whilst encyclopaedic sense of a word -or more generally of a linguistic entity,- could be defined, as implied by Eco, as what we know of that entity, characterised by the relationships the different parts of that knowledge have in turn with what makes up knowledge of that part. The openness in the latter definition is why Eco (1999:232) suggests that Encyclopaedic knowledge will never be fully explored, and as such is similar in its ungraspable-ness as the dictionary, of which no part is autonomous. A more practical approach to word definition has been the thesaurus, in the linguistic sense of the term, whereby words in the lexis are classified according to categories; difficulties have arisen, of course, when deciding which categories to use to classify words, with the added difficulty that as meanings in language can be seen to change over time6, and as such a thesaurus is necessarily never authoritative. Some linguists have used the word "dictionary" to describe this process; we shall however here refer to a "thesaurus".

It follows that approaches to ordering data from the dictionary and encyclopaedic meanings of a given term would differ, yet are also complementary; it is necessary to understand the dictionary meaning of a word in order to contextualise the information we have about it or, to use the definitions given above, to make the link between the parts of data forming knowledge of a given term and the knowledge that forms those parts. It is thus advantageous to have access to both types of information; however, rather than suggesting an inexhaustible map of semantic networks which becomes larger than the information it is describing, we shall attempt to make use of this duality to minimise the amount of information we have to consider. As we have seen, the encyclopaedia is what me might term a complex unit: the amount of parts and of links between the parts is so great that it becomes difficult to analyse or to deconstruct the system. Thus, rather than attempt a complete definition of all of the parts and all of the semantic links, we can use several different mechanisms using several different approaches to semantic analysis to narrow down the information we are looking at from the entirety of collection, to a relevant part of it. This set of different approaches is what we shall be looking at next.

Ontologies

An Ontology, in the philosophical sense of the term, deals with the questions of what is and what is not7; computer science has re-appropriated the term to describe something slightly different. In the beginnings of studies on Artificial Intelligence, it was noted that in order for a program to be able to construct equations as well as resolve them, it needed to understand in what ways it could use the components of these equations, in a similar way humans use grammar to define whether the construction of a sentence in a given language is correct or not. Thus for instance, in engineering, it was necessary for the program to understand that weight was measured in kilograms, and that speed is measured in kilometres per hour and designates movement, in order to for it to understand how the two properties affect each other and how to measure the result, for example when calculating the inertia of an object. Thus, programmers designed methods of describing the different types of relationships between the different elements which made up the conceptual world of the program. This is an ontology, in the computational sense of the word: what Gruber (1993:199) has defined as "the specification of a conceptualisation", or as Guarino (1995:628) more helpfully puts it, "an axiomatic characterisation of the meaning of a logical vocabulary". The example we used above written in OIL could be seen as an incomplete (and very badly structured) ontology of music recordings. Thus, it is possible to define semantic relationships between elements in the ontology; and if we think of these elements as the parts of the encyclopaedia, it becomes possible to semantically define the relationships in between the different parts of the encyclopaedia.

Originally, ontologies were used specifically in the sphere of Artificial Intelligence studies; however, there have been some attempts recently to apply them to more general and more widespread applications. In most cases, attempts to create non-application-specific ontologies have attempted universality, striving to cover the entire span of mappable concepts. The most widespread methodology to do so is to create a top-level ontology, which then progressively becomes more segmented as more detailed ontologies focusing on the specifics of a given subject area are defined. In this tree-like structure and in its desire for completeness, the universal ontology is similar to the thesaurus, although it differs significantly in defining relationships in between items rather than in categorising them according to their properties. Ontologies suffer from both its similarities and its differences to the thesaurus: because it attempts to index relationships between items, the number of which is potentially infinite, as opposed to the number of words which is finite, it risks becoming infinitely complex at the more detailed level, making it almost impossible to create by design. Moreover, it relies like the thesaurus on a given and agreed-on set of "primitive" terms to define the relationships; however as Guarino (1995:635) points out, there are many areas in which it would be very difficult to agree on those primitives and, as Eco (1999:252) puts it, permanently redefining these primitives in terms of others would in itself amount to infinite semiosis. In short, ontologies as a complete solution for semantic linking suffers from its aim: to be, as Cioccharelli -quoted by Guarino (1995:639)- puts it, "the development of the logic of all forms and modes of being", or of being everything to everything.

The nature of encyclopaedic data

This leads us to consider the structure of the semantic relationships within the encyclopaedia. Whilst the parts of data forming the encyclopaedia might individually be structured, as we have seen above, there are differences in the way these parts relate to each other. Encyclopaedic knowledge is de facto unstructured, or rather structured according to an irregular taxonomy; it is impossible to derive a simple structure by which to organise encyclopaedic knowledge, because the structure of knowledge is not simple. Considering the examples used above, it would be hard to account for instance for the link between Shoenberg and expressionism using a rigid tree-like structure, as such a taxonomy would difficultly allow for links crossing several different categories and levels of the structure. The only way around this would be to authorise such links, which if one imagines them extended to the entire corpus, would make it look not like a tree at all, but rather like the network of unstructured knowledge we had to start with (Eco, 1999:254).

We have seen that the structure of encyclopaedic knowledge is complex; however, this does not mean that it is impenetrable. Rather, we shall adopt the approach of Eco (1999:254), who suggested that a thesaurus-like classification of words and the unorganised network of information coexist, as complementary ways of thinking about knowledge; in our case, we shall think of the thesaurus and other methodological approaches as ways of traversing the bodies of knowledge and, recognising that each one of these methods is incomplete, attempt to use them together rather than individually. Considering the use of a restricted vocabulary for defining semantic relationships by way of ontologies is an excellent starting point for semantically tagging encyclopaedic data; let us consider some other methods which could be used complementarily.

Semantic Fields

Semantic Fields are a linguistic construct first devised by Trier (1937), aiming at bettering the method of understanding the semantic structure of language. Trier suggested that words can be grouped together by conceptual areas, for instance, that "atonal", "diachronic" and "tonal" are all a part of a semantic field covering use of tonality in music composition. In our case, this notion is useful because it allows for a type of semantic linking between concepts which is based on language structure, rather than on a set of primitive markers, and thus allow for a different set of semantics to emerge. However, there are some problems with this approach; first of all, the structure of language or rather the structure of the fields has to be determined; and as Lyons (1977:254) has shown, it is difficult to unequivocally determine what constitutes a field, and which words make it up. Secondly, as Eco points out, practical attempts at applying the theory have only managed to categorise restricted subsets of fields, rather than the entire lexis. Finally, as Trier himself acknowledges, the makeup of language changes over time, and thus any definition of language in terms of semantic fields would only be valid at the particular moment it was created.

Diachronicity

This is a problem which occurs across all methods of semantic linking; because language constantly shifts, and does so unpredictably, a set of semantic links does not necessarily hold through time. To palliate this problem, let us suggest that the make-up of our semantic links also change through time, rather than be a collection of rules set in stone. The RDF language introduces the concept of reification (Lassila and Swick, 1999:19): this is a statement which qualifies another statement, hence allowing for the possibility of expressing something of a value judgement about a given semantic statement. This mechanism is extremely useful as it allows us to open up the creation of semantic links; rather than operate a closed system where all semantic links are input by a single or group of super-human beings (necessarily super human as defining the totality of the semantic links would imply a knowledge of the entire library), which would be an impossible task, we can let the index develop over time, allowing users to input new semantic links as required. This would allow not only for the progressive construction of the index across all methods of semantic tagging, thereby solving the problem of infinitely complex ontologies we saw earlier, but also for its perennity, as the links get updated according to changes in the language. The mechanism for weighting such a system could be similar to Google´s, whereby the importance of a semantic link over another could be defined as an iterative function composed of parameters such as how recent the link is, how many times it has been followed, how many links the user who created that link has contributed and their validity, consistency with the rest of the body of links etc.

Collocation

Trier´s semantic fields offer a paradigmatic semantic analysis; but we can also consider syntagmatic linguistic semantics, as in for instance collocational theory. Collocation basically suggests there be a certain determinability in the discourse (Martin, 1992:275), or that given a partly-formed sentence structure, there are only a limited amount of possibilities with which it could end. This concept was pioneered by Firth, and greatly expanded on by Sinclair and Halliday (Martin, 1992:275); the suggestions being that each word in a sentence calls a set of others, which are progressively reduced as the context becomes clearer. As Halliday puts it, "collocation attempts to reduce a very large class of items to a very small subclass of words" (quoted in Martin, 1992:275). The appeal to us is double; this would allow the possibility to use linguistics to create links between encyclopaedic items, as well as within the lexis itself. Moreover, collocational theory could help determine the nature of the semantic links between items, without it necessarily being made explicit in an ontology. Thus for instance, if one knows Shoenberg is a composer, collocational theory could suggest that he composes musical pieces, and thus link to some of the pieces he composed. Collocation here helps in two ways; first of all in determining that the action of Shoenberg we are most likely to be looking for is "composing"; and then reducing the set of items he could be linked to to those that are music pieces, and by reversing the relationship of composer to piece, those pieces composed by him. Thus collocation is useful for us in two ways: first of all, in helping determine which links should be used from an item to another, which can be used not only in query matching but also in the construction of the ontologies- for instance if ontology modification were open as described above, collocation could be used to suggest different link types to users inputting new semantics into the system. Secondly, collocation can be used in suggesting items to connect to another given a semantic link. This theory does suffer from some drawbacks; most attempts to draw out a collocational map have been based on statistical analysis8, which given the amount of possibilities of combining words into grammatically correct sentences could take some time to complete. Other more conceptual attempts have stumbled on the problem of restructuring the map in order to reduce the amount of analysis required.

Componential Analysis

In order for us to apply collocational analysis to encyclopaedic data, it is necessary for us to know what that the data refers to. For this we might turn to componential analysis, as suggested by Katz (Katz and Fodor 1963, Katz and Posta 1964), which we can also apply to our use of the lexis. Katz suggests that each lexical item has a defined set of characterisations expressed as a set of components, or "semantic markers" (Lehrer, 1974:40). Thus, it is possible to describe each item of the lexis in terms of a set of "linguistic universals"10 which when applied together accurately describe the item in opposition to others. Transferring this analysis from the lexis to the encyclopaedia, for "Shoenberg" we might have:

"Shoenberg" (composer)(modern)(atonal)

where the characters in parenthesis are the semantic markers. Note that Katz introduced the principle of redundancy rules, which save us from adding markers such as (physical object), (animate) or (living), by virtue of these being inferable from the fact Shoenberg is a composer. It follows from this that each set of markers must in turn be defined by others; this brings us once again to the problem of the universality of the markers, which must be either absolute and thus restrictive, or flexible which amounts to infinite semiosis (Eco, 1976:121). Note that this is a simplification of the model adjusted for our purpose, and as such we shall not consider the many other criticisms levelled at it, as they are out of the scope of this essay11.

WordNet and lexical ontologies

Finally, let us consider the case of WordNet. WordNet is an initiative by Princeton University to organise the lexis in a set of given semantic relationships, of the type we described when examining the limitations of keyword search; in this case, synonymy, antonymy, hyponymy/hypernymy ("kind-of"), and holonymy/meronymy ("part-of") (Miller et. al., 1993:8). These relationships are defined, in the case of WordNet, by the way of ontology. The difference between the WordNet ontology and the type of ontologies we have been considering is that whilst we have been looking at encyclopaedic data, WordNet is concerned with lexical or dictionary information. The types of links contained in a dictionary and an encyclopaedia are as we have seen different; yet they supplement each other, both in the construction of the encyclopaedic network ( for instance the encyclopaedic entities "tonality" and "atonality" are just as opposed as their dictionary equivalents), and in its navigation (for instance retrieving information about "atonal" music when the user requests information about "pan-tonal" compositions). A lexical ontology such as WordNet describes the links between lexical items, using semantics which are specifically suited to such a task, such as the ones we have seen Lyons (1977:279-311) describe; this could be placed alongside our encyclopaedic ontology, in which semantic relationships are not necessarily as restricted, thereby providing us with a supplementary set of semantic links with which to find relevant information.

Discussion

In this section we have examined various different methodologies for semantically marking information, both at a dictionary and at an encyclopaedic level. In most cases, these approaches are insufficient in themselves, as seen by the criticisms levelled against them; in many cases however, they can be seen as complementary. Thus for instance, the dictionary semantics of WordNet supplement the concept of semantic markers in Katz´ componential analysis: although not eliminating the circularity implied by semantic markers, being able to refer to semantic markers on the basis of a given network of semantic references greatly helps in reducing the fuzziness in determining wich markers to use on a practical level. This can be applied to any form of approach which requires "primitive terms" as a basis for analysis, such as ontologies. Moreover, collocational theory can be used in defining the links in an ontology; for instance given two encyclopaedic entities, knowledge of their nature could allow one to reduce the amounts of potential links between them, as in our example of "Shoenberg" and "Die Erwartung" (the link here would be "composed"). Finally, as a practical application of collocation would indeed require knowledge of the nature of the entities at play, componential analysis could be used to define the entities in terms of semantic markers. Thus, all of the methods we have seen so far suffer from the rigidity of their structure; however by applying them together, we are able to create a complex set of semantic relationships, operating on different bases, and thereby creating a structure the nature of which is similar to that of the corpus. The strengths of these methods thus lies not in their individual application, which as we have seen is in most cases somehow flawed, but rather in their complementarity, as they each allow us to reduce the amount of information in the corpus from its whole to its relevant parts. Moreover, by allowing users to effect the index, fuzziness in the lexis can be resolved in a more natural way, thereby allowing for both the progressive construction of the index, and its evolution through time.

It will be noted that throughout our exploration of these models, little attention was given to the debates surrounding their validity from a linguistic point of view; our aim however is not so much to assess this validity, but to examine whether these methods can be used for our purpose, which is the retrieval of relevant information to a query from an encyclopaedic corpus. Moreover, it can be argued than very little attention has been given to the detail of these methods, and as such that their usefulness can difficultly be evaluated. This is to a certain extent true; however it was not our intention to provide an in-depth analysis of these methods, but rather an overview, as a means of understanding their general way of working and how they could be applied. Thus a detailed overview of the application of these methods to our purpose is unfortunately beyond the scope of this essay; however, insofar as we have understood the limitations of these methods and how to use them to our advantage, we consider our initial aim fulfilled, which was to provide an overview of techniques applicable to our purpose.

Conclusion

Thus, we have been able to introduce the concepts necessary to understanding the basics of semantically related information, within the particular field of encyclopaedic data. We have seen that traditional keyword-search methods are not adequate to the retrieving of relevant information, and we have seen that this was due to two factors: the lack of structure in the data tagging, and the limits of the semantics involved in such a search. Consequently, we have examined more efficient methods of marking up the data from a technological point of view, which involve using the highly structured XML language to tag the data, and the XML-based RDF framework to tag the semantics across the data. We noted the difference between encyclopaedic and dictionary data, and considered the way they could be used complementarily; we also saw that although all the methods of semantic linking we considered were somehow flawed individually, used together they could provide a very efficient basis for relevant information retrieval. Thus, this essay serves as an introduction to the problem of efficient retrieval of encyclopaedic data, assessing the limitations of current search systems and proposing methods of palliating them.

In doing so, we have set a strong basis on which to base further work, although many of the methods we proposed would have to be discussed in greater detail, especially from a practical point of view. Thus, it would be necessary to devise ways of implementing both the technical and linguistic methodologies discussed here. The way the data is structured would have to be decided on, and implemented accordingly with XML; similar work would have to be carried out for the structure of the metadata using RDF. Both these tasks imply sufficient understanding of the semantic methods employed, which in turn need to be both effected to the specifications of our application and integrated with each other. All of this is still very much theoretical, in that it does not require the construction of a working model; but should this be considered, it would take a great amount of work to build an application able to process such a large amount of semantic information in an sufficiently short amount of time for it to be used as a search engine.

Therefore, whilst we have seen the advantages and suggested methodologies for linking information semantically, there is a great amount of ground to cover before it becomes possible to apply them. Whilst computers have indeed made information available, that information has still to be made accessible; and given the challenges of doing so, it may still be a few more years before Bush´s (1945:14) vision of an efficient and discerning "intimate supplement to man´s memory" becomes a reality.

Bibliography

Bush, Vannevar (1945) "How We May Think" in The Atlantic Monthly 176(1): 101-108

Decker, Stefan, Melnik, Sergey, Van Harmelen, Frank, Fensel, Dieter, Kelin, Michel, Broekstra, Jee, Erdmann, Michael, and Horrocks, Ian (1999) "The Semantic Web: The Roles of XML and RDF" in IEEE Internet Computing, 4(5): 63-74

Eco, Umberto (1999) "Kant and the Platypus: Essays on Language and Cognition" (London: Secker & Warburg)

Eco, Umberto (1976) "A Theory of Semiotics" (London: Indiana University Press)

Gruber, T. R. (1993) "A Translation Approach to Portable Ontologies" in Knowledge Acquisition, 5(2):199-220

Guarino, Nicola (1995) "Formal Ontology, Conceptual Analysis and Knowledge Representation" in International Journal of Human and Computer Studies, 43(5/6): 625-640

Lehrer, Adrienne (1974) "Semantic Fields and Lexical Structure" ( Amsterdam, London: North-Holland Publishing Co., etc.)

Lyons, John (1977) "Semantics" Vol. 1 (Cambridge: Cambridge University Press)

Lyons John (1995) "Linguistic Semantics: an Introduction" (Cambridge: Cambridge University Press)

Martin J.R. (1992) "English Text: System and Structure" ( Philadelphia: John Benjamins Pub. Co.)

Webography

Alschuler, Liora (2000) "Going to Extremes" Proceedings of the GCA´s Extreme Markup Languages Conference 2000; available online from http://www.xml.com/pub/a/2000/09/13/extremes.html?page=2

Berners-Lee, Tim (1998) "Semantic Web Roadmap" available online from http://www.w3.org/DesignIssues/Semantic

Brin, Sergei and Page, Lawrence (1998) "Anatomy of a Large Scale Hypertextual Web Search Engine" Computer Science Department, Stanford University; available online from http://www.stanford.edu/class/cs240/readings/google.pdf

Guarino, Nicola, Gangemi, Aldo and Oltramari, Alessandro (2001) "Conceptual Analysis of Lexical Taxonomies: The Case of WordNet Top-Level" Proceedings of the International Conference on Formal Ontologies in Information Systems (FOIS) 2001; available online from http://www.ladseb.pd.cnr.it/infor/Ontology/ Papers/FOIS2001-Final.pdf

Heflin, Jeffrey Douglas (2001) "Towards the Semantic Web: Knowledge Representation in a Dynamic, Distributed Environment"" Thesis (PhD) directed by Professor James A. Hendler, Department of Computer Science, University of Maryland; available online from http://www.cse.lehigh.edu/~heflin/pubs/heflin-thesis.pdf

Lassila, Ora and Swick Ralph R. (1999) "Ressource Description Framwork (RDF) Model and Syntax Specification" W3C Recommendation; available online from http://www.w3.org/TR/REC-rdf-syntax

Miller, Goerge A., Beckwith, Richard, Fellbaum, Christian, Gross, Derek and Miller, Katherine (1993) "Introduction to WordNet: An On-line Lexical Database" Cognitive Science Laboratory, Princeton University; available online from http://www.cogsci.princeton.edu/~wn/obtain/5papers.pdf

Pepper, Steve and Moore, Graham (eds) (2001) "XML Topic Maps Specification 1.0" XML Topic Maps Specification; available online from http://www.topicmaps.org/xtm/1.0/

All links are correct at time of printing [May 2002].

[ close this window ]