
Few topics cause more concern and confusion in the web community than the Semantic Web. The Semantic Web has been described as a vision of a web that goes beyond billions of linked web documents that lay in wait to be indexed by global search engines, it is a web where the semantics, or meaning, behind the content can be utilized in a meaningful way. To some, this hearkens back to the failed promises of Artificial Intelligence computing and the non-delivery of systems that were supposed to work out the family’s budget and intelligently order groceries for the week. The World Wide Web Consortium’s (W3C) extensive work on the Semantic Web has also been characterized as taking place in a semantic “cloud” that has obscured and detracted from much-needed web standardization efforts.
If you look beyond the hype, the Semantic Web can, in some ways, be seen as a natural progression that comes from building more capabilities into every new web technology. A simple sequence describing the evolution of the Semantic Web might begin with the chaotic stage of early HTML documents, where a minimal set of tags described all manner of content. Along the way, it was realized that it would be helpful to have concepts like “author” described in more meaningful tags than “h1″ or “bold”. XML emerged as the solution to ensure that the syntax and content of documents were consistent and to allow applications better ways of working with groups of documents that are authored for a common purpose, such as finding aids and full text materials marked up in TEI. XML uses constructs called DTDs and Schemas to tightly control the structure of documents and was met with great enthusiasm by web developers who could now share information using tags with labels like “subject” that better reflect the content itself.
XML is arguably a key building block in the Semantic Web but the first real manifestation of the W3C’s semantic work was the publication of the Resource Description Framework (RDF) specification for encoding and sharing metadata. Metadata is sometimes called “data about data” and has been one of the main activities of libraries for several centuries. The premise of RDF is that metadata can be modeled as a set of statements that indicate a piece of information about something else. In RDF parlance, these are called “triples’. For example, the statement “Tim Severin is the creator of the Brendan Voyage” consists of three parts (Tim Severin, Creator, Brendan Voyage) and can be written with RDF in XML as:
<rdf:RDF xmlns:rdf=”http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc=”http://purl.org/dc/elements/1.0/”>
<rdf:Description rdf:about=”http://address_for_Brendan_Voyage”>
<dc:Creator>Tim Severin</dc:Creator>
</rdf:RDF>
This type of statement is called an assertion and RDF specifies that every part of the assertion can be assigned a URI (Uniform Resource Indicator), much like a URL but different in the sense that it doesn’t have to map to a real web address and can represent concepts (“Creator”), living entities (“Tim Severin”), and anything else in the known and imagined universe, from animals to laundry lists. The “dc” in the example stands for Dublin Core and is associated with a special URI called a namespace(“http://purl.org/dc/elements/1.0/ “) that, in turn, is associated with a set of metadata elements. On its own, this is somewhat useful, but one of the most compelling aspects of RDF is combining elements from different metadata sets. If I had a set of elements specifying a rating system, for example, I could insert a namespace (xmlns) reference that would allow me to insert my rating as shown:
<rdf:RDF xmlns:rdf=”http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc=”http://purl.org/dc/elements/1.0/” xmlns:ar=”http://www.for.me/ar/elements/”>
<rdf:Description rdf:about=”http://address_for_Brendan_Voyage”>
<ar:Rating>Excellent</ar:Rating>
<dc:Creator>Tim Severin</dc:Creator>
</rdf:RDF>
RDF detractors are quick to point out that this type of “mixing and matching” for metadata has been slow to ignite the kind of interest that has followed HTML and XML. While there is no doubt that RDF has not captured as much of the spotlight, it is worth noting that:
- RDF is concerned with metadata, which isn’t always appreciated if you don’t have occasion to ponder information retrieval or if you think that keyword indexing can solve most information needs.
- The syntax is somewhat convoluted, even compared to HTML and XML, and may be better represented by labeled graphs or other techniques common in Computer Science but often confusing to the novice. Tim Berners-Lee, the inventor of the World Wide Web, has proposed a much simpler syntax for RDF called Notation 3 which looks something like:
:tim :creator “The Brendan Voyage” .
In addition to the need to appreciate metadata and the syntax issues, another difficulty with the Semantic Web is that RDF is only the first step along the way. Going beyond assertions to supporting any high level of inferences, where a computer can automatically pull together concepts, really requires some understanding of RDF Schemas and Ontology Languages like DAML+OIL. RDF Schema allows concepts to be specified and related, for example, specifying that a “writer” is a type of “creator”. Ontologies are also formal representations of entities and concepts, and languages like DAML+OIL are different from RDF Schema in the sense that they provide even more options for defining relationships. For example, using Notation 3, we could have this relationship:
dc:Creator daml:equivalentTo red:PreparerName .
This would allow a program to “infer” that a real estate agreement identified with the “PreparerName” element from the Real Estate Data (red) Consortium schema is equivalent to “Creator” from Dublin Core using the “equivalentTo” property from DAML+OIL. This means that in addition to titles of monographs that the author I am researching has written, I could also receive documents that represent the author’s activities as a lawyer from a semantically-aware library system.
RDF Schemas and ontology work are crucial to the success of the Semantic Web, and have tended to emerge in subject areas that lend themselves well to defining relationships between concepts, for examples, dictionaries and vocabularies, thesauri, and many branches of science. For libraries, the value of the Semantic Web may have less to do with changes in bibliographic databases than with integrating resources that don’t often show up in traditional cataloguing. Scientific datasets, for example, often don’t have access points that translate well to bibliographic descriptions and bring in a multitude of concepts that may be critical for the resource community the datasets are produced for. DNA sequences, solar wind movements, and other types of scientific data require specialized query languages. RDF holds the promise of wiring in the metadata and schema/ontologies that address the complexity of the semantics of the data rather than trying to cram this level of description into Dublin Core or MARC.
Another intriguing use of Semantic Web activity is to tie together library functions with external systems. For example, expanding on the work of the RDF Calendar initiative to support queries like “find me all the works on XML that are due in the library before I go on vacation”. The Semantic Web could provide the plumbing to allow a system to talk to an individual’s RDF-enabled calendar system to determine the timeframe identified by the use of the term “vacation”. RDF and Semantic Web-based query languages offer a glimpse of how the semantics/vocabularies of different research communities may be combined in supporting information retrieval. It isn’t likely that the results will come close to the early promises of Artificial Intelligence but libraries are in a somewhat unique position to both appreciate the importance of sharing metadata, and understand the benefits of interoperable vocabularies and semantics better than most organizations. The Semantic Web may turn out to be far less audacious in practice than in concept, but it could be an important tool for trying to provide services for the growing stream of diverse web-based content and services that flows by our libraries.