From: =?gb2312?B?08kgV2luZG93cyBJbnRlcm5ldCBFeHBsb3JlciA4ILGjtOY=?= Subject: How to publish Linked Data on the Web Date: Tue, 5 Oct 2010 11:21:25 +0800 MIME-Version: 1.0 Content-Type: multipart/related; type="text/html"; boundary="----=_NextPart_000_0013_01CB647F.7270F260" X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7600.16543 这是 MIME 格式的多方邮件。 ------=_NextPart_000_0013_01CB647F.7270F260 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Location: http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/ =EF=BB=BF
This document provides a tutorial on how to publish Linked Data on = the Web.=20 After a general overview of the concept of Linked Data, we describe = several=20 practical recipes for publishing information as Linked Data on the = Web.
The goal of Linked Data is to enable people to share structured data = on the=20 Web as easily as they can share documents today.
The term Linked Data was coined by Tim Berners-Lee in his Linked Data = Web=20 architecture note. The term refers to a style of publishing and = interlinking=20 structured data on the Web. The basic assumption behind Linked Data is = that the=20 value and usefulness of data increases the more it is interlinked with = other=20 data. In summary, Linked Data is simply about using the Web to create = typed=20 links between data from different sources.
The basic tenets of Linked Data are to:
Applying both principles leads to the creation of a data commons on = the Web,=20 a space where people and organizations can post and consume data about = anything.=20 This data commons is often called the Web of Data or Semantic Web.
The Web of Data can be accessed using Linked Data browsers, just as = the=20 traditional Web of documents is accessed using HTML browsers. However, = instead=20 of following links between HTML pages, Linked Data browsers enable users = to=20 navigate between different data sources by following RDF links. This = allows the=20 user to start with one data source and then move through a potentially = endless=20 Web of data sources connected by RDF links. For instance, while looking = at data=20 about a person from one source, a user might be interested in = information about=20 the person's home town. By following an RDF link, the user can navigate = to=20 information about that town contained in another dataset.
Just as the traditional document Web can be crawled by following = hypertext=20 links, the Web of Data can be crawled by following RDF links. Working on = the=20 crawled data, search engines can provide sophisticated query = capabilities,=20 similar to those provided by conventional relational databases. Because = the=20 query results themselves are structured data, not just links to HTML = pages, they=20 can be immediately processed, thus enabling a new class of applications = based on=20 the Web of Data.
The glue that holds together the traditional document Web is the = hypertext=20 links between HTML pages. The glue of the data web is RDF links. An RDF = link=20 simply states that one piece of data has some kind of relationship to = another=20 piece of data. These relationships can have different types. For = instance, an=20 RDF link that connects data about people can state that two people know = each=20 other; an RDF link that connects information about a person with = information=20 about publications in a bibliographic database might state that a person = is the=20 author of a specific paper.
There is already a lot of structured data accessible on the Web = through Web=20 2.0 APIs such as the eBay,=20 Amazon, = Yahoo, and = Google Base = APIs.=20 Compared to these APIs, Linked Data has the advantage of providing a = single,=20 standardized access mechanism instead of relying on diverse interfaces = and=20 result formats. This allows data sources to be:
Having provided a background to Linked Data concepts, the rest of = this=20 document is structured as follows: Section=20 2 outlines the basic principles of Linked Data. Section=20 3 provides practical advice on how to name resources with URI = references. Section=20 4 discusses terms from well-known vocabularies and data sources = which should=20 be reused to represent information. Section=20 5 explains what information should be included into RDF descriptions = that=20 are published on the Web. Section=20 6 gives practical advice on how to generate RDF links between data = from=20 different data sources. Section=20 7 presents several complete recipes for publishing different types = of=20 information as Linked Data on the Web using existing Linked Data = publishing=20 tools. Section=20 8 discusses testing and debugging Linked Data sources. Finally Section=20 9 gives an overview of alternative discovery mechanisms for Linked = Data on=20 the Web.
This chapter describes the basic principles of Linked Data. As Linked = Data is=20 closely aligned to the general architecture of the Web, we first = summarize the=20 basic principles of this architecture. Then we give an overview of the = RDF data=20 model and recommend how the data model should be used in the Linked Data = context.
This section summarizes the basic principles of the Web Architecture = and=20 introduces terminology such as resource and = representation.=20 For more detailed information please refer to the Architecture of the World Wide = Web, Volume=20 One W3C recommendation and the current findings of the = W3C Technical Architecture Group=20 (TAG).
To publish data on the Web, we first have to identify the items = of=20 interest in our domain. They are the things whose properties and=20 relationships we want to describe in the data. In Web Architecture = terminology,=20 all items of interest are called resources.
In 'Dereferencing=20 HTTP URIs' the W3C Technical = Architecture Group (TAG) distinguish between two kinds of resources: = information resources and non-information resources = (also=20 called 'other resources') . This distinction is quite important in a = Linked Data=20 context. All the resources we find on the traditional document Web, such = as=20 documents, images, and other media files, are information resources. But = many of=20 the things we want to share data about are not: People, physical = products,=20 places, proteins, scientific concepts, and so on. As a rule of thumb, = all=20 =E2=80=9Creal-world objects=E2=80=9D that exist outside of the Web are = non-information=20 resources.
Resources are identified using Uniform= Resource=20 Identifiers (URIs). In the context of Linked Data, we restrict=20 ourselves to using HTTP URIs only and avoid other URI schemes such as URNs and = DOIs.= HTTP=20 URIs make good names for two reasons: They provide a simple way to = create=20 globally unique names without centralized management; and URIs work not = just as=20 a name but also as a means of accessing information about a resource = over the=20 Web. The preference for HTTP over other URI schemes is discussed at = length in=20 the W3C TAG draft finding URNs, = Namespaces=20 and Registries.
Information resources can have representations. A = representation is=20 a stream of bytes in a certain format, such as HTML, RDF/XML, or JPEG. = For=20 example, an invoice is an information resource. It could be represented = as an=20 HTML page, as a printable PDF document, or as an RDF document. A single=20 information resource can have many different representations, e.g. in = different=20 formats, resolution qualities, or natural languages.
URI Dereferencing is the process of looking up a URI on the = Web in=20 order to get information about the referenced resource. The W3C TAG = draft=20 finding about Dereferencing=20 HTTP URIs introduced a distinction on how URIs identifying = information=20 resources and non-information resources are dereferenced:
200 =
OK
.=20
303 See Other
. This is called a 303 =
redirect.=20
In a second step, the client dereferences this new URI and gets a=20
representation describing the original non-information resource. =
Note: There are two approaches that data publishers can use to = provide=20 clients with URIs of information resources describing non-information = resources:=20 Hash URIs and 303 redirects. This document focuses mostly on the 303 = redirect=20 approach. See = Section=20 4.3 of Cool URIs for the Semantic Web for a discussion of = the=20 tradeoffs between both approaches.
HTML browsers usually display RDF representations as raw RDF code, or = simply=20 download them as RDF files without displaying them. This is not very = helpful to=20 the average user. Therefore, serving a proper HTML representation in = addition to=20 the RDF representation of a resource helps humans to figure out what a = URI=20 refers to.
This can be achieved using an HTTP mechanism called content=20 negotiation. HTTP clients send HTTP headers with each request to = indicate=20 what kinds of representation they prefer. Servers can inspect those = headers and=20 select an appropriate response. If the headers indicate that the client = prefers=20 HTML, then the server can generate an HTML representation. If the client = prefers=20 RDF, then the server can generate RDF.
Content negotiation for non-information resources is usually = implemented in the following way. When a URI identifying a = non-information=20 resource is dereferenced, the server sends a 303 redirect to an = information=20 resource appropriate for the client. Therefore, a data source often = serves three=20 URIs related to each non-information resource, for instance:
The picture below shows how dereferencing a HTTP URI identifying a=20 non-information resource plays together with content negotiation:
vocabulary URI
. =
If the=20
client is a Linked Data browser and would prefer an RDF/XML =
representation of=20
the resource, it sends an Accept: application/rdf+xml
=
header=20
along with the request. HTML browsers would send an Accept:=20
text/html
header instead.=20
303 See Other
response code and sends the client =
the URI=20
of an information resource describing the non-information resource. In =
the RDF=20
case: RDF content location
.=20
application/rdf+xml
.=20
vocabulary URI
. A complete example of a HTTP session for dereferencing a URI = identifying a=20 non-information resource is given in Appendix=20 A.
In an open environment like the Web it often happens that different=20 information providers talk about the same non-information resource, for = instance=20 a geographic location or a famous person. As they do not know about each = other,=20 they introduce different URIs for identifying the same real-world = object. For=20 instance: DBpedia a data source providing information that has been = extracted=20 from Wikipedia uses the URI http://dbpedia.org/resource/B= erlin=20 to identify Berlin. Geonames is a data source providing information = about=20 millions of geographic locations uses the URI http://sws.geonames.org/2950159= / to=20 identify Berlin. As both URIs refer to the same non-information = resource, they=20 are called URI aliases. URI aliases are common on the Web of Data, as it = can not=20 realistically be expected that all information providers agree on the = same URIs=20 to identify a non-information resources. URI aliases provide an = important social=20 function to the Web of Data as they are dereferenced to different = descriptions=20 of the same non-information resource and thus allow different views and = opinions=20 to be expressed. In order to still be able to track that different = information=20 providers speak about the same non-information resource, it is common = practice=20 that information providers set owl:sameAs links = to URI=20 aliases they know about. This practice is explained in Section=20 6 in more detail.
Within this tutorial we use a new term which is not part of the =
standard Web=20
Architecture terminology but useful in the Linked Data context. The term =
is=20
associated description and it refers to the description of a=20
non-information resource that a client obtains by dereferencing a =
specific URI=20
identifying this non-information resource. For example: Deferencing the =
URI http://dbpedia.org/resource/B=
erlin=20
asking for application/rdf+xml
gives you, after a redirect, =
an=20
associated description that is equal to the RDF description of http://dbpedia.org/resource/B=
erlin=20
within the information resource http://dbpedia.org/data/Berlin. Using=20
this new term makes sense in a Linked Data context as it is common =
practice to=20
use multiple URI aliases to refer to the same non-information resource =
and also=20
because different URI aliases dereference to different descriptions of =
the=20
resource.
When publishing Linked Data on the Web, we represent information = about=20 resources using the Resource=20 Description Framework (RDF). RDF provides a data model that is = extremely=20 simple on the one hand but strictly tailored towards Web architecture on = the=20 other hand.
In RDF, a description of a resource is represented as a number of=20 triples. The three parts of each triple are called its=20 subject, predicate, and object. A triple = mirrors the=20 basic structure of a simple sentence, such as this one:
= Chris has the email address = chris@bizer.de . (subject) (predicate) = (object)
The subject of a triple is the URI identifying the described = resource. The=20 object can either be a simple literal value, like a string, = number, or=20 date; or the URI of another resource that is somehow related to the = subject. The=20 predicate, in the middle, indicates what kind of relation exists between = subject=20 and object, e.g. this is the name or date of birth (in the case of a = literal),=20 or the employer or someone the person knows (in the case of another = resource).=20 The predicate is a URI too. These predicate URIs come from=20 vocabularies, collections of URIs that can be used to represent = information about a certain domain. Please refer to Section=20 4 for more information about which vocabularies to use in a Linked = Data=20 context.
Some people like to imagine a set of RDF triples as an RDF graph. The = URIs=20 occurring as subject and object URIs are the nodes in the graph, and = each triple=20 is a directed arc (arrow) that connects the subject to the object.
Two principal types of RDF triples can be = distinguished,=20 Literal Triples and RDF Links:
RDF links are the foundation for the Web of Data. Dereferencing the = URI=20 that appears as the destination of a link yields a description of the = linked=20 resource. This description will usually contain additional RDF links = which=20 point to other URIs that in turn can also be dereferenced, and so on. = This is=20 how individual resource descriptions are woven into the Web of Data. = This is=20 also how the Web of Data can be navigated using a Linked Data browser = or=20 crawled by the robot of a search engine.
Let's take an RDF browser like Disco or Tabulator as an example. =
The surfer=20
uses the browser to display information about Richard from his FOAF =
profile.=20
Richard has identified himself with the URI http://richard.cygania=
k.de/foaf.rdf#cygri.=20
When the surfer types this URI into the navigation bar, the browser=20
dereferences this URI over the Web, asking for content type=20
application/rdf+xml
and displays the retrieved =
information (click=20
here to have Disco do this). In his profile, Richard has stated =
that he is=20
based near Berlin, using the DBpedia URI http://www4.wiwiss.fu-berlin.de/rdf_browse=
r/?browse_uri=3Dhttp%3A//dbpedia.org/resource/Berlin=20
as URI alias for the non-information resource Berlin. As the surfer is =
interested in Berlin, he instructs the browser to dereference this URI =
by=20
clicking on it. The browser now dereferences this URI asking for=20
application/rdf+xml
.
After being redirected with a HTTP 303 response code, the browser = retrieves=20 an RDF graph describing Berlin in more detail. A part of this graph is = shown=20 below. The graph contains a literal triple stating that Berlin has = 3.405.259=20 inhabitants and another RDF link to a resource representing a list of = German=20 cities.
As both RDF graphs share the URI http://dbpedia.org/resource/B= erlin,=20 they naturally merge together, as shown below.
The surfer might also be interested in other German cities. = Therefore he=20 lets the browser dereference the URI identifying the list. The = retrieved RDF=20 graph contains more RDF links to German cities, for instance, Hamburg = and=20 M=C3=BCnchen as shown below.
Seen from a Web perspective, the most valuable RDF links are those = that=20 connect a resource to external data published by other data sources, = because=20 they link up different islands of data into a Web. Technically, such = an=20 external RDF link is a RDF triple which has a subject URI from one = data source=20 and an object URI from another data source. The box below contains = various=20 external RDF links taken from different data sources on the Web.
# Two RDF links = taken from DBpedia <http://dbpedia.org/resource/Berlin> owl:sameAs <http://sws.geonames.org/2950159/> . =20 <http://dbpedia.org/resource/Tim_Berners-Lee> owl:sameAs = <http://www4.wiwiss.fu-berlin.de/dblp/resource/person/100007> .=20 # RDF links taken from Tim Berners-Lee's FOAF profile <http://www.w3.org/People/Berners-Lee/card#i> owl:sameAs <http://dbpedia.org/resource/Tim_Berners-Lee> ; foaf:knows <http://www.w3.org/People/Connolly/#me> . # RDF links taken from Richard Cyganiaks's FOAF profile <http://richard.cyganiak.de/foaf.rdf#cygri> foaf:knows <http://www.w3.org/People/Berners-Lee/card#i> ; foaf:topic_interest <http://dbpedia.org/resource/Semantic_Web> = .
The main benefits of using the RDF data model in a Linked Data = context are=20 that:
In order to make it easier for clients to merge and query your = data, we=20 recommend not to use the full expressivity of the RDF data model, but = a subset=20 of the RDF features. Especially:
Resources are named with URI references. When publishing Linked = Data, you=20 should devote some effort to choosing good URIs for your = resources.
On the one hand, they should be good names that other = publishers=20 can use confidently to link to your resources in their own data. On = the other=20 hand, you will have to put technical infrastructure in place to make = them=20 dereferenceable, and this may put some constraints on what = you can=20 do.
This section lists, in loose order, some things to keep in = mind.
Examples of cool URIs:
See also:
In order to make it as easy as possible for client applications to = process=20 your data, you should reuse terms from well-known vocabularies = wherever=20 possible. You should only define new terms yourself if you can not = find=20 required terms in existing vocabularies.
A set of well-known vocabularies has evolved in the Semantic Web = community.=20 Please check whether your data can be represented using terms from = these=20 vocabularies before defining any new terms:
A more extensive list=20 of well-known vocabularies is maintained by the W3C=20 SWEO Linking Open Data community project in the ESW Wiki. A = listing of the=20 100=20 most common RDF namespaces (August 2006) is provided by UMBC = eBiquity=20 Group.
It is common practice to mix terms from different vocabularies. We=20 especially recommend the use of rdfs:label and = foaf:depiction = properties=20 whenever possible as these terms are well-supported by client=20 applications.
If you need URI references for geographic places, research areas, = general=20 topics, artists, books or CDs, you should consider using URIs from = data=20 sources within the W3C=20 SWEO Linking Open Data community project, for instance Geonames, DBpedia, Musicbrainz,=20 dbtune or the RDF=20 Book Mashup. The two main benefits of using URIs from these data = sources=20 are:
A more extensive list=20 of datasets with dereferenceable URIs is maintained by the Linking = Open=20 Data community project in the ESW Wiki.
Good examples of how terms from different well-known vocabularies = are mixed=20 in one document and how existing concept URIs are reused are given by = the FOAF=20 profiles of Tim=20 Berners-Lee and Ivan=20 Herman.
When you cannot find good existing vocabularies that cover all the = classes=20 and properties you need, then you have to define your own terms. = Defining new=20 terms is not hard. RDF classes and properties are resources = themselves,=20 identified by URIs, and published on the Web, so everything we said = about=20 publishing Linked Data applies to them as well.
You can define vocabularies using the RDF Vocabulary Description = Language=20 1.0: RDF Schema or the Web=20 Ontology Language (OWL). For introductions to RDFS, see the section on = Vocabulary=20 Documentation in the SWAP Tutorial, and the very detailed RDF Schema = section of=20 the RDF Primer. OWL is introduced in the OWL Overview.
Here we give some guidelines for those who are familiar with these=20 languages:
The following example contains a definition of a class and a = property=20 following the rules above. The example uses the Turtle syntax. = Namespace=20 declarations are omitted.
# Definition of the class = "Lover" <http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LoveVocabulary#Lover&= gt;=20 rdf:type rdfs:Class ; rdfs:label "Lover"@en ; rdfs:label "Liebender"@de ; rdfs:comment "A person who loves somebody."@en ; rdfs:comment "Eine Person die Jemanden liebt."@de ; rdfs:subClassOf foaf:Person . # Definition of the property "loves" <http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LoveVocabublary#loves= >=20 rdf:type rdf:Property ; rdfs:label "loves"@en ; rdfs:label "liebt"@de ; rdfs:comment "Relation between a lover and a loved person."@en ; rdfs:comment "Beziehung zwischen einem Liebenden und einer geliebten = Person."@de ; rdfs:subPropertyOf foaf:knows ; rdfs:domain = <http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LoveVocabulary#Lover&= gt; ; rdfs:range foaf:Person .
Note that the terms are defined in a namespace that is controlled = by Chris=20 Bizer and that they are related to the FOAF vocabulary using=20 rdfs:subPropertyOf and rdfs:subClassOf links.
So, assuming we have expressed all our data in RDF triples: What = triples=20 should go into the RDF representation that is returned (after a 303 = redirect)=20 in response to dereferencing a URI identifying a non-information = resource?
application/x-turtle
. In situations =
where=20
your think people might want to use your data together with XML =
technologies=20
such as XSLT or XQuery, you might additionally also serve a TriX=20
serialization, as TriX works better with these technologies than =
RDF/XML.=20
In the following, we give two examples of RDF descriptions = following the=20 rules above. The first example covers the case of an authoritative=20 representation served by a URI owner. The second example covers the = case of=20 non-authoritative information served by somebody who is not the owner = of the=20 described URI.
The following example shows parts of the Turtle = representation of the=20 information resource http://dbpedia.org/data/Alec= _Empire.=20 The resource describes the German musician Alec Empire. Using Web Architecture = terminology, it is a=20 authoritative description as it is served after a 303 redirect by the = owner of=20 the URI http://dbpedia.org/resou= rce/Alec_Empire.=20 Namespace declarations are omitted:
# Metadata and = Licensing Information <http://dbpedia.org/data/Alec_Empire> rdfs:label "RDF description of Alec Empire" ; rdf:type foaf:Document ; dc:publisher <http://dbpedia.org/resource/DBpedia> ; dc:date "2007-07-13"^^xsd:date ; dc:rights <http://en.wikipedia.org/wiki/WP:GFDL> . =20 # The description <http://dbpedia.org/resource/Alec_Empire>=20 foaf:name "Empire, Alec" ; rdf:type foaf:Person ; rdf:type <http://dbpedia.org/class/yago/musician> ; rdfs:comment "Alec Empire (born May 2, 1972) is a German musician who is = ..."@en ; rdfs:comment "Alec Empire (eigentlich Alexander Wilke) ist ein deutscher = Musiker. ..."@de ; dbpedia:genre <http://dbpedia.org/resource/Techno> ; dbpedia:associatedActs = <http://dbpedia.org/resource/Atari_Teenage_Riot> ; foaf:page <http://en.wikipedia.org/wiki/Alec_Empire> ; foaf:page <http://dbpedia.org/page/Alec_Empire> ;=20 rdfs:isDefinedBy <http://dbpedia.org/data/Alec_Empire> ; owl:sameAs = <http://zitgist.com/music/artist/d71ba53b-23b0-4870-a429-cce6f345763b&= gt; . =20 # Backlinks <http://dbpedia.org/resource/60_Second_Wipeout> dbpedia:producer <http://dbpedia.org/resource/Alec_Empire> . <http://dbpedia.org/resource/Limited_Editions_1990-1994> dbpedia:artist <http://dbpedia.org/resource/Alec_Empire> = .
Note that the description contains an owl:sameAs Link stating that = http://dbpedia.org/resou= rce/Alec_Empire=20 and http://zitgist.com/music/artist/d71ba53b-23b0-4870-a429-cce6f345763b= =20 are URI aliases referring to the same non-information resource.
In order to make it easier for Linked Data clients to understand = the=20 relation between http://dbpedia.org/resou= rce/Alec_Empire,=20 http://dbpedia.org/data/Alec= _Empire,=20 and http://dbpedia.org/page/Alec= _Empire,=20 the URIs can be interlinked using the rdfs:isDefinedBy= =20 and the foaf:page = property=20 as recommended in the C= ool=20 URI paper.
The following example shows the Turtle = representation of the=20 information resource http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/Link= edDataTutorial/ChrisAboutRichard=20 which is published by Chris to provide information about Richard. Note = that in=20 Turtle, the syntactic shortcut <> can be used to refer = to the=20 URI of the current document. Richard owns the URI http://richard.cygania= k.de/foaf.rdf#cygri=20 and is therefore the only person who can provide an authoritative = description=20 for this URI. Thus using Web = Architecture terminology, Chris is providing non-authoritative = information=20 about Richard.
# Metadata and Licensing = Information <> rdf:type foaf:Document ; dc:author <http://www.bizer.de#chris> ; dc:date "2007-07-13"^^xsd:date ; cc:license <http://web.resource.org/cc/PublicDomain> . =20 # The description <http://richard.cyganiak.de/foaf.rdf#cygri>=20 foaf:name "Richard Cyganiak" ; foaf:topic_interest = <http://dbpedia.org/resource/Category:Databases> ; foaf:topic_interest <http://dbpedia.org/resource/MacBook_Pro> = ; rdfs:isDefinedBy <http://richard.cyganiak.de/foaf.rdf> ; rdf:seeAlso <> . =20 # Backlinks <http://www.bizer.de#chris> foaf:knows <http://richard.cyganiak.de/foaf.rdf#cygri> . <http://www4.wiwiss.fu-berlin.de/is-group/resource/projects/Project3&g= t; doap:developer <http://richard.cyganiak.de/foaf.rdf#cygri> = .
RDF links enable Linked Data browsers and crawlers to navigate = between data=20 sources and to discover additional data.
The application domain will determine which RDF properties are used = as=20 predicates. For instance, commonly used linking properties in the = domain of=20 describing people are foaf:knows, foaf:based_near = and foaf:topic_intere= st=20 . Examples of combining these properties with property values from DBpedia, the DBLP bibliography and the RDF=20 Book Mashup are found in Tim = Berners-Lee's and Ivan Herman's FOAF=20 profiles.
It is common practice to use the owl:sameAs = property for=20 stating that another data source also provides information about a = specific=20 non-information resource. An owl:sameAs link indicates that two URI = references=20 actually refer to the same thing. Therefore, owl:sameAs is used to map = between=20 different URI aliases (see Section=20 2.1). Examples of using owl:sameAs to indicate that two URIs talk = about=20 the same thing are again Tim's FOAF = profile which=20 states that http://www.w3.org/Pe= ople/Berners-Lee/card#i=20 identifies the same resource as http://www4.wiwiss.fu-berlin.de/bookmashup/persons/Tim+Berners-Lee= =20 and http= ://www4.wiwiss.fu-berlin.de/dblp/resource/person/100007.=20 Other usage examples are found in DBpedia and the Berlin=20 DBLP server.
RDF links can be set manually, which is usually the case for FOAF = profiles,=20 or they can be generated by automated linking algorithms. This = approach is=20 usually taken to interlink large datasets.
Before you can set RDF links manually, you need to know something = about the=20 datasets you want to link to. In order to get an overview of different = datasets that can be used as linking targets please refer to the Linking=20 Open Data Dataset list. Once you have identified particular = datasets as=20 suitable linking targets, you can manually search in these for the URI = references you want to link to. If a data source doesn't provide a = search=20 interface, for instance a SPARQL endpoint or a HTML Web form, you can = use=20 Linked Data browsers like Tabulator or Disco to=20 explore the dataset and find the right URIs.
You can also use services such as Uriqr or=20 Sindice to search for existing = URIs and=20 to choose the most popular one if you find several candidates. Uriqr = allows=20 you to find URIs for people you know, simply by searching for their = name.=20 Results are ranked according to how heavily a particular URI is = referenced in=20 RDF documents on the Web, but you will need to apply a little bit of = human=20 intelligence in picking the most appropriate URI to use. Sindice = indexes the=20 Semantic Web and can tell you which sources mention a certain URI. = Therefore=20 the service can help you to choose the most popular URI for a concept. =
Remember that data sources might use HTTP-303 redirects to redirect = clients=20 from URIs identifying non-information resources to URIs identifying=20 information resources that describe the non-information resources. In = this=20 case, make sure that you link to the URI reference identifying the=20 non-information resource, and not the document about it.
The approach described above does not scale to large datasets, for = instance=20 interlinking 70,000 places in DBpedia=20 to their corresponding entries in Geonames. In such cases = it makes=20 sense to use an automated record linkage algorithm to generate RDF = links=20 between data sources.
Record = Linkage is=20 a well-known problem in the databases community. The Linking Open Data = Project=20 collects material related to using record linkage algorithms in the = Linked=20 Data context on the Equivalence=20 Mining wiki page.
There is still a lack of good, easy-to-use tools to auto-generate = RDF=20 links. Therefore it is common practice to implement dataset-specific = record=20 linkage algorithms to generate RDF links between data sources. In the=20 following we describe two classes of such algorithms:
In various domains, there are generally accepted naming schemata. = For=20 instance, in the publication domain there are ISBN numbers, in the = financial=20 domain there are ISIN=20 identifiers. If these identifiers are used as part of HTTP URIs = identifying=20 particular resources, it is possible to use simple pattern-based = algorithms to=20 generate RDF links between these resources.
An example of a data source using ISBN numbers as part of its URIs = is the=20 RDF=20 Book Mashup, which for instance uses the URI http= ://www4.wiwiss.fu-berlin.de/bookmashup/books/006251587X=20 to identify the book 'Harry Potter and the Half-blood Prince'. Having = the ISBN=20 number in these URIs made it easy for DBpedia to generate owl:sameAs = links=20 between books within DBpedia and the Book Mashup. DBpedia uses the = following=20 pattern-based algorithm:
Running this algorithm against all books in DBpedia resulted in 9000 RDF = links=20 which were merged with the DBpedia dataset. For instance, the = resulting link=20 for the Harry Potter book is:
<http://dbpedia.org/resource/Harry_Potter_and_the_Half-Blood_Prince= > owl:sameAs <http= ://www4.wiwiss.fu-berlin.de/bookmashup/books/006251587X>
In cases where no common identifiers across datasets exist, it is = necessary=20 to employ more complex property-based linkage algorithms. We outline = two=20 algorithms below:
This chapter provides practical recipes for publishing different = types of=20 information as Linked Data on the Web. Information has to fulfill the=20 following minimal requirements to be considered "published as Linked = Data on=20 the Web":
application/rdf+xml
, a data source must return an =
RDF/XML=20
description of the identified resource.=20
Which of the following recipes fits your needs depends on various = factors,=20 such as:
After you have published your information as Linked Data, you = should ensure=20 that there are external RDF links pointing at URIs from your dataset, = so that=20 RDF browser and crawlers can find your data. There are two basic ways = of doing=20 this:
The simplest way to serve Linked Data is to produce static RDF = files, and=20 upload them to a web server. This approach is typically chosen in = situations=20 where
There are several issues to consider:
Older web servers are sometimes not yet configured to return the = correct=20 MIME type when serving RDF/XML files. Linked Data browsers may not = recognize=20 RDF data served in this way because the server claims that it is not = RDF/XML=20 but plain text. To find out if your server needs fixing, use cURL tool and the steps outlined = in this=20 tutorial.
How to fix this depends on the web server. In the case of Apache, = add=20 this line to the httpd.conf configuration file, or to an=20 .htaccess file in the web server's directory where the RDF = files=20 are placed:
AddType application/rdf+xml .rdf
This tells Apache to serve files with an .rdf extension = using=20 the correct MIME type for RDF/XML, application/rdf+xml. = Note this=20 means you have to name your files with the .rdf = extension.
While you're at it, you can also add these lines to make your web = server=20 ready for other RDF syntaxes (N3 and Turtle):
AddType = text/rdf+n3;charset=3Dutf-8 .n3 AddType application/x-turtle .ttl
On the document Web, it's considered bad form to publish huge = HTML pages,=20 because they load very slowly in browsers and consume unnecessary = bandwidth.=20 The same is true when publishing Linked Data: Your RDF files = shouldn't be=20 larger than, say, a few hundred kilobytes. If your files are larger = and=20 describe multiple resources, you should break them up into several = RDF=20 files, or use Pubby=20 as described in recipe 7.3 to serve them in chunks.
When you serve multiple RDF files, make sure they are linked to = each=20 other through RDF triples that involve resources described in = different=20 files.
The static file approach doesn't support the 303 redirects = required for=20 the URIs of non-information resources. Fortunately there is another=20 standards-compliant method of naming non-information resources, = which works=20 very well with static RDF files, but has a downside we will discuss = later.=20 This method relies on hash URIs.
When you serve a static RDF file at, say,=20 http://example.com/people.rdf, then you should name the=20 non-information resources described in the file by appending a = fragment=20 identifier to the file's URI. The identifier must be unique = within the=20 file. That way, you end up with URIs like this for your = non-information=20 resources:
This works because HTTP clients dereference hash URIs by = stripping off=20 the part after the hash and dereferencing the resulting URI. A = Linked Data=20 browser will then look into the response (the RDF file in this = case), and=20 find triples that tell it more about the non-information resource, = achieving=20 an effect quite similar to the 303 redirect.
The downside of this naming approach is that the URIs are not = very "cool"=20 according to the criteria set out in section=20 3. There's a reference to a specific representation format in = the=20 identifiers (the .rdf extension). And if you choose to = rename the=20 RDF file later on, or decide to split your data into several files, = then all=20 identifiers will change and existing links to them will break.
That's why you should use this approach only if the overall = structure and=20 size of the dataset are unlikely to change much in the future, or as = a=20 quick-and-dirty solution for transient data where link stability = isn't so=20 important.
This approach can be extended to use 303 redirects and even to = support=20 content negotiation, if you are willing to go through some extra = hoops.=20 Unfortunately this process is dependent on your web server and its=20 configuration. The W3C has published several recipes that show how = to do=20 this for the Apache web server: Best Practice Recipes = for=20 Publishing RDF Vocabularies. The document is officially targeted = at=20 publishers of RDF vocabularies, but the recipes work for other kinds = of RDF=20 data served from static files. Note that at the time of writing = there is=20 still an issue=20 with content negotiation in this document which might be solved = by=20 moving from Apache mod_rewrite to mod_negotiation.
If your data is stored in a relational database it is usually a = good idea=20 to leave it there and just publish a Linked Data view on your existing = database.
A tool for serving Linked Data views on relational databases is D2R=20 Server. D2R server relies on a declarative mapping between the = schemata of=20 the database and the target RDF terms. Based on this mapping, D2R = Server=20 serves a Linked Data view on your database and provides a SPARQL = endpoint for=20 the database.
There are several D2R Servers online, for example Berlin DBLP = Bibliography=20 Server, Hannover DBLP = Bibliography=20 Server, http://www4.wiwiss.fu-= berlin.de/is-group/=20 or the EuroStat = Countries=20 and Regions Server.
Publishing a relational database as Linked Data typically involves = the=20 following steps:
Alternatively, you can also use:=20
If your information is currently represented in formats such as = CSV,=20 Microsoft Excel, or BibTEX and you want to serve the information as = Linked=20 Data on the Web it is usually a good idea to do the following:
The approach described above is taken by the DBpedia project, among others. = The project=20 uses PHP scripts to extract structured data from Wikipedia pages. This = data is=20 then converted to RDF and stored in a OpenLink = Virtuoso=20 repository which provides a SPARQL endpoint. In order to get a Linked = Data=20 view, Pubby is = put in=20 front of the SPARQL endpoint.
If your dataset is sufficiently small = to fit=20 completely into the web server's main memory, then you can do without = the RDF=20 repository, and instead use Pubby's=20 conf:loadRDF option to load the RDF data from an RDF file = directly=20 into Pubby. This might be simpler, but unlike a real RDF repository, = Pubby=20 will keep everything in main memory and doesn't offer a SPARQL = endpoint.
Large numbers of Web applications have started to make their data = available=20 on the Web through Web APIs. Examples of data sources providing such = APIs=20 include eBay, = Amazon, = Yahoo, Google= and Google = Base. An more=20 comprehensive API list is found at Programmable Web. = Different APIs=20 provide diverse query and retrieval interfaces and return results = using a=20 number of different formats such as XML, JSON or ATOM. This leads to = three=20 general limitations of Web APIs:
These limitations can be overcome by implementing Linked Data = wrappers=20 around APIs. In general, Linked Data wrappers do the following:
application/rdf+xml
, the wrapper rewrites the client's =
request=20
into a request against the underlying API.=20
Examples of Linked Data Wrappers include:
The RDF=20 Book Mashup makes information about books, their authors, = reviews, and=20 online bookstores available as RDF on the Web. The RDF Book Mashup = assigns a=20 HTTP URI to each book that has an ISBN number. Whenever one of these = URIs is=20 dereferenced, the Book Mashup requests data about the book, its = author as=20 well as reviews and sales offers from the Amazon = API=20 and the Google Base=20 API. This data is then transformed into RDF and returned to the=20 client.
The RDF Book Mashup is implemented as a sma= ll=20 PHP script which can be used as a template for implementing = similar=20 wrappers around other Web APIs. More information about the Book = Mashup and=20 the relationship of Web APIs to Linked Data in general is available = in The=20 RDF Book Mashup: From Web APIs to a Web of Data (Slides).
After you have published information as Linked Data on the Web, you = should=20 test whether your information can be accessed correctly.
An easy way of testing is to put several of your URIs into the Vapour Linked validation = service,=20 which generates a detailed report on how your URIs react to different = HTTP=20 requests.
Beside of this, it is also important to see whether your = information=20 displays correctly in different Linked Data browsers and whether the = browsers=20 can follow RDF links within your data. Therefore, take several URIs = from your=20 dataset and enter them into the navigation bar of the following Linked = Data=20 browsers:
If you run into problems, you should do the following:
If you can not figure out yourself what is going wrong, ask on the = Linking= Open=20 Data mailing list for help.
The standard way of discovering Linked Data on the Web is by=20 following RDF Links within data the client already = knows. In=20 order to further ease discovery, information providers can decide to = support=20 additional discovery mechanisms:
It also makes sense in many cases to set links from existing =
webpages to=20
RDF data, for instance from your personal home page to your FOAF =
profile.=20
Such links can be set using the HTML <link>
=
element in=20
the <head>
of your HTML page.
<link = rel=3D"alternate" type=3D"application/rdf+xml" = href=3D"link_to_the_RDF_version" />
HTML <link>
elements are used by browser =
extensions,=20
like Piggybank =
and Semantic Radar, to =
discover RDF=20
data on the Web.
For more information about Linked Data please refer to:
This is an example HTTP session where a Linked Data browser tries = to=20 dereference the URI http://dbpedia.org/resource/B= erlin,=20 a URI for the city of Berlin, published by the DBpedia project.
To obtain a representation, the client connects to the = dbpedia.org=20 server and issues an HTTP GET request:
GET /resource/Berlin = HTTP/1.1 Host: dbpedia.org Accept: text/html;q=3D0.5, application/rdf+xml
The client sends an Accept: header to indicate that it = would take=20 either HTML or RDF; the q=3D0.5 quality value for HTML shows = that it=20 prefers RDF. The server could answer:
HTTP/1.1 303 See Other Location: http://dbpedia.org/data/Berlin Vary: Accept
This is a 303 redirect, which tells the client that the requested = resource=20 is a non-information resource, and its associated description can be = found at=20 the URI given in the Location: response header. Note that if = the=20 Accept: header had indicated a preference for HTML, we would = have=20 been redirected to another URI. This is indicated by the = Vary:=20 header, which is required so that caches work correctly. Next the = client will=20 try to dereference the URI of the associated description.
GET = /data/Berlin HTTP/1.1 Host: dbpedia.org Accept: text/html;q=3D0.5, application/rdf+xml
The server could answer:
HTTP/1.1 200 OK Content-Type: application/rdf+xml;charset=3Dutf-8 <?xml version=3D"1.0"?> <rdf:RDF xmlns:units=3D"http://dbpedia.org/units/" xmlns:foaf=3D"http://xmlns.com/foaf/0.1/" xmlns:geonames=3D"http://www.geonames.org/ontology#" xmlns:rdfs=3D"http://www.w3.org/2000/01/rdf-schema#" ...
The 200 status code tells the client that the response contains the = representation of an information resource. The Content-Type: = header=20 tells us that the representation is in RDF/XML format. The rest of the = message=20 contains the representation. Only the beginning is shown.
A great way to get started with publishing Linked Data on the Web = is to=20 serve a static RDF file; this can work well for small amounts of = relatively=20 simple data. One common example of this practice is providing a Friend-of-a-Friend (FOAF) = file=20 alongside (and interlinked with) your HTML home page. This Appendix = provides=20 step-by-step instructions on how to create and publish a FOAF = description of=20 yourself, and how to link it into the Web of Data.
FOAF-a-Matic is a = web form=20 that will generate a basic FOAF description for you in RDF/XML (which = we will=20 call your "FOAF file" from now onwards). This provides an excellent = foundation=20 to which you can add additional data. After generating your FOAF = profile using=20 FOAF-a-Matic, you'll need to save the generated RDF/XML code as a = static file=20 (using the filename foaf.rdf is a common convention) and = decide where=20 it will be hosted. Technically your FOAF file can be hosted anywhere = on the=20 Web, although it's common practice to use your own Web space and place = it in=20 the same directory as your home page. For example, Richard Cyganiak's = FOAF=20 file is located at http://richard.cyganiak.de/f= oaf.rdf;=20 this is the URI of the RDF document, the document describes Richard, = who is=20 identified by the URI = http://richard.cyganiak.de/foaf.rdf#cygri.
By default FOAF-a-Matic will use a fragment identifier (such as=20 #me) to refer to you within your FOAF file. This fragment = identifier=20 is appended to the URI of your FOAF file to form your URI, of = the=20 form http://yourdomain.com/foaf.rdf#me. Congratulations, you = now have=20 a URI which can be used to identify you in other RDF statements on the = Web of=20 Data. Your URI is an example of a "hash URI", discussed in more detail = in Section=20 7.1.
At this stage you will want to start linking your FOAF file into = the Web.=20 One good place to start is by linking to your FOAF file from your = homepage,=20 using the HTML LINK Auto-Discovery technique from Section=20 9, but don't stop there. To firmly embed your FOAF file into the = Web of=20 Data you need to read on and implement the guidance in the following=20 sections.
At present FOAF-a-Matic uses "blank nodes" to identify people you = know, not=20 URI references, as the system has no way of knowing the appropriate = URIs to=20 associate with each person. Example output with blank nodes is shown=20 below:
... <foaf:knows> <foaf:Person> = <foaf:mbox_sha1sum>362ce75324396f0aa2d3e5f1246f40bf3bb44401</foa= f:mbox_sha1sum> <foaf:name>Dan Brickley</foaf:name> <rdfs:seeAlso rdf:resource=3D"http://danbri.org/foaf.rdf"/>=20 </foaf:Person> </foaf:knows> ...
This is valid RDF but isn't good Linked Data, as blank nodes make = it much=20 harder to link and merge data across different sources (see Section=20 2.2 for more discussion of the issues with blank nodes). = Therefore, the=20 first thing to do in making your FOAF file into Linked Data is to look = at the=20 blank nodes representing people you know, and to replace them with = URIs where=20 possible.
Section=20 6.1 explains how to set RDF links manually, and describes two = services=20 (Uriqr and Sindice) that you can use to try to find = existing=20 URIs for people you know. Following this approach we can find out that = the=20 existing URI http://danbri.org/foaf.rdf#dan= bri=20 identifies Dan Brickley, and replace the blank node generated by = FOAF-a-Matic=20 with this URI reference. The result would look like this:
... <foaf:knows> <foaf:Person = rdf:about=3D"http://danbri.org/foaf.rdf#danbri"> = <foaf:mbox_sha1sum>362ce75324396f0aa2d3e5f1246f40bf3bb44401</foa= f:mbox_sha1sum> <foaf:name>Dan Brickley</foaf:name> <rdfs:seeAlso rdf:resource=3D"http://danbri.org/foaf.rdf"/>=20 </foaf:Person> </foaf:knows> ...
After setting links to people you know, there are many other ways = in which=20 you can link your FOAF file into the Web of Data. Here we will discuss = two of=20 these.
A common thing to include in your FOAF file is information about = where you=20 are based, using the foaf:based_near property. This isn't = supported=20 by FOAF-a-Matic, so you'll need to add the code in manually. Add the = following=20 line somewhere inside the <foaf:Person=20 rdf:ID=3D"me"></foaf:Person> element, replacing the = object of the=20 triple with the URI reference of the place nearest to where you are = based.
<foaf:based_near = rdf:resource=3D"http://dbpedia.org/resource/Milton_Keynes"/>
Using URIs from DBpedia or Geonames ensures that you are = linking your=20 FOAF file to well-established URIs which are also likely to be used by = others,=20 therefore making the Web more interconnected.
If you have ever written a book or published an academic paper in = the field=20 of Computer Science, a URI may already have been created for you in = the RDF version of DBLP = or in the=20 RDF = Book=20 Mashup. At a general level, how to handle this issue is touched = upon under=20 URI Aliases in Section=20 2.1. In summary, you simply need to link to them with a statement = saying=20 that they identify the same thing as the URI identifying you in your = FOAF=20 file. Section=20 6 describes how this should be done using the owl:sameAs=20 property.