The semantic web is about using the infrastructure of the web to do the same for data as it has done for documents. At present, data on the web is relatively siloed and can be difficult to access. In contrast, the semantic web uses much of the same technologies and protocols as the document web but is instead "a web of data that can be processed directly and indirectly by machines” [Tim Berners-Lee]. It allows data to be represented on the web in a highly efficient way, with consistent links and meaning.
Typically, organisations have tended towards publishing data using bespoke APIs or XML schema. The data is then transformed by recipient systems into their own bespoke schemas, and the unique identifiers of the key data elements often lost.
There are huge efficiencies to be made if instead these standard identifiers – or URIs - for key data elements are published on the web. The increased use of a common identifier reduces the cost of integrating disparate data. In addition, much like a link to a document url, they can be followed back to their source for more information or simply a definition.
Brand Trust & Provenance
Your URIs reflect your brand and provide an indication of trust for consumers of documents and data. Geonames is an open linked geographic dataset. It draws upon a variety of data sourced from data providers around the world. Geonames
is now becoming more widely adopted as organizations look for sources of strong identifiers for geographic locations (BBC
). This causes problems for well trusted data providers like the Ordnance Survey
. OS data is used extensively within Geonames but it is impossible to tell which has come from OS, which has been added from another source. OS URIs exposed in the Geonames dataset would have provided a level of confidence in the subset of the Geonames data (an indication of provenance). Unfortunately OS have been relatively slow to provide these, so at best will have to be retrofitted back into Geonames which will be a cost to Geonames it is unlikely to want to incur.
Relevance & Value
The more URIs are used the more valuable they become. Companies House records all the registered companies in the UK. Accessing this data has in the past been a laborious and expensive process. By scraping Companies House data and combining it with similar data sources, OpenCorporates now provides a URI for every UK company (along with URIs for companies in many other jurisdictions as well). As a result, OpenCorporates
has rapidly becoming the data hub for corporate information on the web and is in many respects more relevant than Companies House for identifying UK companies on the web. For example a number of local councils include OpenCorporates URIs in their published spending data. Companies House has announced they intend to publish URIs for companies, which seems possibly a vain attempt to displace OpenCorporates and regain its position as the central hub for corporate information.
Integration & Reach
Lowering the barrier to integration makes it more likely that a particular data service will be used or bought. MusicBrainz
is an IMDB for Music and the standard offering is a free service that provides URIs. Because MusicBrainz provide URIs it becomes increasingly cost effective for its users, such as the BBC, to integrate new product features such as the LastFM
service and Guardian music reviews
through the common use of URIs. In addition the BBC pays to subscribe to the MusicBrainz premiere data service as it is now dependent on MusicBrainz identifiers and data.
Some Technical Considerations
Instance data URIs should be designed with only the very minimum amount of information embedded in the URI as required in order that they :
encapsulate the primary key (or possibly an alternate key) of the business domain
can be logically separated into API endpoints sufficient to aptly meet the requirements of your use cases.
can be routed within your physical architecture to the correct destination (the physical data silo)
One does not want verbose semantics embedded in linked data URIs as developers and consumers must not and should not be encouraged to infer semantic information from the URI. The URI is just an Identifier that must fulfil the 3 conditions described above.
The semantics are delivered through the underlying data referenced by the URI. Over time if the semantics of the underlying resource change you may end up with a URI that does not reflect the resource it refers to if the URI has been minted with embedded information.
While the single requirement for a data instance URI is simply that it encapsulates the primary key (or possibly an alternate key) of the business domain. It is in essense a data modeling unique identifier in URI format. Adding anything else e.g. to ensure that an HTTP request can be correctly routed within your physical architecture is at best an informed deployment compromise and possibly even an anti-pattern: as it creates an unintentional coupling of physical and logical architecture that is going to cause headaches further down the road when the physical architecture changes? Unless absolutely unavoidable, the physical architecture ideally should be adapted to work with your URIs rather than the other way round.