BBC News Labs

News Labs


Over the Summer 2012 I was involved in the BBC News labs project lead by Matt Shearer and described here on the BBC Internet Tech blog. As a follow up to this article I thought it would be good to provide some detail around the RDF and semantics that were used to power the News labs.

As described in Matt’s blog post, to enable rapid prototyping of News applications DBpedia was used to semantically annotate an archive (60k articles) and live feeds of BBC News.



Ontologies & SPARQL

News Articles

With the primary purpose of the news labs lying in the rapid prototyping of novel applications and user experiences that could be built on top of news + linked data, it was decided to use existing public domain ontologies to deliver the semantics and not put much effort into any new modelling. So along with the DBpedia Ontology  the modelling of events, news articles  and semantic tagging was accomplished using a combination of dublin core ( dcterms ), the SNaP tag  and event  ontologies, and an internal BBC news asset ontology. An article with semantic annotations could thus be represented in RDF like this :

@prefix xsd:     <> .
@prefix bbcasset: <> .
@prefix dcterms: <> .
@prefix pnt:     <> .
<> a bbcasset:Article ;
        dcterms:title "Aftershocks hit Solomon Islands relief work"@en ;
        dcterms:description "Aftershocks are continuing to rock the Solomon Islands, hampering relief efforts after a quake and tsunami."@en ;
        dcterms:created "2013-02-08T12:27:12Z"^^xsd:dateTime ;
        dcterms:subject "World"@en ;
        dcterms:subject "Asia"@en ;
        dcterms:identifier "21378032" ;
        dcterms:publisher <> ;
        pnt:about <> ,
           <> ;
        pnt:mentions <> ,
           <> ,
           <> ,


A news event and the semantic annotation of an article with an event (for example the Mars Curiosity rover landing) could be represented like so :

@prefix rdfs:    <> .
@prefix xsd:     <> .
@prefix bbcasset: <> .
@prefix dcterms: <> .
@prefix pne:     <> .
@prefix event:   <> .
@prefix pnt:     <> .
@prefix tl:      <> .

<http://the.juicer.domain/events/19>a pne:Event;
        event:time <http://the.juicer.domain/events/19/time> ;
        pne:title "Mars landing (Mars Science Laboratory)"@en ;
        event:factor <> ;
        event:place <>,
          <> ;
        event:agent <> .

<http://the.juicer.domain/events/19/time> a tl:Interval ;
        tl:beginsAtDateTime "2012-08-06T15:53:00Z"^^xsd:dateTime .

        pnt:about <http://the.juicer.domain/events/19> .

This gives us the ability to exploit the DBpedia ontology to query against the graph of the dbpedia concepts semantically annotated against news articles to deliver rich aggregations of content, for example :

“Give me the 10 most recent news articles about politicians”

PREFIX pnt: <>
PREFIX rdf: <>
PREFIX dcterms:<>

select ?article where {
    ?article pnt:about ?person .
    ?article dcterms:created ?published .
    ?person rdf:type <> .
order by desc(?published)
limit 10


As DBpedia utilises the WGS84 basic geo-positioning  vocabulary to represent the latitude and longitude of location concepts, we can exploit the geospatial index and SPARQL extensions in OWLIM  to make combined spatial and semantic news aggregations, for example

“Give me the 10 most recent news articles about politicians within 20km of Manchester”, using a query such as :

PREFIX pnt: <>
PREFIX rdf: <>
PREFIX dcterms: <>
PREFIX geo: <>
PREFIX omgeo: <>

select ?article where {
    ?article pnt:about ?person .
    ?article dcterms:created ?published .
    ?person rdf:type <> .
    ?article pnt:isTaggedWith ?place .
    <> geo:lat ?latManchester ;
        geo:long ?longManchester .
    ?place omgeo:nearby(?latManchester ?longManchester "20km");
order by desc(?published)
limit 10

We can now build some very cool localised news applications.


With the drivers of the news labs being to exploit semantics to rapidly prototype, and also to educate and expose BBC developers to semantic technologies and RDF, it was important that APIs were constructed that would enable this.  At the same time, exposing an open SPARQL endpoint would be inherently risky in that a consumer (a developer new to SPARQL for instance) could potentially invoke a query that could hammer the underlying the triple store hindering other labs development teams. We also wanted  to provide APIs that would produce dev-friendly BBC JSON representations of news articles to aid rapid web application development.

Accordingly, custom web service APIs were built (in Java) that expose the full power and flexibility of SPARQL 1.1 to semantically aggregate news content, while ensuring that potentially dangerous queries could not be run, and returning news article JSON to the caller.

To achieve this, an API was constructed that accepts a SPARQL ‘Where’ clause binding a specified (arbitary) variable to news article RDF resources along with  the name of the binding variable as request parameters, as follows :

GET  /articles?binding={b}&where={SPARQL where clause that binds b}

The caller passes a {SPARQL where clause} that would bind parameter {b} to news articles, for example :

GET /articles?binding=a&where=?a pnt:about <>
Accept : application/json

Internally, processing this request the API would construct a full SPARQL query enhancing the given where clause for safety and to ensure news articles when correctly bound. In this example the query compiled by the API and invoked on the triple store would end up as :

select ?a where {
    ?a rdf:type bbcasset:Article .
    ?a dcterms:created ?published .
    ?a pnt:about <>  .
order by desc(?published)
limit 10

The API then collates the JSON for the news articles returned from the query invocation against the triple store, and returns the JSON aggregation to the caller.

By compiling a the complete query in the API code, we reduce the risk of dodgy SPARQL queries being invoked by restricting the result set to only the RDF resources we are interested in, in this case BBC Articles.  The API is thus safe, has the flexibility of SPARQL, allowing developers to fully exploit the dbpedia graph to aggregate news content, delivering easy to consume JSON.

For more substantial queries, the request can be POSTed with the Where clause URL encoded in the request body. The API also supports limit and offset parameters providing the ability to page through results of articles.  

Using the more complex geospatial example above, the consumer could invoke the API request :

POST  /articles?binding=article&limit=20
Accept : application/json
Content-Type : application/x-www-form-urlencoded

URL Encoded Request body :
where=?article pnt:about ?person .
      ?person rdf:type <> .
      ?article pnt:isTaggedWith ?place .
      <> <> ?latManchester ;
                   <> ?longManchester .
      ?place omgeo:nearby(?latManchester ?longManchester "20km").

and get back the JSON for all the news articles about politicians occurring within 20km of Manchester.

So there we have it, 60k news articles annotated with DBpedia concepts, an OWLIM triple store loaded with dbpedia, and an API that exposes SPARQL to the user but returns easily consumable news article JSON.

Developers can then rapidly prototype extremely cool web applications limited only by their imagination (and the dbpedia universe), while learning about RDF and Semantics at the same time!



Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.