There are increasing benefits in identifying data held within these documents and publishing this data in a form that is accessible to automated tools (as RDFa markup in pages served on the site, and through a special query server called a SPARQL endpoint).
These benefits include:
- Improved search engine rankings
- Improved search precision
- Enabling the creation of applications that combine data from different sources, for example events listings, schedules and maps.
Because such applications are based on publically available data they can be developed by any third party with the necessary interest and technical ability. See the Southampton Open Data Service for an example of how this ecosystem of data and applications is beginning to develop.
The first step to enabling semantic search and querying on a site is identifying the entities - most importantly people, places and organisations - discussed in a document. External resources describing these entities should also be identified. These represent globally unique IDs for entities; it is this uniqueness that enables much of the power of the semantic web. If our article text mentions Portland, a conventional search engine will have no way of knowing which of the places of that name is being referred to. If the article is semantically tagged with the URI "http://dbpedia.org/page/Isle_of_Portland" this clearly identifies the article as being about Portland, UK and not Portland, Oregon for example. Semantic search engines such as Sindice can exploit this precision to offer extremely accurate search results.
In order for semantic information to be generated in an efficient way, it is essential that automated tools are used to reduce the amount of labour involved. There are many different techniques for doing this, and the range of techniques and data sources available are constantly expanding (see the Linked Data Cloud diagram). For this reason it is important that our approach to the extraction of semantic information from article text is based around an extensible framework that will accommodate newly available technologies without the need for corresponding changes in the CMS. IKS FISE provides just such a framework.
The integration between GOSS iCM and FISE works as follows:
- IKS FISE is installed as an additional service with which GOSS iCM can communicate over HTTP.
- A new Semantic tab is provided inside the GOSS iCM article editor; when the user elects to retrieve entities from within this tab GOSS iCM sends a request to FISE to identify entities referred to in the article text.
- The user reviews these entities to ensure accuracy and relates them to the article. Entities are stored and related to the article by making use of special groups inside the existing metadata system.
- This information is used to generate RDFa mark up on site pages that are generated from the article. It is also used to populate a public triple store that responds to SPARQL queries - this enables queries such as "Find me all places that are mentioned in articles that mention the RSPCA".
This is a prototype for the semantic functionality that will be available in a future GOSS iCM release. It will continue to evolve through new developments within GOSS and the IKS project. One area of particular interest to GOSS is linking to sources of data of greater relevance to our Public Sector clients to make it easier for them to meet their statutory responsibilities for publishing data and to do it in a way that extends the "Linked Data Cloud" with 5 Star Linked Data as opposed to creating isolated silos of data. The flexibility of the IKS architecture ensures that we will be able to exploit new sources of data to their fullest as they become available. The future of the IKS architecture itself is assured as the associated software including FISE is now transitioning into a new Apache project under the name Stanbol which will ensure long term development.
GOSS will be demonstrating this integration at the IKS Workshop in Paris in July 2011.