This article is a long introduction to linked data. Alternative ways of finding out more about linked data will be published in the coming months, including podcasts, webinars, and bitesize articles.
What is it
Linked data is… data you can link to. That’s because it’s built on all the mechanisms of the web, so each thing you want to describe gets a URL that is used both as an identifier for that thing and as an address you can look up to find information about it. Effectively every item in your dataset gets its own web page.
The data formats and protocols of linked data are all W3C standards, offering the possibility for different organisations to make their data available in compatible ways.
You can of course link to a whole Excel file or CSV file or Shapefile, without needing linked data, but in that case you are pointing to a whole bucket of data, and you need some other mechanism to be more specific ('open this link then go to row 125, column H'). That's much less standardised and much harder to use in an automated context.
Linked data uses the data-modeling specification RDF ('Resource Description Framework') as the underlying way to represent the data. That is a very flexible approach that allows (even requires) you to be very precise about your data and is great for machine processing and data integration. It's not necessarily the easiest thing for most users to work with, but because it is precise, and easy to process in software then it's straightforward to create all kinds of other outputs from it according to what the data is and what users want - web pages, CSV, JSON, Shapefiles, etc.
Why you do it
For most organisations the reason for making data available is to help people find, understand and use it.
There are many ways to achieve those goals and an overall data strategy will combine a number of approaches, but linked data has an important contribution to make to those three pillars of making data useful.
The World Wide Web Consortium has put together the Data on the Web Best Practices and the Spatial Data on the Web Best Practices, giving guidance and recommendations on how to make use of the web to disseminate data. Effective use of linked data is well suited to implementation of those best practices.
It is also a good match to the recently released UK National Data Strategy.
Linked data directly addresses the 'Data Foundations' element of this strategy, which states:
“The true value of data can only be fully realised when it is fit for purpose, recorded in standardised formats on modern, future-proof systems and held in a condition that means it is findable, accessible, interoperable and reusable. By improving the quality of the data, we can use it more effectively and drive better insights and outcomes from its use.”
It also addresses the third priority ‘mission’ in the government’s strategy:
“[...] the government (will drive) major improvements in the way information is efficiently managed, used and shared across government. To succeed, we need a whole-government approach that ensures alignment around the best practice and standards needed to drive value and insights from data; and the creation of an appropriately safeguarded, joined-up and interoperable data infrastructure to support this”
People usually find data by following a link to it or by searching for it, whether through a web search engine or a more site specific search.
Linked data allows assigning URLs not just to datasets, but to specific bits of data within that dataset, making it easy to share pointers to data. It is easier to back up reports or charts by citing the data that supports it.
Because everything has a URL, assuming it is publicly accessible on the web, then web search engines like Google, Bing etc are able to index the data and direct users to it. In addition to their mainstream approach of indexing the text on web pages, they are increasingly making use of metadata embedded in web pages to influence search ranking and to enrich how search results are presented. This embedded metadata takes the form of linked data using the schema.org vocabulary.
So, in site-specific searches, it is possible to supplement text based indexes with a more structured search using metadata. Linked data is well suited to linking datasets to standards-based machine readable metadata, helping people find the data they want.
Once people find data that might meet their purpose, they need to know where it came from and how it was processed, to help them know if it is applicable to their problem; to understand the level of quality and trustworthiness, and to apply it to their own problem.
As well as helping with data discovery, metadata is also important for helping people understand data once they find it. The web based nature of linked data makes it easy to connect data at a fine grained level to definitions, explanations and contextual information.
Use of standards is an important element of consistency and predictability in how data is presented, as well as making data more likely to be supported by good software tools.
What makes data easy for someone to understand depends greatly on their background and their purpose. Well-structured, documented, machine-readable linked data is a sound starting point for presenting data in many different ways: for example as a CSV download, or a map-based visualisation, or through a web-based explorer. Linked data makes it possible to offer many different views on an authoritative underlying collection of data.
The various Environment Agency linked-data-powered apps and explorers are good examples of the ways that data can be made more understandable for users.
Linked data is an enabling mechanism for these various views: directly using 'raw' linked data is generally only for specialists carrying out analysis or building more 'friendly' views for broader groups of users.
People generally want to use data to help them investigate an issue or to inform a decision. In most cases this involves using data from many different sources and the challenge of combining data from different 'data silos' is well known.
Several aspects of linked data help with this problem:
- it makes data accessible through the web
- Its approach to naming conventions make the data easier to join up. Each statistic has its own, globally unique identifier (name). This means that separate datasets can re-use the same identifier when they are talking about the same thing
- data is machine readable and queryable, which helps with automated filtering and processing of large amounts of data
- in linked data, the data model or 'schema' is described in the data itself, making it possible to enrich data by adding in lookups and connections
The linked data representation of information is a very precise one - it enables you, and sometimes requires you, to be explicit about aspects of the data that other approaches might leave implicit or open to interpretation. This makes it excellent for automated processing but requires some extra effort. The process of cleaning and transforming the data often helps expose mistakes or inconsistencies that might otherwise go unnoticed, and so helps improve data quality as a side effect of the data publishing process.
When publishers make linked data available (whether directly or via downloads and APIs) it enables other organisations to make use of it. There are many good examples from the Environment Agency's linked data, but also from other linked data publishers. Here are a few current examples:
- DBPedia, “allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.” What this means in practice is that all of the information boxes that you can see on wikipedia are queryable. So it is possible to write queries that look for relationships between things that have pages on wikipedia. An example of this is to query all of the rivers on wikipedia, and count up which body of water has the most watercourses discharging into it. Here's the query that does that: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Body_of_water_with_the_most_watercourses_ending_in_it
- The EA's Public Registers are used by the group Dsposal, who have developed an online platform to connect waste producers with licensed waste companies.
- The BBC use data from the flood warnings API on their website, used by a huge audience in times of flooding.
- FloodRe uses the flood warnings and river monitoring APIs to assess and predict their insurance liabilities
- The 'Scottish Tech Army' (a group of furloughed IT professionals offering their skills for public benefit projects) created a dashboard of the Scottish Government’s linked data publishing of Covid statistics.
What it looks like technically
Rather than being stored in a database as a collection of connected tables, rows and columns, linked data is stored in a graph database. A graph database is a much simpler way of storing data, which can be thought of as one big table, consisting of three columns, and as many rows as are needed (often hundreds of millions).
Each piece of information is stored as a triple - a statement that consists of an identifier (the subject), a property (the predicate), and a value (the object). As an analogy, it works a bit like grammar: The cat (subject) sat on (predicate) the mat (object), but it's easier to show as a real world example.
If we take catchment data explorer as an example, we have a lot of data about water bodies. One of these data items is the overall 2019 classification. In the graph database, this looks like this for one water body:
subject predicate object Captain's Pond 2019classification Moderate
For two water bodies, it looks like:
subject predicate object Captain's Pond 2019classification Moderate Decoy Broad 2019classification Poor
This principle can be extended, so more water bodies just means more rows (or triples). If a new water body is discovered, or created in the real world, then it's simply a case of adding more triples.
As well as creating more water bodies, this way of storing data makes the data model flexible. To add 2020 classification data to the database, we would simple add more triples - no requirement to change the structure of tables.
subject predicate object Captain's Pond 2019classification Moderate Captain's Pond 2020classification Good
We have a lot more information about these water bodies, and these can all be represented by more triples:
subject predicate object Captain's Pond 2019classification Moderate Captain's Pond meanDepth 1m Captain's Pond altitude 26m Captain's Pond waterBodyType Lake Captain's Pond parentOpCatchment Bure ...
With thousands of water bodies in England, it's easy to see how the number of triples can get into the millions.
From these examples so far, the 'linked' part of linked data hasn't surfaced. Looking at the examples above, we have the subject water body - Captain's Pond. With thousands of water bodies, there's a good chance that two water bodies could have the same name, which as well as being confusing, would cause problems with this data model. To help with this, instead of using the name of the water body as the subject, we use an identifier (a URI). These identifiers are unique within the database, and are consistent - ie wherever we want to attach some data to Captain's Pond, we can use its identifier, which is
subject predicate object GB30535397 label Captain's Pond GB30535397 2019classification Moderate GB30535397 meanDepth 1m GB30535397 altitude 26m GB30535397 waterBodyType Lake GB30535397 parentOpCatchment Bure ...
But this still isn't linked. Because actually, we don't just use the identifier on its own - we use a URL as the identifier. In the case of Captain's Pond on Catchment Explorer, this is
https://environment.data.gov.uk/catchment-planning/WaterBody/GB30535397. This has several benefits:
- the identifier is globally unique - we know exactly what it is that we are talking about when use that identifier.
- we can create a page on the internet that holds the information that we know about this water body (you can click the link to see this in practice).
- other people and applications can link to this water body. This could be from a report produced within the Environment Agency, it could be an article about the water body on a local news website, or it could be referenced from a separate dataset in another service, such as Flood Plan Explorer
Wherever possible, data points within the dataset use URLs as their identifiers - in the example above - the classification, the water body type and the operational catchment that the water body is within would all be stored as URLs, each with their own page where you can find out more information, such as what 'Moderate' actually means, the definition of a 'Lake', or which other water bodies are in the same Operational Catchment area. This web of data has the potential to be extremely powerful, allowing people to explore, discover and use all the information in the dataset.
Hopefully this introduction to Linked Data has been simple enough to understand, but detailed enough to be useful. We are building a resource library with more articles, tutorials, and podcasts, and will be making these available on the forum.