What it looks like technically
Rather than being stored in a database as a collection of connected tables, rows and columns, linked data is stored in a graph database. A graph database is a much simpler way of storing data, which can be thought of as one big table, consisting of three columns, and as many rows as are needed (often hundreds of millions).
Each piece of information is stored as a triple - a statement that consists of an identifier (the subject), a property (the predicate), and a value (the object). As an analogy, it works a bit like grammar: The cat (subject) sat on (predicate) the mat (object), but it's easier to show as a real world example.
If we take catchment data explorer as an example, we have a lot of data about water bodies. One of these data items is the overall 2019 classification. In the graph database, this looks like this for one water body:
subject predicate object
Captain's Pond 2019classification Moderate
For two water bodies, it looks like:
subject predicate object
Captain's Pond 2019classification Moderate
Decoy Broad 2019classification Poor
This principle can be extended, so more water bodies just means more rows (or triples). If a new water body is discovered, or created in the real world, then it's simply a case of adding more triples.
As well as creating more water bodies, this way of storing data makes the data model flexible. To add 2020 classification data to the database, we would simple add more triples - no requirement to change the structure of tables.
subject predicate object
Captain's Pond 2019classification Moderate
Captain's Pond 2020classification Good
We have a lot more information about these water bodies, and these can all be represented by more triples:
subject predicate object
Captain's Pond 2019classification Moderate
Captain's Pond meanDepth 1m
Captain's Pond altitude 26m
Captain's Pond waterBodyType Lake
Captain's Pond parentOpCatchment Bure
...
With thousands of water bodies in England, it's easy to see how the number of triples can get into the millions.
From these examples so far, the 'linked' part of linked data hasn't surfaced. Looking at the examples above, we have the subject water body - Captain's Pond. With thousands of water bodies, there's a good chance that two water bodies could have the same name, which as well as being confusing, would cause problems with this data model. To help with this, instead of using the name of the water body as the subject, we use an identifier (a URI). These identifiers are unique within the database, and are consistent - ie wherever we want to attach some data to Captain's Pond, we can use its identifier, which is GB30535397
.
subject predicate object
GB30535397 label Captain's Pond
GB30535397 2019classification Moderate
GB30535397 meanDepth 1m
GB30535397 altitude 26m
GB30535397 waterBodyType Lake
GB30535397 parentOpCatchment Bure
...
But this still isn't linked. Because actually, we don't just use the identifier on its own - we use a URL as the identifier. In the case of Captain's Pond on Catchment Explorer, this is https://environment.data.gov.uk/catchment-planning/WaterBody/GB30535397
. This has several benefits:
- the identifier is globally unique - we know exactly what it is that we are talking about when use that identifier.
- we can create a page on the internet that holds the information that we know about this water body (you can click the link to see this in practice).
- other people and applications can link to this water body. This could be from a report produced within the Environment Agency, it could be an article about the water body on a local news website, or it could be referenced from a separate dataset in another service, such as Flood Plan Explorer
Wherever possible, data points within the dataset use URLs as their identifiers - in the example above - the classification, the water body type and the operational catchment that the water body is within would all be stored as URLs, each with their own page where you can find out more information, such as what 'Moderate' actually means, the definition of a 'Lake', or which other water bodies are in the same Operational Catchment area. This web of data has the potential to be extremely powerful, allowing people to explore, discover and use all the information in the dataset.
To explore Defra's linked data, go to https://environment.data.gov.uk/. For enquiries about the data, or to talk to the team about your data, please Submit Feedback/Report an issue.
Comments
Please sign in to leave a comment