Introduction to using SPARQL
SPARQL is a query language you can use to get highly customised results sets from DEFRA Open Data. It's the most flexible and powerful way to work with the data.
This guide describes:
- What SPARQL is
- How to get to the SPARQL editor
- Writing SPARQL queries
- Worked example using the 'Milk prices and composition of milk - annual statistics’ dataset
What is SPARQL?
SPARQL is a query language, designed for querying RDF. SPARQL stands for SPARQL Protocol and RDF Query Language.
SPARQL will look familiar to anyone who has used SQL. Where SQL is for querying relational databases, SPARQL is for querying RDF (or graph) databases.
How to get to the SPARQL editor
The Defra SPARQL Query Tool is available from the following URL: https://environment.data.gov.uk/linked-data/tools/sparql
Writing SPARQL queries
To be able to write SPARQL queries to get data out of DEFRA Open Data, it is important to understand how the data is structured.
In this guide, we are only going to look at SELECT queries - to simply return data in tabular format (e.g. downloadable as CSV). There are other SPARQL variants that allow you to retrieve data in graph formats, but these are outside the scope of this guide.
At its heart, a SPARQL SELECT query is simply about pattern matching, using combinations of known-values and variables. The known values can be expressed using URIs (such as <http://royalfamily.com/charles>) or literals (e.g. strings, numbers), and the variables are of the form ?variablename. The simplest SELECT query would look like this:
SELECT ?s ?p ?o
WHERE { ?s ?p ?o }
This would return every single piece of data in the database. Because we’ve used variables in the subject, predicate and object part of the triple, there is nothing to restrict the triples being returned.
Continuing to use the Royal Family example from the Linked Data Primer guide, if we write a query that locks a part of a triple down:
SELECT ?p ?o
WHERE { <http://royalfamily.com/charles> ?p ?o }
Then this will only return those triples where Charles is the subject.
Note: these queries won't return any data from DEFRA Open Data, as this is made-up data about the Royal Family.
royalFam:charles prop:hasAge 68 .
royalFam:charles prop:hasTitle “Prince of Wales” .
royalFam:charles prop:hasTitle “Duke of Rothesay” .
royalFam:charles prop:livesAt “Clarence House” .
royalFam:charles prop:hasEyeColour “Blue” .
royalFam:charles prop:isPatron “AgeUK” .
royalFam:charles prop:wrote “The Old Man of Lochnagar” .
(We have used the ‘QName’ approach to defining prefixes here, to make the data a bit easier to read, where ‘royalFam:’ stands for <http://royalfamily.com/> and ‘prop:’ stands for ‘<http://example.com>’.
If we specify the (made up) hasAge predicate instead:
SELECT ?royal ?age
WHERE { ?royal <http://example.com/hasAge> ?age}
This would return all triples that match that pattern - giving us Charles and Camilla’s age.
royalFam:charles 68 .
royalFam:camilla 69 .
We can also use the results of one triple-pattern match in another:
SELECT ?royal ?title
WHERE {
?royal <http://example.com/hasAge> 68 .
?royal <http://example.com/hasTitle> ?title .
}
This will restrict the results to only those subjects that have an age of 68 (in this case just Charles), and then will return any the object of any triples where the subject is Charles, and the predicate is the Title. So this query would return:
royalFam:charles “Prince of Wales” .
royalFam:charles “Duke of Rothesay” .
And this is the basics of SPARQL querying. We’ve chosen a very simple database for our examples here, to illustrate the principles. In reality, it can be harder to work out how to get what we want out of the database. Thankfully, there are tricks to help guide us.
Worked example using the ‘Milk prices and composition of milk - annual statistics’ dataset
This example will describe how we query PMD using SPARQL to get the data to support this simple tool: https://swirrl.github.io/defra-open-datasets/farming-statistics/supporting-info/api-examples/
This tool finds the measure types for the dataset, allows a user to select one of these and then plots that over time on a chart.
It is written in HTML/javascript and the complete code can be seen by viewing the page source. For this example data is requested from PMD in JSON format by setting a header of 'Accept': 'application/json'. This data could also be requested as CSV or TXT.
Step 1: Find the measure types
The first query (see query in PMD) finds the measures associated with this dataset and the application adds them to the select.
In this query we are looking to get the measure type labels and also their URI’s (to use in the second query). This is set out in the SELECT line.
The first line of the WHERE clause gets the URI for each measure used in that dataset and the second line finds the label for each of these URIs (note the use of ‘?o’ in both lines).
Query:
SELECT DISTINCT ?o ?measures
WHERE { <http://environment.data.gov.uk/linked-data/data/milk-price-composition-annual/codes-used/measure_type> <http://www.w3.org/2004/02/skos/core#member> ?o .
?o <http://www.w3.org/2000/01/rdf-schema#label> ?measures .}
LIMIT 10
For the application itself this query is URL encoded and sent as a GET request to the following URL (this can be found under ‘More options’ on the SPARQL query page):
The application then uses this returned data to populate the select.
Step 2: Get the data for the selected measure type
The second query (see query in PMD) gets the data for the selected measure and sorts it to get into the correct data order. The value selected from the select by the user is input into the following query in place of <http://environment.data.gov.uk/linked-data/def/measure/price>
The first line finds all individual observations URIs for that dataset and the second line to find all the individual observations (values) that have the measure type requested.
The third line then finds the URIs for the dates for each of the Observation URIs and the final line looks up the labels we hold for these data URIs.
The data is then sorted using ‘ORDER BY ASC(?date)’ on these data labels.
Query:
SELECT ?date ?observation
WHERE {?observationUri ?b <http://environment.data.gov.uk/linked-data/data/milk-price-composition-annual> .
?observationUri <http://environment.data.gov.uk/linked-data/def/measure/price> ?observation .
?observationUri <http://purl.org/linked-data/sdmx/2009/dimension#refPeriod> ?dateUri .
?dateUri <http://www.w3.org/2000/01/rdf-schema#label> ?date .
}
ORDER BY ASC(?date)
LIMIT 100
The application then takes this data and creates a chart using Chart.js
Using prefixes
This query has been written in a verbose way, using the full URI’s to make it easier to see what is happening. However you can also make use of prefixes to shorten and simplify the query.
For example, in our example we have this URI:
‘http://www.w3.org/2000/01/rdf-schema#label’
If we set a prefix of
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
This could then be written as `rdfs:label’ in the query. The full query for step 1 in this case would then be:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?o ?measures
WHERE { <http://environment.data.gov.uk/linked-data/data/milk-price-composition-annual/codes-used/measure_type> <http://www.w3.org/2004/02/skos/core#member> ?o .
?o rdfs:label ?measures .}
LIMIT 10
You will see in the DEFRA Sparql Query Tool that common prefixes are included in the default query. These can be used, added to, or removed as required.
Further information
Some good references to read more about SPARQL are:
- https://www.w3.org/TR/sparql11-overview/
- https://www.w3.org/TR/sparql11-query/
- https://www.dataversity.net/introduction-to-sparql/
- https://www.oreilly.com/library/view/learning-sparql/9781449311285/
To explore Defra's linked data, go to https://environment.data.gov.uk/. For enquiries about the data, or to talk to the team about your data, please Submit Feedback/Report an issue.
Comments
Please sign in to leave a comment