Describing ONS datasets with standard vocabularies

Last week I published some open data publishing principles that can inform further development of the Data Discovery Alpha. This week I’ve begun turning those principles into actionable recommendations.

For example, if we want reuse rights to be clear then how do can licensing be published in both human and machine-readable formats? This is something I’ve previously explored quite extensively at the Open Data Institute (ODI), so there’s plenty of practical guidance to build on.

Similarly, if we want datasets to be discoverable, always be presented in context and legible to users, then what information and metadata might need to be presented?

I’ve begun the process of developing this guidance by:

  • exploring the metadata already collected and managed by the ONS, and some of the ongoing work to improve it
  • reviewing existing metadata vocabularies to determine how well they align with the needs of the ONS and its reusers
  • comparing the metadata recommended by tools like open data certificates and some standard metadata profiles

You can see my brief comparison of open data certificates, Data on the Web Best Practices and some EU metadata profiles. There’s a great deal of agreement in terms of recommended metadata but there are some differences in what is considered to be mandatory.

The 3 main metadata vocabularies that I’ve been looking at are:

  • Data Catalog Vocabulary (DCAT) — which is supported by data.gov.uk, data.gov, all of the major open data portals and a variety of other open data tools. DCAT is based on standard vocabularies like Dublin Core that have been in use for many years.
  • DCAT-AP — an extension to DCAT that recommends the use of some additional metadata elements to ensure that data can be discovered and reused across different data portals in the EU
  • STAT-DCAT — an extension of DCAT-AP that adds support for describing statistical datasets. This work has been lead by Eurostat and others in the statistical open data community

Collectively these standards describe how to:

  • publish descriptions of datasets and their distributions (downloads)
  • publish the structure of statistical datasets, for example, information on the dimensions and attributes used to report on observations
  • relate datasets to supporting documentation, version notes, and other material relevant to reusers

This is exactly what we need in order to present data in context and to ensure that users can understand how the data is structured.

A variety of formats can be used to publish this metadata, but JSON-LD looks like a strong candidate for a common, baseline format.

To start testing out how well this works in practice I’ve started putting together some examples.

The examples include exploring Google’s recently launched support for describing datasets using Schema.org. This is at an early stage but is very closely aligned to existing standards and formats.

Collectively this looks like a promising way forward, and should provide a solid foundation for implementing the open data publishing principles.

The next steps are to test this out with more examples, particularly around describing statistical datasets. I’m also keen to explore how CSV on the Web can be used to help provide metadata for the CSV files published by the ONS.

As ever, if you have feedback or comments then please get in touch.

Some open data publishing principles

This week I’ve started working with the Digital Publishing team at the ONS. They’re currently hard at work on the Data Discovery Alpha exploring how to better support users in finding and accessing datasets.

As our national statistics body and the UK’s largest producer of official statistics, it’s important that the ONS is seen as an exemplar of how to publish high-quality data. Open data from the ONS should be published according to current best practices. The team have asked me to help them think through how these apply to the ONS website.

This is an exciting opportunity and I’m already enjoying getting up to speed with everything that’s happening across the organisation. It’s also a big task as the ONS publish a lot of different types of data. For example, it’s not just statistics, there’s also geographic datasets.

To help frame the work that we’ll be doing I’ve drafted a few high-level principles which I thought I’d share here.

The principles provide an approach for thinking about open data publishing that focuses on the outcomes: what it is that we want to enable users to do?

Importantly, the principles are aligned with the Data on the Web Best Practices, the recommendations in the Open Data Institute’s Open Data Certificates, and the Code of Practice for Official Statistics.

Obviously, implementing all this will also draw on the open principles enshrined in the GDS service manual. For example, building on open standards.

  1. Make data discoverable

Datasets need to be discoverable on the ONS website and the team are continuing to put a great deal of effort into that.

But there are various ways in which discovery can happen and not all of those need be on the ONS website. Users might find data via Google and/or specialised data aggregators and portals.

This means that data needs to have good quality descriptive metadata and be easily indexed by third-parties

  1. Ensure reuse rights are always clear

Data published by the ONS is reusable under the Open Government Licence (OGL). But individual datasets may be derived from data provided by other organisations. This means re-users may need to include additional attribution or copyright statements when reusing the data.

While these requirements are all documented, the rights of re-users, along with any obligations should be clear at the point of use.

And, as data may be distributed by third-parties, those licensing and rights statements should also be machine-readable.

  1. Help users cite their sources

Clear attribution statements and stable links can do more than help users fulfil their obligations under the OGL.

Easy ways to reference and link to datasets will encourage users to cite their sources. This provides another route for potential users to discover datasets, by following links to primary sources from analysis, visualisations and applications.

Stable links, clearly labelled citation examples, and supporting metadata can make all of this easier for reusers.

  1. Always present data in context

Access to data only gets you so far. Deciding whether the data is fit for purpose and the process of turning it into insight requires access to more information.

Documentation about the contents of the dataset, notes on how it was collected and processed, and any known limitations with its quality are all important to deciding when and how a dataset might be used.

Users should be able to easily find and access this contextual information. Where possible it should be packaged with the dataset to support downloading and redistribution.

  1. Make datasets legible

Statistical datasets can be very complex. They can include multiple dimensions and use complex hierarchical coding schemes. Terms used in the data may have specific statistical definitions that are at odds with their use in common language. Individual data points may even have annotations and notes, for example, to mark provisional or revised figures.

This information needs to be as readily accessible as the data itself. This makes it easier for re-users to understand and correctly interpret the data. Ideally definitions of standard attributes, dimensions and measures should all be independently available and accessible, especially where these are reused across datasets.

  1. Data should be useful for everyone

Open formats and standards ensure that data can be used by anyone, without requiring proprietary software or systems. But there is no single approach to consuming and reusing data. Treating data as infrastructure means recognising that there are a range of communities interested in that data and they have different needs.

Supporting these user needs may require presenting a choice of formats and data access options. Some users will want customised downloads while others may want to automatically access data in bulk or via APIs.

The GDS registers framework is a good example of a system that supports multiple ways to access, use and share the same core data.

  1. Make data part of the web

Hopefully, as the other principles make clear, a dataset doesn’t stand alone. There’s a whole collection of supporting documentation, definitions and metadata that helps to describe it. And, surrounding that, are all of the other outputs of the ONS: the bulletins, visualisations and other commentary that threads together multiple datasets.

Regardless of the technology used to manage and publish data, everything that a user needs to refer to or share should have a place on the web.

Collectively these principles should hopefully give us a framework that will guide both the work carried out on the alpha but also beyond. Over the coming weeks I’ll be turning these principles into suggestions and recommendations for how to manage and publish open data as part of the ONS website.

If you’ve got feedback or comments then I’d love to hear from you!

A year in the life of Digital Publishing

Well then. As is only right and normal at the end of a year, it is time to look back at a year in the life of Digital Publishing. It has been a big year for us here in ONS. We launched our new site in February. We also saw a bunch of people move on to new projects and recruited some very talented people to help us ensure we continue to improve everything we do.

So – let’s dust off the time machine and take a tour of our blog post highlights of the year.

January

We were excited about hitting 200,000 followers on our corporate Twitter account; a big step for any account and especially so for a National Statistical Institute. At keyboard time, we now have closer to 250,000 and this post from Jo (our social media lead) shows how we have spent time and energy making sure we have the right content on the right social platforms.

February

February was all about the launch of the ONS website. Product Rob put together the all important website launch blog post and looking back, it is amazing to see how much excitement was in the team at this big moment in ONS digital history. Matt (formally of this land) did a great post looking at the whole Beta phase of the project. Everyone who was involved in that project should be very proud of the great work they did. It enabled pretty much every other thing we did for the rest of the year.

March

March saw Product Rob again talking about datasets (something that would be a theme across the year) and Lauren from “team social” took some time to explain how a tweetalong works for a big day like the budget. Rob (the other one) also put together some thoughts on what open data means to the ONS and highlighted some nice ways of using it.

April

April saw Matt talking about what we wanted to prove. This is something that I have gone back to time and time again and have pinched for use in pretty much all other ONS projects I have been involved in. Thanks to Tom for giving us the idea in the first place.

May

May saw the arrival of Service Managers at the ONS and this post talks a little bit about what that means. It also features a picture of my face, which suggests this was the time of the year when I arrived at team ONS digital (it has been an exhausting and thrilling few months). We also started putting out some posts about the website fixes we were making (Rob the original doing most of these).

June

June saw Zoe and Rob (not that one, or that one either) talking about charts:

We love a chart so these are some vital reading! We also looked at the accessibility of social media content, which is not talked about enough but is a very important topic.

July

July saw Matt signing off with a string of posts about the principles we try to use to guide the team here. These are a powerful reminder that codifying sensible things can be really important to a team.

August

August saw some reporting on new content formats from Rob the third. This was part of a really important project for me, experimenting with different presentation techniques for our different users. Alongside this we saw a continued collection of release notes (note to self, we need to get back into the swing of things with these – they are really useful!).

September

September saw some really helpful notes about our use of Slideshare, more of those weeknotes and a post from me around the importance of public roadmaps (posting this, with a link to our very own roadmap was one of the most bracing things I have done in a long while. I am please to say it has been a really helpful process for the service team).

October

October was all about sprint notes, conference write ups, the arrival of Benjy into the team, Rhod talking about design patterns, Laura on proof reading and Rob on dashboards. This, perhaps more than any other month, shows the full range of things we work on across the service (and why it is such an exciting thing to work on).

November

November saw the Data Discovery project starting to make an appearance in our thinking with a lovely post about the inception process we used to get that whole big chunk of work moving. Data Discovery is something that we will be talking about for quite some time more. Alongside that, Lisa put together some thoughts on GDP – a subject I feel I now know a lot more about than I ever thought possible.

December

December is carrying on the Data Discovery theme with more detail on the design patterns needed for data, a check-in on sprint outputs and something from me that I haven’t finished writing yet. It also saw 2 very exciting (long overdue) digital blog post debuts. Delivery Manager Rachel took some time to discuss how we have kept our roadmap up-to-date and Al discussed the work we have been undertaking in expanding the personas we use to help guide who we undertake user testing with.

All in all a pretty busy year and I can promise many more posts to come in the New Year.

Open Educational Resources from ONS

Since my arrival at the ONS 6 months ago, I have received a good deal of training on the sort of stuff you’d expect at a big, responsible organisation – anti-bullying and harassment training, equality and diversity training, health and safety, responsibility with information, and a raft of ‘digital awareness’ modules. I think this training is of real value but a lot of people might see the legal requirement for this as an undue burden on business, especially for small to medium sized enterprises (SMEs). I would argue that government bodies like the ONS can maximise the value of their work and reduce some of the perceived burden on SMEs by applying Open Data philosophy to all resources pushing beyond the common misunderstanding that ‘open data’ is just the information that can be found in spreadsheets.
Continue reading “Open Educational Resources from ONS”

Open Data Day 2016

Open Data Day has been again! Hundreds of events with thousands of attendees happened over 6 continents – what a community of developers, hackers, data wranglers and designers there are out there: talk about the Digital Revolution! I was lucky enough to attend the London event and take part in an excellent project to do with the gender of London’s street names.

Open Data Day London attendees
The gathering in Newspeak House, London

The project was all the more interesting because it was based on another project by hackers from Montevideo in Uruguay. They had collected their city’s street names from Open Data sources and then used a system called Genderize and a lot of manual curation to identify all the streets named after women. They’d then plotted this on a map on their project site, A-tu-nombre.

Montevideo streets highlighted when named after women
Streets named after women are highlighted in orange: from atunombre.uy

We decided to do the same thing for London. It was interesting to see how the same project was approached differently by us. Our assumption was that this was a project intended to highlight gender disparity and so we were concerned with plotting men Vs women on our map. However a big part of the focus in Uruguay had been to highlight the women and link to their Wikipedia page so people could learn more about them, learning about cultural history and a bit less adversarial.

Other differences became obvious in the challenge itself, for example, street naming in Montevideo often uses the full name of the person whereas in the UK we tend to use a surname or title and it’s much harder to automate the identification – this meant we didn’t bother with automated links to Wikipedia and just stuck with war of the sexes (see how that looked at the end).
This is user engagement

Getting stuck in at a hackathon was a great way to build relationships with developers and Open Data users that wouldn’t normally fall into our ‘User Experience’ surveys and seminars as well as to build relationships with obvious groups like Open Knowledge. I was impressed to be working alongside local council employees and after discovering they have lots of opinions on ONS Open Data I’ll be going to visit them to hear the experience of their whole team.

Another exciting hookup was with Data Campfire who are prototyping a platform that lets data users promote their projects and link to the publishers of the data they’ve used. It’ll be so much easier to learn from our wider data users if we can get a ping from that platform whenever someone posts a new use of our data.

Perhaps the best linkup was with the original team from Uruguay who were at their own Open Data Day event and happy to give us pointers and encouragement over the course of the event. Open Data is global and it’s great to have the opportunity to engage with potential users on another continent.
For anyone that’s thinking, ‘but I don’t have the skills to go to one of these things’, I can report that it was a hugely diverse group with bloggers, designers, journalists and activists alongside the obvious programmers and data geeks. You can definitely join in and contribute at an event like this.
See our project
Open Source goes hand in hand with Open Data so check out the gender assignment code over at GitHub. Or check out the CartoDB map of London’s streets with gender.

London streets highlighted by gender
London’s streets highlighted by gender

You might notice that although it’s ‘quite good’, it’s not perfect. Long Acre is considered female for example and we had to manually intervene to stop all the lanes being genderised because Lane is a legitimate name. However, there is a reason the Open mantra is ‘release early, release often’. Rather than sit on the project until the system is perfect – many, many months from now – we can post our code and share our ideas and hopefully inspire the community straight away, just as we were inspired by the team in Uruguay.

Update: Gregor Boyd over at the Data Donkey blog has copied/extended this project for Edinburgh’s streets using a different data source and a different mapping system – check out Edinburgh streets by gender too. If you repeat/extend this project for your neighbourhood, please do comment to let us know!

Open Data is the new oil that fuels society

If you want to be followed on social media, just add #bigdata or #datascience to your posts. These are the buzzwords of the century so far and with the aid of geek chic have brought computer tech to its greatest prominence since #dotcombubble. We’re all going to get rich off Big Data or so the story goes – data is the ‘new oil’ ( or information is the new oil ) and Data Scientist is the ‘sexiest job of the 21st Century‘. These ideas have been endlessly rebutted and reinforced over the last couple of years but regardless of how much might be hype, data is definitely the big thing of the moment.

Data is the new oil
Data is the new oil, taken from The Human Face of Data press kit

Arguably the poor cousin of Big Data is Open Data. This is probably because venture capitalists hate the idea of just giving away their IP, USP and other acronyms but also possibly because outside of the nascent hacker communities, not many people get too excited about having machine readable access to bus timetables or waste management data.
And yet, Open Data has been getting a lot of loving attention from governments, especially in the aftermath of the global financial crash and the ubiquitous drive to cut costs via efficiency savings and perhaps even increase economic returns from government assets.
This government sponsored open movement is incredibly timely and important. In part because the Open Source and Open Data movements are really priming the pump of the Data Science industry (or Digital Economy) but it also offers to increase public trust in government, something that appears a lot in the UK Statistics Authority’s Code of Practice . It also promises more globally linked-up monitoring, evaluation and strategising which is surely required for tackling global social challenges like climate change, food security and our ageing populations.
The UK has been at the forefront of Open Data for a few years, only just being pipped to pole-position by Taiwan in this year’s Open Data Index. The Office for National Statistics has been leading that charge and currently has 1213 datasets and over 20 thousand reference tables available via the ONS website – and yet there is so much further to go in opening up and unleashing the full potential of Open Data for “UK Plc” and our society.
I have been in my new role as Open Data Lead at the ONS for 3 weeks so far. It’s still early days but I’ve been excited to see the developments underway – with a new website almost finished beta testing, an API that’s also in late stage beta, and a pilot project for a Linked Data portal/API just kicking off (watch this space).

A big part of my role is community engagement and advocacy and I’ll be hoping to create a dialogue both on this blog and on social media (@bobbledavidson) on how the ONS should be pushing forward with Open Data. What data needs to be released? What format is best? I want to hear from you.

If data is the new oil, Open Data is the oil that fuels society and we need all hands at the pump.