Building and stocking the survey question library

Data Science Campus
June 27, 2017

Categories: Apprenticeships, Projects

The Office for National Statistics (ONS) is on a fantastic journey to better serve the UK by improving the quality of our statistics and the processes by which they are produced.

Over the last month, I have been working on an important project to develop a library of survey questions. This supports our data collection transformation programme’s objective to significantly transform ONS’s data collection activity which includes replacing paper questionnaires by moving our surveys online.

The question library will be the digital store of all the questions we ask on our surveys, all the information which describes those questions, and all the information on how they are used to compile our statistics. Most importantly, it makes this information digitally accessible.

Describing this change as just “getting surveys online” is a bit simplistic though. This is because:

responding to a question online is different from responding on paper. The questions must be transformed, not simply translated, and this offers opportunities for improvement, as well as some challenges
a digital channel means that multiple different surveys on similar topics can be combined into a modular approach, which saves time for the respondent; this means that questions need to be consistent across surveys and requires some structural changes to the surveys themselves
bringing surveys online efficiently requires a common back-end to create, store and maintain surveys and questions; this is to avoid building a new digital system for each individual survey

A shared question library will help address these challenges by providing:

reference data to compare past and future questions and surveys: wording, validation, inclusions and exclusions all digitally available to those producing statistics
questions in a machine-readable format as this will enable text analytics to identify all similar and relevant questions across surveys and therefore speed up the process of streamlining surveys
a common format to store surveys enabling self-serve updating and authoring, greatly reducing the cost of moving surveys online and maintaining those surveys once they are online

My role as one of the Data Analytics apprentices is to develop the question library by converting those remaining paper surveys into a machine-readable format.

Stocking the question library

This is not as straightforward as you’d like to think. It’s not as simple as copying and pasting questions from PDF to whatever desired format someone chooses. Capturing text within a PDF is extremely easy, but capturing the information correctly in a specific order and structure is another kettle of fish. Content within the survey needs to be nested in a hierarchical format, so that values are within values.

Here I’ll explain…

The current aim is to have a program that allows a user to enter a tracking code (others call this a box number as I’ve had to clarify this many times), which brings up everything within the hierarchy above it. Such content would be the question it belongs to, any inclusions or exclusions associated with that question, the section that question is in and so on, until we know everything about that question and the context of the answer provided by the respondent.

Yes, this can be done within a tabular format but there is going to be a lot of repetitive data. This is why we have selected JavaScript Object Notation (JSON) format. Other formats such as XML have been discussed but we believe JSON enables us to do what we want quickly and efficiently.

We chose JSON over the likes of XML for a number of reasons. For example, it is easier to read and write (which is very beneficial considering the number of surveys and questionnaire types we have to go through), it can store multiple values within other values using arrays, and it’s easier to parse information. The final schema needed for a question library will likely be more verbose than JSON but for our current aims it works well. Plus, whichever future format is decided, JSON should be able to be translated easily.

An example JSON (JavaScript Object Notation) file snippet from one of ONS’ business surveys.

By storing surveys in a nested file format (such as JSON), understanding the context of a question as answered with sections and subsections is made easier.

Next steps

Despite having some “bang your head against a wall” moments, we are now at the stage where progress is extremely good. We have been joined by a graduate developer, Ian Edward, who has been developing a program using Python to process the JSON files, which then views them in a much more user-friendly HTML format – so we know that the JSON files are being structured correctly when it comes to nesting them, which is great! That said, the real fun begins once we start to complete text analytics on the captured data to enable survey transformation.

Developing a strategic question library is a great example of ONS taking a data-driven and cost-conscious approach to the process of producing statistics and I am excited to be a part of this positive change within ONS and the wider government.

Tags: Apprenticeships, code, data, Natural Language Processing, Projects, Research, surveys

Data science for the public good

Building and stocking the survey question library

Recent Posts

Data science for the public good

Share this post

Recent Posts