Optimus – A Natural language processing (NLP) pipeline for turning free-text lists into hierarchical datasets

Many data sets contain variables that consist of short free-text descriptions of items or products. The Data Science Campus has been working with DEFRA to understand shipping manifests of ferry journeys that record short descriptions of cargo on boarding lorries. The huge variation in detail, scale of description and how items are recorded (such as incorrect spellings or syntactic differences for identical products) make it difficult to automatically clean the data to a structured state that is ready for aggregation and analysis.

What’s the data science?

We are developing a NLP pipeline that utilises a Subword-Information Skipgram (SIS) model to retrieve vector representations of item descriptions, allowing tiered-grouping of syntactically and semantically similar descriptions. In this pipeline, the individual relationships between words within each group are assessed, labels are automatically generated.

What’s the impact?

The capability to produce structured data sets where each item is classified across multiple hierarchical tiers, enables data to be aggregated at different granularities and can be linked to existing taxonomies.

The processing pipeline will be released as a generalised tool that can be applied to other datasets that consist of short items or product descriptions.

If you’re interested in learning more about this project or would like to get involved then you can get in touch via email or Twitter.