Extracting, visualising and identifying emerging important terminology from patent collections

Data Science Campus
May 27, 2020

Categories: News, Operation and Automation, Projects

In this publication, we provide an overview of how pyGrams, a natural language processing tool developed by the Data Science Campus, can be used to extract emerging terminology from large documents. We apply this approach to patent documents and perform experiments to identify emerging and declining terminologies and technologies.

The adoption of a digital life by individuals, organisations and government departments means that more and more digital data are captured every day. Nearly all these data will have associated metadata, including a time stamp, which gives users a digital diary of activity.

Although the history of patents extends back to Ancient Greece, the introduction of the computer means that the various patent offices around the world hold digital copies of patent documents in the order of millions (for example, the US Patent and Trademark Office (USPTO) has patent records from 1976). These electronically stored patents provide a fantastic timeline of innovation around the world.

Analysing datasets such as patents can provide invaluable insights. Early identification of emerging technologies can inform decisions by policymakers or other decision-makers prior to large-scale take-up of technologies, components or their adoption into manufacturing processes. This early identification can allow organisations and individuals to allocate suitable resources to address future technological trends and requirements more efficiently.

Cheap access to unmanned aerial vehicles (UAVs), such as drones, to the general public for example, meant the UK Civil Aviation Authority (CAA) had to create a new set of guidance and regulations for users.

However, finding a signal in the noise is challenging. PyGrams aims to tackle this issue, allowing users to extract, visualise and identify emerging terms within large document collections such as, but not restricted to, patents. This project is ongoing, and new updates will be published here when available.

Our work here has been developed alongside domain experts at the Intellectual Property Office (IPO).

Extracting terms from patent collections

We provide an overview of pyGrams and how it can be used to extract emerging terminology from documents. We also explore how a time series approach can be used to nowcast term usage.

Time series analysis of patents’ important terminology using e-score and net growth

We discuss two methods of analysing the time series of keywords gained from patent documents. We perform a number of experiments to show how these methods can be used to identify emerging and declining terminologies and technologies.

You can read more about pyGrams in the previous links or by visiting the pyGrams GitHub repository or GitHub document pages.

Tags: Machine Learning, Natural Language Processing, Patent data, Projects, Pygrams

Data science for the public good