Reproducible Analytical Pipeline Journey

Description

Analysis is a costly venture. Reporting is time-consuming, laborious work that often involves analysts disappearing for months behind a monitor. Once a report is published, it is essential that it stands up to scrutiny.  That may involve pulling the analyst off other pressing projects to reprise their work, adding to delays on different workstreams. If only there was a better way… 

Reproducible Analytical Pipelines are programs that automate the trouble of ingesting, processing, modelling and reporting data. RAPs should be robust, employing tests to provide assurance. They automate the manual elements of analytical work, presenting opportunities for impressive efficiencies in your teams. RAPs connect data ingestion right through to publication, freeing up that analyst to build the chat-bot you always wanted. Published figures can be independently verified by other analysts with access to the pipeline, simply by clicking ‘run’ – that’s what puts the R in RAP. 

This learning journey will help participants gain the tools to create Reproducible Analytical Pipelines (RAP), a key part of producing high quality outputs. Participants will learn how to use important components of RAP in Python and R, improving their programming skills. Participants are expected to have a working knowledge of Python or R for data analysis already. 

Learning outcomes

Learners should be able to understand:

  • What reproducible analytical pipelines are, and what components are involved
  • How to write cleaner code
  • How to use the command line tools
  • How to use Git for version control
  • How to better structure code
  • How to write basic unit tests to test code

Pathway detail

The RAP journey begins with an introduction to the programme framework of choice. Clean code is then covered to help establish good programming habits from the outset. Command line basics introduces important principles for efficiently managing computer files and operations. This course improves familiarity with software required for interfacing with Git version control, a powerful, free version control solution widely used by programmers. Modular programming helps programmers to logically structure more complex scripts into the units of a robust pipeline. At this point, unit testing is introduced to help ensure functions and modules behave as required and that any future amendments to the developing code base do not degrade the quality of the pipeline outputs. Packaging and documentation helps analysts to take the next step in their programming journey, which is to package their code for the benefits of others within the analytical community. Finally, Continuous Integration helps to make sense of the automated tools that are available to developers when working with remote Git solutions such as GitHub, helping to improve the efficiency of software development and providing assurance to package users.

Prerequisites

No prior knowledge is needed to take part in this pathway.

Courses in this learning journey

This pathway can be completed using either the R or Python programming languages.

Reproducible Analytical Pipeline journey in R

Course nameSkill levelDuration
Introduction to RBeginner2 days
Best Practice in Programming – Clean CodeBeginner1 hour
Command Line BasicsBeginner2 hours
Introduction to GITBeginner4 hours
Statistics in RIntermediate16 hours
Modular Programming Intermediate4 to 5 hours
Introduction to Unit TestingIntermediate4 hours
Packaging and DocumentationIntermediate4 hours
Introduction to Continuous IntegrationAdvanced2 hours

Reproducible Analytical Pipeline journey in Python

Course nameSkill levelDuration
Introduction to PythonBeginner2 days
Best Practice in Programming – Clean CodeBeginner1 hour
Command Line BasicsBeginner2 hours
Introduction to GITBeginner4 hours
Statistics in PythonIntermediate16 hours
Modular Programming Intermediate4 to 5 hours
Introduction to Unit TestingIntermediate4 hours
Packaging and DocumentationIntermediate4 hours
Introduction to Continuous IntegrationAdvanced2 hours