Introduction into DS - banner.jpeg

This section has its focus on the works of the PHUSE Education for the Future (EftF): Data Sciences project. With the formation of the work group in late 2018 , the team has taken on the mission to pick up the work from the EftF: Data Engineering Project and dive deeper into the topic of Data Sciences. In this section we first define a common understanding of data science in the pharmaceutical industry actually means. 

The pharmaceutical industry is strongly regulated, and data driven for decades. We can look back at more than 50 years of applied data science and already find a lot of data science examples similar to nowadays examples of “Data Sciences”.

The recent rise of digital technologies, open source packages, new experimental medical methodologies, access to the open or proprietary data sources in combination with machine learning approaches may potentially enhance and complement already existing methods.

Introduction to Pharmaceutical Data Science

What is Data Science in Pharmaceutical Industry?

As described in Wikipedia, the pharmaceutical industry discovers, develops, produces, and markets drugs or pharmaceutical drugs for use as medications to be administered (or self-administered) to patients, with the aim to cure them, vaccinate them, or alleviate the symptoms. This industry is geared to provide the best possible treatment that require effective gathering, storing, and distribution of data and information.

Before a new drug enters the market, a company has to conduct numerous clinical trials, following a pre-specified clinical development program plan,  to collect data about the safety and efficacy of a new treatment. These clinical trials are conducted in both healthy subjects (i.e. Phase I) and patients (Phase II and III) to gather information for a data driven decisions. At the end of a clinical development program, integrated data analysis are provided to various regional regulatory agencies, such as the FDA, EMA or PMDA, to apply for a new drug approval.

The data collection and provision to regulatory agencies for a new drug application is a strongly orientated scientific experimental and regulated process. Detailed  ethical rules for human experimentation are harmonized globally in the ICH guidelines, which are complemented by rules described in industry guidelines published by regional regulatory agencies. Therapeutic area specific regulatory guidelines also provide directions of clinical trial endpoints to show the efficacy and safety of a new treatment. The transparency and traceability of experimental data collection is one fundamental requirement in this context. This needs to be taken into account for each data processing step during the data acquisition, transformation, analysis and provision.

Collected data needs to be presented and analyzed using various statistical methods. The results of the statistical analysis is displayed in tables, listings and figures following pre-planned statistical analysis methods. For a successful new drug application, the statistical analysis should demonstrate statistically significant differences of the new treatment compared to existing treatments (or placebo) and should also demonstrating the safety. 

In this strongly regulated environment, the definition and application of Pharmaceutical Data Science should be quite different compared to Data Science in other industries such as transportation (eg. Uber) or marketing (eg. Amazon) as described in the EftF: Data Engineering.

We understand and therefore define Pharmaceutical Clinical Development Data Science (in short from now on Clinical Data Scientist [CDS]) as follows:

CDS  is inherently an integrative discipline, ensures well planned traceable data collection, harmonized optimization, integration, analysis and display of different data sources. It reduces uncertainty and creates knowledge and their collective use to achieve progressive results in the treatment processes.

Data Science BIOMETRIC roles in the pharmaceutical industry

The complexity of the data collection, optimization and analysis in a strongly regulated environment requires specialized people, who focus on each of these areas. 

Basically three core BIOMETRIC functions ensure focused and compliant data flows during a clinical trial. 

  • The Statistician ensures optimal study design and plans the statistical analysis of the clinical trial. He or she estimates the required patient sample size needed in a trial to detect and correctly interpret statistical significant differences in safety and efficacy of treatments.

  • The Clinical Data Manager ensures the collection of high-quality and reliable data. He or she ensures that data is adequately collected, cleaned, and securely stored for further processing.

  • The Statistical Programmer ensures the optimization of collected data to allow statistical analysis and later submissions to regulatory agencies. He or she transforms collected raw data into CDISC SDTM and ADaM format, and prepares the statistical displays. 

These three biometrical core functions closely collaborate among each other and also cross-functionally with operational and medical staff in the clinical trial team.

Usually new drug applications require the conduct of multiple clinical trials in different stagesPhase 1 studies collect information about safety and the pharmacokinetic profile of a new drug. Phase 1 studies are usually conducted in healthy subjects. Phase 2 studies collect information about safety and efficacy with the aim to proof the concept (ie. efficacy) of a new treatment and find the optimal dose for the pivotal studies. Phase 3 studies then finally collect information about safety and efficacy with the aim to proof the effectiveness of a new treatment.

The above mentioned various phases of clinical trials require again focused biometrical staff to adequately plan, conduct and analyze the data. Since the same drug might be efficacious in various indications/projects or even in various therapeutic areas (provide example from Viagra development or the more current example of immuno-modulators for chronic diseases affecting different systems or organs (RA, MS, Lupus, Cancer, etc)), it is beneficial to further structure the three above mentioned key data science roles into focused study, project, and compound as shown in the following figure.

General structure of studies within projects/indications for different compounds in various therapeutic areas

General structure of studies within projects/indications for different compounds in various therapeutic areas

Each hierarchy level of studies, projects, compounds and therapeutic areas, provides a different type of data, which also requires specific functional focus of Clinical Data Managers, Statistical Programmer and Statistician on each hierarchical level. As an example, Statistical Programming functions therefore have roles of Study Statistical Programmers, Project Programmers, Compound Programmers and Therapeutic Area Programmers. Similar roles can be found within Clinical Data Management and Statistics as well. All these roles perform Pharmaceutical Data Science, we are not going to call any of these functional roles Pharmaceutical Data Scientist though. We are defining a CDS as an overarching role, who could even work with data across various therapeutic areas and also across various functions.


If we consider a clinical trial, the Study Data Manager and the Study Statistician focus on their accountability to clean and analyze the trial data.  The CDS could take an overarching approach and utilize operational data from Data Management processes (eg. number of queries per endpoint), and combine it with statistical results (eg. statistical significance or trial results) to detect fraud and misconduct in a clinical trial. 

Thus, the CDS can be helpful in two different ways:

  1. Finding relations between different aspects that are not included in the original study design (i.e., between different studies, different indications under a single compound, and even different compounds from a single pool of similar subjects and data).

  2. Separating data and subjects within a pool to find unsuspecting classifications between them.