Since the formation of CDISC, significant progress has been made in standardising the format of data collected, analysed and submitted to the health authorities. This has allowed organisations to explore methods to build automation in their data flows and reporting tools.


Various pharmaceutical companies are/have explored automation in creation of a standard CRF, conversion of data into SDTM-, checking of data for submission compliance, production of standard tables, listings & figures, creating components of the e-submission data packages, etc. However, in many cases the tools require rather rigid adherence to a standard, making it difficult to truly realise the benefits of the automation and requiring additional manual upkeep and maintenance.

However, recent advances in data engineering techniques, such as machine learning and big data processing, has allowed companies to more easily curate data that are in a structured, as well as, unstructured format. These concepts have been applied in the healthcare industry as shared at the PhUSE single day event in Ridgefield, CT. At the event, Dr. Wade Schulz, of the Yale School of Medicine, shared the various technologies and tools used to build a data lake that integrates with their clinical information systems to provide historic and real-time data for research studies and clinical decision support. Tools in his “Data Science Toolkit” include, but are not limited to, kafka, Storm, Apache Spark, Hadoop, Apache HBASE and python to ingest, process, store and analyse clinical and healthcare data. He also notes the anticipation of a significant amount of future healthcare data in unstructured format, posing greater challenge to ingest and process data. [link to SDE presentation]

Reflecting on the extensive standards and structured data formats that exist in the pharmaceutical industry today and the advancements brought forward through I4.0, the pharmaceutical industry is well positioned to take advantage of emerging technologies in automation, particularly on standardised data. Thus, allowing for focus on novel data types and unstructured data using new methodologies and techniques. [1]


[1] Schulz, Wade. “Baikal – Implementing and Deploying Clinical Models with a Real-Time Data Lake.” PhUSE SDE. Focus on the Patients - Bridging Data to Solutions, 26 July 2018, Ridgefield, Boehringer Ingelheim.