In the age of Industry 4.0, advancements in artificial intelligence and machine learning along with improvements in technologies for data storage, data access and computer processing are driving innovations in automation of data acquisition, integration and analytics.  Automation software and tools continue to grow in availability and capability. However, the questions remain.

  • Why automate?

  • What should be automated?  

  • How might automation impact our business?

With these advancements in technology and automation capabilities, companies across all industries are looking at how they can use automation to improve their business.  A recent post on DevOps.com stating “Today’s systems are simply becoming too big and complex to run completely manually, and working without automation is largely unsustainable for many enterprises across all industries”, [1] eludes to the criticality of automation to survive in today’s competitive market.

However, determining what to automate requires careful consideration of how the automation will fit into your business workflows.  Many organisations develop an automation strategy taking into consideration manual tasks that are often repeated or standardised and tasks that require human problem solving, particularly in un-predictable scenarios.  As advised by Jim Higgins on Forbes.com, “The ideal level of automation is less about replacement and more about enablement. It helps users be better at their jobs, giving them analytics to make sound decisions for the business and freeing them from repetitive and monotonous tasks so they can be more strategic.”  In his post he emphasises the importance of the user feeling they have control of the technology and that the user feels the value of the automation, making them better at their job.[2]  These are important considerations for successful integration of automations into your business processes.

As discussed in the previous use cases, data is needed in real-time to answer important business questions in every sector of our economy including pharmaceutics.  The use of software and automation is essential in making this a reality.


A few examples of software languages and tools to enable automation are included below.  

Apache® Software Foundation


An open source software ecosystem with projects supporting big data, databases, graphics, and many more capabilities.  Some of the Apache projects are listed below – visit https://www.apache.org/ to learn more.

Apache Spark™


A unified analytics engine for large-scale data processing for big data and machine learning.  Apache Spark was originally developed at UC Berkeley in 2009.[3]

Apache Arrow™

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardised language-independent columnar memory format for flat and hierarchical data, organised for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby. [4]

Apache Airflow (undergoing incubation)  

Airflow is a platform to programmatically author, schedule and monitor workflows.[5]

Apache Ignite™ (a project managed by the Apache Ignite Committee)

Apache Ignite In-Memory Data Fabric is designed to deliver uncompromised performance for a wide set of in-memory computing use cases from high performance computing, to the industry most advanced data grid, in-memory SQL, in-memory file system, streaming, and more.[6]

Apache™ Hadoop®

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.[7]  

Apache Mesos™

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other frameworks on a dynamically shared pool of nodes.[8]


Java is a general-purpose programming language released in 1995 by Sun Microsystems and later acquired by Oracle.  The original goal of Java was to develop a language that could run on consumer appliances, e.g. a home refrigerator.  Their premise was “write once, run anywhere”.  In essence enabling code to be compiled to run on any device.  However, Java became more popular for its features of writing applets (small programs) that can run within a web browser.  Given the rise of the internet in the late 1990’s, Java gained wide popularity and great success with this capability. [9]

Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualisations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modelling, data visualisation, machine learning, and much more. [10]


Dating from 1991, the Python programming language was considered a gap-filler, a way to write scripts that “automate the boring stuff” (as one popular book on learning Python put it) or to rapidly prototype applications that will be implemented in other languages.  However, over the past few years, Python has emerged as a first-class citizen in modern software development, infrastructure management, and data analysis. It is no longer a back-room utility language, but a major force in web application, creation and systems management, and a key driver of the explosion in big data analytics and machine intelligence.[11]

Further readings

R and R Shiny

2018 marks the 25 year anniversary of the creation of R.  This open source statistical software used by millions of users around the world modernises the way we think of statistical computing.  The Comprehensive R Archive Network (CRAN) provides open sharing of code libraries and the elimination of redeveloping code. R Shiny is an R package that provides end users web applications for visualising and analysing data.

Further Readings

R Markdown

R Markdown provides an authoring framework for data science.  R Markdown documents are fully reproducible and support dozens of static and dynamic output formats.
You can use a single R Markdown file to both save and execute code and generate high quality reports that can be shared with an audience.

RDF data cube

The standard provides a means to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organisations. [12]


SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system, or for stream processing in a relational data stream management system.


TensorFlow is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organisation, it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains. [13]  

NoSQL database systems

NoSQL, is a form of database model other than that of a relational database. NoSQL databases have come to the fore with the ever-expanding volumes of data and speed of data processing in the world of e-commerce and social media.  Benefits of a NoSQL data model may be a simpler design, distributed data stores and ultimately increased speed of processing.[14]

Although many of the software and techniques above play a role in enabling automation, current technologies will continue to evolve and mature.  Many organisations are exploring automation and may test various solutions. However, automating for the sake of automation may not lead to the optimal solution.  Developing a clear strategy, taking into consideration the problem, the people, and the desired outcome can lead to a greater effect of automation to improve efficiency and productivity.


[1] Wells, Marshall, and Miha Kralj. “3 Reasons Why Automation Is Critical.” DevOps.com, DevOps.com, 21 Mar. 2016, https://devops.com/3-reasons-automation-critical/.

[2] Higgins, Jim. “How Much Automation Is Too Much?” Forbes, Forbes Magazine, 6 Apr. 2018, www.forbes.com/sites/forbestechcouncil/2018/04/06/how-much-automation-is-too-much/#96106e2f9696.

[3] “What Is Apache Spark?” Databricks, Databricks, https://databricks.com/spark/about.

[4] “Apache Arrow.” Apache Arrow Homepage, The Apache Software Foundation, https://arrow.apache.org/.

[5] “Apache Airflow (Incubating) Documentation.” Apache Airflow (Incubating) Documentation - Airflow Documentation, Apache Incubator, https://airflow.apache.org/.

[6] “Apache Ignite.” Apache Ignite, The Apache Software Foundation,https://ignite.apache.org/.

[7] “Apache Hadoop.” Apache Hadoop, The Apache Software Foundation, https://hadoop.apache.org/.

[8] “Apache Mesos.” Apache Mesos, The Apache Software Foundation, http://mesos.apache.org/.

[9] Wintrich, David. “Java: What Beginners Need to Know Now.” Course Report, Course Report, 2017, www.coursereport.com/blog/what-is-java-programming-used-for.

[10] “Project Jupyter.” Project Jupyter, 2018, http://jupyter.org/.

[11] Yegulalp, Serdar. “What Is the Python Programming Language? Everything You Need to Know.” InfoWorld, IDG Communications, Inc., 1 June 2018, www.infoworld.com/article/3204016/python/what-is-python.html.

[12] “The RDF Data Cube Vocabulary.” Edited by Richard Cyganiak and Dave Reynolds, W3C - World Wide Web Consortium, Government Linked Data Working Group, 2014, www.w3.org/TR/vocab-data-cube/.

[13] “About TensorFlow.” TensorFlow, www.tensorflow.org/.

[14] Moniruzzaman, A B M, and Syed Akhter Hossain. “NoSQL Database: New Era of Databases for Big Data Analytics - Classification, Characteristics and Comparison.” Academia.edu - Share Research, International Journal of Database Theory and Application, www.academia.edu/5352898/NoSQL_Database_New_Era_of_Databases_for_Big_Data_Analytics_-_Classification_Characteristics_and_Comparison.