STRUCTURAL APPROACH OF OUR WORKING GROUP
When we started our research, our project was quickly overwhelmed with existing information around the huge topic of Data Science. We quickly found great learning sources for programming languages, database structures, machine learning algorithms, use cases in other industries, visualization techniques, and many more. It was a huge challenge for us though to connect the dots and put this all into context to reach our above stated goal. We therefore decided to use a structured approach to gather and present the knowledge on this website.
We decided to follow a combined approach of the PPDAC method (as described for example in the book “The Art of Statistics” by David Spiegelhalter) and the Six Divisions of Data Science (as describe for example in the article “50 years of Data Science” by David Donoho.
The PPDAC Cycle
The PPDAC Cycle starts with clearly defining a Problem, which needs to be solved. This is followed by a Plan for a solution and the Data, which needs to be Analyzed before we eventually come to a Conclusion.
Problem: Understanding and thoroughly defining the problem, which we want to study is the key part of the overall analytical process. You cannot find a stable data driven solution, if you do not completely understand, what you are trying to solve.
Plan: The next stage is to formulate an approach that has the potential to solve the problem. This should include hypotheses you want to “prove” with the data.
Data: In data science experiments, data is at the heart of the solutions we plan to build. Therefore think cautiously about the data you have to collect, gather and explore.
Analysis: Analysing the data can be seen as an iterative process. Depending on the size of data, a first step to explore the data might be a foundational statistical description of the dataset before you dive deeper into the inferential statistics.
Conclusion: Once you have completed your statistical analysis, you should be able to draw conclusions out of it. Often the conclusions lead to other problems, which restarts the PPDAC cycle.
The PPDAC method is of course meant to be used on data science experiments. However, it also works extremely well in our working group setting to communicate our different use cases in an effective manner.
David Spiegelhalter: The art of statistics (level: Beginner)
The Six Divisions of Greater Data Science
Data is at the heart of Data Science, which is addressed with the Six Divisions of Greater Data Science as suggested by Donoho. For the Data in the PPDAC approach, we will address the following six divisions:
Data Exploration and Preparation
Data Representation and Transformation
Computing with Data
Data Visualization and Presentation
Science about Data Science
An analytical approach is a fundamental skill; each Data Scientist should have this in his or her genes. Therefore our project benefited from the approach during the research and publishing process. We of course also hope that our Data Science Students will benefit here as well.
David Donoho: 50 years of Data Science (level: Beginner)
Aligned with the PPDAC Cycle, we truly believe that the best way to learn is to start learning based on examples. Our working group collected many examples of various data science problems, how they were planned, how data was gathered and explored and how conclusions were made. We will present use cases of various levels of complexity, which will be used to explain different solution approaches and the underlying methodology.
There are many different ways a person can learn something. Our working group tries to collect various learning materials to address preferences for each student of data science. As a first step for our working group, we discussed various possible educational methods.
Taking into consideration that educational methods have changed significantly over the past few decades, we set our goal to cater to different learning preferences by using a wide variety of resources available today. Therefore we plan to employ much, if not all, of the following educational methods and resources to build our repository and develop learning pathways:
Articles and Conference Papers
Professional (online) Training Courses
Our working group strives to collect material, which can be accessed for free. However, this might not always possible and we might provide links to professional training material. PHUSE or the EftF WG: Data Science take over no liability for any of the material which we link on our website.
We acknowledge that this website will be utilized by readers with different educational background and with different levels of experience in data science. We will mark the different sections according to the required level of expertise. Here we distinguish between the following levels:
Beginner - this is the entry level and the right one for you if you don‘t have any experience in the area of data science. We assume though that you have a basic understanding in at least one biometrical function.
Intermediate - this is the second level and the right one for you if you have a thorough understanding of a biometrical function.
Expert - this is our highest level and the right one for you if you have earned your stripes in any data science projects already. This level requires a deep understanding of clinical development processes.