This is the second part of an article on Big Data and Data Science. (Here is the first). I thought I would put down a list of skills, algorithms and the stuff of a Data Scientist. Clearly this is my opinion but I would welcome feedback.
I also try to arrange requirements in a hierarchy of importance.
First and foremost a Data Scientist should be a Scientist. They need to understand what a hypothesis is and what constitutes a proof. They need to be able to communicate with the business, first to develop something useful and then to relay their success.
Clearly there is overlap with a Business Analyst, Business Intelligence, Software Engineers, Statisticians and more. The difference often can be represented through the tool set.
There is an awful lot in the above but it comes down to science training and domain understanding.
Then of course a Data Scientist needs to be able to successfully execute a plan. For this they need raw data skills – for example SQL (e.g. Postgres).
A Data Scientist does not necessarily need a lot of data. Normally they are involved in extracting and analysing real time data. Thus, they may need Big Data query skills, such as Hive or Pig.
Taking a step backwards, a Data Scientist may be required to help develop requirements for some Big Iron needed for Big Data. A whole lot of work may be needed to define the physical architecture, operating system (or overlay such as Hadoop) and Application Software. (More about Application Software later). While some tasks are not within their domain, they should be consulted and must have input.
A critical approach is to iterate this process. Initial a Data Scientist may help setup hardware, do a static data pull, define a model and analyse the model. Then later they may do a dynamic data pull into an application that provides real time feedback. Then as a third iteration, they may wish to perfect the model and get arms-length feedback into the business.
There are a whole lot of soft skills required for a data scientist. These skills are build on experience and expertise in the hard skills. They include
- the ability to govern technical projects, including managing the approach methodology and team, e.g. project management, process and budgeting,
- design and visual design,
- creating an understanding and solution buy-in and pre-sales,
- identify new analytics trends, ability to teach others and ability to learn new techniques,
- the ability to create and manage an innovation agenda, e.g. push the analytical / modelling envelope,
- building community e.g. mentoring and advising,
- communicate effectively.
A data scientist should have a broad general knowledge about PaaS, XaaS, micro services, Cognitive Computing, IoT and so on.
The data scientist should read a lot so that they have ideas for social network and online models. Typically they would exercise these models using MatLab or Mathematica.
The following lists of skills, tools and algorithms are neither complete nor necessary, but are intended to partition technologies.
Very often a Data scientist needs to help develop an application around their models and algorithms. Therefore a data scientist should be able to program and run scripts, this include SQL queries:
- Java, Groovy & Gradle
- R language,
- Python or Ruby,
- Docker and Kubernetes,
- Query languages include Hive, Pig, MapReduce and Interactive SQL,
- Database knowledge should include relational databases, NoSQL (like Neo4j) and others.
General statistical skills include
- deep and broad statistical modelling,
- good applied statistical and mathematical skills, such as
- statistical testing,
- multivariable calculus,
- linear algebra,
- test hypotheses from raw data sets,
- Artificial Intelligence,
- Machine Learning.
Data visualisation tools, such as
Some companies have proprietary visualisation and reporting tools.
- Enterprise Business Intelligence;
- QlikView, Lumira, Tableau, Flare, Zoomdata, Cognos, SAP BW, SSRS
- Enterprise ETL
- SSIS, SAP DS, IBM DataStage, Pentaho.
In the next section, we look at the definitions of some specific languages, distributions, algorithms and tools. These are frequently thrown up as necessary requirements for a Data Scientist, but in actual fact after everything else, these may be needed and can be learned. It is far more important to be able to understand the the need for a model in the context of the business than being able to extract some data that really means very little (e.g. drawing actionable conclusions).