Data Science Skills, Tools and Algorithms (Part 3)

Finally, in this third part of our series [1, 2] on data science, we look at the definitions of some specific languages, distributions, algorithms and tools.

These are frequently thrown up as necessary requirements for a Data Scientist, but in actual fact after everything else, these may be needed and can be learned.

It is far more important to be able to understand the the need for a model in the context of the business than being able to extract some data that really means very little (e.g. drawing actionable conclusions).

Hadoop Consists of clustered computers, built from commodity hardware, managed by the Hadoop software framework.
HDFS Hadoop Distributed File System
YARN Yet Another Resource Negotiator for scheduling users’ applications.
Map Reduce A programming model whereby tasks are broken up into smaller parts, and processed independently.
Apache HBase An open source, non-relational, distributed database.
Apache Hadoop suite Cassandra A column-oriented NoSQL database that supports access from Hadoop.
Spark For programming entire clusters with implicit data parallelism and fault-tolerance.
Sqoop For transferring data between relational databases and Hadoop.
Mahoud Provides machine learning algorithms for collaborative filtering, clustering and classification.
Flume Enables users to efficiently collect, aggregate, and move large amounts of data.
Hive An SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Pig Pig abstracts the programming from Java MapReduce to make MapReduce programming high level.
CouchDB A document-oriented NoSQL database.
Cloudera Impala Enables users to issue  SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation.
Common data science toolkits R Is a 4th GL programming language for Statistic manipulation and analysis. It is a tool used to analyse and manipulate data.
Weka A tool for machine learning. It is open source and written in written in Java
NumPy NumPy is a library for Python, that adds MatLab like functionality. Additional libraries include SciPy.
MatLab, ML Studio A 4GL language that allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages, including C, C++, C#, Java, Fortran and Python.
Enterprise BI (Proprietary tools) QlikView QlikView and Qlik Sense, business intelligence & visualization software.
Lumira Business intelligence software developed by SAP BusinessObjects. Used to manipulate and visualize data.
Tableau Software for interactive data visualization focused on business intelligence.
Zoomdata A data visualization and analytics tool that allows customers to explore and analyze the vast quantities of data
SSRS SQL Server Reporting Services (SSRS) is a server-based report generating software system from Microsoft.
SAP BW An Enterprise Data Warehouse. It can transform and consolidate business information from virtually any source system.
Cognos IBM Business intelligence and performance management products.
Query languages SQL Relational Data Base Query Language.
Interactive SQL on Hadoop
NoSQL databases MongoDB An open-source document-oriented NoSQL database. Uses JSON-like documents.
Elasticsearch A search engine providing a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
IoT like Raptor
Enterprise ETL (Extract, Transform, Load) SSIS Microsoft’s SQL Server Integration Services (SSIS) to perform a broad range of data migration tasks.
IBM DataStage Ditto IBM
Pentaho Ditto Hitachi
PaaS and XaaS X as a Service, Such as Platform
Microservices Docker Docker provides an abstraction and automation for”containers” to run a single Linux instances.
Kubernetes Automating deployment, scaling and management of containerized applications
Cognitive Computing Platforms that encompass machine learning, reasoning, natural language processing, speech recognition and vision.
Machine Learning AI
Numerical techniques k-NN Pattern recognition method used for classification and regression.
Naive Bayes A machine learning algorithm using probabilistic classifiers based on applying Bayes’ theorem with strong  independence assumptions.
SVM Machine Learning algorithm to analyze data used for classification and regression analysis.
Decision Forests Forests are an ensemble learning method for classification, regression and other tasks.
Social listening Social listening is the process of tracking conversations around specific words over many sources.
CEP Stream Complex Event Processing Stream
Statistical modelling Distributions, statistical testing, regression, multivariable calculus and linear algebra Stuff you learn at school
Data visualisation tools D3.js Javascript Visualisation Library
GGplot R visualisation library

One of the things a data scientist tries to is make sense of data. They do this by either modelling data or finding correlations in data.

Machine Learning algorithms can be divided into 2 broad categories:

Supervised learning is useful in cases where a property  is available for a certain dataset (training set), but is missing and needs to be predicted for other instances. In the supervised learning category you will find :

  1. Decision Trees
  2. Naïve Bayes Classification
  3. Ordinary Least Squares Regression
  4. Logistic Regression
  5. Support Vector Machines
  6. Ensemble Methods

Unsupervised learning is useful in discovering implicit relationships in a given dataset. In the unsupervised learning you will find:

  1. Clustering Algorithms
  2. Principal Component Analysis
  3. Singular Value Decomposition
  4. Independent Component Analysis

Next we summarise some distributions, that is a statistical sample could well have a mean, mode and mean.

Then there is the consideration of issues relating to Thick or Thin tails. This is the famous Black Swan event, where a statistically unlikely scenario (but not effectively zero) has a disproportionate impact (should it occur).

This survey could be termed a random walk through technologies, tools, techniques, software, programming languages, algorithms and more. Clearly there is a lot in the detail – but what I tried to do is distinguish between what is important and what is not.

Leave a Reply

Your email address will not be published. Required fields are marked *