Getting into the groove of Data Science

This article was planned to be about The Business of Data Science. However there are a lot of good articles on the business benefits of data science. On the other hand, I did not want to write about the planning for a data science technology roll out as this is the subject of a future blog.

Therefore, here I am going to expand my view of Data Science from the point of view of someone from the arena. Theodore Roosevelt colourfully described ‘life in the arena’ as follows:

The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, and comes short again and again, because there is no effort without error and shortcoming, but who does actually strive to do the deeds; who knows the great enthusiasms, the great devotions; who spends himself in a worthy cause…

Bumbling around as a scientist, software engineer and programmer has left some impressions:

I was fortunate that the first industrial programming I did was on a DEC VAX. This was a fine machine and OS. Thankfully I never got sucked into the 32 bit memory-model monkey-tricks. My first experience on a supercomputer was a Cray YMP. This was particularly disappointing because I was expecting a Ferrari-experience. As it turns out it was more like a truck: one submits a batch job via the same-old-same-old dirty dumb terminals. The job just came back quicker.

At some point I tried to set up a Beowulf cluster with machines scrounged from the department. This was not a huge success. It turns out that one needs to have good network cards, otherwise you land up chasing your tail.

Seymour Cray said that he would rather have one Ox then 1024 Chickens.

Everyone knows that there is a trade-off in spec’ing out a computer. A fast clock speed, lots of RAM and network access and hard-drive are all considerations. But you cannot have them all. The problem that you wish to solve determines the hardware.

Thus if you have a computation that can be partitioned into independent parts, the right memory model would be the chickens. For a calculation that needs shared memory, the ox will do.

I have not defined Big Data or Data Science properly. Basically Data Scientists don’t need Big Data – its just that they frequently use Big Data sets. A Data Scientist needs to generate a working hypothesis on data. Furthermore, people have been using big data sets in seismic data, weather data, rocket telemetry and telecoms for years. So big data is not new.

Here is a view as to what makes Big Data.

There are 3 things needed to do Data Science successfully

  1. the physical architecture, algorithms and models
  2. a good physical understanding of the limitations of the models
  3. determining actionable outcomes.

The issue is that the above 3 steps may not be in order and may be repeated.

Setting up the hardware is an entertaining task, as discussed above. Furthermore there are recipes for data science algorithms in abundance – be that Machine Learning, Artificial Intelligence or something else. There are any number of academic papers out there on cool models: modelling network behaviour or what have you. (As mentioned, I will come back to algorithms and models in a later blog post.)

With regard to point 2: It is really hard to get an understanding of the model once it is implemented. One needs to understand the validity of the parameter space, analytic continuation, numerical stability and so much more. Even in Neural Networks, say, where one does not care so much about the meaning of parameters, one needs to understand the limits of the training.  Getting a computer to produce results is one thing – understanding what they mean is a whole new ball game.

Another way of putting it is that one needs to be sure that you have correctly implemented the maths. Programming can be very tricky.

Then of course the business does not care about nummies (or computation and numerical issues). They want a binary yes or no. Plus graphs. In a Power Point. Yesterday. Getting actionable results is where the Business hits the road.

Rinse and repeat: Normally, one would want to do a toy problem, and then up the computational horse power and models and promises.

Here is a view on Big Data maturity.

In the next blog we will provide some of the numerical recipes and try and decombobulate some of the wizzy words from what is really needed.

 

One Reply to “Getting into the groove of Data Science”

Leave a Reply

Your email address will not be published. Required fields are marked *