Science Communication

I have frequently observed the holes in our fabric of society caused by bad communication. In this article I focus on Science and Technology.

This is a difficult argument to make, because it is so heated. Furthermore, the intention of the argument is in striking the right balance. This argument cannot be won one way or another. I try my best to navigate the dead-pools in which such debates sink – but I am not convinced I have succeeded, which is a pity since this article is about communication. So I ask you, dear reader, to be patient, give my intentions the befit of the doubt, and comment.

Let me start with a representative anecdote: I was sitting with some technology guys at a well known science research institute, talking and waiting for the chair. As usual, the conversation freewheeled in and out of technical minutiae, joshing and baiting each other. Quite educational and enjoyable. (I have noticed that such informal meetings are quite different when women are present – either they are disregarded, or alternatively the conversation is muted.)

It was clear that the techies did not have a clue as to where the meeting was going on a business level. They hate the word ‘business’ anyway. They like to call themselves introverts. So here we were, well schooled guys, PhD’s and all, shooting the breeze.

At some point the political commissar arrived, late. However, she knew where she was going and how she was going to get there. I did not like her, but I could work with her, because she had a plan.

(I am a trained scientist. A qualified engineer. But I work in business. This was a choice I made. For my many sins I have many hats – wearing them simultaneously is really hard.)

In the meantime, the tech guys shut up. They looked dejected and crushed. They did not say a word or even look up.

Finally, I left, the deal done. (The commissar later reneged on her undertakings, but that is another story.)

We see this frequently in science: that a research institute has a communications department, and scientists are not allowed to talk freely.

Some claim ‘academic independence’ is compromised in the name of integrity and fidelity of the mission.

On the other hand, when techies speak freely, all sorts of hell is let loose. Recently in Google an engineer published an opinion piece that was so flawed it was embarrassing. He was fired, which was a pity. I have no doubt that his screed was sorely lacking and that Google did not agree with him. Yet he was muzzled by the corporate thought police.

Let me be very clear. He had some ideas with which I profoundly disagreed. But he had hurt no-one. (I accept the fact that he was offensive and that he poisoned the atmosphere for women, etc, etc.) But in the general realm of Bad Things, he had not yet reached Sticks and Stones proportions.

Yet, being pointless is also pointless. Being only able and prepared to talk about things-in-the-small is just sad. Techies must be able and willing to look at the bigger picture. Pejoratively labelling these bigger issues “business issue” does not mean that techies can abrogate responsibility to get involved in the broader debate.

Techies may be wrong. (We are frequently wrong.) But that does not merit the public beating. Being wrong is good; being right is better. But it is dishonest and counter productive not to exercise various views.

Some times it is indeed true that these views are simply bile and must be smothered. Furthermore, not all ideas are equally good, and we must be prepared to concede.

Recently I upset various people  when I noted that Zimbabwe would be a counter example in the importance in the role of Science.

It was noted that my comment was unfair against the hardworking and dedicated scientists there, who do locally relevant science, despite no funding or support. (This was a Twitter debate, so arguments were somewhat terse.)

I get Science.

What I see in this image is two countries who are on the same level, one of whom is failing. Clearly having an institution does not lead to economic (or other) success. (I have high regards for Zimbabwean software people. However I know no S&T people.)

To expand my point: what I see is token science: countries make big institutes, filled with scientists, pursuing mirages or cottage industries. You can always go through the motions of doing science.

In my defence, is the money going to institutions the same as the money going to dedicated scientists? I agree with the practice of science (with caveats).

I am reminded of the Iraqi Institute of science sponsoring research into djini power (this was before they became ‘capable’ of WMD) or Soviet science. While there were exceptions, such as Bonner and Sakharov,  science in the Soviet Union was according to political diktat.

I am not a bottom line kind of guy. But I do look towards actionable outcomes. When I look at the above map, I see the pretence of science.

Back to the point: scientific arguments are often quite subtle. (This essay may be a case in point.) Science journalists take the arguments and hack and change them into a story. People like stories.

Furthermore scientist are the biggest suckers there are when it comes to swallowing facts. For example, they allow themselves to participate in climate ‘debate’s, where both sides are ‘fairly’ represented. Well they are not – 97% of scientists think humans are responsible for global warming. Scientists are being hoisted on their own petard or that of PR.

It is so easy to make a mistake and say something that causes offence or is naive. But that should not stop us weighing-in.

 

 

PGWC Digital Strategy

In response to “Dept. of Economic Development & Tourism Western Cape Government Sector Digital Disruption Impact Assessment“. I am sorry that I was underwhelmed by the document.
It reads like a first year marketing project. Have no doubt that the report is masterfully executed – as pretty as can be.
But I do not see any strategy or vision in the document – what I do see are really tired examples of breathless digital Use Cases.
Furthermore, why the PGWC got a global consultancy to grind out yet another vanilla report is beyond me. I read quite a lot of government strategy and this report, while being particularly pretty, is bereft of substance.
Most of the (real) digital experts I know, would not even bother reading the first page because it is so lite.
Two examples:
Soekor (Petrosa) was one of the best tech companies in SA. While they found no oil, they could do data back in the day. Now they are a shadow of themselves. They are laughable. I know the PGWC has no dog in this fight, but (a) they should have got involved when they could (b) they cannot use it as their story (c) Petrosa did ‘big data’, there should be some lessons there.
The Cape Big Data facility is a confused miasma of priorities – there is the CHPC and the SKA (whatever that is). Yet proper experts did not even make it to meet the department in the tender. It was quite crushing, that even at the first Big Data meeting,  a bus was driven between the ministers, without even trying.
The PGWC prides itself on the vibrancy of its digital economy and startup. But  there is very little tangible evidence to show that they are doing anything more than your government brethren nationally, north or east.
I gave up trying to critique the report.
What I offer instead is a few vignettes about what I would like to see
(1) Open Data being used to stimulated SMB and create openness in government. This could be used around water consumption, dam levels and more.
(2) Engaging and embracing Digital standards to share information. This could be around train timetables.
(3) Attracting Global software companies to take up residency. #Girls Coding would feed into this.
(4) Tenders stink. They really do. The tender program is an abject failure. Please do not blame central government or process.
I will try and read the document again, but it is just irritating due to being so lite and full of trite aphorisms, noble statements and basic bullshit.
I have stated previously that government support for SMBs is a demonstrable cockup. Please, government thinker, put the following picture in your pipe too.
I am calling this digital policy bullshit.

Infectious diseases and modelling online social networks

Frequently models of infectious diseases are used to model the way users grow in an online social network.

The logic is simple – disease contagion (e.g. Avian Flu) propagates in the ‘same way’ as the way people pick up new information about something cool (e.g. about a social network) and therefore subscribe to this particular network.

I was reading “On the predictability of infectious disease outbreaks”  by Samuel V. Scarpino and Giovanni Petri, and was reminded that the link between disease spreading and people signing-up is but tenuous.

While the article does not even bother to try and make this link (being intelligent academics) I am drawn to these types of articles in order to get some insights and learn new mathematical and statistical techniques.

It is very easy to forget why you arrived at the point of confusing the two, so this is a footnote to self.

P vs NP problems – progress?

The problem of commputability has been with computer scientists and other types for some time.

Norbert Blum, a distinguished mathematician, appears to have made progress on the problem.  I am not a computer scientist, but John Baez (mathematical physicist) has a good blog on the subject here -> Norbert Blum on P versus NP.

I add some fairy-wheels here.

The author has published a preprint – that means that experts still need to referee the paper, but the author has made it available to the rest of us in the meantime. A lot of preprints are published that do not pass muster. Nevertheless this paper seems to have cleared the initial BS filters.

What he has done, in best mathematical tradition, is partition the set of possible problems as being NP-Hard, HP-Complete, NP and P (Left Hand Side) and not just the two alternatives of the Right Hand Side.

 

So what problems are we talking about and what does P and N mean?

First, problems of type NP are computational problems can be solved in nondeterministic polynomial time – that means there is some number α, such that solving a problem takes less than tα where “t” is time.

Thus a problem that can be solved in time t is better than a problem that can be solved in time t2. The next step is to understand that any problem that can be solved in time tn is much quicker than et.

(Yes you can add some scalar factor like 0.00001 to damp the exponential time. But for t large enough, the wookie always gets you.)

NP-complete problems are the set of problems whose solutions can be verified in polynomial time.

At a conference in 1971  the question of whether all NP-complete problems were solvable in polynomial time was framed. This question is represented as P=NP, that is, is whether the set of problems that can be verified in polynomial time are also solvable in polynomial time.

This is now one of the great unsolved problems of mathematics and profoundly affects algorithm design in computation.

So what kind of problems? For example, the travelling salesman problem: “Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?” This is for example useful in the airline industry. Another example is the graph colouring problem, which asks whether there is a way of colouring the vertices of a graph so that that no two adjacent vertices share the same colour, for a certain minimal number of colours.

Solving these problems is easy on a case by case basis – the problem is finding a general solution.

Back to the problem at hand – it looks as though NP complete problems are not all P.

That is, there are some problems that we can show that the proof is correct if we get given one; but finding this proof is not so easy.

(But there was a recent similar proof that had an error. So we wait for the egg-heads to say yay or nay.)

Inherent Trade-Offs in the Fair Determination of Risk Scores

Jon Kleinberg,  Sendhil Mullainathan & Manish Raghavan

This is a commentary on an arivX preprint (17 Nov 2016). Please note that I have not reviewed the maths and statistics.

Abstract (edited)

Recent discussion in the public sphere about algorithmic classification has involved tension between competing notions of what it means for a probabilistic classification to be fair to different groups. We formalize three fairness conditions that lie at the heart of these debates, and we prove  there is no method that can satisfy these three conditions simultaneously.

The conditions are

  1. well-calibrated,
  2. balance for the negative class and
  3. statistical parity

Moreover, a version of this fact holds in an approximate sense as well.

Kleinberg earned his spurs working with Watts and Strogatz on The 6 Degrees of Separation.

To take one simple example, suppose one wants to determine the risk that a person is a carrier for a disease X, and suppose that a higher fraction of women than men are carriers. Then this result implies that in any test designed to estimate the probability that someone is a carrier of X, at least one of the following undesirable properties must hold:

(a) the test’s probability estimates are systematically skewed upward or downward for at least one gender; or

(b) the test assigns a higher average risk estimate to healthy people (non-carriers) in one gender than the other; or

(c) the test assigns a higher average risk estimate to carriers of the disease in one gender than the other.

The point is that this trade-off among (a), (b), and (c) is not a fact about medicine; it is simply a fact about risk estimates when the base rates differ between two groups.

Thus it appears that being fair is not all that easy if you are a computer, even if that is your intention, you can’t.

Girls Coding – the question of metrics

Recently, in an internal memo, a Google employee and Harvard Biology PhD wrote a screed [1] about intrinsic differences between men and women as a justification for, well, dunno. The article was on diversity, which was ironic, because it got him fired. I think he may have been fired because the article was extremely light-weight, perpetuating stereotypes especially about women being more caring and co-operative than men, “research” that has been questioned in recent years [2]. The article was debunked elsewhere – but I liked [3] and [4]. Whatever the case, firing the author was OTT.
While the article perpetuated male stupidity, firing him created a demon that the Breitbart boys love and was a lost opportunity.
One of the best points against him was that as a software engineer, he should understand diversity better, so that his systems do not perpetuate in code bigotry and bias. This is especially true for learning systems.
While no-one is contradicting that men and women are different (well, duh?) we have dug ourselves a hole reminiscent of the pro-slavery or women’s franchise of previous centuries, with a new difference: women were up-there with men in tech.
This note is not a contribution to the Google debate, but notes the loss in parts of the software community of the “Yeah!” moment that separates the institutionalised hacks from the people with a plan and fire in their hearts.
What I mean by this is that for me coding is fun and creative. Showing people stuff working is where I want to be. I love that feeling, working late at night, when you get something to work. Properly.
This to me is what is lacking – the opportunity for girls to make working stuff.
There are a lot of good initiatives to get girls coding. However, I do not understand what their success factors are. I would like to understand.
What I do not think will work is simply minting programming droids. We want people – women – who have a plan and can make things happen. I think that what is missing is for the opportunity for girls to share that making moment.
In my mind the school system does not create the right kind of vibe. (Again their are school Expos which are pretty good.) I have also heard that mentoring empowers women (people) in the same way the “Yeah!” moment.
What I would like to see is someone measuring the desired end results of getting girls coding initiatives. In 20 years time we would like to see more women back coding. But what about metrics for the short term?
What do we need to measure to get more women software engineers. I think getting girls enthusiastic is a good start.  But the right kind of plan is needed to get that enthusiasm into women coding again.

Bibliography

[1] Here is the original article
plus a really bit of mealy mouthed bit of corporate ‘communication’ from their new Vice President of Diversity, Integrity & Governance, Danielle Brown.
[2] Inferior: How science Got Women Wrong by Angela Saini

[3] Sabine Hossenfelder  – http://backreaction.blogspot.co.za/2017/08/outraged-about-google-diversity-memo-i.html?spref=tw
[4] Debunk from an engineering point of view – Yonatan Zunger –  https://medium.com/@yonatanzunger/so-about-this-googlers-manifesto-1e3773ed1788
[5] The Economist https://www.economist.com/news/business-and-finance/21725972-james-damore-said-personality-may-explain-gender-gap-tech-google-employee-inflames?fsrc=scn/tw/te/bl/ed/

Data Science Skills, Tools and Algorithms (Part 3)

Finally, in this third part of our series [1, 2] on data science, we look at the definitions of some specific languages, distributions, algorithms and tools.

These are frequently thrown up as necessary requirements for a Data Scientist, but in actual fact after everything else, these may be needed and can be learned.

It is far more important to be able to understand the the need for a model in the context of the business than being able to extract some data that really means very little (e.g. drawing actionable conclusions).

Skill
Hadoop Consists of clustered computers, built from commodity hardware, managed by the Hadoop software framework.
HDFS Hadoop Distributed File System
YARN Yet Another Resource Negotiator for scheduling users’ applications.
Map Reduce A programming model whereby tasks are broken up into smaller parts, and processed independently.
Apache HBase An open source, non-relational, distributed database.
Apache Hadoop suite Cassandra A column-oriented NoSQL database that supports access from Hadoop.
Spark For programming entire clusters with implicit data parallelism and fault-tolerance.
Sqoop For transferring data between relational databases and Hadoop.
Mahoud Provides machine learning algorithms for collaborative filtering, clustering and classification.
Flume Enables users to efficiently collect, aggregate, and move large amounts of data.
Hive An SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Pig Pig abstracts the programming from Java MapReduce to make MapReduce programming high level.
CouchDB A document-oriented NoSQL database.
Cloudera Impala Enables users to issue  SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation.
Common data science toolkits R Is a 4th GL programming language for Statistic manipulation and analysis. It is a tool used to analyse and manipulate data.
Weka A tool for machine learning. It is open source and written in written in Java
NumPy NumPy is a library for Python, that adds MatLab like functionality. Additional libraries include SciPy.
MatLab, ML Studio A 4GL language that allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages, including C, C++, C#, Java, Fortran and Python.
Enterprise BI (Proprietary tools) QlikView QlikView and Qlik Sense, business intelligence & visualization software.
Lumira Business intelligence software developed by SAP BusinessObjects. Used to manipulate and visualize data.
Tableau Software for interactive data visualization focused on business intelligence.
Zoomdata A data visualization and analytics tool that allows customers to explore and analyze the vast quantities of data
SSRS SQL Server Reporting Services (SSRS) is a server-based report generating software system from Microsoft.
SAP BW An Enterprise Data Warehouse. It can transform and consolidate business information from virtually any source system.
Cognos IBM Business intelligence and performance management products.
Query languages SQL Relational Data Base Query Language.
Interactive SQL on Hadoop
NoSQL databases MongoDB An open-source document-oriented NoSQL database. Uses JSON-like documents.
Elasticsearch A search engine providing a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
IoT like Raptor
Enterprise ETL (Extract, Transform, Load) SSIS Microsoft’s SQL Server Integration Services (SSIS) to perform a broad range of data migration tasks.
SAP DS Ditto SAP
IBM DataStage Ditto IBM
Pentaho Ditto Hitachi
PaaS and XaaS X as a Service, Such as Platform
Microservices Docker Docker provides an abstraction and automation for”containers” to run a single Linux instances.
Kubernetes Automating deployment, scaling and management of containerized applications
Cognitive Computing Platforms that encompass machine learning, reasoning, natural language processing, speech recognition and vision.
Machine Learning AI
Numerical techniques k-NN Pattern recognition method used for classification and regression.
Naive Bayes A machine learning algorithm using probabilistic classifiers based on applying Bayes’ theorem with strong  independence assumptions.
SVM Machine Learning algorithm to analyze data used for classification and regression analysis.
Decision Forests Forests are an ensemble learning method for classification, regression and other tasks.
Social listening Social listening is the process of tracking conversations around specific words over many sources.
CEP Stream Complex Event Processing Stream
Statistical modelling Distributions, statistical testing, regression, multivariable calculus and linear algebra Stuff you learn at school
Data visualisation tools D3.js Javascript Visualisation Library
GGplot R visualisation library

One of the things a data scientist tries to is make sense of data. They do this by either modelling data or finding correlations in data.

Machine Learning algorithms can be divided into 2 broad categories:

Supervised learning is useful in cases where a property  is available for a certain dataset (training set), but is missing and needs to be predicted for other instances. In the supervised learning category you will find :

  1. Decision Trees
  2. Naïve Bayes Classification
  3. Ordinary Least Squares Regression
  4. Logistic Regression
  5. Support Vector Machines
  6. Ensemble Methods

Unsupervised learning is useful in discovering implicit relationships in a given dataset. In the unsupervised learning you will find:

  1. Clustering Algorithms
  2. Principal Component Analysis
  3. Singular Value Decomposition
  4. Independent Component Analysis

Next we summarise some distributions, that is a statistical sample could well have a mean, mode and mean.

Then there is the consideration of issues relating to Thick or Thin tails. This is the famous Black Swan event, where a statistically unlikely scenario (but not effectively zero) has a disproportionate impact (should it occur).

This survey could be termed a random walk through technologies, tools, techniques, software, programming languages, algorithms and more. Clearly there is a lot in the detail – but what I tried to do is distinguish between what is important and what is not.

Data Science Skills, Tools and Algorithms (Part 2)

This is the second part of an article on Big Data and Data Science. (Here is the first). I thought I would put down a list of skills, algorithms and the stuff of a Data Scientist. Clearly this is my opinion but I would welcome feedback.

I also try to arrange requirements in a hierarchy of importance.

First and foremost a Data Scientist should be a Scientist. They need to understand what a hypothesis is and what constitutes a proof. They need to be able to communicate with the business, first to develop something useful and then to relay their success.

Clearly there is overlap with a Business Analyst, Business Intelligence, Software Engineers, Statisticians and more. The difference often can be represented through the tool set.

There is an awful lot in the above but it comes down to science training and domain understanding.

Then of course a Data Scientist needs to be able to successfully execute a plan. For this they need raw data skills – for example SQL (e.g. Postgres).

A Data Scientist does not necessarily need a lot of data. Normally they are involved in extracting and analysing real time data. Thus, they may need Big Data query skills, such as Hive or Pig.

Taking a step backwards,  a Data Scientist may be required to help develop requirements for some Big Iron needed for Big Data. A whole lot of work may be needed to define the physical architecture, operating system (or overlay such as Hadoop) and Application Software. (More about Application Software later). While some tasks are not within their domain, they should be consulted and must have input.

A critical approach is to iterate this process. Initial a Data Scientist may help setup hardware, do a static data pull, define a model and analyse the model. Then later they may do a dynamic data pull into an application that provides real time feedback. Then as a third iteration, they may wish to perfect the model and get arms-length feedback into the business.

There are a whole lot of soft skills required for a data scientist. These skills are build on experience and expertise in the hard skills. They include

  • the ability to govern technical projects, including managing the approach methodology and team, e.g. project management, process and budgeting,
  • analysis,
  • design and visual design,
  • standards,
  • creating an understanding and solution buy-in and pre-sales,
  • identify new analytics trends, ability to teach others and ability to learn new techniques,
  • the ability to create and manage an innovation agenda, e.g. push the analytical / modelling envelope,
  • building community e.g. mentoring and advising,
  • communicate effectively.

A data scientist should have a broad general knowledge about PaaS, XaaS, micro services, Cognitive Computing, IoT and so on.

The data scientist should read a lot so that they have ideas for social network and online models. Typically they would exercise these models using MatLab or Mathematica.

The following lists of skills, tools and algorithms are neither complete nor necessary, but are intended to partition technologies.

Very often a Data scientist needs to help develop an application around their models and algorithms. Therefore a data scientist should be able to program and run scripts, this include SQL queries:

  • Java, Groovy & Gradle
  • C#,
  • C++,
  • R language,
  • Python or Ruby,
  • Scala,
  • Docker and Kubernetes,
  • Javascript,
  • Query languages include Hive, Pig, MapReduce and Interactive SQL,
  • Database knowledge should include relational databases, NoSQL (like Neo4j) and others.

General statistical skills include

  • deep and broad statistical modelling,
  • good applied statistical and mathematical skills, such as
    • distributions,
    • statistical testing,
    • regression,
    • multivariable calculus,
    • linear algebra,
  • test hypotheses from raw data sets,
  • Artificial Intelligence,
  • Machine Learning.

Data visualisation tools, such as

  • D3.js,
  • GGplot,
  • etc.

Some companies have proprietary visualisation and reporting tools.

  • Enterprise Business Intelligence;
    • QlikView, Lumira, Tableau, Flare, Zoomdata, Cognos, SAP BW, SSRS
  • Enterprise ETL
    • SSIS, SAP DS, IBM DataStage, Pentaho.

 

In the next section, we look at the definitions of some specific languages, distributions, algorithms and tools. These are frequently thrown up as necessary requirements for a Data Scientist, but in actual fact after everything else, these may be needed and can be learned. It is far more important to be able to understand the the need for a model in the context of the business than being able to extract some data that really means very little (e.g. drawing actionable conclusions).

 

Getting into the groove of Data Science

This article was planned to be about The Business of Data Science. However there are a lot of good articles on the business benefits of data science. On the other hand, I did not want to write about the planning for a data science technology roll out as this is the subject of a future blog.

Therefore, here I am going to expand my view of Data Science from the point of view of someone from the arena. Theodore Roosevelt colourfully described ‘life in the arena’ as follows:

The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, and comes short again and again, because there is no effort without error and shortcoming, but who does actually strive to do the deeds; who knows the great enthusiasms, the great devotions; who spends himself in a worthy cause…

Bumbling around as a scientist, software engineer and programmer has left some impressions:

I was fortunate that the first industrial programming I did was on a DEC VAX. This was a fine machine and OS. Thankfully I never got sucked into the 32 bit memory-model monkey-tricks. My first experience on a supercomputer was a Cray YMP. This was particularly disappointing because I was expecting a Ferrari-experience. As it turns out it was more like a truck: one submits a batch job via the same-old-same-old dirty dumb terminals. The job just came back quicker.

At some point I tried to set up a Beowulf cluster with machines scrounged from the department. This was not a huge success. It turns out that one needs to have good network cards, otherwise you land up chasing your tail.

Seymour Cray said that he would rather have one Ox then 1024 Chickens.

Everyone knows that there is a trade-off in spec’ing out a computer. A fast clock speed, lots of RAM and network access and hard-drive are all considerations. But you cannot have them all. The problem that you wish to solve determines the hardware.

Thus if you have a computation that can be partitioned into independent parts, the right memory model would be the chickens. For a calculation that needs shared memory, the ox will do.

I have not defined Big Data or Data Science properly. Basically Data Scientists don’t need Big Data – its just that they frequently use Big Data sets. A Data Scientist needs to generate a working hypothesis on data. Furthermore, people have been using big data sets in seismic data, weather data, rocket telemetry and telecoms for years. So big data is not new.

Here is a view as to what makes Big Data.

There are 3 things needed to do Data Science successfully

  1. the physical architecture, algorithms and models
  2. a good physical understanding of the limitations of the models
  3. determining actionable outcomes.

The issue is that the above 3 steps may not be in order and may be repeated.

Setting up the hardware is an entertaining task, as discussed above. Furthermore there are recipes for data science algorithms in abundance – be that Machine Learning, Artificial Intelligence or something else. There are any number of academic papers out there on cool models: modelling network behaviour or what have you. (As mentioned, I will come back to algorithms and models in a later blog post.)

With regard to point 2: It is really hard to get an understanding of the model once it is implemented. One needs to understand the validity of the parameter space, analytic continuation, numerical stability and so much more. Even in Neural Networks, say, where one does not care so much about the meaning of parameters, one needs to understand the limits of the training.  Getting a computer to produce results is one thing – understanding what they mean is a whole new ball game.

Another way of putting it is that one needs to be sure that you have correctly implemented the maths. Programming can be very tricky.

Then of course the business does not care about nummies (or computation and numerical issues). They want a binary yes or no. Plus graphs. In a Power Point. Yesterday. Getting actionable results is where the Business hits the road.

Rinse and repeat: Normally, one would want to do a toy problem, and then up the computational horse power and models and promises.

Here is a view on Big Data maturity.

In the next blog we will provide some of the numerical recipes and try and decombobulate some of the wizzy words from what is really needed.

 

Modeling population growth in online social networks

Calculating the number of clients of an online social network.

I recently read a paper by Zhu, Li and Fu [1] on modelling online social networks.

They derive a population growth model of online social networks as a function of time.

They use

  • population distribution – P(s,t) – the proportion of the number of locations with population size – s –  over the total number of populated locations,
  • the total number of populated locations l(t)
  • the largest population size n(t).

These distributions P(s,t), l(t) and n(t) follow Power Laws (empirical). This conclusion was obtained by looking at statistics of three online social networks and choosing Power Laws as the relevant distribution functions.

This can now be inserted into the above GP(t)

and integrated out to give

There are quite a few potential problems with this model, amongst others the distribution fit. The fact that the distribution has 6 parameters is not impressive.

However, the result is quite surprising in that it seems to apply to all online social networks (up to the above parameters).

We plot some of the results.

The authors give us some parameters. The rest we calculate by normalising the formula (that is we need to determine a constant to compare oranges with oranges). This we do at 18 months.

Here are our parameters

First we look at GP(t): the number of clients grows surprisingly slowly as a function of time.

Then we look at  the distribution of the largest population size: n(t)

Then we look at the total number of populated locations l(t)

Finally we look at P(s,t) – the proportion of the number of locations

Conclusion

These models are quite fun to work with and are quite useful. However care needs to be taken in drawing conclusions.

The paper [1] referred to in this discussion is now about 4 years old.

Reference

[1] https://link.springer.com/article/10.1186/2194-3206-1-14