Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere -- if a process is truly random, then it is not predictable, in the analytic sense of that term. Randomness refers to the absence of patterns, order, coherence, and predictability in a system.

Unfortunately, we are often fooled by random events whenever apparent order emerges in the system. In moments of statistical weakness, some folks even develop theories to explain such "ordered" patterns. However, if the events are truly random, then any correlation is purely coincidental and not causal. I remember learning in graduate school a simple joke about erroneous scientific data analysis related to this concept: "Two points in a monotonic sequence display a tendency. Three points in a monotonic sequence display a trend. Four points in a monotonic sequence define a theory." The message was clear -- beware of apparent order in a random process, and don't be tricked into developing a theory to explain random data.

One way that randomness is most likely to induce a reduction in rational thinking is in small-numbers phenomena. For example, suppose that I ask 12 people which American NFL football team that they like the most, and they all say Baltimore Ravens. Is that a statistical fluke, a fair statement about the national sentiment, or a selection effect (since all 12 people that I asked actually live in Baltimore)? The answer is probably the latter. Okay, this example may be too obvious. So, consider the following less obvious example:

Suppose I have a fair coin (with a head or a tail being equally likely to appear when I toss the coin).

Of the following 3 sequences (each representing 12 sequential tosses of the fair coin), which sequence corresponds to a bogus sequence (i.e., a sequence that I manually typed on the computer)?

(a) HTHTHTHTHTHH

(b) TTTTTTTTTTTT

(c) HHHHHHHHHHHT

(d) None of the above.

In each case, a coin toss of head is listed as "H", and a coin toss of tail is listed as "T".

The answer is "(d) None of the Above."

None of the above sequences was generated manually. They were all actual subsequences extracted from a larger sequence of random coin tosses. I admit that I selected these 3 subsequences non-randomly (which induces a statistical bias known as a selection effect) in order to try to fool you. The small-numbers phenomenon is evident here -- it corresponds to the fact that when only 12 coin tosses are considered, the occurrence of any "improbable result" may lead us (incorrectly) to believe that it is statistically significant. Conversely, if we saw answer (b) continuing for dozens of more coin tosses (nothing but Tails, all the way down), then that would be truly significant.

So, let's try again with another sample problem (#2) in which I truly did invent one of the three sequences (i.e., a bogus sequence that I manually typed on the computer, attempting to create my own example of a random sequence). Which one of these 50-coin toss sequences is the bogus sequence?

(a) HTHHTHHTTHHTTTHTHTHHHTHTHTHHHTTHTTTHTHTHHTTHTHTHTT

(b) HHHHHHTHTHHHHHTTTHTTTTHTTHHHHTHHHHHTHTTHHHTHHHHHHH

(c) THTTTTTTHTTTTTTTTHHHTTTTHHTTTTHHHTHHTTHHTTTTTHTTHH

For the two real (non-bogus) sequences, I used a random number generator to generate the 50-coin sequence. The random number generator (common to nearly all scientific programming environments) produces a random number between 0 and 1. I simply labeled the event as "H" when the number was 0.5 or greater, and labeled the event as "T" whenever the number was less than 0.5.

The answer to sample problem #2 is ... posted at the bottom of this post (by which point you will have probably guessed it).

This topic of "fooled by randomness" came up when I was reading an article recently on the Turing Award Winners from 1966 through 2013.

This article lists many interesting statistical facts about the 61 winners of the award. The article provides a fun, interactive data visualization built with Tableau tools in which you can explore these statistical data, which include: each winner's birth year, age at time of award, nationality, gender, and... astrological sign! Being a data scientist and astrophysicist, I found the inclusion of Zodiac sign to be disconcerting. However, the author of the original post does admit that this was included jokingly.

As you look at the data, you will see that 10 of the 61 Turing Award winners were born under one specific sign of the Zodiac, and only 2 of the 61 winners were born under another sign (in fact, two such examples exist). These questions then arise: Is there significance to this apparent correlation?

Is there true order here, and not randomness? Are Capricorns really five times more likely to win future Turing Awards than Scorpios?

Of course, the response to these questions is that the statistical distribution of astrological birth signs does truly represent a purely random process, with no astrological (or astronomical) significance whatsoever. But, to prove this fact, it appeared to be a fun exercise for my random number generator once again.

So, I generated random birth months (1 through 12, corresponding equivalently to the 12 signs of the Zodiac) for 61 individuals. (For simplicity, we assumed that all birth months are equally likely, thus ignoring the variable length of the various months.) I repeated this simulation 100,000 times (which almost certainly falls into that scientific data analysis category of "overkill").

I then examined how many times in the 100,000 simulations did some of the following apparent correlations exist:

(1) We find 10 or more of the 61 individuals with the same birth month (astrological sign):

Answer: in 32% of the simulations

(2) We find 2 or fewer of the 61 individuals in any one of the birth months:

Answer: in 80% of the simulations

(3) We see the ratio of "maximum number of birthdays in one of the months" to "minimum number of birthdays in another month" equal to 5 or greater:

Answer: in 40% of the simulations

(4) We see the ratio of "maximum number of birthdays in one of the months" to "minimum number of birthdays in another month" equal to 4.5 or greater:

Answer: in 49% of the simulations

Therefore, it is statistically reasonable and totally expected that we would see 1 or 2 birth months that contain only two award winners. It is also statistically reasonable that we could see 5 times as many winners in the most populous month as in the least populous month. Regarding the first correlation (32% of the simulations revealing 10 or more of the 61 individuals with the same birth month), 32% is a non-trivial percentage and therefore not surprising that we see it occur in real life.

What conclusions can we draw from all of this discussion of "fooled by randomness"? What are the traps that we can fall into?

We often tend to pick out and focus on the "most interesting" results in our data, and ignore the uninteresting cases. This is selection bias, and also is an example of "a posteriori" statistics (derived from observed facts, not from logical principles).

It is easy to be fooled by randomness, especially in our rush to build predictive analytics models that actually predict interesting outcomes.

This is similar to the birthday paradox (in which the likelihood that two people in a crowd have the same birthday is approximately 50% when there are only 23 people in the group). This 50-50 break point occurs at such a small number because, as you increase the sample size, it becomes less and less likely to avoid the same birthday (i.e., a repeating pattern in random data).

Humans are good at seeing patterns and correlations in data, but correlation does not imply causation.

The bigger the data set, the more likely you will see an "unlikely" pattern!

What we see in the Turing Awards data is evidence of a "small-numbers phenomenon."

When asked to pick the "random" statistical distribution that is generated by a human (versus a distribution generated by an algorithm), we tend to confuse "randomness" with the "appearance of randomness". A distribution may appear to be more random, but in fact it is less random, since it has a statistically unrealistic small variance in behavior: lots of non-repeating values, but few large repetitions (i.e., we forget to take into account the long tail of the distribution). For example, in sample problem #1 above, the answer (b) sequence of 11 T's after the initial T has a statistical likelihood of 1 part in 2^11 (once in 2048 twelve-toss subsequences), which is rare but it still occurred in my real experiment!

So, this brings us back to our sample problem #2, whose correct answer is: (a).

If that answer surprises us, it is because when we generate random sequences manually (without the aid of an objective unbiased algorithm), or when we try to judge if a data string is a random sequence, we are prone to falling into some of the traps listed above.

Some opinions expressed in this article may be those of a guest author and not necessarily Analytikus. Staff authors are listed http://www.analyticbridge.com/profiles/blogs/7-traps-to-avoid-being-fooled-by-statistical-randomness