A Complexity Profile is
probably the most important result of a QCM analysis. Its interpretation
therefore is of paramount importance.
Before this is done, it is important to consolidate a few basic concepts.
There are two types of variables in a system:
- Inputs
- Outputs
These can be classified in two other categories:
- Controllable
- Uncontrollable
There are different situations that one can be confronted with:
- Variables are only inputs (e.g. accelerator pedal angle)
- Variables are only outputs (e.g. stock values, survey results)
- Both inputs and outputs are present
What is Complexity?
Complexity is a measure of how much information a system “contains” and
how much this information is structured. One could simply sum up the
Shannon entropies of each variable and conclude that this is the total
amount of information in a system. However, because variables can be
correlated, they give rise to structure. Structure means the system can
“do more” and, potentially, perform new functions. Structure is present
everywhere in Nature. More structured information means more
correlations within the system. Critical complexity measures how much
information can a system contain before it starts to lose structure
(i.e. before this information becomes meaningless). Complexity is
measured in bits, since information is measured in bits.
The importance of
structure is paramount. An analogy: the mass of an atom’s nucleus is
less than the sum of the masses of its components. This is because the
energy going into the various bindings has an equivalent in terms of
mass (m=E/c^2). It is this amount that is “lost” when measuring the mass
of the nucleus as a whole. The same is with complexity. It measures the
information within a system not only based on the sum of the Shannon
entropies of each variable, it also takes into account the “bindings”
between the variables. This means that structure also carries
information, not just each variable.
Complexity is like
energy. More energy one has, more can be turned into work in order to
accomplish something. More complexity means more information and more
information also means that more can be accomplished.
What does the Complexity
Map show? It shows which groups of variables vary together. It does NOT
indicate if A is causing a variation in B or vice-versa, it simply
shows how variables are grouped when they change. In other words, “when
variable A varies, B also varies” – this is all that can be said, unless
one knows specifically that a certain variable is independent and is
controllable and its variations are intended.
A Complexity Profile (or Complexity Spectrum) shows how much information is “removed” from a system (a multi-dimensional data array) if a particular variable is removed. The measurement is provided in percentage terms. The contributions to a Complexity Profile are ranked in descending order. When a variable is at the top of the CP it does not necessarily mean that it is the most important one or that it dominates/controls the system in question. This is ONLY true if the variable is an input.
When the first variable
in a CP profile is removed, all one can say for sure is that the data
set without that variable will experience the largest possible loss of
information. The fact that a variable lies at the top of the CP does not
automatically mean that it drives the business. Why is that the case?
The first important step in a complexity analysis of any system is the
synthesis of a meaningful data set. If you put in garbage, the results
will be in proportion to the amount of garbage with respect to
meaningful data. It is up to the user to collect meaningful data that
embraces correctly a given problem and not indiscriminately. Therefore,
if you are completely sure that your data is correct and meaningful
(i.e. is of high quality), then indeed
the CP provides a ranking of the variables in terms of how much information each variable contributes to the whole picture.
But what that does physically mean? It means that the variable in question varies a lot AND it does so in unison (i.e. with structure) with numerous other variables.
the CP provides a ranking of the variables in terms of how much information each variable contributes to the whole picture.
But what that does physically mean? It means that the variable in question varies a lot AND it does so in unison (i.e. with structure) with numerous other variables.
The CP, therefore, is an
objective way of ranking (weighing) variables because it ranks them
based on how much information they carry not based on a subjective
perception of importance.
And what is meant by high quality data?
- A sufficient number of samples (generally less than 10 is not a good way to start)
- No outliers (a few very remote points can skew the results very much)
- The data array is well populated (i.e. has high density, or a relatively small fraction of null entries.
Therefore, if a variable
lies in the upper part of the CP and it is a controllable input to your
system then indeed it is an important business driver.
What about outputs? What
if you have, say N stocks, and therefore N observable outputs from a
system (stock exchange). How is the CP to be interpreted then? The above
comment in red still holds. But can anything else be said in such a
case? Probably yes.
A common question people
formulate (even though we think this is not a good question to ask) is
that of causality. If A and B vary together, is it A that causes the
variation in B or vice-versa? This question is very difficult to answer
(unless one has “insider” information). It is one of those questions
that have no answer and that are useless to ask (is pizza better than
spaghetti?). However, the Complexity Profile can help.
Let us see an example, the DJIA Index. The Complexity Map is illustrated below.
The corresponding Complexity profile is:
The corresponding Complexity profile is:
This is a case in which it is impossible, for example, to say if it is
the price of Home Depot stocks that drives the price of Citigroup stocks
or vice-versa. What does it mean “to drive”? The relationship in
question is shown below:
What really drives both stocks is the market but
that cannot be measured easily. So, what we can do is to assume that if
two variable co-vary (vary together) the one with a higher CP
contribution “drives” the other. In this case we could say that
Citigroup “dominates” Home Depot. It is very difficult to disprove such a
statement (unless one has privileged information or if the data has
been manipulated).
In the case in question we could say that Citigroup
dominates the DJIA Index even though market capitalization or stock
value could hint something different. In summary, we could conclude that
a Complexity Profile may help solve the eternal issue of causality
(which seems to trouble humanity so much).
No comments:
Post a Comment