The Ratiocinator: Information through the Lens of Computation: Algorithmic Information Theory

Up to now, previous discussions about information and computation have been rather informal -- I have neglected to give detailed mathematical introductions to concepts like Shannon entropy, Turing universality, codes etc. for the simple reason that for our purposes, a qualitative understanding has been fully sufficient.
However, if we want to make any quantitative statements about the notions of information and computation, it will benefit us to take another look at the subject in a slightly more formal context.
To this end, the most efficient framework seems to me to be that of algorithmic information theory, introduced and developed mainly by the mathematicians Ray Solomonoff, Andrey Kolmogorov, and Gregory Chaitin in the 1960s. According to Chaitin, AIT is "the result of putting Shannon's information theory and Turing's computability theory into a cocktail shaker and shaking vigorously"; as such, it will aid us in achieving a better understanding, in a quantitative way, of both subjects.
In a way, AIT is concerned with a concrete realisation of the somewhat abstract-seeming notion of information or information content within the realm of computation. To this end, its central concept is the so-called Kolmogorov complexity, which, loosely speaking, provides a measure of the information content of an object -- say, a string of binary digits -- by considering the shortest program, on a given computer, that outputs this object. This might seem at first like a rather arbitrary definition: different computers generally will have different programs, of different lengths, that generate the string, leading to different complexities for one and the same object. But it turns out that despite this ambiguity, Kolmogorov complexity is useful independent of the concrete computational implementation it is based on.
First, some technicalities. We fix a notion of computer as a machine that takes as input binary strings, and outputs them, as well -- a computer thus is a (partial) function from binary strings to binary strings (where 'partial' means that not for all possible inputs, an output necessarily exists). We require that the admissible binary strings have the property that no valid string x is a prefix of another valid string y -- i.e. that y can't be written as xz, obtained from x through extending it with another string z. This ensures that programs are self-delimiting: when a machine has received the string x, it 'knows' that the input is complete, and there is thus no need for a special stop-character. Such machines are sometimes called prefix-free, self-delimiting, or simply Chaitin machines. There are universal Chaitin machines, so we can do everything with them we can do with any other concrete realisation of computation.
Let now U be such a universal Chaitin machine. We write the action of U on a string p, the program, as U(p) (a program, here, comes equipped with all its input). If U(p) is defined, the result will be another string, the output x, and we write U(p) = x. We denote the length of a string with vertical bars, such that |x| = n means that the string x is n bits long.
This is enough already to define Kolmogorov complexity formally. The Kolmogorov complexity K_U(x) of a given string x relative to a given universal Chaitin machine U is:

K_U(x) = min {|p|: U(p) = x},

which means nothing but the shortest (minimal) program p that, if executed on U, produces x as an output.
Thus, Kolmogorov complexity measures the power of a given Chaitin machine to compress a certain string. This has an intuitive connection with the notion of information content: the more information a string x contains, the harder it should be to compress it, as there is less redundancy to be exploited for the compression process. There are also obvious connections to a notion of 'orderliness': very highly ordered strings can be very highly compressed, while strings showing little or no order are barely compressible.
We've already met the precursor to these ideas, when I related the anecdote of Leibniz and the ink blots, back in the first post: a collection of ink blots on a piece of paper can be said to follow a law if there exists a simple mathematical description for their distribution; otherwise, it must be judged random.
Kolmogorov complexity can be used to formalize this: a string is random if there exists no program of significantly smaller length producing this string, i.e. if K_U(x) ≈ |x|.
But there's still the issue to consider that as defined up to now, Kolmogorov complexity only considers one specific universal machine as reference -- so it seems like it might well be that results obtained for one machine U might differ greatly from results obtained for a different machine V!
Luckily, however, that turns out not to be so. A celebrated result in AIT is the so-called invariance theorem, which roughly states that no matter what machine you choose to concretely evaluate the Kolmogorov complexity, the obtained results will hold for all possible machines -- at least qualitatively.
Concretely, by universality, we know that any given universal machine can emulate any other. In our concrete formalization, there exists a string v, such that the machine U, given v and a program, will execute the same computation as V, if V is given that program. Formally: if V(p) = x, then there exists v such that U(vp) = x (here vp simply means the concatenation of the strings v and p; note that this entails that v alone is not a valid string for U). But this means that Kolmogorov complexities relative to U and V can differ from one another at most by the length of the 'simulating' string v (there are typically many such strings; we'll just take the shortest)! Formally, there is a constant c = |v| such that: K_U(x) ≤ K_V(x) + c. Often, results in AIT will only depend on the asymptotic behaviour, rather than the concrete value, of K; thus, we can ignore the constant, which is always small in the asymptote, and write independently of any machine implementation K(x) for the Kolmogorov complexity of some string x.
In any case, the exact value of K(x) is in general rather immaterial, as K(x) is not a computable function. To see this, imagine you have a machine that, upon being given a string x, outputs K(x). Now, we can write a program that takes as input a natural number n, and produces a string with complexity greater than n, simply by trying every string in succession until it finds one. This program has a constant length c, plus the length of its input, |n|. So, we have a program of complexity c + |n|, which outputs a number of complexity -- as stipulated -- n. But this is a contradiction, since evidently, there exists a value for n such that this program is shorter than the shortest one possible, which is of length n, since n grows much faster than |n|!
Or in other words, if K(x) were computable, we could find a string of complexity n -- i.e. a string for which the shortest program that computes is is of n bits length --, and create a program that computes this number, whose length is |n| + c < n bits -- shorter than the stipulated shortest possible program -- an obvious contradiction. Hence, K(x) can't be computable.
This is related to Berry's paradox: n is the smallest number not definable in under sixty-six symbols. Yet it's just been defined using only sixty-five (counting blanks as a symbol)!

The Number of Wisdom
Let us now focus our attention on a curious object, constructed as follows: for some universal machine, if p denotes a program that eventually terminates and produces an output (rather than, say, loop forever), form 2^-|p|; then, add up all these factors for all halting programs. 2^-|p| is the probability of hitting on the program p randomly, through a series of coin tosses (recall that, since we're working with self-delimiting programs, the coin tossing sequence automatically stops when you find a valid program); the sum of all these probabilities is thus the probability that a given machine, if fed a random program, halts. Thus, it is fittingly known as the halting probability; other names include Chaitin's constant (though since every machine has its own unique halting probability, calling it a 'constant' is a bit of a misnomer), or simply Ω. Formally:

Ω = Σ2^-|p|,

where the sum is taken over all halting programs.
This might seem at first to be a rather baroque construction of dubious significance, but it has some quite remarkable properties:

It is algorithmically random -- meaning that the Kolmogorov complexity of the first n bits of its binary expansion is at least n - c, i.e. it is asymptotically incompressible -- no program significantly shorter than n bits can compute significantly more than its first n bits
It is, however, left-computably enumerable (left-c.e.), i.e. there exists a program enumerating the set of rational numbers q where q ≤ Ω
Knowledge of the first n bits of its binary expansion enables deciding the halting problem for all programs up to n bits in size

Also, it can be shown (see here) that all random left-c.e. real numbers are halting probabilities.
It follows immediately from these points that there exists no axiom system that can derive the value of more than finitely many bits of Ω's binary expansion -- since if there were, such a system could be automated in the form of a Turing machine, which then would be able to compute Ω with arbitrary precision.
This is actually a surprisingly deep statement -- it is formally equivalent to the better known concept of Gödel incompleteness, which roughly states that for any axiomatic system S of sufficient complexity, there exists a sentence G, expressible in the system, which asserts 'G can't be proven in S', leading to an obvious paradox: if G can be proven in S, then S is inconsistent, since it proves a falsehood; if G can't be proven in S, then it is incomplete, since there are truths that it is incapable of proving.
The existence of bits in the binary expansion of Ω whose value a given axiomatic system can't determine similarly exhibits that system's incompleteness; this result, in slightly different form, is consequently known as Chaitin's incompleteness theorem.
This points towards a connection between randomness and incompleteness: if there were a system that could derive arbitrarily many bits of Ω, or equivalently, a machine that could compute Ω to arbitrary accuracy, then Ω wouldn't be random -- since the amount of bits needed to specify the system would then be the consequently bounded Kolmogorov complexity of Ω.
One should take note of the fact that both incompleteness results crucially depend on self-reference: the Gödel sentence is essentially a formalization of the liar paradox, i.e. the impossibility to assign a consistent truth value to the sentence 'this sentence is false' -- if it is false, it is true, however, if it is true, it is false --, while Chaitin incompleteness depends on the uncomputability of Kolmogorov complexity, which can be traced back to the aforementioned Berry paradox.
Alternatively, it can be seen to follow directly from the fact that if there were an axiomatic system able to enumerate Ω to arbitrary accuracy, this system could be used to solve the halting problem -- which of course is known to be unsolvable.
Which is a shame, really -- and here we get to the reason why Ω is also sometimes called 'the number of wisdom --, because if one could solve the halting problem, the solution to lots of mathematics' most famous problem would instantly become trivial. Take Goldbach's conjecture: Every even integer greater than two can be expressed as a sum of two primes.
We could imagine writing a simple program that, for every even integer, simply checks whether or not it is expressible as a sum of two primes -- by simply doing a brute-force test, i.e. checking the sum of all primes less than that number. Clearly, if Goldbach's conjecture is false, it will eventually find such a number, and then terminate. However, if it is true, then it will never halt, but continue searching forever.
Now, if one had a way to solve the halting problem -- i.e. for any arbitrary program, find out whether or not it will eventually halt --, then one could just write the program that checks for Goldbach numbers, run it through the halting checker, and instantly know the truth (or falsity) of Goldbach's conjecture: if the halting checker proclaims that the Goldbach checker halts, then Goldbach's conjecture is false -- there exists an even number greater than two such that there is no way to express it as the sum of two primes. However, if the halting checker judges the Goldbach checker to be non-halting, then Goldbach's conjecture is true.
If this has a sort of eerie feel to it, it's well justified: the program that checks whether or not there is a counterexample to Goldbach's conjecture is never run, yet we know what would have happened if it had been run. This is a strange way of coming into knowledge -- it seems as if information had been created out of nothing.
The most curious property of Chaitin constants then is that knowing Ω entails exactly the ability to do the above -- a halting probability serves as an oracle for the halting problem. How is that possible?
Well, assume you knew the first n bits of some machine's halting probability. Then, on that machine, you run in sequence of ascending length every possible program up to n bits length, and record those that halt. For each halting program p, you add the factor 2^-|p|. Once your total equals the number corresponding to the first n bits of Ω, you know the programs that haven't halted up to now, never will halt, as they would add a contribution to the total that would make it exceed Ω.

The Takeaway
In this post, we have made contact with a number of results that we will meet again in the (hopefully) not too distant future. First, we have found a formalization of the notion of randomness, and a connection between randomness and the notions of incompleteness and uncomputability. Secondly, and more specifically, we have learned that a certain construction of a random number, Ω, can, due to incompleteness, be known to only finite accuracy by computable means. Thirdly, all (computably enumerable) random real numbers are Ω-numbers. And finally, knowing Ω to arbitrary precision entails being able to solve the halting problem.
These results interlock in interesting ways: if Ω were computable, it wouldn't be random; if the halting problem were solvable, Ω would be computable; if there were no incompleteness, the halting problem would be solvable, etc. Somewhere at the heart of all this, lies the concept of self-reference -- though to bring it out clearly into the open will require some more work.

Sonntag, 18. September 2011

Information through the Lens of Computation: Algorithmic Information Theory

Keine Kommentare:

Kommentar veröffentlichen