Picture a world in which there are only two things, and they're both identical -- let's say two uniform spheres of the same size and color, with no other distinguishable properties.

Now, ask yourself: How do you know there are two of them? (Apart from me telling you there are, that is.)

Most people will probably answer that they can just count the spheres, or perhaps that there's one 'over there', while the other's 'right here' -- but that already depends on the introduction of extra structure, something that allows you to say: "This is sphere number 1, while that is sphere number 2". Spatial separation, or the notion of position, is such extra structure: each sphere, additionally to being of some size and color, now also has a definite position -- a new property. But we said previously that the spheres don't have any properties additionally to size and color. So, obeying this, can you tell how many spheres there are?

The answer is, somewhat surprisingly, that you can't. In fact, you can't even distinguish between universes in which there is only one sphere, two identical ones, three identical ones etc. There is no fact of the matter differentiating between the cases where there are one, two, three, etc. spheres -- all identical spheres are thus essentially one and the same sphere.

This is what Leibniz (him again!) calls the identity of indiscernibles: whenever two objects hold all the same properties, they are in fact the same object.

Now consider the same two-sphere universe, but one sphere has been painted black. Suddenly, the task of determining how many spheres there are becomes trivial! There's two: the one that's been painted black, and the one that hasn't. But how has this simple trick upgraded the solution of this problem from impossible to child's play?

The answer is, of course, that now the two spheres are discernible. Now there exists a property that one sphere has, but the other sphere lacks: being black.

Now, what is this good for? Well, the interesting thing is that now, once you have distinguishable spheres (or distinguishable anythings, really), you can do stuff with them -- for instance, you can answer questions: the black sphere for no, the non-black sphere for yes. You can signal choices. Make decisions. If you work out a clever scheme, you can even communicate: a sequence of five spheres could correspond to a letter, or another orthographical sign -- writing b for black and n for not black, nnnnb could stand for 'a', nnnbn for 'b', nnnbb for 'c' and so on. For instance, nbnnn, nnbnb, nbbnn, nbbnn, nbbbb would mean 'hello' (I've just added the commas to improve legibility; they're not actually necessary, you could just as well count the number of ns and bs). Though cumbersome, anything you could wish to communicate can be communicated this way.

And that's not the end of it -- utilising clever enough rules, you can use the spheres to make logical deductions, carry out computations, etc; in fact, it's easy to see that one could use them to simulate one of the cellular automatons I mentioned in the previous post, particularly a computationally universal one, which means that any computation that can be carried out at all can be carried out with these spheres.

All this -- communication, computation, decision making, etc. -- becomes possible only because the spheres have been made distinguishable from one another. Really, a difference that makes a world of difference!

Of course, most have probably spotted that the story so far has really been about bits, and hence, about information. In fact, this is where the title derives from -- Gregory Bateson, a noted pioneer of systems theory and cybernetics, characterized information as "a difference that makes a difference".

At first, this has little to do with how we define 'information' in our everyday lives. When we usually talk about 'getting information' about something, we mean acquainting ourselves with relevant facts. I'll later get back to what this has to do with 0s and 1s, or black and non-black spheres.

First, we must consider the question that stood at the origin of modern information theory: given a certain signal, how can we tell how much information is contained within it? Not knowing the code, it is not at all obvious that nbnnn, nnbnb, nbbnn, nbbnn, nbbbb and 'hello' contain the same information -- one is five times as long as the other, for starters.

To this end, two notions we will need are compressibility and predictability. It's clearest to just give examples: a highly uniform string (of symbols), such as aaaaaaaaaaaaaaaaaaaa, is highly compressible -- you could just as well give me the string '20 times a' or '20a' as a description of it, and I would get all the information that was contained in the original. On the other hand, a random, highly diverse string, such as wtokngmnoiakahklaijtg, is not really compressible in any obvious way, even though it is of the same length as the previous one.

The reason for this difference in compressibility lies in the difference in predictability between the two strings: for the first, predicting the next symbol on the basis of the previous ones becomes easier the more of the string you read; for the second, prediction is impossible (this, in fact, can be taken as a definition of randomness: a sequence of symbols is random if it is impossible to predict the next symbol based on only knowing the previous ones). The first string thus can be given a short description easily, while for the second, every description that contains the same information will on average be at least as long as the string itself. (Note the resemblance here to the story of Leibniz and the inkblots.)

Strings of symbols that can't be compressed very much, i.e. that are very unpredictable, are also said to have high entropy; conversely, highly redundant, compressible strings have low entropy.

Another notion we will need is that of a code: basically, a code is a map between certain sets of symbols. We have already encountered one instance of a code, which mapped letters to patterns of black and non-black spheres, or alternatively, bits. Generally, the symbols on one 'side' of the code are assumed to be understood, to have a certain meaning, so that using the code, a meaningless string of symbols can be translated into a meaningful one (though how one string of symbols can be meaningful at all is a very difficult question, which we shall ignore for the time being). That way, the meaningless bs and ns of the previous example get mapped to the meaningful 'hello'. This is how 0s and 1s, strings of symbols, etc. connect to the everyday notion of information: applying a suitable code, such entities can be converted into something that has meaning to us -- such as sentences in English, for example. Thus, the notion of information we've been using so far is really related to the potential of meaningful information that a string can carry -- the amount of meaningful information that can be extracted using a suitable code.

Codes are generally not unique -- I could invent another mapping that takes the same meaningless string to a different meaningful one, so the meaningless string on its own does not suffice to determine both the code and the meaningful string it is intended to map to; this is the reason ancient, extinct languages are, without any point of reference such as, for instance, the stone of Rosetta, impossible to translate.

The key realisation now is this: for most codes, strings that have a high entropy carry more information than strings of the same size that have a lower entropy.

The reason lies in the notion of predictability. First, let's go back to the spheres. One of the first things we discovered distinguishable spheres let us do was answering questions. How did that work again? Well, to answer 'no', one could just show the black sphere; to answer 'yes', show the non-black one instead. The distinguishability of the spheres thus lets us decide alternatives: yes or no, non-black or black, 1 or 0. Now, any kind of message can in principle be rephrased in terms of a set of (yes/no) questions to answer. In fact, that's basically what our simple code does -- the first spot represents the answer to the question: "Is the letter found in the first fifteen letters of the alphabet?" (n for yes, b for no), the second spot answers "Of the remaining letters, is the sought one in the first seven?", and so on, until eventually, a single letter is picked out.

It's like the game 'twenty questions': you successively eliminate possibilities until at the end, only one remains. The question of how much information is contained in a message thus becomes the question of how many (yes/no) questions I can answer using it.

It's easy from here to arrive at a precise quantification of information content: Optimally, every answered question can cut the amount of possibilities in half. So, with one answered question, you can decide between two alternatives; with two answered questions, between four; three answers allow you to uniquely pick out one of eight objects, and so on -- generally, n answers -- n bits -- allow you to distinguish between 2

^{n}objects. Thus, in the game of 'twenty questions', the twenty answers allow you to distinguish between 2

^{20}possibilities -- which works out to a staggering number of 1,048,576!

Thus, any string that can decide between 1,048,576 alternatives is said to have an information content of 20 bits, or, more generally, if it can decide between n alternatives, its information content is log

_{2}(n) (where log

_{2}is the base-2 logarithm, i.e. the inverse of the exponentiation operation 2

^{n}).

With every answered question, you learn something new about the sought letter, person, or object, and thus, about the content of the message.

This way the communication of a message can be recast as a question answering process; the more questions that are answered, the more detailed can the message be.

Now, for the string consisting just of twenty repetitions of the letter a, you learn nothing new with each symbol you read, no further questions are answered, since you could have predicted it with certainty in advance -- but with the random string, since every new symbol comes unexpected, you gain some knowledge with each one, and get new answers. A redundant string can't answer many questions, while a random one can -- in fact, it can answer the maximum amount of questions for a string of its length (written with a particular set of symbols); its entropy, and hence its information content, is maximal.

However, there's a caveat attached to this: it's only true on average. In principle, nothing keeps you from creating a code that maps the letter a to the complete works of Shakespeare, so that just one letter, despite not being able to answer many questions at all, is able to transmit quite a long and detailed message. However, you always end up paying for this 'illegal' gain in efficiency at some point, in having to use a highly incompressible, long and complex string in order to encode some comparatively short and simple message.

To see this, consider Borges' Library of Babel: it contains every book (and by book, I mean every possible combination of characters), of a certain length, say 130,000 for definiteness. You're faced with the task of cataloguing this awesome collection. So, you start simple: you give the book that is identical to Shakespeare's Hamlet (which contains about 130,000 letters) the number one, the book that differs from Hamlet in that the last character is an a (if it isn't in the original) the number two, the book whose last letter is a b the number three, and so on.

At first, you might think you've hit upon a highly efficient scheme -- after all, strings of 130,000 characters length get mapped to very short ones. But things will get bad, and go from bad to worse to terrible, rather quickly: for the books only differing in the last letter, you will need the numbers one to 26, for the books differing in the last two letters, the first 26*26 numbers, and finally, to catalogue all the books, you will need the numbers up to 26

^{130,000}, which is a number with about 184,000 digits -- and thus, far longer than the books themselves are!

Your previously efficient coding scheme has come back to bite you: where at first you were able to save significant amounts of space, now you end up expending correspondingly more to make up for it. On average, there exists no magical space saver like that -- and thus, on average, it remains true that strings with a high entropy carry more information than low entropy ones.

great article and enjoyable read!

AntwortenLöschen