Samstag, 16. Mai 2020

The Shannon Paradoxon and how to solve it

(working title: Identifying symbol-tokens by measuring pixel-correlations within documents)
When looking at a sequence of black and white pixels, there's a simple way to determine how much information is contained within the sequence. Use Shannons Formula for Entropy:



So since there are only black and white pixels, i runs from 1 to n=2, and b the base is 2 by convention. P just means the Probability of appearance of the corresponding color xi, which is easily calculated by dividing the absolute number of black rpt. white pixels through the total number of pixels overall.

There's an interesting trick here: whenever the number of black and white pixels are equal, the maximum amount of information for the amount of pixels given is conveyed. Whenever there are more white than black or more black than white pixels, the information content get's less and less, due some kind of unbalance. This is pure mathematics and can be seen in the following chart:


The maximum of information is in the middle, when black equals white (Probability of Black plotted horizontal). The more both are out of balance, the less information is contained in the sequence considered. Think of it as in the following exemplification: when something happens more often, it get's more and more expectable, therefore giving you less information (that's the log-part of the Formula), when on the other hand something get's rarer and rarer, it gives you more information when it happens! And as you can see in the above illustration, the gain on one color's information-value cannot compensate for the loss of the other.
The total loss in information due to this unbalance is called redundancy.

This is not the "Shannon Paradoxon" - these were just the basics!