Dienstag, 1. August 2017

Why the Information-Heatmap of an Alphabet reminds me of a QR-Code



Calculating the "per pixel"-I
nformation of an Alphabet

What does this mean, "Information-Heatmap"? - How is it done? - To compute the "Information-Heatmap" of an Alphabet you calculate how much information is given by every single subpixel of an alphabet. To recap: information in the "Shannon-Weaver" sense is defined over probability of occurence of each letter. The more frequent a letter appears in a text, the less information is contained in it and a very rare letter yields more information than average. So to compute the information on pixel-level, you start with a pixel-by-pixel representation of every letter of the alphabet, i.e. a bitmap font. So each pixel in the heatmap represents how much information of the whole letter is contributed by this single pixel (at this specific position)! (Remarks: (1) we here omit the fact, that the letters of the alphabet have different appearance probabilities themselves, and assume equal probability of 1/26 per letter - (2) see below for a python script that calculates the Information-Heatmap for any given Bitmap-Font).

Here the calculation rule step-by-step summarized:
  • take the letters of an alphabet represented by a pixel matrix
  • calculate how much of the information obtained by the letter as whole, is contributed by every single pixel's position in the mean
  • visualize that information in form of a heatmap (painted with e.g. matplotlib/python)


Model used for Calculation

To find out the median information per pixel, the following model is used. Starting with a complete covered up font, pixel by pixel is uncovered. The probabilities for the next uncovered pixel for being black or white are calculated, using the know remaining possibilities of this special alphabet. (note: all probabilities multiplied along the path add up to the appearance-probability of the uncovered letter). When the final Pixel has been uncovered, all pixels are covered again and a new random "uncovering"-path through all pixels is chosen. In the end, the average probability for every pixel-position is calculated. Since we cannot compute all possible paths (permutations) of uncovering, due to algorithmical complexity, we choose only a pseudo-random subset of all paths ("Monte-Carlo"-style simulation).
Below a (small) sample alphabet is given, consisting of 4 "letters" a-d. Note, that the 4 per-pixel values for mean-information sum up to 2 bits, because the alphabet has size 4 (log2(4)=2 bits). Whereas in the introduction above the pixel-wise information quantities sum up to the median information give by any single letter of the Alphabet - in that case log2(26) ≈ 4.7... bits .


Result

The Information-Heatmap shows "hotspots" - which represent more information-gain than average (yellow). Also there are areas, where there is not so much information gathered (dark blue). The distribution of the hotspots and darker areas is not by chance - it reveals some kind of "structure" in the sense, that the hotspots are evenly distributed over the whole area of the letter, arranged in a "grid". This reminds me of how QR-Codes are constructed: there are markers, that are uniformly spaced, and in between, there are areas, where everything looks fuzzy ("white noise" as it is called in information theory).
[Remark: in fact, the setting is vice versa with qr-codes, the "hotspots" have low information - but the principle stays the same]


Interpretation

Remember, that the markers appear in every QR-code at the same position - that means, that the information of the markers is zero, because it's always there (probability=1)!!! - If their information-value is zero, this in turn means, they are redundant !!! - But hey, why is redundancy needed, when we actually want to convey information?

The markers in the QR-code are needed for positioning of the camera when reading the information coded in the pixel area - that's the area, where the information is contained (1 bit per pixel).
In the same way, in the Information-Heatmap, the areas with deep-blue coloring are needed for positional stabilizing during the cognition process, for better "reading" of the actual information.

So in every alphabet made up of letters there are places which bear no (or few) information - but in fact they are very necessary for the "real" information to be correctly read - you could say, that these former contain "positional information".

That means, that there are two types of information: positional and actual information. Both types of information require each other; this is similar to what is stated in Shannon's information theory: when a channel for information-transfer is "noisy" (defective) - i.e. some information is lost during transfer -, then this can be overcome by increasing the redundancy of the transferred information. So: to transport more bytes in a noisy channel you need more redundancy in your information!


Hypothesis

"good alphabets are redundant"
in the sense of: for an alphabet to be good (readable by humans and machines) it should contain a good portion of redundancy.
But we want to formulate an extended version:

"good alphabets are redundant and additional the areas of redundancy and information should be evenly distributed across the whole area of the letter"

Additionally it seems the redundancy shall not be evenly distributed over its constituting sub pixels, but it shall have points of concentration (hotspots) and accordingly areas where there's less redundancy (and therefore more information). That means that optimally the Information-Heatmap is evenly structured (eggbox-shaped)!!!


What are next Steps?

  • examine high-resolution fonts and bigger font-sets (e.g. chinese fonts) → therefore apply for computing-time @ aigrant.org
  • transfer idea from letter-subpixels to sentence-words to gain NLP-insights (word2vec)
  • find a good metric that favors information-distributions with hotspots and "plains" (  → kind of  "egg box"-metric :), so that we can classify "good" alphabets
  • examine how both types of information (actual & structural) are balanced in different kinds of alphabets



Bibliography

[1] http://worrydream.com/refs/Shannon%20-%20A%20Mathematical%20Theory%20of%20Communication.pdf
[2] http://www.cognitionresearch.org/papers/background/aicomm_1993/aicomm_1993_html/aicomm_1993.htm
[3] https://books.google.de/books?id=WQiBBwAAQBAJ&hl=de&num=11
[4] https://stackoverflow.com/questions/38422347/how-to-get-nth-permutation-and-vice-versa-given-a-charset
[5] https://math.stackexchange.com/questions/60742/finding-the-n-th-lexicographic-permutation-of-a-string

Code

Python3-Code (works best  with Jupyter Notebook/Anaconda):


%matplotlib inline
import matplotlib.pyplot as plt
from math import log, factorial
import numpy as np
import itertools
import sys

def sigirinc(a, chk, depth, p, per, d, z):
   b = []
   for x in a:
      if d[x][per[depth]] == chk:
         b.append(x)
   if len(b) > 0:
      p[per[depth]]=log(len(a)*1./len(b),2)
      if depth == len(d[0])-1: 
         assert (len(b)==1),"more than one element at end of path!!!"
         z[b[0]] = p 
      else:
         c = list(b)
         q = list(p)
         sigirinc(b, 0, depth+1, p, per, d, z)
         sigirinc(c, 1, depth+1, q, per, d, z)

def checkalph(d, anz):
   s = range(len(d))
   zi = []
   counter = 0
   gesamt = factorial(len(d[0]))
   if(factorial(len(d[0]))<anz):
      anz = factorial(len(d[0]))
   step = int(factorial(len(d[0]))/anz)
   for i in range(0,gesamt-1,step):
      permi = []
      nthPerm(i, list(range(len(d[0]))), permi)
      p=list(range(len(d[0])))
      q=list(range(len(d[0])))
      z = list(range(len(s)))
      sigirinc(s, 0, 0, p, permi, d, z)
      sigirinc(s, 1, 0, q, permi, d, z)
      zi.append(z)
      counter += 1
      if counter%(anz/100)==0:
         sys.stdout.write("%s" % '.') 
         sys.stdout.flush()
   return zi
    
def printheatmap(a,zzz,x,y):
   lll = []
   for zz in zzz:
      zz_ = list(itertools.chain.from_iterable(zz))
      i = np.mean(zz_, axis=0)
      lll.append(i)
   plt.imshow(np.array(lll).reshape(x,y), interpolation='spline36') 
   plt.colorbar()
   plt.show()
def nthPerm(n,elems,z):#with n from 0
   if(len(elems) == 1):
      z.append(elems[0])
   else:
      sizeGroup = factorial(len(elems)-1)
      q,r = divmod(n,sizeGroup)
      v = elems[q]
      elems.remove(v)
      z.append(v)
      nthPerm(r,elems,z)

font_x=7
font_y=9

### BigPixel-Font A-Z (7x9)
dd=([0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0],
[1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0],
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1],
[0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1],
[0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0],
[0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0],
[1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0],
[1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

cycles = 100000

zz = checkalph(dd, cycles)
zzz = []
zzz.append(zz)
printheatmap(dd,zzz,font_y,font_x)


2 Kommentare:

  1. Michel Foucault, Archäologie des Wissens. Kapitel 3: Prinzipiell gibt es auf der Welt graphische und phonetische Alphabete. Graphische Alphabete sind zum Beispiel die chinesische Schrift etc. Phonetische Alphabete setzen ein Phoneminventar voraus und Phoneme. Das ist das erste große Problem. Daran im Anschluss kann man nach graphischen Repräsentationen dieser Phoneme suchen.
    Du gehst einen Schritt weiter. Eine Aussage ist nie explizit. Bei Foucault findet sich sogar eine Begründung dafür. Ich würde einmal versuchen, zu experimentieren und tatsächlich ein Alphabet zu entwerfen, egal wie gut oder schlecht.
    Das große Problem für den Sprachwissenschaftler ist, dass du an der Phonetik vorbeiwillst. Aber auch wenn du über den künstlerischen Weg gehst (chinesische Schriftzeichen), setzt du immer ein Subjekt voraus, das diese Zeichen erfindet und wieder zurückübersetzt.
    Ich finde die Schrift der Tuareg interessant.

    Aber die Zeichen sind für den Informatiker bestimmt nicht sehr distinkt.

    Viele Grüße
    Malte Siebenrok

    AntwortenLöschen
  2. Die Tuareg-Schrift ist interessant, weil die Zeichen auf den ersten Blick sehr unterschiedlich sind.
    (s. https://upload.wikimedia.org/wikipedia/commons/8/83/UCB_Tifinagh.png)

    AntwortenLöschen