Part 2 - Information Theory

August 2019

Part 1 - Thermodynamic Overview Part 3 - Geometry I

"Where is the life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?"
- T.S. Eliot, "The Rock"

Index
Section 2.1 - Introduction & Motivation

Information theory, aside from being an interesting line of research in itself, represents a profound intersection of disciplines. There are information-theoretic questions and lines of reasoning being used by mathematicians, engineers, theoretical computer scientists, and both theoretical and experimental physicists. For our purposes, we will mainly be looking at portions of interest to theoretical physicists and mathematicians. In this section, we will focus mainly on understanding fundamental results.

For the physicist, information theory has been of monumental interest since the early days of quantum mechanics. Upon discovering the correlations between certain measurements made on quantum states (a phenomenon of great importance called entanglement), basic questions of how information is shared between states have fascinated and beguiled the community. Furthermore, when coming to grips with entanglement and its implications, physicists began to wonder how it might be used in a practical setting: and so began the field of study we now know as quantum computing. Those physicists with a bent towards the theoretical continued to wonder at the basic nature of information, and assert the maxim that information itself if a physical quantity to be measured - that it, too, with mass and charge and all the rest is a quantity by which we might characterise physical phenomena.

For many, information theory as a field of research starts with Claude Shannon's landmark paper, "A Mathematical Theory of Communication", first published in the Bell Systems Journal in 1948. In it, Shannon posits some very reasonable properties about signals coming from random source and explores the consequences of these properties. Starting with the case of discrete noiseless sources and eventually generalizing first to discrete noisy sources and eventually to continuous signals. So many foundational results were described in this paper that much of this article will be devoted simply to its explanation.

We start with a description of what we call a "communication system". This is something best described by a diagram.
Here is a diagram.

Diagram of a
                        communication system

In its simplest form, this is what signal-based communication looks like. As an example, say we're talking about landline telephones. Starting from the speaker, the sound reverberates on a diaphragm attached to some iron and placed near a magnet. The motion of the magnetized iron induces an electric current which is transfered over wiring to an operator. The operator on a switchboard then connects the speaker to the intended recipient, whose own device performs the reverse process to decode the message - turning the recieved current into reverberation identical (up to some noise) to that produced by the speaker. At just about any of these points there is a capacty for noise to be introduced - either from sheer loss in fidelity from the equipment, or from intentional interference by say, the operator in this example.

Now this is stripped down enough to be generalizable, but that means for this to be useful we need to consider cases with their own structure. The first such case we'll be exploring is the simplest, and it's where Shannon starts too: the discrete, noiseless channel. Discrete, noiseless channels are communication schemes where the messages are sent and received as discrete signals, and there is no interference from the outside or within the equipment - an idealized case, to be sure! Near approximations are very well-designed telegraph machines, or shipboard signal flags on a clear day. We also assume that the messages are encoded with a finite character set - as in both of the examples above.

Section 2.2 - Channel Capacity

Shannon begins in a natural place in the description of such schemes: with the capacity of the scheme itself. This quantity is defined in the following way: $$ C = \lim_{T \to \infty} \frac{\log_2(N(T))}{T}, $$ where $N(T)$ is the number of number of allowed signals of duration $T$. This definition works for the discrete noiseless channel, but needs to be generalized for other cases. For polynomial $N(T)$ of order less than $2^{T^2}$, we will have a finite channel capacity, as we will for any function bounded in size by $2^{T^2}$ from above. We have a vacuous lower bound of signals of length 0.

Messages transferred may be encoded in a character set of smaller size than the set used in the original alphabet that constructed the message, e.g. we may encode the latin alphabet into the 32 bitstrings of length 5, with characters to spare. If we imagine this kind of encoding, we may also imagine a graph that represents the construction of these strings. Starting from an empty string, we may either append a 1 or a 0 to it, and repeat for as long as we wish to make our message. We may also encode seperation strings, things that specify that what message being sent is a letter or a word, sending us back to a new empty string. We thus have two nodes: one for beginning new strings and one for building them. This sort of messaging system, in which there are only a few distinct options after each step, are in fact excellent models of practical communication systems. So what is the capacity of such a system?

We will prove that is comes down to the following.
THEOREM 1: Let $b^{(s)}_{ij}$ be the duration of an allowable symbol $s$ that leads from state $i$ to state $j$. Then the channel capacity $C$ is equal to $\log W$, there $W$ is given by the largest real solution to the determinant equation $$ \left| \sum_s W^{-b^{s}_{ij}} - \delta_{ij} \right| = 0. $$

PROOF:Let $N_i(L)$ be the number of strings of length L that end in state $i$. We may write that $$N_j(L) = \sum_{i,s} N_i(L - b_{i,j}^{(s)}$$ where $b_{i,j}^{(n)}$ are the lengths of the symbols which take state $i$ to state $j$. The objects being summed over are linear differences and thus we have a solution of the form $N_j = A_jW^L$ as $L \rightarrow \infty$. Substituting in this solution form into our difference equation, we recover $$ \begin{align} & A_jW^L = \sum_{i,s}A_iW^{L-b_{ij}^{(s)}}\\ \\ & A_j = \sum_{i,s}A_iW^{-b_{ij}^{(s)}}\\ \\ & \sum_i\left(\sum_s W^{-b_{ij}^{(s)}}-\delta_{ij}\right)A_i = 0. \end{align} $$ For this to make sense, we need that the determinant vanishes: $$D(W) = \left|a_{ij}\right| = \left| \sum_s W^{-b^{s}_{ij}} - \delta_{ij} \right| = 0.$$ This requirement uniquely characterises $W$, and it is the largest root to the determinant equation. $\blacksquare$

We may then give the quantity $C$ as $$ C = \lim_{L \to \infty}\frac{\log\sum_j{A_jW^L}}{L} = \log W. $$ A more generalised definition of the channel capacity is $C = \sup_{p_X(x)} I(X;Y)$, where $I(X;Y)$ is the mutual information shared by random variables $X,Y$ and the supremum is taken over the marginal distribution $p_X(x)$ which happens to characterise the joint distribution between $X$ and $Y$. We choose to ignore this definition for now, as it's needless but bears mentioning.

Section 2.3 - Markov and Ergodic Processes

We often come across information systems that produce seemingly random outputs across a generally non-uniform distribution. We call strings generated by these information systems "Markov chains" (also sometimes called "Markov processes"). In a Markov process, the proceeding steps may be influenced by the position in which one starts. We call this re-adjustment of distributions a "residue of influence". Markov processes are another construction that are well expressed graphically.

Diagram of a Markov
                            process

In this example, we can see that the distribution associated with this process is going to produce strings with many C's.

There is extensive literature on the application of Markov processes in statistics, but we're going to focus on a particular type of stochastic process called an "ergodic process". Ergodic processes are those which have the property that their statistical natures may be determined from a single, sufficiently long random sampling. In plain language, that means that each sequence produced by an ergodic process has the same kind of statistical properties - individual letter distributions, digram and trigram distributions, and so on. Resultant of this definition are inferences that we can make on the structure of the graph defining such a process:

  1. The graphs must not contain "islands" that are impossible to escape - that is, an ergodic process may not be described by a directed graph that takes a starting state to some terminal node or subset of nodes where it may not return to the original state. If $A \rightarrow C$ with some probability and $C$ is unable to return to $A$ either directly or through other connections along the graph, then we do not have an ergodic system.
  2. We call any closed walk on the graph a "circuit", and say that the length of the circuit is the number of edges the circuit takes to close. In the graph above, starting at node $A$ and taking the path $BCA$ is a circuit of length 3. A requirement of ergodicity is that all circuit lengths of a given graph are coprime.

This first result is fairly intuitive: you can't have a uniform statistical structure if one sampling may give you characters that another sample literally cannot. But what of the second result? This is due to the fact that if we have a common divisor $d > 1$, the sequences that arise will contain a periodic structure. Their statistical properties will be the same, but the different sequences will be shifted in their origin. This is fixable by shifting up from 0 to $d-1$. Furthermore, the if the first condition is violated, your graph may be broken up into smaller graphs so that each individually is ergodic. If this must occur and the second condition remains intact, then we have a mixed manifested out of a number of pure components - terminology not unfamiliar to those with some background in quantum mechanics. These pure states are associated with probabilities so that measurement or sampling identifies which ergodic subgraph we end up on.

We're going to assume from this point on that the processes we're talking about are ergodic. Ergodic processes have a longstanding relation to problems in physics, and ergodic theory in general was invented as an approach to problems in statistical mechanics.

Section 2.4 - Information Entropy

Now that we have a descriptive model of information transfer (at least over a discrete source), how do we quantize the amount of information that gets transfered, and at what rate? If we take a random discrete source that produces outputs $1 , \ldots , n$ with known probabilities $p_1 , \ldots , p_n$, we try to measure the amount of "choice" or "uncertainty" in the event measured. Say we have a function $S$ of the probabilities that is intended to do this measurement. $S$ should follow the following properties:

  1. $S$ must be continuous in its domain
  2. If the elements of its domain, the $p_i$, are all equal (i.e. the distribution is uniform), then $S$ is monotonically increasing in $n$.
  3. If choices must be broken down into succesive choices, the total $S$ must be the weigthed sum of individual values of $S$.

THEOREM 2: There is only one measure satisfying each of these three properties. This function is the Shannon entropy equation, written $$ H = -K\sum^n_{i=1} p_i \log p_i, $$ with $K$ being a positive constant depending on your logarithm's base.

PROOF: We use a few easily generalizable special cases to make our calculations easier. Let's assume we're making a choice of one out of $N$ objects, so that our choice is distributed uniformly, any one object having a probability of $\frac{1}{N}$ of being selected. We write that $S(\frac{1}{N}, \ldots, \frac{1}{N}) \equiv f(N)$. Breaking $N$ into $m$ groups, each group $i$ having elements $n_i$. Picking now in two steps, picking an element of group $i$ has probablity $$ p_i = \frac{n_i}{N}, $$ and the second choice of an element in group $i$ is of of probability $1/n_i$. We say that the first step has an uncertainty of $S(p_1, \ldots, p_m)$ and the second step has an expected value of uncertainty of $\sum^m_{i=1} p_i f(n_i)$. Thus, we see that $$ f(N) = S(p_1, \ldots, p_m) + \sum^m_{i=1} p_i f(n_i). $$ Considering the special case in which each of the $m$ groups have $n$ elements, we have two important relations: $$ \begin{align} & f(N) = S(\frac{1}{m},\ldots,\frac{1}{m})+ \sum^m_{i=1}\frac{1}{m}f(n),\\ \\ &f(N) = f(nm) = f(n) + f(m),\\ \\ &\text{ for arbitrary } n,m. \end{align} $$ This second line is in fact a version of the Cauchy functional equation, this type having only a single well-behaved continuous solution: that $f(m) = K\log(m)$. For our purposes, we further limit allowable results to positive values of $K$, in keeping with the necessity of $S$ to be monotonically increasing with $n$ (as asserted in condition 2).
Plugging this back into the general case, we get $$ \begin{align} & K\log(N) = S(p_1,\ldots,p_m) + \sum^m_{i=1}p_iKf(n_i)\\ \\ & S(p_1,\ldots,p_m) = K\log(N) - \sum^m_{i=1}p_iKf(n_i)\\ \\ & = K\sum^m_{i=1}p_i\log(N) - K\sum^m_{i=1}p_if(n_i)\\ \\ & = K\sum^m_{i=1}p_i\log\left(\frac{N}{n_i}\right)\\ \\ & =-K\sum^m_{i=1}p_i\log(p_i) \end{align} $$ $\blacksquare$
This of course, bears resemblance to the famous Boltzmann entropy we derived in the previous section. More will be said about this relation presently - but for now, we can see that this function (which I will now call Shannon entropy, or simply entropy) follows the same properties of Boltzmann entropy, due to its logarithmic nature. Those are:

  1. $S=0$ iff all $p_i$ are 0 except for one, which has a value of one. Otherwise, $S$ is positive.
  2. $S$ is maximised when all probabilities are equal. This is also the situation with the most uncertainty, for purposes of intuition.
  3. The joint entropy is less than the sum of the individual entropies, i.e. $S(x,y) \leq S(X) + S(y)$.
  4. Any change towards equalizing the probabilities in the domain of $S$ increases $S$.
  5. Say we have two events $x,y$ not necessarily inedpendent of each other. For any value $i$ that $x$ can assume, there is a conditional probability $p_i(j)$ that $y$ takes the value $j$. This conditional probability is given by $$ p_i(j) = \frac{p(i,j)}{\sum_jp(i,j)}, $$ and define the "conditional entropy" of $y$, $S_x(y)$, as the average entropy of $y$ for each value of $x$. Subbing in the values of $p_i(j)$ we obtain $$ S_x(y) = S(x,y) - S(x). $$ The joint entropy of $x,y$ is the entropy of $x$ plus the conditional entropy of $y$ on known $x$.
  6. From properties 3 and 5, we can see that $S(y) \geq S_x(y)$.

Sources and Entropy

Consider again the discrete sources with finite signals that we have been describing thus far. Each state $j$ that can be produced by the source has an associated probability $P_i$, and individual probabilities producing each symbol in the signal $p_i(j)$. Each state, then, has an associated entropy! This is given by $S_i$, and we can say that the total entropy is the individual state entropies weigthed with the state probabilities - i.e. $$ S = \sum_i P_iS_i \\ = -\sum_i\sum_j P_ip_i(j)\log p_i(j). $$ In plain language, this is the entropy of the source per symbol of the signal. The source entropy, if the signal is produced in a fixed frequency $f_i$, may be described in entropy per second: $S' = \sum_i f_i S_i$. This may be further reduced to $S' = nS$ where $n$ is the number of symbols produced by the source per second.

For independent symbols in the source signal, $$ S = -\sum_i p_i \log p_i, $$ $p_i$ again being the probability of recieving symbol $i$. Taking a long, $n$-character message, we will observe with high probability $np_1$ occurences of the first symbol, and so on to $np_m$ occurences of the $m^{\text{th}}$, for a signal producing symbols from an alphabet of length $m$. The probability of receiving a particular signal of this type is approximately $p \approx p_1^{np_1}p_2^{np_2} \ldots p_m^{np_m}$. We take the logarithm and tidy up: $$ \begin{align} \log p & \approx m \sum_i^m p_i \log p_i \log p \approx -mS,\\ \\ S &\approx \frac{\log\frac{1}{p}}{n}. \end{align} $$ This result holds for any source, and we can tighten it up with the following theorem:

THEOREM 3: Given any $\epsilon > 0 \text{ and } \delta > 0$, we may find $n_0$ so that any sequences of length $n \geq n_0$ fall into two cases:

  1. a set whose total probability is less than $\epsilon$
  2. The remainder, whose probabilities satisfy $$ \left| \frac{\log p^{-1}}{n} - S\right| < \delta. $$

PROOF: Again assuming we are dealing with an ergodic system, we have that any state may transition into another with some probability $P>0$ and will take a particular path with $p > 0$ to do so. By the relationship of pointwise ergodicity and the strong law of large numbers, we can say that the number of times a particular path $p_{ij}$ is taken in a sequence of length $N$ is proportional to the probability of being at point $i$, say $P_i$, and then choosing this path at any point, $p_{ij}N$. If $N$ is sufficiently large, the probability of percentage error $\pm \delta$ is less than $\epsilon$. Almost all of the numbers, excluding a set with small probability, the actual numbers lie within this $$ (P_ip_{ij} \pm \delta)N, $$ so almost all sequences have a probability $p$ given by the expression $$ p = \prod_{i,j}p_{ij}^{(P_ip_{ij} \pm \delta)N}, $$ from which we can determine $$ \frac{\log p}{N} = \sum_{i,j} (P_i p_{ij} \pm \delta) \log p_{ij}\\ \left| \frac{\log p}{N} - \sum_{i,j} P_i p_{ij} \log p_{ij} \right| < \eta. $$ A bit of arithmetic is sufficient to show that this is equivalent to our theorem. $\blacksquare$

We also have the following theorem relating entropy to the number of strings needed to be added up to reach some probability $q$:

THEOREM 4: $$ \lim_{N \to \infty} \frac{\log n(q)}{N} = S, $$ when $q \neq 0,1$.

PROOF: Suppose $q = 0$. Then clearly $n(q) = 0$. If, however, we have $q = 1$, then $n(q)$ should be the number of strings of length $N$. We assume that we are working with bitstrings, and thus may express this number as $2^N$. It is easy to generalize our result to strings with characters from alphabets of any size. We see then that to achieve probability 1, we need all $2^N$ strings.

From theorem 3, we have that almost all of our sequences of characters come with a probability approaching $p = 2^{-SN}$ for large $N$. If we arrange the probabilites of each string in descending order, we will have that the number of strings to reach probability $q$ will be $q2^{SN}$, and we will not see any of the strings with probability less than $\epsilon$ in our collection. Therefore, $n(q) = \frac{q}{2^{-SN}} = q2^{SN}$. From here, we can see very quickly that $$ \frac{\log n(q)}{N} = S. $$ $\blacksquare$

Information and Thermodynamics

The real crux of the article is how we relate thermodynamic entropy and information theory. Evidently, we use similar jargon between the disciplines, in particular the use of the word "entropy". The Boltzmann thermodynamic entropy formula may in fact be viewed as a particular case of Shannon's information entropy, with Boltzmann's constant acting as a constant of proportionality between the two, and with the logarithm taking a different base. The conversion factor from base $e$ to base 2 may be absorbed into a constant itself, and so we have a simple way of converting between the two.

There is also an interesting parallel between our assumption of ergodic processes and the canonical ensemble of statistical mechanics - I'm not well versed enough in the particulars of ergodic theory to draw any conclusions from this, but when I have understood more, I will report on any interesting findings.