You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A random variable is a function that assigns a numerical value to each outcome of a random experiment. For example, if we roll a die, the random variable $X$ that represents the number appearing on the top face of the die can take the values $1, 2, 3, 4, 5, 6$. In this case, we would write that:
$$
X = \{1,2,3,4,5,6\}
$$
That is, $X$ is a random variable that takes values in the set $\{1,2,3,4,5,6\}$.
In the previous example, the random variable $X$ is discrete, because it takes values in a finite or countable set (remember, there are infinities larger than others. A countable infinity is one that can be matched with the natural numbers $\mathbb{N}$). But there are also continuous random variables, which take values in an interval of real numbers. For example, if we measure a person's height, the random variable representing the height can take any value in the interval $(0,\infty)$.
Probability Function
The probability function of a random variable is a function that assigns to each value of the random variable the probability that this value occurs. For example, if we roll a die, the probability function of the random variable $X$ that represents the number appearing on the top face of the die is a function that assigns to each number in the set $\{1,2,3,4,5,6\}$ the probability of that number coming up. In this case, the probability function is uniform because all the numbers have the same probability of appearing.
For a discrete random variable, the probability function can be represented in a table or with a formula. For example, if we roll a die, the probability function of the random variable $X$ that represents the number appearing on the top face of the die can be represented with the following table:
For a continuous random variable, the probability function can be represented by a probability density function. For example, if we measure a person's height, the probability density function of the random variable representing height assigns to each value in the interval $(0,\infty)$ the probability that a person's height falls within that interval.
Expected Value
The expected value of a random variable is the average value the random variable takes in a random experiment. It is calculated by multiplying each value of the random variable by its probability and summing the results:
For example, if we roll a die, the expected value of the random variable $X$ representing the number that appears on the top face of the die is calculated as:
Where $f(x)$ is the probability density function of the random variable $X$.
LOTUS: Law of the Unconscious Statistician
LOTUS is a rule that allows us to calculate the expected value of a function of a random variable without needing to know the probability function of the random variable. LOTUS states that the expected value of a function of a random variable is calculated by multiplying the function by the probability function of the random variable and summing the results:
For example, if we roll a die, and define the random variable $Y = X^2$, where $X$ is the random variable representing the number on the top face of the die, the expected value of the random variable $Y$ is calculated as:
Where $f(x)$ is the probability density function of the random variable $X$.
Entropy
In physics, entropy is a measure of the amount of disorder or chaos in a system; more specifically, in thermodynamics, entropy is a measure of the amount of energy that cannot be used to do work. In information theory, entropy is a measure of the uncertainty of a random variable. It can also be interpreted as a measure of the amount of information needed to describe a random variable.
The entropy of a random variable $X$ is defined as the expected value of the information of the random variable. The information of a random variable is a measure of the "surprise" that a value of the random variable produces.
The information of a random variable is calculated as the negative logarithm of the probability of a value of the random variable occurring:
$$
I(x) = -\log_n\left(P(X=x)\right)
$$
Note that the base of the logarithm determines the unit of measurement for the information. This base can be any positive number, but the most common bases are:
2: in which case the unit of information is the bit (binary digit).
e: in which case the unit of information is the nat (natural unit of information).
10: in which case the unit of information is the hartley.
If no base is specified, we will assume the base of the logarithm is 2.
Knowing this, the entropy of a random variable $X$, which we denote as $H_k(X)$, is calculated as follows (thanks to the LOTUS theorem):
where the subscript $k$ indicates the base of the logarithm. Remember that if no base is specified, we assume the base of the logarithm is 2.
In information theory, we start with an information source defined as a pair $(S,P)$ where $S$ is a predefined alphabet and $P$ is a probability distribution over $S$.
Since the information source introduces uncertainty in the random variable defined by the predefined alphabet, entropy is useful for measuring the degree of uncertainty, the degree of randomness in the information source, and consequently estimate the average units of information needed to encode all possible values that may occur in the information source.
Due to the properties of the logarithm, $H_k(X)$ has the following alternative (but equivalent) definitions:
Furthermore, due to the properties of the logarithm (specifically that $\log_a(b) = \frac{\log_c(b)}{\log_c(a)}$), it holds that $H_b(X) = \log_b(a) \cdot H_a(X)$. The proof is straightforward:
Let the alphabet$S = \{a,b,c,d\}$with a probability distribution$P = \left\{\frac{1}{2},\frac{1}{4},\frac{1}{8},\frac{1}{8}\right\}$. The entropy of the random variable$X$defined by the alphabet$S$and the probability distribution$P$is:
The entropy of the random variable $X$ is 1.75 bits.
An interesting property of entropy is the following: Let $X$ be a random variable over the alphabet $\{x_1,x_2,\dots,x_n\}$, then:
$$
0 \leq H(X) \leq \log(n)
$$
The entropy of a random variable $X$ is bounded by $0$ and $\log(n)$, where $n$ is the number of elements in the alphabet of the random variable. Entropy is maximized when the probability distribution is uniform, and minimized when the probability distribution is degenerate (i.e., when a single value of the alphabet has probability 1 and all other values have probability 0).
Example 2: Entropy of a Bernoulli Distributed Random Variable
Let$X$be a Bernoulli distributed random variable with parameter$p$(i.e.,$X \sim \text{Bernoulli}(p)$, which means that our random variable measures "the number of successes in a Bernoulli experiment"). The probability function of$X$is:
If we represent the entropy of the random variable $X$ as a function of $p$, we obtain the following graph:
What does this mean? The entropy of a Bernoulli random variable is maximized when $p=0.5$, that is, when the probability distribution is uniform. This makes sense because in a uniform distribution, all values of the alphabet have the same probability of occurring, and therefore there is more uncertainty about the value that the random variable will take. On the other hand, the entropy of a Bernoulli random variable is minimized when $p=0$ or $p=1$, that is, when the probability distribution is degenerate. This also makes sense because in a degenerate distribution, only one value of the alphabet has probability 1 and the rest have probability 0, and therefore there is no uncertainty about the value that the random variable will take.
How would this be programmed in python?
importnumpyasnpimportmatplotlib.pyplotaspltdefH_Bernoulli(p):
h= (p-1) *np.log2(1-p) -p*np.log2(p)
# Fix the NaN values that occur due to the logarithm of zeroh[np.isnan(h)] =0returnhplt.figure(figsize=(8, 6))
p=np.linspace(0, 1, 200)
plt.plot(p, H_Bernoulli(p), "r", linewidth=3)
plt.grid()
plt.xlim(0, 1)
plt.ylim(0, 1.005)
plt.ylabel(r"$H(X)$ (bits)", fontsize=16)
plt.xlabel(r"$p$", fontsize=16)
plt.tick_params(axis="both", which="major", labelsize=14)
Reminder: Joint and Conditional Probabilities
Given two random variables $X$ and $Y$, the joint probability of $X$ and $Y$ is the probability that both random variables take specific values. It is denoted as $P(X=x,Y=y)$, and is calculated as:
$$
P(X=x,Y=y) = P(Y=y|X=x)P(X=x)
$$
where $P(Y=y|X=x)$ is the conditional probability of $Y$ given $X$, i.e., the probability that $Y$ takes a specific value given that $X$ has taken a specific value.
Similarly, we can define the joint probability as:
$$
P(X=x,Y=y) = P(X=x|Y=y)P(Y=y)
$$
Equating both expressions leads to Bayes' Theorem:
Additionally, if the random variables $X$ and $Y$ are independent, then the joint probability of $X$ and $Y$ is equal to the product of the marginal probabilities of $X$ and $Y$:
$$
P(X=x,Y=y) = P(X=x)P(Y=y)
$$
Why? Because if $X$ and $Y$ are independent, the probability of $Y$ taking a specific value does not depend on $X$ taking a specific value, and vice versa. Hence, $P(Y=y|X=x) = P(Y=y)$ and $P(X=x|Y=y) = P(X=x)$.
Joint Entropy and Conditional Entropy
So far, we have defined the entropy of a single random variable. But in many problems of information theory, we are interested in the entropy of two or more random variables.
The joint entropy of two random variables $X$ and $Y$ is defined as the entropy of the random variable $(X,Y)$, which is a random variable that takes values in the Cartesian product (i.e., all possible pairs) of the alphabets of $X$ and $Y$. The joint entropy of $X$ and $Y$ is denoted as $H(X,Y)$, and is calculated as follows:
In some cases, this notation is clearer than the notation with the alphabets $S_X$ and $S_Y$. Usually $S_X = S_Y$, meaning we are using the same alphabet for both random variables.
Similarly to joint entropy, we can define the conditional entropy of a random variable $X$ given another random variable $Y$. The conditional entropy of $X$ given $Y$ is denoted as $H(X|Y)$, and is calculated as follows:
As demonstrated above, the properties of joint entropy arise from working with the joint and conditional probabilities of random variables $X$ and $Y$.
One of these properties is:
$$
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
$$
This property is known as the chain rule of entropy, and it can be derived from the definition of joint and conditional entropy:
In the same way, we can prove that $H(X,Y) = H(Y) + H(X|Y)$.
From this result, we can derive the following inequalities for the random variables $X$ and $Y$:
$H(X,Y) \leq H(X) + H(Y)$
$H(X,Y) = H(X) + H(Y)$ (if $X$ and $Y$ are independent)
A simple way to think about these properties is to think about joint and conditional probabilities, from which we derive similar inequalities. However, in the case of entropy, due to the logarithm, the inequalities involve sums rather than products.
A corollary of the above properties is:
$H(X|Y) \leq H(X)$
$H(X|Y) = H(X)$ (if $X$ and $Y$ are independent)
Returning to the chain rule, it can be extended to the case of more than two variables, and is defined as:
Relative entropy, also known as Kullback-Leibler divergence, is a measure of the difference between two probability distributions (the amount of information needed to describe one probability distribution using another). The relative entropy of two probability distributions $P$ and $Q$ is denoted as $D(P||Q)$, and is calculated as:
Relative entropy is always non-negative, and is equal to zero if and only if $P$ and $Q$ are equal.
Mutual Information
The mutual information of two random variables $X$ and $Y$ is a measure of the amount of information that one random variable provides about the other. The mutual information of $X$ and $Y$ is denoted as $I(X;Y)$, and is calculated as:
Mutual information is always non-negative, and it is equal to zero if and only if $X$ and $Y$ are independent. Mutual information can also be thought of as the relative entropy between the joint distribution of two random variables and the product of their marginal distributions.
From the definition of mutual information, the following properties can be deduced:
$I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$
$I(X;Y) = H(X) + H(Y) - H(X,Y)$
$I(X;Y) = I(Y;X)$
$I(X;X) = H(X)$
Graphically, these properties can be represented as follows:
The Continuous Case
Example 1: Uniform Distribution
Let the random variable $X$ be uniformly distributed between $0$ and $a$. Its probability density function is: