Understanding Marginalization, Conditional Probability, and Double Sums with Dice

A step-by-step explanation of Judea Pearl’s early probability ideas: the law of total probability, conditional probability as reasoning under an assumption, and how a triangular grid makes a double sum visually obvious.

Marginalization and the law of total probability

Pearl starts from a simple idea: if an event A can happen through several mutually exclusive and exhaustive cases B₁, B₂, …, Bₙ, then the total probability of A is obtained by adding the joint probabilities across all cases.

P(A) = Σᵢ P(A, Bᵢ)

This is the law of total probability. When we sum over all the possible values of B, we are marginalizing over B, and the result P(A) is called the marginal probability of A.

For two cases only, this becomes P(A) = P(A, B) + P(A, ¬B). The general sum is just the same idea extended to many cases.

Linking equation (1.4) to the dice example

Let A be the event “the two dice show the same number.” First choose a single proposition:

P(A) = P(A, B) + P(A, ¬B)

Now interpret both terms:

P(A) = 1/36 + 5/36 = 6/36 = 1/6

If we split the event into all six possibilities for the first die instead, then we get the more general form:

P(A) = Σᵢ₌₁⁶ P(A, Bᵢ)

where Bᵢ means “the first die equals i.” Each term is then P((i,i)) = 1/36, so:

P(A) = 6 × (1/36) = 1/6

Conditional probability as belief under an assumption

Pearl emphasizes that conditional probabilities are the basic objects of Bayesian reasoning. The expression P(A | B) means: how much do we believe A if B is known with certainty?

P(A | B)

If learning that B is true does not change our belief in A, then A and B are independent:

P(A | B) = P(A)

More generally, if:

P(A | B, C) = P(A | C)

then A and B are conditionally independent given C. Once C is known, learning B adds no extra information about A.

Why Pearl treats conditional probability as more basic

Traditionally, conditional probability is defined from the joint probability:

P(A | B) = P(A, B) / P(B)

Pearl reverses the emphasis. He argues that what we usually know empirically is not the joint probability table, but statements of the form “given this context, how likely is that event?” In this view, B acts like a context or frame of knowledge, and A | B means the event A within the world where B is taken as true.

Joint probabilities are then built from conditional probabilities by the product rule:

P(A, B) = P(A | B) P(B)

For the dice example, if A is “the dice are equal” and Bᵢ is “the first die is i,” then:

P(A, Bᵢ) = P(A | Bᵢ) P(Bᵢ)

Since P(Bᵢ) = 1/6 and P(A | Bᵢ) = 1/6 because the second die must also equal i, we get:

P(A, Bᵢ) = (1/6)(1/6) = 1/36

The derivation of P(X > Y) using summation notation

Now let X be the first die and Y the second die. We want to compute the probability that the first die is greater than the second:

P(X > Y)
  1. Condition on all possible values of X:

    P(X > Y) = Σᵢ₌₁⁶ P(X > Y | X = i) P(X = i)
  2. Because the first die is fair:

    P(X = i) = 1/6
    P(X > Y) = (1/6) Σᵢ₌₁⁶ P(X > Y | X = i)
  3. If X = i, then the event X > Y becomes Y < i:

    P(X > Y | X = i) = P(Y < i)
    P(X > Y) = (1/6) Σᵢ₌₁⁶ P(Y < i)
  4. Expand P(Y < i) as a sum over all values of Y smaller than i:

    P(Y < i) = Σⱼ₌₁ⁱ⁻¹ P(Y = j)
  5. Because the second die is fair:

    P(Y = j) = 1/6
    P(X > Y) = (1/6) Σᵢ₌₁⁶ ( Σⱼ₌₁ⁱ⁻¹ 1/6 )
    P(X > Y) = (1/36) Σᵢ₌₁⁶ Σⱼ₌₁ⁱ⁻¹ 1
  6. The inner sum adds 1 exactly i - 1 times, so:

    Σⱼ₌₁ⁱ⁻¹ 1 = i - 1
    Σᵢ₌₁⁶ Σⱼ₌₁ⁱ⁻¹ 1 = Σᵢ₌₁⁶ (i - 1)
  7. Compute the final sum:

    Σᵢ₌₁⁶ (i - 1) = 0 + 1 + 2 + 3 + 4 + 5 = 15
    P(X > Y) = 15/36 = 5/12

Why the double sum becomes a single sum

The identity

Σᵢ₌₁⁶ Σⱼ₌₁ⁱ⁻¹ 1 = Σᵢ₌₁⁶ (i - 1)

simply means that for each fixed i, the inner sum counts how many values of j satisfy 1 ≤ j ≤ i - 1. There are exactly i - 1 such values. So the inner counting operation can be replaced by the number of terms being counted.

Geometric interpretation: a triangular grid

A very intuitive way to see the double sum is to place all 36 dice outcomes into a 6 × 6 grid. Let the row index be i = X and the column index be j = Y. The condition X > Y means exactly j < i, which corresponds to all cells below the diagonal.

Y = 1 Y = 2 Y = 3 Y = 4 Y = 5 Y = 6
X = 1 0 0 0 0 0 0
X = 2 X 0 0 0 0 0
X = 3 X X 0 0 0 0
X = 4 X X X 0 0 0
X = 5 X X X X 0 0
X = 6 X X X X X 0

Each row contributes exactly i - 1 valid cells:

0 + 1 + 2 + 3 + 4 + 5 = 15

That is why:

Σᵢ₌₁⁶ Σⱼ₌₁ⁱ⁻¹ 1 = 15

and therefore:

P(X > Y) = 15/36 = 5/12

Final intuition

The outer sum runs over the possible values of the first die. The inner sum counts how many values of the second die satisfy the condition once the first die is fixed. This is the essence of assumption-based reasoning: fix a context, calculate what happens in that context, weight it by how likely the context is, and add everything together.

In Pearl’s language, conditional probability is reasoning inside a context, and marginalization is what you do when you combine all possible contexts back into one final belief.