Peano arithmetic is enough, because Peano arithmetic encodes computation

Question

PA is enough, because PA can encode computation.

This is longer than I expected, and was made longer still by some browser crashes. But I'd been idly thinking of writing these ideas up. I hadn't for these reasons.

It is a lot of work.
What I have to say is obvious to logicians, and they would consider the detour into programming to only be a distraction.
Computer programmers who can appreciate the programming detour, are mostly not that interested in logic.

But this question hits the perfect intersection, and so I'll outline my thoughts in some detail. First I will answer the question in detail, assuming that anything you can do by a mechanical process (for example on a computer), can be done in PA. Then I will explain how PA encodes computation by outlining how to bootstrap Lisp from PA. If you're curious about bootstrapping Lisp, search for "Prelude". Otherwise, read on.

To answer this question, it suffices for PA to be able to prove how long the proof must be for any particular $G(n)$, and to prove that it can construct that proof. I will show this by demonstrating that PA can construct a proof of length $O(\log^*(n) \log(\log^*(n)))$, where the funny function is the very slowly growing iterated logarithm. In general, it takes more work to write out $n$, than it does to write out a partial proof of Goodstein's theorem which covers $n$!

But since larger values of $n$ need longer proofs, this doesn't lead to a proof in PA of Goodstein's theorem.

First a detour into the roots of Goodstein's theorem.

What are ordinals?

In John Von Neumann's famous construction, you get ordinals in ZFC from the following rules.

Every ordinal is a set.
The empty set $\{\}$ is an ordinal, we call it $0$.
If $\text{ord}$ is an ordinal, then so is $\text{ord} \cup \{\text{ord}\}$. (In Lisp notation we can call that $(s \, \text{ord})$.)
If $X$ is a set of ordinals, then the union of $X$ is also an ordinal.

In this construction, the ordinals have an order $<$ created by the rule that $x < y$ if and only if $x \in y$.

So the ordinals start off with $0 = \{\},\; 1 = \{0\},\; 2 = \{0, 1\},\; \ldots,\; \omega = \{0, 1, 2, \ldots\},\; \omega + 1 = \{0, 1, 2, \ldots, \omega\}$ and so on. Where "and so on" quickly boggles the mind.

The Goodstein Sequence is built around a hereditary base notation. This notation represents an ordinal written out in Cantor normal form.

Cantor normal form works like this. Any ordinal $\text{ord}$ can be written out uniquely in the form $((n_1, \text{ord}_1), (n_2, \text{ord}_2, \ldots, (n_k, \text{ord}_k))$ where each $n_i$ is a positive natural number, each $\text{ord}_i$ is an ordinal, and $\text{ord}_1 > \text{ord}_2 > \ldots > \text{ord}_k$.

I will not prove this result. Nor will I prove that this notation always represents an ordinal. But the following example gives you an idea of how it works out in practice.

In this notation, the empty list, $()$, represents the ordinal $0$. And $((n_1, \text{ord}_1), (n_2, \text{ord}_2, \ldots, (n_k, \text{ord}_k))$ represents $n_1 \omega^{\text{ord}_1} + n_2 \omega^{\text{ord}_2} + \ldots + n_k \omega^{\text{ord}_k}$.

For example, we can write $\omega^{\omega}$ as $((1, \omega))$. Which in turn is $((1, (1, 1)))$.

In this notation, $<$ is calculated lexicographically as follows.

To compare $((n_1, \text{ord}_1), (n_2, \text{ord}_2, \ldots, (n_k, \text{ord}_k))$ to $((m_1, \text{ord2}_1), (m_2, \text{ord2}_2, \ldots, (m_k, \text{ord2}_k))$, first we compare $\text{ord}_1$ to $\text{ord2}_1$, then we compare $n_1$ to $m_1$, then we compare $\text{ord}_2$ to $\text{ord2}_2$, then we compare $n_2$ to $m_2$, and so on. We stop as soon as any comparison shows that one is larger than the other, or if one ordinal sequence runs out before the other. (Shorter is smaller.)

Now go back and look at the Goodstein Sequence, and convince yourself that hereditary base notation represents ordinals written out in Cantor normal form.

About induction.

The fifth axiom of PA gives us induction. Here is what it says.

Suppose that $S$ is a statement that we can make about natural numbers. Suppose further that we can prove two things:

$S$ is true of $0$.
If $S$ is true of $n$, then $S$ is true of $(s \, n)$. (I will use Lisp notation a lot in this answer. I'm just getting you used to it right now.)

Then $S$ is true of all natural numbers.

From induction, PA can recursively define $<$ and show that it is true. And it can also prove the theorem that strong induction works:

Strong Induction Suppose that $S$ is a statement about the natural numbers. Suppose that we can prove that, for every natural number $n$, if $S$ is true for all natural numbers less than $n$, then it must be true for $n$.

Then $S$ is true for all natural numbers.

Note that no specific reference is made to $0$. But $0$ works because there are no natural numbers that are less than $0$, and every possible statement is true about all members of the empty set. That said, it is often convenient in proofs to deal with $0$ as a special case.

In ZFC, we can go further. We can prove that strong induction works for the ordinals. The result is a kind of induction that goes up to infinite ordinals, so we call it transfinite induction.

I will demonstrate how transfinite induction works with two proofs about things written out in Cantor normal form. (Remember, I told you that these are ordinals, but I didn't prove that they actually are ordinals. So these theorems do not immediately follow for us from transfinite induction.)

Theorem 1: Any descending sequence of numbers in Cantor normal form must be of finite length.

Theorem 2: We can prove statements about numbers in Cantor normal form with transfinite induction.

Note that the observation that hereditary base notation is something written out in Cantor normal form, and Theorem 1, we can prove Goodstein's theorem. We will need Theorem 2 later.

Here are the proofs.

Theorem 1: Proof Suppose that we have a descending sequence of numbers written in Cantor normal form. If it is an empty sequence, then the theorem is true, so let's assume that it is a nonempty sequence. Let's look at the first element in the sequence.

If the first entry is the empty list, then the list cannot have more members, and the theorem is true of this sequence. Therefore we may assume that the first entry has a first term of the form $(n, \text{ord})$ with $n$ a positive integer, and $ord$ an ordinal. And, by the induction hypothesis, we know that all descending sequences that start with a smaller ordinal, or with the same ordinal and a smaller natural number, must be finite in length.

We shall break the original sequence into two descending sequences, with the second possibly empty. The first sequence is the members of the sequence whose entries have $(n, \text{ord})$ as their first term. The second sequence is the members that don't.

All of the members of the first sequence can be mapped to another sequence that just has that common first entry missing.

The second sequence might be empty, in which case it has length $0$, which is finite.

Either sequence could start with the element $()$, in which case it has length $1$, which is finite.

Otherwise each sequence must start with an element whose first term is of the form $(m, \text{ord2})$, with either $\text{ord2} < \text{ord}$, or with $\text{ord2} = \text{ord}$ and $m < n$. In which case, by the induction hypothesis, the sequence must have finite length.

Since the overall sequence is the combination of two finite sequences, it must also be finite.

And now that we have the first theorem, we can prove the second fairly easily.

Theorem 2: Proof Suppose that we have a statement $S$, and we have shown that if this statement is true for all things written in Cantor normal form less than a given one, then it is true of that one as well.

Suppose further that we have $x_1$ in Cantor normal form, and statement $S$ is false for $x_1$.

In order for $S$ to be false for $x_1$, there must exist a smaller $x_2$ for which $S$ is false. And below $x_2$ there must be a smaller $x_3$, and so on. And so we get a descending sequence of things written in Cantor normal form.

By theorem 1, this must be a finite sequence. It must wind up like $(x_1, x_2, \ldots, x_k)$ and be unable to be extended farther. By construction, the statement $S$ is false for $x_k$. By the induction hypothesis, $S$ must be true for $x_k$. This is a contradiction, and so $x_1$ cannot exist.

If $S$ can't be false for anything written in Cantor normal form, then $S$ is true of all of them. And so the theorem is proven.

Transfinite Induction in PA.

PA cannot prove that transfinite induction works for all ordinals. But it can prove it for some ordinals.

First of all, PA can prove that strong induction works. This is transfinite induction for all members of $\omega$.

If you look at the proofs of theorems 1 and 2, you'll see that from transfinite induction on $\omega$, PA can prove transfinite induction for $\omega^{\omega}$.

If you look at the proofs of theorems 1 and 2, you'll see that from transfinite induction on $\omega^{\omega}$, PA can prove transfinite induction for $\omega^{\omega^{\omega}}$.

If you look at the proofs of theorems 1 and 2, you'll see that from transfinite induction on $\omega^{\omega^{\omega}}$, PA can prove transfinite induction for $\omega^{\omega^{\omega^{\omega}}}$.

And so on. Absolutely nothing changes between these proofs other than the height of the tower of $\omega$s. Therefore they can be produced purely mechanically. Because you we are writing out that tower each time, the $m$th such proof winds up with length $O(m)$. And to get there we need $m$ proofs. And so we see that a proof of this exists for a tower of height $m$ in PA, and the proof is of length $O(m^2)$.

These proofs can be trivially shortened with a notation such as $\omega^{[m]}$ instead of a tower of height $m$. Writing out $m$ can be done in length $O(\log(m))$, so this makes each proof $O(\log(m))$. That makes the overall proof only $O(m \log(m))$.

And so for every ordinal in $\varepsilon_0$, there exists a proof of transfinite induction in PA for that ordinal. But, combined, these proofs are of infinite length. And so don't give a proof of transfinite induction for $\varepsilon_0$. Which is a good thing, because from transfinite induction on $\varepsilon_0$ it is possible to prove that PA is consistent. So if PA proved transfinite induction on $\varepsilon_0$, then it would run afoul of Gödel's second incompleteness theorem, and we can't have that!

Note that for any given $n$, we only need to prove it to the height of the tower that its hereditary base notation has. Which height is $O(\log^*(n))$, where that is the iterated logarithm. A function that PA can easily compute, and which grows extremely slowly. This is exactly what I promised I would show at the start.

A bit more about what PA proves

Everything that I've described so far can be produced mechanically. A programmer can literally write out a program that takes $n$ and does the following:

Prints proofs of common facts about PA.
Prints a proof that $G(n)$ traces out a descending sequence in $\omega^{[m]}$ for some $m$.
Traces out the computation of that $m$ and writes out a proof of what $m$ is.
Writes out a proof of transfinite induction for $\omega^{[0]}$. (ie, just $0$.)
Writes out a proof that transfinite induction for $\omega^{[i]}$ implies transfinite induction for $omega^{[i+1]}$.
For each $i$ from $0$ to $m-2$, writes out a proof of transfinite induction for $i+1$.
Writes out a proof that transfinite induction on $omega^{[m-1]}$ implies that every descending sequence in $\omega^{[m]}$ has finite length.
Writes out the tail concluding that $G(n)$ terminates.

PA now:

Can prove that this procedure will terminate.
Can prove that it produces a list of statements.
Can prove that it starts with the Peano Axioms.
Can prove that each statement follows logically from statements that came before.
Can conclude by induction that all of the statements are proven.
Can verify that the very last statement is, "$G(n)$ terminates."

And therefore, PA proves that PA proves that $G(n)$ terminates for any natural number $n$.

Read on if you want to think about how one could encode Lisp in PA.

Prelude.

When I talk about "encoding", this is not strictly a mathematical concept. Concepts exist in our minds, and not in mathematical symbols. We decide what things will mean. And they can mean anything. We can, as long as we keep our context straight, even overload meanings. So the same number can "mean" one thing one time, and another thing another time.

For a simple example, consider the signal that kicked off Paul Revere's famous ride. "One if by land, two if by sea." In this way a collection of lanterns, representing a single number, could mean "by land" or "by sea". But if Robert Newman had hung 42 lanterns up, it would have just caused confusion. Because Douglas Adams hadn't yet told us the answer to life, the universe, and everything. And computer scientists had not yet decided that in ASCII 42 means *. Since programmers know that * is the symbol for everything, the punchline to Douglas Adams' joke actually arrived before the joke was made. Yes, that is recursive humor. It seemed appropriate given that I will discuss bootstrapping.

Yes, I am a dad. Why do you ask?

Moving along...

Why Lisp?

I will be giving all of my programming examples in Lisp. That's because it is easy to explain and interpret.

Here is the explanation. The important thing that you have to remember is this. Most Lisp terms look like (command and arguments). The first word is a command that describes what to do with and arguments. Usually the command will be a function, so we eval each argument in turn. But there are some weird, but important, exceptions. They usually make sense when you see them, and there aren't that many. For example:

; Any side effects only exist in your mind...
(defvar false 0)
(defvar true 1)

It would make no sense to try to eval variables while we are trying to define them, so we don't. This is an example of a special form. We will see others.

Incidentally, we just encoded Booleans as numbers.

Not all terms have to have parentheses. For example 6 is a perfectly valid Lisp term, and I think that you can guess what its value is. Now that we've defined the variables, true and false are valid terms as well.

We might as well throw in basic logic. If you can't figure out how these work, come back after reading the next section.

(defun not (x)
    (if x
        false
        true)

(defun and (x y)
    (if x
        y
        false))

(defun or (x y)
    (if x
        true
        y))

Strictly speaking, a programming language doesn't really need comments. But Lisp has them, and they start with a ;.

Lisp isn't just easy to explain to humans. When you're trying to bootstrap a programming language, one of the things that you have to do is have the computer parse the language. Consider something as simple as, 1 * 2 + 3 * 4. To properly get 14 out, your parser has to figure out operator precedence. That's surprisingly tricky.

But Lisp makes it easy. The equivalent expression is (+ (* 1 2) (* 3 4)). The first time you see this, you may need to work to figure out what it says. That's just because it is unfamiliar. But it is really easy to program a computer to look for balanced parentheses, then always look for the command at the first term.

What is built into PA?

The Peano axioms give us the following useful foundation.

0
The notion of a successor, which gives us the function (s n). (You can think of it as (+ n 1), though we actually will define + in terms of s and not the other way around.)
The notion of "sameness", which lets us define a function (= n m) that returns 0 or (s 0) depending on whether the two are different or the same.
The notion of a predecessor, which is defined for all numbers except 0. This gives us the function (p n). (But don't call it on 0!)
The ability to define things recursively, as long as the recursion can be proven to work by induction. An example of a recursion that doesn't work is (defun growing (n) (growing (s n))). As you start trying to evaluate it, the numbers keep growing and growing, and you never get to the end!
The ability to do something differently when we see a 0 or a 1. This lets us build the special form (if test on-true on-false). That allows for conditional logic like we used to build boolean logic.

And that's it. Is it enough?

Basic number theory

PA lets us define familiar basic functions, and prove their basic properties. Here is an example:

; It would be nice to know when one number is larger than another...
(defun < (n m)
    (if (= m 0)
        false
        (if (= n 0)
            true
            (< (p n) (p m)))))

(Exercise for the reader. Prove by induction on n that this is a well-defined function for any pair of natural numbers (n, m).)

That nested logic gets annoying, quickly. That's why Lisp defines cond. It lets us write that same function like this:

; It would be nice to know when one number is larger than another...
(defun < (n m)
    (cond
        (= m 0) false
        (= n 0) true
                (< (p n) (p m)))))

Here cond is a macro. It just takes its arguments, rewrites it as the previous nested if statements, and then executes it with eval. On more complex chains of conditional logic, this becomes a big saving.

Here are some more basic functions that you would expect from elementary school. If you work out why these are the usual functions, you'll be well on your way to reading Lisp.

; These seem useful
(defun min (n m)
    (if (< n m)
        n
        m))

(defun max (n m)
    (if (< n m)
        m
        n))

; We should be able to add numbers...
(defun + (n m)
    (if (= n 0)
        m
        (s (+ (p n) m))))

; After a while, you stop noticing that stack of closing parens.

; Multiplication seems important...
(defun * (n m)
    (if (= n 0)
        0
        (+ (* (p n) m) m))))

; Luckily this code runs at the speed of thought

;Exponentiation anyone?
(defun pow (n m)
    (if (= m 0)
        1
        (* n (pow n (p m)))))

; You may have to think faster...

; % gives you the remainder after division
(defun % (n m)
    (cond
        (= n 0)               0
        (= (% (p n) m) (p m)) 0
                              (s (% (p n) m) m)))

; // is integer division, it drops the remainder
(defun // (n m)
    (cond
        (= n 0)       0
        (= (% n m) 0) (s (// (p n) m))
                      (// (p n) m)))

If you prove why these functions have the properties you expect, then you'll be well on your way to understanding PA. For example about + we can prove from the definition the following statements:

(= (+ 0 m) m)
(= (+ (s n) m) (s (+ n m)))

From those two statements, you can prove by induction that (+ n m) is always well-defined. (Meaning that for any pair of arguments, there is always one and only one possible answer.)

By induction, we can then prove the following two statements:

(= (+ n 0) n)
(= (+ n (s m)) (s (+ n m)))

From the fact that the first pair of statements uniquely defined what addition must be, the second pair of statements proves that (= (+ n m) (+ m n)) - ie the commutative law holds.

From one we get two.

So far we are just doing basic arithmetic. Let's introduce some programming. Let's see how to get two numbers out of one number, or one number out of any two numbers.

The whole idea here is this. We can write any number in base 2. If we just look at the odd bits, we get one number that we can call the head of the pair. If we just look at the even bits, we get another number that we can call the tail of the number. We can encode 2 numbers as one just by interleaving their bits.

I've had to be slightly clever, but the following functions do exactly that. (Unless I have a bug. You know how it goes.)

; I thought about calling these car and cdr...
; ...then decided that I'm not really THAT addicted to Lisp

(defun get-binary-head (n)
    (if (= n 0)
        0
        (+ (% n 2)
           (* (get-binary-head (// n 4)) 2))))

(defun get-binary-tail (n)
    (get-binary-head (// n 2)))

; All pure functional programming - return the new structure.
(defun make-binary-head (n)
    (if (= n 0) 
        0
        (+ (% n 2)
           (* (make-binary-head (// n 2)) 4))))

(defun make-binary-tail (n)
    (* (make-binary-head n) 2))

(defun make-binary-pair (n m)
    (+ (make-binary-head n) (make-binary-tail m)))

(defun set-binary-head (pair n)
    (+ (make-binary-head n) (make-binary-tail (get-binary-tail pair))))

(defun set-binary-tail (pair n)
    (+ (make-binary-head (get-binary-head pair)) (make-binary-tail n)))

In order to prove that these work in PA, you'll need to first prove the theorem of strong induction. This theorem says that to prove that a statement S is true of all natural numbers, it suffices to prove the following two statements:

S is true of 0.
If S is true for all n with (< n m), then S is true of m.

(This has to be a theorem, because PA did not start with a notion of "less than".)

From two, many.

As soon as a competent programmer sees the ability to make pairs, the next question is whether they can make linked lists. Because with linked lists, we can build anything!

The following construction shows that we can do that in PA.

; Calling 0 by the right name can remind us of what we mean.
(defvar nil 0)

; Let's have empty lists
(defun make-empty-binary-list ()
    nil)

; Extend a list by adding a value on the start
; (Note the use of s to ensure that nonempty lists are never nil)
(defun add-binary-list-head (head list)
    (make-binary-pair (s head) tail)

; Be sure to undo the s in the last function!
(defun get-binary-list-head (list)
    (p (get-binary-head list)))

; What's in a name?
(defun get-binary-list-tail (list)
    (get-binary-tail list))

; How long is the list?
(defun binary-list-length (list)
    (if (= list nil) 0
        (s (binary-list-length (get-binary-list-tail list)))))

; Let's get any element!
(defun get-binary-list-at (list n)
    (if (= n 0)
        (get-binary-list-head list)
        (get-binary-list-at (get-binary-tail list) (p n))))

; Let's insert an element anywhere
(defun insert-binary-list-at (list n value)
    (if (= n 0)
        (add-binary-list-head value list)
        (add-binary-list-head
            (get-binary-list-head list)
            (insert-binary-list-at
                (get-binary-list-tail list)
                (p n)
                value))))

; How big are these numbers now...?
; ...doesn't matter. There are always more!

; Let's remove an element anywhere
(defun remove-binary-list-at (list n)
    (if (= n 0)
        (get-binary-list-tail list)
        (add-binary-list-head
            (get-binary-list-head list)
            (remove-binary-list-at list (p n)))))

From many, any.

At this point, any competent programmer will recognize that it is game over. Once we have numbers, pairs, and lists, we can build anything at all that we want. Stacks, queues, trees, text documents, virtual machines, and so on. All can be represented. All can be manipulated in any way that we can describe with a computer.

In fact we can build a single encoding that can include all of these, all at once.

First let's create a basic list of types. Our list might begin (type-number, type-boolean, type-pair, type-list, ...). We can keep on adding to this list over time. For example we can use ASCII to create a type-text. And on it goes.

Second, every number will be mapped to a pair (type, value). Where value is expected to be of the right type. Some numbers won't actually map to something that works as a value of that type - that's OK. We have lots of numbers. But those that do map, will map to a completely unique thing. Our list starts off with:

n | binary | pair   | interpretation
------------------------------------
0 |    0   | (0, 0) | the number 0
1 |    1   | (1, 0) | false
2 |   10   | (0, 1) | the number 1
3 |   11   | (1, 1) | true
4 |  100   | (2, 0) | the pair (0, 0)
5 |  101   | (3, 0) | the empty list ()
6 |  110   | (2, 1) | the pair (1, 0)
7 |  111   | (3, 1) | the list (0)
8 | 1000   | (0, 2) | the number 2
9 | 1001   | (1, 2) | INVALID BOOLEAN

If we're going to work with this encoding, it makes sense to rename every function that we've been using so far by adding PA- to the start of the name. That frees up the functions that we have defined to work on things with type type-number. And now we can add this handy convenience function:

(defn type-of (n)
    (make-binary-pair type-type (PA-get-binary-head n))

; Lisp fans might want to implement make-type...
; ...why did you need to make it a macro?

And now it is a tedious but straightforward exercise to build up our functions to understand Lisp, a Lisp virtual machine, and a Lisp interpreter that runs on that machine.

And, of course, Lisp is Turing complete. So we can then build any other kind of virtual machine, for any other language we want, and so on.

This gives us the ability to encode any possible computable procedure, any possible state we might encounter following that procedure, and to work out the exact state of the computation after any number of steps.

This is what I mean when I say that PA can encode computation.

PA also encodes proofs in PA

Logicians have created a formal language with a formal set of symbols called First Order Logic. A proof in this language is just a list of statements. Each statement is a single reasoning step. You are free to write down statements that are complete nonsense, but some make sense. You are allowed to write down a proof with invalid reasoning steps, but that is an invalid proof.

In other words, we can create a type type-proof. It will consist of a list of statements. Each statement is a list of characters. You can write a program that can verify whether any individual statement is valid. And so on. In short, it is possible to make something like the following program work.

(defn proof_step_is_valid (proof step)
    ...)

(defn proof-is-valid (proof)
    (check-proof proof 0))

; Meta-programming with a vengeance!
(check-proves (axioms statement n)
    ; This defines a local variable
    (let proof-found
        (if
            (not (= (type-of n) type-proof))
            false ; not of type proof
            ; need another local variable
            (let proof (value-of n)
                ; the code in here knows what proof is
                (cond
                    ; the data structure encoded in the value may be invalid!
                    (not (proof-is-well-formed proof)) false
                    (not (proof-is-valid proof))       false
                    (not (proof-assumes axioms))       false
                    (not (proof-concludes statement))  false
                                                       true)))
        ; The code here knows what proof-found is
        (if proof-found true (check-proves axioms statement (PA-s n)))))

; It may be hard to figure out whether this returns...
(defn proves (axioms statement)
    (check-proves axioms statement PA-0))

In other words if any proof of any statement from any axioms exists, there is a specific PA number that must encode it. There is a specific function that can find that proof. And so we see that all of logic is encoded in PA.

Incidentally, the kind of recursive self-reference that I just demonstrated in proof is what allows us to embed self-recursion. Which allow us to construct statements whose meaning is some variation on, "This statement is a lie." The impossibility of proving such statements true or false is what gives us things like Gödel incompleteness, the unsolvability of the Halting problem, and so on.

Note, Gödel managed to encode logic in PA without having to also encode computation. To a logician, embedding computation is just a giant detour. But as a programmer, I find this a very natural way to understand things.

If you've actually read to this point, you're likely to be a programmer. And hopefully you agree with my point of view.