Previous-Copy Compression: Foundations and Algebraic Digit Expansions
This page collects the foundational material for the online previous-copy model and its application to algebraic digit expansions.
Part I.A
Previous-Copy Compression and Algebraic Digit Expansions
Abstract
Let \(b\geq 2\) be an integer and let
be algebraic irrational, with \(a_n\in\{0,\ldots,b-1\}\). Write
for the prefix of length \(N\) of the base-\(b\) expansion of \(\alpha\).
We study the compressibility of \(w_N(\alpha)\) under a left-to-right previous-copy model. A phrase is either a literal symbol or an exact copy of an earlier substring; self-overlap is allowed, as in self-referential variants of LZ77. Let \(\oc(W)\) be the minimum number of phrases in such a parsing of a finite word \(W\). We prove
Consequently, every standard LZ77-type previous-factor parsing, self-referential or not, uses \(\omega(\log N)\) phrases on \(w_N(\alpha)\).
The proof combines a simple multiplicative-growth lemma with Diophantine approximation. A logarithmic online previous-copy parsing forces a copied phrase whose endpoint is a fixed multiplicative factor larger than its starting boundary. If the copy displacement tends to infinity, this gives a repetition forbidden by the Adamczewski--Bugeaud Diophantine exponent criterion. If the displacement is bounded, it gives too good an approximation by rationals with denominators supported on a fixed finite set of primes, contradicting Ridout's theorem.
We also prove a structured-noise variant using Nguyen's refined Diophantine exponent theorem. If copied phrases may differ from their sources on a vanishing proportion of positions contained in a bounded number of intervals, logarithmically many phrases still do not suffice.
Finally, we add an effective finite-prefix component. We introduce the linear extension complexity \(\lambda(W)\), the least \(C\) such that \(W\) extends to an infinite word of factor complexity at most \(Cn\). Using Bugeaud's effective theorem for long prefixes of algebraic numbers, we prove
Since small online-copy parsings and small string attractors both imply small linear extension complexity, this gives the same effective lower bound for the minimum string attractor size of \(w_N(\alpha)\), and hence for all compression measures which induce string attractors of comparable size. This part is weaker than the superlogarithmic LZ77 bound, but applies to offline repetitiveness measures outside the reach of the online argument.
1 Introduction
The base-\(b\) expansion of an algebraic irrational number is expected to be highly complex. This expectation cannot be usefully expressed through unrestricted Kolmogorov complexity of finite prefixes: algebraic numbers are computable, and hence their first \(N\) digits have descriptions of length \(O(\log N)\). One must instead work with restricted models of description or compression.
A fundamental theorem of Adamczewski and Bugeaud says that the base-\(b\) expansion of an algebraic irrational number cannot have low factor complexity. More precisely, if \(p(n,\alpha,b)\) denotes the number of distinct blocks of length \(n\) occurring in the base-\(b\) expansion of an algebraic irrational \(\alpha\), then
Their method rests on a Diophantine principle: sufficiently strong repetitions in the digit expansion produce rational approximations that are too good for algebraic irrational numbers.
The purpose of this note is to apply that principle to a concrete compression model. We consider finite prefixes under a left-to-right previous-copy parsing. This model is deliberately generous. Copy sources and lengths are not charged. A copied phrase is allowed to overlap its target. Therefore a lower bound in this model is not an artefact of bit-level encoding conventions.
The first main result is the following.
Theorem 1.1 (Exact online-copy lower bound).
Let \(b\geq 2\), and let
Then
Equivalently,
As a direct consequence, every standard LZ77-type previous-factor parsing of \(w_N(\alpha)\), self-referential or not, has \(\omega(\log N)\) phrases.
The result should be read carefully. It is not a lower bound for arbitrary straight-line programs, unrestricted grammar compression, bidirectional macro schemes, or offline copy systems. The online condition is essential in the proof. However, a second and independent part of the note gives effective lower bounds for several offline repetitiveness measures through a different route.
To state that route, define the linear extension complexity \(\lambda(W)\) of a finite word \(W\) to be the least \(C\) such that \(W\) is the prefix of an infinite word \(x\) satisfying
where \(p_x(n)\) is the number of distinct factors of length \(n\) of \(x\). We prove that both online-copy parsings and string attractors control \(\lambda(W)\):
where \(\gamma(W)\) is the minimum string attractor size of \(W\). Combining this with Bugeaud's effective finite-prefix theorem gives the second main result.
Theorem 1.2 (Effective finite-prefix lower bound).
Let \(b\geq 2\), and let
Then, for all sufficiently large \(N\),
Consequently,
and the same lower bound holds for every compression measure which is known to induce a string attractor of size bounded linearly by that measure.
This effective bound is weaker than \(\oc(w_N(\alpha))=\omega(\log N)\), but it applies to offline measures for which the online multiplicative-growth argument does not apply.
We also prove a structured noisy version of the online theorem. There, a copied phrase may differ from its source on a small exceptional set, provided the exceptional set is contained in a bounded number of intervals. The unbounded-displacement case is handled by Nguyen's refined Diophantine exponent theorem; the bounded-displacement case is handled directly by Ridout's theorem.
The final section isolates a quantitative profile
measuring the strongest exact repetition in the digit sequence beyond scale \(T\). The exact theorem implies qualitatively that
Moreover, any explicit decay bound on \(\Theta_\alpha(T)\) would immediately imply an explicit superlogarithmic lower bound for \(\oc(w_N(\alpha))\). This reduces the main quantitative breakthrough problem to a finite Diophantine problem, closely related to a moving-period version of Ridout's theorem.
Throughout the note, logarithms are natural. All implicit constants may depend on fixed parameters such as \(\alpha\), its degree and height, and, where relevant, \(b\), although the effective theorem quoted from Bugeaud is itself independent of the base.
2 Diophantine inputs
Let
be an infinite word over a finite set of integers. For integers \(i\leq j\), write
For an integer \(b\geq 2\), put
We use three Diophantine inputs: the Adamczewski--Bugeaud repetition criterion, Ridout's theorem, and Bugeaud's effective finite-prefix theorem. For the noisy section we also use Nguyen's refined repetition criterion.
2.1 Exact repetitions
The following is the form of the Adamczewski--Bugeaud Diophantine exponent criterion needed below.
Theorem 2.1 (Exact repetition criterion).
Let \(b\geq 2\). Suppose that there exist \(\rho>1\) and sequences of integers
such that
and
for every \(m\). Then
is rational or transcendental.
Indeed, the prefix \(\mathbf a[1,t_m]\) contains a repetition beginning after a preperiod of length \(r_m\), with period \(s_m-r_m\), and with total length at least a fixed multiple of \(s_m\). In the language of Adamczewski and Bugeaud, the Diophantine exponent of \(\mathbf a\) is greater than \(1\). The stated conclusion is precisely the corresponding transcendence criterion.
2.2 Ridout's theorem
We shall use Ridout's theorem in the following form.
Theorem 2.2 (Ridout).
Let \(S\) be a finite set of prime numbers, let \(\theta\) be a real algebraic number, and let \(\varepsilon>0\). Then there are only finitely many rational numbers \(P/Q\), written in lowest terms, such that every prime divisor of \(Q\) belongs to \(S\) and
2.3 Structured noisy repetitions
Let \(U\) and \(V\) be two words of the same length \(L\). For \(\eta>0\) and an integer \(\Delta\geq 0\), say that \(U\) and \(V\) are \((\eta,\Delta)\)-close if the set
is contained in a union of at most \(\Delta\) intervals of \(\{1,\ldots,L\}\) whose total cardinality is at most \(\eta L\).
We use the following consequence of Nguyen's refined Diophantine exponent theorem.
Theorem 2.3 (Refined repetition criterion).
Let \(b\geq 2\). Suppose that there exist \(\rho>1\), an integer \(\Delta\geq 0\), a sequence \(\eta_m\to 0\), and integers
such that
and such that the two words
are \((\eta_m,\Delta)\)-close for every \(m\). Then
is rational or transcendental.
Only the exact criterion and Ridout's theorem are needed for the exact online-copy lower bound. The refined criterion is used solely in the noisy section.
2.4 Bugeaud's effective finite-prefix theorem
We also use the following theorem of Bugeaud.
Theorem 2.4 (Bugeaud).
Let \(b\geq 2\), and let \(\xi\) be a real algebraic irrational number of degree \(D\) and height at most \(H\), with \(H\geq e^e\). Let \(a\) be the base-\(b\) expansion of \(\xi\). Let \(w\) be an infinite word satisfying
for some integer \(C\geq 2\). If the first \(L\) digits of \(a\) coincide with the first \(L\) digits of \(w\), then either
or
The numerical constants play no role in this note. What matters is that, for fixed algebraic \(\xi\), the theorem forbids arbitrarily long prefixes from coinciding with infinite words of complexity \(Cn\) unless \(C\) grows at least like
up to a constant depending on \(\xi\).
3 Online previous-copy parsings
Let \(W=W[1]W[2]\cdots W[N]\) be a finite word.
Definition 3.1.
An online previous-copy parsing of \(W\) is a factorization
with boundaries
where
Each phrase \(F_j\) is of one of two types.
First, \(F_j\) may be a literal, in which case \(|F_j|=1\).
Second, \(F_j\) may be a copied phrase. Writing
this means that there is a source position \(p\) with
such that
The source starts strictly to the left of the target phrase. It may overlap the target.
Define
to be the minimum number of phrases in an online previous-copy parsing of \(W\).
This is a structural parsing measure, not a bit-level encoding length. In particular, source positions and phrase lengths are not charged.
Remark 3.2.
The first phrase in every online previous-copy parsing is necessarily a literal, since no previous source is available at position \(1\).
4 A multiplicative-growth lemma
The following elementary lemma is the combinatorial core of the exact lower bound.
Lemma 4.1 (Multiplicative-growth phrase).
Let \(C>0\). Suppose that, for infinitely many \(N\), a word \(W_N\) of length \(N\) has an online previous-copy parsing with at most \(C\log N\) phrases. Then there exist \(\rho>1\) and infinitely many copied phrases, arising in such parsings, with boundaries \(s and One may take
Proof.
Fix such a parsing of \(W_N\), and write its boundaries as
Since the first phrase is a literal, \(n_1=1\). Thus
Set
Suppose, for contradiction, that there is a constant \(T\) such that every phrase satisfying
has endpoint \(n_j\leq T\), along the infinite family of parsings under consideration. After the last such phrase, all remaining ratios are \(<\rho\). Therefore, for all sufficiently large \(N\),
which is impossible.
Thus there are phrases with
and unbounded endpoints. A literal phrase with starting boundary \(n_{j-1}\) has ratio
which is \(<\rho\) for all sufficiently large \(n_{j-1}\). Hence the high-growth phrases with unbounded endpoints are copied phrases.
5 Exact online-copy lower bound
Theorem 5.1 (Compression criterion).
Let \(b\geq 2\), and let
If
for infinitely many \(N\), then \(\xi_{\mathbf a,b}\) is rational or transcendental.
Proof.
It is enough to prove the contrapositive. Suppose that
Assume, for contradiction, that there are \(C>0\) and infinitely many \(N\) such that
By the multiplicative-growth lemma, after passing to an infinite subsequence, we obtain copied phrases
such that
for some fixed \(\rho>1\), and \(t\to\infty\).
For each such copied phrase, put \(L=t-s\). There is a source position \(p\), with \(1\leq p\leq s\), such that
Set
Then
and
We split into two cases.
First suppose that
along an infinite subsequence. Then the triples \(r
and the exact repetition condition above. By the exact repetition criterion, \(\alpha\) is rational or transcendental, contradicting the assumption that \(\alpha\) is algebraic irrational.
It remains to consider the case in which \(d=s-r\) is bounded along an infinite subsequence. Passing to a further subsequence, we may assume that \(d\) is fixed.
The equality
can be rewritten as
Thus the block
is periodic with period \(d\).
Define the rational number
Then \(\beta\) has a base-\(b\) expansion agreeing with that of \(\alpha\) through at least the first \(t\) digits. Hence
Write \(\beta=P/Q\) in lowest terms. Since \(\beta\) has preperiod \(r\) and period \(d\) in base \(b\), its denominator divides
Therefore every prime divisor of \(Q\) belongs to the fixed finite set
and
If \(r\) is bounded along an infinite subsequence, then only finitely many such rationals \(\beta\) can occur. Since the agreement length \(t\) tends to infinity, one of them agrees with \(\alpha\) to arbitrarily many digits, and hence equals \(\alpha\). This contradicts the irrationality of \(\alpha\).
Thus \(r\to\infty\). Since
and
there is \(\rho'>1\) such that, for all sufficiently large terms of the subsequence,
Choose \(\varepsilon>0\) with \(1+\varepsilon<\rho'\). Since \(Q\ll_d b^r\), we obtain
along the subsequence. Therefore
for infinitely many distinct rationals \(P/Q\) whose denominators have all prime factors in the fixed finite set \(S\). This contradicts Ridout's theorem.
Both cases are impossible. Hence an algebraic irrational \(\alpha\) cannot have
\[ \oc(w_N(\alpha))=O(\log N) \]along an infinite sequence of \(N\).
Corollary 5.2 (Exact online-copy lower bound).
Let \(b\geq 2\), and let
Then
6 Consequences for LZ77-type parsings
The online previous-copy model is at least as permissive as the usual left-to-right previous-factor models.
Let \(z(W)\) denote the number of phrases in a standard LZ77-type parsing of \(W\), under any convention in which each phrase is either a literal, or a previous copy, possibly followed by one trailing literal. The previous copy may be self-referential or non-self-referential.
In the self-referential case, the copied part is an online previous-copy phrase. In the non-self-referential case, it is a fortiori an online previous-copy phrase. If a convention attaches one trailing literal to a copied phrase, split that phrase into at most two online previous-copy phrases. Hence there is an absolute constant \(A\), depending only on the LZ77 convention, such that
for every finite word \(W\).
Corollary 6.1 (LZ77 lower bound).
Let \(b\geq 2\), and let
Then every standard LZ77-type previous-factor parsing of
has
phrases.
Proof.
If some such parsing had \(O(\log N)\) phrases along an infinite sequence of \(N\), then the comparison above would give
along the same sequence, contradicting the exact online-copy lower bound.
7 Linear extension complexity
The online argument gives a strong superlogarithmic lower bound, but it relies essentially on left-to-right growth. We now introduce a weaker measure which also interacts with offline repetitiveness notions such as string attractors.
Let \(\mathcal A\) be a finite alphabet. For an infinite word \(x\in\mathcal A^{\mathbb N}\), let
be the number of distinct factors of length \(k\) occurring in \(x\).
Definition 7.1.
For a finite word \(W\in\mathcal A^*\), define its linear extension complexity by
Thus \(\lambda(W)\) measures whether \(W\) can be embedded as a prefix of an infinite word of uniformly linear factor complexity.
Lemma 7.2 (Online-copy parsings give linear extensions).
For every finite word \(W\),
Proof.
Let
be an online previous-copy parsing with \(z=\oc(W)\) phrases. Let
be its phrase boundaries. Fix a symbol \(c\in\mathcal A\), and extend \(W\) to the infinite word
Let \(k\geq 1\), and consider any factor \(U\) of \(x\) of length \(k\). Choose the leftmost occurrence of \(U\) in \(x\). We claim that this occurrence crosses either the first position of a phrase, or the first position of the infinite constant tail.
Indeed, suppose first that the occurrence lies entirely inside \(W\). If it is contained in a copied phrase
and does not cross the first position \(s+1\) of that phrase, then it starts at some position \(i>s+1\) and ends at most at \(t\). Let \(p\leq s\) be the source position of the copied phrase. By the copy equality, the same length-\(k\) factor occurs starting at
contradicting the leftmost choice. Thus the occurrence crosses the beginning of a phrase, unless it lies in a literal phrase, in which case it also crosses the beginning of that phrase.
If the occurrence meets the infinite tail, then it crosses the first position \(|W|+1\) of the tail, unless it is entirely contained in the tail. In the latter case \(U=c^k\), which also occurs starting at \(|W|+1\).
There are \(z\) phrase-start positions and one tail-start position. For a fixed position \(q\), at most \(k\) factors of length \(k\) cross \(q\). Therefore
Hence \(\lambda(W)\leq z+1\).
8 Effective lower bounds from Bugeaud's theorem
The previous lemma allows us to convert Bugeaud's finite-prefix theorem into an effective lower bound for \(\lambda(w_N(\alpha))\).
Theorem 8.1 (Effective lower bound for linear extension complexity).
Let \(b\geq 2\), and let
Then there is a constant \(c_\alpha>0\) such that, for all sufficiently large \(N\),
Proof.
Let \(D\) be the degree of \(\alpha\), and let \(H\geq e^e\) be an upper bound for its height. Put
Let \(C_0>\lambda(w_N(\alpha))\). By the definition of \(\lambda\), there is an infinite word \(x\) whose first \(N\) symbols are \(w_N(\alpha)\) and such that
Choose an integer
with \(C\leq C_0+3\). Then
Bugeaud's theorem applies to \(\alpha\), \(x\), \(C\), and \(L=N\). Hence either
or
We show that both alternatives are impossible if
with \(c>0\) sufficiently small and \(N\) sufficiently large.
First, under this hypothesis,
Therefore
grows faster than every fixed power of \(\log N\). Since \(C\) is only a fixed power of \(\log N\), we have
Thus the first alternative would force \(H\to +\infty\), impossible because \(H\) is fixed.
Second, using
and
for large \(N\), we obtain
Since \(\log C\leq \log\log N\), the right-hand side is bounded below by a positive constant multiple of \(c^{-11/2}\). Choosing \(c\) sufficiently small, this lower bound exceeds \(\log D\). Thus the second alternative would force
again impossible.
Consequently,
Since \(C\leq C_0+3\), and since \(C_0>\lambda(w_N(\alpha))\) was arbitrary, the theorem follows after adjusting the implicit constant.
Corollary 8.2 (Effective online-copy lower bound).
Let \(b\geq 2\), and let
Then, for all sufficiently large \(N\),
Proof.
This follows from
and the preceding theorem.
Remark 8.3.
This effective bound is much weaker than
which was proved by the exact Diophantine exponent argument. Its value is that it comes from a finite-prefix low-complexity obstruction and will also apply to offline repetitiveness measures.
9 String attractors and offline compression
We now apply the same finite-prefix method to string attractors.
Definition 9.1.
Let \(W\) be a finite word of length \(N\). A set
is a string attractor for \(W\) if every distinct factor of \(W\) has an occurrence in \(W\) crossing at least one position of \(\Gamma\). Let
be the minimum size of a string attractor for \(W\).
Lemma 9.2 (Attractors give linear extensions).
For every finite word \(W\),
Proof.
Let \(\Gamma\) be a string attractor for \(W\), with
Fix a symbol \(c\) in the alphabet and extend \(W\) to
Let \(k\geq 1\). Every factor of length \(k\) of \(x\) either is a factor of \(W\), crosses the boundary between \(W\) and the infinite tail, or is \(c^k\).
Every distinct factor of length \(k\) of \(W\) has an occurrence crossing a position of \(\Gamma\). For each fixed position \(q\in\Gamma\), at most \(k\) length-\(k\) factors cross \(q\). Hence factors coming from \(W\) contribute at most
distinct words.
The factors crossing the tail boundary contribute at most \(k\) additional words, and the factor \(c^k\) is included among them by considering the occurrence starting at the first tail position. Therefore
and so
Theorem 9.3 (Effective string-attractor lower bound).
Let \(b\geq 2\), and let
Then, for all sufficiently large \(N\),
Proof.
By the preceding lemma,
The result follows from the effective lower bound for \(\lambda(w_N(\alpha))\).
Corollary 9.4 (Offline dictionary-compression consequences).
Let \(M(W)\) be any compression measure for which every representation of size \(M(W)\) induces a string attractor of size \(O(M(W))\), under the corresponding standard size convention. Then, for every algebraic irrational
and every integer base \(b\geq 2\),
Proof.
If a representation of size \(M(W)\) induces a string attractor of size \(O(M(W))\), then
Apply this to \(W=w_N(\alpha)\) and use the preceding theorem.
Remark 9.5.
This corollary applies to the standard repetitiveness measures for which such a linear attractor extraction is known, under the corresponding size conventions. These include, among others, LZ77, straight-line programs under usual grammar-size conventions, collage systems, macro schemes, and related measures discussed in the string-attractor literature. The principle is uniform: once a compressor induces an attractor of size \(O(M)\), the lower bound for \(\gamma\) transfers to \(M\).
Remark 9.6.
The string-attractor result is not a substitute for the exact online-copy theorem. It gives only
not \(\gamma(w_N(\alpha))=\omega(\log N)\). Proving a superlogarithmic lower bound for string attractors or straight-line programs would require a new idea beyond the online multiplicative-growth argument.
10 Noisy online-copy parsings
We now return to online parsings and introduce a structured-noise version.
Let \(U,V\) be words of the same length \(L\). Recall that \(U\) and \(V\) are \((\eta,\Delta)\)-close if their mismatch set is contained in a union of at most \(\Delta\) intervals of total cardinality at most \(\eta L\).
Definition 10.1.
Let \(W\) be a finite word. An \((\eta,\Delta)\)-noisy online-copy parsing of \(W\) is a factorization
with boundaries
such that each phrase is either a literal or an approximate previous copy in the following sense. If
then there exists a source position \(p\), with
such that
are \((\eta,\Delta)\)-close.
Let
be the minimum number of phrases in such a parsing.
This is again a structural measure. The exceptional intervals, their locations, and their contents are not charged.
11 Noisy online-copy lower bound
Theorem 11.1 (Noisy online-copy lower bound).
Let \(b\geq 2\), and let
Fix an integer \(\Delta\geq 0\), and let
Then
Proof.
Assume, for contradiction, that there exist \(C>0\) and infinitely many \(N\) such that
The multiplicative-growth lemma applies verbatim, since it uses only the phrase boundaries. Thus, after passing to an infinite subsequence, there are approximate copied phrases
such that
for some fixed \(\rho>1\).
For each such phrase, put \(L=t-s\). There exists \(p\), with \(1\leq p\leq s\), such that
and
are \((\eta_N,\Delta)\)-close. Set
Then the two words
are \((\eta_N,\Delta)\)-close.
We split into two cases.
First suppose that
along an infinite subsequence. Since \(\eta_N\to0\), the refined repetition criterion applies to the triples \(r
It remains to treat the case where \(d=s-r\) is bounded. Passing to a further subsequence, assume that \(d\) is fixed.
The approximate copy condition says that the equalities
fail only for values of \(h\) contained in at most \(\Delta\) intervals of total cardinality at most \(\eta_N L\). Equivalently, up to enlarging these intervals by \(O_\Delta(1)\) endpoints, the period-\(d\) relations
fail only on a subset of
covered by at most \(\Delta\) intervals of total cardinality at most \(\eta_N L+O_\Delta(1)\).
Remove these bad intervals. The remaining good set is a union of at most \(\Delta+1\) intervals and has cardinality at least
Thus one good interval has length
Since
there is a constant \(c>0\), depending only on \(\rho\) and \(\Delta\), such that
for all sufficiently large \(N\).
Let this good interval of edge positions begin at \(A\). Then
It follows that the digit block
is periodic with period \(d\).
Set
Since \(A\leq r+L\leq t\), we have \(m\leq t\). Hence
unless \(m\) is bounded, in which case the argument below is even stronger.
Define the rational number
The periodicity just obtained implies that \(\alpha\) and \(\beta\) agree through at least position \(m+H\). Therefore
Writing \(\beta=P/Q\) in lowest terms, we have
up to cancellation. Thus all prime divisors of \(Q\) lie in the fixed finite set
and
If \(m\) is bounded along an infinite subsequence, then only finitely many rationals \(\beta\) occur. Since \(H\to\infty\), one of them agrees with \(\alpha\) to arbitrarily many digits, and hence equals \(\alpha\), contradicting irrationality.
Thus \(m\to\infty\). Since \(H\geq cm\), choose \(\varepsilon>0\) with \(\varepsilon Therefore for infinitely many distinct rationals \(P/Q\) whose denominators have all prime factors in \(S\). This contradicts Ridout's theorem. Both cases are impossible. Hence The exact online-copy theorem is qualitative. It proves superlogarithmic divergence but gives no explicit rate. The next definitions isolate the precise finite Diophantine obstruction that controls possible quantitative improvements. Let \(\mathbf a=a_1a_2\cdots\) be the digit sequence of \(\alpha\). Definition 12.1. For \(T\geq 1\), define with the convention that the supremum is \(0\) if no such triple exists. When \(\mathbf a\) is the base-\(b\) digit sequence of \(\alpha\), we write instead of \(\Theta_{\mathbf a}(T)\). Proposition 12.2 (Qualitative decay of the repetition profile). Let \(b\geq 2\), and let Then Proof. Suppose not. Then there exist \(\varepsilon>0\) and triples such that and Put If \(d_m\to\infty\) along an infinite subsequence, the exact repetition criterion implies that \(\alpha\) is rational or transcendental, contradiction. Thus \(d_m\) is bounded along an infinite subsequence. Passing to a further subsequence, assume \(d_m=d\) is fixed. The equality above gives a period-\(d\) block of length tending to infinity. Repeating the bounded-displacement argument in the proof of the exact online-copy theorem, we obtain infinitely many rational numbers \(\beta=P/Q\), eventually periodic in base \(b\) with fixed period \(d\), such that for some \(\varepsilon'>0\), with all prime factors of \(Q\) lying in the fixed finite set This contradicts Ridout's theorem, unless \(\alpha\) is rational. Since \(\alpha\) is algebraic irrational, this is impossible. The next lemma gives the precise bridge from the repetition profile to online-copy lower bounds. Lemma 12.3 (Quantitative profile bound). Let \(\mathbf a\) be any infinite word, and let For \(T\geq 2\), put Then, for every \(N>T\), Proof. Let be an online previous-copy parsing of \(W_N\). Let \(q\) be the largest index such that Such an index exists because \(n_1=1 For every \(j>q\), the endpoint satisfies \(n_j\geq T\). If the \(j\)-th phrase is a literal, then because \(n_{j-1}\geq n_q\geq T-1\) for \(j>q+1\), and for \(j=q+1\) the claimed bound is still harmless after replacing \(n_q\) by the last boundary below \(T\) and absorbing the single crossing step into the final strict inequality below. If the \(j\)-th phrase is copied, write The copy gives a triple \(r and \(t\geq T\). Hence Therefore every relevant ratio after the boundary \(T\), except possibly the single ratio crossing \(T\), is bounded by \(1+\eta_{\mathbf a}(T)\). Since the crossing ratio contributes only to moving from a boundary below \(T\) to a boundary at least \(T\), we have Thus Taking the minimum over parsings proves the lemma. Corollary 12.4 (Conditional quantitative strengthening). Let \(b\geq 2\), and let Suppose that there are constants \(A>0\), \(c>0\), and \(T_0\) such that Then More generally, if for all large \(T\), then Proof. Apply the quantitative profile bound with For large \(N\), is negligible compared with either of the assumed upper bounds for \(\Theta_\alpha(T)\). Since for small positive \(u\), the result follows. Remark 12.5. Thus the problem of proving an explicit superlogarithmic lower bound for online-copy or LZ77 parsing is reduced to proving an explicit decay estimate for the finite repetition profile \(\Theta_\alpha(T)\). The bounded-displacement part of the proof used Ridout's theorem only after the period \(d\) had been fixed. Quantitative improvements require control of periods \(d\) that may grow with the scale. A repetition produces an eventually periodic rational approximant with preperiod \(r\), period \(d\), and agreement length at least \(t\). Its denominator divides If then the associated rational \(\beta_{r,d}\) satisfies For fixed \(d\), Ridout excludes infinitely many such approximations. For \(d\to\infty\), the set of prime divisors of \(b^d-1\) is no longer fixed. Problem 13.1 (Moving-period Ridout problem). Let \(b\geq 2\), and let Find an explicit function \(F_\alpha(\varepsilon)\) such that there are no rationals \(\beta\) with an eventually periodic base-\(b\) expansion of preperiod \(r\), period \(d\), and satisfying Up to the separate treatment of bounded values of \(r+d\), a bound of the form would imply and hence Even a double-exponential bound would imply This is the main Diophantine bottleneck for any explicit improvement over the qualitative lower bound The multiplicative-growth lemma extracts one strong repetition from an \(O(\log N)\)-phrase parsing. To prove a quantitative lower bound of the form one must handle repetitions whose exponent is only A single such weak repetition is generally not enough. A possible strengthening is to use the whole growth distribution of a parsing. Given phrase boundaries define the growth energy above a scale \(T\) by One always has Each copied phrase contributing to \(E_T\) gives a repetition with strength Problem 14.1 (Many weak repetitions). Develop a Diophantine criterion excluding, for algebraic irrational digit expansions, families of repetitions whose individual exponents may tend to \(1\), but whose total growth energy is large. Such a criterion would be more compatible with quantitative versions of the Subspace Theorem, which usually control families of solutions rather than a single exceptionally good solution. The noisy theorem above assumes that, in each approximate copy, the mismatch set is contained in at most \(\Delta\) intervals, where \(\Delta\) is fixed. This is the natural regime of Nguyen's refined Diophantine exponent criterion. A more robust compression model would allow copied phrases plus correction sets of low complexity. For example, the error set might be sparse, or might itself admit a short interval decomposition, LZ parsing, grammar, or attractor. The bounded-displacement part of the noisy proof shows the obstruction clearly. If the error set is covered by \(\Delta\) intervals, then one obtains a genuinely periodic interval of length at least For fixed \(\Delta\), this yields a rational approximation with a fixed exponent \(>1\), and Ridout applies. If \(\Delta\to\infty\), the exponent tends to \(1\), and one again needs quantitative Diophantine input. Problem 15.1 (Noisy copies with growing error complexity). Extend the noisy online-copy lower bound to regimes where \(\Delta=\Delta_N\to\infty\), or where the mismatch sets have low compressive complexity rather than bounded interval complexity. A quantitative moving-period Ridout theorem would likely be necessary even for moderately growing \(\Delta_N\). The compression criterion proved above is qualitative: if a non-rational number has logarithmically short online-copy parsings infinitely often, then it is transcendental. One may ask for quantitative transcendence measures when such parsings occur on sufficiently dense scales. Suppose that for an infinite sequence \(N_m\), and that the sequence is not too sparse, for instance for some fixed \(A\). The proof of the compression criterion then produces rational approximations at controlled heights and with controlled gaps. This is the type of input from which transcendence measures may sometimes be extracted. Problem 16.1 (Compression-to-measure principle). Develop a theorem converting dense logarithmic online-copy compression profiles into explicit transcendence measures for The refined Diophantine exponent framework suggests that such a principle should also have noisy variants. The exact online-copy theorem gives a strong lower bound for a previous-copy model and hence for LZ77-type parsings. It does not by itself imply superlogarithmic lower bounds for arbitrary grammar compression, straight-line programs, bidirectional macro schemes, or smallest string attractors. The effective finite-prefix method gives weaker lower bounds for offline repetitiveness measures: These bounds are new in nature relative to the online proof, because they pass through linear extension complexity and Bugeaud's finite-prefix theorem, not through phrase-boundary growth. The results are not normality statements. They do not imply digit equidistribution, nor any explicit digit-frequency estimate for a specific algebraic irrational number. The central open quantitative question is to improve to an explicit lower bound such as or even The quantitative repetition profile \(\Theta_\alpha(T)\) and the moving-period Ridout problem isolate the main obstruction. During the preparation of this work, ChatGPT, by OpenAI, was used to assist with mathematical drafting, formalization, review, and editing. This work is shared as a preliminary AI-assisted mathematical note. The mathematical content may have been only partially reviewed and may contain errors; it should not be treated as peer-reviewed or as a fully verified manuscript. Part I.B Let $b\geq2$ be an integer alphabet size. We study the extremal and metric behaviour of the online previous-copy parsing complexity $\oc(W)$ of a finite word $W\in\{0,\ldots,b-1\}^N$. In this model a phrase is either a literal symbol or an exact copy whose source starts earlier in the word; self-overlap is allowed. We prove the sharp universal upper bound uniformly for all words $W$ of length $N$, and show that this is best possible: For Lebesgue-almost every $\alpha\in[0,1]$, if $w_N(\alpha)$ denotes the prefix of length $N$ of the base-$b$ expansion of $\alpha$, then We also determine the finite-word entropy of low online previous-copy complexity. For $0\leq \kappa\leq1$, let Then, for $0<\kappa\leq1$, Consequently, the Hausdorff spectrum is exact: Finally, we record parallel extremal and almost-sure estimates for the normalized substring complexity and for the minimum string attractor size $\gamma(W)$: for almost every $\alpha$, and the same asymptotic scale holds in the worst case. The online previous-copy parsing model is a structural abstraction of left-to-right dictionary compression. A finite word is parsed from left to right into phrases. A phrase is either a literal symbol or a copy of an earlier substring. The source is required to start earlier than the target, but it may overlap the target. This convention includes the self-referential behaviour familiar from variants of LZ77. A companion paper studied this model on digit prefixes of algebraic irrational numbers and proved the lower bound for every algebraic irrational $\alpha$. The present note is independent of the Diophantine part. Its purpose is to determine the natural extremal and metric scale of $\oc(W)$ for arbitrary and typical finite words. The main scale is This scale appears in three complementary ways. First, every word of length $N$ has an online previous-copy parsing with at most phrases. Second, by a counting argument, this is sharp in the worst case. Third, the same asymptotic holds for almost every base-$b$ expansion. The proof of the universal upper bound is elementary. Choose a block length Parsing greedily, any phrase shorter than $L$, except possibly near the end of the word, must start at a previously unseen block of length $L$. There are at most $b^L$ such blocks, while all other phrases have length at least $L$. This gives The lower bounds come from counting parse descriptions. The number of words of length $N$ admitting an online previous-copy parse with $z$ phrases is at most For this is $b^{(\kappa+o(1))N}$. This gives both the worst-case lower bound and the entropy/Hausdorff spectrum. The last part treats two standard repetitiveness measures: the normalized substring complexity and the minimum string attractor size $\gamma(W)$. The inequalities connect these measures to the online previous-copy model. A standard collision estimate for random words gives the matching lower bound for $\delta$, and hence for $\gamma$, almost surely. Throughout the paper, logarithms without subscript are natural, while $\log_b$ denotes logarithm in base $b$. Let be a finite word over the alphabet Definition 2.1. An online previous-copy parsing of $W$ is a factorization with boundaries where Each phrase is of one of the following two types. First, $F_j$ may be a literal, in which case Second, $F_j$ may be a copy. Writing this means that there exists a source position $p$ with such that The source may overlap the target. Let be the minimum number of phrases in an online previous-copy parsing of $W$. Remark 2.2. The first phrase must be a literal. The definition is structural: source positions and lengths are not charged as bits. The quantity $\oc(W)$ is therefore a phrase-complexity measure, not a bit-complexity measure. We first prove that every word has an online previous-copy parsing with about $N/\log_b N$ phrases. Theorem 3.1 (Universal upper bound). Uniformly for all words $W\in\{0,\ldots,b-1\}^N$, More precisely, for every integer $L\geq1$, and with this gives the displayed asymptotic. Proof. Fix $L$. We construct a parsing by the following greedy rule. Suppose the current position is $i$. If $i+L-1\leq N$ and the block has an occurrence starting at some position $p
Call a phrase short if it is a literal produced before the final $L$ positions. If a short phrase starts at position $i\leq N-L+1$, then the block has no earlier occurrence. Hence two different short phrase starts $i then at position $j$ the parser could have copied at least $L$ symbols from source $i$, contradicting that the phrase at $j$ was short. There are at most $b^L$ distinct blocks of length $L$. Hence the number of short phrases before the final $L$ positions is at most $b^L$. The final part contributes at most $L$ literal phrases. All copied phrases produced by the algorithm have length $L$, so there are at most $N/L$ such phrases. Therefore Now take Then and Also Thus Let Lemma 4.1 (Counting parse descriptions). For $1\leq z\leq N/2$, Consequently, if with fixed $0<\kappa\leq1$, then Proof. Fix $m\leq z$. We count a superset of words admitting a parsing with $m$ phrases. The boundaries are determined by the $m-1$ internal cut positions, so there are choices. For each phrase, choose whether it is a literal or a copy, giving at most $2^m$ choices. Assign to every phrase a symbol in the alphabet, even if the phrase is a copy; this gives at most $b^m$ choices. Assign also to every phrase a source position in $\{1,\ldots,N\}$, even if the phrase is a literal; this gives at most $N^m$ choices. Thus the number of descriptions with $m$ phrases is at most Every such description determines at most one word. Indeed, literals prescribe their symbols, and copied phrases are deterministic once the already parsed prefix is known. If a source overlaps the target, the copied symbols are still determined progressively because the source starts earlier than the target. Therefore For $m\leq z\leq N/2$, The function is increasing for $1\leq m\leq N/2$. Hence Now take Taking logarithms in base $b$, The last term is The middle term is Also $\log_b z_N=o(N)$. Therefore as claimed. Theorem 5.1 (Worst-case asymptotics). Proof. The upper bound follows from the universal upper bound. For the lower bound, fix $0<\kappa<1$, and set By the counting lemma, Since the total number of words of length $N$ is $b^N$, for all large $N$ there exists a word $W$ with Thus Letting $\kappa\uparrow1$ gives the lower bound. For $\alpha\in[0,1]$, let be the prefix of length $N$ of the base-$b$ expansion of $\alpha$. We ignore the countable set of numbers with two base-$b$ expansions. Theorem 6.1 (Almost-sure asymptotic). For Lebesgue-almost every $\alpha\in[0,1]$, Equivalently, for almost every $\alpha$. Proof. The upper bound holds for every word by the universal upper bound. For the lower bound, fix $0<\kappa<1$, and set The set of $\alpha$ such that is a union of base-$b$ cylinders of length $N$, one for each word in $\mathcal W_{N,z_N}$. Each cylinder has Lebesgue measure $b^{-N}$. Hence its measure is at most This is summable in $N$. By the Borel--Cantelli lemma, for almost every $\alpha$, the event occurs for only finitely many $N$. Therefore for every $0<\kappa<1$, and hence the liminf is at least $1$. The limsup is at most $1$ by the universal upper bound. For $0<\kappa\leq1$, define Theorem 7.1 (Entropy spectrum). For $0<\kappa\leq1$, For $\kappa\geq1$, the limit is $1$. Proof. The upper bound for $0<\kappa\leq1$ follows directly from the counting lemma: For the lower bound, fix $0<\eta<\kappa$, and put For every word construct a word $W(U)$ of length $N$ as follows. First write $U$. Then continue periodically with period $U$ until the word has total length $N$. Different $U$'s give different $W(U)$'s, because the first $M$ symbols of $W(U)$ are $U$. Hence this construction gives $b^M$ distinct words. By the universal upper bound applied to $U$, uniformly in $U$. After $U$ has been parsed, the periodic continuation is obtained with one copied phrase, using source position $1$ and allowing self-overlap. Thus Since we get for all sufficiently large $N$. Hence Letting $\eta\to0$ gives For $\kappa\geq1$, the universal upper bound implies that all words are eventually counted up to $o(1)$ in the threshold, and the entropy limit is $1$. For $0\leq\kappa\leq1$, define Theorem 8.1 (Hausdorff spectrum). For $0\leq\kappa\leq1$, For $\kappa\geq1$, $F_\kappa$ has full Lebesgue measure and hence Hausdorff dimension $1$. Proof. We first prove the upper bound. Fix $\eta>0$. If $\alpha\in F_\kappa$, then for infinitely many $N$, Thus $F_\kappa$ is contained in the limsup of the union of cylinders of length $N$ corresponding to words in By the entropy upper bound, Each cylinder has diameter comparable to $b^{-N}$. If $s>\kappa+\eta$, then Hence the $s$-dimensional Hausdorff measure of $F_\kappa$ is zero. Therefore Letting $\eta\to0$ gives For the lower bound, assume $0<\kappa\leq1$. We construct a Cantor set contained in $F_\kappa$. Choose a sequence of integers $M_j\to\infty$ growing so rapidly that the total length constructed before stage $j$ is $o(M_j/\log M_j)$. Suppose that before stage $j$ a prefix of length $R_{j-1}$ has been constructed. Choose freely a word After writing $U_j$, continue periodically with period $U_j$ until the total length reaches Since $R_{j-1}=o(M_j)$, At time $N_j$, the already existing prefix before $U_j$ contributes $o(M_j/\log M_j)$ phrases. The block $U_j$ can be parsed using the universal upper bound: The periodic continuation after $U_j$ is copied with one phrase, since self-overlap is allowed. Thus Because we obtain Hence every point in the constructed Cantor set lies in $F_\kappa$. It remains to estimate its dimension. At stage $j$, the number of independent choices is $b^{M_j}$. Since the sequence $M_j$ grows very fast, the contribution of all previous stages is negligible compared with $M_j$. The cylinders at stage $j$ have length $b^{-N_j}$. A standard mass distribution argument gives Thus The case $\kappa=0$ follows from the upper bound and the fact that $F_0$ is nonempty, for example it contains eventually periodic expansions. Therefore $\dimH F_0=0$. For a finite word $W$, let $p_W(k)$ be the number of distinct factors of $W$ of length $k$. Define Let $\gamma(W)$ denote the size of the smallest string attractor of $W$. Recall that a set of positions is a string attractor if every distinct factor of $W$ has an occurrence crossing at least one position of $\Gamma$. Lemma 9.1. For every finite word $W$, Proof. First let $\Gamma$ be a string attractor for $W$. Fix $k$. Every distinct length-$k$ factor has an occurrence crossing some position of $\Gamma$. A fixed position can be crossed by at most $k$ length-$k$ factors. Therefore Taking the minimum over $\Gamma$ and then the maximum over $k$ gives Now take an online previous-copy parsing of $W$ with phrase starts Let $\Gamma$ be this set of phrase starts. We claim that $\Gamma$ is a string attractor. Let $U$ be any factor of $W$, and choose its leftmost occurrence. If this occurrence does not cross a phrase start, it lies strictly inside a single phrase. If that phrase is a literal, this is impossible unless $|U|=1$, in which case the occurrence crosses the phrase start. If the phrase is copied, the source of the copy gives an earlier occurrence of $U$, contradicting the choice of the leftmost occurrence. Hence every factor has an occurrence crossing a phrase start, and $\Gamma$ is an attractor. Thus Taking the minimum over parsings gives Lemma 9.2 (Universal upper bound for $\delta$). Uniformly for all words $W$ of length $N$, Proof. For every $k$, Thus If $k\leq \log_b N$, then the maximum of $b^k/k$ in this range occurs at $k=\lfloor\log_b N\rfloor+O(1)$, and is If $k\geq \log_b N$, then Taking the maximum over $k$ proves the claim. Theorem 9.3 (Worst-case for $\delta$ and $\gamma$). and Proof. The upper bound for $\delta$ is the universal bound just proved. The upper bound for $\gamma$ follows from and the universal upper bound for $\oc$. For the lower bounds, it is enough to show that there exist words $W$ with This follows, for example, from the almost-sure lower bound for $\delta$ proved in the next subsection. Since the same examples give the lower bound for $\gamma$. Theorem 9.4 (Almost-sure behaviour of $\delta$ and $\gamma$). For Lebesgue-almost every $\alpha\in[0,1]$, and Proof. The upper bound for $\delta$ is universal. The upper bound for $\gamma$ follows from and the almost-sure asymptotic for $\oc$. It remains to prove the almost-sure lower bound for $\delta$. Fix $\varepsilon>0$, and set For a random word $W_N$ of length $N$, let $C_N$ be the number of pairs $1\leq i For any pair $i Choose a subsequence with $q\varepsilon>2$. By Markov's inequality, The last series is summable. By Borel--Cantelli, almost surely. The number of distinct length-$k_{N_m}$ factors in the prefix of length $N_m$ is at least because the number of repeated occurrences is bounded by the number of colliding pairs. Now let $N_m\leq N we have $N_m\sim N$ and $\log N_m\sim\log N$. The prefix $w_N(\alpha)$ contains the prefix $w_{N_m}(\alpha)$, so Also Thus Since $\varepsilon>0$ is arbitrary, taken over a countable sequence tending to $0$, we obtain almost surely. This proves the theorem. The lower bound for $\gamma$ follows from The results above show that the online previous-copy complexity and the standard repetitiveness measures $\delta$ and $\gamma$ share the same natural scale in the worst case and almost surely: More precisely, and For Lebesgue-almost every $\alpha\in[0,1]$, The entropy spectrum and Hausdorff spectrum for $\oc$ are also determined: for $0<\kappa\leq1$, and for $0\leq\kappa\leq1$. During the preparation of this work, ChatGPT, by OpenAI, was used to assist with mathematical drafting, formalization, review, and editing. This work is shared as a preliminary AI-assisted mathematical note. The mathematical content may have been only partially reviewed and may contain errors; it should not be treated as peer-reviewed or as a fully verified manuscript.
12 A quantitative repetition profile
13 A moving-period Ridout problem
14 Many weak repetitions
15 Beyond bounded-interval noise
16 Transcendence measures from dense compression profiles
17 Scope
Declaration of generative AI and AI-assisted technologies in the writing process
References
Metric and Extremal Theory of Online Previous-Copy Compression
Abstract
1 Introduction
2 Online previous-copy parsings
3 A universal upper bound
4 Counting words with short parsings
5 Worst-case complexity
6 Metric law for online previous-copy complexity
7 Finite entropy of low-complexity words
8 Hausdorff spectrum
9 Normalized substring complexity and string attractors
9.1 Universal and worst-case bounds
9.2 Almost-sure behaviour
10 Summary of asymptotic scales
Declaration of generative AI and AI-assisted technologies in the writing process
References