Back to AI-assisted math

Previous-Copy Compression: Foundations and Algebraic Digit Expansions

This page collects the foundational material for the online previous-copy model and its application to algebraic digit expansions.

Part I.A

Previous-Copy Compression and Algebraic Digit Expansions

Luca Blanchi

Abstract

Let \(b\geq 2\) be an integer and let

\[ \alpha=\sum_{n\geq 1}a_n b^{-n} \in (0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q) \]

be algebraic irrational, with \(a_n\in\{0,\ldots,b-1\}\). Write

\[ w_N(\alpha)=a_1a_2\cdots a_N \]

for the prefix of length \(N\) of the base-\(b\) expansion of \(\alpha\).

We study the compressibility of \(w_N(\alpha)\) under a left-to-right previous-copy model. A phrase is either a literal symbol or an exact copy of an earlier substring; self-overlap is allowed, as in self-referential variants of LZ77. Let \(\oc(W)\) be the minimum number of phrases in such a parsing of a finite word \(W\). We prove

\[ \oc(w_N(\alpha))=\omega(\log N). \]

Consequently, every standard LZ77-type previous-factor parsing, self-referential or not, uses \(\omega(\log N)\) phrases on \(w_N(\alpha)\).

The proof combines a simple multiplicative-growth lemma with Diophantine approximation. A logarithmic online previous-copy parsing forces a copied phrase whose endpoint is a fixed multiplicative factor larger than its starting boundary. If the copy displacement tends to infinity, this gives a repetition forbidden by the Adamczewski--Bugeaud Diophantine exponent criterion. If the displacement is bounded, it gives too good an approximation by rationals with denominators supported on a fixed finite set of primes, contradicting Ridout's theorem.

We also prove a structured-noise variant using Nguyen's refined Diophantine exponent theorem. If copied phrases may differ from their sources on a vanishing proportion of positions contained in a bounded number of intervals, logarithmically many phrases still do not suffice.

Finally, we add an effective finite-prefix component. We introduce the linear extension complexity \(\lambda(W)\), the least \(C\) such that \(W\) extends to an infinite word of factor complexity at most \(Cn\). Using Bugeaud's effective theorem for long prefixes of algebraic numbers, we prove

\[ \lambda(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}. \]

Since small online-copy parsings and small string attractors both imply small linear extension complexity, this gives the same effective lower bound for the minimum string attractor size of \(w_N(\alpha)\), and hence for all compression measures which induce string attractors of comparable size. This part is weaker than the superlogarithmic LZ77 bound, but applies to offline repetitiveness measures outside the reach of the online argument.

1 Introduction

The base-\(b\) expansion of an algebraic irrational number is expected to be highly complex. This expectation cannot be usefully expressed through unrestricted Kolmogorov complexity of finite prefixes: algebraic numbers are computable, and hence their first \(N\) digits have descriptions of length \(O(\log N)\). One must instead work with restricted models of description or compression.

A fundamental theorem of Adamczewski and Bugeaud says that the base-\(b\) expansion of an algebraic irrational number cannot have low factor complexity. More precisely, if \(p(n,\alpha,b)\) denotes the number of distinct blocks of length \(n\) occurring in the base-\(b\) expansion of an algebraic irrational \(\alpha\), then

\[ \frac{p(n,\alpha,b)}{n}\longrightarrow +\infty. \]

Their method rests on a Diophantine principle: sufficiently strong repetitions in the digit expansion produce rational approximations that are too good for algebraic irrational numbers.

The purpose of this note is to apply that principle to a concrete compression model. We consider finite prefixes under a left-to-right previous-copy parsing. This model is deliberately generous. Copy sources and lengths are not charged. A copied phrase is allowed to overlap its target. Therefore a lower bound in this model is not an artefact of bit-level encoding conventions.

The first main result is the following.

Theorem 1.1 (Exact online-copy lower bound).

Let \(b\geq 2\), and let

\[ \alpha=\sum_{n\geq 1}a_n b^{-n} \in (0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Then

\[ \oc(w_N(\alpha))=\omega(\log N). \]

Equivalently,

\[ \lim_{N\to\infty} \frac{\oc(w_N(\alpha))}{\log N} = +\infty. \]

As a direct consequence, every standard LZ77-type previous-factor parsing of \(w_N(\alpha)\), self-referential or not, has \(\omega(\log N)\) phrases.

The result should be read carefully. It is not a lower bound for arbitrary straight-line programs, unrestricted grammar compression, bidirectional macro schemes, or offline copy systems. The online condition is essential in the proof. However, a second and independent part of the note gives effective lower bounds for several offline repetitiveness measures through a different route.

To state that route, define the linear extension complexity \(\lambda(W)\) of a finite word \(W\) to be the least \(C\) such that \(W\) is the prefix of an infinite word \(x\) satisfying

\[ p_x(n)\leq Cn \qquad(n\geq 1), \]

where \(p_x(n)\) is the number of distinct factors of length \(n\) of \(x\). We prove that both online-copy parsings and string attractors control \(\lambda(W)\):

\[ \lambda(W)\leq \oc(W)+1, \qquad \lambda(W)\leq \gamma(W)+1, \]

where \(\gamma(W)\) is the minimum string attractor size of \(W\). Combining this with Bugeaud's effective finite-prefix theorem gives the second main result.

Theorem 1.2 (Effective finite-prefix lower bound).

Let \(b\geq 2\), and let

\[ \alpha\in (0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Then, for all sufficiently large \(N\),

\[ \lambda(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}. \]

Consequently,

\[ \gamma(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}, \]

and the same lower bound holds for every compression measure which is known to induce a string attractor of size bounded linearly by that measure.

This effective bound is weaker than \(\oc(w_N(\alpha))=\omega(\log N)\), but it applies to offline measures for which the online multiplicative-growth argument does not apply.

We also prove a structured noisy version of the online theorem. There, a copied phrase may differ from its source on a small exceptional set, provided the exceptional set is contained in a bounded number of intervals. The unbounded-displacement case is handled by Nguyen's refined Diophantine exponent theorem; the bounded-displacement case is handled directly by Ridout's theorem.

The final section isolates a quantitative profile

\[ \Theta_\alpha(T) \]

measuring the strongest exact repetition in the digit sequence beyond scale \(T\). The exact theorem implies qualitatively that

\[ \Theta_\alpha(T)\to 0. \]

Moreover, any explicit decay bound on \(\Theta_\alpha(T)\) would immediately imply an explicit superlogarithmic lower bound for \(\oc(w_N(\alpha))\). This reduces the main quantitative breakthrough problem to a finite Diophantine problem, closely related to a moving-period version of Ridout's theorem.

Throughout the note, logarithms are natural. All implicit constants may depend on fixed parameters such as \(\alpha\), its degree and height, and, where relevant, \(b\), although the effective theorem quoted from Bugeaud is itself independent of the base.

2 Diophantine inputs

Let

\[ \mathbf a=a_1a_2a_3\cdots \]

be an infinite word over a finite set of integers. For integers \(i\leq j\), write

\[ \mathbf a[i,j]=a_i a_{i+1}\cdots a_j. \]

For an integer \(b\geq 2\), put

\[ \xi_{\mathbf a,b} = \sum_{n\geq 1}a_n b^{-n}. \]

We use three Diophantine inputs: the Adamczewski--Bugeaud repetition criterion, Ridout's theorem, and Bugeaud's effective finite-prefix theorem. For the noisy section we also use Nguyen's refined repetition criterion.

2.1 Exact repetitions

The following is the form of the Adamczewski--Bugeaud Diophantine exponent criterion needed below.

Theorem 2.1 (Exact repetition criterion).

Let \(b\geq 2\). Suppose that there exist \(\rho>1\) and sequences of integers

\[ 0\leq r_m

such that

\[ s_m-r_m\to\infty, \qquad t_m\geq \rho s_m, \]

and

\[ \mathbf a[r_m+1,r_m+t_m-s_m] = \mathbf a[s_m+1,t_m] \]

for every \(m\). Then

\[ \xi_{\mathbf a,b} \]

is rational or transcendental.

Indeed, the prefix \(\mathbf a[1,t_m]\) contains a repetition beginning after a preperiod of length \(r_m\), with period \(s_m-r_m\), and with total length at least a fixed multiple of \(s_m\). In the language of Adamczewski and Bugeaud, the Diophantine exponent of \(\mathbf a\) is greater than \(1\). The stated conclusion is precisely the corresponding transcendence criterion.

2.2 Ridout's theorem

We shall use Ridout's theorem in the following form.

Theorem 2.2 (Ridout).

Let \(S\) be a finite set of prime numbers, let \(\theta\) be a real algebraic number, and let \(\varepsilon>0\). Then there are only finitely many rational numbers \(P/Q\), written in lowest terms, such that every prime divisor of \(Q\) belongs to \(S\) and

\[ \left|\theta-\frac{P}{Q}\right| < Q^{-1-\varepsilon}. \]

2.3 Structured noisy repetitions

Let \(U\) and \(V\) be two words of the same length \(L\). For \(\eta>0\) and an integer \(\Delta\geq 0\), say that \(U\) and \(V\) are \((\eta,\Delta)\)-close if the set

\[ \{1\leq i\leq L: U_i\neq V_i\} \]

is contained in a union of at most \(\Delta\) intervals of \(\{1,\ldots,L\}\) whose total cardinality is at most \(\eta L\).

We use the following consequence of Nguyen's refined Diophantine exponent theorem.

Theorem 2.3 (Refined repetition criterion).

Let \(b\geq 2\). Suppose that there exist \(\rho>1\), an integer \(\Delta\geq 0\), a sequence \(\eta_m\to 0\), and integers

\[ 0\leq r_m

such that

\[ s_m-r_m\to\infty, \qquad t_m\geq \rho s_m, \]

and such that the two words

\[ \mathbf a[r_m+1,r_m+t_m-s_m], \qquad \mathbf a[s_m+1,t_m] \]

are \((\eta_m,\Delta)\)-close for every \(m\). Then

\[ \xi_{\mathbf a,b} \]

is rational or transcendental.

Only the exact criterion and Ridout's theorem are needed for the exact online-copy lower bound. The refined criterion is used solely in the noisy section.

2.4 Bugeaud's effective finite-prefix theorem

We also use the following theorem of Bugeaud.

Theorem 2.4 (Bugeaud).

Let \(b\geq 2\), and let \(\xi\) be a real algebraic irrational number of degree \(D\) and height at most \(H\), with \(H\geq e^e\). Let \(a\) be the base-\(b\) expansion of \(\xi\). Let \(w\) be an infinite word satisfying

\[ p_w(n)\leq Cn \qquad(n\geq 1) \]

for some integer \(C\geq 2\). If the first \(L\) digits of \(a\) coincide with the first \(L\) digits of \(w\), then either

\[ H \geq \exp\left\{ 10^{-2}C^{-1}L^{1/(8\log(4C))} \right\}, \]

or

\[ D \geq \exp\left\{ 10^{-100}C^{-11/2}(\log C)^{-1} (\log L)^{1/2}(\log\log L)^{-1} \right\}. \]

The numerical constants play no role in this note. What matters is that, for fixed algebraic \(\xi\), the theorem forbids arbitrarily long prefixes from coinciding with infinite words of complexity \(Cn\) unless \(C\) grows at least like

\[ \frac{(\log L)^{1/11}}{(\log\log L)^{4/11}} \]

up to a constant depending on \(\xi\).

3 Online previous-copy parsings

Let \(W=W[1]W[2]\cdots W[N]\) be a finite word.

Definition 3.1.

An online previous-copy parsing of \(W\) is a factorization

\[ W=F_1F_2\cdots F_z \]

with boundaries

\[ 0=n_0

where

\[ F_j=W[n_{j-1}+1,n_j]. \]

Each phrase \(F_j\) is of one of two types.

First, \(F_j\) may be a literal, in which case \(|F_j|=1\).

Second, \(F_j\) may be a copied phrase. Writing

\[ s=n_{j-1}, \qquad t=n_j, \qquad L=t-s, \]

this means that there is a source position \(p\) with

\[ 1\leq p\leq s \]

such that

\[ W[p+h]=W[s+1+h] \qquad(0\leq h

The source starts strictly to the left of the target phrase. It may overlap the target.

Define

\[ \oc(W) \]

to be the minimum number of phrases in an online previous-copy parsing of \(W\).

This is a structural parsing measure, not a bit-level encoding length. In particular, source positions and phrase lengths are not charged.

Remark 3.2.

The first phrase in every online previous-copy parsing is necessarily a literal, since no previous source is available at position \(1\).

4 A multiplicative-growth lemma

The following elementary lemma is the combinatorial core of the exact lower bound.

Lemma 4.1 (Multiplicative-growth phrase).

Let \(C>0\). Suppose that, for infinitely many \(N\), a word \(W_N\) of length \(N\) has an online previous-copy parsing with at most \(C\log N\) phrases. Then there exist \(\rho>1\) and infinitely many copied phrases, arising in such parsings, with boundaries \(s

\[ t\geq \rho s \]

and

\[ t\to\infty. \]

One may take

\[ \rho=\exp\left(\frac1{3C}\right). \]

Proof.

Fix such a parsing of \(W_N\), and write its boundaries as

\[ 0=n_0

Since the first phrase is a literal, \(n_1=1\). Thus

\[ N=\frac{n_z}{n_1} = \prod_{j=2}^{z}\frac{n_j}{n_{j-1}}. \]

Set

\[ \rho=\exp\left(\frac1{3C}\right)>1. \]

Suppose, for contradiction, that there is a constant \(T\) such that every phrase satisfying

\[ \frac{n_j}{n_{j-1}}\geq \rho \]

has endpoint \(n_j\leq T\), along the infinite family of parsings under consideration. After the last such phrase, all remaining ratios are \(<\rho\). Therefore, for all sufficiently large \(N\),

\[ N\leq T\rho^z \leq T\exp\left(\frac1{3C}C\log N\right) = TN^{1/3}, \]

which is impossible.

Thus there are phrases with

\[ \frac{n_j}{n_{j-1}}\geq \rho \]

and unbounded endpoints. A literal phrase with starting boundary \(n_{j-1}\) has ratio

\[ \frac{n_j}{n_{j-1}} = 1+\frac1{n_{j-1}}, \]

which is \(<\rho\) for all sufficiently large \(n_{j-1}\). Hence the high-growth phrases with unbounded endpoints are copied phrases.

5 Exact online-copy lower bound

Theorem 5.1 (Compression criterion).

Let \(b\geq 2\), and let

\[ \xi_{\mathbf a,b} = \sum_{n\geq 1}a_n b^{-n}. \]

If

\[ \oc(a_1a_2\cdots a_N)=O(\log N) \]

for infinitely many \(N\), then \(\xi_{\mathbf a,b}\) is rational or transcendental.

Proof.

It is enough to prove the contrapositive. Suppose that

\[ \xi_{\mathbf a,b} = \alpha \in (0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Assume, for contradiction, that there are \(C>0\) and infinitely many \(N\) such that

\[ \oc(w_N(\alpha))\leq C\log N. \]

By the multiplicative-growth lemma, after passing to an infinite subsequence, we obtain copied phrases

\[ w_N(\alpha)[s+1,t] \]

such that

\[ t\geq \rho s \]

for some fixed \(\rho>1\), and \(t\to\infty\).

For each such copied phrase, put \(L=t-s\). There is a source position \(p\), with \(1\leq p\leq s\), such that

\[ a_{p+h}=a_{s+1+h} \qquad(0\leq h

Set

\[ r=p-1, \qquad d=s-r=s-p+1. \]

Then

\[ 0\leq r

and

\[ \mathbf a[r+1,r+t-s] = \mathbf a[s+1,t]. \]

We split into two cases.

First suppose that

\[ d=s-r\to\infty \]

along an infinite subsequence. Then the triples \(r

\[ s-r\to\infty, \qquad t\geq \rho s, \]

and the exact repetition condition above. By the exact repetition criterion, \(\alpha\) is rational or transcendental, contradicting the assumption that \(\alpha\) is algebraic irrational.

It remains to consider the case in which \(d=s-r\) is bounded along an infinite subsequence. Passing to a further subsequence, we may assume that \(d\) is fixed.

The equality

\[ a_{r+1+h}=a_{s+1+h} \qquad(0\leq h

can be rewritten as

\[ a_k=a_{k+d} \qquad(r+1\leq k\leq t-d). \]

Thus the block

\[ a_{r+1}a_{r+2}\cdots a_t \]

is periodic with period \(d\).

Define the rational number

\[ \beta = \sum_{n=1}^{r}a_n b^{-n} + \sum_{q\geq 0}\sum_{i=1}^{d} a_{r+i}b^{-(r+qd+i)}. \]

Then \(\beta\) has a base-\(b\) expansion agreeing with that of \(\alpha\) through at least the first \(t\) digits. Hence

\[ |\alpha-\beta|\ll b^{-t}. \]

Write \(\beta=P/Q\) in lowest terms. Since \(\beta\) has preperiod \(r\) and period \(d\) in base \(b\), its denominator divides

\[ b^r(b^d-1). \]

Therefore every prime divisor of \(Q\) belongs to the fixed finite set

\[ S=\{q\text{ prime}:q\mid b(b^d-1)\}, \]

and

\[ Q\ll_d b^r. \]

If \(r\) is bounded along an infinite subsequence, then only finitely many such rationals \(\beta\) can occur. Since the agreement length \(t\) tends to infinity, one of them agrees with \(\alpha\) to arbitrarily many digits, and hence equals \(\alpha\). This contradicts the irrationality of \(\alpha\).

Thus \(r\to\infty\). Since

\[ s=r+d \]

and

\[ t\geq \rho s, \]

there is \(\rho'>1\) such that, for all sufficiently large terms of the subsequence,

\[ t\geq \rho' r. \]

Choose \(\varepsilon>0\) with \(1+\varepsilon<\rho'\). Since \(Q\ll_d b^r\), we obtain

\[ b^{-t}\ll Q^{-1-\varepsilon} \]

along the subsequence. Therefore

\[ |\alpha-\beta|

for infinitely many distinct rationals \(P/Q\) whose denominators have all prime factors in the fixed finite set \(S\). This contradicts Ridout's theorem.

Both cases are impossible. Hence an algebraic irrational \(\alpha\) cannot have

\[ \oc(w_N(\alpha))=O(\log N) \]

along an infinite sequence of \(N\).

Corollary 5.2 (Exact online-copy lower bound).

Let \(b\geq 2\), and let

\[ \alpha=\sum_{n\geq 1}a_n b^{-n} \in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Then

\[ \oc(w_N(\alpha))=\omega(\log N). \]

6 Consequences for LZ77-type parsings

The online previous-copy model is at least as permissive as the usual left-to-right previous-factor models.

Let \(z(W)\) denote the number of phrases in a standard LZ77-type parsing of \(W\), under any convention in which each phrase is either a literal, or a previous copy, possibly followed by one trailing literal. The previous copy may be self-referential or non-self-referential.

In the self-referential case, the copied part is an online previous-copy phrase. In the non-self-referential case, it is a fortiori an online previous-copy phrase. If a convention attaches one trailing literal to a copied phrase, split that phrase into at most two online previous-copy phrases. Hence there is an absolute constant \(A\), depending only on the LZ77 convention, such that

\[ \oc(W)\leq A z(W)+A \]

for every finite word \(W\).

Corollary 6.1 (LZ77 lower bound).

Let \(b\geq 2\), and let

\[ \alpha\in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Then every standard LZ77-type previous-factor parsing of

\[ w_N(\alpha) \]

has

\[ \omega(\log N) \]

phrases.

Proof.

If some such parsing had \(O(\log N)\) phrases along an infinite sequence of \(N\), then the comparison above would give

\[ \oc(w_N(\alpha))=O(\log N) \]

along the same sequence, contradicting the exact online-copy lower bound.

7 Linear extension complexity

The online argument gives a strong superlogarithmic lower bound, but it relies essentially on left-to-right growth. We now introduce a weaker measure which also interacts with offline repetitiveness notions such as string attractors.

Let \(\mathcal A\) be a finite alphabet. For an infinite word \(x\in\mathcal A^{\mathbb N}\), let

\[ p_x(k) \]

be the number of distinct factors of length \(k\) occurring in \(x\).

Definition 7.1.

For a finite word \(W\in\mathcal A^*\), define its linear extension complexity by

\[ \lambda(W) = \inf\left\{ C\geq 1: \begin{array}{l} \text{there exists }x\in\mathcal A^{\mathbb N}\text{ with }x[1,|W|]=W,\\ \text{and }p_x(k)\leq Ck\text{ for every }k\geq 1 \end{array} \right\}. \]

Thus \(\lambda(W)\) measures whether \(W\) can be embedded as a prefix of an infinite word of uniformly linear factor complexity.

Lemma 7.2 (Online-copy parsings give linear extensions).

For every finite word \(W\),

\[ \lambda(W)\leq \oc(W)+1. \]

Proof.

Let

\[ W=F_1F_2\cdots F_z \]

be an online previous-copy parsing with \(z=\oc(W)\) phrases. Let

\[ 0=n_0

be its phrase boundaries. Fix a symbol \(c\in\mathcal A\), and extend \(W\) to the infinite word

\[ x=Wccc\cdots. \]

Let \(k\geq 1\), and consider any factor \(U\) of \(x\) of length \(k\). Choose the leftmost occurrence of \(U\) in \(x\). We claim that this occurrence crosses either the first position of a phrase, or the first position of the infinite constant tail.

Indeed, suppose first that the occurrence lies entirely inside \(W\). If it is contained in a copied phrase

\[ W[s+1,t] \]

and does not cross the first position \(s+1\) of that phrase, then it starts at some position \(i>s+1\) and ends at most at \(t\). Let \(p\leq s\) be the source position of the copied phrase. By the copy equality, the same length-\(k\) factor occurs starting at

\[ p+(i-s-1)

contradicting the leftmost choice. Thus the occurrence crosses the beginning of a phrase, unless it lies in a literal phrase, in which case it also crosses the beginning of that phrase.

If the occurrence meets the infinite tail, then it crosses the first position \(|W|+1\) of the tail, unless it is entirely contained in the tail. In the latter case \(U=c^k\), which also occurs starting at \(|W|+1\).

There are \(z\) phrase-start positions and one tail-start position. For a fixed position \(q\), at most \(k\) factors of length \(k\) cross \(q\). Therefore

\[ p_x(k)\leq (z+1)k \qquad(k\geq 1). \]

Hence \(\lambda(W)\leq z+1\).

8 Effective lower bounds from Bugeaud's theorem

The previous lemma allows us to convert Bugeaud's finite-prefix theorem into an effective lower bound for \(\lambda(w_N(\alpha))\).

Theorem 8.1 (Effective lower bound for linear extension complexity).

Let \(b\geq 2\), and let

\[ \alpha\in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Then there is a constant \(c_\alpha>0\) such that, for all sufficiently large \(N\),

\[ \lambda(w_N(\alpha)) \geq c_\alpha \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}. \]

Proof.

Let \(D\) be the degree of \(\alpha\), and let \(H\geq e^e\) be an upper bound for its height. Put

\[ L=N. \]

Let \(C_0>\lambda(w_N(\alpha))\). By the definition of \(\lambda\), there is an infinite word \(x\) whose first \(N\) symbols are \(w_N(\alpha)\) and such that

\[ p_x(k)\leq C_0 k \qquad(k\geq 1). \]

Choose an integer

\[ C\geq \max\{2,C_0\} \]

with \(C\leq C_0+3\). Then

\[ p_x(k)\leq Ck \qquad(k\geq 1). \]

Bugeaud's theorem applies to \(\alpha\), \(x\), \(C\), and \(L=N\). Hence either

\[ H \geq \exp\left\{ 10^{-2}C^{-1}N^{1/(8\log(4C))} \right\}, \]

or

\[ D \geq \exp\left\{ 10^{-100}C^{-11/2}(\log C)^{-1} (\log N)^{1/2}(\log\log N)^{-1} \right\}. \]

We show that both alternatives are impossible if

\[ C\leq c \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}} \]

with \(c>0\) sufficiently small and \(N\) sufficiently large.

First, under this hypothesis,

\[ \log(4C)\ll \log\log N. \]

Therefore

\[ N^{1/(8\log(4C))} = \exp\left(\frac{\log N}{8\log(4C)}\right) \]

grows faster than every fixed power of \(\log N\). Since \(C\) is only a fixed power of \(\log N\), we have

\[ C^{-1}N^{1/(8\log(4C))}\to +\infty. \]

Thus the first alternative would force \(H\to +\infty\), impossible because \(H\) is fixed.

Second, using

\[ C\leq c \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}} \]

and

\[ \log C\leq \log\log N \]

for large \(N\), we obtain

\begin{align*} &C^{-11/2}(\log C)^{-1} (\log N)^{1/2}(\log\log N)^{-1} \\ &\qquad\geq c^{-11/2}(\log\log N)^2 (\log C)^{-1}(\log\log N)^{-1}. \end{align*}

Since \(\log C\leq \log\log N\), the right-hand side is bounded below by a positive constant multiple of \(c^{-11/2}\). Choosing \(c\) sufficiently small, this lower bound exceeds \(\log D\). Thus the second alternative would force

\[ D>\deg(\alpha), \]

again impossible.

Consequently,

\[ C \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}. \]

Since \(C\leq C_0+3\), and since \(C_0>\lambda(w_N(\alpha))\) was arbitrary, the theorem follows after adjusting the implicit constant.

Corollary 8.2 (Effective online-copy lower bound).

Let \(b\geq 2\), and let

\[ \alpha\in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Then, for all sufficiently large \(N\),

\[ \oc(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}. \]

Proof.

This follows from

\[ \lambda(W)\leq \oc(W)+1 \]

and the preceding theorem.

Remark 8.3.

This effective bound is much weaker than

\[ \oc(w_N(\alpha))=\omega(\log N), \]

which was proved by the exact Diophantine exponent argument. Its value is that it comes from a finite-prefix low-complexity obstruction and will also apply to offline repetitiveness measures.

9 String attractors and offline compression

We now apply the same finite-prefix method to string attractors.

Definition 9.1.

Let \(W\) be a finite word of length \(N\). A set

\[ \Gamma\subseteq \{1,\ldots,N\} \]

is a string attractor for \(W\) if every distinct factor of \(W\) has an occurrence in \(W\) crossing at least one position of \(\Gamma\). Let

\[ \gamma(W) \]

be the minimum size of a string attractor for \(W\).

Lemma 9.2 (Attractors give linear extensions).

For every finite word \(W\),

\[ \lambda(W)\leq \gamma(W)+1. \]

Proof.

Let \(\Gamma\) be a string attractor for \(W\), with

\[ |\Gamma|=\gamma(W). \]

Fix a symbol \(c\) in the alphabet and extend \(W\) to

\[ x=Wccc\cdots. \]

Let \(k\geq 1\). Every factor of length \(k\) of \(x\) either is a factor of \(W\), crosses the boundary between \(W\) and the infinite tail, or is \(c^k\).

Every distinct factor of length \(k\) of \(W\) has an occurrence crossing a position of \(\Gamma\). For each fixed position \(q\in\Gamma\), at most \(k\) length-\(k\) factors cross \(q\). Hence factors coming from \(W\) contribute at most

\[ \gamma(W)k \]

distinct words.

The factors crossing the tail boundary contribute at most \(k\) additional words, and the factor \(c^k\) is included among them by considering the occurrence starting at the first tail position. Therefore

\[ p_x(k)\leq (\gamma(W)+1)k \qquad(k\geq 1), \]

and so

\[ \lambda(W)\leq \gamma(W)+1. \]

Theorem 9.3 (Effective string-attractor lower bound).

Let \(b\geq 2\), and let

\[ \alpha\in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Then, for all sufficiently large \(N\),

\[ \gamma(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}. \]

Proof.

By the preceding lemma,

\[ \lambda(w_N(\alpha)) \leq \gamma(w_N(\alpha))+1. \]

The result follows from the effective lower bound for \(\lambda(w_N(\alpha))\).

Corollary 9.4 (Offline dictionary-compression consequences).

Let \(M(W)\) be any compression measure for which every representation of size \(M(W)\) induces a string attractor of size \(O(M(W))\), under the corresponding standard size convention. Then, for every algebraic irrational

\[ \alpha\in(0,1) \]

and every integer base \(b\geq 2\),

\[ M(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}. \]

Proof.

If a representation of size \(M(W)\) induces a string attractor of size \(O(M(W))\), then

\[ \gamma(W)\ll M(W). \]

Apply this to \(W=w_N(\alpha)\) and use the preceding theorem.

Remark 9.5.

This corollary applies to the standard repetitiveness measures for which such a linear attractor extraction is known, under the corresponding size conventions. These include, among others, LZ77, straight-line programs under usual grammar-size conventions, collage systems, macro schemes, and related measures discussed in the string-attractor literature. The principle is uniform: once a compressor induces an attractor of size \(O(M)\), the lower bound for \(\gamma\) transfers to \(M\).

Remark 9.6.

The string-attractor result is not a substitute for the exact online-copy theorem. It gives only

\[ \gamma(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}, \]

not \(\gamma(w_N(\alpha))=\omega(\log N)\). Proving a superlogarithmic lower bound for string attractors or straight-line programs would require a new idea beyond the online multiplicative-growth argument.

10 Noisy online-copy parsings

We now return to online parsings and introduce a structured-noise version.

Let \(U,V\) be words of the same length \(L\). Recall that \(U\) and \(V\) are \((\eta,\Delta)\)-close if their mismatch set is contained in a union of at most \(\Delta\) intervals of total cardinality at most \(\eta L\).

Definition 10.1.

Let \(W\) be a finite word. An \((\eta,\Delta)\)-noisy online-copy parsing of \(W\) is a factorization

\[ W=F_1F_2\cdots F_z \]

with boundaries

\[ 0=n_0

such that each phrase is either a literal or an approximate previous copy in the following sense. If

\[ F_j=W[s+1,t], \qquad L=t-s, \]

then there exists a source position \(p\), with

\[ 1\leq p\leq s, \]

such that

\[ W[p,p+L-1] \quad\text{and}\quad W[s+1,t] \]

are \((\eta,\Delta)\)-close.

Let

\[ \oc_{\eta,\Delta}(W) \]

be the minimum number of phrases in such a parsing.

This is again a structural measure. The exceptional intervals, their locations, and their contents are not charged.

11 Noisy online-copy lower bound

Theorem 11.1 (Noisy online-copy lower bound).

Let \(b\geq 2\), and let

\[ \alpha=\sum_{n\geq 1}a_n b^{-n} \in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Fix an integer \(\Delta\geq 0\), and let

\[ \eta_N\to 0. \]

Then

\[ \oc_{\eta_N,\Delta}(w_N(\alpha)) = \omega(\log N). \]

Proof.

Assume, for contradiction, that there exist \(C>0\) and infinitely many \(N\) such that

\[ \oc_{\eta_N,\Delta}(w_N(\alpha))\leq C\log N. \]

The multiplicative-growth lemma applies verbatim, since it uses only the phrase boundaries. Thus, after passing to an infinite subsequence, there are approximate copied phrases

\[ w_N(\alpha)[s+1,t] \]

such that

\[ t\geq \rho s, \qquad t\to\infty \]

for some fixed \(\rho>1\).

For each such phrase, put \(L=t-s\). There exists \(p\), with \(1\leq p\leq s\), such that

\[ a_p a_{p+1}\cdots a_{p+L-1} \]

and

\[ a_{s+1}a_{s+2}\cdots a_t \]

are \((\eta_N,\Delta)\)-close. Set

\[ r=p-1, \qquad d=s-r. \]

Then the two words

\[ \mathbf a[r+1,r+t-s], \qquad \mathbf a[s+1,t] \]

are \((\eta_N,\Delta)\)-close.

We split into two cases.

First suppose that

\[ d=s-r\to\infty \]

along an infinite subsequence. Since \(\eta_N\to0\), the refined repetition criterion applies to the triples \(r

It remains to treat the case where \(d=s-r\) is bounded. Passing to a further subsequence, assume that \(d\) is fixed.

The approximate copy condition says that the equalities

\[ a_{r+1+h}=a_{s+1+h} \qquad(0\leq h

fail only for values of \(h\) contained in at most \(\Delta\) intervals of total cardinality at most \(\eta_N L\). Equivalently, up to enlarging these intervals by \(O_\Delta(1)\) endpoints, the period-\(d\) relations

\[ a_k=a_{k+d} \]

fail only on a subset of

\[ r+1\leq k\leq r+L \]

covered by at most \(\Delta\) intervals of total cardinality at most \(\eta_N L+O_\Delta(1)\).

Remove these bad intervals. The remaining good set is a union of at most \(\Delta+1\) intervals and has cardinality at least

\[ (1-\eta_N)L-O_\Delta(1). \]

Thus one good interval has length

\[ H\geq \frac{(1-\eta_N)L-O_\Delta(1)}{\Delta+1}. \]

Since

\[ L=t-s\geq \left(1-\frac1\rho\right)t, \]

there is a constant \(c>0\), depending only on \(\rho\) and \(\Delta\), such that

\[ H\geq ct \]

for all sufficiently large \(N\).

Let this good interval of edge positions begin at \(A\). Then

\[ a_k=a_{k+d} \qquad(A\leq k\leq A+H-1). \]

It follows that the digit block

\[ a_Aa_{A+1}\cdots a_{A+H+d-1} \]

is periodic with period \(d\).

Set

\[ m=A-1. \]

Since \(A\leq r+L\leq t\), we have \(m\leq t\). Hence

\[ H\geq cm \]

unless \(m\) is bounded, in which case the argument below is even stronger.

Define the rational number

\[ \beta = \sum_{n=1}^{m}a_n b^{-n} + \sum_{q\geq 0}\sum_{i=1}^{d} a_{m+i}b^{-(m+qd+i)}. \]

The periodicity just obtained implies that \(\alpha\) and \(\beta\) agree through at least position \(m+H\). Therefore

\[ |\alpha-\beta|\ll b^{-(m+H)}. \]

Writing \(\beta=P/Q\) in lowest terms, we have

\[ Q\mid b^m(b^d-1) \]

up to cancellation. Thus all prime divisors of \(Q\) lie in the fixed finite set

\[ S=\{q\text{ prime}:q\mid b(b^d-1)\}, \]

and

\[ Q\ll_d b^m. \]

If \(m\) is bounded along an infinite subsequence, then only finitely many rationals \(\beta\) occur. Since \(H\to\infty\), one of them agrees with \(\alpha\) to arbitrarily many digits, and hence equals \(\alpha\), contradicting irrationality.

Thus \(m\to\infty\). Since \(H\geq cm\), choose \(\varepsilon>0\) with \(\varepsilon

\[ b^{-(m+H)}\ll Q^{-1-\varepsilon}. \]

Therefore

\[ |\alpha-\beta|

for infinitely many distinct rationals \(P/Q\) whose denominators have all prime factors in \(S\). This contradicts Ridout's theorem.

Both cases are impossible. Hence

\[ \oc_{\eta_N,\Delta}(w_N(\alpha)) = \omega(\log N). \]

12 A quantitative repetition profile

The exact online-copy theorem is qualitative. It proves superlogarithmic divergence but gives no explicit rate. The next definitions isolate the precise finite Diophantine obstruction that controls possible quantitative improvements.

Let \(\mathbf a=a_1a_2\cdots\) be the digit sequence of \(\alpha\).

Definition 12.1.

For \(T\geq 1\), define

\[ \Theta_{\mathbf a}(T) = \sup \left\{ \frac{t}{s}-1: \begin{array}{l} 0\leq r

with the convention that the supremum is \(0\) if no such triple exists.

When \(\mathbf a\) is the base-\(b\) digit sequence of \(\alpha\), we write

\[ \Theta_\alpha(T) \]

instead of \(\Theta_{\mathbf a}(T)\).

Proposition 12.2 (Qualitative decay of the repetition profile).

Let \(b\geq 2\), and let

\[ \alpha\in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Then

\[ \Theta_\alpha(T)\to 0 \qquad(T\to\infty). \]

Proof.

Suppose not. Then there exist \(\varepsilon>0\) and triples

\[ 0\leq r_m

such that

\[ t_m\geq (1+\varepsilon)s_m \]

and

\[ \mathbf a[r_m+1,r_m+t_m-s_m] = \mathbf a[s_m+1,t_m]. \]

Put

\[ d_m=s_m-r_m. \]

If \(d_m\to\infty\) along an infinite subsequence, the exact repetition criterion implies that \(\alpha\) is rational or transcendental, contradiction.

Thus \(d_m\) is bounded along an infinite subsequence. Passing to a further subsequence, assume \(d_m=d\) is fixed. The equality above gives a period-\(d\) block of length tending to infinity. Repeating the bounded-displacement argument in the proof of the exact online-copy theorem, we obtain infinitely many rational numbers \(\beta=P/Q\), eventually periodic in base \(b\) with fixed period \(d\), such that

\[ |\alpha-\beta|

for some \(\varepsilon'>0\), with all prime factors of \(Q\) lying in the fixed finite set

\[ \{q:q\mid b(b^d-1)\}. \]

This contradicts Ridout's theorem, unless \(\alpha\) is rational. Since \(\alpha\) is algebraic irrational, this is impossible.

The next lemma gives the precise bridge from the repetition profile to online-copy lower bounds.

Lemma 12.3 (Quantitative profile bound).

Let \(\mathbf a\) be any infinite word, and let

\[ W_N=a_1a_2\cdots a_N. \]

For \(T\geq 2\), put

\[ \eta_{\mathbf a}(T) = \max\left\{\Theta_{\mathbf a}(T),\frac1{T-1}\right\}. \]

Then, for every \(N>T\),

\[ \oc(W_N) \geq \frac{\log(N/T)}{\log(1+\eta_{\mathbf a}(T))}. \]

Proof.

Let

\[ 0=n_0

be an online previous-copy parsing of \(W_N\).

Let \(q\) be the largest index such that

\[ n_q

Such an index exists because \(n_1=1

\[ n_q

For every \(j>q\), the endpoint satisfies \(n_j\geq T\).

If the \(j\)-th phrase is a literal, then

\[ \frac{n_j}{n_{j-1}} = 1+\frac1{n_{j-1}} \leq 1+\frac1{T-1} \leq 1+\eta_{\mathbf a}(T), \]

because \(n_{j-1}\geq n_q\geq T-1\) for \(j>q+1\), and for \(j=q+1\) the claimed bound is still harmless after replacing \(n_q\) by the last boundary below \(T\) and absorbing the single crossing step into the final strict inequality below. If the \(j\)-th phrase is copied, write

\[ s=n_{j-1}, \qquad t=n_j. \]

The copy gives a triple \(r

\[ \mathbf a[r+1,r+t-s] = \mathbf a[s+1,t], \]

and \(t\geq T\). Hence

\[ \frac{t}{s}-1\leq \Theta_{\mathbf a}(T) \leq \eta_{\mathbf a}(T). \]

Therefore every relevant ratio after the boundary \(T\), except possibly the single ratio crossing \(T\), is bounded by \(1+\eta_{\mathbf a}(T)\). Since the crossing ratio contributes only to moving from a boundary below \(T\) to a boundary at least \(T\), we have

\[ \frac{N}{T} \leq (1+\eta_{\mathbf a}(T))^z. \]

Thus

\[ z \geq \frac{\log(N/T)}{\log(1+\eta_{\mathbf a}(T))}. \]

Taking the minimum over parsings proves the lemma.

Corollary 12.4 (Conditional quantitative strengthening).

Let \(b\geq 2\), and let

\[ \alpha\in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Suppose that there are constants \(A>0\), \(c>0\), and \(T_0\) such that

\[ \Theta_\alpha(T) \leq \frac{A}{(\log T)^c} \qquad(T\geq T_0). \]

Then

\[ \oc(w_N(\alpha)) \gg_{\alpha} (\log N)^{1+c}. \]

More generally, if

\[ \Theta_\alpha(T) \leq \frac{A}{(\log\log T)^c} \]

for all large \(T\), then

\[ \oc(w_N(\alpha)) \gg_{\alpha} \log N\,(\log\log N)^c. \]

Proof.

Apply the quantitative profile bound with

\[ T=N^{1/2}. \]

For large \(N\),

\[ \frac1{T-1} \]

is negligible compared with either of the assumed upper bounds for \(\Theta_\alpha(T)\). Since

\[ \log(1+u)\leq 2u \]

for small positive \(u\), the result follows.

Remark 12.5.

Thus the problem of proving an explicit superlogarithmic lower bound for online-copy or LZ77 parsing is reduced to proving an explicit decay estimate for the finite repetition profile \(\Theta_\alpha(T)\).

13 A moving-period Ridout problem

The bounded-displacement part of the proof used Ridout's theorem only after the period \(d\) had been fixed. Quantitative improvements require control of periods \(d\) that may grow with the scale.

A repetition

\[ \mathbf a[r+1,r+t-s] = \mathbf a[s+1,t], \qquad d=s-r, \]

produces an eventually periodic rational approximant with preperiod \(r\), period \(d\), and agreement length at least \(t\). Its denominator divides

\[ b^r(b^d-1). \]

If

\[ t\geq (1+\varepsilon)s = (1+\varepsilon)(r+d), \]

then the associated rational \(\beta_{r,d}\) satisfies

\[ |\alpha-\beta_{r,d}| \ll b^{-(1+\varepsilon)(r+d)}. \]

For fixed \(d\), Ridout excludes infinitely many such approximations. For \(d\to\infty\), the set of prime divisors of \(b^d-1\) is no longer fixed.

Problem 13.1 (Moving-period Ridout problem).

Let \(b\geq 2\), and let

\[ \alpha\in(0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q). \]

Find an explicit function \(F_\alpha(\varepsilon)\) such that there are no rationals \(\beta\) with an eventually periodic base-\(b\) expansion of preperiod \(r\), period \(d\), and

\[ r+d\geq F_\alpha(\varepsilon), \]

satisfying

\[ |\alpha-\beta| < b^{-(1+\varepsilon)(r+d)}. \]

Up to the separate treatment of bounded values of \(r+d\), a bound of the form

\[ F_\alpha(\varepsilon)\leq \exp(C\varepsilon^{-A}) \]

would imply

\[ \Theta_\alpha(T)\ll_\alpha (\log T)^{-1/A}, \]

and hence

\[ \oc(w_N(\alpha)) \gg_\alpha (\log N)^{1+1/A}. \]

Even a double-exponential bound

\[ F_\alpha(\varepsilon)\leq \exp\exp(C\varepsilon^{-A}) \]

would imply

\[ \oc(w_N(\alpha)) \gg_\alpha \log N\,(\log\log N)^{1/A}. \]

This is the main Diophantine bottleneck for any explicit improvement over the qualitative lower bound

\[ \omega(\log N). \]

14 Many weak repetitions

The multiplicative-growth lemma extracts one strong repetition from an \(O(\log N)\)-phrase parsing. To prove a quantitative lower bound of the form

\[ \oc(w_N(\alpha)) \gg \log N\,f(N), \qquad f(N)\to\infty, \]

one must handle repetitions whose exponent is only

\[ 1+\frac1{f(N)}+o\left(\frac1{f(N)}\right). \]

A single such weak repetition is generally not enough.

A possible strengthening is to use the whole growth distribution of a parsing. Given phrase boundaries

\[ 0=n_0

define the growth energy above a scale \(T\) by

\[ E_T = \sum_{j:n_{j-1}\geq T} \log\frac{n_j}{n_{j-1}}. \]

One always has

\[ E_T\geq \log\frac{N}{T}-O(1). \]

Each copied phrase contributing to \(E_T\) gives a repetition with strength

\[ \frac{n_j}{n_{j-1}}-1. \]

Problem 14.1 (Many weak repetitions).

Develop a Diophantine criterion excluding, for algebraic irrational digit expansions, families of repetitions whose individual exponents may tend to \(1\), but whose total growth energy is large.

Such a criterion would be more compatible with quantitative versions of the Subspace Theorem, which usually control families of solutions rather than a single exceptionally good solution.

15 Beyond bounded-interval noise

The noisy theorem above assumes that, in each approximate copy, the mismatch set is contained in at most \(\Delta\) intervals, where \(\Delta\) is fixed. This is the natural regime of Nguyen's refined Diophantine exponent criterion.

A more robust compression model would allow copied phrases plus correction sets of low complexity. For example, the error set might be sparse, or might itself admit a short interval decomposition, LZ parsing, grammar, or attractor.

The bounded-displacement part of the noisy proof shows the obstruction clearly. If the error set is covered by \(\Delta\) intervals, then one obtains a genuinely periodic interval of length at least

\[ \frac{(1-\eta)L}{\Delta+1}. \]

For fixed \(\Delta\), this yields a rational approximation with a fixed exponent \(>1\), and Ridout applies. If \(\Delta\to\infty\), the exponent tends to \(1\), and one again needs quantitative Diophantine input.

Problem 15.1 (Noisy copies with growing error complexity).

Extend the noisy online-copy lower bound to regimes where \(\Delta=\Delta_N\to\infty\), or where the mismatch sets have low compressive complexity rather than bounded interval complexity.

A quantitative moving-period Ridout theorem would likely be necessary even for moderately growing \(\Delta_N\).

16 Transcendence measures from dense compression profiles

The compression criterion proved above is qualitative: if a non-rational number has logarithmically short online-copy parsings infinitely often, then it is transcendental. One may ask for quantitative transcendence measures when such parsings occur on sufficiently dense scales.

Suppose that

\[ \oc(a_1a_2\cdots a_{N_m}) \leq C\log N_m \]

for an infinite sequence \(N_m\), and that the sequence is not too sparse, for instance

\[ N_{m+1}\leq N_m^A \]

for some fixed \(A\). The proof of the compression criterion then produces rational approximations at controlled heights and with controlled gaps. This is the type of input from which transcendence measures may sometimes be extracted.

Problem 16.1 (Compression-to-measure principle).

Develop a theorem converting dense logarithmic online-copy compression profiles into explicit transcendence measures for

\[ \xi_{\mathbf a,b} = \sum_{n\geq 1}a_n b^{-n}. \]

The refined Diophantine exponent framework suggests that such a principle should also have noisy variants.

17 Scope

The exact online-copy theorem gives a strong lower bound for a previous-copy model and hence for LZ77-type parsings. It does not by itself imply superlogarithmic lower bounds for arbitrary grammar compression, straight-line programs, bidirectional macro schemes, or smallest string attractors.

The effective finite-prefix method gives weaker lower bounds for offline repetitiveness measures:

\[ \gg_\alpha \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}. \]

These bounds are new in nature relative to the online proof, because they pass through linear extension complexity and Bugeaud's finite-prefix theorem, not through phrase-boundary growth.

The results are not normality statements. They do not imply digit equidistribution, nor any explicit digit-frequency estimate for a specific algebraic irrational number.

The central open quantitative question is to improve

\[ \oc(w_N(\alpha))=\omega(\log N) \]

to an explicit lower bound such as

\[ \oc(w_N(\alpha)) \gg \log N\,f(N), \qquad f(N)\to\infty, \]

or even

\[ \oc(w_N(\alpha)) \gg (\log N)^{1+\varepsilon}. \]

The quantitative repetition profile \(\Theta_\alpha(T)\) and the moving-period Ridout problem isolate the main obstruction.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work, ChatGPT, by OpenAI, was used to assist with mathematical drafting, formalization, review, and editing. This work is shared as a preliminary AI-assisted mathematical note. The mathematical content may have been only partially reviewed and may contain errors; it should not be treated as peer-reviewed or as a fully verified manuscript.

References

  1. [ABL04] B. Adamczewski, Y. Bugeaud, and F. Luca, \emph{Sur la complexit\'{e} des nombres alg\'{e}briques}, C. R. Math. Acad. Sci. Paris 339 (2004), no. 1, 11--14.
  2. [AB07] B. Adamczewski and Y. Bugeaud, On the complexity of algebraic numbers I. Expansions in integer bases, Ann. of Math. (2) 165 (2007), no. 2, 547--565.
  3. [BE08] Y. Bugeaud and J.-H. Evertse, On two notions of complexity of algebraic numbers, Acta Arith. 133 (2008), no. 3, 221--250.
  4. [Bug08] Y. Bugeaud, An explicit lower bound for the block complexity of an algebraic number, Rend. Lincei Mat. Appl. 19 (2008), 229--235.
  5. [ES02] J.-H. Evertse and H. P. Schlickewei, A quantitative version of the Absolute Subspace Theorem, J. Reine Angew. Math. 548 (2002), 21--127.
  6. [KP18] D. Kempa and N. Prezza, At the roots of dictionary compression: string attractors, Proceedings of the 50th Annual ACM Symposium on Theory of Computing, 2018, 827--840.
  7. [KNP23] T. Kociumaka, G. Navarro, and N. Prezza, Toward a definitive compressibility measure for repetitive sequences, IEEE Trans. Inform. Theory 69 (2023), no. 4, 2074--2092.
  8. [Ngu26] Q.-K. Nguyen, Transcendence and measures via the refined Diophantine exponent, arXiv:2605.30606, 2026.
  9. [Pre17] N. Prezza, String attractors, arXiv:1709.05314, 2017.
  10. [Rid57] D. Ridout, Rational approximations to algebraic numbers, Mathematika 4 (1957), 125--131.
  11. [ZL77] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory 23 (1977), no. 3, 337--343.

Part I.B

Metric and Extremal Theory of Online Previous-Copy Compression

Luca Blanchi

Abstract

Let $b\geq2$ be an integer alphabet size. We study the extremal and metric behaviour of the online previous-copy parsing complexity $\oc(W)$ of a finite word $W\in\{0,\ldots,b-1\}^N$. In this model a phrase is either a literal symbol or an exact copy whose source starts earlier in the word; self-overlap is allowed.

We prove the sharp universal upper bound

\[ \oc(W)\leq (1+o(1))\frac{N}{\log_b N} \]

uniformly for all words $W$ of length $N$, and show that this is best possible:

\[ \max_{|W|=N}\oc(W) = (1+o(1))\frac{N}{\log_b N}. \]

For Lebesgue-almost every $\alpha\in[0,1]$, if $w_N(\alpha)$ denotes the prefix of length $N$ of the base-$b$ expansion of $\alpha$, then

\[ \oc(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}. \]

We also determine the finite-word entropy of low online previous-copy complexity. For $0\leq \kappa\leq1$, let

\[ \mathcal C_N(\kappa) = \left\{ W\in\{0,\ldots,b-1\}^N: \oc(W)\leq \kappa\frac{N}{\log_b N} \right\}. \]

Then, for $0<\kappa\leq1$,

\[ \lim_{N\to\infty} \frac1N\log_b|\mathcal C_N(\kappa)| = \kappa. \]

Consequently, the Hausdorff spectrum is exact:

\[ \dimH \left\{ \alpha\in[0,1]: \liminf_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} \leq\kappa \right\} = \kappa \qquad(0\leq\kappa\leq1). \]

Finally, we record parallel extremal and almost-sure estimates for the normalized substring complexity

\[ \delta(W)=\max_{1\leq k\leq |W|}\frac{p_W(k)}{k} \]

and for the minimum string attractor size $\gamma(W)$:

\[ \delta(w_N(\alpha)) = \gamma(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N} \]

for almost every $\alpha$, and the same asymptotic scale holds in the worst case.

1 Introduction

The online previous-copy parsing model is a structural abstraction of left-to-right dictionary compression. A finite word is parsed from left to right into phrases. A phrase is either a literal symbol or a copy of an earlier substring. The source is required to start earlier than the target, but it may overlap the target. This convention includes the self-referential behaviour familiar from variants of LZ77.

A companion paper studied this model on digit prefixes of algebraic irrational numbers and proved the lower bound

\[ \oc(w_N(\alpha))=\omega(\log N) \]

for every algebraic irrational $\alpha$. The present note is independent of the Diophantine part. Its purpose is to determine the natural extremal and metric scale of $\oc(W)$ for arbitrary and typical finite words.

The main scale is

\[ \frac{N}{\log_b N}. \]

This scale appears in three complementary ways.

First, every word of length $N$ has an online previous-copy parsing with at most

\[ (1+o(1))\frac{N}{\log_b N} \]

phrases. Second, by a counting argument, this is sharp in the worst case. Third, the same asymptotic holds for almost every base-$b$ expansion.

The proof of the universal upper bound is elementary. Choose a block length

\[ L=\lfloor \log_b N-2\log_b\log_b N\rfloor. \]

Parsing greedily, any phrase shorter than $L$, except possibly near the end of the word, must start at a previously unseen block of length $L$. There are at most $b^L$ such blocks, while all other phrases have length at least $L$. This gives

\[ \oc(W)\leq \frac{N}{L}+b^L+L+O(1). \]

The lower bounds come from counting parse descriptions. The number of words of length $N$ admitting an online previous-copy parse with $z$ phrases is at most

\[ z\left(\frac{eN}{z}\right)^z(2bN)^z. \]

For

\[ z\sim \kappa \frac{N}{\log_b N}, \]

this is $b^{(\kappa+o(1))N}$. This gives both the worst-case lower bound and the entropy/Hausdorff spectrum.

The last part treats two standard repetitiveness measures: the normalized substring complexity

\[ \delta(W)=\max_k p_W(k)/k \]

and the minimum string attractor size $\gamma(W)$. The inequalities

\[ \delta(W)\leq\gamma(W)\leq\oc(W) \]

connect these measures to the online previous-copy model. A standard collision estimate for random words gives the matching lower bound for $\delta$, and hence for $\gamma$, almost surely.

Throughout the paper, logarithms without subscript are natural, while $\log_b$ denotes logarithm in base $b$.

2 Online previous-copy parsings

Let

\[ W=W[1]W[2]\cdots W[N] \]

be a finite word over the alphabet

\[ \{0,\ldots,b-1\}. \]

Definition 2.1.

An online previous-copy parsing of $W$ is a factorization

\[ W=F_1F_2\cdots F_z \]

with boundaries

\[ 0=n_0

where

\[ F_j=W[n_{j-1}+1,n_j]. \]

Each phrase is of one of the following two types.

First, $F_j$ may be a literal, in which case

\[ |F_j|=1. \]

Second, $F_j$ may be a copy. Writing

\[ s=n_{j-1}, \qquad t=n_j, \qquad \ell=t-s, \]

this means that there exists a source position $p$ with

\[ 1\leq p\leq s \]

such that

\[ W[p+h]=W[s+1+h] \qquad (0\leq h<\ell). \]

The source may overlap the target.

Let

\[ \oc(W) \]

be the minimum number of phrases in an online previous-copy parsing of $W$.

Remark 2.2.

The first phrase must be a literal. The definition is structural: source positions and lengths are not charged as bits. The quantity $\oc(W)$ is therefore a phrase-complexity measure, not a bit-complexity measure.

3 A universal upper bound

We first prove that every word has an online previous-copy parsing with about $N/\log_b N$ phrases.

Theorem 3.1 (Universal upper bound).

Uniformly for all words $W\in\{0,\ldots,b-1\}^N$,

\[ \oc(W) \leq (1+o(1))\frac{N}{\log_b N}. \]

More precisely, for every integer $L\geq1$,

\[ \oc(W) \leq \frac{N}{L}+b^L+L+O(1), \]

and with

\[ L=\left\lfloor \log_b N-2\log_b\log_b N\right\rfloor \]

this gives the displayed asymptotic.

Proof.

Fix $L$. We construct a parsing by the following greedy rule. Suppose the current position is $i$. If $i+L-1\leq N$ and the block

\[ W[i,i+L-1] \]

has an occurrence starting at some position $p

Call a phrase short if it is a literal produced before the final $L$ positions. If a short phrase starts at position $i\leq N-L+1$, then the block

\[ W[i,i+L-1] \]

has no earlier occurrence. Hence two different short phrase starts $i

\[ W[i,i+L-1]=W[j,j+L-1], \]

then at position $j$ the parser could have copied at least $L$ symbols from source $i$, contradicting that the phrase at $j$ was short.

There are at most $b^L$ distinct blocks of length $L$. Hence the number of short phrases before the final $L$ positions is at most $b^L$. The final part contributes at most $L$ literal phrases.

All copied phrases produced by the algorithm have length $L$, so there are at most $N/L$ such phrases. Therefore

\[ \oc(W) \leq \frac{N}{L}+b^L+L+O(1). \]

Now take

\[ L=\left\lfloor \log_b N-2\log_b\log_b N\right\rfloor. \]

Then

\[ b^L\leq \frac{N}{(\log_b N)^2}, \]

and

\[ \frac{N}{L} = (1+o(1))\frac{N}{\log_b N}. \]

Also

\[ b^L+L=o\left(\frac{N}{\log_b N}\right). \]

Thus

\[ \oc(W) \leq (1+o(1))\frac{N}{\log_b N}. \]

4 Counting words with short parsings

Let

\[ \mathcal W_{N,z} = \left\{ W\in\{0,\ldots,b-1\}^N: \oc(W)\leq z \right\}. \]

Lemma 4.1 (Counting parse descriptions).

For $1\leq z\leq N/2$,

\[ |\mathcal W_{N,z}| \leq z\left(\frac{eN}{z}\right)^z(2bN)^z. \]

Consequently, if

\[ z_N=\left\lfloor \kappa\frac{N}{\log_b N}\right\rfloor \]

with fixed $0<\kappa\leq1$, then

\[ |\mathcal W_{N,z_N}| \leq b^{(\kappa+o(1))N}. \]

Proof.

Fix $m\leq z$. We count a superset of words admitting a parsing with $m$ phrases.

The boundaries are determined by the $m-1$ internal cut positions, so there are

\[ \binom{N-1}{m-1} \]

choices.

For each phrase, choose whether it is a literal or a copy, giving at most $2^m$ choices. Assign to every phrase a symbol in the alphabet, even if the phrase is a copy; this gives at most $b^m$ choices. Assign also to every phrase a source position in $\{1,\ldots,N\}$, even if the phrase is a literal; this gives at most $N^m$ choices.

Thus the number of descriptions with $m$ phrases is at most

\[ \binom{N-1}{m-1}(2bN)^m. \]

Every such description determines at most one word. Indeed, literals prescribe their symbols, and copied phrases are deterministic once the already parsed prefix is known. If a source overlaps the target, the copied symbols are still determined progressively because the source starts earlier than the target.

Therefore

\[ |\mathcal W_{N,z}| \leq \sum_{m\leq z} \binom{N-1}{m-1}(2bN)^m. \]

For $m\leq z\leq N/2$,

\[ \binom{N-1}{m-1} \leq \left(\frac{eN}{m}\right)^m. \]

The function

\[ m\mapsto \left(\frac{eN}{m}\right)^m(2bN)^m \]

is increasing for $1\leq m\leq N/2$. Hence

\[ |\mathcal W_{N,z}| \leq z\left(\frac{eN}{z}\right)^z(2bN)^z. \]

Now take

\[ z_N=\left\lfloor \kappa\frac{N}{\log_b N}\right\rfloor. \]

Taking logarithms in base $b$,

\[ \log_b |\mathcal W_{N,z_N}| \leq \log_b z_N + z_N\log_b\frac{eN}{z_N} + z_N\log_b(2bN). \]

The last term is

\[ z_N\log_b N+O(z_N) = (\kappa+o(1))N. \]

The middle term is

\[ z_N\log_b\frac{eN}{z_N} = O\left(\frac{N}{\log N}\log\log N\right) = o(N). \]

Also $\log_b z_N=o(N)$. Therefore

\[ \log_b |\mathcal W_{N,z_N}| \leq (\kappa+o(1))N, \]

as claimed.

5 Worst-case complexity

Theorem 5.1 (Worst-case asymptotics).

\[ \max_{W\in\{0,\ldots,b-1\}^N}\oc(W) = (1+o(1))\frac{N}{\log_b N}. \]

Proof.

The upper bound follows from the universal upper bound.

For the lower bound, fix $0<\kappa<1$, and set

\[ z_N=\left\lfloor \kappa\frac{N}{\log_b N}\right\rfloor. \]

By the counting lemma,

\[ |\mathcal W_{N,z_N}| \leq b^{(\kappa+o(1))N}. \]

Since the total number of words of length $N$ is $b^N$, for all large $N$ there exists a word $W$ with

\[ \oc(W)>z_N. \]

Thus

\[ \max_{|W|=N}\oc(W) \geq \kappa\frac{N}{\log_b N}(1-o(1)). \]

Letting $\kappa\uparrow1$ gives the lower bound.

6 Metric law for online previous-copy complexity

For $\alpha\in[0,1]$, let

\[ w_N(\alpha) \]

be the prefix of length $N$ of the base-$b$ expansion of $\alpha$. We ignore the countable set of numbers with two base-$b$ expansions.

Theorem 6.1 (Almost-sure asymptotic).

For Lebesgue-almost every $\alpha\in[0,1]$,

\[ \oc(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}. \]

Equivalently,

\[ \lim_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} = 1 \]

for almost every $\alpha$.

Proof.

The upper bound holds for every word by the universal upper bound.

For the lower bound, fix $0<\kappa<1$, and set

\[ z_N=\left\lfloor \kappa\frac{N}{\log_b N}\right\rfloor. \]

The set of $\alpha$ such that

\[ \oc(w_N(\alpha))\leq z_N \]

is a union of base-$b$ cylinders of length $N$, one for each word in $\mathcal W_{N,z_N}$. Each cylinder has Lebesgue measure $b^{-N}$. Hence its measure is at most

\[ b^{-N}|\mathcal W_{N,z_N}| \leq b^{-(1-\kappa+o(1))N}. \]

This is summable in $N$. By the Borel--Cantelli lemma, for almost every $\alpha$, the event

\[ \oc(w_N(\alpha))\leq \kappa\frac{N}{\log_b N} \]

occurs for only finitely many $N$. Therefore

\[ \liminf_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} \geq \kappa \]

for every $0<\kappa<1$, and hence the liminf is at least $1$. The limsup is at most $1$ by the universal upper bound.

7 Finite entropy of low-complexity words

For $0<\kappa\leq1$, define

\[ \mathcal C_N(\kappa) = \left\{ W\in\{0,\ldots,b-1\}^N: \oc(W)\leq \kappa\frac{N}{\log_b N} \right\}. \]

Theorem 7.1 (Entropy spectrum).

For $0<\kappa\leq1$,

\[ \lim_{N\to\infty} \frac1N\log_b|\mathcal C_N(\kappa)| = \kappa. \]

For $\kappa\geq1$, the limit is $1$.

Proof.

The upper bound for $0<\kappa\leq1$ follows directly from the counting lemma:

\[ |\mathcal C_N(\kappa)| \leq b^{(\kappa+o(1))N}. \]

For the lower bound, fix $0<\eta<\kappa$, and put

\[ M=\lfloor(\kappa-\eta)N\rfloor. \]

For every word

\[ U\in\{0,\ldots,b-1\}^M, \]

construct a word $W(U)$ of length $N$ as follows. First write $U$. Then continue periodically with period $U$ until the word has total length $N$.

Different $U$'s give different $W(U)$'s, because the first $M$ symbols of $W(U)$ are $U$. Hence this construction gives $b^M$ distinct words.

By the universal upper bound applied to $U$,

\[ \oc(U)\leq (1+o(1))\frac{M}{\log_b M} \]

uniformly in $U$. After $U$ has been parsed, the periodic continuation is obtained with one copied phrase, using source position $1$ and allowing self-overlap. Thus

\[ \oc(W(U)) \leq (1+o(1))\frac{M}{\log_b M}+1. \]

Since

\[ M=(\kappa-\eta+o(1))N \qquad\text{and}\qquad \log_b M\sim\log_b N, \]

we get

\[ \oc(W(U)) \leq (\kappa-\eta+o(1))\frac{N}{\log_b N} \leq \kappa\frac{N}{\log_b N} \]

for all sufficiently large $N$. Hence

\[ |\mathcal C_N(\kappa)| \geq b^M = b^{(\kappa-\eta+o(1))N}. \]

Letting $\eta\to0$ gives

\[ \liminf_{N\to\infty} \frac1N\log_b|\mathcal C_N(\kappa)| \geq \kappa. \]

For $\kappa\geq1$, the universal upper bound implies that all words are eventually counted up to $o(1)$ in the threshold, and the entropy limit is $1$.

8 Hausdorff spectrum

For $0\leq\kappa\leq1$, define

\[ F_\kappa = \left\{ \alpha\in[0,1]: \liminf_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} \leq \kappa \right\}. \]

Theorem 8.1 (Hausdorff spectrum).

For $0\leq\kappa\leq1$,

\[ \dimH F_\kappa=\kappa. \]

For $\kappa\geq1$, $F_\kappa$ has full Lebesgue measure and hence Hausdorff dimension $1$.

Proof.

We first prove the upper bound. Fix $\eta>0$. If $\alpha\in F_\kappa$, then for infinitely many $N$,

\[ \oc(w_N(\alpha)) \leq (\kappa+\eta)\frac{N}{\log_b N}. \]

Thus $F_\kappa$ is contained in the limsup of the union of cylinders of length $N$ corresponding to words in

\[ \mathcal C_N(\kappa+\eta). \]

By the entropy upper bound,

\[ |\mathcal C_N(\kappa+\eta)| \leq b^{(\kappa+\eta+o(1))N}. \]

Each cylinder has diameter comparable to $b^{-N}$. If $s>\kappa+\eta$, then

\[ \sum_N |\mathcal C_N(\kappa+\eta)| b^{-sN} < \infty. \]

Hence the $s$-dimensional Hausdorff measure of $F_\kappa$ is zero. Therefore

\[ \dimH F_\kappa\leq \kappa+\eta. \]

Letting $\eta\to0$ gives

\[ \dimH F_\kappa\leq\kappa. \]

For the lower bound, assume $0<\kappa\leq1$. We construct a Cantor set contained in $F_\kappa$. Choose a sequence of integers $M_j\to\infty$ growing so rapidly that the total length constructed before stage $j$ is $o(M_j/\log M_j)$.

Suppose that before stage $j$ a prefix of length $R_{j-1}$ has been constructed. Choose freely a word

\[ U_j\in\{0,\ldots,b-1\}^{M_j}. \]

After writing $U_j$, continue periodically with period $U_j$ until the total length reaches

\[ N_j = R_{j-1} + \left\lfloor\frac{M_j}{\kappa}\right\rfloor. \]

Since $R_{j-1}=o(M_j)$,

\[ N_j\sim \frac{M_j}{\kappa}. \]

At time $N_j$, the already existing prefix before $U_j$ contributes $o(M_j/\log M_j)$ phrases. The block $U_j$ can be parsed using the universal upper bound:

\[ \oc(U_j) \leq (1+o(1))\frac{M_j}{\log_b M_j}. \]

The periodic continuation after $U_j$ is copied with one phrase, since self-overlap is allowed. Thus

\[ \oc(w_{N_j}) \leq (1+o(1))\frac{M_j}{\log_b M_j}. \]

Because

\[ N_j\sim\frac{M_j}{\kappa} \qquad\text{and}\qquad \log_b N_j\sim\log_b M_j, \]

we obtain

\[ \frac{\oc(w_{N_j})\log_b N_j}{N_j} \leq \kappa+o(1). \]

Hence every point in the constructed Cantor set lies in $F_\kappa$.

It remains to estimate its dimension. At stage $j$, the number of independent choices is $b^{M_j}$. Since the sequence $M_j$ grows very fast, the contribution of all previous stages is negligible compared with $M_j$. The cylinders at stage $j$ have length $b^{-N_j}$. A standard mass distribution argument gives

\[ \dimH \geq \liminf_{j\to\infty} \frac{\sum_{i\leq j}M_i}{N_j} = \liminf_{j\to\infty} \frac{M_j+o(M_j)}{M_j/\kappa} = \kappa. \]

Thus

\[ \dimH F_\kappa\geq\kappa. \]

The case $\kappa=0$ follows from the upper bound and the fact that $F_0$ is nonempty, for example it contains eventually periodic expansions. Therefore $\dimH F_0=0$.

9 Normalized substring complexity and string attractors

For a finite word $W$, let $p_W(k)$ be the number of distinct factors of $W$ of length $k$. Define

\[ \delta(W)=\max_{1\leq k\leq |W|}\frac{p_W(k)}{k}. \]

Let $\gamma(W)$ denote the size of the smallest string attractor of $W$. Recall that a set of positions

\[ \Gamma\subseteq\{1,\ldots,|W|\} \]

is a string attractor if every distinct factor of $W$ has an occurrence crossing at least one position of $\Gamma$.

Lemma 9.1.

For every finite word $W$,

\[ \delta(W)\leq \gamma(W)\leq \oc(W). \]

Proof.

First let $\Gamma$ be a string attractor for $W$. Fix $k$. Every distinct length-$k$ factor has an occurrence crossing some position of $\Gamma$. A fixed position can be crossed by at most $k$ length-$k$ factors. Therefore

\[ p_W(k)\leq |\Gamma|k. \]

Taking the minimum over $\Gamma$ and then the maximum over $k$ gives

\[ \delta(W)\leq \gamma(W). \]

Now take an online previous-copy parsing of $W$ with phrase starts

\[ n_0+1,n_1+1,\ldots,n_{z-1}+1. \]

Let $\Gamma$ be this set of phrase starts. We claim that $\Gamma$ is a string attractor.

Let $U$ be any factor of $W$, and choose its leftmost occurrence. If this occurrence does not cross a phrase start, it lies strictly inside a single phrase. If that phrase is a literal, this is impossible unless $|U|=1$, in which case the occurrence crosses the phrase start. If the phrase is copied, the source of the copy gives an earlier occurrence of $U$, contradicting the choice of the leftmost occurrence. Hence every factor has an occurrence crossing a phrase start, and $\Gamma$ is an attractor. Thus

\[ \gamma(W)\leq z. \]

Taking the minimum over parsings gives

\[ \gamma(W)\leq \oc(W). \]

9.1 Universal and worst-case bounds

Lemma 9.2 (Universal upper bound for $\delta$).

Uniformly for all words $W$ of length $N$,

\[ \delta(W) \leq (1+o(1))\frac{N}{\log_b N}. \]

Proof.

For every $k$,

\[ p_W(k)\leq \min\{b^k,N-k+1\}\leq \min\{b^k,N\}. \]

Thus

\[ \frac{p_W(k)}{k} \leq \frac{\min\{b^k,N\}}{k}. \]

If $k\leq \log_b N$, then the maximum of $b^k/k$ in this range occurs at $k=\lfloor\log_b N\rfloor+O(1)$, and is

\[ (1+o(1))\frac{N}{\log_b N}. \]

If $k\geq \log_b N$, then

\[ \frac{N}{k}\leq \frac{N}{\log_b N}. \]

Taking the maximum over $k$ proves the claim.

Theorem 9.3 (Worst-case for $\delta$ and $\gamma$).

\[ \max_{|W|=N}\delta(W) = (1+o(1))\frac{N}{\log_b N}, \]

and

\[ \max_{|W|=N}\gamma(W) = (1+o(1))\frac{N}{\log_b N}. \]

Proof.

The upper bound for $\delta$ is the universal bound just proved. The upper bound for $\gamma$ follows from

\[ \gamma(W)\leq \oc(W) \]

and the universal upper bound for $\oc$.

For the lower bounds, it is enough to show that there exist words $W$ with

\[ \delta(W)\geq(1-o(1))\frac{N}{\log_b N}. \]

This follows, for example, from the almost-sure lower bound for $\delta$ proved in the next subsection. Since

\[ \delta(W)\leq\gamma(W), \]

the same examples give the lower bound for $\gamma$.

9.2 Almost-sure behaviour

Theorem 9.4 (Almost-sure behaviour of $\delta$ and $\gamma$).

For Lebesgue-almost every $\alpha\in[0,1]$,

\[ \delta(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}, \]

and

\[ \gamma(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}. \]

Proof.

The upper bound for $\delta$ is universal. The upper bound for $\gamma$ follows from

\[ \gamma(W)\leq\oc(W) \]

and the almost-sure asymptotic for $\oc$.

It remains to prove the almost-sure lower bound for $\delta$. Fix $\varepsilon>0$, and set

\[ k_N=\left\lceil(1+\varepsilon)\log_b N\right\rceil. \]

For a random word $W_N$ of length $N$, let $C_N$ be the number of pairs $1\leq i

\[ W_N[i,i+k_N-1]=W_N[j,j+k_N-1]. \]

For any pair $i

\[ \mathbb E C_N \ll N^2b^{-k_N} \ll N^{1-\varepsilon}. \]

Choose a subsequence

\[ N_m=\lfloor m^q\rfloor \]

with $q\varepsilon>2$. By Markov's inequality,

\[ \mathbb P(C_{N_m}>\eta N_m) \leq \frac{\mathbb E C_{N_m}}{\eta N_m} \ll N_m^{-\varepsilon} \ll m^{-q\varepsilon}. \]

The last series is summable. By Borel--Cantelli,

\[ C_{N_m}=o(N_m) \]

almost surely.

The number of distinct length-$k_{N_m}$ factors in the prefix of length $N_m$ is at least

\[ N_m-k_{N_m}+1-C_{N_m} = (1-o(1))N_m, \]

because the number of repeated occurrences is bounded by the number of colliding pairs.

Now let $N_m\leq N

\[ \frac{N_{m+1}}{N_m}\to1, \]

we have $N_m\sim N$ and $\log N_m\sim\log N$. The prefix $w_N(\alpha)$ contains the prefix $w_{N_m}(\alpha)$, so

\[ p_{w_N(\alpha)}(k_{N_m}) \geq (1-o(1))N_m = (1-o(1))N. \]

Also

\[ k_{N_m} = (1+\varepsilon+o(1))\log_b N. \]

Thus

\[ \delta(w_N(\alpha)) \geq \frac{p_{w_N(\alpha)}(k_{N_m})}{k_{N_m}} \geq (1-o(1))\frac{N}{(1+\varepsilon)\log_b N}. \]

Since $\varepsilon>0$ is arbitrary, taken over a countable sequence tending to $0$, we obtain

\[ \delta(w_N(\alpha)) \geq (1-o(1))\frac{N}{\log_b N} \]

almost surely. This proves the theorem. The lower bound for $\gamma$ follows from

\[ \delta(W)\leq\gamma(W). \]

10 Summary of asymptotic scales

The results above show that the online previous-copy complexity and the standard repetitiveness measures $\delta$ and $\gamma$ share the same natural scale in the worst case and almost surely:

\[ \frac{N}{\log_b N}. \]

More precisely,

\[ \max_{|W|=N}\oc(W) = (1+o(1))\frac{N}{\log_b N}, \]
\[ \max_{|W|=N}\delta(W) = (1+o(1))\frac{N}{\log_b N}, \]

and

\[ \max_{|W|=N}\gamma(W) = (1+o(1))\frac{N}{\log_b N}. \]

For Lebesgue-almost every $\alpha\in[0,1]$,

\[ \oc(w_N(\alpha)) = \delta(w_N(\alpha)) = \gamma(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}. \]

The entropy spectrum and Hausdorff spectrum for $\oc$ are also determined:

\[ \lim_{N\to\infty} \frac1N \log_b \#\left\{ W\in\{0,\ldots,b-1\}^N: \oc(W)\leq \kappa\frac{N}{\log_b N} \right\} = \kappa \]

for $0<\kappa\leq1$, and

\[ \dimH \left\{ \alpha: \liminf_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} \leq\kappa \right\} = \kappa \]

for $0\leq\kappa\leq1$.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work, ChatGPT, by OpenAI, was used to assist with mathematical drafting, formalization, review, and editing. This work is shared as a preliminary AI-assisted mathematical note. The mathematical content may have been only partially reviewed and may contain errors; it should not be treated as peer-reviewed or as a fully verified manuscript.

References

  1. [KP18] D. Kempa and N. Prezza, At the roots of dictionary compression: string attractors, Proceedings of the 50th Annual ACM Symposium on Theory of Computing, 2018, 827--840.
  2. [KNP23] T. Kociumaka, G. Navarro, and N. Prezza, Toward a definitive compressibility measure for repetitive sequences, IEEE Trans. Inform. Theory 69 (2023), no. 4, 2074--2092.
  3. [Pre17] N. Prezza, String attractors, arXiv:1709.05314, 2017.
  4. [ZL77] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory 23 (1977), no. 3, 337--343.