June 2026

Previous-Copy Compression: Foundations and Algebraic Digit Expansions

This page collects the foundational material for the online previous-copy model and its application to algebraic digit expansions.

Part I.A

June 2026

Previous-Copy Compression and Algebraic Digit Expansions

Luca Blanchi

Download PDF Download TeX

Abstract

Let $b\geq 2$ be an integer and let

\alpha=\sum_{n\geq 1}a_n b^{-n} \in (0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q)

be algebraic irrational, with $a_n\in\{0,\ldots,b-1\}$. Write

w_N(\alpha)=a_1a_2\cdots a_N

for the prefix of length $N$ of the base-$b$ expansion of $\alpha$.

We study the compressibility of $w_N(\alpha)$ under a left-to-right previous-copy model. A phrase is either a literal symbol or an exact copy of an earlier substring; self-overlap is allowed, as in self-referential variants of LZ77. Let $\oc(W)$ be the minimum number of phrases in such a parsing of a finite word $W$. We prove

\oc(w_N(\alpha))=\omega(\log N).

Consequently, every standard LZ77-type previous-factor parsing, self-referential or not, uses $\omega(\log N)$ phrases on $w_N(\alpha)$.

The proof combines a simple multiplicative-growth lemma with Diophantine approximation. A logarithmic online previous-copy parsing forces a copied phrase whose endpoint is a fixed multiplicative factor larger than its starting boundary. If the copy displacement tends to infinity, this gives a repetition forbidden by the Adamczewski--Bugeaud Diophantine exponent criterion. If the displacement is bounded, it gives too good an approximation by rationals with denominators supported on a fixed finite set of primes, contradicting Ridout's theorem.

We also prove a structured-noise variant using Nguyen's refined Diophantine exponent theorem. If copied phrases may differ from their sources on a vanishing proportion of positions contained in a bounded number of intervals, logarithmically many phrases still do not suffice.

Finally, we add an effective finite-prefix component. We introduce the linear extension complexity $\lambda(W)$, the least $C$ such that $W$ extends to an infinite word of factor complexity at most $Cn$. Using Bugeaud's effective theorem for long prefixes of algebraic numbers, we prove

\lambda(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}.

Since small online-copy parsings and small string attractors both imply small linear extension complexity, this gives the same effective lower bound for the minimum string attractor size of $w_N(\alpha)$, and hence for all compression measures which induce string attractors of comparable size. This part is weaker than the superlogarithmic LZ77 bound, but applies to offline repetitiveness measures outside the reach of the online argument.

1 Introduction

The base-$b$ expansion of an algebraic irrational number is expected to be highly complex. This expectation cannot be usefully expressed through unrestricted Kolmogorov complexity of finite prefixes: algebraic numbers are computable, and hence their first $N$ digits have descriptions of length $O(\log N)$. One must instead work with restricted models of description or compression.

A fundamental theorem of Adamczewski and Bugeaud says that the base-$b$ expansion of an algebraic irrational number cannot have low factor complexity. More precisely, if $p(n,\alpha,b)$ denotes the number of distinct blocks of length $n$ occurring in the base-$b$ expansion of an algebraic irrational $\alpha$, then

\frac{p(n,\alpha,b)}{n}\longrightarrow +\infty.

Their method rests on a Diophantine principle: sufficiently strong repetitions in the digit expansion produce rational approximations that are too good for algebraic irrational numbers.

The purpose of this note is to apply that principle to a concrete compression model. We consider finite prefixes under a left-to-right previous-copy parsing. This model is deliberately generous. Copy sources and lengths are not charged. A copied phrase is allowed to overlap its target. Therefore a lower bound in this model is not an artefact of bit-level encoding conventions.

The first main result is the following.

Theorem 1.1 (Exact online-copy lower bound).

Let $b\geq 2$, and let

\alpha=\sum_{n\geq 1}a_n b^{-n} \in (0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q).

Then

\oc(w_N(\alpha))=\omega(\log N).

Equivalently,

\lim_{N\to\infty} \frac{\oc(w_N(\alpha))}{\log N} = +\infty.

As a direct consequence, every standard LZ77-type previous-factor parsing of $w_N(\alpha)$, self-referential or not, has $\omega(\log N)$ phrases.

The result should be read carefully. It is not a lower bound for arbitrary straight-line programs, unrestricted grammar compression, bidirectional macro schemes, or offline copy systems. The online condition is essential in the proof. However, a second and independent part of the note gives effective lower bounds for several offline repetitiveness measures through a different route.

To state that route, define the linear extension complexity $\lambda(W)$ of a finite word $W$ to be the least $C$ such that $W$ is the prefix of an infinite word $x$ satisfying

p_x(n)\leq Cn \qquad(n\geq 1),

where $p_x(n)$ is the number of distinct factors of length $n$ of $x$. We prove that both online-copy parsings and string attractors control $\lambda(W)$:

\lambda(W)\leq \oc(W)+1, \qquad \lambda(W)\leq \gamma(W)+1,

where $\gamma(W)$ is the minimum string attractor size of $W$. Combining this with Bugeaud's effective finite-prefix theorem gives the second main result.

Theorem 1.2 (Effective finite-prefix lower bound).

Let $b\geq 2$, and let

\alpha\in (0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q).

Then, for all sufficiently large $N$,

\lambda(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}}.

Consequently,

\gamma(w_N(\alpha)) \gg_{\alpha} \frac{(\log N)^{1/11}}{(\log\log N)^{4/11}},

and the same lower bound holds for every compression measure which is known to induce a string attractor of size bounded linearly by that measure.

This effective bound is weaker than $\oc(w_N(\alpha))=\omega(\log N)$, but it applies to offline measures for which the online multiplicative-growth argument does not apply.

We also prove a structured noisy version of the online theorem. There, a copied phrase may differ from its source on a small exceptional set, provided the exceptional set is contained in a bounded number of intervals. The unbounded-displacement case is handled by Nguyen's refined Diophantine exponent theorem; the bounded-displacement case is handled directly by Ridout's theorem.

The final section isolates a quantitative profile

\Theta_\alpha(T)

measuring the strongest exact repetition in the digit sequence beyond scale $T$. The exact theorem implies qualitatively that

\Theta_\alpha(T)\to 0.

Moreover, any explicit decay bound on $\Theta_\alpha(T)$ would immediately imply an explicit superlogarithmic lower bound for $\oc(w_N(\alpha))$. This reduces the main quantitative breakthrough problem to a finite Diophantine problem, closely related to a moving-period version of Ridout's theorem.

Throughout the note, logarithms are natural. All implicit constants may depend on fixed parameters such as $\alpha$, its degree and height, and, where relevant, $b$, although the effective theorem quoted from Bugeaud is itself independent of the base.

2 Diophantine inputs

Let

\mathbf a=a_1a_2a_3\cdots

be an infinite word over a finite set of integers. For integers $i\leq j$, write

\mathbf a[i,j]=a_i a_{i+1}\cdots a_j.

For an integer $b\geq 2$, put

\xi_{\mathbf a,b} = \sum_{n\geq 1}a_n b^{-n}.

We use three Diophantine inputs: the Adamczewski--Bugeaud repetition criterion, Ridout's theorem, and Bugeaud's effective finite-prefix theorem. For the noisy section we also use Nguyen's refined repetition criterion.

2.1 Exact repetitions

The following is the form of the Adamczewski--Bugeaud Diophantine exponent criterion needed below.

Theorem 2.1 (Exact repetition criterion).

Let $b\geq 2$. Suppose that there exist $\rho>1$ and sequences of integers

\[ 0\leq r_m such that \[ s_m-r_m\to\infty, \qquad t_m\geq \rho s_m, \] and \[ \mathbf a[r_m+1,r_m+t_m-s_m] = \mathbf a[s_m+1,t_m] \] for every \(m\). Then \[ \xi_{\mathbf a,b} \] is rational or transcendental.

Indeed, the prefix $\mathbf a[1,t_m]$ contains a repetition beginning after a preperiod of length $r_m$, with period $s_m-r_m$, and with total length at least a fixed multiple of $s_m$. In the language of Adamczewski and Bugeaud, the Diophantine exponent of $\mathbf a$ is greater than $1$. The stated conclusion is precisely the corresponding transcendence criterion.

2.2 Ridout's theorem

We shall use Ridout's theorem in the following form.

Theorem 2.2 (Ridout).

Let $S$ be a finite set of prime numbers, let $\theta$ be a real algebraic number, and let $\varepsilon>0$. Then there are only finitely many rational numbers $P/Q$, written in lowest terms, such that every prime divisor of $Q$ belongs to $S$ and

\left|\theta-\frac{P}{Q}\right| < Q^{-1-\varepsilon}.

2.3 Structured noisy repetitions

Let $U$ and $V$ be two words of the same length $L$. For $\eta>0$ and an integer $\Delta\geq 0$, say that $U$ and $V$ are $(\eta,\Delta)$-close if the set

\{1\leq i\leq L: U_i\neq V_i\}

is contained in a union of at most $\Delta$ intervals of $\{1,\ldots,L\}$ whose total cardinality is at most $\eta L$.

We use the following consequence of Nguyen's refined Diophantine exponent theorem.

Theorem 2.3 (Refined repetition criterion).

Let $b\geq 2$. Suppose that there exist $\rho>1$, an integer $\Delta\geq 0$, a sequence $\eta_m\to 0$, and integers

\[ 0\leq r_m such that \[ s_m-r_m\to\infty, \qquad t_m\geq \rho s_m, \] and such that the two words \[ \mathbf a[r_m+1,r_m+t_m-s_m], \qquad \mathbf a[s_m+1,t_m] \] are \((\eta_m,\Delta)\)-close for every \(m\). Then \[ \xi_{\mathbf a,b} \] is rational or transcendental.

Only the exact criterion and Ridout's theorem are needed for the exact online-copy lower bound. The refined criterion is used solely in the noisy section.

2.4 Bugeaud's effective finite-prefix theorem

We also use the following theorem of Bugeaud.

Theorem 2.4 (Bugeaud).

Let $b\geq 2$, and let $\xi$ be a real algebraic irrational number of degree $D$ and height at most $H$, with $H\geq e^e$. Let $a$ be the base-$b$ expansion of $\xi$. Let $w$ be an infinite word satisfying

p_w(n)\leq Cn \qquad(n\geq 1)

for some integer $C\geq 2$. If the first $L$ digits of $a$ coincide with the first $L$ digits of $w$, then either

H \geq \exp\left\{ 10^{-2}C^{-1}L^{1/(8\log(4C))} \right\},

D \geq \exp\left\{ 10^{-100}C^{-11/2}(\log C)^{-1} (\log L)^{1/2}(\log\log L)^{-1} \right\}.

The numerical constants play no role in this note. What matters is that, for fixed algebraic $\xi$, the theorem forbids arbitrarily long prefixes from coinciding with infinite words of complexity $Cn$ unless $C$ grows at least like

\frac{(\log L)^{1/11}}{(\log\log L)^{4/11}}

up to a constant depending on $\xi$.

3 Online previous-copy parsings

Let $W=W[1]W[2]\cdots W[N]$ be a finite word.

Definition 3.1.

An online previous-copy parsing of $W$ is a factorization

W=F_1F_2\cdots F_z

with boundaries

\[ 0=n_0 where \[ F_j=W[n_{j-1}+1,n_j]. \] Each phrase \(F_j\) is of one of two types. First, \(F_j\) may be a literal, in which case \(|F_j|=1\). Second, \(F_j\) may be a copied phrase. Writing \[ s=n_{j-1}, \qquad t=n_j, \qquad L=t-s, \] this means that there is a source position \(p\) with \[ 1\leq p\leq s \] such that \[ W[p+h]=W[s+1+h] \qquad(0\leq h The source starts strictly to the left of the target phrase. It may overlap the target.

Define

\oc(W)

to be the minimum number of phrases in an online previous-copy parsing of $W$.

This is a structural parsing measure, not a bit-level encoding length. In particular, source positions and phrase lengths are not charged.

Remark 3.2.

The first phrase in every online previous-copy parsing is necessarily a literal, since no previous source is available at position $1$.

4 A multiplicative-growth lemma

The following elementary lemma is the combinatorial core of the exact lower bound.

Lemma 4.1 (Multiplicative-growth phrase).

Let $C>0$. Suppose that, for infinitely many $N$, a word $W_N$ of length $N$ has an online previous-copy parsing with at most $C\log N$ phrases. Then there exist $\rho>1$ and infinitely many copied phrases, arising in such parsings, with boundaries \(s

t\geq \rho s

and

t\to\infty.

One may take

\rho=\exp\left(\frac1{3C}\right).

Proof.

Fix such a parsing of $W_N$, and write its boundaries as

\[ 0=n_0 Since the first phrase is a literal, \(n_1=1\). Thus \[ N=\frac{n_z}{n_1} = \prod_{j=2}^{z}\frac{n_j}{n_{j-1}}. \] Set \[ \rho=\exp\left(\frac1{3C}\right)>1. \] Suppose, for contradiction, that there is a constant \(T\) such that every phrase satisfying \[ \frac{n_j}{n_{j-1}}\geq \rho \] has endpoint \(n_j\leq T\), along the infinite family of parsings under consideration. After the last such phrase, all remaining ratios are \(<\rho\). Therefore, for all sufficiently large \(N\), \[ N\leq T\rho^z \leq T\exp\left(\frac1{3C}C\log N\right) = TN^{1/3}, \] which is impossible. Thus there are phrases with \[ \frac{n_j}{n_{j-1}}\geq \rho \] and unbounded endpoints. A literal phrase with starting boundary \(n_{j-1}\) has ratio \[ \frac{n_j}{n_{j-1}} = 1+\frac1{n_{j-1}}, \] which is \(<\rho\) for all sufficiently large \(n_{j-1}\). Hence the high-growth phrases with unbounded endpoints are copied phrases.

5 Exact online-copy lower bound

Theorem 5.1 (Compression criterion).

Let $b\geq 2$, and let

\xi_{\mathbf a,b} = \sum_{n\geq 1}a_n b^{-n}.

\oc(a_1a_2\cdots a_N)=O(\log N)

for infinitely many $N$, then $\xi_{\mathbf a,b}$ is rational or transcendental.

Proof.

It is enough to prove the contrapositive. Suppose that

\xi_{\mathbf a,b} = \alpha \in (0,1)\cap(\overline{\mathbb Q}\setminus\mathbb Q).

Assume, for contradiction, that there are $C>0$ and infinitely many $N$ such that

\oc(w_N(\alpha))\leq C\log N.

By the multiplicative-growth lemma, after passing to an infinite subsequence, we obtain copied phrases

w_N(\alpha)[s+1,t]

such that

t\geq \rho s

for some fixed $\rho>1$, and $t\to\infty$.

For each such copied phrase, put $L=t-s$. There is a source position $p$, with $1\leq p\leq s$, such that

Part I.B

June 2026

Metric and Extremal Theory of Online Previous-Copy Compression

Luca Blanchi

Download PDF Download TeX

Abstract

Let $b\geq2$ be an integer alphabet size. We study the extremal and metric behaviour of the online previous-copy parsing complexity $\oc(W)$ of a finite word $W\in\{0,\ldots,b-1\}^N$. In this model a phrase is either a literal symbol or an exact copy whose source starts earlier in the word; self-overlap is allowed.

We prove the sharp universal upper bound

\oc(W)\leq (1+o(1))\frac{N}{\log_b N}

uniformly for all words $W$ of length $N$, and show that this is best possible:

\max_{|W|=N}\oc(W) = (1+o(1))\frac{N}{\log_b N}.

For Lebesgue-almost every $\alpha\in[0,1]$, if $w_N(\alpha)$ denotes the prefix of length $N$ of the base-$b$ expansion of $\alpha$, then

\oc(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}.

We also determine the finite-word entropy of low online previous-copy complexity. For $0\leq \kappa\leq1$, let

\mathcal C_N(\kappa) = \left\{ W\in\{0,\ldots,b-1\}^N: \oc(W)\leq \kappa\frac{N}{\log_b N} \right\}.

Then, for $0<\kappa\leq1$,

\lim_{N\to\infty} \frac1N\log_b|\mathcal C_N(\kappa)| = \kappa.

Consequently, the Hausdorff spectrum is exact:

\dimH \left\{ \alpha\in[0,1]: \liminf_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} \leq\kappa \right\} = \kappa \qquad(0\leq\kappa\leq1).

Finally, we record parallel extremal and almost-sure estimates for the normalized substring complexity

\delta(W)=\max_{1\leq k\leq |W|}\frac{p_W(k)}{k}

and for the minimum string attractor size $\gamma(W)$:

\delta(w_N(\alpha)) = \gamma(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}

for almost every $\alpha$, and the same asymptotic scale holds in the worst case.

1 Introduction

The online previous-copy parsing model is a structural abstraction of left-to-right dictionary compression. A finite word is parsed from left to right into phrases. A phrase is either a literal symbol or a copy of an earlier substring. The source is required to start earlier than the target, but it may overlap the target. This convention includes the self-referential behaviour familiar from variants of LZ77.

A companion paper studied this model on digit prefixes of algebraic irrational numbers and proved the lower bound

\oc(w_N(\alpha))=\omega(\log N)

for every algebraic irrational $\alpha$. The present note is independent of the Diophantine part. Its purpose is to determine the natural extremal and metric scale of $\oc(W)$ for arbitrary and typical finite words.

The main scale is

\frac{N}{\log_b N}.

This scale appears in three complementary ways.

First, every word of length $N$ has an online previous-copy parsing with at most

(1+o(1))\frac{N}{\log_b N}

phrases. Second, by a counting argument, this is sharp in the worst case. Third, the same asymptotic holds for almost every base-$b$ expansion.

The proof of the universal upper bound is elementary. Choose a block length

L=\lfloor \log_b N-2\log_b\log_b N\rfloor.

Parsing greedily, any phrase shorter than $L$, except possibly near the end of the word, must start at a previously unseen block of length $L$. There are at most $b^L$ such blocks, while all other phrases have length at least $L$. This gives

\oc(W)\leq \frac{N}{L}+b^L+L+O(1).

The lower bounds come from counting parse descriptions. The number of words of length $N$ admitting an online previous-copy parse with $z$ phrases is at most

z\left(\frac{eN}{z}\right)^z(2bN)^z.

For

z\sim \kappa \frac{N}{\log_b N},

this is $b^{(\kappa+o(1))N}$. This gives both the worst-case lower bound and the entropy/Hausdorff spectrum.

The last part treats two standard repetitiveness measures: the normalized substring complexity

\delta(W)=\max_k p_W(k)/k

and the minimum string attractor size $\gamma(W)$. The inequalities

\delta(W)\leq\gamma(W)\leq\oc(W)

connect these measures to the online previous-copy model. A standard collision estimate for random words gives the matching lower bound for $\delta$, and hence for $\gamma$, almost surely.

Throughout the paper, logarithms without subscript are natural, while $\log_b$ denotes logarithm in base $b$.

2 Online previous-copy parsings

Let

W=W[1]W[2]\cdots W[N]

be a finite word over the alphabet

\{0,\ldots,b-1\}.

Definition 2.1.

An online previous-copy parsing of $W$ is a factorization

W=F_1F_2\cdots F_z

with boundaries

\[ 0=n_0 where \[ F_j=W[n_{j-1}+1,n_j]. \] Each phrase is of one of the following two types. First, $F_j$ may be a literal, in which case \[ |F_j|=1. \] Second, $F_j$ may be a copy. Writing \[ s=n_{j-1}, \qquad t=n_j, \qquad \ell=t-s, \] this means that there exists a source position $p$ with \[ 1\leq p\leq s \] such that \[ W[p+h]=W[s+1+h] \qquad (0\leq h<\ell). \] The source may overlap the target.

Let

\oc(W)

be the minimum number of phrases in an online previous-copy parsing of $W$.

Remark 2.2.

The first phrase must be a literal. The definition is structural: source positions and lengths are not charged as bits. The quantity $\oc(W)$ is therefore a phrase-complexity measure, not a bit-complexity measure.

3 A universal upper bound

We first prove that every word has an online previous-copy parsing with about $N/\log_b N$ phrases.

Theorem 3.1 (Universal upper bound).

Uniformly for all words $W\in\{0,\ldots,b-1\}^N$,

\oc(W) \leq (1+o(1))\frac{N}{\log_b N}.

More precisely, for every integer $L\geq1$,

\oc(W) \leq \frac{N}{L}+b^L+L+O(1),

and with

L=\left\lfloor \log_b N-2\log_b\log_b N\right\rfloor

this gives the displayed asymptotic.

Proof.

Fix $L$. We construct a parsing by the following greedy rule. Suppose the current position is $i$. If $i+L-1\leq N$ and the block

\[ W[i,i+L-1] \]

has an occurrence starting at some position $p

Call a phrase short if it is a literal produced before the final $L$ positions. If a short phrase starts at position $i\leq N-L+1$, then the block

\[ W[i,i+L-1] \]

has no earlier occurrence. Hence two different short phrase starts $i

\[ W[i,i+L-1]=W[j,j+L-1], \]

then at position $j$ the parser could have copied at least $L$ symbols from source $i$, contradicting that the phrase at $j$ was short.

There are at most $b^L$ distinct blocks of length $L$. Hence the number of short phrases before the final $L$ positions is at most $b^L$. The final part contributes at most $L$ literal phrases.

All copied phrases produced by the algorithm have length $L$, so there are at most $N/L$ such phrases. Therefore

\oc(W) \leq \frac{N}{L}+b^L+L+O(1).

Now take

L=\left\lfloor \log_b N-2\log_b\log_b N\right\rfloor.

Then

b^L\leq \frac{N}{(\log_b N)^2},

and

\frac{N}{L} = (1+o(1))\frac{N}{\log_b N}.

Also

b^L+L=o\left(\frac{N}{\log_b N}\right).

Thus

\oc(W) \leq (1+o(1))\frac{N}{\log_b N}.

4 Counting words with short parsings

Let

\mathcal W_{N,z} = \left\{ W\in\{0,\ldots,b-1\}^N: \oc(W)\leq z \right\}.

Lemma 4.1 (Counting parse descriptions).

For $1\leq z\leq N/2$,

|\mathcal W_{N,z}| \leq z\left(\frac{eN}{z}\right)^z(2bN)^z.

Consequently, if

z_N=\left\lfloor \kappa\frac{N}{\log_b N}\right\rfloor

with fixed $0<\kappa\leq1$, then

|\mathcal W_{N,z_N}| \leq b^{(\kappa+o(1))N}.

Proof.

Fix $m\leq z$. We count a superset of words admitting a parsing with $m$ phrases.

The boundaries are determined by the $m-1$ internal cut positions, so there are

\binom{N-1}{m-1}

choices.

For each phrase, choose whether it is a literal or a copy, giving at most $2^m$ choices. Assign to every phrase a symbol in the alphabet, even if the phrase is a copy; this gives at most $b^m$ choices. Assign also to every phrase a source position in $\{1,\ldots,N\}$, even if the phrase is a literal; this gives at most $N^m$ choices.

Thus the number of descriptions with $m$ phrases is at most

\binom{N-1}{m-1}(2bN)^m.

Every such description determines at most one word. Indeed, literals prescribe their symbols, and copied phrases are deterministic once the already parsed prefix is known. If a source overlaps the target, the copied symbols are still determined progressively because the source starts earlier than the target.

Therefore

|\mathcal W_{N,z}| \leq \sum_{m\leq z} \binom{N-1}{m-1}(2bN)^m.

For $m\leq z\leq N/2$,

\binom{N-1}{m-1} \leq \left(\frac{eN}{m}\right)^m.

The function

m\mapsto \left(\frac{eN}{m}\right)^m(2bN)^m

is increasing for $1\leq m\leq N/2$. Hence

|\mathcal W_{N,z}| \leq z\left(\frac{eN}{z}\right)^z(2bN)^z.

Now take

z_N=\left\lfloor \kappa\frac{N}{\log_b N}\right\rfloor.

Taking logarithms in base $b$,

\log_b |\mathcal W_{N,z_N}| \leq \log_b z_N + z_N\log_b\frac{eN}{z_N} + z_N\log_b(2bN).

The last term is

z_N\log_b N+O(z_N) = (\kappa+o(1))N.

The middle term is

z_N\log_b\frac{eN}{z_N} = O\left(\frac{N}{\log N}\log\log N\right) = o(N).

Also $\log_b z_N=o(N)$. Therefore

\log_b |\mathcal W_{N,z_N}| \leq (\kappa+o(1))N,

as claimed.

5 Worst-case complexity

Theorem 5.1 (Worst-case asymptotics).

\max_{W\in\{0,\ldots,b-1\}^N}\oc(W) = (1+o(1))\frac{N}{\log_b N}.

Proof.

The upper bound follows from the universal upper bound.

For the lower bound, fix $0<\kappa<1$, and set

z_N=\left\lfloor \kappa\frac{N}{\log_b N}\right\rfloor.

By the counting lemma,

|\mathcal W_{N,z_N}| \leq b^{(\kappa+o(1))N}.

Since the total number of words of length $N$ is $b^N$, for all large $N$ there exists a word $W$ with

\oc(W)>z_N.

Thus

\max_{|W|=N}\oc(W) \geq \kappa\frac{N}{\log_b N}(1-o(1)).

Letting $\kappa\uparrow1$ gives the lower bound.

6 Metric law for online previous-copy complexity

For $\alpha\in[0,1]$, let

w_N(\alpha)

be the prefix of length $N$ of the base-$b$ expansion of $\alpha$. We ignore the countable set of numbers with two base-$b$ expansions.

Theorem 6.1 (Almost-sure asymptotic).

For Lebesgue-almost every $\alpha\in[0,1]$,

\oc(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}.

Equivalently,

\lim_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} = 1

for almost every $\alpha$.

Proof.

The upper bound holds for every word by the universal upper bound.

For the lower bound, fix $0<\kappa<1$, and set

z_N=\left\lfloor \kappa\frac{N}{\log_b N}\right\rfloor.

The set of $\alpha$ such that

\oc(w_N(\alpha))\leq z_N

is a union of base-$b$ cylinders of length $N$, one for each word in $\mathcal W_{N,z_N}$. Each cylinder has Lebesgue measure $b^{-N}$. Hence its measure is at most

b^{-N}|\mathcal W_{N,z_N}| \leq b^{-(1-\kappa+o(1))N}.

This is summable in $N$. By the Borel--Cantelli lemma, for almost every $\alpha$, the event

\oc(w_N(\alpha))\leq \kappa\frac{N}{\log_b N}

occurs for only finitely many $N$. Therefore

\liminf_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} \geq \kappa

for every $0<\kappa<1$, and hence the liminf is at least $1$. The limsup is at most $1$ by the universal upper bound.

7 Finite entropy of low-complexity words

For $0<\kappa\leq1$, define

\mathcal C_N(\kappa) = \left\{ W\in\{0,\ldots,b-1\}^N: \oc(W)\leq \kappa\frac{N}{\log_b N} \right\}.

Theorem 7.1 (Entropy spectrum).

For $0<\kappa\leq1$,

\lim_{N\to\infty} \frac1N\log_b|\mathcal C_N(\kappa)| = \kappa.

For $\kappa\geq1$, the limit is $1$.

Proof.

The upper bound for $0<\kappa\leq1$ follows directly from the counting lemma:

|\mathcal C_N(\kappa)| \leq b^{(\kappa+o(1))N}.

For the lower bound, fix $0<\eta<\kappa$, and put

M=\lfloor(\kappa-\eta)N\rfloor.

For every word

U\in\{0,\ldots,b-1\}^M,

construct a word $W(U)$ of length $N$ as follows. First write $U$. Then continue periodically with period $U$ until the word has total length $N$.

Different $U$'s give different $W(U)$'s, because the first $M$ symbols of $W(U)$ are $U$. Hence this construction gives $b^M$ distinct words.

By the universal upper bound applied to $U$,

\oc(U)\leq (1+o(1))\frac{M}{\log_b M}

uniformly in $U$. After $U$ has been parsed, the periodic continuation is obtained with one copied phrase, using source position $1$ and allowing self-overlap. Thus

\oc(W(U)) \leq (1+o(1))\frac{M}{\log_b M}+1.

Since

M=(\kappa-\eta+o(1))N \qquad\text{and}\qquad \log_b M\sim\log_b N,

we get

\oc(W(U)) \leq (\kappa-\eta+o(1))\frac{N}{\log_b N} \leq \kappa\frac{N}{\log_b N}

for all sufficiently large $N$. Hence

|\mathcal C_N(\kappa)| \geq b^M = b^{(\kappa-\eta+o(1))N}.

Letting $\eta\to0$ gives

\liminf_{N\to\infty} \frac1N\log_b|\mathcal C_N(\kappa)| \geq \kappa.

For $\kappa\geq1$, the universal upper bound implies that all words are eventually counted up to $o(1)$ in the threshold, and the entropy limit is $1$.

8 Hausdorff spectrum

For $0\leq\kappa\leq1$, define

F_\kappa = \left\{ \alpha\in[0,1]: \liminf_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} \leq \kappa \right\}.

Theorem 8.1 (Hausdorff spectrum).

For $0\leq\kappa\leq1$,

\dimH F_\kappa=\kappa.

For $\kappa\geq1$, $F_\kappa$ has full Lebesgue measure and hence Hausdorff dimension $1$.

Proof.

We first prove the upper bound. Fix $\eta>0$. If $\alpha\in F_\kappa$, then for infinitely many $N$,

\oc(w_N(\alpha)) \leq (\kappa+\eta)\frac{N}{\log_b N}.

Thus $F_\kappa$ is contained in the limsup of the union of cylinders of length $N$ corresponding to words in

\mathcal C_N(\kappa+\eta).

By the entropy upper bound,

|\mathcal C_N(\kappa+\eta)| \leq b^{(\kappa+\eta+o(1))N}.

Each cylinder has diameter comparable to $b^{-N}$. If $s>\kappa+\eta$, then

\sum_N |\mathcal C_N(\kappa+\eta)| b^{-sN} < \infty.

Hence the $s$-dimensional Hausdorff measure of $F_\kappa$ is zero. Therefore

\dimH F_\kappa\leq \kappa+\eta.

Letting $\eta\to0$ gives

\dimH F_\kappa\leq\kappa.

For the lower bound, assume $0<\kappa\leq1$. We construct a Cantor set contained in $F_\kappa$. Choose a sequence of integers $M_j\to\infty$ growing so rapidly that the total length constructed before stage $j$ is $o(M_j/\log M_j)$.

Suppose that before stage $j$ a prefix of length $R_{j-1}$ has been constructed. Choose freely a word

U_j\in\{0,\ldots,b-1\}^{M_j}.

After writing $U_j$, continue periodically with period $U_j$ until the total length reaches

N_j = R_{j-1} + \left\lfloor\frac{M_j}{\kappa}\right\rfloor.

Since $R_{j-1}=o(M_j)$,

N_j\sim \frac{M_j}{\kappa}.

At time $N_j$, the already existing prefix before $U_j$ contributes $o(M_j/\log M_j)$ phrases. The block $U_j$ can be parsed using the universal upper bound:

\oc(U_j) \leq (1+o(1))\frac{M_j}{\log_b M_j}.

The periodic continuation after $U_j$ is copied with one phrase, since self-overlap is allowed. Thus

\oc(w_{N_j}) \leq (1+o(1))\frac{M_j}{\log_b M_j}.

Because

N_j\sim\frac{M_j}{\kappa} \qquad\text{and}\qquad \log_b N_j\sim\log_b M_j,

we obtain

\frac{\oc(w_{N_j})\log_b N_j}{N_j} \leq \kappa+o(1).

Hence every point in the constructed Cantor set lies in $F_\kappa$.

It remains to estimate its dimension. At stage $j$, the number of independent choices is $b^{M_j}$. Since the sequence $M_j$ grows very fast, the contribution of all previous stages is negligible compared with $M_j$. The cylinders at stage $j$ have length $b^{-N_j}$. A standard mass distribution argument gives

\dimH \geq \liminf_{j\to\infty} \frac{\sum_{i\leq j}M_i}{N_j} = \liminf_{j\to\infty} \frac{M_j+o(M_j)}{M_j/\kappa} = \kappa.

Thus

\dimH F_\kappa\geq\kappa.

The case $\kappa=0$ follows from the upper bound and the fact that $F_0$ is nonempty, for example it contains eventually periodic expansions. Therefore $\dimH F_0=0$.

9 Normalized substring complexity and string attractors

For a finite word $W$, let $p_W(k)$ be the number of distinct factors of $W$ of length $k$. Define

\delta(W)=\max_{1\leq k\leq |W|}\frac{p_W(k)}{k}.

Let $\gamma(W)$ denote the size of the smallest string attractor of $W$. Recall that a set of positions

\Gamma\subseteq\{1,\ldots,|W|\}

is a string attractor if every distinct factor of $W$ has an occurrence crossing at least one position of $\Gamma$.

Lemma 9.1.

For every finite word $W$,

\delta(W)\leq \gamma(W)\leq \oc(W).

Proof.

First let $\Gamma$ be a string attractor for $W$. Fix $k$. Every distinct length-$k$ factor has an occurrence crossing some position of $\Gamma$. A fixed position can be crossed by at most $k$ length-$k$ factors. Therefore

p_W(k)\leq |\Gamma|k.

Taking the minimum over $\Gamma$ and then the maximum over $k$ gives

\delta(W)\leq \gamma(W).

Now take an online previous-copy parsing of $W$ with phrase starts

n_0+1,n_1+1,\ldots,n_{z-1}+1.

Let $\Gamma$ be this set of phrase starts. We claim that $\Gamma$ is a string attractor.

Let $U$ be any factor of $W$, and choose its leftmost occurrence. If this occurrence does not cross a phrase start, it lies strictly inside a single phrase. If that phrase is a literal, this is impossible unless $|U|=1$, in which case the occurrence crosses the phrase start. If the phrase is copied, the source of the copy gives an earlier occurrence of $U$, contradicting the choice of the leftmost occurrence. Hence every factor has an occurrence crossing a phrase start, and $\Gamma$ is an attractor. Thus

\gamma(W)\leq z.

Taking the minimum over parsings gives

\gamma(W)\leq \oc(W).

9.1 Universal and worst-case bounds

Lemma 9.2 (Universal upper bound for $\delta$).

Uniformly for all words $W$ of length $N$,

\delta(W) \leq (1+o(1))\frac{N}{\log_b N}.

Proof.

For every $k$,

p_W(k)\leq \min\{b^k,N-k+1\}\leq \min\{b^k,N\}.

Thus

\frac{p_W(k)}{k} \leq \frac{\min\{b^k,N\}}{k}.

If $k\leq \log_b N$, then the maximum of $b^k/k$ in this range occurs at $k=\lfloor\log_b N\rfloor+O(1)$, and is

(1+o(1))\frac{N}{\log_b N}.

If $k\geq \log_b N$, then

\frac{N}{k}\leq \frac{N}{\log_b N}.

Taking the maximum over $k$ proves the claim.

Theorem 9.3 (Worst-case for $\delta$ and $\gamma$).

\max_{|W|=N}\delta(W) = (1+o(1))\frac{N}{\log_b N},

and

\max_{|W|=N}\gamma(W) = (1+o(1))\frac{N}{\log_b N}.

Proof.

The upper bound for $\delta$ is the universal bound just proved. The upper bound for $\gamma$ follows from

\gamma(W)\leq \oc(W)

and the universal upper bound for $\oc$.

For the lower bounds, it is enough to show that there exist words $W$ with

\delta(W)\geq(1-o(1))\frac{N}{\log_b N}.

This follows, for example, from the almost-sure lower bound for $\delta$ proved in the next subsection. Since

\delta(W)\leq\gamma(W),

the same examples give the lower bound for $\gamma$.

9.2 Almost-sure behaviour

Theorem 9.4 (Almost-sure behaviour of $\delta$ and $\gamma$).

For Lebesgue-almost every $\alpha\in[0,1]$,

\delta(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N},

and

\gamma(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}.

Proof.

The upper bound for $\delta$ is universal. The upper bound for $\gamma$ follows from

\gamma(W)\leq\oc(W)

and the almost-sure asymptotic for $\oc$.

It remains to prove the almost-sure lower bound for $\delta$. Fix $\varepsilon>0$, and set

k_N=\left\lceil(1+\varepsilon)\log_b N\right\rceil.

For a random word $W_N$ of length $N$, let $C_N$ be the number of pairs $1\leq i

\[ W_N[i,i+k_N-1]=W_N[j,j+k_N-1]. \]

For any pair $i

\mathbb E C_N \ll N^2b^{-k_N} \ll N^{1-\varepsilon}.

Choose a subsequence

N_m=\lfloor m^q\rfloor

with $q\varepsilon>2$. By Markov's inequality,

\mathbb P(C_{N_m}>\eta N_m) \leq \frac{\mathbb E C_{N_m}}{\eta N_m} \ll N_m^{-\varepsilon} \ll m^{-q\varepsilon}.

The last series is summable. By Borel--Cantelli,

C_{N_m}=o(N_m)

almost surely.

The number of distinct length-$k_{N_m}$ factors in the prefix of length $N_m$ is at least

N_m-k_{N_m}+1-C_{N_m} = (1-o(1))N_m,

because the number of repeated occurrences is bounded by the number of colliding pairs.

Now let $N_m\leq N

\frac{N_{m+1}}{N_m}\to1,

we have $N_m\sim N$ and $\log N_m\sim\log N$. The prefix $w_N(\alpha)$ contains the prefix $w_{N_m}(\alpha)$, so

p_{w_N(\alpha)}(k_{N_m}) \geq (1-o(1))N_m = (1-o(1))N.

Also

k_{N_m} = (1+\varepsilon+o(1))\log_b N.

Thus

\delta(w_N(\alpha)) \geq \frac{p_{w_N(\alpha)}(k_{N_m})}{k_{N_m}} \geq (1-o(1))\frac{N}{(1+\varepsilon)\log_b N}.

Since $\varepsilon>0$ is arbitrary, taken over a countable sequence tending to $0$, we obtain

\delta(w_N(\alpha)) \geq (1-o(1))\frac{N}{\log_b N}

almost surely. This proves the theorem. The lower bound for $\gamma$ follows from

\delta(W)\leq\gamma(W).

10 Summary of asymptotic scales

The results above show that the online previous-copy complexity and the standard repetitiveness measures $\delta$ and $\gamma$ share the same natural scale in the worst case and almost surely:

\frac{N}{\log_b N}.

More precisely,

\max_{|W|=N}\oc(W) = (1+o(1))\frac{N}{\log_b N},

\max_{|W|=N}\delta(W) = (1+o(1))\frac{N}{\log_b N},

and

\max_{|W|=N}\gamma(W) = (1+o(1))\frac{N}{\log_b N}.

For Lebesgue-almost every $\alpha\in[0,1]$,

\oc(w_N(\alpha)) = \delta(w_N(\alpha)) = \gamma(w_N(\alpha)) = (1+o(1))\frac{N}{\log_b N}.

The entropy spectrum and Hausdorff spectrum for $\oc$ are also determined:

\lim_{N\to\infty} \frac1N \log_b \#\left\{ W\in\{0,\ldots,b-1\}^N: \oc(W)\leq \kappa\frac{N}{\log_b N} \right\} = \kappa

for $0<\kappa\leq1$, and

\dimH \left\{ \alpha: \liminf_{N\to\infty} \frac{\oc(w_N(\alpha))\log_b N}{N} \leq\kappa \right\} = \kappa

for $0\leq\kappa\leq1$.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work, ChatGPT, by OpenAI, was used to assist with mathematical drafting, formalization, review, and editing. This work is shared as a preliminary AI-assisted mathematical note. The mathematical content may have been only partially reviewed and may contain errors; it should not be treated as peer-reviewed or as a fully verified manuscript.

References

[KP18] D. Kempa and N. Prezza, At the roots of dictionary compression: string attractors, Proceedings of the 50th Annual ACM Symposium on Theory of Computing, 2018, 827--840.
[KNP23] T. Kociumaka, G. Navarro, and N. Prezza, Toward a definitive compressibility measure for repetitive sequences, IEEE Trans. Inform. Theory 69 (2023), no. 4, 2074--2092.
[Pre17] N. Prezza, String attractors, arXiv:1709.05314, 2017.
[ZL77] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory 23 (1977), no. 3, 337--343.

Abstract

1 Introduction

2 Diophantine inputs

2.1 Exact repetitions

2.2 Ridout's theorem

2.3 Structured noisy repetitions

2.4 Bugeaud's effective finite-prefix theorem

3 Online previous-copy parsings

4 A multiplicative-growth lemma

5 Exact online-copy lower bound

6 Consequences for LZ77-type parsings

7 Linear extension complexity

8 Effective lower bounds from Bugeaud's theorem

9 String attractors and offline compression

10 Noisy online-copy parsings

11 Noisy online-copy lower bound

12 A quantitative repetition profile

13 A moving-period Ridout problem

14 Many weak repetitions

15 Beyond bounded-interval noise

16 Transcendence measures from dense compression profiles

17 Scope

Declaration of generative AI and AI-assisted technologies in the writing process

References

Abstract

1 Introduction

2 Online previous-copy parsings

3 A universal upper bound

4 Counting words with short parsings

5 Worst-case complexity

6 Metric law for online previous-copy complexity

7 Finite entropy of low-complexity words

8 Hausdorff spectrum

9 Normalized substring complexity and string attractors

9.1 Universal and worst-case bounds

9.2 Almost-sure behaviour

10 Summary of asymptotic scales

Declaration of generative AI and AI-assisted technologies in the writing process

References