DeepMind

通过从数万亿个 token 检索来改进语言模型

Sebastian Borgeaud†, Arthur Mensch†, Jordan Hoffmann†, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae‡, Erich Elsen‡ and Laurent Sifre†,‡

来自 DeepMind 的所有作者，†同等贡献，‡同等资深作者

我们通过在大规模语料库中检索与前置 token 本地相似的文档块来对自回归语言模型进行条件化，从而增强其性能。使用 2 万亿 token 的数据库，我们的 Retrieval-Enhanced Transformer (RETRO) 在 Pile 上实现了与 GPT-3 和 Jurassic-1 相当的表现，尽管参数数量少 25 倍。微调后，RETRO 的性能可转化为诸如问答等需要丰富知识的下游任务。RETRO 结合了一个冻结的 BERT 检索器、一个可微分的编码器以及一个分块交叉注意机制，以预测基于比训练期间通常消耗的数据量大一个数量级的 token。我们通常从零开始训练 RETRO，但也可以快速将 RETROfit 应用于预训练的 Transformer 并仍然获得良好表现。我们的工作为通过在前所未有的规模上显式内存来改进语言模型开辟了新的途径。

1. 引言

语言建模（LM）是一种无监督任务，其目标是对文本的概率进行建模，通常通过将其分解为条件下一词预测 p(x_1, \dots, x_n) = \prod_i p(x_i|x_{<i})。神经网络已被证明是强大的语言模型，最早是以循环架构的形式出现（Graves, 2013；Jozefowicz 等，2016；Mikolov 等，2010），最近则以 Transformer（Vaswani 等，2017）的形式出现，这些模型利用注意力机制为过去的上下文提供语境。性能的大幅提升来自于增加数据量、训练计算资源或模型参数。Transformer 已从早期的1亿参数模型扩展到超过百亿参数（Brown 等，2020；Radford 等，2019），在过去两年里，这导致模型在零样本或少样本的任务中表现出色。增大模型规模能够可预测地提升在各种下游任务上的表现（Kaplan 等，2020）。参数数量增加的好处来源于两个方面：训练和推理时额外的计算以及对训练数据的更强记忆。

在本工作中，我们试图通过探索高效方式，在不显著增加计算量的前提下，利用大规模内存增强语言模型，从而实现这些目标的解耦。具体而言，我们提出从大型文本数据库检索作为扩展语言模型的一条互补路径。与其增大模型规模并使用更多数据进行训练，我们的做法是赋予模型直接访问大型数据库以进行预测的能力——这是一种半参数化方法。从宏观上看，我们的检索 Transformer（RETRO）模型将输入序列划分为若干块，并检索与前一块相似的文本，以提升当前块的预测效果。现有针对语言建模的检索工作仅考虑小型 Transformer（约1亿参数）和规模有限的数据库（最多数十亿个标记）(Guu 等，2020; Khandelwal 等，2020; Lewis 等，2020; Yogatama 等，2021)。据我们所知，我们的工作是首个展示将检索数据库规模扩大到万亿级标记，以服务大型参数化语言模型所带来的收益。我们的主要

通讯作者：{sborgeaud|amensch|jordanhoffmann|sifre}@deepmind.com

arXiv:2112.04426v3 [cs.CL] 2022年2月7日

通过检索数万亿个标记来改进语言模型

图 1 | RETRO 的规模化。 我们的检索模型的性能提升随模型规模（左）保持不变，并且可与将参数模型大小乘以约10倍相媲美。当使用多达40个邻居时，性能提升随检索数据库大小（中）和检索邻居数（右）在 C4 验证集上增加。超过此范围后，性能开始下降，可能是由于质量下降。在评估时，RETRO 可以在不使用检索数据（RETRO[OFF]）的情况下使用，相比基线转换器仅导致有限的性能下降。

贡献如下。

我们介绍 RETRO，一种检索增强的自回归语言模型 (§2.2)。我们使用分块交叉注意力模块来整合检索到的文本 (§2.4)，其时间复杂度线性于检索数据的量。我们展示了基于预训练冻结的 BERT 模型 (§2.3) 的检索在大规模下有效，无需训练和更新检索器网络。
我们证明我们的方法随模型规模和数据库规模良好扩展（图 1）：RETRO 在 150M 至 7B 参数的模型范围内提供恒定增益，并且在评估时可通过增大数据库规模和检索邻居数量来提升性能。我们的最大模型在包括 Wikitext103（Merity 等，2017）和 Pile（Gao 等，2020）在内的多种下游评估数据集上取得了最先进的结果（§4）。我们还展示了 RETRO 可以在下游任务（如问答，§4.3）中通过微调实现竞争性能。
我们提出一种考虑测试文档与训练集近似度的评估方法 (§2.6)，解决测试集泄漏问题（Lee 等，2021）。这对所有语言模型都很重要，尤其是检索增强模型，因为它们在评估时可直接访问训练数据集。使用该方法，我们展示了 RETRO 的性能来源于显式邻居复制和一般知识提取（§4.4）。

2. 方法

我们设计了一个检索增强架构，能够从拥有数万亿标记的数据库中检索。为此，我们以连续标记块而非单个标记的级别进行检索，这显著降低了存储和计算需求。我们的方法首先构建一个键值数据库，值存储原始文本标记块，键为冻结的 BERT 嵌入（Devlin 等，2019）。我们使用冻结模型，以避免在训练期间需要定期重新计算整个数据库的嵌入。每个训练序列随后被划分为块，并用从数据库检索到的 k 最近邻进行增强。编码-解码器架构将检索块整合到模型的预测中。我们在图 2 中概述了 RETRO 架构，并在本节中详细阐述。我们在本节结尾通过引入

通过从数万亿标记中检索来改进语言模型

图 2 | RETRO 架构。左侧：简化版本，长度为 n = 12 的序列被划分为 l = 3 个大小为 m = 4 的块。对于每个块，我们检索 k = 2 个邻居，每个邻居包含 r = 5 个标记。检索通路如顶部所示。右侧：CCA 操作符中交互的细节。因果性保持不变，第一块的邻居仅影响第一块的最后一个标记以及第二块的标记。

一种新的方法来评估语言模型，当评估集在其中部分存在时

训练集。

2.1. 训练数据集

我们使用多语言版本的 MassiveText (Rae 等人，2021) 作为训练和检索数据。该数据集由来自多种来源和多种语言的文本文件组成，总计超过5万亿个标记（详见表1）。序列从训练数据的子集采样，采样权重在表1最右列给出。我们使用 SentencePiece (Kudo 和 Richardson，2018) 对数据集进行分词，词汇量为128,000个标记。在训练过程中（除非另有说明），我们从训练数据中检索 600B 个标记。训练检索数据库由与训练数据相同的子集组成，比例与训练采样频率匹配。在评估阶段，检索数据库包含这些数据集的全部联合集，唯一例外是我们对书籍使用 4% 的子样本。评估检索数据库因此包含 1.75T 个标记。为限制测试集泄漏，我们使用 MinHash 方案计算训练和测试文档之间的 13-gram Jaccard 相似度，并删除与验证集或测试集文档相似度高于 0.8 的所有训练文档。此外，我们还从我们的 Wikipedia 训练数据中删除了 Wikitext103 (Merity 等人，2017) 的所有验证和测试文章。

2.2. 检索增强的自回归词标模型

我们的方法将检索用作一种方式，在小块的粒度上增强输入示例。

个标记。形式上，我们考虑在 V = [1, v] 中的整数标记序列，这些序列是使用文本分词器得到的。

分词器¹。我们将每个 n-标记长的例子 X = (x_1, \dots, x_n) 拆分成一个由 l 块 (C_1, \dots, C_l) 组成的序列。

大小为 m = \frac{n}{l}，即 C_1 \triangleq (x_1, \dots, x_m), \dots, C_l \triangleq (x_{n-m+1}, \dots, x_n) \in V^m。我们使用 n = 2048 和 m = 64。

我们用来自数据库 D 的 k 个邻居集合 RET_D(C_u) 对每个块 C_u 进行增强。RET_D（或

¹我们在全文中使用符号 [1, v] \triangleq \{1, \dots, v\}。

通过检索数万亿个标记来提升语言模型。

(RET 省略) 是 §2.3 中指定的非可训练操作符。Token 似然性由一个模型提供，该模型由 \theta 参数化，输入包括先前的 token 及其检索到的邻居。这样定义了以下检索增强的序列对数似然：

L(X|\theta, \mathcal{D}) \triangleq \sum_{u=1}^{l} \sum_{i=1}^{m} \ell_{\theta}(x_{(u-1)m+i} | (x_j)_{j<(u-1)m+i}, (\mathrm{RET}_{\mathcal{D}}(C_{u'}))_{u'<u}). \quad (1)

我们设置 \mathrm{RET}(C_1) = \emptyset，即第一块中 token 的似然性不依赖任何检索数据。此似然定义保持自回归性：第 u 块的第 i 个 token 的概率，x_{(u-1)m+i}，仅依赖之前见过的 token (x_j)_{1\le j<(u-1)m+i} 以及从前块检索到的数据 (\mathrm{RET}(C_{u'}))_{u'<u}。因此我们可以直接用对数概率 \ell 进行采样，其中在块 C_u 内的采样受邻居 (\mathrm{RET}(C_{u'}))_{u'<u} 的条件约束。这使得检索增强模型与通过采样评估的最大语言模型直接可比较。

2.3. 最近邻检索

检索邻居。 我们的数据库由键值记忆组成。每个值由两个连续的 token 块组成，我们用 [N, F] 表示，其中 N 是用于计算键的邻居块，F 是其在原始文档中的续写。对应的键是 N 的 BERT 嵌入，按时间平均，记为 \mathrm{BERT}(N)。对于每个块 C，我们使用 L_2 距离在 BERT 嵌入 d(C, N) = \|\mathrm{BERT}(C) - \mathrm{BERT}(N)\|_2^2 上，从键值数据库检索其近似 k-最近邻。模型接收对应的值 \mathrm{RET}(C) \triangleq ([N^1, F^1], \dots, [N^k, F^k])。邻居块及其续写都能带来显著改进，如我们的消融研究（附录 D）所示。我们对 N^j 和 F^j 都使用长度 64，因此 \mathrm{RET}(C) 的形状为 k \times r，维度为 r = 128。为避免在检索集合 \mathrm{RET}(C_u) 中检索到块 C_{u+1}，这会在训练时破坏因果性，我们过滤掉来源于与训练序列 X 相同文档的邻居。

对于包含 T 个元素的数据库，
我们可以在 O(\log T) 时间内查询近似最近邻。
我们使用 SCaNN 库（Guo 等，2020）来实现这一点。
这意味着我们可以在 10 毫秒内查询我们的 2 万亿 token 数据库，同时进行模型评估或采样；这笔费用在块长度上得到摊销。
实时检索速度过慢，无法跟上训练计算的节奏——我们利用 BERT 嵌入算子冻结的特性，预先计算所有近似最近邻并将结果作为数据的一部分保存下来。
在附录中的图 9 中，我们展示了仅检索 Wikipedia 内邻居的结果。
我们发现，邻居往往来自距离给定文章 2–3 链接远的地方，而随机文章之间相距超过 5 链接。

Table 1 | MassiveText。最后一列表示训练期间的采样权重。多语种子集包含 10 种语言的文档。完整细分见 §A.1。

Source	Token count (M)	Documents (M)	Multilingual	Sampling frequency
Web	977,563	1,208	Yes	55%
Books	3,423,740	20	No	25%
News	236,918	398	No	10%
Wikipedia	13,288	23	Yes	5%
GitHub	374,952	143	No	5%

通过从数万亿 token 检索来改进语言模型

2.4. RETRO 模型架构

我们的模型依赖于编码-解码 transformer 架构，并通过 Vaswani 等人 (2017) 所引入的交叉注意机制来集成检索到的数据。首先，检索到的 token RET(C) 被送入编码器 Transformer，后者计算编码后的邻居集合 E。将中间激活记为 H，我们的 transformer 解码器随后交错使用 RETRO 块 RETRO(H, E) 与标准 Transformer 块 LM(H)（超参数 P \subseteq [1, L] 决定我们在哪些层使用 RETRO 块）。这些块由三种不同的残差运算符组成，签名为 \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}：一个全连接层 FFW、标准的序列级自注意层 ATTN，以及一个分块交叉注意层 CCA(\cdot, E)，该层融合了检索编码器的信息：

\text{RETRO}(H, E) \triangleq \text{FFW}(\text{CCA}(\text{ATTN}(H), E)), \quad \text{and} \quad \text{LM}(H) \triangleq \text{FFW}(\text{ATTN}(H)) \quad (2)

由于 FFW、ATTN 和 CCA 都是自回归算子，它们在位置 i 的输出仅取决于 (h_j)_{j \le i}，任何 RETRO 和 LM 层的连续堆叠，随后接上一个 token 分类头，都定义了一个自回归对数似然 (1)。模型架构的概览见 Algorithm 1 和图 2。接下来我们将更详细地描述检索编码器和分块跨注意力层，并说明如何从 RETRO 采样。

编码检索邻居。 对于每个块 C_u，其 k 个检索邻居 RET(C_u) 被输入到一个双向变压器 ENCODER，产生输出 E_u^j \triangleq \text{ENCODER}(RET(C_u)^j, H_u) \in \mathbb{R}^{r \times d'}，其中 j \in [1, k] 索引每个邻居。检索编码器是一个非因果变压器。它通过交叉注意力层对 H_u（块 C_u 的激活）进行条件化；这使得检索编码器的表示能够被检索块以可微分方式调制。更准确地说，u^{th} 块的 j^{th} 个邻居 RET(C_u)^j 的编码依赖于在第 min(P) 层对块 C_u 的关注激活 H_u \triangleq (h_{(u-1)m+i})_{i \in [1,m]} \in \mathbb{R}^{m \times d}。所有块的所有邻居都并行编码，产生完整的编码集 E \triangleq (E_u^j)_{u \in [1,l], j \in [1,k]} \in \mathbb{R}^{l \times k \times r \times d'}。我们将 E_u \in \mathbb{R}^{k \times r \times d'} 指为块 u \in [1, l] 的编码邻居。

分块交叉注意力。 为了执行 CCA 操作，我们首先将给定的中间激活 H \in \mathbb{R}^{n \times d} 拆分为 l-1 个关注的分块 (H_u^+ \triangleq (h_{um+i-1})_{i \in [1,m]} \in \mathbb{R}^{m \times d})_{u \in [1,l-1]}，如图 2 右侧所示。H_u^+ 包含分块 C_u 中最后一个标记以及分块 C_{u+1}^2 中前 m-1 个标记的中间嵌入。我们计算 H_u^+ 与 E_u 之间的交叉注意力——后者是从分块 C_u 获取的已编码检索集。注意力在时间和邻居上同时计算，因为我们在应用交叉注意力之前将 E_u 的邻居和时间维度合并。由于数据分块与检索邻居之间存在对齐概念，我们使用如 §B.1.2 所述的相对位置编码。

我们将每个分块交叉注意力的 l-1 输出（每个形状为 m \times d）按时间拼接，并适当填充结果；因此我们形成输出激活 \text{CCA}(H, E) \in \mathbb{R}^{n \times d}。正式地，对于每个块 C_u 和每个 token i \in [1, m]，我们设定

\text{CCA}(H, E)_{um+i-1} \triangleq \text{CA}(h_{um+i-1}, E_u), \quad (3)

²chunk C_u 的最后一个 token 是第一个能够访问检索到的内容 E_u 的 token，同时保持 (1) 中的自回归性。因此，chunk C_u = (x_{(u-1)m+i})_{i \in [1,m]} 与对应的 attending chunk C_u^+ = (x_{um+i-1})_{i \in [1,m]} 之间存在一个 token 的重叠。

通过检索数万亿个 token 来提升语言模型

算法 1：RETRO 模型架构概览。

超参数： P 与 P_{enc}，分别为解码器和编码器中具有交叉注意力层的索引。

超参数： L 与 L_{enc}，分别为解码器层数和编码器层数。

输入： X \in \mathcal{V}^n：token 序列。(\text{RET}(C_u))_{1 \le u \le l}：检索到的邻居。

输出： O \in \mathbb{R}^{n \times |\mathcal{V}|}：输出 logits。

def ENCODER(\text{RET}(C_u)_{1 \le u \le l}, H):

$(H_u)_{u \in [1, l]} \leftarrow \text{SPLIT}(H)$
**for** $j \in [1, k], u \in [1, l]$ **do** // Encoder shared across neighbours and chunks
    $E_u^j = \text{EMB}_{\text{enc}}(\text{RET}(C_u)^j)$ // May be shared with the decoder $\text{EMB}$
    **for** $p' \in [1, L_{enc}]$ **do**
        $E_u^j \leftarrow \text{ATTN}_{\text{enc}}(E_u^j)$ // Bi-directional attention
        **if** $p' \in P_{enc}$ **then**
            $E_u^j \leftarrow \text{CA}_{\text{enc}}(E_u^j, H_u)$
            $E_u^j \leftarrow \text{FFW}_{\text{enc}}(E_u^j)$
    **return** $E$
$H \leftarrow \text{EMB}(X)$
**for** $p \in [1, L]$ **do**
    $H \leftarrow \text{ATTN}(H)$ // Causal attention
    **if** $p = \min(P)$ **then**
        // The neighbour ENCODER is conditioned with the decoder activations of
        the last layer before the first cross-attention
        $E = \text{ENCODER}(\text{RET}(C_u)_{1 \le u \le l}, H)$
    **if** $p \in P$ **then**
        $H \leftarrow \text{CCA}(H, E)$
        $H \leftarrow \text{FFW}(H)$
$O \leftarrow \text{READ}(H)$

其中 C_A 是跨时间拼接编码邻居的交叉注意力残差算子。我们回忆该算子在其最简单版本中由三个参数矩阵 K \in \mathbb{R}^{d \times c}、Q \in \mathbb{R}^{d \times c} 与 V \in \mathbb{R}^{d \times d} 定义。对于所有 h \in \mathbb{R}^d 与 Y \in \mathbb{R}^{T \times d}，我们定义

C_A(h, Y) \triangleq \text{softmax}(YKQ^T h)YV, \qquad (4)

其中 softmax 在第二维度上执行，所有乘积均为矩阵乘积。我们使用多头交叉注意力，并在 softmax 上添加位置编码（参见 §B.1.2）。

前 m-1 个 token 不能关注先前块的任何邻居；在这些位置，我们将 C_{CA} 定义为恒等映射，针对所有 token j \in [1, m-1] 设置 C_{CA}(H, E)_j \triangleq h_j。最终，最后一个 token h_{lm} 关注最后一次检索集 E_l，并我们设定 h_{lm} \triangleq C_A(h_{lm}, E_l)（图 2 中未示）。列表 1 包含 C_{CA} 的简化实现。请注意，分块跨注意力是自回归的：C_{CA} 在位置 i 的输出取决于从 0 到 i 的 token 序列，这些序列被输入到 C_{CA}。

在 RETRO 模型中，即使每个 C_{CA} 跨注意力仅关注前一块 \text{RET}(C_{u-1}) 的邻居，跨块的依赖关系仍通过自注意力操作传播。i^{th} token 在 u^{th} 块中的激活因此可能依赖于所有之前邻居的集合 \text{RET}(C_{u'})_{u'<u}，而不会产生对该集合进行跨注意力的二次成本。

通过从数万亿个 token 检索来改进语言模型

采样。 在采样时，在块 C_u 的末尾，我们使用 SCaNN 根据嵌入 BERT(C_u) 检索邻居 RET(C_u)。随后，编码后的邻居 E_u = ENCODER(RET(C_u)) 用于对下一块 C_{u+1} 的生成进行条件化，我们以增量方式完成：总体而言，采样的成本随着被采样序列大小呈二次方增长，类似于从常规 Transformer 采样；检索所增加的成本与块数 l 成线性关系，且在实践中相比 token 采样成本可忽略不计。

2.5. 基线 Transformer 架构

我们使用了一种变换器（Vaswani 等人，2017），与（Radford 等人，2019）中描述的类似，仅做了一些最小的改动：我们将 LayerNorm 替换为 RMSNorm（Zhang 和 Sennrich，2019），并使用相对位置编码（Dai 等人，2019）。作为基准，我们训练了 132M、368M、1.3B 和 7.0B 参数的无检索变换器（嵌入矩阵不计入参数总数）。我们使用的超参数详见表 2。所有检索模型使用相同大小的编码器来处理检索数据，具有 d' = 896 和 2 层，大约增加了 19M 参数。该编码器使用相对位置编码。检索模型每 3 层包含一个 RETRO‑block，起始于第 6 层。对于我们最小的模型，CCA 在主路径的第 6、9 和 12 层应用，并在编码器中一次用于查询条件化，这又增加了 12M 参数。随着基线模型尺寸的增大，额外参数的相对数量会降低。所有模型均使用 JAX（Bradbury 等人，2018）和 Haiku（Hennigan 等人，2020）实现。

2.6. 定量评估数据集泄漏利用

RETRO 模型可能更容易受益于评估数据集泄漏，i.e. 评估时使用的数据也出现在训练集中。为更好地理解检索如何提升语言建模性能，我们将评估似然量化为评估集与训练集重叠程度的函数。

以下方法可以与任何语言模型一起使用，仅依赖于 §2.3 中提出的冻结检索器系统。我们将评估序列 (X_i)_i 拆分为长度为 m \le 64 的块，并将训练数据视为一组块 C。对于每个评估块 C \in C，我们在训练数据中检索最相近的 10 个邻居（长度最多 128），然后计算评估块与其邻居之间最长的公共 token 子串。得到一个数字 s \in [0, m]。值 r(C) = \frac{s}{m}，范围从 0（从未见过的块）到 1（完全见过的块），可靠地指示评估块与训练数据之间的重叠程度。对于给定模型，我们随后获得每个块 C 的对数似然 \ell(C) 以及其编码的字节数 N(C)。然后我们考虑模型的过滤 bits-per-bytes。

\forall \alpha \in [0, 1], \quad C_{\alpha} \triangleq \{C \in C, r(C) \le \alpha\}, \quad \text{bpb}(\alpha) \triangleq \frac{\sum_{C \in C_{\alpha}} \ell(C)}{\sum_{C \in C_{\alpha}} N(C)}, \qquad (5)

表 2 | 参数数量（不包含嵌入层）以及我们基准和 RETRO 模型对应的超参数。

Baseline parameters	RETRO	d	d_ffw	# heads	Head size	# layers
132M	172M (+30%)	896	3,584	16	64	12
368M	425M (+15%)	1,536	6,144	12	128	12
1,309M	1,451M (+11%)	2,048	8,192	16	128	24
6,982M	7,532M (+8%)	4,096	16,384	32	128	32

通过检索数万亿 token 改进语言模型

这些对应于与训练片段重叠不到 \alpha% 的片段集合上的 bits-per-bytes。请注意，完整的评估 bits-per-bytes 性能可由 bpb(1) 恢复。函数 bpb(·) 允许我们评估评估泄漏对预测性能的影响：对于低 \alpha，bpb(\alpha) 表明模型在完全新片段上的表现；bpb(·) 的斜率显示模型利用评估泄漏的程度。

3. 相关工作

我们首先回顾使用检索进行语言建模的现有工作，并将 RETRO 与这些工作进行比较（见表 3）。由于我们在包含大量互联网上内容的大型数据集上训练 RETRO 模型，我们的工作引发了潜在的隐私、安全和公平性问题，我们随后对此进行审查。

3.1. 检索式语言建模

Brants 等人 (2007) 表明，将训练数据规模扩展到万亿级令牌可以提升 n-gram 模型的机器翻译性能。更近的研究中，GPT-2 (Radford 等人，2019)、GPT-3 (Brown 等人，2020) 和 Jurassic-1 (Lieber 等人，2021) 显示，扩大语言模型规模可在许多下游任务中带来巨大改进。同时，Carlini 等人 (2021) 证明大规模语言模型可以完美记忆其训练数据的一部分，暗示通过检索增强模型可能进一步提升性能。然而，训练集与测试集之间显著的数据泄漏 (Lee 等人，2021；Lewis 等人，2021) 使得在大数据集上训练的大模型的比较与评估变得困难，尤其是在为训练数据集添加检索功能后。

历史上，文本信息检索依赖于倒排索引匹配，例如 TF-IDF 和 BM25 (Robertson 和 Zaragoza, 2009)。基础工作使用潜在主题建模方法，如 LDA (Blei 等人, 2003)，以识别相关邻居 (Wei 和 Croft, 2006)。在机器翻译方面，如 Zhang 等人 (2018) 和 Gu 等人 (2018) 基于源句子与目标句子之间的编辑距离检索翻译对，并使用最近检索到的目标句子来指导翻译输出。检索数据库也可能是结构化的 — 例如，Ahn 等人 (2016) 使用符号知识图谱来改进 RNN 语言模型。

随着深度学习的成功，检索系统部分转向基于神经网络激活的密集学习表示。Continuous cache (Grave 等人, 2017) 为与当前激活向量相似的先前激活对应的标记添加概率质量，将模型的上下文扩展到局部历史。kNN-LM (Khandelwal 等人, 2020) 将这一思路应用于 transformer 并将检索数据库扩展到英文维基百科，从而得到

表 3 | RETRO 与现有检索方法的比较。

	# Retrieval tokens	Granularity	Retriever training	Retrieval integration
Continuous Cache	O (10³)	Token	Frozen (LSTM)	Add to probs
kNN-LM	O (10⁹)	Token	Frozen (Transformer)	Add to probs
SPALM	O (10⁹)	Token	Frozen (Transformer)	Gated logits
DPR	O (10⁹)	Prompt	Contrastive proxy	Extractive QA
REALM	O (10⁹)	Prompt	End-to-End	Prepend to prompt
RAG	O (10⁹)	Prompt	Fine-tuned DPR	Cross-attention
FID	O (10⁹)	Prompt	Frozen DPR	Cross-attention
EMDR²	O (10⁹)	Prompt	End-to-End (EM)	Cross-attention
RETRO (ours)	O (10¹²)	Chunk	Frozen (BERT)	Chunked cross-attention

通过检索万亿级令牌来改进语言模型

在 Wikitext103 评估中取得了显著提升。连续缓存和 kNN-LM 并不会修改基础神经网络模型，而是在推理时在语言模型输出与从检索到的 token 计算出的分布之间进行插值。因此，这些方法可以无须额外训练即可插入任何模型，尽管这限制了模型对检索文本进行推理的能力。SPALM (Yogatama et al., 2021) 通过添加一个额外的门控网络来后处理检索数据以解决此限制；但在推理期间，大部分网络未受检索影响。

检索表示可以直接训练，而不是依赖预训练模型——为此目的已开发检索系统，主要用于开放域问答。例如，DPR (Karpukhin 等, 2020) 使用对比损失训练两个 BERT 模型（分别用于查询和键），以对齐问题及其答案的表示。Lee 等人 (2019) 使用逆向填空任务来寻找段落的语义表示以进行检索。这些工作与连续缓存和 kNN-LM 的不同之处在于，它们将整段文本（或块）一起嵌入，而不是逐个 token 嵌入。检索网络是在不与使用检索数据的下游任务耦合的情况下训练的。REALM (Guu 等, 2020) 专门解决了这一潜在问题，它将检索系统端到端训练，以最大化最终训练交叉熵。这带来了在训练期间搜索数据库并周期性更新嵌入表的额外复杂性，严重限制了其可操作的规模。RAG (Lewis 等, 2020) 和 FID (Izacard 与 Grave, 2021) 在 DPR 的基础上，通过训练编码器-解码器 transformer 模型，在问答基准上设定了最先进的水平。最近，EMDR² (Sachan 等, 2021) 通过使用期望最大化算法端到端训练检索器来扩展 FID，并在与同等规模模型相比时取得了最先进的结果。

在开放域对话场景中，BlenderBot 2.0（Komeili 等，2021）学习生成文本互联网查询，在评估模型回答与人类回答的相似度任务上，优于密集检索方法。这需要收集包含关联搜索查询的人类对话数据集，从而限制了此方法的可扩展性。Hashemi 等（2020）提出了 Guided Transformer，一种类似 RETRO 的修改版 Transformer，用于文档检索和澄清性问题选择。虽然在问答和其他具有强条件化的任务中效果良好，但这些方法都没有设计用于建模任意文本序列，与 RETRO 相比。

RETRO 与 kNN-LM 和 DPR 共享组件，因为它使用冻结的检索表示。RETRO 对比 QA 示例处理更长的序列；这需要在子序列层面进行推理，并为序列的不同块检索不同文档。类似于 FID，RETRO 在编码器中分别处理检索到的邻居，并在分块交叉注意力中将它们组合。与例如 REALM 预先将检索到的文档追加到提示不同，RETRO 的分块方法允许在生成序列时进行多次检索，而不是仅基于提示一次性检索。进一步地，检索在 RETRO 的整个预训练过程中完成，并非仅仅为了解决某个下游任务而简单插入。最后，基于密集查询向量的以往方法使用小模型和少于 3B 个标记的检索数据集（英语维基百科）。表 3 总结了 RETRO 与现有方法的区别。

3.2. 隐私、安全与公平

Bender 等（2021）；Weidinger 等（2021）强调大型语言模型的若干风险。这些风险源于它们记忆训练数据的能力、训练成本高昂、训练数据的静态特性（Lazaridou 等，2021）、倾向于放大训练数据中固有偏见，以及生成有害语言的能力（Gehman 等，2020）。在本节中，我们审视这些风险，重点关注检索增强语言模型可能如何放大或

通过检索数万亿个标记来改进语言模型

缓解它们.

大型语言模型可以完美地记住其训练数据的一部分 (Carlini 等，2021)。当与从网络或其他来源收集的大型训练数据集相结合时，这会产生明显的隐私和安全隐患。诸如 RETRO 的检索模型在推理期间可以访问整个训练数据集，通过能够直接复制训练数据，进一步加剧了这些隐私问题。然而，检索系统通过在推理时消除可检索的数据，为缓解这些问题提供了一条路径。另外，检索模型的差分隐私训练 (Abadi 等，2016) 可以保证模型权重中不存储任何私人信息，而对私人数据的个性化可以通过在推理时更新检索数据库来实现。

由于其高昂的训练成本，定期重新训练大型语言模型以融入新数据、语言和规范是不可行的昂贵做法。为保持检索模型的最新状态，更新检索数据库往往已足够，因为这比从零开始重新训练模型便宜得多。除了在公平性和偏见方面更新模型的好处之外，单纯训练大型语言模型也会产生显著的能耗（Schwartz 等，2020；Strubell 等，2019）。检索机制提供了一条路径，可降低训练和更新达到一定性能的语言模型所需的计算量。

大型语言模型容易生成有毒输出，Gehman 等（2020）已证明这一点。Bender 等（2021）；Jo 与 Gebru（2020）则主张改进训练数据的策划和文档记录的重要性。此外，如果在训练后发现部分训练数据会引发偏见或有毒输出，检索可以进行一定程度的纠正，因为违规检索数据可以被事后过滤。然而，在没有细致分析和干预的情况下，检索模型也可能放大训练数据中已存在的偏见。检索模型还可能通过检索文档的选择机制再添一层偏见来源。该领域还需进一步研究，以更好地理解检索如何影响模型输出的偏见与毒性。

最终，来自大型模型的样本难以解释，这使得减轻这些问题更加具有挑战性（Belinkov 等，2020；Jain 和 Wallace，2019）。检索为模型输出提供了更多洞察，因为人们可以直接可视化或修改正在使用的邻居。表 6、7、20 和 21 的示例说明了检索如何通过提供更透明的输出，使语言模型更具事实性和可解释性。

4. 结果

我们首先报告语言建模基准的结果。其次，我们展示如何将预训练的 Transformer 语言模型 RETROfit 到检索模型中，仅需少量额外 FLOPs。接下来，我们报告 RETRO 在问答上的结果。最后，我们报告带有泄漏过滤的评估指标，以更好地了解检索带来收益的来源。

4.1. 语言建模

数据集。 我们在 C4（Raffel 等，2020）、Wikitext103（Merity 等，2017）、Curation Corpus（Curation，2020）、Lambada（Paperno 等，2016）和 Pile（Gao 等，2020）上评估我们的模型。我们还在一组手工挑选的维基百科文章上进行评估，这些文章在 2021 年 9 月被添加或大量编辑，距我们预训练和检索数据集收集已过数月（详情见 §A.2）。我们从“未来”文章构建数据集，并手动删除与训练数据中文档高度重叠的新文章。这样可以保证评估文档未泄漏到我们的训练数据中。

通过从数万亿个令牌中检索来改进语言模型

Figure 3 | 相对于模型规模的扩展。 (a) LAMBADA top‑1 准确率。 (b) 在 Curation Corpus 上的评估损失。 (c) 在 Wikitext103 有效集上的困惑度。 (d) 2021 年 9 月选定维基百科文章的每字节比特数。

在 C4、Wikitext103、The Pile 以及我们的 Wikipedia 数据集上，我们评估语言模型在整篇文档上的性能，并测量每字节比特（bpb）。我们更倾向于使用每字节比特（bpb），因为它与分词器无关。我们以 2048 个标记的序列长度进行评估，但在文档内部使用 1024 的步幅，以减轻边界效应。在 Curation Corpus 上，我们将文章、“TL;DR:” 字符串和摘要拼接在一起，但仅在摘要上评估 bpb。对于 Lambada，我们评估最后一个词的准确率，使用贪婪生成。

Model scaling. 在图 1（左）和图 3 中，我们展示了随着模型规模从 1.5 亿到 70 亿（非嵌入）参数扩展，语言模型性能的变化。我们发现，在所有数据集上，RETRO 在所有模型规模下均优于基线。此外，我们观察到，随着模型规模扩大，改进并未减弱。性能取决于数据集，在 Wikitext103 和 C4 上的提升最大。 Wikipedia 文章和其他网页与 Wikitext103 文档相似，即使不是完全复制（§4.4），因此我们的检索模型能够直接利用这些重叠，从而在 Wikitext103 上获得显著提升。 Curation Corpus 的提升最小，RETRO 仅略优于基线。这可以预料，因为 Curation Corpus 的摘要仅包含来源文章的信息，且未包含在我们的检索数据库中。在我们的“未来” Wikipedia 2021 年 9 月数据集上，我们同样观察到所有模型规模均有持续提升。

Data scaling. 图 1（中）展示了在评估时扩展检索数据库如何提升语言模型性能。我们观察到，当检索数据从 Wikipedia（40 亿标记）扩展到全部 Massive text（1.7 万亿标记）时，提升显著。图 1（右）展示了随着检索块数量增加，性能如何变化。尽管仅用 2 个邻居进行训练，但当邻居数量从 1 增加到 10 时，我们看到所有模型都有持续改进。此外，我们观察到更大的模型能更好地利用更多邻居：172M 模型最多可使用 10 个邻居提升，而 7B 模型最多可使用 40 个邻居提升。

The Pile. 我们在 Pile 测试集³上评估我们的 7B 模型，并将其与 178B 参数的 Jurrasic-1 (Lieber 等，2021) 模型以及 280B 参数的 Gopher (Rae 等，2021) 模型进行比较。我们不将其与 GPT-3 进行比较，因为 GPT-3 在几乎所有子集上都被 Jurassic-1 和 Gopher 超越。图 4 显示了我们 7B Transformer 基线在 bits-per-byte 方面的相对提升，针对我们的

³由于与其使用相关的法律和伦理问题，我们排除了 Enron Emails 和 Youtube Subtitles 数据集。

通过从数万亿个标记中检索来改进语言模型

Figure 4 | The Pile: Comparison of our 7B baseline against Jurassic-1, Gopher, and RETRO. 我们观察到检索模型在所有测试集上均优于基线，并且在大多数测试集上优于 Jurassic-1，尽管其规模比前者小一个数量级。

7.5B RETRO 模型、Jurassic-1 和 Gopher。Jurassic-1 在除书籍外的所有数据集上都优于基线，这很可能是由于我们的训练数据中包含书籍。Gopher 和 RETRO 在所有测试集上都优于基线。总体而言，RETRO 7.5B 在大多数测试集上优于 Jurassic-1 和 Gopher。在 dm.mathematics 和 ubuntu_irc 子集上，我们的 RETRO 模型并未优于 7B 基线，并且低于 Jurassic-1。我们推测，在这些数据集上检索到的邻居并不有用，这可能是由于检索数据集的内容以及最近邻搜索的有效性所致。

Wikitext103. 为了在受控环境下验证我们的方法，我们将其与 kNN-LM (Khandelwal 等，2020) 在 Wikitext103 数据集上的表现进行比较，结果见表 4。我们在 Wikitext103 的训练集上训练了一个基线 Transformer。该 Transformer 具有 24 层、1024 个隐藏单元、16 个头以及 64 的键尺寸，类似于 Baevski 和 Auli (2019)。我们的基线没有自适应输入，且分词器拥有开放词表，而非 Baevski 和 Auli (2019)，这使得我们的基线

Table 4 | Wikitext103 上的困惑度。 当使用 Wikipedia 数据集进行检索时，RETRO 的表现与我们实现的 kNN-LM 相似。随着检索数据集的扩大，RETRO 的表现显著提升。检索完整 MassiveText 的困惑度相当低，这在一定程度上归因于与 Wikitext103 的部分重叠未被我们的去重检测到。

Model	Retrieval Set	#Database tokens	#Database keys	Valid	Test
Adaptive Inputs (Baevski and Auli, 2019)	-	-	-	17.96	18.65
SPALM (Yogatama et al., 2021)	Wikipedia	3B	3B	17.20	17.60
kNN-LM (Khandelwal et al., 2020)	Wikipedia	3B	3B	16.06	16.12
Megatron (Shoeybi et al., 2019)	-	-	-	-	10.81
Baseline transformer (ours)	-	-	-	21.53	22.96
kNN-LM (ours)	Wikipedia	4B	4B	18.52	19.54
RETRO	Wikipedia	4B	0.06B	18.46	18.97
RETRO	C4	174B	2.9B	12.87	10.23
RETRO	MassiveText (1%)	18B	0.8B	18.92	20.33
RETRO	MassiveText (10%)	179B	4B	13.54	14.95
RETRO	MassiveText (100%)	1792B	28B	3.21	3.92

通过检索数万亿个 token 提升语言模型

困惑度略高一些。完整的实验细节和超参数见 §C.2 与表 11。

我们用自己的 tokenizer 和基线 transformer 重新实现 kNN-LM，以为 Wikitext103 的每个 token 生成 1024 维的嵌入。kNN-LM 的概率为 p_{kNN-LM} = \lambda p_{kNN} + (1 - \lambda)p_{LM}，对应 p_{kNN}(n_k) \propto \exp(-\alpha d_k)。我们在验证集（图 7）上调优 \lambda = 0.118 和 \alpha = 0.00785，并报告这些超参数在验证集和测试集上的表现。

我们将基线 transformer 微调为 RETRO 模型（图 7），使用 Wikitext103 训练数据，并以 2 个邻居从 Wikipedia 检索。我们仅训练新的权重，正如 §4.2 所述，并将编码器与主通路共享嵌入权重。由于 Wikitext103 数据量相当小，在此环境下从头训练 RETRO 会导致过拟合，因此此做法是必要的。

我们使用不同检索集评估微调后的 RETRO 模型。评估时对 RETRO 和 kNN-LM 均使用 10 个邻居。从 Wikipedia 检索时，得到的结果与我们实现的 kNN-LM 相当。进一步将检索数据库扩展到 MassiveText，带来了显著提升，部分原因是泄漏（见 §4.4）。为了可复现性，我们还给出了从 C4 检索时的结果，这些结果接近先前的最先进水平，并可与使用 10% MassiveText 的结果相媲美。

值得注意的是，kNN‑LM 在检索数据集中每个 token 需要 1024 个浮点数，总计 4 亿个 Wikipedia token 对应 15 TB。因此，kNN‑LM 及其他基于 token 的检索方法无法扩展到拥有万亿级 token 的检索数据库，例如 MassiveText。相比之下，RETRO 仅需 215 GB 即可对我们的 Wikipedia 数据集进行索引，MassiveText 需要 93 TB。查看表 4 中检索数据库条目的数量可以清楚地说明为什么在扩展到万亿级 token 数据集时需要按块级检索。

4.2. RETRO-fitting 基线模型

我们通过冻结预训练权重，只训练分块交叉注意力和邻居编码器参数（对于 7B 模型而言不到 10% 的权重）将基线模型扩展为 RETRO 模型，如图 5 所示。这为使用检索增强 transformer 提供了一条高效的替代路径，只需 600 万条序列（我们使用的预训练序列的 3%）。此外，仅训练新权重可确保在不使用检索时评估时原始模型的性能完全保持。RETRO‑fitting 模型迅速超过基线模型的性能，甚至接近从头训练的 RETRO 模型的性能。实验的超参数见 §C.3。

4.3. 问答

我们在 Natural Questions (Kwiatkowski 等，2019) 数据集上微调检索模型，以证明我们的检索路径可用于注入任意数据源的信息。我们使用 Izacard 与 Grave (2021) 提供的版本⁴，该版本已用 DPR (Karpukhin 等，2020) 检索的段落进行增强。我们在 7.5B 预训练 RETRO 模型上使用前 20 条检索段落，对所有权重进行 25,000 步微调。我们将数据格式化为 “question: {question} "answer: {answer}” 并左填充数据，使得 “answer:” 与第一个 64 个 token 的第一个块结束位置对齐，从而与第一个检索块对齐。模型可以通过序列中的前置 token 以及前 20 条 DPR Wikipedia 段落及其标题（通过分块交叉注意力机制）访问问题。

answer: {answer}” 并左填充数据，使得 “answer:” 与第一个 64 个 token 的第一个块结束位置对齐，从而与第一个检索块对齐。模型可以通过序列中的前置 token 以及前 20 条 DPR Wikipedia 段落及其标题（通过分块交叉注意力机制）访问问题。

⁴https://github.com/facebookresearch/FiD

Improving 通过从数万亿个令牌检索来构建语言模型

图 5 | 对基准 transformer 进行 RETRO 拟合。任何 transformer 都可以通过随机初始化并仅训练分块的交叉注意力和检索编码器权重，微调成检索增强型 transformer。以这种方式微调能够快速恢复并超越非检索模型的性能，并且几乎能达到从零开始训练检索模型的相同性能（在每个图的右侧箭头所示）。我们发现，仅用预训练期间看到的令牌数量的 3% 来训练我们的模型即可得到良好的 RETRO 拟合性能。

精确匹配得分如表 5 所示，完整的微调细节见 §C.4。我们的方法与之前的 REALM、RAG 和 DPR 等方法具有竞争力，但逊色于更近期的 FID。与本研究相反，我们发现将邻居数量增加到 20 以上并不会提升 RETRO 在此任务上的性能。我们推测 T5 的编码器-解码器结构——FID 的基础模型——以及 T5 的预训练目标导致模型更依赖编码器输出而非 RETRO，这在 QA 场景中尤为重要。若要与 T5 微调模型竞争，未来工作应考虑在生成令牌时进一步迫使 RETRO 更依赖检索编码器输出。

4.4. 将检索性能与数据集泄露关联

我们在图 6 中报告了如 §2.6 所述的 C4、Curation Corpus 和 Wikitext103 的过滤评估损失。

在 C4 和 Wikitext103 上，由于训练集中存在泄漏，斜率对两者均为负

基线模型和 RETRO 模型。RETRO 模型比基线模型更强烈地利用泄漏

模型，正如更负的斜率所示。原因是其明确能够复制粘贴现有

训练块以预测泄漏评估块（请参见此模型行为的定性示例

表 5 | 问答结果。 在 Natural Questions 上的精确匹配准确率。

Model	Test Accuracy
REALM (Guu et al., 2020)	40.4
DPR (Karpukhin et al., 2020)	41.5
RAG (Lewis et al., 2020)	44.5
EMDR² (Sachan et al., 2021)	52.5
FID (Izacard and Grave, 2021)	51.4
FID + Distill. (Izacard et al., 2020)	54.7
Baseline 7B (closed book)	30.4
RETRO 7.5B (DPR retrieval)	45.5

通过从数万亿个令牌检索来改进语言模型

图 6 | 性能与最长公共检索子串。 评估损失随允许的最长公共子串长度（评估数据块与最近邻块之间）而变化。当考虑与训练数据块重叠不超过 8 个连续词元的块时，检索仍然有帮助。

在 Wikitext103 文章的表 19 中)。在 Curation Corpus 上，检索提供了一个常数偏移，这在设计上是预期的，因为 Curation Corpus 与训练数据集之间没有泄漏。

另一方面，RETRO 在所有泄漏水平上都优于基线模型，直至 \alpha = 12.5\%。在此水平下，损失是在与训练数据集中最近匹配块共享不到 8 个连续词元的块上计算的——这是一种合理的重叠水平，我们认为不存在局部泄漏。因此，检索既能提升与训练集块在语法上相似的块的预测，又能提升与所有训练块在语法上不同的块的预测。这表明 RETRO 具有基于模型参数和检索数据库的非平凡泛化能力。类似的结果在 Pile 数据集上也出现（见图 12，§F.3）。

4.5. 使用 RETRO 进行采样

我们展示了使用7.5B RETRO模型在表6、表7和附录E中获得的样本示例。对于每个块（第一个块为提示），我们将采样块 C_u 与检索到的相邻块 RET(C_u) 并排比较。为了给出局部重叠的指示，我们根据检索到的块 RET(C_{u-1}) 中最长公共前缀（LCP）的长度，对块 C_u 中的每个采样标记进行着色。同样，我们根据采样块中的 LCP 对检索到的块进行着色。在表6中的样本中，对于我们选择的提示，我们观察到检索块影响样本，因为采样标记与相邻标记之间存在重叠。总体而言，检索降低了幻觉（与Shuster等人 (2021) 的发现相一致），并使模型更具知识性，与禁用检索时生成的样本相比。在表7中的样本中，模型识别出提示是《哈姆雷特》第一幕的开始，并利用检索数据继续，仅出现少量错误。我们在附录E中提供了更多示例，包括来自评估集的示例，以及用于给表格着色的详细过程。

5. 结论

我们提出了检索增强变换器（RETRO），这是一种在从拥有万亿级标记的数据库检索的同时对任意文本序列进行建模的方法——将可供模型使用的数据规模扩大一个数量级，相比于通常在训练期间消耗的数据量。 RETRO models

通过检索万亿级标记来提升语言模型

收益对于具有至少7B参数的模型不会减少，并且对应于非检索

在某些数据集上具有10倍参数的模型。在Wikitext103和Pile上，RETRO表现更优-

优于此前在大规模数据集上训练的模型。我们也展示了RETRO在

对检索密集型下游任务（如问答）具有竞争力。

RETRO模型具有灵活性，可在评估时无需检索即可使用，并且仍能达到与基线模型相当的性能。相反，基线模型可以迅速微调为RETRO模型，以获得几乎与从零开始训练相同的性能。仔细分析表明，RETRO所获得的收益中只有少量是由于测试集泄漏。一般而言，我们警告大型语言数据集中可能存在此类泄漏，并建议进一步研究以更好地理解测试集泄漏在大型语言模型性能中的作用。

总体而言，我们的工作在前所未有的规模上证明了半参数方法可以

提供一种与纯参数扩展正交、更高效的方法，我们寻求构建更

强大的语言模型。

致谢

我们想感谢 Nikolai Grigorev、Marc'aurelio Ranzato、Cyprien de Masson d'Autume、Po-Sen Huang、Johannes Welbl、Lisa Anne Hendricks、Ethan Perez、Jeff Stanway、Eric Noland、Gregory Wayne、John Jumper、Julian Schrittwieser、Lorrayne Bennett、Devang Agrawal、Dani Yogatama、Susannah Young、Nando de Freitas、Demis Hassabis 和 Koray Kavukcuoglu 的帮助、建议和评审。此外，我们还想感谢 Zonglin Li、David Simcha 和 ScaNN 开发者的帮助。

通过检索数万亿个标记来改进语言模型

表 6 | 示例 - 海狸是有趣的动物。 RETRO[OFF] 样本迅速偏离其他动物，而 RETRO[ON] 样本由于邻居条件，倾向于保持关注海狸主题。

Prompt and sample of RETRO[OFF]	Prompt and sample of RETRO[ON] colored by LCP with RET(C_u-1) LCP = 0, 1, 2, 3,4,≥ 5	[N¹_u, F¹_u] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4,≥ 5	[N²_u, F²_u] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4,≥ 5
Beavers are interesting animals that live near rivers. They build	Beavers are interesting animals that live near rivers. They build	Beavers build their lodges in ponds they have created in wooded areas. Like many things in nature, there is a connection between creatures in the wild. Beaver ponds cause trees to drown, but the dead trees attract the great blue heron, which often return year after year. Over time, a beaver pond can attract more than 50 nests in a colony, called a rookery. An example of this can be found in the large pond off Bradford Road at Carter Fields near the Boxford line. Chris Leahy, an expert with the Massachusetts Audubon Society who wrote	naw them into smaller sections and drag them into the water. Engineers of the Pond Beavers are interesting animals because they change the habitat in which they live. Beavers do this by blocking up streams to create ponds. Then they build their homes, called lodges, in these ponds. Beavers' bodies make them well-suited for underwater building. Special muscles close off their noses, ears, and throats to keep the water out. Beavers' broad tails act like rudders for steering. Their two very large, orange front teeth are used to gnaw down trees. They begin building their dam
dams to create ponds. Frogs are amphibians, so they can live in both land and water. They have great camouflage to hide from predators. The Golden Retriever, or Golden, is a very big dog. They are very strong, and have a thick coat to help them live in	their houses called beaver dams in the riverbeds. They also live on land. Beavers use their strong teeth and strong jaws to cut down trees and branches to build their homes. They also use their teeth and jaws to chew up food. Beavers use their big, flat tails to swim. They use	then they mean that you are very busy. Beavers swim easily in streams, picking up rocks and sticks to build their dams. They gnaw at trees with their big front teeth to cut them down. Then they use parts of the trees to build their houses. Beavers are the clever builders. They know exactly what they need to build their beaver dams. They use mud from the stream to make their dams stay together. They use their tails to pat down the mud. Beavers put a snug room at the top of their dams for their babies. They store their food underwater. Beavers eat the bark from the	ar-like tail, and two protruding teeth that are strong enough to gnaw down trees. The beaver uses trees, branches, and mud to build dams across rivers and streams. These dams create a deep pool of water in which the beaver builds its home. A beaver home is called a lodge. A baby beaver or “kit” remains in the family lodge until the age of two. Beaver fur, known as pelt, was once highly popular as a trim for hats and coats. How might the popularity of beaver fur contribute to the colonization of New York? www.Ww
cold areas. A kangaroo is an animal that is very large. It can jump very high. They can run very fast. A meerkat is a small animal that lives in Africa. They live in colonies of up to a hundred of them. They can climb trees very easily.	their sharp teeth to gnaw at rocks and cut trees. They use their flat tails to steer when they are swimming. In the winter, beavers have to build a house called a beaver lodge. They build a house that is very strong. The walls are made of twigs. The roof is made of	then they mean that you are very busy. Beavers swim easily in streams, picking up rocks and sticks to build their dams. They gnaw at trees with their big front teeth to cut them down. Then they use parts of the trees to build their houses. Beavers are the clever builders. They know exactly what they need to build their beaver dams. They use mud from the stream to make their dams stay together. They use their tails to pat down the mud. Beavers put a snug room at the top of their dams for their babies. They store their food underwater. Beavers eat the bark from the	w at trees with their big front teeth to cut them down. Then they use parts of the trees to build their houses. Beavers are clever builders. They know exactly what they need to build their beaver dams. They use mud from the stream to make their dams stay together. They use their tails to pat down the mud. Beavers put a snug room at the top of their dams for their babies. They store their food underwater. Beavers eat the bark from the trees that they cut down! What is the main idea of the first paragraph? 2. What is the main idea of the second paragraph?
A mouse is a small mammal that lives on land. It is a very good climber and it can run very fast. Penguins are birds that live on Antarctica. They have a thick coat to keep them warm. Rabbits are small animals that live in the ground. They	of branches and other tree parts. They also use their strong jaws to cut trees. They bring them to their home. They also use their sharp teeth to chew up the tree parts. They use their flat tails to swim to the top of their house. Then they use their teeth and jaws to chew up the tree

通过检索数万亿个标记来改进语言模型

表 7 | 示例 - 哈姆雷特，第一幕，第一场。RETRO[Off] 样本具有正确语法但为幻觉，并以一个角色的重复结束（FRANCISCO Approach me not）。RETRO[On] 样本是原文的正确续写，并能抵抗我们提示与检索数据之间的格式差异。

Prompt and sample of RETRO [Off]	Prompt and sample of RETRO [On]	[N¹_u, F¹_u] colored by LCP with C_u+1	[N²_u, F²_u] colored by LCP with C_u+1
	LCP = 0, 1, 2, 3, 4, ≥ 5	LCP = 0, 1, 2, 3, 4, ≥ 5	LCP = 0, 1, 2, 3, 4, ≥ 5
ACT I SCENE I. Elsinore. A pla tform before the castle. FRANC ISCO at his post. Enter to him BERNARDO BERNARDO Who's there ? FRANCISCO Nay, answer me: st and, and unfold yourself. BERN ARDO	ACT I SCENE I. Elsinore. A pla tform before the castle. FRANC ISCO at his post. Enter to him BERNARDO BERNARDO Who's there ? FRANCISCO Nay, answer me: st and, and unfold yourself. BERN ARDO	ACT I SCENE I. Elsinore. A pla tform before the castle. (FRANC ISCO at his post. Enter to him BERNARDO) BERNARDO Who's the re? FRANCISCO Nay, answer me: stand, and unfold yourself. BE RNARDO Long live the king! FRA NCISCO Bernardo? BERNARDO He. FRANCISCO You come most careful ly upon your hour. BERNARDO 'Tis now struck twelve; get the e to bed, Francisco. FRANCISCO For this relief much thanks: 'tis bitter cold, And I am sic k at heart. BERNARDO Have you	><TEXT>ACT I SCENE I. Elsin ore. A platform before the cas tle. FRANCISCO at his post. E nter to him BERNARDO BERNARDO Who's there? FRANCISCO Nay, an swer me: stand, and unfold you rself. BERNARDO Long live the king! FRANCISCO Bernardo? BERN ARDO He. FRANCISCO You come mo st carefully upon your hour. B ERNARDO 'Tis now struck twelve ; get thee to bed, Francisco. FRANCISCO For this relief much thanks: 'tis bitter cold, And I am sick at heart.
Who calls ? FRANCISCO I am th e lord here; I, Francisco, tha t am sick of grief. [ Aside. B ERNARDO The king ! FRANCISCO I am sick of that also. BERNARD O My lord ? FRANCISCO Do not a pproach me. BERNARDO	Long live the king! FRANCISCO Bernardo? BERNARDO He. FRANCI SCO You come most carefully up on your hour. BERNARDO 'Tis no w struck twelve; get thee to b ed, Francisco. FRANCISCO For t his relief much thanks: 'tis b itter cold, And I am sick at h eart. B	Long live the king! FRANCISCO Bernardo? BERNARDO He. FRANCI SCO You come most carefully up on your hour. BERNARDO 'Tis no w struck twelve; get thee to b ed, Francisco. FRANCISCO For t his relief much thanks: 'tis b itter cold, And I am sick at h eart.	live the king! FRANCISCO Bern ardo? BERNARDO He. FRANCISCO Y ou come most carefully upon yo ur hour. BERNARDO 'Tis now str uck twelve: get thee to bed, F rancisco. FRANCISCO For this t elief much thanks: 'tis bitter cold, And I am sick at heart. BERNARDO Have you had quiet g uard? FRANCISCO Not a mouse st irring. BERNARDO Well, good ni ght. If you do meet Horatio and Marcellus, The rivals2 of my watch, bid them make haste. FRANCISCO I think I hear them. Stand, ho! Who's th ere? (Enter HORATIO and MARCEL LUS) HORATIO Friends to this g round. MARCELLUS And liegemen to the Dane. FRANCISCO Give yo u good night. MARCELLUS O, far ewell, honest soldier: Who hat h relieved you? FRANCISCO Bern ardo has my place. Give you go od night. (Exit)
Francisco, I would speak with you. FRANCISCO Approach me not , but speak. BERNARDO Your han d, your voice FRANCISCO I will not hear thee speak. BERNARDO Francisco, your hand, I entre at thee. FRANCISCO Approach me not. BERNARDO Francisco FRANC	ERNARDO Have you had quiet gua rd? FRANCISCO Not a mouse stir ring. BERNARDO Well, good nigh t. If you do meet Horatio and Marcellus, The rivals of my wa tch, bid them make haste. FRAN CISCO I think I hear them. Sta nd, ho! who is there? Enter	had quiet guard? FRANCISCO No t a mouse stirring. BERNARDO W ell, good night. If you do mee t Horatio and Marcellus, The r ivals of my watch, bid them ma ke haste. FRANCISCO I think I hear them. Stand, ho! Who's th ere? (Enter HORATIO and MARCEL LUS) HORATIO Friends to this g round. MARCELLUS And liegemen to the Dane. FRANCISCO Give yo u good night. MARCELLUS O, far ewell, honest soldier: Who hat h relieved you? FRANCISCO Bern ardo has my place. Give you go od night. (Exit)	ARDO Have you had quiet guard? FRANCISCO Not a mouse stirrin g. BERNARDO Well, good night. If you do meet Horatio and Marc ellus, The rivals2 of my watch , bid them make haste. FRANCIS CO I think I hear them.— Stand , ho! who is there? ENTER HORA TIO AND MARCELLUS, HORATIO Fri ends to this ground. MARCELLUS And liegemen to the Dane, 3 FR ANCISCO Give you good night. M ARCELLUS O, farewell, honest s oldier: Who hath relieved you? FRANCISCO Bernardo hath my pl ace. Give you good night
ISCO Approach me not. BERNARDO I have a letter FRANCISCO App roach me not. BERNARDO For the king. FRANCISCO Approach me n ot. BERNARDO There's no treaso n in't. FRANCISCO Approach me not. BERNARDO I will	HORATIO and MARCELLUS HORATIO Friends to this ground. MARCE LLUS And liegemen to the Dane. FRANCISCO Give you good night . MARCELLUS O, farewell, hones t soldier: Who hath relieved y ou? FRANCISCO Bernardo hath my place. Give you good night.

通过检索数万亿个标记来改进语言模型

参考文献

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. 深度学习与差分隐私. In ACM SIGSAC 计算机与通信安全会议, 2016.

S. Ahn, H. Choi, T. Pärnamaa, and Y. Bengio. 一种神经知识语言模型. arXiv 预印本 arXiv:1608.00318, 2016.

A. Baevski and M. Auli. 神经语言建模的自适应输入表示. In 国际学习表征会议, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ.

Y. Belinkov, S. Gehrmann, and E. Pavlick. 解释性与神经自然语言处理中的分析. In 第58届计算语言学协会年会教程摘要论文集, pages 1–5, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-tutorials.1. URL https://aclanthology.org/2020.acl-tutorials.1.

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. 随机鹦鹉的危险：语言模型可以过大吗？ In ACM 公平、责任与透明度会议, 2021.

D. M. Blei, A. Y. Ng, and M. I. Jordan. 潜在狄利克雷分配. 机器学习研究期刊, 3(Jan):993–1022, 2003. URL https://jmlr.csail.mit.edu/papers/v3/blei03a.html.

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. V. der Plas, S. Wanderman-Milne, and Q. Zhang. JAX: Python+NumPy 程序的可组合变换, 2018. URL http://github.com/google/jax.

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. 大型语言模型在机器翻译中的应用. In 自然语言处理与计算语言学习经验方法联合会议, pages 858–867, 2007.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. 语言模型是少样本学习者。 In Advances in Neural Information Processing Systems, 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. 从大型语言模型中提取训练数据。 Preprint, 2021.

C. Consonni, D. Laniado, and A. Montresor. Wikilinkgraphs: 关于维基百科链接网络的完整、纵向和多语言数据集。 In AAAI International Conference on Web and Social Media, volume 13, 2019.

策划。策划语料库基础，2020。

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Annual Meeting of the Association for Computational Linguistics, July 2019. URL https://aclanthology.org/P19-1285.

Improving 通过检索数万亿个令牌的语言模型

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In 北美计算语言学协会北美分会会议, 2019年6月. URL https://aclanthology.org/N19-1423.

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.

S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In 自然语言处理经验方法会议, 2020年11月. URL https://aclanthology.org/2020.findings-emnlp.301.

E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous cache. In 国际学习表示会议, 2017. URL https://openreview.net/forum?id=B184E5qee.

A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

J. Gu, Y. Wang, K. Cho, and V. O. Li. 基于搜索引擎的神经机器翻译. In AAAI Conference on Artificial Intelligence, 2018.

R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar. 加速大规模推理采用各向异性向量量化. In International Conference on Machine Learning, 2020. URL https://arxiv.org/abs/1908.10396.

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang. 检索增强语言模型预训练. In International Conference on Machine Learning, 2020.

H. Hashemi, H. Zamani, and W. B. Croft. 指导型变换器：利用多种外部来源进行会话检索中的表示学习. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1131–1140, 2020.

T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. Haiku：为 JAX 的十四行诗，2020. URL http://github.com/deepmind/dm-haiku.

G. Izacard and E. Grave. 利用生成模型进行段落检索以实现开放域问答. In Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2021. URL https://aclanthology.org/2021.eacl-main.74.

G. Izacard, F. Petroni, L. Hosseini, N. De Cao, S. Riedel, and E. Grave. 一种内存高效的开放域问答基线. arXiv preprint arXiv:2012.15156, 2020.

S. Jain and B. C. Wallace. 注意力并非解释. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1357. URL https://aclanthology.org/N19-1357.

E. S. Jo and T. Gebru. 来自档案的经验：在机器学习中收集社会文化数据的策略. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306–316, 2020.

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. 探究语言建模的极限. arXiv preprint arXiv:1602.02410, 2016.

通过检索数万亿个令牌提升语言模型.

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. 神经语言模型的规模定律. CoRR, 2020. URL https://arxiv.org/abs/2001.08361.

V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. 针对开放领域问答的稠密段落检索. In Conference on Empirical Methods in Natural Language Processing, Nov. 2020. URL https://aclanthology.org/2020.emnlp-main.550.

U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis. 通过记忆实现泛化：最近邻语言模型. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH.

M. Komeili, K. Shuster, and J. Weston. 互联网增强式对话生成. arXiv preprint arXiv:2107.07566, 2021.

T. Kudo and J. Richardson. Sentencepiece: 一种简单且语言无关的子词分词器和还原器，用于神经文本处理. arXiv preprint arXiv:1808.06226, 2018.

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural Questions: 问答研究的基准. Transactions of the Association of Computational Linguistics, 7:452–466, Mar. 2019. URL https://aclanthology.org/Q19-1026.

A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d’Autume, S. Ruder, D. Yogatama, K. Cao, T. Kociský, S. Young, and P. Blunsom. 静态语言建模的陷阱. CoRR, 2021. URL https://arxiv.org/abs/2102.01951.

K. Lee, M.-W. Chang, and K. Toutanova. 弱监督开放域问答的潜在检索. In Annual Meeting of the Association for Computational Linguistic, June 2019. URL http://arxiv.org/abs/1906.00300.

K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. 去重训练数据可提升语言模型. arXiv preprint arXiv:2107.06499, 2021.

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. 面向知识密集型 NLP 任务的检索增强生成. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf.

P. Lewis, P. Stenetorp, and S. Riedel. 开放领域问答数据集中问答测试-训练重叠. In Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2021. URL https://aclanthology.org/2021.eacl-main.86.

O. Lieber, O. Sharir, B. Lenz, and Y. Shoham. Jurassic-1：技术细节与评估. White Paper. AI21 Labs, 2021.

I. Loshchilov and F. Hutter. 分离权重衰减正则化. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.

S. Merity, C. Xiong, J. Bradbury, and R. Socher. 指针哨兵混合模型. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe.

Improving 通过检索数万亿标记的语言模型

T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur. 基于循环神经网络的语言模型. Interspeech, 2(3):1045–1048, 2010.

D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. LAMBADA 数据集：需要广泛语篇上下文的词预测. In Annual Meeting of the Association for Computational Linguistics, Aug. 2016. URL https://aclanthology.org/P16-1144.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. 语言模型是无监督多任务学习者. Preprint, 2019.

J. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. Jayakumar, E. Buchatskaya, D. Budden, E. Sutherland, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J.-B. Lespiau, M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, J. Bradbury, M. Johnson, B. Hechtman, L. Weidinger, I. Gabriel, W. Isaac, E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, and G. Irving. 扩展语言模型：从训练 Gopher 中的方法、分析与见解。 arXiv submission, 2021.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 探索统一文本转文本变换器的迁移学习极限。 Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: 面向训练万亿参数模型的内存优化。 In IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.

S. Robertson and H. Zaragoza. 概率相关框架：BM25 与其延伸。 Foundations and Trends in Information Retrieval, 3:333–389, Jan 2009.

D. S. Sachan, S. Reddy, W. Hamilton, C. Dyer, and D. Yogatama. 多文档阅读器与检索器的端到端训练，用于开放领域问答。 arXiv preprint arXiv:2106.05346, 2021.

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. 绿色人工智能。 Communications of the Association for Computing Machinery, 63(12):54–63, Nov. 2020.

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, 和 B. Catanzaro. Megatron-LM: 使用模型并行训练多十亿参数语言模型. CoRR, 2019. URL http://arxiv.org/abs/1909.08053.

K. Shuster, S. Poff, M. Chen, D. Kiela, 和 J. Weston. 检索增强减少对话中的幻觉. arXiv:2104.07567 [cs], 2021年4月. URL http://arxiv.org/abs/2104.07567.

E. Strubell, A. Ganesh, 和 A. McCallum. 深度学习在NLP中的能源与政策考量. 见 Association for Computational Linguistics, 2019年7月. URL https://aclanthology.org/P19-1355.

A. Vaswani, N., Shazeer, N., Parmar, J., Uszkoreit, L., Jones, A., Gomez, A., Kaiser, L., 和 Polosukhin, I. Attention is all you need. 见 Advances in Neural Information Processing Systems, 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Improving 通过检索数万亿个令牌构建语言模型

X. Wei 和 W. B. Croft. 基于LDA的文档模型用于突发检索. 见 ACM SIGIR International Conference on Research and Development in Information Retrieval, 2006. URL http://portal.acm.org/citation.cfm?doid=1148170.1148204.

L. Weidinger, I. Gabriel, C. Griffin, M. Rauh, J. Uesato, J. Mellor, W. Isaac, P.-S. Huang, L. A. Hendricks, M. Cheng, B. Balle, J. Haas, C. Biles, L. Rimell, W. Hawkins, M. Glaese, A. Kasirzadeh, Z. Kenton, S. Brown, A. Birhane, T. Stepleton, G. Irving, 和 S. Legassick. 来自语言模型的伤害的伦理与社会风险. arXiv submission, 2021.

D. Yogatama, C. de Masson d'Autume, 和 L. Kong. 自适应半参数语言模型. Transactions of the Association for Computational Linguistics, 9:362–373, 2021.

B. Zhang 和 R. Sennrich. 均方根层归一化. 见 Advances in Neural Information Processing Systems, 2019. URL https://proceedings.neurips.cc/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf.

J. Zhang, M., Utiyama, E., Sumita, G., Neubig, G., 和 Nakamura, S. 通过检索翻译片段指导神经机器翻译. 见 Conference of the North American Chapter of the Association for Computational Linguistics, 2018.

通过检索数万亿个令牌来改进语言模型

A. 数据集

我们提供了MassiveText的完整描述以及我们从最近的维基百科文章中提取的内容。

A.1. MassiveText的完整描述

MassiveText按来源和语言的完整细分见表 8。关于MassiveText的完整描述与分析，请参阅Rae等人（2021）。

Source	Language	Token count (M)	Documents	Sampling weight
Web	En	483,002	604,938,816	0.314
	Ru	103,954	93,004,882	0.033
	Es	95,762	126,893,286	0.033
	Zh	95,152	121,813,451	0.033
	Fr	59,450	76,612,205	0.033
	De	57,546	77,242,640	0.033
	Pt	44,561	62,524,362	0.033
	It	35,255	42,565,093	0.033
	Sw	2,246	1,971,234	0.0044
	Ur	631	455,429	0.0011
Books	En	3,423,740	20,472,632	0.25
News	En	236,918	397,852,713	0.1
Wikipedia	En	3,977	6,267,214	0.0285
	De	2,155	3,307,818	0.003
	Fr	1,783	2,310,040	0.003
	Ru	1,411	2,767,039	0.003
	Es	1,270	2,885,013	0.003
	It	1,071	2,014,291	0.003
	Zh	927	1,654,772	0.003
	Pt	614	1,423,335	0.003
	Ur	61	344,811	0.0001
	Sw	15	58,090	0.0004
	Github	-	374,952	142,881,832	0.05
Total	-	5,026,463	1,792,260,998	1

表 8 | MassiveText 数据集。 最后一列表示训练期间每个数据集的采样权重。对于检索数据库，使用整个数据集，除非书籍，我们使用4%的子样本。

A.2. Wikipedia 2021年9月

我们创建了一个评估数据集，包含23篇在我们收集训练数据集之后于2021年9月添加或大量编辑的维基百科文章。除此之外，我们过滤掉过度依赖模板内容的文章，使用§2.6中详细说明的方法来识别与相邻文章高度重叠的章节。图10显示我们的测试数据集与从训练数据集中检索到的邻居之间几乎没有重叠。完整的包含文章列表见表 9。

通过从数万亿标记检索改进语言模型

表 9 | 我们的 Wikipedia 2021年9月 评估数据集中包含的完整文章集。

Megan Rohrer	Aakashavaani
Emma Raducanu	Junior Eurovision Song Contest 2021
Ambra Sabatini	Pavilion Bukit Jalil
WhyDonate	Blake Desjarlais
The Juggernaut (company)	2021 All-Ireland Senior Football Championship Final
Angela Diaz	Drift-barrier hypothesis
2020 Summer Paralympics	Venomics
2021 Afghan protests	Great Circle (novel)
Rexh Xhakli	Hurricane Ida
Julia Laskin	2021 Montenegrin episcopal enthronement protests
Cuijk	At War With the Silverfish
Ghoubet Wind Power Station

我们首先使用 mwparserfromhell⁵ 解析文章。随后删除具有以下标题的章节： “references”, “external links”, “sources”, “further reading”, “see also”, “citations”, 和 “note”。在剩余章节中，我们删除 Wikilinks 并移除以下模板： “reflist”, “notelist”, “notelist-ua”, “notelist-lr”, “notelist-ur”, 和 “notelist-lg”。我们还排除带有 “ref” 或 “table” 标签的对象，并使用 strip_code 函数清理剩余文本。最后，我们将标题与所有章节拼接，并使用 \n\n 进行分隔。

B. 检索架构的细节

我们介绍了 RETRO 架构以及用于 RETROfitting 现有语言模型的微调过程。

B.1. RETRO 架构与实现

B.1.1. 前馈架构

正如正文中所述，整体的编码-解码架构完全是前馈式的。我们以序列 X \in \mathcal{V}^n = (C_u)_{1 \le u \le l} 开始，并使用其预计算的邻居 (\text{RET}(C_u))_{1 \le u \le l}，并在 \mathbb{R}^{n \times |\mathcal{V}|} 返回 logits。与正文中引入的 \text{ATTN}、\text{FFW}、\text{CCA} 和 \text{CA} 操作符一起，我们定义了解码嵌入层 \text{EMB}: \mathcal{V}^n \to \mathbb{R}^{n \times d}，提取分块中间嵌入 \text{SPLIT}(H) \triangleq (H_u)_{1 \le u \le l} \in \mathbb{R}^{l \times m \times d} 的 \text{SPLIT} 操作符，以及读出层 \text{READ}: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times |\mathcal{V}|}。随后我们在 Algorithm 1 中描述前向传递。除了常规 Transformer 的超参数外，RETRO architecture hyperparameters 涉及层索引 P_{\text{enc}} 和 P，在这些层上编码器和解码器执行跨注意力。

B.1.2. 分块跨注意力层中的相对位置编码

\text{CA} 操作符使用相对位置 logits，计算方式是基于将数据 token 与检索 token 分隔的特定相对距离。事实上，我们期望任何检索邻居 \text{RET}(C_u)^j 与块 C_u 相对对齐，并假设它们从相同位置开始。因此，在计算 \text{CA}(H_u^+, E_u) 时，我们将数据 token i \in [1, l] 属于块 C_u^+ 与

⁵https://github.com/earwig/mwparserfromhell

Improving 通过检索数万亿个令牌的语言模型

检索 token i' \in [1, 2l] 属于 RET(C_u)^j 为

d(i, i') \triangleq i - i' + l - 1. \qquad (6)

在计算 encoder cross-attentions C_A(RET(C_u)^j, H_u) 时，我们将检索 token i' \in [1, 2l] 与数据 token i \in [1, l] 之间的距离设定为

d_{\text{enc}}(i', i) \triangleq i' - i. \qquad (7)

位置 logits 作为对 (d(i, i'))_{i,i'} 计算得到的余弦向量的线性变换获得，并像在常规自注意力块中一样添加到 content logits。

B.1.3. 分块跨注意力实现

我们的 CCA 操作符实现（见列表 1）基于跨注意力层的向量化应用。为简化起见，我们省略了多头注意力逻辑，并使用最简单的 Q,K,V 注意力。我们省略了前面描述的相对位置 logits 的计算。

B.1.4. 可选的嵌入矩阵共享

我们默认为编码器和解码器使用分离的嵌入，这使我们能够为编码器使用不同的维度（通常保持为 d_{\text{ENC}} = 896），而为解码器使用 d = 8192（我们将其缩放到此维度）。我们可以共享嵌入，训练时几乎没有差异，正如我们在消融实验中所示。

B.2. 基准至 RETRO 模型微调

如图 5 所示，我们发现可以在预训练的基线 Transformer 上通过微调添加 RETRO。在所有情况下，我们冻结了预训练阶段的所有权重，并重新初始化检索编码器和 cross‑attention 权重。在所有情况下，cross‑attention 从第六层开始，每隔三层添加一次。三个较小模型的学习率设为 2 \times 10^{-4}，较大模型为其一半。我们尝试在微调期间允许整个模型继续训练，但始终发现最好的做法是冻结预训练模型。这样保持了 retrieval‑off performance 冻结，而当所有权重都被调优时，retrieval off performance 会下降。

C. 训练细节与超参数

我们提供了 §4 中各种实验使用的超参数。

C.1. 语言模型预训练

在 Table 10 中，我们展示了我们训练的不同模型的超参数。在所有情况下，我们训练了 419,430,400,000 个训练标记。三个较小的模型以 256 的批大小训练，最大的模型以 1024 的批大小训练。最小学习率设为最大学习率的 0.1 倍，如 Table 10 所示。学习率使用与总训练标记数相匹配的余弦周期长度进行衰减。所有模型使用 AdamW (Loshchilov and Hutter, 2019) 训练，权重衰减参数为 0.1。学习率在训练的前 750 步内线性增加，从 10^{-7} 到最大学习率。所有模型使用 ZeRO 分片优化器状态 (Rajbhandari et al., 2020)。附加基础设施细节可在 Rae et al. (2021) 中找到。

通过检索数万亿个令牌来提升语言模型

列表1 | Jax实现的简化版分块交叉注意力

n = 128  # Sequence length
m = 16   # Chunk length
r = 32   # Retrieval length
k = 4    # Number of neighbours
d = 16   # Embedding size
l = n // m  # Number of chunks

# Parameters
Q = jnp(zeros((d, d)))
K = jnp(zeros((d, d)))
V = jnp(zeros((d, d)))

def relative_positional_encodings(attending_length, attended_length):
    # Classical relative positional encodings
    ...

def crossAttention(chunk, neighbour):
    m, d = chunk.shape
    r, d = neighbour.shape
    queries = chunk @ Q
    keys = neighbour @ K
    logs = queries @ keys.T
    values = neighbour @ V
    return logs, values

def multi_neighbour crossAttention(chunk, neighbours):
    m, d = chunk.shape
    k, r, d = neighbours.shape

    logs, values = jnpvectorize(crossAttention,
                                signature='(m,d),(r,d)->(m,r),(r,d)')(
                                chunk, neighbours)
    assert logs.shape == (k, m, r)
    assert values.shape == (k, r, d)
    logs += relative_positional_encodings(m, r) [None, : , : ]
    logs = jnp.moveaxis(logits, 0, -1).reshape((m, r * k))
    values = jnp.moveaxis(values, 0, 1).reshape((r * k, d))
    return jaxnn softmax(logits) @ values

def multi_chunk crossAttention(observation, neighbours):
    attendingChunks = jnp.pad(observation[m-1:],
                               ((0, m-1), (0, 0)),
                               mode='constant').reshape(1, m, d)
    chunked_output = jnpvectorize(multi_neighbour crossAttention,
                                  signature='(m,d),(k,r,d)->(m,d)')(
                                  attendingChunks, neighbours)
    assert chunked_output.shape == (1, m, d)
    output = jnp.pad(chunked_output. reshape(n, d),
                       ((m-1, 0), (0, 0)),
                       mode='constant')[:n]
    return output

observation = jnp(zeros((n, d)))  # Input
neighbours = jnp(zeros((1, k, r, d)))

h = multi_chunk crossAttention(observation, neighbours)
assert h.shape == (n, d)  # Output

通过检索数万亿个令牌来提升语言模型

表10 | RETRO模型超参数以及解码器大小

Baseline	d_model	d_ffw	# heads	Head size	# layers	P	P_ENC	Max LR
247M	896	3584	16	64	12	[6, 9, 12]	[1]	2×10^-4
564M	1536	6144	12	128	12	[6, 9, 12]	[1]	2×10^-4
1,574M	2048	8192	16	128	24	[9, 12, ..., 24]	[1]	2×10^-4
7,505M	4096	16384	32	128	32	[9, 12, ..., 32]	[1]	1×10^-4

表11 | Wikitext103实验的超参数，见表4。我们为基线和RETRO拟合使用相同的学习率调度。对于RETRO拟合，我们重置调度，即调度从步骤0开始，而不是从步骤35,000开始。

Model	Number of layers	18
	d	1024
	d_FFW	4096
	Key size	64
	Value size	64
	Number of heads	16
Training data	Dataset	Wikitext103train
	Sequence length	3072
	Batch size	128
	Tokenizer vocabulary size	128,000
Optimisation	optimiser	Adam
	Adam's β₁	0.9
	Adam's β₂	0.95
	Adam's ε	1e-8
	Dropout rate	0.25
Schedule	Learning rate start	1e-7
	Learning rate max	2.5e-4
	Learning rate min	2e-5
	Warmup steps	4,000
	Cosine cycle steps	100,000
Evaluation	Overlapping proportion	87.5%

C.2. Wikitext103比较

我们提供了关于我们的 Wikitext103 结果的更多细节，这些结果已在 §4.1 和表 4 中呈现。我们在 Wikitext103 训练集上使用表 11 中列出的超参数训练了基线 Transformer。学习率在前 4,000 步中线性递增，从 1 \times 10^{-7} 增至 2.5 \times 10^{-4}，随后在 100,000 步时按余弦调度递减到 2 \times 10^{-5}。基线检查点在 35,000 步时在 Wikitext103 验证集上的困惑度最低，为 21.58，对重叠比例 75%（仅使用至少 75% 序列长度上下文的 token 的概率进行滑动窗口评估）而言。我们将此检查点用于表 4 中报告的所有基线和 kNN‑LM 数值，除非表 4 报告的是 87.5% 的重叠比例，此时我们的基线困惑度略低，为 21.53。

我们还将 35,000 步的基线检查点作为 RETROfit 的初始化，后者除了使用相同的优化器和调度超参数外，只训练新的检索权重，具体如 §4.2 所述。我们的最佳 RETROfit 检查点在从 Wikipedia 检索时的 Wikitext103 验证困惑度为 18.46。我们在表 4 中使用此 RETRO 检查点来报告所有其他检索集的结果。我们的基线和 RETROfit 的评估曲线如图 7（左）所示。在这种特定情况下，

通过检索数万亿个 token 提升语言模型

由于 Wikitext103 规模相当小，直接从零开始训练 RETRO 模型导致的结果弱于基线，至少在从 Wikipedia 检索时如此，因为我们未能找到有效的办法来缓解因 RETRO 的额外权重而导致的过拟合增加。

我们也使用与基线和 RETROfitting 实验相同的分词器和数据集重新实现 kNN-LM。kNN-LM 的概率为 p_{kNN-LM} = \lambda p_{LM} + (1 - \lambda)p_{kNN}，具有 p_{kNN}(n_k) \propto \exp(-\alpha d_k)。为调整 \lambda 和 \alpha，我们从 \alpha = 0.0012 开始，它对应于我们用作 kNN-LM 键和查询的嵌入范数的标准差的倒数。我们找到最佳的 \lambda = 0.118。随后我们为该 \lambda 的值寻找最佳的 \alpha = 0.00785。图 7 中央和右侧分别显示 kNN-LM 的困惑度与 \lambda 和 \alpha 的函数关系。

图 7 | Wikitext103valid 困惑度。左侧：基线和 RETROfit（从基线的检查点 35,000 步初始化）的困惑度随训练步数变化。中间和右侧：kNN-LM 困惑度随 \lambda（对应 \alpha = 0.0012）和 \alpha（对应 \lambda = 0.12）的函数变化。

C.3. RETROfitting 基线模型实验

在表 12 中，我们给出了在 Massive Text 上对模型进行 RETROfitting 时使用的超参数。

表 12 | RETROfitting 实验的超参数

Model	Layers with RETRO-block (P)	Learning rate	Batch size
172M	Every 3^rd from 6	2 × 10^-4 → 2 × 10^-5	256
425M	Every 3^rd from 6	2 × 10^-4 → 2 × 10^-5	256
1.5B	Every 3^rd from 6	2 × 10^-4 → 2 × 10^-5	256
7.5B	Every 3^rd from 6	1 × 10^-4 → 1 × 10^-5	256

C.4. 问答实验

我们对 7.5B RETRO 模型进行 25,000 步微调，批量大小为 128，学习率余弦调度从 10^{-6} 变到 10^{-7}，并使用 750 步的线性上升。我们仅在解码器中使用 dropout，因为这比在编码器和解码器中同时使用 dropout 效果更好。每个邻居的格式为 title: {title}, source: {source}。我们在训练和评估时使用来自 DPR 的前 20 个邻居。

通过从数万亿个令牌中检索来改进语言模型

表 13 | RETRO 在不同变体中的性能。 在 C4 评估集上测量字节/比特的模型性能，针对使用 1570 亿令牌计划训练的 247M 参数模型。

Ablation group	Ablation	C4 eval bpb
Model	RETRO	0.822
	No query conditioning	0.829
	No CA positional encodings	0.826
	Shared embeddings	0.823
	6-layer encoder	0.821
Retrieval values	Neighbours N	0.950
	Continuations F	0.895
	No retrieval	0.987
Training neighbours	1 training neighbours	0.858
Training neighbours	4 training neighbours	0.847
Cross attention position	CA top layer (1/12)	0.827
	CA mid layer (6/12)	0.823
	CA top layer (12/12)	0.831
	CA all layers	0.860
	CA every 3 from 1	0.823

D. 模型消融

我们通过评估不包含这些设计选择时会发生什么来验证重要的设计选择。我们在所有实验中使用247M参数模型，并在所有消融实验中使用压缩的157亿token计划进行训练。我们将结果相对于在正文中呈现并在此回顾的默认设置进行描述。我们报告训练过程结束时的C4评估损失，并且还比较了评估损失相对于训练时间的下降情况，该下降是相对基线训练时间测量的。结果如图8和表13所示。

在交叉注意力中使用相对编码。 如 §B.1.2 所述，在交叉注意力中使用相对编码，既能在达到给定性能所需的步骤数上提供纯粹的提升，也能提升计算效率。

在编码器上对前一个块进行条件化。 如 §B.1.1 所述，将编码器对前一个块的中间嵌入进行条件化，既能在步骤数方面提供纯粹的提升，也能提升计算效率。

共享嵌入。 在编码器和解码器之间共享嵌入不会影响性能。这促使我们使用单独的嵌入，因为这允许在我们扩大解码器尺寸时，使编码器比解码器更窄。

关注邻居及其延续。 RETRO 模型通过关注给定块的前一个块的邻居及其时间延续来进行训练。我们测量仅在邻居上训练和评估 RETRO 模型以及仅在其延续上训练和评估 RETRO 模型对性能的影响。总体而言，仅关注邻居提供了 22% 的性能提升，这是 RETRO 中检索带来的提升，而关注邻居的未来则提供了 56% 的

通过从数万亿个 token 中检索来改进语言模型

图 8 | 不同变体的计算效率。 我们报告了绘制 C4 评估字节/比特与时间的训练曲线，相对于训练基线 RETRO 模型所需时间的比例。总体而言，我们的设计选择在计算效率方面是最优的。

性能。关注两个邻居及其后续是最有效的选择，无论在最终性能还是训练效率方面。

训练更深的编码器。 文中所有模型均使用相对较小的 RETRO 编码器。我们尝试了深度为原来的 3 倍的编码器。发现这导致损失仅略微下降 0.15%，但训练时间增加了 20%。总体而言，在训练效率方面，使用浅层编码器是最佳选择。

使用多个邻居进行训练。 我们测量了在单个检索到的邻居上训练以及在 4 个邻居上训练（RETRO 在训练时使用 2 个邻居）的效果。单个邻居训练导致性能大幅下降，而 4 个邻居训练在训练结束时并未带来显著的性能提升，却增加了大量计算开销。总体而言，我们发现使用 2 个邻居在训练效率方面是最佳选择。此外，评估时也可以使用额外的邻居。

交叉注意力的频率。 我们测量了解码器中交叉注意力频率对性能的影响。总体而言，仅在最顶层或最底层进行一次交叉注意力是糟糕的选择，而在中间层进行一次相对合理。我们决定每 3 层进行一次交叉注意力，因为这在性能与运行时间之间提供了良好的折衷。

通过检索数万亿 tokens 提升语言模型

E. 定性实验

我们通过查看评估样本的困惑度并自回归生成样本来展示 RETRO 模型的使用。

E.1. 检视评估数据中的邻居和困惑度

为了解决 RETRO 模型所利用信息的直观认识，我们建议更仔细地查看一些评估文档及其在表 16、17、18 与 19 中对应检索到的数据。在这些表格中，4 行对应文档的前 4 个块。左侧最左列显示正在评估的文档中的块 C_u，其中每个 token 根据负交叉熵损失差值 L_{RETRO[OFF]} - L_{RETRO} 进行着色，正值（用黄色着色）表示 RETRO 在获得邻居数据时表现更好。第二列也显示评估的块 C_u，但每个 token i 的着色取决于与前置邻居的最长公共前缀（LCP）的长度，即最大整数 j，使得前缀 (x_{i-j-1}, \dots, x_i) 同时出现在 RET(C_{u-1}) 中。相反，第三列和第四列分别显示前两个邻居及其后续部分 [N_u^1, F_u^1] 与 [N_u^2, F_u^2]，它们按与后续块 C_{u+1} 的 LCP 进行着色。LCP 着色有助于直观识别评估文档与检索数据的重叠部分。请注意，第二列中的第一个块 C_1 未着色，因为它没有前置邻居可用于计算 LCP。同样，我们不显示第四个块的邻居，因为它们不用于约束前四个块中的任何一个。

我们的定性分析展示了两种主要行为。

首先，我们观察到有时 C_u 中的特定事实可以从前置邻居 RET(C_{u-1}) 中提取，这可以导致 RETRO 模型对相应 token 的损失显著降低。此类行为的例子包括表 16 中的期刊名称 Publishers Weekly，表 17 中的足球队名称 Tyrone，或表 18 中的事件日期 25 August 到 6 September 2020。在这三个例子中，评估数据由 2021 年 9 月撰写的最新维基百科文章组成，在我们构建检索数据集之后（参见 §A.2 节）。然而，用于预测这条新数据的相关信息已存在于预先的检索数据中，RETRO 模型似乎能够正确利用这些信息。

另一方面，我们还观察到，尽管使用了去重，但仍有部分评估数据在我们的训练和检索数据中部分泄露。 RETRO 能够极大地利用这种泄漏。 Table 19 说明了这种行为，其中块 C_2 和 C_3 与 RET(C_1) 与 RET(C_2) 大部分重叠，尽管存在细微的格式差异，这导致所有相应的 token 的 RETRO 损失显著降低。 Fig. 6 显示，通过过滤与检索集合重叠的评估块，可以量化这两种行为分别对 RETRO 损失降低的贡献。

E.2. 检查样本

我们可以在使用 RETRO 模型生成的样本上执行与上述相同的程序，以更好地了解检索数据对采样的影响。我们在 Table 6、7、20 和 21 中展示了使用 7.5B RETRO 模型获取的样本示例。

E.3. 邻居量化

为了量化源文档与检索块之间的距离概念，我们可以在仅从 Wikipedia 检索时，测量源文章之间的距离。Consonni 等人（2019）

通过检索数万亿令牌来改进语言模型

Figure 9 | Wikipedia 链接距离（检索文章）。对于每个序列和块组合，我们仅使用 Wikipedia 计算目标与前五个邻居之间的链接距离。排名显示相对邻居距离，其中 rank-1 是第一个邻居，rank-5 是第五个邻居。不同颜色代表链接距离。因为我们不从同一文档检索，1 是最小值。我们发现，平均而言，随机文章之间存在路径时的距离超过 5.0

提供了一个维基百科链接数据集，针对每篇文章都包含一份相邻文章列表。我们利用这些数据构建有向图，计算从一篇页面到另一篇页面的距离。在图 9中，我们计算了训练序列与检索邻居之间的链接距离。我们发现检索到的文档往往来自与包含目标的文章相当接近的文章。进一步地，我们发现平均距离随着排名而增加，表明我们的邻居既有用，且顺序合理。这为我们更大规模的实验提供了信心，其中文档距离定义不那么明确。

F. 补充定量结果

我们报告对应主文本的定量图表，以及在 Pile 上进一步过滤的语言模型结果。

F.1. 主文本数据集

我们报告 RETRO 与基线模型在评估集上的比特/字节性能，见表 14。

F.2. 堆

在图 4中，我们将 RETRO 与 Jurassic-1（Lieber 等，2021）进行比较。完整的比特/字节结果见表 15。

F.3. 过滤结果

我们的主要评估集泄漏块的分布。 我们通过测量具有某些比例的评估块来评估评估集与训练集之间的泄漏。

通过从数万亿令牌中检索来改进语言模型

表 14 | 主要语言建模数据集的完整结果。前三行对应图 1，最后一组行对应图 3。

	Baseline				RETRO [Off]				RETRO[On]
	172M	425M	1.5B	7.5B	172M	425M	1.5B	7.5B	172M	425M	1.5B	7.5B
C4 Eval bpb	0.98	0.92	0.84	0.78	0.98	0.92	0.84	0.78	0.82	0.77	0.71	0.66
C4 Eval bpb (900B)	-	-	-	-	-	-	-	-	0.88	0.83	0.76	0.71
C4 Eval bpb (360B)	-	-	-	-	-	-	-	-	0.92	0.87	0.80	0.74
C4 Eval bpb (180B)	-	-	-	-	-	-	-	-	0.94	0.89	0.81	0.75
C4 Eval bpb (90B)	-	-	-	-	-	-	-	-	0.95	0.89	0.82	0.76
C4 Eval bpb (36B)	-	-	-	-	-	-	-	-	0.96	0.90	0.83	0.77
C4 Eval bpb (18B)	-	-	-	-	-	-	-	-	0.96	0.91	0.83	0.77
C4 Eval bpb (9B)	-	-	-	-	-	-	-	-	0.96	0.91	0.83	0.77
C4 Eval bpb (4B)	-	-	-	-	-	-	-	-	0.97	0.91	0.84	0.78
C4 Eval bpb (2B)	-	-	-	-	-	-	-	-	0.97	0.91	0.84	0.78
C4 Eval bpb (k = 1)	-	-	-	-	-	-	-	-	0.84	0.79	0.73	0.67
C4 Eval bpb (k = 2)	-	-	-	-	-	-	-	-	0.83	0.78	0.72	0.67
C4 Eval bpb (k = 3)	-	-	-	-	-	-	-	-	0.82	0.78	0.71	0.66
C4 Eval bpb (k = 4)	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.66
C4 Eval bpb (k = 5)	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.66
C4 Eval bpb (k = 10)	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.66
C4 Eval bpb (k = 20)	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.66
C4 Eval bpb (k = 30)	-	-	-	-	-	-	-	-	0.82	0.77	0.71	0.65
C4 Eval bpb (k = 40)	-	-	-	-	-	-	-	-	0.83	0.77	0.71	0.65
C4 Eval bpb (k = 50)	-	-	-	-	-	-	-	-	0.83	0.78	0.71	0.66
C4 Eval bpb (k = 60)	-	-	-	-	-	-	-	-	0.84	0.78	0.72	0.66
C4 Eval bpb (k = 70)	-	-	-	-	-	-	-	-	0.84	0.79	0.72	0.66
C4 Eval bpb (k = 80)	-	-	-	-	-	-	-	-	0.85	0.79	0.73	0.66
C4 Eval bpb (k = 90)	-	-	-	-	-	-	-	-	0.85	0.79	0.73	0.66
C4 Eval bpb (k = 100)	-	-	-	-	-	-	-	-	0.85	0.79	-	0.67
Lambada Accuracy	0.42	0.51	0.61	0.69	0.47	0.54	0.63	0.70	0.52	0.60	0.67	0.73
Curation Corpus bpb	0.69	0.63	0.56	0.52	0.68	0.64	0.57	0.51	0.66	0.61	0.55	0.50
Wikitext103 Perplexity	25.62	19.29	13.98	10.65	25.88	19.78	13.89	10.40	3.32	2.96	2.53	2.22
Wikipedia Sept. 2021 bpb	0.85	0.78	0.71	0.65	0.86	0.79	0.71	0.65	0.79	0.73	0.66	0.61

重叠 r(C)。我们在图 10中展示了直方图。可以看到 C4 在训练集和评估集之间存在轻微重叠。类似地，Wikitext103 的块出现在训练集中，尽管我们已经从训练集中移除了实际的 Wikitext103 评估文档。另一方面，我们的 Wikipedia September 21 数据集几乎没有泄漏（数据是训练数据创建时不存在的原始文档），Curated Corpus 同样没有泄漏。

Pile 上的过滤结果。 We report chunk overlap distribution and filtered performance curves on the Pile in Fig. 12 and Fig. 11, respectively. The qualitative interpretation of the filtered curves is the same: RETRO models exploit leakage more, but the performance improvement they provide remains significant even on original chunks that haven't been observed in the training set.

通过检索数万亿个 token 提升语言模型

Table 15 | The Pile 上的完整结果，以 bits-per-bytes 为度量。 Jurassic-1 and GPT-3 numbers are taken from Lieber et al. (2021). Gopher numbers are taken from Rae et al. (2021).

Subset	7B Baseline (Ours)	GPT-3	Jurassic-1	Gopher	7.5B RETRO
arxiv	0.742	0.838	0.680	0.641	0.714
books3	0.792	0.802	0.835	0.706	0.653
dmici	1.177	1.371	1.037	1.135	1.164
freelaw	0.576	0.612	0.514	0.506	0.499
github	0.420	0.645	0.358	0.367	0.199
gutenberg pg 19	0.803	1.163	0.890	0.652	0.400
hackernews	0.971	0.975	0.869	0.888	0.860
nih_exporter	0.650	0.612	0.590	0.590	0.635
opensubtitles	0.974	0.932	0.879	0.894	0.930
philpapers	0.760	0.723	0.742	0.682	0.699
pile cc	0.771	0.698	0.669	0.688	0.626
pubmed abstractions	0.639	0.625	0.587	0.578	0.542
pubmed central	0.588	0.690	0.579	0.512	0.419
stackexchange	0.714	0.773	0.655	0.638	0.624
ubuntu_irc	1.200	0.946	0.857	1.081	1.178
uspto backgrounds	0.603	0.566	0.537	0.545	0.583

Figure 10 | 评估和训练块之间重叠分布 for C4, Curation Corpus, Wikitext103 and Wikipedia Sept. 2021.

通过检索数万亿个 token 提升语言模型

Figure 11 | 在 Pile 上过滤的评估损失，基准 Transformers 和 RETRO。

通过检索数万亿个 token 提升语言模型

Figure 12 | Pile 评估集的评估和训练块之间重叠分布。

通过检索数万亿个 token 提升语言模型

Table 16 | Great Circle (novel), 来自维基百科 9 月 21 日。该文章讨论了一部近期小说，章节 C₃ 和 C₄ 专门讨论其评价。评审该小说的期刊名称 Publishers Weekly 同时出现在章节 C₃ 的邻居 [N₃¹, F₃¹]、[N₃², F₃²] 以及后续章节 C₄ 中，在那里这些 tokens 的 loss 通过 RETRO 明显降低。

C_u colored by loss difference L_RETRO[OFF] - L_RETRO ≤ -0.5, = 0, ≥ 0.5	C_u colored by LCP with RET(C_u-1) LCP = 0, 1, 2, 3,4, ≥ 5	[N_u¹, F_u¹] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4, ≥ 5	[N_u², F_u²] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4, ≥ 5
Great Circle (novel)Great Circle is a 2021 novel by Maggie Shipstead, published on May 4, 2021, by Alfred A. Knopf. The novel has been shortl- isted for the 2021 Booker Prize. Sy-nopsis The novel consists of two pa- rallel narratives about two fictiona- l women. One is	Great Circle (novel) Great Circle is a 2021 novel by Maggie Shipstead, published on May 4, 2021, by Alfred A. Knopf. The novel has been shortl- isted for the 2021 Booker Prize. Sy-nopsis The novel consists of two pa- rallel narratives about two fictiona- l women. One is	The Dutch House (novel)The Dutch House is a 2019 novel by Ann Patchett . It was published by Harper on Sept- ober 24, 2019. It tells the story o f a brother and sister over the cour- se of five decades. The novel was a finalist for the 2020 Pulitzer Priz e for Fiction. Plot The Dutch House is a mansion located in Elkins Park , Pennsylvania, a suburb of Philadel- phia. It was built in 1922 by the Va- nHoebeek family, a husband and wife originally from the Netherlands who made their fortune in the tobacco in dustry. Cyril Conroy, a self-made re- al estate mogul	The Dutch House (novel)The Dutch House is a 2019 novel by Ann Patchett . It was published by Harper on Sept- ober 24, 2019. It tells the story o f a brother and sister over the cour- se of five decades. [2] The novel wa s a finalist for the 2020 Pulitzer Priz e for Fiction. [3] Plot[edit]Th e Dutch House is a mansion located in Elkins Park, Pennsylvania, a subur b of Philadelphia. It was built in 1 922 by the VanHoebeek family, a husb and and wife originally from the Net- herlands who made their fortune in t he tobacco industry. Cyril Conroy, a self-
about the disappeared 20th-century aviator Marian Graves, while the oth- er is about the struggling 21st-cent- ury Hollywood actress Hadley Baxter, who is attempting to make a film ab- out Marian. Hadley's narrative is to ld in the first-person, while Marian 's sections are told in the third-pe rson	about the disappeared 20th-century aviator Marian Graves, while the oth- er is about the struggling 21st-cent- ury Hollywood actress Hadley Baxter, who is attempting to make a film ab- out Marian. Hadley's narrative is to ld in the first-person, while Marian 's sections are told in the third-pe rson	on becoming a filmmaker. She has fo und a subject for her film project, an obscure African American actress credited only as “the watermelon wom an” in old Hollywood films, and the subsequent film recounts her search for this woman even as it covers, in the manner of the earlier Dunyement aries, Dunye's friendships and her l ove life. In The Watermelon Woman, D unye makes the film she set out to m ake in 1990 about African American w omen artists, a film that both inven ts an artistic predecessor with whom she can identify and also “finds” C heryl herself as the artist that she seeks. As Dunye identifies herself	based closely on her own youthful ex-periences. (She plans the film to be the first of two parts, the second dealing with the aftermath of the f irst's events.) Byrne plays a young film student named Julie (Hogg's ava tar), who starts her artistic educa- tion with high hopes of making a movi e about a boy named Tony, living in working-class Sunderland, who adores his mother — “is almost obsessed wi th her,” as eager Julie tells her ad visers. Her idealism is evident from the start. The advisers are skepti cal, and no wonder; Julie's family i s posh, with a comfortable country e state and
Reception Great Circle received very favorable reviews, with a cumu- lative "Rave" rating at the review ag gregator website Book Marks, based o n 22 book reviews from mainstream li terary critics. The novel debuted at number fourteen on The New York Tim es Hardcover fiction best-seller lis t for the week ending May	Reception Great Circle received very favorable reviews, with a cumu- lative "Rave" rating at the review ag gregator website Book Marks, based o n 22 book reviews from mainstream li terary critics. The novel debuted at number fourteen on The New York Tim es Hardcover fiction best-seller lis t for the week ending May	first edition hardcover Reception The novel debuted at number one on T he New York Times fiction best-selle r list. As of the week ending Febru ary 20, 2021, the novel has spent 38 weeks on the list. At the review ag gregator website Book Marks, which a assigns individual ratings to book re views from mainstream literary criti cs, the novel received a cumulative "Rave" rating based on 38 reviews, w ith only one "mixed" review. Publ ish ers Weekly wrote, "Bennett renders h er characters and their struggles wi th great compassion, and explores th e complicated state of mind that Ste lla finds herself in while passing a s white." In its	The book also debuted at number tw o on The New York Times Hardcover No nfiction best-sellers list on July 2 8, 2019. [5] It spent eleven weeks on the list. [6] Reception[edit]At t he review aggregator website Book Ma rks, which assigns individual rating s to book reviews from mainstream li terary critics, the book received a cumulative "Positive" rating based o n 29 reviews: 12 "Rave" reviews, 6 " Positive" reviews, 9 "Mixed" reviews , and 2 "Pan" reviews. [7] Publ isher s Weekly gave the book a mixed revie w, writing, "Unfortunately, all thre e
8, 2021. Critics praised the novel for sustaining its length and for Sh ipstead's research and intricate nov el structure for perfectly interweav- ing the parallel narratives, despite the time and circumstances separati ng them. In its starred review, Pub lishers Weekly wrote, "Shipstead man ages to portray both Marian's and Ha dley's	8, 2021. Critics praised the novel for sustaining its length and for Sh ipstead's research and intricate nov el structure for perfectly interweav- ing the parallel narratives, despite the time and circumstances separati ng them. In its starred review, Pub lishers Weekly wrote, "Shipstead man ages to portray both Marian's and Ha dley's

通过检索数万亿个 token 提升语言模型

Table 17 | 来自维基百科 9 月 21 日的全爱尔兰高级足球锦标赛决赛。球队 Tyrone 的名称既出现在块 C₁ 的第二个邻居 [N_1^2, F_1^2] 中，也出现在后续块 C₂ 中，在那里这些标记的损失显著降低了 RETRO。

C_u colored by loss difference L_RETRO[OFF] - L_RETRO ≤ -0.5, = 0, ≥ 0.5	C_u colored by LCP with RET(C_u-1) LCP = 0, 1, 2, 3,4, ≥ 5	[N_u¹, F_u¹] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4, ≥ 5	[N_u², F_u²] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4, ≥ 5
2021 All-Ireland Senior Football Championship Final The 2021 All-Ireland Senior Football Championship Final was the 134th final of the All-Ireland Senior Football Championship and the culmination of the 2021 All-Ireland Senior Football Championship. The match was played at Croke Park in Dublin on 11 September 2021. It was originally scheduled	2021 All-Ireland Senior Football Championship Final The 2021 All-Ireland Senior Football Championship Final was the 134th final of the All-Ireland Senior Football Championship and the culmination of the 2021 All-Ireland Senior Football Championship. The match was played at Croke Park in Dublin on 11 September 2021. It was originally scheduled	2018 All-Ireland Senior Football Championship Final The 2018 All-Ireland Senior Football Championship Final was the 131st final of the All-Ireland Senior Football Championship and the culmination of the 2018 All-Ireland Senior Football Championship in Gaelic football. The match was played at Croke Park in Dublin on 2 September 2018.[3] It was the second time the teams had met in the final; Dublin won the first encounter in 1995. The final was shown live in Ireland and on RTÉ Two as part of The Sunday Game live programme, presented by Michael Lyster from Croke Park, with studio analysis from Joe Brolly,	2018 All-Ireland Senior Football Championship Final The 2018 All-Ireland Senior Football Championship Final was the 131st final of the All-Ireland Senior Football Championship and the culmination of the 2018 All-Ireland Senior Football Championship in Gaelic football. The match was played at Croke Park in Dublin on 2 September 2018. It was the second time the teams had met in the final; Dublin won the first encounter in 1995. It was the third consecutive year that a team qualified under the system of second chances introduced in 2001; Tyrone qualified despite defeat in its provincial championship. Dublin won the final by a margin of six points
for 28 August but had to be postponed by two weeks when the semi-final was postponed due to a COVID-19 outbreak. Ulster champions Tyrone took on Connacht champions Mayo, in what was their first ever meeting in a final, winning their 4th title after a 2-14 to 0-15 win. Mayo lost	for 28 August but had to be postponed by two weeks when the semi-final was postponed due to a COVID-19 outbreak. Ulster champions Tyrone took on Connacht champions Mayo, in what was their first ever meeting in a final, winning their 4th title after a 2-14 to 0-15 win. Mayo lost	game 23-23 after extra time, however Ulster progressed under the competition rules as they scored three tries in the match against Leinster's two. The semi-finals took place in mid November and saw both the away teams win, as Ulster beat Glasgow and Edinburgh beat Connacht. The final was held on Saturday December 20 at Muarrayfield Stadium and saw Ulster beat Edinburgh 21-27 to win the Celtic Cup. 2004-05 season The format of the competition was changed for the second edition of the competition. The competition was moved to April and May to run after the conclusion of the Celtic League competition, with only eight	with a last-ditch plan of action – play the Munster/Ulster Semi-Final on March 16th, with the winners to play Connacht in the following day's Final. On March 16th then Munster had an easy win over Ulster (9-07 to 0-00) but thankfully for the Munster players, the pitch cut up so badly during the game, it was decided to postpone the following day's hurling Final (until Easter Sunday) with the football Final going ahead on its own on St. Patrick's Day. Less than a week later, on March 23rd, seven
their 11th consecutive final since 1989, losing 6 finals in 9 years, with this latest defeat on an identical scoreline to 2020, when Mayo lost to Dublin. Background were aiming to win their fourth title and first All-Ireland since 1951. Since then, they had lost ten finals (1989, 1996, 1997, 2004, 2006,	their 11th consecutive final since 1989, losing 6 finals in 9 years, with this latest defeat on an identical scoreline to 2020, when Mayo lost to Dublin. Background were aiming to win their fourth title and first All-Ireland since 1951. Since then, they had lost ten finals (1989, 1996, 1997, 2004, 2006,	1-16 to 0-15 winners to qualify for their 10th league final in the past 13 years. They have won seven of their previous league finals underody since 2002, losing the other two to Waterford (2007) and Dublin (2011). Despite the defeat there were some distinct positives from a Galway perspective - most notably the solid displays of Daithí Burke at centre-back, Joseph Cooney at wing-back and Ronan Burke at full-back. Colm Calanan continued his excellent form in goal and also hit a stunning free from distance. Indeed it was not the Galway defence that was the problem	which Dublin won by 0-12 to 0-9. Dublin are going for an unprecedented fourth successive Championship win over Kerry. Prior to their current run, which started with the 2011 All-Ireland final, they had only managed two consecutive victories over them on two separate occasions - 1909 and '24, 1976 and '77. The longest winning sequence in the rivalry was set by Kerry between 1941 and 1975, when they won each of the six Championship meetings. Kerry went nine games unbeaten between 1978 and 2009, with four victories either side of a dramatic draw at the quarter-final stage in Thurles in 2001. Sunday will mark their 11th
2012, 2013, 2016, 2017, 2020). appeared in their seventh final, winning on three occasions in 2003, 2005 and 2008. This final was the fifth to be contested by county teams from Connacht and Ulster, the other finals were 1925 (Galway beat Cavan), 1943 (Roscommon beat Cavan), 1948 (Cavan beat	2012, 2013, 2016, 2017, 2020). appeared in their seventh final, winning on three occasions in 2003, 2005 and 2008. This final was the fifth to be contested by county teams from Connacht and Ulster, the other finals were 1925 (Galway beat Cavan), 1943 (Roscommon beat Cavan), 1948 (Cavan beat

通过从数万亿个标记中检索来改进语言模型

Table 18 | 2020 年夏季残奥会，来自维基百科 9 月 21 日。活动的原始日期 2020 年 8 月 25 日至 9 月 6 日，在块 C₁ 的邻居 [N_1^1, F_1^1]、[N_1^2, F_1^2] 和后续块 C₂ 中均出现，在那里这些标记的损失显著降低了 RETRO。值得注意的是，在这种情况下，邻居是在事件尚未被推迟之前写的。

C_u colored by loss difference L_RETRO[OFF] - L_RETRO ≤ -0.5, = 0, ≥ 0.5	C_u colored by LCP with RET(C_u-1) LCP = 0, 1, 2, 3, 4, ≥ 5	[N_u¹, F_u¹] colored by LCP with C_u+1 LCP = 0, 1, 2, 3, 4, ≥ 5	[N_u², F_u²] colored by LCP with C_u+1 LCP = 0, 1, 2, 3, 4, ≥ 5
2020 Summer Paralympics The, branded as the Tokyo 2020 Paralympic Games, was an international multi-sport parasports event held from 24 August to 5 September 2021 in Tokyo, Japan. They were the 16th Summer Paralympic Games as organized by the International Paralympic Committee (IPC).	2020 Summer Paralympics The, branded as the Tokyo 2020 Paralympic Games, was an international multi-sport parasports event held from 24 August to 5 September 2021 in Tokyo, Japan. They were the 16th Summer Paralympic Games as organized by the International Paralympic Committee (IPC).	pic Games.* The 2020 Summer Paralympics are an upcoming major international multi-sport event for athletes with disabilities governed by the International Paralympic Committee. Scheduled as the 16th Summer Paralympic Games, it is planned to be held in Tokyo, Japan from 25 August to 6 September 2020.3. 2019 BWF Para-Badminton World Championships- The 2019 BWF Para-Badminton World Championships was held from 20 to 25 August 2019 in Basel, Switzerland.- Men's event: Gold Medal: Pramod Bhagat in Singles SL3 Event and Pramod Bhagat and Manoj	2020 Summer Paralympics The are an upcoming major international multi-sport event for athletes with disabilities governed by the International Paralympic Committee. Scheduled as the 16th Summer Paralympic Games, they are scheduled to be held in Tokyo, Japan between 24 August and 5 September 2021. Originally due to take place between 25 August and 6 September 2020. On 24 March 2020, the IOC and the Tokyo Organizing Committee officially announced that the 2020 Summer Olympics and 2020 Summer Paralympics would be postponed to 2021, due to the COVID-19 pandemic, marking the first time that the Paralympics has been postponed. They will still be publicly marketed as
Originally scheduled to take place from 25 August to 6 September 2020, in March 2020 both the 2020 Summer Olympics and Paralympics were postponed by one year due to the COVID-19 pandemic, with the rescheduled Games still referred to as Tokyo 2020 for marketing and branding purposes. As with the Olympics, the Games were largely held behind	Originally scheduled to take place from 25 August to 6 September 2020, in March 2020 both the 2020 Summer Olympics and Paralympics were postponed by one year due to the COVID-19 pandemic, with the rescheduled Games still referred to as Tokyo 2020 for marketing and branding purposes. As with the Olympics, the Games were largely held behind	once submitted. This process was undertaken following the postponement of the Tokyo 2020 Games due to the COVID-19 pandemic, with both the Olympics and Paralympics pushed back a year. Now, the Tokyo 2020 Olympics are scheduled for July 23 to August 8 while the Paralympics are due to follow from August 24 to September 5. The refund process is separate for ticketholders outside of Japan, who purchased tickets through authorised ticket resellers (ATR). Each ATR has its own individual refund procedure. Early figures from the refund process for the Tokyo 2020 Olympics stated that around 18 per cent	Olympiad, have now been postponed and rescheduled for 23 July to 8 August 2021 in Tokyo, Japan. The Games were postponed in March 2020 as a result of the worldwide Covid-19 pandemic, although they will still keep the name Tokyo 2020 for marketing and branding purposes. This will be the first time the Olympic Games have been postponed rather than cancelled.
closed doors with no outside spectators due to a state of emergency in the Greater Tokyo Area and other prefectures. The Games were the second Summer Paralympics hosted by Tokyo since 1964, and the third Paralympics held in Japan overall since the 1998 Winter Paralympics in Nagano. The Games featured	closed doors with no outside spectators due to a state of emergency in the Greater Tokyo Area and other prefectures. The Games were the second Summer Paralympics hosted by Tokyo since 1964, and the third Paralympics held in Japan overall since the 1998 Winter Paralympics in Nagano. The Games featured	has been rescheduled to May 1-4 because of travel restrictions under the current state of emergency in Tokyo and other 10 prefectures across Japan. The Tokyo 2020 organizing committee announced that the first of 18 test events for the Olympic and Paralympic Games will involve wheelchair rugby, which will be held in Yoyogai National Stadium from April 3 to 4. The FINA Diving World Cup will follow from April 18 to 23 at the Tokyo Aquatics Centre, which will also serve as an Olympic qualifying event. The spread of the COVID-19 pandemic has slowed down in Tokyo three weeks after the Japanese capital entered a state of emergency on	Olympic Games, when Tokyo became the first city in Asia to host the Olympic and Paralympic Games, but unfortunately strong winds made it an impossible task this time around. Members of the Tokyo Organising Committee of the Olympic and Paralympic Games (Tokyo 2020), Tokyo Metropolitan Government officials, Tokyo 2020 Torch Relay Official Ambassadors and representatives from Miyagi Prefecture joined the arrival ceremony. FLAME OF RECOVERY The Olympic flame will now be put on display at various locations in the Tohoku region, to highlight the message of hope in the area as worst affected by the 2011 Great East Japan Earthquake
539 medal events in 22 sports, with badminton and taekwondo both making their Paralympic debut to replace football 7-a-side and sailing. China topped the medal table for the fifth consecutive Paralympics, with 96 golds and 207 total medals. Great Britain finished second for the ninth time,	539 medal events in 22 sports, with badminton and taekwondo both making their Paralympic debut to replace football 7-a-side and sailing. China topped the medal table for the fifth consecutive Paralympics, with 96 golds and 207 total medals. Great Britain finished second for the ninth time,

通过从数万亿个标记中检索来改进语言模型

Table 19 | Daniel Radcliffe，来自 Wikitext103Valid，从 c4 检索的数据。块 C₂ 和 C₃ 大部分来自邻居 [N₁, F₁] 和 [N₂, F₂]，仅有格式差异，这大大降低了这些标记的损失。此例说明，即使经过去重，训练数据泄露到评估集合时，我们的 RETRO 模型也能直接利用这一泄漏。

C_u colored by loss difference L_RETRO[OFF] - L_RETRO ≤ -0.5, = 0, ≥ 0.5	C_u colored by LCP with RET(C_u-1) LCP = 0, 1, 2, 3,4, ≥ 5	[N¹_u, F¹_u] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4, ≥ 5	[N²_u, F²_u] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4, ≥ 5
= Daniel Radcliffe = Daniel Jacob R adcliffe (born 23 July 1989) is an English actor who rose to prominenc e as the title character in the Harr y Potter film series. He made his ac ting debut at 10 years of age in BBC One's 1999 television film David Co pperfield, followed by his cinematic debut	= Daniel Radcliffe = Daniel Jacob R adcliffe (born 23 July 1989) is an English actor who rose to prominenc e as the title character in the Harr y Potter film series. He made his ac ting debut at 10 years of age in BBC One's 1999 television film David Co pperfield, followed by his cinematic debut	Daniel Jacob Radcliffe (born 23 July 1989) is an English actor who rose to prominence as the title character in the Harry Potter film series. He made his acting debut at 10 years o f age in BBC One's 1999 television f ilm David Copperfield, followed by h is cinematic debut in 2001's The Tai lor of Panama. At age 11, he was cas t as Harry Potter in the first Harry Potter film, and starred in the ser ies for 10 years until the release o f the eighth and final film in 2011. Radcliffe began to branch out to s tage acting in 2007, starring in the London and New York productions of Equus, and	Daniel Jacob Radcliffe (born 23 July 1989) is an English actor who rose to prominence as the title character in the Harry Potter film series. He made his acting debut at 10 years o f age in BBC One's 1999 television m ovie David Copperfield, followed by his film debut in 2001's The Tailor of Panama. At age 11, he was cast as Harry Potter in the first Harry Pot ter film, and starred in the series for 10 years until the release of th e eighth and final film in 2011. Rad cliffe began to branch out to stage acting in 2007, starring in the Lond on and New York productions of Equus , and in the
in 2001's The Tailor of Panama. At age 11, he was cast as Harry Potter in the first Harry Potter film, and starred in the series for 10 years u ntil the release of the eighth and f inal film in 2011. Radcliffe began to branch out to stage acting in 200 7, starring in the London and New	in 2001's The Tailor of Panama. At age 11, he was cast as Harry Potter in the first Harry Potter film, and starred in the series for 10 years u ntil the release of the eighth and f inal film in 2011. Radcliffe began to branch out to stage acting in 200 7, starring in the London and New	in 2001's The Tailor of Panama. At age 11, he was cast as Harry Potter in the first Harry Potter film, and starred in the series for 10 years u ntil the release of the eighth and f inal film in 2011. Radcliffe began to branch out to stage acting in 200 7, starring in the London and New Yo rk productions of Equus, and in the 2011 Broadway revival of the musical How to Succeed in Business Without Really Trying. He starred in the 201 2 horror film The Woman in Black, an d played beat poet Allen Ginsberg in the 2013 independent film Kill Your Darlings. He has contributed to ma ny charities	of Panama. At age 11, he was cast a s Harry Potter in the first Harry Po tter film, and starred in the series for 10 years until the release of t he eighth and final film in 2011.R adcliffe began to branch out to stag e acting in 2007, starring in the Lo ndon and New York productions of Equ us, and in the 2011 Broadway revival of the musical How to Succeed in Bu siness Without Really Trying. He sta rred in the 2012 horror film The Wom an in Black, and played beat poet Al len Ginsberg in the 2013 independent film Kill Your Darlings. He has con tributed to many charities, includin g Demelza House Children's
York productions of Equus, and in t he 2011 Broadway revival of the musi cal How to Succeed in Business Witho ut Really Trying. He starred in the 2012 horror film The Woman in Black, and played beat poet Allen Ginsberg in the 2013 independent film Kill Y our <unk>. He has contributed to ma ny charities,	York productions of Equus, and in t he 2011 Broadway revival of the musi cal How to Succeed in Business Witho ut Really Trying. He starred in the 2012 horror film The Woman in Black, and played beat poet Allen Ginsberg in the 2013 independent film Kill Y our <unk>. He has contributed to ma ny charities,	York productions of Equus, and in t he 2011 Broadway revival of the musi cal How to Succeed in Business Witho ut Really Trying. He starred in the 2012 horror film The Woman in Black, and played beat poet Allen Ginsberg in the 2013 independent film Kill Y our Darlings. He has contributed to many charities, including Demelza H ouse Children's Hospice and The Trev or Project. He also made public serv ice announcements for the latter. In 2011, he was awarded the Trevor Pro ject's "Hero Award." Sources ee about Radcliffe's personal wealth ; he was reported to have earned £1 million for the first Harry Potter	in the 2011 Broadway revival of the musical How to Succeed in Business Without Really Trying. He starred in the 2012 horror film The Woman in B lack, and played beat poet Allen Gin sberg in the 2013 independent film K ill Your Darlings. He has contribute d to many charities, including Demel za House Children's Hospice and The Trevor Project. He also made public service announcements for the latter. In 2011, he was awarded the Trevor Project's "Hero Award."
including <unk> Hospice Care for Ch ildren, and The Trevor Project for s uicide prevention among LGBTQ youth, which gave him its Hero Award in 20 11. = Early life = Radcliffe w as born in West London, England. He is the only child of Alan George Rad cliffe, a literary agent, and	including <unk> Hospice Care for Ch ildren, and The Trevor Project for s uicide prevention among LGBTQ youth, which gave him its Hero Award in 20 11. = Early life = Radcliffe w as born in West London, England. He is the only child of Alan George Rad cliffe, a literary agent, and

通过从数万亿个标记中检索来改进语言模型

Table 20 | 示例 - 《人权宣言》：第一条。RETRO[OFF] 示例语法正确且几乎合理，但为幻觉。RETRO[ON] 示例则正确复制自邻居数据，并根据我们的提示进行稳健重排。

Prompt and sample of RETRO[OFF]	Prompt and sample of RETRO[ON] colored by LCP with RET(C_u-1) LCP = 0, 1, 2, 3, 4, ≥ 5	[N_u¹, F_u¹] colored by LCP with C_u+1 LCP = 0, 1, 2, 3, 4, ≥ 5	[N_u², F_u²] colored by LCP with C_u+1 LCP = 0, 1, 2, 3, 4, ≥ 5
Article premier - Les hommes naissent et demeurent libres et égaux en droits. Les distinctions sociales ne peuvent être fondées que sur l'utilité com mune. Article 2. - Le but de toute association politique e st la conservation des droits naturels et	Article premier - Les hommes naissent et demeurent libres et égaux en droits. Les distinctions sociales ne peuvent être fondées que sur l'utilité com mune. Article 2. - Le but de toute association politique e st la conservation des droits naturels et	de l'homme et du citoyen. Article 1er Les hommes naissent et demeurent libres et égaux en droits. Les distinctions so ciales ne peuvent être fondées que sur l'utilité commune. Article 2Le but de toute asso ciation politique est la conse rvation des droits naturels et imprescriptibles de l'homme. Ces droits sont la liberté, la propriété, la sûreté, et la r ésistance à l'oppression. Article 3Le principe de toute s ouveraineté réside essentielle ment dans la nation. Nul corps , nul individu ne peut exercer d'autorité qui n'en	Les hommes naissent et demeur ent libres et égaux en droits. Les distinctions sociales ne peuvent être fondées que sur l 'utilité commune. Article 2. Le but de toute association po litique est la conservation de s droits naturels et imprescri ptibles de l'Homme. Ces droits sont la liberté, la propriété , la sûreté, et la résistance à l'oppression. Article 3. -Le principe de toute Souverainet é réside essentiellement dans la Nation. Nul corps, nul indi vidu ne peut exercer d'autorit é qui n'en émane expressément. Art
imprescriptibles de l'homme, et par conséquent la garantie à chacun des droits suivants : Article 3. - La propriété est un droit inviolable et sa cré. Toute personne a le droit de procéder à sa propre cons ervation. Article 4. - Le	imprescriptibles de l'homme. Ces droits sont la liberté, la propriété, la sûreté et la ré sistance à l'oppression. Article 3.- Le principe de toute souveraineté réside essentiel lement dans la nation. Nul cor ps, nul individu ne peut exerc er d'autorité qui n	criptibles de l'homme. Ces dro its sont la liberté, la propri été, la sûreté et la résistanc e à l'oppression. Article 3 - Le principe de toute souverai neté résideessentiellement da ns la Nation. Nul corps, nul i ndividu nepeut exercer d'auto rité qui n'en émane expresséme nt. Article 4 - La liberté co nsiste à pouvoir faire tout ce quine nuit pas à autrui : ai nsi, l'exercice des droits nat urelsde chaque homme n'a de b ornes que celles qui assurent auxautres membres de la socié té la jouissance de	et imprescriptibles de l'homm e. Ces droits sont la liberté, la propriété, la sûreté et la résistance à l'oppression. A rticle 3 - Le principe de tout e souveraineté réside essentie llement dans la Nation. Nul co rps, nul individu ne peut exer cer d'autorité qui n'en émane expressément. Article 4 - La liberté consiste à pouvoir fai re tout ce qui ne nuit pas à a utrui : ainsi, l'exercice des droits naturels de chaque homm e n'a de bornes que celles qui assurent aux autres membres d e la société la jouissance de ces mêmes droits. Ces bornes
but de toute association est la défense des droits de l'hom me et du citoyen. Tout citoye n a le droit de participer à l a direction des affaires publi ques. Article 5. - L'impuni té n'a jamais été et ne sera j amais une fin en elle-même. L' imp	'en émane expressément. Article 4.- La liberté consiste à pouvoir faire tout ce qui ne nuit pas à autrui : ainsi, l'e xercice des droits naturels de chaque homme n'a de bornes qu e celles qui assurent aux autr es membres de la société la jo uissance de ces mêmes	mane expressément. Article 4 - La liberté consiste à pouvoi r faire tout ce qui ne nuit pa s à autrui : ainsi, l'exercice des droits naturels de chaque homme n'a de bornes que celle s qui assurent aux autres memb res de la société la jouissanc e de ces mêmes droits. Ces bor nes ne peuvent être déterminée s que par la loi. Article 5 - La loi n'a le droit de défend re que les actions nuisibles à la société. Tout ce qui n'est pas défendu par la loi ne peu t être empêché, et nul ne peut être contraint à faire ce qu' elle n	mane expressément. Article 4 - La liberté consiste à pouvoi r faire tout ce qui ne nuit pa s à autrui : ainsi, l'exercice des droits naturels de chaque homme n'a de bornes que celle s qui assurent aux autres memb res de la société la jouissanc e de ces mêmes droits. Ces bor nes ne peuvent être déterminée s que par la loi. Article 5 - La loi n'a le droit de défend re que les actions nuisibles à la société. Tout ce qui n'est pas défendu par la loi ne peu t être empêché, et nul ne peut être contraint à faire ce qu' elle n
unité, comme le despotisme, s 'est toujours révélée être un instrument d'oppression. La ty rannie qui s'est élue juge su prême de la conscience des aut res ne peut être jugée. La jus tice se trouve dans la consci ence de chaque citoyen,	droits. Ces bornes ne peuvent être déterminées que par la l oi. Article 5.- La loi n'a le droit de défendre que les a ctions nuisibles à la société. Tout ce qui n'est pas défendu par la loi ne peut être empêc hé, et nul ne peut être

通过从数万亿个标记中检索来改进语言模型

Table 21 | 示例 - π 的小数。RETRO[OFF] 示例在提示结束后两位即迅速偏离，而 RETRO[ON] 则正确输出大量 π 小数，直接复制自邻居数据。

Prompt and sample of RETRO[OFF]	Prompt and sample of RETRO[ON] colored by LCP with RET(C_u-1) LCP = 0, 1, 2, 3,4,≥ 5	[N_u¹, F_u¹] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4,≥ 5	[N_u², F_u²] colored by LCP with C_u+1 LCP = 0, 1, 2, 3,4,≥ 5
Prompt and sample of RETRO[OFF]		Pi = 3.1415926535 8979323846 2643383279 5028841971 69399375 10 5820974944 5923078164 06286 20899 8628034825 3421170679	Pi = 3.1415926535 8979323846 2643383279 5028841971 69399375 10 5820974944 5923078164 06286 20899 8628034825 3421170679	"1415926535 8979323846 26433 83279 5028841971 693993751058 20974944 5923078164 0628620899 8628034825 34211706798214808 651 3282306647 0938446095 5058 223172 53594081284811174502 8 410270193 8521105559 644622948 9 54930381964428810975 665933 4461 2847564823 3786783	46 2643383279 5028841971 69399 37510 5820974944 592307816406 28620899 8628034825 3421170679 8214808651 3282306647 0938446 095 50582231725359408128 4811 174502 8410270193 8521105559 6 446229489 5493038196 442881097 56659334461 2847564823 378678 3165 2712019091 4564856692 346 0
8294049602 8988496069 9858349 065 9873246379 9644789435 8628 730709 6540159079 5944069810 5 992965913 7095378412 69378359	8214808651 3282306647 0938446 095 5058223172 53594081284811 174502 8410270193 8521105559 6 446229489 5493038196442881097 5 6659334461 284	651 3282306647 0938446095 5058 223172 5359408128 4811174502 8410270193 8521105559 64462294 89 54930381964428810975 66593 34461 2847564823 3786783165 27 12019091 4564856692 346034861 0 4543266482 1339360726 024914 12737245870066 0631558817 488 1520920 9628292540 91715364	47 0938446095 5058223172 53594 081284811174502 8410270193 85 21105559 6446229489 5493038196 4428810975 6659334461 2847564 823 3786783165 27120190914564 856692 3460348610 4543266482 1 339360726 0249141273724587006 6 0631558817 4881520920 962829 2540 91715364367892590360
10 6940372045 7088679512 85612 30857 9046461290 9276642155 56 54603269 5656128798 6366475705 6294954741 5886335339 57657	7564823 3786783165 2712019091 4564856692 3460348610 45432664 82 1339360726 024914127372458 70066 0631558817 4881520920 96 28292540 91715	23 3786783165 2712019091 4564 856692 3460348610 4543266482 1 339360726 0249141273724587006 6 0631558817 4881520920 962829 2540 9171536436 7892590360 01 13305305 4882046652 1384146951 94151160943305727036 5759591 953 0921861173 8193261179 3105 118548 0744623799 627495	165 27120190914564856692 3460 348610 4543266482 1339360726 0 2491412737245870066 063155881 7 4881520920 9628292540 917153 64367892590360 0113305305 488 2046652 1384146951 9415116094 3305727036 5759591953 09218611 73 8193261179 31051185480744 23799 6274956735 1885752724 89 1227
76345 5770886953 7988876910 79 66169745 6493974637 6345801550 6663542854 6333764630 6356284 271 7885339804 5672434	364367892590360 0113305305 48 82046652 1384146951 9415116094 3305727036 5759591953 0921861 173 8193261179 31051185480744 623799 6274