Tree Indexes

上节提到，DBMS 使用一些特定的数据结构来存储信息：

Internal Meta-data
Core Data Storage
Temporary Data Structures
Table Indexes

本节将介绍存储 table index 最常用的树形数据结构：B+ Tree，Skip Lists，Radix Tree

Table Index

table index 为提供 DBMS 数据查询的快速索引，它本身存储着某表某列排序后的数据，并包含指向相应 tuple 的指针。DBMS 需要保证表信息与索引信息在逻辑上保持同步。用户可以在 DBMS 中为任意表建立多个索引，DBMS 负责选择最优的索引提升查询效率。但索引自身需要占用存储空间，因此在索引数量与索引存储、维护成本之间存在权衡。

B+ Tree

B-Tree Family

B-Tree 中的 B 指的是 Balanced，实际上 B-Tree Family 中的 Tree 都是 Balanced Tree。B-Tree 就是其中之一，以之为基础又衍生出了 $B^+ \ Tree$ 、 $B^{link} \ Tree$ 、 $B^* \ Tree$ 等一系列数据结构。B-Tree 与 B+ Tree 最主要的区别就是 B+ Tree 的所有 key 都存储在 leaf nodes，而 B-Tree 的 key 会存在所有节点中。

B+ Tree

B+ Tree 是一种自平衡树，它将数据有序地存储，且在 search、sequential access、insertions 以及 deletions 操作的复杂度上都满足 $O(logn)$ ，其中 sequential access 的最终复杂度还与所需数据总量有关。

B+ Tree 可以看作是 BST (Binary Search Tree) 的衍生结构，它的每个节点可以有多个 children，这特别契合 disk-oriented database 的数据存储方式，每个 page 存储一个节点，使得树的结构扁平化，减少获取索引给查询带来的 I/O 成本。其基本结构如下图所示：

以 M-way B+tree 为例，它的特点总结如下：

每个节点最多存储 M 个 key，有 M+1 个 children
B+ Tree 是 perfectly balanced，即每个 leaf node 的深度都一样
除了 root 节点，所有其它节点中至少处于半满状态，即 $M/2-1 \leq \#keys \leq M-1$
假设每个 inner node 中包含 k 个 keys，那么它必然有 k+1 个 children
B+ Tree 的 leaf nodes 通过双向链表串联，从而为 sequential access 提供更搞笑的支持

B+ Tree Nodes

B+ Tree 中的每个 node 都包含一列按 key 排好序的 key/value pairs，key 就是 table index 对应的 column，value 的取值与 node 类型相关，在 inner nodes 和 leaf nodes 中存的内容不同。

leaf node 的 values 取值在不同数据库中、不同索引优先级中也不同，但总而言之，通常有两种做法：

Record/Tuple Ids：存储指向最终 tuple 的指针
Tuple Data：直接将 tuple data 存在 leaf node 中，但这种方式对于 Secondary Indexes 不适用，因为 DBMS 只能将 tuple 数据存储到一个 index 中，否则数据的存储就会出现冗余，同时带来额外的维护成本。

此外，leaf node 还需要存储相邻 siblings 的地址以及其它一下元信息，如下图所示：

B+ Tree Operations

Insert

找到对应的 leaf node，L
将 key/value pair 按顺序插入到 L 中
如果 L 还有足够的空间，操作结束；如果空间不足，则需要将 L 分裂成两个节点，同时在 parent node 上新增 entry，若 parent node 也空间不足，则递归地分裂，直到 root node 为止。

Delete

从 root 开始，找到目标 entry 所处的 leaf node, L
删除该 entry
如果 L 仍然至少处于半满状态，则操作结束；否则先尝试从 siblings 那里拆借 entries，如果失败，则将 L 与相应的 sibling 合并
如果合并发生了，则可能需要递归地删除 parent node 中的 entry

B+Tree 的 Insert、Delete 过程，可参考这里。

In Practice

Typical Fill-Factor: 67%
- Average Fanout = 2*100*0.67 = 234
Typical Capacities:
- Height 4: 312,900,721 entries
- Height 3: 2,406,104 entries
Pages per level:
- Level 1 = 1 page = 8KB
- Level 2 = 134 pages = 1MB
- Level 3 = 17956 pages = 140 MB

Clustered Indexes

Clustered Indexes 规定了 table 本身的物理存储方式，通常即按 primary key 排序存储，因此一个 table 只能建立一个 cluster index。有些 DBMSs 对每个 table 都添加聚簇索引，如果该 table 没有 primary key，则 DBMS 会为其自动生成一个；而有些 DBMSs 则不支持 clustered indexes。

Compound Index

DBMS 支持同时对多个字段建立 table index（B+ Tree），即 compound index，如

CREATE INDEX compound_idx ON table (a, b, c);

它可以被用在包含 a 的 condition 的查询中，如：

SELECT c
  FROM table
 WHERE a = 5 AND b >= 42 AND c < 77;
 
SELECT c
  FROM table
 WHERE a = 5 AND b >= 42;

SELECT c
  FROM table
 WHERE a = 5;

尽管它可以被用在不包含 a 相关 condition 的查询中，但这约等于全表扫描，因此 DBMS 通常会使用全表扫描。如果使用 hash index 作为 table index，则必须对 (a, b, c) 完全匹配才可以使用。

B+ Tree Design Choices

Node Size

通常来说，disk 的数据读取速度越慢，node size 就越大：

Disk Type

Node Size

HDD

~1MB

SSD

~10KB

In-Memory

~512B

具体情境下的最优大小由 workload 决定。

Merge Threshold

由于 merge 操作引起的修改较大，有些 DBMS 选择延迟 merge 操作的发生时间，甚至可以利用其它进程来负责周期性地重建 table index。

Variable Length Keys

B+ Tree 中存储的 key 经常是变长的，通常有三种手段来应对：

Pointers：存储指向 key 的指针
Variable Length Nodes：需要精细的内存管理操作
Key Map：内嵌一个指针数组，指向 node 中的 key/val list

Non-unique Indexes

索引针对的 key 可能是非唯一的，通常有两种手段来应对：

Duplicate Keys：存储多次相同的 key
Value Lists：每个 key 只出现一次，但同时维护另一个链表，存储 key 对应的多个 values，类似 chained hashing

分别如下面两张图所示：

Intra-node Search

在节点内部搜索，就是在排好序的序列中检索元素，手段通常有：

Linear Scan
Binary Search
Interpolation：通过 keys 的分布统计信息来估计大概位置进行检索

Optimizations

Prefix Compression

同一个 leaf node 中的 keys 通常有相同的 prefix，如下图所示：

为了节省空间，可以只存所有 keys 的不同的 suffix。

Suffix Truncation

由于 inner nodes 只用于引导搜索，因此没有必要在 inner nodes 中储存完整的 key，我们可以只存储足够的 prefix 即可，如下图所示：

Bulk Insert

建 B+ Tree 的最快方式是先将 keys 排好序后，再从下往上建树，如下图所示：

因此如果有大量插入操作，可以利用这种方式提高效率

Pointer Swizzling

Nodes 使用 page id 来存储其它 nodes 的引用，DBMS 每次需要首先从 page table 中获取对应的内存地址，然后才能获取相应的 nodes 本身，如果 page 已经在 buffer pool 中，我们可以直接存储其它 page 在 buffer pool 中的内存地址作为引用，从而提高访问效率。

Additional Index Usage

Implicit Indexes

许多 DBMSs 会自动创建 index，来帮助施行 integrity constraints，情形包括：

Primary Keys
Unique Constraints
Foreign Keys

如当我们创建下面的 foo table 时：

CREATE TABLE foo (
  id SERIAL PRIMARY KEY,
  val1 INT NOT NULL,
  val2 VARCHAR(32) UNIQUE
);

CREATE TABLE bar (
  id INT REFERENCES foo (val1),
  val VARCHAR(32)
);

DBMS 可能会自动建立以下索引：

CREATE UNIQUE INDEX foo_pkey ON foo (id);        /* primary keys */
CREATE INDEX foo_val1_key ON foo (val1);         /* foreign keys */
CREATE UNIQUE INDEX foo_val2_key ON foo (val2);  /* Unique Constraints */

Partial Indexes

只针对 table 的子集建立 index，这可以大大减少 index 本身的大小，如下所示：

CREATE INDEX idx_foo
          ON foo (a, b)
       WHERE c = 'WuTang';

一种常见的使用场景就是通过事件来为 indexes 分区，即为不同的年、月、日分别建立索引。

Covering Indexes

如果 query 所需的所有 column 都存在于 index 中，则 DBMS 甚至不用去获取 tuple 本身即可得到查询结果，如下所示：

CREATE INDEX idx_foo ON foo (a, b);

SELECT b FROM foo
 WHERE a = 123;

甚至可以在 index 中有意加入别的 column （INDEX INCLUDE COLUMNS）：

CREATE INDEX idx_foo
          ON foo (a, b)
     INCLUDE (c);

Functional/Expression Indexes

index 中的 key 不一定是 column 中的原始值，也可以是通过计算得到的值，如针对以下查询：

SELECT * FROM users
 WHERE EXTRACT(dow FROM login) = 2;

直接针对 login 建立索引是没用的，这时候可以针对计算后的值建立索引：

CREATE INDEX idx_user_login 
          ON users (EXTRACT(dow FROM login));

Skip Lists

从上文的介绍中，可以发现 table index 实际上就是一个 dynamic order-preserving 的数据结构，同时对增删改查的操作应提供较高的性能（B+ Tree， $O(logn)$ ）。

dynamic order-preserving 的数据结构中，最简单的就是 sorted linked list，所有操作的复杂度均在 $O(n)$ ，性能较 B+ Tree 相比逊色许多，但如果将多个 sorted linked list 垒起来，就可能提供与 B+ Tree 相媲美的性能：

这就是 Skip Lists。在 MIT 6.046 课程中有专门的一节课介绍 Skip Lists，可以点击查看。Skip Lists 是一种基于概率的数据结构，它提供的复杂度都是期望的结果。它的 Pros & Cons 总结如下：

Pros
- 比 B+ Tree 的一般实现使用更少的内存
- 插入和删除操作不需要重新平衡树结构
Cons
- 并不像 B+ Tree 一样对 disk/cache 的结构友好
- 反向搜索实现复杂

Radix Tree

Radix Tree vs. Trie

radix tree 实际上就是压缩后的 trie

简介

radix tree 将每个 key 拆成一个序列：

树的高度取决于 keys 的长度
不需要重新平衡树（rebalancing）
从 root 到 leaf 的路径代表相应的 leaf

Binary Comparable Keys

为了让 keys 能够合理地拆解成序列，许多类型的 key 都需要特殊处理：

Unsigned Integers：对小端存储的机器要把 bits 翻转一遍
Signed Integers：需要翻转 2's-complement 从而使得负数小于正数
Floats：需要分成多个组，然后存储成 unsigned integer
Compound：分别转化各个 attributes 然后组合起来

举例如下：

Inverted Indexes

尽管 tree index 非常有利于 point 和 range 查询，如：

Find all customers in the 15217 zip code
Find all orders between June 2018 and September 2018

但对于 keyword search，tree index 就显得无能为力，如：

Find all Wikipedia articles that contain the word "Pavlo"

尽管 DBMSs 在一定程度上支持这种搜索，但更复杂、灵活的查询就不是它们的专长。有一部分 DBMSs 专门提供复杂、灵活的 keyword search 功能，如 ElasticSearch、Solr、Sphinx 等。

参考

slides1, slides2

PreviousHash Tables NextIndex Concurrency Control

Last updated 7 years ago

hashtagTable Index

hashtagB+ Tree

hashtagB-Tree Family

hashtagB+ Tree

hashtagB+ Tree Nodes

hashtagB+ Tree Operations

hashtagIn Practice

hashtagClustered Indexes

hashtagCompound Index

hashtagB+ Tree Design Choices

hashtagNode Size

hashtagMerge Threshold

hashtagVariable Length Keys

hashtagNon-unique Indexes

hashtagIntra-node Search

hashtagOptimizations

hashtagPrefix Compression

hashtagSuffix Truncation

hashtagBulk Insert

hashtagPointer Swizzling

hashtagAdditional Index Usage

hashtagImplicit Indexes

hashtagPartial Indexes

hashtagCovering Indexes

hashtagFunctional/Expression Indexes

hashtagSkip Lists

hashtagRadix Tree

hashtagRadix Tree vs. Trie

hashtag简介

hashtagBinary Comparable Keys

hashtagInverted Indexes

hashtag参考

Table Index

B+ Tree

B-Tree Family

B+ Tree

B+ Tree Nodes

B+ Tree Operations

In Practice

Clustered Indexes

Compound Index

B+ Tree Design Choices

Node Size

Merge Threshold

Variable Length Keys

Non-unique Indexes

Intra-node Search

Optimizations

Prefix Compression

Suffix Truncation

Bulk Insert

Pointer Swizzling

Additional Index Usage

Implicit Indexes

Partial Indexes

Covering Indexes

Functional/Expression Indexes

Skip Lists

Radix Tree

Radix Tree vs. Trie

简介

Binary Comparable Keys

Inverted Indexes

参考