Paper Reading: The Volcano Optimizer Generator: Extensibility and Efficient Search

2022-02-27 3630 words 8 minutes

Contents

这篇paper是 The Volcano Optimizer Generator: Extensibility and Efficient Search, in ICDE, 1993。

对应15-721的Optimizer Implementation (Overview)，主要涉及数据库的查询优化器的 search engine，这里的 volcano 也是之前 paper reading 中提到的基于迭代器的执行引擎，但整个 volcano project 还包含了 optimizer 部分，也是今天介绍的主题。

开始前需要简要说明，大部分内容都是来源 15-721 Optimizater1 部分与 An Overview of Query Optimization in Relational Systems，而这篇 paper 总体较抽象，只会提到一些 terms 和 search engine 。

Query Optimization 前置知识点

Background

关系查询语言，是访问关系型数据库的声明式接口，其中 SQL 已经成为关系查询语言的标准。

对一个 SQL Database 来说，查询系统的核心组件是，query optimizer 和 query execution engine。

一个 query execution engine 实现了一系列的 physical operators。operator 的作用是，处理 input data stream，并生成 output data stream。

而 query execution engine 抽象表现来看，它就是一个 physical operator tree。如下图

/images/the-volcano-optimizer-generator-extensibility-and-efficient-search/operator-tree.jpg — operator-tree

query optimizer 的作用，将 SQL 产生成高效的执行计划。但这并不简单，因为对一个 SQL query 来说，可能存在大量的等价的 operator trees。比如，不同的join顺序、一个join表达式对应不同的operator算法。

如下图展示了 join 对 operator tree 的复杂度：

/images/the-volcano-optimizer-generator-extensibility-and-efficient-search/table1.jpg — operator-tree

因此，query optimization 的问题也被看成是一个复杂的搜索问题。为了实现复杂的最优搜索，optimizer 需要提供几个方面：

plan search spaces：构造 operator tree 的搜索空间，即依赖于等价的表达式转换集合与一系列支持的 physical operators
cost estimation：估算每个 plan 需要执行的 cost
enumeration algorithm：高效的执行算法

查询优化是什么

find the lowest cost plan ：找到数据库认为的 cost 最低的执行计划，cost 可能是 time 、memory、monetary in cloud 等
hardest ：需要从所有可能性的 plan 中选择 best one，因此需要 simplify/reduce search space，estimate plan 真实执行的 cost
responsibility ：在有限的资源限制内，计算资源或者优化时间，找到尽可能足够好的执行计划

Architecture Overview

Parser ：SQL Query String 解析成 AST，包含一系列 string token，比如 column name、table name
Binder ：查询系统的 catalog，并将上一步骤中的 string token 转为数据库的 internal identifier，比如 column name 替换为 Datachunk 的 InputRef。Binder 会生成最终的 logical plan，它是 high level description 描述 query 要做的事
Tree Rewriter (optional) ：rewrite logical plan 到 logical plan，比如 heuristic rules：constant folding / predicate pushdown
Optimizer (cost based) : 基于数据库收集的统计信息，选择 lowest cost plan，生成 physical plan 并最终给 executor 执行

Logical Vs. Physical Plans

逻辑计划，描述 what’s want to do，比如 access table
物理计划，描述 how actually want to do，比如 access table using index X
Not always a 1:1 mapping from logical to physical，比如 logical join 对应 physical hash-join / merge-join

Relational Algebra Equivalence

关系代数的等价性，如下表达式 A B C 相互 join，优化器会选择最佳的 join order 来完成

$$ (A \bowtie (B \bowtie C)) = (B \bowtie (A \bowtie C)) $$

Heuristic-Based Optimization

启发式/探索式的优化器：产生于1979年的Query Processing In A Relational Database Management System，全是静态规则，没有成本模型，也是由于当时数据量很小的原因，这也是最早出现的 Optimization Search Strategy

static rules to transform logical operators to a physical plan

Perform most restrictive selection early
Perform all selection before joins
Predicate / Limit / Projection pushdowns
Join ordering based on cardinality

优劣势：

Advantages：容易实现与调试，对简单查询很有效速度很快
Disadvantage：是否应用 rule 依赖于 magic constants 去决定，并且对于复杂的查询，包含内部依赖关系时，无法产生 good plan

Heuristic + Cost Based Join Search

使用 static rules 去做最开始的优化，再使用动态规划去决定 lowest cost plan，这也是第一个基于成本模型的优化器，广为人知的示例是 System R，对应的paper：Access path selection in a relational database management system

System R 优化策略：

将 query 分为不同的 block，分别生成 logical operators
对每个 logical operator 生成一系列的 physical operators
采用 Bottom-up search 策略与动态规划寻求局部最优解思路

Volcano Optimizer Introduction

General purpose cost-bsaed query optimizer, based on equivalence rules on algebras.

Easily add new operations and equivalence rules.
Treats physical properties of data as first-class entities during planning.
Top-down approach (backward chaining) using branch-and-bound search.

Design Principles

logical 和 physical 设计原因：

关系代数一直是用于关系型系统查询，但同时，它也可用于可扩展的和面向对象的查询处理系统，即通过定义 algebra operator、代数等价法则、合适的实现算法来实现。它们的共同点是：代数算子 consume 一个或多个类型 (e.g., set, array, list, time series)，并产生另一个适合下一个 operator 的输入类型。
执行引擎也是基于 algebra operators ，即消耗和产生批量的数据类型。然而，operators 和算法是不同的。
因此，基于上面两点（查询表达中的 algebra operator 与物理执行引擎中的 algebra operator），volcano optimizer 提出了两阶段优化。使用 logical algebra 来代表各种关系代数算子，使用 physical algebra 来代表关系代数算子的具体实现算法，logical algebra 和 physical algebra 之间的转换通过 cost-based mapping。

rules 设计原因：

rules 是一个通用的概念，通过简洁 + 模块化的方式，来描述查询优化中的等价转换。
rules 是相互独立的，只有在 optimize query 的时候，才会被组合起来。
modularization 对于构建复杂的优化器来说，构建和维护都有很大优势。

Optimizer Implementor

Optimization 过程如前面所说，将 logical algebra 转换为 physical algebra，这个过程中会 reorder operators + 执行算法选择。

logical algebra 之间转换通过 transformation rule，logical 到 physical algebra 之间转换通过 implementation rule。多个 logical operators 可能会转为一个 physical operator。

实现一个 volcano optimizer implementor 需要提供的模块：

一系列 logical operators
transformation rules + conditions : 用于在 logical algebra 之间做等价转换，在 condition 满足情况下。
一系列算法 + enforcers : 物理实现的具体算法 + 可添加的 enforcer（存在于 physical algebra，但没有对应的 logical algebra，它不进行任何数据操作，而是在 outputs 中 强制物理属性给后续的处理算法，比如，sort enforcer 强制要求输出数据保持有序）。
implementation rules + conditions : 转换具体的物理实现算法的规则与需要满足的condition。
cost functions ：用于基本运算和比较的成本函数。
logical properties : 来源于 logical algebra expression，包含结果集schema、expected size 等。
physical properties : 用于具体的实现算法的属性，比如数据是否排序、数据分区情况等。
applicability function : 适用性函数，用于物理算法 + enforcers。决定物理算法的输出，是否满足父算子的 physical properties 要求。

Search Engine

有了上面的知识铺垫，search engine 是 volcano optimizer 的核心部分。主要流程如下伪代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// LogExpr:当前的逻辑表达式, PhysProp:对LogExpr应该具有的物理输出属性的要求, Limit:cost限制
FindBestPlan (LogExpr, PhysProp, Limit)
// 1. 在现有lookup table中查找最优计划
if the pair LogExpr and PhysProp is in the look-up table 
    if the cost in the look-up table < Limit 
        // 满足cost则返回上层
        return Plan and Cost 
    else 
        // 不满足cost表示没有找到
        return failure 

// 2. 执行优化
/* else: optimization required */ 
create the set of possible "moves" from 
    applicable transformations 
    algorithms that give the required PhysProp 
    enforcers for required PhysProp 
order the set of moves by promise 
for the most promising moves 
    if the move uses a transformation
        // 在transformation rules中，依次尝试apply，并生成新的LogExpr，再继续递归调用FindBestPlan
        // 比如，各种 Heuristic Rules
        apply the transformation creating NewLogExpr 
        call FindBestPlan (NewLogExpr, PhysProp, Limit) 
    else if the move uses an algorithm 
        // 在implementation rules中，选择具体的算法后，会给算法的 input 加入 PhysProp 要求，再继续下层的递归
        // 比如，如果选择了 sort-merge join，则要求 input 必须是join key有序的，从而计算这部分 cost 加在 TotalCost中，同时 Limit 也会起到 pruning 的作用
        TotalCost := cost of the algorithm 
        for each input I while TotalCost < Limit 
            determine required physical properties PP for I 
            Cost = FindBestPlan (I, PP, Limit - TotalCost) 
            add Cost to TotalCost 
    else
        TotalCost := cost of the enforcer 
        modify PhysProp for enforced property 
        call FindBestPlan for LogExpr with new PhysProp 

/* maintain the look-up table of explored facts */ 
// 3. 将获取到的LogExpr和局部最优plan记录在hash lookup table中，后续递归回来可能再访问到相同的logical expression, 直接获取局部最优解即可
if LogExpr is not in the look-up table 
    insert LogExpr into the look-up table 
insert PhysProp and best plan found into look-up table 
// 4. 在本次递归中，返回局部最优解和该最优解的cost
return best Plan and Cost Figure

总的来说，上面的代码核心逻辑：列举所有可能的 LogExpr ，然后从 Top-Down 开始深度优先递归，直到找到最优的 plan。

其中的优化会用到：transformation rules、implementation rules 和 enforcer。cost 的比较计算，贯穿在每个环节中。

如下具体示例，来源开头的 slides ，它是 Artist 、Appears、Album 三表的 join，并需要 Artist.Id 排序。

第一步：会列举出所有可能的 logical expression，在 branch-and-bound search 中使用
第二步：选取一个 top logical expression 的 physical operator (黑色背景)，这里是 sort-merge join 算子开始，深度优先递归，找出该分支的 lowest cost 路径，如图中的 SM_JOIN 路径，并且它子节点也是采用相同方式，如图中的 HASH_JOIN(A1, A2) 和 SM_JOIN(A1, A2)，从这两个中选择 cost 最低的。
注意，这里 SM_JOIN 会对下层 logical expression 产生 physical properties 的要求，比如，这里是对 join 的两个输入，需要在 join key 上先排序。
右边的红叉标记，HASH_JOIN 算法，由于无法 order by 要求，则无法作为 top node 下层节点，会被 prune 掉
左边的红叉标记，HASH_JOIN 上层加入了 quicksort enforcer，来满足物理有序性，但由于 cost 较高也会被 prune 掉

/images/the-volcano-optimizer-generator-extensibility-and-efficient-search/volcano-optimizer-flow.jpg — volcano optimizer flow

优劣势

Advantages :
- 使用 declarative rules 方式来生成 transformations
- search engine 具有很好的 extensibility
Disadvantages :
- 需要 generate 所有可能的 logical operators，在执行 optimization search 之前
- not easy to modify predicates，即执行 rewrite expression 不太容易(where clauses)，因为优化器只知道 physical algebra 的转换

总结

整理来说，这篇paper内容偏抽象，需要借助其它资料帮助理解。同时，对Volcano进行改进的Cascades，给出了更多工程上的实现细节，可能更方便理解，会在下一篇学习。