动态符号执行

初始概念介绍

动态符号执行的核心思想是在所谓的符号数据域上模拟程序执行

cpu工作在1和0上，如果使用read系统调用读入用户输入，得到的是1和0.作为发送程序的人，无法给它发送一个未知值的变量，但是可以创造一个模拟器来实现它，可以是0，1或者x.随着知道已知数据是什么，未知数据是什么，根据cmp比较未知数据，并基于结果进行跳转

mov rax, 0  > rax = 0
mov rdi, 0  >rax = 0, rdi = 0
mov rdx, 1  > rax = 0, rdi = 0, rdx = 1
syscall -   >rax = 0, rdi = 0, rdx = 1, [rsi] = user_input
cmp qword ptr [rsi], 42   >rax = 0, rdi = 0, rdx = 1, [rsi] = user_input, user_input == 42?
je .get_flag

就可以创造出一个公式，描述程序的行为，获得程序的输入、检查条件和持续输出之间的关系

1
2

(user_inp(user_input == 42 AND program_output == the_flag) OR
(user input != 42 AND program output == "INCORRECT")ut == 42 AND program_output == the_flag) OR(user input != 42 AND program output == "INCORRECT")

然后加上一些约束关系，就可以得到有效答案

(
    (user_input == 42 AND program_output == the_flag) OR
    (user_input != 42 AND program_output == "INCORRECT")  >user_input == 42
) AND (program_output != "INCORRECT")

原理

现代的动态符号执行会模拟代码，转化为数据公式

符号执行必须像普通CPU一样按顺序执行代码，不过它有一个可修改的模拟状态。

在它执行程序时，每当遇到分支，模拟器就会多分出一个内部状态，一个状态根据条件进入一个分支，而另一个不会进入同一个分支。当这些状态分离后它们的行为就不一样了，

angr

一些其他工具

angr: our binary analysis framework that does (among other things) symboli

execution.

miasm: another binary analysis framework that does symbolic execution.

s2e: another binary analysis framework that does symbolic execution.

manticore: another binary analysis framework that does symbolic execution.

klee: symbolic execution on source code.

angr 安装（加个镜像源安装更快）

https://github.com/angr/angr

1	pip install angr -i https://pypi.tuna.tsinghua.edu.cn/simple

angr 依赖于 pycparser 来解析 C 代码。如果没有安装 pycparser，就会出现错误。

1	pip install pycparser -i https://pypi.tuna.tsinghua.edu.cn/simple

如果还有报错，就尝试更新

1	pip install --upgrade pycparser

angr基本使用

使用的测试代码 do.c

int main()
{
    char buffer[1024];
    read(0, buffer, 1);

    if (buffer[0] == 42)
        puts("Success!");
    else
        puts("Failure!");
}

//angr有项目的概念，基本就是打开二进制程序以及他用的所有库
In [1]: import angr

In [3]: p=angr.Project("do")

//可以得到各种各样的信息，比如入口地址
In [4]: hex(p.entry)
Out[4]: '0x401050'

In [5]: p.factory.block(0x401050).pp()
        _start:
401050  xor     ebp, ebp
401052  mov     r9, rdx
401055  pop     rsi
401056  mov     rdx, rsp
401059  and     rsp, 0xfffffffffffffff0
40105d  push    rax
40105e  push    rsp
40105f  xor     r8d, r8d
401062  xor     ecx, ecx
401064  mov     rdi, main
40106b  call    qword ptr [0x403fd8]

//现在来用它做一个模拟器，它从一个激活状态开始
In [6]: sm=p.factory.simulation_manager()

In [7]: sm.active
Out[7]: [<SimState @ 0x401050>]

In [8]: sm.active[0]
Out[8]: <SimState @ 0x401050>

//这个状态有寄存器之类的东西，这些是寄存器的初始值
In [9]: sm.active[0].regs
Out[9]: <angr.state_plugins.view.SimRegNameView at 0x7fa4c530f190>

In [10]: sm.active[0].regs.rax
Out[10]: <BV64 0x1c>

In [11]: sm.active[0].regs.rsp
Out[11]: <BV64 0x7fffffffffeff98>
//还有内存，比如我们加载elf的基址，加载4字节
In [12]: sm.active[0].memory.load(0x400000,4)
Out[12]: <BV32 0x7f454c46>

In [13]: "\x7f\x45\x4c\x46"
Out[13]: '\x7fELF'

现在来看执行，模拟程序

In [14]: sm.step()
Out[14]: <SimulationManager with 1 active>

In [15]: sm.step()
Out[15]: <SimulationManager with 1 active>

In [16]: sm.step()
Out[16]: <SimulationManager with 1 active>

In [17]: sm.step()
Out[17]: <SimulationManager with 1 active>

In [18]: sm.step()
Out[18]: <SimulationManager with 1 active>

我们可以看到之前的状态，它在执行过程中走过的路径

In [19]: list(sm.active[0].history.descriptions)
Out[19]: 
['<IRSB from 0x401050: 1 sat>',
 '<SimProcedure __libc_start_main from 0x527cc0: 1 sat>',
 '<IRSB from 0x401136: 1 sat>',
 '<IRSB from 0x401040: 1 sat>',
 '<SimProcedure read from 0x5fea10: 1 sat>']

继续尝试步进，这次使用函数来步进

这条指令让符号执行开始运行，并在活动路径数量超过 1 时停止。换句话说，符号执行会继续探索路径，直到程序出现了多个可能的执行路径。这通常用于查找分支点或分析程序中不同输入条件引起的执行分歧。

1 2	In [20]: sm.run(until=lambda sm2 :len(sm.active) >1) Out[20]: <SimulationManager with 2 active>

来看看发生了什么，这是2者的执行路径，它们最后执行的都是 '<IRSB from 0x40115f: 2 sat>']

In [21]: list(sm.active[0].history.descriptions)
Out[21]: 
['<IRSB from 0x401050: 1 sat>',
 '<SimProcedure __libc_start_main from 0x527cc0: 1 sat>',
 '<IRSB from 0x401136: 1 sat>',
 '<IRSB from 0x401040: 1 sat>',
 '<SimProcedure read from 0x5fea10: 1 sat>',
 '<IRSB from 0x40115f: 2 sat>']

In [22]: list(sm.active[1].history.descriptions)
Out[22]: 
['<IRSB from 0x401050: 1 sat>',
 '<SimProcedure __libc_start_main from 0x527cc0: 1 sat>',
 '<IRSB from 0x401136: 1 sat>',
 '<IRSB from 0x401040: 1 sat>',
 '<SimProcedure read from 0x5fea10: 1 sat>',
 '<IRSB from 0x40115f: 2 sat>']

查看产生分支的地方，这里就是cmp比较区域，也是一开始程序要点处

In [23]: p.factory.block(0x40115f).pp()
40115f  movzx   eax, byte ptr [rbp-0x400]
401166  cmp     al, 0x2a
401168  jne     0x40117b

可以来看这两个模拟器的状态，到底是产生了什么分歧。可以看到是“约束条件”

In [24]: sm.active[0].solver.constraints
Out[24]: [<Bool packet_0_stdin_55_8 == 42>]

In [25]: sm.active[1].solver.constraints
Out[25]: [<Bool packet_0_stdin_55_8 != 42>]

现在来看看“运行”会是怎样，成功运行并找到了分歧。但这是一个非常简单的程序，所以可以运行完并有结果。

1 2	In [26]: sm.run() Out[26]: <SimulationManager with 2 deadended>

可以看一下现在的约束条件,还是42

//注意，运行结束了，它们不再是activate，而是deadended
In [27]: sm.active[1].solver.constraints
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[27], line 1
----> 1 sm.active[1].solver.constraints

IndexError: list index out of range


In [28]: sm.deadended[1].solver.constraints
Out[28]: [<Bool packet_0_stdin_55_8 == 42>]

In [29]: sm.deadended[0].solver.constraints
Out[29]: [<Bool packet_0_stdin_55_8 != 42>]

还可以看看他们的标准输出（可以称为“具体化”，具体化是查询约束求解器的过程）

In [30]: sm.deadended[0].posix.stdout.concretize()
Out[30]: [b'Failure!', b'\n']

In [31]: sm.deadended[1].posix.stdout.concretize()
Out[31]: [b'Success!', b'\n']

可以利用这个计算出标准输入

1 2	In [32]: sm.deadended[1].posix.stdin.concretize() Out[32]: [b'*']

一些问题

看起来很完美，但是符号执行有严重的限制：它必须追踪每一条路径来理解正在发生的事情

路径爆炸

比如此代码

unsigned long atoi(char *s) {
    unsigned long n = 0;
    while (*s != '\0') {
        if (*s == '0') n = n * 10;
        else if (*s == '1') n = n * 10 + 1;
        else if (*s == '2') n = n * 10 + 2;
        else if (*s == '3') n = n * 10 + 3;
        else if (*s == '4') n = n * 10 + 4;
        else if (*s == '5') n = n * 10 + 5;
        else if (*s == '6') n = n * 10 + 6;
        else if (*s == '7') n = n * 10 + 7;
        else if (*s == '8') n = n * 10 + 8;
        else if (*s == '9') n = n * 10 + 9;
        else break;
        s++;
    }
    return n;
}
int main() {
    char buffer[1024];
    read(0, buffer, 10); 
    int i = atoi(buffer); 

    if (i == 42) {
        puts("Success!");
    } else {
        puts("Failure!"); 
    }
    return 0; // 添加返回语句
}

每个状态都会再分出10个不同的状态，最终会导致状态爆炸，这就是符号执行不可行的首要原因，它会遍历每一条路径，面对复杂流程的代码就先一步卡在了路径遍历上。

现在执行此代码，来看到实时的状态爆炸

In [1]: import angr

In [2]: p=angr.Project("atoi")

In [3]: sm=p.factory.simulation_manager()

In [4]: sm.run(until=lamnda sm2 :len(sm.active)>1)
  Cell In[4], line 1
    sm.run(until=lamnda sm2 :len(sm.active)>1)
                 ^
SyntaxError: invalid syntax. Perhaps you forgot a comma?


In [5]: sm.run(until=lambda sm2 :len(sm.active)>1)
Out[5]: <SimulationManager with 2 active>

In [6]: sm.step()
Out[6]: <SimulationManager with 3 active>

In [7]: sm.step()
Out[7]: <SimulationManager with 4 active>

In [8]: sm.step()
Out[8]: <SimulationManager with 6 active>

In [9]: sm.step()
Out[9]: <SimulationManager with 9 active>

In [10]: sm.step()
Out[10]: <SimulationManager with 13 active>

In [11]: sm.step()
Out[11]: <SimulationManager with 19 active>

In [12]: sm.step()
Out[12]: <SimulationManager with 28 active>

In [13]: sm.step()
Out[13]: <SimulationManager with 41 active>

In [14]: sm.step()
Out[14]: <SimulationManager with 59 active, 1 deadended>

In [15]: sm.deadended[0].posix.stdin.concretizer()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[15], line 1
----> 1 sm.deadended[0].posix.stdin.concretizer()

AttributeError: 'SimPacketsStream' object has no attribute 'concretizer'

In [16]: sm.deadended[0].posix.stdin.concretize()
Out[16]: [b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']

这是第一个结束状态，得到的都是空字节，继续步进，会发现需要的时间越来愈长

In [17]: sm.step()
Out[17]: <SimulationManager with 87 active, 1 deadended>

In [18]: sm.step()
Out[18]: <SimulationManager with 128 active, 1 deadended>

In [19]: sm.step()
Out[19]: <SimulationManager with 186 active, 2 deadended>

In [20]: sm.deadended[0].posix.stdin.concretize()
Out[20]: [b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']

In [21]: sm.deadended[1].posix.stdin.concretize()
Out[21]: [b'0\x00\x00\x00\x00\x00\x00\x00\x00\x00']

In [22]: sm.step()
Out[22]: <SimulationManager with 271 active, 3 deadended>

In [23]: sm.deadended[2].posix.stdin.concretize()
Out[23]: [b'1\x00\x00\x00\x00\x00\x00\x00\x00\x00']

可以得到数字0，1，但这些离42还有很远,而且越到后面路径越多

如何解决？

可以使用一些技术合并状态，使用复杂的数学公式，将多种状态合并成一种

但也有可能得到越来越复杂的数学公式

约束一般使用 SMT解答器来解答，可满足性模块化理论

SMT的基本思想是：理论本质上编码了不同类型的简化结果，将一些常见的情况套用那些化简，然后不断地用那些理论简化结果，用逻辑表达式构建一个模型，它表示实际涉及的比特，即使这些比特是未知的，然后通过布尔可满足性（SAT）原理解决问题，SAT是一个NP完全解决的问题https://en.wikipedia.org/wiki/Boolean_satisfiability_problem

这意味着我们没有一个多项式时间的算法来解决这个问题，求解SAT模型的算法是指数型的，所以解决复杂的程序会很慢

当面对复杂数据，复杂操作时，符号执行会失败，不只是路径爆炸的问题，也由于约束求解的局限性

环境建模

模拟CPU：相对简单

模拟OS：难

符号执行引擎必须识别各种内存错误（如缓冲区溢出、空指针解引用），这需要精确地跟踪内存块的边界、访问权限等信息。内存模拟中出现的错误检测约束，会进一步增加符号执行的复杂性。
动态符号执行中需要模拟对指针、数组、结构体等复杂数据结构的访问。这些结构通常涉及多级指针和嵌套的内存引用，导致在执行过程中需要频繁地进行地址计算和内存对象查找。例如，访问链表、树等数据结构时，指针的符号化会导致路径爆炸，增加符号执行的开销。如何准确地追踪指针的指向并处理其符号化表达式，是内存模拟中的一大挑战
在符号执行中，程序中的变量、地址、数组索引等都可能被符号化，从而表示为一组符号变量，而非具体值。这样一来，内存地址本身也可能是符号化的，这意味着一个内存访问操作（如读、写）可能对应多个潜在的内存位置。当使用符号化地址进行内存访问时，符号执行引擎必须考虑所有可能的地址，并在模拟内存时维护这些可能性。这极大地增加了内存访问的复杂性和计算成本
动态符号执行在探索程序路径时，需要不断克隆内存状态，以探索不同的执行分支。每当程序执行一条条件分支指令，符号执行引擎就会创建两个新的状态，并且需要分别模拟它们的内存操作。这会导致大量的内存复制和状态克隆操作。
符号执行引擎通常使用一种分层表示的方法来共享内存对象。然而，这种优化需要精确地管理哪些对象被共享，哪些对象需要克隆，且要确保在不同执行路径上的内存操作不会互相干扰

参考资料

pwn.college Module 12 - Automated Vulnerability Discovery - Dynamic Symbolic Execution

https://www.diag.uniroma1.it/~delia/papers/svtr19.pdf

https://srg.doc.ic.ac.uk/files/papers/ngmem-ase-2019.pdf