What kind of semantics information Triton can provide?

by @Jonathan Salwan - 2016-05-18

1 - Abstract

2 - Working at the instruction level

2.1 - Semantics expressions

2.2 - Explicit and implicit information

2.3 - LOAD / STORE expressions

2.4 - Segments

2.5 - The Abstract Syntax Tree of expressions

2.5.1 - Subtrees to keep the SSA form

2.5.2 - Commutativity and the equal operator

3 - Sequence of instructions

3.1 - Symbolic emulation

4 - How do we test our semantics?

5 - How many semantics Triton supports?

5.1 - Few words about floating-point instructions

6 - Conclusion

1 - Abstract

Triton is a binary analysis framework which is currently mainly focused around the symbolic execution research area. Triton is basically a C++11 library and exposes a C++ API as well as Python bindings. Through this blog post we will see how Triton (build 1189) manages and deals with instruction semantics, then, what kind of information you can get back after processing an instruction into Triton.

2 - Working at the instruction level

Triton mainly works at the instruction level. It means that the highest class is Instruction. This class takes as inputs opcodes and a potential context (memory state or registers state). If you process this instruction, you can get back expressions (as AST) about all side effects (semantics) of the instruction as well as several information about implicit and explicit effects (GET, PUT, LOAD and STORE).

2.1 - Semantics expressions

As an example, we will use the add rax, rbx x86_64 instruction. What you need to do is just to give opcodes and a potential context. Note that if you don't give a context, the concretization will be processed on zero.

>>> from triton import *

>>> setArchitecture(ARCH.X86_64)
>>> setAstRepresentationMode(AST_REPRESENTATION.PYTHON)

>>> inst = Instruction()
>>> inst.setOpcodes("\x48\x01\xD8") # add rax, rbx
>>> inst.setAddress(0x400000)
>>> inst.updateContext(Register(REG.RAX, 0x4444444455555555))
>>> inst.updateContext(Register(REG.RBX, 0x1111111122222222))

>>> processing(inst)

>>> print inst
400000: add rax, rbx

>>> for expression in inst.getSymbolicExpressions():
...     print expression
...
ref_0 = ((0x4444444455555555 + 0x1111111122222222) & 0xFFFFFFFFFFFFFFFF) # ADD operation
ref_1 = (0x1 if (0x10 == (0x10 & (ref_0 ^ (0x4444444455555555 ^ 0x1111111122222222)))) else 0x0) # Adjust flag
ref_2 = ((((0x4444444455555555 & 0x1111111122222222) ^ (((0x4444444455555555 ^ 0x1111111122222222) ^ ref_0) & (0x4444444455555555 ^ 0x1111111122222222))) >> 63) & 0x1) # Carry flag
ref_3 = ((((0x4444444455555555 ^ ~0x1111111122222222) & (0x4444444455555555 ^ ref_0)) >> 63) & 0x1) # Overflow flag
ref_4 = ((((((((0x1 ^ (((ref_0 & 0xFF) >> 0x0) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x1) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x2) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x3) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x4) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x5) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x6) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x7) & 0x1)) # Parity flag
ref_5 = ((ref_0 >> 63) & 0x1) # Sign flag
ref_6 = (0x1 if (ref_0 == 0x0) else 0x0) # Zero flag
ref_7 = 0x400003 # Program Counter

Each ref_ is a SymbolicExpression object which is unique and represents the output of the instruction. These references represent our SSA form.

This object (SymbolicExpression) provides several methods which give information about the output assignment - is the output assigned to a register? To a memory? Which register, which memory cells? etc. For instance, if we want to know what the ref_0 is assigned to, we can do something like this:

>>> ref_0 = inst.getSymbolicExpressions()[0]

>>> ref_0.isRegister()
True

>>> ref_0.getOriginRegister()
rax:64 bv[63..0]

>>> hex(ref_0.getOriginRegister().getConcreteValue())
0x5555555577777777L

Note that all references point to the largest register. For example, if the destination is AL, the output reference contains the RAX expression.

2.2 - Explicit and implicit information

Triton doesn't make a distinction between what information are implicit or explicit, everything is processed at the same level. However, you can classify information into 6 categories:

Output semantics expressions
Memory LOAD accesses
Memory STORE accesses
Input (read) registers
Output (write) registers
Input (read) immediates

Example with the previous add rax, rbx:

>>> for reg, ast in inst.getReadRegisters():
...     print reg, hex(reg.getConcreteValue())
...
rax:64 bv[63..0] 0x4444444455555555L
rbx:64 bv[63..0] 0x1111111122222222L

>>> for reg, ast in inst.getWrittenRegisters():
...     print reg, hex(reg.getConcreteValue())
...
rax:64 bv[63..0] 0x5555555577777777L
rip:64 bv[63..0] 0x400003L
af:1 bv[0..0] 0x0L
cf:1 bv[0..0] 0x0L
of:1 bv[0..0] 0x0L
pf:1 bv[0..0] 0x1L
sf:1 bv[0..0] 0x0L
zf:1 bv[0..0] 0x0L

Another example with a ret instruction:

>>> inst = Instruction()
>>> inst.setOpcodes("\xc3") # ret
>>> inst.setAddress(0x40000)
>>> inst.updateContext(Register(REG.RSP, 0xfff5000))
>>> inst.updateContext(Memory(0xfff5000, 8, 0x50000))

>>> processing(inst)

>>> print inst
40000: ret

>>> for load, ast in inst.getLoadAccess():
...     print load, hex(load.getConcreteValue())
...
[@0xfff5000]:64 bv[63..0] 0x50000L

>>> for reg, ast in inst.getWrittenRegisters():
...     print reg, hex(reg.getConcreteValue())
...
rsp:64 bv[63..0] 0xfff5008L
rip:64 bv[63..0] 0x50000L

2.3 - LOAD / STORE expressions

When an instruction performs a memory access (LOAD, STORE or both), you can get information about its base, index, scale and displacement as well as the AST of the access expression.

>>> inst = Instruction()
>>> inst.setAddress(0x400000)
>>> inst.setOpcodes("\x8A\x84\x4B\x00\x10\x00\x00") # mov al, BYTE PTR [rbx+rcx*2+0x1000]
>>> inst.updateContext(Register(REG.RAX, 0x4444444455555555))
>>> inst.updateContext(Register(REG.RBX, 0x10000))
>>> inst.updateContext(Register(REG.RCX, 0x200))

>>> processing(inst)
>>> print inst
0x400000: mov al, byte ptr [rbx + rcx*2 + 0x1000]

>>> op0 = inst.getOperands()[0]
>>> op1 = inst.getOperands()[1]

>>> print op0
al:8 bv[7..0]

>>> print op1
[@0x11400]:8 bv[7..0]

>>> print op1.getBaseRegister()
rbx:64 bv[63..0]

>>> print op1.getIndexRegister()
rcx:64 bv[63..0]

>>> print op1.getScale()
0x2:8 bv[7..0]

>>> print op1.getDisplacement()
0x1000:8 bv[7..0]

>>> print op1.getLeaAst()
((0x10000 + ((((0x200 * 0x2) & 0xFFFFFFFFFFFFFFFF) + 0x1000) & 0xFFFFFFFFFFFFFFFF)) & 0xFFFFFFFFFFFFFFFF)

>>> print hex(op1.getLeaAst().evaluate())
0x11400L

This LEA AST may contain previous expressions (SSA) and symbolic variables which imply a symbolic memory access and allow you to know the max and min range of the memory access.

2.4 - Segments

When you develop a binary analysis framework or an emulator, the main difficulty is to manage correctly segments and their content - Trust me, that's not too easy if you want a design which makes sense :-). Triton provides two lazy ways to deal with segments and their contents:

Using emulation: You have to define a base address (instead of an offset into a GDT) and a content.
Using concrete value: You can directly define at the process level the target address and its content (e.g: in the case where you use a DBI).

First example with the emulation way:

>>> enableSymbolicEmulation(True)
>>> setLastRegisterValue(Register(REG.FS, 0xfff400))
>>> setLastRegisterValue(Register(REG.RSI, 0x10))
>>> setLastMemoryAreaValue(0xfff400, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26])

>>> inst = Instruction()
>>> inst.setAddress(0x40000)
>>> inst.setOpcodes("\x64\x8A\x46\x04") # mov al, BYTE PTR fs:[rsi+0x4]

>>> processing(inst)

>>> print inst.getOperands()[1].getSegmentRegister()
fs:64 bv[63..0]

>>> print inst.getOperands()[1]
[@0xfff414]:8 bv[7..0]

>>> print inst.getOperands()[1].getConcreteValue()
21

Second example is the case where you use a DBI like Pin. Pin can provide the runtime memory access and its content which has been read or written. That's typically what we use into our libpintool.so tracer.

>>> inst = Instruction()
>>> inst.setAddress(0x40000)
>>> inst.setOpcodes("\x64\x8A\x46\x04") # mov al, BYTE PTR fs:[rsi+0x4]
>>> inst.updateContext(Memory(0xfff414, CPUSIZE.BYTE, 21)) # This information comes from the DBI

>>> processing(inst)

>>> print inst.getOperands()[1].getSegmentRegister()
fs:64 bv[63..0]

>>> print inst.getOperands()[1]
[@0xfff414]:8 bv[7..0]

>>> print inst.getOperands()[1].getConcreteValue()
21

2.5 - The Abstract Syntax Tree of expressions

Triton represents its expressions into an Abstract Syntax Tree and can display this AST into two different representations, the SMT2-Lib or the Python syntax. These modes can be enabled using their appropriate flags.

An abstract syntax tree (AST) is a representation of a grammar as tree. Triton uses a custom AST for its expressions. As all expressions are built at runtime, an AST is available at each program point. For example, let assume this set of instructions:

1. mov al, 1
2. mov cl, 10
3. mov dl, 20
4. xor cl, dl
5. add al, cl

At the line 5, the AST of the AL register looks like this:

This AST represents the semantics of the AL register at the program point 5 from the program point 1. Note that this AST has been simplified for a better comprehension. The real AST contains some concat and extract which represent the real Intel behavior. According to the API, you can build and modify your own AST. Then, you can perform some modifications and simplifications before sending it to the solver. Below, a little example about the node manipulation.

>>> node = bvadd(bv(1, 8), bvxor(bv(10, 8), bv(20, 8)))
>>> print node
(bvadd (_ bv1 8) (bvxor (_ bv10 8) (_ bv20 8)))

>>> subchild = node.getChilds()[1].getChilds()[0]
>>> print subchild
(_ bv10 8)
>>> print subchild.getChilds()[0].getValue()
10
>>> print subchild.getChilds()[1].getValue()
8

>>> node = bvadd(bv(1, 8), bvxor(bv(10, 8), bv(20, 8)))
>>> print node
(bvadd (_ bv1 8) (bvxor (_ bv10 8) (_ bv20 8)))

>>> node.setChild(0, bv(123, 8))
>>> print node
(bvadd (_ bv123 8) (bvxor (_ bv10 8) (_ bv20 8)))

>>> node.evaluate()
153L
>>> node.isSigned()
True
>>> node.getBitvectorSize()
8

2.5.1 - Subtrees to keep the SSA form

To manage more easily the subtrees and to keep the SSA form of registers and memory cells, we have added a REFERENCE node which is a "terminate" node of a tree but which contains a reference to another subtree. Below, an example of one "partial" tree linked with two other subtrees.

If you try to go through the full AST you will fail at the first reference node because a reference node does not contains child nodes. The only way to jump from a reference node to the targeted node is to use the getFullAst() function.

>>> setAstRepresentationMode(AST_REPRESENTATION.SMT)

>>> zfId = getSymbolicRegisterId(REG.ZF)
>>> partialTree = getSymbolicExpressionFromId(zfId).getAst()
>>> print partialTree
(ite (= ref!89 (_ bv0 32)) (_ bv1 1) (_ bv0 1))

>>> fullTree = getFullAst(partialTree)
>>> print fullTree
(ite (= (bvsub ((_ extract 31 0) ((_ zero_extend 32) ((_ extract 31 0) ((_ zero_extend 32) (bvxor ((_ extract 31 0) ((_ zero_extend 32) (bvsub ((_ extract 31 0) ((_ zero_extend 32) ((_ sign_extend 24) ((_ extract 7 0) SymVar_0)))) (_ bv1 32)))) (_ bv85 32)))))) ((_ extract 31 0) ((_ zero_extend 32) ((_ sign_extend 24) ((_ extract 7 0) ((_ zero_extend 32) ((_ zero_extend 24) ((_ extract 7 0) (_ bv49 8))))))))) (_ bv0 32)) (_ bv1 1) (_ bv0 1))