What kind of semantics information Triton can provide?
by @Jonathan Salwan - 2016-05-18-
1 - Abstract
2 - Working at the instruction level
-
2.1 - Semantics expressions
2.2 - Explicit and implicit information
2.3 - LOAD / STORE expressions
2.4 - Segments
2.5 - The Abstract Syntax Tree of expressions
1 - Abstract
Triton is a binary analysis framework which is currently mainly focused around the symbolic execution research area. Triton is basically a C++11 library and exposes a C++ API as well as Python bindings. Through this blog post we will see how Triton (build 1189) manages and deals with instruction semantics, then, what kind of information you can get back after processing an instruction into Triton.
2 - Working at the instruction level
Triton mainly works at the instruction level. It means that the highest class is Instruction. This class takes as
inputs opcodes and a potential context (memory state or registers state). If you process
this instruction, you can get back expressions (as AST) about all side effects (semantics) of the instruction as well as
several information about implicit and explicit effects (GET
, PUT
, LOAD
and STORE
).
2.1 - Semantics expressions
As an example, we will use the add rax, rbx
x86_64 instruction. What you need to do is just to give opcodes and a potential context. Note that if you don't give a context, the concretization will
be processed on zero.
>>> from triton import *
>>> setArchitecture(ARCH.X86_64)
>>> setAstRepresentationMode(AST_REPRESENTATION.PYTHON)
>>> inst = Instruction()
>>> inst.setOpcodes("\x48\x01\xD8") # add rax, rbx
>>> inst.setAddress(0x400000)
>>> inst.updateContext(Register(REG.RAX, 0x4444444455555555))
>>> inst.updateContext(Register(REG.RBX, 0x1111111122222222))
>>> processing(inst)
>>> print inst
400000: add rax, rbx
>>> for expression in inst.getSymbolicExpressions():
... print expression
...
ref_0 = ((0x4444444455555555 + 0x1111111122222222) & 0xFFFFFFFFFFFFFFFF) # ADD operation
ref_1 = (0x1 if (0x10 == (0x10 & (ref_0 ^ (0x4444444455555555 ^ 0x1111111122222222)))) else 0x0) # Adjust flag
ref_2 = ((((0x4444444455555555 & 0x1111111122222222) ^ (((0x4444444455555555 ^ 0x1111111122222222) ^ ref_0) & (0x4444444455555555 ^ 0x1111111122222222))) >> 63) & 0x1) # Carry flag
ref_3 = ((((0x4444444455555555 ^ ~0x1111111122222222) & (0x4444444455555555 ^ ref_0)) >> 63) & 0x1) # Overflow flag
ref_4 = ((((((((0x1 ^ (((ref_0 & 0xFF) >> 0x0) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x1) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x2) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x3) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x4) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x5) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x6) & 0x1)) ^ (((ref_0 & 0xFF) >> 0x7) & 0x1)) # Parity flag
ref_5 = ((ref_0 >> 63) & 0x1) # Sign flag
ref_6 = (0x1 if (ref_0 == 0x0) else 0x0) # Zero flag
ref_7 = 0x400003 # Program Counter
Each ref_
is a SymbolicExpression object which is unique and represents the output of the instruction. These references
represent our SSA form.
This object (SymbolicExpression
) provides several methods which give information about the output assignment - is the output assigned to a register? To a memory? Which register, which memory cells? etc.
For instance, if we want to know what the ref_0
is assigned to, we can do something like this:
>>> ref_0 = inst.getSymbolicExpressions()[0]
>>> ref_0.isRegister()
True
>>> ref_0.getOriginRegister()
rax:64 bv[63..0]
>>> hex(ref_0.getOriginRegister().getConcreteValue())
0x5555555577777777L
Note that all references point to the largest register. For example, if the destination is AL
, the output reference contains the RAX
expression.
2.2 - Explicit and implicit information
Triton doesn't make a distinction between what information are implicit or explicit, everything is processed at the same level. However, you can classify information into 6 categories:
- Output semantics expressions
- Memory LOAD accesses
- Memory STORE accesses
- Input (read) registers
- Output (write) registers
- Input (read) immediates
Example with the previous add rax, rbx
:
>>> for reg, ast in inst.getReadRegisters():
... print reg, hex(reg.getConcreteValue())
...
rax:64 bv[63..0] 0x4444444455555555L
rbx:64 bv[63..0] 0x1111111122222222L
>>> for reg, ast in inst.getWrittenRegisters():
... print reg, hex(reg.getConcreteValue())
...
rax:64 bv[63..0] 0x5555555577777777L
rip:64 bv[63..0] 0x400003L
af:1 bv[0..0] 0x0L
cf:1 bv[0..0] 0x0L
of:1 bv[0..0] 0x0L
pf:1 bv[0..0] 0x1L
sf:1 bv[0..0] 0x0L
zf:1 bv[0..0] 0x0L
Another example with a ret
instruction:
>>> inst = Instruction()
>>> inst.setOpcodes("\xc3") # ret
>>> inst.setAddress(0x40000)
>>> inst.updateContext(Register(REG.RSP, 0xfff5000))
>>> inst.updateContext(Memory(0xfff5000, 8, 0x50000))
>>> processing(inst)
>>> print inst
40000: ret
>>> for load, ast in inst.getLoadAccess():
... print load, hex(load.getConcreteValue())
...
[@0xfff5000]:64 bv[63..0] 0x50000L
>>> for reg, ast in inst.getWrittenRegisters():
... print reg, hex(reg.getConcreteValue())
...
rsp:64 bv[63..0] 0xfff5008L
rip:64 bv[63..0] 0x50000L
2.3 - LOAD / STORE expressions
When an instruction performs a memory access (LOAD
, STORE
or both), you can get information about its base, index,
scale and displacement as well as the AST of the access expression.
>>> inst = Instruction()
>>> inst.setAddress(0x400000)
>>> inst.setOpcodes("\x8A\x84\x4B\x00\x10\x00\x00") # mov al, BYTE PTR [rbx+rcx*2+0x1000]
>>> inst.updateContext(Register(REG.RAX, 0x4444444455555555))
>>> inst.updateContext(Register(REG.RBX, 0x10000))
>>> inst.updateContext(Register(REG.RCX, 0x200))
>>> processing(inst)
>>> print inst
0x400000: mov al, byte ptr [rbx + rcx*2 + 0x1000]
>>> op0 = inst.getOperands()[0]
>>> op1 = inst.getOperands()[1]
>>> print op0
al:8 bv[7..0]
>>> print op1
[@0x11400]:8 bv[7..0]
>>> print op1.getBaseRegister()
rbx:64 bv[63..0]
>>> print op1.getIndexRegister()
rcx:64 bv[63..0]
>>> print op1.getScale()
0x2:8 bv[7..0]
>>> print op1.getDisplacement()
0x1000:8 bv[7..0]
>>> print op1.getLeaAst()
((0x10000 + ((((0x200 * 0x2) & 0xFFFFFFFFFFFFFFFF) + 0x1000) & 0xFFFFFFFFFFFFFFFF)) & 0xFFFFFFFFFFFFFFFF)
>>> print hex(op1.getLeaAst().evaluate())
0x11400L
This LEA AST may contain previous expressions (SSA) and symbolic variables which imply a symbolic memory access and allow you to know the max and min range of the memory access.
2.4 - Segments
When you develop a binary analysis framework or an emulator, the main difficulty is to manage correctly segments and their content - Trust me, that's not too easy if you want a design which makes sense :-). Triton provides two lazy ways to deal with segments and their contents:
- Using emulation: You have to define a base address (instead of an offset into a GDT) and a content.
- Using concrete value: You can directly define at the process level the target address and its content (e.g: in the case where you use a DBI).
First example with the emulation way:
>>> enableSymbolicEmulation(True)
>>> setLastRegisterValue(Register(REG.FS, 0xfff400))
>>> setLastRegisterValue(Register(REG.RSI, 0x10))
>>> setLastMemoryAreaValue(0xfff400, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26])
>>> inst = Instruction()
>>> inst.setAddress(0x40000)
>>> inst.setOpcodes("\x64\x8A\x46\x04") # mov al, BYTE PTR fs:[rsi+0x4]
>>> processing(inst)
>>> print inst.getOperands()[1].getSegmentRegister()
fs:64 bv[63..0]
>>> print inst.getOperands()[1]
[@0xfff414]:8 bv[7..0]
>>> print inst.getOperands()[1].getConcreteValue()
21
Second example is the case where you use a DBI like Pin. Pin can provide the runtime memory access and its content which has been read or written. That's typically what we use into our libpintool.so tracer.
>>> inst = Instruction()
>>> inst.setAddress(0x40000)
>>> inst.setOpcodes("\x64\x8A\x46\x04") # mov al, BYTE PTR fs:[rsi+0x4]
>>> inst.updateContext(Memory(0xfff414, CPUSIZE.BYTE, 21)) # This information comes from the DBI
>>> processing(inst)
>>> print inst.getOperands()[1].getSegmentRegister()
fs:64 bv[63..0]
>>> print inst.getOperands()[1]
[@0xfff414]:8 bv[7..0]
>>> print inst.getOperands()[1].getConcreteValue()
21
2.5 - The Abstract Syntax Tree of expressions
Triton represents its expressions into an Abstract Syntax Tree and can display this AST into two different representations, the SMT2-Lib or the Python syntax. These modes can be enabled using their appropriate flags.
An abstract syntax tree (AST) is a representation of a grammar as tree. Triton uses a custom AST for its expressions. As all expressions are built at runtime, an AST is available at each program point. For example, let assume this set of instructions:
1. mov al, 1
2. mov cl, 10
3. mov dl, 20
4. xor cl, dl
5. add al, cl
At the line 5, the AST of the AL
register looks like this:
This AST represents the semantics of the AL
register at the program point 5 from the program point 1. Note that this AST has been simplified for
a better comprehension. The real AST contains some concat
and extract
which represent the real Intel behavior. According to the API, you can build
and modify your own AST. Then, you can perform some modifications and simplifications
before sending it to the solver. Below, a little example about the node manipulation.
>>> node = bvadd(bv(1, 8), bvxor(bv(10, 8), bv(20, 8)))
>>> print node
(bvadd (_ bv1 8) (bvxor (_ bv10 8) (_ bv20 8)))
>>> subchild = node.getChilds()[1].getChilds()[0]
>>> print subchild
(_ bv10 8)
>>> print subchild.getChilds()[0].getValue()
10
>>> print subchild.getChilds()[1].getValue()
8
>>> node = bvadd(bv(1, 8), bvxor(bv(10, 8), bv(20, 8)))
>>> print node
(bvadd (_ bv1 8) (bvxor (_ bv10 8) (_ bv20 8)))
>>> node.setChild(0, bv(123, 8))
>>> print node
(bvadd (_ bv123 8) (bvxor (_ bv10 8) (_ bv20 8)))
>>> node.evaluate()
153L
>>> node.isSigned()
True
>>> node.getBitvectorSize()
8
2.5.1 - Subtrees to keep the SSA form
To manage more easily the subtrees and to keep the SSA form of registers and memory cells, we have added a REFERENCE
node which is a "terminate" node of a
tree but which contains a reference to another subtree. Below, an example of one "partial" tree linked with two other subtrees.
If you try to go through the full AST you will fail at the first reference node because a reference node does not contains child nodes. The only way to jump from a reference node to the targeted node is to use the getFullAst() function.
>>> setAstRepresentationMode(AST_REPRESENTATION.SMT)
>>> zfId = getSymbolicRegisterId(REG.ZF)
>>> partialTree = getSymbolicExpressionFromId(zfId).getAst()
>>> print partialTree
(ite (= ref!89 (_ bv0 32)) (_ bv1 1) (_ bv0 1))
>>> fullTree = getFullAst(partialTree)
>>> print fullTree
(ite (= (bvsub ((_ extract 31 0) ((_ zero_extend 32) ((_ extract 31 0) ((_ zero_extend 32) (bvxor ((_ extract 31 0) ((_ zero_extend 32) (bvsub ((_ extract 31 0) ((_ zero_extend 32) ((_ sign_extend 24) ((_ extract 7 0) SymVar_0)))) (_ bv1 32)))) (_ bv85 32)))))) ((_ extract 31 0) ((_ zero_extend 32) ((_ sign_extend 24) ((_ extract 7 0) ((_ zero_extend 32) ((_ zero_extend 24) ((_ extract 7 0) (_ bv49 8))))))))) (_ bv0 32)) (_ bv1 1) (_ bv0 1))