# Wiki

# bindead-monolith / Home

# Welcome

This Wiki gives some rudimentary documentation about the BinDead analysis framework. The analyzer is able to perform a reachability analysis on executable code. What makes it special is its hierarchy of abstract domains.

## Hierarchy of Abstract Domains

The domains in this hierarchy come in four flavors:

### The Segment Domain

The root of the hierarchy is a set of segment domains. Technically, it is a sum of abstract domains in that each segment model a part of the state space of the program. The segment domain is able to resolve a pointer access, that is, to determine to which memory region the pointer points to and whether it is within the bounds of a memory region.

The segments that the root domain tracks are stack frames, heap allocated
memory regions, global data, constant data, thread-local storage (*this is a todo*) and the
processor-internal state. Even the segments of the executable file are segments in our analyzer. These domains partition the state space of the
analyzer. They share the
Segment
interface that allows the root domain to resolve constant addresses to memory
regions. In addition, each of these domains may implement a set of primitive operations that allow, for example, the creation of new
heap-allocated memory regions. Thus, it is necessary to pick the right set of segment domains for a specific analysis to work.

The interface to the root domain is defined in RootDomain. It allows the evaluation of assignments, pointer stores and loads, and tests.

### The Memory Domain

The memory domain tracks the content of a memory region. A memory region MemVar is a consecutive piece of memory with no bounds. If we were to analyse C/Java rather than assembler, a memory region would correspond to the notion of a program variable which is also consecutive and has a determined size. In the context of assembler analysis, a memory region becomes a larger entity, since boundaries between variables are now known. The purpose of the Memory Domain is to translate pointer loads and stores to accesses to fixed-sized variables whose content is tracked by a numeric domain. The interface of the memory domain is defined in MemoryDomain. Instances of this class form a hierarchy. Although any number of memory domains is possible, only a few behaviors make sense:

- The Array Domain (todo)

The array domain observes accesses to a memory region at offsets that are non-constant. It tries to summarize the span of memory that can be accessed into an access at a constant offset into an auxiliary memory region that represents the summary of all array elements.

- The Field Domain

The field domain creates a so-called field for each write access with a constant offset. Each field is associated with a numeric variable NumVar for which the numeric domain hierarchy tracks a value. Overlapping fields are possible and enable e.g. the access to an integers as bytes. In these cases the field domain performs the re-interpretation of the numeric values transparently to the caller of this domain.

- The Content Domain

(needs re-working to fit description, currently the content is managed within the fields domain)

Certain memory regions are pre-defined upon the start of a program and may
even stay constant throughout the program. In order *not* to store this
content in the numeric domain, a content domain tracks these constant values.
Whenever the field domain has no field associated with a piece of memory, it
forwards the (read) request to the constant domain which might have a constant
associated with that offset. Particular example of memory regions that use the content domain are the data and code segment of a program.

### The Finite Domain

The finite domain is what is commonly known as numeric domain in that it maps variables to values. Its interface FiniteDomain only provides operations to evaluate expression over variables. Each assignment and test has a fixed bit-size associated with it, the notable exceptions are operations that perform size conversion (these are also modelled as assignments). When a variable is introduced, it has a size in bits and one of the types Boolean, address or numeric. The type merely serves as a hint to domains and makes no guarantee about the range of the variable. Indeed, all variables have to be interpreted modulus their bit-size, which makes numeric operations at this level rather cumbersome. The following domains are implemented as finite domains:

- The Flow-Sensitive Points-to Domain

This domain observes assignment of address variables to program variables.
Address variables represent the address of a memory region and are usually
mapped to a very large interval, such as `[4096,2^31]`

. The points-to
domain uses addresses as a symbolic value and, whenever it observes an
assignment of the form `v=a`

with `a`

being and address, it stores the
information that `v=f·a`

where `f=1`

is a new Boolean variable stored
in its child. The benefit is that, after a join with another domain state in
which the variable points to `a'`

the combined points-to information of
`v`

is `v=f·a+f'·a'`

with `f=1-f', 0 ≤ f ≤ 1`

, indicating that
`v`

points to either `a`

or `a'`

but neither to both nor to none.
This information is queried by the root domain whenever a pointer is
dereferenced. See also PointerDereference.

- The Predicate Domain

The predicate domain observes assignments of the form `v = a ≤s32 b`

where
`≤s32`

is the test if the 32-bit value `a`

is less or equal to `b`

when interpreting their contents as signed integers. This domain allows any
form of predicate abstraction but has actually a very specific purpose: In the
context of analyzing assembler, a conditional jump is executed using two
operations: a test, which sets several flags, and a jump that branches
depending on the value of one of these flags. The predicate domain allows for
tracking the semantic test associated with each of the flags that the test
instruction sets. When the conditional branch is executed, it merely tests the flag which the predicate domain translates into the associated numeric test.

- The SAT Domain (in beta status)

The relationship between the various Boolean variables can be exactly expressed as a single Boolean formula. The SAT domain tracks this Boolean formula and manipulates it according the the performed operations. A join on two Boolean formulae creates a new Boolean formulae which is, simply speaking, a disjunction of the two Boolean input formule whose variables have been renamed. The stale variables are projected out during widening. This projection is calculated using a SAT solver, hence the name of the domain.

- The Wrapping Domain

The wrapping domain is the interface between the finite domain and the lowest
level, the Zeno domains. Its purpose is to wrap the argument of each operation
whose semantics change, depending on the signedness and the size of its
operands. Examples for such operations are tests and divisions. The wrapping
domain will check if the value of the operands is within the required signed
or unsigned range and, if not, wrap their values. For example, wrapping the
variable `x`

whose value is currently `[-2,5]`

to an unsigned byte will
result in calculating the join of `[253,255]`

and `[0,5]`

(which is
likely to be `[0,255]`

). After wrapping has been applied, a comparison of
two unsigned bytes can safely be performed.

Note that some operations, such as addition and subtraction, are agnostic to the signedness of their arguments and, in particular, have the correct semantics even in the modulus-interpretation of the finite domain. Hence, these agnostic operations are passed-on unaltered.

### The Zeno Domain

The Zeno domain (named after the set `Z`

of whole numbers) is defined in
ZenoDomain.
It differs from the Finite domains in that variables no longer have a size. In
particular, all calculations correspond to their mathematical counterparts. Currently, the following domains are implemented:

- The Affine Domain

The affine domain tracks equality relations between variables. For each known
equation `x=…`

the variable `x`

is removed from the child (since it is
fully determined by the variables on its right-hand side). If its child domain
observes that a new equality relation holds after a test, it propagates this
equality up and the affine domain will insert it into its system of
equalities, thereby removing another variable form the child.

- The Redundant Affine Domain

This is a variation of the affine domain that does not remove any variables
from its child, even if they are fully determined by an equation. The reason
for this domain is that our current child domains (namely intervals) are
not closed under linear transformation involving three or more variables which
incurs a precision loss (e.g. if `x=y+z`

holds then the affine domain
would remove `x`

but the interval of the sum `y+z`

is larger than that
of `x`

).

- The Narrowing Domain

This domain tracks tests that are applied but turn out to be redundant. Once widening is applied, all collected tests are applied that have been collected, thereby narrowing the state space. The net result is narrowing is not needed any more (which would be very imprecise anyway since any variable whose bound is widened away would be wrapped).

- The Gauge Domain (under development)

This is Venet's domain that infers wedges, that is, convex polyhedra of the consisting of a single vertex and rays.

- The Congruence Domain

Tracks if a variable `x`

is multiple of a constant `k`

. It stores
`x/k`

in its child, thereby enabling automatic reduction of bounds since
it is assumed that all variables are integral.

- The Value-Set Domain

Tracks a set of values. If this set is not stable during widening, the variable is promoted to an interval and widened.

- The Interval Domain

Tracks an interval for every variable. Unlike all previous domains, it has no child domain.

- The IntervalSet Domain

This is a domain tracking sets of intervals and replaces both, the Value-Set and the Interval domain.

# Domain Interfaces

The transfer functions of the Memory, Finite, and Zeno domains differ slightly from each other since they become more and more abstract. However, certain operations like the standard domain operations join, widen and subset test are common to all domain interfaces, see SemiLattice.

It is important that all transfer functions only mention variables that are
known by the domains. We call this set of variables the *support set* of the
domain. The functions in the
VariableSupport
interface change the support set. These functions are duplicated by the Finite
and Zeno interfaces since their signatures are slightly different.

Moreover, there are a few operations common to the Finite and Zeno domains that are non-standard and that allow the analyser to work with summaries of memory regions. The are given by the Summarization interface and include the following operations:

- fold and expand

Both of these take a list of variable pairs,
VarPair.
One variable in each pair is called *permanent* and one is called
*ephemeral*. The purpose of fold is to remove the ephemeral variables by
merging the information contained in them with that of the permanent
variables. The expand operation is dual in that it duplicates the content in
the permanent variables to the newly introduced ephemeral variables.

- copy and paste

The *copyAndPaste* operation extracts the information over the given set of variables in one abstract domain and inserts it into another domain.

# Implementation Strategy

Every abstract state is a pure value in the sense that no operation can modify
it: Applying any operations that modifies a domain will return a new domain.
This design allows the sharing of (parts of) the abstract state between
different program locations. This is made possible by using a special kind of
data structure, namely *persistent trees*, defined by e.g.
AVLMap:
Similar to trees in a function language, inserting or deleting an element will
return a new tree. Each state should use such a tree to store its state. When
facing the task of calculating a join, a special *split* operation can be
called on the two trees that hold the two input states. This split operation
only returns mappings that are unique to one of the trees or whose value
differs. In particular, all subtrees that are the same in both trees are not
even considered when calculating *split* thereby enabling a sub-linear join
on two domains.

The pure implementation of domains leads to the following convention for its
implementation: For a domain *D* the transfer functions are implemented in a
module *D*. The immutable state of the domain is defined in a module
*DState*. Implementing a transfer function commences by passing the internal
data structures of *DState* to a class called *DBuilder*. The *DBuilder*
class implements the actual semantic methods that update the state destructively.
Once the construction of the resulting domain element is finished, the state
in the *DBuilder* object is wrapped in an immutable *DState* object.

# Implementation of the Parent/Child Relation

The hierarchy of domains is an instance of co-fibered domains. The idea is very simple: A parent domain contains a handle to its child and translates each call to its semantic transfer functions into zero or more calls to transfer functions on the child. Since the parent domain has complete control over the child domain, it may chose to use a different support set on the child than its own support set. The principles of forwarding the domain operations to a child are implemented in MemoryFunctor, FiniteFunctor, and ZenoFunctor. An instance of these classes will, by default, forward the operation to the child.

### Child Operations

When calculating the new state within a co-fibered domain, it might be
necessary to issue several operations on the child. In order to perform these
operations from inside the *DBuilder* object, the handle to the child would
have to be passed in and out whenever a method in the *DBuilder* object
might have to perform an operation on the child. Moreover, certain sequences
of operations, like introducing and immediately projecting out a variable,
effectively do nothing on the child and time can be saved by not performing
these operations on the child.

In order to provide a common infrastructure for the re-occurring need to
execute child operations, each Functor class provides a way to enqueue *child
operations* that can be executed when the new domain state is finally built.
These child operations are defined in
MemoryChildOp,
FiniteChildOp,
and
ZenoChildOp.
They implement rudimentary optimisations. It might be possible to even gather
child operations across several domain operation on the parent, thereby
opening up more opportunities for optimization. However, so far it's not clear
if it's worth the effort so it's not implemented.

### Building the Infrastructure

The project uses Maven to orchestrate the compilation of the various packages. Under a more recent version of Eclipse, the following steps should work:

- Install the Mercurial versioning system into Eclipse
- Install the Maven build system into Eclipse
- Import a new project by cloning the Mercurial repository
- Import each top-level module into the work space
- In order to build the parsers in the
`rreil`

package open a shell and issue`mvn compile`

- Refresh the whole project

Updated