This Wiki gives some rudimentary documentation about the BinDead analysis framework. The analyzer is able to perform a reachability analysis on executable code. What makes it special is its hierarchy of abstract domains.
Hierarchy of Abstract Domains
The domains in this hierarchy come in four flavors:
The Segment Domain
The root of the hierarchy is a set of segment domains. Technically, it is a sum of abstract domains in that each segment model a part of the state space of the program. The segment domain is able to resolve a pointer access, that is, to determine to which memory region the pointer points to and whether it is within the bounds of a memory region.
The segments that the root domain tracks are stack frames, heap allocated memory regions, global data, constant data, thread-local storage (this is a todo) and the processor-internal state. Even the segments of the executable file are segments in our analyzer. These domains partition the state space of the analyzer. They share the Segment interface that allows the root domain to resolve constant addresses to memory regions. In addition, each of these domains may implement a set of primitive operations that allow, for example, the creation of new heap-allocated memory regions. Thus, it is necessary to pick the right set of segment domains for a specific analysis to work.
The interface to the root domain is defined in RootDomain. It allows the evaluation of assignments, pointer stores and loads, and tests.
The Memory Domain
The memory domain tracks the content of a memory region. A memory region MemVar is a consecutive piece of memory with no bounds. If we were to analyse C/Java rather than assembler, a memory region would correspond to the notion of a program variable which is also consecutive and has a determined size. In the context of assembler analysis, a memory region becomes a larger entity, since boundaries between variables are now known. The purpose of the Memory Domain is to translate pointer loads and stores to accesses to fixed-sized variables whose content is tracked by a numeric domain. The interface of the memory domain is defined in MemoryDomain. Instances of this class form a hierarchy. Although any number of memory domains is possible, only a few behaviors make sense:
- The Array Domain (todo)
The array domain observes accesses to a memory region at offsets that are non-constant. It tries to summarize the span of memory that can be accessed into an access at a constant offset into an auxiliary memory region that represents the summary of all array elements.
- The Field Domain
The field domain creates a so-called field for each write access with a constant offset. Each field is associated with a numeric variable NumVar for which the numeric domain hierarchy tracks a value. Overlapping fields are possible and enable e.g. the access to an integers as bytes. In these cases the field domain performs the re-interpretation of the numeric values transparently to the caller of this domain.
- The Content Domain
(needs re-working to fit description, currently the content is managed within the fields domain)
Certain memory regions are pre-defined upon the start of a program and may even stay constant throughout the program. In order not to store this content in the numeric domain, a content domain tracks these constant values. Whenever the field domain has no field associated with a piece of memory, it forwards the (read) request to the constant domain which might have a constant associated with that offset. Particular example of memory regions that use the content domain are the data and code segment of a program.
The Finite Domain
The finite domain is what is commonly known as numeric domain in that it maps variables to values. Its interface FiniteDomain only provides operations to evaluate expression over variables. Each assignment and test has a fixed bit-size associated with it, the notable exceptions are operations that perform size conversion (these are also modelled as assignments). When a variable is introduced, it has a size in bits and one of the types Boolean, address or numeric. The type merely serves as a hint to domains and makes no guarantee about the range of the variable. Indeed, all variables have to be interpreted modulus their bit-size, which makes numeric operations at this level rather cumbersome. The following domains are implemented as finite domains:
- The Flow-Sensitive Points-to Domain
This domain observes assignment of address variables to program variables.
Address variables represent the address of a memory region and are usually
mapped to a very large interval, such as
[4096,2^31]. The points-to
domain uses addresses as a symbolic value and, whenever it observes an
assignment of the form
a being and address, it stores the
f=1 is a new Boolean variable stored
in its child. The benefit is that, after a join with another domain state in
which the variable points to
a' the combined points-to information of
f=1-f', 0 ≤ f ≤ 1, indicating that
v points to either
a' but neither to both nor to none.
This information is queried by the root domain whenever a pointer is
dereferenced. See also PointerDereference.
- The Predicate Domain
The predicate domain observes assignments of the form
v = a ≤s32 b where
≤s32 is the test if the 32-bit value
a is less or equal to
when interpreting their contents as signed integers. This domain allows any
form of predicate abstraction but has actually a very specific purpose: In the
context of analyzing assembler, a conditional jump is executed using two
operations: a test, which sets several flags, and a jump that branches
depending on the value of one of these flags. The predicate domain allows for
tracking the semantic test associated with each of the flags that the test
instruction sets. When the conditional branch is executed, it merely tests the flag which the predicate domain translates into the associated numeric test.
- The SAT Domain (in beta status)
The relationship between the various Boolean variables can be exactly expressed as a single Boolean formula. The SAT domain tracks this Boolean formula and manipulates it according the the performed operations. A join on two Boolean formulae creates a new Boolean formulae which is, simply speaking, a disjunction of the two Boolean input formule whose variables have been renamed. The stale variables are projected out during widening. This projection is calculated using a SAT solver, hence the name of the domain.
- The Wrapping Domain
The wrapping domain is the interface between the finite domain and the lowest
level, the Zeno domains. Its purpose is to wrap the argument of each operation
whose semantics change, depending on the signedness and the size of its
operands. Examples for such operations are tests and divisions. The wrapping
domain will check if the value of the operands is within the required signed
or unsigned range and, if not, wrap their values. For example, wrapping the
x whose value is currently
[-2,5] to an unsigned byte will
result in calculating the join of
[0,5] (which is
likely to be
[0,255]). After wrapping has been applied, a comparison of
two unsigned bytes can safely be performed.
Note that some operations, such as addition and subtraction, are agnostic to the signedness of their arguments and, in particular, have the correct semantics even in the modulus-interpretation of the finite domain. Hence, these agnostic operations are passed-on unaltered.
The Zeno Domain
The Zeno domain (named after the set
Z of whole numbers) is defined in
It differs from the Finite domains in that variables no longer have a size. In
particular, all calculations correspond to their mathematical counterparts. Currently, the following domains are implemented:
- The Affine Domain
The affine domain tracks equality relations between variables. For each known
x=… the variable
x is removed from the child (since it is
fully determined by the variables on its right-hand side). If its child domain
observes that a new equality relation holds after a test, it propagates this
equality up and the affine domain will insert it into its system of
equalities, thereby removing another variable form the child.
- The Redundant Affine Domain
This is a variation of the affine domain that does not remove any variables
from its child, even if they are fully determined by an equation. The reason
for this domain is that our current child domains (namely intervals) are
not closed under linear transformation involving three or more variables which
incurs a precision loss (e.g. if
x=y+z holds then the affine domain
x but the interval of the sum
y+z is larger than that
- The Narrowing Domain
This domain tracks tests that are applied but turn out to be redundant. Once widening is applied, all collected tests are applied that have been collected, thereby narrowing the state space. The net result is narrowing is not needed any more (which would be very imprecise anyway since any variable whose bound is widened away would be wrapped).
- The Gauge Domain (under development)
This is Venet's domain that infers wedges, that is, convex polyhedra of the consisting of a single vertex and rays.
- The Congruence Domain
Tracks if a variable
x is multiple of a constant
k. It stores
x/k in its child, thereby enabling automatic reduction of bounds since
it is assumed that all variables are integral.
- The Value-Set Domain
Tracks a set of values. If this set is not stable during widening, the variable is promoted to an interval and widened.
- The Interval Domain
Tracks an interval for every variable. Unlike all previous domains, it has no child domain.
- The IntervalSet Domain
This is a domain tracking sets of intervals and replaces both, the Value-Set and the Interval domain.
The transfer functions of the Memory, Finite, and Zeno domains differ slightly from each other since they become more and more abstract. However, certain operations like the standard domain operations join, widen and subset test are common to all domain interfaces, see SemiLattice.
It is important that all transfer functions only mention variables that are known by the domains. We call this set of variables the support set of the domain. The functions in the VariableSupport interface change the support set. These functions are duplicated by the Finite and Zeno interfaces since their signatures are slightly different.
Moreover, there are a few operations common to the Finite and Zeno domains that are non-standard and that allow the analyser to work with summaries of memory regions. The are given by the Summarization interface and include the following operations:
- fold and expand
Both of these take a list of variable pairs, VarPair. One variable in each pair is called permanent and one is called ephemeral. The purpose of fold is to remove the ephemeral variables by merging the information contained in them with that of the permanent variables. The expand operation is dual in that it duplicates the content in the permanent variables to the newly introduced ephemeral variables.
- copy and paste
The copyAndPaste operation extracts the information over the given set of variables in one abstract domain and inserts it into another domain.
Every abstract state is a pure value in the sense that no operation can modify it: Applying any operations that modifies a domain will return a new domain. This design allows the sharing of (parts of) the abstract state between different program locations. This is made possible by using a special kind of data structure, namely persistent trees, defined by e.g. AVLMap: Similar to trees in a function language, inserting or deleting an element will return a new tree. Each state should use such a tree to store its state. When facing the task of calculating a join, a special split operation can be called on the two trees that hold the two input states. This split operation only returns mappings that are unique to one of the trees or whose value differs. In particular, all subtrees that are the same in both trees are not even considered when calculating split thereby enabling a sub-linear join on two domains.
The pure implementation of domains leads to the following convention for its implementation: For a domain D the transfer functions are implemented in a module D. The immutable state of the domain is defined in a module DState. Implementing a transfer function commences by passing the internal data structures of DState to a class called DBuilder. The DBuilder class implements the actual semantic methods that update the state destructively. Once the construction of the resulting domain element is finished, the state in the DBuilder object is wrapped in an immutable DState object.
Implementation of the Parent/Child Relation
The hierarchy of domains is an instance of co-fibered domains. The idea is very simple: A parent domain contains a handle to its child and translates each call to its semantic transfer functions into zero or more calls to transfer functions on the child. Since the parent domain has complete control over the child domain, it may chose to use a different support set on the child than its own support set. The principles of forwarding the domain operations to a child are implemented in MemoryFunctor, FiniteFunctor, and ZenoFunctor. An instance of these classes will, by default, forward the operation to the child.
When calculating the new state within a co-fibered domain, it might be necessary to issue several operations on the child. In order to perform these operations from inside the DBuilder object, the handle to the child would have to be passed in and out whenever a method in the DBuilder object might have to perform an operation on the child. Moreover, certain sequences of operations, like introducing and immediately projecting out a variable, effectively do nothing on the child and time can be saved by not performing these operations on the child.
In order to provide a common infrastructure for the re-occurring need to execute child operations, each Functor class provides a way to enqueue child operations that can be executed when the new domain state is finally built. These child operations are defined in MemoryChildOp, FiniteChildOp, and ZenoChildOp. They implement rudimentary optimisations. It might be possible to even gather child operations across several domain operation on the parent, thereby opening up more opportunities for optimization. However, so far it's not clear if it's worth the effort so it's not implemented.
Building the Infrastructure
The project uses Maven to orchestrate the compilation of the various packages. Under a more recent version of Eclipse, the following steps should work:
- Install the Mercurial versioning system into Eclipse
- Install the Maven build system into Eclipse
- Import a new project by cloning the Mercurial repository
- Import each top-level module into the work space
- In order to build the parsers in the
rreilpackage open a shell and issue
- Refresh the whole project