Bitsliced Trivium in CorePy

Paul Crowley,, 2008

Copyright Paul Crowley, LShift Ltd, 2008, released under MIT licence

This implementation of Trivium computes 128 independent streams of
output simultaneously at under 1 cycle/byte - over four times faster
than standard implementations - using the SSE2 instruction set.  This
is not particularly useful for normal use, but is very handy for
cryptanalytic applications, and in particular for investigating the
"cube attack" on Trivium.

Rather than using the traditional combination of C and assembler, this
implementation is based on the tool CorePy, which makes it easy to
create and call sequences of machine instructions from Python code.  I
commend this tool to crypto implementers: it made implementation much
faster and I believe resulted in clearer code.  You will need to fetch
and build CorePy-1.0 to use this code, and change the path in the
"sys.path.append" line in source files to refer to it.  The URL is at
the end of this README.

Note that the overhead of executing assembly or accessing memory is
very high - run "accesstest" to measure it on your system.  In order
to make the speed gained in assembly useful, I have moved much of the
work of summing a cube into an assembler inner loop, resulting in a
tremendous speedup.  On my 2.4 GHz Intel Core 2 system, I can sum a
30-dimensional cube using output bit 771 in around 43 seconds.

Please email me if you find yourself looking at this code, I am very
keen to hear from you.

The following files can be found here:

    Tests a "maxterm" listed in the tables at the end of the cube
    attack paper.

    List of maxterms in the Dinur and Shamir paper on the cube attack.

    Maxterms emailed to me by Itai Dinur to replace the problematic
    ones in the paper.
    Utility to convert maxterms copy-pasted from the PDF of the paper
    to the maxterms.txt format.

    Find correct key polynomial for a maxterm
    Crudely test a candidate maxterm for linearity

    A short test to exercise our SSE2 implementation, and compare the
    results to the simple implementation.

    Tests the speed of this implementation and reports in clocks/byte.
    Note that the clock speed is hardwired and may need to change on
    your system.
    Prints an assembly representation of our implementation.

    Times access to various CorePy primitives.




The "cube attack":
This code: