HTTPS SSH

PDF Scrutinizer

PDF Scrutinizer is a library for detecting and analyzing malicious PDF documents.

Author: Florian Schmitt florian@florianschmitt.de

Thanks to Jan Gassen and Elmar Gerhards-Padilla

Additional thanks to Mila, who operates the contagio dump, for providing vast amounts of malicious PDF samples which were essential for the development of this tool!

Info

Currently, as you can see at the CVE-ID dates, the specific detection mechanisms are a bit outdated. However, adding detection means for newer vulnerabilities should be rather straightforward in general. There are a lot of examples which you can use as reference.

Additionally, PDF Scrutinizer does not depend on the knowledge of specific vulnerabilities but uses generic mechanisms, meaning there is a good chance that current malicious documents can be classified correctly.

The omnipresent problem will always consists of the fact that the PDF specification is not really restrictive and Adobe Reader itself tries to open malformed documents at all costs. Thus, errors will always occur when dealing with malformed documents using a third-party library (as PDFBox). I tried to use a "path of least resistance", meaning that not every single document can be processed without error, but hopefully most of them should.

In case this project does not stay up-to-date with current exploits, I think it is still a worthy example of mechanisms which can be used to detect malicious techniques in client-side scripting languages. I am pretty sure that the ideas implemented here can also be used in other scenarios.

Reference

Florian Schmitt, Jan Gassen and Elmar Gerhards-Padilla. PDF Scrutinizer: Detecting JavaScript-based Attacks in PDF Documents. Proc. of the 10th Annual Conference on Privacy, Security and Trust, PST, Paris, France, July 16-18, 2012. LINK

(ask me for a draft copy if you don't have access to IEEE Xplore)

Features

  • completely automatic, meaning no user interaction needed
  • extracts code from different locations in the document
    • document catalog -> names,
    • document catalog -> open action,
    • additional actions of pages,
    • AcroForms,
    • brute force any object and test whether it has a JS value
  • detection of exposures targeting static vulnerabilities, currently:
    • CVE-2009-0658
    • CVE-2010-0188
  • static heuristics examining extracted and dynamically evaluated code strings
    • RegexMalicious, RegexSuspicious - search using regular expressions, marks as malicious/suspicious
    • VulnerableAPICalls - marks the document as suspicious, in case vulnerable API calls are found
  • execution of extracted code
    • emulation of the JavaScript for Acrobat API
    • when necessary, the emulation provides functionality (e.g. doc.getAnnots, doc.getPageNthWord, ...)
    • asynchronous execution of code (app.setTimeout, app.setInterval, app.clearTimeOut & app.clearInterval)
  • dynamic emulation/detection of JavaScript-based exploits by hooking the vulnerable calls in the emulation, currently:
    • CVE-2007-5659
    • CVE-2008-2992
    • CVE-2009-0927
    • CVE-2009-1492
    • CVE-2009-1493
    • CVE-2009-4324
    • CVE-2010-4091
  • dynamic heuristics
    • HeapSprayDetector - is able to detect the presence of heap spraying
    • StringLengthTester - checks if the length of used strings exceed a threshold
    • ShellcodeTester - tests strings which have certain characteristics for shellcode (using the libemu library)
  • "reasonable" performance

Preparations

Because I am unsure about the different licenses right now, I will not include the used libraries at the moment. These libraries are Apache PDFBox and Mozilla Rhino. I had to make some changes to both of the libraries in order to make the Scrutinizer functionality work. I made forks for both of the libraries which need to be cloned before the Scrutinizer is usable.

clone PDF Scrutinizer
$ git clone https://bitbucket.org/florianschmitt/pdf-scrutinizer.git

$ mkdir pdf-scrutinizer/lib/
$ cd pdf-scrutinizer/lib/

clone PDFBox fork
$ git clone https://github.com/florianschmitt/pdfbox.git

some additional file that is needed:
$ wget -P pdfbox/pdfbox/src/main/resources/org/apache/pdfbox/resources/ http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt

clone rhino-mirror fork
$ git clone https://github.com/florianschmitt/rhino-mirror.git

the used rhino mirror has another special dependency:
$ git clone https://github.com/joelhockey/jcodings.git

Usage

Once the libraries are provided a simple maven call should be sufficient:

mvn test

in order to run some acceptance tests.

mvn package assembly:single -DskipTests

to build a jar package with all dependencies.

To run it against a single document use for example my simple run script:

./run.sh -pdf src/test/resources/2Collection/CVE-2009-4324_PDF_2009-11-30_note200911.pdf\=1ST0DAYFILE

Apart from the logging output, the tool saves data in the result-directory named after the md5hash of the sample.

TODO/Contribute

  • migrate to newest PDFBox and Rhino version
  • improve PDF Reader emulation quality by
    • extending the JavaScript for Acrobat API emulation
    • better the differences of the various environments (browser and standalone)
    • developing emulation for different PDF Reader versions (this is a though one...)
    • developing emulation for PDF Reader plugins
  • add DocumentExposure classes for detection of specific exploits targeting static vulnerabilities
  • improve RegexMalicious, RegexSuspicious and VulnerableAPICalls static heuristic data
  • add more dynamic heuristics and improve the quality of the existing
  • have fun! :)