PDF Scrutinizer is a library for detecting and analyzing malicious PDF documents.
Author: Florian Schmitt firstname.lastname@example.org
Thanks to Jan Gassen and Elmar Gerhards-Padilla
Additional thanks to Mila, who operates the contagio dump, for providing vast amounts of malicious PDF samples which were essential for the development of this tool!
Currently, as you can see at the CVE-ID dates, the specific detection mechanisms are a bit outdated. However, adding detection means for newer vulnerabilities should be rather straightforward in general. There are a lot of examples which you can use as reference.
Additionally, PDF Scrutinizer does not depend on the knowledge of specific vulnerabilities but uses generic mechanisms, meaning there is a good chance that current malicious documents can be classified correctly.
The omnipresent problem will always consists of the fact that the PDF specification is not really restrictive and Adobe Reader itself tries to open malformed documents at all costs. Thus, errors will always occur when dealing with malformed documents using a third-party library (as PDFBox). I tried to use a "path of least resistance", meaning that not every single document can be processed without error, but hopefully most of them should.
In case this project does not stay up-to-date with current exploits, I think it is still a worthy example of mechanisms which can be used to detect malicious techniques in client-side scripting languages. I am pretty sure that the ideas implemented here can also be used in other scenarios.
(ask me for a draft copy if you don't have access to IEEE Xplore)
- completely automatic, meaning no user interaction needed
- extracts code from different locations in the document
- document catalog -> names,
- document catalog -> open action,
- additional actions of pages,
- brute force any object and test whether it has a JS value
- detection of exposures targeting static vulnerabilities, currently:
- static heuristics examining extracted and dynamically evaluated code strings
- RegexMalicious, RegexSuspicious - search using regular expressions, marks as malicious/suspicious
- VulnerableAPICalls - marks the document as suspicious, in case vulnerable API calls are found
- execution of extracted code
- when necessary, the emulation provides functionality (e.g. doc.getAnnots, doc.getPageNthWord, ...)
- asynchronous execution of code (app.setTimeout, app.setInterval, app.clearTimeOut & app.clearInterval)
- dynamic heuristics
- HeapSprayDetector - is able to detect the presence of heap spraying
- StringLengthTester - checks if the length of used strings exceed a threshold
- ShellcodeTester - tests strings which have certain characteristics for shellcode (using the libemu library)
- "reasonable" performance
Because I am unsure about the different licenses right now, I will not include the used libraries at the moment. These libraries are Apache PDFBox and Mozilla Rhino. I had to make some changes to both of the libraries in order to make the Scrutinizer functionality work. I made forks for both of the libraries which need to be cloned before the Scrutinizer is usable.
clone PDF Scrutinizer $ git clone https://bitbucket.org/florianschmitt/pdf-scrutinizer.git $ mkdir pdf-scrutinizer/lib/ $ cd pdf-scrutinizer/lib/ clone PDFBox fork $ git clone https://github.com/florianschmitt/pdfbox.git some additional file that is needed: $ wget -P pdfbox/pdfbox/src/main/resources/org/apache/pdfbox/resources/ http://partners.adobe.com/public/developer/en/opentype/glyphlist.txt clone rhino-mirror fork $ git clone https://github.com/florianschmitt/rhino-mirror.git the used rhino mirror has another special dependency: $ git clone https://github.com/joelhockey/jcodings.git
Once the libraries are provided a simple maven call should be sufficient:
in order to run some acceptance tests.
mvn package assembly:single -DskipTests
to build a jar package with all dependencies.
To run it against a single document use for example my simple run script:
./run.sh -pdf src/test/resources/2Collection/CVE-2009-4324_PDF_2009-11-30_note200911.pdf\=1ST0DAYFILE
Apart from the logging output, the tool saves data in the result-directory named after the md5hash of the sample.
- migrate to newest PDFBox and Rhino version
- improve PDF Reader emulation quality by
- better the differences of the various environments (browser and standalone)
- developing emulation for different PDF Reader versions (this is a though one...)
- developing emulation for PDF Reader plugins
- add DocumentExposure classes for detection of specific exploits targeting static vulnerabilities
- improve RegexMalicious, RegexSuspicious and VulnerableAPICalls static heuristic data
- add more dynamic heuristics and improve the quality of the existing
- have fun! :)