SpiderLabs Blog

Analyzing PDF Malware - Part 1

Written by | Sep 22, 2011 11:46:00 AM

Background

I'd like to think that security awareness has gotten to the point where the average end user thinks twice before opening an 'exe' file sent to them as an email attachment. I like to think that. I really do. But when it comes to opening PDF documents, whether it be an email attachment or their latest online utility bill, I can't even begin to convince myself that there is ever a moment of hesitation. And am I the only one who finds it ironic that security publications covering recent PDF attacks can often be downloaded in PDF form? How long would it take to live down being compromised by a document that is warning users about itself? Maybe it's just my affinity for self-referential humor.

A point to note is that the Portable Document Format is already a huge winner for presenting content for a number of reasons; the proliferation of easily accessible cross-platform readers, and relatively small file sizes are two quick obvious ones. Since many attackers tend to be opportunistic, PDF's popularity among end users and it's ability for dynamic action makes it a natural choice as an attack vector. Attackers go where the victims are, so to speak. Often times it comes down to a simple numbers game. More users on a particular platform equals more potential victims to the attacker.

If you pay attention to the news you don't have to think back too far to remember various incidents involving malformed PDF documents. From hackers who leverage malicious PDF documents to gain a foot hold on an internal network of a major corporation, to reversers taking advantage of a weakness in a rendering engine to jailbreak their smart phones, PDF's are being used to bypass established security protections. How do we defend ourselves against maliciously crafted PDF's? There are a variety of methods that can be employed, but I think the best first move for those with the technical inclination is to understand the problem at hand by looking at a sample.

In this first of a multi-part writeup we will analyze a sample PDF aptly named sample1.pdf, and attempt to determine if the file is malicious or not. We will analyze it using a blend of both static and dynamic methodologies. If we determine that the file is malicious (spoiler alert: it is) we will dissect the attacks that were employed. We will trace the code of the document through various rounds of obfuscation, rout out common techniques employed by the attackers, and identify the vulnerabilities that were targeted.

Don't forget…

First things first. It may seem like it goes without saying, but in your zeal to dig into the tech, you may forget to check if someone has already encountered your file and done the heavy lifting for you. Before you even try opening the file, run a quick MD5 sum and do an online search to see if you get any hits. If you know that there aren't any confidentiality issues regarding your file, you may also want to submit to any of the myriad of online services. A fairly comprehensive list of online services from anti-virus scanners to automated sandboxes can be found over at cleanbytes.net. You may find that your answers are already well documented or easily detected through automated analysis. If this is your case, "Bob's your uncle" as they say. However if you are not able to get a clear answer one way or the other through your searching, or if your particular file has the potential to contain sensitive information you may need to take the next step in analysis.

What would you say "ya do" here?

Since you are investigating the nature of your file, you will want to use a few tools to peek inside the file without dynamically executing the contents. There are a growing number of tools to choose from when analyzing PDF's. I will demonstrate a sampling of them throughout the post. To begin with, a simple strings dump will give us any printable characters in the file:

A quick look at this output gives a bit of helpful information immediately, namely we find JavaScript content mixed in with common PDF objects. JavaScript is supported by PDF and is often the workhorse that attackers use to setup and execute attacks. Many JavaScript obfuscation tricks commonly used in web browsers can also be used with success in a PDF. In addition to using the strings command you may also want to use your favorite hex editor to statically view the contents of the PDF. In some cases, such as with the 010 Editor, there are templates that can be used to do some minimal parsing of the file's structure to wrap a bit of context around the printable characters giving you a sense of the object's overall structure.

Running a second tool, PDFiD from Didier Stevens confirms what we are seeing in the previous strings output by displaying the structure of the objects and actions:

PDFiD shows us that there are three objects, but more importantly it counts "/JS" and "/Javascript" occurrences which also matches up with what we see in the strings dump.

A third and final tool we can use to take a static look at our file is pdfscan.rb from Origami, a Ruby framework used to analyze PDF documents:

Again we get confirmation with pdfscan's slightly more verbose layout that "/JavaScript" is present in our sample1.pdf.

In the past, much of the JavaScript encountered within PDF files was very straightforward in nature. Today, there is a veritable cornucopia of obfuscation techniques employed by attackers. In addition to scripting capabilities of JavaScript, the current PDF specification supports a number of different encoding types in the form of "Standard Filters". Described at a high level, Filters support actions such as data compression, modification of character representation, and encryption. Filters can be used by attackers to hide from anti-virus signatures, and create havoc for security analysts trying to manually untangle the gnarly mess, especially when they are called in succession or "cascaded". Didier Stevens has a must-read, albeit old, write-up covering more of the detailed ways that PDFs can be encoded.

Get into the light where you belong.

Luckily there are tools available to help in the extraction of the JavaScript we noticed in our static analysis. First let's take a look at the output of extractjs.rb that has been run against our sample1.pdf. This ruby script is from same Origami project as pdfscan.rb mentioned earlier:

All right, now we are getting somewhere! We've extracted the main chunk of JavaScript from the /JS tag in Object 1, but some key pieces are still missing from the code. One of the techniques that attackers have adapted over time is to hide snippets of code within document variables normally used to describe the PDF itself, such as the document's title, or subject. These meta-fields are shown below with the help of yet another command line tool pdf.py from the jsunpack-n suite of scripts:

We see from the output that Object 3 contains three additional tags:

1) Producer = substr

2) Subject = spli

3) Title = [data 45194 bytes]

The "Title" tag actually contains a very large string of data 45,194 bytes long. We will come back to that later. You may notice that at the end of pdf.py's output it says that it wrote JavaScript to ../sample1.pdf.out. If we view the beginning of that file we see that the script makes a good attempt to capture those extra Tag values and make them accessible. Very handy.

Once the three document variables from Object 3 have been combined with the JavaScript we extracted from Object 1, we can trace, de-obfuscate, and simplify the small amount of code by hand to produce the following readable output:

Untangling this code isn't a required step, but it gives you a more complete view into what is going on under the hood, and can help prevent missing a branch of conditional code that might be hiding some unknown functionality.

We can see that the large amount of data stored in the title variable is being decoded and evaluated. The next step is to execute our code in a controlled manner to see what the code is doing with that data. There are a couple of ways to accomplish this. If your preference is for command line tools, SpiderMonkey is the way to go. Like many of these great analysis tools it comes pre-compiled on Lenny Zeltser's REMnux 2 linux distro. If you prefer a GUI interface for this stage, Malzilla or PDFStreamDumper are both nice visual solutions. We are going to mix it up a bit and check out one of the GUIs.

We have used Malzilla to run our JavaScript (top pane in image), which produces a second stage of JavaScript (lower pane in image). Fortunately this second stage code is much easier to read than the first and only contains some minor obfuscation. Malzilla conveniently saves this new output to a file for us. The initial line of the newly produced script is the variable 'bjsg' containing escaped shellcode. This will be a primary target for analysis later. After some beautification, a bit of formatting, and some renaming we can investigate the rest of the file:

It is interesting to note that the code attempts to detect the version of the PDF viewer and the version of the EScript plugin used to execute JavaScript within the PDF. It then uses that garnered info to specifically target ranges and combinations of those versions. There are also five additional functions defined:

1) function build_nop()

2) function collabExploit()

3) function printf()

4) function geticon()

5) function a()

The first function creates a NOP sled. The remaining functions exploit known vulnerabilities with PDF viewing software:

1) NOP sled

2) CVE-2007-5659

3) CVE-2008-2992

4) CVE-2009-0927

5) CVE-2009-4324

At this point our initial suspicions have been confirmed. Our sample1.pdf file is indeed of a malicious nature. But what is this malicious file attempting to do on our system after exploiting one of these known vulnerabilities? To find out that answer we need to investigate the contents of the shellcode we discovered in our second stage JavaScript, which we will do in the next post of the multi-part series.

To be continued…