Introduction
Does your Java web application have to allow users to upload files?
My condolences!
Uploading a file to your web server is like inviting the Trojans to bring in their horse. A big horse, made of wood, that sounds suspiciously hollow. Except for whatever’s rolling around in there. In some cases, that’s literally what’s happening! Such a dangerous operation requires extraordinary protections, and in this post, I’ll cover one of the most important: verifying that the file is what you expect it to be.
Note: I’m not going to cover anti-virus in this article. That’s a complex-enough topic that I think it deserves its own post later.
Technology and Approach
There are a variety of ways a file can be dangerous. For example:
- The file might contain an illicit payload designed to infect your server-side application runtime when it tries to process the file or its contents
- The file might contain an illicit payload designed to look innocuous to the server but actually contain a payload designed to harm users who download or interact with the file
- The file might be large enough to cause a denial of service by filling up your server’s disk space
The last point isn’t something I’ll talk about at length, since the solution is so straight forward. That is to say, you should always limit the maximum upload size to match your business needs. If you only allow customer to upload small-ish PNGs, limit the maximum upload size to 1Mb or whatever makes sense for the quality of graphic image you’re expecting.
Setting that limit won’t do anything to protect against illicit payloads, though. The Introduction alluded to one way to protect yourself: anti-virus software. The other thing you can do is verify that the uploaded file is the kind of file you’re expecting. If you allow uploads of PDFs, you need to make sure the file’s a PDF. Allow uploads of GIF files? Confirm it’s a GIF.
Does that sound easy? It’s really not. You can’t rely at all on file extensions. The user sets those during upload, and even if you arbitrarily change the extension to what you expect after receiving the file, the extension only gives you a hint as to what’s in the file — and an unreliable hint at that.
Actually, scratch that. The extension is completely worthless in terms of determining what’s in the file. At least it is from a security perspective.
So, why not just try to perform an operation on the file? Why not, for example, attempt to extract the text from a PDF to see if it works?
The problem is the ingenuity of malicious actors. PDFs, JPGs, GIFs, and in fact most file types can be crafted so that attempting to perform a routine operation can exploit a vulnerability, either in the low-level operating system APIs that allow our application runtime to process the file, or in the application runtime itself. The more complex the operation against the file, the higher the probability that something untoward could happen.
Yes, even just touching the file could be problematic.
Of course, throwing our hands up in despair isn’t helpful, so I recommend a different approach: attempt a simple operation to determine the filetype, and if that passes without incident, try a more complex operation. The technique I’ll present tries these two tests:
- Perform basic validation using the JDK’s MimetypesFileTypeMap
- Perform a more complex validation using Apache Tika
Implementation Examples
Figure 1 shows the overall flow of the example:
The Java Server Page (JSP) named input_05 allows a user to upload a file of whatever type. The servlet/controller named processinput_05 accepts the file as multi-part MIME data and reassembles it. Then, it uses the JDK’s MimetypesFileTypeMap to do a preliminary check. Assuming all’s well, the code then does a more sophisticated check using Apache Tika. It then displays the results to the browser.
Here’s a look at the most important bits of code.
First, we reassemble the attachment and place it in /tmp so we can look at it:
@MultipartConfig public class processinput_05 extends HttpServlet { .... protected void processRequest(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { // See http://stackoverflow.com/questions/2422468/how-to-upload-files-to-server-using-jsp-servlet // for a description of this approach to uploading files Part filePart = request.getPart("file2scrub"); String strScrubFileName = Paths.get(filePart.getSubmittedFileName()).getFileName().toString(); // MSIE fix. InputStream fileContent = filePart.getInputStream(); OutputStream os = new FileOutputStream(new File("/tmp/random-string-" + strScrubFileName)); int read = 0; byte[] bytes = new byte[1024]; while ((read = fileContent.read(bytes)) != -1) { os.write(bytes, 0, read); } os.flush(); os.close(); fileContent.close();
Notice that I found a great example on Stack Exchange. I highly recommend that site. Its ranking system ensures that you almost always find reliable examples. It’s rare that I don’t find what I’m looking for there.
Once we have the file safely stored in /tmp, we can begin validating it. I’ve placed the logic for both the JDK and Apache Tika checks in a class called Input05_Detector. So, I instantiate that class and invoke the method that supports the JDK check:
String strDetectedType01 = null; Input05_Detector i5d = new Input05_Detector(); try { strDetectedType01 = i5d.tryUsingMimetypesFileTypeMapgetContentType("/tmp/random-string-" + strScrubFileName); } catch (Exception e) { strDetectedType01 = "Error detecting type."; e.printStackTrace(); }
The method tryUsingMimetypesFilesTypeMapgetContentType is surprisingly simple:
public String tryUsingMimetypesFileTypeMapgetContentType(String strCandidateFile) throws Exception { String strReturn = null; MimetypesFileTypeMap typeMTF = new MimetypesFileTypeMap(); strReturn = typeMTF.getContentType(strCandidateFile); typeMTF = null; return strReturn; }
This code returns the file type to the extent it can with a basic check. You can read about how the logic works at Oracle’s site. It doesn’t attempt to perform any operations on the file, and the check involves little logic within the JDK, so it’s more safe than a complex operation. Here’s a table that describes what the method returns for a variety of file types:
Type | Results of Check |
---|---|
Apple Pages with *.pages | application/octet-stream |
BZ2 with *.bz2 | application/octet-stream |
Calendar with *.ics | application/octet-stream |
CSV with *.csv | application/octet-stream |
DOCX with *.docx | application/octet-stream |
Excel with *.xls | application/octet-stream |
Excel with *.xlsx | application/octet-stream |
GIF with *.gif | image/gif |
HTML with *.html | text/html |
iDVD with *.dvdproj | application/octet-stream |
JPG with *.jpg | image/jpeg |
Old Word with *.doc | application/octet-stream |
PDF with *.pdf | application/octet-stream |
PNG with *.png | application/octet-stream |
PowerPoint with *.pptx | application/octet-stream |
RTF with *.rtf | application/rtf |
Scrivener with *.scriv | application/octet-stream |
SQL with *.sql | application/octet-stream |
Text file with *.docx | application/octet-stream |
Text file with *.txt | text/plain |
There are a couple of things I’d like you to notice about the table. First, the JDK’s MimetypesFileTypeMap identifies some file types with precision, like the JPG or GIF files. This means your code can check to see if the uploaded file check returns “image/jpeg” and proceed with a high degree of confidence. Second, there are a lot of files that it simply identifies as “application/octet-stream.” It’s true that’s not very precise, but it’s still a step forward. You could, for example, verify that what you’re expecting to be a PDF file (application/octet-stream) doesn’t show as “application/rtf.” Also, if the JDK doesn’t crash, that’s a minor indication that the file might be safe to check more deeply using Apache Tika.
Apache Tika is a robust program that allows you to interact with a wide variety of file types. The demonstration here only scratches the surface. In fact, we’re only going to use one method called detect().
First, from the servlet, processinput_05 invokves Input05_Detector:
String strDetectedType02 = null; try { strDetectedType02 = i5d.tryUsingTikaDetect("/tmp/random-string-" + strScrubFileName); } catch (Exception e) { strDetectedType02 = "Error detecting type."; e.printStackTrace(); // Yes, I know, this is bad form. It's only for the example! }
Within Input05_Detector, here’s what the method tryUsingTikaDetect does:
public String tryUsingTikaDetect(String strCandidateFile) throws Exception { String strReturn = null; Tika tika = new Tika(); strReturn = tika.detect(new File(strCandidateFile)); tika = null; return strReturn; }
Again, a very simple method, but it yields a lot of functionality. Like the JDK’s method of file checking, Tika returns what kind of file it thinks the candidate is. However, it’s a lot more precise. This table summarizes its results for the same files as above:
Type | Tika Reports As... |
---|---|
Old Word with *.doc | application/msword |
PDF with *.pdf | application/pdf |
RTF with *.rtf | application/rtf |
Excel with *.xls | application/vnd.ms-excel |
PowerPoint with *.pptx | application/vnd.openxmlformats-officedocument.presentationml.presentation |
Excel with *.xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
DOCX with *.docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document |
BZ2 with *.bz2 | application/x-bzip2 |
iDVD with *.dvdproj | application/zip |
Apple Pages with *.pages | application/zip |
Scrivener with *.scriv | application/zip |
GIF with *.gif | image/gif |
JPG with *.jpg | image/jpeg |
PNG with *.png | image/png |
Calendar with *.ics | text/calendar |
CSV with *.csv | text/csv |
HTML with *.html | text/html |
Text file with *.txt | text/plain |
SQL with *.sql | text/x-sql |
Text file with *.docx | text/plain |
Some of the results are the same. For example, checking a *.jpg files an application type of “image/jpeg,” just like with the JDK check. However, where Tika really shines is with the more complex file types. Tika identifies a modern Microsoft Word *.docx file as “application/vnd.openxmlformats-officedocument.wordprocessingml.document,” and it offers similar precision with the other Microsoft formats. It’s even harder to fool. If I gave a plan text file an extension of docx, Tika correctly identified it as “text/plain,” whereas the JDK thought it was an “application/octet-stream.”
In our example, if you wanted to allow people to upload PDF files, you’d first use the JDK to make sure it saw the type as “application/octet-stream.” If the file met that first criteria, you could then check it with Tika to make sure its type was “application/pdf.” At that point, you can be reasonably sure the file’s a PDF.
We’ll cover checking it with anti-virus in another article.
If you’re interested in more options or permutations for Java file checking, I found a blog post on Inspired by Actual Events to be very helpful. Check it out!
Wrapping Up
Accepting files from the wild is a dangerous business. It might seem that the approach I’ve recommended is overly cautious, but consider this: it only takes one malformed file to really wreck your company’s reputation. During a Root Cause Analysis session, do you really want your code singled out as the cause of an undesired data disclosure? Put another way, isn’t it worth a little extra time to check the files to make sure you protect your customers data? Or your company’s reputation? But maybe the best reason to perform these checks is that they offer a level of protection for a very, very low cost. In other words, you can improve safety and security without impacting the project schedule. Everyone wins!
by Terrance A. Crow
Terrance has been writing professionally since the late 1990s — yes, he’s been writing since the last century! Though he started writing about programming techniques and security for Lotus Notes Domino, he went on to write about Microsoft technologies like SQL Server, ActiveX Data Objects, and C#. He now focuses on application security for professional developers because… Well, you’ve watched the news. You know why!