Scanning XML at gigabytes per second
I recently added SIMD acceleration to libxml2-ee,
making it possible to parse and serialize at least some XML (and HTML)
files at gigabytes per second.
The changes speed up scanning of text data, for example normal text,
attribute values and the contents of comments or CDATA sections. When
parsing such constructs, we have to scan the input until we find certain
terminating characters like < or & and make sure that the scanned
part is valid. When serializing, some values must be escaped and we have
to stop at a similar set of characters.
These ideas aren't new and build upon work by people like Daniel Lemire. See some of his papers:
- Validating UTF-8 In Less Than One Instruction Per Byte (2021)
- Parsing Gigabytes of JSON per Second (2024)
- Scanning HTML at Tens of Gigabytes per Second on ARM Processors (2025)
I also implemented the algorithm in the first paper to validate UTF-8 using SIMD instructions.
Finding stop bytes
Finding the end of character data is straight-forward. Depending on what we are currently parsing, we have to consume all input until we find one of several terminating characters. These can include:
- characters that start new elements or entity references:
<& - characters that end attribute values, comments, CDATA sections or
processing instructions:
"'-]? - whitespace when normalizing attribute values
- invalid characters like C0 control chars or
<inside attribute values - invalid UTF-8
When serializing, we also have to stop at certain characters that have
to be escaped. We ignore UTF-8 for now and only concern ourselves with
simple ASCII bytes. What we are looking for is a SIMD version of
strcspn in C or std::string::find_first_of in C++.
The basic idea is to load a 16 or 32-byte chunk of bytes into a SIMD register, find the stop bytes, and move a bitmask representing stop bytes back to general purpose registers. If we found a stop byte, we terminate and count the leading zeros of the bitmask to find the position of the stop byte in the current chunk.
The trickiest part is to classify stop bytes which essentially means to
look up a bit from a 256-element bit vector. Such an operation isn't
directly supported by current SIMD ISAs, but there's clever solution
that uses the pshufb (Intel) or tbl (ARM) instructions which allow
to lookup elements from a 16-byte array. We simply perform two such
table lookups for the high and low nibble of each input byte, then AND
the results together.
Validating UTF-8
The approach described above is called vectorized classification and is also used to validate UTF-8 with a SIMD algorithm described by Lemire in Validating UTF-8 In Less Than One Instruction Per Byte. The core idea is to use a third nibble from the following byte to classify byte pairs. For pairs of continuation bytes, only a little extra work is required. This ingenious algorithm only takes around 20 SIMD instructions to process a chunk of bytes.
Scanning with UTF-8 validation still turns out to be 2-3 times slower, so libxml2-ee takes a hybrid approach. It starts to scan data as ASCII and switches to the UTF-8 algorithm whenever a non-ASCII byte is encountered.
Handling line and column numbers
Unlike other parsers which are more performance-oriented, libxml2 has to keep track of line and column numbers when parsing which complicates and slows things down a little. To track line numbers, we have to count all newline bytes and to track columns, we have to count UTF-8 sequences after the last newline. For example, we may end up with a chunk like
UUcNUNUcUcc.NU.N
U: start of UTF-8 sequence
c: continuation byte
N: newline
.: stop byte
To begin, we must ignore the first stop byte and following bytes:
UUcNUNUcUcc.NU.N
UUcNUNUcUcc#####
Now we can count the newlines and add them to the line count. If there were any newlines, we reset the column count to 1. Then we mask off the bytes before the last newline:
UUcNUNUcUcc#####
######UcUcc#####
Now we can count the remaining sequences and add them to the column count.
This mostly involves bit operations like counting leading or trailing
zeroes, population count, creating and applying bit masks. Most of
these steps aren't suitable for SIMD processing, so we extract
scalar bit vectors from SIMD into general purpose (GP) registers with
instructions like pmovmsk.
Popcount on ARM Neon
Unfortunately, a pmovmsk equivalent isn't available on ARM Neon.
Neither is a popcount instruction in the GP integer domain. The
popcount built-in will use SIMD instructions cnt and addv with
clang or gcc which requires a
SIMD/GP roundtrip.
To work around the missing pmovmsk, there's a well-known trick
involving shrn which results in a 64-bit mask containing four
consecutive zero or one bits for each of the 16 vector elements.
In binary, this looks like:
0000111111111111000011110000000000001111000011111111000011111111
Or in hex:
0x0FFF0F000F0FF0FF
If we proceed to isolate the lowest of the four bits, we end up with something like:
0x0111010001011011
With gaps between the bits, the population count can be computed by simply
multiplying this value with 0x1111111111111111. Long multiplication
in base 16 as illustration:
1111111111111111 * 0111010001011011
-----------------------------------
0 111010001011011
0 1 11010001011011
01 1 1010001011011
011 1 010001011011
0111 0 10001011011
01110 1 0001011011
011101 0 001011011
0111010 0 01011011
01110100 0 1011011
011101000 1 011011
0111010001 0 11011
01110100010 1 1011
011101000101 1 011
0111010001011 0 11
01110100010110 1 1
011101000101101 1
-----------------------------------
012334444556778 9 987665555443221
The sum of all nibbles ends up in the highest nibble of the truncated 64-bit result. If the popcount is below 16, none of the nibbles will overflow. Unfortunately, a popcount of 16 is possible as a corner case, so I ended up with the following:
((v & 0x1111111111111111) * 0x0111111111111111 >> 60) + (v & 1)
This seems to be faster than __builtin_popcount, especially on older
ARM Cortex CPUs.
Results
Here are the results from a synthetic benchmark repeatedly scanning 10MB of data on several of my desktops, notebooks and smartphones. All numbers in GB/s.
Intel AVX (SSE)
- Core i5-14600, 5.2 GHz
- Core i5-4590T, 3.0 GHz
The AVX benchmark was compiled with -march=x86-64-v3 and the SSE
benchmark with -march=x86-64-v2.
i5-14600 i5-4590T
xmlScan: 50.5 (29.4) 14.6 (9.0)
xmlScanLines: 36.0 (15.4) 10.8 (4.8)
xmlScanUtf8: 20.3 (11.8) 6.8 (3.7)
xmlScanLinesUtf8: 14.9 (7.5) 5.4 (2.7)
ARM Neon
- Apple Macbook Air M2 (Apple M2, 3.5 GHz)
- Google Pixel 10 Pro XL (Cortex-X4, 3.78 GHz)
- Huawei P20 Pro (Cortex-A73, 2.4 GHz)
M2 X4 A73
xmlScan: 30.3 27.7 3.5
xmlScanLines: 14.9 15.7 2.5
xmlScanUtf8: 11.6 7.9 1.1
xmlScanLinesUtf8: 6.8 5.8 0.9
Performance on ARM Neon is hampered by the narrow 128-bit SIMD
registers and the lack of instructions like pmovmsk and popcnt.
But it is well known that Apple Silicon can make this up by executing
an extraordinarily large number of instructions per cycle. The
Cortex-X4 performance core is remarkable as well, even beating
Apple's M2 in one of the benchmarks.
Depending on the platform, scanning with UTF-8 validation and line/column accounting achieves speeds of 5-15 GB/s on modern platforms. ASCII text without line numbers can be scanned at 25-50 GB/s.
These improvements mostly apply to files with lots of text data. But large text sections are one of the main reasons for multi-gigabytes XML files. Then there are applications like SVG which typically have large attribute values containing vector paths and profit immensely from these optimizations. Here's a real-world example with an 88MB SVG file taken from Wikipedia demonstrating a 13.8× speed-up with xmllint's built-in benchmark:
$ xmllint --timing --repeat --sax --noout \
Municipalities_of_Italy.svg
100 iterations took 13156 ms
$ build/xmllint --timing --repeat --sax --noout \
Municipalities_of_Italy.svg
100 iterations took 951 ms
Text-heavy and lightly marked up XML like DocBook files are typically parsed 1.5-2× faster. You can also see 2-3× speed-ups when parsing UTF-8 files with many Unicode characters. The SIMD validator is branchless and operates at GB/s speed regardless of input data. On the other hand, files containing mostly elements and small ASCII text segments won't see any improvements. While there's still some potential for optimization, libxml2 was never designed to be the most performant XML parser.