Scanning XML at gigabytes per second

I recently added SIMD acceleration to libxml2-ee, making it possible to parse and serialize at least some XML (and HTML) files at gigabytes per second. The changes speed up scanning of text data, for example normal text, attribute values and the contents of comments or CDATA sections. When parsing such constructs, we have to scan the input until we find certain terminating characters like < or & and make sure that the scanned part is valid. When serializing, some values must be escaped and we have to stop at a similar set of characters.

These ideas aren't new and build upon work by people like Daniel Lemire. See some of his papers:

I also implemented the algorithm in the first paper to validate UTF-8 using SIMD instructions.

Finding stop bytes

Finding the end of character data is straight-forward. Depending on what we are currently parsing, we have to consume all input until we find one of several terminating characters. These can include:

When serializing, we also have to stop at certain characters that have to be escaped. We ignore UTF-8 for now and only concern ourselves with simple ASCII bytes. What we are looking for is a SIMD version of strcspn in C or std::string::find_first_of in C++.

The basic idea is to load a 16 or 32-byte chunk of bytes into a SIMD register, find the stop bytes, and move a bitmask representing stop bytes back to general purpose registers. If we found a stop byte, we terminate and count the leading zeros of the bitmask to find the position of the stop byte in the current chunk.

The trickiest part is to classify stop bytes which essentially means to look up a bit from a 256-element bit vector. Such an operation isn't directly supported by current SIMD ISAs, but there's clever solution that uses the pshufb (Intel) or tbl (ARM) instructions which allow to lookup elements from a 16-byte array. We simply perform two such table lookups for the high and low nibble of each input byte, then AND the results together.

Validating UTF-8

The approach described above is called vectorized classification and is also used to validate UTF-8 with a SIMD algorithm described by Lemire in Validating UTF-8 In Less Than One Instruction Per Byte. The core idea is to use a third nibble from the following byte to classify byte pairs. For pairs of continuation bytes, only a little extra work is required. This ingenious algorithm only takes around 20 SIMD instructions to process a chunk of bytes.

Scanning with UTF-8 validation still turns out to be 2-3 times slower, so libxml2-ee takes a hybrid approach. It starts to scan data as ASCII and switches to the UTF-8 algorithm whenever a non-ASCII byte is encountered.

Handling line and column numbers

Unlike other parsers which are more performance-oriented, libxml2 has to keep track of line and column numbers when parsing which complicates and slows things down a little. To track line numbers, we have to count all newline bytes and to track columns, we have to count UTF-8 sequences after the last newline. For example, we may end up with a chunk like

UUcNUNUcUcc.NU.N

U: start of UTF-8 sequence
c: continuation byte
N: newline
.: stop byte

To begin, we must ignore the first stop byte and following bytes:

UUcNUNUcUcc.NU.N
UUcNUNUcUcc#####

Now we can count the newlines and add them to the line count. If there were any newlines, we reset the column count to 1. Then we mask off the bytes before the last newline:

UUcNUNUcUcc#####
######UcUcc#####

Now we can count the remaining sequences and add them to the column count.

This mostly involves bit operations like counting leading or trailing zeroes, population count, creating and applying bit masks. Most of these steps aren't suitable for SIMD processing, so we extract scalar bit vectors from SIMD into general purpose (GP) registers with instructions like pmovmsk.

Popcount on ARM Neon

Unfortunately, a pmovmsk equivalent isn't available on ARM Neon. Neither is a popcount instruction in the GP integer domain. The popcount built-in will use SIMD instructions cnt and addv with clang or gcc which requires a SIMD/GP roundtrip.

To work around the missing pmovmsk, there's a well-known trick involving shrn which results in a 64-bit mask containing four consecutive zero or one bits for each of the 16 vector elements. In binary, this looks like:

0000111111111111000011110000000000001111000011111111000011111111

Or in hex:

0x0FFF0F000F0FF0FF

If we proceed to isolate the lowest of the four bits, we end up with something like:

0x0111010001011011

With gaps between the bits, the population count can be computed by simply multiplying this value with 0x1111111111111111. Long multiplication in base 16 as illustration:

1111111111111111 * 0111010001011011
-----------------------------------
                  0 111010001011011
                0 1 11010001011011
               01 1 1010001011011
              011 1 010001011011
             0111 0 10001011011
            01110 1 0001011011
           011101 0 001011011
          0111010 0 01011011
         01110100 0 1011011
        011101000 1 011011
       0111010001 0 11011
      01110100010 1 1011
     011101000101 1 011
    0111010001011 0 11
   01110100010110 1 1
  011101000101101 1 
-----------------------------------
  012334444556778 9 987665555443221

The sum of all nibbles ends up in the highest nibble of the truncated 64-bit result. If the popcount is below 16, none of the nibbles will overflow. Unfortunately, a popcount of 16 is possible as a corner case, so I ended up with the following:

((v & 0x1111111111111111) * 0x0111111111111111 >> 60) + (v & 1)

This seems to be faster than __builtin_popcount, especially on older ARM Cortex CPUs.

Results

Here are the results from a synthetic benchmark repeatedly scanning 10MB of data on several of my desktops, notebooks and smartphones. All numbers in GB/s.

Intel AVX (SSE)

The AVX benchmark was compiled with -march=x86-64-v3 and the SSE benchmark with -march=x86-64-v2.

                      i5-14600    i5-4590T

         xmlScan:  50.5 (29.4)  14.6 (9.0)
    xmlScanLines:  36.0 (15.4)  10.8 (4.8)
     xmlScanUtf8:  20.3 (11.8)   6.8 (3.7)
xmlScanLinesUtf8:  14.9  (7.5)   5.4 (2.7)

ARM Neon

                     M2   X4  A73

         xmlScan:  30.3 27.7  3.5
    xmlScanLines:  14.9 15.7  2.5
     xmlScanUtf8:  11.6  7.9  1.1
xmlScanLinesUtf8:   6.8  5.8  0.9

Performance on ARM Neon is hampered by the narrow 128-bit SIMD registers and the lack of instructions like pmovmsk and popcnt. But it is well known that Apple Silicon can make this up by executing an extraordinarily large number of instructions per cycle. The Cortex-X4 performance core is remarkable as well, even beating Apple's M2 in one of the benchmarks.

Depending on the platform, scanning with UTF-8 validation and line/column accounting achieves speeds of 5-15 GB/s on modern platforms. ASCII text without line numbers can be scanned at 25-50 GB/s.

These improvements mostly apply to files with lots of text data. But large text sections are one of the main reasons for multi-gigabytes XML files. Then there are applications like SVG which typically have large attribute values containing vector paths and profit immensely from these optimizations. Here's a real-world example with an 88MB SVG file taken from Wikipedia demonstrating a 13.8× speed-up with xmllint's built-in benchmark:

$ xmllint --timing --repeat --sax --noout \
    Municipalities_of_Italy.svg
100 iterations took 13156 ms
$ build/xmllint --timing --repeat --sax --noout \
    Municipalities_of_Italy.svg
100 iterations took 951 ms

Text-heavy and lightly marked up XML like DocBook files are typically parsed 1.5-2× faster. You can also see 2-3× speed-ups when parsing UTF-8 files with many Unicode characters. The SIMD validator is branchless and operates at GB/s speed regardless of input data. On the other hand, files containing mostly elements and small ASCII text segments won't see any improvements. While there's still some potential for optimization, libxml2 was never designed to be the most performant XML parser.