Everything I know about iconv
The iconv function converts between character encodings. It is part
of the POSIX specification, making it the obvious choice on UNIX-like
systems. In this article, I'll discuss some of the gotchas, review
several implementations of this API and present the results of fuzz
testing most of them.
Basic usage
The iconv API is simple and straight-forward. You create a so-called
conversion decriptor with iconv_open, convert character data
piece by piece using iconv and finally dispose of the descriptor
with iconv_close. Conversion can be stateful with some encodings,
so in addition to the input and output encoding, the conversion
descriptor can hold some state. The signature of iconv is defined
by POSIX
as follows:
size_t iconv(iconv_t cd,
char **restrict inbuf, size_t *restrict inbytesleft,
char **restrict outbuf, size_t *restrict outbytesleft);
Unfortunately, there are some older UNIX systems that use a const char
pointer for inbuf. This is technically more correct but makes it more
difficult to write cross-platform code. While char * can be converted
to const char * implicitly, this doesn't work for double pointers, so
you'll need some kind of work-around if you want your code to compile
cleanly on legacy systems. Some projects use feature tests during
configuration, trying to detect what kind of pointer is used.
But there's a another very simple solution which I'd recommend instead.
Just cast the inbuf pointer to void *. Void pointers can be cast
implicitly to all other pointer types, so the following should compile
without warnings:
result = iconv(cd,
(void *) inbuf, inbytesleft,
outbuf, outbytesleft);
You'll lose a tiny bit of type safety, but if you care about that, you'll
probably use a const char * for the input buffer to begin with and need
an explicit cast anyway.
Conversion errors
If you can be sure that the conversion will always succeed, there's not much more to consider, but most of the time you have to handle input which might be impossible to convert. There are two reasons why a conversion can fail.
-
Invalid input byte sequences: In multi-byte encodings, there are typically some sequences of bytes with no defined meaning. In UTF-8 for instance, a single
0xFFbyte is invalid. Another example are invalid continuation bytes like in0xF1 0x80 0x80 0x40. Even in single-byte encodings, input can be invalid if there's a range of bytes that isn't mapped to any character. -
No suitable mapping in the output encoding: Conversion can fail if an input character cannot be represented in the output encoding. For example, when trying to convert non-ASCII chars like
äto ASCII.
In the first case, iconv will return an error and set
errno to EILSEQ. But there's a major flaw in the iconv API as
defined by POSIX. There's simply no way to detect the second kind of
conversion error. Before POSIX 2024, the specification simply
said:
If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the target codeset, iconv() performs an implementation-dependent conversion on this character.
POSIX 2024 added indicator suffixes
like //IGNORE or //TRANSLIT
which allow to control this implementation-defined behavior a bit
more, but there's still no way to make the conversion stop on such
errors. In many situations you don't really care. But unlike the first
case of invalid byte sequences, the second case of unmappable
characters can sometimes be handled gracefully. In XML or HTML, for
example, you can use numeric character references and a character like
ä could be escaped as ä.
On the positive side, GNU implementations like glibc or libiconv
ignore the POSIX spec and stop the conversion with EILSEQ on
unmappable characters unless an indicator suffix was provided.
Some implementations like FreeBSD also adopt the GNU behavior
for compatibility. Other implementations like musl focus on strict
adherence to the POSIX standard, even if this makes it impossible
to handle unmappable characters gracefully.
Another issue is that even with the GNU quirk, there's no way to distinguish between the two kinds of error. But most of the time, you're either converting from legacy encodings to Unicode, where only the first kind of error can happen, or from validated Unicode to legacy encodings, and you'll only encounter second kind of error.
At least, the Austin Group responsible for the POSIX standard is aware of the issue. There's a defect report and some hope that this issue will be addressed in future versions of the specification.
Quadratic behavior with erroneous input data
Regrettably, there's another issue with implementations that follow the GNU behavior. It seems that they always scan the whole input string, no matter whether a conversion error is detected later. Now if there's an error near the beginning of the string, you have to reprocess the remainder. If there are many unmappable characters, this can lead to quadratic behavior. You can find a test program in this libxml2 issue report. If I remember correctly, glibc, GNU libiconv and BSD iconv behave in this manner.
Review of iconv implementations
If you're serious about fuzz testing, you might have encountered situations where your fuzzer finds bugs not in your own code, but in other libraries you use. I ran into this issue several times when fuzz testing libxml2 which uses iconv to handle XML files with legacy encodings. On various platforms, the fuzzer found bugs which were actually in the system's iconv code.
This seemed like a clear sign that many iconv libraries were never fuzz
tested before, and I decided to fuzz some iconv implementations over a
couple of weekends.
The iconv API works on char buffers and it's trivial to write a
basic fuzzer. One important detail is that such a fuzzer should not
only invoke iconv but also test that some post conditions aren't
violated. Most importantly, the amount of bytes by which the in/out
buffer pointers are advanced should match the changes to the length
values.
In this endeavor, I learned a bit about several iconv implementations and I'll share my impressions here.
glibc
This is probably the most-used implementation around. It is incredibly complex and one could even say overdesigned. Like a few other implementations, it makes use of shared objects loaded at runtime which complicates the code quite a bit. To support legacy encodings, you often need large tables to map codepoints to and from Unicode. These tables can be several hundred KB in size, so I can see the motivation to load these tables on demand. But if I were to design such a system, I'd keep the conversion code which is comparably small in the main binary and only separate the data tables.
The whole subsystem is named “gconv” and is highly configurable. In theory, you could add your own conversion modules on top of the ones provided by glibc but I doubt that anyone needs this feature, at least not today. It might have seemed like a good idea 20 years ago when Unicode wasn't as dominant. You can also disable encodings and tailor transliterations at runtime using configuration files, but this seems overkill to me and increases the attack surface for no good reason.
I didn't actually fuzz test the glibc implementation because it's almost impossible to extract the iconv code and make it run in a standalone way. Nevertheless, I found a functional bug when fuzzing libxml2 which I reported here.
GNU libiconv
Somewhat surprisingly, GNU libiconv is a completely separate code base and mostly unrelated to the glibc implementation. It's designed for portability and often used to make POSIX code run on Windows. Other than that, its design is similar to the glibc implementation.
When fuzz testing, I found a minor memory safety issue which was introduced only recently. It didn't receive a CVE ID, but I don't have a problem with that.
Citrus iconv (BSDs)
The iconv implementation used in FreeBSD, NetBSD and other BSDs
originates from the Citrus project,
started in
2001.
Like the glibc implementation, it is overly complex for my taste,
mostly due to a design relying on loadable modules. It takes countless
indirections until a call to iconv finally starts to convert
input bytes.
Fuzz testing discovered multiple memory safety issues which I reported privately to FreeBSD and NetBSD in January 2025. The issues were acknowledged but are still unfixed as of March 2026.
Apple
Before macOS 14.0 (Sonoma), Apple shipped GNU libiconv with their
desktop operating system. After some utility programs like iconv(1)
switched the license to GPLv3 in 2009, Apple stopped updating to
newer versions and was stuck with version 1.11, released in 2006.
Starting with Sonoma, they replaced this 17-years-old code with a
completely new version based on the BSD Citrus code.
They made multiple changes to the BSD code, most likely to fix compatibility issues with GNU libiconv, but this didn't stop users from complaining about incompatibilites.
After I updated to macOS 14, I also ran into bugs when fuzz testing libxml2 on macOS. One of them was a memory safety issue which I reported to Apple's bug bounty program. About a year later, I received a reply that the issue wasn't deemed severe enough, so I didn't receive a bounty. Nevertheless, they assigned CVE-2024-27811 and credited me in the release notes of macOS 14.5.
These discoveries were more or less accidental. When I started to fuzz test their implementation directly, I found several other memory safety and DoS issues caused by Apple additions on top of the defects in the original BSD code. Since I didn't expect any rewards from their bug bounty program and I didn't want to wait another year, I made my findings public.
Interestingly, you can find some code to support fuzz testing in
Apple's sources,
so at least some people at Apple are aware that fuzzing is absolutely
crucial if you want to write high-quality software. Most of the issues I
found involve indicator suffixes like //IGNORE, so I guess that's
what they missed.
As mentioned above, I also found a few non-security, functional issues in their implementation, probably inherited from BSD. This regularly interfered with testing libxml2, so I simply switched to GNU libiconv from Homebrew.
musl
Reading musl's iconv implementation felt like a breath of fresh air: a single, small file which is almost self-contained. musl can perform stateless conversion without any memory allocations by storing the source and destination encodings in a few bits of the conversion descriptor. Rich Felker, the musl author, also went to incredibly great lengths to optimize for data table size.
Of course it must be said that, compared to other libraries, musl only supports a comparably small number of encodings. But all encodings from the WHATWG Encoding Standard are supported. Some of them lack bidirectional support, though.
Fuzz testing the code led to the discovery of CVE-2025-26519, an almost arbitrary write primitive, but I must admit that it was Rich who discovered how serious the issue was. I still suck at offensive security research.
Bionic
Just for fun, I also tested the iconv code of the Bionic C library used in Android. This is an extremely limited implementation, only supporting ASCII, UTF-8 and UTF-16. I discovered a minor memory safety issue when using an indicator suffix. The issue was reported to Android's bug bounty program and closed without reward.
I didn't really expect to receive a bounty for a minor issue, but what
puzzled me was that the issue was closed because I didn't “demonstrate
reachability.” As OSS maintainer, I received a few bogus bug reports
myself where somebody calls an internal function with the wrong
arguments and without understanding the contract of that function,
thinking they found a security issue. But the rules of the reward
program only require a “POC that uses an API that is callable in
production to trigger the issue”
and iconv is advertised as public API entry point.
I asked Google to clarify their program rules but didn't receive
a reply. This wasn't the first time that I felt that Google makes up
the rules of their bug bounty programs as they go. The least they
could do is to spell out the rules more clearly.
Alternatives to iconv
There's an excellent and comprehensive blog post by JeanHeyd Meneide titled The Wonderfully Terrible World of C and C++ Encoding APIs (with Some Rust). It has some nice tables at the bottom of the article showing the features of several encoding APIs. What you typically need, at least if you're working with something like XML, HTML or MIME email is support for legacy encodings. Then you'll probably need an API that can handle streaming conversion which requires to update the input/output pointers. If you also require a C API, this leaves the following options:
- ICU
- iconv
- encoding_c (C bindings to encoding_rs written in Rust)
I never used encoding_c which is relatively new, but I have some experience with ICU. On the plus side, it avoids the iconv gotchas mentioned above.
On the negative side, ICU is a huge library (35 MB installation
footprint). The API is more complicated, especially if you
have to use “pivot buffers,” and the documentation is a bit lacking.
Through trial and error as well as fuzz testing, I found that
ucnv_convertEx can return the following error codes, some of
which aren't documented at all:
- U_TRUNCATED_CHAR_FOUND
- U_BUFFER_OVERFLOW_ERROR
- U_INVALID_CHAR_FOUND
- U_ILLEGAL_CHAR_FOUND
- U_ILLEGAL_ESCAPE_SEQUENCE
- U_UNSUPPORTED_ESCAPE_SEQUENCE
- U_MEMORY_ALLOCATION_ERROR
While you might think that a project sponsored, supported and used by IBM is properly maintained, I'm still waiting for a fix to a segfault that I reported more than three years ago. If you're curious, this issue was found accidentally by a third party when fuzzing libxml2.
In conclusion, the state of affairs is far from ideal, even in 2026. But there's hope that some necessary amendments are made to the POSIX spec and that all the bugs I discovered will eventually be fixed.