We all know the absolute minimum every software developer must know about Unicode. We aren’t bad citizens of the internet, spreading malice and encoding errors willy-nilly around the world. We’re good people, just trying to get by.
But knowing the minimum is starting to bother us. And the more we think about UTF-8 and text encoding, the more we realize that we don’t “get it” at a messy, squishy level. We’ve read about it, sure, but we’ve never actually wrapped our hands around it and felt the blood pump through its veins.
So that’s what we’re going to do in this post: grab UTF-8 by the bytes and squeeze some real understanding out of it.
Well alright how
|Lower bound||Upper bound||Pattern (binary)|
And I wondered if it would be possible to write a decoding function that looked like this:
byteSequence ["0xxxxxxx"] <|> byteSequence ["110xxxxx", "10xxxxxx"] <|> byteSequence ["1110xxxx", "10xxxxxx", "10xxxxxx"] <|> byteSequence ["11110xxx", "10xxxxxx", "10xxxxxx", "10xxxxxx"]
And it turns out that it is. Sort of.
For whatever reason, the idea of writing a function like this – basically a transliteration of the UTF-8 spec – got me excited. When I sat down to write this post, I didn’t know how I was going to do it. But I was determined to figure it out.
This post chronicles that journey.
If you too are excited, or just a little bit curious, come dream with me. Along the way, we’ll:
- learn quite a bit about UTF-8
- use Attoparsec, a very fun parser combinator library
- learn what “parser combinator” means
- watch me struggle to write Haskell
Or you could skip all that noise and cut to the code. That link includes getting started instructions if you want to follow along at home. Otherwise just sit back, relax, and let’s do this thing together.
A brief recap of UTF-8
You’ve read that Joel on Software post, right? The one I linked to in the first paragraph? I believe you. But just in case, let’s review:
- Unicode 7.0 defines 252,603 different codepoints
- less than half of these codepoints are what we’d call “characters”
- they could be diacritics (like ¨ or ´), one stroke of a character in a fancy writing system, or symbols
- the first published version of this post got that number very wrong because the character vs. codepoint distinction is confusing
- a codepoint is an abstract number, not a physical byte pattern
- there are multiple ways to represent codepoints in concrete ways, called encodings
- one such encoding is UTF-8
- there are others, UTF-16 being pretty popular
- despite the name, UTF-8 takes anywhere from 8 to 32 bits to represent a single codepoint
- UTF-8 has a bunch of bytes
- the order of the bytes is quite important
- …other stuff??
That’s all I got. Let’s learn more.
I showed you this table before:
|Lower bound||Upper bound||Pattern (binary)|
Now let’s talk about what it means.
First, note that
0xxxxxxx is not hexadecimal, but a bit pattern that means the “control” bit
0 followed by seven “data” bits, which could be anything.
The four rows in the table correspond to the four flavors of UTF-8 codepoints, distinguished by the number of bytes you need to encode them (one to four).
“Lower bound” and “upper bound” tell us which codepoints live in which flavor. So if we want to encode
U+2603 (☃), we can see that we need to use the three-byte form, because
0x7FF < 0x2603 < 0xFFFF
The single byte codepoints all start with
0. The multi-byte codepoints start with a “leader” byte followed by one or more “continuation” bytes. Leader bytes start with
11 and tell us how many continuation bytes to expect, and every continuation byte starts with
Boring UTF-8 details Fun UTF-8 facts
Look at the upper bound on the four-byte codepoint. It’s
0x10FFFF, even though we could represent numbers all the way up to
0x1FFFFF (221 - 1), since we have twenty-one data bits to work with. So why does it end there?
Because it turns out UTF-16, another very popular encoding scheme, can’t represent codepoints that big. So we say that anything bigger than that is Strictly Not Allowed in UTF-8 in order to make UTF-8 and UTF-16 fully compatible: we’ll never be able to write something in UTF-8 that we can’t transcode into UTF-16.
It turns out this limitation doesn’t matter in practice:
0x10FFFF (1,114,111 in decimal) is about four and a half times as many representable codepoints as there are codepoints currently defined in the latest Unicode spec, so we’re not in danger of running out of bits any time soon.
If we didn’t care about UTF-16 at all, we could keep going all the way up to a leader byte of
11111111 followed by seven continuation bytes without changing the underlying scheme. We could represent forty-two bits of codepoints! But we don’t do that, because, well, we don’t need to.
We’d also lose the interesting property that UTF-8 doesn’t allow the bytes
0xFF to appear anywhere, which makes it impossible to confuse UTF-16 text (which uses those bytes in its BOM) with UTF-8. But then again, this is not-caring-about-UTF-16-land that we’re dreaming in.
Show me the bits
Alright, fun sidetrack, but there’s a lotta hexadecimal floating around and it’s getting hard to keep it all straight. I wanna see some real world examples. I wanna see some binary.
So let’s look at the string
naïveté. In UTF-8, we’d write that as…
Well, actually, we have multiple choices.
ï could be the single codepoint
U+00EF (LATIN SMALL LETTER I WITH DIAERESIS) or it could be
U+0069 (LATIN SMALL LETTER I) followed by
U+0308 (COMBINING DIAERESIS). Same with
é. Code points ≠ characters, and we have to keep that in mind as we write this parser.
I chose to represent those characters as the following codepoints, for no particular reason:
n a i (combining diaeresis) v e t (e with acute)
Which we can laboriously encode as the following UTF-8 (I’m waving my hands here; this is a post about decoding):
0 1101110 (n) 0 1100001 (a) 0 1101001 (i) 110 01100 (¨ byte 1) 10 001000 (¨ byte 2) 0 1110110 (v) 0 1100101 (e) 0 1110100 (t) 110 00011 (é byte 1) 10 101001 (é byte 2)
I put a space between the “control” bits and the “data” bits, because that’s the easiest way to set them apart with markdown. And we can see that the one-byte characters, the “normal” ASCII ones, all begin with a
0. Both of the two-byte codepoints begin with
110, which tells us to expect one subsequent continuation byte beginning with
Look, you get it. You can see it right there. Let’s get to the code.
It sounds like an odd language choice, but Haskell is actually great for bit twiddling. You don’t have to worry about
int being the wrong size when you compile on a different platform, or accidentally shifting a signed value and tearing your hair out in the resultant mess. No. Haskell has your back. Haskell knows what you want.
But Haskell doesn’t give you a binary literal, and I didn’t want to convert those lovely binary representations up there to hex or something by hand, so the first thing I did was write a
readByte function that can turn strings like
Word8, the Haskell equivalent of
unsigned char or
readByte :: String -> Word8
You read that as “
readByte is a function that takes a
String and returns a
Word8.” It’s called the type declaration, and we’ll be seeing a lot of those, because in Haskell country types are the benevolent monarchs of the land.
I’m not including my implementation of this function here because it isn’t that important: we want to focus on the parsing part. You can always look at the full code if you’re curious how I did it.
Now let’s sanity check that it works:
import BasePrelude main :: IO () main = do let naïveté = readByte <$> ["01101110", "01100001", "01101001", "11001100", "10001000", "01110110", "01100101", "01110100", "11000011", "10101001"] print naïveté
It does print
[110,97,105,204,136,118,101,116,195,169], as we’d expect, but… it’s a little crazy.
readByte over a
[String], giving us a
[Word8] – a linked list of bytes. As Real Human Haskell Developers, we realize that that’s not a great representation for our needs: we’re not about to parse a linked list here, so let’s turn that into a
ByteString (basically, an
unsigned char *) with a nice helper:
import Data.ByteString (ByteString) import qualified Data.ByteString as BS readBytes :: [String] -> ByteString readBytes = BS.pack . fmap readByte
And then we’ll use that instead:
main :: IO () main = do let naïveté = readBytes ["01101110", "01100001", "01101001", "11001100", "10001000", "01110110", "01100101", "01110100", "11000011", "10101001"] print naïveté
If all is well, we’ll see:
ByteString is not encoding aware, and it doesn’t know what to do with those non-ASCII bytes. That’s what we’re here for!
The actual parsing
We’re going to parse
ByteStrings, but we’re going to give the output back as a list of
Word32s, because thirty-two bits is more than enough to encode every Unicode codepoint and there’s no
Word24 in the standard library.
So at the very end of this, we want a function that converts a
ByteString to a
[Word32]. So let’s write that:
parseUtf8 :: ByteString -> [Word32]
Wait, what about invalid input? We could make it a
Maybe [Word32], but let’s say it’ll give us an error message in that case, because that just so happens to be what Attoparsec does:
parseUtf8 :: ByteString -> Either String [Word32]
Ah, but wait. We sorta know some other stuff, too. Like that we want to make an Attoparsec parser to do the parsing (as that is the point of this blog post):
import Data.Attoparsec.ByteString utf8Parser :: Parser [Word32]
And we can guess that the easiest way to write that is to make a parser for an individual codepoint.
codePointParser :: Parser Word32
Cool. Now let’s… implement them:
parseUtf8 = parseOnly utf8Parser utf8Parser = manyTill' codePointParser endOfInput codePointPars--
We just used our first parser combinator! But we haven’t talked about what that means yet.
Parser combinators are functions that take smaller parsers and return new, slightly bigger parsers. They are the glue that allows us to build sophisticated parsers out of small, independent pieces.
Before I heard of parser combinators, I thought of parsers as either:
- a big scary thing that a “parser generator” creates for you from some arcane specification language
- a one-off “recursive descent” (whatever that means) parentheses matcher that works pretty well as long as you never need to change it
- some horrible rat’s nest of regular expressions that you aren’t proud to admit that you wrote
Now I think of them as simple tools. Nothing scary. Nothing fragile. You can build parsers out of a bunch of smaller parsers, and it’s parsers all the way down. It’s a beautifully simple model for solving problems like this.
If we look at the type of
manyTill' :: MonadPlus m => m a -> m b -> m [a]
…we are confused and afraid. But that’s because it’s very general. If we specialize it to only talk about the types we’re using in
manyTill' :: Parser Word32 -> Parser () -> Parser [Word32]
…and fill in the type variables, we can see that it takes a
Parser Word32, a
Parser of something else, and returns a
Parser [Word32]. It’s a function that creates this first-class
Parser thing out of two smaller
Parser things. It’s a parser combinator. We did it!
The second argument,
endOfInput, is a
Parser (), which means that it doesn’t parse any useful data; it can only tell us whether it succeeds or fails. Since
manyTill' just runs its first parser over and over until its second parser succeeds, this is all we need.
manyTill' is one of many built-in combinators that Attoparsec provides us. It’s fun to read the docs to get an idea of some common things you can do, but you’re not limited to that list – before too long we’ll be learning how to define our own combinators!
But first let’s get back to the code.
How do we write
codePointParser is, well, kind of the whole point here. That’s where the magic happens. But how do we write it?
The way I want to write it is like this:
codePointParser :: Parser Word32 codePointParser = byteSequence ["0xxxxxxx"] <|> byteSequence ["110xxxxx", "10xxxxxx"] <|> byteSequence ["1110xxxx", "10xxxxxx", "10xxxxxx"] <|> byteSequence ["11110xxx", "10xxxxxx", "10xxxxxx", "10xxxxxx"]
And fortunately, that totally works. Yes, we can make it shorter with a helper for the continuation bytes, and we will – but for now, bask with me in the explicit glory of the code as it now stands.
What is that weird bird you drew
<|> as “or”. We want to match this byte sequence or that byte sequence or this other one.
It’s a parser combinator (surprise) that takes a parser on the left, a parser on the right, and gives you a parser that tries the left and, if it fails, tries the right. Simple.
Okay cool what about
I want to start writing it – I really do – but before we dive in I want to be very crystal clear about the work it’s going to be doing. Because there’s a very important aspect of UTF-8 that we haven’t mentioned before, and I didn’t really understand before doing this exercise:
When we have those “data bits”, like in
110xxxxx 10xxxxxx, how do we interpret those? Do we just glue those
xs together and treat it like an eleven-bit number? Or do we add 128 to that, since we know that the first 128 codepoints are already covered by the
Another way to phrase that: if I write
110000000 10000001, does that represent the same codepoint as
00000001? And if so, is that… allowed?
Well, it turns out that the answer is yes it does and no it most certainly is not.
We don’t have to do any math. That’s an eleven-bit number and we treat it exactly as it stands. Which means that it is in theory possible to encode something like
11000000 10001000, and this phenomenon is called an overlong encoding.
Overlong encodings are very unambiguously explicitly crystal clearly not allowed. The UTF-8 spec has these scary words to say about it:
Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence
C0 80into the character
U+0000, or the surrogate pair
ED A1 8C ED BE B4into
U+233B4. Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below.
Section 10 contains this gem:
Another example might be a parser which prohibits the octet sequence
2F 2E 2E 2F(
"/../"), yet permits the illegal octet sequence
2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.
Well, alright. We’re going to have to handle this, because this isn’t “Let’s write a dumbed down UTF-8 parser because the real world is too hard.” This is “Parser combinators, hell yeah; let’s crank the tunes and start conforming to an IETF standard.”
Which means that we need to modify what we wrote above to support the concept of overlong encodings.
It won’t be so bad! And we’ll get to, you know, actually parsing before you know it. First, though, I’m going to clean up the code a tiny bit:
codePointParser :: Parser Word32 codePointParser = byteSequence ["0xxxxxxx"] <|> multibyte "110xxxxx" 1 <|> multibyte "1110xxxx" 2 <|> multibyte "11110xxx" 3 where multibyte leader count = byteSequence (leader : replicate count "10xxxxxx")
That’s the same as what we had before, just with less repetition. Now let’s protect against overlong encodings:
codePointParser = byteSequence ["0xxxxxxx"] <|> overlong 0x7F (multibyte "110xxxxx" 1) <|> overlong 0x7FF (multibyte "1110xxxx" 2) <|> overlong 0xFFFF (multibyte "11110xxx" 3) where multibyte leader count = byteSequence (leader : replicate count "10xxxxxx")
Then define this
overlong helper we keep using:
overlong :: Word32 -> Parser Word32 -> Parser Word32 overlong m parser = checkedParser parser (> m) "overlong codepoint!"
Now, we defined this in terms of
checkedParser, but that isn’t something that Attoparsec provides us. We’re going to have to define it ourselves. It’s going to look something like this:
checkedParser :: Parser a -> (a -> Bool) -> String -> Parser a
Read that as “Given a parser, a predicate, and an error message, give me a new parser that fails when the predicate returns
False.” Alright. Let’s write that.
At first this looks a little daunting. Writing our own parser combinator? It sounds like deep Haskell magic. But it turns out it’s quite simple, once we realize that
Parser is a monad:
checkedParser parser pred msg = do byte <- parser unless (pred byte) (fail msg) return byte
Yeah yeah, I used the M-word. But that’s not so bad, is it? Just nice, normal-looking code. Haskell: it’s not so terrifying™.
This almost seems too easy, though. It looks like we “called” the
parser, but what did we pass it? How does it “know” what to parse?
It’s a semi-complicated topic and I won’t try to explain it here, but I will try to offer some intuition: we aren’t returning a “function that parses” here. We’re returning an honest-to-goodness
Parser. The parser doesn’t have a series of steps, it is a series of steps, which we’ve written out with
Crystal clear? Okay yeah perfect let’s pretend this never happened.
So about that
I guess at some point we’ll have to do some actual UTF-8.
byteSequence :: [String] -> Parser Word32
So we have a list of pattern strings that look something like
"110xxxxx". We want to parse a series of bytes that match that pattern, extracting the data bits each time, then glue them all together (somehow) at the end.
But doing lists of things is hard. So let’s break it down even further, and make a parser that matches just one of these patterns.
bitPattern :: String -> Parser ???
But wait: should this give us a
At first I said “yes.” Then I struggled to write the rest of the code, thought about it a bit more, and realized that just couldn’t work. Because there’s an additional, important piece of information: the number of
xs in the pattern.
We need that information when it comes time to glue these babies back together. If we don’t remember how many bits of each byte are actually significant, we’re gonna end up with a garbled mess with all the bits in the wrong places.
So instead of making
bitPattern return a
Parser Word8, let’s define something new:
type SubByte = (Word8, Int)
Because it’s such a simple type, we can just alias a two-tuple here. Now let’s make a couple helpers:
subZero :: SubByte subZero = (0, 0) pushBit :: Bool -> SubByte -> SubByte pushBit True (b, n) = (setBit (shiftL b 1) 0, n + 1) pushBit False (b, n) = (shiftL b 1, n + 1)
Now we can write the type of
bitPattern :: String -> Parser SubByte
And we can almost write the definition too.
Well get to it then
I will – but first I want to break out of the parser world.
Let’s pretend, for a second, that we’re not using Attoparsec. Because having that “we’re making a parser” thing weighing over our head isn’t helping us right now.
It turns out if we can write a normal function to do this work, we can turn it into a
Parser very easily. So let’s not worry about making a
bitPattern parser; let’s just write a function.
matchByte :: String -> Word8 -> Maybe SubByte
Note that I made it return a
Maybe SubByte, to indicate that the pattern could fail – think
matchByte "0xxxxxxx" (readByte "10101001").
This is another case where I wanna say “the implementation doesn’t matter too much”. There are lots of ways to do this. But since this is, well, the meat of the parser, it would seem silly to skip it.
So here’s how I approached this, with my thought process sketched out here:
- we’re reducing many values (the pattern and the bits) into a single value
- that sounds like a
- but we want to short-circuit if something goes wrong
- short-circuiting makes me think of
- I don’t know how to fold-with-maybe
- but I do know how to Hoogle that
- and now I know how to
matchByte pattern byte = foldM check subZero (zip pattern bits) where bits = testBit byte <$> [7, 6 .. 0] check subByte ('1', True) = Just subByte check subByte ('0', False) = Just subByte check subByte ('x', bit) = Just (pushBit bit subByte) check _ _ = Nothing
That was a lot of code at the same time! So let’s go through it, assuming that
Data.Bits, and gives us
Falsewhen we ask it if the bit at a certain index is set. So
testBit byte <$> [7, 6, .. 0]gives us
[True, False, True, False, False, True, False, False].
- Originally I wrote
range (7, 0)instead of using the goofy list range syntax. But
range (7, 0) == , and I was very sad. This is a good example of a program that typechecks but is nonetheless wrong – it can happen!
- Originally I wrote
zip pattern bitsis
[('0', True), ('x', False), ('x', True), ...]. We just line up the pattern with the actual byte.
subZeroand uses the
checkfunction to advance its value each step of the way
checkrejects if it sees
checkleaves the output unchanged if the pattern is
checkpushes a bit onto the output when the pattern character is an
Great! Now we have a normal function (not a parser) that does what we want. But how do we turn it into a parser?
Documentation to the rescue
Attoparsec ships with two ways to turn functions into
satisfy :: (Word8 -> Bool) -> Parser Word8 satisfyWith :: (Word8 -> a) -> (a -> Bool) -> Parser a
The former just checks a byte against a predicate, and the latter allows us to transform the byte into something else, run a predicate on the result, and then return it if it passes.
Both of these are nice, but neither are quite what we want. For example, we could write something like:
bitPattern :: String -> Parser SubByte bitPattern pattern = satisfyWith (matchByte pattern) isJust
But that wouldn’t work, because it’d give us a
Parser (Maybe SubByte). But we know it’s a
Just! Can’t we unwrap it? No can do,
satisfyWith whispers sinisterly. And sure, we could then
fmap fromJust over the result or something. But wouldn’t it be nice if we just had:
satisfyMaybe :: (Word8 -> Maybe a) -> Parser a satisfyMaybe f = do byte <- anyWord8 maybe (fail "maybe not satisfied") return (f byte)
With this helper,
bitPattern becomes trivial:
bitPattern pattern = satisfyMaybe (matchByte pattern)
THE LAST PIECE
Really! We’re so close! We can parse bytes that match the pattern, and get our
SubBytes back. We just need to glue them together!
So now we return to
byteSequence (remember her?), the only unimplemented function we have left:
byteSequence :: [String] -> Parser Word32 byteSequence patterns = do subBytes <- mapM bitPattern patterns return (foldl mergeSubByte 0 subBytes)
mapM does two things for us in one go: it turns our pattern strings into
Parser SubBytes and it executes each of them in order, handing us back the list of parsed values. But since that’s a list of
SubBytes, we need to glue them together into a single
Word32, which we do with a
fold and a simple helper:
mergeSubByte :: Word32 -> SubByte -> Word32 mergeSubByte whole (byte, bitCount) = shiftL whole bitCount .|. fromIntegral byte
And now we’re done. That’s it. We win. It’s over. We test our parser, and it works!
But it has really bad error messages
It does. In particular, we almost always see
"maybe not satisfied" when we test it on invalid input. This is because we don’t have any control over the backtracking: when something fails with an
overlong error (for example), the alternation in
codePointParser will try parsing the next one. There’s no way to distinguish between “irrecoverable failures” (which we want, in this case) and “failures which should cause backtracking.”
That I know of. It’s totally possible! I’m very new at this stuff. But I’m pretty sure Attoparsec is in the “nothing matters but performance” camp, and if we’re after user-friendly error messages, we should look into Parsec, which is a tiny bit harder to use but quite a bit more flexible.
Hey wait that’s awful
It’s not really so bad. We could, instead of using
fail to track our error messages, actually parse to an
Either String [Word32]. That way error conditions are Real Actual Values and we never need to backtrack once we find a definite failure (as opposed to what Attoparsec thinks of as a failure).
I don’t know if that would actually work when you consider that it should short-circuit on the first failure… but it sounds like fun homework for next time.
In the meantime, though, this has gone on for quite a while. I’d like to kick back, relax, and–
What about that spec thing
Ugh. Specs. Who needs ‘em? Well, we came so close to actually being able to validate UTF-8 that it seems silly to stop here. There are only two things we’re missing, and both are very easy to add:
The definition of UTF-8 prohibits encoding character numbers between
U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.
And the fact that codepoints from
0x1FFFFF are not valid, because of the UTF-16 restriction we talked about before. So heck, let’s just handle these:
codePointParser :: Parser Word32 codePointParser = byteSequence ["0xxxxxxx"] <|> overlong 0x7F (multibyte "110xxxxx" 1) <|> overlong 0x7FF (multibyte "1110xxxx" 2) <|> checkedParser withMax (not . isSurrogate) "illegal surrogate pair" where multibyte leader count = byteSequence (leader : replicate count "10xxxxxx") fourByteParser = (overlong 0xFFFF (multibyte "11110xxx" 3)) withMax = checkedParser fourByteParser (< 0x110000) "illegal codepoint over 0x10FFFF" isSurrogate w = w >= 0xD800 && w <= 0xDFFF
And now we’re really, truly done.
Why… why did we just do that
Well, for fun! And to learn about parser combinators. And UTF-8. And because we really didn’t think it would take that long when we started writing this blog post.
The end result is certainly not a useful UTF-8 parser. UTF-8 is such a simple format that throwing parser combinators at it is overkill, and
Data.Text comes with a
decodeUtf8' function that behaves much like my
parseUtf8 – except that it has actual error messages, it results in a useful type (
Text instead of
[Word32]), and it doesn’t do ridiculous things like turning bytes into linked lists of bits.
But the end result is hopefully a rather educational parser.
decodeUtf8' is so highly optimized that it’s very difficult to follow what’s going on – which is expected; performance is more important if you actually want to use these functions.
And hopefully it’s something that we could return to, some day, to fix the error messages. Or perhaps we could profile it, and try to measure the performance. Maybe even optimize the generated assembly.
But not today. Today I’m going to go outside.
Here’s some interesting further reading, because if you got this far you clearly have some sort of crippling reading addiction:
- The code for this blog post
- This very approachable trio of articles by Brian O’Sullivan about Attoparsec’s development:
- That Spolsky article you said you already read