We all know the absolute minimum every software developer must know about Unicode. We aren’t bad citizens of the internet, spreading malice and encoding errors willy-nilly around the world. We’re good people, just trying to get by.

But knowing the minimum is starting to bother us. And the more we think about UTF-8 and text encoding, the more we realize that we don’t “get it” at a messy, squishy level. We’ve read about it, sure, but we’ve never actually wrapped our hands around it and felt the blood pump through its veins.

So that’s what we’re going to do in this post: grab UTF-8 by the bytes and squeeze some real understanding out of it.

Well alright how

UTF-8 looks like this:

Lower bound Upper bound Pattern (binary)
0x00000 0x00007F 0xxxxxxx
0x00080 0x0007FF 110xxxxx 10xxxxxx
0x00800 0x00FFFF 1110xxxx 10xxxxxx 10xxxxxx
0x10000 0x10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

And I wondered if it would be possible to write a decoding function that looked like this:

byteSequence ["0xxxxxxx"] <|>
byteSequence ["110xxxxx", "10xxxxxx"] <|>
byteSequence ["1110xxxx", "10xxxxxx", "10xxxxxx"] <|>
byteSequence ["11110xxx", "10xxxxxx", "10xxxxxx", "10xxxxxx"]

And it turns out that it is. Sort of.

For whatever reason, the idea of writing a function like this – basically a transliteration of the UTF-8 spec – got me excited. When I sat down to write this post, I didn’t know how I was going to do it. But I was determined to figure it out.

This post chronicles that journey.

If you too are excited, or just a little bit curious, come dream with me. Along the way, we’ll:

  • learn quite a bit about UTF-8
  • use Attoparsec, a very fun parser combinator library
  • learn what “parser combinator” means
  • watch me struggle to write Haskell

Or you could skip all that noise and cut to the code. That link includes getting started instructions if you want to follow along at home. Otherwise just sit back, relax, and let’s do this thing together.

A brief recap of UTF-8

You’ve read that Joel on Software post, right? The one I linked to in the first paragraph? I believe you. But just in case, let’s review:

  • Unicode 7.0 defines 252,603 different codepoints
  • less than half of these codepoints are what we’d call “characters”
  • they could be diacritics (like ¨ or ´), one stroke of a character in a fancy writing system, or symbols
  • the first published version of this post got that number very wrong because the character vs. codepoint distinction is confusing
  • a codepoint is an abstract number, not a physical byte pattern
  • there are multiple ways to represent codepoints in concrete ways, called encodings
  • one such encoding is UTF-8
  • there are others, UTF-16 being pretty popular
  • despite the name, UTF-8 takes anywhere from 8 to 32 bits to represent a single codepoint
  • UTF-8 has a bunch of bytes
  • the order of the bytes is quite important
  • …other stuff??

That’s all I got. Let’s learn more.

I showed you this table before:

Lower bound Upper bound Pattern (binary)
0x00000 0x00007F 0xxxxxxx
0x00080 0x0007FF 110xxxxx 10xxxxxx
0x00800 0x00FFFF 1110xxxx 10xxxxxx 10xxxxxx
0x10000 0x10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Now let’s talk about what it means.

First, note that 0xxxxxxx is not hexadecimal, but a bit pattern that means the “control” bit 0 followed by seven “data” bits, which could be anything.

The four rows in the table correspond to the four flavors of UTF-8 codepoints, distinguished by the number of bytes you need to encode them (one to four).

“Lower bound” and “upper bound” tell us which codepoints live in which flavor. So if we want to encode U+2603 (), we can see that we need to use the three-byte form, because 0x7FF < 0x2603 < 0xFFFF

The single byte codepoints all start with 0. The multi-byte codepoints start with a “leader” byte followed by one or more “continuation” bytes. Leader bytes start with 11 and tell us how many continuation bytes to expect, and every continuation byte starts with 10. Simple.

Boring UTF-8 details Fun UTF-8 facts

Look at the upper bound on the four-byte codepoint. It’s 0x10FFFF, even though we could represent numbers all the way up to 0x1FFFFF (221 - 1), since we have twenty-one data bits to work with. So why does it end there?

Because it turns out UTF-16, another very popular encoding scheme, can’t represent codepoints that big. So we say that anything bigger than that is Strictly Not Allowed in UTF-8 in order to make UTF-8 and UTF-16 fully compatible: we’ll never be able to write something in UTF-8 that we can’t transcode into UTF-16.

It turns out this limitation doesn’t matter in practice: 0x10FFFF (1,114,111 in decimal) is about four and a half times as many representable codepoints as there are codepoints currently defined in the latest Unicode spec, so we’re not in danger of running out of bits any time soon.

If we didn’t care about UTF-16 at all, we could keep going all the way up to a leader byte of 11111111 followed by seven continuation bytes without changing the underlying scheme. We could represent forty-two bits of codepoints! But we don’t do that, because, well, we don’t need to.

We’d also lose the interesting property that UTF-8 doesn’t allow the bytes 0xFE and 0xFF to appear anywhere, which makes it impossible to confuse UTF-16 text (which uses those bytes in its BOM) with UTF-8. But then again, this is not-caring-about-UTF-16-land that we’re dreaming in.

Show me the bits

Alright, fun sidetrack, but there’s a lotta hexadecimal floating around and it’s getting hard to keep it all straight. I wanna see some real world examples. I wanna see some binary.

So let’s look at the string naïveté. In UTF-8, we’d write that as…

Well, actually, we have multiple choices. ï could be the single codepoint U+00EF (LATIN SMALL LETTER I WITH DIAERESIS) or it could be U+0069 (LATIN SMALL LETTER I) followed by U+0308 (COMBINING DIAERESIS). Same with é. Code points ≠ characters, and we have to keep that in mind as we write this parser.

I chose to represent those characters as the following codepoints, for no particular reason:

n a i (combining diaeresis) v e t (e with acute)

Which we can laboriously encode as the following UTF-8 (I’m waving my hands here; this is a post about decoding):

0 1101110 (n)
0 1100001 (a)
0 1101001 (i)
110 01100 (¨ byte 1)
10 001000 (¨ byte 2)
0 1110110 (v)
0 1100101 (e)
0 1110100 (t)
110 00011 (é byte 1)
10 101001 (é byte 2)

I put a space between the “control” bits and the “data” bits, because that’s the easiest way to set them apart with markdown. And we can see that the one-byte characters, the “normal” ASCII ones, all begin with a 0. Both of the two-byte codepoints begin with 110, which tells us to expect one subsequent continuation byte beginning with 10

Look, you get it. You can see it right there. Let’s get to the code.

Quick encoding side-note:

There’s a neat command-line tool you can use called xxd to see the binary representation of some UTF-8 text. Unfortunately, I have no idea how to type i + combining diaresis, so this only shows the i with diaresis. Still, it’s better than doing it all by hand:

~ ➜ echo -n "naïveté" | xxd --binary
0000000: 01101110 01100001 11000011 10101111 01110110 01100101  na..ve
0000006: 01110100 11000011 10101001                             t..

I can use echo (with -n to suppress the newline) because my shell happens to use UTF-8 to represent text, but saving the UTF-8 to a file and then using xxd -b filename would probably be a smarter thing to do in real life.

Haskell time

It sounds like an odd language choice, but Haskell is actually great for bit twiddling. You don’t have to worry about int being the wrong size when you compile on a different platform, or accidentally shifting a signed value and tearing your hair out in the resultant mess. No. Haskell has your back. Haskell knows what you want.

You want some nice, fixed-size values and some friendly bitwise operations for them. And Haskell gives them to you, right in the standard library.

But Haskell doesn’t give you a binary literal, and I didn’t want to convert those lovely binary representations up there to hex or something by hand, so the first thing I did was write a readByte function that can turn strings like "10010100" into Word8, the Haskell equivalent of unsigned char or byte.

readByte :: String -> Word8

You read that as “readByte is a function that takes a String and returns a Word8.” It’s called the type declaration, and we’ll be seeing a lot of those, because in Haskell country types are the benevolent monarchs of the land.

I’m not including my implementation of this function here because it isn’t that important: we want to focus on the parsing part. You can always look at the full code if you’re curious how I did it.

Now let’s sanity check that it works:

import BasePrelude

main :: IO ()
main = do
  let naïveté = readByte <$> ["01101110", "01100001", "01101001", "11001100", "10001000", "01110110", "01100101", "01110100", "11000011", "10101001"]
  print naïveté

It does print [110,97,105,204,136,118,101,116,195,169], as we’d expect, but… it’s a little crazy.

We’re mapping readByte over a [String], giving us a [Word8] – a linked list of bytes. As Real Human Haskell Developers, we realize that that’s not a great representation for our needs: we’re not about to parse a linked list here, so let’s turn that into a ByteString (basically, an unsigned char *) with a nice helper:

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS

readBytes :: [String] -> ByteString
readBytes = BS.pack . fmap readByte

And then we’ll use that instead:

main :: IO ()
main = do
  let naïveté = readBytes ["01101110", "01100001", "01101001", "11001100", "10001000", "01110110", "01100101", "01110100", "11000011", "10101001"]
  print naïveté

If all is well, we’ll see:

"nai\204\136vet\195\169"

Because ByteString is not encoding aware, and it doesn’t know what to do with those non-ASCII bytes. That’s what we’re here for!

The actual parsing

We’re going to parse ByteStrings, but we’re going to give the output back as a list of Word32s, because thirty-two bits is more than enough to encode every Unicode codepoint and there’s no Word24 in the standard library.

So at the very end of this, we want a function that converts a ByteString to a [Word32]. So let’s write that:

parseUtf8 :: ByteString -> [Word32]

Wait, what about invalid input? We could make it a Maybe [Word32], but let’s say it’ll give us an error message in that case, because that just so happens to be what Attoparsec does:

parseUtf8 :: ByteString -> Either String [Word32]

Ah, but wait. We sorta know some other stuff, too. Like that we want to make an Attoparsec parser to do the parsing (as that is the point of this blog post):

import Data.Attoparsec.ByteString
utf8Parser :: Parser [Word32]

And we can guess that the easiest way to write that is to make a parser for an individual codepoint.

codePointParser :: Parser Word32

Cool. Now let’s… implement them:

parseUtf8 = parseOnly utf8Parser

utf8Parser = manyTill' codePointParser endOfInput

codePointPars--

WAIT STOP

We just used our first parser combinator! But we haven’t talked about what that means yet.

Parser combinators are functions that take smaller parsers and return new, slightly bigger parsers. They are the glue that allows us to build sophisticated parsers out of small, independent pieces.

Before I heard of parser combinators, I thought of parsers as either:

  • a big scary thing that a “parser generator” creates for you from some arcane specification language
  • a one-off “recursive descent” (whatever that means) parentheses matcher that works pretty well as long as you never need to change it
  • some horrible rat’s nest of regular expressions that you aren’t proud to admit that you wrote

Now I think of them as simple tools. Nothing scary. Nothing fragile. You can build parsers out of a bunch of smaller parsers, and it’s parsers all the way down. It’s a beautifully simple model for solving problems like this.

Brief, unhelpful side note:

“Parser combinators” isn’t a great term. I’d rather we talked about “first class parsers,” since “parser combinators” are merely a class of helper functions you use to build parsers.

Sure, the cool thing about these first class parsers is that they’re, umm, combinated from a bunch of smaller ones. But still, “parser combinators” are the operations we do to them, not the parsers themselves. Maybe we should talk about “combinated parsers.” Yeah, I bet I can get that to stick.

So about manyTill'

If we look at the type of manyTill'

manyTill' :: MonadPlus m => m a -> m b -> m [a]

…we are confused and afraid. But that’s because it’s very general. If we specialize it to only talk about the types we’re using in utf8Parser

manyTill' :: Parser Word32 -> Parser () -> Parser [Word32]

…and fill in the type variables, we can see that it takes a Parser Word32, a Parser of something else, and returns a Parser [Word32]. It’s a function that creates this first-class Parser thing out of two smaller Parser things. It’s a parser combinator. Huzzah!

The second argument, endOfInput, is a Parser (), which means that it doesn’t parse any useful data; it can only tell us whether it succeeds or fails. Since manyTill' just runs its first parser over and over until its second parser succeeds, this is all we need.

The weird apostrophe in the name is because manyTill' is the strict version of a lazy function, manyTill. If you don’t know what that means, don’t worry about it. We could’ve just used manyTill here, but it would probably perform worse in this particular case. Though we should really profile it to find out.

manyTill' is one of many built-in combinators that Attoparsec provides us. It’s fun to read the docs to get an idea of some common things you can do, but you’re not limited to that list – before too long we’ll be learning how to define our own combinators!

But first let’s get back to the code.

How do we write codePointParser

codePointParser is, well, kind of the whole point here. That’s where the magic happens. But how do we write it?

The way I want to write it is like this:

codePointParser :: Parser Word32
codePointParser = byteSequence ["0xxxxxxx"] <|>
                  byteSequence ["110xxxxx", "10xxxxxx"] <|>
                  byteSequence ["1110xxxx", "10xxxxxx", "10xxxxxx"] <|>
                  byteSequence ["11110xxx", "10xxxxxx", "10xxxxxx", "10xxxxxx"]

And fortunately, that totally works. Yes, we can make it shorter with a helper for the continuation bytes, and we will – but for now, bask with me in the explicit glory of the code as it now stands.

What is that weird bird you drew

I read <|> as “or”. We want to match this byte sequence or that byte sequence or this other one.

It’s a parser combinator (surprise) that takes a parser on the left, a parser on the right, and gives you a parser that tries the left and, if it fails, tries the right. Simple.

Okay cool what about byteSequence

I want to start writing it – I really do – but before we dive in I want to be very crystal clear about the work it’s going to be doing. Because there’s a very important aspect of UTF-8 that we haven’t mentioned before, and I didn’t really understand before doing this exercise:

When we have those “data bits”, like in 110xxxxx 10xxxxxx, how do we interpret those? Do we just glue those xs together and treat it like an eleven-bit number? Or do we add 128 to that, since we know that the first 128 codepoints are already covered by the 0xxxxxxx flavor?

Another way to phrase that: if I write 110000000 10000001, does that represent the same codepoint as 00000001? And if so, is that… allowed?

Well, it turns out that the answer is yes it does and no it most certainly is not.

We don’t have to do any math. That’s an eleven-bit number and we treat it exactly as it stands. Which means that it is in theory possible to encode something like 00001000 as 11000000 10001000, and this phenomenon is called an overlong encoding.

Overlong encodings are very unambiguously explicitly crystal clearly not allowed. The UTF-8 spec has these scary words to say about it:

Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below.

Section 10 contains this gem:

Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.

Well, alright. We’re going to have to handle this, because this isn’t “Let’s write a dumbed down UTF-8 parser because the real world is too hard.” This is “Parser combinators, hell yeah; let’s crank the tunes and start conforming to an IETF standard.”

Which means that we need to modify what we wrote above to support the concept of overlong encodings.

It won’t be so bad! And we’ll get to, you know, actually parsing before you know it. First, though, I’m going to clean up the code a tiny bit:

codePointParser :: Parser Word32
codePointParser =
  byteSequence ["0xxxxxxx"] <|>
  multibyte "110xxxxx" 1 <|>
  multibyte "1110xxxx" 2 <|>
  multibyte "11110xxx" 3
  where multibyte leader count =
    byteSequence (leader : replicate count "10xxxxxx")

That’s the same as what we had before, just with less repetition. Now let’s protect against overlong encodings:

codePointParser =
  byteSequence ["0xxxxxxx"] <|>
  overlong 0x7F (multibyte "110xxxxx" 1) <|>
  overlong 0x7FF (multibyte "1110xxxx" 2) <|>
  overlong 0xFFFF (multibyte "11110xxx" 3)
  where multibyte leader count =
    byteSequence (leader : replicate count "10xxxxxx")

Then define this overlong helper we keep using:

overlong :: Word32 -> Parser Word32 -> Parser Word32
overlong m parser = checkedParser parser (> m) "overlong codepoint!"

Non-Haskell people: this is what the type declaration for a function of multiple arguments looks like. Basically it takes a Word32 and a Parser Word32 and returns a Parser Word32. Why is it written like that? Currying. (it’s really cool)

Now, we defined this in terms of checkedParser, but that isn’t something that Attoparsec provides us. We’re going to have to define it ourselves. It’s going to look something like this:

checkedParser :: Parser a -> (a -> Bool) -> String -> Parser a

Read that as “Given a parser, a predicate, and an error message, give me a new parser that fails when the predicate returns False.” Alright. Let’s write that.

Wait how

At first this looks a little daunting. Writing our own parser combinator? It sounds like deep Haskell magic. But it turns out it’s quite simple, once we realize that Parser is a monad:

checkedParser parser pred msg = do
  byte <- parser
  unless (pred byte) (fail msg)
  return byte

Yeah yeah, I used the M-word. But that’s not so bad, is it? Just nice, normal-looking code. Haskell: it’s not so terrifying™.

This almost seems too easy, though. It looks like we “called” the parser, but what did we pass it? How does it “know” what to parse?

It’s a semi-complicated topic and I won’t try to explain it here, but I will try to offer some intuition: we aren’t returning a “function that parses” here. We’re returning an honest-to-goodness Parser. The parser doesn’t have a series of steps, it is a series of steps, which we’ve written out with do notation.

Crystal clear? Okay yeah perfect let’s pretend this never happened.

So about that byteSequence thing

I guess at some point we’ll have to do some actual UTF-8.

byteSequence :: [String] -> Parser Word32

So we have a list of pattern strings that look something like "110xxxxx". We want to parse a series of bytes that match that pattern, extracting the data bits each time, then glue them all together (somehow) at the end.

But doing lists of things is hard. So let’s break it down even further, and make a parser that matches just one of these patterns.

bitPattern :: String -> Parser ???

But wait: should this give us a Word8 back?

At first I said “yes.” Then I struggled to write the rest of the code, thought about it a bit more, and realized that just couldn’t work. Because there’s an additional, important piece of information: the number of xs in the pattern.

We need that information when it comes time to glue these babies back together. If we don’t remember how many bits of each byte are actually significant, we’re gonna end up with a garbled mess with all the bits in the wrong places.

So instead of making bitPattern return a Parser Word8, let’s define something new:

type SubByte = (Word8, Int)

Because it’s such a simple type, we can just aliasing a two-tuple here. Now let’s make a couple helpers:

subZero :: SubByte
subZero = (0, 0)

pushBit :: Bool -> SubByte -> SubByte
pushBit True  (b, n) = (setBit (shiftL b 1) 0, n + 1)
pushBit False (b, n) = (shiftL b 1, n + 1)

shiftL and setBit are from Data.Bits, which we imported via BasePrelude. Their arguments are in a stupid order. Oh well; Haskell isn’t perfect.

Now we can write the type of bitPattern:

bitPattern :: String -> Parser SubByte

And we can almost write the definition too.

Well get to it then

I will – but first I want to break out of the parser world.

Let’s pretend, for a second, that we’re not using Attoparsec. Because having that “we’re making a parser” thing weighing over our head isn’t helping us right now.

It turns out if we can write a normal function to do this work, we can turn it into a Parser very easily. So let’s not worry about making a bitPattern parser; let’s just write a function.

matchByte :: String -> Word8 -> Maybe SubByte

Note that I made it return a Maybe SubByte, to indicate that the pattern could fail – think matchByte "0xxxxxxx" (readByte "10101001").

This is another case where I wanna say “the implementation doesn’t matter too much”. There are lots of ways to do this. But since this is, well, the meat of the parser, it would seem silly to skip it.

So here’s how I approached this, with my thought process sketched out here:

  • we’re reducing many values (the pattern and the bits) into a single value
  • that sounds like a fold
  • but we want to short-circuit if something goes wrong
  • short-circuiting makes me think of Maybe
  • I don’t know how to fold-with-maybe
  • but I do know how to Hoogle that
  • and now I know how to foldM
matchByte pattern byte = foldM check subZero (zip pattern bits)
  where
    bits = testBit byte <$> [7, 6 .. 0]
    check subByte ('1', True)  = Just subByte
    check subByte ('0', False) = Just subByte
    check subByte ('x', bit)   = Just (pushBit bit subByte)
    check _       _            = Nothing

That was a lot of code at the same time! So let’s go through it, assuming that pattern is "0xxxxxxx" and byte is 10100100.

  • testBit is from Data.Bits, and gives us True or False when we ask it if the bit at a certain index is set. So testBit byte <$> [7, 6, .. 0] gives us [True, False, True, False, False, True, False, False].
    • Originally I wrote range (7, 0) instead of using the goofy list range syntax. But range (7, 0) == [], and I was very sad. This is a good example of a program that typechecks but is nonetheless wrong – it can happen!
  • zip pattern bits is [('0', True), ('x', False), ('x', True), ...]. We just line up the pattern with the actual byte.
  • foldM starts with subZero and uses the check function to advance its value each step of the way
  • check rejects if it sees ('1', False) or ('0', True)
  • check leaves the output unchanged if the pattern is ('1', True) or ('0', False)
  • check pushes a bit onto the output when the pattern character is an x

Great! Now we have a normal function (not a parser) that does what we want. But how do we turn it into a parser?

Documentation to the rescue

Attoparsec ships with two ways to turn functions into Parsers:

satisfy :: (Word8 -> Bool) -> Parser Word8
satisfyWith :: (Word8 -> a) -> (a -> Bool) -> Parser a

The former just checks a byte against a predicate, and the latter allows us to transform the byte into something else, run a predicate on the result, and then return it if it passes.

Both of these are nice, but neither are quite what we want. For example, we could write something like:

bitPattern :: String -> Parser SubByte
bitPattern pattern = satisfyWith (matchByte pattern) isJust

But that wouldn’t work, because it’d give us a Parser (Maybe SubByte). But we know it’s a Just! Can’t we unwrap it? No can do, satisfyWith whispers sinisterly. And sure, we could then fmap fromJust over the result or something. But wouldn’t it be nice if we just had:

satisfyMaybe :: (Word8 -> Maybe a) -> Parser a
satisfyMaybe f = do
  byte <- anyWord8
  maybe (fail "maybe not satisfied") return (f byte)

When I first wrote that, I did a peekWord8 and then a skip (const True) on success. But that was silly! Attoparsec will always backtrack after failure. Still, an interesting lesson learned.

With this helper, bitPattern becomes trivial:

bitPattern pattern = satisfyMaybe (matchByte pattern)

THE LAST PIECE

Really! We’re so close! We can parse bytes that match the pattern, and get our SubBytes back. We just need to glue them together!

So now we return to byteSequence (remember her?), the only unimplemented method we have left:

byteSequence :: [String] -> Parser Word32
byteSequence patterns = do
  subBytes <- mapM bitPattern patterns
  return (foldl mergeSubByte 0 subBytes)

mapM does two things for us in one go: it turns our pattern strings into Parser SubBytes and it executes each of them in order, handing us back the list of parsed values. But since that’s a list of SubBytes, we need to glue them together into a single Word32, which we do with a fold and a simple helper:

mergeSubByte :: Word32 -> SubByte -> Word32
mergeSubByte whole (byte, bitCount) =
  shiftL whole bitCount .|. fromIntegral byte

.|. is bitwise or, from Data.Bits. We’re using fromIntegral to convert a Word8 into a Word32 here.

And now we’re done. That’s it. We win. It’s over. We test our parser, and it works!

But it has really bad error messages

It does. In particular, we almost always see "maybe not satisfied" when we test it on invalid input. This is because we don’t have any control over the backtracking: when something fails with an overlong error (for example), the alternation in codePointParser will try parsing the next one. There’s no way to distinguish between “irrecoverable failures” (which we want, in this case) and “failures which should cause backtracking.”

That I know of. It’s totally possible! I’m very new at this stuff. But I’m pretty sure attoparsec is in the “nothing matters but performance” camp, and if we’re after user-friendly error messages, we should look into Parsec, which is a tiny bit harder to use but quite a bit more flexible.

Hey wait that’s awful

It’s not really so bad. We could, instead of using fail to track our error messages, actually parse to an Either String [Word32]. That way error conditions are Real Actual Values and we never need to backtrack once we find a definite failure (as opposed to what Attoparsec thinks of as a failure).

I don’t know if that would actually work when you consider that it should short-circuit on the first failure… but it sounds like fun homework for next time.

In the meantime, though, this has gone on for quite a while. I’d like to kick back, relax, and–

What about that spec thing

Ugh. Specs. Who needs ‘em? Well, we came so close to actually being able to validate UTF-8 that it seems silly to stop here. There are only two things we’re missing, and both are very easy to add:

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

And the fact that codepoints from 0x110000 to 0x1FFFFF are not valid, because of the UTF-16 restriction we talked about before. So heck, let’s just handle these:

codePointParser :: Parser Word32
codePointParser =
  byteSequence ["0xxxxxxx"] <|>
  overlong 0x7F (multibyte "110xxxxx" 1) <|>
  overlong 0x7FF (multibyte "1110xxxx" 2) <|>
  checkedParser withMax (not . isSurrogate) "illegal surrogate pair"
  where
    multibyte leader count =
      byteSequence (leader : replicate count "10xxxxxx")
    fourByteParser = (overlong 0xFFFF (multibyte "11110xxx" 3))
    withMax = checkedParser fourByteParser (< 0x110000) "illegal codepoint over 0x10FFFF"
    isSurrogate w = w >= 0xD800 && w <= 0xDFFF

And now we’re really, truly done.

Why… why did we just do that

Well, for fun! And to learn about parser combinators. And UTF-8. And because we really didn’t think it would take that long when we started writing this blog post.

The end result is certainly not a useful UTF-8 parser. UTF-8 is such a simple format that throwing parser combinators at it is overkill, and Data.Text comes with a decodeUtf8' function that behaves much like my parseUtf8 – except that it has actual error messages, it results in a useful type (Text instead of [Word32]), and it doesn’t do ridiculous things like turning bytes into linked lists of bits.

But the end result is hopefully a rather educational parser. decodeUtf8' is so highly optimized that it’s very difficult to follow what’s going on – which is expected; performance is more important if you actually want to use these functions.

And hopefully it’s something that we could return to, some day, to fix the error messages. Or perhaps we could profile it, and try to measure the performance. Maybe even optimize the generated assembly.

But not today. Today I’m going to go outside.


Here’s some interesting further reading, because if you got this far you clearly have some sort of crippling reading addiction: