We all know the absolute minimum every software developer must know about Unicode. We aren’t bad citizens of the internet, spreading malice and encoding errors willy-nilly around the world. We’re good people, just trying to get by.
But knowing the minimum is starting to bother us. And the more we think about UTF-8 and text encoding, the more we realize that we don’t “get it” at a messy, squishy level. We’ve read about it, sure, but we’ve never actually wrapped our hands around it and felt the blood pump through its veins.
So that’s what we’re going to do in this post: grab UTF-8 by the bytes and squeeze some real understanding out of it.
Well alright how
Lower bound | Upper bound | Pattern (binary) |
---|---|---|
0x00000 |
0x00007F |
0xxxxxxx |
0x00080 |
0x0007FF |
110xxxxx 10xxxxxx |
0x00800 |
0x00FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
0x10000 |
0x10FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
And I wondered if it would be possible to write a decoding function that looked like this:
byteSequence ["0xxxxxxx"] <|>
byteSequence ["110xxxxx", "10xxxxxx"] <|>
byteSequence ["1110xxxx", "10xxxxxx", "10xxxxxx"] <|>
byteSequence ["11110xxx", "10xxxxxx", "10xxxxxx", "10xxxxxx"]
And it turns out that it is. Sort of.
For whatever reason, the idea of writing a function like this – basically a transliteration of the UTF-8 spec – got me excited. When I sat down to write this post, I didn’t know how I was going to do it. But I was determined to figure it out.
This post chronicles that journey.
If you too are excited, or just a little bit curious, come dream with me. Along the way, we’ll:
- learn quite a bit about UTF-8
- use Attoparsec, a very fun parser combinator library
- learn what “parser combinator” means
- watch me struggle to write Haskell
Or you could skip all that noise and cut to the code. That link includes getting started instructions if you want to follow along at home. Otherwise just sit back, relax, and let’s do this thing together.
A brief recap of UTF-8
You’ve read that Joel on Software post, right? The one I linked to in the first paragraph? I believe you. But just in case, let’s review:
- Unicode 7.0 defines 252,603 different codepoints
- less than half of these codepoints are what we’d call “characters”
- they could be diacritics (like ¨ or ´), one stroke of a character in a fancy writing system, or symbols
- the first published version of this post got that number very wrong because the character vs. codepoint distinction is confusing
- a codepoint is an abstract number, not a physical byte pattern
- there are multiple ways to represent codepoints in concrete ways, called encodings
- one such encoding is UTF-8
- there are others, UTF-16 being pretty popular
- despite the name, UTF-8 takes anywhere from 8 to 32 bits to represent a single codepoint
- UTF-8 has a bunch of bytes
- the order of the bytes is quite important
- …other stuff??
That’s all I got. Let’s learn more.
I showed you this table before:
Lower bound | Upper bound | Pattern (binary) |
---|---|---|
0x00000 |
0x00007F |
0xxxxxxx |
0x00080 |
0x0007FF |
110xxxxx 10xxxxxx |
0x00800 |
0x00FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
0x10000 |
0x10FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Now let’s talk about what it means.
First, note that 0xxxxxxx
is not hexadecimal, but a bit pattern that means the “control” bit 0
followed by seven “data” bits, which could be anything.
The four rows in the table correspond to the four flavors of UTF-8 codepoints, distinguished by the number of bytes you need to encode them (one to four).
“Lower bound” and “upper bound” tell us which codepoints live in which flavor. So if we want to encode U+2603
(☃), we can see that we need to use the three-byte form, because 0x7FF < 0x2603 < 0xFFFF
The single byte codepoints all start with 0
. The multi-byte codepoints start with a “leader” byte followed by one or more “continuation” bytes. Leader bytes start with 11
and tell us how many continuation bytes to expect, and every continuation byte starts with 10
. Simple.
Boring UTF-8 details Fun UTF-8 facts
Look at the upper bound on the four-byte codepoint. It’s 0x10FFFF
, even though we could represent numbers all the way up to 0x1FFFFF
(221 - 1), since we have twenty-one data bits to work with. So why does it end there?
Because it turns out UTF-16, another very popular encoding scheme, can’t represent codepoints that big. So we say that anything bigger than that is Strictly Not Allowed in UTF-8 in order to make UTF-8 and UTF-16 fully compatible: we’ll never be able to write something in UTF-8 that we can’t transcode into UTF-16.
It turns out this limitation doesn’t matter in practice: 0x10FFFF
(1,114,111 in decimal) is about four and a half times as many representable codepoints as there are codepoints currently defined in the latest Unicode spec, so we’re not in danger of running out of bits any time soon.
If we didn’t care about UTF-16 at all, we could keep going all the way up to a leader byte of 11111111
followed by seven continuation bytes without changing the underlying scheme. We could represent forty-two bits of codepoints! But we don’t do that, because, well, we don’t need to.
We’d also lose the interesting property that UTF-8 doesn’t allow the bytes 0xFE
and 0xFF
to appear anywhere, which makes it impossible to confuse UTF-16 text (which uses those bytes in its BOM) with UTF-8. But then again, this is not-caring-about-UTF-16-land that we’re dreaming in.
Show me the bits
Alright, fun sidetrack, but there’s a lotta hexadecimal floating around and it’s getting hard to keep it all straight. I wanna see some real world examples. I wanna see some binary.
So let’s look at the string naïveté
. In UTF-8, we’d write that as…
Well, actually, we have multiple choices. ï
could be the single codepoint U+00EF
(LATIN SMALL LETTER I WITH DIAERESIS) or it could be U+0069
(LATIN SMALL LETTER I) followed by U+0308
(COMBINING DIAERESIS). Same with é
. Code points ≠ characters, and we have to keep that in mind as we write this parser.
I chose to represent those characters as the following codepoints, for no particular reason:
n a i (combining diaeresis) v e t (e with acute)
Which we can laboriously encode as the following UTF-8 (I’m waving my hands here; this is a post about decoding):
0 1101110 (n)
0 1100001 (a)
0 1101001 (i)
110 01100 (¨ byte 1)
10 001000 (¨ byte 2)
0 1110110 (v)
0 1100101 (e)
0 1110100 (t)
110 00011 (é byte 1)
10 101001 (é byte 2)
I put a space between the “control” bits and the “data” bits, because that’s the easiest way to set them apart with markdown. And we can see that the one-byte characters, the “normal” ASCII ones, all begin with a 0
. Both of the two-byte codepoints begin with 110
, which tells us to expect one subsequent continuation byte beginning with 10
…
Look, you get it. You can see it right there. Let’s get to the code.
Haskell time
It sounds like an odd language choice, but Haskell is actually great for bit twiddling. You don’t have to worry about int
being the wrong size when you compile on a different platform, or accidentally shifting a signed value and tearing your hair out in the resultant mess. No. Haskell has your back. Haskell knows what you want.
You want some nice, fixed-size values and some friendly bitwise operations for them. And Haskell gives them to you, right in the standard library.
But Haskell doesn’t give you a binary literal, and I didn’t want to convert those lovely binary representations up there to hex or something by hand, so the first thing I did was write a readByte
function that can turn strings like "10010100"
into Word8
, the Haskell equivalent of unsigned char
or byte
.
readByte :: String -> Word8
You read that as “readByte
is a function that takes a String
and returns a Word8
.” It’s called the type declaration, and we’ll be seeing a lot of those, because in Haskell country types are the benevolent monarchs of the land.
I’m not including my implementation of this function here because it isn’t that important: we want to focus on the parsing part. You can always look at the full code if you’re curious how I did it.
Now let’s sanity check that it works:
import BasePrelude
main :: IO ()
main = do
let naïveté = readByte <$> ["01101110", "01100001", "01101001", "11001100", "10001000", "01110110", "01100101", "01110100", "11000011", "10101001"]
print naïveté
It does print [110,97,105,204,136,118,101,116,195,169]
, as we’d expect, but… it’s a little crazy.
We’re mapping readByte
over a [String]
, giving us a [Word8]
– a linked list of bytes. As Real Human Haskell Developers, we realize that that’s not a great representation for our needs: we’re not about to parse a linked list here, so let’s turn that into a ByteString
(basically, an unsigned char *
) with a nice helper:
import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
readBytes :: [String] -> ByteString
readBytes = BS.pack . fmap readByte
And then we’ll use that instead:
main :: IO ()
main = do
let naïveté = readBytes ["01101110", "01100001", "01101001", "11001100", "10001000", "01110110", "01100101", "01110100", "11000011", "10101001"]
print naïveté
If all is well, we’ll see:
"nai\204\136vet\195\169"
Because ByteString
is not encoding aware, and it doesn’t know what to do with those non-ASCII bytes. That’s what we’re here for!
The actual parsing
We’re going to parse ByteString
s, but we’re going to give the output back as a list of Word32
s, because thirty-two bits is more than enough to encode every Unicode codepoint and there’s no Word24
in the standard library.
So at the very end of this, we want a function that converts a ByteString
to a [Word32]
. So let’s write that:
parseUtf8 :: ByteString -> [Word32]
Wait, what about invalid input? We could make it a Maybe [Word32]
, but let’s say it’ll give us an error message in that case, because that just so happens to be what Attoparsec does:
parseUtf8 :: ByteString -> Either String [Word32]
Ah, but wait. We sorta know some other stuff, too. Like that we want to make an Attoparsec parser to do the parsing (as that is the point of this blog post):
import Data.Attoparsec.ByteString
utf8Parser :: Parser [Word32]
And we can guess that the easiest way to write that is to make a parser for an individual codepoint.
codePointParser :: Parser Word32
Cool. Now let’s… implement them:
parseUtf8 = parseOnly utf8Parser
utf8Parser = manyTill' codePointParser endOfInput
codePointPars--
WAIT STOP
We just used our first parser combinator! But we haven’t talked about what that means yet.
Parser combinators are functions that take smaller parsers and return new, slightly bigger parsers. They are the glue that allows us to build sophisticated parsers out of small, independent pieces.
Before I heard of parser combinators, I thought of parsers as either:
- a big scary thing that a “parser generator” creates for you from some arcane specification language
- a one-off “recursive descent” (whatever that means) parentheses matcher that works pretty well as long as you never need to change it
- some horrible rat’s nest of regular expressions that you aren’t proud to admit that you wrote
Now I think of them as simple tools. Nothing scary. Nothing fragile. You can build parsers out of a bunch of smaller parsers, and it’s parsers all the way down. It’s a beautifully simple model for solving problems like this.
So about manyTill'
If we look at the type of manyTill'
…
manyTill' :: MonadPlus m => m a -> m b -> m [a]
…we are confused and afraid. But that’s because it’s very general. If we specialize it to only talk about the types we’re using in utf8Parser
…
manyTill' :: Parser Word32 -> Parser () -> Parser [Word32]
…and fill in the type variables, we can see that it takes a Parser Word32
, a Parser
of something else, and returns a Parser [Word32]
. It’s a function that creates this first-class Parser
thing out of two smaller Parser
things. It’s a parser combinator. We did it!
The second argument, endOfInput
, is a Parser ()
, which means that it doesn’t parse any useful data; it can only tell us whether it succeeds or fails. Since manyTill'
just runs its first parser over and over until its second parser succeeds, this is all we need.
manyTill'
is one of many built-in combinators that Attoparsec provides us. It’s fun to read the docs to get an idea of some common things you can do, but you’re not limited to that list – before too long we’ll be learning how to define our own combinators!
But first let’s get back to the code.
How do we write codePointParser
codePointParser
is, well, kind of the whole point here. That’s where the magic happens. But how do we write it?
The way I want to write it is like this:
codePointParser :: Parser Word32
codePointParser = byteSequence ["0xxxxxxx"] <|>
byteSequence ["110xxxxx", "10xxxxxx"] <|>
byteSequence ["1110xxxx", "10xxxxxx", "10xxxxxx"] <|>
byteSequence ["11110xxx", "10xxxxxx", "10xxxxxx", "10xxxxxx"]
And fortunately, that totally works. Yes, we can make it shorter with a helper for the continuation bytes, and we will – but for now, bask with me in the explicit glory of the code as it now stands.
What is that weird bird you drew
I read <|>
as “or”. We want to match this byte sequence or that byte sequence or this other one.
It’s a parser combinator (surprise) that takes a parser on the left, a parser on the right, and gives you a parser that tries the left and, if it fails, tries the right. Simple.
Okay cool what about byteSequence
I want to start writing it – I really do – but before we dive in I want to be very crystal clear about the work it’s going to be doing. Because there’s a very important aspect of UTF-8 that we haven’t mentioned before, and I didn’t really understand before doing this exercise:
When we have those “data bits”, like in 110xxxxx 10xxxxxx
, how do we interpret those? Do we just glue those x
s together and treat it like an eleven-bit number? Or do we add 128 to that, since we know that the first 128 codepoints are already covered by the 0xxxxxxx
flavor?
Another way to phrase that: if I write 110000000 10000001
, does that represent the same codepoint as 00000001
? And if so, is that… allowed?
Well, it turns out that the answer is yes it does and no it most certainly is not.
We don’t have to do any math. That’s an eleven-bit number and we treat it exactly as it stands. Which means that it is in theory possible to encode something like 00001000
as 11000000 10001000
, and this phenomenon is called an overlong encoding.
Overlong encodings are very unambiguously explicitly crystal clearly not allowed. The UTF-8 spec has these scary words to say about it:
Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence
C0 80
into the characterU+0000
, or the surrogate pairED A1 8C ED BE B4
intoU+233B4
. Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below.
Section 10 contains this gem:
Another example might be a parser which prohibits the octet sequence
2F 2E 2E 2F
("/../"
), yet permits the illegal octet sequence2F C0 AE 2E 2F
. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.
Well, alright. We’re going to have to handle this, because this isn’t “Let’s write a dumbed down UTF-8 parser because the real world is too hard.” This is “Parser combinators, hell yeah; let’s crank the tunes and start conforming to an IETF standard.”
Which means that we need to modify what we wrote above to support the concept of overlong encodings.
It won’t be so bad! And we’ll get to, you know, actually parsing before you know it. First, though, I’m going to clean up the code a tiny bit:
codePointParser :: Parser Word32
codePointParser =
byteSequence ["0xxxxxxx"] <|>
multibyte "110xxxxx" 1 <|>
multibyte "1110xxxx" 2 <|>
multibyte "11110xxx" 3
where multibyte leader count =
byteSequence (leader : replicate count "10xxxxxx")
That’s the same as what we had before, just with less repetition. Now let’s protect against overlong encodings:
codePointParser =
byteSequence ["0xxxxxxx"] <|>
overlong 0x7F (multibyte "110xxxxx" 1) <|>
overlong 0x7FF (multibyte "1110xxxx" 2) <|>
overlong 0xFFFF (multibyte "11110xxx" 3)
where multibyte leader count =
byteSequence (leader : replicate count "10xxxxxx")
Then define this overlong
helper we keep using:
overlong :: Word32 -> Parser Word32 -> Parser Word32
overlong m parser = checkedParser parser (> m) "overlong codepoint!"
Now, we defined this in terms of checkedParser
, but that isn’t something that Attoparsec provides us. We’re going to have to define it ourselves. It’s going to look something like this:
checkedParser :: Parser a -> (a -> Bool) -> String -> Parser a
Read that as “Given a parser, a predicate, and an error message, give me a new parser that fails when the predicate returns False
.” Alright. Let’s write that.
Wait how
At first this looks a little daunting. Writing our own parser combinator? It sounds like deep Haskell magic. But it turns out it’s quite simple, once we realize that Parser
is a monad:
checkedParser parser pred msg = do
byte <- parser
unless (pred byte) (fail msg)
return byte
Yeah yeah, I used the M-word. But that’s not so bad, is it? Just nice, normal-looking code. Haskell: it’s not so terrifying™.
This almost seems too easy, though. It looks like we “called” the parser
, but what did we pass it? How does it “know” what to parse?
It’s a semi-complicated topic and I won’t try to explain it here, but I will try to offer some intuition: we aren’t returning a “function that parses” here. We’re returning an honest-to-goodness Parser
. The parser doesn’t have a series of steps, it is a series of steps, which we’ve written out with do
notation.
Crystal clear? Okay yeah perfect let’s pretend this never happened.
So about that byteSequence
thing
I guess at some point we’ll have to do some actual UTF-8.
byteSequence :: [String] -> Parser Word32
So we have a list of pattern strings that look something like "110xxxxx"
. We want to parse a series of bytes that match that pattern, extracting the data bits each time, then glue them all together (somehow) at the end.
But doing lists of things is hard. So let’s break it down even further, and make a parser that matches just one of these patterns.
bitPattern :: String -> Parser ???
But wait: should this give us a Word8
back?
At first I said “yes.” Then I struggled to write the rest of the code, thought about it a bit more, and realized that just couldn’t work. Because there’s an additional, important piece of information: the number of x
s in the pattern.
We need that information when it comes time to glue these babies back together. If we don’t remember how many bits of each byte are actually significant, we’re gonna end up with a garbled mess with all the bits in the wrong places.
So instead of making bitPattern
return a Parser Word8
, let’s define something new:
type SubByte = (Word8, Int)
Because it’s such a simple type, we can just alias a two-tuple here. Now let’s make a couple helpers:
subZero :: SubByte
subZero = (0, 0)
pushBit :: Bool -> SubByte -> SubByte
pushBit True (b, n) = (setBit (shiftL b 1) 0, n + 1)
pushBit False (b, n) = (shiftL b 1, n + 1)
Now we can write the type of bitPattern
:
bitPattern :: String -> Parser SubByte
And we can almost write the definition too.
Well get to it then
I will – but first I want to break out of the parser world.
Let’s pretend, for a second, that we’re not using Attoparsec. Because having that “we’re making a parser” thing weighing over our head isn’t helping us right now.
It turns out if we can write a normal function to do this work, we can turn it into a Parser
very easily. So let’s not worry about making a bitPattern
parser; let’s just write a function.
matchByte :: String -> Word8 -> Maybe SubByte
Note that I made it return a Maybe SubByte
, to indicate that the pattern could fail – think matchByte "0xxxxxxx" (readByte "10101001")
.
This is another case where I wanna say “the implementation doesn’t matter too much”. There are lots of ways to do this. But since this is, well, the meat of the parser, it would seem silly to skip it.
So here’s how I approached this, with my thought process sketched out here:
- we’re reducing many values (the pattern and the bits) into a single value
- that sounds like a
fold
- but we want to short-circuit if something goes wrong
- short-circuiting makes me think of
Maybe
- I don’t know how to fold-with-maybe
- but I do know how to Hoogle that
- and now I know how to
foldM
matchByte pattern byte = foldM check subZero (zip pattern bits)
where
bits = testBit byte <$> [7, 6 .. 0]
check subByte ('1', True) = Just subByte
check subByte ('0', False) = Just subByte
check subByte ('x', bit) = Just (pushBit bit subByte)
check _ _ = Nothing
That was a lot of code at the same time! So let’s go through it, assuming that pattern
is "0xxxxxxx"
and byte
is 10100100
.
testBit
is fromData.Bits
, and gives usTrue
orFalse
when we ask it if the bit at a certain index is set. SotestBit byte <$> [7, 6, .. 0]
gives us[True, False, True, False, False, True, False, False]
.- Originally I wrote
range (7, 0)
instead of using the goofy list range syntax. Butrange (7, 0) == []
, and I was very sad. This is a good example of a program that typechecks but is nonetheless wrong – it can happen!
- Originally I wrote
zip pattern bits
is[('0', True), ('x', False), ('x', True), ...]
. We just line up the pattern with the actual byte.foldM
starts withsubZero
and uses thecheck
function to advance its value each step of the waycheck
rejects if it sees('1', False)
or('0', True)
check
leaves the output unchanged if the pattern is('1', True)
or('0', False)
check
pushes a bit onto the output when the pattern character is anx
Great! Now we have a normal function (not a parser) that does what we want. But how do we turn it into a parser?
Documentation to the rescue
Attoparsec ships with two ways to turn functions into Parser
s:
satisfy :: (Word8 -> Bool) -> Parser Word8
satisfyWith :: (Word8 -> a) -> (a -> Bool) -> Parser a
The former just checks a byte against a predicate, and the latter allows us to transform the byte into something else, run a predicate on the result, and then return it if it passes.
Both of these are nice, but neither are quite what we want. For example, we could write something like:
bitPattern :: String -> Parser SubByte
bitPattern pattern = satisfyWith (matchByte pattern) isJust
But that wouldn’t work, because it’d give us a Parser (Maybe SubByte)
. But we know it’s a Just
! Can’t we unwrap it? No can do, satisfyWith
whispers sinisterly. And sure, we could then fmap fromJust
over the result or something. But wouldn’t it be nice if we just had:
satisfyMaybe :: (Word8 -> Maybe a) -> Parser a
satisfyMaybe f = do
byte <- anyWord8
maybe (fail "maybe not satisfied") return (f byte)
With this helper, bitPattern
becomes trivial:
bitPattern pattern = satisfyMaybe (matchByte pattern)
THE LAST PIECE
Really! We’re so close! We can parse bytes that match the pattern, and get our SubByte
s back. We just need to glue them together!
So now we return to byteSequence
(remember her?), the only unimplemented function we have left:
byteSequence :: [String] -> Parser Word32
byteSequence patterns = do
subBytes <- mapM bitPattern patterns
return (foldl mergeSubByte 0 subBytes)
mapM
does two things for us in one go: it turns our pattern strings into Parser SubByte
s and it executes each of them in order, handing us back the list of parsed values. But since that’s a list of SubByte
s, we need to glue them together into a single Word32
, which we do with a fold
and a simple helper:
mergeSubByte :: Word32 -> SubByte -> Word32
mergeSubByte whole (byte, bitCount) =
shiftL whole bitCount .|. fromIntegral byte
And now we’re done. That’s it. We win. It’s over. We test our parser, and it works!
But it has really bad error messages
It does. In particular, we almost always see "maybe not satisfied"
when we test it on invalid input. This is because we don’t have any control over the backtracking: when something fails with an overlong
error (for example), the alternation in codePointParser
will try parsing the next one. There’s no way to distinguish between “irrecoverable failures” (which we want, in this case) and “failures which should cause backtracking.”
That I know of. It’s totally possible! I’m very new at this stuff. But I’m pretty sure Attoparsec is in the “nothing matters but performance” camp, and if we’re after user-friendly error messages, we should look into Parsec, which is a tiny bit harder to use but quite a bit more flexible.
Hey wait that’s awful
It’s not really so bad. We could, instead of using fail
to track our error messages, actually parse to an Either String [Word32]
. That way error conditions are Real Actual Values and we never need to backtrack once we find a definite failure (as opposed to what Attoparsec thinks of as a failure).
I don’t know if that would actually work when you consider that it should short-circuit on the first failure… but it sounds like fun homework for next time.
In the meantime, though, this has gone on for quite a while. I’d like to kick back, relax, and–
What about that spec thing
Ugh. Specs. Who needs ‘em? Well, we came so close to actually being able to validate UTF-8 that it seems silly to stop here. There are only two things we’re missing, and both are very easy to add:
The definition of UTF-8 prohibits encoding character numbers between
U+D800
andU+DFFF
, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.
And the fact that codepoints from 0x110000
to 0x1FFFFF
are not valid, because of the UTF-16 restriction we talked about before. So heck, let’s just handle these:
codePointParser :: Parser Word32
codePointParser =
byteSequence ["0xxxxxxx"] <|>
overlong 0x7F (multibyte "110xxxxx" 1) <|>
overlong 0x7FF (multibyte "1110xxxx" 2) <|>
checkedParser withMax (not . isSurrogate) "illegal surrogate pair"
where
multibyte leader count =
byteSequence (leader : replicate count "10xxxxxx")
fourByteParser = (overlong 0xFFFF (multibyte "11110xxx" 3))
withMax = checkedParser fourByteParser (< 0x110000) "illegal codepoint over 0x10FFFF"
isSurrogate w = w >= 0xD800 && w <= 0xDFFF
And now we’re really, truly done.
Why… why did we just do that
Well, for fun! And to learn about parser combinators. And UTF-8. And because we really didn’t think it would take that long when we started writing this blog post.
The end result is certainly not a useful UTF-8 parser. UTF-8 is such a simple format that throwing parser combinators at it is overkill, and Data.Text
comes with a decodeUtf8'
function that behaves much like my parseUtf8
– except that it has actual error messages, it results in a useful type (Text
instead of [Word32]
), and it doesn’t do ridiculous things like turning bytes into linked lists of bits.
But the end result is hopefully a rather educational parser. decodeUtf8'
is so highly optimized that it’s very difficult to follow what’s going on – which is expected; performance is more important if you actually want to use these functions.
And hopefully it’s something that we could return to, some day, to fix the error messages. Or perhaps we could profile it, and try to measure the performance. Maybe even optimize the generated assembly.
But not today. Today I’m going to go outside.
Here’s some interesting further reading, because if you got this far you clearly have some sort of crippling reading addiction:
- The code for this blog post
- This very approachable trio of articles by Brian O’Sullivan about Attoparsec’s development:
- That Spolsky article you said you already read