Revisiting 'Monadic Parsing in Haskell'
Translated to Russian by Clipart Team and Sindhi by Samuel Badree
‘Monadic Parsing in Haskell’ is a short paper that laid the groundwork for libraries like Parsec and Attoparsec. Although it was published in 1998 (almost 20 years ago!) it has aged gracefully and the code samples will run with almost no changes. However, the state of the art has advanced since then and I think the use of modern Haskell can make this material simpler to follow and implement.
Monadic parsing in Haskell is what sold me on all three. Before Haskell my experiences with parsing had involved buggy regexes for lexers and wrangling tools like bison
and flex
, and although I’d heard that Haskell was good for parsing I couldn’t see how this could be the case when I couldn’t find any robust regex libraries! An aside in some documentation pointed me to Attoparsec and when I saw the example RFC2616 parser it seemed like a magic trick. How could it be so small? After a few weeks of trying it myself I was convinced and I’ve never looked back. This was the first application of monads I encountered that actually made my life simpler, and I started to realise that there was more to monads than smugness and being inaccessible to newcomers.
The first change I want to make is the type definition. The paper uses the type
newtype Parser a = Parser (String -> [(a,String)])
and although this is a famous enough definition that it has its own rhyme, I think the flexibility of lists is wasted here. The authors don’t use it, and instead define a ‘deterministic choice’ operator (+++)
that gives at most one result and use that everywhere instead. There is already a perfectly good datatype in Haskell for lists of at most one element, Maybe
, so I’ll use that instead of []
:
newtype Parser a = Parser (String -> Maybe (a, String))
If we rename String
to s
and Maybe
to m
, a more interesting pattern is revealed:
newtype Parser s m a = Parser (s -> m (a, s))
This is StateT
! Recognising this pattern makes instance definitions much easier, so much easier in fact that GHC can do it for us automatically with -XGeneralizedNewtypeDeriving
! For completeness I will resist the temptation to do this, but you can try it yourself with
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
newtype Parser a = Parser (StateT String Maybe a) deriving (Functor, Applicative, Alternative, Monad)
The second change is also for completeness: the authors jump straight into the Monad
instance without defining Functor
and Applicative
first. To be fair, the Applicative
abstraction hadn’t been discovered yet, and this is also the reason why the authors define mzero
and mplus
(which they call (++)
) instead of the more general Alternative
methods empty
and (<|>)
. Because of our Maybe
change, defining Alternative
means I won’t need to bother with their (+++)
.
Finally, I’ll try to avoid do-notation where possible in favour of a more Applicative style using e.g. <*>
(which can be pronounced ‘splat’ if you don’t already have a name for it) because most of these parsers don’t require it.
Let’s begin!
{-# LANGUAGE InstanceSigs #-}
import Control.Applicative (Alternative(..))
import Control.Monad.Trans.State.Strict
import Control.Monad (guard)
import Data.Char (isSpace, isDigit, ord)
For convenience I’ve defined an unParser
that unwraps a Parser a
to its underlying StateT String Maybe a
.
newtype Parser a = Parser { unParser :: StateT String Maybe a }
runParser = runStateT . unParser
fmap
is as simple as unwrapping the Parser
and using the underlying StateT
’s fmap
.
instance Functor Parser where
fmap :: (a -> b) -> Parser a -> Parser b
fmap f p = Parser $ f <$> unParser p
More unwrapping for Applicative
and Alternative
.
The Alternative
typeclass allows us to express the idea of running one parser or another parser, resulting in the first successful parse. empty
handles the case where both parsers fail, and (<|>)
(which can be pronounced ‘alt’) performs the alternation. This is convenient enough on its own, but Alternative
also provides many
and some
which correspond exactly to many
and many1
from the paper:
-- many v = some v <|> pure []
-- some v = (:) <$> v <*> many v
but only after replacing []
with Maybe
like I’ve done here so that (<|>)
corresponds to (+++)
.
instance Applicative Parser where
pure :: a -> Parser a
pure a = Parser $ pure a
(<*>) :: Parser (a -> b) -> Parser a -> Parser b
f <*> a = Parser $ unParser f <*> unParser a
instance Alternative Parser where
empty :: Parser a
empty = Parser empty
(<|>) :: Parser a -> Parser a -> Parser a
a <|> b = Parser $ unParser a <|> unParser b
The Monad
definition is slightly more interesting, because we have to manually construct the StateT
value, but this also boils down to unwrapping and rewrapping.
instance Monad Parser where
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
a >>= f = Parser $ StateT $ \s -> do
(a', s') <- runParser a s
runParser (f a') s'
Notice that anyChar
is the only function below that manually constructs a Parser
, and satisfy
is the only one that requires the Monad
interface.
anyChar :: Parser Char
anyChar = Parser . StateT $ \s -> case s of
[] -> empty
(c:cs) -> pure (c, cs)
satisfy :: (Char -> Bool) -> Parser Char
satisfy pred = do
c <- anyChar
guard $ pred c
pure c
char :: Char -> Parser Char
char = satisfy . (==)
string :: String -> Parser String
string [] = pure []
string (c:cs) = (:) <$> char c <*> string cs
Again, many
and many1
don’t need to be defined because they are provided for free!
sepBy :: Parser a -> Parser b -> Parser [a]
sepBy p sep = (p `sepBy1` sep) <|> pure []
sepBy1 :: Parser a -> Parser b -> Parser [a]
sepBy1 p sep = (:) <$> p <*> many (sep *> p)
These are almost identical to the definitions in the paper. I’ve included chainr
for completeness.
chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainl p op a = (p `chainl1` op) <|> pure a
chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p op = p >>= rest
where
rest a = (do
f <- op
b <- p
rest (f a b)) <|> pure a
chainr :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainr p op a = (p `chainr1` op) <|> pure a
chainr1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainr1 p op = scan
where
scan = p >>= rest
rest a = (do
f <- op
b <- scan
rest (f a b)) <|> pure a
The only difference here is the replacement of (>>)
with (*>)
. These have the same effect, except that the latter works on Applicative
s and also comes with a counterpart (<*)
. When writing parsers I find these especially useful because they allow me to combine multiple parsers together when I only care about the output of one of them, e.g. ignored *> ignored *> value <* ignored
. The calculator example uses this in factor
.
space :: Parser String
space = many (satisfy isSpace)
token :: Parser a -> Parser a
token p = p <* space
symbol :: String -> Parser String
symbol = token . string
apply :: Parser a -> String -> Maybe (a, String)
apply p = runParser (space *> p)
The calculator example is almost unchanged.
expr, term, factor, digit :: Parser Int
expr = term `chainl1` addop
term = factor `chainl1` mulop
factor = digit <|> (symbol "(" *> expr <* symbol ")")
digit = subtract (ord '0') . ord <$> token (satisfy isDigit)
addop, mulop :: Parser (Int -> Int -> Int)
addop = (symbol "+" *> pure (+)) <|> (symbol "-" *> pure (-))
mulop = (symbol "*" *> pure (*)) <|> (symbol "/" *> pure (div))
Finally, the payoff!
runParser expr "(1 + 2 * 4) / 3 + 5"
Just (8,"")
What have we gained in 20 years? With only minor changes, the code is more composable and uses finer-grained abstractions. For example, if we change our minds about replacing []
with Maybe
, we can switch it back and would only have to update the type signature of apply
:
apply :: Parser a -> String -> [(a, String)]
apply p = runParser (space *> p) -- the implementation stays the same!
If we want better error messages, we could use a type such as Either String
to keep track of locations and error messages. The yoctoparsec
library takes this even further, allowing to you to choose your own stream type.
Another big difference is the Applicative
family of functions, which we can leverage whenever we don’t have to branch on a previously parsed value (which turns out to be surprisingly often). I’m a huge fan of the x <$> y <*> z
and the ignored *> value <* ignored
idioms and I think it’s useful to be able to parse this way.
Otherwise, the code is largely the same and I think it’s pretty incredible that so little has changed in 20 years! This code is available as an IHaskell notebook if you would like to experiment with it yourself.
Edit: I just found ‘From monoids to near-semirings: the essence of MonadPlus
and Alternative
’, which demonstrates how my Maybe
-based parser doesn’t strictly obey the Alternative
laws. Something to keep in mind if you plan to use it or something like it!
Thanks to Alan O’Donnell, Andrey Mokhov, Annie Cherkaev, Julia Evans, and Noah Luck Easterly for comments and feedback!