Using `ghc-syntax-highlighter` with Hakyll
In 2018, Mark Karpov announced
ghc-syntax-highlighter
,
a project which uses GHC’s own lexer to tokenise Haskell source code for the
best possible syntax highlighting. I thought this was extremely cool, and
really wanted to use it for this blog. Unfortunately, this is what the post had
to say about pandoc
, which Hakyll uses to process Markdown:
skylighting
is what Pandoc uses btw. And from what I can tell it’s hardcoded to use only that library for highlighting, so some creativity may be necessary to get it work.
I briefly looked into this and reached the same conclusion (and as of this writing it is still the case) so, as a deeply uncreative individual, I sighed deeply and resigned myself to never knowing this particular joy.
Until, just a few days ago, I read this lovely blog post by Tony Zorman about customising Hakyll’s syntax highlighting which included this gem of a sentence in the very first paragraph:
Using
pygmentize
as an example, I will show you how you can swap out pandoc’s native syntax highlighting with pretty much any third party tool that can output HTML.
And in fact this is an accurate description of what follows. This sounds like
exactly what I want to do, and between this and Mark’s
mmark-ext
(which implements
ghc-syntax-highlighter
support as an extension for
mmark
)
I was able to successfully follow the instructions to get
ghc-syntax-highlighter
working with my blog. Let me walk you through what
I did.
Here are the language extensions I will be using:
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ViewPatterns #-}
and these additional imports:
import GHC.SyntaxHighlighter (Token(..), tokenizeHaskell)
import Text.Blaze.Html.Renderer.Text (renderHtml)
import Text.Pandoc.Definition (Block (CodeBlock, RawBlock), Pandoc)
import Text.Pandoc.Walk (walk)
import qualified Text.Blaze.Html5 as H
import qualified Text.Blaze.Html5.Attributes as A
I chose to use blaze-html
since it is already a transitive dependency of pandoc
and using it has no
impact on our dependency tree.
Tony uses walkM
since an external program (pygmentize
) is involved, but
since we are working with pure Haskell code we can get away with just walk
:
ghcSyntaxHighlight :: Pandoc -> Pandoc
ghcSyntaxHighlight = walk $ \case
CodeBlock (_, (isHaskell -> True):_, _) (tokenizeHaskell -> Just tokens) ->
RawBlock "html" . L.toStrict . renderHtml $ formatHaskellTokens tokens
block -> block
where isHaskell = (== "haskell")
This only matches Haskell code blocks which tokenizeHaskell
is able to
successfully tokenise and otherwise falls back on existing pandoc
behaviour.
formatHaskellTokens
generates markup very similarly to what pandoc
already
does:
formatHaskellTokens :: [(Token, T.Text)] -> H.Html
formatHaskellTokens tokens =
H.div H.! A.class_ "sourceCode" $
H.pre H.! A.class_ "sourceCode haskell" $
H.code H.! A.class_ "sourceCode haskell" $
mapM_ tokenToHtml tokens
tokenizeHaskell
produces a list of pairs of the token type (KeywordToken
,
VariableToken
, etc.) and the matched text, and the tokenToHtml
(adapted
from
mmark-ext
)
function creates a span
element with the appropriate class name for our CSS
to style:
tokenToHtml :: (Token, T.Text) -> H.Html
tokenToHtml (tokenClass -> className, text) =
H.span H.!? (not $ T.null className, A.class_ (H.toValue className)) $
H.toHtml text
tokenClass
(also adapted from
mmark-ext
)
outputs the appropriate class name for each token, and I made only minor
changes for styling purposes:
tokenClass :: Token -> T.Text
tokenClass = \case
KeywordTok -> "kw"
PragmaTok -> "pp" -- Preprocessor
SymbolTok -> "ot" -- Other
VariableTok -> "va"
ConstructorTok -> "dt" -- DataType
OperatorTok -> "op"
CharTok -> "ch"
StringTok -> "st"
IntegerTok -> "dv" -- DecVal
RationalTok -> "dv" -- DecVal
CommentTok -> "co"
SpaceTok -> ""
OtherTok -> "ot"
Finally we have to actually use ghcSyntaxHighlight
, for which we define
a replacement for pandocCompiler
called (imaginatively)
customPandocCompiler
and use it everywhere:
customPandocCompiler :: Compiler (Item String)
customPandocCompiler =
pandocCompilerWithTransform
defaultHakyllReaderOptions
defaultHakyllWriterOptions
ghcSyntaxHighlight
Again, since we are using pure functions, we can get away with
pandocCompilerWithTransform
instead of pandocCompilerWithTransformM
.
And we’re done! I also had to tweak my CSS slightly since pandoc
was
generating a span
for each line of source code instead of each token like
ghc-syntax-highlighter
does. For the complete listing, see
here.