Haskell Comic Scraper: Part 2
Last time, we looked at the basics for a webcomic scraper. In this post, we’ll look at extending our code a little to incorporate some date handling and some simple XML processing.
The second version of the code is here.
Date based output directories
The first thing we’ll do is set things up so that the webcomic images we download are stored in a directory whose name is taken from today’s date. Date and time handling functions are in the
Data.Time module, so we import that:
Following our “does it always return the same value” heuristic, it seems pretty clear that a function to get today’s date has to live in the
IO monad, since it wouldn’t be very useful for such a function to return the same value every time it’s called. The
Date.Time module has a number of functions for dealing with dates and times and timezones, mostly modelled on the Unix C library functions for doing the same jobs. We can get a representation of the current date (the
Day type is just a wrapper for an integer giving the modified Julian day) in the local timezone using the following code:
getToday :: IO Day getToday = do tz <- getCurrentTimeZone tm <- getCurrentTime return (localDay $ utcToLocalTime tz tm)
Here, both getting the current time zone and getting the current time require IO actions (the current time zone can change from time to time if the machine where our code is running is moved, for instance), and obviously the current time changes with time... All of the functions
utcToLocalTime are defined in the
Data.Time module, whose documentation you can read here.
Once we have a date, we’d like to turn it into a good name to use for a pathname. This a simple manipulation of the result of
getToday, but because
getToday is an IO action, our function to make the pathname returns an IO action too:
makeDateName :: String -> IO String makeDateName base = do date <- getToday return (base ++ "/" ++ showGregorian date)
showGregorian function is a handy utility from
Data.Time that formats a date in ISO YYYY-MM-DD format, and we pass in a “base directory” where all the webcomic download directories should go.
We can now modify the
processAll function from last time. We’ll need to pass in a base directory name (parameter
bd in the function below), plus a list of comics to get. We can determine the path to save the downloads to using
makeDateName, and we then use the
createDirectoryIfMissing function from
System.Directory to make the relevant directory (the first
True argument acts the same as the
-p flag to
mkdir, making parent directories as required), then change working directory to the new directory, then download the files:
processAll :: String -> [Comic] -> IO () processAll bd cs = do dateDirectory <- makeDateName bd createDirectoryIfMissing True dateDirectory setCurrentDirectory dateDirectory mapM_ writeImageToFile cs
As before, a lot of this looks just like what we would write in a traditional imperative language. The monadic structure of the code that the Haskell compiler generates from a
do statement deals with all the details of routing the results of one step of the computation to the next in a way that preserves referential transparency.
An XML configuration file
Until now, we’ve been specifying the list of comics to read directly in our code. This isn’t very convenient, so it would be good to read the comic details from a configuration file. We’ll use a simple XML file that we’ll call
comics.xml: an example is here.
Haskell has a number of very sophisticated libraries for dealing with XML documents, but sometimes these are slight overkill. If all you want to do is pull some information out of an XML file without too much fuss, then the TagSoup package is what you want. This has a nice simple interface for reading XML (or HTML) that may or may not be well-formed, from which you can extract the data items you need. We import the package as:
and we use it as shown in the
getConfig function, which reads the contents of a configuration file and parses it using the
parseTags function. The result of this is a list of HTML/XML tags with attached attributes that you can process using Haskell’s usual list processing functions. Here, we first use
filter to pick out the entry in the tag list having an open tag of
baseDirectory, from which we extract the name attribute to use as our base directory. If that works, we pick out all the entries with an open tag of
comic and process them with the
makeComic function, which simply pulls the relevant items out of attributes in the comic tags.
getConfig :: FilePath -> IO (String, [Comic]) getConfig cfg_path = do fc <- readFile cfg_path let tags = parseTags fc let bdtags = filter (isTagOpenName "baseDirectory") tags case bdtags of  -> return (error "baseDirectory field missing from comics.xml") otherwise -> do let bd = fromAttrib "name" $ head bdtags let cs = map makeComic $ filter (isTagOpenName "comic") tags return (bd, cs) where makeComic t = if n == "" || u == "" || r == "" || p == "" then error "attribute missing in comic tag" else Comic n u r p where n = fromAttrib "name" t u = fromAttrib "url" t r = fromAttrib "regex" t p = fromAttrib "prefix" t
While it’s not up to more complex XML processing tasks that require walking and transforming the tree of entries in an XML file, for this type of application,
TagSoup is just about perfect. It’s lightweight, easy to use, and interfaces with standard ways of working with lists in Haskell in a seamless way.
To make use of this, we just modify our main program to read the configuration information, which we pass right along to
main :: IO () main = do (bd, cs) <- getConfig "comics.xml" processAll bd cs