Haskell Comic Scraper: Part 2
Last time, we looked at the basics for a webcomic scraper. In this post, we’ll look at extending our code a little to incorporate some date handling and some simple XML processing.
The second version of the code is here.
Date based output directories
The first thing we’ll do is set things up so that the webcomic images we download are stored in a directory whose name is taken from today’s date. Date and time handling functions are in the Data.Time
module, so we import that:
import Data.Time
Following our “does it always return the same value” heuristic, it seems pretty clear that a function to get today’s date has to live in the IO
monad, since it wouldn’t be very useful for such a function to return the same value every time it’s called. The Date.Time
module has a number of functions for dealing with dates and times and timezones, mostly modelled on the Unix C library functions for doing the same jobs. We can get a representation of the current date (the Day
type is just a wrapper for an integer giving the modified Julian day) in the local timezone using the following code:
getToday :: IO Day getToday = do tz <- getCurrentTimeZone tm <- getCurrentTime return (localDay $ utcToLocalTime tz tm)
Here, both getting the current time zone and getting the current time require IO actions (the current time zone can change from time to time if the machine where our code is running is moved, for instance), and obviously the current time changes with time... All of the functions getCurrentTimeZone
, getCurrentTime
, localDay
and utcToLocalTime
are defined in the Data.Time
module, whose documentation you can read here.
Once we have a date, we’d like to turn it into a good name to use for a pathname. This a simple manipulation of the result of getToday
, but because getToday
is an IO action, our function to make the pathname returns an IO action too:
makeDateName :: String -> IO String makeDateName base = do date <- getToday return (base ++ "/" ++ showGregorian date)
Here, the showGregorian
function is a handy utility from Data.Time
that formats a date in ISO YYYY-MM-DD format, and we pass in a “base directory” where all the webcomic download directories should go.
We can now modify the processAll
function from last time. We’ll need to pass in a base directory name (parameter bd
in the function below), plus a list of comics to get. We can determine the path to save the downloads to using makeDateName
, and we then use the createDirectoryIfMissing
function from System.Directory
to make the relevant directory (the first True
argument acts the same as the -p
flag to mkdir
, making parent directories as required), then change working directory to the new directory, then download the files:
processAll :: String -> [Comic] -> IO () processAll bd cs = do dateDirectory <- makeDateName bd createDirectoryIfMissing True dateDirectory setCurrentDirectory dateDirectory mapM_ writeImageToFile cs
As before, a lot of this looks just like what we would write in a traditional imperative language. The monadic structure of the code that the Haskell compiler generates from a do
statement deals with all the details of routing the results of one step of the computation to the next in a way that preserves referential transparency.
An XML configuration file
Until now, we’ve been specifying the list of comics to read directly in our code. This isn’t very convenient, so it would be good to read the comic details from a configuration file. We’ll use a simple XML file that we’ll call comics.xml
: an example is here.
Haskell has a number of very sophisticated libraries for dealing with XML documents, but sometimes these are slight overkill. If all you want to do is pull some information out of an XML file without too much fuss, then the TagSoup package is what you want. This has a nice simple interface for reading XML (or HTML) that may or may not be well-formed, from which you can extract the data items you need. We import the package as:
import Text.HTML.TagSoup
and we use it as shown in the getConfig
function, which reads the contents of a configuration file and parses it using the TagSoup
parseTags
function. The result of this is a list of HTML/XML tags with attached attributes that you can process using Haskell’s usual list processing functions. Here, we first use filter
to pick out the entry in the tag list having an open tag of baseDirectory
, from which we extract the name attribute to use as our base directory. If that works, we pick out all the entries with an open tag of comic
and process them with the makeComic
function, which simply pulls the relevant items out of attributes in the comic tags.
getConfig :: FilePath -> IO (String, [Comic]) getConfig cfg_path = do fc <- readFile cfg_path let tags = parseTags fc let bdtags = filter (isTagOpenName "baseDirectory") tags case bdtags of [] -> return (error "baseDirectory field missing from comics.xml") otherwise -> do let bd = fromAttrib "name" $ head bdtags let cs = map makeComic $ filter (isTagOpenName "comic") tags return (bd, cs) where makeComic t = if n == "" || u == "" || r == "" || p == "" then error "attribute missing in comic tag" else Comic n u r p where n = fromAttrib "name" t u = fromAttrib "url" t r = fromAttrib "regex" t p = fromAttrib "prefix" t
While it’s not up to more complex XML processing tasks that require walking and transforming the tree of entries in an XML file, for this type of application, TagSoup
is just about perfect. It’s lightweight, easy to use, and interfaces with standard ways of working with lists in Haskell in a seamless way.
To make use of this, we just modify our main program to read the configuration information, which we pass right along to processAll
.
main :: IO () main = do (bd, cs) <- getConfig "comics.xml" processAll bd cs
Simple!