Server-side Web Analytics with GoAccess
Let’s start with a digression.
It will come as no surprise to anyone who knows me even a little bit that I’m just a tad disappointed with the modern internet. You look back at the visions of the early workers in hypertext and networking, and their open-hearted dreams of a world of free information bring a little tear to the eye. Fast forward to 2021, and the web is 90% spam, social media giants and ad-tech. It’s not the shiny future we were promised! It’s a swamp.
I think that’s totally duff. There should be a clear split between information and presentation, and the user should be in control of presentation. Imagine for a moment, if you will, a kinder gentler internet from an alternate stream of history, a stream where we are allowed to have nice things. In this world, NiceNet serves up data in agreed standardised formatsYou could think of them as being like standarised XML schemas, but we’re trying to imagine a better world, so don’t think of XML., and your info-portal ingests that data and renders it for you, according to rules you decide on. Every site that provides, for example, event data (live music, cinema, art exhibitions) offers up its data in EventML, a representation your portal understands. News sites serve up NewsML, and so on.
Why is this better? Because you control the presentation, you can make it uniform across different providers of informationAnd you can aggregate and mash information up. And other people can do the same and sell it as a service, maybe.. Instead of hunting around for links on a custom-designed website, your portal displays information organised in ways that you’re completely familiar with. This is important not just to reduce everyday frustration, but it makes a huge difference to accessibility. Web accessibility now too often means reluctantly scattering some ARIA tags around, and maybe doing a bit of testing with a screenreader. It doesn’t address deeper isssues of accessibility, like the issues that people with cognitive impairments haveWho has an elderly relative who’s said “My internets are broken!” when links move around on the websites they use every day? And in any case, we all have days when we’re cognitively impaired to one extent or another. We all need all the help we can get.. Maintaining consistent and familiar display of information is essentially impossible with the web that we have.
And that makes me grumpy.
OK, let’s de-digress now, and get to the point. I got a new laptop recently, and using the web browser on it (empty cache, no cookies yet) really brought home just how much tracking there is on the web, even just the tracking that you can see. Every single damn site wants to give you a cookie “to improve our service to you”You believe that? No, me neither. or get you agree to some blanket abrogation of your GDPR rights so that they can increase their advertising click-through by a fraction of a percentage point. It’s crapYes, I know things have to be paid for somehow somewhere. Call me a Luddite, but real-time targeted advertising markets don’t seem like the best way to do that..
I’d been thinking about things I’d like to do with my website, as part of sprucing it up to use Pollen. And I realised that for years, I’d been using Google Analytics, which, while not among of the worst of the “track your eyeballs, sell your attention” variety of software, still collects information about people who access your website and feed it into the insatiable maw of the Google data consumption engine.
That seemed a little hypocritical. I didn’t have any other tracker things on my website, but I squinted at that Google Analytics cookie and it made me feel bad.
Did I even need analytics at all? It’s a useful way to see which pages or articles people are actually reading, obviously. I do (unlikely as it may seem) use this site for some professional activities, and it can be useful to get some idea of whether something is getting seen. So it’s not all about ego stroking. But how to do it without those nasty cookies, and sharing user information with Google?
My first thought was Piwik, which is not called Matomo, an open source project that sells itself aside a
Google Analytics alternative that protects your data and your customers’ privacy.
The problem with Matomo is that it really does do all that Google Analytics does, and possibly more. There’s lots of stuff for click tracking, invisible image tags, and so on. Nothing I wanted. It’s all very well talking about protecting your data and your users’ privacy from Google, but you are still tracking more about your users than you need to.
In hindsight, the solution is kind of obvious. Your web server has to collect a minial set of information to be able to service requests for the pages on your site (requesting IP address, request timestamp, requesting browser, requested URL) and this is saved to the server log. You can analyse those logs to answer all sorts of questions about who is reading what on your site, without breaking user privacy and without cookiesThe question of whether the requesting IP address is Personally Identifying Information seems complicated, but for most users, their ISP will be using NAT, so knowing their IP address would probably only you to identify the ISP, but not the individual originating the request..
A convenient solution for doing this is GoAccess, which as well as having a delightfully retro terminal interface also produces nice live updating HTML reports that you can host on the same server that serves your site:
GoAccess is pleasantly simple to set up. You need to set up the single GoAccess binary to run as a Linux system service, you need to make sure your webs erver log files are organised right, and you need to set up your web server. If you’re using nginx, you need to set up a virtual host as a reverse proxy to the GoAccess service’s port. You’ll need to do the usual HTTPS termination for that, and if you want live updating, you need to set up secure WebSocket termination and proxying as well. I followed the instructions here and it worked more or less first timeIsn’t it nice when that happens?.
I set up nginx’s log rotation to keep 30 days of logs, so I get history of site accesses over that time frame.
You almost certainly don’t need the kind of intrusive analytics on your website that Google Analytics provides. And if you think you do, you might want to take a good hard look at your business model, because you might be one of the baddies.