This is going to be the oldest of old hat for the cool Haskell kids who invent existential higher-kinded polymorphic whatsits before breakfast, but it amused me, and it’s the first time I’ve used some of these more interesting language extensions for something “real”.
(There’s no code in this post, just some examples to explain what we’re going to do next.)
Suppose we define the state of the system whose evolution we want to study by a probability vector – at any moment in time, we have a probability distribution over a finite partition of the state space of the system (so that if we partition the state space into components, then ). Evolution of the system as a Markov chain is then defined by the evolution rule
where is a Markov matrix. This approach to modelling the evolution of probability densities has the benefit both of being simple to understand and to implement (in terms of estimating the matrix from data) and, as we’ll see, of allowing us to distinguish between random “diffusive” evolution and conservative “non-diffusive” dynamics.
We’ll see how this works by examining a very simple example.
The analysis of preferred flow regimes in the previous article is all very well, and in its way quite illuminating, but it was an entirely static analysis – we didn’t make any use of the fact that the original data we used was a time series, so we couldn’t gain any information about transitions between different states of atmospheric flow. We’ll attempt to remedy that situation now.
What sort of approach can we use to look at the dynamics of changes in patterns of ? Our parameterisation of flow patterns seems like a good start, but we need some way to model transitions between different flow states, i.e. between different points on the sphere. Each of our original maps corresponds to a point on this sphere, so we might hope that we can some up with a way of looking at trajectories of points in space that will give us some insight into the dynamics of atmospheric flow.
A quick post today to round off the “static” part of our atmospheric flow analysis.
Now that we’ve satisfied ourselves that the bumps in the spherical PDF in article 8 of this series are significant (in the narrowly defined sense of the word “significant” that we’ve discussed), we might ask what to sort of atmospheric flow regimes these bumps correspond. Since each point on our unit sphere is really a point in the three-dimensional space spanned by the first three PCA eigenpatterns that we calculated earlier, we can construct composite maps to look at the spatial patterns of flow for each bump just by combining the first three PCA eigenpatterns in proportions given by the “” coordinates of points on the unit sphere.
The spherical PDF we constructed by kernel density estimation in the article before last appeared to have “bumps”, i.e. it’s not uniform in and . We’d like to interpret these bumps as preferred regimes of atmospheric flow, but before we do that, we need to decide whether these bumps are significant. There is a huge amount of confusion that surrounds this idea of significance, mostly caused by blind use of “standard recipes” in common data analysis cases. Here, we have some data analysis that’s anything but standard, and that will rather paradoxically make it much easier to understand what we really mean by significance.
The Haskell kernel density estimation code in the last article does work, but it’s distressingly slow. Timing with the Unix
time command (not all that accurate, but it gives a good idea of orders of magnitude) reveals that this program takes about 6.3 seconds to run. For a one-off, that’s not too bad, but in the next article, we’re going to want to run this type of KDE calculation thousands of times, in order to generate empirical distributions of null hypothesis PDF values for significance testing. So we need something faster.
Up to this point, all the analysis that we’ve done has been what might be called “normal”, or “pedestrian” (or even “boring”). In climate data analysis, you almost always need to do some sort of spatial and temporal subsetting and you very often do some sort of anomaly processing. And everyone does PCA! So there’s not really been anything to get excited about yet.
Now that we have our PCA-transformed anomalies though, we can start to do some more interesting things. In this article, we’re going to look at how we can use the new representation of atmospheric flow patterns offered by the PCA eigenpatterns to reduce the dimensionality of our data, making it much easier to handle. We’ll then look at our data in an interesting geometrical way that allows us to focus on the patterns of flow while ignoring the strengths of different flows, i.e. we’ll be treating strong and weak blocking events as being the same, and strong and weak “normal” flow patterns as being the same. This simplification of things will allow us to do some statistics with our data to get an idea of whether there are statistically significant (in a sense we’ll define) flow patterns visible in our data.
Although the basics of the “project onto eigenvectors of the covariance matrix” prescription do hold just the same in the case of spatio-temporal data as in the simple two-dimensional example we looked at in the earlier article, there are a number of things we need to think about when we come to look at PCA for spatio-temporal data. Specifically, we need to think bout data organisation, the interpretation of the output of the PCA calculation, and the interpretation of PCA as a change of basis in a spatio-temporal setting. Let’s start by looking at data organisation.
The pre-processing that we’ve done hasn’t really got us anywhere in terms of the main analysis we want to do – it’s just organised the data a little and removed the main source of variability (the seasonal cycle) that we’re not interested in. Although we’ve subsetted the original geopotential height data both spatially and temporally, there is still a lot of data: 66 years of 181-day winters, each day of which has values. This is a very common situation to find yourself in if you’re dealing with climate, meteorological, oceanographic or remote sensing data. One approach to this glut of data is something called dimensionality reduction, a term that refers to a range of techniques for extracting “interesting” or “important” patterns from data so that we can then talk about the data in terms of how strong these patterns are instead of what data values we have at each point in space and time.
I’ve put the words “interesting” and “important” in quotes here because what’s interesting or important is up to us to define, and determines the dimensionality reduction method we use. Here, we’re going to side-step the question of determining what’s interesting or important by using the de facto default dimensionality reduction method, principal components analysis (PCA). We’ll take a look in detail at what kind of “interesting” and “important” PCA give us a little later.
Note: there are a couple of earlier articles that I didn’t tag as “haskell” so they didn’t appear in Planet Haskell. They don’t contain any Haskell code, but they cover some background material that’s useful to know (#3 talks about reanalysis data and what is, and #4 displays some of the characteristics of the data we’re going to be using). If you find terms here that are unfamiliar, they might be explained in one of these earlier articles.
The code for this post is available in a Gist.
Update: I missed a bit out of the pre-processing calculation here first time round. I’ve updated this post to reflect this now. Specifically, I forgot to do the running mean smoothing of the mean annual cycle in the anomaly calculation – doesn’t make much difference to the final results, but it’s worth doing just for the data manipulation practice…
Before we can get into the “main analysis”, we need to do some pre-processing of the data. In particular, we are interested in large-scale spatial structures, so we want to subsample the data spatially. We are also going to look only at the Northern Hemisphere winter, so we need to extract temporal subsets for each winter season. (The reason for this is that winter is the season where we see the most interesting changes between persistent flow regimes. And we look at the Northern Hemisphere because it’s where more people live, so it’s more familiar to more people.) Finally, we want to look at variability about the seasonal cycle, so we are going to calculate “anomalies” around the seasonal cycle.
We’ll do the spatial and temporal subsetting as one pre-processing step and then do the anomaly calculation seperately, just for simplicity.