Non-diffusive atmospheric flow #10: significance of flow patterns

data-analysishaskell

The spherical PDF we constructed by kernel density estimation in the article before last appeared to have “bumps”, i.e. it’s not uniform in $\theta$ and $\phi$. We’d like to interpret these bumps as preferred regimes of atmospheric flow, but before we do that, we need to decide whether these bumps are significant. There is a huge amount of confusion that surrounds this idea of significance, mostly caused by blind use of “standard recipes” in common data analysis cases. Here, we have some data analysis that’s anything but standard, and that will rather paradoxically make it much easier to understand what we really mean by significance.


Non-diffusive atmospheric flow #9: speeding up KDE

data-analysishaskell

The Haskell kernel density estimation code in the last article does work, but it’s distressingly slow. Timing with the Unix time command (not all that accurate, but it gives a good idea of orders of magnitude) reveals that this program takes about 6.3 seconds to run. For a one-off, that’s not too bad, but in the next article, we’re going to want to run this type of KDE calculation thousands of times, in order to generate empirical distributions of null hypothesis PDF values for significance testing. So we need something faster.


Non-diffusive atmospheric flow #8: flow pattern distribution

data-analysishaskell

Up to this point, all the analysis that we’ve done has been what might be called “normal”, or “pedestrian” (or even “boring”). In climate data analysis, you almost always need to do some sort of spatial and temporal subsetting and you very often do some sort of anomaly processing. And everyone does PCA! So there’s not really been anything to get excited about yet.

Now that we have our PCA-transformed $Z_{500}$ anomalies though, we can start to do some more interesting things. In this article, we’re going to look at how we can use the new representation of atmospheric flow patterns offered by the PCA eigenpatterns to reduce the dimensionality of our data, making it much easier to handle. We’ll then look at our data in an interesting geometrical way that allows us to focus on the patterns of flow while ignoring the strengths of different flows, i.e. we’ll be treating strong and weak blocking events as being the same, and strong and weak “normal” flow patterns as being the same. This simplification of things will allow us to do some statistics with our data to get an idea of whether there are statistically significant (in a sense we’ll define) flow patterns visible in our data.


Non-diffusive atmospheric flow #7: PCA for spatio-temporal data

data-analysishaskell

Although the basics of the “project onto eigenvectors of the covariance matrix” prescription do hold just the same in the case of spatio-temporal data as in the simple two-dimensional example we looked at in the earlier article, there are a number of things we need to think about when we come to look at PCA for spatio-temporal data. Specifically, we need to think bout data organisation, the interpretation of the output of the PCA calculation, and the interpretation of PCA as a change of basis in a spatio-temporal setting. Let’s start by looking at data organisation.


Non-diffusive atmospheric flow #6: principal components analysis

data-analysishaskell

The pre-processing that we’ve done hasn’t really got us anywhere in terms of the main analysis we want to do–it’s just organised the data a little and removed the main source of variability (the seasonal cycle) that we’re not interested in. Although we’ve subsetted the original geopotential height data both spatially and temporally, there is still a lot of data: 66 years of 181-day winters, each day of which has $72 \times 15$ $Z_{500}$ values. This is a very common situation to find yourself in if you’re dealing with climate, meteorological, oceanographic or remote sensing data. One approach to this glut of data is something called dimensionality reduction, a term that refers to a range of techniques for extracting “interesting” or “important” patterns from data so that we can then talk about the data in terms of how strong these patterns are instead of what data values we have at each point in space and time.

I’ve put the words “interesting” and “important” in quotes here because what’s interesting or important is up to us to define, and determines the dimensionality reduction method we use. Here, we’re going to side-step the question of determining what’s interesting or important by using the de facto default dimensionality reduction method, principal components analysis (PCA). We’ll take a look in detail at what kind of “interesting” and “important” PCA give us a little later.


Non-diffusive atmospheric flow #5: pre-processing

data-analysishaskell

Note: there are a couple of earlier articles that I didn’t tag as “haskell” so they didn’t appear in Planet Haskell. They don’t contain any Haskell code, but they cover some background material that’s useful to know (#3 talks about reanalysis data and what $Z_{500}$ is, and #4 displays some of the characteristics of the data we’re going to be using). If you find terms here that are unfamiliar, they might be explained in one of these earlier articles.

The code for this post is available in a Gist.

Update: I missed a bit out of the pre-processing calculation here first time round. I’ve updated this post to reflect this now. Specifically, I forgot to do the running mean smoothing of the mean annual cycle in the anomaly calculation–doesn’t make much difference to the final results, but it’s worth doing just for the data manipulation practice...

Before we can get into the “main analysis”, we need to do some pre-processing of the $Z_{500}$ data. In particular, we are interested in large-scale spatial structures, so we want to subsample the data spatially. We are also going to look only at the Northern Hemisphere winter, so we need to extract temporal subsets for each winter season. (The reason for this is that winter is the season where we see the most interesting changes between persistent flow regimes. And we look at the Northern Hemisphere because it’s where more people live, so it’s more familiar to more people.) Finally, we want to look at variability about the seasonal cycle, so we are going to calculate “anomalies” around the seasonal cycle.

We’ll do the spatial and temporal subsetting as one pre-processing step and then do the anomaly calculation seperately, just for simplicity.


Non-diffusive atmospheric flow #4: exploring Z500

data-analysis

In the last article, I talked a little about geopotential height and the $Z_{500}$ data we’re going to use for this analysis. Earlier, I talked about how to read data from the NetCDF files that the NCEP reanalysis data comes in. Now we’re going to take a look at some of the features in the data set to get some idea of what we might see in our analysis. In order to do this, we’re going to have to produce some plots. As I’ve said before, I tend not to be very dogmatic about what software to use for plotting–for simple things (scatter plots, line plots, and so on) there are lots of tools that will do the job (including some Haskell tools, like the Chart library), but for more complex things, it tends to be much more efficient to use specialised tools. For example, for 3-D plotting, something like Paraview or Mayavi is a good choice. Here, we’re mostly going to be looking at geospatial data, i.e. maps, and for this there aren’t really any good Haskell tools. Instead, we’re going to use something called NCL (NCAR Command Language). This isn’t by any stretch of the imagination a pretty language from a computer science point of view, but it has a lot of specialised features for plotting climate and meteorological data and is pretty perfect for the needs of this task (the sea level pressure and $Z_{500}$ plots in the last post were made using NCL). I’m not going to talk about the NCL scripts used to produce the plots here, but I might write about NCL a bit more later since it’s a very good tool for this sort of thing.


Non-diffusive atmospheric flow #3: reanalysis data and Z500

data-analysis

In this article, we’re going to look at some of the details of the data that we’re going to be using in our study of non-diffusive flow in the atmosphere. This is still all background material, so there’s no Haskell code here!


Non-diffusive atmospheric flow #2: outline & plan

data-analysis

As I said in the last article, the next bit of this data analysis series is going to attempt to use Haskell to reproduce the analysis in the paper: D. T. Crommelin (2004). Observed nondiffusive dynamics in large-scale atmospheric flow. J. Atmos. Sci. 61(19), 2384–2396. Before we can do this, we need to cover some background, which I’m going to do in this and the next couple of articles. There won’t be any Haskell code in any of these three articles, so I’m not tagging them as “Haskell” so that they don’t end up on Planet Haskell, annoying category theorists who have no interest in atmospheric dynamics. I’ll refer to these background articles from the later “codey” articles as needed.


Haskell data analysis: Reading NetCDF files

haskelldata-analysis

I never really intended the FFT stuff to go on for as long as it did, since that sort of thing wasn’t really what I was planning as the focus for this Data Analysis in Haskell series. The FFT was intended primarily as a “warm-up” exercise. After fourteen blog articles and about 10,000 words, everyone ought to be sufficiently warmed up now...

Instead of trying to lay out any kind of fundamental principles for data analysis before we get going, I’m just going to dive into a real example. I’ll talk about generalities as we go along when we have some context in which to place them.

All of the analysis described in this next series of articles closely follows that in the paper: D. T. Crommelin (2004). Observed nondiffusive dynamics in large-scale atmospheric flow. J. Atmos. Sci. 61(19), 2384–2396. We’re going to replicate most of the data analysis and visualisation from this paper, maybe adding a few interesting extras towards the end.

It’s going to take a couple of articles to lay out some of the background to this problem, but I want to start here with something very practical and not specific to this particular problem. We’re going to look at how to gain access to meteorological and climate data stored in the NetCDF file format from Haskell. This will be useful not only for the low-frequency atmospheric variability problem we’re going to look at, but for other things in the future too.


Link Round-up

Here’s a mixed bag of interesting links, some sciencey, some mathsy, some miscellany:

  1. Network Rail Virtual Archives: OK, this might not, at first sight, sound like something interesting, but it really is. This site has original Victorian-era engineering drawings for a whole range of British railway infrastructure. Bridges, viaducts, stations, tunnels. All rendered in lovely 19th Century penmanship. The Forth Bridge is particularly nice.

  2. open.NASA: A couple of years ago, NASA started a project to open-source code and data from their Earth observing and planetary missions. Open.NASA is gateway to these resources. I’ve not had a chance to look at it in huge detail yet, but there is a lot of stuff there. The list of projects on the code.NASA part looks particularly entertaining.

  3. Game of Primes: Giganotosaurus is a science fiction site that publishes one (longish) short story each month. They’re often very good, and this one was particularly striking–it’s quite beautifully done, full of mystery, and feels like it could be a part of something much larger and deeper.

  4. Surprising connections in mathematics: This one is a bit more technical, from the Math Overflow Q&A website. A lot of the connections people mention are very technical, but some are more accessible, for instance the link between algebra and geometry developed by Descartes and others in the 17th Century. This is something we learn about in school, and something that we don’t think about too much because it seems “obvious”. Only obvious in retrospect, of course, since it took hundreds of years for the connection to be discovered!

  5. De Bruijn grids and tilings: Another technical one, but very interesting. Aperiodic tilings of the plane, like Penrose tilings, are slightly mysterious. This article gives a really clear description of one systematic method for generating such tilings. It’s a very odd and intriguing little bit of mathematics.

  6. Atul Gawande on end-of-life care: Atul Gawande is one of my favourite writers on medical and ethical issues. This article is quite long, but well worth a read.


Command and Control

book-reviews

by Eric Schlosser

My reading list recently has been chock-full of light-hearted and mood-lifting material: some Irvine Welsh novels (always guaranteed to shed a gentle light on all that’s best about the human condition), a long book about clinical depression, M. R. Carey’s interesting sort-of-zombie apocalypse/extreme mycology novel, The Girl With All The Gifts, de Becker’s The Gift Of Fear, a book all about fear and violence, and Piper Kerman’s prison memoir, Orange Is The New Black (which did spoil the mood a little having a few sparks of hope in among the gloom).

Among all this bleakness and blackness, Command and Control somehow manages to stand out as a particularly grim monument to human folly and our collective crimes against all sense and reason. It’s a book about nuclear weapons, so it never really had much chance of being too jolly, but even so, Schlosser’s decision to focus in parallel on US nuclear doctrine and nuclear weapons safety makes for some horrifying reading. It’s something of a mystery how we made it through the Cold War without either a “hot” war or at least some sort of unintended detonation of a nuclear weapon.


Many Books & Their Reviews #2

book-reviews

Second round of “many books”...


Many Books & Their Reviews #1

book-reviews

I’ve been doing quite a bit of reading lately, so I have 28 novels to review! All but one are from series of novels, so that’s not quite as daunting as it sounds. Still, I’ll split this into two posts to make it manageable.


Getting From Here To There

day-jobgis

In particular, getting from where you are now to where you want to be, in terms of your career.

As a result of an email I sent to the Haskell-Cafe mailing list a couple of weeks ago looking for someone to take over a contract I had been working on, someone contacted me asking for career advice. Clearly not someone who knew me at all, otherwise they would have known what a crazy idea that was. Anyway, this person was asking about one of the fundamental problems when you’re starting out in more or less any profession: how do you acquire the experience you need to apply for jobs that say “experience required”, which is more or less all of them?

They asked: “What is the path to getting involved in this stuff? How do I bridge the gap from just playing around with these technologies to having real world experience? It seems that most opportunities are for people with experience.” And this is exactly right. Particularly for contracting, no-one wants to hire someone they think will have to learn on the job. You need to know what you’re doing, which means getting experience somehow. And it would of course be nice to be able to eat and have a life while getting that experience.

I wrote an epic email in reply, and was told that it would have worked better as a blog post (or perhaps a short novel). So here I am, turning it into a blog post!