Non-diffusive atmospheric flow #10: significance of flow patterns
The spherical PDF we constructed by kernel density estimation in the article before last appeared to have “bumps”, i.e. it’s not uniform in
So what do we mean by “significance”? A phenomena is significant if it is unlikely to have occurred by chance. Right away, this definition raises two questions. First, chance implies some sort of probability, a continuous quantity, which leads to the idea of different levels of significance. Second, if we are going to be thinking about probabilities, we are going to need to talk about a distribution for those probabilities. The basic idea is thus to compare our data to a distribution that we explicitly decide based on a null hypothesis. A bump in our PDF will be called significant if it is unlikely to have occurred in data generated under the assumptions in our null hypothesis.
So, what’s a good null hypothesis in this case? We’re trying to determine whether these bumps are a significant deviation from “nothing happening”. In this case, “nothing happening” would mean that the data points we use to generate the PDF are distributed uniformly over the unit sphere parametrised by
We thus want some way of comparing the PDF generated by doing KDE on our data points to PDFs generated by doing KDE on “fake” data points sampled from our null hypothesis distribution. We’re going to follow a sampling-based approach: we will generate “fake” data sets, do exactly the same analysis we did on our real data points to produce “fake” PDFs, then compare these “fake” PDFs to our real one (in a way that will be demonstrated below).
There are a couple of important things to note here, which I usually think of under the heading of “minimisation of cleverness”. First, we do exactly the same analysis on our “fake” data as we do on our “real” data. That means that we can treat arbitrarily complex chains of data analysis without drama: here, we’re doing KDE, which is quite complicated from a statistical perspective, but we don’t really need to think about that, because the fact that we treat the fake and real data sets identically means that we’re comparing like with like and the complexity just “divides out”. Second, because we’re simulating, in the sense that we generate fake data based on a null hypothesis on the data and run it through whatever data analysis steps we’re doing, we make any assumptions that go into our null hypothesis perfectly explicit. If we assume Gaussian data, we can see that, because we’ll be sampling from a Gaussian to generate our fake data. If, as in this case, our null hypothesis distribution is something else, we’ll see that perfectly clearly because we sample directly from that distribution to generate our fake data.
This is a huge contrast to the usual case for “normal” statistics, where one chooses some standard test (
So, we’re going to do sampling-based significance testing here. It is shockingly easy to do and, if you’ve been exposed to the confusion of “traditional” significance testing, shockingly easy to understand.
Simulating from the null hypothesis
In order to test the significance of the bumps we see in our spherical PDF, we need some way of sampling points from our null hypothesis distribution, i.e. from a probability distribution that is uniform on the unit sphere. The simplest way to do this is to sample points from a spherically symmetric three-dimensional probability distribution then project those points onto the unit sphere. The most suitable three-dimensional distribution, at least from the point of view of convenience, is a three dimensional Gaussian distribution with zero mean and unit covariance matrix. This is particularly convenient because if we sample points mwc-random
package is shown below–here, nData
is the number of points we want to sample, and the randPt
function generates a single normalised
-- Random data point generation. gen <- create let randPt gen = do unnorm <- SV.replicateM 3 (standard gen) let mag = sqrt $ unnorm `dot` unnorm norm = scale (1.0 / mag) unnorm return (norm ! 0, norm ! 1, norm ! 2) -- Create random data points, flatten to vector and allocate on -- device. dataPts <- replicateM nData (randPt gen)
If we sample the same number of points from this distribution that we have in our real data and then use the same KDE approach the we used for the real data to generate an empirical PDF on the unit sphere, what do we get? Here’s what one distribution generated by this procedure looks like, using the same colour scale as the “real data” distribution in the earlier article to aid in comparison (darker colours show regions of greater probability density):
We can see that our sample from the null hypothesis distribution also has “bumps”, although they seem to be less prominent than the bumps in PDF for our real data. Why do we see bumps here? Our null hypothesis distribution is uniform, so why is the simulated empirical PDF bumpy? The answer, of course, is sampling variation. If we sample 9966 points on the unit sphere, we are going to get some clustering of points (leading to bumps in the KDE-derived distribution) just by chance. Those chance concentrations of points are what lead to the bumps in the plot above.
What we ultimately want to do then is to answer the question: how likely is it that the bumps in the distribution of our real data could have arisen by chance, assuming that our real data arose from a process matching our null hypothesis?
Testing for significance
The way we’re going to answer the question posed in the last section is purely empirically. We’re going to generate empirical distributions (histograms) of the possible values of the null hypothesis distribution to get a picture of the sampling variability that is possible, then we’re going to look at the values of our “real data” distribution and calculate the proportion of the null hypothesis distribution values less than the real data distribution values. This will give us the probability that our real data distribution could have arisen by chance if the data really came from the null hypothesis distribution.
In words, it sounds complicated. In reality and in code, it’s not. First, we generate a large number of realisations of the null hypothesis distribution, by sampling points on the unit sphere and using KDE to produce PDFs from those point distributions in exactly the same way that we did for our real data, as shown here (code from the make-hist.hs program):
-- Generate PDF realisations. pdfs <- forM [1..nrealisations] $ \r -> do putStrLn $ "REALISATION: " ++ show r -- Create random data points. dataPts <- SV.concat <$> replicateM nData (randPt gen) SV.unsafeWith dataPts $ \p -> CUDA.pokeArray (3 * nData) p dDataPts -- Calculate kernel values for each grid point/data point -- combination and accumulate into grid. CUDA.launchKernel fun gridSize blockSize 0 Nothing [CUDA.IArg (fromIntegral nData), CUDA.VArg dDataPts, CUDA.VArg dPdf] CUDA.sync res <- SVM.new (ntheta * nphi) SVM.unsafeWith res $ \p -> CUDA.peekArray (ntheta * nphi) dPdf p unnormv <- SV.unsafeFreeze res let unnorm = reshape nphi unnormv -- Normalise. let int = dtheta * dphi * sum (zipWith doone sinths (toRows unnorm)) return $ cmap (realToFrac . (/ int)) unnorm :: IO (Matrix Double)
(This is really the key aspect of this sampling-based approach: we perform exactly the same data analysis on the test data sampled from the null hypothesis distribution that we perform on our real data.) We generate 10,000 realisations of the null hypothesis distribution (stored in the pdfs
value), in this case using CUDA to do the actual KDE calculation, so that it doesn’t take too long.
Then, for each spatial point on our unit sphere, i.e. each point in the samples
value in this code:
-- Convert to per-point samples and generate per-point histograms. let samples = [SV.generate nrealisations (\s -> (pdfs !! s) ! i ! j) :: V | i <- [0..ntheta-1], j <- [0..nphi-1]] (rngmin, rngmax) = range nbins $ SV.concat samples hists = map (histogram_ nbins rngmin rngmax) samples :: [V] step i = rngmin + d * fromIntegral i d = (rngmax - rngmin) / fromIntegral nbins bins = SV.generate nbins step
For each point on our grid on the unit sphere, we then calculate a histogram of the samples from the 10,000 empirical PDF realisations, using the histogram_
function from the statistics
package. We use the same bins for all the histograms to make life easier in the next step.
There’s one thing that’s worth commenting on here. You might think that we’re doing excess work here. Our null hypothesis distribution is spherically symmetric, so shouldn’t the histograms be the same for all points on the unit sphere? Well, should they? Or might the exact distribution of samples depend on
Once we have the histograms for each grid point on the unit sphere, we can calculate the significance of the values of the real data distribution (this is from the make-significance.hs program–I split these things up to make checking what was going on during development easier):
-- Split histogram values for later processing. let nhistvals = SV.length histvals oneslice i = SV.slice i nbin histvals histvecs = map oneslice [0, nbin.. nhistvals - nbin] hists = A.listArray ((0, 0), (ntheta-1, nphi-1)) histvecs nrealisations = SV.sum $ hists A.! (0, 0) -- Calculate significance values. let doone :: CDouble -> CDouble -> CDouble doone dith diph = let ith = truncate dith ; iph = truncate diph pdfval = pdf ! ith ! iph hist = hists A.! (ith, iph) pdfbin0 = truncate $ (pdfval - minbin) / binwidth pdfbin = pdfbin0 `max` 0 `min` nbin - 1 in (SV.sum $ SV.take pdfbin hist) / nrealisations sig = build (ntheta, nphi) doone :: Matrix CDouble
We read the histograms into the histvals
value from an intermediate NetCDF file and build an array of histograms indexed by grid cell indexes in the
For instance, if the real data distribution value is greater than 95% of the values generated by the null hypothesis distribution simulation, then we say that we have a significance level of 95% at that point on the unit sphere. We can plot these significance levels in the same way that we’ve been plotting the spherical PDFs. Here’s what those significance levels look like, choosing contour levels for the plot to highlight the most significant regions, i.e. the regions least likely to have occurred by chance if the null hypothesis is true:
In particular, we see that each of the three bumps picked out with labels in the “real data” PDF plot in the earlier article are among the most significant regions of the PDF according to this analysis, being larger than 99.9% of values generated from the null hypothesis uniform distribution.
It’s sort of traditional to try to use some other language to talk about these kinds of results, giving specific terminological meanings to the words “significance levels” and “
In the next article we’ll take a quick look at what these “bumps” in our “real data” PDF represent in terms of atmospheric flow.