Are your kids home for the holidays? Here's how you can teach them least squares, the Kalman filter, and even -as we will see - a hierarchical Bayesian model. I doodled on this topic in an old blog and was inspired to port the notes here by a nice article called Least Squares as Springs written by Professor Joshua Loftus recently. I also recommend his note. He remarks:
I think there's a lot of potential here for statistics education, especially for younger students... I was surprised by how difficult it was to find any references for this elementary idea that provides physics intuition for such an important method as least squares.
I could not agree more and would go further, suggesting that the analog is quite helpful to practitioners and researchers - it helps us think - and communicate. On Professor Loftus' prompting, I'm tempted to adopt Hadley Wickham's beautiful spring plotting example next time I produce a line of best fit for any business purpose!
How does that not convey least squares? Professor Loftus makes the suggestion that educational labs could be set up with analog examples of least squares problems.
The Kalman filter is a special case of least squares (indeed it is a special case of regression, as can be seen by the Duncan and Horn formulation of the Kalman filter). So it must be possible to construct a system of rods and springs that replicate its function. Least squares is simply energy minimization in a Hookean universe - one where forces are proportional to distance.
You'll recall the Kalman filtering setup. We periodically observe Brownian motion subject to gaussian measurement error and attempt to infer the ground truth from those measurements. But rather than dive into a bunch of conditional expectations for gaussian linear systems, we'll look for some physical intuition. We'll pretend that we've been robbed of our computers and are forced to create analog varieties just as in the good old days.
We make an observation that isn't always stressed up front in the statistical literature (Harrison and West for example) or control systems perspective (such as you will find at the at wikipedia entry for "Kalman filter", for example). Then we pursue the analogy between statistics and physics a little further, and show how the updating of a location estimate of a gaussian distribution amounts to a combination of center of mass and reduced mass calculations.
The idea is that we start with masses that represent observations and a prior, and then try to simplify the physical system. Because they contain fixed and fixed and free masses, resolving these physical systems in this fashion requires that we consolidate masses but also change reference frame. This leads to two different types of calculation, both familiar.
Suppose the prior estimate of location for a particle is \(m\) and the prior covariance is \(P\). Suppose we make an observation \(y\) with error variance \(R\). Our posterior belief is gaussian with location \(m'\) say and variance \(P'\). The update is usually written: \begin{eqnarray} m' & = & m + K ( y - m ) \\ P' & = & P(1-K), \ \ {\rm where} \\ K & = & \frac{P}{P+R} \end{eqnarray} However, it is in many ways more natural to use the inverses of covariances instead. If we write \(\varphi = 1/R\), \(p = 1/P\) and \( p' = 1/P'\) and multiply through by \( \frac{P+R}{PR} \) we notice that the Kalman filter update is merely a center of mass calculation: \begin{eqnarray} m' & = & \frac{m/P + y/R} { 1/R + 1/P } = \frac{ pm + \varphi y }{ \varphi + p } \\ p' & = & \frac{1}{P'} = \frac{P+ R}{PR} = \frac{1}{P} + \frac{1}{R} = \varphi + p \end{eqnarray} The analogy works if we treat precision as mass. And in what follows we'll be equally interested in the analogy between force and the derivative of the negative log likelihood function.
![]() |
An "analogue" gaussian smoother using perfect Hookean springs |
Minimizing energy corresponds to maximizing log-likelihood. And setting the derivative of log-likelihood to zero corresponds to finding the equilibrium where forces cancel out.
Furthermore, the fact that combining two pieces of evidence for one latent variable can sometimes be as simple as merging the two observations at their "center of precision" corresponds to a nice accident when forces grow linearly with distance: the impact of two masses on a third is unchanged if they coalesce at their center of mass.
But there is more to the story...
![]() |
Figure 1. Hierarchical model where location of a gaussian distribution is itself gaussian |
![]() |
![]() |
Figure 3. Spring diagram representing noisy evidence |
![]() |
Figure 4. Prior location belief plus a noisy measurement |
![]() |
Figure 5. Prior belief plus a noisy measurement simplified using reduced mass |
![]() |
Figure 6. Simplification of Figure 4 by reduced mass and center of mass calculation. |
If you are interested in the relationship between Physics and Statistics, I suggest going (a lot) further than my noodling. I recently discovered a nice paper by Gabor Szekeley and Maria Rizzo's titled Energy Statistics: A Class of Statistics Based on Distances (pdf). It may help you carry over your physics intuition to statistics. It is also very practical. You can use the scipy.stats.energydistance for example (or R energy package), for a rotation invariant distance measure for multivariate distributions. The properties of this metric are non-trivial, as you'll see in the paper. It is possible that you have already encountered energy distance in another guise. Maximum mean discrepancy (MMD) is the term used in Machine Learning and a paper by Sejdinovic et al demonstrates the equivalence. Indeed they argue that MMD provides a generalization and thus more powerful tests.
On the other hand if you are interested in analog computing there's plenty out there. I'm sure you're aware of Babbage but here are a few lessor known examples:
Send me your favorites or join the discussion here.
Comments