A quick derivation of geostatistical Kriging

William S. Harlan


Date: July 2013


Introduction

A casual encounter with geostatistics can be baffling because of some non-standard terminology. As it turns out, “Kriging,” the core numerical method of geostatistics, can be derived on a napkin, if you are already familiar with some standard least-squares methods.

If you want to appreciate geostatistical applications, then try the popular book “An introduction to Applied Geostatistics," by Isaaks and Srivastava [3]. If you want a clean mathematical notation, then try “Multivariate Geostatistics: An introduction with Applications," by Wackernagel [5];

Kriging is just the least-squares solution of a purely under-determined linear inverse problem. Once you see this equivalence, you can see some simple ways to improve distance-weighted interpolation methods.

Posing the problem

Geostatistics poses a useful problem that you are unlikely to have encountered elsewhere in least-squares. The solution looks very familiar, but it has unique elements that make it very useful.

Assume that you have a continuous function $v ({\svector x}) $ of a spatial vector ${\svector x}$. Usually ${\svector x}$ has two dimensions, but nothing about the derivation limits us to two dimensions.

The actual function $v ({\svector x}) $ is unknown and needs to be reconstructed from a collection of samples $\{ v_i\}$ at $n$ arbitrarily chosen locations.

$\displaystyle v_i\equiv v({\svector x}_i)~$for$\displaystyle ~ i{=}1,n.$ (1)

We will interpolate a value $v_0$ at an unsampled location ${\svector x}_0$ as a linearly weighted sum of the $n$ known values $\{ v_i\}$, at sampled locations $\{{\svector x}_i\}$:

$\displaystyle v({\svector x}_0)$ $\displaystyle =$ $\displaystyle \sum_{i=1}^nw_i\cdot v({\svector x}_i),$ (2)
$\displaystyle v_0$ $\displaystyle =$ $\displaystyle \sum_{i=1}^nw_iv_i, ~$or  
$\displaystyle v_0$ $\displaystyle =$ $\displaystyle {{\svector w}^\intercal}\cdot {\svector v},$  
where$\displaystyle ~
{{\svector w}^\intercal}$ $\displaystyle \equiv$ $\displaystyle [w_1, w_2,..., w_n], ~$and$\displaystyle ~
{\svector v}\equiv [v_1, v_2,..., v_n]^\intercal .$  

In the last version we switch to vector notation.

Our problem: how do we optimally estimate weights ${\svector w}$ given what we know about the distribution of ${\svector x}$?

Assumptions

We will make a few more assumptions appropriate to a linear least-squares solution.

We will treat all our values $\{ v_i\}$ as random variables. Without loss of generality (changing variables if necessary), we will assume these variables have zero expected mean:

$\displaystyle E (v_i) \equiv 0, ~$for$\displaystyle ~ i{=}0,n.$ (3)

Assume a constant stationary variance $\sigma_v^2$ for all samples
$\displaystyle \sigma_v^2\equiv E (v_iv_i), ~$for$\displaystyle ~ i{=}0,n.$     (4)

Notice that we have included the unmeasured value $v_0$ in our assumptions.

We will find it convenient to construct a covariance $\stensor C$ matrix for our measured values:

$\displaystyle C_{ij}$ $\displaystyle \equiv$ $\displaystyle E (v_i v_j ), ~$$\displaystyle \mbox {for}~ i{=}1,n~\mbox{and}~ j{=}1,n, ~\mbox{or}$ (5)
$\displaystyle \stensor C$ $\displaystyle \equiv$ $\displaystyle E ({\svector v}{\svector v}^\intercal ).$  

Note that the covariance is symmetric: $C_{ij}= C_{ji}$.

Define a covariance vector $\svector c$ for our unknown value relative to the measured values:

$\displaystyle c_i$ $\displaystyle \equiv$ $\displaystyle E (v_0 v_i ) , ~$$\displaystyle \mbox {for}~ i{=}1,n, ~\mbox{or}$ (6)
$\displaystyle \svector c$ $\displaystyle \equiv$ $\displaystyle E (v_0 {\svector v}).$  

Much of geostatistics concentrates on the estimation of these covariances. For now we assume they are known.

Notice that none of our assumptions require that our interpolated value share the same physical units as the values we are interpolating from. Co-kriging differs from Simple Kriging largely by combining values with mixed units. (See Wackernagel for a demonstration of their equivalence.)

Least-squares solution

Next we estimate our weights as a least-squares optimization, and take advantage of our assumptions.

Define an estimation error $\sigma_E^2$ as the expected error between the true value and our interpolated value.

$\displaystyle \sigma_E^2$ $\displaystyle \equiv$ $\displaystyle E [ ( v_0- \sum_{i=1}^nw_iv_i)^2 ]$ (7)
  $\displaystyle =$ $\displaystyle E(v_0v_0) - 2 \sum_{i=1}^nE ( v_iv_0) w_i+ \sum_{i=1}^n\sum_{j=1}^nE (v_iv_j) w_iw_j$  
  $\displaystyle =$ $\displaystyle \sigma_v^2- 2 \sum_{i=1}^nc_iw_i+ \sum_{i=1}^n\sum_{j=1}^nC_{ij}w_iw_j.$ (8)

Naturally we want this error to be small. Since this expression is a quadratic of the weights ${\svector w}$, we can minimize $\sigma_E^2$ by differentiating with respect to the weights and setting the value to zero.
    $\displaystyle {\frac{\partial \sigma_E^2}{\partial w_i}} = - 2 c_i+ 2 \sum_{j=1}^nC_{ij}w_j= 0$ (9)
  $\displaystyle \Rightarrow$ $\displaystyle \sum_{j=1}^nC_{ij}w_j= c_i$ (10)
  $\displaystyle \Rightarrow$ $\displaystyle \stensor C\cdot {\svector w}= \svector c$  
  $\displaystyle \Rightarrow$ $\displaystyle {\svector w}= \stensor C^{-1}\cdot \svector c.$ (11)

This result is entirely equivalent to Simple Kriging. The optimum weights are intuitive when you see them in this form.

Naive weights ${\svector w}$ would simply use the covariance vector $\svector c$ between values at the known and the unknown locations. Because the known locations are also correlated with each other, we must remove that effect by multiplying by their inverse covariance $\stensor C^{-1}$.

Once we have our estimated weights ${\svector w}$, we can explicitly quantify the error in our estimate. This will allow us to put Gaussian error bars on each interpolated value. Substitute our solution (10) into our estimation error (8) for

$\displaystyle \sigma_E^2$ $\displaystyle =$ $\displaystyle \sigma_v^2- 2 \sum_{i=1}^nc_iw_i+ \sum_{i=1}^nc_iw_i$  
  $\displaystyle =$ $\displaystyle \sigma_v^2- \sum_{i=1}^nc_iw_i$  
  $\displaystyle =$ $\displaystyle \sigma_v^2- \svector c^\intercal \cdot {\svector w}.$ (12)

Notice that this expected error is independent of correlations between measured samples. If it were not, then the estimate could be improved.

Choosing covariances

Geostatistics usually concentrates on the estimation of covariances (or variograms, discussed later).

Here is the most geostatistical assumption of all. The covariance between values depends only on the vector distance between their sampled positions:

$\displaystyle C_{ij}\equiv
E [v_i v_j ] \equiv
E [v ({\svector x}_i) v({\svector x}_j) ] \equiv C ( {\svector x}_i - {\svector x}_j )$     (13)
$\displaystyle ~$for$\displaystyle ~ i{=}0,n~$and$\displaystyle ~ j{=}0,n.$      

In the absence of other information, it would be appealing to choose a scale-invariant version of a covariance. Scale-invariance allows covariances to be independent of the units chosen for spatial distance. Covariance would similar at any scale, like a fractal.

Unfortunately, this assumption leads to a degenerate solution. Scale-invariance requires covariances to go to infinity at a distance of zero, which should have a finite variance $\sigma_v^2$. As we approach infinity, the covariance matrix $\stensor C$ becomes diagonal like an identity matrix, and so the inverse covariance has no effect on weights.

We see this degeneracy with a popular interpolation known as Shepard's “Inverse Distance Weighting”:

$\displaystyle v({\svector x}_0)$ $\displaystyle =$ $\displaystyle \sum_{i=1}^nw_i\cdot v({\svector x}_i)~$where  
$\displaystyle w_i$ $\displaystyle =$ $\displaystyle \Vert{\svector x}_i- {\svector x}_0\Vert^{-2} / \sum_{j=1}^n\Vert{\svector x}_j- {\svector x}_0\Vert^{-2} .$ (14)

This weight decreases with distance squared, like gravity. The denominator is just a normalization over all samples.

If we are assuming a specific correlation between our measured and interpolated values, then we ought to be able to assume the same correlation between measured values.

Unfortunately, this covariance approaches infinity at a zero distance:

$\displaystyle \stensor C({\svector x}_i- {\svector x}_j)$ $\displaystyle \propto$ $\displaystyle \Vert{\svector x}_i- {\svector x}_j\Vert^{-2}$  
$\displaystyle \Rightarrow ~~ \sigma_v^2$ $\displaystyle \approx$ $\displaystyle \infty$ (15)
$\displaystyle \Rightarrow ~~ \stensor C$ $\displaystyle \propto$ $\displaystyle \stensor I$  
$\displaystyle \Rightarrow ~~ {\svector w}$ $\displaystyle \approx$ $\displaystyle \svector c.$ (16)

Sure enough, this degenerate solution tells us to ignore covariances between our known values.

As it turns out, the degeneracy is easily remedied if we sacrifice scale invariance below the density of our samples. Choose a small distance $\Delta x$ below which the correlation does not increase any further.

$\displaystyle \stensor C({\svector x}_i- {\svector x}_j)$ $\displaystyle \propto$ $\displaystyle \min(\Delta x^{-2}, \Vert{\svector x}_i- {\svector x}_j\Vert ^{-2} ) .$ (17)

Covariances at any smaller distance will all equal the marginal $\sigma_v^2$.

Let's see how this assumption works with a simple example.

A simple example

Let's say we have two measured values to interpolate along a one-dimensional dimension $x$.

One value $v_2$ is twice as far as $v_1$ from the interpolated point $v_0$. Specifically $\Vert x_1- x_0\Vert = 1$ and $\Vert x_2- x_0\Vert = 2$.

If we are using Shepard's interpolation, then we already know enough to estimate the weights as ${{\svector w}^\intercal}= [ 0.8 , 0.2 ]$.

However, it should matter very much if the point we are interpolating is between the two measured points, or on the same side of both.

For a non-degenerate covariance, we will assume $\Delta x= 0.5$ in our covariance (17).

On opposite sides of the interpolated point, the two measured points are weakly correlated with $\Vert x_2 - x_1\Vert = 3$. We get adjusted weights of ${{\svector w}^\intercal}= [ 0.82 , 0.18 ]$, which is very close to the Shepard's interpolation weights.

On the same side of the interpolated point, the two measured points are strongly correlated with $\Vert x_2 - x_1\Vert = 1$. The value $v_2$ is actually better correlated with $v_1$ than it is with the interpolated $v_0$. We expect the nearer value $v_1$ to override and diminish the contribution of $v_2$. And sure enough, the adjusted weights are actually ${{\svector w}^\intercal}= [ 1 ,0 ]$, which ignores the second point entirely.

Variograms

Most geostatistics literature prefers “variograms” to covariances. This is one of the biggest and most unnecessary obstacles to newcomers. The two quantities have a simple equivalence:

$\displaystyle \gamma_v ({\svector x}_i, {\svector x}_j)$ $\displaystyle \equiv$ $\displaystyle \frac{1}{2} E \{ [v ({\svector x}_i) - v({\svector x}_j) ]^2 \} = \sigma_v^2- C_{ij}.$ (18)

Variograms have the minor convenience that values no longer need a zero mean.

The geostatistical assumption (13) can be rewritten

$\displaystyle \gamma_v ({\svector x}_i- {\svector x}_j) = \sigma_v^2- C( {\svector x}_i - {\svector x}_j )$     (19)

Usually variograms are defined in terms of an offset ${\svector h}$:
$\displaystyle {\svector h}_{ij}$ $\displaystyle =$ $\displaystyle {\svector x}_i- {\svector x}_j;$  
$\displaystyle \gamma_v ({\svector h}_{ij})$ $\displaystyle =$ $\displaystyle \sigma_v^2- C( {\svector h}_{ij}).$ (20)

Considering only offsets and ignoring sampling, we can write the equivalence this way:
$\displaystyle \gamma ({\svector h}) = C ({\svector 0} ) - C ({\svector h}).$     (21)

If we assume values are uncorrelated at infinite distance, we find
$\displaystyle C( {\svector \infty})$ $\displaystyle =$ 0  
$\displaystyle \Rightarrow ~~ \gamma( {\svector \infty})$ $\displaystyle =$ $\displaystyle C({\svector 0})$  
$\displaystyle \Rightarrow ~~ C ({\svector h})$ $\displaystyle =$ $\displaystyle \gamma ({\svector \infty}) - \gamma({\svector h}).$ (22)

The most popular of standard analytic variograms is an exponential:

$\displaystyle \gamma_{\exp} ({\svector h})$ $\displaystyle \propto$ $\displaystyle 1 - \exp( - 3 \Vert {\svector h}\Vert / a), ~~$or (23)
$\displaystyle C_{\exp} ({\svector h})$ $\displaystyle \propto$ $\displaystyle \exp( - 3 \Vert {\svector h}\Vert / a).$ (24)

The constant $a$ is the distance over which about 95% of the correlation occurs. Unlike our distance weighting, this function is not scale invariant over any range of offsets.

A user must prepare scatterplots and histograms to scan possible values for the constant $a$, then choose a best fitting value before kriging can begin. For this sort of analysis, variograms have proven more popular than covariances.

Kriging as a posterior estimate

It is also possible to recognize Kriging as a special case of a least-squares inverse problem, without postulating the specific weighted solution (2). I thank Dave Hale [1] for explaining this equivalence to me. Kriging is a purely underdetermined inverse problem, which makes it a bit unusual and less familiar.

We assume that we have a model vector ${\svector m}$ and a data vector ${\svector d}$. The data are assumed to be a linear transform (matrix multiplication) ${\stensor F}$ of the model, plus a vector ${\svector n}$ of noise:

$\displaystyle {\svector d}= {\stensor F}{\svector m}+ {\svector n}.$     (25)

We assume that our model and noise are selected from stationary correlated Gaussian random processes. The expected value of a model given the data (a posterior estimate) should minimize the following objective function:
$\displaystyle S ({\svector m}) = ({\svector d}- {\stensor F}{\svector m})^{\int...
...sor F}{\svector m}) + {\svector m}^{\intercal} {\stensor C}_m^{-1}{\svector m}.$     (26)

The covariance ${\stensor C}_m$ is the Gaussian covariance of the model ${\svector m}$, and covariance ${\stensor C}_n$ is the Gaussian covariance of the noise. We also assume that the noise is uncorrelated to the model. The definition (25) determines the data covariance ${\stensor C}_d$:
$\displaystyle {\stensor C}_d= {\stensor F}{\stensor C}_m{\stensor F}^\intercal + {\stensor C}_n.$     (27)

Since this objective function is a quadratic in ${\svector m}$, we can find an optimum $\hat{{\svector m}}$ by setting the derivative to zero:

$\displaystyle \frac{\partial S}{\partial {\svector m}^{\intercal}} (\hat{{\svector m}})$ $\displaystyle =$ $\displaystyle - {\stensor F}^\intercal {\stensor C}_n^{-1}( {\svector d}- {\stensor F}\hat{{\svector m}}) + {\stensor C}_m^{-1}\hat{{\svector m}}= 0 ;$ (28)
$\displaystyle \Rightarrow ~~ \hat{{\svector m}}$ $\displaystyle =$ $\displaystyle ( {\stensor F}^\intercal {\stensor C}_n^{-1}{\stensor F}+ {\stensor C}_m^{-1}) ^ {-1} {\stensor F}^\intercal {\stensor C}_n^{-1}{\svector d};$ (29)
$\displaystyle \Rightarrow ~~ \hat{{\svector m}}$ $\displaystyle =$ $\displaystyle {\stensor C}_m{\stensor F}^\intercal ( {\stensor F}{\stensor C}_m...
...tor d}~=~ {\stensor C}_m{\stensor F}^\intercal {\stensor C}_d^{-1}{\svector d}.$ (30)

The first solution (29) is directly solved from the derivative (28). The second solution (30) was shown by Tarantola [4] (Appendix 6.30) to be entirely equivalent. If (29) is equivalent to (30), then
$\displaystyle ( {\stensor F}^\intercal {\stensor C}_n^{-1}{\stensor F}+ {\stensor C}_m^{-1}) ^ {-1} {\stensor F}^\intercal {\stensor C}_n^{-1}$ $\displaystyle =$ $\displaystyle {\stensor C}_m{\stensor F}^\intercal ( {\stensor F}{\stensor C}_m{\stensor F}^\intercal + {\stensor C}_n)^{-1} ;$ (31)
$\displaystyle \Rightarrow ~~ {\stensor F}^\intercal {\stensor C}_n^{-1}( {\stensor F}{\stensor C}_m{\stensor F}^\intercal + {\stensor C}_n)$ $\displaystyle =$ $\displaystyle ( {\stensor F}^\intercal {\stensor C}_n^{-1}{\stensor F}+ {\stensor C}_m^{-1}) {\stensor C}_m{\stensor F}^\intercal$ (32)
  $\displaystyle =$ $\displaystyle {\stensor F}^\intercal {\stensor C}_n^{-1}{\stensor F}{\stensor C}_m{\stensor F}^\intercal + {\stensor F}^\intercal .$ (33)

Both sides of (32) are trivially equal to (33).

This alternate solution (30) is less often used because it requires a matrix inversion in the data space rather than in the model space. Most typically, an overdetermined problem has a greater number of data values than model values. In the case of Kriging, however, the situation is reversed.

Others such as Hansen et al [2] have observed that this alternate solution (30) is equivalent to Kriging. We will elaborate below.

In the case of geostatistical Kriging, we assume that our model ${\svector m}$ is a large dense grid of regularly distributed values. The forward operator ${\stensor F}$ merely subsamples that grid for the values ${\svector d}$. The dimensionality of ${\svector d}$ is possibly orders of magnitude smaller than that of the model ${\svector m}$. The transform ${\stensor F}$ is a rectangular matrix that has many more columns than rows. Each row contains mostly zeros and a single column with a value of 1. Most columns do not have a value of 1 on any row.

Applying the transposed operator ${\stensor F}^\intercal $ to subsampled data ${\svector d}$ merely repopulates the sampled values of ${\svector m}$ and leaves all other values set to 0. This is also the right inverse of the original transform, so that ${\stensor F}{\stensor F}^\intercal = \stensor I$.

If we have sampled the model directly, there is no need to assume any noise at all. If we allow ${\stensor C}_n= \epsilon \stensor I; \epsilon \rightarrow 0$, then the usual estimate (29) becomes degenerate, and $\hat{{\svector m}}\rightarrow {\stensor F}^\intercal {\svector d}$. Rather than interpolating any unsampled values, this estimate sets them all to zero.

The alternate form (30) lets us set the noise variance directly to 0. We can then find the following covariances of our sampled data ${\stensor C}_d$:

Assume$\displaystyle ~~
{\stensor C}_n$ $\displaystyle \equiv$ $\displaystyle 0 ;$ (34)
$\displaystyle \Rightarrow ~~ {\stensor C}_d$ $\displaystyle =$ $\displaystyle {\stensor F}{\stensor C}_m{\stensor F}^\intercal$ (35)

and then rewrite our alternate estimate (30) as
$\displaystyle \hat{{\svector m}}$ $\displaystyle =$ $\displaystyle ({\stensor C}_m{\stensor F}^\intercal ) ( {\stensor F}{\stensor C}_m{\stensor F}^\intercal )^{-1} {\svector d}$ (36)
  $\displaystyle =$ $\displaystyle ({\stensor F}{\stensor C}_m)^{\intercal} {\stensor C}_d^{-1}{\svector d}$ (37)
  $\displaystyle =$ $\displaystyle \stensor W^{\intercal}{\svector d};$ (38)
where$\displaystyle ~~ \stensor W$ $\displaystyle \equiv$ $\displaystyle {\stensor C}_d^{-1}({\stensor F}{\stensor C}_m)$ (39)

These weights (39) for all samples are entirely equivalent to those for a single sample (11). Each measured sample of ${\svector d}$ is weighted by its covariance to an interpolated point, then these weights are adjusted by the inverse covariance of the data samples. We now have the advantage of directly relating data covariances to that of the model. This expression is also much easier to generalize to data that do more than merely subsample the model.

References

1
Dave Hale.
Personal communication, 2013.
2
Thomas Mejer Hansen, Andre Journel, Albert Tarantola, and Klaus Mosegaard.
Linear inverse gaussian theory and geostatistics.
Geophysics, pages 101–111, 2006.
3
Edward H. Isaaks and Mohan R. Srivastava.
An Introduction to Applied Geostatistics.
Oxford University Press, USA, January 1990.
4
Albert Tarantola.
Inverse Problem Theory and Methods for Model Parameter Estimation.
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2004.
5
H. Wackernagel.
Multivariate geostatistics: an introduction with applications.
Springer-Verlag, 2nd edition, 2003.