A quick derivation of geostatistical Kriging

A casual encounter with geostatistics can be baffling because of some non-standard terminology. As it turns out, “Kriging,” the core numerical method of geostatistics, can be derived on a napkin, if you are already familiar with some standard least-squares methods.

If you want to appreciate geostatistical applications, then try the popular book “An introduction to Applied Geostatistics," by Isaaks and Srivastava [3]. If you want a clean mathematical notation, then try “Multivariate Geostatistics: An introduction with Applications," by Wackernagel [5];

Kriging is just the least-squares solution of a purely under-determined linear inverse problem. Once you see this equivalence, you can see some simple ways to improve distance-weighted interpolation methods.

Posing the problem

Geostatistics poses a useful problem that you are unlikely to have encountered elsewhere in least-squares. The solution looks very familiar, but it has unique elements that make it very useful.

Assume that you have a continuous function $v ({\svector x})$ of a spatial vector ${\svector x}$ . Usually ${\svector x}$ has two dimensions, but nothing about the derivation limits us to two dimensions.

The actual function $v ({\svector x})$ is unknown and needs to be reconstructed from a collection of samples $\{ v_i\}$ at $n$

arbitrarily chosen locations.

$\displaystyle v_i\equiv v({\svector x}_i)~$ for $\displaystyle ~ i{=}1,n.$

(1)

We will interpolate a value $v_0$

at an unsampled location ${\svector x}_0$ as a linearly weighted sum of the $n$

known values $\{ v_i\}$ , at sampled locations $\{{\svector x}_i\}$ :

$\displaystyle v({\svector x}_0)$	$\displaystyle =$	$\displaystyle \sum_{i=1}^nw_i\cdot v({\svector x}_i),$	(2)
$\displaystyle v_0$	$\displaystyle =$	$\displaystyle \sum_{i=1}^nw_iv_i, ~$ or
$\displaystyle v_0$	$\displaystyle =$	$\displaystyle {{\svector w}^\intercal}\cdot {\svector v},$
where $\displaystyle ~ {{\svector w}^\intercal}$	$\displaystyle \equiv$	$\displaystyle [w_1, w_2,..., w_n], ~$ and $\displaystyle ~ {\svector v}\equiv [v_1, v_2,..., v_n]^\intercal .$

Our problem: how do we optimally estimate weights ${\svector w}$ given what we know about the distribution of ${\svector x}$ ?

Assumptions

We will make a few more assumptions appropriate to a linear least-squares solution.

We will treat all our values $\{ v_i\}$ as random variables. Without loss of generality (changing variables if necessary), we will assume these variables have zero expected mean:

$\displaystyle E (v_i) \equiv 0, ~$ for $\displaystyle ~ i{=}0,n.$

(3)

$\displaystyle \sigma_v^2\equiv E (v_iv_i), ~$ for $\displaystyle ~ i{=}0,n.$

(4)

We will find it convenient to construct a covariance $\stensor C$ matrix for our measured values:

$\displaystyle C_{ij}$	$\displaystyle \equiv$	$\displaystyle E (v_i v_j ), ~$ $\displaystyle \mbox {for}~ i{=}1,n~\mbox{and}~ j{=}1,n, ~\mbox{or}$	(5)
$\displaystyle \stensor C$	$\displaystyle \equiv$	$\displaystyle E ({\svector v}{\svector v}^\intercal ).$

Define a covariance vector $\svector c$ for our unknown value relative to the measured values:

$\displaystyle c_i$	$\displaystyle \equiv$	$\displaystyle E (v_0 v_i ) , ~$ $\displaystyle \mbox {for}~ i{=}1,n, ~\mbox{or}$	(6)
$\displaystyle \svector c$	$\displaystyle \equiv$	$\displaystyle E (v_0 {\svector v}).$

Much of geostatistics concentrates on the estimation of these covariances. For now we assume they are known.

Notice that none of our assumptions require that our interpolated value share the same physical units as the values we are interpolating from. Co-kriging differs from Simple Kriging largely by combining values with mixed units. (See Wackernagel for a demonstration of their equivalence.)

Least-squares solution

Define an estimation error $\sigma_E^2$ as the expected error between the true value and our interpolated value.

$\displaystyle \sigma_E^2$	$\displaystyle \equiv$	$\displaystyle E [ ( v_0- \sum_{i=1}^nw_iv_i)^2 ]$	(7)
	$\displaystyle =$	$\displaystyle E(v_0v_0) - 2 \sum_{i=1}^nE ( v_iv_0) w_i+ \sum_{i=1}^n\sum_{j=1}^nE (v_iv_j) w_iw_j$
	$\displaystyle =$	$\displaystyle \sigma_v^2- 2 \sum_{i=1}^nc_iw_i+ \sum_{i=1}^n\sum_{j=1}^nC_{ij}w_iw_j.$	(8)

	$\displaystyle {\frac{\partial \sigma_E^2}{\partial w_i}} = - 2 c_i+ 2 \sum_{j=1}^nC_{ij}w_j= 0$	(9)
$\displaystyle \Rightarrow$	$\displaystyle \sum_{j=1}^nC_{ij}w_j= c_i$	(10)
$\displaystyle \Rightarrow$	$\displaystyle \stensor C\cdot {\svector w}= \svector c$
$\displaystyle \Rightarrow$	$\displaystyle {\svector w}= \stensor C^{-1}\cdot \svector c.$	(11)

Naive weights ${\svector w}$ would simply use the covariance vector $\svector c$ between values at the known and the unknown locations. Because the known locations are also correlated with each other, we must remove that effect by multiplying by their inverse covariance $\stensor C^{-1}$ .

Once we have our estimated weights ${\svector w}$ , we can explicitly quantify the error in our estimate. This will allow us to put Gaussian error bars on each interpolated value. Substitute our solution (10) into our estimation error (8) for

$\displaystyle \sigma_E^2$	$\displaystyle =$	$\displaystyle \sigma_v^2- 2 \sum_{i=1}^nc_iw_i+ \sum_{i=1}^nc_iw_i$
	$\displaystyle =$	$\displaystyle \sigma_v^2- \sum_{i=1}^nc_iw_i$
	$\displaystyle =$	$\displaystyle \sigma_v^2- \svector c^\intercal \cdot {\svector w}.$	(12)

Choosing covariances

Here is the most geostatistical assumption of all. The covariance between values depends only on the vector distance between their sampled positions:

$\displaystyle C_{ij}\equiv E [v_i v_j ] \equiv E [v ({\svector x}_i) v({\svector x}_j) ] \equiv C ( {\svector x}_i - {\svector x}_j )$			(13)
$\displaystyle ~$ for $\displaystyle ~ i{=}0,n~$ and $\displaystyle ~ j{=}0,n.$

In the absence of other information, it would be appealing to choose a scale-invariant version of a covariance. Scale-invariance allows covariances to be independent of the units chosen for spatial distance. Covariance would similar at any scale, like a fractal.

Unfortunately, this assumption leads to a degenerate solution. Scale-invariance requires covariances to go to infinity at a distance of zero, which should have a finite variance $\sigma_v^2$ . As we approach infinity, the covariance matrix $\stensor C$ becomes diagonal like an identity matrix, and so the inverse covariance has no effect on weights.

We see this degeneracy with a popular interpolation known as Shepard's “Inverse Distance Weighting”:

$\displaystyle v({\svector x}_0)$	$\displaystyle =$	$\displaystyle \sum_{i=1}^nw_i\cdot v({\svector x}_i)~$ where
$\displaystyle w_i$	$\displaystyle =$	$\displaystyle \Vert{\svector x}_i- {\svector x}_0\Vert^{-2} / \sum_{j=1}^n\Vert{\svector x}_j- {\svector x}_0\Vert^{-2} .$	(14)

If we are assuming a specific correlation between our measured and interpolated values, then we ought to be able to assume the same correlation between measured values.

$\displaystyle \stensor C({\svector x}_i- {\svector x}_j)$	$\displaystyle \propto$	$\displaystyle \Vert{\svector x}_i- {\svector x}_j\Vert^{-2}$
$\displaystyle \Rightarrow ~~ \sigma_v^2$	$\displaystyle \approx$	$\displaystyle \infty$	(15)
$\displaystyle \Rightarrow ~~ \stensor C$	$\displaystyle \propto$	$\displaystyle \stensor I$
$\displaystyle \Rightarrow ~~ {\svector w}$	$\displaystyle \approx$	$\displaystyle \svector c.$	(16)

As it turns out, the degeneracy is easily remedied if we sacrifice scale invariance below the density of our samples. Choose a small distance $\Delta x$ below which the correlation does not increase any further.

$\displaystyle \stensor C({\svector x}_i- {\svector x}_j)$

$\displaystyle \propto$

$\displaystyle \min(\Delta x^{-2}, \Vert{\svector x}_i- {\svector x}_j\Vert ^{-2} ) .$

(17)

A simple example

One value

is twice as far as $v_1$

from the interpolated point $v_0$

. Specifically $\Vert x_1- x_0\Vert = 1$ and $\Vert x_2- x_0\Vert = 2$ .

If we are using Shepard's interpolation, then we already know enough to estimate the weights as ${{\svector w}^\intercal}= [ 0.8 , 0.2 ]$ .

However, it should matter very much if the point we are interpolating is between the two measured points, or on the same side of both.

For a non-degenerate covariance, we will assume $\Delta x= 0.5$ in our covariance (17).

On opposite sides of the interpolated point, the two measured points are weakly correlated with $\Vert x_2 - x_1\Vert = 3$ . We get adjusted weights of ${{\svector w}^\intercal}= [ 0.82 , 0.18 ]$ , which is very close to the Shepard's interpolation weights.

On the same side of the interpolated point, the two measured points are strongly correlated with $\Vert x_2 - x_1\Vert = 1$ . The value $v_2$

is actually better correlated with $v_1$

than it is with the interpolated $v_0$

. We expect the nearer value $v_1$

to override and diminish the contribution of $v_2$

. And sure enough, the adjusted weights are actually ${{\svector w}^\intercal}= [ 1 ,0 ]$ , which ignores the second point entirely.

Variograms

Most geostatistics literature prefers “variograms” to covariances. This is one of the biggest and most unnecessary obstacles to newcomers. The two quantities have a simple equivalence:

$\displaystyle \gamma_v ({\svector x}_i, {\svector x}_j)$

$\displaystyle \equiv$

$\displaystyle \frac{1}{2} E \{ [v ({\svector x}_i) - v({\svector x}_j) ]^2 \} = \sigma_v^2- C_{ij}.$

(18)

$\displaystyle \gamma_v ({\svector x}_i- {\svector x}_j) = \sigma_v^2- C( {\svector x}_i - {\svector x}_j )$

(19)

$\displaystyle {\svector h}_{ij}$	$\displaystyle =$	$\displaystyle {\svector x}_i- {\svector x}_j;$
$\displaystyle \gamma_v ({\svector h}_{ij})$	$\displaystyle =$	$\displaystyle \sigma_v^2- C( {\svector h}_{ij}).$	(20)

$\displaystyle \gamma ({\svector h}) = C ({\svector 0} ) - C ({\svector h}).$

(21)

$\displaystyle C( {\svector \infty})$	$\displaystyle =$	0
$\displaystyle \Rightarrow ~~ \gamma( {\svector \infty})$	$\displaystyle =$	$\displaystyle C({\svector 0})$
$\displaystyle \Rightarrow ~~ C ({\svector h})$	$\displaystyle =$	$\displaystyle \gamma ({\svector \infty}) - \gamma({\svector h}).$	(22)

$\displaystyle \gamma_{\exp} ({\svector h})$	$\displaystyle \propto$	$\displaystyle 1 - \exp( - 3 \Vert {\svector h}\Vert / a), ~~$ or	(23)
$\displaystyle C_{\exp} ({\svector h})$	$\displaystyle \propto$	$\displaystyle \exp( - 3 \Vert {\svector h}\Vert / a).$	(24)

A user must prepare scatterplots and histograms to scan possible values for the constant $a$

, then choose a best fitting value before kriging can begin. For this sort of analysis, variograms have proven more popular than covariances.

Kriging as a posterior estimate

It is also possible to recognize Kriging as a special case of a least-squares inverse problem, without postulating the specific weighted solution (2). I thank Dave Hale [1] for explaining this equivalence to me. Kriging is a purely underdetermined inverse problem, which makes it a bit unusual and less familiar.

We assume that we have a model vector ${\svector m}$ and a data vector ${\svector d}$ . The data are assumed to be a linear transform (matrix multiplication) ${\stensor F}$ of the model, plus a vector ${\svector n}$ of noise:

$\displaystyle {\svector d}= {\stensor F}{\svector m}+ {\svector n}.$

(25)

$\displaystyle S ({\svector m}) = ({\svector d}- {\stensor F}{\svector m})^{\int... ...sor F}{\svector m}) + {\svector m}^{\intercal} {\stensor C}_m^{-1}{\svector m}.$

(26)

$\displaystyle {\stensor C}_d= {\stensor F}{\stensor C}_m{\stensor F}^\intercal + {\stensor C}_n.$

(27)

Since this objective function is a quadratic in ${\svector m}$ , we can find an optimum $\hat{{\svector m}}$ by setting the derivative to zero:

$\displaystyle \frac{\partial S}{\partial {\svector m}^{\intercal}} (\hat{{\svector m}})$	$\displaystyle =$	$\displaystyle - {\stensor F}^\intercal {\stensor C}_n^{-1}( {\svector d}- {\stensor F}\hat{{\svector m}}) + {\stensor C}_m^{-1}\hat{{\svector m}}= 0 ;$	(28)
$\displaystyle \Rightarrow ~~ \hat{{\svector m}}$	$\displaystyle =$	$\displaystyle ( {\stensor F}^\intercal {\stensor C}_n^{-1}{\stensor F}+ {\stensor C}_m^{-1}) ^ {-1} {\stensor F}^\intercal {\stensor C}_n^{-1}{\svector d};$	(29)
$\displaystyle \Rightarrow ~~ \hat{{\svector m}}$	$\displaystyle =$	$\displaystyle {\stensor C}_m{\stensor F}^\intercal ( {\stensor F}{\stensor C}_m... ...tor d}~=~ {\stensor C}_m{\stensor F}^\intercal {\stensor C}_d^{-1}{\svector d}.$	(30)

$\displaystyle ( {\stensor F}^\intercal {\stensor C}_n^{-1}{\stensor F}+ {\stensor C}_m^{-1}) ^ {-1} {\stensor F}^\intercal {\stensor C}_n^{-1}$	$\displaystyle =$	$\displaystyle {\stensor C}_m{\stensor F}^\intercal ( {\stensor F}{\stensor C}_m{\stensor F}^\intercal + {\stensor C}_n)^{-1} ;$	(31)
$\displaystyle \Rightarrow ~~ {\stensor F}^\intercal {\stensor C}_n^{-1}( {\stensor F}{\stensor C}_m{\stensor F}^\intercal + {\stensor C}_n)$	$\displaystyle =$	$\displaystyle ( {\stensor F}^\intercal {\stensor C}_n^{-1}{\stensor F}+ {\stensor C}_m^{-1}) {\stensor C}_m{\stensor F}^\intercal$	(32)
	$\displaystyle =$	$\displaystyle {\stensor F}^\intercal {\stensor C}_n^{-1}{\stensor F}{\stensor C}_m{\stensor F}^\intercal + {\stensor F}^\intercal .$	(33)

This alternate solution (30) is less often used because it requires a matrix inversion in the data space rather than in the model space. Most typically, an overdetermined problem has a greater number of data values than model values. In the case of Kriging, however, the situation is reversed.

Others such as Hansen et al [2] have observed that this alternate solution (30) is equivalent to Kriging. We will elaborate below.

In the case of geostatistical Kriging, we assume that our model ${\svector m}$ is a large dense grid of regularly distributed values. The forward operator ${\stensor F}$ merely subsamples that grid for the values ${\svector d}$ . The dimensionality of ${\svector d}$ is possibly orders of magnitude smaller than that of the model ${\svector m}$ . The transform ${\stensor F}$ is a rectangular matrix that has many more columns than rows. Each row contains mostly zeros and a single column with a value of 1. Most columns do not have a value of 1 on any row.

Applying the transposed operator ${\stensor F}^\intercal$ to subsampled data ${\svector d}$ merely repopulates the sampled values of ${\svector m}$ and leaves all other values set to 0. This is also the right inverse of the original transform, so that ${\stensor F}{\stensor F}^\intercal = \stensor I$ .

If we have sampled the model directly, there is no need to assume any noise at all. If we allow ${\stensor C}_n= \epsilon \stensor I; \epsilon \rightarrow 0$ , then the usual estimate (29) becomes degenerate, and $\hat{{\svector m}}\rightarrow {\stensor F}^\intercal {\svector d}$ . Rather than interpolating any unsampled values, this estimate sets them all to zero.

The alternate form (30) lets us set the noise variance directly to 0. We can then find the following covariances of our sampled data ${\stensor C}_d$ :

Assume $\displaystyle ~~ {\stensor C}_n$	$\displaystyle \equiv$	$\displaystyle 0 ;$	(34)
$\displaystyle \Rightarrow ~~ {\stensor C}_d$	$\displaystyle =$	$\displaystyle {\stensor F}{\stensor C}_m{\stensor F}^\intercal$	(35)

$\displaystyle \hat{{\svector m}}$	$\displaystyle =$	$\displaystyle ({\stensor C}_m{\stensor F}^\intercal ) ( {\stensor F}{\stensor C}_m{\stensor F}^\intercal )^{-1} {\svector d}$	(36)
	$\displaystyle =$	$\displaystyle ({\stensor F}{\stensor C}_m)^{\intercal} {\stensor C}_d^{-1}{\svector d}$	(37)
	$\displaystyle =$	$\displaystyle \stensor W^{\intercal}{\svector d};$	(38)
where $\displaystyle ~~ \stensor W$	$\displaystyle \equiv$	$\displaystyle {\stensor C}_d^{-1}({\stensor F}{\stensor C}_m)$	(39)