\documentclass[12pt]{article}
\usepackage{wsh}
\usepackage{amsmath,amsfonts,amssymb}
\begin{document}
\bibliographystyle{plain}
\markright{Kriging --- W.S. Harlan}
\title{A quick derivation of
geostatistical Kriging}
\author{William S. Harlan}
\date{July 2013}
\def \vi {v_i}
\def \vj {v_j}
\def \vo {v_0}
\def \wi {w_i}
\def \wj {w_j}
\def \xv {{\svector x}}
\def \xo {\xv_0}
\def \xi {\xv_i}
\def \xj {\xv_j}
\def \xij {\xi - \xj}
\def \hv {{\svector h}}
\def \dx {\Delta x}
\def \hvij {\hv_{ij}}
\def \vxi {v(\xi)}
\def \vxo {v(\xo)}
\def \vv {{\svector v}}
\def \vx {v (\xv) }
\def \pxi {\{\xi\}}
\def \pvi {\{ \vi\}}
\def \pwi {\{ \wi\}}
\def \sumi {\sum_{i=1}^n}
\def \sumj {\sum_{j=1}^n}
\def \ion {i{=}1,n}
\def \iono {i{=}0,n}
\def \jon {j{=}1,n}
\def \jono {j{=}0,n}
\def \xifor {\{\xi,\mbox{for}~\ion\}}
\def \wv {{\svector w}}
\def \wvt {{\wv^\intercal}}
\def \Cij {C_{ij}}
\def \Cji {C_{ji}}
\def \ci {c_i}
\def \cj {c_j}
\def \Ct {\stensor C}
\def \It {\stensor I}
\def \Cti {\Ct^{-1}}
\def \cv {\svector c}
\def \cvt {\cv^{\intercal}}
\def \st {\sigma_E^2}
\def \sv {\sigma_v^2}
\def \zv {{\svector 0}}
\def \iv {{\svector \infty}}
\def \mv {{\svector m}}
\def \nv {{\svector n}}
\def \mvh {\hat{\mv}}
\def \mvt {\mv^{\intercal}}
\def \dv {{\svector d}}
\def \dvt {\dv^{\intercal}}
\def \Ft {{\stensor F}}
\def \Ftt {\Ft^\intercal}
\def \Cm {{\stensor C}_m}
\def \Cmi {\Cm^{-1}}
\def \Cn {{\stensor C}_n}
\def \Cni {\Cn^{-1}}
\def \Cd {{\stensor C}_d}
\def \Cdi {\Cd^{-1}}
\def \Wt {\stensor W}
\def \Wtt {\Wt^{\intercal}}
\maketitle
\section {Introduction}
A casual encounter with geostatistics can be baffling because of
some non-standard terminology. As it turns out, ``Kriging,''
the core numerical method of geostatistics, can be derived on a napkin,
if you are already familiar with some standard least-squares methods.
If you want to appreciate geostatistical applications, then
try the popular book ``An introduction to Applied
Geostatistics," by Isaaks and Srivastava \cite{isaaks:intro}. If you want
a clean mathematical notation, then try ``Multivariate
Geostatistics: An introduction with Applications," by Wackernagel \cite{wackernagel};
Kriging is just the least-squares solution of a purely
under-determined linear inverse problem. Once you see this equivalence, you
can see some simple ways to improve distance-weighted interpolation methods.
\section {Posing the problem}
Geostatistics poses a useful problem that you are unlikely
to have encountered elsewhere in least-squares. The solution
looks very familiar, but it has unique elements that
make it very useful.
Assume that you have a continuous function $\vx$
of a spatial vector $\xv$. Usually $\xv$ has two dimensions,
but nothing about the derivation limits us to two dimensions.
The actual function $\vx$ is unknown and needs to be reconstructed
from a collection of samples $\pvi$ at $n$ arbitrarily chosen locations.
\begin{equation}
\vi \equiv \vxi ~\mbox{for}~ \ion.
\end{equation}
We will interpolate a value $\vo$ at an unsampled location $\xo$
as a linearly weighted sum of the $n$ known values $\pvi$, at sampled locations $\pxi$:
\begin{eqnarray}
\label{eq:weight}
\vxo &=& \sumi \wi \cdot \vxi , \\
\vo &=& \sumi \wi \vi , ~\mbox{or} \nonumber \\
\vo &=& \wvt \cdot \vv , \nonumber \\
\mbox{where} ~
\wvt &\equiv& [w_1, w_2,..., w_n], ~\mbox{and}~
\vv \equiv [v_1, v_2,..., v_n]^\intercal . \nonumber
\end{eqnarray}
In the last version we switch to vector notation.
Our problem: how do we optimally estimate weights $\wv$ given what we know about the distribution of $\xv$?
\section{Assumptions}
We will make a few more assumptions appropriate to a linear least-squares solution.
We will treat all our values $\pvi$ as random variables.
Without loss of generality (changing variables if necessary),
we will assume these variables have zero expected mean:
\begin{equation}
E (\vi ) \equiv 0, ~\mbox{for}~ \iono.
\end{equation}
Assume a constant stationary variance $\sv$ for all samples
\begin{eqnarray}
\sv \equiv E (\vi \vi ), ~\mbox{for}~ \iono .
\end{eqnarray}
Notice that we have included the unmeasured value $\vo$ in our assumptions.
We will find it convenient to construct a covariance $\Ct$ matrix for our measured values:
\begin{eqnarray}
\Cij &\equiv&
E (v_i v_j ), ~ \mbox {for}~ \ion ~\mbox{and}~ \jon , ~\mbox{or} \\
\Ct &\equiv& E (\vv \vv^\intercal ). \nonumber
\end{eqnarray}
Note that the covariance is symmetric: $\Cij = \Cji$.
Define a covariance vector $\cv$ for our unknown value relative to the measured values:
\begin{eqnarray}
\ci &\equiv& E (v_0 v_i ) , ~ \mbox {for}~ \ion , ~\mbox{or} \\
\cv &\equiv& E (v_0 \vv ). \nonumber
\end{eqnarray}
Much of geostatistics concentrates on the estimation of these covariances. For now we assume
they are known.
Notice that none of our assumptions require that our interpolated value share the same
physical units as the values we are interpolating from.
Co-kriging differs from Simple Kriging largely by combining values with mixed units.
(See Wackernagel for a demonstration of their equivalence.)
\section{Least-squares solution}
Next we estimate our weights as a least-squares optimization, and take advantage of our assumptions.
Define an estimation error $\st$ as the expected
error between the true value and our interpolated value.
\begin{eqnarray}
\st &\equiv& E [ ( \vo - \sumi \wi \vi )^2 ] \\
&=& E(\vo \vo) - 2 \sumi E ( \vi \vo ) \wi + \sumi \sumj E (\vi \vj ) \wi \wj \nonumber \\
\label{eq:err}
&=& \sv - 2 \sumi \ci \wi + \sumi \sumj \Cij \wi \wj .
\end{eqnarray}
Naturally we want this error to be small.
Since this expression is a quadratic of the weights $\wv$, we can minimize $\st$ by
differentiating with respect to the weights and setting the value to zero.
\begin{eqnarray}
&&{\frac{\partial \st }{\partial \wi }} = - 2 \ci + 2 \sumj \Cij \wj = 0 \\
\label{eq:solved}
&\Rightarrow& \sumj \Cij \wj = \ci \\
&\Rightarrow& \Ct \cdot \wv = \cv \nonumber \\
\label{eq:solvedw}
&\Rightarrow& \wv = \Cti \cdot \cv .
\end{eqnarray}
This result is entirely equivalent to Simple Kriging.
The optimum weights are intuitive when you see them in this form.
Naive weights $\wv$ would simply use the covariance vector $\cv$ between
values at the known and the unknown locations. Because the known locations
are also correlated with each other, we must remove that effect by multiplying
by their inverse covariance $\Cti$.
Once we have our estimated weights $\wv$, we can explicitly quantify the error in our estimate.
This will allow us to put Gaussian error bars on each interpolated value.
Substitute our solution (\ref{eq:solved}) into our estimation error (\ref{eq:err}) for
\begin{eqnarray}
\st &=& \sv - 2 \sumi \ci \wi + \sumi \ci \wi \nonumber \\
&=& \sv - \sumi \ci \wi \nonumber \\
&=& \sv - \cv^\intercal \cdot \wv .
\end{eqnarray}
Notice that this expected error is independent of correlations between
measured samples. If it were not, then the estimate could be improved.
\section{Choosing covariances}
Geostatistics usually concentrates on the estimation of covariances (or variograms, discussed later).
Here is the most geostatistical assumption of all.
The covariance between values depends only on the vector distance between their sampled positions:
\begin{eqnarray}
\label{eq:off}
\Cij \equiv
E [v_i v_j ] \equiv
E [v (\xi) v(\xj) ] \equiv C ( \xv_i - \xv_j ) \\
~\mbox{for}~ \iono ~\mbox{and}~ \jono. \nonumber
\end{eqnarray}
In the absence of other information, it would be appealing to choose a scale-invariant version of a covariance.
Scale-invariance allows covariances to be independent of the units chosen for spatial distance.
Covariance would similar at any scale, like a fractal.
Unfortunately, this assumption leads to a degenerate solution. Scale-invariance requires covariances
to go to infinity at a distance of zero, which should have a finite variance $\sv$.
As we approach infinity, the covariance matrix $\Ct$ becomes diagonal like an identity matrix,
and so the inverse covariance has no effect on weights.
We see this degeneracy with a popular interpolation known as Shepard's ``Inverse Distance Weighting'':
\begin{eqnarray}
\vxo &=& \sumi \wi \cdot \vxi ~\mbox{where} \nonumber \\
\wi &=& \|\xi - \xo\|^{-2} / \sumj ( \|\xj - \xo\|^{-2}) .
\end{eqnarray}
This weight decreases with distance squared, like gravity. The denominator is just a normalization over all samples.
If we are assuming a specific correlation between our measured and interpolated values,
then we ought to be able to assume the same correlation between measured values.
Unfortunately, this covariance approaches infinity at a zero distance:
\begin{eqnarray}
\Ct(\xij) &\propto& \|\xij \|^{-2}\nonumber \\
\Rightarrow ~~ \sv &\approx& \infty \\
\Rightarrow ~~ \Ct &\propto& \It \nonumber\\
\Rightarrow ~~ \wv &\approx& \cv .
\end{eqnarray}
Sure enough, this degenerate solution tells us to ignore covariances between our known values.
As it turns out, the degeneracy is easily remedied if we sacrifice scale invariance
below the density of our samples. Choose a small distance $\dx$ below which the
correlation does not increase any further.
\begin{eqnarray}
\label{eq:clipped}
% \Ct(\xij) &\propto& [ \max(\dx, \|\xij \| ) ] ^{-2}
\Ct(\xij) &\propto& \min(\dx ^{-2}, \|\xij \| ^{-2} ) .
\end{eqnarray}
Covariances at any smaller distance will all equal the marginal $\sv$.
Let's see how this assumption works with a simple example.
\section{A simple example}
Let's say we have two measured values to interpolate along a one-dimensional dimension $x$.
One value $v_2$ is twice as far as $v_1$ from the interpolated point $v_0$.
Specifically $\|x_1- x_0\| = 1$ and $\|x_2- x_0\| = 2$.
If we are using Shepard's interpolation, then we already know enough to estimate the weights as
$\wvt = [ 0.8 , 0.2 ]$.
However, it should matter very much if the point we are interpolating is between the two measured points,
or on the same side of both.
For a non-degenerate covariance, we will assume $\dx = 0.5$ in our covariance (\ref{eq:clipped}).
On opposite sides of the interpolated point,
the two measured points are weakly correlated with $\|x_2 - x_1\| = 3$.
We get adjusted weights of $\wvt = [ 0.82 , 0.18 ]$, which is very close to the Shepard's interpolation weights.
On the same side of the interpolated point, the two measured points are strongly correlated with
$\|x_2 - x_1\| = 1$.
The value $v_2$ is actually better correlated with $v_1$
than it is with the interpolated $v_0$. We expect the nearer value $v_1$ to override and diminish
the contribution of $v_2$. And sure enough, the adjusted weights are actually
$\wvt = [ 1 ,0 ]$, which ignores the second point entirely.
\section{Variograms}
Most geostatistics literature prefers ``variograms'' to covariances.
This is one of the biggest and most unnecessary obstacles to newcomers.
The two quantities have a simple equivalence:
\begin{eqnarray}
\gamma_v (\xi, \xj) &\equiv& \frac{1}{2} E \{ [v (\xi) - v(\xj) ]^2 \} = \sv - \Cij .
\end{eqnarray}
Variograms have the minor convenience that values no longer need a zero mean.
The geostatistical assumption (\ref{eq:off}) can be rewritten
\begin{eqnarray}
\gamma_v (\xi - \xj) = \sv - C( \xv_i - \xv_j )
\end{eqnarray}
Usually variograms are defined in terms of an offset $\hv$:
\begin{eqnarray}
\hvij &=& \xi - \xj ; \nonumber \\
\gamma_v (\hvij) &=& \sv - C( \hvij).
\end{eqnarray}
Considering only offsets and ignoring sampling, we can write the equivalence this way:
\begin{eqnarray}
\gamma (\hv) = C ({\svector 0} ) - C (\hv ).
\end{eqnarray}
If we assume values are uncorrelated at infinite distance, we find
\begin{eqnarray}
C( \iv ) &=& 0 \nonumber \\
\Rightarrow ~~ \gamma( \iv ) &=& C(\zv ) \nonumber \\
\Rightarrow ~~ C (\hv ) &=& \gamma (\iv) - \gamma(\hv).
\end{eqnarray}
The most popular of standard analytic variograms is an exponential:
\begin{eqnarray}
\gamma_{\exp} (\hv) &\propto& 1 - \exp( - 3 \| \hv \| / a), ~~\mbox{or} \\
C_{\exp} (\hv) &\propto& \exp( - 3 \| \hv \| / a).
\end{eqnarray}
The constant $a$ is the distance over which about 95\% of the correlation occurs.
Unlike our distance weighting, this function is not scale invariant over any range of offsets.
A user must prepare scatterplots and histograms to scan possible values for the
constant $a$, then choose a best fitting value before kriging can begin.
For this sort of analysis, variograms have proven more popular than covariances.
\section{Kriging as a posterior estimate}
It is also possible to recognize Kriging as a special case of a least-squares inverse problem,
without postulating the specific weighted solution (\ref{eq:weight}).
I thank Dave Hale \cite{hale201307} for explaining this equivalence to me.
Kriging is a purely underdetermined inverse problem, which makes it a bit unusual and less familiar.
We assume that we have a model vector $\mv$ and a data vector $\dv$.
The data are assumed to be a linear transform (matrix multiplication) $\Ft$ of the model,
plus a vector $\nv$ of noise:
\begin{eqnarray}
\label{eq:ddef}
\dv = \Ft \mv + \nv.
\end{eqnarray}
We assume that our model and noise are selected from stationary correlated Gaussian random processes.
The expected value of a model given the data (a posterior estimate) should minimize the following
objective function:
\begin{eqnarray}
S (\mv) = (\dv - \Ft \mv )^{\intercal} \Cni (\dv - \Ft \mv ) + \mv^{\intercal} \Cmi \mv .
\end{eqnarray}
The covariance $\Cm$ is the Gaussian covariance of the model $\mv$, and
covariance $\Cn$ is the Gaussian covariance of the noise.
We also assume that the noise is uncorrelated to the model.
The definition (\ref{eq:ddef}) determines the data covariance $\Cd$:
\begin{eqnarray}
\Cd = \Ft \Cm \Ftt + \Cn .
\end{eqnarray}
Since this objective function is a quadratic in $\mv$, we can find an optimum $\mvh$ by setting the derivative to zero:
\begin{eqnarray}
\label{eq:sderiv}
\frac{\partial S}{\partial \mvt} (\mvh) &=& - \Ftt \Cni ( \dv - \Ft \mvh ) + \Cmi \mvh = 0 ; \\
\label{eq:msolve}
\Rightarrow ~~ \mvh &=& ( \Ftt \Cni \Ft + \Cmi ) ^ {-1} \Ftt \Cni \dv ; \\
\label{eq:dsolve}
\Rightarrow ~~ \mvh &=& \Cm \Ftt ( \Ft \Cm \Ftt + \Cn )^{-1} \dv ~=~ \Cm \Ftt \Cdi \dv .
\end{eqnarray}
The first solution (\ref{eq:msolve}) is directly solved from the derivative (\ref{eq:sderiv}).
The second solution (\ref{eq:dsolve}) was shown by Tarantola \cite{Tarantola:2004} (Appendix 6.30) to be
entirely equivalent. If (\ref{eq:msolve}) is equivalent to (\ref{eq:dsolve}),
then
\begin{eqnarray}
( \Ftt \Cni \Ft + \Cmi ) ^ {-1} \Ftt \Cni &=&\Cm \Ftt ( \Ft \Cm \Ftt + \Cn )^{-1} ; \\
\label{eq:almosteq}
\Rightarrow ~~ \Ftt \Cni ( \Ft \Cm \Ftt + \Cn ) &=& ( \Ftt \Cni \Ft + \Cmi ) \Cm \Ftt \\
\label{eq:trivialeq}
&=& \Ftt \Cni \Ft \Cm \Ftt + \Ftt.
\end{eqnarray}
Both sides of (\ref{eq:almosteq}) are trivially equal to (\ref{eq:trivialeq}).
This alternate solution (\ref{eq:dsolve}) is less often used because it requires a matrix inversion in the data
space rather than in the model space. Most typically, an overdetermined problem has a greater number
of data values than model values. In the case of Kriging, however, the situation is reversed.
Others such as Hansen et al \cite{HansenTarantola:2006} have observed that this alternate solution
(\ref{eq:dsolve}) is equivalent to Kriging. We will elaborate below.
In the case of geostatistical Kriging, we assume that our model $\mv$ is a large dense grid of regularly distributed
values. The forward operator $\Ft$ merely subsamples that grid for the values $\dv$. The dimensionality of $\dv$
is possibly orders of magnitude smaller than that of the model $\mv$. The transform $\Ft$ is a rectangular
matrix that has many more columns than rows. Each row contains mostly zeros and a single column with a value of 1.
Most columns do not have a value of 1 on any row.
Applying the transposed operator $\Ftt$ to subsampled data $\dv$ merely repopulates the sampled
values of $\mv$ and leaves all other values set to 0.
This is also the right inverse of the original transform, so that
$\Ft \Ftt = \It$.
If we have sampled the model directly, there is no need to assume any noise at all.
If we allow $\Cn = \epsilon \It ; \epsilon \rightarrow 0$, then the usual estimate (\ref{eq:msolve}) becomes degenerate,
and $\mvh \rightarrow \Ftt \dv$. Rather than interpolating any unsampled values,
this estimate sets them all to zero.
The alternate form (\ref{eq:dsolve}) lets us set the noise variance directly to 0.
We can then find the following covariances of our sampled data $\Cd$:
\begin{eqnarray}
\mbox{Assume} ~~
\Cn &\equiv& 0 ; \\
\Rightarrow ~~ \Cd &=& \Ft \Cm \Ftt
\end{eqnarray}
and then rewrite our alternate estimate (\ref{eq:dsolve}) as
\begin{eqnarray}
\mvh &=& (\Cm \Ftt ) ( \Ft \Cm \Ftt )^{-1} \dv \\
&=& (\Ft \Cm )^{\intercal} \Cdi \dv \\
&=& \Wtt \dv ; \\
\label{eq:krigw}
\mbox{where} ~~ \Wt &\equiv& \Cdi (\Ft \Cm)
\end{eqnarray}
These weights (\ref{eq:krigw}) for all samples are entirely equivalent to those for a single sample (\ref{eq:solvedw}).
Each measured sample of $\dv$ is weighted by its covariance to an interpolated point, then these weights
are adjusted by the inverse covariance of the data samples. We now have the advantage of directly relating
data covariances to that of the model. This expression is also much easier to generalize to data that do
more than merely subsample the model.
\bibliography{wsh}
\end{document}