Next: The Parameter Estimates
Up: The Details of the
Previous: Posterior Probability Distribution
  Contents
Parameter Estimation and Model Comparison
The posterior probability distributions of section
5.3.4 constitute a complete description of the
author's beliefs about the models and their parameters, once the data
(chapter 5) are taken into account. However, it is a very
unwieldy description, consisting as it does of two scalar fields
defined on a
-dimensional space. A summary is required for
presentation. The author believes that the most useful summary will
be to give, for each parameter
, estimates of its posterior
marginal expectations
 |
(5.50) |
and
 |
(5.51) |
and of its posterior marginal standard deviations,
 |
(5.52) |
and
 |
(5.53) |
where
 |
(5.54) |
The summary should also include the estimates of the model posterior
probabilities
(equations 5.47,
5.48.)
If a means can be found of drawing samples
from the distribution
, where
runs
from
to
,
runs from
to
, and the total number of
samples is, therefore,
,
can [48]
be estimated by
 |
(5.55) |
where
 |
(5.56) |
Similarly,
can [48] be
estimated by
 |
(5.57) |
where
 |
(5.58) |
and
 |
(5.59) |
Similarly, if a means can be found of drawing samples
from the distribution
, where
runs from
to
,
runs from
to
, and the total number of samples is, therefore,
,
can [48] be estimated by
 |
(5.60) |
where
 |
(5.61) |
All of these estimators are [90] non-Bayesian,
i.e. they coincide exactly with Bayesian posterior expectations of the
quantities being estimated, given the samples, only for specific, but
not specified, prior probability distributions over moments of the
distribution
. The estimators will, therefore,
differ by an amount
from the posterior expectations that
would be obtained with explicit statements of plausible prior
probability distributions over these moments, if this were technically
feasible. Fortunately, however, the size of
tends
[10] to decrease with decreasing marginal
likelihood, i.e. with increasing number of samples. Equally
fortunately, when the inference process aims to estimate moments of a
probability distribution, from samples from that distribution, a
critical
for each moment is made available, by the degree of
variation implied by the higher moments, such that significantly
smaller values of
can be safely ignored. All this means
that, for large numbers of samples, the non-Bayesian estimators are
acceptable.
The means of obtaining samples is the leapfrog method. This is
explained in detail in Information Theory, Inference and
Learning Algorithms [90], where it is attributed
to Skilling. Each iteration of the method consists of the application
of a leapfrog proposal density, followed by the application of a
Metropolis decision algorithm. For sampling from the posterior
probability distribution, after the
th iteration, the method's
state is a set of
parameter vectors
, which are the samples to be used in the estimators. The
leapfrog proposal density sets
 |
(5.62) |
where, for each
, a
is chosen at random from the other
integers from
to
, with each of these being equi-probable. The
Metropolis decision algorithm then works as follows:
- if
, then
is set to
. Otherwise,
-
is set at random, either
to
, with probability
, or to
, with probability
.
An analogous process is used to draw samples
from the prior probability distribution.
As part of the preparation of the present thesis, the author has
written Perl code that implements the leapfrog method; in the
transparent copy of this thesis, the code for the null model
is
in the file null, and the code for the main model
is in
the file main. As far as the author has been able to
discover, the present thesis is the first use of the leapfrog method
to infer parameters and model likelihoods from real experimental data.
Some further details are needed to define the particular version of
the leapfrog method used in this thesis:
- Initial vector sets
and
are chosen, centred around the
analytically-derived prior expectation values
of the
parameters. The displacements of the
initial vectors from the
prior expectation are picked randomly from a perfect top-hat
distribution out to a maximum size of each displacement component,
which is the product of a constant factor
or
, and the prior standard deviation
of the relevant parameter. That is to say,
is drawn from a distribution
and
is drawn from a distribution
It has [90] been shown that the leapfrog method
rapidly grows an initial vector set that is narrower than the
distribution, from which one wishes to sample. However, the author
does not know of any analogous proof that the algorithm can shrink an
initial vector set that is broader than the distribution, from which
one wishes to sample. Therefore, the initial vector set, for sampling
from the prior probability distribution is made a factor of
, in each dimension, narrower than the prior probability
distribution. An attempt is made to achieve the same for the
posterior probability distribution, using a guess at the width of the
posterior probability distribution. This gives
, and
, where
is the number
of data points, and the number of adjustable parameters is
for the null model
, and
for the main model
,
because a parameter whose prior probability distribution is a Dirac
delta function is not truly adjustable. One further complication is
that it is necessary to ensure that, in each dimension, the width of
the initial vector set is comfortably greater than the quantum of
floating point numbers of the order of magnitude of
.
Mantissae have 53 bits `on typical hardware' [98].
One bit is assumed to be taken up by the sign, and four bits are left,
to provide 16 possible starting ordinates, i.e. if the half-width of
the top-hat distribution is less than
, it is
adjusted to
.
- The author believes that the number
of state vectors
maintained simultaneously, in the Monte Carlo algorithm, must be at
least the number
of adjustable parameters. This is because
every state vector ever obtained will be some linear combination of
the
initial state vectors, and to obtain a representative
sample, the
initial state vectors must, therefore, span the
parameter space. In this thesis, therefore,
is chosen,
rather than the
or
suggested in Information
Theory, Inference and Learning Algorithms [90].
- The author also believes that having a large value of
will
ameliorate the problem, noted by MacKay [90], of
the state vectors in the leapfrog algorithm becoming stuck
permanently at one vector set, after it has converged, because, as
long as the parameter of interest varies smoothly in
space,
as the
must, if
, there will be a reasonable
sampling of the distribution even if all the individual state
vectors are stationary.
, however, might not vary
smoothly is
space. Therefore, the quantitative estimates
of
obtained herein should be treated with some
scepticism.
Next: The Parameter Estimates
Up: The Details of the
Previous: Posterior Probability Distribution
  Contents
Daniel Christopher Hatton
2004-11-30