Replace known-problematic variance algorithm (for column_variance)

The existing algorithm for `column_variance` uses the textbook
formula (`E[X^2]` - E[X]^2), which is well-established to have
numerical issues. While the intention (traversal
of the elements in column-major order) of the extant algorithm
is apparent, we should not sacrifice precision when we do not need to
-- the two-pass algorithm for variance (N.B. the existing algorithm is
already a two-pass algorithm, anyway) using the formula `E[(x -
E[x])(x - E[x]])` can be substituted without issue.

Notably, the other variance implementations in the `statistics`
module use `E[(x -E[x])(x - E[x]])`. Loss of precision aside,
keeping the existing implementation of `column_variance`
causes the obvious absurdity:

```rust
use nalgebra::Matrix2x3;

let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0);
assert_ne!(m.column_variance().transpose(), m.transpose().row_variance());
```

We can eliminate both the loss of precision the glaring inconsistency
by switching to the implementation provided by this PR.

For a comprehensive analysis of variance algorithms, see this
[reference](https://ds.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf),
in particular, Table 2.  The "two-pass" described in the paper is the
implementation given in this PR. In terms of simplicity (hence, easier
to maintain), "two-pass" is a suitable choice; in terms of runtime
performance and precision, it is a good balance (c.f. Youngs & Cramer
and "textbook"). Furthermore, it is consistent with the variance
algorithm used in the other "*variance" algorithms in the `statistics`
module.
This commit is contained in:
Andrew Radcliffe 2023-11-16 16:25:22 -08:00
parent a91e3b0d89
commit 184f38b227

View File

@ -407,19 +407,23 @@ impl<T: Scalar, R: Dim, C: Dim, S: RawStorage<T, R, C>> Matrix<T, R, C, S> {
DefaultAllocator: Allocator<T, R>,
{
let (nrows, ncols) = self.shape_generic();
let mut sumsq = OVector::zeros_generic(nrows, Const::<1>);
let mean = self.column_mean();
let mut mean = self.column_mean();
mean.apply(|e| *e = -(e.clone() * e.clone()));
let denom = T::one() / crate::convert::<_, T>(ncols.value() as f64);
self.compress_columns(mean, |out, col| {
for j in 0..ncols.value() {
let col = self.column(j);
for i in 0..nrows.value() {
unsafe {
let val = col.vget_unchecked(i);
*out.vget_unchecked_mut(i) += denom.clone() * val.clone() * val.clone()
let x = col.vget_unchecked(i);
let m = mean.vget_unchecked(i);
let delta = x.clone() - m.clone();
*sumsq.vget_unchecked_mut(i) += delta.clone() * delta.clone();
}
}
})
}
let denom = T::one() / crate::convert::<_, T>(ncols.value() as f64);
sumsq.apply(move |e| *e *= denom.clone());
sumsq
}
/*