Replace known-problematic variance algorithm (for `column_variance`)
The existing algorithm for `column_variance` uses the textbook formula (`E[X^2]` - E[X]^2), which is well-established to have numerical issues. While the intention (traversal of the elements in column-major order) of the extant algorithm is apparent, we should not sacrifice precision when we do not need to -- the two-pass algorithm for variance (N.B. the existing algorithm is already a two-pass algorithm, anyway) using the formula `E[(x - E[x])(x - E[x]])` can be substituted without issue. Notably, the other variance implementations in the `statistics` module use `E[(x -E[x])(x - E[x]])`. Loss of precision aside, keeping the existing implementation of `column_variance` causes the obvious absurdity: ```rust use nalgebra::Matrix2x3; let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0); assert_ne!(m.column_variance().transpose(), m.transpose().row_variance()); ``` We can eliminate both the loss of precision the glaring inconsistency by switching to the implementation provided by this PR. For a comprehensive analysis of variance algorithms, see this [reference](https://ds.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf), in particular, Table 2. The "two-pass" described in the paper is the implementation given in this PR. In terms of simplicity (hence, easier to maintain), "two-pass" is a suitable choice; in terms of runtime performance and precision, it is a good balance (c.f. Youngs & Cramer and "textbook"). Furthermore, it is consistent with the variance algorithm used in the other "*variance" algorithms in the `statistics` module.
This commit is contained in:
parent
a91e3b0d89
commit
184f38b227
|
@ -407,19 +407,23 @@ impl<T: Scalar, R: Dim, C: Dim, S: RawStorage<T, R, C>> Matrix<T, R, C, S> {
|
|||
DefaultAllocator: Allocator<T, R>,
|
||||
{
|
||||
let (nrows, ncols) = self.shape_generic();
|
||||
let mut sumsq = OVector::zeros_generic(nrows, Const::<1>);
|
||||
let mean = self.column_mean();
|
||||
|
||||
let mut mean = self.column_mean();
|
||||
mean.apply(|e| *e = -(e.clone() * e.clone()));
|
||||
|
||||
let denom = T::one() / crate::convert::<_, T>(ncols.value() as f64);
|
||||
self.compress_columns(mean, |out, col| {
|
||||
for j in 0..ncols.value() {
|
||||
let col = self.column(j);
|
||||
for i in 0..nrows.value() {
|
||||
unsafe {
|
||||
let val = col.vget_unchecked(i);
|
||||
*out.vget_unchecked_mut(i) += denom.clone() * val.clone() * val.clone()
|
||||
let x = col.vget_unchecked(i);
|
||||
let m = mean.vget_unchecked(i);
|
||||
let delta = x.clone() - m.clone();
|
||||
*sumsq.vget_unchecked_mut(i) += delta.clone() * delta.clone();
|
||||
}
|
||||
}
|
||||
})
|
||||
}
|
||||
let denom = T::one() / crate::convert::<_, T>(ncols.value() as f64);
|
||||
sumsq.apply(move |e| *e *= denom.clone());
|
||||
sumsq
|
||||
}
|
||||
|
||||
/*
|
||||
|
|
Loading…
Reference in New Issue