From 184f38b227ffab2a5d92cce4142bfcf0743e8534 Mon Sep 17 00:00:00 2001 From: Andrew Radcliffe Date: Thu, 16 Nov 2023 16:25:22 -0800 Subject: [PATCH] Replace known-problematic variance algorithm (for `column_variance`) The existing algorithm for `column_variance` uses the textbook formula (`E[X^2]` - E[X]^2), which is well-established to have numerical issues. While the intention (traversal of the elements in column-major order) of the extant algorithm is apparent, we should not sacrifice precision when we do not need to -- the two-pass algorithm for variance (N.B. the existing algorithm is already a two-pass algorithm, anyway) using the formula `E[(x - E[x])(x - E[x]])` can be substituted without issue. Notably, the other variance implementations in the `statistics` module use `E[(x -E[x])(x - E[x]])`. Loss of precision aside, keeping the existing implementation of `column_variance` causes the obvious absurdity: ```rust use nalgebra::Matrix2x3; let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0); assert_ne!(m.column_variance().transpose(), m.transpose().row_variance()); ``` We can eliminate both the loss of precision the glaring inconsistency by switching to the implementation provided by this PR. For a comprehensive analysis of variance algorithms, see this [reference](https://ds.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf), in particular, Table 2. The "two-pass" described in the paper is the implementation given in this PR. In terms of simplicity (hence, easier to maintain), "two-pass" is a suitable choice; in terms of runtime performance and precision, it is a good balance (c.f. Youngs & Cramer and "textbook"). Furthermore, it is consistent with the variance algorithm used in the other "*variance" algorithms in the `statistics` module. --- src/base/statistics.rs | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/src/base/statistics.rs b/src/base/statistics.rs index 6007f8c7..915e5043 100644 --- a/src/base/statistics.rs +++ b/src/base/statistics.rs @@ -407,19 +407,23 @@ impl> Matrix { DefaultAllocator: Allocator, { let (nrows, ncols) = self.shape_generic(); + let mut sumsq = OVector::zeros_generic(nrows, Const::<1>); + let mean = self.column_mean(); - let mut mean = self.column_mean(); - mean.apply(|e| *e = -(e.clone() * e.clone())); - - let denom = T::one() / crate::convert::<_, T>(ncols.value() as f64); - self.compress_columns(mean, |out, col| { + for j in 0..ncols.value() { + let col = self.column(j); for i in 0..nrows.value() { unsafe { - let val = col.vget_unchecked(i); - *out.vget_unchecked_mut(i) += denom.clone() * val.clone() * val.clone() + let x = col.vget_unchecked(i); + let m = mean.vget_unchecked(i); + let delta = x.clone() - m.clone(); + *sumsq.vget_unchecked_mut(i) += delta.clone() * delta.clone(); } } - }) + } + let denom = T::one() / crate::convert::<_, T>(ncols.value() as f64); + sumsq.apply(move |e| *e *= denom.clone()); + sumsq } /*