If you compute the standard deviation of a data set by directly implementing the definition, you’ll need to pass through the data twice: once to find the mean, then a second time to accumulate the squared differences from the mean. But there is an equivalent algorithm that requires only one pass and that is more accurate than the most direct method. You can find the code for implementing it here.
You can also compute the higher sample moments in one pass. I’ve extended the previous code to compute skewness and kurtosis in one pass as well.
The new code also lets you split your data, say to process it in parallel on different threads, and then combine the statistics, in the spirit of map-reduce.
Lastly, I’ve posted analogous code for simple linear regression.
Boost.Accumulators does this for a whole bunch of other statistics.
BTW, your C++ code has quite a bit of room for performance improvements. The RunningStats and RunningRegression objects are 40 and 96 bytes each, and their operator+ takes both parameters by-value, rather than by-reference-to-const. Furthermore, the canonical way to overload operators is to define operator+ in terms of operator+=, instead of the other way around. This will save at least one 40 or 96 byte assignment per observation. Such can be the benefits of using libraries :-)
Why does RunningStat store both m_oldM and m_newM when, after initialization, they’re always the same between calls? Ditto m_old/newS?
My new strategy for optimizing code is going to be to post it online as bait for C++ gurus on break.
So beyond streaming there are also distributed formulas for moments. This is how the moments group works for algebird (see: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/MomentsGroup.scala)
Comparing the results of the RunningStats skewness() and Kurtosis() functions on the http://www.johndcook.com/blog/skewness_kurtosis/ page gave different results to results in Excel, and Python’s Pandas DataFrame results.
They use corrected sample rather than population formulas, so you could extend the code
double RunningStats::SkewnessCS() const
{
// Calculate the population skewness:
double g = sqrt(double(n)) * M3/ pow(M2, 1.5);
// Return the corrected sample skewness:
return sqrt( double(n*n-1))*g / double(n-2);
}
double RunningStats::KurtosisCS() const
{
// Calculate the population excess kurtosis
double g = double(n)*M4 / (M2*M2) – 3.0;
// Return the corrected sample excess kurtosis
return double(n-1) / double( (n-2)*(n-3) ) * ( double(n+1)*g + 6 );
}
Thank you James for correcting the metrics to sample instead of population!