use epsilon thresholding for variance calculation #551

bournejt · 2025-06-23T03:17:49Z

bournejt
Jun 23, 2025

I noticed that variance compute method is implemented just to avoid negative values (cpp link), but not like other statistics using epsilon-based thresholding. This results in weird numerical error when all the values are the same, and csp.stats.stddev gives not exactly zero but small positive numbers. I have prepared a self-contained case to show the error. I think we should convert variance calculation to the same approach. Below is my proposed improvement:

double compute() const
{
    if( m_count > m_ddof )
    {
        double var = m_unnormVar / ( m_count - m_ddof );
        return ( var < EPSILON ? 0 : var );  // Use epsilon threshold like others
    }
    
    return std::numeric_limits<double>::quiet_NaN();
}

Test case

"""
Simple test showing CSP stddev bug vs pandas behavior
"""

import csp
import pandas as pd

from datetime import datetime, timedelta


def test_pandas():
    """Test with pandas - expected behavior"""
    print("=== PANDAS TEST ===")
    
    # Create test data: 3 periods, each with identical values
    data = []
    timestamps = []
    
    base_time = pd.Timestamp('2023-01-01 09:00:00')
    
    # Period 1: 5 ticks of -1.0
    for i in range(5):
        timestamps.append(base_time + pd.Timedelta(seconds=i*10))
        data.append(-1.0)
    
    # Period 2: 5 ticks of -2.0  
    for i in range(5):
        timestamps.append(base_time + pd.Timedelta(minutes=1, seconds=i*10))
        data.append(-2.0)
    
    # Period 3: 5 ticks of -3.0
    for i in range(5):
        timestamps.append(base_time + pd.Timedelta(minutes=2, seconds=i*10))
        data.append(-3.0)
    
    df = pd.DataFrame({'value': data}, index=timestamps)
    
    print("Raw data:")
    for timestamp, value in zip(timestamps, data):
        print(f"  {timestamp}: {value}")
    print()
    
    # Resample to 1-minute periods and calculate stddev
    # Use closed='right', label='right' to match CSP behavior
    result = df.resample('1min', closed='right', label='right').std()
    
    print("Pandas stddev results:")
    for timestamp, stddev in result.iterrows():
        print(f"  {timestamp}: {stddev.iloc[0]:.10f}")

def test_csp():
    """Test with CSP - shows the bug"""
    print("\n=== CSP TEST ===")
    
    @csp.node
    def generate_data() -> csp.ts[float]:
        """Generate the exact same data as pandas test"""
        with csp.alarms():
            alarm = csp.alarm(bool)
        
        with csp.start():
            # Period 1: 5 ticks of -1.0
            for i in range(5):
                csp.schedule_alarm(alarm, timedelta(seconds=i*10), True)
            
            # Period 2: 5 ticks of -2.0
            for i in range(5):
                csp.schedule_alarm(alarm, timedelta(minutes=1, seconds=i*10), True)
            
            # Period 3: 5 ticks of -3.0
            for i in range(5):
                csp.schedule_alarm(alarm, timedelta(minutes=2, seconds=i*10), True)
        
        if csp.ticked(alarm):
            current_time = csp.now()
            seconds = (current_time - datetime(2023, 1, 1, 9, 0, 0)).total_seconds()
            
            if seconds < 60:
                return -1.0    # Period 1
            elif seconds < 120:
                return -2.0    # Period 2
            else:
                return -3.0    # Period 3
    
    @csp.graph
    def csp_test():
        # Same resampling setup as in the actual code
        resample_interval = timedelta(seconds=60)
        timer = csp.timer(interval=resample_interval, value=True)
        
        data = generate_data()
        stddev_result = csp.stats.stddev(data, interval=resample_interval, trigger=timer)
        
        # Print the raw data
        csp.print('RAW_DATA', data)
        
        # Print the results
        csp.print('CSP_STDDEV', stddev_result)
    
    start_time = datetime(2023, 1, 1, 9, 0, 0)
    end_time = start_time + timedelta(minutes=4)
    csp.run(csp_test, starttime=start_time, endtime=end_time)

if __name__ == "__main__":
    test_pandas()
    test_csp()

Expected output from test case

=== PANDAS TEST ===
Raw data:
  2023-01-01 09:00:00: -1.0
  2023-01-01 09:00:10: -1.0
  2023-01-01 09:00:20: -1.0
  2023-01-01 09:00:30: -1.0
  2023-01-01 09:00:40: -1.0
  2023-01-01 09:01:00: -2.0
  2023-01-01 09:01:10: -2.0
  2023-01-01 09:01:20: -2.0
  2023-01-01 09:01:30: -2.0
  2023-01-01 09:01:40: -2.0
  2023-01-01 09:02:00: -3.0
  2023-01-01 09:02:10: -3.0
  2023-01-01 09:02:20: -3.0
  2023-01-01 09:02:30: -3.0
  2023-01-01 09:02:40: -3.0

Pandas stddev results:
  2023-01-01 09:00:00: nan
  2023-01-01 09:01:00: 0.4472135955
  2023-01-01 09:02:00: 0.4472135955
  2023-01-01 09:03:00: 0.0000000000

=== CSP TEST ===
2023-01-01 09:00:00 RAW_DATA:-1.0
2023-01-01 09:00:10 RAW_DATA:-1.0
2023-01-01 09:00:20 RAW_DATA:-1.0
2023-01-01 09:00:30 RAW_DATA:-1.0
2023-01-01 09:00:40 RAW_DATA:-1.0
2023-01-01 09:01:00 RAW_DATA:-2.0
2023-01-01 09:01:00 CSP_STDDEV:0.4472135954999579
2023-01-01 09:01:10 RAW_DATA:-2.0
2023-01-01 09:01:20 RAW_DATA:-2.0
2023-01-01 09:01:30 RAW_DATA:-2.0
2023-01-01 09:01:40 RAW_DATA:-2.0
2023-01-01 09:02:00 RAW_DATA:-3.0
2023-01-01 09:02:00 CSP_STDDEV:0.44721359549995804
2023-01-01 09:02:10 RAW_DATA:-3.0
2023-01-01 09:02:20 RAW_DATA:-3.0
2023-01-01 09:02:30 RAW_DATA:-3.0
2023-01-01 09:02:40 RAW_DATA:-3.0
2023-01-01 09:03:00 CSP_STDDEV:1.609509363375073e-08
2023-01-01 09:04:00 CSP_STDDEV:nan

AdamGlustein · 2025-06-23T12:41:19Z

AdamGlustein
Jun 23, 2025
Maintainer

Thanks for bringing this up. I checked how pandas handles that case and rather than using some epsilon for instability, they have logic in their variance method that checks if the current window is all the same value (https://github.com/pandas-dev/pandas/blob/1da0d022057862f4352113d884648606efd60099/pandas/_libs/window/aggregations.pyx#L337). This seems like a better solution to me as a small positive variance is possible for certain inputs, so rounding it down to zero is not ideal.

One downside of the above approach is that we'll need to compare/count the number of consecutive values, which will have a slight performance cost. But I don't think it will make a huge difference, so I am good with making that change.

Can you go ahead and copy what you have here into an Issue for tracking? Or alternatively, if you are comfortable making the change, your contribution is always welcome.

14 replies

bournejt Jun 26, 2025
Author

And it's up: #554. I don't understand how to run tests actually. Is test running automatically configured to run in github PR?

timkpaine Jun 26, 2025
Maintainer

https://docs.github.com/en/actions/how-tos/managing-workflow-runs-and-deployments/managing-workflow-runs/approving-workflow-runs-from-public-forks#about-workflow-runs-from-public-forks

Please follow https://github.com/Point72/csp/wiki/Local-Development-Setup#guidelines

AdamGlustein Jun 26, 2025
Maintainer

@bournejt the DCO step failed, to fix simply run git commit -s --amend --no-edit and then git push -f origin <branch>. This will sign the commit you have

AdamGlustein Jun 26, 2025
Maintainer

And it's up: #554. I don't understand how to run tests actually. Is test running automatically configured to run in github PR?

You can run tests locally by running make test, assuming you have already built csp

bournejt Jun 30, 2025
Author

@AdamGlustein @timkpaine Thanks for the guidance here. I was trying to do the build in my existing conda. There are apparently some collisions. So I followed the guide to build a new conda env and finished the local build. I did the steps @AdamGlustein give me to sign DCO. However, it was still failing in the PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use epsilon thresholding for variance calculation #551

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 14 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

use epsilon thresholding for variance calculation #551

Uh oh!

bournejt Jun 23, 2025

Replies: 1 comment · 14 replies

Uh oh!

Uh oh!

AdamGlustein Jun 23, 2025 Maintainer

Uh oh!

bournejt Jun 26, 2025 Author

Uh oh!

timkpaine Jun 26, 2025 Maintainer

Uh oh!

AdamGlustein Jun 26, 2025 Maintainer

Uh oh!

AdamGlustein Jun 26, 2025 Maintainer

Uh oh!

bournejt Jun 30, 2025 Author

bournejt
Jun 23, 2025

Replies: 1 comment 14 replies

AdamGlustein
Jun 23, 2025
Maintainer

bournejt Jun 26, 2025
Author

timkpaine Jun 26, 2025
Maintainer

AdamGlustein Jun 26, 2025
Maintainer

AdamGlustein Jun 26, 2025
Maintainer

bournejt Jun 30, 2025
Author