Skip to content

Conversation

deblasis
Copy link

@deblasis deblasis commented Oct 5, 2025

This PR introduces a test to repro the data race issue and also the fix for it in a separate commit.

Issue discovered in getsops/sops integration tests.
Related: getsops/sops#1226
Failed job: https://github.com/getsops/sops/actions/runs/18238383955/job/51962367579?pr=1226

How to Review

Step 1: Reproduce the Bug (Commit f38c967)

# Checkout the first commit (test only, no fix)
git checkout f38c967789

# Run test with race detector
cd common
go test -race -run TestRetry_RaceCondition

# Expected output (BUGGY):
# WARNING: DATA RACE
# Write at 0x... by goroutine X: retry.go:871
# Previous read at 0x... by goroutine Y: retry.go:907
# Found 1 data race(s)
# exit status 66
# FAIL

Step 2: Verify the Fix (Commit 7770835)

# Checkout the fix commit
git checkout 7770835982

# Run same test with race detector
cd common
go test -race -run TestRetry_RaceCondition

# Expected output (FIXED):
# PASS
# ok    github.com/oracle/oci-go-sdk/v65/common 1.020s
# ✅ No race detected!

Step 3: Run Full Test Suite

# Run all common package tests with race detector
go test -race ./common/ -v

# Run broader test suite
go test -race ./... -short

Problem Description

Root Cause

The Retry function had a data race when the context was cancelled:

// Line 807: Shared variable
var response OCIResponse
var err error

go func() {
    // Line 871: Goroutine WRITES to response
    response, err = operation(ctx, request, rsc, extraHeaders)
    // ...
}()

// Lines 905-910: RACE CONDITION
select {
case <-ctx.Done():
    // Line 907: Main thread READS response (concurrent access!)
    return response, ctx.Err()
case result := <-retrierChannel:
    return result.response, result.err
}

Impact

  • Data Race: Undefined behavior per Go memory model
  • Panic Risk: nil pointer dereference when accessing response fields
  • Production Impact: Discovered in getsops/sops integration tests

Original Error

panic: runtime error: invalid memory address or nil pointer dereference
at github.com/oracle/oci-go-sdk/v65/common.newServiceFailureFromResponse
    common/errors.go:111

Solution

Changes Made

The fix ensures the retry goroutine checks for context cancellation and exits promptly:

1. Added context check before each retry attempt
2. Made sleep interruptible
3. Removed the racing select statement

Signed-off-by: Alessandro De Blasis [email protected]

This test demonstrates the data race bug when context is cancelled
during a retry operation.

The race occurs between:
- Line 871: Goroutine writes to shared 'response' variable
- Line 907: Main thread reads 'response' when ctx.Done()

To reproduce:
  go test -race -run TestRetry_RaceCondition ./common/

Expected result (BEFORE fix):
  WARNING: DATA RACE
  Found 1 data race(s)
  exit status 66
  FAIL

Issue discovered in getsops/sops integration tests.
Related: https://github.com/getsops/sops

Signed-off-by: Alessandro De Blasis <[email protected]>
This commit fixes the data race by ensuring the retry goroutine checks
for context cancellation and exits promptly, eliminating the race
between the goroutine writing to 'response' and the main thread reading it.

Changes:
1. Check ctx.Done() before each retry attempt
2. Make sleep interruptible by checking ctx.Done() during wait
3. Remove racing select statement - now always wait for channel result

Benefits:
- No data race: All 'response' access synchronized through channel
- No goroutine leak: Goroutine exits promptly on cancellation
- Preserves response: Returns last OCIResponse even when cancelled
- Deterministic behavior: No timing races

Before this fix:
  go test -race -run TestRetry_RaceCondition ./common/
  WARNING: DATA RACE
  Found 1 data race(s)
  exit status 66
  FAIL

After this fix:
  go test -race -run TestRetry_RaceCondition ./common/
  PASS
  ok (no race detected)

Fixes: Data race between retry.go:871 (write) and retry.go:907 (read)
Related: https://github.com/getsops/sops

Signed-off-by: Alessandro De Blasis <[email protected]>
Copy link

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Oct 5, 2025
Copy link

Thank you for signing the OCA.

@oracle-contributor-agreement oracle-contributor-agreement bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant