[PECOBLR-1735] Fix #729 and #731: Telemetry lifecycle management (#734)

msrathore-db · web-flow · commit 61f80298e301 · 2026-02-06T21:30:31.000+05:30
* Fix #729 and #731: Telemetry lifecycle management Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com> * Address review comments: revert timeout and telemetry_enabled changes Per reviewer feedback on PR #734: 1. Revert timeout from 30s back to 900s (line 299) - Reviewer noted that with wait=False, timeout is not critical - The async nature and wait=False handle the exit speed 2. Revert telemetry_enabled parameter back to True (line 734) - Reviewer noted this is redundant given the early return - If enable_telemetry=False, we return early (line 729) - Line 734 only executes when enable_telemetry=True - Therefore using the parameter here is unnecessary These changes address the reviewer's valid technical concerns while keeping the core fixes intact: - wait=False for non-blocking shutdown (critical for Issue #729) - Early return when enable_telemetry=False (critical for Issue #729) - All Issue #731 fixes (null-safety, __del__, documentation) Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com> * Fix Black formatting violations Apply Black formatting to files modified in previous commits: - src/databricks/sql/common/unified_http_client.py - src/databricks/sql/telemetry/telemetry_client.py Changes are purely cosmetic (quote style consistency). Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com> * Fix CI test failure: Prevent parallel execution of telemetry tests Add @pytest.mark.xdist_group to telemetry test classes to ensure they run sequentially on the same worker when using pytest-xdist (-n auto). Root cause: Tests marked @pytest.mark.serial were still being parallelized in CI because pytest-xdist doesn't respect custom markers by default. With host-level telemetry batching (PR #718), tests running in parallel would share the same TelemetryClient and interfere with each other's event counting, causing test_concurrent_queries_sends_telemetry to see 88 events instead of the expected 60. The xdist_group marker ensures all tests in the "serial_telemetry" group run on the same worker sequentially, preventing state interference. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix telemetry test fixtures: Clean up state before AND after tests Modified telemetry_setup_teardown fixtures to clean up TelemetryClientFactory state both BEFORE and AFTER each test, not just after. This prevents leftover state from previous tests (pending events, active executors) from interfering with the current test. Root cause: In CI with sequential execution on the same worker, if a previous test left pending telemetry events in the executor, those events could be captured by the next test's mock, causing inflated event counts (88 instead of 60). Now ensures complete isolation between tests by resetting all shared state before each test starts. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix CI test failure: Clear _flush_event between tests The _flush_event threading.Event was never cleared after stopping the flush thread, remaining in "set" state. This caused timing issues in subsequent tests where the Event was already signaled, triggering unexpected flush behavior and causing extra telemetry events to be captured (88 instead of 60). Now explicitly clear the _flush_event flag in both setup (before test) and teardown (after test) to ensure clean state isolation between tests. This explains why CI consistently got 88 events - the flush_event from previous tests triggered additional flushes during test execution. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Add debug workflow and output to diagnose CI test failure 1. Created new workflow 'test-telemetry-only.yml' that runs only the failing telemetry test with -n auto, mimicking real CI but much faster 2. Added debug output to test showing: - Client-side captured events - Number of futures/batches - Number of server responses - Server-reported successful events This will help identify why CI gets 88 events vs local 60 events. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix workflow: Add krb5 system dependency The workflow was failing during poetry install due to missing krb5 system libraries needed for kerberos dependencies. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix xdist_group: Add --dist=loadgroup to pytest commands The @pytest.mark.xdist_group markers were being ignored because pytest-xdist uses --dist=load by default, which doesn't respect groups. With --dist=loadgroup, tests in the same xdist_group run sequentially on the same worker, preventing telemetry state interference between tests. This is the ROOT CAUSE of the 88 vs 60 events issue - tests were running in parallel across workers instead of sequentially on one worker as intended. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Add aggressive flush before test to prevent event interference CI shows 72 events instead of 60. Debug output reveals: - Client captured: 60 events (correct) - Server received: 72 events across 2 batches The 12 extra events accumulate in the timing window between fixture cleanup and mock setup. Other tests (like circuit breaker tests not in our xdist_group) may be sending telemetry concurrently. Solution: Add an explicit flush+shutdown RIGHT BEFORE setting up the mock to ensure a completely clean slate with zero buffered events. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Split workflow: Isolate telemetry tests in separate job To prevent interference from other e2e tests, split into two jobs: Job 1 (run-non-telemetry-tests): - Runs all e2e tests EXCEPT telemetry tests - Uses -n auto for parallel execution Job 2 (run-telemetry-tests): - Runs ONLY telemetry tests - Depends on Job 1 completing (needs: run-non-telemetry-tests) - Fresh Python process = complete isolation - No ambient telemetry from other tests This eliminates the 68 vs 60 event discrepancy by ensuring telemetry tests run in a clean environment with zero interference. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix workflows: Add krb5 deps and cleanup debug code Changes across multiple workflows: 1. integration.yml: - Add krb5 system dependency to telemetry job - Fixes: krb5-config command not found error during poetry install 2. code-coverage.yml: - Add krb5 system dependency - Split telemetry tests into separate step for isolation - Maintains coverage accumulation with --cov-append 3. publish-test.yml: - Add krb5 system dependency for consistent builds 4. test_concurrent_telemetry.py: - Remove debug print statements 5. Delete test-telemetry-only.yml: - Remove temporary debug workflow All workflows now have proper telemetry test isolation and required system dependencies for kerberos packages. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix publish-test.yml: Update Python 3.9 -> 3.10 Poetry 2.3.2 installation fails with Python 3.9: Installing Poetry (2.3.2): An error occurred. Other workflows use Python 3.10 and work fine. Updating to match ensures consistency and avoids Poetry installation issues. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix integration workflow: Remove --dist=loadgroup from non-telemetry tests - Remove --dist=loadgroup from non-telemetry job (only needed for telemetry) - Remove test_telemetry_e2e.py from telemetry job (was skipped before) - This should fix test_uc_volume_life_cycle failure caused by changed test distribution * Fix code-coverage workflow: Remove test_telemetry_e2e.py from coverage tests - Only run test_concurrent_telemetry.py in isolated telemetry step - test_telemetry_e2e.py was excluded in original workflow, keep it excluded * Fix publish-test workflow: Remove cache conditional - Always run poetry install (not just on cache miss) - Ensures fresh install with system dependencies (krb5) - Matches pattern used in integration.yml * Fix publish-test.yml: Remove duplicate krb5 install, restore cache conditional - Remove duplicate system dependencies step - Restore cache conditional to match main branch - Keep Python 3.10 (our change from 3.9) * Fix code-coverage: Remove serial tests step - All serial tests are telemetry tests (test_concurrent_telemetry.py and test_telemetry_e2e.py) - They're already run in the isolated telemetry step - Running -m serial with --ignore on both files results in 0 tests (exit code 5) --------- Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com> Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>
diff --git a/.github/workflows/code-coverage.yml b/.github/workflows/code-coverage.yml
@@ -31,6 +31,13 @@ jobs:
         with:
           python-version: "3.10"
       #----------------------------------------------
+      #  -----  install system dependencies  -----
+      #----------------------------------------------
+      - name: Install system dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y libkrb5-dev
+      #----------------------------------------------
       #  -----  install & configure poetry  -----
       #----------------------------------------------
       - name: Install Poetry
@@ -80,13 +87,13 @@ jobs:
             -v
 
       #----------------------------------------------
-      #    run serial tests with coverage
+      #    run telemetry tests with coverage (isolated)
       #----------------------------------------------
-      - name: Run serial tests with coverage
+      - name: Run telemetry tests with coverage (isolated)
         continue-on-error: false
         run: |
-          poetry run pytest tests/e2e \
-            -m "serial" \
+          # Run test_concurrent_telemetry.py separately for isolation
+          poetry run pytest tests/e2e/test_concurrent_telemetry.py \
             --cov=src \
             --cov-append \
             --cov-report=xml \
diff --git a/.github/workflows/integration.yml b/.github/workflows/integration.yml
@@ -7,7 +7,7 @@ on:
   pull_request:   
 
 jobs:
-  run-e2e-tests:
+  run-non-telemetry-tests:
     runs-on: ubuntu-latest
     environment: azure-prod
     env:
@@ -59,9 +59,53 @@ jobs:
       #----------------------------------------------
       #              run test suite
       #----------------------------------------------
-      - name: Run e2e tests (excluding daily-only tests)
+      - name: Run non-telemetry e2e tests
         run: |
-          # Exclude telemetry E2E tests from PR runs (run daily instead)
+          # Exclude all telemetry tests - they run in separate job for isolation
           poetry run python -m pytest tests/e2e \
             --ignore=tests/e2e/test_telemetry_e2e.py \
-            -n auto
+            --ignore=tests/e2e/test_concurrent_telemetry.py \
+            -n auto
+
+  run-telemetry-tests:
+    runs-on: ubuntu-latest
+    needs: run-non-telemetry-tests  # Run after non-telemetry tests complete
+    environment: azure-prod
+    env:
+      DATABRICKS_SERVER_HOSTNAME: ${{ secrets.DATABRICKS_HOST }}
+      DATABRICKS_HTTP_PATH: ${{ secrets.TEST_PECO_WAREHOUSE_HTTP_PATH }}
+      DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
+      DATABRICKS_CATALOG: peco
+      DATABRICKS_USER: ${{ secrets.TEST_PECO_SP_ID }}
+    steps:
+      - name: Check out repository
+        uses: actions/checkout@v4
+      - name: Set up python
+        id: setup-python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+      - name: Install system dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y libkrb5-dev
+      - name: Install Poetry
+        uses: snok/install-poetry@v1
+        with:
+          virtualenvs-create: true
+          virtualenvs-in-project: true
+          installer-parallel: true
+      - name: Load cached venv
+        id: cached-poetry-dependencies
+        uses: actions/cache@v4
+        with:
+          path: .venv
+          key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ github.event.repository.name }}-${{ hashFiles('**/poetry.lock') }}
+      - name: Install dependencies
+        run: poetry install --no-interaction --all-extras
+      - name: Run telemetry tests in isolation
+        run: |
+          # Run test_concurrent_telemetry.py in isolation with complete process separation
+          # Use --dist=loadgroup to respect @pytest.mark.xdist_group markers
+          poetry run python -m pytest tests/e2e/test_concurrent_telemetry.py \
+            -n auto --dist=loadgroup -v
diff --git a/.github/workflows/publish-test.yml b/.github/workflows/publish-test.yml
@@ -14,7 +14,7 @@ jobs:
         id: setup-python
         uses: actions/setup-python@v5
         with:
-          python-version: 3.9
+          python-version: "3.10"
       #----------------------------------------------
       #  -----  install & configure poetry  -----
       #----------------------------------------------
diff --git a/src/databricks/sql/client.py b/src/databricks/sql/client.py
@@ -306,6 +306,8 @@ def read(self) -> Optional[OAuthToken]:
             )
             self.session.open()
         except Exception as e:
+            # Respect user's telemetry preference even during connection failure
+            enable_telemetry = kwargs.get("enable_telemetry", True)
             TelemetryClientFactory.connection_failure_log(
                 error_name="Exception",
                 error_message=str(e),
@@ -316,6 +318,7 @@ def read(self) -> Optional[OAuthToken]:
                 user_agent=self.session.useragent_header
                 if hasattr(self, "session")
                 else None,
+                enable_telemetry=enable_telemetry,
             )
             raise e
 
diff --git a/src/databricks/sql/common/unified_http_client.py b/src/databricks/sql/common/unified_http_client.py
@@ -217,15 +217,15 @@ def _should_use_proxy(self, target_host: str) -> bool:
             logger.debug("Error checking proxy bypass for host %s: %s", target_host, e)
             return True
 
-    def _get_pool_manager_for_url(self, url: str) -> urllib3.PoolManager:
+    def _get_pool_manager_for_url(self, url: str) -> Optional[urllib3.PoolManager]:
         """
         Get the appropriate pool manager for the given URL.
 
         Args:
             url: The target URL
 
         Returns:
-            PoolManager instance (either direct or proxy)
+            PoolManager instance (either direct or proxy), or None if client is closed
         """
         parsed_url = urllib.parse.urlparse(url)
         target_host = parsed_url.hostname
@@ -291,6 +291,14 @@ def request_context(
         # Select appropriate pool manager based on target URL
         pool_manager = self._get_pool_manager_for_url(url)
 
+        # DEFENSIVE: Check if pool_manager is None (client closing/closed)
+        # This prevents AttributeError race condition when telemetry cleanup happens
+        if pool_manager is None:
+            logger.debug(
+                "HTTP client closing or closed, cannot make request to %s", url
+            )
+            raise RequestError("HTTP client is closing or has been closed")
+
         response = None
 
         try:
diff --git a/src/databricks/sql/telemetry/telemetry_client.py b/src/databricks/sql/telemetry/telemetry_client.py
@@ -42,6 +42,7 @@
 from databricks.sql.common.feature_flag import FeatureFlagsContextFactory
 from databricks.sql.common.unified_http_client import UnifiedHttpClient
 from databricks.sql.common.http import HttpMethod
+from databricks.sql.exc import RequestError
 from databricks.sql.telemetry.telemetry_push_client import (
     ITelemetryPushClient,
     TelemetryPushClient,
@@ -417,10 +418,38 @@ def export_latency_log(
         )
 
     def close(self):
-        """Flush remaining events before closing"""
+        """Flush remaining events before closing
+
+        IMPORTANT: This method does NOT close self._http_client.
+
+        Rationale:
+        - _flush() submits async work to the executor that uses _http_client
+        - If we closed _http_client here, async callbacks would fail with AttributeError
+        - Instead, we let _http_client live as long as needed:
+          * Pending futures hold references to self (via bound methods)
+          * This keeps self alive, which keeps self._http_client alive
+          * When all futures complete, Python GC will clean up naturally
+        - The __del__ method ensures eventual cleanup during garbage collection
+
+        This design prevents race conditions while keeping telemetry truly async.
+        """
         logger.debug("Closing TelemetryClient for connection %s", self._session_id_hex)
         self._flush()
 
+    def __del__(self):
+        """Cleanup when TelemetryClient is garbage collected
+
+        This ensures _http_client is eventually closed when the TelemetryClient
+        object is destroyed. By this point, all async work should be complete
+        (since the futures held references keeping us alive), so it's safe to
+        close the http client.
+        """
+        try:
+            if hasattr(self, "_http_client") and self._http_client:
+                self._http_client.close()
+        except Exception:
+            pass
+
 
 class _TelemetryClientHolder:
     """
@@ -674,7 +703,8 @@ def close(host_url):
                 )
                 try:
                     TelemetryClientFactory._stop_flush_thread()
-                    TelemetryClientFactory._executor.shutdown(wait=True)
+                    # Use wait=False to allow process to exit immediately
+                    TelemetryClientFactory._executor.shutdown(wait=False)
                 except Exception as e:
                     logger.debug("Failed to shutdown thread pool executor: %s", e)
                 TelemetryClientFactory._executor = None
@@ -689,9 +719,15 @@ def connection_failure_log(
         port: int,
         client_context,
         user_agent: Optional[str] = None,
+        enable_telemetry: bool = True,
     ):
         """Send error telemetry when connection creation fails, using provided client context"""
 
+        # Respect user's telemetry preference - don't force-enable
+        if not enable_telemetry:
+            logger.debug("Telemetry disabled, skipping connection failure log")
+            return
+
         UNAUTH_DUMMY_SESSION_ID = "unauth_session_id"
 
         TelemetryClientFactory.initialize_telemetry_client(
diff --git a/tests/e2e/test_concurrent_telemetry.py b/tests/e2e/test_concurrent_telemetry.py
@@ -27,6 +27,7 @@ def run_in_threads(target, num_threads, pass_index=False):
 
 
 @pytest.mark.serial
+@pytest.mark.xdist_group(name="serial_telemetry")
 class TestE2ETelemetry(PySQLPytestTestCase):
     @pytest.fixture(autouse=True)
     def telemetry_setup_teardown(self):
@@ -35,13 +36,27 @@ def telemetry_setup_teardown(self):
         before each test and shuts it down afterward. Using a fixture makes
         this robust and automatic.
         """
+        # Clean up BEFORE test starts to ensure no leftover state from previous tests
+        # Use wait=True to ensure all pending telemetry from previous tests completes
+        # This prevents those events from being captured by this test's mock
+        if TelemetryClientFactory._executor:
+            TelemetryClientFactory._executor.shutdown(wait=True)  # WAIT for pending telemetry
+            TelemetryClientFactory._executor = None
+        TelemetryClientFactory._stop_flush_thread()
+        TelemetryClientFactory._flush_event.clear()  # Clear the event flag
+        TelemetryClientFactory._clients.clear()
+        TelemetryClientFactory._initialized = False
+
         try:
             yield
         finally:
+            # Clean up AFTER test ends
+            # Use wait=True to ensure this test's telemetry completes before next test starts
             if TelemetryClientFactory._executor:
-                TelemetryClientFactory._executor.shutdown(wait=True)
+                TelemetryClientFactory._executor.shutdown(wait=True)  # WAIT for this test's telemetry
                 TelemetryClientFactory._executor = None
             TelemetryClientFactory._stop_flush_thread()
+            TelemetryClientFactory._flush_event.clear()  # Clear the event flag
             TelemetryClientFactory._clients.clear()
             TelemetryClientFactory._initialized = False
 
@@ -50,6 +65,14 @@ def test_concurrent_queries_sends_telemetry(self):
         An E2E test where concurrent threads execute real queries against
         the staging endpoint, while we capture and verify the generated telemetry.
         """
+        # Extra flush right before test starts to clear any events that accumulated
+        # between fixture cleanup and now (e.g., from other tests on same worker)
+        if TelemetryClientFactory._executor:
+            TelemetryClientFactory._executor.shutdown(wait=True)
+            TelemetryClientFactory._executor = None
+        TelemetryClientFactory._clients.clear()
+        TelemetryClientFactory._initialized = False
+
         num_threads = 30
         capture_lock = threading.Lock()
         captured_telemetry = []
@@ -139,6 +162,7 @@ def execute_query_worker(thread_id):
                 assert "errors" not in response or not response["errors"]
                 if "numProtoSuccess" in response:
                     total_successful_events += response["numProtoSuccess"]
+
             assert total_successful_events == num_threads * 2
 
             assert (
diff --git a/tests/e2e/test_telemetry_e2e.py b/tests/e2e/test_telemetry_e2e.py
@@ -44,23 +44,45 @@ def connection(self, extra_params=()):
 
 
 @pytest.mark.serial
+@pytest.mark.xdist_group(name="serial_telemetry")
 class TestTelemetryE2E(TelemetryTestBase):
     """E2E tests for telemetry scenarios - must run serially due to shared host-level telemetry client"""
 
     @pytest.fixture(autouse=True)
     def telemetry_setup_teardown(self):
         """Clean up telemetry client state before and after each test"""
+        # Clean up BEFORE test starts
+        # Use wait=True to ensure all pending telemetry from previous tests completes
+        if TelemetryClientFactory._executor:
+            TelemetryClientFactory._executor.shutdown(wait=True)  # WAIT for pending telemetry
+            TelemetryClientFactory._executor = None
+        TelemetryClientFactory._stop_flush_thread()
+        TelemetryClientFactory._flush_event.clear()  # Clear the event flag
+        TelemetryClientFactory._clients.clear()
+        TelemetryClientFactory._initialized = False
+
+        # Clear feature flags cache before test starts
+        from databricks.sql.common.feature_flag import FeatureFlagsContextFactory
+        with FeatureFlagsContextFactory._lock:
+            FeatureFlagsContextFactory._context_map.clear()
+            if FeatureFlagsContextFactory._executor:
+                FeatureFlagsContextFactory._executor.shutdown(wait=False)
+                FeatureFlagsContextFactory._executor = None
+
         try:
             yield
         finally:
+            # Clean up AFTER test ends
+            # Use wait=True to ensure this test's telemetry completes
             if TelemetryClientFactory._executor:
-                TelemetryClientFactory._executor.shutdown(wait=True)
+                TelemetryClientFactory._executor.shutdown(wait=True)  # WAIT for this test's telemetry
                 TelemetryClientFactory._executor = None
             TelemetryClientFactory._stop_flush_thread()
+            TelemetryClientFactory._flush_event.clear()  # Clear the event flag
+            TelemetryClientFactory._clients.clear()
             TelemetryClientFactory._initialized = False
 
-            # Clear feature flags cache to prevent state leakage between tests
-            from databricks.sql.common.feature_flag import FeatureFlagsContextFactory
+            # Clear feature flags cache after test ends
             with FeatureFlagsContextFactory._lock:
                 FeatureFlagsContextFactory._context_map.clear()
                 if FeatureFlagsContextFactory._executor: