Add E2E test workflow for kind cluster #50

skattoju · 2025-10-09T19:24:23Z

Overview

Consolidates and fixes e2e tests to run successfully in Kind-based GitHub Actions with OpenShift/MicroShift compatibility.

Key Changes

Tests & Workflows

Merged duplicate workflows (removed health-check.yaml)
Consolidated test functions with auto-detection for model availability
Fixed Kind cluster creation (explicit config file)
Added comprehensive diagnostic logging

CRDs & Dependencies

Installed 5 OpenShift/KServe/Kubeflow CRDs for helm chart compatibility
Added helm dependency build step
Removed --wait flag to avoid PVC binding timeouts

Configuration

Disabled components not needed for basic e2e: llm-service, configure-pipeline, ingestion-pipeline, mcp-servers
Updated to flexible version constraints (>=)
Fixed port formatting in values-e2e.yaml

What Gets Tested

✅ RAG UI accessibility
✅ Llama Stack connectivity
✅ API endpoints & health checks
⏭️ Model inference (auto-skipped if no models configured)

Results

Duration: ~15-20 minutes
Resources: 4 CPU, 16GB RAM
Environment: Kind + OpenShift CRDs

Note

This is a lightweight deployment validation test. For full functionality testing with models, enable llm-service and set SKIP_MODEL_TESTS=false.

- Add user workflow test simulating real application usage - Deploy full RAG stack in kind for CI testing - Optimized Helm values for CPU-only environment - Runs on PRs, pushes, and manual dispatch

yashoza19

/lgtm

- Install OpenShift Route CRD in Kind cluster for compatibility - Update workflow to support OpenShift-specific resources - Add fallback CRD definition if upstream Route CRD unavailable - Update documentation to reflect MicroShift compatibility testing - Ensure helm install works with OpenShift Route resources This enables testing the RAG application in an environment that mirrors MicroShift/OpenShift deployments while using Kind for CI.

The kind-action was failing because the inline config YAML wasn't being parsed correctly. Creating the config file explicitly before passing it to kind-action resolves the issue.

This step is required to fetch chart dependencies (pgvector, minio, llm-service, configure-pipeline, ingestion-pipeline, llama-stack) before helm install. Without this, the installation fails with missing dependencies error.

Disable llm-service and configure-pipeline components that require: - InferenceService (serving.kserve.io/v1beta1) - ServingRuntime (serving.kserve.io/v1alpha1) - DataSciencePipelinesApplication (datasciencepipelinesapplications.opendatahub.io/v1) - Notebook (kubeflow.org/v1) These CRDs are not available in Kind clusters. The llama-stack component provides the inference capabilities we need for basic e2e testing without requiring KServe.

Install minimal CRD definitions to satisfy Helm chart validation even though the actual components (llm-service, configure-pipeline, ingestion-pipeline) are disabled in e2e tests. CRDs installed: - routes.route.openshift.io (OpenShift) - inferenceservices.serving.kserve.io (KServe) - servingruntimes.serving.kserve.io (KServe) - datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io (OpenDataHub) - notebooks.kubeflow.org (Kubeflow) This approach allows Kind-based e2e tests to work with helm charts that reference these CRDs without requiring full MicroShift/OpenShift setup.

Even with enabled: false, the configure-pipeline subchart was trying to create a PVC. Explicitly disable persistence and PVC creation to prevent the PersistentVolumeClaim pipeline-vol from blocking deployment.

Disabled subcharts (configure-pipeline, llm-service, ingestion-pipeline) still create resources including PVCs that may never bind. Removing --wait from helm install and instead explicitly waiting for only the core deployments we need (rag UI and llamastack). This prevents the 20-minute timeout waiting for unused resources.

Added detailed logging throughout the wait process: - List all resources before waiting - Show deployment and pod status - Describe deployments to see configuration - Show events to catch scheduling/image pull issues - Add failure handlers with detailed diagnostics - Show logs on failure - Exit with error on timeout for faster feedback This will help identify why deployments get stuck (image pull, resource constraints, scheduling issues, etc.)

Disabled in e2e tests: - minio.sampleFileUpload: Job was failing with ImagePullBackOff - mcp-servers: Not needed for basic e2e tests - ingestion-pipeline: Add top-level enabled: false These components were creating pods with image pull issues that blocked deployment. We only need the core stack (rag UI + llamastack + pgvector + minio) for basic e2e testing.

The llamastack init container was waiting for a model service endpoint created by llm-service (which we disabled). For basic e2e tests: - Removed global.models configuration - Disabled llamastack init containers - Focus on testing UI/backend connectivity without full model inference This allows the e2e tests to validate the application stack without requiring KServe/llm-service infrastructure.

Modified test_user_workflow.py to focus on connectivity and health checks: - Skip model inference tests when SKIP_MODEL_TESTS=true (default) - Test UI accessibility - Test backend connectivity - Test API endpoint availability - Test health endpoints This allows e2e tests to validate application deployment without requiring full model serving infrastructure, significantly reducing resource requirements and startup time.

- Fixed NameError by removing INFERENCE_MODEL print statement - Set ingestion-pipeline replicaCount: 0 to prevent pod creation

- Restored INFERENCE_MODEL variable from environment - Added intelligent model detection (SKIP_MODEL_TESTS=auto by default) - Tests will automatically skip inference if no models configured - Tests will run inference if models are available (future-proof) - Gracefully handles both scenarios without errors

The Llama Stack API returns 404 on root endpoint (/) which is valid behavior for API-only services. Allow both 200 and 404 status codes to pass the connectivity test.

skattoju · 2025-10-24T20:05:04Z

closing in favour of #65

Add E2E test workflow for kind cluster

264a5bb

- Add user workflow test simulating real application usage - Deploy full RAG stack in kind for CI testing - Optimized Helm values for CPU-only environment - Runs on PRs, pushes, and manual dispatch

yashoza19 approved these changes Oct 13, 2025

View reviewed changes

sauagarwa approved these changes Oct 13, 2025

View reviewed changes

skattoju marked this pull request as ready for review October 13, 2025 19:40

yashoza19 approved these changes Oct 13, 2025

View reviewed changes

fix: Add helm dependency build step to e2e workflow

ca8f3f7

skattoju force-pushed the add-e2e-tests branch from b46db26 to ca8f3f7 Compare October 17, 2025 12:46

skattoju added 15 commits October 17, 2025 08:51

fix: Create Kind config file explicitly to avoid YAML parsing issues

0ebd911

The kind-action was failing because the inline config YAML wasn't being parsed correctly. Creating the config file explicitly before passing it to kind-action resolves the issue.

fix: Add back helm dependency build step

4bff9be

This step is required to fetch chart dependencies (pgvector, minio, llm-service, configure-pipeline, ingestion-pipeline, llama-stack) before helm install. Without this, the installation fails with missing dependencies error.

fix: Explicitly disable PVC creation in configure-pipeline

2e8b27d

Even with enabled: false, the configure-pipeline subchart was trying to create a PVC. Explicitly disable persistence and PVC creation to prevent the PersistentVolumeClaim pipeline-vol from blocking deployment.

fix: Remove undefined INFERENCE_MODEL reference and force replicas to 0

ef86590

- Fixed NameError by removing INFERENCE_MODEL print statement - Set ingestion-pipeline replicaCount: 0 to prevent pod creation

fix: Allow 404 status code for Llama Stack root endpoint

373d675

The Llama Stack API returns 404 on root endpoint (/) which is valid behavior for API-only services. Allow both 200 and 404 status codes to pass the connectivity test.

docs: update e2e README for lightweight validation approach

c241320

sauagarwa approved these changes Oct 24, 2025

View reviewed changes

skattoju closed this Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add E2E test workflow for kind cluster #50

Add E2E test workflow for kind cluster #50

Uh oh!

skattoju commented Oct 9, 2025 •

edited

Loading

Uh oh!

yashoza19 left a comment

Uh oh!

skattoju commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add E2E test workflow for kind cluster #50

Add E2E test workflow for kind cluster #50

Uh oh!

Conversation

skattoju commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

Tests & Workflows

CRDs & Dependencies

Configuration

What Gets Tested

Results

Note

Uh oh!

yashoza19 left a comment

Choose a reason for hiding this comment

Uh oh!

skattoju commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

skattoju commented Oct 9, 2025 •

edited

Loading