Skip to content

Conversation

@skattoju
Copy link

@skattoju skattoju commented Oct 9, 2025

Overview

Consolidates and fixes e2e tests to run successfully in Kind-based GitHub Actions with OpenShift/MicroShift compatibility.

Key Changes

Tests & Workflows

  • Merged duplicate workflows (removed health-check.yaml)
  • Consolidated test functions with auto-detection for model availability
  • Fixed Kind cluster creation (explicit config file)
  • Added comprehensive diagnostic logging

CRDs & Dependencies

  • Installed 5 OpenShift/KServe/Kubeflow CRDs for helm chart compatibility
  • Added helm dependency build step
  • Removed --wait flag to avoid PVC binding timeouts

Configuration

  • Disabled components not needed for basic e2e: llm-service, configure-pipeline, ingestion-pipeline, mcp-servers
  • Updated to flexible version constraints (>=)
  • Fixed port formatting in values-e2e.yaml

What Gets Tested

✅ RAG UI accessibility
✅ Llama Stack connectivity
✅ API endpoints & health checks
⏭️ Model inference (auto-skipped if no models configured)

Results

  • Duration: ~15-20 minutes
  • Resources: 4 CPU, 16GB RAM
  • Environment: Kind + OpenShift CRDs

Note

This is a lightweight deployment validation test. For full functionality testing with models, enable llm-service and set SKIP_MODEL_TESTS=false.

- Add user workflow test simulating real application usage
- Deploy full RAG stack in kind for CI testing
- Optimized Helm values for CPU-only environment
- Runs on PRs, pushes, and manual dispatch
@skattoju skattoju marked this pull request as ready for review October 13, 2025 19:40
Copy link

@yashoza19 yashoza19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

- Install OpenShift Route CRD in Kind cluster for compatibility
- Update workflow to support OpenShift-specific resources
- Add fallback CRD definition if upstream Route CRD unavailable
- Update documentation to reflect MicroShift compatibility testing
- Ensure helm install works with OpenShift Route resources

This enables testing the RAG application in an environment that
mirrors MicroShift/OpenShift deployments while using Kind for CI.
The kind-action was failing because the inline config YAML wasn't being
parsed correctly. Creating the config file explicitly before passing it
to kind-action resolves the issue.
This step is required to fetch chart dependencies (pgvector, minio,
llm-service, configure-pipeline, ingestion-pipeline, llama-stack)
before helm install. Without this, the installation fails with missing
dependencies error.
Disable llm-service and configure-pipeline components that require:
- InferenceService (serving.kserve.io/v1beta1)
- ServingRuntime (serving.kserve.io/v1alpha1)
- DataSciencePipelinesApplication (datasciencepipelinesapplications.opendatahub.io/v1)
- Notebook (kubeflow.org/v1)

These CRDs are not available in Kind clusters. The llama-stack component
provides the inference capabilities we need for basic e2e testing without
requiring KServe.
Install minimal CRD definitions to satisfy Helm chart validation even
though the actual components (llm-service, configure-pipeline,
ingestion-pipeline) are disabled in e2e tests.

CRDs installed:
- routes.route.openshift.io (OpenShift)
- inferenceservices.serving.kserve.io (KServe)
- servingruntimes.serving.kserve.io (KServe)
- datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io (OpenDataHub)
- notebooks.kubeflow.org (Kubeflow)

This approach allows Kind-based e2e tests to work with helm charts that
reference these CRDs without requiring full MicroShift/OpenShift setup.
Even with enabled: false, the configure-pipeline subchart was trying
to create a PVC. Explicitly disable persistence and PVC creation to
prevent the PersistentVolumeClaim pipeline-vol from blocking deployment.
Disabled subcharts (configure-pipeline, llm-service, ingestion-pipeline)
still create resources including PVCs that may never bind. Removing --wait
from helm install and instead explicitly waiting for only the core
deployments we need (rag UI and llamastack).

This prevents the 20-minute timeout waiting for unused resources.
Added detailed logging throughout the wait process:
- List all resources before waiting
- Show deployment and pod status
- Describe deployments to see configuration
- Show events to catch scheduling/image pull issues
- Add failure handlers with detailed diagnostics
- Show logs on failure
- Exit with error on timeout for faster feedback

This will help identify why deployments get stuck (image pull, resource
constraints, scheduling issues, etc.)
Disabled in e2e tests:
- minio.sampleFileUpload: Job was failing with ImagePullBackOff
- mcp-servers: Not needed for basic e2e tests
- ingestion-pipeline: Add top-level enabled: false

These components were creating pods with image pull issues that blocked
deployment. We only need the core stack (rag UI + llamastack + pgvector + minio)
for basic e2e testing.
The llamastack init container was waiting for a model service endpoint
created by llm-service (which we disabled). For basic e2e tests:
- Removed global.models configuration
- Disabled llamastack init containers
- Focus on testing UI/backend connectivity without full model inference

This allows the e2e tests to validate the application stack without
requiring KServe/llm-service infrastructure.
Modified test_user_workflow.py to focus on connectivity and health checks:
- Skip model inference tests when SKIP_MODEL_TESTS=true (default)
- Test UI accessibility
- Test backend connectivity
- Test API endpoint availability
- Test health endpoints

This allows e2e tests to validate application deployment without
requiring full model serving infrastructure, significantly reducing
resource requirements and startup time.
- Fixed NameError by removing INFERENCE_MODEL print statement
- Set ingestion-pipeline replicaCount: 0 to prevent pod creation
- Restored INFERENCE_MODEL variable from environment
- Added intelligent model detection (SKIP_MODEL_TESTS=auto by default)
- Tests will automatically skip inference if no models configured
- Tests will run inference if models are available (future-proof)
- Gracefully handles both scenarios without errors
The Llama Stack API returns 404 on root endpoint (/) which is valid
behavior for API-only services. Allow both 200 and 404 status codes
to pass the connectivity test.
@skattoju
Copy link
Author

closing in favour of #65

@skattoju skattoju closed this Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants