Skip to content

Conversation

@QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Jan 29, 2026

Reason for Change:

Adds cilium log collector image which is based on mcr fluent bit, with the added app insights output plugin. The plugin takes in an instrumentation key and enriches it with information from the imds, pushing it to app insights.

Moves cilium to server side apply so we can test the cilium log collector by applying it to the cilium daemonset (does not affect ebpf host routing since that might break a dependent pipeline).

To add cilium log collector to any cilium cluster, should be as simple as setting the log collector's image registry and version, and then running make -C ./hack/aks add-cilium-log-collector

pipeline.yaml put through a text diff checker the majority is a CRLF -> LF change

Builds cilium log collector in cni release, acn pr, and acn official build

Puts makefile targets in a separate included makefile. For the image, the dockerfile $VERSION is passed in from $TAG which is actually $(CILIUM_LOG_COLLECTOR_PLATFORM_TAG). Added the log collector to ACN VERSION exclude, and added it to the all binaries target

Cilium log collector is only for linux

Issue Fixed:

Requirements:

Notes:
ACN PR Pass: https://msazure.visualstudio.com/One/_build/results?buildId=151173566&view=results
CNI Release Test Mostly Pass: https://msazure.visualstudio.com/One/_build/results?buildId=151174815&view=results
ACN Test Unofficial Pass: https://dev.azure.com/msazure/One/_build/results?buildId=151174219&view=results
Cilium Nightly Mostly Pass: https://msazure.visualstudio.com/One/_build/results?buildId=151143234&view=results
Cilium private 1.18 w/ upgrade doesn't pass on upgrade but that's not related this change (even without that ebpf stage fails): https://msazure.visualstudio.com/One/_build/results?buildId=151451236&view=results
You can also confirm the image is being deployed by clicking on the "install cilium" job under each e2e stage

currently puts makefile targets in a separate included makefile, which can be changed
currently uses server side apply to apply the cilium log collector sidecar, which can be changed
currently adds cilium log collector only to overlay cilium scenario
log collector only available on linux
builds and publishes images for cni release and acn pr but only used in acn pr for now
files are server side applied and we expect the cilium daemonset to be server side applied first before adding the log collector
the container is host network, so we must use ClusterFirstWithHostNet or dns will fail
if cilium has dnsPolicy ClusterFirstWithHostNet, k8s will add coredns's ip to the pod's /etc/resolv.conf
cilium must come up for coredns to come up, but when cilium starts it makes a dns request to core dns
as a result of the /etc/resolv.conf specifying core dns's ip. core dns at this point is down, however,
and the request times out, so cilium can't start up either

ClusterFirst with hostNetwork is actually "default" (uses vm's /etc/resolv.conf, likely 168.63.129.16/azure dns)
ideally we use a filter plugin but there's no go framework for that right now
take a dependency on the acn method we use for getting imds info (can be changed)
sets kubernetes filter plugin to use tag for pod name etc.; unlike the default setting
we can enrich without a connection to core dns/k8s api/kubelet-- this is important
because on cilium startup coredns will not be running

tested data pipeline through adx in development
…g collector

affects: cni release test, acn pr, acn official build, nightly, ebpf host routing/msft cil
see what breaks
previous commit passes on overlay only
we can try out server side apply with cilium ebpf later-- for now keep as-is
@QxBytes QxBytes self-assigned this Jan 29, 2026
@QxBytes QxBytes requested a review from a team as a code owner January 29, 2026 23:26
@QxBytes QxBytes added the ci Infra or tooling. label Jan 29, 2026
Copilot AI review requested due to automatic review settings January 29, 2026 23:26
@QxBytes QxBytes requested a review from msvik January 29, 2026 23:26
@QxBytes QxBytes added the cilium Related to Cilium. label Jan 29, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Cilium log collector based on Fluent Bit and wires it into Cilium deployments, build pipelines, and tests. It also switches Cilium-related manifests to server-side apply to support patching the daemonset with the log collector container.

Changes:

  • Add a new cilium-log-collector Go module, Fluent Bit output plugin, unit tests, Dockerfile, and Makefile wiring (including build, image, and manifest targets).
  • Introduce Cilium log collector DaemonSet patches and ConfigMaps for Cilium v1.17 and v1.18, update AKS deployment Make targets to use server-side apply, and add a target to patch Cilium with the log collector.
  • Extend CI/CD pipelines to build/publish the cilium-log-collector images and to deploy the log collector into Cilium-based test clusters; update validation tests to use the correct Cilium container.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/validate/linux_validate.go Ensures the Cilium validation check targets the cilium-agent container explicitly.
test/integration/manifests/cilium/v1.17/cilium-log-collector/daemonset-patch.yaml Adds a DaemonSet patch to inject the log collector sidecar into Cilium v1.17 agents.
test/integration/manifests/cilium/v1.17/cilium-log-collector/cilium-log-collector-configmap.yaml Defines the Fluent Bit + App Insights configuration for the Cilium log collector on v1.17, including input/filter/output settings.
test/integration/manifests/cilium/v1.18/cilium-log-collector/daemonset-patch.yaml Same Cilium log collector DaemonSet patching logic for Cilium v1.18.
test/integration/manifests/cilium/v1.18/cilium-log-collector/cilium-log-collector-configmap.yaml Same Fluent Bit/App Insights configuration for Cilium v1.18 clusters.
hack/aks/deploy.mk Switches Cilium deployment to server-side apply and adds an add-cilium-log-collector target and related env vars for injecting the log collector.
cilium-log-collector/out_azure_app_insights.go Implements the Fluent Bit output plugin to send Cilium logs to App Insights, including IMDS metadata enrichment and record processing helpers.
cilium-log-collector/out_azure_app_insights_test.go Provides unit tests for record processing, type conversion, metadata enrichment, and the tracker abstraction for the log collector plugin.
cilium-log-collector/go.mod Declares the new module and its dependencies for the log collector plugin.
cilium-log-collector/go.sum Locks dependency versions for the new cilium-log-collector module.
cilium-log-collector/Makefile Adds build, archive, test, and image/manifest targets specific to the Cilium log collector.
cilium-log-collector/Dockerfile Defines a multi-stage build image that compiles the plugin and layers it into the Fluent Bit base image.
Makefile Includes the log collector’s Makefile, excludes its tags from ACN_VERSION, adds it to all-binaries, and adds it to the Go workspace.
.pipelines/singletenancy/cilium/*.yaml After deploying Cilium in various single-tenancy scenarios, exports log collector image vars and patches the DaemonSet with the log collector.
.pipelines/singletenancy/cilium-overlay/**/*.yaml Same wiring for overlay and overlay-with-Hubble variants in single-tenancy pipelines.
.pipelines/singletenancy/cilium-nodesubnet/*.yaml Ensures Cilium node-subnet scenarios also deploy the log collector.
.pipelines/singletenancy/cilium-dualstack-overlay/*.yaml Adds log collector deployment to dualstack overlay Cilium scenarios.
.pipelines/pipeline.yaml Adds cilium-log-collector to the image build and multi-arch manifest stages.
.pipelines/cni/pipeline.yaml Adds cilium-log-collector to CNI release image builds and manifests and keeps the rest of the pipeline logically equivalent while normalizing line endings.
.pipelines/cni/cilium/cilium-scale-test.yaml Ensures the Cilium scale test environment includes the log collector.
.pipelines/cni/cilium/cilium-overlay-load-test-template.yaml Ensures Cilium overlay and overlay+Hubble load test clusters are patched with the log collector.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@santhoshmprabhu santhoshmprabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Also, any thoughts on the ability to turn off to mitigate any urgent issues?

hostMetadata *common.Metadata
)

func convertToString(v interface{}) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to define this? I'd assume there's a utility function for this.

Copy link
Contributor Author

@QxBytes QxBytes Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed some specific behavior that the built-in functions don't provide-- for example the conversion would usually be just like Sprintf("%v", val) which is the default case here, but that by itself prints out bytes or map[field: value] which doesn't show up/parse well in azure data explorer/kusto. Also straight marshaling into a string doesn't work if the "key" value isn't a string.


telemetryConfig := appinsights.NewTelemetryConfiguration(instrumentationKey)
telemetryConfig.MaxBatchInterval = 1 * time.Second
telemetryConfig.MaxBatchSize = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this set? Batch size in MB? Please add comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maximum number of telemetry items that can be submitted in each request-- I'll add a comment

if logKey == "" {
logKey = "log"
}
debug = output.FLBPluginConfigKey(plugin, "debug")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to debug the plugin itself, unrelated to the logs, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, any chance of storing the filename here, to catch the exact container whose logs are being collected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for debugging the plugin, yes.

If we wanted a specific container's logs, we can modify the input section of the fluent bit config to regex match on it

If you're referring to storing something identifying the file the log came from, each log collected should be already enriched with the container name

}

// ProcessSingleRecord handles processing of an individual record
func (rp *RecordProcessor) ProcessSingleRecord(record ProcessRecord, recordIndex int, metadata *common.Metadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall you mentioned reusing the AppInsights logic from CNI. Is the code duplicated? If yes, it'd be good to move to a common location and reuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reusing the code to get the IMDS metadata so that part isn't duplicated. The logic to send the logs out doesn't need to be as complex as the CNI-- the below for loop is meant to parse through the record's fields and get them into the custom fields map. This would need to be done regardless.

The populating of the custom fields with metadata.* I believe is clearer than what we do in aitelemetry/telemetrywrapper.go/TrackLog, and the metadata fields we use here are different from TrackLog.

We then directly call new trace telemetry. Although you could import and use NewAITelemetry or similar, you would just be replacing this 4-5 line call to NewTraceTelemetry + Track with something of similar complexity.

@@ -0,0 +1,36 @@
# Note: Daemonset must be server side applied prior. Then just do: kubectl apply --server-side daemonset-patch.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you do need to set the field-manager to log-collector, otherwise you might overwrite. I think that's how you are applying in this PR, but this comment may cause an unfamiliar person to try just apply, without the filed-manager setting. Please verify manual apply, and update the above comment if necessary.

Copy link
Contributor Author

@QxBytes QxBytes Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah true-- maybe I'll just instruct them to use the make target instead

@@ -0,0 +1,36 @@
# Note: Daemonset must be server side applied prior. Then just do: kubectl apply --server-side daemonset-patch.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment from the 1.17 file.

camrynl
camrynl previously approved these changes Feb 2, 2026
Copy link
Contributor

@camrynl camrynl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed some questions offline for log storage. ~168 hours or potentially longer and no logs should be missed since this will go back to the start of the container

tested and previous disable file works, and cilium log collector
version does get piped through the data pipeline successfully
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Infra or tooling. cilium Related to Cilium. linux telemetry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants