-
Notifications
You must be signed in to change notification settings - Fork 263
feat: add cilium log collector #4208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
currently puts makefile targets in a separate included makefile, which can be changed currently uses server side apply to apply the cilium log collector sidecar, which can be changed
currently adds cilium log collector only to overlay cilium scenario log collector only available on linux builds and publishes images for cni release and acn pr but only used in acn pr for now files are server side applied and we expect the cilium daemonset to be server side applied first before adding the log collector
the container is host network, so we must use ClusterFirstWithHostNet or dns will fail
if cilium has dnsPolicy ClusterFirstWithHostNet, k8s will add coredns's ip to the pod's /etc/resolv.conf cilium must come up for coredns to come up, but when cilium starts it makes a dns request to core dns as a result of the /etc/resolv.conf specifying core dns's ip. core dns at this point is down, however, and the request times out, so cilium can't start up either ClusterFirst with hostNetwork is actually "default" (uses vm's /etc/resolv.conf, likely 168.63.129.16/azure dns)
ideally we use a filter plugin but there's no go framework for that right now take a dependency on the acn method we use for getting imds info (can be changed) sets kubernetes filter plugin to use tag for pod name etc.; unlike the default setting we can enrich without a connection to core dns/k8s api/kubelet-- this is important because on cilium startup coredns will not be running tested data pipeline through adx in development
…g collector affects: cni release test, acn pr, acn official build, nightly, ebpf host routing/msft cil see what breaks previous commit passes on overlay only
we can try out server side apply with cilium ebpf later-- for now keep as-is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a Cilium log collector based on Fluent Bit and wires it into Cilium deployments, build pipelines, and tests. It also switches Cilium-related manifests to server-side apply to support patching the daemonset with the log collector container.
Changes:
- Add a new
cilium-log-collectorGo module, Fluent Bit output plugin, unit tests, Dockerfile, and Makefile wiring (including build, image, and manifest targets). - Introduce Cilium log collector DaemonSet patches and ConfigMaps for Cilium v1.17 and v1.18, update AKS deployment Make targets to use server-side apply, and add a target to patch Cilium with the log collector.
- Extend CI/CD pipelines to build/publish the
cilium-log-collectorimages and to deploy the log collector into Cilium-based test clusters; update validation tests to use the correct Cilium container.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
test/validate/linux_validate.go |
Ensures the Cilium validation check targets the cilium-agent container explicitly. |
test/integration/manifests/cilium/v1.17/cilium-log-collector/daemonset-patch.yaml |
Adds a DaemonSet patch to inject the log collector sidecar into Cilium v1.17 agents. |
test/integration/manifests/cilium/v1.17/cilium-log-collector/cilium-log-collector-configmap.yaml |
Defines the Fluent Bit + App Insights configuration for the Cilium log collector on v1.17, including input/filter/output settings. |
test/integration/manifests/cilium/v1.18/cilium-log-collector/daemonset-patch.yaml |
Same Cilium log collector DaemonSet patching logic for Cilium v1.18. |
test/integration/manifests/cilium/v1.18/cilium-log-collector/cilium-log-collector-configmap.yaml |
Same Fluent Bit/App Insights configuration for Cilium v1.18 clusters. |
hack/aks/deploy.mk |
Switches Cilium deployment to server-side apply and adds an add-cilium-log-collector target and related env vars for injecting the log collector. |
cilium-log-collector/out_azure_app_insights.go |
Implements the Fluent Bit output plugin to send Cilium logs to App Insights, including IMDS metadata enrichment and record processing helpers. |
cilium-log-collector/out_azure_app_insights_test.go |
Provides unit tests for record processing, type conversion, metadata enrichment, and the tracker abstraction for the log collector plugin. |
cilium-log-collector/go.mod |
Declares the new module and its dependencies for the log collector plugin. |
cilium-log-collector/go.sum |
Locks dependency versions for the new cilium-log-collector module. |
cilium-log-collector/Makefile |
Adds build, archive, test, and image/manifest targets specific to the Cilium log collector. |
cilium-log-collector/Dockerfile |
Defines a multi-stage build image that compiles the plugin and layers it into the Fluent Bit base image. |
Makefile |
Includes the log collector’s Makefile, excludes its tags from ACN_VERSION, adds it to all-binaries, and adds it to the Go workspace. |
.pipelines/singletenancy/cilium/*.yaml |
After deploying Cilium in various single-tenancy scenarios, exports log collector image vars and patches the DaemonSet with the log collector. |
.pipelines/singletenancy/cilium-overlay/**/*.yaml |
Same wiring for overlay and overlay-with-Hubble variants in single-tenancy pipelines. |
.pipelines/singletenancy/cilium-nodesubnet/*.yaml |
Ensures Cilium node-subnet scenarios also deploy the log collector. |
.pipelines/singletenancy/cilium-dualstack-overlay/*.yaml |
Adds log collector deployment to dualstack overlay Cilium scenarios. |
.pipelines/pipeline.yaml |
Adds cilium-log-collector to the image build and multi-arch manifest stages. |
.pipelines/cni/pipeline.yaml |
Adds cilium-log-collector to CNI release image builds and manifests and keeps the rest of the pipeline logically equivalent while normalizing line endings. |
.pipelines/cni/cilium/cilium-scale-test.yaml |
Ensures the Cilium scale test environment includes the log collector. |
.pipelines/cni/cilium/cilium-overlay-load-test-template.yaml |
Ensures Cilium overlay and overlay+Hubble load test clusters are patched with the log collector. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../integration/manifests/cilium/v1.17/cilium-log-collector/cilium-log-collector-configmap.yaml
Show resolved
Hide resolved
.../integration/manifests/cilium/v1.18/cilium-log-collector/cilium-log-collector-configmap.yaml
Show resolved
Hide resolved
previous commit passes all cni release test, acn pr, acn official build
santhoshmprabhu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Also, any thoughts on the ability to turn off to mitigate any urgent issues?
| hostMetadata *common.Metadata | ||
| ) | ||
|
|
||
| func convertToString(v interface{}) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to define this? I'd assume there's a utility function for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed some specific behavior that the built-in functions don't provide-- for example the conversion would usually be just like Sprintf("%v", val) which is the default case here, but that by itself prints out bytes or map[field: value] which doesn't show up/parse well in azure data explorer/kusto. Also straight marshaling into a string doesn't work if the "key" value isn't a string.
|
|
||
| telemetryConfig := appinsights.NewTelemetryConfiguration(instrumentationKey) | ||
| telemetryConfig.MaxBatchInterval = 1 * time.Second | ||
| telemetryConfig.MaxBatchSize = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this set? Batch size in MB? Please add comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maximum number of telemetry items that can be submitted in each request-- I'll add a comment
| if logKey == "" { | ||
| logKey = "log" | ||
| } | ||
| debug = output.FLBPluginConfigKey(plugin, "debug") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to debug the plugin itself, unrelated to the logs, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, any chance of storing the filename here, to catch the exact container whose logs are being collected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for debugging the plugin, yes.
If we wanted a specific container's logs, we can modify the input section of the fluent bit config to regex match on it
If you're referring to storing something identifying the file the log came from, each log collected should be already enriched with the container name
| } | ||
|
|
||
| // ProcessSingleRecord handles processing of an individual record | ||
| func (rp *RecordProcessor) ProcessSingleRecord(record ProcessRecord, recordIndex int, metadata *common.Metadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall you mentioned reusing the AppInsights logic from CNI. Is the code duplicated? If yes, it'd be good to move to a common location and reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reusing the code to get the IMDS metadata so that part isn't duplicated. The logic to send the logs out doesn't need to be as complex as the CNI-- the below for loop is meant to parse through the record's fields and get them into the custom fields map. This would need to be done regardless.
The populating of the custom fields with metadata.* I believe is clearer than what we do in aitelemetry/telemetrywrapper.go/TrackLog, and the metadata fields we use here are different from TrackLog.
We then directly call new trace telemetry. Although you could import and use NewAITelemetry or similar, you would just be replacing this 4-5 line call to NewTraceTelemetry + Track with something of similar complexity.
| @@ -0,0 +1,36 @@ | |||
| # Note: Daemonset must be server side applied prior. Then just do: kubectl apply --server-side daemonset-patch.yaml | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you do need to set the field-manager to log-collector, otherwise you might overwrite. I think that's how you are applying in this PR, but this comment may cause an unfamiliar person to try just apply, without the filed-manager setting. Please verify manual apply, and update the above comment if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah true-- maybe I'll just instruct them to use the make target instead
| @@ -0,0 +1,36 @@ | |||
| # Note: Daemonset must be server side applied prior. Then just do: kubectl apply --server-side daemonset-patch.yaml | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment from the 1.17 file.
camrynl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed some questions offline for log storage. ~168 hours or potentially longer and no logs should be missed since this will go back to the start of the container
tested and previous disable file works, and cilium log collector version does get piped through the data pipeline successfully
Reason for Change:
Adds cilium log collector image which is based on mcr fluent bit, with the added app insights output plugin. The plugin takes in an instrumentation key and enriches it with information from the imds, pushing it to app insights.
Moves cilium to server side apply so we can test the cilium log collector by applying it to the cilium daemonset (does not affect ebpf host routing since that might break a dependent pipeline).
To add cilium log collector to any cilium cluster, should be as simple as setting the log collector's image registry and version, and then running
make -C ./hack/aks add-cilium-log-collectorpipeline.yaml put through a text diff checker the majority is a CRLF -> LF change
Builds cilium log collector in cni release, acn pr, and acn official build
Puts makefile targets in a separate included makefile. For the image, the dockerfile
$VERSIONis passed in from$TAGwhich is actually$(CILIUM_LOG_COLLECTOR_PLATFORM_TAG). Added the log collector to ACN VERSION exclude, and added it to the all binaries targetCilium log collector is only for linux
Issue Fixed:
Requirements:
Notes:
ACN PR Pass: https://msazure.visualstudio.com/One/_build/results?buildId=151173566&view=results
CNI Release Test Mostly Pass: https://msazure.visualstudio.com/One/_build/results?buildId=151174815&view=results
ACN Test Unofficial Pass: https://dev.azure.com/msazure/One/_build/results?buildId=151174219&view=results
Cilium Nightly Mostly Pass: https://msazure.visualstudio.com/One/_build/results?buildId=151143234&view=results
Cilium private 1.18 w/ upgrade doesn't pass on upgrade but that's not related this change (even without that ebpf stage fails): https://msazure.visualstudio.com/One/_build/results?buildId=151451236&view=results
You can also confirm the image is being deployed by clicking on the "install cilium" job under each e2e stage