Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
启动时告警上报链路设计文档
1. 概要
1.1 背景
LoongCollector 在启动过程中,可能会在 AlarmPipeline 完全就绪之前产生告警。由于告警发送机制依赖于:
因此,在启动初期(Pipeline 未就绪时)产生的告警无法被正常发送,可能导致重要告警信息丢失。
1.2 解决方案
实现了一个启动期告警落盘缓冲机制,确保在 AlarmPipeline 就绪前产生的告警能够被持久化保存,并在 Pipeline 就绪后自动恢复和发送。该机制具有以下特点:
1.3 核心组件
2. 流程图
graph TB Start([系统启动]) --> Init[AlarmManager 初始化<br/>mAlarmPipelineReady = false<br/>文件路径: alarm_disk_buffer.json] Init --> ReceiveAlarm[接收告警 SendAlarm] ReceiveAlarm --> CheckPipeline{AlarmPipeline<br/>是否就绪?} CheckPipeline -->|已就绪| WriteBuffer[写入内存 Buffer<br/>去重统计] CheckPipeline -->|未就绪| CheckLevel{告警等级<br/>是否满足条件?} CheckLevel -->|不满足| Discard1[丢弃告警] CheckLevel -->|满足| CheckWindow{是否在启动<br/>窗口期内?} CheckWindow -->|是| CheckSize{文件大小<br/>是否超过限制?} CheckWindow -->|否| Discard2 CheckSize -->|是| Discard2[停止写入<br/>记录错误日志] CheckSize -->|否| OpenFile[首次写入时,打开文件句柄] OpenFile --> WriteFile[写入磁盘文件<br/>alarm_disk_buffer.json] Discard2 --> CloseFile1[关闭文件句柄] WriteBuffer --> End1([告警处理完成]) WriteFile --> End1 Discard1 --> End1 CloseFile1 --> End1 PipelineReady[AlarmPipeline 就绪] --> SendAlarms[SelfMonitorServer::SendAlarms] SendAlarms --> CheckFirst{首次发送?<br/>CheckAndSetAlarmPipelineReady} CheckFirst -->|是 false| CloseFile2[关闭写文件句柄<br/>确保数据刷新] CloseFile2 --> ReadFile[读取磁盘文件<br/>AlarmManager::ReadAlarmsFromFile<br/>按key分组累加count<br/>同时返回原始 JSON 字符串] ReadFile --> ParseFile[解析 JSON 文件<br/>构造 PipelineEventGroup<br/>保存原始 JSON 按 region 分组] ParseFile --> SendFile[发送文件中的告警] SendFile --> LogError{发送失败?} LogError -->|是| RecordError[记录错误日志<br/>包含原始 JSON 字符串] LogError -->|否| Continue[继续处理后续 group] RecordError --> Continue Continue --> DeleteFile[删除磁盘文件<br/>AlarmManager::DeleteAlarmFile] DeleteFile --> SendBuffer[发送内存 Buffer 中的告警] CheckFirst -->|是 true| SendBuffer SendBuffer --> FlushBuffer[FlushAllRegionAlarm<br/>从内存 Buffer 获取告警] FlushBuffer --> PushQueue[推送到处理队列] PushQueue --> End2([发送完成]) style WriteFile fill:#e1f5ff style ReadFile fill:#fff4e1 style SendFile fill:#e8f5e9 style Discard1 fill:#ffebee style Discard2 fill:#ffebee style OpenFile fill:#f3e5f5 style CloseFile1 fill:#f3e5f5 style CloseFile2 fill:#f3e5f53. 设计优势分析
3.1 高效性
atomic_bool进行无锁状态检查,避免锁竞争3.2 资源开销小
3.3 风险可控
3.4 数据完整性
4. 配置参数
logtail_startup_alarm_window_secondslogtail_startup_alarm_file_max_sizelogtail_startup_alarm_file_min_level5. 文件格式
5.1 文件命名
格式:
alarm_disk_buffer.json文件路径:
{AgentDataDir}/alarm_disk_buffer.json注意:文件路径固定,不包含时间戳。连续重启时,所有告警都会追加到同一个文件中,读取时会按 key 分组并累加 count。
5.2 JSON 格式
每行一个 JSON 对象,字段包括:
{ "region": "cn-hangzhou", "alarm_type": "USER_CONFIG_ALARM", "alarm_level": "1", "alarm_message": "Config file not found", "alarm_count": "1", "timestamp": 1704067200, "project_name": "my-project", "category": "my-category", "config": "my-config" }字段说明:
region: 必填,告警所属 regionalarm_type: 必填,告警类型alarm_level: 必填,告警级别(1=WARNING, 2=ERROR, 3=CRITICAL)alarm_message: 必填,告警消息alarm_count: 必填,告警计数(写入时固定为 "1",读取时按 key 分组累加)timestamp: 必填,告警时间戳(Unix 时间戳)project_name: 可选,项目名称category: 可选,分类config: 可选,配置名称重要说明:
alarm_count固定为 "1"6. 测试覆盖
单元测试文件:
core/unittest/monitor/AlarmDiskBufferUnittest.cpp测试用例包括:
7. 总结
启动时告警上报链路通过磁盘缓冲机制,有效解决了 Pipeline 未就绪时告警丢失的问题。该设计在保证数据完整性的同时,通过高效的文件操作、资源限制和容错机制,实现了低开销、低风险的告警持久化和恢复方案。