Skip to content

Conversation

@Takuka0311
Copy link
Collaborator

@Takuka0311 Takuka0311 commented Jan 15, 2026

启动时告警上报链路设计文档

1. 概要

1.1 背景

LoongCollector 在启动过程中,可能会在 AlarmPipeline 完全就绪之前产生告警。由于告警发送机制依赖于:

  1. AlarmManager 启动
  2. AlarmPipeline 成功加载(包括对应模块加载成功)
  3. 启动后 3 秒第一次从 buffer 里拿数据发送

因此,在启动初期(Pipeline 未就绪时)产生的告警无法被正常发送,可能导致重要告警信息丢失。

1.2 解决方案

实现了一个启动期告警落盘缓冲机制,确保在 AlarmPipeline 就绪前产生的告警能够被持久化保存,并在 Pipeline 就绪后自动恢复和发送。该机制具有以下特点:

  • 高效:使用追加写入(append-only)模式,最小化 I/O 开销
  • 资源开销小:仅在启动窗口期内(默认 60 秒)写入文件,窗口期后自动停止
  • 风险可控:文件操作失败不影响主流程,超时自动停止写入
  • 数据完整:支持去重统计,确保告警计数准确;发送失败时记录详细日志,便于排查问题

1.3 核心组件

  • AlarmManager: 告警管理器,负责告警的接收、缓冲和发送,包含启动期告警的落盘和恢复功能
  • SelfMonitorServer: 自监控服务,负责告警的定时发送

2. 流程图

graph TB
    Start([系统启动]) --> Init[AlarmManager 初始化<br/>mAlarmPipelineReady = false<br/>文件路径: alarm_disk_buffer.json]
    Init --> ReceiveAlarm[接收告警 SendAlarm]
    
    ReceiveAlarm --> CheckPipeline{AlarmPipeline<br/>是否就绪?}
    
    CheckPipeline -->|已就绪| WriteBuffer[写入内存 Buffer<br/>去重统计]
    CheckPipeline -->|未就绪| CheckLevel{告警等级<br/>是否满足条件?}
    
    CheckLevel -->|不满足| Discard1[丢弃告警]
    CheckLevel -->|满足| CheckWindow{是否在启动<br/>窗口期内?}
    
    CheckWindow -->|是| CheckSize{文件大小<br/>是否超过限制?}
    CheckWindow -->|否| Discard2
    CheckSize -->|是| Discard2[停止写入<br/>记录错误日志]
    CheckSize -->|否| OpenFile[首次写入时,打开文件句柄]
    OpenFile --> WriteFile[写入磁盘文件<br/>alarm_disk_buffer.json]
    Discard2 --> CloseFile1[关闭文件句柄]
    
    WriteBuffer --> End1([告警处理完成])
    WriteFile --> End1
    Discard1 --> End1
    CloseFile1 --> End1
    
    PipelineReady[AlarmPipeline 就绪] --> SendAlarms[SelfMonitorServer::SendAlarms]
    
    SendAlarms --> CheckFirst{首次发送?<br/>CheckAndSetAlarmPipelineReady}
    CheckFirst -->|是 false| CloseFile2[关闭写文件句柄<br/>确保数据刷新]
    CloseFile2 --> ReadFile[读取磁盘文件<br/>AlarmManager::ReadAlarmsFromFile<br/>按key分组累加count<br/>同时返回原始 JSON 字符串]
    
    ReadFile --> ParseFile[解析 JSON 文件<br/>构造 PipelineEventGroup<br/>保存原始 JSON 按 region 分组]
    ParseFile --> SendFile[发送文件中的告警]
    
    SendFile --> LogError{发送失败?}
    LogError -->|是| RecordError[记录错误日志<br/>包含原始 JSON 字符串]
    LogError -->|否| Continue[继续处理后续 group]
    RecordError --> Continue
    Continue --> DeleteFile[删除磁盘文件<br/>AlarmManager::DeleteAlarmFile]
    DeleteFile --> SendBuffer[发送内存 Buffer 中的告警]
    
    CheckFirst -->|是 true| SendBuffer
    
    SendBuffer --> FlushBuffer[FlushAllRegionAlarm<br/>从内存 Buffer 获取告警]
    FlushBuffer --> PushQueue[推送到处理队列]
    PushQueue --> End2([发送完成])
    
    style WriteFile fill:#e1f5ff
    style ReadFile fill:#fff4e1
    style SendFile fill:#e8f5e9
    style Discard1 fill:#ffebee
    style Discard2 fill:#ffebee
    style OpenFile fill:#f3e5f5
    style CloseFile1 fill:#f3e5f5
    style CloseFile2 fill:#f3e5f5
Loading

3. 设计优势分析

3.1 高效性

  • 使用 atomic_bool 进行无锁状态检查,避免锁竞争
  • 使用持久化文件句柄,避免频繁打开/关闭文件,减少系统调用开销
  • 文件采用追加写入模式,避免随机 I/O
  • 按 region 分组批量处理,减少发送次数
  • Pipeline 就绪后立即切换到内存 Buffer,避免不必要的文件 I/O

3.2 资源开销小

  • 默认 60 秒窗口期,超时后自动停止写入
  • 文件大小限制(默认 10MB),超过后自动停止写入
  • 告警等级筛选(默认只写入 level > 0),减少低级别告警写入
  • 可配置参数灵活控制窗口期、文件大小和等级筛选
  • 文件发送成功后立即删除,释放磁盘空间
  • 使用轻量级的 JSON 格式,解析时按需加载

3.3 风险可控

  • 超过时间窗口后自动停止写入,避免磁盘空间耗尽
  • 文件大小超过限制后自动停止写入,避免文件过大
  • 文件操作失败时静默返回,不影响主流程
  • 发送失败时记录详细错误日志(包含原始 JSON 字符串)
  • 使用固定文件路径,连续重启时自动合并告警

3.4 数据完整性

  • 写文件时 count 固定为 1,读取时按 key 分组累加 count,确保计数准确
  • 告警写入文件持久化,即使进程崩溃也能恢复
  • 连续重启时,所有告警都会保留在同一个文件中,读取时统一处理
  • 保存告警的所有字段(region、type、level、message、count、timestamp 等)
  • 文件中的告警优先于内存 Buffer 中的告警发送,确保时间顺序

4. 配置参数

参数名 类型 默认值 说明
logtail_startup_alarm_window_seconds int32 60 启动期告警落盘的时间窗口(秒),超过此时间后停止写入文件
logtail_startup_alarm_file_max_size int32 10485760 (10MB) 告警磁盘缓冲文件的最大大小(字节),超过此大小后停止写入文件
logtail_startup_alarm_file_min_level int32 1 写入文件的告警最低等级(1 表示写入所有告警,2 表示只写入 ERROR 和 CRITICAL,3 表示只写入 CRITICAL)

5. 文件格式

5.1 文件命名

格式:alarm_disk_buffer.json

文件路径:{AgentDataDir}/alarm_disk_buffer.json

注意:文件路径固定,不包含时间戳。连续重启时,所有告警都会追加到同一个文件中,读取时会按 key 分组并累加 count。

5.2 JSON 格式

每行一个 JSON 对象,字段包括:

{
  "region": "cn-hangzhou",
  "alarm_type": "USER_CONFIG_ALARM",
  "alarm_level": "1",
  "alarm_message": "Config file not found",
  "alarm_count": "1",
  "timestamp": 1704067200,
  "project_name": "my-project",
  "category": "my-category",
  "config": "my-config"
}

字段说明

  • region: 必填,告警所属 region
  • alarm_type: 必填,告警类型
  • alarm_level: 必填,告警级别(1=WARNING, 2=ERROR, 3=CRITICAL)
  • alarm_message: 必填,告警消息
  • alarm_count: 必填,告警计数(写入时固定为 "1",读取时按 key 分组累加)
  • timestamp: 必填,告警时间戳(Unix 时间戳)
  • project_name: 可选,项目名称
  • category: 可选,分类
  • config: 可选,配置名称

重要说明

  • 写入文件时,每条告警的 alarm_count 固定为 "1"
  • 读取文件时,按 key(region_alarmType_projectName_category_config_level)分组,累加相同告警的 count
  • 连续重启时,所有告警都会追加到同一个文件中,读取时会统一处理并累加 count

6. 测试覆盖

单元测试文件:core/unittest/monitor/AlarmDiskBufferUnittest.cpp

测试用例包括:

  1. Pipeline 未就绪时告警写入文件
  2. 读取文件时按 key 分组累加 count
  3. 从文件读取并发送告警
  4. 文件发送失败时的错误日志记录
  5. 超过 60 秒后停止写文件
  6. 文件大小超过限制后停止写入
  7. 告警等级筛选功能
  8. 连续重启场景下的数据完整性和无重复验证

7. 总结

启动时告警上报链路通过磁盘缓冲机制,有效解决了 Pipeline 未就绪时告警丢失的问题。该设计在保证数据完整性的同时,通过高效的文件操作、资源限制和容错机制,实现了低开销、低风险的告警持久化和恢复方案。

…tricManager code structure

- Adjusted naming conventions in .clang-tidy for better readability and consistency.
- Refactored AlarmManager to enhance code clarity and maintainability, including changes to method structures and variable handling.
- Improved MetricManager by refining metric event handling and ensuring proper memory management.
- Updated SelfMonitorServer to utilize move semantics for efficiency in metric event processing.
- Cleaned up includes and removed unnecessary dependencies across various files for better organization.
- Removed unnecessary includes and dependencies in Monitor.cpp, Monitor.h, SelfMonitorServer.cpp, and ProfileSender.cpp.
- Simplified ProfileSender's handling of profile project names by replacing FlusherSLS with a string map for region project names.
- Enhanced unit tests in OnetimeConfigUpdateUnittest to ensure accurate expiration time validation and improved readability.
- Overall code organization and clarity improvements across multiple files.
- Introduced mechanisms for writing alarms to a disk buffer during startup when the alarm pipeline is not ready.
- Added methods to read alarms from the disk buffer and process them into PipelineEventGroups.
- Implemented logic to manage the alarm disk buffer file, including checks for file size limits and time windows for writing.
- Enhanced unit tests for AlarmManager to validate the new disk buffer functionality and ensure no data loss or duplication occurs.
- Updated AlarmManager and SelfMonitorServer to integrate the new alarm handling logic effectively.
…e clarity

- Changed the maximum size for the alarm disk buffer file from MB to bytes for better precision.
- Simplified the iteration over alarm messages using structured bindings for improved readability.
- Updated logging to reflect the new size unit in error messages, enhancing clarity in file size limits.
- Replaced the Provider include with FlusherSLS to streamline alarm processing.
- Added instance_id and hostname to alarm event content for better traceability.
- Updated logic to handle multiple projects when sending alarms, ensuring all relevant regions are notified.
- Enhanced alarm key construction to include additional metadata such as IP, OS, version, instance_id, and hostname.
- Improved unit tests to validate the new alarm handling logic and ensure accurate state management during tests.
- Introduced a set to track unique regions for alarm notifications, ensuring each region receives the alarm only once.
- Updated the SendAlarm method to iterate over distinct regions instead of projects, improving efficiency and clarity in alarm handling.
- Removed redundant comments and cleaned up method declarations for better code organization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant