当前位置:网站首页>Apache skywalking alarm Configuration Guide
Apache skywalking alarm Configuration Guide
2022-04-22 13:33:00 【Wanmao Society】
Apache SkyWalking
Apache SkyWalking Is the application performance monitoring tool of distributed system (Application Performance Management,APM), Dedicated to micro Services 、 Cloud native architecture and container based (Docker、K8s、Mesos) Architecture and Design .
It provides distributed tracking 、 Service grid telemetry analysis 、 Measurement aggregation and visualization integration solution .
Apache SkyWalking The alarm
Apache SkyWalking Alarms are driven by a set of rules , These rules are defined in config/alarm-settings.yml In file .
The definition of alarm rules is divided into three parts .
- Alarm rules : It defines the conditions to trigger the alarm .
- webhook: When an alarm is triggered , List of called service endpoints .
- gRPCHook: When an alarm is triggered , Called remote gRPC Method host and port .
- Slack Chat Hook: When an alarm is triggered , The called Slack Chat Interface .
- WeChat Hook: When an alarm is triggered , Called wechat interface .
- nailing Hook: When an alarm is triggered , The called pin interface .
Alarm rules
There are two types of alarm rules , Separate rules (Individual Rules) And compound rules (Composite Rules), A compound rule is a combination of individual rules .
Separate rules (Individual Rules)
The main points of the separate rules are as follows :
- Rule name : The unique name displayed in the alarm message , Must be
_ruleending . - metrics-name: Measure name , It's also OAL The measure name in the script . In the default configuration, the metrics that can be used for alarms are : service , example , Endpoint , Service relationship , Example relationship , Endpoint relationship . It only supports long,double and int type .
- include-names: List of entity names contained in this rule .
- exclude-names: List of entity names excluded from this rule .
- include-names-regex: Provide a regular expression to contain the entity name . If you set both the regular expression containing the name list and the regular expression containing the name , Then both rules will take effect .
- exclude-names-regex: Provide a regular expression to exclude entity names . If you set both the list of exclusion names and the regular expression of exclusion names , Then both rules will take effect .
- include-labels: Tags included in this rule .
- exclude-labels: Exclude tags from this rule .
- include-labels-regex: Provide a regular expression to include tags . If you set both the list of tags and the regular expression containing tags , Then both rules will take effect .
- exclude-labels-regex: Provide a regular expression to exclude tags . If you set both the exclude tag list and the regular expression of the exclude tag , Then both rules will take effect .
The label setting must store the data in meter-system in , for example :Prometheus, Micrometer. The above four label settings must be implemented LabeledValueHolder Interface .
- threshold: threshold .
For multiple value indicators , for example percentile, The threshold is an array . image value1 value2 value3 value4 value5 Describe it like this . Each value can be used as a threshold for each value in the measure . If you don't want to trigger an alarm with this value or some values , Then set the value to -. For example, in percentile in ,value1 yes P50 The threshold of ,value2 yes P75 The threshold of , that -,-,value3, value4, value5 It means , There's no threshold P50 and P75 Of percentile Alarm rules .
- op: The operator , Support
>,>=,<,<=,=. - period: How long does the alarm rule need to be checked . It's a time window , Match back-end deployment environment time .
- count: In a cycle window , If the op The number of times that the threshold is exceeded reaches count, Then send an alarm .
- only-as-condition:
trueperhapsfalse, Specifies whether the rule can send an alert , Or just as a condition of compound rules . - silence-period: In time N After the alarm is triggered in , stay N -> N + silence-period There is no alarm during this period . By default , It and period equally , This means the same alarm ( The same measure name has the same Id) Only once in the same cycle .
- message: When the rule is triggered , Notification messages sent .
for instance :
rules:
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 10
message: service 【{name}】 The average response time of is recently 10 In minutes 2 Minutes over 1 second
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 10
message: example 【{name}】 The average response time of is recently 10 In minutes 2 Minutes over 1 second
endpoint_resp_time_rule:
metrics-name: endpoint_avg
threshold: 1000
op: ">"
period: 10
count: 2
message: Endpoint 【{name}】 The average response time of is recently 10 In minutes 2 Minutes over 1 second
Compound rules (Composite Rules)
Compound rules are only applicable to alarm rules at the same entity level , For example, it's all service level alert rules :service_percent_rule && service_resp_time_percentile_rule. Can not be Write alarm rules at different entity levels , For example, an alert rule at the service level and a rule at the endpoint level :service_percent_rule && endpoint_percent_rule.
The main points of compound rules are as follows :
- Rule name : The unique name displayed in the alarm message , Must be
_ruleending . - expression: Specify how rules are made up , Support
&&,||,()The operator . - message: When the rule is triggered , Notification messages sent .
for instance :
rules:
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 10
message: service 【{name}】 The average response time of is recently 10 In minutes 2 Minutes over 1 second
service_sla_rule:
metrics-name: service_sla
op: "<"
threshold: 8000
period: 10
count: 2
silence-period: 10
message: service 【{name}】 In recent years, the success rate of 10 In minutes 2 Minutes is less than 80%
composite-rules:
comp_rule:
expression: service_resp_time_rule && service_sla_rule
message: service 【{name}】 In the recent 10 In minutes 2 Minutes over 1 The average response time per second exceeds 1 And the success rate is lower than 80%
Webhook
Webhook Ask for a point-to-point Web Containers . The alarm message will go through HTTP Request to send , The request method is POST,Content-Type by application/json,JSON The format contains the following information :
- scopeId: The goal is Scope Of ID.
- name: The goal is Scope The name of the entity .
- id0:Scope The entity's ID.
- id1: not used .
- ruleName: You are in
alarm-settings.ymlThe name of the rule configured in . - alarmMessage. Alarm message content .
- startTime. Alarm timestamp , The current time and UTC 1970/1/1 Millisecond difference .
for instance :
[{
"scopeId": 1,
"scope": "SERVICE",
"name": "one-more-service",
"id0": "b3JkZXItY2VudGVyLXNlYXJjaC1hcGk=.1",
"id1": "",
"ruleName": "service_resp_time_rule",
"alarmMessage": " service 【one-more-service】 The average response time of is recently 10 In minutes 2 Minutes over 1 second ",
"startTime": 1617670815000
}, {
"scopeId": 2,
"scope": "SERVICE_INSTANCE",
"name": "[email protected] of one-more-service",
"id0": "dWF0LWxib2Mtc2VydmljZQ==.1_ZTRiMzEyNjJhY2FhNDdlZjkyYTIyYjZhMmI4YTdjYjFAMTcyLjI0LjMwLjEzOA==",
"id1": "",
"ruleName": "instance_jvm_young_gc_count_rule",
"alarmMessage": " example 【[email protected] of one-more-service】 Of YoungGC The number of times is recent 10 In minutes 2 Minutes over 10 Time ",
"startTime": 1617670815000
}, {
"scopeId": 3,
"scope": "ENDPOINT",
"name": "/one/more/endpoint in one-more-service",
"id0": "b25lcGllY2UtYXBp.1_L3RlYWNoZXIvc3R1ZGVudC92aXBsZXNzb25z",
"id1": "",
"ruleName": "endpoint_resp_time_rule",
"alarmMessage": " Endpoint 【/one/more/endpoint in one-more-service】 The average response time of is recently 10 In minutes 2 Minutes over 1 second ",
"startTime": 1617670815000
}]
gRPCHook
The alert message will use Protobuf Type through gRPC Remote method sending . The key information of the message format is defined as follows :
syntax = "proto3";
option java_multiple_files = true;
option java_package = "org.apache.skywalking.oap.server.core.alarm.grpc";
service AlarmService {
rpc doAlarm (stream AlarmMessage) returns (Response) {
}
}
message AlarmMessage {
int64 scopeId = 1;
string scope = 2;
string name = 3;
string id0 = 4;
string id1 = 5;
string ruleName = 6;
string alarmMessage = 7;
int64 startTime = 8;
}
message Response {
}
Slack Chat Hook
You need to follow the incoming Webhooks Get started guide and create new Webhooks.
If you configure Slack Incoming Webhooks, Then the alarm message will press Content-Type by application/json adopt HTTP Of POST Mode sending .
for instance :
slackHooks:
textTemplate: |-
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":alarm_clock: *Apache Skywalking Alarm* \n **%s**."
}
}
webhooks:
- https://hooks.slack.com/services/x/y/z
WeChat Hook
Only the enterprise version of wechat supports Webhooks , How to use wechat Webhooks See how to configure swarm robots .
If you configure wechat's Webhooks , Then the alarm message will press Content-Type by application/json adopt HTTP Of POST Mode sending .
for instance :
wechatHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking The alarm : \n %s."
}
}
webhooks:
- https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=dummy_key
nailing Hook
You need to follow the custom robot opening and create a new Webhooks. For safety's sake , You can for Webhook URL configuration optional key .
If you configure nailing in the following way Webhooks , Then the alarm message will press Content-Type by application/json adopt HTTP Of POST Mode sending .
for instance :
dingtalkHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking The alarm : \n %s."
}
}
webhooks:
- url: https://oapi.dingtalk.com/robot/send?access_token=dummy_token
secret: dummysecret
notes : This article takes SkyWalking Of 8.2.0 Version as an example , If the version is different, there will be a slight difference .
版权声明
本文为[Wanmao Society]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204221324098347.html
边栏推荐
- 小程序分享给好友列表以及分享到朋友圈
- 微积分之函数极限
- Station B cuts to the live broadcast, sooner or later
- Knowledge to be learned
- XML外部实体攻击原理以及实战(XXE)(1)
- Xen thermal repair technology (basic understanding)
- Walk in the clouds - travel to the edge of the universe
- 提供信贷支持、创新金融产品……广州金融机构为交通等行业企业纾困解难
- 【Zeekr_Tech】ROS/ROS 2介绍
- MapReduce案例—分别通过Reduce端和Map端实现JOIN操作
猜你喜欢

Rust实现斐波那契数

How does MySQL sort by default when using the select statement without order by?

“开源之夏”活动火热报名中,丰厚奖金等你来拿

托宾Q数据-沪深A股上市公司(含行业名称、代码等指标)2003-2020

Walking in the clouds - but there are books

Inamori Kazuo: face the reality, think hard and fight head-on

Share the five cases of websites slowing down recently

How does redis view the memory size occupied by a single key

Trying to access array offset on value of type int

BPMN - 如何绘制符合良构编排的基础BPMN?
随机推荐
HDU 2544 Dijkstra (template)
redis内存使用info memory命令参数解析
华为云媒体査勇:华为云在视频AI转码领域的技术实践
Summary of maximum matching number, minimum path coverage number, maximum independent number and minimum point coverage number theorems
中国数字经济测度与驱动因素-信息化程度测算指数(2013-2020年)
好物合集(1)
【黑马早报】知乎今日在港上市;小红书回应裁员20%;王者荣耀被指控抄袭;刘畊宏直播收入10天涨10倍;“知网反垄断第一案”已立案...
MapReduce case - summation partition protocol sorting operation on traffic statistics
Harbor v2. 5 update, what functions have been added?
Digital twin: how to support the industrial transformation of a trillion market?
Knowledge to be learned
Walking in the clouds - above firewood, rice, oil and salt
MapReduce case - join operation is realized through the reduce side and the map side respectively
XML外部实体攻击原理以及实战(XXE)(1)
MySQL uses stored procedures to add data
东吴证券X袋鼠云:数据轻松可取、毫秒级反应能力,东吴证券做对了什么?
POJ 3259 最短路SPFA + 负环 (模板)
封装统一响应结果枚举类(工具模块)
Wechat applet adds data to the database
MapReduce案例—分别通过Reduce端和Map端实现JOIN操作