当前位置:网站首页>Major upgrade of MSE Governance Center - Traffic Governance, Database Governance, Same AZ Priority

Major upgrade of MSE Governance Center - Traffic Governance, Database Governance, Same AZ Priority

2022-08-10 18:44:00 InfoQ

作者:流士

本次 MSE Governance Center is downgrading in current limit、database management and AZ Major upgrades in priority,Resilience to Microservice Governance、Rely on the stability of middleware and the performance of traffic scheduling to be comprehensively enhanced,Committed to building a microservice governance platform in the cloud-native era.

前情回顾

Before introducing the ability to upgrade,先简要回顾 MSE 产品的核心能力,Divided into the development state、Test state and running state,Among them, the more commonly used functions in service governance include lossless online and offline、Three functions of full link grayscale and daily environment isolation.

1.png
  • Lossless online and offline aspects

Support small-traffic service warm-up,Avoid newly launched applications being overwhelmed by traffic;While the warm-up model supports dynamic adjustment,Can meet the needs of complex scenarios;并且,预热过程支持关联 Kubernetes 检查.

  • Full link grayscale

Swim lane settings available,支持网关、RPC、RocketMQ 等;One-click dynamic flow cut capability,The cutting effect can be viewed through monitoring;此外,Provides an end-to-end stable baseline environment,方便用户快速、Authenticate new versions securely.

  • Daily environment isolation function

流量在 feature 环境内流转,Enable efficient agile development;The environmental logic isolation,Only need to maintain a set of baseline environment,大幅降低成本;在 IDEA 中使用 Cloud Toolkit end-cloud interconnection,Can connect natively launched applications to the development environment,Reduce development and commissioning costs.

2.png
The following introduces traffic management、database management and AZ Prioritized issues and specific solutions.

Current-limiting relegation comprehensive upgrade for traffic management

The corresponding traffic governance model forms a scalable closed loop in traffic protection,Various problems that may arise around the online environment of the system,effective governance.model starts with‘故障识别’,Identify problems at different levels,Such as the status code and exception type of the interface layer、Abnormal operating system layer;After identifying the problem,发出异常告警,It is convenient for users to carry out targeted traffic management,For example, adaptive current limiting protection or scenario-based current limiting protection;After protection rules are set,The system protects the system according to the preset thresholds and protection measures,The effect of system protection can be viewed by monitoring,另一方面,You can also check the rationality of the traffic protection rule settings by monitoring the reverse,及时调整.

3.jpeg
For the first access without historical data reference,Can be measured by system pressure,Set stress test parameters in combination with business scenarios,Configure traffic governance rules for problems that may arise online,Do a good job of protection strategy.

Stand-alone traffic protection

4.png
First of all to see flow control,The principle is to monitor application or service traffic QPS 指标,Block traffic as soon as the metric reaches the set threshold,避免应用被瞬时的流量高峰冲垮,从而保障应用高可用性.This product provides stand-alone current limiting、集群流控、minute hour current limit、Various current limiting methods such as associated current limiting,Support sliding window、令牌桶、Funnel bucket with multiple current limiting algorithms.

对于并发控制,当强依赖的方法或接口出现不稳定的时候,The number of unstable strong dependencies can be limited by configuring the number of concurrent threads,to isolate exceptions.If the response time of running the request becomes longer,Can lead to thread concurrency.When the number of concurrency exceeds the threshold,AHAS will reject redundant requests,until the stacked tasks are completed,Fewer concurrent threads.to isolate exceptions,The effect of reducing instability.

in system protection,Support adaptive flow control or manually set system rules,Adaptive flow control is based on the system CPU Usage automatically and dynamically adjusts the ingress traffic of the application;System rules are manually set rules from the overall dimension,对应用入口流量进行控制.The purpose is to achieve a balance between the inlet flow of the system and the load of the system,Ensure that the system runs stably at the maximum throughput state.

The circuit breaker can monitor the response time or abnormal ratio of internal or downstream dependencies of the application,当达到指定的阈值时立即降低下游依赖的优先级.在指定的时间内,系统不会调用该不稳定的资源,避免应用受到影响,从而保障应用高可用性.当指定时间过后,再重新恢复对该资源的调用.

Active downgrade protection can specify certain interfaces to be downgraded,Downgraded interfaces trigger custom downgrade behavior(If return the specified content)Without the same logic.

Hotspot protection analyzes parameters with a high number of calls during resource calls,并根据配置的热点规则对包含热点参数的资源调用进行限流,保护系统稳定性.

最后,When the system encounters some non-fatal error(such as occasional timeouts, etc.)时,The eventual failure of the system can be avoided by means of automatic retries.

Cluster Traffic Protection

Among them, the cluster traffic protection is used to solve the uneven flow of single-machine flow control.、The problem that the number of machines changes frequently and the amortization threshold is too small leads to the problem of poor current limiting effect,集群流控可以精确地控制某个服务接口在整个集群的实时调用总量.More suitable for the following scenarios:

1. Uneven traffic of service calls,To ease the situation

Unbalanced traffic to each service instance leads to inaccurate current throttling on a single machine(总量上“Current limit in advance”),Therefore, it is impossible to precisely control the total amount

2. Cluster small traffic accurate scenario

When the total traffic limit of the cluster is relatively small,Stand-alone current limit will be invalid(For example, the total amount of an interface per second does not exceed 10QPS,but the number of machines is 50 台,Even if the stand-alone threshold is set to 1,will still exceed the threshold)

3. Business cluster flow control

For minute-hour-level flow control with business implications,Protects downstream systems from being(Such as the gateway layer limit each user calls a per minute API no more than how many times)压垮.

5.png
Cluster flow control has rich scenarios、Advantages such as low cost of use and fully automatic control:

场景丰富
:Comprehensive coverage for precise protection of ingress traffic from the gateway、Web/RPC Scenarios from precise flow control of service calls to minute-hour-level business dimension flow control

低使用成本
:无需特殊接入方式,开箱即用

Fully automatic control
:自动化管控与分配 server 资源,自动化运维能力保障可用性,Without user attention resources preparation and distribution of the details,只需关注业务

Gateway Traffic Protection

Gateway traffic protection is used to precisely control one or a group of API 的流量,play a role in early protection,Keep excess traffic from hitting backend systems.如果按照单机维度配置,On the one hand, the gateway machine number change is difficult to perceive,另一方面网关流量不均可能导致限流效果不佳.

6.png
Gateway protection has four core capabilities:

1. API/Host Real-time monitoring and flow control of dimensions

2. 动态规则配置,实时生效

3. 集群流量控制,精确控制 API 调用总量

4. 请求参数/header 维度的流控、熔断

全链路&多语言

7.png
MSE The upgraded traffic governance can be applied to the whole link of microservices,For example, in the traffic entry layer,Access via gateway、At the microservice level, not only the microservice itself can be protected,It can also protect the middleware that microservices depend on、如缓存、Rely on the three of database, etc、If you are through ACK 或者 Agent 方式接入,You do not need to transform one line of code that can be easily access,If you have high-level traffic management needs,Such as custom buried point,可通过 SDK 方式接入.

New database management capabilities

Typical Governance Scenario

  • 某系统对外提供某查询接口,SQL 语句涉及多表 join,某些情况下会触发慢查询,耗时长达 30s,最终导致 DB 连接池/Tomcat 线程池满,应用整体不可用.
  • 应用刚启动,由于数据库 Druid 连接池还在初始化中,但是此时已经大量请求进入,迅速导致 Dubbo 的线程池满,许多现场卡在初始化数据库连接的过程中,导致业务请求大量报错.
  • 全链路灰度场景中,由于新的应用版本改了数据库表的内容,灰度流量导致线上数据库的数据错乱,业务同学连夜手动订正线上数据.
  • 在项目初期没有对 SQL 的性能做好考量,随着业务的发展,用户量级的增加,线上遗留老接口的 SQL 逐渐成为性能瓶颈,因此需要有有效的 SQL 洞察能力帮助我们发现遗留的 SQL,并及时进行性能优化.
  • SQL 语句处理时间比较长导致线上业务接口出现大量的慢调用,需要快速定位有问题的慢 SQL,并且通过一定的治理手段进行隔离,将业务快速恢复.因此在微服务访问数据层时,实时的 SQL 洞察能力可以帮助我们快速定位慢的 SQL 调用.

其实针对大多数的后端应用来讲,系统的瓶颈主要受限于数据库,当然复杂度的业务肯定也离不开数据库的操作.因此数据库问题,也是优先级最高的工作,The database of cure is a indispensable part of the micro service governance.

8.png

核心解决方案

9.png
  • 慢 SQL 治理

慢 SQL 是比较致命的影响系统稳定性的因素之一,系统中出现慢 SQL 可能会导致 CPU、负载异常和系统资源耗尽等情况.严重的慢 SQL 发生后可能会拖垮整个数据库,对线上业务产生阻断性的风险.线上生产环境出现慢 SQL 可能原因如下:

  • 网络速度慢、内存不足、I/O 吞吐量小、磁盘空间被占满等硬件原因.
  • 没有索引或者索引失效.
  • 系统数据过多.
  • 在项目初期没有对 SQL 的性能做好考量.
  • 连接池治理

连接池治理是数据库治理中非常重要的一个环节,通过一些链接池的实时指标,我们可以有效地提前识别系统中存在的风险,以下是一些常见的连接池治理的场景.

  • 提前建连

在应用发布或者弹性扩容的场景下,如果刚启动实例中的连接并有没完成建立,但此时实例已经启动完成,Readiness 检查已经通过,意味着此时会有大量的业务流量进入新启动的 pod.大量的请求阻塞在连接池获取连接的动作上,导致服务的线程池满,大量业务请求失败.如果我们的应用具备提前建连的能力,那么就可以在流量到达前,将连接请求数保证在 minIdle 之上,并且配合小流量预热的能力,那么就可以解决以上这个让人头疼的冷启动问题了.

  • "坏"连接剔除

有时候连接池中会存在一些有问题的连接,可能是底层的网络出现了抖动,也有可能是执行的业务出现了慢、死锁等问题.如果我们可以从连接池的视角出发,及时地发现异常的连接,并且进行及时地剔除与回收,那么就可以保证连接池整体的稳定性,不至于被个别有问题的业务处理或者网络抖动给拖垮.

  • 访问控制

理论上并不是全部数据库表都可以随便访问的,在某些时候,有些重要的表可能对于一些不太重要的服务来说,我们希望它是一个禁写、只读的状态,或者当数据库出现抖动、线程池满的情况下,我们希望减少一些耗时的读库 SQL 执行,又或者有一些敏感数据的表只允许某个应用去进行读写访问.那么我们就可以通过动态的访问控制能力,实时下发访问控制规则,来做到对于个别方法、应用的 SQL 面向数据库实例、表的禁读禁写等黑白名单的访问控制.

  • 数据库灰度

微服务体系架构中,服务之间的依赖关系错综复杂,有时某个功能发版依赖多个服务同时升级上线.我们希望可以对这些服务的新版本同时进行小流量灰度验证,这就是微服务架构中特有的全链路灰度场景,通过构建从网关到整个后端服务的环境隔离来对多个不同版本的服务进行灰度验证.MSE 通过影子表的方式,用户可以在不需要修改任何业务代码的情况下,实现数据库层面全链路灰度.

  • 动态读写分离

通过 MSE 提供的 SQL 洞察能力,结合我们对业务的理解,我们可以快速定位划分接口请求为弱请求.将对主库性能以及稳定性影响大的读操作,分流至 RDS 只读库,可以有效降低主库的读写压力,进一步提升微服务应用的稳定性.

10.png
以上这些是 MSE 即将推出的一个数据库治理能力的预告,我们从应用的视角出发整理抽象了我们在访问、使用数据库时场景的一些稳定性治理、性能优化、提效等方面的实战经验,对于每一个后端应用来说,数据库无疑是重中之重,我们希望通过我们的数据库治理能力,可以帮助到大家更好地使用数据库服务.

同 AZ 优先

同城的特点是 RT 一般处在一个比较底的延迟(< 3ms 以内),所以在默认情况下,We can build a large local area network based on different computer rooms in the same city,Then distribute our application across multiple computer rooms,In this way, it can deal with the risk of traffic damage when a single computer room fails..Compared different live,This infrastructure is less expensive to build,There is less changes in architecture.However, under the microservice system,Intricate links between applications,As the link depth gets deeper and deeper,The complexity of governance will also increase,The scenario shown in the figure below is that the front-end traffic is likely to be caused by calling each other in different computer rooms. RT 突增,resulting in loss of traffic.

使用场景

When application deployed in a number of rooms,There will be cross-machine room situations when applications call each other

11.png
机房 1 的 A Application call computer room 2 的 B 应用,Increased network latency for calls across computer rooms,导致 HTTP 响应时间增加.

After enabling the same room priority,consumer The same machine room will be called first provider 服务:

12.png

解决方案

根据路由规则,Automatically identify the same availability zone,And give priority to the same availability zone to reduce call delay,提升性能,It can realize traffic switching in disaster recovery scenarios,保障可用性.

13.png

结语

MSE Governance Center is downgrading in current limit、database management and AZ Capability upgrades in priority areas,Helps enterprises to do system resiliency more easily、timely perception system SQL 异常状态,Do a good job of targeted management and protection,同 AZ Priority can improve the overall performance of the system,Build a robust and stable operating environment.This upgrade is the first stage of the governance center upgrade,Subsequent will continuously introduce management means,为您的系统保驾护航.

MSE 注册配置中心专业版首购享 9 折优惠,MSE 云原生网关预付费全规格享 9 折优惠.点击
此处
,即享优惠!
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/222/202208101804465525.html