当前位置：网站首页>[Cloud Native] This article explains how to add Tencent Crane to Kubevela addon

[Cloud Native] This article explains how to add Tencent Crane to Kubevela addon

2022-08-10 00:32:00 【Digital China cloud base】

一、项目背景

Huge cloud waste

据 Gartner 研究显示：

平均来说企业上云会节省 14% 的成本,但是到 2020 年,由于缺乏成本优化手段,80% 企业的云资源成本将会大幅超出预算;
同时,45% 的企业由于缺乏优化措施,During the process of direct migration to the cloud超买 55% 的资源,并且在上云的第一个 18 个月内会多花费 70%.

In the face of huge investment cost and waste of cloud resources,Those who were fascinated by Shangyun in the first place,even started talking“云”色变.

How to optimize and manage different public cloud costs？This has become an imminent problem.

那在这种情况下,In order to reduce the waste of resources on the cloud,A cultural revolution FinOps （云成本优化）就应运而生了.

什么是 FinOps

我们先来看一下 FinOps 基金会对 FinOps 的定义：

FinOps 是将 DevOps、财务和业务整合在一起的变革,其目标在于优化一个组织在云计算上的支出的财务规范和技术解决方案,即根据支出的历史记录和来自预期负载的信息,FinOps 可以在需要时预分配资源或估算成本.
FinOps 可以称为“财务运营” ,或者更直白地称为“成本优化”,是将财务问责制引入云的 IT 支持,进行调整以优化质量和支出.

FinOps It can be simply understood as cloud cost optimization,To better plan and forecast spending requirements for cloud consumption,Today more and more companies are turning FinOps.

而 Crane It is for FinOps 的而诞生的.

什么是 Crane

Crane 是一个云原生开源项目,为推进云原生用户在确保业务稳定性的基础上做到真正的极致降本,腾讯推出了国内第一个基于云原生技术的成本优化开源项目 Crane（ Cloud Resource Analytics and Economics ）

Crane 遵循 FinOps 标准,旨在为云原生用户提供云成本优化一站式解决方案.

CraneThe goal is to provide a one-stop program,帮助Kubernetes Users can save the use of cloud resources through a series of rich functions,这些功能包括:

Time series forecasting based on monitoring data
Usage and cost visibility
Usage and cost optimization included:

R2(资源重新分配)
R3(Requests and copy referrals)
有效的pod自动缩放(Valid horizontal and verticalpod自动缩放)

成本优化
基于Pod PriorityClass的增强QoS
Load-aware调度

同时,因为 Crane 基于 Prometheus,以及 grafana,It can also play a role in monitoring resources on the cloud.

For fast one-click deployment Crane,我们这里选择将 Crane 插件化,作为 Kubevela 的 addon Integrate into a cluster for quick use.

二、CraneIntegration as a plugin

# 1 项目目录结构

│ metadata.yaml
│ readme.md
│ template.yaml
│
├─image
│     crane-overview.png
│     crane.png
│     wechat.jpeg
│
└─resources
    ├─config
    │     grafana-config.yaml
    │     namespace.yaml
    │     prometheus-config.yaml
    │
    ├─release
    │     crane-release.yaml
    │     fadvisor-release.yaml
    │     grafana-release.yaml
    │     prometheus-release.yaml
    │
    └─repo
            crane-repo.yaml
            grafana-repo.yaml
            prometheus-repo.yaml

其中 image The middle picture is readme required textures in ;resource 中的每个 yaml A file defines one k8s resource.

这里将 crane Plugin is mainly used Helm 的方式（Kubevela Helm插件）,分别将 Prometheus,Grafana,Crane 的 repo 链接引入,然后使用对应的 ConfigMap The configuration in the deployment corresponds to Helm Release.

# 2 添加创建K8s原生资源

创建 crane-system 命名空间:

apiVersion: v1
kind: Namespace
metadata:
  name: crane-system

创建 Prometheus 配置的 ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus
  namespace: crane-system
data:
  override_values: |
    ## Prometheus server ConfigMap entries
    ##
    serverFiles:
      ## Records configuration
      ## Ref: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
      recording_rules.yml:
        groups:
          - name: costs.rules
            interval: 3600s
            rules:
              - expr: |
                  sum(label_replace(irate(container_cpu_usage_seconds_total{container!="POD", container!="",image!=""}[1h]), "node", "$1", "instance",  "(.*)")) by (container, pod, node, namespace) * on (node) group_left() avg(avg_over_time(node_cpu_hourly_cost[1h])) by (node)
                record: namespace:container_cpu_usage_costs_hourly:sum_rate
              - expr: |
                  sum(label_replace(avg_over_time(container_memory_working_set_bytes{container!="POD",container!="",image!=""}[1h]), "node", "$1", "instance",  "(.*)")) by (container, pod, node, namespace) / 1024.0 / 1024.0 / 1024.0 * on (node) group_left() avg(avg_over_time(node_ram_hourly_cost[1h])) by (node)
                record: namespace:container_memory_usage_costs_hourly:sum_rate
              - expr: |
                  avg(avg_over_time(node_cpu_hourly_cost[1h])) by (node)
                record: node:node_cpu_hourly_cost:avg
              - expr: |
                  avg(avg_over_time(node_ram_hourly_cost[1h])) by (node)
                record: node:node_ram_hourly_cost:avg
              - expr: |
                  avg(avg_over_time(node_total_hourly_cost[1h])) by (node)
                record: node:node_total_hourly_cost:avg
          - name: scheduler.rules.30s
            interval: 30s
            rules:
              - record: cpu_usage_active
                expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[90s])) * 100)
              - record: mem_usage_active
                expr: 100*(1-node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)
          - name: scheduler.rules.1m
            interval: 1m
            rules:
              - record: cpu_usage_avg_5m
                expr: avg_over_time(cpu_usage_active[5m])
              - record: mem_usage_avg_5m
                expr: avg_over_time(mem_usage_active[5m])
          - name: scheduler.rules.5m
            interval: 5m
            rules:
              - record: cpu_usage_max_avg_1h
                expr: max_over_time(cpu_usage_avg_5m[1h])
              - record: cpu_usage_max_avg_1d
                expr: max_over_time(cpu_usage_avg_5m[1d])
              - record: mem_usage_max_avg_1h
                expr: max_over_time(mem_usage_avg_5m[1h])
              - record: mem_usage_max_avg_1d
                expr: max_over_time(mem_usage_avg_5m[1d])
     
     
    # adds additional scrape configs to prometheus.yml
    # must be a string so you have to add a | after extraScrapeConfigs:
    # example adds prometheus-blackbox-exporter scrape config
    extraScrapeConfigs: |-
      # this is used to scrape fadvisor
      - job_name: "fadvisor"
        honor_timestamps: true
        scheme: http
        metrics_path: /metrics
        static_configs:
          - targets: ['fadvisor.crane-system.svc.cluster.local:8081']
     
     
    server:
      service:
        persistentVolume:
          enabled: false
        annotations: { }
        labels: { }
        clusterIP: ""
     
        ## List of IP addresses at which the Prometheus server service is available
        ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
        ##
        externalIPs: [ ]
     
        loadBalancerIP: ""
        loadBalancerSourceRanges: [ ]
        servicePort: 8080
        sessionAffinity: None
        type: ClusterIP
     
    nodeExporter:
      hostRootfs: false
     
    alertmanager:
      enabled: false
     
    pushgateway:
      enabled: false
     
    kubeStateMetrics:
      ## If false, kube-state-metrics sub-chart will not be installed
      ##
      enabled: true
     
    ## kube-state-metrics sub-chart configurable values
    ## Please see https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-state-metrics
    ##
    kube-state-metrics:
      prometheus:
        monitor:
          honorLabels: true
      image:
        repository: ccr.ccs.tencentyun.com/tkeimages/kube-state-metrics
        pullPolicy: IfNotPresent
        tag: "2.2.4"

创建 grafana 配置的 ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/managed-by: Helm
  annotations:
    meta.helm.sh/release-name: grafana
    meta.helm.sh/release-namespace: crane-system
  name: grafana
  namespace: crane-system
data:
  override_values: |
    service:
      enabled: true
      type: ClusterIP
      port: 8082
      targetPort: 3000
      # targetPort: 4181 To be used with a proxy extraContainer
      annotations: {}
      labels: {}
      portName: service
     
    # Administrator credentials when not using an existing secret (see below)
    adminUser: admin
    adminPassword: admin
    //...
    //完整代码,关注公众号【神州数码云基地】后台回复【grafana-config.yaml】领取

# 3 添加 repo

将Prometheus,granfana,crane的repo作为 HelmRepository 资源添加.

通过添加这个 3 个 repo We have access to remote Helm The warehouse is used for this Helm 来安装.

创建 crane-system 命名空间:

apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
  name: prometheus
  namespace: crane-system
spec:
  interval: 10m
  timeout: 5m
  url: https://prometheus-community.github.io/helm-charts

grafana-repo.yaml：

apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
  name: grafana
  namespace: crane-system
spec:
  interval: 10m
  timeout: 5m
  url: https://grafana.github.io/helm-charts

crane-repo.yaml：

apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
  name: crane
  namespace: crane-system
spec:
  interval: 10m
  timeout: 5m
  url: https://gocrane.github.io/helm-charts

# 4 使用 Helm 部署 Helm release、

部署 Prometheus：

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: prometheus
  namespace: crane-system
spec:
  timeoout: 10m
  interval: 5m
  chart:
    spec:
      chart: prometheus
      version: 15.8.5
      sourceRef:
        kind: HelmRepository
        name: prometheus
      interval: 5m
  targetNamespace: crane-system
  releaseName: prometheus
  valuesFrom: [{kind: ConfigMap,name: prometheus,valuesKey: override_values}]

sourceRef：为第 3 步中配置的 Helm Repo;

targetName：The target namespace for the deployment;

releaseName：used for selection Release,Because there may be more than one in the same repository,So here you need to specify a specific one release;

valuesFrom：for reading configuration files,这里选择类型为 ConfigMap Also specify the one created in the first step Prometheus 的 cm 名称,valuesKey 为指定的 configMap 中的数据的 key;

部署 grafana :

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: grafana
  namespace: crane-system
spec:
  timeoout: 10m
  interval: 5m
  chart:
    spec:
      chart: grafana
      version: 6.28.0
      sourceRef:
        kind: HelmRepository
        name: grafana
      interval: 5m
  targetNamespace: crane-system
  releaseName: grafana
  valuesFrom: [{kind: ConfigMap,name: grafana,valuesKey: override_values}]

部署 fadvisor :

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: fadvisor
  namespace: crane-system
spec:
  timeoout: 10m
  interval: 5m
  chart:
    spec:
      chart: fadvisor
      version: 0.2.0
      sourceRef:
        kind: HelmRepository
        name: crane
      interval: 5m
  targetNamespace: crane-system
  releaseName: fadvisor

这里的fadvisor 为 crane Officially developed indicator collector.

部署 crane:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: crane
  namespace: crane-system
spec:
  timeoout: 10m
  interval: 5m
  chart:
    spec:
      chart: crane
      version: 0.3.0
      sourceRef:
        kind: HelmRepository
        name: crane
      interval: 5m
  targetNamespace: crane-system
  releaseName: crane

PS:

The above deployment process needs to be performed in order,但是在 kubevela的 addon 部署中,All resources are executed at the same time so sometimes the execution order may be out of order.

But this does not cause the overall execution to fail,因为 k8s There will be a timed rescheduling mechanism that will recreate the wrong resource when the rescheduling mechanism is triggered.