Aaron`s Dev Path

About Me

Dev Path

gitGraph:
  commit id:"Graduate From High School" tag:"Linfen, China"
  commit id:"Got Driver Licence" tag:"2013.08"
  branch TYUT
  commit id:"Enrollment TYUT 🥰"  tag:"Taiyuan, China"
  commit id:"Develop Game App" tag:"“Hello Hell”" type: HIGHLIGHT
  commit id:"Plan:3+1" tag:"2016.09"
  branch Briup.Ltd
  commit id:"First Internship" tag:"Suzhou, China"
  commit id:"CRUD boy" 
  commit id:"Dimission" tag:"2017.01" type:REVERSE
  checkout TYUT
  merge Briup.Ltd id:"Final Presentation" tag:"2017.04"
  checkout Briup.Ltd
  branch Enjoyor.PLC
  commit id:"Second Internship" tag:"Hangzhou,China"
  checkout TYUT
  merge Enjoyor.PLC id:"Got SE Bachelor Degree " tag:"2017.07"
  checkout Enjoyor.PLC
  commit id:"First Full Time Job" tag:"2017.07"
  commit id:"Dimssion" tag:"2018.04"
  checkout main
  merge Enjoyor.PLC id:"Plan To Study Aboard"
  commit id:"Get Some Rest" tag:"2018.06"
  branch TOEFL-GRE
  commit id:"Learning At Huahua.Ltd" tag:"Beijing,China"
  commit id:"Got USC Admission" tag:"2018.11" type: HIGHLIGHT
  checkout main
  merge TOEFL-GRE id:"Prepare To Leave" tag:"2018.12"
  branch USC
  commit id:"Pass Pre-School" tag:"Los Angeles,USA"
  checkout main
  merge USC id:"Back Home,Summer Break" tag:"2019.06"
  commit id:"Back School" tag:"2019.07"
  checkout USC
  merge main id:"Got Straight As"
  commit id:"Leaning ML, DL, GPT"
  checkout main
  merge USC id:"Back,Due to COVID-19" tag:"2021.02"
  checkout USC
  commit id:"Got DS Master Degree" tag:"2021.05"
  checkout main
  commit id:"Got An offer" tag:"2021.06"
  branch Zhejianglab
  commit id:"Second Full Time" tag:"Hangzhou,China"
  commit id:"Got Promotion" tag:"2024.01"
  commit id:"For Now"
Mar 7, 2024

Subsections of Aaron`s Dev Path

🐙Argo (CI/CD)

Content

CheatSheets

argoCD

  • decode passd
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
  • relogin
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS
  • force delete
argocd app terminate-op <$>

argo Workflow

argo Rollouts

Mar 7, 2024

Subsections of 🐙Argo (CI/CD)

Subsections of Argo CD

Subsections of App Template

Deploy A Nginx App

Sync

When your k8s resource files located in `mainfests` folder, you can use the following command to deploy your app.
you only need to set `spec.source.path: mainfests`

  • sample-repo
    • content
    • src
    • mainfests
      • deploy.yaml
      • svc.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: hugo-blog
spec:
  project: default
  source:
    repoURL: 'git@github.com:AaronYang0628/home-site.git'
    targetRevision: main
    path: mainfests
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
  destination:
    server: https://kubernetes.default.svc
    namespace: application

Not only you need files in `mainfests` folder, but also need files in root folder.

you have to create an extra file `kustomization.yaml`, and set `spec.source.path: .`

  • sample-repo
    • kustomization.yaml
    • content
    • src
    • mainfests
      • deploy.yaml
      • svc.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: hugo-blog
spec:
  project: default
  source:
    repoURL: 'git@github.com:AaronYang0628/home-site.git'
    targetRevision: main
    path: .
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
  destination:
    server: https://kubernetes.default.svc
    namespace: application
resources:
  - manifests/pvc.yaml
  - manifests/job.yaml
  - manifests/deployment.yaml
  - ...
Oct 22, 2025

Deploy N Clusters

ArgoCD 凭借其声明式的 GitOps 理念,能非常优雅地处理多 Kubernetes 集群的应用发布。它允许你从一个中心化的 Git 仓库管理多个集群的应用部署,确保状态一致并能快速回滚。

下面这张图概括了使用 ArgoCD 进行多集群发布的典型工作流,帮你先建立一个整体印象:

flowchart TD
    A[Git仓库] --> B{ArgoCD Server}
    
    B --> C[ApplicationSet<br>集群生成器]
    B --> D[ApplicationSet<br>Git生成器]
    B --> E[手动Application<br>资源]
    
    C --> F[集群A<br>App1 & App2]
    C --> G[集群B<br>App1 & App2]
    
    D --> H[集群A<br>App1]
    D --> I[集群A<br>App2]
    
    E --> J[特定集群<br>特定应用]

🔗 连接集群到 ArgoCD

要让 ArgoCD 管理外部集群,你需要先将目标集群的访问凭证添加进来。

  1. 获取目标集群凭证:确保你拥有目标集群的 kubeconfig 文件。
  2. 添加集群到 ArgoCD:使用 ArgoCD CLI 添加集群。这个操作会在 ArgoCD 所在命名空间创建一个存储了集群凭证的 Secret。
    argocd cluster add <context-name> --name <cluster-name> --kubeconfig ~/.kube/config
    • <context-name> 是你 kubeconfig 中的上下文名称。
    • <cluster-name> 是你在 ArgoCD 中为这个集群起的别名。
  3. 验证集群连接:添加后,你可以在 ArgoCD UI 的 “Settings” > “Clusters” 页面,或通过 CLI 查看集群列表:
    argocd cluster list

💡 选择多集群部署策略

连接集群后,核心是定义部署规则。ArgoCD 主要通过 ApplicationApplicationSet 资源来描述部署。

  • Application 资源:定义一个应用在特定集群的部署。管理大量集群和应用时,手动创建每个 Application 会很繁琐。
  • ApplicationSet 资源:这是实现多集群部署的推荐方式。它能根据生成器(Generators)自动为多个集群或多个应用创建 Application 资源。

上面的流程图展示了 ApplicationSet 的两种主要生成器以及手动创建 Application 的方式。

ApplicationSet 常用生成器对比

生成器类型工作原理适用场景
List Generator在 YAML 中静态列出集群和URL。集群数量固定、变化少的场景。
Cluster Generator动态使用 ArgoCD 中已注册的集群。集群动态变化,需自动纳入新集群的场景。
Git Generator根据 Git 仓库中的目录结构自动生成应用。管理大量微服务,每个服务在独立目录。

🛠️ 配置实践示例

这里以 Cluster Generator 为例,展示一个 ApplicationSet 的 YAML 配置:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app-multi-cluster
spec:
  generators:
    - clusters: {} # 自动发现ArgoCD中所有已注册集群
  template:
    metadata:
      name: '{{name}}-my-app'
    spec:
      project: default
      source:
        repoURL: 'https://your-git-repo.com/your-app.git'
        targetRevision: HEAD
        path: k8s-manifests
      destination:
        server: '{{server}}' # 生成器提供的集群API地址
        namespace: my-app-namespace
      syncPolicy:
        syncOptions:
        - CreateNamespace=true # 自动创建命名空间
        automated:
          prune: true # 自动清理
          selfHeal: true # 自动修复漂移

在这个模板中:

  • generators 下的 clusters: {} 会让 ArgoCD 自动发现所有已注册的集群。
  • template 中,{{name}}{{server}} 是变量,Cluster Generator 会为每个已注册的集群填充它们。
  • syncPolicy 下的配置实现了自动同步、自动创建命名空间和资源清理。

⚠️ 多集群管理的关键要点

  1. 集群访问权限与网络:确保 ArgoCD 控制平面能够网络连通所有目标集群的 API Server,并具有在目标命名空间中创建资源的 RBAC 权限
  2. 灵活的同步策略
    • 对于开发环境,可以开启 automated 同步,实现 Git 变更自动部署。
    • 对于生产环境,建议关闭自动同步,采用手动触发同步(Manual)或通过 PR 审批流程,以增加控制力。
  3. 高可用与性能:管理大量集群和应用时,考虑高可用(HA)部署。你可能需要调整 argocd-repo-serverargocd-application-controller 的副本数和资源限制。
  4. 考虑 Argo CD Agent:对于大规模集群管理,可以探索 Argo CD Agent。它将一部分控制平面组件(如 application-controller)分布到托管集群上运行,能提升可扩展性。请注意,截至2025年10月,该功能在 OpenShift GitOps 中仍处于技术预览(Tech Preview) 阶段。

💎 总结

利用 ArgoCD 管理多 Kubernetes 集群应用发布,核心是掌握 ApplicationSetGenerators 的用法。通过 Cluster Generator 或 Git Generator,你可以灵活地实现“一次定义,多处部署”。

希望这些信息能帮助你着手搭建多集群发布流程。如果你能分享更多关于你具体环境的信息(比如集群的大致数量和应用的组织结构),或许我可以给出更贴合你场景的建议。

Mar 14, 2025

ArgoCD Cheatsheets

  • decode passd
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
  • relogin
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS
  • force delete
argocd app terminate-op <$>
Mar 14, 2024

Argo CD Agent

Installation

Content

    Mar 7, 2024

    Argo WorkFlow

    What is Argo Workflow?

    Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.

    • Define workflows where each step in the workflow is a container.
    • Model multi-step workflows as a sequence of tasks or capture the dependencies between tasks using a graph (DAG).
    • Easily run compute intensive jobs for machine learning or data processing in a fraction of the time using Argo Workflows on Kubernetes.
    • Run CI/CD pipelines natively on Kubernetes without configuring complex software development products.

    Installation

    Content

    Mar 7, 2024

    Subsections of Argo WorkFlow

    Argo Workflows Cheatsheets

    Mar 14, 2024

    Subsections of Workflow Template

    DAG Template

    DAG Template

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: dag-diamond-
    spec:
      entrypoint: entry
      serviceAccountName: argo-workflow
      templates:
      - name: echo
        inputs:
          parameters:
          - name: message
        container:
          image: alpine:3.7
          command: [echo, "{{inputs.parameters.message}}"]
      - name: entry
        dag:
          tasks:
          - name: start
            template: echo
            arguments:
                parameters: [{name: message, value: DAG initialized}]
          - name: diamond
            template: diamond
            dependencies: [start]
      - name: diamond
        dag:
          tasks:
          - name: A
            template: echo
            arguments:
              parameters: [{name: message, value: A}]
          - name: B
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: B}]
          - name: C
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: C}]
          - name: D
            dependencies: [B, C]
            template: echo
            arguments:
              parameters: [{name: message, value: D}]
          - name: end
            dependencies: [D]
            template: echo
            arguments:
              parameters: [{name: message, value: end}]
    kubectl -n business-workflow apply -f - << EOF
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: dag-diamond-
    spec:
      entrypoint: entry
      serviceAccountName: argo-workflow
      templates:
      - name: echo
        inputs:
          parameters:
          - name: message
        container:
          image: alpine:3.7
          command: [echo, "{{inputs.parameters.message}}"]
      - name: entry
        dag:
          tasks:
          - name: start
            template: echo
            arguments:
                parameters: [{name: message, value: DAG initialized}]
          - name: diamond
            template: diamond
            dependencies: [start]
      - name: diamond
        dag:
          tasks:
          - name: A
            template: echo
            arguments:
              parameters: [{name: message, value: A}]
          - name: B
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: B}]
          - name: C
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: C}]
          - name: D
            dependencies: [B, C]
            template: echo
            arguments:
              parameters: [{name: message, value: D}]
          - name: end
            dependencies: [D]
            template: echo
            arguments:
              parameters: [{name: message, value: end}]
    EOF
    Mar 7, 2024

    Subsections of Argo Rollouts

    Blue–Green Deploy

    Argo Rollouts 是一个 Kubernetes CRD 控制器,它通过扩展 Kubernetes 的原生 Deployment 资源,为 Kubernetes 提供了更高级的部署策略。其核心原理可以概括为:通过精细控制多个 ReplicaSet(对应不同版本的 Pod)的副本数量和流量分配,来实现可控的、自动化的应用发布流程。


    1. 蓝绿部署原理

    蓝绿部署的核心思想是同时存在两个完全独立的环境(蓝色和绿色),但任何时候只有一个环境承载生产流量。

    工作原理

    1. 初始状态

      • 假设当前生产环境是 蓝色 版本(v1),所有流量都指向蓝色的 ReplicaSet。
      • 绿色 环境虽然可能存在(例如,副本数为 0),但不接收任何流量。
    2. 发布新版本

      • 当需要发布新版本(v2)时,Argo Rollouts 会创建一个与蓝色环境完全隔离的 绿色 环境 ReplicaSet,并启动全部所需的 Pod 实例。
      • 关键点:此时,用户流量仍然 100% 指向蓝色的 v1 版本。绿色 v2 版本在启动和预热期间,完全不影响线上用户。
    3. 测试与验证

      • 运维人员或自动化脚本可以对绿色的 v2 版本进行测试,例如进行 API 调用、检查日志或运行集成测试。这个过程在生产流量不受干扰的情况下进行。
    4. 切换流量

      • 一旦确认 v2 版本稳定,通过一个原子操作,将所有生产流量从蓝色(v1)瞬间切换到绿色(v2)。
      • 这个切换通常是通过更新 Kubernetes Service 或 Ingress 的 selector 标签来实现的。例如,将 app: my-app 的 selector 从 version: v1 改为 version: v2
    5. 发布后

      • 流量切换后,绿色(v2)成为新的生产环境。
      • 蓝色(v1)环境不会被立即删除,而是保留一段时间,作为快速回滚的保障
      • 如果 v2 出现问题,只需将流量再次切回蓝色(v1)即可,回滚过程同样迅速且影响小。

    原理示意图

    [用户] --> [Service (selector: version=v1)] --> [蓝色 ReplicaSet (v1, 100% 流量)]
                                          |
                                          +--> [绿色 ReplicaSet (v2, 0% 流量, 待命)]

    切换后:

    [用户] --> [Service (selector: version=v2)] --> [绿色 ReplicaSet (v2, 100% 流量)]
                                          |
                                          +--> [蓝色 ReplicaSet (v1, 0% 流量, 备用回滚)]

    优点:发布和回滚速度快、风险低、发布期间服务始终可用。 缺点:需要两倍的硬件资源,在切换瞬间可能会有短暂的流量处理问题(如连接中断)。


    2. 金丝雀发布原理

    金丝雀发布的核心思想是逐步将流量从旧版本迁移到新版本,而不是一次性切换。这个过程允许在影响一小部分用户的情况下,验证新版本的稳定性和性能。

    工作原理

    1. 初始状态

      • 与蓝绿部署类似,当前稳定版本(v1)的 ReplicaSet 承载 100% 的流量。
    2. 发布金丝雀版本

      • Argo Rollouts 创建一个新版本(v2)的 ReplicaSet,但只启动少量 Pod(例如,副本数为总体的 1/10)。
      • 此时,通过 流量治理工具(如 Service Mesh:Istio, Linkerd;或 Ingress Controller:Nginx)的规则,将一小部分生产流量(例如 10%)路由到 v2 的 Pod,其余 90% 的流量仍然流向 v1。
    3. 渐进式推广

      • 这是一个多步骤、自动化的过程。Argo Rollouts 的 Rollout CRD 可以定义一个详细的步骤清单(steps)。
      • 示例步骤
        • setWeight: 10 - 将 10% 的流量切到 v2,持续 5 分钟。
        • pause: {duration: 5m} - 暂停发布,观察 v2 的运行指标。
        • setWeight: 40 - 如果一切正常,将流量提升到 40%。
        • pause: {duration: 10m} - 再次暂停并观察。
        • setWeight: 100 - 最终将所有流量切换到 v2。
    4. 自动化分析与回滚

      • 这是 Argo Rollouts 最强大的功能之一。在每次暂停(pause)阶段,它会持续查询指标分析服务
      • 指标分析服务 可以配置一系列规则(AnalysisTemplate),例如:
        • 检查 HTTP 请求错误率是否低于 1%。
        • 检查请求平均响应时间是否小于 200ms。
        • 检查自定义业务指标(如订单失败率)。
      • 如果任何一项指标不达标,Argo Rollouts 会自动中止发布并将流量全部回滚到 v1 版本,无需人工干预。
    5. 发布完成

      • 当所有步骤顺利完成,v2 的 ReplicaSet 将接管 100% 的流量,v1 的 ReplicaSet 最终会被缩容至零。

    原理示意图

    [用户] --> [Istio VirtualService] -- 90% --> [v1 ReplicaSet]
                         |
                         +-- 10% --> [v2 ReplicaSet (金丝雀)]

    (推广中)

    [用户] --> [Istio VirtualService] -- 40% --> [v1 ReplicaSet]
                         |
                         +-- 60% --> [v2 ReplicaSet (金丝雀)]

    (完成后)

    [用户] --> [Istio VirtualService] -- 100% --> [v2 ReplicaSet]

    优点:发布风险极低,可以基于真实流量和指标进行自动化验证,实现“无人值守”的安全发布。 缺点:发布流程更长,需要与复杂的流量治理工具集成。


    总结与核心价值

    特性蓝绿部署金丝雀发布
    核心思想全量切换,环境隔离渐进式流量切换
    流量控制100% 或 0%,原子操作精细的比例控制(1%, 5%, 50%…)
    资源消耗高(需要两套完整环境)低(新旧版本 Pod 共享资源池)
    发布速度快(切换迅速)慢(分多个阶段)
    风险控制通过快速回滚控制风险通过小范围暴露和自动化分析控制风险
    自动化相对简单,主要自动化切换高度自动化,依赖指标分析进行决策

    Argo Rollouts 的核心原理价值在于:

    1. 声明式:像定义 Kubernetes Deployment 一样,通过 YAML 文件声明你的发布策略(蓝绿或金丝雀步骤)。
    2. 控制器模式:Argo Rollouts 控制器持续监听 Rollout 对象的状态,并驱动整个系统(K8s API、Service Mesh、Metrics Server)达到声明的目标状态。
    3. 扩展性:通过 CRD 和 AnalysisTemplate,它提供了极大的灵活性,可以与任何兼容的流量提供商和指标系统集成。
    4. 自动化与安全:将“人脑判断”转化为“基于数据的自动化规则”,极大地提升了发布的可靠性和效率,是实现 GitOps 和持续交付的关键一环。
    Mar 14, 2025

    Argo Rollouts Cheatsheets

    Mar 14, 2024

    Subsections of 🧯BuckUp

    Subsections of ElasticSearch

    ES [Local Disk]

    Preliminary

    • ElasticSearch has installed, if not check link

    • The elasticsearch.yml has configed path.repo, which should be set the same value of settings.location (this will be handled by helm chart, dont worry)

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: false
              service:
                type: ClusterIP
              extraConfig:
                path:
                  repo: /tmp
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      extraConfig:
          path:
            repo: /tmp

    Methods

    Elasticsearch 做备份有两种方式,

    1. 是将数据导出成文本文件,比如通过elasticdump、esm等工具将存储在 Elasticsearch 中的数据导出到文件中。
    2. 是使用snapshot接口实现快照功能,增量备份文件

    第一种方式相对简单,在数据量小的时候比较实用,但当应对大数据量场景时,更推荐使用snapshot api 的方式。

    Steps

    buckup

    asdadas

    1. 创建快照仓库repo -> my_fs_repository
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "fs",
      "settings": {
        "location": "/tmp"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository?pretty"
    1. 分析一个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 手动打快照
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/ay_snap_02?pretty"
    使用SLM自动打快照(没生效)

    Thank you!

    1. 查看指定快照仓库repo 可用的快照
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/*?verbose=false&pretty"
    1. 测试恢复
    # Delete an index
    curl -k -X DELETE "https://elastic-search.dev.tech:32443/books?pretty"
    
    # restore that index
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/ay_snap_02/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl -k -X GET "https://elastic-search.dev.tech:32443/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    ES [S3 Compatible]

    Preliminary

    • ElasticSearch has installed, if not check link

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: true
              service:
                type: ClusterIP
              extraEnvVars:
              - name: S3_ACCESSKEY
                value: admin
              - name: S3_SECRETKEY
                value: ZrwpsezF1Lt85dxl
              extraConfig:
                s3:
                  client:
                    default:
                      protocol: http
                      endpoint: "http://192.168.31.111:9090"
                      path_style_access: true
              initScripts:
                configure-s3-client.sh: |
                  elasticsearch_set_key_value "s3.client.default.access_key" "${S3_ACCESSKEY}"
                  elasticsearch_set_key_value "s3.client.default.secret_key" "${S3_SECRETKEY}"
              hostAliases:
              - ip: 192.168.31.111
                hostnames:
                - minio-api.dev.tech
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      extraEnvVars:
      - name: S3_ACCESSKEY
        value: admin
      - name: S3_SECRETKEY
        value: ZrwpsezF1Lt85dxl
      extraConfig:
        s3:
          client:
            default:
              protocol: http
              endpoint: "http://192.168.31.111:9090"
              path_style_access: true
      initScripts:
        configure-s3-client.sh: |
          elasticsearch_set_key_value "s3.client.default.access_key" "${S3_ACCESSKEY}"
          elasticsearch_set_key_value "s3.client.default.secret_key" "${S3_SECRETKEY}"
      hostAliases:
      - ip: 192.168.31.111
        hostnames:
        - minio-api.dev.tech

    Methods

    Elasticsearch 做备份有两种方式,

    1. 是将数据导出成文本文件,比如通过elasticdump、esm等工具将存储在 Elasticsearch 中的数据导出到文件中。
    2. 是使用snapshot接口实现快照功能,增量备份文件

    第一种方式相对简单,在数据量小的时候比较实用,但当应对大数据量场景时,更推荐使用snapshot api 的方式。

    Steps

    buckup

    asdadas

    1. 创建快照仓库repo -> my_s3_repository
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "s3",
      "settings": {
        "bucket": "local-test",
        "client": "default",
        "endpoint": "http://192.168.31.111:9000"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository?pretty"
    1. 分析一个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 手动打快照
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/ay_s3_snap_02?pretty"
    使用SLM自动打快照(没生效)

    Thank you!

    1. 查看指定快照仓库repo 可用的快照
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/*?verbose=false&pretty"
    1. 测试恢复
    # Delete an index
    curl -k -X DELETE "https://elastic-search.dev.tech:32443/books?pretty"
    
    # restore that index
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/ay_s3_snap_02/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl -k -X GET "https://elastic-search.dev.tech:32443/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    ES Auto BackUp

    Preliminary

    • ElasticSearch has installed, if not check link

    • We use local disk to save the snapshots, more deatils check link

    • And the security is enabled.

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: true
                tls:
                  autoGenerated: true
              service:
                type: ClusterIP
              extraConfig:
                path:
                  repo: /tmp
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      security:
        enabled: true
      extraConfig:
          path:
            repo: /tmp

    Methods

    Steps

    auto buckup
    1. 创建快照仓库repo -> slm_fs_repository
    curl --user elastic:L9shjg6csBmPZgCZ -k -X PUT "https://10.88.0.143:30294/_snapshot/slm_fs_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "fs",
      "settings": {
        "location": "/tmp"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/slm_fs_repository?pretty"
    1. 分析一个快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 查看指定快照仓库repo 可用的快照
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/slm_fs_repository/*?verbose=false&pretty"
    1. 创建SLM admin 角色
    curl --user elastic:L9shjg6csBmPZgCZ -k -X POST "https://10.88.0.143:30294/_security/role/slm-admin?pretty" -H 'Content-Type: application/json' -d'
    {
      "cluster": [ "manage_slm", "cluster:admin/snapshot/*" ],
      "indices": [
        {
          "names": [ ".slm-history-*" ],
          "privileges": [ "all" ]
        }
      ]
    }
    '
    1. 创建自动备份cornjob
    curl --user elastic:L9shjg6csBmPZgCZ -k -X PUT "https://10.88.0.143:30294/_slm/policy/nightly-snapshots?pretty" -H 'Content-Type: application/json' -d'
    {
      "schedule": "0 30 1 * * ?",       
      "name": "<nightly-snap-{now/d}>", 
      "repository": "slm_fs_repository",    
      "config": {
        "indices": "*",                 
        "include_global_state": true    
      },
      "retention": {                    
        "expire_after": "30d",
        "min_count": 5,
        "max_count": 50
      }
    }
    '
    1. 启动自动备份
    curl --user elastic:L9shjg6csBmPZgCZ -k -X POST "https://10.88.0.143:30294/_slm/policy/nightly-snapshots/_execute?pretty"
    1. 查看SLM备份历史
    curl --user elastic:L9shjg6csBmPZgCZ -k -X GET "https://10.88.0.143:30294/_slm/stats?pretty"
    1. 测试恢复
    # Delete an index
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X DELETE "https://10.88.0.143:30294/books?pretty"
    
    # restore that index
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/my_snapshot_2099.05.06/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    Example Shell Script

    Init ES Backup Setting

    create an ES backup setting in s3, and make an snapshot after creation

    #!/bin/bash
    ES_HOST="http://192.168.58.2:30910"
    ES_BACKUP_REPO_NAME="s3_fs_repository"
    S3_CLIENT="default"
    ES_BACKUP_BUCKET_IN_S3="es-snapshot"
    ES_SNAPSHOT_TAG="auto"
    
    CHECK_RESPONSE=$(curl -s -k -X POST "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/_verify?pretty" )
    CHECKED_NODES=$(echo "$CHECK_RESPONSE" | jq -r '.nodes')
    
    
    if [ "$CHECKED_NODES" == null ]; then
      echo "Doesn't exist an ES backup setting..."
      echo "A default backup setting will be generated. (using '$S3_CLIENT' s3 client and all backup files will be saved in a bucket : '$ES_BACKUP_BUCKET_IN_S3'"
    
      CREATE_RESPONSE=$(curl -s -k -X PUT "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME?pretty" -H 'Content-Type: application/json' -d "{\"type\":\"s3\",\"settings\":{\"bucket\":\"$ES_BACKUP_BUCKET_IN_S3\",\"client\":\"$S3_CLIENT\"}}")
      CREATE_ACKNOWLEDGED_FLAG=$(echo "$CREATE_RESPONSE" | jq -r '.acknowledged')
    
      if [ "$CREATE_ACKNOWLEDGED_FLAG" == true ]; then
        echo "Buckup setting '$ES_BACKUP_REPO_NAME' has been created successfully!"
      else
        echo "Failed to create backup setting '$ES_BACKUP_REPO_NAME', since $$CREATE_RESPONSE"
      fi
    else
      echo "Already exist an ES backup setting '$ES_BACKUP_REPO_NAME'"
    fi
    
    CHECK_RESPONSE=$(curl -s -k -X POST "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/_verify?pretty" )
    CHECKED_NODES=$(echo "$CHECK_RESPONSE" | jq -r '.nodes')
    
    if [ "$CHECKED_NODES" != null ]; then
      SNAPSHOT_NAME="meta-data-$ES_SNAPSHOT_TAG-snapshot-$(date +%s)"
      SNAPSHOT_CREATION=$(curl -s -k -X PUT "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/$SNAPSHOT_NAME")
      echo "Snapshot $SNAPSHOT_NAME has been created."
    else
      echo "Failed to create snapshot $SNAPSHOT_NAME ."
    fi
    Mar 14, 2024

    Subsections of Git

    Minio

      Mar 7, 2024

      Subsections of ☁️CSP Related

      Subsections of Aliyun

      OSSutil

      download ossutil

      first, you need to download ossutil first

      OS:
      curl https://gosspublic.alicdn.com/ossutil/install.sh  | sudo bash
      curl -o ossutil-v1.7.19-windows-386.zip https://gosspublic.alicdn.com/ossutil/1.7.19/ossutil-v1.7.19-windows-386.zip

      config ossutil

      ./ossutil config
      ParamsDescriptionInstruction
      endpointthe Endpoint of the region where the Bucket is located
      accessKeyIDOSS AccessKeyget from user info panel
      accessKeySecretOSS AccessKeySecretget from user info panel
      stsTokentoken for sts servicecould be empty
      Info

      you can also modify /home/<$user>/.ossutilconfig file directly to change the configuration.

      list files

      ossutil ls oss://<$PATH>
      For exmaple
      ossutil ls oss://csst-data/CSST-20240312/dfs/

      download file/dir

      you can use cp to download or upload file

      ossutil cp -r oss://<$PATH> <$PTHER_PATH>
      For exmaple
      ossutil cp -r oss://csst-data/CSST-20240312/dfs/ /data/nfs/data/pvc...

      upload file/dir

      ossutil cp -r <$SOURCE_PATH> oss://<$PATH>
      For exmaple
      ossutil cp -r /data/nfs/data/pvc/a.txt  oss://csst-data/CSST-20240312/dfs/b.txt
      Mar 24, 2024

      ECS DNS

      ZJADC (Aliyun Directed Cloud)

      Append content in /etc/resolv.conf

      options timeout:2 attempts:3 rotate
      nameserver 10.255.9.2
      nameserver 10.200.12.5

      And then you probably need to modify yum.repo.d as well, check link


      YQGCY (Aliyun Directed Cloud)

      Append content in /etc/resolv.conf

      nameserver 172.27.205.79

      And then restart kube-system.coredns-xxxx


      Google DNS

      nameserver 8.8.8.8
      nameserver 4.4.4.4
      nameserver 223.5.5.5
      nameserver 223.6.6.6

      Restart DNS

      OS:
      vim /etc/NetworkManager/NetworkManager.conf
      vim /etc/NetworkManager/NetworkManager.conf
      sudo systemctl is-active systemd-resolved
      sudo resolvectl flush-caches
      # or sudo systemd-resolve --flush-caches

      add "dns=none" under '[main]' part

      systemctl restart NetworkManager

      Modify ifcfg-ethX [Optional]

      if you cannot get ipv4 address, you can try to modify ifcfg-ethX

      vim /etc/sysconfig/network-scripts/ifcfg-ens33

      set ONBOOT=yes

      Mar 14, 2024

      OS Mirrors

      Fedora

      • Fedora 40 located in /etc/yum.repos.d/
        Fedora Mirror
        [updates]
        name=Fedora $releasever - $basearch - Updates
        #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/$basearch/
        metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-f$releasever&arch=$basearch
        enabled=1
        countme=1
        repo_gpgcheck=0
        type=rpm
        gpgcheck=1
        metadata_expire=6h
        gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
        skip_if_unavailable=False
        
        [updates-debuginfo]
        name=Fedora $releasever - $basearch - Updates - Debug
        #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/$basearch/debug/
        metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-debug-f$releasever&arch=$basearch
        enabled=0
        repo_gpgcheck=0
        type=rpm
        gpgcheck=1
        metadata_expire=6h
        gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
        skip_if_unavailable=False
        
        [updates-source]
        name=Fedora $releasever - Updates Source
        #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/SRPMS/
        metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-source-f$releasever&arch=$basearch
        enabled=0
        repo_gpgcheck=0
        type=rpm
        gpgcheck=1
        metadata_expire=6h
        gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
        skip_if_unavailable=False

      CentOS

      • CentOS 7 located in /etc/yum.repos.d/

        CentOS Mirror
        [base]
        name=CentOS-$releasever
        #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
        baseurl=http://mirror.centos.org/centos/$releasever/os/$basearch/
        gpgcheck=1
        gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-7
        
        [extras]
        name=CentOS-$releasever
        #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
        baseurl=http://mirror.centos.org/centos/$releasever/extras/$basearch/
        gpgcheck=1
        gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-7
        Aliyun Mirror
        [base]
        name=CentOS-$releasever - Base - mirrors.aliyun.com
        failovermethod=priority
        baseurl=http://mirrors.aliyun.com/centos/$releasever/os/$basearch/
        gpgcheck=1
        gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
        
        [extras]
        name=CentOS-$releasever - Extras - mirrors.aliyun.com
        failovermethod=priority
        baseurl=http://mirrors.aliyun.com/centos/$releasever/extras/$basearch/
        gpgcheck=1
        gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
        163 Mirror
        [base]
        name=CentOS-$releasever - Base - 163.com
        #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
        baseurl=http://mirrors.163.com/centos/$releasever/os/$basearch/
        gpgcheck=1
        gpgkey=http://mirrors.163.com/centos/RPM-GPG-KEY-CentOS-7
        
        [extras]
        name=CentOS-$releasever - Extras - 163.com
        #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
        baseurl=http://mirrors.163.com/centos/$releasever/extras/$basearch/
        gpgcheck=1
        gpgkey=http://mirrors.163.com/centos/RPM-GPG-KEY-CentOS-7

      • CentOS 8 stream located in /etc/yum.repos.d/

        CentOS Mirror
        [baseos]
        name=CentOS Linux - BaseOS
        #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=BaseOS&infra=$infra
        baseurl=http://mirror.centos.org/centos/8-stream/BaseOS/$basearch/os/
        gpgcheck=1
        enabled=1
        gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
        
        [extras]
        name=CentOS Linux - Extras
        #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras&infra=$infra
        baseurl=http://mirror.centos.org/centos/8-stream/extras/$basearch/os/
        gpgcheck=1
        enabled=1
        gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
        
        [appstream]
        name=CentOS Linux - AppStream
        #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=AppStream&infra=$infra
        baseurl=http://mirror.centos.org/centos/8-stream/AppStream/$basearch/os/
        gpgcheck=1
        enabled=1
        gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
        Aliyun Mirror
        [base]
        name=CentOS-8.5.2111 - Base - mirrors.aliyun.com
        baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/BaseOS/$basearch/os/
        gpgcheck=0
        gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official
        
        [extras]
        name=CentOS-8.5.2111 - Extras - mirrors.aliyun.com
        baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/extras/$basearch/os/
        gpgcheck=0
        gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official
        
        [AppStream]
        name=CentOS-8.5.2111 - AppStream - mirrors.aliyun.com
        baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/AppStream/$basearch/os/
        gpgcheck=0
        gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official

      Ubuntu

      • Ubuntu 18.04 located in /etc/apt/sources.list

        Ubuntu Mirror
        deb http://archive.ubuntu.com/ubuntu/ bionic main restricted
        deb http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted
        deb http://archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse
        deb http://security.ubuntu.com/ubuntu/ bionic-security main restricted

      • Ubuntu 20.04 located in /etc/apt/sources.list

        Ubuntu Mirror
        deb http://archive.ubuntu.com/ubuntu/ focal main restricted universe multiverse
        deb http://archive.ubuntu.com/ubuntu/ focal-updates main restricted universe multiverse
        deb http://archive.ubuntu.com/ubuntu/ focal-backports main restricted universe multiverse
        deb http://security.ubuntu.com/ubuntu/ focal-security main restricted

      • Ubuntu 22.04 located in /etc/apt/sources.list

        Ubuntu Mirror
        deb http://archive.ubuntu.com/ubuntu/ jammy main restricted
        deb http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
        deb http://archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
        deb http://security.ubuntu.com/ubuntu/ jammy-security main restricted

      Debian

      • Debian Buster located in /etc/apt/sources.list

        Debian Mirror
        deb http://deb.debian.org/debian buster main
        deb http://security.debian.org/debian-security buster/updates main
        deb http://deb.debian.org/debian buster-updates main
        Aliyun Mirror
        deb http://mirrors.aliyun.com/debian/ buster main non-free contrib
        deb http://mirrors.aliyun.com/debian-security buster/updates main
        deb http://mirrors.aliyun.com/debian/ buster-updates main non-free contrib
        deb http://mirrors.aliyun.com/debian/ buster-backports main non-free contrib
        Tuna Mirror
        deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster main contrib non-free
        deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster-updates main contrib non-free
        deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster-backports main contrib non-free
        deb http://security.debian.org/debian-security buster/updates main contrib non-free

      • Debian Bullseye located in /etc/apt/sources.list

        Debian Mirror
        deb http://deb.debian.org/debian bullseye main
        deb http://security.debian.org/debian-security bullseye-security main
        deb http://deb.debian.org/debian bullseye-updates main
        Aaliyun Mirror
        deb http://mirrors.aliyun.com/debian/ bullseye main non-free contrib
        deb http://mirrors.aliyun.com/debian-security/ bullseye-security main
        deb http://mirrors.aliyun.com/debian/ bullseye-updates main non-free contrib
        deb http://mirrors.aliyun.com/debian/ bullseye-backports main non-free contrib
        Tuna Mirror
        deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye main contrib non-free
        deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye-updates main contrib non-free
        deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye-backports main contrib non-free
        deb http://security.debian.org/debian-security bullseye-security main contrib non-free

      Anolis

      • Anolis 3 located in /etc/yum.repos.d/

        Alinyun Mirror
        [alinux3-module]
        name=alinux3-module
        baseurl=http://mirrors.aliyun.com/alinux/3/module/$basearch/
        gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
        enabled=1
        gpgcheck=1
        
        [alinux3-os]
        name=alinux3-os
        baseurl=http://mirrors.aliyun.com/alinux/3/os/$basearch/
        gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
        enabled=1
        gpgcheck=1
        
        [alinux3-plus]
        name=alinux3-plus
        baseurl=http://mirrors.aliyun.com/alinux/3/plus/$basearch/
        gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
        enabled=1
        gpgcheck=1
        
        [alinux3-powertools]
        name=alinux3-powertools
        baseurl=http://mirrors.aliyun.com/alinux/3/powertools/$basearch/
        gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
        enabled=1
        gpgcheck=1
        
        [alinux3-updates]
        name=alinux3-updates
        baseurl=http://mirrors.aliyun.com/alinux/3/updates/$basearch/
        gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
        enabled=1
        gpgcheck=1
        
        [epel]
        name=Extra Packages for Enterprise Linux 8 - $basearch
        baseurl=http://mirrors.aliyun.com/epel/8/Everything/$basearch
        failovermethod=priority
        enabled=1
        gpgcheck=1
        gpgkey=http://mirrors.aliyun.com/epel/RPM-GPG-KEY-EPEL-8
        
        [epel-module]
        name=Extra Packages for Enterprise Linux 8 - $basearch
        baseurl=http://mirrors.aliyun.com/epel/8/Modular/$basearch
        failovermethod=priority
        enabled=0
        gpgcheck=1
        gpgkey=http://mirrors.aliyun.com/epel/RPM-GPG-KEY-EPEL-8

      • Anolis 2 located in /etc/yum.repos.d/

        Alinyun Mirror


      Refresh Repo

      OS:
      dnf clean all && dnf makecache
      yum clean all && yum makecache
      apt-get clean all
      Mar 14, 2024

      Subsections of 🧪Demo

      Subsections of Game

      LOL Overlay Assistant

      Using deep learning techniques to help you to win the game.

      State Machine Event Bus Python 3.6 TensorFlow2 Captain InfoNew Awesome

      ScreenShots

      There are four main funcs in this tool.

      1. The first one is to detect your game client thread and recognize which
        status you are in.
        func1 func1

      2. The second one is to recommend some champions to play.
        Based on your enemy’s team banned champion, this tool will provide you three
        more choices to counter your enemies.
        func2 func2

      3. The third func will scans the mini-map, and when someone is heading to you,
        a notification window will pop up.
        func3 func3

      4. The last func will provides you some gear recommendation based on your
        enemy’s item list.
        fun4 fun4

      Framework

      mvc mvc

      Checkout in Bilibili

      Checkout in Youtube

      Repo

      you can get code from github, gitee

      Mar 8, 2024

      Roller Coin Assistant

      Using deep learning techniques to help you to mining the cryptos, such as BTC, ETH and DOGE.

      ScreenShots

      There are two main funcs in this tool.

      1. Help you to crack the game
      • only support ‘Coin-Flip’ Game for now.

        right, rollercoin.com had decrease the benefit from this game, thats why I make the repo public. update

      1. Help you to pass the geetest.

      How to use

      1. open a web browser.
      2. go to this link https://rollercoin.com, and create an account.(https://rollercoin.com)
      3. keep the lang equals to ‘English’ (you can click the bottom button to change it).
      4. click the ‘Game’ button.
      5. start the application, and enjoy it.

      Tips

      1. only supprot 1920*1080, 2560*1440 and higher resolution screen.
      2. and if you use 1920*1080 screen, strongly recommend you to fullscreen you web browser.

      Repo

      you can get code from gitee

      Mar 8, 2024

      Subsections of HPC

      Slurm On K8S

      slurm_on_k8s slurm_on_k8s

      Trying to run slurm cluster on kubernets

      Install

      You can directly use helm to manage this slurm chart

      1. helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
      2. helm install slurm ay-helm-mirror/slurm --version 1.0.4

      And then, you should see something like this func1 func1

      Also, you can modify the values.yaml by yourself, and reinstall the slurm cluster

      helm upgrade --create-namespace -n slurm --install -f ./values.yaml slurm ay-helm-mirror/slurm --version=1.0.4
      Important

      And you even can build your own image, especially for people wanna use their own libs. For now, the image we used is

      login –> docker.io/aaron666/slurm-login:intel-mpi

      slurmd –> docker.io/aaron666/slurm-slurmd:intel-mpi

      slurmctld -> docker.io/aaron666/slurm-slurmctld:latest

      slurmdbd –> docker.io/aaron666/slurm-slurmdbd:latest

      munged –> docker.io/aaron666/slurm-munged:latest

      Aug 7, 2024

      Slurm Operator

      if you wanna change slurm configuration ,please check slurm configuration generator click

      • for helm user

        just run for fun!

        1. helm repo add ay-helm-repo https://aaronyang0628.github.io/helm-chart-mirror/charts
        2. helm install slurm ay-helm-repo/slurm --version 1.0.4
      • for opertaor user

        pull an image and apply

        1. docker pull aaron666/slurm-operator:latest
        2. kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/install.yaml
        3. kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.values.yaml
      Aug 7, 2024

      Subsections of Plugins

      Flink S3 F3 Multiple

      Normally, Flink only can access only one S3 endpoint during the runtime. But we need to process some files from multiple minio simultaneously.

      So I modified the original flink-s3-fs-hadoop and enable flink to do so.

      StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
      env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
      env.setParallelism(1);
      env.setStateBackend(new HashMapStateBackend());
      env.getCheckpointConfig().setCheckpointStorage("file:///./checkpoints");
      
      final FileSource<String> source =
          FileSource.forRecordStreamFormat(
                  new TextLineInputFormat(),
                  new Path(
                      "s3u://admin:ZrwpsezF1Lt85dxl@10.11.33.132:9000/user-data/home/conti/2024-02-08--10"))
              .build();
      
      final FileSource<String> source2 =
          FileSource.forRecordStreamFormat(
                  new TextLineInputFormat(),
                  new Path(
                      "s3u://minioadmin:minioadmin@10.101.16.72:9000/user-data/home/conti"))
              .build();
      
      env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source")
          .union(env.fromSource(source2, WatermarkStrategy.noWatermarks(), "file-source2"))
          .print("union-result");
          
      env.execute();
      original usage example

      using default flink-s3-fs-hadoop, the configuration value will set into Hadoop configuration map. Only one value functioning at the same, there is no way for user to operate different in single one job context.

      Configuration pluginConfiguration = new Configuration();
      pluginConfiguration.setString("s3a.access-key", "admin");
      pluginConfiguration.setString("s3a.secret-key", "ZrwpsezF1Lt85dxl");
      pluginConfiguration.setString("s3a.connection.maximum", "1000");
      pluginConfiguration.setString("s3a.endpoint", "http://10.11.33.132:9000");
      pluginConfiguration.setBoolean("s3a.path.style.access", Boolean.TRUE);
      FileSystem.initialize(
          pluginConfiguration, PluginUtils.createPluginManagerFromRootFolder(pluginConfiguration));
      
      StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
      env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
      env.setParallelism(1);
      env.setStateBackend(new HashMapStateBackend());
      env.getCheckpointConfig().setCheckpointStorage("file:///./checkpoints");
      
      final FileSource<String> source =
          FileSource.forRecordStreamFormat(
                  new TextLineInputFormat(), new Path("s3a://user-data/home/conti/2024-02-08--10"))
              .build();
      env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source").print();
      
      env.execute();

      Usage

      There

      Install From

      For now, you can directly download flink-s3-fs-hadoop-$VERSION.jar and load in your project.
      $VERSION is the flink version you are using.

        implementation(files("flink-s3-fs-hadoop-$flinkVersion.jar"))
        <dependency>
            <groupId>org.apache</groupId>
            <artifactId>flink</artifactId>
            <version>$flinkVersion</version>
            <systemPath>${project.basedir}flink-s3-fs-hadoop-$flinkVersion.jar</systemPath>
        </dependency>
      the jar we provided was based on original flink-s3-fs-hadoop plugin, so you should use original protocal prefix s3a://

      Or maybe you can wait from the PR, after I mereged into flink-master, you don't need to do anything, just update your flink version.
      and directly use s3u://

      Repo

      you can get code from github, gitlab

      Mar 8, 2024

      Subsections of Stream

      Cosmic Antenna

      Design Architecture

      • objects

      continuously processing antenna signal records and convert them into 3 dimension data matrixes, sending them to different astronomical algorithm endpoints. asdsaa asdsaa

      • how data flows

      asdsaa asdsaa

      Building From Zero

      Following these steps, you may build comic-antenna from nothing.

      1. install podman

      you can check article Install Podman

      2. install kind and kubectl

      you can check article install kubectl

      # create a cluster using podman
      curl -o kind.cluster.yaml -L https://gitlab.com/-/snippets/3686427/raw/main/kind-cluster.yaml \
      && export KIND_EXPERIMENTAL_PROVIDER=podman \
      && kind create cluster --name cs-cluster --image m.daocloud.io/docker.io/kindest/node:v1.27.3 --config=./kind.cluster.yaml
      Modify ~/.kube/config

      vim ~/.kube/config

      in line 5, change server: http://::xxxx -> server: http://0.0.0.0:xxxxx

      asdsaa asdsaa

      3. [Optional] pre-downloaded slow images

      DOCKER_IMAGE_PATH=/root/docker-images && mkdir -p $DOCKER_IMAGE_PATH
      BASE_URL="https://resource-ops-dev.lab.zjvis.net:32443/docker-images"
      for IMAGE in "quay.io_argoproj_argocd_v2.9.3.dim" \
          "ghcr.io_dexidp_dex_v2.37.0.dim" \
          "docker.io_library_redis_7.0.11-alpine.dim" \
          "docker.io_library_flink_1.17.dim"
      do
          IMAGE_FILE=$DOCKER_IMAGE_PATH/$IMAGE
          if [ ! -f $IMAGE_FILE ]; then
              TMP_FILE=$IMAGE_FILE.tmp \
              && curl -o "$TMP_FILE" -L "$BASE_URL/$IMAGE" \
              && mv $TMP_FILE $IMAGE_FILE
          fi
          kind -n cs-cluster load image-archive $IMAGE_FILE
      done

      4. install argocd

      you can check article Install ArgoCD

      5. install essential app on argocd

      # install cert manger    
      curl -LO https://gitlab.com/-/snippets/3686424/raw/main/cert-manager.yaml \
      && kubectl -n argocd apply -f cert-manager.yaml \
      && argocd app sync argocd/cert-manager
      
      # install ingress
      curl -LO https://gitlab.com/-/snippets/3686426/raw/main/ingress-nginx.yaml \
      && kubectl -n argocd apply -f ingress-nginx.yaml \
      && argocd app sync argocd/ingress-nginx
      
      # install flink-kubernetes-operator
      curl -LO https://gitlab.com/-/snippets/3686429/raw/main/flink-operator.yaml \
      && kubectl -n argocd apply -f flink-operator.yaml \
      && argocd app sync argocd/flink-operator

      6. install git

      sudo dnf install -y git \
      && rm -rf $HOME/cosmic-antenna-demo \
      && mkdir $HOME/cosmic-antenna-demo \
      && git clone --branch pv_pvc_template https://github.com/AaronYang2333/cosmic-antenna-demo.git $HOME/cosmic-antenna-demo

      7. prepare application image

      # cd into  $HOME/cosmic-antenna-demo
      sudo dnf install -y java-11-openjdk.x86_64 \
      && $HOME/cosmic-antenna-demo/gradlew :s3sync:buildImage \
      && $HOME/cosmic-antenna-demo/gradlew :fpga-mock:buildImage
      # save and load into cluster
      VERSION="1.0.3"
      podman save --quiet -o $DOCKER_IMAGE_PATH/fpga-mock_$VERSION.dim localhost/fpga-mock:$VERSION \
      && kind -n cs-cluster load image-archive $DOCKER_IMAGE_PATH/fpga-mock_$VERSION.dim
      podman save --quiet -o $DOCKER_IMAGE_PATH/s3sync_$VERSION.dim localhost/s3sync:$VERSION \
      && kind -n cs-cluster load image-archive $DOCKER_IMAGE_PATH/s3sync_$VERSION.dim
      kubectl -n flink edit role/flink -o yaml
      Modify role config
      kubectl -n flink edit role/flink -o yaml

      add services and endpoints to the rules.resources

      8. prepare k8s resources [pv, pvc, sts]

      cp -rf $HOME/cosmic-antenna-demo/flink/*.yaml /tmp \
      && podman exec -d cs-cluster-control-plane mkdir -p /mnt/flink-job
      # create persist volume
      kubectl -n flink create -f /tmp/pv.template.yaml
      # create pv claim
      kubectl -n flink create -f /tmp/pvc.template.yaml
      # start up flink application
      kubectl -n flink create -f /tmp/job.template.yaml
      # start up ingress
      kubectl -n flink create -f /tmp/ingress.forward.yaml
      # start up fpga UDP client, sending data 
      cp $HOME/cosmic-antenna-demo/fpga-mock/client.template.yaml /tmp \
      && kubectl -n flink create -f /tmp/client.template.yaml

      9. check dashboard in browser

      http://job-template-example.flink.lab.zjvis.net

      Repo

      you can get code from github


      Reference

      1. https://github.com/ben-wangz/blog/tree/main/docs/content/6.kubernetes/7.installation/ha-cluster
      2. xxx
      Mar 7, 2024

      Subsections of Design

      Yaml Crawler

      Steps

      1. define which web url you wanna crawl, lets say https://www.xxx.com/aaa.apex
      2. create a page pojo org.example.business.page.MainPage to describe that page

      Then you can create a yaml file named root-pages.yaml and its content is

      - '@class': "org.example.business.page.MainPage"
        url: "https://www.xxx.com/aaa.apex"
      1. and then define a process flow yaml file, implying how to process web pages the crawler will meet.
      processorChain:
        - '@class': "org.example.crawler.core.processor.decorator.ExceptionRecord"
          processor:
            '@class': "org.example.crawler.core.processor.decorator.RetryControl"
            processor:
              '@class': "org.example.crawler.core.processor.decorator.SpeedControl"
              processor:
                '@class': "org.example.business.hs.code.MainPageProcessor"
                application: "app-name"
              time: 100
              unit: "MILLISECONDS"
            retryTimes: 1
        - '@class': "org.example.crawler.core.processor.decorator.ExceptionRecord"
          processor:
            '@class': "org.example.crawler.core.processor.decorator.RetryControl"
            processor:
              '@class': "org.example.crawler.core.processor.decorator.SpeedControl"
              processor:
                '@class': "org.example.crawler.core.processor.download.DownloadProcessor"
                pagePersist:
                  '@class': "org.example.business.persist.DownloadPageDatabasePersist"
                  downloadPageRepositoryBeanName: "downloadPageRepository"
                downloadPageTransformer:
                  '@class': "org.example.crawler.download.DefaultDownloadPageTransformer"
                skipExists:
                  '@class': "org.example.crawler.download.SkipExistsById"
              time: 1
              unit: "SECONDS"
            retryTimes: 1
      nThreads: 1
      pollWaitingTime: 30
      pollWaitingTimeUnit: "SECONDS"
      waitFinishedTimeout: 180
      waitFinishedTimeUnit: "SECONDS" 

      ExceptionRecord, RetryControl, SpeedControl are provided by the yaml crawler itself, dont worry. you only need to extend how to process your page MainPage, for example, you defined a MainPageProcessor. each processor will produce a set of other page or DownloadPage. DownloadPage like a ship containing information you need, and this framework will help you process DownloadPage and download or persist.

      1. Vola, run your crawler then.

      Repo

      you can get code from github, gitlab

      Mar 8, 2024

      Utils

      Porjects

      Mar 7, 2024

      Subsections of Utils

      Cowsay

      since the previous cowsay image was built ten years ago, and in newser k8s, you will meet an exception like

      Failed to pull image “docker/whalesay:latest”: [DEPRECATION NOTICE] Docker Image Format v1 and Docker Image manifest version 2, schema 1 support is disabled by default and will be removed in an upcoming release. Suggest the author of docker.io/docker/whalesay:latest to upgrade the image to the OCI Format or Docker Image manifest v2, schema 2. More information at https://docs.docker.com/go/deprecated-image-specs/

      So, I built a new one. please try docker.io/aaron666/cowsay:v2

      Build

      docker build -t whalesay:v2 .

      Usage

      docker run -it localhost/whalesay:v2 whalesay  "hello world"
      
      [root@ay-zj-ecs cowsay]# docker run -it localhost/whalesay:v2 whalesay  "hello world"
       _____________
      < hello world >
       -------------
        \
         \
          \     
                            ##        .            
                      ## ## ##       ==            
                   ## ## ## ##      ===            
               /""""""""""""""""___/ ===        
          ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~   
               \______ o          __/            
                \    \        __/             
                  \____\______/   
      docker run -it localhost/whalesay:v2 cowsay  "hello world"
      
      [root@ay-zj-ecs cowsay]# docker run -it localhost/whalesay:v2 cowsay  "hello world"
       _____________
      < hello world >
       -------------
              \   ^__^
               \  (oo)\_______
                  (__)\       )\/\
                      ||----w |
                      ||     ||

      Upload

      registry
      docker tag 5b01b0c3c7ce docker-registry.lab.zverse.space/ay-dev/whalesay:v2
      docker push docker-registry.lab.zverse.space/ay-dev/whalesay:v2
      export DOCKER_PAT=dckr_pat_bBN_Xkgz-TRdxirM2B6EDYCjjrg
      echo $DOCKER_PAT | docker login docker.io -u aaron666  --password-stdin
      docker tag 5b01b0c3c7ce docker.io/aaron666/whalesay:v2
      docker push docker.io/aaron666/whalesay:v2
      export GITHUB_PAT=XXXX
      echo $GITHUB_PAT | docker login ghcr.io -u aaronyang0628 --password-stdin
      docker tag 5b01b0c3c7ce ghcr.io/aaronyang0628/whalesay:v2
      docker push ghcr.io/aaronyang0628/whalesay:v2
      Mar 7, 2025

      Subsections of 🐿️Apache Flink

      Subsections of On K8s Operator

      Job Privilieges

      Template

      apiVersion: rbac.authorization.k8s.io/v1
      kind: Role
      metadata:
        namespace: flink
        name: flink-deployment-manager
      rules:
      - apiGroups: 
        - flink.apache.org
        resources: 
        - flinkdeployments
        verbs: 
        - 'get'
        - 'list'
        - 'create'
        - 'update'
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: RoleBinding
      metadata:
        name: flink-deployment-manager-binding
        namespace: flink
      subjects:
      - kind: User
        name: "277293711358271379"  
        apiGroup: rbac.authorization.k8s.io
      roleRef:
        kind: Role
        name: flink-deployment-manager
        apiGroup: rbac.authorization.k8s.io
      Jul 7, 2024

      OSS Template

      Template

      apiVersion: "flink.apache.org/v1beta1"
      kind: "FlinkDeployment"
      metadata:
        name: "financial-job"
      spec:
        image: "cr.registry.res.cloud.wuxi-yqgcy.cn/mirror/financial-topic:1.5-oss"
        flinkVersion: "v1_17"
        flinkConfiguration:
          taskmanager.numberOfTaskSlots: "8"
          fs.oss.endpoint: http://ay-test.oss-cn-jswx-xuelang-d01-a.ops.cloud.wuxi-yqgcy.cn/
          fs.oss.accessKeyId: 4gqOVOfQqCsCUwaC
          fs.oss.accessKeySecret: xxx
        ingress:
          template: "flink.k8s.io/{{namespace}}/{{name}}(/|$)(.*)"
          className: "nginx"
          annotations:
            cert-manager.io/cluster-issuer: "self-signed-ca-issuer"
            nginx.ingress.kubernetes.io/rewrite-target: "/$2"
        serviceAccount: "flink"
        podTemplate:
          apiVersion: "v1"
          kind: "Pod"
          metadata:
            name: "financial-job"
          spec:
            containers:
              - name: "flink-main-container"
                env:
                  - name: ENABLE_BUILT_IN_PLUGINS
                    value: flink-oss-fs-hadoop-1.17.2.jar
        jobManager:
          resource:
            memory: "2048m"
            cpu: 1
        taskManager:
          resource:
            memory: "2048m"
            cpu: 1
        job:
          jarURI: "local:///app/application.jar"
          parallelism: 1
          upgradeMode: "stateless"
      Apr 7, 2024

      S3 Template

      Template

      apiVersion: "flink.apache.org/v1beta1"
      kind: "FlinkDeployment"
      metadata:
        name: "financial-job"
      spec:
        image: "cr.registry.res.cloud.wuxi-yqgcy.cn/mirror/financial-topic:1.5"
        flinkVersion: "v1_17"
        flinkConfiguration:
          taskmanager.numberOfTaskSlots: "8"
          s3a.endpoint: http://172.27.253.89:9000
          s3a.access-key: minioadmin
          s3a.secret-key: minioadmin
        ingress:
          template: "flink.k8s.io/{{namespace}}/{{name}}(/|$)(.*)"
          className: "nginx"
          annotations:
            cert-manager.io/cluster-issuer: "self-signed-ca-issuer"
            nginx.ingress.kubernetes.io/rewrite-target: "/$2"
        serviceAccount: "flink"
        podTemplate:
          apiVersion: "v1"
          kind: "Pod"
          metadata:
            name: "financial-job"
          spec:
            containers:
              - name: "flink-main-container"
                env:
                  - name: ENABLE_BUILT_IN_PLUGINS
                    value: flink-s3-fs-hadoop-1.17.2.jar
        jobManager:
          resource:
            memory: "2048m"
            cpu: 1
        taskManager:
          resource:
            memory: "2048m"
            cpu: 1
        job:
          jarURI: "local:///app/application.jar"
          parallelism: 1
          upgradeMode: "stateless"
      Apr 7, 2024

      Subsections of CDC

      Mysql CDC

      More Ofthen, we can get a simplest example form CDC Connectors. But people still need to google some inescapable problems before using it.

      preliminary

      Flink: 1.17 JDK: 11

      Flink CDC version mapping
      Flink CDC VersionFlink Version
      1.0.01.11.*
      1.1.01.11.*
      1.2.01.12.*
      1.3.01.12.*
      1.4.01.13.*
      2.0.*1.13.*
      2.1.*1.13.*
      2.2.*1.13.*, 1.14.*
      2.3.*1.13.*, 1.14.*, 1.15.*
      2.4.*1.13.*, 1.14.*, 1.15.*
      3.0.*1.14.*, 1.15.*, 1.16.*

      usage for DataStream API

      Only import com.ververica.flink-connector-mysql-cdc is not enough.

      implementation("com.ververica:flink-connector-mysql-cdc:2.4.0")
      
      //you also need these following dependencies
      implementation("org.apache.flink:flink-shaded-guava:30.1.1-jre-16.1")
      implementation("org.apache.flink:flink-connector-base:1.17")
      implementation("org.apache.flink:flink-table-planner_2.12:1.17")
      <dependency>
        <groupId>com.ververica</groupId>
        <!-- add the dependency matching your database -->
        <artifactId>flink-connector-mysql-cdc</artifactId>
        <!-- The dependency is available only for stable releases, SNAPSHOT dependencies need to be built based on master or release- branches by yourself. -->
        <version>2.4.0</version>
      </dependency>
      
      <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-guava -->
      <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-shaded-guava</artifactId>
        <version>30.1.1-jre-16.1</version>
      </dependency>
      
      <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-base -->
      <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-base</artifactId>
        <version>1.17.1</version>
      </dependency>
      
      <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-planner -->
      <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-planner_2.12</artifactId>
        <version>1.17.1</version>
      </dependency>

      Example Code

      MySqlSource<String> mySqlSource =
          MySqlSource.<String>builder()
              .hostname("192.168.56.107")
              .port(3306)
              .databaseList("test") // set captured database
              .tableList("test.table_a") // set captured table
              .username("root")
              .password("mysql")
              .deserializer(
                  new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
              .serverTimeZone("UTC")
              .build();
      
      StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
      
      // enable checkpoint
      env.enableCheckpointing(3000);
      
      env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
          // set 4 parallel source tasks
          .setParallelism(4)
          .print()
          .setParallelism(1); // use parallelism 1 for sink to keep message ordering
      
      env.execute("Print MySQL Snapshot + Binlog");

      usage for table/SQL API

      Mar 7, 2024

      Connector

      Mar 7, 2024

      Subsections of 🐸Git

      Cheatsheet

      Init global config

      git config --list
      git config --global user.name "AaronYang"
      git config --global user.email aaron19940628@gmail.com
      git config --global pager.branch false
      git config --global pull.ff only
      git --no-pager diff

      change user and email (locally)

      git config user.name ""
      git config user.email ""

      list all remote repo

      git remote -v
      modify remote repo
      git remote set-url origin git@github.com:<$user>/<$repo>.git

      Get specific file from remote

      git archive --remote=git@github.com:<$user>/<$repo>.git <$branch>:<$source_file_path> -o <$target_source_path>
      for example
      git archive --remote=git@github.com:AaronYang2333/LOL_Overlay_Assistant_Tool.git master:paper/2003.11755.pdf -o a.pdf

      Clone specific branch

      git clone -b slurm-23.02 --single-branch --depth=1 https://github.com/SchedMD/slurm.git

      Update submodule

      git submodule add –depth 1 https://github.com/xxx/xxxx a/b/c

      git submodule update --init --recursive

      Save credential

      login first and then execute this

      git config --global credential.helper store

      Delete Branch

      • Deleting a remote branch
        git push origin --delete <branch>  # Git version 1.7.0 or newer
        git push origin -d <branch>        # Shorter version (Git 1.7.0 or newer)
        git push origin :<branch>          # Git versions older than 1.7.0
      • Deleting a local branch
        git branch --delete <branch>
        git branch -d <branch> # Shorter version
        git branch -D <branch> # Force-delete un-merged branches

      Prune remote branches

      git remote prune origin

      Add a new remote repo

      git remote add dev https://xxxxxxxxxxx.git

      Update remote repo

      git remote set-url origin http://xxxxx.git
      Mar 7, 2024

      Subsections of Action

      Customize A Gitea Action

      Introduction

      In this guide, you’ll learn about the basic components needed to create and use a packaged composite action. To focus this guide on the components needed to package the action, the functionality of the action’s code is minimal. The action prints “Hello World” and then “Goodbye”, or if you provide a custom name, it prints “Hello [who-to-greet]” and then “Goodbye”. The action also maps a random number to the random-number output variable, and runs a script named goodbye.sh.

      Once you complete this project, you should understand how to build your own composite action and test it in a workflow.

      Warning

      When creating workflows and actions, you should always consider whether your code might execute untrusted input from possible attackers. Certain contexts should be treated as untrusted input, as an attacker could insert their own malicious content. For more information, see Secure use reference.

      Composite actions and reusable workflows

      Composite actions allow you to collect a series of workflow job steps into a single action which you can then run as a single job step in multiple workflows. Reusable workflows provide another way of avoiding duplication, by allowing you to run a complete workflow from within other workflows. For more information, see Reusing workflow configurations.

      Prerequisites

      Note

      This example explains how to create a composite action within a separate repository. However, it is possible to create a composite action within the same repository. For more information, see Creating a composite action.

      Before you begin, you’ll create a repository on GitHub.

      1. Create a new public repository on GitHub. You can choose any repository name, or use the following hello-world-composite-action example. You can add these files after your project has been pushed to GitHub.

      2. Clone your repository to your computer.

      3. From your terminal, change directories into your new repository.

      cd hello-world-composite-action
      1. In the hello-world-composite-action repository, create a new file called goodbye.sh with example code:
      echo "echo Goodbye" > goodbye.sh
      1. From your terminal, make goodbye.sh executable.
      chmod +x goodbye.sh
      1. From your terminal, check in your goodbye.sh file.
      git add goodbye.sh
      git commit -m "Add goodbye script"
      git push

      Creating an action metadata file

      1. In the hello-world-composite-action repository, create a new file called action.yml and add the following example code. For more information about this syntax, see Metadata syntax reference.
      name: 'Hello World'
      description: 'Greet someone'
      inputs:
        who-to-greet:  # id of input
          description: 'Who to greet'
          required: true
          default: 'World'
      outputs:
        random-number:
          description: "Random number"
          value: ${{ steps.random-number-generator.outputs.random-number }}
      runs:
        using: "composite"
        steps:
          - name: Set Greeting
            run: echo "Hello $INPUT_WHO_TO_GREET."
            shell: bash
            env:
              INPUT_WHO_TO_GREET: ${{ inputs.who-to-greet }}
      
          - name: Random Number Generator
            id: random-number-generator
            run: echo "random-number=$(echo $RANDOM)" >> $GITHUB_OUTPUT
            shell: bash
      
          - name: Set GitHub Path
            run: echo "$GITHUB_ACTION_PATH" >> $GITHUB_PATH
            shell: bash
            env:
              GITHUB_ACTION_PATH: ${{ github.action_path }}
      
          - name: Run goodbye.sh
            run: goodbye.sh
            shell: bash

      This file defines the who-to-greet input, maps the random generated number to the random-number output variable, adds the action’s path to the runner system path (to locate the goodbye.sh script during execution), and runs the goodbye.sh script.

      For more information about managing outputs, see Metadata syntax reference.

      For more information about how to use github.action_path, see Contexts reference.

      1. From your terminal, check in your action.yml file.
      git add action.yml
      git commit -m "Add action"
      git push
      1. From your terminal, add a tag. This example uses a tag called v1. For more information, see About custom actions.
      git tag -a -m "Description of this release" v1
      git push --follow-tags

      Testing out your action in a workflow

      The following workflow code uses the completed hello world action that you made in Creating a composite action.

      Copy the workflow code into a .github/workflows/main.yml file in another repository, replacing OWNER and SHA with the repository owner and the SHA of the commit you want to use, respectively. You can also replace the who-to-greet input with your name.

      on: [push]
      
      jobs:
        hello_world_job:
          runs-on: ubuntu-latest
          name: A job to say hello
          steps:
            - uses: actions/checkout@v5
            - id: foo
              uses: OWNER/hello-world-composite-action@SHA
              with:
                who-to-greet: 'Mona the Octocat'
            - run: echo random-number "$RANDOM_NUMBER"
              shell: bash
              env:
                RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}

      From your repository, click the Actions tab, and select the latest workflow run. The output should include: “Hello Mona the Octocat”, the result of the “Goodbye” script, and a random number.

      Creating a composite action within the same repository

      1. Create a new subfolder called hello-world-composite-action, this can be placed in any subfolder within the repository. However, it is recommended that this be placed in the .github/actions subfolder to make organization easier.

      2. In the hello-world-composite-action folder, do the same steps to create the goodbye.sh script

      echo "echo Goodbye" > goodbye.sh
      chmod +x goodbye.sh
      git add goodbye.sh
      git commit -m "Add goodbye script"
      git push
      1. In the hello-world-composite-action folder, create the action.yml file based on the steps in Creating a composite action.

      2. When using the action, use the relative path to the folder where the composite action’s action.yml file is located in the uses key. The below example assumes it is in the .github/actions/hello-world-composite-action folder.

      on: [push]
      
      jobs:
        hello_world_job:
          runs-on: ubuntu-latest
          name: A job to say hello
          steps:
            - uses: actions/checkout@v5
            - id: foo
              uses: ./.github/actions/hello-world-composite-action
              with:
                who-to-greet: 'Mona the Octocat'
            - run: echo random-number "$RANDOM_NUMBER"
              shell: bash
              env:
                RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}
      Mar 7, 2024

      Customize A Github Action

      Introduction

      In this guide, you’ll learn about the basic components needed to create and use a packaged composite action. To focus this guide on the components needed to package the action, the functionality of the action’s code is minimal. The action prints “Hello World” and then “Goodbye”, or if you provide a custom name, it prints “Hello [who-to-greet]” and then “Goodbye”. The action also maps a random number to the random-number output variable, and runs a script named goodbye.sh.

      Once you complete this project, you should understand how to build your own composite action and test it in a workflow.

      Warning

      When creating workflows and actions, you should always consider whether your code might execute untrusted input from possible attackers. Certain contexts should be treated as untrusted input, as an attacker could insert their own malicious content. For more information, see Secure use reference.

      Composite actions and reusable workflows

      Composite actions allow you to collect a series of workflow job steps into a single action which you can then run as a single job step in multiple workflows. Reusable workflows provide another way of avoiding duplication, by allowing you to run a complete workflow from within other workflows. For more information, see Reusing workflow configurations.

      Prerequisites

      Note

      This example explains how to create a composite action within a separate repository. However, it is possible to create a composite action within the same repository. For more information, see Creating a composite action.

      Before you begin, you’ll create a repository on GitHub.

      1. Create a new public repository on GitHub. You can choose any repository name, or use the following hello-world-composite-action example. You can add these files after your project has been pushed to GitHub.

      2. Clone your repository to your computer.

      3. From your terminal, change directories into your new repository.

      cd hello-world-composite-action
      1. In the hello-world-composite-action repository, create a new file called goodbye.sh with example code:
      echo "echo Goodbye" > goodbye.sh
      1. From your terminal, make goodbye.sh executable.
      chmod +x goodbye.sh
      1. From your terminal, check in your goodbye.sh file.
      git add goodbye.sh
      git commit -m "Add goodbye script"
      git push

      Creating an action metadata file

      1. In the hello-world-composite-action repository, create a new file called action.yml and add the following example code. For more information about this syntax, see Metadata syntax reference.
      name: 'Hello World'
      description: 'Greet someone'
      inputs:
        who-to-greet:  # id of input
          description: 'Who to greet'
          required: true
          default: 'World'
      outputs:
        random-number:
          description: "Random number"
          value: ${{ steps.random-number-generator.outputs.random-number }}
      runs:
        using: "composite"
        steps:
          - name: Set Greeting
            run: echo "Hello $INPUT_WHO_TO_GREET."
            shell: bash
            env:
              INPUT_WHO_TO_GREET: ${{ inputs.who-to-greet }}
      
          - name: Random Number Generator
            id: random-number-generator
            run: echo "random-number=$(echo $RANDOM)" >> $GITHUB_OUTPUT
            shell: bash
      
          - name: Set GitHub Path
            run: echo "$GITHUB_ACTION_PATH" >> $GITHUB_PATH
            shell: bash
            env:
              GITHUB_ACTION_PATH: ${{ github.action_path }}
      
          - name: Run goodbye.sh
            run: goodbye.sh
            shell: bash

      This file defines the who-to-greet input, maps the random generated number to the random-number output variable, adds the action’s path to the runner system path (to locate the goodbye.sh script during execution), and runs the goodbye.sh script.

      For more information about managing outputs, see Metadata syntax reference.

      For more information about how to use github.action_path, see Contexts reference.

      1. From your terminal, check in your action.yml file.
      git add action.yml
      git commit -m "Add action"
      git push
      1. From your terminal, add a tag. This example uses a tag called v1. For more information, see About custom actions.
      git tag -a -m "Description of this release" v1
      git push --follow-tags

      Testing out your action in a workflow

      The following workflow code uses the completed hello world action that you made in Creating a composite action.

      Copy the workflow code into a .github/workflows/main.yml file in another repository, replacing OWNER and SHA with the repository owner and the SHA of the commit you want to use, respectively. You can also replace the who-to-greet input with your name.

      on: [push]
      
      jobs:
        hello_world_job:
          runs-on: ubuntu-latest
          name: A job to say hello
          steps:
            - uses: actions/checkout@v5
            - id: foo
              uses: OWNER/hello-world-composite-action@SHA
              with:
                who-to-greet: 'Mona the Octocat'
            - run: echo random-number "$RANDOM_NUMBER"
              shell: bash
              env:
                RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}

      From your repository, click the Actions tab, and select the latest workflow run. The output should include: “Hello Mona the Octocat”, the result of the “Goodbye” script, and a random number.

      Creating a composite action within the same repository

      1. Create a new subfolder called hello-world-composite-action, this can be placed in any subfolder within the repository. However, it is recommended that this be placed in the .github/actions subfolder to make organization easier.

      2. In the hello-world-composite-action folder, do the same steps to create the goodbye.sh script

      echo "echo Goodbye" > goodbye.sh
      chmod +x goodbye.sh
      git add goodbye.sh
      git commit -m "Add goodbye script"
      git push
      1. In the hello-world-composite-action folder, create the action.yml file based on the steps in Creating a composite action.

      2. When using the action, use the relative path to the folder where the composite action’s action.yml file is located in the uses key. The below example assumes it is in the .github/actions/hello-world-composite-action folder.

      on: [push]
      
      jobs:
        hello_world_job:
          runs-on: ubuntu-latest
          name: A job to say hello
          steps:
            - uses: actions/checkout@v5
            - id: foo
              uses: ./.github/actions/hello-world-composite-action
              with:
                who-to-greet: 'Mona the Octocat'
            - run: echo random-number "$RANDOM_NUMBER"
              shell: bash
              env:
                RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}
      Mar 7, 2024

      Gitea Variables

      Preset Variables

      变量名称示例说明 / 用途
      gitea.actor触发 workflow 的用户的用户名。(docs.gitea.com)
      gitea.event_name事件名称,比如 pushpull_request 等。(docs.gitea.com)
      gitea.ref被触发的 Git 引用(branch/tag/ref)名称。(docs.gitea.com)
      gitea.repository仓库标识,一般是 owner/name。(docs.gitea.com)
      gitea.workspace仓库被 checkout 到 runner 上的工作目录路径。(docs.gitea.com)

      Common Variables

      变量名称示例说明 / 用途
      runner.osRunner 所在的操作系统环境,比如 ubuntu-latest。(docs.gitea.com)
      job.status当前 job 的状态(例如 success 或 failure)。(docs.gitea.com)
      env.xxxx自定义配置变量,在用户/组织/仓库层定义,统一以大写形式引用。(docs.gitea.com)
      secrets.XXXX存放敏感信息的密钥,同样可以在用户/组织/仓库层定义。(docs.gitea.com)

      Sample

      name: Gitea Actions Demo
      run-name: ${{ gitea.actor }} is testing out Gitea Actions 🚀
      on: [push]
      
      env:
          author: gitea_admin
      jobs:
        Explore-Gitea-Actions:
          runs-on: ubuntu-latest
          steps:
            - run: echo "🎉 The job was automatically triggered by a ${{ gitea.event_name }} event."
            - run: echo "🐧 This job is now running on a ${{ runner.os }} server hosted by Gitea!"
            - run: echo "🔎 The name of your branch is ${{ gitea.ref }} and your repository is ${{ gitea.repository }}."
            - name: Check out repository code
              uses: actions/checkout@v4
            - run: echo "💡 The ${{ gitea.repository }} repository has been cloned to the runner."
            - run: echo "🖥️ The workflow is now ready to test your code on the runner."
            - name: List files in the repository
              run: |
                ls ${{ gitea.workspace }}
            - run: echo "🍏 This job's status is ${{ job.status }}."

      Result

      🎉 The job was automatically triggered by a `push` event.
      
      🐧 This job is now running on a `Linux` server hosted by Gitea!
      
      🔎 The name of your branch is `refs/heads/main` and your repository is `gitea_admin/data-warehouse`.
      
      💡 The `gitea_admin/data-warehouse` repository has been cloned to the runner.
      
      🖥️ The workflow is now ready to test your code on the runner.
      
          Dockerfile  README.md  environments  pom.xml  src  templates
      
      🍏 This job's status is `success`.
      Mar 7, 2024

      Github Variables

      Context Variables

      变量名称示例说明 / 用途
      github.actor触发 workflow 的用户的用户名。([docs.gitea.com][1])
      github.event_name事件名称,比如 pushpull_request 等。([docs.gitea.com][1])
      github.ref被触发的 Git 引用(branch/tag/ref)名称。([docs.gitea.com][1])
      github.repository仓库标识,一般是 owner/name。([docs.gitea.com][1])
      github.workspace仓库被 checkout 到 runner 上的工作目录路径。([docs.gitea.com][1])
      env.xxxx在workflow中定义的变量,比如 ${{ env.xxxx }}
      secrets.XXXX通过 Settings -> Actions -> Secrets and variables 创建的密钥。
      Mar 7, 2024

      Subsections of Template

      Apply And Sync Argocd APP

      name: apply-and-sync-app
      run-name: ${{ gitea.actor }} is going to sync an sample argocd app 🚀
      on: [push]
      
      jobs:
        sync-argocd-app:
          runs-on: ubuntu-latest
          steps:
            - name: Sync App
              uses: AaronYang0628/apply-and-sync-argocd@v1.0.6
              with:
                argocd-server: '192.168.100.125:30443'
                argocd-token: ${{ secrets.ARGOCD_TOKEN }}
                application-yaml-path: "environments/ops/argocd/operator.app.yaml"
      Mar 7, 2025

      Publish Chart 2 Harbor

      name: publish-chart-to-harbor-registry
      run-name: ${{ gitea.actor }} is testing out Gitea Push Chart 🚀
      on: [push]
      
      env:
        REGISTRY: harbor.zhejianglab.com
        USER: byang628@zhejianglab.com
        REPOSITORY_NAMESPACE: ay-dev
        CHART_NAME: data-warehouse
      jobs:
        build-and-push-charts:
          runs-on: ubuntu-latest
          permissions:
            packages: write
            contents: read
          strategy:
            matrix:
              include:
                - chart_path: "environments/helm/metadata-environment"
          steps:
            - name: Checkout Repository
              uses: actions/checkout@v4
              with:
                fetch-depth: 0
      
            - name: Log in to Harbor
              uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
              with:
                registry: "${{ env.REGISTRY }}"
                username: "${{ env.USER }}"
                password: "${{ secrets.ZJ_HARBOR_TOKEN }}"
      
            - name: Helm Publish Action
              uses: AaronYang0628/push-helm-chart-to-oci@v0.0.3
              with:
                working-dir: ${{ matrix.chart_path }}
                oci-repository: oci://${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}
                username: ${{ env.USER }}
                password: ${{ secrets.ZJ_HARBOR_TOKEN }}
      Mar 7, 2025

      Publish Image 2 Dockerhub

      name: publish-image-to-ghcr
      run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
      on: [push]
      
      env:
        REGISTRY: ghcr.io
        USER: aaronyang0628
        REPOSITORY_NAMESPACE: aaronyang0628
      jobs:
        build-and-push-images:
          strategy:
            matrix:
              include:
                - name_suffix: "aria-ng"
                  container_path: "application/aria2/container/aria-ng"
                  dockerfile_path: "application/aria2/container/aria-ng/Dockerfile"
                - name_suffix: "aria2"
                  container_path: "application/aria2/container/aria2"
                  dockerfile_path: "application/aria2/container/aria2/Dockerfile"
          runs-on: ubuntu-latest
          steps:
          - name: checkout-repository
            uses: actions/checkout@v4
          - name: log in to the container registry
            uses: docker/login-action@v3
            with:
              registry: "${{ env.REGISTRY }}"
              username: "${{ env.USER }}"
              password: "${{ secrets.GIT_REGISTRY_PWD }}"
          - name: build and push container image
            uses: docker/build-push-action@v6
            with:
              context: "${{ matrix.container_path }}"
              file: "${{ matrix.dockerfile_path }}"
              push: true
              tags: |
                ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ inputs.tag || 'latest' }}
                ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ github.ref_name }}
              labels: |
                org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
      Mar 7, 2025

      Publish Image 2 Ghcr

      name: publish-image-to-ghcr
      run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
      on: [push]
      
      env:
        REGISTRY: ghcr.io
        USER: aaronyang0628
        REPOSITORY_NAMESPACE: aaronyang0628
      jobs:
        build-and-push-images:
          strategy:
            matrix:
              include:
                - name_suffix: "aria-ng"
                  container_path: "application/aria2/container/aria-ng"
                  dockerfile_path: "application/aria2/container/aria-ng/Dockerfile"
                - name_suffix: "aria2"
                  container_path: "application/aria2/container/aria2"
                  dockerfile_path: "application/aria2/container/aria2/Dockerfile"
          runs-on: ubuntu-latest
          steps:
          - name: checkout-repository
            uses: actions/checkout@v4
          - name: log in to the container registry
            uses: docker/login-action@v3
            with:
              registry: "${{ env.REGISTRY }}"
              username: "${{ env.USER }}"
              password: "${{ secrets.GIT_REGISTRY_PWD }}"
          - name: build and push container image
            uses: docker/build-push-action@v6
            with:
              context: "${{ matrix.container_path }}"
              file: "${{ matrix.dockerfile_path }}"
              push: true
              tags: |
                ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ inputs.tag || 'latest' }}
                ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ github.ref_name }}
              labels: |
                org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
      Mar 7, 2025

      Publish Image 2 Harbor

      name: publish-image-to-harbor-registry
      run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
      on: [push]
      
      
      env:
        REGISTRY: harbor.zhejianglab.com
        USER: byang628@zhejianglab.com
        REPOSITORY_NAMESPACE: ay-dev
        IMAGE_NAME: metadata-crd-operator
      jobs:
        build-and-push-images:
          runs-on: ubuntu-latest
          permissions:
            packages: write
            contents: read
          strategy:
            matrix:
              include:
                - name_suffix: "dev"
                  container_path: "."
                  dockerfile_path: "./Dockerfile"
          steps:
            - name: Checkout Repository
              uses: actions/checkout@v4
      
            - name: Log in to Harbor
              uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
              with:
                registry: "${{ env.REGISTRY }}"
                username: "${{ env.USER }}"
                password: "${{ secrets.ZJ_HARBOR_TOKEN }}"
      
            - name: Extract Current Date
              id: extract-date
              run: |
                echo "current-date=$(date +'%Y%m%d')" >> $GITHUB_OUTPUT
                echo will push image: ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ env.IMAGE_NAME }}-${{ matrix.name_suffix }}:v${{ steps.extract-date.outputs.current-date }}
      
            - name: Build And Push Container Image
              uses: docker/build-push-action@v6
              with:
                context: "${{ matrix.container_path }}"
                file: "${{ matrix.dockerfile_path }}"
                push: true
                tags: |
                  ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ env.IMAGE_NAME }}-${{ matrix.name_suffix }}:v${{ steps.extract-date.outputs.current-date }}
                labels: |
                  org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
      Mar 7, 2025

      Subsections of Notes

      Not Allow Push

      Cannot push to your own branch

      mvc mvc

      1. Edit .git/config file under your repo directory.

      2. Find url=entry under section [remote "origin"].

      3. Change it from:

        url=https://gitlab.com/AaronYang2333/ska-src-dm-local-data-preparer.git/

        url=ssh://git@gitlab.com/AaronYang2333/ska-src-dm-local-data-preparer.git

      4. try push again

      Mar 12, 2025

      ☸️Kubernetes

      Mar 7, 2024

      Subsections of ☸️Kubernetes

      Prepare k8s Cluster

      Building a K8s Cluster, you can choose one of the following methods.

      Install Kuberctl

      Build Cluster

      Install By

      Prerequisites

      • Hardware Requirements:

        1. At least 2 GB of RAM per machine (minimum 1 GB)
        2. 2 CPUs on the master node
        3. Full network connectivity among all machines (public or private network)
      • Operating System:

        1. Ubuntu 20.04/18.04, CentOS 7/8, or any other supported Linux distribution.
      • Network Requirements:

        1. Unique hostname, MAC address, and product_uuid for each node.
        2. Certain ports need to be open (e.g., 6443, 2379-2380, 10250, 10251, 10252, 10255, etc.)
      • Disable Swap:

        sudo swapoff -a

      Steps to Setup Kubernetes Cluster

      1. Prepare Your Servers Update the Package Index and Install Necessary Packages On all your nodes (both master and worker):
      sudo apt-get update
      sudo apt-get install -y apt-transport-https ca-certificates curl

      Add the Kubernetes APT Repository

      curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
      cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
      deb http://apt.kubernetes.io/ kubernetes-xenial main
      EOF

      Install kubeadm, kubelet, and kubectl

      sudo apt-get update
      sudo apt-get install -y kubelet kubeadm kubectl
      sudo apt-mark hold kubelet kubeadm kubectl
      1. Initialize the Master Node On the master node, initialize the Kubernetes control plane:
      sudo kubeadm init --pod-network-cidr=192.168.0.0/16

      The –pod-network-cidr flag is used to set the Pod network range. You might need to adjust this based on your network provider

      Set up Local kubeconfig

      mkdir -p $HOME/.kube
      sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
      sudo chown $(id -u):$(id -g) $HOME/.kube/config
      1. Install a Pod Network Add-on You can install a network add-on like Flannel, Calico, or Weave. For example, to install Calico:

      ```shell kubectl apply -f https://github.com/coreos/flannel/raw/master/Documentation/kube-flannel.yml ```

      ```shell kubectl apply -f https://docs.projectcalico.org/v3.14/manifests/calico.yaml ```

      1. Join Worker Nodes to the Cluster On each worker node, run the kubeadm join command provided at the end of the kubeadm init output on the master node. It will look something like this:
      sudo kubeadm join <master-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

      If you lost the join command, you can create a new token on the master node:

      sudo kubeadm token create --print-join-command
      1. Verify the Cluster Once all nodes have joined, you can verify the cluster status from the master node:
      kubectl get nodes

      This command should list all your nodes with the status “Ready”.

      Mar 7, 2025

      Subsections of Prepare k8s Cluster

      Kind

      Preliminary

      • Kind binary has installed, if not check 🔗link

      • Hardware Requirements:

        1. At least 2 GB of RAM per machine (minimum 1 GB)
        2. 2 CPUs on the master node
        3. Full network connectivity among all machines (public or private network)
      • Operating System:

        1. Ubuntu 22.04/14.04, CentOS 7/8, or any other supported Linux distribution.
      • Network Requirements:

        1. Unique hostname, MAC address, and product_uuid for each node.
        2. Certain ports need to be open (e.g., 6443, 2379-2380, 10250, 10251, 10252, 10255, etc.)

      Customize your cluster

      Creating a Kubernetes cluster is as simple as kind create cluster

      kind create cluster --name test

      Reference

      and the you can visit https://kind.sigs.k8s.io/docs/user/quick-start/ for mode detail.

      Mar 7, 2024

      K3s

      Preliminary

      • Hardware Requirements:

        1. Server need to have at least 2 cores, 2 GB RAM
        2. Agent need 1 core , 512 MB RAM
      • Operating System:

        1. K3s is expected to work on most modern Linux systems.
      • Network Requirements:

        1. The K3s server needs port 6443 to be accessible by all nodes.
        2. If you wish to utilize the metrics server, all nodes must be accessible to each other on port 10250.

      Init server

      curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn sh -s - server --cluster-init --flannel-backend=vxlan --node-taint "node-role.kubernetes.io/control-plane=true:NoSchedule"

      Get token

      cat /var/lib/rancher/k3s/server/node-token

      Join worker

      curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn K3S_URL=https://<master-ip>:6443 K3S_TOKEN=<join-token> sh -

      Copy kubeconfig

      mkdir -p $HOME/.kube
      cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config

      Uninstall k3s

      # exec on server
      /usr/local/bin/k3s-uninstall.sh
      
      # exec on agent 
      /usr/local/bin/k3s-agent-uninstall.sh
      Mar 7, 2024

      Minikube

      Preliminary

      • Minikube binary has installed, if not check 🔗link

      • Hardware Requirements:

        1. At least 2 GB of RAM per machine (minimum 1 GB)
        2. 2 CPUs on the master node
        3. Full network connectivity among all machines (public or private network)
      • Operating System:

        1. Ubuntu 20.04/18.04, CentOS 7/8, or any other supported Linux distribution.
      • Network Requirements:

        1. Unique hostname, MAC address, and product_uuid for each node.
        2. Certain ports need to be open (e.g., 6443, 2379-2380, 10250, 10251, 10252, 10255, etc.)

      [Optional] Disable aegis service and reboot system for Aliyun

      sudo systemctl disable aegis && sudo reboot

      Customize your cluster

      minikube start --driver=podman  --image-mirror-country=cn --kubernetes-version=v1.33.1 --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers --cpus=6 --memory=20g --disk-size=50g --force

      Restart minikube

      minikube stop && minikube start

      Add alias

      alias kubectl="minikube kubectl --"

      Stop And Clean

      minikube stop && minikube delete --all --purge

      Forward

      ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f

      and then you can visit https://minikube.sigs.k8s.io/docs/start/ for more detail.

      FAQ

      Q1: couldn’t get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle…

      通常是由于 Metrics Server 未正确安装 或 External Metrics API 缺失 导致的

      # 启用 Minikube 的 metrics-server 插件
      minikube addons enable metrics-server
      
      # 等待部署完成(约 1-2 分钟)
      kubectl wait --for=condition=available deployment/metrics-server -n kube-system --timeout=180s
      
      # 验证 Metrics Server 是否运行
      kubectl -n kube-system get pods  | grep metrics-server

      the possibilities are endless (almost - including other shortcodes may or may not work)

      Q2: Export minikube to local
      minikube start --driver=podman  --image-mirror-country=cn --kubernetes-version=v1.33.1 --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers  --listen-address=0.0.0.0 --cpus=6 --memory=20g --disk-size=100g --force
      Mar 7, 2024

      Subsections of Command

      Kubectl CheatSheet

      Switch Context

      • use different config
      kubectl --kubeconfig /root/.kube/config_ack get pod

      Resource

      • create resource

        Resource From
          kubectl create -n <$namespace> -f <$file_url>
        temp-file.yaml
        apiVersion: v1
        kind: Service
        metadata:
        labels:
            app.kubernetes.io/component: server
            app.kubernetes.io/instance: argo-cd
            app.kubernetes.io/name: argocd-server-external
            app.kubernetes.io/part-of: argocd
            app.kubernetes.io/version: v2.8.4
        name: argocd-server-external
        spec:
        ports:
        - name: https
            port: 443
            protocol: TCP
            targetPort: 8080
            nodePort: 30443
        selector:
            app.kubernetes.io/instance: argo-cd
            app.kubernetes.io/name: argocd-server
        type: NodePort
        
          helm install <$resource_id> <$resource_id> \
              --namespace <$namespace> \
              --create-namespace \
              --version <$version> \
              --repo <$repo_url> \
              --values resource.values.yaml \
              --atomic
        resource.values.yaml
        crds:
            install: true
            keep: false
        global:
            revisionHistoryLimit: 3
            image:
                repository: m.daocloud.io/quay.io/argoproj/argocd
                imagePullPolicy: IfNotPresent
        redis:
            enabled: true
            image:
                repository: m.daocloud.io/docker.io/library/redis
            exporter:
                enabled: false
                image:
                    repository: m.daocloud.io/bitnami/redis-exporter
            metrics:
                enabled: false
        redis-ha:
            enabled: false
            image:
                repository: m.daocloud.io/docker.io/library/redis
            configmapTest:
                repository: m.daocloud.io/docker.io/koalaman/shellcheck
            haproxy:
                enabled: false
                image:
                repository: m.daocloud.io/docker.io/library/haproxy
            exporter:
                enabled: false
                image: m.daocloud.io/docker.io/oliver006/redis_exporter
        dex:
            enabled: true
            image:
                repository: m.daocloud.io/ghcr.io/dexidp/dex
        

      • debug resource

      kubectl -n <$namespace> describe <$resource_id>
      • logging resource
      kubectl -n <$namespace> logs -f <$resource_id>
      • port forwarding resource
      kubectl -n <$namespace> port-forward  <$resource_id> --address 0.0.0.0 8080:80 # local:pod
      • delete all resource under specific namespace
      kubectl delete all --all -n <$namespace>
      if you wannna delete all
      kubectl delete all --all --all-namespaces
      • delete error pods
      kubectl -n <$namespace> delete pods --field-selector status.phase=Failed
      • force delete
      kubectl -n <$namespace> delete pod <$resource_id> --force --grace-period=0
      • opening a Bash Shell inside a Pod
      kubectl -n <$namespace> exec -it <$resource_id> -- bash  
      • copy secret to another namespace
      kubectl -n <$namespaceA> get secret <$secret_name> -o json \
          | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' \
          | kubectl -n <$namespaceB> apply -f -
      • copy secret to another name
      kubectl -n <$namespace> get secret <$old_secret_name> -o json | \
      jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid","ownerReferences","annotations","labels"]) | .metadata.name = "<$new_secret_name>"' | \
      kubectl apply -n <$namespace> -f -
      • delete all completed job
      kubectl delete jobs -n <$namespace> --field-selector status.successful=1 

      Nodes

      • add taint
      kubectl taint nodes <$node_ip> <key:value>
      for example
      kubectl taint nodes node1 dedicated:NoSchedule
      • remove taint
      kubectl remove taint
      for example
      kubectl taint nodes node1 dedicated:NoSchedule-
      • show info extract by json path
      kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'

      Deploy

      • rollout show rollout history
      kubectl -n <$namespace> rollout history deploy/<$deploy_resource_id>

      undo rollout

      kubectl -n <$namespace> rollout undo deploy <$deploy_resource_id>  --to-revision=1

      Patch

      clean those who won’t managed by k8s

      kubectl -n metadata patch flinkingest ingest-table-or-fits-from-oss -p '{"metadata":{"finalizers":[]}}' --type=merge
      Mar 8, 2024

      Helm Chart CheatSheet

      Finding Charts

      helm search hub wordpress

      Adding Repositories

      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
      helm repo update

      Showing Chart Values

      helm show values bitnami/wordpress

      Packaging Charts

      helm package --dependency-update --destination /tmp/ /root/metadata-operator/environments/helm/metadata-environment/charts

      Uninstall Chart

      helm uninstall -n warehouse warehouse

      when failed, you can try

      helm uninstall -n warehouse warehouse --no-hooks --cascade=foreground
      Mar 7, 2024

      Subsections of Conatiner

      CheatShett

      type:
      1. remove specific image
      podman rmi <$image_id>
      1. remove all <none> images
      podman rmi `podamn images | grep  '<none>' | awk '{print $3}'`
      1. remove all stopped containers
      podman container prune
      1. remove all docker images not used
      podman image prune

      sudo podman volume prune

      1. find ip address of a container
      podman inspect --format='{{.NetworkSettings.IPAddress}}' minio-server
      1. exec into container
      podman run -it <$container_id> /bin/bash
      1. run with environment
      podman run -d --replace 
          -p 18123:8123 -p 19000:9000 \
          --name clickhouse-server \
          -e ALLOW_EMPTY_PASSWORD=yes \
          --ulimit nofile=262144:262144 \
          quay.m.daocloud.io/kryptonite/clickhouse-docker-rootless:20.9.3.45 

      --ulimit nofile=262144:262144: 262144 is the maximum users process or for showing maximum user process limit for the logged-in user

      ulimit is admin access required Linux shell command which is used to see, set, or limit the resource usage of the current user. It is used to return the number of open file descriptors for each process. It is also used to set restrictions on the resources used by a process.

      1. login registry
      export ZJLAB_CR_PAT=ghp_xxxxxxxxxxxx
      echo $ZJLAB_CR_PAT | podman login --tls-verify=false cr.registry.res.cloud.zhejianglab.com -u ascm-org-1710208820455 --password-stdin
      
      export GITHUB_CR_PAT=ghp_xxxxxxxxxxxx
      echo $GITHUB_CR_PAT | podman login ghcr.io -u aaronyang0628 --password-stdin
      
      export DOCKER_CR_PAT=dckr_pat_bBN_Xkgz-xxxx
      echo $DOCKER_CR_PAT | podman login docker.io -u aaron666 --password-stdin
      1. tag image
      podman tag 76fdac66291c cr.registry.res.cloud.zhejianglab.com/ay-dev/datahub-s3-fits:1.0.0
      1. push image
      podman push cr.registry.res.cloud.zhejianglab.com/ay-dev/datahub-s3-fits:1.0.0
      1. remove specific image
      docker rmi <$image_id>
      1. remove all <none> images
      docker rmi `docker images | grep  '<none>' | awk '{print $3}'`
      1. remove all stopped containers
      docker container prune
      1. remove all docker images not used
      docker image prune
      1. find ip address of a container
      docker inspect --format='{{.NetworkSettings.IPAddress}}' minio-server
      1. exec into container
      docker exec -it <$container_id> /bin/bash
      1. run with environment
      docker run -d --replace -p 18123:8123 -p 19000:9000 --name clickhouse-server -e ALLOW_EMPTY_PASSWORD=yes --ulimit nofile=262144:262144 quay.m.daocloud.io/kryptonite/clickhouse-docker-rootless:20.9.3.45 

      --ulimit nofile=262144:262144: sssss

      1. copy file

        Copy a local file into container

        docker cp ./some_file CONTAINER:/work

        or copy files from container to local path

        docker cp CONTAINER:/var/logs/ /tmp/app_logs
      2. load a volume

      docker run --rm \
          --entrypoint bash \
          -v $PWD/data:/app:ro \
          -it docker.io/minio/mc:latest \
          -c "mc --insecure alias set minio https://oss-cn-hangzhou-zjy-d01-a.ops.cloud.zhejianglab.com/ g83B2sji1CbAfjQO 2h8NisFRELiwOn41iXc6sgufED1n1A \
              && mc --insecure ls minio/csst-prod/ \
              && mc --insecure mb --ignore-existing minio/csst-prod/crp-test \
              && mc --insecure cp /app/modify.pdf minio/csst-prod/crp-test/ \
              && mc --insecure ls --recursive minio/csst-prod/"
      Mar 7, 2024

      Subsections of Template

      Subsections of DevContainer Template

      Java 21 + Go 1.24

      prepare .devcontainer.json

      {
        "name": "Go & Java DevContainer",
        "build": {
          "dockerfile": "Dockerfile"
        },
        "mounts": [
          "source=/root/.kube/config,target=/root/.kube/config,type=bind",
          "source=/root/.minikube/profiles/minikube/client.crt,target=/root/.minikube/profiles/minikube/client.crt,type=bind",
          "source=/root/.minikube/profiles/minikube/client.key,target=/root/.minikube/profiles/minikube/client.key,type=bind",
          "source=/root/.minikube/ca.crt,target=/root/.minikube/ca.crt,type=bind"
        ],
        "customizations": {
          "vscode": {
            "extensions": [
              "golang.go",
              "vscjava.vscode-java-pack",
              "redhat.java",
              "vscjava.vscode-maven",
              "Alibaba-Cloud.tongyi-lingma",
              "vscjava.vscode-java-debug",
              "vscjava.vscode-java-dependency",
              "vscjava.vscode-java-test"
            ]
          }
        },
        "remoteUser": "root",
        "postCreateCommand": "go version && java -version && mvn -v"
      }

      prepare Dockerfile

      FROM m.daocloud.io/docker.io/ubuntu:24.04
      
      ENV DEBIAN_FRONTEND=noninteractive
      
      RUN apt-get update && \
          apt-get install -y --no-install-recommends \
          ca-certificates \
          curl \
          git \
          wget \
          gnupg \
          vim \
          lsb-release \
          apt-transport-https \
          && apt-get clean \
          && rm -rf /var/lib/apt/lists/*
      
      # install OpenJDK 21 
      RUN mkdir -p /etc/apt/keyrings && \
          wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | gpg --dearmor -o /etc/apt/keyrings/adoptium.gpg && \
          echo "deb [signed-by=/etc/apt/keyrings/adoptium.gpg arch=amd64] https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list > /dev/null && \
          apt-get update && \
          apt-get install -y temurin-21-jdk && \
          apt-get clean && \
          rm -rf /var/lib/apt/lists/*
      
      # set java env
      ENV JAVA_HOME=/usr/lib/jvm/temurin-21-jdk-amd64
      
      # install maven
      ARG MAVEN_VERSION=3.9.10
      RUN wget https://dlcdn.apache.org/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz -O /tmp/maven.tar.gz && \
          mkdir -p /opt/maven && \
          tar -C /opt/maven -xzf /tmp/maven.tar.gz --strip-components=1 && \
          rm /tmp/maven.tar.gz
      
      ENV MAVEN_HOME=/opt/maven
      ENV PATH="${MAVEN_HOME}/bin:${PATH}"
      
      # install go 1.24.4 
      ARG GO_VERSION=1.24.4
      RUN wget https://dl.google.com/go/go${GO_VERSION}.linux-amd64.tar.gz -O /tmp/go.tar.gz && \
          tar -C /usr/local -xzf /tmp/go.tar.gz && \
          rm /tmp/go.tar.gz
      
      # set go env
      ENV GOROOT=/usr/local/go
      ENV GOPATH=/go
      ENV PATH="${GOROOT}/bin:${GOPATH}/bin:${PATH}"
      
      # install other binarys
      ARG KUBECTL_VERSION=v1.33.0
      RUN wget https://files.m.daocloud.io/dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -O /tmp/kubectl && \
          chmod u+x /tmp/kubectl && \
          mv -f /tmp/kubectl /usr/local/bin/kubectl 
      
      ARG HELM_VERSION=v3.13.3
      RUN wget https://files.m.daocloud.io/get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz -O /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz && \
          mkdir -p /opt/helm && \
          tar -C /opt/helm -xzf /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz && \
          rm /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz
      
      ENV HELM_HOME=/opt/helm/linux-amd64
      ENV PATH="${HELM_HOME}:${PATH}"
      
      USER root
      WORKDIR /workspace
      Mar 7, 2024

      Subsections of DEV

      Devpod

      Preliminary

      • Kubernetes has installed, if not check 🔗link
      • Devpod has installed, if not check 🔗link

      1. Get provider config

      # just copy ~/.kube/config

      for example, the original config

      apiVersion: v1
      clusters:
      - cluster:
          certificate-authority: <$file_path>
          extensions:
          - extension:
              provider: minikube.sigs.k8s.io
              version: v1.33.0
            name: cluster_info
          server: https://<$minikube_ip>:8443
        name: minikube
      contexts:
      - context:
          cluster: minikube
          extensions:
          - extension:
              provider: minikube.sigs.k8s.io
              version: v1.33.0
            name: context_info
          namespace: default
          user: minikube
        name: minikube
      current-context: minikube
      kind: Config
      preferences: {}
      users:
      - name: minikube
        user:
          client-certificate: <$file_path>
          client-key: <$file_path>

      you need to rename clusters.cluster.certificate-authority, clusters.cluster.server, users.user.client-certificate, users.user.client-key.

      clusters.cluster.certificate-authority -> clusters.cluster.certificate-authority-data
      clusters.cluster.server -> ip set to `localhost`
      users.user.client-certificate -> users.user.client-certificate-data
      users.user.client-key -> users.user.client-key-data

      the data you paste after each key should be base64

      cat <$file_path> | base64

      then, modified config file should be look like this:

      apiVersion: v1
      clusters:
      - cluster:
          certificate-authority-data: xxxxxxxxxxxxxx
          extensions:
          - extension:
              provider: minikube.sigs.k8s.io
              version: v1.33.0
            name: cluster_info
          server: https://127.0.0.1:8443 
        name: minikube
      contexts:
      - context:
          cluster: minikube
          extensions:
          - extension:
              provider: minikube.sigs.k8s.io
              version: v1.33.0
            name: context_info
          namespace: default
          user: minikube
        name: minikube
      current-context: minikube
      kind: Config
      preferences: {}
      users:
      - name: minikube
        user:
          client-certificate-data: xxxxxxxxxxxx
          client-key-data: xxxxxxxxxxxxxxxx

      then we should forward minikube port in your own pc

      #where you host minikube
      MACHINE_IP_ADDRESS=10.200.60.102
      USER=ayay
      MINIKUBE_IP_ADDRESS=$(ssh -o 'UserKnownHostsFile /dev/null' $USER@$MACHINE_IP_ADDRESS '$HOME/bin/minikube ip')
      ssh -o 'UserKnownHostsFile /dev/null' $USER@$MACHINE_IP_ADDRESS -L "*:8443:$MINIKUBE_IP_ADDRESS:8443" -N -f

      2. Create workspace

      1. get git repo link
      2. choose appropriate provider
      3. choose ide type and version
      4. and go!

      Useful Command

      Install Kubectl

      for more information, you can check 🔗link to install kubectl

      • How to use it in devpod

        Everything works fine.

        when you in pod, and using kubectl you should change clusters.cluster.server in ~/.kube/config to https://<$minikube_ip>:8443

      • exec into devpod

      kubectl -n devpod exec -it <$resource_id> -c devpod -- bin/bash
      • add DNS item
      10.aaa.bbb.ccc gitee.zhejianglab.com
      • shutdown ssh tunnel
        # check if port 8443 is already open
        netstat -aon|findstr "8443"
        
        # find PID
        ps | grep ssh
        
        # kill the process
        taskkill /PID <$PID> /T /F
        # check if port 8443 is already open
        netstat -aon|findstr "8443"
        
        # find PID
        ps | grep ssh
        
        # kill the process
        kill -9 <$PID>
      Mar 7, 2024

      Dev Conatiner

      write .devcontainer.json

      Mar 7, 2024

      Deploy

        Mar 7, 2024

        Subsections of Operator

        KubeBuilder

        Basic

        Kubebuilder 是一个使用 CRDs 构建 K8s API 的 SDK,主要是:

        • 基于 controller-runtime 以及 client-go 构建
        • 提供一套可扩展的 API 框架,方便用户从零开始开发 CRDsControllers 和 Admission Webhooks 来扩展 K8s。
        • 还提供脚手架工具初始化 CRDs 工程,自动生成 boilerplate 模板代码和配置;

        Architecture

        mvc mvc

        Main.go

        import (
        	_ "k8s.io/client-go/plugin/pkg/client/auth"
        
        	ctrl "sigs.k8s.io/controller-runtime"
        )
        // nolint:gocyclo
        func main() {
            ...
        
            mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{}
        
            ...
            if err = (&controller.GuestbookReconciler{
                Client: mgr.GetClient(),
                Scheme: mgr.GetScheme(),
            }).SetupWithManager(mgr); err != nil {
                setupLog.Error(err, "unable to create controller", "controller", "Guestbook")
                os.Exit(1)
            }
        
            ...
            if os.Getenv("ENABLE_WEBHOOKS") != "false" {
                if err = webhookwebappv1.SetupGuestbookWebhookWithManager(mgr); err != nil {
                    setupLog.Error(err, "unable to create webhook", "webhook", "Guestbook")
                    os.Exit(1)
                }
            }

        Manager

        Manager是核心组件,可以协调多个控制器、处理缓存、客户端、领导选举等,来自https://github.com/kubernetes-sigs/controller-runtime/blob/v0.20.0/pkg/manager/manager.go

        • Client 承担了与 Kubernetes API Server 通信、操作资源对象、读写缓存等关键职责; 分为两类:
          • Reader:优先读Cache, 避免频繁访问 API Server, Get后放缓存
          • Writer: 支持写操作(Create、Update、Delete、Patch),直接与 API Server 交互。
          • informers 是 client-go 提供的核心组件,用于监听(Watch)Kubernetes API Server 中特定资源类型(如 Pod、Deployment 或自定义 CRD)的变更事件(Create/Update/Delete)。
            • Client 依赖 Informer 机制自动同步缓存。当 API Server 中资源变更时,Informer 会定时更新本地缓存,确保后续读操作获取最新数据。
        • Cache
          • Cache 通过 内置的client 的 ListWatcher机制 监听 API Server 的资源变更。
          • 事件被写入本地缓存(如 Indexer),避免频繁访问 API Server。
          • 缓存(Cache)的作用是减少对API Server的直接请求,同时保证控制器能够快速读取资源的最新状态。
        • Event

          Kubernetes API Server 通过 HTTP 长连接 推送资源变更事件,client-go 的 Informer 负责监听这些消息。

          • Event:事件是Kubernetes API Server与Controller之间传递的信息,包含资源类型、资源名称、事件类型(ADDED、MODIFIED、DELETED)等信息,并转换成requets, check link
          • API Server → Manager的Informer → Cache → Controller的Watch → Predicate过滤 → WorkQueue → Controller的Reconcile()方法

        Controller

        It’s a controller’s job to ensure that, for any given object the actual state of the world matches the desired state in the object. Each controller focuses on one root Kind, but may interact with other Kinds.

        func (r *GuestbookReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
            ...
        }
        func (r *GuestbookReconciler) SetupWithManager(mgr ctrl.Manager) error {
        	return ctrl.NewControllerManagedBy(mgr).
        		For(&webappv1.Guestbook{}).
        		Named("guestbook").
        		Complete(r)
        }

        If you wanna build your own controller, please check https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md

        1. 每个Controller在初始化时会向Manager注册它关心的资源类型(例如通过Owns(&v1.Pod{})声明关注Pod资源)。

        2. Manager根据Controller的注册信息,为相关资源创建对应的Informer和Watch, check link

        3. 当资源变更事件发生时,Informer会将事件从缓存中取出,并通过Predicate(过滤器)判断是否需要触发协调逻辑。

        4. 若事件通过过滤,Controller会将事件加入队列(WorkQueue),最终调用用户实现的Reconcile()函数进行处理, check link

        func (c *Controller[request]) Start(ctx context.Context) error {
        
        	c.ctx = ctx
        
        	queue := c.NewQueue(c.Name, c.RateLimiter)
        
            c.Queue = &priorityQueueWrapper[request]{TypedRateLimitingInterface: queue}
        
        	err := func() error {
        
                    // start to sync event sources
                    if err := c.startEventSources(ctx); err != nil {
                        return err
                    }
        
                    for i := 0; i < c.MaxConcurrentReconciles; i++ {
                        go func() {
                            for c.processNextWorkItem(ctx) {
        
                            }
                        }()
                    }
        	}()
        
        	c.LogConstructor(nil).Info("All workers finished")
        }
        func (c *Controller[request]) processNextWorkItem(ctx context.Context) bool {
        	obj, priority, shutdown := c.Queue.GetWithPriority()
        
        	c.reconcileHandler(ctx, obj, priority)
        
        }

        Webhook

        Webhooks are a mechanism to intercept requests to the Kubernetes API server. They can be used to validate, mutate, or even proxy requests.

        func (d *GuestbookCustomDefaulter) Default(ctx context.Context, obj runtime.Object) error {}
        
        func (v *GuestbookCustomValidator) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {}
        
        func (v *GuestbookCustomValidator) ValidateUpdate(ctx context.Context, oldObj, newObj runtime.Object) (admission.Warnings, error) {}
        
        func (v *GuestbookCustomValidator) ValidateDelete(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {}
        
        func SetupGuestbookWebhookWithManager(mgr ctrl.Manager) error {
        	return ctrl.NewWebhookManagedBy(mgr).For(&webappv1.Guestbook{}).
        		WithValidator(&GuestbookCustomValidator{}).
        		WithDefaulter(&GuestbookCustomDefaulter{}).
        		Complete()
        }
        Mar 7, 2024

        Subsections of KubeBuilder

        Quick Start

        Prerequisites

        • go version v1.23.0+
        • docker version 17.03+.
        • kubectl version v1.11.3+.
        • Access to a Kubernetes v1.11.3+ cluster.

        Installation

        # download kubebuilder and install locally.
        curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
        chmod +x kubebuilder && sudo mv kubebuilder /usr/local/bin/

        Create A Project

        mkdir -p ~/projects/guestbook
        cd ~/projects/guestbook
        kubebuilder init --domain my.domain --repo my.domain/guestbook
        Error: unable to scaffold with “base.go.kubebuilder.io/v4”:exit status 1

        Just try again!

        rm -rf ~/projects/guestbook/*
        kubebuilder init --domain my.domain --repo my.domain/guestbook

        Create An API

        kubebuilder create api --group webapp --version v1 --kind Guestbook
        Error: unable to run post-scaffold tasks of “base.go.kubebuilder.io/v4”: exec: “make”: executable file not found in $PATH
        apt-get -y install make
        rm -rf ~/projects/guestbook/*
        kubebuilder init --domain my.domain --repo my.domain/guestbook
        kubebuilder create api --group webapp --version v1 --kind Guestbook

        Prepare a K8s Cluster

        cluster in
        minikube start --kubernetes-version=v1.27.10 --image-mirror-country=cn --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers --cpus=4 --memory=4g --disk-size=50g --force

        asdasda

        Modify API [Optional]

        you can moidfy file /~/projects/guestbook/api/v1/guestbook_types.go

        type GuestbookSpec struct {
        	// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
        	// Important: Run "make" to regenerate code after modifying this file
        
        	// Foo is an example field of Guestbook. Edit guestbook_types.go to remove/update
        	Foo string `json:"foo,omitempty"`
        }

        which will corresponding to the file /~/projects/guestbook/config/samples/webapp_v1_guestbook.yaml

        If you are editing the API definitions, generate the manifests such as Custom Resources (CRs) or Custom Resource Definitions (CRDs) using

        make manifests
        Modify Controller [Optional]

        you can moidfy file /~/projects/guestbook/internal/controller/guestbook_controller.go

        // 	"fmt"
        // "k8s.io/apimachinery/pkg/api/errors"
        // "k8s.io/apimachinery/pkg/types"
        // 	appsv1 "k8s.io/api/apps/v1"
        //	corev1 "k8s.io/api/core/v1"
        //	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
        func (r *GuestbookReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
        	// The context is used to allow cancellation of requests, and potentially things like tracing. 
        	_ = log.FromContext(ctx)
        
        	fmt.Printf("I am a controller ->>>>>>")
        	fmt.Printf("Name: %s, Namespace: %s", req.Name, req.Namespace)
        
        	guestbook := &webappv1.Guestbook{}
        	if err := r.Get(ctx, req.NamespacedName, guestbook); err != nil {
        		return ctrl.Result{}, err
        	}
        
        	fooString := guestbook.Spec.Foo
        	replicas := int32(1)
        	fmt.Printf("Foo String: %s", fooString)
        
        	// labels := map[string]string{
        	// 	"app": req.Name,
        	// }
        
        	// dep := &appsv1.Deployment{
        	// 	ObjectMeta: metav1.ObjectMeta{
        	// 		Name:      fooString + "-deployment",
        	// 		Namespace: req.Namespace,
        	// 		Labels:    labels,
        	// 	},
        	// 	Spec: appsv1.DeploymentSpec{
        	// 		Replicas: &replicas,
        	// 		Selector: &metav1.LabelSelector{
        	// 			MatchLabels: labels,
        	// 		},
        	// 		Template: corev1.PodTemplateSpec{
        	// 			ObjectMeta: metav1.ObjectMeta{
        	// 				Labels: labels,
        	// 			},
        	// 			Spec: corev1.PodSpec{
        	// 				Containers: []corev1.Container{{
        	// 					Name:  fooString,
        	// 					Image: "busybox:latest",
        	// 				}},
        	// 			},
        	// 		},
        	// 	},
        	// }
        
        	// existingDep := &appsv1.Deployment{}
        	// err := r.Get(ctx, types.NamespacedName{Name: dep.Name, Namespace: dep.Namespace}, existingDep)
        	// if err != nil {
        	// 	if errors.IsNotFound(err) {
        	// 		if err := r.Create(ctx, dep); err != nil {
        	// 			return ctrl.Result{}, err
        	// 		}
        	// 	} else {
        	// 		return ctrl.Result{}, err
        	// 	}
        	// }
        
        	return ctrl.Result{}, nil
        }

        And you can use make run to test your controller.

        make run

        and use following command to send a request

        make sure you install crds -> make install before you exec this following command

        make install
        kubectl apply -k config/samples/

        your controller terminal should be look like this

        I am a controller ->>>>>>Name: guestbook-sample, Namespace: defaultFoo String: foo-value

        Install CRDs

        check installed crds in k8s

        kubectl get crds

        install guestbook crd in k8s

        cd ~/projects/guestbook
        make install

        uninstall CRDs

        make uninstall
        
        make undeploy

        Deploy to cluster

        make docker-build IMG=aaron666/guestbook-operator:test
        make docker-build docker-push IMG=<some-registry>/<project-name>:tag
        make deploy IMG=<some-registry>/<project-name>:tag
        Mar 7, 2024

        Operator-SDK

          Mar 7, 2024

          Subsections of Proxy

          Daocloud Binary

          使用方法

          在原始 URL 上面加入 files.m.daocloud.io前缀 就可以使用。比如:

          # Helm 下载原始URL
          wget https://get.helm.sh/helm-v3.9.1-linux-amd64.tar.gz
          
          # 加速后的 URL
          wget https://files.m.daocloud.io/get.helm.sh/helm-v3.9.1-linux-amd64.tar.gz

          即可加速下载, 所以如果指定的文件没有被缓存, 会卡住等待缓存完成, 后续下载就无带宽限制。

          最佳实践

          使用场景1 - 安装 Helm

          cd /tmp
          export HELM_VERSION="v3.9.3"
          
          wget "https://files.m.daocloud.io/get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz"
          tar -zxvf helm-${HELM_VERSION}-linux-amd64.tar.gz
          mv linux-amd64/helm /usr/local/bin/helm
          helm version

          使用场景2 - 安装 KubeSpray

          加入如下配置即可:

          files_repo: "https://files.m.daocloud.io"
          
          ## Kubernetes components
          kubeadm_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kubeadm_version }}/bin/linux/{{ image_arch }}/kubeadm"
          kubectl_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kube_version }}/bin/linux/{{ image_arch }}/kubectl"
          kubelet_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kube_version }}/bin/linux/{{ image_arch }}/kubelet"
          
          ## CNI Plugins
          cni_download_url: "{{ files_repo }}/github.com/containernetworking/plugins/releases/download/{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-{{ cni_version }}.tgz"
          
          ## cri-tools
          crictl_download_url: "{{ files_repo }}/github.com/kubernetes-sigs/cri-tools/releases/download/{{ crictl_version }}/crictl-{{ crictl_version }}-{{ ansible_system | lower }}-{{ image_arch }}.tar.gz"
          
          ## [Optional] etcd: only if you **DON'T** use etcd_deployment=host
          etcd_download_url: "{{ files_repo }}/github.com/etcd-io/etcd/releases/download/{{ etcd_version }}/etcd-{{ etcd_version }}-linux-{{ image_arch }}.tar.gz"
          
          # [Optional] Calico: If using Calico network plugin
          calicoctl_download_url: "{{ files_repo }}/github.com/projectcalico/calico/releases/download/{{ calico_ctl_version }}/calicoctl-linux-{{ image_arch }}"
          calicoctl_alternate_download_url: "{{ files_repo }}/github.com/projectcalico/calicoctl/releases/download/{{ calico_ctl_version }}/calicoctl-linux-{{ image_arch }}"
          # [Optional] Calico with kdd: If using Calico network plugin with kdd datastore
          calico_crds_download_url: "{{ files_repo }}/github.com/projectcalico/calico/archive/{{ calico_version }}.tar.gz"
          
          # [Optional] Flannel: If using Falnnel network plugin
          flannel_cni_download_url: "{{ files_repo }}/kubernetes/flannel/{{ flannel_cni_version }}/flannel-{{ image_arch }}"
          
          # [Optional] helm: only if you set helm_enabled: true
          helm_download_url: "{{ files_repo }}/get.helm.sh/helm-{{ helm_version }}-linux-{{ image_arch }}.tar.gz"
          
          # [Optional] crun: only if you set crun_enabled: true
          crun_download_url: "{{ files_repo }}/github.com/containers/crun/releases/download/{{ crun_version }}/crun-{{ crun_version }}-linux-{{ image_arch }}"
          
          # [Optional] kata: only if you set kata_containers_enabled: true
          kata_containers_download_url: "{{ files_repo }}/github.com/kata-containers/kata-containers/releases/download/{{ kata_containers_version }}/kata-static-{{ kata_containers_version }}-{{ ansible_architecture }}.tar.xz"
          
          # [Optional] cri-dockerd: only if you set container_manager: docker
          cri_dockerd_download_url: "{{ files_repo }}/github.com/Mirantis/cri-dockerd/releases/download/v{{ cri_dockerd_version }}/cri-dockerd-{{ cri_dockerd_version }}.{{ image_arch }}.tgz"
          
          # [Optional] runc,containerd: only if you set container_runtime: containerd
          runc_download_url: "{{ files_repo }}/github.com/opencontainers/runc/releases/download/{{ runc_version }}/runc.{{ image_arch }}"
          containerd_download_url: "{{ files_repo }}/github.com/containerd/containerd/releases/download/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
          nerdctl_download_url: "{{ files_repo }}/github.com/containerd/nerdctl/releases/download/v{{ nerdctl_version }}/nerdctl-{{ nerdctl_version }}-{{ ansible_system | lower }}-{{ image_arch }}.tar.gz"

          实测下载速度可以达到Downloaded: 19 files, 603M in 23s (25.9 MB/s), 下载全部文件可以在 23s 内完成! 完整方法可以参考 https://gist.github.com/yankay/a863cf2e300bff6f9040ab1c6c58fbae

          使用场景3 - 安装 KIND

          cd /tmp
          export KIND_VERSION="v0.22.0"
          
          curl -Lo ./kind https://files.m.daocloud.io/github.com/kubernetes-sigs/kind/releases/download/${KIND_VERSION}/kind-linux-amd64
          chmod +x ./kind
          mv ./kind /usr/bin/kind
          kind version

          使用场景4 - 安装 K9S

          cd /tmp
          export K9S_VERSION="v0.32.4"
          
          wget https://files.m.daocloud.io/github.com/derailed/k9s/releases/download/${K9S_VERSION}/k9s_Linux_x86_64.tar.gz
          tar -zxvf k9s_Linux_x86_64.tar.gz
          chmod +x k9s
          mv k9s /usr/bin/k9s
          k9s version

          使用场景5 - 安装 istio

          cd /tmp
          export ISTIO_VERSION="1.14.3"
          
          wget "https://files.m.daocloud.io/github.com/istio/istio/releases/download/${ISTIO_VERSION}/istio-${ISTIO_VERSION}-linux-amd64.tar.gz"
          tar -zxvf istio-${ISTIO_VERSION}-linux-amd64.tar.gz
          # Do follow the istio docs to install istio

          使用场景6 - 安装 nerdctl (代替 docker 工具)

          这里是root安装,其他安装方式请参考源站: https://github.com/containerd/nerdctl

          export NERDCTL_VERSION="1.7.6"
          mkdir -p nerdctl ;cd nerdctl
          wget https://files.m.daocloud.io/github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/nerdctl-full-${NERDCTL_VERSION}-linux-amd64.tar.gz
          tar -zvxf nerdctl-full-${NERDCTL_VERSION}-linux-amd64.tar.gz
          mkdir -p /opt/cni/bin ;cp -f libexec/cni/* /opt/cni/bin/ ;cp bin/* /usr/local/bin/ ;cp lib/systemd/system/*.service /usr/lib/systemd/system/
          systemctl enable containerd ;systemctl start containerd --now
          systemctl enable buildkit;systemctl start buildkit --now

          欢迎贡献更多的场景

          禁止加速的后缀

          以下后缀的文件会直接响应 403

          • .bmp
          • .jpg
          • .jpeg
          • .png
          • .gif
          • .webp
          • .tiff
          • .mp4
          • .webm
          • .ogg
          • .avi
          • .mov
          • .flv
          • .mkv
          • .mp3
          • .wav
          • .rar
          Mar 7, 2024

          Daocloud Image

          快速开始

          docker run -d -P m.daocloud.io/docker.io/library/nginx

          使用方法

          增加前缀 (推荐方式)。比如:

                        docker.io/library/busybox
                           |
                           V
          m.daocloud.io/docker.io/library/busybox

          或者 支持的镜像仓库 的 前缀替换 就可以使用。比如:

                     docker.io/library/busybox
                       |
                       V
          docker.m.daocloud.io/library/busybox

          无缓存

          在拉取的时候如果Daocloud没有缓存, 将会在 同步队列 添加同步缓存的任务.

          支持前缀替换的 Registry (不推荐)

          推荐使用添加前缀的方式.

          前缀替换的 Registry 的规则, 这是人工配置的, 有需求提 Issue.

          源站替换为备注
          docker.elastic.coelastic.m.daocloud.io
          docker.iodocker.m.daocloud.io
          gcr.iogcr.m.daocloud.io
          ghcr.ioghcr.m.daocloud.io
          k8s.gcr.iok8s-gcr.m.daocloud.iok8s.gcr.io 已被迁移到 registry.k8s.io
          registry.k8s.iok8s.m.daocloud.io
          mcr.microsoft.commcr.m.daocloud.io
          nvcr.ionvcr.m.daocloud.io
          quay.ioquay.m.daocloud.io
          registry.ollama.aiollama.m.daocloud.io

          最佳实践

          加速 Kubneretes

          加速安装 kubeadm

          kubeadm config images pull --image-repository k8s-gcr.m.daocloud.io

          加速安装 kind

          kind create cluster --name kind --image m.daocloud.io/docker.io/kindest/node:v1.22.1

          加速 Containerd

          加速 Docker

          添加到 /etc/docker/daemon.json

          {
            "registry-mirrors": [
              "https://docker.m.daocloud.io"
            ]
          }

          加速 Ollama & DeepSeek

          加速安装 Ollama

          CPU:

          docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama docker.m.daocloud.io/ollama/ollama

          GPU 版本:

          1. 首先安装 Nvidia Container Toolkit
          2. 运行以下命令启动 Ollama 容器:
          docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama docker.m.daocloud.io/ollama/ollama

          更多信息请参考:

          加速使用 Deepseek-R1 模型

          如上述步骤,在启动了ollama容器的前提下,还可以通过加速源,加速启动DeepSeek相关的模型服务

          注:目前 Ollama 官方源的下载速度已经很快,您也可以直接使用官方源

          # 使用加速源
          docker exec -it ollama ollama run ollama.m.daocloud.io/library/deepseek-r1:1.5b
          
          # 或直接使用官方源下载模型
          # docker exec -it ollama ollama run deepseek-r1:1.5b
          Mar 7, 2024

          KubeVPN

          1.install krew

            1. download and install krew
            1. Add the $HOME/.krew/bin directory to your PATH environment variable.
          export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
            1. Run kubectl krew to check the installation
          kubectl krew list

          2. Download from kubevpn source from github

          kubectl krew index add kubevpn https://gitclone.com/github.com/kubenetworks/kubevpn.git
          kubectl krew install kubevpn/kubevpn
          kubectl kubevpn 

          3. Deploy VPN in some cluster

          Using different config to access different cluster and deploy vpn in that k8s.

          kubectl kubevpn connect
          If you wanna connect other k8s …
          kubectl kubevpn connect --kubeconfig /root/.kube/xxx_config

          Your terminal should look like this:

          ➜  ~ kubectl kubevpn connect
          Password:
          Starting connect
          Getting network CIDR from cluster info...
          Getting network CIDR from CNI...
          Getting network CIDR from services...
          Labeling Namespace default
          Creating ServiceAccount kubevpn-traffic-manager
          Creating Roles kubevpn-traffic-manager
          Creating RoleBinding kubevpn-traffic-manager
          Creating Service kubevpn-traffic-manager
          Creating MutatingWebhookConfiguration kubevpn-traffic-manager
          Creating Deployment kubevpn-traffic-manager
          
          Pod kubevpn-traffic-manager-66d969fd45-9zlbp is Pending
          Container     Reason            Message
          control-plane ContainerCreating
          vpn           ContainerCreating
          webhook       ContainerCreating
          
          Pod kubevpn-traffic-manager-66d969fd45-9zlbp is Running
          Container     Reason           Message
          control-plane ContainerRunning
          vpn           ContainerRunning
          webhook       ContainerRunning
          
          Forwarding port...
          Connected tunnel
          Adding route...
          Configured DNS service
          +----------------------------------------------------------+
          | Now you can access resources in the kubernetes cluster ! |
          +----------------------------------------------------------+

          already connected to cluster network, use command kubectl kubevpn status to check status

          ➜  ~ kubectl kubevpn status
          ID Mode Cluster   Kubeconfig                  Namespace            Status      Netif
          0  full ops-dev   /root/.kube/zverse_config   data-and-computing   Connected   utun0

          use pod productpage-788df7ff7f-jpkcs IP 172.29.2.134

          ➜  ~ kubectl get pods -o wide
          NAME                                       AGE     IP                NODE              NOMINATED NODE  GATES
          authors-dbb57d856-mbgqk                    7d23h   172.29.2.132      192.168.0.5       <none>         
          details-7d8b5f6bcf-hcl4t                   61d     172.29.0.77       192.168.104.255   <none>         
          kubevpn-traffic-manager-66d969fd45-9zlbp   74s     172.29.2.136      192.168.0.5       <none>         
          productpage-788df7ff7f-jpkcs               61d     172.29.2.134      192.168.0.5       <none>         
          ratings-77b6cd4499-zvl6c                   61d     172.29.0.86       192.168.104.255   <none>         
          reviews-85c88894d9-vgkxd                   24d     172.29.2.249      192.168.0.5       <none>         

          use ping to test connection, seems good

          ➜  ~ ping 172.29.2.134
          PING 172.29.2.134 (172.29.2.134): 56 data bytes
          64 bytes from 172.29.2.134: icmp_seq=0 ttl=63 time=55.727 ms
          64 bytes from 172.29.2.134: icmp_seq=1 ttl=63 time=56.270 ms
          64 bytes from 172.29.2.134: icmp_seq=2 ttl=63 time=55.228 ms
          64 bytes from 172.29.2.134: icmp_seq=3 ttl=63 time=54.293 ms
          ^C
          --- 172.29.2.134 ping statistics ---
          4 packets transmitted, 4 packets received, 0.0% packet loss
          round-trip min/avg/max/stddev = 54.293/55.380/56.270/0.728 ms

          use service productpage IP 172.21.10.49

          ➜  ~ kubectl get services -o wide
          NAME                      TYPE        CLUSTER-IP     PORT(S)              SELECTOR
          authors                   ClusterIP   172.21.5.160   9080/TCP             app=authors
          details                   ClusterIP   172.21.6.183   9080/TCP             app=details
          kubernetes                ClusterIP   172.21.0.1     443/TCP              <none>
          kubevpn-traffic-manager   ClusterIP   172.21.2.86    84xxxxxx0/TCP        app=kubevpn-traffic-manager
          productpage               ClusterIP   172.21.10.49   9080/TCP             app=productpage
          ratings                   ClusterIP   172.21.3.247   9080/TCP             app=ratings
          reviews                   ClusterIP   172.21.8.24    9080/TCP             app=reviews

          use command curl to test service connection

          ➜  ~ curl 172.21.10.49:9080
          <!DOCTYPE html>
          <html>
            <head>
              <title>Simple Bookstore App</title>
          <meta charset="utf-8">
          <meta http-equiv="X-UA-Compatible" content="IE=edge">
          <meta name="viewport" content="width=device-width, initial-scale=1">

          seems good too~

          if you wanna resolve domain

          Domain resolve

          a Pod/Service named productpage in the default namespace can successfully resolve by following name:

          • productpage
          • productpage.default
          • productpage.default.svc.cluster.local
          ➜  ~ curl productpage.default.svc.cluster.local:9080
          <!DOCTYPE html>
          <html>
            <head>
              <title>Simple Bookstore App</title>
          <meta charset="utf-8">
          <meta http-equiv="X-UA-Compatible" content="IE=edge">
          <meta name="viewport" content="width=device-width, initial-scale=1">

          Short domain resolve

          To access the service in the cluster, service name or you can use the short domain name, such as productpage

          ➜  ~ curl productpage:9080
          <!DOCTYPE html>
          <html>
            <head>
              <title>Simple Bookstore App</title>
          <meta charset="utf-8">
          <meta http-equiv="X-UA-Compatible" content="IE=edge">
          ...

          Disclaimer: This only works on the namespace where kubevpn-traffic-manager is deployed.

          Mar 7, 2024

          Subsections of Serverless

          Subsections of Kserve

          Install Kserve

          Preliminary

          • v 1.30 + Kubernetes has installed, if not check 🔗link
          • Helm has installed, if not check 🔗link

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm binary has installed, if not check 🔗link


          1.install from script directly

          Details
          curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash
          Expectd Output

          Installing Gateway API CRDs …

          😀 Successfully installed Istio

          😀 Successfully installed Cert Manager

          😀 Successfully installed Knative

          But you probably will ecounter some error due to the network, like this:
          Error: INSTALLATION FAILED: context deadline exceeded

          you need to reinstall some components

          export KSERVE_VERSION=v0.15.2
          export deploymentMode=Serverless
          helm upgrade --namespace kserve kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version $KSERVE_VERSION
          helm upgrade --namespace kserve kserve oci://ghcr.io/kserve/charts/kserve --version $KSERVE_VERSION --set-string kserve.controller.deploymentMode="$deploymentMode"
          # helm upgrade knative-operator --namespace knative-serving  https://github.com/knative/operator/releases/download/knative-v1.15.7/knative-operator-v1.15.7.tgz

          Preliminary

          1. If you have only one node in your cluster, you need at least 6 CPUs, 6 GB of memory, and 30 GB of disk storage.


          2. If you have multiple nodes in your cluster, for each node you need at least 2 CPUs, 4 GB of memory, and 20 GB of disk storage.


          1.install knative serving CRD resources

          Details
          kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.18.0/serving-crds.yaml

          2.install knative serving components

          Details
          kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.18.0/serving-core.yaml
          # kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/assets/refs/heads/main/knative/serving/release/download/knative-v1.18.0/serving-core.yaml

          3.install network layer Istio

          Details
          kubectl apply -l knative.dev/crd-install=true -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/istio.yaml
          kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/istio.yaml
          kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/net-istio.yaml
          Expectd Output

          Monitor the Knative components until all of the components show a STATUS of Running or Completed.

          kubectl get pods -n knative-serving
          
          #NAME                                      READY   STATUS    RESTARTS   AGE
          #3scale-kourier-control-54cc54cc58-mmdgq   1/1     Running   0          81s
          #activator-67656dcbbb-8mftq                1/1     Running   0          97s
          #autoscaler-df6856b64-5h4lc                1/1     Running   0          97s
          #controller-788796f49d-4x6pm               1/1     Running   0          97s
          #domain-mapping-65f58c79dc-9cw6d           1/1     Running   0          97s
          #domainmapping-webhook-cc646465c-jnwbz     1/1     Running   0          97s
          #webhook-859796bc7-8n5g2                   1/1     Running   0          96s
          Check Knative Hello World

          4.install cert manager

          Details
          kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml

          5.install kserve

          Details
          kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve.yaml
          kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve-cluster-resources.yaml
          Reference

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. ArgoCD has installed, if not check 🔗link


          3. Helm binary has installed, if not check 🔗link


          1.install gateway API CRDs

          Details
          kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml

          2.install cert manager

          Reference

          following 🔗link to install cert manager

          3.install istio system

          Reference

          following 🔗link to install three istio components (istio-base, istiod, istio-ingressgateway)

          4.install Knative Operator

          Details
          kubectl -n argocd apply -f - << EOF
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: knative-operator
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://knative.github.io/operator
              chart: knative-operator
              targetRevision: v1.18.1
              helm:
                releaseName: knative-operator
                values: |
                  knative_operator:
                    knative_operator:
                      image: m.daocloud.io/gcr.io/knative-releases/knative.dev/operator/cmd/operator
                      tag: v1.18.1
                      resources:
                        requests:
                          cpu: 100m
                          memory: 100Mi
                        limits:
                          cpu: 1000m
                          memory: 1000Mi
                    operator_webhook:
                      image: m.daocloud.io/gcr.io/knative-releases/knative.dev/operator/cmd/webhook
                      tag: v1.18.1
                      resources:
                        requests:
                          cpu: 100m
                          memory: 100Mi
                        limits:
                          cpu: 500m
                          memory: 500Mi
            destination:
              server: https://kubernetes.default.svc
              namespace: knative-serving
          EOF

          5.sync by argocd

          Details
          argocd app sync argocd/knative-operator

          6.install kserve serving CRD

          kubectl apply -f - <<EOF
          apiVersion: operator.knative.dev/v1beta1
          kind: KnativeServing
          metadata:
            name: knative-serving
            namespace: knative-serving
          spec:
            version: 1.18.0 # this is knative serving version
            config:
              domain:
                example.com: ""
          EOF
          Details

          7.install kserve CRD

          Details
          kubectl -n argocd apply -f - << EOF
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: kserve-crd
            annotations:
              argocd.argoproj.io/sync-options: ServerSideApply=true
              argocd.argoproj.io/compare-options: IgnoreExtraneous
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
              - ServerSideApply=true
            project: default
            source:
              repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
              chart: kserve-crd
              targetRevision: v0.15.2
              helm:
                releaseName: kserve-crd 
            destination:
              server: https://kubernetes.default.svc
              namespace: kserve
          EOF
          Expectd Output
          knative-serving    activator-cbf5b6b55-7gw8s                                 Running        116s
          knative-serving    autoscaler-c5d454c88-nxrms                                Running        115s
          knative-serving    autoscaler-hpa-6c966695c6-9ld24                           Running        113s
          knative-serving    cleanup-serving-serving-1.18.0-45nhg                      Completed      113s
          knative-serving    controller-84f96b7676-jjqfp                               Running        115s
          knative-serving    net-istio-controller-574679cd5f-2sf4d                     Running        112s
          knative-serving    net-istio-webhook-85c99487db-mmq7n                        Running        111s
          knative-serving    storage-version-migration-serving-serving-1.18.0-k28vf    Completed      113s
          knative-serving    webhook-75d4fb6db5-qqcwz                                  Running        114s

          8.sync by argocd

          Details
          argocd app sync argocd/kserve-crd

          9.install kserve Controller

          Details
          kubectl -n argocd apply -f - << EOF
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: kserve
            annotations:
              argocd.argoproj.io/sync-options: ServerSideApply=true
              argocd.argoproj.io/compare-options: IgnoreExtraneous
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
              - ServerSideApply=true
            project: default
            source:
              repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
              chart: kserve
              targetRevision: v0.15.2
              helm:
                releaseName: kserve
                values: |
                  kserve:
                    agent:
                      image: m.daocloud.io/docker.io/kserve/agent
                    router:
                      image: m.daocloud.io/docker.io/kserve/router
                    storage:
                      image: m.daocloud.io/docker.io/kserve/storage-initializer
                      s3:
                        accessKeyIdName: AWS_ACCESS_KEY_ID
                        secretAccessKeyName: AWS_SECRET_ACCESS_KEY
                        endpoint: ""
                        region: ""
                        verifySSL: ""
                        useVirtualBucket: ""
                        useAnonymousCredential: ""
                    controller:
                      deploymentMode: "Serverless"
                      rbacProxyImage: m.daocloud.io/quay.io/brancz/kube-rbac-proxy:v0.18.0
                      rbacProxy:
                        resources:
                          limits:
                            cpu: 100m
                            memory: 300Mi
                          requests:
                            cpu: 100m
                            memory: 300Mi
                      gateway:
                        domain: example.com
                      image: m.daocloud.io/docker.io/kserve/kserve-controller
                      resources:
                        limits:
                          cpu: 100m
                          memory: 300Mi
                        requests:
                          cpu: 100m
                          memory: 300Mi
                    servingruntime:
                      tensorflow:
                        image: tensorflow/serving
                        tag: 2.6.2
                      mlserver:
                        image: m.daocloud.io/docker.io/seldonio/mlserver
                        tag: 1.5.0
                      sklearnserver:
                        image: m.daocloud.io/docker.io/kserve/sklearnserver
                      xgbserver:
                        image: m.daocloud.io/docker.io/kserve/xgbserver
                      huggingfaceserver:
                        image: m.daocloud.io/docker.io/kserve/huggingfaceserver
                        devShm:
                          enabled: false
                          sizeLimit: ""
                        hostIPC:
                          enabled: false
                      huggingfaceserver_multinode:
                        shm:
                          enabled: true
                          sizeLimit: "3Gi"
                      tritonserver:
                        image: nvcr.io/nvidia/tritonserver
                      pmmlserver:
                        image: m.daocloud.io/docker.io/kserve/pmmlserver
                      paddleserver:
                        image: m.daocloud.io/docker.io/kserve/paddleserver
                      lgbserver:
                        image: m.daocloud.io/docker.io/kserve/lgbserver
                      torchserve:
                        image: pytorch/torchserve-kfs
                        tag: 0.9.0
                      art:
                        image: m.daocloud.io/docker.io/kserve/art-explainer
                    localmodel:
                      enabled: false
                      controller:
                        image: m.daocloud.io/docker.io/kserve/kserve-localmodel-controller
                      jobNamespace: kserve-localmodel-jobs
                      agent:
                        hostPath: /mnt/models
                        image: m.daocloud.io/docker.io/kserve/kserve-localmodelnode-agent
                    inferenceservice:
                      resources:
                        limits:
                          cpu: "1"
                          memory: "2Gi"
                        requests:
                          cpu: "1"
                          memory: "2Gi"
            destination:
              server: https://kubernetes.default.svc
              namespace: kserve
          EOF
          if you have ‘failed calling webhook …’
          Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": failed to call webhook: Post "https://kserve-webhook-server-service.kserve.svc:443/validate-serving-kserve-io-v1alpha1-clusterservingruntime?timeout=10s": no endpoints available for service "kserve-webhook-server-service"                               Running        114s

          Just wait for a while and the resync, and it will be fine.

          10.sync by argocd

          Details
          argocd app sync argocd/kserve

          11.install kserve eventing CRD

          Details
          kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.18.1/eventing-crds.yaml

          12.install kserve eventing

          Details
          kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.18.1/eventing-core.yaml
          Expectd Output
          knative-eventing   eventing-controller-cc45869cd-fmhg8        1/1     Running       0          3m33s
          knative-eventing   eventing-webhook-67fcc6959b-lktxd          1/1     Running       0          3m33s
          knative-eventing   job-sink-7f5d754db-tbf2z                   1/1     Running       0          3m33s

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2024

          Subsections of Serving

          Subsections of Inference

          First Pytorch ISVC

          Mnist Inference

          More Information about mnist service can be found 🔗link

          1. create a namespace
          kubectl create namespace kserve-test
          1. deploy a sample iris service
          kubectl apply -n kserve-test -f - <<EOF
          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "first-torchserve"
            namespace: kserve-test
          spec:
            predictor:
              model:
                modelFormat:
                  name: pytorch
                storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
                resources:
                  limits:
                    memory: 4Gi
          EOF
          1. Check InferenceService status
          kubectl -n kserve-test get inferenceservices first-torchserve 
          Expectd Output
          kubectl -n kserve-test get pod
          #NAME                                           READY   STATUS    RESTARTS   AGE
          #first-torchserve-predictor-00001-deplo...      2/2     Running   0          25s
          
          kubectl -n kserve-test get inferenceservices first-torchserve
          #NAME           URL   READY     PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
          #kserve-test   first-torchserve      http://first-torchserve.kserve-test.example.com   True           100                              first-torchserve-predictor-00001   2m59s

          After all pods are ready, you can access the service by using the following command

          Access By

          If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.

          export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
          export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')

          If the EXTERNAL-IP value is none (or perpetually pending), your environment does not provide an external load balancer for the ingress gateway. In this case, you can access the gateway using the service’s node port.

          export INGRESS_HOST=$(minikube ip)
          export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
          export INGRESS_HOST=$(minikube ip)
          kubectl port-forward --namespace istio-system svc/istio-ingressgateway 30080:80
          export INGRESS_PORT=30080
          1. Perform a prediction First, prepare your inference input request inside a file:
          wget -O ./mnist-input.json https://raw.githubusercontent.com/kserve/kserve/refs/heads/master/docs/samples/v1beta1/torchserve/v1/imgconv/input.json
          Remember to forward port if using minikube
          ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L "*:${INGRESS_PORT}:0.0.0.0:${INGRESS_PORT}" -N -f
          1. Invoke the service
          SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice first-torchserve  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
          # http://first-torchserve.kserve-test.example.com 
          curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/mnist:predict" -d @./mnist-input.json
          Expectd Output
          *   Trying 192.168.58.2...
          * TCP_NODELAY set
          * Connected to 192.168.58.2 (192.168.58.2) port 32132 (#0)
          > POST /v1/models/mnist:predict HTTP/1.1
          > Host: my-torchserve.kserve-test.example.com
          > User-Agent: curl/7.61.1
          > Accept: */*
          > Content-Type: application/json
          > Content-Length: 401
          > 
          * upload completely sent off: 401 out of 401 bytes
          < HTTP/1.1 200 OK
          < content-length: 19
          < content-type: application/json
          < date: Mon, 09 Jun 2025 09:27:27 GMT
          < server: istio-envoy
          < x-envoy-upstream-service-time: 1128
          < 
          * Connection #0 to host 192.168.58.2 left intact
          {"predictions":[2]}
          Mar 7, 2024

          First Custom Model

          AlexNet Inference

          More Information about AlexNet service can be found 🔗link

          1. Implement Custom Model using KServe API
           1import argparse
           2import base64
           3import io
           4import time
           5
           6from fastapi.middleware.cors import CORSMiddleware
           7from torchvision import models, transforms
           8from typing import Dict
           9import torch
          10from PIL import Image
          11
          12import kserve
          13from kserve import Model, ModelServer, logging
          14from kserve.model_server import app
          15from kserve.utils.utils import generate_uuid
          16
          17
          18class AlexNetModel(Model):
          19    def __init__(self, name: str):
          20        super().__init__(name, return_response_headers=True)
          21        self.name = name
          22        self.load()
          23        self.ready = False
          24
          25    def load(self):
          26        self.model = models.alexnet(pretrained=True)
          27        self.model.eval()
          28        # The ready flag is used by model ready endpoint for readiness probes,
          29        # set to True when model is loaded successfully without exceptions.
          30        self.ready = True
          31
          32    async def predict(
          33        self,
          34        payload: Dict,
          35        headers: Dict[str, str] = None,
          36        response_headers: Dict[str, str] = None,
          37    ) -> Dict:
          38        start = time.time()
          39        # Input follows the Tensorflow V1 HTTP API for binary values
          40        # https://www.tensorflow.org/tfx/serving/api_rest#encoding_binary_values
          41        img_data = payload["instances"][0]["image"]["b64"]
          42        raw_img_data = base64.b64decode(img_data)
          43        input_image = Image.open(io.BytesIO(raw_img_data))
          44        preprocess = transforms.Compose([
          45            transforms.Resize(256),
          46            transforms.CenterCrop(224),
          47            transforms.ToTensor(),
          48            transforms.Normalize(mean=[0.485, 0.456, 0.406],
          49                                 std=[0.229, 0.224, 0.225]),
          50        ])
          51        input_tensor = preprocess(input_image).unsqueeze(0)
          52        output = self.model(input_tensor)
          53        torch.nn.functional.softmax(output, dim=1)
          54        values, top_5 = torch.topk(output, 5)
          55        result = values.flatten().tolist()
          56        end = time.time()
          57        response_id = generate_uuid()
          58
          59        # Custom response headers can be added to the inference response
          60        if response_headers is not None:
          61            response_headers.update(
          62                {"prediction-time-latency": f"{round((end - start) * 1000, 9)}"}
          63            )
          64
          65        return {"predictions": result}
          66
          67
          68parser = argparse.ArgumentParser(parents=[kserve.model_server.parser])
          69args, _ = parser.parse_known_args()
          70
          71if __name__ == "__main__":
          72    # Configure kserve and uvicorn logger
          73    if args.configure_logging:
          74        logging.configure_logging(args.log_config_file)
          75    model = AlexNetModel(args.model_name)
          76    model.load()
          77    # Custom middlewares can be added to the model
          78    app.add_middleware(
          79        CORSMiddleware,
          80        allow_origins=["*"],
          81        allow_credentials=True,
          82        allow_methods=["*"],
          83        allow_headers=["*"],
          84    )
          85    ModelServer().start([model])
          1. create requirements.txt
          kserve
          torchvision==0.18.0
          pillow>=10.3.0,<11.0.0
          1. create Dockerfile
          FROM m.daocloud.io/docker.io/library/python:3.11-slim
          
          WORKDIR /app
          
          COPY requirements.txt .
          RUN pip install --no-cache-dir  -r requirements.txt 
          
          COPY model.py .
          
          CMD ["python", "model.py", "--model_name=custom-model"]
          1. build and push custom docker image
          docker build -t ay-custom-model .
          docker tag ddfd0186813e docker-registry.lab.zverse.space/ay/ay-custom-model:latest
          docker push docker-registry.lab.zverse.space/ay/ay-custom-model:latest
          1. create a namespace
          kubectl create namespace kserve-test
          1. deploy a sample custom-model service
          kubectl apply -n kserve-test -f - <<EOF
          apiVersion: serving.kserve.io/v1beta1
          kind: InferenceService
          metadata:
            name: ay-custom-model
          spec:
            predictor:
              containers:
                - name: kserve-container
                  image: docker-registry.lab.zverse.space/ay/ay-custom-model:latest
          EOF
          1. Check InferenceService status
          kubectl -n kserve-test get inferenceservices ay-custom-model
          Expectd Output
          kubectl -n kserve-test get pod
          #NAME                                           READY   STATUS    RESTARTS   AGE
          #ay-custom-model-predictor-00003-dcf4rk         2/2     Running   0        167m
          
          kubectl -n kserve-test get inferenceservices ay-custom-model
          #NAME           URL   READY     PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
          #ay-custom-model   http://ay-custom-model.kserve-test.example.com   True           100                              ay-custom-model-predictor-00003   177m

          After all pods are ready, you can access the service by using the following command

          Access By

          If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.

          export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
          export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')

          If the EXTERNAL-IP value is none (or perpetually pending), your environment does not provide an external load balancer for the ingress gateway. In this case, you can access the gateway using the service’s node port.

          export INGRESS_HOST=$(minikube ip)
          export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
          export INGRESS_HOST=$(minikube ip)
          kubectl port-forward --namespace istio-system svc/istio-ingressgateway 30080:80
          export INGRESS_PORT=30080
          1. Perform a prediction

          First, prepare your inference input request inside a file:

          wget -O ./alex-net-input.json https://kserve.github.io/website/0.15/modelserving/v1beta1/custom/custom_model/input.json
          Remember to forward port if using minikube
          ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L "*:${INGRESS_PORT}:0.0.0.0:${INGRESS_PORT}" -N -f
          1. Invoke the service
          export SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice ay-custom-model  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
          # http://ay-custom-model.kserve-test.example.com
          curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" -X POST "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/custom-model:predict" -d @.//alex-net-input.json
          Expectd Output
          *   Trying 192.168.58.2:30704...
          * Connected to 192.168.58.2 (192.168.58.2) port 30704
          > POST /v1/models/custom-model:predict HTTP/1.1
          > Host: ay-custom-model.kserve-test.example.com
          > User-Agent: curl/8.5.0
          > Accept: */*
          > Content-Type: application/json
          > Content-Length: 105339
          > 
          * We are completely uploaded and fine
          < HTTP/1.1 200 OK
          < content-length: 110
          < content-type: application/json
          < date: Wed, 11 Jun 2025 03:38:30 GMT
          < prediction-time-latency: 89.966773987
          < server: istio-envoy
          < x-envoy-upstream-service-time: 93
          < 
          * Connection #0 to host 192.168.58.2 left intact
          {"predictions":[14.975619316101074,14.0368070602417,13.966034889221191,12.252280235290527,12.086270332336426]}
          Mar 7, 2024

          First Model In Minio

          Inference Model In Minio

          More Information about Deploy InferenceService with a saved model on S3 can be found 🔗link

          Create Service Account

          === “yaml”

          apiVersion: v1
          kind: ServiceAccount
          metadata:
            name: sa
            annotations:
              eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/s3access # replace with your IAM role ARN
              serving.kserve.io/s3-endpoint: s3.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
              serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
              serving.kserve.io/s3-region: "us-east-2"
              serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials

          === “kubectl”

          kubectl apply -f create-s3-sa.yaml

          Create S3 Secret and attach to Service Account

          Create a secret with your S3 user credential, KServe reads the secret annotations to inject the S3 environment variables on storage initializer or model agent to download the models from S3 storage.

          Create S3 secret

          === “yaml”

          apiVersion: v1
          kind: Secret
          metadata:
            name: s3creds
            annotations:
               serving.kserve.io/s3-endpoint: s3.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
               serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
               serving.kserve.io/s3-region: "us-east-2"
               serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
          type: Opaque
          stringData: # use `stringData` for raw credential string or `data` for base64 encoded string
            AWS_ACCESS_KEY_ID: XXXX
            AWS_SECRET_ACCESS_KEY: XXXXXXXX

          Attach secret to a service account

          === “yaml”

          apiVersion: v1
          kind: ServiceAccount
          metadata:
            name: sa
          secrets:
          - name: s3creds

          === “kubectl”

          kubectl apply -f create-s3-secret.yaml

          !!! note If you are running kserve with istio sidecars enabled, there can be a race condition between the istio proxy being ready and the agent pulling models. This will result in a tcp dial connection refused error when the agent tries to download from s3.

          To resolve it, istio allows the blocking of other containers in a pod until the proxy container is ready.
          
          You can enabled this by setting `proxy.holdApplicationUntilProxyStarts: true` in `istio-sidecar-injector` configmap, `proxy.holdApplicationUntilProxyStarts` flag was introduced in Istio 1.7 as an experimental feature and is turned off by default.
          

          Deploy the model on S3 with InferenceService

          Create the InferenceService with the s3 storageUri and the service account with s3 credential attached.

          === “New Schema”

          ```yaml
          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "mnist-s3"
          spec:
            predictor:
              serviceAccountName: sa
              model:
                modelFormat:
                  name: tensorflow
                storageUri: "s3://kserve-examples/mnist"
          ```
          

          === “Old Schema”

          ```yaml
          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "mnist-s3"
          spec:
            predictor:
              serviceAccountName: sa
              tensorflow:
                storageUri: "s3://kserve-examples/mnist"
          ```
          

          Apply the autoscale-gpu.yaml.

          === “kubectl”

          kubectl apply -f mnist-s3.yaml

          Run a prediction

          Now, the ingress can be accessed at ${INGRESS_HOST}:${INGRESS_PORT} or follow this instruction to find out the ingress IP and port.

          SERVICE_HOSTNAME=$(kubectl get inferenceservice mnist-s3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
          
          MODEL_NAME=mnist-s3
          INPUT_PATH=@./input.json
          curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH

          !!! success “Expected Output”

          ```{ .bash .no-copy }
          Note: Unnecessary use of -X or --request, POST is already inferred.
          *   Trying 35.237.217.209...
          * TCP_NODELAY set
          * Connected to mnist-s3.default.35.237.217.209.xip.io (35.237.217.209) port 80 (#0)
          > POST /v1/models/mnist-s3:predict HTTP/1.1
          > Host: mnist-s3.default.35.237.217.209.xip.io
          > User-Agent: curl/7.55.1
          > Accept: */*
          > Content-Length: 2052
          > Content-Type: application/x-www-form-urlencoded
          > Expect: 100-continue
          >
          < HTTP/1.1 100 Continue
          * We are completely uploaded and fine
          < HTTP/1.1 200 OK
          < content-length: 251
          < content-type: application/json
          < date: Sun, 04 Apr 2021 20:06:27 GMT
          < x-envoy-upstream-service-time: 5
          < server: istio-envoy
          <
          * Connection #0 to host mnist-s3.default.35.237.217.209.xip.io left intact
          {
              "predictions": [
                  {
                      "predictions": [0.327352405, 2.00153053e-07, 0.0113353515, 0.203903764, 3.62863029e-05, 0.416683704, 0.000281196437, 8.36911859e-05, 0.0403052084, 1.82206513e-05],
                      "classes": 5
                  }
              ]
          }
          ```
          
          Mar 7, 2024

          Kafka Sink Transformer

          AlexNet Inference

          More Information about Custom Transformer service can be found 🔗link

          1. Implement Custom Transformer ./model.py using Kserve API
           1import os
           2import argparse
           3import json
           4
           5from typing import Dict, Union
           6from kafka import KafkaProducer
           7from cloudevents.http import CloudEvent
           8from cloudevents.conversion import to_structured
           9
          10from kserve import (
          11    Model,
          12    ModelServer,
          13    model_server,
          14    logging,
          15    InferRequest,
          16    InferResponse,
          17)
          18
          19from kserve.logging import logger
          20from kserve.utils.utils import generate_uuid
          21
          22kafka_producer = KafkaProducer(
          23    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
          24    bootstrap_servers=os.environ.get('KAFKA_BOOTSTRAP_SERVERS', 'localhost:9092')
          25)
          26
          27class ImageTransformer(Model):
          28    def __init__(self, name: str):
          29        super().__init__(name, return_response_headers=True)
          30        self.ready = True
          31
          32
          33    def preprocess(
          34        self, payload: Union[Dict, InferRequest], headers: Dict[str, str] = None
          35    ) -> Union[Dict, InferRequest]:
          36        logger.info("Received inputs %s", payload)
          37        logger.info("Received headers %s", headers)
          38        self.request_trace_key = os.environ.get('REQUEST_TRACE_KEY', 'algo.trace.requestId')
          39        if self.request_trace_key not in payload:
          40            logger.error("Request trace key '%s' not found in payload, you cannot trace the prediction result", self.request_trace_key)
          41            if "instances" not in payload:
          42                raise ValueError(
          43                    f"Request trace key '{self.request_trace_key}' not found in payload and 'instances' key is missing."
          44                )
          45        else:
          46            headers[self.request_trace_key] = payload.get(self.request_trace_key)
          47   
          48        return {"instances": payload["instances"]}
          49
          50    def postprocess(
          51        self,
          52        infer_response: Union[Dict, InferResponse],
          53        headers: Dict[str, str] = None,
          54        response_headers: Dict[str, str] = None,
          55    ) -> Union[Dict, InferResponse]:
          56        logger.info("postprocess headers: %s", headers)
          57        logger.info("postprocess response headers: %s", response_headers)
          58        logger.info("postprocess response: %s", infer_response)
          59
          60        attributes = {
          61            "source": "data-and-computing/kafka-sink-transformer",
          62            "type": "org.zhejianglab.zverse.data-and-computing.kafka-sink-transformer",
          63            "request-host": headers.get('host', 'unknown'),
          64            "kserve-isvc-name": headers.get('kserve-isvc-name', 'unknown'),
          65            "kserve-isvc-namespace": headers.get('kserve-isvc-namespace', 'unknown'),
          66            self.request_trace_key: headers.get(self.request_trace_key, 'unknown'),
          67        }
          68
          69        _, cloudevent = to_structured(CloudEvent(attributes, infer_response))
          70        try:
          71            kafka_producer.send(os.environ.get('KAFKA_TOPIC', 'test-topic'), value=cloudevent.decode('utf-8').replace("'", '"'))
          72            kafka_producer.flush()
          73        except Exception as e:
          74            logger.error("Failed to send message to Kafka: %s", e)
          75        return infer_response
          76
          77parser = argparse.ArgumentParser(parents=[model_server.parser])
          78args, _ = parser.parse_known_args()
          79
          80if __name__ == "__main__":
          81    if args.configure_logging:
          82        logging.configure_logging(args.log_config_file)
          83    logging.logger.info("available model name: %s", args.model_name)
          84    logging.logger.info("all args: %s", args.model_name)
          85    model = ImageTransformer(args.model_name)
          86    ModelServer().start([model])
          1. modify ./pyproject.toml
          [tool.poetry]
          name = "custom_transformer"
          version = "0.15.2"
          description = "Custom Transformer Examples. Not intended for use outside KServe Frameworks Images."
          authors = ["Dan Sun <dsun20@bloomberg.net>"]
          license = "Apache-2.0"
          packages = [
              { include = "*.py" }
          ]
          
          [tool.poetry.dependencies]
          python = ">=3.9,<3.13"
          kserve = {path = "../kserve", develop = true}
          pillow = "^10.3.0"
          kafka-python = "^2.2.15"
          cloudevents = "^1.11.1"
          
          [[tool.poetry.source]]
          name = "pytorch"
          url = "https://download.pytorch.org/whl/cpu"
          priority = "explicit"
          
          [tool.poetry.group.test]
          optional = true
          
          [tool.poetry.group.test.dependencies]
          pytest = "^7.4.4"
          mypy = "^0.991"
          
          [tool.poetry.group.dev]
          optional = true
          
          [tool.poetry.group.dev.dependencies]
          black = { version = "~24.3.0", extras = ["colorama"] }
          
          [tool.poetry-version-plugin]
          source = "file"
          file_path = "../VERSION"
          
          [build-system]
          requires = ["poetry-core>=1.0.0"]
          build-backend = "poetry.core.masonry.api"
          1. prepare ../custom_transformer.Dockerfile
          ARG PYTHON_VERSION=3.11
          ARG BASE_IMAGE=python:${PYTHON_VERSION}-slim-bookworm
          ARG VENV_PATH=/prod_venv
          
          FROM ${BASE_IMAGE} AS builder
          
          # Install Poetry
          ARG POETRY_HOME=/opt/poetry
          ARG POETRY_VERSION=1.8.3
          
          RUN python3 -m venv ${POETRY_HOME} && ${POETRY_HOME}/bin/pip install poetry==${POETRY_VERSION}
          ENV PATH="$PATH:${POETRY_HOME}/bin"
          
          # Activate virtual env
          ARG VENV_PATH
          ENV VIRTUAL_ENV=${VENV_PATH}
          RUN python3 -m venv $VIRTUAL_ENV
          ENV PATH="$VIRTUAL_ENV/bin:$PATH"
          
          COPY kserve/pyproject.toml kserve/poetry.lock kserve/
          RUN cd kserve && poetry install --no-root --no-interaction --no-cache
          COPY kserve kserve
          RUN cd kserve && poetry install --no-interaction --no-cache
          
          COPY custom_transformer/pyproject.toml custom_transformer/poetry.lock custom_transformer/
          RUN cd custom_transformer && poetry install --no-root --no-interaction --no-cache
          COPY custom_transformer custom_transformer
          RUN cd custom_transformer && poetry install --no-interaction --no-cache
          
          
          FROM ${BASE_IMAGE} AS prod
          
          COPY third_party third_party
          
          # Activate virtual env
          ARG VENV_PATH
          ENV VIRTUAL_ENV=${VENV_PATH}
          ENV PATH="$VIRTUAL_ENV/bin:$PATH"
          
          RUN useradd kserve -m -u 1000 -d /home/kserve
          
          COPY --from=builder --chown=kserve:kserve $VIRTUAL_ENV $VIRTUAL_ENV
          COPY --from=builder kserve kserve
          COPY --from=builder custom_transformer custom_transformer
          
          USER 1000
          ENTRYPOINT ["python", "-m", "custom_transformer.model"]
          1. regenerate poetry.lock
          poetry lock --no-update
          1. build and push custom docker image
          cd python
          podman build -t docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9 -f custom_transformer.Dockerfile .
          
          podman push docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9
          Mar 7, 2024

          Subsections of Generative

          First Generative Service

          B(KServe 推理服务)
          B --> C[[Knative Serving]] --> D[自动扩缩容/灰度发布]
          B --> E[[Istio]] --> F[流量管理/安全]
          B --> G[[存储系统]] --> H[S3/GCS/PVC]
          
          ### 单YAML部署推理服务
          ```yaml
          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
            namespace: kserve-test
          spec:
            predictor:
              model:
                modelFormat:
                  name: sklearn
                resources: {}
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

          check CRD

          kubectl -n kserve-test get inferenceservices sklearn-iris 
          kubectl -n istio-system get svc istio-ingressgateway 
          export INGRESS_HOST=$(minikube ip)
          export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
          SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice sklearn-iris  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
          # http://sklearn-iris.kserve-test.example.com 
          curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json

          How to deploy your own ML model

          apiVersion: serving.kserve.io/v1beta1
          kind: InferenceService
          metadata:
            name: huggingface-llama3
            namespace: kserve-test
            annotations:
              serving.kserve.io/deploymentMode: RawDeployment
              serving.kserve.io/autoscalerClass: none
          spec:
            predictor:
              model:
                modelFormat:
                  name: huggingface
                storageUri: pvc://llama-3-8b-pvc/hf/8b_instruction_tuned
              workerSpec:
                pipelineParallelSize: 2
                tensorParallelSize: 1
                containers:
                - name: worker-container
                    resources: 
                    requests:
                        nvidia.com/gpu: "8"

          check https://kserve.github.io/website/0.15/modelserving/v1beta1/llm/huggingface/multi-node/#workerspec-and-servingruntime

          Mar 7, 2024

          Canary Policy

          KServe supports canary rollouts for inference services. Canary rollouts allow for a new version of an InferenceService to receive a percentage of traffic. Kserve supports a configurable canary rollout strategy with multiple steps. The rollout strategy can also be implemented to rollback to the previous revision if a rollout step fails.

          KServe automatically tracks the last good revision that was rolled out with 100% traffic. The canaryTrafficPercent field in the component’s spec needs to be set with the percentage of traffic that should be routed to the new revision. KServe will then automatically split the traffic between the last good revision and the revision that is currently being rolled out according to the canaryTrafficPercent value.

          When the first revision of an InferenceService is deployed, it will receive 100% of the traffic. When multiple revisions are deployed, as in step 2, and the canary rollout strategy is configured to route 10% of the traffic to the new revision, 90% of the traffic will go to the LastestRolledoutRevision. If there is an unhealthy or bad revision applied, traffic will not be routed to that bad revision. In step 3, the rollout strategy promotes the LatestReadyRevision from step 2 to the LatestRolledoutRevision. Since it is now promoted, the LatestRolledoutRevision gets 100% of the traffic and is fully rolled out. If a rollback needs to happen, 100% of the traffic will be pinned to the previous healthy/good revision- the PreviousRolledoutRevision.

          Canary Rollout Strategy Steps 1-2 Canary Rollout Strategy Steps 1-2 Canary Rollout Strategy Step 3 Canary Rollout Strategy Step 3

          Reference

          For more information, see Canary Rollout.

          Mar 7, 2024

          Subsections of Canary Policy

          Rollout Example

          Create the InferenceService

          Follow the First Inference Service tutorial. Set up a namespace kserve-test and create an InferenceService.

          After rolling out the first model, 100% traffic goes to the initial model with service revision 1.

          kubectl -n kserve-test get isvc sklearn-iris
          Expectd Output
          NAME       URL              READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
          sklearn-iris   http://sklearn-iris.kserve-test.example.com   True      100       sklearn-iris-predictor--00001   46s      2m39s     70s

          Apply Canary Rollout Strategy

          • Add the canaryTrafficPercent field to the predictor component
          • Update the storageUri to use a new/updated model.
          kubectl apply -n kserve-test -f - <<EOF
          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
            namespace: kserve-test
          spec:
            predictor:
              canaryTrafficPercent: 10
              model:
                args: ["--enable_docs_url=True"]
                modelFormat:
                  name: sklearn
                resources: {}
                runtime: kserve-sklearnserver
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
          EOF

          After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1.

          kubectl -n kserve-test get isvc sklearn-iris
          Expectd Output
          NAME       URL              READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
          sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    90     10       sklearn-iris-predictor-00002   sklearn-iris-predictor-00003   19h

          Check the running pods, you should now see port two pods running for the old and new model and 10% traffic is routed to the new model. Notice revision 1 contains 0002 in its name, while revision 2 contains 0003.

          kubectl get pods 
          
          NAME                                                        READY   STATUS    RESTARTS   AGE
          sklearn-iris-predictor-00002-deployment-c7bb6c685-ktk7r     2/2     Running   0          71m
          sklearn-iris-predictor-00003-deployment-8498d947-fpzcg      2/2     Running   0          20m

          Run a prediction

          Follow the next two steps (Determine the ingress IP and ports and Perform inference) in the First Inference Service tutorial.

          Send more requests to the InferenceService to observe the 10% of traffic that routes to the new revision.

          Promote the canary model

          If the canary model is healthy/passes your tests,

          you can promote it by removing the canaryTrafficPercent field and re-applying the InferenceService custom resource with the same name sklearn-iris

          kubectl apply -n kserve-test -f - <<EOF
          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
            namespace: kserve-test
          spec:
            predictor:
              model:
                args: ["--enable_docs_url=True"]
                modelFormat:
                  name: sklearn
                resources: {}
                runtime: kserve-sklearnserver
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
          EOF

          Now all traffic goes to the revision 2 for the new model.

          kubectl get isvc sklearn-iris
          NAME       URL                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
          sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00002   17m

          The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic.

          kubectl get pods -l serving.kserve.io/inferenceservice=sklearn-iris
          NAME                                                           READY   STATUS        RESTARTS   AGE
          sklearn-iris-predictor-00001-deployment-66c5f5b8d5-gmfvj   1/2     Terminating   0          17m
          sklearn-iris-predictor-00002-deployment-5bd9ff46f8-shtzd   2/2     Running       0          15m

          Rollback and pin the previous model

          You can pin the previous model (model v1, for example) by setting the canaryTrafficPercent to 0 for the current model (model v2, for example). This rolls back from model v2 to model v1 and decreases model v2’s traffic to zero.

          Apply the custom resource to set model v2’s traffic to 0%.

          kubectl apply -n kserve-test -f - <<EOF
          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
          spec:
            predictor:
              canaryTrafficPercent: 0
              model:
                modelFormat:
                  name: sklearn
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
          EOF

          Check the traffic split, now 100% traffic goes to the previous good model (model v1) for revision generation 1.

          kubectl get isvc sklearn-iris
          NAME       URL                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION              LATESTREADYREVISION                AGE
          sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    100    0        sklearn-iris-predictor-00002   sklearn-iris-predictor-00003   18m

          The pods for previous revision (model v1) now routes 100% of the traffic to its pods while the new model (model v2) routes 0% traffic to its pods.

          kubectl get pods -l serving.kserve.io/inferenceservice=sklearn-iris
          
          NAME                                                       READY   STATUS        RESTARTS   AGE
          sklearn-iris-predictor-00002-deployment-66c5f5b8d5-gmfvj   1/2     Running       0          35s
          sklearn-iris-predictor-00003-deployment-5bd9ff46f8-shtzd   2/2     Running       0          16m

          Route traffic using a tag

          You can enable tag based routing by adding the annotation serving.kserve.io/enable-tag-routing, so traffic can be explicitly routed to the canary model (model v2) or the old model (model v1) via a tag in the request URL.

          Apply model v2 with canaryTrafficPercent: 10 and serving.kserve.io/enable-tag-routing: "true".

          kubectl apply -n kserve-test -f - <<EOF
          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
            annotations:
              serving.kserve.io/enable-tag-routing: "true"
          spec:
            predictor:
              canaryTrafficPercent: 10
              model:
                modelFormat:
                  name: sklearn
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
          EOF

          Check the InferenceService status to get the canary and previous model URL.

          kubectl get isvc sklearn-iris -ojsonpath="{.status.components.predictor}"  | jq

          The output should look like

          Expectd Output
          {
              "address": {
              "url": "http://sklearn-iris-predictor-.kserve-test.svc.cluster.local"
              },
              "latestCreatedRevision": "sklearn-iris-predictor--00003",
              "latestReadyRevision": "sklearn-iris-predictor--00003",
              "latestRolledoutRevision": "sklearn-iris-predictor--00001",
              "previousRolledoutRevision": "sklearn-iris-predictor--00001",
              "traffic": [
              {
                  "latestRevision": true,
                  "percent": 10,
                  "revisionName": "sklearn-iris-predictor--00003",
                  "tag": "latest",
                  "url": "http://latest-sklearn-iris-predictor-.kserve-test.example.com"
              },
              {
                  "latestRevision": false,
                  "percent": 90,
                  "revisionName": "sklearn-iris-predictor--00001",
                  "tag": "prev",
                  "url": "http://prev-sklearn-iris-predictor-.kserve-test.example.com"
              }
              ],
              "url": "http://sklearn-iris-predictor-.kserve-test.example.com"
          }

          Since we updated the annotation on the InferenceService, model v2 now corresponds to sklearn-iris-predictor--00003.

          You can now send the request explicitly to the new model or the previous model by using the tag in the request URL. Use the curl command from Perform inference and add latest- or prev- to the model name to send a tag based request.

          For example, set the model name and use the following commands to send traffic to each service based on the latest or prev tag.

          curl the latest revision

          MODEL_NAME=sklearn-iris
          curl -v -H "Host: latest-${MODEL_NAME}-predictor-.kserve-test.example.com" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d @./iris-input.json

          or curl the previous revision

          curl -v -H "Host: prev-${MODEL_NAME}-predictor-.kserve-test.example.com" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d @./iris-input.json
          Mar 7, 2024

          Auto Scaling

          Soft Limit

          You can configure InferenceService with annotation autoscaling.knative.dev/target for a soft limit. The soft limit is a targeted limit rather than a strictly enforced bound, particularly if there is a sudden burst of requests, this value can be exceeded.

          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
            namespace: kserve-test
            annotations:
              autoscaling.knative.dev/target: "5"
          spec:
            predictor:
              model:
                args: ["--enable_docs_url=True"]
                modelFormat:
                  name: sklearn
                resources: {}
                runtime: kserve-sklearnserver
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

          Hard Limit

          You can also configure InferenceService with field containerConcurrency with a hard limit. The hard limit is an enforced upper bound. If concurrency reaches the hard limit, surplus requests will be buffered and must wait until enough capacity is free to execute the requests.

          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
            namespace: kserve-test
          spec:
            predictor:
              containerConcurrency: 5
              model:
                args: ["--enable_docs_url=True"]
                modelFormat:
                  name: sklearn
                resources: {}
                runtime: kserve-sklearnserver
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

          Scale with QPS

          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
            namespace: kserve-test
          spec:
            predictor:
              scaleTarget: 1
              scaleMetric: qps
              model:
                args: ["--enable_docs_url=True"]
                modelFormat:
                  name: sklearn
                resources: {}
                runtime: kserve-sklearnserver
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

          Scale with GPU

          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "flowers-sample-gpu"
            namespace: kserve-test
          spec:
            predictor:
              scaleTarget: 1
              scaleMetric: concurrency
              model:
                modelFormat:
                  name: tensorflow
                storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
                runtimeVersion: "2.6.2-gpu"
                resources:
                  limits:
                    nvidia.com/gpu: 1

          Enable Scale To Zero

          apiVersion: "serving.kserve.io/v1beta1"
          kind: "InferenceService"
          metadata:
            name: "sklearn-iris"
            namespace: kserve-test
          spec:
            predictor:
              minReplicas: 0
              model:
                args: ["--enable_docs_url=True"]
                modelFormat:
                  name: sklearn
                resources: {}
                runtime: kserve-sklearnserver
                storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

          Prepare Concurrent Requests Container

          # export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
          podman run --rm \
                -v /root/kserve/iris-input.json:/tmp/iris-input.json \
                --privileged \
                -e INGRESS_HOST=$(minikube ip) \
                -e INGRESS_PORT=32132 \
                -e MODEL_NAME=sklearn-iris \
                -e INPUT_PATH=/tmp/iris-input.json \
                -e SERVICE_HOSTNAME=sklearn-iris.kserve-test.example.com \
                -it m.daocloud.io/docker.io/library/golang:1.22  bash -c "go install github.com/rakyll/hey@latest; bash"

          Fire

          Send traffic in 30 seconds spurts maintaining 5 in-flight requests.

          hey -z 30s -c 100 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
          Expectd Output
          Summary:
            Total:        30.1390 secs
            Slowest:      0.5015 secs
            Fastest:      0.0252 secs
            Average:      0.1451 secs
            Requests/sec: 687.3483
            
            Total data:   4371076 bytes
            Size/request: 211 bytes
          
          Response time histogram:
            0.025 [1]     |
            0.073 [14]    |
            0.120 [33]    |
            0.168 [19363] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
            0.216 [1171]  |■■
            0.263 [28]    |
            0.311 [6]     |
            0.359 [0]     |
            0.406 [0]     |
            0.454 [0]     |
            0.502 [100]   |
          
          
          Latency distribution:
            10% in 0.1341 secs
            25% in 0.1363 secs
            50% in 0.1388 secs
            75% in 0.1462 secs
            90% in 0.1587 secs
            95% in 0.1754 secs
            99% in 0.1968 secs
          
          Details (average, fastest, slowest):
            DNS+dialup:   0.0000 secs, 0.0252 secs, 0.5015 secs
            DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
            req write:    0.0000 secs, 0.0000 secs, 0.0005 secs
            resp wait:    0.1451 secs, 0.0251 secs, 0.5015 secs
            resp read:    0.0000 secs, 0.0000 secs, 0.0003 secs
          
          Status code distribution:
            [500] 20716 responses

          Reference

          For more information, please refer to the KPA documentation.

          Mar 7, 2024

          Subsections of Knative

          Subsections of Eventing

          Broker

          Knative Broker 是 Knative Eventing 系统的核心组件,它的主要作用是充当事件路由和分发的中枢,在事件生产者(事件源)和事件消费者(服务)之间提供解耦、可靠的事件传输。

          以下是 Knative Broker 的关键作用详解:

          事件接收中心:

          Broker 是事件流汇聚的入口点。各种事件源(如 Kafka 主题、HTTP 源、Cloud Pub/Sub、GitHub Webhooks、定时器、自定义源等)将事件发送到 Broker。

          事件生产者只需知道 Broker 的地址,无需关心最终有哪些消费者或消费者在哪里。

          事件存储与缓冲:

          Broker 通常基于持久化的消息系统实现(如 Apache Kafka, Google Cloud Pub/Sub, RabbitMQ, NATS Streaming 或内存实现 InMemoryChannel)。这提供了:

          持久化: 确保事件在消费者处理前不会丢失(取决于底层通道实现)。

          缓冲: 当消费者暂时不可用或处理速度跟不上事件产生速度时,Broker 可以缓冲事件,避免事件丢失或压垮生产者/消费者。

          重试: 如果消费者处理事件失败,Broker 可以重新投递事件(通常需要结合 Trigger 和 Subscription 的重试策略)。

          解耦事件源和事件消费者:

          这是 Broker 最重要的作用之一。事件源只负责将事件发送到 Broker,完全不知道有哪些服务会消费这些事件。

          事件消费者通过创建 Trigger 向 Broker 声明它对哪些事件感兴趣。消费者只需知道 Broker 的存在,无需知道事件是从哪个具体源产生的。

          这种解耦极大提高了系统的灵活性和可维护性:

          独立演进: 可以独立添加、移除或修改事件源或消费者,只要它们遵循 Broker 的契约。

          动态路由: 基于事件属性(如 type, source)动态路由事件到不同的消费者,无需修改生产者或消费者代码。

          多播: 同一个事件可以被多个不同的消费者同时消费(一个事件 -> Broker -> 多个匹配的 Trigger -> 多个服务)。

          事件过滤与路由(通过 Trigger):

          Broker 本身不直接处理复杂的过滤逻辑。过滤和路由是由 Trigger 资源实现的。

          Trigger 资源绑定到特定的 Broker。

          Trigger 定义了:

          订阅者: 目标服务(Knative Service、Kubernetes Service、Channel 等)的地址。

          过滤器: 基于事件属性(主要是 type 和 source,以及其他可扩展属性)的条件表达式。只有满足条件的事件才会被 Broker 通过该 Trigger 路由到对应的订阅者。

          Broker 接收事件后,会检查所有绑定到它的 Trigger 的过滤器。对于每一个匹配的 Trigger,Broker 都会将事件发送到该 Trigger 指定的订阅者。

          提供标准事件接口:

          Broker 遵循 CloudEvents 规范,它接收和传递的事件都是 CloudEvents 格式的。这为不同来源的事件和不同消费者的处理提供了统一的格式标准,简化了集成。

          多租户和命名空间隔离:

          Broker 通常部署在 Kubernetes 的特定命名空间中。一个命名空间内可以创建多个 Broker。

          这允许在同一个集群内为不同的团队、应用或环境(如 dev, staging)隔离事件流。每个团队/应用可以管理自己命名空间内的 Broker 和 Trigger。

          总结比喻:

          可以把 Knative Broker 想象成一个高度智能的邮局分拣中心:

          接收信件(事件): 来自世界各地(不同事件源)的信件(事件)都寄到这个分拣中心(Broker)。

          存储信件: 分拣中心有仓库(持久化/缓冲)临时存放信件,确保信件安全不丢失。

          分拣规则(Trigger): 分拣中心里有很多分拣员(Trigger)。每个分拣员负责特定类型或来自特定地区的信件(基于事件属性过滤)。

          投递信件: 分拣员(Trigger)找到符合自己负责规则的信件(事件),就把它们投递到正确的收件人(订阅者服务)家门口。

          解耦: 寄信人(事件源)只需要知道分拣中心(Broker)的地址,完全不需要知道收信人(消费者)是谁、在哪里。收信人(消费者)只需要告诉分拣中心里负责自己这类信件的分拣员(创建 Trigger)自己的地址,不需要关心信是谁寄来的。分拣中心(Broker)和分拣员(Trigger)负责中间的复杂路由工作。

          Broker 带来的核心价值:

          松耦合: 彻底解耦事件生产者和消费者。

          灵活性: 动态添加/移除消费者,动态改变路由规则(通过修改/创建/删除 Trigger)。

          可靠性: 提供事件持久化和重试机制(依赖底层实现)。

          可伸缩性: Broker 和消费者都可以独立伸缩。

          标准化: 基于 CloudEvents。

          简化开发: 开发者专注于业务逻辑(生产事件或消费事件),无需自己搭建复杂的事件总线基础设施。

          Mar 7, 2024

          Subsections of Broker

          Install Kafka Broker

          About

          broker broker

          • Source, curl, kafkaSource,
          • Broker
          • Trigger
          • Sink: ksvc, isvc

          Install a Channel (messaging) layer

          kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-controller.yaml
          Expectd Output
          configmap/kafka-broker-config created
          configmap/kafka-channel-config created
          customresourcedefinition.apiextensions.k8s.io/kafkachannels.messaging.knative.dev created
          customresourcedefinition.apiextensions.k8s.io/consumers.internal.kafka.eventing.knative.dev created
          customresourcedefinition.apiextensions.k8s.io/consumergroups.internal.kafka.eventing.knative.dev created
          customresourcedefinition.apiextensions.k8s.io/kafkasinks.eventing.knative.dev created
          customresourcedefinition.apiextensions.k8s.io/kafkasources.sources.knative.dev created
          clusterrole.rbac.authorization.k8s.io/eventing-kafka-source-observer created
          configmap/config-kafka-source-defaults created
          configmap/config-kafka-autoscaler created
          configmap/config-kafka-features created
          configmap/config-kafka-leader-election created
          configmap/kafka-config-logging created
          configmap/config-namespaced-broker-resources created
          configmap/config-tracing configured
          clusterrole.rbac.authorization.k8s.io/knative-kafka-addressable-resolver created
          clusterrole.rbac.authorization.k8s.io/knative-kafka-channelable-manipulator created
          clusterrole.rbac.authorization.k8s.io/kafka-controller created
          serviceaccount/kafka-controller created
          clusterrolebinding.rbac.authorization.k8s.io/kafka-controller created
          clusterrolebinding.rbac.authorization.k8s.io/kafka-controller-addressable-resolver created
          deployment.apps/kafka-controller created
          clusterrole.rbac.authorization.k8s.io/kafka-webhook-eventing created
          serviceaccount/kafka-webhook-eventing created
          clusterrolebinding.rbac.authorization.k8s.io/kafka-webhook-eventing created
          mutatingwebhookconfiguration.admissionregistration.k8s.io/defaulting.webhook.kafka.eventing.knative.dev created
          mutatingwebhookconfiguration.admissionregistration.k8s.io/pods.defaulting.webhook.kafka.eventing.knative.dev created
          secret/kafka-webhook-eventing-certs created
          validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.kafka.eventing.knative.dev created
          deployment.apps/kafka-webhook-eventing created
          service/kafka-webhook-eventing created
          kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-channel.yaml
          Expectd Output
          configmap/config-kafka-channel-data-plane created
          clusterrole.rbac.authorization.k8s.io/knative-kafka-channel-data-plane created
          serviceaccount/knative-kafka-channel-data-plane created
          clusterrolebinding.rbac.authorization.k8s.io/knative-kafka-channel-data-plane created
          statefulset.apps/kafka-channel-dispatcher created
          deployment.apps/kafka-channel-receiver created
          service/kafka-channel-ingress created

          Install a Broker layer

          kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-broker.yaml
          Expectd Output
          configmap/config-kafka-broker-data-plane created
          clusterrole.rbac.authorization.k8s.io/knative-kafka-broker-data-plane created
          serviceaccount/knative-kafka-broker-data-plane created
          clusterrolebinding.rbac.authorization.k8s.io/knative-kafka-broker-data-plane created
          statefulset.apps/kafka-broker-dispatcher created
          deployment.apps/kafka-broker-receiver created
          service/kafka-broker-ingress created
          Reference
          if you cannot find kafka-channel-dispatcher

          please check sts

          root@ay-k3s01:~# kubectl -n knative-eventing  get sts
          NAME                       READY   AGE
          kafka-broker-dispatcher    1/1     19m
          kafka-channel-dispatcher   0/0     22m

          some sts replia is 0, please check

          [Optional] Install Eventing extensions

          • kafka sink
          kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-sink.yaml
          Reference

          for more information, you can check 🔗https://knative.dev/docs/eventing/sinks/kafka-sink/

          • kafka source
          kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-source.yaml
          Reference

          for more information, you can check 🔗https://knative.dev/docs/eventing/sources/kafka-source/

          Mar 7, 2024

          Display Broker Message

          Flow

          flowchart LR
              A[Curl] -->|HTTP| B{Broker}
              B -->|Subscribe| D[Trigger1]
              B -->|Subscribe| E[Trigger2]
              B -->|Subscribe| F[Trigger3]
              E --> G[Display Service]

          Setps

          1. Create Broker Setting

          kubectl apply -f - <<EOF
          apiVersion: v1
          kind: ConfigMap
          metadata:
            name: kafka-broker-config
            namespace: knative-eventing
          data:
            default.topic.partitions: "10"
            default.topic.replication.factor: "1"
            bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
            default.topic.config.retention.ms: "3600"
          EOF

          2. Create Broker

          kubectl apply -f - <<EOF
          apiVersion: eventing.knative.dev/v1
          kind: Broker
          metadata:
            annotations:
              eventing.knative.dev/broker.class: Kafka
            name: first-broker
            namespace: kserve-test
          spec:
            config:
              apiVersion: v1
              kind: ConfigMap
              name: kafka-broker-config
              namespace: knative-eventing
          EOF

          deadletterSink:

          3. Create Trigger

          kubectl apply -f - <<EOF
          apiVersion: eventing.knative.dev/v1
          kind: Trigger
          metadata:
            name: display-service-trigger
            namespace: kserve-test
          spec:
            broker: first-broker
            subscriber:
              ref:
                apiVersion: serving.knative.dev/v1
                kind: Service
                name: event-display
          EOF

          4. Create Sink Service (Display Message)

          kubectl apply -f - <<EOF
          apiVersion: serving.knative.dev/v1
          kind: Service
          metadata:
            name: event-display
            namespace: kserve-test
          spec:
            template:
              spec:
                containers:
                  - image: gcr.io/knative-releases/knative.dev/eventing/cmd/event_display
          EOF

          5. Test

          kubectl run curl-test --image=curlimages/curl -it --rm --restart=Never -- \
            -v "http://kafka-broker-ingress.knative-eventing.svc.cluster.local/kserve-test/first-broker" \
            -X POST \
            -H "Ce-Id: $(date +%s)" \
            -H "Ce-Specversion: 1.0" \
            -H "Ce-Type: test.type" \
            -H "Ce-Source: curl-test" \
            -H "Content-Type: application/json" \
            -d '{"test": "Broker is working"}'

          6. Check message

          kubectl -n kserve-test logs -f deploy/event-display-00001-deployment 
          Expectd Output
          2025/07/02 09:01:25 Failed to read tracing config, using the no-op default: empty json tracing config
          ☁️  cloudevents.Event
          Context Attributes,
            specversion: 1.0
            type: test.type
            source: curl-test
            id: 1751446880
            datacontenttype: application/json
          Extensions,
            knativekafkaoffset: 6
            knativekafkapartition: 6
          Data,
            {
              "test": "Broker is working"
            }
          Mar 7, 2024

          Kafka Broker Invoke ISVC

          1. Prepare RBAC

          • create cluster role to access CRD isvc
          kubectl apply -f - <<EOF
          apiVersion: rbac.authorization.k8s.io/v1
          kind: ClusterRole
          metadata:
            name: kserve-access-for-knative
          rules:
          - apiGroups: ["serving.kserve.io"]
            resources: ["inferenceservices", "inferenceservices/status"]
            verbs: ["get", "list", "watch"]
          EOF
          • create rolebinding and grant privileges
          kubectl apply -f - <<EOF
          apiVersion: rbac.authorization.k8s.io/v1
          kind: ClusterRoleBinding
          metadata:
            name: kafka-controller-kserve-access
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: kserve-access-for-knative
          subjects:
          - kind: ServiceAccount
            name: kafka-controller
            namespace: knative-eventing
          EOF

          2. Create Broker Setting

          kubectl apply -f - <<EOF
          apiVersion: v1
          kind: ConfigMap
          metadata:
            name: kafka-broker-config
            namespace: knative-eventing
          data:
            default.topic.partitions: "10"
            default.topic.replication.factor: "1"
            bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
            default.topic.config.retention.ms: "3600"
          EOF

          3. Create Broker

          kubectl apply -f - <<EOF
          apiVersion: eventing.knative.dev/v1
          kind: Broker
          metadata:
            annotations:
              eventing.knative.dev/broker.class: Kafka
            name: isvc-broker
            namespace: kserve-test
          spec:
            config:
              apiVersion: v1
              kind: ConfigMap
              name: kafka-broker-config
              namespace: knative-eventing
            delivery:
              deadLetterSink:
                ref:
                  apiVersion: serving.knative.dev/v1
                  kind: Service
                  name: event-display
          EOF

          4. Create InferenceService

          Reference

          you can create isvc first-tourchserve service, by following 🔗link

          5. Create Trigger

          kubectl apply -f - << EOF
          apiVersion: eventing.knative.dev/v1
          kind: Trigger
          metadata:
            name: kserve-trigger
            namespace: kserve-test
          spec:
            broker: isvc-broker
            filter:
              attributes:
                type: prediction-request
            subscriber:
              uri: http://first-torchserve.kserve-test.svc.cluster.local/v1/models/mnist:predict
          EOF

          6. Test

          Normally, we can invoke first-tourchserve by executing

          export MASTER_IP=192.168.100.112
          export ISTIO_INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
          export SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice first-torchserve  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
          # http://first-torchserve.kserve-test.example.com 
          curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${MASTER_IP}:${ISTIO_INGRESS_PORT}/v1/models/mnist:predict" -d @./mnist-input.json

          Now, you can access model by executing

          export KAFKA_BROKER_INGRESS_PORT=$(kubectl -n knative-eventing get service kafka-broker-ingress -o jsonpath='{.spec.ports[?(@.name=="http-container")].nodePort}')
          curl -v "http://${MASTER_IP}:${KAFKA_BROKER_INGRESS_PORT}/kserve-test/isvc-broker" \
            -X POST \
            -H "Ce-Id: $(date +%s)" \
            -H "Ce-Specversion: 1.0" \
            -H "Ce-Type: prediction-request" \
            -H "Ce-Source: event-producer" \
            -H "Content-Type: application/json" \
            -d @./mnist-input.json 
          if you cannot see the preidction result

          please check kafka

          # list all topics, find suffix is `isvc-broker` -> knative-broker-kserve-test-isvc-broker
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
              'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
          # retrieve msg from that topic
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic knative-broker-kserve-test-isvc-broker --from-beginning'

          And then, you could see

          {
              "instances": [
                  {
                      "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
                  }
              ]
          }
          {
              "predictions": [
                  2
              ]
          }
          Mar 7, 2024

          Subsections of Plugin

          Subsections of Eventing Kafka Broker

          Prepare Dev Environment

          1. update go -> 1.24

          2. install ko -> 1.8.0

          go install github.com/google/ko@latest
          # wget https://github.com/ko-build/ko/releases/download/v0.18.0/ko_0.18.0_Linux_x86_64.tar.gz
          # tar -xzf ko_0.18.0_Linux_x86_64.tar.gz  -C /usr/local/bin/ko
          # cp /usr/local/bin/ko/ko /root/bin
          1. protoc
          PB_REL="https://github.com/protocolbuffers/protobuf/releases"
          curl -LO $PB_REL/download/v30.2/protoc-30.2-linux-x86_64.zip
          # mkdir -p ${HOME}/bin/
          mkdir -p /usr/local/bin/protoc
          unzip protoc-30.2-linux-x86_64.zip -d /usr/local/bin/protoc
          cp /usr/local/bin/protoc/bin/protoc /root/bin
          # export PATH="$PATH:/root/bin"
          rm -rf protoc-30.2-linux-x86_64.zip
          1. protoc-gen-go -> 1.5.4
          go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
          export GOPATH=/usr/local/go/bin
          1. copy some code
          mkdir -p ${GOPATH}/src/knative.dev
          cd ${GOPATH}/src/knative.dev
          git clone git@github.com:knative/eventing.git # clone eventing repo
          git clone git@github.com:AaronYang0628/eventing-kafka-broker.git
          cd eventing-kafka-broker
          git remote add upstream https://github.com/knative-extensions/eventing-kafka-broker.git
          git remote set-url --push upstream no_push
          export KO_DOCKER_REPO=docker-registry.lab.zverse.space/data-and-computing/ay-dev
          Mar 7, 2024

          Build Async Preidction Flow

          Flow

          flowchart LR
              A[User Curl] -->|HTTP| B{ISVC-Broker:Kafka}
              B -->|Subscribe| D[Trigger1]
              B -->|Subscribe| E[Kserve-Triiger]
              B -->|Subscribe| F[Trigger3]
              E --> G[Mnist Service]
              G --> |Kafka-Sink| B

          Setps

          1. Create Broker Setting

          kubectl apply -f - <<EOF
          apiVersion: v1
          kind: ConfigMap
          metadata:
            name: kafka-broker-config
            namespace: knative-eventing
          data:
            default.topic.partitions: "10"
            default.topic.replication.factor: "1"
            bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
            default.topic.config.retention.ms: "3600"
          EOF

          2. Create Broker

          kubectl apply -f - <<EOF
          apiVersion: eventing.knative.dev/v1
          kind: Broker
          metadata:
            annotations:
              eventing.knative.dev/broker.class: Kafka
            name: isvc-broker
            namespace: kserve-test
          spec:
            config:
              apiVersion: v1
              kind: ConfigMap
              name: kafka-broker-config
              namespace: knative-eventing
          EOF

          3. Create Trigger

          kubectl apply -f - << EOF
          apiVersion: eventing.knative.dev/v1
          kind: Trigger
          metadata:
            name: kserve-trigger
            namespace: kserve-test
          spec:
            broker: isvc-broker
            filter:
              attributes:
                type: prediction-request-udf-attr # you can change this
            subscriber:
              uri: http://prediction-and-sink.kserve-test.svc.cluster.local/v1/models/mnist:predict
          EOF

          4. Create InferenceService

           1kubectl apply -f - <<EOF
           2apiVersion: serving.kserve.io/v1beta1
           3kind: InferenceService
           4metadata:
           5  name: prediction-and-sink
           6  namespace: kserve-test
           7spec:
           8  predictor:
           9    model:
          10      modelFormat:
          11        name: pytorch
          12      storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
          13  transformer:
          14    containers:
          15      - image: docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9
          16        name: kserve-container
          17        env:
          18        - name: KAFKA_BOOTSTRAP_SERVERS
          19          value: kafka.database.svc.cluster.local
          20        - name: KAFKA_TOPIC
          21          value: test-topic # result will be saved in this topic
          22        - name: REQUEST_TRACE_KEY
          23          value: test-trace-id # using this key to retrieve preidtion result
          24        command:
          25          - "python"
          26          - "-m"
          27          - "model"
          28        args:
          29          - --model_name
          30          - mnist
          31EOF
          Expectd Output
          root@ay-k3s01:~# kubectl -n kserve-test get pod
          NAME                                                              READY   STATUS    RESTARTS   AGE
          prediction-and-sink-predictor-00001-deployment-f64bb76f-jqv4m     2/2     Running   0          3m46s
          prediction-and-sink-transformer-00001-deployment-76cccd867lksg9   2/2     Running   0          4m3s
          Expectd Output

          Source code of the docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9 could be found 🔗here

          [Optional] 5. Invoke InferenceService Directly

          • preparation
          wget -O ./mnist-input.json https://raw.githubusercontent.com/kserve/kserve/refs/heads/master/docs/samples/v1beta1/torchserve/v1/imgconv/input.json
          SERVICE_NAME=prediction-and-sink
          MODEL_NAME=mnist
          INPUT_PATH=@./mnist-input.json
          PLAIN_SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice $SERVICE_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
          • fire!!
          export INGRESS_HOST=192.168.100.112
          export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
          curl -v -H "Host: ${PLAIN_SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
          Expectd Output
          curl -v -H "Host: ${PLAIN_SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
          *   Trying 192.168.100.112:31855...
          * Connected to 192.168.100.112 (192.168.100.112) port 31855
          > POST /v1/models/mnist:predict HTTP/1.1
          > Host: prediction-and-sink.kserve-test.ay.test.dev
          > User-Agent: curl/8.5.0
          > Accept: */*
          > Content-Type: application/json
          > Content-Length: 401
          > 
          < HTTP/1.1 200 OK
          < content-length: 19
          < content-type: application/json
          < date: Wed, 02 Jul 2025 08:55:05 GMT,Wed, 02 Jul 2025 08:55:04 GMT
          < server: istio-envoy
          < x-envoy-upstream-service-time: 209
          < 
          * Connection #0 to host 192.168.100.112 left intact
          {"predictions":[2]}

          6. Invoke Broker

          • preparation
          cat > image-with-trace-id.json << EOF
          {
              "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7",
              "instances": [
                  {
                      "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
                  }
              ]
          }
          EOF
          • fire!!
          export MASTER_IP=192.168.100.112
          export KAFKA_BROKER_INGRESS_PORT=$(kubectl -n knative-eventing get service kafka-broker-ingress -o jsonpath='{.spec.ports[?(@.name=="http-container")].nodePort}')
          
          curl -v "http://${MASTER_IP}:${KAFKA_BROKER_INGRESS_PORT}/kserve-test/isvc-broker" \
            -X POST \
            -H "Ce-Id: $(date +%s)" \
            -H "Ce-Specversion: 1.0" \
            -H "Ce-Type: prediction-request-udf-attr" \
            -H "Ce-Source: event-producer" \
            -H "Content-Type: application/json" \
            -d @./image-with-trace-id.json 
          • check input data in kafka topic knative-broker-kserve-test-isvc-broker
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic knative-broker-kserve-test-isvc-broker --from-beginning'
          Expectd Output
          {
              "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7",
              "instances": [
              {
                  "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
              }]
          }
          {
              "predictions": [2] // result will be saved in this topic as well
          }
          • check response result in kafka topic test-topic
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

           1{
           2    "specversion": "1.0",
           3    "id": "822e3115-0185-4752-9967-f408dda72004",
           4    "source": "data-and-computing/kafka-sink-transformer",
           5    "type": "org.zhejianglab.zverse.data-and-computing.kafka-sink-transformer",
           6    "time": "2025-07-02T08:57:04.133497+00:00",
           7    "data":
           8    {
           9        "predictions": [2]
          10    },
          11    "request-host": "prediction-and-sink-transformer.kserve-test.svc.cluster.local",
          12    "kserve-isvc-name": "prediction-and-sink",
          13    "kserve-isvc-namespace": "kserve-test",
          14    "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7"
          15}
          Using test-trace-id to grab the result.

          Mar 7, 2024

          Subsections of 🏗️Linux

          Cheatsheet

          useradd

          sudo useradd <$name> -m -r -s /bin/bash -p <$password>
          add as soduer
          echo '<$name> ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers

          telnet

          a command line interface for communication with a remote device or serve

          telnet <$ip> <$port>
          for example
          telnet 172.27.253.50 9000 #test application connectivity

          lsof (list as open files)

          everything is a file

          lsof <$option:value>
          for example

          -a List processes that have open files

          -c <process_name> List files opened by the specified process

          -g List GID number process details

          -d <file_number> List the processes occupying this file number

          -d List open files in a directory

          -D Recursively list open files in a directory

          -n List files using NFS

          -i List eligible processes. (protocol, :port, @ip)

          -p List files opened by the specified process ID

          -u List UID number process details

          lsof -i:30443 # find port 30443 
          lsof -i -P -n # list all connections

          awk (Aho, Weinberger, and Kernighan [Names])

          awk is a scripting language used for manipulating data and generating reports.

          # awk [params] 'script' 
          awk <$params> <$string_content>
          for example

          filter bigger than 3

          echo -e "1\n2\n3\n4\n5\n" | awk '$1>3'

          func1 func1

          ss (socket statistics)

          view detailed information about your system’s network connections, including TCP/IP, UDP, and Unix domain sockets

          ss [options]
          for example
          OptionsDescription
          -tDisplay TCP sockets
          -lDisplay listening sockets
          -nShow numerical addresses instead of resolving
          -aDisplay all sockets (listening and non-listening)
          #show all listening TCP connection
          ss -tln
          #show all established TCP connections
          ss -tan

          clean files 3 days ago

          find /aaa/bbb/ccc/*.gz -mtime +3 -exec rm {} \;

          ssh without affect $HOME/.ssh/known_hosts

          ssh -o "UserKnownHostsFile /dev/null" root@aaa.domain.com
          ssh -o "UserKnownHostsFile /dev/null" -o "StrictHostKeyChecking=no" root@aaa.domain.com

          sync clock

          [yum|dnf] install -y chrony \
              && systemctl enable chronyd \
              && (systemctl is-active chronyd || systemctl start chronyd) \
              && chronyc sources \
              && chronyc tracking \
              && timedatectl set-timezone 'Asia/Shanghai'

          set hostname

          hostnamectl set-hostname develop

          add remote key to other server

          ssh -o "UserKnownHostsFile /dev/null" \
              root@aaa.bbb.ccc \
              "mkdir -p /root/.ssh && chmod 700 /root/.ssh && echo '$SOME_PUBLIC_KEY' \
              >> /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys"
          for example
          ssh -o "UserKnownHostsFile /dev/null" \
              root@17.27.253.67 \
              "mkdir -p /root/.ssh && chmod 700 /root/.ssh && echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC00JLKF/Cd//rJcdIVGCX3ePo89KAgEccvJe4TEHs5pI5FSxs/7/JfQKZ+by2puC3IT88bo/d7nStw9PR3BXgqFXaBCknNBpSLWBIuvfBF+bcL+jGnQYo2kPjrO+2186C5zKGuPRi9sxLI5AkamGB39L5SGqwe5bbKq2x/8OjUP25AlTd99XsNjEY2uxNVClHysExVad/ZAcl0UVzG5xmllusXCsZVz9HlPExqB6K1sfMYWvLVgSCChx6nUfgg/NZrn/kQG26X0WdtXVM2aXpbAtBioML4rWidsByDb131NqYpJF7f+x3+I5pQ66Qpc72FW1G4mUiWWiGhF9tL8V9o1AY96Rqz0AVaxAQrBEuyCWKrXbA97HeC3Xp57Luvlv9TqUd8CIJYq+QTL0hlIDrzK9rJsg34FRAvf9sh8K2w/T/gC9UnRjRXgkPUgKldq35Y6Z9wP6KY45gCXka1PU4nVqb6wicO+RHcZ5E4sreUwqfTypt5nTOgW2/p8iFhdN8= Administrator@AARON-X1-8TH' \
              >> /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys"

          set -x

          This will print each command to the standard error before executing it, which is useful for debugging scripts.

          set -x

          set -e

          Exit immediately if a command exits with a non-zero status.

          set -x

          sed (Stream Editor)

          sed <$option> <$file_path>
          for example

          replace unix -> linux

          echo "linux is great os. unix is opensource. unix is free os." | sed 's/unix/linux/'

          or you can check https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/

          fdisk

          list all disk

          fdisk -l

          create CFS file system

          Use mkfs.xfs command to create xfs file system and internal log on the same disk, example is shown below:

          mkfs.xfs <$path>

          modprobe

          program to add and remove modules from the Linux Kernel

          modprobe nfs && modprobe nfsd

          disown

          disown command in Linux is used to remove jobs from the job table.

          disown [options] jobID1 jobID2 ... jobIDN
          for example

          for example, there is a job running in the background

          ping google.com > /dev/null &

          using jobs - to list all running jobs

          jobs -l

          using disown -a remove all jobs from the job tables

          disown -a

          using disown %2 to remove the #2 job

          disown %2

          generate SSH key

          ssh-keygen -t rsa -b 4096 -C "aaron19940628@gmail.com"
          sudo ln -sf <$install_path>/bin/* /usr/local/bin

          append dir into $PATH (temporary)

          export PATH="/root/bin:$PATH"

          copy public key to ECS

          ssh-copy-id -i ~/.ssh/id_rsa.pub root@10.200.60.53

          echo “nameserver 8.8.8.8” | sudo tee /etc/resolv.conf echo “nameserver 8.8.4.4” | sudo tee -a /etc/resolv.conf

          Mar 12, 2024

          Subsections of Command

          Echo

          在Windows批处理中(使用ECHO命令)

          ECHO 这是要写入的内容 > filename.txt
          ECHO 这是要追加的内容 >> filename.txt

          在Linux/macOS Shell中

          echo "这是要写入的内容" > filename.txt
          echo "这是要追加的内容" >> filename.txt

          在Python中

          # 写入文件(覆盖)
          with open('filename.txt', 'w', encoding='utf-8') as f:
              f.write("这是要写入的内容\n")
          
          # 追加内容
          with open('filename.txt', 'a', encoding='utf-8') as f:
              f.write("这是要追加的内容\n")

          在PowerShell中

          "这是要写入的内容" | Out-File -FilePath filename.txt
          "这是要追加的内容" | Out-File -FilePath filename.txt -Append

          在JavaScript (Node.js) 中

          const fs = require('fs');
          
          // 写入文件(覆盖)
          fs.writeFileSync('filename.txt', '这是要写入的内容\n');
          
          // 追加内容
          fs.appendFileSync('filename.txt', '这是要追加的内容\n');
          Sep 7, 2025

          Grep

          grep 是 Linux 中强大的文本搜索工具,其名称来源于 “Global Regular Expression Print”。以下是 grep 命令的常见用法:

          基本语法

          grep [选项] 模式 [文件...]

          常用选项

          1. 基础搜索

          # 在文件中搜索包含"error"的行
          grep "error" filename.log
          
          # 搜索时忽略大小写
          grep -i "error" filename.log
          
          # 显示不匹配的行
          grep -v "success" filename.log
          
          # 显示匹配的行号
          grep -n "pattern" filename.txt

          2. 递归搜索

          # 在当前目录及子目录中递归搜索
          grep -r "function_name" .
          
          # 递归搜索并显示文件名
          grep -r -l "text" /path/to/directory

          3. 输出控制

          # 只显示匹配的文件名(不显示具体行)
          grep -l "pattern" *.txt
          
          # 显示匹配行前后的内容
          grep -A 3 "error" logfile.txt    # 显示匹配行后3行
          grep -B 2 "error" logfile.txt    # 显示匹配行前2行
          grep -C 2 "error" logfile.txt    # 显示匹配行前后各2行
          
          # 只显示匹配的部分(而非整行)
          grep -o "pattern" file.txt

          4. 正则表达式

          # 使用扩展正则表达式
          grep -E "pattern1|pattern2" file.txt
          
          # 匹配以"start"开头的行
          grep "^start" file.txt
          
          # 匹配以"end"结尾的行
          grep "end$" file.txt
          
          # 匹配空行
          grep "^$" file.txt
          
          # 使用字符类
          grep "[0-9]" file.txt           # 包含数字的行
          grep "[a-zA-Z]" file.txt        # 包含字母的行

          5. 文件处理

          # 从多个文件中搜索
          grep "text" file1.txt file2.txt
          
          # 使用通配符
          grep "pattern" *.log
          
          # 从标准输入读取
          cat file.txt | grep "pattern"
          echo "some text" | grep "text"

          6. 统计信息

          # 统计匹配的行数
          grep -c "pattern" file.txt
          
          # 统计匹配的次数(可能一行有多个匹配)
          grep -o "pattern" file.txt | wc -l

          实用示例

          1. 日志分析

          # 查找今天的错误日志
          grep "ERROR" /var/log/syslog | grep "$(date '+%Y-%m-%d')"
          
          # 查找包含IP地址的行
          grep -E "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" access.log

          2. 代码搜索

          # 在项目中查找函数定义
          grep -r "function_name(" src/
          
          # 查找包含TODO或FIXME的注释
          grep -r -E "TODO|FIXME" ./
          
          # 查找空行并统计数量
          grep -c "^$" source_code.py

          3. 系统监控

          # 查看特定进程
          ps aux | grep "nginx"
          
          # 检查端口占用
          netstat -tulpn | grep ":80"

          4. 文件内容检查

          # 检查配置文件中的有效设置(忽略注释和空行)
          grep -v "^#" /etc/ssh/sshd_config | grep -v "^$"
          
          # 查找包含邮箱地址的行
          grep -E "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" file.txt

          高级技巧

          1. 使用上下文

          # 显示错误及其上下文
          grep -C 3 -i "error" application.log

          2. 反向引用

          # 使用扩展正则表达式的分组
          grep -E "(abc).*\1" file.txt  # 查找重复的"abc"

          3. 二进制文件搜索

          # 在二进制文件中搜索文本字符串
          grep -a "text" binaryfile

          4. 颜色高亮

          # 启用颜色高亮(通常默认开启)
          grep --color=auto "pattern" file.txt

          常用组合

          与其它命令配合

          # 查找并排序
          grep "pattern" file.txt | sort
          
          # 查找并计数
          grep -o "pattern" file.txt | sort | uniq -c
          
          # 查找并保存结果
          grep "error" logfile.txt > errors.txt

          这些是 grep 命令最常用的用法,掌握它们可以大大提高在 Linux 环境下处理文本的效率。

          Sep 7, 2025

          Sed

          sed(Stream Editor)是 Linux 中强大的流编辑器,用于对文本进行过滤和转换。以下是 sed 命令的常见用法:

          基本语法

          sed [选项] '命令' 文件
          sed [选项] -e '命令1' -e '命令2' 文件
          sed [选项] -f 脚本文件 文件

          常用选项

          1. 基础选项

          # 编辑文件并备份原文件
          sed -i.bak 's/old/new/g' file.txt
          
          # 直接修改文件(无备份)
          sed -i 's/old/new/g' file.txt
          
          # 只打印匹配的行
          sed -n '命令' file.txt
          
          # 使用扩展正则表达式
          sed -E '命令' file.txt

          文本替换

          1. 基本替换

          # 替换每行第一个匹配
          sed 's/old/new/' file.txt
          
          # 替换所有匹配(全局替换)
          sed 's/old/new/g' file.txt
          
          # 替换第N次出现的匹配
          sed 's/old/new/2' file.txt    # 替换第二次出现
          
          # 只替换匹配的行
          sed '/pattern/s/old/new/g' file.txt

          2. 替换分隔符

          # 当模式包含斜杠时,可以使用其他分隔符
          sed 's|/usr/local|/opt|g' file.txt
          sed 's#old#new#g' file.txt

          3. 引用和转义

          # 使用&引用匹配的整个文本
          sed 's/[0-9]*/[&]/g' file.txt
          
          # 使用分组引用
          sed 's/\([a-z]*\) \([a-z]*\)/\2 \1/' file.txt
          sed -E 's/([a-z]*) ([a-z]*)/\2 \1/' file.txt  # 扩展正则表达式

          行操作

          1. 行寻址

          # 指定行号
          sed '5s/old/new/' file.txt        # 只对第5行替换
          sed '1,5s/old/new/g' file.txt     # 1-5行替换
          sed '5,$s/old/new/g' file.txt     # 第5行到最后一行
          
          # 使用正则表达式匹配行
          sed '/^#/s/old/new/' file.txt     # 只对以#开头的行
          sed '/start/,/end/s/old/new/g' file.txt  # 从start到end的行

          2. 删除行

          # 删除空行
          sed '/^$/d' file.txt
          
          # 删除注释行
          sed '/^#/d' file.txt
          
          # 删除特定行号
          sed '5d' file.txt                 # 删除第5行
          sed '1,5d' file.txt               # 删除1-5行
          sed '/pattern/d' file.txt         # 删除匹配模式的行

          3. 插入和添加

          # 在指定行前插入
          sed '5i\插入的内容' file.txt
          
          # 在指定行后添加
          sed '5a\添加的内容' file.txt
          
          # 在文件开头插入
          sed '1i\开头内容' file.txt
          
          # 在文件末尾添加
          sed '$a\结尾内容' file.txt

          4. 修改行

          # 替换整行
          sed '5c\新的行内容' file.txt
          
          # 替换匹配模式的行
          sed '/pattern/c\新的行内容' file.txt

          高级操作

          1. 打印控制

          # 只打印匹配的行(类似grep)
          sed -n '/pattern/p' file.txt
          
          # 打印行号
          sed -n '/pattern/=' file.txt
          
          # 同时打印行号和内容
          sed -n '/pattern/{=;p}' file.txt

          2. 多重命令

          # 使用分号分隔多个命令
          sed 's/old/new/g; s/foo/bar/g' file.txt
          
          # 使用-e选项
          sed -e 's/old/new/' -e 's/foo/bar/' file.txt
          
          # 对同一行执行多个操作
          sed '/pattern/{s/old/new/; s/foo/bar/}' file.txt

          3. 文件操作

          # 读取文件并插入
          sed '/pattern/r otherfile.txt' file.txt
          
          # 将匹配行写入文件
          sed '/pattern/w output.txt' file.txt

          4. 保持空间操作

          # 模式空间与保持空间交换
          sed '1!G;h;$!d' file.txt          # 反转文件行顺序
          
          # 复制到保持空间
          sed '/pattern/h' file.txt
          
          # 从保持空间取回
          sed '/pattern/g' file.txt

          实用示例

          1. 配置文件修改

          # 修改SSH端口
          sed -i 's/^#Port 22/Port 2222/' /etc/ssh/sshd_config
          
          # 启用root登录
          sed -i 's/^#PermitRootLogin yes/PermitRootLogin yes/' /etc/ssh/sshd_config
          
          # 注释掉某行
          sed -i '/pattern/s/^/#/' file.txt
          
          # 取消注释
          sed -i '/pattern/s/^#//' file.txt

          2. 日志处理

          # 提取时间戳
          sed -n 's/.*\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\).*/\1/p' logfile
          
          # 删除空白字符
          sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

          3. 文本格式化

          # 每行末尾添加逗号
          sed 's/$/,/' file.txt
          
          # 合并连续空行
          sed '/^$/{N;/^\n$/D}' file.txt
          
          # 在每行前添加行号
          sed = file.txt | sed 'N;s/\n/\t/'

          4. 数据转换

          # CSV转TSV
          sed 's/,/\t/g' data.csv
          
          # 转换日期格式
          sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3\/\2\/\1/g' dates.txt
          
          # URL编码解码(简单版本)
          echo "hello world" | sed 's/ /%20/g'

          5. 脚本文件使用

          # 创建sed脚本
          cat > script.sed << EOF
          s/old/new/g
          /^#/d
          /^$/d
          EOF
          
          # 使用脚本文件
          sed -f script.sed file.txt

          常用组合技巧

          1. 与管道配合

          # 查找并替换
          grep "pattern" file.txt | sed 's/old/new/g'
          
          # 处理命令输出
          ls -l | sed -n '2,$p' | awk '{print $9}'

          2. 复杂文本处理

          # 提取XML/HTML标签内容
          sed -n 's/.*<title>\(.*\)<\/title>.*/\1/p' file.html
          
          # 处理配置文件段落的示例
          sed -n '/^\[database\]/,/^\[/p' config.ini | sed '/^\[/d'

          这些 sed 命令用法涵盖了大多数日常文本处理需求,掌握它们可以高效地进行批量文本编辑和转换操作。

          Sep 7, 2025

          Subsections of Components

          Cgroup有什么作用

          cgroup 的功能非常丰富,除了 CPU 限制外,还提供了多种系统资源的管控能力:

          1. 内存管理(memory)

          1.1 内存限制

          # 设置内存使用上限
          echo "100M" > /sys/fs/cgroup/memory/group1/memory.limit_in_bytes
          
          # 设置内存+Swap 上限
          echo "200M" > /sys/fs/cgroup/memory/group1/memory.memsw.limit_in_bytes

          1.2 内存统计和监控

          # 查看内存使用情况
          cat /sys/fs/cgroup/memory/group1/memory.usage_in_bytes
          cat /sys/fs/cgroup/memory/group1/memory.stat

          1.3 内存压力控制

          # 设置内存回收压力
          echo 100 > /sys/fs/cgroup/memory/group1/memory.swappiness

          2. 块设备 I/O 控制(blkio)

          2.1 I/O 带宽限制

          # 限制读带宽 1MB/s
          echo "8:0 1048576" > /sys/fs/cgroup/blkio/group1/blkio.throttle.read_bps_device
          
          # 限制写带宽 2MB/s  
          echo "8:0 2097152" > /sys/fs/cgroup/blkio/group1/blkio.throttle.write_bps_device

          2.2 I/OPS 限制

          # 限制每秒读操作数
          echo "8:0 100" > /sys/fs/cgroup/blkio/group1/blkio.throttle.read_iops_device
          
          # 限制每秒写操作数
          echo "8:0 50" > /sys/fs/cgroup/blkio/group1/blkio.throttle.write_iops_device

          2.3 I/O 权重分配

          # 设置 I/O 优先级权重(100-1000)
          echo 500 > /sys/fs/cgroup/blkio/group1/blkio.weight

          3. 进程控制(pids)

          3.1 进程数限制

          # 限制最大进程数
          echo 100 > /sys/fs/cgroup/pids/group1/pids.max
          
          # 查看当前进程数
          cat /sys/fs/cgroup/pids/group1/pids.current

          4. 设备访问控制(devices)

          4.1 设备权限管理

          # 允许访问设备
          echo "c 1:3 rwm" > /sys/fs/cgroup/devices/group1/devices.allow
          
          # 拒绝访问设备
          echo "c 1:5 rwm" > /sys/fs/cgroup/devices/group1/devices.deny

          5. 网络控制(net_cls, net_prio)

          5.1 网络流量分类

          # 设置网络流量类ID
          echo 0x100001 > /sys/fs/cgroup/net_cls/group1/net_cls.classid

          5.2 网络优先级

          # 设置网络接口优先级
          echo "eth0 5" > /sys/fs/cgroup/net_prio/group1/net_prio.ifpriomap

          6. 挂载点控制(devices)

          6.1 文件系统访问限制

          # 限制挂载命名空间操作
          echo 1 > /sys/fs/cgroup/group1/devices.deny

          7. 统一层级 cgroup v2 功能

          cgroup v2 提供了更统一的管理接口:

          7.1 资源保护

          # 内存低水位线保护
          echo "min 50M" > /sys/fs/cgroup/group1/memory.low
          
          # CPU 权重保护
          echo 100 > /sys/fs/cgroup/group1/cpu.weight

          7.2 I/O 控制

          # I/O 权重
          echo "default 100" > /sys/fs/cgroup/group1/io.weight
          
          # I/O 最大带宽
          echo "8:0 rbps=1048576 wbps=2097152" > /sys/fs/cgroup/group1/io.max

          8. 实际应用场景

          8.1 容器资源限制

          # Docker 容器资源限制
          docker run -it \
            --cpus="0.5" \
            --memory="100m" \
            --blkio-weight=500 \
            --pids-limit=100 \
            ubuntu:latest

          8.2 systemd 服务限制

          [Service]
          MemoryMax=100M
          IOWeight=500
          TasksMax=100
          DeviceAllow=/dev/null rw
          DeviceAllow=/dev/zero rw
          DeviceAllow=/dev/full rw

          8.3 Kubernetes 资源管理

          apiVersion: v1
          kind: Pod
          spec:
            containers:
            - name: app
              resources:
                limits:
                  cpu: "500m"
                  memory: "128Mi"
                  ephemeral-storage: "1Gi"
                requests:
                  cpu: "250m" 
                  memory: "64Mi"

          9. 监控和统计

          9.1 资源使用统计

          # 查看 cgroup 资源使用
          cat /sys/fs/cgroup/memory/group1/memory.stat
          cat /sys/fs/cgroup/cpu/group1/cpu.stat
          cat /sys/fs/cgroup/io/group1/io.stat

          9.2 压力状态信息

          # 查看内存压力
          cat /sys/fs/cgroup/memory/group1/memory.pressure

          10. 高级功能

          10.1 资源委托(cgroup v2)

          # 允许子 cgroup 管理特定资源
          echo "+memory +io" > /sys/fs/cgroup/group1/cgroup.subtree_control

          10.2 冻结进程

          # 暂停 cgroup 中所有进程
          echo 1 > /sys/fs/cgroup/group1/cgroup.freeze
          
          # 恢复执行
          echo 0 > /sys/fs/cgroup/group1/cgroup.freeze

          cgroup 的这些功能使得它成为容器化技术(如 Docker、Kubernetes)的基础,提供了完整的资源隔离、限制和统计能力,是现代 Linux 系统资源管理的核心技术。

          Mar 7, 2024

          IPVS

          IPVS 是什么?

          IPVS(IP Virtual Server) 是 Linux 内核内置的第4层(传输层)负载均衡器,是 LVS(Linux Virtual Server)项目的核心组件。

          基本概念

          • 工作层级:传输层(TCP/UDP)
          • 实现方式:内核空间实现,高性能
          • 功能:将 TCP/UDP 请求负载均衡到多个真实服务器

          IPVS 的核心架构

          客户端请求
              ↓
          虚拟服务 (Virtual Service) - VIP:Port
              ↓
          负载均衡调度算法
              ↓
          真实服务器池 (Real Servers)

          IPVS 的主要作用

          1. 高性能负载均衡

          # IPVS 处理能力可达数十万并发连接
          # 相比 iptables 有更好的性能表现

          2. 多种负载均衡算法

          # 查看支持的调度算法
          grep -i ip_vs /lib/modules/$(uname -r)/modules.builtin
          
          # 常用算法:
          rr      # 轮询 (Round Robin)
          wrr     # 加权轮询 (Weighted RR)
          lc      # 最少连接 (Least Connection)
          wlc     # 加权最少连接 (Weighted LC)
          sh      # 源地址哈希 (Source Hashing)
          dh      # 目标地址哈希 (Destination Hashing)

          3. 多种工作模式

          NAT 模式(网络地址转换)

          # 请求和响应都经过负载均衡器
          # 配置示例
          ipvsadm -A -t 192.168.1.100:80 -s rr
          ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -m
          ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -m

          DR 模式(直接路由)

          # 响应直接返回客户端,不经过负载均衡器
          # 高性能模式
          ipvsadm -A -t 192.168.1.100:80 -s rr
          ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -g
          ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -g

          TUN 模式(IP 隧道)

          # 通过 IP 隧道转发请求
          ipvsadm -A -t 192.168.1.100:80 -s rr
          ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -i
          ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -i

          IPVS 在 Kubernetes 中的应用

          kube-proxy IPVS 模式的优势

          # 性能对比
          iptables: O(n) 链式查找,规则多时性能下降
          ipvs:   O(1) 哈希表查找,高性能

          Kubernetes 中的 IPVS 配置

          # 查看 kube-proxy 是否使用 IPVS 模式
          kubectl -n kube-system get pods -l k8s-app=kube-proxy -o yaml | grep mode
          
          # 查看 IPVS 规则
          ipvsadm -Ln

          IPVS 的核心功能

          1. 连接调度

          # 不同调度算法的应用场景
          rr      # 通用场景,服务器性能相近
          wrr     # 服务器性能差异较大
          lc      # 长连接服务,如数据库
          sh      # 会话保持需求

          2. 健康检查

          # IPVS 本身不提供健康检查
          # 需要配合 keepalived 或其他健康检查工具

          3. 会话保持

          # 使用源地址哈希实现会话保持
          ipvsadm -A -t 192.168.1.100:80 -s sh

          IPVS 管理命令详解

          基本操作

          # 添加虚拟服务
          ipvsadm -A -t|u|f <service-address> [-s scheduler]
          
          # 添加真实服务器
          ipvsadm -a -t|u|f <service-address> -r <server-address> [-g|i|m] [-w weight]
          
          # 示例
          ipvsadm -A -t 192.168.1.100:80 -s wlc
          ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.10:8080 -m -w 1
          ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.11:8080 -m -w 2

          监控和统计

          # 查看连接统计
          ipvsadm -Ln --stats
          ipvsadm -Ln --rate
          
          # 查看当前连接
          ipvsadm -Lnc
          
          # 查看超时设置
          ipvsadm -L --timeout

          IPVS 与相关技术对比

          IPVS vs iptables

          特性IPVSiptables
          性能O(1) 哈希查找O(n) 链式查找
          规模支持大量服务规则多时性能下降
          功能专业负载均衡通用防火墙
          算法多种调度算法简单轮询

          IPVS vs Nginx

          特性IPVSNginx
          层级第4层 (传输层)第7层 (应用层)
          性能内核级,更高用户空间,功能丰富
          功能基础负载均衡内容路由、SSL终止等

          实际应用场景

          1. Kubernetes Service 代理

          # kube-proxy 为每个 Service 创建 IPVS 规则
          ipvsadm -Ln
          # 输出示例:
          TCP  10.96.0.1:443 rr
            -> 192.168.1.10:6443    Masq    1      0          0
          TCP  10.96.0.10:53 rr
            -> 10.244.0.5:53        Masq    1      0          0

          2. 高可用负载均衡

          # 配合 keepalived 实现高可用
          # 主备负载均衡器 + IPVS

          3. 数据库读写分离

          # 使用 IPVS 分发数据库连接
          ipvsadm -A -t 192.168.1.100:3306 -s lc
          ipvsadm -a -t 192.168.1.100:3306 -r 192.168.1.20:3306 -m
          ipvsadm -a -t 192.168.1.100:3306 -r 192.168.1.21:3306 -m

          总结

          IPVS 的主要用途:

          1. 高性能负载均衡 - 内核级实现,处理能力强大
          2. 多种调度算法 - 适应不同业务场景
          3. 多种工作模式 - NAT/DR/TUN 满足不同网络需求
          4. 大规模集群支持 - 适合云原生和微服务架构
          5. Kubernetes 集成 - 作为 kube-proxy 的后端,提供高效的 Service 代理

          在 Kubernetes 环境中,IPVS 模式相比 iptables 模式在大规模服务下具有明显的性能优势,是生产环境推荐的负载均衡方案。

          Mar 7, 2024

          Subsections of Interface

          POSIX标准

          Mar 7, 2024

          Subsections of Scripts

          Disable Service

          Disable firewall、selinux、dnsmasq、swap service

          systemctl disable --now firewalld 
          systemctl disable --now dnsmasq
          systemctl disable --now NetworkManager
          
          setenforce 0
          sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/sysconfig/selinux
          sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/selinux/config
          reboot
          getenforce
          
          
          swapoff -a && sysctl -w vm.swappiness=0
          sed -ri '/^[^#]*swap/s@^@#@' /etc/fstab
          Mar 14, 2024

          Free Space

          Cleanup

          1. find first 10 biggest files
          dnf install ncdu
          
          # 找出当前目录下最大的10个文件/目录
          du -ah . | sort -rh | head -n 10
          
          # 找出家目录下大于100M的文件
          find ~ -type f -size +100M -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
          1. clean cache
          rm -rf ~/.cache/*
          sudo rm -rf /tmp/*
          sudo rm -rf /var/tmp/*
          1. clean images
          # 删除所有停止的容器
          podman container prune -y
          
          # 删除所有未被任何容器引用的镜像(悬空镜像)
          podman image prune
          
          # 更激进的清理:删除所有未被运行的容器使用的镜像
          podman image prune -a
          
          # 清理构建缓存
          podman builder prune
          
          # 最彻底的清理:删除所有停止的容器、所有未被容器使用的网络、所有悬空镜像和构建缓存
          podman system prune
          podman system prune -a # 更加彻底,会删除所有未被使用的镜像,而不仅仅是悬空的
          Mar 14, 2024

          Login Without Pwd

          copy id_rsa to other nodes

          yum install sshpass -y
          mkdir -p /extend/shell
          
          cat >>/extend/shell/distribute_pub.sh<< EOF
          #!/bin/bash
          ROOT_PASS=root123
          ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ''
          for ip in 101 102 103 
          do
          sshpass -p\$ROOT_PASS ssh-copy-id -o StrictHostKeyChecking=no 192.168.29.\$ip
          done
          EOF
          
          cd /extend/shell
          chmod +x distribute_pub.sh
          
          ./distribute_pub.sh
          Mar 14, 2024

          Set Http Proxy

          [Optional] Install Proxy Server

          Set Http Proxy

          export https_proxy=http://localhost:20171

          Use Proxy

          Mar 14, 2024

          🪀Install Shit

          Aug 7, 2024

          Subsections of 🪀Install Shit

          Subsections of Application

          Datahub

          Preliminary

          • Kubernetes has installed, if not check link
          • argoCD has installed, if not check link
          • Elasticsearch has installed, if not check link
          • MariaDB has installed, if not check link
          • Kafka has installed, if not check link

          Steps

          1. prepare datahub credentials secret

          kubectl -n application \
              create secret generic datahub-credentials \
              --from-literal=mysql-root-password="$(kubectl get secret mariadb-credentials --namespace database -o jsonpath='{.data.mariadb-root-password}' | base64 -d)"
          kubectl -n application \
              create secret generic datahub-credentials \
              --from-literal=mysql-root-password="$(kubectl get secret mariadb-credentials --namespace database -o jsonpath='{.data.mariadb-root-password}' | base64 -d)" \
              --from-literal=security.protocol="SASL_PLAINTEXT" \
              --from-literal=sasl.mechanism="SCRAM-SHA-256" \
              --from-literal=sasl.jaas.config="org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";"

          5. prepare deploy-datahub.yaml

          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: datahub
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://helm.datahubproject.io
              chart: datahub
              targetRevision: 0.4.8
              helm:
                releaseName: datahub
                values: |
                  global:
                    elasticsearch:
                      host: elastic-search-elasticsearch.application.svc.cluster.local
                      port: 9200
                      skipcheck: "false"
                      insecure: "false"
                      useSSL: "false"
                    kafka:
                      bootstrap:
                        server: kafka.database.svc.cluster.local:9092
                      zookeeper:
                        server: kafka-zookeeper.database.svc.cluster.local:2181
                    sql:
                      datasource:
                        host: mariadb.database.svc.cluster.local:3306
                        hostForMysqlClient: mariadb.database.svc.cluster.local
                        port: 3306
                        url: jdbc:mysql://mariadb.database.svc.cluster.local:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8&enabledTLSProtocols=TLSv1.2
                        driver: com.mysql.cj.jdbc.Driver
                        username: root
                        password:
                          secretRef: datahub-credentials
                          secretKey: mysql-root-password
                  datahub-gms:
                    enabled: true
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-gms
                    service:
                      type: ClusterIP
                    ingress:
                      enabled: false
                  datahub-frontend:
                    enabled: true
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-frontend-react
                    defaultUserCredentials:
                      randomAdminPassword: true
                    service:
                      type: ClusterIP
                    ingress:
                      enabled: true
                      className: nginx
                      annotations:
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                      hosts:
                      - host: datahub.dev.geekcity.tech
                        paths:
                        - /
                      tls:
                      - secretName: "datahub.dev.geekcity.tech-tls"
                        hosts:
                        - datahub.dev.geekcity.tech
                  acryl-datahub-actions:
                    enabled: true
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-actions
                  datahub-mae-consumer:
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-mae-consumer
                    ingress:
                      enabled: false
                  datahub-mce-consumer:
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-mce-consumer
                    ingress:
                      enabled: false
                  datahub-ingestion-cron:
                    enabled: false
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-ingestion
                  elasticsearchSetupJob:
                    enabled: true
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-elasticsearch-setup
                  kafkaSetupJob:
                    enabled: true
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-kafka-setup
                  mysqlSetupJob:
                    enabled: true
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-mysql-setup
                  postgresqlSetupJob:
                    enabled: false
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-postgres-setup
                  datahubUpgrade:
                    enabled: true
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                  datahubSystemUpdate:
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
            destination:
              server: https://kubernetes.default.svc
              namespace: application
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: datahub
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://helm.datahubproject.io
              chart: datahub
              targetRevision: 0.4.8
              helm:
                releaseName: datahub
                values: |
                  global:
                    springKafkaConfigurationOverrides:
                      security.protocol: SASL_PLAINTEXT
                      sasl.mechanism: SCRAM-SHA-256
                    credentialsAndCertsSecrets:
                      name: datahub-credentials
                      secureEnv:
                        sasl.jaas.config: sasl.jaas.config
                    elasticsearch:
                      host: elastic-search-elasticsearch.application.svc.cluster.local
                      port: 9200
                      skipcheck: "false"
                      insecure: "false"
                      useSSL: "false"
                    kafka:
                      bootstrap:
                        server: kafka.database.svc.cluster.local:9092
                      zookeeper:
                        server: kafka-zookeeper.database.svc.cluster.local:2181
                    neo4j:
                      host: neo4j.database.svc.cluster.local:7474
                      uri: bolt://neo4j.database.svc.cluster.local
                      username: neo4j
                      password:
                        secretRef: datahub-credentials
                        secretKey: neo4j-password
                    sql:
                      datasource:
                        host: mariadb.database.svc.cluster.local:3306
                        hostForMysqlClient: mariadb.database.svc.cluster.local
                        port: 3306
                        url: jdbc:mysql://mariadb.database.svc.cluster.local:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8&enabledTLSProtocols=TLSv1.2
                        driver: com.mysql.cj.jdbc.Driver
                        username: root
                        password:
                          secretRef: datahub-credentials
                          secretKey: mysql-root-password
                  datahub-gms:
                    enabled: true
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-gms
                    service:
                      type: ClusterIP
                    ingress:
                      enabled: false
                  datahub-frontend:
                    enabled: true
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-frontend-react
                    defaultUserCredentials:
                      randomAdminPassword: true
                    service:
                      type: ClusterIP
                    ingress:
                      enabled: true
                      className: nginx
                      annotations:
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                      hosts:
                      - host: datahub.dev.geekcity.tech
                        paths:
                        - /
                      tls:
                      - secretName: "datahub.dev.geekcity.tech-tls"
                        hosts:
                        - datahub.dev.geekcity.tech
                  acryl-datahub-actions:
                    enabled: true
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-actions
                  datahub-mae-consumer:
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-mae-consumer
                    ingress:
                      enabled: false
                  datahub-mce-consumer:
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-mce-consumer
                    ingress:
                      enabled: false
                  datahub-ingestion-cron:
                    enabled: false
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-ingestion
                  elasticsearchSetupJob:
                    enabled: true
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-elasticsearch-setup
                  kafkaSetupJob:
                    enabled: true
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-kafka-setup
                  mysqlSetupJob:
                    enabled: true
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-mysql-setup
                  postgresqlSetupJob:
                    enabled: false
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-postgres-setup
                  datahubUpgrade:
                    enabled: true
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                  datahubSystemUpdate:
                    image:
                      repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
            destination:
              server: https://kubernetes.default.svc
              namespace: application
          if you wannna start one more gms

          add this under global, if you wanna start one more gms

            datahub_standalone_consumers_enabled: true

          3. apply to k8s

          kubectl -n argocd apply -f deploy-datahub.yaml

          4. sync by argocd

          argocd app sync argocd/datahub

          5. extract credientials

          kubectl -n application get secret datahub-user-secret -o jsonpath='{.data.user\.props}' | base64 -d

          [Optional] Visit though browser

          add $K8S_MASTER_IP datahub.dev.geekcity.tech to /etc/hosts

          [Optional] Visit though DatahubCLI

          We recommend Python virtual environments (venv-s) to namespace pip modules. Here’s an example setup:

          python3 -m venv venv             # create the environment
          source venv/bin/activate         # activate the environment

          NOTE: If you install datahub in a virtual environment, that same virtual environment must be re-activated each time a shell window or session is created.

          Once inside the virtual environment, install datahub using the following commands

          # Requires Python 3.8+
          python3 -m pip install --upgrade pip wheel setuptools
          python3 -m pip install --upgrade acryl-datahub
          # validate that the install was successful
          datahub version
          # If you see "command not found", try running this instead: python3 -m datahub version
          datahub init
          # authenticate your datahub CLI with your datahub instance
          Mar 7, 2024

          Subsections of Auth

          Deploy GateKeeper Server

          Official Website: https://open-policy-agent.github.io/gatekeeper/website/

          Preliminary

          • Kubernetes 版本必须大于 v1.16

          Components

          Gatekeeper 是基于 Open Policy Agent(OPA) 构建的 Kubernetes 准入控制器,它允许用户定义和实施自定义策略,以控制 Kubernetes 集群中资源的创建、更新和删除操作

          • 核心组件
            • 约束模板(Constraint Templates):定义策略的规则逻辑,使用 Rego 语言编写。它是策略的抽象模板,可以被多个约束实例(Constraint Instance)复用。
            • 约束实例(Constraints Instance):基于约束模板创建的具体策略实例,指定了具体的参数和匹配规则,用于定义哪些资源需要应用该策略。
            • 准入控制器(Admission Controller)(无需修改):拦截 Kubernetes API Server 的请求,根据定义的约束对请求进行评估,如果请求违反了任何约束,则拒绝该请求。
              核心Pod角色

              mvc mvc

              • gatekeeper-audit
                • 定期合规检查:该组件会按照预设的时间间隔,对集群中已存在的所有资源进行全面扫描,以检查它们是否符合所定义的约束规则。(周期性,批量检查)
                • 生成审计报告:在完成资源扫描后,gatekeeper-audit 会生成详细的审计报告,其中会明确指出哪些资源违反了哪些约束规则,方便管理员及时了解集群的合规状态。
              • gatekeeper-controller-manager
                • 实时准入控制:作为准入控制器,gatekeeper-controller-manager 在资源创建、更新或删除操作发起时,会实时拦截这些请求。它会依据预定义的约束模板和约束规则,对请求中的资源进行即时评估。(实时性,事件驱动)
                • 处理决策请求:根据评估结果,如果请求中的资源符合所有约束规则,gatekeeper-controller-manager 会允许该请求继续执行;若违反了任何规则,它会拒绝该请求,避免违规资源进入集群。

          Features

          1. 约束管理

            • 自定义约束模板:用户可以使用 Rego 语言编写自定义的约束模板,实现各种复杂的策略逻辑。

              例如,可以定义策略要求所有的命名空间 NameSpace 必须设置特定的标签,或者限制某些命名空间只能使用特定的镜像。

              查看已存在的约束模板和实例
                  ```shell
                  kubectl get constrainttemplates
                  kubectl get constraints
                  ```
              
                  ```shell
                  kubectl apply -f - <<EOF
                  apiVersion: templates.gatekeeper.sh/v1
                  kind: ConstraintTemplate
                  metadata:
                  name: k8srequiredlabels
                  spec:
                      crd:
                          spec:
                          names:
                              kind: K8sRequiredLabels
                          validation:
                              openAPIV3Schema:
                                  type: object
                                  properties:
                                      labels:
                                          type: array
                                          items:
                                              type: string
                  targets:
                      - target: admission.k8s.gatekeeper.sh
                      rego: |
                          package k8srequiredlabels
              
                          violation[{"msg": msg, "details": {"missing_labels": missing}}] {
                              provided := {label | input.review.object.metadata.labels[label]}
                              required := {label | label := input.parameters.labels[_]}
                              missing := required - provided
                              count(missing) > 0
                              msg := sprintf("you must provide labels: %v", [missing])
                          }
                  EOF
                  ```
              

            • 约束模板复用:约束模板可以被多个约束实例复用,提高了策略的可维护性和复用性。

              例如,可以创建一个通用的标签约束模板,然后在不同的命名空间 NameSpace 中创建不同的约束实例,要求不同的标签。

              一个约束实例的yaml
                  要求所有的命名空间 NameSpace 必须存在标签“gatekeeper”
              
                  ```yaml
                  apiVersion: constraints.gatekeeper.sh/v1beta1
                  kind: K8sRequiredLabels
                  metadata:
                  name: ns-must-have-gk-label
                  spec:
                      enforcementAction: dryrun
                      match:
                          kinds:
                          - apiGroups: [""]
                              kinds: ["Namespace"]
                      parameters:
                          labels: ["gatekeeper"]
                  ```
              

            • 约束更新:当约束模板或约束发生更新时,Gatekeeper 会自动重新评估所有相关的资源,确保策略的实时生效。

          2. 资源控制

            • 准入拦截:当有资源创建或更新请求时,Gatekeeper 会实时拦截请求,并根据策略进行评估。如果请求违反了策略,会立即拒绝请求,并返回详细的错误信息,帮助用户快速定位问题。

            • 资源创建和更新限制:Gatekeeper 可以阻止不符合策略的资源创建和更新请求。

              例如,如果定义了一个策略要求所有的 Deployment 必须设置资源限制(requests 和 limits),那么当用户尝试创建或更新一个没有设置资源限制的 Deployment 时,请求将被拒绝。

              通过enforcementAction来控制,可选:dryrun | deny | warn

              check https://open-policy-agent.github.io/gatekeeper-library/website/validation/containerlimits

            • 资源类型过滤:可以通过约束的 match 字段指定需要应用策略的资源类型和命名空间。

              例如,可以只对特定命名空间中的 Pod 应用策略,或者只对特定 API 组和版本的资源应用策略。

              可以通过syncSet (同步配置)来指定过滤和忽略那些资源

              扫描全部ns,pod,忽略kube开头的命名空间
                  ```yaml
                  apiVersion: config.gatekeeper.sh/v1alpha1
                  kind: Config
                  metadata:
                  name: config
                  namespace: "gatekeeper-system"
                  spec:
                  sync:
                      syncOnly:
                      - group: ""
                          version: "v1"
                          kind: "Namespace"
                      - group: ""
                          version: "v1"
                          kind: "Pod"
                  match:
                      - excludedNamespaces: ["kube-*"]
                      processes: ["*"]
                  ```
              

          3. 合规性保证

            • 行业标准和自定义规范:Gatekeeper 可以确保 Kubernetes 集群中的资源符合行业标准和管理员要求的内部的安全规范。

              例如,可以定义策略要求所有的容器必须使用最新的安全补丁,或者要求所有的存储卷必须进行加密。

              Gatekeeper 已经提供近50种各类资源限制的约束策略,可以通过访问https://open-policy-agent.github.io/gatekeeper-library/website/ 查看并获得

            • 审计和报告:Gatekeeper 可以记录所有的策略评估结果,方便管理员进行审计和报告。通过查看审计日志,管理员可以了解哪些资源违反了策略,以及违反了哪些策略。

            • 审计导出:审计日志可以导出并接入下游。

              详细信息可以查看https://open-policy-agent.github.io/gatekeeper/website/docs/pubsub/

          Installation

          install from
          kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.18.2/deploy/gatekeeper.yaml
          helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
          helm install gatekeeper/gatekeeper --name-template=gatekeeper --namespace gatekeeper-system --create-namespace

          Make sure that:

          • You have Docker version 20.10 or later installed.
          • Your kubectl context is set to the desired installation cluster.
          • You have a container registry you can write to that is readable by the target cluster.
          git clone https://github.com/open-policy-agent/gatekeeper.git \
          && cd gatekeeper 
          • Build and push Gatekeeper image:
          export DESTINATION_GATEKEEPER_IMAGE=<add registry like "myregistry.docker.io/gatekeeper">
          make docker-buildx REPOSITORY=$DESTINATION_GATEKEEPER_IMAGE OUTPUT_TYPE=type=registry
          • And the deploy
          make deploy REPOSITORY=$DESTINATION_GATEKEEPER_IMAGE
          Mar 12, 2024

          Subsections of Binary

          Argo Workflow Binary

          MIRROR="files.m.daocloud.io/"
          VERSION=v3.5.4
          curl -sSLo argo-linux-amd64.gz "https://${MIRROR}github.com/argoproj/argo-workflows/releases/download/${VERSION}/argo-linux-amd64.gz"
          gunzip argo-linux-amd64.gz
          chmod u+x argo-linux-amd64
          mkdir -p ${HOME}/bin
          mv -f argo-linux-amd64 ${HOME}/bin/argo
          rm -f argo-linux-amd64.gz
          Apr 7, 2024

          ArgoCD Binary

          MIRROR="files.m.daocloud.io/"
          VERSION=v3.1.8
          [ $(uname -m) = x86_64 ] && curl -sSLo argocd "https://${MIRROR}github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-amd64"
          [ $(uname -m) = aarch64 ] && curl -sSLo argocd "https://${MIRROR}github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-arm64"
          chmod u+x argocd
          mkdir -p ${HOME}/bin
          mv -f argocd ${HOME}/bin

          [Optional] add to PATH

          cat >> ~/.bashrc  << EOF
          export PATH=$PATH:/root/bin
          EOF
          source ~/.bashrc
          Apr 7, 2024

          Golang Binary

          # sudo rm -rf /usr/local/go  # 删除旧版本
          wget https://go.dev/dl/go1.24.4.linux-amd64.tar.gz
          tar -C /usr/local -xzf go1.24.4.linux-amd64.tar.gz
          vim ~/.bashrc
          export PATH=$PATH:/usr/local/go/bin
          source ~/.bashrc
          rm -rf ./go1.24.4.linux-amd64.tar.gz
          Apr 7, 2024

          Gradle Binary

          MIRROR="files.m.daocloud.io/"
          VERSION=v3.5.4
          curl -sSLo argo-linux-amd64.gz "https://${MIRROR}github.com/argoproj/argo-workflows/releases/download/${VERSION}/argo-linux-amd64.gz"
          gunzip argo-linux-amd64.gz
          chmod u+x argo-linux-amd64
          mkdir -p ${HOME}/bin
          mv -f argo-linux-amd64 ${HOME}/bin/argo
          rm -f argo-linux-amd64.gz
          Apr 7, 2024

          Helm Binary

          ARCH_IN_FILE_NAME=linux-amd64
          FILE_NAME=helm-v3.18.3-${ARCH_IN_FILE_NAME}.tar.gz
          curl -sSLo ${FILE_NAME} "https://files.m.daocloud.io/get.helm.sh/${FILE_NAME}"
          tar zxf ${FILE_NAME}
          mkdir -p ${HOME}/bin
          mv -f ${ARCH_IN_FILE_NAME}/helm ${HOME}/bin
          rm -rf ./${FILE_NAME}
          rm -rf ./${ARCH_IN_FILE_NAME}
          chmod u+x ${HOME}/bin/helm
          Apr 7, 2024

          JQ Binary

          JQ_VERSION=1.7
          JQ_BINARY=jq-linux64
          wget https://github.com/stedolan/jq/releases/download/jq-${JQ_VERSION}/${JQ_BINARY}.tar.gz -O - | tar xz && mv ${JQ_BINARY} /usr/bin/jq
          Apr 7, 2024

          Kind Binary

          MIRROR="files.m.daocloud.io/"
          VERSION=v0.29.0
          [ $(uname -m) = x86_64 ] && curl -sSLo kind "https://${MIRROR}github.com/kubernetes-sigs/kind/releases/download/${VERSION}/kind-linux-amd64"
          [ $(uname -m) = aarch64 ] && curl -sSLo kind "https://${MIRROR}github.com/kubernetes-sigs/kind/releases/download/${VERSION}/kind-linux-arm64"
          chmod u+x kind
          mkdir -p ${HOME}/bin
          mv -f kind ${HOME}/bin
          Apr 7, 2025

          Krew Binary

          cd "$(mktemp -d)" &&
          OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
          ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')" &&
          KREW="krew-${OS}_${ARCH}" &&
          curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
          tar zxvf "${KREW}.tar.gz" &&
          ./"${KREW}" install krew
          Apr 7, 2024

          Kubectl Binary

          MIRROR="files.m.daocloud.io/"
          VERSION=$(curl -L -s https://${MIRROR}dl.k8s.io/release/stable.txt)
          [ $(uname -m) = x86_64 ] && curl -sSLo kubectl "https://${MIRROR}dl.k8s.io/release/${VERSION}/bin/linux/amd64/kubectl"
          [ $(uname -m) = aarch64 ] && curl -sSLo kubectl "https://${MIRROR}dl.k8s.io/release/${VERSION}/bin/linux/arm64/kubectl"
          chmod u+x kubectl
          mkdir -p ${HOME}/bin
          mv -f kubectl ${HOME}/bin
          Apr 7, 2024

          Kustomize Binary

          MIRROR="github.com"
          VERSION="v5.7.1"
          [ $(uname -m) = x86_64 ] && curl -sSLo kustomize "https:///${MIRROR}/kubernetes-sigs/kustomize/releases/download/kustomize/${VERSION}/kustomize_${VERSION}_linux_amd64.tar.gz"
          [ $(uname -m) = aarch64 ] && curl -sSLo kustomize "https:///${MIRROR}/kubernetes-sigs/kustomize/releases/download/kustomize/${VERSION}/kustomize_${VERSION}_linux_arm64.tar.gz"
          chmod u+x kustomize
          mkdir -p ${HOME}/bin
          mv -f kustomize ${HOME}/bin
          Apr 7, 2024

          Maven Binary

          wget https://dlcdn.apache.org/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz
          tar xzf apache-maven-3.9.6-bin.tar.gz -C /usr/local
          ln -sfn /usr/local/apache-maven-3.9.6/bin/mvn /root/bin/mvn  
          export PATH=$PATH:/usr/local/apache-maven-3.9.6/bin
          source ~/.bashrc
          Apr 7, 2024

          Minikube Binary

          MIRROR="files.m.daocloud.io/"
          [ $(uname -m) = x86_64 ] && curl -sSLo minikube "https://${MIRROR}storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64"
          [ $(uname -m) = aarch64 ] && curl -sSLo minikube "https://${MIRROR}storage.googleapis.com/minikube/releases/latest/minikube-linux-arm64"
          chmod u+x minikube
          mkdir -p ${HOME}/bin
          mv -f minikube ${HOME}/bin
          Apr 7, 2024

          Open Java

          mkdir -p /etc/apt/keyrings && \
          wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | gpg --dearmor -o /etc/apt/keyrings/adoptium.gpg && \
          echo "deb [signed-by=/etc/apt/keyrings/adoptium.gpg arch=amd64] https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list > /dev/null && \
          apt-get update && \
          apt-get install -y temurin-21-jdk && \
          apt-get clean && \
          rm -rf /var/lib/apt/lists/*
          Apr 7, 2025

          YQ Binary

          YQ_VERSION=v4.40.5
          YQ_BINARY=yq_linux_amd64
          wget https://github.com/mikefarah/yq/releases/download/${YQ_VERSION}/${YQ_BINARY}.tar.gz -O - | tar xz && mv ${YQ_BINARY} /usr/bin/yq
          Apr 7, 2024

          CICD

          Articles

          FQA

          Q1: difference between docker\podmn\buildah

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2025

          Subsections of CICD

          Install Argo CD

          Preliminary

          • Kubernets has installed, if not check 🔗link
          • Helm binary has installed, if not check 🔗link

          1. install argoCD binary

          2. install components

          Install By
          1. Prepare argocd.values.yaml
          crds:
            install: true
            keep: false
          global:
            domain: argo-cd.ay.dev
            revisionHistoryLimit: 3
            image:
              repository: m.daocloud.io/quay.io/argoproj/argocd
              imagePullPolicy: IfNotPresent
          redis:
            enabled: true
            image:
              repository: m.daocloud.io/docker.io/library/redis
            exporter:
              enabled: false
              image:
                repository: m.daocloud.io/bitnami/redis-exporter
            metrics:
              enabled: false
          redis-ha:
            enabled: false
            image:
              repository: m.daocloud.io/docker.io/library/redis
            configmapTest:
              repository: m.daocloud.io/docker.io/koalaman/shellcheck
            haproxy:
              enabled: false
              image:
                repository: m.daocloud.io/docker.io/library/haproxy
            exporter:
              enabled: false
              image: m.daocloud.io/docker.io/oliver006/redis_exporter
          dex:
            enabled: true
            image:
              repository: m.daocloud.io/ghcr.io/dexidp/dex
          server:
            ingress:
              enabled: true
              ingressClassName: nginx
              annotations:
                nginx.ingress.kubernetes.io/ssl-passthrough: "true"
                cert-manager.io/cluster-issuer: self-signed-ca-issuer
                nginx.ingress.kubernetes.io/backend-protocol: HTTPS
              hostname: argo-cd.ay.dev
              path: /
              pathType: Prefix
              tls: true
          
          2. Install argoCD
          helm upgrade --install argo-cd argo-cd \
            --namespace argocd \
            --create-namespace \
            --version 8.3.5 \
            --repo https://aaronyang0628.github.io/helm-chart-mirror/charts \
            --values argocd.values.yaml \
            --atomic
          
          helm install argo-cd argo-cd \
            --namespace argocd \
            --create-namespace \
            --version 8.3.5 \
            --repo https://argoproj.github.io/argo-helm \
            --values argocd.values.yaml \
            --atomic
          

          by default you can install argocd by this link

          kubectl create namespace argocd \
          && kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

          Or, you can use your won flle link.

          4. prepare argocd-server-external.yaml

          Install By
          kubectl -n argocd apply -f - <<EOF
          apiVersion: v1
          kind: Service
          metadata:
            labels:
              app.kubernetes.io/component: server
              app.kubernetes.io/instance: argo-cd
              app.kubernetes.io/name: argocd-server-external
              app.kubernetes.io/part-of: argocd
            name: argocd-server-external
          spec:
            ports:
            - name: https
              port: 443
              protocol: TCP
              targetPort: 8080
              nodePort: 30443
            selector:
              app.kubernetes.io/instance: argo-cd
              app.kubernetes.io/name: argocd-server
            type: NodePort
          EOF
          kubectl -n argocd apply -f - <<EOF
          apiVersion: v1
          kind: Service
          metadata:
            labels:
              app.kubernetes.io/component: server
              app.kubernetes.io/instance: argo-cd
              app.kubernetes.io/name: argocd-server-external
              app.kubernetes.io/part-of: argocd
              app.kubernetes.io/version: v2.8.4
            name: argocd-server-external
          spec:
            ports:
            - name: https
              port: 443
              protocol: TCP
              targetPort: 8080
              nodePort: 30443
            selector:
              app.kubernetes.io/instance: argo-cd
              app.kubernetes.io/name: argocd-server
            type: NodePort
          EOF

          5. create external service

          kubectl -n argocd apply -f argocd-server-external.yaml

          6. [Optional] prepare argocd-server-ingress.yaml

          Before you create ingress, you need to create cert-manager and cert-issuer self-signed-ca-issuer, if not, please check 🔗link

          Install By
          kubectl -n argocd apply -f - <<EOF
          apiVersion: networking.k8s.io/v1
          kind: Ingress
          metadata:
            annotations:
              cert-manager.io/cluster-issuer: self-signed-ca-issuer
              nginx.ingress.kubernetes.io/backend-protocol: HTTPS
              nginx.ingress.kubernetes.io/ssl-passthrough: "true"
            name: argo-cd-argocd-server
            namespace: argocd
          spec:
            ingressClassName: nginx
            rules:
            - host: argo-cd.ay.dev
              http:
                paths:
                - backend:
                    service:
                      name: argo-cd-argocd-server
                      port:
                        number: 443
                  path: /
                  pathType: Prefix
            tls:
            - hosts:
              - argo-cd.ay.dev
              secretName: argo-cd.ay.dev-tls
          EOF
          apiVersion: networking.k8s.io/v1
          kind: Ingress
          metadata:
            annotations:
              cert-manager.io/cluster-issuer: self-signed-ca-issuer
              nginx.ingress.kubernetes.io/backend-protocol: HTTPS
              nginx.ingress.kubernetes.io/ssl-passthrough: "true"
            name: argo-cd-argocd-server
            namespace: argocd
          spec:
            ingressClassName: nginx
            rules:
            - host: argo-cd.ay.dev
              http:
                paths:
                - backend:
                    service:
                      name: argo-cd-argocd-server
                      port:
                        number: 443
                  path: /
                  pathType: Prefix
            tls:
            - hosts:
              - argo-cd.ay.dev
              secretName: argo-cd.ay.dev-tls

          7. [Optional] create external service

          kubectl -n argocd apply -f argocd-server-external.yaml

          8. get argocd initialized password

          kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

          9. login argocd

          ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
          MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
          argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS

          if you deploy argocd in minikube, you might need to forward this port

          ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f
          open https://$(minikube ip):30443

          if you use ingress, you might need to configure your browser to allow insecure connection

          kubectl -n basic-components get secret root-secret -o jsonpath='{.data.tls\.crt}' | base64 -d > cert-manager-self-signed-ca-secret.crt
          open https://argo-cd.ay.dev
          Mar 7, 2024

          Install Argo WorkFlow

          Preliminary

          • Kubernets has installed, if not check 🔗link
          • Argo CD has installed, if not check 🔗link
          • cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuerservice, , if not check 🔗link
          kubectl get namespace business-workflows > /dev/null 2>&1 || kubectl create namespace business-workflows

          1. prepare argo-workflows.yaml

          content
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: argo-workflows
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://argoproj.github.io/argo-helm
              chart: argo-workflows
              targetRevision: 0.45.27
              helm:
                releaseName: argo-workflows
                values: |
                  crds:
                    install: true
                    keep: false
                  singleNamespace: false
                  controller:
                    image:
                      registry: m.daocloud.io/quay.io
                    workflowNamespaces:
                      - business-workflows
                  executor:
                    image:
                      registry: m.daocloud.io/quay.io
                  workflow:
                    serviceAccount:
                      create: true
                    rbac:
                      create: true
                  server:
                    enabled: true
                    image:
                      registry: m.daocloud.io/quay.io
                    ingress:
                      enabled: true
                      ingressClassName: nginx
                      annotations:
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                        nginx.ingress.kubernetes.io/rewrite-target: /$1
                        nginx.ingress.kubernetes.io/use-regex: "true"
                      hosts:
                        - argo-workflows.ay.dev
                      paths:
                        - /?(.*)
                      pathType: ImplementationSpecific
                      tls:
                        - secretName: argo-workflows.ay.dev-tls
                          hosts:
                            - argo-workflows.ay.dev
                    authModes:
                      - server
                      - client
                    sso:
                      enabled: false
            destination:
              server: https://kubernetes.default.svc
              namespace: workflows
          kubectl -n argocd apply -f - << EOF
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: argo-workflows
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://argoproj.github.io/argo-helm
              chart: argo-workflows
              targetRevision: 0.45.27
              helm:
                releaseName: argo-workflows
                values: |
                  crds:
                    install: true
                    keep: false
                  singleNamespace: false
                  controller:
                    image:
                      registry: m.daocloud.io/quay.io
                    workflowNamespaces:
                      - business-workflows
                  executor:
                    image:
                      registry: m.daocloud.io/quay.io
                  workflow:
                    serviceAccount:
                      create: true
                    rbac:
                      create: true
                  server:
                    enabled: true
                    image:
                      registry: m.daocloud.io/quay.io
                    ingress:
                      enabled: true
                      ingressClassName: nginx
                      annotations:
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                        nginx.ingress.kubernetes.io/rewrite-target: /$1
                        nginx.ingress.kubernetes.io/use-regex: "true"
                      hosts:
                        - argo-workflows.ay.dev
                      paths:
                        - /?(.*)
                      pathType: ImplementationSpecific
                      tls:
                        - secretName: argo-workflows.ay.dev-tls
                          hosts:
                            - argo-workflows.ay.dev
                    authModes:
                      - server
                      - client
                    sso:
                      enabled: false
            destination:
              server: https://kubernetes.default.svc
              namespace: workflows
          EOF

          2. install argo workflow binary

          3. [Optional] apply to k8s

          kubectl -n argocd apply -f argo-workflows.yaml

          4. sync by argocd

          argocd app sync argocd/argo-workflows

          5. submit a test workflow

          argo -n business-workflows submit https://raw.githubusercontent.com/argoproj/argo-workflows/master/examples/hello-world.yaml --serviceaccount=argo-workflow

          6. check workflow status

          # list all flows
          argo -n business-workflows list
          # get specific flow status
          argo -n business-workflows get <$flow_name>
          # get specific flow log
          argo -n business-workflows logs <$flow_name>
          # get specific flow log continuously
          argo -n business-workflows logs <$flow_name> --watch
          Mar 7, 2024

          Install Argo Event

          Preliminary

          • Kubernets has installed, if not check 🔗link
          • Argo CD has installed, if not check 🔗link

          1. prepare argo-events.yaml

          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: argo-events
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://argoproj.github.io/argo-helm
              chart: argo-events
              targetRevision: 2.4.2
              helm:
                releaseName: argo-events
                values: |
                  openshift: false
                  createAggregateRoles: true
                  crds:
                    install: true
                    keep: true
                  global:
                    image:
                      repository: m.daocloud.io/quay.io/argoproj/argo-events
                  controller:
                    replicas: 1
                    resources: {}
                  webhook:
                    enabled: true
                    replicas: 1
                    port: 12000
                    resources: {}
                  extraObjects:
                    - apiVersion: networking.k8s.io/v1
                      kind: Ingress
                      metadata:
                        annotations:
                          cert-manager.io/cluster-issuer: self-signed-ca-issuer
                          nginx.ingress.kubernetes.io/rewrite-target: /$1
                        labels:
                          app.kubernetes.io/instance: argo-events
                          app.kubernetes.io/managed-by: Helm
                          app.kubernetes.io/name: argo-events-events-webhook
                          app.kubernetes.io/part-of: argo-events
                          argocd.argoproj.io/instance: argo-events
                        name: argo-events-webhook
                      spec:
                        ingressClassName: nginx
                        rules:
                        - host: argo-events.webhook.ay.dev
                          http:
                            paths:
                            - backend:
                                service:
                                  name: events-webhook
                                  port:
                                    number: 12000
                              path: /?(.*)
                              pathType: ImplementationSpecific
                        tls:
                        - hosts:
                          - argo-events.webhook.ay.dev
                          secretName: argo-events-webhook-tls
            destination:
              server: https://kubernetes.default.svc
              namespace: argocd

          4. apply to k8s

          kubectl -n argocd apply -f argo-events.yaml

          5. sync by argocd

          argocd app sync argocd/argo-events
          Mar 7, 2024

          Container

          Articles

          FQA

          Q1: difference between docker\podmn\buildah

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2025

          Subsections of Container

          Install Buildah

          Reference

          Prerequisites

          • Kernel Version Requirements To run Buildah on Red Hat Enterprise Linux or CentOS, version 7.4 or higher is required. On other Linux distributions Buildah requires a kernel version that supports the OverlayFS and/or fuse-overlayfs filesystem – you’ll need to consult your distribution’s documentation to determine a minimum version number.

          • runc Requirement Buildah uses runc to run commands when buildah run is used, or when buildah build encounters a RUN instruction, so you’ll also need to build and install a compatible version of runc for Buildah to call for those cases. If Buildah is installed via a package manager such as yum, dnf or apt-get, runc will be installed as part of that process.

          • CNI Requirement When Buildah uses runc to run commands, it defaults to running those commands in the host’s network namespace. If the command is being run in a separate user namespace, though, for example when ID mapping is used, then the command will also be run in a separate network namespace.

          A newly-created network namespace starts with no network interfaces, so commands which are run in that namespace are effectively disconnected from the network unless additional setup is done. Buildah relies on the CNI library and plugins to set up interfaces and routing for network namespaces.

          something wrong with CNI

          If Buildah is installed via a package manager such as yum, dnf or apt-get, a package containing CNI plugins may be available (in Fedora, the package is named containernetworking-cni). If not, they will need to be installed, for example using:

          git clone https://github.com/containernetworking/plugins
          ( cd ./plugins; ./build_linux.sh )
          sudo mkdir -p /opt/cni/bin
          sudo install -v ./plugins/bin/* /opt/cni/bin

          The CNI library needs to be configured so that it will know which plugins to call to set up namespaces. Usually, this configuration takes the form of one or more configuration files in the /etc/cni/net.d directory. A set of example configuration files is included in the docs/cni-examples directory of this source tree.

          Installation

          Caution

          If you already have something wrong with apt update, please check the following 🔗link, adding docker source wont help you to solve that problem.

          sudo dnf update -y 
          sudo dnf -y install buildah

          Once the installation is complete, The buildah images command will list all the images:

          buildah images
          sudo yum -y install buildah

          Once the installation is complete, start the Docker service

          sudo systemctl enable docker
          sudo systemctl start docker
          1. Set up Docker’s apt repository.
          sudo apt-get -y update
          sudo apt-get -y install buildah
          1. Verify that the installation is successful by running the hello-world image:
          sudo buildah run hello-world

          Info

          • Docker Image saved in /var/lib/docker

          Mirror

          You can modify /etc/docker/daemon.json

          {
            "registry-mirrors": ["<$mirror_url>"]
          }

          for example:

          • https://docker.mirrors.ustc.edu.cn
          Mar 7, 2025

          Install Docker

          Mar 7, 2025

          Install Podman

          Reference

          Installation

          Caution

          If you already have something wrong with apt update, please check the following 🔗link, adding docker source wont help you to solve that problem.

          sudo dnf update -y 
          sudo dnf -y install podman
          sudo yum install -y podman
          sudo apt-get update
          sudo apt-get -y install podman

          Run Params

          start an container

          podman run [params]

          -rm: delete if failed

          -v: load a volume

          Example

          podman run --rm\
                -v /root/kserve/iris-input.json:/tmp/iris-input.json \
                --privileged \
               -e MODEL_NAME=sklearn-iris \
               -e INPUT_PATH=/tmp/iris-input.json \
               -e SERVICE_HOSTNAME=sklearn-iris.kserve-test.example.com \
                -it m.daocloud.io/docker.io/library/golang:1.22  sh -c "command A; command B; exec bash"
          Mar 7, 2025

          Subsections of Database

          Install Clickhouse

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. argoCD has installed, if not check 🔗link


          3. cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


          1.prepare admin credentials secret

          Details
          kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
          kubectl -n database create secret generic clickhouse-admin-credentials \
              --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

          2.prepare `deploy-clickhouse.yaml`

          Details
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: clickhouse
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://charts.bitnami.com/bitnami
              chart: clickhouse
              targetRevision: 4.5.1
              helm:
                releaseName: clickhouse
                values: |
                  serviceAccount:
                    name: clickhouse
                  image:
                    registry: m.daocloud.io/docker.io
                    pullPolicy: IfNotPresent
                  volumePermissions:
                    enabled: false
                    image:
                      registry: m.daocloud.io/docker.io
                      pullPolicy: IfNotPresent
                  zookeeper:
                    enabled: true
                    image:
                      registry: m.daocloud.io/docker.io
                      pullPolicy: IfNotPresent
                    replicaCount: 3
                    persistence:
                      enabled: true
                      storageClass: nfs-external
                      size: 8Gi
                    volumePermissions:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                        pullPolicy: IfNotPresent
                  shards: 2
                  replicaCount: 3
                  ingress:
                    enabled: true
                    annotations:
                      cert-manager.io/cluster-issuer: self-signed-ca-issuer
                      nginx.ingress.kubernetes.io/rewrite-target: /$1
                    hostname: clickhouse.dev.geekcity.tech
                    ingressClassName: nginx
                    path: /?(.*)
                    tls: true
                  persistence:
                    enabled: false
                  resources:
                    requests:
                      cpu: 2
                      memory: 512Mi
                    limits:
                      cpu: 3
                      memory: 1024Mi
                  auth:
                    username: admin
                    existingSecret: clickhouse-admin-credentials
                    existingSecretKey: password
                  metrics:
                    enabled: true
                    image:
                      registry: m.daocloud.io/docker.io
                      pullPolicy: IfNotPresent
                    serviceMonitor:
                      enabled: true
                      namespace: monitor
                      jobLabel: clickhouse
                      selector:
                        app.kubernetes.io/name: clickhouse
                        app.kubernetes.io/instance: clickhouse
                      labels:
                        release: prometheus-stack
                  extraDeploy:
                    - |
                      apiVersion: apps/v1
                      kind: Deployment
                      metadata:
                        name: clickhouse-tool
                        namespace: database
                        labels:
                          app.kubernetes.io/name: clickhouse-tool
                      spec:
                        replicas: 1
                        selector:
                          matchLabels:
                            app.kubernetes.io/name: clickhouse-tool
                        template:
                          metadata:
                            labels:
                              app.kubernetes.io/name: clickhouse-tool
                          spec:
                            containers:
                              - name: clickhouse-tool
                                image: m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine
                                imagePullPolicy: IfNotPresent
                                env:
                                  - name: CLICKHOUSE_USER
                                    value: admin
                                  - name: CLICKHOUSE_PASSWORD
                                    valueFrom:
                                      secretKeyRef:
                                        key: password
                                        name: clickhouse-admin-credentials
                                  - name: CLICKHOUSE_HOST
                                    value: csst-clickhouse.csst
                                  - name: CLICKHOUSE_PORT
                                    value: "9000"
                                  - name: TZ
                                    value: Asia/Shanghai
                                command:
                                  - tail
                                args:
                                  - -f
                                  - /etc/hosts
            destination:
              server: https://kubernetes.default.svc
              namespace: database

          3.deploy clickhouse

          Details
          kubectl -n argocd apply -f deploy-clickhouse.yaml

          4.sync by argocd

          Details
          argocd app sync argocd/clickhouse

          5.prepare `clickhouse-interface.yaml`

          Details
          apiVersion: v1
          kind: Service
          metadata:
            labels:
              app.kubernetes.io/component: clickhouse
              app.kubernetes.io/instance: clickhouse
            name: clickhouse-interface
          spec:
            ports:
            - name: http
              port: 8123
              protocol: TCP
              targetPort: http
              nodePort: 31567
            - name: tcp
              port: 9000
              protocol: TCP
              targetPort: tcp
              nodePort: 32005
            selector:
              app.kubernetes.io/component: clickhouse
              app.kubernetes.io/instance: clickhouse
              app.kubernetes.io/name: clickhouse
            type: NodePort

          6.apply to k8s

          Details
          kubectl -n database apply -f clickhouse-interface.yaml

          7.extract clickhouse admin credentials

          Details
          kubectl -n database get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d

          8.invoke http api

          Details
          add `$K8S_MASTER_IP clickhouse.dev.geekcity.tech` to **/etc/hosts**
          CK_PASS=$(kubectl -n database get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d)
          echo 'SELECT version()' | curl -k "https://admin:${CK_PASS}@clickhouse.dev.geekcity.tech:32443/" --data-binary @-

          Preliminary

          1. Docker has installed, if not check 🔗link


          Using Proxy

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          1.init server

          Details
          mkdir -p clickhouse/{data,logs}
          podman run --rm \
              --ulimit nofile=262144:262144 \
              --name clickhouse-server \
              -p 18123:8123 \
              -p 19000:9000 \
              -v $(pwd)/clickhouse/data:/var/lib/clickhouse \
              -v $(pwd)/clickhouse/logs:/var/log/clickhouse-server \
              -e CLICKHOUSE_DB=my_database \
              -e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
              -e CLICKHOUSE_USER=ayayay \
              -e CLICKHOUSE_PASSWORD=123456 \
              -d m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine

          2.check dashboard

          And then you can visit 🔗http://localhost:18123

          3.use cli api

          And then you can visit 🔗http://localhost:19000
          Details
          podman run --rm \
            --entrypoint clickhouse-client \
            -it m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine \
            --host host.containers.internal \
            --port 19000 \
            --user ayayay \
            --password 123456 \
            --query "select version()"

          4.use visual client

          Details
          podman run --rm -p 8080:80 -d m.daocloud.io/docker.io/spoonest/clickhouse-tabix-web-client:stable

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. ArgoCD has installed, if not check 🔗link


          3. Argo Workflow has installed, if not check 🔗link


          1.prepare `argocd-login-credentials`

          Details
          kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
          kubectl -n database create secret generic mariadb-credentials \
              --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
              --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
              --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

          2.apply rolebinding to k8s

          Details
          kubectl apply -f - <<EOF
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: ClusterRole
          metadata:
            name: application-administrator
          rules:
            - apiGroups:
                - argoproj.io
              resources:
                - applications
              verbs:
                - '*'
            - apiGroups:
                - apps
              resources:
                - deployments
              verbs:
                - '*'
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: argocd
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: application
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          EOF

          4.prepare clickhouse admin credentials secret

          Details
          kubectl get namespace application > /dev/null 2>&1 || kubectl create namespace application
          kubectl -n application create secret generic clickhouse-admin-credentials \
            --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

          5.prepare deploy-clickhouse-flow.yaml

          Details
          apiVersion: argoproj.io/v1alpha1
          kind: Workflow
          metadata:
            generateName: deploy-argocd-app-ck-
          spec:
            entrypoint: entry
            artifactRepositoryRef:
              configmap: artifact-repositories
              key: default-artifact-repository
            serviceAccountName: argo-workflow
            templates:
            - name: entry
              inputs:
                parameters:
                - name: argocd-server
                  value: argo-cd-argocd-server.argocd:443
                - name: insecure-option
                  value: --insecure
              dag:
                tasks:
                - name: apply
                  template: apply
                - name: prepare-argocd-binary
                  template: prepare-argocd-binary
                  dependencies:
                  - apply
                - name: sync
                  dependencies:
                  - prepare-argocd-binary
                  template: sync
                  arguments:
                    artifacts:
                    - name: argocd-binary
                      from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                    parameters:
                    - name: argocd-server
                      value: "{{inputs.parameters.argocd-server}}"
                    - name: insecure-option
                      value: "{{inputs.parameters.insecure-option}}"
                - name: wait
                  dependencies:
                  - sync
                  template: wait
                  arguments:
                    artifacts:
                    - name: argocd-binary
                      from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                    parameters:
                    - name: argocd-server
                      value: "{{inputs.parameters.argocd-server}}"
                    - name: insecure-option
                      value: "{{inputs.parameters.insecure-option}}"
            - name: apply
              resource:
                action: apply
                manifest: |
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: app-clickhouse
                    namespace: argocd
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://charts.bitnami.com/bitnami
                      chart: clickhouse
                      targetRevision: 4.5.3
                      helm:
                        releaseName: app-clickhouse
                        values: |
                          image:
                            registry: docker.io
                            repository: bitnami/clickhouse
                            tag: 23.12.3-debian-11-r0
                            pullPolicy: IfNotPresent
                          service:
                            type: ClusterIP
                          volumePermissions:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          ingress:
                            enabled: true
                            ingressClassName: nginx
                            annotations:
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                              nginx.ingress.kubernetes.io/rewrite-target: /$1
                            path: /?(.*)
                            hostname: clickhouse.dev.geekcity.tech
                            tls: true
                          shards: 2
                          replicaCount: 3
                          persistence:
                            enabled: false
                          auth:
                            username: admin
                            existingSecret: clickhouse-admin-credentials
                            existingSecretKey: password
                          zookeeper:
                            enabled: true
                            image:
                              registry: m.daocloud.io/docker.io
                              repository: bitnami/zookeeper
                              tag: 3.8.3-debian-11-r8
                              pullPolicy: IfNotPresent
                            replicaCount: 3
                            persistence:
                              enabled: false
                            volumePermissions:
                              enabled: false
                              image:
                                registry: m.daocloud.io/docker.io
                                pullPolicy: IfNotPresent
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: application
            - name: prepare-argocd-binary
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /tmp/argocd
                  mode: 755
                  http:
                    url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
              outputs:
                artifacts:
                - name: argocd-binary
                  path: "{{inputs.artifacts.argocd-binary.path}}"
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                command:
                - sh
                - -c
                args:
                - |
                  ls -l {{inputs.artifacts.argocd-binary.path}}
            - name: sync
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /usr/local/bin/argocd
                parameters:
                - name: argocd-server
                - name: insecure-option
                  value: ""
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                env:
                - name: ARGOCD_USERNAME
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: username
                - name: ARGOCD_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: password
                - name: WITH_PRUNE_OPTION
                  value: --prune
                command:
                - sh
                - -c
                args:
                - |
                  set -e
                  export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                  export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                  export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                  argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                  argocd app sync argocd/app-clickhouse ${WITH_PRUNE_OPTION} --timeout 300
            - name: wait
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /usr/local/bin/argocd
                parameters:
                - name: argocd-server
                - name: insecure-option
                  value: ""
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                env:
                - name: ARGOCD_USERNAME
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: username
                - name: ARGOCD_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: password
                command:
                - sh
                - -c
                args:
                - |
                  set -e
                  export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                  export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                  export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                  argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                  argocd app wait argocd/app-clickhouse

          6.subimit to argo workflow client

          Details
          argo -n business-workflows submit deploy-clickhouse-flow.yaml

          7.extract clickhouse admin credentials

          Details
          kubectl -n application get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d

          8.invoke http api

          Details
          add `$K8S_MASTER_IP clickhouse.dev.geekcity.tech` to **/etc/hosts**
          CK_PASSWORD=$(kubectl -n application get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d) && echo 'SELECT version()' | curl -k "https://admin:${CK_PASSWORD}@clickhouse.dev.geekcity.tech/" --data-binary @-

          9.create external interface

          Details
          kubectl -n application apply -f - <<EOF
          apiVersion: v1
          kind: Service
          metadata:
            labels:
              app.kubernetes.io/component: clickhouse
              app.kubernetes.io/instance: app-clickhouse
              app.kubernetes.io/managed-by: Helm
              app.kubernetes.io/name: clickhouse
              app.kubernetes.io/version: 23.12.2
              argocd.argoproj.io/instance: app-clickhouse
              helm.sh/chart: clickhouse-4.5.3
            name: app-clickhouse-service-external
          spec:
            ports:
            - name: tcp
              port: 9000
              protocol: TCP
              targetPort: tcp
              nodePort: 30900
            selector:
              app.kubernetes.io/component: clickhouse
              app.kubernetes.io/instance: app-clickhouse
              app.kubernetes.io/name: clickhouse
            type: NodePort
          EOF

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2024

          Install ElasticSearch

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          1.get helm repo

          Details
          helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
          helm repo update

          2.install chart

          Details
          helm install ay-helm-mirror/kube-prometheus-stack --generate-name
          Using Proxy

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          3. ArgoCD has installed, if not check 🔗link


          1.prepare `deploy-elasticsearch.yaml`

          Details
          kubectl apply -f - << EOF
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: elastic-search
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://charts.bitnami.com/bitnami
              chart: elasticsearch
              targetRevision: 19.11.3
              helm:
                releaseName: elastic-search
                values: |
                  global:
                    kibanaEnabled: true
                  clusterName: elastic
                  image:
                    registry: m.zjvis.net/docker.io
                    pullPolicy: IfNotPresent
                  security:
                    enabled: false
                  service:
                    type: ClusterIP
                  ingress:
                    enabled: true
                    annotations:
                      cert-manager.io/cluster-issuer: self-signed-ca-issuer
                      nginx.ingress.kubernetes.io/rewrite-target: /$1
                    hostname: elastic-search.dev.tech
                    ingressClassName: nginx
                    path: /?(.*)
                    tls: true
                  master:
                    masterOnly: false
                    replicaCount: 1
                    persistence:
                      enabled: false
                    resources:
                      requests:
                        cpu: 2
                        memory: 1024Mi
                      limits:
                        cpu: 4
                        memory: 4096Mi
                    heapSize: 2g
                  data:
                    replicaCount: 0
                    persistence:
                      enabled: false
                  coordinating:
                    replicaCount: 0
                  ingest:
                    enabled: true
                    replicaCount: 0
                    service:
                      enabled: false
                      type: ClusterIP
                    ingress:
                      enabled: false
                  metrics:
                    enabled: false
                    image:
                      registry: m.zjvis.net/docker.io
                      pullPolicy: IfNotPresent
                  volumePermissions:
                    enabled: false
                    image:
                      registry: m.zjvis.net/docker.io
                      pullPolicy: IfNotPresent
                  sysctlImage:
                    enabled: true
                    registry: m.zjvis.net/docker.io
                    pullPolicy: IfNotPresent
                  kibana:
                    elasticsearch:
                      hosts:
                        - '{{ include "elasticsearch.service.name" . }}'
                      port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
                  esJavaOpts: "-Xmx2g -Xms2g"        
            destination:
              server: https://kubernetes.default.svc
              namespace: application
          EOF

          3.sync by argocd

          Details
          argocd app sync argocd/elastic-search

          4.extract elasticsearch admin credentials

          Details
          a

          5.invoke http api

          Details
          add `$K8S_MASTER_IP elastic-search.dev.tech` to `/etc/hosts`
          curl -k -H "Content-Type: application/json" \
              -X POST "https://elastic-search.dev.tech:32443/books/_doc?pretty" \
              -d '{"name": "Snow Crash", "author": "Neal Stephenson", "release_date": "1992-06-01", "page_count": 470}'

          Preliminary

          1. Docker|Podman|Buildah has installed, if not check 🔗link


          Using Mirror

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          1.init server

          Details

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          3. ArgoCD has installed, if not check 🔗link


          4. Argo Workflow has installed, if not check 🔗link


          1.prepare `argocd-login-credentials`

          Details
          kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

          2.apply rolebinding to k8s

          Details
          kubectl apply -f - <<EOF
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: ClusterRole
          metadata:
            name: application-administrator
          rules:
            - apiGroups:
                - argoproj.io
              resources:
                - applications
              verbs:
                - '*'
            - apiGroups:
                - apps
              resources:
                - deployments
              verbs:
                - '*'
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: argocd
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: application
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          EOF

          4.prepare `deploy-xxxx-flow.yaml`

          Details

          6.subimit to argo workflow client

          Details
          argo -n business-workflows submit deploy-xxxx-flow.yaml

          7.decode password

          Details
          kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Apr 12, 2024

          Install Kafka

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm binary has installed, if not check 🔗link


          1.get helm repo

          Details
          helm repo add bitnami oci://registry-1.docker.io/bitnamicharts/kafka
          helm repo update

          2.install chart

          helm upgrade --create-namespace -n database kafka --install bitnami/kafka \
            --set global.imageRegistry=m.daocloud.io/docker.io \
            --set zookeeper.enabled=false \
            --set controller.replicaCount=1 \
            --set broker.replicaCount=1 \
            --set persistance.enabled=false  \
            --version 28.0.3
          
          helm upgrade --create-namespace -n database kafka --install bitnami/kafka \
            --set global.imageRegistry=m.daocloud.io/docker.io \
            --set zookeeper.enabled=false \
            --set controller.replicaCount=1 \
            --set broker.replicaCount=1 \
            --set persistance.enabled=false  \
            --version 28.0.3
          
          Details
          kubectl -n database \
            create secret generic client-properties \
            --from-literal=client.properties="$(printf "security.protocol=SASL_PLAINTEXT\nsasl.mechanism=SCRAM-SHA-256\nsasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";\n")"
          Details
          kubectl -n database apply -f - << EOF
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: kafka-client-tools
            labels:
              app: kafka-client-tools
          spec:
            replicas: 1
            selector:
              matchLabels:
                app: kafka-client-tools
            template:
              metadata:
                labels:
                  app: kafka-client-tools
              spec:
                volumes:
                - name: client-properties
                  secret:
                    secretName: client-properties
                containers:
                - name: kafka-client-tools
                  image: m.daocloud.io/docker.io/bitnami/kafka:3.6.2
                  volumeMounts:
                  - name: client-properties
                    mountPath: /bitnami/custom/client.properties
                    subPath: client.properties
                    readOnly: true
                  env:
                  - name: BOOTSTRAP_SERVER
                    value: kafka.database.svc.cluster.local:9092
                  - name: CLIENT_CONFIG_FILE
                    value: /bitnami/custom/client.properties
                  command:
                  - tail
                  - -f
                  - /etc/hosts
                  imagePullPolicy: IfNotPresent
          EOF

          3.validate function

          - list topics
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
              'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
          - create topic
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --create --if-not-exists --topic test-topic'
          - describe topic
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --describe --topic test-topic'
          - produce message
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'for message in $(seq 0 10); do echo $message | kafka-console-producer.sh --bootstrap-server $BOOTSTRAP_SERVER --producer.config $CLIENT_CONFIG_FILE --topic test-topic; done'
          - consume message
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. ArgoCD has installed, if not check 🔗link


          3. Helm binary has installed, if not check 🔗link


          1.prepare `deploy-kafka.yaml`

          kubectl -n argocd apply -f - << EOF
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: kafka
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://charts.bitnami.com/bitnami
              chart: kafka
              targetRevision: 28.0.3
              helm:
                releaseName: kafka
                values: |
                  image:
                    registry: m.daocloud.io/docker.io
                  controller:
                    replicaCount: 1
                    persistence:
                      enabled: false
                    logPersistence:
                      enabled: false
                    extraConfig: |
                      message.max.bytes=5242880
                      default.replication.factor=1
                      offsets.topic.replication.factor=1
                      transaction.state.log.replication.factor=1
                  broker:
                    replicaCount: 1
                    persistence:
                      enabled: false
                    logPersistence:
                      enabled: false
                    extraConfig: |
                      message.max.bytes=5242880
                      default.replication.factor=1
                      offsets.topic.replication.factor=1
                      transaction.state.log.replication.factor=1
                  externalAccess:
                    enabled: false
                    autoDiscovery:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                  volumePermissions:
                    enabled: false
                    image:
                      registry: m.daocloud.io/docker.io
                  metrics:
                    kafka:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                    jmx:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                  provisioning:
                    enabled: false
                  kraft:
                    enabled: true
                  zookeeper:
                    enabled: false
            destination:
              server: https://kubernetes.default.svc
              namespace: database
          EOF
          kubectl -n argocd apply -f - << EOF
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: kafka
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://charts.bitnami.com/bitnami
              chart: kafka
              targetRevision: 28.0.3
              helm:
                releaseName: kafka
                values: |
                  image:
                    registry: m.daocloud.io/docker.io
                  listeners:
                    client:
                      protocol: PLAINTEXT
                    interbroker:
                      protocol: PLAINTEXT
                  controller:
                    replicaCount: 0
                    persistence:
                      enabled: false
                    logPersistence:
                      enabled: false
                    extraConfig: |
                      message.max.bytes=5242880
                      default.replication.factor=1
                      offsets.topic.replication.factor=1
                      transaction.state.log.replication.factor=1
                  broker:
                    replicaCount: 1
                    minId: 0
                    persistence:
                      enabled: false
                    logPersistence:
                      enabled: false
                    extraConfig: |
                      message.max.bytes=5242880
                      default.replication.factor=1
                      offsets.topic.replication.factor=1
                      transaction.state.log.replication.factor=1
                  externalAccess:
                    enabled: false
                    autoDiscovery:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                  volumePermissions:
                    enabled: false
                    image:
                      registry: m.daocloud.io/docker.io
                  metrics:
                    kafka:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                    jmx:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                  provisioning:
                    enabled: false
                  kraft:
                    enabled: false
                  zookeeper:
                    enabled: true
                    image:
                      registry: m.daocloud.io/docker.io
                    replicaCount: 1
                    auth:
                      client:
                        enabled: false
                      quorum:
                        enabled: false
                    persistence:
                      enabled: false
                    volumePermissions:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                      metrics:
                        enabled: false
                    tls:
                      client:
                        enabled: false
                      quorum:
                        enabled: false
            destination:
              server: https://kubernetes.default.svc
              namespace: database
          EOF

          2.sync by argocd

          Details
          argocd app sync argocd/kafka

          3.set up client tool

          kubectl -n database \
              create secret generic client-properties \
              --from-literal=client.properties="$(printf "security.protocol=SASL_PLAINTEXT\nsasl.mechanism=SCRAM-SHA-256\nsasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";\n")"
          kubectl -n database \
              create secret generic client-properties \
              --from-literal=client.properties="security.protocol=PLAINTEXT"

          5.prepare `kafka-client-tools.yaml`

          Details
          kubectl -n database apply -f - << EOF
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: kafka-client-tools
            labels:
              app: kafka-client-tools
          spec:
            replicas: 1
            selector:
              matchLabels:
                app: kafka-client-tools
            template:
              metadata:
                labels:
                  app: kafka-client-tools
              spec:
                volumes:
                - name: client-properties
                  secret:
                    secretName: client-properties
                containers:
                - name: kafka-client-tools
                  image: m.daocloud.io/docker.io/bitnami/kafka:3.6.2
                  volumeMounts:
                  - name: client-properties
                    mountPath: /bitnami/custom/client.properties
                    subPath: client.properties
                    readOnly: true
                  env:
                  - name: BOOTSTRAP_SERVER
                    value: kafka.database.svc.cluster.local:9092
                  - name: CLIENT_CONFIG_FILE
                    value: /bitnami/custom/client.properties
                  - name: ZOOKEEPER_CONNECT
                    value: kafka-zookeeper.database.svc.cluster.local:2181
                  command:
                  - tail
                  - -f
                  - /etc/hosts
                  imagePullPolicy: IfNotPresent
          EOF

          6.validate function

          - list topics
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
              'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
          - create topic
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --create --if-not-exists --topic test-topic'
          - describe topic
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --describe --topic test-topic'
          - produce message
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'for message in $(seq 0 10); do echo $message | kafka-console-producer.sh --bootstrap-server $BOOTSTRAP_SERVER --producer.config $CLIENT_CONFIG_FILE --topic test-topic; done'
          - consume message
          Details
          kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
            'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

          Preliminary

          1. Docker has installed, if not check 🔗link


          Using Proxy

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          1.init server

          Details
          mkdir -p kafka/data
          chmod -R 777 kafka/data
          podman run --rm \
              --name kafka-server \
              --hostname kafka-server \
              -p 9092:9092 \
              -p 9094:9094 \
              -v $(pwd)/kafka/data:/bitnami/kafka/data \
              -e KAFKA_CFG_NODE_ID=0 \
              -e KAFKA_CFG_PROCESS_ROLES=controller,broker \
              -e KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka-server:9093 \
              -e KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,EXTERNAL://:9094 \
              -e KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,EXTERNAL://host.containers.internal:9094 \
              -e KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,EXTERNAL:PLAINTEXT,PLAINTEXT:PLAINTEXT \
              -e KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER \
              -d m.daocloud.io/docker.io/bitnami/kafka:3.6.2

          2.list topic

          Details
          BOOTSTRAP_SERVER=host.containers.internal:9094
          podman run --rm \
              -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-topics.sh \
                  --bootstrap-server $BOOTSTRAP_SERVER --list

          2.create topic

          Details
          BOOTSTRAP_SERVER=host.containers.internal:9094
          # BOOTSTRAP_SERVER=10.200.60.64:9094
          TOPIC=test-topic
          podman run --rm \
              -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-topics.sh \
                  --bootstrap-server $BOOTSTRAP_SERVER \
                  --create \
                  --if-not-exists \
                  --topic $TOPIC

          2.consume record

          Details
          BOOTSTRAP_SERVER=host.containers.internal:9094
          # BOOTSTRAP_SERVER=10.200.60.64:9094
          TOPIC=test-topic
          podman run --rm \
              -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-console-consumer.sh \
                  --bootstrap-server $BOOTSTRAP_SERVER \
                  --topic $TOPIC \
                  --from-beginning

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2024

          Install MariaDB

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. argoCD has installed, if not check 🔗link


          3. cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


          1.prepare mariadb credentials secret

          Details
          kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
          kubectl -n database create secret generic mariadb-credentials \
              --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
              --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
              --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

          2.prepare `deploy-mariadb.yaml`

          Details
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: mariadb
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://charts.bitnami.com/bitnami
              chart: mariadb
              targetRevision: 16.3.2
              helm:
                releaseName: mariadb
                values: |
                  architecture: standalone
                  auth:
                    database: test-mariadb
                    username: aaron.yang
                    existingSecret: mariadb-credentials
                  primary:
                    extraFlags: "--character-set-server=utf8mb4 --collation-server=utf8mb4_bin"
                    persistence:
                      enabled: false
                  secondary:
                    replicaCount: 1
                    persistence:
                      enabled: false
                  image:
                    registry: m.daocloud.io/docker.io
                    pullPolicy: IfNotPresent
                  volumePermissions:
                    enabled: false
                    image:
                      registry: m.daocloud.io/docker.io
                      pullPolicy: IfNotPresent
                  metrics:
                    enabled: false
                    image:
                      registry: m.daocloud.io/docker.io
                      pullPolicy: IfNotPresent
            destination:
              server: https://kubernetes.default.svc
              namespace: database

          3.deploy mariadb

          Details
          kubectl -n argocd apply -f deploy-mariadb.yaml

          4.sync by argocd

          Details
          argocd app sync argocd/mariadb

          5.check mariadb

          Details
          kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. ArgoCD has installed, if not check 🔗link


          3. Argo Workflow has installed, if not check 🔗link


          1.prepare `argocd-login-credentials`

          Details
          kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
          kubectl -n database create secret generic mariadb-credentials \
              --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
              --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
              --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

          2.apply rolebinding to k8s

          Details
          kubectl -n argocd apply -f - <<EOF
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: ClusterRole
          metadata:
            name: application-administrator
          rules:
            - apiGroups:
                - argoproj.io
              resources:
                - applications
              verbs:
                - '*'
            - apiGroups:
                - apps
              resources:
                - deployments
              verbs:
                - '*'
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: argocd
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: application
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          EOF

          3.prepare mariadb credentials secret

          Details
          kubectl -n application create secret generic mariadb-credentials \
            --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
            --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
            --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

          4.prepare `deploy-mariadb-flow.yaml`

          Details
          apiVersion: argoproj.io/v1alpha1
          kind: Workflow
          metadata:
            generateName: deploy-argocd-app-mariadb-
          spec:
            entrypoint: entry
            artifactRepositoryRef:
              configmap: artifact-repositories
              key: default-artifact-repository
            serviceAccountName: argo-workflow
            templates:
            - name: entry
              inputs:
                parameters:
                - name: argocd-server
                  value: argo-cd-argocd-server.argocd:443
                - name: insecure-option
                  value: --insecure
              dag:
                tasks:
                - name: apply
                  template: apply
                - name: prepare-argocd-binary
                  template: prepare-argocd-binary
                  dependencies:
                  - apply
                - name: sync
                  dependencies:
                  - prepare-argocd-binary
                  template: sync
                  arguments:
                    artifacts:
                    - name: argocd-binary
                      from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                    parameters:
                    - name: argocd-server
                      value: "{{inputs.parameters.argocd-server}}"
                    - name: insecure-option
                      value: "{{inputs.parameters.insecure-option}}"
                - name: wait
                  dependencies:
                  - sync
                  template: wait
                  arguments:
                    artifacts:
                    - name: argocd-binary
                      from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                    parameters:
                    - name: argocd-server
                      value: "{{inputs.parameters.argocd-server}}"
                    - name: insecure-option
                      value: "{{inputs.parameters.insecure-option}}"
                - name: init-db-tool
                  template: init-db-tool
                  dependencies:
                  - wait
            - name: apply
              resource:
                action: apply
                manifest: |
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: app-mariadb
                    namespace: argocd
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://charts.bitnami.com/bitnami
                      chart: mariadb
                      targetRevision: 16.5.0
                      helm:
                        releaseName: app-mariadb
                        values: |
                          architecture: standalone
                          auth:
                            database: geekcity
                            username: aaron.yang
                            existingSecret: mariadb-credentials
                          primary:
                            persistence:
                              enabled: false
                          secondary:
                            replicaCount: 1
                            persistence:
                              enabled: false
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                          volumePermissions:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          metrics:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: application
            - name: prepare-argocd-binary
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /tmp/argocd
                  mode: 755
                  http:
                    url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
              outputs:
                artifacts:
                - name: argocd-binary
                  path: "{{inputs.artifacts.argocd-binary.path}}"
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                command:
                - sh
                - -c
                args:
                - |
                  ls -l {{inputs.artifacts.argocd-binary.path}}
            - name: sync
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /usr/local/bin/argocd
                parameters:
                - name: argocd-server
                - name: insecure-option
                  value: ""
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                env:
                - name: ARGOCD_USERNAME
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: username
                - name: ARGOCD_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: password
                - name: WITH_PRUNE_OPTION
                  value: --prune
                command:
                - sh
                - -c
                args:
                - |
                  set -e
                  export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                  export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                  export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                  argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                  argocd app sync argocd/app-mariadb ${WITH_PRUNE_OPTION} --timeout 300
            - name: wait
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /usr/local/bin/argocd
                parameters:
                - name: argocd-server
                - name: insecure-option
                  value: ""
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                env:
                - name: ARGOCD_USERNAME
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: username
                - name: ARGOCD_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: password
                command:
                - sh
                - -c
                args:
                - |
                  set -e
                  export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                  export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                  export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                  argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                  argocd app wait argocd/app-mariadb
            - name: init-db-tool
              resource:
                action: apply
                manifest: |
                  apiVersion: apps/v1
                  kind: Deployment
                  metadata:
                    name: app-mariadb-tool
                    namespace: application
                    labels:
                      app.kubernetes.io/name: mariadb-tool
                  spec:
                    replicas: 1
                    selector:
                      matchLabels:
                        app.kubernetes.io/name: mariadb-tool
                    template:
                      metadata:
                        labels:
                          app.kubernetes.io/name: mariadb-tool
                      spec:
                        containers:
                          - name: mariadb-tool
                            image:  m.daocloud.io/docker.io/bitnami/mariadb:10.5.12-debian-10-r0
                            imagePullPolicy: IfNotPresent
                            env:
                              - name: MARIADB_ROOT_PASSWORD
                                valueFrom:
                                  secretKeyRef:
                                    key: mariadb-root-password
                                    name: mariadb-credentials
                              - name: TZ
                                value: Asia/Shanghai

          5.subimit to argo workflow client

          Details
          argo -n business-workflows submit deploy-mariadb-flow.yaml

          6.decode password

          Details
          kubectl -n application get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

          Preliminary

          1. Docker has installed, if not check 🔗link


          Using Proxy

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          1.init server

          Details
          mkdir -p mariadb/data
          podman run  \
              -p 3306:3306 \
              -e MARIADB_ROOT_PASSWORD=mysql \
              -d m.daocloud.io/docker.io/library/mariadb:11.2.2-jammy \
              --log-bin \
              --binlog-format=ROW

          2.use web console

          And then you can visit 🔗http://localhost:8080

          username: `root`

          password: `mysql`

          Details
          podman run --rm -p 8080:80 \
              -e PMA_ARBITRARY=1 \
              -d m.daocloud.io/docker.io/library/phpmyadmin:5.1.1-apache

          3.use internal client

          Details
          podman run --rm \
              -e MYSQL_PWD=mysql \
              -it m.daocloud.io/docker.io/library/mariadb:11.2.2-jammy \
              mariadb \
              --host host.containers.internal \
              --port 3306 \
              --user root \
              --database mysql \
              --execute 'select version()'

          Useful SQL

          1. list all bin logs
          SHOW BINARY LOGS;
          1. delete previous bin logs
          PURGE BINARY LOGS TO 'mysqld-bin.0000003'; # delete mysqld-bin.0000001 and mysqld-bin.0000002
          PURGE BINARY LOGS BEFORE 'yyyy-MM-dd HH:mm:ss';
          PURGE BINARY LOGS DATE_SUB(NOW(), INTERVAL 3 DAYS); # delete last three days bin log file.
          Details

          If you using master-slave mode, you can change all BINARY to MASTER

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2024

          Install Milvus

          Preliminary

          • Kubernetes has installed, if not check link
          • argoCD has installed, if not check link
          • cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuerservice, , if not check link
          • minio has installed, if not check link

          Steps

          1. copy minio credentials secret

          kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
          kubectl -n storage get secret minio-secret -o json \
              | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' \
              | kubectl -n database apply -f -

          2. prepare deploy-milvus.yaml

          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: milvus
          spec:
            syncPolicy:
              syncOptions:
                - CreateNamespace=true
            project: default
            source:
              repoURL: registry-1.docker.io/bitnamicharts
              chart: milvus
              targetRevision: 11.2.4
              helm:
                releaseName: milvus
                values: |
                  global:
                    security:
                      allowInsecureImages: true
                  milvus:
                    image:
                      registry: m.lab.zverse.space/docker.io
                      repository: bitnami/milvus
                      tag: 2.5.7-debian-12-r0
                      pullPolicy: IfNotPresent
                    auth:
                      enabled: false
                  initJob:
                    forceRun: false
                    image:
                      registry: m.lab.zverse.space/docker.io
                      repository: bitnami/pymilvus
                      tag: 2.5.6-debian-12-r0
                      pullPolicy: IfNotPresent
                    resources:
                      requests:
                        cpu: 2
                        memory: 512Mi
                      limits:
                        cpu: 2
                        memory: 2Gi
                  dataCoord:
                    replicaCount: 1
                    resources:
                      requests:
                        cpu: 500m
                        memory: 512Mi
                      limits:
                        cpu: 2
                        memory: 2Gi
                    metrics:
                      enabled: true
                      
                  rootCoord:
                    replicaCount: 1
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 4Gi
                  queryCoord:
                    replicaCount: 1
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 4Gi
                  indexCoord:
                    replicaCount: 1
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 4Gi
                  dataNode:
                    replicaCount: 1
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 4Gi
                  queryNode:
                    replicaCount: 1
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 2Gi
                  indexNode:
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 2Gi
                  proxy:
                    replicaCount: 1
                    service:
                      type: ClusterIP
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 2Gi
                  attu:
                    image:
                      registry: m.lab.zverse.space/docker.io
                      repository: bitnami/attu
                      tag: 2.5.5-debian-12-r1
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 4Gi
                    service:
                      type: ClusterIP
                    ingress:
                      enabled: true
                      ingressClassName: "nginx"
                      annotations:
                        cert-manager.io/cluster-issuer: alidns-webhook-zverse-letsencrypt
                      hostname: milvus.dev.tech
                      path: /
                      pathType: ImplementationSpecific
                      tls: true
                  waitContainer:
                    image:
                      registry: m.lab.zverse.space/docker.io
                      repository: bitnami/os-shell
                      tag: 12-debian-12-r40
                      pullPolicy: IfNotPresent
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 4Gi
                  externalS3:
                    host: "minio.storage"
                    port: 9000
                    existingSecret: "minio-secret"
                    existingSecretAccessKeyIDKey: "root-user"
                    existingSecretKeySecretKey: "root-password"
                    bucket: "milvus"
                    rootPath: "file"
                  etcd:
                    enabled: true
                    image:
                      registry: m.lab.zverse.space/docker.io
                    replicaCount: 1
                    auth:
                      rbac:
                        create: false
                      client:
                        secureTransport: false
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                      limits:
                        cpu: 2
                        memory: 2Gi
                    persistence:
                      enabled: true
                      storageClass: ""
                      size: 2Gi
                    preUpgradeJob:
                      enabled: false
                  minio:
                    enabled: false
                  kafka:
                    enabled: true
                    image:
                      registry: m.lab.zverse.space/docker.io
                    controller:
                      replicaCount: 1
                      livenessProbe:
                        failureThreshold: 8
                      resources:
                        requests:
                          cpu: 500m
                          memory: 1Gi
                        limits:
                          cpu: 2
                          memory: 2Gi
                      persistence:
                        enabled: true
                        storageClass: ""
                        size: 2Gi
                    service:
                      ports:
                        client: 9092
                    extraConfig: |-
                      offsets.topic.replication.factor=3
                    listeners:
                      client:
                        protocol: PLAINTEXT
                      interbroker:
                        protocol: PLAINTEXT
                      external:
                        protocol: PLAINTEXT
                    sasl:
                      enabledMechanisms: "PLAIN"
                      client:
                        users:
                          - user
                    broker:
                      replicaCount: 0
            destination:
              server: https://kubernetes.default.svc
              namespace: database

          3. apply to k8s

          kubectl -n argocd apply -f deploy-milvus.yaml

          4. sync by argocd

          argocd app sync argocd/milvus

          5. check Attu WebUI

          milvus address: milvus-proxy:19530

          milvus database: default

          https://milvus.dev.tech:32443/#/

          5. [Optional] import data

          import data by using sql file

          MARIADB_ROOT_PASSWORD=$(kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d)
          POD_NAME=$(kubectl get pod -n database -l "app.kubernetes.io/name=mariadb-tool" -o jsonpath="{.items[0].metadata.name}") \
          && export SQL_FILENAME="Dump20240301.sql" \
          && kubectl -n database cp ${SQL_FILENAME} ${POD_NAME}:/tmp/${SQL_FILENAME} \
          && kubectl -n database exec -it deployment/app-mariadb-tool -- bash -c \
              'echo "create database ccds;" | mysql -h mariadb.database -uroot -p$MARIADB_ROOT_PASSWORD' \
          && kubectl -n database exec -it ${POD_NAME} -- bash -c \
              "mysql -h mariadb.database -uroot -p\${MARIADB_ROOT_PASSWORD} \
              ccds < /tmp/Dump20240301.sql"

          6. [Optional] decode password

          kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

          7. [Optional] execute sql in pod

          kubectl -n database exec -it xxxx bash
          mariadb -h 127.0.0.1 -u root -p$MARIADB_ROOT_PASSWORD

          And then you can check connection by

          show status like  'Threads%';
          May 26, 2025

          Install Neo4j

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          1.get helm repo

          Details
          helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
          helm repo update

          2.install chart

          Details
          helm install ay-helm-mirror/kube-prometheus-stack --generate-name
          Using Proxy

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          3. ArgoCD has installed, if not check 🔗link


          1.prepare `deploy-xxxxx.yaml`

          Details

          2.apply to k8s

          Details
          kubectl -n argocd apply -f xxxx.yaml

          3.sync by argocd

          Details
          argocd app sync argocd/xxxx

          4.prepare yaml-content.yaml

          Details

          5.apply to k8s

          Details
          kubectl apply -f xxxx.yaml

          6.apply xxxx.yaml directly

          Details
          kubectl apply -f - <<EOF
          
          EOF

          Preliminary

          1. Docker|Podman|Buildah has installed, if not check 🔗link


          Using Proxy

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          1.init server

          Details
          mkdir -p neo4j/data
          podman run --rm \
              --name neo4j \
              -p 7474:7474 \
              -p 7687:7687 \
              -e neo4j_ROOT_PASSWORD=mysql \
              -v $(pwd)/neo4j/data:/data \
              -d docker.io/library/neo4j:5.18.0-community-bullseye
          and then you can visit 🔗[http://localhost:7474]


          username: `root`
          password: `mysql`

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          3. ArgoCD has installed, if not check 🔗link


          4. Argo Workflow has installed, if not check 🔗link


          1.prepare `argocd-login-credentials`

          Details
          kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

          2.apply rolebinding to k8s

          Details
          kubectl apply -f - <<EOF
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: ClusterRole
          metadata:
            name: application-administrator
          rules:
            - apiGroups:
                - argoproj.io
              resources:
                - applications
              verbs:
                - '*'
            - apiGroups:
                - apps
              resources:
                - deployments
              verbs:
                - '*'
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: argocd
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: application
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          EOF

          4.prepare `deploy-xxxx-flow.yaml`

          Details

          6.subimit to argo workflow client

          Details
          argo -n business-workflows submit deploy-xxxx-flow.yaml

          7.decode password

          Details
          kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2024

          Install Postgresql

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          1.get helm repo

          Details
          helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
          helm repo update

          2.install chart

          Details
          helm install ay-helm-mirror/kube-prometheus-stack --generate-name
          Using Proxy

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          3. ArgoCD has installed, if not check 🔗link


          1.prepare `deploy-xxxxx.yaml`

          Details

          2.apply to k8s

          Details
          kubectl -n argocd apply -f xxxx.yaml

          3.sync by argocd

          Details
          argocd app sync argocd/xxxx

          4.prepare yaml-content.yaml

          Details

          5.apply to k8s

          Details
          kubectl apply -f xxxx.yaml

          6.apply xxxx.yaml directly

          Details
          kubectl apply -f - <<EOF
          
          EOF

          Preliminary

          1. Docker|Podman|Buildah has installed, if not check 🔗link


          Using Proxy

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          1.init server

          Details
          mkdir -p $(pwd)/postgresql/data
          podman run --rm \
              --name postgresql \
              -p 5432:5432 \
              -e POSTGRES_PASSWORD=postgresql \
              -e PGDATA=/var/lib/postgresql/data/pgdata \
              -v $(pwd)/postgresql/data:/var/lib/postgresql/data \
              -d docker.io/library/postgres:15.2-alpine3.17

          2.use web console

          Details
          podman run --rm \
            -p 8080:80 \
            -e 'PGADMIN_DEFAULT_EMAIL=ben.wangz@foxmail.com' \
            -e 'PGADMIN_DEFAULT_PASSWORD=123456' \
            -d docker.io/dpage/pgadmin4:6.15
          And then you can visit 🔗[http://localhost:8080]


          3.use internal client

          Details
          podman run --rm \
              --env PGPASSWORD=postgresql \
              --entrypoint psql \
              -it docker.io/library/postgres:15.2-alpine3.17 \
              --host host.containers.internal \
              --port 5432 \
              --username postgres \
              --dbname postgres \
              --command 'select version()'

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          3. ArgoCD has installed, if not check 🔗link


          4. Argo Workflow has installed, if not check 🔗link


          5. Minio artifact repository has been configured, if not check 🔗link


          - endpoint: minio.storage:9000

          1.prepare `argocd-login-credentials`

          Details
          kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
          ARGOCD_USERNAME=admin
          ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
          kubectl -n business-workflows create secret generic argocd-login-credentials \
              --from-literal=username=${ARGOCD_USERNAME} \
              --from-literal=password=${ARGOCD_PASSWORD}

          2.apply rolebinding to k8s

          Details
          kubectl apply -f - <<EOF
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: ClusterRole
          metadata:
            name: application-administrator
          rules:
            - apiGroups:
                - argoproj.io
              resources:
                - applications
              verbs:
                - '*'
            - apiGroups:
                - apps
              resources:
                - deployments
              verbs:
                - '*'
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: argocd
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: application
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          EOF

          3.prepare postgresql admin credentials secret

          Details
          kubectl -n application create secret generic postgresql-credentials \
            --from-literal=postgres-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
            --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
            --from-literal=replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

          4.prepare `deploy-postgresql-flow.yaml`

          Details
          apiVersion: argoproj.io/v1alpha1
          kind: Workflow
          metadata:
            generateName: deploy-argocd-app-pg-
          spec:
            entrypoint: entry
            artifactRepositoryRef:
              configmap: artifact-repositories
              key: default-artifact-repository
            serviceAccountName: argo-workflow
            templates:
            - name: entry
              inputs:
                parameters:
                - name: argocd-server
                  value: argo-cd-argocd-server.argocd:443
                - name: insecure-option
                  value: --insecure
              dag:
                tasks:
                - name: apply
                  template: apply
                - name: prepare-argocd-binary
                  template: prepare-argocd-binary
                  dependencies:
                  - apply
                - name: sync
                  dependencies:
                  - prepare-argocd-binary
                  template: sync
                  arguments:
                    artifacts:
                    - name: argocd-binary
                      from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                    parameters:
                    - name: argocd-server
                      value: "{{inputs.parameters.argocd-server}}"
                    - name: insecure-option
                      value: "{{inputs.parameters.insecure-option}}"
                - name: wait
                  dependencies:
                  - sync
                  template: wait
                  arguments:
                    artifacts:
                    - name: argocd-binary
                      from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                    parameters:
                    - name: argocd-server
                      value: "{{inputs.parameters.argocd-server}}"
                    - name: insecure-option
                      value: "{{inputs.parameters.insecure-option}}"
                - name: init-db-tool
                  template: init-db-tool
                  dependencies:
                  - wait
            - name: apply
              resource:
                action: apply
                manifest: |
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: app-postgresql
                    namespace: argocd
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://charts.bitnami.com/bitnami
                      chart: postgresql
                      targetRevision: 14.2.2
                      helm:
                        releaseName: app-postgresql
                        values: |
                          architecture: standalone
                          auth:
                            database: geekcity
                            username: aaron.yang
                            existingSecret: postgresql-credentials
                          primary:
                            persistence:
                              enabled: false
                          readReplicas:
                            replicaCount: 1
                            persistence:
                              enabled: false
                          backup:
                            enabled: false
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                          volumePermissions:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          metrics:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: application
            - name: prepare-argocd-binary
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /tmp/argocd
                  mode: 755
                  http:
                    url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
              outputs:
                artifacts:
                - name: argocd-binary
                  path: "{{inputs.artifacts.argocd-binary.path}}"
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                command:
                - sh
                - -c
                args:
                - |
                  ls -l {{inputs.artifacts.argocd-binary.path}}
            - name: sync
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /usr/local/bin/argocd
                parameters:
                - name: argocd-server
                - name: insecure-option
                  value: ""
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                env:
                - name: ARGOCD_USERNAME
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: username
                - name: ARGOCD_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: password
                - name: WITH_PRUNE_OPTION
                  value: --prune
                command:
                - sh
                - -c
                args:
                - |
                  set -e
                  export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                  export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                  export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                  argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                  argocd app sync argocd/app-postgresql ${WITH_PRUNE_OPTION} --timeout 300
            - name: wait
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /usr/local/bin/argocd
                parameters:
                - name: argocd-server
                - name: insecure-option
                  value: ""
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                env:
                - name: ARGOCD_USERNAME
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: username
                - name: ARGOCD_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: password
                command:
                - sh
                - -c
                args:
                - |
                  set -e
                  export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                  export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                  export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                  argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                  argocd app wait argocd/app-postgresql
            - name: init-db-tool
              resource:
                action: apply
                manifest: |
                  apiVersion: apps/v1
                  kind: Deployment
                  metadata:
                    name: app-postgresql-tool
                    namespace: application
                    labels:
                      app.kubernetes.io/name: postgresql-tool
                  spec:
                    replicas: 1
                    selector:
                      matchLabels:
                        app.kubernetes.io/name: postgresql-tool
                    template:
                      metadata:
                        labels:
                          app.kubernetes.io/name: postgresql-tool
                      spec:
                        containers:
                          - name: postgresql-tool
                            image: m.daocloud.io/docker.io/bitnami/postgresql:14.4.0-debian-11-r9
                            imagePullPolicy: IfNotPresent
                            env:
                              - name: POSTGRES_PASSWORD
                                valueFrom:
                                  secretKeyRef:
                                    key: postgres-password
                                    name: postgresql-credentials
                              - name: TZ
                                value: Asia/Shanghai
                            command:
                              - tail
                            args:
                              - -f
                              - /etc/hosts

          6.subimit to argo workflow client

          Details
          argo -n business-workflows submit deploy-postgresql.yaml

          7.decode password

          Details
          kubectl -n application get secret postgresql-credentials -o jsonpath='{.data.postgres-password}' | base64 -d

          8.import data

          Details
          POSTGRES_PASSWORD=$(kubectl -n application get secret postgresql-credentials -o jsonpath='{.data.postgres-password}' | base64 -d) \
          POD_NAME=$(kubectl get pod -n application -l "app.kubernetes.io/name=postgresql-tool" -o jsonpath="{.items[0].metadata.name}") \
          && export SQL_FILENAME="init_dfs_table_data.sql" \
          && kubectl -n application cp ${SQL_FILENAME} ${POD_NAME}:/tmp/${SQL_FILENAME} \
          && kubectl -n application exec -it deployment/app-postgresql-tool -- bash -c \
              'echo "CREATE DATABASE csst;" | PGPASSWORD="$POSTGRES_PASSWORD" \
              psql --host app-postgresql.application -U postgres -d postgres -p 5432' \
          && kubectl -n application exec -it deployment/app-postgresql-tool -- bash -c \
              'PGPASSWORD="$POSTGRES_PASSWORD" psql --host app-postgresql.application \
              -U postgres -d csst -p 5432 < /tmp/init_dfs_table_data.sql'

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2024

          Install Redis

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          1.get helm repo

          Details
          helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
          helm repo update

          2.install chart

          Details
          helm install ay-helm-mirror/kube-prometheus-stack --generate-name
          Using Proxy

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          3. ArgoCD has installed, if not check 🔗link


          1.prepare `deploy-xxxxx.yaml`

          Details

          2.apply to k8s

          Details
          kubectl -n argocd apply -f xxxx.yaml

          3.sync by argocd

          Details
          argocd app sync argocd/xxxx

          4.prepare yaml-content.yaml

          Details

          5.apply to k8s

          Details
          kubectl apply -f xxxx.yaml

          6.apply xxxx.yaml directly

          Details
          kubectl apply -f - <<EOF
          
          EOF

          Preliminary

          1. Docker|Podman|Buildah has installed, if not check 🔗link


          Using Proxy

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          1.init server

          Details
          mkdir -p $(pwd)/redis/data
          podman run --rm \
              --name redis \
              -p 6379:6379 \
              -d docker.io/library/redis:7.2.4-alpine

          1.use internal client

          Details
          podman run --rm \
              -it docker.io/library/redis:7.2.4-alpine \
              redis-cli \
              -h host.containers.internal \
              set mykey somevalue

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm has installed, if not check 🔗link


          3. ArgoCD has installed, if not check 🔗link


          4. Argo Workflow has installed, if not check 🔗link


          5. Minio artifact repository has been configured, if not check 🔗link


          - endpoint: minio.storage:9000

          1.prepare `argocd-login-credentials`

          Details
          ARGOCD_USERNAME=admin
          ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
          kubectl -n business-workflows create secret generic argocd-login-credentials \
              --from-literal=username=${ARGOCD_USERNAME} \
              --from-literal=password=${ARGOCD_PASSWORD}

          2.apply rolebinding to k8s

          Details
          kubectl apply -f - <<EOF
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: ClusterRole
          metadata:
            name: application-administrator
          rules:
            - apiGroups:
                - argoproj.io
              resources:
                - applications
              verbs:
                - '*'
            - apiGroups:
                - apps
              resources:
                - deployments
              verbs:
                - '*'
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: argocd
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: application-administration
            namespace: application
          roleRef:
            apiGroup: rbac.authorization.k8s.io
            kind: ClusterRole
            name: application-administrator
          subjects:
            - kind: ServiceAccount
              name: argo-workflow
              namespace: business-workflows
          EOF

          3.prepare redis credentials secret

          Details
          kubectl -n application create secret generic redis-credentials \
            --from-literal=redis-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

          4.prepare `deploy-redis-flow.yaml`

          Details
          apiVersion: argoproj.io/v1alpha1
          kind: Workflow
          metadata:
            generateName: deploy-argocd-app-redis-
          spec:
            entrypoint: entry
            artifactRepositoryRef:
              configmap: artifact-repositories
              key: default-artifact-repository
            serviceAccountName: argo-workflow
            templates:
            - name: entry
              inputs:
                parameters:
                - name: argocd-server
                  value: argocd-server.argocd:443
                - name: insecure-option
                  value: --insecure
              dag:
                tasks:
                - name: apply
                  template: apply
                - name: prepare-argocd-binary
                  template: prepare-argocd-binary
                  dependencies:
                  - apply
                - name: sync
                  dependencies:
                  - prepare-argocd-binary
                  template: sync
                  arguments:
                    artifacts:
                    - name: argocd-binary
                      from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                    parameters:
                    - name: argocd-server
                      value: "{{inputs.parameters.argocd-server}}"
                    - name: insecure-option
                      value: "{{inputs.parameters.insecure-option}}"
                - name: wait
                  dependencies:
                  - sync
                  template: wait
                  arguments:
                    artifacts:
                    - name: argocd-binary
                      from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                    parameters:
                    - name: argocd-server
                      value: "{{inputs.parameters.argocd-server}}"
                    - name: insecure-option
                      value: "{{inputs.parameters.insecure-option}}"
            - name: apply
              resource:
                action: apply
                manifest: |
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: app-redis
                    namespace: argocd
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://charts.bitnami.com/bitnami
                      chart: redis
                      targetRevision: 18.16.0
                      helm:
                        releaseName: app-redis
                        values: |
                          architecture: replication
                          auth:
                            enabled: true
                            sentinel: true
                            existingSecret: redis-credentials
                          master:
                            count: 1
                            disableCommands:
                              - FLUSHDB
                              - FLUSHALL
                            persistence:
                              enabled: false
                          replica:
                            replicaCount: 3
                            disableCommands:
                              - FLUSHDB
                              - FLUSHALL
                            persistence:
                              enabled: false
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                          sentinel:
                            enabled: false
                            persistence:
                              enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          metrics:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          volumePermissions:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          sysctl:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: application
            - name: prepare-argocd-binary
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /tmp/argocd
                  mode: 755
                  http:
                    url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
              outputs:
                artifacts:
                - name: argocd-binary
                  path: "{{inputs.artifacts.argocd-binary.path}}"
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                command:
                - sh
                - -c
                args:
                - |
                  ls -l {{inputs.artifacts.argocd-binary.path}}
            - name: sync
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /usr/local/bin/argocd
                parameters:
                - name: argocd-server
                - name: insecure-option
                  value: ""
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                env:
                - name: ARGOCD_USERNAME
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: username
                - name: ARGOCD_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: password
                - name: WITH_PRUNE_OPTION
                  value: --prune
                command:
                - sh
                - -c
                args:
                - |
                  set -e
                  export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                  export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                  export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                  argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                  argocd app sync argocd/app-redis ${WITH_PRUNE_OPTION} --timeout 300
            - name: wait
              inputs:
                artifacts:
                - name: argocd-binary
                  path: /usr/local/bin/argocd
                parameters:
                - name: argocd-server
                - name: insecure-option
                  value: ""
              container:
                image: m.daocloud.io/docker.io/library/fedora:39
                env:
                - name: ARGOCD_USERNAME
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: username
                - name: ARGOCD_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: argocd-login-credentials
                      key: password
                command:
                - sh
                - -c
                args:
                - |
                  set -e
                  export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                  export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                  export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                  argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                  argocd app wait argocd/app-redis

          6.subimit to argo workflow client

          Details
          argo -n business-workflows submit deploy-redis-flow.yaml

          7.decode password

          Details
          kubectl -n application get secret redis-credentials -o jsonpath='{.data.redis-password}' | base64 -d

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Mar 7, 2024

          Subsections of Git

          Install Act Runner

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm binary has installed, if not check 🔗link


          1.get helm repo

          Details
          helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
          helm repo update

          2.prepare `act-runner-secret`

          Details
          kubectl -n application create secret generic act-runner-secret \
            --from-literal=act-runner-token=4w3Sx0Hwe6VFevl473ZZ4nFVDvFvhKcEUBvpJ09L

          3.prepare values

          Details
          echo "
          replicas: 1
          runner:
            instanceURL: http://192.168.100.125:30300
            token:
              fromSecret:
                name: "act-runner-secret"
                key: "act-runner-token"" > act-runner-values.yaml

          4.install chart

          Details
          helm upgrade  --create-namespace -n application --install -f ./act-runner-values.yaml act-runner ay-helm-mirror/act-runner

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. ArgoCD has installed, if not check 🔗link


          3. Helm binary has installed, if not check 🔗link


          1.prepare `act-runner-secret`

          Details
          kubectl -n application create secret generic act-runner-secret \
            --from-literal=act-runner-token=4w3Sx0Hwe6VFevl473ZZ4nFVDvFvhKcEUBvpJ09L
          act-runner-token could be get from here

          token is used for authentication and identification, such as P2U1U0oB4XaRCi8azcngmPCLbRpUGapalhmddh23. Each token can be used to create multiple runners, until it is replaced with a new token using the reset link. You can obtain different levels of ’tokens’ from the following places to create the corresponding level of ‘runners’:

          Instance level: The admin settings page, like <your_gitea.com>/-/admin/actions/runners.

          act_runner_token act_runner_token

          2.prepare act-runner.yaml

          Storage In
          kubectl -n argocd apply -f - <<EOF
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: act-runner
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
              chart: act-runner
              targetRevision: 0.2.2
              helm:
                releaseName: act-runner
                values: |
                  image:
                    name: vegardit/gitea-act-runner
                    tag: "dind-0.2.13"
                    repository: m.daocloud.io/docker.io
                  runner:
                    instanceURL: https://192.168.100.125:30300
                    token:
                      fromSecret:
                        name: "act-runner-secret"
                        key: "act-runner-token"
                    config:
                      enabled: true
                      data: |
                        log:
                          level: info
                        runner:
                          labels:
                            - ubuntu-latest:docker://m.daocloud.io/docker.gitea.com/runner-images:ubuntu-latest
                        container:
                          force_pull: true
                  persistence:
                    enabled: true
                    storageClassName: ""
                    accessModes: ReadWriteOnce
                    size: 10Gi
                  autoscaling:
                    enabled: true
                    minReplicas: 1
                    maxReplicas: 3
                  replicas: 1  
                  securityContext:
                    privileged: true
                    runAsUser: 0
                    runAsGroup: 0
                    fsGroup: 0
                    capabilities:
                      add: ["NET_ADMIN", "SYS_ADMIN"]
                  podSecurityContext:
                    runAsUser: 0
                    runAsGroup: 0
                    fsGroup: 0
                  resources: 
                    requests:
                      cpu: 200m
                      memory: 512Mi
                    limits:
                      cpu: 1000m
                      memory: 2048Mi
            destination:
              server: https://kubernetes.default.svc
              namespace: application
          EOF
          

          4.sync by argocd

          Details
          argocd app sync argocd/act-runner

          5.use action

          Details

          Even if Actions is enabled for the Gitea instance, repositories still disable Actions by default.

          To enable it, go to the settings page of your repository like your_gitea.com/<owner>/repo/settings and enable Enable Repository Actions.

          act_runner_token act_runner_token

          Preliminary

          1. Podman has installed, and the `podman` command is available in your PATH.


          1.prepare data and config dir

          Details
          mkdir -p /opt/gitea_act_runner/{data,config} \
          && chown -R 1000:1000 /opt/gitea_act_runner \
          && chmod -R 755 /opt/gitea_act_runner

          2.run container

          Details
          podman run -it \
            --name gitea_act_runner \
            --rm \
            --privileged \
            --network=host \
            -v /opt/gitea_act_runner/data:/data \
            -v /opt/gitea_act_runner/config:/config \
            -v /var/run/podman/podman.sock:/var/run/docker.sock \
            -e GITEA_INSTANCE_URL="http://10.200.60.64:30300" \
            -e GITEA_RUNNER_REGISTRATION_TOKEN="5lgsrOzfKz3RiqeMWxxUb9RmUPEWNnZ6hTTZV0DL" \
            m.daocloud.io/docker.io/gitea/act_runner:latest-dind-rootless
          Using Mirror

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          Preliminary

          1. Docker 2. Podman has installed, and the `podman` command is available in your PATH.

          1.prepare data and config dir

          Details
          mkdir -p /opt/gitea_act_runner/{data,config} \
          && chown -R 1000:1000 /opt/gitea_act_runner \
          && chmod -R 755 /opt/gitea_act_runner

          2.run container

          Details
          docker run -it \
            --name gitea_act_runner \
            --rm \
            --privileged \
            --network=host \
            -v /opt/gitea_act_runner/data:/data \
            -v /opt/gitea_act_runner/config:/config \
            -e GITEA_INSTANCE_URL="http://192.168.100.125:30300" \
            -e GITEA_RUNNER_REGISTRATION_TOKEN="5lgsrOzfKz3RiqeMWxxUb9RmUPEWNnZ6hTTZV0DL" \
            m.daocloud.io/docker.io/gitea/act_runner:latest-dind
          Using Mirror

          you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Jun 7, 2025

          Install Gitea

          Installation

          Install By

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. Helm binary has installed, if not check 🔗link


          3. CertManager has installed, if not check 🔗link


          4. Ingress has installed, if not check 🔗link


          1.get helm repo

          Details
          helm repo add gitea-charts https://dl.gitea.com/charts/
          helm repo update

          2.install chart

          Details
          helm install gitea gitea-charts/gitea --generate-name
          Using Mirror
          helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
            && helm install ay-helm-mirror/gitea --generate-name --version 12.1.3

          for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

          Preliminary

          1. Kubernetes has installed, if not check 🔗link


          2. ArgoCD has installed, if not check 🔗link


          3. Helm binary has installed, if not check 🔗link


          4. Ingres has installed on argoCD, if not check 🔗link


          5. Minio has installed, if not check 🔗link


          1.prepare `chart-museum-credentials`

          Storage In
          kubectl get namespaces application > /dev/null 2>&1 || kubectl create namespace application
          kubectl -n application create secret generic gitea-admin-credentials \
              --from-literal=username=gitea_admin \
              --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
          
          kubectl get namespaces application > /dev/null 2>&1 || kubectl create namespace application
          kubectl -n application create secret generic gitea-admin-credentials \
              --from-literal=username=gitea_admin \
              --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
          

          2.prepare `gitea.yaml`

          Storage In
          apiVersion: argoproj.io/v1alpha1
          kind: Application
          metadata:
            name: gitea
          spec:
            syncPolicy:
              syncOptions:
              - CreateNamespace=true
            project: default
            source:
              repoURL: https://dl.gitea.com/charts/
              chart: gitea
              targetRevision: 10.1.4
              helm:
                releaseName: gitea
                values: |
                  image:
                    registry: m.daocloud.io/docker.io
                  service:
                    http:
                      type: NodePort
                      port: 3000
                      nodePort: 30300
                    ssh:
                      type: NodePort
                      port: 22
                      nodePort: 32022
                  ingress:
                    enabled: true
                    ingressClassName: nginx
                    annotations:
                      kubernetes.io/ingress.class: nginx
                      nginx.ingress.kubernetes.io/rewrite-target: /$1
                      cert-manager.io/cluster-issuer: self-signed-ca-issuer
                    hosts:
                    - host: gitea.ay.dev
                      paths:
                      - path: /?(.*)
                        pathType: ImplementationSpecific
                    tls:
                    - secretName: gitea.ay.dev-tls
                      hosts:
                      - gitea.ay.dev
                  persistence:
                    enabled: true
                    size: 8Gi
                    storageClass: ""
                  redis-cluster:
                    enabled: false
                  postgresql-ha:
                    enabled: false
                  postgresql:
                    enabled: true
                    architecture: standalone
                    image:
                      registry: m.daocloud.io/docker.io
                    primary:
                      persistence:
                        enabled: false
                        storageClass: ""
                        size: 8Gi
                    readReplicas:
                      replicaCount: 1
                      persistence:
                        enabled: true
                        storageClass: ""
                        size: 8Gi
                    backup:
                      enabled: false
                    volumePermissions:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                    metrics:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                  gitea:
                    admin:
                      existingSecret: gitea-admin-credentials
                      email: aaron19940628@gmail.com
                    config:
                      database:
                        DB_TYPE: postgres
                      session:
                        PROVIDER: db
                      cache:
                        ADAPTER: memory
                      queue:
                        TYPE: level
                      indexer:
                        ISSUE_INDEXER_TYPE: bleve
                        REPO_INDEXER_ENABLED: true
                      repository:
                        MAX_CREATION_LIMIT: 10
                        DISABLED_REPO_UNITS: "repo.wiki,repo.ext_wiki,repo.projects"
                        DEFAULT_REPO_UNITS: "repo.code,repo.releases,repo.issues,repo.pulls"
                      server:
                        PROTOCOL: http
                        LANDING_PAGE: login
                        DOMAIN: gitea.ay.dev
                        ROOT_URL: https://gitea.ay.dev:32443/
                        SSH_DOMAIN: ssh.gitea.ay.dev
                        SSH_PORT: 32022
                        SSH_AUTHORIZED_PRINCIPALS_ALLOW: email
                      admin:
                        DISABLE_REGULAR_ORG_CREATION: true
                      security:
                        INSTALL_LOCK: true
                      service:
                        REGISTER_EMAIL_CONFIRM: true
                        DISABLE_REGISTRATION: true
                        ENABLE_NOTIFY_MAIL: false
                        DEFAULT_ALLOW_CREATE_ORGANIZATION: false
                        SHOW_MILESTONES_DASHBOARD_PAGE: false
                      migrations:
                        ALLOW_LOCALNETWORKS: true
                      mailer:
                        ENABLED: false
                      i18n:
                        LANGS: "en-US,zh-CN"
                        NAMES: "English,简体中文"
                      oauth2:
                        ENABLE: false
            destination:
              server: https://kubernetes.default.svc
              namespace: application
          
          sssss
          

          3.apply to k8s

          Details
          kubectl -n argocd apply -f gitea.yaml

          4.sync by argocd

          Details
          argocd app sync argocd/gitea

          5.decode admin password

          login 🔗https://gitea.ay.dev:32443/

          , using user gitea_admin and password
          Details
          kubectl -n application get secret gitea-admin-credentials -o jsonpath='{.data.password}' | base64 -d

          FAQ

          Q1: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Show me almost endless possibilities

          You can add standard markdown syntax:

          • multiple paragraphs
          • bullet point lists
          • emphasized, bold and even bold emphasized text
          • links
          • etc.
          ...and even source code

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Jun 7, 2025

          HPC

            Mar 7, 2024

            Subsections of Monitor

            Install Homepage

            Offical Documentation: https://gethomepage.dev/

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            1.install chart directly

            Details
            helm install homepage oci://ghcr.io/m0nsterrr/helm-charts/homepage

            2.you can modify the values.yaml and re-install

            Related values files
            Details
            helm install homepage oci://ghcr.io/m0nsterrr/helm-charts/homepage -f homepage.values.yaml
            Using Mirror
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
              && helm install ay-helm-mirror/homepage  --generate-name --version 4.2.0

            for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. ArgoCD has installed, if not check 🔗link


            3. Helm binary has installed, if not check 🔗link


            4. Ingres has installed on argoCD, if not check 🔗link


            1.prepare `homepage.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
              apiVersion: argoproj.io/v1alpha1
              kind: Application
              metadata:
                name: homepage
              spec:
                syncPolicy:
                  syncOptions:
                    - CreateNamespace=true
                    - ServerSideApply=true
                project: default
                source:
                  repoURL: oci://ghcr.io/m0nsterrr/helm-charts/homepage
                  chart: homepage
                  targetRevision: 4.2.0
                  helm:
                    releaseName: homepage
                    values: |
                      image:
                        registry: m.daocloud.io/ghcr.io
                        repository: gethomepage/homepage
                        pullPolicy: IfNotPresent
                        tag: "v1.5.0"
                      config:
                        allowedHosts: 
                        - "home.72602.online"
                      ingress:
                        enabled: true
                        ingressClassName: "nginx"
                        annotations:
                          kubernetes.io/ingress.class: nginx
                        hosts:
                          - host: home.72602.online
                            paths:
                              - path: /
                                pathType: ImplementationSpecific
                      resources:
                        limits:
                          cpu: 500m
                          memory: 512Mi
                        requests:
                          cpu: 100m
                          memory: 128Mi
                destination:
                  server: https://kubernetes.default.svc
                  namespace: monitor
            EOF

            3.sync by argocd

            Details
            argocd app sync argocd/homepage

            5.check the web browser

            Details
            K8S_MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
            echo "$K8S_MASTER_IP home.72602.online" >> /etc/hosts

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Docker has installed, if not check 🔗link


            docker run -d \
            --name homepage \
            -e HOMEPAGE_ALLOWED_HOSTS=47.110.67.161:3000 \
            -e PUID=1000 \
            -e PGID=1000 \
            -p 3000:3000 \
            -v /root/home-site/static/icons:/app/public/icons  \
            -v /root/home-site/content/Ops/HomePage/config:/app/config \
            -v /var/run/docker.sock:/var/run/docker.sock:ro \
            --restart unless-stopped \
            ghcr.io/gethomepage/homepage:v1.5.0

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Podman has installed, if not check 🔗link


            podman run -d \
            --name homepage \
            -e HOMEPAGE_ALLOWED_HOSTS=127.0.0.1:3000 \
            -e PUID=1000 \
            -e PGID=1000 \
            -p 3000:3000 \
            -v /root/home-site/static/icons:/app/public/icons \
            -v /root/home-site/content/Ops/HomePage/config:/app/config \
            --restart unless-stopped \
            ghcr.io/gethomepage/homepage:v1.5.0

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Oct 7, 2025

            Install Permetheus Stack

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            1.get helm repo

            Details
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
            helm repo update

            2.install chart

            Details
            helm install ay-helm-mirror/kube-prometheus-stack --generate-name
            Using Mirror
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
              && helm install ay-helm-mirror/kube-prometheus-stack  --generate-name --version 1.17.2

            for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. ArgoCD has installed, if not check 🔗link


            3. Helm binary has installed, if not check 🔗link


            4. Ingres has installed on argoCD, if not check 🔗link


            1.prepare `chart-museum-credentials`

            Details
            kubectl get namespaces monitor > /dev/null 2>&1 || kubectl create namespace monitor
            kubectl -n monitor create secret generic prometheus-stack-credentials \
              --from-literal=grafana-username=admin \
              --from-literal=grafana-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

            2.prepare `prometheus-stack.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
              apiVersion: argoproj.io/v1alpha1
              kind: Application
              metadata:
                name: prometheus-stack
              spec:
                syncPolicy:
                  syncOptions:
                    - CreateNamespace=true
                    - ServerSideApply=true
                project: default
                source:
                  repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                  chart: kube-prometheus-stack
                  targetRevision: 72.9.1
                  helm:
                    releaseName: prometheus-stack
                    values: |
                      crds:
                        enabled: true
                      global:
                        rbac:
                          create: true
                        imageRegistry: ""
                        imagePullSecrets: []
                      alertmanager:
                        enabled: true
                        ingress:
                          enabled: false
                        serviceMonitor:
                          selfMonitor: true
                          interval: ""
                        alertmanagerSpec:
                          image:
                            registry: m.daocloud.io/quay.io
                            repository: prometheus/alertmanager
                            tag: v0.28.1
                          replicas: 1
                          resources: {}
                          storage:
                            volumeClaimTemplate:
                              spec:
                                storageClassName: ""
                                accessModes: ["ReadWriteOnce"]
                                resources:
                                  requests:
                                    storage: 2Gi
                      grafana:
                        enabled: true
                        ingress:
                          enabled: true
                          annotations:
                            cert-manager.io/clusterissuer: self-signed-issuer
                            kubernetes.io/ingress.class: nginx
                          hosts:
                            - grafana.dev.tech
                          path: /
                          pathtype: ImplementationSpecific
                          tls:
                          - secretName: grafana.dev.tech-tls
                            hosts:
                            - grafana.dev.tech
                      prometheusOperator:
                        admissionWebhooks:
                          patch:
                            resources: {}
                            image:
                              registry: m.daocloud.io/registry.k8s.io
                              repository: ingress-nginx/kube-webhook-certgen
                              tag: v1.5.3  
                        image:
                          registry: m.daocloud.io/quay.io
                          repository: prometheus-operator/prometheus-operator
                        prometheusConfigReloader:
                          image:
                            registry: m.daocloud.io/quay.io
                            repository: prometheus-operator/prometheus-config-reloader
                          resources: {}
                        thanosImage:
                          registry: m.daocloud.io/quay.io
                          repository: thanos/thanos
                          tag: v0.38.0
                      prometheus:
                        enabled: true
                        ingress:
                          enabled: true
                          annotations:
                            cert-manager.io/clusterissuer: self-signed-issuer
                            kubernetes.io/ingress.class: nginx
                          hosts:
                            - prometheus.dev.tech
                          path: /
                          pathtype: ImplementationSpecific
                          tls:
                          - secretName: prometheus.dev.tech-tls
                            hosts:
                            - prometheus.dev.tech
                        prometheusSpec:
                          image:
                            registry: m.daocloud.io/quay.io
                            repository: prometheus/prometheus
                            tag: v3.4.0
                          replicas: 1
                          shards: 1
                          resources: {}
                          storageSpec: 
                            volumeClaimTemplate:
                              spec:
                                storageClassName: ""
                                accessModes: ["ReadWriteOnce"]
                                resources:
                                  requests:
                                    storage: 2Gi
                      thanosRuler:
                        enabled: false
                        ingress:
                          enabled: false
                        thanosRulerSpec:
                          replicas: 1
                          storage: {}
                          resources: {}
                          image:
                            registry: m.daocloud.io/quay.io
                            repository: thanos/thanos
                            tag: v0.38.0
                destination:
                  server: https://kubernetes.default.svc
                  namespace: monitor
            EOF

            3.sync by argocd

            Details
            argocd app sync argocd/prometheus-stack

            4.extract clickhouse admin credentials

            Details
              kubectl -n monitor get secret prometheus-stack-credentials -o jsonpath='{.data.grafana-password}' | base64 -d

            5.check the web browser

            Details
              > add `$K8S_MASTER_IP grafana.dev.tech` to **/etc/hosts**
            
              > add `$K8S_MASTER_IP prometheus.dev.tech` to **/etc/hosts**
            prometheus-srver: https://prometheus.dev.tech:32443/


            grafana-console: https://grafana.dev.tech:32443/


            install based on docker

            echo  "start from head is important"

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Jun 7, 2024

            Subsections of Networking

            Install Cert Manager

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm binary has installed, if not check 🔗link


            1.get helm repo

            Details
            helm repo add cert-manager-repo https://charts.jetstack.io
            helm repo update

            2.install chart

            Details
            helm install cert-manager-repo/cert-manager --generate-name --version 1.17.2
            Using Mirror
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
              && helm install ay-helm-mirror/cert-manager --generate-name --version 1.17.2

            for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. ArgoCD has installed, if not check 🔗link


            3. Helm binary has installed, if not check 🔗link


            1.prepare `cert-manager.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: cert-manager
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                chart: cert-manager
                targetRevision: 1.17.2
                helm:
                  releaseName: cert-manager
                  values: |
                    installCRDs: true
                    image:
                      repository: m.daocloud.io/quay.io/jetstack/cert-manager-controller
                      tag: v1.17.2
                    webhook:
                      image:
                        repository: m.daocloud.io/quay.io/jetstack/cert-manager-webhook
                        tag: v1.17.2
                    cainjector:
                      image:
                        repository: m.daocloud.io/quay.io/jetstack/cert-manager-cainjector
                        tag: v1.17.2
                    acmesolver:
                      image:
                        repository: m.daocloud.io/quay.io/jetstack/cert-manager-acmesolver
                        tag: v1.17.2
                    startupapicheck:
                      image:
                        repository: m.daocloud.io/quay.io/jetstack/cert-manager-startupapicheck
                        tag: v1.17.2
              destination:
                server: https://kubernetes.default.svc
                namespace: basic-components
            EOF

            3.sync by argocd

            Details
            argocd app sync argocd/cert-manager

            4.prepare self-signed.yaml

            Details
            kubectl apply  -f - <<EOF
            ---
            apiVersion: cert-manager.io/v1
            kind: Issuer
            metadata:
              namespace: basic-components
              name: self-signed-issuer
            spec:
              selfSigned: {}
            
            ---
            apiVersion: cert-manager.io/v1
            kind: Certificate
            metadata:
              namespace: basic-components
              name: my-self-signed-ca
            spec:
              isCA: true
              commonName: my-self-signed-ca
              secretName: root-secret
              privateKey:
                algorithm: ECDSA
                size: 256
              issuerRef:
                name: self-signed-issuer
                kind: Issuer
                group: cert-manager.io
            
            ---
            apiVersion: cert-manager.io/v1
            kind: ClusterIssuer
            metadata:
              name: self-signed-ca-issuer
            spec:
              ca:
                secretName: root-secret
            EOF

            Preliminary

            1. Docker|Podman|Buildah has installed, if not check 🔗link


            1.just run

            Details
            docker run --name cert-manager -e ALLOW_EMPTY_PASSWORD=yes bitnami/cert-manager:latest
            Using Proxy

            you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

            docker run --name cert-manager \
              -e ALLOW_EMPTY_PASSWORD=yes 
              m.daocloud.io/docker.io/bitnami/cert-manager:latest

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            1.just run

            Details
            kubectl create -f https://github.com/jetstack/cert-manager/releases/download/v1.17.2/cert-manager.yaml

            FAQ

            Q1: The browser doesn’t trust this self-signed certificate

            Basically, you need to import the certificate into your browser.

            kubectl -n basic-components get secret root-secret -o jsonpath='{.data.tls\.crt}' | base64 -d > cert-manager-self-signed-ca-secret.crt

            And then import it into your browser.

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Jun 7, 2024

            Install HAProxy

            Mar 7, 2024

            Install Ingress

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            1.get helm repo

            Details
            helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
            helm repo update

            2.install chart

            Details
            helm install ingress-nginx/ingress-nginx --generate-name
            Using Mirror
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts &&
              helm install ay-helm-mirror/ingress-nginx --generate-name --version 4.11.3

            for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. argoCD has installed, if not check 🔗link


            1.prepare `ingress-nginx.yaml`

            Details
            kubectl -n argocd apply -f - <<EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: ingress-nginx
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://kubernetes.github.io/ingress-nginx
                chart: ingress-nginx
                targetRevision: 4.12.3
                helm:
                  releaseName: ingress-nginx
                  values: |
                    controller:
                      image:
                        registry: m.daocloud.io/registry.k8s.io
                      service:
                        enabled: true
                        type: NodePort
                        nodePorts:
                          http: 32080
                          https: 32443
                          tcp:
                            8080: 32808
                      resources:
                        requests:
                          cpu: 100m
                          memory: 128Mi
                      admissionWebhooks:
                        enabled: true
                        patch:
                          enabled: true
                          image:
                            registry: m.daocloud.io/registry.k8s.io
                    metrics:
                      enabled: false
                    defaultBackend:
                      enabled: false
                      image:
                        registry: m.daocloud.io/registry.k8s.io
              destination:
                server: https://kubernetes.default.svc
                namespace: basic-components
            EOF

            [Optional] 2.apply to k8s

            Details
            kubectl -n argocd apply -f ingress-nginx.yaml

            3.sync by argocd

            Details
            argocd app sync argocd/ingress-nginx

            FAQ

            Q1: Using minikube, cannot access to the website
            ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f
            ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:32443:0.0.0.0:32443' -N -f
            ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:32080:0.0.0.0:32080' -N -f

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Jun 7, 2024

            Install Istio

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            1.get helm repo

            Details
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
            helm repo update

            2.install chart

            Details
            helm install ay-helm-mirror/kube-prometheus-stack --generate-name
            Using Proxy

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            3. ArgoCD has installed, if not check 🔗link


            1.prepare `deploy-istio-base.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: istio-base
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://istio-release.storage.googleapis.com/charts
                chart: base
                targetRevision: 1.23.2
                helm:
                  releaseName: istio-base
                  values: |
                    defaults:
                      global:
                        istioNamespace: istio-system
                      base:
                        enableCRDTemplates: false
                        enableIstioConfigCRDs: true
                      defaultRevision: "default"
              destination:
                server: https://kubernetes.default.svc
                namespace: istio-system
            EOF

            2.sync by argocd

            Details
            argocd app sync argocd/istio-base

            3.prepare `deploy-istiod.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: istiod
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://istio-release.storage.googleapis.com/charts
                chart: istiod
                targetRevision: 1.23.2
                helm:
                  releaseName: istiod
                  values: |
                    defaults:
                      global:
                        istioNamespace: istio-system
                        defaultResources:
                          requests:
                            cpu: 10m
                            memory: 128Mi
                          limits:
                            cpu: 100m
                            memory: 128Mi
                        hub: m.daocloud.io/docker.io/istio
                        proxy:
                          autoInject: disabled
                          resources:
                            requests:
                              cpu: 100m
                              memory: 128Mi
                            limits:
                              cpu: 2000m
                              memory: 1024Mi
                      pilot:
                        autoscaleEnabled: true
                        resources:
                          requests:
                            cpu: 500m
                            memory: 2048Mi
                        cpu:
                          targetAverageUtilization: 80
                        podAnnotations:
                          cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
              destination:
                server: https://kubernetes.default.svc
                namespace: istio-system
            EOF

            4.sync by argocd

            Details
            argocd app sync argocd/istiod

            5.prepare `deploy-istio-ingressgateway.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: istio-ingressgateway
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://istio-release.storage.googleapis.com/charts
                chart: gateway
                targetRevision: 1.23.2
                helm:
                  releaseName: istio-ingressgateway
                  values: |
                    defaults:
                      replicaCount: 1
                      podAnnotations:
                        inject.istio.io/templates: "gateway"
                        sidecar.istio.io/inject: "true"
                        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
                      resources:
                        requests:
                          cpu: 100m
                          memory: 128Mi
                        limits:
                          cpu: 2000m
                          memory: 1024Mi
                      service:
                        type: LoadBalancer
                        ports:
                        - name: status-port
                          port: 15021
                          protocol: TCP
                          targetPort: 15021
                        - name: http2
                          port: 80
                          protocol: TCP
                          targetPort: 80
                        - name: https
                          port: 443
                          protocol: TCP
                          targetPort: 443
                      autoscaling:
                        enabled: true
                        minReplicas: 1
                        maxReplicas: 5
              destination:
                server: https://kubernetes.default.svc
                namespace: istio-system
            EOF

            6.sync by argocd

            Details
            argocd app sync argocd/istio-ingressgateway

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            3. ArgoCD has installed, if not check 🔗link


            4. Argo Workflow has installed, if not check 🔗link


            1.prepare `argocd-login-credentials`

            Details
            kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

            2.apply rolebinding to k8s

            Details
            kubectl apply -f - <<EOF
            ---
            apiVersion: rbac.authorization.k8s.io/v1
            kind: ClusterRole
            metadata:
              name: application-administrator
            rules:
              - apiGroups:
                  - argoproj.io
                resources:
                  - applications
                verbs:
                  - '*'
              - apiGroups:
                  - apps
                resources:
                  - deployments
                verbs:
                  - '*'
            
            ---
            apiVersion: rbac.authorization.k8s.io/v1
            kind: RoleBinding
            metadata:
              name: application-administration
              namespace: argocd
            roleRef:
              apiGroup: rbac.authorization.k8s.io
              kind: ClusterRole
              name: application-administrator
            subjects:
              - kind: ServiceAccount
                name: argo-workflow
                namespace: business-workflows
            
            ---
            apiVersion: rbac.authorization.k8s.io/v1
            kind: RoleBinding
            metadata:
              name: application-administration
              namespace: application
            roleRef:
              apiGroup: rbac.authorization.k8s.io
              kind: ClusterRole
              name: application-administrator
            subjects:
              - kind: ServiceAccount
                name: argo-workflow
                namespace: business-workflows
            EOF

            4.prepare `deploy-xxxx-flow.yaml`

            Details

            6.subimit to argo workflow client

            Details
            argo -n business-workflows submit deploy-xxxx-flow.yaml

            7.decode password

            Details
            kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Jun 7, 2024

            Install Nginx

            1. prepare server.conf

            cat << EOF > default.conf
            server {
              listen 80;
              location / {
                  root   /usr/share/nginx/html;
                  autoindex on;
              }
            }
            EOF

            2. install

            mkdir $(pwd)/data
            podman run --rm -p 8080:80 \
                -v $(pwd)/data:/usr/share/nginx/html:ro \
                -v $(pwd)/default.conf:/etc/nginx/conf.d/default.conf:ro \
                -d docker.io/library/nginx:1.19.9-alpine
            echo 'this is a test' > $(pwd)/data/some-data.txt
            Tip

            you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

            visit http://localhost:8080

            Mar 7, 2024

            Subsections of RPC

            gRpc

            This guide gets you started with gRPC in C++ with a simple working example.

            In the C++ world, there’s no universally accepted standard for managing project dependencies. You need to build and install gRPC before building and running this quick start’s Hello World example.

            Build and locally install gRPC and Protocol Buffers. The steps in the section explain how to build and locally install gRPC and Protocol Buffers using cmake. If you’d rather use bazel, see Building from source.

            1. Setup

            Choose a directory to hold locally installed packages. This page assumes that the environment variable MY_INSTALL_DIR holds this directory path. For example:

            export MY_INSTALL_DIR=$HOME/.local

            Ensure that the directory exists:

            mkdir -p $MY_INSTALL_DIR

            Add the local bin folder to your path variable, for example:

            export PATH="$MY_INSTALL_DIR/bin:$PATH"
            Important

            We strongly encourage you to install gRPC locally — using an appropriately set CMAKE_INSTALL_PREFIX — because there is no easy way to uninstall gRPC after you’ve installed it globally.

            2. Install Essentials

            2.1 Install Cmake

            You need version 3.13 or later of cmake. Install it by following these instructions:

            Install on
            sudo apt install -y cmake
            brew install cmake
            Check the version of cmake
            cmake --version
            2.2 Install basic tools required to build gRPC
            Install on
            sudo apt install -y build-essential autoconf libtool pkg-config
            brew install autoconf automake libtool pkg-config
            2.3 Clone the grpc repo

            Clone the grpc repo and its submodules:

            git clone --recurse-submodules -b v1.62.0 --depth 1 --shallow-submodules https://github.com/grpc/grpc
            2.4 Build and install gRPC and Protocol Buffers

            While not mandatory, gRPC applications usually leverage Protocol Buffers for service definitions and data serialization, and the example code uses proto3.

            The following commands build and locally install gRPC and Protocol Buffers:

            cd grpc
            mkdir -p cmake/build
            pushd cmake/build
            cmake -DgRPC_INSTALL=ON \
                  -DgRPC_BUILD_TESTS=OFF \
                  -DCMAKE_INSTALL_PREFIX=$MY_INSTALL_DIR \
                  ../..
            make -j 4
            make install
            popd

            3. Run the example

            The example code is part of the grpc repo source, which you cloned as part of the steps of the previous section.

            3.1 change the example’s directory:
            cd examples/cpp/helloworld
            3.2 build the example project by using cmake

            make sure you still can echo $MY_INSTALL_DIR, and return a valid result

            mkdir -p cmake/build
            pushd cmake/build
            cmake -DCMAKE_PREFIX_PATH=$MY_INSTALL_DIR ../..
            make -j 4

            3.3 run the server

            ./greeter_server

            3.4 from a different terminal, run the client and see the client output:

            ./greeter_client

            and the result should be like this:

            Greeter received: Hello world
            Apr 7, 2024

            Subsections of Storage

            Deploy Artifict Repository

            Preliminary

            • Kubernetes has installed, if not check link
            • minio is ready for artifact repository

              endpoint: minio.storage:9000

            Steps

            1. prepare bucket for s3 artifact repository

            # K8S_MASTER_IP could be you master ip or loadbalancer external ip
            K8S_MASTER_IP=172.27.253.27
            MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.rootPassword}' | base64 -d)
            podman run --rm \
            --entrypoint bash \
            --add-host=minio-api.dev.geekcity.tech:${K8S_MASTER_IP} \
            -it docker.io/minio/mc:latest \
            -c "mc alias set minio http://minio-api.dev.geekcity.tech admin ${MINIO_ACCESS_SECRET} \
                && mc ls minio \
                && mc mb --ignore-existing minio/argo-workflows-artifacts"

            2. prepare secret s3-artifact-repository-credentials

            will create business-workflows namespace

            MINIO_ACCESS_KEY=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.rootUser}' | base64 -d)
            kubectl -n business-workflows create secret generic s3-artifact-repository-credentials \
                --from-literal=accessKey=${MINIO_ACCESS_KEY} \
                --from-literal=secretKey=${MINIO_ACCESS_SECRET}

            3. prepare configMap artifact-repositories.yaml

            apiVersion: v1
            kind: ConfigMap
            metadata:
              name: artifact-repositories
              annotations:
                workflows.argoproj.io/default-artifact-repository: default-artifact-repository
            data:
              default-artifact-repository: |
                s3:
                  endpoint: minio.storage:9000
                  insecure: true
                  accessKeySecret:
                    name: s3-artifact-repository-credentials
                    key: accessKey
                  secretKeySecret:
                    name: s3-artifact-repository-credentials
                    key: secretKey
                  bucket: argo-workflows-artifacts

            4. apply artifact-repositories.yaml to k8s

            kubectl -n business-workflows apply -f artifact-repositories.yaml
            Mar 7, 2024

            Install Chart Museum

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm binary has installed, if not check 🔗link


            1.get helm repo

            Details
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
            helm repo update

            2.install chart

            Details
            helm install ay-helm-mirror/kube-prometheus-stack --generate-name
            Using Mirror
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
              && helm install ay-helm-mirror/cert-manager --generate-name --version 1.17.2

            for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. ArgoCD has installed, if not check 🔗link


            3. Helm binary has installed, if not check 🔗link


            4. Ingres has installed on argoCD, if not check 🔗link


            5. Minio has installed, if not check 🔗link


            1.prepare `chart-museum-credentials`

            Storage In
            kubectl get namespaces basic-components > /dev/null 2>&1 || kubectl create namespace basic-components
            kubectl -n basic-components create secret generic chart-museum-credentials \
                --from-literal=username=admin \
                --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
            
            kubectl get namespaces basic-components > /dev/null 2>&1 || kubectl create namespace basic-components
            kubectl -n basic-components create secret generic chart-museum-credentials \
                --from-literal=username=admin \
                --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                --from-literal=aws_access_key_id=$(kubectl -n storage get secret minio-credentials -o jsonpath='{.data.rootUser}' | base64 -d) \
                --from-literal=aws_secret_access_key=$(kubectl -n storage get secret minio-credentials -o jsonpath='{.data.rootPassword}' | base64 -d)
            

            2.prepare `chart-museum.yaml`

            Storage In
            kubectl apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: chart-museum
            spec:
              syncPolicy:
                syncOptions:
                  - CreateNamespace=true
              project: default
              source:
                repoURL: https://chartmuseum.github.io/charts
                chart: chartmuseum
                targetRevision: 3.10.3
                helm:
                  releaseName: chart-museum
                  values: |
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                    env:
                      open:
                        DISABLE_API: false
                        STORAGE: local
                        AUTH_ANONYMOUS_GET: true
                      existingSecret: "chart-museum-credentials"
                      existingSecretMappings:
                        BASIC_AUTH_USER: "username"
                        BASIC_AUTH_PASS: "password"
                    persistence:
                      enabled: false
                      storageClass: ""
                    volumePermissions:
                      image:
                        registry: m.daocloud.io/docker.io
                    ingress:
                      enabled: true
                      ingressClassName: nginx
                      annotations:
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                        nginx.ingress.kubernetes.io/rewrite-target: /$1
                      hosts:
                        - name: chartmuseum.ay.dev
                          path: /?(.*)
                          tls: true
                          tlsSecret: chartmuseum.ay.dev-tls
              destination:
                server: https://kubernetes.default.svc
                namespace: basic-components
            EOF
            
            kubectl apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: chart-museum
            spec:
              syncPolicy:
                syncOptions:
                  - CreateNamespace=true
              project: default
              source:
                repoURL: https://chartmuseum.github.io/charts
                chart: chartmuseum
                targetRevision: 3.10.3
                helm:
                  releaseName: chart-museum
                  values: |
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                    env:
                      open:
                        DISABLE_API: false
                        STORAGE: amazon
                        STORAGE_AMAZON_ENDPOINT: http://minio-api.ay.dev:32080
                        STORAGE_AMAZON_BUCKET: chart-museum
                        STORAGE_AMAZON_PREFIX: charts
                        STORAGE_AMAZON_REGION: us-east-1
                        AUTH_ANONYMOUS_GET: true
                      existingSecret: "chart-museum-credentials"
                      existingSecretMappings:
                        BASIC_AUTH_USER: "username"
                        BASIC_AUTH_PASS: "password"
                        AWS_ACCESS_KEY_ID: "aws_access_key_id"
                        AWS_SECRET_ACCESS_KEY: "aws_secret_access_key"
                    persistence:
                      enabled: false
                      storageClass: ""
                    volumePermissions:
                      image:
                        registry: m.daocloud.io/docker.io
                    ingress:
                      enabled: true
                      ingressClassName: nginx
                      annotations:
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                        nginx.ingress.kubernetes.io/rewrite-target: /$1
                      hosts:
                        - name: chartmuseum.ay.dev
                          path: /?(.*)
                          tls: true
                          tlsSecret: chartmuseum.ay.dev-tls
              destination:
                server: https://kubernetes.default.svc
                namespace: basic-components
            EOF
            
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: chart-museum
            spec:
              syncPolicy:
                syncOptions:
                  - CreateNamespace=true
              project: default
              source:
                repoURL: https://chartmuseum.github.io/charts
                chart: chartmuseum
                targetRevision: 3.10.3
                helm:
                  releaseName: chart-museum
                  values: |
                    replicaCount: 1
                    image:
                      repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                    env:
                      open:
                        DISABLE_API: false
                        STORAGE: local
                        AUTH_ANONYMOUS_GET: true
                      existingSecret: "chart-museum-credentials"
                      existingSecretMappings:
                        BASIC_AUTH_USER: "username"
                        BASIC_AUTH_PASS: "password"
                    persistence:
                      enabled: false
                      storageClass: ""
                    volumePermissions:
                      image:
                        registry: m.daocloud.io/docker.io
                    ingress:
                      enabled: true
                      ingressClassName: nginx
                      annotations:
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                        nginx.ingress.kubernetes.io/rewrite-target: /$1
                      hosts:
                        - name: chartmuseum.ay.dev
                          path: /?(.*)
                          tls: true
                          tlsSecret: chartmuseum.ay.dev-tls
              destination:
                server: https://kubernetes.default.svc
                namespace: basic-components
            

            3.sync by argocd

            Details
            argocd app sync argocd/chart-museum

            Uploading a Chart Package

            Follow “How to Run” section below to get ChartMuseum up and running at http://localhost:8080

            First create mychart-0.1.0.tgz using the Helm CLI:

            cd mychart/
            helm package .

            Upload mychart-0.1.0.tgz:

            curl --data-binary "@mychart-0.1.0.tgz" http://localhost:8080/api/charts

            If you’ve signed your package and generated a provenance file, upload it with:

            curl --data-binary "@mychart-0.1.0.tgz.prov" http://localhost:8080/api/prov

            Both files can also be uploaded at once (or one at a time) on the /api/charts route using the multipart/form-data format:

            curl -F "chart=@mychart-0.1.0.tgz" -F "prov=@mychart-0.1.0.tgz.prov" http://localhost:8080/api/charts

            You can also use the helm-push plugin:

            helm cm-push mychart/ chartmuseum

            Installing Charts into Kubernetes

            Add the URL to your ChartMuseum installation to the local repository list:

            helm repo add chartmuseum http://localhost:8080

            Search for charts:

            helm search repo chartmuseum/

            Install chart:

            helm install chartmuseum/mychart --generate-name

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Jun 7, 2024

            Install Harbor

            Mar 7, 2025

            Install Minio

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm binary has installed, if not check 🔗link


            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. ArgoCD has installed, if not check 🔗link


            3. Ingres has installed on argoCD, if not check 🔗link


            4. Cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


            1.prepare minio credentials secret

            Details
            kubectl get namespaces storage > /dev/null 2>&1 || kubectl create namespace storage
            kubectl -n storage create secret generic minio-secret \
                --from-literal=root-user=admin \
                --from-literal=root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

            2.prepare `deploy-minio.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: minio
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                chart: minio
                targetRevision: 16.0.10
                helm:
                  releaseName: minio
                  values: |
                    global:
                      imageRegistry: "m.daocloud.io/docker.io"
                      imagePullSecrets: []
                      storageClass: ""
                      security:
                        allowInsecureImages: true
                      compatibility:
                        openshift:
                          adaptSecurityContext: auto
                    image:
                      registry: m.daocloud.io/docker.io
                      repository: bitnami/minio
                    clientImage:
                      registry: m.daocloud.io/docker.io
                      repository: bitnami/minio-client
                    mode: standalone
                    defaultBuckets: ""
                    auth:
                      # rootUser: admin
                      # rootPassword: ""
                      existingSecret: "minio-secret"
                    statefulset:
                      updateStrategy:
                        type: RollingUpdate
                      podManagementPolicy: Parallel
                      replicaCount: 1
                      zones: 1
                      drivesPerNode: 1
                    resourcesPreset: "micro"
                    resources: 
                      requests:
                        memory: 512Mi
                        cpu: 250m
                      limits:
                        memory: 512Mi
                        cpu: 250m
                    ingress:
                      enabled: true
                      ingressClassName: "nginx"
                      hostname: minio-console.ay.online
                      path: /?(.*)
                      pathType: ImplementationSpecific
                      annotations:
                        kubernetes.io/ingress.class: nginx
                        nginx.ingress.kubernetes.io/rewrite-target: /$1
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                      tls: true
                      selfSigned: true
                      extraHosts: []
                    apiIngress:
                      enabled: true
                      ingressClassName: "nginx"
                      hostname: minio-api.ay.online
                      path: /?(.*)
                      pathType: ImplementationSpecific
                      annotations: 
                        kubernetes.io/ingress.class: nginx
                        nginx.ingress.kubernetes.io/rewrite-target: /$1
                        cert-manager.io/cluster-issuer: self-signed-ca-issuer
                      tls: true
                      selfSigned: true
                      extraHosts: []
                    persistence:
                      enabled: false
                      storageClass: ""
                      mountPath: /bitnami/minio/data
                      accessModes:
                        - ReadWriteOnce
                      size: 8Gi
                      annotations: {}
                      existingClaim: ""
                    metrics:
                      prometheusAuthType: public
                      enabled: false
                      serviceMonitor:
                        enabled: false
                        namespace: ""
                        labels: {}
                        jobLabel: ""
                        paths:
                          - /minio/v2/metrics/cluster
                          - /minio/v2/metrics/node
                        interval: 30s
                        scrapeTimeout: ""
                        honorLabels: false
                      prometheusRule:
                        enabled: false
                        namespace: ""
                        additionalLabels: {}
                        rules: []
              destination:
                server: https://kubernetes.default.svc
                namespace: storage
            EOF

            3.sync by argocd

            Details
            argocd app sync argocd/minio

            4.decode minio secret

            Details
            kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d

            5.visit web console

            Login Credentials

            add $K8S_MASTER_IP minio-console.ay.online to /etc/hosts

            address: 🔗http://minio-console.ay.online:32080/login

            access key: admin

            secret key: ``

            6.using mc

            Details
            K8S_MASTER_IP=$(kubectl get node -l node-role.kubernetes.io/control-plane -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
            MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d)
            podman run --rm \
                --entrypoint bash \
                --add-host=minio-api.dev.tech:${K8S_MASTER_IP} \
                -it m.daocloud.io/docker.io/minio/mc:latest \
                -c "mc alias set minio http://minio-api.dev.tech:32080 admin ${MINIO_ACCESS_SECRET} \
                    && mc ls minio \
                    && mc mb --ignore-existing minio/test \
                    && mc cp /etc/hosts minio/test/etc/hosts \
                    && mc ls --recursive minio"
            Details
            K8S_MASTER_IP=$(kubectl get node -l node-role.kubernetes.io/control-plane -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
            MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d)
            podman run --rm \
                --entrypoint bash \
                --add-host=minio-api.dev.tech:${K8S_MASTER_IP} \
                -it m.daocloud.io/docker.io/minio/mc:latest

            Preliminary

            1. Docker has installed, if not check 🔗link


            Using Proxy

            you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

            1.init server

            Details
            mkdir -p $(pwd)/minio/data
            podman run --rm \
                --name minio-server \
                -p 9000:9000 \
                -p 9001:9001 \
                -v $(pwd)/minio/data:/data \
                -d docker.io/minio/minio:latest server /data --console-address :9001

            2.use web console

            And then you can visit 🔗http://localhost:9001

            username: `minioadmin`

            password: `minioadmin`

            3.use internal client

            Details
            podman run --rm \
                --entrypoint bash \
                -it docker.io/minio/mc:latest \
                -c "mc alias set minio http://host.docker.internal:9000 minioadmin minioadmin \
                    && mc ls minio \
                    && mc mb --ignore-existing minio/test \
                    && mc cp /etc/hosts minio/test/etc/hosts \
                    && mc ls --recursive minio"

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Mar 7, 2024

            Install NFS

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. argoCD has installed, if not check 🔗link


            3. ingres has installed on argoCD, if not check 🔗link


            1.prepare `nfs-provisioner.yaml`

            Details
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: nfs-provisioner
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
                chart: nfs-subdir-external-provisioner
                targetRevision: 4.0.18
                helm:
                  releaseName: nfs-provisioner
                  values: |
                    image:
                      repository: m.daocloud.io/registry.k8s.io/sig-storage/nfs-subdir-external-provisioner
                      pullPolicy: IfNotPresent
                    nfs:
                      server: nfs.services.test
                      path: /
                      mountOptions:
                        - vers=4
                        - minorversion=0
                        - rsize=1048576
                        - wsize=1048576
                        - hard
                        - timeo=600
                        - retrans=2
                        - noresvport
                      volumeName: nfs-subdir-external-provisioner-nas
                      reclaimPolicy: Retain
                    storageClass:
                      create: true
                      defaultClass: true
                      name: nfs-external-nas
              destination:
                server: https://kubernetes.default.svc
                namespace: storage

            3.deploy mariadb

            Details
            kubectl -n argocd apply -f nfs-provisioner.yaml

            4.sync by argocd

            Details
            argocd app sync argocd/nfs-provisioner

            Preliminary

            1. Docker has installed, if not check 🔗link


            Using Proxy

            you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

            1.init server

            Details
            echo -e "nfs\nnfsd" > /etc/modules-load.d/nfs4.conf
            modprobe nfs && modprobe nfsd
            mkdir -p $(pwd)/data/nfs/data
            echo '/data *(rw,fsid=0,no_subtree_check,insecure,no_root_squash)' > $(pwd)/data/nfs/exports
            podman run \
                --name nfs4 \
                --rm \
                --privileged \
                -p 2049:2049 \
                -v $(pwd)/data/nfs/data:/data \
                -v $(pwd)/data/nfs/exports:/etc/exports:ro \
                -d docker.io/erichough/nfs-server:2.2.1

            Preliminary

            1. centos yum repo source has updated, if not check 🔗link


            2.

            1.install nfs util

            sudo apt update -y
            sudo apt-get install nfs-common
            dnf update -y
            dnf install -y nfs-utils rpcbindn
            sudo apt update -y
            sudo apt-get install nfs-common

            2. create share folder

            Details
            mkdir /data && chmod 755 /data

            3.edit `/etc/exports`

            Details
            /data *(rw,sync,insecure,no_root_squash,no_subtree_check)

            4.start nfs server

            Details
            systemctl enable rpcbind
            systemctl enable nfs-server
            systemctl start rpcbind
            systemctl start nfs-server

            5.test load on localhost

            Details
            showmount -e localhost
            Expectd Output
            Export list for localhost:
            /data *

            6.test load on other ip

            Details
            showmount -e 192.168.aa.bb
            Expectd Output
            Export list for localhost:
            /data *

            7.mount nfs disk

            Details
            mkdir -p $(pwd)/mnt/nfs
            sudo mount -v 192.168.aa.bb:/data $(pwd)/mnt/nfs  -o proto=tcp -o nolock

            8.set nfs auto mount

            Details
            echo "192.168.aa.bb:/data /data nfs rw,auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0" >> /etc/fstab
            df -h

            Notes

            [Optional] create new partition
            disk size:
            fdisk /dev/vdb
            
            # n
            # p
            # w
            parted
            
            #select /dev/vdb 
            #mklabel gpt 
            #mkpart primary 0 -1
            #Cancel
            #mkpart primary 0% 100%
            #print
            [Optional]Format disk
            mkfs.xfs /dev/vdb1 -f
            [Optional] mount disk to folder
            mount /dev/vdb1 /data
            [Optional] mount when restart
            #vim `/etc/fstab` 
            /dev/vdb1     /data  xfs   defaults   0 0

            fstab fstab

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Mar 7, 2025

            Install Rook Ceph

            Mar 7, 2025

            Install Reids

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            1.get helm repo

            Details
            helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
            helm repo update

            2.install chart

            Details
            helm install ay-helm-mirror/kube-prometheus-stack --generate-name
            Using Proxy

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            3. ArgoCD has installed, if not check 🔗link


            1.prepare redis secret

            Details
            kubectl get namespaces storage > /dev/null 2>&1 || kubectl create namespace storage
            kubectl -n storage create secret generic redis-credentials \
              --from-literal=redis-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

            2.prepare `deploy-redis.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: redis
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://charts.bitnami.com/bitnami
                chart: redis
                targetRevision: 18.16.0
                helm:
                  releaseName: redis
                  values: |
                    architecture: replication
                    auth:
                      enabled: true
                      sentinel: true
                      existingSecret: redis-credentials
                    master:
                      count: 1
                      disableCommands:
                        - FLUSHDB
                        - FLUSHALL
                      persistence:
                        enabled: true
                        storageClass: nfs-external
                        size: 8Gi
                    replica:
                      replicaCount: 3
                      disableCommands:
                        - FLUSHDB
                        - FLUSHALL
                      persistence:
                        enabled: true
                        storageClass: nfs-external
                        size: 8Gi
                    image:
                      registry: m.daocloud.io/docker.io
                      pullPolicy: IfNotPresent
                    sentinel:
                      enabled: false
                      persistence:
                        enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                        pullPolicy: IfNotPresent
                    metrics:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                        pullPolicy: IfNotPresent
                    volumePermissions:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                        pullPolicy: IfNotPresent
                    sysctl:
                      enabled: false
                      image:
                        registry: m.daocloud.io/docker.io
                        pullPolicy: IfNotPresent
                    extraDeploy:
                      - |
                        apiVersion: apps/v1
                        kind: Deployment
                        metadata:
                          name: redis-tool
                          namespace: csst
                          labels:
                            app.kubernetes.io/name: redis-tool
                        spec:
                          replicas: 1
                          selector:
                            matchLabels:
                              app.kubernetes.io/name: redis-tool
                          template:
                            metadata:
                              labels:
                                app.kubernetes.io/name: redis-tool
                            spec:
                              containers:
                              - name: redis-tool
                                image: m.daocloud.io/docker.io/bitnami/redis:7.2.4-debian-12-r8
                                imagePullPolicy: IfNotPresent
                                env:
                                - name: REDISCLI_AUTH
                                  valueFrom:
                                    secretKeyRef:
                                      key: redis-password
                                      name: redis-credentials
                                - name: TZ
                                  value: Asia/Shanghai
                                command:
                                - tail
                                - -f
                                - /etc/hosts
              destination:
                server: https://kubernetes.default.svc
                namespace: storage
            EOF

            3.sync by argocd

            Details
            argocd app sync argocd/redis

            4.decode password

            Details
            kubectl -n storage get secret redis-credentials -o jsonpath='{.data.redis-password}' | base64 -d

            Preliminary

            1. Docker|Podman|Buildah has installed, if not check 🔗link


            Using Proxy

            you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

            1.init server

            Details

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            3. ArgoCD has installed, if not check 🔗link


            4. Argo Workflow has installed, if not check 🔗link


            1.prepare `argocd-login-credentials`

            Details
            kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

            2.apply rolebinding to k8s

            Details
            kubectl apply -f - <<EOF
            ---
            apiVersion: rbac.authorization.k8s.io/v1
            kind: ClusterRole
            metadata:
              name: application-administrator
            rules:
              - apiGroups:
                  - argoproj.io
                resources:
                  - applications
                verbs:
                  - '*'
              - apiGroups:
                  - apps
                resources:
                  - deployments
                verbs:
                  - '*'
            
            ---
            apiVersion: rbac.authorization.k8s.io/v1
            kind: RoleBinding
            metadata:
              name: application-administration
              namespace: argocd
            roleRef:
              apiGroup: rbac.authorization.k8s.io
              kind: ClusterRole
              name: application-administrator
            subjects:
              - kind: ServiceAccount
                name: argo-workflow
                namespace: business-workflows
            
            ---
            apiVersion: rbac.authorization.k8s.io/v1
            kind: RoleBinding
            metadata:
              name: application-administration
              namespace: application
            roleRef:
              apiGroup: rbac.authorization.k8s.io
              kind: ClusterRole
              name: application-administrator
            subjects:
              - kind: ServiceAccount
                name: argo-workflow
                namespace: business-workflows
            EOF

            4.prepare `deploy-xxxx-flow.yaml`

            Details

            6.subimit to argo workflow client

            Details
            argo -n business-workflows submit deploy-xxxx-flow.yaml

            7.decode password

            Details
            kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            tests

            • kubectl -n storage exec -it deployment/redis-tool -- \
                  redis-cli -c -h redis-master.storage ping
            • kubectl -n storage exec -it deployment/redis-tool -- \
                  redis-cli -c -h redis-master.storage set mykey somevalue
            • kubectl -n storage exec -it deployment/redis-tool -- \
                  redis-cli -c -h redis-master.storage get mykey
            • kubectl -n storage exec -it deployment/redis-tool -- \
                  redis-cli -c -h redis-master.storage del mykey
            • kubectl -n storage exec -it deployment/redis-tool -- \
                  redis-cli -c -h redis-master.storage get mykey
            May 7, 2024

            Subsections of Streaming

            Install Flink Operator

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm has installed, if not check 🔗link


            3. Cert-manager has installed, if not check 🔗link


            1.get helm repo

            Details
            helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-1.11.0/
            helm repo update

            latest version : 🔗https://flink.apache.org/downloads/#apache-flink-kubernetes-operator

            2.install chart

            Details
            helm install --create-namespace -n flink flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator --set image.repository=m.lab.zverse.space/ghcr.io/apache/flink-kubernetes-operator --set image.tag=1.11.0 --set webhook.create=false
            Reference

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. ArgoCD has installed, if not check 🔗link


            3. Cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuer service , if not check 🔗link


            4. Ingres has installed on argoCD, if not check 🔗link


            2.prepare `flink-operator.yaml`

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: flink-operator
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://downloads.apache.org/flink/flink-kubernetes-operator-1.11.0
                chart: flink-kubernetes-operator
                targetRevision: 1.11.0
                helm:
                  releaseName: flink-operator
                  values: |
                    image:
                      repository: m.daocloud.io/ghcr.io/apache/flink-kubernetes-operator
                      pullPolicy: IfNotPresent
                      tag: "1.11.0"
                  version: v3
              destination:
                server: https://kubernetes.default.svc
                namespace: flink
            EOF

            3.sync by argocd

            Details
            argocd app sync argocd/flink-operator

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Jun 7, 2025

            👨‍💻Schedmd Slurm

            The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world’s supercomputers and computer clusters.

            It provides three key functions:

            • allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
            • providing a framework for starting, executing, and monitoring work, typically a parallel job such as Message Passing Interface (MPI) on a set of allocated nodes, and
            • arbitrating contention for resources by managing a queue of pending jobs.

            func1 func1

            Content

            Aug 7, 2024

            Subsections of 👨‍💻Schedmd Slurm

            Build & Install

            Aug 7, 2024

            Subsections of Build & Install

            Install On Debian

            Cluster Setting

            • 1 Manager
            • 1 Login Node
            • 2 Compute nodes
            hostnameIProlequota
            manage01 (slurmctld, slurmdbd)192.168.56.115manager2C4G
            login01 (login)192.168.56.116login2C4G
            compute01 (slurmd)192.168.56.117compute2C4G
            compute02 (slurmd)192.168.56.118compute2C4G

            Software Version:

            softwareversion
            osDebian 12 bookworm
            slurm24.05.2

            Important

            when you see (All Nodes), you need to run the following command on all nodes

            when you see (Manager Node), you only need to run the following command on manager node

            when you see (Login Node), you only need to run the following command on login node

            Prepare Steps (All Nodes)

            1. Modify the /etc/apt/sources.list file Using tuna mirror
            cat > /etc/apt/sources.list << EOF
            deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
            deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
            
            deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
            deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
            
            deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
            deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
            
            deb https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
            deb-src https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
            EOF
            if you cannot get ipv4 address

            Modify the /etc/network/interfaces

            allow-hotplug enps08
            iface enps08 inet dhcp

            restart the network

            systemctl restart networking
            1. Update apt cache
            apt clean all && apt update
            1. Set hostname on each node
            Node:
            hostnamectl set-hostname manage01
            hostnamectl set-hostname login01
            hostnamectl set-hostname compute01
            hostnamectl set-hostname compute02
            1. Set hosts file
            cat >> /etc/hosts << EOF
            192.168.56.115 manage01
            192.168.56.116 login01
            192.168.56.117 compute01
            192.168.56.118 compute02
            EOF
            1. Disable firewall
            systemctl stop nftables && systemctl disable nftables
            1. Install packages ntpdate
            apt-get -y install ntpdate
            1. Sync server time
            ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
            echo 'Asia/Shanghai' >/etc/timezone
            ntpdate time.windows.com
            1. Add cron job to sync time
            crontab -e
            */5 * * * * /usr/sbin/ntpdate time.windows.com
            1. Create ssh key pair on each node
            ssh-keygen -t rsa -b 4096 -C $HOSTNAME
            1. Test ssh login other nodes without password
            Node:
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@login01
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@manage01
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02

            Install Components

            1. Install NFS server (Manager Node)

            there are many ways to install NFS server

            create shared folder

            mkdir /data
            chmod 755 /data

            modify vim /etc/exports

            /data *(rw,sync,insecure,no_subtree_check,no_root_squash)

            start nfs server

            systemctl start rpcbind 
            systemctl start nfs-server 
            
            systemctl enable rpcbind 
            systemctl enable nfs-server

            check nfs server

            showmount -e localhost
            
            # Output
            Export list for localhost:
            /data *
            1. Install munge service
            • add user munge (All Nodes)
            groupadd -g 1108 munge
            useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
            • Install rng-tools-debian (Manager Nodes)
            apt-get install -y rng-tools-debian
            # modify service script
            vim /usr/lib/systemd/system/rngd.service
            [Service]
            ExecStart=/usr/sbin/rngd -f -r /dev/urandom
            systemctl daemon-reload
            systemctl start rngd
            systemctl enable rngd
            apt-get install -y libmunge-dev libmunge2 munge
            • generate secret key (Manager Nodes)
            dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
            • copy munge.key from manager node to the rest node (All Nodes)
            scp -p /etc/munge/munge.key root@login01:/etc/munge/
            scp -p /etc/munge/munge.key root@compute01:/etc/munge/
            scp -p /etc/munge/munge.key root@compute02:/etc/munge/
            • grant privilege on munge.key (All Nodes)
            chown munge: /etc/munge/munge.key
            chmod 400 /etc/munge/munge.key
            
            systemctl start munge
            systemctl enable munge

            Using systemctl status munge to check if the service is running

            • test munge
            munge -n | ssh compute01 unmunge
            1. Install Mariadb (Manager Nodes)
            apt-get install -y mariadb-server
            • create database and user
            systemctl start mariadb
            systemctl enable mariadb
            
            ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) 
            mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
            mysql -uroot -p$ROOT_PASS -e 'create database slurm_acct_db'
            • create user slurm,and grant all privileges on database slurm_acct_db
            mysql -uroot -p$ROOT_PASS
            create user slurm;
            
            grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
            
            flush privileges;
            • create Slurm user
            groupadd -g 1109 slurm
            useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

            Install Slurm (All Nodes)

            • Install basic Debian package build requirements:
            apt-get install -y build-essential fakeroot devscripts equivs
            • Unpack the distributed tarball:
            wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2 &&
            tar -xaf slurm*tar.bz2
            • cd to the directory containing the Slurm source:
            cd slurm-24.05.2 &&   mkdir -p /etc/slurm && ./configure 
            • compile slurm
            make install
            • modify configuration files (Manager Nodes)

              cp /root/slurm-24.05.2/etc/slurm.conf.example /etc/slurm/slurm.conf
              vim /etc/slurm/slurm.conf

              focus on these options:

              SlurmctldHost=manage
              
              AccountingStorageEnforce=associations,limits,qos
              AccountingStorageHost=manage
              AccountingStoragePass=/var/run/munge/munge.socket.2
              AccountingStoragePort=6819  
              AccountingStorageType=accounting_storage/slurmdbd  
              
              JobCompHost=localhost
              JobCompLoc=slurm_acct_db
              JobCompPass=123456
              JobCompPort=3306
              JobCompType=jobcomp/mysql
              JobCompUser=slurm
              JobContainerType=job_container/none
              JobAcctGatherType=jobacct_gather/linux
              cp /root/slurm-24.05.2/etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
              vim /etc/slurm/slurmdbd.conf
              • modify /etc/slurm/cgroup.conf
              cp /root/slurm-24.05.2/etc/cgroup.conf.example /etc/slurm/cgroup.conf
              • send configuration files to other nodes
              scp -r /etc/slurm/*.conf  root@login01:/etc/slurm/
              scp -r /etc/slurm/*.conf  root@compute01:/etc/slurm/
              scp -r /etc/slurm/*.conf  root@compute02:/etc/slurm/
            • grant privilege on some directories (All Nodes)

            mkdir /var/spool/slurmd
            chown slurm: /var/spool/slurmd
            mkdir /var/log/slurm
            chown slurm: /var/log/slurm
            
            mkdir /var/spool/slurmctld
            chown slurm: /var/spool/slurmctld
            
            chown slurm: /etc/slurm/slurmdbd.conf
            chmod 600 /etc/slurm/slurmdbd.conf
            • start slurm services on each node
            Node:
            systemctl start slurmdbd
            systemctl enable slurmdbd
            
            systemctl start slurmctld
            systemctl enable slurmctld
            
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status xxxx` to check if the `xxxx` service is running
            Example slurmdbd.server
            ```text
            # vim /usr/lib/systemd/system/slurmdbd.service
            
            
            [Unit]
            Description=Slurm DBD accounting daemon
            After=network-online.target remote-fs.target munge.service mysql.service mysqld.service mariadb.service sssd.service
            Wants=network-online.target
            ConditionPathExists=/etc/slurm/slurmdbd.conf
            
            [Service]
            Type=simple
            EnvironmentFile=-/etc/sysconfig/slurmdbd
            EnvironmentFile=-/etc/default/slurmdbd
            User=slurm
            Group=slurm
            RuntimeDirectory=slurmdbd
            RuntimeDirectoryMode=0755
            ExecStart=/usr/local/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            LimitNOFILE=65536
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            Example slumctld.server
            ```text
            # vim /usr/lib/systemd/system/slurmctld.service
            
            
            [Unit]
            Description=Slurm controller daemon
            After=network-online.target remote-fs.target munge.service sssd.service
            Wants=network-online.target
            ConditionPathExists=/etc/slurm/slurm.conf
            
            [Service]
            Type=notify
            EnvironmentFile=-/etc/sysconfig/slurmctld
            EnvironmentFile=-/etc/default/slurmctld
            User=slurm
            Group=slurm
            RuntimeDirectory=slurmctld
            RuntimeDirectoryMode=0755
            ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            LimitNOFILE=65536
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            Example slumd.server
            ```text
            # vim /usr/lib/systemd/system/slurmd.service
            
            
            [Unit]
            Description=Slurm node daemon
            After=munge.service network-online.target remote-fs.target sssd.service
            Wants=network-online.target
            #ConditionPathExists=/etc/slurm/slurm.conf
            
            [Service]
            Type=notify
            EnvironmentFile=-/etc/sysconfig/slurmd
            EnvironmentFile=-/etc/default/slurmd
            RuntimeDirectory=slurm
            RuntimeDirectoryMode=0755
            ExecStart=/usr/local/sbin/slurmd --systemd $SLURMD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            KillMode=process
            LimitNOFILE=131072
            LimitMEMLOCK=infinity
            LimitSTACK=infinity
            Delegate=yes
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running

            Test Your Slurm Cluster (Login Node)

            • check cluster configuration
            scontrol show config
            • check cluster status
            sinfo
            scontrol show partition
            scontrol show node
            • submit job
            srun -N2 hostname
            scontrol show jobs
            • check job status
            check job status
            squeue -a
            Aug 7, 2024

            Install From Binary

            Important

            (All Nodes) means all type nodes should install this component.

            (Manager Node) means only the manager node should install this component.

            (Login Node) means only the Auth node should install this component.

            (Cmp) means only the Compute node should install this component.

            Typically, there are three nodes are required to run Slurm.

            1 Manage(Manager Node), 1 Login Node and N Compute(Cmp).

            but you can choose to install all service in single node. check

            Prequisites

            1. change hostname (All Nodes)
              hostnamectl set-hostname (manager|auth|computeXX)
            2. modify /etc/hosts (All Nodes)
              echo "192.aa.bb.cc (manager|auth|computeXX)" >> /etc/hosts
            3. disable firewall, selinux, dnsmasq, swap (All Nodes). more detail here
            4. NFS Server (Manager Node). NFS is used as the default file system for the Slurm accounting database.
            5. [NFS Client] (All Nodes). all node should mount the NFS share
              Install NFS Client
              mount <$nfs_server>:/data /data -o proto=tcp -o nolock
            6. Munge (All Nodes). The auth/munge plugin will be built if the MUNGE authentication development library is installed. MUNGE is used as the default authentication mechanism.
              Install Munge

              All node need to have the munge user and group.

              groupadd -g 1108 munge
              useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
              yum install epel-release -y
              yum install munge munge-libs munge-devel -y

              Create global secret key

              /usr/sbin/create-munge-key -r
              dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

              sync secret to the rest of nodes

              scp -p /etc/munge/munge.key root@<$rest_node>:/etc/munge/
              ssh root@<$rest_node> "chown munge: /etc/munge/munge.key && chmod 400 /etc/munge/munge.key"
              ssh root@<$rest_node> "systemctl start munge && systemctl enable munge"

              test munge if it works

              munge -n | unmunge
            7. Database (Manager Node). MySQL support for accounting will be built if the MySQL or MariaDB development library is present. A currently supported version of MySQL or MariaDB should be used.
              Install MariaDB

              install mariadb

              yum -y install mariadb-server
              systemctl start mariadb && systemctl enable mariadb
              ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) 
              mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"

              login mysql

              mysql -u root -p${ROOT_PASS}
              create database slurm_acct_db;
              create user slurm;
              grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
              flush privileges;
              quit

            Install Slurm

            1. create slurm user (All Nodes)
              groupadd -g 1109 slurm
              useradd -m -c "slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
            Install Slurm from

            Build RPM package

            1. install depeendencies (Manager Node)

              yum -y install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel python3
            2. build rpm package (Manager Node)

              wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2
              rpmbuild -ta --nodeps slurm-24.05.2.tar.bz2

              The rpm files will be installed under the $(HOME)/rpmbuild directory of the user building them.

            3. send rpm to rest nodes (Manager Node)

              ssh root@<$rest_node> "mkdir -p /root/rpmbuild/RPMS/"
              scp -p $(HOME)/rpmbuild/RPMS/x86_64 root@<$rest_node>:/root/rpmbuild/RPMS/x86_64
            4. install rpm (Manager Node)

              ssh root@<$rest_node> "yum localinstall /root/rpmbuild/RPMS/x86_64/slurm-*"
            5. modify configuration file (Manager Node)

              cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
              cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
              cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
              chmod 600 /etc/slurm/slurmdbd.conf
              chown slurm: /etc/slurm/slurmdbd.conf

              cgroup.conf doesnt need to change.

              edit /etc/slurm/slurm.conf, you can use this link as a reference

              edit /etc/slurm/slurmdbd.conf, you can use this link as a reference

            Install yum repo directly

            1. install slurm (All Nodes)

              yum -y slurm-wlm slurmdbd
            2. modify configuration file (All Nodes)

              vim /etc/slurm-llnl/slurm.conf
              vim /etc/slurm-llnl/slurmdbd.conf

              cgroup.conf doesnt need to change.

              edit /etc/slurm/slurm.conf, you can use this link as a reference

              edit /etc/slurm/slurmdbd.conf, you can use this link as a reference

            1. send configuration (Manager Node)
               scp -r /etc/slurm/*.conf  root@<$rest_node>:/etc/slurm/
               ssh rootroot@<$rest_node> "mkdir /var/spool/slurmd && chown slurm: /var/spool/slurmd"
               ssh rootroot@<$rest_node> "mkdir /var/log/slurm && chown slurm: /var/log/slurm"
               ssh rootroot@<$rest_node> "mkdir /var/spool/slurmctld && chown slurm: /var/spool/slurmctld"
            2. start service (Manager Node)
              ssh rootroot@<$rest_node> "systemctl start slurmdbd && systemctl enable slurmdbd"
              ssh rootroot@<$rest_node> "systemctl start slurmctld && systemctl enable slurmctld"
            3. start service (All Nodes)
              ssh rootroot@<$rest_node> "systemctl start slurmd && systemctl enable slurmd"

            Test

            1. show cluster status
            scontrol show config
            sinfo
            scontrol show partition
            scontrol show node
            1. submit job
            srun -N2 hostname
            scontrol show jobs
            1. check job status
            squeue -a

            Reference:

            1. https://slurm.schedmd.com/documentation.html
            2. https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/
            3. https://github.com/Artlands/Install-Slurm
            Aug 7, 2024

            Install From Helm Chart

            Despite the complex binary installation, helm chart is a better way to install slurm.

            Source code could be found from https://github.com/AaronYang0628/slurm-on-k8s

            Prequisites

            1. Kubernetes has installed, if not check 🔗link
            2. Helm binary has installed, if not check 🔗link

            Installation

            1. get helm repo and update

              helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
            2. install slurm chart

              # wget -O slurm.values.yaml https://raw.githubusercontent.com/AaronYang0628/slurm-on-k8s/refs/heads/main/chart/values.yaml
              helm install slurm ay-helm-mirror/chart -f slurm.values.yaml --version 1.0.10

              Or you can get template values.yaml from https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurm.values.yaml

            3. check chart status

              helm -n slurm list
            Aug 7, 2024

            Install From K8s Operator

            Despite the complex binary installation, using k8s operator is a better way to install slurm.

            Source code could be found from https://github.com/AaronYang0628/slurm-on-k8s

            Prequisites

            1. Kubernetes has installed, if not check 🔗link
            2. Helm binary has installed, if not check 🔗link

            Installation

            1. deploy slurm operator

              kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/operator_install.yaml
              Expectd Output
              [root@ay-zj-ecs operator]# kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/operator_install.yaml
              namespace/slurm created
              customresourcedefinition.apiextensions.k8s.io/slurmdeployments.slurm.ay.dev created
              serviceaccount/slurm-operator-controller-manager created
              role.rbac.authorization.k8s.io/slurm-operator-leader-election-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-manager-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-metrics-auth-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-metrics-reader created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-admin-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-editor-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-viewer-role created
              rolebinding.rbac.authorization.k8s.io/slurm-operator-leader-election-rolebinding created
              clusterrolebinding.rbac.authorization.k8s.io/slurm-operator-manager-rolebinding created
              clusterrolebinding.rbac.authorization.k8s.io/slurm-operator-metrics-auth-rolebinding created
              service/slurm-operator-controller-manager-metrics-service created
              deployment.apps/slurm-operator-controller-manager created
            2. check operator status

              kubectl -n slurm get pod
              Expectd Output
              [root@ay-zj-ecs operator]# kubectl -n slurm get pod
              NAME                                READY   STATUS    RESTARTS   AGE
              slurm-operator-controller-manager   1/1     Running   0          27s
            3. apply CRD slurmdeployment

              kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.zj.values.yaml
              Expectd Output
              [root@ay-zj-ecs operator]# kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.zj.values.yaml
              slurmdeployment.slurm.ay.dev/lensing created
            4. check operator status

              kubectl get slurmdeployment
              kubectl -n slurm logs -f deploy/slurm-operator-controller-manager
              # kubectl get slurmdep
              # kubectl -n test get pods
              Expectd Output
              [root@ay-zj-ecs ~]# kubectl get slurmdep -w
              NAME      CPU   GPU   LOGIN   CTLD   DBD   DBSVC   JOB COMMAND                     STATUS
              lensing   0/1   0/0   0/1     0/1    0/1   0/1     sh -c srun -N 2 /bin/hostname   
              lensing   1/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
              lensing   2/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
            5. upgrade slurmdep

              kubectl edit slurmdep lensing
              # set SlurmCPU.replicas = 3
              Expectd Output
              [root@ay-zj-ecs ~]# kubectl edit slurmdep lensing
              slurmdeployment.slurm.ay.dev/lensing edited
              
              [root@ay-zj-ecs ~]# kubectl get slurmdep -w
              NAME      CPU   GPU   LOGIN   CTLD   DBD   DBSVC   JOB COMMAND                     STATUS
              lensing   2/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
              lensing   2/3   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
              lensing   3/3   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
            Aug 7, 2024

            Try OpenSCOW

            What is SCOW?

            SCOW is a HPC cluster management system built by PKU.

            SCOW used four virtual machines to run slurm cluster. It is a good choice for you to learn how to use slurm.

            You should check https://pkuhpc.github.io/OpenSCOW/docs/hpccluster, it works well.

            Aug 7, 2024

            Subsections of CheatSheet

            Common Environment Variables

            VariableDescription
            $SLURM_JOB_IDThe Job ID.
            $SLURM_JOBIDDeprecated. Same as $SLURM_JOB_ID
            $SLURM_SUBMIT_HOSTThe hostname of the node used for job submission.
            $SLURM_JOB_NODELISTContains the definition (list) of the nodes that is assigned to the job.
            $SLURM_NODELISTDeprecated. Same as SLURM_JOB_NODELIST.
            $SLURM_CPUS_PER_TASKNumber of CPUs per task.
            $SLURM_CPUS_ON_NODENumber of CPUs on the allocated node.
            $SLURM_JOB_CPUS_PER_NODECount of processors available to the job on this node.
            $SLURM_CPUS_PER_GPUNumber of CPUs requested per allocated GPU.
            $SLURM_MEM_PER_CPUMemory per CPU. Same as –mem-per-cpu .
            $SLURM_MEM_PER_GPUMemory per GPU.
            $SLURM_MEM_PER_NODEMemory per node. Same as –mem .
            $SLURM_GPUSNumber of GPUs requested.
            $SLURM_NTASKSSame as -n, –ntasks. The number of tasks.
            $SLURM_NTASKS_PER_NODENumber of tasks requested per node.
            $SLURM_NTASKS_PER_SOCKETNumber of tasks requested per socket.
            $SLURM_NTASKS_PER_CORENumber of tasks requested per core.
            $SLURM_NTASKS_PER_GPUNumber of tasks requested per GPU.
            $SLURM_NPROCSSame as -n, –ntasks. See $SLURM_NTASKS.
            $SLURM_TASKS_PER_NODENumber of tasks to be initiated on each node.
            $SLURM_ARRAY_JOB_IDJob array’s master job ID number.
            $SLURM_ARRAY_TASK_IDJob array ID (index) number.
            $SLURM_ARRAY_TASK_COUNTTotal number of tasks in a job array.
            $SLURM_ARRAY_TASK_MAXJob array’s maximum ID (index) number.
            $SLURM_ARRAY_TASK_MINJob array’s minimum ID (index) number.

            A full list of environment variables for SLURM can be found by visiting the SLURM page on environment variables.

            Aug 7, 2024

            File Operations

            File Distribution

            • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
              • Feature
                1. distribute file:Quickly copy files to all compute nodes assigned to the job, avoiding the hassle of manually distributing files. Faster than traditional scp or rsync, especially when distributing to multiple nodes。
                2. simplify script:one command to distribute files to all nodes assigned to the job。
                3. imrpove performance:Improve file distribution speed by parallelizing transfers, especially for large or multiple files。
              • Usage
                1. Alone
                sbcast <source_file> <destination_path>
                1. Embedded in a job script
                #!/bin/bash
                #SBATCH --job-name=example_job
                #SBATCH --output=example_job.out
                #SBATCH --error=example_job.err
                #SBATCH --partition=compute
                #SBATCH --nodes=4
                
                # Use sbcast to distribute the file to the /tmp directory of each node
                sbcast data.txt /tmp/data.txt
                
                # Run your program using the distributed files
                srun my_program /tmp/data.txt

            File Collection

            1. File Redirection When submitting a job, you can use the #SBATCH –output and #SBATCH –error directives to redirect standard output and standard error to specified files.

               #SBATCH --output=output.txt
               #SBATCH --error=error.txt

              Or

              sbatch -N2 -w "compute[01-02]" -o result/file/path xxx.slurm
            2. Send the destination address manually Using scp or rsync in the job to copy the files from the compute nodes to the submit node

            3. Using NFS If a shared file system (such as NFS, Lustre, or GPFS) is configured in the computing cluster, the result files can be written directly to the shared directory. In this way, the result files generated by all nodes are automatically stored in the same location.

            4. Using sbcast

            Aug 7, 2024

            Submit Jobs

            3 Type Jobs

            • srun is used to submit a job for execution or initiate job steps in real time.

              • Example
                1. run shell
                srun -N2 bin/hostname
                1. run script
                srun -N1 test.sh
                1. exec into slurmd node
                srun -w slurm-lensing-slurm-slurmd-cpu-2 --pty /bin/bash
            • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

              • Example

                1. submit a batch job
                sbatch -N2 -w "compute[01-02]" -o job.stdout /data/jobs/batch-job.slurm
                batch-job.slurm
                #!/bin/bash
                
                #SBATCH -N 1
                #SBATCH --job-name=cpu-N1-batch
                #SBATCH --partition=compute
                #SBATCH --mail-type=end
                #SBATCH --mail-user=xxx@email.com
                #SBATCH --output=%j.out
                #SBATCH --error=%j.err
                
                srun -l /bin/hostname #you can still write srun <command> in here
                srun -l pwd
                
                1. submit a parallel task to process differnt data partition
                sbatch /data/jobs/parallel.slurm
                parallel.slurm
                #!/bin/bash
                #SBATCH -N 2 
                #SBATCH --job-name=cpu-N2-parallel
                #SBATCH --partition=compute
                #SBATCH --time=01:00:00
                #SBATCH --array=1-4  # 定义任务数组,假设有4个分片
                #SBATCH --ntasks-per-node=1 # 每个节点只运行一个任务
                #SBATCH --output=process_data_%A_%a.out
                #SBATCH --error=process_data_%A_%a.err
                
                TASK_ID=${SLURM_ARRAY_TASK_ID}
                
                DATA_PART="data_part_${TASK_ID}.txt" #make sure you have that file
                
                if [ -f ${DATA_PART} ]; then
                    echo "Processing ${DATA_PART} on node $(hostname)"
                    # python process_data.py --input ${DATA_PART}
                else
                    echo "File ${DATA_PART} does not exist!"
                fi
                
                how to split file
                split -l 1000 data.txt data_part_ 
                && mv data_part_aa data_part_1 
                && mv data_part_ab data_part_2
                
            • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

              • Example
                1. allocate resources (more like create an virtual machine)
                salloc -N2 bash
                This command will create a job which allocates 2 nodes and spawn a bash shell on each node. and you can execute srun commands in that environment. After your computing task is finsihs, remember to shutdown your job.
                scancel <$job_id>
                when you exit the job, the resources will be released.
            Aug 7, 2024

            Configuration Files

            Aug 7, 2024

            Subsections of MPI Libs

            Test Intel MPI Jobs

            在SLURM集群中使用MPI(Message Passing Interface)进行并行计算,通常需要以下几个步骤:

            1. 安装MPI库

            确保你的集群节点已经安装了MPI库,常见的MPI实现包括:

            • OpenMPI
            • Intel MPI
            • MPICH 可以通过以下命令检查集群是否安装了MPI:
            mpicc --version  # 检查MPI编译器
            mpirun --version # 检查MPI运行时环境

            2. 测试MPI性能

            mpirun -n 2 IMB-MPI1 pingpong

            3. 编译MPI程序

            你可以用mpicc(C语言)或mpic++(C++语言)来编译MPI程序。例如:

            以下是一个简单的MPI “Hello, World!” 示例程序,假设文件名为 hello_mpi.c, 还有一个进行矩阵计算的示例程序,文件名为dot_product.c,任意挑选一个即可:

            #include <stdio.h>
            #include <mpi.h>
            
            int main(int argc, char *argv[]) {
                int rank, size;
                
                // 初始化MPI环境
                MPI_Init(&argc, &argv);
            
                // 获取当前进程的rank和总进程数
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
            
                // 输出进程的信息
                printf("Hello, World! I am process %d out of %d processes.\n", rank, size);
            
                // 退出MPI环境
                MPI_Finalize();
            
                return 0;
            }
            #include <stdio.h>
            #include <stdlib.h>
            #include <mpi.h>
            
            #define N 8  // 向量大小
            
            // 计算向量的局部点积
            double compute_local_dot_product(double *A, double *B, int start, int end) {
                double local_dot = 0.0;
                for (int i = start; i < end; i++) {
                    local_dot += A[i] * B[i];
                }
                return local_dot;
            }
            
            void print_vector(double *Vector) {
                for (int i = 0; i < N; i++) {
                    printf("%f ", Vector[i]);   
                }
                printf("\n");
            }
            
            int main(int argc, char *argv[]) {
                int rank, size;
            
                // 初始化MPI环境
                MPI_Init(&argc, &argv);
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
            
                // 向量A和B
                double A[N], B[N];
            
                // 进程0初始化向量A和B
                if (rank == 0) {
                    for (int i = 0; i < N; i++) {
                        A[i] = i + 1;  // 示例数据
                        B[i] = (i + 1) * 2;  // 示例数据
                    }
                }
            
                // 广播向量A和B到所有进程
                MPI_Bcast(A, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
                MPI_Bcast(B, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
            
                // 每个进程计算自己负责的部分
                int local_n = N / size;  // 每个进程处理的元素个数
                int start = rank * local_n;
                int end = (rank + 1) * local_n;
                
                // 如果是最后一个进程,确保处理所有剩余的元素(处理N % size)
                if (rank == size - 1) {
                    end = N;
                }
            
                double local_dot_product = compute_local_dot_product(A, B, start, end);
            
                // 使用MPI_Reduce将所有进程的局部点积结果汇总到进程0
                double global_dot_product = 0.0;
                MPI_Reduce(&local_dot_product, &global_dot_product, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
            
                // 进程0输出最终结果
                if (rank == 0) {
                    printf("Vector A is\n");
                    print_vector(A);
                    printf("Vector B is\n");
                    print_vector(B);
                    printf("Dot Product of A and B: %f\n", global_dot_product);
                }
            
                // 结束MPI环境
                MPI_Finalize();
                return 0;
            }

            3. 创建Slurm作业脚本

            创建一个SLURM作业脚本来运行该MPI程序。以下是一个基本的SLURM作业脚本,假设文件名为 mpi_test.slurm:

            #!/bin/bash
            #SBATCH --job-name=mpi_job       # Job name
            #SBATCH --nodes=2                # Number of nodes to use
            #SBATCH --ntasks-per-node=1      # Number of tasks per node
            #SBATCH --time=00:10:00          # Time limit
            #SBATCH --output=mpi_test_output_%j.log     # Standard output file
            #SBATCH --error=mpi_test_output_%j.err     # Standard error file
            
            # Manually set Intel OneAPI MPI and Compiler environment
            export I_MPI_PMI=pmi2
            export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/slurm/mpi_pmi2.so
            export I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.14
            export INTEL_COMPILER_ROOT=/opt/intel/oneapi/compiler/2025.0
            export PATH=$I_MPI_ROOT/bin:$INTEL_COMPILER_ROOT/bin:$PATH
            export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$INTEL_COMPILER_ROOT/lib:$LD_LIBRARY_PATH
            export MANPATH=$I_MPI_ROOT/man:$INTEL_COMPILER_ROOT/man:$MANPATH
            
            # Compile the MPI program
            icx-cc -I$I_MPI_ROOT/include  hello_mpi.c -o hello_mpi -L$I_MPI_ROOT/lib -lmpi
            
            # Run the MPI job
            
            mpirun -np 2 ./hello_mpi
            #!/bin/bash
            #SBATCH --job-name=mpi_job       # Job name
            #SBATCH --nodes=2                # Number of nodes to use
            #SBATCH --ntasks-per-node=1      # Number of tasks per node
            #SBATCH --time=00:10:00          # Time limit
            #SBATCH --output=mpi_test_output_%j.log     # Standard output file
            #SBATCH --error=mpi_test_output_%j.err     # Standard error file
            
            # Manually set Intel OneAPI MPI and Compiler environment
            export I_MPI_PMI=pmi2
            export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/slurm/mpi_pmi2.so
            export I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.14
            export INTEL_COMPILER_ROOT=/opt/intel/oneapi/compiler/2025.0
            export PATH=$I_MPI_ROOT/bin:$INTEL_COMPILER_ROOT/bin:$PATH
            export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$INTEL_COMPILER_ROOT/lib:$LD_LIBRARY_PATH
            export MANPATH=$I_MPI_ROOT/man:$INTEL_COMPILER_ROOT/man:$MANPATH
            
            # Compile the MPI program
            icx-cc -I$I_MPI_ROOT/include  dot_product.c -o dot_product -L$I_MPI_ROOT/lib -lmpi
            
            # Run the MPI job
            
            mpirun -np 2 ./dot_product

            4. 编译MPI程序

            在运行作业之前,你需要编译MPI程序。在集群上使用mpicc来编译该程序。假设你将程序保存在 hello_mpi.c 文件中,使用以下命令进行编译:

            mpicc -o hello_mpi hello_mpi.c
            mpicc -o dot_product dot_product.c

            5. 提交Slurm作业

            保存上述作业脚本(mpi_test.slurm)并使用以下命令提交作业:

            sbatch mpi_test.slurm

            6. 查看作业状态

            你可以使用以下命令查看作业的状态:

            squeue -u <your_username>

            7. 检查输出

            作业完成后,输出将保存在你作业脚本中指定的文件中(例如 mpi_test_output_<job_id>.log)。你可以使用 cat 或任何文本编辑器查看输出:

            cat mpi_test_output_*.log

            示例输出 如果一切正常,输出会类似于:

            Hello, World! I am process 0 out of 2 processes.
            Hello, World! I am process 1 out of 2 processes.
            Result Matrix C (A * B):
            14 8 2 -4 
            20 10 0 -10 
            -1189958655 1552515295 21949 -1552471397 
            0 0 0 0 
            Aug 7, 2024

            Test Open MPI Jobs

            在SLURM集群中使用MPI(Message Passing Interface)进行并行计算,通常需要以下几个步骤:

            1. 安装MPI库

            确保你的集群节点已经安装了MPI库,常见的MPI实现包括:

            • OpenMPI
            • Intel MPI
            • MPICH 可以通过以下命令检查集群是否安装了MPI:
            mpicc --version  # 检查MPI编译器
            mpirun --version # 检查MPI运行时环境

            2. 编译MPI程序

            你可以用mpicc(C语言)或mpic++(C++语言)来编译MPI程序。例如:

            以下是一个简单的MPI “Hello, World!” 示例程序,假设文件名为 hello_mpi.c, 还有一个进行矩阵计算的示例程序,文件名为dot_product.c,任意挑选一个即可:

            #include <stdio.h>
            #include <mpi.h>
            
            int main(int argc, char *argv[]) {
                int rank, size;
                
                // 初始化MPI环境
                MPI_Init(&argc, &argv);
            
                // 获取当前进程的rank和总进程数
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
            
                // 输出进程的信息
                printf("Hello, World! I am process %d out of %d processes.\n", rank, size);
            
                // 退出MPI环境
                MPI_Finalize();
            
                return 0;
            }
            #include <stdio.h>
            #include <stdlib.h>
            #include <mpi.h>
            
            #define N 8  // 向量大小
            
            // 计算向量的局部点积
            double compute_local_dot_product(double *A, double *B, int start, int end) {
                double local_dot = 0.0;
                for (int i = start; i < end; i++) {
                    local_dot += A[i] * B[i];
                }
                return local_dot;
            }
            
            void print_vector(double *Vector) {
                for (int i = 0; i < N; i++) {
                    printf("%f ", Vector[i]);   
                }
                printf("\n");
            }
            
            int main(int argc, char *argv[]) {
                int rank, size;
            
                // 初始化MPI环境
                MPI_Init(&argc, &argv);
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
            
                // 向量A和B
                double A[N], B[N];
            
                // 进程0初始化向量A和B
                if (rank == 0) {
                    for (int i = 0; i < N; i++) {
                        A[i] = i + 1;  // 示例数据
                        B[i] = (i + 1) * 2;  // 示例数据
                    }
                }
            
                // 广播向量A和B到所有进程
                MPI_Bcast(A, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
                MPI_Bcast(B, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
            
                // 每个进程计算自己负责的部分
                int local_n = N / size;  // 每个进程处理的元素个数
                int start = rank * local_n;
                int end = (rank + 1) * local_n;
                
                // 如果是最后一个进程,确保处理所有剩余的元素(处理N % size)
                if (rank == size - 1) {
                    end = N;
                }
            
                double local_dot_product = compute_local_dot_product(A, B, start, end);
            
                // 使用MPI_Reduce将所有进程的局部点积结果汇总到进程0
                double global_dot_product = 0.0;
                MPI_Reduce(&local_dot_product, &global_dot_product, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
            
                // 进程0输出最终结果
                if (rank == 0) {
                    printf("Vector A is\n");
                    print_vector(A);
                    printf("Vector B is\n");
                    print_vector(B);
                    printf("Dot Product of A and B: %f\n", global_dot_product);
                }
            
                // 结束MPI环境
                MPI_Finalize();
                return 0;
            }

            3. 创建Slurm作业脚本

            创建一个SLURM作业脚本来运行该MPI程序。以下是一个基本的SLURM作业脚本,假设文件名为 mpi_test.slurm:

            #!/bin/bash
            #SBATCH --job-name=mpi_test                 # 作业名称
            #SBATCH --nodes=2                           # 请求节点数
            #SBATCH --ntasks-per-node=1                 # 每个节点上的任务数
            #SBATCH --time=00:10:00                     # 最大运行时间
            #SBATCH --output=mpi_test_output_%j.log     # 输出日志文件
            
            # 加载MPI模块(如果使用模块化环境)
            module load openmpi
            
            # 运行MPI程序
            mpirun --allow-run-as-root -np 2 ./hello_mpi
            #!/bin/bash
            #SBATCH --job-name=mpi_test                 # 作业名称
            #SBATCH --nodes=2                           # 请求节点数
            #SBATCH --ntasks-per-node=1                 # 每个节点上的任务数
            #SBATCH --time=00:10:00                     # 最大运行时间
            #SBATCH --output=mpi_test_output_%j.log     # 输出日志文件
            
            # 加载MPI模块(如果使用模块化环境)
            module load openmpi
            
            # 运行MPI程序
            mpirun --allow-run-as-root -np 2 ./dot_product

            4. 编译MPI程序

            在运行作业之前,你需要编译MPI程序。在集群上使用mpicc来编译该程序。假设你将程序保存在 hello_mpi.c 文件中,使用以下命令进行编译:

            mpicc -o hello_mpi hello_mpi.c
            mpicc -o dot_product dot_product.c

            5. 提交Slurm作业

            保存上述作业脚本(mpi_test.slurm)并使用以下命令提交作业:

            sbatch mpi_test.slurm

            6. 查看作业状态

            你可以使用以下命令查看作业的状态:

            squeue -u <your_username>

            7. 检查输出

            作业完成后,输出将保存在你作业脚本中指定的文件中(例如 mpi_test_output_<job_id>.log)。你可以使用 cat 或任何文本编辑器查看输出:

            cat mpi_test_output_*.log

            示例输出 如果一切正常,输出会类似于:

            Hello, World! I am process 0 out of 2 processes.
            Hello, World! I am process 1 out of 2 processes.
            Result Matrix C (A * B):
            14 8 2 -4 
            20 10 0 -10 
            -1189958655 1552515295 21949 -1552471397 
            0 0 0 0 
            Aug 7, 2024

            🗃️Usage Notes

            Aug 7, 2024

            Subsections of 🗃️Usage Notes

            Subsections of Application

            有状态or无状态应用

            对应用“有状态”和“无状态”的清晰界定,直接决定了它在Kubernetes中的部署方式、资源类型和运维复杂度。


            一、核心定义

            1. 无状态应用

            定义:应用实例不负责保存每次请求所需的上下文或数据状态。任何一个请求都可以被任何一个实例处理,且处理结果完全一致。

            关键特征

            • 请求自包含:每个请求包含了处理它所需的所有信息(如认证Token、Session ID、操作数据等)。
            • 实例可替代:任何一个实例都是完全相同、可以随时被创建或销毁的。销毁一个实例不会丢失任何数据。
            • 无本地持久化:实例的本地磁盘不被用于保存需要持久化的数据。即使有临时数据,实例销毁后也无需关心。
            • 水平扩展容易:因为实例完全相同,所以直接增加实例数量就能实现扩展,非常简单。

            典型例子

            • Web前端服务器:如Nginx, Apache。
            • API网关:如Kong, Tyk。
            • JWT令牌验证服务
            • 无状态计算服务:如图片转换、数据格式转换等。输入和输出都在请求中完成。

            一个生动的比喻快餐店的收银员。 任何一个收银员都可以为你服务,你点餐(请求),他处理,完成后交易结束。他不需要记住你上次点了什么(状态),你下次来可以去任何一个窗口。

            2. 有状态应用

            定义:应用实例需要保存和维护特定的状态数据。后续请求的处理依赖于之前请求保存的状态,或者会改变这个状态。

            关键特征

            • 状态依赖性:请求的处理结果依赖于该实例上保存的特定状态(如用户会话、数据库中的记录、缓存数据等)。
            • 实例唯一性:每个实例都是独特的,有唯一的身份标识(如ID、主机名)。不能随意替换。
            • 需要持久化存储:实例的状态必须被保存在持久化存储中,并且即使实例重启、迁移或重建,这个存储也必须能被重新挂载和访问。
            • 水平扩展复杂:扩展时需要谨慎处理数据分片、副本同步、身份识别等问题。

            典型例子

            • 数据库:MySQL, PostgreSQL, MongoDB, Redis。
            • 消息队列:Kafka, RabbitMQ。
            • 有状态中间件:如Etcd, Zookeeper。
            • 用户会话服务器:将用户Session保存在本地内存或磁盘的应用。

            一个生动的比喻银行的客户经理。 你有一个指定的客户经理(特定实例),他了解你的所有财务历史和需求(状态)。如果你换了一个新经理,他需要花时间从头了解你的情况,而且可能无法立即获得你所有的历史文件(数据)。


            二、在Kubernetes中的关键差异

            这个界定在K8s中至关重要,因为它决定了你使用哪种工作负载资源。

            特性无状态应用有状态应用
            核心K8s资源DeploymentStatefulSet
            Pod身份完全可互换,无唯一标识。名字是随机的(如 app-7c8b5f6d9-abcde)。有稳定、唯一的标识符,按顺序生成(如 mysql-0, mysql-1, mysql-2)。
            启动/终止顺序并行,无顺序。有序部署(从0到N-1),有序扩缩容(从N-1到0),有序滚动更新
            网络标识不稳定的Pod IP。通过Service负载均衡访问。稳定的网络标识。每个Pod会有一个稳定的DNS记录:<pod-name>.<svc-name>.<namespace>.svc.cluster.local
            存储使用PersistentVolumeClaim模板,所有Pod共享同一个PVC或各自使用独立的、无关联的PVC。使用稳定的、专用的存储。每个Pod根据它的身份标识,挂载一个独立的PVC(如 mysql-0 -> pvc-mysql-0)。
            数据持久性Pod被删除,其关联的PVC通常也会被删除(取决于回收策略)。Pod即使被调度到其他节点,也能通过稳定标识重新挂载到属于它的那块持久化数据。
            典型场景Web服务器、微服务、API数据库、消息队列、集群化应用(如Zookeeper)

            三、一个常见的误区:“看似无状态,实则有状态”

            有些应用初看像无状态,但深究起来其实是有状态的。

            • 误区:一个将用户Session保存在本地内存的Web应用。
              • 看似:它是一个Web服务,可以通过Deployment部署多个副本。
              • 实则:如果用户第一次请求被pod-a处理,Session保存在了pod-a的内存中。下次请求如果被负载均衡到pod-bpod-b无法获取到该用户的Session,导致用户需要重新登录。
              • 解决方案
                1. 改造为无状态:将Session数据外移到集中式的Redis或数据库中。
                2. 承认其有状态:使用StatefulSet,并配合Session亲和性,确保同一用户的请求总是被发到同一个Pod实例上。

            总结

            如何界定一个应用是有状态还是无状态?

            问自己这几个问题:

            1. 这个应用的实例能被随意杀死并立即创建一个新的替代吗? 替代者能无缝接管所有工作吗?
              • -> 无状态
              • 不能 -> 有状态
            2. 应用的多个实例是完全相同的吗? 增加一个实例需要复制数据吗?
              • 是,不需要 -> 无状态
              • 否,需要 -> 有状态
            3. 处理请求是否需要依赖实例本地(内存/磁盘)的、非临时性的数据?
              • -> 无状态
              • -> 有状态

            理解这个界定,是正确设计和部署云原生应用的基石。在K8s中,对于无状态应用,请首选 Deployment;对于有状态应用,请务必使用 StatefulSet

            Mar 7, 2025

            Subsections of Building Tool

            Maven

            1. build from submodule

            You dont need to build from the head of project.

            ./mvnw clean package -DskipTests  -rf :<$submodule-name>

            you can find the <$submodule-name> from submodule ’s pom.xml

            <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            		xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
            
            	<modelVersion>4.0.0</modelVersion>
            
            	<parent>
            		<groupId>org.apache.flink</groupId>
            		<artifactId>flink-formats</artifactId>
            		<version>1.20-SNAPSHOT</version>
            	</parent>
            
            	<artifactId>flink-avro</artifactId>
            	<name>Flink : Formats : Avro</name>

            Then you can modify the command as

            ./mvnw clean package -DskipTests  -rf :flink-avro
            The result will look like this
            [WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
            [WARNING] 
            [INFO] ------------------------------------------------------------------------
            [INFO] Detecting the operating system and CPU architecture
            [INFO] ------------------------------------------------------------------------
            [INFO] os.detected.name: linux
            [INFO] os.detected.arch: x86_64
            [INFO] os.detected.bitness: 64
            [INFO] os.detected.version: 6.7
            [INFO] os.detected.version.major: 6
            [INFO] os.detected.version.minor: 7
            [INFO] os.detected.release: fedora
            [INFO] os.detected.release.version: 38
            [INFO] os.detected.release.like.fedora: true
            [INFO] os.detected.classifier: linux-x86_64
            [INFO] ------------------------------------------------------------------------
            [INFO] Reactor Build Order:
            [INFO] 
            [INFO] Flink : Formats : Avro                                             [jar]
            [INFO] Flink : Formats : SQL Avro                                         [jar]
            [INFO] Flink : Formats : Parquet                                          [jar]
            [INFO] Flink : Formats : SQL Parquet                                      [jar]
            [INFO] Flink : Formats : Orc                                              [jar]
            [INFO] Flink : Formats : SQL Orc                                          [jar]
            [INFO] Flink : Python                                                     [jar]
            ...

            Normally, build Flink will start from module flink-parent

            2. skip some other test

            For example, you can skip RAT test by doing this:

            ./mvnw clean package -DskipTests '-Drat.skip=true'
            Mar 11, 2024

            Gradle

            1. spotless

            keep your code spotless, check more detail in https://github.com/diffplug/spotless

            see how to configuration

            there are several files need to configure.

            1. settings.gradle.kts
            plugins {
                id("org.gradle.toolchains.foojay-resolver-convention") version "0.7.0"
            }
            1. build.gradle.kts
            plugins {
                id("com.diffplug.spotless") version "6.23.3"
            }
            configure<com.diffplug.gradle.spotless.SpotlessExtension> {
                kotlinGradle {
                    target("**/*.kts")
                    ktlint()
                }
                java {
                    target("**/*.java")
                    googleJavaFormat()
                        .reflowLongStrings()
                        .skipJavadocFormatting()
                        .reorderImports(false)
                }
                yaml {
                    target("**/*.yaml")
                    jackson()
                        .feature("ORDER_MAP_ENTRIES_BY_KEYS", true)
                }
                json {
                    target("**/*.json")
                    targetExclude(".vscode/settings.json")
                    jackson()
                        .feature("ORDER_MAP_ENTRIES_BY_KEYS", true)
                }
            }

            And the, you can execute follwoing command to format your code.

            ./gradlew spotlessApply
            ./mvnw spotless:apply

            2. shadowJar

            shadowjar could combine a project’s dependency classes and resources into a single jar. check https://imperceptiblethoughts.com/shadow/

            see how to configuration

            you need moidfy your build.gradle.kts

            import com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar
            
            plugins {
                java // Optional 
                id("com.github.johnrengelman.shadow") version "8.1.1"
            }
            
            tasks.named<ShadowJar>("shadowJar") {
                archiveBaseName.set("connector-shadow")
                archiveVersion.set("1.0")
                archiveClassifier.set("")
                manifest {
                    attributes(mapOf("Main-Class" to "com.example.xxxxx.Main"))
                }
            }
            ./gradlew shadowJar

            3. check dependency

            list your project’s dependencies in tree view

            see how to configuration

            you need moidfy your build.gradle.kts

            configurations {
                compileClasspath
            }
            ./gradlew dependencies --configuration compileClasspath
            ./gradlew :<$module_name>:dependencies --configuration compileClasspath
            Check Potential Result

            result will look like this

            compileClasspath - Compile classpath for source set 'main'.
            +--- org.projectlombok:lombok:1.18.22
            +--- org.apache.flink:flink-hadoop-fs:1.17.1
            |    \--- org.apache.flink:flink-core:1.17.1
            |         +--- org.apache.flink:flink-annotations:1.17.1
            |         |    \--- com.google.code.findbugs:jsr305:1.3.9 -> 3.0.2
            |         +--- org.apache.flink:flink-metrics-core:1.17.1
            |         |    \--- org.apache.flink:flink-annotations:1.17.1 (*)
            |         +--- org.apache.flink:flink-shaded-asm-9:9.3-16.1
            |         +--- org.apache.flink:flink-shaded-jackson:2.13.4-16.1
            |         +--- org.apache.commons:commons-lang3:3.12.0
            |         +--- org.apache.commons:commons-text:1.10.0
            |         |    \--- org.apache.commons:commons-lang3:3.12.0
            |         +--- commons-collections:commons-collections:3.2.2
            |         +--- org.apache.commons:commons-compress:1.21 -> 1.24.0
            |         +--- org.apache.flink:flink-shaded-guava:30.1.1-jre-16.1
            |         \--- com.google.code.findbugs:jsr305:1.3.9 -> 3.0.2
            ...
            Mar 7, 2024

            CICD

            Articles

              FQA

              Q1: difference between docker\podmn\buildah

              You can add standard markdown syntax:

              • multiple paragraphs
              • bullet point lists
              • emphasized, bold and even bold emphasized text
              • links
              • etc.
              ...and even source code

              the possibilities are endless (almost - including other shortcodes may or may not work)

              Mar 7, 2025

              Container

              Articles

              FQA

              Q1: difference between docker\podmn\buildah

              You can add standard markdown syntax:

              • multiple paragraphs
              • bullet point lists
              • emphasized, bold and even bold emphasized text
              • links
              • etc.
              ...and even source code

              the possibilities are endless (almost - including other shortcodes may or may not work)

              Mar 7, 2025

              Subsections of Container

              Build Smaller Image

              减小 Dockerfile 生成镜像体积的方法

              1. 选择更小的基础镜像

              # ❌ 避免使用完整版本
              FROM ubuntu:latest
              
              # ✅ 使用精简版本
              FROM alpine:3.18
              FROM python:3.11-slim
              FROM node:18-alpine

              2. 使用多阶段构建 (Multi-stage Build)

              这是最有效的方法之一:

              # 构建阶段
              FROM golang:1.21 AS builder
              WORKDIR /app
              COPY . .
              RUN go build -o myapp
              
              # 运行阶段 - 只复制必要文件
              FROM alpine:3.18
              WORKDIR /app
              COPY --from=builder /app/myapp .
              CMD ["./myapp"]

              3. 合并 RUN 指令

              每个 RUN 命令都会创建一个新层:

              # ❌ 多层,体积大
              RUN apt-get update
              RUN apt-get install -y package1
              RUN apt-get install -y package2
              
              # ✅ 单层,并清理缓存
              RUN apt-get update && \
                  apt-get install -y package1 package2 && \
                  apt-get clean && \
                  rm -rf /var/lib/apt/lists/*

              4. 清理不必要的文件

              RUN apt-get update && \
                  apt-get install -y build-essential && \
                  # 构建操作... && \
                  apt-get purge -y build-essential && \
                  apt-get autoremove -y && \
                  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

              5. 使用 .dockerignore 文件

              # .dockerignore
              node_modules
              .git
              *.md
              .env
              test/

              6. 只复制必要的文件

              # ❌ 复制所有内容
              COPY . .
              
              # ✅ 只复制需要的文件
              COPY package.json package-lock.json ./
              RUN npm ci --only=production
              COPY src/ ./src/

              7. 移除调试工具和文档

              RUN apk add --no-cache python3 && \
                  rm -rf /usr/share/doc /usr/share/man

              8. 压缩和优化层

              # 在单个 RUN 中完成所有操作
              RUN set -ex && \
                  apk add --no-cache --virtual .build-deps gcc musl-dev && \
                  pip install --no-cache-dir -r requirements.txt && \
                  apk del .build-deps

              9. 使用专门的工具

              • dive: 分析镜像层
                dive your-image:tag
              • docker-slim: 自动精简镜像
                docker-slim build your-image:tag

              实际案例对比

              优化前 (1.2GB):

              FROM ubuntu:20.04
              RUN apt-get update
              RUN apt-get install -y python3 python3-pip
              COPY . /app
              WORKDIR /app
              RUN pip3 install -r requirements.txt
              CMD ["python3", "app.py"]

              优化后 (50MB):

              FROM python:3.11-alpine AS builder
              WORKDIR /app
              COPY requirements.txt .
              RUN pip install --no-cache-dir --user -r requirements.txt
              
              FROM python:3.11-alpine
              WORKDIR /app
              COPY --from=builder /root/.local /root/.local
              COPY app.py .
              ENV PATH=/root/.local/bin:$PATH
              CMD ["python", "app.py"]

              关键要点总结

              ✅ 使用 Alpine 或 slim 镜像
              ✅ 采用多阶段构建
              ✅ 合并命令并清理缓存
              ✅ 配置 .dockerignore
              ✅ 只安装生产环境依赖
              ✅ 删除构建工具和临时文件

              通过这些方法,镜像体积通常可以减少 60-90%!

              Mar 7, 2024

              Network Mode

              Docker的网络模式决定了容器如何与宿主机、其他容器以及外部网络进行通信。

              Docker主要提供了以下五种网络模式,默认创建的是 bridge 模式。


              1. Bridge 模式

              这是 默认 的网络模式。当你创建一个容器而不指定网络时,它就会连接到这个默认的 bridge 网络(名为 bridge)。

              • 工作原理:Docker守护进程会创建一个名为 docker0 的虚拟网桥,它相当于一个虚拟交换机。所有使用该模式的容器都会通过一个虚拟网卡(veth pair)连接到这个网桥上。Docker会为每个容器分配一个IP地址,并配置其网关为 docker0 的地址。
              • 通信方式
                • 容器间通信:在同一个自定义桥接网络下的容器,可以通过容器名(Container Name)直接通信(Docker内嵌了DNS)。但在默认的 bridge 网络下,容器只能通过IP地址通信。
                • 访问外部网络:容器数据包通过 docker0 网桥,再经过宿主机的IPtables进行NAT转换,使用宿主机的IP访问外网。
                • 从外部访问容器:需要做端口映射,例如 -p 8080:80,将宿主机的8080端口映射到容器的80端口。

              优劣分析

              • 优点
                • 隔离性:容器拥有独立的网络命名空间,与宿主机和其他网络隔离,安全性较好。
                • 端口管理灵活:通过端口映射,可以灵活地管理哪些宿主机端口暴露给外部。
                • 通用性:是最常用、最通用的模式,适合大多数应用场景。
              • 缺点
                • 性能开销:相比 host 模式,多了一层网络桥接和NAT,性能有轻微损失。
                • 复杂度:在默认桥接网络中,容器间通信需要使用IP,不如自定义网络方便。

              使用场景:绝大多数需要网络隔离的独立应用,例如Web后端服务、数据库等。

              命令示例

              # 使用默认bridge网络(不推荐用于多容器应用)
              docker run -d --name my-app -p 8080:80 nginx
              
              # 创建自定义bridge网络(推荐)
              docker network create my-network
              docker run -d --name app1 --network my-network my-app
              docker run -d --name app2 --network my-network another-app
              # 现在 app1 和 app2 可以通过容器名直接互相访问

              2. Host 模式

              在这种模式下,容器不会虚拟出自己的网卡,也不会分配独立的IP,而是直接使用宿主机的IP和端口

              • 工作原理:容器与宿主机共享同一个Network Namespace。

              优劣分析

              • 优点
                • 高性能:由于没有NAT和网桥开销,网络性能最高,几乎与宿主机原生网络一致。
                • 简单:无需进行复杂的端口映射,容器内使用的端口就是宿主机上的端口。
              • 缺点
                • 安全性差:容器没有网络隔离,可以直接操作宿主机的网络。
                • 端口冲突:容器使用的端口如果与宿主机服务冲突,会导致容器无法启动。
                • 灵活性差:无法在同一台宿主机上运行多个使用相同端口的容器。

              使用场景:对网络性能要求极高的场景,例如负载均衡器、高频交易系统等。在生产环境中需谨慎使用

              命令示例

              docker run -d --name my-app --network host nginx
              # 此时,直接访问 http://<宿主机IP>:80 即可访问容器中的Nginx

              3. None 模式

              在这种模式下,容器拥有自己独立的网络命名空间,但不进行任何网络配置。容器内部只有回环地址 127.0.0.1

              • 工作原理:容器完全与世隔绝。

              优劣分析

              • 优点
                • 绝对隔离:安全性最高,容器完全无法进行任何网络通信。
              • 缺点
                • 无法联网:容器无法与宿主机、其他容器或外部网络通信。

              使用场景

              1. 需要完全离线处理的批处理任务。
              2. 用户打算使用自定义网络驱动(或手动配置)来完全自定义容器的网络栈。

              命令示例

              docker run -d --name my-app --network none alpine
              # 进入容器后,使用 `ip addr` 查看,只能看到 lo 网卡

              4. Container 模式

              这种模式下,新创建的容器不会创建自己的网卡和IP,而是与一个已经存在的容器共享一个Network Namespace。通俗讲,就是两个容器在同一个网络环境下,看到的IP和端口是一样的。

              • 工作原理:新容器复用指定容器的网络栈。

              优劣分析

              • 优点
                • 高效通信:容器间通信直接通过本地回环地址 127.0.0.1,效率极高。
                • 共享网络视图:可以方便地为一个主容器(如Web服务器)搭配一个辅助容器(如日志收集器),它们看到的网络环境完全一致。
              • 缺点
                • 紧密耦合:两个容器的生命周期和网络配置紧密绑定,缺乏灵活性。
                • 隔离性差:共享网络命名空间,存在一定的安全风险。

              使用场景:Kubernetes中的"边车"模式,例如一个Pod内的主容器和日志代理容器。

              命令示例

              docker run -d --name main-container nginx
              docker run -d --name helper-container --network container:main-container busybox
              # 此时,helper-container 中访问 127.0.0.1:80 就是在访问 main-container 的Nginx服务

              5. Overlay 模式

              这是为了实现 跨主机的容器通信 而设计的,是Docker Swarm和Kubernetes等容器编排系统的核心网络方案。

              • 工作原理:它会在多个Docker宿主机之间创建一个虚拟的分布式网络(Overlay Network),通过VXLAN等隧道技术,让不同宿主机上的容器感觉像是在同一个大的局域网内。

              优劣分析

              • 优点
                • 跨节点通信:解决了集群环境下容器间通信的根本问题。
                • 安全:支持网络加密。
              • 缺点
                • 配置复杂:需要额外的Key-Value存储(如Consul、Etcd)来同步网络状态(Docker Swarm模式内置了此功能)。
                • 性能开销:数据包需要封装和解封装,有一定性能损耗,但现代硬件上通常可以接受。

              使用场景:Docker Swarm集群、Kubernetes集群等分布式应用环境。

              命令示例(在Swarm模式下):

              # 初始化Swarm
              docker swarm init
              
              # 创建Overlay网络
              docker network create -d overlay my-overlay-net
              
              # 在Overlay网络中创建服务
              docker service create --name web --network my-overlay-net -p 80:80 nginx

              总结对比

              网络模式隔离性性能灵活性适用场景
              Bridge(默认)良好通用场景,单机多容器应用
              Host最高对性能要求极致,不介意端口冲突
              None最高-离线任务,完全自定义网络
              Container容器紧密协作(如边车模式)
              Overlay良好集群场景,跨主机容器通信

              最佳实践建议

              1. 单机应用:优先使用 自定义的Bridge网络,它比默认Bridge网络提供了更好的DNS服务发现功能,方便容器间通过名称通信。
              2. 集群应用:必须使用 Overlay网络
              3. 性能极致追求:在确认端口安全和无冲突的前提下,可考虑 Host模式
              4. 安全隔离:对于无需网络的容器,使用 None模式
              5. 避免在生产环境大量使用默认的bridge网络和container模式,因为它们分别在DNS发现和容器耦合度上存在不足。
              Mar 7, 2024

              Container Principle

              Linux 容器技术的基础原理

              容器技术是现代云原生架构的基石。让我深入讲解 Linux 容器的核心原理。


              🎯 容器的本质

              容器 = 特殊的进程

              容器不是虚拟机,它本质上是宿主机上的一个受限进程,通过 Linux 内核的三大核心技术实现隔离和资源限制:

              ┌─────────────────────────────────────────┐
              │         Linux 容器三大支柱               │
              ├─────────────────────────────────────────┤
              │  1. Namespace (命名空间) - 隔离         │
              │  2. Cgroups (控制组) - 资源限制         │
              │  3. Union FS (联合文件系统) - 镜像分层   │
              └─────────────────────────────────────────┘

              🔒 Namespace (命名空间) - 视图隔离

              Namespace 是 Linux 内核提供的一种资源隔离机制,让进程只能看到属于自己的资源。

              七种 Namespace

              Namespace隔离内容内核版本示例
              PID进程 ID2.6.24容器内 PID 1 = 宿主机 PID 12345
              Network网络栈2.6.29独立的 IP、端口、路由表
              Mount文件系统挂载点2.4.19独立的根目录
              UTS主机名和域名2.6.19容器有自己的 hostname
              IPC进程间通信2.6.19消息队列、信号量、共享内存
              User用户和组 ID3.8容器内 root ≠ 宿主机 root
              CgroupCgroup 根目录4.6隔离 cgroup 视图

              1️⃣ PID Namespace (进程隔离)

              原理

              每个容器有独立的进程树,容器内看不到宿主机或其他容器的进程。

              演示

              # 在宿主机上查看进程
              ps aux | grep nginx
              # root  12345  nginx: master process
              
              # 进入容器
              docker exec -it my-container bash
              
              # 在容器内查看进程
              ps aux
              # PID   USER     COMMAND
              # 1     root     nginx: master process  ← 容器内看到的 PID 是 1
              # 25    root     nginx: worker process
              
              # 实际上宿主机上这个进程的真实 PID 是 12345

              手动创建 PID Namespace

              // C 代码示例
              #define _GNU_SOURCE
              #include <sched.h>
              #include <stdio.h>
              #include <unistd.h>
              #include <sys/wait.h>
              
              int child_func(void* arg) {
                  printf("Child PID: %d\n", getpid());  // 输出: 1
                  sleep(100);
                  return 0;
              }
              
              int main() {
                  printf("Parent PID: %d\n", getpid());  // 输出: 真实 PID
                  
                  // 创建新的 PID namespace
                  char stack[1024*1024];
                  int flags = CLONE_NEWPID;
                  
                  pid_t pid = clone(child_func, stack + sizeof(stack), flags | SIGCHLD, NULL);
                  waitpid(pid, NULL, 0);
                  return 0;
              }

              核心特点

              • 容器内第一个进程 PID = 1 (init 进程)
              • 父进程(宿主机)可以看到子进程的真实 PID
              • 子进程(容器)看不到父进程和其他容器的进程

              2️⃣ Network Namespace (网络隔离)

              原理

              每个容器有独立的网络栈:独立的 IP、端口、路由表、防火墙规则。

              架构图

              宿主机网络栈
              ├─ eth0 (物理网卡)
              ├─ docker0 (网桥)
              └─ veth pairs (虚拟网卡对)
                  ├─ vethXXX (宿主机端) ←→ eth0 (容器端)
                  └─ vethYYY (宿主机端) ←→ eth0 (容器端)

              演示

              # 创建新的 network namespace
              ip netns add myns
              
              # 列出所有 namespace
              ip netns list
              
              # 在新 namespace 中执行命令
              ip netns exec myns ip addr
              # 输出: 只有 loopback,没有 eth0
              
              # 创建 veth pair (虚拟网卡对)
              ip link add veth0 type veth peer name veth1
              
              # 将 veth1 移到新 namespace
              ip link set veth1 netns myns
              
              # 配置 IP
              ip addr add 192.168.1.1/24 dev veth0
              ip netns exec myns ip addr add 192.168.1.2/24 dev veth1
              
              # 启动网卡
              ip link set veth0 up
              ip netns exec myns ip link set veth1 up
              ip netns exec myns ip link set lo up
              
              # 测试连通性
              ping 192.168.1.2

              容器网络模式

              Bridge 模式(默认)

              Container A                Container B
                  │                          │
                [eth0]                    [eth0]
                  │                          │
               vethA ←─────┬─────────→ vethB
                           │
                      [docker0 网桥]
                           │
                       [iptables NAT]
                           │
                       [宿主机 eth0]
                           │
                        外部网络

              Host 模式

              Container
                  │
                  └─ 直接使用宿主机网络栈 (没有网络隔离)

              3️⃣ Mount Namespace (文件系统隔离)

              原理

              每个容器有独立的挂载点视图,看到不同的文件系统树。

              演示

              # 创建隔离的挂载环境
              unshare --mount /bin/bash
              
              # 在新 namespace 中挂载
              mount -t tmpfs tmpfs /tmp
              
              # 查看挂载点
              mount | grep tmpfs
              # 这个挂载只在当前 namespace 可见
              
              # 退出后,宿主机看不到这个挂载
              exit
              mount | grep tmpfs  # 找不到

              容器的根文件系统

              # Docker 使用 chroot + pivot_root 切换根目录
              # 容器内 / 实际是宿主机的某个目录
              
              # 查看容器的根文件系统位置
              docker inspect my-container | grep MergedDir
              # "MergedDir": "/var/lib/docker/overlay2/xxx/merged"
              
              # 在宿主机上访问容器文件系统
              ls /var/lib/docker/overlay2/xxx/merged
              # bin  boot  dev  etc  home  lib  ...

              4️⃣ UTS Namespace (主机名隔离)

              演示

              # 在宿主机
              hostname
              # host-machine
              
              # 创建新 UTS namespace
              unshare --uts /bin/bash
              
              # 修改主机名
              hostname my-container
              
              # 查看主机名
              hostname
              # my-container
              
              # 退出后,宿主机主机名不变
              exit
              hostname
              # host-machine

              5️⃣ IPC Namespace (进程间通信隔离)

              原理

              隔离 System V IPC 和 POSIX 消息队列。

              演示

              # 在宿主机创建消息队列
              ipcmk -Q
              # Message queue id: 0
              
              # 查看消息队列
              ipcs -q
              # ------ Message Queues --------
              # key        msqid      owner
              # 0x52020055 0          root
              
              # 进入容器
              docker exec -it my-container bash
              
              # 在容器内查看消息队列
              ipcs -q
              # ------ Message Queues --------
              # (空,看不到宿主机的消息队列)

              6️⃣ User Namespace (用户隔离)

              原理

              容器内的 root 用户可以映射到宿主机的普通用户,增强安全性。

              配置示例

              # 启用 User Namespace 的容器
              docker run --userns-remap=default -it ubuntu bash
              
              # 容器内
              whoami
              # root
              
              id
              # uid=0(root) gid=0(root) groups=0(root)
              
              # 但在宿主机上,这个进程实际运行在普通用户下
              ps aux | grep bash
              # 100000  12345  bash  ← UID 100000,不是 root

              UID 映射配置

              # /etc/subuid 和 /etc/subgid
              cat /etc/subuid
              # dockremap:100000:65536
              # 表示将容器内的 UID 0-65535 映射到宿主机的 100000-165535

              📊 Cgroups (Control Groups) - 资源限制

              Cgroups 用于限制、记录、隔离进程组的资源使用(CPU、内存、磁盘 I/O 等)。

              Cgroups 子系统

              子系统功能示例
              cpu限制 CPU 使用率容器最多用 50% CPU
              cpuset绑定特定 CPU 核心容器只能用 CPU 0-3
              memory限制内存使用容器最多用 512MB 内存
              blkio限制块设备 I/O容器磁盘读写 100MB/s
              devices控制设备访问容器不能访问 /dev/sda
              net_cls网络流量分类为容器流量打标签
              pids限制进程数量容器最多创建 100 个进程

              CPU 限制

              原理

              使用 CFS (Completely Fair Scheduler) 调度器限制 CPU 时间。

              关键参数

              cpu.cfs_period_us  # 周期时间(默认 100ms = 100000us)
              cpu.cfs_quota_us   # 配额时间
              
              # CPU 使用率 = quota / period
              # 例如: 50000 / 100000 = 50% CPU

              Docker 示例

              # 限制容器使用 0.5 个 CPU 核心
              docker run --cpus=0.5 nginx
              
              # 等价于
              docker run --cpu-period=100000 --cpu-quota=50000 nginx
              
              # 查看 cgroup 配置
              cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us
              # 50000

              手动配置 Cgroups

              # 创建 cgroup
              mkdir -p /sys/fs/cgroup/cpu/mycontainer
              
              # 设置 CPU 限制为 50%
              echo 50000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us
              echo 100000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_period_us
              
              # 将进程加入 cgroup
              echo $$ > /sys/fs/cgroup/cpu/mycontainer/cgroup.procs
              
              # 运行 CPU 密集任务
              yes > /dev/null &
              
              # 在另一个终端查看 CPU 使用率
              top -p $(pgrep yes)
              # CPU 使用率被限制在 50% 左右

              内存限制

              关键参数

              memory.limit_in_bytes        # 硬限制
              memory.soft_limit_in_bytes   # 软限制
              memory.oom_control           # OOM 行为控制
              memory.usage_in_bytes        # 当前使用量

              Docker 示例

              # 限制容器使用最多 512MB 内存
              docker run -m 512m nginx
              
              # 查看内存限制
              cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
              # 536870912 (512MB)
              
              # 查看当前内存使用
              cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes

              OOM (Out of Memory) 行为

              # 当容器超过内存限制时
              # 1. 内核触发 OOM Killer
              # 2. 杀死容器内的进程(通常是内存占用最大的)
              # 3. 容器退出,状态码 137
              
              docker ps -a
              # CONTAINER ID   STATUS
              # abc123         Exited (137) 1 minute ago  ← OOM killed

              避免 OOM 的策略

              # 设置 OOM Score Adjustment
              docker run --oom-score-adj=-500 nginx
              # 数值越低,越不容易被 OOM Killer 杀死
              
              # 禁用 OOM Killer (不推荐生产环境)
              docker run --oom-kill-disable nginx

              磁盘 I/O 限制

              Docker 示例

              # 限制读取速度为 10MB/s
              docker run --device-read-bps /dev/sda:10mb nginx
              
              # 限制写入速度为 5MB/s
              docker run --device-write-bps /dev/sda:5mb nginx
              
              # 限制 IOPS
              docker run --device-read-iops /dev/sda:100 nginx
              docker run --device-write-iops /dev/sda:50 nginx

              测试 I/O 限制

              # 在容器内测试写入速度
              docker exec -it my-container bash
              
              dd if=/dev/zero of=/tmp/test bs=1M count=100
              # 写入速度会被限制在 5MB/s

              📦 Union FS (联合文件系统) - 镜像分层

              Union FS 允许多个文件系统分层叠加,实现镜像的复用和高效存储。

              核心概念

              容器可写层 (Read-Write Layer)     ← 容器运行时的修改
              ─────────────────────────────────
              镜像层 4 (Image Layer 4)          ← 只读
              镜像层 3 (Image Layer 3)          ← 只读
              镜像层 2 (Image Layer 2)          ← 只读
              镜像层 1 (Base Layer)             ← 只读
              ─────────────────────────────────
                       统一挂载点
                    (Union Mount Point)

              常见实现

              文件系统特点使用情况
              OverlayFS性能好,内核原生支持Docker 默认(推荐)
              AUFS成熟稳定,但不在主线内核早期 Docker 默认
              Btrfs支持快照,写时复制适合大规模存储
              ZFS企业级功能,但有许可问题高级用户
              Device Mapper块级存储Red Hat 系列

              OverlayFS 原理

              目录结构

              /var/lib/docker/overlay2/<image-id>/
              ├── diff/          # 当前层的文件变更
              ├── link           # 短链接名称
              ├── lower          # 指向下层的链接
              ├── merged/        # 最终挂载点(容器看到的)
              └── work/          # 工作目录(临时文件)

              实际演示

              # 查看镜像的层结构
              docker image inspect nginx:latest | jq '.[0].RootFS.Layers'
              # [
              #   "sha256:abc123...",  ← Layer 1
              #   "sha256:def456...",  ← Layer 2
              #   "sha256:ghi789..."   ← Layer 3
              # ]
              
              # 启动容器
              docker run -d --name web nginx
              
              # 查看容器的文件系统
              docker inspect web | grep MergedDir
              # "MergedDir": "/var/lib/docker/overlay2/xxx/merged"
              
              # 查看挂载信息
              mount | grep overlay
              # overlay on /var/lib/docker/overlay2/xxx/merged type overlay (rw,lowerdir=...,upperdir=...,workdir=...)

              文件操作的 Copy-on-Write (写时复制)

              # 1. 读取文件(从镜像层)
              docker exec web cat /etc/nginx/nginx.conf
              # 直接从只读的镜像层读取,无需复制
              
              # 2. 修改文件
              docker exec web bash -c "echo 'test' >> /etc/nginx/nginx.conf"
              # 触发 Copy-on-Write:
              # - 从下层复制文件到容器可写层
              # - 在可写层修改文件
              # - 下次读取时,从可写层读取(覆盖下层)
              
              # 3. 删除文件
              docker exec web rm /var/log/nginx/access.log
              # 创建 whiteout 文件,标记删除
              # 文件在镜像层仍存在,但容器内看不到

              Whiteout 文件(删除标记)

              # 在容器可写层
              ls -la /var/lib/docker/overlay2/xxx/diff/var/log/nginx/
              # c--------- 1 root root 0, 0 Oct 11 10:00 .wh.access.log
              # 字符设备文件,主次设备号都是 0,表示删除标记

              镜像分层的优势

              1. 共享层,节省空间

              # 假设有 10 个基于 ubuntu:20.04 的镜像
              # 不使用分层:10 × 100MB = 1GB
              # 使用分层:100MB (ubuntu base) + 10 × 10MB (应用层) = 200MB
              # 节省空间:80%

              2. 快速构建

              FROM ubuntu:20.04                    # Layer 1 (缓存)
              RUN apt-get update                   # Layer 2 (缓存)
              RUN apt-get install -y nginx         # Layer 3 (缓存)
              COPY app.conf /etc/nginx/            # Layer 4 (需要重建)
              COPY app.js /var/www/                # Layer 5 (需要重建)
              
              # 如果只修改 app.js,只需要重建 Layer 5
              # 前面的层都从缓存读取

              3. 快速分发

              # 拉取镜像时,只下载本地没有的层
              docker pull nginx:1.21
              # Already exists: Layer 1 (ubuntu base)
              # Downloading:    Layer 2 (nginx files)
              # Downloading:    Layer 3 (config)

              🔗 容器技术完整流程

              Docker 创建容器的完整过程

              docker run -d --name web \
                --cpus=0.5 \
                -m 512m \
                -p 8080:80 \
                nginx:latest

              内部执行流程

              1. 拉取镜像(如果本地没有)
                 └─ 下载各层,存储到 /var/lib/docker/overlay2/
              
              2. 创建 Namespace
                 ├─ PID Namespace (隔离进程)
                 ├─ Network Namespace (隔离网络)
                 ├─ Mount Namespace (隔离文件系统)
                 ├─ UTS Namespace (隔离主机名)
                 ├─ IPC Namespace (隔离进程间通信)
                 └─ User Namespace (隔离用户)
              
              3. 配置 Cgroups
                 ├─ cpu.cfs_quota_us = 50000 (50% CPU)
                 └─ memory.limit_in_bytes = 536870912 (512MB)
              
              4. 挂载文件系统 (OverlayFS)
                 ├─ lowerdir: 镜像只读层
                 ├─ upperdir: 容器可写层
                 ├─ workdir: 工作目录
                 └─ merged: 统一视图挂载点
              
              5. 配置网络
                 ├─ 创建 veth pair
                 ├─ 一端连接到容器的 Network Namespace
                 ├─ 另一端连接到 docker0 网桥
                 ├─ 分配 IP 地址
                 └─ 配置 iptables NAT 规则 (端口映射)
              
              6. 切换根目录
                 ├─ chroot 或 pivot_root
                 └─ 容器内看到的 / 是 merged 目录
              
              7. 启动容器进程
                 ├─ 在新的 Namespace 中
                 ├─ 受 Cgroups 限制
                 └─ 使用新的根文件系统
                 └─ 执行 ENTRYPOINT/CMD
              
              8. 容器运行中
                 └─ containerd-shim 监控进程

              🛠️ 手动创建容器(无 Docker)

              完整示例:从零创建容器

              #!/bin/bash
              # 手动创建一个简单的容器
              
              # 1. 准备根文件系统
              mkdir -p /tmp/mycontainer/rootfs
              cd /tmp/mycontainer/rootfs
              
              # 下载 busybox 作为基础系统
              wget https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
              chmod +x busybox
              ./busybox --install -s .
              
              # 创建必要的目录
              mkdir -p bin sbin etc proc sys tmp dev
              
              # 2. 创建启动脚本
              cat > /tmp/mycontainer/start.sh <<'EOF'
              #!/bin/bash
              
              # 创建新的 namespace
              unshare --pid --net --mount --uts --ipc --fork /bin/bash -c '
                  # 挂载 proc
                  mount -t proc proc /proc
                  
                  # 设置主机名
                  hostname mycontainer
                  
                  # 启动 shell
                  /bin/sh
              '
              EOF
              
              chmod +x /tmp/mycontainer/start.sh
              
              # 3. 启动容器
              chroot /tmp/mycontainer/rootfs /tmp/mycontainer/start.sh

              配置 Cgroups 限制

              # 创建 cgroup
              mkdir -p /sys/fs/cgroup/memory/mycontainer
              mkdir -p /sys/fs/cgroup/cpu/mycontainer
              
              # 设置内存限制 256MB
              echo 268435456 > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
              
              # 设置 CPU 限制 50%
              echo 50000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us
              echo 100000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_period_us
              
              # 将容器进程加入 cgroup
              echo $CONTAINER_PID > /sys/fs/cgroup/memory/mycontainer/cgroup.procs
              echo $CONTAINER_PID > /sys/fs/cgroup/cpu/mycontainer/cgroup.procs

              🔍 容器 vs 虚拟机

              架构对比

              虚拟机架构:
              ┌─────────────────────────────────────┐
              │  App A  │  App B  │  App C          │
              ├─────────┼─────────┼─────────────────┤
              │ Bins/Libs│ Bins/Libs│ Bins/Libs      │
              ├─────────┼─────────┼─────────────────┤
              │ Guest OS│ Guest OS│ Guest OS        │  ← 每个 VM 都有完整 OS
              ├─────────┴─────────┴─────────────────┤
              │       Hypervisor (VMware/KVM)       │
              ├─────────────────────────────────────┤
              │         Host Operating System       │
              ├─────────────────────────────────────┤
              │         Hardware                    │
              └─────────────────────────────────────┘
              
              容器架构:
              ┌─────────────────────────────────────┐
              │  App A  │  App B  │  App C          │
              ├─────────┼─────────┼─────────────────┤
              │ Bins/Libs│ Bins/Libs│ Bins/Libs      │
              ├─────────────────────────────────────┤
              │  Docker Engine / containerd         │
              ├─────────────────────────────────────┤
              │    Host Operating System (Linux)    │  ← 共享内核
              ├─────────────────────────────────────┤
              │         Hardware                    │
              └─────────────────────────────────────┘

              性能对比

              维度虚拟机容器
              启动时间分钟级秒级
              资源占用GB 级内存MB 级内存
              性能开销5-10%< 1%
              隔离程度完全隔离(硬件级)进程隔离(OS 级)
              安全性更高(独立内核)较低(共享内核)
              密度每台物理机 10-50 个每台物理机 100-1000 个

              ⚠️ 容器的安全性考虑

              1. 共享内核的风险

              # 容器逃逸:如果内核有漏洞,容器可能逃逸到宿主机
              
              # 缓解措施:
              # - 使用 User Namespace
              # - 运行容器为非 root 用户
              # - 使用 Seccomp 限制系统调用
              # - 使用 AppArmor/SELinux

              2. 特权容器的危险

              # 特权容器可以访问宿主机所有设备
              docker run --privileged ...
              
              # ❌ 危险:容器内可以:
              # - 加载内核模块
              # - 访问宿主机所有设备
              # - 修改宿主机网络配置
              # - 读写宿主机任意文件
              
              # ✅ 最佳实践:避免使用特权容器

              3. Capability 控制

              # 只授予容器必要的权限
              docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx
              
              # 默认 Docker 授予的 Capabilities:
              # - CHOWN, DAC_OVERRIDE, FOWNER, FSETID
              # - KILL, SETGID, SETUID, SETPCAP
              # - NET_BIND_SERVICE, NET_RAW
              # - SYS_CHROOT, MKNOD, AUDIT_WRITE, SETFCAP

              💡 关键要点总结

              容器 = Namespace + Cgroups + Union FS

              1. Namespace (隔离)

                • PID: 进程隔离
                • Network: 网络隔离
                • Mount: 文件系统隔离
                • UTS: 主机名隔离
                • IPC: 进程间通信隔离
                • User: 用户隔离
              2. Cgroups (限制)

                • CPU: 限制处理器使用
                • Memory: 限制内存使用
                • Block I/O: 限制磁盘 I/O
                • Network: 限制网络带宽
              3. Union FS (分层)

                • 镜像分层存储
                • Copy-on-Write
                • 节省空间和带宽

              容器不是虚拟机

              • ✅ 容器是特殊的进程
              • ✅ 共享宿主机内核
              • ✅ 启动快、资源占用少
              • ⚠️ 隔离性不如虚拟机
              • ⚠️ 需要注意安全配置
              Mar 7, 2024

              Subsections of Database

              Elastic Search DSL

              Basic Query

              exist query

              Returns documents that contain an indexed value for a field.

              GET /_search
              {
                "query": {
                  "exists": {
                    "field": "user"
                  }
                }
              }

              The following search returns documents that are missing an indexed value for the user.id field.

              GET /_search
              {
                "query": {
                  "bool": {
                    "must_not": {
                      "exists": {
                        "field": "user.id"
                      }
                    }
                  }
                }
              }
              fuzz query

              Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.

              GET /_search
              {
                "query": {
                  "fuzzy": {
                    "filed_A": {
                      "value": "ki"
                    }
                  }
                }
              }

              Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.

              GET /_search
              {
                "query": {
                  "fuzzy": {
                    "filed_A": {
                      "value": "ki",
                      "fuzziness": "AUTO",
                      "max_expansions": 50,
                      "prefix_length": 0,
                      "transpositions": true,
                      "rewrite": "constant_score_blended"
                    }
                  }
                }
              }

              rewrite:

              • constant_score_boolean
              • constant_score_filter
              • top_terms_blended_freqs_N
              • top_terms_boost_N, top_terms_N
              • frequent_terms, score_delegating
              ids query

              Returns documents based on their IDs. This query uses document IDs stored in the _id field.

              GET /_search
              {
                "query": {
                  "ids" : {
                    "values" : ["2NTC5ZIBNLuBWC6V5_0Y"]
                  }
                }
              }
              prefix query

              The following search returns documents where the filed_A field contains a term that begins with ki.

              GET /_search
              {
                "query": {
                  "prefix": {
                    "filed_A": {
                      "value": "ki",
                       "rewrite": "constant_score_blended",
                       "case_insensitive": true
                    }
                  }
                }
              }

              You can simplify the prefix query syntax by combining the <field> and value parameters.

              GET /_search
              {
                "query": {
                  "prefix" : { "filed_A" : "ki" }
                }
              }
              range query

              Returns documents that contain terms within a provided range.

              GET /_search
              {
                "query": {
                  "range": {
                    "filed_number": {
                      "gte": 10,
                      "lte": 20,
                      "boost": 2.0
                    }
                  }
                }
              }
              GET /_search
              {
                "query": {
                  "range": {
                    "filed_timestamp": {
                      "time_zone": "+01:00",        
                      "gte": "2020-01-01T00:00:00", 
                      "lte": "now"                  
                    }
                  }
                }
              }
              regex query

              Returns documents that contain terms matching a regular expression.

              GET /_search
              {
                "query": {
                  "regexp": {
                    "filed_A": {
                      "value": "k.*y",
                      "flags": "ALL",
                      "case_insensitive": true,
                      "max_determinized_states": 10000,
                      "rewrite": "constant_score_blended"
                    }
                  }
                }
              }
              term query

              Returns documents that contain an exact term in a provided field.

              You can use the term query to find documents based on a precise value such as a price, a product ID, or a username.

              GET /_search
              {
                "query": {
                  "term": {
                    "filed_A": {
                      "value": "kimchy",
                      "boost": 1.0
                    }
                  }
                }
              }
              wildcard query

              Returns documents that contain terms matching a wildcard pattern.

              A wildcard operator is a placeholder that matches one or more characters. For example, the * wildcard operator matches zero or more characters. You can combine wildcard operators with other characters to create a wildcard pattern.

              GET /_search
              {
                "query": {
                  "wildcard": {
                    "filed_A": {
                      "value": "ki*y",
                      "boost": 1.0,
                      "rewrite": "constant_score_blended"
                    }
                  }
                }
              }
              Oct 7, 2024

              HPC

                Mar 7, 2024

                K8s

                Mar 7, 2024

                Subsections of K8s

                K8s的理解

                一、核心定位:云时代的操作系统

                我对 K8s 最根本的理解是:它正在成为数据中心/云环境的“操作系统”。

                • 传统操作系统(如 Windows、Linux):管理的是单台计算机的硬件资源(CPU、内存、硬盘、网络),并为应用程序(进程)提供运行环境。
                • Kubernetes:管理的是一个集群(由多台计算机组成)的资源,并将这些物理机/虚拟机抽象成一个巨大的“资源池”。它在这个池子上调度和运行的不再是简单的进程,而是容器化了的应用程序

                所以,你可以把 K8s 看作是一个分布式的、面向云原生应用的操作系统。


                二、要解决的核心问题:从“动物园”到“牧场”

                在 K8s 出现之前,微服务和容器化架构带来了新的挑战:

                1. 编排混乱:我有成百上千个容器,应该在哪台机器上启动?如何知道它们是否健康?挂了怎么办?如何扩容缩容?
                2. 网络复杂:容器之间如何发现和通信?如何实现负载均衡?
                3. 存储管理:有状态应用的数据如何持久化?容器漂移后数据怎么跟走?
                4. 部署麻烦:如何实现蓝绿部署、金丝雀发布?如何回滚?

                这个时期被称为“集装箱革命”后的“编排战争”时期,各种工具(如 Docker Swarm, Mesos, Nomad)就像是一个混乱的“动物园”。

                K8s 的诞生(源于 Google 内部系统 Borg 的经验)就是为了系统地解决这些问题,它将混乱的“动物园”管理成了一个井然有序的“牧场”。它的核心能力可以概括为:声明式 API 和控制器模式


                三、核心架构与工作模型:大脑与肢体

                K8s 集群主要由控制平面工作节点 组成。

                • 控制平面:集群的大脑

                  • kube-apiserver:整个系统的唯一入口,所有组件都必须通过它来操作集群状态。它是“前台总机”。
                  • etcd:一个高可用的键值数据库,持久化存储集群的所有状态数据。它是“集群的记忆中心”。
                  • kube-scheduler:负责调度,决定 Pod 应该在哪个节点上运行。它是“人力资源部”。
                  • kube-controller-manager:运行着各种控制器,不断检查当前状态是否与期望状态一致,并努力驱使其一致。例如,节点控制器、副本控制器等。它是“自动化的管理团队”。
                • 工作节点:干活的肢体

                  • kubelet:节点上的“监工”,负责与控制平面通信,管理本节点上 Pod 的生命周期,确保容器健康运行。
                  • kube-proxy:负责节点上的网络规则,实现 Service 的负载均衡和网络代理。
                  • 容器运行时:如 containerd 或 CRI-O,负责真正拉取镜像和运行容器。

                工作模型的核心:声明式 API 与控制器模式

                1. 你向 kube-apiserver 提交一个 YAML/JSON 文件,声明你期望的应用状态(例如:我要运行 3 个 Nginx 实例)。
                2. etcd 记录下这个期望状态。
                3. 各种控制器会持续地“观察”当前状态,并与 etcd 中的期望状态进行对比。
                4. 如果发现不一致(例如,只有一个 Nginx 实例在运行),控制器就会主动采取行动(例如,再创建两个 Pod),直到当前状态与期望状态一致。
                5. 这个过程是自愈的、自动的

                四、关键对象与抽象:乐高积木

                K8s 通过一系列抽象对象来建模应用,这些对象就像乐高积木:

                1. Pod最小部署和管理单元。一个 Pod 可以包含一个或多个紧密关联的容器(如主容器和 Sidecar 容器),它们共享网络和存储。这是 K8s 的“原子”。
                2. Deployment定义无状态应用。它管理 Pod 的多个副本(Replicas),并提供滚动更新、回滚等强大的部署策略。这是最常用的对象。
                3. Service定义一组 Pod 的访问方式。Pod 是“ ephemeral ”的,IP 会变。Service 提供一个稳定的 IP 和 DNS 名称,并作为负载均衡器,将流量分发给后端的健康 Pod。它是“服务的门户”。
                4. ConfigMap & Secret:将配置信息和敏感数据与容器镜像解耦,实现配置的灵活管理。
                5. Volume:抽象了各种存储解决方案,为 Pod 提供持久化存储。
                6. Namespace:在物理集群内部创建多个虚拟集群,实现资源隔离和多租户管理。
                7. StatefulSet用于部署有状态应用(如数据库)。它为每个 Pod 提供稳定的标识符、有序的部署和扩缩容,以及稳定的持久化存储。
                8. Ingress:管理集群外部访问内部服务的入口,通常提供 HTTP/HTTPS 路由、SSL 终止等功能。它是“集群的流量总入口”。

                五、核心价值与优势

                1. 自动化运维:自动化了应用的部署、扩缩容、故障恢复(自愈)、滚动更新等,极大降低了运维成本。
                2. 声明式配置与不可变基础设施:通过 YAML 文件定义一切,基础设施可版本化、可追溯、可重复。这是 DevOps 和 GitOps 的基石。
                3. 环境一致性 & 可移植性:实现了“一次编写,随处运行”。无论是在本地开发机、测试环境,还是在公有云、混合云上,应用的行为都是一致的。
                4. 高可用性与弹性伸缩:轻松实现应用的多副本部署,并能根据 CPU、内存等指标或自定义指标进行自动扩缩容,从容应对流量高峰。
                5. 丰富的生态系统:拥有一个极其庞大和活跃的社区,提供了大量的工具和扩展(Helm, Operator, Istio等),能解决几乎所有你能想到的问题。

                六、挑战与学习曲线

                K8s 并非银弹,它也有自己的挑战:

                • 复杂性高:概念繁多,架构复杂,学习和运维成本非常高。
                • “配置”沉重:YAML 文件可能非常多,管理起来本身就是一门学问。
                • 网络与存储:虽然是核心抽象,但其底层实现和理解起来依然有相当的门槛。

                总结

                在我看来,Kubernetes 不仅仅是一个容器编排工具,它更是一套云原生应用的管理范式。它通过一系列精妙的抽象,将复杂的分布式系统管理问题标准化、自动化和简单化。虽然入门有门槛,但它已经成为现代应用基础设施的事实标准,是任何从事后端开发、运维、架构设计的人员都必须理解和掌握的核心技术。

                简单来说,K8s 让你能够像管理一台超级计算机一样,去管理一个由成千上万台机器组成的集群。

                Mar 7, 2024

                Cgroup在K8S中起什么作用

                Kubernetes 深度集成 cgroup 来实现容器资源管理和隔离。以下是 cgroup 与 K8s 结合的详细方式:

                1. K8s 资源模型与 cgroup 映射

                1.1 资源请求和限制

                apiVersion: v1
                kind: Pod
                spec:
                  containers:
                  - name: app
                    resources:
                      requests:
                        memory: "64Mi"
                        cpu: "250m"
                      limits:
                        memory: "128Mi"
                        cpu: "500m"
                        ephemeral-storage: "2Gi"

                对应 cgroup 配置:

                • cpu.shares = 256 (250m × 1024 / 1000)
                • cpu.cfs_quota_us = 50000 (500m × 100000 / 1000)
                • memory.limit_in_bytes = 134217728 (128Mi)

                2. K8s cgroup 驱动

                2.1 cgroupfs 驱动

                # kubelet 配置
                --cgroup-driver=cgroupfs
                --cgroup-root=/sys/fs/cgroup

                2.2 systemd 驱动(推荐)

                # kubelet 配置
                --cgroup-driver=systemd
                --cgroup-root=/sys/fs/cgroup

                3. K8s cgroup 层级结构

                3.1 cgroup v1 层级

                /sys/fs/cgroup/
                ├── cpu,cpuacct/kubepods/
                │   ├── burstable/pod-uid-1/
                │   │   ├── container-1/
                │   │   └── container-2/
                │   └── guaranteed/pod-uid-2/
                │       └── container-1/
                ├── memory/kubepods/
                └── pids/kubepods/

                3.2 cgroup v2 统一层级

                /sys/fs/cgroup/kubepods/
                ├── pod-uid-1/
                │   ├── container-1/
                │   └── container-2/
                └── pod-uid-2/
                    └── container-1/

                4. QoS 等级与 cgroup 配置

                4.1 Guaranteed (最高优先级)

                resources:
                  limits:
                    cpu: "500m"
                    memory: "128Mi"
                  requests:
                    cpu: "500m" 
                    memory: "128Mi"

                cgroup 配置:

                • cpu.shares = 512
                • cpu.cfs_quota_us = 50000
                • oom_score_adj = -998

                4.2 Burstable (中等优先级)

                resources:
                  requests:
                    cpu: "250m"
                    memory: "64Mi"
                  # limits 未设置或大于 requests

                cgroup 配置:

                • cpu.shares = 256
                • cpu.cfs_quota_us = -1 (无限制)
                • oom_score_adj = 2-999

                4.3 BestEffort (最低优先级)

                # 未设置 resources

                cgroup 配置:

                • cpu.shares = 2
                • memory.limit_in_bytes = 9223372036854771712 (极大值)
                • oom_score_adj = 1000

                5. 实际 cgroup 配置示例

                5.1 查看 Pod 的 cgroup

                # 找到 Pod 的 cgroup 路径
                cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cgroup.procs
                
                # 查看 CPU 配置
                cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.shares
                cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.cfs_quota_us
                
                # 查看内存配置
                cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.limit_in_bytes

                5.2 使用 cgroup-tools 监控

                # 安装工具
                apt-get install cgroup-tools
                
                # 查看 cgroup 统计
                cgget -g cpu:/kubepods/pod-uid-1
                cgget -g memory:/kubepods/pod-uid-1

                6. K8s 特性与 cgroup 集成

                6.1 垂直 Pod 自动缩放 (VPA)

                apiVersion: autoscaling.k8s.io/v1
                kind: VerticalPodAutoscaler
                spec:
                  targetRef:
                    apiVersion: "apps/v1"
                    kind: Deployment
                    name: my-app
                  updatePolicy:
                    updateMode: "Auto"

                VPA 根据历史使用数据动态调整:

                • 修改 resources.requestsresources.limits
                • kubelet 更新对应的 cgroup 配置

                6.2 水平 Pod 自动缩放 (HPA)

                apiVersion: autoscaling/v2
                kind: HorizontalPodAutoscaler
                spec:
                  scaleTargetRef:
                    apiVersion: apps/v1
                    kind: Deployment
                    name: my-app
                  minReplicas: 1
                  maxReplicas: 10
                  metrics:
                  - type: Resource
                    resource:
                      name: cpu
                      target:
                        type: Utilization
                        averageUtilization: 50

                HPA 依赖 cgroup 的 CPU 使用率统计进行决策。

                6.3 资源监控

                # 通过 cgroup 获取容器资源使用
                cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpuacct.usage
                cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.usage_in_bytes
                
                # 使用 metrics-server 收集
                kubectl top pods
                kubectl top nodes

                7. 节点资源管理

                7.1 系统预留资源

                # kubelet 配置
                apiVersion: kubelet.config.k8s.io/v1beta1
                kind: KubeletConfiguration
                systemReserved:
                  cpu: "100m"
                  memory: "256Mi"
                  ephemeral-storage: "1Gi"
                kubeReserved:
                  cpu: "200m"
                  memory: "512Mi"
                  ephemeral-storage: "2Gi"
                evictionHard:
                  memory.available: "100Mi"
                  nodefs.available: "10%"

                7.2 驱逐策略

                当节点资源不足时,kubelet 根据 cgroup 统计:

                • 监控 memory.usage_in_bytes
                • 监控 cpuacct.usage
                • 触发 Pod 驱逐

                8. 故障排查和调试

                8.1 检查 cgroup 配置

                # 进入节点检查
                docker exec -it node-shell /bin/bash
                
                # 查看 Pod cgroup
                find /sys/fs/cgroup -name "*pod-uid*" -type d
                
                # 检查资源限制
                cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.limit_in_bytes
                cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.cfs_quota_us

                8.2 监控 OOM 事件

                # 查看内核日志
                dmesg | grep -i "killed process"
                
                # 查看 cgroup OOM 事件
                grep "kubepods" /var/log/kern.log | grep -i oom

                9. 最佳实践

                9.1 合理设置资源限制

                resources:
                  requests:
                    cpu: "100m"
                    memory: "128Mi"
                  limits:
                    cpu: "200m"    # 不要设置过大
                    memory: "256Mi" # 避免内存浪费

                9.2 使用 LimitRange

                apiVersion: v1
                kind: LimitRange
                metadata:
                  name: mem-limit-range
                spec:
                  limits:
                  - default:
                      memory: "256Mi"
                    defaultRequest:
                      memory: "128Mi"
                    type: Container

                cgroup 是 K8s 资源管理的基石,通过精细的 cgroup 配置,K8s 实现了多租户环境下的资源隔离、公平调度和稳定性保障。

                Mar 7, 2024

                Headless VS ClusterIP

                Q: headless service 和 普通的service 有什么区别? 只是有没有clusterIP?

                “有没有 ClusterIP” 只是表面现象,其背后是根本不同的服务发现模式和适用场景。


                核心区别:服务发现模式

                • 普通 Service:提供的是 “负载均衡” 式的服务发现。
                  • 它抽象了一组 Pod,你访问的是这个抽象的、稳定的 VIP(ClusterIP),然后由 kube-proxy 将流量转发到后端的某个 Pod。
                  • 客户端不知道、也不关心具体是哪个 Pod 在处理请求。
                • Headless Service:提供的是 “直接 Pod IP” 式的服务发现。
                  • 不会给你一个统一的 VIP,而是直接返回后端所有 Pod 的 IP 地址。
                  • 客户端可以直接与任何一个 Pod 通信,并且知道它正在和哪个具体的 Pod 对话。

                详细对比

                特性普通 ServiceHeadless Service
                clusterIP 字段自动分配一个 VIP(如 10.96.123.456),或设置为 None必须设置为 None。这是定义 Headless Service 的标志。
                核心功能负载均衡。作为流量的代理和分发器。服务发现。作为 Pod 的 DNS 记录注册器,不负责流量转发
                DNS 解析结果解析到 Service 的 ClusterIP。解析到所有与 Selector 匹配的 Pod 的 IP 地址
                网络拓扑客户端 -> ClusterIP (VIP) -> (由 kube-proxy 负载均衡) -> 某个 Pod客户端 -> Pod IP
                适用场景标准的微服务、Web 前端/后端 API,任何需要负载均衡的场景。有状态应用集群(如 MySQL, MongoDB, Kafka, Redis Cluster)、需要直接连接特定 Pod 的场景(如 gRPC 长连接、游戏服务器)。

                DNS 解析行为的深入理解

                这是理解两者差异的最直观方式。

                假设我们有一个名为 my-app 的 Service,它选择了 3 个 Pod。

                1. 普通 Service 的 DNS 解析

                • 在集群内,你执行 nslookup my-app(或在 Pod 里用代码查询)。
                • 返回结果1 条 A 记录,指向 Service 的 ClusterIP。
                  Name:      my-app
                  Address 1: 10.96.123.456
                • 你的应用:连接到 10.96.123.456:port,剩下的交给 Kubernetes 的网络层。

                2. Headless Service 的 DNS 解析

                • 在集群内,你执行 nslookup my-app(注意:Service 的 clusterIP: None)。
                • 返回结果多条 A 记录,直接指向后端所有 Pod 的 IP。
                  Name:      my-app
                  Address 1: 172.17.0.10
                  Address 2: 172.17.0.11
                  Address 3: 172.17.0.12
                • 你的应用:会拿到这个 IP 列表,并由客户端自己决定如何连接。比如,它可以:
                  • 随机选一个。
                  • 实现自己的负载均衡逻辑。
                  • 需要连接所有 Pod(比如收集状态)。

                与 StatefulSet 结合的“杀手级应用”

                Headless Service 最经典、最强大的用途就是与 StatefulSet 配合,为有状态应用集群提供稳定的网络标识。

                回顾之前的 MongoDB 例子:

                • StatefulSet: mongodb (3个副本)
                • Headless Service: mongodb-service

                此时,DNS 系统会创建出稳定且可预测的 DNS 记录,而不仅仅是返回 IP 列表:

                • 每个 Pod 获得一个稳定的 DNS 名称

                  • mongodb-0.mongodb-service.default.svc.cluster.local
                  • mongodb-1.mongodb-service.default.svc.cluster.local
                  • mongodb-2.mongodb-service.default.svc.cluster.local
                • 查询 Headless Service 本身的 DNS (mongodb-service) 会返回所有 Pod IP。

                这带来了巨大优势:

                1. 稳定的成员身份:在初始化 MongoDB 副本集时,你可以直接用这些稳定的 DNS 名称来配置成员列表。即使 Pod 重启、IP 变了,它的 DNS 名称永远不变,配置也就永远不会失效。
                2. 直接 Pod 间通信:在 Kafka 或 Redis Cluster 这样的系统中,节点之间需要直接通信来同步数据。它们可以使用这些稳定的 DNS 名称直接找到对方,而不需要经过一个不必要的负载均衡器。
                3. 主从选举与读写分离:客户端应用可以通过固定的 DNS 名称(如 mongodb-0...)直接连接到主节点执行写操作,而通过其他名称连接到从节点进行读操作。

                总结

                你可以这样形象地理解:

                • 普通 Service 像一个公司的“总机号码”

                  • 你打电话给总机(ClusterIP),说“我要找技术支持”,接线员(kube-proxy)会帮你转接到一个空闲的技术支持人员(Pod)那里。你不需要知道具体是谁在为你服务。
                • Headless Service 像一个公司的“内部通讯录”

                  • 它不提供总机转接服务。它只给你一份所有员工(Pod)的姓名和直拨电话(IP)列表。
                  • 特别是对于 StatefulSet,这份通讯录里的每个员工还有自己固定、专属的座位和分机号(稳定的 DNS 名称),比如“张三座位在 A区-001,分机是 8001”。你知道要找谁时,直接打他的分机就行。

                所以,“有没有 ClusterIP” 只是一个开关,这个开关背后选择的是两种截然不同的服务发现和流量治理模式。 对于需要直接寻址、有状态、集群化的应用,Headless Service 是必不可少的基石。

                Mar 7, 2024

                Creating A Pod

                描述 Kubernetes 中一个 Pod 的创建过程,可以清晰地展示了 K8s 各个核心组件是如何协同工作的。

                我们可以将整个过程分为两个主要阶段:控制平面的决策阶段工作节点的执行阶段


                第一阶段:控制平面决策(大脑决策)

                1. 用户提交请求

                  • 用户使用 kubectl apply -f pod.yamlkube-apiserver 提交一个 Pod 定义文件。
                  • kubectl 会验证配置并将其转换为 JSON 格式,通过 REST API 调用发送给 kube-apiserver。
                2. API Server 处理与验证

                  • kube-apiserver 接收到请求后,会进行一系列操作:
                    • 身份认证:验证用户身份。
                    • 授权:检查用户是否有权限创建 Pod。
                    • 准入控制:可能调用一些准入控制器来修改或验证 Pod 对象(例如,注入 Sidecar 容器、设置默认资源限制等)。
                  • 所有验证通过后,kube-apiserver 将 Pod 的元数据对象写入 etcd 数据库。此时,Pod 在 etcd 中的状态被标记为 Pending
                  • 至此,Pod 的创建请求已被记录,但还未被调度到任何节点。
                3. 调度器决策

                  • kube-scheduler 作为一个控制器,通过 watch 机制持续监听 kube-apiserver,发现有一个新的 Pod 被创建且其 nodeName 为空。
                  • 调度器开始为这个 Pod 选择一个最合适的节点,它执行两阶段操作:
                    • 过滤:根据节点资源(CPU、内存)、污点、节点选择器、存储、镜像拉取等因素过滤掉不合适的节点。
                    • 评分:对剩下的节点进行打分(例如,考虑资源均衡、亲和性等),选择得分最高的节点。
                  • 做出决策后,kube-scheduler 补丁 的方式更新 kube-apiserver 中该 Pod 的定义,将其 nodeName 字段设置为选定的节点名称。
                  • kube-apiserver 再次将这个更新后的信息写入 etcd

                第二阶段:工作节点执行(肢体行动)

                1. kubelet 监听到任务

                  • 目标节点上的 kubelet 同样通过 watch 机制监听 kube-apiserver,发现有一个 Pod 被“分配”到了自己所在的节点(即其 nodeName 与自己的节点名匹配)。
                  • kubelet 会从 kube-apiserver 读取完整的 Pod 定义。
                2. kubelet 控制容器运行时

                  • kubelet 通过 CRI 接口调用本地的容器运行时(如 containerd、CRI-O)。
                  • 容器运行时负责:
                    • 从指定的镜像仓库拉取容器镜像(如果本地不存在)。
                    • 根据 Pod 定义创建启动容器。
                3. 配置容器环境

                  • 在启动容器前后,kubelet 还会通过其他接口完成一系列配置:
                    • CNI:调用网络插件(如 Calico、Flannel)为 Pod 分配 IP 地址并配置网络。
                    • CSI:如果 Pod 使用了持久化存储,会调用存储插件挂载存储卷。
                4. 状态上报

                  • 当 Pod 中的所有容器都成功启动并运行后,kubelet 会持续监控容器的健康状态。
                  • 它将 Pod 的当前状态(如 Running)和 IP 地址等信息作为状态更新,上报kube-apiserver
                  • kube-apiserver 最终将这些状态信息写入 etcd

                总结流程图

                用户 kubectl -> API Server -> (写入) etcd -> Scheduler (绑定节点) -> API Server -> (更新) etcd -> 目标节点 kubelet -> 容器运行时 (拉镜像,启容器) -> CNI/CSI (配网络/存储) -> kubelet -> API Server -> (更新状态) etcd

                核心要点:

                • 声明式 API:用户声明“期望状态”,系统驱动“当前状态”向其靠拢。
                • 监听与协同:所有组件都通过监听 kube-apiserver 来获取任务并协同工作。
                • etcd 作为唯一信源:整个集群的状态始终以 etcd 中的数据为准。
                • 组件职责分离:Scheduler 只管调度,kubelet 只管执行,API Server 只管交互和存储。
                Mar 7, 2024

                Deleting A Pod

                删除一个 Pod 的流程与创建过程相对应,但它更侧重于如何优雅地、安全地终止一个运行中的实例。这个过程同样涉及多个组件的协同。

                下面是一个 Pod 的删除流程,但它的核心是体现 Kubernetes 的优雅终止机制。


                删除流程的核心阶段

                阶段一:用户发起删除指令

                1. 用户执行命令:用户执行 kubectl delete pod <pod-name>
                2. API Server 接收请求
                  • kubectlkube-apiserver 发送一个 DELETE 请求。
                  • kube-apiserver 会进行认证、授权等验证。
                3. “标记为删除”:验证通过后,kube-apiserver 不会立即从 etcd 中删除该 Pod 对象,而是会执行一个关键操作:为 Pod 对象设置一个“删除时间戳”(deletionTimestamp)并将其标记为 Terminating 状态。这个状态会更新到 etcd 中。

                阶段二:控制平面与节点的通知

                1. 组件感知变化
                  • 所有监听 kube-apiserver 的组件(如 kube-scheduler, 各个节点的 kubelet)都会立刻感知到这个 Pod 的状态已变为 Terminating
                  • Endpoint Controller 会立刻将这个 Pod 的 IP 从关联的 Service 的 Endpoints(或 EndpointSlice)列表中移除。这意味着新的流量不会再被负载均衡到这个 Pod 上

                阶段三:节点上的优雅终止

                这是最关键的阶段,发生在 Pod 所在的工作节点上。

                1. kubelet 监听到状态变化:目标节点上的 kubelet 通过 watch 机制发现它管理的某个 Pod 被标记为 Terminating

                2. 触发优雅关闭序列

                  • 第1步:执行 PreStop Hook(如果配置了的话) kubelet 会首先执行 Pod 中容器定义的 preStop 钩子。这是一个在发送终止信号之前执行的特定命令或 HTTP 请求。常见用途包括:
                    • 通知上游负载均衡器此实例正在下线。
                    • 让应用完成当前正在处理的请求。
                    • 执行一些清理任务。
                  • 第2步:发送 SIGTERM 信号 kubelet 通过容器运行时向 Pod 中的每个容器的主进程发送 SIGTERM(信号 15)信号。这是一个“优雅关闭”信号,通知应用:“你即将被终止,请保存状态、完成当前工作并自行退出”。
                    • 注意SIGTERMpreStop Hook 是并行执行的。Kubernetes 会等待两者中的一个先完成,再进入下一步。
                3. 等待终止宽限期

                  • 在发送 SIGTERM 之后,Kubernetes 不会立即杀死容器。它会等待一个称为 terminationGracePeriodSeconds 的时长(默认为 30 秒)。
                  • 理想情况下,容器内的应用程序捕获到 SIGTERM 信号后,会开始优雅关闭流程,并在宽限期内自行退出。

                阶段四:强制终止与清理

                1. 宽限期后的处理

                  • 情况A:优雅关闭成功:如果在宽限期内,所有容器都成功停止,kubelet 会通知容器运行时清理容器资源,然后进行下一步。
                  • 情况B:优雅关闭失败:如果宽限期结束后,容器仍未停止,kubelet 会触发强制杀死。它向容器的主进程发送 SIGKILL(信号 9) 信号,该信号无法被捕获或忽略,会立即终止进程。
                2. 清理资源

                  • 容器被强制或优雅地终止后,kubelet 会通过容器运行时清理容器资源。
                  • 同时,kubelet 会清理 Pod 的网络资源(通过 CNI 插件)和存储资源(卸载 Volume)。
                3. 上报最终状态

                  • kubelet 向 kube-apiserver 发送最终信息,确认 Pod 已完全停止。
                  • kube-apiserver 随后从 etcd正式删除该 Pod 的对象记录。至此,这个 Pod 才真正从系统中消失。

                总结流程图

                用户 kubectl delete -> API Server -> (在etcd中标记Pod为 Terminating) -> Endpoint Controller (从Service中移除IP) -> 目标节点 kubelet -> 执行 PreStop Hook -> 发送 SIGTERM 信号 -> (等待 terminationGracePeriodSeconds) -> [成功则清理 / 失败则发送 SIGKILL] -> 清理网络/存储 -> kubelet -> API Server -> (从etcd中删除对象)

                关键要点

                1. 优雅终止是核心:Kubernetes 给了应用一个自我清理的机会,这是保证服务无损发布和滚动更新的基石。
                2. 流量切断先行:Pod 被从 Service 的 Endpoints 中移除是第一步,这确保了在 Pod 开始关闭前,不会有新流量进来。
                3. 两个关键配置
                  • terminationGracePeriodSeconds:决定了应用有多长时间来自行关闭。
                  • preStop Hook:提供了一个主动执行关闭脚本的机会,比单纯等待 SIGTERM 更可靠。
                4. 强制终止作为保障:如果应用无法响应优雅关闭信号,Kubernetes 有最后的强制手段来保证资源被释放。

                理解这个流程对于设计健壮的、能够正确处理关闭信号的微服务至关重要。

                Mar 7, 2024

                Deployment VS ReplicaSet

                下面我会从 架构、工作流、控制循环、数据结构与事件链 等层面详细说明它们是怎么工作的。


                🧩 一、核心概念层次关系

                先看一下层级:

                Deployment → ReplicaSet → Pod
                层级职责控制器类型
                Deployment负责声明“应用版本”和“滚动更新策略”高级控制器(managing controller)
                ReplicaSet保证指定数量的 Pod 副本数基础控制器(ensuring controller)
                Pod最小可调度单元,运行实际容器工作负载对象

                可以理解为:

                Deployment 是策略控制器,ReplicaSet 是数量控制器,Pod 是执行单元。


                ⚙️ 二、Deployment 的工作原理(上层控制器)

                1️⃣ Deployment 对象定义

                你在创建一个 Deployment 时,例如:

                apiVersion: apps/v1
                kind: Deployment
                metadata:
                  name: webapp
                spec:
                  replicas: 3
                  selector:
                    matchLabels:
                      app: webapp
                  template:
                    metadata:
                      labels:
                        app: webapp
                    spec:
                      containers:
                      - name: nginx
                        image: nginx:1.25

                这会创建一个 Deployment 对象并写入 etcd。


                2️⃣ Deployment Controller 发现新对象

                kube-controller-manager 中的 Deployment Controller 通过 Informer + Shared Indexer Cache 订阅(watch)Deployment 资源变化。

                一旦发现新 Deployment,它会执行以下逻辑:

                syncDeployment(deployment):
                    rsList := list ReplicaSets matching deployment.selector
                    newRS := findReplicaSetMatchingPodTemplate(deployment.spec.template)
                    if newRS == nil:
                        newRS = createReplicaSet(deployment.spec.template)
                    adjustReplicaCounts(newRS, oldRSList)

                3️⃣ ReplicaSet 的创建与管理

                • 如果模板(spec.template)发生变化(例如镜像从 nginx:1.25 改为 nginx:1.26), Deployment Controller 会创建一个新的 ReplicaSet
                • 旧 ReplicaSet 会被缩容,新的被扩容,形成滚动更新

                例如:

                Deployment: webapp
                 ├── ReplicaSet (nginx:1.25) — scale 2
                 └── ReplicaSet (nginx:1.26) — scale 1

                当更新完成后:

                Deployment: webapp
                 └── ReplicaSet (nginx:1.26) — scale 3

                4️⃣ 滚动更新策略(核心逻辑)

                Deployment 的滚动更新通过 RollingUpdateDeployment 类型实现:

                strategy:
                  type: RollingUpdate
                  rollingUpdate:
                    maxUnavailable: 1
                    maxSurge: 1

                控制循环每次检查:

                • 有多少可用副本?
                • 是否可以再启动新的 Pod(受 maxSurge 限制)?
                • 是否可以终止旧 Pod(受 maxUnavailable 限制)?

                Controller 会在每次 sync 时:

                1. 创建新 ReplicaSet 的一个 Pod;
                2. 等待其 Ready
                3. 再删除旧 ReplicaSet 的一个 Pod;
                4. 重复直到达到目标状态。

                5️⃣ ReplicaSet 与 Deployment 的“拥有关系”

                Deployment 创建 ReplicaSet 时会设置 OwnerReference:

                metadata:
                  ownerReferences:
                  - apiVersion: apps/v1
                    kind: Deployment
                    name: webapp
                    uid: <deployment-uid>

                这样 kube-controller-manager 知道 该 ReplicaSet 属于哪个 Deployment, 而 Kubernetes 的垃圾回收器(GC Controller)会自动删除孤立的 ReplicaSet。


                🧮 三、ReplicaSet 的内部机制(下层控制器)

                ReplicaSet Controller 的逻辑相对简单:

                syncReplicaSet(rs):
                    desired := rs.spec.replicas
                    actual := countPodsMatchingSelector(rs.selector)
                    if desired > actual:
                        createPods(desired - actual)
                    else if desired < actual:
                        deletePods(actual - desired)

                也就是说,ReplicaSet 只关注Pod 数量是否符合期望。 它并不关心版本,也不关心策略。


                1️⃣ Pod 的创建

                ReplicaSet Controller 调用 PodTemplateSpec 生成新的 Pod:

                pod := newPodFromTemplate(rs.spec.template)
                pod.OwnerReferences = [rs.UID]

                然后通过 API Server 创建 Pod 对象。


                2️⃣ Pod 的调度与状态同步

                创建后的 Pod 会由 kube-scheduler 调度到某个 Node 上, kubelet 拉起容器后汇报状态,ReplicaSet 通过 Informer 感知到变化并更新 status.availableReplicas


                🧠 四、两者的控制循环(Control Loop)

                可以用伪代码总结整个层次的循环:

                while true:
                  # Deployment Controller Loop
                  for each Deployment:
                    reconcile Deployment → ensure right ReplicaSets exist and scaled properly
                
                  # ReplicaSet Controller Loop
                  for each ReplicaSet:
                    reconcile ReplicaSet → ensure correct number of Pods exist
                
                  # Pod Controller Loop (via kubelet)
                  for each Pod:
                    reconcile Pod → ensure container running

                整个系统靠 Informer + WorkQueue + Reconcile 实现最终一致性。 Kubernetes 的控制循环是幂等的(idempotent)——无论运行多少次,结果都一致。


                🔁 五、事件链(从创建到Pod运行)

                阶段执行者操作
                用户kubectl apply提交 Deployment
                apiserver将 Deployment 写入 etcd
                controller-managerDeployment Controller创建 ReplicaSet
                controller-managerReplicaSet Controller创建 Pod
                schedulerkube-scheduler绑定 Node
                kubelet节点上拉取镜像并运行容器
                controller-managerDeployment Controller更新 ReplicaSet 状态,完成滚动更新

                🧰 六、垃圾回收与历史版本

                • Deployment 默认保留最近 10 个历史 ReplicaSet:

                  revisionHistoryLimit: 10
                • 超过限制的旧 ReplicaSet 会自动被清理;

                • 你可以通过:

                  kubectl rollout history deployment webapp
                  kubectl rollout undo deployment webapp --to-revision=2

                  实现回滚。


                🧩 七、关键点总结

                项目DeploymentReplicaSet
                职责管理版本与更新策略管理副本数量
                是否直接创建 Pod否,通过 ReplicaSet
                更新策略支持滚动、暂停、回滚不支持
                典型控制循环调整 ReplicaSet调整 Pod
                与 Pod 的关系间接控制直接控制

                💡 八、类比理解

                你可以这样比喻:

                • Deployment = “项目经理” 管理不同版本的 ReplicaSet,控制滚动更新节奏。
                • ReplicaSet = “小组长” 保证自己手下(Pods)的人数正确。
                • Pod = “员工” 实际干活的单位。
                Mar 7, 2024

                Endpoint VS EndpointSlice

                EndpointEndpointSlice 都是 Kubernetes 中用于管理服务后端端点的资源,但 EndpointSlice 是更现代、更高效的解决方案。以下是它们的详细区别:

                一、基本概念对比

                Endpoint(传统方式)

                apiVersion: v1
                kind: Endpoints
                metadata:
                  name: my-service
                subsets:
                  - addresses:
                    - ip: 10.244.1.5
                      targetRef:
                        kind: Pod
                        name: pod-1
                    - ip: 10.244.1.6
                      targetRef:
                        kind: Pod
                        name: pod-2
                    ports:
                    - port: 8080
                      protocol: TCP

                EndpointSlice(现代方式)

                apiVersion: discovery.k8s.io/v1
                kind: EndpointSlice
                metadata:
                  name: my-service-abc123
                  labels:
                    kubernetes.io/service-name: my-service
                addressType: IPv4
                ports:
                  - name: http
                    protocol: TCP
                    port: 8080
                endpoints:
                  - addresses:
                    - "10.244.1.5"
                    conditions:
                      ready: true
                    targetRef:
                      kind: Pod
                      name: pod-1
                    zone: us-west-2a
                  - addresses:
                    - "10.244.1.6"
                    conditions:
                      ready: true
                    targetRef:
                      kind: Pod
                      name: pod-2
                    zone: us-west-2b

                二、核心架构差异

                1. 数据模型设计

                特性EndpointEndpointSlice
                存储结构单个大对象多个分片对象
                规模限制所有端点在一个对象中自动分片(默认最多100个端点/片)
                更新粒度全量更新增量更新

                2. 性能影响对比

                # Endpoint 的问题:单个大对象
                # 当有 1000 个 Pod 时:
                kubectl get endpoints my-service -o yaml
                # 返回一个包含 1000 个地址的庞大 YAML
                
                # EndpointSlice 的解决方案:自动分片
                # 当有 1000 个 Pod 时:
                kubectl get endpointslices -l kubernetes.io/service-name=my-service
                # 返回 10 个 EndpointSlice,每个包含 100 个端点

                三、详细功能区别

                1. 地址类型支持

                Endpoint

                • 仅支持 IP 地址
                • 有限的元数据

                EndpointSlice

                addressType: IPv4  # 支持 IPv4, IPv6, FQDN
                endpoints:
                  - addresses:
                    - "10.244.1.5"
                    conditions:
                      ready: true
                      serving: true
                      terminating: false
                    hostname: pod-1.subdomain  # 支持主机名
                    nodeName: worker-1
                    zone: us-west-2a
                    hints:
                      forZones:
                      - name: us-west-2a

                2. 拓扑感知和区域信息

                EndpointSlice 独有的拓扑功能

                endpoints:
                  - addresses:
                    - "10.244.1.5"
                    conditions:
                      ready: true
                    # 拓扑信息
                    nodeName: node-1
                    zone: us-west-2a
                    # 拓扑提示,用于优化路由
                    hints:
                      forZones:
                      - name: us-west-2a

                3. 端口定义方式

                Endpoint

                subsets:
                  - ports:
                    - name: http
                      port: 8080
                      protocol: TCP
                    - name: metrics
                      port: 9090
                      protocol: TCP

                EndpointSlice

                ports:
                  - name: http
                    protocol: TCP
                    port: 8080
                    appProtocol: http  # 支持应用层协议标识
                  - name: metrics
                    protocol: TCP  
                    port: 9090
                    appProtocol: https

                四、实际使用场景

                1. 大规模服务(500+ Pods)

                Endpoint 的问题

                # 更新延迟:单个大对象的序列化/反序列化
                # 网络开销:每次更新传输整个端点列表
                # 内存压力:客户端需要缓存整个端点列表

                EndpointSlice 的优势

                # 增量更新:只更新变化的切片
                # 并行处理:多个切片可以并行处理
                # 内存友好:客户端只需关注相关切片

                2. 多区域部署

                EndpointSlice 的拓扑感知

                apiVersion: discovery.k8s.io/v1
                kind: EndpointSlice
                metadata:
                  name: multi-zone-service-1
                  labels:
                    kubernetes.io/service-name: multi-zone-service
                addressType: IPv4
                ports:
                  - name: http
                    protocol: TCP
                    port: 8080
                endpoints:
                  - addresses:
                    - "10.244.1.10"
                    conditions:
                      ready: true
                    zone: zone-a
                    nodeName: node-zone-a-1
                ---
                apiVersion: discovery.k8s.io/v1
                kind: EndpointSlice  
                metadata:
                  name: multi-zone-service-2
                  labels:
                    kubernetes.io/service-name: multi-zone-service
                addressType: IPv4
                ports:
                  - name: http
                    protocol: TCP
                    port: 8080
                endpoints:
                  - addresses:
                    - "10.244.2.10"
                    conditions:
                      ready: true
                    zone: zone-b
                    nodeName: node-zone-b-1

                3. 金丝雀发布和流量管理

                EndpointSlice 提供更细粒度的控制

                # 金丝雀版本的 EndpointSlice
                apiVersion: discovery.k8s.io/v1
                kind: EndpointSlice
                metadata:
                  name: canary-service-version2
                  labels:
                    kubernetes.io/service-name: my-service
                    version: "v2"  # 自定义标签用于选择
                addressType: IPv4
                ports:
                  - name: http
                    protocol: TCP
                    port: 8080
                endpoints:
                  - addresses:
                    - "10.244.3.10"
                    conditions:
                      ready: true

                五、运维和管理差异

                1. 监控方式

                Endpoint 监控

                # 检查单个 Endpoint 对象
                kubectl get endpoints my-service
                kubectl describe endpoints my-service
                
                # 监控端点数量
                kubectl get endpoints my-service -o jsonpath='{.subsets[0].addresses[*].ip}' | wc -w

                EndpointSlice 监控

                # 检查所有相关切片
                kubectl get endpointslices -l kubernetes.io/service-name=my-service
                
                # 查看切片详细信息
                kubectl describe endpointslices my-service-abc123
                
                # 统计总端点数量
                kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*]}{.endpoints[*].addresses}{end}' | jq length

                2. 故障排查

                Endpoint 排查

                # 检查端点状态
                kubectl get endpoints my-service -o yaml | grep -A 5 -B 5 "not-ready"
                
                # 检查控制器日志
                kubectl logs -n kube-system kube-controller-manager-xxx | grep endpoints

                EndpointSlice 排查

                # 检查切片状态
                kubectl get endpointslices --all-namespaces
                
                # 检查端点就绪状态
                kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*]}{.endpoints[*].conditions.ready}{end}'
                
                # 检查 EndpointSlice Controller
                kubectl logs -n kube-system deployment/endpointslice-controller

                六、迁移和兼容性

                1. 自动迁移

                Kubernetes 1.21+ 默认同时维护两者:

                # 启用 EndpointSlice 特性门控
                kube-apiserver --feature-gates=EndpointSlice=true
                kube-controller-manager --feature-gates=EndpointSlice=true
                kube-proxy --feature-gates=EndpointSlice=true

                2. 检查集群状态

                # 检查 EndpointSlice 是否启用
                kubectl get apiservices | grep discovery.k8s.io
                
                # 检查特性门控
                kube-apiserver -h | grep EndpointSlice
                
                # 验证控制器运行状态
                kubectl get pods -n kube-system -l k8s-app=endpointslice-controller

                七、性能基准对比

                场景EndpointEndpointSlice改进
                1000个Pod更新2-3秒200-300ms10倍
                网络带宽使用高(全量传输)低(增量传输)60-80% 减少
                内存使用高(大对象缓存)低(分片缓存)50-70% 减少
                CPU使用高(序列化成本)低(并行处理)40-60% 减少

                八、最佳实践

                1. 新集群配置

                # kube-apiserver 配置
                apiVersion: v1
                kind: Pod
                metadata:
                  name: kube-apiserver
                  namespace: kube-system
                spec:
                  containers:
                  - command:
                    - kube-apiserver
                    - --feature-gates=EndpointSlice=true
                    - --endpointslice-updates-batch-period=1s  # 批量更新周期

                2. 应用程序适配

                // 使用 EndpointSlice 感知的客户端
                import (
                    "k8s.io/client-go/kubernetes"
                    "k8s.io/client-go/tools/cache"
                    listers "k8s.io/client-go/listers/discovery/v1"
                )
                
                // 监听 EndpointSlice 变化
                endpointSliceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
                    AddFunc: func(obj interface{}) {
                        endpointSlice := obj.(*discoveryv1.EndpointSlice)
                        if endpointSlice.Labels["kubernetes.io/service-name"] == "my-service" {
                            updateLoadBalancerConfig(endpointSlice)
                        }
                    },
                })

                总结

                维度EndpointEndpointSlice
                设计年代Kubernetes 早期Kubernetes 1.16+
                扩展性差(单对象限制)优秀(自动分片)
                性能一般(全量更新)优秀(增量更新)
                功能特性基础功能拓扑感知、多协议等
                推荐使用传统集群/兼容性新集群/大规模部署
                未来方向逐步淘汰标准解决方案

                简单建议

                • 新集群:直接使用 EndpointSlice
                • 现有集群:逐步迁移到 EndpointSlice
                • 大规模服务:必须使用 EndpointSlice
                • 小规模测试:两者均可,但建议使用 EndpointSlice 以获得更好的未来兼容性

                EndpointSlice 代表了 Kubernetes 服务发现架构的现代化方向,特别是在大规模和高动态环境中提供了显著的性能和改进。

                Mar 7, 2024

                ETCD如何调优

                好的,Kubernetes 集群的稳定性和性能极大地依赖于其数据存储组件 etcd。对 etcd 进行调优是保障生产环境 K8s 集群高效、稳定运行的关键步骤。

                下面我将从核心原则、性能调优参数、操作系统调优、Kubernetes 相关配置、监控与维护等多个维度,详细讲解如何对 K8s 上的 etcd 进行调优。

                一、核心原则与前提

                1. 硬件是基础:在考虑软件参数调优前,必须确保硬件资源充足且高性能。

                  • CPU:需要足够的计算能力,特别是在高负载下进行压缩、序列化等操作时。
                  • 内存:etcd 的内存消耗与总键值对数量和大小正相关。足够的内存是保证性能的关键。建议至少 8GB,生产环境推荐 16GB 或以上。
                  • 磁盘这是最重要的因素必须使用高性能的 SSD(NVMe SSD 最佳)。etcd 的每次写入都需持久化到磁盘,磁盘的写入延迟(Write Latency)直接决定了 etcd 的写入性能。避免使用网络存储(如 NFS)。
                  • 网络:低延迟、高带宽的网络对于 etcd 节点间同步至关重要。如果 etcd 以集群模式运行,所有节点应位于同一个数据中心或低延迟的可用区。
                2. 备份!备份!备份!:在进行任何调优或配置更改之前,务必对 etcd 数据进行完整备份。误操作可能导致数据损坏或集群不可用。

                二、etcd 命令行参数调优

                etcd 主要通过其启动时的命令行参数进行调优。如果你使用 kubeadm 部署,这些参数通常配置在 /etc/kubernetes/manifests/etcd.yaml 静态 Pod 清单中。

                1. 存储配额与压缩

                为了防止磁盘耗尽,etcd 设有存储配额。一旦超过配额,它将进入维护模式,只能读不能写,并触发告警。

                • --quota-backend-bytes:设置 etcd 数据库的后端存储大小上限。默认是 2GB。对于生产环境,建议设置为 8GB 到 16GB(例如 8589934592 表示 8GB)。设置过大会影响备份和恢复时间。
                • --auto-compaction-mode--auto-compaction-retention:etcd 会累积历史版本,需要定期压缩来回收空间。
                  • --auto-compaction-mode:通常设置为 periodic(按时间周期)。
                  • --auto-compaction-retention:设置保留多长时间的历史数据。例如 "1h" 表示保留 1 小时,"10m" 表示保留 10 分钟。对于频繁变更的集群(如 running many CronJobs),建议设置为较短的周期,如 "10m""30m"

                示例配置片段(在 etcd.yaml 中):

                spec:
                  containers:
                  - command:
                    - etcd
                    ...
                    - --quota-backend-bytes=8589934592    # 8GB
                    - --auto-compaction-mode=periodic
                    - --auto-compaction-retention=10m     # 每10分钟压缩一次历史版本
                    ...

                2. 心跳与选举超时

                这些参数影响集群的领导者选举和节点间的心跳检测,对网络延迟敏感。

                • --heartbeat-interval:领导者向追随者发送心跳的间隔。建议设置为 100300 毫秒之间。网络环境好可以设小(如 100),不稳定则设大(如 300)。
                • --election-timeout:追随者等待多久没收到心跳后开始新一轮选举。此值必须是心跳间隔的 5-10 倍。建议设置在 10003000 毫秒之间。

                规则:heartbeat-interval * 10 >= election-timeout

                示例配置:

                    - --heartbeat-interval=200
                    - --election-timeout=2000

                3. 快照

                etcd 通过快照来持久化其状态。

                • --snapshot-count:指定在制作一次快照前,最多提交多少次事务。默认值是 100,000。在内存充足且磁盘 IO 性能极高的环境下,可以适当调低此值(如 50000)以在崩溃后更快恢复,但这会略微增加磁盘 IO 负担。通常使用默认值即可。

                三、操作系统与运行时调优

                1. 磁盘 I/O 调度器

                对于 SSD,将 I/O 调度器设置为 nonenoop 通常能获得更好的性能。

                # 查看当前调度器
                cat /sys/block/[你的磁盘,如 sda]/queue/scheduler
                
                # 临时修改
                echo 'noop' > /sys/block/sda/queue/scheduler
                
                # 永久修改,在 /etc/default/grub 中添加或修改
                GRUB_CMDLINE_LINUX_DEFAULT="... elevator=noop"
                
                # 然后更新 grub 并重启
                sudo update-grub

                2. 文件系统

                使用 XFSext4 文件系统。它们对 etcd 的工作负载有很好的支持。确保使用 ssd 挂载选项。

                /etc/fstab 中为 etcd 数据目录所在分区添加 ssdnoatime 选项:

                UUID=... /var/lib/etcd ext4 defaults,ssd,noatime 0 0

                3. 提高文件描述符和进程数限制

                etcd 可能会处理大量并发连接。

                # 在 /etc/security/limits.conf 中添加
                * soft nofile 65536
                * hard nofile 65536
                * soft nproc 65536
                * hard nproc 65536

                4. 网络参数调优

                调整内核网络参数,特别是在高负载环境下。

                /etc/sysctl.conf 中添加:

                net.core.somaxconn = 1024
                net.ipv4.tcp_keepalive_time = 600
                net.ipv4.tcp_keepalive_intvl = 60
                net.ipv4.tcp_keepalive_probes = 10

                执行 sysctl -p 使其生效。

                四、Kubernetes 相关调优

                1. 资源请求和限制

                etcd.yaml 中为 etcd 容器设置合适的资源限制,防止其因资源竞争而饿死。

                    resources:
                      requests:
                        memory: "1Gi"
                        cpu: "500m"
                      limits:
                        memory: "8Gi"  # 根据你的 --quota-backend-bytes 设置,确保内存足够
                        cpu: "2"

                2. API Server 的 --etcd-compaction-interval

                在 kube-apiserver 的启动参数中,这个参数控制它请求 etcd 进行压缩的周期。建议与 etcd 的 --auto-compaction-retention 保持一致或略大。

                五、监控与维护

                1. 监控关键指标

                使用 Prometheus 等工具监控 etcd,重点关注以下指标:

                • etcd_disk_wal_fsync_duration_seconds:WAL 日志同步到磁盘的延迟。这是最重要的指标,P99 值应低于 25ms。
                • etcd_disk_backend_commit_duration_seconds:后端数据库提交的延迟。
                • etcd_server_leader_changes_seen_total:领导者变更次数。频繁变更表明集群不稳定。
                • etcd_server_has_leader:当前节点是否认为有领导者(1 为是,0 为否)。
                • etcd_mvcc_db_total_size_in_bytes:当前数据库大小,用于判断是否接近存储配额。

                2. 定期进行碎片整理

                即使开启了自动压缩,etcd 的数据库文件内部仍会产生碎片。当 etcd_mvcc_db_total_size_in_bytes 接近 --quota-backend-bytes 时,即使实际数据量没那么多,也需要在线进行碎片整理。

                # 在任一 etcd 节点上执行
                ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
                  --cacert=/path/to/ca.crt \
                  --cert=/path/to/etcd-client.crt \
                  --key=/path/to/etcd-client.key \
                  defrag

                注意:执行 defrag 会阻塞所有请求,应在业务低峰期进行,并逐个对集群成员执行。

                调优总结与检查清单

                1. 硬件过关:确认使用 SSD,内存充足。
                2. 设置存储配额和自动压缩--quota-backend-bytes=8G, --auto-compaction-retention=10m
                3. 调整心跳与选举超时--heartbeat-interval=200, --election-timeout=2000
                4. 操作系统优化:I/O 调度器、文件系统挂载选项、文件描述符限制。
                5. 配置合理的资源限制:防止 etcd 容器因资源不足被 Kill。
                6. 开启并关注监控:特别是磁盘同步延迟和领导者变更。
                7. 定期维护:根据监控指标,在需要时进行碎片整理。

                对于大多数场景,调整存储配额与压缩心跳与选举超时以及确保高性能磁盘,就能解决绝大部分性能问题。调优是一个持续的过程,需要结合监控数据不断调整。

                Mar 7, 2024

                Flannel VS Calico

                Calico 和 Flannel 是 Kubernetes 中最著名和最常见的两种网络插件(CNI),但它们的设计哲学、实现方式和能力有显著区别。

                简单来说:

                • Flannel 追求的是简单和易用,提供足够的基础网络功能。
                • Calico 追求的是性能和功能,提供强大的网络策略和高性能网络。

                下面我们从多个维度进行详细对比。


                核心对比一览表

                特性FlannelCalico
                核心设计哲学简单、最小化高性能、功能丰富
                网络模型Overlay 网络纯三层路由(可选 Overlay)
                数据平面VXLAN(推荐)、Host-gw、UDPBGP(推荐)、VXLAN、Windows
                性能较好(VXLAN有封装开销)极高(BGP模式下无封装开销)
                网络策略不支持(需安装Cilium等)原生支持(强大的网络策略)
                安全性基础高级(基于标签的微隔离)
                配置与维护非常简单,几乎无需配置相对复杂,功能多配置项也多
                适用场景学习、测试、中小型集群,需求简单生产环境、大型集群、对性能和安全要求高

                深入剖析

                1. 网络模型与工作原理

                这是最根本的区别。

                • Flannel (Overlay Network)

                  • 工作原理:它在底层物理网络之上再构建一个虚拟的“覆盖网络”。当数据包从一个节点的Pod发送到另一个节点的Pod时,Flannel会将它封装在一个新的网络包中(如VXLAN)。
                  • 类比:就像在一封普通信件(Pod的原始数据包)外面套了一个标准快递袋(VXLAN封装),快递系统(底层网络)只关心快递袋上的地址(节点IP),不关心里面的内容。到达目标节点后,再拆开快递袋,取出里面的信。
                  • 优势:对底层网络要求低,只要节点之间IP能通即可,兼容性好。
                  • 劣势:封装和解封装有额外的CPU开销,并且会增加数据包的大小( overhead),导致性能略有下降。
                • Calico (Pure Layer 3)

                  • 工作原理(BGP模式):它不使用封装,而是使用BGP路由协议。每个K8s节点都像一个路由器,它通过BGP协议向集群中的其他节点宣告:“发往这些Pod IP的流量,请送到我这里来”。
                  • 类比:就像整个数据中心是一个大的邮政系统,每个邮局(节点)都知道去往任何地址(Pod IP)的最短路径,信件(数据包)可以直接投递,无需额外包装。
                  • 优势性能高,无封装开销,延迟低,吞吐量高。
                  • 劣势:要求底层网络必须支持BGP或者支持主机路由(某些云平台或网络设备可能需要特定配置)。

                注意:Calico也支持VXLAN模式(通常用于网络策略要求BGP但底层网络不支持的场景),但其最佳性能是在BGP模式下实现的。

                2. 网络策略

                这是两者功能性的一个巨大分水岭。

                • Flannel本身不提供任何网络策略能力。它只负责打通网络,让所有Pod默认可以相互通信。如果你需要实现Pod之间的访问控制(微隔离),你必须额外安装一个网络策略控制器,如 CiliumCalico本身(可以只使用其策略部分,与Flannel叠加使用)。

                • Calico原生支持强大的Kubernetes NetworkPolicy。你可以定义基于Pod标签、命名空间、端口、协议甚至DNS名称的精细规则,来控制Pod的入站和出站流量。这对于实现“零信任”安全模型至关重要。

                3. 性能

                • Calico (BGP模式):由于其纯三层的转发机制,无需封装,数据包是原生IP包,其延迟更低,吞吐量更高,CPU消耗也更少。
                • Flannel (VXLAN模式):由于存在VXLAN的封装头(通常50字节 overhead),最大传输单元会变小,封装/解封装操作也需要CPU参与,性能相比Calico BGP模式要低一些。但其 Host-gw 后端模式性能很好,前提是节点在同一个二层网络。

                4. 生态系统与高级功能

                • Calico:功能非常丰富,远不止基础网络。
                  • 网络策略:如上所述,非常强大。
                  • IPAM:灵活的IP地址管理。
                  • 服务网格集成:与Istio有深度集成,可以实施全局的服务到服务策略。
                  • Windows支持:对Windows节点有良好的支持。
                  • 网络诊断工具:提供了 calicoctl 等强大的运维工具。
                • Flannel:功能相对单一,就是做好网络连通性。它“小而美”,但缺乏高级功能。

                如何选择?

                选择 Flannel 的情况:

                • 新手用户:想要快速搭建一个K8s集群,不想纠结于复杂的网络配置。
                • 测试或开发环境:需求简单,只需要Pod能通。
                • 中小型集群:对性能和高级网络策略没有硬性要求。
                • 底层网络受限:无法配置BGP或主机路由的环境(例如某些公有云基础网络)。

                选择 Calico 的情况:

                • 生产环境:对稳定性和性能有高要求。
                • 大型集群:需要高效的路由和可扩展性。
                • 安全要求高:需要实现Pod之间的网络隔离(微隔离)。
                • 对网络性能极度敏感:例如AI/ML训练、高频交易等场景。
                • 底层网络可控:例如在自建数据中心或云上支持BGP的环境。

                总结

                FlannelCalico
                核心价值简单可靠功能强大
                好比买车丰田卡罗拉:皮实、省心、够用宝马/奥迪:性能强劲、功能齐全、操控精准
                一句话总结“让我快速把网络打通”“我要一个高性能、高安全性的生产级网络”

                在现代Kubernetes部署中,尤其是生产环境,Calico因其卓越的性能和原生的安全能力,已经成为更主流和推荐的选择。而Flannel则在那些“只要能通就行”的简单场景中,依然保持着它的价值。

                Mar 7, 2024

                Headless Service VS ClusterIP

                Headless Service vs ClusterIP 详解

                这是 Kubernetes 中两种常见的 Service 类型,它们在服务发现和负载均衡方面有本质区别。


                🎯 核心区别总结

                维度ClusterIPHeadless Service
                ClusterIP 值有固定的虚拟 IPNone (无 ClusterIP)
                DNS 解析返回 Service IP直接返回 Pod IP 列表
                负载均衡✅ kube-proxy 自动负载均衡❌ 客户端自行选择 Pod
                适用场景无状态服务有状态服务、服务发现
                典型用例Web 应用、API 服务数据库集群、Kafka、Zookeeper

                📋 ClusterIP Service (默认类型)

                定义

                ClusterIP 是 Kubernetes 默认的 Service 类型,会分配一个虚拟 IP(Cluster IP),作为访问后端 Pod 的统一入口。

                YAML 示例

                apiVersion: v1
                kind: Service
                metadata:
                  name: my-web-service
                spec:
                  type: ClusterIP  # 默认类型,可以省略
                  selector:
                    app: web
                  ports:
                  - protocol: TCP
                    port: 80        # Service 端口
                    targetPort: 8080  # Pod 端口

                工作原理

                ┌─────────────────────────────────────────┐
                │          ClusterIP Service              │
                │     (虚拟 IP: 10.96.100.50)             │
                └────────────┬────────────────────────────┘
                             │ kube-proxy 负载均衡
                             │
                     ┌───────┴───────┬──────────┐
                     ▼               ▼          ▼
                  Pod-1          Pod-2      Pod-3
                  10.244.1.5     10.244.2.8  10.244.3.12
                  (app=web)      (app=web)   (app=web)

                DNS 解析行为

                # 在集群内部查询 DNS
                nslookup my-web-service.default.svc.cluster.local
                
                # 输出:
                # Name:    my-web-service.default.svc.cluster.local
                # Address: 10.96.100.50  ← 返回 Service 的虚拟 IP
                
                # 客户端访问这个 IP
                curl http://my-web-service:80
                
                # 请求会被 kube-proxy 自动转发到后端 Pod
                # 默认使用 iptables 或 IPVS 做负载均衡

                特点

                统一入口:客户端只需知道 Service IP,不关心后端 Pod
                自动负载均衡:kube-proxy 自动在多个 Pod 间分发流量
                服务发现简单:通过 DNS 获取稳定的 Service IP
                屏蔽 Pod 变化:Pod 重启或扩缩容,Service IP 不变
                会话保持:可配置 sessionAffinity: ClientIP

                负载均衡方式

                apiVersion: v1
                kind: Service
                metadata:
                  name: my-service
                spec:
                  type: ClusterIP
                  sessionAffinity: ClientIP  # 可选:会话保持(同一客户端固定到同一 Pod)
                  sessionAffinityConfig:
                    clientIP:
                      timeoutSeconds: 10800   # 会话超时时间
                  selector:
                    app: web
                  ports:
                  - port: 80
                    targetPort: 8080

                🔍 Headless Service (无头服务)

                定义

                Headless Service 是不分配 ClusterIP 的特殊 Service,通过设置 clusterIP: None 创建。

                YAML 示例

                apiVersion: v1
                kind: Service
                metadata:
                  name: my-headless-service
                spec:
                  clusterIP: None  # 🔑 关键:设置为 None
                  selector:
                    app: database
                  ports:
                  - protocol: TCP
                    port: 3306
                    targetPort: 3306

                工作原理

                ┌─────────────────────────────────────────┐
                │       Headless Service (无 ClusterIP)   │
                │              DNS 直接返回               │
                └────────────┬────────────────────────────┘
                             │ 没有负载均衡
                             │ DNS 返回所有 Pod IP
                             │
                     ┌───────┴───────┬──────────┐
                     ▼               ▼          ▼
                  Pod-1          Pod-2      Pod-3
                  10.244.1.5     10.244.2.8  10.244.3.12
                  (app=database) (app=database) (app=database)

                DNS 解析行为

                # 在集群内部查询 DNS
                nslookup my-headless-service.default.svc.cluster.local
                
                # 输出:
                # Name:    my-headless-service.default.svc.cluster.local
                # Address: 10.244.1.5   ← Pod-1 IP
                # Address: 10.244.2.8   ← Pod-2 IP
                # Address: 10.244.3.12  ← Pod-3 IP
                
                # 客户端获得所有 Pod IP,自己选择连接哪个

                特点

                服务发现:客户端可以获取所有后端 Pod 的 IP
                自主选择:客户端自己决定连接哪个 Pod(负载均衡逻辑由客户端实现)
                稳定 DNS:每个 Pod 有独立的 DNS 记录
                适合有状态服务:数据库主从、集群成员发现
                无自动负载均衡:需要客户端或应用层实现

                与 StatefulSet 结合(最常见用法)

                # StatefulSet + Headless Service
                apiVersion: v1
                kind: Service
                metadata:
                  name: mysql-headless
                spec:
                  clusterIP: None
                  selector:
                    app: mysql
                  ports:
                  - port: 3306
                    name: mysql
                ---
                apiVersion: apps/v1
                kind: StatefulSet
                metadata:
                  name: mysql
                spec:
                  serviceName: mysql-headless  # 🔑 关联 Headless Service
                  replicas: 3
                  selector:
                    matchLabels:
                      app: mysql
                  template:
                    metadata:
                      labels:
                        app: mysql
                    spec:
                      containers:
                      - name: mysql
                        image: mysql:8.0
                        ports:
                        - containerPort: 3306

                每个 Pod 的独立 DNS 记录

                # StatefulSet 的 Pod 命名规则:
                # <statefulset-name>-<ordinal>.<service-name>.<namespace>.svc.cluster.local
                
                # 示例:
                mysql-0.mysql-headless.default.svc.cluster.local → 10.244.1.5
                mysql-1.mysql-headless.default.svc.cluster.local → 10.244.2.8
                mysql-2.mysql-headless.default.svc.cluster.local → 10.244.3.12
                
                # 可以直接访问特定 Pod
                mysql -h mysql-0.mysql-headless.default.svc.cluster.local -u root -p
                
                # 查询所有 Pod
                nslookup mysql-headless.default.svc.cluster.local

                🔄 实际对比演示

                场景 1:Web 应用(使用 ClusterIP)

                # ClusterIP Service
                apiVersion: v1
                kind: Service
                metadata:
                  name: web-service
                spec:
                  type: ClusterIP
                  selector:
                    app: nginx
                  ports:
                  - port: 80
                    targetPort: 80
                ---
                # Deployment
                apiVersion: apps/v1
                kind: Deployment
                metadata:
                  name: nginx
                spec:
                  replicas: 3
                  selector:
                    matchLabels:
                      app: nginx
                  template:
                    metadata:
                      labels:
                        app: nginx
                    spec:
                      containers:
                      - name: nginx
                        image: nginx:latest
                # 测试访问
                kubectl run test --rm -it --image=busybox -- /bin/sh
                
                # 在 Pod 内执行
                nslookup web-service
                # 输出:只有一个 Service IP
                
                wget -q -O- http://web-service
                # 请求会被自动负载均衡到 3 个 nginx Pod

                场景 2:MySQL 主从(使用 Headless Service)

                # Headless Service
                apiVersion: v1
                kind: Service
                metadata:
                  name: mysql
                spec:
                  clusterIP: None
                  selector:
                    app: mysql
                  ports:
                  - port: 3306
                ---
                # StatefulSet
                apiVersion: apps/v1
                kind: StatefulSet
                metadata:
                  name: mysql
                spec:
                  serviceName: mysql
                  replicas: 3
                  selector:
                    matchLabels:
                      app: mysql
                  template:
                    metadata:
                      labels:
                        app: mysql
                    spec:
                      containers:
                      - name: mysql
                        image: mysql:8.0
                        env:
                        - name: MYSQL_ROOT_PASSWORD
                          value: "password"
                # 测试服务发现
                kubectl run test --rm -it --image=busybox -- /bin/sh
                
                # 在 Pod 内执行
                nslookup mysql
                # 输出:返回 3 个 Pod IP
                
                # 可以连接到特定的 MySQL 实例(如主节点)
                mysql -h mysql-0.mysql.default.svc.cluster.local -u root -p
                
                # 也可以连接到从节点
                mysql -h mysql-1.mysql.default.svc.cluster.local -u root -p
                mysql -h mysql-2.mysql.default.svc.cluster.local -u root -p

                📊 详细对比

                1. DNS 解析差异

                # ClusterIP Service
                $ nslookup web-service
                Server:    10.96.0.10
                Address:   10.96.0.10:53
                
                Name:      web-service.default.svc.cluster.local
                Address:   10.96.100.50  ← Service 虚拟 IP
                
                # Headless Service
                $ nslookup mysql-headless
                Server:    10.96.0.10
                Address:   10.96.0.10:53
                
                Name:      mysql-headless.default.svc.cluster.local
                Address:   10.244.1.5  ← Pod-1 IP
                Address:   10.244.2.8  ← Pod-2 IP
                Address:   10.244.3.12 ← Pod-3 IP

                2. 流量路径差异

                ClusterIP 流量路径:
                Client → Service IP (10.96.100.50)
                       → kube-proxy (iptables/IPVS)
                       → 随机选择一个 Pod
                
                Headless 流量路径:
                Client → DNS 查询
                       → 获取所有 Pod IP
                       → 客户端自己选择 Pod
                       → 直接连接 Pod IP

                3. 使用场景对比

                场景ClusterIPHeadless
                无状态应用✅ 推荐❌ 不需要
                有状态应用❌ 不适合✅ 推荐
                数据库主从❌ 无法区分主从✅ 可以指定连接主节点
                集群成员发现❌ 无法获取成员列表✅ 可以获取所有成员
                需要负载均衡✅ 自动负载均衡❌ 需要客户端实现
                客户端连接池⚠️ 只能连接到 Service IP✅ 可以为每个 Pod 建立连接

                🎯 典型应用场景

                ClusterIP Service 适用场景

                1. 无状态 Web 应用

                apiVersion: v1
                kind: Service
                metadata:
                  name: frontend
                spec:
                  type: ClusterIP
                  selector:
                    app: frontend
                  ports:
                  - port: 80
                    targetPort: 3000

                2. RESTful API 服务

                apiVersion: v1
                kind: Service
                metadata:
                  name: api-service
                spec:
                  type: ClusterIP
                  selector:
                    app: api
                  ports:
                  - port: 8080

                3. 微服务之间的调用

                # Service A 调用 Service B
                apiVersion: v1
                kind: Service
                metadata:
                  name: service-b
                spec:
                  type: ClusterIP
                  selector:
                    app: service-b
                  ports:
                  - port: 9090

                Headless Service 适用场景

                1. MySQL 主从复制

                apiVersion: v1
                kind: Service
                metadata:
                  name: mysql
                spec:
                  clusterIP: None
                  selector:
                    app: mysql
                  ports:
                  - port: 3306
                ---
                # 应用连接时:
                # 写操作 → mysql-0.mysql (主节点)
                # 读操作 → mysql-1.mysql, mysql-2.mysql (从节点)

                2. Kafka 集群

                apiVersion: v1
                kind: Service
                metadata:
                  name: kafka
                spec:
                  clusterIP: None
                  selector:
                    app: kafka
                  ports:
                  - port: 9092
                ---
                # Kafka 客户端可以发现所有 broker:
                # kafka-0.kafka:9092
                # kafka-1.kafka:9092
                # kafka-2.kafka:9092

                3. Elasticsearch 集群

                apiVersion: v1
                kind: Service
                metadata:
                  name: elasticsearch
                spec:
                  clusterIP: None
                  selector:
                    app: elasticsearch
                  ports:
                  - port: 9200
                    name: http
                  - port: 9300
                    name: transport
                ---
                # 集群内部节点通过 DNS 发现彼此:
                # elasticsearch-0.elasticsearch
                # elasticsearch-1.elasticsearch
                # elasticsearch-2.elasticsearch

                4. Redis 集群模式

                apiVersion: v1
                kind: Service
                metadata:
                  name: redis-cluster
                spec:
                  clusterIP: None
                  selector:
                    app: redis
                  ports:
                  - port: 6379
                    name: client
                  - port: 16379
                    name: gossip
                ---
                # Redis 客户端获取所有节点进行 cluster slots 查询

                🔧 混合使用:两种 Service 同时存在

                对于有状态服务,常见做法是同时创建两个 Service:

                # 1. Headless Service:用于 StatefulSet 和 Pod 间通信
                apiVersion: v1
                kind: Service
                metadata:
                  name: mysql-headless
                spec:
                  clusterIP: None
                  selector:
                    app: mysql
                  ports:
                  - port: 3306
                ---
                # 2. ClusterIP Service:用于客户端负载均衡访问(只读副本)
                apiVersion: v1
                kind: Service
                metadata:
                  name: mysql-read
                spec:
                  type: ClusterIP
                  selector:
                    app: mysql
                    role: replica  # 只选择从节点
                  ports:
                  - port: 3306
                ---
                # StatefulSet
                apiVersion: apps/v1
                kind: StatefulSet
                metadata:
                  name: mysql
                spec:
                  serviceName: mysql-headless  # 使用 Headless Service
                  replicas: 3
                  # ...

                使用方式:

                # 写操作:直接连接主节点
                mysql -h mysql-0.mysql-headless -u root -p
                
                # 读操作:通过 ClusterIP 自动负载均衡到所有从节点
                mysql -h mysql-read -u root -p

                🛠️ 常见问题

                Q1: 如何选择使用哪种 Service?

                决策流程:

                应用是无状态的? 
                  ├─ 是 → 使用 ClusterIP
                  └─ 否 → 继续
                
                需要客户端感知所有 Pod?
                  ├─ 是 → 使用 Headless Service
                  └─ 否 → 继续
                
                需要区分不同 Pod(如主从)?
                  ├─ 是 → 使用 Headless Service + StatefulSet
                  └─ 否 → 使用 ClusterIP

                Q2: Headless Service 没有负载均衡怎么办?

                方案:

                1. 客户端负载均衡:应用层实现(如 Kafka 客户端)
                2. DNS 轮询:部分 DNS 客户端会自动轮询
                3. 混合方案:同时创建 ClusterIP Service 用于负载均衡

                Q3: 如何测试 Headless Service?

                # 创建测试 Pod
                kubectl run -it --rm debug --image=busybox --restart=Never -- sh
                
                # 测试 DNS 解析
                nslookup mysql-headless.default.svc.cluster.local
                
                # 测试连接特定 Pod
                wget -O- http://mysql-0.mysql-headless:3306
                
                # 测试所有 Pod
                for i in 0 1 2; do
                  echo "Testing mysql-$i"
                  wget -O- http://mysql-$i.mysql-headless:3306
                done

                Q4: ClusterIP Service 能否用于 StatefulSet?

                可以,但不推荐:

                • ✅ 可以提供负载均衡
                • ❌ 无法通过稳定的 DNS 名访问特定 Pod
                • ❌ 不适合主从架构(无法区分主节点)

                最佳实践:

                • StatefulSet 使用 Headless Service
                • 如需负载均衡,额外创建 ClusterIP Service

                💡 关键要点总结

                ClusterIP Service

                ✅ 默认类型,有虚拟 IP
                ✅ 自动负载均衡(kube-proxy)
                ✅ 适合无状态应用
                ✅ 客户端无需感知后端 Pod
                ✅ DNS 解析返回 Service IP

                Headless Service

                ✅ 设置 clusterIP: None
                ✅ DNS 解析返回所有 Pod IP
                ✅ 适合有状态应用
                ✅ 支持 Pod 级别的服务发现
                ✅ 常与 StatefulSet 配合使用

                选型建议

                • Web 应用、API 服务 → ClusterIP
                • 数据库、消息队列、分布式存储 → Headless Service
                • 有主从/分片的应用 → Headless Service + StatefulSet
                • 需要同时支持负载均衡和直接访问 → 两种 Service 都创建
                Mar 7, 2024

                Helm Principle

                Helm 是 Kubernetes 的包管理工具,类似于 Linux 的 apt/yum 或 Python 的 pip,它的核心作用是: 👉 用模板化的方式定义、安装和升级 Kubernetes 应用。


                🧩 一、Helm 的核心概念

                在理解原理前,先明确 Helm 的几个关键对象:

                概念说明
                Chart一个 Helm 包,描述一组 Kubernetes 资源的模板集合(即一个应用的安装包)
                Values.yamlChart 的参数配置文件,用于填充模板变量
                ReleaseHelm 将 Chart 安装到某个命名空间后的实例,每次安装或升级都是一个 release
                Repository存放打包后 chart (.tgz) 的仓库,可以是 HTTP/OCI 类型(如 Harbor, Artifactory)

                ⚙️ 二、Helm 的工作原理流程

                从用户角度来看,Helm Client 发出命令(如 helm install),Helm 会通过一系列步骤在集群中生成 Kubernetes 资源。

                下面是核心流程图概念(文字版):

                       ┌────────────┐
                       │ helm client│
                       └─────┬──────┘
                             │
                             ▼
                      1. 解析Chart与Values
                             │
                             ▼
                      2. 模板渲染(Helm Template Engine)
                             │
                             ▼
                      3. 生成纯YAML清单
                             │
                             ▼
                      4. 调用Kubernetes API
                             │
                             ▼
                      5. 创建/更新资源(Deployment、Service等)
                             │
                             ▼
                      6. 记录Release历史(ConfigMap/Secret)

                🔍 三、Helm 工作机制分解

                1️⃣ Chart 渲染阶段

                Helm 使用 Go 的 text/template 模板引擎 + Sprig 函数库,将模板与 values.yaml 合并生成 Kubernetes YAML 清单。

                例如:

                # templates/deployment.yaml
                apiVersion: apps/v1
                kind: Deployment
                metadata:
                  name: {{ .Release.Name }}-app
                spec:
                  replicas: {{ .Values.replicas }}

                通过:

                helm template myapp ./mychart -f myvalues.yaml

                Helm 会本地生成纯 YAML 文件(不部署到集群)。


                2️⃣ 部署阶段(Install/Upgrade)

                执行:

                helm install myapp ./mychart

                Helm Client 会将渲染好的 YAML 通过 Kubernetes API 提交到集群(相当于执行 kubectl apply)。

                Helm 同时在命名空间中创建一个 “Release 记录”,默认存放在:

                namespace: <your-namespace>
                kind: Secret
                name: sh.helm.release.v1.<release-name>.vN

                其中保存了:

                • Chart 模板和 values 的快照
                • 渲染后的 manifest
                • Release 状态(deployed、failed 等)
                • 版本号(v1, v2, …)

                3️⃣ 升级与回滚机制

                当执行:

                helm upgrade myapp ./mychart

                时,Helm 会:

                1. 读取旧版本 release secret
                2. 渲染新模板
                3. 比较新旧差异(Diff)
                4. 调用 Kubernetes API 更新对象
                5. 写入新的 release secret(版本号 +1)

                回滚时:

                helm rollback myapp 2

                Helm 会取出 v2 的记录,再次 kubectl apply


                4️⃣ 仓库机制(Helm Repository / OCI Registry)

                Helm 支持两种包分发方式:

                • HTTP 仓库(传统)

                  • 有一个 index.yaml 索引文件
                  • Chart 以 .tgz 格式存储
                • OCI 仓库(现代推荐)

                  • Chart 存储在 OCI registry(如 Harbor, GHCR)

                  • 推送方式:

                    helm push mychart/ oci://harbor.example.com/helm
                  • 拉取方式:

                    helm pull oci://harbor.example.com/helm/mychart --version 1.0.0

                🧠 四、Helm 与 Kubernetes 的关系

                Helm 本身 不直接管理容器运行,它只是:

                • 模板引擎 + 应用生命周期管理器;
                • 所有资源最终仍由 Kubernetes 控制器(如 Deployment controller)调度、运行。

                Helm 类似于 “上层应用打包器”:

                Helm = Chart 模板系统 + Kubernetes API 客户端 + Release 历史追踪

                💡 五、常见命令原理对照

                命令Helm 行为
                helm install渲染模板 → 提交资源 → 创建 release
                helm upgrade渲染模板 → diff 旧版本 → 更新资源 → 新 release
                helm rollback获取旧版本记录 → 重新提交旧 manifest
                helm uninstall删除 Kubernetes 资源 + 删除 release secret
                helm template本地渲染模板,不与集群交互
                helm diff(插件)比较新旧渲染结果差异

                🧩 六、Helm 3 与 Helm 2 的区别(核心)

                Helm 2Helm 3
                需要 Tiller(集群内控制组件)无需 Tiller,完全 client-side
                安全模型复杂(基于 RBAC 授权)安全性更好,直接使用 kubeconfig 权限
                Release 存储在 ConfigMap默认存储在 Secret
                需要 Helm Server 部署纯客户端
                Mar 7, 2024

                HPA

                HPA(Horizontal Pod Autoscaler)是 Kubernetes 中实现自动水平扩缩容的核心组件。它的实现涉及多个 Kubernetes 组件和复杂的控制逻辑。

                一、HPA 架构组成

                1. 核心组件

                ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
                │   HPA Controller │ ◄──│   Metrics API    │ ◄──│  Metrics Server │
                │   (kube-controller)│    │    (聚合层)     │    │   (cAdvisor)    │
                └─────────────────┘    └──────────────────┘    └─────────────────┘
                         │                       │                       │
                         ▼                       ▼                       ▼
                ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
                │ Deployment/     │    │  Custom Metrics  │    │  External       │
                │ StatefulSet     │    │   Adapter        │    │  Metrics        │
                └─────────────────┘    └──────────────────┘    └─────────────────┘

                二、HPA 工作流程

                1. 完整的控制循环

                // 简化的 HPA 控制逻辑
                for {
                    // 1. 获取 HPA 对象
                    hpa := client.AutoscalingV2().HorizontalPodAutoscalers(namespace).Get(name)
                    
                    // 2. 获取缩放目标(Deployment/StatefulSet等)
                    scaleTarget := hpa.Spec.ScaleTargetRef
                    target := client.AppsV1().Deployments(namespace).Get(scaleTarget.Name)
                    
                    // 3. 查询指标
                    metrics := []autoscalingv2.MetricStatus{}
                    for _, metricSpec := range hpa.Spec.Metrics {
                        metricValue := getMetricValue(metricSpec, target)
                        metrics = append(metrics, metricValue)
                    }
                    
                    // 4. 计算期望副本数
                    desiredReplicas := calculateDesiredReplicas(hpa, metrics, currentReplicas)
                    
                    // 5. 执行缩放
                    if desiredReplicas != currentReplicas {
                        scaleTarget.Spec.Replicas = &desiredReplicas
                        client.AppsV1().Deployments(namespace).UpdateScale(scaleTarget.Name, scaleTarget)
                    }
                    
                    time.Sleep(15 * time.Second) // 默认扫描间隔
                }

                2. 详细步骤分解

                步骤 1:指标收集

                # HPA 通过 Metrics API 获取指标
                kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods" | jq .
                
                # 或者通过自定义指标 API
                kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

                步骤 2:指标计算

                // 计算当前指标值与目标值的比率
                func calculateMetricRatio(currentValue, targetValue int64) float64 {
                    return float64(currentValue) / float64(targetValue)
                }
                
                // 示例:CPU 使用率计算
                currentCPUUsage := 800m  # 当前使用 800 milli-cores
                targetCPUUsage := 500m   # 目标使用 500 milli-cores
                ratio := 800.0 / 500.0   # = 1.6

                三、HPA 配置详解

                1. HPA 资源定义

                apiVersion: autoscaling/v2
                kind: HorizontalPodAutoscaler
                metadata:
                  name: myapp-hpa
                  namespace: default
                spec:
                  # 缩放目标
                  scaleTargetRef:
                    apiVersion: apps/v1
                    kind: Deployment
                    name: myapp
                  # 副本数范围
                  minReplicas: 2
                  maxReplicas: 10
                  # 指标定义
                  metrics:
                  - type: Resource
                    resource:
                      name: cpu
                      target:
                        type: Utilization
                        averageUtilization: 50
                  - type: Resource
                    resource:
                      name: memory
                      target:
                        type: Utilization
                        averageUtilization: 70
                  - type: Pods
                    pods:
                      metric:
                        name: packets-per-second
                      target:
                        type: AverageValue
                        averageValue: 1k
                  - type: Object
                    object:
                      metric:
                        name: requests-per-second
                      describedObject:
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        name: main-route
                      target:
                        type: Value
                        value: 10k
                  # 行为配置(Kubernetes 1.18+)
                  behavior:
                    scaleDown:
                      stabilizationWindowSeconds: 300
                      policies:
                      - type: Percent
                        value: 50
                        periodSeconds: 60
                      - type: Pods
                        value: 5
                        periodSeconds: 60
                      selectPolicy: Min
                    scaleUp:
                      stabilizationWindowSeconds: 0
                      policies:
                      - type: Percent
                        value: 100
                        periodSeconds: 15
                      - type: Pods
                        value: 4
                        periodSeconds: 15
                      selectPolicy: Max

                四、指标类型和计算方式

                1. 资源指标(CPU/Memory)

                metrics:
                - type: Resource
                  resource:
                    name: cpu
                    target:
                      type: Utilization    # 利用率模式
                      averageUtilization: 50
                      
                - type: Resource  
                  resource:
                    name: memory
                    target:
                      type: AverageValue  # 平均值模式
                      averageValue: 512Mi

                计算逻辑

                // CPU 利用率计算
                func calculateCPUReplicas(currentUsage, targetUtilization int32, currentReplicas int32) int32 {
                    // 当前总使用量
                    totalUsage := currentUsage * currentReplicas
                    // 期望副本数 = ceil(当前总使用量 / (单个 Pod 请求量 * 目标利用率))
                    desiredReplicas := int32(math.Ceil(float64(totalUsage) / float64(targetUtilization)))
                    return desiredReplicas
                }

                2. 自定义指标(Pods 类型)

                metrics:
                - type: Pods
                  pods:
                    metric:
                      name: http_requests_per_second
                    target:
                      type: AverageValue
                      averageValue: 100

                计算方式

                期望副本数 = ceil(当前总指标值 / 目标平均值)

                3. 对象指标(Object 类型)

                metrics:
                - type: Object
                  object:
                    metric:
                      name: latency
                    describedObject:
                      apiVersion: networking.k8s.io/v1
                      kind: Ingress
                      name: my-ingress
                    target:
                      type: Value
                      value: 100

                五、HPA 算法详解

                1. 核心算法

                // 计算期望副本数
                func GetDesiredReplicas(
                    currentReplicas int32,
                    metricValues []metrics,
                    hpa *HorizontalPodAutoscaler,
                ) int32 {
                    ratios := make([]float64, 0)
                    
                    // 1. 计算每个指标的比率
                    for _, metric := range metricValues {
                        ratio := calculateMetricRatio(metric.current, metric.target)
                        ratios = append(ratios, ratio)
                    }
                    
                    // 2. 选择最大的比率(最需要扩容的指标)
                    maxRatio := getMaxRatio(ratios)
                    
                    // 3. 计算期望副本数
                    desiredReplicas := math.Ceil(float64(currentReplicas) * maxRatio)
                    
                    // 4. 应用边界限制
                    desiredReplicas = applyBounds(desiredReplicas, hpa.Spec.MinReplicas, hpa.Spec.MaxReplicas)
                    
                    return int32(desiredReplicas)
                }

                2. 平滑算法和冷却机制

                // 考虑历史记录的缩放决策
                func withStabilization(desiredReplicas int32, hpa *HorizontalPodAutoscaler) int32 {
                    now := time.Now()
                    
                    if isScaleUp(desiredReplicas, hpa.Status.CurrentReplicas) {
                        // 扩容:通常立即执行
                        stabilizationWindow = hpa.Spec.Behavior.ScaleUp.StabilizationWindowSeconds
                    } else {
                        // 缩容:应用稳定窗口
                        stabilizationWindow = hpa.Spec.Behavior.ScaleDown.StabilizationWindowSeconds
                    }
                    
                    // 过滤稳定窗口内的历史推荐值
                    validRecommendations := filterRecommendationsByTime(
                        hpa.Status.Conditions, 
                        now.Add(-time.Duration(stabilizationWindow)*time.Second)
                    )
                    
                    // 选择策略(Min/Max)
                    finalReplicas := applyPolicy(validRecommendations, hpa.Spec.Behavior)
                    
                    return finalReplicas
                }

                六、高级特性实现

                1. 多指标支持

                当配置多个指标时,HPA 会为每个指标计算期望副本数,然后选择最大值

                func calculateFromMultipleMetrics(metrics []Metric, currentReplicas int32) int32 {
                    desiredReplicas := make([]int32, 0)
                    
                    for _, metric := range metrics {
                        replicas := calculateForSingleMetric(metric, currentReplicas)
                        desiredReplicas = append(desiredReplicas, replicas)
                    }
                    
                    // 选择最大的期望副本数
                    return max(desiredReplicas...)
                }

                2. 扩缩容行为控制

                behavior:
                  scaleDown:
                    # 缩容稳定窗口:5分钟
                    stabilizationWindowSeconds: 300
                    policies:
                    - type: Percent   # 每分钟最多缩容 50%
                      value: 50
                      periodSeconds: 60
                    - type: Pods      # 或每分钟最多减少 5 个 Pod
                      value: 5
                      periodSeconds: 60
                    selectPolicy: Min # 选择限制更严格的策略
                    
                  scaleUp:
                    stabilizationWindowSeconds: 0  # 扩容立即执行
                    policies:
                    - type: Percent   # 每分钟最多扩容 100%
                      value: 100
                      periodSeconds: 60
                    - type: Pods      # 或每分钟最多增加 4 个 Pod
                      value: 4
                      periodSeconds: 60
                    selectPolicy: Max # 选择限制更宽松的策略

                七、监控和调试

                1. 查看 HPA 状态

                # 查看 HPA 详情
                kubectl describe hpa myapp-hpa
                
                # 输出示例:
                # Name: myapp-hpa
                # Namespace: default
                # Reference: Deployment/myapp
                # Metrics: ( current / target )
                #   resource cpu on pods  (as a percentage of request):  65% (130m) / 50%
                #   resource memory on pods:                             120Mi / 100Mi
                # Min replicas: 2
                # Max replicas: 10
                # Deployment pods: 3 current / 3 desired

                2. HPA 相关事件

                # 查看 HPA 事件
                kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler
                
                # 查看缩放历史
                kubectl describe deployment myapp | grep -A 10 "Events"

                3. 指标调试

                # 检查 Metrics API 是否正常工作
                kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq .
                
                # 检查自定义指标
                kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
                
                # 直接查询 Pod 指标
                kubectl top pods
                kubectl top nodes

                八、常见问题排查

                1. HPA 不扩容

                # 检查指标是否可用
                kubectl describe hpa myapp-hpa
                # 查看 Events 部分是否有错误信息
                
                # 检查 Metrics Server
                kubectl get apiservices | grep metrics
                kubectl logs -n kube-system -l k8s-app=metrics-server
                
                # 检查资源请求配置
                kubectl get deployment myapp -o yaml | grep resources -A 5

                2. HPA 频繁震荡

                # 调整行为配置
                behavior:
                  scaleDown:
                    stabilizationWindowSeconds: 600  # 延长缩容稳定窗口
                    policies:
                    - type: Pods
                      value: 1                       # 每次只缩容 1 个 Pod
                      periodSeconds: 300             # 每 5 分钟最多一次

                九、性能优化建议

                1. 大规模集群优化

                # kube-controller-manager 配置优化
                apiVersion: v1
                kind: Pod
                metadata:
                  name: kube-controller-manager
                  namespace: kube-system
                spec:
                  containers:
                  - command:
                    - kube-controller-manager
                    - --horizontal-pod-autoscaler-sync-period=30s           # 调整同步周期
                    - --horizontal-pod-autoscaler-downscale-stabilization=5m0s
                    - --horizontal-pod-autoscaler-initial-readiness-delay=30s
                    - --horizontal-pod-autoscaler-cpu-initialization-period=5m0s

                总结

                HPA 的实现涉及:

                1. 指标收集:通过 Metrics API 聚合资源和使用情况
                2. 算法计算:基于指标值和目标值计算期望副本数
                3. 决策平滑:应用稳定窗口和策略避免震荡
                4. 执行缩放:更新目标资源的副本数
                5. 状态跟踪:记录缩放历史和当前状态

                HPA 的智能化体现在多指标决策、行为控制、冷却机制等方面,使得自动扩缩容既响应迅速又稳定可靠。

                Mar 7, 2024

                More than 1k Nodes

                在这个量级上,K8s 不再只是“能跑就行”,而是进入可扩展性、稳定性、可观测性和资源效率的工程化挑战。下面我从架构、控制面、节点管理、网络、存储、安全和运维几个方面系统讲解。


                🧠 一、总体思路:大规模集群的本质挑战

                当节点规模超过 500~1000 时,Kubernetes 的瓶颈通常出现在:

                • 控制平面(API Server / etcd)压力过大
                • 调度器吞吐不足
                • 资源对象(Pod / Node / Secret / ConfigMap 等)过多,导致 List/Watch 延迟
                • 网络和 CNI 插件在高并发下性能下降
                • 监控、日志、事件系统的数据量爆炸
                • 维护和升级变得极度复杂

                所以,大规模集群的重点是:

                控制平面分层、节点池分区、流量隔离、观测与调优。


                🏗️ 二、控制平面(Control Plane)

                1. etcd 优化

                • 独立部署:不要和 kube-apiserver 混布,最好是独立的高性能节点(NVMe SSD、本地盘)。
                • 使用 etcd v3.5+(性能改进明显),并开启压缩和快照机制。
                • 调大 --max-request-bytes--quota-backend-bytes,避免过载。
                • 定期 defrag:可用 CronJob 自动化。
                • 不要存放短生命周期对象(例如频繁更新的 CRD 状态),可以考虑用外部缓存系统(如 Redis 或 SQL)。

                2. API Server 扩展与保护

                • 使用 负载均衡(HAProxy、NGINX、ELB)在多 API Server 之间分流;

                • 调整:

                  • --max-mutating-requests-inflight
                  • --max-requests-inflight
                  • --target-ram-mb
                • 合理设置 --request-timeout,防止 watch 卡死;

                • 限制大量 client watch 行为(Prometheus、controller-manager 等);

                • 对 client 侧使用 aggregatorread-only proxy 来降低负载。

                3. Scheduler & Controller Manager

                • 多调度器实例(leader election)

                • 启用 调度缓存(SchedulerCache)优化

                • 调整:

                  • --kube-api-qps--kube-api-burst
                  • 调度算法的 backoff 策略;
                • 对自定义 Operator 建议使用 workqueue with rate limiters 防止风暴。


                🧩 三、节点与 Pod 管理

                1. 节点分区与拓扑

                • 按功能/位置划分 Node Pool(如 GPU/CPU/IO 密集型);
                • 使用 Topology Spread Constraints 避免集中调度;
                • 考虑用 Cluster Federation (KubeFed)多个集群 + 集中管理(如 ArgoCD 多集群、Karmada、Fleet)

                2. 节点生命周期

                • 控制 kubelet 心跳频率 (--node-status-update-frequency);
                • 通过 Node Problem Detector (NPD) 自动标记异常节点;
                • 监控 Pod eviction rate,防止节点频繁漂移;
                • 启用 graceful node shutdown 支持。

                3. 镜像与容器运行时

                • 镜像预热(Image pre-pull);
                • 使用 镜像仓库代理(Harbor / registry-mirror)
                • 考虑 containerd 代替 Docker;
                • 定期清理 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots

                🌐 四、网络(CNI)

                1. CNI 选择与调优

                • 大规模下优选:

                  • Calico (BGP 模式)
                  • Cilium (eBPF)
                  • 或使用云原生方案(AWS CNI, Azure CNI)。
                • 降低 ARP / 路由表压力:

                  • 使用 IPAM 子网分段
                  • 开启 Cilium 的 ClusterMesh 分层;
                • 调整 conntrack 表大小(net.netfilter.nf_conntrack_max)。

                2. Service & DNS

                • 启用 CoreDNS 缓存

                • 对大规模 Service 场景,考虑 Headless Service + ExternalName

                • 优化 kube-proxy:

                  • 使用 IPVS 模式
                  • Cilium service LB
                • 如果 Service 数量非常多,可拆分 namespace 级 DNS 域。


                💾 五、存储(CSI)

                • 使用 分布式存储系统(Ceph、Longhorn、OpenEBS、CSI-HostPath);
                • 避免高频小 I/O 的 PVC;
                • 定期清理僵尸 PV/PVC;
                • 对 CSI driver 开启限流与重试机制。

                🔒 六、安全与访问控制

                • 开启 RBAC 严格控制
                • 限制 namespace 级资源上限(ResourceQuota, LimitRange);
                • 审计日志(Audit Policy)异步存储;
                • 对外接口统一走 Ingress Controller;
                • 如果有 Operator 或 CRD 资源暴涨,记得定期清理过期对象。

                📈 七、可观测性与维护

                1. 监控

                • Prometheus 集群化(Thanos / VictoriaMetrics);
                • 不直接监控所有 Pod,可抽样或聚合;
                • kube-state-metrics 与 cAdvisor 数据要限流。

                2. 日志

                • 统一日志收集(Loki / Elasticsearch / Vector);
                • 日志量控制策略(采样、压缩、清理)。

                3. 升级与测试

                • 使用 灰度升级 / Node pool rolling
                • 每次升级前跑 e2e 测试;
                • 对控制平面单独做快照和备份(etcd snapshot)。

                ⚙️ 八、性能调优与实践经验

                • 调整 kubelet QPS 限制:

                  --kube-api-qps=100 --kube-api-burst=200
                • 合理的 Pod 数量控制:

                  • 单节点不超过 110 Pods;
                  • 单 namespace 建议 < 5000 Pods;
                  • 总体目标:1k 节点 → 5~10 万 Pods 以内。
                • 使用 CRD Sharding / 缩减 CRD 状态字段

                • 避免大量短生命周期 Job,可用 CronJob + TTLController 清理。


                🧭 九、扩展方向

                当规模继续上升(>3000 节点)时,可以考虑:

                • 多集群架构(Cluster Federation / Karmada / Rancher Fleet)
                • 控制平面分层(cell-based control plane)
                • API Aggregation Layer + Custom Scheduler

                Mar 7, 2024

                Network Policy

                1. Network Policy 的设计原理

                Kubernetes Network Policy 的设计核心思想是:在默认允许的集群网络中,引入一个“默认拒绝”的、声明式的、基于标签的防火墙

                让我们来分解这个核心思想:

                1. 从“默认允许”到“默认拒绝”

                  • 默认行为:在没有任何 Network Policy 的情况下,Kubernetes 集群内的 Pod 之间是可以自由通信的(取决于 CNI 插件),甚至来自外部的流量也可能直接访问到 Pod。这就像在一个没有防火墙的开放网络里。
                  • Network Policy 的作用:一旦在某个 Namespace 中创建了一个 Network Policy,它就会像一个“开关”,将这个 Namespace 或特定 Pod 的默认行为变为 “默认拒绝”。之后,只有策略中明确允许的流量才能通过。
                2. 声明式模型

                  • 和其他的 Kubernetes 资源(如 Deployment、Service)一样,Network Policy 也是声明式的。你只需要告诉 Kubernetes“你期望的网络状态是什么”(例如,“允许来自带有 role=frontend 标签的 Pod 的流量访问带有 role=backend 标签的 Pod 的 6379 端口”),而不需要关心如何通过 iptables 或 eBPF 命令去实现它。Kubernetes 和其下的 CNI 插件会负责实现你的声明。
                3. 基于标签的选择机制

                  • 这是 Kubernetes 的核心设计模式。Network Policy 不关心 Pod 的 IP 地址,因为 IP 是动态且易变的。它通过 标签 来选择一组 Pod。
                  • podSelector: 选择策略所应用的 Pod(即目标 Pod)。
                  • namespaceSelector: 根据命名空间的标签来选择来源或目标命名空间。
                  • namespaceSelectorpodSelector 可以组合使用,实现非常精细的访问控制。
                4. 策略是叠加的

                  • 多个 Network Policy 可以同时作用于同一个 Pod。最终的规则是所有相关策略的 并集。如果任何一个策略允许了某条流量,那么该流量就是被允许的。这意味着你可以分模块、分层次地定义策略,而不会相互覆盖。

                2. Network Policy 的实现方式

                一个非常重要的概念是:Network Policy 本身只是一个 API 对象,它定义了一套规范。它的具体实现依赖于 Container Network Interface 插件。

                Kubernetes 不会自己实现网络策略,而是由 CNI 插件来负责。这意味着:

                • 如果你的 CNI 插件不支持 Network Policy,那么你创建的 Policy 将不会产生任何效果。
                • 不同的 CNI 插件使用不同的底层技术来实现相同的 Network Policy 规范。

                主流的实现方式和技术包括:

                1. 基于 iptables

                  • 工作原理:CNI 插件(如 Calico 的部分模式、Weave Net 等)会监听 Kubernetes API,当有 Network Policy 被创建时,它会在节点上生成相应的 iptables 规则。这些规则会对进出 Pod 网络接口(veth pair)的数据包进行过滤。
                  • 优点:成熟、稳定、通用。
                  • 缺点:当策略非常复杂时,iptables 规则链会变得很长,可能对性能有一定影响。
                2. 基于 eBPF

                  • 工作原理:这是更现代和高效的方式,被 Cilium 等项目广泛采用。eBPF 允许将程序直接注入到 Linux 内核中,在内核层面高效地执行数据包过滤、转发和策略检查。
                  • 优点:高性能、灵活性极强(可以实现 L3/L4/L7 所有层面的策略)、对系统影响小。
                  • 缺点:需要较新的 Linux 内核版本。
                3. 基于 IPVS 或自有数据平面

                  • 一些 CNI 插件(如 Antrea,它底层使用 OVS)可能有自己独立的数据平面,并在其中实现策略的匹配和执行。

                常见的支持 Network Policy 的 CNI 插件:

                • Calico: 功能强大,支持复杂的网络策略,既可以使用 iptables 模式也可以使用 eBPF 模式。
                • Cilium: 基于 eBPF,原生支持 Network Policy,并扩展到了 L7(HTTP、gRPC 等)网络策略。
                • Weave Net: 提供了对 Kubernetes Network Policy 的基本支持。
                • Antrea: 基于 Open vSwitch,也提供了强大的策略支持。

                3. Network Policy 的用途

                Network Policy 是实现 Kubernetes “零信任”“微隔离” 安全模型的核心工具。其主要用途包括:

                1. 实现最小权限原则

                  • 这是最核心的用途。通过精细的策略,确保一个 Pod 只能与它正常工作所 必需 的其他 Pod 或外部服务通信,除此之外的一切连接都被拒绝。这极大地减少了攻击面。
                2. 隔离多租户环境

                  • 在共享的 Kubernetes 集群中,可以为不同的团队、项目或环境(如 dev, staging)创建不同的命名空间。然后使用 Network Policy 严格限制跨命名空间的访问,确保它们相互隔离,互不干扰。
                3. 保护关键基础服务

                  • 数据库、缓存(如 Redis)、消息队列等后端服务通常不应该被所有 Pod 访问。可以创建策略,只允许特定的前端或中间件 Pod(通过标签选择)访问这些后端服务的特定端口。
                  # 示例:只允许 role=api 的 Pod 访问 role=db 的 Pod 的 5432 端口
                  apiVersion: networking.k8s.io/v1
                  kind: NetworkPolicy
                  metadata:
                    name: allow-api-to-db
                  spec:
                    podSelector:
                      matchLabels:
                        role: db
                    policyTypes:
                    - Ingress
                    ingress:
                    - from:
                      - podSelector:
                          matchLabels:
                            role: api
                      ports:
                      - protocol: TCP
                        port: 5432
                4. 控制外部访问

                  • 使用 ipBlock 字段,可以限制只有来自特定 IP 段(例如公司办公室的 IP)的流量才能访问集群内部的服务。这可以用来替代或补充传统的防火墙规则。
                5. 划分应用层次安全边界

                  • 在一个典型的 Web 应用中,可以创建清晰的层次:
                    • 前端层: 可以接收来自外部的流量(80/443端口),但只能与后端层通信。
                    • 后端层: 只能接收来自前端层的流量,并只能与数据层通信。
                    • 数据层: 只能接收来自后端层的流量,不接受任何其他来源的请求。

                总结

                特性描述
                设计原理在默认允许的网络中,通过声明式和基于标签的机制,实现“默认拒绝”的精细流量控制。
                实现方式由 CNI 插件负责实现,底层技术包括 iptableseBPF 等。策略本身是 Kubernetes 的 API 资源。
                主要用途实现微隔离最小权限原则多租户隔离保护关键服务控制外部访问,是 Kubernetes 网络安全的基石。

                简单来说,Network Policy 就是 Kubernetes 世界的防火墙规则,它让你能够定义“谁在什么条件下可以访问什么”,是生产环境中保障应用安全不可或缺的一部分。

                Mar 7, 2024

                Node NotReady

                当 Kubernetes 中某些 Node 节点状态变为 NotReady 时,这往往意味着 kubelet 无法与控制平面(API Server)正常通信,或该节点上某些关键组件/资源异常。

                我们可以从以下两个层面来分析: 1️⃣ 导致节点 NotReady 的常见原因 2️⃣ NotReady 状态对整个集群和业务的影响


                🧩 一、Node NotReady 的常见原因分类

                kubelet 每 10 秒(默认)向 API Server 报告一次心跳(NodeStatus)。 如果连续 40 秒(默认 --node-monitor-grace-period=40s)没有收到更新,Controller Manager 会将节点标记为 NotReady

                下面按类别详细分析👇


                🖧 1. 网络层异常(最常见)

                症状:节点能 ping 通外网,但与 control plane 交互超时。 原因包括:

                • 节点与 kube-apiserver 之间的网络中断(如防火墙、路由异常、VPC 问题);
                • API Server 负载均衡异常(L4/L7 LB 停止转发流量);
                • Pod 网络插件(CNI)崩溃,kubelet 无法汇报 Pod 状态;
                • 节点 DNS 解析异常(影响 kubelet 访问 API Server)。

                排查方式:

                # 在节点上检查 API Server 可达性
                curl -k https://<apiserver-ip>:6443/healthz
                # 检查 kubelet 日志
                journalctl -u kubelet | grep -E "error|fail|timeout"

                ⚙️ 2. kubelet 本身异常

                症状:节点长时间 NotReady,重启 kubelet 后恢复。

                原因包括:

                • kubelet 崩溃 / 死循环;
                • 磁盘满,导致 kubelet 无法写临时目录(/var/lib/kubelet);
                • 证书过期(/var/lib/kubelet/pki/kubelet-client-current.pem);
                • CPU/Mem 资源耗尽,kubelet 被 OOM;
                • kubelet 配置文件被改动,重启后加载失败。

                排查方式:

                systemctl status kubelet
                journalctl -u kubelet -n 100
                df -h /var/lib/kubelet

                💾 3. 节点资源耗尽

                症状:Node 状态为 NotReadyUnknown,Pod 被驱逐。

                可能原因:

                • 磁盘使用率 > 90%,触发 kubelet DiskPressure
                • 内存 / CPU 长期 100%,触发 MemoryPressure
                • inode 用尽(df -i);
                • 临时目录 /var/lib/docker/tmp/tmp 爆满。

                排查方式:

                kubectl describe node <node-name>
                # 查看 conditions
                # Conditions:
                #   Type              Status
                #   ----              ------
                #   MemoryPressure    True
                #   DiskPressure      True

                🧱 4. 控制面通信问题(API Server / Controller Manager)

                症状:多个节点同时 NotReady

                可能原因:

                • API Server 压力过大,导致心跳包无法及时处理;
                • etcd 异常(写延迟高);
                • Controller Manager 无法更新 NodeStatus;
                • 集群负载均衡器(如 haproxy)挂掉。

                排查方式:

                kubectl get componentstatuses
                # 或直接检查控制平面节点
                kubectl -n kube-system get pods -l tier=control-plane

                🔌 5. 容器运行时 (containerd/docker/crio) 异常

                症状:kubelet 报 “Failed to list pod sandbox”。

                原因包括:

                • containerd 服务挂掉;
                • 版本不兼容(kubelet 与 runtime 版本差异过大);
                • runtime socket 权限错误;
                • overlayfs 损坏;
                • /var/lib/containerd/run/containerd 文件系统只读。

                排查方式:

                systemctl status containerd
                journalctl -u containerd | tail
                crictl ps

                ⏱️ 6. 时间同步错误

                症状:kubelet 心跳被判定过期,但实际节点正常。

                原因:

                • 节点时间漂移(未启用 NTP / chrony);
                • 控制面和节点时间差 > 5 秒;
                • TLS 校验失败(证书时间不合法)。

                🧰 7. 节点维护或人为操作

                包括:

                • 节点被 cordon/drain;
                • 网络策略阻断 kubelet;
                • 人为停掉 kubelet;
                • 节点被重装后未清理旧状态(Node UID 冲突)。

                ⚠️ 二、Node NotReady 的后果与影响

                影响范围描述
                1️⃣ Pod 调度Scheduler 会避免调度新 Pod 到该节点。
                2️⃣ Pod 驱逐Controller Manager 默认在节点 NotReady 超过 300s--pod-eviction-timeout)后,会驱逐所有 Pod。
                3️⃣ Service Endpoint 缺失该节点上运行的 Pod 从 Service Endpoint 列表中移除,导致负载均衡流量下降。
                4️⃣ DaemonSet 中断DaemonSet Controller 不再在该节点上创建/管理 Pod。
                5️⃣ 数据丢失风险若节点上的 Pod 使用本地卷(emptyDir、hostPath),被驱逐后数据会丢失。
                6️⃣ 集群监控告警Prometheus / Alertmanager 触发告警(如 KubeNodeNotReadyKubeletDown)。
                7️⃣ 自动扩缩容失效Cluster Autoscaler 无法正确评估资源利用率。

                🧭 三、最佳实践与预防建议

                1. 启用 Node Problem Detector (NPD) 自动标记系统级异常;

                2. 监控 NodeConditionsReadyMemoryPressureDiskPressure);

                3. 统一节点健康检查策略(如通过 taintstolerations);

                4. 自动修复机制

                  • 结合 Cluster API 或自研 Controller 实现 Node 自动替换;
                  • 若节点 NotReady 超过 10 分钟,自动重建;
                5. 定期巡检:

                  • kubelet、containerd 状态;
                  • 系统时间同步;
                  • 磁盘使用率;
                  • API Server QPS 和 etcd 延迟。
                Mar 7, 2024

                Pause 容器

                Kubernetes Pause 容器的用途

                Pause 容器是 Kubernetes 中一个非常小但极其重要的基础设施容器。很多人会忽略它,但它是 Pod 网络和命名空间共享的核心。


                🎯 核心作用

                1. 作为 Pod 的"根容器"(Infrastructure Container)

                Pause 容器是每个 Pod 中第一个启动的容器,它的生命周期代表整个 Pod 的生命周期。

                Pod 生命周期:
                创建 Pod → 启动 Pause 容器 → 启动业务容器 → ... → 业务容器结束 → 删除 Pause 容器 → Pod 销毁

                2. 持有和共享 Linux 命名空间

                Pause 容器创建并持有以下命名空间,供 Pod 内其他容器共享:

                • Network Namespace (网络命名空间) - 最重要!
                • IPC Namespace (进程间通信)
                • UTS Namespace (主机名)
                # 查看 Pod 中的容器
                docker ps | grep pause
                
                # 你会看到类似输出:
                # k8s_POD_mypod_default_xxx  k8s.gcr.io/pause:3.9
                # k8s_app_mypod_default_xxx  myapp:latest

                🌐 网络命名空间共享(最关键的用途)

                工作原理

                ┌─────────────────── Pod ───────────────────┐
                │                                            │
                │  ┌─────────────┐                          │
                │  │   Pause     │ ← 创建网络命名空间        │
                │  │  Container  │ ← 拥有 Pod IP            │
                │  └──────┬──────┘                          │
                │         │ (共享网络栈)                     │
                │  ┌──────┴──────┬──────────┬──────────┐   │
                │  │ Container A │Container B│Container C│  │
                │  │  (业务容器)  │  (业务容器)│ (业务容器) │  │
                │  └─────────────┴──────────┴──────────┘   │
                │                                            │
                │  所有容器共享:                              │
                │  - 同一个 IP 地址 (Pod IP)                 │
                │  - 同一个网络接口                           │
                │  - 同一个端口空间                           │
                │  - 可以通过 localhost 互相访问              │
                └────────────────────────────────────────────┘

                实际效果

                # 示例 Pod
                apiVersion: v1
                kind: Pod
                metadata:
                  name: multi-container-pod
                spec:
                  containers:
                  - name: nginx
                    image: nginx
                    ports:
                    - containerPort: 80
                  - name: sidecar
                    image: busybox
                    command: ['sh', '-c', 'while true; do wget -O- localhost:80; sleep 5; done']

                在这个例子中:

                • Pause 容器创建网络命名空间并获得 Pod IP (如 10.244.1.5)
                • nginx 容器加入这个网络命名空间,监听 80 端口
                • sidecar 容器也加入同一网络命名空间
                • sidecar 可以通过 localhost:80 访问 nginx,因为它们共享网络栈

                🔍 为什么需要 Pause 容器?

                问题场景:如果没有 Pause 容器

                假设 Pod 中有两个容器 A 和 B:

                场景 1:容器 A 先启动,创建网络命名空间
                ├─ 容器 A 持有网络命名空间 → 拥有 Pod IP
                └─ 容器 B 加入容器 A 的网络命名空间
                
                问题:如果容器 A 崩溃重启或被删除,网络命名空间消失
                → 容器 B 失去网络连接
                → Pod IP 改变
                → Service 路由失效 ❌

                解决方案:引入 Pause 容器

                Pause 容器(持有命名空间) ← 永远不会主动退出
                ├─ 容器 A 加入
                └─ 容器 B 加入
                
                优势:
                ✅ 容器 A 或 B 崩溃不影响网络命名空间
                ✅ Pod IP 始终保持稳定
                ✅ 业务容器可以独立重启
                ✅ 简化容器间的依赖关系

                📦 Pause 容器的特点

                1. 极其精简

                # pause 容器的代码只有几十行 C 代码
                // 核心功能就是:永远 sleep
                int main() {
                    for (;;) pause();  // 无限暂停,等待信号
                    return 0;
                }

                镜像大小:约 700KB (相比普通镜像动辄几百 MB)

                2. 资源占用极低

                # 查看 Pause 容器资源占用
                docker stats <pause-container-id>
                
                # 典型输出:
                # CPU: 0.00%
                # MEM: 0.5 MiB

                3. 生命周期管理

                • Kubelet 创建 Pod 时首先启动 Pause 容器
                • Pause 容器退出 = Pod 销毁
                • 业务容器重启不影响 Pause 容器

                🛠️ 实际用途场景

                场景 1:Sidecar 模式

                # 应用 + 日志收集器
                spec:
                  containers:
                  - name: app
                    image: myapp
                    volumeMounts:
                    - name: logs
                      mountPath: /var/log
                  - name: log-collector
                    image: fluentd
                    volumeMounts:
                    - name: logs
                      mountPath: /var/log
                • Pause 容器保证两个容器可以通过共享卷和 localhost 通信
                • 即使 app 重启,log-collector 仍能正常工作

                场景 2:Service Mesh (如 Istio)

                # 应用 + Envoy 代理
                spec:
                  containers:
                  - name: app
                    image: myapp
                    ports:
                    - containerPort: 8080
                  - name: istio-proxy  # Envoy sidecar
                    image: istio/proxyv2
                • Pause 容器持有网络命名空间
                • Envoy 代理拦截所有进出流量
                • 应用无需感知代理存在

                场景 3:初始化和主容器协作

                spec:
                  initContainers:
                  - name: init-config
                    image: busybox
                    command: ['sh', '-c', 'echo "config" > /config/app.conf']
                    volumeMounts:
                    - name: config
                      mountPath: /config
                  containers:
                  - name: app
                    image: myapp
                    volumeMounts:
                    - name: config
                      mountPath: /config
                • Pause 容器在整个过程中保持网络和 IPC 命名空间稳定

                🔧 查看和调试 Pause 容器

                查看 Pod 中的 Pause 容器

                # 方式 1:通过 crictl (推荐)
                crictl pods
                crictl ps -a | grep pause
                
                # 方式 2:通过 docker (如果使用 Docker 运行时)
                docker ps -a | grep pause
                docker inspect <pause-container-id>
                
                # 方式 3:查看 Pod 的容器列表
                kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].name}'

                Pause 镜像版本

                # 查看 kubelet 使用的 pause 镜像
                kubectl get pod <pod-name> -o yaml | grep pause
                
                # 常见版本:
                # registry.k8s.io/pause:3.9 (最新)
                # registry.k8s.io/pause:3.8
                # k8s.gcr.io/pause:3.7

                ⚠️ 常见问题

                Q1: 为什么看不到 Pause 容器?

                A: 使用 kubectl get pods 不会显示 Pause 容器,它对用户是透明的。需要使用底层运行时命令(如 crictldocker ps)才能看到。

                Q2: Pause 容器会影响性能吗?

                A: 几乎不会。Pause 容器只占用约 0.5MB 内存,CPU 使用为 0,不执行任何业务逻辑。

                Q3: 可以删除 Pause 容器吗?

                A: 不能手动删除。删除 Pause 容器会导致整个 Pod 被销毁。

                Q4: 不同 Pod 的 Pause 容器是否共享?

                A: 不共享。每个 Pod 都有独立的 Pause 容器,确保 Pod 之间的网络和命名空间隔离。


                📝 总结

                用途说明
                命名空间持有者创建并持有 Network、IPC、UTS 命名空间
                网络基础使 Pod 内所有容器共享同一 IP 和网络栈
                生命周期锚点代表 Pod 的生命周期,容器可独立重启
                简化架构解耦容器间依赖,避免级联故障
                资源高效极小的镜像和资源占用

                核心价值: Pause 容器是 Kubernetes Pod 抽象的基石,让多个容器能像在同一主机上一样协作,同时保持各自的独立性和可重启性。

                需要我详细讲解某个具体场景或深入探讨实现原理吗? 🚀

                Mar 7, 2024

                Pod在K8S中DNS解析流程和顺序

                核心概念

                1. CoreDNS: 从Kubernetes 1.11开始,CoreDNS是默认的DNS服务。它作为一个或多个Pod运行在kube-system命名空间下,并配有一个Kubernetes Service(通常叫kube-dns)。
                2. resolv.conf 文件: 每个Pod的/etc/resolv.conf文件是DNS解析的蓝图。Kubelet会自动生成这个文件并挂载到Pod中。
                3. DNS策略: 你可以通过Pod Spec中的dnsPolicy字段来配置DNS策略。

                Pod 的 /etc/resolv.conf 解析

                这是一个典型的Pod内的/etc/resolv.conf文件内容:

                nameserver 10.96.0.10
                search <namespace>.svc.cluster.local svc.cluster.local cluster.local
                options ndots:5

                让我们逐行分析:

                1. nameserver 10.96.0.10

                • 这是CoreDNS Service的集群IP地址。所有Pod的DNS查询默认都会发送到这个地址。
                • 这个IP来自kubelet的--cluster-dns标志,在启动时确定。

                2. search <namespace>.svc.cluster.local svc.cluster.local cluster.local

                • 搜索域列表。当你使用不完整的域名(即不是FQDN)时,系统会按照这个列表的顺序,依次将搜索域附加到主机名后面,直到找到匹配的记录。
                • <namespace>是你的Pod所在的命名空间,例如default
                • 搜索顺序
                  • <pod-namespace>.svc.cluster.local
                  • svc.cluster.local
                  • cluster.local

                3. options ndots:5

                • 这是一个关键的优化/控制选项。
                • 规则: 如果一个域名中的点(.)数量大于或等于这个值(这里是5),系统会将其视为绝对域名(FQDN),并首先尝试直接解析,不会走搜索域列表。
                • 反之,如果点数少于5,系统会依次尝试搜索域,如果都失败了,最后再尝试名称本身。

                DNS 解析流程与顺序(详解)

                假设你的Pod在default命名空间,并且resolv.conf如上所示。

                场景1:解析Kubernetes Service(短名称)

                你想解析同一个命名空间下的Service:my-svc

                1. 应用程序请求解析 my-svc
                2. 系统检查名称 my-svc,点数(0) < 5。
                3. 进入搜索流程
                  • 第一次尝试: my-svc.default.svc.cluster.local -> 成功! 返回ClusterIP。
                  • 解析结束。

                场景2:解析不同命名空间的Service

                你想解析另一个命名空间prod下的Service:my-svc.prod

                1. 应用程序请求解析 my-svc.prod
                2. 系统检查名称 my-svc.prod,点数(1) < 5。
                3. 进入搜索流程
                  • 第一次尝试: my-svc.prod.default.svc.cluster.local -> 失败(因为该Service不在default命名空间)。
                  • 第二次尝试: my-svc.prod.svc.cluster.local -> 成功! 返回ClusterIP。
                  • 解析结束。

                场景3:解析外部域名(例如 www.google.com

                1. 应用程序请求解析 www.google.com
                2. 系统检查名称 www.google.com,点数(3) < 5。
                3. 进入搜索流程
                  • 第一次尝试: www.google.com.default.svc.cluster.local -> 失败
                  • 第二次尝试: www.google.com.svc.cluster.local -> 失败
                  • 第三次尝试: www.google.com.cluster.local -> 失败
                4. 所有搜索域都失败了,系统最后尝试名称本身:www.google.com -> 成功! CoreDNS会将其转发给上游DNS服务器(例如宿主机上的DNS或网络中配置的DNS)。

                场景4:解析被认为是FQDN的域名(点数 >= 5)

                假设你有一个StatefulSet,Pod的FQDN是web-0.nginx.default.svc.cluster.local

                1. 应用程序请求解析 web-0.nginx.default.svc.cluster.local
                2. 系统检查名称,点数(4) < 5?注意:这里是4个点,仍然小于5! 所以它仍然会走搜索流程。
                  • 这会先尝试 web-0.nginx.default.svc.cluster.local.default.svc.cluster.local,显然是错误的。
                  • 为了避免这种低效行为,最佳实践是在应用程序中配置或使用绝对域名(尾部带点)。

                绝对域名示例: 应用程序请求解析 web-0.nginx.default.svc.cluster.local.(注意最后有一个点)。

                • 系统识别其为FQDN,直接查询,不经过任何搜索域。这是最有效的方式。

                DNS 策略

                Pod的dnsPolicy字段决定了如何生成resolv.conf

                • ClusterFirst(默认): DNS查询首先被发送到Kubernetes集群的CoreDNS。如果域名不在集群域内(例如cluster.local),查询会被转发到上游DNS。
                • ClusterFirstWithHostNet: 对于使用hostNetwork: true的Pod,如果你想让它使用集群DNS,就需要设置这个策略。
                • Default: Pod直接从宿主机继承DNS配置(即使用宿主的/etc/resolv.conf)。这意味着它不会使用CoreDNS。
                • None: 忽略所有默认的DNS设置。你必须使用dnsConfig字段来提供自定义的DNS配置。

                总结与流程图

                解析顺序可以简化为以下决策流程:

                flowchart TD
                    A[应用程序发起DNS查询] --> B{查询名称的<br>点数 '.' >= 5?}
                    
                    B -- 是<br>(视为FQDN) --> C[直接查询该名称]
                    C --> D{解析成功?}
                    D -- 是 --> E[返回结果]
                    D -- 否 --> F[解析失败]
                    
                    B -- 否<br>(视为短名称) --> G
                    subgraph G [循环搜索域列表]
                        direction LR
                        H[依次将搜索域附加<br>到名称后并查询] --> I{解析成功?}
                        I -- 是 --> J[返回结果]
                    end
                    
                    I -- 循环结束仍失败 --> K[直接查询原始名称]
                    K --> L{解析成功?}
                    L -- 是 --> E
                    L -- 否 --> F

                关键要点:

                1. 默认流向: Pod -> CoreDNS Service -> CoreDNS Pod -> (根据域判断)返回K8s记录或转发到上游DNS。
                2. 搜索域顺序: 命名空间 -> svc -> cluster.local
                3. ndots:5的影响: 这是为了在便利性和性能之间取得平衡。对于需要频繁访问的外部域名,为了性能最好在应用程序中配置FQDN(尾部带点)或调整ndots选项。
                4. 调试技巧: 进入Pod并执行cat /etc/resolv.confnslookupdig命令是诊断DNS问题的第一步。
                Mar 7, 2024

                当执行kubectl exec 命令时,发生了什么?

                kubectl exec 的实现原理涉及多个组件协同工作,以下是详细原理分析:

                1. 整体架构流程

                用户 -> kubectl -> API Server -> Kubelet -> 容器运行时 -> 目标容器

                2. 详细执行步骤

                步骤1:kubectl 客户端处理

                kubectl exec -it <pod-name> -- /bin/bash
                • kubectl 解析命令参数
                • 构造 Exec API 请求
                • 建立与 API Server 的长连接

                步骤2:API Server 处理

                // API 路径示例
                POST /api/v1/namespaces/{namespace}/pods/{name}/exec
                • 认证和授权检查
                • 验证用户是否有 exec 权限
                • 查找目标 Pod 所在节点
                • 将请求代理到对应节点的 Kubelet

                步骤3:Kubelet 处理

                // Kubelet 的 exec 处理逻辑
                func (h *ExecHandler) serveExec(w http.ResponseWriter, req *http.Request) {
                    // 获取容器信息
                    // 调用容器运行时接口
                    // 建立数据流传输
                }
                • 通过 CRI(Container Runtime Interface)调用容器运行时
                • 创建到容器的连接
                • 管理标准输入、输出、错误流

                步骤4:容器运行时执行

                // CRI 接口定义
                service RuntimeService {
                    rpc Exec(ExecRequest) returns (ExecResponse) {}
                }
                • Docker: 使用 docker exec 底层机制
                • Containerd: 通过 task 执行命令
                • CRI-O: 通过 conmon 管理执行会话

                3. 关键技术机制

                3.1 流式传输协议

                // 使用 SPDY 或 WebSocket 协议
                // 支持多路复用的数据流
                type StreamProtocol interface {
                    Stream(stdin io.Reader, stdout, stderr io.Writer) error
                }

                3.2 终端处理(TTY)

                // 伪终端配置
                type ExecOptions struct {
                    Stdin     io.Reader
                    Stdout    io.Writer
                    Stderr    io.Writer
                    TTY       bool
                    ptyMaster *os.File
                }

                3.3 会话管理

                // ExecSession 管理执行会话
                type ExecSession struct {
                    id        string
                    stdinPipe io.WriteCloser
                    stdoutPipe io.ReadCloser
                    stderrPipe io.ReadCloser
                    done      chan struct{}
                }

                4. 网络通信流程

                客户端 (kubectl)
                    ↓ HTTPS with SPDY/WebSocket
                API Server
                    ↓ 代理连接
                Kubelet (节点)
                    ↓ CRI gRPC
                容器运行时
                    ↓ 容器命名空间
                目标容器进程

                5. 安全机制

                5.1 认证授权

                # RBAC 配置示例
                apiVersion: rbac.authorization.k8s.io/v1
                kind: ClusterRole
                metadata:
                  name: pod-exec
                rules:
                - apiGroups: [""]
                  resources: ["pods/exec"]
                  verbs: ["create"]

                5.2 安全上下文

                // 安全配置
                securityContext := &v1.SecurityContext{
                    RunAsUser:  &uid,
                    RunAsGroup: &gid,
                    Capabilities: &v1.Capabilities{
                        Drop: []v1.Capability{"ALL"},
                    },
                }

                6. 实际代码示例

                kubectl 端实现

                func (o *ExecOptions) Run() error {
                    // 建立与 API Server 的连接
                    executor, err := remotecommand.NewSPDYExecutor(
                        o.Config, "POST", req.URL())
                    
                    // 执行命令
                    return executor.Stream(remotecommand.StreamOptions{
                        Stdin:  o.In,
                        Stdout: o.Out,
                        Stderr: o.ErrOut,
                        Tty:    o.TTY,
                    })
                }

                Kubelet 端处理

                func (h *ExecHandler) serveExec(w http.ResponseWriter, req *http.Request) {
                    // 获取容器 ID
                    containerID := podContainer.ContainerID
                    
                    // 通过 CRI 执行命令
                    execRequest := &runtimeapi.ExecRequest{
                        ContainerId: containerID.ID,
                        Cmd:         cmd,
                        Tty:         tty,
                        Stdin:       stdin,
                        Stdout:      stdout,
                        Stderr:      stderr,
                    }
                    
                    // 调用容器运行时
                    runtimeService.Exec(execRequest)
                }

                7. 容器运行时差异

                Docker

                // 使用 Docker Engine API
                client.ContainerExecCreate()
                client.ContainerExecAttach()

                Containerd

                // 使用 CRI 插件
                task.Exec()

                8. 故障排查要点

                1. 权限问题: 检查 RBAC 配置
                2. 网络连通性: API Server ↔ Kubelet 网络
                3. 容器状态: 目标容器必须处于 Running 状态
                4. 资源限制: 容器资源是否充足
                5. 安全策略: Pod Security Policies 限制

                这种设计使得 kubectl exec 能够在分布式环境中安全、可靠地执行容器内命令,同时保持了良好的用户体验。

                Mar 7, 2024

                QoS 详解

                Kubernetes QoS (Quality of Service) 等级详解

                QoS 等级是 Kubernetes 用来管理 Pod 资源和在资源不足时决定驱逐优先级的机制。


                🎯 三种 QoS 等级

                Kubernetes 根据 Pod 的资源配置自动分配 QoS 等级,共有三种:

                1. Guaranteed (保证型) - 最高优先级

                2. Burstable (突发型) - 中等优先级

                3. BestEffort (尽力而为型) - 最低优先级


                📊 QoS 等级详解

                1️⃣ Guaranteed (保证型)

                定义条件(必须同时满足)

                • Pod 中每个容器(包括 Init 容器)都必须设置 requestslimits
                • 对于每个容器,CPU 和内存的 requests 必须等于 limits

                YAML 示例

                apiVersion: v1
                kind: Pod
                metadata:
                  name: guaranteed-pod
                spec:
                  containers:
                  - name: app
                    image: nginx
                    resources:
                      requests:
                        memory: "200Mi"
                        cpu: "500m"
                      limits:
                        memory: "200Mi"  # 必须等于 requests
                        cpu: "500m"      # 必须等于 requests

                特点

                资源保证:Pod 获得请求的全部资源,不会被其他 Pod 抢占
                最高优先级:资源不足时最后被驱逐
                性能稳定:资源使用可预测,适合关键业务
                OOM 保护:不会因为节点内存压力被 Kill(除非超过自己的 limit)

                适用场景

                • 数据库(MySQL, PostgreSQL, Redis)
                • 消息队列(Kafka, RabbitMQ)
                • 核心业务应用
                • 有状态服务

                2️⃣ Burstable (突发型)

                定义条件(满足以下任一条件)

                • Pod 中至少有一个容器设置了 requestslimits
                • requestslimits 不相等
                • 部分容器设置了资源限制,部分没有

                YAML 示例

                场景 1:只设置 requests

                apiVersion: v1
                kind: Pod
                metadata:
                  name: burstable-pod-1
                spec:
                  containers:
                  - name: app
                    image: nginx
                    resources:
                      requests:
                        memory: "100Mi"
                        cpu: "200m"
                      # 没有设置 limits,可以使用超过 requests 的资源

                场景 2:requests < limits

                apiVersion: v1
                kind: Pod
                metadata:
                  name: burstable-pod-2
                spec:
                  containers:
                  - name: app
                    image: nginx
                    resources:
                      requests:
                        memory: "100Mi"
                        cpu: "200m"
                      limits:
                        memory: "500Mi"  # 允许突发到 500Mi
                        cpu: "1000m"     # 允许突发到 1 核

                场景 3:混合配置

                apiVersion: v1
                kind: Pod
                metadata:
                  name: burstable-pod-3
                spec:
                  containers:
                  - name: app1
                    image: nginx
                    resources:
                      requests:
                        memory: "100Mi"
                      limits:
                        memory: "200Mi"
                  - name: app2
                    image: busybox
                    resources:
                      requests:
                        cpu: "100m"
                      # 只设置 CPU,没有内存限制

                特点

                弹性使用:可以使用超过 requests 的资源(burst)
                ⚠️ 中等优先级:资源不足时,在 BestEffort 之后被驱逐
                ⚠️ 可能被限流:超过 limits 会被限制(CPU)或 Kill(内存)
                成本优化:平衡资源保证和利用率

                适用场景

                • Web 应用(流量有波峰波谷)
                • 定时任务
                • 批处理作业
                • 微服务(大部分场景)

                3️⃣ BestEffort (尽力而为型)

                定义条件

                • Pod 中所有容器没有设置 requestslimits

                YAML 示例

                apiVersion: v1
                kind: Pod
                metadata:
                  name: besteffort-pod
                spec:
                  containers:
                  - name: app
                    image: nginx
                    # 完全没有 resources 配置
                  - name: sidecar
                    image: busybox
                    # 也没有 resources 配置

                特点

                无资源保证:能用多少资源完全看节点剩余
                最低优先级:资源不足时第一个被驱逐
                性能不稳定:可能被其他 Pod 挤占资源
                灵活性高:可以充分利用节点空闲资源

                适用场景

                • 开发测试环境
                • 非关键后台任务
                • 日志收集(可以容忍中断)
                • 临时性工作负载

                🔍 QoS 等级判定流程图

                开始
                  │
                  ├─→ 所有容器都没设置 requests/limits?
                  │   └─→ 是 → BestEffort
                  │
                  ├─→ 所有容器的 requests == limits (CPU和内存)?
                  │   └─→ 是 → Guaranteed
                  │
                  └─→ 其他情况 → Burstable

                🚨 资源不足时的驱逐顺序

                当节点资源不足(如内存压力)时,Kubelet 按以下顺序驱逐 Pod:

                驱逐优先级(从高到低):
                
                1. BestEffort Pod
                   └─→ 超出 requests 最多的先被驱逐
                
                2. Burstable Pod
                   └─→ 按内存使用量排序
                   └─→ 超出 requests 越多,越先被驱逐
                
                3. Guaranteed Pod (最后才驱逐)
                   └─→ 只有在没有其他选择时才驱逐

                实际驱逐示例

                # 节点内存不足场景:
                节点总内存: 8GB
                已用内存: 7.8GB (达到驱逐阈值)
                
                Pod 列表:
                - Pod A (BestEffort): 使用 1GB 内存 → 第一个被驱逐 ❌
                - Pod B (Burstable):  requests=200Mi, 使用 500Mi → 第二个 ❌
                - Pod C (Burstable):  requests=500Mi, 使用 600Mi → 第三个 ❌
                - Pod D (Guaranteed): requests=limits=1GB, 使用 1GB → 保留 ✅

                📝 查看 Pod 的 QoS 等级

                方法 1:使用 kubectl describe

                kubectl describe pod <pod-name>
                
                # 输出中会显示:
                # QoS Class:       Burstable

                方法 2:使用 kubectl get

                # 查看所有 Pod 的 QoS
                kubectl get pods -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass
                
                # 输出:
                # NAME              QOS
                # nginx-guaranteed  Guaranteed
                # app-burstable     Burstable
                # test-besteffort   BestEffort

                方法 3:使用 YAML 输出

                kubectl get pod <pod-name> -o yaml | grep qosClass
                
                # 输出:
                # qosClass: Burstable

                🎨 QoS 配置最佳实践

                生产环境推荐配置

                关键业务 - Guaranteed

                apiVersion: apps/v1
                kind: Deployment
                metadata:
                  name: critical-app
                spec:
                  replicas: 3
                  template:
                    spec:
                      containers:
                      - name: app
                        image: myapp:v1
                        resources:
                          requests:
                            memory: "2Gi"
                            cpu: "1000m"
                          limits:
                            memory: "2Gi"      # requests == limits
                            cpu: "1000m"

                一般业务 - Burstable

                apiVersion: apps/v1
                kind: Deployment
                metadata:
                  name: web-app
                spec:
                  replicas: 5
                  template:
                    spec:
                      containers:
                      - name: web
                        image: nginx:latest
                        resources:
                          requests:
                            memory: "256Mi"    # 保证最低资源
                            cpu: "200m"
                          limits:
                            memory: "512Mi"    # 允许突发到 2 倍
                            cpu: "500m"

                后台任务 - BestEffort 或 Burstable

                apiVersion: batch/v1
                kind: CronJob
                metadata:
                  name: cleanup-job
                spec:
                  schedule: "0 2 * * *"
                  jobTemplate:
                    spec:
                      template:
                        spec:
                          containers:
                          - name: cleanup
                            image: cleanup:v1
                            resources:
                              requests:
                                memory: "128Mi"
                                cpu: "100m"
                              # 不设置 limits,允许使用空闲资源

                🔧 QoS 与资源限制的关系

                CPU 限制行为

                resources:
                  requests:
                    cpu: "500m"    # 保证至少 0.5 核
                  limits:
                    cpu: "1000m"   # 最多使用 1 核
                • requests:节点调度的依据,保证的资源
                • limits:硬限制,超过会被限流(throttle),但不会被 Kill
                • 超过 limits 时,进程会被 CPU throttle,导致性能下降

                内存限制行为

                resources:
                  requests:
                    memory: "256Mi"  # 保证至少 256Mi
                  limits:
                    memory: "512Mi"  # 最多使用 512Mi
                • requests:调度保证,但可以使用更多
                • limits:硬限制,超过会触发 OOM Kill 💀
                • Pod 会被标记为 OOMKilled 并重启

                🛠️ 常见问题

                Q1: 为什么我的 Pod 总是被驱逐?

                # 检查 QoS 等级
                kubectl get pod <pod-name> -o yaml | grep qosClass
                
                # 如果是 BestEffort 或 Burstable,建议:
                # 1. 设置合理的 requests
                # 2. 考虑升级到 Guaranteed(关键服务)
                # 3. 增加节点资源

                Q2: 如何为所有 Pod 设置默认资源限制?

                # 使用 LimitRange
                apiVersion: v1
                kind: LimitRange
                metadata:
                  name: default-limits
                  namespace: default
                spec:
                  limits:
                  - default:              # 默认 limits
                      cpu: "500m"
                      memory: "512Mi"
                    defaultRequest:       # 默认 requests
                      cpu: "100m"
                      memory: "128Mi"
                    type: Container

                Q3: Guaranteed Pod 也会被驱逐吗?

                会! 但只在以下情况:

                • 使用超过自己的 limits(OOM Kill)
                • 节点完全不可用(如节点宕机)
                • 手动删除 Pod
                • DaemonSet 或系统级 Pod 需要资源

                Q4: 如何监控 QoS 相关的问题?

                # 查看节点资源压力
                kubectl describe node <node-name> | grep -A 5 "Conditions:"
                
                # 查看被驱逐的 Pod
                kubectl get events --field-selector reason=Evicted
                
                # 查看 OOM 事件
                kubectl get events --field-selector reason=OOMKilling

                📊 QoS 等级对比表

                维度GuaranteedBurstableBestEffort
                配置要求requests=limitsrequests≠limits 或部分配置无配置
                资源保证✅ 完全保证⚠️ 部分保证❌ 无保证
                驱逐优先级最低(最后驱逐)中等最高(第一个驱逐)
                性能稳定性⭐⭐⭐⭐⭐⭐⭐⭐
                资源利用率低(固定资源)高(可突发)最高(充分利用)
                成本
                适用场景关键业务一般业务测试/临时任务

                🎯 选型建议

                使用 Guaranteed 的场景

                • 🗄️ 数据库(MySQL, MongoDB, Cassandra)
                • 📨 消息队列(Kafka, RabbitMQ)
                • 🔐 认证服务
                • 💰 支付系统
                • 📊 实时数据处理

                使用 Burstable 的场景

                • 🌐 Web 应用(80% 的场景)
                • 🔄 API 服务
                • 🎨 前端应用
                • 📦 微服务
                • ⚙️ 后台处理

                使用 BestEffort 的场景

                • 🧪 开发测试
                • 📝 日志收集(可容忍中断)
                • 🔍 数据探索
                • 🛠️ 一次性脚本

                💡 关键要点总结

                1. QoS 是自动分配的,不能手动指定,由资源配置决定
                2. Guaranteed ≠ 不会被驱逐,只是优先级最低
                3. 生产环境建议至少使用 Burstable,避免 BestEffort
                4. requests 影响调度,limits 影响运行时限制
                5. 内存超限会 OOM,CPU 超限会限流
                6. 使用 LimitRange 强制资源限制,避免 BestEffort Pod
                Mar 7, 2024

                Scheduler

                Kubernetes 调度器(kube-scheduler) 是整个系统中非常关键的组件,它负责决定 哪个 Pod 应该运行在哪个 Node 上

                下面我会分层、逐步详细说明 K8s 调度流程(以 v1.28+ 为例),并解释背后机制。


                🌐 整体架构概览

                Kubernetes 调度器主要完成以下职责:

                1. 监听待调度的 Pod(即 spec.nodeName 为空的 Pod)
                2. 为 Pod 选择最合适的 Node
                3. 将绑定结果写回到 apiserver

                🧩 一、调度总体流程

                Kubernetes 调度流程主要分为三个阶段:

                [Pending Pod] --> [Scheduling Queue] 
                     ↓
                 [PreFilter] → [Filter] → [PostFilter] → [Score] → [Reserve] → [Permit] → [Bind]

                1️⃣ 调度入口:监听未绑定的 Pod

                • Scheduler 通过 informer 监听所有 Pod 资源。
                • 当发现 Pod 没有 spec.nodeName 时,认为它是待调度的。
                • Pod 被放入 调度队列(SchedulingQueue) 中。

                🧮 二、调度核心阶段详解

                🧩 1. PreFilter 阶段

                在调度之前,对 Pod 进行一些准备性检查,例如:

                • 解析 Pod 所需的资源。
                • 检查 PVC、Affinity、Taint/Toleration 是否合理。
                • 计算调度所需的 topology spread 信息。

                🧠 类似于“预处理”,提前准备好过滤阶段要用的数据。


                🧩 2. Filter 阶段(Predicates)

                Scheduler 遍历所有可调度的 Node,筛选出满足条件的节点。

                常见的过滤插件包括:

                插件作用
                NodeUnschedulable过滤掉被标记 unschedulable 的节点
                NodeName如果 Pod 指定了 nodeName,只匹配该节点
                TaintToleration检查 taint / toleration 是否匹配
                NodeAffinity / PodAffinity检查亲和性/反亲和性
                NodeResourcesFit检查 CPU/Memory 等资源是否够用
                VolumeBinding检查 Pod 使用的 PVC 是否能在节点挂载

                🔎 输出结果:

                得到一个候选节点列表(通常是几十个或几百个)。


                🧩 3. PostFilter 阶段

                • 若没有节点符合条件(即调度失败),进入 抢占逻辑(Preemption)
                • 调度器会尝试在某些节点上“抢占”低优先级的 Pod,以便高优先级 Pod 能调度成功。

                🧩 4. Score 阶段(优选)

                对剩余候选节点进行打分。 每个插件给节点打分(0–100),然后汇总加权。

                常见的评分插件:

                插件作用
                LeastAllocated资源使用最少的节点得分高
                BalancedAllocationCPU 和内存使用更均衡的节点得分高
                NodeAffinity符合 affinity 的节点加分
                ImageLocality本地已缓存镜像的节点加分
                InterPodAffinity满足 Pod 间亲和性需求的节点加分

                这些分数会经过 NormalizeScore 标准化到统一范围后求和。

                最终输出:

                最优节点(Score最高)


                🧩 5. Reserve 阶段

                暂时在该节点上 预留资源(在调度器内部缓存中标记),防止并发调度冲突。

                如果后续失败,会执行 Unreserve 回滚。


                🧩 6. Permit 阶段

                某些插件可在绑定前再进行校验或等待,例如:

                • PodGroup(批调度)
                • Scheduler Framework 自定义策略

                可能返回:

                • Success → 继续绑定
                • Wait → 等待事件
                • Reject → 放弃调度

                🧩 7. Bind 阶段

                最终由 Bind 插件 调用 API 将 Pod 绑定到 Node:

                spec:
                  nodeName: node-123

                绑定成功后,kubelet 监听到该 Pod,会拉起容器。


                ⚙️ 三、调度框架(Scheduler Framework)

                K8s 1.19+ 后,调度器是通过 插件化框架 实现的。 每个阶段都有对应的插件点:

                阶段插件接口示例插件
                PreFilterPreFilterPluginPodTopologySpread
                FilterFilterPluginNodeAffinity
                ScoreScorePluginLeastAllocated
                ReserveReservePluginVolumeBinding
                BindBindPluginDefaultBinder

                你可以通过写一个自定义调度插件(Go)扩展调度逻辑。


                🧭 四、调度失败的情况

                常见调度失败的原因:

                原因表现
                所有节点资源不足Pod 一直 Pending
                亲和性限制太严格Pod 无法找到符合要求的节点
                PVC 无法绑定VolumeBinding 阶段失败
                节点被打 taint没有 toleration
                镜像拉取失败Pod 已绑定但容器起不来(kubelet问题)

                🧠 五、总结

                阶段目的关键点
                SchedulingQueue缓冲待调度PodFIFO + 优先级调度
                PreFilter准备数据校验Pod需求
                Filter过滤节点资源与约束
                Score打分选优平衡与局部性
                Reserve预留资源防并发冲突
                Bind绑定Node调度结果落地

                Mar 7, 2024

                服务发现

                最常见的说法是 “两种核心机制”,但这指的是服务发现的两种基本模式,而不是具体的实现方式。


                维度一:两种核心模式

                这是从服务发现的基本原理上划分的。

                1. 基于客户端服务发现

                  • 工作原理:客户端(服务消费者)通过查询一个中心化的服务注册中心(如 Consul、Eureka、Zookeeper)来获取所有可用服务实例的列表(通常是 IP 和端口),然后自己选择一个实例并直接向其发起请求。
                  • 类比:就像你去餐厅吃饭,先看门口的电子菜单(服务注册中心)了解所有菜品和价格,然后自己决定点什么,再告诉服务员。
                  • 特点:客户端需要内置服务发现逻辑,与服务注册中心耦合。这种方式更灵活,但增加了客户端的复杂性。
                2. 基于服务端服务发现

                  • 工作原理:客户端不关心具体的服务实例,它只需要向一个固定的访问端点(通常是 Load Balancer 或 Proxy,如 Kubernetes Service)发起请求。这个端点负责去服务注册中心查询可用实例,并进行负载均衡,将请求转发给其中一个。
                  • 类比:就像你去餐厅直接告诉服务员“来份招牌菜”,服务员(负载均衡器)帮你和后厨(服务实例)沟通,最后把菜端给你。
                  • 特点:客户端无需知道服务发现的具体细节,简化了客户端。这是 Kubernetes 默认采用的方式

                维度二:Kubernetes 中具体的实现方式

                在 Kubernetes 内部,我们通常讨论以下几种具体的服务发现实现手段,它们共同构成了 Kubernetes 强大的服务发现能力。

                1. 环境变量

                当 Pod 被调度到某个节点上时,kubelet 会为当前集群中存在的每个 Service 添加一组环境变量到该 Pod 中。

                • 格式{SVCNAME}_SERVICE_HOST{SVCNAME}_SERVICE_PORT
                • 例子:一个名为 redis-master 的 Service 会生成 REDIS_MASTER_SERVICE_HOST=10.0.0.11REDIS_MASTER_SERVICE_PORT=6379 这样的环境变量。
                • 局限性:环境变量必须在 Pod 创建之前就存在。后创建的 Service 无法将环境变量注入到已运行的 Pod 中。因此,这通常作为辅助手段

                2. DNS(最核心、最推荐的方式)

                这是 Kubernetes 最主要和最优雅的服务发现方式。

                • 工作原理:Kubernetes 集群内置了一个 DNS 服务器(通常是 CoreDNS)。当你创建一个 Service 时,Kubernetes 会自动为这个 Service 注册一个 DNS 记录。
                • DNS 记录格式
                  • 同一命名空间<service-name>.<namespace>.svc.cluster.local -> 指向 Service 的 Cluster IP。
                    • 在同一个命名空间内,你可以直接使用 <service-name> 来访问服务。例如,前端 Pod 访问后端服务,只需使用 http://backend-service
                  • 不同命名空间:需要使用全限定域名,例如 backend-service.production.svc.cluster.local
                • 优点:行为符合标准,应用无需修改代码,直接使用域名即可访问其他服务。

                3. Kubernetes Service

                Service 资源对象本身就是服务发现的载体。它提供了一个稳定的访问端点(VIP 或 DNS 名称),背后对应一组动态变化的 Pod。

                • ClusterIP:默认类型,提供一个集群内部的虚拟 IP,只能从集群内部访问。结合 DNS 使用,是服务间通信的基石。
                • NodePort:在 ClusterIP 基础上,在每个节点上暴露一个静态端口。可以从集群外部通过 <NodeIP>:<NodePort> 访问服务。
                • LoadBalancer:在 NodePort 基础上,利用云服务商提供的负载均衡器,将一个外部 IP 地址暴露给 Service。是向公网暴露服务的主要方式。
                • Headless Service:一种特殊的 Service,当你不需要负载均衡和单个 Service IP 时,可以通过设置 clusterIP: None 来创建。DNS 查询会返回该 Service 后端所有 Pod 的 IP 地址列表,而不是一个 VIP。这常用于有状态应用(如 Kafka、MySQL 集群)的自定义负载均衡或状态同步。

                4. Ingress

                虽然 Ingress 主要被用作 HTTP/HTTPS 路由规则的七层代理,但它也是一种高级的服务发现形式。

                • 它通过规则将外部流量路由到集群内部相应的 Service。
                • 客户端(外部用户)通过访问 Ingress Controller 的地址来发现和访问后端服务。

                总结与对比

                方式原理适用场景特点
                环境变量将 Service 信息注入 Pod 环境变量旧式应用,辅助手段简单,但有局限性(需先于 Pod 创建)
                DNS为 Service 自动注册域名服务间通信的标准方式推荐,符合惯例,无需代码改造
                Service提供稳定的虚拟 IP 或 DNS 名称服务暴露和负载均衡的核心抽象Kubernetes 服务发现的基石
                Ingress七层 HTTP 路由对外暴露 Web 服务,基于域名和路径路由更高级的 API 网关模式

                结论:

                • 从模式上讲,Kubernetes 主要采用基于服务端的服务发现模式。
                • 从具体实现上讲,Kubernetes 的服务发现是一个以 DNS 为核心、以 Service 为基石的完整体系,并辅以环境变量和 Ingress 等机制。

                所以,当有人问“K8s服务发现有哪些方式”时,最准确的回答是:主要有基于 DNS 和 环境变量这两种内部发现机制,而它们都依赖于 Service 这个核心抽象。同时,Ingress 提供了外部到内部的服务发现和路由能力。

                Mar 7, 2024

                Service VS Endpoint

                Service 和 Endpoint/EndpointSlice 在 Kubernetes 中有明确的功能分工,它们共同构成了服务发现和负载均衡的基础。以下是详细的区别分析:

                一、核心功能定位

                Service - 抽象服务层

                apiVersion: v1
                kind: Service
                metadata:
                  name: web-service
                spec:
                  selector:
                    app: web-server
                  ports:
                    - protocol: TCP
                      port: 80           # 服务端口
                      targetPort: 8080   # 后端 Pod 端口
                  type: ClusterIP        # 服务类型

                Service 的核心功能:

                • 服务抽象:提供稳定的虚拟 IP 和 DNS 名称
                • 访问入口:定义客户端如何访问服务
                • 负载均衡策略:指定流量分发方式
                • 服务类型:ClusterIP、NodePort、LoadBalancer、ExternalName

                Endpoint/EndpointSlice - 后端实现层

                apiVersion: v1
                kind: Endpoints
                metadata:
                  name: web-service      # 必须与 Service 同名
                subsets:
                  - addresses:
                    - ip: 10.244.1.5
                      targetRef:
                        kind: Pod
                        name: web-pod-1
                    - ip: 10.244.1.6
                      targetRef:
                        kind: Pod  
                        name: web-pod-2
                    ports:
                    - port: 8080
                      protocol: TCP

                Endpoints 的核心功能:

                • 后端发现:记录实际可用的 Pod IP 地址
                • 健康状态:只包含通过就绪探针检查的 Pod
                • 动态更新:实时反映后端 Pod 的变化
                • 端口映射:维护 Service port 到 Pod port 的映射

                二、详细功能对比

                功能特性ServiceEndpoint/EndpointSlice
                抽象级别逻辑抽象层物理实现层
                数据内容虚拟IP、端口、选择器实际Pod IP地址、端口
                稳定性稳定的VIP和DNS动态变化的IP列表
                创建方式手动定义自动生成(或手动)
                更新频率低频变更高频动态更新
                DNS解析返回Service IP不直接参与DNS
                负载均衡定义策略提供后端目标

                三、实际工作流程

                1. 服务访问流程

                客户端请求 → Service VIP → kube-proxy → Endpoints → 实际 Pod
                    ↓           ↓           ↓           ↓           ↓
                  DNS解析     虚拟IP      iptables/   后端IP列表   具体容器
                             10.96.x.x   IPVS规则    10.244.x.x   应用服务

                2. 数据流向示例

                # 客户端访问
                curl http://web-service.default.svc.cluster.local
                
                # DNS 解析返回 Service IP
                nslookup web-service.default.svc.cluster.local
                # 返回: 10.96.123.456
                
                # kube-proxy 根据 Endpoints 配置转发
                iptables -t nat -L KUBE-SERVICES | grep 10.96.123.456
                # 转发到: 10.244.1.5:8080, 10.244.1.6:8080

                四、使用场景差异

                Service 的使用场景

                # 1. 内部服务访问
                apiVersion: v1
                kind: Service
                metadata:
                  name: internal-api
                spec:
                  type: ClusterIP
                  selector:
                    app: api-server
                  ports:
                    - port: 8080
                
                # 2. 外部访问
                apiVersion: v1
                kind: Service  
                metadata:
                  name: external-web
                spec:
                  type: LoadBalancer
                  selector:
                    app: web-frontend
                  ports:
                    - port: 80
                      nodePort: 30080
                
                # 3. 外部服务代理
                apiVersion: v1
                kind: Service
                metadata:
                  name: external-database
                spec:
                  type: ExternalName
                  externalName: database.example.com

                Endpoints 的使用场景

                # 1. 自动后端管理(默认)
                # Kubernetes 自动维护匹配 Pod 的 Endpoints
                
                # 2. 外部服务集成
                apiVersion: v1
                kind: Service
                metadata:
                  name: legacy-system
                spec:
                  ports:
                    - port: 3306
                ---
                apiVersion: v1
                kind: Endpoints
                metadata:
                  name: legacy-system
                subsets:
                  - addresses:
                    - ip: 192.168.1.100  # 外部数据库
                    ports:
                    - port: 3306
                
                # 3. 多端口复杂服务
                apiVersion: v1
                kind: Service
                metadata:
                  name: complex-app
                spec:
                  ports:
                  - name: http
                    port: 80
                  - name: https
                    port: 443
                  - name: metrics
                    port: 9090

                五、配置和管理差异

                Service 配置重点

                apiVersion: v1
                kind: Service
                metadata:
                  name: optimized-service
                  annotations:
                    # 负载均衡配置
                    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
                    # 会话保持
                    service.kubernetes.io/aws-load-balancer-backend-protocol: "http"
                spec:
                  type: LoadBalancer
                  selector:
                    app: optimized-app
                  sessionAffinity: ClientIP
                  sessionAffinityConfig:
                    clientIP:
                      timeoutSeconds: 10800
                  ports:
                  - name: http
                    port: 80
                    targetPort: 8080
                  # 流量策略(仅对外部流量)
                  externalTrafficPolicy: Local

                Endpoints 配置重点

                apiVersion: v1
                kind: Endpoints
                metadata:
                  name: custom-endpoints
                  labels:
                    # 用于网络策略选择
                    environment: production
                subsets:
                - addresses:
                  - ip: 10.244.1.10
                    nodeName: worker-1
                    targetRef:
                      kind: Pod
                      name: app-pod-1
                      namespace: production
                  - ip: 10.244.1.11
                    nodeName: worker-2  
                    targetRef:
                      kind: Pod
                      name: app-pod-2
                      namespace: production
                  # 多端口定义
                  ports:
                  - name: http
                    port: 8080
                    protocol: TCP
                  - name: metrics
                    port: 9090
                    protocol: TCP
                  - name: health
                    port: 8081
                    protocol: TCP

                六、监控和调试差异

                Service 监控重点

                # 检查 Service 状态
                kubectl get services
                kubectl describe service web-service
                
                # Service 相关指标
                kubectl top services  # 如果支持
                kubectl get --raw /api/v1/namespaces/default/services/web-service/proxy/metrics
                
                # DNS 解析测试
                kubectl run test-$RANDOM --image=busybox --rm -it -- nslookup web-service

                Endpoints 监控重点

                # 检查后端可用性
                kubectl get endpoints
                kubectl describe endpoints web-service
                
                # 验证后端 Pod 状态
                kubectl get pods -l app=web-server -o wide
                
                # 检查就绪探针
                kubectl get pods -l app=web-server -o jsonpath='{.items[*].spec.containers[*].readinessProbe}'
                
                # 直接测试后端连通性
                kubectl run test-$RANDOM --image=busybox --rm -it -- 
                # 在容器内: telnet 10.244.1.5 8080

                七、性能考虑差异

                Service 性能优化

                apiVersion: v1
                kind: Service
                metadata:
                  name: high-performance
                  annotations:
                    # 使用 IPVS 模式提高性能
                    service.kubernetes.io/service.beta.kubernetes.io/ipvs-scheduler: "wrr"
                spec:
                  type: ClusterIP
                  clusterIP: None  # Headless Service,减少一层转发
                  selector:
                    app: high-perf-app

                Endpoints 性能优化

                # 使用 EndpointSlice 提高大规模集群性能
                apiVersion: discovery.k8s.io/v1
                kind: EndpointSlice
                metadata:
                  name: web-service-abc123
                  labels:
                    kubernetes.io/service-name: web-service
                addressType: IPv4
                ports:
                - name: http
                  protocol: TCP
                  port: 8080
                endpoints:
                - addresses:
                  - "10.244.1.5"
                  conditions:
                    ready: true
                  # 拓扑感知,优化路由
                  zone: us-west-2a
                  hints:
                    forZones:
                    - name: us-west-2a

                八、总结

                维度ServiceEndpoint/EndpointSlice
                角色服务门面后端实现
                稳定性高(VIP/DNS稳定)低(IP动态变化)
                关注点如何访问谁能被访问
                配置频率低频高频自动更新
                网络层级L4 负载均衡后端目标发现
                扩展性通过类型扩展通过EndpointSlice扩展

                简单比喻:

                • Service 就像餐厅的接待台和菜单 - 提供统一的入口和访问方式
                • Endpoints 就像后厨的厨师列表 - 记录实际提供服务的人员和位置

                两者协同工作,Service 定义"什么服务可用",Endpoints 定义"谁可以提供这个服务",共同实现了 Kubernetes 强大的服务发现和负载均衡能力。

                Mar 7, 2024

                StatefulSet

                StatefulSet 如何具体解决有状态应用的挑战


                StatefulSet 的四大核心机制

                StatefulSet 通过一系列精心设计的机制,为有状态应用提供了稳定性和可预测性。

                1. 稳定的网络标识

                解决的问题:有状态应用(如数据库节点)需要稳定的主机名来相互发现和通信,不能使用随机名称。

                StatefulSet 的实现

                • 固定的 Pod 名称:Pod 名称遵循固定模式:<statefulset-name>-<ordinal-index>
                  • 例如:redis-cluster-0redis-cluster-1redis-cluster-2
                • 稳定的 DNS 记录:每个 Pod 都会自动获得一个唯一的、稳定的 DNS 记录:
                  • 格式<pod-name>.<svc-name>.<namespace>.svc.cluster.local
                  • 例子redis-cluster-0.redis-service.default.svc.cluster.local

                应对场景

                • 在 Redis 集群中,redis-cluster-0 可以告诉 redis-cluster-1:“我的地址是 redis-cluster-0.redis-service",这个地址在 Pod 的一生中都不会改变,即使它被重新调度到其他节点。

                2. 有序的部署与管理

                解决的问题:像 Zookeeper、Etcd 这样的集群化应用,节点需要按顺序启动和加入集群,主从数据库也需要先启动主节点。

                StatefulSet 的实现

                • 有序部署:当创建 StatefulSet 时,Pod 严格按照索引顺序(0, 1, 2…)依次创建。必须等 Pod-0 完全就绪(Ready)后,才会创建 Pod-1
                • 有序扩缩容
                  • 扩容:按顺序创建新 Pod(如从 3 个扩展到 5 个,会先创建 pod-3,再 pod-4)。
                  • 缩容:按逆序终止 Pod(从 pod-4 开始,然后是 pod-3)。
                • 有序滚动更新:同样遵循逆序策略,确保在更新过程中大部分节点保持可用。

                应对场景

                • 部署 MySQL 主从集群时,StatefulSet 会确保 mysql-0(主节点)先启动并初始化完成,然后才启动 mysql-1(从节点),从节点在启动时就能正确连接到主节点进行数据同步。

                3. 稳定的持久化存储

                这是 StatefulSet 最核心的特性!

                解决的问题:有状态应用的数据必须持久化,并且当 Pod 发生故障或被调度到新节点时,必须能够重新挂载到它自己的那部分数据

                StatefulSet 的实现

                • Volume Claim Template:在 StatefulSet 的 YAML 中,你可以定义一个 volumeClaimTemplate(存储卷申请模板)。
                • 专属的 PVC:StatefulSet 会为每个 Pod 实例根据这个模板创建一个独立的、专用的 PersistentVolumeClaim (PVC)。
                  • mysql-0 -> pvc-name-mysql-0
                  • mysql-1 -> pvc-name-mysql-1
                  • mysql-2 -> pvc-name-mysql-2

                工作流程

                1. 当你创建名为 mysql、副本数为 3 的 StatefulSet 时,K8s 会:
                  • 创建 Pod mysql-0,并同时创建 PVC data-mysql-0,然后将它们绑定。
                  • mysql-0 就绪后,创建 Pod mysql-1 和 PVC data-mysql-1,然后绑定。
                  • 以此类推。
                2. 如果节点故障导致 mysql-1 被删除,K8s 调度器会在其他健康节点上重新创建一个同名的 Pod mysql-1
                3. 这个新 Pod mysql-1 会自动挂载到之前为它创建的、存有它专属数据的 PVC data-mysql-1 上。
                4. 这样,Pod 虽然"漂移"了,但数据依然跟随,应用可以无缝恢复。

                应对场景

                • 对于数据库,每个 Pod 都有自己独立的数据目录。mysql-0 的数据永远不会和 mysql-1 的数据混淆。这为数据分片(Sharding)和主从复制提供了基础。

                4. 稳定的启动顺序与唯一身份

                解决的问题:应用启动脚本或配置可能需要知道当前实例的索引号(如,判断自己是否是第一个节点,从而决定是否要初始化集群)。

                StatefulSet 的实现

                • Pod 的序号(0, 1, 2...)就是其稳定的、唯一的身份标识。
                • 这个序号可以通过环境变量注入到 Pod 的容器中,供应用程序使用。

                总结

                StatefulSet 通过「稳定的网络标识 + 有序的部署管理 + 稳定的专属存储」这三驾马车,将有状态应用在动态的、易变的容器环境中"锚定”,使其获得了类似物理机或虚拟机的稳定运行特性。

                它并没有让有状态应用本身变得简单,而是为这些应用的复杂需求(如集群化、数据持久化)提供了一个可以在 K8s 上稳定运行的框架和平台。对于运维人员来说,这意味着可以用声明式的方式像管理无状态应用一样,去管理数据库、消息队列等复杂的有状态服务。

                Mar 7, 2024

                StatefulSet 2

                答案是:

                StatefulSet 不使用 ReplicaSet 来创建 Pod。 ✅ StatefulSet Controller 直接创建并管理 Pod 对象,而且每个 Pod 都是有序的、具名的、不可替代的。

                接下来我会详细讲清楚 StatefulSet 与 Deployment 的架构差异、实现机制、核心控制循环,让你能在实现 Operator 或高可靠应用时真正理解它的行为。


                🧩 一、StatefulSet 与 Deployment 的根本区别

                项目DeploymentStatefulSet
                控制下层对象ReplicaSet → Pod直接管理 Pod
                Pod 命名随机固定命名(有序索引)
                Pod 身份可替换唯一、持久身份(Stable Identity)
                更新策略滚动更新(无序)有序更新(从 0 开始逐个)
                存储通常无状态绑定 PVC,数据与 Pod 一一对应
                常见场景Web 服务、API、Job数据库、Zookeeper、Kafka、Etcd 等

                ⚙️ 二、StatefulSet 控制器工作原理

                StatefulSet Controller 运行在 kube-controller-manager 中。 它同样是一个典型的 Controller + Informer + WorkQueue + Reconcile Loop 架构。

                但是它的 核心循环逻辑 与 Deployment/ReplicaSet 明显不同:

                1️⃣ Reconcile 流程概览

                伪代码如下:

                syncStatefulSet(set):
                    pods := listPodsMatching(set.selector)
                    sortPodsByOrdinal(pods)
                
                    for i in 0..set.spec.replicas-1:
                        ensurePodExists(set, ordinal=i)
                        ensurePodPVCExists(set, ordinal=i)
                        if updateRequired(set, pod[i]):
                            deletePodAndWaitForReplacement(set, ordinal=i)

                也就是说,StatefulSet controller 自己直接创建 Pod 和 PVC,并且顺序控制整个过程。


                2️⃣ Pod 命名与序号机制

                StatefulSet 会给每个 Pod 一个固定的命名模式:

                <prefix>-<ordinal>

                例如:

                web-0
                web-1
                web-2

                这些名字不是随机生成的 UID,而是根据序号确定的。 这让 Pod 可以拥有稳定的网络标识

                • Pod DNS:<podname>.<headless-service-name>.<namespace>.svc.cluster.local

                • 例如:

                  web-0.web.default.svc.cluster.local
                  web-1.web.default.svc.cluster.local

                这对 Zookeeper、MySQL 集群等“节点必须有固定身份”的应用至关重要。


                3️⃣ 存储(PVC)绑定机制

                每个 StatefulSet Pod 都可以声明一个 volumeClaimTemplate

                volumeClaimTemplates:
                - metadata:
                    name: data
                  spec:
                    accessModes: ["ReadWriteOnce"]
                    resources:
                      requests:
                        storage: 10Gi

                Controller 会为每个 Pod 创建一个独立的 PVC:

                data-web-0
                data-web-1
                data-web-2

                即使 Pod 被删除,这个 PVC 也不会被清理(除非手动删除), 这就实现了状态持久化


                4️⃣ Pod 创建顺序与滚动更新顺序

                StatefulSet 有严格的有序逻辑:

                创建顺序

                • 先创建 web-0
                • web-0 变成 Running + Ready
                • 再创建 web-1
                • 依次类推。

                删除顺序

                • 先删除最高序号 Pod;
                • 依次向下。

                更新顺序(滚动更新)

                • 先更新最高序号 Pod;

                • 等该 Pod Ready 后,再更新下一个;

                • 可配置 podManagementPolicy

                  • OrderedReady(默认,有序)
                  • Parallel(并行)

                5️⃣ StatefulSet 更新控制逻辑

                当你修改 StatefulSet 的模板(例如更换镜像),控制器会:

                1. 比较 spec.template hash;
                2. 找出需要更新的 Pod;
                3. 按序更新;
                4. 每个 Pod 更新完毕、Ready 后,再进行下一个。

                伪代码示例:

                for i = replicas-1 downTo 0:
                    if pod[i].template != set.template:
                        delete(pod[i])
                        waitUntilNewPodReady(i)

                📦 三、StatefulSet 与 PVC 的生命周期

                对象何时创建何时删除
                Pod当 StatefulSet 第一次启动或扩容时缩容时(最高序号先删)
                PVCPod 第一次被创建时不会自动删除(除非使用 StorageClass Retain 策略外的设置)

                这意味着:

                StatefulSet 的 数据持久性是通过 PVC 与序号绑定实现的

                例如:

                data-web-0 → 永远属于 web-0
                data-web-1 → 永远属于 web-1

                🔁 四、StatefulSet 的控制链路(事件流)

                步骤执行者说明
                用户kubectl apply 创建 StatefulSet
                API Server写入 etcd
                StatefulSet Controller检测 StatefulSet 对象变化
                Controller创建第一个 Pod + PVC(ordinal=0)
                Kubelet调度并运行容器
                Controller检测 web-0 Ready 后,继续创建 web-1
                直到所有 replicas Ready
                更新按序滚动更新 Pod

                🧠 五、StatefulSet 为什么不复用 ReplicaSet?

                主要是因为 ReplicaSet 的特性与 StatefulSet 的设计目标相冲突:

                功能ReplicaSetStatefulSet
                Pod 匿名性✅ 所有 Pod 等价❌ 每个 Pod 唯一
                更新无序性✅ 任意替换❌ 必须有序
                PVC 管理❌ 无✅ 一一绑定
                网络身份❌ 无✅ 稳定 DNS 名称

                所以 Kubernetes 设计了专门的 StatefulSet Controller 来实现这一类“有状态副本控制”。


                💡 六、StatefulSet 小结

                特性描述
                控制对象直接管理 Pod
                Pod 命名固定命名(带序号)
                更新策略有序(OrderedReady)或并行(Parallel)
                持久化通过 PVC 一对一绑定
                典型场景数据库、分布式存储、消息队列、协调服务等
                优势稳定网络身份、持久存储、严格更新顺序
                缺点扩缩容慢,调度受限,不适合无状态应用

                🧭 七、视觉化理解(结构示意)

                StatefulSet: kafka
                │
                ├── Pod: kafka-0
                │   ├── PVC: data-kafka-0
                │   └── Hostname: kafka-0.kafka.default.svc
                │
                ├── Pod: kafka-1
                │   ├── PVC: data-kafka-1
                │   └── Hostname: kafka-1.kafka.default.svc
                │
                └── Pod: kafka-2
                    ├── PVC: data-kafka-2
                    └── Hostname: kafka-2.kafka.default.svc
                Mar 7, 2024

                Talk between 2 pods in different nodes

                好的,这是一个非常核心的 Kubernetes 网络问题。不同 Node 上的 Pod 之间的通信过程,清晰地展示了 Kubernetes 网络模型的核心思想:每个 Pod 都拥有一个独立的、扁平的 IP 地址空间,无论它运行在哪个节点上,Pod 之间都可以直接通过这个 IP 进行通信,而无需使用 NAT

                这个过程的实现完全依赖于容器网络接口(CNI)插件,如 Calico、Flannel、Weave Net 等。下面我们以最经典的 Flannel (VXLAN 模式)Calico (BGP 模式) 为例,来阐述这个通信过程。


                核心原则

                1. Pod IP 可达性:Kubernetes 网络模型要求,任何 Pod 的 IP 地址都能被任何其他 Pod 直接访问,无论它们是否在同一个节点上。
                2. 无 NAT:Pod 到 Pod 的通信不应该经过源地址转换(SNAT)或目的地址转换(DNAT)。Pod 看到的源 IP 和目标 IP 就是真实的 Pod IP。

                通用通信流程(抽象模型)

                假设有两个 Pod:

                • Pod A:在 Node 1 上,IP 为 10.244.1.10
                • Pod B:在 Node 2 上,IP 为 10.244.2.20

                Pod A 试图 ping Pod B 的 IP (10.244.2.20) 时,过程如下:

                1. 出站:从 Pod A 到 Node 1

                • Pod A 根据其内部路由表,将数据包从自己的网络命名空间内的 eth0 接口发出。
                • 目标 IP 是 10.244.2.20
                • Node 1 上,有一个网桥(如 cni0)充当了所有本地 Pod 的虚拟交换机。Pod A 的 eth0 通过一对 veth pair 连接到这个网桥。
                • 数据包到达网桥 cni0

                2. 路由决策:在 Node 1 上

                • Node 1内核路由表 由 CNI 插件配置。它查看数据包的目标 IP 10.244.2.20
                • 路由表规则大致如下:
                  Destination     Gateway         Interface
                  10.244.1.0/24   ...            cni0      # 本地 Pod 网段,走 cni0 网桥
                  10.244.2.0/24   192.168.1.102  eth0      # 非本地 Pod 网段,通过网关(即 Node 2 的 IP)从物理网卡 eth0 发出
                • 路由表告诉内核,去往 10.244.2.0/24 网段的数据包,下一跳是 192.168.1.102(即 Node 2 的物理 IP),并通过 Node 1 的物理网络接口 eth0 发出。

                从这里开始,不同 CNI 插件的工作机制产生了差异。


                场景一:使用 Flannel (VXLAN 模式)

                Flannel 通过创建一个覆盖网络 来解决跨节点通信。

                1. 封装

                  • 数据包(源 10.244.1.10,目标 10.244.2.20)到达 Node 1eth0 之前,会被一个特殊的虚拟网络设备 flannel.1 截获。
                  • flannel.1 是一个 VXLAN 隧道端点
                  • 封装flannel.1 会将整个原始数据包(作为 payload)封装在一个新的 UDP 数据包 中。
                    • 外层 IP 头:源 IP 是 Node 1 的 IP (192.168.1.101),目标 IP 是 Node 2 的 IP (192.168.1.102)。
                    • 外层 UDP 头:目标端口通常是 8472 (VXLAN)。
                    • VXLAN 头:包含一个 VNI,用于标识不同的虚拟网络。
                    • 内层原始数据包:原封不动。
                2. 物理网络传输

                  • 这个封装后的 UDP 数据包通过 Node 1 的物理网络 eth0 发送出去。
                  • 它经过底层物理网络(交换机、路由器)顺利到达 Node 2,因为外层 IP 是节点的真实 IP,底层网络是认识的。
                3. 解封装

                  • 数据包到达 Node 2 的物理网卡 eth0
                  • 内核发现这是一个发往 VXLAN 端口 (8472) 的 UDP 包,于是将其交给 Node 2 上的 flannel.1 设备处理。
                  • flannel.1 设备解封装,剥掉外层 UDP 和 IP 头,露出原始的 IP 数据包(源 10.244.1.10,目标 10.244.2.20)。
                4. 入站:从 Node 2 到 Pod B

                  • 解封后的原始数据包被送入 Node 2 的网络栈。
                  • Node 2 的路由表查看目标 IP 10.244.2.20,发现它属于本地的 cni0 网桥管理的网段。
                  • 数据包被转发到 cni0 网桥,网桥再通过 veth pair 将数据包送达 Pod Beth0 接口。

                简单比喻:Flannel 就像在两个节点之间建立了一条邮政专线。你的原始信件(Pod IP 数据包)被塞进一个标准快递信封(外层 UDP 包)里,通过公共邮政系统(物理网络)寄到对方邮局(Node 2),对方邮局再拆开快递信封,把原始信件交给收件人(Pod B)。


                场景二:使用 Calico (BGP 模式)

                Calico 通常不使用隧道,而是利用 BGP 协议纯三层路由,效率更高。

                1. 路由通告

                  • Node 1Node 2 上都运行着 Calico 的 BGP 客户端 Felix 和 BGP 路由反射器 BIRD
                  • Node 2 会通过 BGP 协议向网络中的其他节点(包括 Node 1)通告一条路由信息:“目标网段 10.244.2.0/24 的下一跳是我 192.168.1.102”。
                  • Node 1 学习到了这条路由,并写入自己的内核路由表(就是我们之前在步骤2中看到的那条)。
                2. 直接路由

                  • 数据包(源 10.244.1.10,目标 10.244.2.20)根据路由表,直接通过 Node 1 的物理网卡 eth0 发出。
                  • 没有封装! 数据包保持原样,源 IP 是 10.244.1.10,目标 IP 是 10.244.2.20
                  • 这个数据包被发送到 Node 2 的物理 IP (192.168.1.102)。
                3. 物理网络传输

                  • 数据包经过底层物理网络。这就要求底层网络必须能够路由 Pod IP 的网段。在云环境中,这通常通过配置 VPC 路由表来实现;在物理机房,需要核心交换机学习到这些 BGP 路由或配置静态路由。
                4. 入站:从 Node 2 到 Pod B

                  • 数据包到达 Node 2 的物理网卡 eth0
                  • Node 2 的内核查看目标 IP 10.244.2.20,发现这个 IP 属于一个本地虚拟接口(如 caliXXX,这是 Calico 为每个 Pod 创建的),于是直接将数据包转发给该接口,最终送达 Pod B

                简单比喻:Calico 让每个节点都成为一个智能路由器。它们互相告知“哪个 Pod 网段在我这里”。当 Node 1 要发数据给 Node 2 上的 Pod 时,它就像路由器一样,根据已知的路由表,直接找到 Node 2 的地址并把数据包发过去,中间不拆包。


                总结对比

                特性Flannel (VXLAN)Calico (BGP)
                网络模型Overlay NetworkPure Layer 3
                原理隧道封装路由通告
                性能有封装/解封装开销,性能稍低无隧道开销,性能更高
                依赖对底层网络无要求,只要节点IP通即可依赖底层网络支持路由(云平台VPC或物理网络配置)
                数据包外层Node IP,内层Pod IP始终是Pod IP

                无论采用哪种方式,Kubernetes 和 CNI 插件共同协作,最终实现了一个对应用开发者透明的、扁平的 Pod 网络。开发者只需关心 Pod IP 和 Service,而无需理解底层复杂的跨节点通信机制。

                如果pod之间访问不通怎么排查?

                核心排查思路:从 Pod 内部到外部,从简单到复杂

                整个排查过程可以遵循下图所示的路径,逐步深入:

                flowchart TD
                    A[Pod 之间访问不通] --> B[确认基础连通性<br>ping & telnet]
                
                    B --> C{ping 是否通?}
                    C -- 通 --> D[telnet 端口是否通?]
                    C -- 不通 --> E[检查 NetworkPolicy<br>kubectl get networkpolicy]
                
                    D -- 通 --> F[检查应用日志与配置]
                    D -- 不通 --> G[检查 Service 与 Endpoints<br>kubectl describe svc]
                
                    E --> H[检查 CNI 插件状态<br>kubectl get pods -n kube-system]
                    
                    subgraph G_ [Service排查路径]
                        G --> G1[Endpoints 是否为空?]
                        G1 -- 是 --> G2[检查 Pod 标签与 Selector]
                        G1 -- 否 --> G3[检查 kube-proxy 与 iptables]
                    end
                
                    F --> Z[问题解决]
                    H --> Z
                    G2 --> Z
                    G3 --> Z

                第一阶段:基础信息收集与初步检查

                1. 获取双方 Pod 信息

                  kubectl get pods -o wide
                  • 确认两个 Pod 都处于 Running 状态。
                  • 记录下它们的 IP 地址所在节点
                  • 确认它们不在同一个节点上(如果是,排查方法会略有不同)。
                2. 明确访问方式

                  • 直接通过 Pod IP 访问? (ping <pod-ip>curl <pod-ip>:<port>)
                  • 通过 Service 名称访问? (ping <service-name>curl <service-name>:<port>)
                  • 这个问题决定了后续的排查方向。

                第二阶段:按访问路径深入排查

                场景一:直接通过 Pod IP 访问不通(跨节点)

                这通常是底层网络插件(CNI) 的问题。

                1. 检查 Pod 内部网络

                  • 进入源 Pod,检查其网络配置:
                  kubectl exec -it <source-pod> -- sh
                  # 在 Pod 内部执行:
                  ip addr show eth0 # 查看 IP 是否正确
                  ip route # 查看路由表
                  ping <destination-pod-ip> # 测试连通性
                  • 如果 ping 不通,继续下一步。
                2. 检查目标 Pod 的端口监听

                  • 进入目标 Pod,确认应用在正确端口上监听:
                  kubectl exec -it <destination-pod> -- netstat -tulpn | grep LISTEN
                  # 或者用 ss 命令
                  kubectl exec -it <destination-pod> -- ss -tulpn | grep LISTEN
                  • 如果这里没监听,是应用自身问题,检查应用日志和配置。
                3. 检查 NetworkPolicy(网络策略)

                  • 这是 Kubernetes 的“防火墙”,很可能阻止了访问。
                  kubectl get networkpolicies -A
                  kubectl describe networkpolicy <policy-name> -n <namespace>
                  • 查看是否有策略限制了源 Pod 或目标 Pod 的流量。特别注意 ingress 规则
                4. 检查 CNI 插件状态

                  • CNI 插件(如 Calico、Flannel)的异常会导致跨节点网络瘫痪。
                  kubectl get pods -n kube-system | grep -e calico -e flannel -e weave
                  • 确认所有 CNI 相关的 Pod 都在运行。如果有 CrashLoopBackOff 等状态,查看其日志。
                5. 节点层面排查

                  • 如果以上都正常,问题可能出现在节点网络层面。
                  • 登录到源 Pod 所在节点,尝试 ping 目标 Pod IP。
                  • 检查节点路由表
                    # 在节点上执行
                    ip route
                    • 对于 Flannel,你应该能看到到其他节点 Pod 网段的路由。
                    • 对于 Calico,你应该能看到到每个其他节点 Pod 网段的精确路由。
                  • 检查节点防火墙:在某些环境中(如安全组、iptables 规则)可能阻止了 VXLAN(8472端口)或节点间 Pod IP 的通信。
                    # 检查 iptables 规则
                    sudo iptables-save | grep <pod-ip>

                场景二:通过 Service 名称访问不通

                这通常是 Kubernetes 服务发现kube-proxy 的问题。

                1. 检查 Service 和 Endpoints

                  kubectl get svc <service-name>
                  kubectl describe svc <service-name> # 查看 Selector 和 Port 映射
                  kubectl get endpoints <service-name> # 这是关键!检查是否有健康的 Endpoints
                  • 如果 ENDPOINTS 列为空:说明 Service 的 Label Selector 没有匹配到任何健康的 Pod。请检查:
                    • Pod 的 labels 是否与 Service 的 selector 匹配。
                    • Pod 的 readinessProbe 是否通过。
                2. 检查 DNS 解析

                  • 进入源 Pod,测试是否能解析 Service 名称:
                  kubectl exec -it <source-pod> -- nslookup <service-name>
                  # 或者
                  kubectl exec -it <source-pod> -- cat /etc/resolv.conf
                  • 如果解析失败,检查 kube-dnscoredns Pod 是否正常。
                  kubectl get pods -n kube-system | grep -e coredns -e kube-dns
                3. 检查 kube-proxy

                  • kube-proxy 负责实现 Service 的负载均衡规则(通常是 iptables 或 ipvs)。
                  kubectl get pods -n kube-system | grep kube-proxy
                  • 确认所有 kube-proxy Pod 都在运行。
                  • 可以登录到节点,检查是否有对应的 iptables 规则:
                    sudo iptables-save | grep <service-name>
                    # 或者查看 ipvs 规则(如果使用 ipvs 模式)
                    sudo ipvsadm -ln

                第三阶段:高级调试技巧

                如果上述步骤仍未解决问题,可以尝试以下方法:

                1. 使用网络调试镜像

                  • 部署一个包含网络工具的临时 Pod(如 nicolaka/netshoot)来进行高级调试。
                  kubectl run -it --rm debug-pod --image=nicolaka/netshoot -- /bin/bash
                  • 在这个 Pod 里,你可以使用 tcpdump, tracepath, dig 等强大工具。
                  • 例如,在目标 Pod 的节点上抓包:
                    # 在节点上执行,监听 Pod 网络对应的接口
                    sudo tcpdump -i any -n host <source-pod-ip> and host <destination-pod-ip>
                2. 检查节点网络连接

                  • 确认两个节点之间网络是通的(通过节点 IP)。
                  • 确认 CNI 所需的端口(如 Flannel 的 VXLAN 端口 8472)在节点间是开放的。

                总结与排查命令清单

                当 Pod 间访问不通时,按顺序执行以下命令:

                # 1. 基本信息
                kubectl get pods -o wide
                kubectl get svc,ep -o wide
                
                # 2. 检查 NetworkPolicy
                kubectl get networkpolicies -A
                
                # 3. 检查核心插件
                kubectl get pods -n kube-system | grep -e coredns -e kube-proxy -e calico -e flannel
                
                # 4. 进入 Pod 测试
                kubectl exec -it <source-pod> -- ping <destination-pod-ip>
                kubectl exec -it <source-pod> -- nslookup <service-name>
                
                # 5. 检查目标 Pod 应用
                kubectl exec -it <destination-pod> -- netstat -tulpn
                kubectl logs <destination-pod>
                
                # 6. 节点层面检查
                # 在节点上执行
                ip route
                sudo iptables-save | grep <relevant-ip>

                记住,90% 的 Pod 网络不通问题源于 NetworkPolicy 配置、Service Endpoints 为空,或 CNI 插件故障。按照这个路径排查,绝大多数问题都能被定位和解决。

                Mar 7, 2024

                Talk with API Server

                Kubernetes 各模块与 API Server 通信详解

                这是理解 Kubernetes 架构的核心问题。API Server 是整个集群的"大脑",所有组件都通过它进行通信。


                🎯 Kubernetes 通信架构总览

                ┌─────────────────────────────────────────────────────────┐
                │                    API Server (核心)                     │
                │  - RESTful API (HTTP/HTTPS)                             │
                │  - 认证、授权、准入控制                                   │
                │  - etcd 唯一入口                                         │
                └───────┬─────────────────┬─────────────────┬─────────────┘
                        │                 │                 │
                    ┌───▼───┐         ┌───▼───┐        ┌───▼────┐
                    │Kubelet│         │Scheduler│      │Controller│
                    │(Node) │         │         │      │ Manager  │
                    └───────┘         └─────────┘      └──────────┘
                        │
                    ┌───▼────┐
                    │kube-proxy│
                    └────────┘

                🔐 通信基础:认证、授权、准入

                1. 认证 (Authentication)

                所有组件访问 API Server 必须先通过认证。

                常见认证方式

                认证方式使用场景实现方式
                X.509 证书集群组件(kubelet/scheduler)客户端证书
                ServiceAccount TokenPod 内应用JWT Token
                Bootstrap Token节点加入集群临时 Token
                静态 Token 文件简单测试不推荐生产
                OIDC用户认证外部身份提供商

                X.509 证书认证示例

                # 1. API Server 启动参数包含 CA 证书
                kube-apiserver \
                  --client-ca-file=/etc/kubernetes/pki/ca.crt \
                  --tls-cert-file=/etc/kubernetes/pki/apiserver.crt \
                  --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
                
                # 2. Kubelet 使用客户端证书
                kubelet \
                  --kubeconfig=/etc/kubernetes/kubelet.conf \
                  --client-ca-file=/etc/kubernetes/pki/ca.crt
                
                # 3. kubeconfig 文件内容
                apiVersion: v1
                kind: Config
                clusters:
                - cluster:
                    certificate-authority: /etc/kubernetes/pki/ca.crt  # CA 证书
                    server: https://192.168.1.10:6443                  # API Server 地址
                  name: kubernetes
                users:
                - name: system:node:worker-1
                  user:
                    client-certificate: /var/lib/kubelet/pki/kubelet-client.crt  # 客户端证书
                    client-key: /var/lib/kubelet/pki/kubelet-client.key          # 客户端密钥
                contexts:
                - context:
                    cluster: kubernetes
                    user: system:node:worker-1
                  name: default
                current-context: default

                ServiceAccount Token 认证

                # Pod 内自动挂载的 Token
                cat /var/run/secrets/kubernetes.io/serviceaccount/token
                # eyJhbGciOiJSUzI1NiIsImtpZCI6Ij...
                
                # 使用 Token 访问 API Server
                TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                curl -k -H "Authorization: Bearer $TOKEN" \
                  https://kubernetes.default.svc/api/v1/namespaces/default/pods

                2. 授权 (Authorization)

                认证通过后,检查是否有权限执行操作。

                RBAC (Role-Based Access Control) - 最常用

                # 1. Role - 定义权限
                apiVersion: rbac.authorization.k8s.io/v1
                kind: Role
                metadata:
                  namespace: default
                  name: pod-reader
                rules:
                - apiGroups: [""]
                  resources: ["pods"]
                  verbs: ["get", "list", "watch"]
                
                ---
                # 2. RoleBinding - 绑定用户/ServiceAccount
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: read-pods
                  namespace: default
                subjects:
                - kind: ServiceAccount
                  name: my-app
                  namespace: default
                roleRef:
                  kind: Role
                  name: pod-reader
                  apiGroup: rbac.authorization.k8s.io

                授权模式对比

                模式说明使用场景
                RBAC基于角色生产环境(推荐)
                ABAC基于属性复杂策略(已过时)
                Webhook外部授权服务自定义授权逻辑
                Node节点授权Kubelet 专用
                AlwaysAllow允许所有测试环境(危险)

                3. 准入控制 (Admission Control)

                授权通过后,准入控制器可以修改或拒绝请求。

                常用准入控制器

                # API Server 启用的准入控制器
                kube-apiserver \
                  --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,\
                DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,\
                ValidatingAdmissionWebhook,ResourceQuota,PodSecurityPolicy
                准入控制器作用
                NamespaceLifecycle防止在删除中的 namespace 创建资源
                LimitRanger强制资源限制
                ResourceQuota强制命名空间配额
                PodSecurityPolicy强制 Pod 安全策略
                MutatingAdmissionWebhook修改资源(如注入 sidecar)
                ValidatingAdmissionWebhook验证资源(自定义校验)

                📡 各组件通信详解

                1. Kubelet → API Server

                Kubelet 是唯一主动连接 API Server 的组件。

                通信方式

                Kubelet (每个 Node)
                    │
                    ├─→ List-Watch Pods (监听分配给自己的 Pod)
                    ├─→ Report Node Status (定期上报节点状态)
                    ├─→ Report Pod Status (上报 Pod 状态)
                    └─→ Get Secrets/ConfigMaps (拉取配置)

                实现细节

                // Kubelet 启动时创建 Informer 监听资源
                // 伪代码示例
                func (kl *Kubelet) syncLoop() {
                    // 1. 创建 Pod Informer
                    podInformer := cache.NewSharedIndexInformer(
                        &cache.ListWatch{
                            ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
                                // 列出分配给当前节点的所有 Pod
                                options.FieldSelector = fields.OneTermEqualSelector("spec.nodeName", kl.nodeName).String()
                                return kl.kubeClient.CoreV1().Pods("").List(context.TODO(), options)
                            },
                            WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
                                // 持续监听 Pod 变化
                                options.FieldSelector = fields.OneTermEqualSelector("spec.nodeName", kl.nodeName).String()
                                return kl.kubeClient.CoreV1().Pods("").Watch(context.TODO(), options)
                            },
                        },
                        &v1.Pod{},
                        0, // 不缓存
                        cache.Indexers{},
                    )
                    
                    // 2. 注册事件处理器
                    podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
                        AddFunc:    kl.handlePodAdditions,
                        UpdateFunc: kl.handlePodUpdates,
                        DeleteFunc: kl.handlePodDeletions,
                    })
                    
                    // 3. 定期上报节点状态
                    go wait.Until(kl.syncNodeStatus, 10*time.Second, stopCh)
                }
                
                // 上报节点状态
                func (kl *Kubelet) syncNodeStatus() {
                    node := &v1.Node{
                        ObjectMeta: metav1.ObjectMeta{Name: kl.nodeName},
                        Status: v1.NodeStatus{
                            Conditions: []v1.NodeCondition{
                                {Type: v1.NodeReady, Status: v1.ConditionTrue},
                            },
                            Capacity: kl.getNodeCapacity(),
                            // ...
                        },
                    }
                    
                    // 调用 API Server 更新节点状态
                    kl.kubeClient.CoreV1().Nodes().UpdateStatus(context.TODO(), node, metav1.UpdateOptions{})
                }

                Kubelet 配置示例

                # /var/lib/kubelet/config.yaml
                apiVersion: kubelet.config.k8s.io/v1beta1
                kind: KubeletConfiguration
                # API Server 连接配置(通过 kubeconfig)
                authentication:
                  x509:
                    clientCAFile: /etc/kubernetes/pki/ca.crt
                  webhook:
                    enabled: true
                  anonymous:
                    enabled: false
                authorization:
                  mode: Webhook
                clusterDomain: cluster.local
                clusterDNS:
                - 10.96.0.10
                # 定期上报间隔
                nodeStatusUpdateFrequency: 10s
                nodeStatusReportFrequency: 1m

                List-Watch 机制详解

                ┌─────────────────────────────────────────┐
                │  Kubelet List-Watch 工作流程             │
                ├─────────────────────────────────────────┤
                │                                          │
                │  1. List(初始化)                         │
                │     GET /api/v1/pods?fieldSelector=...  │
                │     ← 返回所有当前 Pod                   │
                │                                          │
                │  2. Watch(持续监听)                      │
                │     GET /api/v1/pods?watch=true&...     │
                │     ← 保持长连接                         │
                │                                          │
                │  3. 接收事件                             │
                │     ← ADDED: Pod nginx-xxx created      │
                │     ← MODIFIED: Pod nginx-xxx updated   │
                │     ← DELETED: Pod nginx-xxx deleted    │
                │                                          │
                │  4. 本地处理                             │
                │     - 缓存更新                           │
                │     - 触发 Pod 生命周期管理              │
                │                                          │
                │  5. 断线重连                             │
                │     - 检测到连接断开                     │
                │     - 重新 List + Watch                  │
                │     - ResourceVersion 确保不丢事件       │
                └─────────────────────────────────────────┘

                HTTP 长连接(Chunked Transfer)

                # Kubelet 发起 Watch 请求
                GET /api/v1/pods?watch=true&resourceVersion=12345&fieldSelector=spec.nodeName=worker-1 HTTP/1.1
                Host: 192.168.1.10:6443
                Authorization: Bearer eyJhbGc...
                Connection: keep-alive
                
                # API Server 返回(Chunked 编码)
                HTTP/1.1 200 OK
                Content-Type: application/json
                Transfer-Encoding: chunked
                
                {"type":"ADDED","object":{"kind":"Pod","apiVersion":"v1",...}}
                {"type":"MODIFIED","object":{"kind":"Pod","apiVersion":"v1",...}}
                {"type":"DELETED","object":{"kind":"Pod","apiVersion":"v1",...}}
                ...
                # 连接保持打开,持续推送事件

                2. Scheduler → API Server

                Scheduler 也使用 List-Watch 机制。

                通信流程

                Scheduler
                    │
                    ├─→ Watch Pods (监听未调度的 Pod)
                    │   └─ spec.nodeName == ""
                    │
                    ├─→ Watch Nodes (监听节点状态)
                    │
                    ├─→ Get PVs, PVCs (获取存储信息)
                    │
                    └─→ Bind Pod (绑定 Pod 到 Node)
                        POST /api/v1/namespaces/{ns}/pods/{name}/binding

                Scheduler 伪代码

                // Scheduler 主循环
                func (sched *Scheduler) scheduleOne() {
                    // 1. 从队列获取待调度的 Pod
                    pod := sched.NextPod()
                    
                    // 2. 执行调度算法(过滤 + 打分)
                    feasibleNodes := sched.findNodesThatFit(pod)
                    if len(feasibleNodes) == 0 {
                        // 无可用节点,标记为不可调度
                        return
                    }
                    
                    priorityList := sched.prioritizeNodes(pod, feasibleNodes)
                    selectedNode := sched.selectHost(priorityList)
                    
                    // 3. 绑定 Pod 到 Node(调用 API Server)
                    binding := &v1.Binding{
                        ObjectMeta: metav1.ObjectMeta{
                            Name:      pod.Name,
                            Namespace: pod.Namespace,
                        },
                        Target: v1.ObjectReference{
                            Kind: "Node",
                            Name: selectedNode,
                        },
                    }
                    
                    // 发送 Binding 请求到 API Server
                    err := sched.client.CoreV1().Pods(pod.Namespace).Bind(
                        context.TODO(),
                        binding,
                        metav1.CreateOptions{},
                    )
                    
                    // 4. API Server 更新 Pod 的 spec.nodeName
                    // 5. Kubelet 监听到 Pod,开始创建容器
                }
                
                // Watch 未调度的 Pod
                func (sched *Scheduler) watchUnscheduledPods() {
                    podInformer := cache.NewSharedIndexInformer(
                        &cache.ListWatch{
                            ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
                                // 只监听 spec.nodeName 为空的 Pod
                                options.FieldSelector = "spec.nodeName="
                                return sched.client.CoreV1().Pods("").List(context.TODO(), options)
                            },
                            WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
                                options.FieldSelector = "spec.nodeName="
                                return sched.client.CoreV1().Pods("").Watch(context.TODO(), options)
                            },
                        },
                        &v1.Pod{},
                        0,
                        cache.Indexers{},
                    )
                    
                    podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
                        AddFunc: func(obj interface{}) {
                            pod := obj.(*v1.Pod)
                            sched.queue.Add(pod)  // 加入调度队列
                        },
                    })
                }

                Binding 请求详解

                # Scheduler 发送的 HTTP 请求
                POST /api/v1/namespaces/default/pods/nginx-xxx/binding HTTP/1.1
                Host: 192.168.1.10:6443
                Authorization: Bearer eyJhbGc...
                Content-Type: application/json
                
                {
                  "apiVersion": "v1",
                  "kind": "Binding",
                  "metadata": {
                    "name": "nginx-xxx",
                    "namespace": "default"
                  },
                  "target": {
                    "kind": "Node",
                    "name": "worker-1"
                  }
                }
                
                # API Server 处理:
                # 1. 验证 Binding 请求
                # 2. 更新 Pod 对象的 spec.nodeName = "worker-1"
                # 3. 返回成功响应
                # 4. Kubelet 监听到 Pod 更新,开始创建容器

                3. Controller Manager → API Server

                Controller Manager 包含多个控制器,每个控制器独立与 API Server 通信。

                常见控制器

                Controller Manager
                    │
                    ├─→ Deployment Controller
                    │   └─ Watch Deployments, ReplicaSets
                    │
                    ├─→ ReplicaSet Controller
                    │   └─ Watch ReplicaSets, Pods
                    │
                    ├─→ Node Controller
                    │   └─ Watch Nodes (节点健康检查)
                    │
                    ├─→ Service Controller
                    │   └─ Watch Services (管理 LoadBalancer)
                    │
                    ├─→ Endpoint Controller
                    │   └─ Watch Services, Pods (创建 Endpoints)
                    │
                    └─→ PV Controller
                        └─ Watch PVs, PVCs (卷绑定)

                ReplicaSet Controller 示例

                // ReplicaSet Controller 的核心逻辑
                func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
                    // 1. 从缓存获取 ReplicaSet
                    rs := rsc.rsLister.Get(namespace, name)
                    
                    // 2. 获取当前 Pod 列表(通过 Selector)
                    allPods := rsc.podLister.List(labels.Everything())
                    filteredPods := rsc.filterActivePods(rs.Spec.Selector, allPods)
                    
                    // 3. 计算差异
                    diff := len(filteredPods) - int(*rs.Spec.Replicas)
                    
                    if diff < 0 {
                        // 需要创建新 Pod
                        diff = -diff
                        for i := 0; i < diff; i++ {
                            // 调用 API Server 创建 Pod
                            pod := newPod(rs)
                            _, err := rsc.kubeClient.CoreV1().Pods(rs.Namespace).Create(
                                context.TODO(),
                                pod,
                                metav1.CreateOptions{},
                            )
                        }
                    } else if diff > 0 {
                        // 需要删除多余 Pod
                        podsToDelete := getPodsToDelete(filteredPods, diff)
                        for _, pod := range podsToDelete {
                            // 调用 API Server 删除 Pod
                            err := rsc.kubeClient.CoreV1().Pods(pod.Namespace).Delete(
                                context.TODO(),
                                pod.Name,
                                metav1.DeleteOptions{},
                            )
                        }
                    }
                    
                    // 4. 更新 ReplicaSet 状态
                    rs.Status.Replicas = int32(len(filteredPods))
                    _, err := rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace).UpdateStatus(
                        context.TODO(),
                        rs,
                        metav1.UpdateOptions{},
                    )
                }

                Node Controller 心跳检测

                // Node Controller 监控节点健康
                func (nc *NodeController) monitorNodeHealth() {
                    for {
                        // 1. 列出所有节点
                        nodes, _ := nc.kubeClient.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
                        
                        for _, node := range nodes.Items {
                            // 2. 检查节点状态
                            now := time.Now()
                            lastHeartbeat := getNodeCondition(&node, v1.NodeReady).LastHeartbeatTime
                            
                            if now.Sub(lastHeartbeat.Time) > 40*time.Second {
                                // 3. 节点超时,标记为 NotReady
                                setNodeCondition(&node, v1.NodeCondition{
                                    Type:   v1.NodeReady,
                                    Status: v1.ConditionUnknown,
                                    Reason: "NodeStatusUnknown",
                                })
                                
                                // 4. 更新节点状态
                                nc.kubeClient.CoreV1().Nodes().UpdateStatus(
                                    context.TODO(),
                                    &node,
                                    metav1.UpdateOptions{},
                                )
                                
                                // 5. 如果节点长时间 NotReady,驱逐 Pod
                                if now.Sub(lastHeartbeat.Time) > 5*time.Minute {
                                    nc.evictPods(node.Name)
                                }
                            }
                        }
                        
                        time.Sleep(10 * time.Second)
                    }
                }

                4. kube-proxy → API Server

                kube-proxy 监听 Service 和 Endpoints,配置网络规则。

                通信流程

                kube-proxy (每个 Node)
                    │
                    ├─→ Watch Services
                    │   └─ 获取 Service 定义
                    │
                    ├─→ Watch Endpoints
                    │   └─ 获取后端 Pod IP 列表
                    │
                    └─→ 配置本地网络
                        ├─ iptables 模式:更新 iptables 规则
                        ├─ ipvs 模式:更新 IPVS 规则
                        └─ userspace 模式:代理转发(已废弃)

                iptables 模式示例

                // kube-proxy 监听 Service 和 Endpoints
                func (proxier *Proxier) syncProxyRules() {
                    // 1. 获取所有 Service
                    services := proxier.serviceStore.List()
                    
                    // 2. 获取所有 Endpoints
                    endpoints := proxier.endpointsStore.List()
                    
                    // 3. 生成 iptables 规则
                    for _, svc := range services {
                        // Service ClusterIP
                        clusterIP := svc.Spec.ClusterIP
                        
                        // 对应的 Endpoints
                        eps := endpoints[svc.Namespace+"/"+svc.Name]
                        
                        // 生成 DNAT 规则
                        // -A KUBE-SERVICES -d 10.96.100.50/32 -p tcp -m tcp --dport 80 -j KUBE-SVC-XXXX
                        chain := generateServiceChain(svc)
                        
                        for _, ep := range eps.Subsets {
                            for _, addr := range ep.Addresses {
                                // -A KUBE-SVC-XXXX -m statistic --mode random --probability 0.33 -j KUBE-SEP-XXXX
                                // -A KUBE-SEP-XXXX -p tcp -m tcp -j DNAT --to-destination 10.244.1.5:8080
                                generateEndpointRule(addr.IP, ep.Ports[0].Port)
                            }
                        }
                    }
                    
                    // 4. 应用 iptables 规则
                    iptables.Restore(rules)
                }

                生成的 iptables 规则示例

                # Service: nginx-service (ClusterIP: 10.96.100.50:80)
                # Endpoints: 10.244.1.5:8080, 10.244.2.8:8080
                
                # 1. KUBE-SERVICES 链(入口)
                -A KUBE-SERVICES -d 10.96.100.50/32 -p tcp -m tcp --dport 80 -j KUBE-SVC-NGINX
                
                # 2. KUBE-SVC-NGINX 链(Service 链)
                -A KUBE-SVC-NGINX -m statistic --mode random --probability 0.5 -j KUBE-SEP-EP1
                -A KUBE-SVC-NGINX -j KUBE-SEP-EP2
                
                # 3. KUBE-SEP-EP1 链(Endpoint 1)
                -A KUBE-SEP-EP1 -p tcp -m tcp -j DNAT --to-destination 10.244.1.5:8080
                
                # 4. KUBE-SEP-EP2 链(Endpoint 2)
                -A KUBE-SEP-EP2 -p tcp -m tcp -j DNAT --to-destination 10.244.2.8:8080

                5. kubectl → API Server

                kubectl 是用户与 API Server 交互的客户端工具。

                通信流程

                kubectl get pods
                    │
                    ├─→ 1. 读取 kubeconfig (~/.kube/config)
                    │      - API Server 地址
                    │      - 证书/Token
                    │
                    ├─→ 2. 发送 HTTP 请求
                    │      GET /api/v1/namespaces/default/pods
                    │
                    ├─→ 3. API Server 处理
                    │      - 认证
                    │      - 授权
                    │      - 从 etcd 读取数据
                    │
                    └─→ 4. 返回结果
                           JSON 格式的 Pod 列表

                kubectl 底层实现

                // kubectl get pods 的简化实现
                func getPods(namespace string) {
                    // 1. 加载 kubeconfig
                    config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
                    
                    // 2. 创建 Clientset
                    clientset, _ := kubernetes.NewForConfig(config)
                    
                    // 3. 发起 GET 请求
                    pods, _ := clientset.CoreV1().Pods(namespace).List(
                        context.TODO(),
                        metav1.ListOptions{},
                    )
                    
                    // 4. 输出结果
                    for _, pod := range pods.Items {
                        fmt.Printf("%s\t%s\t%s\n", pod.Name, pod.Status.Phase, pod.Spec.NodeName)
                    }
                }

                HTTP 请求详解

                # kubectl get pods 发送的实际 HTTP 请求
                GET /api/v1/namespaces/default/pods HTTP/1.1
                Host: 192.168.1.10:6443
                Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ij...
                Accept: application/json
                User-Agent: kubectl/v1.28.0
                
                # API Server 响应
                HTTP/1.1 200 OK
                Content-Type: application/json
                
                {
                  "kind": "PodList",
                  "apiVersion": "v1",
                  "metadata": {
                    "resourceVersion": "12345"
                  },
                  "items": [
                    {
                      "metadata": {
                        "name": "nginx-xxx",
                        "namespace": "default"
                      },
                      "spec": {
                        "nodeName": "worker-1",
                        "containers": [...]
                      },
                      "status": {
                        "phase": "Running"
                      }
                    }
                  ]
                }

                🔄 核心机制:List-Watch

                List-Watch 是 Kubernetes 最核心的通信模式。

                List-Watch 架构

                ┌───────────────────────────────────────────────┐
                │              Client (Kubelet/Controller)      │
                ├───────────────────────────────────────────────┤
                │                                                │
                │  1. List(初始同步)                             │
                │     GET /api/v1/pods                          │
                │     → 获取所有资源                             │
                │     → 本地缓存(Informer Cache)                │
                │                                                │
                │  2. Watch(增量更新)                            │
                │     GET /api/v1/pods?watch=true               │
                │     → 长连接(HTTP Chunked)                    │
                │     → 实时接收 ADDED/MODIFIED/DELETED 事件    │
                │                                                │
                │  3. ResourceVersion(一致性保证)               │
                │     → 每个资源有版本号                         │
                │     → Watch 从指定版本开始                     │
                │     → 断线重连不丢失事件                       │
                │                                                │
                │  4. 本地缓存(Indexer)                         │
                │     → 减少 API Server 压力                    │
                │     → 快速查询                                 │
                │     → 自动同步                                 │
                └───────────────────────────────────────────────┘

                Informer 机制详解

                // Informer 是 List-Watch 的高级封装
                type Informer struct {
                    Indexer   Indexer       // 本地缓存
                    Controller Controller    // List-Watch 控制器
                    Processor  *sharedProcessor  // 事件处理器
                }
                
                // 使用 Informer 监听资源
                func watchPodsWithInformer() {
                    // 1. 创建 SharedInformerFactory
                    factory := informers.NewSharedInformerFactory(clientset, 30*time.Second)
                    
                    // 2. 获取 Pod Informer
                    podInformer := factory.Core().V1().Pods()
                    
                    // 3. 注册事件处理器
                    podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
                        AddFunc: func(obj interface{}) {
                            pod := obj.(*v1.Pod)
                            fmt.Printf("Pod ADDED: %s\n", pod.Name)
                        },
                        UpdateFunc: func(oldObj, newObj interface{}) {
                            pod := newObj.(*v1.Pod)
                            fmt.Printf("Pod UPDATED: %s\n", pod.Name)
                        },
                        DeleteFunc: func(obj interface{}) {
                            pod := obj.(*v1.Pod)
                            fmt.Printf("Pod DELETED: %s\n", pod.Name)
                        },
                    })
                    
                    // 4. 启动 Informer
                    factory.Start(stopCh)
                    
                    // 5. 等待缓存同步完成
                    factory.WaitForCacheSync(stopCh)
                    
                    // 6. 从本地缓存查询(不访问 API Server)
                    pod, _ := podInformer.Lister().Pods("default").Get("nginx-xxx")
                }

                ResourceVersion 机制

                事件流:
                ┌────────────────────────────────────────┐
                │ Pod nginx-xxx created                  │ ResourceVersion: 100
                ├────────────────```
                ├────────────────────────────────────────┤
                │ Pod nginx-xxx updated (image changed)  │ ResourceVersion: 101
                ├────────────────────────────────────────┤
                │ Pod nginx-xxx updated (status changed) │ ResourceVersion: 102
                ├────────────────────────────────────────┤
                │ Pod nginx-xxx deleted                  │ ResourceVersion: 103
                └────────────────────────────────────────┘
                
                Watch 请求:
                1. 初始 Watch: GET /api/v1/pods?watch=true&resourceVersion=100
                   → 从版本 100 开始接收事件
                
                2. 断线重连: GET /api/v1/pods?watch=true&resourceVersion=102
                   → 从版本 102 继续,不会丢失版本 103 的删除事件
                
                3. 版本过期: 如果 resourceVersion 太旧(etcd 已压缩)
                   → API Server 返回 410 Gone
                   → Client 重新 List 获取最新状态,然后 Watch

                🔐 通信安全细节

                1. TLS 双向认证

                ┌────────────────────────────────────────┐
                │        API Server TLS 配置              │
                ├────────────────────────────────────────┤
                │                                         │
                │  Server 端证书:                         │
                │  - apiserver.crt (服务端证书)          │
                │  - apiserver.key (服务端私钥)          │
                │  - ca.crt (CA 证书)                    │
                │                                         │
                │  Client CA:                             │
                │  - 验证客户端证书                       │
                │  - --client-ca-file=/etc/kubernetes/pki/ca.crt │
                │                                         │
                │  启动参数:                              │
                │  --tls-cert-file=/etc/kubernetes/pki/apiserver.crt │
                │  --tls-private-key-file=/etc/kubernetes/pki/apiserver.key │
                │  --client-ca-file=/etc/kubernetes/pki/ca.crt │
                └────────────────────────────────────────┘
                
                ┌────────────────────────────────────────┐
                │        Kubelet TLS 配置                 │
                ├────────────────────────────────────────┤
                │                                         │
                │  Client 证书:                           │
                │  - kubelet-client.crt (客户端证书)     │
                │  - kubelet-client.key (客户端私钥)     │
                │  - ca.crt (CA 证书,验证 API Server)    │
                │                                         │
                │  kubeconfig 配置:                       │
                │  - certificate-authority: ca.crt       │
                │  - client-certificate: kubelet-client.crt │
                │  - client-key: kubelet-client.key      │
                └────────────────────────────────────────┘

                2. ServiceAccount Token 详解

                # 每个 Pod 自动挂载 ServiceAccount
                apiVersion: v1
                kind: Pod
                metadata:
                  name: my-pod
                spec:
                  serviceAccountName: default  # 使用的 ServiceAccount
                  containers:
                  - name: app
                    image: nginx
                    volumeMounts:
                    - name: kube-api-access-xxxxx
                      mountPath: /var/run/secrets/kubernetes.io/serviceaccount
                      readOnly: true
                  volumes:
                  - name: kube-api-access-xxxxx
                    projected:
                      sources:
                      - serviceAccountToken:
                          path: token                    # JWT Token
                          expirationSeconds: 3607
                      - configMap:
                          name: kube-root-ca.crt
                          items:
                          - key: ca.crt
                            path: ca.crt                 # CA 证书
                      - downwardAPI:
                          items:
                          - path: namespace
                            fieldRef:
                              fieldPath: metadata.namespace  # 命名空间

                Pod 内访问 API Server

                # 进入 Pod
                kubectl exec -it my-pod -- sh
                
                # 1. 读取 Token
                TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                
                # 2. 读取 CA 证书
                CACERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                
                # 3. 读取命名空间
                NAMESPACE=$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace)
                
                # 4. 访问 API Server
                curl --cacert $CACERT \
                     --header "Authorization: Bearer $TOKEN" \
                     https://kubernetes.default.svc/api/v1/namespaces/$NAMESPACE/pods
                
                # 5. 使用 kubectl proxy(简化方式)
                kubectl proxy --port=8080 &
                curl http://localhost:8080/api/v1/namespaces/default/pods

                ServiceAccount Token 结构

                # 解码 JWT Token
                TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                echo $TOKEN | cut -d. -f2 | base64 -d | jq
                
                # 输出:
                {
                  "aud": [
                    "https://kubernetes.default.svc"
                  ],
                  "exp": 1696867200,        # 过期时间
                  "iat": 1696863600,        # 签发时间
                  "iss": "https://kubernetes.default.svc.cluster.local",  # 签发者
                  "kubernetes.io": {
                    "namespace": "default",  # 命名空间
                    "pod": {
                      "name": "my-pod",
                      "uid": "abc-123"
                    },
                    "serviceaccount": {
                      "name": "default",     # ServiceAccount 名称
                      "uid": "def-456"
                    }
                  },
                  "nbf": 1696863600,
                  "sub": "system:serviceaccount:default:default"  # Subject
                }

                📊 通信模式总结

                1. 主动推送 vs 被动拉取

                组件通信模式说明
                Kubelet主动连接List-Watch API Server
                Scheduler主动连接List-Watch API Server
                Controller Manager主动连接List-Watch API Server
                kube-proxy主动连接List-Watch API Server
                kubectl主动请求RESTful API 调用
                API Server → etcd主动读写gRPC 连接 etcd

                重要: API Server 从不主动连接其他组件,都是组件主动连接 API Server。

                2. 通信协议

                ┌─────────────────────────────────────────┐
                │  API Server 对外暴露的协议               │
                ├─────────────────────────────────────────┤
                │                                          │
                │  1. HTTPS (主要协议)                     │
                │     - RESTful API                       │
                │     - 端口: 6443 (默认)                  │
                │     - 所有组件使用                       │
                │                                          │
                │  2. HTTP (不推荐)                        │
                │     - 仅用于本地测试                     │
                │     - 端口: 8080 (默认,已废弃)          │
                │     - 生产环境禁用                       │
                │                                          │
                │  3. WebSocket (特殊场景)                │
                │     - kubectl exec/logs/port-forward    │
                │     - 基于 HTTPS 升级                    │
                └─────────────────────────────────────────┘
                
                ┌─────────────────────────────────────────┐
                │  API Server 对 etcd 的协议               │
                ├─────────────────────────────────────────┤
                │                                          │
                │  gRPC (HTTP/2)                          │
                │  - 端口: 2379                            │
                │  - mTLS 双向认证                         │
                │  - 高性能二进制协议                      │
                └─────────────────────────────────────────┘

                🛠️ 实战:监控各组件通信

                1. 查看组件连接状态

                # 1. 查看 API Server 监听端口
                netstat -tlnp | grep kube-apiserver
                # tcp   0   0 :::6443   :::*   LISTEN   12345/kube-apiserver
                
                # 2. 查看连接到 API Server 的客户端
                netstat -anp | grep :6443 | grep ESTABLISHED
                # tcp   0   0 192.168.1.10:6443   192.168.1.11:45678   ESTABLISHED   (Kubelet)
                # tcp   0   0 192.168.1.10:6443   192.168.1.10:45679   ESTABLISHED   (Scheduler)
                # tcp   0   0 192.168.1.10:6443   192.168.1.10:45680   ESTABLISHED   (Controller Manager)
                
                # 3. 查看 API Server 日志
                journalctl -u kube-apiserver -f
                # I1011 10:00:00.123456   12345 httplog.go:89] "HTTP" verb="GET" URI="/api/v1/pods?watch=true" latency="30.123ms" userAgent="kubelet/v1.28.0" srcIP="192.168.1.11:45678"
                
                # 4. 查看 Kubelet 连接
                journalctl -u kubelet -f | grep "Connecting to API"

                2. 使用 tcpdump 抓包

                # 抓取 API Server 通信(6443 端口)
                tcpdump -i any -n port 6443 -A -s 0
                
                # 抓取特定主机的通信
                tcpdump -i any -n host 192.168.1.11 and port 6443
                
                # 保存到文件,用 Wireshark 分析
                tcpdump -i any -n port 6443 -w api-traffic.pcap

                3. API Server Audit 日志

                # API Server 审计配置
                apiVersion: v1
                kind: Policy
                rules:
                # 记录所有请求元数据
                - level: Metadata
                  verbs: ["get", "list", "watch"]
                # 记录创建/更新/删除的完整请求和响应
                - level: RequestResponse
                  verbs: ["create", "update", "patch", "delete"]
                # 启用 Audit 日志
                kube-apiserver \
                  --audit-policy-file=/etc/kubernetes/audit-policy.yaml \
                  --audit-log-path=/var/log/kubernetes/audit.log \
                  --audit-log-maxage=30 \
                  --audit-log-maxbackup=10 \
                  --audit-log-maxsize=100
                
                # 查看审计日志
                tail -f /var/log/kubernetes/audit.log | jq
                
                # 示例输出:
                {
                  "kind": "Event",
                  "apiVersion": "audit.k8s.io/v1",
                  "level": "Metadata",
                  "auditID": "abc-123",
                  "stage": "ResponseComplete",
                  "requestURI": "/api/v1/namespaces/default/pods?watch=true",
                  "verb": "watch",
                  "user": {
                    "username": "system:node:worker-1",
                    "groups": ["system:nodes"]
                  },
                  "sourceIPs": ["192.168.1.11"],
                  "userAgent": "kubelet/v1.28.0",
                  "responseStatus": {
                    "code": 200
                  }
                }

                🔍 高级话题

                1. API Server 聚合层 (API Aggregation)

                允许扩展 API Server,添加自定义 API。

                ┌────────────────────────────────────────┐
                │       Main API Server (kube-apiserver) │
                │         /api, /apis                    │
                └───────────────┬────────────────────────┘
                                │ 代理请求
                        ┌───────┴────────┐
                        ▼                ▼
                ┌──────────────┐  ┌─────────────────┐
                │ Metrics API  │  │ Custom API      │
                │ /apis/metrics│  │ /apis/my.api/v1 │
                └──────────────┘  └─────────────────┘

                注册 APIService

                apiVersion: apiregistration.k8s.io/v1
                kind: APIService
                metadata:
                  name: v1beta1.metrics.k8s.io
                spec:
                  service:
                    name: metrics-server
                    namespace: kube-system
                    port: 443
                  group: metrics.k8s.io
                  version: v1beta1
                  insecureSkipTLSVerify: true
                  groupPriorityMinimum: 100
                  versionPriority: 100

                请求路由

                # 客户端请求
                kubectl top nodes
                # 等价于: GET /apis/metrics.k8s.io/v1beta1/nodes
                
                # API Server 处理:
                # 1. 检查路径 /apis/metrics.k8s.io/v1beta1
                # 2. 查找对应的 APIService
                # 3. 代理请求到 metrics-server Service
                # 4. 返回结果给客户端

                2. API Priority and Fairness (APF)

                控制 API Server 的请求优先级和并发限制。

                # FlowSchema - 定义请求匹配规则
                apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
                kind: FlowSchema
                metadata:
                  name: system-nodes
                spec:
                  priorityLevelConfiguration:
                    name: system  # 关联到优先级配置
                  matchingPrecedence: 900
                  distinguisherMethod:
                    type: ByUser
                  rules:
                  - subjects:
                    - kind: Group
                      group:
                        name: system:nodes  # 匹配 Kubelet 请求
                    resourceRules:
                    - verbs: ["*"]
                      apiGroups: ["*"]
                      resources: ["*"]
                      namespaces: ["*"]
                
                ---
                # PriorityLevelConfiguration - 定义并发限制
                apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
                kind: PriorityLevelConfiguration
                metadata:
                  name: system
                spec:
                  type: Limited
                  limited:
                    assuredConcurrencyShares: 30  # 保证的并发数
                    limitResponse:
                      type: Queue
                      queuing:
                        queues: 64           # 队列数量
                        queueLengthLimit: 50 # 每个队列长度
                        handSize: 6          # 洗牌算法参数

                APF 工作流程

                请求进入 API Server
                    │
                    ├─→ 1. 匹配 FlowSchema (按 precedence 排序)
                    │      - 检查 subject (user/group/serviceaccount)
                    │      - 检查 resource (API 路径)
                    │
                    ├─→ 2. 确定 PriorityLevel
                    │      - system (高优先级,Kubelet/Scheduler)
                    │      - leader-election (中优先级,Controller Manager)
                    │      - workload-high (用户请求)
                    │      - catch-all (默认)
                    │
                    ├─→ 3. 检查并发限制
                    │      - 当前并发数 < assuredConcurrencyShares: 立即执行
                    │      - 超过限制: 进入队列等待
                    │
                    └─→ 4. 执行或拒绝
                           - 队列有空位: 等待执行
                           - 队列满: 返回 429 Too Many Requests

                查看 APF 状态

                # 查看所有 FlowSchema
                kubectl get flowschemas
                
                # 查看 PriorityLevelConfiguration
                kubectl get prioritylevelconfigurations
                
                # 查看实时指标
                kubectl get --raw /metrics | grep apiserver_flowcontrol
                
                # 关键指标:
                # apiserver_flowcontrol_current_inqueue_requests: 当前排队请求数
                # apiserver_flowcontrol_rejected_requests_total: 被拒绝的请求数
                # apiserver_flowcontrol_request_concurrency_limit: 并发限制

                3. Watch Bookmark

                优化 Watch 性能,减少断线重连的代价。

                // 启用 Watch Bookmark
                watch := clientset.CoreV1().Pods("default").Watch(
                    context.TODO(),
                    metav1.ListOptions{
                        Watch:            true,
                        AllowWatchBookmarks: true,  // 🔑 启用 Bookmark
                    },
                )
                
                for event := range watch.ResultChan() {
                    switch event.Type {
                    case watch.Added:
                        // 处理新增事件
                    case watch.Modified:
                        // 处理修改事件
                    case watch.Deleted:
                        // 处理删除事件
                    case watch.Bookmark:
                        // 🔑 Bookmark 事件(无实际数据变更)
                        // 只是告诉客户端当前的 ResourceVersion
                        // 用于优化断线重连
                        pod := event.Object.(*v1.Pod)
                        currentRV := pod.ResourceVersion
                        fmt.Printf("Bookmark at ResourceVersion: %s\n", currentRV)
                    }
                }

                Bookmark 的作用

                没有 Bookmark:
                ┌──────────────────────────────────────┐
                │ 客户端 Watch 从 ResourceVersion 100  │
                │ 长时间没有事件(如 1 小时)             │
                │ 连接断开                              │
                │ 重连时: Watch from RV 100            │
                │ API Server 需要回放 100-200 之间的    │
                │ 所有事件(即使客户端不需要)            │
                └──────────────────────────────────────┘
                
                有 Bookmark:
                ┌──────────────────────────────────────┐
                │ 客户端 Watch 从 ResourceVersion 100  │
                │ 每 10 分钟收到 Bookmark              │
                │   RV 110 (10 分钟后)                 │
                │   RV 120 (20 分钟后)                 │
                │   RV 130 (30 分钟后)                 │
                │ 连接断开                              │
                │ 重连时: Watch from RV 130 ✅         │
                │ 只需回放 130-200 之间的事件           │
                └──────────────────────────────────────┘

                4. 客户端限流 (Client-side Rate Limiting)

                防止客户端压垮 API Server。

                // client-go 的默认限流配置
                config := &rest.Config{
                    Host: "https://192.168.1.10:6443",
                    // QPS 限制
                    QPS: 50.0,        // 每秒 50 个请求
                    // Burst 限制
                    Burst: 100,       // 突发最多 100 个请求
                }
                
                clientset := kubernetes.NewForConfig(config)
                
                // 自定义限流器
                import "golang.org/x/time/rate"
                
                rateLimiter := rate.NewLimiter(
                    rate.Limit(50),  // 每秒 50 个
                    100,             // Burst 100
                )
                
                // 在发送请求前等待
                rateLimiter.Wait(context.Background())
                clientset.CoreV1().Pods("default").List(...)

                📈 性能优化

                1. API Server 侧优化

                # API Server 启动参数
                kube-apiserver \
                  # 增加 worker 线程
                  --max-requests-inflight=400 \
                  --max-mutating-requests-inflight=200 \
                  \
                  # Watch 缓存大小
                  --watch-cache-sizes=pods#1000,nodes#100 \
                  \
                  # etcd 连接池
                  --etcd-servers-overrides=/events#https://etcd-1:2379 \  # 分离 events
                  \
                  # 启用压缩
                  --enable-aggregator-routing=true \
                  \
                  # 内存缓存
                  --default-watch-cache-size=100

                2. Client 侧优化

                // 1. 使用 Informer (本地缓存)
                factory := informers.NewSharedInformerFactory(clientset, 30*time.Second)
                podInformer := factory.Core().V1().Pods()
                
                // 从本地缓存读取,不访问 API Server
                pod, _ := podInformer.Lister().Pods("default").Get("nginx")
                
                // 2. 使用 Field Selector 减少数据量
                listOptions := metav1.ListOptions{
                    FieldSelector: "spec.nodeName=worker-1",  // 只获取特定节点的 Pod
                }
                
                // 3. 使用 Label Selector
                listOptions := metav1.ListOptions{
                    LabelSelector: "app=nginx",  // 只获取特定标签的 Pod
                }
                
                // 4. 限制返回字段
                listOptions := metav1.ListOptions{
                    Limit: 100,  // 分页,每次只返回 100 个
                }
                
                // 5. 批量操作
                // 不推荐: 循环创建 100 个 Pod(100 次 API 调用)
                for i := 0; i < 100; i++ {
                    clientset.CoreV1().Pods("default").Create(...)
                }
                
                // 推荐: 使用 Job/Deployment(1 次 API 调用)
                deployment := &appsv1.Deployment{
                    Spec: appsv1.DeploymentSpec{
                        Replicas: int32Ptr(100),
                        ...
                    },
                }
                clientset.AppsV1().Deployments("default").Create(deployment)

                💡 关键要点总结

                通信模式

                1. 所有组件主动连接 API Server (API Server 从不主动推送)
                2. List-Watch 是核心机制 (初始 List + 持续 Watch)
                3. HTTP 长连接 (Chunked Transfer Encoding)
                4. ResourceVersion 保证一致性 (断线重连不丢事件)

                认证授权

                1. X.509 证书 (集群组件)
                2. ServiceAccount Token (Pod 内应用)
                3. RBAC 授权 (细粒度权限控制)
                4. 准入控制 (请求验证和修改)

                性能优化

                1. Informer 本地缓存 (减少 API Server 压力)
                2. Field/Label Selector (减少数据传输)
                3. APF 流量控制 (防止 API Server 过载)
                4. 客户端限流 (防止客户端压垮 API Server)

                最佳实践

                1. 使用 Informer 而不是轮询
                2. 合理设置 QPS 和 Burst
                3. 避免频繁的 List 操作
                4. 使用 Field Selector 过滤数据
                5. 启用 Watch Bookmark
                6. 监控 API Server 指标
                Mar 7, 2024

                Monitor

                  Mar 7, 2025

                  Subsections of Networking

                  Ingress

                  Kubernetes Ingress 原理详解

                  Ingress 是 Kubernetes 中用于管理集群外部访问集群内服务的 API 对象,提供 HTTP/HTTPS 路由功能。


                  🎯 Ingress 的作用

                  没有 Ingress 的问题

                  问题 1:每个服务需要一个 LoadBalancer
                  ┌────────────────────────────────────┐
                  │  Service A (LoadBalancer)  $$$     │
                  │  Service B (LoadBalancer)  $$$     │
                  │  Service C (LoadBalancer)  $$$     │
                  └────────────────────────────────────┘
                  成本高、管理复杂、IP 地址浪费
                  
                  问题 2:无法基于域名/路径路由
                  客户端 → NodePort:30001 (Service A)
                  客户端 → NodePort:30002 (Service B)
                  需要记住不同的端口,不友好

                  使用 Ingress 的方案

                  单一入口 + 智能路由
                  ┌───────────────────────────────────────┐
                  │         Ingress Controller            │
                  │    (一个 LoadBalancer 或 NodePort)    │
                  └───────────┬───────────────────────────┘
                              │ 根据域名/路径路由
                      ┌───────┴───────┬──────────┐
                      ▼               ▼          ▼
                  Service A       Service B   Service C
                  (ClusterIP)     (ClusterIP) (ClusterIP)

                  🏗️ Ingress 架构组成

                  核心组件

                  ┌─────────────────────────────────────────────┐
                  │              Ingress 生态系统                │
                  ├─────────────────────────────────────────────┤
                  │  1. Ingress Resource (资源对象)             │
                  │     └─ 定义路由规则(YAML)                   │
                  │                                              │
                  │  2. Ingress Controller (控制器)             │
                  │     └─ 读取 Ingress,配置负载均衡器          │
                  │                                              │
                  │  3. 负载均衡器 (Nginx/Traefik/HAProxy)      │
                  │     └─ 实际处理流量的组件                   │
                  └─────────────────────────────────────────────┘

                  📋 Ingress Resource (资源定义)

                  基础示例

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: example-ingress
                    annotations:
                      nginx.ingress.kubernetes.io/rewrite-target: /
                  spec:
                    # 1. 基于域名路由
                    rules:
                    - host: example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: web-service
                              port:
                                number: 80
                    
                    # 2. TLS/HTTPS 配置
                    tls:
                    - hosts:
                      - example.com
                      secretName: example-tls

                  完整功能示例

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: advanced-ingress
                    namespace: default
                    annotations:
                      # Nginx 特定配置
                      nginx.ingress.kubernetes.io/rewrite-target: /$2
                      nginx.ingress.kubernetes.io/ssl-redirect: "true"
                      nginx.ingress.kubernetes.io/rate-limit: "100"
                      # 自定义响应头
                      nginx.ingress.kubernetes.io/configuration-snippet: |
                        add_header X-Custom-Header "Hello from Ingress";
                  spec:
                    # IngressClass (指定使用哪个 Ingress Controller)
                    ingressClassName: nginx
                    
                    # TLS 配置
                    tls:
                    - hosts:
                      - app.example.com
                      - api.example.com
                      secretName: example-tls-secret
                    
                    # 路由规则
                    rules:
                    # 规则 1:app.example.com
                    - host: app.example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: frontend-service
                              port:
                                number: 80
                    
                    # 规则 2:api.example.com
                    - host: api.example.com
                      http:
                        paths:
                        # /v1/* 路由到 api-v1
                        - path: /v1
                          pathType: Prefix
                          backend:
                            service:
                              name: api-v1-service
                              port:
                                number: 8080
                        
                        # /v2/* 路由到 api-v2
                        - path: /v2
                          pathType: Prefix
                          backend:
                            service:
                              name: api-v2-service
                              port:
                                number: 8080
                    
                    # 规则 3:默认后端(可选)
                    defaultBackend:
                      service:
                        name: default-backend
                        port:
                          number: 80

                  🎛️ PathType (路径匹配类型)

                  三种匹配类型

                  PathType匹配规则示例
                  Prefix前缀匹配/foo 匹配 /foo, /foo/, /foo/bar
                  Exact精确匹配/foo 只匹配 /foo,不匹配 /foo/
                  ImplementationSpecific由 Ingress Controller 决定取决于实现

                  示例对比

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: path-types-demo
                  spec:
                    rules:
                    - host: example.com
                      http:
                        paths:
                        # Prefix 匹配
                        - path: /api
                          pathType: Prefix
                          backend:
                            service:
                              name: api-service
                              port:
                                number: 8080
                        # 匹配:
                        # ✅ /api
                        # ✅ /api/
                        # ✅ /api/users
                        # ✅ /api/v1/users
                        
                        # Exact 匹配
                        - path: /login
                          pathType: Exact
                          backend:
                            service:
                              name: auth-service
                              port:
                                number: 80
                        # 匹配:
                        # ✅ /login
                        # ❌ /login/
                        # ❌ /login/oauth

                  🚀 Ingress Controller (控制器)

                  常见 Ingress Controller

                  Controller特点适用场景
                  Nginx Ingress最流行,功能强大通用场景,生产推荐
                  Traefik云原生,动态配置微服务,自动服务发现
                  HAProxy高性能,企业级大流量,高并发
                  KongAPI 网关功能API 管理,插件生态
                  Istio Gateway服务网格集成复杂微服务架构
                  AWS ALB云原生(AWS)AWS 环境
                  GCE云原生(GCP)GCP 环境

                  🔧 Ingress Controller 工作原理

                  核心流程

                  ┌─────────────────────────────────────────────┐
                  │  1. 用户创建/更新 Ingress Resource          │
                  │     kubectl apply -f ingress.yaml           │
                  └────────────────┬────────────────────────────┘
                                   │
                                   ▼
                  ┌─────────────────────────────────────────────┐
                  │  2. Ingress Controller 监听 API Server      │
                  │     - Watch Ingress 对象                    │
                  │     - Watch Service 对象                    │
                  │     - Watch Endpoints 对象                  │
                  └────────────────┬────────────────────────────┘
                                   │
                                   ▼
                  ┌─────────────────────────────────────────────┐
                  │  3. 生成配置文件                             │
                  │     Nginx:  /etc/nginx/nginx.conf          │
                  │     Traefik: 动态配置                       │
                  │     HAProxy: /etc/haproxy/haproxy.cfg      │
                  └────────────────┬────────────────────────────┘
                                   │
                                   ▼
                  ┌─────────────────────────────────────────────┐
                  │  4. 重载/更新负载均衡器                      │
                  │     nginx -s reload                         │
                  └────────────────┬────────────────────────────┘
                                   │
                                   ▼
                  ┌─────────────────────────────────────────────┐
                  │  5. 流量路由生效                             │
                  │     客户端请求 → Ingress → Service → Pod    │
                  └─────────────────────────────────────────────┘

                  📦 部署 Nginx Ingress Controller

                  方式 1:使用官方 Helm Chart (推荐)

                  # 添加 Helm 仓库
                  helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
                  helm repo update
                  
                  # 安装
                  helm install ingress-nginx ingress-nginx/ingress-nginx \
                    --namespace ingress-nginx \
                    --create-namespace \
                    --set controller.service.type=LoadBalancer
                  
                  # 查看部署状态
                  kubectl get pods -n ingress-nginx
                  kubectl get svc -n ingress-nginx

                  方式 2:使用 YAML 部署

                  # 下载官方 YAML
                  kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml
                  
                  # 查看部署
                  kubectl get all -n ingress-nginx

                  核心组件

                  # 1. Deployment - Ingress Controller Pod
                  apiVersion: apps/v1
                  kind: Deployment
                  metadata:
                    name: ingress-nginx-controller
                    namespace: ingress-nginx
                  spec:
                    replicas: 2  # 高可用建议 2+
                    selector:
                      matchLabels:
                        app.kubernetes.io/name: ingress-nginx
                    template:
                      metadata:
                        labels:
                          app.kubernetes.io/name: ingress-nginx
                      spec:
                        serviceAccountName: ingress-nginx
                        containers:
                        - name: controller
                          image: registry.k8s.io/ingress-nginx/controller:v1.9.0
                          args:
                          - /nginx-ingress-controller
                          - --election-id=ingress-nginx-leader
                          - --controller-class=k8s.io/ingress-nginx
                          - --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
                          ports:
                          - name: http
                            containerPort: 80
                          - name: https
                            containerPort: 443
                          livenessProbe:
                            httpGet:
                              path: /healthz
                              port: 10254
                          readinessProbe:
                            httpGet:
                              path: /healthz
                              port: 10254
                  
                  ---
                  # 2. Service - 暴露 Ingress Controller
                  apiVersion: v1
                  kind: Service
                  metadata:
                    name: ingress-nginx-controller
                    namespace: ingress-nginx
                  spec:
                    type: LoadBalancer  # 或 NodePort
                    ports:
                    - name: http
                      port: 80
                      targetPort: 80
                      protocol: TCP
                    - name: https
                      port: 443
                      targetPort: 443
                      protocol: TCP
                    selector:
                      app.kubernetes.io/name: ingress-nginx
                  
                  ---
                  # 3. ConfigMap - Nginx 全局配置
                  apiVersion: v1
                  kind: ConfigMap
                  metadata:
                    name: ingress-nginx-controller
                    namespace: ingress-nginx
                  data:
                    # 自定义 Nginx 配置
                    proxy-body-size: "100m"
                    proxy-connect-timeout: "15"
                    proxy-read-timeout: "600"
                    proxy-send-timeout: "600"
                    use-forwarded-headers: "true"

                  🌐 完整流量路径

                  请求流程详解

                  客户端
                    │ 1. DNS 解析
                    │    example.com → LoadBalancer IP (1.2.3.4)
                    ▼
                  LoadBalancer / NodePort
                    │ 2. 转发到 Ingress Controller Pod
                    ▼
                  Ingress Controller (Nginx Pod)
                    │ 3. 读取 Ingress 规则
                    │    Host: example.com
                    │    Path: /api/users
                    │ 4. 匹配规则
                    │    rule: host=example.com, path=/api
                    │    backend: api-service:8080
                    ▼
                  Service (api-service)
                    │ 5. Service 选择器匹配 Pod
                    │    selector: app=api
                    │ 6. 查询 Endpoints
                    │    endpoints: 10.244.1.5:8080, 10.244.2.8:8080
                    │ 7. 负载均衡(默认轮询)
                    ▼
                  Pod (api-xxxx)
                    │ 8. 容器处理请求
                    │    Container Port: 8080
                    ▼
                  应用响应
                    │ 9. 原路返回
                    ▼
                  客户端收到响应

                  网络数据包追踪

                  # 客户端发起请求
                  curl -H "Host: example.com" http://1.2.3.4/api/users
                  
                  # 1. DNS 解析
                  example.com → 1.2.3.4 (LoadBalancer External IP)
                  
                  # 2. TCP 连接
                  Client:54321 → LoadBalancer:80
                  
                  # 3. LoadBalancer 转发
                  LoadBalancer:80 → Ingress Controller Pod:80 (10.244.0.5:80)
                  
                  # 4. Ingress Controller 内部处理
                  Nginx 读取配置:
                    location /api {
                      proxy_pass http://api-service.default.svc.cluster.local:8080;
                    }
                  
                  # 5. 查询 Service
                  kube-proxy/iptables 规则:
                    api-service:8080 → Endpoints
                  
                  # 6. 负载均衡到 Pod
                  10.244.0.5 → 10.244.1.5:8080 (Pod IP)
                  
                  # 7. 响应返回
                  Pod → Ingress Controller → LoadBalancer → Client

                  🔒 HTTPS/TLS 配置

                  创建 TLS Secret

                  # 方式 1:使用自签名证书(测试环境)
                  openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
                    -keyout tls.key -out tls.crt \
                    -subj "/CN=example.com"
                  
                  kubectl create secret tls example-tls \
                    --cert=tls.crt \
                    --key=tls.key
                  
                  # 方式 2:使用 Let's Encrypt (生产环境,推荐)
                  # 安装 cert-manager
                  kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
                  
                  # 创建 ClusterIssuer
                  kubectl apply -f - <<EOF
                  apiVersion: cert-manager.io/v1
                  kind: ClusterIssuer
                  metadata:
                    name: letsencrypt-prod
                  spec:
                    acme:
                      server: https://acme-v02.api.letsencrypt.org/directory
                      email: admin@example.com
                      privateKeySecretRef:
                        name: letsencrypt-prod
                      solvers:
                      - http01:
                          ingress:
                            class: nginx
                  EOF

                  配置 HTTPS Ingress

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: https-ingress
                    annotations:
                      # 自动重定向 HTTP 到 HTTPS
                      nginx.ingress.kubernetes.io/ssl-redirect: "true"
                      # 使用 cert-manager 自动申请证书
                      cert-manager.io/cluster-issuer: "letsencrypt-prod"
                  spec:
                    ingressClassName: nginx
                    tls:
                    - hosts:
                      - example.com
                      - www.example.com
                      secretName: example-tls  # cert-manager 会自动创建这个 Secret
                    rules:
                    - host: example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: web-service
                              port:
                                number: 80

                  验证 HTTPS

                  # 检查证书
                  curl -v https://example.com
                  
                  # 查看 Secret
                  kubectl get secret example-tls
                  kubectl describe secret example-tls
                  
                  # 测试 HTTP 自动重定向
                  curl -I http://example.com
                  # HTTP/1.1 308 Permanent Redirect
                  # Location: https://example.com/

                  🎨 高级路由场景

                  场景 1:基于路径的路由

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: path-based-routing
                    annotations:
                      nginx.ingress.kubernetes.io/rewrite-target: /$2
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        # /api/v1/* → api-v1-service
                        - path: /api/v1(/|$)(.*)
                          pathType: Prefix
                          backend:
                            service:
                              name: api-v1-service
                              port:
                                number: 8080
                        
                        # /api/v2/* → api-v2-service
                        - path: /api/v2(/|$)(.*)
                          pathType: Prefix
                          backend:
                            service:
                              name: api-v2-service
                              port:
                                number: 8080
                        
                        # /admin/* → admin-service
                        - path: /admin
                          pathType: Prefix
                          backend:
                            service:
                              name: admin-service
                              port:
                                number: 3000
                        
                        # /* → frontend-service (默认)
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: frontend-service
                              port:
                                number: 80

                  场景 2:基于子域名的路由

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: subdomain-routing
                  spec:
                    rules:
                    # www.example.com
                    - host: www.example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: website-service
                              port:
                                number: 80
                    
                    # api.example.com
                    - host: api.example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: api-service
                              port:
                                number: 8080
                    
                    # blog.example.com
                    - host: blog.example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: blog-service
                              port:
                                number: 80
                    
                    # *.dev.example.com (通配符)
                    - host: "*.dev.example.com"
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: dev-environment
                              port:
                                number: 80

                  场景 3:金丝雀发布 (Canary Deployment)

                  # 主版本 Ingress
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: production
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: app-v1
                              port:
                                number: 80
                  
                  ---
                  # 金丝雀版本 Ingress
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: canary
                    annotations:
                      nginx.ingress.kubernetes.io/canary: "true"
                      # 10% 流量到金丝雀版本
                      nginx.ingress.kubernetes.io/canary-weight: "10"
                      
                      # 或基于请求头
                      # nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
                      # nginx.ingress.kubernetes.io/canary-by-header-value: "always"
                      
                      # 或基于 Cookie
                      # nginx.ingress.kubernetes.io/canary-by-cookie: "canary"
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: app-v2-canary
                              port:
                                number: 80

                  场景 4:A/B 测试

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: ab-testing
                    annotations:
                      # 基于请求头进行 A/B 测试
                      nginx.ingress.kubernetes.io/canary: "true"
                      nginx.ingress.kubernetes.io/canary-by-header: "X-Version"
                      nginx.ingress.kubernetes.io/canary-by-header-value: "beta"
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: app-beta
                              port:
                                number: 80
                  # 普通用户访问 A 版本
                  curl http://myapp.com
                  
                  # Beta 用户访问 B 版本
                  curl -H "X-Version: beta" http://myapp.com

                  🔧 常用 Annotations (Nginx)

                  基础配置

                  metadata:
                    annotations:
                      # SSL 重定向
                      nginx.ingress.kubernetes.io/ssl-redirect: "true"
                      
                      # 强制 HTTPS
                      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
                      
                      # 后端协议
                      nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"  # 或 HTTP, GRPC
                      
                      # 路径重写
                      nginx.ingress.kubernetes.io/rewrite-target: /$2
                      
                      # URL 重写
                      nginx.ingress.kubernetes.io/use-regex: "true"

                  高级配置

                  metadata:
                    annotations:
                      # 上传文件大小限制
                      nginx.ingress.kubernetes.io/proxy-body-size: "100m"
                      
                      # 超时配置
                      nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
                      nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
                      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
                      
                      # 会话保持 (Sticky Session)
                      nginx.ingress.kubernetes.io/affinity: "cookie"
                      nginx.ingress.kubernetes.io/session-cookie-name: "route"
                      nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
                      nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
                      
                      # 限流
                      nginx.ingress.kubernetes.io/limit-rps: "100"  # 每秒请求数
                      nginx.ingress.kubernetes.io/limit-connections: "10"  # 并发连接数
                      
                      # CORS 配置
                      nginx.ingress.kubernetes.io/enable-cors: "true"
                      nginx.ingress.kubernetes.io/cors-allow-origin: "*"
                      nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
                      
                      # 白名单
                      nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,192.168.0.0/16"
                      
                      # 基本认证
                      nginx.ingress.kubernetes.io/auth-type: basic
                      nginx.ingress.kubernetes.io/auth-secret: basic-auth
                      nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
                      
                      # 自定义 Nginx 配置片段
                      nginx.ingress.kubernetes.io/configuration-snippet: |
                        more_set_headers "X-Custom-Header: MyValue";
                        add_header X-Request-ID $request_id;

                  🛡️ 安全配置

                  1. 基本认证

                  # 创建密码文件
                  htpasswd -c auth admin
                  # 输入密码
                  
                  # 创建 Secret
                  kubectl create secret generic basic-auth --from-file=auth
                  
                  # 应用到 Ingress
                  kubectl apply -f - <<EOF
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: secure-ingress
                    annotations:
                      nginx.ingress.kubernetes.io/auth-type: basic
                      nginx.ingress.kubernetes.io/auth-secret: basic-auth
                      nginx.ingress.kubernetes.io/auth-realm: "Authentication Required - Please enter your credentials"
                  spec:
                    rules:
                    - host: admin.example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: admin-service
                              port:
                                number: 80
                  EOF

                  2. IP 白名单

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: whitelist-ingress
                    annotations:
                      # 只允许特定 IP 访问
                      nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,192.168.1.100/32"
                  spec:
                    rules:
                    - host: internal.example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: internal-service
                              port:
                                number: 80

                  3. OAuth2 认证

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: oauth2-ingress
                    annotations:
                      nginx.ingress.kubernetes.io/auth-url: "https://oauth2-proxy.example.com/oauth2/auth"
                      nginx.ingress.kubernetes.io/auth-signin: "https://oauth2-proxy.example.com/oauth2/start?rd=$escaped_request_uri"
                  spec:
                    rules:
                    - host: app.example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: protected-service
                              port:
                                number: 80

                  📊 监控和调试

                  查看 Ingress 状态

                  # 列出所有 Ingress
                  kubectl get ingress
                  
                  # 详细信息
                  kubectl describe ingress example-ingress
                  
                  # 查看 Ingress Controller 日志
                  kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -f
                  
                  # 查看生成的 Nginx 配置
                  kubectl exec -n ingress-nginx <ingress-controller-pod> -- cat /etc/nginx/nginx.conf

                  测试 Ingress 规则

                  # 测试域名解析
                  nslookup example.com
                  
                  # 测试 HTTP
                  curl -H "Host: example.com" http://<ingress-ip>/
                  
                  # 测试 HTTPS
                  curl -k -H "Host: example.com" https://<ingress-ip>/
                  
                  # 查看响应头
                  curl -I -H "Host: example.com" http://<ingress-ip>/
                  
                  # 测试特定路径
                  curl -H "Host: example.com" http://<ingress-ip>/api/users

                  常见问题排查

                  # 1. 检查 Ingress 是否有 Address
                  kubectl get ingress
                  # 如果 ADDRESS 列为空,说明 Ingress Controller 未就绪
                  
                  # 2. 检查 Service 和 Endpoints
                  kubectl get svc
                  kubectl get endpoints
                  
                  # 3. 检查 Ingress Controller Pod
                  kubectl get pods -n ingress-nginx
                  kubectl logs -n ingress-nginx <pod-name>
                  
                  # 4. 检查 DNS 解析
                  kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup example.com
                  
                  # 5. 检查网络连通性
                  kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
                    curl -H "Host: example.com" http://web-service.default.svc.cluster.local

                  🎯 Ingress vs Service Type

                  对比表

                  维度IngressLoadBalancerNodePort
                  成本1 个 LB每个服务 1 个 LB免费
                  域名路由✅ 支持❌ 不支持❌ 不支持
                  路径路由✅ 支持❌ 不支持❌ 不支持
                  TLS 终止✅ 支持⚠️ 需要额外配置❌ 不支持
                  7 层功能✅ 丰富❌ 4 层❌ 4 层
                  适用场景HTTP/HTTPS 服务需要独立 LB 的服务开发测试

                  💡 关键要点总结

                  Ingress 的价值

                  1. 成本优化:多个服务共享一个 LoadBalancer
                  2. 智能路由:基于域名、路径的 7 层路由
                  3. TLS 管理:集中管理 HTTPS 证书
                  4. 高级功能:限流、认证、重写、CORS 等
                  5. 易于管理:声明式配置,统一入口

                  核心概念

                  • Ingress Resource:定义路由规则的 YAML
                  • Ingress Controller:读取规则并实现路由的控制器
                  • 负载均衡器:实际处理流量的组件(Nginx/Traefik/HAProxy)

                  典型使用场景

                  • ✅ 微服务 API 网关
                  • ✅ 多租户应用(基于子域名隔离)
                  • ✅ 蓝绿部署/金丝雀发布
                  • ✅ Web 应用统一入口
                  • ❌ 非 HTTP 协议(如 TCP/UDP,考虑使用 Gateway API)

                  🚀 高级话题

                  1. IngressClass (多 Ingress Controller)

                  在同一集群中运行多个 Ingress Controller:

                  # 定义 IngressClass
                  apiVersion: networking.k8s.io/v1
                  kind: IngressClass
                  metadata:
                    name: nginx
                    annotations:
                      ingressclass.kubernetes.io/is-default-class: "true"
                  spec:
                    controller: k8s.io/ingress-nginx
                  
                  ---
                  apiVersion: networking.k8s.io/v1
                  kind: IngressClass
                  metadata:
                    name: traefik
                  spec:
                    controller: traefik.io/ingress-controller
                  
                  ---
                  # 使用特定的 IngressClass
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: my-ingress
                  spec:
                    ingressClassName: nginx  # 🔑 指定使用 nginx 控制器
                    rules:
                    - host: example.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: web-service
                              port:
                                number: 80

                  使用场景:

                  • 内部服务使用 Nginx,外部服务使用 Traefik
                  • 不同团队使用不同的 Ingress Controller
                  • 按环境划分(dev 用 Traefik,prod 用 Nginx)

                  2. 默认后端 (Default Backend)

                  处理未匹配任何规则的请求:

                  # 创建默认后端服务
                  apiVersion: v1
                  kind: Service
                  metadata:
                    name: default-backend
                  spec:
                    selector:
                      app: default-backend
                    ports:
                    - port: 80
                      targetPort: 8080
                  
                  ---
                  apiVersion: apps/v1
                  kind: Deployment
                  metadata:
                    name: default-backend
                  spec:
                    replicas: 1
                    selector:
                      matchLabels:
                        app: default-backend
                    template:
                      metadata:
                        labels:
                          app: default-backend
                      spec:
                        containers:
                        - name: default-backend
                          image: registry.k8s.io/defaultbackend-amd64:1.5
                          ports:
                          - containerPort: 8080
                  
                  ---
                  # 在 Ingress 中指定默认后端
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: ingress-with-default
                  spec:
                    defaultBackend:
                      service:
                        name: default-backend
                        port:
                          number: 80
                    rules:
                    - host: example.com
                      http:
                        paths:
                        - path: /app
                          pathType: Prefix
                          backend:
                            service:
                              name: app-service
                              port:
                                number: 80

                  效果:

                  • 访问 example.com/app → app-service
                  • 访问 example.com/other → default-backend(404 页面)
                  • 访问 unknown.com → default-backend

                  3. ExternalName Service 与 Ingress

                  将 Ingress 路由到集群外部服务:

                  # 创建 ExternalName Service
                  apiVersion: v1
                  kind: Service
                  metadata:
                    name: external-api
                  spec:
                    type: ExternalName
                    externalName: api.external-service.com  # 外部域名
                  
                  ---
                  # Ingress 路由到外部服务
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: external-ingress
                    annotations:
                      nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
                      nginx.ingress.kubernetes.io/upstream-vhost: "api.external-service.com"
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /external
                          pathType: Prefix
                          backend:
                            service:
                              name: external-api
                              port:
                                number: 443

                  使用场景:

                  • 集成第三方 API
                  • 混合云架构(部分服务在云外)
                  • 灰度迁移(逐步从外部迁移到集群内)

                  4. 跨命名空间引用 (ExternalName 方式)

                  Ingress 默认只能引用同一命名空间的 Service,跨命名空间需要特殊处理:

                  # Namespace: backend
                  apiVersion: v1
                  kind: Service
                  metadata:
                    name: api-service
                    namespace: backend
                  spec:
                    selector:
                      app: api
                    ports:
                    - port: 8080
                  
                  ---
                  # Namespace: frontend
                  # 创建 ExternalName Service 指向 backend 命名空间的服务
                  apiVersion: v1
                  kind: Service
                  metadata:
                    name: api-proxy
                    namespace: frontend
                  spec:
                    type: ExternalName
                    externalName: api-service.backend.svc.cluster.local
                    ports:
                    - port: 8080
                  
                  ---
                  # Ingress 在 frontend 命名空间
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: cross-ns-ingress
                    namespace: frontend
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /api
                          pathType: Prefix
                          backend:
                            service:
                              name: api-proxy  # 引用同命名空间的 ExternalName Service
                              port:
                                number: 8080

                  5. TCP/UDP 服务暴露

                  Ingress 原生只支持 HTTP/HTTPS,对于 TCP/UDP 需要特殊配置:

                  Nginx Ingress Controller 的 TCP 配置

                  # ConfigMap 定义 TCP 服务
                  apiVersion: v1
                  kind: ConfigMap
                  metadata:
                    name: tcp-services
                    namespace: ingress-nginx
                  data:
                    # 格式: "外部端口": "命名空间/服务名:服务端口"
                    "3306": "default/mysql:3306"
                    "6379": "default/redis:6379"
                    "27017": "databases/mongodb:27017"
                  
                  ---
                  # 修改 Ingress Controller Service,暴露 TCP 端口
                  apiVersion: v1
                  kind: Service
                  metadata:
                    name: ingress-nginx-controller
                    namespace: ingress-nginx
                  spec:
                    type: LoadBalancer
                    ports:
                    - name: http
                      port: 80
                      targetPort: 80
                    - name: https
                      port: 443
                      targetPort: 443
                    # 添加 TCP 端口
                    - name: mysql
                      port: 3306
                      targetPort: 3306
                    - name: redis
                      port: 6379
                      targetPort: 6379
                    - name: mongodb
                      port: 27017
                      targetPort: 27017
                    selector:
                      app.kubernetes.io/name: ingress-nginx
                  
                  ---
                  # 修改 Ingress Controller Deployment,引用 ConfigMap
                  apiVersion: apps/v1
                  kind: Deployment
                  metadata:
                    name: ingress-nginx-controller
                    namespace: ingress-nginx
                  spec:
                    template:
                      spec:
                        containers:
                        - name: controller
                          args:
                          - /nginx-ingress-controller
                          - --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
                          # ...其他参数

                  访问方式:

                  # 连接 MySQL
                  mysql -h <ingress-lb-ip> -P 3306 -u root -p
                  
                  # 连接 Redis
                  redis-cli -h <ingress-lb-ip> -p 6379

                  6. 灰度发布策略详解

                  基于权重的流量分配

                  # 生产版本 (90% 流量)
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: production
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: app-v1
                              port:
                                number: 80
                  
                  ---
                  # 灰度版本 (10% 流量)
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: canary
                    annotations:
                      nginx.ingress.kubernetes.io/canary: "true"
                      nginx.ingress.kubernetes.io/canary-weight: "10"
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: app-v2
                              port:
                                number: 80

                  基于请求头的灰度

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: canary-header
                    annotations:
                      nginx.ingress.kubernetes.io/canary: "true"
                      nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
                      nginx.ingress.kubernetes.io/canary-by-header-value: "true"
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: app-v2
                              port:
                                number: 80

                  测试:

                  # 普通用户访问 v1
                  curl http://myapp.com
                  
                  # 带特殊请求头的用户访问 v2
                  curl -H "X-Canary: true" http://myapp.com
                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: canary-cookie
                    annotations:
                      nginx.ingress.kubernetes.io/canary: "true"
                      nginx.ingress.kubernetes.io/canary-by-cookie: "canary"
                  spec:
                    rules:
                    - host: myapp.com
                      http:
                        paths:
                        - path: /
                          pathType: Prefix
                          backend:
                            service:
                              name: app-v2
                              port:
                                number: 80

                  使用:

                  • Cookie canary=always → 路由到 v2
                  • Cookie canary=never → 路由到 v1
                  • 无 Cookie → 根据权重路由

                  7. 性能优化

                  Nginx Ingress Controller 优化配置

                  apiVersion: v1
                  kind: ConfigMap
                  metadata:
                    name: ingress-nginx-controller
                    namespace: ingress-nginx
                  data:
                    # 工作进程数(建议等于 CPU 核心数)
                    worker-processes: "auto"
                    
                    # 每个工作进程的连接数
                    max-worker-connections: "65536"
                    
                    # 启用 HTTP/2
                    use-http2: "true"
                    
                    # 启用 gzip 压缩
                    use-gzip: "true"
                    gzip-level: "6"
                    gzip-types: "text/plain text/css application/json application/javascript text/xml application/xml"
                    
                    # 客户端请求体缓冲
                    client-body-buffer-size: "128k"
                    client-max-body-size: "100m"
                    
                    # Keepalive 连接
                    keep-alive: "75"
                    keep-alive-requests: "1000"
                    
                    # 代理缓冲
                    proxy-buffer-size: "16k"
                    proxy-buffers: "4 16k"
                    
                    # 日志优化(生产环境可以禁用访问日志)
                    disable-access-log: "false"
                    access-log-params: "buffer=16k flush=5s"
                    
                    # SSL 优化
                    ssl-protocols: "TLSv1.2 TLSv1.3"
                    ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256"
                    ssl-prefer-server-ciphers: "true"
                    ssl-session-cache: "true"
                    ssl-session-cache-size: "10m"
                    ssl-session-timeout: "10m"
                    
                    # 启用连接复用
                    upstream-keepalive-connections: "100"
                    upstream-keepalive-timeout: "60"
                    
                    # 限制
                    limit-req-status-code: "429"
                    limit-conn-status-code: "429"

                  Ingress Controller Pod 资源配置

                  apiVersion: apps/v1
                  kind: Deployment
                  metadata:
                    name: ingress-nginx-controller
                    namespace: ingress-nginx
                  spec:
                    replicas: 3  # 高可用建议 3+
                    template:
                      spec:
                        containers:
                        - name: controller
                          image: registry.k8s.io/ingress-nginx/controller:v1.9.0
                          resources:
                            requests:
                              cpu: "500m"
                              memory: "512Mi"
                            limits:
                              cpu: "2000m"
                              memory: "2Gi"
                          # 启用性能分析
                          livenessProbe:
                            httpGet:
                              path: /healthz
                              port: 10254
                            initialDelaySeconds: 10
                            periodSeconds: 10
                          readinessProbe:
                            httpGet:
                              path: /healthz
                              port: 10254
                            periodSeconds: 5

                  8. 监控和可观测性

                  Prometheus 监控集成

                  # ServiceMonitor for Prometheus Operator
                  apiVersion: monitoring.coreos.com/v1
                  kind: ServiceMonitor
                  metadata:
                    name: ingress-nginx
                    namespace: ingress-nginx
                  spec:
                    selector:
                      matchLabels:
                        app.kubernetes.io/name: ingress-nginx
                    endpoints:
                    - port: metrics
                      interval: 30s

                  查看 Ingress Controller 指标

                  # 访问 metrics 端点
                  kubectl port-forward -n ingress-nginx svc/ingress-nginx-controller-metrics 10254:10254
                  
                  # 浏览器访问
                  http://localhost:10254/metrics
                  
                  # 关键指标:
                  # - nginx_ingress_controller_requests: 请求总数
                  # - nginx_ingress_controller_request_duration_seconds: 请求延迟
                  # - nginx_ingress_controller_response_size: 响应大小
                  # - nginx_ingress_controller_ssl_expire_time_seconds: SSL 证书过期时间

                  Grafana 仪表盘

                  # 导入官方 Grafana 仪表盘
                  # Dashboard ID: 9614 (Nginx Ingress Controller)
                  # Dashboard ID: 11875 (Nginx Ingress Controller Request Handling Performance)

                  9. 故障排查清单

                  问题 1: Ingress 没有分配 Address

                  # 检查
                  kubectl get ingress
                  # NAME       CLASS   HOSTS         ADDRESS   PORTS   AGE
                  # my-app     nginx   example.com             80      5m
                  
                  # 原因:
                  # 1. Ingress Controller 未运行
                  kubectl get pods -n ingress-nginx
                  
                  # 2. Service type 不是 LoadBalancer
                  kubectl get svc -n ingress-nginx
                  
                  # 3. 云提供商未分配 LoadBalancer IP
                  kubectl describe svc -n ingress-nginx ingress-nginx-controller

                  问题 2: 502 Bad Gateway

                  # 原因 1: 后端 Service 不存在
                  kubectl get svc
                  
                  # 原因 2: 后端 Pod 不健康
                  kubectl get pods
                  kubectl describe pod <pod-name>
                  
                  # 原因 3: 端口配置错误
                  kubectl get svc <service-name> -o yaml | grep -A 5 ports
                  
                  # 原因 4: 网络策略阻止
                  kubectl get networkpolicies
                  
                  # 查看日志
                  kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

                  问题 3: 503 Service Unavailable

                  # 原因: 没有健康的 Endpoints
                  kubectl get endpoints <service-name>
                  
                  # 如果 ENDPOINTS 列为空:
                  # 1. 检查 Service selector 是否匹配 Pod labels
                  kubectl get svc <service-name> -o yaml | grep -A 3 selector
                  kubectl get pods --show-labels
                  
                  # 2. 检查 Pod 是否 Ready
                  kubectl get pods
                  
                  # 3. 检查容器端口是否正确
                  kubectl get pods <pod-name> -o yaml | grep -A 5 ports

                  问题 4: TLS 证书问题

                  # 检查 Secret 是否存在
                  kubectl get secret <tls-secret-name>
                  
                  # 查看证书内容
                  kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
                  
                  # 检查证书过期时间
                  kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
                  
                  # cert-manager 问题
                  kubectl get certificate
                  kubectl describe certificate <cert-name>
                  kubectl get certificaterequests

                  问题 5: 路由规则不生效

                  # 1. 检查 Ingress 配置
                  kubectl describe ingress <ingress-name>
                  
                  # 2. 查看生成的 Nginx 配置
                  kubectl exec -n ingress-nginx <controller-pod> -- cat /etc/nginx/nginx.conf | grep -A 20 "server_name example.com"
                  
                  # 3. 测试域名解析
                  nslookup example.com
                  
                  # 4. 使用 Host header 测试
                  curl -v -H "Host: example.com" http://<ingress-ip>/path
                  
                  # 5. 检查 annotations 是否正确
                  kubectl get ingress <ingress-name> -o yaml | grep -A 10 annotations

                  10. 生产环境最佳实践

                  ✅ 高可用配置

                  # 1. 多副本 Ingress Controller
                  spec:
                    replicas: 3
                    
                    # 2. Pod 反亲和性(分散到不同节点)
                    affinity:
                      podAntiAffinity:
                        requiredDuringSchedulingIgnoredDuringExecution:
                        - labelSelector:
                            matchExpressions:
                            - key: app.kubernetes.io/name
                              operator: In
                              values:
                              - ingress-nginx
                          topologyKey: kubernetes.io/hostname
                  
                    # 3. PodDisruptionBudget(确保至少 2 个副本运行)
                  ---
                  apiVersion: policy/v1
                  kind: PodDisruptionBudget
                  metadata:
                    name: ingress-nginx
                    namespace: ingress-nginx
                  spec:
                    minAvailable: 2
                    selector:
                      matchLabels:
                        app.kubernetes.io/name: ingress-nginx

                  ✅ 资源限制

                  resources:
                    requests:
                      cpu: "500m"
                      memory: "512Mi"
                    limits:
                      cpu: "2"
                      memory: "2Gi"
                  
                  # HPA 自动扩缩容
                  ---
                  apiVersion: autoscaling/v2
                  kind: HorizontalPodAutoscaler
                  metadata:
                    name: ingress-nginx
                    namespace: ingress-nginx
                  spec:
                    scaleTargetRef:
                      apiVersion: apps/v1
                      kind: Deployment
                      name: ingress-nginx-controller
                    minReplicas: 3
                    maxReplicas: 10
                    metrics:
                    - type: Resource
                      resource:
                        name: cpu
                        target:
                          type: Utilization
                          averageUtilization: 70
                    - type: Resource
                      resource:
                        name: memory
                        target:
                          type: Utilization
                          averageUtilization: 80

                  ✅ 安全加固

                  # 1. 只暴露必要端口
                  # 2. 启用 TLS 1.2+
                  # 3. 配置安全头
                  metadata:
                    annotations:
                      nginx.ingress.kubernetes.io/configuration-snippet: |
                        more_set_headers "X-Frame-Options: DENY";
                        more_set_headers "X-Content-Type-Options: nosniff";
                        more_set_headers "X-XSS-Protection: 1; mode=block";
                        more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains";
                  
                  # 4. 配置 WAF(Web Application Firewall)
                  nginx.ingress.kubernetes.io/enable-modsecurity: "true"
                  nginx.ingress.kubernetes.io/enable-owasp-core-rules: "true"
                  
                  # 5. 限流保护
                  nginx.ingress.kubernetes.io/limit-rps: "100"
                  nginx.ingress.kubernetes.io/limit-burst-multiplier: "5"

                  ✅ 监控告警

                  # Prometheus 告警规则示例
                  groups:
                  - name: ingress
                    rules:
                    - alert: IngressControllerDown
                      expr: up{job="ingress-nginx-controller-metrics"} == 0
                      for: 5m
                      annotations:
                        summary: "Ingress Controller is down"
                    
                    - alert: HighErrorRate
                      expr: rate(nginx_ingress_controller_requests{status=~"5.."}[5m]) > 0.05
                      for: 5m
                      annotations:
                        summary: "High 5xx error rate"
                    
                    - alert: HighLatency
                      expr: histogram_quantile(0.95, nginx_ingress_controller_request_duration_seconds_bucket) > 1
                      for: 10m
                      annotations:
                        summary: "High request latency (p95 > 1s)"

                  📚 总结对比:Ingress vs 其他方案

                  Ingress vs LoadBalancer Service

                  场景:部署 10 个微服务
                  
                  方案 A:每个服务一个 LoadBalancer
                  - 成本:10 个 LoadBalancer × $20/月 = $200/月
                  - 管理:10 个独立的 IP 地址
                  - 路由:无智能路由
                  - TLS:每个服务单独配置
                  
                  方案 B:一个 Ingress
                  - 成本:1 个 LoadBalancer × $20/月 = $20/月 ✅
                  - 管理:1 个 IP 地址 ✅
                  - 路由:基于域名/路径智能路由 ✅
                  - TLS:集中管理证书 ✅

                  Ingress vs API Gateway

                  功能IngressAPI Gateway (Kong/Tyk)
                  基础路由
                  认证鉴权⚠️ 基础✅ 完善
                  限流熔断⚠️ 基础✅ 高级
                  插件生态❌ 有限✅ 丰富
                  学习曲线✅ 简单⚠️ 复杂
                  性能✅ 高⚠️ 中等

                  🎓 学习路径建议

                  1. 入门 (1-2 周)

                    • 理解 Ingress 概念
                    • 部署 Nginx Ingress Controller
                    • 创建基本的 Ingress 规则
                    • 配置 HTTP/HTTPS 访问
                  2. 进阶 (2-4 周)

                    • 掌握各种路由策略
                    • TLS 证书管理(cert-manager)
                    • 金丝雀发布
                    • 性能调优
                  3. 高级 (1-2 月)

                    • 多 Ingress Controller 管理
                    • WAF 和安全加固
                    • 监控和告警
                    • 故障排查
                  4. 专家 (持续学习)

                    • 源码阅读
                    • 自定义插件开发
                    • Gateway API 迁移

                  需要我详细展开某个特定主题,比如 cert-manager 自动化证书管理、Ingress Controller 源码分析,或者 Gateway API 新标准吗? 🚀

                  Mar 7, 2024

                  Nginx 性能优化

                  通用优化、操作系统层、Nginx 配置层、架构层等多个维度,为你详细梳理的方式。


                  一、操作系统与硬件层优化

                  这是优化的基础,为 Nginx 提供一个高性能的运行环境。

                  1. 增加文件描述符限制 Nginx 每个连接(尤其是静态文件)都会消耗一个文件描述符。如果并发高,默认限制很容易成为瓶颈。

                    # 临时生效
                    ulimit -n 65536
                    
                    # 永久生效,修改 /etc/security/limits.conf
                    * soft nofile 65536
                    * hard nofile 65536
                    
                    # 同时,确保 nginx.conf 中使用了足够的 worker_rlimit_nofile
                    worker_rlimit_nofile 65536;
                  2. 优化网络栈

                    • 调整 net.core.somaxconn: 定义等待 Nginx 接受的最大连接队列长度。如果遇到 accept() 队列溢出的错误,需要增加这个值。
                      sysctl -w net.core.somaxconn=65535
                      并在 Nginx 的 listen 指令中显式指定 backlog 参数:
                      listen 80 backlog=65535;
                    • 启用 TCP Fast Open: 减少 TCP 三次握手的延迟。
                      sysctl -w net.ipv4.tcp_fastopen=3
                    • 增大临时端口范围: 当 Nginx 作为反向代理时,它需要大量本地端口来连接上游服务器。
                      sysctl -w net.ipv4.ip_local_port_range="1024 65535"
                    • 减少 TCP TIME_WAIT 状态: 对于高并发短连接场景,大量连接处于 TIME_WAIT 状态会耗尽端口资源。
                      # 启用 TIME_WAIT 复用
                      sysctl -w net.ipv4.tcp_tw_reuse=1
                      # 快速回收 TIME_WAIT 连接
                      sysctl -w net.ipv4.tcp_tw_recycle=0 # 注意:在 NAT 环境下建议为 0,否则可能有问题
                      # 增大 FIN_WAIT_2 状态的超时时间
                      sysctl -w net.ipv4.tcp_fin_timeout=30
                  3. 使用高性能磁盘 对于静态资源服务,使用 SSD 硬盘可以极大提升 IO 性能。


                  二、Nginx 配置优化

                  这是优化的核心,直接决定 Nginx 的行为。

                  1. 工作进程与连接数

                    • worker_processes auto;: 设置为 auto,让 Nginx 自动根据 CPU 核心数设置工作进程数,通常等于 CPU 核心数。
                    • worker_connections: 每个工作进程可以处理的最大连接数。它与 worker_rlimit_nofile 共同决定了 Nginx 的总并发能力。
                      events {
                          worker_connections 10240; # 例如,设置为 10240
                          use epoll; # 在 Linux 上使用高性能的 epoll 事件模型
                      }
                  2. 高效静态资源服务

                    • 启用 sendfile: 绕过用户空间,直接在内核中完成文件数据传输,非常高效。
                      sendfile on;
                    • 启用 tcp_nopush: 与 sendfile on 一起使用,确保数据包被填满后再发送,提高网络效率。
                      tcp_nopush on;
                    • 启用 tcp_nodelay: 针对 keepalive 连接,强制立即发送数据,减少延迟。通常与 tcp_nopush 一起使用。
                      tcp_nodelay on;
                  3. 连接与请求超时 合理的超时设置可以释放闲置资源,避免连接被长期占用。

                    # 客户端连接保持超时时间
                    keepalive_timeout 30s;
                    # 与上游服务器的保持连接超时时间
                    proxy_connect_timeout 5s;
                    proxy_send_timeout 60s;
                    proxy_read_timeout 60s;
                    # 客户端请求头读取超时
                    client_header_timeout 15s;
                    # 客户端请求体读取超时
                    client_body_timeout 15s;
                  4. 缓冲与缓存

                    • 缓冲区优化: 为客户端请求头和请求体设置合适的缓冲区大小,避免 Nginx 写入临时文件,降低 IO。
                      client_header_buffer_size 1k;
                      large_client_header_buffers 4 4k;
                      client_body_buffer_size 128k;
                    • 代理缓冲区: 当 Nginx 作为反向代理时,控制从上游服务器接收数据的缓冲区。
                      proxy_buffering on;
                      proxy_buffer_size 4k;
                      proxy_buffers 8 4k;
                    • 启用缓存
                      • 静态资源缓存: 使用 expiresadd_header 指令为静态资源设置长时间的浏览器缓存。
                        location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
                            expires 1y;
                            add_header Cache-Control "public, immutable";
                        }
                      • 反向代理缓存: 使用 proxy_cache 模块缓存上游服务器的动态内容,极大减轻后端压力。
                        proxy_cache_path /path/to/cache levels=1:2 keys_zone=my_cache:10m max_size=10g inactive=60m;
                        location / {
                            proxy_cache my_cache;
                            proxy_cache_valid 200 302 10m;
                            proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
                        }
                  5. 日志优化

                    • 禁用访问日志: 对于极高并发且不关心访问日志的场景,可以关闭 access_log
                    • 缓冲写入日志: 使用 buffer 参数让 Nginx 先将日志写入内存缓冲区,满后再刷入磁盘。
                      access_log /var/log/nginx/access.log main buffer=64k flush=1m;
                    • 记录关键信息: 精简日志格式,只记录必要字段。
                  6. Gzip 压缩 对文本类型的响应进行压缩,减少网络传输量。

                    gzip on;
                    gzip_vary on;
                    gzip_min_length 1024; # 小于此值不压缩
                    gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
                  7. 上游连接保持 当代理到后端服务时,使用 keepalive 保持一定数量的空闲连接,避免频繁建立和断开 TCP 连接的开销。

                    upstream backend_servers {
                        server 10.0.1.100:8080;
                        keepalive 32; # 保持的空闲连接数
                    }
                    
                    location / {
                        proxy_pass http://backend_servers;
                        proxy_http_version 1.1;
                        proxy_set_header Connection "";
                    }

                  三、架构与部署优化

                  1. 负载均衡 使用 Nginx 的 upstream 模块将流量分发到多个后端服务器,实现水平扩展和高可用。

                    upstream app_cluster {
                        least_conn; # 使用最少连接算法
                        server 10.0.1.101:8080;
                        server 10.0.1.102:8080;
                        server 10.0.1.103:8080;
                    }
                  2. 动静分离 将静态资源(图片、CSS、JS)的请求与动态请求分开。可以让 Nginx 直接处理静态资源,而动态请求则代理给后端应用服务器(如 Tomcat, Node.js 等)。

                  3. 启用 HTTP/2 HTTP/2 提供了多路复用、头部压缩等特性,能显著提升页面加载速度。

                    listen 443 ssl http2;
                  4. 使用第三方模块 根据需求编译第三方模块,如:

                    • OpenResty: 基于 Nginx 和 LuaJIT,提供了强大的可编程能力。
                    • ngx_brotli: 使用 Brotli 压缩算法,通常比 Gzip 压缩率更高。

                  四、监控与调试

                  优化不是一次性的,需要持续监控。

                  1. 启用状态模块 使用 stub_status_module 来查看 Nginx 的基本状态信息。

                    location /nginx_status {
                        stub_status;
                        allow 127.0.0.1; # 只允许本机访问
                        deny all;
                    }

                    访问后可以看到活跃连接数、请求总数等信息。

                  2. 分析日志 使用工具如 goaccessawstats 分析访问日志,了解流量模式和瓶颈。

                  3. 性能剖析 在极端情况下,可以使用 debug 日志或系统工具(如 perfstrace)进行深度性能剖析。

                  总结与建议

                  1. 循序渐进: 不要一次性修改所有参数。一次只修改一两项,然后进行压测(如使用 wrk, ab, jmeter),观察效果。
                  2. 监控先行: 在优化前、中、后都要有可靠的监控数据作为依据。
                  3. 理解业务: 优化的策略很大程度上取决于业务类型。是高并发连接?是大文件下载?还是大量的短动态请求?
                  4. 内核参数谨慎调整: 生产环境调整内核参数前,务必在测试环境充分验证。

                  通过以上这些方式的组合运用,你可以显著提升 Nginx 的性能和稳定性,使其能够轻松应对百万级别的并发连接。

                  Oct 7, 2024

                  Traefik VS Nginx

                  好的,这是一个非常经典的问题。Traefik 和 Nginx Ingress 都是 Kubernetes 生态中顶级的 Ingress Controller,但它们的设计哲学、使用体验和侧重点有显著不同。

                  简单来说:

                  • Traefik 更像一个为云原生和微服务而生的动态、自动化的 API 网关
                  • Nginx Ingress 更像一个基于久经考验的 Nginx 的、高度可配置的强大、稳定的反向代理/负载均衡器

                  下面我们详细对比一下 Traefik 相对于 Nginx Ingress 的主要优点。

                  Traefik 的核心优点

                  1. 极致的动态配置与自动化

                  这是 Traefik 最核心的卖点。

                  • 工作原理:Traefik 会主动监听 Kubernetes API Server,实时感知 Service、Ingress Route、Secret 等资源的变化。一旦你创建或修改了一个 Ingress 资源,Traefik 几乎在秒级内自动更新其路由配置,无需任何重启或重载操作。
                  • Nginx Ingress 的对比:Nginx Ingress 通常需要一个名为 nginx-ingress-controller 的组件来监控变化,然后生成一个新的 nginx.conf 配置文件,最后通过向 Nginx 进程发送 reload 信号来加载新配置。虽然这个过程也很快,但它本质上是一个 “生成-重载” 模型,在超大流量或配置复杂时,重载可能带来微小的性能抖动或延迟。

                  结论:在追求完全自动化和零重载的云原生环境中,Traefik 的动态模型更具吸引力。

                  2. 简化的配置模型与 “IngressRoute” CRD

                  Traefik 完美支持标准的 Kubernetes Ingress 资源,但它更推荐使用自己定义的 Custom Resource Definition (CRD),叫做 IngressRoute

                  • 为什么更好:标准的 Ingress 资源功能相对有限,很多高级特性(如重试、限流、断路器、请求镜像等)需要通过繁琐的 annotations 来实现,可读性和可维护性差。
                  • Traefik 的 IngressRoute:它提供了一种声明式的、结构化的 YAML/JSON 配置方式。所有配置(包括 TLS、中间件、路由规则)都以清晰的结构定义在同一个 CRD 中,更符合 Kubernetes 的原生哲学,也更容易进行版本控制和代码审查。

                  示例对比: 使用 Nginx Ingress 的注解来实现路径重写:

                  apiVersion: networking.k8s.io/v1
                  kind: Ingress
                  metadata:
                    name: my-ingress
                    annotations:
                      nginx.ingress.kubernetes.io/rewrite-target: /

                  使用 Traefik 的 IngressRoute 和中间件:

                  apiVersion: traefik.containo.us/v1alpha1
                  kind: IngressRoute
                  metadata:
                    name: my-ingressroute
                  spec:
                    routes:
                    - match: PathPrefix(`/api`)
                      kind: Rule
                      services:
                      - name: my-service
                        port: 80
                      middlewares:
                      - name: strip-prefix # 使用一个独立的、可复用的中间件资源
                  ---
                  apiVersion: traefik.containo.us/v1alpha1
                  kind: Middleware
                  metadata:
                    name: strip-prefix
                  spec:
                    stripPrefix:
                      prefixes:
                        - /api

                  可以看到,Traefik 的配置更加模块化和清晰。

                  3. 内置的、功能丰富的 Dashboard

                  Traefik 自带一个非常直观的 Web UI 控制台。只需简单启用,你就可以在浏览器中实时查看所有的路由器(Routers)、服务(Services)和中间件(Middlewares),以及它们的健康状况和配置关系。

                  • 这对于开发和调试来说是巨大的福音。你可以一目了然地看到流量是如何被路由的,而无需去解析复杂的配置文件或命令行输出。
                  • Nginx Ingress 官方不提供图形化 Dashboard。虽然可以通过第三方工具(如 Prometheus + Grafana)来监控,或者使用 kubectl 命令来查询状态,但远不如 Traefik 的原生 Dashboard 直观方便。

                  4. 原生支持多种后端提供者

                  Traefik 的设计是多提供者的。除了 Kubernetes,它还可以同时从 Docker、Consul、Etcd、Rancher 或者一个简单的静态文件中读取配置。 如果你的技术栈是混合的(例如,部分服务在 K8s,部分服务使用 Docker Compose),Traefik 可以作为一个统一的入口点,简化你的架构。

                  Nginx Ingress 虽然也可以通过其他方式扩展,但其核心是为 Kubernetes 设计的。

                  5. 中间件模式的强大与灵活

                  Traefik 的 “中间件” 概念非常强大。它允许你将各种功能(如认证、限流、头信息修改、重定向、断路器等)定义为独立的、可复用的组件。然后,你可以在任何路由规则上通过引用的方式组合使用这些中间件。

                  这种模式极大地增强了配置的复用性和灵活性,是构建复杂流量策略的理想选择。

                  Nginx Ingress 的优势领域(作为平衡参考)

                  为了做出全面选择,了解 Nginx Ingress 的优势也很重要:

                  1. 极致的性能与稳定性:基于世界上最成熟的 Web 服务器 Nginx,在处理超高并发静态内容和长连接方面,经过了几十年的实战考验,性能和稳定性极高。
                  2. 功能极其丰富:Nginx 本身的功能集非常庞大,加上 Nginx Ingress Controller 提供了大量的注解来暴露这些功能,其能力上限在某些方面可能高于 Traefik。
                  3. 庞大的社区与生态:Nginx 的用户基数巨大,你遇到的任何问题几乎都能在网上找到解决方案或经验分享。
                  4. 精细化控制:对于深度 Nginx 专家,可以通过 ConfigMap 注入自定义的 Nginx 配置片段,实现几乎任何你想要的功能,可控性极强。
                  5. Apache 许可:Nginx 是 Apache 2.0 许可证,而 Traefik v2 之后使用的是限制更多的 Source Available 许可证(虽然对大多数用户免费,但会引起一些大公司的合规顾虑)。Nginx Ingress 完全没有这个问题。

                  总结与选型建议

                  特性TraefikNginx Ingress
                  配置模型动态、自动化,无需重载“生成-重载”模型
                  配置语法声明式 CRD,结构清晰主要依赖 Annotations,较繁琐
                  Dashboard内置,功能强大,开箱即用无官方 UI,需第三方集成
                  设计哲学云原生优先,微服务友好功能与性能优先,稳健可靠
                  学习曲线较低,易于上手和运维中等,需要了解 Nginx 概念
                  性能优秀,足以满足绝大多数场景极致,尤其在静态内容和大并发场景
                  可扩展性通过中间件,模块化程度高通过 Lua 脚本或自定义模板,功能上限高
                  许可证Source Available(可能有限制)Apache 2.0(完全开源)

                  如何选择?

                  • 选择 Traefik,如果:

                    • 你追求极致的云原生体验,希望配置简单、自动化。
                    • 你的团队更青睐 Kubernetes 原生的声明式配置方式。
                    • 你非常看重内置的 Dashboard 用于日常运维和调试。
                    • 你的应用架构是动态的,服务频繁发布和变更。
                    • 你的场景不需要压榨到极致的性能,更看重开发效率和运维简便性。
                  • 选择 Nginx Ingress,如果:

                    • 你对性能和稳定性有极致要求(例如,超大规模网关、CDN边缘节点)。
                    • 你需要使用非常复杂或小众的 Nginx 功能,需要精细化的控制。
                    • 你的团队已经对 Nginx 非常熟悉,有深厚的知识积累。
                    • 你对开源许可证有严格要求,必须使用 Apache 2.0 等宽松许可证。
                    • 你的环境相对稳定,不需要频繁更新路由配置。

                  总而言之,Traefik 胜在“体验”和“自动化”,是现代微服务和云原生环境的理想伴侣。而 Nginx Ingress 胜在“性能”和“功能深度”,是一个经过千锤百炼的、可靠的强大引擎。

                  Mar 7, 2024

                  RPC

                    Mar 7, 2025

                    Subsections of Storage

                    User Based Policy

                    User Based Policy

                    you can change <$bucket> to control the permission

                    App:
                    • ${aws:username} is a build-in variable, indicating the logined user name.
                    {
                        "Version": "2012-10-17",
                        "Statement": [
                            {
                                "Sid": "AllowUserToSeeBucketListInTheConsole",
                                "Action": [
                                    "s3:ListAllMyBuckets",
                                    "s3:GetBucketLocation"
                                ],
                                "Effect": "Allow",
                                "Resource": [
                                    "arn:aws:s3:::*"
                                ]
                            },
                            {
                                "Sid": "AllowRootAndHomeListingOfCompanyBucket",
                                "Action": [
                                    "s3:ListBucket"
                                ],
                                "Effect": "Allow",
                                "Resource": [
                                    "arn:aws:s3:::<$bucket>"
                                ],
                                "Condition": {
                                    "StringEquals": {
                                        "s3:prefix": [
                                            "",
                                            "<$path>/",
                                            "<$path>/${aws:username}"
                                        ],
                                        "s3:delimiter": [
                                            "/"
                                        ]
                                    }
                                }
                            },
                            {
                                "Sid": "AllowListingOfUserFolder",
                                "Action": [
                                    "s3:ListBucket"
                                ],
                                "Effect": "Allow",
                                "Resource": [
                                    "arn:aws:s3:::<$bucket>"
                                ],
                                "Condition": {
                                    "StringLike": {
                                        "s3:prefix": [
                                            "<$path>/${aws:username}/*"
                                        ]
                                    }
                                }
                            },
                            {
                                "Sid": "AllowAllS3ActionsInUserFolder",
                                "Effect": "Allow",
                                "Action": [
                                    "s3:*"
                                ],
                                "Resource": [
                                    "arn:aws:s3:::<$bucket>/<$path>/${aws:username}/*"
                                ]
                            }
                        ]
                    }
                    • <$uid> is Aliyun UID
                    {
                        "Version": "1",
                        "Statement": [{
                            "Effect": "Allow",
                            "Action": [
                                "oss:*"
                            ],
                            "Principal": [
                                "<$uid>"
                            ],
                            "Resource": [
                                "acs:oss:*:<$oss_id>:<$bucket>/<$path>/*"
                            ]
                        }, {
                            "Effect": "Allow",
                            "Action": [
                                "oss:ListObjects",
                                "oss:GetObject"
                            ],
                            "Principal": [
                                 "<$uid>"
                            ],
                            "Resource": [
                                "acs:oss:*:<$oss_id>:<$bucket>"
                            ],
                            "Condition": {
                                "StringLike": {
                                "oss:Prefix": [
                                        "<$path>/*"
                                    ]
                                }
                            }
                        }]
                    }
                    Example:
                    {
                    	"Version": "1",
                    	"Statement": [{
                    		"Effect": "Allow",
                    		"Action": [
                    			"oss:*"
                    		],
                    		"Principal": [
                    			"203415213249511533"
                    		],
                    		"Resource": [
                    			"acs:oss:*:1007296819402486:conti-csst/test/*"
                    		]
                    	}, {
                    		"Effect": "Allow",
                    		"Action": [
                    			"oss:ListObjects",
                    			"oss:GetObject"
                    		],
                    		"Principal": [
                    			"203415213249511533"
                    		],
                    		"Resource": [
                    			"acs:oss:*:1007296819402486:conti-csst"
                    		],
                    		"Condition": {
                    			"StringLike": {
                    				"oss:Prefix": [
                    					"test/*"
                    				]
                    			}
                    		}
                    	}]
                    }
                    Mar 14, 2024

                    Mirrors

                    Gradle Tencent Mirror

                    https://mirrors.cloud.tencent.com/gradle/gradle-8.0-bin.zip

                    PIP Tuna Mirror -i https://pypi.tuna.tsinghua.edu.cn/simple

                    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

                    Maven Mirror

                    <mirror>
                        <id>aliyunmaven</id>
                        <mirrorOf>*</mirrorOf>
                        <name>阿里云公共仓库</name>
                        <url>https://maven.aliyun.com/repository/public</url>
                    </mirror>
                    Mar 7, 2024

                    Languages

                    Mar 7, 2024

                    Subsections of Languages

                    ♨️JAVA

                      Mar 7, 2024

                      Subsections of ♨️JAVA

                      Subsections of JVM Related

                      AOT or JIT

                      JDK 9 引入了一种新的编译模式 AOT(Ahead of Time Compilation) 。和 JIT 不同的是,这种编译模式会在程序被执行前就将其编译成机器码,属于静态编译(C、 C++,Rust,Go 等语言就是静态编译)。AOT 避免了 JIT 预热等各方面的开销,可以提高 Java 程序的启动速度,避免预热时间长。并且,AOT 还能减少内存占用和增强 Java 程序的安全性(AOT 编译后的代码不容易被反编译和修改),特别适合云原生场景。

                      可以看出,AOT 的主要优势在于启动时间、内存占用和打包体积。JIT 的主要优势在于具备更高的极限处理能力,可以降低请求的最大延迟。

                      https://cn.dubbo.apache.org/zh-cn/blog/2023/06/28/%e8%b5%b0%e5%90%91-native-%e5%8c%96springdubbo-aot-%e6%8a%80%e6%9c%af%e7%a4%ba%e4%be%8b%e4%b8%8e%e5%8e%9f%e7%90%86%e8%ae%b2%e8%a7%a3/

                      https://mp.weixin.qq.com/s/4haTyXUmh8m-dBQaEzwDJw

                      既然 AOT 这么多优点,那为什么不全部使用这种编译方式呢?

                      我们前面也对比过 JIT 与 AOT,两者各有优点,只能说 AOT 更适合当下的云原生场景,对微服务架构的支持也比较友好。除此之外,AOT 编译无法支持 Java 的一些动态特性,如反射、动态代理、动态加载、JNI(Java Native Interface)等。然而,很多框架和库(如 Spring、CGLIB)都用到了这些特性。如果只使用 AOT 编译,那就没办法使用这些框架和库了,或者说需要针对性地去做适配和优化。举个例子,CGLIB 动态代理使用的是 ASM 技术,而这种技术大致原理是运行时直接在内存中生成并加载修改后的字节码文件也就是

                      Mar 7, 2024

                      Volatie

                      Volatile是Java虚拟机提供的轻量级的同步机制(三大特性)

                      保证可见性

                      不保证原子性

                      禁止指令重排

                      Mar 7, 2024

                      ♨️JAVA

                        Mar 7, 2024

                        Subsections of Design Pattern

                        Observers

                        Mar 7, 2024

                        Subsections of Web Pattern

                        HTTP Code

                        1xx - 信息性状态码(临时响应)

                        表示请求已收到,正在继续处理。平时在浏览器中很少见到。

                        • 100 Continue:客户端应继续发送请求的剩余部分。通常用于 POST 或 PUT 大量数据前,先询问服务器是否愿意接收。
                        • 101 Switching Protocols:客户端要求切换协议(如切换到 WebSocket),服务器已同意。

                        2xx - 成功状态码(请求成功)

                        表示请求已被服务器成功接收、理解并处理。

                        • 200 OK最常用的成功状态码。表示请求成功,返回的响应体包含了所请求的数据(如 HTML 页面、JSON 数据等)。
                        • 201 Created创建成功。通常在 POST 或 PUT 请求后,表示成功在服务器上创建了一个新资源。响应头 Location 字段通常会包含新资源的 URL。
                        • 202 Accepted:请求已接受,但尚未处理完成。适用于异步任务,比如“请求已进入队列,正在处理中”。
                        • 204 No Content请求成功,但响应报文中没有实体的主体部分。常用于 DELETE 请求成功,或前端只需知道操作成功而无需返回数据的 AJAX 请求。

                        3xx - 重定向状态码(需要进一步操作)

                        表示客户端需要执行额外的操作来完成请求,通常是重定向。

                        • 301 Moved Permanently永久重定向。请求的资源已被永久移动到新的 URL。搜索引擎会更新其链接到新的地址。浏览器会缓存这个重定向
                        • 302 Found临时重定向。请求的资源临时从另一个 URL 响应。搜索引擎不会更新链接。这是最常见的重定向类型,但规范要求方法不变(实际上浏览器常会改为 GET)。
                        • 304 Not Modified资源未修改。用于缓存控制。当客户端拥有缓存的版本,并通过请求头(如 If-Modified-Since)询问资源是否更新时,如果资源未变,服务器会返回此状态码,告诉客户端直接使用缓存。这节省了带宽
                        • 307 Temporary Redirect临时重定向(严格)。与 302 类似,但严格要求客户端不能改变原始的请求方法(例如,POST 必须仍是 POST)。比 302 更规范。

                        4xx - 客户端错误状态码(请求有误)

                        表示客户端可能出错,服务器无法处理请求。

                        • 400 Bad Request错误的请求。服务器因为请求的语法无效而无法理解。就像一个语法错误的句子,服务器看不懂。
                        • 401 Unauthorized未认证。表示请求需要用户认证。通常需要登录或提供 Token。注意,这个名字容易误解,它实际是“未认证”,而不是“未授权”。
                        • 403 Forbidden禁止访问。服务器理解请求,但拒绝执行。与 401 不同,身份验证也无济于事(比如普通用户尝试访问管理员页面)。
                        • 404 Not Found最著名的错误码。服务器找不到请求的资源。可能是 URL 错误,或资源已被删除。
                        • 405 Method Not Allowed方法不被允许。请求行中指定的方法(GET, POST 等)不能用于请求此资源。例如,对只接受 GET 的 URL 发送了 POST 请求。
                        • 408 Request Timeout请求超时。服务器等待客户端发送请求的时间过长。
                        • 409 Conflict冲突。请求与服务器的当前状态冲突。常见于 PUT 请求(例如,修改文件时版本冲突)。
                        • 429 Too Many Requests请求过多。客户端在给定的时间内发送了太多请求(限流)。

                        5xx - 服务器错误状态码(服务器处理请求出错)

                        表示服务器在处理请求时发生错误或内部故障。

                        • 500 Internal Server Error最通用的服务器错误码。服务器遇到了一个未曾预料的状况,导致它无法完成对请求的处理。通常是后端代码抛出了未捕获的异常。
                        • 502 Bad Gateway错误的网关。服务器作为网关或代理,从上游服务器收到了一个无效的响应。常见于 Nginx 后面的应用服务器(如 PHP-FPM)挂掉或未启动。
                        • 503 Service Unavailable服务不可用。服务器当前无法处理请求(由于超载或进行停机维护)。通常,这是一个临时状态。响应头中可能包含 Retry-After 字段,告知客户端何时可以重试。
                        • 504 Gateway Timeout网关超时。服务器作为网关或代理,未能及时从上游服务器收到响应。常见于网络延迟或上游服务器处理过慢。

                        快速记忆表格

                        状态码类别含义常见场景
                        200成功请求成功正常获取网页或数据
                        201成功创建成功创建新用户、新文章成功
                        204成功无内容删除成功,或前端AJAX请求无需返回数据
                        301重定向永久移动网站改版,旧链接永久跳转到新链接
                        302重定向临时移动登录后跳回首页
                        304重定向未修改使用浏览器缓存,节省流量
                        400客户端错误错误请求请求参数格式错误
                        401客户端错误未认证需要登录
                        403客户端错误禁止访问权限不足
                        404客户端错误未找到请求的URL不存在
                        429客户端错误请求过多API调用频率超限
                        500服务器错误内部服务器错误后端代码Bug,数据库连接失败
                        502服务器错误错误网关Nginx 无法连接到后端服务
                        503服务器错误服务不可用服务器维护或过载
                        504服务器错误网关超时后端服务响应太慢

                        希望这个列表对您有帮助!