Aaron`s Dev Path

About Me

Dev Path

gitGraph:
  commit id:"Graduate From High School" tag:"Linfen, China"
  commit id:"Got Driver Licence" tag:"2013.08"
  branch TYUT
  commit id:"Enrollment TYUT 🥰"  tag:"Taiyuan, China"
  commit id:"Develop Game App" tag:"“Hello Hell”" type: HIGHLIGHT
  commit id:"Plan:3+1" tag:"2016.09"
  branch Briup.Ltd
  commit id:"First Internship" tag:"Suzhou, China"
  commit id:"CRUD boy" 
  commit id:"Dimission" tag:"2017.01" type:REVERSE
  checkout TYUT
  merge Briup.Ltd id:"Final Presentation" tag:"2017.04"
  checkout Briup.Ltd
  branch Enjoyor.PLC
  commit id:"Second Internship" tag:"Hangzhou,China"
  checkout TYUT
  merge Enjoyor.PLC id:"Got SE Bachelor Degree " tag:"2017.07"
  checkout Enjoyor.PLC
  commit id:"First Full Time Job" tag:"2017.07"
  commit id:"Dimssion" tag:"2018.04"
  checkout main
  merge Enjoyor.PLC id:"Plan To Study Aboard"
  commit id:"Get Some Rest" tag:"2018.06"
  branch TOEFL-GRE
  commit id:"Learning At Huahua.Ltd" tag:"Beijing,China"
  commit id:"Got USC Admission" tag:"2018.11" type: HIGHLIGHT
  checkout main
  merge TOEFL-GRE id:"Prepare To Leave" tag:"2018.12"
  branch USC
  commit id:"Pass Pre-School" tag:"Los Angeles,USA"
  checkout main
  merge USC id:"Back Home,Summer Break" tag:"2019.06"
  commit id:"Back School" tag:"2019.07"
  checkout USC
  merge main id:"Got Straight As"
  commit id:"Leaning ML, DL, GPT"
  checkout main
  merge USC id:"Back,Due to COVID-19" tag:"2021.02"
  checkout USC
  commit id:"Got DS Master Degree" tag:"2021.05"
  checkout main
  commit id:"Got An offer" tag:"2021.06"
  branch Zhejianglab
  commit id:"Second Full Time" tag:"Hangzhou,China"
  commit id:"Got Promotion" tag:"2024.01"
  commit id:"For Now"
Mar 7, 2024

Subsections of Aaron`s Dev Path

🐙Argo (CI/CD)

Content

CheatSheets

argoCD

  • decode passd
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
  • relogin
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS
  • force delete
argocd app terminate-op <$>

argo Workflow

argo Rollouts

Mar 7, 2024

Subsections of 🐙Argo (CI/CD)

Subsections of Argo CD

Subsections of App Template

Deploy A Nginx App

Sync

When your k8s resource files located in `mainfests` folder, you can use the following command to deploy your app.
you only need to set `spec.source.path: mainfests`

  • sample-repo
    • content
    • src
    • mainfests
      • deploy.yaml
      • svc.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: hugo-blog
spec:
  project: default
  source:
    repoURL: 'git@github.com:AaronYang0628/home-site.git'
    targetRevision: main
    path: mainfests
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
  destination:
    server: https://kubernetes.default.svc
    namespace: application

Not only you need files in `mainfests` folder, but also need files in root folder.

you have to create an extra file `kustomization.yaml`, and set `spec.source.path: .`

  • sample-repo
    • kustomization.yaml
    • content
    • src
    • mainfests
      • deploy.yaml
      • svc.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: hugo-blog
spec:
  project: default
  source:
    repoURL: 'git@github.com:AaronYang0628/home-site.git'
    targetRevision: main
    path: .
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
  destination:
    server: https://kubernetes.default.svc
    namespace: application
resources:
  - manifests/pvc.yaml
  - manifests/job.yaml
  - manifests/deployment.yaml
  - ...
Oct 22, 2025

Deploy N Clusters

ArgoCD 凭借其声明式的 GitOps 理念,能非常优雅地处理多 Kubernetes 集群的应用发布。它允许你从一个中心化的 Git 仓库管理多个集群的应用部署,确保状态一致并能快速回滚。

下面这张图概括了使用 ArgoCD 进行多集群发布的典型工作流,帮你先建立一个整体印象:

flowchart TD
    A[Git仓库] --> B{ArgoCD Server}
    
    B --> C[ApplicationSet<br>集群生成器]
    B --> D[ApplicationSet<br>Git生成器]
    B --> E[手动Application<br>资源]
    
    C --> F[集群A<br>App1 & App2]
    C --> G[集群B<br>App1 & App2]
    
    D --> H[集群A<br>App1]
    D --> I[集群A<br>App2]
    
    E --> J[特定集群<br>特定应用]

🔗 连接集群到 ArgoCD

要让 ArgoCD 管理外部集群,你需要先将目标集群的访问凭证添加进来。

  1. 获取目标集群凭证:确保你拥有目标集群的 kubeconfig 文件。
  2. 添加集群到 ArgoCD:使用 ArgoCD CLI 添加集群。这个操作会在 ArgoCD 所在命名空间创建一个存储了集群凭证的 Secret。
    argocd cluster add <context-name> --name <cluster-name> --kubeconfig ~/.kube/config
    • <context-name> 是你 kubeconfig 中的上下文名称。
    • <cluster-name> 是你在 ArgoCD 中为这个集群起的别名。
  3. 验证集群连接:添加后,你可以在 ArgoCD UI 的 “Settings” > “Clusters” 页面,或通过 CLI 查看集群列表:
    argocd cluster list

💡 选择多集群部署策略

连接集群后,核心是定义部署规则。ArgoCD 主要通过 ApplicationApplicationSet 资源来描述部署。

  • Application 资源:定义一个应用在特定集群的部署。管理大量集群和应用时,手动创建每个 Application 会很繁琐。
  • ApplicationSet 资源:这是实现多集群部署的推荐方式。它能根据生成器(Generators)自动为多个集群或多个应用创建 Application 资源。

上面的流程图展示了 ApplicationSet 的两种主要生成器以及手动创建 Application 的方式。

ApplicationSet 常用生成器对比

生成器类型工作原理适用场景
List Generator在 YAML 中静态列出集群和URL。集群数量固定、变化少的场景。
Cluster Generator动态使用 ArgoCD 中已注册的集群。集群动态变化,需自动纳入新集群的场景。
Git Generator根据 Git 仓库中的目录结构自动生成应用。管理大量微服务,每个服务在独立目录。

🛠️ 配置实践示例

这里以 Cluster Generator 为例,展示一个 ApplicationSet 的 YAML 配置:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app-multi-cluster
spec:
  generators:
    - clusters: {} # 自动发现ArgoCD中所有已注册集群
  template:
    metadata:
      name: '{{name}}-my-app'
    spec:
      project: default
      source:
        repoURL: 'https://your-git-repo.com/your-app.git'
        targetRevision: HEAD
        path: k8s-manifests
      destination:
        server: '{{server}}' # 生成器提供的集群API地址
        namespace: my-app-namespace
      syncPolicy:
        syncOptions:
        - CreateNamespace=true # 自动创建命名空间
        automated:
          prune: true # 自动清理
          selfHeal: true # 自动修复漂移

在这个模板中:

  • generators 下的 clusters: {} 会让 ArgoCD 自动发现所有已注册的集群。
  • template 中,{{name}}{{server}} 是变量,Cluster Generator 会为每个已注册的集群填充它们。
  • syncPolicy 下的配置实现了自动同步、自动创建命名空间和资源清理。

⚠️ 多集群管理的关键要点

  1. 集群访问权限与网络:确保 ArgoCD 控制平面能够网络连通所有目标集群的 API Server,并具有在目标命名空间中创建资源的 RBAC 权限
  2. 灵活的同步策略
    • 对于开发环境,可以开启 automated 同步,实现 Git 变更自动部署。
    • 对于生产环境,建议关闭自动同步,采用手动触发同步(Manual)或通过 PR 审批流程,以增加控制力。
  3. 高可用与性能:管理大量集群和应用时,考虑高可用(HA)部署。你可能需要调整 argocd-repo-serverargocd-application-controller 的副本数和资源限制。
  4. 考虑 Argo CD Agent:对于大规模集群管理,可以探索 Argo CD Agent。它将一部分控制平面组件(如 application-controller)分布到托管集群上运行,能提升可扩展性。请注意,截至2025年10月,该功能在 OpenShift GitOps 中仍处于技术预览(Tech Preview) 阶段。

💎 总结

利用 ArgoCD 管理多 Kubernetes 集群应用发布,核心是掌握 ApplicationSetGenerators 的用法。通过 Cluster Generator 或 Git Generator,你可以灵活地实现“一次定义,多处部署”。

希望这些信息能帮助你着手搭建多集群发布流程。如果你能分享更多关于你具体环境的信息(比如集群的大致数量和应用的组织结构),或许我可以给出更贴合你场景的建议。

Mar 14, 2025

ArgoCD Cheatsheets

  • decode passd
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
  • relogin
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS
  • force delete
argocd app terminate-op <$>
Mar 14, 2024

Argo CD Agent

Installation

Content

    Mar 7, 2024

    Argo WorkFlow

    What is Argo Workflow?

    Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.

    • Define workflows where each step in the workflow is a container.
    • Model multi-step workflows as a sequence of tasks or capture the dependencies between tasks using a graph (DAG).
    • Easily run compute intensive jobs for machine learning or data processing in a fraction of the time using Argo Workflows on Kubernetes.
    • Run CI/CD pipelines natively on Kubernetes without configuring complex software development products.

    Installation

    Content

    Mar 7, 2024

    Subsections of Argo WorkFlow

    Argo Workflows Cheatsheets

    Mar 14, 2024

    Subsections of Workflow Template

    DAG Template

    DAG Template

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: dag-diamond-
    spec:
      entrypoint: entry
      serviceAccountName: argo-workflow
      templates:
      - name: echo
        inputs:
          parameters:
          - name: message
        container:
          image: alpine:3.7
          command: [echo, "{{inputs.parameters.message}}"]
      - name: entry
        dag:
          tasks:
          - name: start
            template: echo
            arguments:
                parameters: [{name: message, value: DAG initialized}]
          - name: diamond
            template: diamond
            dependencies: [start]
      - name: diamond
        dag:
          tasks:
          - name: A
            template: echo
            arguments:
              parameters: [{name: message, value: A}]
          - name: B
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: B}]
          - name: C
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: C}]
          - name: D
            dependencies: [B, C]
            template: echo
            arguments:
              parameters: [{name: message, value: D}]
          - name: end
            dependencies: [D]
            template: echo
            arguments:
              parameters: [{name: message, value: end}]
    kubectl -n business-workflow apply -f - << EOF
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: dag-diamond-
    spec:
      entrypoint: entry
      serviceAccountName: argo-workflow
      templates:
      - name: echo
        inputs:
          parameters:
          - name: message
        container:
          image: alpine:3.7
          command: [echo, "{{inputs.parameters.message}}"]
      - name: entry
        dag:
          tasks:
          - name: start
            template: echo
            arguments:
                parameters: [{name: message, value: DAG initialized}]
          - name: diamond
            template: diamond
            dependencies: [start]
      - name: diamond
        dag:
          tasks:
          - name: A
            template: echo
            arguments:
              parameters: [{name: message, value: A}]
          - name: B
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: B}]
          - name: C
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: C}]
          - name: D
            dependencies: [B, C]
            template: echo
            arguments:
              parameters: [{name: message, value: D}]
          - name: end
            dependencies: [D]
            template: echo
            arguments:
              parameters: [{name: message, value: end}]
    EOF
    Mar 7, 2024

    Subsections of Argo Rollouts

    Blue–Green Deploy

    Argo Rollouts 是一个 Kubernetes CRD 控制器,它通过扩展 Kubernetes 的原生 Deployment 资源,为 Kubernetes 提供了更高级的部署策略。其核心原理可以概括为:通过精细控制多个 ReplicaSet(对应不同版本的 Pod)的副本数量和流量分配,来实现可控的、自动化的应用发布流程。


    1. 蓝绿部署原理

    蓝绿部署的核心思想是同时存在两个完全独立的环境(蓝色和绿色),但任何时候只有一个环境承载生产流量。

    工作原理

    1. 初始状态

      • 假设当前生产环境是 蓝色 版本(v1),所有流量都指向蓝色的 ReplicaSet。
      • 绿色 环境虽然可能存在(例如,副本数为 0),但不接收任何流量。
    2. 发布新版本

      • 当需要发布新版本(v2)时,Argo Rollouts 会创建一个与蓝色环境完全隔离的 绿色 环境 ReplicaSet,并启动全部所需的 Pod 实例。
      • 关键点:此时,用户流量仍然 100% 指向蓝色的 v1 版本。绿色 v2 版本在启动和预热期间,完全不影响线上用户。
    3. 测试与验证

      • 运维人员或自动化脚本可以对绿色的 v2 版本进行测试,例如进行 API 调用、检查日志或运行集成测试。这个过程在生产流量不受干扰的情况下进行。
    4. 切换流量

      • 一旦确认 v2 版本稳定,通过一个原子操作,将所有生产流量从蓝色(v1)瞬间切换到绿色(v2)。
      • 这个切换通常是通过更新 Kubernetes Service 或 Ingress 的 selector 标签来实现的。例如,将 app: my-app 的 selector 从 version: v1 改为 version: v2
    5. 发布后

      • 流量切换后,绿色(v2)成为新的生产环境。
      • 蓝色(v1)环境不会被立即删除,而是保留一段时间,作为快速回滚的保障
      • 如果 v2 出现问题,只需将流量再次切回蓝色(v1)即可,回滚过程同样迅速且影响小。

    原理示意图

    [用户] --> [Service (selector: version=v1)] --> [蓝色 ReplicaSet (v1, 100% 流量)]
                                          |
                                          +--> [绿色 ReplicaSet (v2, 0% 流量, 待命)]

    切换后:

    [用户] --> [Service (selector: version=v2)] --> [绿色 ReplicaSet (v2, 100% 流量)]
                                          |
                                          +--> [蓝色 ReplicaSet (v1, 0% 流量, 备用回滚)]

    优点:发布和回滚速度快、风险低、发布期间服务始终可用。 缺点:需要两倍的硬件资源,在切换瞬间可能会有短暂的流量处理问题(如连接中断)。


    2. 金丝雀发布原理

    金丝雀发布的核心思想是逐步将流量从旧版本迁移到新版本,而不是一次性切换。这个过程允许在影响一小部分用户的情况下,验证新版本的稳定性和性能。

    工作原理

    1. 初始状态

      • 与蓝绿部署类似,当前稳定版本(v1)的 ReplicaSet 承载 100% 的流量。
    2. 发布金丝雀版本

      • Argo Rollouts 创建一个新版本(v2)的 ReplicaSet,但只启动少量 Pod(例如,副本数为总体的 1/10)。
      • 此时,通过 流量治理工具(如 Service Mesh:Istio, Linkerd;或 Ingress Controller:Nginx)的规则,将一小部分生产流量(例如 10%)路由到 v2 的 Pod,其余 90% 的流量仍然流向 v1。
    3. 渐进式推广

      • 这是一个多步骤、自动化的过程。Argo Rollouts 的 Rollout CRD 可以定义一个详细的步骤清单(steps)。
      • 示例步骤
        • setWeight: 10 - 将 10% 的流量切到 v2,持续 5 分钟。
        • pause: {duration: 5m} - 暂停发布,观察 v2 的运行指标。
        • setWeight: 40 - 如果一切正常,将流量提升到 40%。
        • pause: {duration: 10m} - 再次暂停并观察。
        • setWeight: 100 - 最终将所有流量切换到 v2。
    4. 自动化分析与回滚

      • 这是 Argo Rollouts 最强大的功能之一。在每次暂停(pause)阶段,它会持续查询指标分析服务
      • 指标分析服务 可以配置一系列规则(AnalysisTemplate),例如:
        • 检查 HTTP 请求错误率是否低于 1%。
        • 检查请求平均响应时间是否小于 200ms。
        • 检查自定义业务指标(如订单失败率)。
      • 如果任何一项指标不达标,Argo Rollouts 会自动中止发布并将流量全部回滚到 v1 版本,无需人工干预。
    5. 发布完成

      • 当所有步骤顺利完成,v2 的 ReplicaSet 将接管 100% 的流量,v1 的 ReplicaSet 最终会被缩容至零。

    原理示意图

    [用户] --> [Istio VirtualService] -- 90% --> [v1 ReplicaSet]
                         |
                         +-- 10% --> [v2 ReplicaSet (金丝雀)]

    (推广中)

    [用户] --> [Istio VirtualService] -- 40% --> [v1 ReplicaSet]
                         |
                         +-- 60% --> [v2 ReplicaSet (金丝雀)]

    (完成后)

    [用户] --> [Istio VirtualService] -- 100% --> [v2 ReplicaSet]

    优点:发布风险极低,可以基于真实流量和指标进行自动化验证,实现“无人值守”的安全发布。 缺点:发布流程更长,需要与复杂的流量治理工具集成。


    总结与核心价值

    特性蓝绿部署金丝雀发布
    核心思想全量切换,环境隔离渐进式流量切换
    流量控制100% 或 0%,原子操作精细的比例控制(1%, 5%, 50%…)
    资源消耗高(需要两套完整环境)低(新旧版本 Pod 共享资源池)
    发布速度快(切换迅速)慢(分多个阶段)
    风险控制通过快速回滚控制风险通过小范围暴露和自动化分析控制风险
    自动化相对简单,主要自动化切换高度自动化,依赖指标分析进行决策

    Argo Rollouts 的核心原理价值在于:

    1. 声明式:像定义 Kubernetes Deployment 一样,通过 YAML 文件声明你的发布策略(蓝绿或金丝雀步骤)。
    2. 控制器模式:Argo Rollouts 控制器持续监听 Rollout 对象的状态,并驱动整个系统(K8s API、Service Mesh、Metrics Server)达到声明的目标状态。
    3. 扩展性:通过 CRD 和 AnalysisTemplate,它提供了极大的灵活性,可以与任何兼容的流量提供商和指标系统集成。
    4. 自动化与安全:将“人脑判断”转化为“基于数据的自动化规则”,极大地提升了发布的可靠性和效率,是实现 GitOps 和持续交付的关键一环。
    Mar 14, 2025

    Argo Rollouts Cheatsheets

    Mar 14, 2024

    Subsections of 🧯BuckUp

    Subsections of ElasticSearch

    ES [Local Disk]

    Preliminary

    • ElasticSearch has installed, if not check link

    • The elasticsearch.yml has configed path.repo, which should be set the same value of settings.location (this will be handled by helm chart, dont worry)

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: false
              service:
                type: ClusterIP
              extraConfig:
                path:
                  repo: /tmp
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      extraConfig:
          path:
            repo: /tmp

    Methods

    Elasticsearch 做备份有两种方式,

    1. 是将数据导出成文本文件,比如通过elasticdump、esm等工具将存储在 Elasticsearch 中的数据导出到文件中。
    2. 是使用snapshot接口实现快照功能,增量备份文件

    第一种方式相对简单,在数据量小的时候比较实用,但当应对大数据量场景时,更推荐使用snapshot api 的方式。

    Steps

    buckup

    asdadas

    1. 创建快照仓库repo -> my_fs_repository
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "fs",
      "settings": {
        "location": "/tmp"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository?pretty"
    1. 分析一个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 手动打快照
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/ay_snap_02?pretty"
    使用SLM自动打快照(没生效)

    Thank you!

    1. 查看指定快照仓库repo 可用的快照
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/*?verbose=false&pretty"
    1. 测试恢复
    # Delete an index
    curl -k -X DELETE "https://elastic-search.dev.tech:32443/books?pretty"
    
    # restore that index
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/ay_snap_02/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl -k -X GET "https://elastic-search.dev.tech:32443/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    ES [S3 Compatible]

    Preliminary

    • ElasticSearch has installed, if not check link

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: true
              service:
                type: ClusterIP
              extraEnvVars:
              - name: S3_ACCESSKEY
                value: admin
              - name: S3_SECRETKEY
                value: ZrwpsezF1Lt85dxl
              extraConfig:
                s3:
                  client:
                    default:
                      protocol: http
                      endpoint: "http://192.168.31.111:9090"
                      path_style_access: true
              initScripts:
                configure-s3-client.sh: |
                  elasticsearch_set_key_value "s3.client.default.access_key" "${S3_ACCESSKEY}"
                  elasticsearch_set_key_value "s3.client.default.secret_key" "${S3_SECRETKEY}"
              hostAliases:
              - ip: 192.168.31.111
                hostnames:
                - minio-api.dev.tech
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      extraEnvVars:
      - name: S3_ACCESSKEY
        value: admin
      - name: S3_SECRETKEY
        value: ZrwpsezF1Lt85dxl
      extraConfig:
        s3:
          client:
            default:
              protocol: http
              endpoint: "http://192.168.31.111:9090"
              path_style_access: true
      initScripts:
        configure-s3-client.sh: |
          elasticsearch_set_key_value "s3.client.default.access_key" "${S3_ACCESSKEY}"
          elasticsearch_set_key_value "s3.client.default.secret_key" "${S3_SECRETKEY}"
      hostAliases:
      - ip: 192.168.31.111
        hostnames:
        - minio-api.dev.tech

    Methods

    Elasticsearch 做备份有两种方式,

    1. 是将数据导出成文本文件,比如通过elasticdump、esm等工具将存储在 Elasticsearch 中的数据导出到文件中。
    2. 是使用snapshot接口实现快照功能,增量备份文件

    第一种方式相对简单,在数据量小的时候比较实用,但当应对大数据量场景时,更推荐使用snapshot api 的方式。

    Steps

    buckup

    asdadas

    1. 创建快照仓库repo -> my_s3_repository
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "s3",
      "settings": {
        "bucket": "local-test",
        "client": "default",
        "endpoint": "http://192.168.31.111:9000"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository?pretty"
    1. 分析一个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 手动打快照
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/ay_s3_snap_02?pretty"
    使用SLM自动打快照(没生效)

    Thank you!

    1. 查看指定快照仓库repo 可用的快照
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/*?verbose=false&pretty"
    1. 测试恢复
    # Delete an index
    curl -k -X DELETE "https://elastic-search.dev.tech:32443/books?pretty"
    
    # restore that index
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/ay_s3_snap_02/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl -k -X GET "https://elastic-search.dev.tech:32443/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    ES Auto BackUp

    Preliminary

    • ElasticSearch has installed, if not check link

    • We use local disk to save the snapshots, more deatils check link

    • And the security is enabled.

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: true
                tls:
                  autoGenerated: true
              service:
                type: ClusterIP
              extraConfig:
                path:
                  repo: /tmp
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      security:
        enabled: true
      extraConfig:
          path:
            repo: /tmp

    Methods

    Steps

    auto buckup
    1. 创建快照仓库repo -> slm_fs_repository
    curl --user elastic:L9shjg6csBmPZgCZ -k -X PUT "https://10.88.0.143:30294/_snapshot/slm_fs_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "fs",
      "settings": {
        "location": "/tmp"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/slm_fs_repository?pretty"
    1. 分析一个快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 查看指定快照仓库repo 可用的快照
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/slm_fs_repository/*?verbose=false&pretty"
    1. 创建SLM admin 角色
    curl --user elastic:L9shjg6csBmPZgCZ -k -X POST "https://10.88.0.143:30294/_security/role/slm-admin?pretty" -H 'Content-Type: application/json' -d'
    {
      "cluster": [ "manage_slm", "cluster:admin/snapshot/*" ],
      "indices": [
        {
          "names": [ ".slm-history-*" ],
          "privileges": [ "all" ]
        }
      ]
    }
    '
    1. 创建自动备份cornjob
    curl --user elastic:L9shjg6csBmPZgCZ -k -X PUT "https://10.88.0.143:30294/_slm/policy/nightly-snapshots?pretty" -H 'Content-Type: application/json' -d'
    {
      "schedule": "0 30 1 * * ?",       
      "name": "<nightly-snap-{now/d}>", 
      "repository": "slm_fs_repository",    
      "config": {
        "indices": "*",                 
        "include_global_state": true    
      },
      "retention": {                    
        "expire_after": "30d",
        "min_count": 5,
        "max_count": 50
      }
    }
    '
    1. 启动自动备份
    curl --user elastic:L9shjg6csBmPZgCZ -k -X POST "https://10.88.0.143:30294/_slm/policy/nightly-snapshots/_execute?pretty"
    1. 查看SLM备份历史
    curl --user elastic:L9shjg6csBmPZgCZ -k -X GET "https://10.88.0.143:30294/_slm/stats?pretty"
    1. 测试恢复
    # Delete an index
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X DELETE "https://10.88.0.143:30294/books?pretty"
    
    # restore that index
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/my_snapshot_2099.05.06/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    Example Shell Script

    Init ES Backup Setting

    create an ES backup setting in s3, and make an snapshot after creation

    #!/bin/bash
    ES_HOST="http://192.168.58.2:30910"
    ES_BACKUP_REPO_NAME="s3_fs_repository"
    S3_CLIENT="default"
    ES_BACKUP_BUCKET_IN_S3="es-snapshot"
    ES_SNAPSHOT_TAG="auto"
    
    CHECK_RESPONSE=$(curl -s -k -X POST "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/_verify?pretty" )
    CHECKED_NODES=$(echo "$CHECK_RESPONSE" | jq -r '.nodes')
    
    
    if [ "$CHECKED_NODES" == null ]; then
      echo "Doesn't exist an ES backup setting..."
      echo "A default backup setting will be generated. (using '$S3_CLIENT' s3 client and all backup files will be saved in a bucket : '$ES_BACKUP_BUCKET_IN_S3'"
    
      CREATE_RESPONSE=$(curl -s -k -X PUT "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME?pretty" -H 'Content-Type: application/json' -d "{\"type\":\"s3\",\"settings\":{\"bucket\":\"$ES_BACKUP_BUCKET_IN_S3\",\"client\":\"$S3_CLIENT\"}}")
      CREATE_ACKNOWLEDGED_FLAG=$(echo "$CREATE_RESPONSE" | jq -r '.acknowledged')
    
      if [ "$CREATE_ACKNOWLEDGED_FLAG" == true ]; then
        echo "Buckup setting '$ES_BACKUP_REPO_NAME' has been created successfully!"
      else
        echo "Failed to create backup setting '$ES_BACKUP_REPO_NAME', since $$CREATE_RESPONSE"
      fi
    else
      echo "Already exist an ES backup setting '$ES_BACKUP_REPO_NAME'"
    fi
    
    CHECK_RESPONSE=$(curl -s -k -X POST "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/_verify?pretty" )
    CHECKED_NODES=$(echo "$CHECK_RESPONSE" | jq -r '.nodes')
    
    if [ "$CHECKED_NODES" != null ]; then
      SNAPSHOT_NAME="meta-data-$ES_SNAPSHOT_TAG-snapshot-$(date +%s)"
      SNAPSHOT_CREATION=$(curl -s -k -X PUT "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/$SNAPSHOT_NAME")
      echo "Snapshot $SNAPSHOT_NAME has been created."
    else
      echo "Failed to create snapshot $SNAPSHOT_NAME ."
    fi
    Mar 14, 2024

    Subsections of Git

    Minio

      Mar 7, 2024

      Redis

        Mar 7, 2024

        Subsections of ☁️CSP Related

        Subsections of Aliyun

        OSSutil

        download ossutil

        first, you need to download ossutil first

        OS:
        curl https://gosspublic.alicdn.com/ossutil/install.sh  | sudo bash
        curl -o ossutil-v1.7.19-windows-386.zip https://gosspublic.alicdn.com/ossutil/1.7.19/ossutil-v1.7.19-windows-386.zip

        config ossutil

        ./ossutil config
        ParamsDescriptionInstruction
        endpointthe Endpoint of the region where the Bucket is located
        accessKeyIDOSS AccessKeyget from user info panel
        accessKeySecretOSS AccessKeySecretget from user info panel
        stsTokentoken for sts servicecould be empty
        Info

        you can also modify /home/<$user>/.ossutilconfig file directly to change the configuration.

        list files

        ossutil ls oss://<$PATH>
        For exmaple
        ossutil ls oss://csst-data/CSST-20240312/dfs/

        download file/dir

        you can use cp to download or upload file

        ossutil cp -r oss://<$PATH> <$PTHER_PATH>
        For exmaple
        ossutil cp -r oss://csst-data/CSST-20240312/dfs/ /data/nfs/data/pvc...

        upload file/dir

        ossutil cp -r <$SOURCE_PATH> oss://<$PATH>
        For exmaple
        ossutil cp -r /data/nfs/data/pvc/a.txt  oss://csst-data/CSST-20240312/dfs/b.txt
        Mar 24, 2024

        ECS DNS

        ZJADC (Aliyun Directed Cloud)

        Append content in /etc/resolv.conf

        options timeout:2 attempts:3 rotate
        nameserver 10.255.9.2
        nameserver 10.200.12.5

        And then you probably need to modify yum.repo.d as well, check link


        YQGCY (Aliyun Directed Cloud)

        Append content in /etc/resolv.conf

        nameserver 172.27.205.79

        And then restart kube-system.coredns-xxxx


        Google DNS

        nameserver 8.8.8.8
        nameserver 4.4.4.4
        nameserver 223.5.5.5
        nameserver 223.6.6.6

        Restart DNS

        OS:
        vim /etc/NetworkManager/NetworkManager.conf
        vim /etc/NetworkManager/NetworkManager.conf
        sudo systemctl is-active systemd-resolved
        sudo resolvectl flush-caches
        # or sudo systemd-resolve --flush-caches

        add "dns=none" under '[main]' part

        systemctl restart NetworkManager

        Modify ifcfg-ethX [Optional]

        if you cannot get ipv4 address, you can try to modify ifcfg-ethX

        vim /etc/sysconfig/network-scripts/ifcfg-ens33

        set ONBOOT=yes

        Mar 14, 2024

        OS Mirrors

        Fedora

        • Fedora 40 located in /etc/yum.repos.d/
          Fedora Mirror
          [updates]
          name=Fedora $releasever - $basearch - Updates
          #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/$basearch/
          metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-f$releasever&arch=$basearch
          enabled=1
          countme=1
          repo_gpgcheck=0
          type=rpm
          gpgcheck=1
          metadata_expire=6h
          gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
          skip_if_unavailable=False
          
          [updates-debuginfo]
          name=Fedora $releasever - $basearch - Updates - Debug
          #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/$basearch/debug/
          metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-debug-f$releasever&arch=$basearch
          enabled=0
          repo_gpgcheck=0
          type=rpm
          gpgcheck=1
          metadata_expire=6h
          gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
          skip_if_unavailable=False
          
          [updates-source]
          name=Fedora $releasever - Updates Source
          #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/SRPMS/
          metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-source-f$releasever&arch=$basearch
          enabled=0
          repo_gpgcheck=0
          type=rpm
          gpgcheck=1
          metadata_expire=6h
          gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
          skip_if_unavailable=False

        CentOS

        • CentOS 7 located in /etc/yum.repos.d/

          CentOS Mirror
          [base]
          name=CentOS-$releasever
          #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
          baseurl=http://mirror.centos.org/centos/$releasever/os/$basearch/
          gpgcheck=1
          gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-7
          
          [extras]
          name=CentOS-$releasever
          #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
          baseurl=http://mirror.centos.org/centos/$releasever/extras/$basearch/
          gpgcheck=1
          gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-7
          Aliyun Mirror
          [base]
          name=CentOS-$releasever - Base - mirrors.aliyun.com
          failovermethod=priority
          baseurl=http://mirrors.aliyun.com/centos/$releasever/os/$basearch/
          gpgcheck=1
          gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
          
          [extras]
          name=CentOS-$releasever - Extras - mirrors.aliyun.com
          failovermethod=priority
          baseurl=http://mirrors.aliyun.com/centos/$releasever/extras/$basearch/
          gpgcheck=1
          gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
          163 Mirror
          [base]
          name=CentOS-$releasever - Base - 163.com
          #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
          baseurl=http://mirrors.163.com/centos/$releasever/os/$basearch/
          gpgcheck=1
          gpgkey=http://mirrors.163.com/centos/RPM-GPG-KEY-CentOS-7
          
          [extras]
          name=CentOS-$releasever - Extras - 163.com
          #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
          baseurl=http://mirrors.163.com/centos/$releasever/extras/$basearch/
          gpgcheck=1
          gpgkey=http://mirrors.163.com/centos/RPM-GPG-KEY-CentOS-7

        • CentOS 8 stream located in /etc/yum.repos.d/

          CentOS Mirror
          [baseos]
          name=CentOS Linux - BaseOS
          #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=BaseOS&infra=$infra
          baseurl=http://mirror.centos.org/centos/8-stream/BaseOS/$basearch/os/
          gpgcheck=1
          enabled=1
          gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
          
          [extras]
          name=CentOS Linux - Extras
          #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras&infra=$infra
          baseurl=http://mirror.centos.org/centos/8-stream/extras/$basearch/os/
          gpgcheck=1
          enabled=1
          gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
          
          [appstream]
          name=CentOS Linux - AppStream
          #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=AppStream&infra=$infra
          baseurl=http://mirror.centos.org/centos/8-stream/AppStream/$basearch/os/
          gpgcheck=1
          enabled=1
          gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
          Aliyun Mirror
          [base]
          name=CentOS-8.5.2111 - Base - mirrors.aliyun.com
          baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/BaseOS/$basearch/os/
          gpgcheck=0
          gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official
          
          [extras]
          name=CentOS-8.5.2111 - Extras - mirrors.aliyun.com
          baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/extras/$basearch/os/
          gpgcheck=0
          gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official
          
          [AppStream]
          name=CentOS-8.5.2111 - AppStream - mirrors.aliyun.com
          baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/AppStream/$basearch/os/
          gpgcheck=0
          gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official

        Ubuntu

        • Ubuntu 18.04 located in /etc/apt/sources.list

          Ubuntu Mirror
          deb http://archive.ubuntu.com/ubuntu/ bionic main restricted
          deb http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted
          deb http://archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse
          deb http://security.ubuntu.com/ubuntu/ bionic-security main restricted

        • Ubuntu 20.04 located in /etc/apt/sources.list

          Ubuntu Mirror
          deb http://archive.ubuntu.com/ubuntu/ focal main restricted universe multiverse
          deb http://archive.ubuntu.com/ubuntu/ focal-updates main restricted universe multiverse
          deb http://archive.ubuntu.com/ubuntu/ focal-backports main restricted universe multiverse
          deb http://security.ubuntu.com/ubuntu/ focal-security main restricted

        • Ubuntu 22.04 located in /etc/apt/sources.list

          Ubuntu Mirror
          deb http://archive.ubuntu.com/ubuntu/ jammy main restricted
          deb http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
          deb http://archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
          deb http://security.ubuntu.com/ubuntu/ jammy-security main restricted

        Debian

        • Debian Buster located in /etc/apt/sources.list

          Debian Mirror
          deb http://deb.debian.org/debian buster main
          deb http://security.debian.org/debian-security buster/updates main
          deb http://deb.debian.org/debian buster-updates main
          Aliyun Mirror
          deb http://mirrors.aliyun.com/debian/ buster main non-free contrib
          deb http://mirrors.aliyun.com/debian-security buster/updates main
          deb http://mirrors.aliyun.com/debian/ buster-updates main non-free contrib
          deb http://mirrors.aliyun.com/debian/ buster-backports main non-free contrib
          Tuna Mirror
          deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster main contrib non-free
          deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster-updates main contrib non-free
          deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster-backports main contrib non-free
          deb http://security.debian.org/debian-security buster/updates main contrib non-free

        • Debian Bullseye located in /etc/apt/sources.list

          Debian Mirror
          deb http://deb.debian.org/debian bullseye main
          deb http://security.debian.org/debian-security bullseye-security main
          deb http://deb.debian.org/debian bullseye-updates main
          Aaliyun Mirror
          deb http://mirrors.aliyun.com/debian/ bullseye main non-free contrib
          deb http://mirrors.aliyun.com/debian-security/ bullseye-security main
          deb http://mirrors.aliyun.com/debian/ bullseye-updates main non-free contrib
          deb http://mirrors.aliyun.com/debian/ bullseye-backports main non-free contrib
          Tuna Mirror
          deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye main contrib non-free
          deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye-updates main contrib non-free
          deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye-backports main contrib non-free
          deb http://security.debian.org/debian-security bullseye-security main contrib non-free

        Anolis

        • Anolis 3 located in /etc/yum.repos.d/

          Alinyun Mirror
          [alinux3-module]
          name=alinux3-module
          baseurl=http://mirrors.aliyun.com/alinux/3/module/$basearch/
          gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
          enabled=1
          gpgcheck=1
          
          [alinux3-os]
          name=alinux3-os
          baseurl=http://mirrors.aliyun.com/alinux/3/os/$basearch/
          gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
          enabled=1
          gpgcheck=1
          
          [alinux3-plus]
          name=alinux3-plus
          baseurl=http://mirrors.aliyun.com/alinux/3/plus/$basearch/
          gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
          enabled=1
          gpgcheck=1
          
          [alinux3-powertools]
          name=alinux3-powertools
          baseurl=http://mirrors.aliyun.com/alinux/3/powertools/$basearch/
          gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
          enabled=1
          gpgcheck=1
          
          [alinux3-updates]
          name=alinux3-updates
          baseurl=http://mirrors.aliyun.com/alinux/3/updates/$basearch/
          gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
          enabled=1
          gpgcheck=1
          
          [epel]
          name=Extra Packages for Enterprise Linux 8 - $basearch
          baseurl=http://mirrors.aliyun.com/epel/8/Everything/$basearch
          failovermethod=priority
          enabled=1
          gpgcheck=1
          gpgkey=http://mirrors.aliyun.com/epel/RPM-GPG-KEY-EPEL-8
          
          [epel-module]
          name=Extra Packages for Enterprise Linux 8 - $basearch
          baseurl=http://mirrors.aliyun.com/epel/8/Modular/$basearch
          failovermethod=priority
          enabled=0
          gpgcheck=1
          gpgkey=http://mirrors.aliyun.com/epel/RPM-GPG-KEY-EPEL-8

        • Anolis 2 located in /etc/yum.repos.d/

          Alinyun Mirror


        Refresh Repo

        OS:
        dnf clean all && dnf makecache
        yum clean all && yum makecache
        apt-get clean all
        Mar 14, 2024

        Tencent

          Mar 7, 2024

          Subsections of 🧪Demo

          Agent

          Aug 7, 2024

          Subsections of Game

          LOL Overlay Assistant

          Using deep learning techniques to help you to win the game.

          State Machine Event Bus Python 3.6 TensorFlow2 Captain InfoNew Awesome

          ScreenShots

          There are four main funcs in this tool.

          1. The first one is to detect your game client thread and recognize which
            status you are in.
            func1 func1

          2. The second one is to recommend some champions to play.
            Based on your enemy’s team banned champion, this tool will provide you three
            more choices to counter your enemies.
            func2 func2

          3. The third func will scans the mini-map, and when someone is heading to you,
            a notification window will pop up.
            func3 func3

          4. The last func will provides you some gear recommendation based on your
            enemy’s item list.
            fun4 fun4

          Framework

          mvc mvc

          Checkout in Bilibili

          Checkout in Youtube

          Repo

          you can get code from github, gitee

          Mar 8, 2024

          Roller Coin Assistant

          Using deep learning techniques to help you to mining the cryptos, such as BTC, ETH and DOGE.

          ScreenShots

          There are two main funcs in this tool.

          1. Help you to crack the game
          • only support ‘Coin-Flip’ Game for now.

            right, rollercoin.com had decrease the benefit from this game, thats why I make the repo public. update

          1. Help you to pass the geetest.

          How to use

          1. open a web browser.
          2. go to this link https://rollercoin.com, and create an account.(https://rollercoin.com)
          3. keep the lang equals to ‘English’ (you can click the bottom button to change it).
          4. click the ‘Game’ button.
          5. start the application, and enjoy it.

          Tips

          1. only supprot 1920*1080, 2560*1440 and higher resolution screen.
          2. and if you use 1920*1080 screen, strongly recommend you to fullscreen you web browser.

          Repo

          you can get code from gitee

          Mar 8, 2024

          Subsections of HPC

          Slurm On K8S

          slurm_on_k8s slurm_on_k8s

          Trying to run slurm cluster on kubernets

          Install

          You can directly use helm to manage this slurm chart

          1. helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
          2. helm install slurm ay-helm-mirror/slurm --version 1.0.4

          And then, you should see something like this func1 func1

          Also, you can modify the values.yaml by yourself, and reinstall the slurm cluster

          helm upgrade --create-namespace -n slurm --install -f ./values.yaml slurm ay-helm-mirror/slurm --version=1.0.4
          Important

          And you even can build your own image, especially for people wanna use their own libs. For now, the image we used is

          login –> docker.io/aaron666/slurm-login:intel-mpi

          slurmd –> docker.io/aaron666/slurm-slurmd:intel-mpi

          slurmctld -> docker.io/aaron666/slurm-slurmctld:latest

          slurmdbd –> docker.io/aaron666/slurm-slurmdbd:latest

          munged –> docker.io/aaron666/slurm-munged:latest

          Aug 7, 2024

          Slurm Operator

          if you wanna change slurm configuration ,please check slurm configuration generator click

          • for helm user

            just run for fun!

            1. helm repo add ay-helm-repo https://aaronyang0628.github.io/helm-chart-mirror/charts
            2. helm install slurm ay-helm-repo/slurm --version 1.0.4
          • for opertaor user

            pull an image and apply

            1. docker pull aaron666/slurm-operator:latest
            2. kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/install.yaml
            3. kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.values.yaml
          Aug 7, 2024

          Subsections of Plugins

          Flink S3 F3 Multiple

          Normally, Flink only can access only one S3 endpoint during the runtime. But we need to process some files from multiple minio simultaneously.

          So I modified the original flink-s3-fs-hadoop and enable flink to do so.

          StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
          env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
          env.setParallelism(1);
          env.setStateBackend(new HashMapStateBackend());
          env.getCheckpointConfig().setCheckpointStorage("file:///./checkpoints");
          
          final FileSource<String> source =
              FileSource.forRecordStreamFormat(
                      new TextLineInputFormat(),
                      new Path(
                          "s3u://admin:ZrwpsezF1Lt85dxl@10.11.33.132:9000/user-data/home/conti/2024-02-08--10"))
                  .build();
          
          final FileSource<String> source2 =
              FileSource.forRecordStreamFormat(
                      new TextLineInputFormat(),
                      new Path(
                          "s3u://minioadmin:minioadmin@10.101.16.72:9000/user-data/home/conti"))
                  .build();
          
          env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source")
              .union(env.fromSource(source2, WatermarkStrategy.noWatermarks(), "file-source2"))
              .print("union-result");
              
          env.execute();
          original usage example

          using default flink-s3-fs-hadoop, the configuration value will set into Hadoop configuration map. Only one value functioning at the same, there is no way for user to operate different in single one job context.

          Configuration pluginConfiguration = new Configuration();
          pluginConfiguration.setString("s3a.access-key", "admin");
          pluginConfiguration.setString("s3a.secret-key", "ZrwpsezF1Lt85dxl");
          pluginConfiguration.setString("s3a.connection.maximum", "1000");
          pluginConfiguration.setString("s3a.endpoint", "http://10.11.33.132:9000");
          pluginConfiguration.setBoolean("s3a.path.style.access", Boolean.TRUE);
          FileSystem.initialize(
              pluginConfiguration, PluginUtils.createPluginManagerFromRootFolder(pluginConfiguration));
          
          StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
          env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
          env.setParallelism(1);
          env.setStateBackend(new HashMapStateBackend());
          env.getCheckpointConfig().setCheckpointStorage("file:///./checkpoints");
          
          final FileSource<String> source =
              FileSource.forRecordStreamFormat(
                      new TextLineInputFormat(), new Path("s3a://user-data/home/conti/2024-02-08--10"))
                  .build();
          env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source").print();
          
          env.execute();

          Usage

          There

          Install From

          For now, you can directly download flink-s3-fs-hadoop-$VERSION.jar and load in your project.
          $VERSION is the flink version you are using.

            implementation(files("flink-s3-fs-hadoop-$flinkVersion.jar"))
            <dependency>
                <groupId>org.apache</groupId>
                <artifactId>flink</artifactId>
                <version>$flinkVersion</version>
                <systemPath>${project.basedir}flink-s3-fs-hadoop-$flinkVersion.jar</systemPath>
            </dependency>
          the jar we provided was based on original flink-s3-fs-hadoop plugin, so you should use original protocal prefix s3a://

          Or maybe you can wait from the PR, after I mereged into flink-master, you don't need to do anything, just update your flink version.
          and directly use s3u://

          Repo

          you can get code from github, gitlab

          Mar 8, 2024

          Subsections of Stream

          Cosmic Antenna

          Design Architecture

          • objects

          continuously processing antenna signal records and convert them into 3 dimension data matrixes, sending them to different astronomical algorithm endpoints. asdsaa asdsaa

          • how data flows

          asdsaa asdsaa

          Building From Zero

          Following these steps, you may build comic-antenna from nothing.

          1. install podman

          you can check article Install Podman

          2. install kind and kubectl

          you can check article install kubectl

          # create a cluster using podman
          curl -o kind.cluster.yaml -L https://gitlab.com/-/snippets/3686427/raw/main/kind-cluster.yaml \
          && export KIND_EXPERIMENTAL_PROVIDER=podman \
          && kind create cluster --name cs-cluster --image m.daocloud.io/docker.io/kindest/node:v1.27.3 --config=./kind.cluster.yaml
          Modify ~/.kube/config

          vim ~/.kube/config

          in line 5, change server: http://::xxxx -> server: http://0.0.0.0:xxxxx

          asdsaa asdsaa

          3. [Optional] pre-downloaded slow images

          DOCKER_IMAGE_PATH=/root/docker-images && mkdir -p $DOCKER_IMAGE_PATH
          BASE_URL="https://resource-ops-dev.lab.zjvis.net:32443/docker-images"
          for IMAGE in "quay.io_argoproj_argocd_v2.9.3.dim" \
              "ghcr.io_dexidp_dex_v2.37.0.dim" \
              "docker.io_library_redis_7.0.11-alpine.dim" \
              "docker.io_library_flink_1.17.dim"
          do
              IMAGE_FILE=$DOCKER_IMAGE_PATH/$IMAGE
              if [ ! -f $IMAGE_FILE ]; then
                  TMP_FILE=$IMAGE_FILE.tmp \
                  && curl -o "$TMP_FILE" -L "$BASE_URL/$IMAGE" \
                  && mv $TMP_FILE $IMAGE_FILE
              fi
              kind -n cs-cluster load image-archive $IMAGE_FILE
          done

          4. install argocd

          you can check article Install ArgoCD

          5. install essential app on argocd

          # install cert manger    
          curl -LO https://gitlab.com/-/snippets/3686424/raw/main/cert-manager.yaml \
          && kubectl -n argocd apply -f cert-manager.yaml \
          && argocd app sync argocd/cert-manager
          
          # install ingress
          curl -LO https://gitlab.com/-/snippets/3686426/raw/main/ingress-nginx.yaml \
          && kubectl -n argocd apply -f ingress-nginx.yaml \
          && argocd app sync argocd/ingress-nginx
          
          # install flink-kubernetes-operator
          curl -LO https://gitlab.com/-/snippets/3686429/raw/main/flink-operator.yaml \
          && kubectl -n argocd apply -f flink-operator.yaml \
          && argocd app sync argocd/flink-operator

          6. install git

          sudo dnf install -y git \
          && rm -rf $HOME/cosmic-antenna-demo \
          && mkdir $HOME/cosmic-antenna-demo \
          && git clone --branch pv_pvc_template https://github.com/AaronYang2333/cosmic-antenna-demo.git $HOME/cosmic-antenna-demo

          7. prepare application image

          # cd into  $HOME/cosmic-antenna-demo
          sudo dnf install -y java-11-openjdk.x86_64 \
          && $HOME/cosmic-antenna-demo/gradlew :s3sync:buildImage \
          && $HOME/cosmic-antenna-demo/gradlew :fpga-mock:buildImage
          # save and load into cluster
          VERSION="1.0.3"
          podman save --quiet -o $DOCKER_IMAGE_PATH/fpga-mock_$VERSION.dim localhost/fpga-mock:$VERSION \
          && kind -n cs-cluster load image-archive $DOCKER_IMAGE_PATH/fpga-mock_$VERSION.dim
          podman save --quiet -o $DOCKER_IMAGE_PATH/s3sync_$VERSION.dim localhost/s3sync:$VERSION \
          && kind -n cs-cluster load image-archive $DOCKER_IMAGE_PATH/s3sync_$VERSION.dim
          kubectl -n flink edit role/flink -o yaml
          Modify role config
          kubectl -n flink edit role/flink -o yaml

          add services and endpoints to the rules.resources

          8. prepare k8s resources [pv, pvc, sts]

          cp -rf $HOME/cosmic-antenna-demo/flink/*.yaml /tmp \
          && podman exec -d cs-cluster-control-plane mkdir -p /mnt/flink-job
          # create persist volume
          kubectl -n flink create -f /tmp/pv.template.yaml
          # create pv claim
          kubectl -n flink create -f /tmp/pvc.template.yaml
          # start up flink application
          kubectl -n flink create -f /tmp/job.template.yaml
          # start up ingress
          kubectl -n flink create -f /tmp/ingress.forward.yaml
          # start up fpga UDP client, sending data 
          cp $HOME/cosmic-antenna-demo/fpga-mock/client.template.yaml /tmp \
          && kubectl -n flink create -f /tmp/client.template.yaml

          9. check dashboard in browser

          http://job-template-example.flink.lab.zjvis.net

          Repo

          you can get code from github


          Reference

          1. https://github.com/ben-wangz/blog/tree/main/docs/content/6.kubernetes/7.installation/ha-cluster
          2. xxx
          Mar 7, 2024

          Subsections of Design

          Yaml Crawler

          Steps

          1. define which web url you wanna crawl, lets say https://www.xxx.com/aaa.apex
          2. create a page pojo org.example.business.page.MainPage to describe that page

          Then you can create a yaml file named root-pages.yaml and its content is

          - '@class': "org.example.business.page.MainPage"
            url: "https://www.xxx.com/aaa.apex"
          1. and then define a process flow yaml file, implying how to process web pages the crawler will meet.
          processorChain:
            - '@class': "org.example.crawler.core.processor.decorator.ExceptionRecord"
              processor:
                '@class': "org.example.crawler.core.processor.decorator.RetryControl"
                processor:
                  '@class': "org.example.crawler.core.processor.decorator.SpeedControl"
                  processor:
                    '@class': "org.example.business.hs.code.MainPageProcessor"
                    application: "app-name"
                  time: 100
                  unit: "MILLISECONDS"
                retryTimes: 1
            - '@class': "org.example.crawler.core.processor.decorator.ExceptionRecord"
              processor:
                '@class': "org.example.crawler.core.processor.decorator.RetryControl"
                processor:
                  '@class': "org.example.crawler.core.processor.decorator.SpeedControl"
                  processor:
                    '@class': "org.example.crawler.core.processor.download.DownloadProcessor"
                    pagePersist:
                      '@class': "org.example.business.persist.DownloadPageDatabasePersist"
                      downloadPageRepositoryBeanName: "downloadPageRepository"
                    downloadPageTransformer:
                      '@class': "org.example.crawler.download.DefaultDownloadPageTransformer"
                    skipExists:
                      '@class': "org.example.crawler.download.SkipExistsById"
                  time: 1
                  unit: "SECONDS"
                retryTimes: 1
          nThreads: 1
          pollWaitingTime: 30
          pollWaitingTimeUnit: "SECONDS"
          waitFinishedTimeout: 180
          waitFinishedTimeUnit: "SECONDS" 

          ExceptionRecord, RetryControl, SpeedControl are provided by the yaml crawler itself, dont worry. you only need to extend how to process your page MainPage, for example, you defined a MainPageProcessor. each processor will produce a set of other page or DownloadPage. DownloadPage like a ship containing information you need, and this framework will help you process DownloadPage and download or persist.

          1. Vola, run your crawler then.

          Repo

          you can get code from github, gitlab

          Mar 8, 2024

          MCP

          Aug 7, 2024

          RAG

          Aug 7, 2024

          Utils

          Porjects

          Mar 7, 2024

          Subsections of Utils

          Cowsay

          since the previous cowsay image was built ten years ago, and in newser k8s, you will meet an exception like

          Failed to pull image “docker/whalesay:latest”: [DEPRECATION NOTICE] Docker Image Format v1 and Docker Image manifest version 2, schema 1 support is disabled by default and will be removed in an upcoming release. Suggest the author of docker.io/docker/whalesay:latest to upgrade the image to the OCI Format or Docker Image manifest v2, schema 2. More information at https://docs.docker.com/go/deprecated-image-specs/

          So, I built a new one. please try docker.io/aaron666/cowsay:v2

          Build

          docker build -t whalesay:v2 .

          Usage

          docker run -it localhost/whalesay:v2 whalesay  "hello world"
          
          [root@ay-zj-ecs cowsay]# docker run -it localhost/whalesay:v2 whalesay  "hello world"
           _____________
          < hello world >
           -------------
            \
             \
              \     
                                ##        .            
                          ## ## ##       ==            
                       ## ## ## ##      ===            
                   /""""""""""""""""___/ ===        
              ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~   
                   \______ o          __/            
                    \    \        __/             
                      \____\______/   
          docker run -it localhost/whalesay:v2 cowsay  "hello world"
          
          [root@ay-zj-ecs cowsay]# docker run -it localhost/whalesay:v2 cowsay  "hello world"
           _____________
          < hello world >
           -------------
                  \   ^__^
                   \  (oo)\_______
                      (__)\       )\/\
                          ||----w |
                          ||     ||

          Upload

          registry
          docker tag 5b01b0c3c7ce docker-registry.lab.zverse.space/ay-dev/whalesay:v2
          docker push docker-registry.lab.zverse.space/ay-dev/whalesay:v2
          export DOCKER_PAT=dckr_pat_bBN_Xkgz-TRdxirM2B6EDYCjjrg
          echo $DOCKER_PAT | docker login docker.io -u aaron666  --password-stdin
          docker tag 5b01b0c3c7ce docker.io/aaron666/whalesay:v2
          docker push docker.io/aaron666/whalesay:v2
          export GITHUB_PAT=XXXX
          echo $GITHUB_PAT | docker login ghcr.io -u aaronyang0628 --password-stdin
          docker tag 5b01b0c3c7ce ghcr.io/aaronyang0628/whalesay:v2
          docker push ghcr.io/aaronyang0628/whalesay:v2
          Mar 7, 2025

          Subsections of 🐿️Apache Flink

          Subsections of On K8s Operator

          Job Privilieges

          Template

          apiVersion: rbac.authorization.k8s.io/v1
          kind: Role
          metadata:
            namespace: flink
            name: flink-deployment-manager
          rules:
          - apiGroups: 
            - flink.apache.org
            resources: 
            - flinkdeployments
            verbs: 
            - 'get'
            - 'list'
            - 'create'
            - 'update'
          ---
          apiVersion: rbac.authorization.k8s.io/v1
          kind: RoleBinding
          metadata:
            name: flink-deployment-manager-binding
            namespace: flink
          subjects:
          - kind: User
            name: "277293711358271379"  
            apiGroup: rbac.authorization.k8s.io
          roleRef:
            kind: Role
            name: flink-deployment-manager
            apiGroup: rbac.authorization.k8s.io
          Jul 7, 2024

          OSS Template

          Template

          apiVersion: "flink.apache.org/v1beta1"
          kind: "FlinkDeployment"
          metadata:
            name: "financial-job"
          spec:
            image: "cr.registry.res.cloud.wuxi-yqgcy.cn/mirror/financial-topic:1.5-oss"
            flinkVersion: "v1_17"
            flinkConfiguration:
              taskmanager.numberOfTaskSlots: "8"
              fs.oss.endpoint: http://ay-test.oss-cn-jswx-xuelang-d01-a.ops.cloud.wuxi-yqgcy.cn/
              fs.oss.accessKeyId: 4gqOVOfQqCsCUwaC
              fs.oss.accessKeySecret: xxx
            ingress:
              template: "flink.k8s.io/{{namespace}}/{{name}}(/|$)(.*)"
              className: "nginx"
              annotations:
                cert-manager.io/cluster-issuer: "self-signed-ca-issuer"
                nginx.ingress.kubernetes.io/rewrite-target: "/$2"
            serviceAccount: "flink"
            podTemplate:
              apiVersion: "v1"
              kind: "Pod"
              metadata:
                name: "financial-job"
              spec:
                containers:
                  - name: "flink-main-container"
                    env:
                      - name: ENABLE_BUILT_IN_PLUGINS
                        value: flink-oss-fs-hadoop-1.17.2.jar
            jobManager:
              resource:
                memory: "2048m"
                cpu: 1
            taskManager:
              resource:
                memory: "2048m"
                cpu: 1
            job:
              jarURI: "local:///app/application.jar"
              parallelism: 1
              upgradeMode: "stateless"
          Apr 7, 2024

          S3 Template

          Template

          apiVersion: "flink.apache.org/v1beta1"
          kind: "FlinkDeployment"
          metadata:
            name: "financial-job"
          spec:
            image: "cr.registry.res.cloud.wuxi-yqgcy.cn/mirror/financial-topic:1.5"
            flinkVersion: "v1_17"
            flinkConfiguration:
              taskmanager.numberOfTaskSlots: "8"
              s3a.endpoint: http://172.27.253.89:9000
              s3a.access-key: minioadmin
              s3a.secret-key: minioadmin
            ingress:
              template: "flink.k8s.io/{{namespace}}/{{name}}(/|$)(.*)"
              className: "nginx"
              annotations:
                cert-manager.io/cluster-issuer: "self-signed-ca-issuer"
                nginx.ingress.kubernetes.io/rewrite-target: "/$2"
            serviceAccount: "flink"
            podTemplate:
              apiVersion: "v1"
              kind: "Pod"
              metadata:
                name: "financial-job"
              spec:
                containers:
                  - name: "flink-main-container"
                    env:
                      - name: ENABLE_BUILT_IN_PLUGINS
                        value: flink-s3-fs-hadoop-1.17.2.jar
            jobManager:
              resource:
                memory: "2048m"
                cpu: 1
            taskManager:
              resource:
                memory: "2048m"
                cpu: 1
            job:
              jarURI: "local:///app/application.jar"
              parallelism: 1
              upgradeMode: "stateless"
          Apr 7, 2024

          Subsections of CDC

          Mysql CDC

          More Ofthen, we can get a simplest example form CDC Connectors. But people still need to google some inescapable problems before using it.

          preliminary

          Flink: 1.17 JDK: 11

          Flink CDC version mapping
          Flink CDC VersionFlink Version
          1.0.01.11.*
          1.1.01.11.*
          1.2.01.12.*
          1.3.01.12.*
          1.4.01.13.*
          2.0.*1.13.*
          2.1.*1.13.*
          2.2.*1.13.*, 1.14.*
          2.3.*1.13.*, 1.14.*, 1.15.*
          2.4.*1.13.*, 1.14.*, 1.15.*
          3.0.*1.14.*, 1.15.*, 1.16.*

          usage for DataStream API

          Only import com.ververica.flink-connector-mysql-cdc is not enough.

          implementation("com.ververica:flink-connector-mysql-cdc:2.4.0")
          
          //you also need these following dependencies
          implementation("org.apache.flink:flink-shaded-guava:30.1.1-jre-16.1")
          implementation("org.apache.flink:flink-connector-base:1.17")
          implementation("org.apache.flink:flink-table-planner_2.12:1.17")
          <dependency>
            <groupId>com.ververica</groupId>
            <!-- add the dependency matching your database -->
            <artifactId>flink-connector-mysql-cdc</artifactId>
            <!-- The dependency is available only for stable releases, SNAPSHOT dependencies need to be built based on master or release- branches by yourself. -->
            <version>2.4.0</version>
          </dependency>
          
          <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-guava -->
          <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-shaded-guava</artifactId>
            <version>30.1.1-jre-16.1</version>
          </dependency>
          
          <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-base -->
          <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-base</artifactId>
            <version>1.17.1</version>
          </dependency>
          
          <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-planner -->
          <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner_2.12</artifactId>
            <version>1.17.1</version>
          </dependency>

          Example Code

          MySqlSource<String> mySqlSource =
              MySqlSource.<String>builder()
                  .hostname("192.168.56.107")
                  .port(3306)
                  .databaseList("test") // set captured database
                  .tableList("test.table_a") // set captured table
                  .username("root")
                  .password("mysql")
                  .deserializer(
                      new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
                  .serverTimeZone("UTC")
                  .build();
          
          StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
          
          // enable checkpoint
          env.enableCheckpointing(3000);
          
          env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
              // set 4 parallel source tasks
              .setParallelism(4)
              .print()
              .setParallelism(1); // use parallelism 1 for sink to keep message ordering
          
          env.execute("Print MySQL Snapshot + Binlog");

          usage for table/SQL API

          Mar 7, 2024

          Connector

          Mar 7, 2024

          Subsections of 🐸Git

          Cheatsheet

          List config

          git config --list

          Init global config

          git config --global user.name "AaronYang"
          git config --global user.email aaron19940628@gmail.com
          git config --global user.email byang628@alumni.usc.edu
          git config --global pager.branch false
          git config --global pull.ff only
          git --no-pager diff

          change user and email (locally)

          # git config user.name ""
          # git config user.email ""
          git config user.name "AaronYang"
          git config user.email byang628@alumni.usc.edu

          list all remote repo

          git remote -v
          modify remote repo
          git remote set-url origin git@github.com:<$user>/<$repo>.git
          # git remote set-url origin http://xxxxxxxxxxx.git
          add a new remote repo
          git remote add dev https://xxxxxxxxxxx.git

          Clone specific branch

          git clone -b slurm-23.02 --single-branch --depth=1 https://github.com/SchedMD/slurm.git

          Get specific file from remote

          git archive --remote=git@github.com:<$user>/<$repo>.git <$branch>:<$source_file_path> -o <$target_source_path>
          for example
          git archive --remote=git@github.com:AaronYang2333/LOL_Overlay_Assistant_Tool.git master:paper/2003.11755.pdf -o a.pdf

          Update submodule

          git submodule add –depth 1 https://github.com/xxx/xxxx a/b/c

          git submodule update --init --recursive

          Save credential

          login first and then execute this

          git config --global credential.helper store

          Delete Branch

          • Deleting a remote branch
            git push origin --delete <branch>  # Git version 1.7.0 or newer
            git push origin -d <branch>        # Shorter version (Git 1.7.0 or newer)
            git push origin :<branch>          # Git versions older than 1.7.0
          • Deleting a local branch
            git branch --delete <branch>
            git branch -d <branch> # Shorter version
            git branch -D <branch> # Force-delete un-merged branches

          Prune remote branches

          git remote prune origin
          Mar 7, 2024

          Subsections of Action

          Customize A Gitea Action

          Introduction

          In this guide, you’ll learn about the basic components needed to create and use a packaged composite action. To focus this guide on the components needed to package the action, the functionality of the action’s code is minimal. The action prints “Hello World” and then “Goodbye”, or if you provide a custom name, it prints “Hello [who-to-greet]” and then “Goodbye”. The action also maps a random number to the random-number output variable, and runs a script named goodbye.sh.

          Once you complete this project, you should understand how to build your own composite action and test it in a workflow.

          Warning

          When creating workflows and actions, you should always consider whether your code might execute untrusted input from possible attackers. Certain contexts should be treated as untrusted input, as an attacker could insert their own malicious content. For more information, see Secure use reference.

          Composite actions and reusable workflows

          Composite actions allow you to collect a series of workflow job steps into a single action which you can then run as a single job step in multiple workflows. Reusable workflows provide another way of avoiding duplication, by allowing you to run a complete workflow from within other workflows. For more information, see Reusing workflow configurations.

          Prerequisites

          Note

          This example explains how to create a composite action within a separate repository. However, it is possible to create a composite action within the same repository. For more information, see Creating a composite action.

          Before you begin, you’ll create a repository on GitHub.

          1. Create a new public repository on GitHub. You can choose any repository name, or use the following hello-world-composite-action example. You can add these files after your project has been pushed to GitHub.

          2. Clone your repository to your computer.

          3. From your terminal, change directories into your new repository.

          cd hello-world-composite-action
          1. In the hello-world-composite-action repository, create a new file called goodbye.sh with example code:
          echo "echo Goodbye" > goodbye.sh
          1. From your terminal, make goodbye.sh executable.
          chmod +x goodbye.sh
          1. From your terminal, check in your goodbye.sh file.
          git add goodbye.sh
          git commit -m "Add goodbye script"
          git push

          Creating an action metadata file

          1. In the hello-world-composite-action repository, create a new file called action.yml and add the following example code. For more information about this syntax, see Metadata syntax reference.
          name: 'Hello World'
          description: 'Greet someone'
          inputs:
            who-to-greet:  # id of input
              description: 'Who to greet'
              required: true
              default: 'World'
          outputs:
            random-number:
              description: "Random number"
              value: ${{ steps.random-number-generator.outputs.random-number }}
          runs:
            using: "composite"
            steps:
              - name: Set Greeting
                run: echo "Hello $INPUT_WHO_TO_GREET."
                shell: bash
                env:
                  INPUT_WHO_TO_GREET: ${{ inputs.who-to-greet }}
          
              - name: Random Number Generator
                id: random-number-generator
                run: echo "random-number=$(echo $RANDOM)" >> $GITHUB_OUTPUT
                shell: bash
          
              - name: Set GitHub Path
                run: echo "$GITHUB_ACTION_PATH" >> $GITHUB_PATH
                shell: bash
                env:
                  GITHUB_ACTION_PATH: ${{ github.action_path }}
          
              - name: Run goodbye.sh
                run: goodbye.sh
                shell: bash

          This file defines the who-to-greet input, maps the random generated number to the random-number output variable, adds the action’s path to the runner system path (to locate the goodbye.sh script during execution), and runs the goodbye.sh script.

          For more information about managing outputs, see Metadata syntax reference.

          For more information about how to use github.action_path, see Contexts reference.

          1. From your terminal, check in your action.yml file.
          git add action.yml
          git commit -m "Add action"
          git push
          1. From your terminal, add a tag. This example uses a tag called v1. For more information, see About custom actions.
          git tag -a -m "Description of this release" v1
          git push --follow-tags

          Testing out your action in a workflow

          The following workflow code uses the completed hello world action that you made in Creating a composite action.

          Copy the workflow code into a .github/workflows/main.yml file in another repository, replacing OWNER and SHA with the repository owner and the SHA of the commit you want to use, respectively. You can also replace the who-to-greet input with your name.

          on: [push]
          
          jobs:
            hello_world_job:
              runs-on: ubuntu-latest
              name: A job to say hello
              steps:
                - uses: actions/checkout@v5
                - id: foo
                  uses: OWNER/hello-world-composite-action@SHA
                  with:
                    who-to-greet: 'Mona the Octocat'
                - run: echo random-number "$RANDOM_NUMBER"
                  shell: bash
                  env:
                    RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}

          From your repository, click the Actions tab, and select the latest workflow run. The output should include: “Hello Mona the Octocat”, the result of the “Goodbye” script, and a random number.

          Creating a composite action within the same repository

          1. Create a new subfolder called hello-world-composite-action, this can be placed in any subfolder within the repository. However, it is recommended that this be placed in the .github/actions subfolder to make organization easier.

          2. In the hello-world-composite-action folder, do the same steps to create the goodbye.sh script

          echo "echo Goodbye" > goodbye.sh
          chmod +x goodbye.sh
          git add goodbye.sh
          git commit -m "Add goodbye script"
          git push
          1. In the hello-world-composite-action folder, create the action.yml file based on the steps in Creating a composite action.

          2. When using the action, use the relative path to the folder where the composite action’s action.yml file is located in the uses key. The below example assumes it is in the .github/actions/hello-world-composite-action folder.

          on: [push]
          
          jobs:
            hello_world_job:
              runs-on: ubuntu-latest
              name: A job to say hello
              steps:
                - uses: actions/checkout@v5
                - id: foo
                  uses: ./.github/actions/hello-world-composite-action
                  with:
                    who-to-greet: 'Mona the Octocat'
                - run: echo random-number "$RANDOM_NUMBER"
                  shell: bash
                  env:
                    RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}
          Mar 7, 2024

          Customize A Github Action

          Introduction

          In this guide, you’ll learn about the basic components needed to create and use a packaged composite action. To focus this guide on the components needed to package the action, the functionality of the action’s code is minimal. The action prints “Hello World” and then “Goodbye”, or if you provide a custom name, it prints “Hello [who-to-greet]” and then “Goodbye”. The action also maps a random number to the random-number output variable, and runs a script named goodbye.sh.

          Once you complete this project, you should understand how to build your own composite action and test it in a workflow.

          Warning

          When creating workflows and actions, you should always consider whether your code might execute untrusted input from possible attackers. Certain contexts should be treated as untrusted input, as an attacker could insert their own malicious content. For more information, see Secure use reference.

          Composite actions and reusable workflows

          Composite actions allow you to collect a series of workflow job steps into a single action which you can then run as a single job step in multiple workflows. Reusable workflows provide another way of avoiding duplication, by allowing you to run a complete workflow from within other workflows. For more information, see Reusing workflow configurations.

          Prerequisites

          Note

          This example explains how to create a composite action within a separate repository. However, it is possible to create a composite action within the same repository. For more information, see Creating a composite action.

          Before you begin, you’ll create a repository on GitHub.

          1. Create a new public repository on GitHub. You can choose any repository name, or use the following hello-world-composite-action example. You can add these files after your project has been pushed to GitHub.

          2. Clone your repository to your computer.

          3. From your terminal, change directories into your new repository.

          cd hello-world-composite-action
          1. In the hello-world-composite-action repository, create a new file called goodbye.sh with example code:
          echo "echo Goodbye" > goodbye.sh
          1. From your terminal, make goodbye.sh executable.
          chmod +x goodbye.sh
          1. From your terminal, check in your goodbye.sh file.
          git add goodbye.sh
          git commit -m "Add goodbye script"
          git push

          Creating an action metadata file

          1. In the hello-world-composite-action repository, create a new file called action.yml and add the following example code. For more information about this syntax, see Metadata syntax reference.
          name: 'Hello World'
          description: 'Greet someone'
          inputs:
            who-to-greet:  # id of input
              description: 'Who to greet'
              required: true
              default: 'World'
          outputs:
            random-number:
              description: "Random number"
              value: ${{ steps.random-number-generator.outputs.random-number }}
          runs:
            using: "composite"
            steps:
              - name: Set Greeting
                run: echo "Hello $INPUT_WHO_TO_GREET."
                shell: bash
                env:
                  INPUT_WHO_TO_GREET: ${{ inputs.who-to-greet }}
          
              - name: Random Number Generator
                id: random-number-generator
                run: echo "random-number=$(echo $RANDOM)" >> $GITHUB_OUTPUT
                shell: bash
          
              - name: Set GitHub Path
                run: echo "$GITHUB_ACTION_PATH" >> $GITHUB_PATH
                shell: bash
                env:
                  GITHUB_ACTION_PATH: ${{ github.action_path }}
          
              - name: Run goodbye.sh
                run: goodbye.sh
                shell: bash

          This file defines the who-to-greet input, maps the random generated number to the random-number output variable, adds the action’s path to the runner system path (to locate the goodbye.sh script during execution), and runs the goodbye.sh script.

          For more information about managing outputs, see Metadata syntax reference.

          For more information about how to use github.action_path, see Contexts reference.

          1. From your terminal, check in your action.yml file.
          git add action.yml
          git commit -m "Add action"
          git push
          1. From your terminal, add a tag. This example uses a tag called v1. For more information, see About custom actions.
          git tag -a -m "Description of this release" v1
          git push --follow-tags

          Testing out your action in a workflow

          The following workflow code uses the completed hello world action that you made in Creating a composite action.

          Copy the workflow code into a .github/workflows/main.yml file in another repository, replacing OWNER and SHA with the repository owner and the SHA of the commit you want to use, respectively. You can also replace the who-to-greet input with your name.

          on: [push]
          
          jobs:
            hello_world_job:
              runs-on: ubuntu-latest
              name: A job to say hello
              steps:
                - uses: actions/checkout@v5
                - id: foo
                  uses: OWNER/hello-world-composite-action@SHA
                  with:
                    who-to-greet: 'Mona the Octocat'
                - run: echo random-number "$RANDOM_NUMBER"
                  shell: bash
                  env:
                    RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}

          From your repository, click the Actions tab, and select the latest workflow run. The output should include: “Hello Mona the Octocat”, the result of the “Goodbye” script, and a random number.

          Creating a composite action within the same repository

          1. Create a new subfolder called hello-world-composite-action, this can be placed in any subfolder within the repository. However, it is recommended that this be placed in the .github/actions subfolder to make organization easier.

          2. In the hello-world-composite-action folder, do the same steps to create the goodbye.sh script

          echo "echo Goodbye" > goodbye.sh
          chmod +x goodbye.sh
          git add goodbye.sh
          git commit -m "Add goodbye script"
          git push
          1. In the hello-world-composite-action folder, create the action.yml file based on the steps in Creating a composite action.

          2. When using the action, use the relative path to the folder where the composite action’s action.yml file is located in the uses key. The below example assumes it is in the .github/actions/hello-world-composite-action folder.

          on: [push]
          
          jobs:
            hello_world_job:
              runs-on: ubuntu-latest
              name: A job to say hello
              steps:
                - uses: actions/checkout@v5
                - id: foo
                  uses: ./.github/actions/hello-world-composite-action
                  with:
                    who-to-greet: 'Mona the Octocat'
                - run: echo random-number "$RANDOM_NUMBER"
                  shell: bash
                  env:
                    RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}
          Mar 7, 2024

          Gitea Variables

          Preset Variables

          变量名称示例说明 / 用途
          gitea.actor触发 workflow 的用户的用户名。(docs.gitea.com)
          gitea.event_name事件名称,比如 pushpull_request 等。(docs.gitea.com)
          gitea.ref被触发的 Git 引用(branch/tag/ref)名称。(docs.gitea.com)
          gitea.repository仓库标识,一般是 owner/name。(docs.gitea.com)
          gitea.workspace仓库被 checkout 到 runner 上的工作目录路径。(docs.gitea.com)

          Common Variables

          变量名称示例说明 / 用途
          runner.osRunner 所在的操作系统环境,比如 ubuntu-latest。(docs.gitea.com)
          job.status当前 job 的状态(例如 success 或 failure)。(docs.gitea.com)
          env.xxxx自定义配置变量,在用户/组织/仓库层定义,统一以大写形式引用。(docs.gitea.com)
          secrets.XXXX存放敏感信息的密钥,同样可以在用户/组织/仓库层定义。(docs.gitea.com)

          Sample

          name: Gitea Actions Demo
          run-name: ${{ gitea.actor }} is testing out Gitea Actions 🚀
          on: [push]
          
          env:
              author: gitea_admin
          jobs:
            Explore-Gitea-Actions:
              runs-on: ubuntu-latest
              steps:
                - run: echo "🎉 The job was automatically triggered by a ${{ gitea.event_name }} event."
                - run: echo "🐧 This job is now running on a ${{ runner.os }} server hosted by Gitea!"
                - run: echo "🔎 The name of your branch is ${{ gitea.ref }} and your repository is ${{ gitea.repository }}."
                - name: Check out repository code
                  uses: actions/checkout@v4
                - run: echo "💡 The ${{ gitea.repository }} repository has been cloned to the runner."
                - run: echo "🖥️ The workflow is now ready to test your code on the runner."
                - name: List files in the repository
                  run: |
                    ls ${{ gitea.workspace }}
                - run: echo "🍏 This job's status is ${{ job.status }}."

          Result

          🎉 The job was automatically triggered by a `push` event.
          
          🐧 This job is now running on a `Linux` server hosted by Gitea!
          
          🔎 The name of your branch is `refs/heads/main` and your repository is `gitea_admin/data-warehouse`.
          
          💡 The `gitea_admin/data-warehouse` repository has been cloned to the runner.
          
          🖥️ The workflow is now ready to test your code on the runner.
          
              Dockerfile  README.md  environments  pom.xml  src  templates
          
          🍏 This job's status is `success`.
          Mar 7, 2024

          Github Variables

          Context Variables

          变量名称示例说明 / 用途
          github.actor触发 workflow 的用户的用户名。([docs.gitea.com][1])
          github.event_name事件名称,比如 pushpull_request 等。([docs.gitea.com][1])
          github.ref被触发的 Git 引用(branch/tag/ref)名称。([docs.gitea.com][1])
          github.repository仓库标识,一般是 owner/name。([docs.gitea.com][1])
          github.workspace仓库被 checkout 到 runner 上的工作目录路径。([docs.gitea.com][1])
          env.xxxx在workflow中定义的变量,比如 ${{ env.xxxx }}
          secrets.XXXX通过 Settings -> Actions -> Secrets and variables 创建的密钥。
          Mar 7, 2024

          Subsections of Template

          Apply And Sync Argocd APP

          name: apply-and-sync-app
          run-name: ${{ gitea.actor }} is going to sync an sample argocd app 🚀
          on: [push]
          
          jobs:
            sync-argocd-app:
              runs-on: ubuntu-latest
              steps:
                - name: Sync App
                  uses: AaronYang0628/apply-and-sync-argocd@v1.0.6
                  with:
                    argocd-server: '192.168.100.125:30443'
                    argocd-token: ${{ secrets.ARGOCD_TOKEN }}
                    application-yaml-path: "environments/ops/argocd/operator.app.yaml"
          Mar 7, 2025

          Publish Chart 2 Harbor

          name: publish-chart-to-harbor-registry
          run-name: ${{ gitea.actor }} is testing out Gitea Push Chart 🚀
          on: [push]
          
          env:
            REGISTRY: harbor.zhejianglab.com
            USER: byang628@zhejianglab.com
            REPOSITORY_NAMESPACE: ay-dev
            CHART_NAME: data-warehouse
          jobs:
            build-and-push-charts:
              runs-on: ubuntu-latest
              permissions:
                packages: write
                contents: read
              strategy:
                matrix:
                  include:
                    - chart_path: "environments/helm/metadata-environment"
              steps:
                - name: Checkout Repository
                  uses: actions/checkout@v4
                  with:
                    fetch-depth: 0
          
                - name: Log in to Harbor
                  uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
                  with:
                    registry: "${{ env.REGISTRY }}"
                    username: "${{ env.USER }}"
                    password: "${{ secrets.ZJ_HARBOR_TOKEN }}"
          
                - name: Helm Publish Action
                  uses: AaronYang0628/push-helm-chart-to-oci@v0.0.3
                  with:
                    working-dir: ${{ matrix.chart_path }}
                    oci-repository: oci://${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}
                    username: ${{ env.USER }}
                    password: ${{ secrets.ZJ_HARBOR_TOKEN }}
          Mar 7, 2025

          Publish Image 2 Dockerhub

          name: publish-image-to-ghcr
          run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
          on: [push]
          
          env:
            REGISTRY: ghcr.io
            USER: aaronyang0628
            REPOSITORY_NAMESPACE: aaronyang0628
          jobs:
            build-and-push-images:
              strategy:
                matrix:
                  include:
                    - name_suffix: "aria-ng"
                      container_path: "application/aria2/container/aria-ng"
                      dockerfile_path: "application/aria2/container/aria-ng/Dockerfile"
                    - name_suffix: "aria2"
                      container_path: "application/aria2/container/aria2"
                      dockerfile_path: "application/aria2/container/aria2/Dockerfile"
              runs-on: ubuntu-latest
              steps:
              - name: checkout-repository
                uses: actions/checkout@v4
              - name: log in to the container registry
                uses: docker/login-action@v3
                with:
                  registry: "${{ env.REGISTRY }}"
                  username: "${{ env.USER }}"
                  password: "${{ secrets.GIT_REGISTRY_PWD }}"
              - name: build and push container image
                uses: docker/build-push-action@v6
                with:
                  context: "${{ matrix.container_path }}"
                  file: "${{ matrix.dockerfile_path }}"
                  push: true
                  tags: |
                    ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ inputs.tag || 'latest' }}
                    ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ github.ref_name }}
                  labels: |
                    org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
          Mar 7, 2025

          Publish Image 2 Ghcr

          name: publish-image-to-ghcr
          run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
          on: [push]
          
          env:
            REGISTRY: ghcr.io
            USER: aaronyang0628
            REPOSITORY_NAMESPACE: aaronyang0628
          jobs:
            build-and-push-images:
              strategy:
                matrix:
                  include:
                    - name_suffix: "aria-ng"
                      container_path: "application/aria2/container/aria-ng"
                      dockerfile_path: "application/aria2/container/aria-ng/Dockerfile"
                    - name_suffix: "aria2"
                      container_path: "application/aria2/container/aria2"
                      dockerfile_path: "application/aria2/container/aria2/Dockerfile"
              runs-on: ubuntu-latest
              steps:
              - name: checkout-repository
                uses: actions/checkout@v4
              - name: log in to the container registry
                uses: docker/login-action@v3
                with:
                  registry: "${{ env.REGISTRY }}"
                  username: "${{ env.USER }}"
                  password: "${{ secrets.GIT_REGISTRY_PWD }}"
              - name: build and push container image
                uses: docker/build-push-action@v6
                with:
                  context: "${{ matrix.container_path }}"
                  file: "${{ matrix.dockerfile_path }}"
                  push: true
                  tags: |
                    ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ inputs.tag || 'latest' }}
                    ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ github.ref_name }}
                  labels: |
                    org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
          Mar 7, 2025

          Publish Image 2 Harbor

          name: publish-image-to-harbor-registry
          run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
          on: [push]
          
          
          env:
            REGISTRY: harbor.zhejianglab.com
            USER: byang628@zhejianglab.com
            REPOSITORY_NAMESPACE: ay-dev
            IMAGE_NAME: metadata-crd-operator
          jobs:
            build-and-push-images:
              runs-on: ubuntu-latest
              permissions:
                packages: write
                contents: read
              strategy:
                matrix:
                  include:
                    - name_suffix: "dev"
                      container_path: "."
                      dockerfile_path: "./Dockerfile"
              steps:
                - name: Checkout Repository
                  uses: actions/checkout@v4
          
                - name: Log in to Harbor
                  uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
                  with:
                    registry: "${{ env.REGISTRY }}"
                    username: "${{ env.USER }}"
                    password: "${{ secrets.ZJ_HARBOR_TOKEN }}"
          
                - name: Extract Current Date
                  id: extract-date
                  run: |
                    echo "current-date=$(date +'%Y%m%d')" >> $GITHUB_OUTPUT
                    echo will push image: ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ env.IMAGE_NAME }}-${{ matrix.name_suffix }}:v${{ steps.extract-date.outputs.current-date }}
          
                - name: Build And Push Container Image
                  uses: docker/build-push-action@v6
                  with:
                    context: "${{ matrix.container_path }}"
                    file: "${{ matrix.dockerfile_path }}"
                    push: true
                    tags: |
                      ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ env.IMAGE_NAME }}-${{ matrix.name_suffix }}:v${{ steps.extract-date.outputs.current-date }}
                    labels: |
                      org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
          Mar 7, 2025

          Subsections of Notes

          Not Allow Push

          Cannot push to your own branch

          mvc mvc

          1. Edit .git/config file under your repo directory.

          2. Find url=entry under section [remote "origin"].

          3. Change it from:

            url=https://gitlab.com/AaronYang2333/ska-src-dm-local-data-preparer.git/

            url=ssh://git@gitlab.com/AaronYang2333/ska-src-dm-local-data-preparer.git

          4. try push again

          Mar 12, 2025

          ☸️Kubernetes

          Mar 7, 2024

          Subsections of ☸️Kubernetes

          Prepare k8s Cluster

          Building a K8s Cluster, you can choose one of the following methods.

          Install Kuberctl

          Build Cluster

          Install By

          Prerequisites

          • Hardware Requirements:

            1. At least 2 GB of RAM per machine (minimum 1 GB)
            2. 2 CPUs on the master node
            3. Full network connectivity among all machines (public or private network)
          • Operating System:

            1. Ubuntu 20.04/18.04, CentOS 7/8, or any other supported Linux distribution.
          • Network Requirements:

            1. Unique hostname, MAC address, and product_uuid for each node.
            2. Certain ports need to be open (e.g., 6443, 2379-2380, 10250, 10251, 10252, 10255, etc.)
          • Disable Swap:

            sudo swapoff -a

          Steps to Setup Kubernetes Cluster

          1. Prepare Your Servers Update the Package Index and Install Necessary Packages On all your nodes (both master and worker):
          sudo apt-get update
          sudo apt-get install -y apt-transport-https ca-certificates curl

          Add the Kubernetes APT Repository

          curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
          cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
          deb http://apt.kubernetes.io/ kubernetes-xenial main
          EOF

          Install kubeadm, kubelet, and kubectl

          sudo apt-get update
          sudo apt-get install -y kubelet kubeadm kubectl
          sudo apt-mark hold kubelet kubeadm kubectl
          1. Initialize the Master Node On the master node, initialize the Kubernetes control plane:
          sudo kubeadm init --pod-network-cidr=192.168.0.0/16

          The –pod-network-cidr flag is used to set the Pod network range. You might need to adjust this based on your network provider

          Set up Local kubeconfig

          mkdir -p $HOME/.kube
          sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
          sudo chown $(id -u):$(id -g) $HOME/.kube/config
          1. Install a Pod Network Add-on You can install a network add-on like Flannel, Calico, or Weave. For example, to install Calico:

          ```shell kubectl apply -f https://github.com/coreos/flannel/raw/master/Documentation/kube-flannel.yml ```

          ```shell kubectl apply -f https://docs.projectcalico.org/v3.14/manifests/calico.yaml ```

          1. Join Worker Nodes to the Cluster On each worker node, run the kubeadm join command provided at the end of the kubeadm init output on the master node. It will look something like this:
          sudo kubeadm join <master-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

          If you lost the join command, you can create a new token on the master node:

          sudo kubeadm token create --print-join-command
          1. Verify the Cluster Once all nodes have joined, you can verify the cluster status from the master node:
          kubectl get nodes

          This command should list all your nodes with the status “Ready”.

          Mar 7, 2025

          Subsections of Prepare k8s Cluster

          Kind

          Preliminary

          • Kind binary has installed, if not check 🔗link

          • Hardware Requirements:

            1. At least 2 GB of RAM per machine (minimum 1 GB)
            2. 2 CPUs on the master node
            3. Full network connectivity among all machines (public or private network)
          • Operating System:

            1. Ubuntu 22.04/14.04, CentOS 7/8, or any other supported Linux distribution.
          • Network Requirements:

            1. Unique hostname, MAC address, and product_uuid for each node.
            2. Certain ports need to be open (e.g., 6443, 2379-2380, 10250, 10251, 10252, 10255, etc.)

          Customize your cluster

          Creating a Kubernetes cluster is as simple as kind create cluster

          kind create cluster --name test

          Reference

          and the you can visit https://kind.sigs.k8s.io/docs/user/quick-start/ for mode detail.

          Mar 7, 2024

          K3s

          Preliminary

          • Hardware Requirements:

            1. Server need to have at least 2 cores, 2 GB RAM
            2. Agent need 1 core , 512 MB RAM
          • Operating System:

            1. K3s is expected to work on most modern Linux systems.
          • Network Requirements:

            1. The K3s server needs port 6443 to be accessible by all nodes.
            2. If you wish to utilize the metrics server, all nodes must be accessible to each other on port 10250.

          Init server

          curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn sh -s - server --cluster-init --flannel-backend=vxlan --node-taint "node-role.kubernetes.io/control-plane=true:NoSchedule"

          Get token

          cat /var/lib/rancher/k3s/server/node-token

          Join worker

          curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn K3S_URL=https://<master-ip>:6443 K3S_TOKEN=<join-token> sh -

          Copy kubeconfig

          mkdir -p $HOME/.kube
          cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config

          Uninstall k3s

          # exec on server
          /usr/local/bin/k3s-uninstall.sh
          
          # exec on agent 
          /usr/local/bin/k3s-agent-uninstall.sh
          Mar 7, 2024

          Minikube

          Preliminary

          • Minikube binary has installed, if not check 🔗link

          • Hardware Requirements:

            1. At least 2 GB of RAM per machine (minimum 1 GB)
            2. 2 CPUs on the master node
            3. Full network connectivity among all machines (public or private network)
          • Operating System:

            1. Ubuntu 20.04/18.04, CentOS 7/8, or any other supported Linux distribution.
          • Network Requirements:

            1. Unique hostname, MAC address, and product_uuid for each node.
            2. Certain ports need to be open (e.g., 6443, 2379-2380, 10250, 10251, 10252, 10255, etc.)

          [Optional] Disable aegis service and reboot system for Aliyun

          sudo systemctl disable aegis && sudo reboot

          Customize your cluster

          minikube start --driver=podman  --image-mirror-country=cn --kubernetes-version=v1.33.1 --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers --cpus=6 --memory=20g --disk-size=50g --force

          Restart minikube

          minikube stop && minikube start

          Add alias

          alias kubectl="minikube kubectl --"

          Stop And Clean

          minikube stop && minikube delete --all --purge

          Forward

          # execute on your local machine
                  Remote                                                  Local⬇️
             __________________                               ________________________________   
            ╱                  ╲╲         wire/wifi          ╱ [ Minikube ] 17.100.x.y        ╲╲
           ╱                   ╱╱   --------------------    ╱                                 ╱╱
          ╱ telnet 192.168.a.b ╱                           ╱  > execute ssh... at 192.168.a.b ╱ 
          ╲___________________╱  IP: 10.45.m.n             ╲_________________________________╱ IP: 192.168.a.b
          ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f

          and then you can visit https://minikube.sigs.k8s.io/docs/start/ for more detail.

          FAQ

          Q1: couldn’t get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle…

          通常是由于 Metrics Server 未正确安装 或 External Metrics API 缺失 导致的

          # 启用 Minikube 的 metrics-server 插件
          minikube addons enable metrics-server
          
          # 等待部署完成(约 1-2 分钟)
          kubectl wait --for=condition=available deployment/metrics-server -n kube-system --timeout=180s
          
          # 验证 Metrics Server 是否运行
          kubectl -n kube-system get pods  | grep metrics-server

          the possibilities are endless (almost - including other shortcodes may or may not work)

          Q2: Export minikube to local directly
          minikube start --driver=podman  --image-mirror-country=cn --kubernetes-version=v1.33.1 --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers  --listen-address=0.0.0.0 --cpus=6 --memory=20g --disk-size=100g --force
          Mar 7, 2024

          Subsections of Command

          Kubectl CheatSheet

          Switch Context

          • use different config
          kubectl --kubeconfig /root/.kube/config_ack get pod

          Resource

          • create resource

            Resource From
              kubectl create -n <$namespace> -f <$file_url>
            temp-file.yaml
            apiVersion: v1
            kind: Service
            metadata:
            labels:
                app.kubernetes.io/component: server
                app.kubernetes.io/instance: argo-cd
                app.kubernetes.io/name: argocd-server-external
                app.kubernetes.io/part-of: argocd
                app.kubernetes.io/version: v2.8.4
            name: argocd-server-external
            spec:
            ports:
            - name: https
                port: 443
                protocol: TCP
                targetPort: 8080
                nodePort: 30443
            selector:
                app.kubernetes.io/instance: argo-cd
                app.kubernetes.io/name: argocd-server
            type: NodePort
            
              helm install <$resource_id> <$resource_id> \
                  --namespace <$namespace> \
                  --create-namespace \
                  --version <$version> \
                  --repo <$repo_url> \
                  --values resource.values.yaml \
                  --atomic
            resource.values.yaml
            crds:
                install: true
                keep: false
            global:
                revisionHistoryLimit: 3
                image:
                    repository: m.daocloud.io/quay.io/argoproj/argocd
                    imagePullPolicy: IfNotPresent
            redis:
                enabled: true
                image:
                    repository: m.daocloud.io/docker.io/library/redis
                exporter:
                    enabled: false
                    image:
                        repository: m.daocloud.io/bitnami/redis-exporter
                metrics:
                    enabled: false
            redis-ha:
                enabled: false
                image:
                    repository: m.daocloud.io/docker.io/library/redis
                configmapTest:
                    repository: m.daocloud.io/docker.io/koalaman/shellcheck
                haproxy:
                    enabled: false
                    image:
                    repository: m.daocloud.io/docker.io/library/haproxy
                exporter:
                    enabled: false
                    image: m.daocloud.io/docker.io/oliver006/redis_exporter
            dex:
                enabled: true
                image:
                    repository: m.daocloud.io/ghcr.io/dexidp/dex
            

          • debug resource

          kubectl -n <$namespace> describe <$resource_id>
          • logging resource
          kubectl -n <$namespace> logs -f <$resource_id>
          • port forwarding resource
          kubectl -n <$namespace> port-forward  <$resource_id> --address 0.0.0.0 8080:80 # local:pod
          • delete all resource under specific namespace
          kubectl delete all --all -n <$namespace>
          if you wannna delete all
          kubectl delete all --all --all-namespaces
          • delete error pods
          kubectl -n <$namespace> delete pods --field-selector status.phase=Failed
          • force delete
          kubectl -n <$namespace> delete pod <$resource_id> --force --grace-period=0
          • opening a Bash Shell inside a Pod
          kubectl -n <$namespace> exec -it <$resource_id> -- bash  
          • copy secret to another namespace
          kubectl -n <$namespaceA> get secret <$secret_name> -o json \
              | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' \
              | kubectl -n <$namespaceB> apply -f -
          • copy secret to another name
          kubectl -n <$namespace> get secret <$old_secret_name> -o json | \
          jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid","ownerReferences","annotations","labels"]) | .metadata.name = "<$new_secret_name>"' | \
          kubectl apply -n <$namespace> -f -
          • delete all completed job
          kubectl delete jobs -n <$namespace> --field-selector status.successful=1 

          Nodes

          • add taint
          kubectl taint nodes <$node_ip> <key:value>
          for example
          kubectl taint nodes node1 dedicated:NoSchedule
          • remove taint
          kubectl remove taint
          for example
          kubectl taint nodes node1 dedicated:NoSchedule-
          • show info extract by json path
          kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'

          Deploy

          • rollout show rollout history
          kubectl -n <$namespace> rollout history deploy/<$deploy_resource_id>

          undo rollout

          kubectl -n <$namespace> rollout undo deploy <$deploy_resource_id>  --to-revision=1

          Patch

          clean those who won’t managed by k8s

          kubectl -n metadata patch flinkingest ingest-table-or-fits-from-oss -p '{"metadata":{"finalizers":[]}}' --type=merge
          Mar 8, 2024

          Helm Chart CheatSheet

          Finding Charts

          helm search hub wordpress

          Adding Repositories

          helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
          helm repo update

          Showing Chart Values

          helm show values bitnami/wordpress

          Packaging Charts

          helm package --dependency-update --destination /tmp/ /root/metadata-operator/environments/helm/metadata-environment/charts

          Uninstall Chart

          helm uninstall -n warehouse warehouse

          when failed, you can try

          helm uninstall -n warehouse warehouse --no-hooks --cascade=foreground
          Mar 7, 2024

          Resource CheatSheet

          Create Secret From Literal

          kubectl -n application create secret generic xxxx-secrets \
            --from-literal=xxx_uri='https://in03-891eca6c21bd4e5.serverless.aws-eu-central-1.cloud.zilliz.com' \
            --from-literal=xxxx_token='<$the uncoded value, do not base64 and paste here>' \
            --from-literal=tongyi_api_key='sk-xxxxxxxxxxx'

          Forward external service

          kubectl -n basic-components apply -f - <<EOF
          apiVersion: v1
          kind: Service
          metadata:
            name: proxy-server-service
          spec:
            type: ClusterIP
            ports:
            - port: 80
              targetPort: 32080
              protocol: TCP
              name: http
          ---
          kubectl -n basic-components apply -f - <<EOF
          apiVersion: v1
          kind: Endpoints
          metadata:
            name: proxy-server-service
          subsets:
            - addresses:
              - ip: "47.xxx.xxx.xxx"
              ports:
              - port: 32080
                protocol: TCP
                name: http
          ---
          apiVersion: networking.k8s.io/v1
          kind: Ingress
          metadata:
            name: proxy-server-ingress
            annotations:
              nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
              nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
              nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
          spec:
            ingressClassName: nginx
            rules:
            - host: server.proxy.72602.online
              http:
                paths:
                - path: /
                  pathType: Prefix
                  backend:
                    service:
                      name: proxy-server-service
                      port:
                        number: 80
          EOF
          Mar 7, 2025

          Subsections of Conatiner

          CheatShett

          type:
          1. remove specific image
          podman rmi <$image_id>
          1. remove all <none> images
          podman rmi `podamn images | grep  '<none>' | awk '{print $3}'`
          1. remove all stopped containers
          podman container prune
          1. remove all docker images not used
          podman image prune

          sudo podman volume prune

          1. find ip address of a container
          podman inspect --format='{{.NetworkSettings.IPAddress}}' minio-server
          1. exec into container
          podman run -it <$container_id> /bin/bash
          1. run with environment
          podman run -d --replace 
              -p 18123:8123 -p 19000:9000 \
              --name clickhouse-server \
              -e ALLOW_EMPTY_PASSWORD=yes \
              --ulimit nofile=262144:262144 \
              quay.m.daocloud.io/kryptonite/clickhouse-docker-rootless:20.9.3.45 

          --ulimit nofile=262144:262144: 262144 is the maximum users process or for showing maximum user process limit for the logged-in user

          ulimit is admin access required Linux shell command which is used to see, set, or limit the resource usage of the current user. It is used to return the number of open file descriptors for each process. It is also used to set restrictions on the resources used by a process.

          1. login registry
          export ZJLAB_CR_PAT=ghp_xxxxxxxxxxxx
          echo $ZJLAB_CR_PAT | podman login --tls-verify=false cr.registry.res.cloud.zhejianglab.com -u ascm-org-1710208820455 --password-stdin
          
          export GITHUB_CR_PAT=ghp_xxxxxxxxxxxx
          echo $GITHUB_CR_PAT | podman login ghcr.io -u aaronyang0628 --password-stdin
          
          export DOCKER_CR_PAT=dckr_pat_bBN_Xkgz-xxxx
          echo $DOCKER_CR_PAT | podman login docker.io -u aaron666 --password-stdin
          1. tag image
          podman tag 76fdac66291c cr.registry.res.cloud.zhejianglab.com/ay-dev/datahub-s3-fits:1.0.0
          1. push image
          podman push cr.registry.res.cloud.zhejianglab.com/ay-dev/datahub-s3-fits:1.0.0
          1. remove specific image
          docker rmi <$image_id>
          1. remove all <none> images
          docker rmi `docker images | grep  '<none>' | awk '{print $3}'`
          1. remove all stopped containers
          docker container prune
          1. remove all docker images not used
          docker image prune
          1. find ip address of a container
          docker inspect --format='{{.NetworkSettings.IPAddress}}' minio-server
          1. exec into container
          docker exec -it <$container_id> /bin/bash
          1. run with environment
          docker run -d --replace -p 18123:8123 -p 19000:9000 --name clickhouse-server -e ALLOW_EMPTY_PASSWORD=yes --ulimit nofile=262144:262144 quay.m.daocloud.io/kryptonite/clickhouse-docker-rootless:20.9.3.45 

          --ulimit nofile=262144:262144: sssss

          1. copy file

            Copy a local file into container

            docker cp ./some_file CONTAINER:/work

            or copy files from container to local path

            docker cp CONTAINER:/var/logs/ /tmp/app_logs
          2. load a volume

          docker run --rm \
              --entrypoint bash \
              -v $PWD/data:/app:ro \
              -it docker.io/minio/mc:latest \
              -c "mc --insecure alias set minio https://oss-cn-hangzhou-zjy-d01-a.ops.cloud.zhejianglab.com/ g83B2sji1CbAfjQO 2h8NisFRELiwOn41iXc6sgufED1n1A \
                  && mc --insecure ls minio/csst-prod/ \
                  && mc --insecure mb --ignore-existing minio/csst-prod/crp-test \
                  && mc --insecure cp /app/modify.pdf minio/csst-prod/crp-test/ \
                  && mc --insecure ls --recursive minio/csst-prod/"
          Mar 7, 2024

          Subsections of Template

          Subsections of DevContainer Template

          Java 21 + Go 1.24

          prepare .devcontainer.json

          {
            "name": "Go & Java DevContainer",
            "build": {
              "dockerfile": "Dockerfile"
            },
            "mounts": [
              "source=/root/.kube/config,target=/root/.kube/config,type=bind",
              "source=/root/.minikube/profiles/minikube/client.crt,target=/root/.minikube/profiles/minikube/client.crt,type=bind",
              "source=/root/.minikube/profiles/minikube/client.key,target=/root/.minikube/profiles/minikube/client.key,type=bind",
              "source=/root/.minikube/ca.crt,target=/root/.minikube/ca.crt,type=bind"
            ],
            "customizations": {
              "vscode": {
                "extensions": [
                  "golang.go",
                  "vscjava.vscode-java-pack",
                  "redhat.java",
                  "vscjava.vscode-maven",
                  "Alibaba-Cloud.tongyi-lingma",
                  "vscjava.vscode-java-debug",
                  "vscjava.vscode-java-dependency",
                  "vscjava.vscode-java-test"
                ]
              }
            },
            "remoteUser": "root",
            "postCreateCommand": "go version && java -version && mvn -v"
          }

          prepare Dockerfile

          FROM m.daocloud.io/docker.io/ubuntu:24.04
          
          ENV DEBIAN_FRONTEND=noninteractive
          
          RUN apt-get update && \
              apt-get install -y --no-install-recommends \
              ca-certificates \
              curl \
              git \
              wget \
              gnupg \
              vim \
              lsb-release \
              apt-transport-https \
              && apt-get clean \
              && rm -rf /var/lib/apt/lists/*
          
          # install OpenJDK 21 
          RUN mkdir -p /etc/apt/keyrings && \
              wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | gpg --dearmor -o /etc/apt/keyrings/adoptium.gpg && \
              echo "deb [signed-by=/etc/apt/keyrings/adoptium.gpg arch=amd64] https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list > /dev/null && \
              apt-get update && \
              apt-get install -y temurin-21-jdk && \
              apt-get clean && \
              rm -rf /var/lib/apt/lists/*
          
          # set java env
          ENV JAVA_HOME=/usr/lib/jvm/temurin-21-jdk-amd64
          
          # install maven
          ARG MAVEN_VERSION=3.9.10
          RUN wget https://dlcdn.apache.org/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz -O /tmp/maven.tar.gz && \
              mkdir -p /opt/maven && \
              tar -C /opt/maven -xzf /tmp/maven.tar.gz --strip-components=1 && \
              rm /tmp/maven.tar.gz
          
          ENV MAVEN_HOME=/opt/maven
          ENV PATH="${MAVEN_HOME}/bin:${PATH}"
          
          # install go 1.24.4 
          ARG GO_VERSION=1.24.4
          RUN wget https://dl.google.com/go/go${GO_VERSION}.linux-amd64.tar.gz -O /tmp/go.tar.gz && \
              tar -C /usr/local -xzf /tmp/go.tar.gz && \
              rm /tmp/go.tar.gz
          
          # set go env
          ENV GOROOT=/usr/local/go
          ENV GOPATH=/go
          ENV PATH="${GOROOT}/bin:${GOPATH}/bin:${PATH}"
          
          # install other binarys
          ARG KUBECTL_VERSION=v1.33.0
          RUN wget https://files.m.daocloud.io/dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -O /tmp/kubectl && \
              chmod u+x /tmp/kubectl && \
              mv -f /tmp/kubectl /usr/local/bin/kubectl 
          
          ARG HELM_VERSION=v3.13.3
          RUN wget https://files.m.daocloud.io/get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz -O /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz && \
              mkdir -p /opt/helm && \
              tar -C /opt/helm -xzf /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz && \
              rm /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz
          
          ENV HELM_HOME=/opt/helm/linux-amd64
          ENV PATH="${HELM_HOME}:${PATH}"
          
          USER root
          WORKDIR /workspace
          Mar 7, 2024

          Subsections of DEV

          Devpod

          Preliminary

          • Kubernetes has installed, if not check 🔗link
          • Devpod has installed, if not check 🔗link

          1. Get provider config

          # just copy ~/.kube/config

          for example, the original config

          apiVersion: v1
          clusters:
          - cluster:
              certificate-authority: <$file_path>
              extensions:
              - extension:
                  provider: minikube.sigs.k8s.io
                  version: v1.33.0
                name: cluster_info
              server: https://<$minikube_ip>:8443
            name: minikube
          contexts:
          - context:
              cluster: minikube
              extensions:
              - extension:
                  provider: minikube.sigs.k8s.io
                  version: v1.33.0
                name: context_info
              namespace: default
              user: minikube
            name: minikube
          current-context: minikube
          kind: Config
          preferences: {}
          users:
          - name: minikube
            user:
              client-certificate: <$file_path>
              client-key: <$file_path>

          you need to rename clusters.cluster.certificate-authority, clusters.cluster.server, users.user.client-certificate, users.user.client-key.

          clusters.cluster.certificate-authority -> clusters.cluster.certificate-authority-data
          clusters.cluster.server -> ip set to `localhost`
          users.user.client-certificate -> users.user.client-certificate-data
          users.user.client-key -> users.user.client-key-data

          the data you paste after each key should be base64

          cat <$file_path> | base64

          then, modified config file should be look like this:

          apiVersion: v1
          clusters:
          - cluster:
              certificate-authority-data: xxxxxxxxxxxxxx
              extensions:
              - extension:
                  provider: minikube.sigs.k8s.io
                  version: v1.33.0
                name: cluster_info
              server: https://127.0.0.1:8443 
            name: minikube
          contexts:
          - context:
              cluster: minikube
              extensions:
              - extension:
                  provider: minikube.sigs.k8s.io
                  version: v1.33.0
                name: context_info
              namespace: default
              user: minikube
            name: minikube
          current-context: minikube
          kind: Config
          preferences: {}
          users:
          - name: minikube
            user:
              client-certificate-data: xxxxxxxxxxxx
              client-key-data: xxxxxxxxxxxxxxxx

          then we should forward minikube port in your own pc

          #where you host minikube
          MACHINE_IP_ADDRESS=10.200.60.102
          USER=ayay
          MINIKUBE_IP_ADDRESS=$(ssh -o 'UserKnownHostsFile /dev/null' $USER@$MACHINE_IP_ADDRESS '$HOME/bin/minikube ip')
          ssh -o 'UserKnownHostsFile /dev/null' $USER@$MACHINE_IP_ADDRESS -L "*:8443:$MINIKUBE_IP_ADDRESS:8443" -N -f

          2. Create workspace

          1. get git repo link
          2. choose appropriate provider
          3. choose ide type and version
          4. and go!

          Useful Command

          Install Kubectl

          for more information, you can check 🔗link to install kubectl

          • How to use it in devpod

            Everything works fine.

            when you in pod, and using kubectl you should change clusters.cluster.server in ~/.kube/config to https://<$minikube_ip>:8443

          • exec into devpod

          kubectl -n devpod exec -it <$resource_id> -c devpod -- bin/bash
          • add DNS item
          10.aaa.bbb.ccc gitee.zhejianglab.com
          • shutdown ssh tunnel
            # check if port 8443 is already open
            netstat -aon|findstr "8443"
            
            # find PID
            ps | grep ssh
            
            # kill the process
            taskkill /PID <$PID> /T /F
            # check if port 8443 is already open
            netstat -aon|findstr "8443"
            
            # find PID
            ps | grep ssh
            
            # kill the process
            kill -9 <$PID>
          Mar 7, 2024

          Dev Conatiner

          write .devcontainer.json

          Mar 7, 2024

          JumpServer

                    Local             Jumpserver        virtual node (develop/k3s)
                  ________            _______                ________ 
                 ╱        ╲          ╱       ╲╲             ╱        ╲
                ╱         ╱ ------  ╱        ╱╱  --------  ╱         ╱
               ╱         ╱         ╱         ╱            ╱         ╱ 
               ╲________╱          ╲________╱             ╲________╱  
              IP: 10.A.B.C    IP: jumpserver.ay.dev   IP: 192.168.100.xxx                                

          Modify SSH Config

          30022 has ssh service at jumpserver

          cat .ssh/config
          Host jumpserver
            HostName jumpserver.ay.dev
            Port 30022
            User ay
            IdentityFile ~/.ssh/id_rsa
          
          Host virtual
            HostName 192.168.100.xxx
            Port 22
            User ay
            ProxyJump jumpserver
            IdentityFile ~/.ssh/id_rsa

          And then you can directly connect to the virtual node

          Forward port in virtual node

          30022 has ssh service at jumpserver

          32524 is a service which you wanna forward

          ssh -o 'UserKnownHostsFile /dev/null' -o 'ServerAliveInterval=60' -L 32524:192.168.100.xxx:32524 -p 30022 ay@jumpserver.ay.dev
          Mar 7, 2024

          Subsections of Operator SDK

          KubeBuilder

          Basic

          Kubebuilder 是一个使用 CRDs 构建 K8s API 的 SDK,主要是:

          • 基于 controller-runtime 以及 client-go 构建
          • 提供一套可扩展的 API 框架,方便用户从零开始开发 CRDsControllers 和 Admission Webhooks 来扩展 K8s。
          • 还提供脚手架工具初始化 CRDs 工程,自动生成 boilerplate 模板代码和配置;

          Architecture

          mvc mvc

          Main.go

          import (
          	_ "k8s.io/client-go/plugin/pkg/client/auth"
          
          	ctrl "sigs.k8s.io/controller-runtime"
          )
          // nolint:gocyclo
          func main() {
              ...
          
              mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{}
          
              ...
              if err = (&controller.GuestbookReconciler{
                  Client: mgr.GetClient(),
                  Scheme: mgr.GetScheme(),
              }).SetupWithManager(mgr); err != nil {
                  setupLog.Error(err, "unable to create controller", "controller", "Guestbook")
                  os.Exit(1)
              }
          
              ...
              if os.Getenv("ENABLE_WEBHOOKS") != "false" {
                  if err = webhookwebappv1.SetupGuestbookWebhookWithManager(mgr); err != nil {
                      setupLog.Error(err, "unable to create webhook", "webhook", "Guestbook")
                      os.Exit(1)
                  }
              }

          Manager

          Manager是核心组件,可以协调多个控制器、处理缓存、客户端、领导选举等,来自https://github.com/kubernetes-sigs/controller-runtime/blob/v0.20.0/pkg/manager/manager.go

          • Client 承担了与 Kubernetes API Server 通信、操作资源对象、读写缓存等关键职责; 分为两类:
            • Reader:优先读Cache, 避免频繁访问 API Server, Get后放缓存
            • Writer: 支持写操作(Create、Update、Delete、Patch),直接与 API Server 交互。
            • informers 是 client-go 提供的核心组件,用于监听(Watch)Kubernetes API Server 中特定资源类型(如 Pod、Deployment 或自定义 CRD)的变更事件(Create/Update/Delete)。
              • Client 依赖 Informer 机制自动同步缓存。当 API Server 中资源变更时,Informer 会定时更新本地缓存,确保后续读操作获取最新数据。
          • Cache
            • Cache 通过 内置的client 的 ListWatcher机制 监听 API Server 的资源变更。
            • 事件被写入本地缓存(如 Indexer),避免频繁访问 API Server。
            • 缓存(Cache)的作用是减少对API Server的直接请求,同时保证控制器能够快速读取资源的最新状态。
          • Event

            Kubernetes API Server 通过 HTTP 长连接 推送资源变更事件,client-go 的 Informer 负责监听这些消息。

            • Event:事件是Kubernetes API Server与Controller之间传递的信息,包含资源类型、资源名称、事件类型(ADDED、MODIFIED、DELETED)等信息,并转换成requets, check link
            • API Server → Manager的Informer → Cache → Controller的Watch → Predicate过滤 → WorkQueue → Controller的Reconcile()方法

          Controller

          It’s a controller’s job to ensure that, for any given object the actual state of the world matches the desired state in the object. Each controller focuses on one root Kind, but may interact with other Kinds.

          func (r *GuestbookReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
              ...
          }
          func (r *GuestbookReconciler) SetupWithManager(mgr ctrl.Manager) error {
          	return ctrl.NewControllerManagedBy(mgr).
          		For(&webappv1.Guestbook{}).
          		Named("guestbook").
          		Complete(r)
          }

          If you wanna build your own controller, please check https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md

          1. 每个Controller在初始化时会向Manager注册它关心的资源类型(例如通过Owns(&v1.Pod{})声明关注Pod资源)。

          2. Manager根据Controller的注册信息,为相关资源创建对应的Informer和Watch, check link

          3. 当资源变更事件发生时,Informer会将事件从缓存中取出,并通过Predicate(过滤器)判断是否需要触发协调逻辑。

          4. 若事件通过过滤,Controller会将事件加入队列(WorkQueue),最终调用用户实现的Reconcile()函数进行处理, check link

          func (c *Controller[request]) Start(ctx context.Context) error {
          
          	c.ctx = ctx
          
          	queue := c.NewQueue(c.Name, c.RateLimiter)
          
              c.Queue = &priorityQueueWrapper[request]{TypedRateLimitingInterface: queue}
          
          	err := func() error {
          
                      // start to sync event sources
                      if err := c.startEventSources(ctx); err != nil {
                          return err
                      }
          
                      for i := 0; i < c.MaxConcurrentReconciles; i++ {
                          go func() {
                              for c.processNextWorkItem(ctx) {
          
                              }
                          }()
                      }
          	}()
          
          	c.LogConstructor(nil).Info("All workers finished")
          }
          func (c *Controller[request]) processNextWorkItem(ctx context.Context) bool {
          	obj, priority, shutdown := c.Queue.GetWithPriority()
          
          	c.reconcileHandler(ctx, obj, priority)
          
          }

          Webhook

          Webhooks are a mechanism to intercept requests to the Kubernetes API server. They can be used to validate, mutate, or even proxy requests.

          func (d *GuestbookCustomDefaulter) Default(ctx context.Context, obj runtime.Object) error {}
          
          func (v *GuestbookCustomValidator) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {}
          
          func (v *GuestbookCustomValidator) ValidateUpdate(ctx context.Context, oldObj, newObj runtime.Object) (admission.Warnings, error) {}
          
          func (v *GuestbookCustomValidator) ValidateDelete(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {}
          
          func SetupGuestbookWebhookWithManager(mgr ctrl.Manager) error {
          	return ctrl.NewWebhookManagedBy(mgr).For(&webappv1.Guestbook{}).
          		WithValidator(&GuestbookCustomValidator{}).
          		WithDefaulter(&GuestbookCustomDefaulter{}).
          		Complete()
          }
          Mar 7, 2024

          Subsections of KubeBuilder

          Quick Start

          Prerequisites

          • go version v1.23.0+
          • docker version 17.03+.
          • kubectl version v1.11.3+.
          • Access to a Kubernetes v1.11.3+ cluster.

          Installation

          # download kubebuilder and install locally.
          curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
          chmod +x kubebuilder && sudo mv kubebuilder /usr/local/bin/

          Create A Project

          mkdir -p ~/projects/guestbook
          cd ~/projects/guestbook
          kubebuilder init --domain my.domain --repo my.domain/guestbook
          Error: unable to scaffold with “base.go.kubebuilder.io/v4”:exit status 1

          Just try again!

          rm -rf ~/projects/guestbook/*
          kubebuilder init --domain my.domain --repo my.domain/guestbook

          Create An API

          kubebuilder create api --group webapp --version v1 --kind Guestbook
          Error: unable to run post-scaffold tasks of “base.go.kubebuilder.io/v4”: exec: “make”: executable file not found in $PATH
          apt-get -y install make
          rm -rf ~/projects/guestbook/*
          kubebuilder init --domain my.domain --repo my.domain/guestbook
          kubebuilder create api --group webapp --version v1 --kind Guestbook

          Prepare a K8s Cluster

          cluster in
          minikube start --kubernetes-version=v1.27.10 --image-mirror-country=cn --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers --cpus=4 --memory=4g --disk-size=50g --force

          asdasda

          Modify API [Optional]

          you can moidfy file /~/projects/guestbook/api/v1/guestbook_types.go

          type GuestbookSpec struct {
          	// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
          	// Important: Run "make" to regenerate code after modifying this file
          
          	// Foo is an example field of Guestbook. Edit guestbook_types.go to remove/update
          	Foo string `json:"foo,omitempty"`
          }

          which will corresponding to the file /~/projects/guestbook/config/samples/webapp_v1_guestbook.yaml

          If you are editing the API definitions, generate the manifests such as Custom Resources (CRs) or Custom Resource Definitions (CRDs) using

          make manifests
          Modify Controller [Optional]

          you can moidfy file /~/projects/guestbook/internal/controller/guestbook_controller.go

          // 	"fmt"
          // "k8s.io/apimachinery/pkg/api/errors"
          // "k8s.io/apimachinery/pkg/types"
          // 	appsv1 "k8s.io/api/apps/v1"
          //	corev1 "k8s.io/api/core/v1"
          //	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
          func (r *GuestbookReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
          	// The context is used to allow cancellation of requests, and potentially things like tracing. 
          	_ = log.FromContext(ctx)
          
          	fmt.Printf("I am a controller ->>>>>>")
          	fmt.Printf("Name: %s, Namespace: %s", req.Name, req.Namespace)
          
          	guestbook := &webappv1.Guestbook{}
          	if err := r.Get(ctx, req.NamespacedName, guestbook); err != nil {
          		return ctrl.Result{}, err
          	}
          
          	fooString := guestbook.Spec.Foo
          	replicas := int32(1)
          	fmt.Printf("Foo String: %s", fooString)
          
          	// labels := map[string]string{
          	// 	"app": req.Name,
          	// }
          
          	// dep := &appsv1.Deployment{
          	// 	ObjectMeta: metav1.ObjectMeta{
          	// 		Name:      fooString + "-deployment",
          	// 		Namespace: req.Namespace,
          	// 		Labels:    labels,
          	// 	},
          	// 	Spec: appsv1.DeploymentSpec{
          	// 		Replicas: &replicas,
          	// 		Selector: &metav1.LabelSelector{
          	// 			MatchLabels: labels,
          	// 		},
          	// 		Template: corev1.PodTemplateSpec{
          	// 			ObjectMeta: metav1.ObjectMeta{
          	// 				Labels: labels,
          	// 			},
          	// 			Spec: corev1.PodSpec{
          	// 				Containers: []corev1.Container{{
          	// 					Name:  fooString,
          	// 					Image: "busybox:latest",
          	// 				}},
          	// 			},
          	// 		},
          	// 	},
          	// }
          
          	// existingDep := &appsv1.Deployment{}
          	// err := r.Get(ctx, types.NamespacedName{Name: dep.Name, Namespace: dep.Namespace}, existingDep)
          	// if err != nil {
          	// 	if errors.IsNotFound(err) {
          	// 		if err := r.Create(ctx, dep); err != nil {
          	// 			return ctrl.Result{}, err
          	// 		}
          	// 	} else {
          	// 		return ctrl.Result{}, err
          	// 	}
          	// }
          
          	return ctrl.Result{}, nil
          }

          And you can use make run to test your controller.

          make run

          and use following command to send a request

          make sure you install crds -> make install before you exec this following command

          make install
          kubectl apply -k config/samples/

          your controller terminal should be look like this

          I am a controller ->>>>>>Name: guestbook-sample, Namespace: defaultFoo String: foo-value

          Install CRDs

          check installed crds in k8s

          kubectl get crds

          install guestbook crd in k8s

          cd ~/projects/guestbook
          make install

          uninstall CRDs

          make uninstall
          
          make undeploy

          Deploy to cluster

          make docker-build IMG=aaron666/guestbook-operator:test
          make docker-build docker-push IMG=<some-registry>/<project-name>:tag
          make deploy IMG=<some-registry>/<project-name>:tag
          Mar 7, 2024

          Operator-SDK

            Mar 7, 2024

            Subsections of Proxy

            Daocloud Binary

            使用方法

            在原始 URL 上面加入 files.m.daocloud.io前缀 就可以使用。比如:

            # Helm 下载原始URL
            wget https://get.helm.sh/helm-v3.9.1-linux-amd64.tar.gz
            
            # 加速后的 URL
            wget https://files.m.daocloud.io/get.helm.sh/helm-v3.9.1-linux-amd64.tar.gz

            即可加速下载, 所以如果指定的文件没有被缓存, 会卡住等待缓存完成, 后续下载就无带宽限制。

            最佳实践

            使用场景1 - 安装 Helm

            cd /tmp
            export HELM_VERSION="v3.9.3"
            
            wget "https://files.m.daocloud.io/get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz"
            tar -zxvf helm-${HELM_VERSION}-linux-amd64.tar.gz
            mv linux-amd64/helm /usr/local/bin/helm
            helm version

            使用场景2 - 安装 KubeSpray

            加入如下配置即可:

            files_repo: "https://files.m.daocloud.io"
            
            ## Kubernetes components
            kubeadm_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kubeadm_version }}/bin/linux/{{ image_arch }}/kubeadm"
            kubectl_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kube_version }}/bin/linux/{{ image_arch }}/kubectl"
            kubelet_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kube_version }}/bin/linux/{{ image_arch }}/kubelet"
            
            ## CNI Plugins
            cni_download_url: "{{ files_repo }}/github.com/containernetworking/plugins/releases/download/{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-{{ cni_version }}.tgz"
            
            ## cri-tools
            crictl_download_url: "{{ files_repo }}/github.com/kubernetes-sigs/cri-tools/releases/download/{{ crictl_version }}/crictl-{{ crictl_version }}-{{ ansible_system | lower }}-{{ image_arch }}.tar.gz"
            
            ## [Optional] etcd: only if you **DON'T** use etcd_deployment=host
            etcd_download_url: "{{ files_repo }}/github.com/etcd-io/etcd/releases/download/{{ etcd_version }}/etcd-{{ etcd_version }}-linux-{{ image_arch }}.tar.gz"
            
            # [Optional] Calico: If using Calico network plugin
            calicoctl_download_url: "{{ files_repo }}/github.com/projectcalico/calico/releases/download/{{ calico_ctl_version }}/calicoctl-linux-{{ image_arch }}"
            calicoctl_alternate_download_url: "{{ files_repo }}/github.com/projectcalico/calicoctl/releases/download/{{ calico_ctl_version }}/calicoctl-linux-{{ image_arch }}"
            # [Optional] Calico with kdd: If using Calico network plugin with kdd datastore
            calico_crds_download_url: "{{ files_repo }}/github.com/projectcalico/calico/archive/{{ calico_version }}.tar.gz"
            
            # [Optional] Flannel: If using Falnnel network plugin
            flannel_cni_download_url: "{{ files_repo }}/kubernetes/flannel/{{ flannel_cni_version }}/flannel-{{ image_arch }}"
            
            # [Optional] helm: only if you set helm_enabled: true
            helm_download_url: "{{ files_repo }}/get.helm.sh/helm-{{ helm_version }}-linux-{{ image_arch }}.tar.gz"
            
            # [Optional] crun: only if you set crun_enabled: true
            crun_download_url: "{{ files_repo }}/github.com/containers/crun/releases/download/{{ crun_version }}/crun-{{ crun_version }}-linux-{{ image_arch }}"
            
            # [Optional] kata: only if you set kata_containers_enabled: true
            kata_containers_download_url: "{{ files_repo }}/github.com/kata-containers/kata-containers/releases/download/{{ kata_containers_version }}/kata-static-{{ kata_containers_version }}-{{ ansible_architecture }}.tar.xz"
            
            # [Optional] cri-dockerd: only if you set container_manager: docker
            cri_dockerd_download_url: "{{ files_repo }}/github.com/Mirantis/cri-dockerd/releases/download/v{{ cri_dockerd_version }}/cri-dockerd-{{ cri_dockerd_version }}.{{ image_arch }}.tgz"
            
            # [Optional] runc,containerd: only if you set container_runtime: containerd
            runc_download_url: "{{ files_repo }}/github.com/opencontainers/runc/releases/download/{{ runc_version }}/runc.{{ image_arch }}"
            containerd_download_url: "{{ files_repo }}/github.com/containerd/containerd/releases/download/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
            nerdctl_download_url: "{{ files_repo }}/github.com/containerd/nerdctl/releases/download/v{{ nerdctl_version }}/nerdctl-{{ nerdctl_version }}-{{ ansible_system | lower }}-{{ image_arch }}.tar.gz"

            实测下载速度可以达到Downloaded: 19 files, 603M in 23s (25.9 MB/s), 下载全部文件可以在 23s 内完成! 完整方法可以参考 https://gist.github.com/yankay/a863cf2e300bff6f9040ab1c6c58fbae

            使用场景3 - 安装 KIND

            cd /tmp
            export KIND_VERSION="v0.22.0"
            
            curl -Lo ./kind https://files.m.daocloud.io/github.com/kubernetes-sigs/kind/releases/download/${KIND_VERSION}/kind-linux-amd64
            chmod +x ./kind
            mv ./kind /usr/bin/kind
            kind version

            使用场景4 - 安装 K9S

            cd /tmp
            export K9S_VERSION="v0.32.4"
            
            wget https://files.m.daocloud.io/github.com/derailed/k9s/releases/download/${K9S_VERSION}/k9s_Linux_x86_64.tar.gz
            tar -zxvf k9s_Linux_x86_64.tar.gz
            chmod +x k9s
            mv k9s /usr/bin/k9s
            k9s version

            使用场景5 - 安装 istio

            cd /tmp
            export ISTIO_VERSION="1.14.3"
            
            wget "https://files.m.daocloud.io/github.com/istio/istio/releases/download/${ISTIO_VERSION}/istio-${ISTIO_VERSION}-linux-amd64.tar.gz"
            tar -zxvf istio-${ISTIO_VERSION}-linux-amd64.tar.gz
            # Do follow the istio docs to install istio

            使用场景6 - 安装 nerdctl (代替 docker 工具)

            这里是root安装,其他安装方式请参考源站: https://github.com/containerd/nerdctl

            export NERDCTL_VERSION="1.7.6"
            mkdir -p nerdctl ;cd nerdctl
            wget https://files.m.daocloud.io/github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/nerdctl-full-${NERDCTL_VERSION}-linux-amd64.tar.gz
            tar -zvxf nerdctl-full-${NERDCTL_VERSION}-linux-amd64.tar.gz
            mkdir -p /opt/cni/bin ;cp -f libexec/cni/* /opt/cni/bin/ ;cp bin/* /usr/local/bin/ ;cp lib/systemd/system/*.service /usr/lib/systemd/system/
            systemctl enable containerd ;systemctl start containerd --now
            systemctl enable buildkit;systemctl start buildkit --now

            欢迎贡献更多的场景

            禁止加速的后缀

            以下后缀的文件会直接响应 403

            • .bmp
            • .jpg
            • .jpeg
            • .png
            • .gif
            • .webp
            • .tiff
            • .mp4
            • .webm
            • .ogg
            • .avi
            • .mov
            • .flv
            • .mkv
            • .mp3
            • .wav
            • .rar
            Mar 7, 2024

            Daocloud Image

            快速开始

            docker run -d -P m.daocloud.io/docker.io/library/nginx

            使用方法

            增加前缀 (推荐方式)。比如:

                          docker.io/library/busybox
                             |
                             V
            m.daocloud.io/docker.io/library/busybox

            或者 支持的镜像仓库 的 前缀替换 就可以使用。比如:

                       docker.io/library/busybox
                         |
                         V
            docker.m.daocloud.io/library/busybox

            无缓存

            在拉取的时候如果Daocloud没有缓存, 将会在 同步队列 添加同步缓存的任务.

            支持前缀替换的 Registry (不推荐)

            推荐使用添加前缀的方式.

            前缀替换的 Registry 的规则, 这是人工配置的, 有需求提 Issue.

            源站替换为备注
            docker.elastic.coelastic.m.daocloud.io
            docker.iodocker.m.daocloud.io
            gcr.iogcr.m.daocloud.io
            ghcr.ioghcr.m.daocloud.io
            k8s.gcr.iok8s-gcr.m.daocloud.iok8s.gcr.io 已被迁移到 registry.k8s.io
            registry.k8s.iok8s.m.daocloud.io
            mcr.microsoft.commcr.m.daocloud.io
            nvcr.ionvcr.m.daocloud.io
            quay.ioquay.m.daocloud.io
            registry.ollama.aiollama.m.daocloud.io

            最佳实践

            加速 Kubneretes

            加速安装 kubeadm

            kubeadm config images pull --image-repository k8s-gcr.m.daocloud.io

            加速安装 kind

            kind create cluster --name kind --image m.daocloud.io/docker.io/kindest/node:v1.22.1

            加速 Containerd

            加速 Docker

            添加到 /etc/docker/daemon.json

            {
              "registry-mirrors": [
                "https://docker.m.daocloud.io"
              ]
            }

            加速 Ollama & DeepSeek

            加速安装 Ollama

            CPU:

            docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama docker.m.daocloud.io/ollama/ollama

            GPU 版本:

            1. 首先安装 Nvidia Container Toolkit
            2. 运行以下命令启动 Ollama 容器:
            docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama docker.m.daocloud.io/ollama/ollama

            更多信息请参考:

            加速使用 Deepseek-R1 模型

            如上述步骤,在启动了ollama容器的前提下,还可以通过加速源,加速启动DeepSeek相关的模型服务

            注:目前 Ollama 官方源的下载速度已经很快,您也可以直接使用官方源

            # 使用加速源
            docker exec -it ollama ollama run ollama.m.daocloud.io/library/deepseek-r1:1.5b
            
            # 或直接使用官方源下载模型
            # docker exec -it ollama ollama run deepseek-r1:1.5b
            Mar 7, 2024

            KubeVPN

            1.install krew

              1. download and install krew
              1. Add the $HOME/.krew/bin directory to your PATH environment variable.
            export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
              1. Run kubectl krew to check the installation
            kubectl krew list

            2. Download from kubevpn source from github

            kubectl krew index add kubevpn https://gitclone.com/github.com/kubenetworks/kubevpn.git
            kubectl krew install kubevpn/kubevpn
            kubectl kubevpn 

            3. Deploy VPN in some cluster

            Using different config to access different cluster and deploy vpn in that k8s.

            kubectl kubevpn connect
            If you wanna connect other k8s …
            kubectl kubevpn connect --kubeconfig /root/.kube/xxx_config

            Your terminal should look like this:

            ➜  ~ kubectl kubevpn connect
            Password:
            Starting connect
            Getting network CIDR from cluster info...
            Getting network CIDR from CNI...
            Getting network CIDR from services...
            Labeling Namespace default
            Creating ServiceAccount kubevpn-traffic-manager
            Creating Roles kubevpn-traffic-manager
            Creating RoleBinding kubevpn-traffic-manager
            Creating Service kubevpn-traffic-manager
            Creating MutatingWebhookConfiguration kubevpn-traffic-manager
            Creating Deployment kubevpn-traffic-manager
            
            Pod kubevpn-traffic-manager-66d969fd45-9zlbp is Pending
            Container     Reason            Message
            control-plane ContainerCreating
            vpn           ContainerCreating
            webhook       ContainerCreating
            
            Pod kubevpn-traffic-manager-66d969fd45-9zlbp is Running
            Container     Reason           Message
            control-plane ContainerRunning
            vpn           ContainerRunning
            webhook       ContainerRunning
            
            Forwarding port...
            Connected tunnel
            Adding route...
            Configured DNS service
            +----------------------------------------------------------+
            | Now you can access resources in the kubernetes cluster ! |
            +----------------------------------------------------------+

            already connected to cluster network, use command kubectl kubevpn status to check status

            ➜  ~ kubectl kubevpn status
            ID Mode Cluster   Kubeconfig                  Namespace            Status      Netif
            0  full ops-dev   /root/.kube/zverse_config   data-and-computing   Connected   utun0

            use pod productpage-788df7ff7f-jpkcs IP 172.29.2.134

            ➜  ~ kubectl get pods -o wide
            NAME                                       AGE     IP                NODE              NOMINATED NODE  GATES
            authors-dbb57d856-mbgqk                    7d23h   172.29.2.132      192.168.0.5       <none>         
            details-7d8b5f6bcf-hcl4t                   61d     172.29.0.77       192.168.104.255   <none>         
            kubevpn-traffic-manager-66d969fd45-9zlbp   74s     172.29.2.136      192.168.0.5       <none>         
            productpage-788df7ff7f-jpkcs               61d     172.29.2.134      192.168.0.5       <none>         
            ratings-77b6cd4499-zvl6c                   61d     172.29.0.86       192.168.104.255   <none>         
            reviews-85c88894d9-vgkxd                   24d     172.29.2.249      192.168.0.5       <none>         

            use ping to test connection, seems good

            ➜  ~ ping 172.29.2.134
            PING 172.29.2.134 (172.29.2.134): 56 data bytes
            64 bytes from 172.29.2.134: icmp_seq=0 ttl=63 time=55.727 ms
            64 bytes from 172.29.2.134: icmp_seq=1 ttl=63 time=56.270 ms
            64 bytes from 172.29.2.134: icmp_seq=2 ttl=63 time=55.228 ms
            64 bytes from 172.29.2.134: icmp_seq=3 ttl=63 time=54.293 ms
            ^C
            --- 172.29.2.134 ping statistics ---
            4 packets transmitted, 4 packets received, 0.0% packet loss
            round-trip min/avg/max/stddev = 54.293/55.380/56.270/0.728 ms

            use service productpage IP 172.21.10.49

            ➜  ~ kubectl get services -o wide
            NAME                      TYPE        CLUSTER-IP     PORT(S)              SELECTOR
            authors                   ClusterIP   172.21.5.160   9080/TCP             app=authors
            details                   ClusterIP   172.21.6.183   9080/TCP             app=details
            kubernetes                ClusterIP   172.21.0.1     443/TCP              <none>
            kubevpn-traffic-manager   ClusterIP   172.21.2.86    84xxxxxx0/TCP        app=kubevpn-traffic-manager
            productpage               ClusterIP   172.21.10.49   9080/TCP             app=productpage
            ratings                   ClusterIP   172.21.3.247   9080/TCP             app=ratings
            reviews                   ClusterIP   172.21.8.24    9080/TCP             app=reviews

            use command curl to test service connection

            ➜  ~ curl 172.21.10.49:9080
            <!DOCTYPE html>
            <html>
              <head>
                <title>Simple Bookstore App</title>
            <meta charset="utf-8">
            <meta http-equiv="X-UA-Compatible" content="IE=edge">
            <meta name="viewport" content="width=device-width, initial-scale=1">

            seems good too~

            if you wanna resolve domain

            Domain resolve

            a Pod/Service named productpage in the default namespace can successfully resolve by following name:

            • productpage
            • productpage.default
            • productpage.default.svc.cluster.local
            ➜  ~ curl productpage.default.svc.cluster.local:9080
            <!DOCTYPE html>
            <html>
              <head>
                <title>Simple Bookstore App</title>
            <meta charset="utf-8">
            <meta http-equiv="X-UA-Compatible" content="IE=edge">
            <meta name="viewport" content="width=device-width, initial-scale=1">

            Short domain resolve

            To access the service in the cluster, service name or you can use the short domain name, such as productpage

            ➜  ~ curl productpage:9080
            <!DOCTYPE html>
            <html>
              <head>
                <title>Simple Bookstore App</title>
            <meta charset="utf-8">
            <meta http-equiv="X-UA-Compatible" content="IE=edge">
            ...

            Disclaimer: This only works on the namespace where kubevpn-traffic-manager is deployed.

            Mar 7, 2024

            Subsections of Serverless

            Subsections of Kserve

            Install Kserve

            Preliminary

            • v 1.30 + Kubernetes has installed, if not check 🔗link
            • Helm has installed, if not check 🔗link

            Installation

            Install By

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. Helm binary has installed, if not check 🔗link


            1.install from script directly

            Details
            curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash
            Expectd Output

            Installing Gateway API CRDs …

            😀 Successfully installed Istio

            😀 Successfully installed Cert Manager

            😀 Successfully installed Knative

            But you probably will ecounter some error due to the network, like this:
            Error: INSTALLATION FAILED: context deadline exceeded

            you need to reinstall some components

            export KSERVE_VERSION=v0.15.2
            export deploymentMode=Serverless
            helm upgrade --namespace kserve kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version $KSERVE_VERSION
            helm upgrade --namespace kserve kserve oci://ghcr.io/kserve/charts/kserve --version $KSERVE_VERSION --set-string kserve.controller.deploymentMode="$deploymentMode"
            # helm upgrade knative-operator --namespace knative-serving  https://github.com/knative/operator/releases/download/knative-v1.15.7/knative-operator-v1.15.7.tgz

            Preliminary

            1. If you have only one node in your cluster, you need at least 6 CPUs, 6 GB of memory, and 30 GB of disk storage.


            2. If you have multiple nodes in your cluster, for each node you need at least 2 CPUs, 4 GB of memory, and 20 GB of disk storage.


            1.install knative serving CRD resources

            Details
            kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.18.0/serving-crds.yaml

            2.install knative serving components

            Details
            kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.18.0/serving-core.yaml
            # kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/assets/refs/heads/main/knative/serving/release/download/knative-v1.18.0/serving-core.yaml

            3.install network layer Istio

            Details
            kubectl apply -l knative.dev/crd-install=true -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/istio.yaml
            kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/istio.yaml
            kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/net-istio.yaml
            Expectd Output

            Monitor the Knative components until all of the components show a STATUS of Running or Completed.

            kubectl get pods -n knative-serving
            
            #NAME                                      READY   STATUS    RESTARTS   AGE
            #3scale-kourier-control-54cc54cc58-mmdgq   1/1     Running   0          81s
            #activator-67656dcbbb-8mftq                1/1     Running   0          97s
            #autoscaler-df6856b64-5h4lc                1/1     Running   0          97s
            #controller-788796f49d-4x6pm               1/1     Running   0          97s
            #domain-mapping-65f58c79dc-9cw6d           1/1     Running   0          97s
            #domainmapping-webhook-cc646465c-jnwbz     1/1     Running   0          97s
            #webhook-859796bc7-8n5g2                   1/1     Running   0          96s
            Check Knative Hello World

            4.install cert manager

            Details
            kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml

            5.install kserve

            Details
            kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve.yaml
            kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve-cluster-resources.yaml
            Reference

            Preliminary

            1. Kubernetes has installed, if not check 🔗link


            2. ArgoCD has installed, if not check 🔗link


            3. Helm binary has installed, if not check 🔗link


            1.install gateway API CRDs

            Details
            kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml

            2.install cert manager

            Reference

            following 🔗link to install cert manager

            3.install istio system

            Reference

            following 🔗link to install three istio components (istio-base, istiod, istio-ingressgateway)

            4.install Knative Operator

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: knative-operator
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
              project: default
              source:
                repoURL: https://knative.github.io/operator
                chart: knative-operator
                targetRevision: v1.18.1
                helm:
                  releaseName: knative-operator
                  values: |
                    knative_operator:
                      knative_operator:
                        image: m.daocloud.io/gcr.io/knative-releases/knative.dev/operator/cmd/operator
                        tag: v1.18.1
                        resources:
                          requests:
                            cpu: 100m
                            memory: 100Mi
                          limits:
                            cpu: 1000m
                            memory: 1000Mi
                      operator_webhook:
                        image: m.daocloud.io/gcr.io/knative-releases/knative.dev/operator/cmd/webhook
                        tag: v1.18.1
                        resources:
                          requests:
                            cpu: 100m
                            memory: 100Mi
                          limits:
                            cpu: 500m
                            memory: 500Mi
              destination:
                server: https://kubernetes.default.svc
                namespace: knative-serving
            EOF

            5.sync by argocd

            Details
            argocd app sync argocd/knative-operator

            6.install kserve serving CRD

            kubectl apply -f - <<EOF
            apiVersion: operator.knative.dev/v1beta1
            kind: KnativeServing
            metadata:
              name: knative-serving
              namespace: knative-serving
            spec:
              version: 1.18.0 # this is knative serving version
              config:
                domain:
                  example.com: ""
            EOF
            Details

            7.install kserve CRD

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: kserve-crd
              annotations:
                argocd.argoproj.io/sync-options: ServerSideApply=true
                argocd.argoproj.io/compare-options: IgnoreExtraneous
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
                - ServerSideApply=true
              project: default
              source:
                repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                chart: kserve-crd
                targetRevision: v0.15.2
                helm:
                  releaseName: kserve-crd 
              destination:
                server: https://kubernetes.default.svc
                namespace: kserve
            EOF
            Expectd Output
            knative-serving    activator-cbf5b6b55-7gw8s                                 Running        116s
            knative-serving    autoscaler-c5d454c88-nxrms                                Running        115s
            knative-serving    autoscaler-hpa-6c966695c6-9ld24                           Running        113s
            knative-serving    cleanup-serving-serving-1.18.0-45nhg                      Completed      113s
            knative-serving    controller-84f96b7676-jjqfp                               Running        115s
            knative-serving    net-istio-controller-574679cd5f-2sf4d                     Running        112s
            knative-serving    net-istio-webhook-85c99487db-mmq7n                        Running        111s
            knative-serving    storage-version-migration-serving-serving-1.18.0-k28vf    Completed      113s
            knative-serving    webhook-75d4fb6db5-qqcwz                                  Running        114s

            8.sync by argocd

            Details
            argocd app sync argocd/kserve-crd

            9.install kserve Controller

            Details
            kubectl -n argocd apply -f - << EOF
            apiVersion: argoproj.io/v1alpha1
            kind: Application
            metadata:
              name: kserve
              annotations:
                argocd.argoproj.io/sync-options: ServerSideApply=true
                argocd.argoproj.io/compare-options: IgnoreExtraneous
            spec:
              syncPolicy:
                syncOptions:
                - CreateNamespace=true
                - ServerSideApply=true
              project: default
              source:
                repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                chart: kserve
                targetRevision: v0.15.2
                helm:
                  releaseName: kserve
                  values: |
                    kserve:
                      agent:
                        image: m.daocloud.io/docker.io/kserve/agent
                      router:
                        image: m.daocloud.io/docker.io/kserve/router
                      storage:
                        image: m.daocloud.io/docker.io/kserve/storage-initializer
                        s3:
                          accessKeyIdName: AWS_ACCESS_KEY_ID
                          secretAccessKeyName: AWS_SECRET_ACCESS_KEY
                          endpoint: ""
                          region: ""
                          verifySSL: ""
                          useVirtualBucket: ""
                          useAnonymousCredential: ""
                      controller:
                        deploymentMode: "Serverless"
                        rbacProxyImage: m.daocloud.io/quay.io/brancz/kube-rbac-proxy:v0.18.0
                        rbacProxy:
                          resources:
                            limits:
                              cpu: 100m
                              memory: 300Mi
                            requests:
                              cpu: 100m
                              memory: 300Mi
                        gateway:
                          domain: example.com
                        image: m.daocloud.io/docker.io/kserve/kserve-controller
                        resources:
                          limits:
                            cpu: 100m
                            memory: 300Mi
                          requests:
                            cpu: 100m
                            memory: 300Mi
                      servingruntime:
                        tensorflow:
                          image: tensorflow/serving
                          tag: 2.6.2
                        mlserver:
                          image: m.daocloud.io/docker.io/seldonio/mlserver
                          tag: 1.5.0
                        sklearnserver:
                          image: m.daocloud.io/docker.io/kserve/sklearnserver
                        xgbserver:
                          image: m.daocloud.io/docker.io/kserve/xgbserver
                        huggingfaceserver:
                          image: m.daocloud.io/docker.io/kserve/huggingfaceserver
                          devShm:
                            enabled: false
                            sizeLimit: ""
                          hostIPC:
                            enabled: false
                        huggingfaceserver_multinode:
                          shm:
                            enabled: true
                            sizeLimit: "3Gi"
                        tritonserver:
                          image: nvcr.io/nvidia/tritonserver
                        pmmlserver:
                          image: m.daocloud.io/docker.io/kserve/pmmlserver
                        paddleserver:
                          image: m.daocloud.io/docker.io/kserve/paddleserver
                        lgbserver:
                          image: m.daocloud.io/docker.io/kserve/lgbserver
                        torchserve:
                          image: pytorch/torchserve-kfs
                          tag: 0.9.0
                        art:
                          image: m.daocloud.io/docker.io/kserve/art-explainer
                      localmodel:
                        enabled: false
                        controller:
                          image: m.daocloud.io/docker.io/kserve/kserve-localmodel-controller
                        jobNamespace: kserve-localmodel-jobs
                        agent:
                          hostPath: /mnt/models
                          image: m.daocloud.io/docker.io/kserve/kserve-localmodelnode-agent
                      inferenceservice:
                        resources:
                          limits:
                            cpu: "1"
                            memory: "2Gi"
                          requests:
                            cpu: "1"
                            memory: "2Gi"
              destination:
                server: https://kubernetes.default.svc
                namespace: kserve
            EOF
            if you have ‘failed calling webhook …’
            Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": failed to call webhook: Post "https://kserve-webhook-server-service.kserve.svc:443/validate-serving-kserve-io-v1alpha1-clusterservingruntime?timeout=10s": no endpoints available for service "kserve-webhook-server-service"                               Running        114s

            Just wait for a while and the resync, and it will be fine.

            10.sync by argocd

            Details
            argocd app sync argocd/kserve

            11.install kserve eventing CRD

            Details
            kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.18.1/eventing-crds.yaml

            12.install kserve eventing

            Details
            kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.18.1/eventing-core.yaml
            Expectd Output
            knative-eventing   eventing-controller-cc45869cd-fmhg8        1/1     Running       0          3m33s
            knative-eventing   eventing-webhook-67fcc6959b-lktxd          1/1     Running       0          3m33s
            knative-eventing   job-sink-7f5d754db-tbf2z                   1/1     Running       0          3m33s

            FAQ

            Q1: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Q2: Show me almost endless possibilities

            You can add standard markdown syntax:

            • multiple paragraphs
            • bullet point lists
            • emphasized, bold and even bold emphasized text
            • links
            • etc.
            ...and even source code

            the possibilities are endless (almost - including other shortcodes may or may not work)

            Mar 7, 2024

            Subsections of Serving

            Subsections of Inference

            First Pytorch ISVC

            Mnist Inference

            More Information about mnist service can be found 🔗link

            1. create a namespace
            kubectl create namespace kserve-test
            1. deploy a sample iris service
            kubectl apply -n kserve-test -f - <<EOF
            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "first-torchserve"
              namespace: kserve-test
            spec:
              predictor:
                model:
                  modelFormat:
                    name: pytorch
                  storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
                  resources:
                    limits:
                      memory: 4Gi
            EOF
            1. Check InferenceService status
            kubectl -n kserve-test get inferenceservices first-torchserve 
            Expectd Output
            kubectl -n kserve-test get pod
            #NAME                                           READY   STATUS    RESTARTS   AGE
            #first-torchserve-predictor-00001-deplo...      2/2     Running   0          25s
            
            kubectl -n kserve-test get inferenceservices first-torchserve
            #NAME           URL   READY     PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
            #kserve-test   first-torchserve      http://first-torchserve.kserve-test.example.com   True           100                              first-torchserve-predictor-00001   2m59s

            After all pods are ready, you can access the service by using the following command

            Access By

            If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.

            export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
            export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')

            If the EXTERNAL-IP value is none (or perpetually pending), your environment does not provide an external load balancer for the ingress gateway. In this case, you can access the gateway using the service’s node port.

            export INGRESS_HOST=$(minikube ip)
            export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
            export INGRESS_HOST=$(minikube ip)
            kubectl port-forward --namespace istio-system svc/istio-ingressgateway 30080:80
            export INGRESS_PORT=30080
            1. Perform a prediction First, prepare your inference input request inside a file:
            wget -O ./mnist-input.json https://raw.githubusercontent.com/kserve/kserve/refs/heads/master/docs/samples/v1beta1/torchserve/v1/imgconv/input.json
            Remember to forward port if using minikube
            ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L "*:${INGRESS_PORT}:0.0.0.0:${INGRESS_PORT}" -N -f
            1. Invoke the service
            SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice first-torchserve  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
            # http://first-torchserve.kserve-test.example.com 
            curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/mnist:predict" -d @./mnist-input.json
            Expectd Output
            *   Trying 192.168.58.2...
            * TCP_NODELAY set
            * Connected to 192.168.58.2 (192.168.58.2) port 32132 (#0)
            > POST /v1/models/mnist:predict HTTP/1.1
            > Host: my-torchserve.kserve-test.example.com
            > User-Agent: curl/7.61.1
            > Accept: */*
            > Content-Type: application/json
            > Content-Length: 401
            > 
            * upload completely sent off: 401 out of 401 bytes
            < HTTP/1.1 200 OK
            < content-length: 19
            < content-type: application/json
            < date: Mon, 09 Jun 2025 09:27:27 GMT
            < server: istio-envoy
            < x-envoy-upstream-service-time: 1128
            < 
            * Connection #0 to host 192.168.58.2 left intact
            {"predictions":[2]}
            Mar 7, 2024

            First Custom Model

            AlexNet Inference

            More Information about AlexNet service can be found 🔗link

            1. Implement Custom Model using KServe API
             1import argparse
             2import base64
             3import io
             4import time
             5
             6from fastapi.middleware.cors import CORSMiddleware
             7from torchvision import models, transforms
             8from typing import Dict
             9import torch
            10from PIL import Image
            11
            12import kserve
            13from kserve import Model, ModelServer, logging
            14from kserve.model_server import app
            15from kserve.utils.utils import generate_uuid
            16
            17
            18class AlexNetModel(Model):
            19    def __init__(self, name: str):
            20        super().__init__(name, return_response_headers=True)
            21        self.name = name
            22        self.load()
            23        self.ready = False
            24
            25    def load(self):
            26        self.model = models.alexnet(pretrained=True)
            27        self.model.eval()
            28        # The ready flag is used by model ready endpoint for readiness probes,
            29        # set to True when model is loaded successfully without exceptions.
            30        self.ready = True
            31
            32    async def predict(
            33        self,
            34        payload: Dict,
            35        headers: Dict[str, str] = None,
            36        response_headers: Dict[str, str] = None,
            37    ) -> Dict:
            38        start = time.time()
            39        # Input follows the Tensorflow V1 HTTP API for binary values
            40        # https://www.tensorflow.org/tfx/serving/api_rest#encoding_binary_values
            41        img_data = payload["instances"][0]["image"]["b64"]
            42        raw_img_data = base64.b64decode(img_data)
            43        input_image = Image.open(io.BytesIO(raw_img_data))
            44        preprocess = transforms.Compose([
            45            transforms.Resize(256),
            46            transforms.CenterCrop(224),
            47            transforms.ToTensor(),
            48            transforms.Normalize(mean=[0.485, 0.456, 0.406],
            49                                 std=[0.229, 0.224, 0.225]),
            50        ])
            51        input_tensor = preprocess(input_image).unsqueeze(0)
            52        output = self.model(input_tensor)
            53        torch.nn.functional.softmax(output, dim=1)
            54        values, top_5 = torch.topk(output, 5)
            55        result = values.flatten().tolist()
            56        end = time.time()
            57        response_id = generate_uuid()
            58
            59        # Custom response headers can be added to the inference response
            60        if response_headers is not None:
            61            response_headers.update(
            62                {"prediction-time-latency": f"{round((end - start) * 1000, 9)}"}
            63            )
            64
            65        return {"predictions": result}
            66
            67
            68parser = argparse.ArgumentParser(parents=[kserve.model_server.parser])
            69args, _ = parser.parse_known_args()
            70
            71if __name__ == "__main__":
            72    # Configure kserve and uvicorn logger
            73    if args.configure_logging:
            74        logging.configure_logging(args.log_config_file)
            75    model = AlexNetModel(args.model_name)
            76    model.load()
            77    # Custom middlewares can be added to the model
            78    app.add_middleware(
            79        CORSMiddleware,
            80        allow_origins=["*"],
            81        allow_credentials=True,
            82        allow_methods=["*"],
            83        allow_headers=["*"],
            84    )
            85    ModelServer().start([model])
            1. create requirements.txt
            kserve
            torchvision==0.18.0
            pillow>=10.3.0,<11.0.0
            1. create Dockerfile
            FROM m.daocloud.io/docker.io/library/python:3.11-slim
            
            WORKDIR /app
            
            COPY requirements.txt .
            RUN pip install --no-cache-dir  -r requirements.txt 
            
            COPY model.py .
            
            CMD ["python", "model.py", "--model_name=custom-model"]
            1. build and push custom docker image
            docker build -t ay-custom-model .
            docker tag ddfd0186813e docker-registry.lab.zverse.space/ay/ay-custom-model:latest
            docker push docker-registry.lab.zverse.space/ay/ay-custom-model:latest
            1. create a namespace
            kubectl create namespace kserve-test
            1. deploy a sample custom-model service
            kubectl apply -n kserve-test -f - <<EOF
            apiVersion: serving.kserve.io/v1beta1
            kind: InferenceService
            metadata:
              name: ay-custom-model
            spec:
              predictor:
                containers:
                  - name: kserve-container
                    image: docker-registry.lab.zverse.space/ay/ay-custom-model:latest
            EOF
            1. Check InferenceService status
            kubectl -n kserve-test get inferenceservices ay-custom-model
            Expectd Output
            kubectl -n kserve-test get pod
            #NAME                                           READY   STATUS    RESTARTS   AGE
            #ay-custom-model-predictor-00003-dcf4rk         2/2     Running   0        167m
            
            kubectl -n kserve-test get inferenceservices ay-custom-model
            #NAME           URL   READY     PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
            #ay-custom-model   http://ay-custom-model.kserve-test.example.com   True           100                              ay-custom-model-predictor-00003   177m

            After all pods are ready, you can access the service by using the following command

            Access By

            If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.

            export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
            export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')

            If the EXTERNAL-IP value is none (or perpetually pending), your environment does not provide an external load balancer for the ingress gateway. In this case, you can access the gateway using the service’s node port.

            export INGRESS_HOST=$(minikube ip)
            export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
            export INGRESS_HOST=$(minikube ip)
            kubectl port-forward --namespace istio-system svc/istio-ingressgateway 30080:80
            export INGRESS_PORT=30080
            1. Perform a prediction

            First, prepare your inference input request inside a file:

            wget -O ./alex-net-input.json https://kserve.github.io/website/0.15/modelserving/v1beta1/custom/custom_model/input.json
            Remember to forward port if using minikube
            ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L "*:${INGRESS_PORT}:0.0.0.0:${INGRESS_PORT}" -N -f
            1. Invoke the service
            export SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice ay-custom-model  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
            # http://ay-custom-model.kserve-test.example.com
            curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" -X POST "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/custom-model:predict" -d @.//alex-net-input.json
            Expectd Output
            *   Trying 192.168.58.2:30704...
            * Connected to 192.168.58.2 (192.168.58.2) port 30704
            > POST /v1/models/custom-model:predict HTTP/1.1
            > Host: ay-custom-model.kserve-test.example.com
            > User-Agent: curl/8.5.0
            > Accept: */*
            > Content-Type: application/json
            > Content-Length: 105339
            > 
            * We are completely uploaded and fine
            < HTTP/1.1 200 OK
            < content-length: 110
            < content-type: application/json
            < date: Wed, 11 Jun 2025 03:38:30 GMT
            < prediction-time-latency: 89.966773987
            < server: istio-envoy
            < x-envoy-upstream-service-time: 93
            < 
            * Connection #0 to host 192.168.58.2 left intact
            {"predictions":[14.975619316101074,14.0368070602417,13.966034889221191,12.252280235290527,12.086270332336426]}
            Mar 7, 2024

            First Model In Minio

            Inference Model In Minio

            More Information about Deploy InferenceService with a saved model on S3 can be found 🔗link

            Create Service Account

            === “yaml”

            apiVersion: v1
            kind: ServiceAccount
            metadata:
              name: sa
              annotations:
                eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/s3access # replace with your IAM role ARN
                serving.kserve.io/s3-endpoint: s3.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
                serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
                serving.kserve.io/s3-region: "us-east-2"
                serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials

            === “kubectl”

            kubectl apply -f create-s3-sa.yaml

            Create S3 Secret and attach to Service Account

            Create a secret with your S3 user credential, KServe reads the secret annotations to inject the S3 environment variables on storage initializer or model agent to download the models from S3 storage.

            Create S3 secret

            === “yaml”

            apiVersion: v1
            kind: Secret
            metadata:
              name: s3creds
              annotations:
                 serving.kserve.io/s3-endpoint: s3.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
                 serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
                 serving.kserve.io/s3-region: "us-east-2"
                 serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
            type: Opaque
            stringData: # use `stringData` for raw credential string or `data` for base64 encoded string
              AWS_ACCESS_KEY_ID: XXXX
              AWS_SECRET_ACCESS_KEY: XXXXXXXX

            Attach secret to a service account

            === “yaml”

            apiVersion: v1
            kind: ServiceAccount
            metadata:
              name: sa
            secrets:
            - name: s3creds

            === “kubectl”

            kubectl apply -f create-s3-secret.yaml

            !!! note If you are running kserve with istio sidecars enabled, there can be a race condition between the istio proxy being ready and the agent pulling models. This will result in a tcp dial connection refused error when the agent tries to download from s3.

            To resolve it, istio allows the blocking of other containers in a pod until the proxy container is ready.
            
            You can enabled this by setting `proxy.holdApplicationUntilProxyStarts: true` in `istio-sidecar-injector` configmap, `proxy.holdApplicationUntilProxyStarts` flag was introduced in Istio 1.7 as an experimental feature and is turned off by default.
            

            Deploy the model on S3 with InferenceService

            Create the InferenceService with the s3 storageUri and the service account with s3 credential attached.

            === “New Schema”

            ```yaml
            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "mnist-s3"
            spec:
              predictor:
                serviceAccountName: sa
                model:
                  modelFormat:
                    name: tensorflow
                  storageUri: "s3://kserve-examples/mnist"
            ```
            

            === “Old Schema”

            ```yaml
            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "mnist-s3"
            spec:
              predictor:
                serviceAccountName: sa
                tensorflow:
                  storageUri: "s3://kserve-examples/mnist"
            ```
            

            Apply the autoscale-gpu.yaml.

            === “kubectl”

            kubectl apply -f mnist-s3.yaml

            Run a prediction

            Now, the ingress can be accessed at ${INGRESS_HOST}:${INGRESS_PORT} or follow this instruction to find out the ingress IP and port.

            SERVICE_HOSTNAME=$(kubectl get inferenceservice mnist-s3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
            
            MODEL_NAME=mnist-s3
            INPUT_PATH=@./input.json
            curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH

            !!! success “Expected Output”

            ```{ .bash .no-copy }
            Note: Unnecessary use of -X or --request, POST is already inferred.
            *   Trying 35.237.217.209...
            * TCP_NODELAY set
            * Connected to mnist-s3.default.35.237.217.209.xip.io (35.237.217.209) port 80 (#0)
            > POST /v1/models/mnist-s3:predict HTTP/1.1
            > Host: mnist-s3.default.35.237.217.209.xip.io
            > User-Agent: curl/7.55.1
            > Accept: */*
            > Content-Length: 2052
            > Content-Type: application/x-www-form-urlencoded
            > Expect: 100-continue
            >
            < HTTP/1.1 100 Continue
            * We are completely uploaded and fine
            < HTTP/1.1 200 OK
            < content-length: 251
            < content-type: application/json
            < date: Sun, 04 Apr 2021 20:06:27 GMT
            < x-envoy-upstream-service-time: 5
            < server: istio-envoy
            <
            * Connection #0 to host mnist-s3.default.35.237.217.209.xip.io left intact
            {
                "predictions": [
                    {
                        "predictions": [0.327352405, 2.00153053e-07, 0.0113353515, 0.203903764, 3.62863029e-05, 0.416683704, 0.000281196437, 8.36911859e-05, 0.0403052084, 1.82206513e-05],
                        "classes": 5
                    }
                ]
            }
            ```
            
            Mar 7, 2024

            Kafka Sink Transformer

            AlexNet Inference

            More Information about Custom Transformer service can be found 🔗link

            1. Implement Custom Transformer ./model.py using Kserve API
             1import os
             2import argparse
             3import json
             4
             5from typing import Dict, Union
             6from kafka import KafkaProducer
             7from cloudevents.http import CloudEvent
             8from cloudevents.conversion import to_structured
             9
            10from kserve import (
            11    Model,
            12    ModelServer,
            13    model_server,
            14    logging,
            15    InferRequest,
            16    InferResponse,
            17)
            18
            19from kserve.logging import logger
            20from kserve.utils.utils import generate_uuid
            21
            22kafka_producer = KafkaProducer(
            23    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
            24    bootstrap_servers=os.environ.get('KAFKA_BOOTSTRAP_SERVERS', 'localhost:9092')
            25)
            26
            27class ImageTransformer(Model):
            28    def __init__(self, name: str):
            29        super().__init__(name, return_response_headers=True)
            30        self.ready = True
            31
            32
            33    def preprocess(
            34        self, payload: Union[Dict, InferRequest], headers: Dict[str, str] = None
            35    ) -> Union[Dict, InferRequest]:
            36        logger.info("Received inputs %s", payload)
            37        logger.info("Received headers %s", headers)
            38        self.request_trace_key = os.environ.get('REQUEST_TRACE_KEY', 'algo.trace.requestId')
            39        if self.request_trace_key not in payload:
            40            logger.error("Request trace key '%s' not found in payload, you cannot trace the prediction result", self.request_trace_key)
            41            if "instances" not in payload:
            42                raise ValueError(
            43                    f"Request trace key '{self.request_trace_key}' not found in payload and 'instances' key is missing."
            44                )
            45        else:
            46            headers[self.request_trace_key] = payload.get(self.request_trace_key)
            47   
            48        return {"instances": payload["instances"]}
            49
            50    def postprocess(
            51        self,
            52        infer_response: Union[Dict, InferResponse],
            53        headers: Dict[str, str] = None,
            54        response_headers: Dict[str, str] = None,
            55    ) -> Union[Dict, InferResponse]:
            56        logger.info("postprocess headers: %s", headers)
            57        logger.info("postprocess response headers: %s", response_headers)
            58        logger.info("postprocess response: %s", infer_response)
            59
            60        attributes = {
            61            "source": "data-and-computing/kafka-sink-transformer",
            62            "type": "org.zhejianglab.zverse.data-and-computing.kafka-sink-transformer",
            63            "request-host": headers.get('host', 'unknown'),
            64            "kserve-isvc-name": headers.get('kserve-isvc-name', 'unknown'),
            65            "kserve-isvc-namespace": headers.get('kserve-isvc-namespace', 'unknown'),
            66            self.request_trace_key: headers.get(self.request_trace_key, 'unknown'),
            67        }
            68
            69        _, cloudevent = to_structured(CloudEvent(attributes, infer_response))
            70        try:
            71            kafka_producer.send(os.environ.get('KAFKA_TOPIC', 'test-topic'), value=cloudevent.decode('utf-8').replace("'", '"'))
            72            kafka_producer.flush()
            73        except Exception as e:
            74            logger.error("Failed to send message to Kafka: %s", e)
            75        return infer_response
            76
            77parser = argparse.ArgumentParser(parents=[model_server.parser])
            78args, _ = parser.parse_known_args()
            79
            80if __name__ == "__main__":
            81    if args.configure_logging:
            82        logging.configure_logging(args.log_config_file)
            83    logging.logger.info("available model name: %s", args.model_name)
            84    logging.logger.info("all args: %s", args.model_name)
            85    model = ImageTransformer(args.model_name)
            86    ModelServer().start([model])
            1. modify ./pyproject.toml
            [tool.poetry]
            name = "custom_transformer"
            version = "0.15.2"
            description = "Custom Transformer Examples. Not intended for use outside KServe Frameworks Images."
            authors = ["Dan Sun <dsun20@bloomberg.net>"]
            license = "Apache-2.0"
            packages = [
                { include = "*.py" }
            ]
            
            [tool.poetry.dependencies]
            python = ">=3.9,<3.13"
            kserve = {path = "../kserve", develop = true}
            pillow = "^10.3.0"
            kafka-python = "^2.2.15"
            cloudevents = "^1.11.1"
            
            [[tool.poetry.source]]
            name = "pytorch"
            url = "https://download.pytorch.org/whl/cpu"
            priority = "explicit"
            
            [tool.poetry.group.test]
            optional = true
            
            [tool.poetry.group.test.dependencies]
            pytest = "^7.4.4"
            mypy = "^0.991"
            
            [tool.poetry.group.dev]
            optional = true
            
            [tool.poetry.group.dev.dependencies]
            black = { version = "~24.3.0", extras = ["colorama"] }
            
            [tool.poetry-version-plugin]
            source = "file"
            file_path = "../VERSION"
            
            [build-system]
            requires = ["poetry-core>=1.0.0"]
            build-backend = "poetry.core.masonry.api"
            1. prepare ../custom_transformer.Dockerfile
            ARG PYTHON_VERSION=3.11
            ARG BASE_IMAGE=python:${PYTHON_VERSION}-slim-bookworm
            ARG VENV_PATH=/prod_venv
            
            FROM ${BASE_IMAGE} AS builder
            
            # Install Poetry
            ARG POETRY_HOME=/opt/poetry
            ARG POETRY_VERSION=1.8.3
            
            RUN python3 -m venv ${POETRY_HOME} && ${POETRY_HOME}/bin/pip install poetry==${POETRY_VERSION}
            ENV PATH="$PATH:${POETRY_HOME}/bin"
            
            # Activate virtual env
            ARG VENV_PATH
            ENV VIRTUAL_ENV=${VENV_PATH}
            RUN python3 -m venv $VIRTUAL_ENV
            ENV PATH="$VIRTUAL_ENV/bin:$PATH"
            
            COPY kserve/pyproject.toml kserve/poetry.lock kserve/
            RUN cd kserve && poetry install --no-root --no-interaction --no-cache
            COPY kserve kserve
            RUN cd kserve && poetry install --no-interaction --no-cache
            
            COPY custom_transformer/pyproject.toml custom_transformer/poetry.lock custom_transformer/
            RUN cd custom_transformer && poetry install --no-root --no-interaction --no-cache
            COPY custom_transformer custom_transformer
            RUN cd custom_transformer && poetry install --no-interaction --no-cache
            
            
            FROM ${BASE_IMAGE} AS prod
            
            COPY third_party third_party
            
            # Activate virtual env
            ARG VENV_PATH
            ENV VIRTUAL_ENV=${VENV_PATH}
            ENV PATH="$VIRTUAL_ENV/bin:$PATH"
            
            RUN useradd kserve -m -u 1000 -d /home/kserve
            
            COPY --from=builder --chown=kserve:kserve $VIRTUAL_ENV $VIRTUAL_ENV
            COPY --from=builder kserve kserve
            COPY --from=builder custom_transformer custom_transformer
            
            USER 1000
            ENTRYPOINT ["python", "-m", "custom_transformer.model"]
            1. regenerate poetry.lock
            poetry lock --no-update
            1. build and push custom docker image
            cd python
            podman build -t docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9 -f custom_transformer.Dockerfile .
            
            podman push docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9
            Mar 7, 2024

            Subsections of Generative

            First Generative Service

            B(KServe 推理服务)
            B --> C[[Knative Serving]] --> D[自动扩缩容/灰度发布]
            B --> E[[Istio]] --> F[流量管理/安全]
            B --> G[[存储系统]] --> H[S3/GCS/PVC]
            
            ### 单YAML部署推理服务
            ```yaml
            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
              namespace: kserve-test
            spec:
              predictor:
                model:
                  modelFormat:
                    name: sklearn
                  resources: {}
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

            check CRD

            kubectl -n kserve-test get inferenceservices sklearn-iris 
            kubectl -n istio-system get svc istio-ingressgateway 
            export INGRESS_HOST=$(minikube ip)
            export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
            SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice sklearn-iris  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
            # http://sklearn-iris.kserve-test.example.com 
            curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json

            How to deploy your own ML model

            apiVersion: serving.kserve.io/v1beta1
            kind: InferenceService
            metadata:
              name: huggingface-llama3
              namespace: kserve-test
              annotations:
                serving.kserve.io/deploymentMode: RawDeployment
                serving.kserve.io/autoscalerClass: none
            spec:
              predictor:
                model:
                  modelFormat:
                    name: huggingface
                  storageUri: pvc://llama-3-8b-pvc/hf/8b_instruction_tuned
                workerSpec:
                  pipelineParallelSize: 2
                  tensorParallelSize: 1
                  containers:
                  - name: worker-container
                      resources: 
                      requests:
                          nvidia.com/gpu: "8"

            check https://kserve.github.io/website/0.15/modelserving/v1beta1/llm/huggingface/multi-node/#workerspec-and-servingruntime

            Mar 7, 2024

            Canary Policy

            KServe supports canary rollouts for inference services. Canary rollouts allow for a new version of an InferenceService to receive a percentage of traffic. Kserve supports a configurable canary rollout strategy with multiple steps. The rollout strategy can also be implemented to rollback to the previous revision if a rollout step fails.

            KServe automatically tracks the last good revision that was rolled out with 100% traffic. The canaryTrafficPercent field in the component’s spec needs to be set with the percentage of traffic that should be routed to the new revision. KServe will then automatically split the traffic between the last good revision and the revision that is currently being rolled out according to the canaryTrafficPercent value.

            When the first revision of an InferenceService is deployed, it will receive 100% of the traffic. When multiple revisions are deployed, as in step 2, and the canary rollout strategy is configured to route 10% of the traffic to the new revision, 90% of the traffic will go to the LastestRolledoutRevision. If there is an unhealthy or bad revision applied, traffic will not be routed to that bad revision. In step 3, the rollout strategy promotes the LatestReadyRevision from step 2 to the LatestRolledoutRevision. Since it is now promoted, the LatestRolledoutRevision gets 100% of the traffic and is fully rolled out. If a rollback needs to happen, 100% of the traffic will be pinned to the previous healthy/good revision- the PreviousRolledoutRevision.

            Canary Rollout Strategy Steps 1-2 Canary Rollout Strategy Steps 1-2 Canary Rollout Strategy Step 3 Canary Rollout Strategy Step 3

            Reference

            For more information, see Canary Rollout.

            Mar 7, 2024

            Subsections of Canary Policy

            Rollout Example

            Create the InferenceService

            Follow the First Inference Service tutorial. Set up a namespace kserve-test and create an InferenceService.

            After rolling out the first model, 100% traffic goes to the initial model with service revision 1.

            kubectl -n kserve-test get isvc sklearn-iris
            Expectd Output
            NAME       URL              READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
            sklearn-iris   http://sklearn-iris.kserve-test.example.com   True      100       sklearn-iris-predictor--00001   46s      2m39s     70s

            Apply Canary Rollout Strategy

            • Add the canaryTrafficPercent field to the predictor component
            • Update the storageUri to use a new/updated model.
            kubectl apply -n kserve-test -f - <<EOF
            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
              namespace: kserve-test
            spec:
              predictor:
                canaryTrafficPercent: 10
                model:
                  args: ["--enable_docs_url=True"]
                  modelFormat:
                    name: sklearn
                  resources: {}
                  runtime: kserve-sklearnserver
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
            EOF

            After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1.

            kubectl -n kserve-test get isvc sklearn-iris
            Expectd Output
            NAME       URL              READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
            sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    90     10       sklearn-iris-predictor-00002   sklearn-iris-predictor-00003   19h

            Check the running pods, you should now see port two pods running for the old and new model and 10% traffic is routed to the new model. Notice revision 1 contains 0002 in its name, while revision 2 contains 0003.

            kubectl get pods 
            
            NAME                                                        READY   STATUS    RESTARTS   AGE
            sklearn-iris-predictor-00002-deployment-c7bb6c685-ktk7r     2/2     Running   0          71m
            sklearn-iris-predictor-00003-deployment-8498d947-fpzcg      2/2     Running   0          20m

            Run a prediction

            Follow the next two steps (Determine the ingress IP and ports and Perform inference) in the First Inference Service tutorial.

            Send more requests to the InferenceService to observe the 10% of traffic that routes to the new revision.

            Promote the canary model

            If the canary model is healthy/passes your tests,

            you can promote it by removing the canaryTrafficPercent field and re-applying the InferenceService custom resource with the same name sklearn-iris

            kubectl apply -n kserve-test -f - <<EOF
            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
              namespace: kserve-test
            spec:
              predictor:
                model:
                  args: ["--enable_docs_url=True"]
                  modelFormat:
                    name: sklearn
                  resources: {}
                  runtime: kserve-sklearnserver
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
            EOF

            Now all traffic goes to the revision 2 for the new model.

            kubectl get isvc sklearn-iris
            NAME       URL                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
            sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00002   17m

            The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic.

            kubectl get pods -l serving.kserve.io/inferenceservice=sklearn-iris
            NAME                                                           READY   STATUS        RESTARTS   AGE
            sklearn-iris-predictor-00001-deployment-66c5f5b8d5-gmfvj   1/2     Terminating   0          17m
            sklearn-iris-predictor-00002-deployment-5bd9ff46f8-shtzd   2/2     Running       0          15m

            Rollback and pin the previous model

            You can pin the previous model (model v1, for example) by setting the canaryTrafficPercent to 0 for the current model (model v2, for example). This rolls back from model v2 to model v1 and decreases model v2’s traffic to zero.

            Apply the custom resource to set model v2’s traffic to 0%.

            kubectl apply -n kserve-test -f - <<EOF
            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
            spec:
              predictor:
                canaryTrafficPercent: 0
                model:
                  modelFormat:
                    name: sklearn
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
            EOF

            Check the traffic split, now 100% traffic goes to the previous good model (model v1) for revision generation 1.

            kubectl get isvc sklearn-iris
            NAME       URL                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION              LATESTREADYREVISION                AGE
            sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    100    0        sklearn-iris-predictor-00002   sklearn-iris-predictor-00003   18m

            The pods for previous revision (model v1) now routes 100% of the traffic to its pods while the new model (model v2) routes 0% traffic to its pods.

            kubectl get pods -l serving.kserve.io/inferenceservice=sklearn-iris
            
            NAME                                                       READY   STATUS        RESTARTS   AGE
            sklearn-iris-predictor-00002-deployment-66c5f5b8d5-gmfvj   1/2     Running       0          35s
            sklearn-iris-predictor-00003-deployment-5bd9ff46f8-shtzd   2/2     Running       0          16m

            Route traffic using a tag

            You can enable tag based routing by adding the annotation serving.kserve.io/enable-tag-routing, so traffic can be explicitly routed to the canary model (model v2) or the old model (model v1) via a tag in the request URL.

            Apply model v2 with canaryTrafficPercent: 10 and serving.kserve.io/enable-tag-routing: "true".

            kubectl apply -n kserve-test -f - <<EOF
            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
              annotations:
                serving.kserve.io/enable-tag-routing: "true"
            spec:
              predictor:
                canaryTrafficPercent: 10
                model:
                  modelFormat:
                    name: sklearn
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
            EOF

            Check the InferenceService status to get the canary and previous model URL.

            kubectl get isvc sklearn-iris -ojsonpath="{.status.components.predictor}"  | jq

            The output should look like

            Expectd Output
            {
                "address": {
                "url": "http://sklearn-iris-predictor-.kserve-test.svc.cluster.local"
                },
                "latestCreatedRevision": "sklearn-iris-predictor--00003",
                "latestReadyRevision": "sklearn-iris-predictor--00003",
                "latestRolledoutRevision": "sklearn-iris-predictor--00001",
                "previousRolledoutRevision": "sklearn-iris-predictor--00001",
                "traffic": [
                {
                    "latestRevision": true,
                    "percent": 10,
                    "revisionName": "sklearn-iris-predictor--00003",
                    "tag": "latest",
                    "url": "http://latest-sklearn-iris-predictor-.kserve-test.example.com"
                },
                {
                    "latestRevision": false,
                    "percent": 90,
                    "revisionName": "sklearn-iris-predictor--00001",
                    "tag": "prev",
                    "url": "http://prev-sklearn-iris-predictor-.kserve-test.example.com"
                }
                ],
                "url": "http://sklearn-iris-predictor-.kserve-test.example.com"
            }

            Since we updated the annotation on the InferenceService, model v2 now corresponds to sklearn-iris-predictor--00003.

            You can now send the request explicitly to the new model or the previous model by using the tag in the request URL. Use the curl command from Perform inference and add latest- or prev- to the model name to send a tag based request.

            For example, set the model name and use the following commands to send traffic to each service based on the latest or prev tag.

            curl the latest revision

            MODEL_NAME=sklearn-iris
            curl -v -H "Host: latest-${MODEL_NAME}-predictor-.kserve-test.example.com" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d @./iris-input.json

            or curl the previous revision

            curl -v -H "Host: prev-${MODEL_NAME}-predictor-.kserve-test.example.com" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d @./iris-input.json
            Mar 7, 2024

            Auto Scaling

            Soft Limit

            You can configure InferenceService with annotation autoscaling.knative.dev/target for a soft limit. The soft limit is a targeted limit rather than a strictly enforced bound, particularly if there is a sudden burst of requests, this value can be exceeded.

            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
              namespace: kserve-test
              annotations:
                autoscaling.knative.dev/target: "5"
            spec:
              predictor:
                model:
                  args: ["--enable_docs_url=True"]
                  modelFormat:
                    name: sklearn
                  resources: {}
                  runtime: kserve-sklearnserver
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

            Hard Limit

            You can also configure InferenceService with field containerConcurrency with a hard limit. The hard limit is an enforced upper bound. If concurrency reaches the hard limit, surplus requests will be buffered and must wait until enough capacity is free to execute the requests.

            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
              namespace: kserve-test
            spec:
              predictor:
                containerConcurrency: 5
                model:
                  args: ["--enable_docs_url=True"]
                  modelFormat:
                    name: sklearn
                  resources: {}
                  runtime: kserve-sklearnserver
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

            Scale with QPS

            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
              namespace: kserve-test
            spec:
              predictor:
                scaleTarget: 1
                scaleMetric: qps
                model:
                  args: ["--enable_docs_url=True"]
                  modelFormat:
                    name: sklearn
                  resources: {}
                  runtime: kserve-sklearnserver
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

            Scale with GPU

            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "flowers-sample-gpu"
              namespace: kserve-test
            spec:
              predictor:
                scaleTarget: 1
                scaleMetric: concurrency
                model:
                  modelFormat:
                    name: tensorflow
                  storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
                  runtimeVersion: "2.6.2-gpu"
                  resources:
                    limits:
                      nvidia.com/gpu: 1

            Enable Scale To Zero

            apiVersion: "serving.kserve.io/v1beta1"
            kind: "InferenceService"
            metadata:
              name: "sklearn-iris"
              namespace: kserve-test
            spec:
              predictor:
                minReplicas: 0
                model:
                  args: ["--enable_docs_url=True"]
                  modelFormat:
                    name: sklearn
                  resources: {}
                  runtime: kserve-sklearnserver
                  storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

            Prepare Concurrent Requests Container

            # export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
            podman run --rm \
                  -v /root/kserve/iris-input.json:/tmp/iris-input.json \
                  --privileged \
                  -e INGRESS_HOST=$(minikube ip) \
                  -e INGRESS_PORT=32132 \
                  -e MODEL_NAME=sklearn-iris \
                  -e INPUT_PATH=/tmp/iris-input.json \
                  -e SERVICE_HOSTNAME=sklearn-iris.kserve-test.example.com \
                  -it m.daocloud.io/docker.io/library/golang:1.22  bash -c "go install github.com/rakyll/hey@latest; bash"

            Fire

            Send traffic in 30 seconds spurts maintaining 5 in-flight requests.

            hey -z 30s -c 100 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
            Expectd Output
            Summary:
              Total:        30.1390 secs
              Slowest:      0.5015 secs
              Fastest:      0.0252 secs
              Average:      0.1451 secs
              Requests/sec: 687.3483
              
              Total data:   4371076 bytes
              Size/request: 211 bytes
            
            Response time histogram:
              0.025 [1]     |
              0.073 [14]    |
              0.120 [33]    |
              0.168 [19363] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
              0.216 [1171]  |■■
              0.263 [28]    |
              0.311 [6]     |
              0.359 [0]     |
              0.406 [0]     |
              0.454 [0]     |
              0.502 [100]   |
            
            
            Latency distribution:
              10% in 0.1341 secs
              25% in 0.1363 secs
              50% in 0.1388 secs
              75% in 0.1462 secs
              90% in 0.1587 secs
              95% in 0.1754 secs
              99% in 0.1968 secs
            
            Details (average, fastest, slowest):
              DNS+dialup:   0.0000 secs, 0.0252 secs, 0.5015 secs
              DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
              req write:    0.0000 secs, 0.0000 secs, 0.0005 secs
              resp wait:    0.1451 secs, 0.0251 secs, 0.5015 secs
              resp read:    0.0000 secs, 0.0000 secs, 0.0003 secs
            
            Status code distribution:
              [500] 20716 responses

            Reference

            For more information, please refer to the KPA documentation.

            Mar 7, 2024

            Subsections of Knative

            Subsections of Eventing

            Broker

            Knative Broker 是 Knative Eventing 系统的核心组件,它的主要作用是充当事件路由和分发的中枢,在事件生产者(事件源)和事件消费者(服务)之间提供解耦、可靠的事件传输。

            以下是 Knative Broker 的关键作用详解:

            事件接收中心:

            Broker 是事件流汇聚的入口点。各种事件源(如 Kafka 主题、HTTP 源、Cloud Pub/Sub、GitHub Webhooks、定时器、自定义源等)将事件发送到 Broker。

            事件生产者只需知道 Broker 的地址,无需关心最终有哪些消费者或消费者在哪里。

            事件存储与缓冲:

            Broker 通常基于持久化的消息系统实现(如 Apache Kafka, Google Cloud Pub/Sub, RabbitMQ, NATS Streaming 或内存实现 InMemoryChannel)。这提供了:

            持久化: 确保事件在消费者处理前不会丢失(取决于底层通道实现)。

            缓冲: 当消费者暂时不可用或处理速度跟不上事件产生速度时,Broker 可以缓冲事件,避免事件丢失或压垮生产者/消费者。

            重试: 如果消费者处理事件失败,Broker 可以重新投递事件(通常需要结合 Trigger 和 Subscription 的重试策略)。

            解耦事件源和事件消费者:

            这是 Broker 最重要的作用之一。事件源只负责将事件发送到 Broker,完全不知道有哪些服务会消费这些事件。

            事件消费者通过创建 Trigger 向 Broker 声明它对哪些事件感兴趣。消费者只需知道 Broker 的存在,无需知道事件是从哪个具体源产生的。

            这种解耦极大提高了系统的灵活性和可维护性:

            独立演进: 可以独立添加、移除或修改事件源或消费者,只要它们遵循 Broker 的契约。

            动态路由: 基于事件属性(如 type, source)动态路由事件到不同的消费者,无需修改生产者或消费者代码。

            多播: 同一个事件可以被多个不同的消费者同时消费(一个事件 -> Broker -> 多个匹配的 Trigger -> 多个服务)。

            事件过滤与路由(通过 Trigger):

            Broker 本身不直接处理复杂的过滤逻辑。过滤和路由是由 Trigger 资源实现的。

            Trigger 资源绑定到特定的 Broker。

            Trigger 定义了:

            订阅者: 目标服务(Knative Service、Kubernetes Service、Channel 等)的地址。

            过滤器: 基于事件属性(主要是 type 和 source,以及其他可扩展属性)的条件表达式。只有满足条件的事件才会被 Broker 通过该 Trigger 路由到对应的订阅者。

            Broker 接收事件后,会检查所有绑定到它的 Trigger 的过滤器。对于每一个匹配的 Trigger,Broker 都会将事件发送到该 Trigger 指定的订阅者。

            提供标准事件接口:

            Broker 遵循 CloudEvents 规范,它接收和传递的事件都是 CloudEvents 格式的。这为不同来源的事件和不同消费者的处理提供了统一的格式标准,简化了集成。

            多租户和命名空间隔离:

            Broker 通常部署在 Kubernetes 的特定命名空间中。一个命名空间内可以创建多个 Broker。

            这允许在同一个集群内为不同的团队、应用或环境(如 dev, staging)隔离事件流。每个团队/应用可以管理自己命名空间内的 Broker 和 Trigger。

            总结比喻:

            可以把 Knative Broker 想象成一个高度智能的邮局分拣中心:

            接收信件(事件): 来自世界各地(不同事件源)的信件(事件)都寄到这个分拣中心(Broker)。

            存储信件: 分拣中心有仓库(持久化/缓冲)临时存放信件,确保信件安全不丢失。

            分拣规则(Trigger): 分拣中心里有很多分拣员(Trigger)。每个分拣员负责特定类型或来自特定地区的信件(基于事件属性过滤)。

            投递信件: 分拣员(Trigger)找到符合自己负责规则的信件(事件),就把它们投递到正确的收件人(订阅者服务)家门口。

            解耦: 寄信人(事件源)只需要知道分拣中心(Broker)的地址,完全不需要知道收信人(消费者)是谁、在哪里。收信人(消费者)只需要告诉分拣中心里负责自己这类信件的分拣员(创建 Trigger)自己的地址,不需要关心信是谁寄来的。分拣中心(Broker)和分拣员(Trigger)负责中间的复杂路由工作。

            Broker 带来的核心价值:

            松耦合: 彻底解耦事件生产者和消费者。

            灵活性: 动态添加/移除消费者,动态改变路由规则(通过修改/创建/删除 Trigger)。

            可靠性: 提供事件持久化和重试机制(依赖底层实现)。

            可伸缩性: Broker 和消费者都可以独立伸缩。

            标准化: 基于 CloudEvents。

            简化开发: 开发者专注于业务逻辑(生产事件或消费事件),无需自己搭建复杂的事件总线基础设施。

            Mar 7, 2024

            Subsections of Broker

            Install Kafka Broker

            About

            broker broker

            • Source, curl, kafkaSource,
            • Broker
            • Trigger
            • Sink: ksvc, isvc

            Install a Channel (messaging) layer

            kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-controller.yaml
            Expectd Output
            configmap/kafka-broker-config created
            configmap/kafka-channel-config created
            customresourcedefinition.apiextensions.k8s.io/kafkachannels.messaging.knative.dev created
            customresourcedefinition.apiextensions.k8s.io/consumers.internal.kafka.eventing.knative.dev created
            customresourcedefinition.apiextensions.k8s.io/consumergroups.internal.kafka.eventing.knative.dev created
            customresourcedefinition.apiextensions.k8s.io/kafkasinks.eventing.knative.dev created
            customresourcedefinition.apiextensions.k8s.io/kafkasources.sources.knative.dev created
            clusterrole.rbac.authorization.k8s.io/eventing-kafka-source-observer created
            configmap/config-kafka-source-defaults created
            configmap/config-kafka-autoscaler created
            configmap/config-kafka-features created
            configmap/config-kafka-leader-election created
            configmap/kafka-config-logging created
            configmap/config-namespaced-broker-resources created
            configmap/config-tracing configured
            clusterrole.rbac.authorization.k8s.io/knative-kafka-addressable-resolver created
            clusterrole.rbac.authorization.k8s.io/knative-kafka-channelable-manipulator created
            clusterrole.rbac.authorization.k8s.io/kafka-controller created
            serviceaccount/kafka-controller created
            clusterrolebinding.rbac.authorization.k8s.io/kafka-controller created
            clusterrolebinding.rbac.authorization.k8s.io/kafka-controller-addressable-resolver created
            deployment.apps/kafka-controller created
            clusterrole.rbac.authorization.k8s.io/kafka-webhook-eventing created
            serviceaccount/kafka-webhook-eventing created
            clusterrolebinding.rbac.authorization.k8s.io/kafka-webhook-eventing created
            mutatingwebhookconfiguration.admissionregistration.k8s.io/defaulting.webhook.kafka.eventing.knative.dev created
            mutatingwebhookconfiguration.admissionregistration.k8s.io/pods.defaulting.webhook.kafka.eventing.knative.dev created
            secret/kafka-webhook-eventing-certs created
            validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.kafka.eventing.knative.dev created
            deployment.apps/kafka-webhook-eventing created
            service/kafka-webhook-eventing created
            kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-channel.yaml
            Expectd Output
            configmap/config-kafka-channel-data-plane created
            clusterrole.rbac.authorization.k8s.io/knative-kafka-channel-data-plane created
            serviceaccount/knative-kafka-channel-data-plane created
            clusterrolebinding.rbac.authorization.k8s.io/knative-kafka-channel-data-plane created
            statefulset.apps/kafka-channel-dispatcher created
            deployment.apps/kafka-channel-receiver created
            service/kafka-channel-ingress created

            Install a Broker layer

            kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-broker.yaml
            Expectd Output
            configmap/config-kafka-broker-data-plane created
            clusterrole.rbac.authorization.k8s.io/knative-kafka-broker-data-plane created
            serviceaccount/knative-kafka-broker-data-plane created
            clusterrolebinding.rbac.authorization.k8s.io/knative-kafka-broker-data-plane created
            statefulset.apps/kafka-broker-dispatcher created
            deployment.apps/kafka-broker-receiver created
            service/kafka-broker-ingress created
            Reference
            if you cannot find kafka-channel-dispatcher

            please check sts

            root@ay-k3s01:~# kubectl -n knative-eventing  get sts
            NAME                       READY   AGE
            kafka-broker-dispatcher    1/1     19m
            kafka-channel-dispatcher   0/0     22m

            some sts replia is 0, please check

            [Optional] Install Eventing extensions

            • kafka sink
            kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-sink.yaml
            Reference

            for more information, you can check 🔗https://knative.dev/docs/eventing/sinks/kafka-sink/

            • kafka source
            kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-source.yaml
            Reference

            for more information, you can check 🔗https://knative.dev/docs/eventing/sources/kafka-source/

            Mar 7, 2024

            Display Broker Message

            Flow

            flowchart LR
                A[Curl] -->|HTTP| B{Broker}
                B -->|Subscribe| D[Trigger1]
                B -->|Subscribe| E[Trigger2]
                B -->|Subscribe| F[Trigger3]
                E --> G[Display Service]

            Setps

            1. Create Broker Setting

            kubectl apply -f - <<EOF
            apiVersion: v1
            kind: ConfigMap
            metadata:
              name: kafka-broker-config
              namespace: knative-eventing
            data:
              default.topic.partitions: "10"
              default.topic.replication.factor: "1"
              bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
              default.topic.config.retention.ms: "3600"
            EOF

            2. Create Broker

            kubectl apply -f - <<EOF
            apiVersion: eventing.knative.dev/v1
            kind: Broker
            metadata:
              annotations:
                eventing.knative.dev/broker.class: Kafka
              name: first-broker
              namespace: kserve-test
            spec:
              config:
                apiVersion: v1
                kind: ConfigMap
                name: kafka-broker-config
                namespace: knative-eventing
            EOF

            deadletterSink:

            3. Create Trigger

            kubectl apply -f - <<EOF
            apiVersion: eventing.knative.dev/v1
            kind: Trigger
            metadata:
              name: display-service-trigger
              namespace: kserve-test
            spec:
              broker: first-broker
              subscriber:
                ref:
                  apiVersion: serving.knative.dev/v1
                  kind: Service
                  name: event-display
            EOF

            4. Create Sink Service (Display Message)

            kubectl apply -f - <<EOF
            apiVersion: serving.knative.dev/v1
            kind: Service
            metadata:
              name: event-display
              namespace: kserve-test
            spec:
              template:
                spec:
                  containers:
                    - image: gcr.io/knative-releases/knative.dev/eventing/cmd/event_display
            EOF

            5. Test

            kubectl run curl-test --image=curlimages/curl -it --rm --restart=Never -- \
              -v "http://kafka-broker-ingress.knative-eventing.svc.cluster.local/kserve-test/first-broker" \
              -X POST \
              -H "Ce-Id: $(date +%s)" \
              -H "Ce-Specversion: 1.0" \
              -H "Ce-Type: test.type" \
              -H "Ce-Source: curl-test" \
              -H "Content-Type: application/json" \
              -d '{"test": "Broker is working"}'

            6. Check message

            kubectl -n kserve-test logs -f deploy/event-display-00001-deployment 
            Expectd Output
            2025/07/02 09:01:25 Failed to read tracing config, using the no-op default: empty json tracing config
            ☁️  cloudevents.Event
            Context Attributes,
              specversion: 1.0
              type: test.type
              source: curl-test
              id: 1751446880
              datacontenttype: application/json
            Extensions,
              knativekafkaoffset: 6
              knativekafkapartition: 6
            Data,
              {
                "test": "Broker is working"
              }
            Mar 7, 2024

            Kafka Broker Invoke ISVC

            1. Prepare RBAC

            • create cluster role to access CRD isvc
            kubectl apply -f - <<EOF
            apiVersion: rbac.authorization.k8s.io/v1
            kind: ClusterRole
            metadata:
              name: kserve-access-for-knative
            rules:
            - apiGroups: ["serving.kserve.io"]
              resources: ["inferenceservices", "inferenceservices/status"]
              verbs: ["get", "list", "watch"]
            EOF
            • create rolebinding and grant privileges
            kubectl apply -f - <<EOF
            apiVersion: rbac.authorization.k8s.io/v1
            kind: ClusterRoleBinding
            metadata:
              name: kafka-controller-kserve-access
            roleRef:
              apiGroup: rbac.authorization.k8s.io
              kind: ClusterRole
              name: kserve-access-for-knative
            subjects:
            - kind: ServiceAccount
              name: kafka-controller
              namespace: knative-eventing
            EOF

            2. Create Broker Setting

            kubectl apply -f - <<EOF
            apiVersion: v1
            kind: ConfigMap
            metadata:
              name: kafka-broker-config
              namespace: knative-eventing
            data:
              default.topic.partitions: "10"
              default.topic.replication.factor: "1"
              bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
              default.topic.config.retention.ms: "3600"
            EOF

            3. Create Broker

            kubectl apply -f - <<EOF
            apiVersion: eventing.knative.dev/v1
            kind: Broker
            metadata:
              annotations:
                eventing.knative.dev/broker.class: Kafka
              name: isvc-broker
              namespace: kserve-test
            spec:
              config:
                apiVersion: v1
                kind: ConfigMap
                name: kafka-broker-config
                namespace: knative-eventing
              delivery:
                deadLetterSink:
                  ref:
                    apiVersion: serving.knative.dev/v1
                    kind: Service
                    name: event-display
            EOF

            4. Create InferenceService

            Reference

            you can create isvc first-tourchserve service, by following 🔗link

            5. Create Trigger

            kubectl apply -f - << EOF
            apiVersion: eventing.knative.dev/v1
            kind: Trigger
            metadata:
              name: kserve-trigger
              namespace: kserve-test
            spec:
              broker: isvc-broker
              filter:
                attributes:
                  type: prediction-request
              subscriber:
                uri: http://first-torchserve.kserve-test.svc.cluster.local/v1/models/mnist:predict
            EOF

            6. Test

            Normally, we can invoke first-tourchserve by executing

            export MASTER_IP=192.168.100.112
            export ISTIO_INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
            export SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice first-torchserve  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
            # http://first-torchserve.kserve-test.example.com 
            curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${MASTER_IP}:${ISTIO_INGRESS_PORT}/v1/models/mnist:predict" -d @./mnist-input.json

            Now, you can access model by executing

            export KAFKA_BROKER_INGRESS_PORT=$(kubectl -n knative-eventing get service kafka-broker-ingress -o jsonpath='{.spec.ports[?(@.name=="http-container")].nodePort}')
            curl -v "http://${MASTER_IP}:${KAFKA_BROKER_INGRESS_PORT}/kserve-test/isvc-broker" \
              -X POST \
              -H "Ce-Id: $(date +%s)" \
              -H "Ce-Specversion: 1.0" \
              -H "Ce-Type: prediction-request" \
              -H "Ce-Source: event-producer" \
              -H "Content-Type: application/json" \
              -d @./mnist-input.json 
            if you cannot see the preidction result

            please check kafka

            # list all topics, find suffix is `isvc-broker` -> knative-broker-kserve-test-isvc-broker
            kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
            # retrieve msg from that topic
            kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
              'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic knative-broker-kserve-test-isvc-broker --from-beginning'

            And then, you could see

            {
                "instances": [
                    {
                        "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
                    }
                ]
            }
            {
                "predictions": [
                    2
                ]
            }
            Mar 7, 2024

            Subsections of Plugin

            Subsections of Eventing Kafka Broker

            Prepare Dev Environment

            1. update go -> 1.24

            2. install ko -> 1.8.0

            go install github.com/google/ko@latest
            # wget https://github.com/ko-build/ko/releases/download/v0.18.0/ko_0.18.0_Linux_x86_64.tar.gz
            # tar -xzf ko_0.18.0_Linux_x86_64.tar.gz  -C /usr/local/bin/ko
            # cp /usr/local/bin/ko/ko /root/bin
            1. protoc
            PB_REL="https://github.com/protocolbuffers/protobuf/releases"
            curl -LO $PB_REL/download/v30.2/protoc-30.2-linux-x86_64.zip
            # mkdir -p ${HOME}/bin/
            mkdir -p /usr/local/bin/protoc
            unzip protoc-30.2-linux-x86_64.zip -d /usr/local/bin/protoc
            cp /usr/local/bin/protoc/bin/protoc /root/bin
            # export PATH="$PATH:/root/bin"
            rm -rf protoc-30.2-linux-x86_64.zip
            1. protoc-gen-go -> 1.5.4
            go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
            export GOPATH=/usr/local/go/bin
            1. copy some code
            mkdir -p ${GOPATH}/src/knative.dev
            cd ${GOPATH}/src/knative.dev
            git clone git@github.com:knative/eventing.git # clone eventing repo
            git clone git@github.com:AaronYang0628/eventing-kafka-broker.git
            cd eventing-kafka-broker
            git remote add upstream https://github.com/knative-extensions/eventing-kafka-broker.git
            git remote set-url --push upstream no_push
            export KO_DOCKER_REPO=docker-registry.lab.zverse.space/data-and-computing/ay-dev
            Mar 7, 2024

            Build Async Preidction Flow

            Flow

            flowchart LR
                A[User Curl] -->|HTTP| B{ISVC-Broker:Kafka}
                B -->|Subscribe| D[Trigger1]
                B -->|Subscribe| E[Kserve-Triiger]
                B -->|Subscribe| F[Trigger3]
                E --> G[Mnist Service]
                G --> |Kafka-Sink| B

            Setps

            1. Create Broker Setting

            kubectl apply -f - <<EOF
            apiVersion: v1
            kind: ConfigMap
            metadata:
              name: kafka-broker-config
              namespace: knative-eventing
            data:
              default.topic.partitions: "10"
              default.topic.replication.factor: "1"
              bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
              default.topic.config.retention.ms: "3600"
            EOF

            2. Create Broker

            kubectl apply -f - <<EOF
            apiVersion: eventing.knative.dev/v1
            kind: Broker
            metadata:
              annotations:
                eventing.knative.dev/broker.class: Kafka
              name: isvc-broker
              namespace: kserve-test
            spec:
              config:
                apiVersion: v1
                kind: ConfigMap
                name: kafka-broker-config
                namespace: knative-eventing
            EOF

            3. Create Trigger

            kubectl apply -f - << EOF
            apiVersion: eventing.knative.dev/v1
            kind: Trigger
            metadata:
              name: kserve-trigger
              namespace: kserve-test
            spec:
              broker: isvc-broker
              filter:
                attributes:
                  type: prediction-request-udf-attr # you can change this
              subscriber:
                uri: http://prediction-and-sink.kserve-test.svc.cluster.local/v1/models/mnist:predict
            EOF

            4. Create InferenceService

             1kubectl apply -f - <<EOF
             2apiVersion: serving.kserve.io/v1beta1
             3kind: InferenceService
             4metadata:
             5  name: prediction-and-sink
             6  namespace: kserve-test
             7spec:
             8  predictor:
             9    model:
            10      modelFormat:
            11        name: pytorch
            12      storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
            13  transformer:
            14    containers:
            15      - image: docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9
            16        name: kserve-container
            17        env:
            18        - name: KAFKA_BOOTSTRAP_SERVERS
            19          value: kafka.database.svc.cluster.local
            20        - name: KAFKA_TOPIC
            21          value: test-topic # result will be saved in this topic
            22        - name: REQUEST_TRACE_KEY
            23          value: test-trace-id # using this key to retrieve preidtion result
            24        command:
            25          - "python"
            26          - "-m"
            27          - "model"
            28        args:
            29          - --model_name
            30          - mnist
            31EOF
            Expectd Output
            root@ay-k3s01:~# kubectl -n kserve-test get pod
            NAME                                                              READY   STATUS    RESTARTS   AGE
            prediction-and-sink-predictor-00001-deployment-f64bb76f-jqv4m     2/2     Running   0          3m46s
            prediction-and-sink-transformer-00001-deployment-76cccd867lksg9   2/2     Running   0          4m3s
            Expectd Output

            Source code of the docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9 could be found 🔗here

            [Optional] 5. Invoke InferenceService Directly

            • preparation
            wget -O ./mnist-input.json https://raw.githubusercontent.com/kserve/kserve/refs/heads/master/docs/samples/v1beta1/torchserve/v1/imgconv/input.json
            SERVICE_NAME=prediction-and-sink
            MODEL_NAME=mnist
            INPUT_PATH=@./mnist-input.json
            PLAIN_SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice $SERVICE_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
            • fire!!
            export INGRESS_HOST=192.168.100.112
            export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
            curl -v -H "Host: ${PLAIN_SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
            Expectd Output
            curl -v -H "Host: ${PLAIN_SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
            *   Trying 192.168.100.112:31855...
            * Connected to 192.168.100.112 (192.168.100.112) port 31855
            > POST /v1/models/mnist:predict HTTP/1.1
            > Host: prediction-and-sink.kserve-test.ay.test.dev
            > User-Agent: curl/8.5.0
            > Accept: */*
            > Content-Type: application/json
            > Content-Length: 401
            > 
            < HTTP/1.1 200 OK
            < content-length: 19
            < content-type: application/json
            < date: Wed, 02 Jul 2025 08:55:05 GMT,Wed, 02 Jul 2025 08:55:04 GMT
            < server: istio-envoy
            < x-envoy-upstream-service-time: 209
            < 
            * Connection #0 to host 192.168.100.112 left intact
            {"predictions":[2]}

            6. Invoke Broker

            • preparation
            cat > image-with-trace-id.json << EOF
            {
                "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7",
                "instances": [
                    {
                        "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
                    }
                ]
            }
            EOF
            • fire!!
            export MASTER_IP=192.168.100.112
            export KAFKA_BROKER_INGRESS_PORT=$(kubectl -n knative-eventing get service kafka-broker-ingress -o jsonpath='{.spec.ports[?(@.name=="http-container")].nodePort}')
            
            curl -v "http://${MASTER_IP}:${KAFKA_BROKER_INGRESS_PORT}/kserve-test/isvc-broker" \
              -X POST \
              -H "Ce-Id: $(date +%s)" \
              -H "Ce-Specversion: 1.0" \
              -H "Ce-Type: prediction-request-udf-attr" \
              -H "Ce-Source: event-producer" \
              -H "Content-Type: application/json" \
              -d @./image-with-trace-id.json 
            • check input data in kafka topic knative-broker-kserve-test-isvc-broker
            kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
              'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic knative-broker-kserve-test-isvc-broker --from-beginning'
            Expectd Output
            {
                "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7",
                "instances": [
                {
                    "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
                }]
            }
            {
                "predictions": [2] // result will be saved in this topic as well
            }
            • check response result in kafka topic test-topic
            kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
              'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

             1{
             2    "specversion": "1.0",
             3    "id": "822e3115-0185-4752-9967-f408dda72004",
             4    "source": "data-and-computing/kafka-sink-transformer",
             5    "type": "org.zhejianglab.zverse.data-and-computing.kafka-sink-transformer",
             6    "time": "2025-07-02T08:57:04.133497+00:00",
             7    "data":
             8    {
             9        "predictions": [2]
            10    },
            11    "request-host": "prediction-and-sink-transformer.kserve-test.svc.cluster.local",
            12    "kserve-isvc-name": "prediction-and-sink",
            13    "kserve-isvc-namespace": "kserve-test",
            14    "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7"
            15}
            Using test-trace-id to grab the result.

            Mar 7, 2024

            Subsections of 🏗️Linux

            Cheatsheet

            useradd

            sudo useradd <$name> -m -r -s /bin/bash -p <$password>
            add as soduer
            echo '<$name> ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers

            telnet

            a command line interface for communication with a remote device or serve

            telnet <$ip> <$port>
            for example
            telnet 172.27.253.50 9000 #test application connectivity

            lsof (list as open files)

            everything is a file

            lsof <$option:value>
            for example

            -a List processes that have open files

            -c <process_name> List files opened by the specified process

            -g List GID number process details

            -d <file_number> List the processes occupying this file number

            -d List open files in a directory

            -D Recursively list open files in a directory

            -n List files using NFS

            -i List eligible processes. (protocol, :port, @ip)

            -p List files opened by the specified process ID

            -u List UID number process details

            lsof -i:30443 # find port 30443 
            lsof -i -P -n # list all connections

            awk (Aho, Weinberger, and Kernighan [Names])

            awk is a scripting language used for manipulating data and generating reports.

            # awk [params] 'script' 
            awk <$params> <$string_content>
            for example

            filter bigger than 3

            echo -e "1\n2\n3\n4\n5\n" | awk '$1>3'

            func1 func1

            ss (socket statistics)

            view detailed information about your system’s network connections, including TCP/IP, UDP, and Unix domain sockets

            ss [options]
            for example
            OptionsDescription
            -tDisplay TCP sockets
            -lDisplay listening sockets
            -nShow numerical addresses instead of resolving
            -aDisplay all sockets (listening and non-listening)
            #show all listening TCP connection
            ss -tln
            #show all established TCP connections
            ss -tan

            clean files 3 days ago

            find /aaa/bbb/ccc/*.gz -mtime +3 -exec rm {} \;

            ssh without affect $HOME/.ssh/known_hosts

            ssh -o "UserKnownHostsFile /dev/null" root@aaa.domain.com
            ssh -o "UserKnownHostsFile /dev/null" -o "StrictHostKeyChecking=no" root@aaa.domain.com

            sync clock

            [yum|dnf] install -y chrony \
                && systemctl enable chronyd \
                && (systemctl is-active chronyd || systemctl start chronyd) \
                && chronyc sources \
                && chronyc tracking \
                && timedatectl set-timezone 'Asia/Shanghai'

            set hostname

            hostnamectl set-hostname develop

            add remote key to other server

            ssh -o "UserKnownHostsFile /dev/null" \
                root@aaa.bbb.ccc \
                "mkdir -p /root/.ssh && chmod 700 /root/.ssh && echo '$SOME_PUBLIC_KEY' \
                >> /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys"
            for example
            ssh -o "UserKnownHostsFile /dev/null" \
                root@17.27.253.67 \
                "mkdir -p /root/.ssh && chmod 700 /root/.ssh && echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC00JLKF/Cd//rJcdIVGCX3ePo89KAgEccvJe4TEHs5pI5FSxs/7/JfQKZ+by2puC3IT88bo/d7nStw9PR3BXgqFXaBCknNBpSLWBIuvfBF+bcL+jGnQYo2kPjrO+2186C5zKGuPRi9sxLI5AkamGB39L5SGqwe5bbKq2x/8OjUP25AlTd99XsNjEY2uxNVClHysExVad/ZAcl0UVzG5xmllusXCsZVz9HlPExqB6K1sfMYWvLVgSCChx6nUfgg/NZrn/kQG26X0WdtXVM2aXpbAtBioML4rWidsByDb131NqYpJF7f+x3+I5pQ66Qpc72FW1G4mUiWWiGhF9tL8V9o1AY96Rqz0AVaxAQrBEuyCWKrXbA97HeC3Xp57Luvlv9TqUd8CIJYq+QTL0hlIDrzK9rJsg34FRAvf9sh8K2w/T/gC9UnRjRXgkPUgKldq35Y6Z9wP6KY45gCXka1PU4nVqb6wicO+RHcZ5E4sreUwqfTypt5nTOgW2/p8iFhdN8= Administrator@AARON-X1-8TH' \
                >> /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys"

            set -x

            This will print each command to the standard error before executing it, which is useful for debugging scripts.

            set -x

            set -e

            Exit immediately if a command exits with a non-zero status.

            set -x

            sed (Stream Editor)

            sed <$option> <$file_path>
            for example

            replace unix -> linux

            echo "linux is great os. unix is opensource. unix is free os." | sed 's/unix/linux/'

            or you can check https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/

            fdisk

            list all disk

            fdisk -l

            create CFS file system

            Use mkfs.xfs command to create xfs file system and internal log on the same disk, example is shown below:

            mkfs.xfs <$path>

            modprobe

            program to add and remove modules from the Linux Kernel

            modprobe nfs && modprobe nfsd

            disown

            disown command in Linux is used to remove jobs from the job table.

            disown [options] jobID1 jobID2 ... jobIDN
            for example

            for example, there is a job running in the background

            ping google.com > /dev/null &

            using jobs - to list all running jobs

            jobs -l

            using disown -a remove all jobs from the job tables

            disown -a

            using disown %2 to remove the #2 job

            disown %2

            generate SSH key

            ssh-keygen -t rsa -b 4096 -C "aaron19940628@gmail.com"
            sudo ln -sf <$install_path>/bin/* /usr/local/bin

            append dir into $PATH (temporary)

            export PATH="/root/bin:$PATH"

            copy public key to ECS

            ssh-copy-id -i ~/.ssh/id_rsa.pub root@10.200.60.53

            echo “nameserver 8.8.8.8” | sudo tee /etc/resolv.conf echo “nameserver 8.8.4.4” | sudo tee -a /etc/resolv.conf

            Mar 12, 2024

            Subsections of Command

            Echo

            在Windows批处理中(使用ECHO命令)

            ECHO 这是要写入的内容 > filename.txt
            ECHO 这是要追加的内容 >> filename.txt

            在Linux/macOS Shell中

            echo "这是要写入的内容" > filename.txt
            echo "这是要追加的内容" >> filename.txt

            在Python中

            # 写入文件(覆盖)
            with open('filename.txt', 'w', encoding='utf-8') as f:
                f.write("这是要写入的内容\n")
            
            # 追加内容
            with open('filename.txt', 'a', encoding='utf-8') as f:
                f.write("这是要追加的内容\n")

            在PowerShell中

            "这是要写入的内容" | Out-File -FilePath filename.txt
            "这是要追加的内容" | Out-File -FilePath filename.txt -Append

            在JavaScript (Node.js) 中

            const fs = require('fs');
            
            // 写入文件(覆盖)
            fs.writeFileSync('filename.txt', '这是要写入的内容\n');
            
            // 追加内容
            fs.appendFileSync('filename.txt', '这是要追加的内容\n');
            Sep 7, 2025

            Grep

            grep 是 Linux 中强大的文本搜索工具,其名称来源于 “Global Regular Expression Print”。以下是 grep 命令的常见用法:

            基本语法

            grep [选项] 模式 [文件...]

            常用选项

            1. 基础搜索

            # 在文件中搜索包含"error"的行
            grep "error" filename.log
            
            # 搜索时忽略大小写
            grep -i "error" filename.log
            
            # 显示不匹配的行
            grep -v "success" filename.log
            
            # 显示匹配的行号
            grep -n "pattern" filename.txt

            2. 递归搜索

            # 在当前目录及子目录中递归搜索
            grep -r "function_name" .
            
            # 递归搜索并显示文件名
            grep -r -l "text" /path/to/directory

            3. 输出控制

            # 只显示匹配的文件名(不显示具体行)
            grep -l "pattern" *.txt
            
            # 显示匹配行前后的内容
            grep -A 3 "error" logfile.txt    # 显示匹配行后3行
            grep -B 2 "error" logfile.txt    # 显示匹配行前2行
            grep -C 2 "error" logfile.txt    # 显示匹配行前后各2行
            
            # 只显示匹配的部分(而非整行)
            grep -o "pattern" file.txt

            4. 正则表达式

            # 使用扩展正则表达式
            grep -E "pattern1|pattern2" file.txt
            
            # 匹配以"start"开头的行
            grep "^start" file.txt
            
            # 匹配以"end"结尾的行
            grep "end$" file.txt
            
            # 匹配空行
            grep "^$" file.txt
            
            # 使用字符类
            grep "[0-9]" file.txt           # 包含数字的行
            grep "[a-zA-Z]" file.txt        # 包含字母的行

            5. 文件处理

            # 从多个文件中搜索
            grep "text" file1.txt file2.txt
            
            # 使用通配符
            grep "pattern" *.log
            
            # 从标准输入读取
            cat file.txt | grep "pattern"
            echo "some text" | grep "text"

            6. 统计信息

            # 统计匹配的行数
            grep -c "pattern" file.txt
            
            # 统计匹配的次数(可能一行有多个匹配)
            grep -o "pattern" file.txt | wc -l

            实用示例

            1. 日志分析

            # 查找今天的错误日志
            grep "ERROR" /var/log/syslog | grep "$(date '+%Y-%m-%d')"
            
            # 查找包含IP地址的行
            grep -E "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" access.log

            2. 代码搜索

            # 在项目中查找函数定义
            grep -r "function_name(" src/
            
            # 查找包含TODO或FIXME的注释
            grep -r -E "TODO|FIXME" ./
            
            # 查找空行并统计数量
            grep -c "^$" source_code.py

            3. 系统监控

            # 查看特定进程
            ps aux | grep "nginx"
            
            # 检查端口占用
            netstat -tulpn | grep ":80"

            4. 文件内容检查

            # 检查配置文件中的有效设置(忽略注释和空行)
            grep -v "^#" /etc/ssh/sshd_config | grep -v "^$"
            
            # 查找包含邮箱地址的行
            grep -E "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" file.txt

            高级技巧

            1. 使用上下文

            # 显示错误及其上下文
            grep -C 3 -i "error" application.log

            2. 反向引用

            # 使用扩展正则表达式的分组
            grep -E "(abc).*\1" file.txt  # 查找重复的"abc"

            3. 二进制文件搜索

            # 在二进制文件中搜索文本字符串
            grep -a "text" binaryfile

            4. 颜色高亮

            # 启用颜色高亮(通常默认开启)
            grep --color=auto "pattern" file.txt

            常用组合

            与其它命令配合

            # 查找并排序
            grep "pattern" file.txt | sort
            
            # 查找并计数
            grep -o "pattern" file.txt | sort | uniq -c
            
            # 查找并保存结果
            grep "error" logfile.txt > errors.txt

            这些是 grep 命令最常用的用法,掌握它们可以大大提高在 Linux 环境下处理文本的效率。

            Sep 7, 2025

            Sed

            sed(Stream Editor)是 Linux 中强大的流编辑器,用于对文本进行过滤和转换。以下是 sed 命令的常见用法:

            基本语法

            sed [选项] '命令' 文件
            sed [选项] -e '命令1' -e '命令2' 文件
            sed [选项] -f 脚本文件 文件

            常用选项

            1. 基础选项

            # 编辑文件并备份原文件
            sed -i.bak 's/old/new/g' file.txt
            
            # 直接修改文件(无备份)
            sed -i 's/old/new/g' file.txt
            
            # 只打印匹配的行
            sed -n '命令' file.txt
            
            # 使用扩展正则表达式
            sed -E '命令' file.txt

            文本替换

            1. 基本替换

            # 替换每行第一个匹配
            sed 's/old/new/' file.txt
            
            # 替换所有匹配(全局替换)
            sed 's/old/new/g' file.txt
            
            # 替换第N次出现的匹配
            sed 's/old/new/2' file.txt    # 替换第二次出现
            
            # 只替换匹配的行
            sed '/pattern/s/old/new/g' file.txt

            2. 替换分隔符

            # 当模式包含斜杠时,可以使用其他分隔符
            sed 's|/usr/local|/opt|g' file.txt
            sed 's#old#new#g' file.txt

            3. 引用和转义

            # 使用&引用匹配的整个文本
            sed 's/[0-9]*/[&]/g' file.txt
            
            # 使用分组引用
            sed 's/\([a-z]*\) \([a-z]*\)/\2 \1/' file.txt
            sed -E 's/([a-z]*) ([a-z]*)/\2 \1/' file.txt  # 扩展正则表达式

            行操作

            1. 行寻址

            # 指定行号
            sed '5s/old/new/' file.txt        # 只对第5行替换
            sed '1,5s/old/new/g' file.txt     # 1-5行替换
            sed '5,$s/old/new/g' file.txt     # 第5行到最后一行
            
            # 使用正则表达式匹配行
            sed '/^#/s/old/new/' file.txt     # 只对以#开头的行
            sed '/start/,/end/s/old/new/g' file.txt  # 从start到end的行

            2. 删除行

            # 删除空行
            sed '/^$/d' file.txt
            
            # 删除注释行
            sed '/^#/d' file.txt
            
            # 删除特定行号
            sed '5d' file.txt                 # 删除第5行
            sed '1,5d' file.txt               # 删除1-5行
            sed '/pattern/d' file.txt         # 删除匹配模式的行

            3. 插入和添加

            # 在指定行前插入
            sed '5i\插入的内容' file.txt
            
            # 在指定行后添加
            sed '5a\添加的内容' file.txt
            
            # 在文件开头插入
            sed '1i\开头内容' file.txt
            
            # 在文件末尾添加
            sed '$a\结尾内容' file.txt

            4. 修改行

            # 替换整行
            sed '5c\新的行内容' file.txt
            
            # 替换匹配模式的行
            sed '/pattern/c\新的行内容' file.txt

            高级操作

            1. 打印控制

            # 只打印匹配的行(类似grep)
            sed -n '/pattern/p' file.txt
            
            # 打印行号
            sed -n '/pattern/=' file.txt
            
            # 同时打印行号和内容
            sed -n '/pattern/{=;p}' file.txt

            2. 多重命令

            # 使用分号分隔多个命令
            sed 's/old/new/g; s/foo/bar/g' file.txt
            
            # 使用-e选项
            sed -e 's/old/new/' -e 's/foo/bar/' file.txt
            
            # 对同一行执行多个操作
            sed '/pattern/{s/old/new/; s/foo/bar/}' file.txt

            3. 文件操作

            # 读取文件并插入
            sed '/pattern/r otherfile.txt' file.txt
            
            # 将匹配行写入文件
            sed '/pattern/w output.txt' file.txt

            4. 保持空间操作

            # 模式空间与保持空间交换
            sed '1!G;h;$!d' file.txt          # 反转文件行顺序
            
            # 复制到保持空间
            sed '/pattern/h' file.txt
            
            # 从保持空间取回
            sed '/pattern/g' file.txt

            实用示例

            1. 配置文件修改

            # 修改SSH端口
            sed -i 's/^#Port 22/Port 2222/' /etc/ssh/sshd_config
            
            # 启用root登录
            sed -i 's/^#PermitRootLogin yes/PermitRootLogin yes/' /etc/ssh/sshd_config
            
            # 注释掉某行
            sed -i '/pattern/s/^/#/' file.txt
            
            # 取消注释
            sed -i '/pattern/s/^#//' file.txt

            2. 日志处理

            # 提取时间戳
            sed -n 's/.*\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\).*/\1/p' logfile
            
            # 删除空白字符
            sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

            3. 文本格式化

            # 每行末尾添加逗号
            sed 's/$/,/' file.txt
            
            # 合并连续空行
            sed '/^$/{N;/^\n$/D}' file.txt
            
            # 在每行前添加行号
            sed = file.txt | sed 'N;s/\n/\t/'

            4. 数据转换

            # CSV转TSV
            sed 's/,/\t/g' data.csv
            
            # 转换日期格式
            sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3\/\2\/\1/g' dates.txt
            
            # URL编码解码(简单版本)
            echo "hello world" | sed 's/ /%20/g'

            5. 脚本文件使用

            # 创建sed脚本
            cat > script.sed << EOF
            s/old/new/g
            /^#/d
            /^$/d
            EOF
            
            # 使用脚本文件
            sed -f script.sed file.txt

            常用组合技巧

            1. 与管道配合

            # 查找并替换
            grep "pattern" file.txt | sed 's/old/new/g'
            
            # 处理命令输出
            ls -l | sed -n '2,$p' | awk '{print $9}'

            2. 复杂文本处理

            # 提取XML/HTML标签内容
            sed -n 's/.*<title>\(.*\)<\/title>.*/\1/p' file.html
            
            # 处理配置文件段落的示例
            sed -n '/^\[database\]/,/^\[/p' config.ini | sed '/^\[/d'

            这些 sed 命令用法涵盖了大多数日常文本处理需求,掌握它们可以高效地进行批量文本编辑和转换操作。

            Sep 7, 2025

            Subsections of Components

            Cgroup有什么作用

            cgroup 的功能非常丰富,除了 CPU 限制外,还提供了多种系统资源的管控能力:

            1. 内存管理(memory)

            1.1 内存限制

            # 设置内存使用上限
            echo "100M" > /sys/fs/cgroup/memory/group1/memory.limit_in_bytes
            
            # 设置内存+Swap 上限
            echo "200M" > /sys/fs/cgroup/memory/group1/memory.memsw.limit_in_bytes

            1.2 内存统计和监控

            # 查看内存使用情况
            cat /sys/fs/cgroup/memory/group1/memory.usage_in_bytes
            cat /sys/fs/cgroup/memory/group1/memory.stat

            1.3 内存压力控制

            # 设置内存回收压力
            echo 100 > /sys/fs/cgroup/memory/group1/memory.swappiness

            2. 块设备 I/O 控制(blkio)

            2.1 I/O 带宽限制

            # 限制读带宽 1MB/s
            echo "8:0 1048576" > /sys/fs/cgroup/blkio/group1/blkio.throttle.read_bps_device
            
            # 限制写带宽 2MB/s  
            echo "8:0 2097152" > /sys/fs/cgroup/blkio/group1/blkio.throttle.write_bps_device

            2.2 I/OPS 限制

            # 限制每秒读操作数
            echo "8:0 100" > /sys/fs/cgroup/blkio/group1/blkio.throttle.read_iops_device
            
            # 限制每秒写操作数
            echo "8:0 50" > /sys/fs/cgroup/blkio/group1/blkio.throttle.write_iops_device

            2.3 I/O 权重分配

            # 设置 I/O 优先级权重(100-1000)
            echo 500 > /sys/fs/cgroup/blkio/group1/blkio.weight

            3. 进程控制(pids)

            3.1 进程数限制

            # 限制最大进程数
            echo 100 > /sys/fs/cgroup/pids/group1/pids.max
            
            # 查看当前进程数
            cat /sys/fs/cgroup/pids/group1/pids.current

            4. 设备访问控制(devices)

            4.1 设备权限管理

            # 允许访问设备
            echo "c 1:3 rwm" > /sys/fs/cgroup/devices/group1/devices.allow
            
            # 拒绝访问设备
            echo "c 1:5 rwm" > /sys/fs/cgroup/devices/group1/devices.deny

            5. 网络控制(net_cls, net_prio)

            5.1 网络流量分类

            # 设置网络流量类ID
            echo 0x100001 > /sys/fs/cgroup/net_cls/group1/net_cls.classid

            5.2 网络优先级

            # 设置网络接口优先级
            echo "eth0 5" > /sys/fs/cgroup/net_prio/group1/net_prio.ifpriomap

            6. 挂载点控制(devices)

            6.1 文件系统访问限制

            # 限制挂载命名空间操作
            echo 1 > /sys/fs/cgroup/group1/devices.deny

            7. 统一层级 cgroup v2 功能

            cgroup v2 提供了更统一的管理接口:

            7.1 资源保护

            # 内存低水位线保护
            echo "min 50M" > /sys/fs/cgroup/group1/memory.low
            
            # CPU 权重保护
            echo 100 > /sys/fs/cgroup/group1/cpu.weight

            7.2 I/O 控制

            # I/O 权重
            echo "default 100" > /sys/fs/cgroup/group1/io.weight
            
            # I/O 最大带宽
            echo "8:0 rbps=1048576 wbps=2097152" > /sys/fs/cgroup/group1/io.max

            8. 实际应用场景

            8.1 容器资源限制

            # Docker 容器资源限制
            docker run -it \
              --cpus="0.5" \
              --memory="100m" \
              --blkio-weight=500 \
              --pids-limit=100 \
              ubuntu:latest

            8.2 systemd 服务限制

            [Service]
            MemoryMax=100M
            IOWeight=500
            TasksMax=100
            DeviceAllow=/dev/null rw
            DeviceAllow=/dev/zero rw
            DeviceAllow=/dev/full rw

            8.3 Kubernetes 资源管理

            apiVersion: v1
            kind: Pod
            spec:
              containers:
              - name: app
                resources:
                  limits:
                    cpu: "500m"
                    memory: "128Mi"
                    ephemeral-storage: "1Gi"
                  requests:
                    cpu: "250m" 
                    memory: "64Mi"

            9. 监控和统计

            9.1 资源使用统计

            # 查看 cgroup 资源使用
            cat /sys/fs/cgroup/memory/group1/memory.stat
            cat /sys/fs/cgroup/cpu/group1/cpu.stat
            cat /sys/fs/cgroup/io/group1/io.stat

            9.2 压力状态信息

            # 查看内存压力
            cat /sys/fs/cgroup/memory/group1/memory.pressure

            10. 高级功能

            10.1 资源委托(cgroup v2)

            # 允许子 cgroup 管理特定资源
            echo "+memory +io" > /sys/fs/cgroup/group1/cgroup.subtree_control

            10.2 冻结进程

            # 暂停 cgroup 中所有进程
            echo 1 > /sys/fs/cgroup/group1/cgroup.freeze
            
            # 恢复执行
            echo 0 > /sys/fs/cgroup/group1/cgroup.freeze

            cgroup 的这些功能使得它成为容器化技术(如 Docker、Kubernetes)的基础,提供了完整的资源隔离、限制和统计能力,是现代 Linux 系统资源管理的核心技术。

            Mar 7, 2024

            IPVS

            IPVS 是什么?

            IPVS(IP Virtual Server) 是 Linux 内核内置的第4层(传输层)负载均衡器,是 LVS(Linux Virtual Server)项目的核心组件。

            基本概念

            • 工作层级:传输层(TCP/UDP)
            • 实现方式:内核空间实现,高性能
            • 功能:将 TCP/UDP 请求负载均衡到多个真实服务器

            IPVS 的核心架构

            客户端请求
                ↓
            虚拟服务 (Virtual Service) - VIP:Port
                ↓
            负载均衡调度算法
                ↓
            真实服务器池 (Real Servers)

            IPVS 的主要作用

            1. 高性能负载均衡

            # IPVS 处理能力可达数十万并发连接
            # 相比 iptables 有更好的性能表现

            2. 多种负载均衡算法

            # 查看支持的调度算法
            grep -i ip_vs /lib/modules/$(uname -r)/modules.builtin
            
            # 常用算法:
            rr      # 轮询 (Round Robin)
            wrr     # 加权轮询 (Weighted RR)
            lc      # 最少连接 (Least Connection)
            wlc     # 加权最少连接 (Weighted LC)
            sh      # 源地址哈希 (Source Hashing)
            dh      # 目标地址哈希 (Destination Hashing)

            3. 多种工作模式

            NAT 模式(网络地址转换)

            # 请求和响应都经过负载均衡器
            # 配置示例
            ipvsadm -A -t 192.168.1.100:80 -s rr
            ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -m
            ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -m

            DR 模式(直接路由)

            # 响应直接返回客户端,不经过负载均衡器
            # 高性能模式
            ipvsadm -A -t 192.168.1.100:80 -s rr
            ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -g
            ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -g

            TUN 模式(IP 隧道)

            # 通过 IP 隧道转发请求
            ipvsadm -A -t 192.168.1.100:80 -s rr
            ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -i
            ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -i

            IPVS 在 Kubernetes 中的应用

            kube-proxy IPVS 模式的优势

            # 性能对比
            iptables: O(n) 链式查找,规则多时性能下降
            ipvs:   O(1) 哈希表查找,高性能

            Kubernetes 中的 IPVS 配置

            # 查看 kube-proxy 是否使用 IPVS 模式
            kubectl -n kube-system get pods -l k8s-app=kube-proxy -o yaml | grep mode
            
            # 查看 IPVS 规则
            ipvsadm -Ln

            IPVS 的核心功能

            1. 连接调度

            # 不同调度算法的应用场景
            rr      # 通用场景,服务器性能相近
            wrr     # 服务器性能差异较大
            lc      # 长连接服务,如数据库
            sh      # 会话保持需求

            2. 健康检查

            # IPVS 本身不提供健康检查
            # 需要配合 keepalived 或其他健康检查工具

            3. 会话保持

            # 使用源地址哈希实现会话保持
            ipvsadm -A -t 192.168.1.100:80 -s sh

            IPVS 管理命令详解

            基本操作

            # 添加虚拟服务
            ipvsadm -A -t|u|f <service-address> [-s scheduler]
            
            # 添加真实服务器
            ipvsadm -a -t|u|f <service-address> -r <server-address> [-g|i|m] [-w weight]
            
            # 示例
            ipvsadm -A -t 192.168.1.100:80 -s wlc
            ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.10:8080 -m -w 1
            ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.11:8080 -m -w 2

            监控和统计

            # 查看连接统计
            ipvsadm -Ln --stats
            ipvsadm -Ln --rate
            
            # 查看当前连接
            ipvsadm -Lnc
            
            # 查看超时设置
            ipvsadm -L --timeout

            IPVS 与相关技术对比

            IPVS vs iptables

            特性IPVSiptables
            性能O(1) 哈希查找O(n) 链式查找
            规模支持大量服务规则多时性能下降
            功能专业负载均衡通用防火墙
            算法多种调度算法简单轮询

            IPVS vs Nginx

            特性IPVSNginx
            层级第4层 (传输层)第7层 (应用层)
            性能内核级,更高用户空间,功能丰富
            功能基础负载均衡内容路由、SSL终止等

            实际应用场景

            1. Kubernetes Service 代理

            # kube-proxy 为每个 Service 创建 IPVS 规则
            ipvsadm -Ln
            # 输出示例:
            TCP  10.96.0.1:443 rr
              -> 192.168.1.10:6443    Masq    1      0          0
            TCP  10.96.0.10:53 rr
              -> 10.244.0.5:53        Masq    1      0          0

            2. 高可用负载均衡

            # 配合 keepalived 实现高可用
            # 主备负载均衡器 + IPVS

            3. 数据库读写分离

            # 使用 IPVS 分发数据库连接
            ipvsadm -A -t 192.168.1.100:3306 -s lc
            ipvsadm -a -t 192.168.1.100:3306 -r 192.168.1.20:3306 -m
            ipvsadm -a -t 192.168.1.100:3306 -r 192.168.1.21:3306 -m

            总结

            IPVS 的主要用途:

            1. 高性能负载均衡 - 内核级实现,处理能力强大
            2. 多种调度算法 - 适应不同业务场景
            3. 多种工作模式 - NAT/DR/TUN 满足不同网络需求
            4. 大规模集群支持 - 适合云原生和微服务架构
            5. Kubernetes 集成 - 作为 kube-proxy 的后端,提供高效的 Service 代理

            在 Kubernetes 环境中,IPVS 模式相比 iptables 模式在大规模服务下具有明显的性能优势,是生产环境推荐的负载均衡方案。

            Mar 7, 2024

            Subsections of Interface

            POSIX标准

            Mar 7, 2024

            Subsections of Scripts

            Create Systemd Service

            1. 创建 systemd 服务文件
            vim /etc/systemd/system/your-service-name.service
            1. 添加以下内容到文件中

            运行一个脚本

            [Unit]
            Description=Your Service Description
            After=network.target # after: 指定在哪些服务之后启动
            
            [Service]
            Type=simple  # simple: 运行一个简单的程序 | forking: 服务会fork出新的进程 | oneshot: 运行一次 | notify: 运行并等待通知 | exec: 运行一个命令
            User=root
            ExecStart=/bin/bash -c "your-bash-command-here"
            Restart=always
            RestartSec=5
            
            [Install]
            WantedBy=multi-user.target # multi-user.target: 运行在多用户模式下的目标

            运行一个程序

            [Unit]
            Description=Backup Service
            After=network.target
            
            [Service]
            Type=simple
            User=root
            ExecStart=/bin/bash -c "tar -czf /backup/backup-$(date +%Y%m%d).tar.gz /home/user/data"
            Restart=on-failure
            
            [Install]
            WantedBy=multi-user.target
            1. 启动服务
            # 重新加载 systemd 配置
            sudo systemctl daemon-reload
            
            # 启动服务
            sudo systemctl start your-service-name
            
            # 设置开机启动
            sudo systemctl enable your-service-name
            
            # 检查服务状态
            sudo systemctl status your-service-name
            
            # 停止服务
            sudo systemctl stop your-service-name
            
            # 禁用开机启动
            sudo systemctl disable your-service-name
            
            # 查看服务日志
            sudo journalctl -u your-service-name -f
            Mar 14, 2025

            Disable Service

            Disable firewall、selinux、dnsmasq、swap service

            systemctl disable --now firewalld 
            systemctl disable --now dnsmasq
            systemctl disable --now NetworkManager
            
            setenforce 0
            sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/sysconfig/selinux
            sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/selinux/config
            reboot
            getenforce
            
            
            swapoff -a && sysctl -w vm.swappiness=0
            sed -ri '/^[^#]*swap/s@^@#@' /etc/fstab
            Mar 14, 2024

            Free Disk Space

            Cleanup

            1. find first 10 biggest files
            dnf install ncdu
            
            # 找出当前目录下最大的10个文件/目录
            du -ah . | sort -rh | head -n 10
            
            # 找出家目录下大于100M的文件
            find ~ -type f -size +100M -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
            1. clean cache
            rm -rf ~/.cache/*
            sudo rm -rf /tmp/*
            sudo rm -rf /var/tmp/*
            1. clean images
            # 删除所有停止的容器
            podman container prune -y
            
            # 删除所有未被任何容器引用的镜像(悬空镜像)
            podman image prune
            
            # 更激进的清理:删除所有未被运行的容器使用的镜像
            podman image prune -a
            
            # 清理构建缓存
            podman builder prune
            
            # 最彻底的清理:删除所有停止的容器、所有未被容器使用的网络、所有悬空镜像和构建缓存
            podman system prune
            podman system prune -a # 更加彻底,会删除所有未被使用的镜像,而不仅仅是悬空的
            Mar 14, 2024

            Login Without Pwd

            copy id_rsa to other nodes

            yum install sshpass -y
            mkdir -p /extend/shell
            
            cat >>/extend/shell/distribute_pub.sh<< EOF
            #!/bin/bash
            ROOT_PASS=root123
            ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ''
            for ip in 101 102 103 
            do
            sshpass -p\$ROOT_PASS ssh-copy-id -o StrictHostKeyChecking=no 192.168.29.\$ip
            done
            EOF
            
            cd /extend/shell
            chmod +x distribute_pub.sh
            
            ./distribute_pub.sh
            Mar 14, 2024

            Set Http Proxy

            [Optional] Install Proxy Server

            go and check http://port.72602.online/ops/hugo/index.html

            Set Http Proxy

            export https_proxy=http://47.110.67.161:30890
            export http_proxy=http://47.110.67.161:30890

            Use Proxy in Pod

            Mar 14, 2024

            Subsections of 🌐Language

            Subsections of ♨️JAVA

            Subsections of JVM Related

            AOT or JIT

            JDK 9 引入了一种新的编译模式 AOT(Ahead of Time Compilation) 。和 JIT 不同的是,这种编译模式会在程序被执行前就将其编译成机器码,属于静态编译(C、 C++,Rust,Go 等语言就是静态编译)。AOT 避免了 JIT 预热等各方面的开销,可以提高 Java 程序的启动速度,避免预热时间长。并且,AOT 还能减少内存占用和增强 Java 程序的安全性(AOT 编译后的代码不容易被反编译和修改),特别适合云原生场景。

            可以看出,AOT 的主要优势在于启动时间、内存占用和打包体积。JIT 的主要优势在于具备更高的极限处理能力,可以降低请求的最大延迟。

            https://cn.dubbo.apache.org/zh-cn/blog/2023/06/28/%e8%b5%b0%e5%90%91-native-%e5%8c%96springdubbo-aot-%e6%8a%80%e6%9c%af%e7%a4%ba%e4%be%8b%e4%b8%8e%e5%8e%9f%e7%90%86%e8%ae%b2%e8%a7%a3/

            https://mp.weixin.qq.com/s/4haTyXUmh8m-dBQaEzwDJw

            既然 AOT 这么多优点,那为什么不全部使用这种编译方式呢?

            我们前面也对比过 JIT 与 AOT,两者各有优点,只能说 AOT 更适合当下的云原生场景,对微服务架构的支持也比较友好。除此之外,AOT 编译无法支持 Java 的一些动态特性,如反射、动态代理、动态加载、JNI(Java Native Interface)等。然而,很多框架和库(如 Spring、CGLIB)都用到了这些特性。如果只使用 AOT 编译,那就没办法使用这些框架和库了,或者说需要针对性地去做适配和优化。举个例子,CGLIB 动态代理使用的是 ASM 技术,而这种技术大致原理是运行时直接在内存中生成并加载修改后的字节码文件也就是

            Mar 7, 2024

            Volatie

            Volatile是Java虚拟机提供的轻量级的同步机制(三大特性)

            保证可见性

            不保证原子性

            禁止指令重排

            Mar 7, 2024

            🐍Python

              Mar 7, 2024

              🐹Go

                Mar 7, 2024

                Subsections of Design Pattern

                Observers

                Mar 7, 2024

                Subsections of Web Pattern

                HTTP Code

                1xx - 信息性状态码(临时响应)

                表示请求已收到,正在继续处理。平时在浏览器中很少见到。

                • 100 Continue:客户端应继续发送请求的剩余部分。通常用于 POST 或 PUT 大量数据前,先询问服务器是否愿意接收。
                • 101 Switching Protocols:客户端要求切换协议(如切换到 WebSocket),服务器已同意。

                2xx - 成功状态码(请求成功)

                表示请求已被服务器成功接收、理解并处理。

                • 200 OK最常用的成功状态码。表示请求成功,返回的响应体包含了所请求的数据(如 HTML 页面、JSON 数据等)。
                • 201 Created创建成功。通常在 POST 或 PUT 请求后,表示成功在服务器上创建了一个新资源。响应头 Location 字段通常会包含新资源的 URL。
                • 202 Accepted:请求已接受,但尚未处理完成。适用于异步任务,比如“请求已进入队列,正在处理中”。
                • 204 No Content请求成功,但响应报文中没有实体的主体部分。常用于 DELETE 请求成功,或前端只需知道操作成功而无需返回数据的 AJAX 请求。

                3xx - 重定向状态码(需要进一步操作)

                表示客户端需要执行额外的操作来完成请求,通常是重定向。

                • 301 Moved Permanently永久重定向。请求的资源已被永久移动到新的 URL。搜索引擎会更新其链接到新的地址。浏览器会缓存这个重定向
                • 302 Found临时重定向。请求的资源临时从另一个 URL 响应。搜索引擎不会更新链接。这是最常见的重定向类型,但规范要求方法不变(实际上浏览器常会改为 GET)。
                • 304 Not Modified资源未修改。用于缓存控制。当客户端拥有缓存的版本,并通过请求头(如 If-Modified-Since)询问资源是否更新时,如果资源未变,服务器会返回此状态码,告诉客户端直接使用缓存。这节省了带宽
                • 307 Temporary Redirect临时重定向(严格)。与 302 类似,但严格要求客户端不能改变原始的请求方法(例如,POST 必须仍是 POST)。比 302 更规范。

                4xx - 客户端错误状态码(请求有误)

                表示客户端可能出错,服务器无法处理请求。

                • 400 Bad Request错误的请求。服务器因为请求的语法无效而无法理解。就像一个语法错误的句子,服务器看不懂。
                • 401 Unauthorized未认证。表示请求需要用户认证。通常需要登录或提供 Token。注意,这个名字容易误解,它实际是“未认证”,而不是“未授权”。
                • 403 Forbidden禁止访问。服务器理解请求,但拒绝执行。与 401 不同,身份验证也无济于事(比如普通用户尝试访问管理员页面)。
                • 404 Not Found最著名的错误码。服务器找不到请求的资源。可能是 URL 错误,或资源已被删除。
                • 405 Method Not Allowed方法不被允许。请求行中指定的方法(GET, POST 等)不能用于请求此资源。例如,对只接受 GET 的 URL 发送了 POST 请求。
                • 408 Request Timeout请求超时。服务器等待客户端发送请求的时间过长。
                • 409 Conflict冲突。请求与服务器的当前状态冲突。常见于 PUT 请求(例如,修改文件时版本冲突)。
                • 429 Too Many Requests请求过多。客户端在给定的时间内发送了太多请求(限流)。

                5xx - 服务器错误状态码(服务器处理请求出错)

                表示服务器在处理请求时发生错误或内部故障。

                • 500 Internal Server Error最通用的服务器错误码。服务器遇到了一个未曾预料的状况,导致它无法完成对请求的处理。通常是后端代码抛出了未捕获的异常。
                • 502 Bad Gateway错误的网关。服务器作为网关或代理,从上游服务器收到了一个无效的响应。常见于 Nginx 后面的应用服务器(如 PHP-FPM)挂掉或未启动。
                • 503 Service Unavailable服务不可用。服务器当前无法处理请求(由于超载或进行停机维护)。通常,这是一个临时状态。响应头中可能包含 Retry-After 字段,告知客户端何时可以重试。
                • 504 Gateway Timeout网关超时。服务器作为网关或代理,未能及时从上游服务器收到响应。常见于网络延迟或上游服务器处理过慢。

                快速记忆表格

                状态码类别含义常见场景
                200成功请求成功正常获取网页或数据
                201成功创建成功创建新用户、新文章成功
                204成功无内容删除成功,或前端AJAX请求无需返回数据
                301重定向永久移动网站改版,旧链接永久跳转到新链接
                302重定向临时移动登录后跳回首页
                304重定向未修改使用浏览器缓存,节省流量
                400客户端错误错误请求请求参数格式错误
                401客户端错误未认证需要登录
                403客户端错误禁止访问权限不足
                404客户端错误未找到请求的URL不存在
                429客户端错误请求过多API调用频率超限
                500服务器错误内部服务器错误后端代码Bug,数据库连接失败
                502服务器错误错误网关Nginx 无法连接到后端服务
                503服务器错误服务不可用服务器维护或过载
                504服务器错误网关超时后端服务响应太慢

                希望这个列表对您有帮助!

                Mar 7, 2024

                🪀Install Shit

                Aug 7, 2024

                Subsections of 🪀Install Shit

                Subsections of Application

                Datahub

                Preliminary

                • Kubernetes has installed, if not check 🔗link
                • argoCD has installed, if not check 🔗link
                • Elasticsearch has installed, if not check 🔗link
                • MariaDB has installed, if not check 🔗link
                • Kafka has installed, if not check 🔗link

                Steps

                1. prepare datahub credentials secret

                kubectl -n application \
                    create secret generic datahub-credentials \
                    --from-literal=mysql-root-password="$(kubectl get secret mariadb-credentials --namespace database -o jsonpath='{.data.mariadb-root-password}' | base64 -d)"
                kubectl -n application \
                    create secret generic datahub-credentials \
                    --from-literal=mysql-root-password="$(kubectl get secret mariadb-credentials --namespace database -o jsonpath='{.data.mariadb-root-password}' | base64 -d)" \
                    --from-literal=security.protocol="SASL_PLAINTEXT" \
                    --from-literal=sasl.mechanism="SCRAM-SHA-256" \
                    --from-literal=sasl.jaas.config="org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";"

                5. prepare deploy-datahub.yaml

                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: datahub
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://helm.datahubproject.io
                    chart: datahub
                    targetRevision: 0.4.8
                    helm:
                      releaseName: datahub
                      values: |
                        global:
                          elasticsearch:
                            host: elastic-search-elasticsearch.application.svc.cluster.local
                            port: 9200
                            skipcheck: "false"
                            insecure: "false"
                            useSSL: "false"
                          kafka:
                            bootstrap:
                              server: kafka.database.svc.cluster.local:9092
                            zookeeper:
                              server: kafka-zookeeper.database.svc.cluster.local:2181
                          sql:
                            datasource:
                              host: mariadb.database.svc.cluster.local:3306
                              hostForMysqlClient: mariadb.database.svc.cluster.local
                              port: 3306
                              url: jdbc:mysql://mariadb.database.svc.cluster.local:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8&enabledTLSProtocols=TLSv1.2
                              driver: com.mysql.cj.jdbc.Driver
                              username: root
                              password:
                                secretRef: datahub-credentials
                                secretKey: mysql-root-password
                        datahub-gms:
                          enabled: true
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-gms
                          service:
                            type: ClusterIP
                          ingress:
                            enabled: false
                        datahub-frontend:
                          enabled: true
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-frontend-react
                          defaultUserCredentials:
                            randomAdminPassword: true
                          service:
                            type: ClusterIP
                          ingress:
                            enabled: true
                            className: nginx
                            annotations:
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                            hosts:
                            - host: datahub.dev.geekcity.tech
                              paths:
                              - /
                            tls:
                            - secretName: "datahub.dev.geekcity.tech-tls"
                              hosts:
                              - datahub.dev.geekcity.tech
                        acryl-datahub-actions:
                          enabled: true
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-actions
                        datahub-mae-consumer:
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-mae-consumer
                          ingress:
                            enabled: false
                        datahub-mce-consumer:
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-mce-consumer
                          ingress:
                            enabled: false
                        datahub-ingestion-cron:
                          enabled: false
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-ingestion
                        elasticsearchSetupJob:
                          enabled: true
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-elasticsearch-setup
                        kafkaSetupJob:
                          enabled: true
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-kafka-setup
                        mysqlSetupJob:
                          enabled: true
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-mysql-setup
                        postgresqlSetupJob:
                          enabled: false
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-postgres-setup
                        datahubUpgrade:
                          enabled: true
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                        datahubSystemUpdate:
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: application
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: datahub
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://helm.datahubproject.io
                    chart: datahub
                    targetRevision: 0.4.8
                    helm:
                      releaseName: datahub
                      values: |
                        global:
                          springKafkaConfigurationOverrides:
                            security.protocol: SASL_PLAINTEXT
                            sasl.mechanism: SCRAM-SHA-256
                          credentialsAndCertsSecrets:
                            name: datahub-credentials
                            secureEnv:
                              sasl.jaas.config: sasl.jaas.config
                          elasticsearch:
                            host: elastic-search-elasticsearch.application.svc.cluster.local
                            port: 9200
                            skipcheck: "false"
                            insecure: "false"
                            useSSL: "false"
                          kafka:
                            bootstrap:
                              server: kafka.database.svc.cluster.local:9092
                            zookeeper:
                              server: kafka-zookeeper.database.svc.cluster.local:2181
                          neo4j:
                            host: neo4j.database.svc.cluster.local:7474
                            uri: bolt://neo4j.database.svc.cluster.local
                            username: neo4j
                            password:
                              secretRef: datahub-credentials
                              secretKey: neo4j-password
                          sql:
                            datasource:
                              host: mariadb.database.svc.cluster.local:3306
                              hostForMysqlClient: mariadb.database.svc.cluster.local
                              port: 3306
                              url: jdbc:mysql://mariadb.database.svc.cluster.local:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8&enabledTLSProtocols=TLSv1.2
                              driver: com.mysql.cj.jdbc.Driver
                              username: root
                              password:
                                secretRef: datahub-credentials
                                secretKey: mysql-root-password
                        datahub-gms:
                          enabled: true
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-gms
                          service:
                            type: ClusterIP
                          ingress:
                            enabled: false
                        datahub-frontend:
                          enabled: true
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-frontend-react
                          defaultUserCredentials:
                            randomAdminPassword: true
                          service:
                            type: ClusterIP
                          ingress:
                            enabled: true
                            className: nginx
                            annotations:
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                            hosts:
                            - host: datahub.dev.geekcity.tech
                              paths:
                              - /
                            tls:
                            - secretName: "datahub.dev.geekcity.tech-tls"
                              hosts:
                              - datahub.dev.geekcity.tech
                        acryl-datahub-actions:
                          enabled: true
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-actions
                        datahub-mae-consumer:
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-mae-consumer
                          ingress:
                            enabled: false
                        datahub-mce-consumer:
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-mce-consumer
                          ingress:
                            enabled: false
                        datahub-ingestion-cron:
                          enabled: false
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-ingestion
                        elasticsearchSetupJob:
                          enabled: true
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-elasticsearch-setup
                        kafkaSetupJob:
                          enabled: true
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-kafka-setup
                        mysqlSetupJob:
                          enabled: true
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-mysql-setup
                        postgresqlSetupJob:
                          enabled: false
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-postgres-setup
                        datahubUpgrade:
                          enabled: true
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                        datahubSystemUpdate:
                          image:
                            repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: application
                if you wannna start one more gms

                add this under global, if you wanna start one more gms

                  datahub_standalone_consumers_enabled: true

                3. apply to k8s

                kubectl -n argocd apply -f deploy-datahub.yaml

                4. sync by argocd

                argocd app sync argocd/datahub

                5. extract credientials

                kubectl -n application get secret datahub-user-secret -o jsonpath='{.data.user\.props}' | base64 -d

                [Optional] Visit though browser

                add $K8S_MASTER_IP datahub.dev.geekcity.tech to /etc/hosts

                [Optional] Visit though DatahubCLI

                We recommend Python virtual environments (venv-s) to namespace pip modules. Here’s an example setup:

                python3 -m venv venv             # create the environment
                source venv/bin/activate         # activate the environment

                NOTE: If you install datahub in a virtual environment, that same virtual environment must be re-activated each time a shell window or session is created.

                Once inside the virtual environment, install datahub using the following commands

                # Requires Python 3.8+
                python3 -m pip install --upgrade pip wheel setuptools
                python3 -m pip install --upgrade acryl-datahub
                # validate that the install was successful
                datahub version
                # If you see "command not found", try running this instead: python3 -m datahub version
                datahub init
                # authenticate your datahub CLI with your datahub instance
                Mar 7, 2024

                N8N

                🚀Installation

                Install By

                1.prepare `xxxxx-credientials.yaml`

                Details

                2.prepare `deploy-xxxxx.yaml`

                Details
                kubectl -n argocd apply -f - << EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: n8n
                spec:
                  project: default
                  source:
                    repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                    chart: n8n
                    targetRevision: 1.16.1
                    helm:
                      releaseName: n8n
                      values: |
                        global:
                          security:
                            allowInsecureImages: true
                        image:
                          repository: m.daocloud.io/docker.io/n8nio/n8n
                          tag: 1.119.1-amd64
                        log:
                          level: info
                        encryptionKey: "72602-n8n"
                        timezone: Asia/Shanghai
                        db:
                          type: postgresdb
                        externalPostgresql:
                          host: postgresql-hl.database.svc.cluster.local
                          port: 5432
                          username: "n8n"
                          database: "n8n"
                          existingSecret: "n8n-middleware-credential"
                        main:
                          count: 1
                          extraEnvVars:
                            "N8N_BLOCK_ENV_ACCESS_IN_NODE": "false"
                            "EXECUTIONS_TIMEOUT": "300"
                            "EXECUTIONS_TIMEOUT_MAX": "600"
                            "DB_POSTGRESDB_POOL_SIZE": "10"
                            "CACHE_ENABLED": "true"
                            "N8N_CONCURRENCY_PRODUCTION_LIMIT": "5"
                            "N8N_SECURE_COOKIE": "false"
                            "WEBHOOK_URL": "https://webhook.72602.online"
                            "QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD": "30000"
                            "N8N_COMMUNITY_PACKAGES_ENABLED": "false"
                            "N8N_GIT_NODE_DISABLE_BARE_REPOS": "true"
                          persistence:
                            enabled: true
                            accessMode: ReadWriteOnce
                            storageClass: "local-path"
                            size: 5Gi
                          resources:
                            requests:
                              cpu: 1000m
                              memory: 1024Mi
                            limits:
                              cpu: 2000m
                              memory: 2048Mi
                        worker:
                          mode: queue
                          count: 2
                          waitMainNodeReady:
                            enabled: false
                          extraEnvVars:
                            "EXECUTIONS_TIMEOUT": "300"
                            "EXECUTIONS_TIMEOUT_MAX": "600"
                            "DB_POSTGRESDB_POOL_SIZE": "5"
                            "QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD": "30000"
                            "N8N_GIT_NODE_DISABLE_BARE_REPOS": "true"
                          persistence:
                            enabled: true
                            accessMode: ReadWriteOnce
                            storageClass: "local-path"
                            size: 5Gi
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1024Mi
                            limits:
                              cpu: 1000m
                              memory: 2048Mi
                        redis:
                          enabled: true
                          image:
                            registry: m.daocloud.io/docker.io
                            repository: bitnamilegacy/redis
                          master:
                            persistence:
                              enabled: true
                              accessMode: ReadWriteOnce
                              storageClass: "local-path"
                              size: 2Gi
                        ingress:
                          enabled: true
                          className: nginx
                          annotations:
                            kubernetes.io/ingress.class: nginx
                            cert-manager.io/cluster-issuer: letsencrypt
                            nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
                            nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
                            nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
                            nginx.ingress.kubernetes.io/proxy-body-size: "50m"
                            nginx.ingress.kubernetes.io/upstream-keepalive-connections: "50"
                            nginx.ingress.kubernetes.io/upstream-keepalive-timeout: "60"
                          hosts:
                            - host: n8n.72602.online
                              paths:
                                - path: /
                                  pathType: Prefix
                          tls:
                          - hosts:
                            - n8n.72602.online
                            secretName: n8n.72602.online-tls
                        webhook:
                          mode: queue
                          url: "https://webhook.72602.online"
                          autoscaling:
                            enabled: false
                          waitMainNodeReady:
                            enabled: true
                          resources:
                            requests:
                              cpu: 200m
                              memory: 256Mi
                            limits:
                              cpu: 512m
                              memory: 512Mi
                  syncPolicy:
                    automated:
                      prune: true
                      selfHeal: true
                    syncOptions:
                      - CreateNamespace=true
                      - ApplyOutOfSyncOnly=true
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: n8n
                    repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                    chart: n8n
                    targetRevision: 1.16.1
                    helm:
                      releaseName: n8n
                      values: |
                        image:
                          repository: m.daocloud.io/docker.io/n8nio/n8n
                          tag: 1.119.1-amd64
                        log:
                          level: info
                        encryptionKey: 72602-aaron
                        db:
                          type: postgresdb
                        externalPostgresql:
                          host: postgresql.database.svc.cluster.local
                          port: 5432
                          username: "postgres.kconxfeltufjzqtjznfb"
                          database: "postgres"
                          existingSecret: "n8n-middleware-credential"
                        main:
                          count: 1
                          persistence:
                            enabled: true
                            accessMode: ReadWriteOnce
                            storageClass: "local-path"
                            size: 5Gi
                          resources:
                            requests:
                              cpu: 100m
                              memory: 128Mi
                            limits:
                              cpu: 512m
                              memory: 512Mi
                        worker:
                          mode: queue
                          count: 2
                          waitMainNodeReady:
                            enabled: true
                          persistence:
                            enabled: true
                            accessMode: ReadWriteOnce
                            storageClass: "local-path"
                            size: 5Gi
                          resources:
                            requests:
                              cpu: 500m
                              memory: 250Mi
                            limits:
                              cpu: 1000m
                              memory: 1024Mi
                        externalRedis:
                          host: redis.72602.online
                          port: 30679
                          existingSecret: n8n-middleware-credential
                        ingress:
                          enabled: true
                          className: nginx
                          annotations:
                            kubernetes.io/ingress.class: nginx
                            cert-manager.io/cluster-issuer: letsencrypt
                          hosts:
                            - host: n8n.72602.online
                              paths:
                                - path: /
                                  pathType: Prefix
                          tls:
                          - hosts:
                            - n8n.72602.online
                            secretName: n8n.72602.online-tls
                        webhook:
                          mode: queue
                          url: "https://webhook.72602.online"
                          autoscaling:
                            enabled: false
                          waitMainNodeReady:
                            enabled: true
                          resources:
                            requests:
                              cpu: 100m
                              memory: 128Mi
                            limits:
                              cpu: 512m
                              memory: 512Mi
                  syncPolicy:
                    automated:
                      prune: true
                      selfHeal: true
                    syncOptions:
                      - CreateNamespace=true
                      - ApplyOutOfSyncOnly=true
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: n8n
                EOF

                3.sync by argocd

                Details
                argocd app sync argocd/xxxx
                Using AY Helm Mirror
                Using AY ACR Image Mirror
                Using DaoCloud Mirror

                🛎️FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2024

                Wechat Markdown Editor

                🚀Installation

                Install By

                1.get helm repo

                Details
                helm repo add xxxxx https://xxxx
                helm repo update

                2.install chart

                Details
                helm install xxxxx/chart-name --generate-name --version a.b.c
                Using AY Helm Mirror

                1.prepare `xxxxx-credientials.yaml`

                Details

                2.prepare `deploy-xxxxx.yaml`

                Details
                kubectl -n argocd apply -f -<< EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: xxxx
                spec:
                  project: default
                  source:
                    repoURL: https://xxxxx
                    chart: xxxx
                    targetRevision: a.b.c
                EOF

                3.sync by argocd

                Details
                argocd app sync argocd/xxxx
                Using AY Helm Mirror
                Using AY ACR Image Mirror
                Using DaoCloud Mirror

                1.init server

                Details
                Using AY ACR Image Mirror
                Using DaoCloud Mirror

                1.init server

                Details
                Using AY ACR Image Mirror
                Using DaoCloud Mirror

                🛎️FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2024

                Subsections of Auth

                Deploy GateKeeper Server

                Official Website: https://open-policy-agent.github.io/gatekeeper/website/

                Preliminary

                • Kubernetes 版本必须大于 v1.16

                Components

                Gatekeeper 是基于 Open Policy Agent(OPA) 构建的 Kubernetes 准入控制器,它允许用户定义和实施自定义策略,以控制 Kubernetes 集群中资源的创建、更新和删除操作

                • 核心组件
                  • 约束模板(Constraint Templates):定义策略的规则逻辑,使用 Rego 语言编写。它是策略的抽象模板,可以被多个约束实例(Constraint Instance)复用。
                  • 约束实例(Constraints Instance):基于约束模板创建的具体策略实例,指定了具体的参数和匹配规则,用于定义哪些资源需要应用该策略。
                  • 准入控制器(Admission Controller)(无需修改):拦截 Kubernetes API Server 的请求,根据定义的约束对请求进行评估,如果请求违反了任何约束,则拒绝该请求。
                    核心Pod角色

                    mvc mvc

                    • gatekeeper-audit
                      • 定期合规检查:该组件会按照预设的时间间隔,对集群中已存在的所有资源进行全面扫描,以检查它们是否符合所定义的约束规则。(周期性,批量检查)
                      • 生成审计报告:在完成资源扫描后,gatekeeper-audit 会生成详细的审计报告,其中会明确指出哪些资源违反了哪些约束规则,方便管理员及时了解集群的合规状态。
                    • gatekeeper-controller-manager
                      • 实时准入控制:作为准入控制器,gatekeeper-controller-manager 在资源创建、更新或删除操作发起时,会实时拦截这些请求。它会依据预定义的约束模板和约束规则,对请求中的资源进行即时评估。(实时性,事件驱动)
                      • 处理决策请求:根据评估结果,如果请求中的资源符合所有约束规则,gatekeeper-controller-manager 会允许该请求继续执行;若违反了任何规则,它会拒绝该请求,避免违规资源进入集群。

                Features

                1. 约束管理

                  • 自定义约束模板:用户可以使用 Rego 语言编写自定义的约束模板,实现各种复杂的策略逻辑。

                    例如,可以定义策略要求所有的命名空间 NameSpace 必须设置特定的标签,或者限制某些命名空间只能使用特定的镜像。

                    查看已存在的约束模板和实例
                        ```shell
                        kubectl get constrainttemplates
                        kubectl get constraints
                        ```
                    
                        ```shell
                        kubectl apply -f - <<EOF
                        apiVersion: templates.gatekeeper.sh/v1
                        kind: ConstraintTemplate
                        metadata:
                        name: k8srequiredlabels
                        spec:
                            crd:
                                spec:
                                names:
                                    kind: K8sRequiredLabels
                                validation:
                                    openAPIV3Schema:
                                        type: object
                                        properties:
                                            labels:
                                                type: array
                                                items:
                                                    type: string
                        targets:
                            - target: admission.k8s.gatekeeper.sh
                            rego: |
                                package k8srequiredlabels
                    
                                violation[{"msg": msg, "details": {"missing_labels": missing}}] {
                                    provided := {label | input.review.object.metadata.labels[label]}
                                    required := {label | label := input.parameters.labels[_]}
                                    missing := required - provided
                                    count(missing) > 0
                                    msg := sprintf("you must provide labels: %v", [missing])
                                }
                        EOF
                        ```
                    

                  • 约束模板复用:约束模板可以被多个约束实例复用,提高了策略的可维护性和复用性。

                    例如,可以创建一个通用的标签约束模板,然后在不同的命名空间 NameSpace 中创建不同的约束实例,要求不同的标签。

                    一个约束实例的yaml
                        要求所有的命名空间 NameSpace 必须存在标签“gatekeeper”
                    
                        ```yaml
                        apiVersion: constraints.gatekeeper.sh/v1beta1
                        kind: K8sRequiredLabels
                        metadata:
                        name: ns-must-have-gk-label
                        spec:
                            enforcementAction: dryrun
                            match:
                                kinds:
                                - apiGroups: [""]
                                    kinds: ["Namespace"]
                            parameters:
                                labels: ["gatekeeper"]
                        ```
                    

                  • 约束更新:当约束模板或约束发生更新时,Gatekeeper 会自动重新评估所有相关的资源,确保策略的实时生效。

                2. 资源控制

                  • 准入拦截:当有资源创建或更新请求时,Gatekeeper 会实时拦截请求,并根据策略进行评估。如果请求违反了策略,会立即拒绝请求,并返回详细的错误信息,帮助用户快速定位问题。

                  • 资源创建和更新限制:Gatekeeper 可以阻止不符合策略的资源创建和更新请求。

                    例如,如果定义了一个策略要求所有的 Deployment 必须设置资源限制(requests 和 limits),那么当用户尝试创建或更新一个没有设置资源限制的 Deployment 时,请求将被拒绝。

                    通过enforcementAction来控制,可选:dryrun | deny | warn

                    check https://open-policy-agent.github.io/gatekeeper-library/website/validation/containerlimits

                  • 资源类型过滤:可以通过约束的 match 字段指定需要应用策略的资源类型和命名空间。

                    例如,可以只对特定命名空间中的 Pod 应用策略,或者只对特定 API 组和版本的资源应用策略。

                    可以通过syncSet (同步配置)来指定过滤和忽略那些资源

                    扫描全部ns,pod,忽略kube开头的命名空间
                        ```yaml
                        apiVersion: config.gatekeeper.sh/v1alpha1
                        kind: Config
                        metadata:
                        name: config
                        namespace: "gatekeeper-system"
                        spec:
                        sync:
                            syncOnly:
                            - group: ""
                                version: "v1"
                                kind: "Namespace"
                            - group: ""
                                version: "v1"
                                kind: "Pod"
                        match:
                            - excludedNamespaces: ["kube-*"]
                            processes: ["*"]
                        ```
                    

                3. 合规性保证

                  • 行业标准和自定义规范:Gatekeeper 可以确保 Kubernetes 集群中的资源符合行业标准和管理员要求的内部的安全规范。

                    例如,可以定义策略要求所有的容器必须使用最新的安全补丁,或者要求所有的存储卷必须进行加密。

                    Gatekeeper 已经提供近50种各类资源限制的约束策略,可以通过访问https://open-policy-agent.github.io/gatekeeper-library/website/ 查看并获得

                  • 审计和报告:Gatekeeper 可以记录所有的策略评估结果,方便管理员进行审计和报告。通过查看审计日志,管理员可以了解哪些资源违反了策略,以及违反了哪些策略。

                  • 审计导出:审计日志可以导出并接入下游。

                    详细信息可以查看https://open-policy-agent.github.io/gatekeeper/website/docs/pubsub/

                Installation

                install from
                kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.18.2/deploy/gatekeeper.yaml
                helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
                helm install gatekeeper/gatekeeper --name-template=gatekeeper --namespace gatekeeper-system --create-namespace

                Make sure that:

                • You have Docker version 20.10 or later installed.
                • Your kubectl context is set to the desired installation cluster.
                • You have a container registry you can write to that is readable by the target cluster.
                git clone https://github.com/open-policy-agent/gatekeeper.git \
                && cd gatekeeper 
                • Build and push Gatekeeper image:
                export DESTINATION_GATEKEEPER_IMAGE=<add registry like "myregistry.docker.io/gatekeeper">
                make docker-buildx REPOSITORY=$DESTINATION_GATEKEEPER_IMAGE OUTPUT_TYPE=type=registry
                • And the deploy
                make deploy REPOSITORY=$DESTINATION_GATEKEEPER_IMAGE
                Mar 12, 2024

                Subsections of Binary

                Argo Workflow Binary

                MIRROR="files.m.daocloud.io/"
                VERSION=v3.5.4
                curl -sSLo argo-linux-amd64.gz "https://${MIRROR}github.com/argoproj/argo-workflows/releases/download/${VERSION}/argo-linux-amd64.gz"
                gunzip argo-linux-amd64.gz
                chmod u+x argo-linux-amd64
                mkdir -p ${HOME}/bin
                mv -f argo-linux-amd64 ${HOME}/bin/argo
                rm -f argo-linux-amd64.gz
                Apr 7, 2024

                ArgoCD Binary

                MIRROR="files.m.daocloud.io/"
                VERSION=v3.1.8
                [ $(uname -m) = x86_64 ] && curl -sSLo argocd "https://${MIRROR}github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-amd64"
                [ $(uname -m) = aarch64 ] && curl -sSLo argocd "https://${MIRROR}github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-arm64"
                chmod u+x argocd
                mkdir -p ${HOME}/bin
                mv -f argocd ${HOME}/bin

                [Optional] add to PATH

                cat >> ~/.bashrc  << EOF
                export PATH=$PATH:/root/bin
                EOF
                source ~/.bashrc
                Apr 7, 2024

                Golang Binary

                # sudo rm -rf /usr/local/go  # 删除旧版本
                wget https://go.dev/dl/go1.24.4.linux-amd64.tar.gz
                tar -C /usr/local -xzf go1.24.4.linux-amd64.tar.gz
                vim ~/.bashrc
                export PATH=$PATH:/usr/local/go/bin
                source ~/.bashrc
                rm -rf ./go1.24.4.linux-amd64.tar.gz
                Apr 7, 2024

                Gradle Binary

                MIRROR="files.m.daocloud.io/"
                VERSION=v3.5.4
                curl -sSLo argo-linux-amd64.gz "https://${MIRROR}github.com/argoproj/argo-workflows/releases/download/${VERSION}/argo-linux-amd64.gz"
                gunzip argo-linux-amd64.gz
                chmod u+x argo-linux-amd64
                mkdir -p ${HOME}/bin
                mv -f argo-linux-amd64 ${HOME}/bin/argo
                rm -f argo-linux-amd64.gz
                Apr 7, 2024

                Helm Binary

                ARCH_IN_FILE_NAME=linux-amd64
                FILE_NAME=helm-v3.18.3-${ARCH_IN_FILE_NAME}.tar.gz
                curl -sSLo ${FILE_NAME} "https://files.m.daocloud.io/get.helm.sh/${FILE_NAME}"
                tar zxf ${FILE_NAME}
                mkdir -p ${HOME}/bin
                mv -f ${ARCH_IN_FILE_NAME}/helm ${HOME}/bin
                rm -rf ./${FILE_NAME}
                rm -rf ./${ARCH_IN_FILE_NAME}
                chmod u+x ${HOME}/bin/helm
                Apr 7, 2024

                JQ Binary

                JQ_VERSION=1.7
                JQ_BINARY=jq-linux64
                wget https://github.com/stedolan/jq/releases/download/jq-${JQ_VERSION}/${JQ_BINARY}.tar.gz -O - | tar xz && mv ${JQ_BINARY} /usr/bin/jq
                Apr 7, 2024

                Kind Binary

                MIRROR="files.m.daocloud.io/"
                VERSION=v0.29.0
                [ $(uname -m) = x86_64 ] && curl -sSLo kind "https://${MIRROR}github.com/kubernetes-sigs/kind/releases/download/${VERSION}/kind-linux-amd64"
                [ $(uname -m) = aarch64 ] && curl -sSLo kind "https://${MIRROR}github.com/kubernetes-sigs/kind/releases/download/${VERSION}/kind-linux-arm64"
                chmod u+x kind
                mkdir -p ${HOME}/bin
                mv -f kind ${HOME}/bin
                Apr 7, 2025

                Krew Binary

                cd "$(mktemp -d)" &&
                OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
                ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')" &&
                KREW="krew-${OS}_${ARCH}" &&
                curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
                tar zxvf "${KREW}.tar.gz" &&
                ./"${KREW}" install krew
                Apr 7, 2024

                Kubectl Binary

                MIRROR="files.m.daocloud.io/"
                VERSION=$(curl -L -s https://${MIRROR}dl.k8s.io/release/stable.txt)
                [ $(uname -m) = x86_64 ] && curl -sSLo kubectl "https://${MIRROR}dl.k8s.io/release/${VERSION}/bin/linux/amd64/kubectl"
                [ $(uname -m) = aarch64 ] && curl -sSLo kubectl "https://${MIRROR}dl.k8s.io/release/${VERSION}/bin/linux/arm64/kubectl"
                chmod u+x kubectl
                mkdir -p ${HOME}/bin
                mv -f kubectl ${HOME}/bin
                Apr 7, 2024

                Kustomize Binary

                MIRROR="github.com"
                VERSION="v5.7.1"
                [ $(uname -m) = x86_64 ] && curl -sSLo kustomize "https:///${MIRROR}/kubernetes-sigs/kustomize/releases/download/kustomize/${VERSION}/kustomize_${VERSION}_linux_amd64.tar.gz"
                [ $(uname -m) = aarch64 ] && curl -sSLo kustomize "https:///${MIRROR}/kubernetes-sigs/kustomize/releases/download/kustomize/${VERSION}/kustomize_${VERSION}_linux_arm64.tar.gz"
                chmod u+x kustomize
                mkdir -p ${HOME}/bin
                mv -f kustomize ${HOME}/bin
                Apr 7, 2024

                Maven Binary

                wget https://dlcdn.apache.org/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz
                tar xzf apache-maven-3.9.6-bin.tar.gz -C /usr/local
                ln -sfn /usr/local/apache-maven-3.9.6/bin/mvn /root/bin/mvn  
                export PATH=$PATH:/usr/local/apache-maven-3.9.6/bin
                source ~/.bashrc
                Apr 7, 2024

                Minikube Binary

                MIRROR="files.m.daocloud.io/"
                [ $(uname -m) = x86_64 ] && curl -sSLo minikube "https://${MIRROR}storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64"
                [ $(uname -m) = aarch64 ] && curl -sSLo minikube "https://${MIRROR}storage.googleapis.com/minikube/releases/latest/minikube-linux-arm64"
                chmod u+x minikube
                mkdir -p ${HOME}/bin
                mv -f minikube ${HOME}/bin
                Apr 7, 2024

                Open Java

                mkdir -p /etc/apt/keyrings && \
                wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | gpg --dearmor -o /etc/apt/keyrings/adoptium.gpg && \
                echo "deb [signed-by=/etc/apt/keyrings/adoptium.gpg arch=amd64] https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list > /dev/null && \
                apt-get update && \
                apt-get install -y temurin-21-jdk && \
                apt-get clean && \
                rm -rf /var/lib/apt/lists/*
                Apr 7, 2025

                YQ Binary

                YQ_VERSION=v4.40.5
                YQ_BINARY=yq_linux_amd64
                wget https://github.com/mikefarah/yq/releases/download/${YQ_VERSION}/${YQ_BINARY}.tar.gz -O - | tar xz && mv ${YQ_BINARY} /usr/bin/yq
                Apr 7, 2024

                CICD

                Articles

                FQA

                Q1: difference between docker\podmn\buildah

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2025

                Subsections of CICD

                Install Argo CD

                Preliminary

                • Kubernets has installed, if not check 🔗link
                • Helm binary has installed, if not check 🔗link

                1. install argoCD binary

                2. install components

                Install By
                1. Prepare argocd.values.yaml
                crds:
                  install: true
                  keep: false
                global:
                  domain: argo-cd.ay.dev
                  revisionHistoryLimit: 3
                  image:
                    repository: m.daocloud.io/quay.io/argoproj/argocd
                    imagePullPolicy: IfNotPresent
                redis:
                  enabled: true
                  image:
                    repository: m.daocloud.io/docker.io/library/redis
                  exporter:
                    enabled: false
                    image:
                      repository: m.daocloud.io/bitnami/redis-exporter
                  metrics:
                    enabled: false
                redis-ha:
                  enabled: false
                  image:
                    repository: m.daocloud.io/docker.io/library/redis
                  configmapTest:
                    repository: m.daocloud.io/docker.io/koalaman/shellcheck
                  haproxy:
                    enabled: false
                    image:
                      repository: m.daocloud.io/docker.io/library/haproxy
                  exporter:
                    enabled: false
                    image: m.daocloud.io/docker.io/oliver006/redis_exporter
                dex:
                  enabled: true
                  image:
                    repository: m.daocloud.io/ghcr.io/dexidp/dex
                server:
                  ingress:
                    enabled: true
                    ingressClassName: nginx
                    annotations:
                      nginx.ingress.kubernetes.io/ssl-passthrough: "true"
                      cert-manager.io/cluster-issuer: self-signed-ca-issuer
                      nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                    hostname: argo-cd.ay.dev
                    path: /
                    pathType: Prefix
                    tls: true
                
                2. Install argoCD
                helm upgrade --install argo-cd argo-cd \
                  --namespace argocd \
                  --create-namespace \
                  --version 8.3.5 \
                  --repo https://aaronyang0628.github.io/helm-chart-mirror/charts \
                  --values argocd.values.yaml \
                  --atomic
                
                helm install argo-cd argo-cd \
                  --namespace argocd \
                  --create-namespace \
                  --version 8.3.5 \
                  --repo https://argoproj.github.io/argo-helm \
                  --values argocd.values.yaml \
                  --atomic
                

                by default you can install argocd by this link

                kubectl create namespace argocd \
                && kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

                Or, you can use your won flle link.

                4. prepare argocd-server-external.yaml

                Install By
                kubectl -n argocd apply -f - <<EOF
                apiVersion: v1
                kind: Service
                metadata:
                  labels:
                    app.kubernetes.io/component: server
                    app.kubernetes.io/instance: argo-cd
                    app.kubernetes.io/name: argocd-server-external
                    app.kubernetes.io/part-of: argocd
                  name: argocd-server-external
                spec:
                  ports:
                  - name: https
                    port: 443
                    protocol: TCP
                    targetPort: 8080
                    nodePort: 30443
                  selector:
                    app.kubernetes.io/instance: argo-cd
                    app.kubernetes.io/name: argocd-server
                  type: NodePort
                EOF
                kubectl -n argocd apply -f - <<EOF
                apiVersion: v1
                kind: Service
                metadata:
                  labels:
                    app.kubernetes.io/component: server
                    app.kubernetes.io/instance: argo-cd
                    app.kubernetes.io/name: argocd-server-external
                    app.kubernetes.io/part-of: argocd
                    app.kubernetes.io/version: v2.8.4
                  name: argocd-server-external
                spec:
                  ports:
                  - name: https
                    port: 443
                    protocol: TCP
                    targetPort: 8080
                    nodePort: 30443
                  selector:
                    app.kubernetes.io/instance: argo-cd
                    app.kubernetes.io/name: argocd-server
                  type: NodePort
                EOF

                5. create external service

                kubectl -n argocd apply -f argocd-server-external.yaml

                6. [Optional] prepare argocd-server-ingress.yaml

                Before you create ingress, you need to create cert-manager and cert-issuer self-signed-ca-issuer, if not, please check 🔗link

                Install By
                kubectl -n argocd apply -f - <<EOF
                apiVersion: networking.k8s.io/v1
                kind: Ingress
                metadata:
                  annotations:
                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
                  name: argo-cd-argocd-server
                  namespace: argocd
                spec:
                  ingressClassName: nginx
                  rules:
                  - host: argo-cd.ay.dev
                    http:
                      paths:
                      - backend:
                          service:
                            name: argo-cd-argocd-server
                            port:
                              number: 443
                        path: /
                        pathType: Prefix
                  tls:
                  - hosts:
                    - argo-cd.ay.dev
                    secretName: argo-cd.ay.dev-tls
                EOF
                apiVersion: networking.k8s.io/v1
                kind: Ingress
                metadata:
                  annotations:
                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
                  name: argo-cd-argocd-server
                  namespace: argocd
                spec:
                  ingressClassName: nginx
                  rules:
                  - host: argo-cd.ay.dev
                    http:
                      paths:
                      - backend:
                          service:
                            name: argo-cd-argocd-server
                            port:
                              number: 443
                        path: /
                        pathType: Prefix
                  tls:
                  - hosts:
                    - argo-cd.ay.dev
                    secretName: argo-cd.ay.dev-tls

                7. [Optional] create external service

                kubectl -n argocd apply -f argocd-server-external.yaml

                8. get argocd initialized password

                kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

                9. login argocd

                ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
                MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
                argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS

                if you deploy argocd in minikube, you might need to forward this port

                ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f
                open https://$(minikube ip):30443

                if you use ingress, you might need to configure your browser to allow insecure connection

                kubectl -n basic-components get secret root-secret -o jsonpath='{.data.tls\.crt}' | base64 -d > cert-manager-self-signed-ca-secret.crt
                open https://argo-cd.ay.dev
                Mar 7, 2024

                Install Argo WorkFlow

                Preliminary

                • Kubernets has installed, if not check 🔗link
                • Argo CD has installed, if not check 🔗link
                • cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuerservice, , if not check 🔗link
                kubectl get namespace business-workflows > /dev/null 2>&1 || kubectl create namespace business-workflows

                1. prepare argo-workflows.yaml

                content
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: argo-workflows
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://argoproj.github.io/argo-helm
                    chart: argo-workflows
                    targetRevision: 0.45.27
                    helm:
                      releaseName: argo-workflows
                      values: |
                        crds:
                          install: true
                          keep: false
                        singleNamespace: false
                        controller:
                          image:
                            registry: m.daocloud.io/quay.io
                          workflowNamespaces:
                            - business-workflows
                        executor:
                          image:
                            registry: m.daocloud.io/quay.io
                        workflow:
                          serviceAccount:
                            create: true
                          rbac:
                            create: true
                        server:
                          enabled: true
                          image:
                            registry: m.daocloud.io/quay.io
                          ingress:
                            enabled: true
                            ingressClassName: nginx
                            annotations:
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                              nginx.ingress.kubernetes.io/rewrite-target: /$1
                              nginx.ingress.kubernetes.io/use-regex: "true"
                            hosts:
                              - argo-workflows.ay.dev
                            paths:
                              - /?(.*)
                            pathType: ImplementationSpecific
                            tls:
                              - secretName: argo-workflows.ay.dev-tls
                                hosts:
                                  - argo-workflows.ay.dev
                          authModes:
                            - server
                            - client
                          sso:
                            enabled: false
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: workflows
                kubectl -n argocd apply -f - << EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: argo-workflows
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://argoproj.github.io/argo-helm
                    chart: argo-workflows
                    targetRevision: 0.45.27
                    helm:
                      releaseName: argo-workflows
                      values: |
                        crds:
                          install: true
                          keep: false
                        singleNamespace: false
                        controller:
                          image:
                            registry: m.daocloud.io/quay.io
                          workflowNamespaces:
                            - business-workflows
                        executor:
                          image:
                            registry: m.daocloud.io/quay.io
                        workflow:
                          serviceAccount:
                            create: true
                          rbac:
                            create: true
                        server:
                          enabled: true
                          image:
                            registry: m.daocloud.io/quay.io
                          ingress:
                            enabled: true
                            ingressClassName: nginx
                            annotations:
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                              nginx.ingress.kubernetes.io/rewrite-target: /$1
                              nginx.ingress.kubernetes.io/use-regex: "true"
                            hosts:
                              - argo-workflows.ay.dev
                            paths:
                              - /?(.*)
                            pathType: ImplementationSpecific
                            tls:
                              - secretName: argo-workflows.ay.dev-tls
                                hosts:
                                  - argo-workflows.ay.dev
                          authModes:
                            - server
                            - client
                          sso:
                            enabled: false
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: workflows
                EOF

                2. install argo workflow binary

                3. [Optional] apply to k8s

                kubectl -n argocd apply -f argo-workflows.yaml

                4. sync by argocd

                argocd app sync argocd/argo-workflows

                5. submit a test workflow

                argo -n business-workflows submit https://raw.githubusercontent.com/argoproj/argo-workflows/master/examples/hello-world.yaml --serviceaccount=argo-workflow

                6. check workflow status

                # list all flows
                argo -n business-workflows list
                # get specific flow status
                argo -n business-workflows get <$flow_name>
                # get specific flow log
                argo -n business-workflows logs <$flow_name>
                # get specific flow log continuously
                argo -n business-workflows logs <$flow_name> --watch
                Mar 7, 2024

                Install Argo Event

                Preliminary

                • Kubernets has installed, if not check 🔗link
                • Argo CD has installed, if not check 🔗link

                1. prepare argo-events.yaml

                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: argo-events
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://argoproj.github.io/argo-helm
                    chart: argo-events
                    targetRevision: 2.4.2
                    helm:
                      releaseName: argo-events
                      values: |
                        openshift: false
                        createAggregateRoles: true
                        crds:
                          install: true
                          keep: true
                        global:
                          image:
                            repository: m.daocloud.io/quay.io/argoproj/argo-events
                        controller:
                          replicas: 1
                          resources: {}
                        webhook:
                          enabled: true
                          replicas: 1
                          port: 12000
                          resources: {}
                        extraObjects:
                          - apiVersion: networking.k8s.io/v1
                            kind: Ingress
                            metadata:
                              annotations:
                                cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                nginx.ingress.kubernetes.io/rewrite-target: /$1
                              labels:
                                app.kubernetes.io/instance: argo-events
                                app.kubernetes.io/managed-by: Helm
                                app.kubernetes.io/name: argo-events-events-webhook
                                app.kubernetes.io/part-of: argo-events
                                argocd.argoproj.io/instance: argo-events
                              name: argo-events-webhook
                            spec:
                              ingressClassName: nginx
                              rules:
                              - host: argo-events.webhook.ay.dev
                                http:
                                  paths:
                                  - backend:
                                      service:
                                        name: events-webhook
                                        port:
                                          number: 12000
                                    path: /?(.*)
                                    pathType: ImplementationSpecific
                              tls:
                              - hosts:
                                - argo-events.webhook.ay.dev
                                secretName: argo-events-webhook-tls
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: argocd

                4. apply to k8s

                kubectl -n argocd apply -f argo-events.yaml

                5. sync by argocd

                argocd app sync argocd/argo-events
                Mar 7, 2024

                Reloader

                Install

                Details
                helm repo add stakater https://stakater.github.io/stakater-charts
                helm repo update
                helm install reloader stakater/reloader
                Using AY Helm Mirror

                for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                helm repo update
                helm -n basic-components install reloader stakater/reloader
                Details
                kubectl apply -f https://raw.githubusercontent.com/stakater/Reloader/master/deployments/kubernetes/reloader.yaml
                Using AY Gitee Mirror
                kubectl apply -f https://gitee.com/aaron2333/aaaa/raw/main/bbbb.yaml
                Using AY ACR Mirror
                docker pull crpi-wixjy6gci86ms14e.cn-hongkong.personal.cr.aliyuncs.com/ay-mirror/xxxx
                Using DaoCloud Mirror
                docker pull m.daocloud.io/docker.io/library/xxxx

                Usage

                • For a Deployment called foo have a ConfigMap called foo-configmap. Then add this annotation to main metadata of your Deployment configmap.reloader.stakater.com/reload: "foo-configmap"

                • For a Deployment called foo have a Secret called foo-secret. Then add this annotation to main metadata of your Deployment secret.reloader.stakater.com/reload: "foo-secret"

                • After successful installation, your pods will get rolling updates when a change in data of configmap or secret will happen.

                Reference

                For more information about reloader, please refer to https://github.com/stakater/Reloader

                Container

                Articles

                FQA

                Q1: difference between docker\podmn\buildah

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2025

                Subsections of Container

                Install Buildah

                Reference

                Prerequisites

                • Kernel Version Requirements To run Buildah on Red Hat Enterprise Linux or CentOS, version 7.4 or higher is required. On other Linux distributions Buildah requires a kernel version that supports the OverlayFS and/or fuse-overlayfs filesystem – you’ll need to consult your distribution’s documentation to determine a minimum version number.

                • runc Requirement Buildah uses runc to run commands when buildah run is used, or when buildah build encounters a RUN instruction, so you’ll also need to build and install a compatible version of runc for Buildah to call for those cases. If Buildah is installed via a package manager such as yum, dnf or apt-get, runc will be installed as part of that process.

                • CNI Requirement When Buildah uses runc to run commands, it defaults to running those commands in the host’s network namespace. If the command is being run in a separate user namespace, though, for example when ID mapping is used, then the command will also be run in a separate network namespace.

                A newly-created network namespace starts with no network interfaces, so commands which are run in that namespace are effectively disconnected from the network unless additional setup is done. Buildah relies on the CNI library and plugins to set up interfaces and routing for network namespaces.

                something wrong with CNI

                If Buildah is installed via a package manager such as yum, dnf or apt-get, a package containing CNI plugins may be available (in Fedora, the package is named containernetworking-cni). If not, they will need to be installed, for example using:

                git clone https://github.com/containernetworking/plugins
                ( cd ./plugins; ./build_linux.sh )
                sudo mkdir -p /opt/cni/bin
                sudo install -v ./plugins/bin/* /opt/cni/bin

                The CNI library needs to be configured so that it will know which plugins to call to set up namespaces. Usually, this configuration takes the form of one or more configuration files in the /etc/cni/net.d directory. A set of example configuration files is included in the docs/cni-examples directory of this source tree.

                Installation

                Caution

                If you already have something wrong with apt update, please check the following 🔗link, adding docker source wont help you to solve that problem.

                sudo dnf update -y 
                sudo dnf -y install buildah

                Once the installation is complete, The buildah images command will list all the images:

                buildah images
                sudo yum -y install buildah

                Once the installation is complete, start the Docker service

                sudo systemctl enable docker
                sudo systemctl start docker
                1. Set up Docker’s apt repository.
                sudo apt-get -y update
                sudo apt-get -y install buildah
                1. Verify that the installation is successful by running the hello-world image:
                sudo buildah run hello-world

                Info

                • Docker Image saved in /var/lib/docker

                Mirror

                You can modify /etc/docker/daemon.json

                {
                  "registry-mirrors": ["<$mirror_url>"]
                }

                for example:

                • https://docker.mirrors.ustc.edu.cn
                Mar 7, 2025

                Install Docker

                Mar 7, 2025

                Install Podman

                Reference

                Installation

                Caution

                If you already have something wrong with apt update, please check the following 🔗link, adding docker source wont help you to solve that problem.

                sudo dnf update -y 
                sudo dnf -y install podman
                sudo yum install -y podman
                sudo apt-get update
                sudo apt-get -y install podman

                Run Params

                start an container

                podman run [params]

                -rm: delete if failed

                -v: load a volume

                Example

                podman run --rm\
                      -v /root/kserve/iris-input.json:/tmp/iris-input.json \
                      --privileged \
                     -e MODEL_NAME=sklearn-iris \
                     -e INPUT_PATH=/tmp/iris-input.json \
                     -e SERVICE_HOSTNAME=sklearn-iris.kserve-test.example.com \
                      -it m.daocloud.io/docker.io/library/golang:1.22  sh -c "command A; command B; exec bash"
                Mar 7, 2025

                Subsections of Database

                Install Clickhouse

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. argoCD has installed, if not check 🔗link


                3. cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


                1.prepare admin credentials secret

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                kubectl -n database create secret generic clickhouse-admin-credentials \
                    --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                2.prepare `deploy-clickhouse.yaml`

                Details
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: clickhouse
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://charts.bitnami.com/bitnami
                    chart: clickhouse
                    targetRevision: 4.5.1
                    helm:
                      releaseName: clickhouse
                      values: |
                        serviceAccount:
                          name: clickhouse
                        image:
                          registry: m.daocloud.io/docker.io
                          pullPolicy: IfNotPresent
                        volumePermissions:
                          enabled: false
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                        zookeeper:
                          enabled: true
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                          replicaCount: 3
                          persistence:
                            enabled: true
                            storageClass: nfs-external
                            size: 8Gi
                          volumePermissions:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                        shards: 2
                        replicaCount: 3
                        ingress:
                          enabled: true
                          annotations:
                            cert-manager.io/cluster-issuer: self-signed-ca-issuer
                            nginx.ingress.kubernetes.io/rewrite-target: /$1
                          hostname: clickhouse.dev.geekcity.tech
                          ingressClassName: nginx
                          path: /?(.*)
                          tls: true
                        persistence:
                          enabled: false
                        resources:
                          requests:
                            cpu: 2
                            memory: 512Mi
                          limits:
                            cpu: 3
                            memory: 1024Mi
                        auth:
                          username: admin
                          existingSecret: clickhouse-admin-credentials
                          existingSecretKey: password
                        metrics:
                          enabled: true
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                          serviceMonitor:
                            enabled: true
                            namespace: monitor
                            jobLabel: clickhouse
                            selector:
                              app.kubernetes.io/name: clickhouse
                              app.kubernetes.io/instance: clickhouse
                            labels:
                              release: prometheus-stack
                        extraDeploy:
                          - |
                            apiVersion: apps/v1
                            kind: Deployment
                            metadata:
                              name: clickhouse-tool
                              namespace: database
                              labels:
                                app.kubernetes.io/name: clickhouse-tool
                            spec:
                              replicas: 1
                              selector:
                                matchLabels:
                                  app.kubernetes.io/name: clickhouse-tool
                              template:
                                metadata:
                                  labels:
                                    app.kubernetes.io/name: clickhouse-tool
                                spec:
                                  containers:
                                    - name: clickhouse-tool
                                      image: m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine
                                      imagePullPolicy: IfNotPresent
                                      env:
                                        - name: CLICKHOUSE_USER
                                          value: admin
                                        - name: CLICKHOUSE_PASSWORD
                                          valueFrom:
                                            secretKeyRef:
                                              key: password
                                              name: clickhouse-admin-credentials
                                        - name: CLICKHOUSE_HOST
                                          value: csst-clickhouse.csst
                                        - name: CLICKHOUSE_PORT
                                          value: "9000"
                                        - name: TZ
                                          value: Asia/Shanghai
                                      command:
                                        - tail
                                      args:
                                        - -f
                                        - /etc/hosts
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: database

                3.deploy clickhouse

                Details
                kubectl -n argocd apply -f deploy-clickhouse.yaml

                4.sync by argocd

                Details
                argocd app sync argocd/clickhouse

                5.prepare `clickhouse-interface.yaml`

                Details
                apiVersion: v1
                kind: Service
                metadata:
                  labels:
                    app.kubernetes.io/component: clickhouse
                    app.kubernetes.io/instance: clickhouse
                  name: clickhouse-interface
                spec:
                  ports:
                  - name: http
                    port: 8123
                    protocol: TCP
                    targetPort: http
                    nodePort: 31567
                  - name: tcp
                    port: 9000
                    protocol: TCP
                    targetPort: tcp
                    nodePort: 32005
                  selector:
                    app.kubernetes.io/component: clickhouse
                    app.kubernetes.io/instance: clickhouse
                    app.kubernetes.io/name: clickhouse
                  type: NodePort

                6.apply to k8s

                Details
                kubectl -n database apply -f clickhouse-interface.yaml

                7.extract clickhouse admin credentials

                Details
                kubectl -n database get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d

                8.invoke http api

                Details
                add `$K8S_MASTER_IP clickhouse.dev.geekcity.tech` to **/etc/hosts**
                CK_PASS=$(kubectl -n database get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d)
                echo 'SELECT version()' | curl -k "https://admin:${CK_PASS}@clickhouse.dev.geekcity.tech:32443/" --data-binary @-

                Preliminary

                1. Docker has installed, if not check 🔗link


                Using Proxy

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                1.init server

                Details
                mkdir -p clickhouse/{data,logs}
                podman run --rm \
                    --ulimit nofile=262144:262144 \
                    --name clickhouse-server \
                    -p 18123:8123 \
                    -p 19000:9000 \
                    -v $(pwd)/clickhouse/data:/var/lib/clickhouse \
                    -v $(pwd)/clickhouse/logs:/var/log/clickhouse-server \
                    -e CLICKHOUSE_DB=my_database \
                    -e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
                    -e CLICKHOUSE_USER=ayayay \
                    -e CLICKHOUSE_PASSWORD=123456 \
                    -d m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine

                2.check dashboard

                And then you can visit 🔗http://localhost:18123

                3.use cli api

                And then you can visit 🔗http://localhost:19000
                Details
                podman run --rm \
                  --entrypoint clickhouse-client \
                  -it m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine \
                  --host host.containers.internal \
                  --port 19000 \
                  --user ayayay \
                  --password 123456 \
                  --query "select version()"

                4.use visual client

                Details
                podman run --rm -p 8080:80 -d m.daocloud.io/docker.io/spoonest/clickhouse-tabix-web-client:stable

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. ArgoCD has installed, if not check 🔗link


                3. Argo Workflow has installed, if not check 🔗link


                1.prepare `argocd-login-credentials`

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                kubectl -n database create secret generic mariadb-credentials \
                    --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                    --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                    --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                2.apply rolebinding to k8s

                Details
                kubectl apply -f - <<EOF
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: ClusterRole
                metadata:
                  name: application-administrator
                rules:
                  - apiGroups:
                      - argoproj.io
                    resources:
                      - applications
                    verbs:
                      - '*'
                  - apiGroups:
                      - apps
                    resources:
                      - deployments
                    verbs:
                      - '*'
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: argocd
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: application
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                EOF

                4.prepare clickhouse admin credentials secret

                Details
                kubectl get namespace application > /dev/null 2>&1 || kubectl create namespace application
                kubectl -n application create secret generic clickhouse-admin-credentials \
                  --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                5.prepare deploy-clickhouse-flow.yaml

                Details
                apiVersion: argoproj.io/v1alpha1
                kind: Workflow
                metadata:
                  generateName: deploy-argocd-app-ck-
                spec:
                  entrypoint: entry
                  artifactRepositoryRef:
                    configmap: artifact-repositories
                    key: default-artifact-repository
                  serviceAccountName: argo-workflow
                  templates:
                  - name: entry
                    inputs:
                      parameters:
                      - name: argocd-server
                        value: argo-cd-argocd-server.argocd:443
                      - name: insecure-option
                        value: --insecure
                    dag:
                      tasks:
                      - name: apply
                        template: apply
                      - name: prepare-argocd-binary
                        template: prepare-argocd-binary
                        dependencies:
                        - apply
                      - name: sync
                        dependencies:
                        - prepare-argocd-binary
                        template: sync
                        arguments:
                          artifacts:
                          - name: argocd-binary
                            from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                          parameters:
                          - name: argocd-server
                            value: "{{inputs.parameters.argocd-server}}"
                          - name: insecure-option
                            value: "{{inputs.parameters.insecure-option}}"
                      - name: wait
                        dependencies:
                        - sync
                        template: wait
                        arguments:
                          artifacts:
                          - name: argocd-binary
                            from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                          parameters:
                          - name: argocd-server
                            value: "{{inputs.parameters.argocd-server}}"
                          - name: insecure-option
                            value: "{{inputs.parameters.insecure-option}}"
                  - name: apply
                    resource:
                      action: apply
                      manifest: |
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: app-clickhouse
                          namespace: argocd
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://charts.bitnami.com/bitnami
                            chart: clickhouse
                            targetRevision: 4.5.3
                            helm:
                              releaseName: app-clickhouse
                              values: |
                                image:
                                  registry: docker.io
                                  repository: bitnami/clickhouse
                                  tag: 23.12.3-debian-11-r0
                                  pullPolicy: IfNotPresent
                                service:
                                  type: ClusterIP
                                volumePermissions:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                ingress:
                                  enabled: true
                                  ingressClassName: nginx
                                  annotations:
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                    nginx.ingress.kubernetes.io/rewrite-target: /$1
                                  path: /?(.*)
                                  hostname: clickhouse.dev.geekcity.tech
                                  tls: true
                                shards: 2
                                replicaCount: 3
                                persistence:
                                  enabled: false
                                auth:
                                  username: admin
                                  existingSecret: clickhouse-admin-credentials
                                  existingSecretKey: password
                                zookeeper:
                                  enabled: true
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    repository: bitnami/zookeeper
                                    tag: 3.8.3-debian-11-r8
                                    pullPolicy: IfNotPresent
                                  replicaCount: 3
                                  persistence:
                                    enabled: false
                                  volumePermissions:
                                    enabled: false
                                    image:
                                      registry: m.daocloud.io/docker.io
                                      pullPolicy: IfNotPresent
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: application
                  - name: prepare-argocd-binary
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /tmp/argocd
                        mode: 755
                        http:
                          url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
                    outputs:
                      artifacts:
                      - name: argocd-binary
                        path: "{{inputs.artifacts.argocd-binary.path}}"
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        ls -l {{inputs.artifacts.argocd-binary.path}}
                  - name: sync
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /usr/local/bin/argocd
                      parameters:
                      - name: argocd-server
                      - name: insecure-option
                        value: ""
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      env:
                      - name: ARGOCD_USERNAME
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: username
                      - name: ARGOCD_PASSWORD
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: password
                      - name: WITH_PRUNE_OPTION
                        value: --prune
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        set -e
                        export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                        export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                        export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                        argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                        argocd app sync argocd/app-clickhouse ${WITH_PRUNE_OPTION} --timeout 300
                  - name: wait
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /usr/local/bin/argocd
                      parameters:
                      - name: argocd-server
                      - name: insecure-option
                        value: ""
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      env:
                      - name: ARGOCD_USERNAME
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: username
                      - name: ARGOCD_PASSWORD
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: password
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        set -e
                        export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                        export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                        export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                        argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                        argocd app wait argocd/app-clickhouse

                6.subimit to argo workflow client

                Details
                argo -n business-workflows submit deploy-clickhouse-flow.yaml

                7.extract clickhouse admin credentials

                Details
                kubectl -n application get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d

                8.invoke http api

                Details
                add `$K8S_MASTER_IP clickhouse.dev.geekcity.tech` to **/etc/hosts**
                CK_PASSWORD=$(kubectl -n application get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d) && echo 'SELECT version()' | curl -k "https://admin:${CK_PASSWORD}@clickhouse.dev.geekcity.tech/" --data-binary @-

                9.create external interface

                Details
                kubectl -n application apply -f - <<EOF
                apiVersion: v1
                kind: Service
                metadata:
                  labels:
                    app.kubernetes.io/component: clickhouse
                    app.kubernetes.io/instance: app-clickhouse
                    app.kubernetes.io/managed-by: Helm
                    app.kubernetes.io/name: clickhouse
                    app.kubernetes.io/version: 23.12.2
                    argocd.argoproj.io/instance: app-clickhouse
                    helm.sh/chart: clickhouse-4.5.3
                  name: app-clickhouse-service-external
                spec:
                  ports:
                  - name: tcp
                    port: 9000
                    protocol: TCP
                    targetPort: tcp
                    nodePort: 30900
                  selector:
                    app.kubernetes.io/component: clickhouse
                    app.kubernetes.io/instance: app-clickhouse
                    app.kubernetes.io/name: clickhouse
                  type: NodePort
                EOF

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2024

                Install ElasticSearch

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                1.get helm repo

                Details
                helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                helm repo update

                2.install chart

                Details
                helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                Using Proxy

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                3. ArgoCD has installed, if not check 🔗link


                1.prepare `deploy-elasticsearch.yaml`

                Details
                kubectl apply -f - << EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: elastic-search
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://charts.bitnami.com/bitnami
                    chart: elasticsearch
                    targetRevision: 19.11.3
                    helm:
                      releaseName: elastic-search
                      values: |
                        global:
                          kibanaEnabled: true
                        clusterName: elastic
                        image:
                          registry: m.zjvis.net/docker.io
                          pullPolicy: IfNotPresent
                        security:
                          enabled: false
                        service:
                          type: ClusterIP
                        ingress:
                          enabled: true
                          annotations:
                            cert-manager.io/cluster-issuer: self-signed-ca-issuer
                            nginx.ingress.kubernetes.io/rewrite-target: /$1
                          hostname: elastic-search.dev.tech
                          ingressClassName: nginx
                          path: /?(.*)
                          tls: true
                        master:
                          masterOnly: false
                          replicaCount: 1
                          persistence:
                            enabled: false
                          resources:
                            requests:
                              cpu: 2
                              memory: 1024Mi
                            limits:
                              cpu: 4
                              memory: 4096Mi
                          heapSize: 2g
                        data:
                          replicaCount: 0
                          persistence:
                            enabled: false
                        coordinating:
                          replicaCount: 0
                        ingest:
                          enabled: true
                          replicaCount: 0
                          service:
                            enabled: false
                            type: ClusterIP
                          ingress:
                            enabled: false
                        metrics:
                          enabled: false
                          image:
                            registry: m.zjvis.net/docker.io
                            pullPolicy: IfNotPresent
                        volumePermissions:
                          enabled: false
                          image:
                            registry: m.zjvis.net/docker.io
                            pullPolicy: IfNotPresent
                        sysctlImage:
                          enabled: true
                          registry: m.zjvis.net/docker.io
                          pullPolicy: IfNotPresent
                        kibana:
                          elasticsearch:
                            hosts:
                              - '{{ include "elasticsearch.service.name" . }}'
                            port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
                        esJavaOpts: "-Xmx2g -Xms2g"        
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: application
                EOF

                3.sync by argocd

                Details
                argocd app sync argocd/elastic-search

                4.extract elasticsearch admin credentials

                Details
                a

                5.invoke http api

                Details
                add `$K8S_MASTER_IP elastic-search.dev.tech` to `/etc/hosts`
                curl -k -H "Content-Type: application/json" \
                    -X POST "https://elastic-search.dev.tech:32443/books/_doc?pretty" \
                    -d '{"name": "Snow Crash", "author": "Neal Stephenson", "release_date": "1992-06-01", "page_count": 470}'

                Preliminary

                1. Docker|Podman|Buildah has installed, if not check 🔗link


                Using Mirror

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                1.init server

                Details

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                3. ArgoCD has installed, if not check 🔗link


                4. Argo Workflow has installed, if not check 🔗link


                1.prepare `argocd-login-credentials`

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                2.apply rolebinding to k8s

                Details
                kubectl apply -f - <<EOF
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: ClusterRole
                metadata:
                  name: application-administrator
                rules:
                  - apiGroups:
                      - argoproj.io
                    resources:
                      - applications
                    verbs:
                      - '*'
                  - apiGroups:
                      - apps
                    resources:
                      - deployments
                    verbs:
                      - '*'
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: argocd
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: application
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                EOF

                4.prepare `deploy-xxxx-flow.yaml`

                Details

                6.subimit to argo workflow client

                Details
                argo -n business-workflows submit deploy-xxxx-flow.yaml

                7.decode password

                Details
                kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Apr 12, 2024

                Install Kafka

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm binary has installed, if not check 🔗link


                1.get helm repo

                Details
                helm repo add bitnami oci://registry-1.docker.io/bitnamicharts/kafka
                helm repo update

                2.install chart

                helm upgrade --create-namespace -n database kafka --install bitnami/kafka \
                  --set global.imageRegistry=m.daocloud.io/docker.io \
                  --set zookeeper.enabled=false \
                  --set controller.replicaCount=1 \
                  --set broker.replicaCount=1 \
                  --set persistance.enabled=false  \
                  --version 28.0.3
                
                helm upgrade --create-namespace -n database kafka --install bitnami/kafka \
                  --set global.imageRegistry=m.daocloud.io/docker.io \
                  --set zookeeper.enabled=false \
                  --set controller.replicaCount=1 \
                  --set broker.replicaCount=1 \
                  --set persistance.enabled=false  \
                  --version 28.0.3
                
                Details
                kubectl -n database \
                  create secret generic client-properties \
                  --from-literal=client.properties="$(printf "security.protocol=SASL_PLAINTEXT\nsasl.mechanism=SCRAM-SHA-256\nsasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";\n")"
                Details
                kubectl -n database apply -f - << EOF
                apiVersion: apps/v1
                kind: Deployment
                metadata:
                  name: kafka-client-tools
                  labels:
                    app: kafka-client-tools
                spec:
                  replicas: 1
                  selector:
                    matchLabels:
                      app: kafka-client-tools
                  template:
                    metadata:
                      labels:
                        app: kafka-client-tools
                    spec:
                      volumes:
                      - name: client-properties
                        secret:
                          secretName: client-properties
                      containers:
                      - name: kafka-client-tools
                        image: m.daocloud.io/docker.io/bitnami/kafka:3.6.2
                        volumeMounts:
                        - name: client-properties
                          mountPath: /bitnami/custom/client.properties
                          subPath: client.properties
                          readOnly: true
                        env:
                        - name: BOOTSTRAP_SERVER
                          value: kafka.database.svc.cluster.local:9092
                        - name: CLIENT_CONFIG_FILE
                          value: /bitnami/custom/client.properties
                        command:
                        - tail
                        - -f
                        - /etc/hosts
                        imagePullPolicy: IfNotPresent
                EOF

                3.validate function

                - list topics
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                    'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
                - create topic
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                  'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --create --if-not-exists --topic test-topic'
                - describe topic
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                  'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --describe --topic test-topic'
                - produce message
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                  'for message in $(seq 0 10); do echo $message | kafka-console-producer.sh --bootstrap-server $BOOTSTRAP_SERVER --producer.config $CLIENT_CONFIG_FILE --topic test-topic; done'
                - consume message
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                  'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. ArgoCD has installed, if not check 🔗link


                3. Helm binary has installed, if not check 🔗link


                1.prepare `deploy-kafka.yaml`

                kubectl -n argocd apply -f - << EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: kafka
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://charts.bitnami.com/bitnami
                    chart: kafka
                    targetRevision: 28.0.3
                    helm:
                      releaseName: kafka
                      values: |
                        image:
                          registry: m.daocloud.io/docker.io
                        controller:
                          replicaCount: 1
                          persistence:
                            enabled: false
                          logPersistence:
                            enabled: false
                          extraConfig: |
                            message.max.bytes=5242880
                            default.replication.factor=1
                            offsets.topic.replication.factor=1
                            transaction.state.log.replication.factor=1
                        broker:
                          replicaCount: 1
                          persistence:
                            enabled: false
                          logPersistence:
                            enabled: false
                          extraConfig: |
                            message.max.bytes=5242880
                            default.replication.factor=1
                            offsets.topic.replication.factor=1
                            transaction.state.log.replication.factor=1
                        externalAccess:
                          enabled: false
                          autoDiscovery:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                        volumePermissions:
                          enabled: false
                          image:
                            registry: m.daocloud.io/docker.io
                        metrics:
                          kafka:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                          jmx:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                        provisioning:
                          enabled: false
                        kraft:
                          enabled: true
                        zookeeper:
                          enabled: false
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: database
                EOF
                kubectl -n argocd apply -f - << EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: kafka
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://charts.bitnami.com/bitnami
                    chart: kafka
                    targetRevision: 28.0.3
                    helm:
                      releaseName: kafka
                      values: |
                        image:
                          registry: m.daocloud.io/docker.io
                        listeners:
                          client:
                            protocol: PLAINTEXT
                          interbroker:
                            protocol: PLAINTEXT
                        controller:
                          replicaCount: 0
                          persistence:
                            enabled: false
                          logPersistence:
                            enabled: false
                          extraConfig: |
                            message.max.bytes=5242880
                            default.replication.factor=1
                            offsets.topic.replication.factor=1
                            transaction.state.log.replication.factor=1
                        broker:
                          replicaCount: 1
                          minId: 0
                          persistence:
                            enabled: false
                          logPersistence:
                            enabled: false
                          extraConfig: |
                            message.max.bytes=5242880
                            default.replication.factor=1
                            offsets.topic.replication.factor=1
                            transaction.state.log.replication.factor=1
                        externalAccess:
                          enabled: false
                          autoDiscovery:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                        volumePermissions:
                          enabled: false
                          image:
                            registry: m.daocloud.io/docker.io
                        metrics:
                          kafka:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                          jmx:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                        provisioning:
                          enabled: false
                        kraft:
                          enabled: false
                        zookeeper:
                          enabled: true
                          image:
                            registry: m.daocloud.io/docker.io
                          replicaCount: 1
                          auth:
                            client:
                              enabled: false
                            quorum:
                              enabled: false
                          persistence:
                            enabled: false
                          volumePermissions:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                            metrics:
                              enabled: false
                          tls:
                            client:
                              enabled: false
                            quorum:
                              enabled: false
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: database
                EOF

                2.sync by argocd

                Details
                argocd app sync argocd/kafka

                3.set up client tool

                kubectl -n database \
                    create secret generic client-properties \
                    --from-literal=client.properties="$(printf "security.protocol=SASL_PLAINTEXT\nsasl.mechanism=SCRAM-SHA-256\nsasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";\n")"
                kubectl -n database \
                    create secret generic client-properties \
                    --from-literal=client.properties="security.protocol=PLAINTEXT"

                5.prepare `kafka-client-tools.yaml`

                Details
                kubectl -n database apply -f - << EOF
                apiVersion: apps/v1
                kind: Deployment
                metadata:
                  name: kafka-client-tools
                  labels:
                    app: kafka-client-tools
                spec:
                  replicas: 1
                  selector:
                    matchLabels:
                      app: kafka-client-tools
                  template:
                    metadata:
                      labels:
                        app: kafka-client-tools
                    spec:
                      volumes:
                      - name: client-properties
                        secret:
                          secretName: client-properties
                      containers:
                      - name: kafka-client-tools
                        image: m.daocloud.io/docker.io/bitnami/kafka:3.6.2
                        volumeMounts:
                        - name: client-properties
                          mountPath: /bitnami/custom/client.properties
                          subPath: client.properties
                          readOnly: true
                        env:
                        - name: BOOTSTRAP_SERVER
                          value: kafka.database.svc.cluster.local:9092
                        - name: CLIENT_CONFIG_FILE
                          value: /bitnami/custom/client.properties
                        - name: ZOOKEEPER_CONNECT
                          value: kafka-zookeeper.database.svc.cluster.local:2181
                        command:
                        - tail
                        - -f
                        - /etc/hosts
                        imagePullPolicy: IfNotPresent
                EOF

                6.validate function

                - list topics
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                    'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
                - create topic
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                  'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --create --if-not-exists --topic test-topic'
                - describe topic
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                  'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --describe --topic test-topic'
                - produce message
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                  'for message in $(seq 0 10); do echo $message | kafka-console-producer.sh --bootstrap-server $BOOTSTRAP_SERVER --producer.config $CLIENT_CONFIG_FILE --topic test-topic; done'
                - consume message
                Details
                kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                  'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

                Preliminary

                1. Docker has installed, if not check 🔗link


                Using Proxy

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                1.init server

                Details
                mkdir -p kafka/data
                chmod -R 777 kafka/data
                podman run --rm \
                    --name kafka-server \
                    --hostname kafka-server \
                    -p 9092:9092 \
                    -p 9094:9094 \
                    -v $(pwd)/kafka/data:/bitnami/kafka/data \
                    -e KAFKA_CFG_NODE_ID=0 \
                    -e KAFKA_CFG_PROCESS_ROLES=controller,broker \
                    -e KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka-server:9093 \
                    -e KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,EXTERNAL://:9094 \
                    -e KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,EXTERNAL://host.containers.internal:9094 \
                    -e KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,EXTERNAL:PLAINTEXT,PLAINTEXT:PLAINTEXT \
                    -e KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER \
                    -d m.daocloud.io/docker.io/bitnami/kafka:3.6.2

                2.list topic

                Details
                BOOTSTRAP_SERVER=host.containers.internal:9094
                podman run --rm \
                    -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-topics.sh \
                        --bootstrap-server $BOOTSTRAP_SERVER --list

                2.create topic

                Details
                BOOTSTRAP_SERVER=host.containers.internal:9094
                # BOOTSTRAP_SERVER=10.200.60.64:9094
                TOPIC=test-topic
                podman run --rm \
                    -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-topics.sh \
                        --bootstrap-server $BOOTSTRAP_SERVER \
                        --create \
                        --if-not-exists \
                        --topic $TOPIC

                2.consume record

                Details
                BOOTSTRAP_SERVER=host.containers.internal:9094
                # BOOTSTRAP_SERVER=10.200.60.64:9094
                TOPIC=test-topic
                podman run --rm \
                    -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-console-consumer.sh \
                        --bootstrap-server $BOOTSTRAP_SERVER \
                        --topic $TOPIC \
                        --from-beginning

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2024

                Install MariaDB

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. argoCD has installed, if not check 🔗link


                3. cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


                1.prepare mariadb credentials secret

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                kubectl -n database create secret generic mariadb-credentials \
                    --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                    --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                    --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                2.prepare `deploy-mariadb.yaml`

                Details
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: mariadb
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://charts.bitnami.com/bitnami
                    chart: mariadb
                    targetRevision: 16.3.2
                    helm:
                      releaseName: mariadb
                      values: |
                        architecture: standalone
                        auth:
                          database: test-mariadb
                          username: aaron.yang
                          existingSecret: mariadb-credentials
                        primary:
                          extraFlags: "--character-set-server=utf8mb4 --collation-server=utf8mb4_bin"
                          persistence:
                            enabled: false
                        secondary:
                          replicaCount: 1
                          persistence:
                            enabled: false
                        image:
                          registry: m.daocloud.io/docker.io
                          pullPolicy: IfNotPresent
                        volumePermissions:
                          enabled: false
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                        metrics:
                          enabled: false
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: database

                3.deploy mariadb

                Details
                kubectl -n argocd apply -f deploy-mariadb.yaml

                4.sync by argocd

                Details
                argocd app sync argocd/mariadb

                5.check mariadb

                Details
                kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. ArgoCD has installed, if not check 🔗link


                3. Argo Workflow has installed, if not check 🔗link


                1.prepare `argocd-login-credentials`

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                kubectl -n database create secret generic mariadb-credentials \
                    --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                    --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                    --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                2.apply rolebinding to k8s

                Details
                kubectl -n argocd apply -f - <<EOF
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: ClusterRole
                metadata:
                  name: application-administrator
                rules:
                  - apiGroups:
                      - argoproj.io
                    resources:
                      - applications
                    verbs:
                      - '*'
                  - apiGroups:
                      - apps
                    resources:
                      - deployments
                    verbs:
                      - '*'
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: argocd
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: application
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                EOF

                3.prepare mariadb credentials secret

                Details
                kubectl -n application create secret generic mariadb-credentials \
                  --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                  --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                  --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                4.prepare `deploy-mariadb-flow.yaml`

                Details
                apiVersion: argoproj.io/v1alpha1
                kind: Workflow
                metadata:
                  generateName: deploy-argocd-app-mariadb-
                spec:
                  entrypoint: entry
                  artifactRepositoryRef:
                    configmap: artifact-repositories
                    key: default-artifact-repository
                  serviceAccountName: argo-workflow
                  templates:
                  - name: entry
                    inputs:
                      parameters:
                      - name: argocd-server
                        value: argo-cd-argocd-server.argocd:443
                      - name: insecure-option
                        value: --insecure
                    dag:
                      tasks:
                      - name: apply
                        template: apply
                      - name: prepare-argocd-binary
                        template: prepare-argocd-binary
                        dependencies:
                        - apply
                      - name: sync
                        dependencies:
                        - prepare-argocd-binary
                        template: sync
                        arguments:
                          artifacts:
                          - name: argocd-binary
                            from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                          parameters:
                          - name: argocd-server
                            value: "{{inputs.parameters.argocd-server}}"
                          - name: insecure-option
                            value: "{{inputs.parameters.insecure-option}}"
                      - name: wait
                        dependencies:
                        - sync
                        template: wait
                        arguments:
                          artifacts:
                          - name: argocd-binary
                            from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                          parameters:
                          - name: argocd-server
                            value: "{{inputs.parameters.argocd-server}}"
                          - name: insecure-option
                            value: "{{inputs.parameters.insecure-option}}"
                      - name: init-db-tool
                        template: init-db-tool
                        dependencies:
                        - wait
                  - name: apply
                    resource:
                      action: apply
                      manifest: |
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: app-mariadb
                          namespace: argocd
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://charts.bitnami.com/bitnami
                            chart: mariadb
                            targetRevision: 16.5.0
                            helm:
                              releaseName: app-mariadb
                              values: |
                                architecture: standalone
                                auth:
                                  database: geekcity
                                  username: aaron.yang
                                  existingSecret: mariadb-credentials
                                primary:
                                  persistence:
                                    enabled: false
                                secondary:
                                  replicaCount: 1
                                  persistence:
                                    enabled: false
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                                volumePermissions:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                metrics:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: application
                  - name: prepare-argocd-binary
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /tmp/argocd
                        mode: 755
                        http:
                          url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
                    outputs:
                      artifacts:
                      - name: argocd-binary
                        path: "{{inputs.artifacts.argocd-binary.path}}"
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        ls -l {{inputs.artifacts.argocd-binary.path}}
                  - name: sync
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /usr/local/bin/argocd
                      parameters:
                      - name: argocd-server
                      - name: insecure-option
                        value: ""
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      env:
                      - name: ARGOCD_USERNAME
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: username
                      - name: ARGOCD_PASSWORD
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: password
                      - name: WITH_PRUNE_OPTION
                        value: --prune
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        set -e
                        export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                        export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                        export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                        argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                        argocd app sync argocd/app-mariadb ${WITH_PRUNE_OPTION} --timeout 300
                  - name: wait
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /usr/local/bin/argocd
                      parameters:
                      - name: argocd-server
                      - name: insecure-option
                        value: ""
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      env:
                      - name: ARGOCD_USERNAME
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: username
                      - name: ARGOCD_PASSWORD
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: password
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        set -e
                        export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                        export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                        export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                        argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                        argocd app wait argocd/app-mariadb
                  - name: init-db-tool
                    resource:
                      action: apply
                      manifest: |
                        apiVersion: apps/v1
                        kind: Deployment
                        metadata:
                          name: app-mariadb-tool
                          namespace: application
                          labels:
                            app.kubernetes.io/name: mariadb-tool
                        spec:
                          replicas: 1
                          selector:
                            matchLabels:
                              app.kubernetes.io/name: mariadb-tool
                          template:
                            metadata:
                              labels:
                                app.kubernetes.io/name: mariadb-tool
                            spec:
                              containers:
                                - name: mariadb-tool
                                  image:  m.daocloud.io/docker.io/bitnami/mariadb:10.5.12-debian-10-r0
                                  imagePullPolicy: IfNotPresent
                                  env:
                                    - name: MARIADB_ROOT_PASSWORD
                                      valueFrom:
                                        secretKeyRef:
                                          key: mariadb-root-password
                                          name: mariadb-credentials
                                    - name: TZ
                                      value: Asia/Shanghai

                5.subimit to argo workflow client

                Details
                argo -n business-workflows submit deploy-mariadb-flow.yaml

                6.decode password

                Details
                kubectl -n application get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

                Preliminary

                1. Docker has installed, if not check 🔗link


                Using Proxy

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                1.init server

                Details
                mkdir -p mariadb/data
                podman run  \
                    -p 3306:3306 \
                    -e MARIADB_ROOT_PASSWORD=mysql \
                    -d m.daocloud.io/docker.io/library/mariadb:11.2.2-jammy \
                    --log-bin \
                    --binlog-format=ROW

                2.use web console

                And then you can visit 🔗http://localhost:8080

                username: `root`

                password: `mysql`

                Details
                podman run --rm -p 8080:80 \
                    -e PMA_ARBITRARY=1 \
                    -d m.daocloud.io/docker.io/library/phpmyadmin:5.1.1-apache

                3.use internal client

                Details
                podman run --rm \
                    -e MYSQL_PWD=mysql \
                    -it m.daocloud.io/docker.io/library/mariadb:11.2.2-jammy \
                    mariadb \
                    --host host.containers.internal \
                    --port 3306 \
                    --user root \
                    --database mysql \
                    --execute 'select version()'

                Useful SQL

                1. list all bin logs
                SHOW BINARY LOGS;
                1. delete previous bin logs
                PURGE BINARY LOGS TO 'mysqld-bin.0000003'; # delete mysqld-bin.0000001 and mysqld-bin.0000002
                PURGE BINARY LOGS BEFORE 'yyyy-MM-dd HH:mm:ss';
                PURGE BINARY LOGS DATE_SUB(NOW(), INTERVAL 3 DAYS); # delete last three days bin log file.
                Details

                If you using master-slave mode, you can change all BINARY to MASTER

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2024

                Install Milvus

                Preliminary

                • Kubernetes has installed, if not check link
                • argoCD has installed, if not check link
                • cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuerservice, , if not check link
                • minio has installed, if not check link

                Steps

                1. copy minio credentials secret

                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                kubectl -n storage get secret minio-secret -o json \
                    | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' \
                    | kubectl -n database apply -f -

                2. prepare deploy-milvus.yaml

                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: milvus
                spec:
                  syncPolicy:
                    syncOptions:
                      - CreateNamespace=true
                  project: default
                  source:
                    repoURL: registry-1.docker.io/bitnamicharts
                    chart: milvus
                    targetRevision: 11.2.4
                    helm:
                      releaseName: milvus
                      values: |
                        global:
                          security:
                            allowInsecureImages: true
                        milvus:
                          image:
                            registry: m.lab.zverse.space/docker.io
                            repository: bitnami/milvus
                            tag: 2.5.7-debian-12-r0
                            pullPolicy: IfNotPresent
                          auth:
                            enabled: false
                        initJob:
                          forceRun: false
                          image:
                            registry: m.lab.zverse.space/docker.io
                            repository: bitnami/pymilvus
                            tag: 2.5.6-debian-12-r0
                            pullPolicy: IfNotPresent
                          resources:
                            requests:
                              cpu: 2
                              memory: 512Mi
                            limits:
                              cpu: 2
                              memory: 2Gi
                        dataCoord:
                          replicaCount: 1
                          resources:
                            requests:
                              cpu: 500m
                              memory: 512Mi
                            limits:
                              cpu: 2
                              memory: 2Gi
                          metrics:
                            enabled: true
                            
                        rootCoord:
                          replicaCount: 1
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 4Gi
                        queryCoord:
                          replicaCount: 1
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 4Gi
                        indexCoord:
                          replicaCount: 1
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 4Gi
                        dataNode:
                          replicaCount: 1
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 4Gi
                        queryNode:
                          replicaCount: 1
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 2Gi
                        indexNode:
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 2Gi
                        proxy:
                          replicaCount: 1
                          service:
                            type: ClusterIP
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 2Gi
                        attu:
                          image:
                            registry: m.lab.zverse.space/docker.io
                            repository: bitnami/attu
                            tag: 2.5.5-debian-12-r1
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 4Gi
                          service:
                            type: ClusterIP
                          ingress:
                            enabled: true
                            ingressClassName: "nginx"
                            annotations:
                              cert-manager.io/cluster-issuer: alidns-webhook-zverse-letsencrypt
                            hostname: milvus.dev.tech
                            path: /
                            pathType: ImplementationSpecific
                            tls: true
                        waitContainer:
                          image:
                            registry: m.lab.zverse.space/docker.io
                            repository: bitnami/os-shell
                            tag: 12-debian-12-r40
                            pullPolicy: IfNotPresent
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 4Gi
                        externalS3:
                          host: "minio.storage"
                          port: 9000
                          existingSecret: "minio-secret"
                          existingSecretAccessKeyIDKey: "root-user"
                          existingSecretKeySecretKey: "root-password"
                          bucket: "milvus"
                          rootPath: "file"
                        etcd:
                          enabled: true
                          image:
                            registry: m.lab.zverse.space/docker.io
                          replicaCount: 1
                          auth:
                            rbac:
                              create: false
                            client:
                              secureTransport: false
                          resources:
                            requests:
                              cpu: 500m
                              memory: 1Gi
                            limits:
                              cpu: 2
                              memory: 2Gi
                          persistence:
                            enabled: true
                            storageClass: ""
                            size: 2Gi
                          preUpgradeJob:
                            enabled: false
                        minio:
                          enabled: false
                        kafka:
                          enabled: true
                          image:
                            registry: m.lab.zverse.space/docker.io
                          controller:
                            replicaCount: 1
                            livenessProbe:
                              failureThreshold: 8
                            resources:
                              requests:
                                cpu: 500m
                                memory: 1Gi
                              limits:
                                cpu: 2
                                memory: 2Gi
                            persistence:
                              enabled: true
                              storageClass: ""
                              size: 2Gi
                          service:
                            ports:
                              client: 9092
                          extraConfig: |-
                            offsets.topic.replication.factor=3
                          listeners:
                            client:
                              protocol: PLAINTEXT
                            interbroker:
                              protocol: PLAINTEXT
                            external:
                              protocol: PLAINTEXT
                          sasl:
                            enabledMechanisms: "PLAIN"
                            client:
                              users:
                                - user
                          broker:
                            replicaCount: 0
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: database

                3. apply to k8s

                kubectl -n argocd apply -f deploy-milvus.yaml

                4. sync by argocd

                argocd app sync argocd/milvus

                5. check Attu WebUI

                milvus address: milvus-proxy:19530

                milvus database: default

                https://milvus.dev.tech:32443/#/

                5. [Optional] import data

                import data by using sql file

                MARIADB_ROOT_PASSWORD=$(kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d)
                POD_NAME=$(kubectl get pod -n database -l "app.kubernetes.io/name=mariadb-tool" -o jsonpath="{.items[0].metadata.name}") \
                && export SQL_FILENAME="Dump20240301.sql" \
                && kubectl -n database cp ${SQL_FILENAME} ${POD_NAME}:/tmp/${SQL_FILENAME} \
                && kubectl -n database exec -it deployment/app-mariadb-tool -- bash -c \
                    'echo "create database ccds;" | mysql -h mariadb.database -uroot -p$MARIADB_ROOT_PASSWORD' \
                && kubectl -n database exec -it ${POD_NAME} -- bash -c \
                    "mysql -h mariadb.database -uroot -p\${MARIADB_ROOT_PASSWORD} \
                    ccds < /tmp/Dump20240301.sql"

                6. [Optional] decode password

                kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

                7. [Optional] execute sql in pod

                kubectl -n database exec -it xxxx bash
                mariadb -h 127.0.0.1 -u root -p$MARIADB_ROOT_PASSWORD

                And then you can check connection by

                show status like  'Threads%';
                May 26, 2025

                Install Neo4j

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                1.get helm repo

                Details
                helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                helm repo update

                2.install chart

                Details
                helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                Using Proxy

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                3. ArgoCD has installed, if not check 🔗link


                1.prepare `deploy-xxxxx.yaml`

                Details

                2.apply to k8s

                Details
                kubectl -n argocd apply -f xxxx.yaml

                3.sync by argocd

                Details
                argocd app sync argocd/xxxx

                4.prepare yaml-content.yaml

                Details

                5.apply to k8s

                Details
                kubectl apply -f xxxx.yaml

                6.apply xxxx.yaml directly

                Details
                kubectl apply -f - <<EOF
                
                EOF

                Preliminary

                1. Docker|Podman|Buildah has installed, if not check 🔗link


                Using Proxy

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                1.init server

                Details
                mkdir -p neo4j/data
                podman run --rm \
                    --name neo4j \
                    -p 7474:7474 \
                    -p 7687:7687 \
                    -e neo4j_ROOT_PASSWORD=mysql \
                    -v $(pwd)/neo4j/data:/data \
                    -d docker.io/library/neo4j:5.18.0-community-bullseye
                and then you can visit 🔗[http://localhost:7474]


                username: `root`
                password: `mysql`

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                3. ArgoCD has installed, if not check 🔗link


                4. Argo Workflow has installed, if not check 🔗link


                1.prepare `argocd-login-credentials`

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                2.apply rolebinding to k8s

                Details
                kubectl apply -f - <<EOF
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: ClusterRole
                metadata:
                  name: application-administrator
                rules:
                  - apiGroups:
                      - argoproj.io
                    resources:
                      - applications
                    verbs:
                      - '*'
                  - apiGroups:
                      - apps
                    resources:
                      - deployments
                    verbs:
                      - '*'
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: argocd
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: application
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                EOF

                4.prepare `deploy-xxxx-flow.yaml`

                Details

                6.subimit to argo workflow client

                Details
                argo -n business-workflows submit deploy-xxxx-flow.yaml

                7.decode password

                Details
                kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2024

                Install Postgresql

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                1.get helm repo

                Details
                helm repo add bitnami https://charts.bitnami.com/bitnami
                helm repo update

                2.install chart

                Details
                helm install bitnami/postgresql --generate-name --version 18.1.8
                Using Proxy

                for more information, you can check 🔗https://artifacthub.io/packages/helm/prometheus-community/prometheus

                helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                helm repo update
                helm install my-postgresql ay-helm-mirror/postgresql --version 18.1.8

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                3. ArgoCD has installed, if not check 🔗link


                1.prepare `postgresql-credentials`

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                kubectl -n database create secret generic postgresql-credentials \
                    --from-literal=postgres-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                    --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                    --from-literal=replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                2.prepare `deploy-postgresql.yaml`

                Details
                kubectl -n argocd apply -f - <<EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: postgresql
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                    chart: postgresql
                    targetRevision: 18.1.8
                    helm:
                      releaseName: postgresql
                      values: |
                        global:
                          security:
                            allowInsecureImages: true
                        architecture: standalone
                        auth:
                          database: n8n
                          username: n8n
                          existingSecret: postgresql-credentials
                        primary:
                          persistence:
                            enabled: true
                            storageClass: local-path
                            size: 8Gi
                        readReplicas:
                          replicaCount: 1
                          persistence:
                            enabled: true
                            storageClass: local-path
                            size: 8Gi
                        backup:
                          enabled: false
                        image:
                          registry: m.daocloud.io/registry-1.docker.io
                          pullPolicy: IfNotPresent
                        volumePermissions:
                          enabled: false
                          image:
                            registry: m.daocloud.io/registry-1.docker.io
                            pullPolicy: IfNotPresent
                        metrics:
                          enabled: false
                          image:
                            registry: m.daocloud.io/registry-1.docker.io
                            pullPolicy: IfNotPresent
                    extraDeploy:
                    - apiVersion: traefik.io/v1alpha1
                      kind: IngressRouteTCP
                      metadata:
                        name: postgres-tcp
                        namespace: database
                      spec:
                        entryPoints:
                          - postgres
                        routes:
                        - match: HostSNI(`*`)
                          services:
                          - name: postgresql
                            port: 5432
                    - apiVersion: networking.k8s.io/v1
                      kind: Ingress
                      metadata:
                        name: postgres-tcp-ingress
                        annotations:
                          kubernetes.io/ingress.class: nginx
                      spec:
                        rules:
                        - host: postgres.ay.dev
                          http:
                            paths:
                            - path: /
                              pathType: Prefix
                              backend:
                                service:
                                  name: postgresql
                                  port:
                                    number: 5342
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: database
                EOF

                3.sync by argocd

                Details
                argocd app sync argocd/postgresql

                4.prepare yaml-content.yaml

                Details

                5.apply to k8s

                Details
                kubectl apply -f xxxx.yaml

                6.apply xxxx.yaml directly

                Details
                kubectl apply -f - <<EOF
                
                EOF

                Preliminary

                1. Docker|Podman|Buildah has installed, if not check 🔗link


                Using Proxy

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                1.init server

                Details
                mkdir -p $(pwd)/postgresql/data
                podman run --rm \
                    --name postgresql \
                    -p 5432:5432 \
                    -e POSTGRES_PASSWORD=postgresql \
                    -e PGDATA=/var/lib/postgresql/data/pgdata \
                    -v $(pwd)/postgresql/data:/var/lib/postgresql/data \
                    -d docker.io/library/postgres:15.2-alpine3.17

                2.use web console

                Details
                podman run --rm \
                  -p 8080:80 \
                  -e 'PGADMIN_DEFAULT_EMAIL=ben.wangz@foxmail.com' \
                  -e 'PGADMIN_DEFAULT_PASSWORD=123456' \
                  -d docker.io/dpage/pgadmin4:6.15
                And then you can visit 🔗[http://localhost:8080]


                3.use internal client

                Details
                podman run --rm \
                    --env PGPASSWORD=postgresql \
                    --entrypoint psql \
                    -it docker.io/library/postgres:15.2-alpine3.17 \
                    --host host.containers.internal \
                    --port 5432 \
                    --username postgres \
                    --dbname postgres \
                    --command 'select version()'

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                3. ArgoCD has installed, if not check 🔗link


                4. Argo Workflow has installed, if not check 🔗link


                5. Minio artifact repository has been configured, if not check 🔗link


                - endpoint: minio.storage:9000

                1.prepare `argocd-login-credentials`

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                ARGOCD_USERNAME=admin
                ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
                kubectl -n business-workflows create secret generic argocd-login-credentials \
                    --from-literal=username=${ARGOCD_USERNAME} \
                    --from-literal=password=${ARGOCD_PASSWORD}

                2.apply rolebinding to k8s

                Details
                kubectl apply -f - <<EOF
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: ClusterRole
                metadata:
                  name: application-administrator
                rules:
                  - apiGroups:
                      - argoproj.io
                    resources:
                      - applications
                    verbs:
                      - '*'
                  - apiGroups:
                      - apps
                    resources:
                      - deployments
                    verbs:
                      - '*'
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: argocd
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: application
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                EOF

                3.prepare postgresql admin credentials secret

                Details
                kubectl -n application create secret generic postgresql-credentials \
                  --from-literal=postgres-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                  --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                  --from-literal=replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                4.prepare `deploy-postgresql-flow.yaml`

                Details
                apiVersion: argoproj.io/v1alpha1
                kind: Workflow
                metadata:
                  generateName: deploy-argocd-app-pg-
                spec:
                  entrypoint: entry
                  artifactRepositoryRef:
                    configmap: artifact-repositories
                    key: default-artifact-repository
                  serviceAccountName: argo-workflow
                  templates:
                  - name: entry
                    inputs:
                      parameters:
                      - name: argocd-server
                        value: argo-cd-argocd-server.argocd:443
                      - name: insecure-option
                        value: --insecure
                    dag:
                      tasks:
                      - name: apply
                        template: apply
                      - name: prepare-argocd-binary
                        template: prepare-argocd-binary
                        dependencies:
                        - apply
                      - name: sync
                        dependencies:
                        - prepare-argocd-binary
                        template: sync
                        arguments:
                          artifacts:
                          - name: argocd-binary
                            from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                          parameters:
                          - name: argocd-server
                            value: "{{inputs.parameters.argocd-server}}"
                          - name: insecure-option
                            value: "{{inputs.parameters.insecure-option}}"
                      - name: wait
                        dependencies:
                        - sync
                        template: wait
                        arguments:
                          artifacts:
                          - name: argocd-binary
                            from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                          parameters:
                          - name: argocd-server
                            value: "{{inputs.parameters.argocd-server}}"
                          - name: insecure-option
                            value: "{{inputs.parameters.insecure-option}}"
                      - name: init-db-tool
                        template: init-db-tool
                        dependencies:
                        - wait
                  - name: apply
                    resource:
                      action: apply
                      manifest: |
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: app-postgresql
                          namespace: argocd
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://charts.bitnami.com/bitnami
                            chart: postgresql
                            targetRevision: 14.2.2
                            helm:
                              releaseName: app-postgresql
                              values: |
                                architecture: standalone
                                auth:
                                  database: geekcity
                                  username: aaron.yang
                                  existingSecret: postgresql-credentials
                                primary:
                                  persistence:
                                    enabled: false
                                readReplicas:
                                  replicaCount: 1
                                  persistence:
                                    enabled: false
                                backup:
                                  enabled: false
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                                volumePermissions:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                metrics:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: application
                  - name: prepare-argocd-binary
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /tmp/argocd
                        mode: 755
                        http:
                          url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
                    outputs:
                      artifacts:
                      - name: argocd-binary
                        path: "{{inputs.artifacts.argocd-binary.path}}"
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        ls -l {{inputs.artifacts.argocd-binary.path}}
                  - name: sync
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /usr/local/bin/argocd
                      parameters:
                      - name: argocd-server
                      - name: insecure-option
                        value: ""
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      env:
                      - name: ARGOCD_USERNAME
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: username
                      - name: ARGOCD_PASSWORD
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: password
                      - name: WITH_PRUNE_OPTION
                        value: --prune
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        set -e
                        export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                        export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                        export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                        argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                        argocd app sync argocd/app-postgresql ${WITH_PRUNE_OPTION} --timeout 300
                  - name: wait
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /usr/local/bin/argocd
                      parameters:
                      - name: argocd-server
                      - name: insecure-option
                        value: ""
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      env:
                      - name: ARGOCD_USERNAME
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: username
                      - name: ARGOCD_PASSWORD
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: password
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        set -e
                        export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                        export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                        export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                        argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                        argocd app wait argocd/app-postgresql
                  - name: init-db-tool
                    resource:
                      action: apply
                      manifest: |
                        apiVersion: apps/v1
                        kind: Deployment
                        metadata:
                          name: app-postgresql-tool
                          namespace: application
                          labels:
                            app.kubernetes.io/name: postgresql-tool
                        spec:
                          replicas: 1
                          selector:
                            matchLabels:
                              app.kubernetes.io/name: postgresql-tool
                          template:
                            metadata:
                              labels:
                                app.kubernetes.io/name: postgresql-tool
                            spec:
                              containers:
                                - name: postgresql-tool
                                  image: m.daocloud.io/docker.io/bitnami/postgresql:14.4.0-debian-11-r9
                                  imagePullPolicy: IfNotPresent
                                  env:
                                    - name: POSTGRES_PASSWORD
                                      valueFrom:
                                        secretKeyRef:
                                          key: postgres-password
                                          name: postgresql-credentials
                                    - name: TZ
                                      value: Asia/Shanghai
                                  command:
                                    - tail
                                  args:
                                    - -f
                                    - /etc/hosts

                6.subimit to argo workflow client

                Details
                argo -n business-workflows submit deploy-postgresql.yaml

                7.decode password

                Details
                kubectl -n application get secret postgresql-credentials -o jsonpath='{.data.postgres-password}' | base64 -d

                8.import data

                Details
                POSTGRES_PASSWORD=$(kubectl -n application get secret postgresql-credentials -o jsonpath='{.data.postgres-password}' | base64 -d) \
                POD_NAME=$(kubectl get pod -n application -l "app.kubernetes.io/name=postgresql-tool" -o jsonpath="{.items[0].metadata.name}") \
                && export SQL_FILENAME="init_dfs_table_data.sql" \
                && kubectl -n application cp ${SQL_FILENAME} ${POD_NAME}:/tmp/${SQL_FILENAME} \
                && kubectl -n application exec -it deployment/app-postgresql-tool -- bash -c \
                    'echo "CREATE DATABASE csst;" | PGPASSWORD="$POSTGRES_PASSWORD" \
                    psql --host app-postgresql.application -U postgres -d postgres -p 5432' \
                && kubectl -n application exec -it deployment/app-postgresql-tool -- bash -c \
                    'PGPASSWORD="$POSTGRES_PASSWORD" psql --host app-postgresql.application \
                    -U postgres -d csst -p 5432 < /tmp/init_dfs_table_data.sql'

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2024

                Install PgAdmin

                🚀Installation

                Install By

                1.get helm repo

                Details
                helm repo add runix https://helm.runix.net/
                helm repo update

                2.install chart

                Details
                helm install runix/pgadmin4 --generate-name --version 1.23.3
                Using AY Helm Mirror

                1.prepare `pgadmin-credentials.yaml`

                Details
                kubectl -n database create secret generic pgadmin-credentials \
                  --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                2.prepare `deploy-pgadmin.yaml`

                Details
                kubectl -n argocd apply -f -<< EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: pgadmin
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://helm.runix.net/
                    chart: pgadmin4
                    targetRevision: 1.23.3
                    helm:
                      releaseName: pgadmin4
                      values: |
                        replicaCount: 1
                        persistentVolume:
                          enabled: false
                        env:
                          email: pgadmin@mail.72602.online
                          variables:
                            - name: PGADMIN_CONFIG_WTF_CSRF_ENABLED
                              value: "False"
                        existingSecret: pgadmin-credentials
                        resources:
                          requests:
                            memory: 512Mi
                            cpu: 500m
                          limits:
                            memory: 1024Mi
                            cpu: 1000m
                        image:
                          registry: m.daocloud.io/docker.io
                          pullPolicy: IfNotPresent
                        ingress:
                          enabled: true
                          ingressClassName: nginx
                          annotations:
                            cert-manager.io/cluster-issuer: letsencrypt
                          hosts:
                            - host: pgadmin.72602.online
                              paths:
                                - path: /
                                  pathType: ImplementationSpecific
                          tls:
                            - secretName: pgadmin.72602.online-tls
                              hosts:
                                - pgadmin.72602.online
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: database
                EOF

                3.sync by argocd

                Details
                argocd app sync argocd/pgadmin
                Using AY Helm Mirror
                Using AY ACR Image Mirror
                Using DaoCloud Mirror

                1.init server

                Details
                Using AY ACR Image Mirror
                Using DaoCloud Mirror

                1.init server

                Details
                Using AY ACR Image Mirror
                Using DaoCloud Mirror

                🛎️FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2025

                Install Redis

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                1.get helm repo

                Details
                helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                helm repo update

                2.install chart

                Details
                helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                Using Proxy

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                3. ArgoCD has installed, if not check 🔗link


                1.prepare `redis-credentials`

                Details
                kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                kubectl -n database create secret generic redis-credentials \
                --from-literal=redis-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                2.apply `deploy-redis.yaml`

                Details
                kubectl -n argocd apply -f - << 'EOF'
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: redis
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://charts.bitnami.com/bitnami
                    chart: redis
                    targetRevision: 18.16.0
                    helm:
                      releaseName: redis
                      values: |
                        architecture: replication
                        auth:
                          enabled: true
                          sentinel: false
                          existingSecret: redis-credentials
                        master:
                          count: 1
                          resources:
                            requests:
                              memory: 512Mi
                              cpu: 512m
                            limits:
                              memory: 1024Mi
                              cpu: 1024m
                          disableCommands:
                            - FLUSHDB
                            - FLUSHALL
                          persistence:
                            enabled: true
                            storageClass: "local-path"
                            accessModes:
                            - ReadWriteOnce
                            size: 8Gi
                        replica:
                          replicaCount: 1
                          resources:
                            requests:
                              memory: 512Mi
                              cpu: 512m
                            limits:
                              memory: 1024Mi
                              cpu: 1024m
                          disableCommands:
                            - FLUSHDB
                            - FLUSHALL
                          persistence:
                            enabled: true
                            storageClass: "local-path"
                            accessModes:
                            - ReadWriteOnce
                            size: 8Gi
                        image:
                          registry: m.daocloud.io/docker.io
                          pullPolicy: IfNotPresent
                        sentinel:
                          enabled: false
                        metrics:
                          enabled: false
                        volumePermissions:
                          enabled: false
                        sysctl:
                          enabled: false
                        extraDeploy:
                        - apiVersion: traefik.io/v1alpha1
                          kind: IngressRouteTCP
                          metadata:
                            name: redis-tcp
                            namespace: storage
                          spec:
                            entryPoints:
                              - redis
                            routes:
                            - match: HostSNI(`*`)
                              services:
                              - name: redis-master
                                port: 6379
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: storage
                EOF

                3.sync by argocd

                Details
                argocd app sync argocd/redis

                4.test redis connection

                Details
                kubectl -n storage run test --rm -it --image=m.daocloud.io/docker.io/library/redis:7 -- \
                redis-cli -h redis-master -p 6379 -a uItmVGpX5PShHc8j ping

                Preliminary

                1. Docker|Podman|Buildah has installed, if not check 🔗link


                Using Proxy

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                1.init server

                Details
                mkdir -p $(pwd)/redis/data
                podman run --rm \
                    --name redis \
                    -p 6379:6379 \
                    -v $(pwd)/redis/data:/data \
                    -d docker.io/library/redis:7.2.4-alpine

                2.use internal client

                Details
                podman run --rm \
                    -it docker.io/library/redis:7.2.4-alpine \
                    redis-cli \
                    -h host.containers.internal \
                    set mykey somevalue

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm has installed, if not check 🔗link


                3. ArgoCD has installed, if not check 🔗link


                4. Argo Workflow has installed, if not check 🔗link


                5. Minio artifact repository has been configured, if not check 🔗link


                - endpoint: minio.storage:9000

                1.prepare `argocd-login-credentials`

                Details
                ARGOCD_USERNAME=admin
                ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
                kubectl -n business-workflows create secret generic argocd-login-credentials \
                    --from-literal=username=${ARGOCD_USERNAME} \
                    --from-literal=password=${ARGOCD_PASSWORD}

                2.apply rolebinding to k8s

                Details
                kubectl apply -f - <<EOF
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: ClusterRole
                metadata:
                  name: application-administrator
                rules:
                  - apiGroups:
                      - argoproj.io
                    resources:
                      - applications
                    verbs:
                      - '*'
                  - apiGroups:
                      - apps
                    resources:
                      - deployments
                    verbs:
                      - '*'
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: argocd
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                
                ---
                apiVersion: rbac.authorization.k8s.io/v1
                kind: RoleBinding
                metadata:
                  name: application-administration
                  namespace: application
                roleRef:
                  apiGroup: rbac.authorization.k8s.io
                  kind: ClusterRole
                  name: application-administrator
                subjects:
                  - kind: ServiceAccount
                    name: argo-workflow
                    namespace: business-workflows
                EOF

                3.prepare redis credentials secret

                Details
                kubectl -n application create secret generic redis-credentials \
                  --from-literal=redis-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                4.prepare `deploy-redis-flow.yaml`

                Details
                apiVersion: argoproj.io/v1alpha1
                kind: Workflow
                metadata:
                  generateName: deploy-argocd-app-redis-
                spec:
                  entrypoint: entry
                  artifactRepositoryRef:
                    configmap: artifact-repositories
                    key: default-artifact-repository
                  serviceAccountName: argo-workflow
                  templates:
                  - name: entry
                    inputs:
                      parameters:
                      - name: argocd-server
                        value: argocd-server.argocd:443
                      - name: insecure-option
                        value: --insecure
                    dag:
                      tasks:
                      - name: apply
                        template: apply
                      - name: prepare-argocd-binary
                        template: prepare-argocd-binary
                        dependencies:
                        - apply
                      - name: sync
                        dependencies:
                        - prepare-argocd-binary
                        template: sync
                        arguments:
                          artifacts:
                          - name: argocd-binary
                            from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                          parameters:
                          - name: argocd-server
                            value: "{{inputs.parameters.argocd-server}}"
                          - name: insecure-option
                            value: "{{inputs.parameters.insecure-option}}"
                      - name: wait
                        dependencies:
                        - sync
                        template: wait
                        arguments:
                          artifacts:
                          - name: argocd-binary
                            from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                          parameters:
                          - name: argocd-server
                            value: "{{inputs.parameters.argocd-server}}"
                          - name: insecure-option
                            value: "{{inputs.parameters.insecure-option}}"
                  - name: apply
                    resource:
                      action: apply
                      manifest: |
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: app-redis
                          namespace: argocd
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://charts.bitnami.com/bitnami
                            chart: redis
                            targetRevision: 18.16.0
                            helm:
                              releaseName: app-redis
                              values: |
                                architecture: replication
                                auth:
                                  enabled: true
                                  sentinel: true
                                  existingSecret: redis-credentials
                                master:
                                  count: 1
                                  disableCommands:
                                    - FLUSHDB
                                    - FLUSHALL
                                  persistence:
                                    enabled: false
                                replica:
                                  replicaCount: 3
                                  disableCommands:
                                    - FLUSHDB
                                    - FLUSHALL
                                  persistence:
                                    enabled: false
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                                sentinel:
                                  enabled: false
                                  persistence:
                                    enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                metrics:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                volumePermissions:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                sysctl:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: application
                  - name: prepare-argocd-binary
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /tmp/argocd
                        mode: 755
                        http:
                          url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
                    outputs:
                      artifacts:
                      - name: argocd-binary
                        path: "{{inputs.artifacts.argocd-binary.path}}"
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        ls -l {{inputs.artifacts.argocd-binary.path}}
                  - name: sync
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /usr/local/bin/argocd
                      parameters:
                      - name: argocd-server
                      - name: insecure-option
                        value: ""
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      env:
                      - name: ARGOCD_USERNAME
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: username
                      - name: ARGOCD_PASSWORD
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: password
                      - name: WITH_PRUNE_OPTION
                        value: --prune
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        set -e
                        export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                        export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                        export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                        argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                        argocd app sync argocd/app-redis ${WITH_PRUNE_OPTION} --timeout 300
                  - name: wait
                    inputs:
                      artifacts:
                      - name: argocd-binary
                        path: /usr/local/bin/argocd
                      parameters:
                      - name: argocd-server
                      - name: insecure-option
                        value: ""
                    container:
                      image: m.daocloud.io/docker.io/library/fedora:39
                      env:
                      - name: ARGOCD_USERNAME
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: username
                      - name: ARGOCD_PASSWORD
                        valueFrom:
                          secretKeyRef:
                            name: argocd-login-credentials
                            key: password
                      command:
                      - sh
                      - -c
                      args:
                      - |
                        set -e
                        export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                        export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                        export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                        argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                        argocd app wait argocd/app-redis

                6.subimit to argo workflow client

                Details
                argo -n business-workflows submit deploy-redis-flow.yaml

                7.decode password

                Details
                kubectl -n application get secret redis-credentials -o jsonpath='{.data.redis-password}' | base64 -d

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Mar 7, 2024

                Subsections of Git

                Install Act Runner

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm binary has installed, if not check 🔗link


                1.get helm repo

                Details
                helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                helm repo update

                2.prepare `act-runner-secret`

                Details
                kubectl -n application create secret generic act-runner-secret \
                  --from-literal=act-runner-token=4w3Sx0Hwe6VFevl473ZZ4nFVDvFvhKcEUBvpJ09L

                3.prepare values

                Details
                echo "
                replicas: 1
                runner:
                  instanceURL: http://192.168.100.125:30300
                  token:
                    fromSecret:
                      name: "act-runner-secret"
                      key: "act-runner-token"" > act-runner-values.yaml

                4.install chart

                Details
                helm upgrade  --create-namespace -n application --install -f ./act-runner-values.yaml act-runner ay-helm-mirror/act-runner

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. ArgoCD has installed, if not check 🔗link


                3. Helm binary has installed, if not check 🔗link


                1.prepare `act-runner-secret`

                Details
                kubectl -n application create secret generic act-runner-secret \
                  --from-literal=act-runner-token=4w3Sx0Hwe6VFevl473ZZ4nFVDvFvhKcEUBvpJ09L
                act-runner-token could be get from here

                token is used for authentication and identification, such as P2U1U0oB4XaRCi8azcngmPCLbRpUGapalhmddh23. Each token can be used to create multiple runners, until it is replaced with a new token using the reset link. You can obtain different levels of ’tokens’ from the following places to create the corresponding level of ‘runners’:

                Instance level: The admin settings page, like <your_gitea.com>/-/admin/actions/runners.

                act_runner_token act_runner_token

                2.prepare act-runner.yaml

                Storage In
                kubectl -n argocd apply -f - <<EOF
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: act-runner
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                    chart: act-runner
                    targetRevision: 0.2.2
                    helm:
                      releaseName: act-runner
                      values: |
                        image:
                          name: vegardit/gitea-act-runner
                          tag: "dind-0.2.13"
                          repository: m.daocloud.io/docker.io
                        runner:
                          instanceURL: https://192.168.100.125:30300
                          token:
                            fromSecret:
                              name: "act-runner-secret"
                              key: "act-runner-token"
                          config:
                            enabled: true
                            data: |
                              log:
                                level: info
                              runner:
                                labels:
                                  - ubuntu-latest:docker://m.daocloud.io/docker.gitea.com/runner-images:ubuntu-latest
                              container:
                                force_pull: true
                        persistence:
                          enabled: true
                          storageClassName: ""
                          accessModes: ReadWriteOnce
                          size: 10Gi
                        autoscaling:
                          enabled: true
                          minReplicas: 1
                          maxReplicas: 3
                        replicas: 1  
                        securityContext:
                          privileged: true
                          runAsUser: 0
                          runAsGroup: 0
                          fsGroup: 0
                          capabilities:
                            add: ["NET_ADMIN", "SYS_ADMIN"]
                        podSecurityContext:
                          runAsUser: 0
                          runAsGroup: 0
                          fsGroup: 0
                        resources: 
                          requests:
                            cpu: 200m
                            memory: 512Mi
                          limits:
                            cpu: 1000m
                            memory: 2048Mi
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: application
                EOF
                

                4.sync by argocd

                Details
                argocd app sync argocd/act-runner

                5.use action

                Details

                Even if Actions is enabled for the Gitea instance, repositories still disable Actions by default.

                To enable it, go to the settings page of your repository like your_gitea.com/<owner>/repo/settings and enable Enable Repository Actions.

                act_runner_token act_runner_token

                Preliminary

                1. Podman has installed, and the `podman` command is available in your PATH.


                1.prepare data and config dir

                Details
                mkdir -p /opt/gitea_act_runner/{data,config} \
                && chown -R 1000:1000 /opt/gitea_act_runner \
                && chmod -R 755 /opt/gitea_act_runner

                2.run container

                Details
                podman run -it \
                  --name gitea_act_runner \
                  --rm \
                  --privileged \
                  --network=host \
                  -v /opt/gitea_act_runner/data:/data \
                  -v /opt/gitea_act_runner/config:/config \
                  -v /var/run/podman/podman.sock:/var/run/docker.sock \
                  -e GITEA_INSTANCE_URL="http://10.200.60.64:30300" \
                  -e GITEA_RUNNER_REGISTRATION_TOKEN="5lgsrOzfKz3RiqeMWxxUb9RmUPEWNnZ6hTTZV0DL" \
                  m.daocloud.io/docker.io/gitea/act_runner:latest-dind-rootless
                Using Mirror

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                Preliminary

                1. Docker 2. Podman has installed, and the `podman` command is available in your PATH.

                1.prepare data and config dir

                Details
                mkdir -p /opt/gitea_act_runner/{data,config} \
                && chown -R 1000:1000 /opt/gitea_act_runner \
                && chmod -R 755 /opt/gitea_act_runner

                2.run container

                Details
                docker run -it \
                  --name gitea_act_runner \
                  --rm \
                  --privileged \
                  --network=host \
                  -v /opt/gitea_act_runner/data:/data \
                  -v /opt/gitea_act_runner/config:/config \
                  -e GITEA_INSTANCE_URL="http://192.168.100.125:30300" \
                  -e GITEA_RUNNER_REGISTRATION_TOKEN="5lgsrOzfKz3RiqeMWxxUb9RmUPEWNnZ6hTTZV0DL" \
                  m.daocloud.io/docker.io/gitea/act_runner:latest-dind
                Using Mirror

                you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Jun 7, 2025

                Install Gitea

                Installation

                Install By

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. Helm binary has installed, if not check 🔗link


                3. CertManager has installed, if not check 🔗link


                4. Ingress has installed, if not check 🔗link


                1.get helm repo

                Details
                helm repo add gitea-charts https://dl.gitea.com/charts/
                helm repo update

                2.install chart

                Details
                helm install gitea gitea-charts/gitea --generate-name
                Using Mirror
                helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                  && helm install ay-helm-mirror/gitea --generate-name --version 12.1.3

                for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                Preliminary

                1. Kubernetes has installed, if not check 🔗link


                2. ArgoCD has installed, if not check 🔗link


                3. Helm binary has installed, if not check 🔗link


                4. Ingres has installed on argoCD, if not check 🔗link


                5. Minio has installed, if not check 🔗link


                1.prepare `chart-museum-credentials`

                Storage In
                kubectl get namespaces application > /dev/null 2>&1 || kubectl create namespace application
                kubectl -n application create secret generic gitea-admin-credentials \
                    --from-literal=username=gitea_admin \
                    --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
                
                kubectl get namespaces application > /dev/null 2>&1 || kubectl create namespace application
                kubectl -n application create secret generic gitea-admin-credentials \
                    --from-literal=username=gitea_admin \
                    --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
                

                2.prepare `gitea.yaml`

                Storage In
                apiVersion: argoproj.io/v1alpha1
                kind: Application
                metadata:
                  name: gitea
                spec:
                  syncPolicy:
                    syncOptions:
                    - CreateNamespace=true
                  project: default
                  source:
                    repoURL: https://dl.gitea.com/charts/
                    chart: gitea
                    targetRevision: 10.1.4
                    helm:
                      releaseName: gitea
                      values: |
                        image:
                          registry: m.daocloud.io/docker.io
                        service:
                          http:
                            type: NodePort
                            port: 3000
                            nodePort: 30300
                          ssh:
                            type: NodePort
                            port: 22
                            nodePort: 32022
                        ingress:
                          enabled: true
                          ingressClassName: nginx
                          annotations:
                            kubernetes.io/ingress.class: nginx
                            nginx.ingress.kubernetes.io/rewrite-target: /$1
                            cert-manager.io/cluster-issuer: self-signed-ca-issuer
                          hosts:
                          - host: gitea.ay.dev
                            paths:
                            - path: /?(.*)
                              pathType: ImplementationSpecific
                          tls:
                          - secretName: gitea.ay.dev-tls
                            hosts:
                            - gitea.ay.dev
                        persistence:
                          enabled: true
                          size: 8Gi
                          storageClass: ""
                        redis-cluster:
                          enabled: false
                        postgresql-ha:
                          enabled: false
                        postgresql:
                          enabled: true
                          architecture: standalone
                          image:
                            registry: m.daocloud.io/docker.io
                          primary:
                            persistence:
                              enabled: false
                              storageClass: ""
                              size: 8Gi
                          readReplicas:
                            replicaCount: 1
                            persistence:
                              enabled: true
                              storageClass: ""
                              size: 8Gi
                          backup:
                            enabled: false
                          volumePermissions:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                          metrics:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                        gitea:
                          admin:
                            existingSecret: gitea-admin-credentials
                            email: aaron19940628@gmail.com
                          config:
                            database:
                              DB_TYPE: postgres
                            session:
                              PROVIDER: db
                            cache:
                              ADAPTER: memory
                            queue:
                              TYPE: level
                            indexer:
                              ISSUE_INDEXER_TYPE: bleve
                              REPO_INDEXER_ENABLED: true
                            repository:
                              MAX_CREATION_LIMIT: 10
                              DISABLED_REPO_UNITS: "repo.wiki,repo.ext_wiki,repo.projects"
                              DEFAULT_REPO_UNITS: "repo.code,repo.releases,repo.issues,repo.pulls"
                            server:
                              PROTOCOL: http
                              LANDING_PAGE: login
                              DOMAIN: gitea.ay.dev
                              ROOT_URL: https://gitea.ay.dev:32443/
                              SSH_DOMAIN: ssh.gitea.ay.dev
                              SSH_PORT: 32022
                              SSH_AUTHORIZED_PRINCIPALS_ALLOW: email
                            admin:
                              DISABLE_REGULAR_ORG_CREATION: true
                            security:
                              INSTALL_LOCK: true
                            service:
                              REGISTER_EMAIL_CONFIRM: true
                              DISABLE_REGISTRATION: true
                              ENABLE_NOTIFY_MAIL: false
                              DEFAULT_ALLOW_CREATE_ORGANIZATION: false
                              SHOW_MILESTONES_DASHBOARD_PAGE: false
                            migrations:
                              ALLOW_LOCALNETWORKS: true
                            mailer:
                              ENABLED: false
                            i18n:
                              LANGS: "en-US,zh-CN"
                              NAMES: "English,简体中文"
                            oauth2:
                              ENABLE: false
                  destination:
                    server: https://kubernetes.default.svc
                    namespace: application
                
                sssss
                

                3.apply to k8s

                Details
                kubectl -n argocd apply -f gitea.yaml

                4.sync by argocd

                Details
                argocd app sync argocd/gitea

                5.decode admin password

                login 🔗https://gitea.ay.dev:32443/

                , using user gitea_admin and password
                Details
                kubectl -n application get secret gitea-admin-credentials -o jsonpath='{.data.password}' | base64 -d

                FAQ

                Q1: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Show me almost endless possibilities

                You can add standard markdown syntax:

                • multiple paragraphs
                • bullet point lists
                • emphasized, bold and even bold emphasized text
                • links
                • etc.
                ...and even source code

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Jun 7, 2025

                HPC

                  Mar 7, 2024

                  Subsections of Monitor

                  Install Homepage

                  Offical Documentation: https://gethomepage.dev/

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  1.install chart directly

                  Details
                  helm install homepage oci://ghcr.io/m0nsterrr/helm-charts/homepage

                  2.you can modify the values.yaml and re-install

                  Related values files
                  Details
                  helm install homepage oci://ghcr.io/m0nsterrr/helm-charts/homepage -f homepage.values.yaml
                  Using Mirror
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                    && helm install ay-helm-mirror/homepage  --generate-name --version 4.2.0

                  for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. ArgoCD has installed, if not check 🔗link


                  3. Helm binary has installed, if not check 🔗link


                  4. Ingres has installed on argoCD, if not check 🔗link


                  1.prepare `homepage.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                    apiVersion: argoproj.io/v1alpha1
                    kind: Application
                    metadata:
                      name: homepage
                    spec:
                      syncPolicy:
                        syncOptions:
                          - CreateNamespace=true
                          - ServerSideApply=true
                      project: default
                      source:
                        repoURL: oci://ghcr.io/m0nsterrr/helm-charts/homepage
                        chart: homepage
                        targetRevision: 4.2.0
                        helm:
                          releaseName: homepage
                          values: |
                            image:
                              registry: m.daocloud.io/ghcr.io
                              repository: gethomepage/homepage
                              pullPolicy: IfNotPresent
                              tag: "v1.5.0"
                            config:
                              allowedHosts: 
                              - "home.72602.online"
                            ingress:
                              enabled: true
                              ingressClassName: "nginx"
                              annotations:
                                kubernetes.io/ingress.class: nginx
                              hosts:
                                - host: home.72602.online
                                  paths:
                                    - path: /
                                      pathType: ImplementationSpecific
                            resources:
                              limits:
                                cpu: 500m
                                memory: 512Mi
                              requests:
                                cpu: 100m
                                memory: 128Mi
                      destination:
                        server: https://kubernetes.default.svc
                        namespace: monitor
                  EOF

                  3.sync by argocd

                  Details
                  argocd app sync argocd/homepage

                  5.check the web browser

                  Details
                  K8S_MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
                  echo "$K8S_MASTER_IP home.72602.online" >> /etc/hosts

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Docker has installed, if not check 🔗link


                  docker run -d \
                  --name homepage \
                  -e HOMEPAGE_ALLOWED_HOSTS=47.110.67.161:3000 \
                  -e PUID=1000 \
                  -e PGID=1000 \
                  -p 3000:3000 \
                  -v /root/home-site/static/icons:/app/public/icons  \
                  -v /root/home-site/content/Ops/HomePage/config:/app/config \
                  -v /var/run/docker.sock:/var/run/docker.sock:ro \
                  --restart unless-stopped \
                  ghcr.io/gethomepage/homepage:v1.5.0

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Podman has installed, if not check 🔗link


                  podman run -d \
                  --name homepage \
                  -e HOMEPAGE_ALLOWED_HOSTS=127.0.0.1:3000 \
                  -e PUID=1000 \
                  -e PGID=1000 \
                  -p 3000:3000 \
                  -v /root/home-site/static/icons:/app/public/icons \
                  -v /root/home-site/content/Ops/HomePage/config:/app/config \
                  --restart unless-stopped \
                  ghcr.io/gethomepage/homepage:v1.5.0

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Oct 7, 2025

                  Install Permetheus Stack

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  1.get helm repo

                  Details
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                  helm repo update

                  2.install chart

                  Details
                  helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                  Using Mirror
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                    && helm install ay-helm-mirror/kube-prometheus-stack  --generate-name --version 1.17.2

                  for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. ArgoCD has installed, if not check 🔗link


                  3. Helm binary has installed, if not check 🔗link


                  4. Ingres has installed on argoCD, if not check 🔗link


                  1.prepare `chart-museum-credentials`

                  Details
                  kubectl get namespaces monitor > /dev/null 2>&1 || kubectl create namespace monitor
                  kubectl -n monitor create secret generic prometheus-stack-credentials \
                    --from-literal=grafana-username=admin \
                    --from-literal=grafana-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                  2.prepare `prometheus-stack.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                    apiVersion: argoproj.io/v1alpha1
                    kind: Application
                    metadata:
                      name: prometheus-stack
                    spec:
                      syncPolicy:
                        syncOptions:
                          - CreateNamespace=true
                          - ServerSideApply=true
                      project: default
                      source:
                        repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                        chart: kube-prometheus-stack
                        targetRevision: 72.9.1
                        helm:
                          releaseName: prometheus-stack
                          values: |
                            crds:
                              enabled: true
                            global:
                              rbac:
                                create: true
                              imageRegistry: ""
                              imagePullSecrets: []
                            alertmanager:
                              enabled: true
                              ingress:
                                enabled: false
                              serviceMonitor:
                                selfMonitor: true
                                interval: ""
                              alertmanagerSpec:
                                image:
                                  registry: m.daocloud.io/quay.io
                                  repository: prometheus/alertmanager
                                  tag: v0.28.1
                                replicas: 1
                                resources: {}
                                storage:
                                  volumeClaimTemplate:
                                    spec:
                                      storageClassName: ""
                                      accessModes: ["ReadWriteOnce"]
                                      resources:
                                        requests:
                                          storage: 2Gi
                            grafana:
                              enabled: true
                              ingress:
                                enabled: true
                                annotations:
                                  cert-manager.io/clusterissuer: self-signed-issuer
                                  kubernetes.io/ingress.class: nginx
                                hosts:
                                  - grafana.dev.tech
                                path: /
                                pathtype: ImplementationSpecific
                                tls:
                                - secretName: grafana.dev.tech-tls
                                  hosts:
                                  - grafana.dev.tech
                            prometheusOperator:
                              admissionWebhooks:
                                patch:
                                  resources: {}
                                  image:
                                    registry: m.daocloud.io/registry.k8s.io
                                    repository: ingress-nginx/kube-webhook-certgen
                                    tag: v1.5.3  
                              image:
                                registry: m.daocloud.io/quay.io
                                repository: prometheus-operator/prometheus-operator
                              prometheusConfigReloader:
                                image:
                                  registry: m.daocloud.io/quay.io
                                  repository: prometheus-operator/prometheus-config-reloader
                                resources: {}
                              thanosImage:
                                registry: m.daocloud.io/quay.io
                                repository: thanos/thanos
                                tag: v0.38.0
                            prometheus:
                              enabled: true
                              ingress:
                                enabled: true
                                annotations:
                                  cert-manager.io/clusterissuer: self-signed-issuer
                                  kubernetes.io/ingress.class: nginx
                                hosts:
                                  - prometheus.dev.tech
                                path: /
                                pathtype: ImplementationSpecific
                                tls:
                                - secretName: prometheus.dev.tech-tls
                                  hosts:
                                  - prometheus.dev.tech
                              prometheusSpec:
                                image:
                                  registry: m.daocloud.io/quay.io
                                  repository: prometheus/prometheus
                                  tag: v3.4.0
                                replicas: 1
                                shards: 1
                                resources: {}
                                storageSpec: 
                                  volumeClaimTemplate:
                                    spec:
                                      storageClassName: ""
                                      accessModes: ["ReadWriteOnce"]
                                      resources:
                                        requests:
                                          storage: 2Gi
                            thanosRuler:
                              enabled: false
                              ingress:
                                enabled: false
                              thanosRulerSpec:
                                replicas: 1
                                storage: {}
                                resources: {}
                                image:
                                  registry: m.daocloud.io/quay.io
                                  repository: thanos/thanos
                                  tag: v0.38.0
                      destination:
                        server: https://kubernetes.default.svc
                        namespace: monitor
                  EOF

                  3.sync by argocd

                  Details
                  argocd app sync argocd/prometheus-stack

                  4.extract clickhouse admin credentials

                  Details
                    kubectl -n monitor get secret prometheus-stack-credentials -o jsonpath='{.data.grafana-password}' | base64 -d

                  5.check the web browser

                  Details
                    > add `$K8S_MASTER_IP grafana.dev.tech` to **/etc/hosts**
                  
                    > add `$K8S_MASTER_IP prometheus.dev.tech` to **/etc/hosts**
                  prometheus-srver: https://prometheus.dev.tech:32443/


                  grafana-console: https://grafana.dev.tech:32443/


                  install based on docker

                  echo  "start from head is important"

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Jun 7, 2024

                  Subsections of Networking

                  Install Cert Manager

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm binary has installed, if not check 🔗link


                  1.get helm repo

                  Details
                  helm repo add cert-manager-repo https://charts.jetstack.io
                  helm repo update

                  2.install chart

                  Details
                  helm install cert-manager-repo/cert-manager --generate-name --version 1.17.2
                  Using Mirror
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                    && helm install ay-helm-mirror/cert-manager --generate-name --version 1.17.2

                  for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. ArgoCD has installed, if not check 🔗link


                  3. Helm binary has installed, if not check 🔗link


                  1.prepare `cert-manager.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: cert-manager
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                      chart: cert-manager
                      targetRevision: 1.17.2
                      helm:
                        releaseName: cert-manager
                        values: |
                          installCRDs: true
                          image:
                            repository: m.daocloud.io/quay.io/jetstack/cert-manager-controller
                            tag: v1.17.2
                          webhook:
                            image:
                              repository: m.daocloud.io/quay.io/jetstack/cert-manager-webhook
                              tag: v1.17.2
                          cainjector:
                            image:
                              repository: m.daocloud.io/quay.io/jetstack/cert-manager-cainjector
                              tag: v1.17.2
                          acmesolver:
                            image:
                              repository: m.daocloud.io/quay.io/jetstack/cert-manager-acmesolver
                              tag: v1.17.2
                          startupapicheck:
                            image:
                              repository: m.daocloud.io/quay.io/jetstack/cert-manager-startupapicheck
                              tag: v1.17.2
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: basic-components
                  EOF

                  3.sync by argocd

                  Details
                  argocd app sync argocd/cert-manager

                  Preliminary

                  1. Docker|Podman|Buildah has installed, if not check 🔗link


                  1.just run

                  Details
                  docker run --name cert-manager -e ALLOW_EMPTY_PASSWORD=yes bitnami/cert-manager:latest
                  Using Proxy

                  you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                  docker run --name cert-manager \
                    -e ALLOW_EMPTY_PASSWORD=yes 
                    m.daocloud.io/docker.io/bitnami/cert-manager:latest

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  1.just run

                  Details
                  kubectl create -f https://github.com/jetstack/cert-manager/releases/download/v1.17.2/cert-manager.yaml

                  Prepare Certificate Issuer

                  kubectl apply  -f - <<EOF
                  ---
                  apiVersion: cert-manager.io/v1
                  kind: Issuer
                  metadata:
                    namespace: basic-components
                    name: self-signed-issuer
                  spec:
                    selfSigned: {}
                  
                  ---
                  apiVersion: cert-manager.io/v1
                  kind: Certificate
                  metadata:
                    namespace: basic-components
                    name: my-self-signed-ca
                  spec:
                    isCA: true
                    commonName: my-self-signed-ca
                    secretName: root-secret
                    privateKey:
                      algorithm: ECDSA
                      size: 256
                    issuerRef:
                      name: self-signed-issuer
                      kind: Issuer
                      group: cert-manager.io
                  
                  ---
                  apiVersion: cert-manager.io/v1
                  kind: ClusterIssuer
                  metadata:
                    name: self-signed-ca-issuer
                  spec:
                    ca:
                      secretName: root-secret
                  EOF
                  kubectl -n kube-system apply -f - << EOF
                  apiVersion: cert-manager.io/v1
                  kind: ClusterIssuer
                  metadata:
                    name: letsencrypt
                  spec:
                    acme:
                      email: aaron19940628@gmail.com
                      server: https://acme-v02.api.letsencrypt.org/directory
                      privateKeySecretRef:
                        name: letsencrypt-account-key
                      solvers:
                      - http01:
                          ingress:
                            class: nginx
                  EOF

                  FAQ

                  Q1: The browser doesn’t trust this self-signed certificate

                  Basically, you need to import the certificate into your browser.

                  kubectl -n basic-components get secret root-secret -o jsonpath='{.data.tls\.crt}' | base64 -d > cert-manager-self-signed-ca-secret.crt

                  And then import it into your browser.

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Jun 7, 2024

                  Install HAProxy

                  Mar 7, 2024

                  Install Ingress

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  1.get helm repo

                  Details
                  helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
                  helm repo update

                  2.install chart

                  Details
                  helm install ingress-nginx/ingress-nginx --generate-name
                  Using Mirror
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts &&
                    helm install ay-helm-mirror/ingress-nginx --generate-name --version 4.11.3

                  for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. argoCD has installed, if not check 🔗link


                  1.prepare `ingress-nginx.yaml`

                  Details
                  kubectl -n argocd apply -f - <<EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: ingress-nginx
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://kubernetes.github.io/ingress-nginx
                      chart: ingress-nginx
                      targetRevision: 4.12.3
                      helm:
                        releaseName: ingress-nginx
                        values: |
                          controller:
                            image:
                              registry: m.daocloud.io/registry.k8s.io
                            service:
                              enabled: true
                              type: NodePort
                              nodePorts:
                                http: 32080
                                https: 32443
                                tcp:
                                  8080: 32808
                            resources:
                              requests:
                                cpu: 100m
                                memory: 128Mi
                            admissionWebhooks:
                              enabled: true
                              patch:
                                enabled: true
                                image:
                                  registry: m.daocloud.io/registry.k8s.io
                          metrics:
                            enabled: false
                          defaultBackend:
                            enabled: false
                            image:
                              registry: m.daocloud.io/registry.k8s.io
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: basic-components
                  EOF

                  [Optional] 2.apply to k8s

                  Details
                  kubectl -n argocd apply -f ingress-nginx.yaml

                  3.sync by argocd

                  Details
                  argocd app sync argocd/ingress-nginx

                  FAQ

                  Q1: Using minikube, cannot access to the website
                  ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f
                  ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:32443:0.0.0.0:32443' -N -f
                  ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:32080:0.0.0.0:32080' -N -f

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Jun 7, 2024

                  Install Istio

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  1.get helm repo

                  Details
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                  helm repo update

                  2.install chart

                  Details
                  helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                  Using Proxy

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  3. ArgoCD has installed, if not check 🔗link


                  1.prepare `deploy-istio-base.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: istio-base
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://istio-release.storage.googleapis.com/charts
                      chart: base
                      targetRevision: 1.23.2
                      helm:
                        releaseName: istio-base
                        values: |
                          defaults:
                            global:
                              istioNamespace: istio-system
                            base:
                              enableCRDTemplates: false
                              enableIstioConfigCRDs: true
                            defaultRevision: "default"
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: istio-system
                  EOF

                  2.sync by argocd

                  Details
                  argocd app sync argocd/istio-base

                  3.prepare `deploy-istiod.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: istiod
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://istio-release.storage.googleapis.com/charts
                      chart: istiod
                      targetRevision: 1.23.2
                      helm:
                        releaseName: istiod
                        values: |
                          defaults:
                            global:
                              istioNamespace: istio-system
                              defaultResources:
                                requests:
                                  cpu: 10m
                                  memory: 128Mi
                                limits:
                                  cpu: 100m
                                  memory: 128Mi
                              hub: m.daocloud.io/docker.io/istio
                              proxy:
                                autoInject: disabled
                                resources:
                                  requests:
                                    cpu: 100m
                                    memory: 128Mi
                                  limits:
                                    cpu: 2000m
                                    memory: 1024Mi
                            pilot:
                              autoscaleEnabled: true
                              resources:
                                requests:
                                  cpu: 500m
                                  memory: 2048Mi
                              cpu:
                                targetAverageUtilization: 80
                              podAnnotations:
                                cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: istio-system
                  EOF

                  4.sync by argocd

                  Details
                  argocd app sync argocd/istiod

                  5.prepare `deploy-istio-ingressgateway.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: istio-ingressgateway
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://istio-release.storage.googleapis.com/charts
                      chart: gateway
                      targetRevision: 1.23.2
                      helm:
                        releaseName: istio-ingressgateway
                        values: |
                          defaults:
                            replicaCount: 1
                            podAnnotations:
                              inject.istio.io/templates: "gateway"
                              sidecar.istio.io/inject: "true"
                              cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
                            resources:
                              requests:
                                cpu: 100m
                                memory: 128Mi
                              limits:
                                cpu: 2000m
                                memory: 1024Mi
                            service:
                              type: LoadBalancer
                              ports:
                              - name: status-port
                                port: 15021
                                protocol: TCP
                                targetPort: 15021
                              - name: http2
                                port: 80
                                protocol: TCP
                                targetPort: 80
                              - name: https
                                port: 443
                                protocol: TCP
                                targetPort: 443
                            autoscaling:
                              enabled: true
                              minReplicas: 1
                              maxReplicas: 5
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: istio-system
                  EOF

                  6.sync by argocd

                  Details
                  argocd app sync argocd/istio-ingressgateway

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  3. ArgoCD has installed, if not check 🔗link


                  4. Argo Workflow has installed, if not check 🔗link


                  1.prepare `argocd-login-credentials`

                  Details
                  kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                  2.apply rolebinding to k8s

                  Details
                  kubectl apply -f - <<EOF
                  ---
                  apiVersion: rbac.authorization.k8s.io/v1
                  kind: ClusterRole
                  metadata:
                    name: application-administrator
                  rules:
                    - apiGroups:
                        - argoproj.io
                      resources:
                        - applications
                      verbs:
                        - '*'
                    - apiGroups:
                        - apps
                      resources:
                        - deployments
                      verbs:
                        - '*'
                  
                  ---
                  apiVersion: rbac.authorization.k8s.io/v1
                  kind: RoleBinding
                  metadata:
                    name: application-administration
                    namespace: argocd
                  roleRef:
                    apiGroup: rbac.authorization.k8s.io
                    kind: ClusterRole
                    name: application-administrator
                  subjects:
                    - kind: ServiceAccount
                      name: argo-workflow
                      namespace: business-workflows
                  
                  ---
                  apiVersion: rbac.authorization.k8s.io/v1
                  kind: RoleBinding
                  metadata:
                    name: application-administration
                    namespace: application
                  roleRef:
                    apiGroup: rbac.authorization.k8s.io
                    kind: ClusterRole
                    name: application-administrator
                  subjects:
                    - kind: ServiceAccount
                      name: argo-workflow
                      namespace: business-workflows
                  EOF

                  4.prepare `deploy-xxxx-flow.yaml`

                  Details

                  6.subimit to argo workflow client

                  Details
                  argo -n business-workflows submit deploy-xxxx-flow.yaml

                  7.decode password

                  Details
                  kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Jun 7, 2024

                  Install Nginx

                  1. prepare server.conf

                  cat << EOF > default.conf
                  server {
                    listen 80;
                    location / {
                        root   /usr/share/nginx/html;
                        autoindex on;
                    }
                  }
                  EOF

                  2. install

                  mkdir $(pwd)/data
                  podman run --rm -p 8080:80 \
                      -v $(pwd)/data:/usr/share/nginx/html:ro \
                      -v $(pwd)/default.conf:/etc/nginx/conf.d/default.conf:ro \
                      -d docker.io/library/nginx:1.19.9-alpine
                  echo 'this is a test' > $(pwd)/data/some-data.txt
                  Tip

                  you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                  visit http://localhost:8080

                  Mar 7, 2024

                  Install Traefik

                  Mar 7, 2024

                  Subsections of RPC

                  gRpc

                  This guide gets you started with gRPC in C++ with a simple working example.

                  In the C++ world, there’s no universally accepted standard for managing project dependencies. You need to build and install gRPC before building and running this quick start’s Hello World example.

                  Build and locally install gRPC and Protocol Buffers. The steps in the section explain how to build and locally install gRPC and Protocol Buffers using cmake. If you’d rather use bazel, see Building from source.

                  1. Setup

                  Choose a directory to hold locally installed packages. This page assumes that the environment variable MY_INSTALL_DIR holds this directory path. For example:

                  export MY_INSTALL_DIR=$HOME/.local

                  Ensure that the directory exists:

                  mkdir -p $MY_INSTALL_DIR

                  Add the local bin folder to your path variable, for example:

                  export PATH="$MY_INSTALL_DIR/bin:$PATH"
                  Important

                  We strongly encourage you to install gRPC locally — using an appropriately set CMAKE_INSTALL_PREFIX — because there is no easy way to uninstall gRPC after you’ve installed it globally.

                  2. Install Essentials

                  2.1 Install Cmake

                  You need version 3.13 or later of cmake. Install it by following these instructions:

                  Install on
                  sudo apt install -y cmake
                  brew install cmake
                  Check the version of cmake
                  cmake --version
                  2.2 Install basic tools required to build gRPC
                  Install on
                  sudo apt install -y build-essential autoconf libtool pkg-config
                  brew install autoconf automake libtool pkg-config
                  2.3 Clone the grpc repo

                  Clone the grpc repo and its submodules:

                  git clone --recurse-submodules -b v1.62.0 --depth 1 --shallow-submodules https://github.com/grpc/grpc
                  2.4 Build and install gRPC and Protocol Buffers

                  While not mandatory, gRPC applications usually leverage Protocol Buffers for service definitions and data serialization, and the example code uses proto3.

                  The following commands build and locally install gRPC and Protocol Buffers:

                  cd grpc
                  mkdir -p cmake/build
                  pushd cmake/build
                  cmake -DgRPC_INSTALL=ON \
                        -DgRPC_BUILD_TESTS=OFF \
                        -DCMAKE_INSTALL_PREFIX=$MY_INSTALL_DIR \
                        ../..
                  make -j 4
                  make install
                  popd

                  3. Run the example

                  The example code is part of the grpc repo source, which you cloned as part of the steps of the previous section.

                  3.1 change the example’s directory:
                  cd examples/cpp/helloworld
                  3.2 build the example project by using cmake

                  make sure you still can echo $MY_INSTALL_DIR, and return a valid result

                  mkdir -p cmake/build
                  pushd cmake/build
                  cmake -DCMAKE_PREFIX_PATH=$MY_INSTALL_DIR ../..
                  make -j 4

                  3.3 run the server

                  ./greeter_server

                  3.4 from a different terminal, run the client and see the client output:

                  ./greeter_client

                  and the result should be like this:

                  Greeter received: Hello world
                  Apr 7, 2024

                  Subsections of Storage

                  Deploy Artifict Repository

                  Preliminary

                  • Kubernetes has installed, if not check link
                  • minio is ready for artifact repository

                    endpoint: minio.storage:9000

                  Steps

                  1. prepare bucket for s3 artifact repository

                  # K8S_MASTER_IP could be you master ip or loadbalancer external ip
                  K8S_MASTER_IP=172.27.253.27
                  MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.rootPassword}' | base64 -d)
                  podman run --rm \
                  --entrypoint bash \
                  --add-host=minio-api.dev.geekcity.tech:${K8S_MASTER_IP} \
                  -it docker.io/minio/mc:latest \
                  -c "mc alias set minio http://minio-api.dev.geekcity.tech admin ${MINIO_ACCESS_SECRET} \
                      && mc ls minio \
                      && mc mb --ignore-existing minio/argo-workflows-artifacts"

                  2. prepare secret s3-artifact-repository-credentials

                  will create business-workflows namespace

                  MINIO_ACCESS_KEY=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.rootUser}' | base64 -d)
                  kubectl -n business-workflows create secret generic s3-artifact-repository-credentials \
                      --from-literal=accessKey=${MINIO_ACCESS_KEY} \
                      --from-literal=secretKey=${MINIO_ACCESS_SECRET}

                  3. prepare configMap artifact-repositories.yaml

                  apiVersion: v1
                  kind: ConfigMap
                  metadata:
                    name: artifact-repositories
                    annotations:
                      workflows.argoproj.io/default-artifact-repository: default-artifact-repository
                  data:
                    default-artifact-repository: |
                      s3:
                        endpoint: minio.storage:9000
                        insecure: true
                        accessKeySecret:
                          name: s3-artifact-repository-credentials
                          key: accessKey
                        secretKeySecret:
                          name: s3-artifact-repository-credentials
                          key: secretKey
                        bucket: argo-workflows-artifacts

                  4. apply artifact-repositories.yaml to k8s

                  kubectl -n business-workflows apply -f artifact-repositories.yaml
                  Mar 7, 2024

                  Install Chart Museum

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm binary has installed, if not check 🔗link


                  1.get helm repo

                  Details
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                  helm repo update

                  2.install chart

                  Details
                  helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                  Using Mirror
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                    && helm install ay-helm-mirror/cert-manager --generate-name --version 1.17.2

                  for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. ArgoCD has installed, if not check 🔗link


                  3. Helm binary has installed, if not check 🔗link


                  4. Ingres has installed on argoCD, if not check 🔗link


                  5. Minio has installed, if not check 🔗link


                  1.prepare `chart-museum-credentials`

                  Storage In
                  kubectl get namespaces basic-components > /dev/null 2>&1 || kubectl create namespace basic-components
                  kubectl -n basic-components create secret generic chart-museum-credentials \
                      --from-literal=username=admin \
                      --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
                  
                  kubectl get namespaces basic-components > /dev/null 2>&1 || kubectl create namespace basic-components
                  kubectl -n basic-components create secret generic chart-museum-credentials \
                      --from-literal=username=admin \
                      --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                      --from-literal=aws_access_key_id=$(kubectl -n storage get secret minio-credentials -o jsonpath='{.data.rootUser}' | base64 -d) \
                      --from-literal=aws_secret_access_key=$(kubectl -n storage get secret minio-credentials -o jsonpath='{.data.rootPassword}' | base64 -d)
                  

                  2.prepare `chart-museum.yaml`

                  Storage In
                  kubectl apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: chart-museum
                  spec:
                    syncPolicy:
                      syncOptions:
                        - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://chartmuseum.github.io/charts
                      chart: chartmuseum
                      targetRevision: 3.10.3
                      helm:
                        releaseName: chart-museum
                        values: |
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                          env:
                            open:
                              DISABLE_API: false
                              STORAGE: local
                              AUTH_ANONYMOUS_GET: true
                            existingSecret: "chart-museum-credentials"
                            existingSecretMappings:
                              BASIC_AUTH_USER: "username"
                              BASIC_AUTH_PASS: "password"
                          persistence:
                            enabled: false
                            storageClass: ""
                          volumePermissions:
                            image:
                              registry: m.daocloud.io/docker.io
                          ingress:
                            enabled: true
                            ingressClassName: nginx
                            annotations:
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                              nginx.ingress.kubernetes.io/rewrite-target: /$1
                            hosts:
                              - name: chartmuseum.ay.dev
                                path: /?(.*)
                                tls: true
                                tlsSecret: chartmuseum.ay.dev-tls
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: basic-components
                  EOF
                  
                  kubectl apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: chart-museum
                  spec:
                    syncPolicy:
                      syncOptions:
                        - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://chartmuseum.github.io/charts
                      chart: chartmuseum
                      targetRevision: 3.10.3
                      helm:
                        releaseName: chart-museum
                        values: |
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                          env:
                            open:
                              DISABLE_API: false
                              STORAGE: amazon
                              STORAGE_AMAZON_ENDPOINT: http://minio-api.ay.dev:32080
                              STORAGE_AMAZON_BUCKET: chart-museum
                              STORAGE_AMAZON_PREFIX: charts
                              STORAGE_AMAZON_REGION: us-east-1
                              AUTH_ANONYMOUS_GET: true
                            existingSecret: "chart-museum-credentials"
                            existingSecretMappings:
                              BASIC_AUTH_USER: "username"
                              BASIC_AUTH_PASS: "password"
                              AWS_ACCESS_KEY_ID: "aws_access_key_id"
                              AWS_SECRET_ACCESS_KEY: "aws_secret_access_key"
                          persistence:
                            enabled: false
                            storageClass: ""
                          volumePermissions:
                            image:
                              registry: m.daocloud.io/docker.io
                          ingress:
                            enabled: true
                            ingressClassName: nginx
                            annotations:
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                              nginx.ingress.kubernetes.io/rewrite-target: /$1
                            hosts:
                              - name: chartmuseum.ay.dev
                                path: /?(.*)
                                tls: true
                                tlsSecret: chartmuseum.ay.dev-tls
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: basic-components
                  EOF
                  
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: chart-museum
                  spec:
                    syncPolicy:
                      syncOptions:
                        - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://chartmuseum.github.io/charts
                      chart: chartmuseum
                      targetRevision: 3.10.3
                      helm:
                        releaseName: chart-museum
                        values: |
                          replicaCount: 1
                          image:
                            repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                          env:
                            open:
                              DISABLE_API: false
                              STORAGE: local
                              AUTH_ANONYMOUS_GET: true
                            existingSecret: "chart-museum-credentials"
                            existingSecretMappings:
                              BASIC_AUTH_USER: "username"
                              BASIC_AUTH_PASS: "password"
                          persistence:
                            enabled: false
                            storageClass: ""
                          volumePermissions:
                            image:
                              registry: m.daocloud.io/docker.io
                          ingress:
                            enabled: true
                            ingressClassName: nginx
                            annotations:
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                              nginx.ingress.kubernetes.io/rewrite-target: /$1
                            hosts:
                              - name: chartmuseum.ay.dev
                                path: /?(.*)
                                tls: true
                                tlsSecret: chartmuseum.ay.dev-tls
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: basic-components
                  

                  3.sync by argocd

                  Details
                  argocd app sync argocd/chart-museum

                  Uploading a Chart Package

                  Follow “How to Run” section below to get ChartMuseum up and running at http://localhost:8080

                  First create mychart-0.1.0.tgz using the Helm CLI:

                  cd mychart/
                  helm package .

                  Upload mychart-0.1.0.tgz:

                  curl --data-binary "@mychart-0.1.0.tgz" http://localhost:8080/api/charts

                  If you’ve signed your package and generated a provenance file, upload it with:

                  curl --data-binary "@mychart-0.1.0.tgz.prov" http://localhost:8080/api/prov

                  Both files can also be uploaded at once (or one at a time) on the /api/charts route using the multipart/form-data format:

                  curl -F "chart=@mychart-0.1.0.tgz" -F "prov=@mychart-0.1.0.tgz.prov" http://localhost:8080/api/charts

                  You can also use the helm-push plugin:

                  helm cm-push mychart/ chartmuseum

                  Installing Charts into Kubernetes

                  Add the URL to your ChartMuseum installation to the local repository list:

                  helm repo add chartmuseum http://localhost:8080

                  Search for charts:

                  helm search repo chartmuseum/

                  Install chart:

                  helm install chartmuseum/mychart --generate-name

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Jun 7, 2024

                  Install Harbor

                  Mar 7, 2025

                  Install Minio

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm binary has installed, if not check 🔗link


                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. ArgoCD has installed, if not check 🔗link


                  3. Ingres has installed on argoCD, if not check 🔗link


                  4. Cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


                  1.prepare minio credentials secret

                  Details
                  kubectl get namespaces storage > /dev/null 2>&1 || kubectl create namespace storage
                  kubectl -n storage create secret generic minio-secret \
                      --from-literal=root-user=admin \
                      --from-literal=root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                  2.prepare `deploy-minio.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: minio
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                      chart: minio
                      targetRevision: 16.0.10
                      helm:
                        releaseName: minio
                        values: |
                          global:
                            imageRegistry: "m.daocloud.io/docker.io"
                            imagePullSecrets: []
                            storageClass: ""
                            security:
                              allowInsecureImages: true
                            compatibility:
                              openshift:
                                adaptSecurityContext: auto
                          image:
                            registry: m.daocloud.io/docker.io
                            repository: bitnami/minio
                          clientImage:
                            registry: m.daocloud.io/docker.io
                            repository: bitnami/minio-client
                          mode: standalone
                          defaultBuckets: ""
                          auth:
                            # rootUser: admin
                            # rootPassword: ""
                            existingSecret: "minio-secret"
                          statefulset:
                            updateStrategy:
                              type: RollingUpdate
                            podManagementPolicy: Parallel
                            replicaCount: 1
                            zones: 1
                            drivesPerNode: 1
                          resourcesPreset: "micro"
                          resources: 
                            requests:
                              memory: 512Mi
                              cpu: 250m
                            limits:
                              memory: 512Mi
                              cpu: 250m
                          ingress:
                            enabled: true
                            ingressClassName: "nginx"
                            hostname: minio-console.ay.online
                            path: /?(.*)
                            pathType: ImplementationSpecific
                            annotations:
                              kubernetes.io/ingress.class: nginx
                              nginx.ingress.kubernetes.io/rewrite-target: /$1
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                            tls: true
                            selfSigned: true
                            extraHosts: []
                          apiIngress:
                            enabled: true
                            ingressClassName: "nginx"
                            hostname: minio-api.ay.online
                            path: /?(.*)
                            pathType: ImplementationSpecific
                            annotations: 
                              kubernetes.io/ingress.class: nginx
                              nginx.ingress.kubernetes.io/rewrite-target: /$1
                              cert-manager.io/cluster-issuer: self-signed-ca-issuer
                            tls: true
                            selfSigned: true
                            extraHosts: []
                          persistence:
                            enabled: false
                            storageClass: ""
                            mountPath: /bitnami/minio/data
                            accessModes:
                              - ReadWriteOnce
                            size: 8Gi
                            annotations: {}
                            existingClaim: ""
                          metrics:
                            prometheusAuthType: public
                            enabled: false
                            serviceMonitor:
                              enabled: false
                              namespace: ""
                              labels: {}
                              jobLabel: ""
                              paths:
                                - /minio/v2/metrics/cluster
                                - /minio/v2/metrics/node
                              interval: 30s
                              scrapeTimeout: ""
                              honorLabels: false
                            prometheusRule:
                              enabled: false
                              namespace: ""
                              additionalLabels: {}
                              rules: []
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: storage
                  EOF

                  3.sync by argocd

                  Details
                  argocd app sync argocd/minio

                  4.decode minio secret

                  Details
                  kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d

                  5.visit web console

                  Login Credentials

                  add $K8S_MASTER_IP minio-console.ay.online to /etc/hosts

                  address: 🔗http://minio-console.ay.online:32080/login

                  access key: admin

                  secret key: ``

                  6.using mc

                  Details
                  K8S_MASTER_IP=$(kubectl get node -l node-role.kubernetes.io/control-plane -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
                  MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d)
                  podman run --rm \
                      --entrypoint bash \
                      --add-host=minio-api.dev.tech:${K8S_MASTER_IP} \
                      -it m.daocloud.io/docker.io/minio/mc:latest \
                      -c "mc alias set minio http://minio-api.dev.tech:32080 admin ${MINIO_ACCESS_SECRET} \
                          && mc ls minio \
                          && mc mb --ignore-existing minio/test \
                          && mc cp /etc/hosts minio/test/etc/hosts \
                          && mc ls --recursive minio"
                  Details
                  K8S_MASTER_IP=$(kubectl get node -l node-role.kubernetes.io/control-plane -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
                  MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d)
                  podman run --rm \
                      --entrypoint bash \
                      --add-host=minio-api.dev.tech:${K8S_MASTER_IP} \
                      -it m.daocloud.io/docker.io/minio/mc:latest

                  Preliminary

                  1. Docker has installed, if not check 🔗link


                  Using Proxy

                  you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                  1.init server

                  Details
                  mkdir -p $(pwd)/minio/data
                  podman run --rm \
                      --name minio-server \
                      -p 9000:9000 \
                      -p 9001:9001 \
                      -v $(pwd)/minio/data:/data \
                      -d docker.io/minio/minio:latest server /data --console-address :9001

                  2.use web console

                  And then you can visit 🔗http://localhost:9001

                  username: `minioadmin`

                  password: `minioadmin`

                  3.use internal client

                  Details
                  podman run --rm \
                      --entrypoint bash \
                      -it docker.io/minio/mc:latest \
                      -c "mc alias set minio http://host.docker.internal:9000 minioadmin minioadmin \
                          && mc ls minio \
                          && mc mb --ignore-existing minio/test \
                          && mc cp /etc/hosts minio/test/etc/hosts \
                          && mc ls --recursive minio"

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Mar 7, 2024

                  Install NFS

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. argoCD has installed, if not check 🔗link


                  3. ingres has installed on argoCD, if not check 🔗link


                  1.prepare `nfs-provisioner.yaml`

                  Details
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: nfs-provisioner
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
                      chart: nfs-subdir-external-provisioner
                      targetRevision: 4.0.18
                      helm:
                        releaseName: nfs-provisioner
                        values: |
                          image:
                            repository: m.daocloud.io/registry.k8s.io/sig-storage/nfs-subdir-external-provisioner
                            pullPolicy: IfNotPresent
                          nfs:
                            server: nfs.services.test
                            path: /
                            mountOptions:
                              - vers=4
                              - minorversion=0
                              - rsize=1048576
                              - wsize=1048576
                              - hard
                              - timeo=600
                              - retrans=2
                              - noresvport
                            volumeName: nfs-subdir-external-provisioner-nas
                            reclaimPolicy: Retain
                          storageClass:
                            create: true
                            defaultClass: true
                            name: nfs-external-nas
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: storage

                  3.deploy mariadb

                  Details
                  kubectl -n argocd apply -f nfs-provisioner.yaml

                  4.sync by argocd

                  Details
                  argocd app sync argocd/nfs-provisioner

                  Preliminary

                  1. Docker has installed, if not check 🔗link


                  Using Proxy

                  you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                  1.init server

                  Details
                  echo -e "nfs\nnfsd" > /etc/modules-load.d/nfs4.conf
                  modprobe nfs && modprobe nfsd
                  mkdir -p $(pwd)/data/nfs/data
                  echo '/data *(rw,fsid=0,no_subtree_check,insecure,no_root_squash)' > $(pwd)/data/nfs/exports
                  podman run \
                      --name nfs4 \
                      --rm \
                      --privileged \
                      -p 2049:2049 \
                      -v $(pwd)/data/nfs/data:/data \
                      -v $(pwd)/data/nfs/exports:/etc/exports:ro \
                      -d docker.io/erichough/nfs-server:2.2.1

                  Preliminary

                  1. centos yum repo source has updated, if not check 🔗link


                  2.

                  1.install nfs util

                  sudo apt update -y
                  sudo apt-get install nfs-common
                  dnf update -y
                  dnf install -y nfs-utils rpcbindn
                  sudo apt update -y
                  sudo apt-get install nfs-common

                  2. create share folder

                  Details
                  mkdir /data && chmod 755 /data

                  3.edit `/etc/exports`

                  Details
                  /data *(rw,sync,insecure,no_root_squash,no_subtree_check)

                  4.start nfs server

                  Details
                  systemctl enable rpcbind
                  systemctl enable nfs-server
                  systemctl start rpcbind
                  systemctl start nfs-server

                  5.test load on localhost

                  Details
                  showmount -e localhost
                  Expectd Output
                  Export list for localhost:
                  /data *

                  6.test load on other ip

                  Details
                  showmount -e 192.168.aa.bb
                  Expectd Output
                  Export list for localhost:
                  /data *

                  7.mount nfs disk

                  Details
                  mkdir -p $(pwd)/mnt/nfs
                  sudo mount -v 192.168.aa.bb:/data $(pwd)/mnt/nfs  -o proto=tcp -o nolock

                  8.set nfs auto mount

                  Details
                  echo "192.168.aa.bb:/data /data nfs rw,auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0" >> /etc/fstab
                  df -h

                  Notes

                  [Optional] create new partition
                  disk size:
                  fdisk /dev/vdb
                  
                  # n
                  # p
                  # w
                  parted
                  
                  #select /dev/vdb 
                  #mklabel gpt 
                  #mkpart primary 0 -1
                  #Cancel
                  #mkpart primary 0% 100%
                  #print
                  [Optional]Format disk
                  mkfs.xfs /dev/vdb1 -f
                  [Optional] mount disk to folder
                  mount /dev/vdb1 /data
                  [Optional] mount when restart
                  #vim `/etc/fstab` 
                  /dev/vdb1     /data  xfs   defaults   0 0

                  fstab fstab

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Mar 7, 2025

                  Install Rook Ceph

                  Mar 7, 2025

                  Install Reids

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  1.get helm repo

                  Details
                  helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                  helm repo update

                  2.install chart

                  Details
                  helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                  Using Proxy

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  3. ArgoCD has installed, if not check 🔗link


                  1.prepare redis secret

                  Details
                  kubectl get namespaces storage > /dev/null 2>&1 || kubectl create namespace storage
                  kubectl -n storage create secret generic redis-credentials \
                    --from-literal=redis-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                  2.prepare `deploy-redis.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: redis
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://charts.bitnami.com/bitnami
                      chart: redis
                      targetRevision: 18.16.0
                      helm:
                        releaseName: redis
                        values: |
                          architecture: replication
                          auth:
                            enabled: true
                            sentinel: true
                            existingSecret: redis-credentials
                          master:
                            count: 1
                            disableCommands:
                              - FLUSHDB
                              - FLUSHALL
                            persistence:
                              enabled: true
                              storageClass: nfs-external
                              size: 8Gi
                          replica:
                            replicaCount: 3
                            disableCommands:
                              - FLUSHDB
                              - FLUSHALL
                            persistence:
                              enabled: true
                              storageClass: nfs-external
                              size: 8Gi
                          image:
                            registry: m.daocloud.io/docker.io
                            pullPolicy: IfNotPresent
                          sentinel:
                            enabled: false
                            persistence:
                              enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          metrics:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          volumePermissions:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          sysctl:
                            enabled: false
                            image:
                              registry: m.daocloud.io/docker.io
                              pullPolicy: IfNotPresent
                          extraDeploy:
                            - |
                              apiVersion: apps/v1
                              kind: Deployment
                              metadata:
                                name: redis-tool
                                namespace: csst
                                labels:
                                  app.kubernetes.io/name: redis-tool
                              spec:
                                replicas: 1
                                selector:
                                  matchLabels:
                                    app.kubernetes.io/name: redis-tool
                                template:
                                  metadata:
                                    labels:
                                      app.kubernetes.io/name: redis-tool
                                  spec:
                                    containers:
                                    - name: redis-tool
                                      image: m.daocloud.io/docker.io/bitnami/redis:7.2.4-debian-12-r8
                                      imagePullPolicy: IfNotPresent
                                      env:
                                      - name: REDISCLI_AUTH
                                        valueFrom:
                                          secretKeyRef:
                                            key: redis-password
                                            name: redis-credentials
                                      - name: TZ
                                        value: Asia/Shanghai
                                      command:
                                      - tail
                                      - -f
                                      - /etc/hosts
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: storage
                  EOF

                  3.sync by argocd

                  Details
                  argocd app sync argocd/redis

                  4.decode password

                  Details
                  kubectl -n storage get secret redis-credentials -o jsonpath='{.data.redis-password}' | base64 -d

                  Preliminary

                  1. Docker|Podman|Buildah has installed, if not check 🔗link


                  Using Proxy

                  you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                  1.init server

                  Details

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  3. ArgoCD has installed, if not check 🔗link


                  4. Argo Workflow has installed, if not check 🔗link


                  1.prepare `argocd-login-credentials`

                  Details
                  kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                  2.apply rolebinding to k8s

                  Details
                  kubectl apply -f - <<EOF
                  ---
                  apiVersion: rbac.authorization.k8s.io/v1
                  kind: ClusterRole
                  metadata:
                    name: application-administrator
                  rules:
                    - apiGroups:
                        - argoproj.io
                      resources:
                        - applications
                      verbs:
                        - '*'
                    - apiGroups:
                        - apps
                      resources:
                        - deployments
                      verbs:
                        - '*'
                  
                  ---
                  apiVersion: rbac.authorization.k8s.io/v1
                  kind: RoleBinding
                  metadata:
                    name: application-administration
                    namespace: argocd
                  roleRef:
                    apiGroup: rbac.authorization.k8s.io
                    kind: ClusterRole
                    name: application-administrator
                  subjects:
                    - kind: ServiceAccount
                      name: argo-workflow
                      namespace: business-workflows
                  
                  ---
                  apiVersion: rbac.authorization.k8s.io/v1
                  kind: RoleBinding
                  metadata:
                    name: application-administration
                    namespace: application
                  roleRef:
                    apiGroup: rbac.authorization.k8s.io
                    kind: ClusterRole
                    name: application-administrator
                  subjects:
                    - kind: ServiceAccount
                      name: argo-workflow
                      namespace: business-workflows
                  EOF

                  4.prepare `deploy-xxxx-flow.yaml`

                  Details

                  6.subimit to argo workflow client

                  Details
                  argo -n business-workflows submit deploy-xxxx-flow.yaml

                  7.decode password

                  Details
                  kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  tests

                  • kubectl -n storage exec -it deployment/redis-tool -- \
                        redis-cli -c -h redis-master.storage ping
                  • kubectl -n storage exec -it deployment/redis-tool -- \
                        redis-cli -c -h redis-master.storage set mykey somevalue
                  • kubectl -n storage exec -it deployment/redis-tool -- \
                        redis-cli -c -h redis-master.storage get mykey
                  • kubectl -n storage exec -it deployment/redis-tool -- \
                        redis-cli -c -h redis-master.storage del mykey
                  • kubectl -n storage exec -it deployment/redis-tool -- \
                        redis-cli -c -h redis-master.storage get mykey
                  May 7, 2024

                  Subsections of Streaming

                  Install Flink Operator

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm has installed, if not check 🔗link


                  3. Cert-manager has installed, if not check 🔗link


                  1.get helm repo

                  Details
                  helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-1.11.0/
                  helm repo update

                  latest version : 🔗https://flink.apache.org/downloads/#apache-flink-kubernetes-operator

                  2.install chart

                  Details
                  helm install --create-namespace -n flink flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator --set image.repository=m.lab.zverse.space/ghcr.io/apache/flink-kubernetes-operator --set image.tag=1.11.0 --set webhook.create=false
                  Reference

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. ArgoCD has installed, if not check 🔗link


                  3. Cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuer service , if not check 🔗link


                  4. Ingres has installed on argoCD, if not check 🔗link


                  2.prepare `flink-operator.yaml`

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: flink-operator
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://downloads.apache.org/flink/flink-kubernetes-operator-1.11.0
                      chart: flink-kubernetes-operator
                      targetRevision: 1.11.0
                      helm:
                        releaseName: flink-operator
                        values: |
                          image:
                            repository: m.daocloud.io/ghcr.io/apache/flink-kubernetes-operator
                            pullPolicy: IfNotPresent
                            tag: "1.11.0"
                        version: v3
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: flink
                  EOF

                  3.sync by argocd

                  Details
                  argocd app sync argocd/flink-operator

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Jun 7, 2025

                  👨‍💻Schedmd Slurm

                  The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world’s supercomputers and computer clusters.

                  It provides three key functions:

                  • allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
                  • providing a framework for starting, executing, and monitoring work, typically a parallel job such as Message Passing Interface (MPI) on a set of allocated nodes, and
                  • arbitrating contention for resources by managing a queue of pending jobs.

                  func1 func1

                  Content

                  Aug 7, 2024

                  Subsections of 👨‍💻Schedmd Slurm

                  Build & Install

                  Aug 7, 2024

                  Subsections of Build & Install

                  Install On Debian

                  Cluster Setting

                  • 1 Manager
                  • 1 Login Node
                  • 2 Compute nodes
                  hostnameIProlequota
                  manage01 (slurmctld, slurmdbd)192.168.56.115manager2C4G
                  login01 (login)192.168.56.116login2C4G
                  compute01 (slurmd)192.168.56.117compute2C4G
                  compute02 (slurmd)192.168.56.118compute2C4G

                  Software Version:

                  softwareversion
                  osDebian 12 bookworm
                  slurm24.05.2

                  Important

                  when you see (All Nodes), you need to run the following command on all nodes

                  when you see (Manager Node), you only need to run the following command on manager node

                  when you see (Login Node), you only need to run the following command on login node

                  Prepare Steps (All Nodes)

                  1. Modify the /etc/apt/sources.list file Using tuna mirror
                  cat > /etc/apt/sources.list << EOF
                  deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
                  deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
                  
                  deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
                  deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
                  
                  deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
                  deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
                  
                  deb https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
                  deb-src https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
                  EOF
                  if you cannot get ipv4 address

                  Modify the /etc/network/interfaces

                  allow-hotplug enps08
                  iface enps08 inet dhcp

                  restart the network

                  systemctl restart networking
                  1. Update apt cache
                  apt clean all && apt update
                  1. Set hostname on each node
                  Node:
                  hostnamectl set-hostname manage01
                  hostnamectl set-hostname login01
                  hostnamectl set-hostname compute01
                  hostnamectl set-hostname compute02
                  1. Set hosts file
                  cat >> /etc/hosts << EOF
                  192.168.56.115 manage01
                  192.168.56.116 login01
                  192.168.56.117 compute01
                  192.168.56.118 compute02
                  EOF
                  1. Disable firewall
                  systemctl stop nftables && systemctl disable nftables
                  1. Install packages ntpdate
                  apt-get -y install ntpdate
                  1. Sync server time
                  ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
                  echo 'Asia/Shanghai' >/etc/timezone
                  ntpdate time.windows.com
                  1. Add cron job to sync time
                  crontab -e
                  */5 * * * * /usr/sbin/ntpdate time.windows.com
                  1. Create ssh key pair on each node
                  ssh-keygen -t rsa -b 4096 -C $HOSTNAME
                  1. Test ssh login other nodes without password
                  Node:
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@login01
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@manage01
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02

                  Install Components

                  1. Install NFS server (Manager Node)

                  there are many ways to install NFS server

                  create shared folder

                  mkdir /data
                  chmod 755 /data

                  modify vim /etc/exports

                  /data *(rw,sync,insecure,no_subtree_check,no_root_squash)

                  start nfs server

                  systemctl start rpcbind 
                  systemctl start nfs-server 
                  
                  systemctl enable rpcbind 
                  systemctl enable nfs-server

                  check nfs server

                  showmount -e localhost
                  
                  # Output
                  Export list for localhost:
                  /data *
                  1. Install munge service
                  • add user munge (All Nodes)
                  groupadd -g 1108 munge
                  useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
                  • Install rng-tools-debian (Manager Nodes)
                  apt-get install -y rng-tools-debian
                  # modify service script
                  vim /usr/lib/systemd/system/rngd.service
                  [Service]
                  ExecStart=/usr/sbin/rngd -f -r /dev/urandom
                  systemctl daemon-reload
                  systemctl start rngd
                  systemctl enable rngd
                  apt-get install -y libmunge-dev libmunge2 munge
                  • generate secret key (Manager Nodes)
                  dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
                  • copy munge.key from manager node to the rest node (All Nodes)
                  scp -p /etc/munge/munge.key root@login01:/etc/munge/
                  scp -p /etc/munge/munge.key root@compute01:/etc/munge/
                  scp -p /etc/munge/munge.key root@compute02:/etc/munge/
                  • grant privilege on munge.key (All Nodes)
                  chown munge: /etc/munge/munge.key
                  chmod 400 /etc/munge/munge.key
                  
                  systemctl start munge
                  systemctl enable munge

                  Using systemctl status munge to check if the service is running

                  • test munge
                  munge -n | ssh compute01 unmunge
                  1. Install Mariadb (Manager Nodes)
                  apt-get install -y mariadb-server
                  • create database and user
                  systemctl start mariadb
                  systemctl enable mariadb
                  
                  ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) 
                  mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
                  mysql -uroot -p$ROOT_PASS -e 'create database slurm_acct_db'
                  • create user slurm,and grant all privileges on database slurm_acct_db
                  mysql -uroot -p$ROOT_PASS
                  create user slurm;
                  
                  grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
                  
                  flush privileges;
                  • create Slurm user
                  groupadd -g 1109 slurm
                  useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

                  Install Slurm (All Nodes)

                  • Install basic Debian package build requirements:
                  apt-get install -y build-essential fakeroot devscripts equivs
                  • Unpack the distributed tarball:
                  wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2 &&
                  tar -xaf slurm*tar.bz2
                  • cd to the directory containing the Slurm source:
                  cd slurm-24.05.2 &&   mkdir -p /etc/slurm && ./configure 
                  • compile slurm
                  make install
                  • modify configuration files (Manager Nodes)

                    cp /root/slurm-24.05.2/etc/slurm.conf.example /etc/slurm/slurm.conf
                    vim /etc/slurm/slurm.conf

                    focus on these options:

                    SlurmctldHost=manage
                    
                    AccountingStorageEnforce=associations,limits,qos
                    AccountingStorageHost=manage
                    AccountingStoragePass=/var/run/munge/munge.socket.2
                    AccountingStoragePort=6819  
                    AccountingStorageType=accounting_storage/slurmdbd  
                    
                    JobCompHost=localhost
                    JobCompLoc=slurm_acct_db
                    JobCompPass=123456
                    JobCompPort=3306
                    JobCompType=jobcomp/mysql
                    JobCompUser=slurm
                    JobContainerType=job_container/none
                    JobAcctGatherType=jobacct_gather/linux
                    cp /root/slurm-24.05.2/etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
                    vim /etc/slurm/slurmdbd.conf
                    • modify /etc/slurm/cgroup.conf
                    cp /root/slurm-24.05.2/etc/cgroup.conf.example /etc/slurm/cgroup.conf
                    • send configuration files to other nodes
                    scp -r /etc/slurm/*.conf  root@login01:/etc/slurm/
                    scp -r /etc/slurm/*.conf  root@compute01:/etc/slurm/
                    scp -r /etc/slurm/*.conf  root@compute02:/etc/slurm/
                  • grant privilege on some directories (All Nodes)

                  mkdir /var/spool/slurmd
                  chown slurm: /var/spool/slurmd
                  mkdir /var/log/slurm
                  chown slurm: /var/log/slurm
                  
                  mkdir /var/spool/slurmctld
                  chown slurm: /var/spool/slurmctld
                  
                  chown slurm: /etc/slurm/slurmdbd.conf
                  chmod 600 /etc/slurm/slurmdbd.conf
                  • start slurm services on each node
                  Node:
                  systemctl start slurmdbd
                  systemctl enable slurmdbd
                  
                  systemctl start slurmctld
                  systemctl enable slurmctld
                  
                  systemctl start slurmd
                  systemctl enable slurmd
                  Using `systemctl status xxxx` to check if the `xxxx` service is running
                  Example slurmdbd.server
                  ```text
                  # vim /usr/lib/systemd/system/slurmdbd.service
                  
                  
                  [Unit]
                  Description=Slurm DBD accounting daemon
                  After=network-online.target remote-fs.target munge.service mysql.service mysqld.service mariadb.service sssd.service
                  Wants=network-online.target
                  ConditionPathExists=/etc/slurm/slurmdbd.conf
                  
                  [Service]
                  Type=simple
                  EnvironmentFile=-/etc/sysconfig/slurmdbd
                  EnvironmentFile=-/etc/default/slurmdbd
                  User=slurm
                  Group=slurm
                  RuntimeDirectory=slurmdbd
                  RuntimeDirectoryMode=0755
                  ExecStart=/usr/local/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
                  ExecReload=/bin/kill -HUP $MAINPID
                  LimitNOFILE=65536
                  
                  
                  # Uncomment the following lines to disable logging through journald.
                  # NOTE: It may be preferable to set these through an override file instead.
                  #StandardOutput=null
                  #StandardError=null
                  
                  [Install]
                  WantedBy=multi-user.target
                  ```
                  
                  Example slumctld.server
                  ```text
                  # vim /usr/lib/systemd/system/slurmctld.service
                  
                  
                  [Unit]
                  Description=Slurm controller daemon
                  After=network-online.target remote-fs.target munge.service sssd.service
                  Wants=network-online.target
                  ConditionPathExists=/etc/slurm/slurm.conf
                  
                  [Service]
                  Type=notify
                  EnvironmentFile=-/etc/sysconfig/slurmctld
                  EnvironmentFile=-/etc/default/slurmctld
                  User=slurm
                  Group=slurm
                  RuntimeDirectory=slurmctld
                  RuntimeDirectoryMode=0755
                  ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS
                  ExecReload=/bin/kill -HUP $MAINPID
                  LimitNOFILE=65536
                  
                  
                  # Uncomment the following lines to disable logging through journald.
                  # NOTE: It may be preferable to set these through an override file instead.
                  #StandardOutput=null
                  #StandardError=null
                  
                  [Install]
                  WantedBy=multi-user.target
                  ```
                  
                  Example slumd.server
                  ```text
                  # vim /usr/lib/systemd/system/slurmd.service
                  
                  
                  [Unit]
                  Description=Slurm node daemon
                  After=munge.service network-online.target remote-fs.target sssd.service
                  Wants=network-online.target
                  #ConditionPathExists=/etc/slurm/slurm.conf
                  
                  [Service]
                  Type=notify
                  EnvironmentFile=-/etc/sysconfig/slurmd
                  EnvironmentFile=-/etc/default/slurmd
                  RuntimeDirectory=slurm
                  RuntimeDirectoryMode=0755
                  ExecStart=/usr/local/sbin/slurmd --systemd $SLURMD_OPTIONS
                  ExecReload=/bin/kill -HUP $MAINPID
                  KillMode=process
                  LimitNOFILE=131072
                  LimitMEMLOCK=infinity
                  LimitSTACK=infinity
                  Delegate=yes
                  
                  
                  # Uncomment the following lines to disable logging through journald.
                  # NOTE: It may be preferable to set these through an override file instead.
                  #StandardOutput=null
                  #StandardError=null
                  
                  [Install]
                  WantedBy=multi-user.target
                  ```
                  
                  systemctl start slurmd
                  systemctl enable slurmd
                  Using `systemctl status slurmd` to check if the `slurmd` service is running
                  systemctl start slurmd
                  systemctl enable slurmd
                  Using `systemctl status slurmd` to check if the `slurmd` service is running
                  systemctl start slurmd
                  systemctl enable slurmd
                  Using `systemctl status slurmd` to check if the `slurmd` service is running

                  Test Your Slurm Cluster (Login Node)

                  • check cluster configuration
                  scontrol show config
                  • check cluster status
                  sinfo
                  scontrol show partition
                  scontrol show node
                  • submit job
                  srun -N2 hostname
                  scontrol show jobs
                  • check job status
                  check job status
                  squeue -a
                  Aug 7, 2024

                  Install On Ubuntu

                  Cluster Setting

                  • 1 Manager
                  • 1 Login Node
                  • 2 Compute nodes
                  hostnameIProlequota
                  manage01 (slurmctld, slurmdbd)192.168.56.115manager2C4G
                  login01 (login)192.168.56.116login2C4G
                  compute01 (slurmd)192.168.56.117compute2C4G
                  compute02 (slurmd)192.168.56.118compute2C4G

                  Software Version:

                  softwareversion
                  osUbuntu 22.04
                  slurm24.05.2

                  Important

                  when you see (All Nodes), you need to run the following command on all nodes

                  when you see (Manager Node), you only need to run the following command on manager node

                  when you see (Login Node), you only need to run the following command on login node

                  Prepare Steps (All Nodes)

                  1. Modify the /etc/apt/sources.list file Using tuna mirror
                  cat > /etc/apt/sources.list << EOF
                  
                  EOF
                  if you cannot get ipv4 address

                  Modify the /etc/network/interfaces

                  allow-hotplug enps08
                  iface enps08 inet dhcp

                  restart the network

                  systemctl restart networking
                  1. Update apt cache
                  apt clean all && apt update
                  1. Set hosts file
                  cat >> /etc/hosts << EOF
                  10.119.2.36 juice-036
                  10.119.2.37 juice-037
                  10.119.2.38 juice-038
                  EOF
                  1. Install packages ntpdate
                  apt-get -y install ntpdate
                  1. Sync server time
                  ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
                  echo 'Asia/Shanghai' >/etc/timezone
                  ntpdate ntp.aliyun.com
                  1. Add cron job to sync time
                  crontab -e
                  */5 * * * * /usr/sbin/ntpdate ntp.aliyun.com
                  1. Create ssh key pair on each node
                  ssh-keygen -t rsa -b 4096 -C $HOSTNAME
                  1. Test ssh login other nodes without password
                  Node:
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-036
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-037
                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-038

                  Install Components

                  1. Install NFS server (Manager Node)

                  there are many ways to install NFS server

                  create shared folder

                  mkdir /data
                  chmod 755 /data

                  modify vim /etc/exports

                  /data *(rw,sync,insecure,no_subtree_check,no_root_squash)

                  start nfs server

                  systemctl start rpcbind 
                  systemctl start nfs-server 
                  
                  systemctl enable rpcbind 
                  systemctl enable nfs-server

                  check nfs server

                  showmount -e localhost
                  
                  # Output
                  Export list for localhost:
                  /data *
                  1. Install munge service
                  • add user munge (All Nodes)
                  sudo apt install -y build-essential git wget munge libmunge-dev libmunge2 \
                      mariadb-server libmariadb-dev libssl-dev libpam0g-dev \
                      libhwloc-dev liblua5.3-dev libreadline-dev libncurses-dev \
                      libjson-c-dev libyaml-dev libhttp-parser-dev libjwt-dev libdbus-glib-1-dev libbpf-dev libdbus-1-dev
                  
                  
                  which mungekey
                  
                  # 如果有,使用它生成 key
                  sudo systemctl stop munge
                  sudo mungekey -c
                  sudo chown munge:munge /etc/munge/munge.key
                  sudo chmod 400 /etc/munge/munge.key
                  sudo systemctl start munge
                  • copy munge.key from manager node to the rest node (All Nodes)
                  sudo scp /etc/munge/munge.key juice-036:/tmp/munge.key
                  sudo scp /etc/munge/munge.key juice-037:/tmp/munge.key
                  sudo scp /etc/munge/munge.key juice-038:/tmp/munge.key
                  • grant privilege on munge.key (All Nodes)
                  systemctl stop munge
                  
                  sudo mv /tmp/munge.key /etc/munge/munge.key
                  chown munge: /etc/munge/munge.key
                  chmod 400 /etc/munge/munge.key
                  
                  systemctl start munge
                  systemctl status munge
                  systemctl enable munge

                  Using systemctl status munge to check if the service is running

                  • test munge
                  munge -n | ssh juice-036 unmunge
                  munge -n | ssh juice-037 unmunge
                  munge -n | ssh juice-038 unmunge
                  1. Install Mariadb (Manager Nodes)
                  apt-get install -y mariadb-server
                  • create database and user
                  systemctl start mariadb
                  systemctl enable mariadb
                  
                  ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) 
                  mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
                  mysql -uroot -p$ROOT_PASS -e 'create database slurm_acct_db'
                  • create user slurm,and grant all privileges on database slurm_acct_db
                  mysql -uroot -p$ROOT_PASS
                  create user slurm;
                  
                  grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
                  
                  flush privileges;
                  • create Slurm user
                  groupadd -g 1109 slurm
                  useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

                  Install Slurm (All Nodes)

                  • Install basic Debian package build requirements:
                  apt-get install -y build-essential fakeroot devscripts equivs
                  • Unpack the distributed tarball:
                  wget https://download.schedmd.com/slurm/slurm-25.05.2.tar.bz2 -O slurm-25.05.2.tar.bz2 &&
                  tar -xaf slurm*tar.bz2
                  • cd to the directory containing the Slurm source:
                  cd slurm-25.05.2 &&   mkdir -p /etc/slurm && ./configure --prefix=/usr --sysconfdir=/etc/slurm  --enable-cgroupv2
                  • compile slurm
                  make install
                  • modify configuration files (Manager Nodes)

                    cp /root/slurm-25.05.2/etc/slurm.conf.example /etc/slurm/slurm.conf
                    vim /etc/slurm/slurm.conf

                    focus on these options:

                    SlurmctldHost=manage
                    
                    AccountingStorageEnforce=associations,limits,qos
                    AccountingStorageHost=manage
                    AccountingStoragePass=/var/run/munge/munge.socket.2
                    AccountingStoragePort=6819  
                    AccountingStorageType=accounting_storage/slurmdbd  
                    
                    JobCompHost=localhost
                    JobCompLoc=slurm_acct_db
                    JobCompPass=123456
                    JobCompPort=3306
                    JobCompType=jobcomp/mysql
                    JobCompUser=slurm
                    JobContainerType=job_container/none
                    JobAcctGatherType=jobacct_gather/linux
                    cp /root/slurm-25.05.2/etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
                    vim /etc/slurm/slurmdbd.conf
                    • modify /etc/slurm/cgroup.conf
                    cp /root/slurm-25.05.2/etc/cgroup.conf.example /etc/slurm/cgroup.conf
                    • send configuration files to other nodes
                    scp -r /etc/slurm/*.conf  root@juice-037:/etc/slurm/
                    scp -r /etc/slurm/*.conf  root@juice-038:/etc/slurm/
                  • grant privilege on some directories (All Nodes)

                  mkdir /var/spool/slurmd
                  chown slurm: /var/spool/slurmd
                  mkdir /var/log/slurm
                  chown slurm: /var/log/slurm
                  
                  mkdir /var/spool/slurmctld
                  chown slurm: /var/spool/slurmctld
                  
                  chown slurm: /etc/slurm/slurmdbd.conf
                  chmod 600 /etc/slurm/slurmdbd.conf
                  • start slurm services on each node
                  Node:
                  systemctl start slurmdbd
                  systemctl enable slurmdbd
                  
                  systemctl start slurmctld
                  systemctl enable slurmctld
                  
                  systemctl start slurmd
                  systemctl enable slurmd
                  Using `systemctl status xxxx` to check if the `xxxx` service is running
                  Example slurmdbd.server
                  ```text
                  # vim /usr/lib/systemd/system/slurmdbd.service
                  
                  
                  [Unit]
                  Description=Slurm DBD accounting daemon
                  After=network-online.target remote-fs.target munge.service mysql.service mysqld.service mariadb.service sssd.service
                  Wants=network-online.target
                  ConditionPathExists=/etc/slurm/slurmdbd.conf
                  
                  [Service]
                  Type=simple
                  EnvironmentFile=-/etc/sysconfig/slurmdbd
                  EnvironmentFile=-/etc/default/slurmdbd
                  User=slurm
                  Group=slurm
                  RuntimeDirectory=slurmdbd
                  RuntimeDirectoryMode=0755
                  ExecStart=/usr/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
                  ExecReload=/bin/kill -HUP $MAINPID
                  LimitNOFILE=65536
                  
                  
                  # Uncomment the following lines to disable logging through journald.
                  # NOTE: It may be preferable to set these through an override file instead.
                  #StandardOutput=null
                  #StandardError=null
                  
                  [Install]
                  WantedBy=multi-user.target
                  ```
                  
                  Example slumctld.server
                  ```text
                  # vim /usr/lib/systemd/system/slurmctld.service
                  
                  
                  [Unit]
                  Description=Slurm controller daemon
                  After=network-online.target remote-fs.target munge.service sssd.service
                  Wants=network-online.target
                  ConditionPathExists=/etc/slurm/slurm.conf
                  
                  [Service]
                  Type=notify
                  EnvironmentFile=-/etc/sysconfig/slurmctld
                  EnvironmentFile=-/etc/default/slurmctld
                  User=slurm
                  Group=slurm
                  RuntimeDirectory=slurmctld
                  RuntimeDirectoryMode=0755
                  ExecStart=/usr/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS
                  ExecReload=/bin/kill -HUP $MAINPID
                  LimitNOFILE=65536
                  
                  
                  # Uncomment the following lines to disable logging through journald.
                  # NOTE: It may be preferable to set these through an override file instead.
                  #StandardOutput=null
                  #StandardError=null
                  
                  [Install]
                  WantedBy=multi-user.target
                  ```
                  
                  Example slumd.server
                  ```text
                  # vim /usr/lib/systemd/system/slurmd.service
                  
                  
                  [Unit]
                  Description=Slurm node daemon
                  After=munge.service network-online.target remote-fs.target sssd.service
                  Wants=network-online.target
                  #ConditionPathExists=/etc/slurm/slurm.conf
                  
                  [Service]
                  Type=notify
                  EnvironmentFile=-/etc/sysconfig/slurmd
                  EnvironmentFile=-/etc/default/slurmd
                  RuntimeDirectory=slurm
                  RuntimeDirectoryMode=0755
                  ExecStart=/usr/sbin/slurmd --systemd $SLURMD_OPTIONS
                  ExecReload=/bin/kill -HUP $MAINPID
                  KillMode=process
                  LimitNOFILE=131072
                  LimitMEMLOCK=infinity
                  LimitSTACK=infinity
                  Delegate=yes
                  
                  
                  # Uncomment the following lines to disable logging through journald.
                  # NOTE: It may be preferable to set these through an override file instead.
                  #StandardOutput=null
                  #StandardError=null
                  
                  [Install]
                  WantedBy=multi-user.target
                  ```
                  
                  systemctl start slurmd
                  systemctl enable slurmd
                  Using `systemctl status slurmd` to check if the `slurmd` service is running
                  systemctl start slurmd
                  systemctl enable slurmd
                  Using `systemctl status slurmd` to check if the `slurmd` service is running
                  systemctl start slurmd
                  systemctl enable slurmd
                  Using `systemctl status slurmd` to check if the `slurmd` service is running

                  Test Your Slurm Cluster (Login Node)

                  • check cluster configuration
                  scontrol show config
                  • check cluster status
                  sinfo
                  scontrol show partition
                  scontrol show node
                  • submit job
                  srun -N2 hostname
                  scontrol show jobs
                  • check job status
                  check job status
                  squeue -a
                  Aug 7, 2024

                  Install From Binary

                  Important

                  (All Nodes) means all type nodes should install this component.

                  (Manager Node) means only the manager node should install this component.

                  (Login Node) means only the Auth node should install this component.

                  (Cmp) means only the Compute node should install this component.

                  Typically, there are three nodes are required to run Slurm.

                  1 Manage(Manager Node), 1 Login Node and N Compute(Cmp).

                  but you can choose to install all service in single node. check

                  Prequisites

                  1. change hostname (All Nodes)
                    hostnamectl set-hostname (manager|auth|computeXX)
                  2. modify /etc/hosts (All Nodes)
                    echo "192.aa.bb.cc (manager|auth|computeXX)" >> /etc/hosts
                  3. disable firewall, selinux, dnsmasq, swap (All Nodes). more detail here
                  4. NFS Server (Manager Node). NFS is used as the default file system for the Slurm accounting database.
                  5. [NFS Client] (All Nodes). all node should mount the NFS share
                    Install NFS Client
                    mount <$nfs_server>:/data /data -o proto=tcp -o nolock
                  6. Munge (All Nodes). The auth/munge plugin will be built if the MUNGE authentication development library is installed. MUNGE is used as the default authentication mechanism.
                    Install Munge

                    All node need to have the munge user and group.

                    groupadd -g 1108 munge
                    useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
                    yum install epel-release -y
                    yum install munge munge-libs munge-devel -y

                    Create global secret key

                    /usr/sbin/create-munge-key -r
                    dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

                    sync secret to the rest of nodes

                    scp -p /etc/munge/munge.key root@<$rest_node>:/etc/munge/
                    ssh root@<$rest_node> "chown munge: /etc/munge/munge.key && chmod 400 /etc/munge/munge.key"
                    ssh root@<$rest_node> "systemctl start munge && systemctl enable munge"

                    test munge if it works

                    munge -n | unmunge
                  7. Database (Manager Node). MySQL support for accounting will be built if the MySQL or MariaDB development library is present. A currently supported version of MySQL or MariaDB should be used.
                    Install MariaDB

                    install mariadb

                    yum -y install mariadb-server
                    systemctl start mariadb && systemctl enable mariadb
                    ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) 
                    mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"

                    login mysql

                    mysql -u root -p${ROOT_PASS}
                    create database slurm_acct_db;
                    create user slurm;
                    grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
                    flush privileges;
                    quit

                  Install Slurm

                  1. create slurm user (All Nodes)
                    groupadd -g 1109 slurm
                    useradd -m -c "slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
                  Install Slurm from

                  Build RPM package

                  1. install depeendencies (Manager Node)

                    yum -y install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel python3
                  2. build rpm package (Manager Node)

                    wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2
                    rpmbuild -ta --nodeps slurm-24.05.2.tar.bz2

                    The rpm files will be installed under the $(HOME)/rpmbuild directory of the user building them.

                  3. send rpm to rest nodes (Manager Node)

                    ssh root@<$rest_node> "mkdir -p /root/rpmbuild/RPMS/"
                    scp -p $(HOME)/rpmbuild/RPMS/x86_64 root@<$rest_node>:/root/rpmbuild/RPMS/x86_64
                  4. install rpm (Manager Node)

                    ssh root@<$rest_node> "yum localinstall /root/rpmbuild/RPMS/x86_64/slurm-*"
                  5. modify configuration file (Manager Node)

                    cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
                    cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
                    cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
                    chmod 600 /etc/slurm/slurmdbd.conf
                    chown slurm: /etc/slurm/slurmdbd.conf

                    cgroup.conf doesnt need to change.

                    edit /etc/slurm/slurm.conf, you can use this link as a reference

                    edit /etc/slurm/slurmdbd.conf, you can use this link as a reference

                  Install yum repo directly

                  1. install slurm (All Nodes)

                    yum -y slurm-wlm slurmdbd
                  2. modify configuration file (All Nodes)

                    vim /etc/slurm-llnl/slurm.conf
                    vim /etc/slurm-llnl/slurmdbd.conf

                    cgroup.conf doesnt need to change.

                    edit /etc/slurm/slurm.conf, you can use this link as a reference

                    edit /etc/slurm/slurmdbd.conf, you can use this link as a reference

                  1. send configuration (Manager Node)
                     scp -r /etc/slurm/*.conf  root@<$rest_node>:/etc/slurm/
                     ssh rootroot@<$rest_node> "mkdir /var/spool/slurmd && chown slurm: /var/spool/slurmd"
                     ssh rootroot@<$rest_node> "mkdir /var/log/slurm && chown slurm: /var/log/slurm"
                     ssh rootroot@<$rest_node> "mkdir /var/spool/slurmctld && chown slurm: /var/spool/slurmctld"
                  2. start service (Manager Node)
                    ssh rootroot@<$rest_node> "systemctl start slurmdbd && systemctl enable slurmdbd"
                    ssh rootroot@<$rest_node> "systemctl start slurmctld && systemctl enable slurmctld"
                  3. start service (All Nodes)
                    ssh rootroot@<$rest_node> "systemctl start slurmd && systemctl enable slurmd"

                  Test

                  1. show cluster status
                  scontrol show config
                  sinfo
                  scontrol show partition
                  scontrol show node
                  1. submit job
                  srun -N2 hostname
                  scontrol show jobs
                  1. check job status
                  squeue -a

                  Reference:

                  1. https://slurm.schedmd.com/documentation.html
                  2. https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/
                  3. https://github.com/Artlands/Install-Slurm
                  Aug 7, 2024

                  Install From Helm Chart

                  Despite the complex binary installation, helm chart is a better way to install slurm.

                  Source code could be found from https://github.com/AaronYang0628/slurm-on-k8s

                  Prequisites

                  1. Kubernetes has installed, if not check 🔗link
                  2. Helm binary has installed, if not check 🔗link

                  Installation

                  1. get helm repo and update

                    helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                  2. install slurm chart

                    # wget -O slurm.values.yaml https://raw.githubusercontent.com/AaronYang0628/slurm-on-k8s/refs/heads/main/chart/values.yaml
                    helm install slurm ay-helm-mirror/chart -f slurm.values.yaml --version 1.0.10

                    Or you can get template values.yaml from https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurm.values.yaml

                  3. check chart status

                    helm -n slurm list
                  Aug 7, 2024

                  Install From K8s Operator

                  Despite the complex binary installation, using k8s operator is a better way to install slurm.

                  Source code could be found from https://github.com/AaronYang0628/slurm-on-k8s

                  Prequisites

                  1. Kubernetes has installed, if not check 🔗link
                  2. Helm binary has installed, if not check 🔗link

                  Installation

                  1. deploy slurm operator

                    kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/operator_install.yaml
                    Expectd Output
                    [root@ay-zj-ecs operator]# kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/operator_install.yaml
                    namespace/slurm created
                    customresourcedefinition.apiextensions.k8s.io/slurmdeployments.slurm.ay.dev created
                    serviceaccount/slurm-operator-controller-manager created
                    role.rbac.authorization.k8s.io/slurm-operator-leader-election-role created
                    clusterrole.rbac.authorization.k8s.io/slurm-operator-manager-role created
                    clusterrole.rbac.authorization.k8s.io/slurm-operator-metrics-auth-role created
                    clusterrole.rbac.authorization.k8s.io/slurm-operator-metrics-reader created
                    clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-admin-role created
                    clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-editor-role created
                    clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-viewer-role created
                    rolebinding.rbac.authorization.k8s.io/slurm-operator-leader-election-rolebinding created
                    clusterrolebinding.rbac.authorization.k8s.io/slurm-operator-manager-rolebinding created
                    clusterrolebinding.rbac.authorization.k8s.io/slurm-operator-metrics-auth-rolebinding created
                    service/slurm-operator-controller-manager-metrics-service created
                    deployment.apps/slurm-operator-controller-manager created
                  2. check operator status

                    kubectl -n slurm get pod
                    Expectd Output
                    [root@ay-zj-ecs operator]# kubectl -n slurm get pod
                    NAME                                READY   STATUS    RESTARTS   AGE
                    slurm-operator-controller-manager   1/1     Running   0          27s
                  3. apply CRD slurmdeployment

                    kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.zj.values.yaml
                    Expectd Output
                    [root@ay-zj-ecs operator]# kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.zj.values.yaml
                    slurmdeployment.slurm.ay.dev/lensing created
                  4. check operator status

                    kubectl get slurmdeployment
                    kubectl -n slurm logs -f deploy/slurm-operator-controller-manager
                    # kubectl get slurmdep
                    # kubectl -n test get pods
                    Expectd Output
                    [root@ay-zj-ecs ~]# kubectl get slurmdep -w
                    NAME      CPU   GPU   LOGIN   CTLD   DBD   DBSVC   JOB COMMAND                     STATUS
                    lensing   0/1   0/0   0/1     0/1    0/1   0/1     sh -c srun -N 2 /bin/hostname   
                    lensing   1/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
                    lensing   2/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
                  5. upgrade slurmdep

                    kubectl edit slurmdep lensing
                    # set SlurmCPU.replicas = 3
                    Expectd Output
                    [root@ay-zj-ecs ~]# kubectl edit slurmdep lensing
                    slurmdeployment.slurm.ay.dev/lensing edited
                    
                    [root@ay-zj-ecs ~]# kubectl get slurmdep -w
                    NAME      CPU   GPU   LOGIN   CTLD   DBD   DBSVC   JOB COMMAND                     STATUS
                    lensing   2/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
                    lensing   2/3   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
                    lensing   3/3   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
                  Aug 7, 2024

                  Try OpenSCOW

                  What is SCOW?

                  SCOW is a HPC cluster management system built by PKU.

                  SCOW used four virtual machines to run slurm cluster. It is a good choice for you to learn how to use slurm.

                  You should check https://pkuhpc.github.io/OpenSCOW/docs/hpccluster, it works well.

                  Aug 7, 2024

                  Subsections of CheatSheet

                  Common Environment Variables

                  VariableDescription
                  $SLURM_JOB_IDThe Job ID.
                  $SLURM_JOBIDDeprecated. Same as $SLURM_JOB_ID
                  $SLURM_SUBMIT_HOSTThe hostname of the node used for job submission.
                  $SLURM_JOB_NODELISTContains the definition (list) of the nodes that is assigned to the job.
                  $SLURM_NODELISTDeprecated. Same as SLURM_JOB_NODELIST.
                  $SLURM_CPUS_PER_TASKNumber of CPUs per task.
                  $SLURM_CPUS_ON_NODENumber of CPUs on the allocated node.
                  $SLURM_JOB_CPUS_PER_NODECount of processors available to the job on this node.
                  $SLURM_CPUS_PER_GPUNumber of CPUs requested per allocated GPU.
                  $SLURM_MEM_PER_CPUMemory per CPU. Same as –mem-per-cpu .
                  $SLURM_MEM_PER_GPUMemory per GPU.
                  $SLURM_MEM_PER_NODEMemory per node. Same as –mem .
                  $SLURM_GPUSNumber of GPUs requested.
                  $SLURM_NTASKSSame as -n, –ntasks. The number of tasks.
                  $SLURM_NTASKS_PER_NODENumber of tasks requested per node.
                  $SLURM_NTASKS_PER_SOCKETNumber of tasks requested per socket.
                  $SLURM_NTASKS_PER_CORENumber of tasks requested per core.
                  $SLURM_NTASKS_PER_GPUNumber of tasks requested per GPU.
                  $SLURM_NPROCSSame as -n, –ntasks. See $SLURM_NTASKS.
                  $SLURM_TASKS_PER_NODENumber of tasks to be initiated on each node.
                  $SLURM_ARRAY_JOB_IDJob array’s master job ID number.
                  $SLURM_ARRAY_TASK_IDJob array ID (index) number.
                  $SLURM_ARRAY_TASK_COUNTTotal number of tasks in a job array.
                  $SLURM_ARRAY_TASK_MAXJob array’s maximum ID (index) number.
                  $SLURM_ARRAY_TASK_MINJob array’s minimum ID (index) number.

                  A full list of environment variables for SLURM can be found by visiting the SLURM page on environment variables.

                  Aug 7, 2024

                  File Operations

                  File Distribution

                  • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
                    • Feature
                      1. distribute file:Quickly copy files to all compute nodes assigned to the job, avoiding the hassle of manually distributing files. Faster than traditional scp or rsync, especially when distributing to multiple nodes。
                      2. simplify script:one command to distribute files to all nodes assigned to the job。
                      3. imrpove performance:Improve file distribution speed by parallelizing transfers, especially for large or multiple files。
                    • Usage
                      1. Alone
                      sbcast <source_file> <destination_path>
                      1. Embedded in a job script
                      #!/bin/bash
                      #SBATCH --job-name=example_job
                      #SBATCH --output=example_job.out
                      #SBATCH --error=example_job.err
                      #SBATCH --partition=compute
                      #SBATCH --nodes=4
                      
                      # Use sbcast to distribute the file to the /tmp directory of each node
                      sbcast data.txt /tmp/data.txt
                      
                      # Run your program using the distributed files
                      srun my_program /tmp/data.txt

                  File Collection

                  1. File Redirection When submitting a job, you can use the #SBATCH –output and #SBATCH –error directives to redirect standard output and standard error to specified files.

                     #SBATCH --output=output.txt
                     #SBATCH --error=error.txt

                    Or

                    sbatch -N2 -w "compute[01-02]" -o result/file/path xxx.slurm
                  2. Send the destination address manually Using scp or rsync in the job to copy the files from the compute nodes to the submit node

                  3. Using NFS If a shared file system (such as NFS, Lustre, or GPFS) is configured in the computing cluster, the result files can be written directly to the shared directory. In this way, the result files generated by all nodes are automatically stored in the same location.

                  4. Using sbcast

                  Aug 7, 2024

                  Submit Jobs

                  3 Type Jobs

                  • srun is used to submit a job for execution or initiate job steps in real time.

                    • Example
                      1. run shell
                      srun -N2 bin/hostname
                      1. run script
                      srun -N1 test.sh
                      1. exec into slurmd node
                      srun -w slurm-lensing-slurm-slurmd-cpu-2 --pty /bin/bash
                  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

                    • Example

                      1. submit a batch job
                      sbatch -N2 -w "compute[01-02]" -o job.stdout /data/jobs/batch-job.slurm
                      batch-job.slurm
                      #!/bin/bash
                      
                      #SBATCH -N 1
                      #SBATCH --job-name=cpu-N1-batch
                      #SBATCH --partition=compute
                      #SBATCH --mail-type=end
                      #SBATCH --mail-user=xxx@email.com
                      #SBATCH --output=%j.out
                      #SBATCH --error=%j.err
                      
                      srun -l /bin/hostname #you can still write srun <command> in here
                      srun -l pwd
                      
                      1. submit a parallel task to process differnt data partition
                      sbatch /data/jobs/parallel.slurm
                      parallel.slurm
                      #!/bin/bash
                      #SBATCH -N 2 
                      #SBATCH --job-name=cpu-N2-parallel
                      #SBATCH --partition=compute
                      #SBATCH --time=01:00:00
                      #SBATCH --array=1-4  # 定义任务数组,假设有4个分片
                      #SBATCH --ntasks-per-node=1 # 每个节点只运行一个任务
                      #SBATCH --output=process_data_%A_%a.out
                      #SBATCH --error=process_data_%A_%a.err
                      
                      TASK_ID=${SLURM_ARRAY_TASK_ID}
                      
                      DATA_PART="data_part_${TASK_ID}.txt" #make sure you have that file
                      
                      if [ -f ${DATA_PART} ]; then
                          echo "Processing ${DATA_PART} on node $(hostname)"
                          # python process_data.py --input ${DATA_PART}
                      else
                          echo "File ${DATA_PART} does not exist!"
                      fi
                      
                      how to split file
                      split -l 1000 data.txt data_part_ 
                      && mv data_part_aa data_part_1 
                      && mv data_part_ab data_part_2
                      
                  • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

                    • Example
                      1. allocate resources (more like create an virtual machine)
                      salloc -N2 bash
                      This command will create a job which allocates 2 nodes and spawn a bash shell on each node. and you can execute srun commands in that environment. After your computing task is finsihs, remember to shutdown your job.
                      scancel <$job_id>
                      when you exit the job, the resources will be released.
                  Aug 7, 2024

                  Configuration Files

                  Aug 7, 2024

                  Subsections of MPI Libs

                  Test Intel MPI Jobs

                  在SLURM集群中使用MPI(Message Passing Interface)进行并行计算,通常需要以下几个步骤:

                  1. 安装MPI库

                  确保你的集群节点已经安装了MPI库,常见的MPI实现包括:

                  • OpenMPI
                  • Intel MPI
                  • MPICH 可以通过以下命令检查集群是否安装了MPI:
                  mpicc --version  # 检查MPI编译器
                  mpirun --version # 检查MPI运行时环境

                  2. 测试MPI性能

                  mpirun -n 2 IMB-MPI1 pingpong

                  3. 编译MPI程序

                  你可以用mpicc(C语言)或mpic++(C++语言)来编译MPI程序。例如:

                  以下是一个简单的MPI “Hello, World!” 示例程序,假设文件名为 hello_mpi.c, 还有一个进行矩阵计算的示例程序,文件名为dot_product.c,任意挑选一个即可:

                  #include <stdio.h>
                  #include <mpi.h>
                  
                  int main(int argc, char *argv[]) {
                      int rank, size;
                      
                      // 初始化MPI环境
                      MPI_Init(&argc, &argv);
                  
                      // 获取当前进程的rank和总进程数
                      MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                      MPI_Comm_size(MPI_COMM_WORLD, &size);
                  
                      // 输出进程的信息
                      printf("Hello, World! I am process %d out of %d processes.\n", rank, size);
                  
                      // 退出MPI环境
                      MPI_Finalize();
                  
                      return 0;
                  }
                  #include <stdio.h>
                  #include <stdlib.h>
                  #include <mpi.h>
                  
                  #define N 8  // 向量大小
                  
                  // 计算向量的局部点积
                  double compute_local_dot_product(double *A, double *B, int start, int end) {
                      double local_dot = 0.0;
                      for (int i = start; i < end; i++) {
                          local_dot += A[i] * B[i];
                      }
                      return local_dot;
                  }
                  
                  void print_vector(double *Vector) {
                      for (int i = 0; i < N; i++) {
                          printf("%f ", Vector[i]);   
                      }
                      printf("\n");
                  }
                  
                  int main(int argc, char *argv[]) {
                      int rank, size;
                  
                      // 初始化MPI环境
                      MPI_Init(&argc, &argv);
                      MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                      MPI_Comm_size(MPI_COMM_WORLD, &size);
                  
                      // 向量A和B
                      double A[N], B[N];
                  
                      // 进程0初始化向量A和B
                      if (rank == 0) {
                          for (int i = 0; i < N; i++) {
                              A[i] = i + 1;  // 示例数据
                              B[i] = (i + 1) * 2;  // 示例数据
                          }
                      }
                  
                      // 广播向量A和B到所有进程
                      MPI_Bcast(A, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
                      MPI_Bcast(B, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
                  
                      // 每个进程计算自己负责的部分
                      int local_n = N / size;  // 每个进程处理的元素个数
                      int start = rank * local_n;
                      int end = (rank + 1) * local_n;
                      
                      // 如果是最后一个进程,确保处理所有剩余的元素(处理N % size)
                      if (rank == size - 1) {
                          end = N;
                      }
                  
                      double local_dot_product = compute_local_dot_product(A, B, start, end);
                  
                      // 使用MPI_Reduce将所有进程的局部点积结果汇总到进程0
                      double global_dot_product = 0.0;
                      MPI_Reduce(&local_dot_product, &global_dot_product, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
                  
                      // 进程0输出最终结果
                      if (rank == 0) {
                          printf("Vector A is\n");
                          print_vector(A);
                          printf("Vector B is\n");
                          print_vector(B);
                          printf("Dot Product of A and B: %f\n", global_dot_product);
                      }
                  
                      // 结束MPI环境
                      MPI_Finalize();
                      return 0;
                  }

                  3. 创建Slurm作业脚本

                  创建一个SLURM作业脚本来运行该MPI程序。以下是一个基本的SLURM作业脚本,假设文件名为 mpi_test.slurm:

                  #!/bin/bash
                  #SBATCH --job-name=mpi_job       # Job name
                  #SBATCH --nodes=2                # Number of nodes to use
                  #SBATCH --ntasks-per-node=1      # Number of tasks per node
                  #SBATCH --time=00:10:00          # Time limit
                  #SBATCH --output=mpi_test_output_%j.log     # Standard output file
                  #SBATCH --error=mpi_test_output_%j.err     # Standard error file
                  
                  # Manually set Intel OneAPI MPI and Compiler environment
                  export I_MPI_PMI=pmi2
                  export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/slurm/mpi_pmi2.so
                  export I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.14
                  export INTEL_COMPILER_ROOT=/opt/intel/oneapi/compiler/2025.0
                  export PATH=$I_MPI_ROOT/bin:$INTEL_COMPILER_ROOT/bin:$PATH
                  export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$INTEL_COMPILER_ROOT/lib:$LD_LIBRARY_PATH
                  export MANPATH=$I_MPI_ROOT/man:$INTEL_COMPILER_ROOT/man:$MANPATH
                  
                  # Compile the MPI program
                  icx-cc -I$I_MPI_ROOT/include  hello_mpi.c -o hello_mpi -L$I_MPI_ROOT/lib -lmpi
                  
                  # Run the MPI job
                  
                  mpirun -np 2 ./hello_mpi
                  #!/bin/bash
                  #SBATCH --job-name=mpi_job       # Job name
                  #SBATCH --nodes=2                # Number of nodes to use
                  #SBATCH --ntasks-per-node=1      # Number of tasks per node
                  #SBATCH --time=00:10:00          # Time limit
                  #SBATCH --output=mpi_test_output_%j.log     # Standard output file
                  #SBATCH --error=mpi_test_output_%j.err     # Standard error file
                  
                  # Manually set Intel OneAPI MPI and Compiler environment
                  export I_MPI_PMI=pmi2
                  export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/slurm/mpi_pmi2.so
                  export I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.14
                  export INTEL_COMPILER_ROOT=/opt/intel/oneapi/compiler/2025.0
                  export PATH=$I_MPI_ROOT/bin:$INTEL_COMPILER_ROOT/bin:$PATH
                  export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$INTEL_COMPILER_ROOT/lib:$LD_LIBRARY_PATH
                  export MANPATH=$I_MPI_ROOT/man:$INTEL_COMPILER_ROOT/man:$MANPATH
                  
                  # Compile the MPI program
                  icx-cc -I$I_MPI_ROOT/include  dot_product.c -o dot_product -L$I_MPI_ROOT/lib -lmpi
                  
                  # Run the MPI job
                  
                  mpirun -np 2 ./dot_product

                  4. 编译MPI程序

                  在运行作业之前,你需要编译MPI程序。在集群上使用mpicc来编译该程序。假设你将程序保存在 hello_mpi.c 文件中,使用以下命令进行编译:

                  mpicc -o hello_mpi hello_mpi.c
                  mpicc -o dot_product dot_product.c

                  5. 提交Slurm作业

                  保存上述作业脚本(mpi_test.slurm)并使用以下命令提交作业:

                  sbatch mpi_test.slurm

                  6. 查看作业状态

                  你可以使用以下命令查看作业的状态:

                  squeue -u <your_username>

                  7. 检查输出

                  作业完成后,输出将保存在你作业脚本中指定的文件中(例如 mpi_test_output_<job_id>.log)。你可以使用 cat 或任何文本编辑器查看输出:

                  cat mpi_test_output_*.log

                  示例输出 如果一切正常,输出会类似于:

                  Hello, World! I am process 0 out of 2 processes.
                  Hello, World! I am process 1 out of 2 processes.
                  Result Matrix C (A * B):
                  14 8 2 -4 
                  20 10 0 -10 
                  -1189958655 1552515295 21949 -1552471397 
                  0 0 0 0 
                  Aug 7, 2024

                  Test Open MPI Jobs

                  在SLURM集群中使用MPI(Message Passing Interface)进行并行计算,通常需要以下几个步骤:

                  1. 安装MPI库

                  确保你的集群节点已经安装了MPI库,常见的MPI实现包括:

                  • OpenMPI
                  • Intel MPI
                  • MPICH 可以通过以下命令检查集群是否安装了MPI:
                  mpicc --version  # 检查MPI编译器
                  mpirun --version # 检查MPI运行时环境

                  2. 编译MPI程序

                  你可以用mpicc(C语言)或mpic++(C++语言)来编译MPI程序。例如:

                  以下是一个简单的MPI “Hello, World!” 示例程序,假设文件名为 hello_mpi.c, 还有一个进行矩阵计算的示例程序,文件名为dot_product.c,任意挑选一个即可:

                  #include <stdio.h>
                  #include <mpi.h>
                  
                  int main(int argc, char *argv[]) {
                      int rank, size;
                      
                      // 初始化MPI环境
                      MPI_Init(&argc, &argv);
                  
                      // 获取当前进程的rank和总进程数
                      MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                      MPI_Comm_size(MPI_COMM_WORLD, &size);
                  
                      // 输出进程的信息
                      printf("Hello, World! I am process %d out of %d processes.\n", rank, size);
                  
                      // 退出MPI环境
                      MPI_Finalize();
                  
                      return 0;
                  }
                  #include <stdio.h>
                  #include <stdlib.h>
                  #include <mpi.h>
                  
                  #define N 8  // 向量大小
                  
                  // 计算向量的局部点积
                  double compute_local_dot_product(double *A, double *B, int start, int end) {
                      double local_dot = 0.0;
                      for (int i = start; i < end; i++) {
                          local_dot += A[i] * B[i];
                      }
                      return local_dot;
                  }
                  
                  void print_vector(double *Vector) {
                      for (int i = 0; i < N; i++) {
                          printf("%f ", Vector[i]);   
                      }
                      printf("\n");
                  }
                  
                  int main(int argc, char *argv[]) {
                      int rank, size;
                  
                      // 初始化MPI环境
                      MPI_Init(&argc, &argv);
                      MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                      MPI_Comm_size(MPI_COMM_WORLD, &size);
                  
                      // 向量A和B
                      double A[N], B[N];
                  
                      // 进程0初始化向量A和B
                      if (rank == 0) {
                          for (int i = 0; i < N; i++) {
                              A[i] = i + 1;  // 示例数据
                              B[i] = (i + 1) * 2;  // 示例数据
                          }
                      }
                  
                      // 广播向量A和B到所有进程
                      MPI_Bcast(A, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
                      MPI_Bcast(B, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
                  
                      // 每个进程计算自己负责的部分
                      int local_n = N / size;  // 每个进程处理的元素个数
                      int start = rank * local_n;
                      int end = (rank + 1) * local_n;
                      
                      // 如果是最后一个进程,确保处理所有剩余的元素(处理N % size)
                      if (rank == size - 1) {
                          end = N;
                      }
                  
                      double local_dot_product = compute_local_dot_product(A, B, start, end);
                  
                      // 使用MPI_Reduce将所有进程的局部点积结果汇总到进程0
                      double global_dot_product = 0.0;
                      MPI_Reduce(&local_dot_product, &global_dot_product, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
                  
                      // 进程0输出最终结果
                      if (rank == 0) {
                          printf("Vector A is\n");
                          print_vector(A);
                          printf("Vector B is\n");
                          print_vector(B);
                          printf("Dot Product of A and B: %f\n", global_dot_product);
                      }
                  
                      // 结束MPI环境
                      MPI_Finalize();
                      return 0;
                  }

                  3. 创建Slurm作业脚本

                  创建一个SLURM作业脚本来运行该MPI程序。以下是一个基本的SLURM作业脚本,假设文件名为 mpi_test.slurm:

                  #!/bin/bash
                  #SBATCH --job-name=mpi_test                 # 作业名称
                  #SBATCH --nodes=2                           # 请求节点数
                  #SBATCH --ntasks-per-node=1                 # 每个节点上的任务数
                  #SBATCH --time=00:10:00                     # 最大运行时间
                  #SBATCH --output=mpi_test_output_%j.log     # 输出日志文件
                  
                  # 加载MPI模块(如果使用模块化环境)
                  module load openmpi
                  
                  # 运行MPI程序
                  mpirun --allow-run-as-root -np 2 ./hello_mpi
                  #!/bin/bash
                  #SBATCH --job-name=mpi_test                 # 作业名称
                  #SBATCH --nodes=2                           # 请求节点数
                  #SBATCH --ntasks-per-node=1                 # 每个节点上的任务数
                  #SBATCH --time=00:10:00                     # 最大运行时间
                  #SBATCH --output=mpi_test_output_%j.log     # 输出日志文件
                  
                  # 加载MPI模块(如果使用模块化环境)
                  module load openmpi
                  
                  # 运行MPI程序
                  mpirun --allow-run-as-root -np 2 ./dot_product

                  4. 编译MPI程序

                  在运行作业之前,你需要编译MPI程序。在集群上使用mpicc来编译该程序。假设你将程序保存在 hello_mpi.c 文件中,使用以下命令进行编译:

                  mpicc -o hello_mpi hello_mpi.c
                  mpicc -o dot_product dot_product.c

                  5. 提交Slurm作业

                  保存上述作业脚本(mpi_test.slurm)并使用以下命令提交作业:

                  sbatch mpi_test.slurm

                  6. 查看作业状态

                  你可以使用以下命令查看作业的状态:

                  squeue -u <your_username>

                  7. 检查输出

                  作业完成后,输出将保存在你作业脚本中指定的文件中(例如 mpi_test_output_<job_id>.log)。你可以使用 cat 或任何文本编辑器查看输出:

                  cat mpi_test_output_*.log

                  示例输出 如果一切正常,输出会类似于:

                  Hello, World! I am process 0 out of 2 processes.
                  Hello, World! I am process 1 out of 2 processes.
                  Result Matrix C (A * B):
                  14 8 2 -4 
                  20 10 0 -10 
                  -1189958655 1552515295 21949 -1552471397 
                  0 0 0 0 
                  Aug 7, 2024

                  🗃️Usage Notes

                  Aug 7, 2024

                  Subsections of 🗃️Usage Notes

                  Subsections of Application

                  有状态or无状态应用

                  对应用“有状态”和“无状态”的清晰界定,直接决定了它在Kubernetes中的部署方式、资源类型和运维复杂度。


                  一、核心定义

                  1. 无状态应用

                  定义:应用实例不负责保存每次请求所需的上下文或数据状态。任何一个请求都可以被任何一个实例处理,且处理结果完全一致。

                  关键特征

                  • 请求自包含:每个请求包含了处理它所需的所有信息(如认证Token、Session ID、操作数据等)。
                  • 实例可替代:任何一个实例都是完全相同、可以随时被创建或销毁的。销毁一个实例不会丢失任何数据。
                  • 无本地持久化:实例的本地磁盘不被用于保存需要持久化的数据。即使有临时数据,实例销毁后也无需关心。
                  • 水平扩展容易:因为实例完全相同,所以直接增加实例数量就能实现扩展,非常简单。

                  典型例子

                  • Web前端服务器:如Nginx, Apache。
                  • API网关:如Kong, Tyk。
                  • JWT令牌验证服务
                  • 无状态计算服务:如图片转换、数据格式转换等。输入和输出都在请求中完成。

                  一个生动的比喻快餐店的收银员。 任何一个收银员都可以为你服务,你点餐(请求),他处理,完成后交易结束。他不需要记住你上次点了什么(状态),你下次来可以去任何一个窗口。

                  2. 有状态应用

                  定义:应用实例需要保存和维护特定的状态数据。后续请求的处理依赖于之前请求保存的状态,或者会改变这个状态。

                  关键特征

                  • 状态依赖性:请求的处理结果依赖于该实例上保存的特定状态(如用户会话、数据库中的记录、缓存数据等)。
                  • 实例唯一性:每个实例都是独特的,有唯一的身份标识(如ID、主机名)。不能随意替换。
                  • 需要持久化存储:实例的状态必须被保存在持久化存储中,并且即使实例重启、迁移或重建,这个存储也必须能被重新挂载和访问。
                  • 水平扩展复杂:扩展时需要谨慎处理数据分片、副本同步、身份识别等问题。

                  典型例子

                  • 数据库:MySQL, PostgreSQL, MongoDB, Redis。
                  • 消息队列:Kafka, RabbitMQ。
                  • 有状态中间件:如Etcd, Zookeeper。
                  • 用户会话服务器:将用户Session保存在本地内存或磁盘的应用。

                  一个生动的比喻银行的客户经理。 你有一个指定的客户经理(特定实例),他了解你的所有财务历史和需求(状态)。如果你换了一个新经理,他需要花时间从头了解你的情况,而且可能无法立即获得你所有的历史文件(数据)。


                  二、在Kubernetes中的关键差异

                  这个界定在K8s中至关重要,因为它决定了你使用哪种工作负载资源。

                  特性无状态应用有状态应用
                  核心K8s资源DeploymentStatefulSet
                  Pod身份完全可互换,无唯一标识。名字是随机的(如 app-7c8b5f6d9-abcde)。有稳定、唯一的标识符,按顺序生成(如 mysql-0, mysql-1, mysql-2)。
                  启动/终止顺序并行,无顺序。有序部署(从0到N-1),有序扩缩容(从N-1到0),有序滚动更新
                  网络标识不稳定的Pod IP。通过Service负载均衡访问。稳定的网络标识。每个Pod会有一个稳定的DNS记录:<pod-name>.<svc-name>.<namespace>.svc.cluster.local
                  存储使用PersistentVolumeClaim模板,所有Pod共享同一个PVC或各自使用独立的、无关联的PVC。使用稳定的、专用的存储。每个Pod根据它的身份标识,挂载一个独立的PVC(如 mysql-0 -> pvc-mysql-0)。
                  数据持久性Pod被删除,其关联的PVC通常也会被删除(取决于回收策略)。Pod即使被调度到其他节点,也能通过稳定标识重新挂载到属于它的那块持久化数据。
                  典型场景Web服务器、微服务、API数据库、消息队列、集群化应用(如Zookeeper)

                  三、一个常见的误区:“看似无状态,实则有状态”

                  有些应用初看像无状态,但深究起来其实是有状态的。

                  • 误区:一个将用户Session保存在本地内存的Web应用。
                    • 看似:它是一个Web服务,可以通过Deployment部署多个副本。
                    • 实则:如果用户第一次请求被pod-a处理,Session保存在了pod-a的内存中。下次请求如果被负载均衡到pod-bpod-b无法获取到该用户的Session,导致用户需要重新登录。
                    • 解决方案
                      1. 改造为无状态:将Session数据外移到集中式的Redis或数据库中。
                      2. 承认其有状态:使用StatefulSet,并配合Session亲和性,确保同一用户的请求总是被发到同一个Pod实例上。

                  总结

                  如何界定一个应用是有状态还是无状态?

                  问自己这几个问题:

                  1. 这个应用的实例能被随意杀死并立即创建一个新的替代吗? 替代者能无缝接管所有工作吗?
                    • -> 无状态
                    • 不能 -> 有状态
                  2. 应用的多个实例是完全相同的吗? 增加一个实例需要复制数据吗?
                    • 是,不需要 -> 无状态
                    • 否,需要 -> 有状态
                  3. 处理请求是否需要依赖实例本地(内存/磁盘)的、非临时性的数据?
                    • -> 无状态
                    • -> 有状态

                  理解这个界定,是正确设计和部署云原生应用的基石。在K8s中,对于无状态应用,请首选 Deployment;对于有状态应用,请务必使用 StatefulSet

                  Mar 7, 2025

                  Subsections of Building Tool

                  Maven

                  1. build from submodule

                  You dont need to build from the head of project.

                  ./mvnw clean package -DskipTests  -rf :<$submodule-name>

                  you can find the <$submodule-name> from submodule ’s pom.xml

                  <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                  		xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
                  
                  	<modelVersion>4.0.0</modelVersion>
                  
                  	<parent>
                  		<groupId>org.apache.flink</groupId>
                  		<artifactId>flink-formats</artifactId>
                  		<version>1.20-SNAPSHOT</version>
                  	</parent>
                  
                  	<artifactId>flink-avro</artifactId>
                  	<name>Flink : Formats : Avro</name>

                  Then you can modify the command as

                  ./mvnw clean package -DskipTests  -rf :flink-avro
                  The result will look like this
                  [WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
                  [WARNING] 
                  [INFO] ------------------------------------------------------------------------
                  [INFO] Detecting the operating system and CPU architecture
                  [INFO] ------------------------------------------------------------------------
                  [INFO] os.detected.name: linux
                  [INFO] os.detected.arch: x86_64
                  [INFO] os.detected.bitness: 64
                  [INFO] os.detected.version: 6.7
                  [INFO] os.detected.version.major: 6
                  [INFO] os.detected.version.minor: 7
                  [INFO] os.detected.release: fedora
                  [INFO] os.detected.release.version: 38
                  [INFO] os.detected.release.like.fedora: true
                  [INFO] os.detected.classifier: linux-x86_64
                  [INFO] ------------------------------------------------------------------------
                  [INFO] Reactor Build Order:
                  [INFO] 
                  [INFO] Flink : Formats : Avro                                             [jar]
                  [INFO] Flink : Formats : SQL Avro                                         [jar]
                  [INFO] Flink : Formats : Parquet                                          [jar]
                  [INFO] Flink : Formats : SQL Parquet                                      [jar]
                  [INFO] Flink : Formats : Orc                                              [jar]
                  [INFO] Flink : Formats : SQL Orc                                          [jar]
                  [INFO] Flink : Python                                                     [jar]
                  ...

                  Normally, build Flink will start from module flink-parent

                  2. skip some other test

                  For example, you can skip RAT test by doing this:

                  ./mvnw clean package -DskipTests '-Drat.skip=true'
                  Mar 11, 2024

                  Gradle

                  1. spotless

                  keep your code spotless, check more detail in https://github.com/diffplug/spotless

                  see how to configuration

                  there are several files need to configure.

                  1. settings.gradle.kts
                  plugins {
                      id("org.gradle.toolchains.foojay-resolver-convention") version "0.7.0"
                  }
                  1. build.gradle.kts
                  plugins {
                      id("com.diffplug.spotless") version "6.23.3"
                  }
                  configure<com.diffplug.gradle.spotless.SpotlessExtension> {
                      kotlinGradle {
                          target("**/*.kts")
                          ktlint()
                      }
                      java {
                          target("**/*.java")
                          googleJavaFormat()
                              .reflowLongStrings()
                              .skipJavadocFormatting()
                              .reorderImports(false)
                      }
                      yaml {
                          target("**/*.yaml")
                          jackson()
                              .feature("ORDER_MAP_ENTRIES_BY_KEYS", true)
                      }
                      json {
                          target("**/*.json")
                          targetExclude(".vscode/settings.json")
                          jackson()
                              .feature("ORDER_MAP_ENTRIES_BY_KEYS", true)
                      }
                  }

                  And the, you can execute follwoing command to format your code.

                  ./gradlew spotlessApply
                  ./mvnw spotless:apply

                  2. shadowJar

                  shadowjar could combine a project’s dependency classes and resources into a single jar. check https://imperceptiblethoughts.com/shadow/

                  see how to configuration

                  you need moidfy your build.gradle.kts

                  import com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar
                  
                  plugins {
                      java // Optional 
                      id("com.github.johnrengelman.shadow") version "8.1.1"
                  }
                  
                  tasks.named<ShadowJar>("shadowJar") {
                      archiveBaseName.set("connector-shadow")
                      archiveVersion.set("1.0")
                      archiveClassifier.set("")
                      manifest {
                          attributes(mapOf("Main-Class" to "com.example.xxxxx.Main"))
                      }
                  }
                  ./gradlew shadowJar

                  3. check dependency

                  list your project’s dependencies in tree view

                  see how to configuration

                  you need moidfy your build.gradle.kts

                  configurations {
                      compileClasspath
                  }
                  ./gradlew dependencies --configuration compileClasspath
                  ./gradlew :<$module_name>:dependencies --configuration compileClasspath
                  Check Potential Result

                  result will look like this

                  compileClasspath - Compile classpath for source set 'main'.
                  +--- org.projectlombok:lombok:1.18.22
                  +--- org.apache.flink:flink-hadoop-fs:1.17.1
                  |    \--- org.apache.flink:flink-core:1.17.1
                  |         +--- org.apache.flink:flink-annotations:1.17.1
                  |         |    \--- com.google.code.findbugs:jsr305:1.3.9 -> 3.0.2
                  |         +--- org.apache.flink:flink-metrics-core:1.17.1
                  |         |    \--- org.apache.flink:flink-annotations:1.17.1 (*)
                  |         +--- org.apache.flink:flink-shaded-asm-9:9.3-16.1
                  |         +--- org.apache.flink:flink-shaded-jackson:2.13.4-16.1
                  |         +--- org.apache.commons:commons-lang3:3.12.0
                  |         +--- org.apache.commons:commons-text:1.10.0
                  |         |    \--- org.apache.commons:commons-lang3:3.12.0
                  |         +--- commons-collections:commons-collections:3.2.2
                  |         +--- org.apache.commons:commons-compress:1.21 -> 1.24.0
                  |         +--- org.apache.flink:flink-shaded-guava:30.1.1-jre-16.1
                  |         \--- com.google.code.findbugs:jsr305:1.3.9 -> 3.0.2
                  ...
                  Mar 7, 2024

                  CICD

                  Articles

                    FQA

                    Q1: difference between docker\podmn\buildah

                    You can add standard markdown syntax:

                    • multiple paragraphs
                    • bullet point lists
                    • emphasized, bold and even bold emphasized text
                    • links
                    • etc.
                    ...and even source code

                    the possibilities are endless (almost - including other shortcodes may or may not work)

                    Mar 7, 2025

                    Container

                    Articles

                    FQA

                    Q1: difference between docker\podmn\buildah

                    You can add standard markdown syntax:

                    • multiple paragraphs
                    • bullet point lists
                    • emphasized, bold and even bold emphasized text
                    • links
                    • etc.
                    ...and even source code

                    the possibilities are endless (almost - including other shortcodes may or may not work)

                    Mar 7, 2025

                    Subsections of Container

                    Build Smaller Image

                    减小 Dockerfile 生成镜像体积的方法

                    1. 选择更小的基础镜像

                    # ❌ 避免使用完整版本
                    FROM ubuntu:latest
                    
                    # ✅ 使用精简版本
                    FROM alpine:3.18
                    FROM python:3.11-slim
                    FROM node:18-alpine

                    2. 使用多阶段构建 (Multi-stage Build)

                    这是最有效的方法之一:

                    # 构建阶段
                    FROM golang:1.21 AS builder
                    WORKDIR /app
                    COPY . .
                    RUN go build -o myapp
                    
                    # 运行阶段 - 只复制必要文件
                    FROM alpine:3.18
                    WORKDIR /app
                    COPY --from=builder /app/myapp .
                    CMD ["./myapp"]

                    3. 合并 RUN 指令

                    每个 RUN 命令都会创建一个新层:

                    # ❌ 多层,体积大
                    RUN apt-get update
                    RUN apt-get install -y package1
                    RUN apt-get install -y package2
                    
                    # ✅ 单层,并清理缓存
                    RUN apt-get update && \
                        apt-get install -y package1 package2 && \
                        apt-get clean && \
                        rm -rf /var/lib/apt/lists/*

                    4. 清理不必要的文件

                    RUN apt-get update && \
                        apt-get install -y build-essential && \
                        # 构建操作... && \
                        apt-get purge -y build-essential && \
                        apt-get autoremove -y && \
                        rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

                    5. 使用 .dockerignore 文件

                    # .dockerignore
                    node_modules
                    .git
                    *.md
                    .env
                    test/

                    6. 只复制必要的文件

                    # ❌ 复制所有内容
                    COPY . .
                    
                    # ✅ 只复制需要的文件
                    COPY package.json package-lock.json ./
                    RUN npm ci --only=production
                    COPY src/ ./src/

                    7. 移除调试工具和文档

                    RUN apk add --no-cache python3 && \
                        rm -rf /usr/share/doc /usr/share/man

                    8. 压缩和优化层

                    # 在单个 RUN 中完成所有操作
                    RUN set -ex && \
                        apk add --no-cache --virtual .build-deps gcc musl-dev && \
                        pip install --no-cache-dir -r requirements.txt && \
                        apk del .build-deps

                    9. 使用专门的工具

                    • dive: 分析镜像层
                      dive your-image:tag
                    • docker-slim: 自动精简镜像
                      docker-slim build your-image:tag

                    实际案例对比

                    优化前 (1.2GB):

                    FROM ubuntu:20.04
                    RUN apt-get update
                    RUN apt-get install -y python3 python3-pip
                    COPY . /app
                    WORKDIR /app
                    RUN pip3 install -r requirements.txt
                    CMD ["python3", "app.py"]

                    优化后 (50MB):

                    FROM python:3.11-alpine AS builder
                    WORKDIR /app
                    COPY requirements.txt .
                    RUN pip install --no-cache-dir --user -r requirements.txt
                    
                    FROM python:3.11-alpine
                    WORKDIR /app
                    COPY --from=builder /root/.local /root/.local
                    COPY app.py .
                    ENV PATH=/root/.local/bin:$PATH
                    CMD ["python", "app.py"]

                    关键要点总结

                    ✅ 使用 Alpine 或 slim 镜像
                    ✅ 采用多阶段构建
                    ✅ 合并命令并清理缓存
                    ✅ 配置 .dockerignore
                    ✅ 只安装生产环境依赖
                    ✅ 删除构建工具和临时文件

                    通过这些方法,镜像体积通常可以减少 60-90%!

                    Mar 7, 2024

                    Network Mode

                    Docker的网络模式决定了容器如何与宿主机、其他容器以及外部网络进行通信。

                    Docker主要提供了以下五种网络模式,默认创建的是 bridge 模式。


                    1. Bridge 模式

                    这是 默认 的网络模式。当你创建一个容器而不指定网络时,它就会连接到这个默认的 bridge 网络(名为 bridge)。

                    • 工作原理:Docker守护进程会创建一个名为 docker0 的虚拟网桥,它相当于一个虚拟交换机。所有使用该模式的容器都会通过一个虚拟网卡(veth pair)连接到这个网桥上。Docker会为每个容器分配一个IP地址,并配置其网关为 docker0 的地址。
                    • 通信方式
                      • 容器间通信:在同一个自定义桥接网络下的容器,可以通过容器名(Container Name)直接通信(Docker内嵌了DNS)。但在默认的 bridge 网络下,容器只能通过IP地址通信。
                      • 访问外部网络:容器数据包通过 docker0 网桥,再经过宿主机的IPtables进行NAT转换,使用宿主机的IP访问外网。
                      • 从外部访问容器:需要做端口映射,例如 -p 8080:80,将宿主机的8080端口映射到容器的80端口。

                    优劣分析

                    • 优点
                      • 隔离性:容器拥有独立的网络命名空间,与宿主机和其他网络隔离,安全性较好。
                      • 端口管理灵活:通过端口映射,可以灵活地管理哪些宿主机端口暴露给外部。
                      • 通用性:是最常用、最通用的模式,适合大多数应用场景。
                    • 缺点
                      • 性能开销:相比 host 模式,多了一层网络桥接和NAT,性能有轻微损失。
                      • 复杂度:在默认桥接网络中,容器间通信需要使用IP,不如自定义网络方便。

                    使用场景:绝大多数需要网络隔离的独立应用,例如Web后端服务、数据库等。

                    命令示例

                    # 使用默认bridge网络(不推荐用于多容器应用)
                    docker run -d --name my-app -p 8080:80 nginx
                    
                    # 创建自定义bridge网络(推荐)
                    docker network create my-network
                    docker run -d --name app1 --network my-network my-app
                    docker run -d --name app2 --network my-network another-app
                    # 现在 app1 和 app2 可以通过容器名直接互相访问

                    2. Host 模式

                    在这种模式下,容器不会虚拟出自己的网卡,也不会分配独立的IP,而是直接使用宿主机的IP和端口

                    • 工作原理:容器与宿主机共享同一个Network Namespace。

                    优劣分析

                    • 优点
                      • 高性能:由于没有NAT和网桥开销,网络性能最高,几乎与宿主机原生网络一致。
                      • 简单:无需进行复杂的端口映射,容器内使用的端口就是宿主机上的端口。
                    • 缺点
                      • 安全性差:容器没有网络隔离,可以直接操作宿主机的网络。
                      • 端口冲突:容器使用的端口如果与宿主机服务冲突,会导致容器无法启动。
                      • 灵活性差:无法在同一台宿主机上运行多个使用相同端口的容器。

                    使用场景:对网络性能要求极高的场景,例如负载均衡器、高频交易系统等。在生产环境中需谨慎使用

                    命令示例

                    docker run -d --name my-app --network host nginx
                    # 此时,直接访问 http://<宿主机IP>:80 即可访问容器中的Nginx

                    3. None 模式

                    在这种模式下,容器拥有自己独立的网络命名空间,但不进行任何网络配置。容器内部只有回环地址 127.0.0.1

                    • 工作原理:容器完全与世隔绝。

                    优劣分析

                    • 优点
                      • 绝对隔离:安全性最高,容器完全无法进行任何网络通信。
                    • 缺点
                      • 无法联网:容器无法与宿主机、其他容器或外部网络通信。

                    使用场景

                    1. 需要完全离线处理的批处理任务。
                    2. 用户打算使用自定义网络驱动(或手动配置)来完全自定义容器的网络栈。

                    命令示例

                    docker run -d --name my-app --network none alpine
                    # 进入容器后,使用 `ip addr` 查看,只能看到 lo 网卡

                    4. Container 模式

                    这种模式下,新创建的容器不会创建自己的网卡和IP,而是与一个已经存在的容器共享一个Network Namespace。通俗讲,就是两个容器在同一个网络环境下,看到的IP和端口是一样的。

                    • 工作原理:新容器复用指定容器的网络栈。

                    优劣分析

                    • 优点
                      • 高效通信:容器间通信直接通过本地回环地址 127.0.0.1,效率极高。
                      • 共享网络视图:可以方便地为一个主容器(如Web服务器)搭配一个辅助容器(如日志收集器),它们看到的网络环境完全一致。
                    • 缺点
                      • 紧密耦合:两个容器的生命周期和网络配置紧密绑定,缺乏灵活性。
                      • 隔离性差:共享网络命名空间,存在一定的安全风险。

                    使用场景:Kubernetes中的"边车"模式,例如一个Pod内的主容器和日志代理容器。

                    命令示例

                    docker run -d --name main-container nginx
                    docker run -d --name helper-container --network container:main-container busybox
                    # 此时,helper-container 中访问 127.0.0.1:80 就是在访问 main-container 的Nginx服务

                    5. Overlay 模式

                    这是为了实现 跨主机的容器通信 而设计的,是Docker Swarm和Kubernetes等容器编排系统的核心网络方案。

                    • 工作原理:它会在多个Docker宿主机之间创建一个虚拟的分布式网络(Overlay Network),通过VXLAN等隧道技术,让不同宿主机上的容器感觉像是在同一个大的局域网内。

                    优劣分析

                    • 优点
                      • 跨节点通信:解决了集群环境下容器间通信的根本问题。
                      • 安全:支持网络加密。
                    • 缺点
                      • 配置复杂:需要额外的Key-Value存储(如Consul、Etcd)来同步网络状态(Docker Swarm模式内置了此功能)。
                      • 性能开销:数据包需要封装和解封装,有一定性能损耗,但现代硬件上通常可以接受。

                    使用场景:Docker Swarm集群、Kubernetes集群等分布式应用环境。

                    命令示例(在Swarm模式下):

                    # 初始化Swarm
                    docker swarm init
                    
                    # 创建Overlay网络
                    docker network create -d overlay my-overlay-net
                    
                    # 在Overlay网络中创建服务
                    docker service create --name web --network my-overlay-net -p 80:80 nginx

                    总结对比

                    网络模式隔离性性能灵活性适用场景
                    Bridge(默认)良好通用场景,单机多容器应用
                    Host最高对性能要求极致,不介意端口冲突
                    None最高-离线任务,完全自定义网络
                    Container容器紧密协作(如边车模式)
                    Overlay良好集群场景,跨主机容器通信

                    最佳实践建议

                    1. 单机应用:优先使用 自定义的Bridge网络,它比默认Bridge网络提供了更好的DNS服务发现功能,方便容器间通过名称通信。
                    2. 集群应用:必须使用 Overlay网络
                    3. 性能极致追求:在确认端口安全和无冲突的前提下,可考虑 Host模式
                    4. 安全隔离:对于无需网络的容器,使用 None模式
                    5. 避免在生产环境大量使用默认的bridge网络和container模式,因为它们分别在DNS发现和容器耦合度上存在不足。
                    Mar 7, 2024

                    Container Principle

                    Linux 容器技术的基础原理

                    容器技术是现代云原生架构的基石。让我深入讲解 Linux 容器的核心原理。


                    🎯 容器的本质

                    容器 = 特殊的进程

                    容器不是虚拟机,它本质上是宿主机上的一个受限进程,通过 Linux 内核的三大核心技术实现隔离和资源限制:

                    ┌─────────────────────────────────────────┐
                    │         Linux 容器三大支柱               │
                    ├─────────────────────────────────────────┤
                    │  1. Namespace (命名空间) - 隔离         │
                    │  2. Cgroups (控制组) - 资源限制         │
                    │  3. Union FS (联合文件系统) - 镜像分层   │
                    └─────────────────────────────────────────┘

                    🔒 Namespace (命名空间) - 视图隔离

                    Namespace 是 Linux 内核提供的一种资源隔离机制,让进程只能看到属于自己的资源。

                    七种 Namespace

                    Namespace隔离内容内核版本示例
                    PID进程 ID2.6.24容器内 PID 1 = 宿主机 PID 12345
                    Network网络栈2.6.29独立的 IP、端口、路由表
                    Mount文件系统挂载点2.4.19独立的根目录
                    UTS主机名和域名2.6.19容器有自己的 hostname
                    IPC进程间通信2.6.19消息队列、信号量、共享内存
                    User用户和组 ID3.8容器内 root ≠ 宿主机 root
                    CgroupCgroup 根目录4.6隔离 cgroup 视图

                    1️⃣ PID Namespace (进程隔离)

                    原理

                    每个容器有独立的进程树,容器内看不到宿主机或其他容器的进程。

                    演示

                    # 在宿主机上查看进程
                    ps aux | grep nginx
                    # root  12345  nginx: master process
                    
                    # 进入容器
                    docker exec -it my-container bash
                    
                    # 在容器内查看进程
                    ps aux
                    # PID   USER     COMMAND
                    # 1     root     nginx: master process  ← 容器内看到的 PID 是 1
                    # 25    root     nginx: worker process
                    
                    # 实际上宿主机上这个进程的真实 PID 是 12345

                    手动创建 PID Namespace

                    // C 代码示例
                    #define _GNU_SOURCE
                    #include <sched.h>
                    #include <stdio.h>
                    #include <unistd.h>
                    #include <sys/wait.h>
                    
                    int child_func(void* arg) {
                        printf("Child PID: %d\n", getpid());  // 输出: 1
                        sleep(100);
                        return 0;
                    }
                    
                    int main() {
                        printf("Parent PID: %d\n", getpid());  // 输出: 真实 PID
                        
                        // 创建新的 PID namespace
                        char stack[1024*1024];
                        int flags = CLONE_NEWPID;
                        
                        pid_t pid = clone(child_func, stack + sizeof(stack), flags | SIGCHLD, NULL);
                        waitpid(pid, NULL, 0);
                        return 0;
                    }

                    核心特点

                    • 容器内第一个进程 PID = 1 (init 进程)
                    • 父进程(宿主机)可以看到子进程的真实 PID
                    • 子进程(容器)看不到父进程和其他容器的进程

                    2️⃣ Network Namespace (网络隔离)

                    原理

                    每个容器有独立的网络栈:独立的 IP、端口、路由表、防火墙规则。

                    架构图

                    宿主机网络栈
                    ├─ eth0 (物理网卡)
                    ├─ docker0 (网桥)
                    └─ veth pairs (虚拟网卡对)
                        ├─ vethXXX (宿主机端) ←→ eth0 (容器端)
                        └─ vethYYY (宿主机端) ←→ eth0 (容器端)

                    演示

                    # 创建新的 network namespace
                    ip netns add myns
                    
                    # 列出所有 namespace
                    ip netns list
                    
                    # 在新 namespace 中执行命令
                    ip netns exec myns ip addr
                    # 输出: 只有 loopback,没有 eth0
                    
                    # 创建 veth pair (虚拟网卡对)
                    ip link add veth0 type veth peer name veth1
                    
                    # 将 veth1 移到新 namespace
                    ip link set veth1 netns myns
                    
                    # 配置 IP
                    ip addr add 192.168.1.1/24 dev veth0
                    ip netns exec myns ip addr add 192.168.1.2/24 dev veth1
                    
                    # 启动网卡
                    ip link set veth0 up
                    ip netns exec myns ip link set veth1 up
                    ip netns exec myns ip link set lo up
                    
                    # 测试连通性
                    ping 192.168.1.2

                    容器网络模式

                    Bridge 模式(默认)

                    Container A                Container B
                        │                          │
                      [eth0]                    [eth0]
                        │                          │
                     vethA ←─────┬─────────→ vethB
                                 │
                            [docker0 网桥]
                                 │
                             [iptables NAT]
                                 │
                             [宿主机 eth0]
                                 │
                              外部网络

                    Host 模式

                    Container
                        │
                        └─ 直接使用宿主机网络栈 (没有网络隔离)

                    3️⃣ Mount Namespace (文件系统隔离)

                    原理

                    每个容器有独立的挂载点视图,看到不同的文件系统树。

                    演示

                    # 创建隔离的挂载环境
                    unshare --mount /bin/bash
                    
                    # 在新 namespace 中挂载
                    mount -t tmpfs tmpfs /tmp
                    
                    # 查看挂载点
                    mount | grep tmpfs
                    # 这个挂载只在当前 namespace 可见
                    
                    # 退出后,宿主机看不到这个挂载
                    exit
                    mount | grep tmpfs  # 找不到

                    容器的根文件系统

                    # Docker 使用 chroot + pivot_root 切换根目录
                    # 容器内 / 实际是宿主机的某个目录
                    
                    # 查看容器的根文件系统位置
                    docker inspect my-container | grep MergedDir
                    # "MergedDir": "/var/lib/docker/overlay2/xxx/merged"
                    
                    # 在宿主机上访问容器文件系统
                    ls /var/lib/docker/overlay2/xxx/merged
                    # bin  boot  dev  etc  home  lib  ...

                    4️⃣ UTS Namespace (主机名隔离)

                    演示

                    # 在宿主机
                    hostname
                    # host-machine
                    
                    # 创建新 UTS namespace
                    unshare --uts /bin/bash
                    
                    # 修改主机名
                    hostname my-container
                    
                    # 查看主机名
                    hostname
                    # my-container
                    
                    # 退出后,宿主机主机名不变
                    exit
                    hostname
                    # host-machine

                    5️⃣ IPC Namespace (进程间通信隔离)

                    原理

                    隔离 System V IPC 和 POSIX 消息队列。

                    演示

                    # 在宿主机创建消息队列
                    ipcmk -Q
                    # Message queue id: 0
                    
                    # 查看消息队列
                    ipcs -q
                    # ------ Message Queues --------
                    # key        msqid      owner
                    # 0x52020055 0          root
                    
                    # 进入容器
                    docker exec -it my-container bash
                    
                    # 在容器内查看消息队列
                    ipcs -q
                    # ------ Message Queues --------
                    # (空,看不到宿主机的消息队列)

                    6️⃣ User Namespace (用户隔离)

                    原理

                    容器内的 root 用户可以映射到宿主机的普通用户,增强安全性。

                    配置示例

                    # 启用 User Namespace 的容器
                    docker run --userns-remap=default -it ubuntu bash
                    
                    # 容器内
                    whoami
                    # root
                    
                    id
                    # uid=0(root) gid=0(root) groups=0(root)
                    
                    # 但在宿主机上,这个进程实际运行在普通用户下
                    ps aux | grep bash
                    # 100000  12345  bash  ← UID 100000,不是 root

                    UID 映射配置

                    # /etc/subuid 和 /etc/subgid
                    cat /etc/subuid
                    # dockremap:100000:65536
                    # 表示将容器内的 UID 0-65535 映射到宿主机的 100000-165535

                    📊 Cgroups (Control Groups) - 资源限制

                    Cgroups 用于限制、记录、隔离进程组的资源使用(CPU、内存、磁盘 I/O 等)。

                    Cgroups 子系统

                    子系统功能示例
                    cpu限制 CPU 使用率容器最多用 50% CPU
                    cpuset绑定特定 CPU 核心容器只能用 CPU 0-3
                    memory限制内存使用容器最多用 512MB 内存
                    blkio限制块设备 I/O容器磁盘读写 100MB/s
                    devices控制设备访问容器不能访问 /dev/sda
                    net_cls网络流量分类为容器流量打标签
                    pids限制进程数量容器最多创建 100 个进程

                    CPU 限制

                    原理

                    使用 CFS (Completely Fair Scheduler) 调度器限制 CPU 时间。

                    关键参数

                    cpu.cfs_period_us  # 周期时间(默认 100ms = 100000us)
                    cpu.cfs_quota_us   # 配额时间
                    
                    # CPU 使用率 = quota / period
                    # 例如: 50000 / 100000 = 50% CPU

                    Docker 示例

                    # 限制容器使用 0.5 个 CPU 核心
                    docker run --cpus=0.5 nginx
                    
                    # 等价于
                    docker run --cpu-period=100000 --cpu-quota=50000 nginx
                    
                    # 查看 cgroup 配置
                    cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us
                    # 50000

                    手动配置 Cgroups

                    # 创建 cgroup
                    mkdir -p /sys/fs/cgroup/cpu/mycontainer
                    
                    # 设置 CPU 限制为 50%
                    echo 50000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us
                    echo 100000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_period_us
                    
                    # 将进程加入 cgroup
                    echo $$ > /sys/fs/cgroup/cpu/mycontainer/cgroup.procs
                    
                    # 运行 CPU 密集任务
                    yes > /dev/null &
                    
                    # 在另一个终端查看 CPU 使用率
                    top -p $(pgrep yes)
                    # CPU 使用率被限制在 50% 左右

                    内存限制

                    关键参数

                    memory.limit_in_bytes        # 硬限制
                    memory.soft_limit_in_bytes   # 软限制
                    memory.oom_control           # OOM 行为控制
                    memory.usage_in_bytes        # 当前使用量

                    Docker 示例

                    # 限制容器使用最多 512MB 内存
                    docker run -m 512m nginx
                    
                    # 查看内存限制
                    cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
                    # 536870912 (512MB)
                    
                    # 查看当前内存使用
                    cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes

                    OOM (Out of Memory) 行为

                    # 当容器超过内存限制时
                    # 1. 内核触发 OOM Killer
                    # 2. 杀死容器内的进程(通常是内存占用最大的)
                    # 3. 容器退出,状态码 137
                    
                    docker ps -a
                    # CONTAINER ID   STATUS
                    # abc123         Exited (137) 1 minute ago  ← OOM killed

                    避免 OOM 的策略

                    # 设置 OOM Score Adjustment
                    docker run --oom-score-adj=-500 nginx
                    # 数值越低,越不容易被 OOM Killer 杀死
                    
                    # 禁用 OOM Killer (不推荐生产环境)
                    docker run --oom-kill-disable nginx

                    磁盘 I/O 限制

                    Docker 示例

                    # 限制读取速度为 10MB/s
                    docker run --device-read-bps /dev/sda:10mb nginx
                    
                    # 限制写入速度为 5MB/s
                    docker run --device-write-bps /dev/sda:5mb nginx
                    
                    # 限制 IOPS
                    docker run --device-read-iops /dev/sda:100 nginx
                    docker run --device-write-iops /dev/sda:50 nginx

                    测试 I/O 限制

                    # 在容器内测试写入速度
                    docker exec -it my-container bash
                    
                    dd if=/dev/zero of=/tmp/test bs=1M count=100
                    # 写入速度会被限制在 5MB/s

                    📦 Union FS (联合文件系统) - 镜像分层

                    Union FS 允许多个文件系统分层叠加,实现镜像的复用和高效存储。

                    核心概念

                    容器可写层 (Read-Write Layer)     ← 容器运行时的修改
                    ─────────────────────────────────
                    镜像层 4 (Image Layer 4)          ← 只读
                    镜像层 3 (Image Layer 3)          ← 只读
                    镜像层 2 (Image Layer 2)          ← 只读
                    镜像层 1 (Base Layer)             ← 只读
                    ─────────────────────────────────
                             统一挂载点
                          (Union Mount Point)

                    常见实现

                    文件系统特点使用情况
                    OverlayFS性能好,内核原生支持Docker 默认(推荐)
                    AUFS成熟稳定,但不在主线内核早期 Docker 默认
                    Btrfs支持快照,写时复制适合大规模存储
                    ZFS企业级功能,但有许可问题高级用户
                    Device Mapper块级存储Red Hat 系列

                    OverlayFS 原理

                    目录结构

                    /var/lib/docker/overlay2/<image-id>/
                    ├── diff/          # 当前层的文件变更
                    ├── link           # 短链接名称
                    ├── lower          # 指向下层的链接
                    ├── merged/        # 最终挂载点(容器看到的)
                    └── work/          # 工作目录(临时文件)

                    实际演示

                    # 查看镜像的层结构
                    docker image inspect nginx:latest | jq '.[0].RootFS.Layers'
                    # [
                    #   "sha256:abc123...",  ← Layer 1
                    #   "sha256:def456...",  ← Layer 2
                    #   "sha256:ghi789..."   ← Layer 3
                    # ]
                    
                    # 启动容器
                    docker run -d --name web nginx
                    
                    # 查看容器的文件系统
                    docker inspect web | grep MergedDir
                    # "MergedDir": "/var/lib/docker/overlay2/xxx/merged"
                    
                    # 查看挂载信息
                    mount | grep overlay
                    # overlay on /var/lib/docker/overlay2/xxx/merged type overlay (rw,lowerdir=...,upperdir=...,workdir=...)

                    文件操作的 Copy-on-Write (写时复制)

                    # 1. 读取文件(从镜像层)
                    docker exec web cat /etc/nginx/nginx.conf
                    # 直接从只读的镜像层读取,无需复制
                    
                    # 2. 修改文件
                    docker exec web bash -c "echo 'test' >> /etc/nginx/nginx.conf"
                    # 触发 Copy-on-Write:
                    # - 从下层复制文件到容器可写层
                    # - 在可写层修改文件
                    # - 下次读取时,从可写层读取(覆盖下层)
                    
                    # 3. 删除文件
                    docker exec web rm /var/log/nginx/access.log
                    # 创建 whiteout 文件,标记删除
                    # 文件在镜像层仍存在,但容器内看不到

                    Whiteout 文件(删除标记)

                    # 在容器可写层
                    ls -la /var/lib/docker/overlay2/xxx/diff/var/log/nginx/
                    # c--------- 1 root root 0, 0 Oct 11 10:00 .wh.access.log
                    # 字符设备文件,主次设备号都是 0,表示删除标记

                    镜像分层的优势

                    1. 共享层,节省空间

                    # 假设有 10 个基于 ubuntu:20.04 的镜像
                    # 不使用分层:10 × 100MB = 1GB
                    # 使用分层:100MB (ubuntu base) + 10 × 10MB (应用层) = 200MB
                    # 节省空间:80%

                    2. 快速构建

                    FROM ubuntu:20.04                    # Layer 1 (缓存)
                    RUN apt-get update                   # Layer 2 (缓存)
                    RUN apt-get install -y nginx         # Layer 3 (缓存)
                    COPY app.conf /etc/nginx/            # Layer 4 (需要重建)
                    COPY app.js /var/www/                # Layer 5 (需要重建)
                    
                    # 如果只修改 app.js,只需要重建 Layer 5
                    # 前面的层都从缓存读取

                    3. 快速分发

                    # 拉取镜像时,只下载本地没有的层
                    docker pull nginx:1.21
                    # Already exists: Layer 1 (ubuntu base)
                    # Downloading:    Layer 2 (nginx files)
                    # Downloading:    Layer 3 (config)

                    🔗 容器技术完整流程

                    Docker 创建容器的完整过程

                    docker run -d --name web \
                      --cpus=0.5 \
                      -m 512m \
                      -p 8080:80 \
                      nginx:latest

                    内部执行流程

                    1. 拉取镜像(如果本地没有)
                       └─ 下载各层,存储到 /var/lib/docker/overlay2/
                    
                    2. 创建 Namespace
                       ├─ PID Namespace (隔离进程)
                       ├─ Network Namespace (隔离网络)
                       ├─ Mount Namespace (隔离文件系统)
                       ├─ UTS Namespace (隔离主机名)
                       ├─ IPC Namespace (隔离进程间通信)
                       └─ User Namespace (隔离用户)
                    
                    3. 配置 Cgroups
                       ├─ cpu.cfs_quota_us = 50000 (50% CPU)
                       └─ memory.limit_in_bytes = 536870912 (512MB)
                    
                    4. 挂载文件系统 (OverlayFS)
                       ├─ lowerdir: 镜像只读层
                       ├─ upperdir: 容器可写层
                       ├─ workdir: 工作目录
                       └─ merged: 统一视图挂载点
                    
                    5. 配置网络
                       ├─ 创建 veth pair
                       ├─ 一端连接到容器的 Network Namespace
                       ├─ 另一端连接到 docker0 网桥
                       ├─ 分配 IP 地址
                       └─ 配置 iptables NAT 规则 (端口映射)
                    
                    6. 切换根目录
                       ├─ chroot 或 pivot_root
                       └─ 容器内看到的 / 是 merged 目录
                    
                    7. 启动容器进程
                       ├─ 在新的 Namespace 中
                       ├─ 受 Cgroups 限制
                       └─ 使用新的根文件系统
                       └─ 执行 ENTRYPOINT/CMD
                    
                    8. 容器运行中
                       └─ containerd-shim 监控进程

                    🛠️ 手动创建容器(无 Docker)

                    完整示例:从零创建容器

                    #!/bin/bash
                    # 手动创建一个简单的容器
                    
                    # 1. 准备根文件系统
                    mkdir -p /tmp/mycontainer/rootfs
                    cd /tmp/mycontainer/rootfs
                    
                    # 下载 busybox 作为基础系统
                    wget https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
                    chmod +x busybox
                    ./busybox --install -s .
                    
                    # 创建必要的目录
                    mkdir -p bin sbin etc proc sys tmp dev
                    
                    # 2. 创建启动脚本
                    cat > /tmp/mycontainer/start.sh <<'EOF'
                    #!/bin/bash
                    
                    # 创建新的 namespace
                    unshare --pid --net --mount --uts --ipc --fork /bin/bash -c '
                        # 挂载 proc
                        mount -t proc proc /proc
                        
                        # 设置主机名
                        hostname mycontainer
                        
                        # 启动 shell
                        /bin/sh
                    '
                    EOF
                    
                    chmod +x /tmp/mycontainer/start.sh
                    
                    # 3. 启动容器
                    chroot /tmp/mycontainer/rootfs /tmp/mycontainer/start.sh

                    配置 Cgroups 限制

                    # 创建 cgroup
                    mkdir -p /sys/fs/cgroup/memory/mycontainer
                    mkdir -p /sys/fs/cgroup/cpu/mycontainer
                    
                    # 设置内存限制 256MB
                    echo 268435456 > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
                    
                    # 设置 CPU 限制 50%
                    echo 50000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us
                    echo 100000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_period_us
                    
                    # 将容器进程加入 cgroup
                    echo $CONTAINER_PID > /sys/fs/cgroup/memory/mycontainer/cgroup.procs
                    echo $CONTAINER_PID > /sys/fs/cgroup/cpu/mycontainer/cgroup.procs

                    🔍 容器 vs 虚拟机

                    架构对比

                    虚拟机架构:
                    ┌─────────────────────────────────────┐
                    │  App A  │  App B  │  App C          │
                    ├─────────┼─────────┼─────────────────┤
                    │ Bins/Libs│ Bins/Libs│ Bins/Libs      │
                    ├─────────┼─────────┼─────────────────┤
                    │ Guest OS│ Guest OS│ Guest OS        │  ← 每个 VM 都有完整 OS
                    ├─────────┴─────────┴─────────────────┤
                    │       Hypervisor (VMware/KVM)       │
                    ├─────────────────────────────────────┤
                    │         Host Operating System       │
                    ├─────────────────────────────────────┤
                    │         Hardware                    │
                    └─────────────────────────────────────┘
                    
                    容器架构:
                    ┌─────────────────────────────────────┐
                    │  App A  │  App B  │  App C          │
                    ├─────────┼─────────┼─────────────────┤
                    │ Bins/Libs│ Bins/Libs│ Bins/Libs      │
                    ├─────────────────────────────────────┤
                    │  Docker Engine / containerd         │
                    ├─────────────────────────────────────┤
                    │    Host Operating System (Linux)    │  ← 共享内核
                    ├─────────────────────────────────────┤
                    │         Hardware                    │
                    └─────────────────────────────────────┘

                    性能对比

                    维度虚拟机容器
                    启动时间分钟级秒级
                    资源占用GB 级内存MB 级内存
                    性能开销5-10%< 1%
                    隔离程度完全隔离(硬件级)进程隔离(OS 级)
                    安全性更高(独立内核)较低(共享内核)
                    密度每台物理机 10-50 个每台物理机 100-1000 个

                    ⚠️ 容器的安全性考虑

                    1. 共享内核的风险

                    # 容器逃逸:如果内核有漏洞,容器可能逃逸到宿主机
                    
                    # 缓解措施:
                    # - 使用 User Namespace
                    # - 运行容器为非 root 用户
                    # - 使用 Seccomp 限制系统调用
                    # - 使用 AppArmor/SELinux

                    2. 特权容器的危险

                    # 特权容器可以访问宿主机所有设备
                    docker run --privileged ...
                    
                    # ❌ 危险:容器内可以:
                    # - 加载内核模块
                    # - 访问宿主机所有设备
                    # - 修改宿主机网络配置
                    # - 读写宿主机任意文件
                    
                    # ✅ 最佳实践:避免使用特权容器

                    3. Capability 控制

                    # 只授予容器必要的权限
                    docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx
                    
                    # 默认 Docker 授予的 Capabilities:
                    # - CHOWN, DAC_OVERRIDE, FOWNER, FSETID
                    # - KILL, SETGID, SETUID, SETPCAP
                    # - NET_BIND_SERVICE, NET_RAW
                    # - SYS_CHROOT, MKNOD, AUDIT_WRITE, SETFCAP

                    💡 关键要点总结

                    容器 = Namespace + Cgroups + Union FS

                    1. Namespace (隔离)

                      • PID: 进程隔离
                      • Network: 网络隔离
                      • Mount: 文件系统隔离
                      • UTS: 主机名隔离
                      • IPC: 进程间通信隔离
                      • User: 用户隔离
                    2. Cgroups (限制)

                      • CPU: 限制处理器使用
                      • Memory: 限制内存使用
                      • Block I/O: 限制磁盘 I/O
                      • Network: 限制网络带宽
                    3. Union FS (分层)

                      • 镜像分层存储
                      • Copy-on-Write
                      • 节省空间和带宽

                    容器不是虚拟机

                    • ✅ 容器是特殊的进程
                    • ✅ 共享宿主机内核
                    • ✅ 启动快、资源占用少
                    • ⚠️ 隔离性不如虚拟机
                    • ⚠️ 需要注意安全配置
                    Mar 7, 2024

                    Subsections of Database

                    Elastic Search DSL

                    Basic Query

                    exist query

                    Returns documents that contain an indexed value for a field.

                    GET /_search
                    {
                      "query": {
                        "exists": {
                          "field": "user"
                        }
                      }
                    }

                    The following search returns documents that are missing an indexed value for the user.id field.

                    GET /_search
                    {
                      "query": {
                        "bool": {
                          "must_not": {
                            "exists": {
                              "field": "user.id"
                            }
                          }
                        }
                      }
                    }
                    fuzz query

                    Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.

                    GET /_search
                    {
                      "query": {
                        "fuzzy": {
                          "filed_A": {
                            "value": "ki"
                          }
                        }
                      }
                    }

                    Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.

                    GET /_search
                    {
                      "query": {
                        "fuzzy": {
                          "filed_A": {
                            "value": "ki",
                            "fuzziness": "AUTO",
                            "max_expansions": 50,
                            "prefix_length": 0,
                            "transpositions": true,
                            "rewrite": "constant_score_blended"
                          }
                        }
                      }
                    }

                    rewrite:

                    • constant_score_boolean
                    • constant_score_filter
                    • top_terms_blended_freqs_N
                    • top_terms_boost_N, top_terms_N
                    • frequent_terms, score_delegating
                    ids query

                    Returns documents based on their IDs. This query uses document IDs stored in the _id field.

                    GET /_search
                    {
                      "query": {
                        "ids" : {
                          "values" : ["2NTC5ZIBNLuBWC6V5_0Y"]
                        }
                      }
                    }
                    prefix query

                    The following search returns documents where the filed_A field contains a term that begins with ki.

                    GET /_search
                    {
                      "query": {
                        "prefix": {
                          "filed_A": {
                            "value": "ki",
                             "rewrite": "constant_score_blended",
                             "case_insensitive": true
                          }
                        }
                      }
                    }

                    You can simplify the prefix query syntax by combining the <field> and value parameters.

                    GET /_search
                    {
                      "query": {
                        "prefix" : { "filed_A" : "ki" }
                      }
                    }
                    range query

                    Returns documents that contain terms within a provided range.

                    GET /_search
                    {
                      "query": {
                        "range": {
                          "filed_number": {
                            "gte": 10,
                            "lte": 20,
                            "boost": 2.0
                          }
                        }
                      }
                    }
                    GET /_search
                    {
                      "query": {
                        "range": {
                          "filed_timestamp": {
                            "time_zone": "+01:00",        
                            "gte": "2020-01-01T00:00:00", 
                            "lte": "now"                  
                          }
                        }
                      }
                    }
                    regex query

                    Returns documents that contain terms matching a regular expression.

                    GET /_search
                    {
                      "query": {
                        "regexp": {
                          "filed_A": {
                            "value": "k.*y",
                            "flags": "ALL",
                            "case_insensitive": true,
                            "max_determinized_states": 10000,
                            "rewrite": "constant_score_blended"
                          }
                        }
                      }
                    }
                    term query

                    Returns documents that contain an exact term in a provided field.

                    You can use the term query to find documents based on a precise value such as a price, a product ID, or a username.

                    GET /_search
                    {
                      "query": {
                        "term": {
                          "filed_A": {
                            "value": "kimchy",
                            "boost": 1.0
                          }
                        }
                      }
                    }
                    wildcard query

                    Returns documents that contain terms matching a wildcard pattern.

                    A wildcard operator is a placeholder that matches one or more characters. For example, the * wildcard operator matches zero or more characters. You can combine wildcard operators with other characters to create a wildcard pattern.

                    GET /_search
                    {
                      "query": {
                        "wildcard": {
                          "filed_A": {
                            "value": "ki*y",
                            "boost": 1.0,
                            "rewrite": "constant_score_blended"
                          }
                        }
                      }
                    }
                    Oct 7, 2024

                    HPC

                      Mar 7, 2024

                      K8s

                      Mar 7, 2024

                      Subsections of K8s

                      K8s的理解

                      一、核心定位:云时代的操作系统

                      我对 K8s 最根本的理解是:它正在成为数据中心/云环境的“操作系统”。

                      • 传统操作系统(如 Windows、Linux):管理的是单台计算机的硬件资源(CPU、内存、硬盘、网络),并为应用程序(进程)提供运行环境。
                      • Kubernetes:管理的是一个集群(由多台计算机组成)的资源,并将这些物理机/虚拟机抽象成一个巨大的“资源池”。它在这个池子上调度和运行的不再是简单的进程,而是容器化了的应用程序

                      所以,你可以把 K8s 看作是一个分布式的、面向云原生应用的操作系统。


                      二、要解决的核心问题:从“动物园”到“牧场”

                      在 K8s 出现之前,微服务和容器化架构带来了新的挑战:

                      1. 编排混乱:我有成百上千个容器,应该在哪台机器上启动?如何知道它们是否健康?挂了怎么办?如何扩容缩容?
                      2. 网络复杂:容器之间如何发现和通信?如何实现负载均衡?
                      3. 存储管理:有状态应用的数据如何持久化?容器漂移后数据怎么跟走?
                      4. 部署麻烦:如何实现蓝绿部署、金丝雀发布?如何回滚?

                      这个时期被称为“集装箱革命”后的“编排战争”时期,各种工具(如 Docker Swarm, Mesos, Nomad)就像是一个混乱的“动物园”。

                      K8s 的诞生(源于 Google 内部系统 Borg 的经验)就是为了系统地解决这些问题,它将混乱的“动物园”管理成了一个井然有序的“牧场”。它的核心能力可以概括为:声明式 API 和控制器模式


                      三、核心架构与工作模型:大脑与肢体

                      K8s 集群主要由控制平面工作节点 组成。

                      • 控制平面:集群的大脑

                        • kube-apiserver:整个系统的唯一入口,所有组件都必须通过它来操作集群状态。它是“前台总机”。
                        • etcd:一个高可用的键值数据库,持久化存储集群的所有状态数据。它是“集群的记忆中心”。
                        • kube-scheduler:负责调度,决定 Pod 应该在哪个节点上运行。它是“人力资源部”。
                        • kube-controller-manager:运行着各种控制器,不断检查当前状态是否与期望状态一致,并努力驱使其一致。例如,节点控制器、副本控制器等。它是“自动化的管理团队”。
                      • 工作节点:干活的肢体

                        • kubelet:节点上的“监工”,负责与控制平面通信,管理本节点上 Pod 的生命周期,确保容器健康运行。
                        • kube-proxy:负责节点上的网络规则,实现 Service 的负载均衡和网络代理。
                        • 容器运行时:如 containerd 或 CRI-O,负责真正拉取镜像和运行容器。

                      工作模型的核心:声明式 API 与控制器模式

                      1. 你向 kube-apiserver 提交一个 YAML/JSON 文件,声明你期望的应用状态(例如:我要运行 3 个 Nginx 实例)。
                      2. etcd 记录下这个期望状态。
                      3. 各种控制器会持续地“观察”当前状态,并与 etcd 中的期望状态进行对比。
                      4. 如果发现不一致(例如,只有一个 Nginx 实例在运行),控制器就会主动采取行动(例如,再创建两个 Pod),直到当前状态与期望状态一致。
                      5. 这个过程是自愈的、自动的

                      四、关键对象与抽象:乐高积木

                      K8s 通过一系列抽象对象来建模应用,这些对象就像乐高积木:

                      1. Pod最小部署和管理单元。一个 Pod 可以包含一个或多个紧密关联的容器(如主容器和 Sidecar 容器),它们共享网络和存储。这是 K8s 的“原子”。
                      2. Deployment定义无状态应用。它管理 Pod 的多个副本(Replicas),并提供滚动更新、回滚等强大的部署策略。这是最常用的对象。
                      3. Service定义一组 Pod 的访问方式。Pod 是“ ephemeral ”的,IP 会变。Service 提供一个稳定的 IP 和 DNS 名称,并作为负载均衡器,将流量分发给后端的健康 Pod。它是“服务的门户”。
                      4. ConfigMap & Secret:将配置信息和敏感数据与容器镜像解耦,实现配置的灵活管理。
                      5. Volume:抽象了各种存储解决方案,为 Pod 提供持久化存储。
                      6. Namespace:在物理集群内部创建多个虚拟集群,实现资源隔离和多租户管理。
                      7. StatefulSet用于部署有状态应用(如数据库)。它为每个 Pod 提供稳定的标识符、有序的部署和扩缩容,以及稳定的持久化存储。
                      8. Ingress:管理集群外部访问内部服务的入口,通常提供 HTTP/HTTPS 路由、SSL 终止等功能。它是“集群的流量总入口”。

                      五、核心价值与优势

                      1. 自动化运维:自动化了应用的部署、扩缩容、故障恢复(自愈)、滚动更新等,极大降低了运维成本。
                      2. 声明式配置与不可变基础设施:通过 YAML 文件定义一切,基础设施可版本化、可追溯、可重复。这是 DevOps 和 GitOps 的基石。
                      3. 环境一致性 & 可移植性:实现了“一次编写,随处运行”。无论是在本地开发机、测试环境,还是在公有云、混合云上,应用的行为都是一致的。
                      4. 高可用性与弹性伸缩:轻松实现应用的多副本部署,并能根据 CPU、内存等指标或自定义指标进行自动扩缩容,从容应对流量高峰。
                      5. 丰富的生态系统:拥有一个极其庞大和活跃的社区,提供了大量的工具和扩展(Helm, Operator, Istio等),能解决几乎所有你能想到的问题。

                      六、挑战与学习曲线

                      K8s 并非银弹,它也有自己的挑战:

                      • 复杂性高:概念繁多,架构复杂,学习和运维成本非常高。
                      • “配置”沉重:YAML 文件可能非常多,管理起来本身就是一门学问。
                      • 网络与存储:虽然是核心抽象,但其底层实现和理解起来依然有相当的门槛。

                      总结

                      在我看来,Kubernetes 不仅仅是一个容器编排工具,它更是一套云原生应用的管理范式。它通过一系列精妙的抽象,将复杂的分布式系统管理问题标准化、自动化和简单化。虽然入门有门槛,但它已经成为现代应用基础设施的事实标准,是任何从事后端开发、运维、架构设计的人员都必须理解和掌握的核心技术。

                      简单来说,K8s 让你能够像管理一台超级计算机一样,去管理一个由成千上万台机器组成的集群。

                      Mar 7, 2024

                      Cgroup在K8S中起什么作用

                      Kubernetes 深度集成 cgroup 来实现容器资源管理和隔离。以下是 cgroup 与 K8s 结合的详细方式:

                      1. K8s 资源模型与 cgroup 映射

                      1.1 资源请求和限制

                      apiVersion: v1
                      kind: Pod
                      spec:
                        containers:
                        - name: app
                          resources:
                            requests:
                              memory: "64Mi"
                              cpu: "250m"
                            limits:
                              memory: "128Mi"
                              cpu: "500m"
                              ephemeral-storage: "2Gi"

                      对应 cgroup 配置:

                      • cpu.shares = 256 (250m × 1024 / 1000)
                      • cpu.cfs_quota_us = 50000 (500m × 100000 / 1000)
                      • memory.limit_in_bytes = 134217728 (128Mi)

                      2. K8s cgroup 驱动

                      2.1 cgroupfs 驱动

                      # kubelet 配置
                      --cgroup-driver=cgroupfs
                      --cgroup-root=/sys/fs/cgroup

                      2.2 systemd 驱动(推荐)

                      # kubelet 配置
                      --cgroup-driver=systemd
                      --cgroup-root=/sys/fs/cgroup

                      3. K8s cgroup 层级结构

                      3.1 cgroup v1 层级

                      /sys/fs/cgroup/
                      ├── cpu,cpuacct/kubepods/
                      │   ├── burstable/pod-uid-1/
                      │   │   ├── container-1/
                      │   │   └── container-2/
                      │   └── guaranteed/pod-uid-2/
                      │       └── container-1/
                      ├── memory/kubepods/
                      └── pids/kubepods/

                      3.2 cgroup v2 统一层级

                      /sys/fs/cgroup/kubepods/
                      ├── pod-uid-1/
                      │   ├── container-1/
                      │   └── container-2/
                      └── pod-uid-2/
                          └── container-1/

                      4. QoS 等级与 cgroup 配置

                      4.1 Guaranteed (最高优先级)

                      resources:
                        limits:
                          cpu: "500m"
                          memory: "128Mi"
                        requests:
                          cpu: "500m" 
                          memory: "128Mi"

                      cgroup 配置:

                      • cpu.shares = 512
                      • cpu.cfs_quota_us = 50000
                      • oom_score_adj = -998

                      4.2 Burstable (中等优先级)

                      resources:
                        requests:
                          cpu: "250m"
                          memory: "64Mi"
                        # limits 未设置或大于 requests

                      cgroup 配置:

                      • cpu.shares = 256
                      • cpu.cfs_quota_us = -1 (无限制)
                      • oom_score_adj = 2-999

                      4.3 BestEffort (最低优先级)

                      # 未设置 resources

                      cgroup 配置:

                      • cpu.shares = 2
                      • memory.limit_in_bytes = 9223372036854771712 (极大值)
                      • oom_score_adj = 1000

                      5. 实际 cgroup 配置示例

                      5.1 查看 Pod 的 cgroup

                      # 找到 Pod 的 cgroup 路径
                      cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cgroup.procs
                      
                      # 查看 CPU 配置
                      cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.shares
                      cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.cfs_quota_us
                      
                      # 查看内存配置
                      cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.limit_in_bytes

                      5.2 使用 cgroup-tools 监控

                      # 安装工具
                      apt-get install cgroup-tools
                      
                      # 查看 cgroup 统计
                      cgget -g cpu:/kubepods/pod-uid-1
                      cgget -g memory:/kubepods/pod-uid-1

                      6. K8s 特性与 cgroup 集成

                      6.1 垂直 Pod 自动缩放 (VPA)

                      apiVersion: autoscaling.k8s.io/v1
                      kind: VerticalPodAutoscaler
                      spec:
                        targetRef:
                          apiVersion: "apps/v1"
                          kind: Deployment
                          name: my-app
                        updatePolicy:
                          updateMode: "Auto"

                      VPA 根据历史使用数据动态调整:

                      • 修改 resources.requestsresources.limits
                      • kubelet 更新对应的 cgroup 配置

                      6.2 水平 Pod 自动缩放 (HPA)

                      apiVersion: autoscaling/v2
                      kind: HorizontalPodAutoscaler
                      spec:
                        scaleTargetRef:
                          apiVersion: apps/v1
                          kind: Deployment
                          name: my-app
                        minReplicas: 1
                        maxReplicas: 10
                        metrics:
                        - type: Resource
                          resource:
                            name: cpu
                            target:
                              type: Utilization
                              averageUtilization: 50

                      HPA 依赖 cgroup 的 CPU 使用率统计进行决策。

                      6.3 资源监控

                      # 通过 cgroup 获取容器资源使用
                      cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpuacct.usage
                      cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.usage_in_bytes
                      
                      # 使用 metrics-server 收集
                      kubectl top pods
                      kubectl top nodes

                      7. 节点资源管理

                      7.1 系统预留资源

                      # kubelet 配置
                      apiVersion: kubelet.config.k8s.io/v1beta1
                      kind: KubeletConfiguration
                      systemReserved:
                        cpu: "100m"
                        memory: "256Mi"
                        ephemeral-storage: "1Gi"
                      kubeReserved:
                        cpu: "200m"
                        memory: "512Mi"
                        ephemeral-storage: "2Gi"
                      evictionHard:
                        memory.available: "100Mi"
                        nodefs.available: "10%"

                      7.2 驱逐策略

                      当节点资源不足时,kubelet 根据 cgroup 统计:

                      • 监控 memory.usage_in_bytes
                      • 监控 cpuacct.usage
                      • 触发 Pod 驱逐

                      8. 故障排查和调试

                      8.1 检查 cgroup 配置

                      # 进入节点检查
                      docker exec -it node-shell /bin/bash
                      
                      # 查看 Pod cgroup
                      find /sys/fs/cgroup -name "*pod-uid*" -type d
                      
                      # 检查资源限制
                      cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.limit_in_bytes
                      cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.cfs_quota_us

                      8.2 监控 OOM 事件

                      # 查看内核日志
                      dmesg | grep -i "killed process"
                      
                      # 查看 cgroup OOM 事件
                      grep "kubepods" /var/log/kern.log | grep -i oom

                      9. 最佳实践

                      9.1 合理设置资源限制

                      resources:
                        requests:
                          cpu: "100m"
                          memory: "128Mi"
                        limits:
                          cpu: "200m"    # 不要设置过大
                          memory: "256Mi" # 避免内存浪费

                      9.2 使用 LimitRange

                      apiVersion: v1
                      kind: LimitRange
                      metadata:
                        name: mem-limit-range
                      spec:
                        limits:
                        - default:
                            memory: "256Mi"
                          defaultRequest:
                            memory: "128Mi"
                          type: Container

                      cgroup 是 K8s 资源管理的基石,通过精细的 cgroup 配置,K8s 实现了多租户环境下的资源隔离、公平调度和稳定性保障。

                      Mar 7, 2024

                      Headless VS ClusterIP

                      Q: headless service 和 普通的service 有什么区别? 只是有没有clusterIP?

                      “有没有 ClusterIP” 只是表面现象,其背后是根本不同的服务发现模式和适用场景。


                      核心区别:服务发现模式

                      • 普通 Service:提供的是 “负载均衡” 式的服务发现。
                        • 它抽象了一组 Pod,你访问的是这个抽象的、稳定的 VIP(ClusterIP),然后由 kube-proxy 将流量转发到后端的某个 Pod。
                        • 客户端不知道、也不关心具体是哪个 Pod 在处理请求。
                      • Headless Service:提供的是 “直接 Pod IP” 式的服务发现。
                        • 不会给你一个统一的 VIP,而是直接返回后端所有 Pod 的 IP 地址。
                        • 客户端可以直接与任何一个 Pod 通信,并且知道它正在和哪个具体的 Pod 对话。

                      详细对比

                      特性普通 ServiceHeadless Service
                      clusterIP 字段自动分配一个 VIP(如 10.96.123.456),或设置为 None必须设置为 None。这是定义 Headless Service 的标志。
                      核心功能负载均衡。作为流量的代理和分发器。服务发现。作为 Pod 的 DNS 记录注册器,不负责流量转发
                      DNS 解析结果解析到 Service 的 ClusterIP。解析到所有与 Selector 匹配的 Pod 的 IP 地址
                      网络拓扑客户端 -> ClusterIP (VIP) -> (由 kube-proxy 负载均衡) -> 某个 Pod客户端 -> Pod IP
                      适用场景标准的微服务、Web 前端/后端 API,任何需要负载均衡的场景。有状态应用集群(如 MySQL, MongoDB, Kafka, Redis Cluster)、需要直接连接特定 Pod 的场景(如 gRPC 长连接、游戏服务器)。

                      DNS 解析行为的深入理解

                      这是理解两者差异的最直观方式。

                      假设我们有一个名为 my-app 的 Service,它选择了 3 个 Pod。

                      1. 普通 Service 的 DNS 解析

                      • 在集群内,你执行 nslookup my-app(或在 Pod 里用代码查询)。
                      • 返回结果1 条 A 记录,指向 Service 的 ClusterIP。
                        Name:      my-app
                        Address 1: 10.96.123.456
                      • 你的应用:连接到 10.96.123.456:port,剩下的交给 Kubernetes 的网络层。

                      2. Headless Service 的 DNS 解析

                      • 在集群内,你执行 nslookup my-app(注意:Service 的 clusterIP: None)。
                      • 返回结果多条 A 记录,直接指向后端所有 Pod 的 IP。
                        Name:      my-app
                        Address 1: 172.17.0.10
                        Address 2: 172.17.0.11
                        Address 3: 172.17.0.12
                      • 你的应用:会拿到这个 IP 列表,并由客户端自己决定如何连接。比如,它可以:
                        • 随机选一个。
                        • 实现自己的负载均衡逻辑。
                        • 需要连接所有 Pod(比如收集状态)。

                      与 StatefulSet 结合的“杀手级应用”

                      Headless Service 最经典、最强大的用途就是与 StatefulSet 配合,为有状态应用集群提供稳定的网络标识。

                      回顾之前的 MongoDB 例子:

                      • StatefulSet: mongodb (3个副本)
                      • Headless Service: mongodb-service

                      此时,DNS 系统会创建出稳定且可预测的 DNS 记录,而不仅仅是返回 IP 列表:

                      • 每个 Pod 获得一个稳定的 DNS 名称

                        • mongodb-0.mongodb-service.default.svc.cluster.local
                        • mongodb-1.mongodb-service.default.svc.cluster.local
                        • mongodb-2.mongodb-service.default.svc.cluster.local
                      • 查询 Headless Service 本身的 DNS (mongodb-service) 会返回所有 Pod IP。

                      这带来了巨大优势:

                      1. 稳定的成员身份:在初始化 MongoDB 副本集时,你可以直接用这些稳定的 DNS 名称来配置成员列表。即使 Pod 重启、IP 变了,它的 DNS 名称永远不变,配置也就永远不会失效。
                      2. 直接 Pod 间通信:在 Kafka 或 Redis Cluster 这样的系统中,节点之间需要直接通信来同步数据。它们可以使用这些稳定的 DNS 名称直接找到对方,而不需要经过一个不必要的负载均衡器。
                      3. 主从选举与读写分离:客户端应用可以通过固定的 DNS 名称(如 mongodb-0...)直接连接到主节点执行写操作,而通过其他名称连接到从节点进行读操作。

                      总结

                      你可以这样形象地理解:

                      • 普通 Service 像一个公司的“总机号码”

                        • 你打电话给总机(ClusterIP),说“我要找技术支持”,接线员(kube-proxy)会帮你转接到一个空闲的技术支持人员(Pod)那里。你不需要知道具体是谁在为你服务。
                      • Headless Service 像一个公司的“内部通讯录”

                        • 它不提供总机转接服务。它只给你一份所有员工(Pod)的姓名和直拨电话(IP)列表。
                        • 特别是对于 StatefulSet,这份通讯录里的每个员工还有自己固定、专属的座位和分机号(稳定的 DNS 名称),比如“张三座位在 A区-001,分机是 8001”。你知道要找谁时,直接打他的分机就行。

                      所以,“有没有 ClusterIP” 只是一个开关,这个开关背后选择的是两种截然不同的服务发现和流量治理模式。 对于需要直接寻址、有状态、集群化的应用,Headless Service 是必不可少的基石。

                      Mar 7, 2024

                      Creating A Pod

                      描述 Kubernetes 中一个 Pod 的创建过程,可以清晰地展示了 K8s 各个核心组件是如何协同工作的。

                      我们可以将整个过程分为两个主要阶段:控制平面的决策阶段工作节点的执行阶段


                      第一阶段:控制平面决策(大脑决策)

                      1. 用户提交请求

                        • 用户使用 kubectl apply -f pod.yamlkube-apiserver 提交一个 Pod 定义文件。
                        • kubectl 会验证配置并将其转换为 JSON 格式,通过 REST API 调用发送给 kube-apiserver。
                      2. API Server 处理与验证

                        • kube-apiserver 接收到请求后,会进行一系列操作:
                          • 身份认证:验证用户身份。
                          • 授权:检查用户是否有权限创建 Pod。
                          • 准入控制:可能调用一些准入控制器来修改或验证 Pod 对象(例如,注入 Sidecar 容器、设置默认资源限制等)。
                        • 所有验证通过后,kube-apiserver 将 Pod 的元数据对象写入 etcd 数据库。此时,Pod 在 etcd 中的状态被标记为 Pending
                        • 至此,Pod 的创建请求已被记录,但还未被调度到任何节点。
                      3. 调度器决策

                        • kube-scheduler 作为一个控制器,通过 watch 机制持续监听 kube-apiserver,发现有一个新的 Pod 被创建且其 nodeName 为空。
                        • 调度器开始为这个 Pod 选择一个最合适的节点,它执行两阶段操作:
                          • 过滤:根据节点资源(CPU、内存)、污点、节点选择器、存储、镜像拉取等因素过滤掉不合适的节点。
                          • 评分:对剩下的节点进行打分(例如,考虑资源均衡、亲和性等),选择得分最高的节点。
                        • 做出决策后,kube-scheduler 补丁 的方式更新 kube-apiserver 中该 Pod 的定义,将其 nodeName 字段设置为选定的节点名称。
                        • kube-apiserver 再次将这个更新后的信息写入 etcd

                      第二阶段:工作节点执行(肢体行动)

                      1. kubelet 监听到任务

                        • 目标节点上的 kubelet 同样通过 watch 机制监听 kube-apiserver,发现有一个 Pod 被“分配”到了自己所在的节点(即其 nodeName 与自己的节点名匹配)。
                        • kubelet 会从 kube-apiserver 读取完整的 Pod 定义。
                      2. kubelet 控制容器运行时

                        • kubelet 通过 CRI 接口调用本地的容器运行时(如 containerd、CRI-O)。
                        • 容器运行时负责:
                          • 从指定的镜像仓库拉取容器镜像(如果本地不存在)。
                          • 根据 Pod 定义创建启动容器。
                      3. 配置容器环境

                        • 在启动容器前后,kubelet 还会通过其他接口完成一系列配置:
                          • CNI:调用网络插件(如 Calico、Flannel)为 Pod 分配 IP 地址并配置网络。
                          • CSI:如果 Pod 使用了持久化存储,会调用存储插件挂载存储卷。
                      4. 状态上报

                        • 当 Pod 中的所有容器都成功启动并运行后,kubelet 会持续监控容器的健康状态。
                        • 它将 Pod 的当前状态(如 Running)和 IP 地址等信息作为状态更新,上报kube-apiserver
                        • kube-apiserver 最终将这些状态信息写入 etcd

                      总结流程图

                      用户 kubectl -> API Server -> (写入) etcd -> Scheduler (绑定节点) -> API Server -> (更新) etcd -> 目标节点 kubelet -> 容器运行时 (拉镜像,启容器) -> CNI/CSI (配网络/存储) -> kubelet -> API Server -> (更新状态) etcd

                      核心要点:

                      • 声明式 API:用户声明“期望状态”,系统驱动“当前状态”向其靠拢。
                      • 监听与协同:所有组件都通过监听 kube-apiserver 来获取任务并协同工作。
                      • etcd 作为唯一信源:整个集群的状态始终以 etcd 中的数据为准。
                      • 组件职责分离:Scheduler 只管调度,kubelet 只管执行,API Server 只管交互和存储。
                      Mar 7, 2024

                      Deleting A Pod

                      删除一个 Pod 的流程与创建过程相对应,但它更侧重于如何优雅地、安全地终止一个运行中的实例。这个过程同样涉及多个组件的协同。

                      下面是一个 Pod 的删除流程,但它的核心是体现 Kubernetes 的优雅终止机制。


                      删除流程的核心阶段

                      阶段一:用户发起删除指令

                      1. 用户执行命令:用户执行 kubectl delete pod <pod-name>
                      2. API Server 接收请求
                        • kubectlkube-apiserver 发送一个 DELETE 请求。
                        • kube-apiserver 会进行认证、授权等验证。
                      3. “标记为删除”:验证通过后,kube-apiserver 不会立即从 etcd 中删除该 Pod 对象,而是会执行一个关键操作:为 Pod 对象设置一个“删除时间戳”(deletionTimestamp)并将其标记为 Terminating 状态。这个状态会更新到 etcd 中。

                      阶段二:控制平面与节点的通知

                      1. 组件感知变化
                        • 所有监听 kube-apiserver 的组件(如 kube-scheduler, 各个节点的 kubelet)都会立刻感知到这个 Pod 的状态已变为 Terminating
                        • Endpoint Controller 会立刻将这个 Pod 的 IP 从关联的 Service 的 Endpoints(或 EndpointSlice)列表中移除。这意味着新的流量不会再被负载均衡到这个 Pod 上

                      阶段三:节点上的优雅终止

                      这是最关键的阶段,发生在 Pod 所在的工作节点上。

                      1. kubelet 监听到状态变化:目标节点上的 kubelet 通过 watch 机制发现它管理的某个 Pod 被标记为 Terminating

                      2. 触发优雅关闭序列

                        • 第1步:执行 PreStop Hook(如果配置了的话) kubelet 会首先执行 Pod 中容器定义的 preStop 钩子。这是一个在发送终止信号之前执行的特定命令或 HTTP 请求。常见用途包括:
                          • 通知上游负载均衡器此实例正在下线。
                          • 让应用完成当前正在处理的请求。
                          • 执行一些清理任务。
                        • 第2步:发送 SIGTERM 信号 kubelet 通过容器运行时向 Pod 中的每个容器的主进程发送 SIGTERM(信号 15)信号。这是一个“优雅关闭”信号,通知应用:“你即将被终止,请保存状态、完成当前工作并自行退出”。
                          • 注意SIGTERMpreStop Hook 是并行执行的。Kubernetes 会等待两者中的一个先完成,再进入下一步。
                      3. 等待终止宽限期

                        • 在发送 SIGTERM 之后,Kubernetes 不会立即杀死容器。它会等待一个称为 terminationGracePeriodSeconds 的时长(默认为 30 秒)。
                        • 理想情况下,容器内的应用程序捕获到 SIGTERM 信号后,会开始优雅关闭流程,并在宽限期内自行退出。

                      阶段四:强制终止与清理

                      1. 宽限期后的处理

                        • 情况A:优雅关闭成功:如果在宽限期内,所有容器都成功停止,kubelet 会通知容器运行时清理容器资源,然后进行下一步。
                        • 情况B:优雅关闭失败:如果宽限期结束后,容器仍未停止,kubelet 会触发强制杀死。它向容器的主进程发送 SIGKILL(信号 9) 信号,该信号无法被捕获或忽略,会立即终止进程。
                      2. 清理资源

                        • 容器被强制或优雅地终止后,kubelet 会通过容器运行时清理容器资源。
                        • 同时,kubelet 会清理 Pod 的网络资源(通过 CNI 插件)和存储资源(卸载 Volume)。
                      3. 上报最终状态

                        • kubelet 向 kube-apiserver 发送最终信息,确认 Pod 已完全停止。
                        • kube-apiserver 随后从 etcd正式删除该 Pod 的对象记录。至此,这个 Pod 才真正从系统中消失。

                      总结流程图

                      用户 kubectl delete -> API Server -> (在etcd中标记Pod为 Terminating) -> Endpoint Controller (从Service中移除IP) -> 目标节点 kubelet -> 执行 PreStop Hook -> 发送 SIGTERM 信号 -> (等待 terminationGracePeriodSeconds) -> [成功则清理 / 失败则发送 SIGKILL] -> 清理网络/存储 -> kubelet -> API Server -> (从etcd中删除对象)

                      关键要点

                      1. 优雅终止是核心:Kubernetes 给了应用一个自我清理的机会,这是保证服务无损发布和滚动更新的基石。
                      2. 流量切断先行:Pod 被从 Service 的 Endpoints 中移除是第一步,这确保了在 Pod 开始关闭前,不会有新流量进来。
                      3. 两个关键配置
                        • terminationGracePeriodSeconds:决定了应用有多长时间来自行关闭。
                        • preStop Hook:提供了一个主动执行关闭脚本的机会,比单纯等待 SIGTERM 更可靠。
                      4. 强制终止作为保障:如果应用无法响应优雅关闭信号,Kubernetes 有最后的强制手段来保证资源被释放。

                      理解这个流程对于设计健壮的、能够正确处理关闭信号的微服务至关重要。

                      Mar 7, 2024

                      Deployment VS ReplicaSet

                      下面我会从 架构、工作流、控制循环、数据结构与事件链 等层面详细说明它们是怎么工作的。


                      🧩 一、核心概念层次关系

                      先看一下层级:

                      Deployment → ReplicaSet → Pod
                      层级职责控制器类型
                      Deployment负责声明“应用版本”和“滚动更新策略”高级控制器(managing controller)
                      ReplicaSet保证指定数量的 Pod 副本数基础控制器(ensuring controller)
                      Pod最小可调度单元,运行实际容器工作负载对象

                      可以理解为:

                      Deployment 是策略控制器,ReplicaSet 是数量控制器,Pod 是执行单元。


                      ⚙️ 二、Deployment 的工作原理(上层控制器)

                      1️⃣ Deployment 对象定义

                      你在创建一个 Deployment 时,例如:

                      apiVersion: apps/v1
                      kind: Deployment
                      metadata:
                        name: webapp
                      spec:
                        replicas: 3
                        selector:
                          matchLabels:
                            app: webapp
                        template:
                          metadata:
                            labels:
                              app: webapp
                          spec:
                            containers:
                            - name: nginx
                              image: nginx:1.25

                      这会创建一个 Deployment 对象并写入 etcd。


                      2️⃣ Deployment Controller 发现新对象

                      kube-controller-manager 中的 Deployment Controller 通过 Informer + Shared Indexer Cache 订阅(watch)Deployment 资源变化。

                      一旦发现新 Deployment,它会执行以下逻辑:

                      syncDeployment(deployment):
                          rsList := list ReplicaSets matching deployment.selector
                          newRS := findReplicaSetMatchingPodTemplate(deployment.spec.template)
                          if newRS == nil:
                              newRS = createReplicaSet(deployment.spec.template)
                          adjustReplicaCounts(newRS, oldRSList)

                      3️⃣ ReplicaSet 的创建与管理

                      • 如果模板(spec.template)发生变化(例如镜像从 nginx:1.25 改为 nginx:1.26), Deployment Controller 会创建一个新的 ReplicaSet
                      • 旧 ReplicaSet 会被缩容,新的被扩容,形成滚动更新

                      例如:

                      Deployment: webapp
                       ├── ReplicaSet (nginx:1.25) — scale 2
                       └── ReplicaSet (nginx:1.26) — scale 1

                      当更新完成后:

                      Deployment: webapp
                       └── ReplicaSet (nginx:1.26) — scale 3

                      4️⃣ 滚动更新策略(核心逻辑)

                      Deployment 的滚动更新通过 RollingUpdateDeployment 类型实现:

                      strategy:
                        type: RollingUpdate
                        rollingUpdate:
                          maxUnavailable: 1
                          maxSurge: 1

                      控制循环每次检查:

                      • 有多少可用副本?
                      • 是否可以再启动新的 Pod(受 maxSurge 限制)?
                      • 是否可以终止旧 Pod(受 maxUnavailable 限制)?

                      Controller 会在每次 sync 时:

                      1. 创建新 ReplicaSet 的一个 Pod;
                      2. 等待其 Ready
                      3. 再删除旧 ReplicaSet 的一个 Pod;
                      4. 重复直到达到目标状态。

                      5️⃣ ReplicaSet 与 Deployment 的“拥有关系”

                      Deployment 创建 ReplicaSet 时会设置 OwnerReference:

                      metadata:
                        ownerReferences:
                        - apiVersion: apps/v1
                          kind: Deployment
                          name: webapp
                          uid: <deployment-uid>

                      这样 kube-controller-manager 知道 该 ReplicaSet 属于哪个 Deployment, 而 Kubernetes 的垃圾回收器(GC Controller)会自动删除孤立的 ReplicaSet。


                      🧮 三、ReplicaSet 的内部机制(下层控制器)

                      ReplicaSet Controller 的逻辑相对简单:

                      syncReplicaSet(rs):
                          desired := rs.spec.replicas
                          actual := countPodsMatchingSelector(rs.selector)
                          if desired > actual:
                              createPods(desired - actual)
                          else if desired < actual:
                              deletePods(actual - desired)

                      也就是说,ReplicaSet 只关注Pod 数量是否符合期望。 它并不关心版本,也不关心策略。


                      1️⃣ Pod 的创建

                      ReplicaSet Controller 调用 PodTemplateSpec 生成新的 Pod:

                      pod := newPodFromTemplate(rs.spec.template)
                      pod.OwnerReferences = [rs.UID]

                      然后通过 API Server 创建 Pod 对象。


                      2️⃣ Pod 的调度与状态同步

                      创建后的 Pod 会由 kube-scheduler 调度到某个 Node 上, kubelet 拉起容器后汇报状态,ReplicaSet 通过 Informer 感知到变化并更新 status.availableReplicas


                      🧠 四、两者的控制循环(Control Loop)

                      可以用伪代码总结整个层次的循环:

                      while true:
                        # Deployment Controller Loop
                        for each Deployment:
                          reconcile Deployment → ensure right ReplicaSets exist and scaled properly
                      
                        # ReplicaSet Controller Loop
                        for each ReplicaSet:
                          reconcile ReplicaSet → ensure correct number of Pods exist
                      
                        # Pod Controller Loop (via kubelet)
                        for each Pod:
                          reconcile Pod → ensure container running

                      整个系统靠 Informer + WorkQueue + Reconcile 实现最终一致性。 Kubernetes 的控制循环是幂等的(idempotent)——无论运行多少次,结果都一致。


                      🔁 五、事件链(从创建到Pod运行)

                      阶段执行者操作
                      用户kubectl apply提交 Deployment
                      apiserver将 Deployment 写入 etcd
                      controller-managerDeployment Controller创建 ReplicaSet
                      controller-managerReplicaSet Controller创建 Pod
                      schedulerkube-scheduler绑定 Node
                      kubelet节点上拉取镜像并运行容器
                      controller-managerDeployment Controller更新 ReplicaSet 状态,完成滚动更新

                      🧰 六、垃圾回收与历史版本

                      • Deployment 默认保留最近 10 个历史 ReplicaSet:

                        revisionHistoryLimit: 10
                      • 超过限制的旧 ReplicaSet 会自动被清理;

                      • 你可以通过:

                        kubectl rollout history deployment webapp
                        kubectl rollout undo deployment webapp --to-revision=2

                        实现回滚。


                      🧩 七、关键点总结

                      项目DeploymentReplicaSet
                      职责管理版本与更新策略管理副本数量
                      是否直接创建 Pod否,通过 ReplicaSet
                      更新策略支持滚动、暂停、回滚不支持
                      典型控制循环调整 ReplicaSet调整 Pod
                      与 Pod 的关系间接控制直接控制

                      💡 八、类比理解

                      你可以这样比喻:

                      • Deployment = “项目经理” 管理不同版本的 ReplicaSet,控制滚动更新节奏。
                      • ReplicaSet = “小组长” 保证自己手下(Pods)的人数正确。
                      • Pod = “员工” 实际干活的单位。
                      Mar 7, 2024

                      Endpoint VS EndpointSlice

                      EndpointEndpointSlice 都是 Kubernetes 中用于管理服务后端端点的资源,但 EndpointSlice 是更现代、更高效的解决方案。以下是它们的详细区别:

                      一、基本概念对比

                      Endpoint(传统方式)

                      apiVersion: v1
                      kind: Endpoints
                      metadata:
                        name: my-service
                      subsets:
                        - addresses:
                          - ip: 10.244.1.5
                            targetRef:
                              kind: Pod
                              name: pod-1
                          - ip: 10.244.1.6
                            targetRef:
                              kind: Pod
                              name: pod-2
                          ports:
                          - port: 8080
                            protocol: TCP

                      EndpointSlice(现代方式)

                      apiVersion: discovery.k8s.io/v1
                      kind: EndpointSlice
                      metadata:
                        name: my-service-abc123
                        labels:
                          kubernetes.io/service-name: my-service
                      addressType: IPv4
                      ports:
                        - name: http
                          protocol: TCP
                          port: 8080
                      endpoints:
                        - addresses:
                          - "10.244.1.5"
                          conditions:
                            ready: true
                          targetRef:
                            kind: Pod
                            name: pod-1
                          zone: us-west-2a
                        - addresses:
                          - "10.244.1.6"
                          conditions:
                            ready: true
                          targetRef:
                            kind: Pod
                            name: pod-2
                          zone: us-west-2b

                      二、核心架构差异

                      1. 数据模型设计

                      特性EndpointEndpointSlice
                      存储结构单个大对象多个分片对象
                      规模限制所有端点在一个对象中自动分片(默认最多100个端点/片)
                      更新粒度全量更新增量更新

                      2. 性能影响对比

                      # Endpoint 的问题:单个大对象
                      # 当有 1000 个 Pod 时:
                      kubectl get endpoints my-service -o yaml
                      # 返回一个包含 1000 个地址的庞大 YAML
                      
                      # EndpointSlice 的解决方案:自动分片
                      # 当有 1000 个 Pod 时:
                      kubectl get endpointslices -l kubernetes.io/service-name=my-service
                      # 返回 10 个 EndpointSlice,每个包含 100 个端点

                      三、详细功能区别

                      1. 地址类型支持

                      Endpoint

                      • 仅支持 IP 地址
                      • 有限的元数据

                      EndpointSlice

                      addressType: IPv4  # 支持 IPv4, IPv6, FQDN
                      endpoints:
                        - addresses:
                          - "10.244.1.5"
                          conditions:
                            ready: true
                            serving: true
                            terminating: false
                          hostname: pod-1.subdomain  # 支持主机名
                          nodeName: worker-1
                          zone: us-west-2a
                          hints:
                            forZones:
                            - name: us-west-2a

                      2. 拓扑感知和区域信息

                      EndpointSlice 独有的拓扑功能

                      endpoints:
                        - addresses:
                          - "10.244.1.5"
                          conditions:
                            ready: true
                          # 拓扑信息
                          nodeName: node-1
                          zone: us-west-2a
                          # 拓扑提示,用于优化路由
                          hints:
                            forZones:
                            - name: us-west-2a

                      3. 端口定义方式

                      Endpoint

                      subsets:
                        - ports:
                          - name: http
                            port: 8080
                            protocol: TCP
                          - name: metrics
                            port: 9090
                            protocol: TCP

                      EndpointSlice

                      ports:
                        - name: http
                          protocol: TCP
                          port: 8080
                          appProtocol: http  # 支持应用层协议标识
                        - name: metrics
                          protocol: TCP  
                          port: 9090
                          appProtocol: https

                      四、实际使用场景

                      1. 大规模服务(500+ Pods)

                      Endpoint 的问题

                      # 更新延迟:单个大对象的序列化/反序列化
                      # 网络开销:每次更新传输整个端点列表
                      # 内存压力:客户端需要缓存整个端点列表

                      EndpointSlice 的优势

                      # 增量更新:只更新变化的切片
                      # 并行处理:多个切片可以并行处理
                      # 内存友好:客户端只需关注相关切片

                      2. 多区域部署

                      EndpointSlice 的拓扑感知

                      apiVersion: discovery.k8s.io/v1
                      kind: EndpointSlice
                      metadata:
                        name: multi-zone-service-1
                        labels:
                          kubernetes.io/service-name: multi-zone-service
                      addressType: IPv4
                      ports:
                        - name: http
                          protocol: TCP
                          port: 8080
                      endpoints:
                        - addresses:
                          - "10.244.1.10"
                          conditions:
                            ready: true
                          zone: zone-a
                          nodeName: node-zone-a-1
                      ---
                      apiVersion: discovery.k8s.io/v1
                      kind: EndpointSlice  
                      metadata:
                        name: multi-zone-service-2
                        labels:
                          kubernetes.io/service-name: multi-zone-service
                      addressType: IPv4
                      ports:
                        - name: http
                          protocol: TCP
                          port: 8080
                      endpoints:
                        - addresses:
                          - "10.244.2.10"
                          conditions:
                            ready: true
                          zone: zone-b
                          nodeName: node-zone-b-1

                      3. 金丝雀发布和流量管理

                      EndpointSlice 提供更细粒度的控制

                      # 金丝雀版本的 EndpointSlice
                      apiVersion: discovery.k8s.io/v1
                      kind: EndpointSlice
                      metadata:
                        name: canary-service-version2
                        labels:
                          kubernetes.io/service-name: my-service
                          version: "v2"  # 自定义标签用于选择
                      addressType: IPv4
                      ports:
                        - name: http
                          protocol: TCP
                          port: 8080
                      endpoints:
                        - addresses:
                          - "10.244.3.10"
                          conditions:
                            ready: true

                      五、运维和管理差异

                      1. 监控方式

                      Endpoint 监控

                      # 检查单个 Endpoint 对象
                      kubectl get endpoints my-service
                      kubectl describe endpoints my-service
                      
                      # 监控端点数量
                      kubectl get endpoints my-service -o jsonpath='{.subsets[0].addresses[*].ip}' | wc -w

                      EndpointSlice 监控

                      # 检查所有相关切片
                      kubectl get endpointslices -l kubernetes.io/service-name=my-service
                      
                      # 查看切片详细信息
                      kubectl describe endpointslices my-service-abc123
                      
                      # 统计总端点数量
                      kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*]}{.endpoints[*].addresses}{end}' | jq length

                      2. 故障排查

                      Endpoint 排查

                      # 检查端点状态
                      kubectl get endpoints my-service -o yaml | grep -A 5 -B 5 "not-ready"
                      
                      # 检查控制器日志
                      kubectl logs -n kube-system kube-controller-manager-xxx | grep endpoints

                      EndpointSlice 排查

                      # 检查切片状态
                      kubectl get endpointslices --all-namespaces
                      
                      # 检查端点就绪状态
                      kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*]}{.endpoints[*].conditions.ready}{end}'
                      
                      # 检查 EndpointSlice Controller
                      kubectl logs -n kube-system deployment/endpointslice-controller

                      六、迁移和兼容性

                      1. 自动迁移

                      Kubernetes 1.21+ 默认同时维护两者:

                      # 启用 EndpointSlice 特性门控
                      kube-apiserver --feature-gates=EndpointSlice=true
                      kube-controller-manager --feature-gates=EndpointSlice=true
                      kube-proxy --feature-gates=EndpointSlice=true

                      2. 检查集群状态

                      # 检查 EndpointSlice 是否启用
                      kubectl get apiservices | grep discovery.k8s.io
                      
                      # 检查特性门控
                      kube-apiserver -h | grep EndpointSlice
                      
                      # 验证控制器运行状态
                      kubectl get pods -n kube-system -l k8s-app=endpointslice-controller

                      七、性能基准对比

                      场景EndpointEndpointSlice改进
                      1000个Pod更新2-3秒200-300ms10倍
                      网络带宽使用高(全量传输)低(增量传输)60-80% 减少
                      内存使用高(大对象缓存)低(分片缓存)50-70% 减少
                      CPU使用高(序列化成本)低(并行处理)40-60% 减少

                      八、最佳实践

                      1. 新集群配置

                      # kube-apiserver 配置
                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: kube-apiserver
                        namespace: kube-system
                      spec:
                        containers:
                        - command:
                          - kube-apiserver
                          - --feature-gates=EndpointSlice=true
                          - --endpointslice-updates-batch-period=1s  # 批量更新周期

                      2. 应用程序适配

                      // 使用 EndpointSlice 感知的客户端
                      import (
                          "k8s.io/client-go/kubernetes"
                          "k8s.io/client-go/tools/cache"
                          listers "k8s.io/client-go/listers/discovery/v1"
                      )
                      
                      // 监听 EndpointSlice 变化
                      endpointSliceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
                          AddFunc: func(obj interface{}) {
                              endpointSlice := obj.(*discoveryv1.EndpointSlice)
                              if endpointSlice.Labels["kubernetes.io/service-name"] == "my-service" {
                                  updateLoadBalancerConfig(endpointSlice)
                              }
                          },
                      })

                      总结

                      维度EndpointEndpointSlice
                      设计年代Kubernetes 早期Kubernetes 1.16+
                      扩展性差(单对象限制)优秀(自动分片)
                      性能一般(全量更新)优秀(增量更新)
                      功能特性基础功能拓扑感知、多协议等
                      推荐使用传统集群/兼容性新集群/大规模部署
                      未来方向逐步淘汰标准解决方案

                      简单建议

                      • 新集群:直接使用 EndpointSlice
                      • 现有集群:逐步迁移到 EndpointSlice
                      • 大规模服务:必须使用 EndpointSlice
                      • 小规模测试:两者均可,但建议使用 EndpointSlice 以获得更好的未来兼容性

                      EndpointSlice 代表了 Kubernetes 服务发现架构的现代化方向,特别是在大规模和高动态环境中提供了显著的性能和改进。

                      Mar 7, 2024

                      ETCD如何调优

                      好的,Kubernetes 集群的稳定性和性能极大地依赖于其数据存储组件 etcd。对 etcd 进行调优是保障生产环境 K8s 集群高效、稳定运行的关键步骤。

                      下面我将从核心原则、性能调优参数、操作系统调优、Kubernetes 相关配置、监控与维护等多个维度,详细讲解如何对 K8s 上的 etcd 进行调优。

                      一、核心原则与前提

                      1. 硬件是基础:在考虑软件参数调优前,必须确保硬件资源充足且高性能。

                        • CPU:需要足够的计算能力,特别是在高负载下进行压缩、序列化等操作时。
                        • 内存:etcd 的内存消耗与总键值对数量和大小正相关。足够的内存是保证性能的关键。建议至少 8GB,生产环境推荐 16GB 或以上。
                        • 磁盘这是最重要的因素必须使用高性能的 SSD(NVMe SSD 最佳)。etcd 的每次写入都需持久化到磁盘,磁盘的写入延迟(Write Latency)直接决定了 etcd 的写入性能。避免使用网络存储(如 NFS)。
                        • 网络:低延迟、高带宽的网络对于 etcd 节点间同步至关重要。如果 etcd 以集群模式运行,所有节点应位于同一个数据中心或低延迟的可用区。
                      2. 备份!备份!备份!:在进行任何调优或配置更改之前,务必对 etcd 数据进行完整备份。误操作可能导致数据损坏或集群不可用。

                      二、etcd 命令行参数调优

                      etcd 主要通过其启动时的命令行参数进行调优。如果你使用 kubeadm 部署,这些参数通常配置在 /etc/kubernetes/manifests/etcd.yaml 静态 Pod 清单中。

                      1. 存储配额与压缩

                      为了防止磁盘耗尽,etcd 设有存储配额。一旦超过配额,它将进入维护模式,只能读不能写,并触发告警。

                      • --quota-backend-bytes:设置 etcd 数据库的后端存储大小上限。默认是 2GB。对于生产环境,建议设置为 8GB 到 16GB(例如 8589934592 表示 8GB)。设置过大会影响备份和恢复时间。
                      • --auto-compaction-mode--auto-compaction-retention:etcd 会累积历史版本,需要定期压缩来回收空间。
                        • --auto-compaction-mode:通常设置为 periodic(按时间周期)。
                        • --auto-compaction-retention:设置保留多长时间的历史数据。例如 "1h" 表示保留 1 小时,"10m" 表示保留 10 分钟。对于频繁变更的集群(如 running many CronJobs),建议设置为较短的周期,如 "10m""30m"

                      示例配置片段(在 etcd.yaml 中):

                      spec:
                        containers:
                        - command:
                          - etcd
                          ...
                          - --quota-backend-bytes=8589934592    # 8GB
                          - --auto-compaction-mode=periodic
                          - --auto-compaction-retention=10m     # 每10分钟压缩一次历史版本
                          ...

                      2. 心跳与选举超时

                      这些参数影响集群的领导者选举和节点间的心跳检测,对网络延迟敏感。

                      • --heartbeat-interval:领导者向追随者发送心跳的间隔。建议设置为 100300 毫秒之间。网络环境好可以设小(如 100),不稳定则设大(如 300)。
                      • --election-timeout:追随者等待多久没收到心跳后开始新一轮选举。此值必须是心跳间隔的 5-10 倍。建议设置在 10003000 毫秒之间。

                      规则:heartbeat-interval * 10 >= election-timeout

                      示例配置:

                          - --heartbeat-interval=200
                          - --election-timeout=2000

                      3. 快照

                      etcd 通过快照来持久化其状态。

                      • --snapshot-count:指定在制作一次快照前,最多提交多少次事务。默认值是 100,000。在内存充足且磁盘 IO 性能极高的环境下,可以适当调低此值(如 50000)以在崩溃后更快恢复,但这会略微增加磁盘 IO 负担。通常使用默认值即可。

                      三、操作系统与运行时调优

                      1. 磁盘 I/O 调度器

                      对于 SSD,将 I/O 调度器设置为 nonenoop 通常能获得更好的性能。

                      # 查看当前调度器
                      cat /sys/block/[你的磁盘,如 sda]/queue/scheduler
                      
                      # 临时修改
                      echo 'noop' > /sys/block/sda/queue/scheduler
                      
                      # 永久修改,在 /etc/default/grub 中添加或修改
                      GRUB_CMDLINE_LINUX_DEFAULT="... elevator=noop"
                      
                      # 然后更新 grub 并重启
                      sudo update-grub

                      2. 文件系统

                      使用 XFSext4 文件系统。它们对 etcd 的工作负载有很好的支持。确保使用 ssd 挂载选项。

                      /etc/fstab 中为 etcd 数据目录所在分区添加 ssdnoatime 选项:

                      UUID=... /var/lib/etcd ext4 defaults,ssd,noatime 0 0

                      3. 提高文件描述符和进程数限制

                      etcd 可能会处理大量并发连接。

                      # 在 /etc/security/limits.conf 中添加
                      * soft nofile 65536
                      * hard nofile 65536
                      * soft nproc 65536
                      * hard nproc 65536

                      4. 网络参数调优

                      调整内核网络参数,特别是在高负载环境下。

                      /etc/sysctl.conf 中添加:

                      net.core.somaxconn = 1024
                      net.ipv4.tcp_keepalive_time = 600
                      net.ipv4.tcp_keepalive_intvl = 60
                      net.ipv4.tcp_keepalive_probes = 10

                      执行 sysctl -p 使其生效。

                      四、Kubernetes 相关调优

                      1. 资源请求和限制

                      etcd.yaml 中为 etcd 容器设置合适的资源限制,防止其因资源竞争而饿死。

                          resources:
                            requests:
                              memory: "1Gi"
                              cpu: "500m"
                            limits:
                              memory: "8Gi"  # 根据你的 --quota-backend-bytes 设置,确保内存足够
                              cpu: "2"

                      2. API Server 的 --etcd-compaction-interval

                      在 kube-apiserver 的启动参数中,这个参数控制它请求 etcd 进行压缩的周期。建议与 etcd 的 --auto-compaction-retention 保持一致或略大。

                      五、监控与维护

                      1. 监控关键指标

                      使用 Prometheus 等工具监控 etcd,重点关注以下指标:

                      • etcd_disk_wal_fsync_duration_seconds:WAL 日志同步到磁盘的延迟。这是最重要的指标,P99 值应低于 25ms。
                      • etcd_disk_backend_commit_duration_seconds:后端数据库提交的延迟。
                      • etcd_server_leader_changes_seen_total:领导者变更次数。频繁变更表明集群不稳定。
                      • etcd_server_has_leader:当前节点是否认为有领导者(1 为是,0 为否)。
                      • etcd_mvcc_db_total_size_in_bytes:当前数据库大小,用于判断是否接近存储配额。

                      2. 定期进行碎片整理

                      即使开启了自动压缩,etcd 的数据库文件内部仍会产生碎片。当 etcd_mvcc_db_total_size_in_bytes 接近 --quota-backend-bytes 时,即使实际数据量没那么多,也需要在线进行碎片整理。

                      # 在任一 etcd 节点上执行
                      ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
                        --cacert=/path/to/ca.crt \
                        --cert=/path/to/etcd-client.crt \
                        --key=/path/to/etcd-client.key \
                        defrag

                      注意:执行 defrag 会阻塞所有请求,应在业务低峰期进行,并逐个对集群成员执行。

                      调优总结与检查清单

                      1. 硬件过关:确认使用 SSD,内存充足。
                      2. 设置存储配额和自动压缩--quota-backend-bytes=8G, --auto-compaction-retention=10m
                      3. 调整心跳与选举超时--heartbeat-interval=200, --election-timeout=2000
                      4. 操作系统优化:I/O 调度器、文件系统挂载选项、文件描述符限制。
                      5. 配置合理的资源限制:防止 etcd 容器因资源不足被 Kill。
                      6. 开启并关注监控:特别是磁盘同步延迟和领导者变更。
                      7. 定期维护:根据监控指标,在需要时进行碎片整理。

                      对于大多数场景,调整存储配额与压缩心跳与选举超时以及确保高性能磁盘,就能解决绝大部分性能问题。调优是一个持续的过程,需要结合监控数据不断调整。

                      Mar 7, 2024

                      Flannel VS Calico

                      Calico 和 Flannel 是 Kubernetes 中最著名和最常见的两种网络插件(CNI),但它们的设计哲学、实现方式和能力有显著区别。

                      简单来说:

                      • Flannel 追求的是简单和易用,提供足够的基础网络功能。
                      • Calico 追求的是性能和功能,提供强大的网络策略和高性能网络。

                      下面我们从多个维度进行详细对比。


                      核心对比一览表

                      特性FlannelCalico
                      核心设计哲学简单、最小化高性能、功能丰富
                      网络模型Overlay 网络纯三层路由(可选 Overlay)
                      数据平面VXLAN(推荐)、Host-gw、UDPBGP(推荐)、VXLAN、Windows
                      性能较好(VXLAN有封装开销)极高(BGP模式下无封装开销)
                      网络策略不支持(需安装Cilium等)原生支持(强大的网络策略)
                      安全性基础高级(基于标签的微隔离)
                      配置与维护非常简单,几乎无需配置相对复杂,功能多配置项也多
                      适用场景学习、测试、中小型集群,需求简单生产环境、大型集群、对性能和安全要求高

                      深入剖析

                      1. 网络模型与工作原理

                      这是最根本的区别。

                      • Flannel (Overlay Network)

                        • 工作原理:它在底层物理网络之上再构建一个虚拟的“覆盖网络”。当数据包从一个节点的Pod发送到另一个节点的Pod时,Flannel会将它封装在一个新的网络包中(如VXLAN)。
                        • 类比:就像在一封普通信件(Pod的原始数据包)外面套了一个标准快递袋(VXLAN封装),快递系统(底层网络)只关心快递袋上的地址(节点IP),不关心里面的内容。到达目标节点后,再拆开快递袋,取出里面的信。
                        • 优势:对底层网络要求低,只要节点之间IP能通即可,兼容性好。
                        • 劣势:封装和解封装有额外的CPU开销,并且会增加数据包的大小( overhead),导致性能略有下降。
                      • Calico (Pure Layer 3)

                        • 工作原理(BGP模式):它不使用封装,而是使用BGP路由协议。每个K8s节点都像一个路由器,它通过BGP协议向集群中的其他节点宣告:“发往这些Pod IP的流量,请送到我这里来”。
                        • 类比:就像整个数据中心是一个大的邮政系统,每个邮局(节点)都知道去往任何地址(Pod IP)的最短路径,信件(数据包)可以直接投递,无需额外包装。
                        • 优势性能高,无封装开销,延迟低,吞吐量高。
                        • 劣势:要求底层网络必须支持BGP或者支持主机路由(某些云平台或网络设备可能需要特定配置)。

                      注意:Calico也支持VXLAN模式(通常用于网络策略要求BGP但底层网络不支持的场景),但其最佳性能是在BGP模式下实现的。

                      2. 网络策略

                      这是两者功能性的一个巨大分水岭。

                      • Flannel本身不提供任何网络策略能力。它只负责打通网络,让所有Pod默认可以相互通信。如果你需要实现Pod之间的访问控制(微隔离),你必须额外安装一个网络策略控制器,如 CiliumCalico本身(可以只使用其策略部分,与Flannel叠加使用)。

                      • Calico原生支持强大的Kubernetes NetworkPolicy。你可以定义基于Pod标签、命名空间、端口、协议甚至DNS名称的精细规则,来控制Pod的入站和出站流量。这对于实现“零信任”安全模型至关重要。

                      3. 性能

                      • Calico (BGP模式):由于其纯三层的转发机制,无需封装,数据包是原生IP包,其延迟更低,吞吐量更高,CPU消耗也更少。
                      • Flannel (VXLAN模式):由于存在VXLAN的封装头(通常50字节 overhead),最大传输单元会变小,封装/解封装操作也需要CPU参与,性能相比Calico BGP模式要低一些。但其 Host-gw 后端模式性能很好,前提是节点在同一个二层网络。

                      4. 生态系统与高级功能

                      • Calico:功能非常丰富,远不止基础网络。
                        • 网络策略:如上所述,非常强大。
                        • IPAM:灵活的IP地址管理。
                        • 服务网格集成:与Istio有深度集成,可以实施全局的服务到服务策略。
                        • Windows支持:对Windows节点有良好的支持。
                        • 网络诊断工具:提供了 calicoctl 等强大的运维工具。
                      • Flannel:功能相对单一,就是做好网络连通性。它“小而美”,但缺乏高级功能。

                      如何选择?

                      选择 Flannel 的情况:

                      • 新手用户:想要快速搭建一个K8s集群,不想纠结于复杂的网络配置。
                      • 测试或开发环境:需求简单,只需要Pod能通。
                      • 中小型集群:对性能和高级网络策略没有硬性要求。
                      • 底层网络受限:无法配置BGP或主机路由的环境(例如某些公有云基础网络)。

                      选择 Calico 的情况:

                      • 生产环境:对稳定性和性能有高要求。
                      • 大型集群:需要高效的路由和可扩展性。
                      • 安全要求高:需要实现Pod之间的网络隔离(微隔离)。
                      • 对网络性能极度敏感:例如AI/ML训练、高频交易等场景。
                      • 底层网络可控:例如在自建数据中心或云上支持BGP的环境。

                      总结

                      FlannelCalico
                      核心价值简单可靠功能强大
                      好比买车丰田卡罗拉:皮实、省心、够用宝马/奥迪:性能强劲、功能齐全、操控精准
                      一句话总结“让我快速把网络打通”“我要一个高性能、高安全性的生产级网络”

                      在现代Kubernetes部署中,尤其是生产环境,Calico因其卓越的性能和原生的安全能力,已经成为更主流和推荐的选择。而Flannel则在那些“只要能通就行”的简单场景中,依然保持着它的价值。

                      Mar 7, 2024

                      Headless Service VS ClusterIP

                      Headless Service vs ClusterIP 详解

                      这是 Kubernetes 中两种常见的 Service 类型,它们在服务发现和负载均衡方面有本质区别。


                      🎯 核心区别总结

                      维度ClusterIPHeadless Service
                      ClusterIP 值有固定的虚拟 IPNone (无 ClusterIP)
                      DNS 解析返回 Service IP直接返回 Pod IP 列表
                      负载均衡✅ kube-proxy 自动负载均衡❌ 客户端自行选择 Pod
                      适用场景无状态服务有状态服务、服务发现
                      典型用例Web 应用、API 服务数据库集群、Kafka、Zookeeper

                      📋 ClusterIP Service (默认类型)

                      定义

                      ClusterIP 是 Kubernetes 默认的 Service 类型,会分配一个虚拟 IP(Cluster IP),作为访问后端 Pod 的统一入口。

                      YAML 示例

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: my-web-service
                      spec:
                        type: ClusterIP  # 默认类型,可以省略
                        selector:
                          app: web
                        ports:
                        - protocol: TCP
                          port: 80        # Service 端口
                          targetPort: 8080  # Pod 端口

                      工作原理

                      ┌─────────────────────────────────────────┐
                      │          ClusterIP Service              │
                      │     (虚拟 IP: 10.96.100.50)             │
                      └────────────┬────────────────────────────┘
                                   │ kube-proxy 负载均衡
                                   │
                           ┌───────┴───────┬──────────┐
                           ▼               ▼          ▼
                        Pod-1          Pod-2      Pod-3
                        10.244.1.5     10.244.2.8  10.244.3.12
                        (app=web)      (app=web)   (app=web)

                      DNS 解析行为

                      # 在集群内部查询 DNS
                      nslookup my-web-service.default.svc.cluster.local
                      
                      # 输出:
                      # Name:    my-web-service.default.svc.cluster.local
                      # Address: 10.96.100.50  ← 返回 Service 的虚拟 IP
                      
                      # 客户端访问这个 IP
                      curl http://my-web-service:80
                      
                      # 请求会被 kube-proxy 自动转发到后端 Pod
                      # 默认使用 iptables 或 IPVS 做负载均衡

                      特点

                      统一入口:客户端只需知道 Service IP,不关心后端 Pod
                      自动负载均衡:kube-proxy 自动在多个 Pod 间分发流量
                      服务发现简单:通过 DNS 获取稳定的 Service IP
                      屏蔽 Pod 变化:Pod 重启或扩缩容,Service IP 不变
                      会话保持:可配置 sessionAffinity: ClientIP

                      负载均衡方式

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: my-service
                      spec:
                        type: ClusterIP
                        sessionAffinity: ClientIP  # 可选:会话保持(同一客户端固定到同一 Pod)
                        sessionAffinityConfig:
                          clientIP:
                            timeoutSeconds: 10800   # 会话超时时间
                        selector:
                          app: web
                        ports:
                        - port: 80
                          targetPort: 8080

                      🔍 Headless Service (无头服务)

                      定义

                      Headless Service 是不分配 ClusterIP 的特殊 Service,通过设置 clusterIP: None 创建。

                      YAML 示例

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: my-headless-service
                      spec:
                        clusterIP: None  # 🔑 关键:设置为 None
                        selector:
                          app: database
                        ports:
                        - protocol: TCP
                          port: 3306
                          targetPort: 3306

                      工作原理

                      ┌─────────────────────────────────────────┐
                      │       Headless Service (无 ClusterIP)   │
                      │              DNS 直接返回               │
                      └────────────┬────────────────────────────┘
                                   │ 没有负载均衡
                                   │ DNS 返回所有 Pod IP
                                   │
                           ┌───────┴───────┬──────────┐
                           ▼               ▼          ▼
                        Pod-1          Pod-2      Pod-3
                        10.244.1.5     10.244.2.8  10.244.3.12
                        (app=database) (app=database) (app=database)

                      DNS 解析行为

                      # 在集群内部查询 DNS
                      nslookup my-headless-service.default.svc.cluster.local
                      
                      # 输出:
                      # Name:    my-headless-service.default.svc.cluster.local
                      # Address: 10.244.1.5   ← Pod-1 IP
                      # Address: 10.244.2.8   ← Pod-2 IP
                      # Address: 10.244.3.12  ← Pod-3 IP
                      
                      # 客户端获得所有 Pod IP,自己选择连接哪个

                      特点

                      服务发现:客户端可以获取所有后端 Pod 的 IP
                      自主选择:客户端自己决定连接哪个 Pod(负载均衡逻辑由客户端实现)
                      稳定 DNS:每个 Pod 有独立的 DNS 记录
                      适合有状态服务:数据库主从、集群成员发现
                      无自动负载均衡:需要客户端或应用层实现

                      与 StatefulSet 结合(最常见用法)

                      # StatefulSet + Headless Service
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: mysql-headless
                      spec:
                        clusterIP: None
                        selector:
                          app: mysql
                        ports:
                        - port: 3306
                          name: mysql
                      ---
                      apiVersion: apps/v1
                      kind: StatefulSet
                      metadata:
                        name: mysql
                      spec:
                        serviceName: mysql-headless  # 🔑 关联 Headless Service
                        replicas: 3
                        selector:
                          matchLabels:
                            app: mysql
                        template:
                          metadata:
                            labels:
                              app: mysql
                          spec:
                            containers:
                            - name: mysql
                              image: mysql:8.0
                              ports:
                              - containerPort: 3306

                      每个 Pod 的独立 DNS 记录

                      # StatefulSet 的 Pod 命名规则:
                      # <statefulset-name>-<ordinal>.<service-name>.<namespace>.svc.cluster.local
                      
                      # 示例:
                      mysql-0.mysql-headless.default.svc.cluster.local → 10.244.1.5
                      mysql-1.mysql-headless.default.svc.cluster.local → 10.244.2.8
                      mysql-2.mysql-headless.default.svc.cluster.local → 10.244.3.12
                      
                      # 可以直接访问特定 Pod
                      mysql -h mysql-0.mysql-headless.default.svc.cluster.local -u root -p
                      
                      # 查询所有 Pod
                      nslookup mysql-headless.default.svc.cluster.local

                      🔄 实际对比演示

                      场景 1:Web 应用(使用 ClusterIP)

                      # ClusterIP Service
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: web-service
                      spec:
                        type: ClusterIP
                        selector:
                          app: nginx
                        ports:
                        - port: 80
                          targetPort: 80
                      ---
                      # Deployment
                      apiVersion: apps/v1
                      kind: Deployment
                      metadata:
                        name: nginx
                      spec:
                        replicas: 3
                        selector:
                          matchLabels:
                            app: nginx
                        template:
                          metadata:
                            labels:
                              app: nginx
                          spec:
                            containers:
                            - name: nginx
                              image: nginx:latest
                      # 测试访问
                      kubectl run test --rm -it --image=busybox -- /bin/sh
                      
                      # 在 Pod 内执行
                      nslookup web-service
                      # 输出:只有一个 Service IP
                      
                      wget -q -O- http://web-service
                      # 请求会被自动负载均衡到 3 个 nginx Pod

                      场景 2:MySQL 主从(使用 Headless Service)

                      # Headless Service
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: mysql
                      spec:
                        clusterIP: None
                        selector:
                          app: mysql
                        ports:
                        - port: 3306
                      ---
                      # StatefulSet
                      apiVersion: apps/v1
                      kind: StatefulSet
                      metadata:
                        name: mysql
                      spec:
                        serviceName: mysql
                        replicas: 3
                        selector:
                          matchLabels:
                            app: mysql
                        template:
                          metadata:
                            labels:
                              app: mysql
                          spec:
                            containers:
                            - name: mysql
                              image: mysql:8.0
                              env:
                              - name: MYSQL_ROOT_PASSWORD
                                value: "password"
                      # 测试服务发现
                      kubectl run test --rm -it --image=busybox -- /bin/sh
                      
                      # 在 Pod 内执行
                      nslookup mysql
                      # 输出:返回 3 个 Pod IP
                      
                      # 可以连接到特定的 MySQL 实例(如主节点)
                      mysql -h mysql-0.mysql.default.svc.cluster.local -u root -p
                      
                      # 也可以连接到从节点
                      mysql -h mysql-1.mysql.default.svc.cluster.local -u root -p
                      mysql -h mysql-2.mysql.default.svc.cluster.local -u root -p

                      📊 详细对比

                      1. DNS 解析差异

                      # ClusterIP Service
                      $ nslookup web-service
                      Server:    10.96.0.10
                      Address:   10.96.0.10:53
                      
                      Name:      web-service.default.svc.cluster.local
                      Address:   10.96.100.50  ← Service 虚拟 IP
                      
                      # Headless Service
                      $ nslookup mysql-headless
                      Server:    10.96.0.10
                      Address:   10.96.0.10:53
                      
                      Name:      mysql-headless.default.svc.cluster.local
                      Address:   10.244.1.5  ← Pod-1 IP
                      Address:   10.244.2.8  ← Pod-2 IP
                      Address:   10.244.3.12 ← Pod-3 IP

                      2. 流量路径差异

                      ClusterIP 流量路径:
                      Client → Service IP (10.96.100.50)
                             → kube-proxy (iptables/IPVS)
                             → 随机选择一个 Pod
                      
                      Headless 流量路径:
                      Client → DNS 查询
                             → 获取所有 Pod IP
                             → 客户端自己选择 Pod
                             → 直接连接 Pod IP

                      3. 使用场景对比

                      场景ClusterIPHeadless
                      无状态应用✅ 推荐❌ 不需要
                      有状态应用❌ 不适合✅ 推荐
                      数据库主从❌ 无法区分主从✅ 可以指定连接主节点
                      集群成员发现❌ 无法获取成员列表✅ 可以获取所有成员
                      需要负载均衡✅ 自动负载均衡❌ 需要客户端实现
                      客户端连接池⚠️ 只能连接到 Service IP✅ 可以为每个 Pod 建立连接

                      🎯 典型应用场景

                      ClusterIP Service 适用场景

                      1. 无状态 Web 应用

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: frontend
                      spec:
                        type: ClusterIP
                        selector:
                          app: frontend
                        ports:
                        - port: 80
                          targetPort: 3000

                      2. RESTful API 服务

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: api-service
                      spec:
                        type: ClusterIP
                        selector:
                          app: api
                        ports:
                        - port: 8080

                      3. 微服务之间的调用

                      # Service A 调用 Service B
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: service-b
                      spec:
                        type: ClusterIP
                        selector:
                          app: service-b
                        ports:
                        - port: 9090

                      Headless Service 适用场景

                      1. MySQL 主从复制

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: mysql
                      spec:
                        clusterIP: None
                        selector:
                          app: mysql
                        ports:
                        - port: 3306
                      ---
                      # 应用连接时:
                      # 写操作 → mysql-0.mysql (主节点)
                      # 读操作 → mysql-1.mysql, mysql-2.mysql (从节点)

                      2. Kafka 集群

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: kafka
                      spec:
                        clusterIP: None
                        selector:
                          app: kafka
                        ports:
                        - port: 9092
                      ---
                      # Kafka 客户端可以发现所有 broker:
                      # kafka-0.kafka:9092
                      # kafka-1.kafka:9092
                      # kafka-2.kafka:9092

                      3. Elasticsearch 集群

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: elasticsearch
                      spec:
                        clusterIP: None
                        selector:
                          app: elasticsearch
                        ports:
                        - port: 9200
                          name: http
                        - port: 9300
                          name: transport
                      ---
                      # 集群内部节点通过 DNS 发现彼此:
                      # elasticsearch-0.elasticsearch
                      # elasticsearch-1.elasticsearch
                      # elasticsearch-2.elasticsearch

                      4. Redis 集群模式

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: redis-cluster
                      spec:
                        clusterIP: None
                        selector:
                          app: redis
                        ports:
                        - port: 6379
                          name: client
                        - port: 16379
                          name: gossip
                      ---
                      # Redis 客户端获取所有节点进行 cluster slots 查询

                      🔧 混合使用:两种 Service 同时存在

                      对于有状态服务,常见做法是同时创建两个 Service:

                      # 1. Headless Service:用于 StatefulSet 和 Pod 间通信
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: mysql-headless
                      spec:
                        clusterIP: None
                        selector:
                          app: mysql
                        ports:
                        - port: 3306
                      ---
                      # 2. ClusterIP Service:用于客户端负载均衡访问(只读副本)
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: mysql-read
                      spec:
                        type: ClusterIP
                        selector:
                          app: mysql
                          role: replica  # 只选择从节点
                        ports:
                        - port: 3306
                      ---
                      # StatefulSet
                      apiVersion: apps/v1
                      kind: StatefulSet
                      metadata:
                        name: mysql
                      spec:
                        serviceName: mysql-headless  # 使用 Headless Service
                        replicas: 3
                        # ...

                      使用方式:

                      # 写操作:直接连接主节点
                      mysql -h mysql-0.mysql-headless -u root -p
                      
                      # 读操作:通过 ClusterIP 自动负载均衡到所有从节点
                      mysql -h mysql-read -u root -p

                      🛠️ 常见问题

                      Q1: 如何选择使用哪种 Service?

                      决策流程:

                      应用是无状态的? 
                        ├─ 是 → 使用 ClusterIP
                        └─ 否 → 继续
                      
                      需要客户端感知所有 Pod?
                        ├─ 是 → 使用 Headless Service
                        └─ 否 → 继续
                      
                      需要区分不同 Pod(如主从)?
                        ├─ 是 → 使用 Headless Service + StatefulSet
                        └─ 否 → 使用 ClusterIP

                      Q2: Headless Service 没有负载均衡怎么办?

                      方案:

                      1. 客户端负载均衡:应用层实现(如 Kafka 客户端)
                      2. DNS 轮询:部分 DNS 客户端会自动轮询
                      3. 混合方案:同时创建 ClusterIP Service 用于负载均衡

                      Q3: 如何测试 Headless Service?

                      # 创建测试 Pod
                      kubectl run -it --rm debug --image=busybox --restart=Never -- sh
                      
                      # 测试 DNS 解析
                      nslookup mysql-headless.default.svc.cluster.local
                      
                      # 测试连接特定 Pod
                      wget -O- http://mysql-0.mysql-headless:3306
                      
                      # 测试所有 Pod
                      for i in 0 1 2; do
                        echo "Testing mysql-$i"
                        wget -O- http://mysql-$i.mysql-headless:3306
                      done

                      Q4: ClusterIP Service 能否用于 StatefulSet?

                      可以,但不推荐:

                      • ✅ 可以提供负载均衡
                      • ❌ 无法通过稳定的 DNS 名访问特定 Pod
                      • ❌ 不适合主从架构(无法区分主节点)

                      最佳实践:

                      • StatefulSet 使用 Headless Service
                      • 如需负载均衡,额外创建 ClusterIP Service

                      💡 关键要点总结

                      ClusterIP Service

                      ✅ 默认类型,有虚拟 IP
                      ✅ 自动负载均衡(kube-proxy)
                      ✅ 适合无状态应用
                      ✅ 客户端无需感知后端 Pod
                      ✅ DNS 解析返回 Service IP

                      Headless Service

                      ✅ 设置 clusterIP: None
                      ✅ DNS 解析返回所有 Pod IP
                      ✅ 适合有状态应用
                      ✅ 支持 Pod 级别的服务发现
                      ✅ 常与 StatefulSet 配合使用

                      选型建议

                      • Web 应用、API 服务 → ClusterIP
                      • 数据库、消息队列、分布式存储 → Headless Service
                      • 有主从/分片的应用 → Headless Service + StatefulSet
                      • 需要同时支持负载均衡和直接访问 → 两种 Service 都创建
                      Mar 7, 2024

                      Helm Principle

                      Helm 是 Kubernetes 的包管理工具,类似于 Linux 的 apt/yum 或 Python 的 pip,它的核心作用是: 👉 用模板化的方式定义、安装和升级 Kubernetes 应用。


                      🧩 一、Helm 的核心概念

                      在理解原理前,先明确 Helm 的几个关键对象:

                      概念说明
                      Chart一个 Helm 包,描述一组 Kubernetes 资源的模板集合(即一个应用的安装包)
                      Values.yamlChart 的参数配置文件,用于填充模板变量
                      ReleaseHelm 将 Chart 安装到某个命名空间后的实例,每次安装或升级都是一个 release
                      Repository存放打包后 chart (.tgz) 的仓库,可以是 HTTP/OCI 类型(如 Harbor, Artifactory)

                      ⚙️ 二、Helm 的工作原理流程

                      从用户角度来看,Helm Client 发出命令(如 helm install),Helm 会通过一系列步骤在集群中生成 Kubernetes 资源。

                      下面是核心流程图概念(文字版):

                             ┌────────────┐
                             │ helm client│
                             └─────┬──────┘
                                   │
                                   ▼
                            1. 解析Chart与Values
                                   │
                                   ▼
                            2. 模板渲染(Helm Template Engine)
                                   │
                                   ▼
                            3. 生成纯YAML清单
                                   │
                                   ▼
                            4. 调用Kubernetes API
                                   │
                                   ▼
                            5. 创建/更新资源(Deployment、Service等)
                                   │
                                   ▼
                            6. 记录Release历史(ConfigMap/Secret)

                      🔍 三、Helm 工作机制分解

                      1️⃣ Chart 渲染阶段

                      Helm 使用 Go 的 text/template 模板引擎 + Sprig 函数库,将模板与 values.yaml 合并生成 Kubernetes YAML 清单。

                      例如:

                      # templates/deployment.yaml
                      apiVersion: apps/v1
                      kind: Deployment
                      metadata:
                        name: {{ .Release.Name }}-app
                      spec:
                        replicas: {{ .Values.replicas }}

                      通过:

                      helm template myapp ./mychart -f myvalues.yaml

                      Helm 会本地生成纯 YAML 文件(不部署到集群)。


                      2️⃣ 部署阶段(Install/Upgrade)

                      执行:

                      helm install myapp ./mychart

                      Helm Client 会将渲染好的 YAML 通过 Kubernetes API 提交到集群(相当于执行 kubectl apply)。

                      Helm 同时在命名空间中创建一个 “Release 记录”,默认存放在:

                      namespace: <your-namespace>
                      kind: Secret
                      name: sh.helm.release.v1.<release-name>.vN

                      其中保存了:

                      • Chart 模板和 values 的快照
                      • 渲染后的 manifest
                      • Release 状态(deployed、failed 等)
                      • 版本号(v1, v2, …)

                      3️⃣ 升级与回滚机制

                      当执行:

                      helm upgrade myapp ./mychart

                      时,Helm 会:

                      1. 读取旧版本 release secret
                      2. 渲染新模板
                      3. 比较新旧差异(Diff)
                      4. 调用 Kubernetes API 更新对象
                      5. 写入新的 release secret(版本号 +1)

                      回滚时:

                      helm rollback myapp 2

                      Helm 会取出 v2 的记录,再次 kubectl apply


                      4️⃣ 仓库机制(Helm Repository / OCI Registry)

                      Helm 支持两种包分发方式:

                      • HTTP 仓库(传统)

                        • 有一个 index.yaml 索引文件
                        • Chart 以 .tgz 格式存储
                      • OCI 仓库(现代推荐)

                        • Chart 存储在 OCI registry(如 Harbor, GHCR)

                        • 推送方式:

                          helm push mychart/ oci://harbor.example.com/helm
                        • 拉取方式:

                          helm pull oci://harbor.example.com/helm/mychart --version 1.0.0

                      🧠 四、Helm 与 Kubernetes 的关系

                      Helm 本身 不直接管理容器运行,它只是:

                      • 模板引擎 + 应用生命周期管理器;
                      • 所有资源最终仍由 Kubernetes 控制器(如 Deployment controller)调度、运行。

                      Helm 类似于 “上层应用打包器”:

                      Helm = Chart 模板系统 + Kubernetes API 客户端 + Release 历史追踪

                      💡 五、常见命令原理对照

                      命令Helm 行为
                      helm install渲染模板 → 提交资源 → 创建 release
                      helm upgrade渲染模板 → diff 旧版本 → 更新资源 → 新 release
                      helm rollback获取旧版本记录 → 重新提交旧 manifest
                      helm uninstall删除 Kubernetes 资源 + 删除 release secret
                      helm template本地渲染模板,不与集群交互
                      helm diff(插件)比较新旧渲染结果差异

                      🧩 六、Helm 3 与 Helm 2 的区别(核心)

                      Helm 2Helm 3
                      需要 Tiller(集群内控制组件)无需 Tiller,完全 client-side
                      安全模型复杂(基于 RBAC 授权)安全性更好,直接使用 kubeconfig 权限
                      Release 存储在 ConfigMap默认存储在 Secret
                      需要 Helm Server 部署纯客户端
                      Mar 7, 2024

                      HPA

                      HPA(Horizontal Pod Autoscaler)是 Kubernetes 中实现自动水平扩缩容的核心组件。它的实现涉及多个 Kubernetes 组件和复杂的控制逻辑。

                      一、HPA 架构组成

                      1. 核心组件

                      ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
                      │   HPA Controller │ ◄──│   Metrics API    │ ◄──│  Metrics Server │
                      │   (kube-controller)│    │    (聚合层)     │    │   (cAdvisor)    │
                      └─────────────────┘    └──────────────────┘    └─────────────────┘
                               │                       │                       │
                               ▼                       ▼                       ▼
                      ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
                      │ Deployment/     │    │  Custom Metrics  │    │  External       │
                      │ StatefulSet     │    │   Adapter        │    │  Metrics        │
                      └─────────────────┘    └──────────────────┘    └─────────────────┘

                      二、HPA 工作流程

                      1. 完整的控制循环

                      // 简化的 HPA 控制逻辑
                      for {
                          // 1. 获取 HPA 对象
                          hpa := client.AutoscalingV2().HorizontalPodAutoscalers(namespace).Get(name)
                          
                          // 2. 获取缩放目标(Deployment/StatefulSet等)
                          scaleTarget := hpa.Spec.ScaleTargetRef
                          target := client.AppsV1().Deployments(namespace).Get(scaleTarget.Name)
                          
                          // 3. 查询指标
                          metrics := []autoscalingv2.MetricStatus{}
                          for _, metricSpec := range hpa.Spec.Metrics {
                              metricValue := getMetricValue(metricSpec, target)
                              metrics = append(metrics, metricValue)
                          }
                          
                          // 4. 计算期望副本数
                          desiredReplicas := calculateDesiredReplicas(hpa, metrics, currentReplicas)
                          
                          // 5. 执行缩放
                          if desiredReplicas != currentReplicas {
                              scaleTarget.Spec.Replicas = &desiredReplicas
                              client.AppsV1().Deployments(namespace).UpdateScale(scaleTarget.Name, scaleTarget)
                          }
                          
                          time.Sleep(15 * time.Second) // 默认扫描间隔
                      }

                      2. 详细步骤分解

                      步骤 1:指标收集

                      # HPA 通过 Metrics API 获取指标
                      kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods" | jq .
                      
                      # 或者通过自定义指标 API
                      kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

                      步骤 2:指标计算

                      // 计算当前指标值与目标值的比率
                      func calculateMetricRatio(currentValue, targetValue int64) float64 {
                          return float64(currentValue) / float64(targetValue)
                      }
                      
                      // 示例:CPU 使用率计算
                      currentCPUUsage := 800m  # 当前使用 800 milli-cores
                      targetCPUUsage := 500m   # 目标使用 500 milli-cores
                      ratio := 800.0 / 500.0   # = 1.6

                      三、HPA 配置详解

                      1. HPA 资源定义

                      apiVersion: autoscaling/v2
                      kind: HorizontalPodAutoscaler
                      metadata:
                        name: myapp-hpa
                        namespace: default
                      spec:
                        # 缩放目标
                        scaleTargetRef:
                          apiVersion: apps/v1
                          kind: Deployment
                          name: myapp
                        # 副本数范围
                        minReplicas: 2
                        maxReplicas: 10
                        # 指标定义
                        metrics:
                        - type: Resource
                          resource:
                            name: cpu
                            target:
                              type: Utilization
                              averageUtilization: 50
                        - type: Resource
                          resource:
                            name: memory
                            target:
                              type: Utilization
                              averageUtilization: 70
                        - type: Pods
                          pods:
                            metric:
                              name: packets-per-second
                            target:
                              type: AverageValue
                              averageValue: 1k
                        - type: Object
                          object:
                            metric:
                              name: requests-per-second
                            describedObject:
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              name: main-route
                            target:
                              type: Value
                              value: 10k
                        # 行为配置(Kubernetes 1.18+)
                        behavior:
                          scaleDown:
                            stabilizationWindowSeconds: 300
                            policies:
                            - type: Percent
                              value: 50
                              periodSeconds: 60
                            - type: Pods
                              value: 5
                              periodSeconds: 60
                            selectPolicy: Min
                          scaleUp:
                            stabilizationWindowSeconds: 0
                            policies:
                            - type: Percent
                              value: 100
                              periodSeconds: 15
                            - type: Pods
                              value: 4
                              periodSeconds: 15
                            selectPolicy: Max

                      四、指标类型和计算方式

                      1. 资源指标(CPU/Memory)

                      metrics:
                      - type: Resource
                        resource:
                          name: cpu
                          target:
                            type: Utilization    # 利用率模式
                            averageUtilization: 50
                            
                      - type: Resource  
                        resource:
                          name: memory
                          target:
                            type: AverageValue  # 平均值模式
                            averageValue: 512Mi

                      计算逻辑

                      // CPU 利用率计算
                      func calculateCPUReplicas(currentUsage, targetUtilization int32, currentReplicas int32) int32 {
                          // 当前总使用量
                          totalUsage := currentUsage * currentReplicas
                          // 期望副本数 = ceil(当前总使用量 / (单个 Pod 请求量 * 目标利用率))
                          desiredReplicas := int32(math.Ceil(float64(totalUsage) / float64(targetUtilization)))
                          return desiredReplicas
                      }

                      2. 自定义指标(Pods 类型)

                      metrics:
                      - type: Pods
                        pods:
                          metric:
                            name: http_requests_per_second
                          target:
                            type: AverageValue
                            averageValue: 100

                      计算方式

                      期望副本数 = ceil(当前总指标值 / 目标平均值)

                      3. 对象指标(Object 类型)

                      metrics:
                      - type: Object
                        object:
                          metric:
                            name: latency
                          describedObject:
                            apiVersion: networking.k8s.io/v1
                            kind: Ingress
                            name: my-ingress
                          target:
                            type: Value
                            value: 100

                      五、HPA 算法详解

                      1. 核心算法

                      // 计算期望副本数
                      func GetDesiredReplicas(
                          currentReplicas int32,
                          metricValues []metrics,
                          hpa *HorizontalPodAutoscaler,
                      ) int32 {
                          ratios := make([]float64, 0)
                          
                          // 1. 计算每个指标的比率
                          for _, metric := range metricValues {
                              ratio := calculateMetricRatio(metric.current, metric.target)
                              ratios = append(ratios, ratio)
                          }
                          
                          // 2. 选择最大的比率(最需要扩容的指标)
                          maxRatio := getMaxRatio(ratios)
                          
                          // 3. 计算期望副本数
                          desiredReplicas := math.Ceil(float64(currentReplicas) * maxRatio)
                          
                          // 4. 应用边界限制
                          desiredReplicas = applyBounds(desiredReplicas, hpa.Spec.MinReplicas, hpa.Spec.MaxReplicas)
                          
                          return int32(desiredReplicas)
                      }

                      2. 平滑算法和冷却机制

                      // 考虑历史记录的缩放决策
                      func withStabilization(desiredReplicas int32, hpa *HorizontalPodAutoscaler) int32 {
                          now := time.Now()
                          
                          if isScaleUp(desiredReplicas, hpa.Status.CurrentReplicas) {
                              // 扩容:通常立即执行
                              stabilizationWindow = hpa.Spec.Behavior.ScaleUp.StabilizationWindowSeconds
                          } else {
                              // 缩容:应用稳定窗口
                              stabilizationWindow = hpa.Spec.Behavior.ScaleDown.StabilizationWindowSeconds
                          }
                          
                          // 过滤稳定窗口内的历史推荐值
                          validRecommendations := filterRecommendationsByTime(
                              hpa.Status.Conditions, 
                              now.Add(-time.Duration(stabilizationWindow)*time.Second)
                          )
                          
                          // 选择策略(Min/Max)
                          finalReplicas := applyPolicy(validRecommendations, hpa.Spec.Behavior)
                          
                          return finalReplicas
                      }

                      六、高级特性实现

                      1. 多指标支持

                      当配置多个指标时,HPA 会为每个指标计算期望副本数,然后选择最大值

                      func calculateFromMultipleMetrics(metrics []Metric, currentReplicas int32) int32 {
                          desiredReplicas := make([]int32, 0)
                          
                          for _, metric := range metrics {
                              replicas := calculateForSingleMetric(metric, currentReplicas)
                              desiredReplicas = append(desiredReplicas, replicas)
                          }
                          
                          // 选择最大的期望副本数
                          return max(desiredReplicas...)
                      }

                      2. 扩缩容行为控制

                      behavior:
                        scaleDown:
                          # 缩容稳定窗口:5分钟
                          stabilizationWindowSeconds: 300
                          policies:
                          - type: Percent   # 每分钟最多缩容 50%
                            value: 50
                            periodSeconds: 60
                          - type: Pods      # 或每分钟最多减少 5 个 Pod
                            value: 5
                            periodSeconds: 60
                          selectPolicy: Min # 选择限制更严格的策略
                          
                        scaleUp:
                          stabilizationWindowSeconds: 0  # 扩容立即执行
                          policies:
                          - type: Percent   # 每分钟最多扩容 100%
                            value: 100
                            periodSeconds: 60
                          - type: Pods      # 或每分钟最多增加 4 个 Pod
                            value: 4
                            periodSeconds: 60
                          selectPolicy: Max # 选择限制更宽松的策略

                      七、监控和调试

                      1. 查看 HPA 状态

                      # 查看 HPA 详情
                      kubectl describe hpa myapp-hpa
                      
                      # 输出示例:
                      # Name: myapp-hpa
                      # Namespace: default
                      # Reference: Deployment/myapp
                      # Metrics: ( current / target )
                      #   resource cpu on pods  (as a percentage of request):  65% (130m) / 50%
                      #   resource memory on pods:                             120Mi / 100Mi
                      # Min replicas: 2
                      # Max replicas: 10
                      # Deployment pods: 3 current / 3 desired

                      2. HPA 相关事件

                      # 查看 HPA 事件
                      kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler
                      
                      # 查看缩放历史
                      kubectl describe deployment myapp | grep -A 10 "Events"

                      3. 指标调试

                      # 检查 Metrics API 是否正常工作
                      kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq .
                      
                      # 检查自定义指标
                      kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
                      
                      # 直接查询 Pod 指标
                      kubectl top pods
                      kubectl top nodes

                      八、常见问题排查

                      1. HPA 不扩容

                      # 检查指标是否可用
                      kubectl describe hpa myapp-hpa
                      # 查看 Events 部分是否有错误信息
                      
                      # 检查 Metrics Server
                      kubectl get apiservices | grep metrics
                      kubectl logs -n kube-system -l k8s-app=metrics-server
                      
                      # 检查资源请求配置
                      kubectl get deployment myapp -o yaml | grep resources -A 5

                      2. HPA 频繁震荡

                      # 调整行为配置
                      behavior:
                        scaleDown:
                          stabilizationWindowSeconds: 600  # 延长缩容稳定窗口
                          policies:
                          - type: Pods
                            value: 1                       # 每次只缩容 1 个 Pod
                            periodSeconds: 300             # 每 5 分钟最多一次

                      九、性能优化建议

                      1. 大规模集群优化

                      # kube-controller-manager 配置优化
                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: kube-controller-manager
                        namespace: kube-system
                      spec:
                        containers:
                        - command:
                          - kube-controller-manager
                          - --horizontal-pod-autoscaler-sync-period=30s           # 调整同步周期
                          - --horizontal-pod-autoscaler-downscale-stabilization=5m0s
                          - --horizontal-pod-autoscaler-initial-readiness-delay=30s
                          - --horizontal-pod-autoscaler-cpu-initialization-period=5m0s

                      总结

                      HPA 的实现涉及:

                      1. 指标收集:通过 Metrics API 聚合资源和使用情况
                      2. 算法计算:基于指标值和目标值计算期望副本数
                      3. 决策平滑:应用稳定窗口和策略避免震荡
                      4. 执行缩放:更新目标资源的副本数
                      5. 状态跟踪:记录缩放历史和当前状态

                      HPA 的智能化体现在多指标决策、行为控制、冷却机制等方面,使得自动扩缩容既响应迅速又稳定可靠。

                      Mar 7, 2024

                      More than 1k Nodes

                      在这个量级上,K8s 不再只是“能跑就行”,而是进入可扩展性、稳定性、可观测性和资源效率的工程化挑战。下面我从架构、控制面、节点管理、网络、存储、安全和运维几个方面系统讲解。


                      🧠 一、总体思路:大规模集群的本质挑战

                      当节点规模超过 500~1000 时,Kubernetes 的瓶颈通常出现在:

                      • 控制平面(API Server / etcd)压力过大
                      • 调度器吞吐不足
                      • 资源对象(Pod / Node / Secret / ConfigMap 等)过多,导致 List/Watch 延迟
                      • 网络和 CNI 插件在高并发下性能下降
                      • 监控、日志、事件系统的数据量爆炸
                      • 维护和升级变得极度复杂

                      所以,大规模集群的重点是:

                      控制平面分层、节点池分区、流量隔离、观测与调优。


                      🏗️ 二、控制平面(Control Plane)

                      1. etcd 优化

                      • 独立部署:不要和 kube-apiserver 混布,最好是独立的高性能节点(NVMe SSD、本地盘)。
                      • 使用 etcd v3.5+(性能改进明显),并开启压缩和快照机制。
                      • 调大 --max-request-bytes--quota-backend-bytes,避免过载。
                      • 定期 defrag:可用 CronJob 自动化。
                      • 不要存放短生命周期对象(例如频繁更新的 CRD 状态),可以考虑用外部缓存系统(如 Redis 或 SQL)。

                      2. API Server 扩展与保护

                      • 使用 负载均衡(HAProxy、NGINX、ELB)在多 API Server 之间分流;

                      • 调整:

                        • --max-mutating-requests-inflight
                        • --max-requests-inflight
                        • --target-ram-mb
                      • 合理设置 --request-timeout,防止 watch 卡死;

                      • 限制大量 client watch 行为(Prometheus、controller-manager 等);

                      • 对 client 侧使用 aggregatorread-only proxy 来降低负载。

                      3. Scheduler & Controller Manager

                      • 多调度器实例(leader election)

                      • 启用 调度缓存(SchedulerCache)优化

                      • 调整:

                        • --kube-api-qps--kube-api-burst
                        • 调度算法的 backoff 策略;
                      • 对自定义 Operator 建议使用 workqueue with rate limiters 防止风暴。


                      🧩 三、节点与 Pod 管理

                      1. 节点分区与拓扑

                      • 按功能/位置划分 Node Pool(如 GPU/CPU/IO 密集型);
                      • 使用 Topology Spread Constraints 避免集中调度;
                      • 考虑用 Cluster Federation (KubeFed)多个集群 + 集中管理(如 ArgoCD 多集群、Karmada、Fleet)

                      2. 节点生命周期

                      • 控制 kubelet 心跳频率 (--node-status-update-frequency);
                      • 通过 Node Problem Detector (NPD) 自动标记异常节点;
                      • 监控 Pod eviction rate,防止节点频繁漂移;
                      • 启用 graceful node shutdown 支持。

                      3. 镜像与容器运行时

                      • 镜像预热(Image pre-pull);
                      • 使用 镜像仓库代理(Harbor / registry-mirror)
                      • 考虑 containerd 代替 Docker;
                      • 定期清理 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots

                      🌐 四、网络(CNI)

                      1. CNI 选择与调优

                      • 大规模下优选:

                        • Calico (BGP 模式)
                        • Cilium (eBPF)
                        • 或使用云原生方案(AWS CNI, Azure CNI)。
                      • 降低 ARP / 路由表压力:

                        • 使用 IPAM 子网分段
                        • 开启 Cilium 的 ClusterMesh 分层;
                      • 调整 conntrack 表大小(net.netfilter.nf_conntrack_max)。

                      2. Service & DNS

                      • 启用 CoreDNS 缓存

                      • 对大规模 Service 场景,考虑 Headless Service + ExternalName

                      • 优化 kube-proxy:

                        • 使用 IPVS 模式
                        • Cilium service LB
                      • 如果 Service 数量非常多,可拆分 namespace 级 DNS 域。


                      💾 五、存储(CSI)

                      • 使用 分布式存储系统(Ceph、Longhorn、OpenEBS、CSI-HostPath);
                      • 避免高频小 I/O 的 PVC;
                      • 定期清理僵尸 PV/PVC;
                      • 对 CSI driver 开启限流与重试机制。

                      🔒 六、安全与访问控制

                      • 开启 RBAC 严格控制
                      • 限制 namespace 级资源上限(ResourceQuota, LimitRange);
                      • 审计日志(Audit Policy)异步存储;
                      • 对外接口统一走 Ingress Controller;
                      • 如果有 Operator 或 CRD 资源暴涨,记得定期清理过期对象。

                      📈 七、可观测性与维护

                      1. 监控

                      • Prometheus 集群化(Thanos / VictoriaMetrics);
                      • 不直接监控所有 Pod,可抽样或聚合;
                      • kube-state-metrics 与 cAdvisor 数据要限流。

                      2. 日志

                      • 统一日志收集(Loki / Elasticsearch / Vector);
                      • 日志量控制策略(采样、压缩、清理)。

                      3. 升级与测试

                      • 使用 灰度升级 / Node pool rolling
                      • 每次升级前跑 e2e 测试;
                      • 对控制平面单独做快照和备份(etcd snapshot)。

                      ⚙️ 八、性能调优与实践经验

                      • 调整 kubelet QPS 限制:

                        --kube-api-qps=100 --kube-api-burst=200
                      • 合理的 Pod 数量控制:

                        • 单节点不超过 110 Pods;
                        • 单 namespace 建议 < 5000 Pods;
                        • 总体目标:1k 节点 → 5~10 万 Pods 以内。
                      • 使用 CRD Sharding / 缩减 CRD 状态字段

                      • 避免大量短生命周期 Job,可用 CronJob + TTLController 清理。


                      🧭 九、扩展方向

                      当规模继续上升(>3000 节点)时,可以考虑:

                      • 多集群架构(Cluster Federation / Karmada / Rancher Fleet)
                      • 控制平面分层(cell-based control plane)
                      • API Aggregation Layer + Custom Scheduler

                      Mar 7, 2024

                      Network Policy

                      1. Network Policy 的设计原理

                      Kubernetes Network Policy 的设计核心思想是:在默认允许的集群网络中,引入一个“默认拒绝”的、声明式的、基于标签的防火墙

                      让我们来分解这个核心思想:

                      1. 从“默认允许”到“默认拒绝”

                        • 默认行为:在没有任何 Network Policy 的情况下,Kubernetes 集群内的 Pod 之间是可以自由通信的(取决于 CNI 插件),甚至来自外部的流量也可能直接访问到 Pod。这就像在一个没有防火墙的开放网络里。
                        • Network Policy 的作用:一旦在某个 Namespace 中创建了一个 Network Policy,它就会像一个“开关”,将这个 Namespace 或特定 Pod 的默认行为变为 “默认拒绝”。之后,只有策略中明确允许的流量才能通过。
                      2. 声明式模型

                        • 和其他的 Kubernetes 资源(如 Deployment、Service)一样,Network Policy 也是声明式的。你只需要告诉 Kubernetes“你期望的网络状态是什么”(例如,“允许来自带有 role=frontend 标签的 Pod 的流量访问带有 role=backend 标签的 Pod 的 6379 端口”),而不需要关心如何通过 iptables 或 eBPF 命令去实现它。Kubernetes 和其下的 CNI 插件会负责实现你的声明。
                      3. 基于标签的选择机制

                        • 这是 Kubernetes 的核心设计模式。Network Policy 不关心 Pod 的 IP 地址,因为 IP 是动态且易变的。它通过 标签 来选择一组 Pod。
                        • podSelector: 选择策略所应用的 Pod(即目标 Pod)。
                        • namespaceSelector: 根据命名空间的标签来选择来源或目标命名空间。
                        • namespaceSelectorpodSelector 可以组合使用,实现非常精细的访问控制。
                      4. 策略是叠加的

                        • 多个 Network Policy 可以同时作用于同一个 Pod。最终的规则是所有相关策略的 并集。如果任何一个策略允许了某条流量,那么该流量就是被允许的。这意味着你可以分模块、分层次地定义策略,而不会相互覆盖。

                      2. Network Policy 的实现方式

                      一个非常重要的概念是:Network Policy 本身只是一个 API 对象,它定义了一套规范。它的具体实现依赖于 Container Network Interface 插件。

                      Kubernetes 不会自己实现网络策略,而是由 CNI 插件来负责。这意味着:

                      • 如果你的 CNI 插件不支持 Network Policy,那么你创建的 Policy 将不会产生任何效果。
                      • 不同的 CNI 插件使用不同的底层技术来实现相同的 Network Policy 规范。

                      主流的实现方式和技术包括:

                      1. 基于 iptables

                        • 工作原理:CNI 插件(如 Calico 的部分模式、Weave Net 等)会监听 Kubernetes API,当有 Network Policy 被创建时,它会在节点上生成相应的 iptables 规则。这些规则会对进出 Pod 网络接口(veth pair)的数据包进行过滤。
                        • 优点:成熟、稳定、通用。
                        • 缺点:当策略非常复杂时,iptables 规则链会变得很长,可能对性能有一定影响。
                      2. 基于 eBPF

                        • 工作原理:这是更现代和高效的方式,被 Cilium 等项目广泛采用。eBPF 允许将程序直接注入到 Linux 内核中,在内核层面高效地执行数据包过滤、转发和策略检查。
                        • 优点:高性能、灵活性极强(可以实现 L3/L4/L7 所有层面的策略)、对系统影响小。
                        • 缺点:需要较新的 Linux 内核版本。
                      3. 基于 IPVS 或自有数据平面

                        • 一些 CNI 插件(如 Antrea,它底层使用 OVS)可能有自己独立的数据平面,并在其中实现策略的匹配和执行。

                      常见的支持 Network Policy 的 CNI 插件:

                      • Calico: 功能强大,支持复杂的网络策略,既可以使用 iptables 模式也可以使用 eBPF 模式。
                      • Cilium: 基于 eBPF,原生支持 Network Policy,并扩展到了 L7(HTTP、gRPC 等)网络策略。
                      • Weave Net: 提供了对 Kubernetes Network Policy 的基本支持。
                      • Antrea: 基于 Open vSwitch,也提供了强大的策略支持。

                      3. Network Policy 的用途

                      Network Policy 是实现 Kubernetes “零信任”“微隔离” 安全模型的核心工具。其主要用途包括:

                      1. 实现最小权限原则

                        • 这是最核心的用途。通过精细的策略,确保一个 Pod 只能与它正常工作所 必需 的其他 Pod 或外部服务通信,除此之外的一切连接都被拒绝。这极大地减少了攻击面。
                      2. 隔离多租户环境

                        • 在共享的 Kubernetes 集群中,可以为不同的团队、项目或环境(如 dev, staging)创建不同的命名空间。然后使用 Network Policy 严格限制跨命名空间的访问,确保它们相互隔离,互不干扰。
                      3. 保护关键基础服务

                        • 数据库、缓存(如 Redis)、消息队列等后端服务通常不应该被所有 Pod 访问。可以创建策略,只允许特定的前端或中间件 Pod(通过标签选择)访问这些后端服务的特定端口。
                        # 示例:只允许 role=api 的 Pod 访问 role=db 的 Pod 的 5432 端口
                        apiVersion: networking.k8s.io/v1
                        kind: NetworkPolicy
                        metadata:
                          name: allow-api-to-db
                        spec:
                          podSelector:
                            matchLabels:
                              role: db
                          policyTypes:
                          - Ingress
                          ingress:
                          - from:
                            - podSelector:
                                matchLabels:
                                  role: api
                            ports:
                            - protocol: TCP
                              port: 5432
                      4. 控制外部访问

                        • 使用 ipBlock 字段,可以限制只有来自特定 IP 段(例如公司办公室的 IP)的流量才能访问集群内部的服务。这可以用来替代或补充传统的防火墙规则。
                      5. 划分应用层次安全边界

                        • 在一个典型的 Web 应用中,可以创建清晰的层次:
                          • 前端层: 可以接收来自外部的流量(80/443端口),但只能与后端层通信。
                          • 后端层: 只能接收来自前端层的流量,并只能与数据层通信。
                          • 数据层: 只能接收来自后端层的流量,不接受任何其他来源的请求。

                      总结

                      特性描述
                      设计原理在默认允许的网络中,通过声明式和基于标签的机制,实现“默认拒绝”的精细流量控制。
                      实现方式由 CNI 插件负责实现,底层技术包括 iptableseBPF 等。策略本身是 Kubernetes 的 API 资源。
                      主要用途实现微隔离最小权限原则多租户隔离保护关键服务控制外部访问,是 Kubernetes 网络安全的基石。

                      简单来说,Network Policy 就是 Kubernetes 世界的防火墙规则,它让你能够定义“谁在什么条件下可以访问什么”,是生产环境中保障应用安全不可或缺的一部分。

                      Mar 7, 2024

                      Node NotReady

                      当 Kubernetes 中某些 Node 节点状态变为 NotReady 时,这往往意味着 kubelet 无法与控制平面(API Server)正常通信,或该节点上某些关键组件/资源异常。

                      我们可以从以下两个层面来分析: 1️⃣ 导致节点 NotReady 的常见原因 2️⃣ NotReady 状态对整个集群和业务的影响


                      🧩 一、Node NotReady 的常见原因分类

                      kubelet 每 10 秒(默认)向 API Server 报告一次心跳(NodeStatus)。 如果连续 40 秒(默认 --node-monitor-grace-period=40s)没有收到更新,Controller Manager 会将节点标记为 NotReady

                      下面按类别详细分析👇


                      🖧 1. 网络层异常(最常见)

                      症状:节点能 ping 通外网,但与 control plane 交互超时。 原因包括:

                      • 节点与 kube-apiserver 之间的网络中断(如防火墙、路由异常、VPC 问题);
                      • API Server 负载均衡异常(L4/L7 LB 停止转发流量);
                      • Pod 网络插件(CNI)崩溃,kubelet 无法汇报 Pod 状态;
                      • 节点 DNS 解析异常(影响 kubelet 访问 API Server)。

                      排查方式:

                      # 在节点上检查 API Server 可达性
                      curl -k https://<apiserver-ip>:6443/healthz
                      # 检查 kubelet 日志
                      journalctl -u kubelet | grep -E "error|fail|timeout"

                      ⚙️ 2. kubelet 本身异常

                      症状:节点长时间 NotReady,重启 kubelet 后恢复。

                      原因包括:

                      • kubelet 崩溃 / 死循环;
                      • 磁盘满,导致 kubelet 无法写临时目录(/var/lib/kubelet);
                      • 证书过期(/var/lib/kubelet/pki/kubelet-client-current.pem);
                      • CPU/Mem 资源耗尽,kubelet 被 OOM;
                      • kubelet 配置文件被改动,重启后加载失败。

                      排查方式:

                      systemctl status kubelet
                      journalctl -u kubelet -n 100
                      df -h /var/lib/kubelet

                      💾 3. 节点资源耗尽

                      症状:Node 状态为 NotReadyUnknown,Pod 被驱逐。

                      可能原因:

                      • 磁盘使用率 > 90%,触发 kubelet DiskPressure
                      • 内存 / CPU 长期 100%,触发 MemoryPressure
                      • inode 用尽(df -i);
                      • 临时目录 /var/lib/docker/tmp/tmp 爆满。

                      排查方式:

                      kubectl describe node <node-name>
                      # 查看 conditions
                      # Conditions:
                      #   Type              Status
                      #   ----              ------
                      #   MemoryPressure    True
                      #   DiskPressure      True

                      🧱 4. 控制面通信问题(API Server / Controller Manager)

                      症状:多个节点同时 NotReady

                      可能原因:

                      • API Server 压力过大,导致心跳包无法及时处理;
                      • etcd 异常(写延迟高);
                      • Controller Manager 无法更新 NodeStatus;
                      • 集群负载均衡器(如 haproxy)挂掉。

                      排查方式:

                      kubectl get componentstatuses
                      # 或直接检查控制平面节点
                      kubectl -n kube-system get pods -l tier=control-plane

                      🔌 5. 容器运行时 (containerd/docker/crio) 异常

                      症状:kubelet 报 “Failed to list pod sandbox”。

                      原因包括:

                      • containerd 服务挂掉;
                      • 版本不兼容(kubelet 与 runtime 版本差异过大);
                      • runtime socket 权限错误;
                      • overlayfs 损坏;
                      • /var/lib/containerd/run/containerd 文件系统只读。

                      排查方式:

                      systemctl status containerd
                      journalctl -u containerd | tail
                      crictl ps

                      ⏱️ 6. 时间同步错误

                      症状:kubelet 心跳被判定过期,但实际节点正常。

                      原因:

                      • 节点时间漂移(未启用 NTP / chrony);
                      • 控制面和节点时间差 > 5 秒;
                      • TLS 校验失败(证书时间不合法)。

                      🧰 7. 节点维护或人为操作

                      包括:

                      • 节点被 cordon/drain;
                      • 网络策略阻断 kubelet;
                      • 人为停掉 kubelet;
                      • 节点被重装后未清理旧状态(Node UID 冲突)。

                      ⚠️ 二、Node NotReady 的后果与影响

                      影响范围描述
                      1️⃣ Pod 调度Scheduler 会避免调度新 Pod 到该节点。
                      2️⃣ Pod 驱逐Controller Manager 默认在节点 NotReady 超过 300s--pod-eviction-timeout)后,会驱逐所有 Pod。
                      3️⃣ Service Endpoint 缺失该节点上运行的 Pod 从 Service Endpoint 列表中移除,导致负载均衡流量下降。
                      4️⃣ DaemonSet 中断DaemonSet Controller 不再在该节点上创建/管理 Pod。
                      5️⃣ 数据丢失风险若节点上的 Pod 使用本地卷(emptyDir、hostPath),被驱逐后数据会丢失。
                      6️⃣ 集群监控告警Prometheus / Alertmanager 触发告警(如 KubeNodeNotReadyKubeletDown)。
                      7️⃣ 自动扩缩容失效Cluster Autoscaler 无法正确评估资源利用率。

                      🧭 三、最佳实践与预防建议

                      1. 启用 Node Problem Detector (NPD) 自动标记系统级异常;

                      2. 监控 NodeConditionsReadyMemoryPressureDiskPressure);

                      3. 统一节点健康检查策略(如通过 taintstolerations);

                      4. 自动修复机制

                        • 结合 Cluster API 或自研 Controller 实现 Node 自动替换;
                        • 若节点 NotReady 超过 10 分钟,自动重建;
                      5. 定期巡检:

                        • kubelet、containerd 状态;
                        • 系统时间同步;
                        • 磁盘使用率;
                        • API Server QPS 和 etcd 延迟。
                      Mar 7, 2024

                      Pause 容器

                      Kubernetes Pause 容器的用途

                      Pause 容器是 Kubernetes 中一个非常小但极其重要的基础设施容器。很多人会忽略它,但它是 Pod 网络和命名空间共享的核心。


                      🎯 核心作用

                      1. 作为 Pod 的"根容器"(Infrastructure Container)

                      Pause 容器是每个 Pod 中第一个启动的容器,它的生命周期代表整个 Pod 的生命周期。

                      Pod 生命周期:
                      创建 Pod → 启动 Pause 容器 → 启动业务容器 → ... → 业务容器结束 → 删除 Pause 容器 → Pod 销毁

                      2. 持有和共享 Linux 命名空间

                      Pause 容器创建并持有以下命名空间,供 Pod 内其他容器共享:

                      • Network Namespace (网络命名空间) - 最重要!
                      • IPC Namespace (进程间通信)
                      • UTS Namespace (主机名)
                      # 查看 Pod 中的容器
                      docker ps | grep pause
                      
                      # 你会看到类似输出:
                      # k8s_POD_mypod_default_xxx  k8s.gcr.io/pause:3.9
                      # k8s_app_mypod_default_xxx  myapp:latest

                      🌐 网络命名空间共享(最关键的用途)

                      工作原理

                      ┌─────────────────── Pod ───────────────────┐
                      │                                            │
                      │  ┌─────────────┐                          │
                      │  │   Pause     │ ← 创建网络命名空间        │
                      │  │  Container  │ ← 拥有 Pod IP            │
                      │  └──────┬──────┘                          │
                      │         │ (共享网络栈)                     │
                      │  ┌──────┴──────┬──────────┬──────────┐   │
                      │  │ Container A │Container B│Container C│  │
                      │  │  (业务容器)  │  (业务容器)│ (业务容器) │  │
                      │  └─────────────┴──────────┴──────────┘   │
                      │                                            │
                      │  所有容器共享:                              │
                      │  - 同一个 IP 地址 (Pod IP)                 │
                      │  - 同一个网络接口                           │
                      │  - 同一个端口空间                           │
                      │  - 可以通过 localhost 互相访问              │
                      └────────────────────────────────────────────┘

                      实际效果

                      # 示例 Pod
                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: multi-container-pod
                      spec:
                        containers:
                        - name: nginx
                          image: nginx
                          ports:
                          - containerPort: 80
                        - name: sidecar
                          image: busybox
                          command: ['sh', '-c', 'while true; do wget -O- localhost:80; sleep 5; done']

                      在这个例子中:

                      • Pause 容器创建网络命名空间并获得 Pod IP (如 10.244.1.5)
                      • nginx 容器加入这个网络命名空间,监听 80 端口
                      • sidecar 容器也加入同一网络命名空间
                      • sidecar 可以通过 localhost:80 访问 nginx,因为它们共享网络栈

                      🔍 为什么需要 Pause 容器?

                      问题场景:如果没有 Pause 容器

                      假设 Pod 中有两个容器 A 和 B:

                      场景 1:容器 A 先启动,创建网络命名空间
                      ├─ 容器 A 持有网络命名空间 → 拥有 Pod IP
                      └─ 容器 B 加入容器 A 的网络命名空间
                      
                      问题:如果容器 A 崩溃重启或被删除,网络命名空间消失
                      → 容器 B 失去网络连接
                      → Pod IP 改变
                      → Service 路由失效 ❌

                      解决方案:引入 Pause 容器

                      Pause 容器(持有命名空间) ← 永远不会主动退出
                      ├─ 容器 A 加入
                      └─ 容器 B 加入
                      
                      优势:
                      ✅ 容器 A 或 B 崩溃不影响网络命名空间
                      ✅ Pod IP 始终保持稳定
                      ✅ 业务容器可以独立重启
                      ✅ 简化容器间的依赖关系

                      📦 Pause 容器的特点

                      1. 极其精简

                      # pause 容器的代码只有几十行 C 代码
                      // 核心功能就是:永远 sleep
                      int main() {
                          for (;;) pause();  // 无限暂停,等待信号
                          return 0;
                      }

                      镜像大小:约 700KB (相比普通镜像动辄几百 MB)

                      2. 资源占用极低

                      # 查看 Pause 容器资源占用
                      docker stats <pause-container-id>
                      
                      # 典型输出:
                      # CPU: 0.00%
                      # MEM: 0.5 MiB

                      3. 生命周期管理

                      • Kubelet 创建 Pod 时首先启动 Pause 容器
                      • Pause 容器退出 = Pod 销毁
                      • 业务容器重启不影响 Pause 容器

                      🛠️ 实际用途场景

                      场景 1:Sidecar 模式

                      # 应用 + 日志收集器
                      spec:
                        containers:
                        - name: app
                          image: myapp
                          volumeMounts:
                          - name: logs
                            mountPath: /var/log
                        - name: log-collector
                          image: fluentd
                          volumeMounts:
                          - name: logs
                            mountPath: /var/log
                      • Pause 容器保证两个容器可以通过共享卷和 localhost 通信
                      • 即使 app 重启,log-collector 仍能正常工作

                      场景 2:Service Mesh (如 Istio)

                      # 应用 + Envoy 代理
                      spec:
                        containers:
                        - name: app
                          image: myapp
                          ports:
                          - containerPort: 8080
                        - name: istio-proxy  # Envoy sidecar
                          image: istio/proxyv2
                      • Pause 容器持有网络命名空间
                      • Envoy 代理拦截所有进出流量
                      • 应用无需感知代理存在

                      场景 3:初始化和主容器协作

                      spec:
                        initContainers:
                        - name: init-config
                          image: busybox
                          command: ['sh', '-c', 'echo "config" > /config/app.conf']
                          volumeMounts:
                          - name: config
                            mountPath: /config
                        containers:
                        - name: app
                          image: myapp
                          volumeMounts:
                          - name: config
                            mountPath: /config
                      • Pause 容器在整个过程中保持网络和 IPC 命名空间稳定

                      🔧 查看和调试 Pause 容器

                      查看 Pod 中的 Pause 容器

                      # 方式 1:通过 crictl (推荐)
                      crictl pods
                      crictl ps -a | grep pause
                      
                      # 方式 2:通过 docker (如果使用 Docker 运行时)
                      docker ps -a | grep pause
                      docker inspect <pause-container-id>
                      
                      # 方式 3:查看 Pod 的容器列表
                      kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].name}'

                      Pause 镜像版本

                      # 查看 kubelet 使用的 pause 镜像
                      kubectl get pod <pod-name> -o yaml | grep pause
                      
                      # 常见版本:
                      # registry.k8s.io/pause:3.9 (最新)
                      # registry.k8s.io/pause:3.8
                      # k8s.gcr.io/pause:3.7

                      ⚠️ 常见问题

                      Q1: 为什么看不到 Pause 容器?

                      A: 使用 kubectl get pods 不会显示 Pause 容器,它对用户是透明的。需要使用底层运行时命令(如 crictldocker ps)才能看到。

                      Q2: Pause 容器会影响性能吗?

                      A: 几乎不会。Pause 容器只占用约 0.5MB 内存,CPU 使用为 0,不执行任何业务逻辑。

                      Q3: 可以删除 Pause 容器吗?

                      A: 不能手动删除。删除 Pause 容器会导致整个 Pod 被销毁。

                      Q4: 不同 Pod 的 Pause 容器是否共享?

                      A: 不共享。每个 Pod 都有独立的 Pause 容器,确保 Pod 之间的网络和命名空间隔离。


                      📝 总结

                      用途说明
                      命名空间持有者创建并持有 Network、IPC、UTS 命名空间
                      网络基础使 Pod 内所有容器共享同一 IP 和网络栈
                      生命周期锚点代表 Pod 的生命周期,容器可独立重启
                      简化架构解耦容器间依赖,避免级联故障
                      资源高效极小的镜像和资源占用

                      核心价值: Pause 容器是 Kubernetes Pod 抽象的基石,让多个容器能像在同一主机上一样协作,同时保持各自的独立性和可重启性。

                      需要我详细讲解某个具体场景或深入探讨实现原理吗? 🚀

                      Mar 7, 2024

                      Pod在K8S中DNS解析流程和顺序

                      核心概念

                      1. CoreDNS: 从Kubernetes 1.11开始,CoreDNS是默认的DNS服务。它作为一个或多个Pod运行在kube-system命名空间下,并配有一个Kubernetes Service(通常叫kube-dns)。
                      2. resolv.conf 文件: 每个Pod的/etc/resolv.conf文件是DNS解析的蓝图。Kubelet会自动生成这个文件并挂载到Pod中。
                      3. DNS策略: 你可以通过Pod Spec中的dnsPolicy字段来配置DNS策略。

                      Pod 的 /etc/resolv.conf 解析

                      这是一个典型的Pod内的/etc/resolv.conf文件内容:

                      nameserver 10.96.0.10
                      search <namespace>.svc.cluster.local svc.cluster.local cluster.local
                      options ndots:5

                      让我们逐行分析:

                      1. nameserver 10.96.0.10

                      • 这是CoreDNS Service的集群IP地址。所有Pod的DNS查询默认都会发送到这个地址。
                      • 这个IP来自kubelet的--cluster-dns标志,在启动时确定。

                      2. search <namespace>.svc.cluster.local svc.cluster.local cluster.local

                      • 搜索域列表。当你使用不完整的域名(即不是FQDN)时,系统会按照这个列表的顺序,依次将搜索域附加到主机名后面,直到找到匹配的记录。
                      • <namespace>是你的Pod所在的命名空间,例如default
                      • 搜索顺序
                        • <pod-namespace>.svc.cluster.local
                        • svc.cluster.local
                        • cluster.local

                      3. options ndots:5

                      • 这是一个关键的优化/控制选项。
                      • 规则: 如果一个域名中的点(.)数量大于或等于这个值(这里是5),系统会将其视为绝对域名(FQDN),并首先尝试直接解析,不会走搜索域列表。
                      • 反之,如果点数少于5,系统会依次尝试搜索域,如果都失败了,最后再尝试名称本身。

                      DNS 解析流程与顺序(详解)

                      假设你的Pod在default命名空间,并且resolv.conf如上所示。

                      场景1:解析Kubernetes Service(短名称)

                      你想解析同一个命名空间下的Service:my-svc

                      1. 应用程序请求解析 my-svc
                      2. 系统检查名称 my-svc,点数(0) < 5。
                      3. 进入搜索流程
                        • 第一次尝试: my-svc.default.svc.cluster.local -> 成功! 返回ClusterIP。
                        • 解析结束。

                      场景2:解析不同命名空间的Service

                      你想解析另一个命名空间prod下的Service:my-svc.prod

                      1. 应用程序请求解析 my-svc.prod
                      2. 系统检查名称 my-svc.prod,点数(1) < 5。
                      3. 进入搜索流程
                        • 第一次尝试: my-svc.prod.default.svc.cluster.local -> 失败(因为该Service不在default命名空间)。
                        • 第二次尝试: my-svc.prod.svc.cluster.local -> 成功! 返回ClusterIP。
                        • 解析结束。

                      场景3:解析外部域名(例如 www.google.com

                      1. 应用程序请求解析 www.google.com
                      2. 系统检查名称 www.google.com,点数(3) < 5。
                      3. 进入搜索流程
                        • 第一次尝试: www.google.com.default.svc.cluster.local -> 失败
                        • 第二次尝试: www.google.com.svc.cluster.local -> 失败
                        • 第三次尝试: www.google.com.cluster.local -> 失败
                      4. 所有搜索域都失败了,系统最后尝试名称本身:www.google.com -> 成功! CoreDNS会将其转发给上游DNS服务器(例如宿主机上的DNS或网络中配置的DNS)。

                      场景4:解析被认为是FQDN的域名(点数 >= 5)

                      假设你有一个StatefulSet,Pod的FQDN是web-0.nginx.default.svc.cluster.local

                      1. 应用程序请求解析 web-0.nginx.default.svc.cluster.local
                      2. 系统检查名称,点数(4) < 5?注意:这里是4个点,仍然小于5! 所以它仍然会走搜索流程。
                        • 这会先尝试 web-0.nginx.default.svc.cluster.local.default.svc.cluster.local,显然是错误的。
                        • 为了避免这种低效行为,最佳实践是在应用程序中配置或使用绝对域名(尾部带点)。

                      绝对域名示例: 应用程序请求解析 web-0.nginx.default.svc.cluster.local.(注意最后有一个点)。

                      • 系统识别其为FQDN,直接查询,不经过任何搜索域。这是最有效的方式。

                      DNS 策略

                      Pod的dnsPolicy字段决定了如何生成resolv.conf

                      • ClusterFirst(默认): DNS查询首先被发送到Kubernetes集群的CoreDNS。如果域名不在集群域内(例如cluster.local),查询会被转发到上游DNS。
                      • ClusterFirstWithHostNet: 对于使用hostNetwork: true的Pod,如果你想让它使用集群DNS,就需要设置这个策略。
                      • Default: Pod直接从宿主机继承DNS配置(即使用宿主的/etc/resolv.conf)。这意味着它不会使用CoreDNS。
                      • None: 忽略所有默认的DNS设置。你必须使用dnsConfig字段来提供自定义的DNS配置。

                      总结与流程图

                      解析顺序可以简化为以下决策流程:

                      flowchart TD
                          A[应用程序发起DNS查询] --> B{查询名称的<br>点数 '.' >= 5?}
                          
                          B -- 是<br>(视为FQDN) --> C[直接查询该名称]
                          C --> D{解析成功?}
                          D -- 是 --> E[返回结果]
                          D -- 否 --> F[解析失败]
                          
                          B -- 否<br>(视为短名称) --> G
                          subgraph G [循环搜索域列表]
                              direction LR
                              H[依次将搜索域附加<br>到名称后并查询] --> I{解析成功?}
                              I -- 是 --> J[返回结果]
                          end
                          
                          I -- 循环结束仍失败 --> K[直接查询原始名称]
                          K --> L{解析成功?}
                          L -- 是 --> E
                          L -- 否 --> F

                      关键要点:

                      1. 默认流向: Pod -> CoreDNS Service -> CoreDNS Pod -> (根据域判断)返回K8s记录或转发到上游DNS。
                      2. 搜索域顺序: 命名空间 -> svc -> cluster.local
                      3. ndots:5的影响: 这是为了在便利性和性能之间取得平衡。对于需要频繁访问的外部域名,为了性能最好在应用程序中配置FQDN(尾部带点)或调整ndots选项。
                      4. 调试技巧: 进入Pod并执行cat /etc/resolv.confnslookupdig命令是诊断DNS问题的第一步。
                      Mar 7, 2024

                      当执行kubectl exec 命令时,发生了什么?

                      kubectl exec 的实现原理涉及多个组件协同工作,以下是详细原理分析:

                      1. 整体架构流程

                      用户 -> kubectl -> API Server -> Kubelet -> 容器运行时 -> 目标容器

                      2. 详细执行步骤

                      步骤1:kubectl 客户端处理

                      kubectl exec -it <pod-name> -- /bin/bash
                      • kubectl 解析命令参数
                      • 构造 Exec API 请求
                      • 建立与 API Server 的长连接

                      步骤2:API Server 处理

                      // API 路径示例
                      POST /api/v1/namespaces/{namespace}/pods/{name}/exec
                      • 认证和授权检查
                      • 验证用户是否有 exec 权限
                      • 查找目标 Pod 所在节点
                      • 将请求代理到对应节点的 Kubelet

                      步骤3:Kubelet 处理

                      // Kubelet 的 exec 处理逻辑
                      func (h *ExecHandler) serveExec(w http.ResponseWriter, req *http.Request) {
                          // 获取容器信息
                          // 调用容器运行时接口
                          // 建立数据流传输
                      }
                      • 通过 CRI(Container Runtime Interface)调用容器运行时
                      • 创建到容器的连接
                      • 管理标准输入、输出、错误流

                      步骤4:容器运行时执行

                      // CRI 接口定义
                      service RuntimeService {
                          rpc Exec(ExecRequest) returns (ExecResponse) {}
                      }
                      • Docker: 使用 docker exec 底层机制
                      • Containerd: 通过 task 执行命令
                      • CRI-O: 通过 conmon 管理执行会话

                      3. 关键技术机制

                      3.1 流式传输协议

                      // 使用 SPDY 或 WebSocket 协议
                      // 支持多路复用的数据流
                      type StreamProtocol interface {
                          Stream(stdin io.Reader, stdout, stderr io.Writer) error
                      }

                      3.2 终端处理(TTY)

                      // 伪终端配置
                      type ExecOptions struct {
                          Stdin     io.Reader
                          Stdout    io.Writer
                          Stderr    io.Writer
                          TTY       bool
                          ptyMaster *os.File
                      }

                      3.3 会话管理

                      // ExecSession 管理执行会话
                      type ExecSession struct {
                          id        string
                          stdinPipe io.WriteCloser
                          stdoutPipe io.ReadCloser
                          stderrPipe io.ReadCloser
                          done      chan struct{}
                      }

                      4. 网络通信流程

                      客户端 (kubectl)
                          ↓ HTTPS with SPDY/WebSocket
                      API Server
                          ↓ 代理连接
                      Kubelet (节点)
                          ↓ CRI gRPC
                      容器运行时
                          ↓ 容器命名空间
                      目标容器进程

                      5. 安全机制

                      5.1 认证授权

                      # RBAC 配置示例
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: ClusterRole
                      metadata:
                        name: pod-exec
                      rules:
                      - apiGroups: [""]
                        resources: ["pods/exec"]
                        verbs: ["create"]

                      5.2 安全上下文

                      // 安全配置
                      securityContext := &v1.SecurityContext{
                          RunAsUser:  &uid,
                          RunAsGroup: &gid,
                          Capabilities: &v1.Capabilities{
                              Drop: []v1.Capability{"ALL"},
                          },
                      }

                      6. 实际代码示例

                      kubectl 端实现

                      func (o *ExecOptions) Run() error {
                          // 建立与 API Server 的连接
                          executor, err := remotecommand.NewSPDYExecutor(
                              o.Config, "POST", req.URL())
                          
                          // 执行命令
                          return executor.Stream(remotecommand.StreamOptions{
                              Stdin:  o.In,
                              Stdout: o.Out,
                              Stderr: o.ErrOut,
                              Tty:    o.TTY,
                          })
                      }

                      Kubelet 端处理

                      func (h *ExecHandler) serveExec(w http.ResponseWriter, req *http.Request) {
                          // 获取容器 ID
                          containerID := podContainer.ContainerID
                          
                          // 通过 CRI 执行命令
                          execRequest := &runtimeapi.ExecRequest{
                              ContainerId: containerID.ID,
                              Cmd:         cmd,
                              Tty:         tty,
                              Stdin:       stdin,
                              Stdout:      stdout,
                              Stderr:      stderr,
                          }
                          
                          // 调用容器运行时
                          runtimeService.Exec(execRequest)
                      }

                      7. 容器运行时差异

                      Docker

                      // 使用 Docker Engine API
                      client.ContainerExecCreate()
                      client.ContainerExecAttach()

                      Containerd

                      // 使用 CRI 插件
                      task.Exec()

                      8. 故障排查要点

                      1. 权限问题: 检查 RBAC 配置
                      2. 网络连通性: API Server ↔ Kubelet 网络
                      3. 容器状态: 目标容器必须处于 Running 状态
                      4. 资源限制: 容器资源是否充足
                      5. 安全策略: Pod Security Policies 限制

                      这种设计使得 kubectl exec 能够在分布式环境中安全、可靠地执行容器内命令,同时保持了良好的用户体验。

                      Mar 7, 2024

                      QoS 详解

                      Kubernetes QoS (Quality of Service) 等级详解

                      QoS 等级是 Kubernetes 用来管理 Pod 资源和在资源不足时决定驱逐优先级的机制。


                      🎯 三种 QoS 等级

                      Kubernetes 根据 Pod 的资源配置自动分配 QoS 等级,共有三种:

                      1. Guaranteed (保证型) - 最高优先级

                      2. Burstable (突发型) - 中等优先级

                      3. BestEffort (尽力而为型) - 最低优先级


                      📊 QoS 等级详解

                      1️⃣ Guaranteed (保证型)

                      定义条件(必须同时满足)

                      • Pod 中每个容器(包括 Init 容器)都必须设置 requestslimits
                      • 对于每个容器,CPU 和内存的 requests 必须等于 limits

                      YAML 示例

                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: guaranteed-pod
                      spec:
                        containers:
                        - name: app
                          image: nginx
                          resources:
                            requests:
                              memory: "200Mi"
                              cpu: "500m"
                            limits:
                              memory: "200Mi"  # 必须等于 requests
                              cpu: "500m"      # 必须等于 requests

                      特点

                      资源保证:Pod 获得请求的全部资源,不会被其他 Pod 抢占
                      最高优先级:资源不足时最后被驱逐
                      性能稳定:资源使用可预测,适合关键业务
                      OOM 保护:不会因为节点内存压力被 Kill(除非超过自己的 limit)

                      适用场景

                      • 数据库(MySQL, PostgreSQL, Redis)
                      • 消息队列(Kafka, RabbitMQ)
                      • 核心业务应用
                      • 有状态服务

                      2️⃣ Burstable (突发型)

                      定义条件(满足以下任一条件)

                      • Pod 中至少有一个容器设置了 requestslimits
                      • requestslimits 不相等
                      • 部分容器设置了资源限制,部分没有

                      YAML 示例

                      场景 1:只设置 requests

                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: burstable-pod-1
                      spec:
                        containers:
                        - name: app
                          image: nginx
                          resources:
                            requests:
                              memory: "100Mi"
                              cpu: "200m"
                            # 没有设置 limits,可以使用超过 requests 的资源

                      场景 2:requests < limits

                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: burstable-pod-2
                      spec:
                        containers:
                        - name: app
                          image: nginx
                          resources:
                            requests:
                              memory: "100Mi"
                              cpu: "200m"
                            limits:
                              memory: "500Mi"  # 允许突发到 500Mi
                              cpu: "1000m"     # 允许突发到 1 核

                      场景 3:混合配置

                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: burstable-pod-3
                      spec:
                        containers:
                        - name: app1
                          image: nginx
                          resources:
                            requests:
                              memory: "100Mi"
                            limits:
                              memory: "200Mi"
                        - name: app2
                          image: busybox
                          resources:
                            requests:
                              cpu: "100m"
                            # 只设置 CPU,没有内存限制

                      特点

                      弹性使用:可以使用超过 requests 的资源(burst)
                      ⚠️ 中等优先级:资源不足时,在 BestEffort 之后被驱逐
                      ⚠️ 可能被限流:超过 limits 会被限制(CPU)或 Kill(内存)
                      成本优化:平衡资源保证和利用率

                      适用场景

                      • Web 应用(流量有波峰波谷)
                      • 定时任务
                      • 批处理作业
                      • 微服务(大部分场景)

                      3️⃣ BestEffort (尽力而为型)

                      定义条件

                      • Pod 中所有容器没有设置 requestslimits

                      YAML 示例

                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: besteffort-pod
                      spec:
                        containers:
                        - name: app
                          image: nginx
                          # 完全没有 resources 配置
                        - name: sidecar
                          image: busybox
                          # 也没有 resources 配置

                      特点

                      无资源保证:能用多少资源完全看节点剩余
                      最低优先级:资源不足时第一个被驱逐
                      性能不稳定:可能被其他 Pod 挤占资源
                      灵活性高:可以充分利用节点空闲资源

                      适用场景

                      • 开发测试环境
                      • 非关键后台任务
                      • 日志收集(可以容忍中断)
                      • 临时性工作负载

                      🔍 QoS 等级判定流程图

                      开始
                        │
                        ├─→ 所有容器都没设置 requests/limits?
                        │   └─→ 是 → BestEffort
                        │
                        ├─→ 所有容器的 requests == limits (CPU和内存)?
                        │   └─→ 是 → Guaranteed
                        │
                        └─→ 其他情况 → Burstable

                      🚨 资源不足时的驱逐顺序

                      当节点资源不足(如内存压力)时,Kubelet 按以下顺序驱逐 Pod:

                      驱逐优先级(从高到低):
                      
                      1. BestEffort Pod
                         └─→ 超出 requests 最多的先被驱逐
                      
                      2. Burstable Pod
                         └─→ 按内存使用量排序
                         └─→ 超出 requests 越多,越先被驱逐
                      
                      3. Guaranteed Pod (最后才驱逐)
                         └─→ 只有在没有其他选择时才驱逐

                      实际驱逐示例

                      # 节点内存不足场景:
                      节点总内存: 8GB
                      已用内存: 7.8GB (达到驱逐阈值)
                      
                      Pod 列表:
                      - Pod A (BestEffort): 使用 1GB 内存 → 第一个被驱逐 ❌
                      - Pod B (Burstable):  requests=200Mi, 使用 500Mi → 第二个 ❌
                      - Pod C (Burstable):  requests=500Mi, 使用 600Mi → 第三个 ❌
                      - Pod D (Guaranteed): requests=limits=1GB, 使用 1GB → 保留 ✅

                      📝 查看 Pod 的 QoS 等级

                      方法 1:使用 kubectl describe

                      kubectl describe pod <pod-name>
                      
                      # 输出中会显示:
                      # QoS Class:       Burstable

                      方法 2:使用 kubectl get

                      # 查看所有 Pod 的 QoS
                      kubectl get pods -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass
                      
                      # 输出:
                      # NAME              QOS
                      # nginx-guaranteed  Guaranteed
                      # app-burstable     Burstable
                      # test-besteffort   BestEffort

                      方法 3:使用 YAML 输出

                      kubectl get pod <pod-name> -o yaml | grep qosClass
                      
                      # 输出:
                      # qosClass: Burstable

                      🎨 QoS 配置最佳实践

                      生产环境推荐配置

                      关键业务 - Guaranteed

                      apiVersion: apps/v1
                      kind: Deployment
                      metadata:
                        name: critical-app
                      spec:
                        replicas: 3
                        template:
                          spec:
                            containers:
                            - name: app
                              image: myapp:v1
                              resources:
                                requests:
                                  memory: "2Gi"
                                  cpu: "1000m"
                                limits:
                                  memory: "2Gi"      # requests == limits
                                  cpu: "1000m"

                      一般业务 - Burstable

                      apiVersion: apps/v1
                      kind: Deployment
                      metadata:
                        name: web-app
                      spec:
                        replicas: 5
                        template:
                          spec:
                            containers:
                            - name: web
                              image: nginx:latest
                              resources:
                                requests:
                                  memory: "256Mi"    # 保证最低资源
                                  cpu: "200m"
                                limits:
                                  memory: "512Mi"    # 允许突发到 2 倍
                                  cpu: "500m"

                      后台任务 - BestEffort 或 Burstable

                      apiVersion: batch/v1
                      kind: CronJob
                      metadata:
                        name: cleanup-job
                      spec:
                        schedule: "0 2 * * *"
                        jobTemplate:
                          spec:
                            template:
                              spec:
                                containers:
                                - name: cleanup
                                  image: cleanup:v1
                                  resources:
                                    requests:
                                      memory: "128Mi"
                                      cpu: "100m"
                                    # 不设置 limits,允许使用空闲资源

                      🔧 QoS 与资源限制的关系

                      CPU 限制行为

                      resources:
                        requests:
                          cpu: "500m"    # 保证至少 0.5 核
                        limits:
                          cpu: "1000m"   # 最多使用 1 核
                      • requests:节点调度的依据,保证的资源
                      • limits:硬限制,超过会被限流(throttle),但不会被 Kill
                      • 超过 limits 时,进程会被 CPU throttle,导致性能下降

                      内存限制行为

                      resources:
                        requests:
                          memory: "256Mi"  # 保证至少 256Mi
                        limits:
                          memory: "512Mi"  # 最多使用 512Mi
                      • requests:调度保证,但可以使用更多
                      • limits:硬限制,超过会触发 OOM Kill 💀
                      • Pod 会被标记为 OOMKilled 并重启

                      🛠️ 常见问题

                      Q1: 为什么我的 Pod 总是被驱逐?

                      # 检查 QoS 等级
                      kubectl get pod <pod-name> -o yaml | grep qosClass
                      
                      # 如果是 BestEffort 或 Burstable,建议:
                      # 1. 设置合理的 requests
                      # 2. 考虑升级到 Guaranteed(关键服务)
                      # 3. 增加节点资源

                      Q2: 如何为所有 Pod 设置默认资源限制?

                      # 使用 LimitRange
                      apiVersion: v1
                      kind: LimitRange
                      metadata:
                        name: default-limits
                        namespace: default
                      spec:
                        limits:
                        - default:              # 默认 limits
                            cpu: "500m"
                            memory: "512Mi"
                          defaultRequest:       # 默认 requests
                            cpu: "100m"
                            memory: "128Mi"
                          type: Container

                      Q3: Guaranteed Pod 也会被驱逐吗?

                      会! 但只在以下情况:

                      • 使用超过自己的 limits(OOM Kill)
                      • 节点完全不可用(如节点宕机)
                      • 手动删除 Pod
                      • DaemonSet 或系统级 Pod 需要资源

                      Q4: 如何监控 QoS 相关的问题?

                      # 查看节点资源压力
                      kubectl describe node <node-name> | grep -A 5 "Conditions:"
                      
                      # 查看被驱逐的 Pod
                      kubectl get events --field-selector reason=Evicted
                      
                      # 查看 OOM 事件
                      kubectl get events --field-selector reason=OOMKilling

                      📊 QoS 等级对比表

                      维度GuaranteedBurstableBestEffort
                      配置要求requests=limitsrequests≠limits 或部分配置无配置
                      资源保证✅ 完全保证⚠️ 部分保证❌ 无保证
                      驱逐优先级最低(最后驱逐)中等最高(第一个驱逐)
                      性能稳定性⭐⭐⭐⭐⭐⭐⭐⭐
                      资源利用率低(固定资源)高(可突发)最高(充分利用)
                      成本
                      适用场景关键业务一般业务测试/临时任务

                      🎯 选型建议

                      使用 Guaranteed 的场景

                      • 🗄️ 数据库(MySQL, MongoDB, Cassandra)
                      • 📨 消息队列(Kafka, RabbitMQ)
                      • 🔐 认证服务
                      • 💰 支付系统
                      • 📊 实时数据处理

                      使用 Burstable 的场景

                      • 🌐 Web 应用(80% 的场景)
                      • 🔄 API 服务
                      • 🎨 前端应用
                      • 📦 微服务
                      • ⚙️ 后台处理

                      使用 BestEffort 的场景

                      • 🧪 开发测试
                      • 📝 日志收集(可容忍中断)
                      • 🔍 数据探索
                      • 🛠️ 一次性脚本

                      💡 关键要点总结

                      1. QoS 是自动分配的,不能手动指定,由资源配置决定
                      2. Guaranteed ≠ 不会被驱逐,只是优先级最低
                      3. 生产环境建议至少使用 Burstable,避免 BestEffort
                      4. requests 影响调度,limits 影响运行时限制
                      5. 内存超限会 OOM,CPU 超限会限流
                      6. 使用 LimitRange 强制资源限制,避免 BestEffort Pod
                      Mar 7, 2024

                      Scheduler

                      Kubernetes 调度器(kube-scheduler) 是整个系统中非常关键的组件,它负责决定 哪个 Pod 应该运行在哪个 Node 上

                      下面我会分层、逐步详细说明 K8s 调度流程(以 v1.28+ 为例),并解释背后机制。


                      🌐 整体架构概览

                      Kubernetes 调度器主要完成以下职责:

                      1. 监听待调度的 Pod(即 spec.nodeName 为空的 Pod)
                      2. 为 Pod 选择最合适的 Node
                      3. 将绑定结果写回到 apiserver

                      🧩 一、调度总体流程

                      Kubernetes 调度流程主要分为三个阶段:

                      [Pending Pod] --> [Scheduling Queue] 
                           ↓
                       [PreFilter] → [Filter] → [PostFilter] → [Score] → [Reserve] → [Permit] → [Bind]

                      1️⃣ 调度入口:监听未绑定的 Pod

                      • Scheduler 通过 informer 监听所有 Pod 资源。
                      • 当发现 Pod 没有 spec.nodeName 时,认为它是待调度的。
                      • Pod 被放入 调度队列(SchedulingQueue) 中。

                      🧮 二、调度核心阶段详解

                      🧩 1. PreFilter 阶段

                      在调度之前,对 Pod 进行一些准备性检查,例如:

                      • 解析 Pod 所需的资源。
                      • 检查 PVC、Affinity、Taint/Toleration 是否合理。
                      • 计算调度所需的 topology spread 信息。

                      🧠 类似于“预处理”,提前准备好过滤阶段要用的数据。


                      🧩 2. Filter 阶段(Predicates)

                      Scheduler 遍历所有可调度的 Node,筛选出满足条件的节点。

                      常见的过滤插件包括:

                      插件作用
                      NodeUnschedulable过滤掉被标记 unschedulable 的节点
                      NodeName如果 Pod 指定了 nodeName,只匹配该节点
                      TaintToleration检查 taint / toleration 是否匹配
                      NodeAffinity / PodAffinity检查亲和性/反亲和性
                      NodeResourcesFit检查 CPU/Memory 等资源是否够用
                      VolumeBinding检查 Pod 使用的 PVC 是否能在节点挂载

                      🔎 输出结果:

                      得到一个候选节点列表(通常是几十个或几百个)。


                      🧩 3. PostFilter 阶段

                      • 若没有节点符合条件(即调度失败),进入 抢占逻辑(Preemption)
                      • 调度器会尝试在某些节点上“抢占”低优先级的 Pod,以便高优先级 Pod 能调度成功。

                      🧩 4. Score 阶段(优选)

                      对剩余候选节点进行打分。 每个插件给节点打分(0–100),然后汇总加权。

                      常见的评分插件:

                      插件作用
                      LeastAllocated资源使用最少的节点得分高
                      BalancedAllocationCPU 和内存使用更均衡的节点得分高
                      NodeAffinity符合 affinity 的节点加分
                      ImageLocality本地已缓存镜像的节点加分
                      InterPodAffinity满足 Pod 间亲和性需求的节点加分

                      这些分数会经过 NormalizeScore 标准化到统一范围后求和。

                      最终输出:

                      最优节点(Score最高)


                      🧩 5. Reserve 阶段

                      暂时在该节点上 预留资源(在调度器内部缓存中标记),防止并发调度冲突。

                      如果后续失败,会执行 Unreserve 回滚。


                      🧩 6. Permit 阶段

                      某些插件可在绑定前再进行校验或等待,例如:

                      • PodGroup(批调度)
                      • Scheduler Framework 自定义策略

                      可能返回:

                      • Success → 继续绑定
                      • Wait → 等待事件
                      • Reject → 放弃调度

                      🧩 7. Bind 阶段

                      最终由 Bind 插件 调用 API 将 Pod 绑定到 Node:

                      spec:
                        nodeName: node-123

                      绑定成功后,kubelet 监听到该 Pod,会拉起容器。


                      ⚙️ 三、调度框架(Scheduler Framework)

                      K8s 1.19+ 后,调度器是通过 插件化框架 实现的。 每个阶段都有对应的插件点:

                      阶段插件接口示例插件
                      PreFilterPreFilterPluginPodTopologySpread
                      FilterFilterPluginNodeAffinity
                      ScoreScorePluginLeastAllocated
                      ReserveReservePluginVolumeBinding
                      BindBindPluginDefaultBinder

                      你可以通过写一个自定义调度插件(Go)扩展调度逻辑。


                      🧭 四、调度失败的情况

                      常见调度失败的原因:

                      原因表现
                      所有节点资源不足Pod 一直 Pending
                      亲和性限制太严格Pod 无法找到符合要求的节点
                      PVC 无法绑定VolumeBinding 阶段失败
                      节点被打 taint没有 toleration
                      镜像拉取失败Pod 已绑定但容器起不来(kubelet问题)

                      🧠 五、总结

                      阶段目的关键点
                      SchedulingQueue缓冲待调度PodFIFO + 优先级调度
                      PreFilter准备数据校验Pod需求
                      Filter过滤节点资源与约束
                      Score打分选优平衡与局部性
                      Reserve预留资源防并发冲突
                      Bind绑定Node调度结果落地

                      Mar 7, 2024

                      服务发现

                      最常见的说法是 “两种核心机制”,但这指的是服务发现的两种基本模式,而不是具体的实现方式。


                      维度一:两种核心模式

                      这是从服务发现的基本原理上划分的。

                      1. 基于客户端服务发现

                        • 工作原理:客户端(服务消费者)通过查询一个中心化的服务注册中心(如 Consul、Eureka、Zookeeper)来获取所有可用服务实例的列表(通常是 IP 和端口),然后自己选择一个实例并直接向其发起请求。
                        • 类比:就像你去餐厅吃饭,先看门口的电子菜单(服务注册中心)了解所有菜品和价格,然后自己决定点什么,再告诉服务员。
                        • 特点:客户端需要内置服务发现逻辑,与服务注册中心耦合。这种方式更灵活,但增加了客户端的复杂性。
                      2. 基于服务端服务发现

                        • 工作原理:客户端不关心具体的服务实例,它只需要向一个固定的访问端点(通常是 Load Balancer 或 Proxy,如 Kubernetes Service)发起请求。这个端点负责去服务注册中心查询可用实例,并进行负载均衡,将请求转发给其中一个。
                        • 类比:就像你去餐厅直接告诉服务员“来份招牌菜”,服务员(负载均衡器)帮你和后厨(服务实例)沟通,最后把菜端给你。
                        • 特点:客户端无需知道服务发现的具体细节,简化了客户端。这是 Kubernetes 默认采用的方式

                      维度二:Kubernetes 中具体的实现方式

                      在 Kubernetes 内部,我们通常讨论以下几种具体的服务发现实现手段,它们共同构成了 Kubernetes 强大的服务发现能力。

                      1. 环境变量

                      当 Pod 被调度到某个节点上时,kubelet 会为当前集群中存在的每个 Service 添加一组环境变量到该 Pod 中。

                      • 格式{SVCNAME}_SERVICE_HOST{SVCNAME}_SERVICE_PORT
                      • 例子:一个名为 redis-master 的 Service 会生成 REDIS_MASTER_SERVICE_HOST=10.0.0.11REDIS_MASTER_SERVICE_PORT=6379 这样的环境变量。
                      • 局限性:环境变量必须在 Pod 创建之前就存在。后创建的 Service 无法将环境变量注入到已运行的 Pod 中。因此,这通常作为辅助手段

                      2. DNS(最核心、最推荐的方式)

                      这是 Kubernetes 最主要和最优雅的服务发现方式。

                      • 工作原理:Kubernetes 集群内置了一个 DNS 服务器(通常是 CoreDNS)。当你创建一个 Service 时,Kubernetes 会自动为这个 Service 注册一个 DNS 记录。
                      • DNS 记录格式
                        • 同一命名空间<service-name>.<namespace>.svc.cluster.local -> 指向 Service 的 Cluster IP。
                          • 在同一个命名空间内,你可以直接使用 <service-name> 来访问服务。例如,前端 Pod 访问后端服务,只需使用 http://backend-service
                        • 不同命名空间:需要使用全限定域名,例如 backend-service.production.svc.cluster.local
                      • 优点:行为符合标准,应用无需修改代码,直接使用域名即可访问其他服务。

                      3. Kubernetes Service

                      Service 资源对象本身就是服务发现的载体。它提供了一个稳定的访问端点(VIP 或 DNS 名称),背后对应一组动态变化的 Pod。

                      • ClusterIP:默认类型,提供一个集群内部的虚拟 IP,只能从集群内部访问。结合 DNS 使用,是服务间通信的基石。
                      • NodePort:在 ClusterIP 基础上,在每个节点上暴露一个静态端口。可以从集群外部通过 <NodeIP>:<NodePort> 访问服务。
                      • LoadBalancer:在 NodePort 基础上,利用云服务商提供的负载均衡器,将一个外部 IP 地址暴露给 Service。是向公网暴露服务的主要方式。
                      • Headless Service:一种特殊的 Service,当你不需要负载均衡和单个 Service IP 时,可以通过设置 clusterIP: None 来创建。DNS 查询会返回该 Service 后端所有 Pod 的 IP 地址列表,而不是一个 VIP。这常用于有状态应用(如 Kafka、MySQL 集群)的自定义负载均衡或状态同步。

                      4. Ingress

                      虽然 Ingress 主要被用作 HTTP/HTTPS 路由规则的七层代理,但它也是一种高级的服务发现形式。

                      • 它通过规则将外部流量路由到集群内部相应的 Service。
                      • 客户端(外部用户)通过访问 Ingress Controller 的地址来发现和访问后端服务。

                      总结与对比

                      方式原理适用场景特点
                      环境变量将 Service 信息注入 Pod 环境变量旧式应用,辅助手段简单,但有局限性(需先于 Pod 创建)
                      DNS为 Service 自动注册域名服务间通信的标准方式推荐,符合惯例,无需代码改造
                      Service提供稳定的虚拟 IP 或 DNS 名称服务暴露和负载均衡的核心抽象Kubernetes 服务发现的基石
                      Ingress七层 HTTP 路由对外暴露 Web 服务,基于域名和路径路由更高级的 API 网关模式

                      结论:

                      • 从模式上讲,Kubernetes 主要采用基于服务端的服务发现模式。
                      • 从具体实现上讲,Kubernetes 的服务发现是一个以 DNS 为核心、以 Service 为基石的完整体系,并辅以环境变量和 Ingress 等机制。

                      所以,当有人问“K8s服务发现有哪些方式”时,最准确的回答是:主要有基于 DNS 和 环境变量这两种内部发现机制,而它们都依赖于 Service 这个核心抽象。同时,Ingress 提供了外部到内部的服务发现和路由能力。

                      Mar 7, 2024

                      Service VS Endpoint

                      Service 和 Endpoint/EndpointSlice 在 Kubernetes 中有明确的功能分工,它们共同构成了服务发现和负载均衡的基础。以下是详细的区别分析:

                      一、核心功能定位

                      Service - 抽象服务层

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: web-service
                      spec:
                        selector:
                          app: web-server
                        ports:
                          - protocol: TCP
                            port: 80           # 服务端口
                            targetPort: 8080   # 后端 Pod 端口
                        type: ClusterIP        # 服务类型

                      Service 的核心功能:

                      • 服务抽象:提供稳定的虚拟 IP 和 DNS 名称
                      • 访问入口:定义客户端如何访问服务
                      • 负载均衡策略:指定流量分发方式
                      • 服务类型:ClusterIP、NodePort、LoadBalancer、ExternalName

                      Endpoint/EndpointSlice - 后端实现层

                      apiVersion: v1
                      kind: Endpoints
                      metadata:
                        name: web-service      # 必须与 Service 同名
                      subsets:
                        - addresses:
                          - ip: 10.244.1.5
                            targetRef:
                              kind: Pod
                              name: web-pod-1
                          - ip: 10.244.1.6
                            targetRef:
                              kind: Pod  
                              name: web-pod-2
                          ports:
                          - port: 8080
                            protocol: TCP

                      Endpoints 的核心功能:

                      • 后端发现:记录实际可用的 Pod IP 地址
                      • 健康状态:只包含通过就绪探针检查的 Pod
                      • 动态更新:实时反映后端 Pod 的变化
                      • 端口映射:维护 Service port 到 Pod port 的映射

                      二、详细功能对比

                      功能特性ServiceEndpoint/EndpointSlice
                      抽象级别逻辑抽象层物理实现层
                      数据内容虚拟IP、端口、选择器实际Pod IP地址、端口
                      稳定性稳定的VIP和DNS动态变化的IP列表
                      创建方式手动定义自动生成(或手动)
                      更新频率低频变更高频动态更新
                      DNS解析返回Service IP不直接参与DNS
                      负载均衡定义策略提供后端目标

                      三、实际工作流程

                      1. 服务访问流程

                      客户端请求 → Service VIP → kube-proxy → Endpoints → 实际 Pod
                          ↓           ↓           ↓           ↓           ↓
                        DNS解析     虚拟IP      iptables/   后端IP列表   具体容器
                                   10.96.x.x   IPVS规则    10.244.x.x   应用服务

                      2. 数据流向示例

                      # 客户端访问
                      curl http://web-service.default.svc.cluster.local
                      
                      # DNS 解析返回 Service IP
                      nslookup web-service.default.svc.cluster.local
                      # 返回: 10.96.123.456
                      
                      # kube-proxy 根据 Endpoints 配置转发
                      iptables -t nat -L KUBE-SERVICES | grep 10.96.123.456
                      # 转发到: 10.244.1.5:8080, 10.244.1.6:8080

                      四、使用场景差异

                      Service 的使用场景

                      # 1. 内部服务访问
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: internal-api
                      spec:
                        type: ClusterIP
                        selector:
                          app: api-server
                        ports:
                          - port: 8080
                      
                      # 2. 外部访问
                      apiVersion: v1
                      kind: Service  
                      metadata:
                        name: external-web
                      spec:
                        type: LoadBalancer
                        selector:
                          app: web-frontend
                        ports:
                          - port: 80
                            nodePort: 30080
                      
                      # 3. 外部服务代理
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: external-database
                      spec:
                        type: ExternalName
                        externalName: database.example.com

                      Endpoints 的使用场景

                      # 1. 自动后端管理(默认)
                      # Kubernetes 自动维护匹配 Pod 的 Endpoints
                      
                      # 2. 外部服务集成
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: legacy-system
                      spec:
                        ports:
                          - port: 3306
                      ---
                      apiVersion: v1
                      kind: Endpoints
                      metadata:
                        name: legacy-system
                      subsets:
                        - addresses:
                          - ip: 192.168.1.100  # 外部数据库
                          ports:
                          - port: 3306
                      
                      # 3. 多端口复杂服务
                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: complex-app
                      spec:
                        ports:
                        - name: http
                          port: 80
                        - name: https
                          port: 443
                        - name: metrics
                          port: 9090

                      五、配置和管理差异

                      Service 配置重点

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: optimized-service
                        annotations:
                          # 负载均衡配置
                          service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
                          # 会话保持
                          service.kubernetes.io/aws-load-balancer-backend-protocol: "http"
                      spec:
                        type: LoadBalancer
                        selector:
                          app: optimized-app
                        sessionAffinity: ClientIP
                        sessionAffinityConfig:
                          clientIP:
                            timeoutSeconds: 10800
                        ports:
                        - name: http
                          port: 80
                          targetPort: 8080
                        # 流量策略(仅对外部流量)
                        externalTrafficPolicy: Local

                      Endpoints 配置重点

                      apiVersion: v1
                      kind: Endpoints
                      metadata:
                        name: custom-endpoints
                        labels:
                          # 用于网络策略选择
                          environment: production
                      subsets:
                      - addresses:
                        - ip: 10.244.1.10
                          nodeName: worker-1
                          targetRef:
                            kind: Pod
                            name: app-pod-1
                            namespace: production
                        - ip: 10.244.1.11
                          nodeName: worker-2  
                          targetRef:
                            kind: Pod
                            name: app-pod-2
                            namespace: production
                        # 多端口定义
                        ports:
                        - name: http
                          port: 8080
                          protocol: TCP
                        - name: metrics
                          port: 9090
                          protocol: TCP
                        - name: health
                          port: 8081
                          protocol: TCP

                      六、监控和调试差异

                      Service 监控重点

                      # 检查 Service 状态
                      kubectl get services
                      kubectl describe service web-service
                      
                      # Service 相关指标
                      kubectl top services  # 如果支持
                      kubectl get --raw /api/v1/namespaces/default/services/web-service/proxy/metrics
                      
                      # DNS 解析测试
                      kubectl run test-$RANDOM --image=busybox --rm -it -- nslookup web-service

                      Endpoints 监控重点

                      # 检查后端可用性
                      kubectl get endpoints
                      kubectl describe endpoints web-service
                      
                      # 验证后端 Pod 状态
                      kubectl get pods -l app=web-server -o wide
                      
                      # 检查就绪探针
                      kubectl get pods -l app=web-server -o jsonpath='{.items[*].spec.containers[*].readinessProbe}'
                      
                      # 直接测试后端连通性
                      kubectl run test-$RANDOM --image=busybox --rm -it -- 
                      # 在容器内: telnet 10.244.1.5 8080

                      七、性能考虑差异

                      Service 性能优化

                      apiVersion: v1
                      kind: Service
                      metadata:
                        name: high-performance
                        annotations:
                          # 使用 IPVS 模式提高性能
                          service.kubernetes.io/service.beta.kubernetes.io/ipvs-scheduler: "wrr"
                      spec:
                        type: ClusterIP
                        clusterIP: None  # Headless Service,减少一层转发
                        selector:
                          app: high-perf-app

                      Endpoints 性能优化

                      # 使用 EndpointSlice 提高大规模集群性能
                      apiVersion: discovery.k8s.io/v1
                      kind: EndpointSlice
                      metadata:
                        name: web-service-abc123
                        labels:
                          kubernetes.io/service-name: web-service
                      addressType: IPv4
                      ports:
                      - name: http
                        protocol: TCP
                        port: 8080
                      endpoints:
                      - addresses:
                        - "10.244.1.5"
                        conditions:
                          ready: true
                        # 拓扑感知,优化路由
                        zone: us-west-2a
                        hints:
                          forZones:
                          - name: us-west-2a

                      八、总结

                      维度ServiceEndpoint/EndpointSlice
                      角色服务门面后端实现
                      稳定性高(VIP/DNS稳定)低(IP动态变化)
                      关注点如何访问谁能被访问
                      配置频率低频高频自动更新
                      网络层级L4 负载均衡后端目标发现
                      扩展性通过类型扩展通过EndpointSlice扩展

                      简单比喻:

                      • Service 就像餐厅的接待台和菜单 - 提供统一的入口和访问方式
                      • Endpoints 就像后厨的厨师列表 - 记录实际提供服务的人员和位置

                      两者协同工作,Service 定义"什么服务可用",Endpoints 定义"谁可以提供这个服务",共同实现了 Kubernetes 强大的服务发现和负载均衡能力。

                      Mar 7, 2024

                      StatefulSet

                      StatefulSet 如何具体解决有状态应用的挑战


                      StatefulSet 的四大核心机制

                      StatefulSet 通过一系列精心设计的机制,为有状态应用提供了稳定性和可预测性。

                      1. 稳定的网络标识

                      解决的问题:有状态应用(如数据库节点)需要稳定的主机名来相互发现和通信,不能使用随机名称。

                      StatefulSet 的实现

                      • 固定的 Pod 名称:Pod 名称遵循固定模式:<statefulset-name>-<ordinal-index>
                        • 例如:redis-cluster-0redis-cluster-1redis-cluster-2
                      • 稳定的 DNS 记录:每个 Pod 都会自动获得一个唯一的、稳定的 DNS 记录:
                        • 格式<pod-name>.<svc-name>.<namespace>.svc.cluster.local
                        • 例子redis-cluster-0.redis-service.default.svc.cluster.local

                      应对场景

                      • 在 Redis 集群中,redis-cluster-0 可以告诉 redis-cluster-1:“我的地址是 redis-cluster-0.redis-service",这个地址在 Pod 的一生中都不会改变,即使它被重新调度到其他节点。

                      2. 有序的部署与管理

                      解决的问题:像 Zookeeper、Etcd 这样的集群化应用,节点需要按顺序启动和加入集群,主从数据库也需要先启动主节点。

                      StatefulSet 的实现

                      • 有序部署:当创建 StatefulSet 时,Pod 严格按照索引顺序(0, 1, 2…)依次创建。必须等 Pod-0 完全就绪(Ready)后,才会创建 Pod-1
                      • 有序扩缩容
                        • 扩容:按顺序创建新 Pod(如从 3 个扩展到 5 个,会先创建 pod-3,再 pod-4)。
                        • 缩容:按逆序终止 Pod(从 pod-4 开始,然后是 pod-3)。
                      • 有序滚动更新:同样遵循逆序策略,确保在更新过程中大部分节点保持可用。

                      应对场景

                      • 部署 MySQL 主从集群时,StatefulSet 会确保 mysql-0(主节点)先启动并初始化完成,然后才启动 mysql-1(从节点),从节点在启动时就能正确连接到主节点进行数据同步。

                      3. 稳定的持久化存储

                      这是 StatefulSet 最核心的特性!

                      解决的问题:有状态应用的数据必须持久化,并且当 Pod 发生故障或被调度到新节点时,必须能够重新挂载到它自己的那部分数据

                      StatefulSet 的实现

                      • Volume Claim Template:在 StatefulSet 的 YAML 中,你可以定义一个 volumeClaimTemplate(存储卷申请模板)。
                      • 专属的 PVC:StatefulSet 会为每个 Pod 实例根据这个模板创建一个独立的、专用的 PersistentVolumeClaim (PVC)。
                        • mysql-0 -> pvc-name-mysql-0
                        • mysql-1 -> pvc-name-mysql-1
                        • mysql-2 -> pvc-name-mysql-2

                      工作流程

                      1. 当你创建名为 mysql、副本数为 3 的 StatefulSet 时,K8s 会:
                        • 创建 Pod mysql-0,并同时创建 PVC data-mysql-0,然后将它们绑定。
                        • mysql-0 就绪后,创建 Pod mysql-1 和 PVC data-mysql-1,然后绑定。
                        • 以此类推。
                      2. 如果节点故障导致 mysql-1 被删除,K8s 调度器会在其他健康节点上重新创建一个同名的 Pod mysql-1
                      3. 这个新 Pod mysql-1 会自动挂载到之前为它创建的、存有它专属数据的 PVC data-mysql-1 上。
                      4. 这样,Pod 虽然"漂移"了,但数据依然跟随,应用可以无缝恢复。

                      应对场景

                      • 对于数据库,每个 Pod 都有自己独立的数据目录。mysql-0 的数据永远不会和 mysql-1 的数据混淆。这为数据分片(Sharding)和主从复制提供了基础。

                      4. 稳定的启动顺序与唯一身份

                      解决的问题:应用启动脚本或配置可能需要知道当前实例的索引号(如,判断自己是否是第一个节点,从而决定是否要初始化集群)。

                      StatefulSet 的实现

                      • Pod 的序号(0, 1, 2...)就是其稳定的、唯一的身份标识。
                      • 这个序号可以通过环境变量注入到 Pod 的容器中,供应用程序使用。

                      总结

                      StatefulSet 通过「稳定的网络标识 + 有序的部署管理 + 稳定的专属存储」这三驾马车,将有状态应用在动态的、易变的容器环境中"锚定”,使其获得了类似物理机或虚拟机的稳定运行特性。

                      它并没有让有状态应用本身变得简单,而是为这些应用的复杂需求(如集群化、数据持久化)提供了一个可以在 K8s 上稳定运行的框架和平台。对于运维人员来说,这意味着可以用声明式的方式像管理无状态应用一样,去管理数据库、消息队列等复杂的有状态服务。

                      Mar 7, 2024

                      StatefulSet 2

                      答案是:

                      StatefulSet 不使用 ReplicaSet 来创建 Pod。 ✅ StatefulSet Controller 直接创建并管理 Pod 对象,而且每个 Pod 都是有序的、具名的、不可替代的。

                      接下来我会详细讲清楚 StatefulSet 与 Deployment 的架构差异、实现机制、核心控制循环,让你能在实现 Operator 或高可靠应用时真正理解它的行为。


                      🧩 一、StatefulSet 与 Deployment 的根本区别

                      项目DeploymentStatefulSet
                      控制下层对象ReplicaSet → Pod直接管理 Pod
                      Pod 命名随机固定命名(有序索引)
                      Pod 身份可替换唯一、持久身份(Stable Identity)
                      更新策略滚动更新(无序)有序更新(从 0 开始逐个)
                      存储通常无状态绑定 PVC,数据与 Pod 一一对应
                      常见场景Web 服务、API、Job数据库、Zookeeper、Kafka、Etcd 等

                      ⚙️ 二、StatefulSet 控制器工作原理

                      StatefulSet Controller 运行在 kube-controller-manager 中。 它同样是一个典型的 Controller + Informer + WorkQueue + Reconcile Loop 架构。

                      但是它的 核心循环逻辑 与 Deployment/ReplicaSet 明显不同:

                      1️⃣ Reconcile 流程概览

                      伪代码如下:

                      syncStatefulSet(set):
                          pods := listPodsMatching(set.selector)
                          sortPodsByOrdinal(pods)
                      
                          for i in 0..set.spec.replicas-1:
                              ensurePodExists(set, ordinal=i)
                              ensurePodPVCExists(set, ordinal=i)
                              if updateRequired(set, pod[i]):
                                  deletePodAndWaitForReplacement(set, ordinal=i)

                      也就是说,StatefulSet controller 自己直接创建 Pod 和 PVC,并且顺序控制整个过程。


                      2️⃣ Pod 命名与序号机制

                      StatefulSet 会给每个 Pod 一个固定的命名模式:

                      <prefix>-<ordinal>

                      例如:

                      web-0
                      web-1
                      web-2

                      这些名字不是随机生成的 UID,而是根据序号确定的。 这让 Pod 可以拥有稳定的网络标识

                      • Pod DNS:<podname>.<headless-service-name>.<namespace>.svc.cluster.local

                      • 例如:

                        web-0.web.default.svc.cluster.local
                        web-1.web.default.svc.cluster.local

                      这对 Zookeeper、MySQL 集群等“节点必须有固定身份”的应用至关重要。


                      3️⃣ 存储(PVC)绑定机制

                      每个 StatefulSet Pod 都可以声明一个 volumeClaimTemplate

                      volumeClaimTemplates:
                      - metadata:
                          name: data
                        spec:
                          accessModes: ["ReadWriteOnce"]
                          resources:
                            requests:
                              storage: 10Gi

                      Controller 会为每个 Pod 创建一个独立的 PVC:

                      data-web-0
                      data-web-1
                      data-web-2

                      即使 Pod 被删除,这个 PVC 也不会被清理(除非手动删除), 这就实现了状态持久化


                      4️⃣ Pod 创建顺序与滚动更新顺序

                      StatefulSet 有严格的有序逻辑:

                      创建顺序

                      • 先创建 web-0
                      • web-0 变成 Running + Ready
                      • 再创建 web-1
                      • 依次类推。

                      删除顺序

                      • 先删除最高序号 Pod;
                      • 依次向下。

                      更新顺序(滚动更新)

                      • 先更新最高序号 Pod;

                      • 等该 Pod Ready 后,再更新下一个;

                      • 可配置 podManagementPolicy

                        • OrderedReady(默认,有序)
                        • Parallel(并行)

                      5️⃣ StatefulSet 更新控制逻辑

                      当你修改 StatefulSet 的模板(例如更换镜像),控制器会:

                      1. 比较 spec.template hash;
                      2. 找出需要更新的 Pod;
                      3. 按序更新;
                      4. 每个 Pod 更新完毕、Ready 后,再进行下一个。

                      伪代码示例:

                      for i = replicas-1 downTo 0:
                          if pod[i].template != set.template:
                              delete(pod[i])
                              waitUntilNewPodReady(i)

                      📦 三、StatefulSet 与 PVC 的生命周期

                      对象何时创建何时删除
                      Pod当 StatefulSet 第一次启动或扩容时缩容时(最高序号先删)
                      PVCPod 第一次被创建时不会自动删除(除非使用 StorageClass Retain 策略外的设置)

                      这意味着:

                      StatefulSet 的 数据持久性是通过 PVC 与序号绑定实现的

                      例如:

                      data-web-0 → 永远属于 web-0
                      data-web-1 → 永远属于 web-1

                      🔁 四、StatefulSet 的控制链路(事件流)

                      步骤执行者说明
                      用户kubectl apply 创建 StatefulSet
                      API Server写入 etcd
                      StatefulSet Controller检测 StatefulSet 对象变化
                      Controller创建第一个 Pod + PVC(ordinal=0)
                      Kubelet调度并运行容器
                      Controller检测 web-0 Ready 后,继续创建 web-1
                      直到所有 replicas Ready
                      更新按序滚动更新 Pod

                      🧠 五、StatefulSet 为什么不复用 ReplicaSet?

                      主要是因为 ReplicaSet 的特性与 StatefulSet 的设计目标相冲突:

                      功能ReplicaSetStatefulSet
                      Pod 匿名性✅ 所有 Pod 等价❌ 每个 Pod 唯一
                      更新无序性✅ 任意替换❌ 必须有序
                      PVC 管理❌ 无✅ 一一绑定
                      网络身份❌ 无✅ 稳定 DNS 名称

                      所以 Kubernetes 设计了专门的 StatefulSet Controller 来实现这一类“有状态副本控制”。


                      💡 六、StatefulSet 小结

                      特性描述
                      控制对象直接管理 Pod
                      Pod 命名固定命名(带序号)
                      更新策略有序(OrderedReady)或并行(Parallel)
                      持久化通过 PVC 一对一绑定
                      典型场景数据库、分布式存储、消息队列、协调服务等
                      优势稳定网络身份、持久存储、严格更新顺序
                      缺点扩缩容慢,调度受限,不适合无状态应用

                      🧭 七、视觉化理解(结构示意)

                      StatefulSet: kafka
                      │
                      ├── Pod: kafka-0
                      │   ├── PVC: data-kafka-0
                      │   └── Hostname: kafka-0.kafka.default.svc
                      │
                      ├── Pod: kafka-1
                      │   ├── PVC: data-kafka-1
                      │   └── Hostname: kafka-1.kafka.default.svc
                      │
                      └── Pod: kafka-2
                          ├── PVC: data-kafka-2
                          └── Hostname: kafka-2.kafka.default.svc
                      Mar 7, 2024

                      Talk between 2 pods in different nodes

                      好的,这是一个非常核心的 Kubernetes 网络问题。不同 Node 上的 Pod 之间的通信过程,清晰地展示了 Kubernetes 网络模型的核心思想:每个 Pod 都拥有一个独立的、扁平的 IP 地址空间,无论它运行在哪个节点上,Pod 之间都可以直接通过这个 IP 进行通信,而无需使用 NAT

                      这个过程的实现完全依赖于容器网络接口(CNI)插件,如 Calico、Flannel、Weave Net 等。下面我们以最经典的 Flannel (VXLAN 模式)Calico (BGP 模式) 为例,来阐述这个通信过程。


                      核心原则

                      1. Pod IP 可达性:Kubernetes 网络模型要求,任何 Pod 的 IP 地址都能被任何其他 Pod 直接访问,无论它们是否在同一个节点上。
                      2. 无 NAT:Pod 到 Pod 的通信不应该经过源地址转换(SNAT)或目的地址转换(DNAT)。Pod 看到的源 IP 和目标 IP 就是真实的 Pod IP。

                      通用通信流程(抽象模型)

                      假设有两个 Pod:

                      • Pod A:在 Node 1 上,IP 为 10.244.1.10
                      • Pod B:在 Node 2 上,IP 为 10.244.2.20

                      Pod A 试图 ping Pod B 的 IP (10.244.2.20) 时,过程如下:

                      1. 出站:从 Pod A 到 Node 1

                      • Pod A 根据其内部路由表,将数据包从自己的网络命名空间内的 eth0 接口发出。
                      • 目标 IP 是 10.244.2.20
                      • Node 1 上,有一个网桥(如 cni0)充当了所有本地 Pod 的虚拟交换机。Pod A 的 eth0 通过一对 veth pair 连接到这个网桥。
                      • 数据包到达网桥 cni0

                      2. 路由决策:在 Node 1 上

                      • Node 1内核路由表 由 CNI 插件配置。它查看数据包的目标 IP 10.244.2.20
                      • 路由表规则大致如下:
                        Destination     Gateway         Interface
                        10.244.1.0/24   ...            cni0      # 本地 Pod 网段,走 cni0 网桥
                        10.244.2.0/24   192.168.1.102  eth0      # 非本地 Pod 网段,通过网关(即 Node 2 的 IP)从物理网卡 eth0 发出
                      • 路由表告诉内核,去往 10.244.2.0/24 网段的数据包,下一跳是 192.168.1.102(即 Node 2 的物理 IP),并通过 Node 1 的物理网络接口 eth0 发出。

                      从这里开始,不同 CNI 插件的工作机制产生了差异。


                      场景一:使用 Flannel (VXLAN 模式)

                      Flannel 通过创建一个覆盖网络 来解决跨节点通信。

                      1. 封装

                        • 数据包(源 10.244.1.10,目标 10.244.2.20)到达 Node 1eth0 之前,会被一个特殊的虚拟网络设备 flannel.1 截获。
                        • flannel.1 是一个 VXLAN 隧道端点
                        • 封装flannel.1 会将整个原始数据包(作为 payload)封装在一个新的 UDP 数据包 中。
                          • 外层 IP 头:源 IP 是 Node 1 的 IP (192.168.1.101),目标 IP 是 Node 2 的 IP (192.168.1.102)。
                          • 外层 UDP 头:目标端口通常是 8472 (VXLAN)。
                          • VXLAN 头:包含一个 VNI,用于标识不同的虚拟网络。
                          • 内层原始数据包:原封不动。
                      2. 物理网络传输

                        • 这个封装后的 UDP 数据包通过 Node 1 的物理网络 eth0 发送出去。
                        • 它经过底层物理网络(交换机、路由器)顺利到达 Node 2,因为外层 IP 是节点的真实 IP,底层网络是认识的。
                      3. 解封装

                        • 数据包到达 Node 2 的物理网卡 eth0
                        • 内核发现这是一个发往 VXLAN 端口 (8472) 的 UDP 包,于是将其交给 Node 2 上的 flannel.1 设备处理。
                        • flannel.1 设备解封装,剥掉外层 UDP 和 IP 头,露出原始的 IP 数据包(源 10.244.1.10,目标 10.244.2.20)。
                      4. 入站:从 Node 2 到 Pod B

                        • 解封后的原始数据包被送入 Node 2 的网络栈。
                        • Node 2 的路由表查看目标 IP 10.244.2.20,发现它属于本地的 cni0 网桥管理的网段。
                        • 数据包被转发到 cni0 网桥,网桥再通过 veth pair 将数据包送达 Pod Beth0 接口。

                      简单比喻:Flannel 就像在两个节点之间建立了一条邮政专线。你的原始信件(Pod IP 数据包)被塞进一个标准快递信封(外层 UDP 包)里,通过公共邮政系统(物理网络)寄到对方邮局(Node 2),对方邮局再拆开快递信封,把原始信件交给收件人(Pod B)。


                      场景二:使用 Calico (BGP 模式)

                      Calico 通常不使用隧道,而是利用 BGP 协议纯三层路由,效率更高。

                      1. 路由通告

                        • Node 1Node 2 上都运行着 Calico 的 BGP 客户端 Felix 和 BGP 路由反射器 BIRD
                        • Node 2 会通过 BGP 协议向网络中的其他节点(包括 Node 1)通告一条路由信息:“目标网段 10.244.2.0/24 的下一跳是我 192.168.1.102”。
                        • Node 1 学习到了这条路由,并写入自己的内核路由表(就是我们之前在步骤2中看到的那条)。
                      2. 直接路由

                        • 数据包(源 10.244.1.10,目标 10.244.2.20)根据路由表,直接通过 Node 1 的物理网卡 eth0 发出。
                        • 没有封装! 数据包保持原样,源 IP 是 10.244.1.10,目标 IP 是 10.244.2.20
                        • 这个数据包被发送到 Node 2 的物理 IP (192.168.1.102)。
                      3. 物理网络传输

                        • 数据包经过底层物理网络。这就要求底层网络必须能够路由 Pod IP 的网段。在云环境中,这通常通过配置 VPC 路由表来实现;在物理机房,需要核心交换机学习到这些 BGP 路由或配置静态路由。
                      4. 入站:从 Node 2 到 Pod B

                        • 数据包到达 Node 2 的物理网卡 eth0
                        • Node 2 的内核查看目标 IP 10.244.2.20,发现这个 IP 属于一个本地虚拟接口(如 caliXXX,这是 Calico 为每个 Pod 创建的),于是直接将数据包转发给该接口,最终送达 Pod B

                      简单比喻:Calico 让每个节点都成为一个智能路由器。它们互相告知“哪个 Pod 网段在我这里”。当 Node 1 要发数据给 Node 2 上的 Pod 时,它就像路由器一样,根据已知的路由表,直接找到 Node 2 的地址并把数据包发过去,中间不拆包。


                      总结对比

                      特性Flannel (VXLAN)Calico (BGP)
                      网络模型Overlay NetworkPure Layer 3
                      原理隧道封装路由通告
                      性能有封装/解封装开销,性能稍低无隧道开销,性能更高
                      依赖对底层网络无要求,只要节点IP通即可依赖底层网络支持路由(云平台VPC或物理网络配置)
                      数据包外层Node IP,内层Pod IP始终是Pod IP

                      无论采用哪种方式,Kubernetes 和 CNI 插件共同协作,最终实现了一个对应用开发者透明的、扁平的 Pod 网络。开发者只需关心 Pod IP 和 Service,而无需理解底层复杂的跨节点通信机制。

                      如果pod之间访问不通怎么排查?

                      核心排查思路:从 Pod 内部到外部,从简单到复杂

                      整个排查过程可以遵循下图所示的路径,逐步深入:

                      flowchart TD
                          A[Pod 之间访问不通] --> B[确认基础连通性<br>ping & telnet]
                      
                          B --> C{ping 是否通?}
                          C -- 通 --> D[telnet 端口是否通?]
                          C -- 不通 --> E[检查 NetworkPolicy<br>kubectl get networkpolicy]
                      
                          D -- 通 --> F[检查应用日志与配置]
                          D -- 不通 --> G[检查 Service 与 Endpoints<br>kubectl describe svc]
                      
                          E --> H[检查 CNI 插件状态<br>kubectl get pods -n kube-system]
                          
                          subgraph G_ [Service排查路径]
                              G --> G1[Endpoints 是否为空?]
                              G1 -- 是 --> G2[检查 Pod 标签与 Selector]
                              G1 -- 否 --> G3[检查 kube-proxy 与 iptables]
                          end
                      
                          F --> Z[问题解决]
                          H --> Z
                          G2 --> Z
                          G3 --> Z

                      第一阶段:基础信息收集与初步检查

                      1. 获取双方 Pod 信息

                        kubectl get pods -o wide
                        • 确认两个 Pod 都处于 Running 状态。
                        • 记录下它们的 IP 地址所在节点
                        • 确认它们不在同一个节点上(如果是,排查方法会略有不同)。
                      2. 明确访问方式

                        • 直接通过 Pod IP 访问? (ping <pod-ip>curl <pod-ip>:<port>)
                        • 通过 Service 名称访问? (ping <service-name>curl <service-name>:<port>)
                        • 这个问题决定了后续的排查方向。

                      第二阶段:按访问路径深入排查

                      场景一:直接通过 Pod IP 访问不通(跨节点)

                      这通常是底层网络插件(CNI) 的问题。

                      1. 检查 Pod 内部网络

                        • 进入源 Pod,检查其网络配置:
                        kubectl exec -it <source-pod> -- sh
                        # 在 Pod 内部执行:
                        ip addr show eth0 # 查看 IP 是否正确
                        ip route # 查看路由表
                        ping <destination-pod-ip> # 测试连通性
                        • 如果 ping 不通,继续下一步。
                      2. 检查目标 Pod 的端口监听

                        • 进入目标 Pod,确认应用在正确端口上监听:
                        kubectl exec -it <destination-pod> -- netstat -tulpn | grep LISTEN
                        # 或者用 ss 命令
                        kubectl exec -it <destination-pod> -- ss -tulpn | grep LISTEN
                        • 如果这里没监听,是应用自身问题,检查应用日志和配置。
                      3. 检查 NetworkPolicy(网络策略)

                        • 这是 Kubernetes 的“防火墙”,很可能阻止了访问。
                        kubectl get networkpolicies -A
                        kubectl describe networkpolicy <policy-name> -n <namespace>
                        • 查看是否有策略限制了源 Pod 或目标 Pod 的流量。特别注意 ingress 规则
                      4. 检查 CNI 插件状态

                        • CNI 插件(如 Calico、Flannel)的异常会导致跨节点网络瘫痪。
                        kubectl get pods -n kube-system | grep -e calico -e flannel -e weave
                        • 确认所有 CNI 相关的 Pod 都在运行。如果有 CrashLoopBackOff 等状态,查看其日志。
                      5. 节点层面排查

                        • 如果以上都正常,问题可能出现在节点网络层面。
                        • 登录到源 Pod 所在节点,尝试 ping 目标 Pod IP。
                        • 检查节点路由表
                          # 在节点上执行
                          ip route
                          • 对于 Flannel,你应该能看到到其他节点 Pod 网段的路由。
                          • 对于 Calico,你应该能看到到每个其他节点 Pod 网段的精确路由。
                        • 检查节点防火墙:在某些环境中(如安全组、iptables 规则)可能阻止了 VXLAN(8472端口)或节点间 Pod IP 的通信。
                          # 检查 iptables 规则
                          sudo iptables-save | grep <pod-ip>

                      场景二:通过 Service 名称访问不通

                      这通常是 Kubernetes 服务发现kube-proxy 的问题。

                      1. 检查 Service 和 Endpoints

                        kubectl get svc <service-name>
                        kubectl describe svc <service-name> # 查看 Selector 和 Port 映射
                        kubectl get endpoints <service-name> # 这是关键!检查是否有健康的 Endpoints
                        • 如果 ENDPOINTS 列为空:说明 Service 的 Label Selector 没有匹配到任何健康的 Pod。请检查:
                          • Pod 的 labels 是否与 Service 的 selector 匹配。
                          • Pod 的 readinessProbe 是否通过。
                      2. 检查 DNS 解析

                        • 进入源 Pod,测试是否能解析 Service 名称:
                        kubectl exec -it <source-pod> -- nslookup <service-name>
                        # 或者
                        kubectl exec -it <source-pod> -- cat /etc/resolv.conf
                        • 如果解析失败,检查 kube-dnscoredns Pod 是否正常。
                        kubectl get pods -n kube-system | grep -e coredns -e kube-dns
                      3. 检查 kube-proxy

                        • kube-proxy 负责实现 Service 的负载均衡规则(通常是 iptables 或 ipvs)。
                        kubectl get pods -n kube-system | grep kube-proxy
                        • 确认所有 kube-proxy Pod 都在运行。
                        • 可以登录到节点,检查是否有对应的 iptables 规则:
                          sudo iptables-save | grep <service-name>
                          # 或者查看 ipvs 规则(如果使用 ipvs 模式)
                          sudo ipvsadm -ln

                      第三阶段:高级调试技巧

                      如果上述步骤仍未解决问题,可以尝试以下方法:

                      1. 使用网络调试镜像

                        • 部署一个包含网络工具的临时 Pod(如 nicolaka/netshoot)来进行高级调试。
                        kubectl run -it --rm debug-pod --image=nicolaka/netshoot -- /bin/bash
                        • 在这个 Pod 里,你可以使用 tcpdump, tracepath, dig 等强大工具。
                        • 例如,在目标 Pod 的节点上抓包:
                          # 在节点上执行,监听 Pod 网络对应的接口
                          sudo tcpdump -i any -n host <source-pod-ip> and host <destination-pod-ip>
                      2. 检查节点网络连接

                        • 确认两个节点之间网络是通的(通过节点 IP)。
                        • 确认 CNI 所需的端口(如 Flannel 的 VXLAN 端口 8472)在节点间是开放的。

                      总结与排查命令清单

                      当 Pod 间访问不通时,按顺序执行以下命令:

                      # 1. 基本信息
                      kubectl get pods -o wide
                      kubectl get svc,ep -o wide
                      
                      # 2. 检查 NetworkPolicy
                      kubectl get networkpolicies -A
                      
                      # 3. 检查核心插件
                      kubectl get pods -n kube-system | grep -e coredns -e kube-proxy -e calico -e flannel
                      
                      # 4. 进入 Pod 测试
                      kubectl exec -it <source-pod> -- ping <destination-pod-ip>
                      kubectl exec -it <source-pod> -- nslookup <service-name>
                      
                      # 5. 检查目标 Pod 应用
                      kubectl exec -it <destination-pod> -- netstat -tulpn
                      kubectl logs <destination-pod>
                      
                      # 6. 节点层面检查
                      # 在节点上执行
                      ip route
                      sudo iptables-save | grep <relevant-ip>

                      记住,90% 的 Pod 网络不通问题源于 NetworkPolicy 配置、Service Endpoints 为空,或 CNI 插件故障。按照这个路径排查,绝大多数问题都能被定位和解决。

                      Mar 7, 2024

                      Talk with API Server

                      Kubernetes 各模块与 API Server 通信详解

                      这是理解 Kubernetes 架构的核心问题。API Server 是整个集群的"大脑",所有组件都通过它进行通信。


                      🎯 Kubernetes 通信架构总览

                      ┌─────────────────────────────────────────────────────────┐
                      │                    API Server (核心)                     │
                      │  - RESTful API (HTTP/HTTPS)                             │
                      │  - 认证、授权、准入控制                                   │
                      │  - etcd 唯一入口                                         │
                      └───────┬─────────────────┬─────────────────┬─────────────┘
                              │                 │                 │
                          ┌───▼───┐         ┌───▼───┐        ┌───▼────┐
                          │Kubelet│         │Scheduler│      │Controller│
                          │(Node) │         │         │      │ Manager  │
                          └───────┘         └─────────┘      └──────────┘
                              │
                          ┌───▼────┐
                          │kube-proxy│
                          └────────┘

                      🔐 通信基础:认证、授权、准入

                      1. 认证 (Authentication)

                      所有组件访问 API Server 必须先通过认证。

                      常见认证方式

                      认证方式使用场景实现方式
                      X.509 证书集群组件(kubelet/scheduler)客户端证书
                      ServiceAccount TokenPod 内应用JWT Token
                      Bootstrap Token节点加入集群临时 Token
                      静态 Token 文件简单测试不推荐生产
                      OIDC用户认证外部身份提供商

                      X.509 证书认证示例

                      # 1. API Server 启动参数包含 CA 证书
                      kube-apiserver \
                        --client-ca-file=/etc/kubernetes/pki/ca.crt \
                        --tls-cert-file=/etc/kubernetes/pki/apiserver.crt \
                        --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
                      
                      # 2. Kubelet 使用客户端证书
                      kubelet \
                        --kubeconfig=/etc/kubernetes/kubelet.conf \
                        --client-ca-file=/etc/kubernetes/pki/ca.crt
                      
                      # 3. kubeconfig 文件内容
                      apiVersion: v1
                      kind: Config
                      clusters:
                      - cluster:
                          certificate-authority: /etc/kubernetes/pki/ca.crt  # CA 证书
                          server: https://192.168.1.10:6443                  # API Server 地址
                        name: kubernetes
                      users:
                      - name: system:node:worker-1
                        user:
                          client-certificate: /var/lib/kubelet/pki/kubelet-client.crt  # 客户端证书
                          client-key: /var/lib/kubelet/pki/kubelet-client.key          # 客户端密钥
                      contexts:
                      - context:
                          cluster: kubernetes
                          user: system:node:worker-1
                        name: default
                      current-context: default

                      ServiceAccount Token 认证

                      # Pod 内自动挂载的 Token
                      cat /var/run/secrets/kubernetes.io/serviceaccount/token
                      # eyJhbGciOiJSUzI1NiIsImtpZCI6Ij...
                      
                      # 使用 Token 访问 API Server
                      TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                      curl -k -H "Authorization: Bearer $TOKEN" \
                        https://kubernetes.default.svc/api/v1/namespaces/default/pods

                      2. 授权 (Authorization)

                      认证通过后,检查是否有权限执行操作。

                      RBAC (Role-Based Access Control) - 最常用

                      # 1. Role - 定义权限
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: Role
                      metadata:
                        namespace: default
                        name: pod-reader
                      rules:
                      - apiGroups: [""]
                        resources: ["pods"]
                        verbs: ["get", "list", "watch"]
                      
                      ---
                      # 2. RoleBinding - 绑定用户/ServiceAccount
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: read-pods
                        namespace: default
                      subjects:
                      - kind: ServiceAccount
                        name: my-app
                        namespace: default
                      roleRef:
                        kind: Role
                        name: pod-reader
                        apiGroup: rbac.authorization.k8s.io

                      授权模式对比

                      模式说明使用场景
                      RBAC基于角色生产环境(推荐)
                      ABAC基于属性复杂策略(已过时)
                      Webhook外部授权服务自定义授权逻辑
                      Node节点授权Kubelet 专用
                      AlwaysAllow允许所有测试环境(危险)

                      3. 准入控制 (Admission Control)

                      授权通过后,准入控制器可以修改或拒绝请求。

                      常用准入控制器

                      # API Server 启用的准入控制器
                      kube-apiserver \
                        --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,\
                      DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,\
                      ValidatingAdmissionWebhook,ResourceQuota,PodSecurityPolicy
                      准入控制器作用
                      NamespaceLifecycle防止在删除中的 namespace 创建资源
                      LimitRanger强制资源限制
                      ResourceQuota强制命名空间配额
                      PodSecurityPolicy强制 Pod 安全策略
                      MutatingAdmissionWebhook修改资源(如注入 sidecar)
                      ValidatingAdmissionWebhook验证资源(自定义校验)

                      📡 各组件通信详解

                      1. Kubelet → API Server

                      Kubelet 是唯一主动连接 API Server 的组件。

                      通信方式

                      Kubelet (每个 Node)
                          │
                          ├─→ List-Watch Pods (监听分配给自己的 Pod)
                          ├─→ Report Node Status (定期上报节点状态)
                          ├─→ Report Pod Status (上报 Pod 状态)
                          └─→ Get Secrets/ConfigMaps (拉取配置)

                      实现细节

                      // Kubelet 启动时创建 Informer 监听资源
                      // 伪代码示例
                      func (kl *Kubelet) syncLoop() {
                          // 1. 创建 Pod Informer
                          podInformer := cache.NewSharedIndexInformer(
                              &cache.ListWatch{
                                  ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
                                      // 列出分配给当前节点的所有 Pod
                                      options.FieldSelector = fields.OneTermEqualSelector("spec.nodeName", kl.nodeName).String()
                                      return kl.kubeClient.CoreV1().Pods("").List(context.TODO(), options)
                                  },
                                  WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
                                      // 持续监听 Pod 变化
                                      options.FieldSelector = fields.OneTermEqualSelector("spec.nodeName", kl.nodeName).String()
                                      return kl.kubeClient.CoreV1().Pods("").Watch(context.TODO(), options)
                                  },
                              },
                              &v1.Pod{},
                              0, // 不缓存
                              cache.Indexers{},
                          )
                          
                          // 2. 注册事件处理器
                          podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
                              AddFunc:    kl.handlePodAdditions,
                              UpdateFunc: kl.handlePodUpdates,
                              DeleteFunc: kl.handlePodDeletions,
                          })
                          
                          // 3. 定期上报节点状态
                          go wait.Until(kl.syncNodeStatus, 10*time.Second, stopCh)
                      }
                      
                      // 上报节点状态
                      func (kl *Kubelet) syncNodeStatus() {
                          node := &v1.Node{
                              ObjectMeta: metav1.ObjectMeta{Name: kl.nodeName},
                              Status: v1.NodeStatus{
                                  Conditions: []v1.NodeCondition{
                                      {Type: v1.NodeReady, Status: v1.ConditionTrue},
                                  },
                                  Capacity: kl.getNodeCapacity(),
                                  // ...
                              },
                          }
                          
                          // 调用 API Server 更新节点状态
                          kl.kubeClient.CoreV1().Nodes().UpdateStatus(context.TODO(), node, metav1.UpdateOptions{})
                      }

                      Kubelet 配置示例

                      # /var/lib/kubelet/config.yaml
                      apiVersion: kubelet.config.k8s.io/v1beta1
                      kind: KubeletConfiguration
                      # API Server 连接配置(通过 kubeconfig)
                      authentication:
                        x509:
                          clientCAFile: /etc/kubernetes/pki/ca.crt
                        webhook:
                          enabled: true
                        anonymous:
                          enabled: false
                      authorization:
                        mode: Webhook
                      clusterDomain: cluster.local
                      clusterDNS:
                      - 10.96.0.10
                      # 定期上报间隔
                      nodeStatusUpdateFrequency: 10s
                      nodeStatusReportFrequency: 1m

                      List-Watch 机制详解

                      ┌─────────────────────────────────────────┐
                      │  Kubelet List-Watch 工作流程             │
                      ├─────────────────────────────────────────┤
                      │                                          │
                      │  1. List(初始化)                         │
                      │     GET /api/v1/pods?fieldSelector=...  │
                      │     ← 返回所有当前 Pod                   │
                      │                                          │
                      │  2. Watch(持续监听)                      │
                      │     GET /api/v1/pods?watch=true&...     │
                      │     ← 保持长连接                         │
                      │                                          │
                      │  3. 接收事件                             │
                      │     ← ADDED: Pod nginx-xxx created      │
                      │     ← MODIFIED: Pod nginx-xxx updated   │
                      │     ← DELETED: Pod nginx-xxx deleted    │
                      │                                          │
                      │  4. 本地处理                             │
                      │     - 缓存更新                           │
                      │     - 触发 Pod 生命周期管理              │
                      │                                          │
                      │  5. 断线重连                             │
                      │     - 检测到连接断开                     │
                      │     - 重新 List + Watch                  │
                      │     - ResourceVersion 确保不丢事件       │
                      └─────────────────────────────────────────┘

                      HTTP 长连接(Chunked Transfer)

                      # Kubelet 发起 Watch 请求
                      GET /api/v1/pods?watch=true&resourceVersion=12345&fieldSelector=spec.nodeName=worker-1 HTTP/1.1
                      Host: 192.168.1.10:6443
                      Authorization: Bearer eyJhbGc...
                      Connection: keep-alive
                      
                      # API Server 返回(Chunked 编码)
                      HTTP/1.1 200 OK
                      Content-Type: application/json
                      Transfer-Encoding: chunked
                      
                      {"type":"ADDED","object":{"kind":"Pod","apiVersion":"v1",...}}
                      {"type":"MODIFIED","object":{"kind":"Pod","apiVersion":"v1",...}}
                      {"type":"DELETED","object":{"kind":"Pod","apiVersion":"v1",...}}
                      ...
                      # 连接保持打开,持续推送事件

                      2. Scheduler → API Server

                      Scheduler 也使用 List-Watch 机制。

                      通信流程

                      Scheduler
                          │
                          ├─→ Watch Pods (监听未调度的 Pod)
                          │   └─ spec.nodeName == ""
                          │
                          ├─→ Watch Nodes (监听节点状态)
                          │
                          ├─→ Get PVs, PVCs (获取存储信息)
                          │
                          └─→ Bind Pod (绑定 Pod 到 Node)
                              POST /api/v1/namespaces/{ns}/pods/{name}/binding

                      Scheduler 伪代码

                      // Scheduler 主循环
                      func (sched *Scheduler) scheduleOne() {
                          // 1. 从队列获取待调度的 Pod
                          pod := sched.NextPod()
                          
                          // 2. 执行调度算法(过滤 + 打分)
                          feasibleNodes := sched.findNodesThatFit(pod)
                          if len(feasibleNodes) == 0 {
                              // 无可用节点,标记为不可调度
                              return
                          }
                          
                          priorityList := sched.prioritizeNodes(pod, feasibleNodes)
                          selectedNode := sched.selectHost(priorityList)
                          
                          // 3. 绑定 Pod 到 Node(调用 API Server)
                          binding := &v1.Binding{
                              ObjectMeta: metav1.ObjectMeta{
                                  Name:      pod.Name,
                                  Namespace: pod.Namespace,
                              },
                              Target: v1.ObjectReference{
                                  Kind: "Node",
                                  Name: selectedNode,
                              },
                          }
                          
                          // 发送 Binding 请求到 API Server
                          err := sched.client.CoreV1().Pods(pod.Namespace).Bind(
                              context.TODO(),
                              binding,
                              metav1.CreateOptions{},
                          )
                          
                          // 4. API Server 更新 Pod 的 spec.nodeName
                          // 5. Kubelet 监听到 Pod,开始创建容器
                      }
                      
                      // Watch 未调度的 Pod
                      func (sched *Scheduler) watchUnscheduledPods() {
                          podInformer := cache.NewSharedIndexInformer(
                              &cache.ListWatch{
                                  ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
                                      // 只监听 spec.nodeName 为空的 Pod
                                      options.FieldSelector = "spec.nodeName="
                                      return sched.client.CoreV1().Pods("").List(context.TODO(), options)
                                  },
                                  WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
                                      options.FieldSelector = "spec.nodeName="
                                      return sched.client.CoreV1().Pods("").Watch(context.TODO(), options)
                                  },
                              },
                              &v1.Pod{},
                              0,
                              cache.Indexers{},
                          )
                          
                          podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
                              AddFunc: func(obj interface{}) {
                                  pod := obj.(*v1.Pod)
                                  sched.queue.Add(pod)  // 加入调度队列
                              },
                          })
                      }

                      Binding 请求详解

                      # Scheduler 发送的 HTTP 请求
                      POST /api/v1/namespaces/default/pods/nginx-xxx/binding HTTP/1.1
                      Host: 192.168.1.10:6443
                      Authorization: Bearer eyJhbGc...
                      Content-Type: application/json
                      
                      {
                        "apiVersion": "v1",
                        "kind": "Binding",
                        "metadata": {
                          "name": "nginx-xxx",
                          "namespace": "default"
                        },
                        "target": {
                          "kind": "Node",
                          "name": "worker-1"
                        }
                      }
                      
                      # API Server 处理:
                      # 1. 验证 Binding 请求
                      # 2. 更新 Pod 对象的 spec.nodeName = "worker-1"
                      # 3. 返回成功响应
                      # 4. Kubelet 监听到 Pod 更新,开始创建容器

                      3. Controller Manager → API Server

                      Controller Manager 包含多个控制器,每个控制器独立与 API Server 通信。

                      常见控制器

                      Controller Manager
                          │
                          ├─→ Deployment Controller
                          │   └─ Watch Deployments, ReplicaSets
                          │
                          ├─→ ReplicaSet Controller
                          │   └─ Watch ReplicaSets, Pods
                          │
                          ├─→ Node Controller
                          │   └─ Watch Nodes (节点健康检查)
                          │
                          ├─→ Service Controller
                          │   └─ Watch Services (管理 LoadBalancer)
                          │
                          ├─→ Endpoint Controller
                          │   └─ Watch Services, Pods (创建 Endpoints)
                          │
                          └─→ PV Controller
                              └─ Watch PVs, PVCs (卷绑定)

                      ReplicaSet Controller 示例

                      // ReplicaSet Controller 的核心逻辑
                      func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
                          // 1. 从缓存获取 ReplicaSet
                          rs := rsc.rsLister.Get(namespace, name)
                          
                          // 2. 获取当前 Pod 列表(通过 Selector)
                          allPods := rsc.podLister.List(labels.Everything())
                          filteredPods := rsc.filterActivePods(rs.Spec.Selector, allPods)
                          
                          // 3. 计算差异
                          diff := len(filteredPods) - int(*rs.Spec.Replicas)
                          
                          if diff < 0 {
                              // 需要创建新 Pod
                              diff = -diff
                              for i := 0; i < diff; i++ {
                                  // 调用 API Server 创建 Pod
                                  pod := newPod(rs)
                                  _, err := rsc.kubeClient.CoreV1().Pods(rs.Namespace).Create(
                                      context.TODO(),
                                      pod,
                                      metav1.CreateOptions{},
                                  )
                              }
                          } else if diff > 0 {
                              // 需要删除多余 Pod
                              podsToDelete := getPodsToDelete(filteredPods, diff)
                              for _, pod := range podsToDelete {
                                  // 调用 API Server 删除 Pod
                                  err := rsc.kubeClient.CoreV1().Pods(pod.Namespace).Delete(
                                      context.TODO(),
                                      pod.Name,
                                      metav1.DeleteOptions{},
                                  )
                              }
                          }
                          
                          // 4. 更新 ReplicaSet 状态
                          rs.Status.Replicas = int32(len(filteredPods))
                          _, err := rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace).UpdateStatus(
                              context.TODO(),
                              rs,
                              metav1.UpdateOptions{},
                          )
                      }

                      Node Controller 心跳检测

                      // Node Controller 监控节点健康
                      func (nc *NodeController) monitorNodeHealth() {
                          for {
                              // 1. 列出所有节点
                              nodes, _ := nc.kubeClient.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
                              
                              for _, node := range nodes.Items {
                                  // 2. 检查节点状态
                                  now := time.Now()
                                  lastHeartbeat := getNodeCondition(&node, v1.NodeReady).LastHeartbeatTime
                                  
                                  if now.Sub(lastHeartbeat.Time) > 40*time.Second {
                                      // 3. 节点超时,标记为 NotReady
                                      setNodeCondition(&node, v1.NodeCondition{
                                          Type:   v1.NodeReady,
                                          Status: v1.ConditionUnknown,
                                          Reason: "NodeStatusUnknown",
                                      })
                                      
                                      // 4. 更新节点状态
                                      nc.kubeClient.CoreV1().Nodes().UpdateStatus(
                                          context.TODO(),
                                          &node,
                                          metav1.UpdateOptions{},
                                      )
                                      
                                      // 5. 如果节点长时间 NotReady,驱逐 Pod
                                      if now.Sub(lastHeartbeat.Time) > 5*time.Minute {
                                          nc.evictPods(node.Name)
                                      }
                                  }
                              }
                              
                              time.Sleep(10 * time.Second)
                          }
                      }

                      4. kube-proxy → API Server

                      kube-proxy 监听 Service 和 Endpoints,配置网络规则。

                      通信流程

                      kube-proxy (每个 Node)
                          │
                          ├─→ Watch Services
                          │   └─ 获取 Service 定义
                          │
                          ├─→ Watch Endpoints
                          │   └─ 获取后端 Pod IP 列表
                          │
                          └─→ 配置本地网络
                              ├─ iptables 模式:更新 iptables 规则
                              ├─ ipvs 模式:更新 IPVS 规则
                              └─ userspace 模式:代理转发(已废弃)

                      iptables 模式示例

                      // kube-proxy 监听 Service 和 Endpoints
                      func (proxier *Proxier) syncProxyRules() {
                          // 1. 获取所有 Service
                          services := proxier.serviceStore.List()
                          
                          // 2. 获取所有 Endpoints
                          endpoints := proxier.endpointsStore.List()
                          
                          // 3. 生成 iptables 规则
                          for _, svc := range services {
                              // Service ClusterIP
                              clusterIP := svc.Spec.ClusterIP
                              
                              // 对应的 Endpoints
                              eps := endpoints[svc.Namespace+"/"+svc.Name]
                              
                              // 生成 DNAT 规则
                              // -A KUBE-SERVICES -d 10.96.100.50/32 -p tcp -m tcp --dport 80 -j KUBE-SVC-XXXX
                              chain := generateServiceChain(svc)
                              
                              for _, ep := range eps.Subsets {
                                  for _, addr := range ep.Addresses {
                                      // -A KUBE-SVC-XXXX -m statistic --mode random --probability 0.33 -j KUBE-SEP-XXXX
                                      // -A KUBE-SEP-XXXX -p tcp -m tcp -j DNAT --to-destination 10.244.1.5:8080
                                      generateEndpointRule(addr.IP, ep.Ports[0].Port)
                                  }
                              }
                          }
                          
                          // 4. 应用 iptables 规则
                          iptables.Restore(rules)
                      }

                      生成的 iptables 规则示例

                      # Service: nginx-service (ClusterIP: 10.96.100.50:80)
                      # Endpoints: 10.244.1.5:8080, 10.244.2.8:8080
                      
                      # 1. KUBE-SERVICES 链(入口)
                      -A KUBE-SERVICES -d 10.96.100.50/32 -p tcp -m tcp --dport 80 -j KUBE-SVC-NGINX
                      
                      # 2. KUBE-SVC-NGINX 链(Service 链)
                      -A KUBE-SVC-NGINX -m statistic --mode random --probability 0.5 -j KUBE-SEP-EP1
                      -A KUBE-SVC-NGINX -j KUBE-SEP-EP2
                      
                      # 3. KUBE-SEP-EP1 链(Endpoint 1)
                      -A KUBE-SEP-EP1 -p tcp -m tcp -j DNAT --to-destination 10.244.1.5:8080
                      
                      # 4. KUBE-SEP-EP2 链(Endpoint 2)
                      -A KUBE-SEP-EP2 -p tcp -m tcp -j DNAT --to-destination 10.244.2.8:8080

                      5. kubectl → API Server

                      kubectl 是用户与 API Server 交互的客户端工具。

                      通信流程

                      kubectl get pods
                          │
                          ├─→ 1. 读取 kubeconfig (~/.kube/config)
                          │      - API Server 地址
                          │      - 证书/Token
                          │
                          ├─→ 2. 发送 HTTP 请求
                          │      GET /api/v1/namespaces/default/pods
                          │
                          ├─→ 3. API Server 处理
                          │      - 认证
                          │      - 授权
                          │      - 从 etcd 读取数据
                          │
                          └─→ 4. 返回结果
                                 JSON 格式的 Pod 列表

                      kubectl 底层实现

                      // kubectl get pods 的简化实现
                      func getPods(namespace string) {
                          // 1. 加载 kubeconfig
                          config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
                          
                          // 2. 创建 Clientset
                          clientset, _ := kubernetes.NewForConfig(config)
                          
                          // 3. 发起 GET 请求
                          pods, _ := clientset.CoreV1().Pods(namespace).List(
                              context.TODO(),
                              metav1.ListOptions{},
                          )
                          
                          // 4. 输出结果
                          for _, pod := range pods.Items {
                              fmt.Printf("%s\t%s\t%s\n", pod.Name, pod.Status.Phase, pod.Spec.NodeName)
                          }
                      }

                      HTTP 请求详解

                      # kubectl get pods 发送的实际 HTTP 请求
                      GET /api/v1/namespaces/default/pods HTTP/1.1
                      Host: 192.168.1.10:6443
                      Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ij...
                      Accept: application/json
                      User-Agent: kubectl/v1.28.0
                      
                      # API Server 响应
                      HTTP/1.1 200 OK
                      Content-Type: application/json
                      
                      {
                        "kind": "PodList",
                        "apiVersion": "v1",
                        "metadata": {
                          "resourceVersion": "12345"
                        },
                        "items": [
                          {
                            "metadata": {
                              "name": "nginx-xxx",
                              "namespace": "default"
                            },
                            "spec": {
                              "nodeName": "worker-1",
                              "containers": [...]
                            },
                            "status": {
                              "phase": "Running"
                            }
                          }
                        ]
                      }

                      🔄 核心机制:List-Watch

                      List-Watch 是 Kubernetes 最核心的通信模式。

                      List-Watch 架构

                      ┌───────────────────────────────────────────────┐
                      │              Client (Kubelet/Controller)      │
                      ├───────────────────────────────────────────────┤
                      │                                                │
                      │  1. List(初始同步)                             │
                      │     GET /api/v1/pods                          │
                      │     → 获取所有资源                             │
                      │     → 本地缓存(Informer Cache)                │
                      │                                                │
                      │  2. Watch(增量更新)                            │
                      │     GET /api/v1/pods?watch=true               │
                      │     → 长连接(HTTP Chunked)                    │
                      │     → 实时接收 ADDED/MODIFIED/DELETED 事件    │
                      │                                                │
                      │  3. ResourceVersion(一致性保证)               │
                      │     → 每个资源有版本号                         │
                      │     → Watch 从指定版本开始                     │
                      │     → 断线重连不丢失事件                       │
                      │                                                │
                      │  4. 本地缓存(Indexer)                         │
                      │     → 减少 API Server 压力                    │
                      │     → 快速查询                                 │
                      │     → 自动同步                                 │
                      └───────────────────────────────────────────────┘

                      Informer 机制详解

                      // Informer 是 List-Watch 的高级封装
                      type Informer struct {
                          Indexer   Indexer       // 本地缓存
                          Controller Controller    // List-Watch 控制器
                          Processor  *sharedProcessor  // 事件处理器
                      }
                      
                      // 使用 Informer 监听资源
                      func watchPodsWithInformer() {
                          // 1. 创建 SharedInformerFactory
                          factory := informers.NewSharedInformerFactory(clientset, 30*time.Second)
                          
                          // 2. 获取 Pod Informer
                          podInformer := factory.Core().V1().Pods()
                          
                          // 3. 注册事件处理器
                          podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
                              AddFunc: func(obj interface{}) {
                                  pod := obj.(*v1.Pod)
                                  fmt.Printf("Pod ADDED: %s\n", pod.Name)
                              },
                              UpdateFunc: func(oldObj, newObj interface{}) {
                                  pod := newObj.(*v1.Pod)
                                  fmt.Printf("Pod UPDATED: %s\n", pod.Name)
                              },
                              DeleteFunc: func(obj interface{}) {
                                  pod := obj.(*v1.Pod)
                                  fmt.Printf("Pod DELETED: %s\n", pod.Name)
                              },
                          })
                          
                          // 4. 启动 Informer
                          factory.Start(stopCh)
                          
                          // 5. 等待缓存同步完成
                          factory.WaitForCacheSync(stopCh)
                          
                          // 6. 从本地缓存查询(不访问 API Server)
                          pod, _ := podInformer.Lister().Pods("default").Get("nginx-xxx")
                      }

                      ResourceVersion 机制

                      事件流:
                      ┌────────────────────────────────────────┐
                      │ Pod nginx-xxx created                  │ ResourceVersion: 100
                      ├────────────────```
                      ├────────────────────────────────────────┤
                      │ Pod nginx-xxx updated (image changed)  │ ResourceVersion: 101
                      ├────────────────────────────────────────┤
                      │ Pod nginx-xxx updated (status changed) │ ResourceVersion: 102
                      ├────────────────────────────────────────┤
                      │ Pod nginx-xxx deleted                  │ ResourceVersion: 103
                      └────────────────────────────────────────┘
                      
                      Watch 请求:
                      1. 初始 Watch: GET /api/v1/pods?watch=true&resourceVersion=100
                         → 从版本 100 开始接收事件
                      
                      2. 断线重连: GET /api/v1/pods?watch=true&resourceVersion=102
                         → 从版本 102 继续,不会丢失版本 103 的删除事件
                      
                      3. 版本过期: 如果 resourceVersion 太旧(etcd 已压缩)
                         → API Server 返回 410 Gone
                         → Client 重新 List 获取最新状态,然后 Watch

                      🔐 通信安全细节

                      1. TLS 双向认证

                      ┌────────────────────────────────────────┐
                      │        API Server TLS 配置              │
                      ├────────────────────────────────────────┤
                      │                                         │
                      │  Server 端证书:                         │
                      │  - apiserver.crt (服务端证书)          │
                      │  - apiserver.key (服务端私钥)          │
                      │  - ca.crt (CA 证书)                    │
                      │                                         │
                      │  Client CA:                             │
                      │  - 验证客户端证书                       │
                      │  - --client-ca-file=/etc/kubernetes/pki/ca.crt │
                      │                                         │
                      │  启动参数:                              │
                      │  --tls-cert-file=/etc/kubernetes/pki/apiserver.crt │
                      │  --tls-private-key-file=/etc/kubernetes/pki/apiserver.key │
                      │  --client-ca-file=/etc/kubernetes/pki/ca.crt │
                      └────────────────────────────────────────┘
                      
                      ┌────────────────────────────────────────┐
                      │        Kubelet TLS 配置                 │
                      ├────────────────────────────────────────┤
                      │                                         │
                      │  Client 证书:                           │
                      │  - kubelet-client.crt (客户端证书)     │
                      │  - kubelet-client.key (客户端私钥)     │
                      │  - ca.crt (CA 证书,验证 API Server)    │
                      │                                         │
                      │  kubeconfig 配置:                       │
                      │  - certificate-authority: ca.crt       │
                      │  - client-certificate: kubelet-client.crt │
                      │  - client-key: kubelet-client.key      │
                      └────────────────────────────────────────┘

                      2. ServiceAccount Token 详解

                      # 每个 Pod 自动挂载 ServiceAccount
                      apiVersion: v1
                      kind: Pod
                      metadata:
                        name: my-pod
                      spec:
                        serviceAccountName: default  # 使用的 ServiceAccount
                        containers:
                        - name: app
                          image: nginx
                          volumeMounts:
                          - name: kube-api-access-xxxxx
                            mountPath: /var/run/secrets/kubernetes.io/serviceaccount
                            readOnly: true
                        volumes:
                        - name: kube-api-access-xxxxx
                          projected:
                            sources:
                            - serviceAccountToken:
                                path: token                    # JWT Token
                                expirationSeconds: 3607
                            - configMap:
                                name: kube-root-ca.crt
                                items:
                                - key: ca.crt
                                  path: ca.crt                 # CA 证书
                            - downwardAPI:
                                items:
                                - path: namespace
                                  fieldRef:
                                    fieldPath: metadata.namespace  # 命名空间

                      Pod 内访问 API Server

                      # 进入 Pod
                      kubectl exec -it my-pod -- sh
                      
                      # 1. 读取 Token
                      TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                      
                      # 2. 读取 CA 证书
                      CACERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                      
                      # 3. 读取命名空间
                      NAMESPACE=$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace)
                      
                      # 4. 访问 API Server
                      curl --cacert $CACERT \
                           --header "Authorization: Bearer $TOKEN" \
                           https://kubernetes.default.svc/api/v1/namespaces/$NAMESPACE/pods
                      
                      # 5. 使用 kubectl proxy(简化方式)
                      kubectl proxy --port=8080 &
                      curl http://localhost:8080/api/v1/namespaces/default/pods

                      ServiceAccount Token 结构

                      # 解码 JWT Token
                      TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                      echo $TOKEN | cut -d. -f2 | base64 -d | jq
                      
                      # 输出:
                      {
                        "aud": [
                          "https://kubernetes.default.svc"
                        ],
                        "exp": 1696867200,        # 过期时间
                        "iat": 1696863600,        # 签发时间
                        "iss": "https://kubernetes.default.svc.cluster.local",  # 签发者
                        "kubernetes.io": {
                          "namespace": "default",  # 命名空间
                          "pod": {
                            "name": "my-pod",
                            "uid": "abc-123"
                          },
                          "serviceaccount": {
                            "name": "default",     # ServiceAccount 名称
                            "uid": "def-456"
                          }
                        },
                        "nbf": 1696863600,
                        "sub": "system:serviceaccount:default:default"  # Subject
                      }

                      📊 通信模式总结

                      1. 主动推送 vs 被动拉取

                      组件通信模式说明
                      Kubelet主动连接List-Watch API Server
                      Scheduler主动连接List-Watch API Server
                      Controller Manager主动连接List-Watch API Server
                      kube-proxy主动连接List-Watch API Server
                      kubectl主动请求RESTful API 调用
                      API Server → etcd主动读写gRPC 连接 etcd

                      重要: API Server 从不主动连接其他组件,都是组件主动连接 API Server。

                      2. 通信协议

                      ┌─────────────────────────────────────────┐
                      │  API Server 对外暴露的协议               │
                      ├─────────────────────────────────────────┤
                      │                                          │
                      │  1. HTTPS (主要协议)                     │
                      │     - RESTful API                       │
                      │     - 端口: 6443 (默认)                  │
                      │     - 所有组件使用                       │
                      │                                          │
                      │  2. HTTP (不推荐)                        │
                      │     - 仅用于本地测试                     │
                      │     - 端口: 8080 (默认,已废弃)          │
                      │     - 生产环境禁用                       │
                      │                                          │
                      │  3. WebSocket (特殊场景)                │
                      │     - kubectl exec/logs/port-forward    │
                      │     - 基于 HTTPS 升级                    │
                      └─────────────────────────────────────────┘
                      
                      ┌─────────────────────────────────────────┐
                      │  API Server 对 etcd 的协议               │
                      ├─────────────────────────────────────────┤
                      │                                          │
                      │  gRPC (HTTP/2)                          │
                      │  - 端口: 2379                            │
                      │  - mTLS 双向认证                         │
                      │  - 高性能二进制协议                      │
                      └─────────────────────────────────────────┘

                      🛠️ 实战:监控各组件通信

                      1. 查看组件连接状态

                      # 1. 查看 API Server 监听端口
                      netstat -tlnp | grep kube-apiserver
                      # tcp   0   0 :::6443   :::*   LISTEN   12345/kube-apiserver
                      
                      # 2. 查看连接到 API Server 的客户端
                      netstat -anp | grep :6443 | grep ESTABLISHED
                      # tcp   0   0 192.168.1.10:6443   192.168.1.11:45678   ESTABLISHED   (Kubelet)
                      # tcp   0   0 192.168.1.10:6443   192.168.1.10:45679   ESTABLISHED   (Scheduler)
                      # tcp   0   0 192.168.1.10:6443   192.168.1.10:45680   ESTABLISHED   (Controller Manager)
                      
                      # 3. 查看 API Server 日志
                      journalctl -u kube-apiserver -f
                      # I1011 10:00:00.123456   12345 httplog.go:89] "HTTP" verb="GET" URI="/api/v1/pods?watch=true" latency="30.123ms" userAgent="kubelet/v1.28.0" srcIP="192.168.1.11:45678"
                      
                      # 4. 查看 Kubelet 连接
                      journalctl -u kubelet -f | grep "Connecting to API"

                      2. 使用 tcpdump 抓包

                      # 抓取 API Server 通信(6443 端口)
                      tcpdump -i any -n port 6443 -A -s 0
                      
                      # 抓取特定主机的通信
                      tcpdump -i any -n host 192.168.1.11 and port 6443
                      
                      # 保存到文件,用 Wireshark 分析
                      tcpdump -i any -n port 6443 -w api-traffic.pcap

                      3. API Server Audit 日志

                      # API Server 审计配置
                      apiVersion: v1
                      kind: Policy
                      rules:
                      # 记录所有请求元数据
                      - level: Metadata
                        verbs: ["get", "list", "watch"]
                      # 记录创建/更新/删除的完整请求和响应
                      - level: RequestResponse
                        verbs: ["create", "update", "patch", "delete"]
                      # 启用 Audit 日志
                      kube-apiserver \
                        --audit-policy-file=/etc/kubernetes/audit-policy.yaml \
                        --audit-log-path=/var/log/kubernetes/audit.log \
                        --audit-log-maxage=30 \
                        --audit-log-maxbackup=10 \
                        --audit-log-maxsize=100
                      
                      # 查看审计日志
                      tail -f /var/log/kubernetes/audit.log | jq
                      
                      # 示例输出:
                      {
                        "kind": "Event",
                        "apiVersion": "audit.k8s.io/v1",
                        "level": "Metadata",
                        "auditID": "abc-123",
                        "stage": "ResponseComplete",
                        "requestURI": "/api/v1/namespaces/default/pods?watch=true",
                        "verb": "watch",
                        "user": {
                          "username": "system:node:worker-1",
                          "groups": ["system:nodes"]
                        },
                        "sourceIPs": ["192.168.1.11"],
                        "userAgent": "kubelet/v1.28.0",
                        "responseStatus": {
                          "code": 200
                        }
                      }

                      🔍 高级话题

                      1. API Server 聚合层 (API Aggregation)

                      允许扩展 API Server,添加自定义 API。

                      ┌────────────────────────────────────────┐
                      │       Main API Server (kube-apiserver) │
                      │         /api, /apis                    │
                      └───────────────┬────────────────────────┘
                                      │ 代理请求
                              ┌───────┴────────┐
                              ▼                ▼
                      ┌──────────────┐  ┌─────────────────┐
                      │ Metrics API  │  │ Custom API      │
                      │ /apis/metrics│  │ /apis/my.api/v1 │
                      └──────────────┘  └─────────────────┘

                      注册 APIService

                      apiVersion: apiregistration.k8s.io/v1
                      kind: APIService
                      metadata:
                        name: v1beta1.metrics.k8s.io
                      spec:
                        service:
                          name: metrics-server
                          namespace: kube-system
                          port: 443
                        group: metrics.k8s.io
                        version: v1beta1
                        insecureSkipTLSVerify: true
                        groupPriorityMinimum: 100
                        versionPriority: 100

                      请求路由

                      # 客户端请求
                      kubectl top nodes
                      # 等价于: GET /apis/metrics.k8s.io/v1beta1/nodes
                      
                      # API Server 处理:
                      # 1. 检查路径 /apis/metrics.k8s.io/v1beta1
                      # 2. 查找对应的 APIService
                      # 3. 代理请求到 metrics-server Service
                      # 4. 返回结果给客户端

                      2. API Priority and Fairness (APF)

                      控制 API Server 的请求优先级和并发限制。

                      # FlowSchema - 定义请求匹配规则
                      apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
                      kind: FlowSchema
                      metadata:
                        name: system-nodes
                      spec:
                        priorityLevelConfiguration:
                          name: system  # 关联到优先级配置
                        matchingPrecedence: 900
                        distinguisherMethod:
                          type: ByUser
                        rules:
                        - subjects:
                          - kind: Group
                            group:
                              name: system:nodes  # 匹配 Kubelet 请求
                          resourceRules:
                          - verbs: ["*"]
                            apiGroups: ["*"]
                            resources: ["*"]
                            namespaces: ["*"]
                      
                      ---
                      # PriorityLevelConfiguration - 定义并发限制
                      apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
                      kind: PriorityLevelConfiguration
                      metadata:
                        name: system
                      spec:
                        type: Limited
                        limited:
                          assuredConcurrencyShares: 30  # 保证的并发数
                          limitResponse:
                            type: Queue
                            queuing:
                              queues: 64           # 队列数量
                              queueLengthLimit: 50 # 每个队列长度
                              handSize: 6          # 洗牌算法参数

                      APF 工作流程

                      请求进入 API Server
                          │
                          ├─→ 1. 匹配 FlowSchema (按 precedence 排序)
                          │      - 检查 subject (user/group/serviceaccount)
                          │      - 检查 resource (API 路径)
                          │
                          ├─→ 2. 确定 PriorityLevel
                          │      - system (高优先级,Kubelet/Scheduler)
                          │      - leader-election (中优先级,Controller Manager)
                          │      - workload-high (用户请求)
                          │      - catch-all (默认)
                          │
                          ├─→ 3. 检查并发限制
                          │      - 当前并发数 < assuredConcurrencyShares: 立即执行
                          │      - 超过限制: 进入队列等待
                          │
                          └─→ 4. 执行或拒绝
                                 - 队列有空位: 等待执行
                                 - 队列满: 返回 429 Too Many Requests

                      查看 APF 状态

                      # 查看所有 FlowSchema
                      kubectl get flowschemas
                      
                      # 查看 PriorityLevelConfiguration
                      kubectl get prioritylevelconfigurations
                      
                      # 查看实时指标
                      kubectl get --raw /metrics | grep apiserver_flowcontrol
                      
                      # 关键指标:
                      # apiserver_flowcontrol_current_inqueue_requests: 当前排队请求数
                      # apiserver_flowcontrol_rejected_requests_total: 被拒绝的请求数
                      # apiserver_flowcontrol_request_concurrency_limit: 并发限制

                      3. Watch Bookmark

                      优化 Watch 性能,减少断线重连的代价。

                      // 启用 Watch Bookmark
                      watch := clientset.CoreV1().Pods("default").Watch(
                          context.TODO(),
                          metav1.ListOptions{
                              Watch:            true,
                              AllowWatchBookmarks: true,  // 🔑 启用 Bookmark
                          },
                      )
                      
                      for event := range watch.ResultChan() {
                          switch event.Type {
                          case watch.Added:
                              // 处理新增事件
                          case watch.Modified:
                              // 处理修改事件
                          case watch.Deleted:
                              // 处理删除事件
                          case watch.Bookmark:
                              // 🔑 Bookmark 事件(无实际数据变更)
                              // 只是告诉客户端当前的 ResourceVersion
                              // 用于优化断线重连
                              pod := event.Object.(*v1.Pod)
                              currentRV := pod.ResourceVersion
                              fmt.Printf("Bookmark at ResourceVersion: %s\n", currentRV)
                          }
                      }

                      Bookmark 的作用

                      没有 Bookmark:
                      ┌──────────────────────────────────────┐
                      │ 客户端 Watch 从 ResourceVersion 100  │
                      │ 长时间没有事件(如 1 小时)             │
                      │ 连接断开                              │
                      │ 重连时: Watch from RV 100            │
                      │ API Server 需要回放 100-200 之间的    │
                      │ 所有事件(即使客户端不需要)            │
                      └──────────────────────────────────────┘
                      
                      有 Bookmark:
                      ┌──────────────────────────────────────┐
                      │ 客户端 Watch 从 ResourceVersion 100  │
                      │ 每 10 分钟收到 Bookmark              │
                      │   RV 110 (10 分钟后)                 │
                      │   RV 120 (20 分钟后)                 │
                      │   RV 130 (30 分钟后)                 │
                      │ 连接断开                              │
                      │ 重连时: Watch from RV 130 ✅         │
                      │ 只需回放 130-200 之间的事件           │
                      └──────────────────────────────────────┘

                      4. 客户端限流 (Client-side Rate Limiting)

                      防止客户端压垮 API Server。

                      // client-go 的默认限流配置
                      config := &rest.Config{
                          Host: "https://192.168.1.10:6443",
                          // QPS 限制
                          QPS: 50.0,        // 每秒 50 个请求
                          // Burst 限制
                          Burst: 100,       // 突发最多 100 个请求
                      }
                      
                      clientset := kubernetes.NewForConfig(config)
                      
                      // 自定义限流器
                      import "golang.org/x/time/rate"
                      
                      rateLimiter := rate.NewLimiter(
                          rate.Limit(50),  // 每秒 50 个
                          100,             // Burst 100
                      )
                      
                      // 在发送请求前等待
                      rateLimiter.Wait(context.Background())
                      clientset.CoreV1().Pods("default").List(...)

                      📈 性能优化

                      1. API Server 侧优化

                      # API Server 启动参数
                      kube-apiserver \
                        # 增加 worker 线程
                        --max-requests-inflight=400 \
                        --max-mutating-requests-inflight=200 \
                        \
                        # Watch 缓存大小
                        --watch-cache-sizes=pods#1000,nodes#100 \
                        \
                        # etcd 连接池
                        --etcd-servers-overrides=/events#https://etcd-1:2379 \  # 分离 events
                        \
                        # 启用压缩
                        --enable-aggregator-routing=true \
                        \
                        # 内存缓存
                        --default-watch-cache-size=100

                      2. Client 侧优化

                      // 1. 使用 Informer (本地缓存)
                      factory := informers.NewSharedInformerFactory(clientset, 30*time.Second)
                      podInformer := factory.Core().V1().Pods()
                      
                      // 从本地缓存读取,不访问 API Server
                      pod, _ := podInformer.Lister().Pods("default").Get("nginx")
                      
                      // 2. 使用 Field Selector 减少数据量
                      listOptions := metav1.ListOptions{
                          FieldSelector: "spec.nodeName=worker-1",  // 只获取特定节点的 Pod
                      }
                      
                      // 3. 使用 Label Selector
                      listOptions := metav1.ListOptions{
                          LabelSelector: "app=nginx",  // 只获取特定标签的 Pod
                      }
                      
                      // 4. 限制返回字段
                      listOptions := metav1.ListOptions{
                          Limit: 100,  // 分页,每次只返回 100 个
                      }
                      
                      // 5. 批量操作
                      // 不推荐: 循环创建 100 个 Pod(100 次 API 调用)
                      for i := 0; i < 100; i++ {
                          clientset.CoreV1().Pods("default").Create(...)
                      }
                      
                      // 推荐: 使用 Job/Deployment(1 次 API 调用)
                      deployment := &appsv1.Deployment{
                          Spec: appsv1.DeploymentSpec{
                              Replicas: int32Ptr(100),
                              ...
                          },
                      }
                      clientset.AppsV1().Deployments("default").Create(deployment)

                      💡 关键要点总结

                      通信模式

                      1. 所有组件主动连接 API Server (API Server 从不主动推送)
                      2. List-Watch 是核心机制 (初始 List + 持续 Watch)
                      3. HTTP 长连接 (Chunked Transfer Encoding)
                      4. ResourceVersion 保证一致性 (断线重连不丢事件)

                      认证授权

                      1. X.509 证书 (集群组件)
                      2. ServiceAccount Token (Pod 内应用)
                      3. RBAC 授权 (细粒度权限控制)
                      4. 准入控制 (请求验证和修改)

                      性能优化

                      1. Informer 本地缓存 (减少 API Server 压力)
                      2. Field/Label Selector (减少数据传输)
                      3. APF 流量控制 (防止 API Server 过载)
                      4. 客户端限流 (防止客户端压垮 API Server)

                      最佳实践

                      1. 使用 Informer 而不是轮询
                      2. 合理设置 QPS 和 Burst
                      3. 避免频繁的 List 操作
                      4. 使用 Field Selector 过滤数据
                      5. 启用 Watch Bookmark
                      6. 监控 API Server 指标
                      Mar 7, 2024

                      Monitor

                        Mar 7, 2025

                        Subsections of Networking

                        Ingress

                        Kubernetes Ingress 原理详解

                        Ingress 是 Kubernetes 中用于管理集群外部访问集群内服务的 API 对象,提供 HTTP/HTTPS 路由功能。


                        🎯 Ingress 的作用

                        没有 Ingress 的问题

                        问题 1:每个服务需要一个 LoadBalancer
                        ┌────────────────────────────────────┐
                        │  Service A (LoadBalancer)  $$$     │
                        │  Service B (LoadBalancer)  $$$     │
                        │  Service C (LoadBalancer)  $$$     │
                        └────────────────────────────────────┘
                        成本高、管理复杂、IP 地址浪费
                        
                        问题 2:无法基于域名/路径路由
                        客户端 → NodePort:30001 (Service A)
                        客户端 → NodePort:30002 (Service B)
                        需要记住不同的端口,不友好

                        使用 Ingress 的方案

                        单一入口 + 智能路由
                        ┌───────────────────────────────────────┐
                        │         Ingress Controller            │
                        │    (一个 LoadBalancer 或 NodePort)    │
                        └───────────┬───────────────────────────┘
                                    │ 根据域名/路径路由
                            ┌───────┴───────┬──────────┐
                            ▼               ▼          ▼
                        Service A       Service B   Service C
                        (ClusterIP)     (ClusterIP) (ClusterIP)

                        🏗️ Ingress 架构组成

                        核心组件

                        ┌─────────────────────────────────────────────┐
                        │              Ingress 生态系统                │
                        ├─────────────────────────────────────────────┤
                        │  1. Ingress Resource (资源对象)             │
                        │     └─ 定义路由规则(YAML)                   │
                        │                                              │
                        │  2. Ingress Controller (控制器)             │
                        │     └─ 读取 Ingress,配置负载均衡器          │
                        │                                              │
                        │  3. 负载均衡器 (Nginx/Traefik/HAProxy)      │
                        │     └─ 实际处理流量的组件                   │
                        └─────────────────────────────────────────────┘

                        📋 Ingress Resource (资源定义)

                        基础示例

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: example-ingress
                          annotations:
                            nginx.ingress.kubernetes.io/rewrite-target: /
                        spec:
                          # 1. 基于域名路由
                          rules:
                          - host: example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: web-service
                                    port:
                                      number: 80
                          
                          # 2. TLS/HTTPS 配置
                          tls:
                          - hosts:
                            - example.com
                            secretName: example-tls

                        完整功能示例

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: advanced-ingress
                          namespace: default
                          annotations:
                            # Nginx 特定配置
                            nginx.ingress.kubernetes.io/rewrite-target: /$2
                            nginx.ingress.kubernetes.io/ssl-redirect: "true"
                            nginx.ingress.kubernetes.io/rate-limit: "100"
                            # 自定义响应头
                            nginx.ingress.kubernetes.io/configuration-snippet: |
                              add_header X-Custom-Header "Hello from Ingress";
                        spec:
                          # IngressClass (指定使用哪个 Ingress Controller)
                          ingressClassName: nginx
                          
                          # TLS 配置
                          tls:
                          - hosts:
                            - app.example.com
                            - api.example.com
                            secretName: example-tls-secret
                          
                          # 路由规则
                          rules:
                          # 规则 1:app.example.com
                          - host: app.example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: frontend-service
                                    port:
                                      number: 80
                          
                          # 规则 2:api.example.com
                          - host: api.example.com
                            http:
                              paths:
                              # /v1/* 路由到 api-v1
                              - path: /v1
                                pathType: Prefix
                                backend:
                                  service:
                                    name: api-v1-service
                                    port:
                                      number: 8080
                              
                              # /v2/* 路由到 api-v2
                              - path: /v2
                                pathType: Prefix
                                backend:
                                  service:
                                    name: api-v2-service
                                    port:
                                      number: 8080
                          
                          # 规则 3:默认后端(可选)
                          defaultBackend:
                            service:
                              name: default-backend
                              port:
                                number: 80

                        🎛️ PathType (路径匹配类型)

                        三种匹配类型

                        PathType匹配规则示例
                        Prefix前缀匹配/foo 匹配 /foo, /foo/, /foo/bar
                        Exact精确匹配/foo 只匹配 /foo,不匹配 /foo/
                        ImplementationSpecific由 Ingress Controller 决定取决于实现

                        示例对比

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: path-types-demo
                        spec:
                          rules:
                          - host: example.com
                            http:
                              paths:
                              # Prefix 匹配
                              - path: /api
                                pathType: Prefix
                                backend:
                                  service:
                                    name: api-service
                                    port:
                                      number: 8080
                              # 匹配:
                              # ✅ /api
                              # ✅ /api/
                              # ✅ /api/users
                              # ✅ /api/v1/users
                              
                              # Exact 匹配
                              - path: /login
                                pathType: Exact
                                backend:
                                  service:
                                    name: auth-service
                                    port:
                                      number: 80
                              # 匹配:
                              # ✅ /login
                              # ❌ /login/
                              # ❌ /login/oauth

                        🚀 Ingress Controller (控制器)

                        常见 Ingress Controller

                        Controller特点适用场景
                        Nginx Ingress最流行,功能强大通用场景,生产推荐
                        Traefik云原生,动态配置微服务,自动服务发现
                        HAProxy高性能,企业级大流量,高并发
                        KongAPI 网关功能API 管理,插件生态
                        Istio Gateway服务网格集成复杂微服务架构
                        AWS ALB云原生(AWS)AWS 环境
                        GCE云原生(GCP)GCP 环境

                        🔧 Ingress Controller 工作原理

                        核心流程

                        ┌─────────────────────────────────────────────┐
                        │  1. 用户创建/更新 Ingress Resource          │
                        │     kubectl apply -f ingress.yaml           │
                        └────────────────┬────────────────────────────┘
                                         │
                                         ▼
                        ┌─────────────────────────────────────────────┐
                        │  2. Ingress Controller 监听 API Server      │
                        │     - Watch Ingress 对象                    │
                        │     - Watch Service 对象                    │
                        │     - Watch Endpoints 对象                  │
                        └────────────────┬────────────────────────────┘
                                         │
                                         ▼
                        ┌─────────────────────────────────────────────┐
                        │  3. 生成配置文件                             │
                        │     Nginx:  /etc/nginx/nginx.conf          │
                        │     Traefik: 动态配置                       │
                        │     HAProxy: /etc/haproxy/haproxy.cfg      │
                        └────────────────┬────────────────────────────┘
                                         │
                                         ▼
                        ┌─────────────────────────────────────────────┐
                        │  4. 重载/更新负载均衡器                      │
                        │     nginx -s reload                         │
                        └────────────────┬────────────────────────────┘
                                         │
                                         ▼
                        ┌─────────────────────────────────────────────┐
                        │  5. 流量路由生效                             │
                        │     客户端请求 → Ingress → Service → Pod    │
                        └─────────────────────────────────────────────┘

                        📦 部署 Nginx Ingress Controller

                        方式 1:使用官方 Helm Chart (推荐)

                        # 添加 Helm 仓库
                        helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
                        helm repo update
                        
                        # 安装
                        helm install ingress-nginx ingress-nginx/ingress-nginx \
                          --namespace ingress-nginx \
                          --create-namespace \
                          --set controller.service.type=LoadBalancer
                        
                        # 查看部署状态
                        kubectl get pods -n ingress-nginx
                        kubectl get svc -n ingress-nginx

                        方式 2:使用 YAML 部署

                        # 下载官方 YAML
                        kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml
                        
                        # 查看部署
                        kubectl get all -n ingress-nginx

                        核心组件

                        # 1. Deployment - Ingress Controller Pod
                        apiVersion: apps/v1
                        kind: Deployment
                        metadata:
                          name: ingress-nginx-controller
                          namespace: ingress-nginx
                        spec:
                          replicas: 2  # 高可用建议 2+
                          selector:
                            matchLabels:
                              app.kubernetes.io/name: ingress-nginx
                          template:
                            metadata:
                              labels:
                                app.kubernetes.io/name: ingress-nginx
                            spec:
                              serviceAccountName: ingress-nginx
                              containers:
                              - name: controller
                                image: registry.k8s.io/ingress-nginx/controller:v1.9.0
                                args:
                                - /nginx-ingress-controller
                                - --election-id=ingress-nginx-leader
                                - --controller-class=k8s.io/ingress-nginx
                                - --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
                                ports:
                                - name: http
                                  containerPort: 80
                                - name: https
                                  containerPort: 443
                                livenessProbe:
                                  httpGet:
                                    path: /healthz
                                    port: 10254
                                readinessProbe:
                                  httpGet:
                                    path: /healthz
                                    port: 10254
                        
                        ---
                        # 2. Service - 暴露 Ingress Controller
                        apiVersion: v1
                        kind: Service
                        metadata:
                          name: ingress-nginx-controller
                          namespace: ingress-nginx
                        spec:
                          type: LoadBalancer  # 或 NodePort
                          ports:
                          - name: http
                            port: 80
                            targetPort: 80
                            protocol: TCP
                          - name: https
                            port: 443
                            targetPort: 443
                            protocol: TCP
                          selector:
                            app.kubernetes.io/name: ingress-nginx
                        
                        ---
                        # 3. ConfigMap - Nginx 全局配置
                        apiVersion: v1
                        kind: ConfigMap
                        metadata:
                          name: ingress-nginx-controller
                          namespace: ingress-nginx
                        data:
                          # 自定义 Nginx 配置
                          proxy-body-size: "100m"
                          proxy-connect-timeout: "15"
                          proxy-read-timeout: "600"
                          proxy-send-timeout: "600"
                          use-forwarded-headers: "true"

                        🌐 完整流量路径

                        请求流程详解

                        客户端
                          │ 1. DNS 解析
                          │    example.com → LoadBalancer IP (1.2.3.4)
                          ▼
                        LoadBalancer / NodePort
                          │ 2. 转发到 Ingress Controller Pod
                          ▼
                        Ingress Controller (Nginx Pod)
                          │ 3. 读取 Ingress 规则
                          │    Host: example.com
                          │    Path: /api/users
                          │ 4. 匹配规则
                          │    rule: host=example.com, path=/api
                          │    backend: api-service:8080
                          ▼
                        Service (api-service)
                          │ 5. Service 选择器匹配 Pod
                          │    selector: app=api
                          │ 6. 查询 Endpoints
                          │    endpoints: 10.244.1.5:8080, 10.244.2.8:8080
                          │ 7. 负载均衡(默认轮询)
                          ▼
                        Pod (api-xxxx)
                          │ 8. 容器处理请求
                          │    Container Port: 8080
                          ▼
                        应用响应
                          │ 9. 原路返回
                          ▼
                        客户端收到响应

                        网络数据包追踪

                        # 客户端发起请求
                        curl -H "Host: example.com" http://1.2.3.4/api/users
                        
                        # 1. DNS 解析
                        example.com → 1.2.3.4 (LoadBalancer External IP)
                        
                        # 2. TCP 连接
                        Client:54321 → LoadBalancer:80
                        
                        # 3. LoadBalancer 转发
                        LoadBalancer:80 → Ingress Controller Pod:80 (10.244.0.5:80)
                        
                        # 4. Ingress Controller 内部处理
                        Nginx 读取配置:
                          location /api {
                            proxy_pass http://api-service.default.svc.cluster.local:8080;
                          }
                        
                        # 5. 查询 Service
                        kube-proxy/iptables 规则:
                          api-service:8080 → Endpoints
                        
                        # 6. 负载均衡到 Pod
                        10.244.0.5 → 10.244.1.5:8080 (Pod IP)
                        
                        # 7. 响应返回
                        Pod → Ingress Controller → LoadBalancer → Client

                        🔒 HTTPS/TLS 配置

                        创建 TLS Secret

                        # 方式 1:使用自签名证书(测试环境)
                        openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
                          -keyout tls.key -out tls.crt \
                          -subj "/CN=example.com"
                        
                        kubectl create secret tls example-tls \
                          --cert=tls.crt \
                          --key=tls.key
                        
                        # 方式 2:使用 Let's Encrypt (生产环境,推荐)
                        # 安装 cert-manager
                        kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
                        
                        # 创建 ClusterIssuer
                        kubectl apply -f - <<EOF
                        apiVersion: cert-manager.io/v1
                        kind: ClusterIssuer
                        metadata:
                          name: letsencrypt-prod
                        spec:
                          acme:
                            server: https://acme-v02.api.letsencrypt.org/directory
                            email: admin@example.com
                            privateKeySecretRef:
                              name: letsencrypt-prod
                            solvers:
                            - http01:
                                ingress:
                                  class: nginx
                        EOF

                        配置 HTTPS Ingress

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: https-ingress
                          annotations:
                            # 自动重定向 HTTP 到 HTTPS
                            nginx.ingress.kubernetes.io/ssl-redirect: "true"
                            # 使用 cert-manager 自动申请证书
                            cert-manager.io/cluster-issuer: "letsencrypt-prod"
                        spec:
                          ingressClassName: nginx
                          tls:
                          - hosts:
                            - example.com
                            - www.example.com
                            secretName: example-tls  # cert-manager 会自动创建这个 Secret
                          rules:
                          - host: example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: web-service
                                    port:
                                      number: 80

                        验证 HTTPS

                        # 检查证书
                        curl -v https://example.com
                        
                        # 查看 Secret
                        kubectl get secret example-tls
                        kubectl describe secret example-tls
                        
                        # 测试 HTTP 自动重定向
                        curl -I http://example.com
                        # HTTP/1.1 308 Permanent Redirect
                        # Location: https://example.com/

                        🎨 高级路由场景

                        场景 1:基于路径的路由

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: path-based-routing
                          annotations:
                            nginx.ingress.kubernetes.io/rewrite-target: /$2
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              # /api/v1/* → api-v1-service
                              - path: /api/v1(/|$)(.*)
                                pathType: Prefix
                                backend:
                                  service:
                                    name: api-v1-service
                                    port:
                                      number: 8080
                              
                              # /api/v2/* → api-v2-service
                              - path: /api/v2(/|$)(.*)
                                pathType: Prefix
                                backend:
                                  service:
                                    name: api-v2-service
                                    port:
                                      number: 8080
                              
                              # /admin/* → admin-service
                              - path: /admin
                                pathType: Prefix
                                backend:
                                  service:
                                    name: admin-service
                                    port:
                                      number: 3000
                              
                              # /* → frontend-service (默认)
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: frontend-service
                                    port:
                                      number: 80

                        场景 2:基于子域名的路由

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: subdomain-routing
                        spec:
                          rules:
                          # www.example.com
                          - host: www.example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: website-service
                                    port:
                                      number: 80
                          
                          # api.example.com
                          - host: api.example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: api-service
                                    port:
                                      number: 8080
                          
                          # blog.example.com
                          - host: blog.example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: blog-service
                                    port:
                                      number: 80
                          
                          # *.dev.example.com (通配符)
                          - host: "*.dev.example.com"
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: dev-environment
                                    port:
                                      number: 80

                        场景 3:金丝雀发布 (Canary Deployment)

                        # 主版本 Ingress
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: production
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: app-v1
                                    port:
                                      number: 80
                        
                        ---
                        # 金丝雀版本 Ingress
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: canary
                          annotations:
                            nginx.ingress.kubernetes.io/canary: "true"
                            # 10% 流量到金丝雀版本
                            nginx.ingress.kubernetes.io/canary-weight: "10"
                            
                            # 或基于请求头
                            # nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
                            # nginx.ingress.kubernetes.io/canary-by-header-value: "always"
                            
                            # 或基于 Cookie
                            # nginx.ingress.kubernetes.io/canary-by-cookie: "canary"
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: app-v2-canary
                                    port:
                                      number: 80

                        场景 4:A/B 测试

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: ab-testing
                          annotations:
                            # 基于请求头进行 A/B 测试
                            nginx.ingress.kubernetes.io/canary: "true"
                            nginx.ingress.kubernetes.io/canary-by-header: "X-Version"
                            nginx.ingress.kubernetes.io/canary-by-header-value: "beta"
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: app-beta
                                    port:
                                      number: 80
                        # 普通用户访问 A 版本
                        curl http://myapp.com
                        
                        # Beta 用户访问 B 版本
                        curl -H "X-Version: beta" http://myapp.com

                        🔧 常用 Annotations (Nginx)

                        基础配置

                        metadata:
                          annotations:
                            # SSL 重定向
                            nginx.ingress.kubernetes.io/ssl-redirect: "true"
                            
                            # 强制 HTTPS
                            nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
                            
                            # 后端协议
                            nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"  # 或 HTTP, GRPC
                            
                            # 路径重写
                            nginx.ingress.kubernetes.io/rewrite-target: /$2
                            
                            # URL 重写
                            nginx.ingress.kubernetes.io/use-regex: "true"

                        高级配置

                        metadata:
                          annotations:
                            # 上传文件大小限制
                            nginx.ingress.kubernetes.io/proxy-body-size: "100m"
                            
                            # 超时配置
                            nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
                            nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
                            nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
                            
                            # 会话保持 (Sticky Session)
                            nginx.ingress.kubernetes.io/affinity: "cookie"
                            nginx.ingress.kubernetes.io/session-cookie-name: "route"
                            nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
                            nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
                            
                            # 限流
                            nginx.ingress.kubernetes.io/limit-rps: "100"  # 每秒请求数
                            nginx.ingress.kubernetes.io/limit-connections: "10"  # 并发连接数
                            
                            # CORS 配置
                            nginx.ingress.kubernetes.io/enable-cors: "true"
                            nginx.ingress.kubernetes.io/cors-allow-origin: "*"
                            nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
                            
                            # 白名单
                            nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,192.168.0.0/16"
                            
                            # 基本认证
                            nginx.ingress.kubernetes.io/auth-type: basic
                            nginx.ingress.kubernetes.io/auth-secret: basic-auth
                            nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
                            
                            # 自定义 Nginx 配置片段
                            nginx.ingress.kubernetes.io/configuration-snippet: |
                              more_set_headers "X-Custom-Header: MyValue";
                              add_header X-Request-ID $request_id;

                        🛡️ 安全配置

                        1. 基本认证

                        # 创建密码文件
                        htpasswd -c auth admin
                        # 输入密码
                        
                        # 创建 Secret
                        kubectl create secret generic basic-auth --from-file=auth
                        
                        # 应用到 Ingress
                        kubectl apply -f - <<EOF
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: secure-ingress
                          annotations:
                            nginx.ingress.kubernetes.io/auth-type: basic
                            nginx.ingress.kubernetes.io/auth-secret: basic-auth
                            nginx.ingress.kubernetes.io/auth-realm: "Authentication Required - Please enter your credentials"
                        spec:
                          rules:
                          - host: admin.example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: admin-service
                                    port:
                                      number: 80
                        EOF

                        2. IP 白名单

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: whitelist-ingress
                          annotations:
                            # 只允许特定 IP 访问
                            nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,192.168.1.100/32"
                        spec:
                          rules:
                          - host: internal.example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: internal-service
                                    port:
                                      number: 80

                        3. OAuth2 认证

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: oauth2-ingress
                          annotations:
                            nginx.ingress.kubernetes.io/auth-url: "https://oauth2-proxy.example.com/oauth2/auth"
                            nginx.ingress.kubernetes.io/auth-signin: "https://oauth2-proxy.example.com/oauth2/start?rd=$escaped_request_uri"
                        spec:
                          rules:
                          - host: app.example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: protected-service
                                    port:
                                      number: 80

                        📊 监控和调试

                        查看 Ingress 状态

                        # 列出所有 Ingress
                        kubectl get ingress
                        
                        # 详细信息
                        kubectl describe ingress example-ingress
                        
                        # 查看 Ingress Controller 日志
                        kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -f
                        
                        # 查看生成的 Nginx 配置
                        kubectl exec -n ingress-nginx <ingress-controller-pod> -- cat /etc/nginx/nginx.conf

                        测试 Ingress 规则

                        # 测试域名解析
                        nslookup example.com
                        
                        # 测试 HTTP
                        curl -H "Host: example.com" http://<ingress-ip>/
                        
                        # 测试 HTTPS
                        curl -k -H "Host: example.com" https://<ingress-ip>/
                        
                        # 查看响应头
                        curl -I -H "Host: example.com" http://<ingress-ip>/
                        
                        # 测试特定路径
                        curl -H "Host: example.com" http://<ingress-ip>/api/users

                        常见问题排查

                        # 1. 检查 Ingress 是否有 Address
                        kubectl get ingress
                        # 如果 ADDRESS 列为空,说明 Ingress Controller 未就绪
                        
                        # 2. 检查 Service 和 Endpoints
                        kubectl get svc
                        kubectl get endpoints
                        
                        # 3. 检查 Ingress Controller Pod
                        kubectl get pods -n ingress-nginx
                        kubectl logs -n ingress-nginx <pod-name>
                        
                        # 4. 检查 DNS 解析
                        kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup example.com
                        
                        # 5. 检查网络连通性
                        kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
                          curl -H "Host: example.com" http://web-service.default.svc.cluster.local

                        🎯 Ingress vs Service Type

                        对比表

                        维度IngressLoadBalancerNodePort
                        成本1 个 LB每个服务 1 个 LB免费
                        域名路由✅ 支持❌ 不支持❌ 不支持
                        路径路由✅ 支持❌ 不支持❌ 不支持
                        TLS 终止✅ 支持⚠️ 需要额外配置❌ 不支持
                        7 层功能✅ 丰富❌ 4 层❌ 4 层
                        适用场景HTTP/HTTPS 服务需要独立 LB 的服务开发测试

                        💡 关键要点总结

                        Ingress 的价值

                        1. 成本优化:多个服务共享一个 LoadBalancer
                        2. 智能路由:基于域名、路径的 7 层路由
                        3. TLS 管理:集中管理 HTTPS 证书
                        4. 高级功能:限流、认证、重写、CORS 等
                        5. 易于管理:声明式配置,统一入口

                        核心概念

                        • Ingress Resource:定义路由规则的 YAML
                        • Ingress Controller:读取规则并实现路由的控制器
                        • 负载均衡器:实际处理流量的组件(Nginx/Traefik/HAProxy)

                        典型使用场景

                        • ✅ 微服务 API 网关
                        • ✅ 多租户应用(基于子域名隔离)
                        • ✅ 蓝绿部署/金丝雀发布
                        • ✅ Web 应用统一入口
                        • ❌ 非 HTTP 协议(如 TCP/UDP,考虑使用 Gateway API)

                        🚀 高级话题

                        1. IngressClass (多 Ingress Controller)

                        在同一集群中运行多个 Ingress Controller:

                        # 定义 IngressClass
                        apiVersion: networking.k8s.io/v1
                        kind: IngressClass
                        metadata:
                          name: nginx
                          annotations:
                            ingressclass.kubernetes.io/is-default-class: "true"
                        spec:
                          controller: k8s.io/ingress-nginx
                        
                        ---
                        apiVersion: networking.k8s.io/v1
                        kind: IngressClass
                        metadata:
                          name: traefik
                        spec:
                          controller: traefik.io/ingress-controller
                        
                        ---
                        # 使用特定的 IngressClass
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: my-ingress
                        spec:
                          ingressClassName: nginx  # 🔑 指定使用 nginx 控制器
                          rules:
                          - host: example.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: web-service
                                    port:
                                      number: 80

                        使用场景:

                        • 内部服务使用 Nginx,外部服务使用 Traefik
                        • 不同团队使用不同的 Ingress Controller
                        • 按环境划分(dev 用 Traefik,prod 用 Nginx)

                        2. 默认后端 (Default Backend)

                        处理未匹配任何规则的请求:

                        # 创建默认后端服务
                        apiVersion: v1
                        kind: Service
                        metadata:
                          name: default-backend
                        spec:
                          selector:
                            app: default-backend
                          ports:
                          - port: 80
                            targetPort: 8080
                        
                        ---
                        apiVersion: apps/v1
                        kind: Deployment
                        metadata:
                          name: default-backend
                        spec:
                          replicas: 1
                          selector:
                            matchLabels:
                              app: default-backend
                          template:
                            metadata:
                              labels:
                                app: default-backend
                            spec:
                              containers:
                              - name: default-backend
                                image: registry.k8s.io/defaultbackend-amd64:1.5
                                ports:
                                - containerPort: 8080
                        
                        ---
                        # 在 Ingress 中指定默认后端
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: ingress-with-default
                        spec:
                          defaultBackend:
                            service:
                              name: default-backend
                              port:
                                number: 80
                          rules:
                          - host: example.com
                            http:
                              paths:
                              - path: /app
                                pathType: Prefix
                                backend:
                                  service:
                                    name: app-service
                                    port:
                                      number: 80

                        效果:

                        • 访问 example.com/app → app-service
                        • 访问 example.com/other → default-backend(404 页面)
                        • 访问 unknown.com → default-backend

                        3. ExternalName Service 与 Ingress

                        将 Ingress 路由到集群外部服务:

                        # 创建 ExternalName Service
                        apiVersion: v1
                        kind: Service
                        metadata:
                          name: external-api
                        spec:
                          type: ExternalName
                          externalName: api.external-service.com  # 外部域名
                        
                        ---
                        # Ingress 路由到外部服务
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: external-ingress
                          annotations:
                            nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
                            nginx.ingress.kubernetes.io/upstream-vhost: "api.external-service.com"
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /external
                                pathType: Prefix
                                backend:
                                  service:
                                    name: external-api
                                    port:
                                      number: 443

                        使用场景:

                        • 集成第三方 API
                        • 混合云架构(部分服务在云外)
                        • 灰度迁移(逐步从外部迁移到集群内)

                        4. 跨命名空间引用 (ExternalName 方式)

                        Ingress 默认只能引用同一命名空间的 Service,跨命名空间需要特殊处理:

                        # Namespace: backend
                        apiVersion: v1
                        kind: Service
                        metadata:
                          name: api-service
                          namespace: backend
                        spec:
                          selector:
                            app: api
                          ports:
                          - port: 8080
                        
                        ---
                        # Namespace: frontend
                        # 创建 ExternalName Service 指向 backend 命名空间的服务
                        apiVersion: v1
                        kind: Service
                        metadata:
                          name: api-proxy
                          namespace: frontend
                        spec:
                          type: ExternalName
                          externalName: api-service.backend.svc.cluster.local
                          ports:
                          - port: 8080
                        
                        ---
                        # Ingress 在 frontend 命名空间
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: cross-ns-ingress
                          namespace: frontend
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /api
                                pathType: Prefix
                                backend:
                                  service:
                                    name: api-proxy  # 引用同命名空间的 ExternalName Service
                                    port:
                                      number: 8080

                        5. TCP/UDP 服务暴露

                        Ingress 原生只支持 HTTP/HTTPS,对于 TCP/UDP 需要特殊配置:

                        Nginx Ingress Controller 的 TCP 配置

                        # ConfigMap 定义 TCP 服务
                        apiVersion: v1
                        kind: ConfigMap
                        metadata:
                          name: tcp-services
                          namespace: ingress-nginx
                        data:
                          # 格式: "外部端口": "命名空间/服务名:服务端口"
                          "3306": "default/mysql:3306"
                          "6379": "default/redis:6379"
                          "27017": "databases/mongodb:27017"
                        
                        ---
                        # 修改 Ingress Controller Service,暴露 TCP 端口
                        apiVersion: v1
                        kind: Service
                        metadata:
                          name: ingress-nginx-controller
                          namespace: ingress-nginx
                        spec:
                          type: LoadBalancer
                          ports:
                          - name: http
                            port: 80
                            targetPort: 80
                          - name: https
                            port: 443
                            targetPort: 443
                          # 添加 TCP 端口
                          - name: mysql
                            port: 3306
                            targetPort: 3306
                          - name: redis
                            port: 6379
                            targetPort: 6379
                          - name: mongodb
                            port: 27017
                            targetPort: 27017
                          selector:
                            app.kubernetes.io/name: ingress-nginx
                        
                        ---
                        # 修改 Ingress Controller Deployment,引用 ConfigMap
                        apiVersion: apps/v1
                        kind: Deployment
                        metadata:
                          name: ingress-nginx-controller
                          namespace: ingress-nginx
                        spec:
                          template:
                            spec:
                              containers:
                              - name: controller
                                args:
                                - /nginx-ingress-controller
                                - --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
                                # ...其他参数

                        访问方式:

                        # 连接 MySQL
                        mysql -h <ingress-lb-ip> -P 3306 -u root -p
                        
                        # 连接 Redis
                        redis-cli -h <ingress-lb-ip> -p 6379

                        6. 灰度发布策略详解

                        基于权重的流量分配

                        # 生产版本 (90% 流量)
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: production
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: app-v1
                                    port:
                                      number: 80
                        
                        ---
                        # 灰度版本 (10% 流量)
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: canary
                          annotations:
                            nginx.ingress.kubernetes.io/canary: "true"
                            nginx.ingress.kubernetes.io/canary-weight: "10"
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: app-v2
                                    port:
                                      number: 80

                        基于请求头的灰度

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: canary-header
                          annotations:
                            nginx.ingress.kubernetes.io/canary: "true"
                            nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
                            nginx.ingress.kubernetes.io/canary-by-header-value: "true"
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: app-v2
                                    port:
                                      number: 80

                        测试:

                        # 普通用户访问 v1
                        curl http://myapp.com
                        
                        # 带特殊请求头的用户访问 v2
                        curl -H "X-Canary: true" http://myapp.com
                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: canary-cookie
                          annotations:
                            nginx.ingress.kubernetes.io/canary: "true"
                            nginx.ingress.kubernetes.io/canary-by-cookie: "canary"
                        spec:
                          rules:
                          - host: myapp.com
                            http:
                              paths:
                              - path: /
                                pathType: Prefix
                                backend:
                                  service:
                                    name: app-v2
                                    port:
                                      number: 80

                        使用:

                        • Cookie canary=always → 路由到 v2
                        • Cookie canary=never → 路由到 v1
                        • 无 Cookie → 根据权重路由

                        7. 性能优化

                        Nginx Ingress Controller 优化配置

                        apiVersion: v1
                        kind: ConfigMap
                        metadata:
                          name: ingress-nginx-controller
                          namespace: ingress-nginx
                        data:
                          # 工作进程数(建议等于 CPU 核心数)
                          worker-processes: "auto"
                          
                          # 每个工作进程的连接数
                          max-worker-connections: "65536"
                          
                          # 启用 HTTP/2
                          use-http2: "true"
                          
                          # 启用 gzip 压缩
                          use-gzip: "true"
                          gzip-level: "6"
                          gzip-types: "text/plain text/css application/json application/javascript text/xml application/xml"
                          
                          # 客户端请求体缓冲
                          client-body-buffer-size: "128k"
                          client-max-body-size: "100m"
                          
                          # Keepalive 连接
                          keep-alive: "75"
                          keep-alive-requests: "1000"
                          
                          # 代理缓冲
                          proxy-buffer-size: "16k"
                          proxy-buffers: "4 16k"
                          
                          # 日志优化(生产环境可以禁用访问日志)
                          disable-access-log: "false"
                          access-log-params: "buffer=16k flush=5s"
                          
                          # SSL 优化
                          ssl-protocols: "TLSv1.2 TLSv1.3"
                          ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256"
                          ssl-prefer-server-ciphers: "true"
                          ssl-session-cache: "true"
                          ssl-session-cache-size: "10m"
                          ssl-session-timeout: "10m"
                          
                          # 启用连接复用
                          upstream-keepalive-connections: "100"
                          upstream-keepalive-timeout: "60"
                          
                          # 限制
                          limit-req-status-code: "429"
                          limit-conn-status-code: "429"

                        Ingress Controller Pod 资源配置

                        apiVersion: apps/v1
                        kind: Deployment
                        metadata:
                          name: ingress-nginx-controller
                          namespace: ingress-nginx
                        spec:
                          replicas: 3  # 高可用建议 3+
                          template:
                            spec:
                              containers:
                              - name: controller
                                image: registry.k8s.io/ingress-nginx/controller:v1.9.0
                                resources:
                                  requests:
                                    cpu: "500m"
                                    memory: "512Mi"
                                  limits:
                                    cpu: "2000m"
                                    memory: "2Gi"
                                # 启用性能分析
                                livenessProbe:
                                  httpGet:
                                    path: /healthz
                                    port: 10254
                                  initialDelaySeconds: 10
                                  periodSeconds: 10
                                readinessProbe:
                                  httpGet:
                                    path: /healthz
                                    port: 10254
                                  periodSeconds: 5

                        8. 监控和可观测性

                        Prometheus 监控集成

                        # ServiceMonitor for Prometheus Operator
                        apiVersion: monitoring.coreos.com/v1
                        kind: ServiceMonitor
                        metadata:
                          name: ingress-nginx
                          namespace: ingress-nginx
                        spec:
                          selector:
                            matchLabels:
                              app.kubernetes.io/name: ingress-nginx
                          endpoints:
                          - port: metrics
                            interval: 30s

                        查看 Ingress Controller 指标

                        # 访问 metrics 端点
                        kubectl port-forward -n ingress-nginx svc/ingress-nginx-controller-metrics 10254:10254
                        
                        # 浏览器访问
                        http://localhost:10254/metrics
                        
                        # 关键指标:
                        # - nginx_ingress_controller_requests: 请求总数
                        # - nginx_ingress_controller_request_duration_seconds: 请求延迟
                        # - nginx_ingress_controller_response_size: 响应大小
                        # - nginx_ingress_controller_ssl_expire_time_seconds: SSL 证书过期时间

                        Grafana 仪表盘

                        # 导入官方 Grafana 仪表盘
                        # Dashboard ID: 9614 (Nginx Ingress Controller)
                        # Dashboard ID: 11875 (Nginx Ingress Controller Request Handling Performance)

                        9. 故障排查清单

                        问题 1: Ingress 没有分配 Address

                        # 检查
                        kubectl get ingress
                        # NAME       CLASS   HOSTS         ADDRESS   PORTS   AGE
                        # my-app     nginx   example.com             80      5m
                        
                        # 原因:
                        # 1. Ingress Controller 未运行
                        kubectl get pods -n ingress-nginx
                        
                        # 2. Service type 不是 LoadBalancer
                        kubectl get svc -n ingress-nginx
                        
                        # 3. 云提供商未分配 LoadBalancer IP
                        kubectl describe svc -n ingress-nginx ingress-nginx-controller

                        问题 2: 502 Bad Gateway

                        # 原因 1: 后端 Service 不存在
                        kubectl get svc
                        
                        # 原因 2: 后端 Pod 不健康
                        kubectl get pods
                        kubectl describe pod <pod-name>
                        
                        # 原因 3: 端口配置错误
                        kubectl get svc <service-name> -o yaml | grep -A 5 ports
                        
                        # 原因 4: 网络策略阻止
                        kubectl get networkpolicies
                        
                        # 查看日志
                        kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

                        问题 3: 503 Service Unavailable

                        # 原因: 没有健康的 Endpoints
                        kubectl get endpoints <service-name>
                        
                        # 如果 ENDPOINTS 列为空:
                        # 1. 检查 Service selector 是否匹配 Pod labels
                        kubectl get svc <service-name> -o yaml | grep -A 3 selector
                        kubectl get pods --show-labels
                        
                        # 2. 检查 Pod 是否 Ready
                        kubectl get pods
                        
                        # 3. 检查容器端口是否正确
                        kubectl get pods <pod-name> -o yaml | grep -A 5 ports

                        问题 4: TLS 证书问题

                        # 检查 Secret 是否存在
                        kubectl get secret <tls-secret-name>
                        
                        # 查看证书内容
                        kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
                        
                        # 检查证书过期时间
                        kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
                        
                        # cert-manager 问题
                        kubectl get certificate
                        kubectl describe certificate <cert-name>
                        kubectl get certificaterequests

                        问题 5: 路由规则不生效

                        # 1. 检查 Ingress 配置
                        kubectl describe ingress <ingress-name>
                        
                        # 2. 查看生成的 Nginx 配置
                        kubectl exec -n ingress-nginx <controller-pod> -- cat /etc/nginx/nginx.conf | grep -A 20 "server_name example.com"
                        
                        # 3. 测试域名解析
                        nslookup example.com
                        
                        # 4. 使用 Host header 测试
                        curl -v -H "Host: example.com" http://<ingress-ip>/path
                        
                        # 5. 检查 annotations 是否正确
                        kubectl get ingress <ingress-name> -o yaml | grep -A 10 annotations

                        10. 生产环境最佳实践

                        ✅ 高可用配置

                        # 1. 多副本 Ingress Controller
                        spec:
                          replicas: 3
                          
                          # 2. Pod 反亲和性(分散到不同节点)
                          affinity:
                            podAntiAffinity:
                              requiredDuringSchedulingIgnoredDuringExecution:
                              - labelSelector:
                                  matchExpressions:
                                  - key: app.kubernetes.io/name
                                    operator: In
                                    values:
                                    - ingress-nginx
                                topologyKey: kubernetes.io/hostname
                        
                          # 3. PodDisruptionBudget(确保至少 2 个副本运行)
                        ---
                        apiVersion: policy/v1
                        kind: PodDisruptionBudget
                        metadata:
                          name: ingress-nginx
                          namespace: ingress-nginx
                        spec:
                          minAvailable: 2
                          selector:
                            matchLabels:
                              app.kubernetes.io/name: ingress-nginx

                        ✅ 资源限制

                        resources:
                          requests:
                            cpu: "500m"
                            memory: "512Mi"
                          limits:
                            cpu: "2"
                            memory: "2Gi"
                        
                        # HPA 自动扩缩容
                        ---
                        apiVersion: autoscaling/v2
                        kind: HorizontalPodAutoscaler
                        metadata:
                          name: ingress-nginx
                          namespace: ingress-nginx
                        spec:
                          scaleTargetRef:
                            apiVersion: apps/v1
                            kind: Deployment
                            name: ingress-nginx-controller
                          minReplicas: 3
                          maxReplicas: 10
                          metrics:
                          - type: Resource
                            resource:
                              name: cpu
                              target:
                                type: Utilization
                                averageUtilization: 70
                          - type: Resource
                            resource:
                              name: memory
                              target:
                                type: Utilization
                                averageUtilization: 80

                        ✅ 安全加固

                        # 1. 只暴露必要端口
                        # 2. 启用 TLS 1.2+
                        # 3. 配置安全头
                        metadata:
                          annotations:
                            nginx.ingress.kubernetes.io/configuration-snippet: |
                              more_set_headers "X-Frame-Options: DENY";
                              more_set_headers "X-Content-Type-Options: nosniff";
                              more_set_headers "X-XSS-Protection: 1; mode=block";
                              more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains";
                        
                        # 4. 配置 WAF(Web Application Firewall)
                        nginx.ingress.kubernetes.io/enable-modsecurity: "true"
                        nginx.ingress.kubernetes.io/enable-owasp-core-rules: "true"
                        
                        # 5. 限流保护
                        nginx.ingress.kubernetes.io/limit-rps: "100"
                        nginx.ingress.kubernetes.io/limit-burst-multiplier: "5"

                        ✅ 监控告警

                        # Prometheus 告警规则示例
                        groups:
                        - name: ingress
                          rules:
                          - alert: IngressControllerDown
                            expr: up{job="ingress-nginx-controller-metrics"} == 0
                            for: 5m
                            annotations:
                              summary: "Ingress Controller is down"
                          
                          - alert: HighErrorRate
                            expr: rate(nginx_ingress_controller_requests{status=~"5.."}[5m]) > 0.05
                            for: 5m
                            annotations:
                              summary: "High 5xx error rate"
                          
                          - alert: HighLatency
                            expr: histogram_quantile(0.95, nginx_ingress_controller_request_duration_seconds_bucket) > 1
                            for: 10m
                            annotations:
                              summary: "High request latency (p95 > 1s)"

                        📚 总结对比:Ingress vs 其他方案

                        Ingress vs LoadBalancer Service

                        场景:部署 10 个微服务
                        
                        方案 A:每个服务一个 LoadBalancer
                        - 成本:10 个 LoadBalancer × $20/月 = $200/月
                        - 管理:10 个独立的 IP 地址
                        - 路由:无智能路由
                        - TLS:每个服务单独配置
                        
                        方案 B:一个 Ingress
                        - 成本:1 个 LoadBalancer × $20/月 = $20/月 ✅
                        - 管理:1 个 IP 地址 ✅
                        - 路由:基于域名/路径智能路由 ✅
                        - TLS:集中管理证书 ✅

                        Ingress vs API Gateway

                        功能IngressAPI Gateway (Kong/Tyk)
                        基础路由
                        认证鉴权⚠️ 基础✅ 完善
                        限流熔断⚠️ 基础✅ 高级
                        插件生态❌ 有限✅ 丰富
                        学习曲线✅ 简单⚠️ 复杂
                        性能✅ 高⚠️ 中等

                        🎓 学习路径建议

                        1. 入门 (1-2 周)

                          • 理解 Ingress 概念
                          • 部署 Nginx Ingress Controller
                          • 创建基本的 Ingress 规则
                          • 配置 HTTP/HTTPS 访问
                        2. 进阶 (2-4 周)

                          • 掌握各种路由策略
                          • TLS 证书管理(cert-manager)
                          • 金丝雀发布
                          • 性能调优
                        3. 高级 (1-2 月)

                          • 多 Ingress Controller 管理
                          • WAF 和安全加固
                          • 监控和告警
                          • 故障排查
                        4. 专家 (持续学习)

                          • 源码阅读
                          • 自定义插件开发
                          • Gateway API 迁移

                        需要我详细展开某个特定主题,比如 cert-manager 自动化证书管理、Ingress Controller 源码分析,或者 Gateway API 新标准吗? 🚀

                        Mar 7, 2024

                        Nginx 性能优化

                        通用优化、操作系统层、Nginx 配置层、架构层等多个维度,为你详细梳理的方式。


                        一、操作系统与硬件层优化

                        这是优化的基础,为 Nginx 提供一个高性能的运行环境。

                        1. 增加文件描述符限制 Nginx 每个连接(尤其是静态文件)都会消耗一个文件描述符。如果并发高,默认限制很容易成为瓶颈。

                          # 临时生效
                          ulimit -n 65536
                          
                          # 永久生效,修改 /etc/security/limits.conf
                          * soft nofile 65536
                          * hard nofile 65536
                          
                          # 同时,确保 nginx.conf 中使用了足够的 worker_rlimit_nofile
                          worker_rlimit_nofile 65536;
                        2. 优化网络栈

                          • 调整 net.core.somaxconn: 定义等待 Nginx 接受的最大连接队列长度。如果遇到 accept() 队列溢出的错误,需要增加这个值。
                            sysctl -w net.core.somaxconn=65535
                            并在 Nginx 的 listen 指令中显式指定 backlog 参数:
                            listen 80 backlog=65535;
                          • 启用 TCP Fast Open: 减少 TCP 三次握手的延迟。
                            sysctl -w net.ipv4.tcp_fastopen=3
                          • 增大临时端口范围: 当 Nginx 作为反向代理时,它需要大量本地端口来连接上游服务器。
                            sysctl -w net.ipv4.ip_local_port_range="1024 65535"
                          • 减少 TCP TIME_WAIT 状态: 对于高并发短连接场景,大量连接处于 TIME_WAIT 状态会耗尽端口资源。
                            # 启用 TIME_WAIT 复用
                            sysctl -w net.ipv4.tcp_tw_reuse=1
                            # 快速回收 TIME_WAIT 连接
                            sysctl -w net.ipv4.tcp_tw_recycle=0 # 注意:在 NAT 环境下建议为 0,否则可能有问题
                            # 增大 FIN_WAIT_2 状态的超时时间
                            sysctl -w net.ipv4.tcp_fin_timeout=30
                        3. 使用高性能磁盘 对于静态资源服务,使用 SSD 硬盘可以极大提升 IO 性能。


                        二、Nginx 配置优化

                        这是优化的核心,直接决定 Nginx 的行为。

                        1. 工作进程与连接数

                          • worker_processes auto;: 设置为 auto,让 Nginx 自动根据 CPU 核心数设置工作进程数,通常等于 CPU 核心数。
                          • worker_connections: 每个工作进程可以处理的最大连接数。它与 worker_rlimit_nofile 共同决定了 Nginx 的总并发能力。
                            events {
                                worker_connections 10240; # 例如,设置为 10240
                                use epoll; # 在 Linux 上使用高性能的 epoll 事件模型
                            }
                        2. 高效静态资源服务

                          • 启用 sendfile: 绕过用户空间,直接在内核中完成文件数据传输,非常高效。
                            sendfile on;
                          • 启用 tcp_nopush: 与 sendfile on 一起使用,确保数据包被填满后再发送,提高网络效率。
                            tcp_nopush on;
                          • 启用 tcp_nodelay: 针对 keepalive 连接,强制立即发送数据,减少延迟。通常与 tcp_nopush 一起使用。
                            tcp_nodelay on;
                        3. 连接与请求超时 合理的超时设置可以释放闲置资源,避免连接被长期占用。

                          # 客户端连接保持超时时间
                          keepalive_timeout 30s;
                          # 与上游服务器的保持连接超时时间
                          proxy_connect_timeout 5s;
                          proxy_send_timeout 60s;
                          proxy_read_timeout 60s;
                          # 客户端请求头读取超时
                          client_header_timeout 15s;
                          # 客户端请求体读取超时
                          client_body_timeout 15s;
                        4. 缓冲与缓存

                          • 缓冲区优化: 为客户端请求头和请求体设置合适的缓冲区大小,避免 Nginx 写入临时文件,降低 IO。
                            client_header_buffer_size 1k;
                            large_client_header_buffers 4 4k;
                            client_body_buffer_size 128k;
                          • 代理缓冲区: 当 Nginx 作为反向代理时,控制从上游服务器接收数据的缓冲区。
                            proxy_buffering on;
                            proxy_buffer_size 4k;
                            proxy_buffers 8 4k;
                          • 启用缓存
                            • 静态资源缓存: 使用 expiresadd_header 指令为静态资源设置长时间的浏览器缓存。
                              location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
                                  expires 1y;
                                  add_header Cache-Control "public, immutable";
                              }
                            • 反向代理缓存: 使用 proxy_cache 模块缓存上游服务器的动态内容,极大减轻后端压力。
                              proxy_cache_path /path/to/cache levels=1:2 keys_zone=my_cache:10m max_size=10g inactive=60m;
                              location / {
                                  proxy_cache my_cache;
                                  proxy_cache_valid 200 302 10m;
                                  proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
                              }
                        5. 日志优化

                          • 禁用访问日志: 对于极高并发且不关心访问日志的场景,可以关闭 access_log
                          • 缓冲写入日志: 使用 buffer 参数让 Nginx 先将日志写入内存缓冲区,满后再刷入磁盘。
                            access_log /var/log/nginx/access.log main buffer=64k flush=1m;
                          • 记录关键信息: 精简日志格式,只记录必要字段。
                        6. Gzip 压缩 对文本类型的响应进行压缩,减少网络传输量。

                          gzip on;
                          gzip_vary on;
                          gzip_min_length 1024; # 小于此值不压缩
                          gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
                        7. 上游连接保持 当代理到后端服务时,使用 keepalive 保持一定数量的空闲连接,避免频繁建立和断开 TCP 连接的开销。

                          upstream backend_servers {
                              server 10.0.1.100:8080;
                              keepalive 32; # 保持的空闲连接数
                          }
                          
                          location / {
                              proxy_pass http://backend_servers;
                              proxy_http_version 1.1;
                              proxy_set_header Connection "";
                          }

                        三、架构与部署优化

                        1. 负载均衡 使用 Nginx 的 upstream 模块将流量分发到多个后端服务器,实现水平扩展和高可用。

                          upstream app_cluster {
                              least_conn; # 使用最少连接算法
                              server 10.0.1.101:8080;
                              server 10.0.1.102:8080;
                              server 10.0.1.103:8080;
                          }
                        2. 动静分离 将静态资源(图片、CSS、JS)的请求与动态请求分开。可以让 Nginx 直接处理静态资源,而动态请求则代理给后端应用服务器(如 Tomcat, Node.js 等)。

                        3. 启用 HTTP/2 HTTP/2 提供了多路复用、头部压缩等特性,能显著提升页面加载速度。

                          listen 443 ssl http2;
                        4. 使用第三方模块 根据需求编译第三方模块,如:

                          • OpenResty: 基于 Nginx 和 LuaJIT,提供了强大的可编程能力。
                          • ngx_brotli: 使用 Brotli 压缩算法,通常比 Gzip 压缩率更高。

                        四、监控与调试

                        优化不是一次性的,需要持续监控。

                        1. 启用状态模块 使用 stub_status_module 来查看 Nginx 的基本状态信息。

                          location /nginx_status {
                              stub_status;
                              allow 127.0.0.1; # 只允许本机访问
                              deny all;
                          }

                          访问后可以看到活跃连接数、请求总数等信息。

                        2. 分析日志 使用工具如 goaccessawstats 分析访问日志,了解流量模式和瓶颈。

                        3. 性能剖析 在极端情况下,可以使用 debug 日志或系统工具(如 perfstrace)进行深度性能剖析。

                        总结与建议

                        1. 循序渐进: 不要一次性修改所有参数。一次只修改一两项,然后进行压测(如使用 wrk, ab, jmeter),观察效果。
                        2. 监控先行: 在优化前、中、后都要有可靠的监控数据作为依据。
                        3. 理解业务: 优化的策略很大程度上取决于业务类型。是高并发连接?是大文件下载?还是大量的短动态请求?
                        4. 内核参数谨慎调整: 生产环境调整内核参数前,务必在测试环境充分验证。

                        通过以上这些方式的组合运用,你可以显著提升 Nginx 的性能和稳定性,使其能够轻松应对百万级别的并发连接。

                        Oct 7, 2024

                        Traefik VS Nginx

                        好的,这是一个非常经典的问题。Traefik 和 Nginx Ingress 都是 Kubernetes 生态中顶级的 Ingress Controller,但它们的设计哲学、使用体验和侧重点有显著不同。

                        简单来说:

                        • Traefik 更像一个为云原生和微服务而生的动态、自动化的 API 网关
                        • Nginx Ingress 更像一个基于久经考验的 Nginx 的、高度可配置的强大、稳定的反向代理/负载均衡器

                        下面我们详细对比一下 Traefik 相对于 Nginx Ingress 的主要优点。

                        Traefik 的核心优点

                        1. 极致的动态配置与自动化

                        这是 Traefik 最核心的卖点。

                        • 工作原理:Traefik 会主动监听 Kubernetes API Server,实时感知 Service、Ingress Route、Secret 等资源的变化。一旦你创建或修改了一个 Ingress 资源,Traefik 几乎在秒级内自动更新其路由配置,无需任何重启或重载操作。
                        • Nginx Ingress 的对比:Nginx Ingress 通常需要一个名为 nginx-ingress-controller 的组件来监控变化,然后生成一个新的 nginx.conf 配置文件,最后通过向 Nginx 进程发送 reload 信号来加载新配置。虽然这个过程也很快,但它本质上是一个 “生成-重载” 模型,在超大流量或配置复杂时,重载可能带来微小的性能抖动或延迟。

                        结论:在追求完全自动化和零重载的云原生环境中,Traefik 的动态模型更具吸引力。

                        2. 简化的配置模型与 “IngressRoute” CRD

                        Traefik 完美支持标准的 Kubernetes Ingress 资源,但它更推荐使用自己定义的 Custom Resource Definition (CRD),叫做 IngressRoute

                        • 为什么更好:标准的 Ingress 资源功能相对有限,很多高级特性(如重试、限流、断路器、请求镜像等)需要通过繁琐的 annotations 来实现,可读性和可维护性差。
                        • Traefik 的 IngressRoute:它提供了一种声明式的、结构化的 YAML/JSON 配置方式。所有配置(包括 TLS、中间件、路由规则)都以清晰的结构定义在同一个 CRD 中,更符合 Kubernetes 的原生哲学,也更容易进行版本控制和代码审查。

                        示例对比: 使用 Nginx Ingress 的注解来实现路径重写:

                        apiVersion: networking.k8s.io/v1
                        kind: Ingress
                        metadata:
                          name: my-ingress
                          annotations:
                            nginx.ingress.kubernetes.io/rewrite-target: /

                        使用 Traefik 的 IngressRoute 和中间件:

                        apiVersion: traefik.containo.us/v1alpha1
                        kind: IngressRoute
                        metadata:
                          name: my-ingressroute
                        spec:
                          routes:
                          - match: PathPrefix(`/api`)
                            kind: Rule
                            services:
                            - name: my-service
                              port: 80
                            middlewares:
                            - name: strip-prefix # 使用一个独立的、可复用的中间件资源
                        ---
                        apiVersion: traefik.containo.us/v1alpha1
                        kind: Middleware
                        metadata:
                          name: strip-prefix
                        spec:
                          stripPrefix:
                            prefixes:
                              - /api

                        可以看到,Traefik 的配置更加模块化和清晰。

                        3. 内置的、功能丰富的 Dashboard

                        Traefik 自带一个非常直观的 Web UI 控制台。只需简单启用,你就可以在浏览器中实时查看所有的路由器(Routers)、服务(Services)和中间件(Middlewares),以及它们的健康状况和配置关系。

                        • 这对于开发和调试来说是巨大的福音。你可以一目了然地看到流量是如何被路由的,而无需去解析复杂的配置文件或命令行输出。
                        • Nginx Ingress 官方不提供图形化 Dashboard。虽然可以通过第三方工具(如 Prometheus + Grafana)来监控,或者使用 kubectl 命令来查询状态,但远不如 Traefik 的原生 Dashboard 直观方便。

                        4. 原生支持多种后端提供者

                        Traefik 的设计是多提供者的。除了 Kubernetes,它还可以同时从 Docker、Consul、Etcd、Rancher 或者一个简单的静态文件中读取配置。 如果你的技术栈是混合的(例如,部分服务在 K8s,部分服务使用 Docker Compose),Traefik 可以作为一个统一的入口点,简化你的架构。

                        Nginx Ingress 虽然也可以通过其他方式扩展,但其核心是为 Kubernetes 设计的。

                        5. 中间件模式的强大与灵活

                        Traefik 的 “中间件” 概念非常强大。它允许你将各种功能(如认证、限流、头信息修改、重定向、断路器等)定义为独立的、可复用的组件。然后,你可以在任何路由规则上通过引用的方式组合使用这些中间件。

                        这种模式极大地增强了配置的复用性和灵活性,是构建复杂流量策略的理想选择。

                        Nginx Ingress 的优势领域(作为平衡参考)

                        为了做出全面选择,了解 Nginx Ingress 的优势也很重要:

                        1. 极致的性能与稳定性:基于世界上最成熟的 Web 服务器 Nginx,在处理超高并发静态内容和长连接方面,经过了几十年的实战考验,性能和稳定性极高。
                        2. 功能极其丰富:Nginx 本身的功能集非常庞大,加上 Nginx Ingress Controller 提供了大量的注解来暴露这些功能,其能力上限在某些方面可能高于 Traefik。
                        3. 庞大的社区与生态:Nginx 的用户基数巨大,你遇到的任何问题几乎都能在网上找到解决方案或经验分享。
                        4. 精细化控制:对于深度 Nginx 专家,可以通过 ConfigMap 注入自定义的 Nginx 配置片段,实现几乎任何你想要的功能,可控性极强。
                        5. Apache 许可:Nginx 是 Apache 2.0 许可证,而 Traefik v2 之后使用的是限制更多的 Source Available 许可证(虽然对大多数用户免费,但会引起一些大公司的合规顾虑)。Nginx Ingress 完全没有这个问题。

                        总结与选型建议

                        特性TraefikNginx Ingress
                        配置模型动态、自动化,无需重载“生成-重载”模型
                        配置语法声明式 CRD,结构清晰主要依赖 Annotations,较繁琐
                        Dashboard内置,功能强大,开箱即用无官方 UI,需第三方集成
                        设计哲学云原生优先,微服务友好功能与性能优先,稳健可靠
                        学习曲线较低,易于上手和运维中等,需要了解 Nginx 概念
                        性能优秀,足以满足绝大多数场景极致,尤其在静态内容和大并发场景
                        可扩展性通过中间件,模块化程度高通过 Lua 脚本或自定义模板,功能上限高
                        许可证Source Available(可能有限制)Apache 2.0(完全开源)

                        如何选择?

                        • 选择 Traefik,如果:

                          • 你追求极致的云原生体验,希望配置简单、自动化。
                          • 你的团队更青睐 Kubernetes 原生的声明式配置方式。
                          • 你非常看重内置的 Dashboard 用于日常运维和调试。
                          • 你的应用架构是动态的,服务频繁发布和变更。
                          • 你的场景不需要压榨到极致的性能,更看重开发效率和运维简便性。
                        • 选择 Nginx Ingress,如果:

                          • 你对性能和稳定性有极致要求(例如,超大规模网关、CDN边缘节点)。
                          • 你需要使用非常复杂或小众的 Nginx 功能,需要精细化的控制。
                          • 你的团队已经对 Nginx 非常熟悉,有深厚的知识积累。
                          • 你对开源许可证有严格要求,必须使用 Apache 2.0 等宽松许可证。
                          • 你的环境相对稳定,不需要频繁更新路由配置。

                        总而言之,Traefik 胜在“体验”和“自动化”,是现代微服务和云原生环境的理想伴侣。而 Nginx Ingress 胜在“性能”和“功能深度”,是一个经过千锤百炼的、可靠的强大引擎。

                        Mar 7, 2024

                        RPC

                          Mar 7, 2025

                          Subsections of Storage

                          User Based Policy

                          User Based Policy

                          you can change <$bucket> to control the permission

                          App:
                          • ${aws:username} is a build-in variable, indicating the logined user name.
                          {
                              "Version": "2012-10-17",
                              "Statement": [
                                  {
                                      "Sid": "AllowUserToSeeBucketListInTheConsole",
                                      "Action": [
                                          "s3:ListAllMyBuckets",
                                          "s3:GetBucketLocation"
                                      ],
                                      "Effect": "Allow",
                                      "Resource": [
                                          "arn:aws:s3:::*"
                                      ]
                                  },
                                  {
                                      "Sid": "AllowRootAndHomeListingOfCompanyBucket",
                                      "Action": [
                                          "s3:ListBucket"
                                      ],
                                      "Effect": "Allow",
                                      "Resource": [
                                          "arn:aws:s3:::<$bucket>"
                                      ],
                                      "Condition": {
                                          "StringEquals": {
                                              "s3:prefix": [
                                                  "",
                                                  "<$path>/",
                                                  "<$path>/${aws:username}"
                                              ],
                                              "s3:delimiter": [
                                                  "/"
                                              ]
                                          }
                                      }
                                  },
                                  {
                                      "Sid": "AllowListingOfUserFolder",
                                      "Action": [
                                          "s3:ListBucket"
                                      ],
                                      "Effect": "Allow",
                                      "Resource": [
                                          "arn:aws:s3:::<$bucket>"
                                      ],
                                      "Condition": {
                                          "StringLike": {
                                              "s3:prefix": [
                                                  "<$path>/${aws:username}/*"
                                              ]
                                          }
                                      }
                                  },
                                  {
                                      "Sid": "AllowAllS3ActionsInUserFolder",
                                      "Effect": "Allow",
                                      "Action": [
                                          "s3:*"
                                      ],
                                      "Resource": [
                                          "arn:aws:s3:::<$bucket>/<$path>/${aws:username}/*"
                                      ]
                                  }
                              ]
                          }
                          • <$uid> is Aliyun UID
                          {
                              "Version": "1",
                              "Statement": [{
                                  "Effect": "Allow",
                                  "Action": [
                                      "oss:*"
                                  ],
                                  "Principal": [
                                      "<$uid>"
                                  ],
                                  "Resource": [
                                      "acs:oss:*:<$oss_id>:<$bucket>/<$path>/*"
                                  ]
                              }, {
                                  "Effect": "Allow",
                                  "Action": [
                                      "oss:ListObjects",
                                      "oss:GetObject"
                                  ],
                                  "Principal": [
                                       "<$uid>"
                                  ],
                                  "Resource": [
                                      "acs:oss:*:<$oss_id>:<$bucket>"
                                  ],
                                  "Condition": {
                                      "StringLike": {
                                      "oss:Prefix": [
                                              "<$path>/*"
                                          ]
                                      }
                                  }
                              }]
                          }
                          Example:
                          {
                          	"Version": "1",
                          	"Statement": [{
                          		"Effect": "Allow",
                          		"Action": [
                          			"oss:*"
                          		],
                          		"Principal": [
                          			"203415213249511533"
                          		],
                          		"Resource": [
                          			"acs:oss:*:1007296819402486:conti-csst/test/*"
                          		]
                          	}, {
                          		"Effect": "Allow",
                          		"Action": [
                          			"oss:ListObjects",
                          			"oss:GetObject"
                          		],
                          		"Principal": [
                          			"203415213249511533"
                          		],
                          		"Resource": [
                          			"acs:oss:*:1007296819402486:conti-csst"
                          		],
                          		"Condition": {
                          			"StringLike": {
                          				"oss:Prefix": [
                          					"test/*"
                          				]
                          			}
                          		}
                          	}]
                          }
                          Mar 14, 2024

                          Mirrors

                          Gradle Tencent Mirror

                          https://mirrors.cloud.tencent.com/gradle/gradle-8.0-bin.zip

                          PIP Tuna Mirror -i https://pypi.tuna.tsinghua.edu.cn/simple

                          pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

                          Maven Mirror

                          <mirror>
                              <id>aliyunmaven</id>
                              <mirrorOf>*</mirrorOf>
                              <name>阿里云公共仓库</name>
                              <url>https://maven.aliyun.com/repository/public</url>
                          </mirror>