Aaron`s Dev Path

About Me

Dev Path

gitGraph:
  commit id:"Graduate From High School" tag:"Linfen, China"
  commit id:"Got Driver Licence" tag:"2013.08"
  branch TYUT
  commit id:"Enrollment TYUT 🥰"  tag:"Taiyuan, China"
  commit id:"Develop Game App" tag:"“Hello Hell”" type: HIGHLIGHT
  commit id:"Plan:3+1" tag:"2016.09"
  branch Briup.Ltd
  commit id:"First Internship" tag:"Suzhou, China"
  commit id:"CRUD boy" 
  commit id:"Dimission" tag:"2017.01" type:REVERSE
  checkout TYUT
  merge Briup.Ltd id:"Final Presentation" tag:"2017.04"
  checkout Briup.Ltd
  branch Enjoyor.PLC
  commit id:"Second Internship" tag:"Hangzhou,China"
  checkout TYUT
  merge Enjoyor.PLC id:"Got SE Bachelor Degree " tag:"2017.07"
  checkout Enjoyor.PLC
  commit id:"First Full Time Job" tag:"2017.07"
  commit id:"Dimssion" tag:"2018.04"
  checkout main
  merge Enjoyor.PLC id:"Plan To Study Aboard"
  commit id:"Get Some Rest" tag:"2018.06"
  branch TOEFL-GRE
  commit id:"Learning At Huahua.Ltd" tag:"Beijing,China"
  commit id:"Got USC Admission" tag:"2018.11" type: HIGHLIGHT
  checkout main
  merge TOEFL-GRE id:"Prepare To Leave" tag:"2018.12"
  branch USC
  commit id:"Pass Pre-School" tag:"Los Angeles,USA"
  checkout main
  merge USC id:"Back Home,Summer Break" tag:"2019.06"
  commit id:"Back School" tag:"2019.07"
  checkout USC
  merge main id:"Got Straight As"
  commit id:"Leaning ML, DL, GPT"
  checkout main
  merge USC id:"Back,Due to COVID-19" tag:"2021.02"
  checkout USC
  commit id:"Got DS Master Degree" tag:"2021.05"
  checkout main
  commit id:"Got An offer" tag:"2021.06"
  branch Zhejianglab
  commit id:"Second Full Time" tag:"Hangzhou,China"
  commit id:"Got Promotion" tag:"2024.01"
  commit id:"For Now"
Mar 7, 2024

Subsections of Aaron`s Dev Path

🐙Argo (CI/CD)

Content

ShortCuts

argoCD

  • decode passd
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
  • relogin
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS
  • force delete
argocd app terminate-op <$>

argo Workflow

argo Rollouts

Mar 7, 2024

Subsections of 🐙Argo (CI/CD)

Subsections of Argo CD

Subsections of App Template

Deploy From Git Repo

方案 1:为 ArgoCD 添加 SSH 密钥(推荐用于私有仓库)

1. 生成 SSH 密钥(如果还没有)

ssh-keygen -t ed25519 -C "argocd@your-cluster" -f ~/.ssh/argocd_key -N ""

2. 将公钥添加到 GitHub

复制公钥内容

cat ~/.ssh/argocd_key.pub

然后在 GitHub 仓库设置 -> Deploy keys 中添加

3. 将私钥添加到 ArgoCD

argocd repo add git@github.com:AaronYang0628/Euclid-Image-Cutout-Service.git \
  --ssh-private-key-path ~/.ssh/argocd_key

方案 2:使用 HTTPS + Token(更简单)

1. 在 GitHub 生成 Personal Access Token

Settings -> Developer settings -> Personal access tokens -> Generate new token (需要 ‘repo’ 权限)

2. 使用 HTTPS URL 添加仓库

argocd repo add https://github.com/AaronYang0628/Euclid-Image-Cutout-Service.git \
  --username AaronYang0628 \
  --password <your-github-token>

方案 3:公开仓库(如果可以)

如果这个仓库可以公开,直接使用 HTTPS URL 无需认证:

argocd repo add https://github.com/AaronYang0628/Euclid-Image-Cutout-Service.git

验证连接

添加仓库后,验证连接:

argocd repo list

创建 Application

连接成功后,创建 ArgoCD Application:

argocd app create euclid-cutout \
  --repo https://github.com/AaronYang0628/Euclid-Image-Cutout-Service.git \
  --path k8s \
  --dest-server https://kubernetes.default.svc \
    --dest-namespace default
Sync

When your k8s resource files located in `mainfests` folder, you can use the following command to deploy your app.
you only need to set `spec.source.path: mainfests`

  • sample-repo
    • content
    • src
    • mainfests
      • deploy.yaml
      • svc.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: hugo-blog
spec:
  project: default
  source:
    repoURL: 'git@github.com:<$github_username>/sample-repo.git'
    targetRevision: main
    path: mainfests
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
  destination:
    server: https://kubernetes.default.svc
    namespace: application

Not only you need files in `mainfests` folder, but also need files in root folder.

适用于:待部署的应用不只需要k8s资源,还需要源代码中的一些数据或者文件时

you have to create an extra file `kustomization.yaml`, and set `spec.source.path: .`

  • sample-repo
    • kustomization.yaml
    • content
    • src
    • mainfests
      • deploy.yaml
      • svc.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: hugo-blog
spec:
  project: default
  source:
    repoURL: 'git@github.com:<$github_username>/sample-repo.git'
    targetRevision: main
    path: .
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
  destination:
    server: https://kubernetes.default.svc
    namespace: application
resources:
  - manifests/pvc.yaml
  - manifests/job.yaml
  - manifests/deployment.yaml
  - ...
Oct 22, 2025

Deploy N Clusters

ArgoCD 凭借其声明式的 GitOps 理念,能非常优雅地处理多 Kubernetes 集群的应用发布。它允许你从一个中心化的 Git 仓库管理多个集群的应用部署,确保状态一致并能快速回滚。

下面这张图概括了使用 ArgoCD 进行多集群发布的典型工作流,帮你先建立一个整体印象:

flowchart TD
    A[Git仓库] --> B{ArgoCD Server}
    
    B --> C[ApplicationSet<br>集群生成器]
    B --> D[ApplicationSet<br>Git生成器]
    B --> E[手动Application<br>资源]
    
    C --> F[集群A<br>App1 & App2]
    C --> G[集群B<br>App1 & App2]
    
    D --> H[集群A<br>App1]
    D --> I[集群A<br>App2]
    
    E --> J[特定集群<br>特定应用]

🔗 连接集群到 ArgoCD

要让 ArgoCD 管理外部集群,你需要先将目标集群的访问凭证添加进来。

  1. 获取目标集群凭证:确保你拥有目标集群的 kubeconfig 文件。
  2. 添加集群到 ArgoCD:使用 ArgoCD CLI 添加集群。这个操作会在 ArgoCD 所在命名空间创建一个存储了集群凭证的 Secret。
    argocd cluster add <context-name> --name <cluster-name> --kubeconfig ~/.kube/config
    • <context-name> 是你 kubeconfig 中的上下文名称。
    • <cluster-name> 是你在 ArgoCD 中为这个集群起的别名。
  3. 验证集群连接:添加后,你可以在 ArgoCD UI 的 “Settings” > “Clusters” 页面,或通过 CLI 查看集群列表:
    argocd cluster list

💡 选择多集群部署策略

连接集群后,核心是定义部署规则。ArgoCD 主要通过 ApplicationApplicationSet 资源来描述部署。

  • Application 资源:定义一个应用在特定集群的部署。管理大量集群和应用时,手动创建每个 Application 会很繁琐。
  • ApplicationSet 资源:这是实现多集群部署的推荐方式。它能根据生成器(Generators)自动为多个集群或多个应用创建 Application 资源。

上面的流程图展示了 ApplicationSet 的两种主要生成器以及手动创建 Application 的方式。

ApplicationSet 常用生成器对比

生成器类型工作原理适用场景
List Generator在 YAML 中静态列出集群和URL。集群数量固定、变化少的场景。
Cluster Generator动态使用 ArgoCD 中已注册的集群。集群动态变化,需自动纳入新集群的场景。
Git Generator根据 Git 仓库中的目录结构自动生成应用。管理大量微服务,每个服务在独立目录。

🛠️ 配置实践示例

这里以 Cluster Generator 为例,展示一个 ApplicationSet 的 YAML 配置:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app-multi-cluster
spec:
  generators:
    - clusters: {} # 自动发现ArgoCD中所有已注册集群
  template:
    metadata:
      name: '{{name}}-my-app'
    spec:
      project: default
      source:
        repoURL: 'https://your-git-repo.com/your-app.git'
        targetRevision: HEAD
        path: k8s-manifests
      destination:
        server: '{{server}}' # 生成器提供的集群API地址
        namespace: my-app-namespace
      syncPolicy:
        syncOptions:
        - CreateNamespace=true # 自动创建命名空间
        automated:
          prune: true # 自动清理
          selfHeal: true # 自动修复漂移

在这个模板中:

  • generators 下的 clusters: {} 会让 ArgoCD 自动发现所有已注册的集群。
  • template 中,{{name}}{{server}} 是变量,Cluster Generator 会为每个已注册的集群填充它们。
  • syncPolicy 下的配置实现了自动同步、自动创建命名空间和资源清理。

⚠️ 多集群管理的关键要点

  1. 集群访问权限与网络:确保 ArgoCD 控制平面能够网络连通所有目标集群的 API Server,并具有在目标命名空间中创建资源的 RBAC 权限
  2. 灵活的同步策略
    • 对于开发环境,可以开启 automated 同步,实现 Git 变更自动部署。
    • 对于生产环境,建议关闭自动同步,采用手动触发同步(Manual)或通过 PR 审批流程,以增加控制力。
  3. 高可用与性能:管理大量集群和应用时,考虑高可用(HA)部署。你可能需要调整 argocd-repo-serverargocd-application-controller 的副本数和资源限制。
  4. 考虑 Argo CD Agent:对于大规模集群管理,可以探索 Argo CD Agent。它将一部分控制平面组件(如 application-controller)分布到托管集群上运行,能提升可扩展性。请注意,截至2025年10月,该功能在 OpenShift GitOps 中仍处于技术预览(Tech Preview) 阶段。

💎 总结

利用 ArgoCD 管理多 Kubernetes 集群应用发布,核心是掌握 ApplicationSetGenerators 的用法。通过 Cluster Generator 或 Git Generator,你可以灵活地实现“一次定义,多处部署”。

希望这些信息能帮助你着手搭建多集群发布流程。如果你能分享更多关于你具体环境的信息(比如集群的大致数量和应用的组织结构),或许我可以给出更贴合你场景的建议。

Mar 14, 2025

ArgoCD Cheatsheets

  • decode passd
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
  • relogin
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS
  • force delete
argocd app terminate-op <$>
Mar 14, 2024

Argo CD Agent

Installation

Content

    Mar 7, 2024

    Argo WorkFlow

    What is Argo Workflow?

    Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.

    • Define workflows where each step in the workflow is a container.
    • Model multi-step workflows as a sequence of tasks or capture the dependencies between tasks using a graph (DAG).
    • Easily run compute intensive jobs for machine learning or data processing in a fraction of the time using Argo Workflows on Kubernetes.
    • Run CI/CD pipelines natively on Kubernetes without configuring complex software development products.

    Installation

    Content

    Mar 7, 2024

    Subsections of Argo WorkFlow

    Argo Workflows Cheatsheets

    Mar 14, 2024

    Subsections of Workflow Template

    DAG Template

    DAG Template

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: dag-diamond-
    spec:
      entrypoint: entry
      serviceAccountName: argo-workflow
      templates:
      - name: echo
        inputs:
          parameters:
          - name: message
        container:
          image: alpine:3.7
          command: [echo, "{{inputs.parameters.message}}"]
      - name: entry
        dag:
          tasks:
          - name: start
            template: echo
            arguments:
                parameters: [{name: message, value: DAG initialized}]
          - name: diamond
            template: diamond
            dependencies: [start]
      - name: diamond
        dag:
          tasks:
          - name: A
            template: echo
            arguments:
              parameters: [{name: message, value: A}]
          - name: B
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: B}]
          - name: C
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: C}]
          - name: D
            dependencies: [B, C]
            template: echo
            arguments:
              parameters: [{name: message, value: D}]
          - name: end
            dependencies: [D]
            template: echo
            arguments:
              parameters: [{name: message, value: end}]
    kubectl -n business-workflow apply -f - << EOF
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: dag-diamond-
    spec:
      entrypoint: entry
      serviceAccountName: argo-workflow
      templates:
      - name: echo
        inputs:
          parameters:
          - name: message
        container:
          image: alpine:3.7
          command: [echo, "{{inputs.parameters.message}}"]
      - name: entry
        dag:
          tasks:
          - name: start
            template: echo
            arguments:
                parameters: [{name: message, value: DAG initialized}]
          - name: diamond
            template: diamond
            dependencies: [start]
      - name: diamond
        dag:
          tasks:
          - name: A
            template: echo
            arguments:
              parameters: [{name: message, value: A}]
          - name: B
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: B}]
          - name: C
            dependencies: [A]
            template: echo
            arguments:
              parameters: [{name: message, value: C}]
          - name: D
            dependencies: [B, C]
            template: echo
            arguments:
              parameters: [{name: message, value: D}]
          - name: end
            dependencies: [D]
            template: echo
            arguments:
              parameters: [{name: message, value: end}]
    EOF
    Mar 7, 2024

    Subsections of Argo Rollouts

    Blue–Green Deploy

    Argo Rollouts 是一个 Kubernetes CRD 控制器,它通过扩展 Kubernetes 的原生 Deployment 资源,为 Kubernetes 提供了更高级的部署策略。其核心原理可以概括为:通过精细控制多个 ReplicaSet(对应不同版本的 Pod)的副本数量和流量分配,来实现可控的、自动化的应用发布流程。


    1. 蓝绿部署原理

    蓝绿部署的核心思想是同时存在两个完全独立的环境(蓝色和绿色),但任何时候只有一个环境承载生产流量。

    工作原理

    1. 初始状态

      • 假设当前生产环境是 蓝色 版本(v1),所有流量都指向蓝色的 ReplicaSet。
      • 绿色 环境虽然可能存在(例如,副本数为 0),但不接收任何流量。
    2. 发布新版本

      • 当需要发布新版本(v2)时,Argo Rollouts 会创建一个与蓝色环境完全隔离的 绿色 环境 ReplicaSet,并启动全部所需的 Pod 实例。
      • 关键点:此时,用户流量仍然 100% 指向蓝色的 v1 版本。绿色 v2 版本在启动和预热期间,完全不影响线上用户。
    3. 测试与验证

      • 运维人员或自动化脚本可以对绿色的 v2 版本进行测试,例如进行 API 调用、检查日志或运行集成测试。这个过程在生产流量不受干扰的情况下进行。
    4. 切换流量

      • 一旦确认 v2 版本稳定,通过一个原子操作,将所有生产流量从蓝色(v1)瞬间切换到绿色(v2)。
      • 这个切换通常是通过更新 Kubernetes Service 或 Ingress 的 selector 标签来实现的。例如,将 app: my-app 的 selector 从 version: v1 改为 version: v2
    5. 发布后

      • 流量切换后,绿色(v2)成为新的生产环境。
      • 蓝色(v1)环境不会被立即删除,而是保留一段时间,作为快速回滚的保障
      • 如果 v2 出现问题,只需将流量再次切回蓝色(v1)即可,回滚过程同样迅速且影响小。

    原理示意图

    [用户] --> [Service (selector: version=v1)] --> [蓝色 ReplicaSet (v1, 100% 流量)]
                                          |
                                          +--> [绿色 ReplicaSet (v2, 0% 流量, 待命)]

    切换后:

    [用户] --> [Service (selector: version=v2)] --> [绿色 ReplicaSet (v2, 100% 流量)]
                                          |
                                          +--> [蓝色 ReplicaSet (v1, 0% 流量, 备用回滚)]

    优点:发布和回滚速度快、风险低、发布期间服务始终可用。 缺点:需要两倍的硬件资源,在切换瞬间可能会有短暂的流量处理问题(如连接中断)。


    2. 金丝雀发布原理

    金丝雀发布的核心思想是逐步将流量从旧版本迁移到新版本,而不是一次性切换。这个过程允许在影响一小部分用户的情况下,验证新版本的稳定性和性能。

    工作原理

    1. 初始状态

      • 与蓝绿部署类似,当前稳定版本(v1)的 ReplicaSet 承载 100% 的流量。
    2. 发布金丝雀版本

      • Argo Rollouts 创建一个新版本(v2)的 ReplicaSet,但只启动少量 Pod(例如,副本数为总体的 1/10)。
      • 此时,通过 流量治理工具(如 Service Mesh:Istio, Linkerd;或 Ingress Controller:Nginx)的规则,将一小部分生产流量(例如 10%)路由到 v2 的 Pod,其余 90% 的流量仍然流向 v1。
    3. 渐进式推广

      • 这是一个多步骤、自动化的过程。Argo Rollouts 的 Rollout CRD 可以定义一个详细的步骤清单(steps)。
      • 示例步骤
        • setWeight: 10 - 将 10% 的流量切到 v2,持续 5 分钟。
        • pause: {duration: 5m} - 暂停发布,观察 v2 的运行指标。
        • setWeight: 40 - 如果一切正常,将流量提升到 40%。
        • pause: {duration: 10m} - 再次暂停并观察。
        • setWeight: 100 - 最终将所有流量切换到 v2。
    4. 自动化分析与回滚

      • 这是 Argo Rollouts 最强大的功能之一。在每次暂停(pause)阶段,它会持续查询指标分析服务
      • 指标分析服务 可以配置一系列规则(AnalysisTemplate),例如:
        • 检查 HTTP 请求错误率是否低于 1%。
        • 检查请求平均响应时间是否小于 200ms。
        • 检查自定义业务指标(如订单失败率)。
      • 如果任何一项指标不达标,Argo Rollouts 会自动中止发布并将流量全部回滚到 v1 版本,无需人工干预。
    5. 发布完成

      • 当所有步骤顺利完成,v2 的 ReplicaSet 将接管 100% 的流量,v1 的 ReplicaSet 最终会被缩容至零。

    原理示意图

    [用户] --> [Istio VirtualService] -- 90% --> [v1 ReplicaSet]
                         |
                         +-- 10% --> [v2 ReplicaSet (金丝雀)]

    (推广中)

    [用户] --> [Istio VirtualService] -- 40% --> [v1 ReplicaSet]
                         |
                         +-- 60% --> [v2 ReplicaSet (金丝雀)]

    (完成后)

    [用户] --> [Istio VirtualService] -- 100% --> [v2 ReplicaSet]

    优点:发布风险极低,可以基于真实流量和指标进行自动化验证,实现“无人值守”的安全发布。 缺点:发布流程更长,需要与复杂的流量治理工具集成。


    总结与核心价值

    特性蓝绿部署金丝雀发布
    核心思想全量切换,环境隔离渐进式流量切换
    流量控制100% 或 0%,原子操作精细的比例控制(1%, 5%, 50%…)
    资源消耗高(需要两套完整环境)低(新旧版本 Pod 共享资源池)
    发布速度快(切换迅速)慢(分多个阶段)
    风险控制通过快速回滚控制风险通过小范围暴露和自动化分析控制风险
    自动化相对简单,主要自动化切换高度自动化,依赖指标分析进行决策

    Argo Rollouts 的核心原理价值在于:

    1. 声明式:像定义 Kubernetes Deployment 一样,通过 YAML 文件声明你的发布策略(蓝绿或金丝雀步骤)。
    2. 控制器模式:Argo Rollouts 控制器持续监听 Rollout 对象的状态,并驱动整个系统(K8s API、Service Mesh、Metrics Server)达到声明的目标状态。
    3. 扩展性:通过 CRD 和 AnalysisTemplate,它提供了极大的灵活性,可以与任何兼容的流量提供商和指标系统集成。
    4. 自动化与安全:将“人脑判断”转化为“基于数据的自动化规则”,极大地提升了发布的可靠性和效率,是实现 GitOps 和持续交付的关键一环。
    Mar 14, 2025

    Argo Rollouts Cheatsheets

    Mar 14, 2024

    Subsections of 🧯BuckUp

    Subsections of ElasticSearch

    ES [Local Disk]

    Preliminary

    • ElasticSearch has installed, if not check link

    • The elasticsearch.yml has configed path.repo, which should be set the same value of settings.location (this will be handled by helm chart, dont worry)

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: false
              service:
                type: ClusterIP
              extraConfig:
                path:
                  repo: /tmp
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      extraConfig:
          path:
            repo: /tmp

    Methods

    Elasticsearch 做备份有两种方式,

    1. 是将数据导出成文本文件,比如通过elasticdump、esm等工具将存储在 Elasticsearch 中的数据导出到文件中。
    2. 是使用snapshot接口实现快照功能,增量备份文件

    第一种方式相对简单,在数据量小的时候比较实用,但当应对大数据量场景时,更推荐使用snapshot api 的方式。

    Steps

    buckup

    asdadas

    1. 创建快照仓库repo -> my_fs_repository
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "fs",
      "settings": {
        "location": "/tmp"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository?pretty"
    1. 分析一个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 手动打快照
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/ay_snap_02?pretty"
    使用SLM自动打快照(没生效)

    Thank you!

    1. 查看指定快照仓库repo 可用的快照
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/*?verbose=false&pretty"
    1. 测试恢复
    # Delete an index
    curl -k -X DELETE "https://elastic-search.dev.tech:32443/books?pretty"
    
    # restore that index
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_fs_repository/ay_snap_02/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl -k -X GET "https://elastic-search.dev.tech:32443/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    ES [S3 Compatible]

    Preliminary

    • ElasticSearch has installed, if not check link

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: true
              service:
                type: ClusterIP
              extraEnvVars:
              - name: S3_ACCESSKEY
                value: admin
              - name: S3_SECRETKEY
                value: ZrwpsezF1Lt85dxl
              extraConfig:
                s3:
                  client:
                    default:
                      protocol: http
                      endpoint: "http://192.168.31.111:9090"
                      path_style_access: true
              initScripts:
                configure-s3-client.sh: |
                  elasticsearch_set_key_value "s3.client.default.access_key" "${S3_ACCESSKEY}"
                  elasticsearch_set_key_value "s3.client.default.secret_key" "${S3_SECRETKEY}"
              hostAliases:
              - ip: 192.168.31.111
                hostnames:
                - minio-api.dev.tech
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      extraEnvVars:
      - name: S3_ACCESSKEY
        value: admin
      - name: S3_SECRETKEY
        value: ZrwpsezF1Lt85dxl
      extraConfig:
        s3:
          client:
            default:
              protocol: http
              endpoint: "http://192.168.31.111:9090"
              path_style_access: true
      initScripts:
        configure-s3-client.sh: |
          elasticsearch_set_key_value "s3.client.default.access_key" "${S3_ACCESSKEY}"
          elasticsearch_set_key_value "s3.client.default.secret_key" "${S3_SECRETKEY}"
      hostAliases:
      - ip: 192.168.31.111
        hostnames:
        - minio-api.dev.tech

    Methods

    Elasticsearch 做备份有两种方式,

    1. 是将数据导出成文本文件,比如通过elasticdump、esm等工具将存储在 Elasticsearch 中的数据导出到文件中。
    2. 是使用snapshot接口实现快照功能,增量备份文件

    第一种方式相对简单,在数据量小的时候比较实用,但当应对大数据量场景时,更推荐使用snapshot api 的方式。

    Steps

    buckup

    asdadas

    1. 创建快照仓库repo -> my_s3_repository
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "s3",
      "settings": {
        "bucket": "local-test",
        "client": "default",
        "endpoint": "http://192.168.31.111:9000"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository?pretty"
    1. 分析一个快照仓库repo
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 手动打快照
    curl -k -X PUT "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/ay_s3_snap_02?pretty"
    使用SLM自动打快照(没生效)

    Thank you!

    1. 查看指定快照仓库repo 可用的快照
    curl -k -X GET "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/*?verbose=false&pretty"
    1. 测试恢复
    # Delete an index
    curl -k -X DELETE "https://elastic-search.dev.tech:32443/books?pretty"
    
    # restore that index
    curl -k -X POST "https://elastic-search.dev.tech:32443/_snapshot/my_s3_repository/ay_s3_snap_02/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl -k -X GET "https://elastic-search.dev.tech:32443/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    ES Auto BackUp

    Preliminary

    • ElasticSearch has installed, if not check link

    • We use local disk to save the snapshots, more deatils check link

    • And the security is enabled.

      ES argocd-app yaml
      apiVersion: argoproj.io/v1alpha1
      kind: Application
      metadata:
        name: elastic-search
      spec:
        syncPolicy:
          syncOptions:
          - CreateNamespace=true
        project: default
        source:
          repoURL: https://charts.bitnami.com/bitnami
          chart: elasticsearch
          targetRevision: 19.11.3
          helm:
            releaseName: elastic-search
            values: |
              global:
                kibanaEnabled: true
              clusterName: elastic
              image:
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              security:
                enabled: true
                tls:
                  autoGenerated: true
              service:
                type: ClusterIP
              extraConfig:
                path:
                  repo: /tmp
              ingress:
                enabled: true
                annotations:
                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                hostname: elastic-search.dev.tech
                ingressClassName: nginx
                path: /?(.*)
                tls: true
              master:
                masterOnly: false
                replicaCount: 1
                persistence:
                  enabled: false
                resources:
                  requests:
                    cpu: 2
                    memory: 1024Mi
                  limits:
                    cpu: 4
                    memory: 4096Mi
                heapSize: 2g
              data:
                replicaCount: 0
                persistence:
                  enabled: false
              coordinating:
                replicaCount: 0
              ingest:
                enabled: true
                replicaCount: 0
                service:
                  enabled: false
                  type: ClusterIP
                ingress:
                  enabled: false
              metrics:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              volumePermissions:
                enabled: false
                image:
                  registry: m.zjvis.net/docker.io
                  pullPolicy: IfNotPresent
              sysctlImage:
                enabled: true
                registry: m.zjvis.net/docker.io
                pullPolicy: IfNotPresent
              kibana:
                elasticsearch:
                  hosts:
                    - '{{ include "elasticsearch.service.name" . }}'
                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
              esJavaOpts: "-Xmx2g -Xms2g"        
        destination:
          server: https://kubernetes.default.svc
          namespace: application

      diff from oirginal file :

      security:
        enabled: true
      extraConfig:
          path:
            repo: /tmp

    Methods

    Steps

    auto buckup
    1. 创建快照仓库repo -> slm_fs_repository
    curl --user elastic:L9shjg6csBmPZgCZ -k -X PUT "https://10.88.0.143:30294/_snapshot/slm_fs_repository?pretty" -H 'Content-Type: application/json' -d'
    {
      "type": "fs",
      "settings": {
        "location": "/tmp"
      }
    }
    '

    你也能使用storage-class 挂载一个路径在pod中,将snapshot文件存放在外挂路径上

    1. 验证集群各个节点是否可以使用这个快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/_verify?pretty"
    1. 查看快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/_all?pretty"
    1. 查看某一个快照仓库repo的具体setting
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/slm_fs_repository?pretty"
    1. 分析一个快照仓库repo
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/_analyze?blob_count=10&max_blob_size=1mb&timeout=120s&pretty"
    1. 查看指定快照仓库repo 可用的快照
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/_snapshot/slm_fs_repository/*?verbose=false&pretty"
    1. 创建SLM admin 角色
    curl --user elastic:L9shjg6csBmPZgCZ -k -X POST "https://10.88.0.143:30294/_security/role/slm-admin?pretty" -H 'Content-Type: application/json' -d'
    {
      "cluster": [ "manage_slm", "cluster:admin/snapshot/*" ],
      "indices": [
        {
          "names": [ ".slm-history-*" ],
          "privileges": [ "all" ]
        }
      ]
    }
    '
    1. 创建自动备份cornjob
    curl --user elastic:L9shjg6csBmPZgCZ -k -X PUT "https://10.88.0.143:30294/_slm/policy/nightly-snapshots?pretty" -H 'Content-Type: application/json' -d'
    {
      "schedule": "0 30 1 * * ?",       
      "name": "<nightly-snap-{now/d}>", 
      "repository": "slm_fs_repository",    
      "config": {
        "indices": "*",                 
        "include_global_state": true    
      },
      "retention": {                    
        "expire_after": "30d",
        "min_count": 5,
        "max_count": 50
      }
    }
    '
    1. 启动自动备份
    curl --user elastic:L9shjg6csBmPZgCZ -k -X POST "https://10.88.0.143:30294/_slm/policy/nightly-snapshots/_execute?pretty"
    1. 查看SLM备份历史
    curl --user elastic:L9shjg6csBmPZgCZ -k -X GET "https://10.88.0.143:30294/_slm/stats?pretty"
    1. 测试恢复
    # Delete an index
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X DELETE "https://10.88.0.143:30294/books?pretty"
    
    # restore that index
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X POST "https://10.88.0.143:30294/_snapshot/slm_fs_repository/my_snapshot_2099.05.06/_restore?pretty" -H 'Content-Type: application/json' -d'
    {
      "indices": "books"
    }
    '
    
    # query
    curl --user elastic:L9shjg6csBmPZgCZ  -k -X GET "https://10.88.0.143:30294/books/_search?pretty" -H 'Content-Type: application/json' -d'
    {
      "query": {
        "match_all": {}
      }
    }
    '
    Oct 7, 2024

    Example Shell Script

    Init ES Backup Setting

    create an ES backup setting in s3, and make an snapshot after creation

    #!/bin/bash
    ES_HOST="http://192.168.58.2:30910"
    ES_BACKUP_REPO_NAME="s3_fs_repository"
    S3_CLIENT="default"
    ES_BACKUP_BUCKET_IN_S3="es-snapshot"
    ES_SNAPSHOT_TAG="auto"
    
    CHECK_RESPONSE=$(curl -s -k -X POST "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/_verify?pretty" )
    CHECKED_NODES=$(echo "$CHECK_RESPONSE" | jq -r '.nodes')
    
    
    if [ "$CHECKED_NODES" == null ]; then
      echo "Doesn't exist an ES backup setting..."
      echo "A default backup setting will be generated. (using '$S3_CLIENT' s3 client and all backup files will be saved in a bucket : '$ES_BACKUP_BUCKET_IN_S3'"
    
      CREATE_RESPONSE=$(curl -s -k -X PUT "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME?pretty" -H 'Content-Type: application/json' -d "{\"type\":\"s3\",\"settings\":{\"bucket\":\"$ES_BACKUP_BUCKET_IN_S3\",\"client\":\"$S3_CLIENT\"}}")
      CREATE_ACKNOWLEDGED_FLAG=$(echo "$CREATE_RESPONSE" | jq -r '.acknowledged')
    
      if [ "$CREATE_ACKNOWLEDGED_FLAG" == true ]; then
        echo "Buckup setting '$ES_BACKUP_REPO_NAME' has been created successfully!"
      else
        echo "Failed to create backup setting '$ES_BACKUP_REPO_NAME', since $$CREATE_RESPONSE"
      fi
    else
      echo "Already exist an ES backup setting '$ES_BACKUP_REPO_NAME'"
    fi
    
    CHECK_RESPONSE=$(curl -s -k -X POST "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/_verify?pretty" )
    CHECKED_NODES=$(echo "$CHECK_RESPONSE" | jq -r '.nodes')
    
    if [ "$CHECKED_NODES" != null ]; then
      SNAPSHOT_NAME="meta-data-$ES_SNAPSHOT_TAG-snapshot-$(date +%s)"
      SNAPSHOT_CREATION=$(curl -s -k -X PUT "$ES_HOST/_snapshot/$ES_BACKUP_REPO_NAME/$SNAPSHOT_NAME")
      echo "Snapshot $SNAPSHOT_NAME has been created."
    else
      echo "Failed to create snapshot $SNAPSHOT_NAME ."
    fi
    Mar 14, 2024

    Subsections of Git

    Minio

      Mar 7, 2024

      N8N

        Mar 7, 2024

        Redis

          Mar 7, 2024

          Subsections of ☁️CSP Related

          Subsections of Aliyun

          OSSutil

          download ossutil

          first, you need to download ossutil first

          OS:
          curl https://gosspublic.alicdn.com/ossutil/install.sh  | sudo bash
          curl -o ossutil-v1.7.19-windows-386.zip https://gosspublic.alicdn.com/ossutil/1.7.19/ossutil-v1.7.19-windows-386.zip

          config ossutil

          ./ossutil config
          ParamsDescriptionInstruction
          endpointthe Endpoint of the region where the Bucket is located
          accessKeyIDOSS AccessKeyget from user info panel
          accessKeySecretOSS AccessKeySecretget from user info panel
          stsTokentoken for sts servicecould be empty
          Info

          you can also modify /home/<$user>/.ossutilconfig file directly to change the configuration.

          list files

          ossutil ls oss://<$PATH>
          For exmaple
          ossutil ls oss://csst-data/CSST-20240312/dfs/

          download file/dir

          you can use cp to download or upload file

          ossutil cp -r oss://<$PATH> <$PTHER_PATH>
          For exmaple
          ossutil cp -r oss://csst-data/CSST-20240312/dfs/ /data/nfs/data/pvc...

          upload file/dir

          ossutil cp -r <$SOURCE_PATH> oss://<$PATH>
          For exmaple
          ossutil cp -r /data/nfs/data/pvc/a.txt  oss://csst-data/CSST-20240312/dfs/b.txt
          Mar 24, 2024

          ECS DNS

          ZJADC (Aliyun Directed Cloud)

          Append content in /etc/resolv.conf

          options timeout:2 attempts:3 rotate
          nameserver 10.255.9.2
          nameserver 10.200.12.5

          And then you probably need to modify yum.repo.d as well, check link


          YQGCY (Aliyun Directed Cloud)

          Append content in /etc/resolv.conf

          nameserver 172.27.205.79

          And then restart kube-system.coredns-xxxx


          Google DNS

          nameserver 8.8.8.8
          nameserver 4.4.4.4
          nameserver 223.5.5.5
          nameserver 223.6.6.6

          Restart DNS

          OS:
          vim /etc/NetworkManager/NetworkManager.conf
          vim /etc/NetworkManager/NetworkManager.conf
          sudo systemctl is-active systemd-resolved
          sudo resolvectl flush-caches
          # or sudo systemd-resolve --flush-caches

          add "dns=none" under '[main]' part

          systemctl restart NetworkManager

          Modify ifcfg-ethX [Optional]

          if you cannot get ipv4 address, you can try to modify ifcfg-ethX

          vim /etc/sysconfig/network-scripts/ifcfg-ens33

          set ONBOOT=yes

          Mar 14, 2024

          Tencent

            Mar 7, 2024

            Subsections of Zhejianglab

            👨‍💻Schedmd Slurm

            The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world’s supercomputers and computer clusters.

            It provides three key functions:

            • allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
            • providing a framework for starting, executing, and monitoring work, typically a parallel job such as Message Passing Interface (MPI) on a set of allocated nodes, and
            • arbitrating contention for resources by managing a queue of pending jobs.

            func1 func1

            Content

            Aug 7, 2024

            Subsections of 👨‍💻Schedmd Slurm

            Build & Install

            Aug 7, 2024

            Subsections of Build & Install

            Install On Debian

            Cluster Setting

            • 1 Manager
            • 1 Login Node
            • 2 Compute nodes
            hostnameIProlequota
            manage01 (slurmctld, slurmdbd)192.168.56.115manager2C4G
            login01 (login)192.168.56.116login2C4G
            compute01 (slurmd)192.168.56.117compute2C4G
            compute02 (slurmd)192.168.56.118compute2C4G

            Software Version:

            softwareversion
            osDebian 12 bookworm
            slurm24.05.2

            Important

            when you see (All Nodes), you need to run the following command on all nodes

            when you see (Manager Node), you only need to run the following command on manager node

            when you see (Login Node), you only need to run the following command on login node

            Prepare Steps (All Nodes)

            1. Modify the /etc/apt/sources.list file Using tuna mirror
            cat > /etc/apt/sources.list << EOF
            deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
            deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm main contrib non-free non-free-firmware
            
            deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
            deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-updates main contrib non-free non-free-firmware
            
            deb https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
            deb-src https://mirrors.tuna.tsinghua.edu.cn/debian/ bookworm-backports main contrib non-free non-free-firmware
            
            deb https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
            deb-src https://mirrors.tuna.tsinghua.edu.cn/debian-security/ bookworm-security main contrib non-free non-free-firmware
            EOF
            if you cannot get ipv4 address

            Modify the /etc/network/interfaces

            allow-hotplug enps08
            iface enps08 inet dhcp

            restart the network

            systemctl restart networking
            1. Update apt cache
            apt clean all && apt update
            1. Set hostname on each node
            Node:
            hostnamectl set-hostname manage01
            hostnamectl set-hostname login01
            hostnamectl set-hostname compute01
            hostnamectl set-hostname compute02
            1. Set hosts file
            cat >> /etc/hosts << EOF
            192.168.56.115 manage01
            192.168.56.116 login01
            192.168.56.117 compute01
            192.168.56.118 compute02
            EOF
            1. Disable firewall
            systemctl stop nftables && systemctl disable nftables
            1. Install packages ntpdate
            apt-get -y install ntpdate
            1. Sync server time
            ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
            echo 'Asia/Shanghai' >/etc/timezone
            ntpdate time.windows.com
            1. Add cron job to sync time
            crontab -e
            */5 * * * * /usr/sbin/ntpdate time.windows.com
            1. Create ssh key pair on each node
            ssh-keygen -t rsa -b 4096 -C $HOSTNAME
            1. Test ssh login other nodes without password
            Node:
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@login01
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@manage01
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute01
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@compute02

            Install Components

            1. Install NFS server (Manager Node)

            there are many ways to install NFS server

            create shared folder

            mkdir /data
            chmod 755 /data

            modify vim /etc/exports

            /data *(rw,sync,insecure,no_subtree_check,no_root_squash)

            start nfs server

            systemctl start rpcbind 
            systemctl start nfs-server 
            
            systemctl enable rpcbind 
            systemctl enable nfs-server

            check nfs server

            showmount -e localhost
            
            # Output
            Export list for localhost:
            /data *
            1. Install munge service
            • add user munge (All Nodes)
            groupadd -g 1108 munge
            useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
            • Install rng-tools-debian (Manager Nodes)
            apt-get install -y rng-tools-debian
            # modify service script
            vim /usr/lib/systemd/system/rngd.service
            [Service]
            ExecStart=/usr/sbin/rngd -f -r /dev/urandom
            systemctl daemon-reload
            systemctl start rngd
            systemctl enable rngd
            apt-get install -y libmunge-dev libmunge2 munge
            • generate secret key (Manager Nodes)
            dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
            • copy munge.key from manager node to the rest node (All Nodes)
            scp -p /etc/munge/munge.key root@login01:/etc/munge/
            scp -p /etc/munge/munge.key root@compute01:/etc/munge/
            scp -p /etc/munge/munge.key root@compute02:/etc/munge/
            • grant privilege on munge.key (All Nodes)
            chown munge: /etc/munge/munge.key
            chmod 400 /etc/munge/munge.key
            
            systemctl start munge
            systemctl enable munge

            Using systemctl status munge to check if the service is running

            • test munge
            munge -n | ssh compute01 unmunge
            1. Install Mariadb (Manager Nodes)
            apt-get install -y mariadb-server
            • create database and user
            systemctl start mariadb
            systemctl enable mariadb
            
            ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) 
            mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
            mysql -uroot -p$ROOT_PASS -e 'create database slurm_acct_db'
            • create user slurm,and grant all privileges on database slurm_acct_db
            mysql -uroot -p$ROOT_PASS
            create user slurm;
            
            grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
            
            flush privileges;
            • create Slurm user
            groupadd -g 1109 slurm
            useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

            Install Slurm (All Nodes)

            • Install basic Debian package build requirements:
            apt-get install -y build-essential fakeroot devscripts equivs
            • Unpack the distributed tarball:
            wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2 &&
            tar -xaf slurm*tar.bz2
            • cd to the directory containing the Slurm source:
            cd slurm-24.05.2 &&   mkdir -p /etc/slurm && ./configure 
            • compile slurm
            make install
            • modify configuration files (Manager Nodes)

              cp /root/slurm-24.05.2/etc/slurm.conf.example /etc/slurm/slurm.conf
              vim /etc/slurm/slurm.conf

              focus on these options:

              SlurmctldHost=manage
              
              AccountingStorageEnforce=associations,limits,qos
              AccountingStorageHost=manage
              AccountingStoragePass=/var/run/munge/munge.socket.2
              AccountingStoragePort=6819  
              AccountingStorageType=accounting_storage/slurmdbd  
              
              JobCompHost=localhost
              JobCompLoc=slurm_acct_db
              JobCompPass=123456
              JobCompPort=3306
              JobCompType=jobcomp/mysql
              JobCompUser=slurm
              JobContainerType=job_container/none
              JobAcctGatherType=jobacct_gather/linux
              cp /root/slurm-24.05.2/etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
              vim /etc/slurm/slurmdbd.conf
              • modify /etc/slurm/cgroup.conf
              cp /root/slurm-24.05.2/etc/cgroup.conf.example /etc/slurm/cgroup.conf
              • send configuration files to other nodes
              scp -r /etc/slurm/*.conf  root@login01:/etc/slurm/
              scp -r /etc/slurm/*.conf  root@compute01:/etc/slurm/
              scp -r /etc/slurm/*.conf  root@compute02:/etc/slurm/
            • grant privilege on some directories (All Nodes)

            mkdir /var/spool/slurmd
            chown slurm: /var/spool/slurmd
            mkdir /var/log/slurm
            chown slurm: /var/log/slurm
            
            mkdir /var/spool/slurmctld
            chown slurm: /var/spool/slurmctld
            
            chown slurm: /etc/slurm/slurmdbd.conf
            chmod 600 /etc/slurm/slurmdbd.conf
            • start slurm services on each node
            Node:
            systemctl start slurmdbd
            systemctl enable slurmdbd
            
            systemctl start slurmctld
            systemctl enable slurmctld
            
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status xxxx` to check if the `xxxx` service is running
            Example slurmdbd.server
            ```text
            # vim /usr/lib/systemd/system/slurmdbd.service
            
            
            [Unit]
            Description=Slurm DBD accounting daemon
            After=network-online.target remote-fs.target munge.service mysql.service mysqld.service mariadb.service sssd.service
            Wants=network-online.target
            ConditionPathExists=/etc/slurm/slurmdbd.conf
            
            [Service]
            Type=simple
            EnvironmentFile=-/etc/sysconfig/slurmdbd
            EnvironmentFile=-/etc/default/slurmdbd
            User=slurm
            Group=slurm
            RuntimeDirectory=slurmdbd
            RuntimeDirectoryMode=0755
            ExecStart=/usr/local/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            LimitNOFILE=65536
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            Example slumctld.server
            ```text
            # vim /usr/lib/systemd/system/slurmctld.service
            
            
            [Unit]
            Description=Slurm controller daemon
            After=network-online.target remote-fs.target munge.service sssd.service
            Wants=network-online.target
            ConditionPathExists=/etc/slurm/slurm.conf
            
            [Service]
            Type=notify
            EnvironmentFile=-/etc/sysconfig/slurmctld
            EnvironmentFile=-/etc/default/slurmctld
            User=slurm
            Group=slurm
            RuntimeDirectory=slurmctld
            RuntimeDirectoryMode=0755
            ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            LimitNOFILE=65536
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            Example slumd.server
            ```text
            # vim /usr/lib/systemd/system/slurmd.service
            
            
            [Unit]
            Description=Slurm node daemon
            After=munge.service network-online.target remote-fs.target sssd.service
            Wants=network-online.target
            #ConditionPathExists=/etc/slurm/slurm.conf
            
            [Service]
            Type=notify
            EnvironmentFile=-/etc/sysconfig/slurmd
            EnvironmentFile=-/etc/default/slurmd
            RuntimeDirectory=slurm
            RuntimeDirectoryMode=0755
            ExecStart=/usr/local/sbin/slurmd --systemd $SLURMD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            KillMode=process
            LimitNOFILE=131072
            LimitMEMLOCK=infinity
            LimitSTACK=infinity
            Delegate=yes
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running

            Test Your Slurm Cluster (Login Node)

            • check cluster configuration
            scontrol show config
            • check cluster status
            sinfo
            scontrol show partition
            scontrol show node
            • submit job
            srun -N2 hostname
            scontrol show jobs
            • check job status
            check job status
            squeue -a
            Aug 7, 2024

            Install On Ubuntu

            Cluster Setting

            • 1 Manager
            • 1 Login Node
            • 2 Compute nodes
            hostnameIProlequota
            manage01 (slurmctld, slurmdbd)192.168.56.115manager2C4G
            login01 (login)192.168.56.116login2C4G
            compute01 (slurmd)192.168.56.117compute2C4G
            compute02 (slurmd)192.168.56.118compute2C4G

            Software Version:

            softwareversion
            osUbuntu 22.04
            slurm24.05.2

            Important

            when you see (All Nodes), you need to run the following command on all nodes

            when you see (Manager Node), you only need to run the following command on manager node

            when you see (Login Node), you only need to run the following command on login node

            Prepare Steps (All Nodes)

            1. Modify the /etc/apt/sources.list file Using tuna mirror
            cat > /etc/apt/sources.list << EOF
            
            EOF
            if you cannot get ipv4 address

            Modify the /etc/network/interfaces

            allow-hotplug enps08
            iface enps08 inet dhcp

            restart the network

            systemctl restart networking
            1. Update apt cache
            apt clean all && apt update
            1. Set hosts file
            cat >> /etc/hosts << EOF
            10.119.2.36 juice-036
            10.119.2.37 juice-037
            10.119.2.38 juice-038
            EOF
            1. Install packages ntpdate
            apt-get -y install ntpdate
            1. Sync server time
            ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
            echo 'Asia/Shanghai' >/etc/timezone
            ntpdate ntp.aliyun.com
            1. Add cron job to sync time
            crontab -e
            */5 * * * * /usr/sbin/ntpdate ntp.aliyun.com
            1. Create ssh key pair on each node
            ssh-keygen -t rsa -b 4096 -C $HOSTNAME
            1. Test ssh login other nodes without password
            Node:
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-036
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-037
            ssh-copy-id -i ~/.ssh/id_rsa.pub root@juice-038

            Install Components

            1. Install NFS server (Manager Node)

            there are many ways to install NFS server

            create shared folder

            mkdir /data
            chmod 755 /data

            modify vim /etc/exports

            /data *(rw,sync,insecure,no_subtree_check,no_root_squash)

            start nfs server

            systemctl start rpcbind 
            systemctl start nfs-server 
            
            systemctl enable rpcbind 
            systemctl enable nfs-server

            check nfs server

            showmount -e localhost
            
            # Output
            Export list for localhost:
            /data *
            1. Install munge service
            • add user munge (All Nodes)
            sudo apt install -y build-essential git wget munge libmunge-dev libmunge2 \
                mariadb-server libmariadb-dev libssl-dev libpam0g-dev \
                libhwloc-dev liblua5.3-dev libreadline-dev libncurses-dev \
                libjson-c-dev libyaml-dev libhttp-parser-dev libjwt-dev libdbus-glib-1-dev libbpf-dev libdbus-1-dev
            
            
            which mungekey
            
            # 如果有,使用它生成 key
            sudo systemctl stop munge
            sudo mungekey -c
            sudo chown munge:munge /etc/munge/munge.key
            sudo chmod 400 /etc/munge/munge.key
            sudo systemctl start munge
            • copy munge.key from manager node to the rest node (All Nodes)
            sudo scp /etc/munge/munge.key juice-036:/tmp/munge.key
            sudo scp /etc/munge/munge.key juice-037:/tmp/munge.key
            sudo scp /etc/munge/munge.key juice-038:/tmp/munge.key
            • grant privilege on munge.key (All Nodes)
            systemctl stop munge
            
            sudo mv /tmp/munge.key /etc/munge/munge.key
            chown munge: /etc/munge/munge.key
            chmod 400 /etc/munge/munge.key
            
            systemctl start munge
            systemctl status munge
            systemctl enable munge

            Using systemctl status munge to check if the service is running

            • test munge
            munge -n | ssh juice-036 unmunge
            munge -n | ssh juice-037 unmunge
            munge -n | ssh juice-038 unmunge
            1. Install Mariadb (Manager Nodes)
            apt-get install -y mariadb-server
            • create database and user
            systemctl start mariadb
            systemctl enable mariadb
            
            ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) 
            mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"
            mysql -uroot -p$ROOT_PASS -e 'create database slurm_acct_db'
            • create user slurm,and grant all privileges on database slurm_acct_db
            mysql -uroot -p$ROOT_PASS
            create user slurm;
            
            grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
            
            flush privileges;
            • create Slurm user
            groupadd -g 1109 slurm
            useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

            Install Slurm (All Nodes)

            • Install basic Debian package build requirements:
            apt-get install -y build-essential fakeroot devscripts equivs
            • Unpack the distributed tarball:
            wget https://download.schedmd.com/slurm/slurm-25.05.2.tar.bz2 -O slurm-25.05.2.tar.bz2 &&
            tar -xaf slurm*tar.bz2
            • cd to the directory containing the Slurm source:
            cd slurm-25.05.2 &&   mkdir -p /etc/slurm && ./configure --prefix=/usr --sysconfdir=/etc/slurm  --enable-cgroupv2
            • compile slurm
            make install
            • modify configuration files (Manager Nodes)

              cp /root/slurm-25.05.2/etc/slurm.conf.example /etc/slurm/slurm.conf
              vim /etc/slurm/slurm.conf

              focus on these options:

              SlurmctldHost=manage
              
              AccountingStorageEnforce=associations,limits,qos
              AccountingStorageHost=manage
              AccountingStoragePass=/var/run/munge/munge.socket.2
              AccountingStoragePort=6819  
              AccountingStorageType=accounting_storage/slurmdbd  
              
              JobCompHost=localhost
              JobCompLoc=slurm_acct_db
              JobCompPass=123456
              JobCompPort=3306
              JobCompType=jobcomp/mysql
              JobCompUser=slurm
              JobContainerType=job_container/none
              JobAcctGatherType=jobacct_gather/linux
              cp /root/slurm-25.05.2/etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
              vim /etc/slurm/slurmdbd.conf
              • modify /etc/slurm/cgroup.conf
              cp /root/slurm-25.05.2/etc/cgroup.conf.example /etc/slurm/cgroup.conf
              • send configuration files to other nodes
              scp -r /etc/slurm/*.conf  root@juice-037:/etc/slurm/
              scp -r /etc/slurm/*.conf  root@juice-038:/etc/slurm/
            • grant privilege on some directories (All Nodes)

            mkdir /var/spool/slurmd
            chown slurm: /var/spool/slurmd
            mkdir /var/log/slurm
            chown slurm: /var/log/slurm
            
            mkdir /var/spool/slurmctld
            chown slurm: /var/spool/slurmctld
            
            chown slurm: /etc/slurm/slurmdbd.conf
            chmod 600 /etc/slurm/slurmdbd.conf
            • start slurm services on each node
            Node:
            systemctl start slurmdbd
            systemctl enable slurmdbd
            
            systemctl start slurmctld
            systemctl enable slurmctld
            
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status xxxx` to check if the `xxxx` service is running
            Example slurmdbd.server
            ```text
            # vim /usr/lib/systemd/system/slurmdbd.service
            
            
            [Unit]
            Description=Slurm DBD accounting daemon
            After=network-online.target remote-fs.target munge.service mysql.service mysqld.service mariadb.service sssd.service
            Wants=network-online.target
            ConditionPathExists=/etc/slurm/slurmdbd.conf
            
            [Service]
            Type=simple
            EnvironmentFile=-/etc/sysconfig/slurmdbd
            EnvironmentFile=-/etc/default/slurmdbd
            User=slurm
            Group=slurm
            RuntimeDirectory=slurmdbd
            RuntimeDirectoryMode=0755
            ExecStart=/usr/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            LimitNOFILE=65536
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            Example slumctld.server
            ```text
            # vim /usr/lib/systemd/system/slurmctld.service
            
            
            [Unit]
            Description=Slurm controller daemon
            After=network-online.target remote-fs.target munge.service sssd.service
            Wants=network-online.target
            ConditionPathExists=/etc/slurm/slurm.conf
            
            [Service]
            Type=notify
            EnvironmentFile=-/etc/sysconfig/slurmctld
            EnvironmentFile=-/etc/default/slurmctld
            User=slurm
            Group=slurm
            RuntimeDirectory=slurmctld
            RuntimeDirectoryMode=0755
            ExecStart=/usr/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            LimitNOFILE=65536
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            Example slumd.server
            ```text
            # vim /usr/lib/systemd/system/slurmd.service
            
            
            [Unit]
            Description=Slurm node daemon
            After=munge.service network-online.target remote-fs.target sssd.service
            Wants=network-online.target
            #ConditionPathExists=/etc/slurm/slurm.conf
            
            [Service]
            Type=notify
            EnvironmentFile=-/etc/sysconfig/slurmd
            EnvironmentFile=-/etc/default/slurmd
            RuntimeDirectory=slurm
            RuntimeDirectoryMode=0755
            ExecStart=/usr/sbin/slurmd --systemd $SLURMD_OPTIONS
            ExecReload=/bin/kill -HUP $MAINPID
            KillMode=process
            LimitNOFILE=131072
            LimitMEMLOCK=infinity
            LimitSTACK=infinity
            Delegate=yes
            
            
            # Uncomment the following lines to disable logging through journald.
            # NOTE: It may be preferable to set these through an override file instead.
            #StandardOutput=null
            #StandardError=null
            
            [Install]
            WantedBy=multi-user.target
            ```
            
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running
            systemctl start slurmd
            systemctl enable slurmd
            Using `systemctl status slurmd` to check if the `slurmd` service is running

            Test Your Slurm Cluster (Login Node)

            • check cluster configuration
            scontrol show config
            • check cluster status
            sinfo
            scontrol show partition
            scontrol show node
            • submit job
            srun -N2 hostname
            scontrol show jobs
            • check job status
            check job status
            squeue -a
            Aug 7, 2024

            Install From Binary

            Important

            (All Nodes) means all type nodes should install this component.

            (Manager Node) means only the manager node should install this component.

            (Login Node) means only the Auth node should install this component.

            (Cmp) means only the Compute node should install this component.

            Typically, there are three nodes are required to run Slurm.

            1 Manage(Manager Node), 1 Login Node and N Compute(Cmp).

            but you can choose to install all service in single node. check

            Prequisites

            1. change hostname (All Nodes)
              hostnamectl set-hostname (manager|auth|computeXX)
            2. modify /etc/hosts (All Nodes)
              echo "192.aa.bb.cc (manager|auth|computeXX)" >> /etc/hosts
            3. disable firewall, selinux, dnsmasq, swap (All Nodes). more detail here
            4. NFS Server (Manager Node). NFS is used as the default file system for the Slurm accounting database.
            5. [NFS Client] (All Nodes). all node should mount the NFS share
              Install NFS Client
              mount <$nfs_server>:/data /data -o proto=tcp -o nolock
            6. Munge (All Nodes). The auth/munge plugin will be built if the MUNGE authentication development library is installed. MUNGE is used as the default authentication mechanism.
              Install Munge

              All node need to have the munge user and group.

              groupadd -g 1108 munge
              useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
              yum install epel-release -y
              yum install munge munge-libs munge-devel -y

              Create global secret key

              /usr/sbin/create-munge-key -r
              dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

              sync secret to the rest of nodes

              scp -p /etc/munge/munge.key root@<$rest_node>:/etc/munge/
              ssh root@<$rest_node> "chown munge: /etc/munge/munge.key && chmod 400 /etc/munge/munge.key"
              ssh root@<$rest_node> "systemctl start munge && systemctl enable munge"

              test munge if it works

              munge -n | unmunge
            7. Database (Manager Node). MySQL support for accounting will be built if the MySQL or MariaDB development library is present. A currently supported version of MySQL or MariaDB should be used.
              Install MariaDB

              install mariadb

              yum -y install mariadb-server
              systemctl start mariadb && systemctl enable mariadb
              ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) 
              mysql -e "CREATE USER root IDENTIFIED BY '${ROOT_PASS}'"

              login mysql

              mysql -u root -p${ROOT_PASS}
              create database slurm_acct_db;
              create user slurm;
              grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
              flush privileges;
              quit

            Install Slurm

            1. create slurm user (All Nodes)
              groupadd -g 1109 slurm
              useradd -m -c "slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
            Install Slurm from

            Build RPM package

            1. install depeendencies (Manager Node)

              yum -y install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel python3
            2. build rpm package (Manager Node)

              wget https://download.schedmd.com/slurm/slurm-24.05.2.tar.bz2 -O slurm-24.05.2.tar.bz2
              rpmbuild -ta --nodeps slurm-24.05.2.tar.bz2

              The rpm files will be installed under the $(HOME)/rpmbuild directory of the user building them.

            3. send rpm to rest nodes (Manager Node)

              ssh root@<$rest_node> "mkdir -p /root/rpmbuild/RPMS/"
              scp -p $(HOME)/rpmbuild/RPMS/x86_64 root@<$rest_node>:/root/rpmbuild/RPMS/x86_64
            4. install rpm (Manager Node)

              ssh root@<$rest_node> "yum localinstall /root/rpmbuild/RPMS/x86_64/slurm-*"
            5. modify configuration file (Manager Node)

              cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
              cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
              cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
              chmod 600 /etc/slurm/slurmdbd.conf
              chown slurm: /etc/slurm/slurmdbd.conf

              cgroup.conf doesnt need to change.

              edit /etc/slurm/slurm.conf, you can use this link as a reference

              edit /etc/slurm/slurmdbd.conf, you can use this link as a reference

            Install yum repo directly

            1. install slurm (All Nodes)

              yum -y slurm-wlm slurmdbd
            2. modify configuration file (All Nodes)

              vim /etc/slurm-llnl/slurm.conf
              vim /etc/slurm-llnl/slurmdbd.conf

              cgroup.conf doesnt need to change.

              edit /etc/slurm/slurm.conf, you can use this link as a reference

              edit /etc/slurm/slurmdbd.conf, you can use this link as a reference

            1. send configuration (Manager Node)
               scp -r /etc/slurm/*.conf  root@<$rest_node>:/etc/slurm/
               ssh rootroot@<$rest_node> "mkdir /var/spool/slurmd && chown slurm: /var/spool/slurmd"
               ssh rootroot@<$rest_node> "mkdir /var/log/slurm && chown slurm: /var/log/slurm"
               ssh rootroot@<$rest_node> "mkdir /var/spool/slurmctld && chown slurm: /var/spool/slurmctld"
            2. start service (Manager Node)
              ssh rootroot@<$rest_node> "systemctl start slurmdbd && systemctl enable slurmdbd"
              ssh rootroot@<$rest_node> "systemctl start slurmctld && systemctl enable slurmctld"
            3. start service (All Nodes)
              ssh rootroot@<$rest_node> "systemctl start slurmd && systemctl enable slurmd"

            Test

            1. show cluster status
            scontrol show config
            sinfo
            scontrol show partition
            scontrol show node
            1. submit job
            srun -N2 hostname
            scontrol show jobs
            1. check job status
            squeue -a

            Reference:

            1. https://slurm.schedmd.com/documentation.html
            2. https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/
            3. https://github.com/Artlands/Install-Slurm
            Aug 7, 2024

            Install From Helm Chart

            Despite the complex binary installation, helm chart is a better way to install slurm.

            Source code could be found from https://github.com/AaronYang0628/slurm-on-k8s

            Prequisites

            1. Kubernetes has installed, if not check 🔗link
            2. Helm binary has installed, if not check 🔗link

            Installation

            1. get helm repo and update

              helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
            2. install slurm chart

              # wget -O slurm.values.yaml https://raw.githubusercontent.com/AaronYang0628/slurm-on-k8s/refs/heads/main/chart/values.yaml
              helm install slurm ay-helm-mirror/chart -f slurm.values.yaml --version 1.0.10

              Or you can get template values.yaml from https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurm.values.yaml

            3. check chart status

              helm -n slurm list
            Aug 7, 2024

            Install From K8s Operator

            Despite the complex binary installation, using k8s operator is a better way to install slurm.

            Source code could be found from https://github.com/AaronYang0628/slurm-on-k8s

            Prequisites

            1. Kubernetes has installed, if not check 🔗link
            2. Helm binary has installed, if not check 🔗link

            Installation

            1. deploy slurm operator

              kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/operator_install.yaml
              Expectd Output
              [root@ay-zj-ecs operator]# kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/operator_install.yaml
              namespace/slurm created
              customresourcedefinition.apiextensions.k8s.io/slurmdeployments.slurm.ay.dev created
              serviceaccount/slurm-operator-controller-manager created
              role.rbac.authorization.k8s.io/slurm-operator-leader-election-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-manager-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-metrics-auth-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-metrics-reader created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-admin-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-editor-role created
              clusterrole.rbac.authorization.k8s.io/slurm-operator-slurmdeployment-viewer-role created
              rolebinding.rbac.authorization.k8s.io/slurm-operator-leader-election-rolebinding created
              clusterrolebinding.rbac.authorization.k8s.io/slurm-operator-manager-rolebinding created
              clusterrolebinding.rbac.authorization.k8s.io/slurm-operator-metrics-auth-rolebinding created
              service/slurm-operator-controller-manager-metrics-service created
              deployment.apps/slurm-operator-controller-manager created
            2. check operator status

              kubectl -n slurm get pod
              Expectd Output
              [root@ay-zj-ecs operator]# kubectl -n slurm get pod
              NAME                                READY   STATUS    RESTARTS   AGE
              slurm-operator-controller-manager   1/1     Running   0          27s
            3. apply CRD slurmdeployment

              kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.zj.values.yaml
              Expectd Output
              [root@ay-zj-ecs operator]# kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.zj.values.yaml
              slurmdeployment.slurm.ay.dev/lensing created
            4. check operator status

              kubectl get slurmdeployment
              kubectl -n slurm logs -f deploy/slurm-operator-controller-manager
              # kubectl get slurmdep
              # kubectl -n test get pods
              Expectd Output
              [root@ay-zj-ecs ~]# kubectl get slurmdep -w
              NAME      CPU   GPU   LOGIN   CTLD   DBD   DBSVC   JOB COMMAND                     STATUS
              lensing   0/1   0/0   0/1     0/1    0/1   0/1     sh -c srun -N 2 /bin/hostname   
              lensing   1/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
              lensing   2/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
            5. upgrade slurmdep

              kubectl edit slurmdep lensing
              # set SlurmCPU.replicas = 3
              Expectd Output
              [root@ay-zj-ecs ~]# kubectl edit slurmdep lensing
              slurmdeployment.slurm.ay.dev/lensing edited
              
              [root@ay-zj-ecs ~]# kubectl get slurmdep -w
              NAME      CPU   GPU   LOGIN   CTLD   DBD   DBSVC   JOB COMMAND                     STATUS
              lensing   2/2   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
              lensing   2/3   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
              lensing   3/3   0/0   1/1     1/1    1/1   1/1     sh -c srun -N 2 /bin/hostname   
            Aug 7, 2024

            Try OpenSCOW

            What is SCOW?

            SCOW is a HPC cluster management system built by PKU.

            SCOW used four virtual machines to run slurm cluster. It is a good choice for you to learn how to use slurm.

            You should check https://pkuhpc.github.io/OpenSCOW/docs/hpccluster, it works well.

            Aug 7, 2024

            Subsections of CheatSheet

            Common Environment Variables

            VariableDescription
            $SLURM_JOB_IDThe Job ID.
            $SLURM_JOBIDDeprecated. Same as $SLURM_JOB_ID
            $SLURM_SUBMIT_HOSTThe hostname of the node used for job submission.
            $SLURM_JOB_NODELISTContains the definition (list) of the nodes that is assigned to the job.
            $SLURM_NODELISTDeprecated. Same as SLURM_JOB_NODELIST.
            $SLURM_CPUS_PER_TASKNumber of CPUs per task.
            $SLURM_CPUS_ON_NODENumber of CPUs on the allocated node.
            $SLURM_JOB_CPUS_PER_NODECount of processors available to the job on this node.
            $SLURM_CPUS_PER_GPUNumber of CPUs requested per allocated GPU.
            $SLURM_MEM_PER_CPUMemory per CPU. Same as –mem-per-cpu .
            $SLURM_MEM_PER_GPUMemory per GPU.
            $SLURM_MEM_PER_NODEMemory per node. Same as –mem .
            $SLURM_GPUSNumber of GPUs requested.
            $SLURM_NTASKSSame as -n, –ntasks. The number of tasks.
            $SLURM_NTASKS_PER_NODENumber of tasks requested per node.
            $SLURM_NTASKS_PER_SOCKETNumber of tasks requested per socket.
            $SLURM_NTASKS_PER_CORENumber of tasks requested per core.
            $SLURM_NTASKS_PER_GPUNumber of tasks requested per GPU.
            $SLURM_NPROCSSame as -n, –ntasks. See $SLURM_NTASKS.
            $SLURM_TASKS_PER_NODENumber of tasks to be initiated on each node.
            $SLURM_ARRAY_JOB_IDJob array’s master job ID number.
            $SLURM_ARRAY_TASK_IDJob array ID (index) number.
            $SLURM_ARRAY_TASK_COUNTTotal number of tasks in a job array.
            $SLURM_ARRAY_TASK_MAXJob array’s maximum ID (index) number.
            $SLURM_ARRAY_TASK_MINJob array’s minimum ID (index) number.

            A full list of environment variables for SLURM can be found by visiting the SLURM page on environment variables.

            Aug 7, 2024

            File Operations

            File Distribution

            • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
              • Feature
                1. distribute file:Quickly copy files to all compute nodes assigned to the job, avoiding the hassle of manually distributing files. Faster than traditional scp or rsync, especially when distributing to multiple nodes。
                2. simplify script:one command to distribute files to all nodes assigned to the job。
                3. imrpove performance:Improve file distribution speed by parallelizing transfers, especially for large or multiple files。
              • Usage
                1. Alone
                sbcast <source_file> <destination_path>
                1. Embedded in a job script
                #!/bin/bash
                #SBATCH --job-name=example_job
                #SBATCH --output=example_job.out
                #SBATCH --error=example_job.err
                #SBATCH --partition=compute
                #SBATCH --nodes=4
                
                # Use sbcast to distribute the file to the /tmp directory of each node
                sbcast data.txt /tmp/data.txt
                
                # Run your program using the distributed files
                srun my_program /tmp/data.txt

            File Collection

            1. File Redirection When submitting a job, you can use the #SBATCH –output and #SBATCH –error directives to redirect standard output and standard error to specified files.

               #SBATCH --output=output.txt
               #SBATCH --error=error.txt

              Or

              sbatch -N2 -w "compute[01-02]" -o result/file/path xxx.slurm
            2. Send the destination address manually Using scp or rsync in the job to copy the files from the compute nodes to the submit node

            3. Using NFS If a shared file system (such as NFS, Lustre, or GPFS) is configured in the computing cluster, the result files can be written directly to the shared directory. In this way, the result files generated by all nodes are automatically stored in the same location.

            4. Using sbcast

            Aug 7, 2024

            Submit Jobs

            3 Type Jobs

            • srun is used to submit a job for execution or initiate job steps in real time.

              • Example
                1. run shell
                srun -N2 bin/hostname
                1. run script
                srun -N1 test.sh
                1. exec into slurmd node
                srun -w slurm-lensing-slurm-slurmd-cpu-2 --pty /bin/bash
            • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

              • Example

                1. submit a batch job
                sbatch -N2 -w "compute[01-02]" -o job.stdout /data/jobs/batch-job.slurm
                batch-job.slurm
                #!/bin/bash
                
                #SBATCH -N 1
                #SBATCH --job-name=cpu-N1-batch
                #SBATCH --partition=compute
                #SBATCH --mail-type=end
                #SBATCH --mail-user=xxx@email.com
                #SBATCH --output=%j.out
                #SBATCH --error=%j.err
                
                srun -l /bin/hostname #you can still write srun <command> in here
                srun -l pwd
                
                1. submit a parallel task to process differnt data partition
                sbatch /data/jobs/parallel.slurm
                parallel.slurm
                #!/bin/bash
                #SBATCH -N 2 
                #SBATCH --job-name=cpu-N2-parallel
                #SBATCH --partition=compute
                #SBATCH --time=01:00:00
                #SBATCH --array=1-4  # 定义任务数组,假设有4个分片
                #SBATCH --ntasks-per-node=1 # 每个节点只运行一个任务
                #SBATCH --output=process_data_%A_%a.out
                #SBATCH --error=process_data_%A_%a.err
                
                TASK_ID=${SLURM_ARRAY_TASK_ID}
                
                DATA_PART="data_part_${TASK_ID}.txt" #make sure you have that file
                
                if [ -f ${DATA_PART} ]; then
                    echo "Processing ${DATA_PART} on node $(hostname)"
                    # python process_data.py --input ${DATA_PART}
                else
                    echo "File ${DATA_PART} does not exist!"
                fi
                
                how to split file
                split -l 1000 data.txt data_part_ 
                && mv data_part_aa data_part_1 
                && mv data_part_ab data_part_2
                
            • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

              • Example
                1. allocate resources (more like create an virtual machine)
                salloc -N2 bash
                This command will create a job which allocates 2 nodes and spawn a bash shell on each node. and you can execute srun commands in that environment. After your computing task is finsihs, remember to shutdown your job.
                scancel <$job_id>
                when you exit the job, the resources will be released.
            Aug 7, 2024

            Configuration Files

            Aug 7, 2024

            Subsections of MPI Libs

            Test Intel MPI Jobs

            在SLURM集群中使用MPI(Message Passing Interface)进行并行计算,通常需要以下几个步骤:

            1. 安装MPI库

            确保你的集群节点已经安装了MPI库,常见的MPI实现包括:

            • OpenMPI
            • Intel MPI
            • MPICH 可以通过以下命令检查集群是否安装了MPI:
            mpicc --version  # 检查MPI编译器
            mpirun --version # 检查MPI运行时环境

            2. 测试MPI性能

            mpirun -n 2 IMB-MPI1 pingpong

            3. 编译MPI程序

            你可以用mpicc(C语言)或mpic++(C++语言)来编译MPI程序。例如:

            以下是一个简单的MPI “Hello, World!” 示例程序,假设文件名为 hello_mpi.c, 还有一个进行矩阵计算的示例程序,文件名为dot_product.c,任意挑选一个即可:

            #include <stdio.h>
            #include <mpi.h>
            
            int main(int argc, char *argv[]) {
                int rank, size;
                
                // 初始化MPI环境
                MPI_Init(&argc, &argv);
            
                // 获取当前进程的rank和总进程数
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
            
                // 输出进程的信息
                printf("Hello, World! I am process %d out of %d processes.\n", rank, size);
            
                // 退出MPI环境
                MPI_Finalize();
            
                return 0;
            }
            #include <stdio.h>
            #include <stdlib.h>
            #include <mpi.h>
            
            #define N 8  // 向量大小
            
            // 计算向量的局部点积
            double compute_local_dot_product(double *A, double *B, int start, int end) {
                double local_dot = 0.0;
                for (int i = start; i < end; i++) {
                    local_dot += A[i] * B[i];
                }
                return local_dot;
            }
            
            void print_vector(double *Vector) {
                for (int i = 0; i < N; i++) {
                    printf("%f ", Vector[i]);   
                }
                printf("\n");
            }
            
            int main(int argc, char *argv[]) {
                int rank, size;
            
                // 初始化MPI环境
                MPI_Init(&argc, &argv);
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
            
                // 向量A和B
                double A[N], B[N];
            
                // 进程0初始化向量A和B
                if (rank == 0) {
                    for (int i = 0; i < N; i++) {
                        A[i] = i + 1;  // 示例数据
                        B[i] = (i + 1) * 2;  // 示例数据
                    }
                }
            
                // 广播向量A和B到所有进程
                MPI_Bcast(A, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
                MPI_Bcast(B, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
            
                // 每个进程计算自己负责的部分
                int local_n = N / size;  // 每个进程处理的元素个数
                int start = rank * local_n;
                int end = (rank + 1) * local_n;
                
                // 如果是最后一个进程,确保处理所有剩余的元素(处理N % size)
                if (rank == size - 1) {
                    end = N;
                }
            
                double local_dot_product = compute_local_dot_product(A, B, start, end);
            
                // 使用MPI_Reduce将所有进程的局部点积结果汇总到进程0
                double global_dot_product = 0.0;
                MPI_Reduce(&local_dot_product, &global_dot_product, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
            
                // 进程0输出最终结果
                if (rank == 0) {
                    printf("Vector A is\n");
                    print_vector(A);
                    printf("Vector B is\n");
                    print_vector(B);
                    printf("Dot Product of A and B: %f\n", global_dot_product);
                }
            
                // 结束MPI环境
                MPI_Finalize();
                return 0;
            }

            3. 创建Slurm作业脚本

            创建一个SLURM作业脚本来运行该MPI程序。以下是一个基本的SLURM作业脚本,假设文件名为 mpi_test.slurm:

            #!/bin/bash
            #SBATCH --job-name=mpi_job       # Job name
            #SBATCH --nodes=2                # Number of nodes to use
            #SBATCH --ntasks-per-node=1      # Number of tasks per node
            #SBATCH --time=00:10:00          # Time limit
            #SBATCH --output=mpi_test_output_%j.log     # Standard output file
            #SBATCH --error=mpi_test_output_%j.err     # Standard error file
            
            # Manually set Intel OneAPI MPI and Compiler environment
            export I_MPI_PMI=pmi2
            export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/slurm/mpi_pmi2.so
            export I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.14
            export INTEL_COMPILER_ROOT=/opt/intel/oneapi/compiler/2025.0
            export PATH=$I_MPI_ROOT/bin:$INTEL_COMPILER_ROOT/bin:$PATH
            export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$INTEL_COMPILER_ROOT/lib:$LD_LIBRARY_PATH
            export MANPATH=$I_MPI_ROOT/man:$INTEL_COMPILER_ROOT/man:$MANPATH
            
            # Compile the MPI program
            icx-cc -I$I_MPI_ROOT/include  hello_mpi.c -o hello_mpi -L$I_MPI_ROOT/lib -lmpi
            
            # Run the MPI job
            
            mpirun -np 2 ./hello_mpi
            #!/bin/bash
            #SBATCH --job-name=mpi_job       # Job name
            #SBATCH --nodes=2                # Number of nodes to use
            #SBATCH --ntasks-per-node=1      # Number of tasks per node
            #SBATCH --time=00:10:00          # Time limit
            #SBATCH --output=mpi_test_output_%j.log     # Standard output file
            #SBATCH --error=mpi_test_output_%j.err     # Standard error file
            
            # Manually set Intel OneAPI MPI and Compiler environment
            export I_MPI_PMI=pmi2
            export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/slurm/mpi_pmi2.so
            export I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.14
            export INTEL_COMPILER_ROOT=/opt/intel/oneapi/compiler/2025.0
            export PATH=$I_MPI_ROOT/bin:$INTEL_COMPILER_ROOT/bin:$PATH
            export LD_LIBRARY_PATH=$I_MPI_ROOT/lib:$INTEL_COMPILER_ROOT/lib:$LD_LIBRARY_PATH
            export MANPATH=$I_MPI_ROOT/man:$INTEL_COMPILER_ROOT/man:$MANPATH
            
            # Compile the MPI program
            icx-cc -I$I_MPI_ROOT/include  dot_product.c -o dot_product -L$I_MPI_ROOT/lib -lmpi
            
            # Run the MPI job
            
            mpirun -np 2 ./dot_product

            4. 编译MPI程序

            在运行作业之前,你需要编译MPI程序。在集群上使用mpicc来编译该程序。假设你将程序保存在 hello_mpi.c 文件中,使用以下命令进行编译:

            mpicc -o hello_mpi hello_mpi.c
            mpicc -o dot_product dot_product.c

            5. 提交Slurm作业

            保存上述作业脚本(mpi_test.slurm)并使用以下命令提交作业:

            sbatch mpi_test.slurm

            6. 查看作业状态

            你可以使用以下命令查看作业的状态:

            squeue -u <your_username>

            7. 检查输出

            作业完成后,输出将保存在你作业脚本中指定的文件中(例如 mpi_test_output_<job_id>.log)。你可以使用 cat 或任何文本编辑器查看输出:

            cat mpi_test_output_*.log

            示例输出 如果一切正常,输出会类似于:

            Hello, World! I am process 0 out of 2 processes.
            Hello, World! I am process 1 out of 2 processes.
            Result Matrix C (A * B):
            14 8 2 -4 
            20 10 0 -10 
            -1189958655 1552515295 21949 -1552471397 
            0 0 0 0 
            Aug 7, 2024

            Test Open MPI Jobs

            在SLURM集群中使用MPI(Message Passing Interface)进行并行计算,通常需要以下几个步骤:

            1. 安装MPI库

            确保你的集群节点已经安装了MPI库,常见的MPI实现包括:

            • OpenMPI
            • Intel MPI
            • MPICH 可以通过以下命令检查集群是否安装了MPI:
            mpicc --version  # 检查MPI编译器
            mpirun --version # 检查MPI运行时环境

            2. 编译MPI程序

            你可以用mpicc(C语言)或mpic++(C++语言)来编译MPI程序。例如:

            以下是一个简单的MPI “Hello, World!” 示例程序,假设文件名为 hello_mpi.c, 还有一个进行矩阵计算的示例程序,文件名为dot_product.c,任意挑选一个即可:

            #include <stdio.h>
            #include <mpi.h>
            
            int main(int argc, char *argv[]) {
                int rank, size;
                
                // 初始化MPI环境
                MPI_Init(&argc, &argv);
            
                // 获取当前进程的rank和总进程数
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
            
                // 输出进程的信息
                printf("Hello, World! I am process %d out of %d processes.\n", rank, size);
            
                // 退出MPI环境
                MPI_Finalize();
            
                return 0;
            }
            #include <stdio.h>
            #include <stdlib.h>
            #include <mpi.h>
            
            #define N 8  // 向量大小
            
            // 计算向量的局部点积
            double compute_local_dot_product(double *A, double *B, int start, int end) {
                double local_dot = 0.0;
                for (int i = start; i < end; i++) {
                    local_dot += A[i] * B[i];
                }
                return local_dot;
            }
            
            void print_vector(double *Vector) {
                for (int i = 0; i < N; i++) {
                    printf("%f ", Vector[i]);   
                }
                printf("\n");
            }
            
            int main(int argc, char *argv[]) {
                int rank, size;
            
                // 初始化MPI环境
                MPI_Init(&argc, &argv);
                MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                MPI_Comm_size(MPI_COMM_WORLD, &size);
            
                // 向量A和B
                double A[N], B[N];
            
                // 进程0初始化向量A和B
                if (rank == 0) {
                    for (int i = 0; i < N; i++) {
                        A[i] = i + 1;  // 示例数据
                        B[i] = (i + 1) * 2;  // 示例数据
                    }
                }
            
                // 广播向量A和B到所有进程
                MPI_Bcast(A, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
                MPI_Bcast(B, N, MPI_DOUBLE, 0, MPI_COMM_WORLD);
            
                // 每个进程计算自己负责的部分
                int local_n = N / size;  // 每个进程处理的元素个数
                int start = rank * local_n;
                int end = (rank + 1) * local_n;
                
                // 如果是最后一个进程,确保处理所有剩余的元素(处理N % size)
                if (rank == size - 1) {
                    end = N;
                }
            
                double local_dot_product = compute_local_dot_product(A, B, start, end);
            
                // 使用MPI_Reduce将所有进程的局部点积结果汇总到进程0
                double global_dot_product = 0.0;
                MPI_Reduce(&local_dot_product, &global_dot_product, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
            
                // 进程0输出最终结果
                if (rank == 0) {
                    printf("Vector A is\n");
                    print_vector(A);
                    printf("Vector B is\n");
                    print_vector(B);
                    printf("Dot Product of A and B: %f\n", global_dot_product);
                }
            
                // 结束MPI环境
                MPI_Finalize();
                return 0;
            }

            3. 创建Slurm作业脚本

            创建一个SLURM作业脚本来运行该MPI程序。以下是一个基本的SLURM作业脚本,假设文件名为 mpi_test.slurm:

            #!/bin/bash
            #SBATCH --job-name=mpi_test                 # 作业名称
            #SBATCH --nodes=2                           # 请求节点数
            #SBATCH --ntasks-per-node=1                 # 每个节点上的任务数
            #SBATCH --time=00:10:00                     # 最大运行时间
            #SBATCH --output=mpi_test_output_%j.log     # 输出日志文件
            
            # 加载MPI模块(如果使用模块化环境)
            module load openmpi
            
            # 运行MPI程序
            mpirun --allow-run-as-root -np 2 ./hello_mpi
            #!/bin/bash
            #SBATCH --job-name=mpi_test                 # 作业名称
            #SBATCH --nodes=2                           # 请求节点数
            #SBATCH --ntasks-per-node=1                 # 每个节点上的任务数
            #SBATCH --time=00:10:00                     # 最大运行时间
            #SBATCH --output=mpi_test_output_%j.log     # 输出日志文件
            
            # 加载MPI模块(如果使用模块化环境)
            module load openmpi
            
            # 运行MPI程序
            mpirun --allow-run-as-root -np 2 ./dot_product

            4. 编译MPI程序

            在运行作业之前,你需要编译MPI程序。在集群上使用mpicc来编译该程序。假设你将程序保存在 hello_mpi.c 文件中,使用以下命令进行编译:

            mpicc -o hello_mpi hello_mpi.c
            mpicc -o dot_product dot_product.c

            5. 提交Slurm作业

            保存上述作业脚本(mpi_test.slurm)并使用以下命令提交作业:

            sbatch mpi_test.slurm

            6. 查看作业状态

            你可以使用以下命令查看作业的状态:

            squeue -u <your_username>

            7. 检查输出

            作业完成后,输出将保存在你作业脚本中指定的文件中(例如 mpi_test_output_<job_id>.log)。你可以使用 cat 或任何文本编辑器查看输出:

            cat mpi_test_output_*.log

            示例输出 如果一切正常,输出会类似于:

            Hello, World! I am process 0 out of 2 processes.
            Hello, World! I am process 1 out of 2 processes.
            Result Matrix C (A * B):
            14 8 2 -4 
            20 10 0 -10 
            -1189958655 1552515295 21949 -1552471397 
            0 0 0 0 
            Aug 7, 2024

            Data Warehouse

              Mar 7, 2024

              Subsections of 🧪Demo

              Agent

              Aug 7, 2024

              Subsections of Game

              LOL Overlay Assistant

              Using deep learning techniques to help you to win the game.

              State Machine Event Bus Python 3.6 TensorFlow2 Captain InfoNew Awesome

              ScreenShots

              There are four main funcs in this tool.

              1. The first one is to detect your game client thread and recognize which
                status you are in.
                func1 func1

              2. The second one is to recommend some champions to play.
                Based on your enemy’s team banned champion, this tool will provide you three
                more choices to counter your enemies.
                func2 func2

              3. The third func will scans the mini-map, and when someone is heading to you,
                a notification window will pop up.
                func3 func3

              4. The last func will provides you some gear recommendation based on your
                enemy’s item list.
                fun4 fun4

              Framework

              mvc mvc

              Checkout in Bilibili

              Checkout in Youtube

              Repo

              you can get code from github, gitee

              Mar 8, 2021

              Roller Coin Assistant

              Using deep learning techniques to help you to mining the cryptos, such as BTC, ETH and DOGE.

              ScreenShots

              There are two main funcs in this tool.

              1. Help you to crack the game
              • only support ‘Coin-Flip’ Game for now.

                right, rollercoin.com had decrease the benefit from this game, thats why I make the repo public. update

              1. Help you to pass the geetest.

              How to use

              1. open a web browser.
              2. go to this link https://rollercoin.com, and create an account.(https://rollercoin.com)
              3. keep the lang equals to ‘English’ (you can click the bottom button to change it).
              4. click the ‘Game’ button.
              5. start the application, and enjoy it.

              Tips

              1. only supprot 1920*1080, 2560*1440 and higher resolution screen.
              2. and if you use 1920*1080 screen, strongly recommend you to fullscreen you web browser.

              Repo

              you can get code from gitee

              Mar 8, 2023

              Subsections of HPC

              Slurm On K8S

              slurm_on_k8s slurm_on_k8s

              Trying to run slurm cluster on kubernets

              Install

              You can directly use helm to manage this slurm chart

              1. helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
              2. helm install slurm ay-helm-mirror/slurm --version 1.0.4

              And then, you should see something like this func1 func1

              Also, you can modify the values.yaml by yourself, and reinstall the slurm cluster

              helm upgrade --create-namespace -n slurm --install -f ./values.yaml slurm ay-helm-mirror/slurm --version=1.0.4
              Important

              And you even can build your own image, especially for people wanna use their own libs. For now, the image we used is

              login –> docker.io/aaron666/slurm-login:intel-mpi

              slurmd –> docker.io/aaron666/slurm-slurmd:intel-mpi

              slurmctld -> docker.io/aaron666/slurm-slurmctld:latest

              slurmdbd –> docker.io/aaron666/slurm-slurmdbd:latest

              munged –> docker.io/aaron666/slurm-munged:latest

              Aug 7, 2024

              Slurm Operator

              if you wanna change slurm configuration ,please check slurm configuration generator click

              • for helm user

                just run for fun!

                1. helm repo add ay-helm-repo https://aaronyang0628.github.io/helm-chart-mirror/charts
                2. helm install slurm ay-helm-repo/slurm --version 1.0.4
              • for opertaor user

                pull an image and apply

                1. docker pull aaron666/slurm-operator:latest
                2. kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/install.yaml
                3. kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/helm-chart-mirror/refs/heads/main/templates/slurm/slurmdeployment.values.yaml
              Aug 7, 2024

              Subsections of Plugins

              Flink S3 F3 Multiple

              Normally, Flink only can access only one S3 endpoint during the runtime. But we need to process some files from multiple minio simultaneously.

              So I modified the original flink-s3-fs-hadoop and enable flink to do so.

              StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
              env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
              env.setParallelism(1);
              env.setStateBackend(new HashMapStateBackend());
              env.getCheckpointConfig().setCheckpointStorage("file:///./checkpoints");
              
              final FileSource<String> source =
                  FileSource.forRecordStreamFormat(
                          new TextLineInputFormat(),
                          new Path(
                              "s3u://admin:ZrwpsezF1Lt85dxl@10.11.33.132:9000/user-data/home/conti/2024-02-08--10"))
                      .build();
              
              final FileSource<String> source2 =
                  FileSource.forRecordStreamFormat(
                          new TextLineInputFormat(),
                          new Path(
                              "s3u://minioadmin:minioadmin@10.101.16.72:9000/user-data/home/conti"))
                      .build();
              
              env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source")
                  .union(env.fromSource(source2, WatermarkStrategy.noWatermarks(), "file-source2"))
                  .print("union-result");
                  
              env.execute();
              original usage example

              using default flink-s3-fs-hadoop, the configuration value will set into Hadoop configuration map. Only one value functioning at the same, there is no way for user to operate different in single one job context.

              Configuration pluginConfiguration = new Configuration();
              pluginConfiguration.setString("s3a.access-key", "admin");
              pluginConfiguration.setString("s3a.secret-key", "ZrwpsezF1Lt85dxl");
              pluginConfiguration.setString("s3a.connection.maximum", "1000");
              pluginConfiguration.setString("s3a.endpoint", "http://10.11.33.132:9000");
              pluginConfiguration.setBoolean("s3a.path.style.access", Boolean.TRUE);
              FileSystem.initialize(
                  pluginConfiguration, PluginUtils.createPluginManagerFromRootFolder(pluginConfiguration));
              
              StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
              env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
              env.setParallelism(1);
              env.setStateBackend(new HashMapStateBackend());
              env.getCheckpointConfig().setCheckpointStorage("file:///./checkpoints");
              
              final FileSource<String> source =
                  FileSource.forRecordStreamFormat(
                          new TextLineInputFormat(), new Path("s3a://user-data/home/conti/2024-02-08--10"))
                      .build();
              env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source").print();
              
              env.execute();

              Usage

              There

              Install From

              For now, you can directly download flink-s3-fs-hadoop-$VERSION.jar and load in your project.
              $VERSION is the flink version you are using.

                implementation(files("flink-s3-fs-hadoop-$flinkVersion.jar"))
                <dependency>
                    <groupId>org.apache</groupId>
                    <artifactId>flink</artifactId>
                    <version>$flinkVersion</version>
                    <systemPath>${project.basedir}flink-s3-fs-hadoop-$flinkVersion.jar</systemPath>
                </dependency>
              the jar we provided was based on original flink-s3-fs-hadoop plugin, so you should use original protocal prefix s3a://

              Or maybe you can wait from the PR, after I mereged into flink-master, you don't need to do anything, just update your flink version.
              and directly use s3u://

              Repo

              you can get code from github, gitlab

              Mar 8, 2024

              Subsections of Stream

              Cosmic Antenna

              Design Architecture

              • objects

              continuously processing antenna signal records and convert them into 3 dimension data matrixes, sending them to different astronomical algorithm endpoints. asdsaa asdsaa

              • how data flows

              asdsaa asdsaa

              Building From Zero

              Following these steps, you may build comic-antenna from nothing.

              1. install podman

              you can check article Install Podman

              2. install kind and kubectl

              you can check article install kubectl

              # create a cluster using podman
              curl -o kind.cluster.yaml -L https://gitlab.com/-/snippets/3686427/raw/main/kind-cluster.yaml \
              && export KIND_EXPERIMENTAL_PROVIDER=podman \
              && kind create cluster --name cs-cluster --image m.daocloud.io/docker.io/kindest/node:v1.27.3 --config=./kind.cluster.yaml
              Modify ~/.kube/config

              vim ~/.kube/config

              in line 5, change server: http://::xxxx -> server: http://0.0.0.0:xxxxx

              asdsaa asdsaa

              3. [Optional] pre-downloaded slow images

              DOCKER_IMAGE_PATH=/root/docker-images && mkdir -p $DOCKER_IMAGE_PATH
              BASE_URL="https://resource-ops-dev.lab.zjvis.net:32443/docker-images"
              for IMAGE in "quay.io_argoproj_argocd_v2.9.3.dim" \
                  "ghcr.io_dexidp_dex_v2.37.0.dim" \
                  "docker.io_library_redis_7.0.11-alpine.dim" \
                  "docker.io_library_flink_1.17.dim"
              do
                  IMAGE_FILE=$DOCKER_IMAGE_PATH/$IMAGE
                  if [ ! -f $IMAGE_FILE ]; then
                      TMP_FILE=$IMAGE_FILE.tmp \
                      && curl -o "$TMP_FILE" -L "$BASE_URL/$IMAGE" \
                      && mv $TMP_FILE $IMAGE_FILE
                  fi
                  kind -n cs-cluster load image-archive $IMAGE_FILE
              done

              4. install argocd

              you can check article Install ArgoCD

              5. install essential app on argocd

              # install cert manger    
              curl -LO https://gitlab.com/-/snippets/3686424/raw/main/cert-manager.yaml \
              && kubectl -n argocd apply -f cert-manager.yaml \
              && argocd app sync argocd/cert-manager
              
              # install ingress
              curl -LO https://gitlab.com/-/snippets/3686426/raw/main/ingress-nginx.yaml \
              && kubectl -n argocd apply -f ingress-nginx.yaml \
              && argocd app sync argocd/ingress-nginx
              
              # install flink-kubernetes-operator
              curl -LO https://gitlab.com/-/snippets/3686429/raw/main/flink-operator.yaml \
              && kubectl -n argocd apply -f flink-operator.yaml \
              && argocd app sync argocd/flink-operator

              6. install git

              sudo dnf install -y git \
              && rm -rf $HOME/cosmic-antenna-demo \
              && mkdir $HOME/cosmic-antenna-demo \
              && git clone --branch pv_pvc_template https://github.com/AaronYang2333/cosmic-antenna-demo.git $HOME/cosmic-antenna-demo

              7. prepare application image

              # cd into  $HOME/cosmic-antenna-demo
              sudo dnf install -y java-11-openjdk.x86_64 \
              && $HOME/cosmic-antenna-demo/gradlew :s3sync:buildImage \
              && $HOME/cosmic-antenna-demo/gradlew :fpga-mock:buildImage
              # save and load into cluster
              VERSION="1.0.3"
              podman save --quiet -o $DOCKER_IMAGE_PATH/fpga-mock_$VERSION.dim localhost/fpga-mock:$VERSION \
              && kind -n cs-cluster load image-archive $DOCKER_IMAGE_PATH/fpga-mock_$VERSION.dim
              podman save --quiet -o $DOCKER_IMAGE_PATH/s3sync_$VERSION.dim localhost/s3sync:$VERSION \
              && kind -n cs-cluster load image-archive $DOCKER_IMAGE_PATH/s3sync_$VERSION.dim
              kubectl -n flink edit role/flink -o yaml
              Modify role config
              kubectl -n flink edit role/flink -o yaml

              add services and endpoints to the rules.resources

              8. prepare k8s resources [pv, pvc, sts]

              cp -rf $HOME/cosmic-antenna-demo/flink/*.yaml /tmp \
              && podman exec -d cs-cluster-control-plane mkdir -p /mnt/flink-job
              # create persist volume
              kubectl -n flink create -f /tmp/pv.template.yaml
              # create pv claim
              kubectl -n flink create -f /tmp/pvc.template.yaml
              # start up flink application
              kubectl -n flink create -f /tmp/job.template.yaml
              # start up ingress
              kubectl -n flink create -f /tmp/ingress.forward.yaml
              # start up fpga UDP client, sending data 
              cp $HOME/cosmic-antenna-demo/fpga-mock/client.template.yaml /tmp \
              && kubectl -n flink create -f /tmp/client.template.yaml

              9. check dashboard in browser

              http://job-template-example.flink.lab.zjvis.net

              Repo

              you can get code from github


              Reference

              1. https://github.com/ben-wangz/blog/tree/main/docs/content/6.kubernetes/7.installation/ha-cluster
              2. xxx
              Mar 7, 2024

              Subsections of Design

              Yaml Crawler

              Steps

              1. define which web url you wanna crawl, lets say https://www.xxx.com/aaa.apex
              2. create a page pojo org.example.business.page.MainPage to describe that page

              Then you can create a yaml file named root-pages.yaml and its content is

              - '@class': "org.example.business.page.MainPage"
                url: "https://www.xxx.com/aaa.apex"
              1. and then define a process flow yaml file, implying how to process web pages the crawler will meet.
              processorChain:
                - '@class': "org.example.crawler.core.processor.decorator.ExceptionRecord"
                  processor:
                    '@class': "org.example.crawler.core.processor.decorator.RetryControl"
                    processor:
                      '@class': "org.example.crawler.core.processor.decorator.SpeedControl"
                      processor:
                        '@class': "org.example.business.hs.code.MainPageProcessor"
                        application: "app-name"
                      time: 100
                      unit: "MILLISECONDS"
                    retryTimes: 1
                - '@class': "org.example.crawler.core.processor.decorator.ExceptionRecord"
                  processor:
                    '@class': "org.example.crawler.core.processor.decorator.RetryControl"
                    processor:
                      '@class': "org.example.crawler.core.processor.decorator.SpeedControl"
                      processor:
                        '@class': "org.example.crawler.core.processor.download.DownloadProcessor"
                        pagePersist:
                          '@class': "org.example.business.persist.DownloadPageDatabasePersist"
                          downloadPageRepositoryBeanName: "downloadPageRepository"
                        downloadPageTransformer:
                          '@class': "org.example.crawler.download.DefaultDownloadPageTransformer"
                        skipExists:
                          '@class': "org.example.crawler.download.SkipExistsById"
                      time: 1
                      unit: "SECONDS"
                    retryTimes: 1
              nThreads: 1
              pollWaitingTime: 30
              pollWaitingTimeUnit: "SECONDS"
              waitFinishedTimeout: 180
              waitFinishedTimeUnit: "SECONDS" 

              ExceptionRecord, RetryControl, SpeedControl are provided by the yaml crawler itself, dont worry. you only need to extend how to process your page MainPage, for example, you defined a MainPageProcessor. each processor will produce a set of other page or DownloadPage. DownloadPage like a ship containing information you need, and this framework will help you process DownloadPage and download or persist.

              1. Vola, run your crawler then.

              Repo

              you can get code from github, gitlab

              Mar 8, 2024

              MCP

              Aug 7, 2024

              RAG

              Aug 7, 2024

              Utils

              Porjects

              Mar 7, 2024

              Subsections of Utils

              Cowsay

              since the previous cowsay image was built ten years ago, and in newser k8s, you will meet an exception like

              Failed to pull image “docker/whalesay:latest”: [DEPRECATION NOTICE] Docker Image Format v1 and Docker Image manifest version 2, schema 1 support is disabled by default and will be removed in an upcoming release. Suggest the author of docker.io/docker/whalesay:latest to upgrade the image to the OCI Format or Docker Image manifest v2, schema 2. More information at https://docs.docker.com/go/deprecated-image-specs/

              So, I built a new one. please try docker.io/aaron666/cowsay:v2

              Build

              docker build -t whalesay:v2 .

              Usage

              docker run -it localhost/whalesay:v2 whalesay  "hello world"
              
              [root@ay-zj-ecs cowsay]# docker run -it localhost/whalesay:v2 whalesay  "hello world"
               _____________
              < hello world >
               -------------
                \
                 \
                  \     
                                    ##        .            
                              ## ## ##       ==            
                           ## ## ## ##      ===            
                       /""""""""""""""""___/ ===        
                  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~   
                       \______ o          __/            
                        \    \        __/             
                          \____\______/   
              docker run -it localhost/whalesay:v2 cowsay  "hello world"
              
              [root@ay-zj-ecs cowsay]# docker run -it localhost/whalesay:v2 cowsay  "hello world"
               _____________
              < hello world >
               -------------
                      \   ^__^
                       \  (oo)\_______
                          (__)\       )\/\
                              ||----w |
                              ||     ||

              Upload

              registry
              docker tag 5b01b0c3c7ce docker-registry.lab.zverse.space/ay-dev/whalesay:v2
              docker push docker-registry.lab.zverse.space/ay-dev/whalesay:v2
              export DOCKER_PAT=dckr_pat_bBN_Xkgz-TRdxirM2B6EDYCjjrg
              echo $DOCKER_PAT | docker login docker.io -u aaron666  --password-stdin
              docker tag 5b01b0c3c7ce docker.io/aaron666/whalesay:v2
              docker push docker.io/aaron666/whalesay:v2
              export GITHUB_PAT=XXXX
              echo $GITHUB_PAT | docker login ghcr.io -u aaronyang0628 --password-stdin
              docker tag 5b01b0c3c7ce ghcr.io/aaronyang0628/whalesay:v2
              docker push ghcr.io/aaronyang0628/whalesay:v2
              Mar 7, 2025

              Subsections of 🐸Git Related

              Cheatsheet

              List config

              git config --list

              Init global config

              git config --global user.name "AaronYang"
              git config --global user.email aaron19940628@gmail.com
              git config --global user.email byang628@alumni.usc.edu
              git config --global pager.branch false
              git config --global pull.ff only
              git --no-pager diff

              change user and email (locally)

              # git config user.name ""
              # git config user.email ""
              git config user.name "AaronYang"
              git config user.email byang628@alumni.usc.edu

              list all remote repo

              git remote -v
              modify remote repo
              git remote set-url origin git@github.com:<$user>/<$repo>.git
              # git remote set-url origin http://xxxxxxxxxxx.git
              add a new remote repo
              git remote add dev https://xxxxxxxxxxx.git

              Clone specific branch

              git clone -b slurm-23.02 --single-branch --depth=1 https://github.com/SchedMD/slurm.git

              Get specific file from remote

              git archive --remote=git@github.com:<$user>/<$repo>.git <$branch>:<$source_file_path> -o <$target_source_path>
              for example
              git archive --remote=git@github.com:AaronYang2333/LOL_Overlay_Assistant_Tool.git master:paper/2003.11755.pdf -o a.pdf

              Update submodule

              git submodule add –depth 1 https://github.com/xxx/xxxx a/b/c

              git submodule update --init --recursive

              Save credential

              login first and then execute this

              git config --global credential.helper store

              Delete Branch

              • Deleting a remote branch
                git push origin --delete <branch>  # Git version 1.7.0 or newer
                git push origin -d <branch>        # Shorter version (Git 1.7.0 or newer)
                git push origin :<branch>          # Git versions older than 1.7.0
              • Deleting a local branch
                git branch --delete <branch>
                git branch -d <branch> # Shorter version
                git branch -D <branch> # Force-delete un-merged branches

              Prune remote branches

              git remote prune origin
              Mar 7, 2024

              Subsections of Action

              Customize A Gitea Action

              Introduction

              In this guide, you’ll learn about the basic components needed to create and use a packaged composite action. To focus this guide on the components needed to package the action, the functionality of the action’s code is minimal. The action prints “Hello World” and then “Goodbye”, or if you provide a custom name, it prints “Hello [who-to-greet]” and then “Goodbye”. The action also maps a random number to the random-number output variable, and runs a script named goodbye.sh.

              Once you complete this project, you should understand how to build your own composite action and test it in a workflow.

              Warning

              When creating workflows and actions, you should always consider whether your code might execute untrusted input from possible attackers. Certain contexts should be treated as untrusted input, as an attacker could insert their own malicious content. For more information, see Secure use reference.

              Composite actions and reusable workflows

              Composite actions allow you to collect a series of workflow job steps into a single action which you can then run as a single job step in multiple workflows. Reusable workflows provide another way of avoiding duplication, by allowing you to run a complete workflow from within other workflows. For more information, see Reusing workflow configurations.

              Prerequisites

              Note

              This example explains how to create a composite action within a separate repository. However, it is possible to create a composite action within the same repository. For more information, see Creating a composite action.

              Before you begin, you’ll create a repository on GitHub.

              1. Create a new public repository on GitHub. You can choose any repository name, or use the following hello-world-composite-action example. You can add these files after your project has been pushed to GitHub.

              2. Clone your repository to your computer.

              3. From your terminal, change directories into your new repository.

              cd hello-world-composite-action
              1. In the hello-world-composite-action repository, create a new file called goodbye.sh with example code:
              echo "echo Goodbye" > goodbye.sh
              1. From your terminal, make goodbye.sh executable.
              chmod +x goodbye.sh
              1. From your terminal, check in your goodbye.sh file.
              git add goodbye.sh
              git commit -m "Add goodbye script"
              git push

              Creating an action metadata file

              1. In the hello-world-composite-action repository, create a new file called action.yml and add the following example code. For more information about this syntax, see Metadata syntax reference.
              name: 'Hello World'
              description: 'Greet someone'
              inputs:
                who-to-greet:  # id of input
                  description: 'Who to greet'
                  required: true
                  default: 'World'
              outputs:
                random-number:
                  description: "Random number"
                  value: ${{ steps.random-number-generator.outputs.random-number }}
              runs:
                using: "composite"
                steps:
                  - name: Set Greeting
                    run: echo "Hello $INPUT_WHO_TO_GREET."
                    shell: bash
                    env:
                      INPUT_WHO_TO_GREET: ${{ inputs.who-to-greet }}
              
                  - name: Random Number Generator
                    id: random-number-generator
                    run: echo "random-number=$(echo $RANDOM)" >> $GITHUB_OUTPUT
                    shell: bash
              
                  - name: Set GitHub Path
                    run: echo "$GITHUB_ACTION_PATH" >> $GITHUB_PATH
                    shell: bash
                    env:
                      GITHUB_ACTION_PATH: ${{ github.action_path }}
              
                  - name: Run goodbye.sh
                    run: goodbye.sh
                    shell: bash

              This file defines the who-to-greet input, maps the random generated number to the random-number output variable, adds the action’s path to the runner system path (to locate the goodbye.sh script during execution), and runs the goodbye.sh script.

              For more information about managing outputs, see Metadata syntax reference.

              For more information about how to use github.action_path, see Contexts reference.

              1. From your terminal, check in your action.yml file.
              git add action.yml
              git commit -m "Add action"
              git push
              1. From your terminal, add a tag. This example uses a tag called v1. For more information, see About custom actions.
              git tag -a -m "Description of this release" v1
              git push --follow-tags

              Testing out your action in a workflow

              The following workflow code uses the completed hello world action that you made in Creating a composite action.

              Copy the workflow code into a .github/workflows/main.yml file in another repository, replacing OWNER and SHA with the repository owner and the SHA of the commit you want to use, respectively. You can also replace the who-to-greet input with your name.

              on: [push]
              
              jobs:
                hello_world_job:
                  runs-on: ubuntu-latest
                  name: A job to say hello
                  steps:
                    - uses: actions/checkout@v5
                    - id: foo
                      uses: OWNER/hello-world-composite-action@SHA
                      with:
                        who-to-greet: 'Mona the Octocat'
                    - run: echo random-number "$RANDOM_NUMBER"
                      shell: bash
                      env:
                        RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}

              From your repository, click the Actions tab, and select the latest workflow run. The output should include: “Hello Mona the Octocat”, the result of the “Goodbye” script, and a random number.

              Creating a composite action within the same repository

              1. Create a new subfolder called hello-world-composite-action, this can be placed in any subfolder within the repository. However, it is recommended that this be placed in the .github/actions subfolder to make organization easier.

              2. In the hello-world-composite-action folder, do the same steps to create the goodbye.sh script

              echo "echo Goodbye" > goodbye.sh
              chmod +x goodbye.sh
              git add goodbye.sh
              git commit -m "Add goodbye script"
              git push
              1. In the hello-world-composite-action folder, create the action.yml file based on the steps in Creating a composite action.

              2. When using the action, use the relative path to the folder where the composite action’s action.yml file is located in the uses key. The below example assumes it is in the .github/actions/hello-world-composite-action folder.

              on: [push]
              
              jobs:
                hello_world_job:
                  runs-on: ubuntu-latest
                  name: A job to say hello
                  steps:
                    - uses: actions/checkout@v5
                    - id: foo
                      uses: ./.github/actions/hello-world-composite-action
                      with:
                        who-to-greet: 'Mona the Octocat'
                    - run: echo random-number "$RANDOM_NUMBER"
                      shell: bash
                      env:
                        RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}
              Mar 7, 2024

              Customize A Github Action

              Introduction

              In this guide, you’ll learn about the basic components needed to create and use a packaged composite action. To focus this guide on the components needed to package the action, the functionality of the action’s code is minimal. The action prints “Hello World” and then “Goodbye”, or if you provide a custom name, it prints “Hello [who-to-greet]” and then “Goodbye”. The action also maps a random number to the random-number output variable, and runs a script named goodbye.sh.

              Once you complete this project, you should understand how to build your own composite action and test it in a workflow.

              Warning

              When creating workflows and actions, you should always consider whether your code might execute untrusted input from possible attackers. Certain contexts should be treated as untrusted input, as an attacker could insert their own malicious content. For more information, see Secure use reference.

              Composite actions and reusable workflows

              Composite actions allow you to collect a series of workflow job steps into a single action which you can then run as a single job step in multiple workflows. Reusable workflows provide another way of avoiding duplication, by allowing you to run a complete workflow from within other workflows. For more information, see Reusing workflow configurations.

              Prerequisites

              Note

              This example explains how to create a composite action within a separate repository. However, it is possible to create a composite action within the same repository. For more information, see Creating a composite action.

              Before you begin, you’ll create a repository on GitHub.

              1. Create a new public repository on GitHub. You can choose any repository name, or use the following hello-world-composite-action example. You can add these files after your project has been pushed to GitHub.

              2. Clone your repository to your computer.

              3. From your terminal, change directories into your new repository.

              cd hello-world-composite-action
              1. In the hello-world-composite-action repository, create a new file called goodbye.sh with example code:
              echo "echo Goodbye" > goodbye.sh
              1. From your terminal, make goodbye.sh executable.
              chmod +x goodbye.sh
              1. From your terminal, check in your goodbye.sh file.
              git add goodbye.sh
              git commit -m "Add goodbye script"
              git push

              Creating an action metadata file

              1. In the hello-world-composite-action repository, create a new file called action.yml and add the following example code. For more information about this syntax, see Metadata syntax reference.
              name: 'Hello World'
              description: 'Greet someone'
              inputs:
                who-to-greet:  # id of input
                  description: 'Who to greet'
                  required: true
                  default: 'World'
              outputs:
                random-number:
                  description: "Random number"
                  value: ${{ steps.random-number-generator.outputs.random-number }}
              runs:
                using: "composite"
                steps:
                  - name: Set Greeting
                    run: echo "Hello $INPUT_WHO_TO_GREET."
                    shell: bash
                    env:
                      INPUT_WHO_TO_GREET: ${{ inputs.who-to-greet }}
              
                  - name: Random Number Generator
                    id: random-number-generator
                    run: echo "random-number=$(echo $RANDOM)" >> $GITHUB_OUTPUT
                    shell: bash
              
                  - name: Set GitHub Path
                    run: echo "$GITHUB_ACTION_PATH" >> $GITHUB_PATH
                    shell: bash
                    env:
                      GITHUB_ACTION_PATH: ${{ github.action_path }}
              
                  - name: Run goodbye.sh
                    run: goodbye.sh
                    shell: bash

              This file defines the who-to-greet input, maps the random generated number to the random-number output variable, adds the action’s path to the runner system path (to locate the goodbye.sh script during execution), and runs the goodbye.sh script.

              For more information about managing outputs, see Metadata syntax reference.

              For more information about how to use github.action_path, see Contexts reference.

              1. From your terminal, check in your action.yml file.
              git add action.yml
              git commit -m "Add action"
              git push
              1. From your terminal, add a tag. This example uses a tag called v1. For more information, see About custom actions.
              git tag -a -m "Description of this release" v1
              git push --follow-tags

              Testing out your action in a workflow

              The following workflow code uses the completed hello world action that you made in Creating a composite action.

              Copy the workflow code into a .github/workflows/main.yml file in another repository, replacing OWNER and SHA with the repository owner and the SHA of the commit you want to use, respectively. You can also replace the who-to-greet input with your name.

              on: [push]
              
              jobs:
                hello_world_job:
                  runs-on: ubuntu-latest
                  name: A job to say hello
                  steps:
                    - uses: actions/checkout@v5
                    - id: foo
                      uses: OWNER/hello-world-composite-action@SHA
                      with:
                        who-to-greet: 'Mona the Octocat'
                    - run: echo random-number "$RANDOM_NUMBER"
                      shell: bash
                      env:
                        RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}

              From your repository, click the Actions tab, and select the latest workflow run. The output should include: “Hello Mona the Octocat”, the result of the “Goodbye” script, and a random number.

              Creating a composite action within the same repository

              1. Create a new subfolder called hello-world-composite-action, this can be placed in any subfolder within the repository. However, it is recommended that this be placed in the .github/actions subfolder to make organization easier.

              2. In the hello-world-composite-action folder, do the same steps to create the goodbye.sh script

              echo "echo Goodbye" > goodbye.sh
              chmod +x goodbye.sh
              git add goodbye.sh
              git commit -m "Add goodbye script"
              git push
              1. In the hello-world-composite-action folder, create the action.yml file based on the steps in Creating a composite action.

              2. When using the action, use the relative path to the folder where the composite action’s action.yml file is located in the uses key. The below example assumes it is in the .github/actions/hello-world-composite-action folder.

              on: [push]
              
              jobs:
                hello_world_job:
                  runs-on: ubuntu-latest
                  name: A job to say hello
                  steps:
                    - uses: actions/checkout@v5
                    - id: foo
                      uses: ./.github/actions/hello-world-composite-action
                      with:
                        who-to-greet: 'Mona the Octocat'
                    - run: echo random-number "$RANDOM_NUMBER"
                      shell: bash
                      env:
                        RANDOM_NUMBER: ${{ steps.foo.outputs.random-number }}
              Mar 7, 2024

              Gitea Variables

              Preset Variables

              变量名称示例说明 / 用途
              gitea.actor触发 workflow 的用户的用户名。(docs.gitea.com)
              gitea.event_name事件名称,比如 pushpull_request 等。(docs.gitea.com)
              gitea.ref被触发的 Git 引用(branch/tag/ref)名称。(docs.gitea.com)
              gitea.repository仓库标识,一般是 owner/name。(docs.gitea.com)
              gitea.workspace仓库被 checkout 到 runner 上的工作目录路径。(docs.gitea.com)

              Common Variables

              变量名称示例说明 / 用途
              runner.osRunner 所在的操作系统环境,比如 ubuntu-latest。(docs.gitea.com)
              job.status当前 job 的状态(例如 success 或 failure)。(docs.gitea.com)
              env.xxxx自定义配置变量,在用户/组织/仓库层定义,统一以大写形式引用。(docs.gitea.com)
              secrets.XXXX存放敏感信息的密钥,同样可以在用户/组织/仓库层定义。(docs.gitea.com)

              Sample

              name: Gitea Actions Demo
              run-name: ${{ gitea.actor }} is testing out Gitea Actions 🚀
              on: [push]
              
              env:
                  author: gitea_admin
              jobs:
                Explore-Gitea-Actions:
                  runs-on: ubuntu-latest
                  steps:
                    - run: echo "🎉 The job was automatically triggered by a ${{ gitea.event_name }} event."
                    - run: echo "🐧 This job is now running on a ${{ runner.os }} server hosted by Gitea!"
                    - run: echo "🔎 The name of your branch is ${{ gitea.ref }} and your repository is ${{ gitea.repository }}."
                    - name: Check out repository code
                      uses: actions/checkout@v4
                    - run: echo "💡 The ${{ gitea.repository }} repository has been cloned to the runner."
                    - run: echo "🖥️ The workflow is now ready to test your code on the runner."
                    - name: List files in the repository
                      run: |
                        ls ${{ gitea.workspace }}
                    - run: echo "🍏 This job's status is ${{ job.status }}."

              Result

              🎉 The job was automatically triggered by a `push` event.
              
              🐧 This job is now running on a `Linux` server hosted by Gitea!
              
              🔎 The name of your branch is `refs/heads/main` and your repository is `gitea_admin/data-warehouse`.
              
              💡 The `gitea_admin/data-warehouse` repository has been cloned to the runner.
              
              🖥️ The workflow is now ready to test your code on the runner.
              
                  Dockerfile  README.md  environments  pom.xml  src  templates
              
              🍏 This job's status is `success`.
              Mar 7, 2024

              Github Variables

              Context Variables

              变量名称示例说明 / 用途
              github.actor触发 workflow 的用户的用户名。([docs.gitea.com][1])
              github.event_name事件名称,比如 pushpull_request 等。([docs.gitea.com][1])
              github.ref被触发的 Git 引用(branch/tag/ref)名称。([docs.gitea.com][1])
              github.repository仓库标识,一般是 owner/name。([docs.gitea.com][1])
              github.workspace仓库被 checkout 到 runner 上的工作目录路径。([docs.gitea.com][1])
              env.xxxx在workflow中定义的变量,比如 ${{ env.xxxx }}
              secrets.XXXX通过 Settings -> Actions -> Secrets and variables 创建的密钥。
              Mar 7, 2024

              Subsections of Template

              Apply And Sync Argocd APP

              name: apply-and-sync-app
              run-name: ${{ gitea.actor }} is going to sync an sample argocd app 🚀
              on: [push]
              
              jobs:
                sync-argocd-app:
                  runs-on: ubuntu-latest
                  steps:
                    - name: Sync App
                      uses: AaronYang0628/apply-and-sync-argocd@v1.0.6
                      with:
                        argocd-server: '192.168.100.125:30443'
                        argocd-token: ${{ secrets.ARGOCD_TOKEN }}
                        application-yaml-path: "environments/ops/argocd/operator.app.yaml"
              Mar 7, 2025

              Publish Chart 2 Harbor

              name: publish-chart-to-harbor-registry
              run-name: ${{ gitea.actor }} is testing out Gitea Push Chart 🚀
              on: [push]
              
              env:
                REGISTRY: harbor.zhejianglab.com
                USER: byang628@zhejianglab.com
                REPOSITORY_NAMESPACE: ay-dev
                CHART_NAME: data-warehouse
              jobs:
                build-and-push-charts:
                  runs-on: ubuntu-latest
                  permissions:
                    packages: write
                    contents: read
                  strategy:
                    matrix:
                      include:
                        - chart_path: "environments/helm/metadata-environment"
                  steps:
                    - name: Checkout Repository
                      uses: actions/checkout@v4
                      with:
                        fetch-depth: 0
              
                    - name: Log in to Harbor
                      uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
                      with:
                        registry: "${{ env.REGISTRY }}"
                        username: "${{ env.USER }}"
                        password: "${{ secrets.ZJ_HARBOR_TOKEN }}"
              
                    - name: Helm Publish Action
                      uses: AaronYang0628/push-helm-chart-to-oci@v0.0.3
                      with:
                        working-dir: ${{ matrix.chart_path }}
                        oci-repository: oci://${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}
                        username: ${{ env.USER }}
                        password: ${{ secrets.ZJ_HARBOR_TOKEN }}
              Mar 7, 2025

              Publish Image 2 Dockerhub

              name: publish-image-to-ghcr
              run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
              on: [push]
              
              env:
                REGISTRY: ghcr.io
                USER: aaronyang0628
                REPOSITORY_NAMESPACE: aaronyang0628
              jobs:
                build-and-push-images:
                  strategy:
                    matrix:
                      include:
                        - name_suffix: "aria-ng"
                          container_path: "application/aria2/container/aria-ng"
                          dockerfile_path: "application/aria2/container/aria-ng/Dockerfile"
                        - name_suffix: "aria2"
                          container_path: "application/aria2/container/aria2"
                          dockerfile_path: "application/aria2/container/aria2/Dockerfile"
                  runs-on: ubuntu-latest
                  steps:
                  - name: checkout-repository
                    uses: actions/checkout@v4
                  - name: log in to the container registry
                    uses: docker/login-action@v3
                    with:
                      registry: "${{ env.REGISTRY }}"
                      username: "${{ env.USER }}"
                      password: "${{ secrets.GIT_REGISTRY_PWD }}"
                  - name: build and push container image
                    uses: docker/build-push-action@v6
                    with:
                      context: "${{ matrix.container_path }}"
                      file: "${{ matrix.dockerfile_path }}"
                      push: true
                      tags: |
                        ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ inputs.tag || 'latest' }}
                        ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ github.ref_name }}
                      labels: |
                        org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
              Mar 7, 2025

              Publish Image 2 Ghcr

              name: publish-image-to-ghcr
              run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
              on: [push]
              
              env:
                REGISTRY: ghcr.io
                USER: aaronyang0628
                REPOSITORY_NAMESPACE: aaronyang0628
              jobs:
                build-and-push-images:
                  strategy:
                    matrix:
                      include:
                        - name_suffix: "aria-ng"
                          container_path: "application/aria2/container/aria-ng"
                          dockerfile_path: "application/aria2/container/aria-ng/Dockerfile"
                        - name_suffix: "aria2"
                          container_path: "application/aria2/container/aria2"
                          dockerfile_path: "application/aria2/container/aria2/Dockerfile"
                  runs-on: ubuntu-latest
                  steps:
                  - name: checkout-repository
                    uses: actions/checkout@v4
                  - name: log in to the container registry
                    uses: docker/login-action@v3
                    with:
                      registry: "${{ env.REGISTRY }}"
                      username: "${{ env.USER }}"
                      password: "${{ secrets.GIT_REGISTRY_PWD }}"
                  - name: build and push container image
                    uses: docker/build-push-action@v6
                    with:
                      context: "${{ matrix.container_path }}"
                      file: "${{ matrix.dockerfile_path }}"
                      push: true
                      tags: |
                        ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ inputs.tag || 'latest' }}
                        ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ github.repository }}-${{ matrix.name_suffix }}:${{ github.ref_name }}
                      labels: |
                        org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
              Mar 7, 2025

              Publish Image 2 Harbor

              name: publish-image-to-harbor-registry
              run-name: ${{ gitea.actor }} is testing out Gitea Push Image 🚀
              on: [push]
              
              
              env:
                REGISTRY: harbor.zhejianglab.com
                USER: byang628@zhejianglab.com
                REPOSITORY_NAMESPACE: ay-dev
                IMAGE_NAME: metadata-crd-operator
              jobs:
                build-and-push-images:
                  runs-on: ubuntu-latest
                  permissions:
                    packages: write
                    contents: read
                  strategy:
                    matrix:
                      include:
                        - name_suffix: "dev"
                          container_path: "."
                          dockerfile_path: "./Dockerfile"
                  steps:
                    - name: Checkout Repository
                      uses: actions/checkout@v4
              
                    - name: Log in to Harbor
                      uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
                      with:
                        registry: "${{ env.REGISTRY }}"
                        username: "${{ env.USER }}"
                        password: "${{ secrets.ZJ_HARBOR_TOKEN }}"
              
                    - name: Extract Current Date
                      id: extract-date
                      run: |
                        echo "current-date=$(date +'%Y%m%d')" >> $GITHUB_OUTPUT
                        echo will push image: ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ env.IMAGE_NAME }}-${{ matrix.name_suffix }}:v${{ steps.extract-date.outputs.current-date }}
              
                    - name: Build And Push Container Image
                      uses: docker/build-push-action@v6
                      with:
                        context: "${{ matrix.container_path }}"
                        file: "${{ matrix.dockerfile_path }}"
                        push: true
                        tags: |
                          ${{ env.REGISTRY }}/${{ env.REPOSITORY_NAMESPACE }}/${{ env.IMAGE_NAME }}-${{ matrix.name_suffix }}:v${{ steps.extract-date.outputs.current-date }}
                        labels: |
                          org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
              Mar 7, 2025

              Subsections of Notes

              Not Allow Push

              Cannot push to your own branch

              mvc mvc

              1. Edit .git/config file under your repo directory.

              2. Find url=entry under section [remote "origin"].

              3. Change it from:

                url=https://gitlab.com/AaronYang2333/ska-src-dm-local-data-preparer.git/

                url=ssh://git@gitlab.com/AaronYang2333/ska-src-dm-local-data-preparer.git

              4. try push again

              Mar 12, 2025

              📟GPU Related

              • Nvidia DALI

                https://aaronyang0628.github.io/dali-tutorial/

                Mar 7, 2024

                Subsections of 📟GPU Related

                ☸️Kubernetes

                Mar 7, 2024

                Subsections of ☸️Kubernetes

                Prepare k8s Cluster

                Install Kuberctl

                MIRROR="files.m.daocloud.io/"
                VERSION=$(curl -L -s https://${MIRROR}dl.k8s.io/release/stable.txt)
                [ $(uname -m) = x86_64 ] && curl -sSLo kubectl "https://${MIRROR}dl.k8s.io/release/${VERSION}/bin/linux/amd64/kubectl"
                [ $(uname -m) = aarch64 ] && curl -sSLo kubectl "https://${MIRROR}dl.k8s.io/release/${VERSION}/bin/linux/arm64/kubectl"
                chmod u+x kubectl
                mkdir -p ${HOME}/bin
                mv -f kubectl ${HOME}/bin

                Then, you can choose

                Building a K8s Cluster, you can choose one of the following methods.

                Mar 7, 2025

                Subsections of Prepare k8s Cluster

                Kind

                kind-logo
                Warning

                Although this is a easiest way to build a cluster, but I wont recommend you to choose this. SInce there are many unsolved issue in here, check https://kind.sigs.k8s.io/docs/user/known-issues/

                Preliminary

                • Kind binary has installed, if not check 🔗link

                • Hardware Requirements:

                  1. At least 2 GB of RAM per machine (minimum 1 GB)
                  2. 2 CPUs on the master node
                  3. Full network connectivity among all machines (public or private network)
                • Operating System:

                  1. Ubuntu 22.04/14.04, CentOS 7/8, or any other supported Linux distribution.
                • Network Requirements:

                  1. Unique hostname, MAC address, and product_uuid for each node.
                  2. Certain ports need to be open (e.g., 6443, 2379-2380, 10250, 10251, 10252, 10255, etc.)

                Create your cluster

                Creating a Kubernetes cluster is as simple as kind create cluster

                kind create cluster --name test
                Customize your cluster

                If you need to configure your cluster, you can create kind-config.yaml

                # 创建配置文件
                cat <<EOF > kind-config.yaml
                kind: Cluster
                apiVersion: kind.x-k8s.io/v1alpha4
                nodes:
                - role: control-plane
                - role: worker
                - role: worker
                EOF
                
                # 创建集群
                kind create cluster --name my-cluster --config kind-config.yaml
                
                # 等待容器启动
                sleep 10
                
                # 设置资源限制
                docker update --memory="4g" --cpus="2" my-cluster-control-plane
                docker update --memory="2g" --cpus="1" my-cluster-worker
                docker update --memory="2g" --cpus="1" my-cluster-worker2
                
                echo "✅集群创建完成,资源限制已应用"

                Delete your cluster

                kind delete cluster

                Reference

                and the you can visit https://kind.sigs.k8s.io/docs/user/quick-start/ for mode detail.

                Mar 7, 2024

                K3s

                Preliminary

                • Hardware Requirements:

                  1. Server need to have at least 2 cores, 2 GB RAM
                  2. Agent need 1 core , 512 MB RAM
                • Operating System:

                  1. K3s is expected to work on most modern Linux systems.
                • Network Requirements:

                  1. The K3s server needs port 6443 to be accessible by all nodes.
                  2. If you wish to utilize the metrics server, all nodes must be accessible to each other on port 10250.

                Init K3s Server [At Server End]

                curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn sh -s - server --cluster-init --flannel-backend=vxlan --node-taint "node-role.kubernetes.io/control-plane=true:NoSchedule" --disable traefik

                append --disable traefik

                append --disable servicelb 禁用 ServiceLB(如果你计划使用其他负载均衡器如 MetalLB)

                append --disable local-storage 禁用本地存储(如果你使用其他存储方案)

                Get K3s Token [At Server End]

                cat /var/lib/rancher/k3s/server/node-token

                Join K3s Worker [At Agent End]

                curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | INSTALL_K3S_MIRROR=cn K3S_URL=https://<master-ip>:6443 K3S_TOKEN=<join-token> sh -

                Copy Kubeconfig [At Server + Agent End]

                mkdir -p $HOME/.kube
                cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config

                How it works

                k3s

                Uninstall K3s cluster [At Server + Agent End]

                # exec on server
                /usr/local/bin/k3s-uninstall.sh
                
                # exec on agent 
                /usr/local/bin/k3s-agent-uninstall.sh
                Mar 7, 2024

                Minikube

                Preliminary

                • Minikube binary has installed, if not check 🔗link

                • Hardware Requirements:

                  1. At least 2 GB of RAM per machine (minimum 1 GB)
                  2. 2 CPUs on the master node
                  3. Full network connectivity among all machines (public or private network)
                • Operating System:

                  1. Ubuntu 20.04/18.04, CentOS 7/8, or any other supported Linux distribution.
                • Network Requirements:

                  1. Unique hostname, MAC address, and product_uuid for each node.
                  2. Certain ports need to be open (e.g., 6443, 2379-2380, 10250, 10251, 10252, 10255, etc.)

                [Optional] Disable aegis service and reboot system for Aliyun

                sudo systemctl disable aegis && sudo reboot

                Customize your cluster

                minikube start --driver=podman  --image-mirror-country=cn --kubernetes-version=v1.33.1 --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers --cpus=6 --memory=20g --disk-size=50g --force

                Restart minikube

                minikube stop && minikube start

                Add alias

                alias kubectl="minikube kubectl --"

                Stop And Clean

                minikube stop && minikube delete --all --purge

                Forward

                # execute on your local machine
                        Remote                                                  Local⬇️
                   __________________                          ________________________________   
                  ╱                  ╲╲      wire/wifi        ╱ [ Minikube ] 17.100.x.y        ╲╲
                 ╱                   ╱╱   <------------->    ╱                                 ╱╱
                ╱ telnet 192.168.a.b ╱                      ╱  > execute ssh... at 192.168.a.b ╱ 
                ╲___________________╱  IP: 10.45.m.n        ╲_________________________________╱ IP: 192.168.a.b
                ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f

                and then you can visit https://minikube.sigs.k8s.io/docs/start/ for more detail.

                FAQ

                Q1: couldn’t get resource list for external.metrics.k8s.io/v1beta1: the server is currently unable to handle…

                通常是由于 Metrics Server 未正确安装 或 External Metrics API 缺失 导致的

                # 启用 Minikube 的 metrics-server 插件
                minikube addons enable metrics-server
                
                # 等待部署完成(约 1-2 分钟)
                kubectl wait --for=condition=available deployment/metrics-server -n kube-system --timeout=180s
                
                # 验证 Metrics Server 是否运行
                kubectl -n kube-system get pods  | grep metrics-server

                the possibilities are endless (almost - including other shortcodes may or may not work)

                Q2: Export minikube to local directly
                minikube start --driver=podman  --image-mirror-country=cn --kubernetes-version=v1.33.1 --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers  --listen-address=0.0.0.0 --cpus=6 --memory=20g --disk-size=100g --force
                Mar 7, 2024

                Subsections of Command

                Kubectl CheatSheet

                Switch Context

                • use different config
                kubectl --kubeconfig /root/.kube/config_ack get pod

                Resource

                • create resource

                  Resource From
                    kubectl create -n <$namespace> -f <$file_url>
                  temp-file.yaml
                  apiVersion: v1
                  kind: Service
                  metadata:
                  labels:
                      app.kubernetes.io/component: server
                      app.kubernetes.io/instance: argo-cd
                      app.kubernetes.io/name: argocd-server-external
                      app.kubernetes.io/part-of: argocd
                      app.kubernetes.io/version: v2.8.4
                  name: argocd-server-external
                  spec:
                  ports:
                  - name: https
                      port: 443
                      protocol: TCP
                      targetPort: 8080
                      nodePort: 30443
                  selector:
                      app.kubernetes.io/instance: argo-cd
                      app.kubernetes.io/name: argocd-server
                  type: NodePort
                  
                    helm install <$resource_id> <$resource_id> \
                        --namespace <$namespace> \
                        --create-namespace \
                        --version <$version> \
                        --repo <$repo_url> \
                        --values resource.values.yaml \
                        --atomic
                  resource.values.yaml
                  crds:
                      install: true
                      keep: false
                  global:
                      revisionHistoryLimit: 3
                      image:
                          repository: m.daocloud.io/quay.io/argoproj/argocd
                          imagePullPolicy: IfNotPresent
                  redis:
                      enabled: true
                      image:
                          repository: m.daocloud.io/docker.io/library/redis
                      exporter:
                          enabled: false
                          image:
                              repository: m.daocloud.io/bitnami/redis-exporter
                      metrics:
                          enabled: false
                  redis-ha:
                      enabled: false
                      image:
                          repository: m.daocloud.io/docker.io/library/redis
                      configmapTest:
                          repository: m.daocloud.io/docker.io/koalaman/shellcheck
                      haproxy:
                          enabled: false
                          image:
                          repository: m.daocloud.io/docker.io/library/haproxy
                      exporter:
                          enabled: false
                          image: m.daocloud.io/docker.io/oliver006/redis_exporter
                  dex:
                      enabled: true
                      image:
                          repository: m.daocloud.io/ghcr.io/dexidp/dex
                  

                • debug resource

                kubectl -n <$namespace> describe <$resource_id>
                • logging resource
                kubectl -n <$namespace> logs -f <$resource_id>
                • port forwarding resource
                kubectl -n <$namespace> port-forward  <$resource_id> --address 0.0.0.0 8080:80 # local:pod
                • delete all resource under specific namespace
                kubectl delete all --all -n <$namespace>
                if you wannna delete all
                kubectl delete all --all --all-namespaces
                • delete error pods
                kubectl -n <$namespace> delete pods --field-selector status.phase=Failed
                • force delete
                kubectl -n <$namespace> delete pod <$resource_id> --force --grace-period=0
                • opening a Bash Shell inside a Pod
                kubectl -n <$namespace> exec -it <$resource_id> -- bash  
                • copy secret to another namespace
                kubectl -n <$namespaceA> get secret <$secret_name> -o json \
                    | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' \
                    | kubectl -n <$namespaceB> apply -f -
                • copy secret to another name
                kubectl -n <$namespace> get secret <$old_secret_name> -o json | \
                jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid","ownerReferences","annotations","labels"]) | .metadata.name = "<$new_secret_name>"' | \
                kubectl apply -n <$namespace> -f -
                • delete all completed job
                kubectl delete jobs -n <$namespace> --field-selector status.successful=1 

                Nodes

                • list taint
                kubectl describe nodes | grep -i taint
                ## kubectl describe node <节点名称> | grep Taints
                • add taint
                kubectl taint nodes <$node_ip> <key:value>
                for example
                kubectl taint nodes node1 dedicated:NoSchedule
                kubectl taint nodes <节点名称> <taint-key>-
                for example
                kubectl taint nodes node1 dedicated:NoSchedule-
                • show info extract by json path
                kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'

                Deploy

                • rollout show rollout history
                kubectl -n <$namespace> rollout history deploy/<$deploy_resource_id>

                undo rollout

                kubectl -n <$namespace> rollout undo deploy <$deploy_resource_id>  --to-revision=1

                Patch

                clean those who won’t managed by k8s

                kubectl -n metadata patch flinkingest ingest-table-or-fits-from-oss -p '{"metadata":{"finalizers":[]}}' --type=merge
                Mar 8, 2024

                Helm Chart CheatSheet

                Finding Charts

                helm search hub wordpress

                Adding Repositories

                helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                helm repo update

                Showing Chart Values

                helm show values bitnami/wordpress

                Packaging Charts

                helm package --dependency-update --destination /tmp/ /root/metadata-operator/environments/helm/metadata-environment/charts

                Uninstall Chart

                helm uninstall -n warehouse warehouse

                when failed, you can try

                helm uninstall -n warehouse warehouse --no-hooks --cascade=foreground
                Mar 7, 2024

                Resource CheatSheet

                Create Secret From Literal

                kubectl -n application create secret generic xxxx-secrets \
                  --from-literal=xxx_uri='https://in03-891eca6c21bd4e5.serverless.aws-eu-central-1.cloud.zilliz.com' \
                  --from-literal=xxxx_token='<$the uncoded value, do not base64 and paste here>' \
                  --from-literal=tongyi_api_key='sk-xxxxxxxxxxx'

                Forward external service

                kubectl -n basic-components apply -f - <<EOF
                apiVersion: v1
                kind: Service
                metadata:
                  name: proxy-server-service
                spec:
                  type: ClusterIP
                  ports:
                  - port: 80
                    targetPort: 32080
                    protocol: TCP
                    name: http
                ---
                kubectl -n basic-components apply -f - <<EOF
                apiVersion: v1
                kind: Endpoints
                metadata:
                  name: proxy-server-service
                subsets:
                  - addresses:
                    - ip: "47.xxx.xxx.xxx"
                    ports:
                    - port: 32080
                      protocol: TCP
                      name: http
                ---
                apiVersion: networking.k8s.io/v1
                kind: Ingress
                metadata:
                  name: proxy-server-ingress
                  annotations:
                    nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
                    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
                    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
                spec:
                  ingressClassName: nginx
                  rules:
                  - host: server.proxy.72602.online
                    http:
                      paths:
                      - path: /
                        pathType: Prefix
                        backend:
                          service:
                            name: proxy-server-service
                            port:
                              number: 80
                EOF
                Mar 7, 2025

                Subsections of Conatiner

                CheatShett

                type:
                1. remove specific image
                podman rmi <$image_id>
                1. remove all <none> images
                podman rmi `podamn images | grep  '<none>' | awk '{print $3}'`
                1. remove all stopped containers
                podman container prune
                1. remove all docker images not used
                podman image prune

                sudo podman volume prune

                1. find ip address of a container
                podman inspect --format='{{.NetworkSettings.IPAddress}}' minio-server
                1. exec into container
                podman run -it <$container_id> /bin/bash
                1. run with environment
                podman run -d --replace 
                    -p 18123:8123 -p 19000:9000 \
                    --name clickhouse-server \
                    -e ALLOW_EMPTY_PASSWORD=yes \
                    --ulimit nofile=262144:262144 \
                    quay.m.daocloud.io/kryptonite/clickhouse-docker-rootless:20.9.3.45 

                --ulimit nofile=262144:262144: 262144 is the maximum users process or for showing maximum user process limit for the logged-in user

                ulimit is admin access required Linux shell command which is used to see, set, or limit the resource usage of the current user. It is used to return the number of open file descriptors for each process. It is also used to set restrictions on the resources used by a process.

                1. login registry
                export ZJLAB_CR_PAT=ghp_xxxxxxxxxxxx
                echo $ZJLAB_CR_PAT | podman login --tls-verify=false cr.registry.res.cloud.zhejianglab.com -u ascm-org-1710208820455 --password-stdin
                
                export GITHUB_CR_PAT=ghp_xxxxxxxxxxxx
                echo $GITHUB_CR_PAT | podman login ghcr.io -u aaronyang0628 --password-stdin
                
                export DOCKER_CR_PAT=dckr_pat_bBN_Xkgz-xxxx
                echo $DOCKER_CR_PAT | podman login docker.io -u aaron666 --password-stdin
                
                export HARBOR_CR_PAT=Aaron
                echo $DOCKER_CR_PAT | podman login --tls-verify=false harbor.zhejianglab.com -u byang628@zhejianglab.org --password-stdin
                1. tag image
                podman tag 76fdac66291c cr.registry.res.cloud.zhejianglab.com/ay-dev/datahub-s3-fits:1.0.0
                1. push image
                podman push cr.registry.res.cloud.zhejianglab.com/ay-dev/datahub-s3-fits:1.0.0
                1. remove specific image
                docker rmi <$image_id>
                1. remove all <none> images
                docker rmi `docker images | grep  '<none>' | awk '{print $3}'`
                1. remove all stopped containers
                docker container prune
                1. remove all docker images not used
                docker image prune
                1. find ip address of a container
                docker inspect --format='{{.NetworkSettings.IPAddress}}' minio-server
                1. exec into container
                docker exec -it <$container_id> /bin/bash
                1. run with environment
                docker run -d --replace -p 18123:8123 -p 19000:9000 --name clickhouse-server -e ALLOW_EMPTY_PASSWORD=yes --ulimit nofile=262144:262144 quay.m.daocloud.io/kryptonite/clickhouse-docker-rootless:20.9.3.45 

                --ulimit nofile=262144:262144: sssss

                1. copy file

                  Copy a local file into container

                  docker cp ./some_file CONTAINER:/work

                  or copy files from container to local path

                  docker cp CONTAINER:/var/logs/ /tmp/app_logs
                2. load a volume

                docker run --rm \
                    --entrypoint bash \
                    -v $PWD/data:/app:ro \
                    -it docker.io/minio/mc:latest \
                    -c "mc --insecure alias set minio https://oss-cn-hangzhou-zjy-d01-a.ops.cloud.zhejianglab.com/ g83B2sji1CbAfjQO 2h8NisFRELiwOn41iXc6sgufED1n1A \
                        && mc --insecure ls minio/csst-prod/ \
                        && mc --insecure mb --ignore-existing minio/csst-prod/crp-test \
                        && mc --insecure cp /app/modify.pdf minio/csst-prod/crp-test/ \
                        && mc --insecure ls --recursive minio/csst-prod/"
                Mar 7, 2024

                Subsections of Template

                Subsections of DevContainer Template

                Java 21 + Go 1.24

                prepare .devcontainer.json

                {
                  "name": "Go & Java DevContainer",
                  "build": {
                    "dockerfile": "Dockerfile"
                  },
                  "mounts": [
                    "source=/root/.kube/config,target=/root/.kube/config,type=bind",
                    "source=/root/.minikube/profiles/minikube/client.crt,target=/root/.minikube/profiles/minikube/client.crt,type=bind",
                    "source=/root/.minikube/profiles/minikube/client.key,target=/root/.minikube/profiles/minikube/client.key,type=bind",
                    "source=/root/.minikube/ca.crt,target=/root/.minikube/ca.crt,type=bind"
                  ],
                  "customizations": {
                    "vscode": {
                      "extensions": [
                        "golang.go",
                        "vscjava.vscode-java-pack",
                        "redhat.java",
                        "vscjava.vscode-maven",
                        "Alibaba-Cloud.tongyi-lingma",
                        "vscjava.vscode-java-debug",
                        "vscjava.vscode-java-dependency",
                        "vscjava.vscode-java-test"
                      ]
                    }
                  },
                  "remoteUser": "root",
                  "postCreateCommand": "go version && java -version && mvn -v"
                }

                prepare Dockerfile

                FROM m.daocloud.io/docker.io/ubuntu:24.04
                
                ENV DEBIAN_FRONTEND=noninteractive
                
                RUN apt-get update && \
                    apt-get install -y --no-install-recommends \
                    ca-certificates \
                    curl \
                    git \
                    wget \
                    gnupg \
                    vim \
                    lsb-release \
                    apt-transport-https \
                    && apt-get clean \
                    && rm -rf /var/lib/apt/lists/*
                
                # install OpenJDK 21 
                RUN mkdir -p /etc/apt/keyrings && \
                    wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | gpg --dearmor -o /etc/apt/keyrings/adoptium.gpg && \
                    echo "deb [signed-by=/etc/apt/keyrings/adoptium.gpg arch=amd64] https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list > /dev/null && \
                    apt-get update && \
                    apt-get install -y temurin-21-jdk && \
                    apt-get clean && \
                    rm -rf /var/lib/apt/lists/*
                
                # set java env
                ENV JAVA_HOME=/usr/lib/jvm/temurin-21-jdk-amd64
                
                # install maven
                ARG MAVEN_VERSION=3.9.10
                RUN wget https://dlcdn.apache.org/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz -O /tmp/maven.tar.gz && \
                    mkdir -p /opt/maven && \
                    tar -C /opt/maven -xzf /tmp/maven.tar.gz --strip-components=1 && \
                    rm /tmp/maven.tar.gz
                
                ENV MAVEN_HOME=/opt/maven
                ENV PATH="${MAVEN_HOME}/bin:${PATH}"
                
                # install go 1.24.4 
                ARG GO_VERSION=1.24.4
                RUN wget https://dl.google.com/go/go${GO_VERSION}.linux-amd64.tar.gz -O /tmp/go.tar.gz && \
                    tar -C /usr/local -xzf /tmp/go.tar.gz && \
                    rm /tmp/go.tar.gz
                
                # set go env
                ENV GOROOT=/usr/local/go
                ENV GOPATH=/go
                ENV PATH="${GOROOT}/bin:${GOPATH}/bin:${PATH}"
                
                # install other binarys
                ARG KUBECTL_VERSION=v1.33.0
                RUN wget https://files.m.daocloud.io/dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -O /tmp/kubectl && \
                    chmod u+x /tmp/kubectl && \
                    mv -f /tmp/kubectl /usr/local/bin/kubectl 
                
                ARG HELM_VERSION=v3.13.3
                RUN wget https://files.m.daocloud.io/get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz -O /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz && \
                    mkdir -p /opt/helm && \
                    tar -C /opt/helm -xzf /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz && \
                    rm /tmp/helm-${HELM_VERSION}-linux-amd64.tar.gz
                
                ENV HELM_HOME=/opt/helm/linux-amd64
                ENV PATH="${HELM_HOME}:${PATH}"
                
                USER root
                WORKDIR /workspace
                Mar 7, 2024

                Python 3.11 + CUDA 120

                prepare .devcontainer.json

                {
                	"name": "DALI Learning Environment",
                	"build": {
                		"dockerfile": "Dockerfile",
                		"context": "..",
                		"args": {
                			"VARIANT": "3.11",
                			"HTTP_PROXY": "",
                			"HTTPS_PROXY": "",
                			"http_proxy": "",
                			"https_proxy": ""
                		}
                	},
                	"forwardPorts": [8000],
                	"portsAttributes": {
                		"8000": {
                			"label": "HTTP Server",
                			"protocol": "http",
                			"onAutoForward": "notify"
                		}
                	},
                	"customizations": {
                		"vscode": {
                			"extensions": [
                				"ms-python.python",
                				"ms-python.vscode-pylance",
                				"ms-python.debugpy"
                			],
                			"settings": {
                				"python.defaultInterpreterPath": "/usr/bin/python",
                				"files.exclude": {
                					"**/__pycache__": true,
                					"**/*.pyc": true
                				}
                			}
                		}
                	},
                	"postCreateCommand": "bash .devcontainer/post-create.sh",
                	"remoteUser": "vscode",
                	"runArgs": [
                		"-p", "0.0.0.0:8000:8000", 
                		"--device=/dev/nvidiactl",
                		"--device=/dev/nvidia0",
                		"--device=/dev/nvidia-uvm",
                		"--device=/dev/nvidia-uvm-tools",
                		"--ipc=host",
                		"--ulimit", "memlock=-1",
                		"--ulimit", "stack=67108864",
                		"--env", "LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/host/usr/lib/x86_64-linux-gnu"
                	],
                	"mounts": [
                		"type=bind,src=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,readonly",
                		"type=bind,src=/usr/lib/x86_64-linux-gnu,target=/host/usr/lib/x86_64-linux-gnu,readonly",
                		"type=bind,src=/usr/bin/nvidia-smi,target=/usr/bin/nvidia-smi,readonly",
                		"type=bind,src=/usr/bin/nvidia-debugdump,target=/usr/bin/nvidia-debugdump,readonly"
                	],
                	"containerEnv": {
                		"CUDA_VISIBLE_DEVICES": "0",
                		"NVIDIA_VISIBLE_DEVICES": "all",
                		"NVIDIA_DRIVER_CAPABILITIES": "compute,utility",
                		"HTTP_PROXY": "",
                		"HTTPS_PROXY": "",
                		"http_proxy": "",
                		"https_proxy": "",
                		"NO_PROXY": "",
                		"no_proxy": ""
                	},
                	"description": "NVIDIA DALI MCP开发环境 - GPU支持的轻量级镜像"
                }

                prepare Dockerfile

                # Use runtime image instead of devel to reduce size (4GB vs 10GB)
                FROM m.daocloud.io/docker.io/nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
                
                ENV DEBIAN_FRONTEND=noninteractive
                ENV TZ=Asia/Shanghai
                
                # Add deadsnakes PPA for Python 3.11 and install base dependencies
                # Clear proxy settings to avoid connection issues during build
                RUN unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy && \
                    apt-get update && apt-get install -y --no-install-recommends \
                    software-properties-common \
                    curl \
                    && add-apt-repository ppa:deadsnakes/ppa -y && \
                    apt-get update && apt-get install -y --no-install-recommends \
                    python3.11 \
                    python3.11-dev \
                    python3.11-distutils \
                    python3-pip \
                    git \
                    wget \
                    && rm -rf /var/lib/apt/lists/*
                
                # Install Node.js (LTS) for Claude CLI
                RUN unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy && \
                    curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - && \
                    apt-get install -y --no-install-recommends nodejs && \
                    rm -rf /var/lib/apt/lists/*
                
                # Set NVIDIA library paths
                ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
                ENV CUDA_HOME=/usr/local/cuda
                ENV PATH=${CUDA_HOME}/bin:${PATH}
                ENV NVIDIA_VISIBLE_DEVICES=all
                ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
                
                # Set Python 3.11 as default version
                RUN unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy && \
                    update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
                    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
                
                # Upgrade pip
                RUN unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy && \
                    python -m pip install --no-cache-dir --upgrade pip setuptools wheel
                
                # Install DALI and minimal packages for MCP development
                RUN unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy && \
                    pip install --no-cache-dir \
                    --extra-index-url https://pypi.nvidia.com \
                    nvidia-dali-cuda120 \
                    numpy \
                    ipython
                
                # Create non-root user
                RUN useradd -m -s /bin/bash vscode && \
                    mkdir -p /workspace && \
                    chown -R vscode:vscode /workspace
                
                WORKDIR /workspace
                USER vscode
                
                CMD ["/bin/bash"]

                post-create.sh

                #!/bin/bash
                
                # Clear proxy settings (should already be cleared by containerEnv, but double-check)
                unset HTTP_PROXY HTTPS_PROXY http_proxy https_proxy
                
                # Install MCP SDK and minimal dependencies
                pip install --no-cache-dir \
                    -i https://pypi.tuna.tsinghua.edu.cn/simple \
                    mcp \
                    anthropic
                
                curl -fsSL https://claude.ai/install.sh | bash
                
                echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
                
                # Create working directories
                mkdir -p /workspace/scripts
                
                echo "✅ DALI  environment setup completed!"
                Mar 7, 2024

                Subsections of DEV

                Devpod

                Preliminary

                • Kubernetes has installed, if not check 🔗link
                • Devpod has installed, if not check 🔗link

                1. Get provider config

                # just copy ~/.kube/config

                for example, the original config

                apiVersion: v1
                clusters:
                - cluster:
                    certificate-authority: <$file_path>
                    extensions:
                    - extension:
                        provider: minikube.sigs.k8s.io
                        version: v1.33.0
                      name: cluster_info
                    server: https://<$minikube_ip>:8443
                  name: minikube
                contexts:
                - context:
                    cluster: minikube
                    extensions:
                    - extension:
                        provider: minikube.sigs.k8s.io
                        version: v1.33.0
                      name: context_info
                    namespace: default
                    user: minikube
                  name: minikube
                current-context: minikube
                kind: Config
                preferences: {}
                users:
                - name: minikube
                  user:
                    client-certificate: <$file_path>
                    client-key: <$file_path>

                you need to rename clusters.cluster.certificate-authority, clusters.cluster.server, users.user.client-certificate, users.user.client-key.

                clusters.cluster.certificate-authority -> clusters.cluster.certificate-authority-data
                clusters.cluster.server -> ip set to `localhost`
                users.user.client-certificate -> users.user.client-certificate-data
                users.user.client-key -> users.user.client-key-data

                the data you paste after each key should be base64

                cat <$file_path> | base64

                then, modified config file should be look like this:

                apiVersion: v1
                clusters:
                - cluster:
                    certificate-authority-data: xxxxxxxxxxxxxx
                    extensions:
                    - extension:
                        provider: minikube.sigs.k8s.io
                        version: v1.33.0
                      name: cluster_info
                    server: https://127.0.0.1:8443 
                  name: minikube
                contexts:
                - context:
                    cluster: minikube
                    extensions:
                    - extension:
                        provider: minikube.sigs.k8s.io
                        version: v1.33.0
                      name: context_info
                    namespace: default
                    user: minikube
                  name: minikube
                current-context: minikube
                kind: Config
                preferences: {}
                users:
                - name: minikube
                  user:
                    client-certificate-data: xxxxxxxxxxxx
                    client-key-data: xxxxxxxxxxxxxxxx

                then we should forward minikube port in your own pc

                #where you host minikube
                MACHINE_IP_ADDRESS=10.200.60.102
                USER=ayay
                MINIKUBE_IP_ADDRESS=$(ssh -o 'UserKnownHostsFile /dev/null' $USER@$MACHINE_IP_ADDRESS '$HOME/bin/minikube ip')
                ssh -o 'UserKnownHostsFile /dev/null' $USER@$MACHINE_IP_ADDRESS -L "*:8443:$MINIKUBE_IP_ADDRESS:8443" -N -f

                2. Create workspace

                1. get git repo link
                2. choose appropriate provider
                3. choose ide type and version
                4. and go!

                Useful Command

                Install Kubectl

                for more information, you can check 🔗link to install kubectl

                • How to use it in devpod

                  Everything works fine.

                  when you in pod, and using kubectl you should change clusters.cluster.server in ~/.kube/config to https://<$minikube_ip>:8443

                • exec into devpod

                kubectl -n devpod exec -it <$resource_id> -c devpod -- bin/bash
                • add DNS item
                10.aaa.bbb.ccc gitee.zhejianglab.com
                • shutdown ssh tunnel
                  # check if port 8443 is already open
                  netstat -aon|findstr "8443"
                  
                  # find PID
                  ps | grep ssh
                  
                  # kill the process
                  taskkill /PID <$PID> /T /F
                  # check if port 8443 is already open
                  netstat -aon|findstr "8443"
                  
                  # find PID
                  ps | grep ssh
                  
                  # kill the process
                  kill -9 <$PID>
                Mar 7, 2024

                Dev Conatiner

                write .devcontainer.json

                Mar 7, 2024

                JumpServer

                          Local             Jumpserver        virtual node (develop/k3s)
                        ________            _______                ________ 
                       ╱        ╲          ╱       ╲╲             ╱        ╲
                      ╱         ╱ ------  ╱        ╱╱  --------  ╱         ╱
                     ╱         ╱         ╱         ╱            ╱         ╱ 
                     ╲________╱          ╲________╱             ╲________╱  
                    IP: 10.A.B.C    IP: jumpserver.ay.dev   IP: 192.168.100.xxx                                

                Modify SSH Config

                30022 has ssh service at jumpserver

                cat .ssh/config
                Host jumpserver
                  HostName jumpserver.ay.dev
                  Port 30022
                  User ay
                  IdentityFile ~/.ssh/id_rsa
                
                Host virtual
                  HostName 192.168.100.xxx
                  Port 22
                  User ay
                  ProxyJump jumpserver
                  IdentityFile ~/.ssh/id_rsa

                And then you can directly connect to the virtual node

                Forward port in virtual node

                30022 has ssh service at jumpserver

                32524 is a service which you wanna forward

                ssh -o 'UserKnownHostsFile /dev/null' -o 'ServerAliveInterval=60' -L 32524:192.168.100.xxx:32524 -p 30022 ay@jumpserver.ay.dev
                Mar 7, 2024

                Subsections of Operator SDK

                KubeBuilder

                Basic

                Kubebuilder 是一个使用 CRDs 构建 K8s API 的 SDK,主要是:

                • 基于 controller-runtime 以及 client-go 构建
                • 提供一套可扩展的 API 框架,方便用户从零开始开发 CRDsControllers 和 Admission Webhooks 来扩展 K8s。
                • 还提供脚手架工具初始化 CRDs 工程,自动生成 boilerplate 模板代码和配置;

                Architecture

                mvc mvc

                Main.go

                import (
                	_ "k8s.io/client-go/plugin/pkg/client/auth"
                
                	ctrl "sigs.k8s.io/controller-runtime"
                )
                // nolint:gocyclo
                func main() {
                    ...
                
                    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{}
                
                    ...
                    if err = (&controller.GuestbookReconciler{
                        Client: mgr.GetClient(),
                        Scheme: mgr.GetScheme(),
                    }).SetupWithManager(mgr); err != nil {
                        setupLog.Error(err, "unable to create controller", "controller", "Guestbook")
                        os.Exit(1)
                    }
                
                    ...
                    if os.Getenv("ENABLE_WEBHOOKS") != "false" {
                        if err = webhookwebappv1.SetupGuestbookWebhookWithManager(mgr); err != nil {
                            setupLog.Error(err, "unable to create webhook", "webhook", "Guestbook")
                            os.Exit(1)
                        }
                    }

                Manager

                Manager是核心组件,可以协调多个控制器、处理缓存、客户端、领导选举等,来自https://github.com/kubernetes-sigs/controller-runtime/blob/v0.20.0/pkg/manager/manager.go

                • Client 承担了与 Kubernetes API Server 通信、操作资源对象、读写缓存等关键职责; 分为两类:
                  • Reader:优先读Cache, 避免频繁访问 API Server, Get后放缓存
                  • Writer: 支持写操作(Create、Update、Delete、Patch),直接与 API Server 交互。
                  • informers 是 client-go 提供的核心组件,用于监听(Watch)Kubernetes API Server 中特定资源类型(如 Pod、Deployment 或自定义 CRD)的变更事件(Create/Update/Delete)。
                    • Client 依赖 Informer 机制自动同步缓存。当 API Server 中资源变更时,Informer 会定时更新本地缓存,确保后续读操作获取最新数据。
                • Cache
                  • Cache 通过 内置的client 的 ListWatcher机制 监听 API Server 的资源变更。
                  • 事件被写入本地缓存(如 Indexer),避免频繁访问 API Server。
                  • 缓存(Cache)的作用是减少对API Server的直接请求,同时保证控制器能够快速读取资源的最新状态。
                • Event

                  Kubernetes API Server 通过 HTTP 长连接 推送资源变更事件,client-go 的 Informer 负责监听这些消息。

                  • Event:事件是Kubernetes API Server与Controller之间传递的信息,包含资源类型、资源名称、事件类型(ADDED、MODIFIED、DELETED)等信息,并转换成requets, check link
                  • API Server → Manager的Informer → Cache → Controller的Watch → Predicate过滤 → WorkQueue → Controller的Reconcile()方法

                Controller

                It’s a controller’s job to ensure that, for any given object the actual state of the world matches the desired state in the object. Each controller focuses on one root Kind, but may interact with other Kinds.

                func (r *GuestbookReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
                    ...
                }
                func (r *GuestbookReconciler) SetupWithManager(mgr ctrl.Manager) error {
                	return ctrl.NewControllerManagedBy(mgr).
                		For(&webappv1.Guestbook{}).
                		Named("guestbook").
                		Complete(r)
                }

                If you wanna build your own controller, please check https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md

                1. 每个Controller在初始化时会向Manager注册它关心的资源类型(例如通过Owns(&v1.Pod{})声明关注Pod资源)。

                2. Manager根据Controller的注册信息,为相关资源创建对应的Informer和Watch, check link

                3. 当资源变更事件发生时,Informer会将事件从缓存中取出,并通过Predicate(过滤器)判断是否需要触发协调逻辑。

                4. 若事件通过过滤,Controller会将事件加入队列(WorkQueue),最终调用用户实现的Reconcile()函数进行处理, check link

                func (c *Controller[request]) Start(ctx context.Context) error {
                
                	c.ctx = ctx
                
                	queue := c.NewQueue(c.Name, c.RateLimiter)
                
                    c.Queue = &priorityQueueWrapper[request]{TypedRateLimitingInterface: queue}
                
                	err := func() error {
                
                            // start to sync event sources
                            if err := c.startEventSources(ctx); err != nil {
                                return err
                            }
                
                            for i := 0; i < c.MaxConcurrentReconciles; i++ {
                                go func() {
                                    for c.processNextWorkItem(ctx) {
                
                                    }
                                }()
                            }
                	}()
                
                	c.LogConstructor(nil).Info("All workers finished")
                }
                func (c *Controller[request]) processNextWorkItem(ctx context.Context) bool {
                	obj, priority, shutdown := c.Queue.GetWithPriority()
                
                	c.reconcileHandler(ctx, obj, priority)
                
                }

                Webhook

                Webhooks are a mechanism to intercept requests to the Kubernetes API server. They can be used to validate, mutate, or even proxy requests.

                func (d *GuestbookCustomDefaulter) Default(ctx context.Context, obj runtime.Object) error {}
                
                func (v *GuestbookCustomValidator) ValidateCreate(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {}
                
                func (v *GuestbookCustomValidator) ValidateUpdate(ctx context.Context, oldObj, newObj runtime.Object) (admission.Warnings, error) {}
                
                func (v *GuestbookCustomValidator) ValidateDelete(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {}
                
                func SetupGuestbookWebhookWithManager(mgr ctrl.Manager) error {
                	return ctrl.NewWebhookManagedBy(mgr).For(&webappv1.Guestbook{}).
                		WithValidator(&GuestbookCustomValidator{}).
                		WithDefaulter(&GuestbookCustomDefaulter{}).
                		Complete()
                }
                Mar 7, 2024

                Subsections of KubeBuilder

                Quick Start

                Prerequisites

                • go version v1.23.0+
                • docker version 17.03+.
                • kubectl version v1.11.3+.
                • Access to a Kubernetes v1.11.3+ cluster.

                Installation

                # download kubebuilder and install locally.
                curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
                chmod +x kubebuilder && sudo mv kubebuilder /usr/local/bin/

                Create A Project

                mkdir -p ~/projects/guestbook
                cd ~/projects/guestbook
                kubebuilder init --domain my.domain --repo my.domain/guestbook
                Error: unable to scaffold with “base.go.kubebuilder.io/v4”:exit status 1

                Just try again!

                rm -rf ~/projects/guestbook/*
                kubebuilder init --domain my.domain --repo my.domain/guestbook

                Create An API

                kubebuilder create api --group webapp --version v1 --kind Guestbook
                Error: unable to run post-scaffold tasks of “base.go.kubebuilder.io/v4”: exec: “make”: executable file not found in $PATH
                apt-get -y install make
                rm -rf ~/projects/guestbook/*
                kubebuilder init --domain my.domain --repo my.domain/guestbook
                kubebuilder create api --group webapp --version v1 --kind Guestbook

                Prepare a K8s Cluster

                cluster in
                minikube start --kubernetes-version=v1.27.10 --image-mirror-country=cn --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers --cpus=4 --memory=4g --disk-size=50g --force

                asdasda

                Modify API [Optional]

                you can moidfy file /~/projects/guestbook/api/v1/guestbook_types.go

                type GuestbookSpec struct {
                	// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
                	// Important: Run "make" to regenerate code after modifying this file
                
                	// Foo is an example field of Guestbook. Edit guestbook_types.go to remove/update
                	Foo string `json:"foo,omitempty"`
                }

                which will corresponding to the file /~/projects/guestbook/config/samples/webapp_v1_guestbook.yaml

                If you are editing the API definitions, generate the manifests such as Custom Resources (CRs) or Custom Resource Definitions (CRDs) using

                make manifests
                Modify Controller [Optional]

                you can moidfy file /~/projects/guestbook/internal/controller/guestbook_controller.go

                // 	"fmt"
                // "k8s.io/apimachinery/pkg/api/errors"
                // "k8s.io/apimachinery/pkg/types"
                // 	appsv1 "k8s.io/api/apps/v1"
                //	corev1 "k8s.io/api/core/v1"
                //	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
                func (r *GuestbookReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
                	// The context is used to allow cancellation of requests, and potentially things like tracing. 
                	_ = log.FromContext(ctx)
                
                	fmt.Printf("I am a controller ->>>>>>")
                	fmt.Printf("Name: %s, Namespace: %s", req.Name, req.Namespace)
                
                	guestbook := &webappv1.Guestbook{}
                	if err := r.Get(ctx, req.NamespacedName, guestbook); err != nil {
                		return ctrl.Result{}, err
                	}
                
                	fooString := guestbook.Spec.Foo
                	replicas := int32(1)
                	fmt.Printf("Foo String: %s", fooString)
                
                	// labels := map[string]string{
                	// 	"app": req.Name,
                	// }
                
                	// dep := &appsv1.Deployment{
                	// 	ObjectMeta: metav1.ObjectMeta{
                	// 		Name:      fooString + "-deployment",
                	// 		Namespace: req.Namespace,
                	// 		Labels:    labels,
                	// 	},
                	// 	Spec: appsv1.DeploymentSpec{
                	// 		Replicas: &replicas,
                	// 		Selector: &metav1.LabelSelector{
                	// 			MatchLabels: labels,
                	// 		},
                	// 		Template: corev1.PodTemplateSpec{
                	// 			ObjectMeta: metav1.ObjectMeta{
                	// 				Labels: labels,
                	// 			},
                	// 			Spec: corev1.PodSpec{
                	// 				Containers: []corev1.Container{{
                	// 					Name:  fooString,
                	// 					Image: "busybox:latest",
                	// 				}},
                	// 			},
                	// 		},
                	// 	},
                	// }
                
                	// existingDep := &appsv1.Deployment{}
                	// err := r.Get(ctx, types.NamespacedName{Name: dep.Name, Namespace: dep.Namespace}, existingDep)
                	// if err != nil {
                	// 	if errors.IsNotFound(err) {
                	// 		if err := r.Create(ctx, dep); err != nil {
                	// 			return ctrl.Result{}, err
                	// 		}
                	// 	} else {
                	// 		return ctrl.Result{}, err
                	// 	}
                	// }
                
                	return ctrl.Result{}, nil
                }

                And you can use make run to test your controller.

                make run

                and use following command to send a request

                make sure you install crds -> make install before you exec this following command

                make install
                kubectl apply -k config/samples/

                your controller terminal should be look like this

                I am a controller ->>>>>>Name: guestbook-sample, Namespace: defaultFoo String: foo-value

                Install CRDs

                check installed crds in k8s

                kubectl get crds

                install guestbook crd in k8s

                cd ~/projects/guestbook
                make install

                uninstall CRDs

                make uninstall
                
                make undeploy

                Deploy to cluster

                make docker-build IMG=aaron666/guestbook-operator:test
                make docker-build docker-push IMG=<some-registry>/<project-name>:tag
                make deploy IMG=<some-registry>/<project-name>:tag
                Mar 7, 2024

                Operator-SDK

                  Mar 7, 2024

                  Subsections of Proxy

                  Daocloud Binary

                  使用方法

                  在原始 URL 上面加入 files.m.daocloud.io前缀 就可以使用。比如:

                  # Helm 下载原始URL
                  wget https://get.helm.sh/helm-v3.9.1-linux-amd64.tar.gz
                  
                  # 加速后的 URL
                  wget https://files.m.daocloud.io/get.helm.sh/helm-v3.9.1-linux-amd64.tar.gz

                  即可加速下载, 所以如果指定的文件没有被缓存, 会卡住等待缓存完成, 后续下载就无带宽限制。

                  最佳实践

                  使用场景1 - 安装 Helm

                  cd /tmp
                  export HELM_VERSION="v3.9.3"
                  
                  wget "https://files.m.daocloud.io/get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz"
                  tar -zxvf helm-${HELM_VERSION}-linux-amd64.tar.gz
                  mv linux-amd64/helm /usr/local/bin/helm
                  helm version

                  使用场景2 - 安装 KubeSpray

                  加入如下配置即可:

                  files_repo: "https://files.m.daocloud.io"
                  
                  ## Kubernetes components
                  kubeadm_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kubeadm_version }}/bin/linux/{{ image_arch }}/kubeadm"
                  kubectl_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kube_version }}/bin/linux/{{ image_arch }}/kubectl"
                  kubelet_download_url: "{{ files_repo }}/dl.k8s.io/release/{{ kube_version }}/bin/linux/{{ image_arch }}/kubelet"
                  
                  ## CNI Plugins
                  cni_download_url: "{{ files_repo }}/github.com/containernetworking/plugins/releases/download/{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-{{ cni_version }}.tgz"
                  
                  ## cri-tools
                  crictl_download_url: "{{ files_repo }}/github.com/kubernetes-sigs/cri-tools/releases/download/{{ crictl_version }}/crictl-{{ crictl_version }}-{{ ansible_system | lower }}-{{ image_arch }}.tar.gz"
                  
                  ## [Optional] etcd: only if you **DON'T** use etcd_deployment=host
                  etcd_download_url: "{{ files_repo }}/github.com/etcd-io/etcd/releases/download/{{ etcd_version }}/etcd-{{ etcd_version }}-linux-{{ image_arch }}.tar.gz"
                  
                  # [Optional] Calico: If using Calico network plugin
                  calicoctl_download_url: "{{ files_repo }}/github.com/projectcalico/calico/releases/download/{{ calico_ctl_version }}/calicoctl-linux-{{ image_arch }}"
                  calicoctl_alternate_download_url: "{{ files_repo }}/github.com/projectcalico/calicoctl/releases/download/{{ calico_ctl_version }}/calicoctl-linux-{{ image_arch }}"
                  # [Optional] Calico with kdd: If using Calico network plugin with kdd datastore
                  calico_crds_download_url: "{{ files_repo }}/github.com/projectcalico/calico/archive/{{ calico_version }}.tar.gz"
                  
                  # [Optional] Flannel: If using Falnnel network plugin
                  flannel_cni_download_url: "{{ files_repo }}/kubernetes/flannel/{{ flannel_cni_version }}/flannel-{{ image_arch }}"
                  
                  # [Optional] helm: only if you set helm_enabled: true
                  helm_download_url: "{{ files_repo }}/get.helm.sh/helm-{{ helm_version }}-linux-{{ image_arch }}.tar.gz"
                  
                  # [Optional] crun: only if you set crun_enabled: true
                  crun_download_url: "{{ files_repo }}/github.com/containers/crun/releases/download/{{ crun_version }}/crun-{{ crun_version }}-linux-{{ image_arch }}"
                  
                  # [Optional] kata: only if you set kata_containers_enabled: true
                  kata_containers_download_url: "{{ files_repo }}/github.com/kata-containers/kata-containers/releases/download/{{ kata_containers_version }}/kata-static-{{ kata_containers_version }}-{{ ansible_architecture }}.tar.xz"
                  
                  # [Optional] cri-dockerd: only if you set container_manager: docker
                  cri_dockerd_download_url: "{{ files_repo }}/github.com/Mirantis/cri-dockerd/releases/download/v{{ cri_dockerd_version }}/cri-dockerd-{{ cri_dockerd_version }}.{{ image_arch }}.tgz"
                  
                  # [Optional] runc,containerd: only if you set container_runtime: containerd
                  runc_download_url: "{{ files_repo }}/github.com/opencontainers/runc/releases/download/{{ runc_version }}/runc.{{ image_arch }}"
                  containerd_download_url: "{{ files_repo }}/github.com/containerd/containerd/releases/download/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
                  nerdctl_download_url: "{{ files_repo }}/github.com/containerd/nerdctl/releases/download/v{{ nerdctl_version }}/nerdctl-{{ nerdctl_version }}-{{ ansible_system | lower }}-{{ image_arch }}.tar.gz"

                  实测下载速度可以达到Downloaded: 19 files, 603M in 23s (25.9 MB/s), 下载全部文件可以在 23s 内完成! 完整方法可以参考 https://gist.github.com/yankay/a863cf2e300bff6f9040ab1c6c58fbae

                  使用场景3 - 安装 KIND

                  cd /tmp
                  export KIND_VERSION="v0.22.0"
                  
                  curl -Lo ./kind https://files.m.daocloud.io/github.com/kubernetes-sigs/kind/releases/download/${KIND_VERSION}/kind-linux-amd64
                  chmod +x ./kind
                  mv ./kind /usr/bin/kind
                  kind version

                  使用场景4 - 安装 K9S

                  cd /tmp
                  export K9S_VERSION="v0.32.4"
                  
                  wget https://files.m.daocloud.io/github.com/derailed/k9s/releases/download/${K9S_VERSION}/k9s_Linux_x86_64.tar.gz
                  tar -zxvf k9s_Linux_x86_64.tar.gz
                  chmod +x k9s
                  mv k9s /usr/bin/k9s
                  k9s version

                  使用场景5 - 安装 istio

                  cd /tmp
                  export ISTIO_VERSION="1.14.3"
                  
                  wget "https://files.m.daocloud.io/github.com/istio/istio/releases/download/${ISTIO_VERSION}/istio-${ISTIO_VERSION}-linux-amd64.tar.gz"
                  tar -zxvf istio-${ISTIO_VERSION}-linux-amd64.tar.gz
                  # Do follow the istio docs to install istio

                  使用场景6 - 安装 nerdctl (代替 docker 工具)

                  这里是root安装,其他安装方式请参考源站: https://github.com/containerd/nerdctl

                  export NERDCTL_VERSION="1.7.6"
                  mkdir -p nerdctl ;cd nerdctl
                  wget https://files.m.daocloud.io/github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/nerdctl-full-${NERDCTL_VERSION}-linux-amd64.tar.gz
                  tar -zvxf nerdctl-full-${NERDCTL_VERSION}-linux-amd64.tar.gz
                  mkdir -p /opt/cni/bin ;cp -f libexec/cni/* /opt/cni/bin/ ;cp bin/* /usr/local/bin/ ;cp lib/systemd/system/*.service /usr/lib/systemd/system/
                  systemctl enable containerd ;systemctl start containerd --now
                  systemctl enable buildkit;systemctl start buildkit --now

                  欢迎贡献更多的场景

                  禁止加速的后缀

                  以下后缀的文件会直接响应 403

                  • .bmp
                  • .jpg
                  • .jpeg
                  • .png
                  • .gif
                  • .webp
                  • .tiff
                  • .mp4
                  • .webm
                  • .ogg
                  • .avi
                  • .mov
                  • .flv
                  • .mkv
                  • .mp3
                  • .wav
                  • .rar
                  Mar 7, 2024

                  Daocloud Image

                  快速开始

                  docker run -d -P m.daocloud.io/docker.io/library/nginx

                  使用方法

                  增加前缀 (推荐方式)。比如:

                                docker.io/library/busybox
                                   |
                                   V
                  m.daocloud.io/docker.io/library/busybox

                  或者 支持的镜像仓库 的 前缀替换 就可以使用。比如:

                             docker.io/library/busybox
                               |
                               V
                  docker.m.daocloud.io/library/busybox

                  无缓存

                  在拉取的时候如果Daocloud没有缓存, 将会在 同步队列 添加同步缓存的任务.

                  支持前缀替换的 Registry (不推荐)

                  推荐使用添加前缀的方式.

                  前缀替换的 Registry 的规则, 这是人工配置的, 有需求提 Issue.

                  源站替换为备注
                  docker.elastic.coelastic.m.daocloud.io
                  docker.iodocker.m.daocloud.io
                  gcr.iogcr.m.daocloud.io
                  ghcr.ioghcr.m.daocloud.io
                  k8s.gcr.iok8s-gcr.m.daocloud.iok8s.gcr.io 已被迁移到 registry.k8s.io
                  registry.k8s.iok8s.m.daocloud.io
                  mcr.microsoft.commcr.m.daocloud.io
                  nvcr.ionvcr.m.daocloud.io
                  quay.ioquay.m.daocloud.io
                  registry.ollama.aiollama.m.daocloud.io

                  最佳实践

                  加速 Kubneretes

                  加速安装 kubeadm

                  kubeadm config images pull --image-repository k8s-gcr.m.daocloud.io

                  加速安装 kind

                  kind create cluster --name kind --image m.daocloud.io/docker.io/kindest/node:v1.22.1

                  加速 Containerd

                  加速 Docker

                  添加到 /etc/docker/daemon.json

                  {
                    "registry-mirrors": [
                      "https://docker.m.daocloud.io"
                    ]
                  }

                  加速 Ollama & DeepSeek

                  加速安装 Ollama

                  CPU:

                  docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama docker.m.daocloud.io/ollama/ollama

                  GPU 版本:

                  1. 首先安装 Nvidia Container Toolkit
                  2. 运行以下命令启动 Ollama 容器:
                  docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama docker.m.daocloud.io/ollama/ollama

                  更多信息请参考:

                  加速使用 Deepseek-R1 模型

                  如上述步骤,在启动了ollama容器的前提下,还可以通过加速源,加速启动DeepSeek相关的模型服务

                  注:目前 Ollama 官方源的下载速度已经很快,您也可以直接使用官方源

                  # 使用加速源
                  docker exec -it ollama ollama run ollama.m.daocloud.io/library/deepseek-r1:1.5b
                  
                  # 或直接使用官方源下载模型
                  # docker exec -it ollama ollama run deepseek-r1:1.5b
                  Mar 7, 2024

                  KubeVPN

                  1.install krew

                    1. download and install krew
                    1. Add the $HOME/.krew/bin directory to your PATH environment variable.
                  export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
                    1. Run kubectl krew to check the installation
                  kubectl krew list

                  2. Download from kubevpn source from github

                  kubectl krew index add kubevpn https://gitclone.com/github.com/kubenetworks/kubevpn.git
                  kubectl krew install kubevpn/kubevpn
                  kubectl kubevpn 

                  3. Deploy VPN in some cluster

                  Using different config to access different cluster and deploy vpn in that k8s.

                  kubectl kubevpn connect
                  If you wanna connect other k8s …
                  kubectl kubevpn connect --kubeconfig /root/.kube/xxx_config

                  Your terminal should look like this:

                  ➜  ~ kubectl kubevpn connect
                  Password:
                  Starting connect
                  Getting network CIDR from cluster info...
                  Getting network CIDR from CNI...
                  Getting network CIDR from services...
                  Labeling Namespace default
                  Creating ServiceAccount kubevpn-traffic-manager
                  Creating Roles kubevpn-traffic-manager
                  Creating RoleBinding kubevpn-traffic-manager
                  Creating Service kubevpn-traffic-manager
                  Creating MutatingWebhookConfiguration kubevpn-traffic-manager
                  Creating Deployment kubevpn-traffic-manager
                  
                  Pod kubevpn-traffic-manager-66d969fd45-9zlbp is Pending
                  Container     Reason            Message
                  control-plane ContainerCreating
                  vpn           ContainerCreating
                  webhook       ContainerCreating
                  
                  Pod kubevpn-traffic-manager-66d969fd45-9zlbp is Running
                  Container     Reason           Message
                  control-plane ContainerRunning
                  vpn           ContainerRunning
                  webhook       ContainerRunning
                  
                  Forwarding port...
                  Connected tunnel
                  Adding route...
                  Configured DNS service
                  +----------------------------------------------------------+
                  | Now you can access resources in the kubernetes cluster ! |
                  +----------------------------------------------------------+

                  already connected to cluster network, use command kubectl kubevpn status to check status

                  ➜  ~ kubectl kubevpn status
                  ID Mode Cluster   Kubeconfig                  Namespace            Status      Netif
                  0  full ops-dev   /root/.kube/zverse_config   data-and-computing   Connected   utun0

                  use pod productpage-788df7ff7f-jpkcs IP 172.29.2.134

                  ➜  ~ kubectl get pods -o wide
                  NAME                                       AGE     IP                NODE              NOMINATED NODE  GATES
                  authors-dbb57d856-mbgqk                    7d23h   172.29.2.132      192.168.0.5       <none>         
                  details-7d8b5f6bcf-hcl4t                   61d     172.29.0.77       192.168.104.255   <none>         
                  kubevpn-traffic-manager-66d969fd45-9zlbp   74s     172.29.2.136      192.168.0.5       <none>         
                  productpage-788df7ff7f-jpkcs               61d     172.29.2.134      192.168.0.5       <none>         
                  ratings-77b6cd4499-zvl6c                   61d     172.29.0.86       192.168.104.255   <none>         
                  reviews-85c88894d9-vgkxd                   24d     172.29.2.249      192.168.0.5       <none>         

                  use ping to test connection, seems good

                  ➜  ~ ping 172.29.2.134
                  PING 172.29.2.134 (172.29.2.134): 56 data bytes
                  64 bytes from 172.29.2.134: icmp_seq=0 ttl=63 time=55.727 ms
                  64 bytes from 172.29.2.134: icmp_seq=1 ttl=63 time=56.270 ms
                  64 bytes from 172.29.2.134: icmp_seq=2 ttl=63 time=55.228 ms
                  64 bytes from 172.29.2.134: icmp_seq=3 ttl=63 time=54.293 ms
                  ^C
                  --- 172.29.2.134 ping statistics ---
                  4 packets transmitted, 4 packets received, 0.0% packet loss
                  round-trip min/avg/max/stddev = 54.293/55.380/56.270/0.728 ms

                  use service productpage IP 172.21.10.49

                  ➜  ~ kubectl get services -o wide
                  NAME                      TYPE        CLUSTER-IP     PORT(S)              SELECTOR
                  authors                   ClusterIP   172.21.5.160   9080/TCP             app=authors
                  details                   ClusterIP   172.21.6.183   9080/TCP             app=details
                  kubernetes                ClusterIP   172.21.0.1     443/TCP              <none>
                  kubevpn-traffic-manager   ClusterIP   172.21.2.86    84xxxxxx0/TCP        app=kubevpn-traffic-manager
                  productpage               ClusterIP   172.21.10.49   9080/TCP             app=productpage
                  ratings                   ClusterIP   172.21.3.247   9080/TCP             app=ratings
                  reviews                   ClusterIP   172.21.8.24    9080/TCP             app=reviews

                  use command curl to test service connection

                  ➜  ~ curl 172.21.10.49:9080
                  <!DOCTYPE html>
                  <html>
                    <head>
                      <title>Simple Bookstore App</title>
                  <meta charset="utf-8">
                  <meta http-equiv="X-UA-Compatible" content="IE=edge">
                  <meta name="viewport" content="width=device-width, initial-scale=1">

                  seems good too~

                  if you wanna resolve domain

                  Domain resolve

                  a Pod/Service named productpage in the default namespace can successfully resolve by following name:

                  • productpage
                  • productpage.default
                  • productpage.default.svc.cluster.local
                  ➜  ~ curl productpage.default.svc.cluster.local:9080
                  <!DOCTYPE html>
                  <html>
                    <head>
                      <title>Simple Bookstore App</title>
                  <meta charset="utf-8">
                  <meta http-equiv="X-UA-Compatible" content="IE=edge">
                  <meta name="viewport" content="width=device-width, initial-scale=1">

                  Short domain resolve

                  To access the service in the cluster, service name or you can use the short domain name, such as productpage

                  ➜  ~ curl productpage:9080
                  <!DOCTYPE html>
                  <html>
                    <head>
                      <title>Simple Bookstore App</title>
                  <meta charset="utf-8">
                  <meta http-equiv="X-UA-Compatible" content="IE=edge">
                  ...

                  Disclaimer: This only works on the namespace where kubevpn-traffic-manager is deployed.

                  Mar 7, 2024

                  Subsections of Serverless

                  Subsections of Kserve

                  Install Kserve

                  Preliminary

                  • v 1.30 + Kubernetes has installed, if not check 🔗link
                  • Helm has installed, if not check 🔗link

                  Installation

                  Install By

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. Helm binary has installed, if not check 🔗link


                  1.install from script directly

                  Details
                  curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash
                  Expectd Output

                  Installing Gateway API CRDs …

                  😀 Successfully installed Istio

                  😀 Successfully installed Cert Manager

                  😀 Successfully installed Knative

                  But you probably will ecounter some error due to the network, like this:
                  Error: INSTALLATION FAILED: context deadline exceeded

                  you need to reinstall some components

                  export KSERVE_VERSION=v0.15.2
                  export deploymentMode=Serverless
                  helm upgrade --namespace kserve kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version $KSERVE_VERSION
                  helm upgrade --namespace kserve kserve oci://ghcr.io/kserve/charts/kserve --version $KSERVE_VERSION --set-string kserve.controller.deploymentMode="$deploymentMode"
                  # helm upgrade knative-operator --namespace knative-serving  https://github.com/knative/operator/releases/download/knative-v1.15.7/knative-operator-v1.15.7.tgz

                  Preliminary

                  1. If you have only one node in your cluster, you need at least 6 CPUs, 6 GB of memory, and 30 GB of disk storage.


                  2. If you have multiple nodes in your cluster, for each node you need at least 2 CPUs, 4 GB of memory, and 20 GB of disk storage.


                  1.install knative serving CRD resources

                  Details
                  kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.18.0/serving-crds.yaml

                  2.install knative serving components

                  Details
                  kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.18.0/serving-core.yaml
                  # kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/assets/refs/heads/main/knative/serving/release/download/knative-v1.18.0/serving-core.yaml

                  3.install network layer Istio

                  Details
                  kubectl apply -l knative.dev/crd-install=true -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/istio.yaml
                  kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/istio.yaml
                  kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/net-istio.yaml
                  Expectd Output

                  Monitor the Knative components until all of the components show a STATUS of Running or Completed.

                  kubectl get pods -n knative-serving
                  
                  #NAME                                      READY   STATUS    RESTARTS   AGE
                  #3scale-kourier-control-54cc54cc58-mmdgq   1/1     Running   0          81s
                  #activator-67656dcbbb-8mftq                1/1     Running   0          97s
                  #autoscaler-df6856b64-5h4lc                1/1     Running   0          97s
                  #controller-788796f49d-4x6pm               1/1     Running   0          97s
                  #domain-mapping-65f58c79dc-9cw6d           1/1     Running   0          97s
                  #domainmapping-webhook-cc646465c-jnwbz     1/1     Running   0          97s
                  #webhook-859796bc7-8n5g2                   1/1     Running   0          96s
                  Check Knative Hello World

                  4.install cert manager

                  Details
                  kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml

                  5.install kserve

                  Details
                  kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve.yaml
                  kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve-cluster-resources.yaml
                  Reference

                  Preliminary

                  1. Kubernetes has installed, if not check 🔗link


                  2. ArgoCD has installed, if not check 🔗link


                  3. Helm binary has installed, if not check 🔗link


                  1.install gateway API CRDs

                  Details
                  kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml

                  2.install cert manager

                  Reference

                  following 🔗link to install cert manager

                  3.install istio system

                  Reference

                  following 🔗link to install three istio components (istio-base, istiod, istio-ingressgateway)

                  4.install Knative Operator

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: knative-operator
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                    project: default
                    source:
                      repoURL: https://knative.github.io/operator
                      chart: knative-operator
                      targetRevision: v1.18.1
                      helm:
                        releaseName: knative-operator
                        values: |
                          knative_operator:
                            knative_operator:
                              image: m.daocloud.io/gcr.io/knative-releases/knative.dev/operator/cmd/operator
                              tag: v1.18.1
                              resources:
                                requests:
                                  cpu: 100m
                                  memory: 100Mi
                                limits:
                                  cpu: 1000m
                                  memory: 1000Mi
                            operator_webhook:
                              image: m.daocloud.io/gcr.io/knative-releases/knative.dev/operator/cmd/webhook
                              tag: v1.18.1
                              resources:
                                requests:
                                  cpu: 100m
                                  memory: 100Mi
                                limits:
                                  cpu: 500m
                                  memory: 500Mi
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: knative-serving
                  EOF

                  5.sync by argocd

                  Details
                  argocd app sync argocd/knative-operator

                  6.install kserve serving CRD

                  kubectl apply -f - <<EOF
                  apiVersion: operator.knative.dev/v1beta1
                  kind: KnativeServing
                  metadata:
                    name: knative-serving
                    namespace: knative-serving
                  spec:
                    version: 1.18.0 # this is knative serving version
                    config:
                      domain:
                        example.com: ""
                  EOF
                  Details

                  7.install kserve CRD

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: kserve-crd
                    annotations:
                      argocd.argoproj.io/sync-options: ServerSideApply=true
                      argocd.argoproj.io/compare-options: IgnoreExtraneous
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                      - ServerSideApply=true
                    project: default
                    source:
                      repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                      chart: kserve-crd
                      targetRevision: v0.15.2
                      helm:
                        releaseName: kserve-crd 
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: kserve
                  EOF
                  Expectd Output
                  knative-serving    activator-cbf5b6b55-7gw8s                                 Running        116s
                  knative-serving    autoscaler-c5d454c88-nxrms                                Running        115s
                  knative-serving    autoscaler-hpa-6c966695c6-9ld24                           Running        113s
                  knative-serving    cleanup-serving-serving-1.18.0-45nhg                      Completed      113s
                  knative-serving    controller-84f96b7676-jjqfp                               Running        115s
                  knative-serving    net-istio-controller-574679cd5f-2sf4d                     Running        112s
                  knative-serving    net-istio-webhook-85c99487db-mmq7n                        Running        111s
                  knative-serving    storage-version-migration-serving-serving-1.18.0-k28vf    Completed      113s
                  knative-serving    webhook-75d4fb6db5-qqcwz                                  Running        114s

                  8.sync by argocd

                  Details
                  argocd app sync argocd/kserve-crd

                  9.install kserve Controller

                  Details
                  kubectl -n argocd apply -f - << EOF
                  apiVersion: argoproj.io/v1alpha1
                  kind: Application
                  metadata:
                    name: kserve
                    annotations:
                      argocd.argoproj.io/sync-options: ServerSideApply=true
                      argocd.argoproj.io/compare-options: IgnoreExtraneous
                  spec:
                    syncPolicy:
                      syncOptions:
                      - CreateNamespace=true
                      - ServerSideApply=true
                    project: default
                    source:
                      repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                      chart: kserve
                      targetRevision: v0.15.2
                      helm:
                        releaseName: kserve
                        values: |
                          kserve:
                            agent:
                              image: m.daocloud.io/docker.io/kserve/agent
                            router:
                              image: m.daocloud.io/docker.io/kserve/router
                            storage:
                              image: m.daocloud.io/docker.io/kserve/storage-initializer
                              s3:
                                accessKeyIdName: AWS_ACCESS_KEY_ID
                                secretAccessKeyName: AWS_SECRET_ACCESS_KEY
                                endpoint: ""
                                region: ""
                                verifySSL: ""
                                useVirtualBucket: ""
                                useAnonymousCredential: ""
                            controller:
                              deploymentMode: "Serverless"
                              rbacProxyImage: m.daocloud.io/quay.io/brancz/kube-rbac-proxy:v0.18.0
                              rbacProxy:
                                resources:
                                  limits:
                                    cpu: 100m
                                    memory: 300Mi
                                  requests:
                                    cpu: 100m
                                    memory: 300Mi
                              gateway:
                                domain: example.com
                              image: m.daocloud.io/docker.io/kserve/kserve-controller
                              resources:
                                limits:
                                  cpu: 100m
                                  memory: 300Mi
                                requests:
                                  cpu: 100m
                                  memory: 300Mi
                            servingruntime:
                              tensorflow:
                                image: tensorflow/serving
                                tag: 2.6.2
                              mlserver:
                                image: m.daocloud.io/docker.io/seldonio/mlserver
                                tag: 1.5.0
                              sklearnserver:
                                image: m.daocloud.io/docker.io/kserve/sklearnserver
                              xgbserver:
                                image: m.daocloud.io/docker.io/kserve/xgbserver
                              huggingfaceserver:
                                image: m.daocloud.io/docker.io/kserve/huggingfaceserver
                                devShm:
                                  enabled: false
                                  sizeLimit: ""
                                hostIPC:
                                  enabled: false
                              huggingfaceserver_multinode:
                                shm:
                                  enabled: true
                                  sizeLimit: "3Gi"
                              tritonserver:
                                image: nvcr.io/nvidia/tritonserver
                              pmmlserver:
                                image: m.daocloud.io/docker.io/kserve/pmmlserver
                              paddleserver:
                                image: m.daocloud.io/docker.io/kserve/paddleserver
                              lgbserver:
                                image: m.daocloud.io/docker.io/kserve/lgbserver
                              torchserve:
                                image: pytorch/torchserve-kfs
                                tag: 0.9.0
                              art:
                                image: m.daocloud.io/docker.io/kserve/art-explainer
                            localmodel:
                              enabled: false
                              controller:
                                image: m.daocloud.io/docker.io/kserve/kserve-localmodel-controller
                              jobNamespace: kserve-localmodel-jobs
                              agent:
                                hostPath: /mnt/models
                                image: m.daocloud.io/docker.io/kserve/kserve-localmodelnode-agent
                            inferenceservice:
                              resources:
                                limits:
                                  cpu: "1"
                                  memory: "2Gi"
                                requests:
                                  cpu: "1"
                                  memory: "2Gi"
                    destination:
                      server: https://kubernetes.default.svc
                      namespace: kserve
                  EOF
                  if you have ‘failed calling webhook …’
                  Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": failed to call webhook: Post "https://kserve-webhook-server-service.kserve.svc:443/validate-serving-kserve-io-v1alpha1-clusterservingruntime?timeout=10s": no endpoints available for service "kserve-webhook-server-service"                               Running        114s

                  Just wait for a while and the resync, and it will be fine.

                  10.sync by argocd

                  Details
                  argocd app sync argocd/kserve

                  11.install kserve eventing CRD

                  Details
                  kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.18.1/eventing-crds.yaml

                  12.install kserve eventing

                  Details
                  kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.18.1/eventing-core.yaml
                  Expectd Output
                  knative-eventing   eventing-controller-cc45869cd-fmhg8        1/1     Running       0          3m33s
                  knative-eventing   eventing-webhook-67fcc6959b-lktxd          1/1     Running       0          3m33s
                  knative-eventing   job-sink-7f5d754db-tbf2z                   1/1     Running       0          3m33s

                  FAQ

                  Q1: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Q2: Show me almost endless possibilities

                  You can add standard markdown syntax:

                  • multiple paragraphs
                  • bullet point lists
                  • emphasized, bold and even bold emphasized text
                  • links
                  • etc.
                  ...and even source code

                  the possibilities are endless (almost - including other shortcodes may or may not work)

                  Mar 7, 2024

                  Subsections of Serving

                  Subsections of Inference

                  First Pytorch ISVC

                  Mnist Inference

                  More Information about mnist service can be found 🔗link

                  1. create a namespace
                  kubectl create namespace kserve-test
                  1. deploy a sample iris service
                  kubectl apply -n kserve-test -f - <<EOF
                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "first-torchserve"
                    namespace: kserve-test
                  spec:
                    predictor:
                      model:
                        modelFormat:
                          name: pytorch
                        storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
                        resources:
                          limits:
                            memory: 4Gi
                  EOF
                  1. Check InferenceService status
                  kubectl -n kserve-test get inferenceservices first-torchserve 
                  Expectd Output
                  kubectl -n kserve-test get pod
                  #NAME                                           READY   STATUS    RESTARTS   AGE
                  #first-torchserve-predictor-00001-deplo...      2/2     Running   0          25s
                  
                  kubectl -n kserve-test get inferenceservices first-torchserve
                  #NAME           URL   READY     PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
                  #kserve-test   first-torchserve      http://first-torchserve.kserve-test.example.com   True           100                              first-torchserve-predictor-00001   2m59s

                  After all pods are ready, you can access the service by using the following command

                  Access By

                  If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.

                  export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
                  export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')

                  If the EXTERNAL-IP value is none (or perpetually pending), your environment does not provide an external load balancer for the ingress gateway. In this case, you can access the gateway using the service’s node port.

                  export INGRESS_HOST=$(minikube ip)
                  export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
                  export INGRESS_HOST=$(minikube ip)
                  kubectl port-forward --namespace istio-system svc/istio-ingressgateway 30080:80
                  export INGRESS_PORT=30080
                  1. Perform a prediction First, prepare your inference input request inside a file:
                  wget -O ./mnist-input.json https://raw.githubusercontent.com/kserve/kserve/refs/heads/master/docs/samples/v1beta1/torchserve/v1/imgconv/input.json
                  Remember to forward port if using minikube
                  ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L "*:${INGRESS_PORT}:0.0.0.0:${INGRESS_PORT}" -N -f
                  1. Invoke the service
                  SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice first-torchserve  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
                  # http://first-torchserve.kserve-test.example.com 
                  curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/mnist:predict" -d @./mnist-input.json
                  Expectd Output
                  *   Trying 192.168.58.2...
                  * TCP_NODELAY set
                  * Connected to 192.168.58.2 (192.168.58.2) port 32132 (#0)
                  > POST /v1/models/mnist:predict HTTP/1.1
                  > Host: my-torchserve.kserve-test.example.com
                  > User-Agent: curl/7.61.1
                  > Accept: */*
                  > Content-Type: application/json
                  > Content-Length: 401
                  > 
                  * upload completely sent off: 401 out of 401 bytes
                  < HTTP/1.1 200 OK
                  < content-length: 19
                  < content-type: application/json
                  < date: Mon, 09 Jun 2025 09:27:27 GMT
                  < server: istio-envoy
                  < x-envoy-upstream-service-time: 1128
                  < 
                  * Connection #0 to host 192.168.58.2 left intact
                  {"predictions":[2]}
                  Mar 7, 2024

                  First Custom Model

                  AlexNet Inference

                  More Information about AlexNet service can be found 🔗link

                  1. Implement Custom Model using KServe API
                   1import argparse
                   2import base64
                   3import io
                   4import time
                   5
                   6from fastapi.middleware.cors import CORSMiddleware
                   7from torchvision import models, transforms
                   8from typing import Dict
                   9import torch
                  10from PIL import Image
                  11
                  12import kserve
                  13from kserve import Model, ModelServer, logging
                  14from kserve.model_server import app
                  15from kserve.utils.utils import generate_uuid
                  16
                  17
                  18class AlexNetModel(Model):
                  19    def __init__(self, name: str):
                  20        super().__init__(name, return_response_headers=True)
                  21        self.name = name
                  22        self.load()
                  23        self.ready = False
                  24
                  25    def load(self):
                  26        self.model = models.alexnet(pretrained=True)
                  27        self.model.eval()
                  28        # The ready flag is used by model ready endpoint for readiness probes,
                  29        # set to True when model is loaded successfully without exceptions.
                  30        self.ready = True
                  31
                  32    async def predict(
                  33        self,
                  34        payload: Dict,
                  35        headers: Dict[str, str] = None,
                  36        response_headers: Dict[str, str] = None,
                  37    ) -> Dict:
                  38        start = time.time()
                  39        # Input follows the Tensorflow V1 HTTP API for binary values
                  40        # https://www.tensorflow.org/tfx/serving/api_rest#encoding_binary_values
                  41        img_data = payload["instances"][0]["image"]["b64"]
                  42        raw_img_data = base64.b64decode(img_data)
                  43        input_image = Image.open(io.BytesIO(raw_img_data))
                  44        preprocess = transforms.Compose([
                  45            transforms.Resize(256),
                  46            transforms.CenterCrop(224),
                  47            transforms.ToTensor(),
                  48            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                  49                                 std=[0.229, 0.224, 0.225]),
                  50        ])
                  51        input_tensor = preprocess(input_image).unsqueeze(0)
                  52        output = self.model(input_tensor)
                  53        torch.nn.functional.softmax(output, dim=1)
                  54        values, top_5 = torch.topk(output, 5)
                  55        result = values.flatten().tolist()
                  56        end = time.time()
                  57        response_id = generate_uuid()
                  58
                  59        # Custom response headers can be added to the inference response
                  60        if response_headers is not None:
                  61            response_headers.update(
                  62                {"prediction-time-latency": f"{round((end - start) * 1000, 9)}"}
                  63            )
                  64
                  65        return {"predictions": result}
                  66
                  67
                  68parser = argparse.ArgumentParser(parents=[kserve.model_server.parser])
                  69args, _ = parser.parse_known_args()
                  70
                  71if __name__ == "__main__":
                  72    # Configure kserve and uvicorn logger
                  73    if args.configure_logging:
                  74        logging.configure_logging(args.log_config_file)
                  75    model = AlexNetModel(args.model_name)
                  76    model.load()
                  77    # Custom middlewares can be added to the model
                  78    app.add_middleware(
                  79        CORSMiddleware,
                  80        allow_origins=["*"],
                  81        allow_credentials=True,
                  82        allow_methods=["*"],
                  83        allow_headers=["*"],
                  84    )
                  85    ModelServer().start([model])
                  1. create requirements.txt
                  kserve
                  torchvision==0.18.0
                  pillow>=10.3.0,<11.0.0
                  1. create Dockerfile
                  FROM m.daocloud.io/docker.io/library/python:3.11-slim
                  
                  WORKDIR /app
                  
                  COPY requirements.txt .
                  RUN pip install --no-cache-dir  -r requirements.txt 
                  
                  COPY model.py .
                  
                  CMD ["python", "model.py", "--model_name=custom-model"]
                  1. build and push custom docker image
                  docker build -t ay-custom-model .
                  docker tag ddfd0186813e docker-registry.lab.zverse.space/ay/ay-custom-model:latest
                  docker push docker-registry.lab.zverse.space/ay/ay-custom-model:latest
                  1. create a namespace
                  kubectl create namespace kserve-test
                  1. deploy a sample custom-model service
                  kubectl apply -n kserve-test -f - <<EOF
                  apiVersion: serving.kserve.io/v1beta1
                  kind: InferenceService
                  metadata:
                    name: ay-custom-model
                  spec:
                    predictor:
                      containers:
                        - name: kserve-container
                          image: docker-registry.lab.zverse.space/ay/ay-custom-model:latest
                  EOF
                  1. Check InferenceService status
                  kubectl -n kserve-test get inferenceservices ay-custom-model
                  Expectd Output
                  kubectl -n kserve-test get pod
                  #NAME                                           READY   STATUS    RESTARTS   AGE
                  #ay-custom-model-predictor-00003-dcf4rk         2/2     Running   0        167m
                  
                  kubectl -n kserve-test get inferenceservices ay-custom-model
                  #NAME           URL   READY     PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
                  #ay-custom-model   http://ay-custom-model.kserve-test.example.com   True           100                              ay-custom-model-predictor-00003   177m

                  After all pods are ready, you can access the service by using the following command

                  Access By

                  If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.

                  export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
                  export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')

                  If the EXTERNAL-IP value is none (or perpetually pending), your environment does not provide an external load balancer for the ingress gateway. In this case, you can access the gateway using the service’s node port.

                  export INGRESS_HOST=$(minikube ip)
                  export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
                  export INGRESS_HOST=$(minikube ip)
                  kubectl port-forward --namespace istio-system svc/istio-ingressgateway 30080:80
                  export INGRESS_PORT=30080
                  1. Perform a prediction

                  First, prepare your inference input request inside a file:

                  wget -O ./alex-net-input.json https://kserve.github.io/website/0.15/modelserving/v1beta1/custom/custom_model/input.json
                  Remember to forward port if using minikube
                  ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L "*:${INGRESS_PORT}:0.0.0.0:${INGRESS_PORT}" -N -f
                  1. Invoke the service
                  export SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice ay-custom-model  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
                  # http://ay-custom-model.kserve-test.example.com
                  curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" -X POST "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/custom-model:predict" -d @.//alex-net-input.json
                  Expectd Output
                  *   Trying 192.168.58.2:30704...
                  * Connected to 192.168.58.2 (192.168.58.2) port 30704
                  > POST /v1/models/custom-model:predict HTTP/1.1
                  > Host: ay-custom-model.kserve-test.example.com
                  > User-Agent: curl/8.5.0
                  > Accept: */*
                  > Content-Type: application/json
                  > Content-Length: 105339
                  > 
                  * We are completely uploaded and fine
                  < HTTP/1.1 200 OK
                  < content-length: 110
                  < content-type: application/json
                  < date: Wed, 11 Jun 2025 03:38:30 GMT
                  < prediction-time-latency: 89.966773987
                  < server: istio-envoy
                  < x-envoy-upstream-service-time: 93
                  < 
                  * Connection #0 to host 192.168.58.2 left intact
                  {"predictions":[14.975619316101074,14.0368070602417,13.966034889221191,12.252280235290527,12.086270332336426]}
                  Mar 7, 2024

                  First Model In Minio

                  Inference Model In Minio

                  More Information about Deploy InferenceService with a saved model on S3 can be found 🔗link

                  Create Service Account

                  === “yaml”

                  apiVersion: v1
                  kind: ServiceAccount
                  metadata:
                    name: sa
                    annotations:
                      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/s3access # replace with your IAM role ARN
                      serving.kserve.io/s3-endpoint: s3.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
                      serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
                      serving.kserve.io/s3-region: "us-east-2"
                      serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials

                  === “kubectl”

                  kubectl apply -f create-s3-sa.yaml

                  Create S3 Secret and attach to Service Account

                  Create a secret with your S3 user credential, KServe reads the secret annotations to inject the S3 environment variables on storage initializer or model agent to download the models from S3 storage.

                  Create S3 secret

                  === “yaml”

                  apiVersion: v1
                  kind: Secret
                  metadata:
                    name: s3creds
                    annotations:
                       serving.kserve.io/s3-endpoint: s3.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
                       serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
                       serving.kserve.io/s3-region: "us-east-2"
                       serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
                  type: Opaque
                  stringData: # use `stringData` for raw credential string or `data` for base64 encoded string
                    AWS_ACCESS_KEY_ID: XXXX
                    AWS_SECRET_ACCESS_KEY: XXXXXXXX

                  Attach secret to a service account

                  === “yaml”

                  apiVersion: v1
                  kind: ServiceAccount
                  metadata:
                    name: sa
                  secrets:
                  - name: s3creds

                  === “kubectl”

                  kubectl apply -f create-s3-secret.yaml

                  !!! note If you are running kserve with istio sidecars enabled, there can be a race condition between the istio proxy being ready and the agent pulling models. This will result in a tcp dial connection refused error when the agent tries to download from s3.

                  To resolve it, istio allows the blocking of other containers in a pod until the proxy container is ready.
                  
                  You can enabled this by setting `proxy.holdApplicationUntilProxyStarts: true` in `istio-sidecar-injector` configmap, `proxy.holdApplicationUntilProxyStarts` flag was introduced in Istio 1.7 as an experimental feature and is turned off by default.
                  

                  Deploy the model on S3 with InferenceService

                  Create the InferenceService with the s3 storageUri and the service account with s3 credential attached.

                  === “New Schema”

                  ```yaml
                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "mnist-s3"
                  spec:
                    predictor:
                      serviceAccountName: sa
                      model:
                        modelFormat:
                          name: tensorflow
                        storageUri: "s3://kserve-examples/mnist"
                  ```
                  

                  === “Old Schema”

                  ```yaml
                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "mnist-s3"
                  spec:
                    predictor:
                      serviceAccountName: sa
                      tensorflow:
                        storageUri: "s3://kserve-examples/mnist"
                  ```
                  

                  Apply the autoscale-gpu.yaml.

                  === “kubectl”

                  kubectl apply -f mnist-s3.yaml

                  Run a prediction

                  Now, the ingress can be accessed at ${INGRESS_HOST}:${INGRESS_PORT} or follow this instruction to find out the ingress IP and port.

                  SERVICE_HOSTNAME=$(kubectl get inferenceservice mnist-s3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
                  
                  MODEL_NAME=mnist-s3
                  INPUT_PATH=@./input.json
                  curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH

                  !!! success “Expected Output”

                  ```{ .bash .no-copy }
                  Note: Unnecessary use of -X or --request, POST is already inferred.
                  *   Trying 35.237.217.209...
                  * TCP_NODELAY set
                  * Connected to mnist-s3.default.35.237.217.209.xip.io (35.237.217.209) port 80 (#0)
                  > POST /v1/models/mnist-s3:predict HTTP/1.1
                  > Host: mnist-s3.default.35.237.217.209.xip.io
                  > User-Agent: curl/7.55.1
                  > Accept: */*
                  > Content-Length: 2052
                  > Content-Type: application/x-www-form-urlencoded
                  > Expect: 100-continue
                  >
                  < HTTP/1.1 100 Continue
                  * We are completely uploaded and fine
                  < HTTP/1.1 200 OK
                  < content-length: 251
                  < content-type: application/json
                  < date: Sun, 04 Apr 2021 20:06:27 GMT
                  < x-envoy-upstream-service-time: 5
                  < server: istio-envoy
                  <
                  * Connection #0 to host mnist-s3.default.35.237.217.209.xip.io left intact
                  {
                      "predictions": [
                          {
                              "predictions": [0.327352405, 2.00153053e-07, 0.0113353515, 0.203903764, 3.62863029e-05, 0.416683704, 0.000281196437, 8.36911859e-05, 0.0403052084, 1.82206513e-05],
                              "classes": 5
                          }
                      ]
                  }
                  ```
                  
                  Mar 7, 2024

                  Kafka Sink Transformer

                  AlexNet Inference

                  More Information about Custom Transformer service can be found 🔗link

                  1. Implement Custom Transformer ./model.py using Kserve API
                   1import os
                   2import argparse
                   3import json
                   4
                   5from typing import Dict, Union
                   6from kafka import KafkaProducer
                   7from cloudevents.http import CloudEvent
                   8from cloudevents.conversion import to_structured
                   9
                  10from kserve import (
                  11    Model,
                  12    ModelServer,
                  13    model_server,
                  14    logging,
                  15    InferRequest,
                  16    InferResponse,
                  17)
                  18
                  19from kserve.logging import logger
                  20from kserve.utils.utils import generate_uuid
                  21
                  22kafka_producer = KafkaProducer(
                  23    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
                  24    bootstrap_servers=os.environ.get('KAFKA_BOOTSTRAP_SERVERS', 'localhost:9092')
                  25)
                  26
                  27class ImageTransformer(Model):
                  28    def __init__(self, name: str):
                  29        super().__init__(name, return_response_headers=True)
                  30        self.ready = True
                  31
                  32
                  33    def preprocess(
                  34        self, payload: Union[Dict, InferRequest], headers: Dict[str, str] = None
                  35    ) -> Union[Dict, InferRequest]:
                  36        logger.info("Received inputs %s", payload)
                  37        logger.info("Received headers %s", headers)
                  38        self.request_trace_key = os.environ.get('REQUEST_TRACE_KEY', 'algo.trace.requestId')
                  39        if self.request_trace_key not in payload:
                  40            logger.error("Request trace key '%s' not found in payload, you cannot trace the prediction result", self.request_trace_key)
                  41            if "instances" not in payload:
                  42                raise ValueError(
                  43                    f"Request trace key '{self.request_trace_key}' not found in payload and 'instances' key is missing."
                  44                )
                  45        else:
                  46            headers[self.request_trace_key] = payload.get(self.request_trace_key)
                  47   
                  48        return {"instances": payload["instances"]}
                  49
                  50    def postprocess(
                  51        self,
                  52        infer_response: Union[Dict, InferResponse],
                  53        headers: Dict[str, str] = None,
                  54        response_headers: Dict[str, str] = None,
                  55    ) -> Union[Dict, InferResponse]:
                  56        logger.info("postprocess headers: %s", headers)
                  57        logger.info("postprocess response headers: %s", response_headers)
                  58        logger.info("postprocess response: %s", infer_response)
                  59
                  60        attributes = {
                  61            "source": "data-and-computing/kafka-sink-transformer",
                  62            "type": "org.zhejianglab.zverse.data-and-computing.kafka-sink-transformer",
                  63            "request-host": headers.get('host', 'unknown'),
                  64            "kserve-isvc-name": headers.get('kserve-isvc-name', 'unknown'),
                  65            "kserve-isvc-namespace": headers.get('kserve-isvc-namespace', 'unknown'),
                  66            self.request_trace_key: headers.get(self.request_trace_key, 'unknown'),
                  67        }
                  68
                  69        _, cloudevent = to_structured(CloudEvent(attributes, infer_response))
                  70        try:
                  71            kafka_producer.send(os.environ.get('KAFKA_TOPIC', 'test-topic'), value=cloudevent.decode('utf-8').replace("'", '"'))
                  72            kafka_producer.flush()
                  73        except Exception as e:
                  74            logger.error("Failed to send message to Kafka: %s", e)
                  75        return infer_response
                  76
                  77parser = argparse.ArgumentParser(parents=[model_server.parser])
                  78args, _ = parser.parse_known_args()
                  79
                  80if __name__ == "__main__":
                  81    if args.configure_logging:
                  82        logging.configure_logging(args.log_config_file)
                  83    logging.logger.info("available model name: %s", args.model_name)
                  84    logging.logger.info("all args: %s", args.model_name)
                  85    model = ImageTransformer(args.model_name)
                  86    ModelServer().start([model])
                  1. modify ./pyproject.toml
                  [tool.poetry]
                  name = "custom_transformer"
                  version = "0.15.2"
                  description = "Custom Transformer Examples. Not intended for use outside KServe Frameworks Images."
                  authors = ["Dan Sun <dsun20@bloomberg.net>"]
                  license = "Apache-2.0"
                  packages = [
                      { include = "*.py" }
                  ]
                  
                  [tool.poetry.dependencies]
                  python = ">=3.9,<3.13"
                  kserve = {path = "../kserve", develop = true}
                  pillow = "^10.3.0"
                  kafka-python = "^2.2.15"
                  cloudevents = "^1.11.1"
                  
                  [[tool.poetry.source]]
                  name = "pytorch"
                  url = "https://download.pytorch.org/whl/cpu"
                  priority = "explicit"
                  
                  [tool.poetry.group.test]
                  optional = true
                  
                  [tool.poetry.group.test.dependencies]
                  pytest = "^7.4.4"
                  mypy = "^0.991"
                  
                  [tool.poetry.group.dev]
                  optional = true
                  
                  [tool.poetry.group.dev.dependencies]
                  black = { version = "~24.3.0", extras = ["colorama"] }
                  
                  [tool.poetry-version-plugin]
                  source = "file"
                  file_path = "../VERSION"
                  
                  [build-system]
                  requires = ["poetry-core>=1.0.0"]
                  build-backend = "poetry.core.masonry.api"
                  1. prepare ../custom_transformer.Dockerfile
                  ARG PYTHON_VERSION=3.11
                  ARG BASE_IMAGE=python:${PYTHON_VERSION}-slim-bookworm
                  ARG VENV_PATH=/prod_venv
                  
                  FROM ${BASE_IMAGE} AS builder
                  
                  # Install Poetry
                  ARG POETRY_HOME=/opt/poetry
                  ARG POETRY_VERSION=1.8.3
                  
                  RUN python3 -m venv ${POETRY_HOME} && ${POETRY_HOME}/bin/pip install poetry==${POETRY_VERSION}
                  ENV PATH="$PATH:${POETRY_HOME}/bin"
                  
                  # Activate virtual env
                  ARG VENV_PATH
                  ENV VIRTUAL_ENV=${VENV_PATH}
                  RUN python3 -m venv $VIRTUAL_ENV
                  ENV PATH="$VIRTUAL_ENV/bin:$PATH"
                  
                  COPY kserve/pyproject.toml kserve/poetry.lock kserve/
                  RUN cd kserve && poetry install --no-root --no-interaction --no-cache
                  COPY kserve kserve
                  RUN cd kserve && poetry install --no-interaction --no-cache
                  
                  COPY custom_transformer/pyproject.toml custom_transformer/poetry.lock custom_transformer/
                  RUN cd custom_transformer && poetry install --no-root --no-interaction --no-cache
                  COPY custom_transformer custom_transformer
                  RUN cd custom_transformer && poetry install --no-interaction --no-cache
                  
                  
                  FROM ${BASE_IMAGE} AS prod
                  
                  COPY third_party third_party
                  
                  # Activate virtual env
                  ARG VENV_PATH
                  ENV VIRTUAL_ENV=${VENV_PATH}
                  ENV PATH="$VIRTUAL_ENV/bin:$PATH"
                  
                  RUN useradd kserve -m -u 1000 -d /home/kserve
                  
                  COPY --from=builder --chown=kserve:kserve $VIRTUAL_ENV $VIRTUAL_ENV
                  COPY --from=builder kserve kserve
                  COPY --from=builder custom_transformer custom_transformer
                  
                  USER 1000
                  ENTRYPOINT ["python", "-m", "custom_transformer.model"]
                  1. regenerate poetry.lock
                  poetry lock --no-update
                  1. build and push custom docker image
                  cd python
                  podman build -t docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9 -f custom_transformer.Dockerfile .
                  
                  podman push docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9
                  Mar 7, 2024

                  Subsections of Generative

                  First Generative Service

                  B(KServe 推理服务)
                  B --> C[[Knative Serving]] --> D[自动扩缩容/灰度发布]
                  B --> E[[Istio]] --> F[流量管理/安全]
                  B --> G[[存储系统]] --> H[S3/GCS/PVC]
                  
                  ### 单YAML部署推理服务
                  ```yaml
                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                    namespace: kserve-test
                  spec:
                    predictor:
                      model:
                        modelFormat:
                          name: sklearn
                        resources: {}
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

                  check CRD

                  kubectl -n kserve-test get inferenceservices sklearn-iris 
                  kubectl -n istio-system get svc istio-ingressgateway 
                  export INGRESS_HOST=$(minikube ip)
                  export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
                  SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice sklearn-iris  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
                  # http://sklearn-iris.kserve-test.example.com 
                  curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json

                  How to deploy your own ML model

                  apiVersion: serving.kserve.io/v1beta1
                  kind: InferenceService
                  metadata:
                    name: huggingface-llama3
                    namespace: kserve-test
                    annotations:
                      serving.kserve.io/deploymentMode: RawDeployment
                      serving.kserve.io/autoscalerClass: none
                  spec:
                    predictor:
                      model:
                        modelFormat:
                          name: huggingface
                        storageUri: pvc://llama-3-8b-pvc/hf/8b_instruction_tuned
                      workerSpec:
                        pipelineParallelSize: 2
                        tensorParallelSize: 1
                        containers:
                        - name: worker-container
                            resources: 
                            requests:
                                nvidia.com/gpu: "8"

                  check https://kserve.github.io/website/0.15/modelserving/v1beta1/llm/huggingface/multi-node/#workerspec-and-servingruntime

                  Mar 7, 2024

                  Canary Policy

                  KServe supports canary rollouts for inference services. Canary rollouts allow for a new version of an InferenceService to receive a percentage of traffic. Kserve supports a configurable canary rollout strategy with multiple steps. The rollout strategy can also be implemented to rollback to the previous revision if a rollout step fails.

                  KServe automatically tracks the last good revision that was rolled out with 100% traffic. The canaryTrafficPercent field in the component’s spec needs to be set with the percentage of traffic that should be routed to the new revision. KServe will then automatically split the traffic between the last good revision and the revision that is currently being rolled out according to the canaryTrafficPercent value.

                  When the first revision of an InferenceService is deployed, it will receive 100% of the traffic. When multiple revisions are deployed, as in step 2, and the canary rollout strategy is configured to route 10% of the traffic to the new revision, 90% of the traffic will go to the LastestRolledoutRevision. If there is an unhealthy or bad revision applied, traffic will not be routed to that bad revision. In step 3, the rollout strategy promotes the LatestReadyRevision from step 2 to the LatestRolledoutRevision. Since it is now promoted, the LatestRolledoutRevision gets 100% of the traffic and is fully rolled out. If a rollback needs to happen, 100% of the traffic will be pinned to the previous healthy/good revision- the PreviousRolledoutRevision.

                  Canary Rollout Strategy Steps 1-2 Canary Rollout Strategy Steps 1-2 Canary Rollout Strategy Step 3 Canary Rollout Strategy Step 3

                  Reference

                  For more information, see Canary Rollout.

                  Mar 7, 2024

                  Subsections of Canary Policy

                  Rollout Example

                  Create the InferenceService

                  Follow the First Inference Service tutorial. Set up a namespace kserve-test and create an InferenceService.

                  After rolling out the first model, 100% traffic goes to the initial model with service revision 1.

                  kubectl -n kserve-test get isvc sklearn-iris
                  Expectd Output
                  NAME       URL              READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
                  sklearn-iris   http://sklearn-iris.kserve-test.example.com   True      100       sklearn-iris-predictor--00001   46s      2m39s     70s

                  Apply Canary Rollout Strategy

                  • Add the canaryTrafficPercent field to the predictor component
                  • Update the storageUri to use a new/updated model.
                  kubectl apply -n kserve-test -f - <<EOF
                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                    namespace: kserve-test
                  spec:
                    predictor:
                      canaryTrafficPercent: 10
                      model:
                        args: ["--enable_docs_url=True"]
                        modelFormat:
                          name: sklearn
                        resources: {}
                        runtime: kserve-sklearnserver
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
                  EOF

                  After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1.

                  kubectl -n kserve-test get isvc sklearn-iris
                  Expectd Output
                  NAME       URL              READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
                  sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    90     10       sklearn-iris-predictor-00002   sklearn-iris-predictor-00003   19h

                  Check the running pods, you should now see port two pods running for the old and new model and 10% traffic is routed to the new model. Notice revision 1 contains 0002 in its name, while revision 2 contains 0003.

                  kubectl get pods 
                  
                  NAME                                                        READY   STATUS    RESTARTS   AGE
                  sklearn-iris-predictor-00002-deployment-c7bb6c685-ktk7r     2/2     Running   0          71m
                  sklearn-iris-predictor-00003-deployment-8498d947-fpzcg      2/2     Running   0          20m

                  Run a prediction

                  Follow the next two steps (Determine the ingress IP and ports and Perform inference) in the First Inference Service tutorial.

                  Send more requests to the InferenceService to observe the 10% of traffic that routes to the new revision.

                  Promote the canary model

                  If the canary model is healthy/passes your tests,

                  you can promote it by removing the canaryTrafficPercent field and re-applying the InferenceService custom resource with the same name sklearn-iris

                  kubectl apply -n kserve-test -f - <<EOF
                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                    namespace: kserve-test
                  spec:
                    predictor:
                      model:
                        args: ["--enable_docs_url=True"]
                        modelFormat:
                          name: sklearn
                        resources: {}
                        runtime: kserve-sklearnserver
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
                  EOF

                  Now all traffic goes to the revision 2 for the new model.

                  kubectl get isvc sklearn-iris
                  NAME       URL                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                AGE
                  sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00002   17m

                  The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic.

                  kubectl get pods -l serving.kserve.io/inferenceservice=sklearn-iris
                  NAME                                                           READY   STATUS        RESTARTS   AGE
                  sklearn-iris-predictor-00001-deployment-66c5f5b8d5-gmfvj   1/2     Terminating   0          17m
                  sklearn-iris-predictor-00002-deployment-5bd9ff46f8-shtzd   2/2     Running       0          15m

                  Rollback and pin the previous model

                  You can pin the previous model (model v1, for example) by setting the canaryTrafficPercent to 0 for the current model (model v2, for example). This rolls back from model v2 to model v1 and decreases model v2’s traffic to zero.

                  Apply the custom resource to set model v2’s traffic to 0%.

                  kubectl apply -n kserve-test -f - <<EOF
                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                  spec:
                    predictor:
                      canaryTrafficPercent: 0
                      model:
                        modelFormat:
                          name: sklearn
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
                  EOF

                  Check the traffic split, now 100% traffic goes to the previous good model (model v1) for revision generation 1.

                  kubectl get isvc sklearn-iris
                  NAME       URL                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION              LATESTREADYREVISION                AGE
                  sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    100    0        sklearn-iris-predictor-00002   sklearn-iris-predictor-00003   18m

                  The pods for previous revision (model v1) now routes 100% of the traffic to its pods while the new model (model v2) routes 0% traffic to its pods.

                  kubectl get pods -l serving.kserve.io/inferenceservice=sklearn-iris
                  
                  NAME                                                       READY   STATUS        RESTARTS   AGE
                  sklearn-iris-predictor-00002-deployment-66c5f5b8d5-gmfvj   1/2     Running       0          35s
                  sklearn-iris-predictor-00003-deployment-5bd9ff46f8-shtzd   2/2     Running       0          16m

                  Route traffic using a tag

                  You can enable tag based routing by adding the annotation serving.kserve.io/enable-tag-routing, so traffic can be explicitly routed to the canary model (model v2) or the old model (model v1) via a tag in the request URL.

                  Apply model v2 with canaryTrafficPercent: 10 and serving.kserve.io/enable-tag-routing: "true".

                  kubectl apply -n kserve-test -f - <<EOF
                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                    annotations:
                      serving.kserve.io/enable-tag-routing: "true"
                  spec:
                    predictor:
                      canaryTrafficPercent: 10
                      model:
                        modelFormat:
                          name: sklearn
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
                  EOF

                  Check the InferenceService status to get the canary and previous model URL.

                  kubectl get isvc sklearn-iris -ojsonpath="{.status.components.predictor}"  | jq

                  The output should look like

                  Expectd Output
                  {
                      "address": {
                      "url": "http://sklearn-iris-predictor-.kserve-test.svc.cluster.local"
                      },
                      "latestCreatedRevision": "sklearn-iris-predictor--00003",
                      "latestReadyRevision": "sklearn-iris-predictor--00003",
                      "latestRolledoutRevision": "sklearn-iris-predictor--00001",
                      "previousRolledoutRevision": "sklearn-iris-predictor--00001",
                      "traffic": [
                      {
                          "latestRevision": true,
                          "percent": 10,
                          "revisionName": "sklearn-iris-predictor--00003",
                          "tag": "latest",
                          "url": "http://latest-sklearn-iris-predictor-.kserve-test.example.com"
                      },
                      {
                          "latestRevision": false,
                          "percent": 90,
                          "revisionName": "sklearn-iris-predictor--00001",
                          "tag": "prev",
                          "url": "http://prev-sklearn-iris-predictor-.kserve-test.example.com"
                      }
                      ],
                      "url": "http://sklearn-iris-predictor-.kserve-test.example.com"
                  }

                  Since we updated the annotation on the InferenceService, model v2 now corresponds to sklearn-iris-predictor--00003.

                  You can now send the request explicitly to the new model or the previous model by using the tag in the request URL. Use the curl command from Perform inference and add latest- or prev- to the model name to send a tag based request.

                  For example, set the model name and use the following commands to send traffic to each service based on the latest or prev tag.

                  curl the latest revision

                  MODEL_NAME=sklearn-iris
                  curl -v -H "Host: latest-${MODEL_NAME}-predictor-.kserve-test.example.com" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d @./iris-input.json

                  or curl the previous revision

                  curl -v -H "Host: prev-${MODEL_NAME}-predictor-.kserve-test.example.com" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d @./iris-input.json
                  Mar 7, 2024

                  Auto Scaling

                  Soft Limit

                  You can configure InferenceService with annotation autoscaling.knative.dev/target for a soft limit. The soft limit is a targeted limit rather than a strictly enforced bound, particularly if there is a sudden burst of requests, this value can be exceeded.

                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                    namespace: kserve-test
                    annotations:
                      autoscaling.knative.dev/target: "5"
                  spec:
                    predictor:
                      model:
                        args: ["--enable_docs_url=True"]
                        modelFormat:
                          name: sklearn
                        resources: {}
                        runtime: kserve-sklearnserver
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

                  Hard Limit

                  You can also configure InferenceService with field containerConcurrency with a hard limit. The hard limit is an enforced upper bound. If concurrency reaches the hard limit, surplus requests will be buffered and must wait until enough capacity is free to execute the requests.

                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                    namespace: kserve-test
                  spec:
                    predictor:
                      containerConcurrency: 5
                      model:
                        args: ["--enable_docs_url=True"]
                        modelFormat:
                          name: sklearn
                        resources: {}
                        runtime: kserve-sklearnserver
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

                  Scale with QPS

                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                    namespace: kserve-test
                  spec:
                    predictor:
                      scaleTarget: 1
                      scaleMetric: qps
                      model:
                        args: ["--enable_docs_url=True"]
                        modelFormat:
                          name: sklearn
                        resources: {}
                        runtime: kserve-sklearnserver
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

                  Scale with GPU

                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "flowers-sample-gpu"
                    namespace: kserve-test
                  spec:
                    predictor:
                      scaleTarget: 1
                      scaleMetric: concurrency
                      model:
                        modelFormat:
                          name: tensorflow
                        storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
                        runtimeVersion: "2.6.2-gpu"
                        resources:
                          limits:
                            nvidia.com/gpu: 1

                  Enable Scale To Zero

                  apiVersion: "serving.kserve.io/v1beta1"
                  kind: "InferenceService"
                  metadata:
                    name: "sklearn-iris"
                    namespace: kserve-test
                  spec:
                    predictor:
                      minReplicas: 0
                      model:
                        args: ["--enable_docs_url=True"]
                        modelFormat:
                          name: sklearn
                        resources: {}
                        runtime: kserve-sklearnserver
                        storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

                  Prepare Concurrent Requests Container

                  # export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
                  podman run --rm \
                        -v /root/kserve/iris-input.json:/tmp/iris-input.json \
                        --privileged \
                        -e INGRESS_HOST=$(minikube ip) \
                        -e INGRESS_PORT=32132 \
                        -e MODEL_NAME=sklearn-iris \
                        -e INPUT_PATH=/tmp/iris-input.json \
                        -e SERVICE_HOSTNAME=sklearn-iris.kserve-test.example.com \
                        -it m.daocloud.io/docker.io/library/golang:1.22  bash -c "go install github.com/rakyll/hey@latest; bash"

                  Fire

                  Send traffic in 30 seconds spurts maintaining 5 in-flight requests.

                  hey -z 30s -c 100 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
                  Expectd Output
                  Summary:
                    Total:        30.1390 secs
                    Slowest:      0.5015 secs
                    Fastest:      0.0252 secs
                    Average:      0.1451 secs
                    Requests/sec: 687.3483
                    
                    Total data:   4371076 bytes
                    Size/request: 211 bytes
                  
                  Response time histogram:
                    0.025 [1]     |
                    0.073 [14]    |
                    0.120 [33]    |
                    0.168 [19363] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
                    0.216 [1171]  |■■
                    0.263 [28]    |
                    0.311 [6]     |
                    0.359 [0]     |
                    0.406 [0]     |
                    0.454 [0]     |
                    0.502 [100]   |
                  
                  
                  Latency distribution:
                    10% in 0.1341 secs
                    25% in 0.1363 secs
                    50% in 0.1388 secs
                    75% in 0.1462 secs
                    90% in 0.1587 secs
                    95% in 0.1754 secs
                    99% in 0.1968 secs
                  
                  Details (average, fastest, slowest):
                    DNS+dialup:   0.0000 secs, 0.0252 secs, 0.5015 secs
                    DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
                    req write:    0.0000 secs, 0.0000 secs, 0.0005 secs
                    resp wait:    0.1451 secs, 0.0251 secs, 0.5015 secs
                    resp read:    0.0000 secs, 0.0000 secs, 0.0003 secs
                  
                  Status code distribution:
                    [500] 20716 responses

                  Reference

                  For more information, please refer to the KPA documentation.

                  Mar 7, 2024

                  Subsections of Knative

                  Subsections of Eventing

                  Broker

                  Knative Broker 是 Knative Eventing 系统的核心组件,它的主要作用是充当事件路由和分发的中枢,在事件生产者(事件源)和事件消费者(服务)之间提供解耦、可靠的事件传输。

                  以下是 Knative Broker 的关键作用详解:

                  事件接收中心:

                  Broker 是事件流汇聚的入口点。各种事件源(如 Kafka 主题、HTTP 源、Cloud Pub/Sub、GitHub Webhooks、定时器、自定义源等)将事件发送到 Broker。

                  事件生产者只需知道 Broker 的地址,无需关心最终有哪些消费者或消费者在哪里。

                  事件存储与缓冲:

                  Broker 通常基于持久化的消息系统实现(如 Apache Kafka, Google Cloud Pub/Sub, RabbitMQ, NATS Streaming 或内存实现 InMemoryChannel)。这提供了:

                  持久化: 确保事件在消费者处理前不会丢失(取决于底层通道实现)。

                  缓冲: 当消费者暂时不可用或处理速度跟不上事件产生速度时,Broker 可以缓冲事件,避免事件丢失或压垮生产者/消费者。

                  重试: 如果消费者处理事件失败,Broker 可以重新投递事件(通常需要结合 Trigger 和 Subscription 的重试策略)。

                  解耦事件源和事件消费者:

                  这是 Broker 最重要的作用之一。事件源只负责将事件发送到 Broker,完全不知道有哪些服务会消费这些事件。

                  事件消费者通过创建 Trigger 向 Broker 声明它对哪些事件感兴趣。消费者只需知道 Broker 的存在,无需知道事件是从哪个具体源产生的。

                  这种解耦极大提高了系统的灵活性和可维护性:

                  独立演进: 可以独立添加、移除或修改事件源或消费者,只要它们遵循 Broker 的契约。

                  动态路由: 基于事件属性(如 type, source)动态路由事件到不同的消费者,无需修改生产者或消费者代码。

                  多播: 同一个事件可以被多个不同的消费者同时消费(一个事件 -> Broker -> 多个匹配的 Trigger -> 多个服务)。

                  事件过滤与路由(通过 Trigger):

                  Broker 本身不直接处理复杂的过滤逻辑。过滤和路由是由 Trigger 资源实现的。

                  Trigger 资源绑定到特定的 Broker。

                  Trigger 定义了:

                  订阅者: 目标服务(Knative Service、Kubernetes Service、Channel 等)的地址。

                  过滤器: 基于事件属性(主要是 type 和 source,以及其他可扩展属性)的条件表达式。只有满足条件的事件才会被 Broker 通过该 Trigger 路由到对应的订阅者。

                  Broker 接收事件后,会检查所有绑定到它的 Trigger 的过滤器。对于每一个匹配的 Trigger,Broker 都会将事件发送到该 Trigger 指定的订阅者。

                  提供标准事件接口:

                  Broker 遵循 CloudEvents 规范,它接收和传递的事件都是 CloudEvents 格式的。这为不同来源的事件和不同消费者的处理提供了统一的格式标准,简化了集成。

                  多租户和命名空间隔离:

                  Broker 通常部署在 Kubernetes 的特定命名空间中。一个命名空间内可以创建多个 Broker。

                  这允许在同一个集群内为不同的团队、应用或环境(如 dev, staging)隔离事件流。每个团队/应用可以管理自己命名空间内的 Broker 和 Trigger。

                  总结比喻:

                  可以把 Knative Broker 想象成一个高度智能的邮局分拣中心:

                  接收信件(事件): 来自世界各地(不同事件源)的信件(事件)都寄到这个分拣中心(Broker)。

                  存储信件: 分拣中心有仓库(持久化/缓冲)临时存放信件,确保信件安全不丢失。

                  分拣规则(Trigger): 分拣中心里有很多分拣员(Trigger)。每个分拣员负责特定类型或来自特定地区的信件(基于事件属性过滤)。

                  投递信件: 分拣员(Trigger)找到符合自己负责规则的信件(事件),就把它们投递到正确的收件人(订阅者服务)家门口。

                  解耦: 寄信人(事件源)只需要知道分拣中心(Broker)的地址,完全不需要知道收信人(消费者)是谁、在哪里。收信人(消费者)只需要告诉分拣中心里负责自己这类信件的分拣员(创建 Trigger)自己的地址,不需要关心信是谁寄来的。分拣中心(Broker)和分拣员(Trigger)负责中间的复杂路由工作。

                  Broker 带来的核心价值:

                  松耦合: 彻底解耦事件生产者和消费者。

                  灵活性: 动态添加/移除消费者,动态改变路由规则(通过修改/创建/删除 Trigger)。

                  可靠性: 提供事件持久化和重试机制(依赖底层实现)。

                  可伸缩性: Broker 和消费者都可以独立伸缩。

                  标准化: 基于 CloudEvents。

                  简化开发: 开发者专注于业务逻辑(生产事件或消费事件),无需自己搭建复杂的事件总线基础设施。

                  Mar 7, 2024

                  Subsections of Broker

                  Install Kafka Broker

                  About

                  broker broker

                  • Source, curl, kafkaSource,
                  • Broker
                  • Trigger
                  • Sink: ksvc, isvc

                  Install a Channel (messaging) layer

                  kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-controller.yaml
                  Expectd Output
                  configmap/kafka-broker-config created
                  configmap/kafka-channel-config created
                  customresourcedefinition.apiextensions.k8s.io/kafkachannels.messaging.knative.dev created
                  customresourcedefinition.apiextensions.k8s.io/consumers.internal.kafka.eventing.knative.dev created
                  customresourcedefinition.apiextensions.k8s.io/consumergroups.internal.kafka.eventing.knative.dev created
                  customresourcedefinition.apiextensions.k8s.io/kafkasinks.eventing.knative.dev created
                  customresourcedefinition.apiextensions.k8s.io/kafkasources.sources.knative.dev created
                  clusterrole.rbac.authorization.k8s.io/eventing-kafka-source-observer created
                  configmap/config-kafka-source-defaults created
                  configmap/config-kafka-autoscaler created
                  configmap/config-kafka-features created
                  configmap/config-kafka-leader-election created
                  configmap/kafka-config-logging created
                  configmap/config-namespaced-broker-resources created
                  configmap/config-tracing configured
                  clusterrole.rbac.authorization.k8s.io/knative-kafka-addressable-resolver created
                  clusterrole.rbac.authorization.k8s.io/knative-kafka-channelable-manipulator created
                  clusterrole.rbac.authorization.k8s.io/kafka-controller created
                  serviceaccount/kafka-controller created
                  clusterrolebinding.rbac.authorization.k8s.io/kafka-controller created
                  clusterrolebinding.rbac.authorization.k8s.io/kafka-controller-addressable-resolver created
                  deployment.apps/kafka-controller created
                  clusterrole.rbac.authorization.k8s.io/kafka-webhook-eventing created
                  serviceaccount/kafka-webhook-eventing created
                  clusterrolebinding.rbac.authorization.k8s.io/kafka-webhook-eventing created
                  mutatingwebhookconfiguration.admissionregistration.k8s.io/defaulting.webhook.kafka.eventing.knative.dev created
                  mutatingwebhookconfiguration.admissionregistration.k8s.io/pods.defaulting.webhook.kafka.eventing.knative.dev created
                  secret/kafka-webhook-eventing-certs created
                  validatingwebhookconfiguration.admissionregistration.k8s.io/validation.webhook.kafka.eventing.knative.dev created
                  deployment.apps/kafka-webhook-eventing created
                  service/kafka-webhook-eventing created
                  kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-channel.yaml
                  Expectd Output
                  configmap/config-kafka-channel-data-plane created
                  clusterrole.rbac.authorization.k8s.io/knative-kafka-channel-data-plane created
                  serviceaccount/knative-kafka-channel-data-plane created
                  clusterrolebinding.rbac.authorization.k8s.io/knative-kafka-channel-data-plane created
                  statefulset.apps/kafka-channel-dispatcher created
                  deployment.apps/kafka-channel-receiver created
                  service/kafka-channel-ingress created

                  Install a Broker layer

                  kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-broker.yaml
                  Expectd Output
                  configmap/config-kafka-broker-data-plane created
                  clusterrole.rbac.authorization.k8s.io/knative-kafka-broker-data-plane created
                  serviceaccount/knative-kafka-broker-data-plane created
                  clusterrolebinding.rbac.authorization.k8s.io/knative-kafka-broker-data-plane created
                  statefulset.apps/kafka-broker-dispatcher created
                  deployment.apps/kafka-broker-receiver created
                  service/kafka-broker-ingress created
                  Reference
                  if you cannot find kafka-channel-dispatcher

                  please check sts

                  root@ay-k3s01:~# kubectl -n knative-eventing  get sts
                  NAME                       READY   AGE
                  kafka-broker-dispatcher    1/1     19m
                  kafka-channel-dispatcher   0/0     22m

                  some sts replia is 0, please check

                  [Optional] Install Eventing extensions

                  • kafka sink
                  kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-sink.yaml
                  Reference

                  for more information, you can check 🔗https://knative.dev/docs/eventing/sinks/kafka-sink/

                  • kafka source
                  kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.18.0/eventing-kafka-source.yaml
                  Reference

                  for more information, you can check 🔗https://knative.dev/docs/eventing/sources/kafka-source/

                  Mar 7, 2024

                  Display Broker Message

                  Flow

                  flowchart LR
                      A[Curl] -->|HTTP| B{Broker}
                      B -->|Subscribe| D[Trigger1]
                      B -->|Subscribe| E[Trigger2]
                      B -->|Subscribe| F[Trigger3]
                      E --> G[Display Service]

                  Setps

                  1. Create Broker Setting

                  kubectl apply -f - <<EOF
                  apiVersion: v1
                  kind: ConfigMap
                  metadata:
                    name: kafka-broker-config
                    namespace: knative-eventing
                  data:
                    default.topic.partitions: "10"
                    default.topic.replication.factor: "1"
                    bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
                    default.topic.config.retention.ms: "3600"
                  EOF

                  2. Create Broker

                  kubectl apply -f - <<EOF
                  apiVersion: eventing.knative.dev/v1
                  kind: Broker
                  metadata:
                    annotations:
                      eventing.knative.dev/broker.class: Kafka
                    name: first-broker
                    namespace: kserve-test
                  spec:
                    config:
                      apiVersion: v1
                      kind: ConfigMap
                      name: kafka-broker-config
                      namespace: knative-eventing
                  EOF

                  deadletterSink:

                  3. Create Trigger

                  kubectl apply -f - <<EOF
                  apiVersion: eventing.knative.dev/v1
                  kind: Trigger
                  metadata:
                    name: display-service-trigger
                    namespace: kserve-test
                  spec:
                    broker: first-broker
                    subscriber:
                      ref:
                        apiVersion: serving.knative.dev/v1
                        kind: Service
                        name: event-display
                  EOF

                  4. Create Sink Service (Display Message)

                  kubectl apply -f - <<EOF
                  apiVersion: serving.knative.dev/v1
                  kind: Service
                  metadata:
                    name: event-display
                    namespace: kserve-test
                  spec:
                    template:
                      spec:
                        containers:
                          - image: gcr.io/knative-releases/knative.dev/eventing/cmd/event_display
                  EOF

                  5. Test

                  kubectl run curl-test --image=curlimages/curl -it --rm --restart=Never -- \
                    -v "http://kafka-broker-ingress.knative-eventing.svc.cluster.local/kserve-test/first-broker" \
                    -X POST \
                    -H "Ce-Id: $(date +%s)" \
                    -H "Ce-Specversion: 1.0" \
                    -H "Ce-Type: test.type" \
                    -H "Ce-Source: curl-test" \
                    -H "Content-Type: application/json" \
                    -d '{"test": "Broker is working"}'

                  6. Check message

                  kubectl -n kserve-test logs -f deploy/event-display-00001-deployment 
                  Expectd Output
                  2025/07/02 09:01:25 Failed to read tracing config, using the no-op default: empty json tracing config
                  ☁️  cloudevents.Event
                  Context Attributes,
                    specversion: 1.0
                    type: test.type
                    source: curl-test
                    id: 1751446880
                    datacontenttype: application/json
                  Extensions,
                    knativekafkaoffset: 6
                    knativekafkapartition: 6
                  Data,
                    {
                      "test": "Broker is working"
                    }
                  Mar 7, 2024

                  Kafka Broker Invoke ISVC

                  1. Prepare RBAC

                  • create cluster role to access CRD isvc
                  kubectl apply -f - <<EOF
                  apiVersion: rbac.authorization.k8s.io/v1
                  kind: ClusterRole
                  metadata:
                    name: kserve-access-for-knative
                  rules:
                  - apiGroups: ["serving.kserve.io"]
                    resources: ["inferenceservices", "inferenceservices/status"]
                    verbs: ["get", "list", "watch"]
                  EOF
                  • create rolebinding and grant privileges
                  kubectl apply -f - <<EOF
                  apiVersion: rbac.authorization.k8s.io/v1
                  kind: ClusterRoleBinding
                  metadata:
                    name: kafka-controller-kserve-access
                  roleRef:
                    apiGroup: rbac.authorization.k8s.io
                    kind: ClusterRole
                    name: kserve-access-for-knative
                  subjects:
                  - kind: ServiceAccount
                    name: kafka-controller
                    namespace: knative-eventing
                  EOF

                  2. Create Broker Setting

                  kubectl apply -f - <<EOF
                  apiVersion: v1
                  kind: ConfigMap
                  metadata:
                    name: kafka-broker-config
                    namespace: knative-eventing
                  data:
                    default.topic.partitions: "10"
                    default.topic.replication.factor: "1"
                    bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
                    default.topic.config.retention.ms: "3600"
                  EOF

                  3. Create Broker

                  kubectl apply -f - <<EOF
                  apiVersion: eventing.knative.dev/v1
                  kind: Broker
                  metadata:
                    annotations:
                      eventing.knative.dev/broker.class: Kafka
                    name: isvc-broker
                    namespace: kserve-test
                  spec:
                    config:
                      apiVersion: v1
                      kind: ConfigMap
                      name: kafka-broker-config
                      namespace: knative-eventing
                    delivery:
                      deadLetterSink:
                        ref:
                          apiVersion: serving.knative.dev/v1
                          kind: Service
                          name: event-display
                  EOF

                  4. Create InferenceService

                  Reference

                  you can create isvc first-tourchserve service, by following 🔗link

                  5. Create Trigger

                  kubectl apply -f - << EOF
                  apiVersion: eventing.knative.dev/v1
                  kind: Trigger
                  metadata:
                    name: kserve-trigger
                    namespace: kserve-test
                  spec:
                    broker: isvc-broker
                    filter:
                      attributes:
                        type: prediction-request
                    subscriber:
                      uri: http://first-torchserve.kserve-test.svc.cluster.local/v1/models/mnist:predict
                  EOF

                  6. Test

                  Normally, we can invoke first-tourchserve by executing

                  export MASTER_IP=192.168.100.112
                  export ISTIO_INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
                  export SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice first-torchserve  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
                  # http://first-torchserve.kserve-test.example.com 
                  curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${MASTER_IP}:${ISTIO_INGRESS_PORT}/v1/models/mnist:predict" -d @./mnist-input.json

                  Now, you can access model by executing

                  export KAFKA_BROKER_INGRESS_PORT=$(kubectl -n knative-eventing get service kafka-broker-ingress -o jsonpath='{.spec.ports[?(@.name=="http-container")].nodePort}')
                  curl -v "http://${MASTER_IP}:${KAFKA_BROKER_INGRESS_PORT}/kserve-test/isvc-broker" \
                    -X POST \
                    -H "Ce-Id: $(date +%s)" \
                    -H "Ce-Specversion: 1.0" \
                    -H "Ce-Type: prediction-request" \
                    -H "Ce-Source: event-producer" \
                    -H "Content-Type: application/json" \
                    -d @./mnist-input.json 
                  if you cannot see the preidction result

                  please check kafka

                  # list all topics, find suffix is `isvc-broker` -> knative-broker-kserve-test-isvc-broker
                  kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                      'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
                  # retrieve msg from that topic
                  kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                    'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic knative-broker-kserve-test-isvc-broker --from-beginning'

                  And then, you could see

                  {
                      "instances": [
                          {
                              "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
                          }
                      ]
                  }
                  {
                      "predictions": [
                          2
                      ]
                  }
                  Mar 7, 2024

                  Subsections of Plugin

                  Subsections of Eventing Kafka Broker

                  Prepare Dev Environment

                  1. update go -> 1.24

                  2. install ko -> 1.8.0

                  go install github.com/google/ko@latest
                  # wget https://github.com/ko-build/ko/releases/download/v0.18.0/ko_0.18.0_Linux_x86_64.tar.gz
                  # tar -xzf ko_0.18.0_Linux_x86_64.tar.gz  -C /usr/local/bin/ko
                  # cp /usr/local/bin/ko/ko /root/bin
                  1. protoc
                  PB_REL="https://github.com/protocolbuffers/protobuf/releases"
                  curl -LO $PB_REL/download/v30.2/protoc-30.2-linux-x86_64.zip
                  # mkdir -p ${HOME}/bin/
                  mkdir -p /usr/local/bin/protoc
                  unzip protoc-30.2-linux-x86_64.zip -d /usr/local/bin/protoc
                  cp /usr/local/bin/protoc/bin/protoc /root/bin
                  # export PATH="$PATH:/root/bin"
                  rm -rf protoc-30.2-linux-x86_64.zip
                  1. protoc-gen-go -> 1.5.4
                  go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
                  export GOPATH=/usr/local/go/bin
                  1. copy some code
                  mkdir -p ${GOPATH}/src/knative.dev
                  cd ${GOPATH}/src/knative.dev
                  git clone git@github.com:knative/eventing.git # clone eventing repo
                  git clone git@github.com:AaronYang0628/eventing-kafka-broker.git
                  cd eventing-kafka-broker
                  git remote add upstream https://github.com/knative-extensions/eventing-kafka-broker.git
                  git remote set-url --push upstream no_push
                  export KO_DOCKER_REPO=docker-registry.lab.zverse.space/data-and-computing/ay-dev
                  Mar 7, 2024

                  Build Async Preidction Flow

                  Flow

                  flowchart LR
                      A[User Curl] -->|HTTP| B{ISVC-Broker:Kafka}
                      B -->|Subscribe| D[Trigger1]
                      B -->|Subscribe| E[Kserve-Triiger]
                      B -->|Subscribe| F[Trigger3]
                      E --> G[Mnist Service]
                      G --> |Kafka-Sink| B

                  Setps

                  1. Create Broker Setting

                  kubectl apply -f - <<EOF
                  apiVersion: v1
                  kind: ConfigMap
                  metadata:
                    name: kafka-broker-config
                    namespace: knative-eventing
                  data:
                    default.topic.partitions: "10"
                    default.topic.replication.factor: "1"
                    bootstrap.servers: "kafka.database.svc.cluster.local:9092" #kafka service address
                    default.topic.config.retention.ms: "3600"
                  EOF

                  2. Create Broker

                  kubectl apply -f - <<EOF
                  apiVersion: eventing.knative.dev/v1
                  kind: Broker
                  metadata:
                    annotations:
                      eventing.knative.dev/broker.class: Kafka
                    name: isvc-broker
                    namespace: kserve-test
                  spec:
                    config:
                      apiVersion: v1
                      kind: ConfigMap
                      name: kafka-broker-config
                      namespace: knative-eventing
                  EOF

                  3. Create Trigger

                  kubectl apply -f - << EOF
                  apiVersion: eventing.knative.dev/v1
                  kind: Trigger
                  metadata:
                    name: kserve-trigger
                    namespace: kserve-test
                  spec:
                    broker: isvc-broker
                    filter:
                      attributes:
                        type: prediction-request-udf-attr # you can change this
                    subscriber:
                      uri: http://prediction-and-sink.kserve-test.svc.cluster.local/v1/models/mnist:predict
                  EOF

                  4. Create InferenceService

                   1kubectl apply -f - <<EOF
                   2apiVersion: serving.kserve.io/v1beta1
                   3kind: InferenceService
                   4metadata:
                   5  name: prediction-and-sink
                   6  namespace: kserve-test
                   7spec:
                   8  predictor:
                   9    model:
                  10      modelFormat:
                  11        name: pytorch
                  12      storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
                  13  transformer:
                  14    containers:
                  15      - image: docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9
                  16        name: kserve-container
                  17        env:
                  18        - name: KAFKA_BOOTSTRAP_SERVERS
                  19          value: kafka.database.svc.cluster.local
                  20        - name: KAFKA_TOPIC
                  21          value: test-topic # result will be saved in this topic
                  22        - name: REQUEST_TRACE_KEY
                  23          value: test-trace-id # using this key to retrieve preidtion result
                  24        command:
                  25          - "python"
                  26          - "-m"
                  27          - "model"
                  28        args:
                  29          - --model_name
                  30          - mnist
                  31EOF
                  Expectd Output
                  root@ay-k3s01:~# kubectl -n kserve-test get pod
                  NAME                                                              READY   STATUS    RESTARTS   AGE
                  prediction-and-sink-predictor-00001-deployment-f64bb76f-jqv4m     2/2     Running   0          3m46s
                  prediction-and-sink-transformer-00001-deployment-76cccd867lksg9   2/2     Running   0          4m3s
                  Expectd Output

                  Source code of the docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9 could be found 🔗here

                  [Optional] 5. Invoke InferenceService Directly

                  • preparation
                  wget -O ./mnist-input.json https://raw.githubusercontent.com/kserve/kserve/refs/heads/master/docs/samples/v1beta1/torchserve/v1/imgconv/input.json
                  SERVICE_NAME=prediction-and-sink
                  MODEL_NAME=mnist
                  INPUT_PATH=@./mnist-input.json
                  PLAIN_SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice $SERVICE_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
                  • fire!!
                  export INGRESS_HOST=192.168.100.112
                  export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
                  curl -v -H "Host: ${PLAIN_SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
                  Expectd Output
                  curl -v -H "Host: ${PLAIN_SERVICE_HOSTNAME}" -H "Content-Type: application/json" -d $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
                  *   Trying 192.168.100.112:31855...
                  * Connected to 192.168.100.112 (192.168.100.112) port 31855
                  > POST /v1/models/mnist:predict HTTP/1.1
                  > Host: prediction-and-sink.kserve-test.ay.test.dev
                  > User-Agent: curl/8.5.0
                  > Accept: */*
                  > Content-Type: application/json
                  > Content-Length: 401
                  > 
                  < HTTP/1.1 200 OK
                  < content-length: 19
                  < content-type: application/json
                  < date: Wed, 02 Jul 2025 08:55:05 GMT,Wed, 02 Jul 2025 08:55:04 GMT
                  < server: istio-envoy
                  < x-envoy-upstream-service-time: 209
                  < 
                  * Connection #0 to host 192.168.100.112 left intact
                  {"predictions":[2]}

                  6. Invoke Broker

                  • preparation
                  cat > image-with-trace-id.json << EOF
                  {
                      "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7",
                      "instances": [
                          {
                              "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
                          }
                      ]
                  }
                  EOF
                  • fire!!
                  export MASTER_IP=192.168.100.112
                  export KAFKA_BROKER_INGRESS_PORT=$(kubectl -n knative-eventing get service kafka-broker-ingress -o jsonpath='{.spec.ports[?(@.name=="http-container")].nodePort}')
                  
                  curl -v "http://${MASTER_IP}:${KAFKA_BROKER_INGRESS_PORT}/kserve-test/isvc-broker" \
                    -X POST \
                    -H "Ce-Id: $(date +%s)" \
                    -H "Ce-Specversion: 1.0" \
                    -H "Ce-Type: prediction-request-udf-attr" \
                    -H "Ce-Source: event-producer" \
                    -H "Content-Type: application/json" \
                    -d @./image-with-trace-id.json 
                  • check input data in kafka topic knative-broker-kserve-test-isvc-broker
                  kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                    'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic knative-broker-kserve-test-isvc-broker --from-beginning'
                  Expectd Output
                  {
                      "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7",
                      "instances": [
                      {
                          "data": "iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAAAw0lEQVR4nGNgGFggVVj4/y8Q2GOR83n+58/fP0DwcSqmpNN7oOTJw6f+/H2pjUU2JCSEk0EWqN0cl828e/FIxvz9/9cCh1zS5z9/G9mwyzl/+PNnKQ45nyNAr9ThMHQ/UG4tDofuB4bQIhz6fIBenMWJQ+7Vn7+zeLCbKXv6z59NOPQVgsIcW4QA9YFi6wNQLrKwsBebW/68DJ388Nun5XFocrqvIFH59+XhBAxThTfeB0r+vP/QHbuDCgr2JmOXoSsAAKK7bU3vISS4AAAAAElFTkSuQmCC"
                      }]
                  }
                  {
                      "predictions": [2] // result will be saved in this topic as well
                  }
                  • check response result in kafka topic test-topic
                  kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                    'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

                   1{
                   2    "specversion": "1.0",
                   3    "id": "822e3115-0185-4752-9967-f408dda72004",
                   4    "source": "data-and-computing/kafka-sink-transformer",
                   5    "type": "org.zhejianglab.zverse.data-and-computing.kafka-sink-transformer",
                   6    "time": "2025-07-02T08:57:04.133497+00:00",
                   7    "data":
                   8    {
                   9        "predictions": [2]
                  10    },
                  11    "request-host": "prediction-and-sink-transformer.kserve-test.svc.cluster.local",
                  12    "kserve-isvc-name": "prediction-and-sink",
                  13    "kserve-isvc-namespace": "kserve-test",
                  14    "test-trace-id": "16ec3446-48d6-422e-9926-8224853e84a7"
                  15}
                  Using test-trace-id to grab the result.

                  Mar 7, 2024

                  Subsections of 🏗️Linux

                  Cheatsheet

                  useradd

                  sudo useradd <$name> -m -r -s /bin/bash -p <$password>
                  add as soduer
                  echo '<$name> ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers

                  telnet

                  a command line interface for communication with a remote device or serve

                  telnet <$ip> <$port>
                  for example
                  telnet 172.27.253.50 9000 #test application connectivity

                  lsof (list as open files)

                  everything is a file

                  lsof <$option:value>
                  for example

                  -a List processes that have open files

                  -c <process_name> List files opened by the specified process

                  -g List GID number process details

                  -d <file_number> List the processes occupying this file number

                  -d List open files in a directory

                  -D Recursively list open files in a directory

                  -n List files using NFS

                  -i List eligible processes. (protocol, :port, @ip)

                  -p List files opened by the specified process ID

                  -u List UID number process details

                  lsof -i:30443 # find port 30443 
                  lsof -i -P -n # list all connections

                  awk (Aho, Weinberger, and Kernighan [Names])

                  awk is a scripting language used for manipulating data and generating reports.

                  # awk [params] 'script' 
                  awk <$params> <$string_content>
                  for example

                  filter bigger than 3

                  echo -e "1\n2\n3\n4\n5\n" | awk '$1>3'

                  func1 func1

                  ss (socket statistics)

                  view detailed information about your system’s network connections, including TCP/IP, UDP, and Unix domain sockets

                  ss [options]
                  for example
                  OptionsDescription
                  -tDisplay TCP sockets
                  -lDisplay listening sockets
                  -nShow numerical addresses instead of resolving
                  -aDisplay all sockets (listening and non-listening)
                  #show all listening TCP connection
                  ss -tln
                  #show all established TCP connections
                  ss -tan

                  clean files 3 days ago

                  find /aaa/bbb/ccc/*.gz -mtime +3 -exec rm {} \;

                  ssh without affect $HOME/.ssh/known_hosts

                  ssh -o "UserKnownHostsFile /dev/null" root@aaa.domain.com
                  ssh -o "UserKnownHostsFile /dev/null" -o "StrictHostKeyChecking=no" root@aaa.domain.com

                  sync clock

                  [yum|dnf] install -y chrony \
                      && systemctl enable chronyd \
                      && (systemctl is-active chronyd || systemctl start chronyd) \
                      && chronyc sources \
                      && chronyc tracking \
                      && timedatectl set-timezone 'Asia/Shanghai'

                  set hostname

                  hostnamectl set-hostname develop

                  add remote key to other server

                  ssh -o "UserKnownHostsFile /dev/null" \
                      root@aaa.bbb.ccc \
                      "mkdir -p /root/.ssh && chmod 700 /root/.ssh && echo '$SOME_PUBLIC_KEY' \
                      >> /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys"
                  for example
                  ssh -o "UserKnownHostsFile /dev/null" \
                      root@17.27.253.67 \
                      "mkdir -p /root/.ssh && chmod 700 /root/.ssh && echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC00JLKF/Cd//rJcdIVGCX3ePo89KAgEccvJe4TEHs5pI5FSxs/7/JfQKZ+by2puC3IT88bo/d7nStw9PR3BXgqFXaBCknNBpSLWBIuvfBF+bcL+jGnQYo2kPjrO+2186C5zKGuPRi9sxLI5AkamGB39L5SGqwe5bbKq2x/8OjUP25AlTd99XsNjEY2uxNVClHysExVad/ZAcl0UVzG5xmllusXCsZVz9HlPExqB6K1sfMYWvLVgSCChx6nUfgg/NZrn/kQG26X0WdtXVM2aXpbAtBioML4rWidsByDb131NqYpJF7f+x3+I5pQ66Qpc72FW1G4mUiWWiGhF9tL8V9o1AY96Rqz0AVaxAQrBEuyCWKrXbA97HeC3Xp57Luvlv9TqUd8CIJYq+QTL0hlIDrzK9rJsg34FRAvf9sh8K2w/T/gC9UnRjRXgkPUgKldq35Y6Z9wP6KY45gCXka1PU4nVqb6wicO+RHcZ5E4sreUwqfTypt5nTOgW2/p8iFhdN8= Administrator@AARON-X1-8TH' \
                      >> /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys"

                  set -x

                  This will print each command to the standard error before executing it, which is useful for debugging scripts.

                  set -x

                  set -e

                  Exit immediately if a command exits with a non-zero status.

                  set -x

                  sed (Stream Editor)

                  sed <$option> <$file_path>
                  for example

                  replace unix -> linux

                  echo "linux is great os. unix is opensource. unix is free os." | sed 's/unix/linux/'

                  or you can check https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/

                  fdisk

                  list all disk

                  fdisk -l

                  create CFS file system

                  Use mkfs.xfs command to create xfs file system and internal log on the same disk, example is shown below:

                  mkfs.xfs <$path>

                  modprobe

                  program to add and remove modules from the Linux Kernel

                  modprobe nfs && modprobe nfsd

                  disown

                  disown command in Linux is used to remove jobs from the job table.

                  disown [options] jobID1 jobID2 ... jobIDN
                  for example

                  for example, there is a job running in the background

                  ping google.com > /dev/null &

                  using jobs - to list all running jobs

                  jobs -l

                  using disown -a remove all jobs from the job tables

                  disown -a

                  using disown %2 to remove the #2 job

                  disown %2

                  generate SSH key

                  ssh-keygen -t rsa -b 4096 -C "aaron19940628@gmail.com"
                  sudo ln -sf <$install_path>/bin/* /usr/local/bin

                  append dir into $PATH (temporary)

                  export PATH="/root/bin:$PATH"

                  copy public key to ECS

                  ssh-copy-id -i ~/.ssh/id_rsa.pub root@10.200.60.53

                  echo “nameserver 8.8.8.8” | sudo tee /etc/resolv.conf echo “nameserver 8.8.4.4” | sudo tee -a /etc/resolv.conf

                  Mar 12, 2024

                  Subsections of Command

                  Echo

                  在Windows批处理中(使用ECHO命令)

                  ECHO 这是要写入的内容 > filename.txt
                  ECHO 这是要追加的内容 >> filename.txt

                  在Linux/macOS Shell中

                  echo "这是要写入的内容" > filename.txt
                  echo "这是要追加的内容" >> filename.txt

                  在Python中

                  # 写入文件(覆盖)
                  with open('filename.txt', 'w', encoding='utf-8') as f:
                      f.write("这是要写入的内容\n")
                  
                  # 追加内容
                  with open('filename.txt', 'a', encoding='utf-8') as f:
                      f.write("这是要追加的内容\n")

                  在PowerShell中

                  "这是要写入的内容" | Out-File -FilePath filename.txt
                  "这是要追加的内容" | Out-File -FilePath filename.txt -Append

                  在JavaScript (Node.js) 中

                  const fs = require('fs');
                  
                  // 写入文件(覆盖)
                  fs.writeFileSync('filename.txt', '这是要写入的内容\n');
                  
                  // 追加内容
                  fs.appendFileSync('filename.txt', '这是要追加的内容\n');
                  Sep 7, 2025

                  Grep

                  grep 是 Linux 中强大的文本搜索工具,其名称来源于 “Global Regular Expression Print”。以下是 grep 命令的常见用法:

                  基本语法

                  grep [选项] 模式 [文件...]

                  常用选项

                  1. 基础搜索

                  # 在文件中搜索包含"error"的行
                  grep "error" filename.log
                  
                  # 搜索时忽略大小写
                  grep -i "error" filename.log
                  
                  # 显示不匹配的行
                  grep -v "success" filename.log
                  
                  # 显示匹配的行号
                  grep -n "pattern" filename.txt

                  2. 递归搜索

                  # 在当前目录及子目录中递归搜索
                  grep -r "function_name" .
                  
                  # 递归搜索并显示文件名
                  grep -r -l "text" /path/to/directory

                  3. 输出控制

                  # 只显示匹配的文件名(不显示具体行)
                  grep -l "pattern" *.txt
                  
                  # 显示匹配行前后的内容
                  grep -A 3 "error" logfile.txt    # 显示匹配行后3行
                  grep -B 2 "error" logfile.txt    # 显示匹配行前2行
                  grep -C 2 "error" logfile.txt    # 显示匹配行前后各2行
                  
                  # 只显示匹配的部分(而非整行)
                  grep -o "pattern" file.txt

                  4. 正则表达式

                  # 使用扩展正则表达式
                  grep -E "pattern1|pattern2" file.txt
                  
                  # 匹配以"start"开头的行
                  grep "^start" file.txt
                  
                  # 匹配以"end"结尾的行
                  grep "end$" file.txt
                  
                  # 匹配空行
                  grep "^$" file.txt
                  
                  # 使用字符类
                  grep "[0-9]" file.txt           # 包含数字的行
                  grep "[a-zA-Z]" file.txt        # 包含字母的行

                  5. 文件处理

                  # 从多个文件中搜索
                  grep "text" file1.txt file2.txt
                  
                  # 使用通配符
                  grep "pattern" *.log
                  
                  # 从标准输入读取
                  cat file.txt | grep "pattern"
                  echo "some text" | grep "text"

                  6. 统计信息

                  # 统计匹配的行数
                  grep -c "pattern" file.txt
                  
                  # 统计匹配的次数(可能一行有多个匹配)
                  grep -o "pattern" file.txt | wc -l

                  实用示例

                  1. 日志分析

                  # 查找今天的错误日志
                  grep "ERROR" /var/log/syslog | grep "$(date '+%Y-%m-%d')"
                  
                  # 查找包含IP地址的行
                  grep -E "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" access.log

                  2. 代码搜索

                  # 在项目中查找函数定义
                  grep -r "function_name(" src/
                  
                  # 查找包含TODO或FIXME的注释
                  grep -r -E "TODO|FIXME" ./
                  
                  # 查找空行并统计数量
                  grep -c "^$" source_code.py

                  3. 系统监控

                  # 查看特定进程
                  ps aux | grep "nginx"
                  
                  # 检查端口占用
                  netstat -tulpn | grep ":80"

                  4. 文件内容检查

                  # 检查配置文件中的有效设置(忽略注释和空行)
                  grep -v "^#" /etc/ssh/sshd_config | grep -v "^$"
                  
                  # 查找包含邮箱地址的行
                  grep -E "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" file.txt

                  高级技巧

                  1. 使用上下文

                  # 显示错误及其上下文
                  grep -C 3 -i "error" application.log

                  2. 反向引用

                  # 使用扩展正则表达式的分组
                  grep -E "(abc).*\1" file.txt  # 查找重复的"abc"

                  3. 二进制文件搜索

                  # 在二进制文件中搜索文本字符串
                  grep -a "text" binaryfile

                  4. 颜色高亮

                  # 启用颜色高亮(通常默认开启)
                  grep --color=auto "pattern" file.txt

                  常用组合

                  与其它命令配合

                  # 查找并排序
                  grep "pattern" file.txt | sort
                  
                  # 查找并计数
                  grep -o "pattern" file.txt | sort | uniq -c
                  
                  # 查找并保存结果
                  grep "error" logfile.txt > errors.txt

                  这些是 grep 命令最常用的用法,掌握它们可以大大提高在 Linux 环境下处理文本的效率。

                  Sep 7, 2025

                  Sed

                  sed(Stream Editor)是 Linux 中强大的流编辑器,用于对文本进行过滤和转换。以下是 sed 命令的常见用法:

                  基本语法

                  sed [选项] '命令' 文件
                  sed [选项] -e '命令1' -e '命令2' 文件
                  sed [选项] -f 脚本文件 文件

                  常用选项

                  1. 基础选项

                  # 编辑文件并备份原文件
                  sed -i.bak 's/old/new/g' file.txt
                  
                  # 直接修改文件(无备份)
                  sed -i 's/old/new/g' file.txt
                  
                  # 只打印匹配的行
                  sed -n '命令' file.txt
                  
                  # 使用扩展正则表达式
                  sed -E '命令' file.txt

                  文本替换

                  1. 基本替换

                  # 替换每行第一个匹配
                  sed 's/old/new/' file.txt
                  
                  # 替换所有匹配(全局替换)
                  sed 's/old/new/g' file.txt
                  
                  # 替换第N次出现的匹配
                  sed 's/old/new/2' file.txt    # 替换第二次出现
                  
                  # 只替换匹配的行
                  sed '/pattern/s/old/new/g' file.txt

                  2. 替换分隔符

                  # 当模式包含斜杠时,可以使用其他分隔符
                  sed 's|/usr/local|/opt|g' file.txt
                  sed 's#old#new#g' file.txt

                  3. 引用和转义

                  # 使用&引用匹配的整个文本
                  sed 's/[0-9]*/[&]/g' file.txt
                  
                  # 使用分组引用
                  sed 's/\([a-z]*\) \([a-z]*\)/\2 \1/' file.txt
                  sed -E 's/([a-z]*) ([a-z]*)/\2 \1/' file.txt  # 扩展正则表达式

                  行操作

                  1. 行寻址

                  # 指定行号
                  sed '5s/old/new/' file.txt        # 只对第5行替换
                  sed '1,5s/old/new/g' file.txt     # 1-5行替换
                  sed '5,$s/old/new/g' file.txt     # 第5行到最后一行
                  
                  # 使用正则表达式匹配行
                  sed '/^#/s/old/new/' file.txt     # 只对以#开头的行
                  sed '/start/,/end/s/old/new/g' file.txt  # 从start到end的行

                  2. 删除行

                  # 删除空行
                  sed '/^$/d' file.txt
                  
                  # 删除注释行
                  sed '/^#/d' file.txt
                  
                  # 删除特定行号
                  sed '5d' file.txt                 # 删除第5行
                  sed '1,5d' file.txt               # 删除1-5行
                  sed '/pattern/d' file.txt         # 删除匹配模式的行

                  3. 插入和添加

                  # 在指定行前插入
                  sed '5i\插入的内容' file.txt
                  
                  # 在指定行后添加
                  sed '5a\添加的内容' file.txt
                  
                  # 在文件开头插入
                  sed '1i\开头内容' file.txt
                  
                  # 在文件末尾添加
                  sed '$a\结尾内容' file.txt

                  4. 修改行

                  # 替换整行
                  sed '5c\新的行内容' file.txt
                  
                  # 替换匹配模式的行
                  sed '/pattern/c\新的行内容' file.txt

                  高级操作

                  1. 打印控制

                  # 只打印匹配的行(类似grep)
                  sed -n '/pattern/p' file.txt
                  
                  # 打印行号
                  sed -n '/pattern/=' file.txt
                  
                  # 同时打印行号和内容
                  sed -n '/pattern/{=;p}' file.txt

                  2. 多重命令

                  # 使用分号分隔多个命令
                  sed 's/old/new/g; s/foo/bar/g' file.txt
                  
                  # 使用-e选项
                  sed -e 's/old/new/' -e 's/foo/bar/' file.txt
                  
                  # 对同一行执行多个操作
                  sed '/pattern/{s/old/new/; s/foo/bar/}' file.txt

                  3. 文件操作

                  # 读取文件并插入
                  sed '/pattern/r otherfile.txt' file.txt
                  
                  # 将匹配行写入文件
                  sed '/pattern/w output.txt' file.txt

                  4. 保持空间操作

                  # 模式空间与保持空间交换
                  sed '1!G;h;$!d' file.txt          # 反转文件行顺序
                  
                  # 复制到保持空间
                  sed '/pattern/h' file.txt
                  
                  # 从保持空间取回
                  sed '/pattern/g' file.txt

                  实用示例

                  1. 配置文件修改

                  # 修改SSH端口
                  sed -i 's/^#Port 22/Port 2222/' /etc/ssh/sshd_config
                  
                  # 启用root登录
                  sed -i 's/^#PermitRootLogin yes/PermitRootLogin yes/' /etc/ssh/sshd_config
                  
                  # 注释掉某行
                  sed -i '/pattern/s/^/#/' file.txt
                  
                  # 取消注释
                  sed -i '/pattern/s/^#//' file.txt

                  2. 日志处理

                  # 提取时间戳
                  sed -n 's/.*\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\).*/\1/p' logfile
                  
                  # 删除空白字符
                  sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

                  3. 文本格式化

                  # 每行末尾添加逗号
                  sed 's/$/,/' file.txt
                  
                  # 合并连续空行
                  sed '/^$/{N;/^\n$/D}' file.txt
                  
                  # 在每行前添加行号
                  sed = file.txt | sed 'N;s/\n/\t/'

                  4. 数据转换

                  # CSV转TSV
                  sed 's/,/\t/g' data.csv
                  
                  # 转换日期格式
                  sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3\/\2\/\1/g' dates.txt
                  
                  # URL编码解码(简单版本)
                  echo "hello world" | sed 's/ /%20/g'

                  5. 脚本文件使用

                  # 创建sed脚本
                  cat > script.sed << EOF
                  s/old/new/g
                  /^#/d
                  /^$/d
                  EOF
                  
                  # 使用脚本文件
                  sed -f script.sed file.txt

                  常用组合技巧

                  1. 与管道配合

                  # 查找并替换
                  grep "pattern" file.txt | sed 's/old/new/g'
                  
                  # 处理命令输出
                  ls -l | sed -n '2,$p' | awk '{print $9}'

                  2. 复杂文本处理

                  # 提取XML/HTML标签内容
                  sed -n 's/.*<title>\(.*\)<\/title>.*/\1/p' file.html
                  
                  # 处理配置文件段落的示例
                  sed -n '/^\[database\]/,/^\[/p' config.ini | sed '/^\[/d'

                  这些 sed 命令用法涵盖了大多数日常文本处理需求,掌握它们可以高效地进行批量文本编辑和转换操作。

                  Sep 7, 2025

                  Subsections of Components

                  Cgroup有什么作用

                  cgroup 的功能非常丰富,除了 CPU 限制外,还提供了多种系统资源的管控能力:

                  1. 内存管理(memory)

                  1.1 内存限制

                  # 设置内存使用上限
                  echo "100M" > /sys/fs/cgroup/memory/group1/memory.limit_in_bytes
                  
                  # 设置内存+Swap 上限
                  echo "200M" > /sys/fs/cgroup/memory/group1/memory.memsw.limit_in_bytes

                  1.2 内存统计和监控

                  # 查看内存使用情况
                  cat /sys/fs/cgroup/memory/group1/memory.usage_in_bytes
                  cat /sys/fs/cgroup/memory/group1/memory.stat

                  1.3 内存压力控制

                  # 设置内存回收压力
                  echo 100 > /sys/fs/cgroup/memory/group1/memory.swappiness

                  2. 块设备 I/O 控制(blkio)

                  2.1 I/O 带宽限制

                  # 限制读带宽 1MB/s
                  echo "8:0 1048576" > /sys/fs/cgroup/blkio/group1/blkio.throttle.read_bps_device
                  
                  # 限制写带宽 2MB/s  
                  echo "8:0 2097152" > /sys/fs/cgroup/blkio/group1/blkio.throttle.write_bps_device

                  2.2 I/OPS 限制

                  # 限制每秒读操作数
                  echo "8:0 100" > /sys/fs/cgroup/blkio/group1/blkio.throttle.read_iops_device
                  
                  # 限制每秒写操作数
                  echo "8:0 50" > /sys/fs/cgroup/blkio/group1/blkio.throttle.write_iops_device

                  2.3 I/O 权重分配

                  # 设置 I/O 优先级权重(100-1000)
                  echo 500 > /sys/fs/cgroup/blkio/group1/blkio.weight

                  3. 进程控制(pids)

                  3.1 进程数限制

                  # 限制最大进程数
                  echo 100 > /sys/fs/cgroup/pids/group1/pids.max
                  
                  # 查看当前进程数
                  cat /sys/fs/cgroup/pids/group1/pids.current

                  4. 设备访问控制(devices)

                  4.1 设备权限管理

                  # 允许访问设备
                  echo "c 1:3 rwm" > /sys/fs/cgroup/devices/group1/devices.allow
                  
                  # 拒绝访问设备
                  echo "c 1:5 rwm" > /sys/fs/cgroup/devices/group1/devices.deny

                  5. 网络控制(net_cls, net_prio)

                  5.1 网络流量分类

                  # 设置网络流量类ID
                  echo 0x100001 > /sys/fs/cgroup/net_cls/group1/net_cls.classid

                  5.2 网络优先级

                  # 设置网络接口优先级
                  echo "eth0 5" > /sys/fs/cgroup/net_prio/group1/net_prio.ifpriomap

                  6. 挂载点控制(devices)

                  6.1 文件系统访问限制

                  # 限制挂载命名空间操作
                  echo 1 > /sys/fs/cgroup/group1/devices.deny

                  7. 统一层级 cgroup v2 功能

                  cgroup v2 提供了更统一的管理接口:

                  7.1 资源保护

                  # 内存低水位线保护
                  echo "min 50M" > /sys/fs/cgroup/group1/memory.low
                  
                  # CPU 权重保护
                  echo 100 > /sys/fs/cgroup/group1/cpu.weight

                  7.2 I/O 控制

                  # I/O 权重
                  echo "default 100" > /sys/fs/cgroup/group1/io.weight
                  
                  # I/O 最大带宽
                  echo "8:0 rbps=1048576 wbps=2097152" > /sys/fs/cgroup/group1/io.max

                  8. 实际应用场景

                  8.1 容器资源限制

                  # Docker 容器资源限制
                  docker run -it \
                    --cpus="0.5" \
                    --memory="100m" \
                    --blkio-weight=500 \
                    --pids-limit=100 \
                    ubuntu:latest

                  8.2 systemd 服务限制

                  [Service]
                  MemoryMax=100M
                  IOWeight=500
                  TasksMax=100
                  DeviceAllow=/dev/null rw
                  DeviceAllow=/dev/zero rw
                  DeviceAllow=/dev/full rw

                  8.3 Kubernetes 资源管理

                  apiVersion: v1
                  kind: Pod
                  spec:
                    containers:
                    - name: app
                      resources:
                        limits:
                          cpu: "500m"
                          memory: "128Mi"
                          ephemeral-storage: "1Gi"
                        requests:
                          cpu: "250m" 
                          memory: "64Mi"

                  9. 监控和统计

                  9.1 资源使用统计

                  # 查看 cgroup 资源使用
                  cat /sys/fs/cgroup/memory/group1/memory.stat
                  cat /sys/fs/cgroup/cpu/group1/cpu.stat
                  cat /sys/fs/cgroup/io/group1/io.stat

                  9.2 压力状态信息

                  # 查看内存压力
                  cat /sys/fs/cgroup/memory/group1/memory.pressure

                  10. 高级功能

                  10.1 资源委托(cgroup v2)

                  # 允许子 cgroup 管理特定资源
                  echo "+memory +io" > /sys/fs/cgroup/group1/cgroup.subtree_control

                  10.2 冻结进程

                  # 暂停 cgroup 中所有进程
                  echo 1 > /sys/fs/cgroup/group1/cgroup.freeze
                  
                  # 恢复执行
                  echo 0 > /sys/fs/cgroup/group1/cgroup.freeze

                  cgroup 的这些功能使得它成为容器化技术(如 Docker、Kubernetes)的基础,提供了完整的资源隔离、限制和统计能力,是现代 Linux 系统资源管理的核心技术。

                  Mar 7, 2024

                  IPVS

                  IPVS 是什么?

                  IPVS(IP Virtual Server) 是 Linux 内核内置的第4层(传输层)负载均衡器,是 LVS(Linux Virtual Server)项目的核心组件。

                  基本概念

                  • 工作层级:传输层(TCP/UDP)
                  • 实现方式:内核空间实现,高性能
                  • 功能:将 TCP/UDP 请求负载均衡到多个真实服务器

                  IPVS 的核心架构

                  客户端请求
                      ↓
                  虚拟服务 (Virtual Service) - VIP:Port
                      ↓
                  负载均衡调度算法
                      ↓
                  真实服务器池 (Real Servers)

                  IPVS 的主要作用

                  1. 高性能负载均衡

                  # IPVS 处理能力可达数十万并发连接
                  # 相比 iptables 有更好的性能表现

                  2. 多种负载均衡算法

                  # 查看支持的调度算法
                  grep -i ip_vs /lib/modules/$(uname -r)/modules.builtin
                  
                  # 常用算法:
                  rr      # 轮询 (Round Robin)
                  wrr     # 加权轮询 (Weighted RR)
                  lc      # 最少连接 (Least Connection)
                  wlc     # 加权最少连接 (Weighted LC)
                  sh      # 源地址哈希 (Source Hashing)
                  dh      # 目标地址哈希 (Destination Hashing)

                  3. 多种工作模式

                  NAT 模式(网络地址转换)

                  # 请求和响应都经过负载均衡器
                  # 配置示例
                  ipvsadm -A -t 192.168.1.100:80 -s rr
                  ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -m
                  ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -m

                  DR 模式(直接路由)

                  # 响应直接返回客户端,不经过负载均衡器
                  # 高性能模式
                  ipvsadm -A -t 192.168.1.100:80 -s rr
                  ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -g
                  ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -g

                  TUN 模式(IP 隧道)

                  # 通过 IP 隧道转发请求
                  ipvsadm -A -t 192.168.1.100:80 -s rr
                  ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.10:80 -i
                  ipvsadm -a -t 192.168.1.100:80 -r 10.244.1.11:80 -i

                  IPVS 在 Kubernetes 中的应用

                  kube-proxy IPVS 模式的优势

                  # 性能对比
                  iptables: O(n) 链式查找,规则多时性能下降
                  ipvs:   O(1) 哈希表查找,高性能

                  Kubernetes 中的 IPVS 配置

                  # 查看 kube-proxy 是否使用 IPVS 模式
                  kubectl -n kube-system get pods -l k8s-app=kube-proxy -o yaml | grep mode
                  
                  # 查看 IPVS 规则
                  ipvsadm -Ln

                  IPVS 的核心功能

                  1. 连接调度

                  # 不同调度算法的应用场景
                  rr      # 通用场景,服务器性能相近
                  wrr     # 服务器性能差异较大
                  lc      # 长连接服务,如数据库
                  sh      # 会话保持需求

                  2. 健康检查

                  # IPVS 本身不提供健康检查
                  # 需要配合 keepalived 或其他健康检查工具

                  3. 会话保持

                  # 使用源地址哈希实现会话保持
                  ipvsadm -A -t 192.168.1.100:80 -s sh

                  IPVS 管理命令详解

                  基本操作

                  # 添加虚拟服务
                  ipvsadm -A -t|u|f <service-address> [-s scheduler]
                  
                  # 添加真实服务器
                  ipvsadm -a -t|u|f <service-address> -r <server-address> [-g|i|m] [-w weight]
                  
                  # 示例
                  ipvsadm -A -t 192.168.1.100:80 -s wlc
                  ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.10:8080 -m -w 1
                  ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.11:8080 -m -w 2

                  监控和统计

                  # 查看连接统计
                  ipvsadm -Ln --stats
                  ipvsadm -Ln --rate
                  
                  # 查看当前连接
                  ipvsadm -Lnc
                  
                  # 查看超时设置
                  ipvsadm -L --timeout

                  IPVS 与相关技术对比

                  IPVS vs iptables

                  特性IPVSiptables
                  性能O(1) 哈希查找O(n) 链式查找
                  规模支持大量服务规则多时性能下降
                  功能专业负载均衡通用防火墙
                  算法多种调度算法简单轮询

                  IPVS vs Nginx

                  特性IPVSNginx
                  层级第4层 (传输层)第7层 (应用层)
                  性能内核级,更高用户空间,功能丰富
                  功能基础负载均衡内容路由、SSL终止等

                  实际应用场景

                  1. Kubernetes Service 代理

                  # kube-proxy 为每个 Service 创建 IPVS 规则
                  ipvsadm -Ln
                  # 输出示例:
                  TCP  10.96.0.1:443 rr
                    -> 192.168.1.10:6443    Masq    1      0          0
                  TCP  10.96.0.10:53 rr
                    -> 10.244.0.5:53        Masq    1      0          0

                  2. 高可用负载均衡

                  # 配合 keepalived 实现高可用
                  # 主备负载均衡器 + IPVS

                  3. 数据库读写分离

                  # 使用 IPVS 分发数据库连接
                  ipvsadm -A -t 192.168.1.100:3306 -s lc
                  ipvsadm -a -t 192.168.1.100:3306 -r 192.168.1.20:3306 -m
                  ipvsadm -a -t 192.168.1.100:3306 -r 192.168.1.21:3306 -m

                  总结

                  IPVS 的主要用途:

                  1. 高性能负载均衡 - 内核级实现,处理能力强大
                  2. 多种调度算法 - 适应不同业务场景
                  3. 多种工作模式 - NAT/DR/TUN 满足不同网络需求
                  4. 大规模集群支持 - 适合云原生和微服务架构
                  5. Kubernetes 集成 - 作为 kube-proxy 的后端,提供高效的 Service 代理

                  在 Kubernetes 环境中,IPVS 模式相比 iptables 模式在大规模服务下具有明显的性能优势,是生产环境推荐的负载均衡方案。

                  Mar 7, 2024

                  Subsections of Interface

                  POSIX标准

                  Mar 7, 2024

                  Subsections of Mirror

                  Source Repo

                  Fedora

                  • Fedora 40 located in /etc/yum.repos.d/
                    Fedora Mirror
                    [updates]
                    name=Fedora $releasever - $basearch - Updates
                    #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/$basearch/
                    metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-f$releasever&arch=$basearch
                    enabled=1
                    countme=1
                    repo_gpgcheck=0
                    type=rpm
                    gpgcheck=1
                    metadata_expire=6h
                    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
                    skip_if_unavailable=False
                    
                    [updates-debuginfo]
                    name=Fedora $releasever - $basearch - Updates - Debug
                    #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/$basearch/debug/
                    metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-debug-f$releasever&arch=$basearch
                    enabled=0
                    repo_gpgcheck=0
                    type=rpm
                    gpgcheck=1
                    metadata_expire=6h
                    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
                    skip_if_unavailable=False
                    
                    [updates-source]
                    name=Fedora $releasever - Updates Source
                    #baseurl=http://download.example/pub/fedora/linux/updates/$releasever/Everything/SRPMS/
                    metalink=https://mirrors.fedoraproject.org/metalink?repo=updates-released-source-f$releasever&arch=$basearch
                    enabled=0
                    repo_gpgcheck=0
                    type=rpm
                    gpgcheck=1
                    metadata_expire=6h
                    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-$releasever-$basearch
                    skip_if_unavailable=False

                  CentOS

                  • CentOS 7 located in /etc/yum.repos.d/

                    CentOS Mirror
                    [base]
                    name=CentOS-$releasever
                    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
                    baseurl=http://mirror.centos.org/centos/$releasever/os/$basearch/
                    gpgcheck=1
                    gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-7
                    
                    [extras]
                    name=CentOS-$releasever
                    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
                    baseurl=http://mirror.centos.org/centos/$releasever/extras/$basearch/
                    gpgcheck=1
                    gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-7
                    Aliyun Mirror
                    [base]
                    name=CentOS-$releasever - Base - mirrors.aliyun.com
                    failovermethod=priority
                    baseurl=http://mirrors.aliyun.com/centos/$releasever/os/$basearch/
                    gpgcheck=1
                    gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
                    
                    [extras]
                    name=CentOS-$releasever - Extras - mirrors.aliyun.com
                    failovermethod=priority
                    baseurl=http://mirrors.aliyun.com/centos/$releasever/extras/$basearch/
                    gpgcheck=1
                    gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
                    163 Mirror
                    [base]
                    name=CentOS-$releasever - Base - 163.com
                    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
                    baseurl=http://mirrors.163.com/centos/$releasever/os/$basearch/
                    gpgcheck=1
                    gpgkey=http://mirrors.163.com/centos/RPM-GPG-KEY-CentOS-7
                    
                    [extras]
                    name=CentOS-$releasever - Extras - 163.com
                    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
                    baseurl=http://mirrors.163.com/centos/$releasever/extras/$basearch/
                    gpgcheck=1
                    gpgkey=http://mirrors.163.com/centos/RPM-GPG-KEY-CentOS-7

                  • CentOS 8 stream located in /etc/yum.repos.d/

                    CentOS Mirror
                    [baseos]
                    name=CentOS Linux - BaseOS
                    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=BaseOS&infra=$infra
                    baseurl=http://mirror.centos.org/centos/8-stream/BaseOS/$basearch/os/
                    gpgcheck=1
                    enabled=1
                    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
                    
                    [extras]
                    name=CentOS Linux - Extras
                    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras&infra=$infra
                    baseurl=http://mirror.centos.org/centos/8-stream/extras/$basearch/os/
                    gpgcheck=1
                    enabled=1
                    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
                    
                    [appstream]
                    name=CentOS Linux - AppStream
                    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=AppStream&infra=$infra
                    baseurl=http://mirror.centos.org/centos/8-stream/AppStream/$basearch/os/
                    gpgcheck=1
                    enabled=1
                    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
                    Aliyun Mirror
                    [base]
                    name=CentOS-8.5.2111 - Base - mirrors.aliyun.com
                    baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/BaseOS/$basearch/os/
                    gpgcheck=0
                    gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official
                    
                    [extras]
                    name=CentOS-8.5.2111 - Extras - mirrors.aliyun.com
                    baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/extras/$basearch/os/
                    gpgcheck=0
                    gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official
                    
                    [AppStream]
                    name=CentOS-8.5.2111 - AppStream - mirrors.aliyun.com
                    baseurl=http://mirrors.aliyun.com/centos-vault/8.5.2111/AppStream/$basearch/os/
                    gpgcheck=0
                    gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-Official

                  Ubuntu

                  • Ubuntu 18.04 located in /etc/apt/sources.list

                    Ubuntu Mirror
                    deb http://archive.ubuntu.com/ubuntu/ bionic main restricted
                    deb http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted
                    deb http://archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse
                    deb http://security.ubuntu.com/ubuntu/ bionic-security main restricted

                  • Ubuntu 20.04 located in /etc/apt/sources.list

                    Ubuntu Mirror
                    deb http://archive.ubuntu.com/ubuntu/ focal main restricted universe multiverse
                    deb http://archive.ubuntu.com/ubuntu/ focal-updates main restricted universe multiverse
                    deb http://archive.ubuntu.com/ubuntu/ focal-backports main restricted universe multiverse
                    deb http://security.ubuntu.com/ubuntu/ focal-security main restricted

                  • Ubuntu 22.04 located in /etc/apt/sources.list

                    Ubuntu Mirror
                    deb http://archive.ubuntu.com/ubuntu/ jammy main restricted
                    deb http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
                    deb http://archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
                    deb http://security.ubuntu.com/ubuntu/ jammy-security main restricted

                  Debian

                  • Debian Buster located in /etc/apt/sources.list

                    Debian Mirror
                    deb http://deb.debian.org/debian buster main
                    deb http://security.debian.org/debian-security buster/updates main
                    deb http://deb.debian.org/debian buster-updates main
                    Aliyun Mirror
                    deb http://mirrors.aliyun.com/debian/ buster main non-free contrib
                    deb http://mirrors.aliyun.com/debian-security buster/updates main
                    deb http://mirrors.aliyun.com/debian/ buster-updates main non-free contrib
                    deb http://mirrors.aliyun.com/debian/ buster-backports main non-free contrib
                    Tuna Mirror
                    deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster main contrib non-free
                    deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster-updates main contrib non-free
                    deb http://mirrors.tuna.tsinghua.edu.cn/debian/ buster-backports main contrib non-free
                    deb http://security.debian.org/debian-security buster/updates main contrib non-free

                  • Debian Bullseye located in /etc/apt/sources.list

                    Debian Mirror
                    deb http://deb.debian.org/debian bullseye main
                    deb http://security.debian.org/debian-security bullseye-security main
                    deb http://deb.debian.org/debian bullseye-updates main
                    Aaliyun Mirror
                    deb http://mirrors.aliyun.com/debian/ bullseye main non-free contrib
                    deb http://mirrors.aliyun.com/debian-security/ bullseye-security main
                    deb http://mirrors.aliyun.com/debian/ bullseye-updates main non-free contrib
                    deb http://mirrors.aliyun.com/debian/ bullseye-backports main non-free contrib
                    Tuna Mirror
                    deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye main contrib non-free
                    deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye-updates main contrib non-free
                    deb http://mirrors.tuna.tsinghua.edu.cn/debian/ bullseye-backports main contrib non-free
                    deb http://security.debian.org/debian-security bullseye-security main contrib non-free

                  Anolis

                  • Anolis 3 located in /etc/yum.repos.d/

                    Alinyun Mirror
                    [alinux3-module]
                    name=alinux3-module
                    baseurl=http://mirrors.aliyun.com/alinux/3/module/$basearch/
                    gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
                    enabled=1
                    gpgcheck=1
                    
                    [alinux3-os]
                    name=alinux3-os
                    baseurl=http://mirrors.aliyun.com/alinux/3/os/$basearch/
                    gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
                    enabled=1
                    gpgcheck=1
                    
                    [alinux3-plus]
                    name=alinux3-plus
                    baseurl=http://mirrors.aliyun.com/alinux/3/plus/$basearch/
                    gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
                    enabled=1
                    gpgcheck=1
                    
                    [alinux3-powertools]
                    name=alinux3-powertools
                    baseurl=http://mirrors.aliyun.com/alinux/3/powertools/$basearch/
                    gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
                    enabled=1
                    gpgcheck=1
                    
                    [alinux3-updates]
                    name=alinux3-updates
                    baseurl=http://mirrors.aliyun.com/alinux/3/updates/$basearch/
                    gpgkey=http://mirrors.aliyun.com/alinux/3/RPM-GPG-KEY-ALINUX-3
                    enabled=1
                    gpgcheck=1
                    
                    [epel]
                    name=Extra Packages for Enterprise Linux 8 - $basearch
                    baseurl=http://mirrors.aliyun.com/epel/8/Everything/$basearch
                    failovermethod=priority
                    enabled=1
                    gpgcheck=1
                    gpgkey=http://mirrors.aliyun.com/epel/RPM-GPG-KEY-EPEL-8
                    
                    [epel-module]
                    name=Extra Packages for Enterprise Linux 8 - $basearch
                    baseurl=http://mirrors.aliyun.com/epel/8/Modular/$basearch
                    failovermethod=priority
                    enabled=0
                    gpgcheck=1
                    gpgkey=http://mirrors.aliyun.com/epel/RPM-GPG-KEY-EPEL-8

                  • Anolis 2 located in /etc/yum.repos.d/

                    Alinyun Mirror


                  Refresh Repo

                  OS:
                  dnf clean all && dnf makecache
                  yum clean all && yum makecache
                  apt-get clean all
                  Mar 14, 2024

                  Subsections of Scripts

                  Create Systemd Service

                  1. 创建 systemd 服务文件
                  vim /etc/systemd/system/your-service-name.service
                  1. 添加以下内容到文件中

                  运行一个脚本

                  [Unit]
                  Description=Your Service Description
                  After=network.target # after: 指定在哪些服务之后启动
                  
                  [Service]
                  Type=simple  # simple: 运行一个简单的程序 | forking: 服务会fork出新的进程 | oneshot: 运行一次 | notify: 运行并等待通知 | exec: 运行一个命令
                  User=root
                  ExecStart=/bin/bash -c "your-bash-command-here"
                  Restart=always
                  RestartSec=5
                  
                  [Install]
                  WantedBy=multi-user.target # multi-user.target: 运行在多用户模式下的目标

                  运行一个程序

                  [Unit]
                  Description=Backup Service
                  After=network.target
                  
                  [Service]
                  Type=simple
                  User=root
                  ExecStart=/bin/bash -c "tar -czf /backup/backup-$(date +%Y%m%d).tar.gz /home/user/data"
                  Restart=on-failure
                  
                  [Install]
                  WantedBy=multi-user.target
                  1. 启动服务
                  # 重新加载 systemd 配置
                  sudo systemctl daemon-reload
                  
                  # 启动服务
                  sudo systemctl start your-service-name
                  
                  # 设置开机启动
                  sudo systemctl enable your-service-name
                  
                  # 检查服务状态
                  sudo systemctl status your-service-name
                  
                  # 停止服务
                  sudo systemctl stop your-service-name
                  
                  # 禁用开机启动
                  sudo systemctl disable your-service-name
                  
                  # 查看服务日志
                  sudo journalctl -u your-service-name -f
                  Mar 14, 2025

                  Disable Service

                  Disable firewall、selinux、dnsmasq、swap service

                  systemctl disable --now firewalld 
                  systemctl disable --now dnsmasq
                  systemctl disable --now NetworkManager
                  
                  setenforce 0
                  sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/sysconfig/selinux
                  sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/selinux/config
                  reboot
                  getenforce
                  
                  
                  swapoff -a && sysctl -w vm.swappiness=0
                  sed -ri '/^[^#]*swap/s@^@#@' /etc/fstab
                  Mar 14, 2024

                  Free Disk Space

                  Cleanup

                  1. find first 10 biggest files
                  dnf install ncdu
                  
                  # 找出当前目录下最大的10个文件/目录
                  du -ah . | sort -rh | head -n 10
                  
                  # 找出家目录下大于100M的文件
                  find ~ -type f -size +100M -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
                  1. clean cache
                  rm -rf ~/.cache/*
                  sudo rm -rf /tmp/*
                  sudo rm -rf /var/tmp/*
                  1. clean images
                  # 删除所有停止的容器
                  podman container prune -y
                  
                  # 删除所有未被任何容器引用的镜像(悬空镜像)
                  podman image prune
                  
                  # 更激进的清理:删除所有未被运行的容器使用的镜像
                  podman image prune -a
                  
                  # 清理构建缓存
                  podman builder prune
                  
                  # 最彻底的清理:删除所有停止的容器、所有未被容器使用的网络、所有悬空镜像和构建缓存
                  podman system prune
                  podman system prune -a # 更加彻底,会删除所有未被使用的镜像,而不仅仅是悬空的
                  Mar 14, 2024

                  Login Without Pwd

                  copy id_rsa to other nodes

                  yum install sshpass -y
                  mkdir -p /extend/shell
                  
                  cat >>/extend/shell/distribute_pub.sh<< EOF
                  #!/bin/bash
                  ROOT_PASS=root123
                  ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ''
                  for ip in 101 102 103 
                  do
                  sshpass -p\$ROOT_PASS ssh-copy-id -o StrictHostKeyChecking=no 192.168.29.\$ip
                  done
                  EOF
                  
                  cd /extend/shell
                  chmod +x distribute_pub.sh
                  
                  ./distribute_pub.sh
                  Mar 14, 2024

                  Set Http Proxy

                  [Optional] Install Proxy Server

                  go and check http://port.72602.online/ops/hugo/index.html

                  Set Http Proxy

                  export https_proxy=http://47.110.67.161:30890
                  export http_proxy=http://47.110.67.161:30890

                  Use Proxy in Pod

                  Mar 14, 2024

                  Subsections of 🌐Language

                  Subsections of ♨️JAVA

                  Subsections of JVM Related

                  AOT or JIT

                  JDK 9 引入了一种新的编译模式 AOT(Ahead of Time Compilation) 。和 JIT 不同的是,这种编译模式会在程序被执行前就将其编译成机器码,属于静态编译(C、 C++,Rust,Go 等语言就是静态编译)。AOT 避免了 JIT 预热等各方面的开销,可以提高 Java 程序的启动速度,避免预热时间长。并且,AOT 还能减少内存占用和增强 Java 程序的安全性(AOT 编译后的代码不容易被反编译和修改),特别适合云原生场景。

                  可以看出,AOT 的主要优势在于启动时间、内存占用和打包体积。JIT 的主要优势在于具备更高的极限处理能力,可以降低请求的最大延迟。

                  https://cn.dubbo.apache.org/zh-cn/blog/2023/06/28/%e8%b5%b0%e5%90%91-native-%e5%8c%96springdubbo-aot-%e6%8a%80%e6%9c%af%e7%a4%ba%e4%be%8b%e4%b8%8e%e5%8e%9f%e7%90%86%e8%ae%b2%e8%a7%a3/

                  https://mp.weixin.qq.com/s/4haTyXUmh8m-dBQaEzwDJw

                  既然 AOT 这么多优点,那为什么不全部使用这种编译方式呢?

                  我们前面也对比过 JIT 与 AOT,两者各有优点,只能说 AOT 更适合当下的云原生场景,对微服务架构的支持也比较友好。除此之外,AOT 编译无法支持 Java 的一些动态特性,如反射、动态代理、动态加载、JNI(Java Native Interface)等。然而,很多框架和库(如 Spring、CGLIB)都用到了这些特性。如果只使用 AOT 编译,那就没办法使用这些框架和库了,或者说需要针对性地去做适配和优化。举个例子,CGLIB 动态代理使用的是 ASM 技术,而这种技术大致原理是运行时直接在内存中生成并加载修改后的字节码文件也就是

                  Mar 7, 2024

                  Volatie

                  Volatile是Java虚拟机提供的轻量级的同步机制(三大特性)

                  保证可见性

                  不保证原子性

                  禁止指令重排

                  Mar 7, 2024

                  🐍Python

                    Mar 7, 2024

                    🐹Go

                      Mar 7, 2024

                      Subsections of Design Pattern

                      Observers

                      Mar 7, 2024

                      Subsections of Web Pattern

                      HTTP Code

                      1xx - 信息性状态码(临时响应)

                      表示请求已收到,正在继续处理。平时在浏览器中很少见到。

                      • 100 Continue:客户端应继续发送请求的剩余部分。通常用于 POST 或 PUT 大量数据前,先询问服务器是否愿意接收。
                      • 101 Switching Protocols:客户端要求切换协议(如切换到 WebSocket),服务器已同意。

                      2xx - 成功状态码(请求成功)

                      表示请求已被服务器成功接收、理解并处理。

                      • 200 OK最常用的成功状态码。表示请求成功,返回的响应体包含了所请求的数据(如 HTML 页面、JSON 数据等)。
                      • 201 Created创建成功。通常在 POST 或 PUT 请求后,表示成功在服务器上创建了一个新资源。响应头 Location 字段通常会包含新资源的 URL。
                      • 202 Accepted:请求已接受,但尚未处理完成。适用于异步任务,比如“请求已进入队列,正在处理中”。
                      • 204 No Content请求成功,但响应报文中没有实体的主体部分。常用于 DELETE 请求成功,或前端只需知道操作成功而无需返回数据的 AJAX 请求。

                      3xx - 重定向状态码(需要进一步操作)

                      表示客户端需要执行额外的操作来完成请求,通常是重定向。

                      • 301 Moved Permanently永久重定向。请求的资源已被永久移动到新的 URL。搜索引擎会更新其链接到新的地址。浏览器会缓存这个重定向
                      • 302 Found临时重定向。请求的资源临时从另一个 URL 响应。搜索引擎不会更新链接。这是最常见的重定向类型,但规范要求方法不变(实际上浏览器常会改为 GET)。
                      • 304 Not Modified资源未修改。用于缓存控制。当客户端拥有缓存的版本,并通过请求头(如 If-Modified-Since)询问资源是否更新时,如果资源未变,服务器会返回此状态码,告诉客户端直接使用缓存。这节省了带宽
                      • 307 Temporary Redirect临时重定向(严格)。与 302 类似,但严格要求客户端不能改变原始的请求方法(例如,POST 必须仍是 POST)。比 302 更规范。

                      4xx - 客户端错误状态码(请求有误)

                      表示客户端可能出错,服务器无法处理请求。

                      • 400 Bad Request错误的请求。服务器因为请求的语法无效而无法理解。就像一个语法错误的句子,服务器看不懂。
                      • 401 Unauthorized未认证。表示请求需要用户认证。通常需要登录或提供 Token。注意,这个名字容易误解,它实际是“未认证”,而不是“未授权”。
                      • 403 Forbidden禁止访问。服务器理解请求,但拒绝执行。与 401 不同,身份验证也无济于事(比如普通用户尝试访问管理员页面)。
                      • 404 Not Found最著名的错误码。服务器找不到请求的资源。可能是 URL 错误,或资源已被删除。
                      • 405 Method Not Allowed方法不被允许。请求行中指定的方法(GET, POST 等)不能用于请求此资源。例如,对只接受 GET 的 URL 发送了 POST 请求。
                      • 408 Request Timeout请求超时。服务器等待客户端发送请求的时间过长。
                      • 409 Conflict冲突。请求与服务器的当前状态冲突。常见于 PUT 请求(例如,修改文件时版本冲突)。
                      • 429 Too Many Requests请求过多。客户端在给定的时间内发送了太多请求(限流)。

                      5xx - 服务器错误状态码(服务器处理请求出错)

                      表示服务器在处理请求时发生错误或内部故障。

                      • 500 Internal Server Error最通用的服务器错误码。服务器遇到了一个未曾预料的状况,导致它无法完成对请求的处理。通常是后端代码抛出了未捕获的异常。
                      • 502 Bad Gateway错误的网关。服务器作为网关或代理,从上游服务器收到了一个无效的响应。常见于 Nginx 后面的应用服务器(如 PHP-FPM)挂掉或未启动。
                      • 503 Service Unavailable服务不可用。服务器当前无法处理请求(由于超载或进行停机维护)。通常,这是一个临时状态。响应头中可能包含 Retry-After 字段,告知客户端何时可以重试。
                      • 504 Gateway Timeout网关超时。服务器作为网关或代理,未能及时从上游服务器收到响应。常见于网络延迟或上游服务器处理过慢。

                      快速记忆表格

                      状态码类别含义常见场景
                      200成功请求成功正常获取网页或数据
                      201成功创建成功创建新用户、新文章成功
                      204成功无内容删除成功,或前端AJAX请求无需返回数据
                      301重定向永久移动网站改版,旧链接永久跳转到新链接
                      302重定向临时移动登录后跳回首页
                      304重定向未修改使用浏览器缓存,节省流量
                      400客户端错误错误请求请求参数格式错误
                      401客户端错误未认证需要登录
                      403客户端错误禁止访问权限不足
                      404客户端错误未找到请求的URL不存在
                      429客户端错误请求过多API调用频率超限
                      500服务器错误内部服务器错误后端代码Bug,数据库连接失败
                      502服务器错误错误网关Nginx 无法连接到后端服务
                      503服务器错误服务不可用服务器维护或过载
                      504服务器错误网关超时后端服务响应太慢

                      希望这个列表对您有帮助!

                      Mar 7, 2024

                      Subsections of 🥮Middleware

                      Subsections of 🐿️Apache Flink

                      Subsections of On K8s Operator

                      Job Privilieges

                      Template

                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: Role
                      metadata:
                        namespace: flink
                        name: flink-deployment-manager
                      rules:
                      - apiGroups: 
                        - flink.apache.org
                        resources: 
                        - flinkdeployments
                        verbs: 
                        - 'get'
                        - 'list'
                        - 'create'
                        - 'update'
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: flink-deployment-manager-binding
                        namespace: flink
                      subjects:
                      - kind: User
                        name: "277293711358271379"  
                        apiGroup: rbac.authorization.k8s.io
                      roleRef:
                        kind: Role
                        name: flink-deployment-manager
                        apiGroup: rbac.authorization.k8s.io
                      Jul 7, 2024

                      OSS Template

                      Template

                      apiVersion: "flink.apache.org/v1beta1"
                      kind: "FlinkDeployment"
                      metadata:
                        name: "financial-job"
                      spec:
                        image: "cr.registry.res.cloud.wuxi-yqgcy.cn/mirror/financial-topic:1.5-oss"
                        flinkVersion: "v1_17"
                        flinkConfiguration:
                          taskmanager.numberOfTaskSlots: "8"
                          fs.oss.endpoint: http://ay-test.oss-cn-jswx-xuelang-d01-a.ops.cloud.wuxi-yqgcy.cn/
                          fs.oss.accessKeyId: 4gqOVOfQqCsCUwaC
                          fs.oss.accessKeySecret: xxx
                        ingress:
                          template: "flink.k8s.io/{{namespace}}/{{name}}(/|$)(.*)"
                          className: "nginx"
                          annotations:
                            cert-manager.io/cluster-issuer: "self-signed-ca-issuer"
                            nginx.ingress.kubernetes.io/rewrite-target: "/$2"
                        serviceAccount: "flink"
                        podTemplate:
                          apiVersion: "v1"
                          kind: "Pod"
                          metadata:
                            name: "financial-job"
                          spec:
                            containers:
                              - name: "flink-main-container"
                                env:
                                  - name: ENABLE_BUILT_IN_PLUGINS
                                    value: flink-oss-fs-hadoop-1.17.2.jar
                        jobManager:
                          resource:
                            memory: "2048m"
                            cpu: 1
                        taskManager:
                          resource:
                            memory: "2048m"
                            cpu: 1
                        job:
                          jarURI: "local:///app/application.jar"
                          parallelism: 1
                          upgradeMode: "stateless"
                      Apr 7, 2024

                      S3 Template

                      Template

                      apiVersion: "flink.apache.org/v1beta1"
                      kind: "FlinkDeployment"
                      metadata:
                        name: "financial-job"
                      spec:
                        image: "cr.registry.res.cloud.wuxi-yqgcy.cn/mirror/financial-topic:1.5"
                        flinkVersion: "v1_17"
                        flinkConfiguration:
                          taskmanager.numberOfTaskSlots: "8"
                          s3a.endpoint: http://172.27.253.89:9000
                          s3a.access-key: minioadmin
                          s3a.secret-key: minioadmin
                        ingress:
                          template: "flink.k8s.io/{{namespace}}/{{name}}(/|$)(.*)"
                          className: "nginx"
                          annotations:
                            cert-manager.io/cluster-issuer: "self-signed-ca-issuer"
                            nginx.ingress.kubernetes.io/rewrite-target: "/$2"
                        serviceAccount: "flink"
                        podTemplate:
                          apiVersion: "v1"
                          kind: "Pod"
                          metadata:
                            name: "financial-job"
                          spec:
                            containers:
                              - name: "flink-main-container"
                                env:
                                  - name: ENABLE_BUILT_IN_PLUGINS
                                    value: flink-s3-fs-hadoop-1.17.2.jar
                        jobManager:
                          resource:
                            memory: "2048m"
                            cpu: 1
                        taskManager:
                          resource:
                            memory: "2048m"
                            cpu: 1
                        job:
                          jarURI: "local:///app/application.jar"
                          parallelism: 1
                          upgradeMode: "stateless"
                      Apr 7, 2024

                      Subsections of CDC

                      Mysql CDC

                      More Ofthen, we can get a simplest example form CDC Connectors. But people still need to google some inescapable problems before using it.

                      preliminary

                      Flink: 1.17 JDK: 11

                      Flink CDC version mapping
                      Flink CDC VersionFlink Version
                      1.0.01.11.*
                      1.1.01.11.*
                      1.2.01.12.*
                      1.3.01.12.*
                      1.4.01.13.*
                      2.0.*1.13.*
                      2.1.*1.13.*
                      2.2.*1.13.*, 1.14.*
                      2.3.*1.13.*, 1.14.*, 1.15.*
                      2.4.*1.13.*, 1.14.*, 1.15.*
                      3.0.*1.14.*, 1.15.*, 1.16.*

                      usage for DataStream API

                      Only import com.ververica.flink-connector-mysql-cdc is not enough.

                      implementation("com.ververica:flink-connector-mysql-cdc:2.4.0")
                      
                      //you also need these following dependencies
                      implementation("org.apache.flink:flink-shaded-guava:30.1.1-jre-16.1")
                      implementation("org.apache.flink:flink-connector-base:1.17")
                      implementation("org.apache.flink:flink-table-planner_2.12:1.17")
                      <dependency>
                        <groupId>com.ververica</groupId>
                        <!-- add the dependency matching your database -->
                        <artifactId>flink-connector-mysql-cdc</artifactId>
                        <!-- The dependency is available only for stable releases, SNAPSHOT dependencies need to be built based on master or release- branches by yourself. -->
                        <version>2.4.0</version>
                      </dependency>
                      
                      <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-guava -->
                      <dependency>
                        <groupId>org.apache.flink</groupId>
                        <artifactId>flink-shaded-guava</artifactId>
                        <version>30.1.1-jre-16.1</version>
                      </dependency>
                      
                      <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-base -->
                      <dependency>
                        <groupId>org.apache.flink</groupId>
                        <artifactId>flink-connector-base</artifactId>
                        <version>1.17.1</version>
                      </dependency>
                      
                      <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-planner -->
                      <dependency>
                        <groupId>org.apache.flink</groupId>
                        <artifactId>flink-table-planner_2.12</artifactId>
                        <version>1.17.1</version>
                      </dependency>

                      Example Code

                      MySqlSource<String> mySqlSource =
                          MySqlSource.<String>builder()
                              .hostname("192.168.56.107")
                              .port(3306)
                              .databaseList("test") // set captured database
                              .tableList("test.table_a") // set captured table
                              .username("root")
                              .password("mysql")
                              .deserializer(
                                  new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
                              .serverTimeZone("UTC")
                              .build();
                      
                      StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
                      
                      // enable checkpoint
                      env.enableCheckpointing(3000);
                      
                      env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
                          // set 4 parallel source tasks
                          .setParallelism(4)
                          .print()
                          .setParallelism(1); // use parallelism 1 for sink to keep message ordering
                      
                      env.execute("Print MySQL Snapshot + Binlog");

                      usage for table/SQL API

                      Mar 7, 2024

                      Connector

                      Mar 7, 2024

                      🪀Install Shit

                      Aug 7, 2024

                      Subsections of 🪀Install Shit

                      Subsections of Application

                      Datahub

                      Preliminary

                      • Kubernetes has installed, if not check 🔗link
                      • argoCD has installed, if not check 🔗link
                      • Elasticsearch has installed, if not check 🔗link
                      • MariaDB has installed, if not check 🔗link
                      • Kafka has installed, if not check 🔗link

                      Steps

                      1. prepare datahub credentials secret

                      kubectl -n application \
                          create secret generic datahub-credentials \
                          --from-literal=mysql-root-password="$(kubectl get secret mariadb-credentials --namespace database -o jsonpath='{.data.mariadb-root-password}' | base64 -d)"
                      kubectl -n application \
                          create secret generic datahub-credentials \
                          --from-literal=mysql-root-password="$(kubectl get secret mariadb-credentials --namespace database -o jsonpath='{.data.mariadb-root-password}' | base64 -d)" \
                          --from-literal=security.protocol="SASL_PLAINTEXT" \
                          --from-literal=sasl.mechanism="SCRAM-SHA-256" \
                          --from-literal=sasl.jaas.config="org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";"

                      5. prepare deploy-datahub.yaml

                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: datahub
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://helm.datahubproject.io
                          chart: datahub
                          targetRevision: 0.4.8
                          helm:
                            releaseName: datahub
                            values: |
                              global:
                                elasticsearch:
                                  host: elastic-search-elasticsearch.application.svc.cluster.local
                                  port: 9200
                                  skipcheck: "false"
                                  insecure: "false"
                                  useSSL: "false"
                                kafka:
                                  bootstrap:
                                    server: kafka.database.svc.cluster.local:9092
                                  zookeeper:
                                    server: kafka-zookeeper.database.svc.cluster.local:2181
                                sql:
                                  datasource:
                                    host: mariadb.database.svc.cluster.local:3306
                                    hostForMysqlClient: mariadb.database.svc.cluster.local
                                    port: 3306
                                    url: jdbc:mysql://mariadb.database.svc.cluster.local:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8&enabledTLSProtocols=TLSv1.2
                                    driver: com.mysql.cj.jdbc.Driver
                                    username: root
                                    password:
                                      secretRef: datahub-credentials
                                      secretKey: mysql-root-password
                              datahub-gms:
                                enabled: true
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-gms
                                service:
                                  type: ClusterIP
                                ingress:
                                  enabled: false
                              datahub-frontend:
                                enabled: true
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-frontend-react
                                defaultUserCredentials:
                                  randomAdminPassword: true
                                service:
                                  type: ClusterIP
                                ingress:
                                  enabled: true
                                  className: nginx
                                  annotations:
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                  hosts:
                                  - host: datahub.dev.geekcity.tech
                                    paths:
                                    - /
                                  tls:
                                  - secretName: "datahub.dev.geekcity.tech-tls"
                                    hosts:
                                    - datahub.dev.geekcity.tech
                              acryl-datahub-actions:
                                enabled: true
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-actions
                              datahub-mae-consumer:
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-mae-consumer
                                ingress:
                                  enabled: false
                              datahub-mce-consumer:
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-mce-consumer
                                ingress:
                                  enabled: false
                              datahub-ingestion-cron:
                                enabled: false
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-ingestion
                              elasticsearchSetupJob:
                                enabled: true
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-elasticsearch-setup
                              kafkaSetupJob:
                                enabled: true
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-kafka-setup
                              mysqlSetupJob:
                                enabled: true
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-mysql-setup
                              postgresqlSetupJob:
                                enabled: false
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-postgres-setup
                              datahubUpgrade:
                                enabled: true
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                              datahubSystemUpdate:
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: application
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: datahub
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://helm.datahubproject.io
                          chart: datahub
                          targetRevision: 0.4.8
                          helm:
                            releaseName: datahub
                            values: |
                              global:
                                springKafkaConfigurationOverrides:
                                  security.protocol: SASL_PLAINTEXT
                                  sasl.mechanism: SCRAM-SHA-256
                                credentialsAndCertsSecrets:
                                  name: datahub-credentials
                                  secureEnv:
                                    sasl.jaas.config: sasl.jaas.config
                                elasticsearch:
                                  host: elastic-search-elasticsearch.application.svc.cluster.local
                                  port: 9200
                                  skipcheck: "false"
                                  insecure: "false"
                                  useSSL: "false"
                                kafka:
                                  bootstrap:
                                    server: kafka.database.svc.cluster.local:9092
                                  zookeeper:
                                    server: kafka-zookeeper.database.svc.cluster.local:2181
                                neo4j:
                                  host: neo4j.database.svc.cluster.local:7474
                                  uri: bolt://neo4j.database.svc.cluster.local
                                  username: neo4j
                                  password:
                                    secretRef: datahub-credentials
                                    secretKey: neo4j-password
                                sql:
                                  datasource:
                                    host: mariadb.database.svc.cluster.local:3306
                                    hostForMysqlClient: mariadb.database.svc.cluster.local
                                    port: 3306
                                    url: jdbc:mysql://mariadb.database.svc.cluster.local:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8&enabledTLSProtocols=TLSv1.2
                                    driver: com.mysql.cj.jdbc.Driver
                                    username: root
                                    password:
                                      secretRef: datahub-credentials
                                      secretKey: mysql-root-password
                              datahub-gms:
                                enabled: true
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-gms
                                service:
                                  type: ClusterIP
                                ingress:
                                  enabled: false
                              datahub-frontend:
                                enabled: true
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-frontend-react
                                defaultUserCredentials:
                                  randomAdminPassword: true
                                service:
                                  type: ClusterIP
                                ingress:
                                  enabled: true
                                  className: nginx
                                  annotations:
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                  hosts:
                                  - host: datahub.dev.geekcity.tech
                                    paths:
                                    - /
                                  tls:
                                  - secretName: "datahub.dev.geekcity.tech-tls"
                                    hosts:
                                    - datahub.dev.geekcity.tech
                              acryl-datahub-actions:
                                enabled: true
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-actions
                              datahub-mae-consumer:
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-mae-consumer
                                ingress:
                                  enabled: false
                              datahub-mce-consumer:
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-mce-consumer
                                ingress:
                                  enabled: false
                              datahub-ingestion-cron:
                                enabled: false
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-ingestion
                              elasticsearchSetupJob:
                                enabled: true
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-elasticsearch-setup
                              kafkaSetupJob:
                                enabled: true
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-kafka-setup
                              mysqlSetupJob:
                                enabled: true
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-mysql-setup
                              postgresqlSetupJob:
                                enabled: false
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-postgres-setup
                              datahubUpgrade:
                                enabled: true
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                              datahubSystemUpdate:
                                image:
                                  repository: m.daocloud.io/docker.io/acryldata/datahub-upgrade
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: application
                      if you wannna start one more gms

                      add this under global, if you wanna start one more gms

                        datahub_standalone_consumers_enabled: true

                      3. apply to k8s

                      kubectl -n argocd apply -f deploy-datahub.yaml

                      4. sync by argocd

                      argocd app sync argocd/datahub

                      5. extract credientials

                      kubectl -n application get secret datahub-user-secret -o jsonpath='{.data.user\.props}' | base64 -d

                      [Optional] Visit though browser

                      add $K8S_MASTER_IP datahub.dev.geekcity.tech to /etc/hosts

                      [Optional] Visit though DatahubCLI

                      We recommend Python virtual environments (venv-s) to namespace pip modules. Here’s an example setup:

                      python3 -m venv venv             # create the environment
                      source venv/bin/activate         # activate the environment

                      NOTE: If you install datahub in a virtual environment, that same virtual environment must be re-activated each time a shell window or session is created.

                      Once inside the virtual environment, install datahub using the following commands

                      # Requires Python 3.8+
                      python3 -m pip install --upgrade pip wheel setuptools
                      python3 -m pip install --upgrade acryl-datahub
                      # validate that the install was successful
                      datahub version
                      # If you see "command not found", try running this instead: python3 -m datahub version
                      datahub init
                      # authenticate your datahub CLI with your datahub instance
                      Mar 7, 2024

                      N8N

                      🚀Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      4. Database postgresql has been installed, if not check 🔗link


                      1.prepare `n8n-middleware-credientials.yaml`

                      Details
                      kubectl get namespaces n8n > /dev/null 2>&1 || kubectl create namespace n8n
                      N8N_PASSWORD=$(kubectl -n database get secret postgresql-credentials -o jsonpath='{.data.password}' | base64 -d)
                      kubectl -n n8n create secret generic n8n-middleware-credential \
                      --from-literal=postgres-password="${N8N_PASSWORD}"

                      2.prepare `deploy-n8n.yaml`

                      Details
                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: n8n
                      spec:
                        project: default
                        source:
                          repoURL: https://community-charts.github.io/helm-charts
                          targetRevision: 1.16.22
                          helm:
                            releaseName: n8n
                            values: |
                              global:
                                security:
                                  allowInsecureImages: true
                              image:
                                repository: n8nio/n8n
                              log:
                                level: info
                              encryptionKey: "ay-dev-n8n"
                              timezone: Asia/Shanghai
                              db:
                                type: postgresdb
                              externalPostgresql:
                                host: postgresql-hl.database.svc.cluster.local
                                port: 5432
                                username: "n8n"
                                database: "n8n"
                                existingSecret: "n8n-middleware-credential"
                              main:
                                count: 1
                                extraEnvVars:
                                  "N8N_BLOCK_ENV_ACCESS_IN_NODE": "false"
                                  "N8N_FILE_SYSTEM_ALLOWED_PATHS": "/data"
                                  "EXECUTIONS_TIMEOUT": "300"
                                  "EXECUTIONS_TIMEOUT_MAX": "600"
                                  "DB_POSTGRESDB_POOL_SIZE": "10"
                                  "CACHE_ENABLED": "true"
                                  "N8N_CONCURRENCY_PRODUCTION_LIMIT": "5"
                                  "NODE_TLS_REJECT_UNAUTHORIZED": "0"
                                  "N8N_SECURE_COOKIE": "false"
                                  "WEBHOOK_URL": "https://webhook.n8n.ay.dev"
                                  "QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD": "60000"
                                  "N8N_COMMUNITY_PACKAGES_ENABLED": "true"
                                  "N8N_GIT_NODE_DISABLE_BARE_REPOS": "true"
                                  "N8N_LICENSE_AUTO_RENEW_ENABLED": "true"
                                  "N8N_LICENSE_RENEW_ON_INIT": "true"
                                persistence:
                                  enabled: true
                                  accessMode: ReadWriteOnce
                                  storageClass: "local-path"
                                  size: 50Gi
                                volumes:
                                  - name: downloads-volume
                                    hostPath:
                                      path: /home/aaron/Downloads
                                      type: DirectoryOrCreate
                                volumeMounts:
                                  - name: downloads-volume
                                    mountPath: /data
                                resources:
                                  requests:
                                    cpu: 1000m
                                    memory: 1024Mi
                                  limits:
                                    cpu: 2000m
                                    memory: 2048Mi
                              worker:
                                mode: queue
                                count: 2
                                waitMainNodeReady:
                                  enabled: false
                                extraEnvVars:
                                  "N8N_FILE_SYSTEM_ALLOWED_PATHS": "/data"
                                  "EXECUTIONS_TIMEOUT": "300"
                                  "EXECUTIONS_TIMEOUT_MAX": "600"
                                  "DB_POSTGRESDB_POOL_SIZE": "5"
                                  "QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD": "60000"
                                  "N8N_COMMUNITY_PACKAGES_ENABLED": "true"
                                  "N8N_GIT_NODE_DISABLE_BARE_REPOS": "true"
                                  "N8N_LICENSE_AUTO_RENEW_ENABLED": "true"
                                  "N8N_LICENSE_RENEW_ON_INIT": "true"
                                persistence:
                                  enabled: true
                                  accessMode: ReadWriteOnce
                                  storageClass: "local-path"
                                  size: 50Gi
                                volumes:
                                  - name: downloads-volume
                                    hostPath:
                                      path: /home/aaron/Downloads
                                      type: DirectoryOrCreate
                                volumeMounts:
                                  - name: downloads-volume
                                    mountPath: /data
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1024Mi
                                  limits:
                                    cpu: 1000m
                                    memory: 2048Mi
                              nodes:
                                builtin:
                                  enabled: true
                                  modules:
                                    - crypto
                                    - fs
                                external:
                                  allowAll: true
                                  packages:
                                    - n8n-nodes-globals
                              npmRegistry:
                                enabled: true
                                url: http://mirrors.cloud.tencent.com/npm/
                              redis:
                                enabled: true
                                image:
                                  registry: m.daocloud.io/docker.io
                                  repository: bitnamilegacy/redis
                                master:
                                  resourcesPreset: "small"
                                  persistence:
                                    enabled: true
                                    accessMode: ReadWriteOnce
                                    storageClass: "local-path"
                                    size: 10Gi
                              ingress:
                                enabled: true
                                className: nginx
                                annotations:
                                  kubernetes.io/ingress.class: nginx
                                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                  nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
                                  nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
                                  nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
                                  nginx.ingress.kubernetes.io/proxy-body-size: "50m"
                                  nginx.ingress.kubernetes.io/upstream-keepalive-connections: "50"
                                  nginx.ingress.kubernetes.io/upstream-keepalive-timeout: "60"
                                hosts:
                                  - host: n8n.ay.dev
                                    paths:
                                      - path: /
                                        pathType: Prefix
                                tls:
                                - hosts:
                                  - n8n.ay.dev
                                  - webhook.n8n.ay.dev
                                  secretName: n8n.ay.dev-tls
                              webhook:
                                mode: queue
                                url: "https://webhook.n8n.ay.dev"
                                autoscaling:
                                  enabled: false
                                waitMainNodeReady:
                                  enabled: true
                                resources:
                                  requests:
                                    cpu: 200m
                                    memory: 256Mi
                                  limits:
                                    cpu: 512m
                                    memory: 512Mi
                          chart: n8n
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: n8n
                        syncPolicy:
                          syncOptions:
                            - CreateNamespace=true
                            - ApplyOutOfSyncOnly=false
                      EOF

                      3.sync by argocd

                      Details
                      argocd app sync argocd/n8n
                      Using AY Helm Mirror

                      for more information, you can check 🔗https://github.com/AaronYang0628/helm-chart-mirror

                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update
                        helm install ay-helm-mirror/chart-name --generate-name --version a.b.c
                      Using AY ACR Image Mirror
                      Using DaoCloud Mirror

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      4. Database postgresql has been installed, if not check 🔗link


                      1.prepare `n8n-middleware-credientials.yaml`

                      Details
                      kubectl get namespaces n8n > /dev/null 2>&1 || kubectl create namespace n8n
                      N8N_PASSWORD=$(kubectl -n database get secret postgresql-credentials -o jsonpath='{.data.password}' | base64 -d)
                      kubectl -n n8n create secret generic n8n-middleware-credential \
                      --from-literal=postgres-password="${N8N_PASSWORD}"

                      2.prepare `deploy-n8n.yaml`

                      Details
                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: n8n
                      spec:
                        project: default
                        source:
                          repoURL: https://community-charts.github.io/helm-charts
                          targetRevision: 1.16.22
                          helm:
                            releaseName: n8n
                            values: |
                              global:
                                security:
                                  allowInsecureImages: true
                              image:
                                repository: n8nio/n8n
                              log:
                                level: info
                              encryptionKey: "ay-dev-n8n"
                              timezone: Asia/Shanghai
                              db:
                                type: postgresdb
                              externalPostgresql:
                                host: postgresql-hl.database.svc.cluster.local
                                port: 5432
                                username: "n8n"
                                database: "n8n"
                                existingSecret: "n8n-middleware-credential"
                              main:
                                count: 1
                                extraEnvVars:
                                  "N8N_BLOCK_ENV_ACCESS_IN_NODE": "false"
                                  "N8N_FILE_SYSTEM_ALLOWED_PATHS": "/home/node/.n8n-files"
                                  "EXECUTIONS_TIMEOUT": "300"
                                  "EXECUTIONS_TIMEOUT_MAX": "600"
                                  "DB_POSTGRESDB_POOL_SIZE": "10"
                                  "CACHE_ENABLED": "true"
                                  "N8N_CONCURRENCY_PRODUCTION_LIMIT": "5"
                                  "NODE_TLS_REJECT_UNAUTHORIZED": "0"
                                  "N8N_SECURE_COOKIE": "false"
                                  "WEBHOOK_URL": "https://webhook.n8n.ay.dev"
                                  "QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD": "60000"
                                  "N8N_COMMUNITY_PACKAGES_ENABLED": "true"
                                  "N8N_GIT_NODE_DISABLE_BARE_REPOS": "true"
                                  "N8N_LICENSE_AUTO_RENEW_ENABLED": "true"
                                  "N8N_LICENSE_RENEW_ON_INIT": "true"
                                persistence:
                                  enabled: true
                                  accessMode: ReadWriteOnce
                                  storageClass: "local-path"
                                  size: 50Gi
                                volumes:
                                  - name: downloads-volume
                                    hostPath:
                                      path: /home/aaron/Downloads
                                      type: DirectoryOrCreate
                                volumeMounts:
                                  - name: downloads-volume
                                    mountPath: /home/node/.n8n-files
                                resources:
                                  requests:
                                    cpu: 1000m
                                    memory: 1024Mi
                                  limits:
                                    cpu: 2000m
                                    memory: 2048Mi
                              worker:
                                mode: queue
                                count: 2
                                waitMainNodeReady:
                                  enabled: false
                                extraEnvVars:
                                  "N8N_FILE_SYSTEM_ALLOWED_PATHS": "/home/node/.n8n-files"
                                  "EXECUTIONS_TIMEOUT": "300"
                                  "EXECUTIONS_TIMEOUT_MAX": "600"
                                  "DB_POSTGRESDB_POOL_SIZE": "5"
                                  "QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD": "60000"
                                  "N8N_COMMUNITY_PACKAGES_ENABLED": "true"
                                  "N8N_GIT_NODE_DISABLE_BARE_REPOS": "true"
                                  "N8N_LICENSE_AUTO_RENEW_ENABLED": "true"
                                  "N8N_LICENSE_RENEW_ON_INIT": "true"
                                persistence:
                                  enabled: true
                                  accessMode: ReadWriteOnce
                                  storageClass: "local-path"
                                  size: 50Gi
                                volumes:
                                  - name: downloads-volume
                                    hostPath:
                                      path: /home/aaron/Downloads
                                      type: DirectoryOrCreate
                                volumeMounts:
                                  - name: downloads-volume
                                    mountPath: /home/node/.n8n-files
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1024Mi
                                  limits:
                                    cpu: 1000m
                                    memory: 2048Mi
                              nodes:
                                builtin:
                                  enabled: true
                                  modules:
                                    - crypto
                                    - fs
                                external:
                                  allowAll: true
                                  packages:
                                    - n8n-nodes-globals
                              npmRegistry:
                                enabled: true
                                url: http://mirrors.cloud.tencent.com/npm/
                              redis:
                                enabled: true
                                image:
                                  registry: m.daocloud.io/docker.io
                                  repository: bitnamilegacy/redis
                                master:
                                  resourcesPreset: "small"
                                  persistence:
                                    enabled: true
                                    accessMode: ReadWriteOnce
                                    storageClass: "local-path"
                                    size: 10Gi
                              ingress:
                                enabled: true
                                className: nginx
                                annotations:
                                  kubernetes.io/ingress.class: nginx
                                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                  nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
                                  nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
                                  nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
                                  nginx.ingress.kubernetes.io/proxy-body-size: "50m"
                                  nginx.ingress.kubernetes.io/upstream-keepalive-connections: "50"
                                  nginx.ingress.kubernetes.io/upstream-keepalive-timeout: "60"
                                hosts:
                                  - host: n8n.ay.dev
                                    paths:
                                      - path: /
                                        pathType: Prefix
                                tls:
                                - hosts:
                                  - n8n.ay.dev
                                  - webhook.n8n.ay.dev
                                  secretName: n8n.ay.dev-tls
                              webhook:
                                mode: queue
                                url: "https://webhook.n8n.ay.dev"
                                autoscaling:
                                  enabled: false
                                waitMainNodeReady:
                                  enabled: true
                                resources:
                                  requests:
                                    cpu: 200m
                                    memory: 256Mi
                                  limits:
                                    cpu: 512m
                                    memory: 512Mi
                          chart: n8n
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: n8n
                        syncPolicy:
                          syncOptions:
                            - CreateNamespace=true
                            - ApplyOutOfSyncOnly=false
                      EOF

                      3.sync by argocd

                      Details
                      argocd app sync argocd/n8n
                      Using AY Helm Mirror

                      for more information, you can check 🔗https://github.com/AaronYang0628/helm-chart-mirror

                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update
                        helm install ay-helm-mirror/chart-name --generate-name --version a.b.c
                      Using AY ACR Image Mirror
                      Using DaoCloud Mirror

                      🛎️FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2024

                      Wechat Markdown Editor

                      🚀Installation

                      Install By

                      1.get helm repo

                      Details
                      helm repo add xxxxx https://xxxx
                      helm repo update

                      2.install chart

                      Details
                      helm install xxxxx/chart-name --generate-name --version a.b.c
                      Using AY Helm Mirror

                      1.prepare `xxxxx-credientials.yaml`

                      Details

                      2.prepare `deploy-xxxxx.yaml`

                      Details
                      kubectl -n argocd apply -f -<< EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: xxxx
                      spec:
                        project: default
                        source:
                          repoURL: https://xxxxx
                          chart: xxxx
                          targetRevision: a.b.c
                      EOF

                      3.sync by argocd

                      Details
                      argocd app sync argocd/xxxx
                      Using AY Helm Mirror

                      for more information, you can check 🔗https://github.com/AaronYang0628/helm-chart-mirror

                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update
                        helm install ay-helm-mirror/chart-name --generate-name --version a.b.c
                      Using AY ACR Image Mirror
                      Using DaoCloud Mirror

                      1.init server

                      Details
                      Using AY ACR Image Mirror
                      Using DaoCloud Mirror

                      1.init server

                      Details
                      Using AY ACR Image Mirror
                      Using DaoCloud Mirror

                      🛎️FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2024

                      Subsections of Auth

                      Deploy GateKeeper Server

                      Official Website: https://open-policy-agent.github.io/gatekeeper/website/

                      Preliminary

                      • Kubernetes 版本必须大于 v1.16

                      Components

                      Gatekeeper 是基于 Open Policy Agent(OPA) 构建的 Kubernetes 准入控制器,它允许用户定义和实施自定义策略,以控制 Kubernetes 集群中资源的创建、更新和删除操作

                      • 核心组件
                        • 约束模板(Constraint Templates):定义策略的规则逻辑,使用 Rego 语言编写。它是策略的抽象模板,可以被多个约束实例(Constraint Instance)复用。
                        • 约束实例(Constraints Instance):基于约束模板创建的具体策略实例,指定了具体的参数和匹配规则,用于定义哪些资源需要应用该策略。
                        • 准入控制器(Admission Controller)(无需修改):拦截 Kubernetes API Server 的请求,根据定义的约束对请求进行评估,如果请求违反了任何约束,则拒绝该请求。
                          核心Pod角色

                          mvc mvc

                          • gatekeeper-audit
                            • 定期合规检查:该组件会按照预设的时间间隔,对集群中已存在的所有资源进行全面扫描,以检查它们是否符合所定义的约束规则。(周期性,批量检查)
                            • 生成审计报告:在完成资源扫描后,gatekeeper-audit 会生成详细的审计报告,其中会明确指出哪些资源违反了哪些约束规则,方便管理员及时了解集群的合规状态。
                          • gatekeeper-controller-manager
                            • 实时准入控制:作为准入控制器,gatekeeper-controller-manager 在资源创建、更新或删除操作发起时,会实时拦截这些请求。它会依据预定义的约束模板和约束规则,对请求中的资源进行即时评估。(实时性,事件驱动)
                            • 处理决策请求:根据评估结果,如果请求中的资源符合所有约束规则,gatekeeper-controller-manager 会允许该请求继续执行;若违反了任何规则,它会拒绝该请求,避免违规资源进入集群。

                      Features

                      1. 约束管理

                        • 自定义约束模板:用户可以使用 Rego 语言编写自定义的约束模板,实现各种复杂的策略逻辑。

                          例如,可以定义策略要求所有的命名空间 NameSpace 必须设置特定的标签,或者限制某些命名空间只能使用特定的镜像。

                          查看已存在的约束模板和实例
                              ```shell
                              kubectl get constrainttemplates
                              kubectl get constraints
                              ```
                          
                              ```shell
                              kubectl apply -f - <<EOF
                              apiVersion: templates.gatekeeper.sh/v1
                              kind: ConstraintTemplate
                              metadata:
                              name: k8srequiredlabels
                              spec:
                                  crd:
                                      spec:
                                      names:
                                          kind: K8sRequiredLabels
                                      validation:
                                          openAPIV3Schema:
                                              type: object
                                              properties:
                                                  labels:
                                                      type: array
                                                      items:
                                                          type: string
                              targets:
                                  - target: admission.k8s.gatekeeper.sh
                                  rego: |
                                      package k8srequiredlabels
                          
                                      violation[{"msg": msg, "details": {"missing_labels": missing}}] {
                                          provided := {label | input.review.object.metadata.labels[label]}
                                          required := {label | label := input.parameters.labels[_]}
                                          missing := required - provided
                                          count(missing) > 0
                                          msg := sprintf("you must provide labels: %v", [missing])
                                      }
                              EOF
                              ```
                          

                        • 约束模板复用:约束模板可以被多个约束实例复用,提高了策略的可维护性和复用性。

                          例如,可以创建一个通用的标签约束模板,然后在不同的命名空间 NameSpace 中创建不同的约束实例,要求不同的标签。

                          一个约束实例的yaml
                              要求所有的命名空间 NameSpace 必须存在标签“gatekeeper”
                          
                              ```yaml
                              apiVersion: constraints.gatekeeper.sh/v1beta1
                              kind: K8sRequiredLabels
                              metadata:
                              name: ns-must-have-gk-label
                              spec:
                                  enforcementAction: dryrun
                                  match:
                                      kinds:
                                      - apiGroups: [""]
                                          kinds: ["Namespace"]
                                  parameters:
                                      labels: ["gatekeeper"]
                              ```
                          

                        • 约束更新:当约束模板或约束发生更新时,Gatekeeper 会自动重新评估所有相关的资源,确保策略的实时生效。

                      2. 资源控制

                        • 准入拦截:当有资源创建或更新请求时,Gatekeeper 会实时拦截请求,并根据策略进行评估。如果请求违反了策略,会立即拒绝请求,并返回详细的错误信息,帮助用户快速定位问题。

                        • 资源创建和更新限制:Gatekeeper 可以阻止不符合策略的资源创建和更新请求。

                          例如,如果定义了一个策略要求所有的 Deployment 必须设置资源限制(requests 和 limits),那么当用户尝试创建或更新一个没有设置资源限制的 Deployment 时,请求将被拒绝。

                          通过enforcementAction来控制,可选:dryrun | deny | warn

                          check https://open-policy-agent.github.io/gatekeeper-library/website/validation/containerlimits

                        • 资源类型过滤:可以通过约束的 match 字段指定需要应用策略的资源类型和命名空间。

                          例如,可以只对特定命名空间中的 Pod 应用策略,或者只对特定 API 组和版本的资源应用策略。

                          可以通过syncSet (同步配置)来指定过滤和忽略那些资源

                          扫描全部ns,pod,忽略kube开头的命名空间
                              ```yaml
                              apiVersion: config.gatekeeper.sh/v1alpha1
                              kind: Config
                              metadata:
                              name: config
                              namespace: "gatekeeper-system"
                              spec:
                              sync:
                                  syncOnly:
                                  - group: ""
                                      version: "v1"
                                      kind: "Namespace"
                                  - group: ""
                                      version: "v1"
                                      kind: "Pod"
                              match:
                                  - excludedNamespaces: ["kube-*"]
                                  processes: ["*"]
                              ```
                          

                      3. 合规性保证

                        • 行业标准和自定义规范:Gatekeeper 可以确保 Kubernetes 集群中的资源符合行业标准和管理员要求的内部的安全规范。

                          例如,可以定义策略要求所有的容器必须使用最新的安全补丁,或者要求所有的存储卷必须进行加密。

                          Gatekeeper 已经提供近50种各类资源限制的约束策略,可以通过访问https://open-policy-agent.github.io/gatekeeper-library/website/ 查看并获得

                        • 审计和报告:Gatekeeper 可以记录所有的策略评估结果,方便管理员进行审计和报告。通过查看审计日志,管理员可以了解哪些资源违反了策略,以及违反了哪些策略。

                        • 审计导出:审计日志可以导出并接入下游。

                          详细信息可以查看https://open-policy-agent.github.io/gatekeeper/website/docs/pubsub/

                      Installation

                      install from
                      kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.18.2/deploy/gatekeeper.yaml
                      helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
                      helm install gatekeeper/gatekeeper --name-template=gatekeeper --namespace gatekeeper-system --create-namespace

                      Make sure that:

                      • You have Docker version 20.10 or later installed.
                      • Your kubectl context is set to the desired installation cluster.
                      • You have a container registry you can write to that is readable by the target cluster.
                      git clone https://github.com/open-policy-agent/gatekeeper.git \
                      && cd gatekeeper 
                      • Build and push Gatekeeper image:
                      export DESTINATION_GATEKEEPER_IMAGE=<add registry like "myregistry.docker.io/gatekeeper">
                      make docker-buildx REPOSITORY=$DESTINATION_GATEKEEPER_IMAGE OUTPUT_TYPE=type=registry
                      • And the deploy
                      make deploy REPOSITORY=$DESTINATION_GATEKEEPER_IMAGE
                      Mar 12, 2024

                      Subsections of Binary

                      Argo Workflow Binary

                      MIRROR="files.m.daocloud.io/"
                      VERSION=v3.5.4
                      curl -sSLo argo-linux-amd64.gz "https://${MIRROR}github.com/argoproj/argo-workflows/releases/download/${VERSION}/argo-linux-amd64.gz"
                      gunzip argo-linux-amd64.gz
                      chmod u+x argo-linux-amd64
                      mkdir -p ${HOME}/bin
                      mv -f argo-linux-amd64 ${HOME}/bin/argo
                      rm -f argo-linux-amd64.gz
                      Apr 7, 2024

                      ArgoCD Binary

                      MIRROR="files.m.daocloud.io/"
                      VERSION=v3.1.8
                      [ $(uname -m) = x86_64 ] && curl -sSLo argocd "https://${MIRROR}github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-amd64"
                      [ $(uname -m) = aarch64 ] && curl -sSLo argocd "https://${MIRROR}github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-arm64"
                      chmod u+x argocd
                      mkdir -p ${HOME}/bin
                      mv -f argocd ${HOME}/bin

                      [Optional] add to PATH

                      cat >> ~/.bashrc  << EOF
                      export PATH=$PATH:/root/bin
                      EOF
                      source ~/.bashrc
                      Apr 7, 2024

                      Golang Binary

                      # sudo rm -rf /usr/local/go  # 删除旧版本
                      wget https://go.dev/dl/go1.24.4.linux-amd64.tar.gz
                      tar -C /usr/local -xzf go1.24.4.linux-amd64.tar.gz
                      vim ~/.bashrc
                      export PATH=$PATH:/usr/local/go/bin
                      source ~/.bashrc
                      rm -rf ./go1.24.4.linux-amd64.tar.gz
                      Apr 7, 2024

                      Gradle Binary

                      MIRROR="files.m.daocloud.io/"
                      VERSION=v3.5.4
                      curl -sSLo argo-linux-amd64.gz "https://${MIRROR}github.com/argoproj/argo-workflows/releases/download/${VERSION}/argo-linux-amd64.gz"
                      gunzip argo-linux-amd64.gz
                      chmod u+x argo-linux-amd64
                      mkdir -p ${HOME}/bin
                      mv -f argo-linux-amd64 ${HOME}/bin/argo
                      rm -f argo-linux-amd64.gz
                      Apr 7, 2024

                      Helm Binary

                      ARCH_IN_FILE_NAME=linux-amd64
                      FILE_NAME=helm-v3.18.3-${ARCH_IN_FILE_NAME}.tar.gz
                      curl -sSLo ${FILE_NAME} "https://files.m.daocloud.io/get.helm.sh/${FILE_NAME}"
                      tar zxf ${FILE_NAME}
                      mkdir -p ${HOME}/bin
                      mv -f ${ARCH_IN_FILE_NAME}/helm ${HOME}/bin
                      rm -rf ./${FILE_NAME}
                      rm -rf ./${ARCH_IN_FILE_NAME}
                      chmod u+x ${HOME}/bin/helm
                      Apr 7, 2024

                      JQ Binary

                      JQ_VERSION=1.7
                      JQ_BINARY=jq-linux64
                      wget https://github.com/stedolan/jq/releases/download/jq-${JQ_VERSION}/${JQ_BINARY}.tar.gz -O - | tar xz && mv ${JQ_BINARY} /usr/bin/jq
                      Apr 7, 2024

                      Kind Binary

                      MIRROR="files.m.daocloud.io/"
                      VERSION=v0.29.0
                      [ $(uname -m) = x86_64 ] && curl -sSLo kind "https://${MIRROR}github.com/kubernetes-sigs/kind/releases/download/${VERSION}/kind-linux-amd64"
                      [ $(uname -m) = aarch64 ] && curl -sSLo kind "https://${MIRROR}github.com/kubernetes-sigs/kind/releases/download/${VERSION}/kind-linux-arm64"
                      chmod u+x kind
                      mkdir -p ${HOME}/bin
                      mv -f kind ${HOME}/bin
                      Apr 7, 2025

                      Krew Binary

                      cd "$(mktemp -d)" &&
                      OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
                      ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')" &&
                      KREW="krew-${OS}_${ARCH}" &&
                      curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
                      tar zxvf "${KREW}.tar.gz" &&
                      ./"${KREW}" install krew
                      Apr 7, 2024

                      Kubectl Binary

                      MIRROR="files.m.daocloud.io/"
                      VERSION=$(curl -L -s https://${MIRROR}dl.k8s.io/release/stable.txt)
                      [ $(uname -m) = x86_64 ] && curl -sSLo kubectl "https://${MIRROR}dl.k8s.io/release/${VERSION}/bin/linux/amd64/kubectl"
                      [ $(uname -m) = aarch64 ] && curl -sSLo kubectl "https://${MIRROR}dl.k8s.io/release/${VERSION}/bin/linux/arm64/kubectl"
                      chmod u+x kubectl
                      mkdir -p ${HOME}/bin
                      mv -f kubectl ${HOME}/bin
                      Apr 7, 2024

                      Kustomize Binary

                      MIRROR="github.com"
                      VERSION="v5.7.1"
                      [ $(uname -m) = x86_64 ] && curl -sSLo kustomize "https:///${MIRROR}/kubernetes-sigs/kustomize/releases/download/kustomize/${VERSION}/kustomize_${VERSION}_linux_amd64.tar.gz"
                      [ $(uname -m) = aarch64 ] && curl -sSLo kustomize "https:///${MIRROR}/kubernetes-sigs/kustomize/releases/download/kustomize/${VERSION}/kustomize_${VERSION}_linux_arm64.tar.gz"
                      chmod u+x kustomize
                      mkdir -p ${HOME}/bin
                      mv -f kustomize ${HOME}/bin
                      Apr 7, 2024

                      Maven Binary

                      wget https://dlcdn.apache.org/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz
                      tar xzf apache-maven-3.9.6-bin.tar.gz -C /usr/local
                      ln -sfn /usr/local/apache-maven-3.9.6/bin/mvn /root/bin/mvn  
                      export PATH=$PATH:/usr/local/apache-maven-3.9.6/bin
                      source ~/.bashrc
                      Apr 7, 2024

                      Minikube Binary

                      MIRROR="files.m.daocloud.io/"
                      [ $(uname -m) = x86_64 ] && curl -sSLo minikube "https://${MIRROR}storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64"
                      [ $(uname -m) = aarch64 ] && curl -sSLo minikube "https://${MIRROR}storage.googleapis.com/minikube/releases/latest/minikube-linux-arm64"
                      chmod u+x minikube
                      mkdir -p ${HOME}/bin
                      mv -f minikube ${HOME}/bin
                      Apr 7, 2024

                      Open Java

                      mkdir -p /etc/apt/keyrings && \
                      wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | gpg --dearmor -o /etc/apt/keyrings/adoptium.gpg && \
                      echo "deb [signed-by=/etc/apt/keyrings/adoptium.gpg arch=amd64] https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list > /dev/null && \
                      apt-get update && \
                      apt-get install -y temurin-21-jdk && \
                      apt-get clean && \
                      rm -rf /var/lib/apt/lists/*
                      Apr 7, 2025

                      YQ Binary

                      YQ_VERSION=v4.40.5
                      YQ_BINARY=yq_linux_amd64
                      wget https://github.com/mikefarah/yq/releases/download/${YQ_VERSION}/${YQ_BINARY}.tar.gz -O - | tar xz && mv ${YQ_BINARY} /usr/bin/yq
                      Apr 7, 2024

                      CICD

                      Articles

                      FQA

                      Q1: difference between docker\podmn\buildah

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2025

                      Subsections of CICD

                      Install Argo CD

                      Preliminary

                      • Kubernets has installed, if not check 🔗link
                      • Helm binary has installed, if not check 🔗link

                      1. install argoCD binary

                      MIRROR="files.m.daocloud.io/"
                      VERSION=v3.1.8
                      [ $(uname -m) = x86_64 ] && curl -sSLo argocd "https://${MIRROR}github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-amd64"
                      [ $(uname -m) = aarch64 ] && curl -sSLo argocd "https://${MIRROR}github.com/argoproj/argo-cd/releases/download/${VERSION}/argocd-linux-arm64"
                      chmod u+x argocd
                      mkdir -p ${HOME}/bin
                      mv -f argocd ${HOME}/bin

                      [Optional] add to PATH

                      cat >> ~/.bashrc  << EOF
                      export PATH=$PATH:/root/bin
                      EOF
                      source ~/.bashrc

                      2. install components

                      Install By
                      1. Prepare argocd.values.yaml
                      cat <<EOF > argocd.zj.values.yaml
                      crds:
                        install: true
                        keep: false
                      global:
                        domain: argo-cd.ay.dev
                        revisionHistoryLimit: 3
                        image:
                          repository: m.daocloud.io/quay.io/argoproj/argocd
                          imagePullPolicy: IfNotPresent
                      redis:
                        enabled: true
                        image:
                          repository: m.daocloud.io/docker.io/library/redis
                        exporter:
                          enabled: false
                          image:
                            repository: m.daocloud.io/bitnami/redis-exporter
                        metrics:
                          enabled: false
                      redis-ha:
                        enabled: false
                        image:
                          repository: m.daocloud.io/docker.io/library/redis
                        configmapTest:
                          repository: m.daocloud.io/docker.io/koalaman/shellcheck
                        haproxy:
                          enabled: false
                          image:
                            repository: m.daocloud.io/docker.io/library/haproxy
                        exporter:
                          enabled: false
                          image: m.daocloud.io/docker.io/oliver006/redis_exporter
                      dex:
                        enabled: true
                        image:
                          repository: m.daocloud.io/ghcr.io/dexidp/dex
                      server:
                        ingress:
                          enabled: true
                          ingressClassName: nginx
                          annotations:
                            nginx.ingress.kubernetes.io/ssl-passthrough: "true"
                            cert-manager.io/cluster-issuer: self-signed-ca-issuer
                            nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                          hostname: argo-cd.ay.dev
                          path: /
                          pathType: Prefix
                          tls: true
                      EOF
                      
                      cat <<EOF > argocd.72602.values.yaml
                      crds:
                        install: true
                        keep: false
                      global:
                        domain: argo-cd.ay.dev
                        revisionHistoryLimit: 3
                        image:
                          repository: m.daocloud.io/quay.io/argoproj/argocd
                          imagePullPolicy: IfNotPresent
                      redis:
                        enabled: true
                        image:
                          repository: m.daocloud.io/docker.io/library/redis
                        exporter:
                          enabled: false
                          image:
                            repository: m.daocloud.io/bitnami/redis-exporter
                        metrics:
                          enabled: false
                      redis-ha:
                        enabled: false
                        image:
                          repository: m.daocloud.io/docker.io/library/redis
                        configmapTest:
                          repository: m.daocloud.io/docker.io/koalaman/shellcheck
                        haproxy:
                          enabled: false
                          image:
                            repository: m.daocloud.io/docker.io/library/haproxy
                        exporter:
                          enabled: false
                          image: m.daocloud.io/docker.io/oliver006/redis_exporter
                      dex:
                        enabled: true
                        image:
                          repository: m.daocloud.io/ghcr.io/dexidp/dex
                      server:
                        ingress:
                          enabled: true
                          ingressClassName: nginx
                          annotations:
                            nginx.ingress.kubernetes.io/ssl-passthrough: "true"
                            cert-manager.io/cluster-issuer: self-signed-ca-issuer
                            nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                          hostname: argo-cd.ay.dev
                          path: /
                          pathType: Prefix
                          tls: true
                      EOF
                      
                      2. Install argoCD From Mirror
                      helm upgrade --install argo-cd argo-cd \
                        --namespace argocd \
                        --create-namespace \
                        --version 8.3.5 \
                        --repo https://aaronyang0628.github.io/helm-chart-mirror/charts \
                        --values argocd.zj.values.yaml \
                        --atomic
                      
                      helm install argo-cd argo-cd \
                        --namespace argocd \
                        --create-namespace \
                        --version 8.3.5 \
                        --repo https://aaronyang0628.github.io/helm-chart-mirror/charts \
                        --values argocd.72602.values.yaml \
                        --atomic
                      
                      [Optional]() 4. Install argoCD From Original
                      helm upgrade --install argo-cd argo-cd \
                        --namespace argocd \
                        --create-namespace \
                        --version 8.3.5 \
                        --repo https://argoproj.github.io/argo-helm \
                        --values argocd.zj.values.yaml \
                        --atomic
                      
                      helm install argo-cd argo-cd \
                        --namespace argocd \
                        --create-namespace \
                        --version 8.3.5 \
                        --repo https://argoproj.github.io/argo-helm \
                        --values argocd.72602.values.yaml \
                        --atomic
                      

                      by default you can install argocd by this link

                      kubectl create namespace argocd \
                      && kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

                      Or, you can use your won flle link.

                      4. prepare argocd-server-external.yaml

                      Install By
                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: v1
                      kind: Service
                      metadata:
                        labels:
                          app.kubernetes.io/component: server
                          app.kubernetes.io/instance: argo-cd
                          app.kubernetes.io/name: argocd-server-external
                          app.kubernetes.io/part-of: argocd
                        name: argocd-server-external
                      spec:
                        ports:
                        - name: https
                          port: 443
                          protocol: TCP
                          targetPort: 8080
                          nodePort: 30443
                        selector:
                          app.kubernetes.io/instance: argo-cd
                          app.kubernetes.io/name: argocd-server
                        type: NodePort
                      EOF
                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: v1
                      kind: Service
                      metadata:
                        labels:
                          app.kubernetes.io/component: server
                          app.kubernetes.io/instance: argo-cd
                          app.kubernetes.io/name: argocd-server-external
                          app.kubernetes.io/part-of: argocd
                        name: argocd-server-external
                      spec:
                        ports:
                        - name: https
                          port: 443
                          protocol: TCP
                          targetPort: 8080
                          nodePort: 30443
                        selector:
                          app.kubernetes.io/instance: argo-cd
                          app.kubernetes.io/name: argocd-server
                        type: NodePort
                      EOF
                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: v1
                      kind: Service
                      metadata:
                        labels:
                          app.kubernetes.io/component: server
                          app.kubernetes.io/instance: argo-cd
                          app.kubernetes.io/name: argocd-server-external
                          app.kubernetes.io/part-of: argocd
                          app.kubernetes.io/version: v2.8.4
                        name: argocd-server-external
                      spec:
                        ports:
                        - name: https
                          port: 443
                          protocol: TCP
                          targetPort: 8080
                          nodePort: 30443
                        selector:
                          app.kubernetes.io/instance: argo-cd
                          app.kubernetes.io/name: argocd-server
                        type: NodePort
                      EOF

                      6. [Optional] prepare argocd-server-ingress.yaml

                      And also, you need to install ingress-nginx or traefik components, if not, please check 🔗link

                      Install By

                      Before you create ingress, you need to create cert-manager and cert-issuer self-signed-ca-issuer, if not, please check 🔗link

                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: networking.k8s.io/v1
                      kind: Ingress
                      metadata:
                        annotations:
                          cert-manager.io/cluster-issuer: self-signed-ca-issuer
                          nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                        name: argo-cd-argocd-server
                        namespace: argocd
                      spec:
                        ingressClassName: nginx
                        rules:
                        - host: argo-cd.ay.dev
                          http:
                            paths:
                            - backend:
                                service:
                                  name: argo-cd-argocd-server
                                  port:
                                    number: 443
                              path: /
                              pathType: Prefix
                        tls:
                        - hosts:
                          - argo-cd.ay.dev
                          secretName: argo-cd.ay.dev-tls
                      EOF

                      Before you create ingress, you need to create cert-manager and cert-issuer lets-encrypt, if not, please check 🔗link

                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: networking.k8s.io/v1
                      kind: Ingress
                      metadata:
                        annotations:
                          cert-manager.io/cluster-issuer: lets-encrypt
                          nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                          nginx.ingress.kubernetes.io/ssl-passthrough: "true"
                        name: argo-cd-argocd-server
                        namespace: argocd
                      spec:
                        ingressClassName: nginx
                        rules:
                        - host: argo-cd.72602.online
                          http:
                            paths:
                            - backend:
                                service:
                                  name: argo-cd-argocd-server
                                  port:
                                    number: 443
                              path: /
                              pathType: Prefix
                        tls:
                        - hosts:
                          - argo-cd.72602.online
                          secretName: argo-cd.72602.online-tls
                      EOF
                      apiVersion: networking.k8s.io/v1
                      kind: Ingress
                      metadata:
                        annotations:
                          cert-manager.io/cluster-issuer: self-signed-ca-issuer
                          nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                          nginx.ingress.kubernetes.io/ssl-passthrough: "true"
                        name: argo-cd-argocd-server
                        namespace: argocd
                      spec:
                        ingressClassName: nginx
                        rules:
                        - host: argo-cd.ay.dev
                          http:
                            paths:
                            - backend:
                                service:
                                  name: argo-cd-argocd-server
                                  port:
                                    number: 443
                              path: /
                              pathType: Prefix
                        tls:
                        - hosts:
                          - argo-cd.ay.dev
                          secretName: argo-cd.ay.dev-tls

                      8. get argocd initialized password

                      kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

                      9. login argocd

                      ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
                      MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
                      argocd login --insecure --username admin $MASTER_IP:30443 --password $ARGOCD_PASS

                      if you deploy argocd in minikube, you might need to forward this port

                      ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f
                      open https://$(minikube ip):30443

                      if you use ingress, you might need to configure your browser to allow insecure connection

                      kubectl -n basic-components get secret root-secret -o jsonpath='{.data.tls\.crt}' | base64 -d > cert-manager-self-signed-ca-secret.crt

                      import cert-manager-self-signed-ca-secret.crt into your browser

                      open https://argo-cd.ay.dev
                      Mar 7, 2024

                      Install Argo WorkFlow

                      Preliminary

                      • Kubernets has installed, if not check 🔗link
                      • Argo CD has installed, if not check 🔗link
                      • cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuerservice, , if not check 🔗link
                      kubectl get namespace business-workflows > /dev/null 2>&1 || kubectl create namespace business-workflows

                      1. prepare argo-workflows.yaml

                      content
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: argo-workflows
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://argoproj.github.io/argo-helm
                          chart: argo-workflows
                          targetRevision: 0.45.27
                          helm:
                            releaseName: argo-workflows
                            values: |
                              crds:
                                install: true
                                keep: false
                              singleNamespace: false
                              controller:
                                image:
                                  registry: m.daocloud.io/quay.io
                                workflowNamespaces:
                                  - business-workflows
                              executor:
                                image:
                                  registry: m.daocloud.io/quay.io
                              workflow:
                                serviceAccount:
                                  create: true
                                rbac:
                                  create: true
                              server:
                                enabled: true
                                image:
                                  registry: m.daocloud.io/quay.io
                                ingress:
                                  enabled: true
                                  ingressClassName: nginx
                                  annotations:
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                    nginx.ingress.kubernetes.io/rewrite-target: /$1
                                    nginx.ingress.kubernetes.io/use-regex: "true"
                                  hosts:
                                    - argo-workflows.ay.dev
                                  paths:
                                    - /?(.*)
                                  pathType: ImplementationSpecific
                                  tls:
                                    - secretName: argo-workflows.ay.dev-tls
                                      hosts:
                                        - argo-workflows.ay.dev
                                authModes:
                                  - server
                                  - client
                                sso:
                                  enabled: false
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: workflows
                      kubectl -n argocd apply -f - << EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: argo-workflows
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://argoproj.github.io/argo-helm
                          chart: argo-workflows
                          targetRevision: 0.45.27
                          helm:
                            releaseName: argo-workflows
                            values: |
                              crds:
                                install: true
                                keep: false
                              singleNamespace: false
                              controller:
                                image:
                                  registry: m.daocloud.io/quay.io
                                workflowNamespaces:
                                  - business-workflows
                              executor:
                                image:
                                  registry: m.daocloud.io/quay.io
                              workflow:
                                serviceAccount:
                                  create: true
                                rbac:
                                  create: true
                              server:
                                enabled: true
                                image:
                                  registry: m.daocloud.io/quay.io
                                ingress:
                                  enabled: true
                                  ingressClassName: nginx
                                  annotations:
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                    nginx.ingress.kubernetes.io/rewrite-target: /$1
                                    nginx.ingress.kubernetes.io/use-regex: "true"
                                  hosts:
                                    - argo-workflows.ay.dev
                                  paths:
                                    - /?(.*)
                                  pathType: ImplementationSpecific
                                  tls:
                                    - secretName: argo-workflows.ay.dev-tls
                                      hosts:
                                        - argo-workflows.ay.dev
                                authModes:
                                  - server
                                  - client
                                sso:
                                  enabled: false
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: workflows
                      EOF

                      2. install argo workflow binary

                      3. [Optional] apply to k8s

                      kubectl -n argocd apply -f argo-workflows.yaml

                      4. sync by argocd

                      argocd app sync argocd/argo-workflows

                      5. submit a test workflow

                      argo -n business-workflows submit https://raw.githubusercontent.com/argoproj/argo-workflows/master/examples/hello-world.yaml --serviceaccount=argo-workflow

                      6. check workflow status

                      # list all flows
                      argo -n business-workflows list
                      # get specific flow status
                      argo -n business-workflows get <$flow_name>
                      # get specific flow log
                      argo -n business-workflows logs <$flow_name>
                      # get specific flow log continuously
                      argo -n business-workflows logs <$flow_name> --watch
                      Mar 7, 2024

                      Install Argo Event

                      Preliminary

                      • Kubernets has installed, if not check 🔗link
                      • Argo CD has installed, if not check 🔗link

                      1. prepare argo-events.yaml

                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: argo-events
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://argoproj.github.io/argo-helm
                          chart: argo-events
                          targetRevision: 2.4.2
                          helm:
                            releaseName: argo-events
                            values: |
                              openshift: false
                              createAggregateRoles: true
                              crds:
                                install: true
                                keep: true
                              global:
                                image:
                                  repository: m.daocloud.io/quay.io/argoproj/argo-events
                              controller:
                                replicas: 1
                                resources: {}
                              webhook:
                                enabled: true
                                replicas: 1
                                port: 12000
                                resources: {}
                              extraObjects:
                                - apiVersion: networking.k8s.io/v1
                                  kind: Ingress
                                  metadata:
                                    annotations:
                                      cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                      nginx.ingress.kubernetes.io/rewrite-target: /$1
                                    labels:
                                      app.kubernetes.io/instance: argo-events
                                      app.kubernetes.io/managed-by: Helm
                                      app.kubernetes.io/name: argo-events-events-webhook
                                      app.kubernetes.io/part-of: argo-events
                                      argocd.argoproj.io/instance: argo-events
                                    name: argo-events-webhook
                                  spec:
                                    ingressClassName: nginx
                                    rules:
                                    - host: argo-events.webhook.ay.dev
                                      http:
                                        paths:
                                        - backend:
                                            service:
                                              name: events-webhook
                                              port:
                                                number: 12000
                                          path: /?(.*)
                                          pathType: ImplementationSpecific
                                    tls:
                                    - hosts:
                                      - argo-events.webhook.ay.dev
                                      secretName: argo-events-webhook-tls
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: argocd

                      4. apply to k8s

                      kubectl -n argocd apply -f argo-events.yaml

                      5. sync by argocd

                      argocd app sync argocd/argo-events
                      Mar 7, 2024

                      Reloader

                      Install

                      Details
                      helm repo add stakater https://stakater.github.io/stakater-charts
                      helm repo update
                      helm install reloader stakater/reloader
                      Using AY Helm Mirror

                      for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                      helm repo update
                      helm -n basic-components install reloader stakater/reloader
                      Details
                      kubectl apply -f https://raw.githubusercontent.com/stakater/Reloader/master/deployments/kubernetes/reloader.yaml
                      Using AY Gitee Mirror
                      kubectl apply -f https://gitee.com/aaron2333/aaaa/raw/main/bbbb.yaml
                      Using AY ACR Mirror
                      docker pull crpi-wixjy6gci86ms14e.cn-hongkong.personal.cr.aliyuncs.com/ay-mirror/xxxx
                      Using DaoCloud Mirror
                      docker pull m.daocloud.io/docker.io/library/xxxx

                      Usage

                      • For a Deployment called foo have a ConfigMap called foo-configmap. Then add this annotation to main metadata of your Deployment configmap.reloader.stakater.com/reload: "foo-configmap"

                      • For a Deployment called foo have a Secret called foo-secret. Then add this annotation to main metadata of your Deployment secret.reloader.stakater.com/reload: "foo-secret"

                      • After successful installation, your pods will get rolling updates when a change in data of configmap or secret will happen.

                      Reference

                      For more information about reloader, please refer to https://github.com/stakater/Reloader

                      Container

                      Articles

                      FQA

                      Q1: difference between docker\podmn\buildah

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2025

                      Subsections of Container

                      Install Buildah

                      Reference

                      Prerequisites

                      • Kernel Version Requirements To run Buildah on Red Hat Enterprise Linux or CentOS, version 7.4 or higher is required. On other Linux distributions Buildah requires a kernel version that supports the OverlayFS and/or fuse-overlayfs filesystem – you’ll need to consult your distribution’s documentation to determine a minimum version number.

                      • runc Requirement Buildah uses runc to run commands when buildah run is used, or when buildah build encounters a RUN instruction, so you’ll also need to build and install a compatible version of runc for Buildah to call for those cases. If Buildah is installed via a package manager such as yum, dnf or apt-get, runc will be installed as part of that process.

                      • CNI Requirement When Buildah uses runc to run commands, it defaults to running those commands in the host’s network namespace. If the command is being run in a separate user namespace, though, for example when ID mapping is used, then the command will also be run in a separate network namespace.

                      A newly-created network namespace starts with no network interfaces, so commands which are run in that namespace are effectively disconnected from the network unless additional setup is done. Buildah relies on the CNI library and plugins to set up interfaces and routing for network namespaces.

                      something wrong with CNI

                      If Buildah is installed via a package manager such as yum, dnf or apt-get, a package containing CNI plugins may be available (in Fedora, the package is named containernetworking-cni). If not, they will need to be installed, for example using:

                      git clone https://github.com/containernetworking/plugins
                      ( cd ./plugins; ./build_linux.sh )
                      sudo mkdir -p /opt/cni/bin
                      sudo install -v ./plugins/bin/* /opt/cni/bin

                      The CNI library needs to be configured so that it will know which plugins to call to set up namespaces. Usually, this configuration takes the form of one or more configuration files in the /etc/cni/net.d directory. A set of example configuration files is included in the docs/cni-examples directory of this source tree.

                      Installation

                      Caution

                      If you already have something wrong with apt update, please check the following 🔗link, adding docker source wont help you to solve that problem.

                      sudo dnf update -y 
                      sudo dnf -y install buildah

                      Once the installation is complete, The buildah images command will list all the images:

                      buildah images
                      sudo yum -y install buildah

                      Once the installation is complete, start the Docker service

                      sudo systemctl enable docker
                      sudo systemctl start docker
                      1. Set up Docker’s apt repository.
                      sudo apt-get -y update
                      sudo apt-get -y install buildah
                      1. Verify that the installation is successful by running the hello-world image:
                      sudo buildah run hello-world

                      Info

                      • Docker Image saved in /var/lib/docker

                      Mirror

                      You can modify /etc/docker/daemon.json

                      {
                        "registry-mirrors": ["<$mirror_url>"]
                      }

                      for example:

                      • https://docker.mirrors.ustc.edu.cn
                      Mar 7, 2025

                      Install Docker

                      Mar 7, 2025

                      Install Podman

                      Reference

                      Installation

                      Caution

                      If you already have something wrong with apt update, please check the following 🔗link, adding docker source wont help you to solve that problem.

                      sudo dnf update -y 
                      sudo dnf -y install podman
                      sudo yum install -y podman
                      sudo apt-get update
                      sudo apt-get -y install podman

                      Run Params

                      start an container

                      podman run [params]

                      -rm: delete if failed

                      -v: load a volume

                      Example

                      podman run --rm\
                            -v /root/kserve/iris-input.json:/tmp/iris-input.json \
                            --privileged \
                           -e MODEL_NAME=sklearn-iris \
                           -e INPUT_PATH=/tmp/iris-input.json \
                           -e SERVICE_HOSTNAME=sklearn-iris.kserve-test.example.com \
                            -it m.daocloud.io/docker.io/library/golang:1.22  sh -c "command A; command B; exec bash"
                      Mar 7, 2025

                      Subsections of Database

                      Install Clickhouse

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. argoCD has installed, if not check 🔗link


                      3. cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


                      1.prepare admin credentials secret

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      kubectl -n database create secret generic clickhouse-admin-credentials \
                          --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      2.prepare `deploy-clickhouse.yaml`

                      Details
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: clickhouse
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://charts.bitnami.com/bitnami
                          chart: clickhouse
                          targetRevision: 4.5.1
                          helm:
                            releaseName: clickhouse
                            values: |
                              serviceAccount:
                                name: clickhouse
                              image:
                                registry: m.daocloud.io/docker.io
                                pullPolicy: IfNotPresent
                              volumePermissions:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                              zookeeper:
                                enabled: true
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                                replicaCount: 3
                                persistence:
                                  enabled: true
                                  storageClass: nfs-external
                                  size: 8Gi
                                volumePermissions:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                              shards: 2
                              replicaCount: 3
                              ingress:
                                enabled: true
                                annotations:
                                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                                hostname: clickhouse.dev.geekcity.tech
                                ingressClassName: nginx
                                path: /?(.*)
                                tls: true
                              persistence:
                                enabled: false
                              resources:
                                requests:
                                  cpu: 2
                                  memory: 512Mi
                                limits:
                                  cpu: 3
                                  memory: 1024Mi
                              auth:
                                username: admin
                                existingSecret: clickhouse-admin-credentials
                                existingSecretKey: password
                              metrics:
                                enabled: true
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                                serviceMonitor:
                                  enabled: true
                                  namespace: monitor
                                  jobLabel: clickhouse
                                  selector:
                                    app.kubernetes.io/name: clickhouse
                                    app.kubernetes.io/instance: clickhouse
                                  labels:
                                    release: prometheus-stack
                              extraDeploy:
                                - |
                                  apiVersion: apps/v1
                                  kind: Deployment
                                  metadata:
                                    name: clickhouse-tool
                                    namespace: database
                                    labels:
                                      app.kubernetes.io/name: clickhouse-tool
                                  spec:
                                    replicas: 1
                                    selector:
                                      matchLabels:
                                        app.kubernetes.io/name: clickhouse-tool
                                    template:
                                      metadata:
                                        labels:
                                          app.kubernetes.io/name: clickhouse-tool
                                      spec:
                                        containers:
                                          - name: clickhouse-tool
                                            image: m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine
                                            imagePullPolicy: IfNotPresent
                                            env:
                                              - name: CLICKHOUSE_USER
                                                value: admin
                                              - name: CLICKHOUSE_PASSWORD
                                                valueFrom:
                                                  secretKeyRef:
                                                    key: password
                                                    name: clickhouse-admin-credentials
                                              - name: CLICKHOUSE_HOST
                                                value: csst-clickhouse.csst
                                              - name: CLICKHOUSE_PORT
                                                value: "9000"
                                              - name: TZ
                                                value: Asia/Shanghai
                                            command:
                                              - tail
                                            args:
                                              - -f
                                              - /etc/hosts
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: database

                      3.deploy clickhouse

                      Details
                      kubectl -n argocd apply -f deploy-clickhouse.yaml

                      4.sync by argocd

                      Details
                      argocd app sync argocd/clickhouse

                      5.prepare `clickhouse-interface.yaml`

                      Details
                      apiVersion: v1
                      kind: Service
                      metadata:
                        labels:
                          app.kubernetes.io/component: clickhouse
                          app.kubernetes.io/instance: clickhouse
                        name: clickhouse-interface
                      spec:
                        ports:
                        - name: http
                          port: 8123
                          protocol: TCP
                          targetPort: http
                          nodePort: 31567
                        - name: tcp
                          port: 9000
                          protocol: TCP
                          targetPort: tcp
                          nodePort: 32005
                        selector:
                          app.kubernetes.io/component: clickhouse
                          app.kubernetes.io/instance: clickhouse
                          app.kubernetes.io/name: clickhouse
                        type: NodePort

                      6.apply to k8s

                      Details
                      kubectl -n database apply -f clickhouse-interface.yaml

                      7.extract clickhouse admin credentials

                      Details
                      kubectl -n database get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d

                      8.invoke http api

                      Details
                      add `$K8S_MASTER_IP clickhouse.dev.geekcity.tech` to **/etc/hosts**
                      CK_PASS=$(kubectl -n database get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d)
                      echo 'SELECT version()' | curl -k "https://admin:${CK_PASS}@clickhouse.dev.geekcity.tech:32443/" --data-binary @-

                      Preliminary

                      1. Docker has installed, if not check 🔗link


                      Using Proxy

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      1.init server

                      Details
                      mkdir -p clickhouse/{data,logs}
                      podman run --rm \
                          --ulimit nofile=262144:262144 \
                          --name clickhouse-server \
                          -p 18123:8123 \
                          -p 19000:9000 \
                          -v $(pwd)/clickhouse/data:/var/lib/clickhouse \
                          -v $(pwd)/clickhouse/logs:/var/log/clickhouse-server \
                          -e CLICKHOUSE_DB=my_database \
                          -e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
                          -e CLICKHOUSE_USER=ayayay \
                          -e CLICKHOUSE_PASSWORD=123456 \
                          -d m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine

                      2.check dashboard

                      And then you can visit 🔗http://localhost:18123

                      3.use cli api

                      And then you can visit 🔗http://localhost:19000
                      Details
                      podman run --rm \
                        --entrypoint clickhouse-client \
                        -it m.daocloud.io/docker.io/clickhouse/clickhouse-server:23.11.5.29-alpine \
                        --host host.containers.internal \
                        --port 19000 \
                        --user ayayay \
                        --password 123456 \
                        --query "select version()"

                      4.use visual client

                      Details
                      podman run --rm -p 8080:80 -d m.daocloud.io/docker.io/spoonest/clickhouse-tabix-web-client:stable

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. ArgoCD has installed, if not check 🔗link


                      3. Argo Workflow has installed, if not check 🔗link


                      1.prepare `argocd-login-credentials`

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      kubectl -n database create secret generic mariadb-credentials \
                          --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      2.apply rolebinding to k8s

                      Details
                      kubectl apply -f - <<EOF
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: ClusterRole
                      metadata:
                        name: application-administrator
                      rules:
                        - apiGroups:
                            - argoproj.io
                          resources:
                            - applications
                          verbs:
                            - '*'
                        - apiGroups:
                            - apps
                          resources:
                            - deployments
                          verbs:
                            - '*'
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: argocd
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: application
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      EOF

                      4.prepare clickhouse admin credentials secret

                      Details
                      kubectl get namespace application > /dev/null 2>&1 || kubectl create namespace application
                      kubectl -n application create secret generic clickhouse-admin-credentials \
                        --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      5.prepare deploy-clickhouse-flow.yaml

                      Details
                      apiVersion: argoproj.io/v1alpha1
                      kind: Workflow
                      metadata:
                        generateName: deploy-argocd-app-ck-
                      spec:
                        entrypoint: entry
                        artifactRepositoryRef:
                          configmap: artifact-repositories
                          key: default-artifact-repository
                        serviceAccountName: argo-workflow
                        templates:
                        - name: entry
                          inputs:
                            parameters:
                            - name: argocd-server
                              value: argo-cd-argocd-server.argocd:443
                            - name: insecure-option
                              value: --insecure
                          dag:
                            tasks:
                            - name: apply
                              template: apply
                            - name: prepare-argocd-binary
                              template: prepare-argocd-binary
                              dependencies:
                              - apply
                            - name: sync
                              dependencies:
                              - prepare-argocd-binary
                              template: sync
                              arguments:
                                artifacts:
                                - name: argocd-binary
                                  from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                                parameters:
                                - name: argocd-server
                                  value: "{{inputs.parameters.argocd-server}}"
                                - name: insecure-option
                                  value: "{{inputs.parameters.insecure-option}}"
                            - name: wait
                              dependencies:
                              - sync
                              template: wait
                              arguments:
                                artifacts:
                                - name: argocd-binary
                                  from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                                parameters:
                                - name: argocd-server
                                  value: "{{inputs.parameters.argocd-server}}"
                                - name: insecure-option
                                  value: "{{inputs.parameters.insecure-option}}"
                        - name: apply
                          resource:
                            action: apply
                            manifest: |
                              apiVersion: argoproj.io/v1alpha1
                              kind: Application
                              metadata:
                                name: app-clickhouse
                                namespace: argocd
                              spec:
                                syncPolicy:
                                  syncOptions:
                                  - CreateNamespace=true
                                project: default
                                source:
                                  repoURL: https://charts.bitnami.com/bitnami
                                  chart: clickhouse
                                  targetRevision: 4.5.3
                                  helm:
                                    releaseName: app-clickhouse
                                    values: |
                                      image:
                                        registry: docker.io
                                        repository: bitnami/clickhouse
                                        tag: 23.12.3-debian-11-r0
                                        pullPolicy: IfNotPresent
                                      service:
                                        type: ClusterIP
                                      volumePermissions:
                                        enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                      ingress:
                                        enabled: true
                                        ingressClassName: nginx
                                        annotations:
                                          cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                          nginx.ingress.kubernetes.io/rewrite-target: /$1
                                        path: /?(.*)
                                        hostname: clickhouse.dev.geekcity.tech
                                        tls: true
                                      shards: 2
                                      replicaCount: 3
                                      persistence:
                                        enabled: false
                                      auth:
                                        username: admin
                                        existingSecret: clickhouse-admin-credentials
                                        existingSecretKey: password
                                      zookeeper:
                                        enabled: true
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          repository: bitnami/zookeeper
                                          tag: 3.8.3-debian-11-r8
                                          pullPolicy: IfNotPresent
                                        replicaCount: 3
                                        persistence:
                                          enabled: false
                                        volumePermissions:
                                          enabled: false
                                          image:
                                            registry: m.daocloud.io/docker.io
                                            pullPolicy: IfNotPresent
                                destination:
                                  server: https://kubernetes.default.svc
                                  namespace: application
                        - name: prepare-argocd-binary
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /tmp/argocd
                              mode: 755
                              http:
                                url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
                          outputs:
                            artifacts:
                            - name: argocd-binary
                              path: "{{inputs.artifacts.argocd-binary.path}}"
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              ls -l {{inputs.artifacts.argocd-binary.path}}
                        - name: sync
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /usr/local/bin/argocd
                            parameters:
                            - name: argocd-server
                            - name: insecure-option
                              value: ""
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            env:
                            - name: ARGOCD_USERNAME
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: username
                            - name: ARGOCD_PASSWORD
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: password
                            - name: WITH_PRUNE_OPTION
                              value: --prune
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              set -e
                              export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                              export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                              export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                              argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                              argocd app sync argocd/app-clickhouse ${WITH_PRUNE_OPTION} --timeout 300
                        - name: wait
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /usr/local/bin/argocd
                            parameters:
                            - name: argocd-server
                            - name: insecure-option
                              value: ""
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            env:
                            - name: ARGOCD_USERNAME
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: username
                            - name: ARGOCD_PASSWORD
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: password
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              set -e
                              export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                              export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                              export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                              argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                              argocd app wait argocd/app-clickhouse

                      6.subimit to argo workflow client

                      Details
                      argo -n business-workflows submit deploy-clickhouse-flow.yaml

                      7.extract clickhouse admin credentials

                      Details
                      kubectl -n application get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d

                      8.invoke http api

                      Details
                      add `$K8S_MASTER_IP clickhouse.dev.geekcity.tech` to **/etc/hosts**
                      CK_PASSWORD=$(kubectl -n application get secret clickhouse-admin-credentials -o jsonpath='{.data.password}' | base64 -d) && echo 'SELECT version()' | curl -k "https://admin:${CK_PASSWORD}@clickhouse.dev.geekcity.tech/" --data-binary @-

                      9.create external interface

                      Details
                      kubectl -n application apply -f - <<EOF
                      apiVersion: v1
                      kind: Service
                      metadata:
                        labels:
                          app.kubernetes.io/component: clickhouse
                          app.kubernetes.io/instance: app-clickhouse
                          app.kubernetes.io/managed-by: Helm
                          app.kubernetes.io/name: clickhouse
                          app.kubernetes.io/version: 23.12.2
                          argocd.argoproj.io/instance: app-clickhouse
                          helm.sh/chart: clickhouse-4.5.3
                        name: app-clickhouse-service-external
                      spec:
                        ports:
                        - name: tcp
                          port: 9000
                          protocol: TCP
                          targetPort: tcp
                          nodePort: 30900
                        selector:
                          app.kubernetes.io/component: clickhouse
                          app.kubernetes.io/instance: app-clickhouse
                          app.kubernetes.io/name: clickhouse
                        type: NodePort
                      EOF

                      FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2024

                      Install ElasticSearch

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      1.get helm repo

                      Details
                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                      helm repo update

                      2.install chart

                      Details
                      helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                      Using Proxy

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      1.prepare `deploy-elasticsearch.yaml`

                      Details
                      kubectl apply -f - << EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: elastic-search
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://charts.bitnami.com/bitnami
                          chart: elasticsearch
                          targetRevision: 19.11.3
                          helm:
                            releaseName: elastic-search
                            values: |
                              global:
                                kibanaEnabled: true
                              clusterName: elastic
                              image:
                                registry: m.zjvis.net/docker.io
                                pullPolicy: IfNotPresent
                              security:
                                enabled: false
                              service:
                                type: ClusterIP
                              ingress:
                                enabled: true
                                annotations:
                                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                                hostname: elastic-search.dev.tech
                                ingressClassName: nginx
                                path: /?(.*)
                                tls: true
                              master:
                                masterOnly: false
                                replicaCount: 1
                                persistence:
                                  enabled: false
                                resources:
                                  requests:
                                    cpu: 2
                                    memory: 1024Mi
                                  limits:
                                    cpu: 4
                                    memory: 4096Mi
                                heapSize: 2g
                              data:
                                replicaCount: 0
                                persistence:
                                  enabled: false
                              coordinating:
                                replicaCount: 0
                              ingest:
                                enabled: true
                                replicaCount: 0
                                service:
                                  enabled: false
                                  type: ClusterIP
                                ingress:
                                  enabled: false
                              metrics:
                                enabled: false
                                image:
                                  registry: m.zjvis.net/docker.io
                                  pullPolicy: IfNotPresent
                              volumePermissions:
                                enabled: false
                                image:
                                  registry: m.zjvis.net/docker.io
                                  pullPolicy: IfNotPresent
                              sysctlImage:
                                enabled: true
                                registry: m.zjvis.net/docker.io
                                pullPolicy: IfNotPresent
                              kibana:
                                elasticsearch:
                                  hosts:
                                    - '{{ include "elasticsearch.service.name" . }}'
                                  port: '{{ include "elasticsearch.service.ports.restAPI" . }}'
                              esJavaOpts: "-Xmx2g -Xms2g"        
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: application
                      EOF

                      3.sync by argocd

                      Details
                      argocd app sync argocd/elastic-search

                      4.extract elasticsearch admin credentials

                      Details
                      a

                      5.invoke http api

                      Details
                      add `$K8S_MASTER_IP elastic-search.dev.tech` to `/etc/hosts`
                      curl -k -H "Content-Type: application/json" \
                          -X POST "https://elastic-search.dev.tech:32443/books/_doc?pretty" \
                          -d '{"name": "Snow Crash", "author": "Neal Stephenson", "release_date": "1992-06-01", "page_count": 470}'

                      Preliminary

                      1. Docker|Podman|Buildah has installed, if not check 🔗link


                      Using Mirror

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      1.init server

                      Details

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      4. Argo Workflow has installed, if not check 🔗link


                      1.prepare `argocd-login-credentials`

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                      2.apply rolebinding to k8s

                      Details
                      kubectl apply -f - <<EOF
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: ClusterRole
                      metadata:
                        name: application-administrator
                      rules:
                        - apiGroups:
                            - argoproj.io
                          resources:
                            - applications
                          verbs:
                            - '*'
                        - apiGroups:
                            - apps
                          resources:
                            - deployments
                          verbs:
                            - '*'
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: argocd
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: application
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      EOF

                      4.prepare `deploy-xxxx-flow.yaml`

                      Details

                      6.subimit to argo workflow client

                      Details
                      argo -n business-workflows submit deploy-xxxx-flow.yaml

                      7.decode password

                      Details
                      kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                      FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Apr 12, 2024

                      Install Kafka

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm binary has installed, if not check 🔗link


                      1.get helm repo

                      Details
                      helm repo add bitnami oci://registry-1.docker.io/bitnamicharts/kafka
                      helm repo update

                      2.install chart

                      helm upgrade --create-namespace -n database kafka --install bitnami/kafka \
                        --set global.imageRegistry=m.daocloud.io/docker.io \
                        --set zookeeper.enabled=false \
                        --set controller.replicaCount=1 \
                        --set broker.replicaCount=1 \
                        --set persistance.enabled=false  \
                        --version 28.0.3
                      
                      helm upgrade --create-namespace -n database kafka --install bitnami/kafka \
                        --set global.imageRegistry=m.daocloud.io/docker.io \
                        --set zookeeper.enabled=false \
                        --set controller.replicaCount=1 \
                        --set broker.replicaCount=1 \
                        --set persistance.enabled=false  \
                        --version 28.0.3
                      
                      Details
                      kubectl -n database \
                        create secret generic client-properties \
                        --from-literal=client.properties="$(printf "security.protocol=SASL_PLAINTEXT\nsasl.mechanism=SCRAM-SHA-256\nsasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";\n")"
                      Details
                      kubectl -n database apply -f - << EOF
                      apiVersion: apps/v1
                      kind: Deployment
                      metadata:
                        name: kafka-client-tools
                        labels:
                          app: kafka-client-tools
                      spec:
                        replicas: 1
                        selector:
                          matchLabels:
                            app: kafka-client-tools
                        template:
                          metadata:
                            labels:
                              app: kafka-client-tools
                          spec:
                            volumes:
                            - name: client-properties
                              secret:
                                secretName: client-properties
                            containers:
                            - name: kafka-client-tools
                              image: m.daocloud.io/docker.io/bitnami/kafka:3.6.2
                              volumeMounts:
                              - name: client-properties
                                mountPath: /bitnami/custom/client.properties
                                subPath: client.properties
                                readOnly: true
                              env:
                              - name: BOOTSTRAP_SERVER
                                value: kafka.database.svc.cluster.local:9092
                              - name: CLIENT_CONFIG_FILE
                                value: /bitnami/custom/client.properties
                              command:
                              - tail
                              - -f
                              - /etc/hosts
                              imagePullPolicy: IfNotPresent
                      EOF

                      3.validate function

                      - list topics
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                          'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
                      - create topic
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                        'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --create --if-not-exists --topic test-topic'
                      - describe topic
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                        'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --describe --topic test-topic'
                      - produce message
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                        'for message in $(seq 0 10); do echo $message | kafka-console-producer.sh --bootstrap-server $BOOTSTRAP_SERVER --producer.config $CLIENT_CONFIG_FILE --topic test-topic; done'
                      - consume message
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                        'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. ArgoCD has installed, if not check 🔗link


                      3. Helm binary has installed, if not check 🔗link


                      1.prepare `deploy-kafka.yaml`

                      kubectl -n argocd apply -f - << EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: kafka
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://charts.bitnami.com/bitnami
                          chart: kafka
                          targetRevision: 28.0.3
                          helm:
                            releaseName: kafka
                            values: |
                              image:
                                registry: m.daocloud.io/docker.io
                              controller:
                                replicaCount: 1
                                persistence:
                                  enabled: false
                                logPersistence:
                                  enabled: false
                                extraConfig: |
                                  message.max.bytes=5242880
                                  default.replication.factor=1
                                  offsets.topic.replication.factor=1
                                  transaction.state.log.replication.factor=1
                              broker:
                                replicaCount: 1
                                persistence:
                                  enabled: false
                                logPersistence:
                                  enabled: false
                                extraConfig: |
                                  message.max.bytes=5242880
                                  default.replication.factor=1
                                  offsets.topic.replication.factor=1
                                  transaction.state.log.replication.factor=1
                              externalAccess:
                                enabled: false
                                autoDiscovery:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                              volumePermissions:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/docker.io
                              metrics:
                                kafka:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                jmx:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                              provisioning:
                                enabled: false
                              kraft:
                                enabled: true
                              zookeeper:
                                enabled: false
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: database
                      EOF
                      kubectl -n argocd apply -f - << EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: kafka
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://charts.bitnami.com/bitnami
                          chart: kafka
                          targetRevision: 28.0.3
                          helm:
                            releaseName: kafka
                            values: |
                              image:
                                registry: m.daocloud.io/docker.io
                              listeners:
                                client:
                                  protocol: PLAINTEXT
                                interbroker:
                                  protocol: PLAINTEXT
                              controller:
                                replicaCount: 0
                                persistence:
                                  enabled: false
                                logPersistence:
                                  enabled: false
                                extraConfig: |
                                  message.max.bytes=5242880
                                  default.replication.factor=1
                                  offsets.topic.replication.factor=1
                                  transaction.state.log.replication.factor=1
                              broker:
                                replicaCount: 1
                                minId: 0
                                persistence:
                                  enabled: false
                                logPersistence:
                                  enabled: false
                                extraConfig: |
                                  message.max.bytes=5242880
                                  default.replication.factor=1
                                  offsets.topic.replication.factor=1
                                  transaction.state.log.replication.factor=1
                              externalAccess:
                                enabled: false
                                autoDiscovery:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                              volumePermissions:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/docker.io
                              metrics:
                                kafka:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                jmx:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                              provisioning:
                                enabled: false
                              kraft:
                                enabled: false
                              zookeeper:
                                enabled: true
                                image:
                                  registry: m.daocloud.io/docker.io
                                replicaCount: 1
                                auth:
                                  client:
                                    enabled: false
                                  quorum:
                                    enabled: false
                                persistence:
                                  enabled: false
                                volumePermissions:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                  metrics:
                                    enabled: false
                                tls:
                                  client:
                                    enabled: false
                                  quorum:
                                    enabled: false
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: database
                      EOF

                      2.sync by argocd

                      Details
                      argocd app sync argocd/kafka

                      3.set up client tool

                      kubectl -n database \
                          create secret generic client-properties \
                          --from-literal=client.properties="$(printf "security.protocol=SASL_PLAINTEXT\nsasl.mechanism=SCRAM-SHA-256\nsasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username=\"user1\" password=\"$(kubectl get secret kafka-user-passwords --namespace database -o jsonpath='{.data.client-passwords}' | base64 -d | cut -d , -f 1)\";\n")"
                      kubectl -n database \
                          create secret generic client-properties \
                          --from-literal=client.properties="security.protocol=PLAINTEXT"

                      5.prepare `kafka-client-tools.yaml`

                      Details
                      kubectl -n database apply -f - << EOF
                      apiVersion: apps/v1
                      kind: Deployment
                      metadata:
                        name: kafka-client-tools
                        labels:
                          app: kafka-client-tools
                      spec:
                        replicas: 1
                        selector:
                          matchLabels:
                            app: kafka-client-tools
                        template:
                          metadata:
                            labels:
                              app: kafka-client-tools
                          spec:
                            volumes:
                            - name: client-properties
                              secret:
                                secretName: client-properties
                            containers:
                            - name: kafka-client-tools
                              image: m.daocloud.io/docker.io/bitnami/kafka:3.6.2
                              volumeMounts:
                              - name: client-properties
                                mountPath: /bitnami/custom/client.properties
                                subPath: client.properties
                                readOnly: true
                              env:
                              - name: BOOTSTRAP_SERVER
                                value: kafka.database.svc.cluster.local:9092
                              - name: CLIENT_CONFIG_FILE
                                value: /bitnami/custom/client.properties
                              - name: ZOOKEEPER_CONNECT
                                value: kafka-zookeeper.database.svc.cluster.local:2181
                              command:
                              - tail
                              - -f
                              - /etc/hosts
                              imagePullPolicy: IfNotPresent
                      EOF

                      6.validate function

                      - list topics
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                          'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --list'
                      - create topic
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                        'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --create --if-not-exists --topic test-topic'
                      - describe topic
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                        'kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER --command-config $CLIENT_CONFIG_FILE --describe --topic test-topic'
                      - produce message
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                        'for message in $(seq 0 10); do echo $message | kafka-console-producer.sh --bootstrap-server $BOOTSTRAP_SERVER --producer.config $CLIENT_CONFIG_FILE --topic test-topic; done'
                      - consume message
                      Details
                      kubectl -n database exec -it deployment/kafka-client-tools -- bash -c \
                        'kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVER --consumer.config $CLIENT_CONFIG_FILE --topic test-topic --from-beginning'

                      Preliminary

                      1. Docker has installed, if not check 🔗link


                      Using Proxy

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      1.init server

                      Details
                      mkdir -p kafka/data
                      chmod -R 777 kafka/data
                      podman run --rm \
                          --name kafka-server \
                          --hostname kafka-server \
                          -p 9092:9092 \
                          -p 9094:9094 \
                          -v $(pwd)/kafka/data:/bitnami/kafka/data \
                          -e KAFKA_CFG_NODE_ID=0 \
                          -e KAFKA_CFG_PROCESS_ROLES=controller,broker \
                          -e KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka-server:9093 \
                          -e KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,EXTERNAL://:9094 \
                          -e KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,EXTERNAL://host.containers.internal:9094 \
                          -e KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,EXTERNAL:PLAINTEXT,PLAINTEXT:PLAINTEXT \
                          -e KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER \
                          -d m.daocloud.io/docker.io/bitnami/kafka:3.6.2

                      2.list topic

                      Details
                      BOOTSTRAP_SERVER=host.containers.internal:9094
                      podman run --rm \
                          -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-topics.sh \
                              --bootstrap-server $BOOTSTRAP_SERVER --list

                      2.create topic

                      Details
                      BOOTSTRAP_SERVER=host.containers.internal:9094
                      # BOOTSTRAP_SERVER=10.200.60.64:9094
                      TOPIC=test-topic
                      podman run --rm \
                          -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-topics.sh \
                              --bootstrap-server $BOOTSTRAP_SERVER \
                              --create \
                              --if-not-exists \
                              --topic $TOPIC

                      2.consume record

                      Details
                      BOOTSTRAP_SERVER=host.containers.internal:9094
                      # BOOTSTRAP_SERVER=10.200.60.64:9094
                      TOPIC=test-topic
                      podman run --rm \
                          -it m.daocloud.io/docker.io/bitnami/kafka:3.6.2 kafka-console-consumer.sh \
                              --bootstrap-server $BOOTSTRAP_SERVER \
                              --topic $TOPIC \
                              --from-beginning

                      FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2024

                      Install MariaDB

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. argoCD has installed, if not check 🔗link


                      3. cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


                      1.prepare mariadb credentials secret

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      kubectl -n database create secret generic mariadb-credentials \
                          --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      2.prepare `deploy-mariadb.yaml`

                      Details
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: mariadb
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://charts.bitnami.com/bitnami
                          chart: mariadb
                          targetRevision: 16.3.2
                          helm:
                            releaseName: mariadb
                            values: |
                              architecture: standalone
                              auth:
                                database: test-mariadb
                                username: aaron.yang
                                existingSecret: mariadb-credentials
                              primary:
                                extraFlags: "--character-set-server=utf8mb4 --collation-server=utf8mb4_bin"
                                persistence:
                                  enabled: false
                              secondary:
                                replicaCount: 1
                                persistence:
                                  enabled: false
                              image:
                                registry: m.daocloud.io/docker.io
                                pullPolicy: IfNotPresent
                              volumePermissions:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                              metrics:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: database

                      3.deploy mariadb

                      Details
                      kubectl -n argocd apply -f deploy-mariadb.yaml

                      4.sync by argocd

                      Details
                      argocd app sync argocd/mariadb

                      5.check mariadb

                      Details
                      kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. ArgoCD has installed, if not check 🔗link


                      3. Argo Workflow has installed, if not check 🔗link


                      1.prepare `argocd-login-credentials`

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      kubectl -n database create secret generic mariadb-credentials \
                          --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      2.apply rolebinding to k8s

                      Details
                      kubectl -n argocd apply -f - <<EOF
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: ClusterRole
                      metadata:
                        name: application-administrator
                      rules:
                        - apiGroups:
                            - argoproj.io
                          resources:
                            - applications
                          verbs:
                            - '*'
                        - apiGroups:
                            - apps
                          resources:
                            - deployments
                          verbs:
                            - '*'
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: argocd
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: application
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      EOF

                      3.prepare mariadb credentials secret

                      Details
                      kubectl -n application create secret generic mariadb-credentials \
                        --from-literal=mariadb-root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                        --from-literal=mariadb-replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                        --from-literal=mariadb-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      4.prepare `deploy-mariadb-flow.yaml`

                      Details
                      apiVersion: argoproj.io/v1alpha1
                      kind: Workflow
                      metadata:
                        generateName: deploy-argocd-app-mariadb-
                      spec:
                        entrypoint: entry
                        artifactRepositoryRef:
                          configmap: artifact-repositories
                          key: default-artifact-repository
                        serviceAccountName: argo-workflow
                        templates:
                        - name: entry
                          inputs:
                            parameters:
                            - name: argocd-server
                              value: argo-cd-argocd-server.argocd:443
                            - name: insecure-option
                              value: --insecure
                          dag:
                            tasks:
                            - name: apply
                              template: apply
                            - name: prepare-argocd-binary
                              template: prepare-argocd-binary
                              dependencies:
                              - apply
                            - name: sync
                              dependencies:
                              - prepare-argocd-binary
                              template: sync
                              arguments:
                                artifacts:
                                - name: argocd-binary
                                  from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                                parameters:
                                - name: argocd-server
                                  value: "{{inputs.parameters.argocd-server}}"
                                - name: insecure-option
                                  value: "{{inputs.parameters.insecure-option}}"
                            - name: wait
                              dependencies:
                              - sync
                              template: wait
                              arguments:
                                artifacts:
                                - name: argocd-binary
                                  from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                                parameters:
                                - name: argocd-server
                                  value: "{{inputs.parameters.argocd-server}}"
                                - name: insecure-option
                                  value: "{{inputs.parameters.insecure-option}}"
                            - name: init-db-tool
                              template: init-db-tool
                              dependencies:
                              - wait
                        - name: apply
                          resource:
                            action: apply
                            manifest: |
                              apiVersion: argoproj.io/v1alpha1
                              kind: Application
                              metadata:
                                name: app-mariadb
                                namespace: argocd
                              spec:
                                syncPolicy:
                                  syncOptions:
                                  - CreateNamespace=true
                                project: default
                                source:
                                  repoURL: https://charts.bitnami.com/bitnami
                                  chart: mariadb
                                  targetRevision: 16.5.0
                                  helm:
                                    releaseName: app-mariadb
                                    values: |
                                      architecture: standalone
                                      auth:
                                        database: geekcity
                                        username: aaron.yang
                                        existingSecret: mariadb-credentials
                                      primary:
                                        persistence:
                                          enabled: false
                                      secondary:
                                        replicaCount: 1
                                        persistence:
                                          enabled: false
                                      image:
                                        registry: m.daocloud.io/docker.io
                                        pullPolicy: IfNotPresent
                                      volumePermissions:
                                        enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                      metrics:
                                        enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                destination:
                                  server: https://kubernetes.default.svc
                                  namespace: application
                        - name: prepare-argocd-binary
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /tmp/argocd
                              mode: 755
                              http:
                                url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
                          outputs:
                            artifacts:
                            - name: argocd-binary
                              path: "{{inputs.artifacts.argocd-binary.path}}"
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              ls -l {{inputs.artifacts.argocd-binary.path}}
                        - name: sync
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /usr/local/bin/argocd
                            parameters:
                            - name: argocd-server
                            - name: insecure-option
                              value: ""
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            env:
                            - name: ARGOCD_USERNAME
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: username
                            - name: ARGOCD_PASSWORD
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: password
                            - name: WITH_PRUNE_OPTION
                              value: --prune
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              set -e
                              export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                              export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                              export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                              argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                              argocd app sync argocd/app-mariadb ${WITH_PRUNE_OPTION} --timeout 300
                        - name: wait
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /usr/local/bin/argocd
                            parameters:
                            - name: argocd-server
                            - name: insecure-option
                              value: ""
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            env:
                            - name: ARGOCD_USERNAME
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: username
                            - name: ARGOCD_PASSWORD
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: password
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              set -e
                              export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                              export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                              export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                              argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                              argocd app wait argocd/app-mariadb
                        - name: init-db-tool
                          resource:
                            action: apply
                            manifest: |
                              apiVersion: apps/v1
                              kind: Deployment
                              metadata:
                                name: app-mariadb-tool
                                namespace: application
                                labels:
                                  app.kubernetes.io/name: mariadb-tool
                              spec:
                                replicas: 1
                                selector:
                                  matchLabels:
                                    app.kubernetes.io/name: mariadb-tool
                                template:
                                  metadata:
                                    labels:
                                      app.kubernetes.io/name: mariadb-tool
                                  spec:
                                    containers:
                                      - name: mariadb-tool
                                        image:  m.daocloud.io/docker.io/bitnami/mariadb:10.5.12-debian-10-r0
                                        imagePullPolicy: IfNotPresent
                                        env:
                                          - name: MARIADB_ROOT_PASSWORD
                                            valueFrom:
                                              secretKeyRef:
                                                key: mariadb-root-password
                                                name: mariadb-credentials
                                          - name: TZ
                                            value: Asia/Shanghai

                      5.subimit to argo workflow client

                      Details
                      argo -n business-workflows submit deploy-mariadb-flow.yaml

                      6.decode password

                      Details
                      kubectl -n application get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

                      Preliminary

                      1. Docker has installed, if not check 🔗link


                      Using Proxy

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      1.init server

                      Details
                      mkdir -p mariadb/data
                      podman run  \
                          -p 3306:3306 \
                          -e MARIADB_ROOT_PASSWORD=mysql \
                          -d m.daocloud.io/docker.io/library/mariadb:11.2.2-jammy \
                          --log-bin \
                          --binlog-format=ROW

                      2.use web console

                      And then you can visit 🔗http://localhost:8080

                      username: `root`

                      password: `mysql`

                      Details
                      podman run --rm -p 8080:80 \
                          -e PMA_ARBITRARY=1 \
                          -d m.daocloud.io/docker.io/library/phpmyadmin:5.1.1-apache

                      3.use internal client

                      Details
                      podman run --rm \
                          -e MYSQL_PWD=mysql \
                          -it m.daocloud.io/docker.io/library/mariadb:11.2.2-jammy \
                          mariadb \
                          --host host.containers.internal \
                          --port 3306 \
                          --user root \
                          --database mysql \
                          --execute 'select version()'

                      Useful SQL

                      1. list all bin logs
                      SHOW BINARY LOGS;
                      1. delete previous bin logs
                      PURGE BINARY LOGS TO 'mysqld-bin.0000003'; # delete mysqld-bin.0000001 and mysqld-bin.0000002
                      PURGE BINARY LOGS BEFORE 'yyyy-MM-dd HH:mm:ss';
                      PURGE BINARY LOGS DATE_SUB(NOW(), INTERVAL 3 DAYS); # delete last three days bin log file.
                      Details

                      If you using master-slave mode, you can change all BINARY to MASTER

                      FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2024

                      Install Milvus

                      Preliminary

                      • Kubernetes has installed, if not check link
                      • argoCD has installed, if not check link
                      • cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuerservice, , if not check link
                      • minio has installed, if not check link

                      Steps

                      1. copy minio credentials secret

                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      kubectl -n storage get secret minio-secret -o json \
                          | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' \
                          | kubectl -n database apply -f -

                      2. prepare deploy-milvus.yaml

                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: milvus
                      spec:
                        syncPolicy:
                          syncOptions:
                            - CreateNamespace=true
                        project: default
                        source:
                          repoURL: registry-1.docker.io/bitnamicharts
                          chart: milvus
                          targetRevision: 11.2.4
                          helm:
                            releaseName: milvus
                            values: |
                              global:
                                security:
                                  allowInsecureImages: true
                              milvus:
                                image:
                                  registry: m.lab.zverse.space/docker.io
                                  repository: bitnami/milvus
                                  tag: 2.5.7-debian-12-r0
                                  pullPolicy: IfNotPresent
                                auth:
                                  enabled: false
                              initJob:
                                forceRun: false
                                image:
                                  registry: m.lab.zverse.space/docker.io
                                  repository: bitnami/pymilvus
                                  tag: 2.5.6-debian-12-r0
                                  pullPolicy: IfNotPresent
                                resources:
                                  requests:
                                    cpu: 2
                                    memory: 512Mi
                                  limits:
                                    cpu: 2
                                    memory: 2Gi
                              dataCoord:
                                replicaCount: 1
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 512Mi
                                  limits:
                                    cpu: 2
                                    memory: 2Gi
                                metrics:
                                  enabled: true
                                  
                              rootCoord:
                                replicaCount: 1
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 4Gi
                              queryCoord:
                                replicaCount: 1
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 4Gi
                              indexCoord:
                                replicaCount: 1
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 4Gi
                              dataNode:
                                replicaCount: 1
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 4Gi
                              queryNode:
                                replicaCount: 1
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 2Gi
                              indexNode:
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 2Gi
                              proxy:
                                replicaCount: 1
                                service:
                                  type: ClusterIP
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 2Gi
                              attu:
                                image:
                                  registry: m.lab.zverse.space/docker.io
                                  repository: bitnami/attu
                                  tag: 2.5.5-debian-12-r1
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 4Gi
                                service:
                                  type: ClusterIP
                                ingress:
                                  enabled: true
                                  ingressClassName: "nginx"
                                  annotations:
                                    cert-manager.io/cluster-issuer: alidns-webhook-zverse-letsencrypt
                                  hostname: milvus.dev.tech
                                  path: /
                                  pathType: ImplementationSpecific
                                  tls: true
                              waitContainer:
                                image:
                                  registry: m.lab.zverse.space/docker.io
                                  repository: bitnami/os-shell
                                  tag: 12-debian-12-r40
                                  pullPolicy: IfNotPresent
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 4Gi
                              externalS3:
                                host: "minio.storage"
                                port: 9000
                                existingSecret: "minio-secret"
                                existingSecretAccessKeyIDKey: "root-user"
                                existingSecretKeySecretKey: "root-password"
                                bucket: "milvus"
                                rootPath: "file"
                              etcd:
                                enabled: true
                                image:
                                  registry: m.lab.zverse.space/docker.io
                                replicaCount: 1
                                auth:
                                  rbac:
                                    create: false
                                  client:
                                    secureTransport: false
                                resources:
                                  requests:
                                    cpu: 500m
                                    memory: 1Gi
                                  limits:
                                    cpu: 2
                                    memory: 2Gi
                                persistence:
                                  enabled: true
                                  storageClass: ""
                                  size: 2Gi
                                preUpgradeJob:
                                  enabled: false
                              minio:
                                enabled: false
                              kafka:
                                enabled: true
                                image:
                                  registry: m.lab.zverse.space/docker.io
                                controller:
                                  replicaCount: 1
                                  livenessProbe:
                                    failureThreshold: 8
                                  resources:
                                    requests:
                                      cpu: 500m
                                      memory: 1Gi
                                    limits:
                                      cpu: 2
                                      memory: 2Gi
                                  persistence:
                                    enabled: true
                                    storageClass: ""
                                    size: 2Gi
                                service:
                                  ports:
                                    client: 9092
                                extraConfig: |-
                                  offsets.topic.replication.factor=3
                                listeners:
                                  client:
                                    protocol: PLAINTEXT
                                  interbroker:
                                    protocol: PLAINTEXT
                                  external:
                                    protocol: PLAINTEXT
                                sasl:
                                  enabledMechanisms: "PLAIN"
                                  client:
                                    users:
                                      - user
                                broker:
                                  replicaCount: 0
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: database

                      3. apply to k8s

                      kubectl -n argocd apply -f deploy-milvus.yaml

                      4. sync by argocd

                      argocd app sync argocd/milvus

                      5. check Attu WebUI

                      milvus address: milvus-proxy:19530

                      milvus database: default

                      https://milvus.dev.tech:32443/#/

                      5. [Optional] import data

                      import data by using sql file

                      MARIADB_ROOT_PASSWORD=$(kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d)
                      POD_NAME=$(kubectl get pod -n database -l "app.kubernetes.io/name=mariadb-tool" -o jsonpath="{.items[0].metadata.name}") \
                      && export SQL_FILENAME="Dump20240301.sql" \
                      && kubectl -n database cp ${SQL_FILENAME} ${POD_NAME}:/tmp/${SQL_FILENAME} \
                      && kubectl -n database exec -it deployment/app-mariadb-tool -- bash -c \
                          'echo "create database ccds;" | mysql -h mariadb.database -uroot -p$MARIADB_ROOT_PASSWORD' \
                      && kubectl -n database exec -it ${POD_NAME} -- bash -c \
                          "mysql -h mariadb.database -uroot -p\${MARIADB_ROOT_PASSWORD} \
                          ccds < /tmp/Dump20240301.sql"

                      6. [Optional] decode password

                      kubectl -n database get secret mariadb-credentials -o jsonpath='{.data.mariadb-root-password}' | base64 -d

                      7. [Optional] execute sql in pod

                      kubectl -n database exec -it xxxx bash
                      mariadb -h 127.0.0.1 -u root -p$MARIADB_ROOT_PASSWORD

                      And then you can check connection by

                      show status like  'Threads%';
                      May 26, 2025

                      Install Neo4j

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      1.get helm repo

                      Details
                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                      helm repo update

                      2.install chart

                      Details
                      helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                      Using Proxy

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      1.prepare `deploy-xxxxx.yaml`

                      Details

                      2.apply to k8s

                      Details
                      kubectl -n argocd apply -f xxxx.yaml

                      3.sync by argocd

                      Details
                      argocd app sync argocd/xxxx

                      4.prepare yaml-content.yaml

                      Details

                      5.apply to k8s

                      Details
                      kubectl apply -f xxxx.yaml

                      6.apply xxxx.yaml directly

                      Details
                      kubectl apply -f - <<EOF
                      
                      EOF

                      Preliminary

                      1. Docker|Podman|Buildah has installed, if not check 🔗link


                      Using Proxy

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      1.init server

                      Details
                      mkdir -p neo4j/data
                      podman run --rm \
                          --name neo4j \
                          -p 7474:7474 \
                          -p 7687:7687 \
                          -e neo4j_ROOT_PASSWORD=mysql \
                          -v $(pwd)/neo4j/data:/data \
                          -d docker.io/library/neo4j:5.18.0-community-bullseye
                      and then you can visit 🔗[http://localhost:7474]


                      username: `root`
                      password: `mysql`

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      4. Argo Workflow has installed, if not check 🔗link


                      1.prepare `argocd-login-credentials`

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                      2.apply rolebinding to k8s

                      Details
                      kubectl apply -f - <<EOF
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: ClusterRole
                      metadata:
                        name: application-administrator
                      rules:
                        - apiGroups:
                            - argoproj.io
                          resources:
                            - applications
                          verbs:
                            - '*'
                        - apiGroups:
                            - apps
                          resources:
                            - deployments
                          verbs:
                            - '*'
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: argocd
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: application
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      EOF

                      4.prepare `deploy-xxxx-flow.yaml`

                      Details

                      6.subimit to argo workflow client

                      Details
                      argo -n business-workflows submit deploy-xxxx-flow.yaml

                      7.decode password

                      Details
                      kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                      FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2024

                      Install Postgresql

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      1.get helm repo

                      Details
                      helm repo add bitnami https://charts.bitnami.com/bitnami
                      helm repo update

                      2.install chart

                      Details
                      helm install bitnami/postgresql --generate-name --version 18.1.8
                      Using Proxy

                      for more information, you can check 🔗https://artifacthub.io/packages/helm/prometheus-community/prometheus

                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                      helm repo update
                      helm install my-postgresql ay-helm-mirror/postgresql --version 18.1.8

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      1.prepare `postgresql-credentials`

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      kubectl -n database create secret generic postgresql-credentials \
                          --from-literal=postgres-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      2.prepare `deploy-postgresql.yaml`

                      Details
                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: postgresql
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                          chart: postgresql
                          targetRevision: 18.1.8
                          helm:
                            releaseName: postgresql
                            values: |
                              global:
                                security:
                                  allowInsecureImages: true
                              architecture: standalone
                              auth:
                                database: n8n
                                username: n8n
                                existingSecret: postgresql-credentials
                              primary:
                                resources:
                                  requests:
                                    cpu: 1
                                    memory: 512Mi
                                  limits:
                                    cpu: 2
                                    memory: 1024Mi
                                persistence:
                                  enabled: true
                                  storageClass: local-path
                                  size: 8Gi
                              readReplicas:
                                replicaCount: 1
                                persistence:
                                  enabled: true
                                  storageClass: local-path
                                  size: 8Gi
                              backup:
                                enabled: false
                              image:
                                registry: m.daocloud.io/registry-1.docker.io
                                pullPolicy: IfNotPresent
                              volumePermissions:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/registry-1.docker.io
                                  pullPolicy: IfNotPresent
                              metrics:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/registry-1.docker.io
                                  pullPolicy: IfNotPresent
                          extraDeploy:
                          - apiVersion: networking.k8s.io/v1
                            kind: Ingress
                            metadata:
                              name: postgres-tcp-ingress
                              annotations:
                                kubernetes.io/ingress.class: nginx
                            spec:
                              rules:
                              - host: postgres.ay.dev
                                http:
                                  paths:
                                  - path: /
                                    pathType: Prefix
                                    backend:
                                      service:
                                        name: postgresql
                                        port:
                                          number: 5342
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: database
                      EOF

                      3.sync by argocd

                      Details
                      argocd app sync argocd/postgresql

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      1.prepare `postgresql-credentials`

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      kubectl -n database create secret generic postgresql-credentials \
                          --from-literal=postgres-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                          --from-literal=replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      2.prepare `deploy-postgresql.yaml`

                      Details
                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: postgresql
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                          chart: postgresql
                          targetRevision: 18.1.8
                          helm:
                            releaseName: postgresql
                            values: |
                              global:
                                security:
                                  allowInsecureImages: true
                              architecture: standalone
                              auth:
                                database: n8n
                                username: n8n
                                existingSecret: postgresql-credentials
                              primary:
                                resources:
                                  requests:
                                    cpu: 1
                                    memory: 512Mi
                                  limits:
                                    cpu: 2
                                    memory: 1024Mi
                                persistence:
                                  enabled: true
                                  storageClass: local-path
                                  size: 8Gi
                              readReplicas:
                                replicaCount: 1
                                persistence:
                                  enabled: true
                                  storageClass: local-path
                                  size: 8Gi
                              backup:
                                enabled: false
                              image:
                                registry: m.daocloud.io/registry-1.docker.io
                                pullPolicy: IfNotPresent
                              volumePermissions:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/registry-1.docker.io
                                  pullPolicy: IfNotPresent
                              metrics:
                                enabled: false
                                image:
                                  registry: m.daocloud.io/registry-1.docker.io
                                  pullPolicy: IfNotPresent
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: database
                      EOF

                      3.sync by argocd

                      Details
                      argocd app sync argocd/postgresql

                      Preliminary

                      1. Docker|Podman|Buildah has installed, if not check 🔗link


                      Using Proxy

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      1.init server

                      Details
                      mkdir -p $(pwd)/postgresql/data
                      podman run --rm \
                          --name postgresql \
                          -p 5432:5432 \
                          -e POSTGRES_PASSWORD=postgresql \
                          -e PGDATA=/var/lib/postgresql/data/pgdata \
                          -v $(pwd)/postgresql/data:/var/lib/postgresql/data \
                          -d docker.io/library/postgres:15.2-alpine3.17

                      2.use web console

                      Details
                      podman run --rm \
                        -p 8080:80 \
                        -e 'PGADMIN_DEFAULT_EMAIL=ben.wangz@foxmail.com' \
                        -e 'PGADMIN_DEFAULT_PASSWORD=123456' \
                        -d docker.io/dpage/pgadmin4:6.15
                      And then you can visit 🔗[http://localhost:8080]


                      3.use internal client

                      Details
                      podman run --rm \
                          --env PGPASSWORD=postgresql \
                          --entrypoint psql \
                          -it docker.io/library/postgres:15.2-alpine3.17 \
                          --host host.containers.internal \
                          --port 5432 \
                          --username postgres \
                          --dbname postgres \
                          --command 'select version()'

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      4. Argo Workflow has installed, if not check 🔗link


                      5. Minio artifact repository has been configured, if not check 🔗link


                      - endpoint: minio.storage:9000

                      1.prepare `argocd-login-credentials`

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      ARGOCD_USERNAME=admin
                      ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
                      kubectl -n business-workflows create secret generic argocd-login-credentials \
                          --from-literal=username=${ARGOCD_USERNAME} \
                          --from-literal=password=${ARGOCD_PASSWORD}

                      2.apply rolebinding to k8s

                      Details
                      kubectl apply -f - <<EOF
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: ClusterRole
                      metadata:
                        name: application-administrator
                      rules:
                        - apiGroups:
                            - argoproj.io
                          resources:
                            - applications
                          verbs:
                            - '*'
                        - apiGroups:
                            - apps
                          resources:
                            - deployments
                          verbs:
                            - '*'
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: argocd
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: application
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      EOF

                      3.prepare postgresql admin credentials secret

                      Details
                      kubectl -n application create secret generic postgresql-credentials \
                        --from-literal=postgres-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                        --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                        --from-literal=replication-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      4.prepare `deploy-postgresql-flow.yaml`

                      Details
                      apiVersion: argoproj.io/v1alpha1
                      kind: Workflow
                      metadata:
                        generateName: deploy-argocd-app-pg-
                      spec:
                        entrypoint: entry
                        artifactRepositoryRef:
                          configmap: artifact-repositories
                          key: default-artifact-repository
                        serviceAccountName: argo-workflow
                        templates:
                        - name: entry
                          inputs:
                            parameters:
                            - name: argocd-server
                              value: argo-cd-argocd-server.argocd:443
                            - name: insecure-option
                              value: --insecure
                          dag:
                            tasks:
                            - name: apply
                              template: apply
                            - name: prepare-argocd-binary
                              template: prepare-argocd-binary
                              dependencies:
                              - apply
                            - name: sync
                              dependencies:
                              - prepare-argocd-binary
                              template: sync
                              arguments:
                                artifacts:
                                - name: argocd-binary
                                  from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                                parameters:
                                - name: argocd-server
                                  value: "{{inputs.parameters.argocd-server}}"
                                - name: insecure-option
                                  value: "{{inputs.parameters.insecure-option}}"
                            - name: wait
                              dependencies:
                              - sync
                              template: wait
                              arguments:
                                artifacts:
                                - name: argocd-binary
                                  from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                                parameters:
                                - name: argocd-server
                                  value: "{{inputs.parameters.argocd-server}}"
                                - name: insecure-option
                                  value: "{{inputs.parameters.insecure-option}}"
                            - name: init-db-tool
                              template: init-db-tool
                              dependencies:
                              - wait
                        - name: apply
                          resource:
                            action: apply
                            manifest: |
                              apiVersion: argoproj.io/v1alpha1
                              kind: Application
                              metadata:
                                name: app-postgresql
                                namespace: argocd
                              spec:
                                syncPolicy:
                                  syncOptions:
                                  - CreateNamespace=true
                                project: default
                                source:
                                  repoURL: https://charts.bitnami.com/bitnami
                                  chart: postgresql
                                  targetRevision: 14.2.2
                                  helm:
                                    releaseName: app-postgresql
                                    values: |
                                      architecture: standalone
                                      auth:
                                        database: geekcity
                                        username: aaron.yang
                                        existingSecret: postgresql-credentials
                                      primary:
                                        persistence:
                                          enabled: false
                                      readReplicas:
                                        replicaCount: 1
                                        persistence:
                                          enabled: false
                                      backup:
                                        enabled: false
                                      image:
                                        registry: m.daocloud.io/docker.io
                                        pullPolicy: IfNotPresent
                                      volumePermissions:
                                        enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                      metrics:
                                        enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                destination:
                                  server: https://kubernetes.default.svc
                                  namespace: application
                        - name: prepare-argocd-binary
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /tmp/argocd
                              mode: 755
                              http:
                                url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
                          outputs:
                            artifacts:
                            - name: argocd-binary
                              path: "{{inputs.artifacts.argocd-binary.path}}"
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              ls -l {{inputs.artifacts.argocd-binary.path}}
                        - name: sync
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /usr/local/bin/argocd
                            parameters:
                            - name: argocd-server
                            - name: insecure-option
                              value: ""
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            env:
                            - name: ARGOCD_USERNAME
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: username
                            - name: ARGOCD_PASSWORD
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: password
                            - name: WITH_PRUNE_OPTION
                              value: --prune
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              set -e
                              export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                              export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                              export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                              argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                              argocd app sync argocd/app-postgresql ${WITH_PRUNE_OPTION} --timeout 300
                        - name: wait
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /usr/local/bin/argocd
                            parameters:
                            - name: argocd-server
                            - name: insecure-option
                              value: ""
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            env:
                            - name: ARGOCD_USERNAME
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: username
                            - name: ARGOCD_PASSWORD
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: password
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              set -e
                              export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                              export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                              export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                              argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                              argocd app wait argocd/app-postgresql
                        - name: init-db-tool
                          resource:
                            action: apply
                            manifest: |
                              apiVersion: apps/v1
                              kind: Deployment
                              metadata:
                                name: app-postgresql-tool
                                namespace: application
                                labels:
                                  app.kubernetes.io/name: postgresql-tool
                              spec:
                                replicas: 1
                                selector:
                                  matchLabels:
                                    app.kubernetes.io/name: postgresql-tool
                                template:
                                  metadata:
                                    labels:
                                      app.kubernetes.io/name: postgresql-tool
                                  spec:
                                    containers:
                                      - name: postgresql-tool
                                        image: m.daocloud.io/docker.io/bitnami/postgresql:14.4.0-debian-11-r9
                                        imagePullPolicy: IfNotPresent
                                        env:
                                          - name: POSTGRES_PASSWORD
                                            valueFrom:
                                              secretKeyRef:
                                                key: postgres-password
                                                name: postgresql-credentials
                                          - name: TZ
                                            value: Asia/Shanghai
                                        command:
                                          - tail
                                        args:
                                          - -f
                                          - /etc/hosts

                      6.subimit to argo workflow client

                      Details
                      argo -n business-workflows submit deploy-postgresql.yaml

                      7.decode password

                      Details
                      kubectl -n application get secret postgresql-credentials -o jsonpath='{.data.postgres-password}' | base64 -d

                      8.import data

                      Details
                      POSTGRES_PASSWORD=$(kubectl -n application get secret postgresql-credentials -o jsonpath='{.data.postgres-password}' | base64 -d) \
                      POD_NAME=$(kubectl get pod -n application -l "app.kubernetes.io/name=postgresql-tool" -o jsonpath="{.items[0].metadata.name}") \
                      && export SQL_FILENAME="init_dfs_table_data.sql" \
                      && kubectl -n application cp ${SQL_FILENAME} ${POD_NAME}:/tmp/${SQL_FILENAME} \
                      && kubectl -n application exec -it deployment/app-postgresql-tool -- bash -c \
                          'echo "CREATE DATABASE csst;" | PGPASSWORD="$POSTGRES_PASSWORD" \
                          psql --host app-postgresql.application -U postgres -d postgres -p 5432' \
                      && kubectl -n application exec -it deployment/app-postgresql-tool -- bash -c \
                          'PGPASSWORD="$POSTGRES_PASSWORD" psql --host app-postgresql.application \
                          -U postgres -d csst -p 5432 < /tmp/init_dfs_table_data.sql'

                      FAQ

                      Q1: How to connect to the postgres
                      POSTGRES_PASSWORD=$(kubectl -n database get secret postgresql-credentials -o jsonpath='{.data.postgres-password}' | base64 -d)
                      podman run --rm \
                          --env PGPASSWORD=${POSTGRES_PASSWORD} \
                          --entrypoint psql \
                          -it m.daocloud.io/docker.io/library/postgres:15.2-alpine3.17 \
                          --host host.containers.internal \
                          --port 32543 \
                          --username postgres  \
                          --dbname postgres  \
                          --command 'SELECT datname FROM pg_database;
                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2024

                      Install PgAdmin

                      🚀Installation

                      Install By

                      1.get helm repo

                      Details
                      helm repo add runix https://helm.runix.net/
                      helm repo update

                      2.install chart

                      Details
                      helm install runix/pgadmin4 --generate-name --version 1.23.3
                      Using AY Helm Mirror

                      for more information, you can check 🔗https://github.com/AaronYang0628/helm-chart-mirror

                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update
                        helm install ay-helm-mirror/chart-name --generate-name --version a.b.c

                      1.prepare `pgadmin-credentials.yaml`

                      Details
                      kubectl -n database create secret generic pgadmin-credentials \
                        --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      2.prepare `deploy-pgadmin.yaml`

                      Details
                      kubectl -n argocd apply -f -<< EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: pgadmin
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://helm.runix.net/
                          chart: pgadmin4
                          targetRevision: 1.23.3
                          helm:
                            releaseName: pgadmin4
                            values: |
                              replicaCount: 1
                              persistentVolume:
                                enabled: false
                              env:
                                email: pgadmin@mail.72602.online
                                variables:
                                  - name: PGADMIN_CONFIG_WTF_CSRF_ENABLED
                                    value: "False"
                              existingSecret: pgadmin-credentials
                              resources:
                                requests:
                                  memory: 512Mi
                                  cpu: 500m
                                limits:
                                  memory: 1024Mi
                                  cpu: 1000m
                              image:
                                registry: m.daocloud.io/docker.io
                                pullPolicy: IfNotPresent
                              ingress:
                                enabled: true
                                ingressClassName: nginx
                                annotations:
                                  cert-manager.io/cluster-issuer: letsencrypt
                                hosts:
                                  - host: pgadmin.72602.online
                                    paths:
                                      - path: /
                                        pathType: ImplementationSpecific
                                tls:
                                  - secretName: pgadmin.72602.online-tls
                                    hosts:
                                      - pgadmin.72602.online
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: database
                      EOF

                      3.sync by argocd

                      Details
                      argocd app sync argocd/pgadmin
                      Using AY Helm Mirror

                      for more information, you can check 🔗https://github.com/AaronYang0628/helm-chart-mirror

                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update
                        helm install ay-helm-mirror/chart-name --generate-name --version a.b.c
                      Using AY ACR Image Mirror
                      Using DaoCloud Mirror

                      1.init server

                      Details
                      Using AY ACR Image Mirror
                      Using DaoCloud Mirror

                      1.init server

                      Details
                      Using AY ACR Image Mirror
                      Using DaoCloud Mirror

                      🛎️FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2025

                      Install Redis

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      1.get helm repo

                      Details
                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                      helm repo update

                      2.install chart

                      Details
                      helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                      Using Proxy

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      1.prepare `redis-credentials`

                      Details
                      kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database
                      kubectl -n database create secret generic redis-credentials \
                      --from-literal=redis-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      2.apply `deploy-redis.yaml`

                      Details
                      kubectl -n argocd apply -f - << 'EOF'
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: redis
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://charts.bitnami.com/bitnami
                          chart: redis
                          targetRevision: 18.16.0
                          helm:
                            releaseName: redis
                            values: |
                              architecture: replication
                              auth:
                                enabled: true
                                sentinel: false
                                existingSecret: redis-credentials
                              master:
                                count: 1
                                resources:
                                  requests:
                                    memory: 512Mi
                                    cpu: 512m
                                  limits:
                                    memory: 1024Mi
                                    cpu: 1024m
                                disableCommands:
                                  - FLUSHDB
                                  - FLUSHALL
                                persistence:
                                  enabled: true
                                  storageClass: "local-path"
                                  accessModes:
                                  - ReadWriteOnce
                                  size: 8Gi
                              replica:
                                replicaCount: 1
                                resources:
                                  requests:
                                    memory: 512Mi
                                    cpu: 512m
                                  limits:
                                    memory: 1024Mi
                                    cpu: 1024m
                                disableCommands:
                                  - FLUSHDB
                                  - FLUSHALL
                                persistence:
                                  enabled: true
                                  storageClass: "local-path"
                                  accessModes:
                                  - ReadWriteOnce
                                  size: 8Gi
                              image:
                                registry: m.daocloud.io/docker.io
                                pullPolicy: IfNotPresent
                              sentinel:
                                enabled: false
                              metrics:
                                enabled: false
                              volumePermissions:
                                enabled: false
                              sysctl:
                                enabled: false
                              extraDeploy:
                              - apiVersion: traefik.io/v1alpha1
                                kind: IngressRouteTCP
                                metadata:
                                  name: redis-tcp
                                  namespace: storage
                                spec:
                                  entryPoints:
                                    - redis
                                  routes:
                                  - match: HostSNI(`*`)
                                    services:
                                    - name: redis-master
                                      port: 6379
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: storage
                      EOF

                      3.sync by argocd

                      Details
                      argocd app sync argocd/redis

                      4.test redis connection

                      Details
                      kubectl -n storage run test --rm -it --image=m.daocloud.io/docker.io/library/redis:7 -- \
                      redis-cli -h redis-master -p 6379 -a uItmVGpX5PShHc8j ping

                      Preliminary

                      1. Docker|Podman|Buildah has installed, if not check 🔗link


                      Using Proxy

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      1.init server

                      Details
                      mkdir -p $(pwd)/redis/data
                      podman run --rm \
                          --name redis \
                          -p 6379:6379 \
                          -v $(pwd)/redis/data:/data \
                          -d docker.io/library/redis:7.2.4-alpine

                      2.use internal client

                      Details
                      podman run --rm \
                          -it docker.io/library/redis:7.2.4-alpine \
                          redis-cli \
                          -h host.containers.internal \
                          set mykey somevalue

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm has installed, if not check 🔗link


                      3. ArgoCD has installed, if not check 🔗link


                      4. Argo Workflow has installed, if not check 🔗link


                      5. Minio artifact repository has been configured, if not check 🔗link


                      - endpoint: minio.storage:9000

                      1.prepare `argocd-login-credentials`

                      Details
                      ARGOCD_USERNAME=admin
                      ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
                      kubectl -n business-workflows create secret generic argocd-login-credentials \
                          --from-literal=username=${ARGOCD_USERNAME} \
                          --from-literal=password=${ARGOCD_PASSWORD}

                      2.apply rolebinding to k8s

                      Details
                      kubectl apply -f - <<EOF
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: ClusterRole
                      metadata:
                        name: application-administrator
                      rules:
                        - apiGroups:
                            - argoproj.io
                          resources:
                            - applications
                          verbs:
                            - '*'
                        - apiGroups:
                            - apps
                          resources:
                            - deployments
                          verbs:
                            - '*'
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: argocd
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      
                      ---
                      apiVersion: rbac.authorization.k8s.io/v1
                      kind: RoleBinding
                      metadata:
                        name: application-administration
                        namespace: application
                      roleRef:
                        apiGroup: rbac.authorization.k8s.io
                        kind: ClusterRole
                        name: application-administrator
                      subjects:
                        - kind: ServiceAccount
                          name: argo-workflow
                          namespace: business-workflows
                      EOF

                      3.prepare redis credentials secret

                      Details
                      kubectl -n application create secret generic redis-credentials \
                        --from-literal=redis-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                      4.prepare `deploy-redis-flow.yaml`

                      Details
                      apiVersion: argoproj.io/v1alpha1
                      kind: Workflow
                      metadata:
                        generateName: deploy-argocd-app-redis-
                      spec:
                        entrypoint: entry
                        artifactRepositoryRef:
                          configmap: artifact-repositories
                          key: default-artifact-repository
                        serviceAccountName: argo-workflow
                        templates:
                        - name: entry
                          inputs:
                            parameters:
                            - name: argocd-server
                              value: argocd-server.argocd:443
                            - name: insecure-option
                              value: --insecure
                          dag:
                            tasks:
                            - name: apply
                              template: apply
                            - name: prepare-argocd-binary
                              template: prepare-argocd-binary
                              dependencies:
                              - apply
                            - name: sync
                              dependencies:
                              - prepare-argocd-binary
                              template: sync
                              arguments:
                                artifacts:
                                - name: argocd-binary
                                  from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                                parameters:
                                - name: argocd-server
                                  value: "{{inputs.parameters.argocd-server}}"
                                - name: insecure-option
                                  value: "{{inputs.parameters.insecure-option}}"
                            - name: wait
                              dependencies:
                              - sync
                              template: wait
                              arguments:
                                artifacts:
                                - name: argocd-binary
                                  from: "{{tasks.prepare-argocd-binary.outputs.artifacts.argocd-binary}}"
                                parameters:
                                - name: argocd-server
                                  value: "{{inputs.parameters.argocd-server}}"
                                - name: insecure-option
                                  value: "{{inputs.parameters.insecure-option}}"
                        - name: apply
                          resource:
                            action: apply
                            manifest: |
                              apiVersion: argoproj.io/v1alpha1
                              kind: Application
                              metadata:
                                name: app-redis
                                namespace: argocd
                              spec:
                                syncPolicy:
                                  syncOptions:
                                  - CreateNamespace=true
                                project: default
                                source:
                                  repoURL: https://charts.bitnami.com/bitnami
                                  chart: redis
                                  targetRevision: 18.16.0
                                  helm:
                                    releaseName: app-redis
                                    values: |
                                      architecture: replication
                                      auth:
                                        enabled: true
                                        sentinel: true
                                        existingSecret: redis-credentials
                                      master:
                                        count: 1
                                        disableCommands:
                                          - FLUSHDB
                                          - FLUSHALL
                                        persistence:
                                          enabled: false
                                      replica:
                                        replicaCount: 3
                                        disableCommands:
                                          - FLUSHDB
                                          - FLUSHALL
                                        persistence:
                                          enabled: false
                                      image:
                                        registry: m.daocloud.io/docker.io
                                        pullPolicy: IfNotPresent
                                      sentinel:
                                        enabled: false
                                        persistence:
                                          enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                      metrics:
                                        enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                      volumePermissions:
                                        enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                      sysctl:
                                        enabled: false
                                        image:
                                          registry: m.daocloud.io/docker.io
                                          pullPolicy: IfNotPresent
                                destination:
                                  server: https://kubernetes.default.svc
                                  namespace: application
                        - name: prepare-argocd-binary
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /tmp/argocd
                              mode: 755
                              http:
                                url: https://files.m.daocloud.io/github.com/argoproj/argo-cd/releases/download/v2.9.3/argocd-linux-amd64
                          outputs:
                            artifacts:
                            - name: argocd-binary
                              path: "{{inputs.artifacts.argocd-binary.path}}"
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              ls -l {{inputs.artifacts.argocd-binary.path}}
                        - name: sync
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /usr/local/bin/argocd
                            parameters:
                            - name: argocd-server
                            - name: insecure-option
                              value: ""
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            env:
                            - name: ARGOCD_USERNAME
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: username
                            - name: ARGOCD_PASSWORD
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: password
                            - name: WITH_PRUNE_OPTION
                              value: --prune
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              set -e
                              export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                              export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                              export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                              argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                              argocd app sync argocd/app-redis ${WITH_PRUNE_OPTION} --timeout 300
                        - name: wait
                          inputs:
                            artifacts:
                            - name: argocd-binary
                              path: /usr/local/bin/argocd
                            parameters:
                            - name: argocd-server
                            - name: insecure-option
                              value: ""
                          container:
                            image: m.daocloud.io/docker.io/library/fedora:39
                            env:
                            - name: ARGOCD_USERNAME
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: username
                            - name: ARGOCD_PASSWORD
                              valueFrom:
                                secretKeyRef:
                                  name: argocd-login-credentials
                                  key: password
                            command:
                            - sh
                            - -c
                            args:
                            - |
                              set -e
                              export ARGOCD_SERVER={{inputs.parameters.argocd-server}}
                              export INSECURE_OPTION={{inputs.parameters.insecure-option}}
                              export ARGOCD_USERNAME=${ARGOCD_USERNAME:-admin}
                              argocd login ${INSECURE_OPTION} --username ${ARGOCD_USERNAME} --password ${ARGOCD_PASSWORD} ${ARGOCD_SERVER}
                              argocd app wait argocd/app-redis

                      6.subimit to argo workflow client

                      Details
                      argo -n business-workflows submit deploy-redis-flow.yaml

                      7.decode password

                      Details
                      kubectl -n application get secret redis-credentials -o jsonpath='{.data.redis-password}' | base64 -d

                      FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Mar 7, 2024

                      Subsections of Git

                      Install Act Runner

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm binary has installed, if not check 🔗link


                      1.get helm repo

                      Details
                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                      helm repo update

                      2.prepare `act-runner-secret`

                      Details
                      kubectl -n application create secret generic act-runner-secret \
                        --from-literal=act-runner-token=4w3Sx0Hwe6VFevl473ZZ4nFVDvFvhKcEUBvpJ09L

                      3.prepare values

                      Details
                      echo "
                      replicas: 1
                      runner:
                        instanceURL: http://192.168.100.125:30300
                        token:
                          fromSecret:
                            name: "act-runner-secret"
                            key: "act-runner-token"" > act-runner-values.yaml

                      4.install chart

                      Details
                      helm upgrade  --create-namespace -n application --install -f ./act-runner-values.yaml act-runner ay-helm-mirror/act-runner

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. ArgoCD has installed, if not check 🔗link


                      3. Helm binary has installed, if not check 🔗link


                      1.prepare `act-runner-secret`

                      Details
                      kubectl -n application create secret generic act-runner-secret \
                        --from-literal=act-runner-token=4w3Sx0Hwe6VFevl473ZZ4nFVDvFvhKcEUBvpJ09L
                      act-runner-token could be get from here

                      token is used for authentication and identification, such as P2U1U0oB4XaRCi8azcngmPCLbRpUGapalhmddh23. Each token can be used to create multiple runners, until it is replaced with a new token using the reset link. You can obtain different levels of ’tokens’ from the following places to create the corresponding level of ‘runners’:

                      Instance level: The admin settings page, like <your_gitea.com>/-/admin/actions/runners.

                      act_runner_token act_runner_token

                      2.prepare act-runner.yaml

                      Storage In
                      kubectl -n argocd apply -f - <<EOF
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: act-runner
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                          chart: act-runner
                          targetRevision: 0.2.2
                          helm:
                            releaseName: act-runner
                            values: |
                              image:
                                name: vegardit/gitea-act-runner
                                tag: "dind-0.2.13"
                                repository: m.daocloud.io/docker.io
                              runner:
                                instanceURL: https://192.168.100.125:30300
                                token:
                                  fromSecret:
                                    name: "act-runner-secret"
                                    key: "act-runner-token"
                                config:
                                  enabled: true
                                  data: |
                                    log:
                                      level: info
                                    runner:
                                      labels:
                                        - ubuntu-latest:docker://m.daocloud.io/docker.gitea.com/runner-images:ubuntu-latest
                                    container:
                                      force_pull: true
                              persistence:
                                enabled: true
                                storageClassName: ""
                                accessModes: ReadWriteOnce
                                size: 10Gi
                              autoscaling:
                                enabled: true
                                minReplicas: 1
                                maxReplicas: 3
                              replicas: 1  
                              securityContext:
                                privileged: true
                                runAsUser: 0
                                runAsGroup: 0
                                fsGroup: 0
                                capabilities:
                                  add: ["NET_ADMIN", "SYS_ADMIN"]
                              podSecurityContext:
                                runAsUser: 0
                                runAsGroup: 0
                                fsGroup: 0
                              resources: 
                                requests:
                                  cpu: 200m
                                  memory: 512Mi
                                limits:
                                  cpu: 1000m
                                  memory: 2048Mi
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: application
                      EOF
                      

                      4.sync by argocd

                      Details
                      argocd app sync argocd/act-runner

                      5.use action

                      Details

                      Even if Actions is enabled for the Gitea instance, repositories still disable Actions by default.

                      To enable it, go to the settings page of your repository like your_gitea.com/<owner>/repo/settings and enable Enable Repository Actions.

                      act_runner_token act_runner_token

                      Preliminary

                      1. Podman has installed, and the `podman` command is available in your PATH.


                      1.prepare data and config dir

                      Details
                      mkdir -p /opt/gitea_act_runner/{data,config} \
                      && chown -R 1000:1000 /opt/gitea_act_runner \
                      && chmod -R 755 /opt/gitea_act_runner

                      2.run container

                      Details
                      podman run -it \
                        --name gitea_act_runner \
                        --rm \
                        --privileged \
                        --network=host \
                        -v /opt/gitea_act_runner/data:/data \
                        -v /opt/gitea_act_runner/config:/config \
                        -v /var/run/podman/podman.sock:/var/run/docker.sock \
                        -e GITEA_INSTANCE_URL="http://10.200.60.64:30300" \
                        -e GITEA_RUNNER_REGISTRATION_TOKEN="5lgsrOzfKz3RiqeMWxxUb9RmUPEWNnZ6hTTZV0DL" \
                        m.daocloud.io/docker.io/gitea/act_runner:latest-dind-rootless
                      Using Mirror

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      Preliminary

                      1. Docker 2. Podman has installed, and the `podman` command is available in your PATH.

                      1.prepare data and config dir

                      Details
                      mkdir -p /opt/gitea_act_runner/{data,config} \
                      && chown -R 1000:1000 /opt/gitea_act_runner \
                      && chmod -R 755 /opt/gitea_act_runner

                      2.run container

                      Details
                      docker run -it \
                        --name gitea_act_runner \
                        --rm \
                        --privileged \
                        --network=host \
                        -v /opt/gitea_act_runner/data:/data \
                        -v /opt/gitea_act_runner/config:/config \
                        -e GITEA_INSTANCE_URL="http://192.168.100.125:30300" \
                        -e GITEA_RUNNER_REGISTRATION_TOKEN="5lgsrOzfKz3RiqeMWxxUb9RmUPEWNnZ6hTTZV0DL" \
                        m.daocloud.io/docker.io/gitea/act_runner:latest-dind
                      Using Mirror

                      you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                      FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Jun 7, 2025

                      Install Gitea

                      Installation

                      Install By

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. Helm binary has installed, if not check 🔗link


                      3. CertManager has installed, if not check 🔗link


                      4. Ingress has installed, if not check 🔗link


                      1.get helm repo

                      Details
                      helm repo add gitea-charts https://dl.gitea.com/charts/
                      helm repo update

                      2.install chart

                      Details
                      helm install gitea gitea-charts/gitea --generate-name
                      Using Mirror
                      helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                        && helm install ay-helm-mirror/gitea --generate-name --version 12.1.3

                      for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                      Preliminary

                      1. Kubernetes has installed, if not check 🔗link


                      2. ArgoCD has installed, if not check 🔗link


                      3. Helm binary has installed, if not check 🔗link


                      4. Ingres has installed on argoCD, if not check 🔗link


                      5. Minio has installed, if not check 🔗link


                      1.prepare `chart-museum-credentials`

                      Storage In
                      kubectl get namespaces application > /dev/null 2>&1 || kubectl create namespace application
                      kubectl -n application create secret generic gitea-admin-credentials \
                          --from-literal=username=gitea_admin \
                          --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
                      
                      kubectl get namespaces application > /dev/null 2>&1 || kubectl create namespace application
                      kubectl -n application create secret generic gitea-admin-credentials \
                          --from-literal=username=gitea_admin \
                          --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
                      

                      2.prepare `gitea.yaml`

                      Storage In
                      apiVersion: argoproj.io/v1alpha1
                      kind: Application
                      metadata:
                        name: gitea
                      spec:
                        syncPolicy:
                          syncOptions:
                          - CreateNamespace=true
                        project: default
                        source:
                          repoURL: https://dl.gitea.com/charts/
                          chart: gitea
                          targetRevision: 10.1.4
                          helm:
                            releaseName: gitea
                            values: |
                              image:
                                registry: m.daocloud.io/docker.io
                              service:
                                http:
                                  type: NodePort
                                  port: 3000
                                  nodePort: 30300
                                ssh:
                                  type: NodePort
                                  port: 22
                                  nodePort: 32022
                              ingress:
                                enabled: true
                                ingressClassName: nginx
                                annotations:
                                  kubernetes.io/ingress.class: nginx
                                  nginx.ingress.kubernetes.io/rewrite-target: /$1
                                  cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                hosts:
                                - host: gitea.ay.dev
                                  paths:
                                  - path: /?(.*)
                                    pathType: ImplementationSpecific
                                tls:
                                - secretName: gitea.ay.dev-tls
                                  hosts:
                                  - gitea.ay.dev
                              persistence:
                                enabled: true
                                size: 8Gi
                                storageClass: ""
                              redis-cluster:
                                enabled: false
                              postgresql-ha:
                                enabled: false
                              postgresql:
                                enabled: true
                                architecture: standalone
                                image:
                                  registry: m.daocloud.io/docker.io
                                primary:
                                  persistence:
                                    enabled: false
                                    storageClass: ""
                                    size: 8Gi
                                readReplicas:
                                  replicaCount: 1
                                  persistence:
                                    enabled: true
                                    storageClass: ""
                                    size: 8Gi
                                backup:
                                  enabled: false
                                volumePermissions:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                metrics:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                              gitea:
                                admin:
                                  existingSecret: gitea-admin-credentials
                                  email: aaron19940628@gmail.com
                                config:
                                  database:
                                    DB_TYPE: postgres
                                  session:
                                    PROVIDER: db
                                  cache:
                                    ADAPTER: memory
                                  queue:
                                    TYPE: level
                                  indexer:
                                    ISSUE_INDEXER_TYPE: bleve
                                    REPO_INDEXER_ENABLED: true
                                  repository:
                                    MAX_CREATION_LIMIT: 10
                                    DISABLED_REPO_UNITS: "repo.wiki,repo.ext_wiki,repo.projects"
                                    DEFAULT_REPO_UNITS: "repo.code,repo.releases,repo.issues,repo.pulls"
                                  server:
                                    PROTOCOL: http
                                    LANDING_PAGE: login
                                    DOMAIN: gitea.ay.dev
                                    ROOT_URL: https://gitea.ay.dev:32443/
                                    SSH_DOMAIN: ssh.gitea.ay.dev
                                    SSH_PORT: 32022
                                    SSH_AUTHORIZED_PRINCIPALS_ALLOW: email
                                  admin:
                                    DISABLE_REGULAR_ORG_CREATION: true
                                  security:
                                    INSTALL_LOCK: true
                                  service:
                                    REGISTER_EMAIL_CONFIRM: true
                                    DISABLE_REGISTRATION: true
                                    ENABLE_NOTIFY_MAIL: false
                                    DEFAULT_ALLOW_CREATE_ORGANIZATION: false
                                    SHOW_MILESTONES_DASHBOARD_PAGE: false
                                  migrations:
                                    ALLOW_LOCALNETWORKS: true
                                  mailer:
                                    ENABLED: false
                                  i18n:
                                    LANGS: "en-US,zh-CN"
                                    NAMES: "English,简体中文"
                                  oauth2:
                                    ENABLE: false
                        destination:
                          server: https://kubernetes.default.svc
                          namespace: application
                      
                      sssss
                      

                      3.apply to k8s

                      Details
                      kubectl -n argocd apply -f gitea.yaml

                      4.sync by argocd

                      Details
                      argocd app sync argocd/gitea

                      5.decode admin password

                      login 🔗https://gitea.ay.dev:32443/

                      , using user gitea_admin and password
                      Details
                      kubectl -n application get secret gitea-admin-credentials -o jsonpath='{.data.password}' | base64 -d

                      FAQ

                      Q1: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Q2: Show me almost endless possibilities

                      You can add standard markdown syntax:

                      • multiple paragraphs
                      • bullet point lists
                      • emphasized, bold and even bold emphasized text
                      • links
                      • etc.
                      ...and even source code

                      the possibilities are endless (almost - including other shortcodes may or may not work)

                      Jun 7, 2025

                      HPC

                        Mar 7, 2024

                        Subsections of MCP Related

                        MCP Inspector

                        🚀Installation

                        Install By

                        1.get helm repo

                        Details
                        helm repo add xxxxx https://xxxx
                        helm repo update

                        2.install chart

                        Details
                        helm install xxxxx/chart-name --generate-name --version a.b.c
                        Using AY Helm Mirror

                        for more information, you can check 🔗https://github.com/AaronYang0628/helm-chart-mirror

                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                          helm repo update
                          helm install ay-helm-mirror/chart-name --generate-name --version a.b.c

                        1.prepare `xxxxx-credientials.yaml`

                        Details

                        2.prepare `deploy-xxxxx.yaml`

                        Details
                        kubectl -n argocd apply -f -<< EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: xxxx
                        spec:
                          project: default
                          source:
                            repoURL: https://xxxxx
                            chart: xxxx
                            targetRevision: a.b.c
                        EOF

                        3.sync by argocd

                        Details
                        argocd app sync argocd/xxxx
                        Using AY Helm Mirror

                        for more information, you can check 🔗https://github.com/AaronYang0628/helm-chart-mirror

                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                          helm repo update
                          helm install ay-helm-mirror/chart-name --generate-name --version a.b.c
                        Using AY ACR Image Mirror
                        Using DaoCloud Mirror

                        1.init server

                        Details
                        Using AY ACR Image Mirror
                        Using DaoCloud Mirror

                        1.prepare `argocd-login-credentials`

                        Details
                        kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                        2.apply rolebinding to k8s

                        4.prepare `deploy-xxxx-flow.yaml`

                        Details

                        5.subimit to argo workflow client

                        Details
                        argo -n business-workflows submit deploy-xxxx-flow.yaml

                        7.decode password

                        Details
                        kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                        1.init server

                        Details
                        Using AY ACR Image Mirror
                        Using DaoCloud Mirror

                        🛎️FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Jun 7, 2024

                        Subsections of Monitor

                        Install Homepage

                        Offical Documentation: https://gethomepage.dev/

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        1.install chart directly

                        Details
                        helm install homepage oci://ghcr.io/m0nsterrr/helm-charts/homepage

                        2.you can modify the values.yaml and re-install

                        Related values files
                        Details
                        helm install homepage oci://ghcr.io/m0nsterrr/helm-charts/homepage -f homepage.values.yaml
                        Using Mirror
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                          && helm install ay-helm-mirror/homepage  --generate-name --version 4.2.0

                        for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. ArgoCD has installed, if not check 🔗link


                        3. Helm binary has installed, if not check 🔗link


                        4. Ingres has installed on argoCD, if not check 🔗link


                        1.prepare `homepage.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                          apiVersion: argoproj.io/v1alpha1
                          kind: Application
                          metadata:
                            name: homepage
                          spec:
                            syncPolicy:
                              syncOptions:
                                - CreateNamespace=true
                                - ServerSideApply=true
                            project: default
                            source:
                              repoURL: oci://ghcr.io/m0nsterrr/helm-charts/homepage
                              chart: homepage
                              targetRevision: 4.2.0
                              helm:
                                releaseName: homepage
                                values: |
                                  image:
                                    registry: m.daocloud.io/ghcr.io
                                    repository: gethomepage/homepage
                                    pullPolicy: IfNotPresent
                                    tag: "v1.5.0"
                                  config:
                                    allowedHosts: 
                                    - "home.72602.online"
                                  ingress:
                                    enabled: true
                                    ingressClassName: "nginx"
                                    annotations:
                                      kubernetes.io/ingress.class: nginx
                                    hosts:
                                      - host: home.72602.online
                                        paths:
                                          - path: /
                                            pathType: ImplementationSpecific
                                  resources:
                                    limits:
                                      cpu: 500m
                                      memory: 512Mi
                                    requests:
                                      cpu: 100m
                                      memory: 128Mi
                            destination:
                              server: https://kubernetes.default.svc
                              namespace: monitor
                        EOF

                        3.sync by argocd

                        Details
                        argocd app sync argocd/homepage

                        5.check the web browser

                        Details
                        K8S_MASTER_IP=$(kubectl get nodes --selector=node-role.kubernetes.io/control-plane -o jsonpath='{$.items[0].status.addresses[?(@.type=="InternalIP")].address}')
                        echo "$K8S_MASTER_IP home.72602.online" >> /etc/hosts

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Docker has installed, if not check 🔗link


                        docker run -d \
                        --name homepage \
                        -e HOMEPAGE_ALLOWED_HOSTS=47.110.67.161:3000 \
                        -e PUID=1000 \
                        -e PGID=1000 \
                        -p 3000:3000 \
                        -v /root/home-site/static/icons:/app/public/icons  \
                        -v /root/home-site/content/Ops/HomePage/config:/app/config \
                        -v /var/run/docker.sock:/var/run/docker.sock:ro \
                        --restart unless-stopped \
                        ghcr.io/gethomepage/homepage:v1.5.0

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Podman has installed, if not check 🔗link


                        podman run -d \
                        --name homepage \
                        -e HOMEPAGE_ALLOWED_HOSTS=127.0.0.1:3000 \
                        -e PUID=1000 \
                        -e PGID=1000 \
                        -p 3000:3000 \
                        -v /root/home-site/static/icons:/app/public/icons \
                        -v /root/home-site/content/Ops/HomePage/config:/app/config \
                        --restart unless-stopped \
                        ghcr.io/gethomepage/homepage:v1.5.0

                        FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Oct 7, 2025

                        Install Permetheus Stack

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        1.get helm repo

                        Details
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update

                        2.install chart

                        Details
                        helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                        Using Mirror
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                          && helm install ay-helm-mirror/kube-prometheus-stack  --generate-name --version 1.17.2

                        for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. ArgoCD has installed, if not check 🔗link


                        3. Helm binary has installed, if not check 🔗link


                        4. Ingres has installed on argoCD, if not check 🔗link


                        1.prepare `chart-museum-credentials`

                        Details
                        kubectl get namespaces monitor > /dev/null 2>&1 || kubectl create namespace monitor
                        kubectl -n monitor create secret generic prometheus-stack-credentials \
                          --from-literal=grafana-username=admin \
                          --from-literal=grafana-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                        2.prepare `prometheus-stack.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                          apiVersion: argoproj.io/v1alpha1
                          kind: Application
                          metadata:
                            name: prometheus-stack
                          spec:
                            syncPolicy:
                              syncOptions:
                                - CreateNamespace=true
                                - ServerSideApply=true
                            project: default
                            source:
                              repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                              chart: kube-prometheus-stack
                              targetRevision: 72.9.1
                              helm:
                                releaseName: prometheus-stack
                                values: |
                                  crds:
                                    enabled: true
                                  global:
                                    rbac:
                                      create: true
                                    imageRegistry: ""
                                    imagePullSecrets: []
                                  alertmanager:
                                    enabled: true
                                    ingress:
                                      enabled: false
                                    serviceMonitor:
                                      selfMonitor: true
                                      interval: ""
                                    alertmanagerSpec:
                                      image:
                                        registry: m.daocloud.io/quay.io
                                        repository: prometheus/alertmanager
                                        tag: v0.28.1
                                      replicas: 1
                                      resources: {}
                                      storage:
                                        volumeClaimTemplate:
                                          spec:
                                            storageClassName: ""
                                            accessModes: ["ReadWriteOnce"]
                                            resources:
                                              requests:
                                                storage: 2Gi
                                  grafana:
                                    enabled: true
                                    ingress:
                                      enabled: true
                                      annotations:
                                        cert-manager.io/clusterissuer: self-signed-issuer
                                        kubernetes.io/ingress.class: nginx
                                      hosts:
                                        - grafana.dev.tech
                                      path: /
                                      pathtype: ImplementationSpecific
                                      tls:
                                      - secretName: grafana.dev.tech-tls
                                        hosts:
                                        - grafana.dev.tech
                                  prometheusOperator:
                                    admissionWebhooks:
                                      patch:
                                        resources: {}
                                        image:
                                          registry: m.daocloud.io/registry.k8s.io
                                          repository: ingress-nginx/kube-webhook-certgen
                                          tag: v1.5.3  
                                    image:
                                      registry: m.daocloud.io/quay.io
                                      repository: prometheus-operator/prometheus-operator
                                    prometheusConfigReloader:
                                      image:
                                        registry: m.daocloud.io/quay.io
                                        repository: prometheus-operator/prometheus-config-reloader
                                      resources: {}
                                    thanosImage:
                                      registry: m.daocloud.io/quay.io
                                      repository: thanos/thanos
                                      tag: v0.38.0
                                  prometheus:
                                    enabled: true
                                    ingress:
                                      enabled: true
                                      annotations:
                                        cert-manager.io/clusterissuer: self-signed-issuer
                                        kubernetes.io/ingress.class: nginx
                                      hosts:
                                        - prometheus.dev.tech
                                      path: /
                                      pathtype: ImplementationSpecific
                                      tls:
                                      - secretName: prometheus.dev.tech-tls
                                        hosts:
                                        - prometheus.dev.tech
                                    prometheusSpec:
                                      image:
                                        registry: m.daocloud.io/quay.io
                                        repository: prometheus/prometheus
                                        tag: v3.4.0
                                      replicas: 1
                                      shards: 1
                                      resources: {}
                                      storageSpec: 
                                        volumeClaimTemplate:
                                          spec:
                                            storageClassName: ""
                                            accessModes: ["ReadWriteOnce"]
                                            resources:
                                              requests:
                                                storage: 2Gi
                                  thanosRuler:
                                    enabled: false
                                    ingress:
                                      enabled: false
                                    thanosRulerSpec:
                                      replicas: 1
                                      storage: {}
                                      resources: {}
                                      image:
                                        registry: m.daocloud.io/quay.io
                                        repository: thanos/thanos
                                        tag: v0.38.0
                            destination:
                              server: https://kubernetes.default.svc
                              namespace: monitor
                        EOF

                        3.sync by argocd

                        Details
                        argocd app sync argocd/prometheus-stack

                        4.extract clickhouse admin credentials

                        Details
                          kubectl -n monitor get secret prometheus-stack-credentials -o jsonpath='{.data.grafana-password}' | base64 -d

                        5.check the web browser

                        Details
                          > add `$K8S_MASTER_IP grafana.dev.tech` to **/etc/hosts**
                        
                          > add `$K8S_MASTER_IP prometheus.dev.tech` to **/etc/hosts**
                        prometheus-srver: https://prometheus.dev.tech:32443/


                        grafana-console: https://grafana.dev.tech:32443/


                        install based on docker

                        echo  "start from head is important"

                        FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Jun 7, 2024

                        Subsections of Networking

                        Install Cert Manager

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm binary has installed, if not check 🔗link


                        1.get helm repo

                        Details
                        helm repo add cert-manager-repo https://charts.jetstack.io
                        helm repo update

                        2.install chart

                        Details
                        helm install cert-manager-repo/cert-manager --generate-name --version 1.17.2
                        Using Mirror
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                          && helm install ay-helm-mirror/cert-manager --generate-name --version 1.17.2

                        for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. ArgoCD has installed, if not check 🔗link


                        3. Helm binary has installed, if not check 🔗link


                        1.prepare `cert-manager.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: cert-manager
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                            chart: cert-manager
                            targetRevision: 1.17.2
                            helm:
                              releaseName: cert-manager
                              values: |
                                installCRDs: true
                                image:
                                  repository: m.daocloud.io/quay.io/jetstack/cert-manager-controller
                                  tag: v1.17.2
                                webhook:
                                  image:
                                    repository: m.daocloud.io/quay.io/jetstack/cert-manager-webhook
                                    tag: v1.17.2
                                cainjector:
                                  image:
                                    repository: m.daocloud.io/quay.io/jetstack/cert-manager-cainjector
                                    tag: v1.17.2
                                acmesolver:
                                  image:
                                    repository: m.daocloud.io/quay.io/jetstack/cert-manager-acmesolver
                                    tag: v1.17.2
                                startupapicheck:
                                  image:
                                    repository: m.daocloud.io/quay.io/jetstack/cert-manager-startupapicheck
                                    tag: v1.17.2
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: basic-components
                        EOF

                        3.sync by argocd

                        Details
                        argocd app sync argocd/cert-manager

                        Preliminary

                        1. Docker|Podman|Buildah has installed, if not check 🔗link


                        1.just run

                        Details
                        docker run --name cert-manager -e ALLOW_EMPTY_PASSWORD=yes bitnami/cert-manager:latest
                        Using Proxy

                        you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                        docker run --name cert-manager \
                          -e ALLOW_EMPTY_PASSWORD=yes 
                          m.daocloud.io/docker.io/bitnami/cert-manager:latest

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        1.just run

                        Details
                        kubectl create -f https://github.com/jetstack/cert-manager/releases/download/v1.17.2/cert-manager.yaml

                        Prepare Certificate Issuer

                        kubectl apply  -f - <<EOF
                        ---
                        apiVersion: cert-manager.io/v1
                        kind: Issuer
                        metadata:
                          namespace: basic-components
                          name: self-signed-issuer
                        spec:
                          selfSigned: {}
                        
                        ---
                        apiVersion: cert-manager.io/v1
                        kind: Certificate
                        metadata:
                          namespace: basic-components
                          name: my-self-signed-ca
                        spec:
                          isCA: true
                          commonName: my-self-signed-ca
                          secretName: root-secret
                          privateKey:
                            algorithm: ECDSA
                            size: 256
                          issuerRef:
                            name: self-signed-issuer
                            kind: Issuer
                            group: cert-manager.io
                        
                        ---
                        apiVersion: cert-manager.io/v1
                        kind: ClusterIssuer
                        metadata:
                          name: self-signed-ca-issuer
                        spec:
                          ca:
                            secretName: root-secret
                        EOF
                        kubectl -n kube-system apply -f - << EOF
                        apiVersion: cert-manager.io/v1
                        kind: ClusterIssuer
                        metadata:
                          name: letsencrypt
                        spec:
                          acme:
                            email: aaron19940628@gmail.com
                            server: https://acme-v02.api.letsencrypt.org/directory
                            privateKeySecretRef:
                              name: letsencrypt-account-key
                            solvers:
                            - http01:
                                ingress:
                                  class: nginx
                        EOF

                        FAQ

                        Q1: The browser doesn’t trust this self-signed certificate

                        Basically, you need to import the certificate into your browser.

                        kubectl -n basic-components get secret root-secret -o jsonpath='{.data.tls\.crt}' | base64 -d > cert-manager-self-signed-ca-secret.crt

                        And then import it into your browser.

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Jun 7, 2024

                        Install HAProxy

                        Mar 7, 2024

                        Install Ingress

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        1.get helm repo

                        Details
                        helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
                        helm repo update

                        2.install chart

                        Details
                        helm install ingress-nginx/ingress-nginx --generate-name
                        Using Mirror
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts &&
                          helm install ay-helm-mirror/ingress-nginx --generate-name --version 4.11.3

                        for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. argoCD has installed, if not check 🔗link


                        1.prepare `ingress-nginx.yaml`

                        Details
                        kubectl -n argocd apply -f - <<EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: ingress-nginx
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://kubernetes.github.io/ingress-nginx
                            chart: ingress-nginx
                            targetRevision: 4.12.3
                            helm:
                              releaseName: ingress-nginx
                              values: |
                                controller:
                                  image:
                                    registry: m.daocloud.io/registry.k8s.io
                                  service:
                                    enabled: true
                                    type: NodePort
                                    nodePorts:
                                      http: 32080
                                      https: 32443
                                      tcp:
                                        8080: 32808
                                        5324: 33224 #pg
                                  resources:
                                    requests:
                                      cpu: 100m
                                      memory: 128Mi
                                  admissionWebhooks:
                                    enabled: true
                                    patch:
                                      enabled: true
                                      image:
                                        registry: m.daocloud.io/registry.k8s.io
                                metrics:
                                  enabled: false
                                defaultBackend:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/registry.k8s.io
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: basic-components
                        EOF

                        [Optional] 2.apply to k8s

                        Details
                        kubectl -n argocd apply -f ingress-nginx.yaml

                        3.sync by argocd

                        Details
                        argocd app sync argocd/ingress-nginx

                        FAQ

                        Q1: Using minikube, cannot access to the website
                        ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:30443:0.0.0.0:30443' -N -f
                        ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:32443:0.0.0.0:32443' -N -f
                        ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) -L '*:32080:0.0.0.0:32080' -N -f

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Jun 7, 2024

                        Install Istio

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        1.get helm repo

                        Details
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update

                        2.install chart

                        Details
                        helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                        Using Proxy

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        3. ArgoCD has installed, if not check 🔗link


                        1.prepare `deploy-istio-base.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: istio-base
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://istio-release.storage.googleapis.com/charts
                            chart: base
                            targetRevision: 1.23.2
                            helm:
                              releaseName: istio-base
                              values: |
                                defaults:
                                  global:
                                    istioNamespace: istio-system
                                  base:
                                    enableCRDTemplates: false
                                    enableIstioConfigCRDs: true
                                  defaultRevision: "default"
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: istio-system
                        EOF

                        2.sync by argocd

                        Details
                        argocd app sync argocd/istio-base

                        3.prepare `deploy-istiod.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: istiod
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://istio-release.storage.googleapis.com/charts
                            chart: istiod
                            targetRevision: 1.23.2
                            helm:
                              releaseName: istiod
                              values: |
                                defaults:
                                  global:
                                    istioNamespace: istio-system
                                    defaultResources:
                                      requests:
                                        cpu: 10m
                                        memory: 128Mi
                                      limits:
                                        cpu: 100m
                                        memory: 128Mi
                                    hub: m.daocloud.io/docker.io/istio
                                    proxy:
                                      autoInject: disabled
                                      resources:
                                        requests:
                                          cpu: 100m
                                          memory: 128Mi
                                        limits:
                                          cpu: 2000m
                                          memory: 1024Mi
                                  pilot:
                                    autoscaleEnabled: true
                                    resources:
                                      requests:
                                        cpu: 500m
                                        memory: 2048Mi
                                    cpu:
                                      targetAverageUtilization: 80
                                    podAnnotations:
                                      cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: istio-system
                        EOF

                        4.sync by argocd

                        Details
                        argocd app sync argocd/istiod

                        5.prepare `deploy-istio-ingressgateway.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: istio-ingressgateway
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://istio-release.storage.googleapis.com/charts
                            chart: gateway
                            targetRevision: 1.23.2
                            helm:
                              releaseName: istio-ingressgateway
                              values: |
                                defaults:
                                  replicaCount: 1
                                  podAnnotations:
                                    inject.istio.io/templates: "gateway"
                                    sidecar.istio.io/inject: "true"
                                    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
                                  resources:
                                    requests:
                                      cpu: 100m
                                      memory: 128Mi
                                    limits:
                                      cpu: 2000m
                                      memory: 1024Mi
                                  service:
                                    type: LoadBalancer
                                    ports:
                                    - name: status-port
                                      port: 15021
                                      protocol: TCP
                                      targetPort: 15021
                                    - name: http2
                                      port: 80
                                      protocol: TCP
                                      targetPort: 80
                                    - name: https
                                      port: 443
                                      protocol: TCP
                                      targetPort: 443
                                  autoscaling:
                                    enabled: true
                                    minReplicas: 1
                                    maxReplicas: 5
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: istio-system
                        EOF

                        6.sync by argocd

                        Details
                        argocd app sync argocd/istio-ingressgateway

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        3. ArgoCD has installed, if not check 🔗link


                        4. Argo Workflow has installed, if not check 🔗link


                        1.prepare `argocd-login-credentials`

                        Details
                        kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                        2.apply rolebinding to k8s

                        Details
                        kubectl apply -f - <<EOF
                        ---
                        apiVersion: rbac.authorization.k8s.io/v1
                        kind: ClusterRole
                        metadata:
                          name: application-administrator
                        rules:
                          - apiGroups:
                              - argoproj.io
                            resources:
                              - applications
                            verbs:
                              - '*'
                          - apiGroups:
                              - apps
                            resources:
                              - deployments
                            verbs:
                              - '*'
                        
                        ---
                        apiVersion: rbac.authorization.k8s.io/v1
                        kind: RoleBinding
                        metadata:
                          name: application-administration
                          namespace: argocd
                        roleRef:
                          apiGroup: rbac.authorization.k8s.io
                          kind: ClusterRole
                          name: application-administrator
                        subjects:
                          - kind: ServiceAccount
                            name: argo-workflow
                            namespace: business-workflows
                        
                        ---
                        apiVersion: rbac.authorization.k8s.io/v1
                        kind: RoleBinding
                        metadata:
                          name: application-administration
                          namespace: application
                        roleRef:
                          apiGroup: rbac.authorization.k8s.io
                          kind: ClusterRole
                          name: application-administrator
                        subjects:
                          - kind: ServiceAccount
                            name: argo-workflow
                            namespace: business-workflows
                        EOF

                        4.prepare `deploy-xxxx-flow.yaml`

                        Details

                        6.subimit to argo workflow client

                        Details
                        argo -n business-workflows submit deploy-xxxx-flow.yaml

                        7.decode password

                        Details
                        kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                        FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Jun 7, 2024

                        Install Nginx

                        1. prepare server.conf

                        cat << EOF > default.conf
                        server {
                          listen 80;
                          location / {
                              root   /usr/share/nginx/html;
                              autoindex on;
                          }
                        }
                        EOF

                        2. install

                        mkdir $(pwd)/data
                        podman run --rm -p 8080:80 \
                            -v $(pwd)/data:/usr/share/nginx/html:ro \
                            -v $(pwd)/default.conf:/etc/nginx/conf.d/default.conf:ro \
                            -d docker.io/library/nginx:1.19.9-alpine
                        echo 'this is a test' > $(pwd)/data/some-data.txt
                        Tip

                        you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                        visit http://localhost:8080

                        Mar 7, 2024

                        Install Traefik

                        Mar 7, 2024

                        Subsections of RPC

                        gRpc

                        This guide gets you started with gRPC in C++ with a simple working example.

                        In the C++ world, there’s no universally accepted standard for managing project dependencies. You need to build and install gRPC before building and running this quick start’s Hello World example.

                        Build and locally install gRPC and Protocol Buffers. The steps in the section explain how to build and locally install gRPC and Protocol Buffers using cmake. If you’d rather use bazel, see Building from source.

                        1. Setup

                        Choose a directory to hold locally installed packages. This page assumes that the environment variable MY_INSTALL_DIR holds this directory path. For example:

                        export MY_INSTALL_DIR=$HOME/.local

                        Ensure that the directory exists:

                        mkdir -p $MY_INSTALL_DIR

                        Add the local bin folder to your path variable, for example:

                        export PATH="$MY_INSTALL_DIR/bin:$PATH"
                        Important

                        We strongly encourage you to install gRPC locally — using an appropriately set CMAKE_INSTALL_PREFIX — because there is no easy way to uninstall gRPC after you’ve installed it globally.

                        2. Install Essentials

                        2.1 Install Cmake

                        You need version 3.13 or later of cmake. Install it by following these instructions:

                        Install on
                        sudo apt install -y cmake
                        brew install cmake
                        Check the version of cmake
                        cmake --version
                        2.2 Install basic tools required to build gRPC
                        Install on
                        sudo apt install -y build-essential autoconf libtool pkg-config
                        brew install autoconf automake libtool pkg-config
                        2.3 Clone the grpc repo

                        Clone the grpc repo and its submodules:

                        git clone --recurse-submodules -b v1.62.0 --depth 1 --shallow-submodules https://github.com/grpc/grpc
                        2.4 Build and install gRPC and Protocol Buffers

                        While not mandatory, gRPC applications usually leverage Protocol Buffers for service definitions and data serialization, and the example code uses proto3.

                        The following commands build and locally install gRPC and Protocol Buffers:

                        cd grpc
                        mkdir -p cmake/build
                        pushd cmake/build
                        cmake -DgRPC_INSTALL=ON \
                              -DgRPC_BUILD_TESTS=OFF \
                              -DCMAKE_INSTALL_PREFIX=$MY_INSTALL_DIR \
                              ../..
                        make -j 4
                        make install
                        popd

                        3. Run the example

                        The example code is part of the grpc repo source, which you cloned as part of the steps of the previous section.

                        3.1 change the example’s directory:
                        cd examples/cpp/helloworld
                        3.2 build the example project by using cmake

                        make sure you still can echo $MY_INSTALL_DIR, and return a valid result

                        mkdir -p cmake/build
                        pushd cmake/build
                        cmake -DCMAKE_PREFIX_PATH=$MY_INSTALL_DIR ../..
                        make -j 4

                        3.3 run the server

                        ./greeter_server

                        3.4 from a different terminal, run the client and see the client output:

                        ./greeter_client

                        and the result should be like this:

                        Greeter received: Hello world
                        Apr 7, 2024

                        Subsections of Storage

                        Deploy Artifict Repository

                        Preliminary

                        • Kubernetes has installed, if not check link
                        • minio is ready for artifact repository

                          endpoint: minio.storage:9000

                        Steps

                        1. prepare bucket for s3 artifact repository

                        # K8S_MASTER_IP could be you master ip or loadbalancer external ip
                        K8S_MASTER_IP=172.27.253.27
                        MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.rootPassword}' | base64 -d)
                        podman run --rm \
                        --entrypoint bash \
                        --add-host=minio-api.dev.geekcity.tech:${K8S_MASTER_IP} \
                        -it docker.io/minio/mc:latest \
                        -c "mc alias set minio http://minio-api.dev.geekcity.tech admin ${MINIO_ACCESS_SECRET} \
                            && mc ls minio \
                            && mc mb --ignore-existing minio/argo-workflows-artifacts"

                        2. prepare secret s3-artifact-repository-credentials

                        will create business-workflows namespace

                        MINIO_ACCESS_KEY=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.rootUser}' | base64 -d)
                        kubectl -n business-workflows create secret generic s3-artifact-repository-credentials \
                            --from-literal=accessKey=${MINIO_ACCESS_KEY} \
                            --from-literal=secretKey=${MINIO_ACCESS_SECRET}

                        3. prepare configMap artifact-repositories.yaml

                        apiVersion: v1
                        kind: ConfigMap
                        metadata:
                          name: artifact-repositories
                          annotations:
                            workflows.argoproj.io/default-artifact-repository: default-artifact-repository
                        data:
                          default-artifact-repository: |
                            s3:
                              endpoint: minio.storage:9000
                              insecure: true
                              accessKeySecret:
                                name: s3-artifact-repository-credentials
                                key: accessKey
                              secretKeySecret:
                                name: s3-artifact-repository-credentials
                                key: secretKey
                              bucket: argo-workflows-artifacts

                        4. apply artifact-repositories.yaml to k8s

                        kubectl -n business-workflows apply -f artifact-repositories.yaml
                        Mar 7, 2024

                        Install Chart Museum

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm binary has installed, if not check 🔗link


                        1.get helm repo

                        Details
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update

                        2.install chart

                        Details
                        helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                        Using Mirror
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts \
                          && helm install ay-helm-mirror/cert-manager --generate-name --version 1.17.2

                        for more information, you can check 🔗https://aaronyang0628.github.io/helm-chart-mirror/

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. ArgoCD has installed, if not check 🔗link


                        3. Helm binary has installed, if not check 🔗link


                        4. Ingres has installed on argoCD, if not check 🔗link


                        5. Minio has installed, if not check 🔗link


                        1.prepare `chart-museum-credentials`

                        Storage In
                        kubectl get namespaces basic-components > /dev/null 2>&1 || kubectl create namespace basic-components
                        kubectl -n basic-components create secret generic chart-museum-credentials \
                            --from-literal=username=admin \
                            --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
                        
                        kubectl get namespaces basic-components > /dev/null 2>&1 || kubectl create namespace basic-components
                        kubectl -n basic-components create secret generic chart-museum-credentials \
                            --from-literal=username=admin \
                            --from-literal=password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) \
                            --from-literal=aws_access_key_id=$(kubectl -n storage get secret minio-credentials -o jsonpath='{.data.rootUser}' | base64 -d) \
                            --from-literal=aws_secret_access_key=$(kubectl -n storage get secret minio-credentials -o jsonpath='{.data.rootPassword}' | base64 -d)
                        

                        2.prepare `chart-museum.yaml`

                        Storage In
                        kubectl apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: chart-museum
                        spec:
                          syncPolicy:
                            syncOptions:
                              - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://chartmuseum.github.io/charts
                            chart: chartmuseum
                            targetRevision: 3.10.3
                            helm:
                              releaseName: chart-museum
                              values: |
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                                env:
                                  open:
                                    DISABLE_API: false
                                    STORAGE: local
                                    AUTH_ANONYMOUS_GET: true
                                  existingSecret: "chart-museum-credentials"
                                  existingSecretMappings:
                                    BASIC_AUTH_USER: "username"
                                    BASIC_AUTH_PASS: "password"
                                persistence:
                                  enabled: false
                                  storageClass: ""
                                volumePermissions:
                                  image:
                                    registry: m.daocloud.io/docker.io
                                ingress:
                                  enabled: true
                                  ingressClassName: nginx
                                  annotations:
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                    nginx.ingress.kubernetes.io/rewrite-target: /$1
                                  hosts:
                                    - name: chartmuseum.ay.dev
                                      path: /?(.*)
                                      tls: true
                                      tlsSecret: chartmuseum.ay.dev-tls
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: basic-components
                        EOF
                        
                        kubectl apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: chart-museum
                        spec:
                          syncPolicy:
                            syncOptions:
                              - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://chartmuseum.github.io/charts
                            chart: chartmuseum
                            targetRevision: 3.10.3
                            helm:
                              releaseName: chart-museum
                              values: |
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                                env:
                                  open:
                                    DISABLE_API: false
                                    STORAGE: amazon
                                    STORAGE_AMAZON_ENDPOINT: http://minio-api.ay.dev:32080
                                    STORAGE_AMAZON_BUCKET: chart-museum
                                    STORAGE_AMAZON_PREFIX: charts
                                    STORAGE_AMAZON_REGION: us-east-1
                                    AUTH_ANONYMOUS_GET: true
                                  existingSecret: "chart-museum-credentials"
                                  existingSecretMappings:
                                    BASIC_AUTH_USER: "username"
                                    BASIC_AUTH_PASS: "password"
                                    AWS_ACCESS_KEY_ID: "aws_access_key_id"
                                    AWS_SECRET_ACCESS_KEY: "aws_secret_access_key"
                                persistence:
                                  enabled: false
                                  storageClass: ""
                                volumePermissions:
                                  image:
                                    registry: m.daocloud.io/docker.io
                                ingress:
                                  enabled: true
                                  ingressClassName: nginx
                                  annotations:
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                    nginx.ingress.kubernetes.io/rewrite-target: /$1
                                  hosts:
                                    - name: chartmuseum.ay.dev
                                      path: /?(.*)
                                      tls: true
                                      tlsSecret: chartmuseum.ay.dev-tls
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: basic-components
                        EOF
                        
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: chart-museum
                        spec:
                          syncPolicy:
                            syncOptions:
                              - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://chartmuseum.github.io/charts
                            chart: chartmuseum
                            targetRevision: 3.10.3
                            helm:
                              releaseName: chart-museum
                              values: |
                                replicaCount: 1
                                image:
                                  repository: m.daocloud.io/ghcr.io/helm/chartmuseum
                                env:
                                  open:
                                    DISABLE_API: false
                                    STORAGE: local
                                    AUTH_ANONYMOUS_GET: true
                                  existingSecret: "chart-museum-credentials"
                                  existingSecretMappings:
                                    BASIC_AUTH_USER: "username"
                                    BASIC_AUTH_PASS: "password"
                                persistence:
                                  enabled: false
                                  storageClass: ""
                                volumePermissions:
                                  image:
                                    registry: m.daocloud.io/docker.io
                                ingress:
                                  enabled: true
                                  ingressClassName: nginx
                                  annotations:
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                    nginx.ingress.kubernetes.io/rewrite-target: /$1
                                  hosts:
                                    - name: chartmuseum.ay.dev
                                      path: /?(.*)
                                      tls: true
                                      tlsSecret: chartmuseum.ay.dev-tls
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: basic-components
                        

                        3.sync by argocd

                        Details
                        argocd app sync argocd/chart-museum

                        Uploading a Chart Package

                        Follow “How to Run” section below to get ChartMuseum up and running at http://localhost:8080

                        First create mychart-0.1.0.tgz using the Helm CLI:

                        cd mychart/
                        helm package .

                        Upload mychart-0.1.0.tgz:

                        curl --data-binary "@mychart-0.1.0.tgz" http://localhost:8080/api/charts

                        If you’ve signed your package and generated a provenance file, upload it with:

                        curl --data-binary "@mychart-0.1.0.tgz.prov" http://localhost:8080/api/prov

                        Both files can also be uploaded at once (or one at a time) on the /api/charts route using the multipart/form-data format:

                        curl -F "chart=@mychart-0.1.0.tgz" -F "prov=@mychart-0.1.0.tgz.prov" http://localhost:8080/api/charts

                        You can also use the helm-push plugin:

                        helm cm-push mychart/ chartmuseum

                        Installing Charts into Kubernetes

                        Add the URL to your ChartMuseum installation to the local repository list:

                        helm repo add chartmuseum http://localhost:8080

                        Search for charts:

                        helm search repo chartmuseum/

                        Install chart:

                        helm install chartmuseum/mychart --generate-name

                        FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Jun 7, 2024

                        Install Harbor

                        Mar 7, 2025

                        Install Minio

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm binary has installed, if not check 🔗link


                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. ArgoCD has installed, if not check 🔗link


                        3. Ingres has installed on argoCD, if not check 🔗link


                        4. Cert-manager has installed on argocd and the clusterissuer has a named `self-signed-ca-issuer`service, , if not check 🔗link


                        1.prepare minio credentials secret

                        Details
                        kubectl get namespaces storage > /dev/null 2>&1 || kubectl create namespace storage
                        kubectl -n storage create secret generic minio-secret \
                            --from-literal=root-user=admin \
                            --from-literal=root-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                        2.prepare `deploy-minio.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: minio
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
                            chart: minio
                            targetRevision: 16.0.10
                            helm:
                              releaseName: minio
                              values: |
                                global:
                                  imageRegistry: "m.daocloud.io/docker.io"
                                  imagePullSecrets: []
                                  storageClass: ""
                                  security:
                                    allowInsecureImages: true
                                  compatibility:
                                    openshift:
                                      adaptSecurityContext: auto
                                image:
                                  registry: m.daocloud.io/docker.io
                                  repository: bitnami/minio
                                clientImage:
                                  registry: m.daocloud.io/docker.io
                                  repository: bitnami/minio-client
                                mode: standalone
                                defaultBuckets: ""
                                auth:
                                  # rootUser: admin
                                  # rootPassword: ""
                                  existingSecret: "minio-secret"
                                statefulset:
                                  updateStrategy:
                                    type: RollingUpdate
                                  podManagementPolicy: Parallel
                                  replicaCount: 1
                                  zones: 1
                                  drivesPerNode: 1
                                resourcesPreset: "micro"
                                resources: 
                                  requests:
                                    memory: 512Mi
                                    cpu: 250m
                                  limits:
                                    memory: 512Mi
                                    cpu: 250m
                                ingress:
                                  enabled: true
                                  ingressClassName: "nginx"
                                  hostname: minio-console.ay.online
                                  path: /?(.*)
                                  pathType: ImplementationSpecific
                                  annotations:
                                    kubernetes.io/ingress.class: nginx
                                    nginx.ingress.kubernetes.io/rewrite-target: /$1
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                  tls: true
                                  selfSigned: true
                                  extraHosts: []
                                apiIngress:
                                  enabled: true
                                  ingressClassName: "nginx"
                                  hostname: minio-api.ay.online
                                  path: /?(.*)
                                  pathType: ImplementationSpecific
                                  annotations: 
                                    kubernetes.io/ingress.class: nginx
                                    nginx.ingress.kubernetes.io/rewrite-target: /$1
                                    cert-manager.io/cluster-issuer: self-signed-ca-issuer
                                  tls: true
                                  selfSigned: true
                                  extraHosts: []
                                persistence:
                                  enabled: false
                                  storageClass: ""
                                  mountPath: /bitnami/minio/data
                                  accessModes:
                                    - ReadWriteOnce
                                  size: 8Gi
                                  annotations: {}
                                  existingClaim: ""
                                metrics:
                                  prometheusAuthType: public
                                  enabled: false
                                  serviceMonitor:
                                    enabled: false
                                    namespace: ""
                                    labels: {}
                                    jobLabel: ""
                                    paths:
                                      - /minio/v2/metrics/cluster
                                      - /minio/v2/metrics/node
                                    interval: 30s
                                    scrapeTimeout: ""
                                    honorLabels: false
                                  prometheusRule:
                                    enabled: false
                                    namespace: ""
                                    additionalLabels: {}
                                    rules: []
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: storage
                        EOF

                        3.sync by argocd

                        Details
                        argocd app sync argocd/minio

                        4.decode minio secret

                        Details
                        kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d

                        5.visit web console

                        Login Credentials

                        add $K8S_MASTER_IP minio-console.ay.online to /etc/hosts

                        address: 🔗http://minio-console.ay.online:32080/login

                        access key: admin

                        secret key: ``

                        6.using mc

                        Details
                        K8S_MASTER_IP=$(kubectl get node -l node-role.kubernetes.io/control-plane -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
                        MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d)
                        podman run --rm \
                            --entrypoint bash \
                            --add-host=minio-api.dev.tech:${K8S_MASTER_IP} \
                            -it m.daocloud.io/docker.io/minio/mc:latest \
                            -c "mc alias set minio http://minio-api.dev.tech:32080 admin ${MINIO_ACCESS_SECRET} \
                                && mc ls minio \
                                && mc mb --ignore-existing minio/test \
                                && mc cp /etc/hosts minio/test/etc/hosts \
                                && mc ls --recursive minio"
                        Details
                        K8S_MASTER_IP=$(kubectl get node -l node-role.kubernetes.io/control-plane -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
                        MINIO_ACCESS_SECRET=$(kubectl -n storage get secret minio-secret -o jsonpath='{.data.root-password}' | base64 -d)
                        podman run --rm \
                            --entrypoint bash \
                            --add-host=minio-api.dev.tech:${K8S_MASTER_IP} \
                            -it m.daocloud.io/docker.io/minio/mc:latest

                        Preliminary

                        1. Docker has installed, if not check 🔗link


                        Using Proxy

                        you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                        1.init server

                        Details
                        mkdir -p $(pwd)/minio/data
                        podman run --rm \
                            --name minio-server \
                            -p 9000:9000 \
                            -p 9001:9001 \
                            -v $(pwd)/minio/data:/data \
                            -d docker.io/minio/minio:latest server /data --console-address :9001

                        2.use web console

                        And then you can visit 🔗http://localhost:9001

                        username: `minioadmin`

                        password: `minioadmin`

                        3.use internal client

                        Details
                        podman run --rm \
                            --entrypoint bash \
                            -it docker.io/minio/mc:latest \
                            -c "mc alias set minio http://host.docker.internal:9000 minioadmin minioadmin \
                                && mc ls minio \
                                && mc mb --ignore-existing minio/test \
                                && mc cp /etc/hosts minio/test/etc/hosts \
                                && mc ls --recursive minio"

                        FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Mar 7, 2024

                        Install NFS

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. argoCD has installed, if not check 🔗link


                        3. ingres has installed on argoCD, if not check 🔗link


                        1.prepare `nfs-provisioner.yaml`

                        Details
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: nfs-provisioner
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
                            chart: nfs-subdir-external-provisioner
                            targetRevision: 4.0.18
                            helm:
                              releaseName: nfs-provisioner
                              values: |
                                image:
                                  repository: m.daocloud.io/registry.k8s.io/sig-storage/nfs-subdir-external-provisioner
                                  pullPolicy: IfNotPresent
                                nfs:
                                  server: nfs.services.test
                                  path: /
                                  mountOptions:
                                    - vers=4
                                    - minorversion=0
                                    - rsize=1048576
                                    - wsize=1048576
                                    - hard
                                    - timeo=600
                                    - retrans=2
                                    - noresvport
                                  volumeName: nfs-subdir-external-provisioner-nas
                                  reclaimPolicy: Retain
                                storageClass:
                                  create: true
                                  defaultClass: true
                                  name: nfs-external-nas
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: storage

                        3.deploy mariadb

                        Details
                        kubectl -n argocd apply -f nfs-provisioner.yaml

                        4.sync by argocd

                        Details
                        argocd app sync argocd/nfs-provisioner

                        Preliminary

                        1. Docker has installed, if not check 🔗link


                        Using Proxy

                        you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                        1.init server

                        Details
                        echo -e "nfs\nnfsd" > /etc/modules-load.d/nfs4.conf
                        modprobe nfs && modprobe nfsd
                        mkdir -p $(pwd)/data/nfs/data
                        echo '/data *(rw,fsid=0,no_subtree_check,insecure,no_root_squash)' > $(pwd)/data/nfs/exports
                        podman run \
                            --name nfs4 \
                            --rm \
                            --privileged \
                            -p 2049:2049 \
                            -v $(pwd)/data/nfs/data:/data \
                            -v $(pwd)/data/nfs/exports:/etc/exports:ro \
                            -d docker.io/erichough/nfs-server:2.2.1

                        Preliminary

                        1. centos yum repo source has updated, if not check 🔗link


                        2.

                        1.install nfs util

                        sudo apt update -y
                        sudo apt-get install nfs-common
                        dnf update -y
                        dnf install -y nfs-utils rpcbindn
                        sudo apt update -y
                        sudo apt-get install nfs-common

                        2. create share folder

                        Details
                        mkdir /data && chmod 755 /data

                        3.edit `/etc/exports`

                        Details
                        /data *(rw,sync,insecure,no_root_squash,no_subtree_check)

                        4.start nfs server

                        Details
                        systemctl enable rpcbind
                        systemctl enable nfs-server
                        systemctl start rpcbind
                        systemctl start nfs-server

                        5.test load on localhost

                        Details
                        showmount -e localhost
                        Expectd Output
                        Export list for localhost:
                        /data *

                        6.test load on other ip

                        Details
                        showmount -e 192.168.aa.bb
                        Expectd Output
                        Export list for localhost:
                        /data *

                        7.mount nfs disk

                        Details
                        mkdir -p $(pwd)/mnt/nfs
                        sudo mount -v 192.168.aa.bb:/data $(pwd)/mnt/nfs  -o proto=tcp -o nolock

                        8.set nfs auto mount

                        Details
                        echo "192.168.aa.bb:/data /data nfs rw,auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0" >> /etc/fstab
                        df -h

                        Notes

                        [Optional] create new partition
                        disk size:
                        fdisk /dev/vdb
                        
                        # n
                        # p
                        # w
                        parted
                        
                        #select /dev/vdb 
                        #mklabel gpt 
                        #mkpart primary 0 -1
                        #Cancel
                        #mkpart primary 0% 100%
                        #print
                        [Optional]Format disk
                        mkfs.xfs /dev/vdb1 -f
                        [Optional] mount disk to folder
                        mount /dev/vdb1 /data
                        [Optional] mount when restart
                        #vim `/etc/fstab` 
                        /dev/vdb1     /data  xfs   defaults   0 0

                        fstab fstab

                        FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Mar 7, 2025

                        Install Rook Ceph

                        Mar 7, 2025

                        Install Reids

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        1.get helm repo

                        Details
                        helm repo add ay-helm-mirror https://aaronyang0628.github.io/helm-chart-mirror/charts
                        helm repo update

                        2.install chart

                        Details
                        helm install ay-helm-mirror/kube-prometheus-stack --generate-name
                        Using Proxy

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        3. ArgoCD has installed, if not check 🔗link


                        1.prepare redis secret

                        Details
                        kubectl get namespaces storage > /dev/null 2>&1 || kubectl create namespace storage
                        kubectl -n storage create secret generic redis-credentials \
                          --from-literal=redis-password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)

                        2.prepare `deploy-redis.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: redis
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://charts.bitnami.com/bitnami
                            chart: redis
                            targetRevision: 18.16.0
                            helm:
                              releaseName: redis
                              values: |
                                architecture: replication
                                auth:
                                  enabled: true
                                  sentinel: true
                                  existingSecret: redis-credentials
                                master:
                                  count: 1
                                  disableCommands:
                                    - FLUSHDB
                                    - FLUSHALL
                                  persistence:
                                    enabled: true
                                    storageClass: nfs-external
                                    size: 8Gi
                                replica:
                                  replicaCount: 3
                                  disableCommands:
                                    - FLUSHDB
                                    - FLUSHALL
                                  persistence:
                                    enabled: true
                                    storageClass: nfs-external
                                    size: 8Gi
                                image:
                                  registry: m.daocloud.io/docker.io
                                  pullPolicy: IfNotPresent
                                sentinel:
                                  enabled: false
                                  persistence:
                                    enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                metrics:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                volumePermissions:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                sysctl:
                                  enabled: false
                                  image:
                                    registry: m.daocloud.io/docker.io
                                    pullPolicy: IfNotPresent
                                extraDeploy:
                                  - |
                                    apiVersion: apps/v1
                                    kind: Deployment
                                    metadata:
                                      name: redis-tool
                                      namespace: csst
                                      labels:
                                        app.kubernetes.io/name: redis-tool
                                    spec:
                                      replicas: 1
                                      selector:
                                        matchLabels:
                                          app.kubernetes.io/name: redis-tool
                                      template:
                                        metadata:
                                          labels:
                                            app.kubernetes.io/name: redis-tool
                                        spec:
                                          containers:
                                          - name: redis-tool
                                            image: m.daocloud.io/docker.io/bitnami/redis:7.2.4-debian-12-r8
                                            imagePullPolicy: IfNotPresent
                                            env:
                                            - name: REDISCLI_AUTH
                                              valueFrom:
                                                secretKeyRef:
                                                  key: redis-password
                                                  name: redis-credentials
                                            - name: TZ
                                              value: Asia/Shanghai
                                            command:
                                            - tail
                                            - -f
                                            - /etc/hosts
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: storage
                        EOF

                        3.sync by argocd

                        Details
                        argocd app sync argocd/redis

                        4.decode password

                        Details
                        kubectl -n storage get secret redis-credentials -o jsonpath='{.data.redis-password}' | base64 -d

                        Preliminary

                        1. Docker|Podman|Buildah has installed, if not check 🔗link


                        Using Proxy

                        you can run an addinational daocloud image to accelerate your pulling, check Daocloud Proxy

                        1.init server

                        Details

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        3. ArgoCD has installed, if not check 🔗link


                        4. Argo Workflow has installed, if not check 🔗link


                        1.prepare `argocd-login-credentials`

                        Details
                        kubectl get namespaces database > /dev/null 2>&1 || kubectl create namespace database

                        2.apply rolebinding to k8s

                        Details
                        kubectl apply -f - <<EOF
                        ---
                        apiVersion: rbac.authorization.k8s.io/v1
                        kind: ClusterRole
                        metadata:
                          name: application-administrator
                        rules:
                          - apiGroups:
                              - argoproj.io
                            resources:
                              - applications
                            verbs:
                              - '*'
                          - apiGroups:
                              - apps
                            resources:
                              - deployments
                            verbs:
                              - '*'
                        
                        ---
                        apiVersion: rbac.authorization.k8s.io/v1
                        kind: RoleBinding
                        metadata:
                          name: application-administration
                          namespace: argocd
                        roleRef:
                          apiGroup: rbac.authorization.k8s.io
                          kind: ClusterRole
                          name: application-administrator
                        subjects:
                          - kind: ServiceAccount
                            name: argo-workflow
                            namespace: business-workflows
                        
                        ---
                        apiVersion: rbac.authorization.k8s.io/v1
                        kind: RoleBinding
                        metadata:
                          name: application-administration
                          namespace: application
                        roleRef:
                          apiGroup: rbac.authorization.k8s.io
                          kind: ClusterRole
                          name: application-administrator
                        subjects:
                          - kind: ServiceAccount
                            name: argo-workflow
                            namespace: business-workflows
                        EOF

                        4.prepare `deploy-xxxx-flow.yaml`

                        Details

                        6.subimit to argo workflow client

                        Details
                        argo -n business-workflows submit deploy-xxxx-flow.yaml

                        7.decode password

                        Details
                        kubectl -n application get secret xxxx-credentials -o jsonpath='{.data.xxx-password}' | base64 -d

                        FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        tests

                        • kubectl -n storage exec -it deployment/redis-tool -- \
                              redis-cli -c -h redis-master.storage ping
                        • kubectl -n storage exec -it deployment/redis-tool -- \
                              redis-cli -c -h redis-master.storage set mykey somevalue
                        • kubectl -n storage exec -it deployment/redis-tool -- \
                              redis-cli -c -h redis-master.storage get mykey
                        • kubectl -n storage exec -it deployment/redis-tool -- \
                              redis-cli -c -h redis-master.storage del mykey
                        • kubectl -n storage exec -it deployment/redis-tool -- \
                              redis-cli -c -h redis-master.storage get mykey
                        May 7, 2024

                        Subsections of Streaming

                        Install Flink Operator

                        Installation

                        Install By

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. Helm has installed, if not check 🔗link


                        3. Cert-manager has installed, if not check 🔗link


                        1.get helm repo

                        Details
                        helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-1.11.0/
                        helm repo update

                        latest version : 🔗https://flink.apache.org/downloads/#apache-flink-kubernetes-operator

                        2.install chart

                        Details
                        helm install --create-namespace -n flink flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator --set image.repository=m.lab.zverse.space/ghcr.io/apache/flink-kubernetes-operator --set image.tag=1.11.0 --set webhook.create=false
                        Reference

                        Preliminary

                        1. Kubernetes has installed, if not check 🔗link


                        2. ArgoCD has installed, if not check 🔗link


                        3. Cert-manager has installed on argocd and the clusterissuer has a named self-signed-ca-issuer service , if not check 🔗link


                        4. Ingres has installed on argoCD, if not check 🔗link


                        2.prepare `flink-operator.yaml`

                        Details
                        kubectl -n argocd apply -f - << EOF
                        apiVersion: argoproj.io/v1alpha1
                        kind: Application
                        metadata:
                          name: flink-operator
                        spec:
                          syncPolicy:
                            syncOptions:
                            - CreateNamespace=true
                          project: default
                          source:
                            repoURL: https://downloads.apache.org/flink/flink-kubernetes-operator-1.11.0
                            chart: flink-kubernetes-operator
                            targetRevision: 1.11.0
                            helm:
                              releaseName: flink-operator
                              values: |
                                image:
                                  repository: m.daocloud.io/ghcr.io/apache/flink-kubernetes-operator
                                  pullPolicy: IfNotPresent
                                  tag: "1.11.0"
                              version: v3
                          destination:
                            server: https://kubernetes.default.svc
                            namespace: flink
                        EOF

                        3.sync by argocd

                        Details
                        argocd app sync argocd/flink-operator

                        FAQ

                        Q1: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Q2: Show me almost endless possibilities

                        You can add standard markdown syntax:

                        • multiple paragraphs
                        • bullet point lists
                        • emphasized, bold and even bold emphasized text
                        • links
                        • etc.
                        ...and even source code

                        the possibilities are endless (almost - including other shortcodes may or may not work)

                        Jun 7, 2025

                        🗃️Usage Notes

                        Aug 7, 2024

                        Subsections of 🗃️Usage Notes

                        Subsections of Application

                        有状态or无状态应用

                        对应用“有状态”和“无状态”的清晰界定,直接决定了它在Kubernetes中的部署方式、资源类型和运维复杂度。


                        一、核心定义

                        1. 无状态应用

                        定义:应用实例不负责保存每次请求所需的上下文或数据状态。任何一个请求都可以被任何一个实例处理,且处理结果完全一致。

                        关键特征

                        • 请求自包含:每个请求包含了处理它所需的所有信息(如认证Token、Session ID、操作数据等)。
                        • 实例可替代:任何一个实例都是完全相同、可以随时被创建或销毁的。销毁一个实例不会丢失任何数据。
                        • 无本地持久化:实例的本地磁盘不被用于保存需要持久化的数据。即使有临时数据,实例销毁后也无需关心。
                        • 水平扩展容易:因为实例完全相同,所以直接增加实例数量就能实现扩展,非常简单。

                        典型例子

                        • Web前端服务器:如Nginx, Apache。
                        • API网关:如Kong, Tyk。
                        • JWT令牌验证服务
                        • 无状态计算服务:如图片转换、数据格式转换等。输入和输出都在请求中完成。

                        一个生动的比喻快餐店的收银员。 任何一个收银员都可以为你服务,你点餐(请求),他处理,完成后交易结束。他不需要记住你上次点了什么(状态),你下次来可以去任何一个窗口。

                        2. 有状态应用

                        定义:应用实例需要保存和维护特定的状态数据。后续请求的处理依赖于之前请求保存的状态,或者会改变这个状态。

                        关键特征

                        • 状态依赖性:请求的处理结果依赖于该实例上保存的特定状态(如用户会话、数据库中的记录、缓存数据等)。
                        • 实例唯一性:每个实例都是独特的,有唯一的身份标识(如ID、主机名)。不能随意替换。
                        • 需要持久化存储:实例的状态必须被保存在持久化存储中,并且即使实例重启、迁移或重建,这个存储也必须能被重新挂载和访问。
                        • 水平扩展复杂:扩展时需要谨慎处理数据分片、副本同步、身份识别等问题。

                        典型例子

                        • 数据库:MySQL, PostgreSQL, MongoDB, Redis。
                        • 消息队列:Kafka, RabbitMQ。
                        • 有状态中间件:如Etcd, Zookeeper。
                        • 用户会话服务器:将用户Session保存在本地内存或磁盘的应用。

                        一个生动的比喻银行的客户经理。 你有一个指定的客户经理(特定实例),他了解你的所有财务历史和需求(状态)。如果你换了一个新经理,他需要花时间从头了解你的情况,而且可能无法立即获得你所有的历史文件(数据)。


                        二、在Kubernetes中的关键差异

                        这个界定在K8s中至关重要,因为它决定了你使用哪种工作负载资源。

                        特性无状态应用有状态应用
                        核心K8s资源DeploymentStatefulSet
                        Pod身份完全可互换,无唯一标识。名字是随机的(如 app-7c8b5f6d9-abcde)。有稳定、唯一的标识符,按顺序生成(如 mysql-0, mysql-1, mysql-2)。
                        启动/终止顺序并行,无顺序。有序部署(从0到N-1),有序扩缩容(从N-1到0),有序滚动更新
                        网络标识不稳定的Pod IP。通过Service负载均衡访问。稳定的网络标识。每个Pod会有一个稳定的DNS记录:<pod-name>.<svc-name>.<namespace>.svc.cluster.local
                        存储使用PersistentVolumeClaim模板,所有Pod共享同一个PVC或各自使用独立的、无关联的PVC。使用稳定的、专用的存储。每个Pod根据它的身份标识,挂载一个独立的PVC(如 mysql-0 -> pvc-mysql-0)。
                        数据持久性Pod被删除,其关联的PVC通常也会被删除(取决于回收策略)。Pod即使被调度到其他节点,也能通过稳定标识重新挂载到属于它的那块持久化数据。
                        典型场景Web服务器、微服务、API数据库、消息队列、集群化应用(如Zookeeper)

                        三、一个常见的误区:“看似无状态,实则有状态”

                        有些应用初看像无状态,但深究起来其实是有状态的。

                        • 误区:一个将用户Session保存在本地内存的Web应用。
                          • 看似:它是一个Web服务,可以通过Deployment部署多个副本。
                          • 实则:如果用户第一次请求被pod-a处理,Session保存在了pod-a的内存中。下次请求如果被负载均衡到pod-bpod-b无法获取到该用户的Session,导致用户需要重新登录。
                          • 解决方案
                            1. 改造为无状态:将Session数据外移到集中式的Redis或数据库中。
                            2. 承认其有状态:使用StatefulSet,并配合Session亲和性,确保同一用户的请求总是被发到同一个Pod实例上。

                        总结

                        如何界定一个应用是有状态还是无状态?

                        问自己这几个问题:

                        1. 这个应用的实例能被随意杀死并立即创建一个新的替代吗? 替代者能无缝接管所有工作吗?
                          • -> 无状态
                          • 不能 -> 有状态
                        2. 应用的多个实例是完全相同的吗? 增加一个实例需要复制数据吗?
                          • 是,不需要 -> 无状态
                          • 否,需要 -> 有状态
                        3. 处理请求是否需要依赖实例本地(内存/磁盘)的、非临时性的数据?
                          • -> 无状态
                          • -> 有状态

                        理解这个界定,是正确设计和部署云原生应用的基石。在K8s中,对于无状态应用,请首选 Deployment;对于有状态应用,请务必使用 StatefulSet

                        Mar 7, 2025

                        Subsections of Building Tool

                        Maven

                        1. build from submodule

                        You dont need to build from the head of project.

                        ./mvnw clean package -DskipTests  -rf :<$submodule-name>

                        you can find the <$submodule-name> from submodule ’s pom.xml

                        <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                        		xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
                        
                        	<modelVersion>4.0.0</modelVersion>
                        
                        	<parent>
                        		<groupId>org.apache.flink</groupId>
                        		<artifactId>flink-formats</artifactId>
                        		<version>1.20-SNAPSHOT</version>
                        	</parent>
                        
                        	<artifactId>flink-avro</artifactId>
                        	<name>Flink : Formats : Avro</name>

                        Then you can modify the command as

                        ./mvnw clean package -DskipTests  -rf :flink-avro
                        The result will look like this
                        [WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
                        [WARNING] 
                        [INFO] ------------------------------------------------------------------------
                        [INFO] Detecting the operating system and CPU architecture
                        [INFO] ------------------------------------------------------------------------
                        [INFO] os.detected.name: linux
                        [INFO] os.detected.arch: x86_64
                        [INFO] os.detected.bitness: 64
                        [INFO] os.detected.version: 6.7
                        [INFO] os.detected.version.major: 6
                        [INFO] os.detected.version.minor: 7
                        [INFO] os.detected.release: fedora
                        [INFO] os.detected.release.version: 38
                        [INFO] os.detected.release.like.fedora: true
                        [INFO] os.detected.classifier: linux-x86_64
                        [INFO] ------------------------------------------------------------------------
                        [INFO] Reactor Build Order:
                        [INFO] 
                        [INFO] Flink : Formats : Avro                                             [jar]
                        [INFO] Flink : Formats : SQL Avro                                         [jar]
                        [INFO] Flink : Formats : Parquet                                          [jar]
                        [INFO] Flink : Formats : SQL Parquet                                      [jar]
                        [INFO] Flink : Formats : Orc                                              [jar]
                        [INFO] Flink : Formats : SQL Orc                                          [jar]
                        [INFO] Flink : Python                                                     [jar]
                        ...

                        Normally, build Flink will start from module flink-parent

                        2. skip some other test

                        For example, you can skip RAT test by doing this:

                        ./mvnw clean package -DskipTests '-Drat.skip=true'
                        Mar 11, 2024

                        Gradle

                        1. spotless

                        keep your code spotless, check more detail in https://github.com/diffplug/spotless

                        see how to configuration

                        there are several files need to configure.

                        1. settings.gradle.kts
                        plugins {
                            id("org.gradle.toolchains.foojay-resolver-convention") version "0.7.0"
                        }
                        1. build.gradle.kts
                        plugins {
                            id("com.diffplug.spotless") version "6.23.3"
                        }
                        configure<com.diffplug.gradle.spotless.SpotlessExtension> {
                            kotlinGradle {
                                target("**/*.kts")
                                ktlint()
                            }
                            java {
                                target("**/*.java")
                                googleJavaFormat()
                                    .reflowLongStrings()
                                    .skipJavadocFormatting()
                                    .reorderImports(false)
                            }
                            yaml {
                                target("**/*.yaml")
                                jackson()
                                    .feature("ORDER_MAP_ENTRIES_BY_KEYS", true)
                            }
                            json {
                                target("**/*.json")
                                targetExclude(".vscode/settings.json")
                                jackson()
                                    .feature("ORDER_MAP_ENTRIES_BY_KEYS", true)
                            }
                        }

                        And the, you can execute follwoing command to format your code.

                        ./gradlew spotlessApply
                        ./mvnw spotless:apply

                        2. shadowJar

                        shadowjar could combine a project’s dependency classes and resources into a single jar. check https://imperceptiblethoughts.com/shadow/

                        see how to configuration

                        you need moidfy your build.gradle.kts

                        import com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar
                        
                        plugins {
                            java // Optional 
                            id("com.github.johnrengelman.shadow") version "8.1.1"
                        }
                        
                        tasks.named<ShadowJar>("shadowJar") {
                            archiveBaseName.set("connector-shadow")
                            archiveVersion.set("1.0")
                            archiveClassifier.set("")
                            manifest {
                                attributes(mapOf("Main-Class" to "com.example.xxxxx.Main"))
                            }
                        }
                        ./gradlew shadowJar

                        3. check dependency

                        list your project’s dependencies in tree view

                        see how to configuration

                        you need moidfy your build.gradle.kts

                        configurations {
                            compileClasspath
                        }
                        ./gradlew dependencies --configuration compileClasspath
                        ./gradlew :<$module_name>:dependencies --configuration compileClasspath
                        Check Potential Result

                        result will look like this

                        compileClasspath - Compile classpath for source set 'main'.
                        +--- org.projectlombok:lombok:1.18.22
                        +--- org.apache.flink:flink-hadoop-fs:1.17.1
                        |    \--- org.apache.flink:flink-core:1.17.1
                        |         +--- org.apache.flink:flink-annotations:1.17.1
                        |         |    \--- com.google.code.findbugs:jsr305:1.3.9 -> 3.0.2
                        |         +--- org.apache.flink:flink-metrics-core:1.17.1
                        |         |    \--- org.apache.flink:flink-annotations:1.17.1 (*)
                        |         +--- org.apache.flink:flink-shaded-asm-9:9.3-16.1
                        |         +--- org.apache.flink:flink-shaded-jackson:2.13.4-16.1
                        |         +--- org.apache.commons:commons-lang3:3.12.0
                        |         +--- org.apache.commons:commons-text:1.10.0
                        |         |    \--- org.apache.commons:commons-lang3:3.12.0
                        |         +--- commons-collections:commons-collections:3.2.2
                        |         +--- org.apache.commons:commons-compress:1.21 -> 1.24.0
                        |         +--- org.apache.flink:flink-shaded-guava:30.1.1-jre-16.1
                        |         \--- com.google.code.findbugs:jsr305:1.3.9 -> 3.0.2
                        ...
                        Mar 7, 2024

                        CICD

                        Articles

                          FQA

                          Q1: difference between docker\podmn\buildah

                          You can add standard markdown syntax:

                          • multiple paragraphs
                          • bullet point lists
                          • emphasized, bold and even bold emphasized text
                          • links
                          • etc.
                          ...and even source code

                          the possibilities are endless (almost - including other shortcodes may or may not work)

                          Mar 7, 2025

                          Container

                          Articles

                          FQA

                          Q1: difference between docker\podmn\buildah

                          You can add standard markdown syntax:

                          • multiple paragraphs
                          • bullet point lists
                          • emphasized, bold and even bold emphasized text
                          • links
                          • etc.
                          ...and even source code

                          the possibilities are endless (almost - including other shortcodes may or may not work)

                          Mar 7, 2025

                          Subsections of Container

                          Build Smaller Image

                          减小 Dockerfile 生成镜像体积的方法

                          1. 选择更小的基础镜像

                          # ❌ 避免使用完整版本
                          FROM ubuntu:latest
                          
                          # ✅ 使用精简版本
                          FROM alpine:3.18
                          FROM python:3.11-slim
                          FROM node:18-alpine

                          2. 使用多阶段构建 (Multi-stage Build)

                          这是最有效的方法之一:

                          # 构建阶段
                          FROM golang:1.21 AS builder
                          WORKDIR /app
                          COPY . .
                          RUN go build -o myapp
                          
                          # 运行阶段 - 只复制必要文件
                          FROM alpine:3.18
                          WORKDIR /app
                          COPY --from=builder /app/myapp .
                          CMD ["./myapp"]

                          3. 合并 RUN 指令

                          每个 RUN 命令都会创建一个新层:

                          # ❌ 多层,体积大
                          RUN apt-get update
                          RUN apt-get install -y package1
                          RUN apt-get install -y package2
                          
                          # ✅ 单层,并清理缓存
                          RUN apt-get update && \
                              apt-get install -y package1 package2 && \
                              apt-get clean && \
                              rm -rf /var/lib/apt/lists/*

                          4. 清理不必要的文件

                          RUN apt-get update && \
                              apt-get install -y build-essential && \
                              # 构建操作... && \
                              apt-get purge -y build-essential && \
                              apt-get autoremove -y && \
                              rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

                          5. 使用 .dockerignore 文件

                          # .dockerignore
                          node_modules
                          .git
                          *.md
                          .env
                          test/

                          6. 只复制必要的文件

                          # ❌ 复制所有内容
                          COPY . .
                          
                          # ✅ 只复制需要的文件
                          COPY package.json package-lock.json ./
                          RUN npm ci --only=production
                          COPY src/ ./src/

                          7. 移除调试工具和文档

                          RUN apk add --no-cache python3 && \
                              rm -rf /usr/share/doc /usr/share/man

                          8. 压缩和优化层

                          # 在单个 RUN 中完成所有操作
                          RUN set -ex && \
                              apk add --no-cache --virtual .build-deps gcc musl-dev && \
                              pip install --no-cache-dir -r requirements.txt && \
                              apk del .build-deps

                          9. 使用专门的工具

                          • dive: 分析镜像层
                            dive your-image:tag
                          • docker-slim: 自动精简镜像
                            docker-slim build your-image:tag

                          实际案例对比

                          优化前 (1.2GB):

                          FROM ubuntu:20.04
                          RUN apt-get update
                          RUN apt-get install -y python3 python3-pip
                          COPY . /app
                          WORKDIR /app
                          RUN pip3 install -r requirements.txt
                          CMD ["python3", "app.py"]

                          优化后 (50MB):

                          FROM python:3.11-alpine AS builder
                          WORKDIR /app
                          COPY requirements.txt .
                          RUN pip install --no-cache-dir --user -r requirements.txt
                          
                          FROM python:3.11-alpine
                          WORKDIR /app
                          COPY --from=builder /root/.local /root/.local
                          COPY app.py .
                          ENV PATH=/root/.local/bin:$PATH
                          CMD ["python", "app.py"]

                          关键要点总结

                          ✅ 使用 Alpine 或 slim 镜像
                          ✅ 采用多阶段构建
                          ✅ 合并命令并清理缓存
                          ✅ 配置 .dockerignore
                          ✅ 只安装生产环境依赖
                          ✅ 删除构建工具和临时文件

                          通过这些方法,镜像体积通常可以减少 60-90%!

                          Mar 7, 2024

                          Network Mode

                          Docker的网络模式决定了容器如何与宿主机、其他容器以及外部网络进行通信。

                          Docker主要提供了以下五种网络模式,默认创建的是 bridge 模式。


                          1. Bridge 模式

                          这是 默认 的网络模式。当你创建一个容器而不指定网络时,它就会连接到这个默认的 bridge 网络(名为 bridge)。

                          • 工作原理:Docker守护进程会创建一个名为 docker0 的虚拟网桥,它相当于一个虚拟交换机。所有使用该模式的容器都会通过一个虚拟网卡(veth pair)连接到这个网桥上。Docker会为每个容器分配一个IP地址,并配置其网关为 docker0 的地址。
                          • 通信方式
                            • 容器间通信:在同一个自定义桥接网络下的容器,可以通过容器名(Container Name)直接通信(Docker内嵌了DNS)。但在默认的 bridge 网络下,容器只能通过IP地址通信。
                            • 访问外部网络:容器数据包通过 docker0 网桥,再经过宿主机的IPtables进行NAT转换,使用宿主机的IP访问外网。
                            • 从外部访问容器:需要做端口映射,例如 -p 8080:80,将宿主机的8080端口映射到容器的80端口。

                          优劣分析

                          • 优点
                            • 隔离性:容器拥有独立的网络命名空间,与宿主机和其他网络隔离,安全性较好。
                            • 端口管理灵活:通过端口映射,可以灵活地管理哪些宿主机端口暴露给外部。
                            • 通用性:是最常用、最通用的模式,适合大多数应用场景。
                          • 缺点
                            • 性能开销:相比 host 模式,多了一层网络桥接和NAT,性能有轻微损失。
                            • 复杂度:在默认桥接网络中,容器间通信需要使用IP,不如自定义网络方便。

                          使用场景:绝大多数需要网络隔离的独立应用,例如Web后端服务、数据库等。

                          命令示例

                          # 使用默认bridge网络(不推荐用于多容器应用)
                          docker run -d --name my-app -p 8080:80 nginx
                          
                          # 创建自定义bridge网络(推荐)
                          docker network create my-network
                          docker run -d --name app1 --network my-network my-app
                          docker run -d --name app2 --network my-network another-app
                          # 现在 app1 和 app2 可以通过容器名直接互相访问

                          2. Host 模式

                          在这种模式下,容器不会虚拟出自己的网卡,也不会分配独立的IP,而是直接使用宿主机的IP和端口

                          • 工作原理:容器与宿主机共享同一个Network Namespace。

                          优劣分析

                          • 优点
                            • 高性能:由于没有NAT和网桥开销,网络性能最高,几乎与宿主机原生网络一致。
                            • 简单:无需进行复杂的端口映射,容器内使用的端口就是宿主机上的端口。
                          • 缺点
                            • 安全性差:容器没有网络隔离,可以直接操作宿主机的网络。
                            • 端口冲突:容器使用的端口如果与宿主机服务冲突,会导致容器无法启动。
                            • 灵活性差:无法在同一台宿主机上运行多个使用相同端口的容器。

                          使用场景:对网络性能要求极高的场景,例如负载均衡器、高频交易系统等。在生产环境中需谨慎使用

                          命令示例

                          docker run -d --name my-app --network host nginx
                          # 此时,直接访问 http://<宿主机IP>:80 即可访问容器中的Nginx

                          3. None 模式

                          在这种模式下,容器拥有自己独立的网络命名空间,但不进行任何网络配置。容器内部只有回环地址 127.0.0.1

                          • 工作原理:容器完全与世隔绝。

                          优劣分析

                          • 优点
                            • 绝对隔离:安全性最高,容器完全无法进行任何网络通信。
                          • 缺点
                            • 无法联网:容器无法与宿主机、其他容器或外部网络通信。

                          使用场景

                          1. 需要完全离线处理的批处理任务。
                          2. 用户打算使用自定义网络驱动(或手动配置)来完全自定义容器的网络栈。

                          命令示例

                          docker run -d --name my-app --network none alpine
                          # 进入容器后,使用 `ip addr` 查看,只能看到 lo 网卡

                          4. Container 模式

                          这种模式下,新创建的容器不会创建自己的网卡和IP,而是与一个已经存在的容器共享一个Network Namespace。通俗讲,就是两个容器在同一个网络环境下,看到的IP和端口是一样的。

                          • 工作原理:新容器复用指定容器的网络栈。

                          优劣分析

                          • 优点
                            • 高效通信:容器间通信直接通过本地回环地址 127.0.0.1,效率极高。
                            • 共享网络视图:可以方便地为一个主容器(如Web服务器)搭配一个辅助容器(如日志收集器),它们看到的网络环境完全一致。
                          • 缺点
                            • 紧密耦合:两个容器的生命周期和网络配置紧密绑定,缺乏灵活性。
                            • 隔离性差:共享网络命名空间,存在一定的安全风险。

                          使用场景:Kubernetes中的"边车"模式,例如一个Pod内的主容器和日志代理容器。

                          命令示例

                          docker run -d --name main-container nginx
                          docker run -d --name helper-container --network container:main-container busybox
                          # 此时,helper-container 中访问 127.0.0.1:80 就是在访问 main-container 的Nginx服务

                          5. Overlay 模式

                          这是为了实现 跨主机的容器通信 而设计的,是Docker Swarm和Kubernetes等容器编排系统的核心网络方案。

                          • 工作原理:它会在多个Docker宿主机之间创建一个虚拟的分布式网络(Overlay Network),通过VXLAN等隧道技术,让不同宿主机上的容器感觉像是在同一个大的局域网内。

                          优劣分析

                          • 优点
                            • 跨节点通信:解决了集群环境下容器间通信的根本问题。
                            • 安全:支持网络加密。
                          • 缺点
                            • 配置复杂:需要额外的Key-Value存储(如Consul、Etcd)来同步网络状态(Docker Swarm模式内置了此功能)。
                            • 性能开销:数据包需要封装和解封装,有一定性能损耗,但现代硬件上通常可以接受。

                          使用场景:Docker Swarm集群、Kubernetes集群等分布式应用环境。

                          命令示例(在Swarm模式下):

                          # 初始化Swarm
                          docker swarm init
                          
                          # 创建Overlay网络
                          docker network create -d overlay my-overlay-net
                          
                          # 在Overlay网络中创建服务
                          docker service create --name web --network my-overlay-net -p 80:80 nginx

                          总结对比

                          网络模式隔离性性能灵活性适用场景
                          Bridge(默认)良好通用场景,单机多容器应用
                          Host最高对性能要求极致,不介意端口冲突
                          None最高-离线任务,完全自定义网络
                          Container容器紧密协作(如边车模式)
                          Overlay良好集群场景,跨主机容器通信

                          最佳实践建议

                          1. 单机应用:优先使用 自定义的Bridge网络,它比默认Bridge网络提供了更好的DNS服务发现功能,方便容器间通过名称通信。
                          2. 集群应用:必须使用 Overlay网络
                          3. 性能极致追求:在确认端口安全和无冲突的前提下,可考虑 Host模式
                          4. 安全隔离:对于无需网络的容器,使用 None模式
                          5. 避免在生产环境大量使用默认的bridge网络和container模式,因为它们分别在DNS发现和容器耦合度上存在不足。
                          Mar 7, 2024

                          Container Principle

                          Linux 容器技术的基础原理

                          容器技术是现代云原生架构的基石。让我深入讲解 Linux 容器的核心原理。


                          🎯 容器的本质

                          容器 = 特殊的进程

                          容器不是虚拟机,它本质上是宿主机上的一个受限进程,通过 Linux 内核的三大核心技术实现隔离和资源限制:

                          ┌─────────────────────────────────────────┐
                          │         Linux 容器三大支柱               │
                          ├─────────────────────────────────────────┤
                          │  1. Namespace (命名空间) - 隔离         │
                          │  2. Cgroups (控制组) - 资源限制         │
                          │  3. Union FS (联合文件系统) - 镜像分层   │
                          └─────────────────────────────────────────┘

                          🔒 Namespace (命名空间) - 视图隔离

                          Namespace 是 Linux 内核提供的一种资源隔离机制,让进程只能看到属于自己的资源。

                          七种 Namespace

                          Namespace隔离内容内核版本示例
                          PID进程 ID2.6.24容器内 PID 1 = 宿主机 PID 12345
                          Network网络栈2.6.29独立的 IP、端口、路由表
                          Mount文件系统挂载点2.4.19独立的根目录
                          UTS主机名和域名2.6.19容器有自己的 hostname
                          IPC进程间通信2.6.19消息队列、信号量、共享内存
                          User用户和组 ID3.8容器内 root ≠ 宿主机 root
                          CgroupCgroup 根目录4.6隔离 cgroup 视图

                          1️⃣ PID Namespace (进程隔离)

                          原理

                          每个容器有独立的进程树,容器内看不到宿主机或其他容器的进程。

                          演示

                          # 在宿主机上查看进程
                          ps aux | grep nginx
                          # root  12345  nginx: master process
                          
                          # 进入容器
                          docker exec -it my-container bash
                          
                          # 在容器内查看进程
                          ps aux
                          # PID   USER     COMMAND
                          # 1     root     nginx: master process  ← 容器内看到的 PID 是 1
                          # 25    root     nginx: worker process
                          
                          # 实际上宿主机上这个进程的真实 PID 是 12345

                          手动创建 PID Namespace

                          // C 代码示例
                          #define _GNU_SOURCE
                          #include <sched.h>
                          #include <stdio.h>
                          #include <unistd.h>
                          #include <sys/wait.h>
                          
                          int child_func(void* arg) {
                              printf("Child PID: %d\n", getpid());  // 输出: 1
                              sleep(100);
                              return 0;
                          }
                          
                          int main() {
                              printf("Parent PID: %d\n", getpid());  // 输出: 真实 PID
                              
                              // 创建新的 PID namespace
                              char stack[1024*1024];
                              int flags = CLONE_NEWPID;
                              
                              pid_t pid = clone(child_func, stack + sizeof(stack), flags | SIGCHLD, NULL);
                              waitpid(pid, NULL, 0);
                              return 0;
                          }

                          核心特点

                          • 容器内第一个进程 PID = 1 (init 进程)
                          • 父进程(宿主机)可以看到子进程的真实 PID
                          • 子进程(容器)看不到父进程和其他容器的进程

                          2️⃣ Network Namespace (网络隔离)

                          原理

                          每个容器有独立的网络栈:独立的 IP、端口、路由表、防火墙规则。

                          架构图

                          宿主机网络栈
                          ├─ eth0 (物理网卡)
                          ├─ docker0 (网桥)
                          └─ veth pairs (虚拟网卡对)
                              ├─ vethXXX (宿主机端) ←→ eth0 (容器端)
                              └─ vethYYY (宿主机端) ←→ eth0 (容器端)

                          演示

                          # 创建新的 network namespace
                          ip netns add myns
                          
                          # 列出所有 namespace
                          ip netns list
                          
                          # 在新 namespace 中执行命令
                          ip netns exec myns ip addr
                          # 输出: 只有 loopback,没有 eth0
                          
                          # 创建 veth pair (虚拟网卡对)
                          ip link add veth0 type veth peer name veth1
                          
                          # 将 veth1 移到新 namespace
                          ip link set veth1 netns myns
                          
                          # 配置 IP
                          ip addr add 192.168.1.1/24 dev veth0
                          ip netns exec myns ip addr add 192.168.1.2/24 dev veth1
                          
                          # 启动网卡
                          ip link set veth0 up
                          ip netns exec myns ip link set veth1 up
                          ip netns exec myns ip link set lo up
                          
                          # 测试连通性
                          ping 192.168.1.2

                          容器网络模式

                          Bridge 模式(默认)

                          Container A                Container B
                              │                          │
                            [eth0]                    [eth0]
                              │                          │
                           vethA ←─────┬─────────→ vethB
                                       │
                                  [docker0 网桥]
                                       │
                                   [iptables NAT]
                                       │
                                   [宿主机 eth0]
                                       │
                                    外部网络

                          Host 模式

                          Container
                              │
                              └─ 直接使用宿主机网络栈 (没有网络隔离)

                          3️⃣ Mount Namespace (文件系统隔离)

                          原理

                          每个容器有独立的挂载点视图,看到不同的文件系统树。

                          演示

                          # 创建隔离的挂载环境
                          unshare --mount /bin/bash
                          
                          # 在新 namespace 中挂载
                          mount -t tmpfs tmpfs /tmp
                          
                          # 查看挂载点
                          mount | grep tmpfs
                          # 这个挂载只在当前 namespace 可见
                          
                          # 退出后,宿主机看不到这个挂载
                          exit
                          mount | grep tmpfs  # 找不到

                          容器的根文件系统

                          # Docker 使用 chroot + pivot_root 切换根目录
                          # 容器内 / 实际是宿主机的某个目录
                          
                          # 查看容器的根文件系统位置
                          docker inspect my-container | grep MergedDir
                          # "MergedDir": "/var/lib/docker/overlay2/xxx/merged"
                          
                          # 在宿主机上访问容器文件系统
                          ls /var/lib/docker/overlay2/xxx/merged
                          # bin  boot  dev  etc  home  lib  ...

                          4️⃣ UTS Namespace (主机名隔离)

                          演示

                          # 在宿主机
                          hostname
                          # host-machine
                          
                          # 创建新 UTS namespace
                          unshare --uts /bin/bash
                          
                          # 修改主机名
                          hostname my-container
                          
                          # 查看主机名
                          hostname
                          # my-container
                          
                          # 退出后,宿主机主机名不变
                          exit
                          hostname
                          # host-machine

                          5️⃣ IPC Namespace (进程间通信隔离)

                          原理

                          隔离 System V IPC 和 POSIX 消息队列。

                          演示

                          # 在宿主机创建消息队列
                          ipcmk -Q
                          # Message queue id: 0
                          
                          # 查看消息队列
                          ipcs -q
                          # ------ Message Queues --------
                          # key        msqid      owner
                          # 0x52020055 0          root
                          
                          # 进入容器
                          docker exec -it my-container bash
                          
                          # 在容器内查看消息队列
                          ipcs -q
                          # ------ Message Queues --------
                          # (空,看不到宿主机的消息队列)

                          6️⃣ User Namespace (用户隔离)

                          原理

                          容器内的 root 用户可以映射到宿主机的普通用户,增强安全性。

                          配置示例

                          # 启用 User Namespace 的容器
                          docker run --userns-remap=default -it ubuntu bash
                          
                          # 容器内
                          whoami
                          # root
                          
                          id
                          # uid=0(root) gid=0(root) groups=0(root)
                          
                          # 但在宿主机上,这个进程实际运行在普通用户下
                          ps aux | grep bash
                          # 100000  12345  bash  ← UID 100000,不是 root

                          UID 映射配置

                          # /etc/subuid 和 /etc/subgid
                          cat /etc/subuid
                          # dockremap:100000:65536
                          # 表示将容器内的 UID 0-65535 映射到宿主机的 100000-165535

                          📊 Cgroups (Control Groups) - 资源限制

                          Cgroups 用于限制、记录、隔离进程组的资源使用(CPU、内存、磁盘 I/O 等)。

                          Cgroups 子系统

                          子系统功能示例
                          cpu限制 CPU 使用率容器最多用 50% CPU
                          cpuset绑定特定 CPU 核心容器只能用 CPU 0-3
                          memory限制内存使用容器最多用 512MB 内存
                          blkio限制块设备 I/O容器磁盘读写 100MB/s
                          devices控制设备访问容器不能访问 /dev/sda
                          net_cls网络流量分类为容器流量打标签
                          pids限制进程数量容器最多创建 100 个进程

                          CPU 限制

                          原理

                          使用 CFS (Completely Fair Scheduler) 调度器限制 CPU 时间。

                          关键参数

                          cpu.cfs_period_us  # 周期时间(默认 100ms = 100000us)
                          cpu.cfs_quota_us   # 配额时间
                          
                          # CPU 使用率 = quota / period
                          # 例如: 50000 / 100000 = 50% CPU

                          Docker 示例

                          # 限制容器使用 0.5 个 CPU 核心
                          docker run --cpus=0.5 nginx
                          
                          # 等价于
                          docker run --cpu-period=100000 --cpu-quota=50000 nginx
                          
                          # 查看 cgroup 配置
                          cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us
                          # 50000

                          手动配置 Cgroups

                          # 创建 cgroup
                          mkdir -p /sys/fs/cgroup/cpu/mycontainer
                          
                          # 设置 CPU 限制为 50%
                          echo 50000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us
                          echo 100000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_period_us
                          
                          # 将进程加入 cgroup
                          echo $$ > /sys/fs/cgroup/cpu/mycontainer/cgroup.procs
                          
                          # 运行 CPU 密集任务
                          yes > /dev/null &
                          
                          # 在另一个终端查看 CPU 使用率
                          top -p $(pgrep yes)
                          # CPU 使用率被限制在 50% 左右

                          内存限制

                          关键参数

                          memory.limit_in_bytes        # 硬限制
                          memory.soft_limit_in_bytes   # 软限制
                          memory.oom_control           # OOM 行为控制
                          memory.usage_in_bytes        # 当前使用量

                          Docker 示例

                          # 限制容器使用最多 512MB 内存
                          docker run -m 512m nginx
                          
                          # 查看内存限制
                          cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
                          # 536870912 (512MB)
                          
                          # 查看当前内存使用
                          cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes

                          OOM (Out of Memory) 行为

                          # 当容器超过内存限制时
                          # 1. 内核触发 OOM Killer
                          # 2. 杀死容器内的进程(通常是内存占用最大的)
                          # 3. 容器退出,状态码 137
                          
                          docker ps -a
                          # CONTAINER ID   STATUS
                          # abc123         Exited (137) 1 minute ago  ← OOM killed

                          避免 OOM 的策略

                          # 设置 OOM Score Adjustment
                          docker run --oom-score-adj=-500 nginx
                          # 数值越低,越不容易被 OOM Killer 杀死
                          
                          # 禁用 OOM Killer (不推荐生产环境)
                          docker run --oom-kill-disable nginx

                          磁盘 I/O 限制

                          Docker 示例

                          # 限制读取速度为 10MB/s
                          docker run --device-read-bps /dev/sda:10mb nginx
                          
                          # 限制写入速度为 5MB/s
                          docker run --device-write-bps /dev/sda:5mb nginx
                          
                          # 限制 IOPS
                          docker run --device-read-iops /dev/sda:100 nginx
                          docker run --device-write-iops /dev/sda:50 nginx

                          测试 I/O 限制

                          # 在容器内测试写入速度
                          docker exec -it my-container bash
                          
                          dd if=/dev/zero of=/tmp/test bs=1M count=100
                          # 写入速度会被限制在 5MB/s

                          📦 Union FS (联合文件系统) - 镜像分层

                          Union FS 允许多个文件系统分层叠加,实现镜像的复用和高效存储。

                          核心概念

                          容器可写层 (Read-Write Layer)     ← 容器运行时的修改
                          ─────────────────────────────────
                          镜像层 4 (Image Layer 4)          ← 只读
                          镜像层 3 (Image Layer 3)          ← 只读
                          镜像层 2 (Image Layer 2)          ← 只读
                          镜像层 1 (Base Layer)             ← 只读
                          ─────────────────────────────────
                                   统一挂载点
                                (Union Mount Point)

                          常见实现

                          文件系统特点使用情况
                          OverlayFS性能好,内核原生支持Docker 默认(推荐)
                          AUFS成熟稳定,但不在主线内核早期 Docker 默认
                          Btrfs支持快照,写时复制适合大规模存储
                          ZFS企业级功能,但有许可问题高级用户
                          Device Mapper块级存储Red Hat 系列

                          OverlayFS 原理

                          目录结构

                          /var/lib/docker/overlay2/<image-id>/
                          ├── diff/          # 当前层的文件变更
                          ├── link           # 短链接名称
                          ├── lower          # 指向下层的链接
                          ├── merged/        # 最终挂载点(容器看到的)
                          └── work/          # 工作目录(临时文件)

                          实际演示

                          # 查看镜像的层结构
                          docker image inspect nginx:latest | jq '.[0].RootFS.Layers'
                          # [
                          #   "sha256:abc123...",  ← Layer 1
                          #   "sha256:def456...",  ← Layer 2
                          #   "sha256:ghi789..."   ← Layer 3
                          # ]
                          
                          # 启动容器
                          docker run -d --name web nginx
                          
                          # 查看容器的文件系统
                          docker inspect web | grep MergedDir
                          # "MergedDir": "/var/lib/docker/overlay2/xxx/merged"
                          
                          # 查看挂载信息
                          mount | grep overlay
                          # overlay on /var/lib/docker/overlay2/xxx/merged type overlay (rw,lowerdir=...,upperdir=...,workdir=...)

                          文件操作的 Copy-on-Write (写时复制)

                          # 1. 读取文件(从镜像层)
                          docker exec web cat /etc/nginx/nginx.conf
                          # 直接从只读的镜像层读取,无需复制
                          
                          # 2. 修改文件
                          docker exec web bash -c "echo 'test' >> /etc/nginx/nginx.conf"
                          # 触发 Copy-on-Write:
                          # - 从下层复制文件到容器可写层
                          # - 在可写层修改文件
                          # - 下次读取时,从可写层读取(覆盖下层)
                          
                          # 3. 删除文件
                          docker exec web rm /var/log/nginx/access.log
                          # 创建 whiteout 文件,标记删除
                          # 文件在镜像层仍存在,但容器内看不到

                          Whiteout 文件(删除标记)

                          # 在容器可写层
                          ls -la /var/lib/docker/overlay2/xxx/diff/var/log/nginx/
                          # c--------- 1 root root 0, 0 Oct 11 10:00 .wh.access.log
                          # 字符设备文件,主次设备号都是 0,表示删除标记

                          镜像分层的优势

                          1. 共享层,节省空间

                          # 假设有 10 个基于 ubuntu:20.04 的镜像
                          # 不使用分层:10 × 100MB = 1GB
                          # 使用分层:100MB (ubuntu base) + 10 × 10MB (应用层) = 200MB
                          # 节省空间:80%

                          2. 快速构建

                          FROM ubuntu:20.04                    # Layer 1 (缓存)
                          RUN apt-get update                   # Layer 2 (缓存)
                          RUN apt-get install -y nginx         # Layer 3 (缓存)
                          COPY app.conf /etc/nginx/            # Layer 4 (需要重建)
                          COPY app.js /var/www/                # Layer 5 (需要重建)
                          
                          # 如果只修改 app.js,只需要重建 Layer 5
                          # 前面的层都从缓存读取

                          3. 快速分发

                          # 拉取镜像时,只下载本地没有的层
                          docker pull nginx:1.21
                          # Already exists: Layer 1 (ubuntu base)
                          # Downloading:    Layer 2 (nginx files)
                          # Downloading:    Layer 3 (config)

                          🔗 容器技术完整流程

                          Docker 创建容器的完整过程

                          docker run -d --name web \
                            --cpus=0.5 \
                            -m 512m \
                            -p 8080:80 \
                            nginx:latest

                          内部执行流程

                          1. 拉取镜像(如果本地没有)
                             └─ 下载各层,存储到 /var/lib/docker/overlay2/
                          
                          2. 创建 Namespace
                             ├─ PID Namespace (隔离进程)
                             ├─ Network Namespace (隔离网络)
                             ├─ Mount Namespace (隔离文件系统)
                             ├─ UTS Namespace (隔离主机名)
                             ├─ IPC Namespace (隔离进程间通信)
                             └─ User Namespace (隔离用户)
                          
                          3. 配置 Cgroups
                             ├─ cpu.cfs_quota_us = 50000 (50% CPU)
                             └─ memory.limit_in_bytes = 536870912 (512MB)
                          
                          4. 挂载文件系统 (OverlayFS)
                             ├─ lowerdir: 镜像只读层
                             ├─ upperdir: 容器可写层
                             ├─ workdir: 工作目录
                             └─ merged: 统一视图挂载点
                          
                          5. 配置网络
                             ├─ 创建 veth pair
                             ├─ 一端连接到容器的 Network Namespace
                             ├─ 另一端连接到 docker0 网桥
                             ├─ 分配 IP 地址
                             └─ 配置 iptables NAT 规则 (端口映射)
                          
                          6. 切换根目录
                             ├─ chroot 或 pivot_root
                             └─ 容器内看到的 / 是 merged 目录
                          
                          7. 启动容器进程
                             ├─ 在新的 Namespace 中
                             ├─ 受 Cgroups 限制
                             └─ 使用新的根文件系统
                             └─ 执行 ENTRYPOINT/CMD
                          
                          8. 容器运行中
                             └─ containerd-shim 监控进程

                          🛠️ 手动创建容器(无 Docker)

                          完整示例:从零创建容器

                          #!/bin/bash
                          # 手动创建一个简单的容器
                          
                          # 1. 准备根文件系统
                          mkdir -p /tmp/mycontainer/rootfs
                          cd /tmp/mycontainer/rootfs
                          
                          # 下载 busybox 作为基础系统
                          wget https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
                          chmod +x busybox
                          ./busybox --install -s .
                          
                          # 创建必要的目录
                          mkdir -p bin sbin etc proc sys tmp dev
                          
                          # 2. 创建启动脚本
                          cat > /tmp/mycontainer/start.sh <<'EOF'
                          #!/bin/bash
                          
                          # 创建新的 namespace
                          unshare --pid --net --mount --uts --ipc --fork /bin/bash -c '
                              # 挂载 proc
                              mount -t proc proc /proc
                              
                              # 设置主机名
                              hostname mycontainer
                              
                              # 启动 shell
                              /bin/sh
                          '
                          EOF
                          
                          chmod +x /tmp/mycontainer/start.sh
                          
                          # 3. 启动容器
                          chroot /tmp/mycontainer/rootfs /tmp/mycontainer/start.sh

                          配置 Cgroups 限制

                          # 创建 cgroup
                          mkdir -p /sys/fs/cgroup/memory/mycontainer
                          mkdir -p /sys/fs/cgroup/cpu/mycontainer
                          
                          # 设置内存限制 256MB
                          echo 268435456 > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
                          
                          # 设置 CPU 限制 50%
                          echo 50000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us
                          echo 100000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_period_us
                          
                          # 将容器进程加入 cgroup
                          echo $CONTAINER_PID > /sys/fs/cgroup/memory/mycontainer/cgroup.procs
                          echo $CONTAINER_PID > /sys/fs/cgroup/cpu/mycontainer/cgroup.procs

                          🔍 容器 vs 虚拟机

                          架构对比

                          虚拟机架构:
                          ┌─────────────────────────────────────┐
                          │  App A  │  App B  │  App C          │
                          ├─────────┼─────────┼─────────────────┤
                          │ Bins/Libs│ Bins/Libs│ Bins/Libs      │
                          ├─────────┼─────────┼─────────────────┤
                          │ Guest OS│ Guest OS│ Guest OS        │  ← 每个 VM 都有完整 OS
                          ├─────────┴─────────┴─────────────────┤
                          │       Hypervisor (VMware/KVM)       │
                          ├─────────────────────────────────────┤
                          │         Host Operating System       │
                          ├─────────────────────────────────────┤
                          │         Hardware                    │
                          └─────────────────────────────────────┘
                          
                          容器架构:
                          ┌─────────────────────────────────────┐
                          │  App A  │  App B  │  App C          │
                          ├─────────┼─────────┼─────────────────┤
                          │ Bins/Libs│ Bins/Libs│ Bins/Libs      │
                          ├─────────────────────────────────────┤
                          │  Docker Engine / containerd         │
                          ├─────────────────────────────────────┤
                          │    Host Operating System (Linux)    │  ← 共享内核
                          ├─────────────────────────────────────┤
                          │         Hardware                    │
                          └─────────────────────────────────────┘

                          性能对比

                          维度虚拟机容器
                          启动时间分钟级秒级
                          资源占用GB 级内存MB 级内存
                          性能开销5-10%< 1%
                          隔离程度完全隔离(硬件级)进程隔离(OS 级)
                          安全性更高(独立内核)较低(共享内核)
                          密度每台物理机 10-50 个每台物理机 100-1000 个

                          ⚠️ 容器的安全性考虑

                          1. 共享内核的风险

                          # 容器逃逸:如果内核有漏洞,容器可能逃逸到宿主机
                          
                          # 缓解措施:
                          # - 使用 User Namespace
                          # - 运行容器为非 root 用户
                          # - 使用 Seccomp 限制系统调用
                          # - 使用 AppArmor/SELinux

                          2. 特权容器的危险

                          # 特权容器可以访问宿主机所有设备
                          docker run --privileged ...
                          
                          # ❌ 危险:容器内可以:
                          # - 加载内核模块
                          # - 访问宿主机所有设备
                          # - 修改宿主机网络配置
                          # - 读写宿主机任意文件
                          
                          # ✅ 最佳实践:避免使用特权容器

                          3. Capability 控制

                          # 只授予容器必要的权限
                          docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx
                          
                          # 默认 Docker 授予的 Capabilities:
                          # - CHOWN, DAC_OVERRIDE, FOWNER, FSETID
                          # - KILL, SETGID, SETUID, SETPCAP
                          # - NET_BIND_SERVICE, NET_RAW
                          # - SYS_CHROOT, MKNOD, AUDIT_WRITE, SETFCAP

                          💡 关键要点总结

                          容器 = Namespace + Cgroups + Union FS

                          1. Namespace (隔离)

                            • PID: 进程隔离
                            • Network: 网络隔离
                            • Mount: 文件系统隔离
                            • UTS: 主机名隔离
                            • IPC: 进程间通信隔离
                            • User: 用户隔离
                          2. Cgroups (限制)

                            • CPU: 限制处理器使用
                            • Memory: 限制内存使用
                            • Block I/O: 限制磁盘 I/O
                            • Network: 限制网络带宽
                          3. Union FS (分层)

                            • 镜像分层存储
                            • Copy-on-Write
                            • 节省空间和带宽

                          容器不是虚拟机

                          • ✅ 容器是特殊的进程
                          • ✅ 共享宿主机内核
                          • ✅ 启动快、资源占用少
                          • ⚠️ 隔离性不如虚拟机
                          • ⚠️ 需要注意安全配置
                          Mar 7, 2024

                          Subsections of Database

                          Elastic Search DSL

                          Basic Query

                          exist query

                          Returns documents that contain an indexed value for a field.

                          GET /_search
                          {
                            "query": {
                              "exists": {
                                "field": "user"
                              }
                            }
                          }

                          The following search returns documents that are missing an indexed value for the user.id field.

                          GET /_search
                          {
                            "query": {
                              "bool": {
                                "must_not": {
                                  "exists": {
                                    "field": "user.id"
                                  }
                                }
                              }
                            }
                          }
                          fuzz query

                          Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.

                          GET /_search
                          {
                            "query": {
                              "fuzzy": {
                                "filed_A": {
                                  "value": "ki"
                                }
                              }
                            }
                          }

                          Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.

                          GET /_search
                          {
                            "query": {
                              "fuzzy": {
                                "filed_A": {
                                  "value": "ki",
                                  "fuzziness": "AUTO",
                                  "max_expansions": 50,
                                  "prefix_length": 0,
                                  "transpositions": true,
                                  "rewrite": "constant_score_blended"
                                }
                              }
                            }
                          }

                          rewrite:

                          • constant_score_boolean
                          • constant_score_filter
                          • top_terms_blended_freqs_N
                          • top_terms_boost_N, top_terms_N
                          • frequent_terms, score_delegating
                          ids query

                          Returns documents based on their IDs. This query uses document IDs stored in the _id field.

                          GET /_search
                          {
                            "query": {
                              "ids" : {
                                "values" : ["2NTC5ZIBNLuBWC6V5_0Y"]
                              }
                            }
                          }
                          prefix query

                          The following search returns documents where the filed_A field contains a term that begins with ki.

                          GET /_search
                          {
                            "query": {
                              "prefix": {
                                "filed_A": {
                                  "value": "ki",
                                   "rewrite": "constant_score_blended",
                                   "case_insensitive": true
                                }
                              }
                            }
                          }

                          You can simplify the prefix query syntax by combining the <field> and value parameters.

                          GET /_search
                          {
                            "query": {
                              "prefix" : { "filed_A" : "ki" }
                            }
                          }
                          range query

                          Returns documents that contain terms within a provided range.

                          GET /_search
                          {
                            "query": {
                              "range": {
                                "filed_number": {
                                  "gte": 10,
                                  "lte": 20,
                                  "boost": 2.0
                                }
                              }
                            }
                          }
                          GET /_search
                          {
                            "query": {
                              "range": {
                                "filed_timestamp": {
                                  "time_zone": "+01:00",        
                                  "gte": "2020-01-01T00:00:00", 
                                  "lte": "now"                  
                                }
                              }
                            }
                          }
                          regex query

                          Returns documents that contain terms matching a regular expression.

                          GET /_search
                          {
                            "query": {
                              "regexp": {
                                "filed_A": {
                                  "value": "k.*y",
                                  "flags": "ALL",
                                  "case_insensitive": true,
                                  "max_determinized_states": 10000,
                                  "rewrite": "constant_score_blended"
                                }
                              }
                            }
                          }
                          term query

                          Returns documents that contain an exact term in a provided field.

                          You can use the term query to find documents based on a precise value such as a price, a product ID, or a username.

                          GET /_search
                          {
                            "query": {
                              "term": {
                                "filed_A": {
                                  "value": "kimchy",
                                  "boost": 1.0
                                }
                              }
                            }
                          }
                          wildcard query

                          Returns documents that contain terms matching a wildcard pattern.

                          A wildcard operator is a placeholder that matches one or more characters. For example, the * wildcard operator matches zero or more characters. You can combine wildcard operators with other characters to create a wildcard pattern.

                          GET /_search
                          {
                            "query": {
                              "wildcard": {
                                "filed_A": {
                                  "value": "ki*y",
                                  "boost": 1.0,
                                  "rewrite": "constant_score_blended"
                                }
                              }
                            }
                          }
                          Oct 7, 2024

                          HPC

                            Mar 7, 2024

                            K8s

                            Mar 7, 2024

                            Subsections of K8s

                            K8s的理解

                            一、核心定位:云时代的操作系统

                            我对 K8s 最根本的理解是:它正在成为数据中心/云环境的“操作系统”。

                            • 传统操作系统(如 Windows、Linux):管理的是单台计算机的硬件资源(CPU、内存、硬盘、网络),并为应用程序(进程)提供运行环境。
                            • Kubernetes:管理的是一个集群(由多台计算机组成)的资源,并将这些物理机/虚拟机抽象成一个巨大的“资源池”。它在这个池子上调度和运行的不再是简单的进程,而是容器化了的应用程序

                            所以,你可以把 K8s 看作是一个分布式的、面向云原生应用的操作系统。


                            二、要解决的核心问题:从“动物园”到“牧场”

                            在 K8s 出现之前,微服务和容器化架构带来了新的挑战:

                            1. 编排混乱:我有成百上千个容器,应该在哪台机器上启动?如何知道它们是否健康?挂了怎么办?如何扩容缩容?
                            2. 网络复杂:容器之间如何发现和通信?如何实现负载均衡?
                            3. 存储管理:有状态应用的数据如何持久化?容器漂移后数据怎么跟走?
                            4. 部署麻烦:如何实现蓝绿部署、金丝雀发布?如何回滚?

                            这个时期被称为“集装箱革命”后的“编排战争”时期,各种工具(如 Docker Swarm, Mesos, Nomad)就像是一个混乱的“动物园”。

                            K8s 的诞生(源于 Google 内部系统 Borg 的经验)就是为了系统地解决这些问题,它将混乱的“动物园”管理成了一个井然有序的“牧场”。它的核心能力可以概括为:声明式 API 和控制器模式


                            三、核心架构与工作模型:大脑与肢体

                            K8s 集群主要由控制平面工作节点 组成。

                            • 控制平面:集群的大脑

                              • kube-apiserver:整个系统的唯一入口,所有组件都必须通过它来操作集群状态。它是“前台总机”。
                              • etcd:一个高可用的键值数据库,持久化存储集群的所有状态数据。它是“集群的记忆中心”。
                              • kube-scheduler:负责调度,决定 Pod 应该在哪个节点上运行。它是“人力资源部”。
                              • kube-controller-manager:运行着各种控制器,不断检查当前状态是否与期望状态一致,并努力驱使其一致。例如,节点控制器、副本控制器等。它是“自动化的管理团队”。
                            • 工作节点:干活的肢体

                              • kubelet:节点上的“监工”,负责与控制平面通信,管理本节点上 Pod 的生命周期,确保容器健康运行。
                              • kube-proxy:负责节点上的网络规则,实现 Service 的负载均衡和网络代理。
                              • 容器运行时:如 containerd 或 CRI-O,负责真正拉取镜像和运行容器。

                            工作模型的核心:声明式 API 与控制器模式

                            1. 你向 kube-apiserver 提交一个 YAML/JSON 文件,声明你期望的应用状态(例如:我要运行 3 个 Nginx 实例)。
                            2. etcd 记录下这个期望状态。
                            3. 各种控制器会持续地“观察”当前状态,并与 etcd 中的期望状态进行对比。
                            4. 如果发现不一致(例如,只有一个 Nginx 实例在运行),控制器就会主动采取行动(例如,再创建两个 Pod),直到当前状态与期望状态一致。
                            5. 这个过程是自愈的、自动的

                            四、关键对象与抽象:乐高积木

                            K8s 通过一系列抽象对象来建模应用,这些对象就像乐高积木:

                            1. Pod最小部署和管理单元。一个 Pod 可以包含一个或多个紧密关联的容器(如主容器和 Sidecar 容器),它们共享网络和存储。这是 K8s 的“原子”。
                            2. Deployment定义无状态应用。它管理 Pod 的多个副本(Replicas),并提供滚动更新、回滚等强大的部署策略。这是最常用的对象。
                            3. Service定义一组 Pod 的访问方式。Pod 是“ ephemeral ”的,IP 会变。Service 提供一个稳定的 IP 和 DNS 名称,并作为负载均衡器,将流量分发给后端的健康 Pod。它是“服务的门户”。
                            4. ConfigMap & Secret:将配置信息和敏感数据与容器镜像解耦,实现配置的灵活管理。
                            5. Volume:抽象了各种存储解决方案,为 Pod 提供持久化存储。
                            6. Namespace:在物理集群内部创建多个虚拟集群,实现资源隔离和多租户管理。
                            7. StatefulSet用于部署有状态应用(如数据库)。它为每个 Pod 提供稳定的标识符、有序的部署和扩缩容,以及稳定的持久化存储。
                            8. Ingress:管理集群外部访问内部服务的入口,通常提供 HTTP/HTTPS 路由、SSL 终止等功能。它是“集群的流量总入口”。

                            五、核心价值与优势

                            1. 自动化运维:自动化了应用的部署、扩缩容、故障恢复(自愈)、滚动更新等,极大降低了运维成本。
                            2. 声明式配置与不可变基础设施:通过 YAML 文件定义一切,基础设施可版本化、可追溯、可重复。这是 DevOps 和 GitOps 的基石。
                            3. 环境一致性 & 可移植性:实现了“一次编写,随处运行”。无论是在本地开发机、测试环境,还是在公有云、混合云上,应用的行为都是一致的。
                            4. 高可用性与弹性伸缩:轻松实现应用的多副本部署,并能根据 CPU、内存等指标或自定义指标进行自动扩缩容,从容应对流量高峰。
                            5. 丰富的生态系统:拥有一个极其庞大和活跃的社区,提供了大量的工具和扩展(Helm, Operator, Istio等),能解决几乎所有你能想到的问题。

                            六、挑战与学习曲线

                            K8s 并非银弹,它也有自己的挑战:

                            • 复杂性高:概念繁多,架构复杂,学习和运维成本非常高。
                            • “配置”沉重:YAML 文件可能非常多,管理起来本身就是一门学问。
                            • 网络与存储:虽然是核心抽象,但其底层实现和理解起来依然有相当的门槛。

                            总结

                            在我看来,Kubernetes 不仅仅是一个容器编排工具,它更是一套云原生应用的管理范式。它通过一系列精妙的抽象,将复杂的分布式系统管理问题标准化、自动化和简单化。虽然入门有门槛,但它已经成为现代应用基础设施的事实标准,是任何从事后端开发、运维、架构设计的人员都必须理解和掌握的核心技术。

                            简单来说,K8s 让你能够像管理一台超级计算机一样,去管理一个由成千上万台机器组成的集群。

                            Mar 7, 2024

                            Cgroup在K8S中起什么作用

                            Kubernetes 深度集成 cgroup 来实现容器资源管理和隔离。以下是 cgroup 与 K8s 结合的详细方式:

                            1. K8s 资源模型与 cgroup 映射

                            1.1 资源请求和限制

                            apiVersion: v1
                            kind: Pod
                            spec:
                              containers:
                              - name: app
                                resources:
                                  requests:
                                    memory: "64Mi"
                                    cpu: "250m"
                                  limits:
                                    memory: "128Mi"
                                    cpu: "500m"
                                    ephemeral-storage: "2Gi"

                            对应 cgroup 配置:

                            • cpu.shares = 256 (250m × 1024 / 1000)
                            • cpu.cfs_quota_us = 50000 (500m × 100000 / 1000)
                            • memory.limit_in_bytes = 134217728 (128Mi)

                            2. K8s cgroup 驱动

                            2.1 cgroupfs 驱动

                            # kubelet 配置
                            --cgroup-driver=cgroupfs
                            --cgroup-root=/sys/fs/cgroup

                            2.2 systemd 驱动(推荐)

                            # kubelet 配置
                            --cgroup-driver=systemd
                            --cgroup-root=/sys/fs/cgroup

                            3. K8s cgroup 层级结构

                            3.1 cgroup v1 层级

                            /sys/fs/cgroup/
                            ├── cpu,cpuacct/kubepods/
                            │   ├── burstable/pod-uid-1/
                            │   │   ├── container-1/
                            │   │   └── container-2/
                            │   └── guaranteed/pod-uid-2/
                            │       └── container-1/
                            ├── memory/kubepods/
                            └── pids/kubepods/

                            3.2 cgroup v2 统一层级

                            /sys/fs/cgroup/kubepods/
                            ├── pod-uid-1/
                            │   ├── container-1/
                            │   └── container-2/
                            └── pod-uid-2/
                                └── container-1/

                            4. QoS 等级与 cgroup 配置

                            4.1 Guaranteed (最高优先级)

                            resources:
                              limits:
                                cpu: "500m"
                                memory: "128Mi"
                              requests:
                                cpu: "500m" 
                                memory: "128Mi"

                            cgroup 配置:

                            • cpu.shares = 512
                            • cpu.cfs_quota_us = 50000
                            • oom_score_adj = -998

                            4.2 Burstable (中等优先级)

                            resources:
                              requests:
                                cpu: "250m"
                                memory: "64Mi"
                              # limits 未设置或大于 requests

                            cgroup 配置:

                            • cpu.shares = 256
                            • cpu.cfs_quota_us = -1 (无限制)
                            • oom_score_adj = 2-999

                            4.3 BestEffort (最低优先级)

                            # 未设置 resources

                            cgroup 配置:

                            • cpu.shares = 2
                            • memory.limit_in_bytes = 9223372036854771712 (极大值)
                            • oom_score_adj = 1000

                            5. 实际 cgroup 配置示例

                            5.1 查看 Pod 的 cgroup

                            # 找到 Pod 的 cgroup 路径
                            cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cgroup.procs
                            
                            # 查看 CPU 配置
                            cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.shares
                            cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.cfs_quota_us
                            
                            # 查看内存配置
                            cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.limit_in_bytes

                            5.2 使用 cgroup-tools 监控

                            # 安装工具
                            apt-get install cgroup-tools
                            
                            # 查看 cgroup 统计
                            cgget -g cpu:/kubepods/pod-uid-1
                            cgget -g memory:/kubepods/pod-uid-1

                            6. K8s 特性与 cgroup 集成

                            6.1 垂直 Pod 自动缩放 (VPA)

                            apiVersion: autoscaling.k8s.io/v1
                            kind: VerticalPodAutoscaler
                            spec:
                              targetRef:
                                apiVersion: "apps/v1"
                                kind: Deployment
                                name: my-app
                              updatePolicy:
                                updateMode: "Auto"

                            VPA 根据历史使用数据动态调整:

                            • 修改 resources.requestsresources.limits
                            • kubelet 更新对应的 cgroup 配置

                            6.2 水平 Pod 自动缩放 (HPA)

                            apiVersion: autoscaling/v2
                            kind: HorizontalPodAutoscaler
                            spec:
                              scaleTargetRef:
                                apiVersion: apps/v1
                                kind: Deployment
                                name: my-app
                              minReplicas: 1
                              maxReplicas: 10
                              metrics:
                              - type: Resource
                                resource:
                                  name: cpu
                                  target:
                                    type: Utilization
                                    averageUtilization: 50

                            HPA 依赖 cgroup 的 CPU 使用率统计进行决策。

                            6.3 资源监控

                            # 通过 cgroup 获取容器资源使用
                            cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpuacct.usage
                            cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.usage_in_bytes
                            
                            # 使用 metrics-server 收集
                            kubectl top pods
                            kubectl top nodes

                            7. 节点资源管理

                            7.1 系统预留资源

                            # kubelet 配置
                            apiVersion: kubelet.config.k8s.io/v1beta1
                            kind: KubeletConfiguration
                            systemReserved:
                              cpu: "100m"
                              memory: "256Mi"
                              ephemeral-storage: "1Gi"
                            kubeReserved:
                              cpu: "200m"
                              memory: "512Mi"
                              ephemeral-storage: "2Gi"
                            evictionHard:
                              memory.available: "100Mi"
                              nodefs.available: "10%"

                            7.2 驱逐策略

                            当节点资源不足时,kubelet 根据 cgroup 统计:

                            • 监控 memory.usage_in_bytes
                            • 监控 cpuacct.usage
                            • 触发 Pod 驱逐

                            8. 故障排查和调试

                            8.1 检查 cgroup 配置

                            # 进入节点检查
                            docker exec -it node-shell /bin/bash
                            
                            # 查看 Pod cgroup
                            find /sys/fs/cgroup -name "*pod-uid*" -type d
                            
                            # 检查资源限制
                            cat /sys/fs/cgroup/memory/kubepods/pod-uid-1/memory.limit_in_bytes
                            cat /sys/fs/cgroup/cpu/kubepods/pod-uid-1/cpu.cfs_quota_us

                            8.2 监控 OOM 事件

                            # 查看内核日志
                            dmesg | grep -i "killed process"
                            
                            # 查看 cgroup OOM 事件
                            grep "kubepods" /var/log/kern.log | grep -i oom

                            9. 最佳实践

                            9.1 合理设置资源限制

                            resources:
                              requests:
                                cpu: "100m"
                                memory: "128Mi"
                              limits:
                                cpu: "200m"    # 不要设置过大
                                memory: "256Mi" # 避免内存浪费

                            9.2 使用 LimitRange

                            apiVersion: v1
                            kind: LimitRange
                            metadata:
                              name: mem-limit-range
                            spec:
                              limits:
                              - default:
                                  memory: "256Mi"
                                defaultRequest:
                                  memory: "128Mi"
                                type: Container

                            cgroup 是 K8s 资源管理的基石,通过精细的 cgroup 配置,K8s 实现了多租户环境下的资源隔离、公平调度和稳定性保障。

                            Mar 7, 2024

                            Headless VS ClusterIP

                            Q: headless service 和 普通的service 有什么区别? 只是有没有clusterIP?

                            “有没有 ClusterIP” 只是表面现象,其背后是根本不同的服务发现模式和适用场景。


                            核心区别:服务发现模式

                            • 普通 Service:提供的是 “负载均衡” 式的服务发现。
                              • 它抽象了一组 Pod,你访问的是这个抽象的、稳定的 VIP(ClusterIP),然后由 kube-proxy 将流量转发到后端的某个 Pod。
                              • 客户端不知道、也不关心具体是哪个 Pod 在处理请求。
                            • Headless Service:提供的是 “直接 Pod IP” 式的服务发现。
                              • 不会给你一个统一的 VIP,而是直接返回后端所有 Pod 的 IP 地址。
                              • 客户端可以直接与任何一个 Pod 通信,并且知道它正在和哪个具体的 Pod 对话。

                            详细对比

                            特性普通 ServiceHeadless Service
                            clusterIP 字段自动分配一个 VIP(如 10.96.123.456),或设置为 None必须设置为 None。这是定义 Headless Service 的标志。
                            核心功能负载均衡。作为流量的代理和分发器。服务发现。作为 Pod 的 DNS 记录注册器,不负责流量转发
                            DNS 解析结果解析到 Service 的 ClusterIP。解析到所有与 Selector 匹配的 Pod 的 IP 地址
                            网络拓扑客户端 -> ClusterIP (VIP) -> (由 kube-proxy 负载均衡) -> 某个 Pod客户端 -> Pod IP
                            适用场景标准的微服务、Web 前端/后端 API,任何需要负载均衡的场景。有状态应用集群(如 MySQL, MongoDB, Kafka, Redis Cluster)、需要直接连接特定 Pod 的场景(如 gRPC 长连接、游戏服务器)。

                            DNS 解析行为的深入理解

                            这是理解两者差异的最直观方式。

                            假设我们有一个名为 my-app 的 Service,它选择了 3 个 Pod。

                            1. 普通 Service 的 DNS 解析

                            • 在集群内,你执行 nslookup my-app(或在 Pod 里用代码查询)。
                            • 返回结果1 条 A 记录,指向 Service 的 ClusterIP。
                              Name:      my-app
                              Address 1: 10.96.123.456
                            • 你的应用:连接到 10.96.123.456:port,剩下的交给 Kubernetes 的网络层。

                            2. Headless Service 的 DNS 解析

                            • 在集群内,你执行 nslookup my-app(注意:Service 的 clusterIP: None)。
                            • 返回结果多条 A 记录,直接指向后端所有 Pod 的 IP。
                              Name:      my-app
                              Address 1: 172.17.0.10
                              Address 2: 172.17.0.11
                              Address 3: 172.17.0.12
                            • 你的应用:会拿到这个 IP 列表,并由客户端自己决定如何连接。比如,它可以:
                              • 随机选一个。
                              • 实现自己的负载均衡逻辑。
                              • 需要连接所有 Pod(比如收集状态)。

                            与 StatefulSet 结合的“杀手级应用”

                            Headless Service 最经典、最强大的用途就是与 StatefulSet 配合,为有状态应用集群提供稳定的网络标识。

                            回顾之前的 MongoDB 例子:

                            • StatefulSet: mongodb (3个副本)
                            • Headless Service: mongodb-service

                            此时,DNS 系统会创建出稳定且可预测的 DNS 记录,而不仅仅是返回 IP 列表:

                            • 每个 Pod 获得一个稳定的 DNS 名称

                              • mongodb-0.mongodb-service.default.svc.cluster.local
                              • mongodb-1.mongodb-service.default.svc.cluster.local
                              • mongodb-2.mongodb-service.default.svc.cluster.local
                            • 查询 Headless Service 本身的 DNS (mongodb-service) 会返回所有 Pod IP。

                            这带来了巨大优势:

                            1. 稳定的成员身份:在初始化 MongoDB 副本集时,你可以直接用这些稳定的 DNS 名称来配置成员列表。即使 Pod 重启、IP 变了,它的 DNS 名称永远不变,配置也就永远不会失效。
                            2. 直接 Pod 间通信:在 Kafka 或 Redis Cluster 这样的系统中,节点之间需要直接通信来同步数据。它们可以使用这些稳定的 DNS 名称直接找到对方,而不需要经过一个不必要的负载均衡器。
                            3. 主从选举与读写分离:客户端应用可以通过固定的 DNS 名称(如 mongodb-0...)直接连接到主节点执行写操作,而通过其他名称连接到从节点进行读操作。

                            总结

                            你可以这样形象地理解:

                            • 普通 Service 像一个公司的“总机号码”

                              • 你打电话给总机(ClusterIP),说“我要找技术支持”,接线员(kube-proxy)会帮你转接到一个空闲的技术支持人员(Pod)那里。你不需要知道具体是谁在为你服务。
                            • Headless Service 像一个公司的“内部通讯录”

                              • 它不提供总机转接服务。它只给你一份所有员工(Pod)的姓名和直拨电话(IP)列表。
                              • 特别是对于 StatefulSet,这份通讯录里的每个员工还有自己固定、专属的座位和分机号(稳定的 DNS 名称),比如“张三座位在 A区-001,分机是 8001”。你知道要找谁时,直接打他的分机就行。

                            所以,“有没有 ClusterIP” 只是一个开关,这个开关背后选择的是两种截然不同的服务发现和流量治理模式。 对于需要直接寻址、有状态、集群化的应用,Headless Service 是必不可少的基石。

                            Mar 7, 2024

                            Creating A Pod

                            描述 Kubernetes 中一个 Pod 的创建过程,可以清晰地展示了 K8s 各个核心组件是如何协同工作的。

                            我们可以将整个过程分为两个主要阶段:控制平面的决策阶段工作节点的执行阶段


                            第一阶段:控制平面决策(大脑决策)

                            1. 用户提交请求

                              • 用户使用 kubectl apply -f pod.yamlkube-apiserver 提交一个 Pod 定义文件。
                              • kubectl 会验证配置并将其转换为 JSON 格式,通过 REST API 调用发送给 kube-apiserver。
                            2. API Server 处理与验证

                              • kube-apiserver 接收到请求后,会进行一系列操作:
                                • 身份认证:验证用户身份。
                                • 授权:检查用户是否有权限创建 Pod。
                                • 准入控制:可能调用一些准入控制器来修改或验证 Pod 对象(例如,注入 Sidecar 容器、设置默认资源限制等)。
                              • 所有验证通过后,kube-apiserver 将 Pod 的元数据对象写入 etcd 数据库。此时,Pod 在 etcd 中的状态被标记为 Pending
                              • 至此,Pod 的创建请求已被记录,但还未被调度到任何节点。
                            3. 调度器决策

                              • kube-scheduler 作为一个控制器,通过 watch 机制持续监听 kube-apiserver,发现有一个新的 Pod 被创建且其 nodeName 为空。
                              • 调度器开始为这个 Pod 选择一个最合适的节点,它执行两阶段操作:
                                • 过滤:根据节点资源(CPU、内存)、污点、节点选择器、存储、镜像拉取等因素过滤掉不合适的节点。
                                • 评分:对剩下的节点进行打分(例如,考虑资源均衡、亲和性等),选择得分最高的节点。
                              • 做出决策后,kube-scheduler 补丁 的方式更新 kube-apiserver 中该 Pod 的定义,将其 nodeName 字段设置为选定的节点名称。
                              • kube-apiserver 再次将这个更新后的信息写入 etcd

                            第二阶段:工作节点执行(肢体行动)

                            1. kubelet 监听到任务

                              • 目标节点上的 kubelet 同样通过 watch 机制监听 kube-apiserver,发现有一个 Pod 被“分配”到了自己所在的节点(即其 nodeName 与自己的节点名匹配)。
                              • kubelet 会从 kube-apiserver 读取完整的 Pod 定义。
                            2. kubelet 控制容器运行时

                              • kubelet 通过 CRI 接口调用本地的容器运行时(如 containerd、CRI-O)。
                              • 容器运行时负责:
                                • 从指定的镜像仓库拉取容器镜像(如果本地不存在)。
                                • 根据 Pod 定义创建启动容器。
                            3. 配置容器环境

                              • 在启动容器前后,kubelet 还会通过其他接口完成一系列配置:
                                • CNI:调用网络插件(如 Calico、Flannel)为 Pod 分配 IP 地址并配置网络。
                                • CSI:如果 Pod 使用了持久化存储,会调用存储插件挂载存储卷。
                            4. 状态上报

                              • 当 Pod 中的所有容器都成功启动并运行后,kubelet 会持续监控容器的健康状态。
                              • 它将 Pod 的当前状态(如 Running)和 IP 地址等信息作为状态更新,上报kube-apiserver
                              • kube-apiserver 最终将这些状态信息写入 etcd

                            总结流程图

                            用户 kubectl -> API Server -> (写入) etcd -> Scheduler (绑定节点) -> API Server -> (更新) etcd -> 目标节点 kubelet -> 容器运行时 (拉镜像,启容器) -> CNI/CSI (配网络/存储) -> kubelet -> API Server -> (更新状态) etcd

                            核心要点:

                            • 声明式 API:用户声明“期望状态”,系统驱动“当前状态”向其靠拢。
                            • 监听与协同:所有组件都通过监听 kube-apiserver 来获取任务并协同工作。
                            • etcd 作为唯一信源:整个集群的状态始终以 etcd 中的数据为准。
                            • 组件职责分离:Scheduler 只管调度,kubelet 只管执行,API Server 只管交互和存储。
                            Mar 7, 2024

                            Deleting A Pod

                            删除一个 Pod 的流程与创建过程相对应,但它更侧重于如何优雅地、安全地终止一个运行中的实例。这个过程同样涉及多个组件的协同。

                            下面是一个 Pod 的删除流程,但它的核心是体现 Kubernetes 的优雅终止机制。


                            删除流程的核心阶段

                            阶段一:用户发起删除指令

                            1. 用户执行命令:用户执行 kubectl delete pod <pod-name>
                            2. API Server 接收请求
                              • kubectlkube-apiserver 发送一个 DELETE 请求。
                              • kube-apiserver 会进行认证、授权等验证。
                            3. “标记为删除”:验证通过后,kube-apiserver 不会立即从 etcd 中删除该 Pod 对象,而是会执行一个关键操作:为 Pod 对象设置一个“删除时间戳”(deletionTimestamp)并将其标记为 Terminating 状态。这个状态会更新到 etcd 中。

                            阶段二:控制平面与节点的通知

                            1. 组件感知变化
                              • 所有监听 kube-apiserver 的组件(如 kube-scheduler, 各个节点的 kubelet)都会立刻感知到这个 Pod 的状态已变为 Terminating
                              • Endpoint Controller 会立刻将这个 Pod 的 IP 从关联的 Service 的 Endpoints(或 EndpointSlice)列表中移除。这意味着新的流量不会再被负载均衡到这个 Pod 上

                            阶段三:节点上的优雅终止

                            这是最关键的阶段,发生在 Pod 所在的工作节点上。

                            1. kubelet 监听到状态变化:目标节点上的 kubelet 通过 watch 机制发现它管理的某个 Pod 被标记为 Terminating

                            2. 触发优雅关闭序列

                              • 第1步:执行 PreStop Hook(如果配置了的话) kubelet 会首先执行 Pod 中容器定义的 preStop 钩子。这是一个在发送终止信号之前执行的特定命令或 HTTP 请求。常见用途包括:
                                • 通知上游负载均衡器此实例正在下线。
                                • 让应用完成当前正在处理的请求。
                                • 执行一些清理任务。
                              • 第2步:发送 SIGTERM 信号 kubelet 通过容器运行时向 Pod 中的每个容器的主进程发送 SIGTERM(信号 15)信号。这是一个“优雅关闭”信号,通知应用:“你即将被终止,请保存状态、完成当前工作并自行退出”。
                                • 注意SIGTERMpreStop Hook 是并行执行的。Kubernetes 会等待两者中的一个先完成,再进入下一步。
                            3. 等待终止宽限期

                              • 在发送 SIGTERM 之后,Kubernetes 不会立即杀死容器。它会等待一个称为 terminationGracePeriodSeconds 的时长(默认为 30 秒)。
                              • 理想情况下,容器内的应用程序捕获到 SIGTERM 信号后,会开始优雅关闭流程,并在宽限期内自行退出。

                            阶段四:强制终止与清理

                            1. 宽限期后的处理

                              • 情况A:优雅关闭成功:如果在宽限期内,所有容器都成功停止,kubelet 会通知容器运行时清理容器资源,然后进行下一步。
                              • 情况B:优雅关闭失败:如果宽限期结束后,容器仍未停止,kubelet 会触发强制杀死。它向容器的主进程发送 SIGKILL(信号 9) 信号,该信号无法被捕获或忽略,会立即终止进程。
                            2. 清理资源

                              • 容器被强制或优雅地终止后,kubelet 会通过容器运行时清理容器资源。
                              • 同时,kubelet 会清理 Pod 的网络资源(通过 CNI 插件)和存储资源(卸载 Volume)。
                            3. 上报最终状态

                              • kubelet 向 kube-apiserver 发送最终信息,确认 Pod 已完全停止。
                              • kube-apiserver 随后从 etcd正式删除该 Pod 的对象记录。至此,这个 Pod 才真正从系统中消失。

                            总结流程图

                            用户 kubectl delete -> API Server -> (在etcd中标记Pod为 Terminating) -> Endpoint Controller (从Service中移除IP) -> 目标节点 kubelet -> 执行 PreStop Hook -> 发送 SIGTERM 信号 -> (等待 terminationGracePeriodSeconds) -> [成功则清理 / 失败则发送 SIGKILL] -> 清理网络/存储 -> kubelet -> API Server -> (从etcd中删除对象)

                            关键要点

                            1. 优雅终止是核心:Kubernetes 给了应用一个自我清理的机会,这是保证服务无损发布和滚动更新的基石。
                            2. 流量切断先行:Pod 被从 Service 的 Endpoints 中移除是第一步,这确保了在 Pod 开始关闭前,不会有新流量进来。
                            3. 两个关键配置
                              • terminationGracePeriodSeconds:决定了应用有多长时间来自行关闭。
                              • preStop Hook:提供了一个主动执行关闭脚本的机会,比单纯等待 SIGTERM 更可靠。
                            4. 强制终止作为保障:如果应用无法响应优雅关闭信号,Kubernetes 有最后的强制手段来保证资源被释放。

                            理解这个流程对于设计健壮的、能够正确处理关闭信号的微服务至关重要。

                            Mar 7, 2024

                            Deployment VS ReplicaSet

                            下面我会从 架构、工作流、控制循环、数据结构与事件链 等层面详细说明它们是怎么工作的。


                            🧩 一、核心概念层次关系

                            先看一下层级:

                            Deployment → ReplicaSet → Pod
                            层级职责控制器类型
                            Deployment负责声明“应用版本”和“滚动更新策略”高级控制器(managing controller)
                            ReplicaSet保证指定数量的 Pod 副本数基础控制器(ensuring controller)
                            Pod最小可调度单元,运行实际容器工作负载对象

                            可以理解为:

                            Deployment 是策略控制器,ReplicaSet 是数量控制器,Pod 是执行单元。


                            ⚙️ 二、Deployment 的工作原理(上层控制器)

                            1️⃣ Deployment 对象定义

                            你在创建一个 Deployment 时,例如:

                            apiVersion: apps/v1
                            kind: Deployment
                            metadata:
                              name: webapp
                            spec:
                              replicas: 3
                              selector:
                                matchLabels:
                                  app: webapp
                              template:
                                metadata:
                                  labels:
                                    app: webapp
                                spec:
                                  containers:
                                  - name: nginx
                                    image: nginx:1.25

                            这会创建一个 Deployment 对象并写入 etcd。


                            2️⃣ Deployment Controller 发现新对象

                            kube-controller-manager 中的 Deployment Controller 通过 Informer + Shared Indexer Cache 订阅(watch)Deployment 资源变化。

                            一旦发现新 Deployment,它会执行以下逻辑:

                            syncDeployment(deployment):
                                rsList := list ReplicaSets matching deployment.selector
                                newRS := findReplicaSetMatchingPodTemplate(deployment.spec.template)
                                if newRS == nil:
                                    newRS = createReplicaSet(deployment.spec.template)
                                adjustReplicaCounts(newRS, oldRSList)

                            3️⃣ ReplicaSet 的创建与管理

                            • 如果模板(spec.template)发生变化(例如镜像从 nginx:1.25 改为 nginx:1.26), Deployment Controller 会创建一个新的 ReplicaSet
                            • 旧 ReplicaSet 会被缩容,新的被扩容,形成滚动更新

                            例如:

                            Deployment: webapp
                             ├── ReplicaSet (nginx:1.25) — scale 2
                             └── ReplicaSet (nginx:1.26) — scale 1

                            当更新完成后:

                            Deployment: webapp
                             └── ReplicaSet (nginx:1.26) — scale 3

                            4️⃣ 滚动更新策略(核心逻辑)

                            Deployment 的滚动更新通过 RollingUpdateDeployment 类型实现:

                            strategy:
                              type: RollingUpdate
                              rollingUpdate:
                                maxUnavailable: 1
                                maxSurge: 1

                            控制循环每次检查:

                            • 有多少可用副本?
                            • 是否可以再启动新的 Pod(受 maxSurge 限制)?
                            • 是否可以终止旧 Pod(受 maxUnavailable 限制)?

                            Controller 会在每次 sync 时:

                            1. 创建新 ReplicaSet 的一个 Pod;
                            2. 等待其 Ready
                            3. 再删除旧 ReplicaSet 的一个 Pod;
                            4. 重复直到达到目标状态。

                            5️⃣ ReplicaSet 与 Deployment 的“拥有关系”

                            Deployment 创建 ReplicaSet 时会设置 OwnerReference:

                            metadata:
                              ownerReferences:
                              - apiVersion: apps/v1
                                kind: Deployment
                                name: webapp
                                uid: <deployment-uid>

                            这样 kube-controller-manager 知道 该 ReplicaSet 属于哪个 Deployment, 而 Kubernetes 的垃圾回收器(GC Controller)会自动删除孤立的 ReplicaSet。


                            🧮 三、ReplicaSet 的内部机制(下层控制器)

                            ReplicaSet Controller 的逻辑相对简单:

                            syncReplicaSet(rs):
                                desired := rs.spec.replicas
                                actual := countPodsMatchingSelector(rs.selector)
                                if desired > actual:
                                    createPods(desired - actual)
                                else if desired < actual:
                                    deletePods(actual - desired)

                            也就是说,ReplicaSet 只关注Pod 数量是否符合期望。 它并不关心版本,也不关心策略。


                            1️⃣ Pod 的创建

                            ReplicaSet Controller 调用 PodTemplateSpec 生成新的 Pod:

                            pod := newPodFromTemplate(rs.spec.template)
                            pod.OwnerReferences = [rs.UID]

                            然后通过 API Server 创建 Pod 对象。


                            2️⃣ Pod 的调度与状态同步

                            创建后的 Pod 会由 kube-scheduler 调度到某个 Node 上, kubelet 拉起容器后汇报状态,ReplicaSet 通过 Informer 感知到变化并更新 status.availableReplicas


                            🧠 四、两者的控制循环(Control Loop)

                            可以用伪代码总结整个层次的循环:

                            while true:
                              # Deployment Controller Loop
                              for each Deployment:
                                reconcile Deployment → ensure right ReplicaSets exist and scaled properly
                            
                              # ReplicaSet Controller Loop
                              for each ReplicaSet:
                                reconcile ReplicaSet → ensure correct number of Pods exist
                            
                              # Pod Controller Loop (via kubelet)
                              for each Pod:
                                reconcile Pod → ensure container running

                            整个系统靠 Informer + WorkQueue + Reconcile 实现最终一致性。 Kubernetes 的控制循环是幂等的(idempotent)——无论运行多少次,结果都一致。


                            🔁 五、事件链(从创建到Pod运行)

                            阶段执行者操作
                            用户kubectl apply提交 Deployment
                            apiserver将 Deployment 写入 etcd
                            controller-managerDeployment Controller创建 ReplicaSet
                            controller-managerReplicaSet Controller创建 Pod
                            schedulerkube-scheduler绑定 Node
                            kubelet节点上拉取镜像并运行容器
                            controller-managerDeployment Controller更新 ReplicaSet 状态,完成滚动更新

                            🧰 六、垃圾回收与历史版本

                            • Deployment 默认保留最近 10 个历史 ReplicaSet:

                              revisionHistoryLimit: 10
                            • 超过限制的旧 ReplicaSet 会自动被清理;

                            • 你可以通过:

                              kubectl rollout history deployment webapp
                              kubectl rollout undo deployment webapp --to-revision=2

                              实现回滚。


                            🧩 七、关键点总结

                            项目DeploymentReplicaSet
                            职责管理版本与更新策略管理副本数量
                            是否直接创建 Pod否,通过 ReplicaSet
                            更新策略支持滚动、暂停、回滚不支持
                            典型控制循环调整 ReplicaSet调整 Pod
                            与 Pod 的关系间接控制直接控制

                            💡 八、类比理解

                            你可以这样比喻:

                            • Deployment = “项目经理” 管理不同版本的 ReplicaSet,控制滚动更新节奏。
                            • ReplicaSet = “小组长” 保证自己手下(Pods)的人数正确。
                            • Pod = “员工” 实际干活的单位。
                            Mar 7, 2024

                            Endpoint VS EndpointSlice

                            EndpointEndpointSlice 都是 Kubernetes 中用于管理服务后端端点的资源,但 EndpointSlice 是更现代、更高效的解决方案。以下是它们的详细区别:

                            一、基本概念对比

                            Endpoint(传统方式)

                            apiVersion: v1
                            kind: Endpoints
                            metadata:
                              name: my-service
                            subsets:
                              - addresses:
                                - ip: 10.244.1.5
                                  targetRef:
                                    kind: Pod
                                    name: pod-1
                                - ip: 10.244.1.6
                                  targetRef:
                                    kind: Pod
                                    name: pod-2
                                ports:
                                - port: 8080
                                  protocol: TCP

                            EndpointSlice(现代方式)

                            apiVersion: discovery.k8s.io/v1
                            kind: EndpointSlice
                            metadata:
                              name: my-service-abc123
                              labels:
                                kubernetes.io/service-name: my-service
                            addressType: IPv4
                            ports:
                              - name: http
                                protocol: TCP
                                port: 8080
                            endpoints:
                              - addresses:
                                - "10.244.1.5"
                                conditions:
                                  ready: true
                                targetRef:
                                  kind: Pod
                                  name: pod-1
                                zone: us-west-2a
                              - addresses:
                                - "10.244.1.6"
                                conditions:
                                  ready: true
                                targetRef:
                                  kind: Pod
                                  name: pod-2
                                zone: us-west-2b

                            二、核心架构差异

                            1. 数据模型设计

                            特性EndpointEndpointSlice
                            存储结构单个大对象多个分片对象
                            规模限制所有端点在一个对象中自动分片(默认最多100个端点/片)
                            更新粒度全量更新增量更新

                            2. 性能影响对比

                            # Endpoint 的问题:单个大对象
                            # 当有 1000 个 Pod 时:
                            kubectl get endpoints my-service -o yaml
                            # 返回一个包含 1000 个地址的庞大 YAML
                            
                            # EndpointSlice 的解决方案:自动分片
                            # 当有 1000 个 Pod 时:
                            kubectl get endpointslices -l kubernetes.io/service-name=my-service
                            # 返回 10 个 EndpointSlice,每个包含 100 个端点

                            三、详细功能区别

                            1. 地址类型支持

                            Endpoint

                            • 仅支持 IP 地址
                            • 有限的元数据

                            EndpointSlice

                            addressType: IPv4  # 支持 IPv4, IPv6, FQDN
                            endpoints:
                              - addresses:
                                - "10.244.1.5"
                                conditions:
                                  ready: true
                                  serving: true
                                  terminating: false
                                hostname: pod-1.subdomain  # 支持主机名
                                nodeName: worker-1
                                zone: us-west-2a
                                hints:
                                  forZones:
                                  - name: us-west-2a

                            2. 拓扑感知和区域信息

                            EndpointSlice 独有的拓扑功能

                            endpoints:
                              - addresses:
                                - "10.244.1.5"
                                conditions:
                                  ready: true
                                # 拓扑信息
                                nodeName: node-1
                                zone: us-west-2a
                                # 拓扑提示,用于优化路由
                                hints:
                                  forZones:
                                  - name: us-west-2a

                            3. 端口定义方式

                            Endpoint

                            subsets:
                              - ports:
                                - name: http
                                  port: 8080
                                  protocol: TCP
                                - name: metrics
                                  port: 9090
                                  protocol: TCP

                            EndpointSlice

                            ports:
                              - name: http
                                protocol: TCP
                                port: 8080
                                appProtocol: http  # 支持应用层协议标识
                              - name: metrics
                                protocol: TCP  
                                port: 9090
                                appProtocol: https

                            四、实际使用场景

                            1. 大规模服务(500+ Pods)

                            Endpoint 的问题

                            # 更新延迟:单个大对象的序列化/反序列化
                            # 网络开销:每次更新传输整个端点列表
                            # 内存压力:客户端需要缓存整个端点列表

                            EndpointSlice 的优势

                            # 增量更新:只更新变化的切片
                            # 并行处理:多个切片可以并行处理
                            # 内存友好:客户端只需关注相关切片

                            2. 多区域部署

                            EndpointSlice 的拓扑感知

                            apiVersion: discovery.k8s.io/v1
                            kind: EndpointSlice
                            metadata:
                              name: multi-zone-service-1
                              labels:
                                kubernetes.io/service-name: multi-zone-service
                            addressType: IPv4
                            ports:
                              - name: http
                                protocol: TCP
                                port: 8080
                            endpoints:
                              - addresses:
                                - "10.244.1.10"
                                conditions:
                                  ready: true
                                zone: zone-a
                                nodeName: node-zone-a-1
                            ---
                            apiVersion: discovery.k8s.io/v1
                            kind: EndpointSlice  
                            metadata:
                              name: multi-zone-service-2
                              labels:
                                kubernetes.io/service-name: multi-zone-service
                            addressType: IPv4
                            ports:
                              - name: http
                                protocol: TCP
                                port: 8080
                            endpoints:
                              - addresses:
                                - "10.244.2.10"
                                conditions:
                                  ready: true
                                zone: zone-b
                                nodeName: node-zone-b-1

                            3. 金丝雀发布和流量管理

                            EndpointSlice 提供更细粒度的控制

                            # 金丝雀版本的 EndpointSlice
                            apiVersion: discovery.k8s.io/v1
                            kind: EndpointSlice
                            metadata:
                              name: canary-service-version2
                              labels:
                                kubernetes.io/service-name: my-service
                                version: "v2"  # 自定义标签用于选择
                            addressType: IPv4
                            ports:
                              - name: http
                                protocol: TCP
                                port: 8080
                            endpoints:
                              - addresses:
                                - "10.244.3.10"
                                conditions:
                                  ready: true

                            五、运维和管理差异

                            1. 监控方式

                            Endpoint 监控

                            # 检查单个 Endpoint 对象
                            kubectl get endpoints my-service
                            kubectl describe endpoints my-service
                            
                            # 监控端点数量
                            kubectl get endpoints my-service -o jsonpath='{.subsets[0].addresses[*].ip}' | wc -w

                            EndpointSlice 监控

                            # 检查所有相关切片
                            kubectl get endpointslices -l kubernetes.io/service-name=my-service
                            
                            # 查看切片详细信息
                            kubectl describe endpointslices my-service-abc123
                            
                            # 统计总端点数量
                            kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*]}{.endpoints[*].addresses}{end}' | jq length

                            2. 故障排查

                            Endpoint 排查

                            # 检查端点状态
                            kubectl get endpoints my-service -o yaml | grep -A 5 -B 5 "not-ready"
                            
                            # 检查控制器日志
                            kubectl logs -n kube-system kube-controller-manager-xxx | grep endpoints

                            EndpointSlice 排查

                            # 检查切片状态
                            kubectl get endpointslices --all-namespaces
                            
                            # 检查端点就绪状态
                            kubectl get endpointslices -l kubernetes.io/service-name=my-service -o jsonpath='{range .items[*]}{.endpoints[*].conditions.ready}{end}'
                            
                            # 检查 EndpointSlice Controller
                            kubectl logs -n kube-system deployment/endpointslice-controller

                            六、迁移和兼容性

                            1. 自动迁移

                            Kubernetes 1.21+ 默认同时维护两者:

                            # 启用 EndpointSlice 特性门控
                            kube-apiserver --feature-gates=EndpointSlice=true
                            kube-controller-manager --feature-gates=EndpointSlice=true
                            kube-proxy --feature-gates=EndpointSlice=true

                            2. 检查集群状态

                            # 检查 EndpointSlice 是否启用
                            kubectl get apiservices | grep discovery.k8s.io
                            
                            # 检查特性门控
                            kube-apiserver -h | grep EndpointSlice
                            
                            # 验证控制器运行状态
                            kubectl get pods -n kube-system -l k8s-app=endpointslice-controller

                            七、性能基准对比

                            场景EndpointEndpointSlice改进
                            1000个Pod更新2-3秒200-300ms10倍
                            网络带宽使用高(全量传输)低(增量传输)60-80% 减少
                            内存使用高(大对象缓存)低(分片缓存)50-70% 减少
                            CPU使用高(序列化成本)低(并行处理)40-60% 减少

                            八、最佳实践

                            1. 新集群配置

                            # kube-apiserver 配置
                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: kube-apiserver
                              namespace: kube-system
                            spec:
                              containers:
                              - command:
                                - kube-apiserver
                                - --feature-gates=EndpointSlice=true
                                - --endpointslice-updates-batch-period=1s  # 批量更新周期

                            2. 应用程序适配

                            // 使用 EndpointSlice 感知的客户端
                            import (
                                "k8s.io/client-go/kubernetes"
                                "k8s.io/client-go/tools/cache"
                                listers "k8s.io/client-go/listers/discovery/v1"
                            )
                            
                            // 监听 EndpointSlice 变化
                            endpointSliceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
                                AddFunc: func(obj interface{}) {
                                    endpointSlice := obj.(*discoveryv1.EndpointSlice)
                                    if endpointSlice.Labels["kubernetes.io/service-name"] == "my-service" {
                                        updateLoadBalancerConfig(endpointSlice)
                                    }
                                },
                            })

                            总结

                            维度EndpointEndpointSlice
                            设计年代Kubernetes 早期Kubernetes 1.16+
                            扩展性差(单对象限制)优秀(自动分片)
                            性能一般(全量更新)优秀(增量更新)
                            功能特性基础功能拓扑感知、多协议等
                            推荐使用传统集群/兼容性新集群/大规模部署
                            未来方向逐步淘汰标准解决方案

                            简单建议

                            • 新集群:直接使用 EndpointSlice
                            • 现有集群:逐步迁移到 EndpointSlice
                            • 大规模服务:必须使用 EndpointSlice
                            • 小规模测试:两者均可,但建议使用 EndpointSlice 以获得更好的未来兼容性

                            EndpointSlice 代表了 Kubernetes 服务发现架构的现代化方向,特别是在大规模和高动态环境中提供了显著的性能和改进。

                            Mar 7, 2024

                            ETCD如何调优

                            好的,Kubernetes 集群的稳定性和性能极大地依赖于其数据存储组件 etcd。对 etcd 进行调优是保障生产环境 K8s 集群高效、稳定运行的关键步骤。

                            下面我将从核心原则、性能调优参数、操作系统调优、Kubernetes 相关配置、监控与维护等多个维度,详细讲解如何对 K8s 上的 etcd 进行调优。

                            一、核心原则与前提

                            1. 硬件是基础:在考虑软件参数调优前,必须确保硬件资源充足且高性能。

                              • CPU:需要足够的计算能力,特别是在高负载下进行压缩、序列化等操作时。
                              • 内存:etcd 的内存消耗与总键值对数量和大小正相关。足够的内存是保证性能的关键。建议至少 8GB,生产环境推荐 16GB 或以上。
                              • 磁盘这是最重要的因素必须使用高性能的 SSD(NVMe SSD 最佳)。etcd 的每次写入都需持久化到磁盘,磁盘的写入延迟(Write Latency)直接决定了 etcd 的写入性能。避免使用网络存储(如 NFS)。
                              • 网络:低延迟、高带宽的网络对于 etcd 节点间同步至关重要。如果 etcd 以集群模式运行,所有节点应位于同一个数据中心或低延迟的可用区。
                            2. 备份!备份!备份!:在进行任何调优或配置更改之前,务必对 etcd 数据进行完整备份。误操作可能导致数据损坏或集群不可用。

                            二、etcd 命令行参数调优

                            etcd 主要通过其启动时的命令行参数进行调优。如果你使用 kubeadm 部署,这些参数通常配置在 /etc/kubernetes/manifests/etcd.yaml 静态 Pod 清单中。

                            1. 存储配额与压缩

                            为了防止磁盘耗尽,etcd 设有存储配额。一旦超过配额,它将进入维护模式,只能读不能写,并触发告警。

                            • --quota-backend-bytes:设置 etcd 数据库的后端存储大小上限。默认是 2GB。对于生产环境,建议设置为 8GB 到 16GB(例如 8589934592 表示 8GB)。设置过大会影响备份和恢复时间。
                            • --auto-compaction-mode--auto-compaction-retention:etcd 会累积历史版本,需要定期压缩来回收空间。
                              • --auto-compaction-mode:通常设置为 periodic(按时间周期)。
                              • --auto-compaction-retention:设置保留多长时间的历史数据。例如 "1h" 表示保留 1 小时,"10m" 表示保留 10 分钟。对于频繁变更的集群(如 running many CronJobs),建议设置为较短的周期,如 "10m""30m"

                            示例配置片段(在 etcd.yaml 中):

                            spec:
                              containers:
                              - command:
                                - etcd
                                ...
                                - --quota-backend-bytes=8589934592    # 8GB
                                - --auto-compaction-mode=periodic
                                - --auto-compaction-retention=10m     # 每10分钟压缩一次历史版本
                                ...

                            2. 心跳与选举超时

                            这些参数影响集群的领导者选举和节点间的心跳检测,对网络延迟敏感。

                            • --heartbeat-interval:领导者向追随者发送心跳的间隔。建议设置为 100300 毫秒之间。网络环境好可以设小(如 100),不稳定则设大(如 300)。
                            • --election-timeout:追随者等待多久没收到心跳后开始新一轮选举。此值必须是心跳间隔的 5-10 倍。建议设置在 10003000 毫秒之间。

                            规则:heartbeat-interval * 10 >= election-timeout

                            示例配置:

                                - --heartbeat-interval=200
                                - --election-timeout=2000

                            3. 快照

                            etcd 通过快照来持久化其状态。

                            • --snapshot-count:指定在制作一次快照前,最多提交多少次事务。默认值是 100,000。在内存充足且磁盘 IO 性能极高的环境下,可以适当调低此值(如 50000)以在崩溃后更快恢复,但这会略微增加磁盘 IO 负担。通常使用默认值即可。

                            三、操作系统与运行时调优

                            1. 磁盘 I/O 调度器

                            对于 SSD,将 I/O 调度器设置为 nonenoop 通常能获得更好的性能。

                            # 查看当前调度器
                            cat /sys/block/[你的磁盘,如 sda]/queue/scheduler
                            
                            # 临时修改
                            echo 'noop' > /sys/block/sda/queue/scheduler
                            
                            # 永久修改,在 /etc/default/grub 中添加或修改
                            GRUB_CMDLINE_LINUX_DEFAULT="... elevator=noop"
                            
                            # 然后更新 grub 并重启
                            sudo update-grub

                            2. 文件系统

                            使用 XFSext4 文件系统。它们对 etcd 的工作负载有很好的支持。确保使用 ssd 挂载选项。

                            /etc/fstab 中为 etcd 数据目录所在分区添加 ssdnoatime 选项:

                            UUID=... /var/lib/etcd ext4 defaults,ssd,noatime 0 0

                            3. 提高文件描述符和进程数限制

                            etcd 可能会处理大量并发连接。

                            # 在 /etc/security/limits.conf 中添加
                            * soft nofile 65536
                            * hard nofile 65536
                            * soft nproc 65536
                            * hard nproc 65536

                            4. 网络参数调优

                            调整内核网络参数,特别是在高负载环境下。

                            /etc/sysctl.conf 中添加:

                            net.core.somaxconn = 1024
                            net.ipv4.tcp_keepalive_time = 600
                            net.ipv4.tcp_keepalive_intvl = 60
                            net.ipv4.tcp_keepalive_probes = 10

                            执行 sysctl -p 使其生效。

                            四、Kubernetes 相关调优

                            1. 资源请求和限制

                            etcd.yaml 中为 etcd 容器设置合适的资源限制,防止其因资源竞争而饿死。

                                resources:
                                  requests:
                                    memory: "1Gi"
                                    cpu: "500m"
                                  limits:
                                    memory: "8Gi"  # 根据你的 --quota-backend-bytes 设置,确保内存足够
                                    cpu: "2"

                            2. API Server 的 --etcd-compaction-interval

                            在 kube-apiserver 的启动参数中,这个参数控制它请求 etcd 进行压缩的周期。建议与 etcd 的 --auto-compaction-retention 保持一致或略大。

                            五、监控与维护

                            1. 监控关键指标

                            使用 Prometheus 等工具监控 etcd,重点关注以下指标:

                            • etcd_disk_wal_fsync_duration_seconds:WAL 日志同步到磁盘的延迟。这是最重要的指标,P99 值应低于 25ms。
                            • etcd_disk_backend_commit_duration_seconds:后端数据库提交的延迟。
                            • etcd_server_leader_changes_seen_total:领导者变更次数。频繁变更表明集群不稳定。
                            • etcd_server_has_leader:当前节点是否认为有领导者(1 为是,0 为否)。
                            • etcd_mvcc_db_total_size_in_bytes:当前数据库大小,用于判断是否接近存储配额。

                            2. 定期进行碎片整理

                            即使开启了自动压缩,etcd 的数据库文件内部仍会产生碎片。当 etcd_mvcc_db_total_size_in_bytes 接近 --quota-backend-bytes 时,即使实际数据量没那么多,也需要在线进行碎片整理。

                            # 在任一 etcd 节点上执行
                            ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
                              --cacert=/path/to/ca.crt \
                              --cert=/path/to/etcd-client.crt \
                              --key=/path/to/etcd-client.key \
                              defrag

                            注意:执行 defrag 会阻塞所有请求,应在业务低峰期进行,并逐个对集群成员执行。

                            调优总结与检查清单

                            1. 硬件过关:确认使用 SSD,内存充足。
                            2. 设置存储配额和自动压缩--quota-backend-bytes=8G, --auto-compaction-retention=10m
                            3. 调整心跳与选举超时--heartbeat-interval=200, --election-timeout=2000
                            4. 操作系统优化:I/O 调度器、文件系统挂载选项、文件描述符限制。
                            5. 配置合理的资源限制:防止 etcd 容器因资源不足被 Kill。
                            6. 开启并关注监控:特别是磁盘同步延迟和领导者变更。
                            7. 定期维护:根据监控指标,在需要时进行碎片整理。

                            对于大多数场景,调整存储配额与压缩心跳与选举超时以及确保高性能磁盘,就能解决绝大部分性能问题。调优是一个持续的过程,需要结合监控数据不断调整。

                            Mar 7, 2024

                            Flannel VS Calico

                            Calico 和 Flannel 是 Kubernetes 中最著名和最常见的两种网络插件(CNI),但它们的设计哲学、实现方式和能力有显著区别。

                            简单来说:

                            • Flannel 追求的是简单和易用,提供足够的基础网络功能。
                            • Calico 追求的是性能和功能,提供强大的网络策略和高性能网络。

                            下面我们从多个维度进行详细对比。


                            核心对比一览表

                            特性FlannelCalico
                            核心设计哲学简单、最小化高性能、功能丰富
                            网络模型Overlay 网络纯三层路由(可选 Overlay)
                            数据平面VXLAN(推荐)、Host-gw、UDPBGP(推荐)、VXLAN、Windows
                            性能较好(VXLAN有封装开销)极高(BGP模式下无封装开销)
                            网络策略不支持(需安装Cilium等)原生支持(强大的网络策略)
                            安全性基础高级(基于标签的微隔离)
                            配置与维护非常简单,几乎无需配置相对复杂,功能多配置项也多
                            适用场景学习、测试、中小型集群,需求简单生产环境、大型集群、对性能和安全要求高

                            深入剖析

                            1. 网络模型与工作原理

                            这是最根本的区别。

                            • Flannel (Overlay Network)

                              • 工作原理:它在底层物理网络之上再构建一个虚拟的“覆盖网络”。当数据包从一个节点的Pod发送到另一个节点的Pod时,Flannel会将它封装在一个新的网络包中(如VXLAN)。
                              • 类比:就像在一封普通信件(Pod的原始数据包)外面套了一个标准快递袋(VXLAN封装),快递系统(底层网络)只关心快递袋上的地址(节点IP),不关心里面的内容。到达目标节点后,再拆开快递袋,取出里面的信。
                              • 优势:对底层网络要求低,只要节点之间IP能通即可,兼容性好。
                              • 劣势:封装和解封装有额外的CPU开销,并且会增加数据包的大小( overhead),导致性能略有下降。
                            • Calico (Pure Layer 3)

                              • 工作原理(BGP模式):它不使用封装,而是使用BGP路由协议。每个K8s节点都像一个路由器,它通过BGP协议向集群中的其他节点宣告:“发往这些Pod IP的流量,请送到我这里来”。
                              • 类比:就像整个数据中心是一个大的邮政系统,每个邮局(节点)都知道去往任何地址(Pod IP)的最短路径,信件(数据包)可以直接投递,无需额外包装。
                              • 优势性能高,无封装开销,延迟低,吞吐量高。
                              • 劣势:要求底层网络必须支持BGP或者支持主机路由(某些云平台或网络设备可能需要特定配置)。

                            注意:Calico也支持VXLAN模式(通常用于网络策略要求BGP但底层网络不支持的场景),但其最佳性能是在BGP模式下实现的。

                            2. 网络策略

                            这是两者功能性的一个巨大分水岭。

                            • Flannel本身不提供任何网络策略能力。它只负责打通网络,让所有Pod默认可以相互通信。如果你需要实现Pod之间的访问控制(微隔离),你必须额外安装一个网络策略控制器,如 CiliumCalico本身(可以只使用其策略部分,与Flannel叠加使用)。

                            • Calico原生支持强大的Kubernetes NetworkPolicy。你可以定义基于Pod标签、命名空间、端口、协议甚至DNS名称的精细规则,来控制Pod的入站和出站流量。这对于实现“零信任”安全模型至关重要。

                            3. 性能

                            • Calico (BGP模式):由于其纯三层的转发机制,无需封装,数据包是原生IP包,其延迟更低,吞吐量更高,CPU消耗也更少。
                            • Flannel (VXLAN模式):由于存在VXLAN的封装头(通常50字节 overhead),最大传输单元会变小,封装/解封装操作也需要CPU参与,性能相比Calico BGP模式要低一些。但其 Host-gw 后端模式性能很好,前提是节点在同一个二层网络。

                            4. 生态系统与高级功能

                            • Calico:功能非常丰富,远不止基础网络。
                              • 网络策略:如上所述,非常强大。
                              • IPAM:灵活的IP地址管理。
                              • 服务网格集成:与Istio有深度集成,可以实施全局的服务到服务策略。
                              • Windows支持:对Windows节点有良好的支持。
                              • 网络诊断工具:提供了 calicoctl 等强大的运维工具。
                            • Flannel:功能相对单一,就是做好网络连通性。它“小而美”,但缺乏高级功能。

                            如何选择?

                            选择 Flannel 的情况:

                            • 新手用户:想要快速搭建一个K8s集群,不想纠结于复杂的网络配置。
                            • 测试或开发环境:需求简单,只需要Pod能通。
                            • 中小型集群:对性能和高级网络策略没有硬性要求。
                            • 底层网络受限:无法配置BGP或主机路由的环境(例如某些公有云基础网络)。

                            选择 Calico 的情况:

                            • 生产环境:对稳定性和性能有高要求。
                            • 大型集群:需要高效的路由和可扩展性。
                            • 安全要求高:需要实现Pod之间的网络隔离(微隔离)。
                            • 对网络性能极度敏感:例如AI/ML训练、高频交易等场景。
                            • 底层网络可控:例如在自建数据中心或云上支持BGP的环境。

                            总结

                            FlannelCalico
                            核心价值简单可靠功能强大
                            好比买车丰田卡罗拉:皮实、省心、够用宝马/奥迪:性能强劲、功能齐全、操控精准
                            一句话总结“让我快速把网络打通”“我要一个高性能、高安全性的生产级网络”

                            在现代Kubernetes部署中,尤其是生产环境,Calico因其卓越的性能和原生的安全能力,已经成为更主流和推荐的选择。而Flannel则在那些“只要能通就行”的简单场景中,依然保持着它的价值。

                            Mar 7, 2024

                            Headless Service VS ClusterIP

                            Headless Service vs ClusterIP 详解

                            这是 Kubernetes 中两种常见的 Service 类型,它们在服务发现和负载均衡方面有本质区别。


                            🎯 核心区别总结

                            维度ClusterIPHeadless Service
                            ClusterIP 值有固定的虚拟 IPNone (无 ClusterIP)
                            DNS 解析返回 Service IP直接返回 Pod IP 列表
                            负载均衡✅ kube-proxy 自动负载均衡❌ 客户端自行选择 Pod
                            适用场景无状态服务有状态服务、服务发现
                            典型用例Web 应用、API 服务数据库集群、Kafka、Zookeeper

                            📋 ClusterIP Service (默认类型)

                            定义

                            ClusterIP 是 Kubernetes 默认的 Service 类型,会分配一个虚拟 IP(Cluster IP),作为访问后端 Pod 的统一入口。

                            YAML 示例

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: my-web-service
                            spec:
                              type: ClusterIP  # 默认类型,可以省略
                              selector:
                                app: web
                              ports:
                              - protocol: TCP
                                port: 80        # Service 端口
                                targetPort: 8080  # Pod 端口

                            工作原理

                            ┌─────────────────────────────────────────┐
                            │          ClusterIP Service              │
                            │     (虚拟 IP: 10.96.100.50)             │
                            └────────────┬────────────────────────────┘
                                         │ kube-proxy 负载均衡
                                         │
                                 ┌───────┴───────┬──────────┐
                                 ▼               ▼          ▼
                              Pod-1          Pod-2      Pod-3
                              10.244.1.5     10.244.2.8  10.244.3.12
                              (app=web)      (app=web)   (app=web)

                            DNS 解析行为

                            # 在集群内部查询 DNS
                            nslookup my-web-service.default.svc.cluster.local
                            
                            # 输出:
                            # Name:    my-web-service.default.svc.cluster.local
                            # Address: 10.96.100.50  ← 返回 Service 的虚拟 IP
                            
                            # 客户端访问这个 IP
                            curl http://my-web-service:80
                            
                            # 请求会被 kube-proxy 自动转发到后端 Pod
                            # 默认使用 iptables 或 IPVS 做负载均衡

                            特点

                            统一入口:客户端只需知道 Service IP,不关心后端 Pod
                            自动负载均衡:kube-proxy 自动在多个 Pod 间分发流量
                            服务发现简单:通过 DNS 获取稳定的 Service IP
                            屏蔽 Pod 变化:Pod 重启或扩缩容,Service IP 不变
                            会话保持:可配置 sessionAffinity: ClientIP

                            负载均衡方式

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: my-service
                            spec:
                              type: ClusterIP
                              sessionAffinity: ClientIP  # 可选:会话保持(同一客户端固定到同一 Pod)
                              sessionAffinityConfig:
                                clientIP:
                                  timeoutSeconds: 10800   # 会话超时时间
                              selector:
                                app: web
                              ports:
                              - port: 80
                                targetPort: 8080

                            🔍 Headless Service (无头服务)

                            定义

                            Headless Service 是不分配 ClusterIP 的特殊 Service,通过设置 clusterIP: None 创建。

                            YAML 示例

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: my-headless-service
                            spec:
                              clusterIP: None  # 🔑 关键:设置为 None
                              selector:
                                app: database
                              ports:
                              - protocol: TCP
                                port: 3306
                                targetPort: 3306

                            工作原理

                            ┌─────────────────────────────────────────┐
                            │       Headless Service (无 ClusterIP)   │
                            │              DNS 直接返回               │
                            └────────────┬────────────────────────────┘
                                         │ 没有负载均衡
                                         │ DNS 返回所有 Pod IP
                                         │
                                 ┌───────┴───────┬──────────┐
                                 ▼               ▼          ▼
                              Pod-1          Pod-2      Pod-3
                              10.244.1.5     10.244.2.8  10.244.3.12
                              (app=database) (app=database) (app=database)

                            DNS 解析行为

                            # 在集群内部查询 DNS
                            nslookup my-headless-service.default.svc.cluster.local
                            
                            # 输出:
                            # Name:    my-headless-service.default.svc.cluster.local
                            # Address: 10.244.1.5   ← Pod-1 IP
                            # Address: 10.244.2.8   ← Pod-2 IP
                            # Address: 10.244.3.12  ← Pod-3 IP
                            
                            # 客户端获得所有 Pod IP,自己选择连接哪个

                            特点

                            服务发现:客户端可以获取所有后端 Pod 的 IP
                            自主选择:客户端自己决定连接哪个 Pod(负载均衡逻辑由客户端实现)
                            稳定 DNS:每个 Pod 有独立的 DNS 记录
                            适合有状态服务:数据库主从、集群成员发现
                            无自动负载均衡:需要客户端或应用层实现

                            与 StatefulSet 结合(最常见用法)

                            # StatefulSet + Headless Service
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: mysql-headless
                            spec:
                              clusterIP: None
                              selector:
                                app: mysql
                              ports:
                              - port: 3306
                                name: mysql
                            ---
                            apiVersion: apps/v1
                            kind: StatefulSet
                            metadata:
                              name: mysql
                            spec:
                              serviceName: mysql-headless  # 🔑 关联 Headless Service
                              replicas: 3
                              selector:
                                matchLabels:
                                  app: mysql
                              template:
                                metadata:
                                  labels:
                                    app: mysql
                                spec:
                                  containers:
                                  - name: mysql
                                    image: mysql:8.0
                                    ports:
                                    - containerPort: 3306

                            每个 Pod 的独立 DNS 记录

                            # StatefulSet 的 Pod 命名规则:
                            # <statefulset-name>-<ordinal>.<service-name>.<namespace>.svc.cluster.local
                            
                            # 示例:
                            mysql-0.mysql-headless.default.svc.cluster.local → 10.244.1.5
                            mysql-1.mysql-headless.default.svc.cluster.local → 10.244.2.8
                            mysql-2.mysql-headless.default.svc.cluster.local → 10.244.3.12
                            
                            # 可以直接访问特定 Pod
                            mysql -h mysql-0.mysql-headless.default.svc.cluster.local -u root -p
                            
                            # 查询所有 Pod
                            nslookup mysql-headless.default.svc.cluster.local

                            🔄 实际对比演示

                            场景 1:Web 应用(使用 ClusterIP)

                            # ClusterIP Service
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: web-service
                            spec:
                              type: ClusterIP
                              selector:
                                app: nginx
                              ports:
                              - port: 80
                                targetPort: 80
                            ---
                            # Deployment
                            apiVersion: apps/v1
                            kind: Deployment
                            metadata:
                              name: nginx
                            spec:
                              replicas: 3
                              selector:
                                matchLabels:
                                  app: nginx
                              template:
                                metadata:
                                  labels:
                                    app: nginx
                                spec:
                                  containers:
                                  - name: nginx
                                    image: nginx:latest
                            # 测试访问
                            kubectl run test --rm -it --image=busybox -- /bin/sh
                            
                            # 在 Pod 内执行
                            nslookup web-service
                            # 输出:只有一个 Service IP
                            
                            wget -q -O- http://web-service
                            # 请求会被自动负载均衡到 3 个 nginx Pod

                            场景 2:MySQL 主从(使用 Headless Service)

                            # Headless Service
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: mysql
                            spec:
                              clusterIP: None
                              selector:
                                app: mysql
                              ports:
                              - port: 3306
                            ---
                            # StatefulSet
                            apiVersion: apps/v1
                            kind: StatefulSet
                            metadata:
                              name: mysql
                            spec:
                              serviceName: mysql
                              replicas: 3
                              selector:
                                matchLabels:
                                  app: mysql
                              template:
                                metadata:
                                  labels:
                                    app: mysql
                                spec:
                                  containers:
                                  - name: mysql
                                    image: mysql:8.0
                                    env:
                                    - name: MYSQL_ROOT_PASSWORD
                                      value: "password"
                            # 测试服务发现
                            kubectl run test --rm -it --image=busybox -- /bin/sh
                            
                            # 在 Pod 内执行
                            nslookup mysql
                            # 输出:返回 3 个 Pod IP
                            
                            # 可以连接到特定的 MySQL 实例(如主节点)
                            mysql -h mysql-0.mysql.default.svc.cluster.local -u root -p
                            
                            # 也可以连接到从节点
                            mysql -h mysql-1.mysql.default.svc.cluster.local -u root -p
                            mysql -h mysql-2.mysql.default.svc.cluster.local -u root -p

                            📊 详细对比

                            1. DNS 解析差异

                            # ClusterIP Service
                            $ nslookup web-service
                            Server:    10.96.0.10
                            Address:   10.96.0.10:53
                            
                            Name:      web-service.default.svc.cluster.local
                            Address:   10.96.100.50  ← Service 虚拟 IP
                            
                            # Headless Service
                            $ nslookup mysql-headless
                            Server:    10.96.0.10
                            Address:   10.96.0.10:53
                            
                            Name:      mysql-headless.default.svc.cluster.local
                            Address:   10.244.1.5  ← Pod-1 IP
                            Address:   10.244.2.8  ← Pod-2 IP
                            Address:   10.244.3.12 ← Pod-3 IP

                            2. 流量路径差异

                            ClusterIP 流量路径:
                            Client → Service IP (10.96.100.50)
                                   → kube-proxy (iptables/IPVS)
                                   → 随机选择一个 Pod
                            
                            Headless 流量路径:
                            Client → DNS 查询
                                   → 获取所有 Pod IP
                                   → 客户端自己选择 Pod
                                   → 直接连接 Pod IP

                            3. 使用场景对比

                            场景ClusterIPHeadless
                            无状态应用✅ 推荐❌ 不需要
                            有状态应用❌ 不适合✅ 推荐
                            数据库主从❌ 无法区分主从✅ 可以指定连接主节点
                            集群成员发现❌ 无法获取成员列表✅ 可以获取所有成员
                            需要负载均衡✅ 自动负载均衡❌ 需要客户端实现
                            客户端连接池⚠️ 只能连接到 Service IP✅ 可以为每个 Pod 建立连接

                            🎯 典型应用场景

                            ClusterIP Service 适用场景

                            1. 无状态 Web 应用

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: frontend
                            spec:
                              type: ClusterIP
                              selector:
                                app: frontend
                              ports:
                              - port: 80
                                targetPort: 3000

                            2. RESTful API 服务

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: api-service
                            spec:
                              type: ClusterIP
                              selector:
                                app: api
                              ports:
                              - port: 8080

                            3. 微服务之间的调用

                            # Service A 调用 Service B
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: service-b
                            spec:
                              type: ClusterIP
                              selector:
                                app: service-b
                              ports:
                              - port: 9090

                            Headless Service 适用场景

                            1. MySQL 主从复制

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: mysql
                            spec:
                              clusterIP: None
                              selector:
                                app: mysql
                              ports:
                              - port: 3306
                            ---
                            # 应用连接时:
                            # 写操作 → mysql-0.mysql (主节点)
                            # 读操作 → mysql-1.mysql, mysql-2.mysql (从节点)

                            2. Kafka 集群

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: kafka
                            spec:
                              clusterIP: None
                              selector:
                                app: kafka
                              ports:
                              - port: 9092
                            ---
                            # Kafka 客户端可以发现所有 broker:
                            # kafka-0.kafka:9092
                            # kafka-1.kafka:9092
                            # kafka-2.kafka:9092

                            3. Elasticsearch 集群

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: elasticsearch
                            spec:
                              clusterIP: None
                              selector:
                                app: elasticsearch
                              ports:
                              - port: 9200
                                name: http
                              - port: 9300
                                name: transport
                            ---
                            # 集群内部节点通过 DNS 发现彼此:
                            # elasticsearch-0.elasticsearch
                            # elasticsearch-1.elasticsearch
                            # elasticsearch-2.elasticsearch

                            4. Redis 集群模式

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: redis-cluster
                            spec:
                              clusterIP: None
                              selector:
                                app: redis
                              ports:
                              - port: 6379
                                name: client
                              - port: 16379
                                name: gossip
                            ---
                            # Redis 客户端获取所有节点进行 cluster slots 查询

                            🔧 混合使用:两种 Service 同时存在

                            对于有状态服务,常见做法是同时创建两个 Service:

                            # 1. Headless Service:用于 StatefulSet 和 Pod 间通信
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: mysql-headless
                            spec:
                              clusterIP: None
                              selector:
                                app: mysql
                              ports:
                              - port: 3306
                            ---
                            # 2. ClusterIP Service:用于客户端负载均衡访问(只读副本)
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: mysql-read
                            spec:
                              type: ClusterIP
                              selector:
                                app: mysql
                                role: replica  # 只选择从节点
                              ports:
                              - port: 3306
                            ---
                            # StatefulSet
                            apiVersion: apps/v1
                            kind: StatefulSet
                            metadata:
                              name: mysql
                            spec:
                              serviceName: mysql-headless  # 使用 Headless Service
                              replicas: 3
                              # ...

                            使用方式:

                            # 写操作:直接连接主节点
                            mysql -h mysql-0.mysql-headless -u root -p
                            
                            # 读操作:通过 ClusterIP 自动负载均衡到所有从节点
                            mysql -h mysql-read -u root -p

                            🛠️ 常见问题

                            Q1: 如何选择使用哪种 Service?

                            决策流程:

                            应用是无状态的? 
                              ├─ 是 → 使用 ClusterIP
                              └─ 否 → 继续
                            
                            需要客户端感知所有 Pod?
                              ├─ 是 → 使用 Headless Service
                              └─ 否 → 继续
                            
                            需要区分不同 Pod(如主从)?
                              ├─ 是 → 使用 Headless Service + StatefulSet
                              └─ 否 → 使用 ClusterIP

                            Q2: Headless Service 没有负载均衡怎么办?

                            方案:

                            1. 客户端负载均衡:应用层实现(如 Kafka 客户端)
                            2. DNS 轮询:部分 DNS 客户端会自动轮询
                            3. 混合方案:同时创建 ClusterIP Service 用于负载均衡

                            Q3: 如何测试 Headless Service?

                            # 创建测试 Pod
                            kubectl run -it --rm debug --image=busybox --restart=Never -- sh
                            
                            # 测试 DNS 解析
                            nslookup mysql-headless.default.svc.cluster.local
                            
                            # 测试连接特定 Pod
                            wget -O- http://mysql-0.mysql-headless:3306
                            
                            # 测试所有 Pod
                            for i in 0 1 2; do
                              echo "Testing mysql-$i"
                              wget -O- http://mysql-$i.mysql-headless:3306
                            done

                            Q4: ClusterIP Service 能否用于 StatefulSet?

                            可以,但不推荐:

                            • ✅ 可以提供负载均衡
                            • ❌ 无法通过稳定的 DNS 名访问特定 Pod
                            • ❌ 不适合主从架构(无法区分主节点)

                            最佳实践:

                            • StatefulSet 使用 Headless Service
                            • 如需负载均衡,额外创建 ClusterIP Service

                            💡 关键要点总结

                            ClusterIP Service

                            ✅ 默认类型,有虚拟 IP
                            ✅ 自动负载均衡(kube-proxy)
                            ✅ 适合无状态应用
                            ✅ 客户端无需感知后端 Pod
                            ✅ DNS 解析返回 Service IP

                            Headless Service

                            ✅ 设置 clusterIP: None
                            ✅ DNS 解析返回所有 Pod IP
                            ✅ 适合有状态应用
                            ✅ 支持 Pod 级别的服务发现
                            ✅ 常与 StatefulSet 配合使用

                            选型建议

                            • Web 应用、API 服务 → ClusterIP
                            • 数据库、消息队列、分布式存储 → Headless Service
                            • 有主从/分片的应用 → Headless Service + StatefulSet
                            • 需要同时支持负载均衡和直接访问 → 两种 Service 都创建
                            Mar 7, 2024

                            Helm Principle

                            Helm 是 Kubernetes 的包管理工具,类似于 Linux 的 apt/yum 或 Python 的 pip,它的核心作用是: 👉 用模板化的方式定义、安装和升级 Kubernetes 应用。


                            🧩 一、Helm 的核心概念

                            在理解原理前,先明确 Helm 的几个关键对象:

                            概念说明
                            Chart一个 Helm 包,描述一组 Kubernetes 资源的模板集合(即一个应用的安装包)
                            Values.yamlChart 的参数配置文件,用于填充模板变量
                            ReleaseHelm 将 Chart 安装到某个命名空间后的实例,每次安装或升级都是一个 release
                            Repository存放打包后 chart (.tgz) 的仓库,可以是 HTTP/OCI 类型(如 Harbor, Artifactory)

                            ⚙️ 二、Helm 的工作原理流程

                            从用户角度来看,Helm Client 发出命令(如 helm install),Helm 会通过一系列步骤在集群中生成 Kubernetes 资源。

                            下面是核心流程图概念(文字版):

                                   ┌────────────┐
                                   │ helm client│
                                   └─────┬──────┘
                                         │
                                         ▼
                                  1. 解析Chart与Values
                                         │
                                         ▼
                                  2. 模板渲染(Helm Template Engine)
                                         │
                                         ▼
                                  3. 生成纯YAML清单
                                         │
                                         ▼
                                  4. 调用Kubernetes API
                                         │
                                         ▼
                                  5. 创建/更新资源(Deployment、Service等)
                                         │
                                         ▼
                                  6. 记录Release历史(ConfigMap/Secret)

                            🔍 三、Helm 工作机制分解

                            1️⃣ Chart 渲染阶段

                            Helm 使用 Go 的 text/template 模板引擎 + Sprig 函数库,将模板与 values.yaml 合并生成 Kubernetes YAML 清单。

                            例如:

                            # templates/deployment.yaml
                            apiVersion: apps/v1
                            kind: Deployment
                            metadata:
                              name: {{ .Release.Name }}-app
                            spec:
                              replicas: {{ .Values.replicas }}

                            通过:

                            helm template myapp ./mychart -f myvalues.yaml

                            Helm 会本地生成纯 YAML 文件(不部署到集群)。


                            2️⃣ 部署阶段(Install/Upgrade)

                            执行:

                            helm install myapp ./mychart

                            Helm Client 会将渲染好的 YAML 通过 Kubernetes API 提交到集群(相当于执行 kubectl apply)。

                            Helm 同时在命名空间中创建一个 “Release 记录”,默认存放在:

                            namespace: <your-namespace>
                            kind: Secret
                            name: sh.helm.release.v1.<release-name>.vN

                            其中保存了:

                            • Chart 模板和 values 的快照
                            • 渲染后的 manifest
                            • Release 状态(deployed、failed 等)
                            • 版本号(v1, v2, …)

                            3️⃣ 升级与回滚机制

                            当执行:

                            helm upgrade myapp ./mychart

                            时,Helm 会:

                            1. 读取旧版本 release secret
                            2. 渲染新模板
                            3. 比较新旧差异(Diff)
                            4. 调用 Kubernetes API 更新对象
                            5. 写入新的 release secret(版本号 +1)

                            回滚时:

                            helm rollback myapp 2

                            Helm 会取出 v2 的记录,再次 kubectl apply


                            4️⃣ 仓库机制(Helm Repository / OCI Registry)

                            Helm 支持两种包分发方式:

                            • HTTP 仓库(传统)

                              • 有一个 index.yaml 索引文件
                              • Chart 以 .tgz 格式存储
                            • OCI 仓库(现代推荐)

                              • Chart 存储在 OCI registry(如 Harbor, GHCR)

                              • 推送方式:

                                helm push mychart/ oci://harbor.example.com/helm
                              • 拉取方式:

                                helm pull oci://harbor.example.com/helm/mychart --version 1.0.0

                            🧠 四、Helm 与 Kubernetes 的关系

                            Helm 本身 不直接管理容器运行,它只是:

                            • 模板引擎 + 应用生命周期管理器;
                            • 所有资源最终仍由 Kubernetes 控制器(如 Deployment controller)调度、运行。

                            Helm 类似于 “上层应用打包器”:

                            Helm = Chart 模板系统 + Kubernetes API 客户端 + Release 历史追踪

                            💡 五、常见命令原理对照

                            命令Helm 行为
                            helm install渲染模板 → 提交资源 → 创建 release
                            helm upgrade渲染模板 → diff 旧版本 → 更新资源 → 新 release
                            helm rollback获取旧版本记录 → 重新提交旧 manifest
                            helm uninstall删除 Kubernetes 资源 + 删除 release secret
                            helm template本地渲染模板,不与集群交互
                            helm diff(插件)比较新旧渲染结果差异

                            🧩 六、Helm 3 与 Helm 2 的区别(核心)

                            Helm 2Helm 3
                            需要 Tiller(集群内控制组件)无需 Tiller,完全 client-side
                            安全模型复杂(基于 RBAC 授权)安全性更好,直接使用 kubeconfig 权限
                            Release 存储在 ConfigMap默认存储在 Secret
                            需要 Helm Server 部署纯客户端
                            Mar 7, 2024

                            HPA

                            HPA(Horizontal Pod Autoscaler)是 Kubernetes 中实现自动水平扩缩容的核心组件。它的实现涉及多个 Kubernetes 组件和复杂的控制逻辑。

                            一、HPA 架构组成

                            1. 核心组件

                            ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
                            │   HPA Controller │ ◄──│   Metrics API    │ ◄──│  Metrics Server │
                            │   (kube-controller)│    │    (聚合层)     │    │   (cAdvisor)    │
                            └─────────────────┘    └──────────────────┘    └─────────────────┘
                                     │                       │                       │
                                     ▼                       ▼                       ▼
                            ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
                            │ Deployment/     │    │  Custom Metrics  │    │  External       │
                            │ StatefulSet     │    │   Adapter        │    │  Metrics        │
                            └─────────────────┘    └──────────────────┘    └─────────────────┘

                            二、HPA 工作流程

                            1. 完整的控制循环

                            // 简化的 HPA 控制逻辑
                            for {
                                // 1. 获取 HPA 对象
                                hpa := client.AutoscalingV2().HorizontalPodAutoscalers(namespace).Get(name)
                                
                                // 2. 获取缩放目标(Deployment/StatefulSet等)
                                scaleTarget := hpa.Spec.ScaleTargetRef
                                target := client.AppsV1().Deployments(namespace).Get(scaleTarget.Name)
                                
                                // 3. 查询指标
                                metrics := []autoscalingv2.MetricStatus{}
                                for _, metricSpec := range hpa.Spec.Metrics {
                                    metricValue := getMetricValue(metricSpec, target)
                                    metrics = append(metrics, metricValue)
                                }
                                
                                // 4. 计算期望副本数
                                desiredReplicas := calculateDesiredReplicas(hpa, metrics, currentReplicas)
                                
                                // 5. 执行缩放
                                if desiredReplicas != currentReplicas {
                                    scaleTarget.Spec.Replicas = &desiredReplicas
                                    client.AppsV1().Deployments(namespace).UpdateScale(scaleTarget.Name, scaleTarget)
                                }
                                
                                time.Sleep(15 * time.Second) // 默认扫描间隔
                            }

                            2. 详细步骤分解

                            步骤 1:指标收集

                            # HPA 通过 Metrics API 获取指标
                            kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods" | jq .
                            
                            # 或者通过自定义指标 API
                            kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

                            步骤 2:指标计算

                            // 计算当前指标值与目标值的比率
                            func calculateMetricRatio(currentValue, targetValue int64) float64 {
                                return float64(currentValue) / float64(targetValue)
                            }
                            
                            // 示例:CPU 使用率计算
                            currentCPUUsage := 800m  # 当前使用 800 milli-cores
                            targetCPUUsage := 500m   # 目标使用 500 milli-cores
                            ratio := 800.0 / 500.0   # = 1.6

                            三、HPA 配置详解

                            1. HPA 资源定义

                            apiVersion: autoscaling/v2
                            kind: HorizontalPodAutoscaler
                            metadata:
                              name: myapp-hpa
                              namespace: default
                            spec:
                              # 缩放目标
                              scaleTargetRef:
                                apiVersion: apps/v1
                                kind: Deployment
                                name: myapp
                              # 副本数范围
                              minReplicas: 2
                              maxReplicas: 10
                              # 指标定义
                              metrics:
                              - type: Resource
                                resource:
                                  name: cpu
                                  target:
                                    type: Utilization
                                    averageUtilization: 50
                              - type: Resource
                                resource:
                                  name: memory
                                  target:
                                    type: Utilization
                                    averageUtilization: 70
                              - type: Pods
                                pods:
                                  metric:
                                    name: packets-per-second
                                  target:
                                    type: AverageValue
                                    averageValue: 1k
                              - type: Object
                                object:
                                  metric:
                                    name: requests-per-second
                                  describedObject:
                                    apiVersion: networking.k8s.io/v1
                                    kind: Ingress
                                    name: main-route
                                  target:
                                    type: Value
                                    value: 10k
                              # 行为配置(Kubernetes 1.18+)
                              behavior:
                                scaleDown:
                                  stabilizationWindowSeconds: 300
                                  policies:
                                  - type: Percent
                                    value: 50
                                    periodSeconds: 60
                                  - type: Pods
                                    value: 5
                                    periodSeconds: 60
                                  selectPolicy: Min
                                scaleUp:
                                  stabilizationWindowSeconds: 0
                                  policies:
                                  - type: Percent
                                    value: 100
                                    periodSeconds: 15
                                  - type: Pods
                                    value: 4
                                    periodSeconds: 15
                                  selectPolicy: Max

                            四、指标类型和计算方式

                            1. 资源指标(CPU/Memory)

                            metrics:
                            - type: Resource
                              resource:
                                name: cpu
                                target:
                                  type: Utilization    # 利用率模式
                                  averageUtilization: 50
                                  
                            - type: Resource  
                              resource:
                                name: memory
                                target:
                                  type: AverageValue  # 平均值模式
                                  averageValue: 512Mi

                            计算逻辑

                            // CPU 利用率计算
                            func calculateCPUReplicas(currentUsage, targetUtilization int32, currentReplicas int32) int32 {
                                // 当前总使用量
                                totalUsage := currentUsage * currentReplicas
                                // 期望副本数 = ceil(当前总使用量 / (单个 Pod 请求量 * 目标利用率))
                                desiredReplicas := int32(math.Ceil(float64(totalUsage) / float64(targetUtilization)))
                                return desiredReplicas
                            }

                            2. 自定义指标(Pods 类型)

                            metrics:
                            - type: Pods
                              pods:
                                metric:
                                  name: http_requests_per_second
                                target:
                                  type: AverageValue
                                  averageValue: 100

                            计算方式

                            期望副本数 = ceil(当前总指标值 / 目标平均值)

                            3. 对象指标(Object 类型)

                            metrics:
                            - type: Object
                              object:
                                metric:
                                  name: latency
                                describedObject:
                                  apiVersion: networking.k8s.io/v1
                                  kind: Ingress
                                  name: my-ingress
                                target:
                                  type: Value
                                  value: 100

                            五、HPA 算法详解

                            1. 核心算法

                            // 计算期望副本数
                            func GetDesiredReplicas(
                                currentReplicas int32,
                                metricValues []metrics,
                                hpa *HorizontalPodAutoscaler,
                            ) int32 {
                                ratios := make([]float64, 0)
                                
                                // 1. 计算每个指标的比率
                                for _, metric := range metricValues {
                                    ratio := calculateMetricRatio(metric.current, metric.target)
                                    ratios = append(ratios, ratio)
                                }
                                
                                // 2. 选择最大的比率(最需要扩容的指标)
                                maxRatio := getMaxRatio(ratios)
                                
                                // 3. 计算期望副本数
                                desiredReplicas := math.Ceil(float64(currentReplicas) * maxRatio)
                                
                                // 4. 应用边界限制
                                desiredReplicas = applyBounds(desiredReplicas, hpa.Spec.MinReplicas, hpa.Spec.MaxReplicas)
                                
                                return int32(desiredReplicas)
                            }

                            2. 平滑算法和冷却机制

                            // 考虑历史记录的缩放决策
                            func withStabilization(desiredReplicas int32, hpa *HorizontalPodAutoscaler) int32 {
                                now := time.Now()
                                
                                if isScaleUp(desiredReplicas, hpa.Status.CurrentReplicas) {
                                    // 扩容:通常立即执行
                                    stabilizationWindow = hpa.Spec.Behavior.ScaleUp.StabilizationWindowSeconds
                                } else {
                                    // 缩容:应用稳定窗口
                                    stabilizationWindow = hpa.Spec.Behavior.ScaleDown.StabilizationWindowSeconds
                                }
                                
                                // 过滤稳定窗口内的历史推荐值
                                validRecommendations := filterRecommendationsByTime(
                                    hpa.Status.Conditions, 
                                    now.Add(-time.Duration(stabilizationWindow)*time.Second)
                                )
                                
                                // 选择策略(Min/Max)
                                finalReplicas := applyPolicy(validRecommendations, hpa.Spec.Behavior)
                                
                                return finalReplicas
                            }

                            六、高级特性实现

                            1. 多指标支持

                            当配置多个指标时,HPA 会为每个指标计算期望副本数,然后选择最大值

                            func calculateFromMultipleMetrics(metrics []Metric, currentReplicas int32) int32 {
                                desiredReplicas := make([]int32, 0)
                                
                                for _, metric := range metrics {
                                    replicas := calculateForSingleMetric(metric, currentReplicas)
                                    desiredReplicas = append(desiredReplicas, replicas)
                                }
                                
                                // 选择最大的期望副本数
                                return max(desiredReplicas...)
                            }

                            2. 扩缩容行为控制

                            behavior:
                              scaleDown:
                                # 缩容稳定窗口:5分钟
                                stabilizationWindowSeconds: 300
                                policies:
                                - type: Percent   # 每分钟最多缩容 50%
                                  value: 50
                                  periodSeconds: 60
                                - type: Pods      # 或每分钟最多减少 5 个 Pod
                                  value: 5
                                  periodSeconds: 60
                                selectPolicy: Min # 选择限制更严格的策略
                                
                              scaleUp:
                                stabilizationWindowSeconds: 0  # 扩容立即执行
                                policies:
                                - type: Percent   # 每分钟最多扩容 100%
                                  value: 100
                                  periodSeconds: 60
                                - type: Pods      # 或每分钟最多增加 4 个 Pod
                                  value: 4
                                  periodSeconds: 60
                                selectPolicy: Max # 选择限制更宽松的策略

                            七、监控和调试

                            1. 查看 HPA 状态

                            # 查看 HPA 详情
                            kubectl describe hpa myapp-hpa
                            
                            # 输出示例:
                            # Name: myapp-hpa
                            # Namespace: default
                            # Reference: Deployment/myapp
                            # Metrics: ( current / target )
                            #   resource cpu on pods  (as a percentage of request):  65% (130m) / 50%
                            #   resource memory on pods:                             120Mi / 100Mi
                            # Min replicas: 2
                            # Max replicas: 10
                            # Deployment pods: 3 current / 3 desired

                            2. HPA 相关事件

                            # 查看 HPA 事件
                            kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler
                            
                            # 查看缩放历史
                            kubectl describe deployment myapp | grep -A 10 "Events"

                            3. 指标调试

                            # 检查 Metrics API 是否正常工作
                            kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq .
                            
                            # 检查自定义指标
                            kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
                            
                            # 直接查询 Pod 指标
                            kubectl top pods
                            kubectl top nodes

                            八、常见问题排查

                            1. HPA 不扩容

                            # 检查指标是否可用
                            kubectl describe hpa myapp-hpa
                            # 查看 Events 部分是否有错误信息
                            
                            # 检查 Metrics Server
                            kubectl get apiservices | grep metrics
                            kubectl logs -n kube-system -l k8s-app=metrics-server
                            
                            # 检查资源请求配置
                            kubectl get deployment myapp -o yaml | grep resources -A 5

                            2. HPA 频繁震荡

                            # 调整行为配置
                            behavior:
                              scaleDown:
                                stabilizationWindowSeconds: 600  # 延长缩容稳定窗口
                                policies:
                                - type: Pods
                                  value: 1                       # 每次只缩容 1 个 Pod
                                  periodSeconds: 300             # 每 5 分钟最多一次

                            九、性能优化建议

                            1. 大规模集群优化

                            # kube-controller-manager 配置优化
                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: kube-controller-manager
                              namespace: kube-system
                            spec:
                              containers:
                              - command:
                                - kube-controller-manager
                                - --horizontal-pod-autoscaler-sync-period=30s           # 调整同步周期
                                - --horizontal-pod-autoscaler-downscale-stabilization=5m0s
                                - --horizontal-pod-autoscaler-initial-readiness-delay=30s
                                - --horizontal-pod-autoscaler-cpu-initialization-period=5m0s

                            总结

                            HPA 的实现涉及:

                            1. 指标收集:通过 Metrics API 聚合资源和使用情况
                            2. 算法计算:基于指标值和目标值计算期望副本数
                            3. 决策平滑:应用稳定窗口和策略避免震荡
                            4. 执行缩放:更新目标资源的副本数
                            5. 状态跟踪:记录缩放历史和当前状态

                            HPA 的智能化体现在多指标决策、行为控制、冷却机制等方面,使得自动扩缩容既响应迅速又稳定可靠。

                            Mar 7, 2024

                            More than 1k Nodes

                            在这个量级上,K8s 不再只是“能跑就行”,而是进入可扩展性、稳定性、可观测性和资源效率的工程化挑战。下面我从架构、控制面、节点管理、网络、存储、安全和运维几个方面系统讲解。


                            🧠 一、总体思路:大规模集群的本质挑战

                            当节点规模超过 500~1000 时,Kubernetes 的瓶颈通常出现在:

                            • 控制平面(API Server / etcd)压力过大
                            • 调度器吞吐不足
                            • 资源对象(Pod / Node / Secret / ConfigMap 等)过多,导致 List/Watch 延迟
                            • 网络和 CNI 插件在高并发下性能下降
                            • 监控、日志、事件系统的数据量爆炸
                            • 维护和升级变得极度复杂

                            所以,大规模集群的重点是:

                            控制平面分层、节点池分区、流量隔离、观测与调优。


                            🏗️ 二、控制平面(Control Plane)

                            1. etcd 优化

                            • 独立部署:不要和 kube-apiserver 混布,最好是独立的高性能节点(NVMe SSD、本地盘)。
                            • 使用 etcd v3.5+(性能改进明显),并开启压缩和快照机制。
                            • 调大 --max-request-bytes--quota-backend-bytes,避免过载。
                            • 定期 defrag:可用 CronJob 自动化。
                            • 不要存放短生命周期对象(例如频繁更新的 CRD 状态),可以考虑用外部缓存系统(如 Redis 或 SQL)。

                            2. API Server 扩展与保护

                            • 使用 负载均衡(HAProxy、NGINX、ELB)在多 API Server 之间分流;

                            • 调整:

                              • --max-mutating-requests-inflight
                              • --max-requests-inflight
                              • --target-ram-mb
                            • 合理设置 --request-timeout,防止 watch 卡死;

                            • 限制大量 client watch 行为(Prometheus、controller-manager 等);

                            • 对 client 侧使用 aggregatorread-only proxy 来降低负载。

                            3. Scheduler & Controller Manager

                            • 多调度器实例(leader election)

                            • 启用 调度缓存(SchedulerCache)优化

                            • 调整:

                              • --kube-api-qps--kube-api-burst
                              • 调度算法的 backoff 策略;
                            • 对自定义 Operator 建议使用 workqueue with rate limiters 防止风暴。


                            🧩 三、节点与 Pod 管理

                            1. 节点分区与拓扑

                            • 按功能/位置划分 Node Pool(如 GPU/CPU/IO 密集型);
                            • 使用 Topology Spread Constraints 避免集中调度;
                            • 考虑用 Cluster Federation (KubeFed)多个集群 + 集中管理(如 ArgoCD 多集群、Karmada、Fleet)

                            2. 节点生命周期

                            • 控制 kubelet 心跳频率 (--node-status-update-frequency);
                            • 通过 Node Problem Detector (NPD) 自动标记异常节点;
                            • 监控 Pod eviction rate,防止节点频繁漂移;
                            • 启用 graceful node shutdown 支持。

                            3. 镜像与容器运行时

                            • 镜像预热(Image pre-pull);
                            • 使用 镜像仓库代理(Harbor / registry-mirror)
                            • 考虑 containerd 代替 Docker;
                            • 定期清理 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots

                            🌐 四、网络(CNI)

                            1. CNI 选择与调优

                            • 大规模下优选:

                              • Calico (BGP 模式)
                              • Cilium (eBPF)
                              • 或使用云原生方案(AWS CNI, Azure CNI)。
                            • 降低 ARP / 路由表压力:

                              • 使用 IPAM 子网分段
                              • 开启 Cilium 的 ClusterMesh 分层;
                            • 调整 conntrack 表大小(net.netfilter.nf_conntrack_max)。

                            2. Service & DNS

                            • 启用 CoreDNS 缓存

                            • 对大规模 Service 场景,考虑 Headless Service + ExternalName

                            • 优化 kube-proxy:

                              • 使用 IPVS 模式
                              • Cilium service LB
                            • 如果 Service 数量非常多,可拆分 namespace 级 DNS 域。


                            💾 五、存储(CSI)

                            • 使用 分布式存储系统(Ceph、Longhorn、OpenEBS、CSI-HostPath);
                            • 避免高频小 I/O 的 PVC;
                            • 定期清理僵尸 PV/PVC;
                            • 对 CSI driver 开启限流与重试机制。

                            🔒 六、安全与访问控制

                            • 开启 RBAC 严格控制
                            • 限制 namespace 级资源上限(ResourceQuota, LimitRange);
                            • 审计日志(Audit Policy)异步存储;
                            • 对外接口统一走 Ingress Controller;
                            • 如果有 Operator 或 CRD 资源暴涨,记得定期清理过期对象。

                            📈 七、可观测性与维护

                            1. 监控

                            • Prometheus 集群化(Thanos / VictoriaMetrics);
                            • 不直接监控所有 Pod,可抽样或聚合;
                            • kube-state-metrics 与 cAdvisor 数据要限流。

                            2. 日志

                            • 统一日志收集(Loki / Elasticsearch / Vector);
                            • 日志量控制策略(采样、压缩、清理)。

                            3. 升级与测试

                            • 使用 灰度升级 / Node pool rolling
                            • 每次升级前跑 e2e 测试;
                            • 对控制平面单独做快照和备份(etcd snapshot)。

                            ⚙️ 八、性能调优与实践经验

                            • 调整 kubelet QPS 限制:

                              --kube-api-qps=100 --kube-api-burst=200
                            • 合理的 Pod 数量控制:

                              • 单节点不超过 110 Pods;
                              • 单 namespace 建议 < 5000 Pods;
                              • 总体目标:1k 节点 → 5~10 万 Pods 以内。
                            • 使用 CRD Sharding / 缩减 CRD 状态字段

                            • 避免大量短生命周期 Job,可用 CronJob + TTLController 清理。


                            🧭 九、扩展方向

                            当规模继续上升(>3000 节点)时,可以考虑:

                            • 多集群架构(Cluster Federation / Karmada / Rancher Fleet)
                            • 控制平面分层(cell-based control plane)
                            • API Aggregation Layer + Custom Scheduler

                            Mar 7, 2024

                            Network Policy

                            1. Network Policy 的设计原理

                            Kubernetes Network Policy 的设计核心思想是:在默认允许的集群网络中,引入一个“默认拒绝”的、声明式的、基于标签的防火墙

                            让我们来分解这个核心思想:

                            1. 从“默认允许”到“默认拒绝”

                              • 默认行为:在没有任何 Network Policy 的情况下,Kubernetes 集群内的 Pod 之间是可以自由通信的(取决于 CNI 插件),甚至来自外部的流量也可能直接访问到 Pod。这就像在一个没有防火墙的开放网络里。
                              • Network Policy 的作用:一旦在某个 Namespace 中创建了一个 Network Policy,它就会像一个“开关”,将这个 Namespace 或特定 Pod 的默认行为变为 “默认拒绝”。之后,只有策略中明确允许的流量才能通过。
                            2. 声明式模型

                              • 和其他的 Kubernetes 资源(如 Deployment、Service)一样,Network Policy 也是声明式的。你只需要告诉 Kubernetes“你期望的网络状态是什么”(例如,“允许来自带有 role=frontend 标签的 Pod 的流量访问带有 role=backend 标签的 Pod 的 6379 端口”),而不需要关心如何通过 iptables 或 eBPF 命令去实现它。Kubernetes 和其下的 CNI 插件会负责实现你的声明。
                            3. 基于标签的选择机制

                              • 这是 Kubernetes 的核心设计模式。Network Policy 不关心 Pod 的 IP 地址,因为 IP 是动态且易变的。它通过 标签 来选择一组 Pod。
                              • podSelector: 选择策略所应用的 Pod(即目标 Pod)。
                              • namespaceSelector: 根据命名空间的标签来选择来源或目标命名空间。
                              • namespaceSelectorpodSelector 可以组合使用,实现非常精细的访问控制。
                            4. 策略是叠加的

                              • 多个 Network Policy 可以同时作用于同一个 Pod。最终的规则是所有相关策略的 并集。如果任何一个策略允许了某条流量,那么该流量就是被允许的。这意味着你可以分模块、分层次地定义策略,而不会相互覆盖。

                            2. Network Policy 的实现方式

                            一个非常重要的概念是:Network Policy 本身只是一个 API 对象,它定义了一套规范。它的具体实现依赖于 Container Network Interface 插件。

                            Kubernetes 不会自己实现网络策略,而是由 CNI 插件来负责。这意味着:

                            • 如果你的 CNI 插件不支持 Network Policy,那么你创建的 Policy 将不会产生任何效果。
                            • 不同的 CNI 插件使用不同的底层技术来实现相同的 Network Policy 规范。

                            主流的实现方式和技术包括:

                            1. 基于 iptables

                              • 工作原理:CNI 插件(如 Calico 的部分模式、Weave Net 等)会监听 Kubernetes API,当有 Network Policy 被创建时,它会在节点上生成相应的 iptables 规则。这些规则会对进出 Pod 网络接口(veth pair)的数据包进行过滤。
                              • 优点:成熟、稳定、通用。
                              • 缺点:当策略非常复杂时,iptables 规则链会变得很长,可能对性能有一定影响。
                            2. 基于 eBPF

                              • 工作原理:这是更现代和高效的方式,被 Cilium 等项目广泛采用。eBPF 允许将程序直接注入到 Linux 内核中,在内核层面高效地执行数据包过滤、转发和策略检查。
                              • 优点:高性能、灵活性极强(可以实现 L3/L4/L7 所有层面的策略)、对系统影响小。
                              • 缺点:需要较新的 Linux 内核版本。
                            3. 基于 IPVS 或自有数据平面

                              • 一些 CNI 插件(如 Antrea,它底层使用 OVS)可能有自己独立的数据平面,并在其中实现策略的匹配和执行。

                            常见的支持 Network Policy 的 CNI 插件:

                            • Calico: 功能强大,支持复杂的网络策略,既可以使用 iptables 模式也可以使用 eBPF 模式。
                            • Cilium: 基于 eBPF,原生支持 Network Policy,并扩展到了 L7(HTTP、gRPC 等)网络策略。
                            • Weave Net: 提供了对 Kubernetes Network Policy 的基本支持。
                            • Antrea: 基于 Open vSwitch,也提供了强大的策略支持。

                            3. Network Policy 的用途

                            Network Policy 是实现 Kubernetes “零信任”“微隔离” 安全模型的核心工具。其主要用途包括:

                            1. 实现最小权限原则

                              • 这是最核心的用途。通过精细的策略,确保一个 Pod 只能与它正常工作所 必需 的其他 Pod 或外部服务通信,除此之外的一切连接都被拒绝。这极大地减少了攻击面。
                            2. 隔离多租户环境

                              • 在共享的 Kubernetes 集群中,可以为不同的团队、项目或环境(如 dev, staging)创建不同的命名空间。然后使用 Network Policy 严格限制跨命名空间的访问,确保它们相互隔离,互不干扰。
                            3. 保护关键基础服务

                              • 数据库、缓存(如 Redis)、消息队列等后端服务通常不应该被所有 Pod 访问。可以创建策略,只允许特定的前端或中间件 Pod(通过标签选择)访问这些后端服务的特定端口。
                              # 示例:只允许 role=api 的 Pod 访问 role=db 的 Pod 的 5432 端口
                              apiVersion: networking.k8s.io/v1
                              kind: NetworkPolicy
                              metadata:
                                name: allow-api-to-db
                              spec:
                                podSelector:
                                  matchLabels:
                                    role: db
                                policyTypes:
                                - Ingress
                                ingress:
                                - from:
                                  - podSelector:
                                      matchLabels:
                                        role: api
                                  ports:
                                  - protocol: TCP
                                    port: 5432
                            4. 控制外部访问

                              • 使用 ipBlock 字段,可以限制只有来自特定 IP 段(例如公司办公室的 IP)的流量才能访问集群内部的服务。这可以用来替代或补充传统的防火墙规则。
                            5. 划分应用层次安全边界

                              • 在一个典型的 Web 应用中,可以创建清晰的层次:
                                • 前端层: 可以接收来自外部的流量(80/443端口),但只能与后端层通信。
                                • 后端层: 只能接收来自前端层的流量,并只能与数据层通信。
                                • 数据层: 只能接收来自后端层的流量,不接受任何其他来源的请求。

                            总结

                            特性描述
                            设计原理在默认允许的网络中,通过声明式和基于标签的机制,实现“默认拒绝”的精细流量控制。
                            实现方式由 CNI 插件负责实现,底层技术包括 iptableseBPF 等。策略本身是 Kubernetes 的 API 资源。
                            主要用途实现微隔离最小权限原则多租户隔离保护关键服务控制外部访问,是 Kubernetes 网络安全的基石。

                            简单来说,Network Policy 就是 Kubernetes 世界的防火墙规则,它让你能够定义“谁在什么条件下可以访问什么”,是生产环境中保障应用安全不可或缺的一部分。

                            Mar 7, 2024

                            Node NotReady

                            当 Kubernetes 中某些 Node 节点状态变为 NotReady 时,这往往意味着 kubelet 无法与控制平面(API Server)正常通信,或该节点上某些关键组件/资源异常。

                            我们可以从以下两个层面来分析: 1️⃣ 导致节点 NotReady 的常见原因 2️⃣ NotReady 状态对整个集群和业务的影响


                            🧩 一、Node NotReady 的常见原因分类

                            kubelet 每 10 秒(默认)向 API Server 报告一次心跳(NodeStatus)。 如果连续 40 秒(默认 --node-monitor-grace-period=40s)没有收到更新,Controller Manager 会将节点标记为 NotReady

                            下面按类别详细分析👇


                            🖧 1. 网络层异常(最常见)

                            症状:节点能 ping 通外网,但与 control plane 交互超时。 原因包括:

                            • 节点与 kube-apiserver 之间的网络中断(如防火墙、路由异常、VPC 问题);
                            • API Server 负载均衡异常(L4/L7 LB 停止转发流量);
                            • Pod 网络插件(CNI)崩溃,kubelet 无法汇报 Pod 状态;
                            • 节点 DNS 解析异常(影响 kubelet 访问 API Server)。

                            排查方式:

                            # 在节点上检查 API Server 可达性
                            curl -k https://<apiserver-ip>:6443/healthz
                            # 检查 kubelet 日志
                            journalctl -u kubelet | grep -E "error|fail|timeout"

                            ⚙️ 2. kubelet 本身异常

                            症状:节点长时间 NotReady,重启 kubelet 后恢复。

                            原因包括:

                            • kubelet 崩溃 / 死循环;
                            • 磁盘满,导致 kubelet 无法写临时目录(/var/lib/kubelet);
                            • 证书过期(/var/lib/kubelet/pki/kubelet-client-current.pem);
                            • CPU/Mem 资源耗尽,kubelet 被 OOM;
                            • kubelet 配置文件被改动,重启后加载失败。

                            排查方式:

                            systemctl status kubelet
                            journalctl -u kubelet -n 100
                            df -h /var/lib/kubelet

                            💾 3. 节点资源耗尽

                            症状:Node 状态为 NotReadyUnknown,Pod 被驱逐。

                            可能原因:

                            • 磁盘使用率 > 90%,触发 kubelet DiskPressure
                            • 内存 / CPU 长期 100%,触发 MemoryPressure
                            • inode 用尽(df -i);
                            • 临时目录 /var/lib/docker/tmp/tmp 爆满。

                            排查方式:

                            kubectl describe node <node-name>
                            # 查看 conditions
                            # Conditions:
                            #   Type              Status
                            #   ----              ------
                            #   MemoryPressure    True
                            #   DiskPressure      True

                            🧱 4. 控制面通信问题(API Server / Controller Manager)

                            症状:多个节点同时 NotReady

                            可能原因:

                            • API Server 压力过大,导致心跳包无法及时处理;
                            • etcd 异常(写延迟高);
                            • Controller Manager 无法更新 NodeStatus;
                            • 集群负载均衡器(如 haproxy)挂掉。

                            排查方式:

                            kubectl get componentstatuses
                            # 或直接检查控制平面节点
                            kubectl -n kube-system get pods -l tier=control-plane

                            🔌 5. 容器运行时 (containerd/docker/crio) 异常

                            症状:kubelet 报 “Failed to list pod sandbox”。

                            原因包括:

                            • containerd 服务挂掉;
                            • 版本不兼容(kubelet 与 runtime 版本差异过大);
                            • runtime socket 权限错误;
                            • overlayfs 损坏;
                            • /var/lib/containerd/run/containerd 文件系统只读。

                            排查方式:

                            systemctl status containerd
                            journalctl -u containerd | tail
                            crictl ps

                            ⏱️ 6. 时间同步错误

                            症状:kubelet 心跳被判定过期,但实际节点正常。

                            原因:

                            • 节点时间漂移(未启用 NTP / chrony);
                            • 控制面和节点时间差 > 5 秒;
                            • TLS 校验失败(证书时间不合法)。

                            🧰 7. 节点维护或人为操作

                            包括:

                            • 节点被 cordon/drain;
                            • 网络策略阻断 kubelet;
                            • 人为停掉 kubelet;
                            • 节点被重装后未清理旧状态(Node UID 冲突)。

                            ⚠️ 二、Node NotReady 的后果与影响

                            影响范围描述
                            1️⃣ Pod 调度Scheduler 会避免调度新 Pod 到该节点。
                            2️⃣ Pod 驱逐Controller Manager 默认在节点 NotReady 超过 300s--pod-eviction-timeout)后,会驱逐所有 Pod。
                            3️⃣ Service Endpoint 缺失该节点上运行的 Pod 从 Service Endpoint 列表中移除,导致负载均衡流量下降。
                            4️⃣ DaemonSet 中断DaemonSet Controller 不再在该节点上创建/管理 Pod。
                            5️⃣ 数据丢失风险若节点上的 Pod 使用本地卷(emptyDir、hostPath),被驱逐后数据会丢失。
                            6️⃣ 集群监控告警Prometheus / Alertmanager 触发告警(如 KubeNodeNotReadyKubeletDown)。
                            7️⃣ 自动扩缩容失效Cluster Autoscaler 无法正确评估资源利用率。

                            🧭 三、最佳实践与预防建议

                            1. 启用 Node Problem Detector (NPD) 自动标记系统级异常;

                            2. 监控 NodeConditionsReadyMemoryPressureDiskPressure);

                            3. 统一节点健康检查策略(如通过 taintstolerations);

                            4. 自动修复机制

                              • 结合 Cluster API 或自研 Controller 实现 Node 自动替换;
                              • 若节点 NotReady 超过 10 分钟,自动重建;
                            5. 定期巡检:

                              • kubelet、containerd 状态;
                              • 系统时间同步;
                              • 磁盘使用率;
                              • API Server QPS 和 etcd 延迟。
                            Mar 7, 2024

                            Pause 容器

                            Kubernetes Pause 容器的用途

                            Pause 容器是 Kubernetes 中一个非常小但极其重要的基础设施容器。很多人会忽略它,但它是 Pod 网络和命名空间共享的核心。


                            🎯 核心作用

                            1. 作为 Pod 的"根容器"(Infrastructure Container)

                            Pause 容器是每个 Pod 中第一个启动的容器,它的生命周期代表整个 Pod 的生命周期。

                            Pod 生命周期:
                            创建 Pod → 启动 Pause 容器 → 启动业务容器 → ... → 业务容器结束 → 删除 Pause 容器 → Pod 销毁

                            2. 持有和共享 Linux 命名空间

                            Pause 容器创建并持有以下命名空间,供 Pod 内其他容器共享:

                            • Network Namespace (网络命名空间) - 最重要!
                            • IPC Namespace (进程间通信)
                            • UTS Namespace (主机名)
                            # 查看 Pod 中的容器
                            docker ps | grep pause
                            
                            # 你会看到类似输出:
                            # k8s_POD_mypod_default_xxx  k8s.gcr.io/pause:3.9
                            # k8s_app_mypod_default_xxx  myapp:latest

                            🌐 网络命名空间共享(最关键的用途)

                            工作原理

                            ┌─────────────────── Pod ───────────────────┐
                            │                                            │
                            │  ┌─────────────┐                          │
                            │  │   Pause     │ ← 创建网络命名空间        │
                            │  │  Container  │ ← 拥有 Pod IP            │
                            │  └──────┬──────┘                          │
                            │         │ (共享网络栈)                     │
                            │  ┌──────┴──────┬──────────┬──────────┐   │
                            │  │ Container A │Container B│Container C│  │
                            │  │  (业务容器)  │  (业务容器)│ (业务容器) │  │
                            │  └─────────────┴──────────┴──────────┘   │
                            │                                            │
                            │  所有容器共享:                              │
                            │  - 同一个 IP 地址 (Pod IP)                 │
                            │  - 同一个网络接口                           │
                            │  - 同一个端口空间                           │
                            │  - 可以通过 localhost 互相访问              │
                            └────────────────────────────────────────────┘

                            实际效果

                            # 示例 Pod
                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: multi-container-pod
                            spec:
                              containers:
                              - name: nginx
                                image: nginx
                                ports:
                                - containerPort: 80
                              - name: sidecar
                                image: busybox
                                command: ['sh', '-c', 'while true; do wget -O- localhost:80; sleep 5; done']

                            在这个例子中:

                            • Pause 容器创建网络命名空间并获得 Pod IP (如 10.244.1.5)
                            • nginx 容器加入这个网络命名空间,监听 80 端口
                            • sidecar 容器也加入同一网络命名空间
                            • sidecar 可以通过 localhost:80 访问 nginx,因为它们共享网络栈

                            🔍 为什么需要 Pause 容器?

                            问题场景:如果没有 Pause 容器

                            假设 Pod 中有两个容器 A 和 B:

                            场景 1:容器 A 先启动,创建网络命名空间
                            ├─ 容器 A 持有网络命名空间 → 拥有 Pod IP
                            └─ 容器 B 加入容器 A 的网络命名空间
                            
                            问题:如果容器 A 崩溃重启或被删除,网络命名空间消失
                            → 容器 B 失去网络连接
                            → Pod IP 改变
                            → Service 路由失效 ❌

                            解决方案:引入 Pause 容器

                            Pause 容器(持有命名空间) ← 永远不会主动退出
                            ├─ 容器 A 加入
                            └─ 容器 B 加入
                            
                            优势:
                            ✅ 容器 A 或 B 崩溃不影响网络命名空间
                            ✅ Pod IP 始终保持稳定
                            ✅ 业务容器可以独立重启
                            ✅ 简化容器间的依赖关系

                            📦 Pause 容器的特点

                            1. 极其精简

                            # pause 容器的代码只有几十行 C 代码
                            // 核心功能就是:永远 sleep
                            int main() {
                                for (;;) pause();  // 无限暂停,等待信号
                                return 0;
                            }

                            镜像大小:约 700KB (相比普通镜像动辄几百 MB)

                            2. 资源占用极低

                            # 查看 Pause 容器资源占用
                            docker stats <pause-container-id>
                            
                            # 典型输出:
                            # CPU: 0.00%
                            # MEM: 0.5 MiB

                            3. 生命周期管理

                            • Kubelet 创建 Pod 时首先启动 Pause 容器
                            • Pause 容器退出 = Pod 销毁
                            • 业务容器重启不影响 Pause 容器

                            🛠️ 实际用途场景

                            场景 1:Sidecar 模式

                            # 应用 + 日志收集器
                            spec:
                              containers:
                              - name: app
                                image: myapp
                                volumeMounts:
                                - name: logs
                                  mountPath: /var/log
                              - name: log-collector
                                image: fluentd
                                volumeMounts:
                                - name: logs
                                  mountPath: /var/log
                            • Pause 容器保证两个容器可以通过共享卷和 localhost 通信
                            • 即使 app 重启,log-collector 仍能正常工作

                            场景 2:Service Mesh (如 Istio)

                            # 应用 + Envoy 代理
                            spec:
                              containers:
                              - name: app
                                image: myapp
                                ports:
                                - containerPort: 8080
                              - name: istio-proxy  # Envoy sidecar
                                image: istio/proxyv2
                            • Pause 容器持有网络命名空间
                            • Envoy 代理拦截所有进出流量
                            • 应用无需感知代理存在

                            场景 3:初始化和主容器协作

                            spec:
                              initContainers:
                              - name: init-config
                                image: busybox
                                command: ['sh', '-c', 'echo "config" > /config/app.conf']
                                volumeMounts:
                                - name: config
                                  mountPath: /config
                              containers:
                              - name: app
                                image: myapp
                                volumeMounts:
                                - name: config
                                  mountPath: /config
                            • Pause 容器在整个过程中保持网络和 IPC 命名空间稳定

                            🔧 查看和调试 Pause 容器

                            查看 Pod 中的 Pause 容器

                            # 方式 1:通过 crictl (推荐)
                            crictl pods
                            crictl ps -a | grep pause
                            
                            # 方式 2:通过 docker (如果使用 Docker 运行时)
                            docker ps -a | grep pause
                            docker inspect <pause-container-id>
                            
                            # 方式 3:查看 Pod 的容器列表
                            kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].name}'

                            Pause 镜像版本

                            # 查看 kubelet 使用的 pause 镜像
                            kubectl get pod <pod-name> -o yaml | grep pause
                            
                            # 常见版本:
                            # registry.k8s.io/pause:3.9 (最新)
                            # registry.k8s.io/pause:3.8
                            # k8s.gcr.io/pause:3.7

                            ⚠️ 常见问题

                            Q1: 为什么看不到 Pause 容器?

                            A: 使用 kubectl get pods 不会显示 Pause 容器,它对用户是透明的。需要使用底层运行时命令(如 crictldocker ps)才能看到。

                            Q2: Pause 容器会影响性能吗?

                            A: 几乎不会。Pause 容器只占用约 0.5MB 内存,CPU 使用为 0,不执行任何业务逻辑。

                            Q3: 可以删除 Pause 容器吗?

                            A: 不能手动删除。删除 Pause 容器会导致整个 Pod 被销毁。

                            Q4: 不同 Pod 的 Pause 容器是否共享?

                            A: 不共享。每个 Pod 都有独立的 Pause 容器,确保 Pod 之间的网络和命名空间隔离。


                            📝 总结

                            用途说明
                            命名空间持有者创建并持有 Network、IPC、UTS 命名空间
                            网络基础使 Pod 内所有容器共享同一 IP 和网络栈
                            生命周期锚点代表 Pod 的生命周期,容器可独立重启
                            简化架构解耦容器间依赖,避免级联故障
                            资源高效极小的镜像和资源占用

                            核心价值: Pause 容器是 Kubernetes Pod 抽象的基石,让多个容器能像在同一主机上一样协作,同时保持各自的独立性和可重启性。

                            需要我详细讲解某个具体场景或深入探讨实现原理吗? 🚀

                            Mar 7, 2024

                            Pod在K8S中DNS解析流程和顺序

                            核心概念

                            1. CoreDNS: 从Kubernetes 1.11开始,CoreDNS是默认的DNS服务。它作为一个或多个Pod运行在kube-system命名空间下,并配有一个Kubernetes Service(通常叫kube-dns)。
                            2. resolv.conf 文件: 每个Pod的/etc/resolv.conf文件是DNS解析的蓝图。Kubelet会自动生成这个文件并挂载到Pod中。
                            3. DNS策略: 你可以通过Pod Spec中的dnsPolicy字段来配置DNS策略。

                            Pod 的 /etc/resolv.conf 解析

                            这是一个典型的Pod内的/etc/resolv.conf文件内容:

                            nameserver 10.96.0.10
                            search <namespace>.svc.cluster.local svc.cluster.local cluster.local
                            options ndots:5

                            让我们逐行分析:

                            1. nameserver 10.96.0.10

                            • 这是CoreDNS Service的集群IP地址。所有Pod的DNS查询默认都会发送到这个地址。
                            • 这个IP来自kubelet的--cluster-dns标志,在启动时确定。

                            2. search <namespace>.svc.cluster.local svc.cluster.local cluster.local

                            • 搜索域列表。当你使用不完整的域名(即不是FQDN)时,系统会按照这个列表的顺序,依次将搜索域附加到主机名后面,直到找到匹配的记录。
                            • <namespace>是你的Pod所在的命名空间,例如default
                            • 搜索顺序
                              • <pod-namespace>.svc.cluster.local
                              • svc.cluster.local
                              • cluster.local

                            3. options ndots:5

                            • 这是一个关键的优化/控制选项。
                            • 规则: 如果一个域名中的点(.)数量大于或等于这个值(这里是5),系统会将其视为绝对域名(FQDN),并首先尝试直接解析,不会走搜索域列表。
                            • 反之,如果点数少于5,系统会依次尝试搜索域,如果都失败了,最后再尝试名称本身。

                            DNS 解析流程与顺序(详解)

                            假设你的Pod在default命名空间,并且resolv.conf如上所示。

                            场景1:解析Kubernetes Service(短名称)

                            你想解析同一个命名空间下的Service:my-svc

                            1. 应用程序请求解析 my-svc
                            2. 系统检查名称 my-svc,点数(0) < 5。
                            3. 进入搜索流程
                              • 第一次尝试: my-svc.default.svc.cluster.local -> 成功! 返回ClusterIP。
                              • 解析结束。

                            场景2:解析不同命名空间的Service

                            你想解析另一个命名空间prod下的Service:my-svc.prod

                            1. 应用程序请求解析 my-svc.prod
                            2. 系统检查名称 my-svc.prod,点数(1) < 5。
                            3. 进入搜索流程
                              • 第一次尝试: my-svc.prod.default.svc.cluster.local -> 失败(因为该Service不在default命名空间)。
                              • 第二次尝试: my-svc.prod.svc.cluster.local -> 成功! 返回ClusterIP。
                              • 解析结束。

                            场景3:解析外部域名(例如 www.google.com

                            1. 应用程序请求解析 www.google.com
                            2. 系统检查名称 www.google.com,点数(3) < 5。
                            3. 进入搜索流程
                              • 第一次尝试: www.google.com.default.svc.cluster.local -> 失败
                              • 第二次尝试: www.google.com.svc.cluster.local -> 失败
                              • 第三次尝试: www.google.com.cluster.local -> 失败
                            4. 所有搜索域都失败了,系统最后尝试名称本身:www.google.com -> 成功! CoreDNS会将其转发给上游DNS服务器(例如宿主机上的DNS或网络中配置的DNS)。

                            场景4:解析被认为是FQDN的域名(点数 >= 5)

                            假设你有一个StatefulSet,Pod的FQDN是web-0.nginx.default.svc.cluster.local

                            1. 应用程序请求解析 web-0.nginx.default.svc.cluster.local
                            2. 系统检查名称,点数(4) < 5?注意:这里是4个点,仍然小于5! 所以它仍然会走搜索流程。
                              • 这会先尝试 web-0.nginx.default.svc.cluster.local.default.svc.cluster.local,显然是错误的。
                              • 为了避免这种低效行为,最佳实践是在应用程序中配置或使用绝对域名(尾部带点)。

                            绝对域名示例: 应用程序请求解析 web-0.nginx.default.svc.cluster.local.(注意最后有一个点)。

                            • 系统识别其为FQDN,直接查询,不经过任何搜索域。这是最有效的方式。

                            DNS 策略

                            Pod的dnsPolicy字段决定了如何生成resolv.conf

                            • ClusterFirst(默认): DNS查询首先被发送到Kubernetes集群的CoreDNS。如果域名不在集群域内(例如cluster.local),查询会被转发到上游DNS。
                            • ClusterFirstWithHostNet: 对于使用hostNetwork: true的Pod,如果你想让它使用集群DNS,就需要设置这个策略。
                            • Default: Pod直接从宿主机继承DNS配置(即使用宿主的/etc/resolv.conf)。这意味着它不会使用CoreDNS。
                            • None: 忽略所有默认的DNS设置。你必须使用dnsConfig字段来提供自定义的DNS配置。

                            总结与流程图

                            解析顺序可以简化为以下决策流程:

                            flowchart TD
                                A[应用程序发起DNS查询] --> B{查询名称的<br>点数 '.' >= 5?}
                                
                                B -- 是<br>(视为FQDN) --> C[直接查询该名称]
                                C --> D{解析成功?}
                                D -- 是 --> E[返回结果]
                                D -- 否 --> F[解析失败]
                                
                                B -- 否<br>(视为短名称) --> G
                                subgraph G [循环搜索域列表]
                                    direction LR
                                    H[依次将搜索域附加<br>到名称后并查询] --> I{解析成功?}
                                    I -- 是 --> J[返回结果]
                                end
                                
                                I -- 循环结束仍失败 --> K[直接查询原始名称]
                                K --> L{解析成功?}
                                L -- 是 --> E
                                L -- 否 --> F

                            关键要点:

                            1. 默认流向: Pod -> CoreDNS Service -> CoreDNS Pod -> (根据域判断)返回K8s记录或转发到上游DNS。
                            2. 搜索域顺序: 命名空间 -> svc -> cluster.local
                            3. ndots:5的影响: 这是为了在便利性和性能之间取得平衡。对于需要频繁访问的外部域名,为了性能最好在应用程序中配置FQDN(尾部带点)或调整ndots选项。
                            4. 调试技巧: 进入Pod并执行cat /etc/resolv.confnslookupdig命令是诊断DNS问题的第一步。
                            Mar 7, 2024

                            当执行kubectl exec 命令时,发生了什么?

                            kubectl exec 的实现原理涉及多个组件协同工作,以下是详细原理分析:

                            1. 整体架构流程

                            用户 -> kubectl -> API Server -> Kubelet -> 容器运行时 -> 目标容器

                            2. 详细执行步骤

                            步骤1:kubectl 客户端处理

                            kubectl exec -it <pod-name> -- /bin/bash
                            • kubectl 解析命令参数
                            • 构造 Exec API 请求
                            • 建立与 API Server 的长连接

                            步骤2:API Server 处理

                            // API 路径示例
                            POST /api/v1/namespaces/{namespace}/pods/{name}/exec
                            • 认证和授权检查
                            • 验证用户是否有 exec 权限
                            • 查找目标 Pod 所在节点
                            • 将请求代理到对应节点的 Kubelet

                            步骤3:Kubelet 处理

                            // Kubelet 的 exec 处理逻辑
                            func (h *ExecHandler) serveExec(w http.ResponseWriter, req *http.Request) {
                                // 获取容器信息
                                // 调用容器运行时接口
                                // 建立数据流传输
                            }
                            • 通过 CRI(Container Runtime Interface)调用容器运行时
                            • 创建到容器的连接
                            • 管理标准输入、输出、错误流

                            步骤4:容器运行时执行

                            // CRI 接口定义
                            service RuntimeService {
                                rpc Exec(ExecRequest) returns (ExecResponse) {}
                            }
                            • Docker: 使用 docker exec 底层机制
                            • Containerd: 通过 task 执行命令
                            • CRI-O: 通过 conmon 管理执行会话

                            3. 关键技术机制

                            3.1 流式传输协议

                            // 使用 SPDY 或 WebSocket 协议
                            // 支持多路复用的数据流
                            type StreamProtocol interface {
                                Stream(stdin io.Reader, stdout, stderr io.Writer) error
                            }

                            3.2 终端处理(TTY)

                            // 伪终端配置
                            type ExecOptions struct {
                                Stdin     io.Reader
                                Stdout    io.Writer
                                Stderr    io.Writer
                                TTY       bool
                                ptyMaster *os.File
                            }

                            3.3 会话管理

                            // ExecSession 管理执行会话
                            type ExecSession struct {
                                id        string
                                stdinPipe io.WriteCloser
                                stdoutPipe io.ReadCloser
                                stderrPipe io.ReadCloser
                                done      chan struct{}
                            }

                            4. 网络通信流程

                            客户端 (kubectl)
                                ↓ HTTPS with SPDY/WebSocket
                            API Server
                                ↓ 代理连接
                            Kubelet (节点)
                                ↓ CRI gRPC
                            容器运行时
                                ↓ 容器命名空间
                            目标容器进程

                            5. 安全机制

                            5.1 认证授权

                            # RBAC 配置示例
                            apiVersion: rbac.authorization.k8s.io/v1
                            kind: ClusterRole
                            metadata:
                              name: pod-exec
                            rules:
                            - apiGroups: [""]
                              resources: ["pods/exec"]
                              verbs: ["create"]

                            5.2 安全上下文

                            // 安全配置
                            securityContext := &v1.SecurityContext{
                                RunAsUser:  &uid,
                                RunAsGroup: &gid,
                                Capabilities: &v1.Capabilities{
                                    Drop: []v1.Capability{"ALL"},
                                },
                            }

                            6. 实际代码示例

                            kubectl 端实现

                            func (o *ExecOptions) Run() error {
                                // 建立与 API Server 的连接
                                executor, err := remotecommand.NewSPDYExecutor(
                                    o.Config, "POST", req.URL())
                                
                                // 执行命令
                                return executor.Stream(remotecommand.StreamOptions{
                                    Stdin:  o.In,
                                    Stdout: o.Out,
                                    Stderr: o.ErrOut,
                                    Tty:    o.TTY,
                                })
                            }

                            Kubelet 端处理

                            func (h *ExecHandler) serveExec(w http.ResponseWriter, req *http.Request) {
                                // 获取容器 ID
                                containerID := podContainer.ContainerID
                                
                                // 通过 CRI 执行命令
                                execRequest := &runtimeapi.ExecRequest{
                                    ContainerId: containerID.ID,
                                    Cmd:         cmd,
                                    Tty:         tty,
                                    Stdin:       stdin,
                                    Stdout:      stdout,
                                    Stderr:      stderr,
                                }
                                
                                // 调用容器运行时
                                runtimeService.Exec(execRequest)
                            }

                            7. 容器运行时差异

                            Docker

                            // 使用 Docker Engine API
                            client.ContainerExecCreate()
                            client.ContainerExecAttach()

                            Containerd

                            // 使用 CRI 插件
                            task.Exec()

                            8. 故障排查要点

                            1. 权限问题: 检查 RBAC 配置
                            2. 网络连通性: API Server ↔ Kubelet 网络
                            3. 容器状态: 目标容器必须处于 Running 状态
                            4. 资源限制: 容器资源是否充足
                            5. 安全策略: Pod Security Policies 限制

                            这种设计使得 kubectl exec 能够在分布式环境中安全、可靠地执行容器内命令,同时保持了良好的用户体验。

                            Mar 7, 2024

                            QoS 详解

                            Kubernetes QoS (Quality of Service) 等级详解

                            QoS 等级是 Kubernetes 用来管理 Pod 资源和在资源不足时决定驱逐优先级的机制。


                            🎯 三种 QoS 等级

                            Kubernetes 根据 Pod 的资源配置自动分配 QoS 等级,共有三种:

                            1. Guaranteed (保证型) - 最高优先级

                            2. Burstable (突发型) - 中等优先级

                            3. BestEffort (尽力而为型) - 最低优先级


                            📊 QoS 等级详解

                            1️⃣ Guaranteed (保证型)

                            定义条件(必须同时满足)

                            • Pod 中每个容器(包括 Init 容器)都必须设置 requestslimits
                            • 对于每个容器,CPU 和内存的 requests 必须等于 limits

                            YAML 示例

                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: guaranteed-pod
                            spec:
                              containers:
                              - name: app
                                image: nginx
                                resources:
                                  requests:
                                    memory: "200Mi"
                                    cpu: "500m"
                                  limits:
                                    memory: "200Mi"  # 必须等于 requests
                                    cpu: "500m"      # 必须等于 requests

                            特点

                            资源保证:Pod 获得请求的全部资源,不会被其他 Pod 抢占
                            最高优先级:资源不足时最后被驱逐
                            性能稳定:资源使用可预测,适合关键业务
                            OOM 保护:不会因为节点内存压力被 Kill(除非超过自己的 limit)

                            适用场景

                            • 数据库(MySQL, PostgreSQL, Redis)
                            • 消息队列(Kafka, RabbitMQ)
                            • 核心业务应用
                            • 有状态服务

                            2️⃣ Burstable (突发型)

                            定义条件(满足以下任一条件)

                            • Pod 中至少有一个容器设置了 requestslimits
                            • requestslimits 不相等
                            • 部分容器设置了资源限制,部分没有

                            YAML 示例

                            场景 1:只设置 requests

                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: burstable-pod-1
                            spec:
                              containers:
                              - name: app
                                image: nginx
                                resources:
                                  requests:
                                    memory: "100Mi"
                                    cpu: "200m"
                                  # 没有设置 limits,可以使用超过 requests 的资源

                            场景 2:requests < limits

                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: burstable-pod-2
                            spec:
                              containers:
                              - name: app
                                image: nginx
                                resources:
                                  requests:
                                    memory: "100Mi"
                                    cpu: "200m"
                                  limits:
                                    memory: "500Mi"  # 允许突发到 500Mi
                                    cpu: "1000m"     # 允许突发到 1 核

                            场景 3:混合配置

                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: burstable-pod-3
                            spec:
                              containers:
                              - name: app1
                                image: nginx
                                resources:
                                  requests:
                                    memory: "100Mi"
                                  limits:
                                    memory: "200Mi"
                              - name: app2
                                image: busybox
                                resources:
                                  requests:
                                    cpu: "100m"
                                  # 只设置 CPU,没有内存限制

                            特点

                            弹性使用:可以使用超过 requests 的资源(burst)
                            ⚠️ 中等优先级:资源不足时,在 BestEffort 之后被驱逐
                            ⚠️ 可能被限流:超过 limits 会被限制(CPU)或 Kill(内存)
                            成本优化:平衡资源保证和利用率

                            适用场景

                            • Web 应用(流量有波峰波谷)
                            • 定时任务
                            • 批处理作业
                            • 微服务(大部分场景)

                            3️⃣ BestEffort (尽力而为型)

                            定义条件

                            • Pod 中所有容器没有设置 requestslimits

                            YAML 示例

                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: besteffort-pod
                            spec:
                              containers:
                              - name: app
                                image: nginx
                                # 完全没有 resources 配置
                              - name: sidecar
                                image: busybox
                                # 也没有 resources 配置

                            特点

                            无资源保证:能用多少资源完全看节点剩余
                            最低优先级:资源不足时第一个被驱逐
                            性能不稳定:可能被其他 Pod 挤占资源
                            灵活性高:可以充分利用节点空闲资源

                            适用场景

                            • 开发测试环境
                            • 非关键后台任务
                            • 日志收集(可以容忍中断)
                            • 临时性工作负载

                            🔍 QoS 等级判定流程图

                            开始
                              │
                              ├─→ 所有容器都没设置 requests/limits?
                              │   └─→ 是 → BestEffort
                              │
                              ├─→ 所有容器的 requests == limits (CPU和内存)?
                              │   └─→ 是 → Guaranteed
                              │
                              └─→ 其他情况 → Burstable

                            🚨 资源不足时的驱逐顺序

                            当节点资源不足(如内存压力)时,Kubelet 按以下顺序驱逐 Pod:

                            驱逐优先级(从高到低):
                            
                            1. BestEffort Pod
                               └─→ 超出 requests 最多的先被驱逐
                            
                            2. Burstable Pod
                               └─→ 按内存使用量排序
                               └─→ 超出 requests 越多,越先被驱逐
                            
                            3. Guaranteed Pod (最后才驱逐)
                               └─→ 只有在没有其他选择时才驱逐

                            实际驱逐示例

                            # 节点内存不足场景:
                            节点总内存: 8GB
                            已用内存: 7.8GB (达到驱逐阈值)
                            
                            Pod 列表:
                            - Pod A (BestEffort): 使用 1GB 内存 → 第一个被驱逐 ❌
                            - Pod B (Burstable):  requests=200Mi, 使用 500Mi → 第二个 ❌
                            - Pod C (Burstable):  requests=500Mi, 使用 600Mi → 第三个 ❌
                            - Pod D (Guaranteed): requests=limits=1GB, 使用 1GB → 保留 ✅

                            📝 查看 Pod 的 QoS 等级

                            方法 1:使用 kubectl describe

                            kubectl describe pod <pod-name>
                            
                            # 输出中会显示:
                            # QoS Class:       Burstable

                            方法 2:使用 kubectl get

                            # 查看所有 Pod 的 QoS
                            kubectl get pods -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass
                            
                            # 输出:
                            # NAME              QOS
                            # nginx-guaranteed  Guaranteed
                            # app-burstable     Burstable
                            # test-besteffort   BestEffort

                            方法 3:使用 YAML 输出

                            kubectl get pod <pod-name> -o yaml | grep qosClass
                            
                            # 输出:
                            # qosClass: Burstable

                            🎨 QoS 配置最佳实践

                            生产环境推荐配置

                            关键业务 - Guaranteed

                            apiVersion: apps/v1
                            kind: Deployment
                            metadata:
                              name: critical-app
                            spec:
                              replicas: 3
                              template:
                                spec:
                                  containers:
                                  - name: app
                                    image: myapp:v1
                                    resources:
                                      requests:
                                        memory: "2Gi"
                                        cpu: "1000m"
                                      limits:
                                        memory: "2Gi"      # requests == limits
                                        cpu: "1000m"

                            一般业务 - Burstable

                            apiVersion: apps/v1
                            kind: Deployment
                            metadata:
                              name: web-app
                            spec:
                              replicas: 5
                              template:
                                spec:
                                  containers:
                                  - name: web
                                    image: nginx:latest
                                    resources:
                                      requests:
                                        memory: "256Mi"    # 保证最低资源
                                        cpu: "200m"
                                      limits:
                                        memory: "512Mi"    # 允许突发到 2 倍
                                        cpu: "500m"

                            后台任务 - BestEffort 或 Burstable

                            apiVersion: batch/v1
                            kind: CronJob
                            metadata:
                              name: cleanup-job
                            spec:
                              schedule: "0 2 * * *"
                              jobTemplate:
                                spec:
                                  template:
                                    spec:
                                      containers:
                                      - name: cleanup
                                        image: cleanup:v1
                                        resources:
                                          requests:
                                            memory: "128Mi"
                                            cpu: "100m"
                                          # 不设置 limits,允许使用空闲资源

                            🔧 QoS 与资源限制的关系

                            CPU 限制行为

                            resources:
                              requests:
                                cpu: "500m"    # 保证至少 0.5 核
                              limits:
                                cpu: "1000m"   # 最多使用 1 核
                            • requests:节点调度的依据,保证的资源
                            • limits:硬限制,超过会被限流(throttle),但不会被 Kill
                            • 超过 limits 时,进程会被 CPU throttle,导致性能下降

                            内存限制行为

                            resources:
                              requests:
                                memory: "256Mi"  # 保证至少 256Mi
                              limits:
                                memory: "512Mi"  # 最多使用 512Mi
                            • requests:调度保证,但可以使用更多
                            • limits:硬限制,超过会触发 OOM Kill 💀
                            • Pod 会被标记为 OOMKilled 并重启

                            🛠️ 常见问题

                            Q1: 为什么我的 Pod 总是被驱逐?

                            # 检查 QoS 等级
                            kubectl get pod <pod-name> -o yaml | grep qosClass
                            
                            # 如果是 BestEffort 或 Burstable,建议:
                            # 1. 设置合理的 requests
                            # 2. 考虑升级到 Guaranteed(关键服务)
                            # 3. 增加节点资源

                            Q2: 如何为所有 Pod 设置默认资源限制?

                            # 使用 LimitRange
                            apiVersion: v1
                            kind: LimitRange
                            metadata:
                              name: default-limits
                              namespace: default
                            spec:
                              limits:
                              - default:              # 默认 limits
                                  cpu: "500m"
                                  memory: "512Mi"
                                defaultRequest:       # 默认 requests
                                  cpu: "100m"
                                  memory: "128Mi"
                                type: Container

                            Q3: Guaranteed Pod 也会被驱逐吗?

                            会! 但只在以下情况:

                            • 使用超过自己的 limits(OOM Kill)
                            • 节点完全不可用(如节点宕机)
                            • 手动删除 Pod
                            • DaemonSet 或系统级 Pod 需要资源

                            Q4: 如何监控 QoS 相关的问题?

                            # 查看节点资源压力
                            kubectl describe node <node-name> | grep -A 5 "Conditions:"
                            
                            # 查看被驱逐的 Pod
                            kubectl get events --field-selector reason=Evicted
                            
                            # 查看 OOM 事件
                            kubectl get events --field-selector reason=OOMKilling

                            📊 QoS 等级对比表

                            维度GuaranteedBurstableBestEffort
                            配置要求requests=limitsrequests≠limits 或部分配置无配置
                            资源保证✅ 完全保证⚠️ 部分保证❌ 无保证
                            驱逐优先级最低(最后驱逐)中等最高(第一个驱逐)
                            性能稳定性⭐⭐⭐⭐⭐⭐⭐⭐
                            资源利用率低(固定资源)高(可突发)最高(充分利用)
                            成本
                            适用场景关键业务一般业务测试/临时任务

                            🎯 选型建议

                            使用 Guaranteed 的场景

                            • 🗄️ 数据库(MySQL, MongoDB, Cassandra)
                            • 📨 消息队列(Kafka, RabbitMQ)
                            • 🔐 认证服务
                            • 💰 支付系统
                            • 📊 实时数据处理

                            使用 Burstable 的场景

                            • 🌐 Web 应用(80% 的场景)
                            • 🔄 API 服务
                            • 🎨 前端应用
                            • 📦 微服务
                            • ⚙️ 后台处理

                            使用 BestEffort 的场景

                            • 🧪 开发测试
                            • 📝 日志收集(可容忍中断)
                            • 🔍 数据探索
                            • 🛠️ 一次性脚本

                            💡 关键要点总结

                            1. QoS 是自动分配的,不能手动指定,由资源配置决定
                            2. Guaranteed ≠ 不会被驱逐,只是优先级最低
                            3. 生产环境建议至少使用 Burstable,避免 BestEffort
                            4. requests 影响调度,limits 影响运行时限制
                            5. 内存超限会 OOM,CPU 超限会限流
                            6. 使用 LimitRange 强制资源限制,避免 BestEffort Pod
                            Mar 7, 2024

                            Scheduler

                            Kubernetes 调度器(kube-scheduler) 是整个系统中非常关键的组件,它负责决定 哪个 Pod 应该运行在哪个 Node 上

                            下面我会分层、逐步详细说明 K8s 调度流程(以 v1.28+ 为例),并解释背后机制。


                            🌐 整体架构概览

                            Kubernetes 调度器主要完成以下职责:

                            1. 监听待调度的 Pod(即 spec.nodeName 为空的 Pod)
                            2. 为 Pod 选择最合适的 Node
                            3. 将绑定结果写回到 apiserver

                            🧩 一、调度总体流程

                            Kubernetes 调度流程主要分为三个阶段:

                            [Pending Pod] --> [Scheduling Queue] 
                                 ↓
                             [PreFilter] → [Filter] → [PostFilter] → [Score] → [Reserve] → [Permit] → [Bind]

                            1️⃣ 调度入口:监听未绑定的 Pod

                            • Scheduler 通过 informer 监听所有 Pod 资源。
                            • 当发现 Pod 没有 spec.nodeName 时,认为它是待调度的。
                            • Pod 被放入 调度队列(SchedulingQueue) 中。

                            🧮 二、调度核心阶段详解

                            🧩 1. PreFilter 阶段

                            在调度之前,对 Pod 进行一些准备性检查,例如:

                            • 解析 Pod 所需的资源。
                            • 检查 PVC、Affinity、Taint/Toleration 是否合理。
                            • 计算调度所需的 topology spread 信息。

                            🧠 类似于“预处理”,提前准备好过滤阶段要用的数据。


                            🧩 2. Filter 阶段(Predicates)

                            Scheduler 遍历所有可调度的 Node,筛选出满足条件的节点。

                            常见的过滤插件包括:

                            插件作用
                            NodeUnschedulable过滤掉被标记 unschedulable 的节点
                            NodeName如果 Pod 指定了 nodeName,只匹配该节点
                            TaintToleration检查 taint / toleration 是否匹配
                            NodeAffinity / PodAffinity检查亲和性/反亲和性
                            NodeResourcesFit检查 CPU/Memory 等资源是否够用
                            VolumeBinding检查 Pod 使用的 PVC 是否能在节点挂载

                            🔎 输出结果:

                            得到一个候选节点列表(通常是几十个或几百个)。


                            🧩 3. PostFilter 阶段

                            • 若没有节点符合条件(即调度失败),进入 抢占逻辑(Preemption)
                            • 调度器会尝试在某些节点上“抢占”低优先级的 Pod,以便高优先级 Pod 能调度成功。

                            🧩 4. Score 阶段(优选)

                            对剩余候选节点进行打分。 每个插件给节点打分(0–100),然后汇总加权。

                            常见的评分插件:

                            插件作用
                            LeastAllocated资源使用最少的节点得分高
                            BalancedAllocationCPU 和内存使用更均衡的节点得分高
                            NodeAffinity符合 affinity 的节点加分
                            ImageLocality本地已缓存镜像的节点加分
                            InterPodAffinity满足 Pod 间亲和性需求的节点加分

                            这些分数会经过 NormalizeScore 标准化到统一范围后求和。

                            最终输出:

                            最优节点(Score最高)


                            🧩 5. Reserve 阶段

                            暂时在该节点上 预留资源(在调度器内部缓存中标记),防止并发调度冲突。

                            如果后续失败,会执行 Unreserve 回滚。


                            🧩 6. Permit 阶段

                            某些插件可在绑定前再进行校验或等待,例如:

                            • PodGroup(批调度)
                            • Scheduler Framework 自定义策略

                            可能返回:

                            • Success → 继续绑定
                            • Wait → 等待事件
                            • Reject → 放弃调度

                            🧩 7. Bind 阶段

                            最终由 Bind 插件 调用 API 将 Pod 绑定到 Node:

                            spec:
                              nodeName: node-123

                            绑定成功后,kubelet 监听到该 Pod,会拉起容器。


                            ⚙️ 三、调度框架(Scheduler Framework)

                            K8s 1.19+ 后,调度器是通过 插件化框架 实现的。 每个阶段都有对应的插件点:

                            阶段插件接口示例插件
                            PreFilterPreFilterPluginPodTopologySpread
                            FilterFilterPluginNodeAffinity
                            ScoreScorePluginLeastAllocated
                            ReserveReservePluginVolumeBinding
                            BindBindPluginDefaultBinder

                            你可以通过写一个自定义调度插件(Go)扩展调度逻辑。


                            🧭 四、调度失败的情况

                            常见调度失败的原因:

                            原因表现
                            所有节点资源不足Pod 一直 Pending
                            亲和性限制太严格Pod 无法找到符合要求的节点
                            PVC 无法绑定VolumeBinding 阶段失败
                            节点被打 taint没有 toleration
                            镜像拉取失败Pod 已绑定但容器起不来(kubelet问题)

                            🧠 五、总结

                            阶段目的关键点
                            SchedulingQueue缓冲待调度PodFIFO + 优先级调度
                            PreFilter准备数据校验Pod需求
                            Filter过滤节点资源与约束
                            Score打分选优平衡与局部性
                            Reserve预留资源防并发冲突
                            Bind绑定Node调度结果落地

                            Mar 7, 2024

                            服务发现

                            最常见的说法是 “两种核心机制”,但这指的是服务发现的两种基本模式,而不是具体的实现方式。


                            维度一:两种核心模式

                            这是从服务发现的基本原理上划分的。

                            1. 基于客户端服务发现

                              • 工作原理:客户端(服务消费者)通过查询一个中心化的服务注册中心(如 Consul、Eureka、Zookeeper)来获取所有可用服务实例的列表(通常是 IP 和端口),然后自己选择一个实例并直接向其发起请求。
                              • 类比:就像你去餐厅吃饭,先看门口的电子菜单(服务注册中心)了解所有菜品和价格,然后自己决定点什么,再告诉服务员。
                              • 特点:客户端需要内置服务发现逻辑,与服务注册中心耦合。这种方式更灵活,但增加了客户端的复杂性。
                            2. 基于服务端服务发现

                              • 工作原理:客户端不关心具体的服务实例,它只需要向一个固定的访问端点(通常是 Load Balancer 或 Proxy,如 Kubernetes Service)发起请求。这个端点负责去服务注册中心查询可用实例,并进行负载均衡,将请求转发给其中一个。
                              • 类比:就像你去餐厅直接告诉服务员“来份招牌菜”,服务员(负载均衡器)帮你和后厨(服务实例)沟通,最后把菜端给你。
                              • 特点:客户端无需知道服务发现的具体细节,简化了客户端。这是 Kubernetes 默认采用的方式

                            维度二:Kubernetes 中具体的实现方式

                            在 Kubernetes 内部,我们通常讨论以下几种具体的服务发现实现手段,它们共同构成了 Kubernetes 强大的服务发现能力。

                            1. 环境变量

                            当 Pod 被调度到某个节点上时,kubelet 会为当前集群中存在的每个 Service 添加一组环境变量到该 Pod 中。

                            • 格式{SVCNAME}_SERVICE_HOST{SVCNAME}_SERVICE_PORT
                            • 例子:一个名为 redis-master 的 Service 会生成 REDIS_MASTER_SERVICE_HOST=10.0.0.11REDIS_MASTER_SERVICE_PORT=6379 这样的环境变量。
                            • 局限性:环境变量必须在 Pod 创建之前就存在。后创建的 Service 无法将环境变量注入到已运行的 Pod 中。因此,这通常作为辅助手段

                            2. DNS(最核心、最推荐的方式)

                            这是 Kubernetes 最主要和最优雅的服务发现方式。

                            • 工作原理:Kubernetes 集群内置了一个 DNS 服务器(通常是 CoreDNS)。当你创建一个 Service 时,Kubernetes 会自动为这个 Service 注册一个 DNS 记录。
                            • DNS 记录格式
                              • 同一命名空间<service-name>.<namespace>.svc.cluster.local -> 指向 Service 的 Cluster IP。
                                • 在同一个命名空间内,你可以直接使用 <service-name> 来访问服务。例如,前端 Pod 访问后端服务,只需使用 http://backend-service
                              • 不同命名空间:需要使用全限定域名,例如 backend-service.production.svc.cluster.local
                            • 优点:行为符合标准,应用无需修改代码,直接使用域名即可访问其他服务。

                            3. Kubernetes Service

                            Service 资源对象本身就是服务发现的载体。它提供了一个稳定的访问端点(VIP 或 DNS 名称),背后对应一组动态变化的 Pod。

                            • ClusterIP:默认类型,提供一个集群内部的虚拟 IP,只能从集群内部访问。结合 DNS 使用,是服务间通信的基石。
                            • NodePort:在 ClusterIP 基础上,在每个节点上暴露一个静态端口。可以从集群外部通过 <NodeIP>:<NodePort> 访问服务。
                            • LoadBalancer:在 NodePort 基础上,利用云服务商提供的负载均衡器,将一个外部 IP 地址暴露给 Service。是向公网暴露服务的主要方式。
                            • Headless Service:一种特殊的 Service,当你不需要负载均衡和单个 Service IP 时,可以通过设置 clusterIP: None 来创建。DNS 查询会返回该 Service 后端所有 Pod 的 IP 地址列表,而不是一个 VIP。这常用于有状态应用(如 Kafka、MySQL 集群)的自定义负载均衡或状态同步。

                            4. Ingress

                            虽然 Ingress 主要被用作 HTTP/HTTPS 路由规则的七层代理,但它也是一种高级的服务发现形式。

                            • 它通过规则将外部流量路由到集群内部相应的 Service。
                            • 客户端(外部用户)通过访问 Ingress Controller 的地址来发现和访问后端服务。

                            总结与对比

                            方式原理适用场景特点
                            环境变量将 Service 信息注入 Pod 环境变量旧式应用,辅助手段简单,但有局限性(需先于 Pod 创建)
                            DNS为 Service 自动注册域名服务间通信的标准方式推荐,符合惯例,无需代码改造
                            Service提供稳定的虚拟 IP 或 DNS 名称服务暴露和负载均衡的核心抽象Kubernetes 服务发现的基石
                            Ingress七层 HTTP 路由对外暴露 Web 服务,基于域名和路径路由更高级的 API 网关模式

                            结论:

                            • 从模式上讲,Kubernetes 主要采用基于服务端的服务发现模式。
                            • 从具体实现上讲,Kubernetes 的服务发现是一个以 DNS 为核心、以 Service 为基石的完整体系,并辅以环境变量和 Ingress 等机制。

                            所以,当有人问“K8s服务发现有哪些方式”时,最准确的回答是:主要有基于 DNS 和 环境变量这两种内部发现机制,而它们都依赖于 Service 这个核心抽象。同时,Ingress 提供了外部到内部的服务发现和路由能力。

                            Mar 7, 2024

                            Service VS Endpoint

                            Service 和 Endpoint/EndpointSlice 在 Kubernetes 中有明确的功能分工,它们共同构成了服务发现和负载均衡的基础。以下是详细的区别分析:

                            一、核心功能定位

                            Service - 抽象服务层

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: web-service
                            spec:
                              selector:
                                app: web-server
                              ports:
                                - protocol: TCP
                                  port: 80           # 服务端口
                                  targetPort: 8080   # 后端 Pod 端口
                              type: ClusterIP        # 服务类型

                            Service 的核心功能:

                            • 服务抽象:提供稳定的虚拟 IP 和 DNS 名称
                            • 访问入口:定义客户端如何访问服务
                            • 负载均衡策略:指定流量分发方式
                            • 服务类型:ClusterIP、NodePort、LoadBalancer、ExternalName

                            Endpoint/EndpointSlice - 后端实现层

                            apiVersion: v1
                            kind: Endpoints
                            metadata:
                              name: web-service      # 必须与 Service 同名
                            subsets:
                              - addresses:
                                - ip: 10.244.1.5
                                  targetRef:
                                    kind: Pod
                                    name: web-pod-1
                                - ip: 10.244.1.6
                                  targetRef:
                                    kind: Pod  
                                    name: web-pod-2
                                ports:
                                - port: 8080
                                  protocol: TCP

                            Endpoints 的核心功能:

                            • 后端发现:记录实际可用的 Pod IP 地址
                            • 健康状态:只包含通过就绪探针检查的 Pod
                            • 动态更新:实时反映后端 Pod 的变化
                            • 端口映射:维护 Service port 到 Pod port 的映射

                            二、详细功能对比

                            功能特性ServiceEndpoint/EndpointSlice
                            抽象级别逻辑抽象层物理实现层
                            数据内容虚拟IP、端口、选择器实际Pod IP地址、端口
                            稳定性稳定的VIP和DNS动态变化的IP列表
                            创建方式手动定义自动生成(或手动)
                            更新频率低频变更高频动态更新
                            DNS解析返回Service IP不直接参与DNS
                            负载均衡定义策略提供后端目标

                            三、实际工作流程

                            1. 服务访问流程

                            客户端请求 → Service VIP → kube-proxy → Endpoints → 实际 Pod
                                ↓           ↓           ↓           ↓           ↓
                              DNS解析     虚拟IP      iptables/   后端IP列表   具体容器
                                         10.96.x.x   IPVS规则    10.244.x.x   应用服务

                            2. 数据流向示例

                            # 客户端访问
                            curl http://web-service.default.svc.cluster.local
                            
                            # DNS 解析返回 Service IP
                            nslookup web-service.default.svc.cluster.local
                            # 返回: 10.96.123.456
                            
                            # kube-proxy 根据 Endpoints 配置转发
                            iptables -t nat -L KUBE-SERVICES | grep 10.96.123.456
                            # 转发到: 10.244.1.5:8080, 10.244.1.6:8080

                            四、使用场景差异

                            Service 的使用场景

                            # 1. 内部服务访问
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: internal-api
                            spec:
                              type: ClusterIP
                              selector:
                                app: api-server
                              ports:
                                - port: 8080
                            
                            # 2. 外部访问
                            apiVersion: v1
                            kind: Service  
                            metadata:
                              name: external-web
                            spec:
                              type: LoadBalancer
                              selector:
                                app: web-frontend
                              ports:
                                - port: 80
                                  nodePort: 30080
                            
                            # 3. 外部服务代理
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: external-database
                            spec:
                              type: ExternalName
                              externalName: database.example.com

                            Endpoints 的使用场景

                            # 1. 自动后端管理(默认)
                            # Kubernetes 自动维护匹配 Pod 的 Endpoints
                            
                            # 2. 外部服务集成
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: legacy-system
                            spec:
                              ports:
                                - port: 3306
                            ---
                            apiVersion: v1
                            kind: Endpoints
                            metadata:
                              name: legacy-system
                            subsets:
                              - addresses:
                                - ip: 192.168.1.100  # 外部数据库
                                ports:
                                - port: 3306
                            
                            # 3. 多端口复杂服务
                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: complex-app
                            spec:
                              ports:
                              - name: http
                                port: 80
                              - name: https
                                port: 443
                              - name: metrics
                                port: 9090

                            五、配置和管理差异

                            Service 配置重点

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: optimized-service
                              annotations:
                                # 负载均衡配置
                                service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
                                # 会话保持
                                service.kubernetes.io/aws-load-balancer-backend-protocol: "http"
                            spec:
                              type: LoadBalancer
                              selector:
                                app: optimized-app
                              sessionAffinity: ClientIP
                              sessionAffinityConfig:
                                clientIP:
                                  timeoutSeconds: 10800
                              ports:
                              - name: http
                                port: 80
                                targetPort: 8080
                              # 流量策略(仅对外部流量)
                              externalTrafficPolicy: Local

                            Endpoints 配置重点

                            apiVersion: v1
                            kind: Endpoints
                            metadata:
                              name: custom-endpoints
                              labels:
                                # 用于网络策略选择
                                environment: production
                            subsets:
                            - addresses:
                              - ip: 10.244.1.10
                                nodeName: worker-1
                                targetRef:
                                  kind: Pod
                                  name: app-pod-1
                                  namespace: production
                              - ip: 10.244.1.11
                                nodeName: worker-2  
                                targetRef:
                                  kind: Pod
                                  name: app-pod-2
                                  namespace: production
                              # 多端口定义
                              ports:
                              - name: http
                                port: 8080
                                protocol: TCP
                              - name: metrics
                                port: 9090
                                protocol: TCP
                              - name: health
                                port: 8081
                                protocol: TCP

                            六、监控和调试差异

                            Service 监控重点

                            # 检查 Service 状态
                            kubectl get services
                            kubectl describe service web-service
                            
                            # Service 相关指标
                            kubectl top services  # 如果支持
                            kubectl get --raw /api/v1/namespaces/default/services/web-service/proxy/metrics
                            
                            # DNS 解析测试
                            kubectl run test-$RANDOM --image=busybox --rm -it -- nslookup web-service

                            Endpoints 监控重点

                            # 检查后端可用性
                            kubectl get endpoints
                            kubectl describe endpoints web-service
                            
                            # 验证后端 Pod 状态
                            kubectl get pods -l app=web-server -o wide
                            
                            # 检查就绪探针
                            kubectl get pods -l app=web-server -o jsonpath='{.items[*].spec.containers[*].readinessProbe}'
                            
                            # 直接测试后端连通性
                            kubectl run test-$RANDOM --image=busybox --rm -it -- 
                            # 在容器内: telnet 10.244.1.5 8080

                            七、性能考虑差异

                            Service 性能优化

                            apiVersion: v1
                            kind: Service
                            metadata:
                              name: high-performance
                              annotations:
                                # 使用 IPVS 模式提高性能
                                service.kubernetes.io/service.beta.kubernetes.io/ipvs-scheduler: "wrr"
                            spec:
                              type: ClusterIP
                              clusterIP: None  # Headless Service,减少一层转发
                              selector:
                                app: high-perf-app

                            Endpoints 性能优化

                            # 使用 EndpointSlice 提高大规模集群性能
                            apiVersion: discovery.k8s.io/v1
                            kind: EndpointSlice
                            metadata:
                              name: web-service-abc123
                              labels:
                                kubernetes.io/service-name: web-service
                            addressType: IPv4
                            ports:
                            - name: http
                              protocol: TCP
                              port: 8080
                            endpoints:
                            - addresses:
                              - "10.244.1.5"
                              conditions:
                                ready: true
                              # 拓扑感知,优化路由
                              zone: us-west-2a
                              hints:
                                forZones:
                                - name: us-west-2a

                            八、总结

                            维度ServiceEndpoint/EndpointSlice
                            角色服务门面后端实现
                            稳定性高(VIP/DNS稳定)低(IP动态变化)
                            关注点如何访问谁能被访问
                            配置频率低频高频自动更新
                            网络层级L4 负载均衡后端目标发现
                            扩展性通过类型扩展通过EndpointSlice扩展

                            简单比喻:

                            • Service 就像餐厅的接待台和菜单 - 提供统一的入口和访问方式
                            • Endpoints 就像后厨的厨师列表 - 记录实际提供服务的人员和位置

                            两者协同工作,Service 定义"什么服务可用",Endpoints 定义"谁可以提供这个服务",共同实现了 Kubernetes 强大的服务发现和负载均衡能力。

                            Mar 7, 2024

                            StatefulSet

                            StatefulSet 如何具体解决有状态应用的挑战


                            StatefulSet 的四大核心机制

                            StatefulSet 通过一系列精心设计的机制,为有状态应用提供了稳定性和可预测性。

                            1. 稳定的网络标识

                            解决的问题:有状态应用(如数据库节点)需要稳定的主机名来相互发现和通信,不能使用随机名称。

                            StatefulSet 的实现

                            • 固定的 Pod 名称:Pod 名称遵循固定模式:<statefulset-name>-<ordinal-index>
                              • 例如:redis-cluster-0redis-cluster-1redis-cluster-2
                            • 稳定的 DNS 记录:每个 Pod 都会自动获得一个唯一的、稳定的 DNS 记录:
                              • 格式<pod-name>.<svc-name>.<namespace>.svc.cluster.local
                              • 例子redis-cluster-0.redis-service.default.svc.cluster.local

                            应对场景

                            • 在 Redis 集群中,redis-cluster-0 可以告诉 redis-cluster-1:“我的地址是 redis-cluster-0.redis-service",这个地址在 Pod 的一生中都不会改变,即使它被重新调度到其他节点。

                            2. 有序的部署与管理

                            解决的问题:像 Zookeeper、Etcd 这样的集群化应用,节点需要按顺序启动和加入集群,主从数据库也需要先启动主节点。

                            StatefulSet 的实现

                            • 有序部署:当创建 StatefulSet 时,Pod 严格按照索引顺序(0, 1, 2…)依次创建。必须等 Pod-0 完全就绪(Ready)后,才会创建 Pod-1
                            • 有序扩缩容
                              • 扩容:按顺序创建新 Pod(如从 3 个扩展到 5 个,会先创建 pod-3,再 pod-4)。
                              • 缩容:按逆序终止 Pod(从 pod-4 开始,然后是 pod-3)。
                            • 有序滚动更新:同样遵循逆序策略,确保在更新过程中大部分节点保持可用。

                            应对场景

                            • 部署 MySQL 主从集群时,StatefulSet 会确保 mysql-0(主节点)先启动并初始化完成,然后才启动 mysql-1(从节点),从节点在启动时就能正确连接到主节点进行数据同步。

                            3. 稳定的持久化存储

                            这是 StatefulSet 最核心的特性!

                            解决的问题:有状态应用的数据必须持久化,并且当 Pod 发生故障或被调度到新节点时,必须能够重新挂载到它自己的那部分数据

                            StatefulSet 的实现

                            • Volume Claim Template:在 StatefulSet 的 YAML 中,你可以定义一个 volumeClaimTemplate(存储卷申请模板)。
                            • 专属的 PVC:StatefulSet 会为每个 Pod 实例根据这个模板创建一个独立的、专用的 PersistentVolumeClaim (PVC)。
                              • mysql-0 -> pvc-name-mysql-0
                              • mysql-1 -> pvc-name-mysql-1
                              • mysql-2 -> pvc-name-mysql-2

                            工作流程

                            1. 当你创建名为 mysql、副本数为 3 的 StatefulSet 时,K8s 会:
                              • 创建 Pod mysql-0,并同时创建 PVC data-mysql-0,然后将它们绑定。
                              • mysql-0 就绪后,创建 Pod mysql-1 和 PVC data-mysql-1,然后绑定。
                              • 以此类推。
                            2. 如果节点故障导致 mysql-1 被删除,K8s 调度器会在其他健康节点上重新创建一个同名的 Pod mysql-1
                            3. 这个新 Pod mysql-1 会自动挂载到之前为它创建的、存有它专属数据的 PVC data-mysql-1 上。
                            4. 这样,Pod 虽然"漂移"了,但数据依然跟随,应用可以无缝恢复。

                            应对场景

                            • 对于数据库,每个 Pod 都有自己独立的数据目录。mysql-0 的数据永远不会和 mysql-1 的数据混淆。这为数据分片(Sharding)和主从复制提供了基础。

                            4. 稳定的启动顺序与唯一身份

                            解决的问题:应用启动脚本或配置可能需要知道当前实例的索引号(如,判断自己是否是第一个节点,从而决定是否要初始化集群)。

                            StatefulSet 的实现

                            • Pod 的序号(0, 1, 2...)就是其稳定的、唯一的身份标识。
                            • 这个序号可以通过环境变量注入到 Pod 的容器中,供应用程序使用。

                            总结

                            StatefulSet 通过「稳定的网络标识 + 有序的部署管理 + 稳定的专属存储」这三驾马车,将有状态应用在动态的、易变的容器环境中"锚定”,使其获得了类似物理机或虚拟机的稳定运行特性。

                            它并没有让有状态应用本身变得简单,而是为这些应用的复杂需求(如集群化、数据持久化)提供了一个可以在 K8s 上稳定运行的框架和平台。对于运维人员来说,这意味着可以用声明式的方式像管理无状态应用一样,去管理数据库、消息队列等复杂的有状态服务。

                            Mar 7, 2024

                            StatefulSet 2

                            答案是:

                            StatefulSet 不使用 ReplicaSet 来创建 Pod。 ✅ StatefulSet Controller 直接创建并管理 Pod 对象,而且每个 Pod 都是有序的、具名的、不可替代的。

                            接下来我会详细讲清楚 StatefulSet 与 Deployment 的架构差异、实现机制、核心控制循环,让你能在实现 Operator 或高可靠应用时真正理解它的行为。


                            🧩 一、StatefulSet 与 Deployment 的根本区别

                            项目DeploymentStatefulSet
                            控制下层对象ReplicaSet → Pod直接管理 Pod
                            Pod 命名随机固定命名(有序索引)
                            Pod 身份可替换唯一、持久身份(Stable Identity)
                            更新策略滚动更新(无序)有序更新(从 0 开始逐个)
                            存储通常无状态绑定 PVC,数据与 Pod 一一对应
                            常见场景Web 服务、API、Job数据库、Zookeeper、Kafka、Etcd 等

                            ⚙️ 二、StatefulSet 控制器工作原理

                            StatefulSet Controller 运行在 kube-controller-manager 中。 它同样是一个典型的 Controller + Informer + WorkQueue + Reconcile Loop 架构。

                            但是它的 核心循环逻辑 与 Deployment/ReplicaSet 明显不同:

                            1️⃣ Reconcile 流程概览

                            伪代码如下:

                            syncStatefulSet(set):
                                pods := listPodsMatching(set.selector)
                                sortPodsByOrdinal(pods)
                            
                                for i in 0..set.spec.replicas-1:
                                    ensurePodExists(set, ordinal=i)
                                    ensurePodPVCExists(set, ordinal=i)
                                    if updateRequired(set, pod[i]):
                                        deletePodAndWaitForReplacement(set, ordinal=i)

                            也就是说,StatefulSet controller 自己直接创建 Pod 和 PVC,并且顺序控制整个过程。


                            2️⃣ Pod 命名与序号机制

                            StatefulSet 会给每个 Pod 一个固定的命名模式:

                            <prefix>-<ordinal>

                            例如:

                            web-0
                            web-1
                            web-2

                            这些名字不是随机生成的 UID,而是根据序号确定的。 这让 Pod 可以拥有稳定的网络标识

                            • Pod DNS:<podname>.<headless-service-name>.<namespace>.svc.cluster.local

                            • 例如:

                              web-0.web.default.svc.cluster.local
                              web-1.web.default.svc.cluster.local

                            这对 Zookeeper、MySQL 集群等“节点必须有固定身份”的应用至关重要。


                            3️⃣ 存储(PVC)绑定机制

                            每个 StatefulSet Pod 都可以声明一个 volumeClaimTemplate

                            volumeClaimTemplates:
                            - metadata:
                                name: data
                              spec:
                                accessModes: ["ReadWriteOnce"]
                                resources:
                                  requests:
                                    storage: 10Gi

                            Controller 会为每个 Pod 创建一个独立的 PVC:

                            data-web-0
                            data-web-1
                            data-web-2

                            即使 Pod 被删除,这个 PVC 也不会被清理(除非手动删除), 这就实现了状态持久化


                            4️⃣ Pod 创建顺序与滚动更新顺序

                            StatefulSet 有严格的有序逻辑:

                            创建顺序

                            • 先创建 web-0
                            • web-0 变成 Running + Ready
                            • 再创建 web-1
                            • 依次类推。

                            删除顺序

                            • 先删除最高序号 Pod;
                            • 依次向下。

                            更新顺序(滚动更新)

                            • 先更新最高序号 Pod;

                            • 等该 Pod Ready 后,再更新下一个;

                            • 可配置 podManagementPolicy

                              • OrderedReady(默认,有序)
                              • Parallel(并行)

                            5️⃣ StatefulSet 更新控制逻辑

                            当你修改 StatefulSet 的模板(例如更换镜像),控制器会:

                            1. 比较 spec.template hash;
                            2. 找出需要更新的 Pod;
                            3. 按序更新;
                            4. 每个 Pod 更新完毕、Ready 后,再进行下一个。

                            伪代码示例:

                            for i = replicas-1 downTo 0:
                                if pod[i].template != set.template:
                                    delete(pod[i])
                                    waitUntilNewPodReady(i)

                            📦 三、StatefulSet 与 PVC 的生命周期

                            对象何时创建何时删除
                            Pod当 StatefulSet 第一次启动或扩容时缩容时(最高序号先删)
                            PVCPod 第一次被创建时不会自动删除(除非使用 StorageClass Retain 策略外的设置)

                            这意味着:

                            StatefulSet 的 数据持久性是通过 PVC 与序号绑定实现的

                            例如:

                            data-web-0 → 永远属于 web-0
                            data-web-1 → 永远属于 web-1

                            🔁 四、StatefulSet 的控制链路(事件流)

                            步骤执行者说明
                            用户kubectl apply 创建 StatefulSet
                            API Server写入 etcd
                            StatefulSet Controller检测 StatefulSet 对象变化
                            Controller创建第一个 Pod + PVC(ordinal=0)
                            Kubelet调度并运行容器
                            Controller检测 web-0 Ready 后,继续创建 web-1
                            直到所有 replicas Ready
                            更新按序滚动更新 Pod

                            🧠 五、StatefulSet 为什么不复用 ReplicaSet?

                            主要是因为 ReplicaSet 的特性与 StatefulSet 的设计目标相冲突:

                            功能ReplicaSetStatefulSet
                            Pod 匿名性✅ 所有 Pod 等价❌ 每个 Pod 唯一
                            更新无序性✅ 任意替换❌ 必须有序
                            PVC 管理❌ 无✅ 一一绑定
                            网络身份❌ 无✅ 稳定 DNS 名称

                            所以 Kubernetes 设计了专门的 StatefulSet Controller 来实现这一类“有状态副本控制”。


                            💡 六、StatefulSet 小结

                            特性描述
                            控制对象直接管理 Pod
                            Pod 命名固定命名(带序号)
                            更新策略有序(OrderedReady)或并行(Parallel)
                            持久化通过 PVC 一对一绑定
                            典型场景数据库、分布式存储、消息队列、协调服务等
                            优势稳定网络身份、持久存储、严格更新顺序
                            缺点扩缩容慢,调度受限,不适合无状态应用

                            🧭 七、视觉化理解(结构示意)

                            StatefulSet: kafka
                            │
                            ├── Pod: kafka-0
                            │   ├── PVC: data-kafka-0
                            │   └── Hostname: kafka-0.kafka.default.svc
                            │
                            ├── Pod: kafka-1
                            │   ├── PVC: data-kafka-1
                            │   └── Hostname: kafka-1.kafka.default.svc
                            │
                            └── Pod: kafka-2
                                ├── PVC: data-kafka-2
                                └── Hostname: kafka-2.kafka.default.svc
                            Mar 7, 2024

                            Talk between 2 pods in different nodes

                            好的,这是一个非常核心的 Kubernetes 网络问题。不同 Node 上的 Pod 之间的通信过程,清晰地展示了 Kubernetes 网络模型的核心思想:每个 Pod 都拥有一个独立的、扁平的 IP 地址空间,无论它运行在哪个节点上,Pod 之间都可以直接通过这个 IP 进行通信,而无需使用 NAT

                            这个过程的实现完全依赖于容器网络接口(CNI)插件,如 Calico、Flannel、Weave Net 等。下面我们以最经典的 Flannel (VXLAN 模式)Calico (BGP 模式) 为例,来阐述这个通信过程。


                            核心原则

                            1. Pod IP 可达性:Kubernetes 网络模型要求,任何 Pod 的 IP 地址都能被任何其他 Pod 直接访问,无论它们是否在同一个节点上。
                            2. 无 NAT:Pod 到 Pod 的通信不应该经过源地址转换(SNAT)或目的地址转换(DNAT)。Pod 看到的源 IP 和目标 IP 就是真实的 Pod IP。

                            通用通信流程(抽象模型)

                            假设有两个 Pod:

                            • Pod A:在 Node 1 上,IP 为 10.244.1.10
                            • Pod B:在 Node 2 上,IP 为 10.244.2.20

                            Pod A 试图 ping Pod B 的 IP (10.244.2.20) 时,过程如下:

                            1. 出站:从 Pod A 到 Node 1

                            • Pod A 根据其内部路由表,将数据包从自己的网络命名空间内的 eth0 接口发出。
                            • 目标 IP 是 10.244.2.20
                            • Node 1 上,有一个网桥(如 cni0)充当了所有本地 Pod 的虚拟交换机。Pod A 的 eth0 通过一对 veth pair 连接到这个网桥。
                            • 数据包到达网桥 cni0

                            2. 路由决策:在 Node 1 上

                            • Node 1内核路由表 由 CNI 插件配置。它查看数据包的目标 IP 10.244.2.20
                            • 路由表规则大致如下:
                              Destination     Gateway         Interface
                              10.244.1.0/24   ...            cni0      # 本地 Pod 网段,走 cni0 网桥
                              10.244.2.0/24   192.168.1.102  eth0      # 非本地 Pod 网段,通过网关(即 Node 2 的 IP)从物理网卡 eth0 发出
                            • 路由表告诉内核,去往 10.244.2.0/24 网段的数据包,下一跳是 192.168.1.102(即 Node 2 的物理 IP),并通过 Node 1 的物理网络接口 eth0 发出。

                            从这里开始,不同 CNI 插件的工作机制产生了差异。


                            场景一:使用 Flannel (VXLAN 模式)

                            Flannel 通过创建一个覆盖网络 来解决跨节点通信。

                            1. 封装

                              • 数据包(源 10.244.1.10,目标 10.244.2.20)到达 Node 1eth0 之前,会被一个特殊的虚拟网络设备 flannel.1 截获。
                              • flannel.1 是一个 VXLAN 隧道端点
                              • 封装flannel.1 会将整个原始数据包(作为 payload)封装在一个新的 UDP 数据包 中。
                                • 外层 IP 头:源 IP 是 Node 1 的 IP (192.168.1.101),目标 IP 是 Node 2 的 IP (192.168.1.102)。
                                • 外层 UDP 头:目标端口通常是 8472 (VXLAN)。
                                • VXLAN 头:包含一个 VNI,用于标识不同的虚拟网络。
                                • 内层原始数据包:原封不动。
                            2. 物理网络传输

                              • 这个封装后的 UDP 数据包通过 Node 1 的物理网络 eth0 发送出去。
                              • 它经过底层物理网络(交换机、路由器)顺利到达 Node 2,因为外层 IP 是节点的真实 IP,底层网络是认识的。
                            3. 解封装

                              • 数据包到达 Node 2 的物理网卡 eth0
                              • 内核发现这是一个发往 VXLAN 端口 (8472) 的 UDP 包,于是将其交给 Node 2 上的 flannel.1 设备处理。
                              • flannel.1 设备解封装,剥掉外层 UDP 和 IP 头,露出原始的 IP 数据包(源 10.244.1.10,目标 10.244.2.20)。
                            4. 入站:从 Node 2 到 Pod B

                              • 解封后的原始数据包被送入 Node 2 的网络栈。
                              • Node 2 的路由表查看目标 IP 10.244.2.20,发现它属于本地的 cni0 网桥管理的网段。
                              • 数据包被转发到 cni0 网桥,网桥再通过 veth pair 将数据包送达 Pod Beth0 接口。

                            简单比喻:Flannel 就像在两个节点之间建立了一条邮政专线。你的原始信件(Pod IP 数据包)被塞进一个标准快递信封(外层 UDP 包)里,通过公共邮政系统(物理网络)寄到对方邮局(Node 2),对方邮局再拆开快递信封,把原始信件交给收件人(Pod B)。


                            场景二:使用 Calico (BGP 模式)

                            Calico 通常不使用隧道,而是利用 BGP 协议纯三层路由,效率更高。

                            1. 路由通告

                              • Node 1Node 2 上都运行着 Calico 的 BGP 客户端 Felix 和 BGP 路由反射器 BIRD
                              • Node 2 会通过 BGP 协议向网络中的其他节点(包括 Node 1)通告一条路由信息:“目标网段 10.244.2.0/24 的下一跳是我 192.168.1.102”。
                              • Node 1 学习到了这条路由,并写入自己的内核路由表(就是我们之前在步骤2中看到的那条)。
                            2. 直接路由

                              • 数据包(源 10.244.1.10,目标 10.244.2.20)根据路由表,直接通过 Node 1 的物理网卡 eth0 发出。
                              • 没有封装! 数据包保持原样,源 IP 是 10.244.1.10,目标 IP 是 10.244.2.20
                              • 这个数据包被发送到 Node 2 的物理 IP (192.168.1.102)。
                            3. 物理网络传输

                              • 数据包经过底层物理网络。这就要求底层网络必须能够路由 Pod IP 的网段。在云环境中,这通常通过配置 VPC 路由表来实现;在物理机房,需要核心交换机学习到这些 BGP 路由或配置静态路由。
                            4. 入站:从 Node 2 到 Pod B

                              • 数据包到达 Node 2 的物理网卡 eth0
                              • Node 2 的内核查看目标 IP 10.244.2.20,发现这个 IP 属于一个本地虚拟接口(如 caliXXX,这是 Calico 为每个 Pod 创建的),于是直接将数据包转发给该接口,最终送达 Pod B

                            简单比喻:Calico 让每个节点都成为一个智能路由器。它们互相告知“哪个 Pod 网段在我这里”。当 Node 1 要发数据给 Node 2 上的 Pod 时,它就像路由器一样,根据已知的路由表,直接找到 Node 2 的地址并把数据包发过去,中间不拆包。


                            总结对比

                            特性Flannel (VXLAN)Calico (BGP)
                            网络模型Overlay NetworkPure Layer 3
                            原理隧道封装路由通告
                            性能有封装/解封装开销,性能稍低无隧道开销,性能更高
                            依赖对底层网络无要求,只要节点IP通即可依赖底层网络支持路由(云平台VPC或物理网络配置)
                            数据包外层Node IP,内层Pod IP始终是Pod IP

                            无论采用哪种方式,Kubernetes 和 CNI 插件共同协作,最终实现了一个对应用开发者透明的、扁平的 Pod 网络。开发者只需关心 Pod IP 和 Service,而无需理解底层复杂的跨节点通信机制。

                            如果pod之间访问不通怎么排查?

                            核心排查思路:从 Pod 内部到外部,从简单到复杂

                            整个排查过程可以遵循下图所示的路径,逐步深入:

                            flowchart TD
                                A[Pod 之间访问不通] --> B[确认基础连通性<br>ping & telnet]
                            
                                B --> C{ping 是否通?}
                                C -- 通 --> D[telnet 端口是否通?]
                                C -- 不通 --> E[检查 NetworkPolicy<br>kubectl get networkpolicy]
                            
                                D -- 通 --> F[检查应用日志与配置]
                                D -- 不通 --> G[检查 Service 与 Endpoints<br>kubectl describe svc]
                            
                                E --> H[检查 CNI 插件状态<br>kubectl get pods -n kube-system]
                                
                                subgraph G_ [Service排查路径]
                                    G --> G1[Endpoints 是否为空?]
                                    G1 -- 是 --> G2[检查 Pod 标签与 Selector]
                                    G1 -- 否 --> G3[检查 kube-proxy 与 iptables]
                                end
                            
                                F --> Z[问题解决]
                                H --> Z
                                G2 --> Z
                                G3 --> Z

                            第一阶段:基础信息收集与初步检查

                            1. 获取双方 Pod 信息

                              kubectl get pods -o wide
                              • 确认两个 Pod 都处于 Running 状态。
                              • 记录下它们的 IP 地址所在节点
                              • 确认它们不在同一个节点上(如果是,排查方法会略有不同)。
                            2. 明确访问方式

                              • 直接通过 Pod IP 访问? (ping <pod-ip>curl <pod-ip>:<port>)
                              • 通过 Service 名称访问? (ping <service-name>curl <service-name>:<port>)
                              • 这个问题决定了后续的排查方向。

                            第二阶段:按访问路径深入排查

                            场景一:直接通过 Pod IP 访问不通(跨节点)

                            这通常是底层网络插件(CNI) 的问题。

                            1. 检查 Pod 内部网络

                              • 进入源 Pod,检查其网络配置:
                              kubectl exec -it <source-pod> -- sh
                              # 在 Pod 内部执行:
                              ip addr show eth0 # 查看 IP 是否正确
                              ip route # 查看路由表
                              ping <destination-pod-ip> # 测试连通性
                              • 如果 ping 不通,继续下一步。
                            2. 检查目标 Pod 的端口监听

                              • 进入目标 Pod,确认应用在正确端口上监听:
                              kubectl exec -it <destination-pod> -- netstat -tulpn | grep LISTEN
                              # 或者用 ss 命令
                              kubectl exec -it <destination-pod> -- ss -tulpn | grep LISTEN
                              • 如果这里没监听,是应用自身问题,检查应用日志和配置。
                            3. 检查 NetworkPolicy(网络策略)

                              • 这是 Kubernetes 的“防火墙”,很可能阻止了访问。
                              kubectl get networkpolicies -A
                              kubectl describe networkpolicy <policy-name> -n <namespace>
                              • 查看是否有策略限制了源 Pod 或目标 Pod 的流量。特别注意 ingress 规则
                            4. 检查 CNI 插件状态

                              • CNI 插件(如 Calico、Flannel)的异常会导致跨节点网络瘫痪。
                              kubectl get pods -n kube-system | grep -e calico -e flannel -e weave
                              • 确认所有 CNI 相关的 Pod 都在运行。如果有 CrashLoopBackOff 等状态,查看其日志。
                            5. 节点层面排查

                              • 如果以上都正常,问题可能出现在节点网络层面。
                              • 登录到源 Pod 所在节点,尝试 ping 目标 Pod IP。
                              • 检查节点路由表
                                # 在节点上执行
                                ip route
                                • 对于 Flannel,你应该能看到到其他节点 Pod 网段的路由。
                                • 对于 Calico,你应该能看到到每个其他节点 Pod 网段的精确路由。
                              • 检查节点防火墙:在某些环境中(如安全组、iptables 规则)可能阻止了 VXLAN(8472端口)或节点间 Pod IP 的通信。
                                # 检查 iptables 规则
                                sudo iptables-save | grep <pod-ip>

                            场景二:通过 Service 名称访问不通

                            这通常是 Kubernetes 服务发现kube-proxy 的问题。

                            1. 检查 Service 和 Endpoints

                              kubectl get svc <service-name>
                              kubectl describe svc <service-name> # 查看 Selector 和 Port 映射
                              kubectl get endpoints <service-name> # 这是关键!检查是否有健康的 Endpoints
                              • 如果 ENDPOINTS 列为空:说明 Service 的 Label Selector 没有匹配到任何健康的 Pod。请检查:
                                • Pod 的 labels 是否与 Service 的 selector 匹配。
                                • Pod 的 readinessProbe 是否通过。
                            2. 检查 DNS 解析

                              • 进入源 Pod,测试是否能解析 Service 名称:
                              kubectl exec -it <source-pod> -- nslookup <service-name>
                              # 或者
                              kubectl exec -it <source-pod> -- cat /etc/resolv.conf
                              • 如果解析失败,检查 kube-dnscoredns Pod 是否正常。
                              kubectl get pods -n kube-system | grep -e coredns -e kube-dns
                            3. 检查 kube-proxy

                              • kube-proxy 负责实现 Service 的负载均衡规则(通常是 iptables 或 ipvs)。
                              kubectl get pods -n kube-system | grep kube-proxy
                              • 确认所有 kube-proxy Pod 都在运行。
                              • 可以登录到节点,检查是否有对应的 iptables 规则:
                                sudo iptables-save | grep <service-name>
                                # 或者查看 ipvs 规则(如果使用 ipvs 模式)
                                sudo ipvsadm -ln

                            第三阶段:高级调试技巧

                            如果上述步骤仍未解决问题,可以尝试以下方法:

                            1. 使用网络调试镜像

                              • 部署一个包含网络工具的临时 Pod(如 nicolaka/netshoot)来进行高级调试。
                              kubectl run -it --rm debug-pod --image=nicolaka/netshoot -- /bin/bash
                              • 在这个 Pod 里,你可以使用 tcpdump, tracepath, dig 等强大工具。
                              • 例如,在目标 Pod 的节点上抓包:
                                # 在节点上执行,监听 Pod 网络对应的接口
                                sudo tcpdump -i any -n host <source-pod-ip> and host <destination-pod-ip>
                            2. 检查节点网络连接

                              • 确认两个节点之间网络是通的(通过节点 IP)。
                              • 确认 CNI 所需的端口(如 Flannel 的 VXLAN 端口 8472)在节点间是开放的。

                            总结与排查命令清单

                            当 Pod 间访问不通时,按顺序执行以下命令:

                            # 1. 基本信息
                            kubectl get pods -o wide
                            kubectl get svc,ep -o wide
                            
                            # 2. 检查 NetworkPolicy
                            kubectl get networkpolicies -A
                            
                            # 3. 检查核心插件
                            kubectl get pods -n kube-system | grep -e coredns -e kube-proxy -e calico -e flannel
                            
                            # 4. 进入 Pod 测试
                            kubectl exec -it <source-pod> -- ping <destination-pod-ip>
                            kubectl exec -it <source-pod> -- nslookup <service-name>
                            
                            # 5. 检查目标 Pod 应用
                            kubectl exec -it <destination-pod> -- netstat -tulpn
                            kubectl logs <destination-pod>
                            
                            # 6. 节点层面检查
                            # 在节点上执行
                            ip route
                            sudo iptables-save | grep <relevant-ip>

                            记住,90% 的 Pod 网络不通问题源于 NetworkPolicy 配置、Service Endpoints 为空,或 CNI 插件故障。按照这个路径排查,绝大多数问题都能被定位和解决。

                            Mar 7, 2024

                            Talk with API Server

                            Kubernetes 各模块与 API Server 通信详解

                            这是理解 Kubernetes 架构的核心问题。API Server 是整个集群的"大脑",所有组件都通过它进行通信。


                            🎯 Kubernetes 通信架构总览

                            ┌─────────────────────────────────────────────────────────┐
                            │                    API Server (核心)                     │
                            │  - RESTful API (HTTP/HTTPS)                             │
                            │  - 认证、授权、准入控制                                   │
                            │  - etcd 唯一入口                                         │
                            └───────┬─────────────────┬─────────────────┬─────────────┘
                                    │                 │                 │
                                ┌───▼───┐         ┌───▼───┐        ┌───▼────┐
                                │Kubelet│         │Scheduler│      │Controller│
                                │(Node) │         │         │      │ Manager  │
                                └───────┘         └─────────┘      └──────────┘
                                    │
                                ┌───▼────┐
                                │kube-proxy│
                                └────────┘

                            🔐 通信基础:认证、授权、准入

                            1. 认证 (Authentication)

                            所有组件访问 API Server 必须先通过认证。

                            常见认证方式

                            认证方式使用场景实现方式
                            X.509 证书集群组件(kubelet/scheduler)客户端证书
                            ServiceAccount TokenPod 内应用JWT Token
                            Bootstrap Token节点加入集群临时 Token
                            静态 Token 文件简单测试不推荐生产
                            OIDC用户认证外部身份提供商

                            X.509 证书认证示例

                            # 1. API Server 启动参数包含 CA 证书
                            kube-apiserver \
                              --client-ca-file=/etc/kubernetes/pki/ca.crt \
                              --tls-cert-file=/etc/kubernetes/pki/apiserver.crt \
                              --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
                            
                            # 2. Kubelet 使用客户端证书
                            kubelet \
                              --kubeconfig=/etc/kubernetes/kubelet.conf \
                              --client-ca-file=/etc/kubernetes/pki/ca.crt
                            
                            # 3. kubeconfig 文件内容
                            apiVersion: v1
                            kind: Config
                            clusters:
                            - cluster:
                                certificate-authority: /etc/kubernetes/pki/ca.crt  # CA 证书
                                server: https://192.168.1.10:6443                  # API Server 地址
                              name: kubernetes
                            users:
                            - name: system:node:worker-1
                              user:
                                client-certificate: /var/lib/kubelet/pki/kubelet-client.crt  # 客户端证书
                                client-key: /var/lib/kubelet/pki/kubelet-client.key          # 客户端密钥
                            contexts:
                            - context:
                                cluster: kubernetes
                                user: system:node:worker-1
                              name: default
                            current-context: default

                            ServiceAccount Token 认证

                            # Pod 内自动挂载的 Token
                            cat /var/run/secrets/kubernetes.io/serviceaccount/token
                            # eyJhbGciOiJSUzI1NiIsImtpZCI6Ij...
                            
                            # 使用 Token 访问 API Server
                            TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                            curl -k -H "Authorization: Bearer $TOKEN" \
                              https://kubernetes.default.svc/api/v1/namespaces/default/pods

                            2. 授权 (Authorization)

                            认证通过后,检查是否有权限执行操作。

                            RBAC (Role-Based Access Control) - 最常用

                            # 1. Role - 定义权限
                            apiVersion: rbac.authorization.k8s.io/v1
                            kind: Role
                            metadata:
                              namespace: default
                              name: pod-reader
                            rules:
                            - apiGroups: [""]
                              resources: ["pods"]
                              verbs: ["get", "list", "watch"]
                            
                            ---
                            # 2. RoleBinding - 绑定用户/ServiceAccount
                            apiVersion: rbac.authorization.k8s.io/v1
                            kind: RoleBinding
                            metadata:
                              name: read-pods
                              namespace: default
                            subjects:
                            - kind: ServiceAccount
                              name: my-app
                              namespace: default
                            roleRef:
                              kind: Role
                              name: pod-reader
                              apiGroup: rbac.authorization.k8s.io

                            授权模式对比

                            模式说明使用场景
                            RBAC基于角色生产环境(推荐)
                            ABAC基于属性复杂策略(已过时)
                            Webhook外部授权服务自定义授权逻辑
                            Node节点授权Kubelet 专用
                            AlwaysAllow允许所有测试环境(危险)

                            3. 准入控制 (Admission Control)

                            授权通过后,准入控制器可以修改或拒绝请求。

                            常用准入控制器

                            # API Server 启用的准入控制器
                            kube-apiserver \
                              --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,\
                            DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,\
                            ValidatingAdmissionWebhook,ResourceQuota,PodSecurityPolicy
                            准入控制器作用
                            NamespaceLifecycle防止在删除中的 namespace 创建资源
                            LimitRanger强制资源限制
                            ResourceQuota强制命名空间配额
                            PodSecurityPolicy强制 Pod 安全策略
                            MutatingAdmissionWebhook修改资源(如注入 sidecar)
                            ValidatingAdmissionWebhook验证资源(自定义校验)

                            📡 各组件通信详解

                            1. Kubelet → API Server

                            Kubelet 是唯一主动连接 API Server 的组件。

                            通信方式

                            Kubelet (每个 Node)
                                │
                                ├─→ List-Watch Pods (监听分配给自己的 Pod)
                                ├─→ Report Node Status (定期上报节点状态)
                                ├─→ Report Pod Status (上报 Pod 状态)
                                └─→ Get Secrets/ConfigMaps (拉取配置)

                            实现细节

                            // Kubelet 启动时创建 Informer 监听资源
                            // 伪代码示例
                            func (kl *Kubelet) syncLoop() {
                                // 1. 创建 Pod Informer
                                podInformer := cache.NewSharedIndexInformer(
                                    &cache.ListWatch{
                                        ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
                                            // 列出分配给当前节点的所有 Pod
                                            options.FieldSelector = fields.OneTermEqualSelector("spec.nodeName", kl.nodeName).String()
                                            return kl.kubeClient.CoreV1().Pods("").List(context.TODO(), options)
                                        },
                                        WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
                                            // 持续监听 Pod 变化
                                            options.FieldSelector = fields.OneTermEqualSelector("spec.nodeName", kl.nodeName).String()
                                            return kl.kubeClient.CoreV1().Pods("").Watch(context.TODO(), options)
                                        },
                                    },
                                    &v1.Pod{},
                                    0, // 不缓存
                                    cache.Indexers{},
                                )
                                
                                // 2. 注册事件处理器
                                podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
                                    AddFunc:    kl.handlePodAdditions,
                                    UpdateFunc: kl.handlePodUpdates,
                                    DeleteFunc: kl.handlePodDeletions,
                                })
                                
                                // 3. 定期上报节点状态
                                go wait.Until(kl.syncNodeStatus, 10*time.Second, stopCh)
                            }
                            
                            // 上报节点状态
                            func (kl *Kubelet) syncNodeStatus() {
                                node := &v1.Node{
                                    ObjectMeta: metav1.ObjectMeta{Name: kl.nodeName},
                                    Status: v1.NodeStatus{
                                        Conditions: []v1.NodeCondition{
                                            {Type: v1.NodeReady, Status: v1.ConditionTrue},
                                        },
                                        Capacity: kl.getNodeCapacity(),
                                        // ...
                                    },
                                }
                                
                                // 调用 API Server 更新节点状态
                                kl.kubeClient.CoreV1().Nodes().UpdateStatus(context.TODO(), node, metav1.UpdateOptions{})
                            }

                            Kubelet 配置示例

                            # /var/lib/kubelet/config.yaml
                            apiVersion: kubelet.config.k8s.io/v1beta1
                            kind: KubeletConfiguration
                            # API Server 连接配置(通过 kubeconfig)
                            authentication:
                              x509:
                                clientCAFile: /etc/kubernetes/pki/ca.crt
                              webhook:
                                enabled: true
                              anonymous:
                                enabled: false
                            authorization:
                              mode: Webhook
                            clusterDomain: cluster.local
                            clusterDNS:
                            - 10.96.0.10
                            # 定期上报间隔
                            nodeStatusUpdateFrequency: 10s
                            nodeStatusReportFrequency: 1m

                            List-Watch 机制详解

                            ┌─────────────────────────────────────────┐
                            │  Kubelet List-Watch 工作流程             │
                            ├─────────────────────────────────────────┤
                            │                                          │
                            │  1. List(初始化)                         │
                            │     GET /api/v1/pods?fieldSelector=...  │
                            │     ← 返回所有当前 Pod                   │
                            │                                          │
                            │  2. Watch(持续监听)                      │
                            │     GET /api/v1/pods?watch=true&...     │
                            │     ← 保持长连接                         │
                            │                                          │
                            │  3. 接收事件                             │
                            │     ← ADDED: Pod nginx-xxx created      │
                            │     ← MODIFIED: Pod nginx-xxx updated   │
                            │     ← DELETED: Pod nginx-xxx deleted    │
                            │                                          │
                            │  4. 本地处理                             │
                            │     - 缓存更新                           │
                            │     - 触发 Pod 生命周期管理              │
                            │                                          │
                            │  5. 断线重连                             │
                            │     - 检测到连接断开                     │
                            │     - 重新 List + Watch                  │
                            │     - ResourceVersion 确保不丢事件       │
                            └─────────────────────────────────────────┘

                            HTTP 长连接(Chunked Transfer)

                            # Kubelet 发起 Watch 请求
                            GET /api/v1/pods?watch=true&resourceVersion=12345&fieldSelector=spec.nodeName=worker-1 HTTP/1.1
                            Host: 192.168.1.10:6443
                            Authorization: Bearer eyJhbGc...
                            Connection: keep-alive
                            
                            # API Server 返回(Chunked 编码)
                            HTTP/1.1 200 OK
                            Content-Type: application/json
                            Transfer-Encoding: chunked
                            
                            {"type":"ADDED","object":{"kind":"Pod","apiVersion":"v1",...}}
                            {"type":"MODIFIED","object":{"kind":"Pod","apiVersion":"v1",...}}
                            {"type":"DELETED","object":{"kind":"Pod","apiVersion":"v1",...}}
                            ...
                            # 连接保持打开,持续推送事件

                            2. Scheduler → API Server

                            Scheduler 也使用 List-Watch 机制。

                            通信流程

                            Scheduler
                                │
                                ├─→ Watch Pods (监听未调度的 Pod)
                                │   └─ spec.nodeName == ""
                                │
                                ├─→ Watch Nodes (监听节点状态)
                                │
                                ├─→ Get PVs, PVCs (获取存储信息)
                                │
                                └─→ Bind Pod (绑定 Pod 到 Node)
                                    POST /api/v1/namespaces/{ns}/pods/{name}/binding

                            Scheduler 伪代码

                            // Scheduler 主循环
                            func (sched *Scheduler) scheduleOne() {
                                // 1. 从队列获取待调度的 Pod
                                pod := sched.NextPod()
                                
                                // 2. 执行调度算法(过滤 + 打分)
                                feasibleNodes := sched.findNodesThatFit(pod)
                                if len(feasibleNodes) == 0 {
                                    // 无可用节点,标记为不可调度
                                    return
                                }
                                
                                priorityList := sched.prioritizeNodes(pod, feasibleNodes)
                                selectedNode := sched.selectHost(priorityList)
                                
                                // 3. 绑定 Pod 到 Node(调用 API Server)
                                binding := &v1.Binding{
                                    ObjectMeta: metav1.ObjectMeta{
                                        Name:      pod.Name,
                                        Namespace: pod.Namespace,
                                    },
                                    Target: v1.ObjectReference{
                                        Kind: "Node",
                                        Name: selectedNode,
                                    },
                                }
                                
                                // 发送 Binding 请求到 API Server
                                err := sched.client.CoreV1().Pods(pod.Namespace).Bind(
                                    context.TODO(),
                                    binding,
                                    metav1.CreateOptions{},
                                )
                                
                                // 4. API Server 更新 Pod 的 spec.nodeName
                                // 5. Kubelet 监听到 Pod,开始创建容器
                            }
                            
                            // Watch 未调度的 Pod
                            func (sched *Scheduler) watchUnscheduledPods() {
                                podInformer := cache.NewSharedIndexInformer(
                                    &cache.ListWatch{
                                        ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
                                            // 只监听 spec.nodeName 为空的 Pod
                                            options.FieldSelector = "spec.nodeName="
                                            return sched.client.CoreV1().Pods("").List(context.TODO(), options)
                                        },
                                        WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
                                            options.FieldSelector = "spec.nodeName="
                                            return sched.client.CoreV1().Pods("").Watch(context.TODO(), options)
                                        },
                                    },
                                    &v1.Pod{},
                                    0,
                                    cache.Indexers{},
                                )
                                
                                podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
                                    AddFunc: func(obj interface{}) {
                                        pod := obj.(*v1.Pod)
                                        sched.queue.Add(pod)  // 加入调度队列
                                    },
                                })
                            }

                            Binding 请求详解

                            # Scheduler 发送的 HTTP 请求
                            POST /api/v1/namespaces/default/pods/nginx-xxx/binding HTTP/1.1
                            Host: 192.168.1.10:6443
                            Authorization: Bearer eyJhbGc...
                            Content-Type: application/json
                            
                            {
                              "apiVersion": "v1",
                              "kind": "Binding",
                              "metadata": {
                                "name": "nginx-xxx",
                                "namespace": "default"
                              },
                              "target": {
                                "kind": "Node",
                                "name": "worker-1"
                              }
                            }
                            
                            # API Server 处理:
                            # 1. 验证 Binding 请求
                            # 2. 更新 Pod 对象的 spec.nodeName = "worker-1"
                            # 3. 返回成功响应
                            # 4. Kubelet 监听到 Pod 更新,开始创建容器

                            3. Controller Manager → API Server

                            Controller Manager 包含多个控制器,每个控制器独立与 API Server 通信。

                            常见控制器

                            Controller Manager
                                │
                                ├─→ Deployment Controller
                                │   └─ Watch Deployments, ReplicaSets
                                │
                                ├─→ ReplicaSet Controller
                                │   └─ Watch ReplicaSets, Pods
                                │
                                ├─→ Node Controller
                                │   └─ Watch Nodes (节点健康检查)
                                │
                                ├─→ Service Controller
                                │   └─ Watch Services (管理 LoadBalancer)
                                │
                                ├─→ Endpoint Controller
                                │   └─ Watch Services, Pods (创建 Endpoints)
                                │
                                └─→ PV Controller
                                    └─ Watch PVs, PVCs (卷绑定)

                            ReplicaSet Controller 示例

                            // ReplicaSet Controller 的核心逻辑
                            func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
                                // 1. 从缓存获取 ReplicaSet
                                rs := rsc.rsLister.Get(namespace, name)
                                
                                // 2. 获取当前 Pod 列表(通过 Selector)
                                allPods := rsc.podLister.List(labels.Everything())
                                filteredPods := rsc.filterActivePods(rs.Spec.Selector, allPods)
                                
                                // 3. 计算差异
                                diff := len(filteredPods) - int(*rs.Spec.Replicas)
                                
                                if diff < 0 {
                                    // 需要创建新 Pod
                                    diff = -diff
                                    for i := 0; i < diff; i++ {
                                        // 调用 API Server 创建 Pod
                                        pod := newPod(rs)
                                        _, err := rsc.kubeClient.CoreV1().Pods(rs.Namespace).Create(
                                            context.TODO(),
                                            pod,
                                            metav1.CreateOptions{},
                                        )
                                    }
                                } else if diff > 0 {
                                    // 需要删除多余 Pod
                                    podsToDelete := getPodsToDelete(filteredPods, diff)
                                    for _, pod := range podsToDelete {
                                        // 调用 API Server 删除 Pod
                                        err := rsc.kubeClient.CoreV1().Pods(pod.Namespace).Delete(
                                            context.TODO(),
                                            pod.Name,
                                            metav1.DeleteOptions{},
                                        )
                                    }
                                }
                                
                                // 4. 更新 ReplicaSet 状态
                                rs.Status.Replicas = int32(len(filteredPods))
                                _, err := rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace).UpdateStatus(
                                    context.TODO(),
                                    rs,
                                    metav1.UpdateOptions{},
                                )
                            }

                            Node Controller 心跳检测

                            // Node Controller 监控节点健康
                            func (nc *NodeController) monitorNodeHealth() {
                                for {
                                    // 1. 列出所有节点
                                    nodes, _ := nc.kubeClient.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{})
                                    
                                    for _, node := range nodes.Items {
                                        // 2. 检查节点状态
                                        now := time.Now()
                                        lastHeartbeat := getNodeCondition(&node, v1.NodeReady).LastHeartbeatTime
                                        
                                        if now.Sub(lastHeartbeat.Time) > 40*time.Second {
                                            // 3. 节点超时,标记为 NotReady
                                            setNodeCondition(&node, v1.NodeCondition{
                                                Type:   v1.NodeReady,
                                                Status: v1.ConditionUnknown,
                                                Reason: "NodeStatusUnknown",
                                            })
                                            
                                            // 4. 更新节点状态
                                            nc.kubeClient.CoreV1().Nodes().UpdateStatus(
                                                context.TODO(),
                                                &node,
                                                metav1.UpdateOptions{},
                                            )
                                            
                                            // 5. 如果节点长时间 NotReady,驱逐 Pod
                                            if now.Sub(lastHeartbeat.Time) > 5*time.Minute {
                                                nc.evictPods(node.Name)
                                            }
                                        }
                                    }
                                    
                                    time.Sleep(10 * time.Second)
                                }
                            }

                            4. kube-proxy → API Server

                            kube-proxy 监听 Service 和 Endpoints,配置网络规则。

                            通信流程

                            kube-proxy (每个 Node)
                                │
                                ├─→ Watch Services
                                │   └─ 获取 Service 定义
                                │
                                ├─→ Watch Endpoints
                                │   └─ 获取后端 Pod IP 列表
                                │
                                └─→ 配置本地网络
                                    ├─ iptables 模式:更新 iptables 规则
                                    ├─ ipvs 模式:更新 IPVS 规则
                                    └─ userspace 模式:代理转发(已废弃)

                            iptables 模式示例

                            // kube-proxy 监听 Service 和 Endpoints
                            func (proxier *Proxier) syncProxyRules() {
                                // 1. 获取所有 Service
                                services := proxier.serviceStore.List()
                                
                                // 2. 获取所有 Endpoints
                                endpoints := proxier.endpointsStore.List()
                                
                                // 3. 生成 iptables 规则
                                for _, svc := range services {
                                    // Service ClusterIP
                                    clusterIP := svc.Spec.ClusterIP
                                    
                                    // 对应的 Endpoints
                                    eps := endpoints[svc.Namespace+"/"+svc.Name]
                                    
                                    // 生成 DNAT 规则
                                    // -A KUBE-SERVICES -d 10.96.100.50/32 -p tcp -m tcp --dport 80 -j KUBE-SVC-XXXX
                                    chain := generateServiceChain(svc)
                                    
                                    for _, ep := range eps.Subsets {
                                        for _, addr := range ep.Addresses {
                                            // -A KUBE-SVC-XXXX -m statistic --mode random --probability 0.33 -j KUBE-SEP-XXXX
                                            // -A KUBE-SEP-XXXX -p tcp -m tcp -j DNAT --to-destination 10.244.1.5:8080
                                            generateEndpointRule(addr.IP, ep.Ports[0].Port)
                                        }
                                    }
                                }
                                
                                // 4. 应用 iptables 规则
                                iptables.Restore(rules)
                            }

                            生成的 iptables 规则示例

                            # Service: nginx-service (ClusterIP: 10.96.100.50:80)
                            # Endpoints: 10.244.1.5:8080, 10.244.2.8:8080
                            
                            # 1. KUBE-SERVICES 链(入口)
                            -A KUBE-SERVICES -d 10.96.100.50/32 -p tcp -m tcp --dport 80 -j KUBE-SVC-NGINX
                            
                            # 2. KUBE-SVC-NGINX 链(Service 链)
                            -A KUBE-SVC-NGINX -m statistic --mode random --probability 0.5 -j KUBE-SEP-EP1
                            -A KUBE-SVC-NGINX -j KUBE-SEP-EP2
                            
                            # 3. KUBE-SEP-EP1 链(Endpoint 1)
                            -A KUBE-SEP-EP1 -p tcp -m tcp -j DNAT --to-destination 10.244.1.5:8080
                            
                            # 4. KUBE-SEP-EP2 链(Endpoint 2)
                            -A KUBE-SEP-EP2 -p tcp -m tcp -j DNAT --to-destination 10.244.2.8:8080

                            5. kubectl → API Server

                            kubectl 是用户与 API Server 交互的客户端工具。

                            通信流程

                            kubectl get pods
                                │
                                ├─→ 1. 读取 kubeconfig (~/.kube/config)
                                │      - API Server 地址
                                │      - 证书/Token
                                │
                                ├─→ 2. 发送 HTTP 请求
                                │      GET /api/v1/namespaces/default/pods
                                │
                                ├─→ 3. API Server 处理
                                │      - 认证
                                │      - 授权
                                │      - 从 etcd 读取数据
                                │
                                └─→ 4. 返回结果
                                       JSON 格式的 Pod 列表

                            kubectl 底层实现

                            // kubectl get pods 的简化实现
                            func getPods(namespace string) {
                                // 1. 加载 kubeconfig
                                config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
                                
                                // 2. 创建 Clientset
                                clientset, _ := kubernetes.NewForConfig(config)
                                
                                // 3. 发起 GET 请求
                                pods, _ := clientset.CoreV1().Pods(namespace).List(
                                    context.TODO(),
                                    metav1.ListOptions{},
                                )
                                
                                // 4. 输出结果
                                for _, pod := range pods.Items {
                                    fmt.Printf("%s\t%s\t%s\n", pod.Name, pod.Status.Phase, pod.Spec.NodeName)
                                }
                            }

                            HTTP 请求详解

                            # kubectl get pods 发送的实际 HTTP 请求
                            GET /api/v1/namespaces/default/pods HTTP/1.1
                            Host: 192.168.1.10:6443
                            Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ij...
                            Accept: application/json
                            User-Agent: kubectl/v1.28.0
                            
                            # API Server 响应
                            HTTP/1.1 200 OK
                            Content-Type: application/json
                            
                            {
                              "kind": "PodList",
                              "apiVersion": "v1",
                              "metadata": {
                                "resourceVersion": "12345"
                              },
                              "items": [
                                {
                                  "metadata": {
                                    "name": "nginx-xxx",
                                    "namespace": "default"
                                  },
                                  "spec": {
                                    "nodeName": "worker-1",
                                    "containers": [...]
                                  },
                                  "status": {
                                    "phase": "Running"
                                  }
                                }
                              ]
                            }

                            🔄 核心机制:List-Watch

                            List-Watch 是 Kubernetes 最核心的通信模式。

                            List-Watch 架构

                            ┌───────────────────────────────────────────────┐
                            │              Client (Kubelet/Controller)      │
                            ├───────────────────────────────────────────────┤
                            │                                                │
                            │  1. List(初始同步)                             │
                            │     GET /api/v1/pods                          │
                            │     → 获取所有资源                             │
                            │     → 本地缓存(Informer Cache)                │
                            │                                                │
                            │  2. Watch(增量更新)                            │
                            │     GET /api/v1/pods?watch=true               │
                            │     → 长连接(HTTP Chunked)                    │
                            │     → 实时接收 ADDED/MODIFIED/DELETED 事件    │
                            │                                                │
                            │  3. ResourceVersion(一致性保证)               │
                            │     → 每个资源有版本号                         │
                            │     → Watch 从指定版本开始                     │
                            │     → 断线重连不丢失事件                       │
                            │                                                │
                            │  4. 本地缓存(Indexer)                         │
                            │     → 减少 API Server 压力                    │
                            │     → 快速查询                                 │
                            │     → 自动同步                                 │
                            └───────────────────────────────────────────────┘

                            Informer 机制详解

                            // Informer 是 List-Watch 的高级封装
                            type Informer struct {
                                Indexer   Indexer       // 本地缓存
                                Controller Controller    // List-Watch 控制器
                                Processor  *sharedProcessor  // 事件处理器
                            }
                            
                            // 使用 Informer 监听资源
                            func watchPodsWithInformer() {
                                // 1. 创建 SharedInformerFactory
                                factory := informers.NewSharedInformerFactory(clientset, 30*time.Second)
                                
                                // 2. 获取 Pod Informer
                                podInformer := factory.Core().V1().Pods()
                                
                                // 3. 注册事件处理器
                                podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
                                    AddFunc: func(obj interface{}) {
                                        pod := obj.(*v1.Pod)
                                        fmt.Printf("Pod ADDED: %s\n", pod.Name)
                                    },
                                    UpdateFunc: func(oldObj, newObj interface{}) {
                                        pod := newObj.(*v1.Pod)
                                        fmt.Printf("Pod UPDATED: %s\n", pod.Name)
                                    },
                                    DeleteFunc: func(obj interface{}) {
                                        pod := obj.(*v1.Pod)
                                        fmt.Printf("Pod DELETED: %s\n", pod.Name)
                                    },
                                })
                                
                                // 4. 启动 Informer
                                factory.Start(stopCh)
                                
                                // 5. 等待缓存同步完成
                                factory.WaitForCacheSync(stopCh)
                                
                                // 6. 从本地缓存查询(不访问 API Server)
                                pod, _ := podInformer.Lister().Pods("default").Get("nginx-xxx")
                            }

                            ResourceVersion 机制

                            事件流:
                            ┌────────────────────────────────────────┐
                            │ Pod nginx-xxx created                  │ ResourceVersion: 100
                            ├────────────────```
                            ├────────────────────────────────────────┤
                            │ Pod nginx-xxx updated (image changed)  │ ResourceVersion: 101
                            ├────────────────────────────────────────┤
                            │ Pod nginx-xxx updated (status changed) │ ResourceVersion: 102
                            ├────────────────────────────────────────┤
                            │ Pod nginx-xxx deleted                  │ ResourceVersion: 103
                            └────────────────────────────────────────┘
                            
                            Watch 请求:
                            1. 初始 Watch: GET /api/v1/pods?watch=true&resourceVersion=100
                               → 从版本 100 开始接收事件
                            
                            2. 断线重连: GET /api/v1/pods?watch=true&resourceVersion=102
                               → 从版本 102 继续,不会丢失版本 103 的删除事件
                            
                            3. 版本过期: 如果 resourceVersion 太旧(etcd 已压缩)
                               → API Server 返回 410 Gone
                               → Client 重新 List 获取最新状态,然后 Watch

                            🔐 通信安全细节

                            1. TLS 双向认证

                            ┌────────────────────────────────────────┐
                            │        API Server TLS 配置              │
                            ├────────────────────────────────────────┤
                            │                                         │
                            │  Server 端证书:                         │
                            │  - apiserver.crt (服务端证书)          │
                            │  - apiserver.key (服务端私钥)          │
                            │  - ca.crt (CA 证书)                    │
                            │                                         │
                            │  Client CA:                             │
                            │  - 验证客户端证书                       │
                            │  - --client-ca-file=/etc/kubernetes/pki/ca.crt │
                            │                                         │
                            │  启动参数:                              │
                            │  --tls-cert-file=/etc/kubernetes/pki/apiserver.crt │
                            │  --tls-private-key-file=/etc/kubernetes/pki/apiserver.key │
                            │  --client-ca-file=/etc/kubernetes/pki/ca.crt │
                            └────────────────────────────────────────┘
                            
                            ┌────────────────────────────────────────┐
                            │        Kubelet TLS 配置                 │
                            ├────────────────────────────────────────┤
                            │                                         │
                            │  Client 证书:                           │
                            │  - kubelet-client.crt (客户端证书)     │
                            │  - kubelet-client.key (客户端私钥)     │
                            │  - ca.crt (CA 证书,验证 API Server)    │
                            │                                         │
                            │  kubeconfig 配置:                       │
                            │  - certificate-authority: ca.crt       │
                            │  - client-certificate: kubelet-client.crt │
                            │  - client-key: kubelet-client.key      │
                            └────────────────────────────────────────┘

                            2. ServiceAccount Token 详解

                            # 每个 Pod 自动挂载 ServiceAccount
                            apiVersion: v1
                            kind: Pod
                            metadata:
                              name: my-pod
                            spec:
                              serviceAccountName: default  # 使用的 ServiceAccount
                              containers:
                              - name: app
                                image: nginx
                                volumeMounts:
                                - name: kube-api-access-xxxxx
                                  mountPath: /var/run/secrets/kubernetes.io/serviceaccount
                                  readOnly: true
                              volumes:
                              - name: kube-api-access-xxxxx
                                projected:
                                  sources:
                                  - serviceAccountToken:
                                      path: token                    # JWT Token
                                      expirationSeconds: 3607
                                  - configMap:
                                      name: kube-root-ca.crt
                                      items:
                                      - key: ca.crt
                                        path: ca.crt                 # CA 证书
                                  - downwardAPI:
                                      items:
                                      - path: namespace
                                        fieldRef:
                                          fieldPath: metadata.namespace  # 命名空间

                            Pod 内访问 API Server

                            # 进入 Pod
                            kubectl exec -it my-pod -- sh
                            
                            # 1. 读取 Token
                            TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                            
                            # 2. 读取 CA 证书
                            CACERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                            
                            # 3. 读取命名空间
                            NAMESPACE=$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace)
                            
                            # 4. 访问 API Server
                            curl --cacert $CACERT \
                                 --header "Authorization: Bearer $TOKEN" \
                                 https://kubernetes.default.svc/api/v1/namespaces/$NAMESPACE/pods
                            
                            # 5. 使用 kubectl proxy(简化方式)
                            kubectl proxy --port=8080 &
                            curl http://localhost:8080/api/v1/namespaces/default/pods

                            ServiceAccount Token 结构

                            # 解码 JWT Token
                            TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                            echo $TOKEN | cut -d. -f2 | base64 -d | jq
                            
                            # 输出:
                            {
                              "aud": [
                                "https://kubernetes.default.svc"
                              ],
                              "exp": 1696867200,        # 过期时间
                              "iat": 1696863600,        # 签发时间
                              "iss": "https://kubernetes.default.svc.cluster.local",  # 签发者
                              "kubernetes.io": {
                                "namespace": "default",  # 命名空间
                                "pod": {
                                  "name": "my-pod",
                                  "uid": "abc-123"
                                },
                                "serviceaccount": {
                                  "name": "default",     # ServiceAccount 名称
                                  "uid": "def-456"
                                }
                              },
                              "nbf": 1696863600,
                              "sub": "system:serviceaccount:default:default"  # Subject
                            }

                            📊 通信模式总结

                            1. 主动推送 vs 被动拉取

                            组件通信模式说明
                            Kubelet主动连接List-Watch API Server
                            Scheduler主动连接List-Watch API Server
                            Controller Manager主动连接List-Watch API Server
                            kube-proxy主动连接List-Watch API Server
                            kubectl主动请求RESTful API 调用
                            API Server → etcd主动读写gRPC 连接 etcd

                            重要: API Server 从不主动连接其他组件,都是组件主动连接 API Server。

                            2. 通信协议

                            ┌─────────────────────────────────────────┐
                            │  API Server 对外暴露的协议               │
                            ├─────────────────────────────────────────┤
                            │                                          │
                            │  1. HTTPS (主要协议)                     │
                            │     - RESTful API                       │
                            │     - 端口: 6443 (默认)                  │
                            │     - 所有组件使用                       │
                            │                                          │
                            │  2. HTTP (不推荐)                        │
                            │     - 仅用于本地测试                     │
                            │     - 端口: 8080 (默认,已废弃)          │
                            │     - 生产环境禁用                       │
                            │                                          │
                            │  3. WebSocket (特殊场景)                │
                            │     - kubectl exec/logs/port-forward    │
                            │     - 基于 HTTPS 升级                    │
                            └─────────────────────────────────────────┘
                            
                            ┌─────────────────────────────────────────┐
                            │  API Server 对 etcd 的协议               │
                            ├─────────────────────────────────────────┤
                            │                                          │
                            │  gRPC (HTTP/2)                          │
                            │  - 端口: 2379                            │
                            │  - mTLS 双向认证                         │
                            │  - 高性能二进制协议                      │
                            └─────────────────────────────────────────┘

                            🛠️ 实战:监控各组件通信

                            1. 查看组件连接状态

                            # 1. 查看 API Server 监听端口
                            netstat -tlnp | grep kube-apiserver
                            # tcp   0   0 :::6443   :::*   LISTEN   12345/kube-apiserver
                            
                            # 2. 查看连接到 API Server 的客户端
                            netstat -anp | grep :6443 | grep ESTABLISHED
                            # tcp   0   0 192.168.1.10:6443   192.168.1.11:45678   ESTABLISHED   (Kubelet)
                            # tcp   0   0 192.168.1.10:6443   192.168.1.10:45679   ESTABLISHED   (Scheduler)
                            # tcp   0   0 192.168.1.10:6443   192.168.1.10:45680   ESTABLISHED   (Controller Manager)
                            
                            # 3. 查看 API Server 日志
                            journalctl -u kube-apiserver -f
                            # I1011 10:00:00.123456   12345 httplog.go:89] "HTTP" verb="GET" URI="/api/v1/pods?watch=true" latency="30.123ms" userAgent="kubelet/v1.28.0" srcIP="192.168.1.11:45678"
                            
                            # 4. 查看 Kubelet 连接
                            journalctl -u kubelet -f | grep "Connecting to API"

                            2. 使用 tcpdump 抓包

                            # 抓取 API Server 通信(6443 端口)
                            tcpdump -i any -n port 6443 -A -s 0
                            
                            # 抓取特定主机的通信
                            tcpdump -i any -n host 192.168.1.11 and port 6443
                            
                            # 保存到文件,用 Wireshark 分析
                            tcpdump -i any -n port 6443 -w api-traffic.pcap

                            3. API Server Audit 日志

                            # API Server 审计配置
                            apiVersion: v1
                            kind: Policy
                            rules:
                            # 记录所有请求元数据
                            - level: Metadata
                              verbs: ["get", "list", "watch"]
                            # 记录创建/更新/删除的完整请求和响应
                            - level: RequestResponse
                              verbs: ["create", "update", "patch", "delete"]
                            # 启用 Audit 日志
                            kube-apiserver \
                              --audit-policy-file=/etc/kubernetes/audit-policy.yaml \
                              --audit-log-path=/var/log/kubernetes/audit.log \
                              --audit-log-maxage=30 \
                              --audit-log-maxbackup=10 \
                              --audit-log-maxsize=100
                            
                            # 查看审计日志
                            tail -f /var/log/kubernetes/audit.log | jq
                            
                            # 示例输出:
                            {
                              "kind": "Event",
                              "apiVersion": "audit.k8s.io/v1",
                              "level": "Metadata",
                              "auditID": "abc-123",
                              "stage": "ResponseComplete",
                              "requestURI": "/api/v1/namespaces/default/pods?watch=true",
                              "verb": "watch",
                              "user": {
                                "username": "system:node:worker-1",
                                "groups": ["system:nodes"]
                              },
                              "sourceIPs": ["192.168.1.11"],
                              "userAgent": "kubelet/v1.28.0",
                              "responseStatus": {
                                "code": 200
                              }
                            }

                            🔍 高级话题

                            1. API Server 聚合层 (API Aggregation)

                            允许扩展 API Server,添加自定义 API。

                            ┌────────────────────────────────────────┐
                            │       Main API Server (kube-apiserver) │
                            │         /api, /apis                    │
                            └───────────────┬────────────────────────┘
                                            │ 代理请求
                                    ┌───────┴────────┐
                                    ▼                ▼
                            ┌──────────────┐  ┌─────────────────┐
                            │ Metrics API  │  │ Custom API      │
                            │ /apis/metrics│  │ /apis/my.api/v1 │
                            └──────────────┘  └─────────────────┘

                            注册 APIService

                            apiVersion: apiregistration.k8s.io/v1
                            kind: APIService
                            metadata:
                              name: v1beta1.metrics.k8s.io
                            spec:
                              service:
                                name: metrics-server
                                namespace: kube-system
                                port: 443
                              group: metrics.k8s.io
                              version: v1beta1
                              insecureSkipTLSVerify: true
                              groupPriorityMinimum: 100
                              versionPriority: 100

                            请求路由

                            # 客户端请求
                            kubectl top nodes
                            # 等价于: GET /apis/metrics.k8s.io/v1beta1/nodes
                            
                            # API Server 处理:
                            # 1. 检查路径 /apis/metrics.k8s.io/v1beta1
                            # 2. 查找对应的 APIService
                            # 3. 代理请求到 metrics-server Service
                            # 4. 返回结果给客户端

                            2. API Priority and Fairness (APF)

                            控制 API Server 的请求优先级和并发限制。

                            # FlowSchema - 定义请求匹配规则
                            apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
                            kind: FlowSchema
                            metadata:
                              name: system-nodes
                            spec:
                              priorityLevelConfiguration:
                                name: system  # 关联到优先级配置
                              matchingPrecedence: 900
                              distinguisherMethod:
                                type: ByUser
                              rules:
                              - subjects:
                                - kind: Group
                                  group:
                                    name: system:nodes  # 匹配 Kubelet 请求
                                resourceRules:
                                - verbs: ["*"]
                                  apiGroups: ["*"]
                                  resources: ["*"]
                                  namespaces: ["*"]
                            
                            ---
                            # PriorityLevelConfiguration - 定义并发限制
                            apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
                            kind: PriorityLevelConfiguration
                            metadata:
                              name: system
                            spec:
                              type: Limited
                              limited:
                                assuredConcurrencyShares: 30  # 保证的并发数
                                limitResponse:
                                  type: Queue
                                  queuing:
                                    queues: 64           # 队列数量
                                    queueLengthLimit: 50 # 每个队列长度
                                    handSize: 6          # 洗牌算法参数

                            APF 工作流程

                            请求进入 API Server
                                │
                                ├─→ 1. 匹配 FlowSchema (按 precedence 排序)
                                │      - 检查 subject (user/group/serviceaccount)
                                │      - 检查 resource (API 路径)
                                │
                                ├─→ 2. 确定 PriorityLevel
                                │      - system (高优先级,Kubelet/Scheduler)
                                │      - leader-election (中优先级,Controller Manager)
                                │      - workload-high (用户请求)
                                │      - catch-all (默认)
                                │
                                ├─→ 3. 检查并发限制
                                │      - 当前并发数 < assuredConcurrencyShares: 立即执行
                                │      - 超过限制: 进入队列等待
                                │
                                └─→ 4. 执行或拒绝
                                       - 队列有空位: 等待执行
                                       - 队列满: 返回 429 Too Many Requests

                            查看 APF 状态

                            # 查看所有 FlowSchema
                            kubectl get flowschemas
                            
                            # 查看 PriorityLevelConfiguration
                            kubectl get prioritylevelconfigurations
                            
                            # 查看实时指标
                            kubectl get --raw /metrics | grep apiserver_flowcontrol
                            
                            # 关键指标:
                            # apiserver_flowcontrol_current_inqueue_requests: 当前排队请求数
                            # apiserver_flowcontrol_rejected_requests_total: 被拒绝的请求数
                            # apiserver_flowcontrol_request_concurrency_limit: 并发限制

                            3. Watch Bookmark

                            优化 Watch 性能,减少断线重连的代价。

                            // 启用 Watch Bookmark
                            watch := clientset.CoreV1().Pods("default").Watch(
                                context.TODO(),
                                metav1.ListOptions{
                                    Watch:            true,
                                    AllowWatchBookmarks: true,  // 🔑 启用 Bookmark
                                },
                            )
                            
                            for event := range watch.ResultChan() {
                                switch event.Type {
                                case watch.Added:
                                    // 处理新增事件
                                case watch.Modified:
                                    // 处理修改事件
                                case watch.Deleted:
                                    // 处理删除事件
                                case watch.Bookmark:
                                    // 🔑 Bookmark 事件(无实际数据变更)
                                    // 只是告诉客户端当前的 ResourceVersion
                                    // 用于优化断线重连
                                    pod := event.Object.(*v1.Pod)
                                    currentRV := pod.ResourceVersion
                                    fmt.Printf("Bookmark at ResourceVersion: %s\n", currentRV)
                                }
                            }

                            Bookmark 的作用

                            没有 Bookmark:
                            ┌──────────────────────────────────────┐
                            │ 客户端 Watch 从 ResourceVersion 100  │
                            │ 长时间没有事件(如 1 小时)             │
                            │ 连接断开                              │
                            │ 重连时: Watch from RV 100            │
                            │ API Server 需要回放 100-200 之间的    │
                            │ 所有事件(即使客户端不需要)            │
                            └──────────────────────────────────────┘
                            
                            有 Bookmark:
                            ┌──────────────────────────────────────┐
                            │ 客户端 Watch 从 ResourceVersion 100  │
                            │ 每 10 分钟收到 Bookmark              │
                            │   RV 110 (10 分钟后)                 │
                            │   RV 120 (20 分钟后)                 │
                            │   RV 130 (30 分钟后)                 │
                            │ 连接断开                              │
                            │ 重连时: Watch from RV 130 ✅         │
                            │ 只需回放 130-200 之间的事件           │
                            └──────────────────────────────────────┘

                            4. 客户端限流 (Client-side Rate Limiting)

                            防止客户端压垮 API Server。

                            // client-go 的默认限流配置
                            config := &rest.Config{
                                Host: "https://192.168.1.10:6443",
                                // QPS 限制
                                QPS: 50.0,        // 每秒 50 个请求
                                // Burst 限制
                                Burst: 100,       // 突发最多 100 个请求
                            }
                            
                            clientset := kubernetes.NewForConfig(config)
                            
                            // 自定义限流器
                            import "golang.org/x/time/rate"
                            
                            rateLimiter := rate.NewLimiter(
                                rate.Limit(50),  // 每秒 50 个
                                100,             // Burst 100
                            )
                            
                            // 在发送请求前等待
                            rateLimiter.Wait(context.Background())
                            clientset.CoreV1().Pods("default").List(...)

                            📈 性能优化

                            1. API Server 侧优化

                            # API Server 启动参数
                            kube-apiserver \
                              # 增加 worker 线程
                              --max-requests-inflight=400 \
                              --max-mutating-requests-inflight=200 \
                              \
                              # Watch 缓存大小
                              --watch-cache-sizes=pods#1000,nodes#100 \
                              \
                              # etcd 连接池
                              --etcd-servers-overrides=/events#https://etcd-1:2379 \  # 分离 events
                              \
                              # 启用压缩
                              --enable-aggregator-routing=true \
                              \
                              # 内存缓存
                              --default-watch-cache-size=100

                            2. Client 侧优化

                            // 1. 使用 Informer (本地缓存)
                            factory := informers.NewSharedInformerFactory(clientset, 30*time.Second)
                            podInformer := factory.Core().V1().Pods()
                            
                            // 从本地缓存读取,不访问 API Server
                            pod, _ := podInformer.Lister().Pods("default").Get("nginx")
                            
                            // 2. 使用 Field Selector 减少数据量
                            listOptions := metav1.ListOptions{
                                FieldSelector: "spec.nodeName=worker-1",  // 只获取特定节点的 Pod
                            }
                            
                            // 3. 使用 Label Selector
                            listOptions := metav1.ListOptions{
                                LabelSelector: "app=nginx",  // 只获取特定标签的 Pod
                            }
                            
                            // 4. 限制返回字段
                            listOptions := metav1.ListOptions{
                                Limit: 100,  // 分页,每次只返回 100 个
                            }
                            
                            // 5. 批量操作
                            // 不推荐: 循环创建 100 个 Pod(100 次 API 调用)
                            for i := 0; i < 100; i++ {
                                clientset.CoreV1().Pods("default").Create(...)
                            }
                            
                            // 推荐: 使用 Job/Deployment(1 次 API 调用)
                            deployment := &appsv1.Deployment{
                                Spec: appsv1.DeploymentSpec{
                                    Replicas: int32Ptr(100),
                                    ...
                                },
                            }
                            clientset.AppsV1().Deployments("default").Create(deployment)

                            💡 关键要点总结

                            通信模式

                            1. 所有组件主动连接 API Server (API Server 从不主动推送)
                            2. List-Watch 是核心机制 (初始 List + 持续 Watch)
                            3. HTTP 长连接 (Chunked Transfer Encoding)
                            4. ResourceVersion 保证一致性 (断线重连不丢事件)

                            认证授权

                            1. X.509 证书 (集群组件)
                            2. ServiceAccount Token (Pod 内应用)
                            3. RBAC 授权 (细粒度权限控制)
                            4. 准入控制 (请求验证和修改)

                            性能优化

                            1. Informer 本地缓存 (减少 API Server 压力)
                            2. Field/Label Selector (减少数据传输)
                            3. APF 流量控制 (防止 API Server 过载)
                            4. 客户端限流 (防止客户端压垮 API Server)

                            最佳实践

                            1. 使用 Informer 而不是轮询
                            2. 合理设置 QPS 和 Burst
                            3. 避免频繁的 List 操作
                            4. 使用 Field Selector 过滤数据
                            5. 启用 Watch Bookmark
                            6. 监控 API Server 指标
                            Mar 7, 2024

                            Monitor

                              Mar 7, 2025

                              Subsections of Networking

                              Ingress

                              Kubernetes Ingress 原理详解

                              Ingress 是 Kubernetes 中用于管理集群外部访问集群内服务的 API 对象,提供 HTTP/HTTPS 路由功能。


                              🎯 Ingress 的作用

                              没有 Ingress 的问题

                              问题 1:每个服务需要一个 LoadBalancer
                              ┌────────────────────────────────────┐
                              │  Service A (LoadBalancer)  $$$     │
                              │  Service B (LoadBalancer)  $$$     │
                              │  Service C (LoadBalancer)  $$$     │
                              └────────────────────────────────────┘
                              成本高、管理复杂、IP 地址浪费
                              
                              问题 2:无法基于域名/路径路由
                              客户端 → NodePort:30001 (Service A)
                              客户端 → NodePort:30002 (Service B)
                              需要记住不同的端口,不友好

                              使用 Ingress 的方案

                              单一入口 + 智能路由
                              ┌───────────────────────────────────────┐
                              │         Ingress Controller            │
                              │    (一个 LoadBalancer 或 NodePort)    │
                              └───────────┬───────────────────────────┘
                                          │ 根据域名/路径路由
                                  ┌───────┴───────┬──────────┐
                                  ▼               ▼          ▼
                              Service A       Service B   Service C
                              (ClusterIP)     (ClusterIP) (ClusterIP)

                              🏗️ Ingress 架构组成

                              核心组件

                              ┌─────────────────────────────────────────────┐
                              │              Ingress 生态系统                │
                              ├─────────────────────────────────────────────┤
                              │  1. Ingress Resource (资源对象)             │
                              │     └─ 定义路由规则(YAML)                   │
                              │                                              │
                              │  2. Ingress Controller (控制器)             │
                              │     └─ 读取 Ingress,配置负载均衡器          │
                              │                                              │
                              │  3. 负载均衡器 (Nginx/Traefik/HAProxy)      │
                              │     └─ 实际处理流量的组件                   │
                              └─────────────────────────────────────────────┘

                              📋 Ingress Resource (资源定义)

                              基础示例

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: example-ingress
                                annotations:
                                  nginx.ingress.kubernetes.io/rewrite-target: /
                              spec:
                                # 1. 基于域名路由
                                rules:
                                - host: example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: web-service
                                          port:
                                            number: 80
                                
                                # 2. TLS/HTTPS 配置
                                tls:
                                - hosts:
                                  - example.com
                                  secretName: example-tls

                              完整功能示例

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: advanced-ingress
                                namespace: default
                                annotations:
                                  # Nginx 特定配置
                                  nginx.ingress.kubernetes.io/rewrite-target: /$2
                                  nginx.ingress.kubernetes.io/ssl-redirect: "true"
                                  nginx.ingress.kubernetes.io/rate-limit: "100"
                                  # 自定义响应头
                                  nginx.ingress.kubernetes.io/configuration-snippet: |
                                    add_header X-Custom-Header "Hello from Ingress";
                              spec:
                                # IngressClass (指定使用哪个 Ingress Controller)
                                ingressClassName: nginx
                                
                                # TLS 配置
                                tls:
                                - hosts:
                                  - app.example.com
                                  - api.example.com
                                  secretName: example-tls-secret
                                
                                # 路由规则
                                rules:
                                # 规则 1:app.example.com
                                - host: app.example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: frontend-service
                                          port:
                                            number: 80
                                
                                # 规则 2:api.example.com
                                - host: api.example.com
                                  http:
                                    paths:
                                    # /v1/* 路由到 api-v1
                                    - path: /v1
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: api-v1-service
                                          port:
                                            number: 8080
                                    
                                    # /v2/* 路由到 api-v2
                                    - path: /v2
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: api-v2-service
                                          port:
                                            number: 8080
                                
                                # 规则 3:默认后端(可选)
                                defaultBackend:
                                  service:
                                    name: default-backend
                                    port:
                                      number: 80

                              🎛️ PathType (路径匹配类型)

                              三种匹配类型

                              PathType匹配规则示例
                              Prefix前缀匹配/foo 匹配 /foo, /foo/, /foo/bar
                              Exact精确匹配/foo 只匹配 /foo,不匹配 /foo/
                              ImplementationSpecific由 Ingress Controller 决定取决于实现

                              示例对比

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: path-types-demo
                              spec:
                                rules:
                                - host: example.com
                                  http:
                                    paths:
                                    # Prefix 匹配
                                    - path: /api
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: api-service
                                          port:
                                            number: 8080
                                    # 匹配:
                                    # ✅ /api
                                    # ✅ /api/
                                    # ✅ /api/users
                                    # ✅ /api/v1/users
                                    
                                    # Exact 匹配
                                    - path: /login
                                      pathType: Exact
                                      backend:
                                        service:
                                          name: auth-service
                                          port:
                                            number: 80
                                    # 匹配:
                                    # ✅ /login
                                    # ❌ /login/
                                    # ❌ /login/oauth

                              🚀 Ingress Controller (控制器)

                              常见 Ingress Controller

                              Controller特点适用场景
                              Nginx Ingress最流行,功能强大通用场景,生产推荐
                              Traefik云原生,动态配置微服务,自动服务发现
                              HAProxy高性能,企业级大流量,高并发
                              KongAPI 网关功能API 管理,插件生态
                              Istio Gateway服务网格集成复杂微服务架构
                              AWS ALB云原生(AWS)AWS 环境
                              GCE云原生(GCP)GCP 环境

                              🔧 Ingress Controller 工作原理

                              核心流程

                              ┌─────────────────────────────────────────────┐
                              │  1. 用户创建/更新 Ingress Resource          │
                              │     kubectl apply -f ingress.yaml           │
                              └────────────────┬────────────────────────────┘
                                               │
                                               ▼
                              ┌─────────────────────────────────────────────┐
                              │  2. Ingress Controller 监听 API Server      │
                              │     - Watch Ingress 对象                    │
                              │     - Watch Service 对象                    │
                              │     - Watch Endpoints 对象                  │
                              └────────────────┬────────────────────────────┘
                                               │
                                               ▼
                              ┌─────────────────────────────────────────────┐
                              │  3. 生成配置文件                             │
                              │     Nginx:  /etc/nginx/nginx.conf          │
                              │     Traefik: 动态配置                       │
                              │     HAProxy: /etc/haproxy/haproxy.cfg      │
                              └────────────────┬────────────────────────────┘
                                               │
                                               ▼
                              ┌─────────────────────────────────────────────┐
                              │  4. 重载/更新负载均衡器                      │
                              │     nginx -s reload                         │
                              └────────────────┬────────────────────────────┘
                                               │
                                               ▼
                              ┌─────────────────────────────────────────────┐
                              │  5. 流量路由生效                             │
                              │     客户端请求 → Ingress → Service → Pod    │
                              └─────────────────────────────────────────────┘

                              📦 部署 Nginx Ingress Controller

                              方式 1:使用官方 Helm Chart (推荐)

                              # 添加 Helm 仓库
                              helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
                              helm repo update
                              
                              # 安装
                              helm install ingress-nginx ingress-nginx/ingress-nginx \
                                --namespace ingress-nginx \
                                --create-namespace \
                                --set controller.service.type=LoadBalancer
                              
                              # 查看部署状态
                              kubectl get pods -n ingress-nginx
                              kubectl get svc -n ingress-nginx

                              方式 2:使用 YAML 部署

                              # 下载官方 YAML
                              kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml
                              
                              # 查看部署
                              kubectl get all -n ingress-nginx

                              核心组件

                              # 1. Deployment - Ingress Controller Pod
                              apiVersion: apps/v1
                              kind: Deployment
                              metadata:
                                name: ingress-nginx-controller
                                namespace: ingress-nginx
                              spec:
                                replicas: 2  # 高可用建议 2+
                                selector:
                                  matchLabels:
                                    app.kubernetes.io/name: ingress-nginx
                                template:
                                  metadata:
                                    labels:
                                      app.kubernetes.io/name: ingress-nginx
                                  spec:
                                    serviceAccountName: ingress-nginx
                                    containers:
                                    - name: controller
                                      image: registry.k8s.io/ingress-nginx/controller:v1.9.0
                                      args:
                                      - /nginx-ingress-controller
                                      - --election-id=ingress-nginx-leader
                                      - --controller-class=k8s.io/ingress-nginx
                                      - --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
                                      ports:
                                      - name: http
                                        containerPort: 80
                                      - name: https
                                        containerPort: 443
                                      livenessProbe:
                                        httpGet:
                                          path: /healthz
                                          port: 10254
                                      readinessProbe:
                                        httpGet:
                                          path: /healthz
                                          port: 10254
                              
                              ---
                              # 2. Service - 暴露 Ingress Controller
                              apiVersion: v1
                              kind: Service
                              metadata:
                                name: ingress-nginx-controller
                                namespace: ingress-nginx
                              spec:
                                type: LoadBalancer  # 或 NodePort
                                ports:
                                - name: http
                                  port: 80
                                  targetPort: 80
                                  protocol: TCP
                                - name: https
                                  port: 443
                                  targetPort: 443
                                  protocol: TCP
                                selector:
                                  app.kubernetes.io/name: ingress-nginx
                              
                              ---
                              # 3. ConfigMap - Nginx 全局配置
                              apiVersion: v1
                              kind: ConfigMap
                              metadata:
                                name: ingress-nginx-controller
                                namespace: ingress-nginx
                              data:
                                # 自定义 Nginx 配置
                                proxy-body-size: "100m"
                                proxy-connect-timeout: "15"
                                proxy-read-timeout: "600"
                                proxy-send-timeout: "600"
                                use-forwarded-headers: "true"

                              🌐 完整流量路径

                              请求流程详解

                              客户端
                                │ 1. DNS 解析
                                │    example.com → LoadBalancer IP (1.2.3.4)
                                ▼
                              LoadBalancer / NodePort
                                │ 2. 转发到 Ingress Controller Pod
                                ▼
                              Ingress Controller (Nginx Pod)
                                │ 3. 读取 Ingress 规则
                                │    Host: example.com
                                │    Path: /api/users
                                │ 4. 匹配规则
                                │    rule: host=example.com, path=/api
                                │    backend: api-service:8080
                                ▼
                              Service (api-service)
                                │ 5. Service 选择器匹配 Pod
                                │    selector: app=api
                                │ 6. 查询 Endpoints
                                │    endpoints: 10.244.1.5:8080, 10.244.2.8:8080
                                │ 7. 负载均衡(默认轮询)
                                ▼
                              Pod (api-xxxx)
                                │ 8. 容器处理请求
                                │    Container Port: 8080
                                ▼
                              应用响应
                                │ 9. 原路返回
                                ▼
                              客户端收到响应

                              网络数据包追踪

                              # 客户端发起请求
                              curl -H "Host: example.com" http://1.2.3.4/api/users
                              
                              # 1. DNS 解析
                              example.com → 1.2.3.4 (LoadBalancer External IP)
                              
                              # 2. TCP 连接
                              Client:54321 → LoadBalancer:80
                              
                              # 3. LoadBalancer 转发
                              LoadBalancer:80 → Ingress Controller Pod:80 (10.244.0.5:80)
                              
                              # 4. Ingress Controller 内部处理
                              Nginx 读取配置:
                                location /api {
                                  proxy_pass http://api-service.default.svc.cluster.local:8080;
                                }
                              
                              # 5. 查询 Service
                              kube-proxy/iptables 规则:
                                api-service:8080 → Endpoints
                              
                              # 6. 负载均衡到 Pod
                              10.244.0.5 → 10.244.1.5:8080 (Pod IP)
                              
                              # 7. 响应返回
                              Pod → Ingress Controller → LoadBalancer → Client

                              🔒 HTTPS/TLS 配置

                              创建 TLS Secret

                              # 方式 1:使用自签名证书(测试环境)
                              openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
                                -keyout tls.key -out tls.crt \
                                -subj "/CN=example.com"
                              
                              kubectl create secret tls example-tls \
                                --cert=tls.crt \
                                --key=tls.key
                              
                              # 方式 2:使用 Let's Encrypt (生产环境,推荐)
                              # 安装 cert-manager
                              kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
                              
                              # 创建 ClusterIssuer
                              kubectl apply -f - <<EOF
                              apiVersion: cert-manager.io/v1
                              kind: ClusterIssuer
                              metadata:
                                name: letsencrypt-prod
                              spec:
                                acme:
                                  server: https://acme-v02.api.letsencrypt.org/directory
                                  email: admin@example.com
                                  privateKeySecretRef:
                                    name: letsencrypt-prod
                                  solvers:
                                  - http01:
                                      ingress:
                                        class: nginx
                              EOF

                              配置 HTTPS Ingress

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: https-ingress
                                annotations:
                                  # 自动重定向 HTTP 到 HTTPS
                                  nginx.ingress.kubernetes.io/ssl-redirect: "true"
                                  # 使用 cert-manager 自动申请证书
                                  cert-manager.io/cluster-issuer: "letsencrypt-prod"
                              spec:
                                ingressClassName: nginx
                                tls:
                                - hosts:
                                  - example.com
                                  - www.example.com
                                  secretName: example-tls  # cert-manager 会自动创建这个 Secret
                                rules:
                                - host: example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: web-service
                                          port:
                                            number: 80

                              验证 HTTPS

                              # 检查证书
                              curl -v https://example.com
                              
                              # 查看 Secret
                              kubectl get secret example-tls
                              kubectl describe secret example-tls
                              
                              # 测试 HTTP 自动重定向
                              curl -I http://example.com
                              # HTTP/1.1 308 Permanent Redirect
                              # Location: https://example.com/

                              🎨 高级路由场景

                              场景 1:基于路径的路由

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: path-based-routing
                                annotations:
                                  nginx.ingress.kubernetes.io/rewrite-target: /$2
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    # /api/v1/* → api-v1-service
                                    - path: /api/v1(/|$)(.*)
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: api-v1-service
                                          port:
                                            number: 8080
                                    
                                    # /api/v2/* → api-v2-service
                                    - path: /api/v2(/|$)(.*)
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: api-v2-service
                                          port:
                                            number: 8080
                                    
                                    # /admin/* → admin-service
                                    - path: /admin
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: admin-service
                                          port:
                                            number: 3000
                                    
                                    # /* → frontend-service (默认)
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: frontend-service
                                          port:
                                            number: 80

                              场景 2:基于子域名的路由

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: subdomain-routing
                              spec:
                                rules:
                                # www.example.com
                                - host: www.example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: website-service
                                          port:
                                            number: 80
                                
                                # api.example.com
                                - host: api.example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: api-service
                                          port:
                                            number: 8080
                                
                                # blog.example.com
                                - host: blog.example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: blog-service
                                          port:
                                            number: 80
                                
                                # *.dev.example.com (通配符)
                                - host: "*.dev.example.com"
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: dev-environment
                                          port:
                                            number: 80

                              场景 3:金丝雀发布 (Canary Deployment)

                              # 主版本 Ingress
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: production
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: app-v1
                                          port:
                                            number: 80
                              
                              ---
                              # 金丝雀版本 Ingress
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: canary
                                annotations:
                                  nginx.ingress.kubernetes.io/canary: "true"
                                  # 10% 流量到金丝雀版本
                                  nginx.ingress.kubernetes.io/canary-weight: "10"
                                  
                                  # 或基于请求头
                                  # nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
                                  # nginx.ingress.kubernetes.io/canary-by-header-value: "always"
                                  
                                  # 或基于 Cookie
                                  # nginx.ingress.kubernetes.io/canary-by-cookie: "canary"
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: app-v2-canary
                                          port:
                                            number: 80

                              场景 4:A/B 测试

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: ab-testing
                                annotations:
                                  # 基于请求头进行 A/B 测试
                                  nginx.ingress.kubernetes.io/canary: "true"
                                  nginx.ingress.kubernetes.io/canary-by-header: "X-Version"
                                  nginx.ingress.kubernetes.io/canary-by-header-value: "beta"
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: app-beta
                                          port:
                                            number: 80
                              # 普通用户访问 A 版本
                              curl http://myapp.com
                              
                              # Beta 用户访问 B 版本
                              curl -H "X-Version: beta" http://myapp.com

                              🔧 常用 Annotations (Nginx)

                              基础配置

                              metadata:
                                annotations:
                                  # SSL 重定向
                                  nginx.ingress.kubernetes.io/ssl-redirect: "true"
                                  
                                  # 强制 HTTPS
                                  nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
                                  
                                  # 后端协议
                                  nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"  # 或 HTTP, GRPC
                                  
                                  # 路径重写
                                  nginx.ingress.kubernetes.io/rewrite-target: /$2
                                  
                                  # URL 重写
                                  nginx.ingress.kubernetes.io/use-regex: "true"

                              高级配置

                              metadata:
                                annotations:
                                  # 上传文件大小限制
                                  nginx.ingress.kubernetes.io/proxy-body-size: "100m"
                                  
                                  # 超时配置
                                  nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
                                  nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
                                  nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
                                  
                                  # 会话保持 (Sticky Session)
                                  nginx.ingress.kubernetes.io/affinity: "cookie"
                                  nginx.ingress.kubernetes.io/session-cookie-name: "route"
                                  nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
                                  nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
                                  
                                  # 限流
                                  nginx.ingress.kubernetes.io/limit-rps: "100"  # 每秒请求数
                                  nginx.ingress.kubernetes.io/limit-connections: "10"  # 并发连接数
                                  
                                  # CORS 配置
                                  nginx.ingress.kubernetes.io/enable-cors: "true"
                                  nginx.ingress.kubernetes.io/cors-allow-origin: "*"
                                  nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, PUT, DELETE, OPTIONS"
                                  
                                  # 白名单
                                  nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,192.168.0.0/16"
                                  
                                  # 基本认证
                                  nginx.ingress.kubernetes.io/auth-type: basic
                                  nginx.ingress.kubernetes.io/auth-secret: basic-auth
                                  nginx.ingress.kubernetes.io/auth-realm: "Authentication Required"
                                  
                                  # 自定义 Nginx 配置片段
                                  nginx.ingress.kubernetes.io/configuration-snippet: |
                                    more_set_headers "X-Custom-Header: MyValue";
                                    add_header X-Request-ID $request_id;

                              🛡️ 安全配置

                              1. 基本认证

                              # 创建密码文件
                              htpasswd -c auth admin
                              # 输入密码
                              
                              # 创建 Secret
                              kubectl create secret generic basic-auth --from-file=auth
                              
                              # 应用到 Ingress
                              kubectl apply -f - <<EOF
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: secure-ingress
                                annotations:
                                  nginx.ingress.kubernetes.io/auth-type: basic
                                  nginx.ingress.kubernetes.io/auth-secret: basic-auth
                                  nginx.ingress.kubernetes.io/auth-realm: "Authentication Required - Please enter your credentials"
                              spec:
                                rules:
                                - host: admin.example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: admin-service
                                          port:
                                            number: 80
                              EOF

                              2. IP 白名单

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: whitelist-ingress
                                annotations:
                                  # 只允许特定 IP 访问
                                  nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8,192.168.1.100/32"
                              spec:
                                rules:
                                - host: internal.example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: internal-service
                                          port:
                                            number: 80

                              3. OAuth2 认证

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: oauth2-ingress
                                annotations:
                                  nginx.ingress.kubernetes.io/auth-url: "https://oauth2-proxy.example.com/oauth2/auth"
                                  nginx.ingress.kubernetes.io/auth-signin: "https://oauth2-proxy.example.com/oauth2/start?rd=$escaped_request_uri"
                              spec:
                                rules:
                                - host: app.example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: protected-service
                                          port:
                                            number: 80

                              📊 监控和调试

                              查看 Ingress 状态

                              # 列出所有 Ingress
                              kubectl get ingress
                              
                              # 详细信息
                              kubectl describe ingress example-ingress
                              
                              # 查看 Ingress Controller 日志
                              kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -f
                              
                              # 查看生成的 Nginx 配置
                              kubectl exec -n ingress-nginx <ingress-controller-pod> -- cat /etc/nginx/nginx.conf

                              测试 Ingress 规则

                              # 测试域名解析
                              nslookup example.com
                              
                              # 测试 HTTP
                              curl -H "Host: example.com" http://<ingress-ip>/
                              
                              # 测试 HTTPS
                              curl -k -H "Host: example.com" https://<ingress-ip>/
                              
                              # 查看响应头
                              curl -I -H "Host: example.com" http://<ingress-ip>/
                              
                              # 测试特定路径
                              curl -H "Host: example.com" http://<ingress-ip>/api/users

                              常见问题排查

                              # 1. 检查 Ingress 是否有 Address
                              kubectl get ingress
                              # 如果 ADDRESS 列为空,说明 Ingress Controller 未就绪
                              
                              # 2. 检查 Service 和 Endpoints
                              kubectl get svc
                              kubectl get endpoints
                              
                              # 3. 检查 Ingress Controller Pod
                              kubectl get pods -n ingress-nginx
                              kubectl logs -n ingress-nginx <pod-name>
                              
                              # 4. 检查 DNS 解析
                              kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup example.com
                              
                              # 5. 检查网络连通性
                              kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
                                curl -H "Host: example.com" http://web-service.default.svc.cluster.local

                              🎯 Ingress vs Service Type

                              对比表

                              维度IngressLoadBalancerNodePort
                              成本1 个 LB每个服务 1 个 LB免费
                              域名路由✅ 支持❌ 不支持❌ 不支持
                              路径路由✅ 支持❌ 不支持❌ 不支持
                              TLS 终止✅ 支持⚠️ 需要额外配置❌ 不支持
                              7 层功能✅ 丰富❌ 4 层❌ 4 层
                              适用场景HTTP/HTTPS 服务需要独立 LB 的服务开发测试

                              💡 关键要点总结

                              Ingress 的价值

                              1. 成本优化:多个服务共享一个 LoadBalancer
                              2. 智能路由:基于域名、路径的 7 层路由
                              3. TLS 管理:集中管理 HTTPS 证书
                              4. 高级功能:限流、认证、重写、CORS 等
                              5. 易于管理:声明式配置,统一入口

                              核心概念

                              • Ingress Resource:定义路由规则的 YAML
                              • Ingress Controller:读取规则并实现路由的控制器
                              • 负载均衡器:实际处理流量的组件(Nginx/Traefik/HAProxy)

                              典型使用场景

                              • ✅ 微服务 API 网关
                              • ✅ 多租户应用(基于子域名隔离)
                              • ✅ 蓝绿部署/金丝雀发布
                              • ✅ Web 应用统一入口
                              • ❌ 非 HTTP 协议(如 TCP/UDP,考虑使用 Gateway API)

                              🚀 高级话题

                              1. IngressClass (多 Ingress Controller)

                              在同一集群中运行多个 Ingress Controller:

                              # 定义 IngressClass
                              apiVersion: networking.k8s.io/v1
                              kind: IngressClass
                              metadata:
                                name: nginx
                                annotations:
                                  ingressclass.kubernetes.io/is-default-class: "true"
                              spec:
                                controller: k8s.io/ingress-nginx
                              
                              ---
                              apiVersion: networking.k8s.io/v1
                              kind: IngressClass
                              metadata:
                                name: traefik
                              spec:
                                controller: traefik.io/ingress-controller
                              
                              ---
                              # 使用特定的 IngressClass
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: my-ingress
                              spec:
                                ingressClassName: nginx  # 🔑 指定使用 nginx 控制器
                                rules:
                                - host: example.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: web-service
                                          port:
                                            number: 80

                              使用场景:

                              • 内部服务使用 Nginx,外部服务使用 Traefik
                              • 不同团队使用不同的 Ingress Controller
                              • 按环境划分(dev 用 Traefik,prod 用 Nginx)

                              2. 默认后端 (Default Backend)

                              处理未匹配任何规则的请求:

                              # 创建默认后端服务
                              apiVersion: v1
                              kind: Service
                              metadata:
                                name: default-backend
                              spec:
                                selector:
                                  app: default-backend
                                ports:
                                - port: 80
                                  targetPort: 8080
                              
                              ---
                              apiVersion: apps/v1
                              kind: Deployment
                              metadata:
                                name: default-backend
                              spec:
                                replicas: 1
                                selector:
                                  matchLabels:
                                    app: default-backend
                                template:
                                  metadata:
                                    labels:
                                      app: default-backend
                                  spec:
                                    containers:
                                    - name: default-backend
                                      image: registry.k8s.io/defaultbackend-amd64:1.5
                                      ports:
                                      - containerPort: 8080
                              
                              ---
                              # 在 Ingress 中指定默认后端
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: ingress-with-default
                              spec:
                                defaultBackend:
                                  service:
                                    name: default-backend
                                    port:
                                      number: 80
                                rules:
                                - host: example.com
                                  http:
                                    paths:
                                    - path: /app
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: app-service
                                          port:
                                            number: 80

                              效果:

                              • 访问 example.com/app → app-service
                              • 访问 example.com/other → default-backend(404 页面)
                              • 访问 unknown.com → default-backend

                              3. ExternalName Service 与 Ingress

                              将 Ingress 路由到集群外部服务:

                              # 创建 ExternalName Service
                              apiVersion: v1
                              kind: Service
                              metadata:
                                name: external-api
                              spec:
                                type: ExternalName
                                externalName: api.external-service.com  # 外部域名
                              
                              ---
                              # Ingress 路由到外部服务
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: external-ingress
                                annotations:
                                  nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
                                  nginx.ingress.kubernetes.io/upstream-vhost: "api.external-service.com"
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /external
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: external-api
                                          port:
                                            number: 443

                              使用场景:

                              • 集成第三方 API
                              • 混合云架构(部分服务在云外)
                              • 灰度迁移(逐步从外部迁移到集群内)

                              4. 跨命名空间引用 (ExternalName 方式)

                              Ingress 默认只能引用同一命名空间的 Service,跨命名空间需要特殊处理:

                              # Namespace: backend
                              apiVersion: v1
                              kind: Service
                              metadata:
                                name: api-service
                                namespace: backend
                              spec:
                                selector:
                                  app: api
                                ports:
                                - port: 8080
                              
                              ---
                              # Namespace: frontend
                              # 创建 ExternalName Service 指向 backend 命名空间的服务
                              apiVersion: v1
                              kind: Service
                              metadata:
                                name: api-proxy
                                namespace: frontend
                              spec:
                                type: ExternalName
                                externalName: api-service.backend.svc.cluster.local
                                ports:
                                - port: 8080
                              
                              ---
                              # Ingress 在 frontend 命名空间
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: cross-ns-ingress
                                namespace: frontend
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /api
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: api-proxy  # 引用同命名空间的 ExternalName Service
                                          port:
                                            number: 8080

                              5. TCP/UDP 服务暴露

                              Ingress 原生只支持 HTTP/HTTPS,对于 TCP/UDP 需要特殊配置:

                              Nginx Ingress Controller 的 TCP 配置

                              # ConfigMap 定义 TCP 服务
                              apiVersion: v1
                              kind: ConfigMap
                              metadata:
                                name: tcp-services
                                namespace: ingress-nginx
                              data:
                                # 格式: "外部端口": "命名空间/服务名:服务端口"
                                "3306": "default/mysql:3306"
                                "6379": "default/redis:6379"
                                "27017": "databases/mongodb:27017"
                              
                              ---
                              # 修改 Ingress Controller Service,暴露 TCP 端口
                              apiVersion: v1
                              kind: Service
                              metadata:
                                name: ingress-nginx-controller
                                namespace: ingress-nginx
                              spec:
                                type: LoadBalancer
                                ports:
                                - name: http
                                  port: 80
                                  targetPort: 80
                                - name: https
                                  port: 443
                                  targetPort: 443
                                # 添加 TCP 端口
                                - name: mysql
                                  port: 3306
                                  targetPort: 3306
                                - name: redis
                                  port: 6379
                                  targetPort: 6379
                                - name: mongodb
                                  port: 27017
                                  targetPort: 27017
                                selector:
                                  app.kubernetes.io/name: ingress-nginx
                              
                              ---
                              # 修改 Ingress Controller Deployment,引用 ConfigMap
                              apiVersion: apps/v1
                              kind: Deployment
                              metadata:
                                name: ingress-nginx-controller
                                namespace: ingress-nginx
                              spec:
                                template:
                                  spec:
                                    containers:
                                    - name: controller
                                      args:
                                      - /nginx-ingress-controller
                                      - --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
                                      # ...其他参数

                              访问方式:

                              # 连接 MySQL
                              mysql -h <ingress-lb-ip> -P 3306 -u root -p
                              
                              # 连接 Redis
                              redis-cli -h <ingress-lb-ip> -p 6379

                              6. 灰度发布策略详解

                              基于权重的流量分配

                              # 生产版本 (90% 流量)
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: production
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: app-v1
                                          port:
                                            number: 80
                              
                              ---
                              # 灰度版本 (10% 流量)
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: canary
                                annotations:
                                  nginx.ingress.kubernetes.io/canary: "true"
                                  nginx.ingress.kubernetes.io/canary-weight: "10"
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: app-v2
                                          port:
                                            number: 80

                              基于请求头的灰度

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: canary-header
                                annotations:
                                  nginx.ingress.kubernetes.io/canary: "true"
                                  nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
                                  nginx.ingress.kubernetes.io/canary-by-header-value: "true"
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: app-v2
                                          port:
                                            number: 80

                              测试:

                              # 普通用户访问 v1
                              curl http://myapp.com
                              
                              # 带特殊请求头的用户访问 v2
                              curl -H "X-Canary: true" http://myapp.com
                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: canary-cookie
                                annotations:
                                  nginx.ingress.kubernetes.io/canary: "true"
                                  nginx.ingress.kubernetes.io/canary-by-cookie: "canary"
                              spec:
                                rules:
                                - host: myapp.com
                                  http:
                                    paths:
                                    - path: /
                                      pathType: Prefix
                                      backend:
                                        service:
                                          name: app-v2
                                          port:
                                            number: 80

                              使用:

                              • Cookie canary=always → 路由到 v2
                              • Cookie canary=never → 路由到 v1
                              • 无 Cookie → 根据权重路由

                              7. 性能优化

                              Nginx Ingress Controller 优化配置

                              apiVersion: v1
                              kind: ConfigMap
                              metadata:
                                name: ingress-nginx-controller
                                namespace: ingress-nginx
                              data:
                                # 工作进程数(建议等于 CPU 核心数)
                                worker-processes: "auto"
                                
                                # 每个工作进程的连接数
                                max-worker-connections: "65536"
                                
                                # 启用 HTTP/2
                                use-http2: "true"
                                
                                # 启用 gzip 压缩
                                use-gzip: "true"
                                gzip-level: "6"
                                gzip-types: "text/plain text/css application/json application/javascript text/xml application/xml"
                                
                                # 客户端请求体缓冲
                                client-body-buffer-size: "128k"
                                client-max-body-size: "100m"
                                
                                # Keepalive 连接
                                keep-alive: "75"
                                keep-alive-requests: "1000"
                                
                                # 代理缓冲
                                proxy-buffer-size: "16k"
                                proxy-buffers: "4 16k"
                                
                                # 日志优化(生产环境可以禁用访问日志)
                                disable-access-log: "false"
                                access-log-params: "buffer=16k flush=5s"
                                
                                # SSL 优化
                                ssl-protocols: "TLSv1.2 TLSv1.3"
                                ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256"
                                ssl-prefer-server-ciphers: "true"
                                ssl-session-cache: "true"
                                ssl-session-cache-size: "10m"
                                ssl-session-timeout: "10m"
                                
                                # 启用连接复用
                                upstream-keepalive-connections: "100"
                                upstream-keepalive-timeout: "60"
                                
                                # 限制
                                limit-req-status-code: "429"
                                limit-conn-status-code: "429"

                              Ingress Controller Pod 资源配置

                              apiVersion: apps/v1
                              kind: Deployment
                              metadata:
                                name: ingress-nginx-controller
                                namespace: ingress-nginx
                              spec:
                                replicas: 3  # 高可用建议 3+
                                template:
                                  spec:
                                    containers:
                                    - name: controller
                                      image: registry.k8s.io/ingress-nginx/controller:v1.9.0
                                      resources:
                                        requests:
                                          cpu: "500m"
                                          memory: "512Mi"
                                        limits:
                                          cpu: "2000m"
                                          memory: "2Gi"
                                      # 启用性能分析
                                      livenessProbe:
                                        httpGet:
                                          path: /healthz
                                          port: 10254
                                        initialDelaySeconds: 10
                                        periodSeconds: 10
                                      readinessProbe:
                                        httpGet:
                                          path: /healthz
                                          port: 10254
                                        periodSeconds: 5

                              8. 监控和可观测性

                              Prometheus 监控集成

                              # ServiceMonitor for Prometheus Operator
                              apiVersion: monitoring.coreos.com/v1
                              kind: ServiceMonitor
                              metadata:
                                name: ingress-nginx
                                namespace: ingress-nginx
                              spec:
                                selector:
                                  matchLabels:
                                    app.kubernetes.io/name: ingress-nginx
                                endpoints:
                                - port: metrics
                                  interval: 30s

                              查看 Ingress Controller 指标

                              # 访问 metrics 端点
                              kubectl port-forward -n ingress-nginx svc/ingress-nginx-controller-metrics 10254:10254
                              
                              # 浏览器访问
                              http://localhost:10254/metrics
                              
                              # 关键指标:
                              # - nginx_ingress_controller_requests: 请求总数
                              # - nginx_ingress_controller_request_duration_seconds: 请求延迟
                              # - nginx_ingress_controller_response_size: 响应大小
                              # - nginx_ingress_controller_ssl_expire_time_seconds: SSL 证书过期时间

                              Grafana 仪表盘

                              # 导入官方 Grafana 仪表盘
                              # Dashboard ID: 9614 (Nginx Ingress Controller)
                              # Dashboard ID: 11875 (Nginx Ingress Controller Request Handling Performance)

                              9. 故障排查清单

                              问题 1: Ingress 没有分配 Address

                              # 检查
                              kubectl get ingress
                              # NAME       CLASS   HOSTS         ADDRESS   PORTS   AGE
                              # my-app     nginx   example.com             80      5m
                              
                              # 原因:
                              # 1. Ingress Controller 未运行
                              kubectl get pods -n ingress-nginx
                              
                              # 2. Service type 不是 LoadBalancer
                              kubectl get svc -n ingress-nginx
                              
                              # 3. 云提供商未分配 LoadBalancer IP
                              kubectl describe svc -n ingress-nginx ingress-nginx-controller

                              问题 2: 502 Bad Gateway

                              # 原因 1: 后端 Service 不存在
                              kubectl get svc
                              
                              # 原因 2: 后端 Pod 不健康
                              kubectl get pods
                              kubectl describe pod <pod-name>
                              
                              # 原因 3: 端口配置错误
                              kubectl get svc <service-name> -o yaml | grep -A 5 ports
                              
                              # 原因 4: 网络策略阻止
                              kubectl get networkpolicies
                              
                              # 查看日志
                              kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

                              问题 3: 503 Service Unavailable

                              # 原因: 没有健康的 Endpoints
                              kubectl get endpoints <service-name>
                              
                              # 如果 ENDPOINTS 列为空:
                              # 1. 检查 Service selector 是否匹配 Pod labels
                              kubectl get svc <service-name> -o yaml | grep -A 3 selector
                              kubectl get pods --show-labels
                              
                              # 2. 检查 Pod 是否 Ready
                              kubectl get pods
                              
                              # 3. 检查容器端口是否正确
                              kubectl get pods <pod-name> -o yaml | grep -A 5 ports

                              问题 4: TLS 证书问题

                              # 检查 Secret 是否存在
                              kubectl get secret <tls-secret-name>
                              
                              # 查看证书内容
                              kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
                              
                              # 检查证书过期时间
                              kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
                              
                              # cert-manager 问题
                              kubectl get certificate
                              kubectl describe certificate <cert-name>
                              kubectl get certificaterequests

                              问题 5: 路由规则不生效

                              # 1. 检查 Ingress 配置
                              kubectl describe ingress <ingress-name>
                              
                              # 2. 查看生成的 Nginx 配置
                              kubectl exec -n ingress-nginx <controller-pod> -- cat /etc/nginx/nginx.conf | grep -A 20 "server_name example.com"
                              
                              # 3. 测试域名解析
                              nslookup example.com
                              
                              # 4. 使用 Host header 测试
                              curl -v -H "Host: example.com" http://<ingress-ip>/path
                              
                              # 5. 检查 annotations 是否正确
                              kubectl get ingress <ingress-name> -o yaml | grep -A 10 annotations

                              10. 生产环境最佳实践

                              ✅ 高可用配置

                              # 1. 多副本 Ingress Controller
                              spec:
                                replicas: 3
                                
                                # 2. Pod 反亲和性(分散到不同节点)
                                affinity:
                                  podAntiAffinity:
                                    requiredDuringSchedulingIgnoredDuringExecution:
                                    - labelSelector:
                                        matchExpressions:
                                        - key: app.kubernetes.io/name
                                          operator: In
                                          values:
                                          - ingress-nginx
                                      topologyKey: kubernetes.io/hostname
                              
                                # 3. PodDisruptionBudget(确保至少 2 个副本运行)
                              ---
                              apiVersion: policy/v1
                              kind: PodDisruptionBudget
                              metadata:
                                name: ingress-nginx
                                namespace: ingress-nginx
                              spec:
                                minAvailable: 2
                                selector:
                                  matchLabels:
                                    app.kubernetes.io/name: ingress-nginx

                              ✅ 资源限制

                              resources:
                                requests:
                                  cpu: "500m"
                                  memory: "512Mi"
                                limits:
                                  cpu: "2"
                                  memory: "2Gi"
                              
                              # HPA 自动扩缩容
                              ---
                              apiVersion: autoscaling/v2
                              kind: HorizontalPodAutoscaler
                              metadata:
                                name: ingress-nginx
                                namespace: ingress-nginx
                              spec:
                                scaleTargetRef:
                                  apiVersion: apps/v1
                                  kind: Deployment
                                  name: ingress-nginx-controller
                                minReplicas: 3
                                maxReplicas: 10
                                metrics:
                                - type: Resource
                                  resource:
                                    name: cpu
                                    target:
                                      type: Utilization
                                      averageUtilization: 70
                                - type: Resource
                                  resource:
                                    name: memory
                                    target:
                                      type: Utilization
                                      averageUtilization: 80

                              ✅ 安全加固

                              # 1. 只暴露必要端口
                              # 2. 启用 TLS 1.2+
                              # 3. 配置安全头
                              metadata:
                                annotations:
                                  nginx.ingress.kubernetes.io/configuration-snippet: |
                                    more_set_headers "X-Frame-Options: DENY";
                                    more_set_headers "X-Content-Type-Options: nosniff";
                                    more_set_headers "X-XSS-Protection: 1; mode=block";
                                    more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains";
                              
                              # 4. 配置 WAF(Web Application Firewall)
                              nginx.ingress.kubernetes.io/enable-modsecurity: "true"
                              nginx.ingress.kubernetes.io/enable-owasp-core-rules: "true"
                              
                              # 5. 限流保护
                              nginx.ingress.kubernetes.io/limit-rps: "100"
                              nginx.ingress.kubernetes.io/limit-burst-multiplier: "5"

                              ✅ 监控告警

                              # Prometheus 告警规则示例
                              groups:
                              - name: ingress
                                rules:
                                - alert: IngressControllerDown
                                  expr: up{job="ingress-nginx-controller-metrics"} == 0
                                  for: 5m
                                  annotations:
                                    summary: "Ingress Controller is down"
                                
                                - alert: HighErrorRate
                                  expr: rate(nginx_ingress_controller_requests{status=~"5.."}[5m]) > 0.05
                                  for: 5m
                                  annotations:
                                    summary: "High 5xx error rate"
                                
                                - alert: HighLatency
                                  expr: histogram_quantile(0.95, nginx_ingress_controller_request_duration_seconds_bucket) > 1
                                  for: 10m
                                  annotations:
                                    summary: "High request latency (p95 > 1s)"

                              📚 总结对比:Ingress vs 其他方案

                              Ingress vs LoadBalancer Service

                              场景:部署 10 个微服务
                              
                              方案 A:每个服务一个 LoadBalancer
                              - 成本:10 个 LoadBalancer × $20/月 = $200/月
                              - 管理:10 个独立的 IP 地址
                              - 路由:无智能路由
                              - TLS:每个服务单独配置
                              
                              方案 B:一个 Ingress
                              - 成本:1 个 LoadBalancer × $20/月 = $20/月 ✅
                              - 管理:1 个 IP 地址 ✅
                              - 路由:基于域名/路径智能路由 ✅
                              - TLS:集中管理证书 ✅

                              Ingress vs API Gateway

                              功能IngressAPI Gateway (Kong/Tyk)
                              基础路由
                              认证鉴权⚠️ 基础✅ 完善
                              限流熔断⚠️ 基础✅ 高级
                              插件生态❌ 有限✅ 丰富
                              学习曲线✅ 简单⚠️ 复杂
                              性能✅ 高⚠️ 中等

                              🎓 学习路径建议

                              1. 入门 (1-2 周)

                                • 理解 Ingress 概念
                                • 部署 Nginx Ingress Controller
                                • 创建基本的 Ingress 规则
                                • 配置 HTTP/HTTPS 访问
                              2. 进阶 (2-4 周)

                                • 掌握各种路由策略
                                • TLS 证书管理(cert-manager)
                                • 金丝雀发布
                                • 性能调优
                              3. 高级 (1-2 月)

                                • 多 Ingress Controller 管理
                                • WAF 和安全加固
                                • 监控和告警
                                • 故障排查
                              4. 专家 (持续学习)

                                • 源码阅读
                                • 自定义插件开发
                                • Gateway API 迁移

                              需要我详细展开某个特定主题,比如 cert-manager 自动化证书管理、Ingress Controller 源码分析,或者 Gateway API 新标准吗? 🚀

                              Mar 7, 2024

                              Nginx 性能优化

                              通用优化、操作系统层、Nginx 配置层、架构层等多个维度,为你详细梳理的方式。


                              一、操作系统与硬件层优化

                              这是优化的基础,为 Nginx 提供一个高性能的运行环境。

                              1. 增加文件描述符限制 Nginx 每个连接(尤其是静态文件)都会消耗一个文件描述符。如果并发高,默认限制很容易成为瓶颈。

                                # 临时生效
                                ulimit -n 65536
                                
                                # 永久生效,修改 /etc/security/limits.conf
                                * soft nofile 65536
                                * hard nofile 65536
                                
                                # 同时,确保 nginx.conf 中使用了足够的 worker_rlimit_nofile
                                worker_rlimit_nofile 65536;
                              2. 优化网络栈

                                • 调整 net.core.somaxconn: 定义等待 Nginx 接受的最大连接队列长度。如果遇到 accept() 队列溢出的错误,需要增加这个值。
                                  sysctl -w net.core.somaxconn=65535
                                  并在 Nginx 的 listen 指令中显式指定 backlog 参数:
                                  listen 80 backlog=65535;
                                • 启用 TCP Fast Open: 减少 TCP 三次握手的延迟。
                                  sysctl -w net.ipv4.tcp_fastopen=3
                                • 增大临时端口范围: 当 Nginx 作为反向代理时,它需要大量本地端口来连接上游服务器。
                                  sysctl -w net.ipv4.ip_local_port_range="1024 65535"
                                • 减少 TCP TIME_WAIT 状态: 对于高并发短连接场景,大量连接处于 TIME_WAIT 状态会耗尽端口资源。
                                  # 启用 TIME_WAIT 复用
                                  sysctl -w net.ipv4.tcp_tw_reuse=1
                                  # 快速回收 TIME_WAIT 连接
                                  sysctl -w net.ipv4.tcp_tw_recycle=0 # 注意:在 NAT 环境下建议为 0,否则可能有问题
                                  # 增大 FIN_WAIT_2 状态的超时时间
                                  sysctl -w net.ipv4.tcp_fin_timeout=30
                              3. 使用高性能磁盘 对于静态资源服务,使用 SSD 硬盘可以极大提升 IO 性能。


                              二、Nginx 配置优化

                              这是优化的核心,直接决定 Nginx 的行为。

                              1. 工作进程与连接数

                                • worker_processes auto;: 设置为 auto,让 Nginx 自动根据 CPU 核心数设置工作进程数,通常等于 CPU 核心数。
                                • worker_connections: 每个工作进程可以处理的最大连接数。它与 worker_rlimit_nofile 共同决定了 Nginx 的总并发能力。
                                  events {
                                      worker_connections 10240; # 例如,设置为 10240
                                      use epoll; # 在 Linux 上使用高性能的 epoll 事件模型
                                  }
                              2. 高效静态资源服务

                                • 启用 sendfile: 绕过用户空间,直接在内核中完成文件数据传输,非常高效。
                                  sendfile on;
                                • 启用 tcp_nopush: 与 sendfile on 一起使用,确保数据包被填满后再发送,提高网络效率。
                                  tcp_nopush on;
                                • 启用 tcp_nodelay: 针对 keepalive 连接,强制立即发送数据,减少延迟。通常与 tcp_nopush 一起使用。
                                  tcp_nodelay on;
                              3. 连接与请求超时 合理的超时设置可以释放闲置资源,避免连接被长期占用。

                                # 客户端连接保持超时时间
                                keepalive_timeout 30s;
                                # 与上游服务器的保持连接超时时间
                                proxy_connect_timeout 5s;
                                proxy_send_timeout 60s;
                                proxy_read_timeout 60s;
                                # 客户端请求头读取超时
                                client_header_timeout 15s;
                                # 客户端请求体读取超时
                                client_body_timeout 15s;
                              4. 缓冲与缓存

                                • 缓冲区优化: 为客户端请求头和请求体设置合适的缓冲区大小,避免 Nginx 写入临时文件,降低 IO。
                                  client_header_buffer_size 1k;
                                  large_client_header_buffers 4 4k;
                                  client_body_buffer_size 128k;
                                • 代理缓冲区: 当 Nginx 作为反向代理时,控制从上游服务器接收数据的缓冲区。
                                  proxy_buffering on;
                                  proxy_buffer_size 4k;
                                  proxy_buffers 8 4k;
                                • 启用缓存
                                  • 静态资源缓存: 使用 expiresadd_header 指令为静态资源设置长时间的浏览器缓存。
                                    location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
                                        expires 1y;
                                        add_header Cache-Control "public, immutable";
                                    }
                                  • 反向代理缓存: 使用 proxy_cache 模块缓存上游服务器的动态内容,极大减轻后端压力。
                                    proxy_cache_path /path/to/cache levels=1:2 keys_zone=my_cache:10m max_size=10g inactive=60m;
                                    location / {
                                        proxy_cache my_cache;
                                        proxy_cache_valid 200 302 10m;
                                        proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
                                    }
                              5. 日志优化

                                • 禁用访问日志: 对于极高并发且不关心访问日志的场景,可以关闭 access_log
                                • 缓冲写入日志: 使用 buffer 参数让 Nginx 先将日志写入内存缓冲区,满后再刷入磁盘。
                                  access_log /var/log/nginx/access.log main buffer=64k flush=1m;
                                • 记录关键信息: 精简日志格式,只记录必要字段。
                              6. Gzip 压缩 对文本类型的响应进行压缩,减少网络传输量。

                                gzip on;
                                gzip_vary on;
                                gzip_min_length 1024; # 小于此值不压缩
                                gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
                              7. 上游连接保持 当代理到后端服务时,使用 keepalive 保持一定数量的空闲连接,避免频繁建立和断开 TCP 连接的开销。

                                upstream backend_servers {
                                    server 10.0.1.100:8080;
                                    keepalive 32; # 保持的空闲连接数
                                }
                                
                                location / {
                                    proxy_pass http://backend_servers;
                                    proxy_http_version 1.1;
                                    proxy_set_header Connection "";
                                }

                              三、架构与部署优化

                              1. 负载均衡 使用 Nginx 的 upstream 模块将流量分发到多个后端服务器,实现水平扩展和高可用。

                                upstream app_cluster {
                                    least_conn; # 使用最少连接算法
                                    server 10.0.1.101:8080;
                                    server 10.0.1.102:8080;
                                    server 10.0.1.103:8080;
                                }
                              2. 动静分离 将静态资源(图片、CSS、JS)的请求与动态请求分开。可以让 Nginx 直接处理静态资源,而动态请求则代理给后端应用服务器(如 Tomcat, Node.js 等)。

                              3. 启用 HTTP/2 HTTP/2 提供了多路复用、头部压缩等特性,能显著提升页面加载速度。

                                listen 443 ssl http2;
                              4. 使用第三方模块 根据需求编译第三方模块,如:

                                • OpenResty: 基于 Nginx 和 LuaJIT,提供了强大的可编程能力。
                                • ngx_brotli: 使用 Brotli 压缩算法,通常比 Gzip 压缩率更高。

                              四、监控与调试

                              优化不是一次性的,需要持续监控。

                              1. 启用状态模块 使用 stub_status_module 来查看 Nginx 的基本状态信息。

                                location /nginx_status {
                                    stub_status;
                                    allow 127.0.0.1; # 只允许本机访问
                                    deny all;
                                }

                                访问后可以看到活跃连接数、请求总数等信息。

                              2. 分析日志 使用工具如 goaccessawstats 分析访问日志,了解流量模式和瓶颈。

                              3. 性能剖析 在极端情况下,可以使用 debug 日志或系统工具(如 perfstrace)进行深度性能剖析。

                              总结与建议

                              1. 循序渐进: 不要一次性修改所有参数。一次只修改一两项,然后进行压测(如使用 wrk, ab, jmeter),观察效果。
                              2. 监控先行: 在优化前、中、后都要有可靠的监控数据作为依据。
                              3. 理解业务: 优化的策略很大程度上取决于业务类型。是高并发连接?是大文件下载?还是大量的短动态请求?
                              4. 内核参数谨慎调整: 生产环境调整内核参数前,务必在测试环境充分验证。

                              通过以上这些方式的组合运用,你可以显著提升 Nginx 的性能和稳定性,使其能够轻松应对百万级别的并发连接。

                              Oct 7, 2024

                              Traefik VS Nginx

                              好的,这是一个非常经典的问题。Traefik 和 Nginx Ingress 都是 Kubernetes 生态中顶级的 Ingress Controller,但它们的设计哲学、使用体验和侧重点有显著不同。

                              简单来说:

                              • Traefik 更像一个为云原生和微服务而生的动态、自动化的 API 网关
                              • Nginx Ingress 更像一个基于久经考验的 Nginx 的、高度可配置的强大、稳定的反向代理/负载均衡器

                              下面我们详细对比一下 Traefik 相对于 Nginx Ingress 的主要优点。

                              Traefik 的核心优点

                              1. 极致的动态配置与自动化

                              这是 Traefik 最核心的卖点。

                              • 工作原理:Traefik 会主动监听 Kubernetes API Server,实时感知 Service、Ingress Route、Secret 等资源的变化。一旦你创建或修改了一个 Ingress 资源,Traefik 几乎在秒级内自动更新其路由配置,无需任何重启或重载操作。
                              • Nginx Ingress 的对比:Nginx Ingress 通常需要一个名为 nginx-ingress-controller 的组件来监控变化,然后生成一个新的 nginx.conf 配置文件,最后通过向 Nginx 进程发送 reload 信号来加载新配置。虽然这个过程也很快,但它本质上是一个 “生成-重载” 模型,在超大流量或配置复杂时,重载可能带来微小的性能抖动或延迟。

                              结论:在追求完全自动化和零重载的云原生环境中,Traefik 的动态模型更具吸引力。

                              2. 简化的配置模型与 “IngressRoute” CRD

                              Traefik 完美支持标准的 Kubernetes Ingress 资源,但它更推荐使用自己定义的 Custom Resource Definition (CRD),叫做 IngressRoute

                              • 为什么更好:标准的 Ingress 资源功能相对有限,很多高级特性(如重试、限流、断路器、请求镜像等)需要通过繁琐的 annotations 来实现,可读性和可维护性差。
                              • Traefik 的 IngressRoute:它提供了一种声明式的、结构化的 YAML/JSON 配置方式。所有配置(包括 TLS、中间件、路由规则)都以清晰的结构定义在同一个 CRD 中,更符合 Kubernetes 的原生哲学,也更容易进行版本控制和代码审查。

                              示例对比: 使用 Nginx Ingress 的注解来实现路径重写:

                              apiVersion: networking.k8s.io/v1
                              kind: Ingress
                              metadata:
                                name: my-ingress
                                annotations:
                                  nginx.ingress.kubernetes.io/rewrite-target: /

                              使用 Traefik 的 IngressRoute 和中间件:

                              apiVersion: traefik.containo.us/v1alpha1
                              kind: IngressRoute
                              metadata:
                                name: my-ingressroute
                              spec:
                                routes:
                                - match: PathPrefix(`/api`)
                                  kind: Rule
                                  services:
                                  - name: my-service
                                    port: 80
                                  middlewares:
                                  - name: strip-prefix # 使用一个独立的、可复用的中间件资源
                              ---
                              apiVersion: traefik.containo.us/v1alpha1
                              kind: Middleware
                              metadata:
                                name: strip-prefix
                              spec:
                                stripPrefix:
                                  prefixes:
                                    - /api

                              可以看到,Traefik 的配置更加模块化和清晰。

                              3. 内置的、功能丰富的 Dashboard

                              Traefik 自带一个非常直观的 Web UI 控制台。只需简单启用,你就可以在浏览器中实时查看所有的路由器(Routers)、服务(Services)和中间件(Middlewares),以及它们的健康状况和配置关系。

                              • 这对于开发和调试来说是巨大的福音。你可以一目了然地看到流量是如何被路由的,而无需去解析复杂的配置文件或命令行输出。
                              • Nginx Ingress 官方不提供图形化 Dashboard。虽然可以通过第三方工具(如 Prometheus + Grafana)来监控,或者使用 kubectl 命令来查询状态,但远不如 Traefik 的原生 Dashboard 直观方便。

                              4. 原生支持多种后端提供者

                              Traefik 的设计是多提供者的。除了 Kubernetes,它还可以同时从 Docker、Consul、Etcd、Rancher 或者一个简单的静态文件中读取配置。 如果你的技术栈是混合的(例如,部分服务在 K8s,部分服务使用 Docker Compose),Traefik 可以作为一个统一的入口点,简化你的架构。

                              Nginx Ingress 虽然也可以通过其他方式扩展,但其核心是为 Kubernetes 设计的。

                              5. 中间件模式的强大与灵活

                              Traefik 的 “中间件” 概念非常强大。它允许你将各种功能(如认证、限流、头信息修改、重定向、断路器等)定义为独立的、可复用的组件。然后,你可以在任何路由规则上通过引用的方式组合使用这些中间件。

                              这种模式极大地增强了配置的复用性和灵活性,是构建复杂流量策略的理想选择。

                              Nginx Ingress 的优势领域(作为平衡参考)

                              为了做出全面选择,了解 Nginx Ingress 的优势也很重要:

                              1. 极致的性能与稳定性:基于世界上最成熟的 Web 服务器 Nginx,在处理超高并发静态内容和长连接方面,经过了几十年的实战考验,性能和稳定性极高。
                              2. 功能极其丰富:Nginx 本身的功能集非常庞大,加上 Nginx Ingress Controller 提供了大量的注解来暴露这些功能,其能力上限在某些方面可能高于 Traefik。
                              3. 庞大的社区与生态:Nginx 的用户基数巨大,你遇到的任何问题几乎都能在网上找到解决方案或经验分享。
                              4. 精细化控制:对于深度 Nginx 专家,可以通过 ConfigMap 注入自定义的 Nginx 配置片段,实现几乎任何你想要的功能,可控性极强。
                              5. Apache 许可:Nginx 是 Apache 2.0 许可证,而 Traefik v2 之后使用的是限制更多的 Source Available 许可证(虽然对大多数用户免费,但会引起一些大公司的合规顾虑)。Nginx Ingress 完全没有这个问题。

                              总结与选型建议

                              特性TraefikNginx Ingress
                              配置模型动态、自动化,无需重载“生成-重载”模型
                              配置语法声明式 CRD,结构清晰主要依赖 Annotations,较繁琐
                              Dashboard内置,功能强大,开箱即用无官方 UI,需第三方集成
                              设计哲学云原生优先,微服务友好功能与性能优先,稳健可靠
                              学习曲线较低,易于上手和运维中等,需要了解 Nginx 概念
                              性能优秀,足以满足绝大多数场景极致,尤其在静态内容和大并发场景
                              可扩展性通过中间件,模块化程度高通过 Lua 脚本或自定义模板,功能上限高
                              许可证Source Available(可能有限制)Apache 2.0(完全开源)

                              如何选择?

                              • 选择 Traefik,如果:

                                • 你追求极致的云原生体验,希望配置简单、自动化。
                                • 你的团队更青睐 Kubernetes 原生的声明式配置方式。
                                • 你非常看重内置的 Dashboard 用于日常运维和调试。
                                • 你的应用架构是动态的,服务频繁发布和变更。
                                • 你的场景不需要压榨到极致的性能,更看重开发效率和运维简便性。
                              • 选择 Nginx Ingress,如果:

                                • 你对性能和稳定性有极致要求(例如,超大规模网关、CDN边缘节点)。
                                • 你需要使用非常复杂或小众的 Nginx 功能,需要精细化的控制。
                                • 你的团队已经对 Nginx 非常熟悉,有深厚的知识积累。
                                • 你对开源许可证有严格要求,必须使用 Apache 2.0 等宽松许可证。
                                • 你的环境相对稳定,不需要频繁更新路由配置。

                              总而言之,Traefik 胜在“体验”和“自动化”,是现代微服务和云原生环境的理想伴侣。而 Nginx Ingress 胜在“性能”和“功能深度”,是一个经过千锤百炼的、可靠的强大引擎。

                              Mar 7, 2024

                              RPC

                                Mar 7, 2025

                                Subsections of Storage

                                User Based Policy

                                User Based Policy

                                you can change <$bucket> to control the permission

                                App:
                                • ${aws:username} is a build-in variable, indicating the logined user name.
                                {
                                    "Version": "2012-10-17",
                                    "Statement": [
                                        {
                                            "Sid": "AllowUserToSeeBucketListInTheConsole",
                                            "Action": [
                                                "s3:ListAllMyBuckets",
                                                "s3:GetBucketLocation"
                                            ],
                                            "Effect": "Allow",
                                            "Resource": [
                                                "arn:aws:s3:::*"
                                            ]
                                        },
                                        {
                                            "Sid": "AllowRootAndHomeListingOfCompanyBucket",
                                            "Action": [
                                                "s3:ListBucket"
                                            ],
                                            "Effect": "Allow",
                                            "Resource": [
                                                "arn:aws:s3:::<$bucket>"
                                            ],
                                            "Condition": {
                                                "StringEquals": {
                                                    "s3:prefix": [
                                                        "",
                                                        "<$path>/",
                                                        "<$path>/${aws:username}"
                                                    ],
                                                    "s3:delimiter": [
                                                        "/"
                                                    ]
                                                }
                                            }
                                        },
                                        {
                                            "Sid": "AllowListingOfUserFolder",
                                            "Action": [
                                                "s3:ListBucket"
                                            ],
                                            "Effect": "Allow",
                                            "Resource": [
                                                "arn:aws:s3:::<$bucket>"
                                            ],
                                            "Condition": {
                                                "StringLike": {
                                                    "s3:prefix": [
                                                        "<$path>/${aws:username}/*"
                                                    ]
                                                }
                                            }
                                        },
                                        {
                                            "Sid": "AllowAllS3ActionsInUserFolder",
                                            "Effect": "Allow",
                                            "Action": [
                                                "s3:*"
                                            ],
                                            "Resource": [
                                                "arn:aws:s3:::<$bucket>/<$path>/${aws:username}/*"
                                            ]
                                        }
                                    ]
                                }
                                • <$uid> is Aliyun UID
                                {
                                    "Version": "1",
                                    "Statement": [{
                                        "Effect": "Allow",
                                        "Action": [
                                            "oss:*"
                                        ],
                                        "Principal": [
                                            "<$uid>"
                                        ],
                                        "Resource": [
                                            "acs:oss:*:<$oss_id>:<$bucket>/<$path>/*"
                                        ]
                                    }, {
                                        "Effect": "Allow",
                                        "Action": [
                                            "oss:ListObjects",
                                            "oss:GetObject"
                                        ],
                                        "Principal": [
                                             "<$uid>"
                                        ],
                                        "Resource": [
                                            "acs:oss:*:<$oss_id>:<$bucket>"
                                        ],
                                        "Condition": {
                                            "StringLike": {
                                            "oss:Prefix": [
                                                    "<$path>/*"
                                                ]
                                            }
                                        }
                                    }]
                                }
                                Example:
                                {
                                	"Version": "1",
                                	"Statement": [{
                                		"Effect": "Allow",
                                		"Action": [
                                			"oss:*"
                                		],
                                		"Principal": [
                                			"203415213249511533"
                                		],
                                		"Resource": [
                                			"acs:oss:*:1007296819402486:conti-csst/test/*"
                                		]
                                	}, {
                                		"Effect": "Allow",
                                		"Action": [
                                			"oss:ListObjects",
                                			"oss:GetObject"
                                		],
                                		"Principal": [
                                			"203415213249511533"
                                		],
                                		"Resource": [
                                			"acs:oss:*:1007296819402486:conti-csst"
                                		],
                                		"Condition": {
                                			"StringLike": {
                                				"oss:Prefix": [
                                					"test/*"
                                				]
                                			}
                                		}
                                	}]
                                }
                                Mar 14, 2024

                                Mirrors

                                Gradle Tencent Mirror

                                https://mirrors.cloud.tencent.com/gradle/gradle-8.0-bin.zip

                                PIP Tuna Mirror -i https://pypi.tuna.tsinghua.edu.cn/simple

                                pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

                                Maven Mirror

                                <mirror>
                                    <id>aliyunmaven</id>
                                    <mirrorOf>*</mirrorOf>
                                    <name>阿里云公共仓库</name>
                                    <url>https://maven.aliyun.com/repository/public</url>
                                </mirror>