服务器和软件版本规划

7台服务器和1个虚拟IP:
172.27.244.151-153(3台master,4C/8G/500GB)
172.27.244.154-156(3台worker,16C/32G/500GB)
172.27.24.150(nfs/harbor,4C/8G/2TB)
172.27.244.160(虚拟IP)

版本:
OS:ubuntu20.04.6
rke:1.4.6  docker:20.10.X(5:20.10.24~3-0~ubuntu-focal)
rancher:2.7.5  k8s:支持1.23-1.26(rancher/hyperkube:v1.26.4-rancher2)
kustomize:5.1.0
kubeflow:1.7.0

安装k8s集群

172.27.244.151-156等6台服务器。

默认6台服务器都执行,如果只在其中的某些机器执行会说明。

配置sudo用户免密

1
echo "$USER   ALL=(ALL:ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/$USER

关闭防火墙

我机器在内网,关闭防火墙省事。放在外网的服务最好开启防火墙,放行端口参考各自业务的官方文档!

1
2
sudo systemctl stop ufw
sudo systemctl disable ufw

安装docker

https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/installation-requirements/install-docker

rancher官方有个docker安装脚本

开始安装!

docker官方文档:https://docs.docker.com/engine/install/ubuntu/

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
VERSION_STRING=5:20.10.24~3-0~ubuntu-focal
sudo apt-get install docker-ce=$VERSION_STRING docker-ce-cli=$VERSION_STRING containerd.io docker-buildx-plugin docker-compose-plugin

配置k8s需求

k8s官方文档:

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/

https://kubernetes.io/docs/setup/production-environment/container-runtimes/#install-and-configure-prerequisites

rke官方文档:

https://rke.docs.rancher.com/os

禁用swap

1
sudo swapoff -a && sudo sed -i '/swap/s/^/#/' /etc/fstab

允许iptables桥接流量

确认信息。

1
2
3
lsmod | grep br_netfilter
lsmod | grep overlay
sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.ipv4.ip_forward
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

sudo sysctl --system

安装RKE

找其中一台机器安装rke工具即可,在172.27.244.151安装rke。

rke官方文档:https://rke.docs.rancher.com/installation

创建rke用户

172.27.244.151-156都创建用户,并添加到docker组。

1
2
3
sudo useradd -m -s /bin/bash rkeuser
sudo usermod -aG docker rkeuser
sudo chpasswd <<< 'rkeuser:123456'

创建密钥并拷贝

在172.27.244.151上执行,本机rkeuser用户也要拷贝密钥。

1
2
3
4
5
6
7
ssh-keygen -qf ~/.ssh/id_rsa -P ''
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]

rke创建k8s集群

在172.27.244.151上执行。

rke下载地址:https://github.com/rancher/rke/releases/tag/v1.4.6

1
2
3
4
cd && wget https://github.com/rancher/rke/releases/download/v1.4.6/rke_linux-amd64

mkdir bin && cp rke_linux-amd64 bin && chmod +x bin/rke_linux-amd64
sudo ln -s $HOME/bin/rke_linux-amd64 /usr/local/bin/rke

配置rke

1
rke config --name cluster.yml

默认端口8472与深信服超融合有冲突,修改cluster.yml,参考 https://rke.docs.rancher.com/config-options/add-ons/network-plugins#flannel

1
2
3
4
5
network:
  plugin: flannel
  options:
    flannel_backend_type: vxlan
    flannel_backend_port: "8972"

一些常用配置参考:

1
2
3
4
5
6
7
8
9
network:
  plugin: canal
  options:
    canal_flannel_backend_type: vxlan
    canal_flannel_backend_port: "8972"

kubelet:
    extra_args:
      max-pods: 250

如果使用canal的话还需要修改canal-config

1
kubectl edit configmap canal-config -n kube-system

找到如下位置,添加端口号8972

1
2
3
4
5
6
7
8
9
data:
  net-conf.json: |
    {
      "Network": "10.42.0.0/16",
      "Backend": {
        "Type": "vxlan",
        "Port": 8972
      }
    }

运行rke

1
rke up

常用命令,仅更新,忽略docker版本等

1
rke up --update-only --ignore-docker-version

安装kubectl

在172.27.244.151上执行。

安装kubectl并配置命令自动补全。

1
2
3
4
5
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
echo 'source <(kubectl completion bash)' >>~/.bashrc && . ~/.bashrc
mkdir $HOME/.kube
cp kube_config_cluster.yml $HOME/.kube/config

安装rancher

在172.27.244.151上执行。

官方文档:https://ranchermanager.docs.rancher.com/zh/pages-for-subheaders/install-upgrade-on-a-kubernetes-cluster

安装helm。

1
2
3
wget https://get.helm.sh/helm-v3.12.1-linux-amd64.tar.gz
tar zxf helm-v3.12.1-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm

用helm安装rancher。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
helm repo add rancher-stable https://releases.rancher.com/server-charts/stable

kubectl create namespace cattle-system

#确认v1.12.2版本信息,如有需要可替换
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.2/cert-manager.crds.yaml

helm repo add jetstack https://charts.jetstack.io

helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.12.2


helm install rancher rancher-stable/rancher \
  --namespace cattle-system \
  --set hostname=rancher-ai.example.com \
  --set bootstrapPassword=123456 \
  --set global.cattle.psp.enabled=false \
  --version 2.7.5

安装完,检查rancher状态。

1
2
kubectl -n cattle-system rollout status deploy/rancher
kubectl -n cattle-system get deploy rancher

如有需要,查看密码方法。

1
kubectl get secret --namespace cattle-system bootstrap-secret -o go-template='{{ .data.bootstrapPassword|base64decode}}{{ "\n" }}'

安装kubeflow

安装storageClass

前提条件,在172.27.244.150先安装nfs-server ,再安装存储类nfs-cient-provisioner,在172.27.244.151上执行。

nfs-cient-provisioner文档:

https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner

https://artifacthub.io/packages/helm/nfs-subdir-external-provisioner/nfs-subdir-external-provisioner

挂载选项相关文档:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/s1-nfs-client-config-options

https://www.man7.org/linux/man-pages/man5/nfs.5.html

1
2
3
4
5
6
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update
helm install nfs-client-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
    --set nfs.server=172.27.244.150 \
    --set nfs.path=/data_nfs \
    --set storageClass.defaultClass=true

常用命令参考

1
2
3
4
5
6
7
helm install nfs-client-provisioner-yw ./nfs-subdir-external-provisioner-4.0.12.tgz \
    --set nfs.server=172.27.244.150 \
    --set nfs.path=/data/nfs/nfs-client-provisioner \
    --set image.repository=harbor-ai.sunwoda-evb.com/kf1.9/registry.k8s.io/sig-storage/nfs-subdir-external-provisioner \
    --set image.tag=v4.0.2 \
    --set storageClass.name=nfs-client \
    --namespace nfs-client-system

默认是拉取v4.0.2,在3台worker服务器执行。

1
2
sudo docker pull strongxyz/nfs-subdir-external-provisioner:v4.0.2
sudo docker tag strongxyz/nfs-subdir-external-provisioner:v4.0.2 registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2

安装kustomize

在172.27.244.151上执行。

1
2
wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize/v5.1.0/kustomize_v5.1.0_linux_amd64.tar.gz
tar zxf kustomize_v5.1.0_linux_amd64.tar.gz && cp kustomize /usr/local/bin/

下载kubeflow和镜像

在172.27.244.151上执行。

下载kubeflow

1
2
wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.7.0.tar.gz
tar zxf manifests-1.7.0 && cd manifests-1.7.0

查看需要哪些镜像,国内环境无法访问gcr.io,从国外服务器中转镜像到hub.docker.com(需要自己注册仓库),然后再拉取镜像!

1
2
kustomize build example > kustomize_build_example.out.txt
awk -F': ' '/image: gcr.io/{print $2}' kustomize_build_example.out.txt | sort -u > pull.image.list.txt

所需镜像列表:

gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0

下载相关镜像

下载镜像并推送到hub.docker.com个人仓库(脚本中未放docker login,登录后再执行即可,也可以把登录放入脚本中)。

hub.docker.com仓库需要自行注册

在能访问gcr.io的服务器上执行。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
a=(
gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
)
b=(${a[*]//gcr.io*\//strongxyz/})
for ((i=0; i<${#a[*]}; i++)); do
    sudo docker pull "${a[i]}"
    sudo docker tag "${a[i]}" "${b[i]}"
    sudo docker push "${b[i]}"
    sudo docker rmi "${b[i]}"
    sudo docker rmi "${a[i]}"
done
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
a=(
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
)
b=(
strongxyz/eventing_cmd_controller:sha256
strongxyz/eventing_cmd_mtping:sha256
strongxyz/eventing_cmd_webhook:sha256
strongxyz/net-istio_cmd_controller:sha256
strongxyz/net-istio_cmd_webhook:sha256
strongxyz/serving_cmd_activator:sha256
strongxyz/serving_cmd_autoscaler:sha256
strongxyz/serving_cmd_controller:sha256
strongxyz/serving_cmd_domain-mapping:sha256
strongxyz/serving_cmd_domain-mapping-webhook:sha256
strongxyz/serving_cmd_queue:sha256
strongxyz/serving_cmd_webhook:sha256
)
for ((i=0; i<${#a[*]}; i++)); do
    sudo docker pull "${a[i]}"
    sudo docker tag "${a[i]}" "${b[i]}"
    sudo docker push "${b[i]}"
    sudo docker rmi "${b[i]}"
    sudo docker rmi "${a[i]}"
done

在172.27.244.154-156等3台worker机器执行,拉取镜像,重新打tag。如果master也是worker节点,master也需要拉取镜像。

拉取镜像不需要登录。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
a=(
gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
)
b=(${a[*]//gcr.io*\//strongxyz/})
for ((i=0; i<${#a[*]}; i++)); do
    sudo docker pull "${b[i]}"
    sudo docker tag "${b[i]}" "${a[i]}"
    sudo docker rmi "${b[i]}"
done
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
a=(
gcr.io/knative-releases/knative.dev/eventing/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping:sha256
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook:sha256
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/activator:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/queue:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/webhook:sha256
)
b=(
strongxyz/eventing_cmd_controller:sha256
strongxyz/eventing_cmd_mtping:sha256
strongxyz/eventing_cmd_webhook:sha256
strongxyz/net-istio_cmd_controller:sha256
strongxyz/net-istio_cmd_webhook:sha256
strongxyz/serving_cmd_activator:sha256
strongxyz/serving_cmd_autoscaler:sha256
strongxyz/serving_cmd_controller:sha256
strongxyz/serving_cmd_domain-mapping:sha256
strongxyz/serving_cmd_domain-mapping-webhook:sha256
strongxyz/serving_cmd_queue:sha256
strongxyz/serving_cmd_webhook:sha256
)
for ((i=0; i<${#a[*]}; i++)); do
    sudo docker pull "${b[i]}"
    sudo docker tag "${b[i]}" "${a[i]}"
    sudo docker rmi "${b[i]}"
done

vi example/kustomization.yaml,最后添加以下内容。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
images:
  - name: gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
    newName: gcr.io/knative-releases/knative.dev/eventing/cmd/controller
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
    newName: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
    newName: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
    newName: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
    newName: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/activator
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/controller
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/queue
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/webhook
    newTag: "sha256"

修改kubeflow默认密码

随便找台有python3的机器执行都可以,这里在172.27.244.151上执行。

1
2
3
sudo apt install python3-pip
sudo pip3 install passlib
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

以上为交互式命令,输入密码,返回hash值。

123456

$2y$12$cDyyBfNqBDpQ9kkRoJYSI.xWggu2r9iHj1234GuTFddJkaWZu3a33

vi common/dex/base/config-map.yaml,修改对应内容。

一键安装kubeflow

不同版本,此安装脚本示例会有些不同,具体请参考 https://github.com/kubeflow/manifests?tab=readme-ov-file#install-with-a-single-command

1
while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

安装完成后

问题1

pod报错:

error: resource mapping not found for name: “webhook” namespace: “knative-serving” from “STDIN”: no matches for kind “HorizontalPodAutoscaler” in version “autoscaling/v2beta2”

原因:

https://kubernetes.io/docs/reference/using-api/deprecation-guide/#horizontalpodautoscaler-v125

https://www.kubeflow.org/docs/releases/kubeflow-1.7/#dependency-versions-manifests

解决办法:

修改common/knative/knative-serving/base/upstream/serving-core.yaml对应的镜像autoscaling/v2

问题2

有些镜像拉取策略需要修改成IfNotPresent,如下:

1
2
3
common/oidc-authservice/base/statefulset.yaml
apps/pipeline/upstream/base/pipeline/ml-pipeline-viewer-crd-deployment.yaml
apps/pipeline/upstream/base/cache/cache-deployment.yaml

问题3

有些只有一个副本,避免下次运行在另一台worker机器又去拉取镜像,3台worker都拉取以下镜像。

1
2
3
sudo docker pull kubeflowkatib/katib-db-manager:v0.15.0
sudo docker pull mysql:8.0.29
sudo docker pull python:3.7

问题4

mysql的pod报错 chown: changing ownership of ‘/var/lib/mysql/': Operation not permitted

解决:

修改nfs-server,把all_squash改成no_root_squash

sudo vi /etc/exports

sudo exportfs -a

安装nfs-server时就用no_root_squash,避免此问题产生。

安装keepalived

找两台机器安装keepalived,这里选在172.27.244.152-153上安装。

1
sudo apt -y install nginx keepalived

nginx配置:

1
2
3
sudo rm /etc/nginx/sites-enabled/default
sudo cp /etc/nginx/nginx.conf{,.bak}
sudo vi /etc/nginx/nginx.conf

配置参考:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
stream {
        upstream k8s-80 {
            least_conn;
            server 172.27.244.154:80 max_fails=3 fail_timeout=5s;
            server 172.27.244.155:80 max_fails=3 fail_timeout=5s;
            server 172.27.244.156:80 max_fails=3 fail_timeout=5s;
        }
        upstream k8s-443 {
            least_conn;
            server 172.27.244.154:443 max_fails=3 fail_timeout=5s;
            server 172.27.244.155:443 max_fails=3 fail_timeout=5s;
            server 172.27.244.156:443 max_fails=3 fail_timeout=5s;
        }
        server {
            listen 80;
            proxy_pass k8s-80;
        }
        server {
            listen 443;
            proxy_pass k8s-443;
        }
}
1
2
sudo nginx -t
sudo nginx -s reload

keepalived配置:

https://manpages.debian.org/unstable/keepalived/keepalived.conf.5.en.html

https://manpages.ubuntu.com/manpages/focal/man5/keepalived.conf.5.html

https://keepalived.readthedocs.io/en/latest/configuration_synopsis.html#global-definitions-synopsis

https://github.com/acassen/keepalived/tree/master/doc/samples

/usr/share/doc/keepalived/samples/keepalived.conf.vrrp.localcheck

/usr/share/doc/keepalived/samples/keepalived.conf.vrrp

1
sudo vi /etc/keepalived/keepalived.conf

配置参考,两台机器IP对换一下即可,另一台机器配置state BACKUP

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
global_defs {
   script_user root
   enable_script_security
}

vrrp_script chk_nginx {
    script "killall -0 nginx"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    state MASTER
    interface ens18
    garp_master_delay 10  #default
    #smtp_alert  #是否发送邮件
    virtual_router_id 55
    priority 100  #实例优先级
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 00002023
    }
    unicast_src_ip 172.27.244.152
    unicast_peer {  #单播
        172.27.244.153
    }
    virtual_ipaddress {
        172.27.244.160/24 dev ens18
    }
    track_script {
       chk_nginx
    }
}
1
sudo systemctl restart keepalived

域名解析:

172.27.244.160 rancher-ai.example.com

172.27.244.160 kubeflow-ai.example.com

安装nfs server

在172.27.244.150执行。

安装。

1
2
3
4
5
sudo apt -y install nfs-kernel-server
sudo mkdir /data_nfs
sudo chown 1000.1000 /data_nfs
echo '/data_nfs 172.27.244.0/24(rw,sync,insecure,no_subtree_check,no_root_squash,no_all_squash,anonuid=1000,anongid=1000)' | sudo tee -a /etc/exports
sudo exportfs -a

检查nfs状态。

1
2
3
4
systemctl status nfs-server
sudo rpcinfo -p | grep nfs
cat /var/lib/nfs/etab
sudo exportfs

安装harbor

在172.27.244.150执行。

安装docker

版本换成5:23.0.6-1~ubuntu.20.04~focal,也可以不带版本号安装最新版本,其它略。

1
2
3
4
5
#查看可安装版本
apt-cache madison docker-ce

VERSION_STRING=5:23.0.6-1~ubuntu.20.04~focal
sudo apt-get install docker-ce=$VERSION_STRING docker-ce-cli=$VERSION_STRING containerd.io docker-buildx-plugin docker-compose-plugin

安装docker-compose

版本号v2.20.1需要自行确认,可以安装当前最新版本。

1
2
sudo curl -L https://github.com/docker/compose/releases/download/v2.20.1/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

开始安装harbor

1
wget https://github.com/goharbor/harbor/releases/download/v2.8.2/harbor-offline-installer-v2.8.2.tgz

略…