服务器和软件版本规划

7台服务器和1个虚拟IP:
172.27.244.151-153(3台master,4C/8G/500GB)
172.27.244.154-156(3台worker,8C/32G/500GB)
172.27.24.150(nfs/harbor,4C/8G/2TB)
172.27.244.160(虚拟IP)
说明:
harbor镜像仓库可以考虑装到nfs服务器上备用,kubeflow平台暂时不需要镜像仓库。

版本:
OS:ubuntu20.04.6
rke:1.4.6  docker:20.10.X(5:20.10.24~3-0~ubuntu-focal)
rancher:2.7.5  k8s:支持1.23-1.26(rancher/hyperkube:v1.26.4-rancher2)
kustomize:5.1.0
kubeflow:1.7.0

安装k8s集群

172.27.244.151-156等6台服务器。

默认6台服务器都执行,如果只在其中的某些机器执行会说明。

配置sudo用户免密

1
echo "$USER   ALL=(ALL:ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/$USER

关闭防火墙

我机器在内网,关闭防火墙省事。放在外网的服务最好开启防火墙,放行端口参考各自业务的官方文档!

1
2
sudo systemctl stop ufw
sudo systemctl disable ufw

安装docker

https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/installation-requirements/install-docker

rancher官方有个docker安装脚本

开始安装!

docker官方文档:https://docs.docker.com/engine/install/ubuntu/

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
VERSION_STRING=5:20.10.24~3-0~ubuntu-focal
sudo apt-get install docker-ce=$VERSION_STRING docker-ce-cli=$VERSION_STRING containerd.io docker-buildx-plugin docker-compose-plugin

配置k8s需求

k8s官方文档:

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/

https://kubernetes.io/docs/setup/production-environment/container-runtimes/#install-and-configure-prerequisites

rke官方文档:

https://rke.docs.rancher.com/os

禁用swap

1
sudo swapoff -a && sudo sed -i '/swap/s/^/#/' /etc/fstab

允许iptables桥接流量

确认信息。

1
2
3
lsmod | grep br_netfilter
lsmod | grep overlay
sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.ipv4.ip_forward
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

sudo sysctl --system

安装RKE

找其中一台机器安装rke工具即可,在172.27.244.151安装rke。

rke官方文档:https://rke.docs.rancher.com/installation

创建rke用户

172.27.244.151-156都创建用户,并添加到docker组。

1
2
3
sudo useradd -m -s /bin/bash rkeuser
sudo usermod -aG docker rkeuser
sudo chpasswd <<< 'rkeuser:123456'

创建密钥并拷贝

在172.27.244.151上执行,本机rkeuser用户也要拷贝密钥。

1
2
3
4
5
6
7
ssh-keygen -qf ~/.ssh/id_rsa -P ''
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]

rke创建k8s集群

在172.27.244.151上执行。

rke下载地址:https://github.com/rancher/rke/releases/tag/v1.4.6

1
2
3
4
cd && wget https://github.com/rancher/rke/releases/download/v1.4.6/rke_linux-amd64

mkdir bin && cp rke_linux-amd64 bin && chmod +x bin/rke_linux-amd64
sudo ln -s $HOME/bin/rke_linux-amd64 /usr/local/bin/rke
1
rke config --name cluster.yml

默认端口8472与深信服超融合有冲突,修改cluster.yml,参考 https://rke.docs.rancher.com/config-options/add-ons/network-plugins#flannel

1
2
3
4
5
network:
  plugin: flannel
  options:
    flannel_backend_type: vxlan
    flannel_backend_port: "8972"
1
rke up

安装kubectl

在172.27.244.151上执行。

安装kubectl并配置命令自动补全。

1
2
3
4
5
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
echo 'source <(kubectl completion bash)' >>~/.bashrc && . ~/.bashrc
mkdir $HOME/.kube
cp kube_config_cluster.yml $HOME/.kube/config

安装rancher

在172.27.244.151上执行。

官方文档:https://ranchermanager.docs.rancher.com/zh/pages-for-subheaders/install-upgrade-on-a-kubernetes-cluster

安装helm。

1
2
3
wget https://get.helm.sh/helm-v3.12.1-linux-amd64.tar.gz
tar zxf helm-v3.12.1-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm

用helm安装rancher。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
helm repo add rancher-stable https://releases.rancher.com/server-charts/stable

kubectl create namespace cattle-system

#确认v1.12.2版本信息,如有需要可替换
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.2/cert-manager.crds.yaml

helm repo add jetstack https://charts.jetstack.io

helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.12.2


helm install rancher rancher-stable/rancher \
  --namespace cattle-system \
  --set hostname=rancher.ai.example.com \
  --set bootstrapPassword=123456 \
  --set global.cattle.psp.enabled=false \
  --version 2.7.5

安装完,检查rancher状态。

1
2
kubectl -n cattle-system rollout status deploy/rancher
kubectl -n cattle-system get deploy rancher

如有需要,查看密码方法。

1
kubectl get secret --namespace cattle-system bootstrap-secret -o go-template='{{ .data.bootstrapPassword|base64decode}}{{ "\n" }}'

安装kubeflow

安装storageClass

前提条件,在172.27.244.150先安装nfs-server ,再安装存储类nfs-cient-provisioner,在172.27.244.151上执行。

nfs-cient-provisioner文档:

https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner

https://artifacthub.io/packages/helm/nfs-subdir-external-provisioner/nfs-subdir-external-provisioner

挂载选项相关文档:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/s1-nfs-client-config-options

https://www.man7.org/linux/man-pages/man5/nfs.5.html

1
2
3
4
5
6
7
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update
helm install nfs-client-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
    --set nfs.server=172.27.244.150 \
    --set nfs.path=/data_nfs \
    --set nfs.mountOptions={"nfsvers=4\,minorversion=0\,rsize=1048576\,wsize=1048576\,hard\,timeo=600\,retrans=2\,noresvport"} \
    --set storageClass.defaultClass=true

默认是拉取v4.0.2,在3台worker服务器执行。

1
2
sudo docker pull strongxyz/nfs-subdir-external-provisioner:v4.0.2
sudo docker tag strongxyz/nfs-subdir-external-provisioner:v4.0.2 registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2

安装kustomize

在172.27.244.151上执行。

1
2
wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize/v5.1.0/kustomize_v5.1.0_linux_amd64.tar.gz
tar zxf kustomize_v5.1.0_linux_amd64.tar.gz && cp kustomize /usr/local/bin/

下载kubeflow和镜像

在172.27.244.151上执行。

下载kubeflow

1
2
wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.7.0.tar.gz
tar zxf manifests-1.7.0 && cd manifests-1.7.0

查看需要哪些镜像,国内环境无法访问gcr.io,从国外服务器中转镜像到hub.docker.com(需要自己注册仓库),然后再拉取镜像!

1
2
kustomize build example > kustomize_build_example.out.txt
awk -F': ' '/image: gcr.io/{print $2}' kustomize_build_example.out.txt | sort -u > pull.image.list.txt

所需镜像列表:

gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0

下载相关镜像

下载镜像并推送到hub.docker.com个人仓库(脚本中未放docker login,登录后再执行即可,也可以把登录放入脚本中)。

hub.docker.com仓库需要自行注册,比如我的个人仓库https://hub.docker.com/u/strongxyz

在能访问gcr.io的服务器上执行。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
a=(
gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
)
b=(${a[*]//gcr.io*\//strongxyz/})
for ((i=0; i<${#a[*]}; i++)); do
    sudo docker pull "${a[i]}"
    sudo docker tag "${a[i]}" "${b[i]}"
    sudo docker push "${b[i]}"
    sudo docker rmi "${b[i]}"
    sudo docker rmi "${a[i]}"
done
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
a=(
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
)
b=(
strongxyz/eventing_cmd_controller:sha256
strongxyz/eventing_cmd_mtping:sha256
strongxyz/eventing_cmd_webhook:sha256
strongxyz/net-istio_cmd_controller:sha256
strongxyz/net-istio_cmd_webhook:sha256
strongxyz/serving_cmd_activator:sha256
strongxyz/serving_cmd_autoscaler:sha256
strongxyz/serving_cmd_controller:sha256
strongxyz/serving_cmd_domain-mapping:sha256
strongxyz/serving_cmd_domain-mapping-webhook:sha256
strongxyz/serving_cmd_queue:sha256
strongxyz/serving_cmd_webhook:sha256
)
for ((i=0; i<${#a[*]}; i++)); do
    sudo docker pull "${a[i]}"
    sudo docker tag "${a[i]}" "${b[i]}"
    sudo docker push "${b[i]}"
    sudo docker rmi "${b[i]}"
    sudo docker rmi "${a[i]}"
done

在172.27.244.154-156等3台worker机器执行,拉取镜像,重新打tag。如果master也是worker节点,master也需要拉取镜像。

拉取镜像不需要登录。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
a=(
gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
)
b=(${a[*]//gcr.io*\//strongxyz/})
for ((i=0; i<${#a[*]}; i++)); do
    sudo docker pull "${b[i]}"
    sudo docker tag "${b[i]}" "${a[i]}"
    sudo docker rmi "${b[i]}"
done
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
a=(
gcr.io/knative-releases/knative.dev/eventing/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping:sha256
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook:sha256
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/activator:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/queue:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/webhook:sha256
)
b=(
strongxyz/eventing_cmd_controller:sha256
strongxyz/eventing_cmd_mtping:sha256
strongxyz/eventing_cmd_webhook:sha256
strongxyz/net-istio_cmd_controller:sha256
strongxyz/net-istio_cmd_webhook:sha256
strongxyz/serving_cmd_activator:sha256
strongxyz/serving_cmd_autoscaler:sha256
strongxyz/serving_cmd_controller:sha256
strongxyz/serving_cmd_domain-mapping:sha256
strongxyz/serving_cmd_domain-mapping-webhook:sha256
strongxyz/serving_cmd_queue:sha256
strongxyz/serving_cmd_webhook:sha256
)
for ((i=0; i<${#a[*]}; i++)); do
    sudo docker pull "${b[i]}"
    sudo docker tag "${b[i]}" "${a[i]}"
    sudo docker rmi "${b[i]}"
done

vi example/kustomization.yaml,最后添加以下内容。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
images:
  - name: gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
    newName: gcr.io/knative-releases/knative.dev/eventing/cmd/controller
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
    newName: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
    newName: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
    newName: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
    newName: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/activator
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/controller
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/queue
    newTag: "sha256"
  - name: gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
    newName: gcr.io/knative-releases/knative.dev/serving/cmd/webhook
    newTag: "sha256"

修改kubeflow默认密码

随便找台有python3的机器执行都可以,这里在172.27.244.151上执行。

1
2
3
sudo apt install python3-pip
sudo pip3 install passlib
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

以上为交互式命令,输入密码,返回hash值。

123456

$2y$12$cDyyBfNqBDpQ9kkRoJYSI.xWggu2r9iHj1234GuTFddJkaWZu3a33

vi common/dex/base/config-map.yaml,修改对应内容。

一键安装kubeflow

1
while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

安装完成后

问题1

pod报错:

error: resource mapping not found for name: “webhook” namespace: “knative-serving” from “STDIN”: no matches for kind “HorizontalPodAutoscaler” in version “autoscaling/v2beta2”

原因:

https://kubernetes.io/docs/reference/using-api/deprecation-guide/#horizontalpodautoscaler-v125

https://www.kubeflow.org/docs/releases/kubeflow-1.7/#dependency-versions-manifests

解决办法:

修改common/knative/knative-serving/base/upstream/serving-core.yaml对应的镜像autoscaling/v2

问题2

有些镜像拉取策略需要修改成IfNotPresent,如下:

1
2
3
common/oidc-authservice/base/statefulset.yaml
apps/pipeline/upstream/base/pipeline/ml-pipeline-viewer-crd-deployment.yaml
apps/pipeline/upstream/base/cache/cache-deployment.yaml

问题3

有些只有一个副本,避免下次运行在另一台worker机器又去拉取镜像,3台worker都拉取以下镜像。

1
2
3
sudo docker pull kubeflowkatib/katib-db-manager:v0.15.0
sudo docker pull mysql:8.0.29
sudo docker pull python:3.7

问题4

mysql的pod报错 chown: changing ownership of ‘/var/lib/mysql/': Operation not permitted

解决:

修改nfs-server,把all_squash改成no_root_squash

sudo vi /etc/exports

sudo exportfs -a

安装nfs-server时就用no_root_squash,避免此问题产生。

安装keepalived

找两台机器安装keepalived,这里选在172.27.244.152-153上安装。

1
sudo apt -y install nginx keepalived

nginx配置:

1
2
3
sudo rm /etc/nginx/sites-enabled/default
sudo cp /etc/nginx/nginx.conf{,.bak}
sudo vi /etc/nginx/nginx.conf

配置参考:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
stream {
        upstream k8s-80 {
            least_conn;
            server 172.27.244.154:80 max_fails=3 fail_timeout=5s;
            server 172.27.244.155:80 max_fails=3 fail_timeout=5s;
            server 172.27.244.156:80 max_fails=3 fail_timeout=5s;
        }
        upstream k8s-443 {
            least_conn;
            server 172.27.244.154:443 max_fails=3 fail_timeout=5s;
            server 172.27.244.155:443 max_fails=3 fail_timeout=5s;
            server 172.27.244.156:443 max_fails=3 fail_timeout=5s;
        }
        server {
            listen 80;
            proxy_pass k8s-80;
        }
        server {
            listen 443;
            proxy_pass k8s-443;
        }
}
1
2
sudo nginx -t
sudo nginx -s reload

keepalived配置:

https://manpages.debian.org/unstable/keepalived/keepalived.conf.5.en.html

https://manpages.ubuntu.com/manpages/focal/man5/keepalived.conf.5.html

https://keepalived.readthedocs.io/en/latest/configuration_synopsis.html#global-definitions-synopsis

https://github.com/acassen/keepalived/tree/master/doc/samples

/usr/share/doc/keepalived/samples/keepalived.conf.vrrp.localcheck

/usr/share/doc/keepalived/samples/keepalived.conf.vrrp

1
sudo vi /etc/keepalived/keepalived.conf

配置参考,两台机器IP对换一下即可。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
global_defs {
   script_user root
   enable_script_security
}

vrrp_script chk_nginx {
    script "killall -0 nginx"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    state MASTER
    interface ens18
    garp_master_delay 10  #default
    #smtp_alert  #是否发送邮件
    virtual_router_id 55
    priority 100  #实例优先级
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 00002023
    }
    unicast_src_ip 172.27.244.152
    unicast_peer {  #单播
        172.27.244.153
    }
    virtual_ipaddress {
        172.27.244.160
    }
    track_script {
       chk_nginx
    }
}
1
sudo systemctl restart keepalived

域名解析:

172.27.244.160 rancher.ai.example.com

172.27.244.160 kubeflow.ai.example.com

安装nfs server

在172.27.244.150执行。

安装。

1
2
3
4
5
sudo apt -y install nfs-kernel-server
sudo mkdir /data_nfs
sudo chown 1000.1000 /data_nfs
echo '/data_nfs 172.27.244.0/24(rw,sync,insecure,no_subtree_check,no_root_squash,no_all_squash,anonuid=1000,anongid=1000)' | sudo tee -a /etc/exports
sudo exportfs -a

检查nfs状态。

1
2
3
4
systemctl status nfs-server
sudo rpcinfo -p | grep nfs
cat /var/lib/nfs/etab
sudo exportfs

安装harbor

在172.27.244.150执行。

安装docker

版本换成5:23.0.6-1~ubuntu.20.04~focal,也可以不带版本号安装最新版本,其它略。

1
2
3
4
5
#查看可安装版本
apt-cache madison docker-ce

VERSION_STRING=5:23.0.6-1~ubuntu.20.04~focal
sudo apt-get install docker-ce=$VERSION_STRING docker-ce-cli=$VERSION_STRING containerd.io docker-buildx-plugin docker-compose-plugin

安装docker-compose

版本号v2.20.1需要自行确认,可以安装当前最新版本。

1
2
sudo curl -L https://github.com/docker/compose/releases/download/v2.20.1/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

自签证书

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
openssl genrsa -out ca.key 4096
openssl req -x509 -new -nodes -sha512 -days 3650 -subj "/O=Sunwoda-evb/CN=Sunwoda-evb Certs H1" -key ca.key -out ca.crt
openssl genrsa -out ai.example.com.key 4096
openssl req -sha512 -new -subj "/C=CN/ST=Guangdong/L=Shenzhen/O=sunwoda-evb/OU=IT/CN=ai.example.com" -key ai.example.com.key -out ai.example.com.csr

tee v3.ext <<- EOF
authorityKeyIdentifier=keyid,issuer
basicConstraints=CA:FALSE
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names

[alt_names]
DNS.1=*.ai.example.com
EOF

openssl x509 -req -sha512 -days 3650 -extfile v3.ext -CA ca.crt -CAkey ca.key -CAcreateserial -in ai.example.com.csr -out ai.example.com.crt

开始安装harbor

1
wget https://github.com/goharbor/harbor/releases/download/v2.8.2/harbor-offline-installer-v2.8.2.tgz
1
2
3
4
5
6
sudo mkdir -p /data/harbor/cert
sudo cp ai.example.com.key ai.example.com.crt /data/harbor/cert
openssl x509 -inform PEM -in ai.example.com.crt -out ai.example.com.cert
sudo mkdir -p /etc/docker/certs.d/harbor.ai.example.com
sudo cp {ca.crt,ai.example.com.key,ai.example.com.cert} /etc/docker/certs.d/harbor.ai.example.com
sudo systemctl restart docker

修改harbor.yml,配置示例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
hostname: harbor.ai.example.com
http:
  port: 8083
https:
  port: 8443
  certificate: /data/harbor/cert/ai.example.com.crt
  private_key: /data/harbor/cert/ai.example.com.key
external_url: https://harbor.ai.example.com
harbor_admin_password: 123456
database:
  password: root123
  max_idle_conns: 100
  max_open_conns: 900
  conn_max_lifetime: 5m
  conn_max_idle_time: 0
data_volume: /data/harbor

开始安装。

1
2
sudo ./install.sh --with-trivy
sudo docker-compose up -d

变更harbor.yml后重新配置。

1
2
3
sudo ./prepare --with-trivy
sudo docker-compose down -v
sudo docker-compose up -d

安装nginx

配置示例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
server {
    listen                  443 ssl http2;
    server_name             harbor.ai.example.com;

    # SSL
    ssl_certificate         /etc/nginx/ssl/ai.example.com.crt;
    ssl_certificate_key     /etc/nginx/ssl/ai.example.com.key;

    location / {
        proxy_pass                         https://127.0.0.1:8443;
        proxy_http_version                 1.1;

        # Proxy headers
        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-Host  $host;
        proxy_set_header X-Forwarded-Port  $server_port;

        # Proxy timeouts
        proxy_connect_timeout              3600s;
        proxy_send_timeout                 3600s;
        proxy_read_timeout                 3600s;
    }
}

server {
    listen      80;
    listen      [::]:80;
    server_name harbor.ai.example.com;

    location / {
        return 301 https://harbor.ai.example.com$request_uri;
    }
}

信任自签CA证书

在worker节点执行,从harbor.ai.example.com拉取镜像的节点都需要信任自签证书。

自签的CA证书可以用nginx公布出来,方便下载使用,终端传文件挺麻烦的。

1
2
3
4
sudo mkdir -p /etc/docker/certs.d/harbor.ai.example.com
sudo wget http://downloads.ai.example.com/download/ssl/ca.H1.crt -O /etc/docker/certs.d/harbor.ai.example.com/ca.H1.crt
#或者没解析域名的话
sudo wget http://172.27.244.150/download/ssl/ca.H1.crt -O /etc/docker/certs.d/harbor.ai.example.com/ca.H1.crt

nginx配置示例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
server {
    listen      80;
    listen                  443 ssl http2;
    server_name             downloads.ai.example.com;
    root                    /srv/www/downloads.ai.example.com;

    # SSL
    ssl_certificate         /etc/nginx/ssl/ai.example.com.crt;
    ssl_certificate_key     /etc/nginx/ssl/ai.example.com.key;

    location /download/ {
        autoindex on;
        autoindex_exact_size off;
        autoindex_localtime on;

        location ~* .(yml|yaml|conf|cnf)$ {
            add_header Content-Type 'text/plain; charset=utf-8';
        }
    }

    location / {
        location ~* {
            add_header Content-Type 'text/plain; charset=utf-8';
        }
    }
}

域名解析:

172.27.244.150 harbor.ai.example.com

172.27.244.150 downloads.ai.example.com