服务器和软件版本规划
7台服务器和1个虚拟IP:
172.27.244.151-153(3台master,4C/8G/500GB)
172.27.244.154-156(3台worker,8C/32G/500GB)
172.27.24.150(nfs/harbor,4C/8G/2TB)
172.27.244.160(虚拟IP)
说明:
harbor镜像仓库可以考虑装到nfs服务器上备用,kubeflow平台暂时不需要镜像仓库。
版本:
OS:ubuntu20.04.6
rke:1.4.6 docker:20.10.X(5:20.10.24~3-0~ubuntu-focal)
rancher:2.7.5 k8s:支持1.23-1.26(rancher/hyperkube:v1.26.4-rancher2)
kustomize:5.1.0
kubeflow:1.7.0
安装k8s集群
172.27.244.151-156等6台服务器。
默认6台服务器都执行,如果只在其中的某些机器执行会说明。
配置sudo用户免密
1
| echo "$USER ALL=(ALL:ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/$USER
|
关闭防火墙
我机器在内网,关闭防火墙省事。放在外网的服务最好开启防火墙,放行端口参考各自业务的官方文档!
1
2
| sudo systemctl stop ufw
sudo systemctl disable ufw
|
安装docker
https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/installation-requirements/install-docker
rancher官方有个docker安装脚本
开始安装!
docker官方文档:https://docs.docker.com/engine/install/ubuntu/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
VERSION_STRING=5:20.10.24~3-0~ubuntu-focal
sudo apt-get install docker-ce=$VERSION_STRING docker-ce-cli=$VERSION_STRING containerd.io docker-buildx-plugin docker-compose-plugin
|
配置k8s需求
k8s官方文档:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/
https://kubernetes.io/docs/setup/production-environment/container-runtimes/#install-and-configure-prerequisites
rke官方文档:
https://rke.docs.rancher.com/os
禁用swap
1
| sudo swapoff -a && sudo sed -i '/swap/s/^/#/' /etc/fstab
|
允许iptables桥接流量
确认信息。
1
2
3
| lsmod | grep br_netfilter
lsmod | grep overlay
sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.ipv4.ip_forward
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --system
|
安装RKE
找其中一台机器安装rke工具即可,在172.27.244.151安装rke。
rke官方文档:https://rke.docs.rancher.com/installation
创建rke用户
172.27.244.151-156都创建用户,并添加到docker组。
1
2
3
| sudo useradd -m -s /bin/bash rkeuser
sudo usermod -aG docker rkeuser
sudo chpasswd <<< 'rkeuser:123456'
|
创建密钥并拷贝
在172.27.244.151上执行,本机rkeuser用户也要拷贝密钥。
1
2
3
4
5
6
7
| ssh-keygen -qf ~/.ssh/id_rsa -P ''
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
ssh-copy-id -f -p 5008 -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no [email protected]
|
rke创建k8s集群
在172.27.244.151上执行。
rke下载地址:https://github.com/rancher/rke/releases/tag/v1.4.6
1
2
3
4
| cd && wget https://github.com/rancher/rke/releases/download/v1.4.6/rke_linux-amd64
mkdir bin && cp rke_linux-amd64 bin && chmod +x bin/rke_linux-amd64
sudo ln -s $HOME/bin/rke_linux-amd64 /usr/local/bin/rke
|
1
| rke config --name cluster.yml
|
默认端口8472与深信服超融合有冲突,修改cluster.yml,参考 https://rke.docs.rancher.com/config-options/add-ons/network-plugins#flannel
1
2
3
4
5
| network:
plugin: flannel
options:
flannel_backend_type: vxlan
flannel_backend_port: "8972"
|
安装kubectl
在172.27.244.151上执行。
安装kubectl并配置命令自动补全。
1
2
3
4
5
| curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
echo 'source <(kubectl completion bash)' >>~/.bashrc && . ~/.bashrc
mkdir $HOME/.kube
cp kube_config_cluster.yml $HOME/.kube/config
|
安装rancher
在172.27.244.151上执行。
官方文档:https://ranchermanager.docs.rancher.com/zh/pages-for-subheaders/install-upgrade-on-a-kubernetes-cluster
安装helm。
1
2
3
| wget https://get.helm.sh/helm-v3.12.1-linux-amd64.tar.gz
tar zxf helm-v3.12.1-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
|
用helm安装rancher。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| helm repo add rancher-stable https://releases.rancher.com/server-charts/stable
kubectl create namespace cattle-system
#确认v1.12.2版本信息,如有需要可替换
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.2/cert-manager.crds.yaml
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.12.2
helm install rancher rancher-stable/rancher \
--namespace cattle-system \
--set hostname=rancher.ai.example.com \
--set bootstrapPassword=123456 \
--set global.cattle.psp.enabled=false \
--version 2.7.5
|
安装完,检查rancher状态。
1
2
| kubectl -n cattle-system rollout status deploy/rancher
kubectl -n cattle-system get deploy rancher
|
如有需要,查看密码方法。
1
| kubectl get secret --namespace cattle-system bootstrap-secret -o go-template='{{ .data.bootstrapPassword|base64decode}}{{ "\n" }}'
|
安装kubeflow
安装storageClass
前提条件,在172.27.244.150先安装nfs-server ,再安装存储类nfs-cient-provisioner,在172.27.244.151上执行。
nfs-cient-provisioner文档:
https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner
https://artifacthub.io/packages/helm/nfs-subdir-external-provisioner/nfs-subdir-external-provisioner
挂载选项相关文档:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/s1-nfs-client-config-options
https://www.man7.org/linux/man-pages/man5/nfs.5.html
1
2
3
4
5
6
7
| helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update
helm install nfs-client-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
--set nfs.server=172.27.244.150 \
--set nfs.path=/data_nfs \
--set nfs.mountOptions={"nfsvers=4\,minorversion=0\,rsize=1048576\,wsize=1048576\,hard\,timeo=600\,retrans=2\,noresvport"} \
--set storageClass.defaultClass=true
|
默认是拉取v4.0.2,在3台worker服务器执行。
1
2
| sudo docker pull strongxyz/nfs-subdir-external-provisioner:v4.0.2
sudo docker tag strongxyz/nfs-subdir-external-provisioner:v4.0.2 registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2
|
安装kustomize
在172.27.244.151上执行。
1
2
| wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize/v5.1.0/kustomize_v5.1.0_linux_amd64.tar.gz
tar zxf kustomize_v5.1.0_linux_amd64.tar.gz && cp kustomize /usr/local/bin/
|
下载kubeflow和镜像
在172.27.244.151上执行。
下载kubeflow
1
2
| wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.7.0.tar.gz
tar zxf manifests-1.7.0 && cd manifests-1.7.0
|
查看需要哪些镜像,国内环境无法访问gcr.io,从国外服务器中转镜像到hub.docker.com(需要自己注册仓库),然后再拉取镜像!
1
2
| kustomize build example > kustomize_build_example.out.txt
awk -F': ' '/image: gcr.io/{print $2}' kustomize_build_example.out.txt | sort -u > pull.image.list.txt
|
所需镜像列表:
gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
下载相关镜像
下载镜像并推送到hub.docker.com个人仓库(脚本中未放docker login,登录后再执行即可,也可以把登录放入脚本中)。
hub.docker.com仓库需要自行注册,比如我的个人仓库https://hub.docker.com/u/strongxyz
在能访问gcr.io的服务器上执行。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| a=(
gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
)
b=(${a[*]//gcr.io*\//strongxyz/})
for ((i=0; i<${#a[*]}; i++)); do
sudo docker pull "${a[i]}"
sudo docker tag "${a[i]}" "${b[i]}"
sudo docker push "${b[i]}"
sudo docker rmi "${b[i]}"
sudo docker rmi "${a[i]}"
done
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| a=(
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
)
b=(
strongxyz/eventing_cmd_controller:sha256
strongxyz/eventing_cmd_mtping:sha256
strongxyz/eventing_cmd_webhook:sha256
strongxyz/net-istio_cmd_controller:sha256
strongxyz/net-istio_cmd_webhook:sha256
strongxyz/serving_cmd_activator:sha256
strongxyz/serving_cmd_autoscaler:sha256
strongxyz/serving_cmd_controller:sha256
strongxyz/serving_cmd_domain-mapping:sha256
strongxyz/serving_cmd_domain-mapping-webhook:sha256
strongxyz/serving_cmd_queue:sha256
strongxyz/serving_cmd_webhook:sha256
)
for ((i=0; i<${#a[*]}; i++)); do
sudo docker pull "${a[i]}"
sudo docker tag "${a[i]}" "${b[i]}"
sudo docker push "${b[i]}"
sudo docker rmi "${b[i]}"
sudo docker rmi "${a[i]}"
done
|
在172.27.244.154-156等3台worker机器执行,拉取镜像,重新打tag。如果master也是worker节点,master也需要拉取镜像。
拉取镜像不需要登录。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| a=(
gcr.io/arrikto/kubeflow/oidc-authservice:e236439
gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
gcr.io/ml-pipeline/frontend
gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
gcr.io/ml-pipeline/visualization-server
gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
)
b=(${a[*]//gcr.io*\//strongxyz/})
for ((i=0; i<${#a[*]}; i++)); do
sudo docker pull "${b[i]}"
sudo docker tag "${b[i]}" "${a[i]}"
sudo docker rmi "${b[i]}"
done
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| a=(
gcr.io/knative-releases/knative.dev/eventing/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/eventing/cmd/mtping:sha256
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook:sha256
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/activator:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/controller:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/queue:sha256
gcr.io/knative-releases/knative.dev/serving/cmd/webhook:sha256
)
b=(
strongxyz/eventing_cmd_controller:sha256
strongxyz/eventing_cmd_mtping:sha256
strongxyz/eventing_cmd_webhook:sha256
strongxyz/net-istio_cmd_controller:sha256
strongxyz/net-istio_cmd_webhook:sha256
strongxyz/serving_cmd_activator:sha256
strongxyz/serving_cmd_autoscaler:sha256
strongxyz/serving_cmd_controller:sha256
strongxyz/serving_cmd_domain-mapping:sha256
strongxyz/serving_cmd_domain-mapping-webhook:sha256
strongxyz/serving_cmd_queue:sha256
strongxyz/serving_cmd_webhook:sha256
)
for ((i=0; i<${#a[*]}; i++)); do
sudo docker pull "${b[i]}"
sudo docker tag "${b[i]}" "${a[i]}"
sudo docker rmi "${b[i]}"
done
|
vi example/kustomization.yaml,最后添加以下内容。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| images:
- name: gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
newName: gcr.io/knative-releases/knative.dev/eventing/cmd/controller
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
newName: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
newName: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
newName: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
newName: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
newName: gcr.io/knative-releases/knative.dev/serving/cmd/activator
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
newName: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
newName: gcr.io/knative-releases/knative.dev/serving/cmd/controller
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
newName: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
newName: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
newName: gcr.io/knative-releases/knative.dev/serving/cmd/queue
newTag: "sha256"
- name: gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
newName: gcr.io/knative-releases/knative.dev/serving/cmd/webhook
newTag: "sha256"
|
修改kubeflow默认密码
随便找台有python3的机器执行都可以,这里在172.27.244.151上执行。
1
2
3
| sudo apt install python3-pip
sudo pip3 install passlib
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'
|
以上为交互式命令,输入密码,返回hash值。
123456
$2y$12$cDyyBfNqBDpQ9kkRoJYSI.xWggu2r9iHj1234GuTFddJkaWZu3a33
vi common/dex/base/config-map.yaml,修改对应内容。
一键安装kubeflow
1
| while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
|
安装完成后
问题1
pod报错:
error: resource mapping not found for name: “webhook” namespace: “knative-serving” from “STDIN”: no matches for kind “HorizontalPodAutoscaler” in version “autoscaling/v2beta2”
原因:
https://kubernetes.io/docs/reference/using-api/deprecation-guide/#horizontalpodautoscaler-v125
https://www.kubeflow.org/docs/releases/kubeflow-1.7/#dependency-versions-manifests
解决办法:
修改common/knative/knative-serving/base/upstream/serving-core.yaml对应的镜像autoscaling/v2
问题2
有些镜像拉取策略需要修改成IfNotPresent,如下:
1
2
3
| common/oidc-authservice/base/statefulset.yaml
apps/pipeline/upstream/base/pipeline/ml-pipeline-viewer-crd-deployment.yaml
apps/pipeline/upstream/base/cache/cache-deployment.yaml
|
问题3
有些只有一个副本,避免下次运行在另一台worker机器又去拉取镜像,3台worker都拉取以下镜像。
1
2
3
| sudo docker pull kubeflowkatib/katib-db-manager:v0.15.0
sudo docker pull mysql:8.0.29
sudo docker pull python:3.7
|
问题4
mysql的pod报错 chown: changing ownership of ‘/var/lib/mysql/': Operation not permitted
解决:
修改nfs-server,把all_squash改成no_root_squash
sudo vi /etc/exports
sudo exportfs -a
安装nfs-server时就用no_root_squash,避免此问题产生。
安装keepalived
找两台机器安装keepalived,这里选在172.27.244.152-153上安装。
1
| sudo apt -y install nginx keepalived
|
nginx配置:
1
2
3
| sudo rm /etc/nginx/sites-enabled/default
sudo cp /etc/nginx/nginx.conf{,.bak}
sudo vi /etc/nginx/nginx.conf
|
配置参考:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| stream {
upstream k8s-80 {
least_conn;
server 172.27.244.154:80 max_fails=3 fail_timeout=5s;
server 172.27.244.155:80 max_fails=3 fail_timeout=5s;
server 172.27.244.156:80 max_fails=3 fail_timeout=5s;
}
upstream k8s-443 {
least_conn;
server 172.27.244.154:443 max_fails=3 fail_timeout=5s;
server 172.27.244.155:443 max_fails=3 fail_timeout=5s;
server 172.27.244.156:443 max_fails=3 fail_timeout=5s;
}
server {
listen 80;
proxy_pass k8s-80;
}
server {
listen 443;
proxy_pass k8s-443;
}
}
|
1
2
| sudo nginx -t
sudo nginx -s reload
|
keepalived配置:
https://manpages.debian.org/unstable/keepalived/keepalived.conf.5.en.html
https://manpages.ubuntu.com/manpages/focal/man5/keepalived.conf.5.html
https://keepalived.readthedocs.io/en/latest/configuration_synopsis.html#global-definitions-synopsis
https://github.com/acassen/keepalived/tree/master/doc/samples
/usr/share/doc/keepalived/samples/keepalived.conf.vrrp.localcheck
/usr/share/doc/keepalived/samples/keepalived.conf.vrrp
1
| sudo vi /etc/keepalived/keepalived.conf
|
配置参考,两台机器IP对换一下即可。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| global_defs {
script_user root
enable_script_security
}
vrrp_script chk_nginx {
script "killall -0 nginx"
interval 2
weight 2
}
vrrp_instance VI_1 {
state MASTER
interface ens18
garp_master_delay 10 #default
#smtp_alert #是否发送邮件
virtual_router_id 55
priority 100 #实例优先级
advert_int 1
authentication {
auth_type PASS
auth_pass 00002023
}
unicast_src_ip 172.27.244.152
unicast_peer { #单播
172.27.244.153
}
virtual_ipaddress {
172.27.244.160
}
track_script {
chk_nginx
}
}
|
1
| sudo systemctl restart keepalived
|
域名解析:
172.27.244.160 rancher.ai.example.com
172.27.244.160 kubeflow.ai.example.com
安装nfs server
在172.27.244.150执行。
安装。
1
2
3
4
5
| sudo apt -y install nfs-kernel-server
sudo mkdir /data_nfs
sudo chown 1000.1000 /data_nfs
echo '/data_nfs 172.27.244.0/24(rw,sync,insecure,no_subtree_check,no_root_squash,no_all_squash,anonuid=1000,anongid=1000)' | sudo tee -a /etc/exports
sudo exportfs -a
|
检查nfs状态。
1
2
3
4
| systemctl status nfs-server
sudo rpcinfo -p | grep nfs
cat /var/lib/nfs/etab
sudo exportfs
|
安装harbor
在172.27.244.150执行。
安装docker
版本换成5:23.0.6-1~ubuntu.20.04~focal
,也可以不带版本号安装最新版本,其它略。
1
2
3
4
5
| #查看可安装版本
apt-cache madison docker-ce
VERSION_STRING=5:23.0.6-1~ubuntu.20.04~focal
sudo apt-get install docker-ce=$VERSION_STRING docker-ce-cli=$VERSION_STRING containerd.io docker-buildx-plugin docker-compose-plugin
|
安装docker-compose
版本号v2.20.1需要自行确认,可以安装当前最新版本。
1
2
| sudo curl -L https://github.com/docker/compose/releases/download/v2.20.1/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
|
自签证书
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| openssl genrsa -out ca.key 4096
openssl req -x509 -new -nodes -sha512 -days 3650 -subj "/O=Sunwoda-evb/CN=Sunwoda-evb Certs H1" -key ca.key -out ca.crt
openssl genrsa -out ai.example.com.key 4096
openssl req -sha512 -new -subj "/C=CN/ST=Guangdong/L=Shenzhen/O=sunwoda-evb/OU=IT/CN=ai.example.com" -key ai.example.com.key -out ai.example.com.csr
tee v3.ext <<- EOF
authorityKeyIdentifier=keyid,issuer
basicConstraints=CA:FALSE
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1=*.ai.example.com
EOF
openssl x509 -req -sha512 -days 3650 -extfile v3.ext -CA ca.crt -CAkey ca.key -CAcreateserial -in ai.example.com.csr -out ai.example.com.crt
|
开始安装harbor
1
| wget https://github.com/goharbor/harbor/releases/download/v2.8.2/harbor-offline-installer-v2.8.2.tgz
|
1
2
3
4
5
6
| sudo mkdir -p /data/harbor/cert
sudo cp ai.example.com.key ai.example.com.crt /data/harbor/cert
openssl x509 -inform PEM -in ai.example.com.crt -out ai.example.com.cert
sudo mkdir -p /etc/docker/certs.d/harbor.ai.example.com
sudo cp {ca.crt,ai.example.com.key,ai.example.com.cert} /etc/docker/certs.d/harbor.ai.example.com
sudo systemctl restart docker
|
修改harbor.yml,配置示例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| hostname: harbor.ai.example.com
http:
port: 8083
https:
port: 8443
certificate: /data/harbor/cert/ai.example.com.crt
private_key: /data/harbor/cert/ai.example.com.key
external_url: https://harbor.ai.example.com
harbor_admin_password: 123456
database:
password: root123
max_idle_conns: 100
max_open_conns: 900
conn_max_lifetime: 5m
conn_max_idle_time: 0
data_volume: /data/harbor
|
开始安装。
1
2
| sudo ./install.sh --with-trivy
sudo docker-compose up -d
|
变更harbor.yml后重新配置。
1
2
3
| sudo ./prepare --with-trivy
sudo docker-compose down -v
sudo docker-compose up -d
|
安装nginx
配置示例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| server {
listen 443 ssl http2;
server_name harbor.ai.example.com;
# SSL
ssl_certificate /etc/nginx/ssl/ai.example.com.crt;
ssl_certificate_key /etc/nginx/ssl/ai.example.com.key;
location / {
proxy_pass https://127.0.0.1:8443;
proxy_http_version 1.1;
# Proxy headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Port $server_port;
# Proxy timeouts
proxy_connect_timeout 3600s;
proxy_send_timeout 3600s;
proxy_read_timeout 3600s;
}
}
server {
listen 80;
listen [::]:80;
server_name harbor.ai.example.com;
location / {
return 301 https://harbor.ai.example.com$request_uri;
}
}
|
信任自签CA证书
在worker节点执行,从harbor.ai.example.com拉取镜像的节点都需要信任自签证书。
自签的CA证书可以用nginx公布出来,方便下载使用,终端传文件挺麻烦的。
1
2
3
4
| sudo mkdir -p /etc/docker/certs.d/harbor.ai.example.com
sudo wget http://downloads.ai.example.com/download/ssl/ca.H1.crt -O /etc/docker/certs.d/harbor.ai.example.com/ca.H1.crt
#或者没解析域名的话
sudo wget http://172.27.244.150/download/ssl/ca.H1.crt -O /etc/docker/certs.d/harbor.ai.example.com/ca.H1.crt
|
nginx配置示例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| server {
listen 80;
listen 443 ssl http2;
server_name downloads.ai.example.com;
root /srv/www/downloads.ai.example.com;
# SSL
ssl_certificate /etc/nginx/ssl/ai.example.com.crt;
ssl_certificate_key /etc/nginx/ssl/ai.example.com.key;
location /download/ {
autoindex on;
autoindex_exact_size off;
autoindex_localtime on;
location ~* .(yml|yaml|conf|cnf)$ {
add_header Content-Type 'text/plain; charset=utf-8';
}
}
location / {
location ~* {
add_header Content-Type 'text/plain; charset=utf-8';
}
}
}
|
域名解析:
172.27.244.150 harbor.ai.example.com
172.27.244.150 downloads.ai.example.com