17 posts tagged with "docker"

遇到了传说中的container runtime is down PLEG is not healthy

September 25, 2019 · 4 min read

在一次异常断电后, 开发环境的一个小 kubernetes cluster 中不幸遭遇了 PLEG is not healthy 问题, 表现是 k8s 中的 pod 状态变成 Unknown 或 ContainerCreating, k8s 节点状态变成 NotReady:

# kubectl get nodes
NAME             STATUS     ROLES     AGE	VERSION   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
k8s-dev-master   Ready      master    1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://17.3.0
k8s-dev-node1    NotReady   node      1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://Unknown
k8s-dev-node2    NotReady   node      1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://Unknown
k8s-dev-node3    NotReady   node      289d	v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://Unknown
k8s-dev-node4    Ready      node      289d	v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://17.3.0

Kubelet 日志中提示: skipping pod synchronization, container runtime is down PLEG is not healthy:

9月 25 11:05:06 k8s-dev-node1 kubelet[546]: I0925 11:05:06.003645     546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m18.877402888s ago; threshold is 3m0s]
9月 25 11:05:11 k8s-dev-node1 kubelet[546]: I0925 11:05:11.004116     546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m23.877803484s ago; threshold is 3m0s]
9月 25 11:05:16 k8s-dev-node1 kubelet[546]: I0925 11:05:16.004382     546 kubelet.go:1794] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 21m28.878169681s ago; threshold is 3m0s]

重启节点 docker 和 kubelet 后恢复，过不了多久又出错变成 NotReady, google 了一把，在 stackoverflow 和 github/kubernetes 上有相关的 issue:

但#45419在 v1.16 中才被 fix, 从 1.10 升级到 1.16 太繁琐, 看到 #61117中的一个评论说通过请求节点上的/var/lib/kubelet/pods 目录可以解决, 第一次试了下由于 mount 卷的占用问题没有删除掉该目录, 问题没有解决, 后面索性级升级了 docker, 从 17.3.0 升级到了 19.3.2, 并请除了每个节点中/var/lib/kubelet/pods/, /var/lib/docker 两个目录下的所有数据后，问题解决了。

大致过程:

# 先禁用docker和kubelet自动启动, 重启后清除文件:
systemctl disable docker && systemctl disable kubelet
reboot
rm -rf /var/lib/kubelet/pods/
rm -rf /var/lib/docker

# 中间顺便把docker-ce从17.3.0升级到了19.3.2

# 升级完docker后修改docker.service还指定17.3.0中默认的storage-driver为overlay, 中间试过overlay2, devicemapper, vfs, kubelet中都有报错, 不知是kubernetes v1.10的支持问题还是数据没有清除干净
vi /etc/systemd/system/docker.service

ExecStart=/usr/bin/dockerd ... --storage-driver=overlay

# 重新加载配置后启动docker
systemctl daemon-reload
systemctl start docker && systemctl enable docker
systemctl status docker

# 由于/var/lib/docker目录被整体删除, 如果节点不能直接访问k8s镜像库，需要手动导入节点需要的基础镜像:
docker load -i kubernetes-v10.0-node.tar

# 启动Kubelet
systemctl start kubelet && systemctl enable kubelet
systemctl status kubelet

问题解决：

# kubectl get nodes -o wide
NAME             STATUS    ROLES     AGE       VERSION   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
k8s-dev-master   Ready     master    1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://17.3.0
k8s-dev-node1    Ready     node      1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://19.3.2
k8s-dev-node2    Ready     node      1y        v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://19.3.2
k8s-dev-node3    Ready     node      289d      v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://19.3.2
k8s-dev-node4    Ready     node      289d      v1.10.0   <none>        CentOS Linux 7 (Core)   3.10.0-957.21.3.el7.x86_64   docker://19.3.2

本次断电不幸造成了 kong 网关上 3 个月的配置数据丢失:(, 备份! 备份! 备份!

Getting real client IP in Docker Swarm

May 22, 2019 · 2 min read

在 Docker Swarm 中通过 Stack Deploy 部署 Service 的时候，在 Service 中默认无法获取到客户端的 IP 地址, Github 中有一个 issue 在 track 这个问题：Unable to retrieve user's IP address in docker swarm mode

目前的解决方法或 Workaround 是把 port 改成 host 模式, 以 kong 为例.

默认的 port 发布模式:

version: "3.7"
services:
  kong-proxy:
    image: kong:1.0.3-alpine
    deploy:
      mode: global
      labels:
        - "tier=frontend"
      restart_policy:
        condition: any
    ports:
<span style="color: #ff0000;">      - "80:8000"
      - "443:8443"</span>
    depends_on:
      - database-postgresql
    environment:
      KONG_ADMIN_LISTEN: 0.0.0.0:8001, 0.0.0.0:8444 ssl
      KONG_DATABASE: postgres
      KONG_PG_DATABASE: kong
      KONG_PG_USER: kong
      KONG_PG_PASSWORD: PaSsW0rd
      KONG_PG_HOST: database-postgresql
      KONG_PG_PORT: "5432"

    volumes:
      - type: "bind"
        source: "/var/log/kong/"
        target: "/usr/local/kong/logs/"
#        read_only: true
    networks:
      - backend
      - frontend
networks:
  frontend:
  backend:

修改 port 为 host 模式:

version: "3.7"
services:
  kong-proxy:
    image: kong:1.0.3-alpine
    deploy:
      mode: global
      labels:
        - "tier=frontend"
      restart_policy:
        condition: any
    ports:
<span style="color: #ff0000;">      - target: 8000</span>
<span style="color: #ff0000;">        published: 80</span>
<span style="color: #ff0000;">        mode: host</span>
<span style="color: #ff0000;">      - target: 8443</span>
<span style="color: #ff0000;">        published: 43</span>
<span style="color: #ff0000;">        mode: host
</span>    depends_on:
      - database-postgresql
    environment:
      KONG_ADMIN_LISTEN: 0.0.0.0:8001, 0.0.0.0:8444 ssl
      KONG_DATABASE: postgres
      KONG_PG_DATABASE: kong
      KONG_PG_USER: kong
      KONG_PG_PASSWORD: PaSsW0rd
      KONG_PG_HOST: database-postgresql
      KONG_PG_PORT: "5432"

    volumes:
      - type: "bind"
        source: "/var/log/kong/"
        target: "/usr/local/kong/logs/"
#        read_only: true
    networks:
      - backend
      - frontend
networks:
  frontend:
  backend: