networking - Docker Swarm Mode: Not all VIPs for a service work. Getting timeouts for several VIPs -
description i'm having issues overlay network using docker swarm mode (important: swarm mode, not swarm). have overlay network named "internal". have service named "datacollector" scaled 12 instances. docker exec
service running in same swarm (and on same overlay network) , run curl http://datacollector
12 times. however, 4 of requests result in timeout. run dig tasks.datacollector
, list of 12 ip addresses. sure enough, 8 of ip addresses work 4 timeout every time.
i tried scaling service down 1 instance , 12, got same result.
i used docker service ps datacollector
find each running instance of service. used docker kill xxxx
on each node manually kill instances , let swarm recreate them. checked dig again , verified list of ip addresses task no longer same. after ran curl http://datacollector
12 more times. 3 requests work , remaining 9 timeout!
this second time has happened in last 2 weeks or so. previous time had remove services, remove overlay network, recreate overlay network, , re-create of services in order resolve issue. obviously, isn't workable long term solution :(
output of `docker service inspect datacollector:
[ { "id": "2uevc4ouakk6k3dirhgqxexz9", "version": { "index": 72152 }, "createdat": "2016-11-12t20:38:51.137043037z", "updatedat": "2016-11-17t15:22:34.402801678z", "spec": { "name": "datacollector", "tasktemplate": { "containerspec": { "image": "507452836298.dkr.ecr.us-east-1.amazonaws.com/swarm/api:61d7931f583742cca91b368bc6d9e15314545093", "args": [ "node", ".", "api/datacollector" ], "env": [ "environment=stage", "mongodb_uri=mongodb://mongodb:27017/liveearth", "rabbitmq_url=amqp://rabbitmq", "elasticsearch_url=http://elasticsearch" ] }, "resources": { "limits": {}, "reservations": {} }, "restartpolicy": { "condition": "any", "maxattempts": 0 }, "placement": { "constraints": [ "node.labels.role.api==true", "node.labels.role.api==true", "node.labels.role.api==true", "node.labels.role.api==true", "node.labels.role.api==true" ] } }, "mode": { "replicated": { "replicas": 12 } }, "updateconfig": { "parallelism": 1, "failureaction": "pause" }, "networks": [ { "target": "88e9fd9715o5v1hqu6dnkg3vp" } ], "endpointspec": { "mode": "vip" } }, "endpoint": { "spec": { "mode": "vip" }, "virtualips": [ { "networkid": "88e9fd9715o5v1hqu6dnkg3vp", "addr": "192.168.1.23/24" } ] }, "updatestatus": { "state": "completed", "startedat": "2016-11-17t15:19:34.471292948z", "completedat": "2016-11-17t15:22:34.402794312z", "message": "update completed" } } ]
output of docker network inspect internal
:
[ { "name": "internal", "id": "88e9fd9715o5v1hqu6dnkg3vp", "scope": "swarm", "driver": "overlay", "enableipv6": false, "ipam": { "driver": "default", "options": null, "config": [ { "subnet": "192.168.1.0/24", "gateway": "192.168.1.1" } ] }, "internal": false, "containers": { "03ac1e71139ff2140f93c80d9e6b1d69abf442a0c2362610bee3e116e84ef434": { "name": "datacollector.5.cxmvk7p1hwznautresir94m3s", "endpointid": "22445be80ba55b67d7cfcfbc75f2c15586bace5f317be8ba9b59c5f9f338525c", "macaddress": "02:42:c0:a8:01:72", "ipv4address": "192.168.1.114/24", "ipv6address": "" }, "08ae84c7cb6e57583baf12c2a9082c1d17f1e65261cfa93346aaa9bda1244875": { "name": "auth.10.aasw00k7teq4knxibctlrrj7e", "endpointid": "c3506c851f4c9f0d06d684a9f023e7ba529d0149d70fa7834180a87ad733c678", "macaddress": "02:42:c0:a8:01:44", "ipv4address": "192.168.1.68/24", "ipv6address": "" }, "192203a127d6831c3f4a41eabdd8df5282e33c3e92b99c3baaf1f213042f5418": { "name": "parkingcollector.1.8yrm6d831wrfsrkzhal7cf2pm", "endpointid": "34de6e9621ef54f7d963db942a7a7b6e0013ac6db6c9f17b384de689b1f1b187", "macaddress": "02:42:c0:a8:01:9a", "ipv4address": "192.168.1.154/24", "ipv6address": "" }, "24258109e16c1a5b15dcc84a41d99a4a6617bcadecc9b35279c721c0d2855141": { "name": "stream.8.38npsusmpa1pf8fbnmaux57rx", "endpointid": "b675991ffbd5c0d051a4b68790a33307b03b48582fd1b37ba531cf5e964af0ce", "macaddress": "02:42:c0:a8:01:74", "ipv4address": "192.168.1.116/24", "ipv6address": "" }, "33063b988473b73be2cbc51e912e165112de3d01bc00ee2107aa635e30a36335": { "name": "billing.2.ca41k2h44zkn9wfbsif0lfupf", "endpointid": "77c576929d5e82f1075b4cc6fcb4128ce959281d4b9c1c22d9dcd1e42eed8b5e", "macaddress": "02:42:c0:a8:01:87", "ipv4address": "192.168.1.135/24", "ipv6address": "" }, "8b0929e66e6c284206ea713f7c92f1207244667d3ff02815d4bab617c349b220": { "name": "shotspottercollector.2.328408tiyy8aryr0g1ipmm5xm", "endpointid": "f2a0558ec67745f5d1601375c2090f5cd141303bf0d54bec717e3463f26ed74d", "macaddress": "02:42:c0:a8:01:90", "ipv4address": "192.168.1.144/24", "ipv6address": "" }, "938fe5f6f9bb893862e8c06becd76c1a7fe5f2d3b791fc55d7d8164e67ee3553": { "name": "inrixproxy.2.ed77crvat0waw41phjknhhm6v", "endpointid": "88f550fecd60f0bdb0dfc9d5bf0c74716a91d009bcc27dc4392b113ab1215038", "macaddress": "02:42:c0:a8:01:96", "ipv4address": "192.168.1.150/24", "ipv6address": "" }, "970f9d4c6ae6cc4de54a1d501408720b7d95114c28a6615d8e4e650b7e69bc40": { "name": "rabbitmq.1.e7j721g6hfhs8r7p3phih4g9v", "endpointid": "c04a4a5650ee6e10b87884004aa2cb1ec6b1c7036af15c31579462b6621436a2", "macaddress": "02:42:c0:a8:01:1e", "ipv4address": "192.168.1.30/24", "ipv6address": "" }, "b1f676e6d38eec026583943dc0abff1163d21e6be9c5901539c46288f8941638": { "name": "logspout.0.51j8juw8aj0rjjccp2am0rib5", "endpointid": "98a93153abd6897c58276340df2eeec5c0ceb77fbe17d1ce8c465febb06776c7", "macaddress": "02:42:c0:a8:01:10", "ipv4address": "192.168.1.16/24", "ipv6address": "" }, "bab4d80be830fa3b3fefe501c66e3640907a2cbb2addc925a0eb6967a771a172": { "name": "auth.2.8fduvrn5ayk024b0lkhyz50of", "endpointid": "7e81d41fa04ec14263a2423d8ef003d6d431a8c3ff319963197f8a8d73b4e361", "macaddress": "02:42:c0:a8:01:3a", "ipv4address": "192.168.1.58/24", "ipv6address": "" }, "bc3c75a7c2d8c078eb7cc1555833ff0d374d82045dd9fb24ccfc37868615bb5e": { "name": "reverseproxy.6.2g20zphn5j1r2feylzcplyorg", "endpointid": "6c2138966ebcd144b47229a94ee603d264f3954a96ccd024d9e96501b7ffd5c0", "macaddress": "02:42:c0:a8:01:6c", "ipv4address": "192.168.1.108/24", "ipv6address": "" }, "cd59d61b16ac0325336121a8558e8215e42aa5300f75054df17a70bf1f3e6c0c": { "name": "usgscollector.1.0h0afyw8va8maoa4tjd5qz588", "endpointid": "952073efc6a567ebd3f80d26811222c675183e8c76005fbf12388725a97b1bee", "macaddress": "02:42:c0:a8:01:48", "ipv4address": "192.168.1.72/24", "ipv6address": "" }, "d40476e56b91762b0609acd637a4f70e42c88d266f8ebb7d9511050a8fc1df17": { "name": "kibana.1.6hxu5b97hfykuqr5yb9i9sn5r", "endpointid": "08c5188076f9b8038d864d570e7084433a8d97d4c8809d27debf71cb5d652cd7", "macaddress": "02:42:c0:a8:01:06", "ipv4address": "192.168.1.6/24", "ipv6address": "" }, "e29369ad8ee5b12fb0c6f9bcb899514ab092f7da291a7c05eea758b0c19bfb65": { "name": "weatherbugcollector.1.crpub0hf85cewxm0qt6annsra", "endpointid": "afa1ddbad8ab8fdab69505ddb5342ac89c0d17bc75a11e9ac0ac8829e5885997", "macaddress": "02:42:c0:a8:01:2e", "ipv4address": "192.168.1.46/24", "ipv6address": "" }, "f1bf0a656ecb9d7ef9b837efa94a050d9c98586f7312435e48b9a129c5e92e46": { "name": "socratacollector.1.627icslq6kdb4syaha6tzkb19", "endpointid": "14bea0d9ec3f94b04b32f36b7172c60316ee703651d0d920126a49dd0fa99cf5", "macaddress": "02:42:c0:a8:01:1b", "ipv4address": "192.168.1.27/24", "ipv6address": "" } }, "options": { "com.docker.network.driver.overlay.vxlanid_list": "257" }, "labels": {} } ]
output of dig datacollector
:
; <<>> dig 9.9.5-9+deb8u8-debian <<>> datacollector ;; global options: +cmd ;; got answer: ;; ->>header<<- opcode: query, status: noerror, id: 38227 ;; flags: qr rd ra; query: 1, answer: 1, authority: 0, additional: 0 ;; question section: ;datacollector. in ;; answer section: datacollector. 600 in 192.168.1.23 ;; query time: 0 msec ;; server: 127.0.0.11#53(127.0.0.11) ;; when: thu nov 17 16:11:57 utc 2016 ;; msg size rcvd: 60
output of dig tasks.datacollector
:
; <<>> dig 9.9.5-9+deb8u8-debian <<>> tasks.datacollector ;; global options: +cmd ;; got answer: ;; ->>header<<- opcode: query, status: noerror, id: 9810 ;; flags: qr rd ra; query: 1, answer: 12, authority: 0, additional: 0 ;; question section: ;tasks.datacollector. in ;; answer section: tasks.datacollector. 600 in 192.168.1.115 tasks.datacollector. 600 in 192.168.1.66 tasks.datacollector. 600 in 192.168.1.22 tasks.datacollector. 600 in 192.168.1.114 tasks.datacollector. 600 in 192.168.1.37 tasks.datacollector. 600 in 192.168.1.139 tasks.datacollector. 600 in 192.168.1.148 tasks.datacollector. 600 in 192.168.1.110 tasks.datacollector. 600 in 192.168.1.112 tasks.datacollector. 600 in 192.168.1.100 tasks.datacollector. 600 in 192.168.1.39 tasks.datacollector. 600 in 192.168.1.106 ;; query time: 0 msec ;; server: 127.0.0.11#53(127.0.0.11) ;; when: thu nov 17 16:08:54 utc 2016 ;; msg size rcvd: 457
output of docker version
:
client: version: 1.12.3 api version: 1.24 go version: go1.6.3 git commit: 6b644ec built: wed oct 26 23:26:11 2016 os/arch: darwin/amd64 server: version: 1.12.3 api version: 1.24 go version: go1.6.3 git commit: 6b644ec built: wed oct 26 21:44:32 2016 os/arch: linux/amd64
output of docker info
:
containers: 58 running: 15 paused: 0 stopped: 43 images: 123 server version: 1.12.3 storage driver: aufs root dir: /var/lib/docker/aufs backing filesystem: extfs dirs: 430 dirperm1 supported: false logging driver: json-file cgroup driver: cgroupfs plugins: volume: local network: host null overlay bridge swarm: active nodeid: 8uxexr2uz3qpn5x1km9k4le9s manager: true clusterid: 2kd4md2qyu67szx4y6q2npnet managers: 3 nodes: 8 orchestration: task history retention limit: 5 raft: snapshot interval: 10000 heartbeat tick: 1 election tick: 3 dispatcher: heartbeat period: 5 seconds ca configuration: expiry duration: 3 months node address: 10.10.44.201 runtimes: runc default runtime: runc security options: apparmor kernel version: 3.13.0-91-generic operating system: ubuntu 14.04.4 lts ostype: linux architecture: x86_64 cpus: 2 total memory: 3.676 gib name: stage-0 id: 76z2:gn43:rqnd:bbaj:aguu:s3f7:jwbc:ccck:i4vh:pkyc:uhqt:ir2u docker root dir: /var/lib/docker debug mode (client): false debug mode (server): false username: herbrandson registry: https://index.docker.io/v1/ warning: no swap limit support labels: provider=generic insecure registries: 127.0.0.0/8
additional environment details: docker swarm mode (not swarm). nodes running on aws. swarm has 8 nodes (3 managers , 5 workers)
update: per comments, here's snipet docker daemon logs on swarm master
time="2016-11-17t15:19:45.890158968z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=ch6w74b3cu78y8r2ugkmfmu8a time="2016-11-17t15:19:48.929507277z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=exb6dfc067nxudzr8uo1eyj4e time="2016-11-17t15:19:50.104962867z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=6mbbfkilj9gslfi33w7sursb9 time="2016-11-17t15:19:50.877223204z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=drd8o0yn1cg5t3k76frxgukaq time="2016-11-17t15:19:54.680427504z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=9lwl5v0f2v6p52shg6gixs3j7 time="2016-11-17t15:19:54.949118806z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=51q1eeilfspsm4cx79nfkl4r0 time="2016-11-17t15:19:56.485909146z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=3vjzfjjdrjio2gx45q9c3j6qd time="2016-11-17t15:19:56.934070026z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:20:00.000614497z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:20:00.163458802z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=4xa2ub5npxyxpyx3vd5n1gsuy time="2016-11-17t15:20:01.463407652z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:20:01.949087337z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:20:02.942094926z" level=error msg="failed create real server 192.168.1.150 vip 192.168.1.32 fwmark 947 in sb 938fe5f6f9bb893862e8c06becd76c1a7fe5f2d3b791fc55d7d8164e67ee3553: no such process" time="2016-11-17t15:20:03.319168359z" level=error msg="failed delete new service vip 192.168.1.61 fwmark 2133: no such process" time="2016-11-17t15:20:03.363775880z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/5de57ee133a5: reexec failed: exit status 5" time="2016-11-17t15:20:05.772683092z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:20:06.059212643z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:20:07.335686642z" level=error msg="failed delete new service vip 192.168.1.67 fwmark 2134: no such process" time="2016-11-17t15:20:07.385135664z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/6699e7c03bbd: reexec failed: exit status 5" time="2016-11-17t15:20:07.604064777z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:20:07.673852364z" level=error msg="failed delete new service vip 192.168.1.75 fwmark 2097: no such process" time="2016-11-17t15:20:07.766525370z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/6699e7c03bbd: reexec failed: exit status 5" time="2016-11-17t15:20:09.080101131z" level=error msg="failed create real server 192.168.1.155 vip 192.168.1.35 fwmark 904 in sb 192203a127d6831c3f4a41eabdd8df5282e33c3e92b99c3baaf1f213042f5418: no such process" time="2016-11-17t15:20:11.516338629z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:20:11.729274237z" level=error msg="failed delete new service vip 192.168.1.83 fwmark 2124: no such process" time="2016-11-17t15:20:11.887572806z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/5b810132057e: reexec failed: exit status 5" time="2016-11-17t15:20:12.281481060z" level=error msg="failed delete new service vip 192.168.1.73 fwmark 2136: no such process" time="2016-11-17t15:20:12.395326864z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/5b810132057e: reexec failed: exit status 5" time="2016-11-17t15:20:20.263565036z" level=error msg="failed create real server 192.168.1.72 vip 192.168.1.91 fwmark 2163 in sb cd59d61b16ac0325336121a8558e8215e42aa5300f75054df17a70bf1f3e6c0c: no such process" time="2016-11-17t15:20:20.410996971z" level=error msg="failed delete new service vip 192.168.1.95 fwmark 2144: no such process" time="2016-11-17t15:20:20.456710211z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/88d38a2bfb77: reexec failed: exit status 5" time="2016-11-17t15:20:21.389253510z" level=error msg="failed create real server 192.168.1.46 vip 192.168.1.99 fwmark 2145 in sb cd59d61b16ac0325336121a8558e8215e42aa5300f75054df17a70bf1f3e6c0c: no such process" time="2016-11-17t15:20:22.208965378z" level=error msg="failed create real server 192.168.1.46 vip 192.168.1.99 fwmark 2145 in sb e29369ad8ee5b12fb0c6f9bcb899514ab092f7da291a7c05eea758b0c19bfb65: no such process" time="2016-11-17t15:20:23.334582312z" level=error msg="failed create new service vip 192.168.1.97 fwmark 2166: file exists" time="2016-11-17t15:20:23.495873232z" level=error msg="failed create real server 192.168.1.48 vip 192.168.1.17 fwmark 552 in sb e29369ad8ee5b12fb0c6f9bcb899514ab092f7da291a7c05eea758b0c19bfb65: no such process" time="2016-11-17t15:20:25.831988014z" level=error msg="failed create real server 192.168.1.116 vip 192.168.1.41 fwmark 566 in sb 03ac1e71139ff2140f93c80d9e6b1d69abf442a0c2362610bee3e116e84ef434: no such process" time="2016-11-17t15:20:25.850904011z" level=error msg="failed create real server 192.168.1.116 vip 192.168.1.41 fwmark 566 in sb 03ac1e71139ff2140f93c80d9e6b1d69abf442a0c2362610bee3e116e84ef434: no such process" time="2016-11-17t15:20:37.159637665z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=6yhu3glre4tbz6d08lk2pq9eb time="2016-11-17t15:20:48.229343512z" level=error msg="error closing logger: invalid argument" time="2016-11-17t15:51:16.027686909z" level=error msg="error getting service internal: service internal not found" time="2016-11-17t15:51:16.027708795z" level=error msg="handler /v1.24/services/internal returned error: service internal not found" time="2016-11-17t16:15:50.946921655z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=cxmvk7p1hwznautresir94m3s time="2016-11-17t16:16:01.994494784z" level=error msg="error closing logger: invalid argument"
update 2: tried removing service , re-creating , did not resolve issue.
update 3: went through , rebooted each node in cluster one-by-one. after things appear normal. however, still don't know caused this. more importantly, how keep happening again in future?
Comments
Post a Comment