networking - Docker Swarm Mode: Not all VIPs for a service work. Getting timeouts for several VIPs -


description i'm having issues overlay network using docker swarm mode (important: swarm mode, not swarm). have overlay network named "internal". have service named "datacollector" scaled 12 instances. docker exec service running in same swarm (and on same overlay network) , run curl http://datacollector 12 times. however, 4 of requests result in timeout. run dig tasks.datacollector , list of 12 ip addresses. sure enough, 8 of ip addresses work 4 timeout every time.

i tried scaling service down 1 instance , 12, got same result.

i used docker service ps datacollector find each running instance of service. used docker kill xxxx on each node manually kill instances , let swarm recreate them. checked dig again , verified list of ip addresses task no longer same. after ran curl http://datacollector 12 more times. 3 requests work , remaining 9 timeout!

this second time has happened in last 2 weeks or so. previous time had remove services, remove overlay network, recreate overlay network, , re-create of services in order resolve issue. obviously, isn't workable long term solution :(

output of `docker service inspect datacollector:

[     {         "id": "2uevc4ouakk6k3dirhgqxexz9",         "version": {             "index": 72152         },         "createdat": "2016-11-12t20:38:51.137043037z",         "updatedat": "2016-11-17t15:22:34.402801678z",         "spec": {             "name": "datacollector",             "tasktemplate": {                 "containerspec": {                     "image": "507452836298.dkr.ecr.us-east-1.amazonaws.com/swarm/api:61d7931f583742cca91b368bc6d9e15314545093",                     "args": [                         "node",                         ".",                         "api/datacollector"                     ],                     "env": [                         "environment=stage",                         "mongodb_uri=mongodb://mongodb:27017/liveearth",                         "rabbitmq_url=amqp://rabbitmq",                         "elasticsearch_url=http://elasticsearch"                     ]                 },                 "resources": {                     "limits": {},                     "reservations": {}                 },                 "restartpolicy": {                     "condition": "any",                     "maxattempts": 0                 },                 "placement": {                     "constraints": [                         "node.labels.role.api==true",                         "node.labels.role.api==true",                         "node.labels.role.api==true",                         "node.labels.role.api==true",                         "node.labels.role.api==true"                     ]                 }             },             "mode": {                 "replicated": {                     "replicas": 12                 }             },             "updateconfig": {                 "parallelism": 1,                 "failureaction": "pause"             },             "networks": [                 {                     "target": "88e9fd9715o5v1hqu6dnkg3vp"                 }             ],             "endpointspec": {                 "mode": "vip"             }         },         "endpoint": {             "spec": {                 "mode": "vip"             },             "virtualips": [                 {                     "networkid": "88e9fd9715o5v1hqu6dnkg3vp",                     "addr": "192.168.1.23/24"                 }             ]         },         "updatestatus": {             "state": "completed",             "startedat": "2016-11-17t15:19:34.471292948z",             "completedat": "2016-11-17t15:22:34.402794312z",             "message": "update completed"         }     } ] 

output of docker network inspect internal:

[     {         "name": "internal",         "id": "88e9fd9715o5v1hqu6dnkg3vp",         "scope": "swarm",         "driver": "overlay",         "enableipv6": false,         "ipam": {             "driver": "default",             "options": null,             "config": [                 {                     "subnet": "192.168.1.0/24",                     "gateway": "192.168.1.1"                 }             ]         },         "internal": false,         "containers": {             "03ac1e71139ff2140f93c80d9e6b1d69abf442a0c2362610bee3e116e84ef434": {                 "name": "datacollector.5.cxmvk7p1hwznautresir94m3s",                 "endpointid": "22445be80ba55b67d7cfcfbc75f2c15586bace5f317be8ba9b59c5f9f338525c",                 "macaddress": "02:42:c0:a8:01:72",                 "ipv4address": "192.168.1.114/24",                 "ipv6address": ""             },             "08ae84c7cb6e57583baf12c2a9082c1d17f1e65261cfa93346aaa9bda1244875": {                 "name": "auth.10.aasw00k7teq4knxibctlrrj7e",                 "endpointid": "c3506c851f4c9f0d06d684a9f023e7ba529d0149d70fa7834180a87ad733c678",                 "macaddress": "02:42:c0:a8:01:44",                 "ipv4address": "192.168.1.68/24",                 "ipv6address": ""             },             "192203a127d6831c3f4a41eabdd8df5282e33c3e92b99c3baaf1f213042f5418": {                 "name": "parkingcollector.1.8yrm6d831wrfsrkzhal7cf2pm",                 "endpointid": "34de6e9621ef54f7d963db942a7a7b6e0013ac6db6c9f17b384de689b1f1b187",                 "macaddress": "02:42:c0:a8:01:9a",                 "ipv4address": "192.168.1.154/24",                 "ipv6address": ""             },             "24258109e16c1a5b15dcc84a41d99a4a6617bcadecc9b35279c721c0d2855141": {                 "name": "stream.8.38npsusmpa1pf8fbnmaux57rx",                 "endpointid": "b675991ffbd5c0d051a4b68790a33307b03b48582fd1b37ba531cf5e964af0ce",                 "macaddress": "02:42:c0:a8:01:74",                 "ipv4address": "192.168.1.116/24",                 "ipv6address": ""             },             "33063b988473b73be2cbc51e912e165112de3d01bc00ee2107aa635e30a36335": {                 "name": "billing.2.ca41k2h44zkn9wfbsif0lfupf",                 "endpointid": "77c576929d5e82f1075b4cc6fcb4128ce959281d4b9c1c22d9dcd1e42eed8b5e",                 "macaddress": "02:42:c0:a8:01:87",                 "ipv4address": "192.168.1.135/24",                 "ipv6address": ""             },             "8b0929e66e6c284206ea713f7c92f1207244667d3ff02815d4bab617c349b220": {                 "name": "shotspottercollector.2.328408tiyy8aryr0g1ipmm5xm",                 "endpointid": "f2a0558ec67745f5d1601375c2090f5cd141303bf0d54bec717e3463f26ed74d",                 "macaddress": "02:42:c0:a8:01:90",                 "ipv4address": "192.168.1.144/24",                 "ipv6address": ""             },             "938fe5f6f9bb893862e8c06becd76c1a7fe5f2d3b791fc55d7d8164e67ee3553": {                 "name": "inrixproxy.2.ed77crvat0waw41phjknhhm6v",                 "endpointid": "88f550fecd60f0bdb0dfc9d5bf0c74716a91d009bcc27dc4392b113ab1215038",                 "macaddress": "02:42:c0:a8:01:96",                 "ipv4address": "192.168.1.150/24",                 "ipv6address": ""             },             "970f9d4c6ae6cc4de54a1d501408720b7d95114c28a6615d8e4e650b7e69bc40": {                 "name": "rabbitmq.1.e7j721g6hfhs8r7p3phih4g9v",                 "endpointid": "c04a4a5650ee6e10b87884004aa2cb1ec6b1c7036af15c31579462b6621436a2",                 "macaddress": "02:42:c0:a8:01:1e",                 "ipv4address": "192.168.1.30/24",                 "ipv6address": ""             },             "b1f676e6d38eec026583943dc0abff1163d21e6be9c5901539c46288f8941638": {                 "name": "logspout.0.51j8juw8aj0rjjccp2am0rib5",                 "endpointid": "98a93153abd6897c58276340df2eeec5c0ceb77fbe17d1ce8c465febb06776c7",                 "macaddress": "02:42:c0:a8:01:10",                 "ipv4address": "192.168.1.16/24",                 "ipv6address": ""             },             "bab4d80be830fa3b3fefe501c66e3640907a2cbb2addc925a0eb6967a771a172": {                 "name": "auth.2.8fduvrn5ayk024b0lkhyz50of",                 "endpointid": "7e81d41fa04ec14263a2423d8ef003d6d431a8c3ff319963197f8a8d73b4e361",                 "macaddress": "02:42:c0:a8:01:3a",                 "ipv4address": "192.168.1.58/24",                 "ipv6address": ""             },             "bc3c75a7c2d8c078eb7cc1555833ff0d374d82045dd9fb24ccfc37868615bb5e": {                 "name": "reverseproxy.6.2g20zphn5j1r2feylzcplyorg",                 "endpointid": "6c2138966ebcd144b47229a94ee603d264f3954a96ccd024d9e96501b7ffd5c0",                 "macaddress": "02:42:c0:a8:01:6c",                 "ipv4address": "192.168.1.108/24",                 "ipv6address": ""             },             "cd59d61b16ac0325336121a8558e8215e42aa5300f75054df17a70bf1f3e6c0c": {                 "name": "usgscollector.1.0h0afyw8va8maoa4tjd5qz588",                 "endpointid": "952073efc6a567ebd3f80d26811222c675183e8c76005fbf12388725a97b1bee",                 "macaddress": "02:42:c0:a8:01:48",                 "ipv4address": "192.168.1.72/24",                 "ipv6address": ""             },             "d40476e56b91762b0609acd637a4f70e42c88d266f8ebb7d9511050a8fc1df17": {                 "name": "kibana.1.6hxu5b97hfykuqr5yb9i9sn5r",                 "endpointid": "08c5188076f9b8038d864d570e7084433a8d97d4c8809d27debf71cb5d652cd7",                 "macaddress": "02:42:c0:a8:01:06",                 "ipv4address": "192.168.1.6/24",                 "ipv6address": ""             },             "e29369ad8ee5b12fb0c6f9bcb899514ab092f7da291a7c05eea758b0c19bfb65": {                 "name": "weatherbugcollector.1.crpub0hf85cewxm0qt6annsra",                 "endpointid": "afa1ddbad8ab8fdab69505ddb5342ac89c0d17bc75a11e9ac0ac8829e5885997",                 "macaddress": "02:42:c0:a8:01:2e",                 "ipv4address": "192.168.1.46/24",                 "ipv6address": ""             },             "f1bf0a656ecb9d7ef9b837efa94a050d9c98586f7312435e48b9a129c5e92e46": {                 "name": "socratacollector.1.627icslq6kdb4syaha6tzkb19",                 "endpointid": "14bea0d9ec3f94b04b32f36b7172c60316ee703651d0d920126a49dd0fa99cf5",                 "macaddress": "02:42:c0:a8:01:1b",                 "ipv4address": "192.168.1.27/24",                 "ipv6address": ""             }         },         "options": {             "com.docker.network.driver.overlay.vxlanid_list": "257"         },         "labels": {}     } ] 

output of dig datacollector:

; <<>> dig 9.9.5-9+deb8u8-debian <<>> datacollector ;; global options: +cmd ;; got answer: ;; ->>header<<- opcode: query, status: noerror, id: 38227 ;; flags: qr rd ra; query: 1, answer: 1, authority: 0, additional: 0  ;; question section: ;datacollector.         in   ;; answer section: datacollector.      600 in    192.168.1.23  ;; query time: 0 msec ;; server: 127.0.0.11#53(127.0.0.11) ;; when: thu nov 17 16:11:57 utc 2016 ;; msg size  rcvd: 60 

output of dig tasks.datacollector:

; <<>> dig 9.9.5-9+deb8u8-debian <<>> tasks.datacollector ;; global options: +cmd ;; got answer: ;; ->>header<<- opcode: query, status: noerror, id: 9810 ;; flags: qr rd ra; query: 1, answer: 12, authority: 0, additional: 0  ;; question section: ;tasks.datacollector.       in   ;; answer section: tasks.datacollector.    600 in    192.168.1.115 tasks.datacollector.    600 in    192.168.1.66 tasks.datacollector.    600 in    192.168.1.22 tasks.datacollector.    600 in    192.168.1.114 tasks.datacollector.    600 in    192.168.1.37 tasks.datacollector.    600 in    192.168.1.139 tasks.datacollector.    600 in    192.168.1.148 tasks.datacollector.    600 in    192.168.1.110 tasks.datacollector.    600 in    192.168.1.112 tasks.datacollector.    600 in    192.168.1.100 tasks.datacollector.    600 in    192.168.1.39 tasks.datacollector.    600 in    192.168.1.106  ;; query time: 0 msec ;; server: 127.0.0.11#53(127.0.0.11) ;; when: thu nov 17 16:08:54 utc 2016 ;; msg size  rcvd: 457 

output of docker version:

client:  version:      1.12.3  api version:  1.24  go version:   go1.6.3  git commit:   6b644ec  built:        wed oct 26 23:26:11 2016  os/arch:      darwin/amd64  server:  version:      1.12.3  api version:  1.24  go version:   go1.6.3  git commit:   6b644ec  built:        wed oct 26 21:44:32 2016  os/arch:      linux/amd64 

output of docker info:

containers: 58  running: 15  paused: 0  stopped: 43 images: 123 server version: 1.12.3 storage driver: aufs  root dir: /var/lib/docker/aufs  backing filesystem: extfs  dirs: 430  dirperm1 supported: false logging driver: json-file cgroup driver: cgroupfs plugins:  volume: local  network: host null overlay bridge swarm: active  nodeid: 8uxexr2uz3qpn5x1km9k4le9s  manager: true  clusterid: 2kd4md2qyu67szx4y6q2npnet  managers: 3  nodes: 8  orchestration:   task history retention limit: 5  raft:   snapshot interval: 10000   heartbeat tick: 1   election tick: 3  dispatcher:   heartbeat period: 5 seconds  ca configuration:   expiry duration: 3 months  node address: 10.10.44.201 runtimes: runc default runtime: runc security options: apparmor kernel version: 3.13.0-91-generic operating system: ubuntu 14.04.4 lts ostype: linux architecture: x86_64 cpus: 2 total memory: 3.676 gib name: stage-0 id: 76z2:gn43:rqnd:bbaj:aguu:s3f7:jwbc:ccck:i4vh:pkyc:uhqt:ir2u docker root dir: /var/lib/docker debug mode (client): false debug mode (server): false username: herbrandson registry: https://index.docker.io/v1/ warning: no swap limit support labels:  provider=generic insecure registries:  127.0.0.0/8 

additional environment details: docker swarm mode (not swarm). nodes running on aws. swarm has 8 nodes (3 managers , 5 workers)

update: per comments, here's snipet docker daemon logs on swarm master

time="2016-11-17t15:19:45.890158968z" level=error msg="container status  unavailable" error="context canceled" module=taskmanager task.id=ch6w74b3cu78y8r2ugkmfmu8a  time="2016-11-17t15:19:48.929507277z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=exb6dfc067nxudzr8uo1eyj4e  time="2016-11-17t15:19:50.104962867z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=6mbbfkilj9gslfi33w7sursb9  time="2016-11-17t15:19:50.877223204z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=drd8o0yn1cg5t3k76frxgukaq  time="2016-11-17t15:19:54.680427504z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=9lwl5v0f2v6p52shg6gixs3j7  time="2016-11-17t15:19:54.949118806z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=51q1eeilfspsm4cx79nfkl4r0  time="2016-11-17t15:19:56.485909146z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=3vjzfjjdrjio2gx45q9c3j6qd  time="2016-11-17t15:19:56.934070026z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:20:00.000614497z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:20:00.163458802z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=4xa2ub5npxyxpyx3vd5n1gsuy  time="2016-11-17t15:20:01.463407652z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:20:01.949087337z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:20:02.942094926z" level=error msg="failed create real server 192.168.1.150 vip 192.168.1.32 fwmark 947 in sb 938fe5f6f9bb893862e8c06becd76c1a7fe5f2d3b791fc55d7d8164e67ee3553: no such process"  time="2016-11-17t15:20:03.319168359z" level=error msg="failed delete new service vip 192.168.1.61 fwmark 2133: no such process"  time="2016-11-17t15:20:03.363775880z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/5de57ee133a5: reexec failed: exit status 5"  time="2016-11-17t15:20:05.772683092z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:20:06.059212643z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:20:07.335686642z" level=error msg="failed delete new service vip 192.168.1.67 fwmark 2134: no such process"  time="2016-11-17t15:20:07.385135664z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/6699e7c03bbd: reexec failed: exit status 5"  time="2016-11-17t15:20:07.604064777z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:20:07.673852364z" level=error msg="failed delete new service vip 192.168.1.75 fwmark 2097: no such process"  time="2016-11-17t15:20:07.766525370z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/6699e7c03bbd: reexec failed: exit status 5"  time="2016-11-17t15:20:09.080101131z" level=error msg="failed create real server 192.168.1.155 vip 192.168.1.35 fwmark 904 in sb 192203a127d6831c3f4a41eabdd8df5282e33c3e92b99c3baaf1f213042f5418: no such process"  time="2016-11-17t15:20:11.516338629z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:20:11.729274237z" level=error msg="failed delete new service vip 192.168.1.83 fwmark 2124: no such process"  time="2016-11-17t15:20:11.887572806z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/5b810132057e: reexec failed: exit status 5"  time="2016-11-17t15:20:12.281481060z" level=error msg="failed delete new service vip 192.168.1.73 fwmark 2136: no such process"  time="2016-11-17t15:20:12.395326864z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/5b810132057e: reexec failed: exit status 5"  time="2016-11-17t15:20:20.263565036z" level=error msg="failed create real server 192.168.1.72 vip 192.168.1.91 fwmark 2163 in sb cd59d61b16ac0325336121a8558e8215e42aa5300f75054df17a70bf1f3e6c0c: no such process"  time="2016-11-17t15:20:20.410996971z" level=error msg="failed delete new service vip 192.168.1.95 fwmark 2144: no such process"  time="2016-11-17t15:20:20.456710211z" level=error msg="failed add firewall mark rule in sbox /var/run/docker/netns/88d38a2bfb77: reexec failed: exit status 5"  time="2016-11-17t15:20:21.389253510z" level=error msg="failed create real server 192.168.1.46 vip 192.168.1.99 fwmark 2145 in sb cd59d61b16ac0325336121a8558e8215e42aa5300f75054df17a70bf1f3e6c0c: no such process"  time="2016-11-17t15:20:22.208965378z" level=error msg="failed create real server 192.168.1.46 vip 192.168.1.99 fwmark 2145 in sb e29369ad8ee5b12fb0c6f9bcb899514ab092f7da291a7c05eea758b0c19bfb65: no such process"  time="2016-11-17t15:20:23.334582312z" level=error msg="failed create new service vip 192.168.1.97 fwmark 2166: file exists"  time="2016-11-17t15:20:23.495873232z" level=error msg="failed create real server 192.168.1.48 vip 192.168.1.17 fwmark 552 in sb e29369ad8ee5b12fb0c6f9bcb899514ab092f7da291a7c05eea758b0c19bfb65: no such process"  time="2016-11-17t15:20:25.831988014z" level=error msg="failed create real server 192.168.1.116 vip 192.168.1.41 fwmark 566 in sb 03ac1e71139ff2140f93c80d9e6b1d69abf442a0c2362610bee3e116e84ef434: no such process"  time="2016-11-17t15:20:25.850904011z" level=error msg="failed create real server 192.168.1.116 vip 192.168.1.41 fwmark 566 in sb 03ac1e71139ff2140f93c80d9e6b1d69abf442a0c2362610bee3e116e84ef434: no such process"  time="2016-11-17t15:20:37.159637665z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=6yhu3glre4tbz6d08lk2pq9eb  time="2016-11-17t15:20:48.229343512z" level=error msg="error closing logger: invalid argument"  time="2016-11-17t15:51:16.027686909z" level=error msg="error getting service internal: service internal not found"  time="2016-11-17t15:51:16.027708795z" level=error msg="handler /v1.24/services/internal returned error: service internal not found"  time="2016-11-17t16:15:50.946921655z" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=cxmvk7p1hwznautresir94m3s  time="2016-11-17t16:16:01.994494784z" level=error msg="error closing logger: invalid argument"  

update 2: tried removing service , re-creating , did not resolve issue.

update 3: went through , rebooted each node in cluster one-by-one. after things appear normal. however, still don't know caused this. more importantly, how keep happening again in future?


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -