In my previous post, I showed how to deploy Pivotal Container Services (PKS) on a simplified flat network. In this post, I will highlight some of the issues one might encounter if you wish to deploy PKS on a more complex network topology. For example, you may have vCenter Server on a vSphere management network alongside the PKS management components (PKS CLI client, Pivotal Ops Manager). You may then want to have another “intermediate network” for the deployment of the BOSH and PKS VMs. And then finally, you may finally have another network on which the Kubernetes (K8s) VMs (master, workers) are deployed. These component need to communicate to each other across the different networks, e.g. the bosh agent on the K8s master and worker VMs needs to be able to reach the vSphere infrastructure. What I want to highlight in this post are some of the issues and error messages that you might encounter when rolling out PKS on such a configuration, and what you can do to fix them. Think of this as a lessons learnt by me trying to do something similar.
A picture is worth a thousand words, so a final PKS deployment may look something similar to this layout here:
Let’s now look at what happens when certain components in this deployment cannot communicate/route to other components.
Issue #1: This is the error I observed when trying to deploy the PKS VM via the Pivotal Ops Manager on a network which could not route to my vSphere network. Note this is the Pivotal Container Service PKS VM (purple above) and not the PKS client with the CLI tools (orange above).
ISSUE #2: This next issue was to do with not being able to resolve fully qualified domain names. If DNS has not been configured correctly when you are setting up the network section of the manifests in Pivotal Ops Manager, then PKS will not be able resolve ESXi hostnames in your vSphere environment. I’m guessing that the upload-stemcell command which is getting an error here is where it is trying to upload the customized operating system image for the PKS VM to vSphere. But it is unable to resolve the FQDN to something more meaningful.
===== 2018-04-19 11:23:06 UTC Finished “/usr/local/bin/bosh –no-color –non-interactive –tty –environment=192.168.191.10 upload-stemcell /var/tempest/stemcells/bosh-stemcell-3468.28-vsphere-esxi-ubuntu-trusty-go_agent.tgz”; Duration: 198s; Exit Status: 1
===== 2018-04-19 11:23:06 UTC Running “/usr/local/bin/bosh –no-color –non-interactive –tty –environment=192.168.191.10 upload-stemcell /var/tempest/stemcells/bosh-stemcell-3468.28-vsphere-esxi-ubuntu-trusty-go_agent.tgz”
Using environment ‘192.168.191.10’ as client ‘ops_manager’
0.00% 0.51% 10.62 MB/s 38s
Task 14
Task 14 | 11:23:39 | Update stemcell: Extracting stemcell archive (00:00:04)
Task 14 | 11:23:43 | Update stemcell: Verifying stemcell manifest (00:00:00)
Task 14 | 11:23:45 | Update stemcell: Checking if this stemcell already exists (00:00:00)
Task 14 | 11:23:45 | Update stemcell: Uploading stemcell bosh-vsphere-esxi-ubuntu-trusty-go_agent/3468.28 to the cloud (00:00:28)
L Error: Unknown CPI error ‘Unknown’ with message ‘getaddrinfo: Name or service not known (esxi-dell-g.rainpole.com:443)’ in ‘create_stemcell’ CPI method
Task 14 | 11:24:13 | Error: Unknown CPI error ‘Unknown’ with message ‘getaddrinfo: Name or service not known (esxi-dell-g.rainpole.com:443)’ in ‘create_stemcell’ CPI method
Task 14 Started Thu Apr 19 11:23:39 UTC 2018
Task 14 Finished Thu Apr 19 11:24:13 UTC 2018
Task 14 Duration 00:00:34
Task 14 error
Uploading stemcell file:
Expected task ’14’ to succeed but state is ‘error’
Exit code 1
RESOLUTION #2: Ensure that the DNS server entries in the network sections of the manifests in Ops Manager are correct so that both BOSH and PKS can resolve vCenter and ESXi host FQDNs.
ISSUE #3: The K8s master and worker VMs deploy but never enter a running state – they get left in a state of unresponsive agent. I used two very useful commands here on the PKS Client to troubleshoot. This PKS Client (orange above) is not the PKS VM (purple above), but the VM where I have my CLI tools deployed (see previous post for more info). One command is bosh vms and the other is bosh task. The first, bosh vms, shows me the current state of deployed VMs, including the K8s VMs, and bosh task command tracks the K8s cluster deploy tasks. As you can see, the deployment gives up after 10 minutes/600 seconds.
root@pks-cli:~# bosh vms
Using environment ‘10.27.51.181’ as client ‘ops_manager’
Task 53
Task 54. Done
Task 53 done
Deployment ‘pivotal-container-service-e7febad16f1bf59db116’
Instance Process State AZ IPs VM CID VM Type Active
pivotal-container-service/d4a0fd19-e9ce-47a8-a7df-afa100a612fa running CH-AZ 10.27.51.182 vm-68aadcae-ba47-41e8-843a-fb3764670861 micro false
1 vms
Deployment ‘service-instance_2214fcfa-c02f-498f-b37b-3a1b9cf89b27’
Instance Process State AZ IPs VM CID VM Type Active
master/4ed5f285-5c89-4740-b8bf-32682137cab6 unresponsive agent CH-AZ 192.50.0.140 vm-75109d0d-5581-4cf9-9dcc-d873e9602b9b – false
worker/9d9ad944-dac1-4d3b-9838-1f7c61ffb5b1 unresponsive agent CH-AZ 192.50.0.141 vm-5e882aff-f709-4b7b-ab47-8d6be80cb7dd – false
worker/dd673f29-9bc8-4921-b231-ea35f2cc66b1 unresponsive agent CH-AZ 192.50.0.142 vm-e0b6dca3-cd92-4e0a-8429-9a2fe2a2dc56 – false
worker/e36af3d7-e0cd-4c23-88e6-adde3f554300 unresponsive agent CH-AZ 192.50.0.143 vm-cfd0c81e-9811-49cb-9c87-e23063f83a6b – false
4 vms
Succeeded
root@pks-cli:~#
root@pks-cli:~# bosh task
Using environment ‘10.27.51.181’ as client ‘ops_manager’
Task 48
Task 48 | 15:52:13 | Preparing deployment: Preparing deployment (00:00:05)
Task 48 | 15:52:30 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 48 | 15:52:30 | Creating missing vms: master/4ed5f285-5c89-4740-b8bf-32682137cab6 (0)
Task 48 | 15:52:30 | Creating missing vms: worker/e36af3d7-e0cd-4c23-88e6-adde3f554300 (1)
Task 48 | 15:52:30 | Creating missing vms: worker/9d9ad944-dac1-4d3b-9838-1f7c61ffb5b1 (0)
Task 48 | 15:52:30 | Creating missing vms: worker/dd673f29-9bc8-4921-b231-ea35f2cc66b1 (2)
Task 48 | 16:02:53 | Creating missing vms: worker/9d9ad944-dac1-4d3b-9838-1f7c61ffb5b1 (0) (00:10:23)
L Error: Timed out pinging to 8876d9df-290f-41b9-8455-1c8efe5fc05d after 600 seconds
Task 48 | 16:02:58 | Creating missing vms: worker/dd673f29-9bc8-4921-b231-ea35f2cc66b1 (2) (00:10:28)
L Error: Timed out pinging to 893ccb7a-11d8-4055-b486-f435f922954c after 600 seconds
Task 48 | 16:02:58 | Creating missing vms: master/4ed5f285-5c89-4740-b8bf-32682137cab6 (0) (00:10:28)
L Error: Timed out pinging to 4741eb79-ca75-4352-ba8e-d70474c7beb8 after 600 seconds
Task 48 | 16:03:00 | Creating missing vms: worker/e36af3d7-e0cd-4c23-88e6-adde3f554300 (1) (00:10:30)
L Error: Timed out pinging to 0315851b-fdd3-48b5-9415-75d2bf52c945 after 600 seconds
Task 48 | 16:03:00 | Error: Timed out pinging to 8876d9df-290f-41b9-8455-1c8efe5fc05d after 600 seconds
Task 48 Started Fri Apr 20 15:52:13 UTC 2018
Task 48 Finished Fri Apr 20 16:03:00 UTC 2018
Task 48 Duration 00:10:47
Task 48 error
Capturing task ’48’ output:
Expected task ’48’ to succeed but state is ‘error’
Exit code 1
root@pks-cli:~#
RESOLUTION #3: BOSH agents in the Kubernetes VMs needs to be to communicate back to BOSH VM, so there needs to be a route between Kubernetes VM deployed on the “Service Network” – the network that is configured in the BOSH manifest and consumed in the PKS manifest in Ops Manager – and the “intermediate network” on which BOSH and PKS VMs are deployed. If there is no route between the networks, then this is what you will observe.
ISSUE #4: In this final issue, the K8s cluster does not deploy successfully. The master and worker VMs are running but the first worker VM from K8s never restarts after stopping for the canary step. The canary step is where a duplicate master or worker node is updated with any necessary configuration/components/software, and if the update is successful, it replaces the current master or worker node. In this example, we are looking at the task after the failure, again using bosh task. If you give the task number to bosh task, it will list the task steps, as shown below:
root@pks-cli:~# bosh task 31
Using environment ‘192.168.191.10’ as client ‘ops_manager’
Task 31
Task 31 | 12:00:21 | Preparing deployment: Preparing deployment (00:00:06)
Task 31 | 12:00:40 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 31 | 12:00:40 | Creating missing vms: master/3544e363-4a12-488b-a2ea-8fb76a480575 (0)
Task 31 | 12:00:40 | Creating missing vms: worker/dff1daa3-9bf0-4e6a-90a3-4dde6286d972 (0)
Task 31 | 12:00:40 | Creating missing vms: worker/363b9529-d7f7-4d64-a389-84a9a13fcc91 (2)
Task 31 | 12:00:40 | Creating missing vms: worker/8908ebd2-d28f-4a9c-b184-c5379fa35824 (1) (00:01:10)
Task 31 | 12:02:01 | Creating missing vms: worker/dff1daa3-9bf0-4e6a-90a3-4dde6286d972 (0) (00:01:21)
Task 31 | 12:02:03 | Creating missing vms: worker/363b9529-d7f7-4d64-a389-84a9a13fcc91 (2) (00:01:23)
Task 31 | 12:02:04 | Creating missing vms: master/3544e363-4a12-488b-a2ea-8fb76a480575 (0) (00:01:24)
Task 31 | 12:02:11 | Updating instance master: master/3544e363-4a12-488b-a2ea-8fb76a480575 (0) (canary) (00:02:02)
Task 31 | 12:04:13 | Updating instance worker: worker/dff1daa3-9bf0-4e6a-90a3-4dde6286d972 (0) (canary) (00:02:29)
L Error: Action Failed get_task: Task 7af7fdd9-fa53-4dc7-5b2a-6c9de2e7df3c result: 1 of 4 pre-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns-enable, bosh-dns, syslog_forwarder.
Task 31 | 12:06:42 | Error: Action Failed get_task: Task 7af7fdd9-fa53-4dc7-5b2a-6c9de2e7df3c result: 1 of 4 pre-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns-enable, bosh-dns, syslog_forwarder.
Task 31 Started Thu Apr 19 12:00:21 UTC 2018
Task 31 Finished Thu Apr 19 12:06:42 UTC 2018
Task 31 Duration 00:06:21
Task 31 error
Capturing task ’31’ output:
Expected task ’31’ to succeed but state is ‘error’
Exit code 1
root@pks-cli:~#
In this case, because the K8s VMs are running, we can actually log onto the K8s VM and see if we can figure out why it failed by looking at the logs. There are 3 steps to this. First we use a new bosh command, bosh deployments.
Step 4.1 – get list of deployments via BOSH CLI and locate the service instance
root@pks-cli:~# bosh deployments
Using environment ‘192.168.191.10’ as client ‘ops_manager’
Name Release(s) Stemcell(s)
Team(s) Cloud Config
pivotal-container-service-9b9223d27659ed342925 bosh-dns/1.3.0 bosh-vsphere-esxi-ubuntu-trusty-go_agent/3468.28
– latest
cf-mysql/36.10.0
docker/30.1.4
kubo/0.13.0
kubo-etcd/8
kubo-service-adapter/1.0.0-build.3
on-demand-service-broker/0.19.0
pks-api/1.0.0-build.3
pks-helpers/19.0.0
pks-nsx-t/0.1.6
syslog-migration/10
uaa/54
service-instance_20474001-494e-43b1-aca4-ab8f788078b6 bosh-dns/1.3.0 bosh-vsphere-esxi-ubuntu-trusty-go_agent/3468.28 pivotal-container-service-9b9223d27659ed342925 latest
docker/30.1.4
kubo/0.13.0
kubo-etcd/8
pks-helpers/19.0.0
pks-nsx-t/0.1.6
syslog-migration/10
2 deployments
Succeeded
Step 4.2 – Open an SSH session to the first worker on your K8s cluster, worker/0
Once the service instance is located, we can specify that deployment in the bosh command, and request SSH access to one of the VMs in the K8s cluster, in this case the first worker which is identified as worker/0.
root@pks-cli:~# bosh -d service-instance_20474001-494e-43b1-aca4-ab8f788078b6 ssh worker/0
Using environment ‘192.168.191.10’ as client ‘ops_manager’
Using deployment ‘service-instance_20474001-494e-43b1-aca4-ab8f788078b6’
Task 130. Done
Unauthorized use is strictly prohibited. All access and activity
is subject to logging and monitoring.
Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-116-generic x86_64)
* Documentation: https://help.ubuntu.com/
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
Last login: Thu Apr 19 15:03:05 2018 from 192.168.192.131
To run a command as administrator (user “root”), use “sudo <command>”.
See “man sudo_root” for details.
worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:~$
Step 4.3 – Examine the log files of the worker
The log files we are interested in is /var/cap/sys/log/kubetlet/*.log as it is the kubelet component which failed during the previous canary step. You will need superuser privileges to view this file, so simply sudo su – to get that. I’ve truncated the log file here, fyi.
worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log$ sudo su –
worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:~# pwd
/root
worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:~# cd /var/vcap/sys/log/kubelet/
worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log/kubelet# ls -ltr
total 8
-rw-r—– 1 root root 21 Apr 19 12:24 pre-start.stdout.log
-rw-r—– 1 root root 2716 Apr 19 12:25 pre-start.stderr.log
worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log/kubelet# cat pre-start.stdout.log
rpcbind stop/waiting
worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log/kubelet# cat pre-start.stderr.log
+ CONF_DIR=/var/vcap/jobs/kubelet/config
+ PKG_DIR=/var/vcap/packages/kubernetes
+ source /var/vcap/packages/kubo-common/utils.sh
+ main
+ detect_cloud_config
<—snip —>
+ export GOVC_DATACENTER=CH-Datacenter
+ GOVC_DATACENTER=CH-Datacenter
++ cat /sys/class/dmi/id/product_serial
++ sed -e ‘s/^VMware-//’ -e ‘s/-/ /’
++ awk ‘{ print tolower($1$2$3$4 “-” $5$6 “-” $7$8 “-” $9$10 “-” $11$12$13$14$15$16) }’
+ local vm_uuid=423c6dcf-d47b-53a3-5a1e-2251d6bdc4b7
+ /var/vcap/packages/govc/bin/govc vm.change -e disk.enableUUID=1 -vm.uuid=423c6dcf-d47b-53a3-5a1e-2251d6bdc4b7
/var/vcap/packages/govc/bin/govc: Post https://10.27.51.106:443/sdk: dial tcp 10.27.51.106:443: i/o timeout
worker/bcfbd60c-667f-45a8-9791-0b5d0a7a9565:/var/vcap/sys/log/kubelet#
RESOLUTION #4: In this example, we see the K8s worker node getting and i/o timeout while trying to communicate with my vCenter server (that is my VC IP that I added to the PKS manifest in Pivotal Operations Manager in the Kubernetes Cloud Provider section. This access is required by the K8s VMs to manage/create/delete persistent volumes as VMDKs for the application containers that will run on K8s. In this case, the K8s cluster was deployed on a network segment that allowed it to communicate to BOSH/PKS VMs, but not to the vCenter Server/vSphere environment.
Other useful things to know – how to log into BOSH and PKS VMs
We have seen how we can access our K8s VMs if we need to troubleshoot but what about the BOSH and PKS VM. This is quite straight forward. What you will need to do is login the Pivotal Operations Manager, click on the tile of the VM that you wish to login to, select credentials, and from there you can get a login to a shell on each of the VMs. Login as user vcap, supply the password retrieved from the Ops Manager and then sudo if you need superuser privileges.
Here is where to get them for BOSH Director:
Here is where to get them for PKS/Pivotal Container Service:
The post PKS – Networking Setup Tips and Tricks appeared first on CormacHogan.com.