23. Release Notes¶
23.1. Robin Cloud Native Platform v5.3.11¶
The Robin Cloud Native Platform (CNP) v5.3.11 release has new features, an improvement, a bug fix, and known issues.
Release Date: 10 November 2021
23.1.1. Infrastructure Versions¶
The following software applications are included in this CNP release.
Software Application |
Version |
---|---|
Kubernetes |
1.21.5 |
Docker |
19.03.9 |
Prometheus |
2.16.0 |
Node-exporter |
1.1.2 |
Calico |
3.12.3 |
HA-Proxy |
1.5.18 |
PostgreSQL |
9.6.11 |
Grafana |
6.5.3 |
23.1.2. Upgrade Path¶
The following is the supported upgrade path for Robin CNP v5.3.11:
Robin v5.3.9-286 (GA) to Robin v5.3.11-69 (GA)
23.1.3. New Features¶
23.1.3.1. Support for NVIDIA A100 MIG¶
Robin CNP v5.3.11 supports the Multi-Instance GPU (MIG) mode of operation for the NVIDIA A100 GPU. Robin allows you to use partitioned GPUs in Robin bundles and also supports chargeback functionality for these GPU partitions.
23.1.3.2. Rocky Linux Support¶
Starting from Robin CNP v5.3.11, Rocky Linux 8.4 is supported. You can install Robin CNP v5.3.11 on this version of Linux.
23.1.3.3. Support for Application Ephemeral Volumes¶
Robin CNP v5.3.11 supports Application Ephemeral Volumes (AEVs). An AEV is temporary storage that Robin bundle applications can use. The AEV only exists when an application is running. When you create an application, the AEV is created for usage, and its space will be reclaimed back by Robin when the application stops. Other applications can use the reclaimed storage space. Robin provisions the storage space back to the application when it starts again.
You can add the AEVs only from the Robin UI when creating an application using a Robin bundle. Also, you can create templates of the application with AEV for future use. You can add a maximum of 10 AEVs per application.
23.1.4. Improvement¶
23.1.4.1. Support for @ Symbol in Robin Usernames¶
Starting with Robin CNP v5.3.11, you can use the @ symbol as part of Robin usernames. This enables you to use email addresses as usernames.
23.1.5. Fixed Issue¶
Reference ID |
Description |
PP-24202 |
The security issue with SSL Medium Strength Cipher Suites is fixed by supporting the Strong Cipher Suites with more than 128bit keys are configured in Robin CNP Services. The following are the list of the supported Strong Cipher Suites:
The above-mentioned list of Strong Cipher Suites is supported in the following Robin CNP Services:
|
23.1.6. Known Issues¶
Reference ID |
Description |
PP-24270 |
Symptom In Robin CNP, Kubelet might go down due to the presence of stale Workaround Complete the following steps to fix this issue:
# rm -rf /var/lib/kubelet/cpu_manager_state
# systemctl restart kubelet
|
PP-24248 |
Symptom When you create a new resource pool and assign it to nodes and later try to deploy a Pod with storage affinity on the node with a newly assigned resource pool, the Pod deployment fails as the node is not taking the correct resource pool. Workaround Complete the following steps to fix this issue:
# kubectl edit node <node_name>
|
PP-22853 |
Symptom GPUs might not be detected after CNP installation, upgrade, or addition of a new node. Workaround Run the following host probe rediscover command: # robin host probe <hostname> --rediscover
|
PP-22626 |
Symptom If NVIDIA GPU drivers are already installed on your setup, operator deployments might fail. Workaround Complete the following steps to fix this issue:
|
PP-21832 |
Symptom After you reboot a node, it might be in the Workaround Complete the following steps to fix this issue:
|
PP-22781 |
Symptom After removing a taint on a master node, GPUs are not detected automatically. GPUs don’t get detected automatically after removing a taint on the master node. Workaround You need to run the |
23.1.7. Technical Support¶
Contact Robin Technical support for any assistance.
23.2. Robin Cloud Native Platform v5.3.11 HF1¶
The Robin CNP v5.3.11 HF1 release has improvements, bug fixes, and known issues.
Release Date: 08 December 2021
23.2.1. Infrastructure Versions¶
The following software applications are included in this CNP release.
Software Application |
Version |
---|---|
Kubernetes |
1.21.5 |
Docker |
19.03.9 |
Prometheus |
2.16.0 |
Node-exporter |
1.1.2 |
Calico |
3.12.3 |
HA-Proxy |
1.5.18 |
PostgreSQL |
9.6.11 |
Grafana |
6.5.3 |
23.2.2. Upgrade Paths¶
The following are the supported upgrade paths for Robin CNP v5.3.11 HF1:
Robin v5.3.5 (HF3) to Robin v5.3.11 (HF1)
Robin v5.3.5 (HF5) to Robin v5.3.11 (HF1)
Robin v5.3.9 (GA) to Robin v5.3.11 (HF1)
Robin v5.3.11 (GA) to Robin v5.3.11 (HF1)
23.2.3. Improvements¶
23.2.3.1. Network Planning Support for Apps with Pod Affinity and Anti-affinity¶
Robin CNP v5.3.11 HF1 provides the network planning support for apps with Pod affinity and anti-affinity.
23.2.3.2. Application Ephemeral Volumes UI Improvement¶
The APPLICATION EPHEMERAL STORAGE section of the Robin UI in Robin CNP v5.3.11 HF1 is improved to display the following drop-down options for the replication option.
Storage-Compute Affinity
Not Replicated
Replicated (2 copies)
Replicated (3 copies)
Note
These options appear when you create AEVs from Robin UI if the AEVs are not defined in a Robin bundle manifest file.
23.2.4. Fixed Issues¶
Reference ID |
Description |
PP-24650 |
Robin CNP v5.3.11 HF1 fixed the sweet32 vulnerability issue. |
PP-24528 |
The issue of the CNP planner assigning Pods without checking the status of network interfaces is fixed in this version. With Robin CNP 5.3.11 HF1, the planner skips the network interfaces when they are down. |
PP-24428 |
The creation of an Application Ephemeral Volume (AEV) fails in a cloud deployment if the replication factor is not specified in the AEV specifications. This issue is fixed in Robin CNP v5.3.11 HF1 by setting a default replication factor for each AEV if it is not provided. |
PP-22941 |
When you do not provide any limits and requests in the container resource section, but you provide Robin annotation for network planning, the issue of a Pod not coming up successfully is fixed. |
PP-21983 |
When an IP address is not in robin |
PP-24589 |
The issue of Calico’s CIDR value was not correctly set when updating a Calico IP pool in an IPv6 Robin CNP setup is fixed. |
PP-24313 |
The |
23.2.5. Known Issues¶
Reference ID |
Description |
PP-24697 |
Symptom If the Network attachment definitions (net-attach-def) are not cleaned up when a Pod is bounced to create the Pod from the webhook start phase, follow this workaround. Workaround You must delete the net-attach-def that is not cleaned up. To delete, run the following command: # kubectl delete net-attach-def <net-attach-def-name> -n <namespace>
|
PP-24600 |
When you deploy a Pod using Kubernetes Deployment, StatefulSet, or DaemonSet, in some scenarios, the deployment process might take a longer time than usual due to the exponential back-off delay during scheduling. This is a known behavior with Kubernetes. |
23.2.6. Technical Support¶
Contact Robin Technical support for any assistance.
23.3. Robin Cloud Native Platform v5.3.11 HF2¶
The Robin CNP v5.3.11 HF2 release has a new feature, improvements, bug fixes, and known issues.
Release Date: 02 February 2022
23.3.1. Infrastructure Versions¶
The following software applications are included in this CNP release.
Software Application |
Version |
---|---|
Kubernetes |
1.21.5 |
Docker |
19.03.9 |
Prometheus |
2.16.0 |
Prometheus-adapter |
0.9.1 |
Node-exporter |
1.1.2 |
Calico |
3.12.3 |
HA-Proxy |
1.5.18 |
PostgreSQL |
9.6.22 |
Grafana |
6.5.3 |
23.3.2. Upgrade Paths¶
The following are the supported upgrade paths for Robin CNP v5.3.11 HF2:
Robin v5.3.5-232 (HF5) to Robin v5.3.11 HF2
Robin v5.3.7-120 (HF1) to Robin v5.3.11 HF2
Robin v5.3.11-104 (HF1) to Robin v5.3.11 HF2
23.3.3. New Feature¶
23.3.3.1. Intel Cache Allocation Technology Support for vDU Pods¶
Robin CNP v5.3.11 HF2 supports the Intel Cache Allocation Technology (CAT) to deploy 4G vDU (Virtual Distributed Unit) Pods for a single non-uniform memory access (NUMA) node host.
The Intel CAT enables vDUs in a 4G environment to access the CPUs on the host using dedicated cache lines/ways. When you deploy a vDU Pod using Robin CNP, the vDUs get a dedicated number of cache lines/ways configured on the host to access the CPUs.
You can use the following annotation in your Pod YAML file for requesting cache lines/ways from the CPU for vDUs.
Example:
# "robin.runtime.num_cache_ways": "4"
Note
You must use the annotation at the Pod level only.
23.3.4. Improvements¶
23.3.4.1. Default replication factor and fault domain parameters for RWX PVCs¶
Starting with Robin CNP v5.3.11 HF2, for RWX PVCs, the default replication
factor is 2
and the default faultdomain
is host
.
If you want to change the replication
factor to 1
, you can use the following parameter under annotations in the RWX PVC: robin.io/rwx_force_single_replica: "1"
.
Note
You must not use the robin.io/replication
and robin.io/rwx_force_single_replica
annotations together in a PVC.
The following is an example of a PVC file:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-shared-1
annotations:
robin.io/nfs-server-type: "shared"
robin.io/replication: "2"
robin.io/faultdomain: "host"
spec:
storageClassName: robin
accessModes:
- ReadWriteMany
resources:
requests:
storage: 500Gi
23.3.4.2. IP-Pool Prefix as Environment Variable inside a Pod¶
Starting with Robin CNP v5.3.11 HF2, Robin CNP provides an IP-Pool prefix as an environment variable inside a Pod.
When you deploy a Robin bundle or Helm app and use Calico or OVS IP-Pool, Robin CNP adds an IP-Pool prefix as an environment variable inside the Pod.
Using the IP-Pool prefix environment variable, you can discover the IP prefix of an IP-Pool by using the command line.
Example:
# kubectl exec -it -n t001-u000004 c1-server-01 -- env | grep -i prefix
ROBIN_SAMPLE_IPPOOL_PREFIX=16
In the above example with the command output: ROBIN_SAMPLE_IPPOOL_PREFIX=16
SAMPLE_ IPPOOL is the IP-Pool name, and it is prefixed with ROBIN and suffixed with the word PREFIX, and 16 is the IP prefix for the IP-Pool.
23.3.4.3. Robin StorageClass with runAsAny parameter¶
Robin CNP v5.3.11 HF2 provides a new parameter runAsAny
in the StorageClass object to enable any user other than the root user to read or write to an NFS mountpoint of an RWX volume.
You can use this parameter in a scenario with multiple containers and different users, and you want to allow any user accessing the Pod (containers) to read or write to an NFS mountpoint of an RWX volume.
In the StorageClass object file, set the runAsAny
parameter to True.
The following is an example of the StorageClass with runAsAny
parameter:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: run-as-any-imm
labels:
app.kubernetes.io/instance: robin
app.kubernetes.io/managed-by: robin.io
app.kubernetes.io/name: robin
provisioner: robin
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
parameters:
replication: '2'
media: HDD
runAsAny: "true"
23.3.4.4. Optimization of CPU and memory for Kubernetes Events¶
Robin CNP v5.3.11 HF2 is optimized to reduce the usage of CPU and memory when processing Kubernetes events.
23.3.5. Fixed Issues¶
Reference ID |
Description |
PP-25070 |
Vulnerability CVE-2021-41103 is related to containerd runtime. The container root directories and some plugins had insufficiently restricted permissions. It allows unprivileged Linux users to traverse directory contents and execute programs. For more information about this vulnerability, see CVE-2021-41103. In this release, Robin.io has upgraded the containerd package to containerd version 1.5.7 for handling this vulnerability. |
PP-24947 |
In the recent versions of Robin CNP, the source-based route is not configured properly for the first additional IP-Pool. This issue is fixed. |
PP-24938 |
After upgrading to Robin CNP v5.3.11 (HF1), the static IP address was not allocated to a Pod, and the Pod did not come up. This issue is fixed. |
PP-24796 |
The scheduler is unable to create a Pod within 30 seconds because Nmap showed that the requested static IP address was already in use due to an external firewall. This issue is fixed, and Nmap is disabled by default. You can enable Nmap to allow the Robin server to scan the network before IP address allocation by running the following command. # robin config update manager mutating_webhook_staticip_validation_enable true
|
PP-24776 |
The |
PP-24697 |
When a Pod with |
PP-24789 |
An inaccessible device does not get elected by marking the device offline when the node goes down. This issue is fixed now. |
PP-25116 |
When you bounce a Pod or reinstall an app, the deletion event might take time to complete the process as the earlier event is stuck at registration due to a missing API in the kubectl API resources. As a result, the static IP address allocation is failing. This issue is fixed. |
PP-25109 |
In a scenario, when two MANAGER nodes are becoming SLAVE nodes and waiting for the third node to become the MASTER node, it fails to become the MASTER node due to internal issues. And, the other two nodes are waiting for the third node to become the MASTER without checking whether any node is holding the MASTER lock. As a result, the control plane is down. This issue is fixed. |
PP-24645 |
The existence of a Instead of failing, Robin CNP now allows the node to continue with its transition to become MASTER. |
PP-25221 |
A Pod fails to come up to running state, and the |
23.3.6. Known Issues¶
Reference ID |
Description |
PP-25360 |
Symptom If containers in a Pod are using an RWX PVC and if they are stuck in the Workaround Delete Pods if they are part of a Deployment or StatefulSet. |
PP-24937 |
Symptom When upgrading to Robin CNP v5.3.11 HF2 from any supported versions using GoRobin, an error related to Paramiko might be encountered. Workaround Check the login credential of the Robin node or try restarting the sshd service on the Robin node. |
PP-25370 |
Symptom A Robin node in a cluster might go into the Workaround For workaround steps, see Troubleshooting a Robin Cluster Node with the NotReady State. |
PP-25430 |
Symptom After upgrading to Robin CNP v5.3.11 HF2, you might observe that the automatically restarted Pods containing Workaround Bounce such Pods manually. |
PP-25422 |
Symptom Upgrade to Robin v5.3.11 HF2 might fail as one of the worker node robinds and node plugin Pods are stuck in the Terminating state. Workaround Perform the following steps to recover all stale NFS mount points:
# kubectl scale —-replicas=0 <replica set>
# umount -f -l <stale mount point>
# systemctl restart kubelet
# kubectl scale —-replicas=<replica count> <replica set>
Note You must not delete the PVC or application Pod. |
PP-25425 |
Symptom When you create an application using Application Ephemeral Volume (AEV) with Storage-Compute Affinity, the app creation might
fail due to the missing Workaround Add the Run the following command to add the tag: # robin host add-tags <hostname> <tag>
Example: # robin host add-tags cscale-82-37.robinsystems.com
robin.io/hostname=cscale-82-37.robinsystems.com
|
PP-25296 |
Symptom When a cluster is recovered from a network partition, the Pods deployed on the worker nodes in the minority partition are redeployed in the majority partition. The Robin control plane is unable to access the worker nodes in the minority partition. The Pods and their volume mounts are cleaned up automatically when the network partition is resolved. In some cases, this automatic clean-up fails to remove the stale Pods on the worker nodes. Workaround Reboot the server. When the server restarts, the server will rejoin the cluster without the stale pods and volume mounts. |
PP-21832 |
Symptom After upgrading your cluster to Robin CNP v5.3.11HF2, if you notice that the cluster node is
in the Workaround To resolve this issue, complete the following steps:
\/bin/cp /root/bin/robin-reboot.service /etc/systemd/system/robin-reboot.service
kernel_version=$(uname -r)
if [[ $kernel_version == "3.10"* ]]; then
sed -i "/EL8/d;s/#EL7//" /etc/systemd/system/robin-reboot.service
else
sed -i "/EL7/d;s/#EL8//" /etc/systemd/system/robin-reboot.service
fi
# systemctl restart kubelet
# systemctl restart dockershim
# docker restart robin-cri
|
PP-25286 |
Symptom When you try to resize a PDV to a large size, the resize task fails due to insufficient storage, and due to this resize task failure, the subsequent PDV resizing tasks also fail. Workaround For workaround steps, see PDV Resize issue. |
PP-25441 |
Symptom In Robin CNP v5.3.11, provisioning clones with Application Ephemeral Volumes (AEVs) are not supported. |
PP-25412 |
Symptom Storage-Compute affinity is not enforced on Application Ephemeral Volume (AEV) when storage is available on the same host. |
PP-25453 |
Symptom When upgrading to Robin CNP v5.3.11 HF2 from any supported versions, CNS fails to execute post-Robin upgrade actions on
one of the nodes as the IO Manager might be down and the node displays Apply the following workaround if you notice this issue. Workaround Run the following command one the node where you executed the upgrade command: ./<gorobin-binary> onprem post-upgrade-robin --hosts-json /root/hosts.json
--gorobintar <local location of GoRobin tar file> --robin-admin-user
<Robin Admin username> --robin-admin-passwd <Robin Admin password>
|
PP-25461 |
Symptom If you discover that a Pod comes up without a static IP address, you must apply the following workaround. The Pod may not get allocated with one or more static IP addresses in multiple conditions. Workaround Bounce the Pods that are not allocated with a required number of static IP addresses. |
PP-25423 |
Symptom After upgrading to Robin CNP v5.3.11HF2 from Robin CNP v5.3.5, application Pods might be in the If you notice this issue, apply the following workaround. Workaround
# kubectl describe PVC <pvc name> -n ns
# kubectl get all -n <ns>
# kubectl scale --replicas=0 <deployment/statefulset> -n <ns>
# kubectl scale --replicas=<n> <deployment/statefulset> -n <ns>
|
PP-25381 |
Symptom Robin CNP does not support rack fault domain for Application Ephemeral Volume (AEV). However, Robin CNP UI incorrectly displays the FaultDomain (Rack) option. |
PP-25467 |
Symptom You might observe Kubelet is slow or does not respond and periodically goes into the error state resulting in issues with the Robin storage layer. This issue could be due to an orphan Pod, or Kubelet is trying to mount a Pod on old PVCs. Workaround
# systemctl status kubelet -l
"orphaned pod pod_id found, but error not a directory occurred
when trying to remove the volumes dir"
# kubectl get pvc -A
# kubectl get pvc -A | grep <pvc_name>
# rm -rf /var/lib/kubelet/pods/<pod_id>
|
PP-25463 |
Symptom The volume mounts in a Pod fail due to duplicate FS UUIDs. Workaround A duplicate FS UUID is present when the device is already mounted on the same node. When a volume mount fails, the FS UUID is displayed in Perform the following steps to resolve the duplicate FS UUID.
# blkid | grep <FS UUID>
# mount | grep <device>
# unmount <device>
After unmounting the device, the duplicate FS UUID conflict will be resolved. |
PP-25466 |
Symptom A Pod fails to come up because the volume is not accessible, and the volume is in the faulted state. Workaround You need to probe the robin host by running the following command: # robin host probe --all
|
PP-25508 |
Symptom When you try to modify an IP Pool, the modification process might fail with an error message due to missing values in the IP Pool. If you notice any error message when modifying the IP Pool, apply the following workaround. Workaround
# robin schedule update K8sResSync k8s_resource_sync 63072000
robin ip-pool info <name>
kubectl edit ripp <name> --validate=false
Example: In the following example, you need to add the missing values
in the [root@centos-60-205 ~]# kubectl edit ripp ovs-1 --validate=false
...
spec:
available: "15"
dns_search: domain.com
driver: ovs
gateway: fd74:ca9b:3a09:868c::1
ifcount: 1
name: ovs-1
nameserver: fd74:ca9b:3a09:868c:10:9:60:62
netmask: ffff:ffff:ffff:ffff:0000:0000:0000:0000
ranges:
- fd74:ca9b:3a09:868c:0010:0009:0109:0010-0020
- fd74:ca9b:3a09:868c:0010:0009:0109:0040-0050
subnet: fd74:ca9b:3a09:868c:0000:0000:0000:0000
used: "2"
zone: default
prefix: 64
vfdriver: xyz
vlan_number: 100
|
23.3.7. Appendix¶
23.3.7.1. Troubleshooting a Robin Cluster Node with the NotReady
State¶
The following content is the workaround for PP-25370.
A Robin node in a cluster might go into the NotReady
state when an RWX PVC’s mount path is not responding. This issue could occur due to several internal Kubernetes known issues.
The RWX PVC’s mount path may not respond due to the following issues/symptoms on your cluster. You can troubleshoot these issues and bring back the node to the Ready
state. This document section provides troubleshooting steps for the following issues:
NFS server’s service IP address entry in the conntrack table might go into
SYN_SENT
orTIME_WAIT
stateNFS Servers may not be ready
NFS Server Failover Issues
I/O hangs on the volume
With Robin v5.3.11 HF2, you might notice the NotReady
state issue when you are upgrading from Robin v5.3.11 HF1 to Robin v5.3.11 HF2.
Troubleshoot NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT state
The Robin node could be in the NotReady
state if the NFS Server’s service IP address entry in the conntrack table in SYN_SENT
or TIME_WAIT
.
The following steps enable you to troubleshoot this issue and bring the node to the Ready
state.
Run the following command to know if your node is in the
NotReady
state when you notice any of the above-mentioned symptoms:
# kubectl get node <node name>
Example:
# kubectl get node hypervvm-61-46
NAME STATUS ROLES AGE VERSION
hypervvm-61-46 NotReady <none> 25h v1.21.5
Run the following command and grep the NFS server mount paths:
# mount|grep :/pvc
Copy the mount paths for verification from the command output.
Run the following command to check the status of the mount path:
# ls <nfsmount>
Example:
# ls /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount
Note
If any mount paths do not respond or hang, you must check the status of conntrack.
You need the service IP of the NFS Server Pod for checking conntrack status.
Run the following command to get the NFS server Pod service IP address:
# mount|grep <pvc name>
Example:
# mount|grep pvc-210829ca-96d4-4a12-aab8-5646d087054d
[fd74:ca9b:3a09:868c:172:18:0:e23e]:/pvc-210829ca-96d4-4a12-aab8-5646d087054d on /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp6,timeo=600,retrans=2,sec=sys,clientaddr=fd74:ca9b:3a09:868c:10:9:82:127,local_lock=none,addr=fd74:ca9b:3a09:868c:172:18:0:e23e)
Verify if the conntrack entry state using NFS server Pod IP address by running the following command:
# conntrack -L -d <NFS server Pod IP address>
Note
If you notice the conntrack entry state as SYN_SENT
or TIME_WAIT
, you need to delete the entry from conntrack table entries to allow connections to the NFS service.
Example:
# conntrack -L -d fd74:ca9b:3a09:868c:172:18:0:e23e
13tcp 6 110 SYN_SENT src=fd74:ca9b:3a09:868c:10:9:82:127 dst=fd74:ca9b:3a09:868c:172:18:0:e23e sport=980 dport=2049 [UNREPLIED] src=fd74:ca9b:3a09:868c:172:18:0:71d4 dst=fd74:ca9b:3a09:868c:10:9:82:127 sport=2049 dport=614 mark=0 use=1
14conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown.
Run the following command to delete the
SYN_SENT
orTIME_WAIT
:
# conntrack -D -d <NFS server Pod IP address>
Example:
# conntrack -D -d fd74:ca9b:3a09:868c:172:18:0:e23e
18tcp 6 102 SYN_SENT src=fd74:ca9b:3a09:868c:10:9:82:127 dst=fd74:ca9b:3a09:868c:172:18:0:e23e sport=980 dport=2049 [UNREPLIED] src=fd74:ca9b:3a09:868c:172:18:0:71d4 dst=fd74:ca9b:3a09:868c:10:9:82:127 sport=2049 dport=614 mark=0 use=1
19conntrack v1.4.4 (conntrack-tools): 1 flow entries have been deleted.
Note
After deleting the SYN_SENT
or TIME_WAIT
state from the conntrack, you should be able to access the NFS mount path.
Run the following command to verify mount path status.
# ls /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount
Additional Troubleshooting Checks
If you have verified the NFS Server’s service IP address entry in the conntrack table in SYN_SENT
or TIME_WAIT
status and still your node is in the NotReady
state, you need to perform additional checks to troubleshoot the issue.
The following are some additional checks for troubleshooting the issue:
Check NFS Exports Status
Check NFS server failover Status
Check NFS server Pod is provisioned.
Check NFS Exports Status
All NFS exports must be in the READY
state.
To check the NFS exports status, run the following command:
# robin nfs export-list
Example:
# robin nfs export-list
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+
| Export State | Export ID | Volume | NFS Server Pod | Export Clients |
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+
| READY | 7 | pvc-9b1ef05e-5e4a-4e6a-ab3e-f7c95d1ae920 | robin-nfs-shared-9 | ["hypervvm-61-48.robinsystems.com","hypervvm-61-43.robinsystems.com"] |
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+
Note
If NFS exports are not in the READY
state, make sure the NFS server failover is enabled. Generally, it is enabled by default.
Check NFS server failover Status
The NFS Server failover status is by default enabled. However, you should check for confirmation and enable it if it is disabled.
To check NFS server failover status, run the following command:
# robin config list nfs|grep failover_enabled
nfs | failover_enabled
Check NFS server Pod is provisioned
To check whether NFS server Pod is provisioned or not, run the following command:
# robin job list|grep -i NFSServerPodCreate|tail
Note
If all of these checks are fine, then it could be a bug in the NFS Server Failover. To troubleshoot the NFS Server failover issue, see Troubleshoot NFS Server Failover Issues.
Troubleshoot NFS Server Failover Issues
A node could go to the NotReady
state due to NFS Server failover issues as well, apart from other issues mentioned in this section.
Note
You can use the following steps even if your NFS Server has no issues, however, the PVC mount path is hung.
Before you troubleshoot the NFS Server failover issues, check the Troubleshoot NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT state and Additional Troubleshooting Checks.
To fix the NFS server failover issues, complete the following steps:
Run the following command to check if any NFS exports are in the
ASSIGNED_ERR
state and identify corresponding PVCs:
# robin nfs export-list
Run the following command to note the replica count in the deployment or StatefulSet:
# kubectl get all -n <ns>
Example:
# kubectl get all -n <ns>
...
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app1 2/2 2 2 27h
NAME DESIRED CURRENT READY AGE
replicaset.apps/app1-5cbbc6d9db 2 2 2 27h
Run the following command to scale the application Pods that use those PVCs to 0:
Note
Do not delete the application.
Scaling down the application Pods will make sure that new Pods do not come up and results in the NFS exports being cleared.
# kubectl scale --replicas=0 <deployment/statefulset> -n <ns>
Run the following command to check all NFS exports are healthy:
# robin nfs export-list
(Optional) Run the following command on the hung paths if you notice some mount paths are still hung:
# umount -f -l <hung nfs mount path>
Run the following command to check the node status:
# kubectl get node <node name>
Note
If you notice the node is still not showing the Ready
state, wait for 2 minutes for kubelet to refresh the status.
If the status is still not showing Ready
, stop and start kubelet by running following commands:
# systemctl stop kubelet
#systemctl start kubelet
Check the node status again. If the status is
Ready
, then go to the last step.
# kubectl get node <node name>
If the node is still not in the
Ready
state or flapping betweenReady/NotReady
and you do not see any Pods in k8s that are using the RWX PVC, it may be Pods are deleted by force from Kubernetes.
In this case, k8s does not see Pods, but Docker is still hanging on to those Pods. THIS IS A RARE CASE and is hit only when Pods are deleted forcefully.
In this case, run the following commands:
# docker rm <>
# systemctl restart docker
Run the following command to check the node status:
# kubectl get node <node name>
The node should be in the Ready
state.
Run the following command to scale up the application Pods back to the original count that you noted earlier:
# kubectl scale --replicas=<n> <deployment/statefulset> -n <ns>
23.3.8. PDV Resize issue¶
When you try to resize a PDV to a large size, the resize task fails due to insufficient storage, and due to this resize task failure, the subsequent PDV resizing tasks also fail.
If you face this issue, complete the following troubleshooting steps:
Run the following command to verify the PersistentDataVolumeResize job status:
# robin job info <Job ID>
Note
In the command output, notice the PersistentDataVolumeResize job failure, and similarly, you might notice multiple failed volume expansion jobs. This is because the Robin CNP is continuously trying to allocate storage for volume expansion.
Example:
# robin job info 935
ID |Type | Desc | State | Start |End | Duration | Dependson| Error | Message
935 | PersistentDataVolumeResize | Resizing PersistentDataVolume 'test-pdv-202201102020041' from 108447924224 to 151G | COMPLETED|FAILED | 27 Jan 12:58:53 | 12:59:06 | 0:00:13 | [] | 1 | Unable to allocate storage for volume pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 of logical size 50.0G. Needed 51.56G of type HDD in physical space but found only 42.28G available. Check available capacity, maximum volume count, physical sector size and maintenance mode for the drives.
Run the following command to get information about the PV and PVC for the impacted PDV volumes:
# robin pdv list <PDV name>
Example:
# robin pdv list test-pdv-202201102020041
Name | Owner/Tenant | Access | Size | Media | PV | PVC
test-pdv-202201102020041 | u1/tenant1 | Private | 100G | HDD | pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 | t038-u000040/pvc-38-40-16420352860585
Save the PVC configuration file.
# kubectl get pvc -n <namespace> <pvc> -o yaml > <file.yaml>
Example:
# kubectl get pvc -n t002-u000006 pvc-38-40-16420352860585 -o yaml > pvc.yaml
Example PVC file:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
robin.io/faultdomain: host
robin.io/media: HDD
robin.io/replication: "2"
robin.io/rpool: default
volume.beta.kubernetes.io/storage-provisioner: robin
volume.kubernetes.io/storage-provisioner: robin
creationTimestamp: "2022-01-13T00:54:46Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
robin.io/domain: ROBIN_PDV
robin.io/tenant: tenant1
robin.io/tenant_id: "38"
robin.io/user_id: "40"
robin.io/username: u1
name: pvc-38-40-16420352860585
namespace: t038-u000040
resourceVersion: "2378648"
uid: 2a9ffb4e-fc25-4536-b700-501c2a7a8d80
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 200Gi
storageClassName: robin-immediate
volumeMode: Filesystem
volumeName: pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80
status:
accessModes:
- ReadWriteMany
capacity:
storage: 100Gi
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-01-27T17:01:41Z"
status: "True"
type: Resizing
phase: Bound
Edit the PVC YAML file and remove the following attributes:
vi <file.yaml from Step 3>
In the metadata annotations, remove the following attributes:
pv.kubernetes.io/bind-completed
pv.kubernetes.io/bound-by-controller
In the In metadata, remove the following attributes:
creationTimestamp
resourceVersion
Uid
Remove the complete status section in the PVC YAML file.
Edited PVC YAML file example:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
robin.io/faultdomain: host
robin.io/media: HDD
robin.io/replication: "2"
robin.io/rpool: default
volume.beta.kubernetes.io/storage-provisioner: robin
volume.kubernetes.io/storage-provisioner: robin
finalizers:
- kubernetes.io/pvc-protection
labels:
robin.io/domain: ROBIN_PDV
robin.io/tenant: tenant1
robin.io/tenant_id: "38"
robin.io/user_id: "40"
robin.io/username: u1
name: pvc-38-40-16420352860585
namespace: t038-u000040
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 200Gi
storageClassName: robin-immediate
volumeMode: Filesystem
volumeName: pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80
In the edited YAML file, change the
spec.resources.requests.storage
size of the underlying Robin volume.Mark the PersistentVolume(PV) that is bound to the PersistentVolumeClaim(PVC) with the
Retain
reclaim policy. This will prevent the underlying volume from being deleted when the PVC is deleted.
Example:
# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched
Patch the PVC to disable Robin admission control from preventing the PDV’s deletion.
Example
# kubectl patch pvc -n t038-u000040 pvc-38-40-16420352860585 -p '{"metadata":{"labels": {"robin.io/override_delete_protection":"true"}}}'
persistentvolumeclaim/pvc-38-40-16420352860585 patched
Delete the PVC.
As PV now has a Retain reclaim policy, you will not lose any data when the PVC is recreated.
Example:
# kubectl delete pvc -n t038-u000040 pvc-38-40-16420352860585
persistentvolumeclaim "pvc-38-40-16420352860585" deleted
Delete the
claimRef
entry from PV specs, so the new PVC can bind to it. This should make the PV Available.
Example:
# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched
Re-create the PVC with a required size without exceeding the available resources.
Note
In the pvc.yaml
file, that is created above, has the name of the existing PV set in the volumeName
attribute. This will bind the new PVC to the existing PV.
Example:
# kubectl create -f pvc.yaml
persistentvolumeclaim/pvc-38-40-16420352860585 created
Restore the original reclaim policy of the PV.
Example:
# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched
23.3.9. Technical Support¶
Contact Robin Technical support for any assistance.
23.4. Robin Cloud Native Platform v5.3.11 HF3¶
The Robin CNP v5.3.11 HF3 release has a new feature and a known issue.
Release Date: 06 May 2022
23.4.1. Infrastructure Versions¶
The following software applications are included in this CNP release.
Software Application |
Version |
---|---|
Kubernetes |
1.21.5 |
Docker |
19.03.9 |
Prometheus |
2.16.0 |
Prometheus-adapter |
0.9.1 |
Node-exporter |
1.1.2 |
Calico |
3.12.3 |
HA-Proxy |
1.5.18 |
PostgreSQL |
9.6.22 |
Grafana |
6.5.3 |
23.4.2. Upgrade Path¶
The following is the supported upgrade path for Robin CNP v5.3.11 HF3:
Robin v5.3.11 (HF2) to Robin v5.3.11 (HF3)
23.4.3. New Feature¶
23.4.3.1. Support for Cisco DCNM E1000 Virtual Interaface¶
Robin CNP 5.3.11 HF3 supports Cisco Data Center Network Manager (DCNM) E1000 virtual network interface for KVMs. You can deploy the Cisco DCNM application on the Robin CNP cluster.
Note
The Cisco DCNM E1000 Virtual Interface is supported only on KVMs with OVS IP Pool. You can configure the interface only using an input.yaml
file. For more information, see here.
23.4.4. Known Issue¶
Reference ID |
Description |
PP-27192 |
Symptom In some rare scenarios, creating an app from a snapshot of a KVM app fails with this error: Failed to ping instance. Workaround Run the following command to restart the KVM instance: # robin instance restart <name>
|
23.4.5. Technical Support¶
Contact Robin Technical support for any assistance.
23.5. Robin Cloud Native Platform v5.3.11 HF4¶
The Robin CNP v5.3.11 HF4 release has improvements, a fixed issue, and known issue.
Release Date: 19 June 2022
23.5.1. Infrastructure Versions¶
The following software applications are included in this CNP release.
Software Application |
Version |
---|---|
Kubernetes |
1.21.5 |
Docker |
19.03.9 |
Prometheus |
2.16.0 |
Prometheus-adapter |
0.9.1 |
Node-exporter |
1.1.2 |
Calico |
3.12.3 |
HA-Proxy |
1.5.18 |
PostgreSQL |
9.6.22 |
Grafana |
6.5.3 |
23.5.2. Upgrade Path¶
The following is the supported upgrade path for Robin CNP v5.3.11 HF4:
Robin v5.3.11(HF2) to Robin v5.3.11(HF4)
23.5.3. Improvements¶
23.5.3.1. Rocky Linux Support¶
Robin CNP v5.3.11HF4 supports Rocky Linux 8.6 version. You can install Robin CNP 5.3.11HF4 on Rocky Linux 8.6 servers.
The following are the supported Rocky Linux 8.6 Kernal versions:
4.18.0-372.9.1.rt7.166.el8.x86_64 (RT Kernel)
4.18.0-372.9.1.el8.x86_64 (Non-RT Kernel)
23.5.3.2. Disable Init Containers and Sidecars in Bundle App¶
Robin CNP v5.3.11HF4 supports disabling the Init Containers and Sidecars in the Robin Bundle apps using the input.yaml file when deploying the Bundle apps.
The following is the sample Robin Bundle file:
name: dpdk-intel
version: v1
icon: icon.png
snapshot: enabled
clone: enabled
roles:
- pktgen
pktgen:
name: pktgen
norootfs: true
image:
name: robinsys/dpdk-intel
version: v1
engine: docker
imagePullPolicy: IfNotPresent
entrypoint: entry.sh
compute:
memory: 1G
cpu:
reserve: true
cores: 2
initContainers:
- name: init1
image: 'robinsys/dpdk-intel:v1'
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 25m
memory: 128Mi
command:
- sleep
- '5'
sidecars:
- name: side1
image: 'robinsys/dpdk-intel:v1'
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- '-c'
- trap 'exit 0' SIGTERM; while true; do sleep 1; done
resources:
limits:
memory: 200Mi
cpu: '1'
- name: side2
image: 'robinsys/dpdk-intel:v1'
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- '-c'
- trap 'exit 0' SIGTERM; while true; do sleep 1; done
resources:
limits:
memory: 200Mi
cpu: '1'
Input Yaml file for disabling Init Containers and Sidecars
In the earlier Robin Bundle sample file, we have side1
and side2
sidecars and Init container init1
.
Using the following sample Input.yaml
file you can disable the Init Containers and sidecars.
From the above sample Bundle Yaml file example, we are disabling side1
sidecar and Init container init1
.
The following is the sample input.yaml file for disabling Init Containers and sidecars.
roles:
- name: pktgen
containers:
- name: side2
disabled: false
- name: side1
disabled: true
- name: init1
disabled: true
You can use the input.yaml file when creating an app using the Robin Bundle.
Syntax
Run the following command when creating an app using the Robin Bundle.
# robin app create from-bundle <appname> <bundleid> <yamlfile> --rpool <rpool> --wait
23.5.4. Fixed Issues¶
Reference ID |
Description |
PP-27304 |
The 503 error message appears due to timeout of the HAProxy. To fix this issue, you need to increase the timeout values of the HAProxy using the |
23.5.5. Known Issues¶
Reference ID |
Description |
PP-27400 |
Symptom The Workaround You can disable all repos by creating a backup folder and manually moving all repo files to it. Run the following commands to manually move all repo files:
|
PP-27613 |
When you create an IP pool, the IP pool creation succeeds; however, the app creation fails with the error IP Pool does not exist. This issue occurs as the pool creation is registered on Kubernetes and fails to register on the database. Workaround
|
23.5.6. Technical Support¶
Contact Robin Technical support for any assistance.