23. Release Notes¶

23.1. Robin Cloud Native Platform v5.3.11¶

The Robin Cloud Native Platform (CNP) v5.3.11 release has new features, an improvement, a bug fix, and known issues.

Release Date: 10 November 2021

23.1.1. Infrastructure Versions¶

The following software applications are included in this CNP release.

Software Application	Version
Kubernetes	1.21.5
Docker	19.03.9
Prometheus	2.16.0
Node-exporter	1.1.2
Calico	3.12.3
HA-Proxy	1.5.18
PostgreSQL	9.6.11
Grafana	6.5.3

23.1.2. Upgrade Path¶

The following is the supported upgrade path for Robin CNP v5.3.11:

Robin v5.3.9-286 (GA) to Robin v5.3.11-69 (GA)

23.1.3. New Features¶

23.1.3.1. Support for NVIDIA A100 MIG¶

Robin CNP v5.3.11 supports the Multi-Instance GPU (MIG) mode of operation for the NVIDIA A100 GPU. Robin allows you to use partitioned GPUs in Robin bundles and also supports chargeback functionality for these GPU partitions.

23.1.3.2. Rocky Linux Support¶

Starting from Robin CNP v5.3.11, Rocky Linux 8.4 is supported. You can install Robin CNP v5.3.11 on this version of Linux.

23.1.3.3. Support for Application Ephemeral Volumes¶

Robin CNP v5.3.11 supports Application Ephemeral Volumes (AEVs). An AEV is temporary storage that Robin bundle applications can use. The AEV only exists when an application is running. When you create an application, the AEV is created for usage, and its space will be reclaimed back by Robin when the application stops. Other applications can use the reclaimed storage space. Robin provisions the storage space back to the application when it starts again.

You can add the AEVs only from the Robin UI when creating an application using a Robin bundle. Also, you can create templates of the application with AEV for future use. You can add a maximum of 10 AEVs per application.

23.1.4. Improvement¶

23.1.4.1. Support for @ Symbol in Robin Usernames¶

Starting with Robin CNP v5.3.11, you can use the @ symbol as part of Robin usernames. This enables you to use email addresses as usernames.

23.1.5. Fixed Issue¶

Reference ID

Description

PP-24202

The security issue with SSL Medium Strength Cipher Suites is fixed by supporting the Strong Cipher Suites with more than 128bit keys are configured in Robin CNP Services.

The following are the list of the supported Strong Cipher Suites:

TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
TLS_RSA_WITH_AES_128_CBC_SHA
TLS_RSA_WITH_AES_256_CBC_SHA
TLS_RSA_WITH_AES_128_GCM_SHA256
TLS_RSA_WITH_AES_256_GCM_SHA384
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA

The above-mentioned list of Strong Cipher Suites is supported in the following Robin CNP Services:

K8s API server
K8s controller Manger
K8s scheduler
K8s kubelet
ROBIN UI https server
ROBIN event server

23.1.6. Known Issues¶

Reference ID	Description
PP-24270	Symptom In Robin CNP, Kubelet might go down due to the presence of stale `cpu_manager_state` file. Workaround Complete the following steps to fix this issue: Remove the stale `/var/lib/kubelet/cpu_manager_state` file using the following command: # rm -rf /var/lib/kubelet/cpu_manager_state Restart the Kubelet by running the following command: # systemctl restart kubelet Make sure `etcd` and `apiserver` Pods on this node are up and running.
PP-24248	Symptom When you create a new resource pool and assign it to nodes and later try to deploy a Pod with storage affinity on the node with a newly assigned resource pool, the Pod deployment fails as the node is not taking the correct resource pool. Workaround Complete the following steps to fix this issue: Run the following to edit the node. # kubectl edit node <node_name> Remove the robin.io/robinrpool resource pool. Add the correct resource pool name.
PP-22853	Symptom GPUs might not be detected after CNP installation, upgrade, or addition of a new node. Workaround Run the following host probe rediscover command: # robin host probe <hostname> --rediscover
PP-22626	Symptom If NVIDIA GPU drivers are already installed on your setup, operator deployments might fail. Workaround Complete the following steps to fix this issue: # yum remove nvidia-driver-latest-dkms # yum remove nvidia-container-toolkit Reboot the node.
PP-21832	Symptom After you reboot a node, it might be in the `NotReady` state. Workaround Complete the following steps to fix this issue: # systemctl restart kubelet # systemctl restart dockershim # docker restart robin-cri
PP-22781	Symptom After removing a taint on a master node, GPUs are not detected automatically. GPUs don’t get detected automatically after removing a taint on the master node. Workaround You need to run the `robin host probe --rediscover --all --wait` command for the GPUs to be detected on the primary master node.

23.1.7. Technical Support¶

Contact Robin Technical support for any assistance.

23.2. Robin Cloud Native Platform v5.3.11 HF1¶

The Robin CNP v5.3.11 HF1 release has improvements, bug fixes, and known issues.

Release Date: 08 December 2021

23.2.1. Infrastructure Versions¶

The following software applications are included in this CNP release.

Software Application	Version
Kubernetes	1.21.5
Docker	19.03.9
Prometheus	2.16.0
Node-exporter	1.1.2
Calico	3.12.3
HA-Proxy	1.5.18
PostgreSQL	9.6.11
Grafana	6.5.3

23.2.2. Upgrade Paths¶

The following are the supported upgrade paths for Robin CNP v5.3.11 HF1:

Robin v5.3.5 (HF3) to Robin v5.3.11 (HF1)
Robin v5.3.5 (HF5) to Robin v5.3.11 (HF1)
Robin v5.3.9 (GA) to Robin v5.3.11 (HF1)
Robin v5.3.11 (GA) to Robin v5.3.11 (HF1)

23.2.3. Improvements¶

23.2.3.1. Network Planning Support for Apps with Pod Affinity and Anti-affinity¶

Robin CNP v5.3.11 HF1 provides the network planning support for apps with Pod affinity and anti-affinity.

23.2.3.2. Application Ephemeral Volumes UI Improvement¶

The APPLICATION EPHEMERAL STORAGE section of the Robin UI in Robin CNP v5.3.11 HF1 is improved to display the following drop-down options for the replication option.

Storage-Compute Affinity
Not Replicated
Replicated (2 copies)
Replicated (3 copies)

Note

These options appear when you create AEVs from Robin UI if the AEVs are not defined in a Robin bundle manifest file.

23.2.4. Fixed Issues¶

Reference ID	Description
PP-24650	Robin CNP v5.3.11 HF1 fixed the sweet32 vulnerability issue.
PP-24528	The issue of the CNP planner assigning Pods without checking the status of network interfaces is fixed in this version. With Robin CNP 5.3.11 HF1, the planner skips the network interfaces when they are down.
PP-24428	The creation of an Application Ephemeral Volume (AEV) fails in a cloud deployment if the replication factor is not specified in the AEV specifications. This issue is fixed in Robin CNP v5.3.11 HF1 by setting a default replication factor for each AEV if it is not provided.
PP-22941	When you do not provide any limits and requests in the container resource section, but you provide Robin annotation for network planning, the issue of a Pod not coming up successfully is fixed.
PP-21983	When an IP address is not in robin `ip-pool info --ip-allocations` and no other running Pods in the cluster is using the IP address, a Pod may not get created that is controlled by a Deployment, StatefulSet, or DaemonSet. This issue is fixed.
PP-24589	The issue of Calico’s CIDR value was not correctly set when updating a Calico IP pool in an IPv6 Robin CNP setup is fixed.
PP-24313	The `robin bundle add` command is incorrectly storing the bundle files in the file-object directory of the log collection service instead of the file-object directory of the file collection service. This issue is fixed.

23.2.5. Known Issues¶

Reference ID

Description

PP-24697

Symptom

If the Network attachment definitions (net-attach-def) are not cleaned up when a Pod is bounced to create the Pod from the webhook start phase, follow this workaround.

Workaround

You must delete the net-attach-def that is not cleaned up.

To delete, run the following command:

# kubectl delete net-attach-def <net-attach-def-name> -n <namespace>

PP-24600

When you deploy a Pod using Kubernetes Deployment, StatefulSet, or DaemonSet, in some scenarios, the deployment process might take a longer time than usual due to the exponential back-off delay during scheduling. This is a known behavior with Kubernetes.

23.2.6. Technical Support¶

Contact Robin Technical support for any assistance.

23.3. Robin Cloud Native Platform v5.3.11 HF2¶

The Robin CNP v5.3.11 HF2 release has a new feature, improvements, bug fixes, and known issues.

Release Date: 02 February 2022

23.3.1. Infrastructure Versions¶

The following software applications are included in this CNP release.

Software Application	Version
Kubernetes	1.21.5
Docker	19.03.9
Prometheus	2.16.0
Prometheus-adapter	0.9.1
Node-exporter	1.1.2
Calico	3.12.3
HA-Proxy	1.5.18
PostgreSQL	9.6.22
Grafana	6.5.3

23.3.2. Upgrade Paths¶

The following are the supported upgrade paths for Robin CNP v5.3.11 HF2:

Robin v5.3.5-232 (HF5) to Robin v5.3.11 HF2
Robin v5.3.7-120 (HF1) to Robin v5.3.11 HF2
Robin v5.3.11-104 (HF1) to Robin v5.3.11 HF2

23.3.3. New Feature¶

23.3.3.1. Intel Cache Allocation Technology Support for vDU Pods¶

Robin CNP v5.3.11 HF2 supports the Intel Cache Allocation Technology (CAT) to deploy 4G vDU (Virtual Distributed Unit) Pods for a single non-uniform memory access (NUMA) node host.

The Intel CAT enables vDUs in a 4G environment to access the CPUs on the host using dedicated cache lines/ways. When you deploy a vDU Pod using Robin CNP, the vDUs get a dedicated number of cache lines/ways configured on the host to access the CPUs.

You can use the following annotation in your Pod YAML file for requesting cache lines/ways from the CPU for vDUs.

Example:

# "robin.runtime.num_cache_ways": "4"

Note

You must use the annotation at the Pod level only.

23.3.4. Improvements¶

23.3.4.1. Default replication factor and fault domain parameters for RWX PVCs¶

Starting with Robin CNP v5.3.11 HF2, for RWX PVCs, the default replication factor is 2 and the default faultdomain is host.

If you want to change the replication factor to 1, you can use the following parameter under annotations in the RWX PVC: robin.io/rwx_force_single_replica: "1".

Note

You must not use the robin.io/replication and robin.io/rwx_force_single_replica annotations together in a PVC.

The following is an example of a PVC file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
   name: nfs-shared-1
   annotations:
      robin.io/nfs-server-type: "shared"
      robin.io/replication: "2"
      robin.io/faultdomain: "host"
spec:
storageClassName: robin
accessModes:
   - ReadWriteMany
resources:
   requests:
      storage: 500Gi

23.3.4.2. IP-Pool Prefix as Environment Variable inside a Pod¶

Starting with Robin CNP v5.3.11 HF2, Robin CNP provides an IP-Pool prefix as an environment variable inside a Pod.

When you deploy a Robin bundle or Helm app and use Calico or OVS IP-Pool, Robin CNP adds an IP-Pool prefix as an environment variable inside the Pod.

Using the IP-Pool prefix environment variable, you can discover the IP prefix of an IP-Pool by using the command line.

Example:

# kubectl exec -it -n t001-u000004   c1-server-01 -- env | grep -i prefix
ROBIN_SAMPLE_IPPOOL_PREFIX=16

In the above example with the command output: ROBIN_SAMPLE_IPPOOL_PREFIX=16

SAMPLE_ IPPOOL is the IP-Pool name, and it is prefixed with ROBIN and suffixed with the word PREFIX, and 16 is the IP prefix for the IP-Pool.

23.3.4.3. Robin StorageClass with runAsAny parameter¶

Robin CNP v5.3.11 HF2 provides a new parameter runAsAny in the StorageClass object to enable any user other than the root user to read or write to an NFS mountpoint of an RWX volume.

You can use this parameter in a scenario with multiple containers and different users, and you want to allow any user accessing the Pod (containers) to read or write to an NFS mountpoint of an RWX volume.

In the StorageClass object file, set the runAsAny parameter to True.

The following is an example of the StorageClass with runAsAny parameter:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: run-as-any-imm
  labels:
    app.kubernetes.io/instance: robin
    app.kubernetes.io/managed-by: robin.io
    app.kubernetes.io/name: robin
provisioner: robin
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
parameters:
  replication: '2'
  media: HDD
  runAsAny: "true"

23.3.4.4. Optimization of CPU and memory for Kubernetes Events¶

Robin CNP v5.3.11 HF2 is optimized to reduce the usage of CPU and memory when processing Kubernetes events.

23.3.5. Fixed Issues¶

Reference ID	Description
PP-25070	Vulnerability CVE-2021-41103 is related to containerd runtime. The container root directories and some plugins had insufficiently restricted permissions. It allows unprivileged Linux users to traverse directory contents and execute programs. For more information about this vulnerability, see CVE-2021-41103. In this release, Robin.io has upgraded the containerd package to containerd version 1.5.7 for handling this vulnerability.
PP-24947	In the recent versions of Robin CNP, the source-based route is not configured properly for the first additional IP-Pool. This issue is fixed.
PP-24938	After upgrading to Robin CNP v5.3.11 (HF1), the static IP address was not allocated to a Pod, and the Pod did not come up. This issue is fixed.
PP-24796	The scheduler is unable to create a Pod within 30 seconds because Nmap showed that the requested static IP address was already in use due to an external firewall. This issue is fixed, and Nmap is disabled by default. You can enable Nmap to allow the Robin server to scan the network before IP address allocation by running the following command. # robin config update manager mutating_webhook_staticip_validation_enable true
PP-24776	The `robin ip-pool add` command with the `nictags pci_addr` option is failing in Robin CNP v5.3.11 HF1. This issue is fixed.
PP-24697	When a Pod with `robin.io/networks` annotation is deleted, the Network attachment definitions (net-attach-defs) entries are left behind. You need to manually delete these entries. This issue is fixed.
PP-24789	An inaccessible device does not get elected by marking the device offline when the node goes down. This issue is fixed now.
PP-25116	When you bounce a Pod or reinstall an app, the deletion event might take time to complete the process as the earlier event is stuck at registration due to a missing API in the kubectl API resources. As a result, the static IP address allocation is failing. This issue is fixed.
PP-25109	In a scenario, when two MANAGER nodes are becoming SLAVE nodes and waiting for the third node to become the MASTER node, it fails to become the MASTER node due to internal issues. And, the other two nodes are waiting for the third node to become the MASTER without checking whether any node is holding the MASTER lock. As a result, the control plane is down. This issue is fixed.
PP-24645	The existence of a `recover.conf` file in the PostgreSQL data directory was preventing a Manager node from becoming MASTER during a high availability transition. This issue is fixed. Instead of failing, Robin CNP now allows the node to continue with its transition to become MASTER.
PP-25221	A Pod fails to come up to running state, and the `kubectl describe pod -n namespace <pod name>` command shows an error that the network CNI plugin failed to set up the Pod. This issue is fixed.

23.3.6. Known Issues¶

Reference ID	Description
PP-25360	Symptom If containers in a Pod are using an RWX PVC and if they are stuck in the `ContainerCreating` state for a long time and display a timeout error, apply the following workaround. Workaround Delete Pods if they are part of a Deployment or StatefulSet.
PP-24937	Symptom When upgrading to Robin CNP v5.3.11 HF2 from any supported versions using GoRobin, an error related to Paramiko might be encountered. Workaround Check the login credential of the Robin node or try restarting the sshd service on the Robin node.
PP-25370	Symptom A Robin node in a cluster might go into the `NotReady` state when an RWX PVC’s mount path is not responding. This issue could occur due to several internal Kubernetes known issues. Workaround For workaround steps, see Troubleshooting a Robin Cluster Node with the NotReady State.
PP-25430	Symptom After upgrading to Robin CNP v5.3.11 HF2, you might observe that the automatically restarted Pods containing `robin.io/networks` annotation did not have the secondary IP addresses. Workaround Bounce such Pods manually.
PP-25422	Symptom Upgrade to Robin v5.3.11 HF2 might fail as one of the worker node robinds and node plugin Pods are stuck in the Terminating state. Workaround Perform the following steps to recover all stale NFS mount points: Make sure that no application Pod uses the PVC. Run the following command to scale down the replica count to 0. # kubectl scale —-replicas=0 <replica set> Run the following command to unmount the stale mount point. # umount -f -l <stale mount point> When all stale NFS mount points on a node are recovered, run the following command to restart Kubelet. # systemctl restart kubelet Scale up the replicas back to the original count. # kubectl scale —-replicas=<replica count> <replica set> Note You must not delete the PVC or application Pod.
PP-25425	Symptom When you create an application using Application Ephemeral Volume (AEV) with Storage-Compute Affinity, the app creation might fail due to the missing `robin.io/hostname` tag on the host. Workaround Add the `robin.io/hostname` tag to the host. Run the following command to add the tag: # robin host add-tags <hostname> <tag> Example: # robin host add-tags cscale-82-37.robinsystems.com robin.io/hostname=cscale-82-37.robinsystems.com
PP-25296	Symptom When a cluster is recovered from a network partition, the Pods deployed on the worker nodes in the minority partition are redeployed in the majority partition. The Robin control plane is unable to access the worker nodes in the minority partition. The Pods and their volume mounts are cleaned up automatically when the network partition is resolved. In some cases, this automatic clean-up fails to remove the stale Pods on the worker nodes. Workaround Reboot the server. When the server restarts, the server will rejoin the cluster without the stale pods and volume mounts.
PP-21832	Symptom After upgrading your cluster to Robin CNP v5.3.11HF2, if you notice that the cluster node is in the `NotReady` state when you reboot the cluster, you must apply the following workaround on all nodes of the cluster. Workaround To resolve this issue, complete the following steps: Run the following command. \/bin/cp /root/bin/robin-reboot.service /etc/systemd/system/robin-reboot.service kernel_version=$(uname -r) if [[ $kernel_version == "3.10"* ]]; then sed -i "/EL8/d;s/#EL7//" /etc/systemd/system/robin-reboot.service else sed -i "/EL7/d;s/#EL8//" /etc/systemd/system/robin-reboot.service fi Run the following command to restart Kubelet. # systemctl restart kubelet Run the following command to restart Dockershim. # systemctl restart dockershim Run the following command to restart robin-cri. # docker restart robin-cri
PP-25286	Symptom When you try to resize a PDV to a large size, the resize task fails due to insufficient storage, and due to this resize task failure, the subsequent PDV resizing tasks also fail. Workaround For workaround steps, see PDV Resize issue.
PP-25441	Symptom In Robin CNP v5.3.11, provisioning clones with Application Ephemeral Volumes (AEVs) are not supported.
PP-25412	Symptom Storage-Compute affinity is not enforced on Application Ephemeral Volume (AEV) when storage is available on the same host.
PP-25453	Symptom When upgrading to Robin CNP v5.3.11 HF2 from any supported versions, CNS fails to execute post-Robin upgrade actions on one of the nodes as the IO Manager might be down and the node displays `NotReady` state. Apply the following workaround if you notice this issue. Workaround Run the following command one the node where you executed the upgrade command: ./<gorobin-binary> onprem post-upgrade-robin --hosts-json /root/hosts.json --gorobintar <local location of GoRobin tar file> --robin-admin-user <Robin Admin username> --robin-admin-passwd <Robin Admin password>
PP-25461	Symptom If you discover that a Pod comes up without a static IP address, you must apply the following workaround. The Pod may not get allocated with one or more static IP addresses in multiple conditions. Workaround Bounce the Pods that are not allocated with a required number of static IP addresses.
PP-25423	Symptom After upgrading to Robin CNP v5.3.11HF2 from Robin CNP v5.3.5, application Pods might be in the `ContainerCreating` state. If you notice this issue, apply the following workaround. Workaround Using the RWX PVC, find the Deployment or StatefulSet and the Pod details by run the following command: # kubectl describe PVC <pvc name> -n ns Note the replica count by running the following command: # kubectl get all -n <ns> Scale the replicas to 0 by running the following command. # kubectl scale --replicas=0 <deployment/statefulset> -n <ns> Observe that the Pod is terminated and does not exist anymore. Scale the replicas back to the count that you noted in step 2. Use the following command to scale up. # kubectl scale --replicas=<n> <deployment/statefulset> -n <ns>
PP-25381	Symptom Robin CNP does not support rack fault domain for Application Ephemeral Volume (AEV). However, Robin CNP UI incorrectly displays the FaultDomain (Rack) option.
PP-25467	Symptom You might observe Kubelet is slow or does not respond and periodically goes into the error state resulting in issues with the Robin storage layer. This issue could be due to an orphan Pod, or Kubelet is trying to mount a Pod on old PVCs. Workaround Run the following command to check the status of Kubelet. # systemctl status kubelet -l In the command output, find the following message. "orphaned pod pod_id found, but error not a directory occurred when trying to remove the volumes dir" Run the following command to know the PVC name from the node. # kubectl get pvc -A Run the following command to check if the PVC is existing. # kubectl get pvc -A \| grep <pvc_name> If the PVC does not exist, delete the directory by running the following command. # rm -rf /var/lib/kubelet/pods/<pod_id>
PP-25463	Symptom The volume mounts in a Pod fail due to duplicate FS UUIDs. Workaround A duplicate FS UUID is present when the device is already mounted on the same node. When a volume mount fails, the FS UUID is displayed in `syslog/dmesg`. Perform the following steps to resolve the duplicate FS UUID. Run the following command to check whether any device has the same FS UUID: # blkid \| grep <FS UUID> Run the following command to check whether the device is mounted: # mount \| grep <device> Run the following command to unmount the device, when the device is mounted: # unmount <device> After unmounting the device, the duplicate FS UUID conflict will be resolved.
PP-25466	Symptom A Pod fails to come up because the volume is not accessible, and the volume is in the faulted state. Workaround You need to probe the robin host by running the following command: # robin host probe --all
PP-25508	Symptom When you try to modify an IP Pool, the modification process might fail with an error message due to missing values in the IP Pool. If you notice any error message when modifying the IP Pool, apply the following workaround. Workaround Run the following command only once on the cluster. # robin schedule update K8sResSync k8s_resource_sync 63072000 Run the following command and make a note of the IP Pool values. robin ip-pool info <name> Run the following command to update missing values in the IP Pool that you noted in the previous step. kubectl edit ripp <name> --validate=false Example: In the following example, you need to add the missing values in the `spec:` section. You do not need to update all values. For example, for network-based IP-Pool, the prefix field is not required. Similarly, you can ignore the values that are not required. [root@centos-60-205 ~]# kubectl edit ripp ovs-1 --validate=false ... spec: available: "15" dns_search: domain.com driver: ovs gateway: fd74:ca9b:3a09:868c::1 ifcount: 1 name: ovs-1 nameserver: fd74:ca9b:3a09:868c:10:9:60:62 netmask: ffff:ffff:ffff:ffff:0000:0000:0000:0000 ranges: - fd74:ca9b:3a09:868c:0010:0009:0109:0010-0020 - fd74:ca9b:3a09:868c:0010:0009:0109:0040-0050 subnet: fd74:ca9b:3a09:868c:0000:0000:0000:0000 used: "2" zone: default prefix: 64 vfdriver: xyz vlan_number: 100 Rerun the failed IP Pool command to verify.

23.3.7. Appendix¶

23.3.7.1. Troubleshooting a Robin Cluster Node with the `NotReady` State¶

The following content is the workaround for PP-25370.

A Robin node in a cluster might go into the NotReady state when an RWX PVC’s mount path is not responding. This issue could occur due to several internal Kubernetes known issues.

The RWX PVC’s mount path may not respond due to the following issues/symptoms on your cluster. You can troubleshoot these issues and bring back the node to the Ready state. This document section provides troubleshooting steps for the following issues:

NFS server’s service IP address entry in the conntrack table might go into SYN_SENT or TIME_WAIT state
NFS Servers may not be ready
NFS Server Failover Issues
I/O hangs on the volume

With Robin v5.3.11 HF2, you might notice the NotReady state issue when you are upgrading from Robin v5.3.11 HF1 to Robin v5.3.11 HF2.

Troubleshoot NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT state

The Robin node could be in the NotReady state if the NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT.

The following steps enable you to troubleshoot this issue and bring the node to the Ready state.

Run the following command to know if your node is in the NotReady state when you notice any of the above-mentioned symptoms:

# kubectl get node <node name>

Example:

# kubectl  get node hypervvm-61-46
NAME             STATUS     ROLES    AGE   VERSION
hypervvm-61-46   NotReady   <none>   25h   v1.21.5

Run the following command and grep the NFS server mount paths:

# mount|grep :/pvc

Copy the mount paths for verification from the command output.
Run the following command to check the status of the mount path:

# ls <nfsmount>

Example:

# ls /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount

Note

If any mount paths do not respond or hang, you must check the status of conntrack.

You need the service IP of the NFS Server Pod for checking conntrack status.

Run the following command to get the NFS server Pod service IP address:

# mount|grep <pvc name>

Example:

# mount|grep pvc-210829ca-96d4-4a12-aab8-5646d087054d

[fd74:ca9b:3a09:868c:172:18:0:e23e]:/pvc-210829ca-96d4-4a12-aab8-5646d087054d on /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp6,timeo=600,retrans=2,sec=sys,clientaddr=fd74:ca9b:3a09:868c:10:9:82:127,local_lock=none,addr=fd74:ca9b:3a09:868c:172:18:0:e23e)

Verify if the conntrack entry state using NFS server Pod IP address by running the following command:

# conntrack -L -d <NFS server Pod IP address>

Note

If you notice the conntrack entry state as SYN_SENT or TIME_WAIT, you need to delete the entry from conntrack table entries to allow connections to the NFS service.

Example:

# conntrack -L -d fd74:ca9b:3a09:868c:172:18:0:e23e
13tcp      6 110 SYN_SENT src=fd74:ca9b:3a09:868c:10:9:82:127 dst=fd74:ca9b:3a09:868c:172:18:0:e23e sport=980 dport=2049 [UNREPLIED] src=fd74:ca9b:3a09:868c:172:18:0:71d4 dst=fd74:ca9b:3a09:868c:10:9:82:127 sport=2049 dport=614 mark=0 use=1
14conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown.

Run the following command to delete the SYN_SENT or TIME_WAIT:

# conntrack -D -d <NFS server Pod IP address>

Example:

# conntrack -D -d fd74:ca9b:3a09:868c:172:18:0:e23e
18tcp      6 102 SYN_SENT src=fd74:ca9b:3a09:868c:10:9:82:127 dst=fd74:ca9b:3a09:868c:172:18:0:e23e sport=980 dport=2049 [UNREPLIED] src=fd74:ca9b:3a09:868c:172:18:0:71d4 dst=fd74:ca9b:3a09:868c:10:9:82:127 sport=2049 dport=614 mark=0 use=1
19conntrack v1.4.4 (conntrack-tools): 1 flow entries have been deleted.

Note

After deleting the SYN_SENT or TIME_WAIT state from the conntrack, you should be able to access the NFS mount path.

Run the following command to verify mount path status.

# ls /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount

Additional Troubleshooting Checks

If you have verified the NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT status and still your node is in the NotReady state, you need to perform additional checks to troubleshoot the issue.

The following are some additional checks for troubleshooting the issue:

Check NFS Exports Status

Check NFS server failover Status

Check NFS server Pod is provisioned.

Check NFS Exports Status

All NFS exports must be in the READY state.

To check the NFS exports status, run the following command:

# robin nfs export-list

Example:

# robin nfs export-list
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+
| Export State | Export ID |                  Volume                  |    NFS Server Pod   |                             Export Clients                            |
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+
|    READY     |     7     | pvc-9b1ef05e-5e4a-4e6a-ab3e-f7c95d1ae920 |  robin-nfs-shared-9 | ["hypervvm-61-48.robinsystems.com","hypervvm-61-43.robinsystems.com"] |
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+

Note

If NFS exports are not in the READY state, make sure the NFS server failover is enabled. Generally, it is enabled by default.

Check NFS server failover Status

The NFS Server failover status is by default enabled. However, you should check for confirmation and enable it if it is disabled.

To check NFS server failover status, run the following command:

# robin config list nfs|grep failover_enabled
nfs     | failover_enabled

Check NFS server Pod is provisioned

To check whether NFS server Pod is provisioned or not, run the following command:

# robin job list|grep -i NFSServerPodCreate|tail

Note

If all of these checks are fine, then it could be a bug in the NFS Server Failover. To troubleshoot the NFS Server failover issue, see Troubleshoot NFS Server Failover Issues.

Troubleshoot NFS Server Failover Issues

A node could go to the NotReady state due to NFS Server failover issues as well, apart from other issues mentioned in this section.

Note

You can use the following steps even if your NFS Server has no issues, however, the PVC mount path is hung.
Before you troubleshoot the NFS Server failover issues, check the Troubleshoot NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT state and Additional Troubleshooting Checks.

To fix the NFS server failover issues, complete the following steps:

Run the following command to check if any NFS exports are in the ASSIGNED_ERR state and identify corresponding PVCs:

# robin nfs export-list

Run the following command to note the replica count in the deployment or StatefulSet:

# kubectl get all -n <ns>

Example:

# kubectl get all -n <ns>
...
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/app1   2/2     2            2           27h

NAME                              DESIRED   CURRENT   READY   AGE
replicaset.apps/app1-5cbbc6d9db   2         2         2       27h

Run the following command to scale the application Pods that use those PVCs to 0:

Note

Do not delete the application.

Scaling down the application Pods will make sure that new Pods do not come up and results in the NFS exports being cleared.

# kubectl scale --replicas=0 <deployment/statefulset> -n <ns>

Run the following command to check all NFS exports are healthy:

# robin nfs export-list

(Optional) Run the following command on the hung paths if you notice some mount paths are still hung:

# umount -f -l <hung nfs mount path>

Run the following command to check the node status:

# kubectl get node <node name>

Note

If you notice the node is still not showing the Ready state, wait for 2 minutes for kubelet to refresh the status.

If the status is still not showing Ready, stop and start kubelet by running following commands:

# systemctl stop kubelet

#systemctl start kubelet

Check the node status again. If the status is Ready, then go to the last step.

# kubectl get node <node name>

If the node is still not in the Ready state or flapping between Ready/NotReady and you do not see any Pods in k8s that are using the RWX PVC, it may be Pods are deleted by force from Kubernetes.

In this case, k8s does not see Pods, but Docker is still hanging on to those Pods. THIS IS A RARE CASE and is hit only when Pods are deleted forcefully.

In this case, run the following commands:

# docker rm <>

# systemctl restart docker

Run the following command to check the node status:

# kubectl get node <node name>

The node should be in the Ready state.

Run the following command to scale up the application Pods back to the original count that you noted earlier:

# kubectl scale --replicas=<n> <deployment/statefulset> -n <ns>

23.3.8. PDV Resize issue¶

When you try to resize a PDV to a large size, the resize task fails due to insufficient storage, and due to this resize task failure, the subsequent PDV resizing tasks also fail.

If you face this issue, complete the following troubleshooting steps:

Run the following command to verify the PersistentDataVolumeResize job status:

# robin job info <Job ID>

Note

In the command output, notice the PersistentDataVolumeResize job failure, and similarly, you might notice multiple failed volume expansion jobs. This is because the Robin CNP is continuously trying to allocate storage for volume expansion.

Example:

# robin job info 935
ID  |Type                        | Desc                                                                               | State    | Start |End              | Duration | Dependson| Error             | Message
935 | PersistentDataVolumeResize | Resizing PersistentDataVolume 'test-pdv-202201102020041' from 108447924224 to 151G | COMPLETED|FAILED | 27 Jan 12:58:53 | 12:59:06 | 0:00:13  | []        | 1     | Unable to allocate storage for volume pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 of logical size 50.0G. Needed 51.56G of type HDD in physical space but found only 42.28G available. Check available capacity, maximum volume count, physical sector size and maintenance mode for the drives.

Run the following command to get information about the PV and PVC for the impacted PDV volumes:

# robin pdv list <PDV name>

Example:

# robin pdv list test-pdv-202201102020041
Name                     | Owner/Tenant | Access  | Size | Media | PV                                       | PVC
test-pdv-202201102020041 | u1/tenant1   | Private | 100G | HDD   | pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 | t038-u000040/pvc-38-40-16420352860585

Save the PVC configuration file.

# kubectl get pvc -n <namespace> <pvc> -o yaml > <file.yaml>

Example:

# kubectl get pvc -n t002-u000006 pvc-38-40-16420352860585 -o yaml > pvc.yaml

Example PVC file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    robin.io/faultdomain: host
    robin.io/media: HDD
    robin.io/replication: "2"
    robin.io/rpool: default
    volume.beta.kubernetes.io/storage-provisioner: robin
    volume.kubernetes.io/storage-provisioner: robin
  creationTimestamp: "2022-01-13T00:54:46Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    robin.io/domain: ROBIN_PDV
    robin.io/tenant: tenant1
    robin.io/tenant_id: "38"
    robin.io/user_id: "40"
    robin.io/username: u1
  name: pvc-38-40-16420352860585
  namespace: t038-u000040
  resourceVersion: "2378648"
  uid: 2a9ffb4e-fc25-4536-b700-501c2a7a8d80
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 200Gi
  storageClassName: robin-immediate
  volumeMode: Filesystem
  volumeName: pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80
status:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 100Gi
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-01-27T17:01:41Z"
    status: "True"
    type: Resizing
  phase: Bound

Edit the PVC YAML file and remove the following attributes:

vi <file.yaml from Step 3>

In the metadata annotations, remove the following attributes:

pv.kubernetes.io/bind-completed
pv.kubernetes.io/bound-by-controller

In the In metadata, remove the following attributes:

creationTimestamp
resourceVersion
Uid

Remove the complete status section in the PVC YAML file.

Edited PVC YAML file example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    robin.io/faultdomain: host
    robin.io/media: HDD
    robin.io/replication: "2"
    robin.io/rpool: default
    volume.beta.kubernetes.io/storage-provisioner: robin
    volume.kubernetes.io/storage-provisioner: robin
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    robin.io/domain: ROBIN_PDV
    robin.io/tenant: tenant1
    robin.io/tenant_id: "38"
    robin.io/user_id: "40"
    robin.io/username: u1
  name: pvc-38-40-16420352860585
  namespace: t038-u000040
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 200Gi
  storageClassName: robin-immediate
  volumeMode: Filesystem
  volumeName: pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80

In the edited YAML file, change the spec.resources.requests.storage size of the underlying Robin volume.
Mark the PersistentVolume(PV) that is bound to the PersistentVolumeClaim(PVC) with the Retain reclaim policy. This will prevent the underlying volume from being deleted when the PVC is deleted.

Example:

# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched

Patch the PVC to disable Robin admission control from preventing the PDV’s deletion.

Example

# kubectl patch pvc -n t038-u000040 pvc-38-40-16420352860585  -p '{"metadata":{"labels": {"robin.io/override_delete_protection":"true"}}}'
persistentvolumeclaim/pvc-38-40-16420352860585 patched

Delete the PVC.

As PV now has a Retain reclaim policy, you will not lose any data when the PVC is recreated.

Example:

# kubectl delete pvc -n t038-u000040 pvc-38-40-16420352860585
persistentvolumeclaim "pvc-38-40-16420352860585" deleted

Delete the claimRef entry from PV specs, so the new PVC can bind to it. This should make the PV Available.

Example:

# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched

Re-create the PVC with a required size without exceeding the available resources.

Note

In the pvc.yaml file, that is created above, has the name of the existing PV set in the volumeName attribute. This will bind the new PVC to the existing PV.

Example:

# kubectl create -f pvc.yaml
persistentvolumeclaim/pvc-38-40-16420352860585 created

Restore the original reclaim policy of the PV.

Example:

# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80  -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched

23.3.9. Technical Support¶

Contact Robin Technical support for any assistance.

23.4. Robin Cloud Native Platform v5.3.11 HF3¶

The Robin CNP v5.3.11 HF3 release has a new feature and a known issue.

Release Date: 06 May 2022

23.4.1. Infrastructure Versions¶

The following software applications are included in this CNP release.

Software Application	Version
Kubernetes	1.21.5
Docker	19.03.9
Prometheus	2.16.0
Prometheus-adapter	0.9.1
Node-exporter	1.1.2
Calico	3.12.3
HA-Proxy	1.5.18
PostgreSQL	9.6.22
Grafana	6.5.3

23.4.2. Upgrade Path¶

The following is the supported upgrade path for Robin CNP v5.3.11 HF3:

Robin v5.3.11 (HF2) to Robin v5.3.11 (HF3)

23.4.3. New Feature¶

23.4.3.1. Support for Cisco DCNM E1000 Virtual Interaface¶

Robin CNP 5.3.11 HF3 supports Cisco Data Center Network Manager (DCNM) E1000 virtual network interface for KVMs. You can deploy the Cisco DCNM application on the Robin CNP cluster.

Note

The Cisco DCNM E1000 Virtual Interface is supported only on KVMs with OVS IP Pool. You can configure the interface only using an input.yaml file. For more information, see here.

23.4.4. Known Issue¶

Reference ID

Description

PP-27192

Symptom

In some rare scenarios, creating an app from a snapshot of a KVM app fails with this error: Failed to ping instance.

Workaround

Run the following command to restart the KVM instance:

# robin instance restart <name>

23.4.5. Technical Support¶

Contact Robin Technical support for any assistance.

23.5. Robin Cloud Native Platform v5.3.11 HF4¶

The Robin CNP v5.3.11 HF4 release has improvements, a fixed issue, and known issue.

Release Date: 19 June 2022

23.5.1. Infrastructure Versions¶

The following software applications are included in this CNP release.

Software Application	Version
Kubernetes	1.21.5
Docker	19.03.9
Prometheus	2.16.0
Prometheus-adapter	0.9.1
Node-exporter	1.1.2
Calico	3.12.3
HA-Proxy	1.5.18
PostgreSQL	9.6.22
Grafana	6.5.3

23.5.2. Upgrade Path¶

The following is the supported upgrade path for Robin CNP v5.3.11 HF4:

Robin v5.3.11(HF2) to Robin v5.3.11(HF4)

23.5.3. Improvements¶

23.5.3.1. Rocky Linux Support¶

Robin CNP v5.3.11HF4 supports Rocky Linux 8.6 version. You can install Robin CNP 5.3.11HF4 on Rocky Linux 8.6 servers.

The following are the supported Rocky Linux 8.6 Kernal versions:

4.18.0-372.9.1.rt7.166.el8.x86_64 (RT Kernel)

4.18.0-372.9.1.el8.x86_64 (Non-RT Kernel)

23.5.3.2. Disable Init Containers and Sidecars in Bundle App¶

Robin CNP v5.3.11HF4 supports disabling the Init Containers and Sidecars in the Robin Bundle apps using the input.yaml file when deploying the Bundle apps.

The following is the sample Robin Bundle file:

 name: dpdk-intel
 version: v1
 icon: icon.png
 snapshot: enabled
 clone: enabled
 roles:
 - pktgen
 pktgen:
 name: pktgen
 norootfs: true
 image:
    name: robinsys/dpdk-intel
    version: v1
    engine: docker
    imagePullPolicy: IfNotPresent
    entrypoint: entry.sh
 compute:
    memory: 1G
    cpu:
       reserve: true
       cores: 2
 initContainers:
    - name: init1
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       resources:
       limits:
          cpu: 25m
          memory: 128Mi
       command:
       - sleep
       - '5'
 sidecars:
    - name: side1
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       command:
       - /bin/bash
       - '-c'
       - trap 'exit 0' SIGTERM; while true; do sleep 1; done
       resources:
       limits:
          memory: 200Mi
          cpu: '1'
    - name: side2
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       command:
       - /bin/bash
       - '-c'
       - trap 'exit 0' SIGTERM; while true; do sleep 1; done
       resources:
       limits:
          memory: 200Mi
          cpu: '1'

Input Yaml file for disabling Init Containers and Sidecars

In the earlier Robin Bundle sample file, we have side1 and side2 sidecars and Init container init1.

Using the following sample Input.yaml file you can disable the Init Containers and sidecars. From the above sample Bundle Yaml file example, we are disabling side1 sidecar and Init container init1.

The following is the sample input.yaml file for disabling Init Containers and sidecars.

roles:
- name: pktgen
  containers:
    - name: side2
      disabled: false
    - name: side1
      disabled: true
    - name: init1
      disabled: true

You can use the input.yaml file when creating an app using the Robin Bundle.

Syntax

Run the following command when creating an app using the Robin Bundle.

# robin app create from-bundle <appname> <bundleid> <yamlfile> --rpool <rpool> --wait

23.5.4. Fixed Issues¶

Reference ID

Description

PP-27304

The 503 error message appears due to timeout of the HAProxy.

To fix this issue, you need to increase the timeout values of the HAProxy using the robin config update command to 60 seconds for the connect_timeout attribute.

23.5.5. Known Issues¶

Reference ID

Description

PP-27400

Symptom

The --disablerepo=* option does not work with CentOS 8 and Rocky Linux 8.

Workaround

You can disable all repos by creating a backup folder and manually moving all repo files to it.

Run the following commands to manually move all repo files:

mkdir /etc/yum.repos.d.backup
mv /etc/yum.repos.d/* /etc/yum.repos.d.backup/

PP-27613

When you create an IP pool, the IP pool creation succeeds; however, the app creation fails with the error IP Pool does not exist. This issue occurs as the pool creation is registered on Kubernetes and fails to register on the database.

Workaround

Log in to the Robin server pod from any node using rbash master.
Run the following command to restart Robin cluster.

systemctl restart robin-server

Run the following command to verify the IP pool list.

robin ip-pool list

23.5.6. Technical Support¶

Contact Robin Technical support for any assistance.

23. Release Notes¶

23.1. Robin Cloud Native Platform v5.3.11¶

23.1.1. Infrastructure Versions¶

23.1.2. Upgrade Path¶

23.1.3. New Features¶

23.1.3.1. Support for NVIDIA A100 MIG¶

23.1.3.2. Rocky Linux Support¶

23.1.3.3. Support for Application Ephemeral Volumes¶

23.1.4. Improvement¶

23.1.4.1. Support for @ Symbol in Robin Usernames¶

23.1.5. Fixed Issue¶

23.1.6. Known Issues¶

23.1.7. Technical Support¶

23.2. Robin Cloud Native Platform v5.3.11 HF1¶

23.2.1. Infrastructure Versions¶

23.2.2. Upgrade Paths¶

23.2.3. Improvements¶

23.2.3.1. Network Planning Support for Apps with Pod Affinity and Anti-affinity¶

23.2.3.2. Application Ephemeral Volumes UI Improvement¶

23.2.4. Fixed Issues¶

23.2.5. Known Issues¶

23.2.6. Technical Support¶

23.3. Robin Cloud Native Platform v5.3.11 HF2¶

23.3.1. Infrastructure Versions¶

23.3.2. Upgrade Paths¶

23.3.3. New Feature¶

23.3.3.1. Intel Cache Allocation Technology Support for vDU Pods¶

23.3.4. Improvements¶

23.3.4.1. Default replication factor and fault domain parameters for RWX PVCs¶

23.3.4.2. IP-Pool Prefix as Environment Variable inside a Pod¶

23.3.4.3. Robin StorageClass with runAsAny parameter¶

23.3.4.4. Optimization of CPU and memory for Kubernetes Events¶

23.3.5. Fixed Issues¶

23.3.6. Known Issues¶

23.3.7. Appendix¶

23.3.7.1. Troubleshooting a Robin Cluster Node with the NotReady State¶

23.3.8. PDV Resize issue¶

23.3.9. Technical Support¶

23.4. Robin Cloud Native Platform v5.3.11 HF3¶

23.4.1. Infrastructure Versions¶

23.4.2. Upgrade Path¶

23.4.3. New Feature¶

23.4.3.1. Support for Cisco DCNM E1000 Virtual Interaface¶

23.4.4. Known Issue¶

23.4.5. Technical Support¶

23.5. Robin Cloud Native Platform v5.3.11 HF4¶

23.5.1. Infrastructure Versions¶

23.5.2. Upgrade Path¶

23.5.3. Improvements¶

23.5.3.1. Rocky Linux Support¶

23.5.3.2. Disable Init Containers and Sidecars in Bundle App¶

23.5.4. Fixed Issues¶

23.5.5. Known Issues¶

23.5.6. Technical Support¶

23.3.7.1. Troubleshooting a Robin Cluster Node with the `NotReady` State¶