23. Release Notes

23.1. Robin Cloud Native Platform v5.3.11

The Robin Cloud Native Platform (CNP) v5.3.11 release has new features, an improvement, a bug fix, and known issues.

Release Date: 10 November 2021

23.1.1. Infrastructure Versions

The following software applications are included in this CNP release.

Software Application

Version

Kubernetes

1.21.5

Docker

19.03.9

Prometheus

2.16.0

Node-exporter

1.1.2

Calico

3.12.3

HA-Proxy

1.5.18

PostgreSQL

9.6.11

Grafana

6.5.3

23.1.2. Upgrade Path

The following is the supported upgrade path for Robin CNP v5.3.11:

  • Robin v5.3.9-286 (GA) to Robin v5.3.11-69 (GA)

23.1.3. New Features

23.1.3.1. Support for NVIDIA A100 MIG

Robin CNP v5.3.11 supports the Multi-Instance GPU (MIG) mode of operation for the NVIDIA A100 GPU. Robin allows you to use partitioned GPUs in Robin bundles and also supports chargeback functionality for these GPU partitions.

23.1.3.2. Rocky Linux Support

Starting from Robin CNP v5.3.11, Rocky Linux 8.4 is supported. You can install Robin CNP v5.3.11 on this version of Linux.

23.1.3.3. Support for Application Ephemeral Volumes

Robin CNP v5.3.11 supports Application Ephemeral Volumes (AEVs). An AEV is temporary storage that Robin bundle applications can use. The AEV only exists when an application is running. When you create an application, the AEV is created for usage, and its space will be reclaimed back by Robin when the application stops. Other applications can use the reclaimed storage space. Robin provisions the storage space back to the application when it starts again.

You can add the AEVs only from the Robin UI when creating an application using a Robin bundle. Also, you can create templates of the application with AEV for future use. You can add a maximum of 10 AEVs per application.

23.1.4. Improvement

23.1.4.1. Support for @ Symbol in Robin Usernames

Starting with Robin CNP v5.3.11, you can use the @ symbol as part of Robin usernames. This enables you to use email addresses as usernames.

23.1.5. Fixed Issue

Reference ID

Description

PP-24202

The security issue with SSL Medium Strength Cipher Suites is fixed by supporting the Strong Cipher Suites with more than 128bit keys are configured in Robin CNP Services.

The following are the list of the supported Strong Cipher Suites:

  • TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256

  • TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384

  • TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305

  • TLS_RSA_WITH_AES_128_CBC_SHA

  • TLS_RSA_WITH_AES_256_CBC_SHA

  • TLS_RSA_WITH_AES_128_GCM_SHA256

  • TLS_RSA_WITH_AES_256_GCM_SHA384

  • TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA

The above-mentioned list of Strong Cipher Suites is supported in the following Robin CNP Services:

  • K8s API server

  • K8s controller Manger

  • K8s scheduler

  • K8s kubelet

  • ROBIN UI https server

  • ROBIN event server

23.1.6. Known Issues

Reference ID

Description

PP-24270

Symptom

In Robin CNP, Kubelet might go down due to the presence of stale cpu_manager_state file.

Workaround

Complete the following steps to fix this issue:

  1. Remove the stale /var/lib/kubelet/cpu_manager_state file using the following command:

# rm -rf /var/lib/kubelet/cpu_manager_state
  1. Restart the Kubelet by running the following command:

# systemctl restart kubelet
  1. Make sure etcd and apiserver Pods on this node are up and running.

PP-24248

Symptom

When you create a new resource pool and assign it to nodes and later try to deploy a Pod with storage affinity on the node with a newly assigned resource pool, the Pod deployment fails as the node is not taking the correct resource pool.

Workaround

Complete the following steps to fix this issue:

  1. Run the following to edit the node.

# kubectl edit node <node_name>
  1. Remove the robin.io/robinrpool resource pool.

  2. Add the correct resource pool name.

PP-22853

Symptom

GPUs might not be detected after CNP installation, upgrade, or addition of a new node.

Workaround

Run the following host probe rediscover command:

# robin host probe <hostname> --rediscover

PP-22626

Symptom

If NVIDIA GPU drivers are already installed on your setup, operator deployments might fail.

Workaround

Complete the following steps to fix this issue:

  1. # yum remove nvidia-driver-latest-dkms
    
  2. # yum remove nvidia-container-toolkit
    
  3. Reboot the node.

PP-21832

Symptom

After you reboot a node, it might be in the NotReady state.

Workaround

Complete the following steps to fix this issue:

  1. # systemctl restart kubelet
    
  2. # systemctl restart dockershim
    
  3. # docker restart robin-cri
    

PP-22781

Symptom

After removing a taint on a master node, GPUs are not detected automatically.

GPUs don’t get detected automatically after removing a taint on the master node.

Workaround

You need to run the robin host probe --rediscover --all --wait command for the GPUs to be detected on the primary master node.

23.1.7. Technical Support

Contact Robin Technical support for any assistance.

23.2. Robin Cloud Native Platform v5.3.11 HF1

The Robin CNP v5.3.11 HF1 release has improvements, bug fixes, and known issues.

Release Date: 08 December 2021

23.2.1. Infrastructure Versions

The following software applications are included in this CNP release.

Software Application

Version

Kubernetes

1.21.5

Docker

19.03.9

Prometheus

2.16.0

Node-exporter

1.1.2

Calico

3.12.3

HA-Proxy

1.5.18

PostgreSQL

9.6.11

Grafana

6.5.3

23.2.2. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.3.11 HF1:

  • Robin v5.3.5 (HF3) to Robin v5.3.11 (HF1)

  • Robin v5.3.5 (HF5) to Robin v5.3.11 (HF1)

  • Robin v5.3.9 (GA) to Robin v5.3.11 (HF1)

  • Robin v5.3.11 (GA) to Robin v5.3.11 (HF1)

23.2.3. Improvements

23.2.3.1. Network Planning Support for Apps with Pod Affinity and Anti-affinity

Robin CNP v5.3.11 HF1 provides the network planning support for apps with Pod affinity and anti-affinity.

23.2.3.2. Application Ephemeral Volumes UI Improvement

The APPLICATION EPHEMERAL STORAGE section of the Robin UI in Robin CNP v5.3.11 HF1 is improved to display the following drop-down options for the replication option.

  • Storage-Compute Affinity

  • Not Replicated

  • Replicated (2 copies)

  • Replicated (3 copies)

Note

These options appear when you create AEVs from Robin UI if the AEVs are not defined in a Robin bundle manifest file.

23.2.4. Fixed Issues

Reference ID

Description

PP-24650

Robin CNP v5.3.11 HF1 fixed the sweet32 vulnerability issue.

PP-24528

The issue of the CNP planner assigning Pods without checking the status of network interfaces is fixed in this version. With Robin CNP 5.3.11 HF1, the planner skips the network interfaces when they are down.

PP-24428

The creation of an Application Ephemeral Volume (AEV) fails in a cloud deployment if the replication factor is not specified in the AEV specifications.

This issue is fixed in Robin CNP v5.3.11 HF1 by setting a default replication factor for each AEV if it is not provided.

PP-22941

When you do not provide any limits and requests in the container resource section, but you provide Robin annotation for network planning, the issue of a Pod not coming up successfully is fixed.

PP-21983

When an IP address is not in robin ip-pool info --ip-allocations and no other running Pods in the cluster is using the IP address, a Pod may not get created that is controlled by a Deployment, StatefulSet, or DaemonSet. This issue is fixed.

PP-24589

The issue of Calico’s CIDR value was not correctly set when updating a Calico IP pool in an IPv6 Robin CNP setup is fixed.

PP-24313

The robin bundle add command is incorrectly storing the bundle files in the file-object directory of the log collection service instead of the file-object directory of the file collection service. This issue is fixed.

23.2.5. Known Issues

Reference ID

Description

PP-24697

Symptom

If the Network attachment definitions (net-attach-def) are not cleaned up when a Pod is bounced to create the Pod from the webhook start phase, follow this workaround.

Workaround

You must delete the net-attach-def that is not cleaned up.

To delete, run the following command:

# kubectl delete net-attach-def <net-attach-def-name> -n <namespace>

PP-24600

When you deploy a Pod using Kubernetes Deployment, StatefulSet, or DaemonSet, in some scenarios, the deployment process might take a longer time than usual due to the exponential back-off delay during scheduling. This is a known behavior with Kubernetes.

23.2.6. Technical Support

Contact Robin Technical support for any assistance.

23.3. Robin Cloud Native Platform v5.3.11 HF2

The Robin CNP v5.3.11 HF2 release has a new feature, improvements, bug fixes, and known issues.

Release Date: 02 February 2022

23.3.1. Infrastructure Versions

The following software applications are included in this CNP release.

Software Application

Version

Kubernetes

1.21.5

Docker

19.03.9

Prometheus

2.16.0

Prometheus-adapter

0.9.1

Node-exporter

1.1.2

Calico

3.12.3

HA-Proxy

1.5.18

PostgreSQL

9.6.22

Grafana

6.5.3

23.3.2. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.3.11 HF2:

  • Robin v5.3.5-232 (HF5) to Robin v5.3.11 HF2

  • Robin v5.3.7-120 (HF1) to Robin v5.3.11 HF2

  • Robin v5.3.11-104 (HF1) to Robin v5.3.11 HF2

23.3.3. New Feature

23.3.3.1. Intel Cache Allocation Technology Support for vDU Pods

Robin CNP v5.3.11 HF2 supports the Intel Cache Allocation Technology (CAT) to deploy 4G vDU (Virtual Distributed Unit) Pods for a single non-uniform memory access (NUMA) node host.

The Intel CAT enables vDUs in a 4G environment to access the CPUs on the host using dedicated cache lines/ways. When you deploy a vDU Pod using Robin CNP, the vDUs get a dedicated number of cache lines/ways configured on the host to access the CPUs.

You can use the following annotation in your Pod YAML file for requesting cache lines/ways from the CPU for vDUs.

Example:

# "robin.runtime.num_cache_ways": "4"

Note

You must use the annotation at the Pod level only.

23.3.4. Improvements

23.3.4.1. Default replication factor and fault domain parameters for RWX PVCs

Starting with Robin CNP v5.3.11 HF2, for RWX PVCs, the default replication factor is 2 and the default faultdomain is host.

If you want to change the replication factor to 1, you can use the following parameter under annotations in the RWX PVC: robin.io/rwx_force_single_replica: "1".

Note

You must not use the robin.io/replication and robin.io/rwx_force_single_replica annotations together in a PVC.

The following is an example of a PVC file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
   name: nfs-shared-1
   annotations:
      robin.io/nfs-server-type: "shared"
      robin.io/replication: "2"
      robin.io/faultdomain: "host"
spec:
storageClassName: robin
accessModes:
   - ReadWriteMany
resources:
   requests:
      storage: 500Gi

23.3.4.2. IP-Pool Prefix as Environment Variable inside a Pod

Starting with Robin CNP v5.3.11 HF2, Robin CNP provides an IP-Pool prefix as an environment variable inside a Pod.

When you deploy a Robin bundle or Helm app and use Calico or OVS IP-Pool, Robin CNP adds an IP-Pool prefix as an environment variable inside the Pod.

Using the IP-Pool prefix environment variable, you can discover the IP prefix of an IP-Pool by using the command line.

Example:

# kubectl exec -it -n t001-u000004   c1-server-01 -- env | grep -i prefix
ROBIN_SAMPLE_IPPOOL_PREFIX=16

In the above example with the command output: ROBIN_SAMPLE_IPPOOL_PREFIX=16

SAMPLE_ IPPOOL is the IP-Pool name, and it is prefixed with ROBIN and suffixed with the word PREFIX, and 16 is the IP prefix for the IP-Pool.

23.3.4.3. Robin StorageClass with runAsAny parameter

Robin CNP v5.3.11 HF2 provides a new parameter runAsAny in the StorageClass object to enable any user other than the root user to read or write to an NFS mountpoint of an RWX volume.

You can use this parameter in a scenario with multiple containers and different users, and you want to allow any user accessing the Pod (containers) to read or write to an NFS mountpoint of an RWX volume.

In the StorageClass object file, set the runAsAny parameter to True.

The following is an example of the StorageClass with runAsAny parameter:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: run-as-any-imm
  labels:
    app.kubernetes.io/instance: robin
    app.kubernetes.io/managed-by: robin.io
    app.kubernetes.io/name: robin
provisioner: robin
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
parameters:
  replication: '2'
  media: HDD
  runAsAny: "true"

23.3.4.4. Optimization of CPU and memory for Kubernetes Events

Robin CNP v5.3.11 HF2 is optimized to reduce the usage of CPU and memory when processing Kubernetes events.

23.3.5. Fixed Issues

Reference ID

Description

PP-25070

Vulnerability CVE-2021-41103 is related to containerd runtime. The container root directories and some plugins had insufficiently restricted permissions. It allows unprivileged Linux users to traverse directory contents and execute programs. For more information about this vulnerability, see CVE-2021-41103.

In this release, Robin.io has upgraded the containerd package to containerd version 1.5.7 for handling this vulnerability.

PP-24947

In the recent versions of Robin CNP, the source-based route is not configured properly for the first additional IP-Pool. This issue is fixed.

PP-24938

After upgrading to Robin CNP v5.3.11 (HF1), the static IP address was not allocated to a Pod, and the Pod did not come up. This issue is fixed.

PP-24796

The scheduler is unable to create a Pod within 30 seconds because Nmap showed that the requested static IP address was already in use due to an external firewall. This issue is fixed, and Nmap is disabled by default.

You can enable Nmap to allow the Robin server to scan the network before IP address allocation by running the following command.

# robin config update manager mutating_webhook_staticip_validation_enable true

PP-24776

The robin ip-pool add command with the nictags pci_addr option is failing in Robin CNP v5.3.11 HF1. This issue is fixed.

PP-24697

When a Pod with robin.io/networks annotation is deleted, the Network attachment definitions (net-attach-defs) entries are left behind. You need to manually delete these entries. This issue is fixed.

PP-24789

An inaccessible device does not get elected by marking the device offline when the node goes down. This issue is fixed now.

PP-25116

When you bounce a Pod or reinstall an app, the deletion event might take time to complete the process as the earlier event is stuck at registration due to a missing API in the kubectl API resources. As a result, the static IP address allocation is failing. This issue is fixed.

PP-25109

In a scenario, when two MANAGER nodes are becoming SLAVE nodes and waiting for the third node to become the MASTER node, it fails to become the MASTER node due to internal issues.

And, the other two nodes are waiting for the third node to become the MASTER without checking whether any node is holding the MASTER lock. As a result, the control plane is down. This issue is fixed.

PP-24645

The existence of a recover.conf file in the PostgreSQL data directory was preventing a Manager node from becoming MASTER during a high availability transition. This issue is fixed.

Instead of failing, Robin CNP now allows the node to continue with its transition to become MASTER.

PP-25221

A Pod fails to come up to running state, and the kubectl describe pod -n namespace <pod name> command shows an error that the network CNI plugin failed to set up the Pod. This issue is fixed.

23.3.6. Known Issues

Reference ID

Description

PP-25360

Symptom

If containers in a Pod are using an RWX PVC and if they are stuck in the ContainerCreating state for a long time and display a timeout error, apply the following workaround.

Workaround

Delete Pods if they are part of a Deployment or StatefulSet.

PP-24937

Symptom

When upgrading to Robin CNP v5.3.11 HF2 from any supported versions using GoRobin, an error related to Paramiko might be encountered.

Workaround

Check the login credential of the Robin node or try restarting the sshd service on the Robin node.

PP-25370

Symptom

A Robin node in a cluster might go into the NotReady state when an RWX PVC’s mount path is not responding. This issue could occur due to several internal Kubernetes known issues.

Workaround

For workaround steps, see Troubleshooting a Robin Cluster Node with the NotReady State.

PP-25430

Symptom

After upgrading to Robin CNP v5.3.11 HF2, you might observe that the automatically restarted Pods containing robin.io/networks annotation did not have the secondary IP addresses.

Workaround

Bounce such Pods manually.

PP-25422

Symptom

Upgrade to Robin v5.3.11 HF2 might fail as one of the worker node robinds and node plugin Pods are stuck in the Terminating state.

Workaround

Perform the following steps to recover all stale NFS mount points:

  1. Make sure that no application Pod uses the PVC.

  2. Run the following command to scale down the replica count to 0.

# kubectl scale —-replicas=0 <replica set>
  1. Run the following command to unmount the stale mount point.

# umount -f -l <stale mount point>
  1. When all stale NFS mount points on a node are recovered, run the following command to restart Kubelet.

# systemctl restart kubelet
  1. Scale up the replicas back to the original count.

# kubectl scale —-replicas=<replica count> <replica set>

Note

You must not delete the PVC or application Pod.

PP-25425

Symptom

When you create an application using Application Ephemeral Volume (AEV) with Storage-Compute Affinity, the app creation might fail due to the missing robin.io/hostname tag on the host.

Workaround

Add the robin.io/hostname tag to the host.

Run the following command to add the tag:

# robin host add-tags <hostname> <tag>

Example:

# robin host add-tags cscale-82-37.robinsystems.com
robin.io/hostname=cscale-82-37.robinsystems.com

PP-25296

Symptom

When a cluster is recovered from a network partition, the Pods deployed on the worker nodes in the minority partition are redeployed in the majority partition. The Robin control plane is unable to access the worker nodes in the minority partition. The Pods and their volume mounts are cleaned up automatically when the network partition is resolved. In some cases, this automatic clean-up fails to remove the stale Pods on the worker nodes.

Workaround

Reboot the server. When the server restarts, the server will rejoin the cluster without the stale pods and volume mounts.

PP-21832

Symptom

After upgrading your cluster to Robin CNP v5.3.11HF2, if you notice that the cluster node is in the NotReady state when you reboot the cluster, you must apply the following workaround on all nodes of the cluster.

Workaround

To resolve this issue, complete the following steps:

  1. Run the following command.

\/bin/cp /root/bin/robin-reboot.service /etc/systemd/system/robin-reboot.service
kernel_version=$(uname -r)
 if [[ $kernel_version == "3.10"* ]]; then
   sed -i "/EL8/d;s/#EL7//" /etc/systemd/system/robin-reboot.service
   else
    sed -i "/EL7/d;s/#EL8//" /etc/systemd/system/robin-reboot.service
   fi
  1. Run the following command to restart Kubelet.

# systemctl restart kubelet
  1. Run the following command to restart Dockershim.

# systemctl restart dockershim
  1. Run the following command to restart robin-cri.

# docker restart robin-cri

PP-25286

Symptom

When you try to resize a PDV to a large size, the resize task fails due to insufficient storage, and due to this resize task failure, the subsequent PDV resizing tasks also fail.

Workaround

For workaround steps, see PDV Resize issue.

PP-25441

Symptom

In Robin CNP v5.3.11, provisioning clones with Application Ephemeral Volumes (AEVs) are not supported.

PP-25412

Symptom

Storage-Compute affinity is not enforced on Application Ephemeral Volume (AEV) when storage is available on the same host.

PP-25453

Symptom

When upgrading to Robin CNP v5.3.11 HF2 from any supported versions, CNS fails to execute post-Robin upgrade actions on one of the nodes as the IO Manager might be down and the node displays NotReady state.

Apply the following workaround if you notice this issue.

Workaround

Run the following command one the node where you executed the upgrade command:

./<gorobin-binary> onprem post-upgrade-robin --hosts-json /root/hosts.json
--gorobintar <local location of GoRobin tar file> --robin-admin-user
<Robin Admin username> --robin-admin-passwd <Robin Admin password>

PP-25461

Symptom

If you discover that a Pod comes up without a static IP address, you must apply the following workaround. The Pod may not get allocated with one or more static IP addresses in multiple conditions.

Workaround

Bounce the Pods that are not allocated with a required number of static IP addresses.

PP-25423

Symptom

After upgrading to Robin CNP v5.3.11HF2 from Robin CNP v5.3.5, application Pods might be in the ContainerCreating state.

If you notice this issue, apply the following workaround.

Workaround

  1. Using the RWX PVC, find the Deployment or StatefulSet and the Pod details by run the following command:

# kubectl describe PVC <pvc name> -n ns
  1. Note the replica count by running the following command:

# kubectl get all -n <ns>
  1. Scale the replicas to 0 by running the following command.

# kubectl scale --replicas=0 <deployment/statefulset> -n <ns>
  1. Observe that the Pod is terminated and does not exist anymore.

  2. Scale the replicas back to the count that you noted in step 2. Use the following command to scale up.

# kubectl scale --replicas=<n> <deployment/statefulset> -n <ns>

PP-25381

Symptom

Robin CNP does not support rack fault domain for Application Ephemeral Volume (AEV). However, Robin CNP UI incorrectly displays the FaultDomain (Rack) option.

PP-25467

Symptom

You might observe Kubelet is slow or does not respond and periodically goes into the error state resulting in issues with the Robin storage layer. This issue could be due to an orphan Pod, or Kubelet is trying to mount a Pod on old PVCs.

Workaround

  1. Run the following command to check the status of Kubelet.

# systemctl status kubelet -l
  1. In the command output, find the following message.

"orphaned pod pod_id found, but error not a directory occurred
 when trying to remove the volumes dir"
  1. Run the following command to know the PVC name from the node.

# kubectl get pvc -A
  1. Run the following command to check if the PVC is existing.

# kubectl get pvc -A | grep <pvc_name>
  1. If the PVC does not exist, delete the directory by running the following command.

# rm -rf /var/lib/kubelet/pods/<pod_id>

PP-25463

Symptom

The volume mounts in a Pod fail due to duplicate FS UUIDs.

Workaround

A duplicate FS UUID is present when the device is already mounted on the same node. When a volume mount fails, the FS UUID is displayed in syslog/dmesg.

Perform the following steps to resolve the duplicate FS UUID.

  1. Run the following command to check whether any device has the same FS UUID:

# blkid | grep <FS UUID>
  1. Run the following command to check whether the device is mounted:

# mount | grep <device>
  1. Run the following command to unmount the device, when the device is mounted:

# unmount <device>

After unmounting the device, the duplicate FS UUID conflict will be resolved.

PP-25466

Symptom

A Pod fails to come up because the volume is not accessible, and the volume is in the faulted state.

Workaround

You need to probe the robin host by running the following command:

# robin host probe --all

PP-25508

Symptom

When you try to modify an IP Pool, the modification process might fail with an error message due to missing values in the IP Pool.

If you notice any error message when modifying the IP Pool, apply the following workaround.

Workaround

  1. Run the following command only once on the cluster.

# robin schedule update K8sResSync k8s_resource_sync 63072000
  1. Run the following command and make a note of the IP Pool values.

robin ip-pool info <name>
  1. Run the following command to update missing values in the IP Pool that you noted in the previous step.

kubectl edit ripp <name> --validate=false

Example:

In the following example, you need to add the missing values in the spec: section. You do not need to update all values. For example, for network-based IP-Pool, the prefix field is not required. Similarly, you can ignore the values that are not required.

[root@centos-60-205 ~]# kubectl edit ripp ovs-1 --validate=false
   ...
   spec:
   available: "15"
   dns_search: domain.com
   driver: ovs
   gateway: fd74:ca9b:3a09:868c::1
   ifcount: 1
   name: ovs-1
   nameserver: fd74:ca9b:3a09:868c:10:9:60:62
   netmask: ffff:ffff:ffff:ffff:0000:0000:0000:0000
   ranges:
   - fd74:ca9b:3a09:868c:0010:0009:0109:0010-0020
   - fd74:ca9b:3a09:868c:0010:0009:0109:0040-0050
   subnet: fd74:ca9b:3a09:868c:0000:0000:0000:0000
   used: "2"
   zone: default
   prefix: 64
   vfdriver: xyz
   vlan_number: 100
  1. Rerun the failed IP Pool command to verify.

23.3.7. Appendix

23.3.7.1. Troubleshooting a Robin Cluster Node with the NotReady State

The following content is the workaround for PP-25370.

A Robin node in a cluster might go into the NotReady state when an RWX PVC’s mount path is not responding. This issue could occur due to several internal Kubernetes known issues.

The RWX PVC’s mount path may not respond due to the following issues/symptoms on your cluster. You can troubleshoot these issues and bring back the node to the Ready state. This document section provides troubleshooting steps for the following issues:

  • NFS server’s service IP address entry in the conntrack table might go into SYN_SENT or TIME_WAIT state

  • NFS Servers may not be ready

  • NFS Server Failover Issues

  • I/O hangs on the volume

With Robin v5.3.11 HF2, you might notice the NotReady state issue when you are upgrading from Robin v5.3.11 HF1 to Robin v5.3.11 HF2.

  • Troubleshoot NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT state

The Robin node could be in the NotReady state if the NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT.

The following steps enable you to troubleshoot this issue and bring the node to the Ready state.

  1. Run the following command to know if your node is in the NotReady state when you notice any of the above-mentioned symptoms:

# kubectl get node <node name>

Example:

# kubectl  get node hypervvm-61-46
NAME             STATUS     ROLES    AGE   VERSION
hypervvm-61-46   NotReady   <none>   25h   v1.21.5
  1. Run the following command and grep the NFS server mount paths:

# mount|grep :/pvc
  1. Copy the mount paths for verification from the command output.

  2. Run the following command to check the status of the mount path:

# ls <nfsmount>

Example:

# ls /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount

Note

If any mount paths do not respond or hang, you must check the status of conntrack.

You need the service IP of the NFS Server Pod for checking conntrack status.

  1. Run the following command to get the NFS server Pod service IP address:

# mount|grep <pvc name>

Example:

# mount|grep pvc-210829ca-96d4-4a12-aab8-5646d087054d

[fd74:ca9b:3a09:868c:172:18:0:e23e]:/pvc-210829ca-96d4-4a12-aab8-5646d087054d on /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp6,timeo=600,retrans=2,sec=sys,clientaddr=fd74:ca9b:3a09:868c:10:9:82:127,local_lock=none,addr=fd74:ca9b:3a09:868c:172:18:0:e23e)
  1. Verify if the conntrack entry state using NFS server Pod IP address by running the following command:

# conntrack -L -d <NFS server Pod IP address>

Note

If you notice the conntrack entry state as SYN_SENT or TIME_WAIT, you need to delete the entry from conntrack table entries to allow connections to the NFS service.

Example:

# conntrack -L -d fd74:ca9b:3a09:868c:172:18:0:e23e
13tcp      6 110 SYN_SENT src=fd74:ca9b:3a09:868c:10:9:82:127 dst=fd74:ca9b:3a09:868c:172:18:0:e23e sport=980 dport=2049 [UNREPLIED] src=fd74:ca9b:3a09:868c:172:18:0:71d4 dst=fd74:ca9b:3a09:868c:10:9:82:127 sport=2049 dport=614 mark=0 use=1
14conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown.
  1. Run the following command to delete the SYN_SENT or TIME_WAIT:

# conntrack -D -d <NFS server Pod IP address>

Example:

# conntrack -D -d fd74:ca9b:3a09:868c:172:18:0:e23e
18tcp      6 102 SYN_SENT src=fd74:ca9b:3a09:868c:10:9:82:127 dst=fd74:ca9b:3a09:868c:172:18:0:e23e sport=980 dport=2049 [UNREPLIED] src=fd74:ca9b:3a09:868c:172:18:0:71d4 dst=fd74:ca9b:3a09:868c:10:9:82:127 sport=2049 dport=614 mark=0 use=1
19conntrack v1.4.4 (conntrack-tools): 1 flow entries have been deleted.

Note

After deleting the SYN_SENT or TIME_WAIT state from the conntrack, you should be able to access the NFS mount path.

  1. Run the following command to verify mount path status.

# ls /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount
  • Additional Troubleshooting Checks

If you have verified the NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT status and still your node is in the NotReady state, you need to perform additional checks to troubleshoot the issue.

The following are some additional checks for troubleshooting the issue:

  • Check NFS Exports Status

  • Check NFS server failover Status

  • Check NFS server Pod is provisioned.

  • Check NFS Exports Status

All NFS exports must be in the READY state.

To check the NFS exports status, run the following command:

# robin nfs export-list

Example:

# robin nfs export-list
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+
| Export State | Export ID |                  Volume                  |    NFS Server Pod   |                             Export Clients                            |
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+
|    READY     |     7     | pvc-9b1ef05e-5e4a-4e6a-ab3e-f7c95d1ae920 |  robin-nfs-shared-9 | ["hypervvm-61-48.robinsystems.com","hypervvm-61-43.robinsystems.com"] |
+--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+

Note

If NFS exports are not in the READY state, make sure the NFS server failover is enabled. Generally, it is enabled by default.

  • Check NFS server failover Status

The NFS Server failover status is by default enabled. However, you should check for confirmation and enable it if it is disabled.

To check NFS server failover status, run the following command:

# robin config list nfs|grep failover_enabled
nfs     | failover_enabled
  • Check NFS server Pod is provisioned

To check whether NFS server Pod is provisioned or not, run the following command:

# robin job list|grep -i NFSServerPodCreate|tail

Note

If all of these checks are fine, then it could be a bug in the NFS Server Failover. To troubleshoot the NFS Server failover issue, see Troubleshoot NFS Server Failover Issues.

  • Troubleshoot NFS Server Failover Issues

A node could go to the NotReady state due to NFS Server failover issues as well, apart from other issues mentioned in this section.

Note

  1. You can use the following steps even if your NFS Server has no issues, however, the PVC mount path is hung.

  2. Before you troubleshoot the NFS Server failover issues, check the Troubleshoot NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT state and Additional Troubleshooting Checks.

To fix the NFS server failover issues, complete the following steps:

  1. Run the following command to check if any NFS exports are in the ASSIGNED_ERR state and identify corresponding PVCs:

# robin nfs export-list
  1. Run the following command to note the replica count in the deployment or StatefulSet:

# kubectl get all -n <ns>

Example:

# kubectl get all -n <ns>
...
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/app1   2/2     2            2           27h

NAME                              DESIRED   CURRENT   READY   AGE
replicaset.apps/app1-5cbbc6d9db   2         2         2       27h
  1. Run the following command to scale the application Pods that use those PVCs to 0:

Note

Do not delete the application.

Scaling down the application Pods will make sure that new Pods do not come up and results in the NFS exports being cleared.

# kubectl scale --replicas=0 <deployment/statefulset> -n <ns>
  1. Run the following command to check all NFS exports are healthy:

# robin nfs export-list
  1. (Optional) Run the following command on the hung paths if you notice some mount paths are still hung:

# umount -f -l <hung nfs mount path>
  1. Run the following command to check the node status:

# kubectl get node <node name>

Note

If you notice the node is still not showing the Ready state, wait for 2 minutes for kubelet to refresh the status.

If the status is still not showing Ready, stop and start kubelet by running following commands:

# systemctl stop kubelet

#systemctl start kubelet
  1. Check the node status again. If the status is Ready, then go to the last step.

# kubectl get node <node name>
  1. If the node is still not in the Ready state or flapping between Ready/NotReady and you do not see any Pods in k8s that are using the RWX PVC, it may be Pods are deleted by force from Kubernetes.

In this case, k8s does not see Pods, but Docker is still hanging on to those Pods. THIS IS A RARE CASE and is hit only when Pods are deleted forcefully.

In this case, run the following commands:

# docker rm <>

# systemctl restart docker
  1. Run the following command to check the node status:

# kubectl get node <node name>

The node should be in the Ready state.

  1. Run the following command to scale up the application Pods back to the original count that you noted earlier:

# kubectl scale --replicas=<n> <deployment/statefulset> -n <ns>

23.3.8. PDV Resize issue

When you try to resize a PDV to a large size, the resize task fails due to insufficient storage, and due to this resize task failure, the subsequent PDV resizing tasks also fail.

If you face this issue, complete the following troubleshooting steps:

  1. Run the following command to verify the PersistentDataVolumeResize job status:

# robin job info <Job ID>

Note

In the command output, notice the PersistentDataVolumeResize job failure, and similarly, you might notice multiple failed volume expansion jobs. This is because the Robin CNP is continuously trying to allocate storage for volume expansion.

Example:

# robin job info 935
ID  |Type                        | Desc                                                                               | State    | Start |End              | Duration | Dependson| Error             | Message
935 | PersistentDataVolumeResize | Resizing PersistentDataVolume 'test-pdv-202201102020041' from 108447924224 to 151G | COMPLETED|FAILED | 27 Jan 12:58:53 | 12:59:06 | 0:00:13  | []        | 1     | Unable to allocate storage for volume pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 of logical size 50.0G. Needed 51.56G of type HDD in physical space but found only 42.28G available. Check available capacity, maximum volume count, physical sector size and maintenance mode for the drives.
  1. Run the following command to get information about the PV and PVC for the impacted PDV volumes:

# robin pdv list <PDV name>

Example:

# robin pdv list test-pdv-202201102020041
Name                     | Owner/Tenant | Access  | Size | Media | PV                                       | PVC
test-pdv-202201102020041 | u1/tenant1   | Private | 100G | HDD   | pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 | t038-u000040/pvc-38-40-16420352860585
  1. Save the PVC configuration file.

# kubectl get pvc -n <namespace> <pvc> -o yaml > <file.yaml>

Example:

# kubectl get pvc -n t002-u000006 pvc-38-40-16420352860585 -o yaml > pvc.yaml

Example PVC file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    robin.io/faultdomain: host
    robin.io/media: HDD
    robin.io/replication: "2"
    robin.io/rpool: default
    volume.beta.kubernetes.io/storage-provisioner: robin
    volume.kubernetes.io/storage-provisioner: robin
  creationTimestamp: "2022-01-13T00:54:46Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    robin.io/domain: ROBIN_PDV
    robin.io/tenant: tenant1
    robin.io/tenant_id: "38"
    robin.io/user_id: "40"
    robin.io/username: u1
  name: pvc-38-40-16420352860585
  namespace: t038-u000040
  resourceVersion: "2378648"
  uid: 2a9ffb4e-fc25-4536-b700-501c2a7a8d80
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 200Gi
  storageClassName: robin-immediate
  volumeMode: Filesystem
  volumeName: pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80
status:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 100Gi
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-01-27T17:01:41Z"
    status: "True"
    type: Resizing
  phase: Bound
  1. Edit the PVC YAML file and remove the following attributes:

vi <file.yaml from Step 3>
  • In the metadata annotations, remove the following attributes:

pv.kubernetes.io/bind-completed
pv.kubernetes.io/bound-by-controller
  • In the In metadata, remove the following attributes:

creationTimestamp
resourceVersion
Uid
  • Remove the complete status section in the PVC YAML file.

Edited PVC YAML file example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    robin.io/faultdomain: host
    robin.io/media: HDD
    robin.io/replication: "2"
    robin.io/rpool: default
    volume.beta.kubernetes.io/storage-provisioner: robin
    volume.kubernetes.io/storage-provisioner: robin
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    robin.io/domain: ROBIN_PDV
    robin.io/tenant: tenant1
    robin.io/tenant_id: "38"
    robin.io/user_id: "40"
    robin.io/username: u1
  name: pvc-38-40-16420352860585
  namespace: t038-u000040
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 200Gi
  storageClassName: robin-immediate
  volumeMode: Filesystem
  volumeName: pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80
  1. In the edited YAML file, change the spec.resources.requests.storage size of the underlying Robin volume.

  2. Mark the PersistentVolume(PV) that is bound to the PersistentVolumeClaim(PVC) with the Retain reclaim policy. This will prevent the underlying volume from being deleted when the PVC is deleted.

Example:

# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched
  1. Patch the PVC to disable Robin admission control from preventing the PDV’s deletion.

Example

# kubectl patch pvc -n t038-u000040 pvc-38-40-16420352860585  -p '{"metadata":{"labels": {"robin.io/override_delete_protection":"true"}}}'
persistentvolumeclaim/pvc-38-40-16420352860585 patched
  1. Delete the PVC.

As PV now has a Retain reclaim policy, you will not lose any data when the PVC is recreated.

Example:

# kubectl delete pvc -n t038-u000040 pvc-38-40-16420352860585
persistentvolumeclaim "pvc-38-40-16420352860585" deleted
  1. Delete the claimRef entry from PV specs, so the new PVC can bind to it. This should make the PV Available.

Example:

# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched
  1. Re-create the PVC with a required size without exceeding the available resources.

Note

In the pvc.yaml file, that is created above, has the name of the existing PV set in the volumeName attribute. This will bind the new PVC to the existing PV.

Example:

# kubectl create -f pvc.yaml
persistentvolumeclaim/pvc-38-40-16420352860585 created
  1. Restore the original reclaim policy of the PV.

Example:

# kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80  -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched

23.3.9. Technical Support

Contact Robin Technical support for any assistance.

23.4. Robin Cloud Native Platform v5.3.11 HF3

The Robin CNP v5.3.11 HF3 release has a new feature and a known issue.

Release Date: 06 May 2022

23.4.1. Infrastructure Versions

The following software applications are included in this CNP release.

Software Application

Version

Kubernetes

1.21.5

Docker

19.03.9

Prometheus

2.16.0

Prometheus-adapter

0.9.1

Node-exporter

1.1.2

Calico

3.12.3

HA-Proxy

1.5.18

PostgreSQL

9.6.22

Grafana

6.5.3

23.4.2. Upgrade Path

The following is the supported upgrade path for Robin CNP v5.3.11 HF3:

  • Robin v5.3.11 (HF2) to Robin v5.3.11 (HF3)

23.4.3. New Feature

23.4.3.1. Support for Cisco DCNM E1000 Virtual Interaface

Robin CNP 5.3.11 HF3 supports Cisco Data Center Network Manager (DCNM) E1000 virtual network interface for KVMs. You can deploy the Cisco DCNM application on the Robin CNP cluster.

Note

The Cisco DCNM E1000 Virtual Interface is supported only on KVMs with OVS IP Pool. You can configure the interface only using an input.yaml file. For more information, see here.

23.4.4. Known Issue

Reference ID

Description

PP-27192

Symptom

In some rare scenarios, creating an app from a snapshot of a KVM app fails with this error: Failed to ping instance.

Workaround

Run the following command to restart the KVM instance:

# robin instance restart <name>

23.4.5. Technical Support

Contact Robin Technical support for any assistance.

23.5. Robin Cloud Native Platform v5.3.11 HF4

The Robin CNP v5.3.11 HF4 release has improvements, a fixed issue, and known issue.

Release Date: 19 June 2022

23.5.1. Infrastructure Versions

The following software applications are included in this CNP release.

Software Application

Version

Kubernetes

1.21.5

Docker

19.03.9

Prometheus

2.16.0

Prometheus-adapter

0.9.1

Node-exporter

1.1.2

Calico

3.12.3

HA-Proxy

1.5.18

PostgreSQL

9.6.22

Grafana

6.5.3

23.5.2. Upgrade Path

The following is the supported upgrade path for Robin CNP v5.3.11 HF4:

  • Robin v5.3.11(HF2) to Robin v5.3.11(HF4)

23.5.3. Improvements

23.5.3.1. Rocky Linux Support

Robin CNP v5.3.11HF4 supports Rocky Linux 8.6 version. You can install Robin CNP 5.3.11HF4 on Rocky Linux 8.6 servers.

The following are the supported Rocky Linux 8.6 Kernal versions:

  • 4.18.0-372.9.1.rt7.166.el8.x86_64 (RT Kernel)

  • 4.18.0-372.9.1.el8.x86_64 (Non-RT Kernel)

23.5.3.2. Disable Init Containers and Sidecars in Bundle App

Robin CNP v5.3.11HF4 supports disabling the Init Containers and Sidecars in the Robin Bundle apps using the input.yaml file when deploying the Bundle apps.

The following is the sample Robin Bundle file:

 name: dpdk-intel
 version: v1
 icon: icon.png
 snapshot: enabled
 clone: enabled
 roles:
 - pktgen
 pktgen:
 name: pktgen
 norootfs: true
 image:
    name: robinsys/dpdk-intel
    version: v1
    engine: docker
    imagePullPolicy: IfNotPresent
    entrypoint: entry.sh
 compute:
    memory: 1G
    cpu:
       reserve: true
       cores: 2
 initContainers:
    - name: init1
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       resources:
       limits:
          cpu: 25m
          memory: 128Mi
       command:
       - sleep
       - '5'
 sidecars:
    - name: side1
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       command:
       - /bin/bash
       - '-c'
       - trap 'exit 0' SIGTERM; while true; do sleep 1; done
       resources:
       limits:
          memory: 200Mi
          cpu: '1'
    - name: side2
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       command:
       - /bin/bash
       - '-c'
       - trap 'exit 0' SIGTERM; while true; do sleep 1; done
       resources:
       limits:
          memory: 200Mi
          cpu: '1'

Input Yaml file for disabling Init Containers and Sidecars

In the earlier Robin Bundle sample file, we have side1 and side2 sidecars and Init container init1.

Using the following sample Input.yaml file you can disable the Init Containers and sidecars. From the above sample Bundle Yaml file example, we are disabling side1 sidecar and Init container init1.

The following is the sample input.yaml file for disabling Init Containers and sidecars.

roles:
- name: pktgen
  containers:
    - name: side2
      disabled: false
    - name: side1
      disabled: true
    - name: init1
      disabled: true

You can use the input.yaml file when creating an app using the Robin Bundle.

Syntax

Run the following command when creating an app using the Robin Bundle.

# robin app create from-bundle <appname> <bundleid> <yamlfile> --rpool <rpool> --wait

23.5.4. Fixed Issues

Reference ID

Description

PP-27304

The 503 error message appears due to timeout of the HAProxy.

To fix this issue, you need to increase the timeout values of the HAProxy using the robin config update command to 60 seconds for the connect_timeout attribute.

23.5.5. Known Issues

Reference ID

Description

PP-27400

Symptom

The --disablerepo=* option does not work with CentOS 8 and Rocky Linux 8.

Workaround

You can disable all repos by creating a backup folder and manually moving all repo files to it.

Run the following commands to manually move all repo files:

  1. mkdir /etc/yum.repos.d.backup

  2. mv /etc/yum.repos.d/* /etc/yum.repos.d.backup/

PP-27613

When you create an IP pool, the IP pool creation succeeds; however, the app creation fails with the error IP Pool does not exist. This issue occurs as the pool creation is registered on Kubernetes and fails to register on the database.

Workaround

  1. Log in to the Robin server pod from any node using rbash master.

  2. Run the following command to restart Robin cluster.

systemctl restart robin-server

  1. Run the following command to verify the IP pool list.

robin ip-pool list

23.5.6. Technical Support

Contact Robin Technical support for any assistance.