25. Release Notes

25.1. Robin Cloud Native Platform v5.4.1

The Robin Cloud Native Platform (CNP) v5.4.1 release has new features, improvements, fixed issues, and known issues.

Release Date: 12 October 2022

25.1.1. Infrastructure Versions

The following software applications are included in this CNP release.

Software Application

Version

Kubernetes

1.23.8

Docker

19.03.9

Prometheus

2.38.0

Prometheus Adapter

0.10.0

Node Exporter

1.3.1

Calico

3.21.5

HAProxy

2.4.7

PostgreSQL

14.5

Grafana

9.1.3

25.1.2. Upgrade Path

The following are the supported upgrade paths for Robin CNP v5.4.1:

  • Robin v5.3.11-217 HF2 to Robin v5.4.1 GA

  • Robin v5.3.11-450 HF4 to Robin v5.4.1 GA

  • Robin v5.3.13-92 GA to Robin v5.4.1 GA

25.1.3. New Features

25.1.3.1. Volume Snapshots

Robin CNP v5.4.1 provides the Volume Snapshot feature to create crash-consistent snapshots containing one or more Robin volumes (Regular or PDVs).

You can back up volume snapshots to any supported external storage repository in the cloud.

The volume snapshots feature enables you to create a new volume using volume snapshots.

Note

You can use the volume snapshot and volume backup features only for volumes that are not part of Robin bundle applications.

25.1.3.2. Support for Volume Backup

A volume backup is a volume snapshot of a single or multiple volumes that you can push to an external cloud storage repository. Robin CNP v5.4.1 enables you to push the existing volume snapshots to the registered external cloud storage repository, or it creates snapshots when you create a new volume backup.

Note

You can use the volume snapshot and volume backup features only for volumes that are not part of Robin bundle applications.

25.1.3.3. Support for Rocky Linux 8.6

Robin supports the Rocky Linux Kernel version 4.18.0-372.9.1.el8.x86_64 and lower.

25.1.3.4. Blacklist an individual IP, a set of IPs, or a range of IPs addresses

Robin CNP v5.4.1 allows you to blacklist an individual IP address, a set of IP addresses, or a range of IP addresses in an IP-Pool when you do not want to use these IP addresses. Robin CNP does not assign the blacklisted IP address to any application. You can blacklist both IPv4 and IPv6 addresses.

You can blacklist the IP address during creating an IP-Pool and updating an existing IP-Pool.

25.1.3.5. Reserve a single IP, set of IPs, or a range of IP addresses

Robin CNP v5.4.1 allows you to reserve an individual IP address, a set of IP addresses, or a range of IP addresses in an IP-Pool. You can reverse the IP address to assign it for a specific purpose. Robin CNP does not assign the reserved IP address to any application. If you want to assign the reserved IP address, you need to assign it as a static IP address. You can reserve both IPv4 and IPv6 addresses.

You can reserve the IP address during creating an IP-Pool and updating an existing IP-Pool.

25.1.3.6. Support for Static MAC Address

Robin CNP v5.4.1 allows you to assign static MAC addresses for both Robin Bundle and non-Robin Bundle applications. When Pod restarts, its IP address and MAC address will change. When you assign a static MAC address to a Pod, the static MAC address retains after the Pod restarts.

When creating an application, you can assign the static MAC address to the Pod using the static_macs parameter in the IP-Pool.

Note

Robin does not support the static MAC address for KVM-based applications.

25.1.3.7. Custom CA certificate and key

Robin CNP v5.4.1 allows you to use a custom CA certificate and key when installing a Robin CNP cluster. You can get the custom CA certificate and key from an external trusted CA or a dedicated internal public key infrastructure service. After getting the custom CA certificate and its key, make sure that the custom CA certificate and its key must be configured as an intermediate CA certificate. The intermediate CA certificate is the signing certificate that signs the other certificates generated by the cluster.

You need to specify the following key pair values in the config.json file for one of the master nodes:

  • ca-cert-path

  • ca-key-path

25.1.3.8. Kubernetes audit logs

Robin CNP v5.4.1 supports the Kubernetes audit logs feature. The audit logs are a set of records with a chronological list of all requests made to the Kubernetes API server.

You can find the Kubernetes audit logs at /var/log/Kubernetes/audit/audit.log on any master node of your cluster.

Note

By default, the Kubernetes audit logs feature is enabled.

Robin CNP logs the following operations at the metadata level audit policy:

  • Create request

  • Patch request

  • Update request

  • Delete request

25.1.3.9. Support for Best-Effort Quality of Service (QoS) in isolated CPU setups

Robin CNP v5.4.1 supports the Best-Effort Quality of Service (QoS) for isolated CPU setups. In isolated CPU setups, the non-application Pods or the control plane Pods use some isolated CPU cores along with the non-isolated CPU cores.

You must enable the Best-Effort QoS feature to stop the non-application or control plane Pods from using the isolated CPU cores.

Once you enable this feature, the CPU request for these Pods is automatically set to zero.

To enable the Best-Effort QoS feature, you need to specify the following key pair value in the config.json file for one of the master nodes:

  • “best-effort-qos": "True"

Note

You can enable this feature during the Robin CNP installation only.

25.1.3.10. License Expiry Verification

Starting from Robin CNP v5.4.1, Robin CNP verifies the license expiry date and generates events and alerts. Robin CNP generates a warning alert seven days before the license expiry date and a license expired event when the license expires, and other license expiry-related events.

You can set the interval for verifying the license status using the robin schedule update command.

The license expiry verification feature is by default set to True.

Robin CNP provides the following events and alerts:

  • EVENT_LICENSE_NOT_ACTIVATED - This event is generated when the license is not activated after installing Robin CNP.

  • EVENT_LICENSE_EXPIRATION_WARNING - This alert is generated when the license is going to expire.

  • EVENT_LICENSE_EXPIRED - This event is generated after the license is expired.

  • EVENT_LICENSE_EXPIRING_TODAY - This alert is generated if the license expiry date is the same current date.

25.1.3.11. Support for Cisco DCNM E1000 Virtual Interface

Robin CNP v5.4.1 supports Cisco Data Center Network Manager (DCNM) E1000 virtual network interface for KVMs. You can deploy the Cisco DCNM application on the Robin CNP cluster.

Note

The Cisco DCNM E1000 Virtual Interface is supported only on KVMs with OVS IP Pool. You can configure the interface only using input.yaml file.

25.1.3.12. Robin Asynchronous Disaster Recovery (Tech Preview)

Starting from Robin CNP v5.4.1, Robin.io provides the snapshot-based Asynchronous Disaster Recovery (DR) feature.

The feature enables you to replicate your Kubernetes-based stateful applications along with its constructs (PVC, StatefulSet, config maps, secrets, services, etc.) onto a remote secondary peer cluster (site), and you can manually failover to it in the event of a disaster or maintenance activities. You can enable encryption when transmitting data over the wire to a peer cluster.

The Robin Asynchronous Disaster Recovery feature allows you to bring your applications online faster by failing over to the secondary cluster (site) in the event of a disaster with a minimum application downtime and failback later.

25.1.4. Improvements

25.1.4.1. Improved Robin CNP Install and Upgrade

Starting with Robin CNP v5.4.1, only the GoRobin utility tool is available for installing and upgrading Robin CNP v5.4.1. The GoRobin uses the new scripts that are provided as part of CNP v5.4.1.

25.1.4.2. Auto-renewing Robin CNP License

Starting from Robin CNP v5.4.1, you can auto-renew the Robin CNP license by activating the license proxy for your Robin CNP clusters. A license proxy can be linked to multiple Robin CNP clusters. You need to set up and activate the license proxy by yourself. Once you activate the license proxy, the license is automatically renewed as per the renewal period mentioned in the license proxy.

25.1.4.3. Install Robin CNP on nodes running on different operating systems

Starting from Robin CNP v5.4.1, Robin supports the installation of Robin CNP on the nodes running on different operating systems supported by Robin.

For example, If you want to install Robin CNP v5.4.1 on a three-node cluster, you can have all three nodes with different operating systems, such as one node with CentOS 7, the second node with RHEL 8, and the third node with Rocky Linux 8.6.

25.1.4.4. UI Support for adding SSH key in Robin Bundle applications

Starting from Robin CNP v5.4.1, a user can also add or delete SSH keys for the Robin Bundle applications. Robin supports only Rivest Shamir Adleman (RSA) SSH key pairs, not Digital Signature Algorithm (DSA) and Elliptic Curve Digital Signature Algorithm (ECDSA) SSH key pairs.

Adding the SSH key for an application is available in the UI. Now, you can add an SSH key for an application during creating the application using UI.

25.1.4.5. Chargeback support for non-Robin Bundle applications

Starting from Robin CNP v5.4.1, Robin CNP supports the chargeback utility for non-Robin Bundle applications also. You need to start the metrics to enable the chargeback feature for non-Robin Bundle applications.

Prometheus and the chargeback_track_k8s_resusage config variable are automatically enabled when you start metrics.

The chargeback utility tracks the usage and cost of the non-Robin Bundle application’s resources such as CPU, GPU, MIG, Memory, Storage space (HDD, SSD), HugePages, SR-IOV, FPGA devices, and PCI devices.

Note

When you stop an application, you will still be billed for the storage bound to it.

25.1.4.6. Provide custom name for Robin default Calico IP-Pool

Starting from Robin CNP v5.4.1, you can add the custom name for the Robin default Calico IP-Pool robin-default. To add the custom name for the Robin default Calico IP-Pool, you must specify the name in the following key pair in the config.json file for one of the master nodes:

  • robin-default-ippool-name

If you do not specify the custom name, Robin CNP uses the robin-default name for the Robin default Calico IP-Pool.

Note

You can use this option during the Robin CNP installation only.

25.1.4.7. Support for recreating existing MIG partitions automatically

Starting from Robin CNP v5.4.1, you can automatically recreate the existing MIG partitions when a node is rebooted. You need to enable the create_mig_partitions_on_reboot attribute by setting it to True in the robin config list before rebooting the node.

Note

By default, the create_mig_partitions_on_reboot attribute is disabled.

25.1.4.8. UI support to manage network policies

Starting from Robin CNP v5.4.1, you can manage the network policies for your cluster through UI.

25.1.4.9. View rpool name in the robin drive list command

Starting from Robin CNP v5.4.1, an option to show the rpool name along with the host is added to the output of the robin drive list command.

25.1.4.10. Support to add-routes and remove-routes to an existing IP-Pool

Starting from Robin CNP v5.4.1, support for add-routes and remove-routes to an existing IP-Pool is added.

25.1.4.11. View PersistentVolumeClaim (PVC) information for a volume

Starting from Robin CNP v5.4.1, the following two options are added as a part of robin volume info command to show a PVC information for a volume:

  • --pvc-name

  • --namespace

25.1.4.12. Add SSH key pair in a separate file for passwordless login to KVM- based VMs

Starting from Robin CNP v5.4.1, you can log in to KVM-based VMs without a password. For passwordless login to KVM-based VMs, you need to add the SSH key pair in a separate YAML file and add the location of the SSH key pair in the manifest file of the KVM-based VMs.

Now, create KVM-based VMs using the YAML file where you added the SSH key pair. After creating KVM-based VMs, you can log in to the KVM-based VMs without a password.

25.1.4.13. Support for NVIDIA HGX hardware for Robin CNP

Starting from Robin CNP v5.4.1, You can also deploy Robin CNP on specialized hardware such as NVIDIA HGX and DGX servers.

25.1.4.14. Collection ID in Robin Bundle Info

The robin bundle info command displays the Collection ID. It enables you to know to which File Collection the bundle belongs.

25.1.4.15. Support for jinja variables in Bundle manifest file for PDV section

Starting with Robin CNP v5.4.1, the Robin Bundle manifest file supports jinja variables in the PDV section. This enables the mount path to enable PDV to be set in the manifest file and these mount paths are then auto-populated in the UI. Prior to this feature support, the PDV mount paths were fixed.

The following are supported jinja variables:

  • namespace

  • user

  • tenant

  • resourcepool

Note

Appropriately named PDVs must exist in the namespace.

Example:

Update the bundle manifest file by adding a pdvs section at the same indentation level as storage and compute like so:

pdvs:
   - name: "{{namespace}}-data"
     mount_path: "/data/{{namespace}}-data"
   - name: "{{namespace}}-data-2"
     mount_path: "/data/{{namespace}}-data-2"

This results in the following auto-filled PDV section in the GUI:

_images/autofilled-PDV.png

Note the jinja variable substitution where {{namespace}} has been substituted with the current namespace, mainly t001-u000004.

Also, note the auto-filled mount paths as specified by the mount_path variable.

25.1.4.16. Disable Init Containers and Sidecars in Bundle App

Robin CNP v5.4.1 supports disabling the Init Containers and Sidecars in the Robin Bundle apps using the input.yaml file when deploying the Bundle apps.

The following is the sample Robin Bundle file:

 name: dpdk-intel
 version: v1
 icon: icon.png
 snapshot: enabled
 clone: enabled
 roles:
 - pktgen
 pktgen:
 name: pktgen
 norootfs: true
 image:
    name: robinsys/dpdk-intel
    version: v1
    engine: docker
    imagePullPolicy: IfNotPresent
    entrypoint: entry.sh
 compute:
    memory: 1G
    cpu:
       reserve: true
       cores: 2
 initContainers:
    - name: init1
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       resources:
       limits:
          cpu: 25m
          memory: 128Mi
       command:
       - sleep
       - '5'
 sidecars:
    - name: side1
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       command:
       - /bin/bash
       - '-c'
       - trap 'exit 0' SIGTERM; while true; do sleep 1; done
       resources:
       limits:
          memory: 200Mi
          cpu: '1'
    - name: side2
       image: 'robinsys/dpdk-intel:v1'
       imagePullPolicy: IfNotPresent
       command:
       - /bin/bash
       - '-c'
       - trap 'exit 0' SIGTERM; while true; do sleep 1; done
       resources:
       limits:
          memory: 200Mi
          cpu: '1'

Input Yaml file for disabling Init Containers and Sidecars

In the earlier Robin Bundle sample file, we have side1 and side2 sidecars and Init container init1.

Using the following sample Input.yaml file you can disable the Init Containers and sidecars. From the above sample Bundle Yaml file example, we are disabling side1 sidecar and Init container init1.

The following is the sample input.yaml file for disabling Init Containers and sidecars.

roles:
- name: pktgen
  containers:
    - name: side2
      disabled: false
    - name: side1
      disabled: true
    - name: init1
      disabled: true

You can use the input.yaml file when creating an app using the Robin Bundle.

Syntax

Run the following command when creating an app using the Robin Bundle.

# robin app create from-bundle <appname> <bundleid> <yamlfile> --rpool <rpool> --wait

25.1.4.17. Robin StorageClass with runAsAny parameter

Robin CNP v5.4.1 provides a new parameter runAsAny in the StorageClass object to enable any user other than the root user to read or write to an NFS mount point of an RWX volume.

You can use this parameter in a scenario with multiple containers and different users, and you want to allow any user accessing the Pod (containers) to read or write to an NFS mountpoint of an RWX volume.

In the StorageClass object file, set the runAsAny parameter to True.

The following is an example of the StorageClass with runAsAny parameter:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: run-as-any-imm
  labels:
    app.kubernetes.io/instance: robin
    app.kubernetes.io/managed-by: robin.io
    app.kubernetes.io/name: robin
provisioner: robin
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate
parameters:
  replication: '2'
  media: HDD
  runAsAny: "true"

25.1.4.18. Sherlock Volume Health report with New Details

Starting with Robin CNP v5.4.1, the Sherlock volume health report displays the following details as part of the report output:

  • Potential IO stalls on mounts

  • NFS Exports

  • NFS Server Pods

25.1.4.19. NIC Tags in Robin Bundle Application

Robin CNP supports providing NIC tags for SR-IOV IP-Pools as part of the Robin Bundle application. You can provide NIC tags as part of the Bundle application even if NIC tags are not part of an IP-Pool.

You can use the NIC tags to request specific physical NIC in a Robin Bundle application’s manifest file. Also, you can provide the required NIC tag details using the input.yaml file.

Provide the NIC tags in the key: value format. You can use only the name as the key in the key: value format. The value must be the physical NIC name on your node.

25.1.4.20. Persistent Storage for vDU and vCUs

Robin CNP v5.4.1 supports the persistent storage for virtualized Distributed Units (vDUs) and virtualized Central Units (vCUs).

The feature enables persistent data written inside the app when you perform VM app restart, stop and start.

25.1.4.21. Set Limits on non-Bundle Applications for Tenant and Users

Starting with Robin CNP v5.4.1, you can set limits on non-Bundle applications for tenant and tenant users at Resource pools and application levels.

25.1.4.22. Support for PCI Device Resource

Robin CNP v5.4.1 enables you to create PCI Device resources using the PCI details.

A PCI device resource in Robin CNP comprises PCI FPGA device ID, vendor ID, device type, and device driver.

You can use the provided PCI resource name as a resource in the Robin Bundle manifest file and Helm charts.

When you use a PCI device resource name in the Robin Bundle manifest file or Helm chart, you do not need to add PCI device details (device ID, vendor ID, device type, driver, etc.).

Note

For Helm Chart, you must provide the PCI Device resource name in the form annotation.

You can also create and manage PCI Device resource name from the Robin CNP UI.

25.1.4.23. Support for ephemeral containers

Robin CNP v5.4.1 supports ephemeral containers. You can add ephemeral containers to a Pod at runtime. Use this feature for debugging distroless containers or any container that does not have utilities needed for debugging.

25.1.4.24. Deprecated events

The following events are deprecated in Robin CNP v5.4.1.

  • EVENT_MGT_MASTER_FAILOVER

  • EVENT_MGT_MANAGER_UNREACHABLE

  • EVENT_COLLECTION_ERROR

  • EVENT_COLLECTION_OFFLINE

  • EVENT_COLLECTION_OFFLINE_FAILED

  • EVENT_COLLECTION_ONLINE

  • EVENT_COLLECTION_ONLINE_FAILED

25.1.5. Fixed Issues

Reference ID

Description

PP-27464

Nessus scans discovered a vulnerability of enabled debugging functions like HTTP TRACE and TRACK. This issue is fixed.

PP-28559

When an application is deployed using a custom interface name, the interface name is not appearing inside the Pod. This issue is fixed.

PP-25070

Vulnerability CVE-2021-41103 is related to containerd runtime. The container root directories and some plugins had insufficiently restricted permissions. It allows unprivileged Linux users to traverse directory contents and execute programs. For more information about this vulnerability, see CVE-2021-41103.

PP-24313

The issue of Robin Bundle getting added to a log collection in place of a File collection is fixed.

PP-28267

IP-Pool creation for the OVS driver fails with the following error:

“ValidatingWebhookConfiguration” for ippool “ippoolcr-validating-webhook” was not created.

This issue is fixed now.

PP-27304

The 503 error message appears due to the timeout of the HAProxy.

To fix this issue, you need to increase the timeout values of HAProxy using the robin config update command to 60 seconds for the connect_timeout attribute.

PP-26389

Sherlock commands are not working for Robin CNP v5.3.5-213. This issue is fixed now.

PP-24787

The issue of duplicate entries in the nfs_service_endpoint table, which causes a job storm that eventually impacts other app deployments and creates performance issues also, is fixed now.

PP-24725

The drive evacuate command fails for replicated volume. This issue is fixed now.

PP-22516

When installing Kubernetes master nodes, node taints may not apply successfully. This issue is fixed now.

PP-22098

Robin does not allow exploding IP addresses for IPv6 IP-Pools except for the last octets. This issue is fixed now.

PP-28408

Robin CNP failed to allocate a static IP address for a Pod in a custom controller due to a stale IP static address residing in the database when the Pod gets deleted. This issue is fixed.

25.1.6. Known Issues

Reference ID

Description

PP-25246

Symptom

When you try to delete a KVM application, the deletion process might be stuck as the Virsh commands on the node may not respond.

Workaround:

Reboot the node.

PP-21916

Symptom

A Pod IP is not pingable from any other node in the cluster, apart from the node where it is running.

Workaround:

Delete the Calico Pod running on the node where the issue is seen.

PP-27620

Symptom

Sync with secondary peer cluster fails due to multiple snapshots restore failures.

Workaround:

Restart the iomgr-server on the affected node.

  1. Log in to the robinds Pod on the affected node (rbash robin)

  2. Run the command systemctl restart iomgr-server

  3. Check the state of the connections using the rdvm conn list command.

PP-27775

Symptom

When upgrading from Robin CNP v5.3.11-HF2 to Robin CNP v5.4.1, one of the hosts is stuck in the Notready state.

Workaround:

You need to delete the worker Pod running on the node that is in the Notready status.

Perform the following steps to delete the worker Pod:

  1. Run the following command to know the status of worker Pods:

    # kubectl get pod -n robinio -o wide |grep worker
    
  2. Run the following command to delete the stuck worker Pod:

    # kubectl delete pod -n robinio <pod_name>
    
  3. Reboot the respective node.

PP-27253

Symptom

One node in a HA cluster is in the NotReady state because the worker Pod on the respective node is down with the following error message:

Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: process_linux.go:130: executing setns process caused: exit status 1: unknown

Workaround

Contact Robin support team for workaround steps.

PP-27076

Symptom

In Robin CNP, Kubelet might go down due to the stale cpu_manager_state file.

Workaround

Complete the following steps to fix this issue:

  1. Remove the stale /var/lib/kubelet/cpu_manager_state file using the following command:

    # rm -rf /var/lib/kubelet/cpu_manager_state
    
  2. Restart the Kubelet by running the following command:

    # systemctl restart kubelet
    
  3. Make sure etcd and apiserver Pods on this node are up and running.

PP-28365

Symptom

Nodes are flapping between Ready and NotReady states because the etcd goes out of sync.

Workaround

Contact Robin CS team for workaround steps.

PP-25360

Symptom

If containers in a Pod are using an RWX PVC and if they are stuck in the ContainerCreating state for a long time and display a timeout error, apply the following workaround.

Workaround

Delete Pods if they are part of a Deployment or StatefulSet.

PP-21469

Symptom

Change in isolcpu does not reflect in the host after rediscover

Workaround

  1. Update /etc/sysconfig/kubelet and update the reserved-cpus parameter to include CPU IDs.

  2. If your new reserverd-cpus is a subset of the existing one, just restart kubelet.

  3. If new reserved-cpus is not a subset of existing cpuset.

  4. Drain the K8s nodes or reboot the K8s node (The aim is to get rid of all Pods which are using CPUs from new reserved-cpus, you can also delete those specific Pods but this is for an advanced user).

  5. Once all Pods are drained, restart kubelet (If you are rebooting, this is not needed).

  6. Uncordon K8s node.

PP-22781

Symptom

After removing a taint on a master node, GPUs are not detected automatically.

Workaround

You need to run the robin host probe --rediscover --all --wait command for the GPUs to be detected on the primary master node.

PP-22853

Symptom

Robin CNP may not detect GPUs in the following scenarios:

  • After Robin CNP installation

  • After upgrading Robin CNP

  • After adding a new node

Workaround

Run the robin host probe <hostname> --rediscover command.

PP-24248

Symptom

When you create a new resource pool and assign it to nodes and later try to deploy a Pod with storage affinity on the node with a newly assigned resource pool, the Pod deployment fails as the node is not taking the correct resource pool.

Workaround

Complete the following steps to fix this issue:

  1. Run the following command to edit the node.

    # kubectl edit node <node_name>
    
  2. Remove the robin.io/robinrpool resource pool.

  3. Add the correct resource pool name.

PP-24736

Symptom

A PVC may not come online after removing an app from the secondary Protection Group on the peer cluster.

Workaround

After you remove the application from the Protection Group and allow the application to start, remove the block_mount label from the PVCs of the application.

PP-27296

Symptom

When installing Robin CNP on Rocky Linux host systems, if runc package is installed, you must uninstall it before installing Robin CNP.

PP-26942

Symptom

When upgrading your cluster to Robin CNP v5.4.1, there is a DB migration step. During this step, the database briefly goes to read-only mode.

The storage manager service might fail if it tries to write to the database while it is in read-only mode. But the storage manager service will restart when the database resumes the read-write mode. You do not need to take any action.

PP-28494

Symptom

During a non-HA upgrade, the File-server Pod may get stuck in the ContainerCreating state as the volume is unmounted and Kubernetes is not aware of it.

If you notice this issue, apply the following workaround steps.

Workaround

  1. Check what PVC/volume file-server Pod is using.

  2. Check the volume is not mounted by running the robin volume info <volume name> command.

  3. Run the following command to cordon the node where the filer server Pod is mounted.

    # kubectl cordon <node_name>
    
  4. Run the following command to delete the file server Pod.

    # kubectl delete pod -n robinio <file_server_pod_name>
    
  5. Run the following command to uncordon the node you have cordoned in step 3.

    # kubectl uncordon <node_name>
    

PP-28461

Symptom

When you increase the snapshot space limit on the Primary Protection Group, the same is not replicated to the secondary Protection Group.

Workaround

If you need to increase space for snapshots on the secondary protection group, apply the following workaround:

Run the following command on the secondary cluster to update the snapshots space limit:

# robin app snapshot-space-limit.

PP-26523

Symptom

Robin Bundle application with PDV or AEV is not supported for disaster recovery.

PP-28460

Symptom

The disaster recovery (DR) initial sync might fail in a certain rare scenarios. If the initial sync fails, apply the following workaround.

Workaround

Remove the application from the Protection Group and add it back.

PP-28458

Symptom

If you increase volume size on the primary Protection Group for an application, the change in volume size is not replicated to the secondary Protection group.

Workaround

  1. Remove the application from the Primary Protection Group.

  2. Increase the volume size.

  3. (optional) Delete the application on the secondary Protection Group.

  4. Add the application back to the Primary Protection Group.

PP-28721

Symptom

In the following scenario, the secondary Protection Group is not showing the correct replication state.

Replication is paused on the primary protection Group and the pause request is updated on the secondary Protection Group.

Meanwhile, the secondary peer cluster went down and later replication is resumed on the primary. However, the state on the secondary is not updated after it is up.

Workaround

To correct the replication state on the secondary, on the primary, pause the replication again and resume.

PP-27937

Symptom

You might see the following similar type of error in a DR setup when snapshots are being deleted:

Snapshot default:pvc-75b3a817-6a16-4b5a-a76a-7490f717e590:t001-u000005-rpol-4-1658759728-1658742924 has too many valid descendants

This issue is due to unreplicated snapshots on the primary cluster. After failover, a cluster can have unreplicated snapshots on original primary cluster and the new primary sends new snapshots to new secondary cluster. The unreplicated snapshots and new snapshots from new primary might have common data points that results to this error.

Workaround

Delete the unreplicated snapshots on the original primary.

PP-28501

Symptom

After upgrading from the existing Robin CNP to Robin CNP v5.4.1 with RWX applications, the NFS server related jobs are stuck.

Workaround

Perform the following steps:

  1. Run the following command to log in to Robin master Pod:

    # rbash master
    
  2. Run the following command to know the Mount State of the storage nodes:

    # stormgr node list
    
  3. Run the following command to unlock the stormgr CLI:

    # stormgr devl unlock
    
  4. Run the following command to unblock the blocked storage node:

    # stormgr node setstatus --block-mount 0 <blocked_storage_node>
    

PP-28077

Symptom

When you try to uninstall Robin CNP without deleting the apps and objects from the cluster, it might get stuck at unmounting /var/lib/kubelet.

Workaround

Power cycle the hosts and rerun the uninstall.

PP-21910

Symptom

Volume APIs get stuck in the RCM server as some mounts get stuck on the worker nodes.

Workaround

Contact Robin CS team for workaround steps.

PP-27826

Symptom

When you reboot all nodes of a cluster together, RWX Pods are stuck in the CrashLoopBackOff state.

Workaround

Delete the respective Pods.

PP-28125

Symptom

After upgrading from the existing Robin CNP to Robin CNP v5.4.1, RWX PVC Pods are stuck in the ContainerCreating state.

Workaround

Perform the following steps to generate a new FS UUID:

  1. Run the following command to know the Pods that are stuck in ContainerCreating state:

    # kubectl get pods -A | grep -v containercreating
    
  2. Run the following commands to get the RWX volume used by these Pods:

    # kubectl describe pods <pod_name>
    # kubectl get pvc -A | grep <claim_name>
    
  3. Run the following command to see the respective job output for the RWX volume with NFSAgentAddExport:

    # robin job list | grep <rwx_volume_name> | grep NFSAgentAddExport
    

Example:

# robin job list | grep pvc-b0f33e4d-6d1c-4d17-9ddf-0a67b9f1af51
 | grep NFSAgentAddExport
->5840 | NFSAgentAddExport | Adding export for vol pvc-b0f33e4d-
6d1c-4d17-9ddf-0a67b9f1af51| COMPLETED | FAILED | 09 Aug 06:39:45
 | 06:40:01 | 0:00:16  | [] | 1  | Command '/bin/mount/dev/sdo
 /var/lib/robin/nfs/robin-nfs-shared-60/ganesha/pvc-b0f33e4d-6d1c-
 4d17-9ddf-0a67b9f1af51' failed with return code 32: mount: wrong
 fs type, bad option, bad superblock on /dev/sdo,
              missing codepage or helper program, or other error

              In some cases useful info is found in syslog - try
              dmesg | tail or so.
  1. If you see the above error, run the following command to know the device and host on which this volume is mounted:

    # robin volume list | grep <rwx_volume_name>
    
  2. Run the following command on the host to know the error:

    # cat /var/log/messages | grep <device_name>
    

Example:

# cat /var/log/messages | grep sdo
Aug  9 06:40:00 asa-06 kernel: XFS (sdo): Filesystem has duplicate
UUID 47762fc2-1e7c-4863-a551-0fe55b29d0c7 - can't mount
  1. Run the following command to generate a new FS UUID for the respective device to be mounted:

    # xfs_admin -U generate <path to device>
    

PP-26693

Symptom

When running heavy storage workloads on Robin CNP v5.4.1, the Robin DaemonSet Pod on one of the worker nodes is down with the following error:

Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: process_linux.go:130: executing setns process caused: exit status 1: unknown

Workaround

To recover, reboot the physical server or contact the Robin CS team to recover without rebooting the server.

PP-27193

Symptom

When upgrading from Robin CNP v5.3.11-HF2 to Robin CNP v5.4.1, RWX Pods may get stuck in the ContainerCreating state as the volume is unmounted and Kubernetes is not aware of it.

If you notice this issue, apply the following workaround steps:

Workaround

  1. Check what PVC/volume Pod is using.

  2. Check the volume is not mounted by running the robin volume info <volume_name> command.

  3. Delete the respective RWX Pod or reboot the respective node.

PP-27678

Symptom

When the node where the volume for file collection is mounted is turned off and you want to delete file collection with a single replica, the file collection delete job will fail putting the file server Pod in the terminating state.

Workaround

Run the following command to delete the file server Pod forcefully stuck in the terminating state:

# kubectl delete <pod_name> -n <robin_ns> --force

PP-27077

Symptom

When deleting the RWX applications, RWX Pods are stuck in the Terminating state.

Workaround

Perform the following steps for deleting the RWX Pods:

  1. Run the following command to find the NFS server Pod associated with the PVC:

    # robin nfs export-list
    
  1. Delete the NFS server Pod used for the respective PVC.

PP-26768

Symptom

You should not use an IP-Pool associated with dpdk drivers as the default network.

PP-26572

Symptom

Due to inaccuracies in tracking the Pod creation, tenants and user limits are not explicitly honored for Helm applications.

PP-26830

Symptom

After deleting the PVCs, Robin CNP cluster is down.

Workaround

Delete the Calico Pod.

PP-26345

Symptom

When you deploy a Pod to use an SR-IOV VF from Ethernet Virtual Function 700 Series 154c, sometimes the Pod gets stuck in the ContainerCreating state with the device busy error message.

Workaround

Delete the Pod that shows the device busy error message.

PP-22881

Symptom

When you try to stop a Windows-based VM, it fails to stop with the following error:

Failed to terminate process 58434 with SIGKILL: Device or resource busy

Workaround

Reboot the respective node and delete the app.

PP-21935

Symptom

Pods are stuck in the ContainerCreating state with the following error:

kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability

Workaround

Perform the following steps:

  1. Flush connection entries:

    # conntrack -F
    
  2. Delete nodeplugin Pod.

Note

If the nodeplugin Pod has become unusable, future filesystem mounts will fail, this is a symptom of the many retries of NFS mount calls that hang. Bouncing the Pod will clear out the hung processes.

PP-22643

Symptom

In some scenarios, a Pod might be stuck in the TopologyAffinityError resulting in VM creation failure.

Workaround

Delete the application and redeploy it after a few minutes.

PP-22626

Symptom

If NVIDIA GPU drivers are already installed on your setup, the Robin GPU operator deployment fails during the Robin CNP install or upgrade process.

Workaround

Run the following steps to fix this issue:

  1. # yum remove nvidia-driver-latest-dkms
    
  2. # yum remove nvidia-container-toolkit
    
  3. Reboot the node.

PP-26581

Symptom

After deleting the PCI resources, the existing Pods that are using the PCI resources are stuck in the ContainerCreating state during the instance relocation.

Workaround

Perform the following steps:

  1. Recreate the PCI resources.

  2. Delete the respective Pod.

PP-28768

Symptom

After upgrading Robin CNP v5.4.1, you might notice that the cordoned node is uncordoned.

Workaround

You should put the cordoned nodes in maintenance mode before upgrading.

Or, you need to corden the node again after upgrading to Robin CNP v5.4.1.

PP-28764

Symptom

For Robin Bundles, after you delete a route from an IP Pool, the deleted route might still appear inside the Pod after restarting it.

Workaround

You need to delete the app and add the app again.

PP-28672

Symptom

When removing a node from the Robin CNP cluster using the k8s-script-el8.sh cleanup command, the console displays the following incorrect message.

Kubernetes cluster has Robin CNP installed. Please clean up Robin CNP to continue or use –force.

You can ignore the message and do not need to remove Robin CNP from the cluster.

PP-27276

Symptom

After upgrading to Robin CNP v5.4.1, some Robin Bundle apps might be OFFLINE due to PLAN_FAILED.

Workaround

Manually restart the Robin Bundle apps one by one.

PP-28809

Symptom

Post worker node failure, RWX Pods stuck in the ContainerCreating state as VolumeFailoverNFSExport job fails with the following error:

Unable to unmount volume <pvc-name>: vol never mounted at zone/node default/<hostname>

Workaround

Contact Robin CS team for workaround steps.

PP-28867

Symptom

The robin chargeback report is not displaying the correct SSD drive price in the report. The report is showing 0.0 as the price.

PP-28945

Symptom

The parameters to provide a custom CA certificate are not currently supported with the GoRobin utility, even though they are available in the list of parameters. The following parameters are not supported with GoRobin:

  • --ca-cert-path

  • --ca-key-path

workaround

You can pass the custom CA certificate parameters as part of the config.json file.

PP-28938

Symptom

When deleting multiple PDVs using the Robin CNP UI, the checkbox for selecting all PDVs (next to Name field) does not work.

Workaround

You must select the corresponding checkbox for each of the PDVs that you want to delete and click Remove.

PP-28946

Symptom

Robin CNP v5.4.1 does not support HashiCorp Vault integration.

PP-28912

Symptom

Support to install or upgrade Robin CNP v5.4.1 as a non-root user is not available.

PP-27530

Symptom

Post upgrade to Robin CNP v5.4.1, RWX Pods get stuck in the ContainerCreating state as jobs fail with the error volume is not accessible, vol state FAULTED but the robin volume list command shows that the volume is in READY state.

Workaround

Run the following command:

# robin host probe --all --wait

PP-28642

Symptom

When you add or remove a route from an IP-Pool, it does not reflect inside the KVM app.

PP-25677

Symptom

A Pod gets stuck in the terminating state as the node on which this Pod was running is permanently unavailable. The static IP address and static MAC address of the terminating Pod are not released until the Pod is permanently deleted.

Workaround

Permanently delete a Pod by restoring the node or by running the kubectl delete pod command.

PP-28966

Symptom

If a pod deployment fails and you notice the following error message in the pod events “Error: Vblock with volume_id <> not mounted”. Apply the following workaround.

Workaround

  1. Delete the VolumeAttachment associated to the pod object.

  2. Delete the pod.

PP-27138

Symptom

Patroni Replicas cannot recover and synchronize with the Leader due to missing WAL files or WAL receiver not running.

Workaround

Contact Robin support team for workaround steps.

25.1.7. Technical Support

Contact Robin Technical support for any assistance.