25. Release Notes

25.1. Robin Cloud Native Platform v5.5.1

The Robin Cloud Native Platform (CNP) v5.5.1 release notes has pre- and post-upgrade steps, new features, improvements, fixed issues, and known issues.

Release Date: September 30, 2025

25.1.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.31.6

Docker

25.0.2

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.28.2

HAProxy

2.4.7

PostgreSQL

14.12

Grafana

9.2.3

CRI Tools

1.31.1

25.1.2. Supported Operating System

The following is the supported operating system and kernel version for Robin CNP v5.5.1:

  • CentOS 7.9 (kernel version: 3.10.0-1160.71.1.el7.x86_64)

25.1.3. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.5.1:

  • Robin CNP v5.4.3 HF6 to Robin CNP v5.5.1-1950

  • Robin CNP v5.5.0-1857 to Robin CNP v5.5.1-1950

25.1.3.1. Pre-upgrade considerations

  • For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

  • When upgrading from supported Robin CNP versions to Robin CNP v5.5.1, if your cluster already has cert-manager installed, you must uninstall it before upgrading to Robin CNP v5.5.1.

  • Robin CNP v5.5.1 does not support the OnDelete strategy for IOMGR Pods during the upgrade process.

25.1.3.2. Post-upgrade considerations

  • After upgrading to Robin CNP v5.5.1, you must run the robin schedule update K8sResSync k8s_resource_sync 60000 command to update the robin schedule K8sResSync.

  • After upgrading to Robin CNP v5.5.1, you must run the robin-server validate-role-bindings command. To run this command, you need to log in to the robin-master Pod. This command verifies the roles assigned to each user in the cluster and corrects them if necessary.

  • After upgrading to Robin CNP v5.5.1, the k8s_auto_registration config parameter is disabled by default. The config setting is deactivated to prevent all Kubernetes apps from automatically registering and consuming resources. The following are the points you must be aware of with this change:

    • You can register the Kubernetes apps using the robin app register command manually and use Robin CNP for snapshots, clones, and backup operations of the Kubernetes app.

    • As this config parameter is disabled, when you run the robin app nfs-list command, the mappings between Kubernetes apps and NFS server Pods are not listed in the command output.

    • If you need mapping between Kubernetes app and NFS server Pod when the k8s_auto_registration config parameter is disabled or the k8s app is not manually registered, get the PVC name from the Pod YAML file (kubectl get pod -n <name> -o YAML) and run the robin nfs export list | grep <pvc name> command.

    • The robin nfs export list command output displays the PVC name and namespace.

25.1.3.3. Pre-upgrade steps

Upgrading from Robin CNP v5.4.3 HF6 or Robin CNP v5.5.0-1857 to Robin CNP v5.5.1

Before upgrading from Robin CNP v5.4.3 HF6 or Robin CNP v5.5.0-1857 to Robin CNP v5.5.1, perform the following steps:

  1. Update the value of the suicide_threshold config parameter to 1800:

    # robin config update agent suicide_threshold 1800
    
  2. Disable the NFS Server Monitor schedule:

    # robin schedule disable "NFS Server" Monitor
    
  3. Set the toleration seconds for all NFS server Pods to 86400 seconds. After upgrade, you must change the tolerations seconds according to the post-upgrade steps.

    # for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds to 86400";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 86400}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 86400}]'; done
    

Verify and configure pod-max-pids on master nodes

Before you change the maximum Pods per node, you need to verify and configure pod-max-pids on master nodes and change if required.

Verify the pod-max-pids configuration on master nodes in kubelet. Based on the number of vCPUs on the host, set pod-max-pids to at least 4096 on master nodes.

Note

If the current value exceeds 4096 based on application requirements (for example, 10000), you do not need to change it.

Always verify the current value before making changes. You can tune this setting further based on cluster observations.

To verify the current pod-max-pids and modify if requried, complete the following steps:

  1. Check the current pod-max-pids value:

    # cat /etc/sysconfig/kubelet systemctl status kubelet -l | grep -i pod-max-pids
    
  2. Check if the value is less than 4096. If the value is more than 4096, you do not need to update.

  3. Open the kubelet configuration file and update if the value less than 4096.

    # vi /etc/sysconfig/kubelet
    
  4. Restart the kubelet service:

# systemctl restart kubelet
  1. Verify the updated configuration:

    # systemctl status kubelet -l | grep -i pod-max-pids
    

25.1.3.4. Post-upgrade steps

After upgrading from Robin CNP v5.4.3 HF6 or Robin CNP v5.5.0-1857 to Robin CNP v5.5.1

After upgrading from Robin CNP v5.4.3 HF6 or Robin CNP v5.5.0-1857 to Robin CNP v5.5.1, perform the following steps:

  1. Update the value of the suicide_threshold config parameter to 40:

    # robin config update agent suicide_threshold 40
    
  2. Enable the NFS Server Monitor schedule:

    # robin schedule enable "NFS Server" Monitor
    
  3. Set the check_helm_apps config parameter to False:

    # robin config update cluster check_helm_apps False
    
  4. Set the chargeback_track_k8s_resusage config parameter to False:

    # robin config update server chargeback_track_k8s_resusage False
    
  5. Set the robin_k8s_extension config parameter to True:

    # robin config update manager robin_k8s_extension True
    
  6. Verify whether the following mutating webhooks are present:

    # kubectl get mutatingwebhookconfigurations -A | grep robin
    k8srobin-deployment-mutating-webhook   1          20d
    k8srobin-ds-mutating-webhook           1          20d
    k8srobin-pod-mutating-webhook          1          20d
    k8srobin-sts-mutating-webhook          1          20d
    robin-deployment-mutating-webhook      1          20d
    robin-ds-mutating-webhook              1          20d
    robin-pod-mutating-webhook             1          20d
    robin-sts-mutating-webhook             1          20d
    
  7. If above k8srobin-* mutating webhooks are not present then bounce the robink8s-serverext Pods:

    # kubectl delete pod -n robinio -l app=robink8s-serverext
    
  8. Verify whether the following validating webhooks are present:

    # kubectl get validatingwebhookconfigurations
    NAME                             WEBHOOKS   AGE
    cert-manager-webhook             1          45h
    controllers-validating-webhook   1          31h
    ippoolcr-validating-webhook      1          31h
    namespaces-validating-webhook    1          31h
    pods-validating-webhook          1          31h
    pvcs-validating-webhook          1          31h
    
  9. If robin-* mutating webhooks displayed in the step 6 output and validating webhooks displayed in the step 8 output are not present on your setup, then restart the robin-server-bg service:

    # rbash master
    # supervisorctl restart robin-server-bg
    
  10. Set the toleration seconds for all NFS server Pods to 60 seconds when the node is in the notready state and set to 0 seconds, when the node is unreachable state.

    for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 60}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 0}]'; done 2>/dev/null
    

25.1.4. New Features

25.1.4.1. Support for Clusters with AMD Processors

Robin CNP v5.5.1 supports clusters with AMD processors. The following improvements are available to support the clusters with AMD processors.

  • Support for Mellanox VFs using the mlx5_core driver for clusters with AMD processors.

  • Increased number of Pods per node.

  • Parameter to configure max-pods per node using config.json file.

25.1.4.1.1. Support for Mellanox VFs using the mlx5_core driver for clusters with AMD processors

Starting with Robin CNP v5.5.1, support for the mlx5_core VF (Virtual Function) driver is provided as part of IP-pools for clusters with Intel and AMD processors.

You can now configure IP pools with the --vfdriver mlx5_core to utilize Mellanox Virtual Functions on clusters.

Robin CNP continues to support the native iavf VF driver. IP pools configured with iavf can still be used to allocate VFs from Mellanox NICs.

You can use the following command to configure mlx5_core driver:

robin ip-pool add <pool-name> --driver sriov --vf-driver mlx5_core --subnet <subnet> --gateway <gateway> --device-ids <device-ids> --nic <nic-name> --nodes <node-names>

Example:

# robin ip-pool add mlx-1 --driver sriov --prefix 64 --vfdriver mlx5_core --range 2a00:fbc:1270:1f3b:0:0:0:1-1000 --vlan 3897
25.1.4.1.2. Change the maximum pods per node after installation

You can set the maximum number of Pods per node any time after installing Robin CNP. Decide the maximum number of Pods based on your requirements and resources on your cluster.

You can also set the maximum Pods per node while installing Robin CNP using the max-pods parameter in the config.json file.

Prerequsietes

  • You must complete the following steps on all nodes of the cluster.

  • Before you change the maximum Pods per node, you need to verify and configure pod-max-pids on master nodes and change if required.

To change the maximum Pods per node after installation, complete the following steps:

  1. Update the kubelet configuration

    1. Edit the kubelet service configuration file:

      # vi /etc/sysconfig/kubelet
      
    2. Add or update the --max-pods parameter in KUBELET_EXTRA_ARGS:

      # cat /etc/sysconfig/kubelet
      # KUBELET_EXTRA_ARGS="--container-runtime-endpoint=unix:///var/run/crirobin.sock --image-service-endpoint=unix:///var/run/crirobin.sock --enable-controller-attach-detach=true --cluster-dns=fd74:ca9b:3a09:868c:0252:0059:0124:800a --cluster-domain=abhinav.mantina.robin --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt --feature-gates=RotateKubeletServerCertificate=true,MemoryManager=true --container-log-max-size=260M --read-only-port=0 --event-qps=0 --streaming-connection-idle-timeout=30m --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA --reserved-cpus=0 --cpu-manager-policy=static --topology-manager-policy=restricted --topology-manager-scope=pod --pod-max-pids=4096 --max-pods=260"
      
  2. Restart the kubelet service

    1. Restart the kubelet service:

      # systemctl restart kubelet.service
      
  3. Update Robin’s host information

    1. Probe the host to rediscover the configuration changes:

      # robin host probe --rediscover <hostname>
      # robin host probe --rediscover r7515-01
      
  4. Verify the changes

    1. Check that the maximum pods value is updated in Robin:

      # robin host list | egrep "<hostname>|Pod|----"
      # robin host list | egrep "r7515-01|Pod|----"
      
      Id            | Hostname | Version    | Status | RPool   | Avail. Zone | LastOpr | Roles | Cores           | GPUs  | Mem              | HDD(#/Alloc/Total) | SSD(#/Alloc/Total) | Pod Usage  | Joined Time
      --------------+----------+------------+--------+---------+-------------+---------+-------+-----------------+-------+------------------+--------------------+--------------------+------------+----------------------
      1755834757:36 | r7515-01 | 5.5.1-1939 | Ready  | default | N/A         | ONLINE  | S,C   | 316.75/3.25/320 | 0/0/0 | 1636G/629G/2266G | -/-/-              | 8/4414G/47693G     | 232/28/260 | 09 Sep 2025 14:34:16
      
25.1.4.1.3. Change the write unit size for SSD disk drives

Robin CNP supports the write unit size of 4096 bytes and 512 bytes. You must make sure that all disk drives on the cluster have the same write unit size.

Note

Before changing the write unit size, make sure your disk drive supports the write unit size that you intend to change. If you try to change to the unsupported write unit size, the operation fails.

When adding a new cluster to your environment, if the new cluster has disk drives that by default come with a different write unit size not present on your existing clusters, you must make sure to change the write unit to match with your existing Robin CNP clusters. The following steps help you to change the write unit size for disk drives in the Robin CNP cluster from 4096 bytes to 512 bytes or vice versa.

Note

You must run the steps on all nodes of the cluster.

To change the write unit size, complete the following steps:

  1. Unregister the SSD drive

    1. List drives to find your target drive:

      # robin drive list | grep <drive_id>
      

      Example:

      # robin drive list | grep 290
      290 | 0xui.3656313058a046000025384300000002 | r7715-04 | default | nvme-eui.3656313058a046000025384300000002 | 5961 | N | SSD | 4624/4624 (100%) | 0/400 | Storage | ONLINE | READY | 4096
      
    2. Unregister the drive:

      # robin drive unregister <drive_wwn> --wait -y
      

      Example:

      # robin drive unregister 0xui.3656313058a046000025384300000002 --wait -y
      Job: 71534 Name: DiskUnregister State: INITIALIZED Error: 0
      Job: 71534 Name: DiskUnregister State: COMPLETED Error: 0
      
    3. Verify the drive is unregistered:

      # robin drive list | grep <drive_wwn>
      

      Example:

      # robin drive list | grep 0xui.3656313058a046000025384300000002
      
  2. Rediscover the drive

    1. Probe the host to rediscover all drives:

      # robin host probe <hostname> --rediscover --all --wait
      

      Example:

      # robin host probe r7715-04 --rediscover --all --wait
      Job: 71535 Name: HostProbe State: VALIDATED Error: 0
      Job: 71535 Name: HostProbe State: COMPLETED Error: 0
      
    2. Verify the drive appears with UNKNOWN status:

      # robin drive list | grep <drive_wwn>
      

      Example:

      # robin drive list | grep 0xui.3656313058a046000025384300000002
      
      - | 0xui.3656313058a046000025384300000002 | r7715-04 | default | nvme-eui.3656313058a046000025384300000002 | 5961 | N | SSD | 4624/4624 (100%) | 0/100000 | Storage | UNKNOWN | INIT | 4096
      
    3. Check the current write unit:

      # robin disk info <drive_wwn> | grep -i write
      

      Example:

      # robin disk info 0xui.3656313058a046000025384300000002 | grep -i write
      
      Write Unit: 4096
      
  3. Update the write unit

    1. Change the write unit to the required size (512 or 4096 bytes):

      # robin disk update --writeunit <size> <drive_wwn> --wait
      

      Example:

      # robin disk update --writeunit 512 0xui.3656313058a046000025384300000002 --wait
      Job: 71539 Name: DiskModify State: INITIALIZED Error: 0
      Job: 71539 Name: DiskModify State: COMPLETED Error: 0
      
  4. Reassign the storage role

    1. Add the storage role back to the node:

      # robin host add-role <hostname> Storage --wait
      

      Example:

      # robin host add-role r7715-04 Storage --wait
      Job: 71540 Name: HostAddRoles State: VALIDATED Error: 0
      Job: 71540 Name: HostAddRoles State: COMPLETED Error: 0
      
    2. Verify the drive is online with the new write unit:

      # robin drive list | grep <drive_wwn>
      

      Example:

      # robin drive list | grep 0xui.3656313058a046000025384300000002
      
      291 | 0xui.3656313058a046000025384300000002 | r7715-04 | default | nvme-eui.3656313058a046000025384300000002 | 5961 | N | SSD | 4624/4624 (100%) | 0/100000 | Storage | ONLINE | READY | 512
      
    3. Confirm the write unit changed to 512:

      # robin disk info <drive_wwn> | grep -i write
      

      Example:

      # robin disk info 0xui.3656313058a046000025384300000002 | grep -i write
      
      Write Unit: 512
      

After completing the earlier steps:

  • The drive status changes from UNKNOWN to ONLINE

  • The drive state changes from INIT to READY

25.1.5. Improvements

25.1.5.1. Remove NUMA restrictions for KVM apps

Starting with Robin CNP v5.5.1, you can remove NUMA restrictions for KVM apps when creating them. This helps in deploying KVM app pods on all worker nodes of a cluster. To remove NUMA restrictions, you must add the following annotation in the input.yaml and create KVM apps using this input.yaml:

  • robin.runtime.skip_cpuset_mems: "ENABLED"

25.1.5.2. Support to create an application with static IPs and static MACs from a snapshot

Starting with Robin CNP v5.5.1, Robin CNP supports creation of application with static IP and static MAC addresses from its snapshot.

Note

Creating an application with static IP and static MAC addresses is supported only for Robin bundle applications.

To create an application, you must specify the static IP and static MAC addresses in the following parameters:

  • static-ips

  • static-macs

For more information, see Create an application from a snapshot.

25.1.5.3. Support to create an application with static IPs and static MACs from a backup

Starting with Robin CNP v5.5.1, Robin CNP supports creation of application with static IP and static MAC addresses from its backup.

Note

Creating an application with static IP and static MAC addresses is supported only for Robin bundle applications.

To create an application, you must specify the static IP and static MAC addresses in the following parameters:

  • static-ips

  • static-macs

For more information, see Create an application from a backup.

25.1.6. Fixed Issues

Reference ID

Description

RSD-8083

The IO hang issue observed on clusters with large disk sizes is fixed.

RSD-9127

The output of Robin CLI commands (robin app list and robin instance list) is displaying an incorrect and inflated CPU count for applications. This issue is fixed.

RSD-9981

After upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, auto-deployment of KVM apps failed on certain nodes due to insufficient CPU resources on these nodes. This issue is fixed.

RSD-9911

Kafka Pods restart due to I/O timeouts on volumes during auto-rebalance operations. This issue is fixed.

RSD-5327

The issue of IOMGR restarting slowly is fixed.

RSD-8854

The issue of the IOMGR service crashing on a node when it came back online after a reboot is fixed.

RSD-8104

A delay in creating a large-size volume is observed when the volume size is more than the individual disks on the cluster. This issue is fixed.

RSD-8083

The issue of the dev slices leader changing tasks, which delays epoch update tasks and results in IO timeouts on the application side, is fixed.

RSD-9478

When a node makes RPC calls to an unhealthy node, these RPC calls are blocked for a long time because TCP keepalive timeout is configured only for client-side RPC sockets. This issue is fixed.

RSD-8083

The default CPU and memory resource limits for the Robin Patroni PostgreSQL database lead to performance issues, particularly for Storage Manager (SM) tasks in larger cluster environments. This issue is fixed.

RSD-8846

The Robin CAT profiles feature does not work as expected on RHEL 8.10. This issue is fixed.

RSD-9127

For a Pod with a sidecar container, Robin CNP is erroneously allocating 0.5 CPU when it actually needs 1.5 CPUs. Due to this issue, Pod deployments are failing, indicating insufficient CPU. This issue is fixed.

RSD-9316

When you try to deploy a large-sized KVM app on a Robin CNP cluster, the deployment fails with the following error message: Failed to download file_object

Example

Failed to download file_object c3cc99163f225f167ae886339eb02fca, not accessible at this point. Ensure the file collection is ONLINE. Error: Connection broken: ConnectionResetError(104, ‘Connection reset by peer’).

This issue is fixed.

RSD-9919

When upgrading from the supported Robin CNP version to v5.5.0-1857, the upgrade failed due to the robin-cert-init error. This issue is fixed.

PP-38268

The Replica State for volumes incorrectly displays a percentage value (Example: 99%) instead of SYNCED, even when all underlying slices are healthy and the volume is not degraded. This occurred because the system was overly sensitive in calculating resync progress, resulting in false negatives. This issue is fixed.

PP-39285

In a rare circumstance, when Patroni instances reboot happening in a particular order, erroneously a lagging Patroni replica claims the Leader role. This issue is fixed.

PP-38087

In certain cases, the snapshot size allocated to a volume could be less than what is requested. This occurs when the volume is allocated from multiple disks. This issue is fixed.

PP-34457

When you have a Robin CNP cluster with the Metrics feature enabled, the Grafana application is not displaying metrics under certain conditions. This issue is fixed.

PP-38061

In rare scenarios, when upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, the upgrade may get stuck while executing Robin upgrade actions on the primary master node because some of the hosts are not in the Ready state. This issue is fixed.

RSD-9176

Creating an application from a bundle fails due to missing IP pool configuration details in the robin-bootstrap-config ConfigMap. This issue is fixed.

RSD-9146

Prune, Purge, and Archive schedules trigger duplicate jobs in the Robin Cluster. This issue is fixed.

RSD-9088

An error occurs when running the robin ip-pool list or robin ip-pool list --urlinfo command on clusters upgraded to Robin CNP v5.5.0.

ERROR - local variable ‘total_cnt’ referenced before assignment

This issue is fixed.

RSD-9289

The Robin CLI commands (robin job info and robin job list) are not synced with each other when a job is in progress. This issue is fixed.

RSD-9208

The issue where creating a Robin Bundle app failed with the following error after upgrading to Robin CNP v5.5.0 is fixed:

IndexError: list index out of range

RSD-9204

The issue where creating a KVM app failed with the following error after upgrading to Robin CNP v5.5.0 is fixed:

NameError: name ‘net_attr’ is not defined

RSD-9202

The issue of robin-nfs-watchdog Pods being down during Robin CNP v5.5.0 upgrade process due to invalid image name is fixed.

RSD-9222

The issue where the robin instance list command was not showing the correct status of Autopilot after upgrading to Robin CNP v5.5.0 is fixed.

RSD-9075

The issue in Robin CNP v5.5.0 where Pod deployments that depend on VLANs configured as ‘ALL’ on the host network interfaces are failing with the Pending status is fixed

RSD-9642

The robin-nfs bundle application is experiencing continuous restarts when Autopilot is enabled. This issue is fixed.

RSD-9455

In Robin version 5.5.0, the Robin CLI command output (robin host list) is not accounting for shared CPU on a multi-container Pod. Where the Pod has one container with a guaranteed CPU and the other one with a shared CPU. This issue is fixed

RSD-9346

When upgrading Robin to CNP v5.5.0, the issue of intermittent failure of the robin-patroni-postgres-operator Pod resulted in “timed out waiting for the condition” errors. This issue is fixed.

RSD-9290

The issue of the Robin Master Pod (robinrcm) restarting due to exceeding the max-pid-limit is fixed. The max-pid-limit is a configurable parameter. For more information, contact the Robin Customer Support team.

RSD-9273

The issue where restarting a Robin Bundle application failed with the following error after upgrading to Robin CNP v5.5.0 is fixed:

‘int object’ has no attribute ‘split’

25.1.7. Known Issues

Reference ID

Description

PP-39656

Symptom

When you deploy applications with preferredDuringSchedulingIgnoredDuringExecution node soft affinity combined with preferredDuringSchedulingIgnoredDuringExecution pod anti-affinity, pods are occasionally scheduled onto nodes that do not possess the label specified in the node affinity rule.

Workaround

Restart the Pod to place it on the node with a soft label.

PP-39645

Symptom

Robin CNP v5.5.1 may rarely fail to honor soft Pod anti-affinity, resulting in uneven Pod distribution on labeled nodes.

When you deploy an application with the recommended preferred DuringSchedulingIgnoredDuringExecution soft Pod Anti-Affinity, pods may not be uniformly distributed across the available, labeled nodes as expected. Kubernetes routes nodes to Robin CNP for pod scheduling. In some situations, a request to the Robin CNP from Kubernetes may not have the required node to honor soft affinity.

Workaround

Bounce the Pod that has not honored soft affinity.

PP-39632

Symptom

After upgrading to Robin CNP 5.5.1, NFS client might hang with no pending IO message.

For no pending IO, refer this path : /var/log/robin/nodeplugin/robin-csi.log with the following message:

2025-08-28 11:41:04,584 - CsiServer_9 - robin.utils - INFO - Executing command /usr/bin/nc -z -w 6 172.19.149.161 2049 with timeout 60 seconds 2025-08-28 11:41:04,612 - CsiServer_9 - robin.utils - INFO - Command /usr/bin/nc -z -w 6 172.19.149.161 2049 completed with return code 0. 2025-08-28 11:41:04,613 - CsiServer_9 - robin.utils - INFO - Standard out:

Also, you can find the following message in the dmesg

[Thu Aug 28 11:59:07 2025] nfs: server 172.19.131.218 not responding, timed out [Thu Aug 28 11:59:17 2025] nfs: server 172.19.131.218 not responding, timed out [Thu Aug 28 11:59:27 2025] nfs: server 172.19.131.218 not responding, timed out

Workaround

  1. Check the node provisioner logs where the PVC is checking for the path and it is hung.

  2. For the deployment/statefulset that is using the problematic PVC, scale down the replica count to 0.

  3. Ensure all Pods associated with the application have terminated.

  4. Scale up the replica count back to the original value.

PP-39429

Symptom

When you try to create an application from a snapshot or backup, the configuration for static IP addresses and static MAC addresses is supported only for Robin Bundle applications, and it is not supported for Helm-based applications.

PP-38044

Symptom

When attempting to detach a repository from a hydrated Helm application, the operation might fail with the following error:

Can’t detach repo as the application is in IMPORTED state, hydrate it in order to detach the repo from it.

This issue occurs even if the application has already been hydrated. The system incorrectly marks the application in the IMPORTED state, preventing the repository from being detached.

Workaround

To detach the repository, manually rehydrate the application and then retry the detach operation:

  1. Run the following command to rehydrate the application.

    # robin app hydrate --wait
    
  2. Once the hydration is complete, detach the repository.

    # robin app detach-repo - -wait –y
    

PP-37652

Symptom

When you deploy a multi-container application using Helm with static IPs assigned from an IP pool, only a subset of the Pods appear on the Robin CNP UI.

Workaround

Run the following CLI command to view all the Pods:

# robin app info <appname> --status

PP-37416

Symptom

In rare scenarios, when upgrading from Robin CNP v5.4.3 HF6 to Robin CNP v5.5.1, the upgrade might fail with the following error during the Kubernetes upgrade process on other master nodes:

Failed to execute kubeadm upgrade command for K8S upgrade. Please make sure you have the correct version of kubeadm rpm binary installed

Steps to identify the issue:

  1. Check the /var/log/robin-install.log file to know why the kubeadm upgrade failed.

    Example:

    [upgrade/staticpods] Moved new manifest to “/etc/kubernetes/manifests/kube-scheduler.yaml” and backed up old manifest to “/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-01-31-01-03-52/kube-scheduler.yaml” [upgrade/staticpods] Waiting for the kubelet to restart the component [upgrade/staticpods]. This might take a minute or longer depending on the component/version gap (timeout 5m0s) static Pod hash for component kube-scheduler on Node sm-compute04 did not change after 5m0s: timed out waiting for the condition

    You can get the above error log for any static manifests of api-server, etcd, scheduler, and controller-manager.

  2. If you notice the above error, run the following command to inspect the Docker containers for the failed component. The containers will likely be in the Exited state.

    # docker ps -a | grep schedule
    

Workaround

If you notice the above error, restart the kubelet:

# systemctl restart kubelet

PP-35015

Symptom

After renewing the expired Robin license successfully, Robin CNP incorrectly displays the License Violation error when you try to add a new user to the cluster. If you notice this issue, apply the following workaround.

Workaround

You need to restart the robin-server-bg service.

# rbash master
# supervisorctl restart robin-server-bg

PP-34492

Symptom

When you run the robin host list command and if you notice a host is in the NotReady and PROBE_PENDING states, follow these workaround steps to diagnose and recover the host:

Workaround

  1. Run the following command to check which host is in the NotReady and PROBE_PENDING states:

    # robin host list
    
  2. Run the following command to check the current (Curr) and desired (Desired) states of the host in the Agent Process (AP) report:

    # robin ap report | grep <hostname>
    
  3. Run the following command to probe the host and recover it:

    # robin host probe <hostname> --wait
    

    This command forces a probe of the host and updates its state in the cluster.

  4. Run the following command to verify the host’s state:

    # robin host list
    

The host should now transition to the Ready state.

PP-34414

Symptom

In rare scenarios, the IOMGR service might fail to open devices in the exclusive mode when it starts as other processes are using these disks. You might observe the following issues:

  • The Robin server is down.

  • All app Pods restart, and some app Pods get stuck in the ContainerCreating state.

To confirm the above issues, complete the following steps:

  1. Check for the EVENT_DISK_FAULTED event type in the disk events:

    # robin event list --type EVENT_DISK_FAULTED
    
  2. If you see the disk is faulted error, check the IOMGR logs for dev_open() error and Failed to exclusively open message on the node where disks are present.

    # cat iomgr.log.0 | grep <device> | grep "dev_open"
    
  3. If you see the Device or resource busy error in the log file, use fuser command with the device path to confirm whether the device is in use:

    # fuser /dev/disk/by-id/scsi-SATA_Micron_M500_MTFD_1401096049D5
    

Workaround

If the device is not in use, restart the IOMGR service on the respective node:

# supervisorctl restart iomgr

PP-34226

Symptom

When a PersistentVolumeClaim (PVC) is created, the CSI provisioner initiates a VolumeCreate job. If this job fails, the CSI provisioner will call a new VolumeCreate job again for the same PVC. However, if the PVC is deleted during this process, the CSI provisioner will continue to call the VolumeCreate job because it does not verify the existence of the PVC before calling the VolumeCreate job.

Workaround

Bounce the CSI provisioner Pod:

# kubectl delete pod <csi-provisioner-robin> -n robinio

PP-38251

Symptom

When evacuating a disk from an offline node, the robin drive evacuate command fails with the following error message:

Json deserialize error: invalid value: integer -‘10’, expected u64 at line 1 column 2440.

Workaround

If you notice the above issue, contact the Robin CS team.

PP-37965

Symptom

In Robin CNP v5.5.1, when you scale up a Robin Bundle app, it is not considering the existing CPU cores and memory already in use by a vnode. As a result, Robin CNP is not able to find a suitable host, even though there are additional resources available.

Workaround

If you notice this issue, apply the following Workaround

  1. Scale up the resources using the following command.

    # robin app computeqos <appname> --role <rolename>
    --cpus <newcnt> --memory <newmem> -- wait
    
  2. If the scale-up operation fails, stop the app using the following command.

    # robin app stop <appname> --wait
    
  3. Try to scale up the resources again.

    # robin app computeqos <appname> --role <rolename>
    --cpus <newcnt> --memory <newmem> --wait
    

PP-39619

Symptom

After creating an app from a backup, the app is stuck in the CREATE_FAILED state. When you try to delete this app, you might get the following error message:

App <app-name> couldn’t be deleted. Please detach app from repos before deleting.

Workaround

If you notice the above issue, contact the Robin CS team.

PP-36865

Symptom

After rebooting a node, the node might not come back online after a long time, and the host BMC console displays the following message for RWX PVCs mounted on that node:

Remounting nfs rwx pic timed out, issugin SIGKILL

Workaround

Power cycle the host machine.

PP-39806

Symptom

When a node hosting KVM applications is shut down due to technical reasons, you might get the following error message during the migration process of these Pods on another node:

Target /usr/local/robin/instances/kvm/clone-kvm-ovs2-server.svc.cluster.local is busy, please retry later.

Workaround

Run the following command to restart the Robin instance after five minutes:

# robin instance restart <instance-name>

PP-38471

Symptom

When StatefulSet Pods restart, the Pods might get stuck in the ContainerCreating state with the error: CSINode <node_name> does not contain driver robin due to stale NFS mount points and failure of the csi-nodeplugin-robin Pod due to CrashLoopBackOff state.

Workaround

If you notice this issue, restart the csi-nodeplugin Pod.

# kubectl delete pod <csi-nodeplugin> -n robinio

PP-38039

Symptom

During node reboot or power reset scenarios, application volumes may force shutdown due to I/O errors. As a result, application Pods might get stuck in the ContainerCreating state with the following mount failure error:

Context Deadline Exceeded.

On the affected node where the volume is mounted or the application Pod is scheduled, the following error might be observed in the dmesg output:

Log I/O Error Detected. Shutting down filesystem

Workaround

If you notice this issue, contact the Robin Customer Support team for assistance.

PP-37330

Symptom

During or after upgrading from the supported versions to Robin CNP v5.5.1 or following node reboots and failover events, applications relying on ReadWriteMany (RWX) NFS volumes may experience critical failures.

These failures might manifest into the following:

  • Pods stuck in ContainerCreating state

  • NFS server Pods entering an ASSIGNED_ERR state

  • Robin jobs failing with specific mount errors.

The underlying cause for these symptoms could be arising from duplicate filesystem UUIDs.

You might observe one of the following error messages:

  • NFS Server Pod in ASSIGNED_ERR

  • RWX apps and NFSAgentAddExport job might fail with an error message similar to the following:

    wrong fs type, bad option, bad superblock on /dev/sdj, missing codepage or helper program, or other error

    Note

    Check the job info to see related failed jobs.

Example

/bin/mount /dev/sdn /var/lib/robin/nfs/robin-nfs-shared-35/ganesha/ pvc-822e76f0-9bb8-4629-8aae-8318fb2d3b41 -o discard failed with return code 32: mount: /var/lib/robin/nfs/robin-nfs-shared-35/ ganesha/pvc-822e76f0-9bb8-4629-8aae-8318fb2d3b41: wrong fs type, bad option, bad superblock on /dev/sdn, missing codepage or helper program, or other error.

Workaround

If you notice this issue, contact the Robin Customer Support team for assistance.

PP-38078

Symptom

After a network partition, the robin-agent and iomgr-server may not restart automatically, and stale devices may not be cleaned up.This issue occurs because the consulwatch thread responsible for monitoring Consul and triggering restarts may fail to detect the network partition.

As a result, stale devices may not be cleaned up, potentially leading to resource contention and other issues.

Workaround

Manually restart the robin-agent and iomgr-server using the supervisorctl:

# supervisorctl restart robin-agent iomgr-server

PP-39842

Symptom

In Robin CNP v5.5.1, the robin host list command doesn’t account for long-running sidecar containers in the initContainers section for bundle applications. This causes Robin CNP to incorrectly place pods, which can lead to oversubscription or performance issues.

You can identify this issue by comparing the output of the following commands:

# kubectl describe node
# robin host list
# robin k8s-collect info

The resource usage reported by kubectl describe node can differ from the usage shown in the Robin commands because the Robin commands do not include resources consumed by long-running sidecar containers.

PP-39901

Symptom

After rebooting a worker node that is hosting Pods with Robin RWX volumes, one or more application Pods using these volumes might get stuck in the ContainerCreating state indefinitely.

Workaround

If you notice the above issue, contact the Robin CS team.

PP-38924

Symptom

After you delete multiple Helm applications, one of the Pods might get stuck in the Error state, and one or more ReadWriteMany (RWX) volumes might get stuck in the Terminating state.

Workaround

Restart Docker and Kubelet on the node where the Pod stuck in the Error state.

PP-39936

Symptom

When relocating a Pod on another node using the robin instance relocate command, if the --to-host option is used, affinity rules will be ignored.

Workaround

Check affinity rules and violations manually when using the --to-host option.

PP-39467

Symptom

When deploying applications with ReadWriteMany (RWX) PersistentVolumeClaims (PVCs), application Pods fail to mount volumes and stuck in the ContainerCreating state because RPC requests stuck in IO operation on the volumes, leading to degraded volumes and faulted storage drives.

Workaround

Reboot the host that is in the Notready state.

25.1.8. Technical Support

Contact Robin Technical support for any assistance.