************* Release Notes ************* =================================== Robin Cloud Native Platform v5.5.1 =================================== The Robin Cloud Native Platform (CNP) v5.5.1 release notes has pre- and post-upgrade steps, new features, improvements, fixed issues, and known issues. **Release Date:** September 30, 2025 Infrastructure Versions ======================= The following software applications are included in this CNP release: ==================== ================== Software Application Version ==================== ================== Kubernetes 1.31.6 Docker 25.0.2 Prometheus 2.39.1 Prometheus Adapter 0.10.0 Node Exporter 1.4.0 Calico 3.28.2 HAProxy 2.4.7 PostgreSQL 14.12 Grafana 9.2.3 CRI Tools 1.31.1 ==================== ================== Supported Operating System =========================== The following is the supported operating system and kernel version for Robin CNP v5.5.1: * CentOS 7.9 (kernel version: 3.10.0-1160.71.1.el7.x86_64) Upgrade Paths ============= The following are the supported upgrade paths for Robin CNP v5.5.1: * Robin CNP v5.4.3 HF6 to Robin CNP v5.5.1-1950 * Robin CNP v5.5.0-1857 to Robin CNP v5.5.1-1950 Pre-upgrade considerations -------------------------- * For a successful upgrade, you must run the ``possible_job_stuck.py`` script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script. * When upgrading from supported Robin CNP versions to Robin CNP v5.5.1, if your cluster already has ``cert-manager`` installed, you must uninstall it before upgrading to Robin CNP v5.5.1. * Robin CNP v5.5.1 does not support the ``OnDelete`` strategy for IOMGR Pods during the upgrade process. Post-upgrade considerations --------------------------- * After upgrading to Robin CNP v5.5.1, you must run the ``robin schedule update K8sResSync k8s_resource_sync 60000`` command to update the ``robin schedule K8sResSync``. * After upgrading to Robin CNP v5.5.1, you must run the ``robin-server validate-role-bindings`` command. To run this command, you need to log in to the ``robin-master`` Pod. This command verifies the roles assigned to each user in the cluster and corrects them if necessary. * After upgrading to Robin CNP v5.5.1, the ``k8s_auto_registration config`` parameter is disabled by default. The config setting is deactivated to prevent all Kubernetes apps from automatically registering and consuming resources. The following are the points you must be aware of with this change: - You can register the Kubernetes apps using the ``robin app register`` command manually and use Robin CNP for snapshots, clones, and backup operations of the Kubernetes app. - As this config parameter is disabled, when you run the ``robin app nfs-list`` command, the mappings between Kubernetes apps and NFS server Pods are not listed in the command output. - If you need mapping between Kubernetes app and NFS server Pod when the ``k8s_auto_registration config`` parameter is disabled or the k8s app is not manually registered, get the PVC name from the Pod YAML file ``(kubectl get pod -n -o YAML)`` and run the robin ``nfs export list | grep `` command. - The ``robin nfs export list`` command output displays the PVC name and namespace. Pre-upgrade steps ------------------ **Upgrading from Robin CNP v5.4.3 HF6 or Robin CNP v5.5.0-1857 to Robin CNP v5.5.1** Before upgrading from Robin CNP v5.4.3 HF6 or Robin CNP v5.5.0-1857 to Robin CNP v5.5.1, perform the following steps: #. Update the value of the ``suicide_threshold`` config parameter to ``1800``: .. code-block:: text # robin config update agent suicide_threshold 1800 #. Disable the ``NFS Server`` Monitor schedule: .. code-block:: text # robin schedule disable "NFS Server" Monitor #. Set the toleration seconds for all NFS server Pods to 86400 seconds. After upgrade, you must change the tolerations seconds according to the post-upgrade steps. .. code-block:: text # for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do echo "Updating $pod tolerationseconds to 86400"; kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 86400}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 86400}]'; done **Verify and configure pod-max-pids on master nodes** Before you change the maximum Pods per node, you need to verify and configure ``pod-max-pids`` on master nodes and change if required. Verify the ``pod-max-pids`` configuration on master nodes in kubelet. Based on the number of vCPUs on the host, set ``pod-max-pids`` to at least 4096 on master nodes. .. Note:: If the current value exceeds 4096 based on application requirements (for example, 10000), you do not need to change it. Always verify the current value before making changes. You can tune this setting further based on cluster observations. To verify the current pod-max-pids and modify if requried, complete the following steps: 1. Check the current `pod-max-pids` value: .. code-block:: text # cat /etc/sysconfig/kubelet systemctl status kubelet -l | grep -i pod-max-pids 2. Check if the value is less than 4096. If the value is more than 4096, you do not need to update. 3. Open the kubelet configuration file and update if the value less than 4096. .. code-block:: text # vi /etc/sysconfig/kubelet 4. Restart the kubelet service: .. code-block:: text # systemctl restart kubelet 5. Verify the updated configuration: .. code-block:: text # systemctl status kubelet -l | grep -i pod-max-pids Post-upgrade steps ------------------- **After upgrading from Robin CNP v5.4.3 HF6 or Robin CNP v5.5.0-1857 to Robin CNP v5.5.1** After upgrading from Robin CNP v5.4.3 HF6 or Robin CNP v5.5.0-1857 to Robin CNP v5.5.1, perform the following steps: #. Update the value of the ``suicide_threshold`` config parameter to ``40``: .. code-block:: text # robin config update agent suicide_threshold 40 #. Enable the ``NFS Server`` Monitor schedule: .. code-block:: text # robin schedule enable "NFS Server" Monitor #. Set the ``check_helm_apps`` config parameter to ``False``: .. code-block:: text # robin config update cluster check_helm_apps False #. Set the ``chargeback_track_k8s_resusage`` config parameter to ``False``: .. code-block:: text # robin config update server chargeback_track_k8s_resusage False #. Set the ``robin_k8s_extension`` config parameter to ``True``: .. code-block:: text # robin config update manager robin_k8s_extension True #. Verify whether the following mutating webhooks are present: .. code-block:: text # kubectl get mutatingwebhookconfigurations -A | grep robin k8srobin-deployment-mutating-webhook 1 20d k8srobin-ds-mutating-webhook 1 20d k8srobin-pod-mutating-webhook 1 20d k8srobin-sts-mutating-webhook 1 20d robin-deployment-mutating-webhook 1 20d robin-ds-mutating-webhook 1 20d robin-pod-mutating-webhook 1 20d robin-sts-mutating-webhook 1 20d #. If above ``k8srobin-*`` mutating webhooks are not present then bounce the ``robink8s-serverext`` Pods: .. code-block:: text # kubectl delete pod -n robinio -l app=robink8s-serverext #. Verify whether the following validating webhooks are present: .. code-block:: text # kubectl get validatingwebhookconfigurations NAME WEBHOOKS AGE cert-manager-webhook 1 45h controllers-validating-webhook 1 31h ippoolcr-validating-webhook 1 31h namespaces-validating-webhook 1 31h pods-validating-webhook 1 31h pvcs-validating-webhook 1 31h #. If ``robin-*`` mutating webhooks displayed in the step 6 output and validating webhooks displayed in the step 8 output are not present on your setup, then restart the ``robin-server-bg`` service: .. code-block:: text # rbash master # supervisorctl restart robin-server-bg #. Set the toleration seconds for all NFS server Pods to 60 seconds when the node is in the ``notready`` state and set to 0 seconds, when the node is ``unreachable`` state. .. code-block:: text for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do echo "Updating $pod tolerationseconds"; kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 60}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 0}]'; done 2>/dev/null New Features ============= Support for Clusters with AMD Processors ---------------------------------------- Robin CNP v5.5.1 supports clusters with AMD processors. The following improvements are available to support the clusters with AMD processors. * Support for Mellanox VFs using the ``mlx5_core`` driver for clusters with AMD processors. * Increased number of Pods per node. * Parameter to configure ``max-pods`` per node using ``config.json`` file. Support for Mellanox VFs using the mlx5_core driver for clusters with AMD processors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Starting with Robin CNP v5.5.1, support for the ``mlx5_core`` VF (Virtual Function) driver is provided as part of IP-pools for clusters with Intel and AMD processors. You can now configure IP pools with the ``--vfdriver mlx5_core`` to utilize Mellanox Virtual Functions on clusters. Robin CNP continues to support the native ``iavf`` VF driver. IP pools configured with ``iavf`` can still be used to allocate VFs from Mellanox NICs. You can use the following command to configure ``mlx5_core`` driver: .. code-block:: text robin ip-pool add --driver sriov --vf-driver mlx5_core --subnet --gateway --device-ids --nic --nodes **Example:** .. code-block:: text # robin ip-pool add mlx-1 --driver sriov --prefix 64 --vfdriver mlx5_core --range 2a00:fbc:1270:1f3b:0:0:0:1-1000 --vlan 3897 Change the maximum pods per node after installation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can set the maximum number of Pods per node any time after installing Robin CNP. Decide the maximum number of Pods based on your requirements and resources on your cluster. You can also set the maximum Pods per node while installing Robin CNP using the ``max-pods`` parameter in the ``config.json`` file. **Prerequsietes** - You must complete the following steps on all nodes of the cluster. - Before you change the maximum Pods per node, you need to verify and configure ``pod-max-pids`` on master nodes and change if required. To change the maximum Pods per node after installation, complete the following steps: 1. Update the kubelet configuration a. Edit the kubelet service configuration file: .. code-block:: text # vi /etc/sysconfig/kubelet b. Add or update the ``--max-pods`` parameter in ``KUBELET_EXTRA_ARGS``: .. code-block:: text # cat /etc/sysconfig/kubelet # KUBELET_EXTRA_ARGS="--container-runtime-endpoint=unix:///var/run/crirobin.sock --image-service-endpoint=unix:///var/run/crirobin.sock --enable-controller-attach-detach=true --cluster-dns=fd74:ca9b:3a09:868c:0252:0059:0124:800a --cluster-domain=abhinav.mantina.robin --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt --feature-gates=RotateKubeletServerCertificate=true,MemoryManager=true --container-log-max-size=260M --read-only-port=0 --event-qps=0 --streaming-connection-idle-timeout=30m --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA --reserved-cpus=0 --cpu-manager-policy=static --topology-manager-policy=restricted --topology-manager-scope=pod --pod-max-pids=4096 --max-pods=260" 2. Restart the kubelet service a. Restart the kubelet service: .. code-block:: text # systemctl restart kubelet.service 3. Update Robin's host information a. Probe the host to rediscover the configuration changes: .. code-block:: text # robin host probe --rediscover # robin host probe --rediscover r7515-01 4. Verify the changes a. Check that the maximum pods value is updated in Robin: .. code-block:: text # robin host list | egrep "|Pod|----" # robin host list | egrep "r7515-01|Pod|----" Id | Hostname | Version | Status | RPool | Avail. Zone | LastOpr | Roles | Cores | GPUs | Mem | HDD(#/Alloc/Total) | SSD(#/Alloc/Total) | Pod Usage | Joined Time --------------+----------+------------+--------+---------+-------------+---------+-------+-----------------+-------+------------------+--------------------+--------------------+------------+---------------------- 1755834757:36 | r7515-01 | 5.5.1-1939 | Ready | default | N/A | ONLINE | S,C | 316.75/3.25/320 | 0/0/0 | 1636G/629G/2266G | -/-/- | 8/4414G/47693G | 232/28/260 | 09 Sep 2025 14:34:16 Change the write unit size for SSD disk drives ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Robin CNP supports the write unit size of 4096 bytes and 512 bytes. You must make sure that all disk drives on the cluster have the same write unit size. .. Note:: Before changing the write unit size, make sure your disk drive supports the write unit size that you intend to change. If you try to change to the unsupported write unit size, the operation fails. When adding a new cluster to your environment, if the new cluster has disk drives that by default come with a different write unit size not present on your existing clusters, you must make sure to change the write unit to match with your existing Robin CNP clusters. The following steps help you to change the write unit size for disk drives in the Robin CNP cluster from 4096 bytes to 512 bytes or vice versa. .. Note:: You must run the steps on all nodes of the cluster. To change the write unit size, complete the following steps: 1. Unregister the SSD drive a. List drives to find your target drive: .. code-block:: text # robin drive list | grep **Example:** .. code-block:: text # robin drive list | grep 290 290 | 0xui.3656313058a046000025384300000002 | r7715-04 | default | nvme-eui.3656313058a046000025384300000002 | 5961 | N | SSD | 4624/4624 (100%) | 0/400 | Storage | ONLINE | READY | 4096 b. Unregister the drive: .. code-block:: text # robin drive unregister --wait -y **Example:** .. code-block:: text # robin drive unregister 0xui.3656313058a046000025384300000002 --wait -y Job: 71534 Name: DiskUnregister State: INITIALIZED Error: 0 Job: 71534 Name: DiskUnregister State: COMPLETED Error: 0 c. Verify the drive is unregistered: .. code-block:: text # robin drive list | grep **Example:** .. code-block:: text # robin drive list | grep 0xui.3656313058a046000025384300000002 2. Rediscover the drive a. Probe the host to rediscover all drives: .. code-block:: text # robin host probe --rediscover --all --wait **Example:** .. code-block:: text # robin host probe r7715-04 --rediscover --all --wait Job: 71535 Name: HostProbe State: VALIDATED Error: 0 Job: 71535 Name: HostProbe State: COMPLETED Error: 0 b. Verify the drive appears with UNKNOWN status: .. code-block:: text # robin drive list | grep **Example:** .. code-block:: text # robin drive list | grep 0xui.3656313058a046000025384300000002 - | 0xui.3656313058a046000025384300000002 | r7715-04 | default | nvme-eui.3656313058a046000025384300000002 | 5961 | N | SSD | 4624/4624 (100%) | 0/100000 | Storage | UNKNOWN | INIT | 4096 c. Check the current write unit: .. code-block:: text # robin disk info | grep -i write **Example:** .. code-block:: text # robin disk info 0xui.3656313058a046000025384300000002 | grep -i write Write Unit: 4096 3. Update the write unit a. Change the write unit to the required size (512 or 4096 bytes): .. code-block:: text # robin disk update --writeunit --wait **Example:** .. code-block:: text # robin disk update --writeunit 512 0xui.3656313058a046000025384300000002 --wait Job: 71539 Name: DiskModify State: INITIALIZED Error: 0 Job: 71539 Name: DiskModify State: COMPLETED Error: 0 4. Reassign the storage role a. Add the storage role back to the node: .. code-block:: text # robin host add-role Storage --wait **Example:** .. code-block:: text # robin host add-role r7715-04 Storage --wait Job: 71540 Name: HostAddRoles State: VALIDATED Error: 0 Job: 71540 Name: HostAddRoles State: COMPLETED Error: 0 b. Verify the drive is online with the new write unit: .. code-block:: text # robin drive list | grep **Example:** .. code-block:: text # robin drive list | grep 0xui.3656313058a046000025384300000002 291 | 0xui.3656313058a046000025384300000002 | r7715-04 | default | nvme-eui.3656313058a046000025384300000002 | 5961 | N | SSD | 4624/4624 (100%) | 0/100000 | Storage | ONLINE | READY | 512 c. Confirm the write unit changed to 512: .. code-block:: text # robin disk info | grep -i write **Example:** .. code-block:: text # robin disk info 0xui.3656313058a046000025384300000002 | grep -i write Write Unit: 512 After completing the earlier steps: * The drive status changes from ``UNKNOWN`` to ``ONLINE`` * The drive state changes from ``INIT`` to ``READY`` Improvements ============ .. PP-39431 Remove NUMA restrictions for KVM apps ------------------------------------- Starting with Robin CNP v5.5.1, you can remove NUMA restrictions for KVM apps when creating them. This helps in deploying KVM app pods on all worker nodes of a cluster. To remove NUMA restrictions, you must add the following annotation in the ``input.yaml`` and create KVM apps using this ``input.yaml``: * ``robin.runtime.skip_cpuset_mems: "ENABLED"`` .. PP-39865 Support to create an application with static IPs and static MACs from a snapshot --------------------------------------------------------------------------------- Starting with Robin CNP v5.5.1, Robin CNP supports creation of application with static IP and static MAC addresses from its snapshot. .. Note:: Creating an application with static IP and static MAC addresses is supported only for Robin bundle applications. To create an application, you must specify the static IP and static MAC addresses in the following parameters: * ``static-ips`` * ``static-macs`` For more information, see `Create an application from a snapshot `__. Support to create an application with static IPs and static MACs from a backup ------------------------------------------------------------------------------- Starting with Robin CNP v5.5.1, Robin CNP supports creation of application with static IP and static MAC addresses from its backup. .. Note:: Creating an application with static IP and static MAC addresses is supported only for Robin bundle applications. To create an application, you must specify the static IP and static MAC addresses in the following parameters: * ``static-ips`` * ``static-macs`` For more information, see `Create an application from a backup `__. Fixed Issues ============ ============= ============================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============= ============================================================================================================================================================================================================================================================================================================================================================ RSD-8083 The IO hang issue observed on clusters with large disk sizes is fixed. RSD-9127 The output of Robin CLI commands (``robin app list`` and ``robin instance list``) is displaying an incorrect and inflated CPU count for applications. This issue is fixed. RSD-9981 After upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, auto-deployment of KVM apps failed on certain nodes due to insufficient CPU resources on these nodes. This issue is fixed. RSD-9911 Kafka Pods restart due to I/O timeouts on volumes during auto-rebalance operations. This issue is fixed. RSD-5327 The issue of IOMGR restarting slowly is fixed. RSD-8854 The issue of the IOMGR service crashing on a node when it came back online after a reboot is fixed. RSD-8104 A delay in creating a large-size volume is observed when the volume size is more than the individual disks on the cluster. This issue is fixed. RSD-8083 The issue of the dev slices leader changing tasks, which delays epoch update tasks and results in IO timeouts on the application side, is fixed. RSD-9478 When a node makes RPC calls to an unhealthy node, these RPC calls are blocked for a long time because TCP keepalive timeout is configured only for client-side RPC sockets. This issue is fixed. RSD-8083 The default CPU and memory resource limits for the Robin Patroni PostgreSQL database lead to performance issues, particularly for Storage Manager (SM) tasks in larger cluster environments. This issue is fixed. RSD-8846 The Robin CAT profiles feature does not work as expected on RHEL 8.10. This issue is fixed. RSD-9127 For a Pod with a sidecar container, Robin CNP is erroneously allocating 0.5 CPU when it actually needs 1.5 CPUs. Due to this issue, Pod deployments are failing, indicating insufficient CPU. This issue is fixed. RSD-9316 When you try to deploy a large-sized KVM app on a Robin CNP cluster, the deployment fails with the following error message: ``Failed to download file_object`` **Example** *Failed to download file_object c3cc99163f225f167ae886339eb02fca, not accessible at this point. Ensure the file collection is ONLINE. Error: Connection broken: ConnectionResetError(104, 'Connection reset by peer').* This issue is fixed. RSD-9919 When upgrading from the supported Robin CNP version to v5.5.0-1857, the upgrade failed due to the ``robin-cert-init`` error. This issue is fixed. PP-38268 The ``Replica State`` for volumes incorrectly displays a percentage value (Example: 99%) instead of ``SYNCED``, even when all underlying slices are healthy and the volume is not degraded. This occurred because the system was overly sensitive in calculating resync progress, resulting in false negatives. This issue is fixed. PP-39285 In a rare circumstance, when Patroni instances reboot happening in a particular order, erroneously a lagging Patroni replica claims the Leader role. This issue is fixed. PP-38087 In certain cases, the snapshot size allocated to a volume could be less than what is requested. This occurs when the volume is allocated from multiple disks. This issue is fixed. PP-34457 When you have a Robin CNP cluster with the Metrics feature enabled, the Grafana application is not displaying metrics under certain conditions. This issue is fixed. PP-38061 In rare scenarios, when upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, the upgrade may get stuck while executing Robin upgrade actions on the primary master node because some of the hosts are not in the Ready state. This issue is fixed. RSD-9176 Creating an application from a bundle fails due to missing IP pool configuration details in the ``robin-bootstrap-config`` ConfigMap. This issue is fixed. RSD-9146 Prune, Purge, and Archive schedules trigger duplicate jobs in the Robin Cluster. This issue is fixed. RSD-9088 An error occurs when running the ``robin ip-pool list`` or ``robin ip-pool list --urlinfo`` command on clusters upgraded to Robin CNP v5.5.0. *ERROR - local variable 'total_cnt' referenced before assignment* This issue is fixed. RSD-9289 The Robin CLI commands (``robin job info`` and ``robin job list``) are not synced with each other when a job is in progress. This issue is fixed. RSD-9208 The issue where creating a Robin Bundle app failed with the following error after upgrading to Robin CNP v5.5.0 is fixed: *IndexError: list index out of range* RSD-9204 The issue where creating a KVM app failed with the following error after upgrading to Robin CNP v5.5.0 is fixed: *NameError: name 'net_attr' is not defined* RSD-9202 The issue of ``robin-nfs-watchdog`` Pods being down during Robin CNP v5.5.0 upgrade process due to invalid image name is fixed. RSD-9222 The issue where the ``robin instance list`` command was not showing the correct status of Autopilot after upgrading to Robin CNP v5.5.0 is fixed. RSD-9075 The issue in Robin CNP v5.5.0 where Pod deployments that depend on VLANs configured as 'ALL' on the host network interfaces are failing with the Pending status is fixed RSD-9642 The robin-nfs bundle application is experiencing continuous restarts when Autopilot is enabled. This issue is fixed. RSD-9455 In Robin version 5.5.0, the Robin CLI command output (``robin host list``) is not accounting for shared CPU on a multi-container Pod. Where the Pod has one container with a guaranteed CPU and the other one with a shared CPU. This issue is fixed RSD-9346 When upgrading Robin to CNP v5.5.0, the issue of intermittent failure of the ``robin-patroni-postgres-operator`` Pod resulted in "timed out waiting for the condition" errors. This issue is fixed. RSD-9290 The issue of the Robin Master Pod (robinrcm) restarting due to exceeding the ``max-pid-limit`` is fixed. The ``max-pid-limit`` is a configurable parameter. For more information, contact the Robin Customer Support team. RSD-9273 The issue where restarting a Robin Bundle application failed with the following error after upgrading to Robin CNP v5.5.0 is fixed: *'int object' has no attribute 'split'* ============= ============================================================================================================================================================================================================================================================================================================================================================ Known Issues ============ ============= ============================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============= ============================================================================================================================================================================================================================================================================================================================================================ PP-39656          **Symptom** When you deploy applications with ``preferredDuringSchedulingIgnoredDuringExecution`` node soft affinity combined with ``preferredDuringSchedulingIgnoredDuringExecution`` pod anti-affinity, pods are occasionally scheduled onto nodes that do not possess the label specified in the node affinity rule.                   **Workaround**                   Restart the Pod to place it on the node with a soft label. PP-39645          **Symptom** Robin CNP v5.5.1 may rarely fail to honor soft Pod anti-affinity, resulting in uneven Pod distribution on labeled nodes.                   When you deploy an application with the recommended preferred ``DuringSchedulingIgnoredDuringExecution`` soft Pod Anti-Affinity, pods may not be uniformly distributed across the available, labeled nodes as expected. Kubernetes routes nodes to Robin CNP for pod scheduling. In some situations, a request to the Robin CNP from Kubernetes may not have the required node to honor soft affinity.                   **Workaround**                   Bounce the Pod that has not honored soft affinity. PP-39632          **Symptom** After upgrading to Robin CNP 5.5.1, NFS client might hang with no pending IO message. For no pending IO, refer this path : ``/var/log/robin/nodeplugin/robin-csi.log`` with the following message: 2025-08-28 11:41:04,584 - CsiServer_9 - robin.utils - INFO - Executing command /usr/bin/nc -z -w 6 172.19.149.161 2049 with timeout 60 seconds 2025-08-28 11:41:04,612 - CsiServer_9 - robin.utils - INFO - Command /usr/bin/nc -z -w 6 172.19.149.161 2049 completed with return code 0. 2025-08-28 11:41:04,613 - CsiServer_9 - robin.utils - INFO - Standard out: Also, you can find the following message in the ``dmesg`` [Thu Aug 28 11:59:07 2025] nfs: server 172.19.131.218 not responding, timed out [Thu Aug 28 11:59:17 2025] nfs: server 172.19.131.218 not responding, timed out [Thu Aug 28 11:59:27 2025] nfs: server 172.19.131.218 not responding, timed out                   **Workaround**                   1. Check the node provisioner logs where the PVC is checking for the path and it is hung.                   2. For the deployment/statefulset that is using the problematic PVC, scale down the replica count to ``0``.                   3. Ensure all Pods associated with the application have terminated.                   4. Scale up the replica count back to the original value. PP-39429          **Symptom** When you try to create an application from a snapshot or backup, the configuration for static IP addresses and static MAC addresses is supported only for Robin Bundle applications, and it is not supported for Helm-based applications. PP-38044          **Symptom**                   When attempting to detach a repository from a hydrated Helm application, the operation might fail with the following error:                   *Can’t detach repo as the application is in IMPORTED state, hydrate it in order to detach the repo from it.*                   This issue occurs even if the application has already been hydrated. The system incorrectly marks the application in the ``IMPORTED`` state, preventing the repository from being detached.                   **Workaround**                   To detach the repository, manually rehydrate the application and then retry the detach operation:                   1. Run the following command to rehydrate the application. .. code-block:: text # robin app hydrate --wait                   2. Once the hydration is complete, detach the repository. .. code-block:: text # robin app detach-repo - -wait –y PP-37652          **Symptom**                   When you deploy a multi-container application using Helm with static IPs assigned from an IP pool, only a subset of the Pods appear on the Robin CNP UI.                   **Workaround**                   Run the following CLI command to view all the Pods: .. code-block:: text # robin app info --status PP-37416          **Symptom**                   In rare scenarios, when upgrading from Robin CNP v5.4.3 HF6 to Robin CNP v5.5.1, the upgrade might fail with the following error during the Kubernetes upgrade process on other master nodes:                   *Failed to execute kubeadm upgrade command for K8S upgrade. Please make sure you have the correct version of kubeadm rpm binary installed*                   Steps to identify the issue:                   1. Check the ``/var/log/robin-install.log`` file to know why the kubeadm upgrade failed.                    **Example:** *[upgrade/staticpods] Moved new manifest to “/etc/kubernetes/manifests/kube-scheduler.yaml” and backed up old manifest to “/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-01-31-01-03-52/kube-scheduler.yaml” [upgrade/staticpods] Waiting for the kubelet to restart the component [upgrade/staticpods].* *This might take a minute or longer depending on the component/version gap (timeout 5m0s) static Pod hash for component kube-scheduler on Node sm-compute04 did not change after 5m0s: timed out waiting for the condition*                    You can get the above error log for any static manifests of api-server, etcd, scheduler, and controller-manager. 2. If you notice the above error, run the following command to inspect the Docker containers for the failed component. The containers will likely be in the Exited state. .. code-block:: text # docker ps -a | grep schedule                   **Workaround**                   If you notice the above error, restart the kubelet: .. code-block:: text # systemctl restart kubelet PP-35015          **Symptom**                   After renewing the expired Robin license successfully, Robin CNP incorrectly displays the License Violation error when you try to add a new user to the cluster. If you notice this issue, apply the following workaround.                   **Workaround**                   You need to restart the ``robin-server-bg`` service. .. code-block:: text # rbash master # supervisorctl restart robin-server-bg PP-34492          **Symptom**                   When you run the ``robin host list`` command and if you notice a host is in the ``NotReady`` and ``PROBE_PENDING`` states, follow these workaround steps to diagnose and recover the host:                   **Workaround**                   1. Run the following command to check which host is in the ``NotReady`` and ``PROBE_PENDING`` states: .. code-block:: text # robin host list                   2. Run the following command to check the current (Curr) and desired (Desired) states of the host in the Agent Process (AP) report: .. code-block:: text # robin ap report | grep                   3. Run the following command to probe the host and recover it: .. code-block:: text # robin host probe --wait                    This command forces a probe of the host and updates its state in the cluster.                    4. Run the following command to verify the host’s state: .. code-block:: text # robin host list                   The host should now transition to the ``Ready`` state. PP-34414          **Symptom**                   In rare scenarios, the ``IOMGR`` service might fail to open devices in the exclusive mode when it starts as other processes are using these disks. You might observe the following issues:                   - The Robin server is down.                   - All app Pods restart, and some app Pods get stuck in the ContainerCreating state.                   To confirm the above issues, complete the following steps:                   1. Check for the ``EVENT_DISK_FAULTED`` event type in the disk events: .. code-block:: text # robin event list --type EVENT_DISK_FAULTED                   2. If you see the disk is faulted error, check the ``IOMGR`` logs for **dev_open()** error and **Failed to exclusively open** message on the node where disks are present. .. code-block:: text # cat iomgr.log.0 | grep | grep "dev_open"                   3. If you see the **Device or resource busy** error in the log file, use fuser command with the device path to confirm whether the device is in use: .. code-block:: text    # fuser /dev/disk/by-id/scsi-SATA_Micron_M500_MTFD_1401096049D5                   **Workaround** If the device is not in use, restart the ``IOMGR`` service on the respective node: .. code-block:: text # supervisorctl restart iomgr PP-34226          **Symptom**                   When a PersistentVolumeClaim (PVC) is created, the CSI provisioner initiates a VolumeCreate job. If this job fails, the CSI provisioner will call a new VolumeCreate job again for the same PVC. However, if the PVC is deleted during this process, the CSI provisioner will continue to call the VolumeCreate job because it does not verify the existence of the PVC before calling the VolumeCreate job.                   **Workaround**                   Bounce the CSI provisioner Pod: .. code-block:: text # kubectl delete pod -n robinio PP-38251          **Symptom**                   When evacuating a disk from an offline node, the ``robin drive evacuate`` command fails with the following error message:                   *Json deserialize error: invalid value: integer -'10', expected u64 at line 1 column 2440.*                   **Workaround**                   If you notice the above issue, contact the Robin CS team. PP-37965          **Symptom**                   In Robin CNP v5.5.1, when you scale up a Robin Bundle app, it is not considering the existing CPU cores and memory already in use by a vnode. As a result, Robin CNP is not able to find a suitable host, even though there are additional resources available.                   **Workaround**                   If you notice this issue, apply the following Workaround                   1. Scale up the resources using the following command. .. code-block:: text # robin app computeqos --role --cpus --memory -- wait                   2. If the scale-up operation fails, stop the app using the following command. .. code-block:: text # robin app stop --wait                   3. Try to scale up the resources again. .. code-block:: text # robin app computeqos --role --cpus --memory --wait PP-39619          **Symptom**                   After creating an app from a backup, the app is stuck in the ``CREATE_FAILED`` state. When you try to delete this app, you might get the following error message:                   *App couldn't be deleted. Please detach app from repos before deleting.*                   **Workaround**                   If you notice the above issue, contact the Robin CS team. PP-36865          **Symptom**                   After rebooting a node, the node might not come back online after a long time, and the host BMC console displays the following message for RWX PVCs mounted on that node:                   *Remounting nfs rwx pic timed out, issugin SIGKILL*                   **Workaround**                   Power cycle the host machine. PP-39806          **Symptom**                   When a node hosting KVM applications is shut down due to technical reasons, you might get the following error message during the migration process of these Pods on another node:                   *Target /usr/local/robin/instances/kvm/clone-kvm-ovs2-server.svc.cluster.local is busy, please retry later.*                   **Workaround**                   Run the following command to restart the Robin instance after five minutes: .. code-block:: text # robin instance restart PP-38471 **Symptom** When StatefulSet Pods restart, the Pods might get stuck in the ``ContainerCreating`` state with the error: *CSINode does not contain driver robin due to stale NFS mount points and failure of the csi-nodeplugin-robin Pod due to CrashLoopBackOff state.* **Workaround** If you notice this issue, restart the ``csi-nodeplugin`` Pod. .. code-block:: text # kubectl delete pod -n robinio PP-38039 **Symptom** During node reboot or power reset scenarios, application volumes may force shutdown due to I/O errors. As a result, application Pods might get stuck in the ``ContainerCreating`` state with the following mount failure error: *Context Deadline Exceeded.* On the affected node where the volume is mounted or the application Pod is scheduled, the following error might be observed in the ``dmesg`` output: *Log I/O Error Detected. Shutting down filesystem* **Workaround** If you notice this issue, contact the Robin Customer Support team for assistance. PP-37330 **Symptom** During or after upgrading from the supported versions to Robin CNP v5.5.1 or following node reboots and failover events, applications relying on ReadWriteMany (RWX) NFS volumes may experience critical failures. .. These failures might manifest into the following: * Pods stuck in ``ContainerCreating`` state * NFS server Pods entering an ``ASSIGNED_ERR`` state * Robin jobs failing with specific mount errors. The underlying cause for these symptoms could be arising from duplicate filesystem UUIDs. You might observe one of the following error messages: * NFS Server Pod in ``ASSIGNED_ERR`` * RWX apps and ``NFSAgentAddExport`` job might fail with an error message similar to the following: ``wrong fs type``, ``bad option``, ``bad superblock on /dev/sdj``, ``missing codepage`` or ``helper program``, or other error .. Note:: Check the job info to see related failed jobs. **Example** */bin/mount /dev/sdn /var/lib/robin/nfs/robin-nfs-shared-35/ganesha/ pvc-822e76f0-9bb8-4629-8aae-8318fb2d3b41 -o discard failed with return code 32: mount: /var/lib/robin/nfs/robin-nfs-shared-35/ ganesha/pvc-822e76f0-9bb8-4629-8aae-8318fb2d3b41: wrong fs type, bad option, bad superblock on /dev/sdn, missing codepage or helper program, or other error.* **Workaround** If you notice this issue, contact the Robin Customer Support team for assistance. PP-38078 **Symptom** After a network partition, the ``robin-agent`` and ``iomgr-server`` may not restart automatically, and stale devices may not be cleaned up.This issue occurs because the consulwatch thread responsible for monitoring Consul and triggering restarts may fail to detect the network partition. As a result, stale devices may not be cleaned up, potentially leading to resource contention and other issues. **Workaround** Manually restart the ``robin-agent`` and ``iomgr-server`` using the ``supervisorctl``: .. code-block:: text # supervisorctl restart robin-agent iomgr-server PP-39842 **Symptom** In Robin CNP v5.5.1, the ``robin host list`` command doesn't account for long-running sidecar containers in the initContainers section for bundle applications. This causes Robin CNP to incorrectly place pods, which can lead to oversubscription or performance issues. You can identify this issue by comparing the output of the following commands: .. code-block:: text # kubectl describe node # robin host list # robin k8s-collect info The resource usage reported by ``kubectl describe node`` can differ from the usage shown in the Robin commands because the Robin commands do not include resources consumed by long-running sidecar containers. PP-39901 **Symptom** After rebooting a worker node that is hosting Pods with Robin RWX volumes, one or more application Pods using these volumes might get stuck in the ContainerCreating state indefinitely. **Workaround** If you notice the above issue, contact the Robin CS team. PP-38924 **Symptom** After you delete multiple Helm applications, one of the Pods might get stuck in the ``Error`` state, and one or more ReadWriteMany (RWX) volumes might get stuck in the ``Terminating`` state. **Workaround** Restart Docker and Kubelet on the node where the Pod stuck in the ``Error`` state. PP-39936 **Symptom** When relocating a Pod on another node using the ``robin instance relocate`` command, if the ``--to-host`` option is used, affinity rules will be ignored. **Workaround** Check affinity rules and violations manually when using the ``--to-host`` option. PP-39467 **Symptom** When deploying applications with ReadWriteMany (RWX) PersistentVolumeClaims (PVCs), application Pods fail to mount volumes and stuck in the ``ContainerCreating`` state because RPC requests stuck in IO operation on the volumes, leading to degraded volumes and faulted storage drives. **Workaround** Reboot the host that is in the ``Notready`` state. ============= ============================================================================================================================================================================================================================================================================================================================================================ Technical Support ================= Contact `Robin Technical support `_ for any assistance.