26. Release Notes

26.1. Robin Cloud Native Platform v5.7.0

The Robin Cloud Native Platform (CNP) v5.7.0 release notes has pre- and post-upgrade steps, a new feature, improvements, fixed issues, and known issues.

Release Date: January 30, 2026

26.1.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.33.5

Docker

25.0.2 (RHEL 8.10 or Rocky Linux 8.10)

Podman

5.4.0 (RHEL 9.6)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.28.2

HAProxy

2.4.7

PostgreSQL

14.12

Grafana

9.2.3

CRI Tools

1.33.0

cert-manager

1.19.1

26.1.2. Supported Operating Systems

The following are the supported operating systems and kernel versions for Robin CNP v5.7.0:

OS Version

Kernel Version

Red Hat Enterprise Linux 8.10

4.18.0-553.el8_10.x86_64

Rocky Linux 8.10

4.18.0-553.el8_10.x86_64

Red Hat Enterprise Linux 9.6

5.14.0-570.24.1.el9_6.x86_64+rt

Note

Robin CNP supports both RT and non-RT kernels on above supported operating systems.

26.1.3. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.7.0:

  • Robin CNP v5.4.3 HF5+PP to Robin CNP v5.7.0-296

  • Robin CNP v5.4.3 HF6 to Robin CNP v5.7.0-296

  • Robin CNP v5.4.3 HF7 to Robin CNP v5.7.0-296

  • Robin CNP v5.5.1 to Robin CNP v5.7.0-296

26.1.3.1. Pre-upgrade considerations

  • For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

  • When upgrading from supported Robin CNP versions to Robin CNP v5.7.0, if your cluster already has cert-manager installed, you must uninstall it before upgrading to Robin CNP v5.7.0.

  • Before upgrading to Robin CNP v5.7.0, you must stop the robin-certs-check job or CronJob. To stop the robin-certs-check job, run the kubectl delete job robin-certs-check -n robinio command, and to stop the robin-certs-check CronJob, run the robin cert check --stop-cronjob command.

26.1.3.2. Post-upgrade considerations

  • After upgrading to Robin CNP v5.7.0, verify that the value of the k8s_resource_sync config parameter is set to 60000 using the robin schedule list | grep -i K8sResSync command. If it is not set, you must run the robin schedule update K8sResSync k8s_resource_sync 60000 command to update the value of the robin schedule K8sResSync config parameter.

  • After upgrading to Robin CNP v5.7.0, you must run the robin-server validate-role-bindings command. To run this command, you need to log in to the robin-master Pod. This command verifies the roles assigned to each user in the cluster and corrects them if necessary.

  • After upgrading to Robin CNP v5.7.0, the k8s_auto_registration config parameter is disabled by default. The config setting is deactivated to prevent all Kubernetes apps from automatically registering and consuming resources. The following are the points you must be aware of with this change:

    • You can register the Kubernetes apps using the robin app register command manually and use Robin CNP for snapshots, clones, and backup operations of the Kubernetes app.

    • As this config parameter is disabled, when you run the robin app nfs-list command, the mappings between Kubernetes apps and NFS server Pods are not listed in the command output.

    • If you need mapping between Kubernetes app and NFS server Pod when the k8s_auto_registration config parameter is disabled or the k8s app is not manually registered, get the PVC name from the Pod YAML file (kubectl get pod -n <name> -o YAML) and run the robin nfs export list | grep <pvc name> command.

    • The robin nfs export list command output displays the PVC name and namespace.

  • After upgrading to Robin CNP v5.7.0, you must start the robin-certs-check CronJob using the robin cert check -stat-cronjob command, if it was stopped before upgrade.

26.1.3.3. Pre-upgrade steps

  • Upgrading from Robin CNP v5.4.3 to Robin CNP v5.7.0-296

    Before upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.7.0, perform the following steps:

    1. Update the value of the suicide_threshold config parameter to 1800:

      # robin config update agent suicide_threshold 1800
      
    2. Verify the NFS monitor is enabled. It must be True

      # robin schedule list | grep -I NFS
      
    3. Set the toleration seconds for all NFS server Pods to 86400 seconds. After upgrade, you must change the tolerations seconds according to the post-upgrade steps.

      for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds to 86400";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 86400}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 86400}]'; done
      
    4. Verify the webhooks are enabled.

      # robin config list | grep -I robin_k8s_extension
      

      It must be True. If it the disabled, enable it.

      # robin config update manager robin_k8s_extension True
      
    5. Check the mutating webhook’s is present on the cluster.

      # kubectl get mutatingwebhookconfiguration -A
      
    6. Enable robin_k8s_extension. Note: You must create the robin-schedulextender-policy-template.yaml and preentry-<version>.sh files in the usr/local/robin directory on all three K8s control-plane hosts.

      • Create a robin-schedulextender-policy-template.yaml file and add the following:

        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        clientConnection:
          kubeconfig: /etc/kubernetes/scheduler.conf
        extenders:
          - urlPrefix: "https://{{hostname}}:{{port}}/{{urlsuffix}}"
            filterVerb: predicates
            enableHTTPS: true
            nodeCacheCapable: false
            ignorable: {{ignorable}}
            httpTimeout: {{httptimeout}}
            tlsConfig:
              insecure: true
            managedResources:
            - name: robin.io/robin-required
              ignoredByScheduler: true
        
      • Create a preentry-<version>.sh and add the following:

        # cp /usr/local/robin/robin-schedulextender-policy-template.yaml /opt/robin/current/etc/robin/k8s/robin-schedulextender-policy-template.yaml
        # cp /usr/local/robin/robin-schedulextender-policy-template.yaml /etc/robin/k8s/robin-schedulextender-policy-template.yaml
        
  • Upgrading from Robin CNP v5.5.1 to Robin CNP v5.7.0-296

    Before upgrading from Robin CNP v5.5.1 to Robin CNP v5.7.0, perform the following steps:

    1. Verify the NFS monitor is enabled using robin schedule list | grep -I NFS. It must be True.

    2. Update the value of the suicide_threshold config parameter to 1800:

      # robin config update agent suicide_threshold 1800
      
    3. Set the toleration seconds for all NFS server Pods to 86400 seconds. After upgrade, you must change the tolerations seconds according to the post-upgrade steps.

      for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds to 86400";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path":"/spec/tolerations/1/tolerationSeconds", "value": 86400}, {"op":"replace", "path": "/spec/tolerations/2/tolerationSeconds", "value":86400}]']; done
      

26.1.3.4. Post-upgrade steps

  • After upgrading from Robin CNP v5.4.3 or Robin CNP v5.5.1 to Robin CNP v5.7.0-296

    After upgrading from Robin CNP v5.4.3 to Robin CNP v5.7.0, perform the following steps:

    1. Update the value of the suicide_threshold config parameter to 40:

      # robin config update agent suicide_threshold 40
      
    2. Set the check_helm_apps config parameter to False:

      # robin config update cluster check_helm_apps False
      
    3. Set the chargeback_track_k8s_resusage config parameter to False:

      # robin config update server chargeback_track_k8s_resusage False
      
    4. Set the robin_k8s_extension config parameter to True. If no, set it to True.

      # robin config update manager robin_k8s_extension True
      
    5. Verify whether the following mutating webhooks are present:

      # kubectl get mutatingwebhookconfigurations -A | grep robin
      k8srobin-deployment-mutating-webhook   1          20d
      k8srobin-ds-mutating-webhook           1          20d
      k8srobin-pod-mutating-webhook          1          20d
      k8srobin-sts-mutating-webhook          1          20d
      robin-deployment-mutating-webhook      1          20d
      robin-ds-mutating-webhook              1          20d
      robin-pod-mutating-webhook             1          20d
      robin-sts-mutating-webhook             1          20d
      
    6. If above k8srobin-* mutating webhooks are not present then bounce the robink8s-serverext Pods:

      # kubectl delete pod -n robinio -l app=robink8s-serverext
      
    7. Verify whether the following validating webhooks are present:

      # kubectl get validatingwebhookconfigurations
      NAME                             WEBHOOKS   AGE
      cert-manager-webhook             1          45h
      controllers-validating-webhook   1          31h
      ippoolcr-validating-webhook      1          31h
      namespaces-validating-webhook    1          31h
      pods-validating-webhook          1          31h
      pvcs-validating-webhook          1          31h
      
    8. If robin-* mutating webhooks displayed in the step 6 output and validating webhooks displayed in the step 8 output are not present on your setup, then restart the robin-server-bg service:

      # rbash master
      # supervisorctl restart robin-server-bg
      
    9. Set the toleration seconds for all NFS server Pods to 60 seconds when the node is in the notready state and set to 0 seconds, when the node is unreachable state.

      for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 60}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 0}]'; done 2>/dev/null
      

26.1.4. New Features

26.1.4.1. Ephemeral storage limits with LimitRange

Starting with Robin CNP v5.7.0, you can use the Kubernetes LimitRange object to manage ephemeral storage limits for containers in pods. This prevents the host’s root filesystem from filling up when containers write excessive data.

This feature is disabled by default and applies at the cluster level.

To configure these limits, modify the parameters in the limitrange section of the Robin configuration. Run the following command to enable the LimitRange object:

# robin config update limitrange enabled True

Note

This feature is supported only on hosts running Red Hat Enterprise Linux 9.6.

For more information, see Limit Ranges for Ephemeral Storage.

26.1.5. Improvements

26.1.5.1. Support for cert-manager v1.19.1

Starting with Robin CNP v5.7.0, Robin CNP supports cert-manager v1.19.1. When you install or upgrade to Robin CNP v5.7.0 from any supported version, the latest cert-manager v1.19.1 is installed.

26.1.5.2. Support for Red Hat Enterprise Linux 9.6

Starting with Robin CNP v5.7.0, Robin CNP supports Red Hat Enterprise Linux (RHEL) 9.6 OS and kernel version 5.14.0-570.24.1.el9_6.x86_64+rt.

26.1.5.3. Enhanced NIC tagging for SR-IOV IP Pools

The Robin CNP v5.7.0 release provided enhanced NIC-level tagging for SR-IOV Physical Functions (PFs), providing more granular control over Virtual Function (VF) placement.

You can now label each PF with custom key-value pairs called NIC tags, such as nictype=active or location=tor-a. By adding these tags to your IP pools, Robin CNP ensures that VFs for an application are allocated only from specific, tagged interfaces. This enhancement enables the deterministic grouping of PFs, leading to predictable and consistent VF placement within pods.

Previously, NICs could only be identified in IP pools by their name or pci_addr, which limited the ability to group multiple SR-IOV interfaces into flexible pools. The new NIC tagging feature solves this issue by allowing you to define custom NIC tags, assign them to different interfaces, and create IP pools that allocate VFs from one or more tagged groups. For example, you can now create a bonded interface pool that draws VFs from interfaces tagged as nictype=active and nictype=standby, giving you precise control over your network topology and resource allocation.

For more information, see NIC Allocation for SR-IOV interfaces.

26.1.5.4. VF allocation based on Bandwidth

Robin CNP v5.7.0 supports Virtual Function (VF) allocation based on the bandwidth of an SR-IOV NIC, allowing for optimized resource utilization and granular control over network resource allocation.

The following VF allocation policies are added in Robin CNP:

  • bandwidth-spread: Utilizes VFs across multiple interfaces to spread bandwidth usage evenly. This ensures better utilization of network interface bandwidth and prevents a single interface from becoming a bottleneck.

  • bandwidth-pack: Packs and utilizes VFs onto fewer interfaces to maximize the utilization of SR-IOV NICs’ bandwidth.

You can now create an IP pool with specific bandwidth weights (in Gbps) using the --bandwidth option. When a VF is being allocated, Robin CNP uses these weights to enforce capacity limits and determine placement based on your selected policy.

Previously, VFs were allocated based on VF counts. This led to over-allocation of a NIC, where the aggregate bandwidth requirements of VFs exceeded the physical capacity of the NIC, resulting in unpredictable latency and performance throttling. Bandwidth-based allocation solves this issue by validating the capacity of a NIC before allocation. Robin CNP now ensures that allocated VFs never exceed the physical capacity of a NIC.

For more information, see NIC Allocation policies.

26.1.5.5. Create application with static IPs and static MACs from snapshot

Starting with Robin CNP v5.7.0, Robin CNP supports creation of application with static IP and static MAC addresses from its snapshot.

Note

Creating an application with static IP and static MAC addresses is supported only for Robin bundle applications.

To create an application, you must specify the static IP and static MAC addresses in the following parameters:

  • static-ips

  • static-macs

For more information, see Create an application from a snapshot.

26.1.5.6. Create application with static IPs and static MACs from backup

Starting with Robin CNP v5.7.0, Robin CNP supports creation of application with static IP and static MAC addresses from its backup.

Note

Creating an application with static IP and static MAC addresses is supported only for Robin bundle applications.

To create an application, you must specify the static IP and static MAC addresses in the following parameters:

  • static-ips

  • static-macs

For more information, see Create an application from a backup.

26.1.5.7. Robin CNP installation qualification on 2 P-Cores with Best-effort-QoS

Tests validated CNP installation with best-effort QoS enabled on a single-node cluster with one storage disk and 40 P-cores. The test environment allocated 2 P-cores for CNP, the OS, and Kubernetes, and ran approximately 70 application Pods to simulate a heavy workload.

The testing focused on platform behavior and CPU allocation. No platform-side issues regarding CPU allocation or stability occurred.

26.1.5.8. Support for the Docker overlay2 driver

Robin CNP v5.7.0 supports docker_storage_driver: overlay2 driver. It is the primary mechanism for managing how Docker images and containers are stored and accessed on a host’s filesystem. This driver prevents installation failures caused by slow Docker image load time. You can configure the Docker storage driver per node using the config.json file while installing CNP.

Note

By default, the Robin CNP v5.7.0 installation process uses the robin-graph driver. However, you can use the overlay2 driver that offers faster load times and can prevent timeouts during the installation process.

This configuration applies at the node level. You must apply the storage driver option ("docker_storage_driver": "overlay2") to every node in your cluster using the configuration JSON file (config.json).

26.1.5.9. Accelerator abstraction

Robin CNP v5.7.0 supports accelerator abstraction, a feature that simplifies the allocation of hardware accelerator resources (such as FPGA) across mixed clusters. This feature allows you to configure PCI resources using abstract names, decoupling Pod specifications from specific hardware identifiers like device IDs or vendor IDs.

With this update, you can apply a single, common annotation to Pod specifications and Helm charts. This enables seamless deployments on clusters that contain different types of accelerator cards, such as those found on existing servers or new GNR-D servers. For more information, see Configure abstract accelerator resources.

26.1.5.10. Schedule Pods of same application on a node using Pod-level affinity and anti-affinity

Robin CNP v5.7.0 enables you to schedule application Pods of the same type on a single node to maximize resource utilization. This approach helps when you have a limited number of nodes and prefer to separate specific workloads, such as ensuring different application types run on different nodes.

To configure this behavior, define soft affinity and anti-affinity rules in the podAffinity and podAntiAffinity fields. Use the preferredDuringSchedulingIgnoredDuringExecution rule to establish these preferences. This rule attempts to schedule Pods according to your criteria but permits placement on other nodes if resources are limited. For more information, see Schedule Pods of same application on a node using Pod-level affinity and anti-affinity.

26.1.5.11. CRI-O support

Robin CNP v5.7.0 supports CRI-O as a high-level container runtime with RHEL 9.6 OS only. CRI-O is a lightweight container runtime for Kubernetes. It is an optimized implementation of the Kubernetes Container Runtime Interface (CRI) to run pods by using any OCI (Open Container Initiative) compliant runtime. It is open-source and an alternative to Docker for Kubernetes clusters.

For more information, see CRI-O.

26.1.5.12. crun support

Robin CNP v5.7.0 supports crun as a low-level container runtime with RHEL 9.6 OS only. crun is a fast, lightweight, and low-memory container runtime. It is the default container runtime in RHEL 9.6 OS, which is used to execute containers within a Pod. It is fully compliant with the Open Container Initiative (OCI) specifications.

26.1.5.13. cgroup V2 support

Robin CNP v5.7.0 supports cgroup v2 of the Linux kernel with RHEL 9.6 OS only. cgroup v2 is the latest, more efficient version of the Linux kernel’s control groups feature, which is the underlying mechanism for managing and enforcing resource limits (CPU, memory, I/O) on containers and Pods. It is the default cgroup version in the RHEL 9.6 OS.

For more information, see cgroup v2.

26.1.6. Fixed Issues

Reference ID

Description

RSD-8083

Dev slices leader change tasks, which delays epoch update tasks and results in IO timeouts on the application side. This issue is fixed.

RSD-8104

A delay in creating a large-size volume is observed when the volume size is more than the individual disks on the cluster. This issue is fixed.

RSD-7814

The issue of the application creation operation failing with the following error is now fixed.

Failed to mount volume <volume-name>: Node <node-name> has mount_blocked STORMGR_NODE_BLOCK_MOUNT. No new mounts are allowed.

RSD-7499

The issue of storage size mismatch between Robin storage and Kubernetes storage, which caused calculation and quota errors when users requested sizes in GB, is fixed.

RSD-10741

The issue of the robin job purge command removing jobs from the database but failing to delete the corresponding log files from the /var/log/robin/server/archived/ directory is fixed.

RSD-6763

The issue of Helm binary version mismatch between the host and the downloaded Helm client or the robin-master Pod is fixed.

RSD-7183

A device may run out of space, and you might observe disk usage alerts or out-of-space errors when an application is writing data, resulting in failed writes. You might also observe that the physical size of a volume is greater than the logical size when you run the robin volume info command. You might notice this issue on volumes without IO.

This issue could be because the garbage collector (GC) failed to reclaim space. This issue is fixed.

RSD-6162

The issue of the snapshot-controller Pod stuck in the ContainerCreating status and Terminating loop status after node reboot is fixed.

RSD-5771

IPv6 IP pool creation is failing with the gateway the same as the broadcast address for the IP pool subnet. This issue is fixed.

RSD-8150

The race condition issue where the device evacuation operation would fail if a replica was assigned to a drive marked as not ready is fixed.

RSD-4634

When Robin CNP is running on SuperMicro nodes, the IPMI tool is incorrectly displaying the BMC IPv6 address as follows: ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff instead of the actual BMC IPv6 address. This issue is fixed.

PP-37360

The issue of the Robin UI displaying a maximum of 1000 users when adding users from an LDAP server is fixed.

RSD-8690

The issue of a newly added node not appearing in the robin host list command, despite a successful Robin CNP installation, while the kubectl get nodes command displays it, is fixed.

PP-39842

The robin host list command doesn’t account for long-running sidecar containers in the initContainers section for bundle applications. This causes Robin CNP to incorrectly place pods, which can lead to oversubscription or performance issues. This issue is fixed.

PP-39806

When a node hosting KVM applications is shut down due to technical reasons, you might get the following error message during the migration process of these Pods on another node:

Target /usr/local/robin/instances/kvm/clone-kvm-ovs2-server.svc.cluster.local is busy, please retry later.

This issue is fixed.

RSD-10622

Istio installation failing with the following error when calling the k8srobin.kubernetes.service.robin webhook: x509: certificate signed by unknown authority. This issue is fixed.

RSD-10265

The kubelet service failed to reload client certificates after automatic rotation. This caused nodes to report a degraded status with x509: certificate has expired errors, even though the underlying certificate files were updated correctly on the disk. This issue is fixed.

RSD-10366

When expanding a ReadWriteMany (RWX) Persistent Volume Claim (PVC) failed with the following error: Vblock with volume_id <id> not mounted. This issue occurred because the CSI driver failed to correctly identify the block device on the NFS client node during the expansion process. This issue is fixed.

RSD-10471

When affinityrules defined in a bundle manifest.yaml file were ignored during application creation. And, when running the robin app create from-bundle command failed to apply these rules, causing Pods to be scheduled on arbitrary nodes instead of the hosts specified by the affinity tags. This issue is fixed.

RSD-10754

Nodes remain in a NotReady state after a cluster power cycle. This issue occurred because the kubelet service failed to start automatically upon reboot. This issue is fixed.

RSD-10243

Application auto-redeployment fails because the scheduler did not correctly validate SR-IOV configuration on the target node. This issue resulted in Pods being scheduled on nodes unable to satisfy the resource requirements, causing the application to remain in a Pending state with FailedScheduling and Insufficient intel.com/sriov errors. This issue is fixed.

RSD-10105

Bundle application deployment fails when multiple IP addresses resolve to the same hostname during reverse DNS lookup. This issue occurred because the system prohibited different IPs from mapping to the same hostname. This issue is fixed.

You can now prevent this conflict by setting the vnode_hostname_reverse_dns_lookup cluster configuration to False, which disables the lookup and ensures unique hostnames are generated for each vnode.

RSD-9892

The issue of VMs provisioned using an ISO image on Robin CNP KVM environments entering into a reboot is fixed.

RSD-9843

The cluster license status became FORCEFULLY EXPIRED immediately after installation. This occurred because the system incorrectly compared the current UTC time against an installation timestamp recorded in the local timezone, leading to a false detection of a system clock rollback. This issue is fixed.

RSD-9767

Deployment of Helm-based applications is taking longer than expected due to pod scheduling delays. This issue is fixed. As part of the fix, the following new cluster-wide configurable attribute is provided: k8s_plan_lock_attempts. You can use the new attribute to configure retry time for Kubernetes Pod scheduling and volume creation requests when the CNP planner is busy processing other requests. By default, CNP makes a retry attempt every two seconds up to five times. You can configure the k8s_plan_lock_attempts attribute value between 1 and 30 seconds.

RSD-9791

The robin namespace share command failed when specifying a user with TenantViewOnly capabilities. Additionally, running robin namespace share with the –all option incorrectly granted edit permissions to these users instead of the intended view permissions. This issue is fixed.

PP-39285

In a rare circumstance, when Patroni instances reboot happening in a particular order, erroneously a lagging Patroni replica claims the Leader role. This issue is fixed.

RSD-9323

When you try to restore an application from a backup that previously had a static IP address, the restore process fails to honor the --ip-pool value provided during deployment. Instead, the restore process attempts to allocate a non-static IP from a historic IP pool, resulting in the following type of error:

Non static IP allocations cannot be done from non-range(network) IP pools -> ‘nc-bss-ov-internal-mgmt-int-v6’.

This issue is fixed.

RSD-9042

The issue where the GoRobin installer did not verify that SELinux was disabled on target hosts is fixed.

The installation pre-check now validates the SELinux status and fails if it detects that SELinux is enabled.

RSD-8630

The application creation from a bundle fails when a namespace resource quota was active. This issue occurred because the hook jobs generated during deployment did not define CPU and memory limits, violating the namespace’s quota requirements. This issue is fixed.

RSD-9143, RSD-8589

Relocating a master Pod caused nodes to become UNREACHABLE, triggering unintended host failovers and workload evacuations. This issue occurred because the agent incorrectly detected a communication failure during the master pod relocation. This issue is fixed.

The system now verifies the health of the Patroni service before initiating an agent restart, preventing false positives and unnecessary service disruptions.

PP-39619

After creating an app from a backup, the app is stuck in the CREATE_FAILED state. When you try to delete this app, you might get the following error message:

App <app-name> couldn’t be deleted. Please detach app from repos before deleting.

This issue is fixed.

PP-37652

When you deploy a multi-container application using Helm with static IPs assigned from an IP pool, only a subset of the Pods appear on the Robin CNP UI. This issue is fixed.

PP-34457

When you have a Robin CNP cluster with the Metrics feature enabled, the Grafana application is not displaying metrics under certain conditions. This issue is fixed.

RSD-10972

The issue of different VF driver names for the robin pci-resource add command between Robin CNP v5.4.3 and Robin CNP v5.7.0 is fixed. Now, Robin CNP v5.7.0 supports both vfio-pci and vfiopci names for VF driver.

RSD-10971

The issue of the robin host pruneimage command failing to remove unused images on RHEL 9.6 OS using Podman or CRI-O is fixed.

RSD-10973

The issue of Pods failing to create in Robin CNP v5.7.0 with the following error when specifying tolerations in Pod YAML is fixed:

Json deserialize error: missing field key at line 1 column 237

RSD-10990

The issue of image upgrades for Robin bundle applications failing on RHEL 9 hosts with the following error is fixed:

Docker daemon not running

RSD-4584

If you have added a range of blacklisted IPs in an unexpanded form, Robin CNP does not allow you to remove a range of blacklisted IPs from the IP Pool. This issue is fixed.

RSD-3885

The robin host remove-vlans command returns an error when attempting to remove VLANs by specifying ALL with the --vlans option. This issue is fixed.

RSD-4065

When creating a superadmin user with AdminViewOnly capabilities or a tenantadmin user with TenantViewOnly capabilities, clusterrolebindings / rolebindings giving the user full access to K8s resources were being created. This issue is fixed. Admin users with ViewOnly capabilities now get view clusterrolebindings / rolebindings.

A utility is provided with Robin CNP v5.4.3 HF5 to fix this issue for existing users (newly created users will have view only clusterrolebindings / rolebindings). You need to run the following command in the robin-master Pod after upgrading to Robin CNP v5.4.3 HF5:

robin-server validate-role-bindings

RSD-3447

The issue of sriov-device-plugin Daemonset scaling down due to specific taints on nodes is fixed.

RSD-10063

The issue of the robin-script.sh cleanup command executing without a confirmation prompt is fixed. This command is enhanced to include an additional confirmation prompt.

RSD- 9738, RSD-9753

The issue of the robin-event-server failing to start due to a database connection timeout is fixed. This issue was resolved by increasing the default database connection timeout to 60 seconds.

RSD-9765

The issue of the robin host pruneimage command failing when purging unused images is fixed.

RSD-10720

When upgrade from Robin CNP v5.5.0 to Robin CNP v5.5.1, Pods and containers were restarting unexpectedly because the robin-kubelet-watcher was incorrectly reporting the Kube-api-server as Down with the following error:

FileNotFoundError(No such file or directory)

This issue is fixed.

26.1.7. Known Issues

Reference ID

Description

PP-35015

Symptom

After rebooting a worker node that is hosting Pods with Robin RWX volumes, one or more application Pods using these volumes might get stuck in the ContainerCreating state indefinitely.

Workaround

If you notice the above issue, contact the Robin CS team.

PP-39901

Symptom

A Pod IP is not pingable from any other node in the cluster, apart from the node where it is running.

Workaround

Bounce the Calico Pod running on the node where the issue is seen.

PP-39900

Symptom

After upgrading your Robin cluster from supported version to Robin CNP v5.5.1, some Helm application Pods might get stuck in the ContainerCreating state

When you run the kubectl describe pod <pod-name> command, the output shows a persistent FailedAttachVolume warning with the following error message:

Error: Failed to mount volume pvc-6f29f4a5-4009-4a99-b37e-a37f34ca5165: Volume 1:22 active snapshot is already mounted elsewhere

You also observe repeated VolumeMount jobs in the robin job list for the affected PVC on the target host that fail with the same already mounted elsewhere error.

Workaround

Important: Apply this workaround only after confirming both of the following conditions:

  • The affected cluster had mismatched hostnames (FQDN vs. short names) in Kubernetes and Robin before the upgrade

  • You observe the specific error “Volume active snapshot is already mounted elsewhere”

  1. Identify the affected volume and the host where it’s currently reported as mounted (Node X):

    # robin volume info <pvc-uuid>
    

    Look for the Mounts section. For example, if pvc-6f29f4a5-4009-4a99-b37e-a37f34ca5165 is reported as mounted on hypervvm-69-34.

  2. Identify the host where Kubernetes is attempting to mount the volume (Node Y):

    # kubectl describe pod <stuck-pod-name> -n <namespace>
    

    Look for the Node: field.

  3. Unmount the volume from Node X using Robin CLI.

    # robin volume unmount <pvc-uuid> <hostname-of-NodeX>
    

    After the unmount operation completes, Kubernetes attachdetach-controller retries mounting the volume on Node Y. Monitor the pod status and kubectl describe pod events.

Note

For further assistance, contact the Robin Customer Support team.

PP-39645

Symptom

Robin CNP v5.7.0 may rarely fail to honor soft Pod anti-affinity, resulting in uneven Pod distribution on labeled nodes.

When you deploy an application with the recommended preferred DuringSchedulingIgnoredDuringExecution soft Pod Anti-Affinity, pods may not be uniformly distributed across the available, labeled nodes as expected. Kubernetes routes nodes to Robin CNP for pod scheduling. In some situations, a request to the Robin CNP from Kubernetes may not have the required node to honor soft affinity.

Workaround

Bounce the Pod that has not honored soft affinity.

PP-34226

Symptom

When a PersistentVolumeClaim (PVC) is created, the CSI provisioner initiates a VolumeCreate job. If this job fails, the CSI provisioner calls a new VolumeCreate job again for the same PVC. However, if the PVC is deleted during this process, the CSI provisioner will continue to call the VolumeCreate job because it does not verify the existence of the PVC before calling the VolumeCreate job.

Workaround

Bounce the CSI provisioner Pod.

# kubectl delete pod -n robinio <csi-provisioner-robin>

PP-34414

Symptom

In rare scenarios, the IOMGR service might fail to open devices in the exclusive mode when it starts as other processes are using these disks. You might observe the following issue:

  • Some app Pods get stuck in the ContainerCreating state after restarting.

Steps to identify the issue:

  1. Check the following type of faulted error in the EVENT_DISK_FAULTED event type in the robin event list command:

    disk /dev/disk/by-id/scsi-SATA_Micron_M500_MTFD_1401096049D5 on node default:poch06 is faulted

    # robin event list --type EVENT_DISK_FAULTED
    
  2. If you see the disk is faulted error, check the IOMGR logs for dev_open() and Failed to exclusively open error messages on the node where disks are present.

    # cat iomgr.log.0 | grep scsi-SATA_Micron_M500_MTFD_1401096049D5
    | grep "dev_open"
    
  3. If you see the Device or resource busy error message in the log file, use fuser command to confirm whether the device is in use:

    # fuser /dev/disk/by-id/scsi-SATA_Micron_M500_MTFD_1401096049D5
    

Workaround

If the device is not in use, restart the IOMGR service on the respective node:

# supervisorctl restart iomgr

PP-39632

Symptom

After upgrading to Robin CNP 5.5.1, NFS client might hang with no pending IO message.

For no pending IO, refer this path: /var/log/robin/nodeplugin/robin-csi.log with the following message:

CsiServer_9 - robin.utils - INFO - Executing command /usr/bin/nc -z -w 6 172.19.149.161 2049 with timeout 60 seconds CsiServer_9 - robin.utils - INFO - Command /usr/bin/nc -z -w 6 172.19.149.161 2049 completed with return code 0. CsiServer_9 - robin.utils - INFO - Standard out:

Also, you can find the following message in the dmesg:

nfs: server 172.19.131.218 not responding, timed out nfs: server 172.19.131.218 not responding, timed out nfs: server 172.19.131.218 not responding, timed out

Workaround

  1. Check the node provisioner logs where the PVC is checking for the path and it is hung.

  2. For the deployment/statefulset that is using the problematic PVC, scale down the replica count to 0.

  3. Ensure all Pods associated with the application have terminated.

  4. Scale up the replica count back to the original value.

PP-34492

Symptom

When you run the robin host list command and if you notice a host is in the NotReady and PROBE_PENDING states, follow these workaround steps to diagnose and recover the host:

Workaround

  1. Run the following command to check which host is in the NotReady and PROBE_PENDING states:

    # robin host list
    
  2. Run the following command to check the current (Curr) and desired (Desired) states of the host in the Agent Process (AP) report:

    # robin ap report | grep <hostname>
    
  3. Run the following command to probe the host and recover it:

    # robin host probe <hostname> --wait
    

    This command forces a probe of the host and updates its state in the cluster.

  4. Run the following command to verify the host’s state:

    # robin host list
    

    The host should now transition to the Ready state.

PP-35478

Symptom

In rare scenarios, the kube-scheduler may not function as expected when many Pods are deployed in a cluster due to issues with the kube-scheduler lease.

Workaround

Complete the following workaround steps to resolve issues with the kube-scheduler lease:

  1. Run the following command to identify the node where the kube-scheduler Pod is running with the lease:

    # kubectl get lease -n kube-system
    
  2. Log in to the node identified in the previous step.

  3. Check if the kube-scheduler Pod is running using the following command:

    # docker ps | grep kube-scheduler
    
  4. As the kube-scheduler is a static Pod, move its configuration file to temporarily stop the Pod:

    # mv /etc/kubernetes/manifests/kube-scheduler.yaml /root
    
  5. Run the following command to confirm that the kube-scheduler Pod is deleted. This may take a few minutes.

    # docker ps | grep kube-scheduler
    
  6. Verify that the kube-scheduler lease is transferred to a different Pod:

    # kubectl get lease -n kube-system
    
  7. Copy the static Pod configuration file back to its original location to redeploy the kube-scheduler Pod:

    # mv /root/kube-scheduler.yaml /etc/kubernetes/manifests/
    
  8. Confirm that the kube-scheduler container is running:

    # docker ps | grep kube-scheduler
    

PP-36865

Symptom

The robin config update manager robin_k8s_extension True command may fail with the error "Helm support cannot be enabled. Robin License is NOT ACTIVATED" even if the license is valid. This failure can leave the system in an inconsistent state, causing subsequent operations like IP pool creation to fail or not appear in the list.

Workaround

To resolve this, manually restart the robin-server and robin-server-bg services using following commands:

# rbash master
# supervisorctl restart robin-server
# supervisorctl restart robin-server-bg

PP-37330

Symptom

During or after upgrading to Robin CNP v5.7.0, the NFSAgentAddExport job might fail with an error message similar to the following:

/bin/mount /dev/sdn /var/lib/robin/nfs/robin-nfs-shared-35/ganesha/pvc-822e76f0-9bb8-4629-8aae-8318fb2d3b41 -o discard failed with return code 32: mount: /var/lib/robin/nfs/robin-nfs-shared-35/ganesha/pvc-822e76f0-9bb8-4629-8aae-8318fb2d3b41: wrong fs type, bad option, bad superblock on /dev/sdn, missing codepage or helper program, or other error.

Workaround

If you notice this issue, contact the Robin Customer Support team for assistance.

PP-37416

Symptom

In rare scenarios, when upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.7.0, the upgrade might fail with the following error during the Kubernetes upgrade process on other master nodes:

Failed to execute kubeadm upgrade command for K8S upgrade. Please make sure you have the correct version of kubeadm rpm binary installed

Steps to identify the issue:

  1. Check the /var/log/robin-install.log file to know why the upgrade failed.

    Example

    etcd container: {etcd_container_id} and exited status: {is_exited}

    Killing progress PID 4168272

    Failed to execute kubeadm upgrade command for K8S upgrade. Please make sure you have the correct version of kubeadm rpm binary installed

    Install logs can be found at /var/log/robin-install.log

    Caught EXIT signal. exit_code: 1

    Note

    You can get the above error logs for any static manifests of api-server, etcd, scheduler, and controller-manager.

  2. If you notice the above error, run the following command to inspect the Docker containers for the failed component. The containers will likely be in the Exited state.

    # docker ps -a | grep schedule
    

Workaround

If you notice the above error, restart the kubelet:

# systemctl restart kubelet

PP-39467

Symptom

When deploying applications with RWX PVCs, application Pods fail to mount volumes and stuck in the ContainerCreating state because RPC requests stuck in IO operation on the volumes, leading to degraded volumes and faulted storage drives.

Workaround

Reboot the host that is in the Notready state.

PP-39429

Symptom

When you try to create an application from a snapshot or backup, the configuration for static IP addresses and static MAC addresses is supported only for Robin Bundle applications, and it is not supported for Helm-based applications.

PP-38044

Symptom

When attempting to detach a repository from a hydrated Helm application, the operation might fail with the following error:

Can’t detach repo as the application is in IMPORTED state, hydrate it in order to detach the repo from it.

This issue occurs even if the application has already been hydrated. The system incorrectly marks the application in the IMPORTED state, preventing the repository from being detached.

Workaround

To detach the repository, manually rehydrate the application and then retry the detach operation:

  1. Run the following command to rehydrate the application.

    # robin app hydrate --wait
    
  2. Once the hydration is complete, detach the repository.

    # robin app detach-repo --wait –y
    

PP-38251

Symptom

When evacuating a disk from an offline node in the large cluster, the robin drive evacuate command fails with the following error message: “Json deserialize error: invalid value: integer -10, expected u64 at line 1 column 2440.

Workaround

If you notice the above issue, contact the Robin CS team.

PP-38471

Symptom

When StatefulSet Pods restart, the Pods might get stuck in the ContainerCreating state with the error: CSINode <node_name> does not contain driver robin due to stale NFS mount points and failure of the csi-nodeplugin-robin Pod due to CrashLoopBackOff state.

Workaround

If you notice this issue, restart the csi-nodeplugin Pod.

# kubectl delete pod <csi-nodeplugin> -n robinio

PP-38087

Symptom

In certain cases, the snapshot size allocated to a volume could be less than what is requested. This occurs when the volume is allocated from multiple disks.

PP-38924

Symptom

After you delete multiple Helm applications, one of the Pods might get stuck in the Error state, and one or more ReadWriteMany (RWX) volumes might get stuck in the Terminating state.

Workaround

On the node where the Pod stuck in the Error state, restart Docker and Kubelet.

PP-34451

Symptom

In rare scenarios, the RWX Pod might be stuck in the ContainerCannotRun state and display the following error in the Pod’s event:

mount.nfs: mount system call failed

Perform the following steps to confirm the issue:

  1. Run the robin volume info command and check for the following details:

    1. Check the status of the volume. It should be in the ONLINE status.

    2. Check whether the respective volume mount path exists.

    3. Check the physical and logical sizes of the volume. If the physical size of the volume is greater than the logical size, then the volume is full.

  2. Run the following command to check whether any of the disks for the volume are running out of space:

# robin disk info
  1. Run the lsblk and blkid commands to check whether the device mount path works fine on the nodes where the volume is mounted.

  2. Run the ls command to check if accessing the respective filesystem mount path gives any input and output errors.

If you notice any input and output errors in step 4, apply the following workaround:

Workaround

  1. Find all the Pods that are using the respective PVC:

# kubectl get pods --all-namespaces -o=jsonpath='{range .items[]}
{.metadata.namespace} /{.metadata.name}{"\t"}{.spec.volumes[].
persistentVolumeClaim.claimName}{"\n"}{end}' | grep <pvc_nmae>
  1. Bounce all the Pods identified in step 1:

# kubectl delete pod <pod> -n <namespace>

PP-21916

Symptom

A pod IP is not pingable from any other node in the cluster, apart from the node where it is running.

Workaround

Bounce the Calico pod running on the node where the issue is seen.

PP-40819

Symptom

From the Robin CNP UI, when you try to deploy an application by cloning from a snapshot, the operation might fail with the following similar error message indicating an invalid negative CPU value: Invalid value: “-200m”: must be greater than or equal to 0.

You might observe this issue specifically when the application has sidecar containers configured with CPU requests/limits. This is a CNP UI issue. You can use the CNP CLI to perform the same operation successfully.

Workaround

Use the following Robin CLI command to clone the snapshot and create an app:

# robin app create from-snapshot <new_app_name>
<snapshot_id> --rpool default --wait

PP-41022

Symptom

The robin host list command might incorrectly display negative values for CPU resources cores (specifically “Free” or “Allocated” CPU) on certain nodes. This occurs even when there are no user applications consuming significant CPU, suggesting a miscalculation or misreporting of available resources. The issue impacts the ability to accurately assess node capacity and schedule new workloads.

Workaround

If you notice this issue, apply the workaround.

Restart kubelet on the affected node:

# systemctl restart kubelet

PP-40993

Symptom

During large cluster upgrades, the upgrade might fail during Robin pre‑upgrade actions if Robin Auto Pilot creates active jobs. This occurs when multiple Robin Auto Pilot watchers are configured for a single pod, resulting in lingering jobs (for example, VnodeDeploy) that block the upgrade process.

Workaround

Restart the robin-master-bg service on the master node to clear active Auto Pilot jobs, then retry the upgrade.

26.1.8. Technical Support

Contact Robin Technical support for any assistance.