25. Release Notes

25.1. Robin Cloud Native Platform v5.5.0

The Robin Cloud Native Platform (CNP) v5.5.0 release notes has pre- and post-upgrade steps, new features, improvements, fixed issues, and known issues.

Release Date: April 17, 2025

25.1.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.31.6

Docker

25.0.2

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.28.2

HAProxy

2.4.7

PostgreSQL

14.12

Grafana

9.2.3

CRI Tools

1.31.1

25.1.2. Supported Operating System

The following is the supported operating system and kernel version for Robin CNP v5.5.0:

  • CentOS 7.9 (kernel version: 3.10.0-1160.71.1.el7.x86_64)

25.1.3. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.5.0:

  • Robin CNP v5.5.0-1841 to Robin CNP v5.5.0-1852

  • Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0-1852

  • Robin CNP v5.4.3 HF3+PP to Robin CNP v5.5.0-1852

25.1.3.1. Pre-upgrade consideration

  • For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

  • When upgrading from supported Robin CNP versions to Robin CNP v5.5.0, if your cluster already has cert-manager installed, you must uninstall it before upgrading to Robin CNP v5.5.0.

25.1.3.2. Post-upgrade considerations

  • After upgrading to Robin CNP v5.5.0, you must run the robin schedule update K8sResSync k8s_resource_sync 60000 command to update the robin schedule K8sResSync.

  • After upgrading to Robin CNP v5.5.0, you must run the robin-server validate-role-bindings command. To run this command, you need to log in to the robin-master Pod. This command verifies the roles assigned to each user in the cluster and corrects them if necessary.

  • After upgrading to Robin CNP v5.5.0, the k8s_auto_registration config parameter is disabled by default. The config setting is deactivated to prevent all Kubernetes apps from automatically registering and consuming resources. The following are the points you must be aware of with this change:

    • You can register the Kubernetes apps using the robin app register command manually and use Robin CNP for snapshots, clones, and backup operations of the Kubernetes app.

    • As this config parameter is disabled, when you run the robin app nfs-list command, the mappings between Kubernetes apps and NFS server Pods are not listed in the command output.

    • If you need mapping between Kubernetes app and NFS server Pod when the k8s_auto_registration config parameter is disabled or the k8s app is not manually registered, get the PVC name from the Pod YAML file (kubectl get pod -n <name> -o YAML) and run the robin nfs export list | grep <pvc name> command.

    • The robin nfs export list command output displays the PVC name and namespace.

25.1.3.3. Pre-upgrade steps

  • Upgrading from Robin CNP v5.4.3 HF5 or Robin CNP v5.5.0-1841 to Robin CNP v5.5.0-1852

    Before upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, perform the following steps:

    1. Update the value of the suicide_threshold config parameter to 1800:

      # robin config update agent suicide_threshold 1800
      
    2. Disable the NFS Server Monitor schedule:

      # robin schedule disable "NFS Server" Monitor
      
    3. Set the toleration seconds for all NFS server Pods to 86400 seconds. After upgrade, you must change the tolerations seconds according to the post-upgrade steps.

      for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds to 86400";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 86400}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 86400}]'; done
      
  • Upgrading from Robin CNP v5.4.3 HF3+PP to Robin CNP v5.5.0

    Before upgrading from Robin CNP v5.4.3 HF3+PP to Robin CNP v5.5.0, perform the following steps:

    1. Update the value of the suicide_threshold config parameter to 1800:

      # robin config update agent suicide_threshold 1800
      
    2. Set the NFS Server schedule CronJob to at least more than 6 months:

      # rbash master
      # rsql
      # update schedule set kwargs='{"cron":"1 1 1 1 *"}' where callback='nfs_server_monitor';
      # \q
      # systemctl restart robin-server
      
    3. Set the toleration seconds for all NFS server Pods to 86400 seconds. After upgrade, you must change the tolerations seconds according to the post-upgrade steps.

      for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds to 86400";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 86400}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 86400}]'; done
      

25.1.3.4. Post-upgrade steps

  • After upgrading from Robin CNP v5.4.3 HF5 or Robin CNP v5.5.0-1841 to Robin CNP v5.5.0-1852

    After upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, perform the following steps:

    1. Update the value of the suicide_threshold config parameter to 40:

      # robin config update agent suicide_threshold 40
      
    2. Enable the NFS Server Monitor schedule:

      # robin schedule enable "NFS Server" Monitor
      
    3. Set the check_helm_apps config parameter to False:

      # robin config update cluster check_helm_apps False
      
    4. Set the chargeback_track_k8s_resusage config parameter to False:

      # robin config update server chargeback_track_k8s_resusage False
      
    5. Set the robin_k8s_extension config parameter to True:

      # robin config update manager robin_k8s_extension True
      
    6. Verify whether the following mutating webhooks are present:

      # kubectl get mutatingwebhookconfigurations -A | grep robin
      k8srobin-deployment-mutating-webhook   1          20d
      k8srobin-ds-mutating-webhook           1          20d
      k8srobin-pod-mutating-webhook          1          20d
      k8srobin-sts-mutating-webhook          1          20d
      robin-deployment-mutating-webhook      1          20d
      robin-ds-mutating-webhook              1          20d
      robin-pod-mutating-webhook             1          20d
      robin-sts-mutating-webhook             1          20d
      
    7. If above k8srobin-* mutating webhooks are not present then bounce the robink8s-serverext Pods:

      # kubectl delete pod -n robinio -l app=robink8s-serverext
      
    8. Verify whether the following validating webhooks are present:

      # kubectl get validatingwebhookconfigurations
      NAME                             WEBHOOKS   AGE
      cert-manager-webhook             1          45h
      controllers-validating-webhook   1          31h
      ippoolcr-validating-webhook      1          31h
      namespaces-validating-webhook    1          31h
      pods-validating-webhook          1          31h
      pvcs-validating-webhook          1          31h
      
    9. If robin-* mutating webhooks displayed in the step 6 output and validating webhooks displayed in the step 8 output are not present on your setup, then restart the robin-server-bg service:

      # rbash master
      # supervisorctl restart robin-server-bg
      
    10. Set the toleration seconds for all NFS server Pods to 60 seconds when the node is in the notready state and set to 0 seconds, when the node is unreachable state.

      for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 60}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 0}]'; done 2>/dev/null
      
  • After upgrading from Robin CNP v5.4.3 HF3+PP to Robin CNP v5.5.0

    After upgrading from Robin CNP v5.4.3 HF3+PP to Robin CNP v5.5.0, perform the following steps:

    1. Update the value of the suicide_threshold config parameter to 40:

      # robin config update agent suicide_threshold 40
      
    2. Enable the NFS Server Monitor schedule:

      # robin schedule enable "NFS Server" Monitor
      
    3. Set the check_helm_apps config parameter to False:

      # robin config update cluster check_helm_apps False
      
    4. Set the chargeback_track_k8s_resusage config parameter to False:

      # robin config update server chargeback_track_k8s_resusage False
      
    5. Set the robin_k8s_extension config parameter to True:

      # robin config update manager robin_k8s_extension True
      
    6. Delete the NFS Server schedule CronJob and restart the robin-server and robin-server-bg services:

      # rbash master
      # rsql
      # DELETE from schedule where callback='nfs_server_monitor';
      # \q
      # supervisorctl restart robin-server
      # supervisorctl restart robin-server-bg
      
    7. Verify whether the following mutating webhooks are present:

      # kubectl get mutatingwebhookconfigurations -A | grep robin
      k8srobin-deployment-mutating-webhook   1          20d
      k8srobin-ds-mutating-webhook           1          20d
      k8srobin-pod-mutating-webhook          1          20d
      k8srobin-sts-mutating-webhook          1          20d
      robin-deployment-mutating-webhook      1          20d
      robin-ds-mutating-webhook              1          20d
      robin-pod-mutating-webhook             1          20d
      robin-sts-mutating-webhook             1          20d
      
    8. If above k8srobin-* mutating webhooks are not present then bounce the robink8s-serverext Pods:

      # kubectl delete pod -n robinio -l app=robink8s-serverext
      
    9. Verify whether the following validating webhooks are present:

      # kubectl get validatingwebhookconfigurations
      NAME                             WEBHOOKS   AGE
      cert-manager-webhook             1          45h
      controllers-validating-webhook   1          31h
      ippoolcr-validating-webhook      1          31h
      namespaces-validating-webhook    1          31h
      pods-validating-webhook          1          31h
      pvcs-validating-webhook          1          31h
      
    10. If robin-* mutating webhooks displayed in the step 7 output and validating webhooks displayed in the step 9 output are not present on your setup, then restart the robin-server-bg service:

      # rbash master
      # supervisorctl restart robin-server-bg
      
      1. Set the toleration seconds for all NFS server Pods to 60 seconds when the node is in the notready state and set to 0 seconds, when the node is unreachable state.

      for pod in `kubectl get pod -n robinio -l robin.io/instance=robin-nfs --output=jsonpath={.items..metadata.name}`; do     echo "Updating $pod tolerationseconds";     kubectl patch pod $pod -n robinio --type='json' -p='[{"op": "replace", "path": "/spec/tolerations/0/tolerationSeconds", "value": 60}, {"op": "replace", "path": "/spec/tolerations/1/tolerationSeconds", "value": 0}]'; done 2>/dev/null
      

25.1.4. New Features

25.1.4.1. Robin Patroni Monitor

The Robin Patroni Monitor feature allows you to monitor the status of the Patroni instances (Pods) in a cluster. The Robin CNP architecture includes a highly available PostgreSQL cluster managed by Patroni, referred to as the Patroni Cluster.

To ensure high availability (HA), Patroni maintains three copies of its database, meaning a maximum of three Patroni instances are present in a cluster at any given time.

A Patroni cluster might become unavailable for a number of reasons. To monitor the status of the Patroni cluster, Robin CNP provides the Robin Patroni Monitor feature, which generates the events as required.

Note

After you upgrade from the previous Robin CNP versions to Robin CNP v5.5.0, the Robin Patroni Monitor feature is automatically enabled.

Also, in the Robin CNP v5.5.0 release, the robin event-type list command displays the Event types related to Patroni if there are any changes in the status of the Patroni replicas:

  • EVENT_PATRONI_LEADER_CHANGE

  • EVENT_PATRONI_INSTANCE_NOT_READY

  • EVENT_PATRONI_INSTANCE_FAILED

  • EVENT_PATRONI_INSTANCE_READY

For more information, see Robin Patroni Monitor.

25.1.4.2. Robin Certificate Management

Starting with Robin CNP v5.5.0, you can manage all certificates for your cluster without manual intervention using the Robin certificate management feature. Robin CNP uses functionality of cert-manager for this feature. cert-manager feature is a native Kubernetes certificate management controller. It helps in issuing certificates from various certificate authorities, such as Let’s Encrypt, Entrust, DigiCert, HashiCorp Vault, Venafi. It can also issue certificates from a local CA (self-signed).

cert-manager adds Certificate and Issuer resources in Kubernetes clusters, which simplifies the process of obtaining, generating, and renewing the certificates for the cluster. For more information, see cert-manager.

Robin certificate management feature manages certificates only for Robin internal services deployed in the robinio namespace. It also ensures that all certificates are valid and up-to-date. It automatically renews certificates before they expire.

Robin certificate management feature has the following certificate issuers:

  • cluster-issuer - it is responsible for all certificates used internally by the various control plane services.

  • ident-issuer - it is responsible for the Cluster Identity certificate used by all outward-facing services such as Kubernetes API Server, Robin client, and GUI.

Points to consider for Robin Certificate Management feature

  • When you install or upgrade to Robin CNP v5.5.0, cert-manager is deployed by default, and a new service named robin-cert-monitor is deployed to monitor the state of all certificates required by various Pods and containers in the Robin CNP cluster, ensuring that all required certificates exist and valid.

  • During installation or upgrade to Robin CNP v5.5.0, only the cert-manager option is supported. If you want to manage certificates of your cluster using the local control mode, you can use the robin cert reset-cluster-certs to enable local control mode.

  • You can have only one cert-manager instance in a cluster.

  • If your cluster is already installed with a Cluster Identity certificate signed by an external CA, you must reconfigure it using the robin cert reset-cluster-identity command after updating to Robin CNP v5.5.0.

  • If you want to utilize a Cluster Identity certificate signed by an external CA after installing Robin CNP v5.5.0, you can use the robin cert reset-cluster-identity command to configure it.

  • If you want to install Robin CNP v5.5.0 with both (Cluster Identity certificate signed by an external CA and cert-manager), you must pass the following options in the config.json file for one of the master nodes. For more information, see Installation with Custom Cluster Identity certificate.

    • ident-ca-path

    • ident-cert-path

    • ident-key-path

  • You cannot install your own cert-manager on a Robin CNP cluster. If you want to utilize the functionality of cert-manager, then use cert-manager deployed as part of the Robin certificate management feature to create Issuers and Certificates in other namespaces.

For more information, see Robin Certificate Management.

25.1.4.3. Large cluster support

Starting with Robin CNP v5.5.0, support for large clusters is available. You can now have a Robin CNP cluster with up to 110 nodes.

25.1.5. Improvements

25.1.5.1. Support for SSL certificate-based authentication for Kafka Subscribers

Starting with Robin CNP v5.5.0, Robin CNP supports SSL certificate-based authentication for Kafka subscribers for alerts and events.

Prior to this release, Robin CNP only supported username and password-based authentication.

To setup SSL certificate-based authentication, you must specify the following certificates and key as part of the robin subscriber add command:

  • CA certificate

  • Client certificate

  • Client key

For more information, see Registering a Robin subscriber.

25.1.5.2. Support for MetalLB with BGP peering

Staring with Robin CNP v5.5.0, Robin CNP supports MetalLB layer 3 mode with the Border Gateway Protocol (BGP) peering. In this mode, each node in the cluster establishes BGP peering sessions with the upstream router and advertises the load balancer IP address assigned to a service for each BGP peering. As a result, there are multiple routes for each load balancer IP address with the upstream router. When a router receives traffic, it selects one of the nodes that advertised the load balancer IP address and sends the traffic to that node.

You can set up MetalLB either during the Robin CNP installation or post-installation.

Note

Robin CNP supports MetalLB with BGP peering in FRR mode.

For more information, see Load Balancer Support using MetalLB.

25.1.5.3. New Node Level Events

Robin CNP v5.5.0 provides the following new events to enhance the system’s ability to monitor and detect node readiness issues at both the Kubernetes and service/component levels:

  • EVENT_NODE_K8S_NOTREADY - This event is generated when a node is marked as down due to an issue with a Kubernetes component. It is a warning alert.

  • EVENT_NODE_K8S_READY - This event is generated when a node is up after being marked as down. It is an info alert.

  • EVENT_NODE_NOTREADY - This event is generated when a node is marked as not ready due to an unhealthy service or component. It is a warning alert.

  • EVENT_NODE_READY - This event is generated when a node is ready after being marked as not ready. It is an info alert.

25.1.5.4. PostgreSQL’s Archive mode is Disabled

Starting with Robin CNP 5.5.0, PostgreSQL’s archive mode is disabled. When archive mode is enabled, WAL files are closed and switched at regular intervals, even if minimal or no data is written during that time, resulting in unnecessary resource usage.

25.1.5.5. New Metrics

Robin CNP v5.5.0 provides the following new metrics for the following categories:

  • Manager services

    robin_manager_services_robin_server robin_manager_services_consul_server robin_manager_services_robin_event_server robin_manager_services_stormgr_server robin_manager_services_pgsql robin_manager_services_robin_master

  • Agent Services

    robin_agent_services_robin_agent robin_agent_services_iomgr_service robin_agent_services_monitor_server robin_agent_services_consul_client

  • Node Metrics

    robin_node_state robin_node_maintenance_mode

  • Disk Metrics

    robin_disk_state robin_disk_maintenance_mode

  • Volume Metrics

    robin_vol_storstatus robin_vol_status robin_vol_mount_node_id robin_vol_snapshot_space_used robin_vol_snapshot_space_limit robin_vol_total_snapshot_count

25.1.5.6. Superadmin with limited capabilities

Starting with Robin CNP v5.5.0, new user capabilities allow you to create a superadmin user with limited capabilities.

You can create a superadmin user with limited capabilities by disabling the following newly added user capabilities:

ManageUserCapabilities: - When this capability is disabled, the user cannot create, edit, or delete custom user capabilities.

ManageAdministratorsTenant: - When this capability is disabled, the user cannot manage resources and users in the Administrators tenant.

AddSelfToTenants: - When this capability is disabled, the user cannot add oneself as a member of other tenants in the cluster.

Note

These user capabilities are by default enabled for the superadmin user. You must disable these capabilities if you need to create a superadmin user without these capabilities.

For more information, see Superadmin with limited capabilities.

25.1.6. Fixed Issues

Reference ID

Description

RSD-8104

The issue of the VolumeCreate job taking longer than expected is fixed.

RSD-8150

The issue of the device evacuation operation failing when a replica is allocated to a drive marked for evacuation is fixed.

PP-37627

The issue of volume expansion operations failing despite having sufficient disk space available in the pool is fixed.

PP-37694

The issue of Pods failing to start due to the system reaching the file descriptor limit on CentOS and Dockershim is displaying the following error message is fixed.

too many open files

In this version, the limit on the number of open files in the Dockershim service is increased to a higher value.

RSD-8846

The Robin CAT profiles feature does not work as expected on RHEL 8.10. This issue is fixed.

RSD-8083

The IO hang issue observed on clusters with large disk sizes is fixed.

RSD-5711

In Robin CNP v5.4.3, the cpuset.mems parameter was incorrectly set to 0-1 for guaranteed QoS Pods, which means the resources could be allocated across NUMA boundaries. With this release, the cpuset.mems parameter is set to the desired NUMA.

RSD-8083

Dev slices leader change tasks, which delays epoch update tasks and results in IO timeouts on the application side. This issue is fixed.

RSD-8854

The issue of the IOMGR service crashing on a node when it came back online after a reboot is fixed.

RSD-8083

The current resource (CPU and memory) limits for Patroni in Robin CNP are insufficient and result in cluster performance issues. This issue is fixed by increasing the resource (CPU and memory) limits for Patroni in Robin CNP v5.5.0. The following are new values:

  • cpu_limits=4 cores

  • memory_limits=8G

PP-37695

The issue where Pods are restricted to accessing memory from a single NUMA node, limiting workloads that require larger memory pools, is fixed. Starting with Robin CNP v5.5.0, a new annotation can be specified to bypass the NUMA restriction: "robin.runtime.skip_cpuset_mems": "true". This annotation allows Pods access to memory across all NUMA nodes.

PP-38088

The issue of volume evacuation not being supported is fixed. And the following error message does not appear during the evacuation operation: Insufficient space given in dev limits to allocate vol.

PP-38008

When you perform a disk evacuation operation on a disk that is hosting volume leader slices, and at the same time, if other replicas of the slices are marked as FAULTED, some slices might remain in the FAULTED state and may not sync. As a result, the volumes go into the DEGRADED state, and slices remain in the FAULTED state. This issue is fixed.

PP-38038

The issue of Robin CNP v5.5.0 not supporting scaling up and scaling down of resources (CPU and memory) for Robin Bundle apps is fixed.

25.1.7. Known Issues

Reference ID

Description

PP-21916

Symptom

A Pod IP is not pingable from any other node in the cluster, apart from the node where it is running.

Workaround

Bounce the Calico Pod running on the node where the issue is seen.

PP-30247

Symptom

After upgrading from Robin CNP v5.4.3HF5 to Robin CNP v5.5.0, the RWX apps might report the following error event type:

wrong fs type, bad option, bad superblock on /dev/sdj, missing codepage or helper program, or other error

Workaround

To resolve this issue, contact the Robin Customer Support team.

PP-30398

Symptom

After removing an offline master node from the cluster and power cycling it, the removed master node is automatically added back as a worker node.

Workaround

  1. Run the following command to remove the host:

    # robin host remove
    
  2. Run the following command to remove the node:

    # kubectl delete node
    
  3. Run k8s-script cleanup and host-script cleanup on the to-be-removed node

PP-34226

Symptom

When a PersistentVolumeClaim (PVC) is created, the CSI provisioner initiates a VolumeCreate job. If this job fails, the CSI provisioner calls a new VolumeCreate job again for the same PVC. However, if the PVC is deleted during this process, the CSI provisioner will continue to call the VolumeCreate job because it does not verify the existence of the PVC before calling the VolumeCreate job.

Workaround

Bounce the CSI provisioner Pod.

# kubectl delete pod -n robinio

PP-34414

Symptom

In rare scenarios, the IOMGR service might fail to open devices in the exclusive mode when it starts as other processes are using these disks. You might observe the following issue:

  • Some app Pods get stuck in the ContainerCreating state after restarting.

Steps to identify the issue:

  1. Check the following type of faulted error in the EVENT_DISK_FAULTED event type in the robin event list command:

    disk /dev/disk/by-id/scsi-SATA_Micron_M500_MTFD_1401096049D5 on node default:poch06 is faulted

    # robin event list --type EVENT_DISK_FAULTED
    
  2. If you see the disk is faulted error, check the IOMGR logs for dev_open() and Failed to exclusively open error messages on the node where disks are present.

    # cat iomgr.log.0 | grep scsi-SATA_Micron_M500_MTFD_1401096049D5
    | grep "dev_open"
    
  3. If you see the Device or resource busy error message in the log file, use fuser command to confirm whether the device is in use:

    # fuser /dev/disk/by-id/scsi-SATA_Micron_M500_MTFD_1401096049D5
    

Workaround

If the device is not in use, restart the IOMGR service on the respective node:

# supervisorctl restart iomgr

PP-34451

Symptom

In rare scenarios, the RWX Pod might be stuck in the ContainerCannotRun state and display the following error in the Pod’s event:

mount.nfs: mount system call failed

Perform the following steps to confirm the issue:

  1. Run the robin volume info command and check for the following details:

    1. Check the status of the volume. It should be in the ONLINE status.

    2. Check whether the respective volume mount path exists.

    3. Check the physical and logical sizes of the volume. If the physical size of the volume is greater than the logical size, then the volume is full.

  2. Run the following command to check whether any of the disks for the volume are running out of space:

    # robin disk info
    
  3. Run the lsblk and blkid commands to check whether the device mount path works fine on the nodes where the volume is mounted.

  4. Run the ls command to check if accessing the respective filesystem mount path gives any input and output errors.

If you notice any input and output errors in step 4, apply the following workaround:

Workaround

  1. Find all the Pods that are using the respective PVC:

    # kubectl get pods --all-namespaces -o=jsonpath='{range .items[]}
    {.metadata.namespace} /{.metadata.name}{"\t"}{.spec.volumes[].
    persistentVolumeClaim.claimName}{"\n"}{end}' | grep <pvc_nmae>
    
  2. Bounce all the Pods identified in step 1:

    # kubectl delete pod -n
    

PP-34457

Symptom

If the Metrics feature is enabled on your Robin CNP cluster and you are using Grafana for monitoring, after upgrading the cluster from any supported Robin CNP versions to Robin CNP v5.4.3 HF5, the Grafana metrics will not work.

Note

You need to take a backup of the configmaps of the Prometheus and Grafana apps in the robinio namespace before you stop the Robin Metrics. The configmaps reset when you stop and start the Metrics feature.

Workaround

You need to stop and restart the Metrics feature.

  1. To stop the Metrics feature, run the following command:

    # robin metrics stop
    
  2. To start the Metrics feature, run the following command:

    # robin metrics start
    

PP-34492

Symptom

When you run the robin host list command and if you notice a host is in the NotReady and PROBE_PENDING states, follow these workaround steps to diagnose and recover the host:

Workaround

  1. Run the following command to check which host is in the NotReady and PROBE_PENDING states:

    # robin host list
    
  2. Run the following command to check the current (Curr) and desired (Desired) states of the host in the Agent Process (AP) report:

    # robin ap report | grep <hostname>
    
  3. Run the following command to probe the host and recover it:

# robin host probe <hostname> --wait

This command forces a probe of the host and updates its state in the cluster.

  1. Run the following command to verify the host’s state:

    # robin host list
    

    The host should now transition to the Ready state.

In rare scenarios, when upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, the upgrade might fail with the following error during the Kubernetes upgrade process on other master nodes:

Failed to execute kubeadm upgrade command for K8S upgrade. Please make sure you have the correct version of kubeadm rpm binary installed

Steps to identify the issue:

  1. Check the following error log in the /var/log/robin-install.log file to know why the kubeadm upgrade failed.

    static Pod hash for component kube-scheduler on Node sm-compute04 did not change after 5m0s: timed out waiting for the condition

    Note

    The above error logs may appear for any of K8s control plane components (API server, etcd, scheduler, controller manager).

  2. If you notice the above error, run the following command to inspect the Docker containers for the failed component. The containers will likely be in the Exited state.

    # docker ps -a | grep schedule
    

Workaround

If you notice the above error, restart the kubelet and rerun the upgrade:

# systemctl restart kubelet

PP-35478

Symptom

In rare scenarios, the kube-scheduler may not function as expected when many Pods are deployed in a cluster due to issues with the kube-scheduler lease.

Workaround

Complete the following workaround steps to resolve issues with the kube-scheduler lease:

  1. Run the following command to identify the node where the kube-scheduler Pod is running with the lease:

    # kubectl get lease -n kube-system
    
  2. Log in to the node identified in the previous step.

  3. Check if the kube-scheduler Pod is running using the following command:

    # docker ps | grep kube-scheduler
    
  4. As the kube-scheduler is a static Pod, move its configuration file to temporarily stop the Pod:

    # mv /etc/kubernetes/manifests/kube-scheduler.yaml /root
    
  5. Run the following command to confirm that the kube-scheduler Pod is deleted. This may take a few minutes.

    # docker ps | grep kube-scheduler
    
  6. Verify that the kube-scheduler lease is transferred to a different Pod:

    # kubectl get lease -n kube-system
    
  7. Copy the static Pod configuration file back to its original location to redeploy the kube-scheduler Pod:

    # mv /root/kube-scheduler.yaml /etc/kubernetes/manifests/
    
  8. Confirm that the kube-scheduler container is running:

    # docker ps | grep kube-scheduler
    

PP-36865

Symptom

After rebooting a node, the node might not come back online after a long time, and the host BMC console displays the following message for RWX PVCs mounted on that node:

Remounting nfs rwx pic timed out, issugin SIGKILL

Workaround Power cycle the host system.

PP-37330

Symptom

During or after upgrading to Robin CNP v5.5.0, the NFSAgentAddExport job might fail with an error message similar to the following:

/bin/mount /dev/sdn /var/lib/robin/nfs/robin-nfs-shared-35/ganesha/pvc-822e76f0-9bb8-4629-8aae-8318fb2d3b41 -o discard failed with return code 32: mount: /var/lib/robin/nfs/robin-nfs-shared-35/ganesha/pvc-822e76f0-9bb8-4629-8aae-8318fb2d3b41: wrong fs type, bad option, bad superblock on /dev/sdn, missing codepage or helper program, or other error.

Workaround

If you notice this issue, contact the Robin Customer Support team for assistance.

PP-37416

Symptom

In rare scenarios, when upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, the upgrade might fail with the following error during the Kubernetes upgrade process on other master nodes:

Failed to execute kubeadm upgrade command for K8S upgrade. Please make sure you have the correct version of kubeadm rpm binary installed

Steps to identify the issue:

  1. Check the /var/log/robin-install.log file to know why the kubeadm upgrade failed.

    Example

    [upgrade/staticpods] Moved new manifest to “/etc/kubernetes/manifests/kube-scheduler.yaml” and backed up old manifest to “/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-01-31-01-03-52/kube-scheduler.yaml” [upgrade/staticpods] Waiting for the kubelet to restart the component [upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s) static Pod hash for component kube-scheduler on Node sm-compute04 did not change after 5m0s: timed out waiting for the condition

    Note

    You can get the above error log for any static manifests of api-server, etcd, scheduler, and controller-manager.

  2. If you notice the above error, run the following command to inspect the Docker containers for the failed component. The containers will likely be in the Exited state.

    # docker ps -a | grep schedule
    

Workaround

If you notice the above error, restart the kubelet:

# systemctl restart kubelet

PP-37965

Symptom

In Robin CNP v5.5.0, when you scale up a Robin Bundle app, it is not considering the existing CPU cores and memory already in use by a vnode. As a result, Robin CNP is not able to find a suitable host, even though there are additional resources available.

Workaround

If you notice this issue, apply the following workaround:

  1. Scale up the resources using the following command.

# robin app computeqos <appname> --role <rolename> --cpus <newcnt>
  --memory <newmem> -- wait
  1. If the scale-up operation fails, stop the app using the following command.

# robin app stop <appname> --wait
  1. Try to scale up the resources again.

# robin app computeqos <appname> --role <rolename> --cpus <newcnt>
-- memory <newmem> --wait

PP-38039

Symptom

During node reboot or power reset scenarios, application volumes may force shutdown due to I/O errors. As a result, application Pods might get stuck in the ContainerCreating state with the following mount failure error:

Context Deadline Exceeded.

On the affected node where the volume is mounted or the application Pod is scheduled, the following error might be observed in the dmesg output:

Log I/O Error Detected. Shutting down filesystem

Workaround

If you notice this issue, contact the Robin Customer Support team for assistance

PP-38044

Symptom

When attempting to detach a repository from a hydrated Helm application, the operation might fail with the following error:

Can’t detach repo as the application is in IMPORTED state, hydrate it in order to detach the repo from it.

This issue occurs even if the application has already been hydrated. The system incorrectly marks the application in the IMPORTED state, preventing the repository from being detached.

Workaround

To detach the repository, manually rehydrate the application and then retry the detach operation:

  1. Run the following command to rehydrate the application.

    # robin app hydrate --wait
    
  2. Once the hydration is complete, detach the repository.

    # robin app detach-repo - -wait –y
    

PP-38061

Symptom

In rare scenarios, when upgrading from Robin CNP v5.4.3 HF5 to Robin CNP v5.5.0, the upgrade may get stuck while executing Robin upgrade actions on the primary master node because some of the hosts are not in the Ready state.

Steps to identify the issue:

  1. Check the following error in the /var/log/robin-install.log file:

    Robin Host is not in READY state. Re-trying host status check in 10 seconds

  2. If you get the above error, run the following command to verify the status of hosts:

    # robin host list
    

Workaround

If any host in the cluster is in the Notready state, apply the following steps:

  1. Log in to the Robin worker Pod running on the host that is in the Notready state:

    # rbash robin
    
  2. Check the following error in the /var/log/robin/robin-worker-bootstrap.log file:

    • MainThread - robin.utils - INFO - Standard err: Error from server (NotFound): configmaps “robin-upgrade-config-5.4.3-564” not found

    • MainThread - robin.rcm.setup.robin_upgrade - INFO - get_host_upgrade_status: Skip Upgrade status check for host hypervvm-61-49.robinsystems.com. Configmap robin-upgrade-config-5.4.3-564 not configured

  3. If you see the above error, stop the robin-bootstrap service:

    # supervisorctl stop robin-bootstrap
    
  4. Create the booststrap_done file manually, if it does not exist:

    # touch /etc/robin/bootstrap_done
    
  5. Start the robin-bootstrap service again:

    # supervisorctl start robin-bootstrap
    

PP-38071

Symptom

Application creation might fail with the following error:

Failed to mount volume : Node has mount_blocked STORMGR_NODE_BLOCK_MOUNT. No new mounts are allowed.

This issue occurs when a node enters a mount-blocked state (STORMGR_NODE_BLOCK_MOUNT), preventing new volume mounts from being processed.

Workaround

Try to create the application after 15 minutes.

PP-38078

Symptom

After a network partition, the robin-agent and iomgr-server may not restart automatically, and stale devices may not be cleaned up.This issue occurs because the consulwatch thread responsible for monitoring Consul and triggering restarts may fail to detect the network partition. As a result, stale devices may not be cleaned up, potentially leading to resource contention and other issues.

Workaround

Manually restart the robin-agent and iomgr-server using supervisorctl:

# supervisorctl restart robin-agent iomgr-server

PP-38087

Symptom

In certain cases, the snapshot size allocated to a volume could be less than what is requested. This occurs when the volume is allocated from multiple disks.

PP-38397

Symptom

When upgrading from supported Robin CNP versions to Robin CNP v5.5.0, the Robin upgrade process might fail due to a Docker installation failure caused by missing dependencies. This issue occurs when the cluster is missing the fuse-overlayfs and slirp4netns packages, which are required by the new Docker version. The upgrade process removes the existing Docker version but fails to install the new version, and the Docker service file gets masked, preventing Docker from starting.

Workaround

  1. Check the host installation logs for errors related to the Docker installation. For example: Install the packages listed as dependencies; in the above cluster, these were the packages that had to be manually installed.

  2. Install the missing dependencies manually.

    # yum install -y slirp4netns fuse-overlayfs
    
  3. Retry the upgrade process.

PP-38471

Symptom

When StatefulSet Pods restart, the Pods might get stuck in the ContainerCreating state with the error: CSINode <node_name> does not contain driver robin due to stale NFS mount points and failure of the csi-nodeplugin-robin Pod due to CrashLoopBackOff state.

Workaround

If you notice this issue, restart the csi-nodeplugin Pod.

# kubectl delete pod <csi-nodeplugin> -n robinio

25.1.8. Technical Support

Contact Robin Technical support for any assistance.