25. Release Notes

25.1. Robin Cloud Native Platform v5.4.3

The Robin Cloud Native Platform (CNP) v5.4.3 release has new features, improvements, fixed issues, and known issues.

Release Date: 07 January 2023

25.1.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.26.0

Docker

19.03.9 (CentOS 7) and 20.10.8 (Rocky 8)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.24.3

HAProxy

2.4.7

PostgreSQL

14.6

Grafana

9.2.3

CRI Tools

1.25.0

25.1.2. Upgrade Path

The following are the supported upgrade paths for Robin CNP v5.4.3:

  • Robin CNP v5.4.1 (GA) to Robin CNP v5.4.3 (GA)

  • Robin CNP v5.3.13 (GA) to Robin CNP v5.4.3 (GA)

Note

Before upgrading to Robin CNP v5.4.3, you must stop the Metrics feature and restart it after the upgrade.

25.1.3. New Features

25.1.3.1. Single Node HA-Ready

Starting from Robin CNP v5.4.3, you can install CNP using a single hostname or IP address in the HA mode in an on-prem environment. Later you can scale up the cluster by adding more Master and Worker nodes as per your requirements. Thus, a Single Node HA-Ready cluster is a cluster with a single host and HA enabled.

You can use the same install command to install a single node HA-Ready cluster but just provide a single hostname or IP address.

25.1.3.2. Add Master and Worker Nodes using GoRobin utility

Robin CNP v5.4.3 supports adding new master or worker nodes to an existing Robin CNP HA cluster to scale up your cluster using the GoRobin utility.

The option to add a master node is available only if you have initially installed your cluster as a HA cluster. However, for a cluster that you have installed as non-HA, you can add more worker nodes.

Note

It is recommended not to add additional nodes to a cluster when you installed the cluster using the --single-node-cluster option. The behavior is not defined. If you want to add more nodes a cluster installed with this option, contact the Robin support team.

25.1.3.3. Zero Trust Feature to Block Network Traffic

Robin CNP v5.4.3 provides the Zero trust feature. You can enable the zero-trust option when installing Robin CNP v5.4.3 using as part of the config.json file during installation.

When you use this option, all network ports will be closed except Kubernetes, Robin control ports, and SSH port. You can use this option in conjunction with the single-node-cluster option or independently.

25.1.3.4. Support to Create KVM-based VMs using Custom UUIDs

Robin CNP v5.4.3 provides you an option to manually provide a custom Universally Unique Identifier (UUID) for a KVM-based application. You can use this feature for VMs where the application license is linked to the UUID of the VM.

This feature enables you to provide the UUID manually using the input.yaml file when creating VMs on Robin CNP.

25.1.4. Improvements

25.1.4.1. Add custome Cluster Identity Certificate for all external-facing Kubernetes and Robin CNP services

Starting from Robin CNP v5.4.3, Robin allows you to use the custom Cluster Identity certificate for all external-facing Kubernetes and Robin CNP services. The Cluster Identity certificate is used to validate the requests sent to the external-facing Kubernetes and Robin CNP services from external clients outside the cluster.

By default, Robin creates its own Cluster Identity certificate and uses this certificate to validate the requests. You can use your own Cluster Identity certificate and private key. An external trusted certificate authority (CA) must sign this certificate.

25.1.4.2. Support for HashiCorp Vault

Starting from Robin CNP v5.4.3, Robin CNP re-enabled the support for HashiCorp Vault integration. You can use the GoRobin utility for integrating HashiCorp Vault when installing Robin CNP v5.4.3.

25.1.4.3. Access Robin CNP cluster installed with Zero trust feature using whitelisted IP address

The zero-trust option limits the ports that are accessible from outside the cluster. By default, these ports are accessible from all nodes. The whitelisted IP addresses option will limit access only from the nodes mentioned using this option.

25.1.4.4. ISO image for VMs

Robin CNP v5.4.3 supports ISO images for creating VMs on Robin CNP.

25.1.4.5. Added new events for Pods (Tech Preview)

The following new events for Pods are added in the robin event list in Robin CNP v5.4.3:

  • EVENT_POD_STARTED

  • EVENT_POD_DEPLOY_FAILED

  • EVENT_POD_STOPPED

  • EVENT_POD_STOP_FAILED

  • EVENT_POD_RESTARTED

  • EVENT_POD_DELETED

  • EVENT_POD_FAULTED

  • EVENT_POD_PLAN_FAILED

  • EVENT_POD_RELOCATED

  • EVENT_POD_RELOCATE_FAILED

  • EVENT_POD_RESTARTING

  • EVENT_K8SPOD

Note

To raise the events for Kubernetes Pods, you need to enable the k8s_event_watcher config attribute by running the robin config update cluster k8s_event_watcher True command. By default, this event is disabled.

25.1.5. Fixed Issues

Reference ID

Description

PP-28938

When deleting multiple PDVs using the Robin CNP UI, the checkbox for selecting all PDVs (next to Name field) does not work. This issue is fixed.

PP-28966

If a Pod deployment fails and you notice the following error message in the Pod events: “Error: Vblock with volume_id <> not mounted”. This issue is fixed.

PP-29360

When you add a secondary DPDK-based IP-Pool, routes are programmed by Robin CNP (robin-ipam) erroneously. As a result, Pods are not coming up and failing at ContainerCreating status. This issue is fixed.

PP-29398

The issue with the robin-pqos command when all the isolated cores are running VDU workloads is fixed.

PP-29427

In a scenario where Pods are scheduled with three replicas, three static IP addresses, and an anti-affinity rule, and if the deployment fails for the first time, Robin CNP is not clearing the entries in the database. During the retry of the failed deployment, one of the Pods failed to come up as the IP address was not released by the previously failed deployment. This issue is fixed.

PP-29430

The issue of not being able to use a static IP address as a string for a single replica in a Static IP annotation is fixed.

25.1.6. Known Issues

Reference ID

Description

PP-21916

Symptom

A Pod IP is not pingable from any other node in the cluster, apart from the node where it is running.

Workaround

Bounce the Calico Pod running on the node where the issue is seen.

PP-21935

Symptom

Pods are stuck in the ContainerCreating state with the following error:

kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability

Workaround

Perform the following steps:

  1. Flush connection entries:

    # conntrack -F
    
  2. Bounce nodeplugin Pod.

Note

If the nodeplugin Pod has become unusable, future filesystem mounts will fail, this is a symptom of the many retries of NFS mount calls that hang. Bouncing the Pod will clear out the hung processes.

PP-22781

Symptom

After removing a taint on a master node, GPUs are not detected automatically.

Workaround

You need to run the robin host probe --rediscover --all --wait command for the GPUs to be detected on the primary master node.

PP-22853

Symptom

Robin CNP may not detect GPUs in the following scenarios:

  • After Robin CNP installation

  • After upgrading Robin CNP

  • After adding a new node

Workaround

Run the robin host probe <hostname> --rediscover command.

PP-24736

Symptom

A PVC may not come online after removing an app from the secondary Protection Group on the peer cluster.

Workaround

After you remove the application from the Protection Group and allow the application to start, remove the block_mount label from the PVCs of the application.

PP-25246

Symptom

When you try to delete a KVM application, the deletion process might be stuck as the Virsh commands on the node may not respond.

Workaround

Reboot the node.

PP-25360

Symptom

If containers in a Pod are using an RWX PVC and if they are stuck in the ContainerCreating state for a long time and display a timeout error, apply the following workaround.

Workaround

Delete Pods if they are part of a Deployment or StatefulSet.

PP-26345

Symptom

When you deploy a Pod to use an SR-IOV VF from Ethernet Virtual Function 700 Series 154c, sometimes the Pod gets stuck in the ContainerCreating state with the device busy error message.

Workaround

Bounce the Pod that shows the device busy error message.

PP-26572

Symptom

Due to inaccuracies in tracking the Pod creation, tenants and user limits are not explicitly honored for Helm applications.

PP-26581

Symptom

After deleting the PCI resources, the existing Pods that are using the PCI resources are stuck in the ContainerCreating state during the instance relocation.

Workaround

Perform the following steps:

  1. Recreate the PCI resources.

  2. Bounce the respective Pod.

PP-26768

Symptom

You should not use an IP-Pool associated with dpdk drivers as the default network.

PP-26830

Symptom

After deleting the PVCs, Robin CNP cluster is down.

Workaround

Bounce the Calico Pod.

PP-27076

Symptom

In Robin CNP, Kubelet might go down due to the stale cpu_manager_state file.

Workaround

Complete the following steps to fix this issue:

  1. Remove the stale /var/lib/kubelet/cpu_manager_state file using the following command:

    # rm -rf /var/lib/kubelet/cpu_manager_state
    
  2. Restart the Kubelet by running the following command:

    # systemctl restart kubelet
    
  3. Make sure etcd and apiserver Pods on this node are up and running.

PP-27077

Symptom

When deleting the RWX applications, RWX Pods are stuck in the Terminating state.

Workaround

Perform the following steps for deleting the RWX Pods:

  1. Run the following command to find the NFS server Pod associated with the PVC:

    # robin nfs export-list
    
  2. Delete the NFS server Pod used for the respective PVC.

PP-27193

Symptom

When upgrading from supported Robin CNP versions to Robin CNP v5.4.3, RWX Pods may get stuck in the ContainerCreating state as the volume is unmounted and Kubernetes is not aware of it.

If you notice this issue, apply the following workaround steps:

Workaround

  1. Check what PVC/volume Pod is using.

  2. Check the volume is not mounted by running the robin volume info <volume_name> command.

  3. Bounce the respective RWX Pod or reboot the respective node.

PP-27276

Symptom

After upgrading to Robin CNP v5.4.3, some Robin Bundle apps might be OFFLINE due to PLAN_FAILED.

Workaround

Manually restart the Robin Bundle apps one by one.

PP-27283

Symptom

In rare scenarios, when you reboot the active master node, two Patroni Pods might have the same role as Replica.

Workaround

Bounce the Calico Pod running on the node where the issue is seen.

PP-27620

Symptom

Sync with secondary peer cluster fails due to multiple snapshots restore failures.

Workaround

Restart the iomgr-server on the affected node.

  1. Log in to the iomgr Pod on the affected node (rbash iomgr)

  2. Run the command systemctl restart iomgr-server

  3. Check the state of the connections using the rdvm conn list command.

PP-27678

Symptom

When the node where the volume for file collection is mounted is turned off and you want to delete file collection with a single replica, the file collection delete job will fail putting the file server Pod in the terminating state.

Workaround

Run the following command to delete the file server Pod forcefully stuck in the terminating state:

# kubectl delete <pod_name> -n <robin_ns> --force

PP-27775

Symptom

When upgrading from Robin CNP supported versions to Robin CNP v5.4.3, one of the hosts is stuck in the Notready state.

Workaround

You need to delete the worker Pod running on the node that is in the Notready status.

Perform the following steps to delete the worker Pod:

  1. Run the following command to know the status of worker Pods:

    # kubectl get pod -n robinio -o wide |grep worker
    
  2. Run the following command to delete the stuck worker Pod:

    # kubectl delete pod -n robinio <pod_name>
    
  3. Reboot the respective node.

PP-27826

Symptom

When you reboot all nodes of a cluster together, RWX Pods are stuck in the CrashLoopBackOff state.

Workaround

Bounce the respective Pods.

PP-28461

Symptom

When you increase the snapshot space limit on the Primary Protection Group, the same is not replicated to the secondary Protection Group.

Workaround

If you need to increase space for snapshots on the secondary protection group, apply the following workaround:

Run the following command on the secondary cluster to update the snapshots space limit:

# robin app snapshot-space-limit

PP-28494

Symptom

During a non-HA upgrade, the File-server Pod may get stuck in the ContainerCreating state as the volume is unmounted and Kubernetes is not aware of it.

If you notice this issue, apply the following workaround steps.

Workaround

  1. Check what PVC/volume file-server Pod is using.

  2. Check the volume is not mounted by running the robin volume info <volume name> command.

  3. Run the following command to cordon the node where the filer server Pod is mounted.

    # kubectl cordon <node_name>
    
  4. Run the following command to delete the filer server Pod.

    # kubectl delete pod -n robinio <file_server_pod_name>
    
  5. Run the following command to uncordon the node you have cordoned in step 3.

    # kubectl uncordon <node_name>
    

PP-28501

Symptom

After upgrading from the existing Robin CNP to Robin CNP v5.4.3 with RWX applications, the NFS server related jobs are stuck.

Workaround

Perform the following steps:

  1. Run the following command to log in to Robin master Pod:

    # rbash master
    
  2. Run the following command to know the Mount State of the storage nodes:

    # stormgr node list
    
  3. Run the following command to unlock the stormgr CLI:

    # stormgr devl unlock
    
  4. Run the following command to unblock the blocked storage node:

    # stormgr node setstatus --block-mount 0 <blocked_storage_node>
    

PP-28768

Symptom

After upgrading to Robin CNP v5.4.3, you might notice that the cordened node is uncordened.

Workaround

You should put the cordened nodes in maintenance mode before upgrading.

Or, you need to corden the node again after upgrading to Robin CNP v5.4.3.

PP-28867

Symptom

The robin chargeback report is not displaying the correct SSD drive price in the report. The report is showing 0.0 as the price.

PP-28922

Symptom

When you try to restore a namespace snapshot, the job hangs as the PVCs are in the Pending state and displays this message:

Error: Invalid annotation ‘robin.io/fstype’ provided

Workaround

To fix this issue, apply the following workaround:

  1. Run the following command and get the list of the PVCs that are stuck in the Pending state for the namespace:

    # kubectl get pvc -n <namespace_name>
    
  2. Run the following command to patch each PVC to remove the robin.io/fstype from annotations:

    # kubectl edit pvc -n ns1 rwx-app-8
    

PP-28972

Symptom

When you try to deploy a KVM-based app and override the NIC tags in the IP-Pool using the input.yaml file, you will see the following message:

Error: list index out of range

You will observe this issue as the bonded interface option is not supported for KVM deployments when the Calico interface is used.

PP-29109

Symptom

Robin CNP v5.4.3 does not support the Application Ephemeral Volume (AEV). Due to AEV nonsupport, operations involved with AEVs will fail.

PP-29150

Symptom When creating an SRIOV or OVS IP pool with VLAN, Robin CNP mistakenly allows the creation of the SRIOV or OVS IP pool if any one of them has configured VLAN for its interface at the host level.

For example: in a scenario where you have created an SRIOV IP pool with VLAN and VLAN is added to the SRIOV interface at the host level. At the same time, if you create an OVS IP Pool with the same VLAN but without adding VLAN for the OVS interface at the host level, the OVS IP pool creation succeeds without any error.

However, in this example, when you try to deploy the Pod using the OVS IP pool, the Pod deployment fails at the Containercreating state without any error message.

PP-29340

Symptom

After upgrading from the existing Robin CNP to Robin CNP v5.4.3, RWX PVC Pods are stuck in the ContainerCreating state.

Workaround

Perform the following steps to generate a new FS UUID:

  1. Run the following command to know the Pods that are stuck in the ContainerCreating state:

    # kubectl get pods -A | grep -v containercreating
    
  2. Run the following commands to get the RWX volume used by these Pods:

    # kubectl describe pods <pod_name>
    
    # kubectl get pvc -A | grep <claim_name>
    
  3. Run the following command to see the respective job output for the RWX volume with NFSAgentAddExport:

    # robin job list | grep <rwx_volume_name> | grep NFSAgentAddExport
    

    Example

    # robin job list | grep pvc-b0f33e4d-6d1c-4d17-9ddf-0a67b9f1af51
      | grep NFSAgentAddExport
    ->5840 | NFSAgentAddExport | Adding export for vol pvc-b0f33e4d-
    6d1c-4d17-9ddf-0a67b9f1af51| COMPLETED | FAILED | 09 Aug 06:39:45
      | 06:40:01 | 0:00:16  | [] | 1  | Command '/bin/mount/dev/sdo
      /var/lib/robin/nfs/robin-nfs-shared-60/ganesha/pvc-b0f33e4d-6d1c-
      4d17-9ddf-0a67b9f1af51' failed with return code 32: mount: wrong
      fs type, bad option, bad superblock on /dev/sdo, missing codepage
      or helper program, or other error
    
      In some cases useful info is found in syslog - try dmesg | tail or so.
    
  4. If you see the above error, run the following command to know the device and host on which this volume is mounted:

    # robin volume list | grep <rwx_volume_name>
    
  5. Run the following command on the host to know the error:

    # cat /var/log/messages | grep <device_name>
    

    Example

    # cat /var/log/messages | grep sdo
     Aug  9 06:40:00 asa-06 kernel: XFS (sdo): Filesystem has duplicate
     UUID 47762fc2-1e7c-4863-a551-0fe55b29d0c7 - can't mount
    
  6. Run the following command to generate a new FS UUID for the respective device to be mounted:

    # xfs_admin -U generate <path to device>
    

PP-29441

Symptom

After adding a Master node if a Patroni Pod is in the Pending state, apply the following workaround:

Workaround

  1. Run the following command to get the Patroni PV.

    # kubectl get pv robin-patroni-pv-<already_present_index>
    -o jsonpath={.spec.local.path}
    

    Example

    # kubectl get pv robin-patroni-pv-0  -o jsonpath={.spec.local.path}
    /home/robinds/var/lib/pgsql/patroni
    
  2. Edit the following YAML file to replace the following:

    • INDEX with the index of the pending Pod

    • PV_PATH with the path from the step 1

    • NODE_Name with the node on which the Patroni Pod is not present

    apiVersion: v1
    kind: PersistentVolume
    metadata:
    labels:
       app: robin-patroni
    name: robin-patroni-pv-<<INDEX>>
    spec:
    accessModes:
    - ReadWriteOnce
    capacity:
       storage: 30Gi
    claimRef:
       apiVersion: v1
       kind: PersistentVolumeClaim
       name: pgdata-robin-patroni-<<INDEX>>
       namespace: robinio
    local:
       path: <<PV_PATH>>
    nodeAffinity:
       required:
          nodeSelectorTerms:
          - matchExpressions:
          - key: kubernetes.io/hostname
             operator: In
             values:
             - <<NODE_NAME>>
    persistentVolumeReclaimPolicy: Delete
    storageClassName: local-storage
    volumeMode: Filesystem
    status:
    phase: Bound
    
  3. After making the changes in step 2 save the file and run the following command:

    # kubectl create -f <above_yaml_file>
    

PP-29505

Symptom

The Dashboard in the Robin CNP UI does not display the metrics data for the CLUSTER CONTAINERS NETWORK DATA TRANSMITTED section of the UI.

PP-29509

Symptom

You must stop metrics before starting the upgrade and restart after the upgrade.

PP-29512

Symptom

After upgrading to Robin CNP v5.4.3, you might observe Robin bundle applications deployed with replica 1 are in the NotReady status.

Workaround

If you observe this issue, apply the following workaround:

Run the following command to make the applications healthy:

# robin host probe <hostname> --rediscover --wait

PP-29521

Symptom

After upgrading to Robin CNP v5.4.3, you might observe Pods stuck in the ContainerCreating state, and a warning event display this message:

Error: Input/Output error on device /dev/sdm

Workaround

If you observe this issue, apply the following workaround:

  1. Run the following command to unfault drives.

    # robin drive unfault <drive>
    
  2. Run the following command to restart the IO manager.

    # systemctl restart iomgr-server
    
  3. Bounce the application Pods.

PP-29525

Symptom

After upgrading to Robin CNP v5.4.3 from supported Robin CNP versions, communication with port 36443 might break because the haproxy.cfg is not pushed into the HAProxy docker container.

As a result, you cannot access the Robin cluster using port 36443.

PP-29528

Symptom

In some scenarios, when a Pod with the robin.io/networks annotation is deleted, the net-attach-def entries are left behind.

Workaround

Delete the net-attach-def entries using the following command:

# kubectl delete net-attach-def net-attach-def_name

25.1.7. Technical Support

Contact Robin Technical support for any assistance.

25.2. Robin Cloud Native Platform v5.4.3 HF1

The Robin Cloud Native Platform (CNP) v5.4.3 HF1 release has improvements, fixed issues, and known issues.

Release Date: 19 April 2023

25.2.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.26.0

Docker

19.03.9 (CentOS 7) and 20.10.8 (Rocky 8)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.24.3

HAProxy

2.4.7

PostgreSQL

14.6

Grafana

9.2.3

CRI Tools

1.25.0

25.2.2. Upgrade Path

The following are the supported upgrade paths for Robin CNP v5.4.3 HF1:

  • Robin CNP v5.4.3-120 (GA) to Robin CNP v5.4.3 HF1

  • Robin CNP v5.4.3-237 (HF1-RC) to Robin CNP v5.4.3 HF1

  • Robin CNP v5.3.11-217 (HF2) to Robin CNP v5.4.3 HF1

Note

For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

25.2.3. Improvements

25.2.3.1. Enhanced GoRobin

Starting from Robin CNP v5.4.3 HF1, the GoRobin utility tool runs the preinstallation checks parallelly.

25.2.3.2. Support for V2 KV engine for the HashiCorp Vault integration

Starting from Robin CNP v5.4.3 HF1, Robin CNP supports the V2 KV engine for the HashiCorp Vault integration.

25.2.3.3. Set faultdomain to host or rack for all RWX PVCs

Starting from Robin CNP v5.4.3 HF1, for all storageclass (custom storageclass and storageclass created by Robin) except the robin-rwx storageclass, you can set host or rack as fautdomain for RWX PVCs. If you set disk as faultdomain, RWX PVC will not be provisioned and shows the following error:

For Access-Many volumes, replication should be more than 1 and faultdomain should be ‘host’.

For the robin-rwx storageclass, the default faultdomain is set to host for RWX PVCs. The options of faultdomain as disk and rack are not supported.

25.2.3.4. Support for deploying NFS server Pods on nodes with custom taints

Starting from Robin CNP v5.4.3 HF1, to deploy NFS server Pods on Kubernetes nodes that have custom taints, you must update the nfs_pod_tolerations config attribute of the nfs section of the robin config to add toleration for NFS server Pods.

The tolerations added through the config attribute take effect only for the newly created NFS server Pods. For the existing NFS server Pods, you must add tolerations manually.

25.2.3.5. Set default faultdomain to host for creating a Persistent Data Volume (PDV)

Starting from Robin CNP v5.4.3 HF1 and onwards, for creating a Persistent Data Volume (PDV), the valid options for faultdomain are host and rack. The disk option for faultdomain is not supported. The default faultdomain is set to host.

25.2.4. Fixed Issues

Reference ID

Description

PP-28768

After upgrading Robin CNP 5.4.1, you might notice that the cordened node is uncordened. This issue is fixed.

PP-29303

The issue of KVMs not considering IP pool configurations (Spoofchk and Trustmode) is fixed.

PP-29743

Token-based integration of HashiCorp Vault is no longer supported.

PP-29090

When you deploy a Helm chart with static IP addresses and annotations in a StatefulSet, the Helm chart fails to assign the defined static IP addresses. However, it assigns different IP addresses from the same IP pool range. This issue is fixed.

PP-29217

The issue of the sriov-device-plugin DaemonSet scaling down due to specific taints on nodes is fixed.

PP-29441

The issue of Patroni Pod being in the Pending state after adding a master node is fixed. The steps to add the master node are updated.

PP-29509

The issue of stopping metrics before starting the upgrade and restarting after the upgrade is fixed.

PP-29512

After upgrading to Robin CNP v5.4.3, you might observe Robin bundle applications deployed with replica 1 are in the NotReady status. This issue is fixed.

PP-29553

The issue of Robin cluster nodes not honoring taints applied on nodes after robin-server restart is fixed.

PP-29577

When you delete an SR-IOV pod and restart the Robin server, you might observe the SR-IOV deployment pods are in the create and terminate loop states. This issue is fixed.

PP-29582

When you stop and start the robin server during pod deployment, the robin network annotations might be ignored. This issue is fixed.

PP-29595

The issue of Robin CNP displaying the following error message when setting up Robin Client is fixed:

Error: You must be logged into the server (the server has asked for the client to provide credentials).

PP-29634

The issue of kubectl top pods command output displaying the following error message is fixed:

Error: Metrics not available for pod.

PP-29644

The issue of the Robin installer trying to access the Internet to download the install images and eventually the installation failing is fixed.

PP-29648

The issue of GoRobin failing to install on a 45-node cluster is fixed.

PP-29779

The issue of discrepancy in CPU core calculation while validating the rpool limit is fixed.

PP-29867

When you use node affinity and IP Pool annotations in a Deployment, it uses the IPs from the IP Pool. However, it fails to follow the Affinity rule. This issue is fixed.

PP-29939

Robin CNP scheduler is not calculating the guaranteed CPU utilization correctly and scheduling Pods on over-utilized nodes. This issue is fixed.

PP-30033

The upgrade process failed when upgrading to Robin CNP v5.3.11 due to a discrepancy in the number of CSI node plugin Pods between the pre- and post-upgrade. This issue is fixed.

PP-30050

The issue of Robin CLI stopped working when stormgr is down is fixed.

PP-30160

The issue of node removal failing with the --force command option is fixed.

PP-30141

The issue of VolumeCreate job failing even though the cluster has enough resources is fixed.

PP-30290

The issue of Istio mutating webhook configuration after the upgrade is fixed.

PP-30296

When you try to move a File collection from a 3 Replica File collection to a single replica File collection fails. This issue is fixed.

PP-30345

The issue of the upgrade process not failing even though the version and tag stamping in the config map failed is fixed.

PP-30381

The issue of not being able to upload bundles to an online File collection is fixed.

PP-30387

The issue of adding PDV volume failing when the Fault domain option disk is used is fixed. Starting with Robin CNP v5.4.3 HF1, the Fault domain option disk is not supported, and the default value for the Fault domain option is host.

25.2.5. Known Issues

Reference ID

Description

PP-28802

Symptom

Robin Control Plane is failing to auto-recover in the following conditions:

  • Root FS is full

  • Out-of-memory condition

  • High CPU situations

  • Operating system Kernel crash

Workaround

Apply the following workaround steps to recover from this situation:

  1. Clean up the disk to free up the space. You need a minimum of 50GB of disk space.

  2. Reboot node.

PP-29533

Symptom

After moving all apps from an existing file collection to a new file collection and then powering off one of the nodes cause application not accessible.

Workaround

Delete file-server Pod for PVC to get mounted again.

PP-29650

Symptom

After failing an IOMGR Pod, the Pod might be in the RunContainerError state.

Workaround

Perform the following steps:

  1. Run the following command to restart Docker:

    # service docker restart
    
  2. Run the following command to remove the container:

    # docker rm -f <iomgr container name>
    

PP-29850

Symptom

After rebooting a node you might notice applications are stuck in ContainerCreating state with volumeattachments error.

Workaround

Bounce the Pods.

PP-29866

Symptom

All Pod deployments at least once go through the Pending or Terminating state before the deployment is successful. The behavior is the same for StatefulSet and Deployment.

PP-29962

Symptom

After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, the robin nfs export list might show the wrong entry.

Workaround

Contact the Robin support team for workaround steps.

PP-30112

Symptom

When you upgrade from Robin CNP v5.3.11 HF2 to v5.4.3 HF1, applications Pods might be in the Terminating state.

Run this command to check Pods: docker ps -a. If the command output is not displaying the Pods, apply the following workaround.

Workaround

Delete the Pods that are in the Terminating state forcefully.

Run the following command to delete:

# kubectl delete pod -n <namespace> --force <Pod name>

PP-30119

Symptom

When removing a node from Robin CNP, if the affinity is tied to local storage, instance relocate fails with the following error:

Unable to reallocate instance <instance name> due to an affinity rule tying it to local storage.

PP-30149

Symptom

After deploying Pods, if you restart Robin Server and delete Pods, some of the Pods might not come back online.

Workaround

Delete the Pods that are in Pending state.

PP-30173

Symptom

During scheduling Pods, there might be a difference of 2 CPU cores between the Robin planner and Kubernetes planner due to resource calculation.

For example, if only 2.2 core is left on a node, then Pod deployed with 2.2 CPU request will not get scheduled as 2.2 core is considered as 2 cores by Robin.

PP-30188

Symptom

After upgrading to Robin CNP v5.4.3 HF1 from the supported version, you might notice the RWX app stuck in the ContainerCreating state.

Workaround

  1. Run the following command to find the NFS server Pod that is serving the PVC:

    # robin nfs export-list
    
  2. Run the following command to delete the NFS server Pod:

    # kubectl delete pod <nfs server pod> -n robinio
    

PP-30243

Symptom

After upgrading from the supported Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might notice continuous vnode deploy job is going on for app.

Workaround

Run the following commands to rectify this issue.

  1. Run the following command to rbash master node.

    # rbash master
    
  2. Run the following command to restart robin server.

    # service robin-server restart
    

PP-30247

Symptom

After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, the RWX apps might report the following error event:

wrong fs type, bad option, bad superblock on /dev/sdj, missing codepage or helper program, or other error

Workaround

Contact the Robin support team for workaround steps.

PP-30251

Symptom

Add master using GoRobin for the same removed node fails on IPv4 setups. The Patroni Pod will be in the Pending state. There is no corresponding PVC created, however, a corresponding PV is present.

Workaround

  1. Delete the pending Patroni Pod.

  2. Check the robin host list output to ensure all hosts are ready.

  3. Assign rpool.

  4. Add Storage and Compute role to newly added master.

PP-30264

Symptom

In a Robin CNP v5.4.3, if you have cri=dockershim, the robin_plugin process might have a memory leak.

Workaround

Use --storage-driver=overlay2 if you are using --cri=dockershim

PP-30298

Symptom

After upgrading from the supported Robin CNP v5.3.11 to Robin CNP 5.4.3 HF1, if the upgrade fails and nodes are in the Notready status, check for the following symptoms and apply the workaround:

  • IO Hangs on a volume that could cause Application downtime, kubelet down, and node shows Notready status.

  • iomgr-server on the affected node is in defunct state.

  • ps -ef showing tcmuctrl commands are stuck.

Workaround

Reboot the node where iomgr-server is in defunct state.

PP-30319

Symptom

When you have a StatefulSet or Deployment with robinrpool, one of the Pods may not get scheduled by Kubernetes and it remains in the Pending state due to a mismatch in Robin planner calculation.

Workaround

Run the following command to delete the Pod that is in the Pending state:

# kubectl delete pod <pod name>

PP-30339

Symptom

After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might observe Helm app Pods in the Error state.

Workaround

You need to stop and start the apps in the Error state.

  1. Run the following command to stop the app:

    # robin app stop <app-name>
    
  2. Run the following command to start the app:

    # robin app start <app-name>
    

PP-30357

Symptom

After you upgrade successfully from supported Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might notice that a node is in the Notready state.

Workaround

Run the following command to rectify this issue:

# robin host probe <hostname> --wait

PP-30361

Symptom

When you delete Pods with static IP and Affinity that are in the Running state and at the same time if you restart Robin Server, the Pods might be stuck in the ContainerCreating state and the Pod events show the following:

“NoNetworkFound” : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io “robin-required” not found

Workaround

Bounce the Pods that are in the ContainerCreating state.

PP-30363

Symptom

After upgrading to Robin CNP v5.4.3HF1, you might notice that one of the nodes in the NotReady state and the node description shows the following reason and message:

Reason: KubeletNotReady

Message: PLEG is not healthy: pleg was last seen active 1h25m32.701542224s ago; threshold is 3m0s.

Workaround

Run the following command to restart Docker service:

# service docker restart

PP-30364

Symptom

After you successfully add a new node using GoRobin to an existing cluster that has nodes associated with a custom rpool, the GoRobin tool will automatically assign the new node to the default rpool as the tool does not support custom rpools.

In this scenario, apply the following workaround to associate the newly added node to the custom rpool.

Workaround

  1. Run the following command to remove the storage and compute roles from the newly added node:

    # robin host remove-role <new node name> storage,compute
    
  2. Run the following command to assign the custom rpool to the newly added node:

    # robin host assign-rpool <new node name> <custom rpool>
    
  3. Run the following command to add the storage and compute roles:

    # robin host add-role <new node name> storage,compute
    

PP-30386

Symptom

When upgrading from Robin CNP v5.3.11 HF1 to Robin CNP v5.4.3 HF1, the NFS exports might be stuck in the ASSIGNED_ERR state.

Workaround

Contact the Robin support team for workaround steps.

PP-30389

Symptom

If you have added a range of blacklisted IPs in an unexpanded form, Robin CNP does not allow you to remove a range of blacklisted IPs from the IP Pool. It is recommended to use the expanded form when adding and removing a range of blacklisted IPs to an IP Pool.

Workaround

If you have added a range of blacklisted IPs in an unexpanded form, you need to remove the range from the database. Contact Robin customer support team to apply the workaround.

PP-30394

Symptom

The Robin CNP UI dashboard does not display the cluster memory usage and cluster storage available details.

Workaround

Complete the following steps to rectify this issue.

  1. Run the following command to edit the Grafana ConfigMap:

    # kubectl edit cm robin-grafana-default-dashboards-config
    -n robinio
    
  2. In the configmap, look for the following query:

    “sum (container_memory_working_set_bytes{id=”/”}) by(kubernetes_io_hostname) / on(kubernetes_io_hostname) machine_memory_bytes * 100”

    Replace the above query with the following query:

    “sum (container_memory_working_set_bytes{id=”/”}) by(kubernetes_io_hostname) / on(kubernetes_io_hostname) machine_memory_bytes{job=”kubernetes-cadvisor”} * 100”

  3. Run the following command to delete the grafana-rs Pod:

    # kubectl delete pod robin-grafana-rs-mvjqh -n robinio
    

PP-30398

Symptom

After removing an offline master node from the cluster and power cycling it, the removed master node is automatically added back as a worker node.

Workaround

  1. Run the following command to remove the host:

# robin host remove <hostname>
  1. Run the following command to remove the node

# kubectl delete node <nodename>
  1. Run k8s-script cleanup and host-script cleanup on the to-be-removed node.

25.2.6. Technical Support

Contact Robin Technical support for any assistance.