25. Release Notes

25.1. Robin Cloud Native Platform v5.4.3

The Robin Cloud Native Platform (CNP) v5.4.3 release has new features, improvements, fixed issues, and known issues.

Release Date: 07 January 2023

25.1.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.26.0

Docker

19.03.9 (CentOS 7) and 20.10.8 (Rocky 8)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.24.3

HAProxy

2.4.7

PostgreSQL

14.6

Grafana

9.2.3

CRI Tools

1.25.0

25.1.2. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.4.3:

  • Robin CNP v5.4.1 (GA) to Robin CNP v5.4.3 (GA)

  • Robin CNP v5.3.13 (GA) to Robin CNP v5.4.3 (GA)

The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.

Note

Before upgrading to Robin CNP v5.4.3, you must stop the Metrics feature and restart it after the upgrade.

25.1.3. New Features

25.1.3.1. Single Node HA-Ready

Starting from Robin CNP v5.4.3, you can install CNP using a single hostname or IP address in the HA mode in an on-prem environment. Later you can scale up the cluster by adding more Master and Worker nodes as per your requirements. Thus, a Single Node HA-Ready cluster is a cluster with a single host and HA enabled.

You can use the same install command to install a single node HA-Ready cluster but just provide a single hostname or IP address.

25.1.3.2. Add Master and Worker Nodes using GoRobin utility

Robin CNP v5.4.3 supports adding new master or worker nodes to an existing Robin CNP HA cluster to scale up your cluster using the GoRobin utility.

The option to add a master node is available only if you have initially installed your cluster as a HA cluster. However, for a cluster that you have installed as non-HA, you can add more worker nodes.

Note

It is recommended not to add additional nodes to a cluster when you installed the cluster using the --single-node-cluster option. The behavior is not defined. If you want to add more nodes a cluster installed with this option, contact the Robin support team.

25.1.3.3. Zero Trust Feature to Block Network Traffic

Robin CNP v5.4.3 provides the Zero trust feature. You can enable the zero-trust option when installing Robin CNP v5.4.3 using as part of the config.json file during installation.

When you use this option, all network ports will be closed except Kubernetes, Robin control ports, and SSH port. You can use this option in conjunction with the single-node-cluster option or independently.

25.1.3.4. Support to Create KVM-based VMs using Custom UUIDs

Robin CNP v5.4.3 provides you an option to manually provide a custom Universally Unique Identifier (UUID) for a KVM-based application. You can use this feature for VMs where the application license is linked to the UUID of the VM.

This feature enables you to provide the UUID manually using the input.yaml file when creating VMs on Robin CNP.

25.1.4. Improvements

25.1.4.1. Add custome Cluster Identity Certificate for all external-facing Kubernetes and Robin CNP services

Starting from Robin CNP v5.4.3, Robin allows you to use the custom Cluster Identity certificate for all external-facing Kubernetes and Robin CNP services. The Cluster Identity certificate is used to validate the requests sent to the external-facing Kubernetes and Robin CNP services from external clients outside the cluster.

By default, Robin creates its own Cluster Identity certificate and uses this certificate to validate the requests. You can use your own Cluster Identity certificate and private key. An external trusted certificate authority (CA) must sign this certificate.

25.1.4.2. Support for HashiCorp Vault

Starting from Robin CNP v5.4.3, Robin CNP re-enabled the support for HashiCorp Vault integration. You can use the GoRobin utility for integrating HashiCorp Vault when installing Robin CNP v5.4.3.

25.1.4.3. Access Robin CNP cluster installed with Zero trust feature using whitelisted IP address

The zero-trust option limits the ports that are accessible from outside the cluster. By default, these ports are accessible from all nodes. The whitelisted IP addresses option will limit access only from the nodes mentioned using this option.

25.1.4.4. ISO image for VMs

Robin CNP v5.4.3 supports ISO images for creating VMs on Robin CNP.

25.1.4.5. Added new events for Pods (Tech Preview)

The following new events for Pods are added in the robin event list in Robin CNP v5.4.3:

  • EVENT_POD_STARTED

  • EVENT_POD_DEPLOY_FAILED

  • EVENT_POD_STOPPED

  • EVENT_POD_STOP_FAILED

  • EVENT_POD_RESTARTED

  • EVENT_POD_DELETED

  • EVENT_POD_FAULTED

  • EVENT_POD_PLAN_FAILED

  • EVENT_POD_RELOCATED

  • EVENT_POD_RELOCATE_FAILED

  • EVENT_POD_RESTARTING

  • EVENT_K8SPOD

Note

To raise the events for Kubernetes Pods, you need to enable the k8s_event_watcher config attribute by running the robin config update cluster k8s_event_watcher True command. By default, this event is disabled.

25.1.5. Fixed Issues

Reference ID

Description

PP-28938

When deleting multiple PDVs using the Robin CNP UI, the checkbox for selecting all PDVs (next to Name field) does not work. This issue is fixed.

PP-28966

If a Pod deployment fails and you notice the following error message in the Pod events: “Error: Vblock with volume_id <> not mounted”. This issue is fixed.

PP-29360

When you add a secondary DPDK-based IP-Pool, routes are programmed by Robin CNP (robin-ipam) erroneously. As a result, Pods are not coming up and failing at ContainerCreating status. This issue is fixed.

PP-29398

The issue with the robin-pqos command when all the isolated cores are running VDU workloads is fixed.

PP-29427

In a scenario where Pods are scheduled with three replicas, three static IP addresses, and an anti-affinity rule, and if the deployment fails for the first time, Robin CNP is not clearing the entries in the database. During the retry of the failed deployment, one of the Pods failed to come up as the IP address was not released by the previously failed deployment. This issue is fixed.

PP-29430

The issue of not being able to use a static IP address as a string for a single replica in a Static IP annotation is fixed.

25.1.6. Known Issues

Reference ID

Description

PP-21916

Symptom

A Pod IP is not pingable from any other node in the cluster, apart from the node where it is running.

Workaround

Bounce the Calico Pod running on the node where the issue is seen.

PP-21935

Symptom

Pods are stuck in the ContainerCreating state with the following error:

kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability

Workaround

Perform the following steps:

  1. Flush connection entries:

    # conntrack -F
    
  2. Bounce nodeplugin Pod.

Note

If the nodeplugin Pod has become unusable, future filesystem mounts will fail, this is a symptom of the many retries of NFS mount calls that hang. Bouncing the Pod will clear out the hung processes.

PP-22781

Symptom

After removing a taint on a master node, GPUs are not detected automatically.

Workaround

You need to run the robin host probe --rediscover --all --wait command for the GPUs to be detected on the primary master node.

PP-22853

Symptom

Robin CNP may not detect GPUs in the following scenarios:

  • After Robin CNP installation

  • After upgrading Robin CNP

  • After adding a new node

Workaround

Run the robin host probe <hostname> --rediscover command.

PP-24736

Symptom

A PVC may not come online after removing an app from the secondary Protection Group on the peer cluster.

Workaround

After you remove the application from the Protection Group and allow the application to start, remove the block_mount label from the PVCs of the application.

PP-25246

Symptom

When you try to delete a KVM application, the deletion process might be stuck as the Virsh commands on the node may not respond.

Workaround

Reboot the node.

PP-25360

Symptom

If containers in a Pod are using an RWX PVC and if they are stuck in the ContainerCreating state for a long time and display a timeout error, apply the following workaround.

Workaround

Delete Pods if they are part of a Deployment or StatefulSet.

PP-26345

Symptom

When you deploy a Pod to use an SR-IOV VF from Ethernet Virtual Function 700 Series 154c, sometimes the Pod gets stuck in the ContainerCreating state with the device busy error message.

Workaround

Bounce the Pod that shows the device busy error message.

PP-26572

Symptom

Due to inaccuracies in tracking the Pod creation, tenants and user limits are not explicitly honored for Helm applications.

PP-26581

Symptom

After deleting the PCI resources, the existing Pods that are using the PCI resources are stuck in the ContainerCreating state during the instance relocation.

Workaround

Perform the following steps:

  1. Recreate the PCI resources.

  2. Bounce the respective Pod.

PP-26768

Symptom

You should not use an IP-Pool associated with dpdk drivers as the default network.

PP-26830

Symptom

After deleting the PVCs, Robin CNP cluster is down.

Workaround

Bounce the Calico Pod.

PP-27076

Symptom

In Robin CNP, Kubelet might go down due to the stale cpu_manager_state file.

Workaround

Complete the following steps to fix this issue:

  1. Remove the stale /var/lib/kubelet/cpu_manager_state file using the following command:

    # rm -rf /var/lib/kubelet/cpu_manager_state
    
  2. Restart the Kubelet by running the following command:

    # systemctl restart kubelet
    
  3. Make sure etcd and apiserver Pods on this node are up and running.

PP-27077

Symptom

When deleting the RWX applications, RWX Pods are stuck in the Terminating state.

Workaround

Perform the following steps for deleting the RWX Pods:

  1. Run the following command to find the NFS server Pod associated with the PVC:

    # robin nfs export-list
    
  2. Delete the NFS server Pod used for the respective PVC.

PP-27193

Symptom

When upgrading from supported Robin CNP versions to Robin CNP v5.4.3, RWX Pods may get stuck in the ContainerCreating state as the volume is unmounted and Kubernetes is not aware of it.

If you notice this issue, apply the following workaround steps:

Workaround

  1. Check what PVC/volume Pod is using.

  2. Check the volume is not mounted by running the robin volume info <volume_name> command.

  3. Bounce the respective RWX Pod or reboot the respective node.

PP-27276

Symptom

After upgrading to Robin CNP v5.4.3, some Robin Bundle apps might be OFFLINE due to PLAN_FAILED.

Workaround

Manually restart the Robin Bundle apps one by one.

PP-27283

Symptom

In rare scenarios, when you reboot the active master node, two Patroni Pods might have the same role as Replica.

Workaround

Bounce the Calico Pod running on the node where the issue is seen.

PP-27620

Symptom

Sync with secondary peer cluster fails due to multiple snapshots restore failures.

Workaround

Restart the iomgr-server on the affected node.

  1. Log in to the iomgr Pod on the affected node (rbash iomgr)

  2. Run the command systemctl restart iomgr-server

  3. Check the state of the connections using the rdvm conn list command.

PP-27678

Symptom

When the node where the volume for file collection is mounted is turned off and you want to delete file collection with a single replica, the file collection delete job will fail putting the file server Pod in the terminating state.

Workaround

Run the following command to delete the file server Pod forcefully stuck in the terminating state:

# kubectl delete <pod_name> -n <robin_ns> --force

PP-27775

Symptom

When upgrading from Robin CNP supported versions to Robin CNP v5.4.3, one of the hosts is stuck in the Notready state.

Workaround

You need to delete the worker Pod running on the node that is in the Notready status.

Perform the following steps to delete the worker Pod:

  1. Run the following command to know the status of worker Pods:

    # kubectl get pod -n robinio -o wide |grep worker
    
  2. Run the following command to delete the stuck worker Pod:

    # kubectl delete pod -n robinio <pod_name>
    
  3. Reboot the respective node.

PP-27826

Symptom

When you reboot all nodes of a cluster together, RWX Pods are stuck in the CrashLoopBackOff state.

Workaround

Bounce the respective Pods.

PP-28461

Symptom

When you increase the snapshot space limit on the Primary Protection Group, the same is not replicated to the secondary Protection Group.

Workaround

If you need to increase space for snapshots on the secondary protection group, apply the following workaround:

Run the following command on the secondary cluster to update the snapshots space limit:

# robin app snapshot-space-limit

PP-28494

Symptom

During a non-HA upgrade, the File-server Pod may get stuck in the ContainerCreating state as the volume is unmounted and Kubernetes is not aware of it.

If you notice this issue, apply the following workaround steps.

Workaround

  1. Check what PVC/volume file-server Pod is using.

  2. Check the volume is not mounted by running the robin volume info <volume name> command.

  3. Run the following command to cordon the node where the filer server Pod is mounted.

    # kubectl cordon <node_name>
    
  4. Run the following command to delete the filer server Pod.

    # kubectl delete pod -n robinio <file_server_pod_name>
    
  5. Run the following command to uncordon the node you have cordoned in step 3.

    # kubectl uncordon <node_name>
    

PP-28501

Symptom

After upgrading from the existing Robin CNP to Robin CNP v5.4.3 with RWX applications, the NFS server related jobs are stuck.

Workaround

Perform the following steps:

  1. Run the following command to log in to Robin master Pod:

    # rbash master
    
  2. Run the following command to know the Mount State of the storage nodes:

    # stormgr node list
    
  3. Run the following command to unlock the stormgr CLI:

    # stormgr devl unlock
    
  4. Run the following command to unblock the blocked storage node:

    # stormgr node setstatus --block-mount 0 <blocked_storage_node>
    

PP-28768

Symptom

After upgrading to Robin CNP v5.4.3, you might notice that the cordened node is uncordened.

Workaround

You should put the cordened nodes in maintenance mode before upgrading.

Or, you need to corden the node again after upgrading to Robin CNP v5.4.3.

PP-28867

Symptom

The robin chargeback report is not displaying the correct SSD drive price in the report. The report is showing 0.0 as the price.

PP-28922

Symptom

When you try to restore a namespace snapshot, the job hangs as the PVCs are in the Pending state and displays this message:

Error: Invalid annotation ‘robin.io/fstype’ provided

Workaround

To fix this issue, apply the following workaround:

  1. Run the following command and get the list of the PVCs that are stuck in the Pending state for the namespace:

    # kubectl get pvc -n <namespace_name>
    
  2. Run the following command to patch each PVC to remove the robin.io/fstype from annotations:

    # kubectl edit pvc -n ns1 rwx-app-8
    

PP-28972

Symptom

When you try to deploy a KVM-based app and override the NIC tags in the IP-Pool using the input.yaml file, you will see the following message:

Error: list index out of range

You will observe this issue as the bonded interface option is not supported for KVM deployments when the Calico interface is used.

PP-29109

Symptom

Robin CNP v5.4.3 does not support the Application Ephemeral Volume (AEV). Due to AEV nonsupport, operations involved with AEVs will fail.

PP-29150

Symptom When creating an SRIOV or OVS IP pool with VLAN, Robin CNP mistakenly allows the creation of the SRIOV or OVS IP pool if any one of them has configured VLAN for its interface at the host level.

For example: in a scenario where you have created an SRIOV IP pool with VLAN and VLAN is added to the SRIOV interface at the host level. At the same time, if you create an OVS IP Pool with the same VLAN but without adding VLAN for the OVS interface at the host level, the OVS IP pool creation succeeds without any error.

However, in this example, when you try to deploy the Pod using the OVS IP pool, the Pod deployment fails at the Containercreating state without any error message.

PP-29340

Symptom

After upgrading from the existing Robin CNP to Robin CNP v5.4.3, RWX PVC Pods are stuck in the ContainerCreating state.

Workaround

Perform the following steps to generate a new FS UUID:

  1. Run the following command to know the Pods that are stuck in the ContainerCreating state:

    # kubectl get pods -A | grep -v containercreating
    
  2. Run the following commands to get the RWX volume used by these Pods:

    # kubectl describe pods <pod_name>
    
    # kubectl get pvc -A | grep <claim_name>
    
  3. Run the following command to see the respective job output for the RWX volume with NFSAgentAddExport:

    # robin job list | grep <rwx_volume_name> | grep NFSAgentAddExport
    

    Example

    # robin job list | grep pvc-b0f33e4d-6d1c-4d17-9ddf-0a67b9f1af51
      | grep NFSAgentAddExport
    ->5840 | NFSAgentAddExport | Adding export for vol pvc-b0f33e4d-
    6d1c-4d17-9ddf-0a67b9f1af51| COMPLETED | FAILED | 09 Aug 06:39:45
      | 06:40:01 | 0:00:16  | [] | 1  | Command '/bin/mount/dev/sdo
      /var/lib/robin/nfs/robin-nfs-shared-60/ganesha/pvc-b0f33e4d-6d1c-
      4d17-9ddf-0a67b9f1af51' failed with return code 32: mount: wrong
      fs type, bad option, bad superblock on /dev/sdo, missing codepage
      or helper program, or other error
    
      In some cases useful info is found in syslog - try dmesg | tail or so.
    
  4. If you see the above error, run the following command to know the device and host on which this volume is mounted:

    # robin volume list | grep <rwx_volume_name>
    
  5. Run the following command on the host to know the error:

    # cat /var/log/messages | grep <device_name>
    

    Example

    # cat /var/log/messages | grep sdo
     Aug  9 06:40:00 asa-06 kernel: XFS (sdo): Filesystem has duplicate
     UUID 47762fc2-1e7c-4863-a551-0fe55b29d0c7 - can't mount
    
  6. Run the following command to generate a new FS UUID for the respective device to be mounted:

    # xfs_admin -U generate <path to device>
    

PP-29441

Symptom

After adding a Master node if a Patroni Pod is in the Pending state, apply the following workaround:

Workaround

  1. Run the following command to get the Patroni PV.

    # kubectl get pv robin-patroni-pv-<already_present_index>
    -o jsonpath={.spec.local.path}
    

    Example

    # kubectl get pv robin-patroni-pv-0  -o jsonpath={.spec.local.path}
    /home/robinds/var/lib/pgsql/patroni
    
  2. Edit the following YAML file to replace the following:

    • INDEX with the index of the pending Pod

    • PV_PATH with the path from the step 1

    • NODE_Name with the node on which the Patroni Pod is not present

    apiVersion: v1
    kind: PersistentVolume
    metadata:
    labels:
       app: robin-patroni
    name: robin-patroni-pv-<<INDEX>>
    spec:
    accessModes:
    - ReadWriteOnce
    capacity:
       storage: 30Gi
    claimRef:
       apiVersion: v1
       kind: PersistentVolumeClaim
       name: pgdata-robin-patroni-<<INDEX>>
       namespace: robinio
    local:
       path: <<PV_PATH>>
    nodeAffinity:
       required:
          nodeSelectorTerms:
          - matchExpressions:
          - key: kubernetes.io/hostname
             operator: In
             values:
             - <<NODE_NAME>>
    persistentVolumeReclaimPolicy: Delete
    storageClassName: local-storage
    volumeMode: Filesystem
    status:
    phase: Bound
    
  3. After making the changes in step 2 save the file and run the following command:

    # kubectl create -f <above_yaml_file>
    

PP-29505

Symptom

The Dashboard in the Robin CNP UI does not display the metrics data for the CLUSTER CONTAINERS NETWORK DATA TRANSMITTED section of the UI.

PP-29509

Symptom

You must stop metrics before starting the upgrade and restart after the upgrade.

PP-29512

Symptom

After upgrading to Robin CNP v5.4.3, you might observe Robin bundle applications deployed with replica 1 are in the NotReady status.

Workaround

If you observe this issue, apply the following workaround:

Run the following command to make the applications healthy:

# robin host probe <hostname> --rediscover --wait

PP-29521

Symptom

After upgrading to Robin CNP v5.4.3, you might observe Pods stuck in the ContainerCreating state, and a warning event display this message:

Error: Input/Output error on device /dev/sdm

Workaround

If you observe this issue, apply the following workaround:

  1. Run the following command to unfault drives.

    # robin drive unfault <drive>
    
  2. Run the following command to restart the IO manager.

    # systemctl restart iomgr-server
    
  3. Bounce the application Pods.

PP-29525

Symptom

After upgrading to Robin CNP v5.4.3 from supported Robin CNP versions, communication with port 36443 might break because the haproxy.cfg is not pushed into the HAProxy docker container.

As a result, you cannot access the Robin cluster using port 36443.

PP-29528

Symptom

In some scenarios, when a Pod with the robin.io/networks annotation is deleted, the net-attach-def entries are left behind.

Workaround

Delete the net-attach-def entries using the following command:

# kubectl delete net-attach-def net-attach-def_name

25.1.7. Technical Support

Contact Robin Technical support for any assistance.

25.2. Robin Cloud Native Platform v5.4.3 HF1

The Robin Cloud Native Platform (CNP) v5.4.3 HF1 release has improvements, fixed issues, and known issues.

Release Date: 19 April 2023

25.2.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.26.0

Docker

19.03.9 (CentOS 7) and 20.10.8 (Rocky 8)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.24.3

HAProxy

2.4.7

PostgreSQL

14.6

Grafana

9.2.3

CRI Tools

1.25.0

25.2.2. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.4.3 HF1:

  • Robin CNP v5.4.3-120 (GA) to Robin CNP v5.4.3 HF1

  • Robin CNP v5.4.3-237 (HF1-RC) to Robin CNP v5.4.3 HF1

  • Robin CNP v5.3.11-217 (HF2) to Robin CNP v5.4.3 HF1

The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.

Note

For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

25.2.3. Improvements

25.2.3.1. Enhanced GoRobin

Starting from Robin CNP v5.4.3 HF1, the GoRobin utility tool runs the preinstallation checks parallelly.

25.2.3.2. Support for V2 KV engine for the HashiCorp Vault integration

Starting from Robin CNP v5.4.3 HF1, Robin CNP supports the V2 KV engine for the HashiCorp Vault integration.

25.2.3.3. Set faultdomain to host or rack for all RWX PVCs

Starting from Robin CNP v5.4.3 HF1, for all storageclass (custom storageclass and storageclass created by Robin) except the robin-rwx storageclass, you can set host or rack as fautdomain for RWX PVCs. If you set disk as faultdomain, RWX PVC will not be provisioned and shows the following error:

For Access-Many volumes, replication should be more than 1 and faultdomain should be ‘host’.

For the robin-rwx storageclass, the default faultdomain is set to host for RWX PVCs. The options of faultdomain as disk and rack are not supported.

25.2.3.4. Support for deploying NFS server Pods on nodes with custom taints

Starting from Robin CNP v5.4.3 HF1, to deploy NFS server Pods on Kubernetes nodes that have custom taints, you must update the nfs_pod_tolerations config attribute of the nfs section of the robin config to add toleration for NFS server Pods.

The tolerations added through the config attribute take effect only for the newly created NFS server Pods. For the existing NFS server Pods, you must add tolerations manually.

25.2.3.5. Set default faultdomain to host for creating a Persistent Data Volume (PDV)

Starting from Robin CNP v5.4.3 HF1 and onwards, for creating a Persistent Data Volume (PDV), the valid options for faultdomain are host and rack. The disk option for faultdomain is not supported. The default faultdomain is set to host.

25.2.3.6. Default MTU value from physical interface

Starting from Robin CNP 5.4.3 HF1, if you do not provide the MTU when creating an IP pool, Robin CNP considers the MTU value of the underling physical device interface as the default MTU.

25.2.4. Fixed Issues

Reference ID

Description

PP-28768

After upgrading Robin CNP 5.4.1, you might notice that the cordened node is uncordened. This issue is fixed.

PP-29303

The issue of KVMs not considering IP pool configurations (Spoofchk and Trustmode) is fixed.

PP-29528

The issue of net-attach-def entries being left behind when a Pod with the robin.io/networks annotation is deleted is fixed. In 5.4.3 HF1, a schedule runs every 30 minutes and cleans up the stale static IPs and net-attach-def that are present in the cluster for more than 30 minutes.

PP-29743

Token-based integration of HashiCorp Vault is no longer supported.

PP-29090

When you deploy a Helm chart with static IP addresses and annotations in a StatefulSet, the Helm chart fails to assign the defined static IP addresses. However, it assigns different IP addresses from the same IP pool range. This issue is fixed.

PP-29217

The issue of the sriov-device-plugin DaemonSet scaling down due to specific taints on nodes is fixed.

PP-29441

The issue of Patroni Pod being in the Pending state after adding a master node is fixed. The steps to add the master node are updated.

PP-29509

The issue of stopping metrics before starting the upgrade and restarting after the upgrade is fixed.

PP-29512

After upgrading to Robin CNP v5.4.3, you might observe Robin bundle applications deployed with replica 1 are in the NotReady status. This issue is fixed.

PP-29553

The issue of Robin cluster nodes not honoring taints applied on nodes after robin-server restart is fixed.

PP-29577

When you delete an SR-IOV pod and restart the Robin server, you might observe the SR-IOV deployment pods are in the create and terminate loop states. This issue is fixed.

PP-29582

When you stop and start the robin server during pod deployment, the robin network annotations might be ignored. This issue is fixed.

PP-29595

The issue of Robin CNP displaying the following error message when setting up Robin Client is fixed:

Error: You must be logged into the server (the server has asked for the client to provide credentials).

PP-29634

The issue of kubectl top pods command output displaying the following error message is fixed:

Error: Metrics not available for pod.

PP-29644

The issue of the Robin installer trying to access the Internet to download the install images and eventually the installation failing is fixed.

PP-29648

The issue of GoRobin failing to install on a 45-node cluster is fixed.

PP-29779

The issue of discrepancy in CPU core calculation while validating the rpool limit is fixed.

PP-29867

When you use node affinity and IP Pool annotations in a Deployment, it uses the IPs from the IP Pool. However, it fails to follow the Affinity rule. This issue is fixed.

PP-29939

Robin CNP scheduler is not calculating the guaranteed CPU utilization correctly and scheduling Pods on over-utilized nodes. This issue is fixed.

PP-30033

The upgrade process failed when upgrading to Robin CNP v5.3.11 due to a discrepancy in the number of CSI node plugin Pods between the pre- and post-upgrade. This issue is fixed.

PP-30050

The issue of Robin CLI stopped working when stormgr is down is fixed.

PP-30160

The issue of node removal failing with the --force command option is fixed.

PP-30141

The issue of VolumeCreate job failing even though the cluster has enough resources is fixed.

PP-30290

The issue of Istio mutating webhook configuration after the upgrade is fixed.

PP-30296

When you try to move a File collection from a 3 Replica File collection to a single replica File collection fails. This issue is fixed.

PP-30345

The issue of the upgrade process not failing even though the version and tag stamping in the config map failed is fixed.

PP-30381

The issue of not being able to upload bundles to an online File collection is fixed.

PP-30387

The issue of adding PDV volume failing when the Fault domain option disk is used is fixed. Starting with Robin CNP v5.4.3 HF1, the Fault domain option disk is not supported, and the default value for the Fault domain option is host.

25.2.5. Known Issues

Reference ID

Description

PP-28802

Symptom

Robin Control Plane is failing to auto-recover in the following conditions:

  • Root FS is full

  • Out-of-memory condition

  • High CPU situations

  • Operating system Kernel crash

Workaround

Apply the following workaround steps to recover from this situation:

  1. Clean up the disk to free up the space. You need a minimum of 50GB of disk space.

  2. Reboot node.

PP-29533

Symptom

After moving all apps from an existing file collection to a new file collection and then powering off one of the nodes cause application not accessible.

Workaround

Delete file-server Pod for PVC to get mounted again.

PP-29650

Symptom

After failing an IOMGR Pod, the Pod might be in the RunContainerError state.

Workaround

Perform the following steps:

  1. Run the following command to restart Docker:

    # service docker restart
    
  2. Run the following command to remove the container:

    # docker rm -f <iomgr container name>
    

PP-29850

Symptom

After rebooting a node you might notice applications are stuck in ContainerCreating state with volumeattachments error.

Workaround

Bounce the Pods.

PP-29866

Symptom

All Pod deployments at least once go through the Pending or Terminating state before the deployment is successful. The behavior is the same for StatefulSet and Deployment.

PP-29962

Symptom

After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, the robin nfs export list might show the wrong entry.

Workaround

Contact the Robin support team for workaround steps.

PP-30112

Symptom

When you upgrade from Robin CNP v5.3.11 HF2 to v5.4.3 HF1, applications Pods might be in the Terminating state.

Run this command to check Pods: docker ps -a. If the command output is not displaying the Pods, apply the following workaround.

Workaround

Delete the Pods that are in the Terminating state forcefully.

Run the following command to delete:

# kubectl delete pod -n <namespace> --force <Pod name>

PP-30119

Symptom

When removing a node from Robin CNP, if the affinity is tied to local storage, instance relocate fails with the following error:

Unable to reallocate instance <instance name> due to an affinity rule tying it to local storage.

PP-30149

Symptom

After deploying Pods, if you restart Robin Server and delete Pods, some of the Pods might not come back online.

Workaround

Delete the Pods that are in Pending state.

PP-30173

Symptom

During scheduling Pods, there might be a difference of 2 CPU cores between the Robin planner and Kubernetes planner due to resource calculation.

For example, if only 2.2 core is left on a node, then Pod deployed with 2.2 CPU request will not get scheduled as 2.2 core is considered as 2 cores by Robin.

PP-30188

Symptom

After upgrading to Robin CNP v5.4.3 HF1 from the supported version, you might notice the RWX app stuck in the ContainerCreating state.

Workaround

  1. Run the following command to find the NFS server Pod that is serving the PVC:

    # robin nfs export-list
    
  2. Run the following command to delete the NFS server Pod:

    # kubectl delete pod <nfs server pod> -n robinio
    

PP-30243

Symptom

After upgrading from the supported Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might notice continuous vnode deploy job is going on for app.

Workaround

Run the following commands to rectify this issue.

  1. Run the following command to rbash master node.

    # rbash master
    
  2. Run the following command to restart robin server.

    # service robin-server restart
    

PP-30247

Symptom

After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, the RWX apps might report the following error event:

wrong fs type, bad option, bad superblock on /dev/sdj, missing codepage or helper program, or other error

Workaround

Contact the Robin support team for workaround steps.

PP-30251

Symptom

Add master using GoRobin for the same removed node fails on IPv4 setups. The Patroni Pod will be in the Pending state. There is no corresponding PVC created, however, a corresponding PV is present.

Workaround

  1. Delete the pending Patroni Pod.

  2. Check the robin host list output to ensure all hosts are ready.

  3. Assign rpool.

  4. Add Storage and Compute role to newly added master.

PP-30264

Symptom

In a Robin CNP v5.4.3, if you have cri=dockershim, the robin_plugin process might have a memory leak.

Workaround

Use --storage-driver=overlay2 if you are using --cri=dockershim

PP-30298

Symptom

After upgrading from the supported Robin CNP v5.3.11 to Robin CNP 5.4.3 HF1, if the upgrade fails and nodes are in the Notready status, check for the following symptoms and apply the workaround:

  • IO Hangs on a volume that could cause Application downtime, kubelet down, and node shows Notready status.

  • iomgr-server on the affected node is in defunct state.

  • ps -ef showing tcmuctrl commands are stuck.

Workaround

Reboot the node where iomgr-server is in defunct state.

PP-30319

Symptom

When you have a StatefulSet or Deployment with robinrpool, one of the Pods may not get scheduled by Kubernetes and it remains in the Pending state due to a mismatch in Robin planner calculation.

Workaround

Run the following command to delete the Pod that is in the Pending state:

# kubectl delete pod <pod name>

PP-30339

Symptom

After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might observe Helm app Pods in the Error state.

Workaround

You need to stop and start the apps in the Error state.

  1. Run the following command to stop the app:

    # robin app stop <app-name>
    
  2. Run the following command to start the app:

    # robin app start <app-name>
    

PP-30357

Symptom

After you upgrade successfully from supported Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might notice that a node is in the Notready state.

Workaround

Run the following command to rectify this issue:

# robin host probe <hostname> --wait

PP-30361

Symptom

When you delete Pods with static IP and Affinity that are in the Running state and at the same time if you restart Robin Server, the Pods might be stuck in the ContainerCreating state and the Pod events show the following:

“NoNetworkFound” : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io “robin-required” not found

Workaround

Bounce the Pods that are in the ContainerCreating state.

PP-30363

Symptom

After upgrading to Robin CNP v5.4.3HF1, you might notice that one of the nodes in the NotReady state and the node description shows the following reason and message:

Reason: KubeletNotReady

Message: PLEG is not healthy: pleg was last seen active 1h25m32.701542224s ago; threshold is 3m0s.

Workaround

Run the following command to restart Docker service:

# service docker restart

PP-30364

Symptom

After you successfully add a new node using GoRobin to an existing cluster that has nodes associated with a custom rpool, the GoRobin tool will automatically assign the new node to the default rpool as the tool does not support custom rpools.

In this scenario, apply the following workaround to associate the newly added node to the custom rpool.

Workaround

  1. Run the following command to remove the storage and compute roles from the newly added node:

    # robin host remove-role <new node name> storage,compute
    
  2. Run the following command to assign the custom rpool to the newly added node:

    # robin host assign-rpool <new node name> <custom rpool>
    
  3. Run the following command to add the storage and compute roles:

    # robin host add-role <new node name> storage,compute
    

PP-30386

Symptom

When upgrading from Robin CNP v5.3.11 HF1 to Robin CNP v5.4.3 HF1, the NFS exports might be stuck in the ASSIGNED_ERR state.

Workaround

Contact the Robin support team for workaround steps.

PP-30389

Symptom

If you have added a range of blacklisted IPs in an unexpanded form, Robin CNP does not allow you to remove a range of blacklisted IPs from the IP Pool. It is recommended to use the expanded form when adding and removing a range of blacklisted IPs to an IP Pool.

Workaround

If you have added a range of blacklisted IPs in an unexpanded form, you need to remove the range from the database. Contact Robin customer support team to apply the workaround.

PP-30394

Symptom

The Robin CNP UI dashboard does not display the cluster memory usage and cluster storage available details.

Workaround

Complete the following steps to rectify this issue.

  1. Run the following command to edit the Grafana ConfigMap:

    # kubectl edit cm robin-grafana-default-dashboards-config
    -n robinio
    
  2. In the configmap, look for the following query:

    “sum (container_memory_working_set_bytes{id=”/”}) by(kubernetes_io_hostname) / on(kubernetes_io_hostname) machine_memory_bytes * 100”

    Replace the above query with the following query:

    “sum (container_memory_working_set_bytes{id=”/”}) by(kubernetes_io_hostname) / on(kubernetes_io_hostname) machine_memory_bytes{job=”kubernetes-cadvisor”} * 100”

  3. Run the following command to delete the grafana-rs Pod:

    # kubectl delete pod robin-grafana-rs-mvjqh -n robinio
    

PP-30398

Symptom

After removing an offline master node from the cluster and power cycling it, the removed master node is automatically added back as a worker node.

Workaround

  1. Run the following command to remove the host:

# robin host remove <hostname>
  1. Run the following command to remove the node

# kubectl delete node <nodename>
  1. Run k8s-script cleanup and host-script cleanup on the to-be-removed node.

25.2.6. Technical Support

Contact Robin Technical support for any assistance.

25.3. Robin Cloud Native Platform v5.4.3 HF2

The Robin Cloud Native Platform (CNP) v5.4.3 HF2 release has a new feature, fixed issues, and known issues.

Release Date: 07 July 2023

25.3.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.26.0

Docker

19.03.9 (CentOS 7) and 20.10.8 (Rocky 8)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.24.3

HAProxy

2.4.7

PostgreSQL

14.6

Grafana

9.2.3

CRI Tools

1.25.0

25.3.2. Upgrade Path

The following is the supported upgrade path for Robin CNP v5.4.3 HF2:

  • Robin CNP v5.4.3-281 (HF1) to Robin CNP v5.4.3-302 (HF2)

The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.

Note

For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

25.3.3. New Feature

25.3.3.1. Support for soft anti-affinity for Robin Bundles

Starting with Robin CNP v5.4.3 HF2, you can enable soft anti-affinity for Robin Bundles.

To enable soft anti-affinity for Robin Bundles, you must use the placeon_different_nodes_on_same_rack parameter in the Robin Bundle YAML file and set it to true.

Example

appname: "centos-1"
   ippools: ["robin-default"]
   roles:
     - name: server1
       placeon_different_nodes_on_same_rack: true
       ippools:
         - ippool: routes-2
     - name: server2
       placeon_different_nodes_on_same_rack: true
       ippools:
         - ippool: routes-2
     - name: server3
       placeon_different_nodes_on_same_rack: true
       #ippools:
         - ippool: routes-1
           static_ips: "fd74:ca9b:3a09:86ba:a:b:c:d"

25.3.4. Fixed Issues

Reference ID

Description

PP-30611

When the monitor server in Robin CNP fails to report to the robin-server, it waits for a long time before attempting to send the next report. As a result, the heartbeat misses for a long time and it results in host probe jobs. This issue is fixed.

PP-30639

Source-based routing is not kicking in when the IP pool --ranges option is used. This issue is fixed.

PP-30864

The issue of the K8s collect watcher taking a long time to complete on a loaded cluster is fixed.

PP-30883

After rebooting the nodes, sometimes, RWX Pods might be stuck in the CrashLoopBackOff state because of the Permission denied error. This issue is fixed.

PP-30895

Importing users with capabilities from an LDAP group using the robin ldap import-group command resulting in an error. This issue is fixed.

PP-30896

Starting from Robin CNP v5.4.3 HF2, the k8s_skipplan_mutatingwebhook config attribute is enabled by default in the robin config list.

PP-30897

The SRIOV annotations in a few of the Deployments are ignored as Pods stuck in the validatingwebhook status and the robin-server is busy. As a result, the Pods are not having the required number of network interfaces. This issue is fixed.

PP-30945

When you use a reused Persistent Volume for VolumeMount, the following error is displayed: ‘NoneType’ object is not subscriptable. This issue is fixed.

PP-30951

The issue of choking the kube-controller and using more CPU because of the logging issues in Kubernetes v1.26.0 is fixed by setting the kube-controller log level from 7 to 4. It applies to both a new installation of Robin CNP v5.4.3 HF2 and an upgrade to Robin CNP v5.4.3 HF2.

PP-30978

The issue of high CPU utilization by robin-util-cli is fixed.

25.3.5. Known Issues

Reference ID

Description

PP-30493

Symptom

When you see Pods in the ContainerCreating status and the kubectl describe pod command output displays the following event, apply the workaround.

“NoNetworkFound” : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io “robin-required” not found

Workaround

Bounce the Pods that are in the ContainerCreating status.

PP-30980

Symptom

When you deploy multiple Pods at the same time, the Pods might come up slowly as mutation takes more time due to timeouts and multiple retries.

PP-31068

Symptom

After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, a few nodes might be in the Partial state because the drives (disks) attached to these nodes are in the OFFLINE state as per RCM but as per the stormgr, these drives are in the READY/ONLINE state.

Workaround

Contact Robin customer support team for workaround steps for this issue.

PP-31070

Symptom

After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, some Pods might be stuck in the ContainerCreating status due to the Calico CNI issue, and /var/log/calico/cni/cni.log shows the following error message:

[ERROR][3095966] plugin.go 580: Final result of CNI DEL was an error. error=error getting ClusterInformation: connection is unauthorized: Unauthorized

Workaround

Bounce the calico-node-xxx Pod which is running on the node where those Pods are stuck.

PP-31072

Symptom

After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, the csi-nodeplugin Pod might be stuck in the Terminating state due to the IO manager issue.

Workaround

Reboot the node where the csi-nodeplugin Pod is stuck.

25.3.6. Technical Support

Contact Robin Technical support for any assistance.

25.4. Robin Cloud Native Platform v5.4.3 HF3

The Robin Cloud Native Platform (CNP) v5.4.3 HF3 release has a fixed issue and known issues.

Release Date: 02 November 2023

25.4.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.25.7 or 1.26.0 (Default)

Docker

19.03.9 (CentOS 7) and 20.10.8 (Rocky 8)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.24.3

HAProxy

2.4.7

PostgreSQL

14.6

Grafana

9.2.3

CRI Tools

1.25.0

25.4.2. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.4.3 HF3:

  • Robin CNP v5.3.13-107 (HF3) to Robin CNP v5.4.3-355 (HF3)

  • Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3)

The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.

Note

For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

25.4.3. Fixed Issues

Reference ID

Description

PP-30493

The issue of Pods being in the ContainerCreating status and the kubectl describe pod command output displays the following event is fixed.

“NoNetworkFound” : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io “robin-required” not found

25.4.4. Known Issues

Reference ID

Description

PP-31522

Symptom

After deleting a backup, unregistering a storage repo fails with the following error message:

Storage repo is associated with volume group

Workaround

Complete the following steps:

  1. Attach the repo to the App.

    # robin app attach-repo <app_name> <repo_name>
    
  2. Detach the repo from the App.

    # robin app detach-repo <app_name> <repo_name>
    
  3. Unregister the repo.

    # robin repo unregister <reponame>
    

PP-32259

Symptom

When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), some of the jobs might be failed with the following error:

FATAL: remaining connection slots are reserved for non-replication superuser connections

Workaround

Contact the Robin Customer Support team for the workaround steps.

PP-32288

Symptom

When all nodes of a cluster are rebooted after installing Robin CNP v5.4.3-355 (HF3), one of them remains in the NotReady state because it is missing from the consul members list.

Workaround

Start the consul-client:

# systemctl start consul-client

PP-32334

Symptom

When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), the robin-file-server fails to come up.

Workaround

Run the following command to unmount the file collection volume:

# robin volume unmount <file-collection-volume-name>

PP-32385

Symptom

When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), sometimes, the etcd Pods get stuck in the CrashLoopBackOff status and kubectl logs show the following error because of the faulted etcd instance:

wal: max entry size limit exceeded

Workaround

You need to remove the faulted etcd instance from the node.

Complete the following steps to remove the faulted etcd instance from the node:

  1. Log in to the node that has faulted etcd instance.

  2. Stop etcd and api-server Pods running on this node by moving the etcd.yaml and api-server.yaml files:

    # mv /etc/kubernetes/manifests/kube-apiserver.yaml .
    
    # mv /etc/kubernetes/manifests/etcd.yaml .
    
  3. Move the /var/lib/etcd directory as well.

    # mv /var/lib/etcd /var/lib/etcd_old
    
  4. Check for the node where the etcd leader is running:

    # retcd get-leader
    
  5. Log in to the node where the etcd Leader is running.

  6. Check for the member ID of the faulted etcd instance:

    # retcd get-members
    
  7. Remove the faulted etcd instance from the cluster:

    # retcd remove-member --member-id=<faulted_etcd_memeber_id>
    
  8. Add the etcd instance back to the cluster using the docker exec command:

    # docker exec <container_id> etcdctl
    --cacert /etc/kubernetes/pki/etcd/ca.crt
    --cert /etc/kubernetes/pki/etcd/peer.crt
    --key /etc/kubernetes/pki/etcd/peer.key
    --endpoints <primary_master_IP>:2379
    member add <faulted_etcd_instance_node>
    --peer-urls=<faulted_etcd_instance_node_IP>:2380
    
  9. Log in to the node that had the faulted etcd instance.

  10. Make sure that the --initial-cluster option contains the entries for all three master nodes in the node_name=node_IP format and the --initial-cluster-state=existing option is available in the etcd.yaml file. If this option is not available, add it.

  11. Move the etcd.yaml and kube-apiserver.yaml files back to the /etc/kubernetes/manifests directory:

    # mv etcd.yaml /etc/kubernetes/manifests
    
    # mv kube-apiserver.yaml /etc/kubernetes/manifests
    

PP-32463

Symptom

After installing Robin CNP v5.4.3-355 (HF3), you might face File Collection Creation failures:

Workaround

  1. Greb all Robin worker Pods.

    # kubectl get pods -n robinio | grep worker
    
  2. Bounce all of the worker Pods.

    # kubectl delete pods <pod name>
    
  3. Try creating a File collection.

    # robin collection create <media_type> <rpool_name>
    --collection_type <type> --replicas <count> --size <size>
    

PP-32477

Symptom

If you notice any of the Robin Patroni Pods in the starting state, follow the below workaround steps:

Workaround

  1. Exec into the unhealthy Robin Patroni instance.

    # kubectl exec -ti <robin-patroni Pod name> -n robinio -- bash
    
  2. Restart the Robin Patroni instance.

    # patronictl restart robin-patroni <robin-patroni Pod name>
    
  3. If the previous command does not work, run the following command to restart the Patroni agent running in the Pod.

    # sv restart patroni
    

PP-32497

Symptom

When a cluster reboots, you might observe one or more Pods might be stuck in the ContainerCreating status as the volume mount fails. Also, you can notice the following type of error message in the kubectl describe pod command output:

volume 1698488235:1 has GET error for volume attachment csi-204e799ca58418a5f0e1b0d4193fd8b0908dbe290ae52ef810a8c2964a12c202: volumeattachments.storage.k8s.io “csi-204e799ca58418a5f0e1b0d4193fd8b0908dbe290ae52ef810a8c2964a12c202” is forbidden:User “system:node:qct-09.robinsystems.com” cannot get resource “volumeattachments” in API group “storage.k8s.io” at the cluster scope: no relationship found between node ‘qct-09.robinsystems.com’and this object

Workaround

Cordon the node and bounce the required Pods.

  1. To cordon the node where the failed Pod is scheduled, run the following command.

    # kubectl cordon <node name>
    
  2. Bounce the Pod by running the following command so that the Pod moves to a different node.

    # kubectl delete pod -n <namespace> <pod name>
    

PP-32515

Symptom

When a cluster reboots, you might notice robin-worker node Pods might stuck in the CrashLoopBackOff status and the following error is observed in the log file of the Robin master Pod.

psycopg2.errorL.ReadOnlySqlTransaction: cannot execute CREATE EXTENSION in a read-only transaction

You can find the log file at: /var/log/robin/robin-master-bootstrap.log

Workaround

Check for the error in the log file and bounce the Robin master Pod.

  1. To check for the error in the master Pod, run the following commands:

    # kubectl get pod -n robinio | grep robin-master
    
  2. Exec into the Robin master Pod and check for the error. If you find the error mentioned above, bounce the Robin master Pod.

  3. To bounce the master Pod, run the following command.

    # kubectl delete pod -n <namespace> <pod name>
    

PP-32517

Symptom

In a rare scenario, upgrade from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3) might fail with this message: Failed to start dockershim CRI.

Workaround

Restart Docker and Dockershim.

  1. To restart docker, run the following command.

    # systemctl restart docker
    
  2. To restart Dockershim, run the following command.

    # systemctl restart dockershim
    
  3. Retry the upgrade.

PP-32523

Symptom

After upgrading to Robin CNP v5.4.3-355 (HF3), you might notice some of the Pods might be in the ContainerCreating status.

Workaround

Bounce the Pods that are in the ContainerCreating status.

To bounce the Pods, run the following command:

# kubectl delete pod -n <namespace> <pod name>

25.4.5. Technical Support

Contact Robin Technical support for any assistance.

25.5. Robin Cloud Native Platform v5.4.3 HF4

The Robin Cloud Native Platform (CNP) v5.4.3-395 (HF4) release has a new feature, improvements, fixed issues, and known issues.

Release Date: 13 December 2023

25.5.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.26.0

Docker

19.03.9 (CentOS 7) and 20.10.8 (Rocky 8)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.24.3

HAProxy

2.4.7

PostgreSQL

14.6

Grafana

9.2.3

CRI Tools

1.25.0

25.5.2. Upgrade Path

The following is the supported upgrade path for Robin CNP v5.4.3-395 (HF4):

  • Robin CNP v5.4.3-302 (HF2) to Robin CNP v5.4.3-395 (HF4)

The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.

Note

  1. For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

  2. You must configure the Calico Typha for the cluster with more than 50 nodes. For more information, see Configure Calico Typha.

  3. After upgrading to Robin CNP v5.4.3-395 (HF4), you must run the robin schedule update K8sResSync k8s_resource_sync 60000 command to update the robin schedule K8sResSync.

25.5.3. New Feature

25.5.3.1. Support of soft affinity

Robin CNP v5.4.3-422 (HF4) supports the soft affinity feature with a few limitations.

In Kubernetes, the soft affinity feature refers to a way of guiding the Kubernetes Scheduler to make a decision about where to place Pods based on preferences, rather than strict requirements. This preference helps to increase the likelihood of co-locating certain Pods on the same node, while still allowing the Kubernetes Scheduler to make adjustments based on resource availability and other constraints. For more information, see Affinity and anti-affinity.

Limitations

The following are the limitations of support for soft affinity and anti-affinity support:

  • These operators are not supported: DoesNotExist, Gt, and Lt.

  • Multiple weight parameters for node and Pod affinity are not supported.

  • Soft anti-affinity doesn’t check or match for the label selector coming from a different Deployment.

  • During a complete cluster restart, if all nodes are not up at the same time, Pods will not be spread across nodes with soft anti-affinity.

  • After a Pod restart, it might not come back on the same node.

  • Post downsizing the number of replicas in a Deployment, soft Pod anti-affinity might not delete the Pods in the same order as creation.

  • As the affinity information is handled in the cache, restarting the robin-server will flush the cache, resulting in scaled-up Pods not being placed as per anti-affinity.

  • Creating, deleting, or recreating Pods multiple times will not honour soft affinity.

  • Pods will be unequally distributed on nodes when all Pods in a deployment are deleted.

25.5.4. Improvements

25.5.4.1. Support for Isolated shared CPU on Kubernetes CPU manager

Starting from Robin CNP v5.4.3-395 (HF4) support to configure isolated shared CPUs on Kubernetes CPU Manager is available. To use the isolated shared CPU on the Kubernetes CPU manager, the CPUs must be from the isolated pool of CPUs on a host.

25.5.4.2. Relaxation in NIC bonding policy

Starting from Robin CNP v5.4.3-395 (HF4), Robin considers the NIC bonding interface up if at least one interface from the two interfaces that are used for creating the bond interface is up.

25.5.5. Fixed Issues

Reference ID

Description

PP-30394

The Robin CNP UI dashboard does not display the cluster memory usage and cluster storage available details. This is fixed.

PP-30673

The issue of kernel.core_pattern getting changed to |/bin/false from /var/crash/core.%e.%p.%h.%t after restarting the robin-worker Pod or robin-iomgr Pod is fixed.

PP-30980

When you deploy multiple Pods at the same time, the Pods might come up slowly as mutation takes more time due to timeouts and multiple retries. This issue is fixed.

PP-31294

The issue of Robin CNP considering a guaranteed number of CPUs as shared CPUs in a Pod when there are both shared and guaranteed CPUs is fixed.

PP-31664

The issue of nodes stuck in reboot due to mount errors is fixed.

PP-32334

The issue of the robin-file-server failing to come up when upgrading from Robin CNP v5.3.13 to Robin CNP v5.4.3 HF3 is fixed.

PP-32405

When there are many Pods in the Kubernetes Scheduler and if they are taking time to come up, then the NFS Server Pod might take more than 10 minutes to come up and the job might time out. This issue is fixed.

PP-32461

The issue of StatefulSet Pods using static IP ranges not resuming with the same IPs after restarting Pods is fixed.

PP-32498

The issue of the snapshot-controller Pod stuck in the ContainerCreating status and Terminating loop status after node reboot is fixed.

PP-32525

The issue of 5G NF Pods not being deployed using the Helm chart due to a webhook timeout is fixed.

PP-32620

The issue of mutation timeout failure is fixed.

25.5.6. Known Issues

Reference ID

Description

PP-31790

Symptom

Sometimes, discrepancy in CPU core calculation is observed for maximum period of 17 minutes when validating the tenant rpool limit.

PP-32555

Symptom

Assigning a static IP address for KVM based apps from a secondary IP-Pool is not supported through the Robin CNP UI.

Workaround

Assign the static IP address for KVM based apps from a secondary IP-Pool using the CLI.

PP-32647

Symptom

After upgrading to Robin CNP v5.3.4-395 (HF4), if the StatefulSet Pods are deleted, they may not retain the same IPs due to the implementation of the staticip_ordered_assignment parameter.

The staticip_ordered_assignment parameter is a new configuration parameter added as part of the Robin CNP v5.4.3-395 (HF4). This config parameter is by default set to True.

When it is set to True, the IPs provided in the network annotations will be assigned serially to the Pods.

PP-32713

Symptom

The Robin log collection operation might fail with the following error message:

Creation of storage for file collection failed, and the File server Pod might be in the Terminating status

Workaround

Rerun the log collection again using the following command for recreating the file server Pod.

# robin log collect robin-storage

PP-32770

Symptom

In a rare scenario, after rebooting the nodes, KVM based apps might be stuck in the Error state with the following error:

Unable to satisfy max guaranteed CPU requirements

Workaround

Restart the respective apps manually.

# robin instance start <app_name>

25.5.7. Technical Support

Contact Robin Technical support for any assistance.

25.6. Robin Cloud Native Platform v5.4.3 HF4 Point Patch-1

The Robin Cloud Native Platform (CNP) v5.4.3-422 (HF4 Point Patch -1) release has new features, fixed issues, and known issues.

Release Date: 06 March 2024

25.6.1. New Features

25.6.1.1. Robin User Ceritifcate Management

Robin CNP v5.4.3 HF4 Point Patch-1 enables you to manage the Robin user security certificate. Robin CNP creates a TLS certificate when a user is created. The user certificate is by default valid for one year from the date of user addition, and it will automatically renew if the certificate is going to expire, depending on the set configuration and scheduler status. The scheduler runs as per the set configuration. For more information, see User Certificate Management

Robin CNP provides a new CLI command to manage user certificates. The following are the commands, and each command has options to manage:

  • robin user-cert check - Enables checking the status of user certificates and user certificate configuration details.

  • robin user-cert renew - Enables renewing user certificates for all users, setting an offset period for checking validity, and performing a dry run to verify the renewal process.

  • robin user-cert update- Enables configuring user certificate configuration details. Setting the life span of the certificate, whose minimum life span is one day, setting the time difference between each user’s certificate renewal checks, and setting the certificate renewal offset time.

  • robin user-cert stop - Enables stopping the certificate validity scheduler checks.

  • robin user-cert start - Enables restarting the stopped validity scheduler checks.

Note

In this release, support for managing the user certificates feature is not available from the Robin CNP UI.

25.6.1.2. Support for Millicore CPUs for Robin Bundle Apps

Robin CNP v5.4.3 HF4 Point Patch -1 supports the millicore CPU unit for a container. Now, you can specify a fractional value of a CPU unit when defining the CPU resource for a container in the Robin Bundle. Using the following files, you can specify the millicore CPUs for Robin Bundle apps:

  • Bundle’s template file - To specify the millicore CPUs in the template file of Robin Bundle app, you need to specify the millicore CPU in the min and max keys of the compute.cpu attribute:

      cpu:
        reserve: true
        min: 1.03
        max: 1.03
    
  • Bundle’s input.yaml file - To specify the millicore CPUs in the input.yaml file of Robin Bundle app, you need to specify the millicore CPU in the min and max keys of the containers.cpu attribute:

      cpu:
        reserve: true
        min: 1.03
        max: 1.03
    
  • Bundle’s manifest.yaml file - To specify the millicore CPUs in the manifest.yaml file of Robin Bundle app, you need to specify the millicore CPU in the core key of the compute.cpu atrribute

      cpu:
        reserve: true
        core: 1.03
    

Note

The min and max keys are applicable only for main containers.

Limitations

  • Robin CNP does not support configuring the millicore CPUs through the CNP UI.

  • Robin CNP does not support the Chargeback feature for millicore CPUs.

  • Millicore CPU unit such as 500m is not supported in the template of Robin Bundle app.

  • When you specify the guaranteed CPUs with millicore values, it is not true guaranteed CPUs and it is Quality of Service (QoS) guaranteed CPUs. This is Kubelet behavior. For true guaranteed CPUs, only integer values must be specified.

25.6.2. Fixed Issues

Reference ID

Description

PP-32758

The issue of the same event ID for two different events, which is not allowing users to add the event ID to the subscription list, is fixed.

PP-33203

In prior CNP releases, the Robin user certificates with a default validity of one year would expire after the validity period, resulting in users being unable to perform app-level tasks. With this release, the user certificates will automatically renew before the expiration date. And Robin CNP allows you to manage the user certificates using the new CLI options.

PP-33253

The issue of the difference in allocated memory display between Robin CNP UI and CLI is fixed.

PP-33255

The issue of Bundle app creation failing when you provide values for the CPU attributes as shown below in the manifest YAML file, is fixed.

cpu:
 cores: 8
 reserve: true

PP-33446

In Robin CNP v5.4.3 HF2, the cpuset.mems parameter was incorrectly set to 0-1 for guaranteed QoS Pods, which means the resources could be allocated across NUMA boundaries. With this release, the cpuset.mems parameter is set to the desired NUMA.

PP-33516

The issue of inflight resources being held by CNP when a Helm chart or Deployment is deleted during the Pod planning phase is fixed. However, there is another known issue when there is a non-graceful termination of a Pod. For more information, see PP-33628 under the Known Issues section.

25.6.3. Known Issues

Reference ID

Description

PP-33501

Symptom

Robin CNP does not support the Chargeback feature for millicore CPUs.

PP-33596

Symptom

Robin CNP does not support the user certificates feature using the CNP UI.

PP-33628

Symptom

In some cases, after the Helm app uninstall, a non-graceful deletion of a Pod, or a StatefulSet Pod deletion, the inflight resources might be held by Robin CNP. To check this, run the following command:

# robin inflight-resources info

PP-33670

Symptom

In scenarios like cluster failover or reboot, Robin CNP may fail to access the devices and mark them as FAULTED erroneously. However, the device might not have issues.

Workaround

Contact the Robin Customer Support team if you observe this issue.

PP-33679

Symptom

When a Master Pod fails over due to network partition on a node, the Master Pod might stuck in the Init state.

Apply the following workaround if you see this issue:

Workaround

Restart the Calico Pod on the node where you are seeing the issue.

PP-33725

Symptom

In the following scenarios, app creation using a snapshot (robin app create from snapshot) fails if you use values other than the values of the parent application.

  • Fractional values are different from the parent app values.

  • Modified CPU value (int or float) of an app snapshot with guaranteed CPU.

25.6.4. Technical Support

Contact Robin Technical support for any assistance.

25.7. Robin Cloud Native Platform v5.4.3 HF5

The Robin Cloud Native Platform (CNP) v5.4.3-564 (HF5) release has a new feature, improvements, fixed issues, and known issues.

Release Date: July 17, 2024

25.7.1. Infrastructure Versions

The following software applications are included in this CNP release:

Software Application

Version

Kubernetes

1.25.7 or 1.26.0 (Default)

Docker

25.0.2 (CentOS 7) and (Rocky 8)

Prometheus

2.39.1

Prometheus Adapter

0.10.0

Node Exporter

1.4.0

Calico

3.24.3

HAProxy

2.4.7

PostgreSQL

14.7

Grafana

9.2.3

CRI Tools

1.25.0

25.7.2. Upgrade Paths

The following are the supported upgrade paths for Robin CNP v5.4.3-564 (HF5):

  • Robin CNP v5.4.3-355 + Point Patch to Robin CNP v5.4.3 HF5

  • Robin CNP v5.4.3-395 + Point Patch-1 to Robin CNP v5.4.3 HF5

  • Robin CNP v5.4.3-302 + Point Patch + Security Patch to Robin CNP v5.4.3 HF5

The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.

25.7.2.1. Pre-upgrade consideration

For a successful upgrade, you must run the possible_job_stuck.py script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.

25.7.2.2. Post-upgrade considerations

  • After upgrading to Robin CNP v5.4.3 HF5, you must run the robin schedule update K8sResSync k8s_resource_sync 60000 command to update the robin schedule K8sResSync.

  • After upgrading to Robin CNP v5.4.3 HF5, you must run the robin-server validate-role-bindings command. To run this command, you need to log in to the robin-master Pod. This command verifies the roles assigned to each user in the cluster and corrects them if necessary.

  • After upgrading to Robin CNP v5.4.3 HF5, the k8s_auto_registration config parameter is disabled by default. The config setting is deactivated to prevent all Kubernetes apps from automatically registering and consuming resources. The following are the points you must be aware with this change:

    • You can register the Kubernetes apps using the robin app register command manually and use Robin CNP for snapshots, clones and backup operations of the Kubernetes app.

    • As this config parameter is disabled, when you run the robin app nfs-list command, the mappings between Kubernetes apps and NFS server Pod are not listed in the command output.

    • If you need mapping between Kubernetes app and NFS server Pod when the k8s_auto_registration config parameter is disabled or the k8s app is not manually registered, get the PVC name from the Pod YAML file (kubctl get pod -n <name> -o YAML) and run the robin nfs export list | grep <pvc name> command.

    • The robin nfs export list command output displays the PVC name and namespace.

25.7.3. New Feature

25.7.3.1. cAdvisor as a DaemonSet

Robin CNP v5.4.3 HF5 supports the cAdvisor as a standalone DaemonSet. By default, cAdvisor (Container Advisor) is enabled when you enable Robin Metrics on your Robin Cluster.

However, if you do not want to use Robin Metrics but still need cAdvisor for gathering node-level metrics, you can enable it and manage cAdvisor independently.

You can start, stop, and view the status of the cAdvisor services without enabling Robin Metrics. This allows you to view metrics on any observability framework (OBF) of your choice. For more information, see cAdvisor as a DaemonSet.

The following new commands are added to support this feature:

  • # robin metrics start-cadvisor

  • # robin metrics status --cadvisor

  • # robin metrics stop-cadvisor

25.7.3.2. Auto Disk Rebalance

Robin CNP v5.4.3 HF5 supports the Auto Disk Rebalance feature. The Auto Disk Rebalance feature manages the storage space of all disks in the cluster automatically when the disk reaches a certain watermark threshold.

By default, the Auto Disk Rebalance feature is enabled.

When a disk reaches a high watermark, the disk rebalance job automatically starts moving the volume slices from one disk to another disk. For more information, see Auto Disk Rebalance.

25.7.4. Improvements

25.7.4.1. Robin services in Master and Worker Pods

The following Robin services run in the Master and Worker Pods as per 5.4.x architecture:

  • Robin services in Master Pod:

    • consul-server

    • robin-server

    • robin-auth-server

    • robin-event-server

    • sherlock-server

    • gui-cli

    • stormgr-server

    • httpd

    • robin-node-monitor

  • Robin services in the Worker Pod:

    • consul-client

    • robin-agent

    • monitor-server

25.7.4.2. Support to specify description when putting a host in maintenance mode

Starting from Robin CNP v5.4.3 HF5, the --desc option is added to specify a description when putting a host in the Maintenance mode.

25.7.4.3. Support to replace the existing Cluster Identity certificate

Robin CNP v5.4.3 HF5 supports replacing your existing Cluster Identity Certificate created during installation. The Cluster Identity Certificate is a server certificate that is presented to all clients that send requests to all externally facing services of the CNP cluster. It provides assurance to clients that they are connecting to the right server. By default, the Cluster Identity Certificate is signed by the Cluster CA. The Cluster CA is a self-signed certificate that is generally not trusted by clients with strict security configured. You can replace the existing Cluster Identity Certificate with a certificate that is signed by an external trusted CA. For more information, see Replacing an existing Cluster Identity certificate.

You can use the following command to replace the existing Cluster Identity certificate:

# robin cert update-cluster-identity <identity_ca_path> <identity_cert_path> <identity_key_path> --force

Once the Cluster Identity Certificate is replaced, you must restart the following containers and Pods for the new certificate to take effect:

  • Containers

    • haproxy-keepalived-robink8s_monitor container running on all master node

  • Pods

    • robin-master

    • robin-grafana-rs

    • robin-prometheus-rs

25.7.4.4. Support to update NFS server type for RWX volume

Robin CNP v5.4.3 HF5 allows you to update the NFS server type for an RWX volume after mounting it. You can change the NFS server type from shared to exclusive and the other way around. For more information, see Update NFS server Pod type.

Note

Before updating the type of NFS server for an RWX volume, the respective volume must be unmounted.

To update the NFS server type, the following new CLI is added as part of Robin CNP v5.4.3 HF5:

# robin nfs export-update <volume> --nfs-server-type <shared|exclusive>

25.7.4.5. Support to add static MAC address for KVM-based application

Robin CNP v5.4.3 HF5 supports the static MAC address for KVM-based applications. To specify the static MAC address for the KVM-based application, you must add the static MAC address in the input.yaml file at runtime when creating an application. This feature enables you to use the VM licenses that are tied to the MAC addresses of the VMs. For more information, see Create a VM with static MAC address.

The following key is used to specify the static MAC address for the KVM-based application in the input.yaml file:

  • static_macs

25.7.4.6. Config option to timeout Metrics start job

A new parameter metrics_timeout is added as part of Robin CNP v5.4.3 HF5. Using this configurable parameter, you can set timeout period for metrics start job. If any of the Metrics feature related Pods (cAdvisor, Grafana, or Prometheus) do not come up within the timeout period, the metrics start job will fail. The default timeout is 3600 seconds.

Note

It is recommended not to set the timeout period less than 900 seconds.

You can configure the parameter using the following command:

# robin config update server metrics_timeout <value>

Example

# robin config update server metrics_timeout 1000 --wait
The 'server' attribute 'metrics_timeout' has been updated

25.7.4.7. New Schedule type PlanCleanup

A new Schedule type, PlanCleanup, is added as part of Robin CNP v5.4.3 HF5 to the Robin Schedules list. This schedule cleans up the stale Plan IDs.

The schedule runs every 30 minutes and it is enabled by default.

25.7.5. Fixed Issues

Reference ID

Description

RSD-4065

When creating a superadmin user with AdminViewOnly capabilities or a tenantadmin user with TenantViewOnly capabilities, clusterrolebindings / rolebindings giving the user full access to K8s resources were being created. This issue has been fixed. Admin users with ViewOnly capabilities now get view clusterrolebindings / rolebindings.

A utility is provided with Robin CNP v5.4.3 HF5 to fix this issue for existing users (newly created users will have view only clusterrolebindings / rolebindings). You need to run the following command in the robin-master Pod after upgrading to Robin CNP v5.4.3 HF5:

# robin-server validate-role-bindings

RSD-4098

The robin-pqos command is not working properly when Robin CNP is installed on a host that has UEFI secure boot enabled. This issue is fixed.

RSD-4998

The issue of a KVM in Robin CNP having an additional bond0 network interface in an SR-IOV IP-pool even though the --ifcount option is not used when creating the IP-pool is fixed.

RSD-5513

The issue of Robin CNP erroneously clearing the network attach definition in the cluster while clearing the stale network attachment, causing the Pods to fail to come up, is fixed.

RSD-5710

The issue of a discrepancy in CPU core calculation when validating the tenant rpool limit is fixed.

RSD-6367

On the Rocky Linux OS, the KVMs running on the Robin Cluster are getting restarted due to the default LIBVIRTD_ARGS set to --timeout 120, which specifies a timeout value of 120 seconds for the daemon. This issue is fixed.

RSD-6525

Security vulnerabilities are fixed.

To fix security vulnerabilities with the existing versions of the following applications, Robin CNP v5.4.3 HF5 has the following upgraded versions:

  • docker-ce 25.0.2

  • dokcer-cli 25.0.2

  • containerd 1.6.28

  • runc 1.1.12

RSD-6684

The issue of failing to add a new worker node to the Robin Cluster running on Robin CNP v5.4.3 HF4 Point Patch-1 due to an issue with the staticip_ordered_assignment parameter is fixed.

RSD-6763

The issue of the Helm binary version mismatch between the host and the downloaded Helm-client or the robin-master Pod is fixed.

RSD-6781

The issue of Robin processes (robink8s_monitor.py, robin_plugin, and robin-cri) running on the isolated shared cores is fixed. For clusters with Robin pqos feature enabled, the robin-cri process will run on the shared isolated cores.

RSD-6843

The issue of NUMA node allocation not working for Guaranteed Pods without SRIOV interfaces is fixed.

RSD-6897

When creating an RWX PVC without specifying the filesystem type, the Pod that will consume this PVC does not come up due to the error Volume already exists but requested fstype ext4 mismatches with existing volume’s fstype xfs. This issue is fixed.

RSD-6919

After uninstalling a Helm app, and deleting a Pod or a StatefulSet Pod forcefully, Robin CNP might hold the inflight resources. This issue is fixed.

RSD-6921

The issue of ERR_NOSPACE where the garbage collection (GC) failed to execute on the disks that have reached their full capacity, is fixed.

RSD-7036

The file-collection volume is failing to mount on the host because the zone ID is missing in the volume handler present in the PV object. This issue is fixed.

RSD-7053

The issue of ERR_NOSPACE where the garbage collection (GC) failed to execute on the unloaded snapshots of the volume slices, is fixed.

RSD-7112

When an event is generated for Robin Bundle and Helm apps, the tags CNCF, NF, and NS UUIDs are not present in the events seen by the Kafka consumer. This issue is fixed.

RSD-7218

The issue of RES memory for the robin-server process increasing gradually over a period of time, resulting in heartbeat missed alerts on the cluster, is fixed.

RSD-7433

The issue of the iomgr-server service crashing with the error conn 0x7f349c400000: recvhdr failed with ERR_BADOP is fixed.

RSD-7478

The issue of VPP applications not being able to use 100% CPU for Guaranteed Pods is fixed. With Robin CNP v5.4.3 HF5, the parameter cpu.cfs_quota_us is set to -1. This allows the usage of 100% CPU for all the existing Guaranteed Pods.

RSD-7529

When Pod events are generated for Kafka user, the master node’s name does not appear correctly in the master_node_name key. This issue is fixed.

25.7.6. Known Issues

Reference ID

Description

PP-31547

Symptom

A device may run out of space, and you might observe disk usage alerts or out-of-space errors when an application is writing data, resulting in failed writes.

You might also observe that the physical size of a volume is greater than the logical size when you run the robin volume info command. You might notice this on volumes without IO.

This issue could be because the garbage collector (GC) failed to reclaim space.

Workaround

If you notice this issue, contact the Robin Customer Support team.

PP-33702

Symptom

When Pod soft anti-affinity is applied to the SRIOV Pods, the Pods might not be evenly distributed on the nodes.

After applying soft anti-affinity for Pods, you can run the following command and check the distribution of Pods on nodes:

# kubectl get pods -n <namespace> -o wide

The uneven distribution of Pods issue occurs as Kubernetes is recommending a node that is not part of the affinity list.

Workaround

To achieve an even distribution of Pods across nodes, delete Pods from the node with the highest number of pods for the same app name.

For example: If there are two nodes: Node A and Node B, and if Node A has five Pods and Node B has one Pod, you should delete two Pods from Node A to balance the distribution of Pods between the two nodes.

Run the following command to delete the Pods:

# kubectl delete pod <pod_name>

PP-34088

Symptom

After upgrading to Robin CNP v5.4.3 HF5, in rare scenarios, the robin-file-server Pod might be stuck in the ContainerCreating state due to volume mount failure. The kubectl describe pod command shows the following similar type of error:

MountVolume.WaitForAttach failed for volume “pvc-c3c62dd9-7b95-4254-912d-31ab5ae05150” : volume 1713906186:1 has GET error for volume attachment csi-aa7f3a85079a40fdc962a9f22ba5685f41947173a57b1f2477ae9847eff0a19b: volumeattachments.storage.k8s.io “csi-aa7f3a85079a40fdc962a9f22ba5685f41947173a57b1f2477ae9847eff0a19b” is forbidden: User “system:node:hypervvm-69-36.robinsystems.com” cannot get resource “volumeattachments” in API group “storage.k8s.io” at the cluster scope: no relationship found between node ‘hypervvm-69-36.robinsystems.com’ and this object

If you notice the above issue, apply the following workaround steps:

Workaround

  1. Get the volume attachment name from error.

  2. Verify whether the volume attachment exists:

    # kubectl get volumeattachments | grep <volume_attachment>
    
  3. If the volume attachment does not exist, bounce the robin-file-server Pod:

    # kubectl delete pod <pod_name> -n <namespace>
    

PP-34111

Symptom

Robin CNP v5.4.3 HF5 UI version does not support cAdvisor start, stop, and check status operations.

Workaround

You can perform these operations using the Robin CNP CLI. For more information, see cAdvisor as a DaemonSet.

PP-34153

Symptom

When you update affinity rules in a StatefulSet after Pods are deployed as per the affinity rules, the new affinity rules are not honored.

Workaround

You need to delete all replica Pods or redeploy the StatefulSet.

PP-34157

Symptom

After upgrading from any supported Robin CNP versions to Robin CNP v5.4.3 HF5, some of the hosts will be in the Partial state. You can observe this issue when you run the robin host list command after upgrading. If you observe any hosts are in the Partial state, perform the following checks for confirmation:

  1. Run the following command to check if any disks are in the offline state that belong to the host that is in the Partial state:

    # robin disk list
    
  2. Run the following command to confirm the correct status of the disks that are in the offline status:

    # robin ap report | grep <disk_name>
    

If the status of the disks is in the Ready state in the previous command output, apply the following workaround:

Workaround

Run the following command to rediscover the disks and other resources:

# robin host probe --rediscover <hostname>

PP-34158

Symptom

When a Robin CNP cluster has more than 50 nodes, the Robin CNP UI does not display the Metrics UI component.

PP-34197

Symptom

Robin CNP is reserving snapshot space, though in the bundle manifest file it is disabled (snapshot: disabled), resulting in app deployment failure due to insufficient storage.

If you notice this issue, apply the following workaround:

Workaround

For each volume, set the snapshot_space_limit to 0 and the snapshot_reserved attribute to 0 for the app in the manifest file.

PP-34226

Symptom

When a PersistentVolumeClaim (PVC) is created, the CSI provisioner initiates a VolumeCreate job. If this job fails, the CSI provisioner will call a new VolumeCreate job again for the same PVC. However, if the PVC is deleted during this process, the CSI provisioner will continue to call the VolumeCreate job because it does not verify the existence of the PVC before calling the VolumeCreate job.

Workaround

Bounce the CSI provisioner Pod:

# kubectl delete pod <csi-provisioner-robin> -n robinio

PP-34339

Symptom

When multiple applications are deployed at a time, the Kubernetes scheduler (K8spodplanner) takes time to pick up the job, and when you run the kubectl get pods commands, the Pods will remain in the Pending state for a long time.

Workaround

If you notice this issue, deploy Pods serially or wait for Pod deployment completion.

PP-34359

Symptom

When a Helm app or Kubernetes application is deployed on Robin CNP, it allocates more resources to a tenant than its limit. To verify this, after deploying the app, run the following command to check if any of the Pods are overusing the resources.

# robin tenant list <tenantname> --full

In the command output, if you notice any overallocation of resources, apply the following workaround.

Workaround

Run the following command to delete the Pods belonging to the tenant namespace:

# kubectl delete pod <pod name> - n <namespace>

PP-34415

Symptom

When deploying a Helm app with millicore CPUs, Robin CNP rounds up the requested millicore CPUs to the nearest integer during the planning phase because it does not support the millicore CPU for Helm apps. If the requested CPUs exceed the max_cores_per_app limit, then the Helm app deployment will fail with the following similar error:

Error from server: error when creating “sts-static-6689-novol.yaml”: admission webhook “master.robin-server.service.robin” denied the request: Total cores (3) exceeds the max_cores_per_app limit (2) for tenant/rpool (master/worker)

Workaround

  1. Reduce the number of replicas to 1 in the respective helm chart.

  2. Deploy the app using the helm install command.

  3. Once the first Pod of the app is up, scale up the replicas as desired:

    # kubectl scale deploy/sts -n <namespace> deployment/sts
    --replicas=<replicas>
    

PP-34434

Symptom

When a StatefulSet or Deployment with multiple replicas is deployed with soft Pod affinity, all Pods are going on the correct node, except one in spite of sufficient resources.

Workaround

Delete the Pod that is placed on the wrong node by running the following command:

# kubectl delete pod <pod_name> -o wide

PP-34439

Symptom

In certain scenarios, an SRIOV Pod with soft anti-affinity might get stuck in the Pending and Terminating status in a loop for the same Pod and it will never move to the Running status.

If you notice the issue, perform the following checks to confirm and apply the workaround.

  1. Run the following command to access the Robin Master Pod:

    # rbash master
    
  2. On the Robin Master Pod, run the following command and check for this error in this log file: k8sapp-server.log.

    # tail -f /var/log/robin/k8sapp-server.log
    |grep -i Host.*not.*found
    

If you notice these symptoms, apply the following workaround:

Workaround

You need to delete the kube-scheduler Pod from the node on which it is active.

  1. Run the following command to know which node is holding the lease of the Kube-Scheduler Pod:

    # kubectl get lease -n kube-system
    
  2. SSH in to the master node that is holding the lease of the kube-scheduler Pod.

  3. On the master node, delete the kube-scheduler Pod by running the following command:

    # mv /etc/kubernetes/manifests/kube-scheduler.yaml ~/
    
  4. To verify the lease holding status has been moved to a different node, run the following command again:

    # kubectl get pods - n kube-system
    
  5. To bring the kube-scheduler Pod back from the node that you deleted, run the following command on the same node:

    # mv /kube-scheduler.yaml /etc/kubernetes
    /manifests/kube-scheduler.yaml
    
  6. To make sure the kube-scheduler Pod is back online, run the following command:

    # docker ps | grep kube-scheduler
    

PP-34451

Symptom

In rare scenarios, the RWX Pod might be stuck in the ContainerCannotRun state and shows the following error in the Pod’s event:

mount.nfs: mount system call failed

Perform the following steps to confirm the issue:

  1. Run the robin volume info <volume> command and check for the following details:

    1. Check the status of the volume. It should be in the ONLINE status.

    2. Check whether the respective volume mount path exists.

    3. Check the physical and logical sizes of the volume. If the physical size of the volume is greater than the logical size, then the volume is full.

  2. Run the following command to check whether any of the disks for the volume is running out of space:

    # robin disk info <wwn>
    
  3. Run the lsblk and blkid commands to check whether the device mount path works fine on the nodes where the volume is mounted.

  4. Run the ls <mount_path> command to check if accessing the respective filesystem mount path gives any input and output errors.

If you notice any input and output errors in step 4, apply the following workaround:

Workaround

  1. Find all the Pods that are using the respective PVC:

    # kubectl get pods --all-namespaces
    -o=jsonpath='{range .items[]}{.metadata.namespace}
    /{.metadata.name}{"\t"}{.spec.volumes[].
    persistentVolumeClaim.claimName}{"\n"}{end}' |
     grep <pvc_name>
    
  2. Bounce all the Pods identified in the step 1:

    # kubectl delete pod <pod> -n <namespace>
    

PP-34457

Symptom

If the Metrics feature is enabled on your Robin CNP cluster and you are using Grafana for monitoring, after upgrading the cluster from any supported Robin CNP versions to Robin CNP v5.4.3 HF5, the Grafana metrics will not work.

Note

You need to take a backup of the configmaps of the Prometheus and Grafana apps in the robinio namespace before you stop the Robin Metrics. The configmaps reset when you stop and start the Metrics feature.

Workaround

You need to stop and restart the Metrics feature.

  1. To stop the Metrics feature, run the following command:

    # robin metrics stop
    
  2. To start the Metrics feature, run the following command:

    # robin metrics start
    

25.7.7. Technical Support

Contact Robin Technical support for any assistance.