************* Release Notes ************* =================================== Robin Cloud Native Platform v5.4.3 =================================== The Robin Cloud Native Platform (CNP) v5.4.3 release has new features, improvements, fixed issues, and known issues. **Release Date:** 07 January 2023 Infrastructure Versions ======================= The following software applications are included in this CNP release: ==================== ================== Software Application Version ==================== ================== Kubernetes 1.26.0 Docker 19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) Prometheus 2.39.1 Prometheus Adapter 0.10.0 Node Exporter 1.4.0 Calico 3.24.3 HAProxy 2.4.7 PostgreSQL 14.6 Grafana 9.2.3 CRI Tools 1.25.0 ==================== ================== Upgrade Paths ============== The following are the supported upgrade paths for Robin CNP v5.4.3: * Robin CNP v5.4.1 (GA) to Robin CNP v5.4.3 (GA) * Robin CNP v5.3.13 (GA) to Robin CNP v5.4.3 (GA) The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see `Upgrade Robin CNP Platform `__. .. Note:: Before upgrading to Robin CNP v5.4.3, you must stop the Metrics feature and restart it after the upgrade. New Features ============= Single Node HA-Ready -------------------- Starting from Robin CNP v5.4.3, you can install CNP using a single hostname or IP address in the HA mode in an on-prem environment. Later you can scale up the cluster by adding more Master and Worker nodes as per your requirements. Thus, a Single Node HA-Ready cluster is a cluster with a single host and HA enabled. You can use the same install command to install a single node HA-Ready cluster but just provide a single hostname or IP address. Add Master and Worker Nodes using GoRobin utility ------------------------------------------------- Robin CNP v5.4.3 supports adding new master or worker nodes to an existing Robin CNP HA cluster to scale up your cluster using the GoRobin utility. The option to add a master node is available only if you have initially installed your cluster as a HA cluster. However, for a cluster that you have installed as non-HA, you can add more worker nodes. .. Note:: It is recommended not to add additional nodes to a cluster when you installed the cluster using the ``--single-node-cluster`` option. The behavior is not defined. If you want to add more nodes a cluster installed with this option, contact the Robin support team. Zero Trust Feature to Block Network Traffic ------------------------------------------- Robin CNP v5.4.3 provides the Zero trust feature. You can enable the ``zero-trust`` option when installing Robin CNP v5.4.3 using as part of the ``config.json`` file during installation. When you use this option, all network ports will be closed except Kubernetes, Robin control ports, and SSH port. You can use this option in conjunction with the ``single-node-cluster`` option or independently. Support to Create KVM-based VMs using Custom UUIDs -------------------------------------------------- Robin CNP v5.4.3 provides you an option to manually provide a custom Universally Unique Identifier (UUID) for a KVM-based application. You can use this feature for VMs where the application license is linked to the UUID of the VM. This feature enables you to provide the UUID manually using the ``input.yaml`` file when creating VMs on Robin CNP. Improvements ============ Add custome Cluster Identity Certificate for all external-facing Kubernetes and Robin CNP services -------------------------------------------------------------------------------------------------- Starting from Robin CNP v5.4.3, Robin allows you to use the custom Cluster Identity certificate for all external-facing Kubernetes and Robin CNP services. The Cluster Identity certificate is used to validate the requests sent to the external-facing Kubernetes and Robin CNP services from external clients outside the cluster. By default, Robin creates its own Cluster Identity certificate and uses this certificate to validate the requests. You can use your own Cluster Identity certificate and private key. An external trusted certificate authority (CA) must sign this certificate. Support for HashiCorp Vault --------------------------- Starting from Robin CNP v5.4.3, Robin CNP re-enabled the support for HashiCorp Vault integration. You can use the GoRobin utility for integrating HashiCorp Vault when installing Robin CNP v5.4.3. Access Robin CNP cluster installed with Zero trust feature using whitelisted IP address --------------------------------------------------------------------------------------- The ``zero-trust`` option limits the ports that are accessible from outside the cluster. By default, these ports are accessible from all nodes. The whitelisted IP addresses option will limit access only from the nodes mentioned using this option. ISO image for VMs ----------------- Robin CNP v5.4.3 supports ISO images for creating VMs on Robin CNP. Added new events for Pods (Tech Preview) --------------------------------------------- The following new events for Pods are added in the ``robin event list`` in Robin CNP v5.4.3: * EVENT_POD_STARTED * EVENT_POD_DEPLOY_FAILED * EVENT_POD_STOPPED * EVENT_POD_STOP_FAILED * EVENT_POD_RESTARTED * EVENT_POD_DELETED * EVENT_POD_FAULTED * EVENT_POD_PLAN_FAILED * EVENT_POD_RELOCATED * EVENT_POD_RELOCATE_FAILED * EVENT_POD_RESTARTING * EVENT_K8SPOD .. Note:: To raise the events for Kubernetes Pods, you need to enable the ``k8s_event_watcher`` config attribute by running the ``robin config update cluster k8s_event_watcher True`` command. By default, this event is disabled. Fixed Issues ============ ============= ============================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============= ============================================================================================================================================================================================================================================================================================================================================================ PP-28938 When deleting multiple PDVs using the Robin CNP UI, the checkbox for selecting all PDVs (next to **Name** field) does not work. This issue is fixed. PP-28966 If a Pod deployment fails and you notice the following error message in the Pod events: `"Error: Vblock with volume_id <> not mounted"`. This issue is fixed. PP-29360 When you add a secondary DPDK-based IP-Pool, routes are programmed by Robin CNP (robin-ipam) erroneously. As a result, Pods are not coming up and failing at ``ContainerCreating`` status. This issue is fixed. PP-29398 The issue with the ``robin-pqos`` command when all the isolated cores are running VDU workloads is fixed. PP-29427 In a scenario where Pods are scheduled with three replicas, three static IP addresses, and an anti-affinity rule, and if the deployment fails for the first time, Robin CNP is not clearing the entries in the database. During the retry of the failed deployment, one of the Pods failed to come up as the IP address was not released by the previously failed deployment. This issue is fixed. PP-29430 The issue of not being able to use a static IP address as a string for a single replica in a Static IP annotation is fixed. ============= ============================================================================================================================================================================================================================================================================================================================================================ Known Issues ============= ============= ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============= ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-21916 **Symptom** A Pod IP is not pingable from any other node in the cluster, apart from the node where it is running. **Workaround** Bounce the Calico Pod running on the node where the issue is seen. PP-21935 **Symptom** Pods are stuck in the ``ContainerCreating`` state with the following error: `kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability` **Workaround** Perform the following steps: 1. Flush connection entries: .. code-block:: text # conntrack -F 2. Bounce nodeplugin Pod. .. Note:: If the nodeplugin Pod has become unusable, future filesystem mounts will fail, this is a symptom of the many retries of NFS mount calls that hang. Bouncing the Pod will clear out the hung processes. PP-22781 **Symptom** After removing a taint on a master node, GPUs are not detected automatically. **Workaround** You need to run the ``robin host probe --rediscover --all --wait`` command for the GPUs to be detected on the primary master node. PP-22853 **Symptom** Robin CNP may not detect GPUs in the following scenarios: - After Robin CNP installation - After upgrading Robin CNP - After adding a new node **Workaround** Run the ``robin host probe --rediscover`` command. PP-24736 **Symptom** A PVC may not come online after removing an app from the secondary Protection Group on the peer cluster. **Workaround** After you remove the application from the Protection Group and allow the application to start, remove the ``block_mount`` label from the PVCs of the application. PP-25246 **Symptom** When you try to delete a KVM application, the deletion process might be stuck as the Virsh commands on the node may not respond. **Workaround** Reboot the node. PP-25360 **Symptom** If containers in a Pod are using an RWX PVC and if they are stuck in the ``ContainerCreating`` state for a long time and display a timeout error, apply the following workaround. **Workaround** Delete Pods if they are part of a Deployment or StatefulSet. PP-26345 **Symptom** When you deploy a Pod to use an SR-IOV VF from Ethernet Virtual Function 700 Series 154c, sometimes the Pod gets stuck in the ``ContainerCreating`` state with the device busy error message. **Workaround** Bounce the Pod that shows the device busy error message. PP-26572 **Symptom** Due to inaccuracies in tracking the Pod creation, tenants and user limits are not explicitly honored for Helm applications. PP-26581 **Symptom** After deleting the PCI resources, the existing Pods that are using the PCI resources are stuck in the ``ContainerCreating`` state during the instance relocation. **Workaround** Perform the following steps: 1. Recreate the PCI resources. 2. Bounce the respective Pod. PP-26768 **Symptom** You should not use an IP-Pool associated with dpdk drivers as the default network. PP-26830 **Symptom** After deleting the PVCs, Robin CNP cluster is down. **Workaround** Bounce the Calico Pod. PP-27076 **Symptom** In Robin CNP, Kubelet might go down due to the stale ``cpu_manager_state`` file. **Workaround** Complete the following steps to fix this issue: 1. Remove the stale ``/var/lib/kubelet/cpu_manager_state`` file using the following command: .. code-block:: text # rm -rf /var/lib/kubelet/cpu_manager_state 2. Restart the Kubelet by running the following command: .. code-block:: text # systemctl restart kubelet 3. Make sure etcd and apiserver Pods on this node are up and running. PP-27077 **Symptom** When deleting the RWX applications, RWX Pods are stuck in the ``Terminating`` state. **Workaround** Perform the following steps for deleting the RWX Pods: 1. Run the following command to find the NFS server Pod associated with the PVC: .. code-block:: text # robin nfs export-list 2. Delete the NFS server Pod used for the respective PVC. PP-27193 **Symptom** When upgrading from supported Robin CNP versions to Robin CNP v5.4.3, RWX Pods may get stuck in the ``ContainerCreating`` state as the volume is unmounted and Kubernetes is not aware of it. If you notice this issue, apply the following workaround steps: **Workaround** 1. Check what PVC/volume Pod is using. 2. Check the volume is not mounted by running the ``robin volume info `` command. 3. Bounce the respective RWX Pod or reboot the respective node. PP-27276 **Symptom** After upgrading to Robin CNP v5.4.3, some Robin Bundle apps might be ``OFFLINE`` due to ``PLAN_FAILED``. **Workaround** Manually restart the Robin Bundle apps one by one. PP-27283 **Symptom** In rare scenarios, when you reboot the active master node, two Patroni Pods might have the same role as Replica. **Workaround** Bounce the Calico Pod running on the node where the issue is seen. PP-27620 **Symptom** Sync with secondary peer cluster fails due to multiple snapshots restore failures. **Workaround** Restart the iomgr-server on the affected node. 1. Log in to the iomgr Pod on the affected node (``rbash iomgr``) 2. Run the command ``systemctl restart iomgr-server`` 3. Check the state of the connections using the ``rdvm conn list`` command. PP-27678 **Symptom** When the node where the volume for file collection is mounted is turned off and you want to delete file collection with a single replica, the file collection delete job will fail putting the file server Pod in the ``terminating`` state. **Workaround** Run the following command to delete the file server Pod forcefully stuck in the ``terminating`` state: .. code-block:: text # kubectl delete -n --force PP-27775 **Symptom** When upgrading from Robin CNP supported versions to Robin CNP v5.4.3, one of the hosts is stuck in the ``Notready`` state. **Workaround** You need to delete the worker Pod running on the node that is in the ``Notready`` status. Perform the following steps to delete the worker Pod: 1. Run the following command to know the status of worker Pods: .. code-block:: text # kubectl get pod -n robinio -o wide |grep worker 2. Run the following command to delete the stuck worker Pod: .. code-block:: text # kubectl delete pod -n robinio 3. Reboot the respective node. PP-27826 **Symptom** When you reboot all nodes of a cluster together, RWX Pods are stuck in the ``CrashLoopBackOff`` state. **Workaround** Bounce the respective Pods. PP-28461 **Symptom** When you increase the snapshot space limit on the Primary Protection Group, the same is not replicated to the secondary Protection Group. **Workaround** If you need to increase space for snapshots on the secondary protection group, apply the following workaround: Run the following command on the secondary cluster to update the snapshots space limit: .. code-block:: text # robin app snapshot-space-limit PP-28494 **Symptom** During a non-HA upgrade, the File-server Pod may get stuck in the ``ContainerCreating`` state as the volume is unmounted and Kubernetes is not aware of it. If you notice this issue, apply the following workaround steps. **Workaround** 1. Check what PVC/volume file-server Pod is using. 2. Check the volume is not mounted by running the ``robin volume info `` command. 3. Run the following command to cordon the node where the filer server Pod is mounted. .. code-block:: text # kubectl cordon 4. Run the following command to delete the filer server Pod. .. code-block:: text # kubectl delete pod -n robinio 5. Run the following command to uncordon the node you have cordoned in step 3. .. code-block:: text # kubectl uncordon PP-28501 **Symptom** After upgrading from the existing Robin CNP to Robin CNP v5.4.3 with RWX applications, the NFS server related jobs are stuck. **Workaround** Perform the following steps: 1. Run the following command to log in to Robin master Pod: .. code-block:: text # rbash master 2. Run the following command to know the Mount State of the storage nodes: .. code-block:: text # stormgr node list 3. Run the following command to unlock the ``stormgr`` CLI: .. code-block:: text # stormgr devl unlock 4. Run the following command to unblock the blocked storage node: .. code-block:: text # stormgr node setstatus --block-mount 0 PP-28768 **Symptom** After upgrading to Robin CNP v5.4.3, you might notice that the cordened node is uncordened. **Workaround** You should put the cordened nodes in maintenance mode before upgrading. Or, you need to corden the node again after upgrading to Robin CNP v5.4.3. PP-28867 **Symptom** The ``robin chargeback report`` is not displaying the correct SSD drive price in the report. The report is showing 0.0 as the price. PP-28922 **Symptom** When you try to restore a namespace snapshot, the job hangs as the PVCs are in the ``Pending`` state and displays this message: `Error: Invalid annotation 'robin.io/fstype' provided` **Workaround** To fix this issue, apply the following workaround: 1. Run the following command and get the list of the PVCs that are stuck in the ``Pending`` state for the namespace: .. code-block:: text # kubectl get pvc -n 2. Run the following command to patch each PVC to remove the ``robin.io/fstype`` from annotations: .. code-block:: text # kubectl edit pvc -n ns1 rwx-app-8 PP-28972 **Symptom** When you try to deploy a KVM-based app and override the NIC tags in the IP-Pool using the ``input.yaml`` file, you will see the following message: `Error: list index out of range` You will observe this issue as the bonded interface option is not supported for KVM deployments when the Calico interface is used. PP-29109 **Symptom** Robin CNP v5.4.3 does not support the Application Ephemeral Volume (AEV). Due to AEV nonsupport, operations involved with AEVs will fail. PP-29150 **Symptom** When creating an SRIOV or OVS IP pool with VLAN, Robin CNP mistakenly allows the creation of the SRIOV or OVS IP pool if any one of them has configured VLAN for its interface at the host level. For example: in a scenario where you have created an SRIOV IP pool with VLAN and VLAN is added to the SRIOV interface at the host level. At the same time, if you create an OVS IP Pool with the same VLAN but without adding VLAN for the OVS interface at the host level, the OVS IP pool creation succeeds without any error. However, in this example, when you try to deploy the Pod using the OVS IP pool, the Pod deployment fails at the ``Containercreating`` state without any error message. PP-29340 **Symptom** After upgrading from the existing Robin CNP to Robin CNP v5.4.3, RWX PVC Pods are stuck in the ``ContainerCreating`` state. **Workaround** Perform the following steps to generate a new FS UUID: 1. Run the following command to know the Pods that are stuck in the ``ContainerCreating`` state: .. code-block:: text # kubectl get pods -A | grep -v containercreating 2. Run the following commands to get the RWX volume used by these Pods: .. code-block:: text # kubectl describe pods # kubectl get pvc -A | grep 3. Run the following command to see the respective job output for the RWX volume with ``NFSAgentAddExport``: .. code-block:: text # robin job list | grep | grep NFSAgentAddExport **Example** .. code-block:: text # robin job list | grep pvc-b0f33e4d-6d1c-4d17-9ddf-0a67b9f1af51 | grep NFSAgentAddExport ->5840 | NFSAgentAddExport | Adding export for vol pvc-b0f33e4d- 6d1c-4d17-9ddf-0a67b9f1af51| COMPLETED | FAILED | 09 Aug 06:39:45 | 06:40:01 | 0:00:16 | [] | 1 | Command '/bin/mount/dev/sdo /var/lib/robin/nfs/robin-nfs-shared-60/ganesha/pvc-b0f33e4d-6d1c- 4d17-9ddf-0a67b9f1af51' failed with return code 32: mount: wrong fs type, bad option, bad superblock on /dev/sdo, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. 4. If you see the above error, run the following command to know the device and host on which this volume is mounted: .. code-block:: text # robin volume list | grep 5. Run the following command on the host to know the error: .. code-block:: text # cat /var/log/messages | grep **Example** .. code-block:: text # cat /var/log/messages | grep sdo Aug 9 06:40:00 asa-06 kernel: XFS (sdo): Filesystem has duplicate UUID 47762fc2-1e7c-4863-a551-0fe55b29d0c7 - can't mount 6. Run the following command to generate a new FS UUID for the respective device to be mounted: .. code-block:: text # xfs_admin -U generate PP-29441 **Symptom** After adding a Master node if a Patroni Pod is in the ``Pending`` state, apply the following workaround: **Workaround** 1. Run the following command to get the Patroni PV. .. code-block:: text # kubectl get pv robin-patroni-pv- -o jsonpath={.spec.local.path} **Example** .. code-block:: text # kubectl get pv robin-patroni-pv-0 -o jsonpath={.spec.local.path} /home/robinds/var/lib/pgsql/patroni 2. Edit the following YAML file to replace the following: - INDEX with the index of the pending Pod - PV_PATH with the path from the step 1 - NODE_Name with the node on which the Patroni Pod is not present .. code-block:: YAML apiVersion: v1 kind: PersistentVolume metadata: labels: app: robin-patroni name: robin-patroni-pv-<> spec: accessModes: - ReadWriteOnce capacity: storage: 30Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: pgdata-robin-patroni-<> namespace: robinio local: path: <> nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - <> persistentVolumeReclaimPolicy: Delete storageClassName: local-storage volumeMode: Filesystem status: phase: Bound 3. After making the changes in step 2 save the file and run the following command: .. code-block:: text # kubectl create -f PP-29505 **Symptom** The Dashboard in the Robin CNP UI does not display the metrics data for the **CLUSTER CONTAINERS NETWORK DATA TRANSMITTED** section of the UI. PP-29509 **Symptom** You must stop metrics before starting the upgrade and restart after the upgrade. PP-29512 **Symptom** After upgrading to Robin CNP v5.4.3, you might observe Robin bundle applications deployed with replica 1 are in the ``NotReady`` status. **Workaround** If you observe this issue, apply the following workaround: Run the following command to make the applications healthy: .. code-block:: text # robin host probe --rediscover --wait PP-29521 **Symptom** After upgrading to Robin CNP v5.4.3, you might observe Pods stuck in the ``ContainerCreating`` state, and a warning event display this message: `Error: Input/Output error on device /dev/sdm` **Workaround** If you observe this issue, apply the following workaround: 1. Run the following command to unfault drives. .. code-block:: text # robin drive unfault 2. Run the following command to restart the IO manager. .. code-block:: text # systemctl restart iomgr-server 3. Bounce the application Pods. PP-29525 **Symptom** After upgrading to Robin CNP v5.4.3 from supported Robin CNP versions, communication with port 36443 might break because the ``haproxy.cfg`` is not pushed into the HAProxy docker container. As a result, you cannot access the Robin cluster using port 36443. PP-29528 **Symptom** In some scenarios, when a Pod with the ``robin.io/networks`` annotation is deleted, the ``net-attach-def`` entries are left behind. **Workaround** Delete the ``net-attach-def`` entries using the following command: .. code-block:: text # kubectl delete net-attach-def net-attach-def_name ============= ============================================================================================================================================================================================================================================================================================================================================================================================================ Technical Support ================= Contact `Robin Technical support `_ for any assistance. ====================================== Robin Cloud Native Platform v5.4.3 HF1 ====================================== The Robin Cloud Native Platform (CNP) v5.4.3 HF1 release has improvements, fixed issues, and known issues. **Release Date:** 19 April 2023 Infrastructure Versions ======================= The following software applications are included in this CNP release: ==================== ================== Software Application Version ==================== ================== Kubernetes 1.26.0 Docker 19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) Prometheus 2.39.1 Prometheus Adapter 0.10.0 Node Exporter 1.4.0 Calico 3.24.3 HAProxy 2.4.7 PostgreSQL 14.6 Grafana 9.2.3 CRI Tools 1.25.0 ==================== ================== Upgrade Paths ============== The following are the supported upgrade paths for Robin CNP v5.4.3 HF1: * Robin CNP v5.4.3-120 (GA) to Robin CNP v5.4.3 HF1 * Robin CNP v5.4.3-237 (HF1-RC) to Robin CNP v5.4.3 HF1 * Robin CNP v5.3.11-217 (HF2) to Robin CNP v5.4.3 HF1 The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see `Upgrade Robin CNP Platform `__. .. Note:: For a successful upgrade, you must run the ``possible_job_stuck.py`` script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script. Improvements ============ Enhanced GoRobin ---------------- Starting from Robin CNP v5.4.3 HF1, the GoRobin utility tool runs the preinstallation checks parallelly. Support for V2 KV engine for the HashiCorp Vault integration ------------------------------------------------------------ Starting from Robin CNP v5.4.3 HF1, Robin CNP supports the V2 KV engine for the HashiCorp Vault integration. Set faultdomain to ``host`` or ``rack`` for all RWX PVCs -------------------------------------------------------- Starting from Robin CNP v5.4.3 HF1, for all storageclass (custom storageclass and storageclass created by Robin) except the ``robin-rwx`` storageclass, you can set ``host`` or ``rack`` as fautdomain for RWX PVCs. If you set ``disk`` as faultdomain, RWX PVC will not be provisioned and shows the following error: `For Access-Many volumes, replication should be more than 1 and faultdomain should be 'host'.` For the ``robin-rwx`` storageclass, the default faultdomain is set to ``host`` for RWX PVCs. The options of faultdomain as ``disk`` and ``rack`` are not supported. Support for deploying NFS server Pods on nodes with custom taints ----------------------------------------------------------------- Starting from Robin CNP v5.4.3 HF1, to deploy NFS server Pods on Kubernetes nodes that have custom taints, you must update the ``nfs_pod_tolerations`` config attribute of the ``nfs`` section of the ``robin config`` to add toleration for NFS server Pods. The tolerations added through the config attribute take effect only for the newly created NFS server Pods. For the existing NFS server Pods, you must add tolerations manually. Set default faultdomain to ``host`` for creating a Persistent Data Volume (PDV) ------------------------------------------------------------------------------- Starting from Robin CNP v5.4.3 HF1 and onwards, for creating a Persistent Data Volume (PDV), the valid options for faultdomain are ``host`` and ``rack``. The ``disk`` option for faultdomain is not supported. The default faultdomain is set to ``host``. Default MTU value from physical interface ---------------------------------------------- Starting from Robin CNP 5.4.3 HF1, if you do not provide the MTU when creating an IP pool, Robin CNP considers the MTU value of the underling physical device interface as the default MTU. Fixed Issues ============ ============= ============================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============= ============================================================================================================================================================================================================================================================================================================================================================ PP-28768 After upgrading Robin CNP 5.4.1, you might notice that the cordened node is uncordened. This issue is fixed. PP-29303 The issue of KVMs not considering IP pool configurations (Spoofchk and Trustmode) is fixed. PP-29528 The issue of ``net-attach-def`` entries being left behind when a Pod with the ``robin.io/networks`` annotation is deleted is fixed. In 5.4.3 HF1, a schedule runs every 30 minutes and cleans up the stale static IPs and ``net-attach-def`` that are present in the cluster for more than 30 minutes. PP-29743 Token-based integration of HashiCorp Vault is no longer supported. PP-29090 When you deploy a Helm chart with static IP addresses and annotations in a StatefulSet, the Helm chart fails to assign the defined static IP addresses. However, it assigns different IP addresses from the same IP pool range. This issue is fixed. PP-29217 The issue of the sriov-device-plugin DaemonSet scaling down due to specific taints on nodes is fixed. PP-29441 The issue of Patroni Pod being in the Pending state after adding a master node is fixed. The steps to add the master node are updated. PP-29509 The issue of stopping metrics before starting the upgrade and restarting after the upgrade is fixed. PP-29512 After upgrading to Robin CNP v5.4.3, you might observe Robin bundle applications deployed with replica 1 are in the NotReady status. This issue is fixed. PP-29553 The issue of Robin cluster nodes not honoring taints applied on nodes after robin-server restart is fixed. PP-29577 When you delete an SR-IOV pod and restart the Robin server, you might observe the SR-IOV deployment pods are in the create and terminate loop states. This issue is fixed. PP-29582 When you stop and start the robin server during pod deployment, the robin network annotations might be ignored. This issue is fixed. PP-29595 The issue of Robin CNP displaying the following error message when setting up Robin Client is fixed: `Error: You must be logged into the server (the server has asked for the client to provide credentials).` PP-29634 The issue of ``kubectl top pods`` command output displaying the following error message is fixed: `Error: Metrics not available for pod.` PP-29644 The issue of the Robin installer trying to access the Internet to download the install images and eventually the installation failing is fixed. PP-29648 The issue of GoRobin failing to install on a 45-node cluster is fixed. PP-29779 The issue of discrepancy in CPU core calculation while validating the rpool limit is fixed. PP-29867 When you use node affinity and IP Pool annotations in a Deployment, it uses the IPs from the IP Pool. However, it fails to follow the Affinity rule. This issue is fixed. PP-29939 Robin CNP scheduler is not calculating the guaranteed CPU utilization correctly and scheduling Pods on over-utilized nodes. This issue is fixed. PP-30033 The upgrade process failed when upgrading to Robin CNP v5.3.11 due to a discrepancy in the number of CSI node plugin Pods between the pre- and post-upgrade. This issue is fixed. PP-30050 The issue of Robin CLI stopped working when stormgr is down is fixed. PP-30160 The issue of node removal failing with the ``--force`` command option is fixed. PP-30141 The issue of VolumeCreate job failing even though the cluster has enough resources is fixed. PP-30290 The issue of Istio mutating webhook configuration after the upgrade is fixed. PP-30296 When you try to move a File collection from a 3 Replica File collection to a single replica File collection fails. This issue is fixed. PP-30345 The issue of the upgrade process not failing even though the version and tag stamping in the config map failed is fixed. PP-30381 The issue of not being able to upload bundles to an online File collection is fixed. PP-30387 The issue of adding PDV volume failing when the Fault domain option ``disk`` is used is fixed. Starting with Robin CNP v5.4.3 HF1, the Fault domain option ``disk`` is not supported, and the default value for the Fault domain option is ``host``. ============= ============================================================================================================================================================================================================================================================================================================================================================ Known Issues ============= ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-28802 **Symptom** Robin Control Plane is failing to auto-recover in the following conditions: - Root FS is full - Out-of-memory condition - High CPU situations - Operating system Kernel crash **Workaround** Apply the following workaround steps to recover from this situation: 1. Clean up the disk to free up the space. You need a minimum of 50GB of disk space. 2. Reboot node. PP-29533 **Symptom** After moving all apps from an existing file collection to a new file collection and then powering off one of the nodes cause application not accessible. **Workaround** Delete file-server Pod for PVC to get mounted again. PP-29650 **Symptom** After failing an IOMGR Pod, the Pod might be in the ``RunContainerError`` state. **Workaround** Perform the following steps: 1. Run the following command to restart Docker: .. code-block:: text # service docker restart 2. Run the following command to remove the container: .. code-block:: text # docker rm -f PP-29850 **Symptom** After rebooting a node you might notice applications are stuck in ``ContainerCreating`` state with volumeattachments error. **Workaround** Bounce the Pods. PP-29866 **Symptom** All Pod deployments at least once go through the Pending or Terminating state before the deployment is successful. The behavior is the same for StatefulSet and Deployment. PP-29962 **Symptom** After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, the robin nfs export list might show the wrong entry. **Workaround** Contact the Robin support team for workaround steps. PP-30112 **Symptom** When you upgrade from Robin CNP v5.3.11 HF2 to v5.4.3 HF1, applications Pods might be in the ``Terminating`` state. Run this command to check Pods: ``docker ps -a``. If the command output is not displaying the Pods, apply the following workaround. **Workaround** Delete the Pods that are in the ``Terminating`` state forcefully. Run the following command to delete: .. code-block:: text # kubectl delete pod -n --force PP-30119 **Symptom** When removing a node from Robin CNP, if the affinity is tied to local storage, instance relocate fails with the following error: `Unable to reallocate instance due to an affinity rule tying it to local storage.` PP-30149 **Symptom** After deploying Pods, if you restart Robin Server and delete Pods, some of the Pods might not come back online. **Workaround** Delete the Pods that are in ``Pending`` state. PP-30173 **Symptom** During scheduling Pods, there might be a difference of 2 CPU cores between the Robin planner and Kubernetes planner due to resource calculation. For example, if only 2.2 core is left on a node, then Pod deployed with 2.2 CPU request will not get scheduled as 2.2 core is considered as 2 cores by Robin. PP-30188 **Symptom** After upgrading to Robin CNP v5.4.3 HF1 from the supported version, you might notice the RWX app stuck in the ``ContainerCreating`` state. **Workaround** 1. Run the following command to find the NFS server Pod that is serving the PVC: .. code-block:: text # robin nfs export-list 2. Run the following command to delete the NFS server Pod: .. code-block:: text # kubectl delete pod -n robinio PP-30243 **Symptom** After upgrading from the supported Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might notice continuous vnode deploy job is going on for app. **Workaround** Run the following commands to rectify this issue. 1. Run the following command to rbash master node. .. code-block:: text # rbash master 2. Run the following command to restart robin server. .. code-block:: text # service robin-server restart PP-30247 **Symptom** After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, the RWX apps might report the following error event: `wrong fs type, bad option, bad superblock on /dev/sdj, missing codepage or helper program, or other error` **Workaround** Contact the Robin support team for workaround steps. PP-30251 **Symptom** Add master using GoRobin for the same removed node fails on IPv4 setups. The Patroni Pod will be in the ``Pending`` state. There is no corresponding PVC created, however, a corresponding PV is present. **Workaround** 1. Delete the pending Patroni Pod. 2. Check the robin host list output to ensure all hosts are ready. 3. Assign rpool. 4. Add Storage and Compute role to newly added master. PP-30264 **Symptom** In a Robin CNP v5.4.3, if you have ``cri=dockershim``, the robin_plugin process might have a memory leak. **Workaround** Use ``--storage-driver=overlay2`` if you are using ``--cri=dockershim`` PP-30298 **Symptom** After upgrading from the supported Robin CNP v5.3.11 to Robin CNP 5.4.3 HF1, if the upgrade fails and nodes are in the Notready status, check for the following symptoms and apply the workaround: - IO Hangs on a volume that could cause Application downtime, kubelet down, and node shows ``Notready`` status. - iomgr-server on the affected node is in defunct state. - ``ps -ef`` showing ``tcmuctrl`` commands are stuck. **Workaround** Reboot the node where iomgr-server is in defunct state. PP-30319 **Symptom** When you have a StatefulSet or Deployment with robinrpool, one of the Pods may not get scheduled by Kubernetes and it remains in the ``Pending`` state due to a mismatch in Robin planner calculation. **Workaround** Run the following command to delete the Pod that is in the ``Pending`` state: .. code-block:: text # kubectl delete pod PP-30339 **Symptom** After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might observe Helm app Pods in the ``Error`` state. **Workaround** You need to stop and start the apps in the ``Error`` state. 1. Run the following command to stop the app: .. code-block:: text # robin app stop 2. Run the following command to start the app: .. code-block:: text # robin app start PP-30357 **Symptom** After you upgrade successfully from supported Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might notice that a node is in the ``Notready`` state. **Workaround** Run the following command to rectify this issue: .. code-block:: text # robin host probe --wait PP-30361 **Symptom** When you delete Pods with static IP and Affinity that are in the ``Running`` state and at the same time if you restart Robin Server, the Pods might be stuck in the ``ContainerCreating`` state and the Pod events show the following: `"NoNetworkFound" : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "robin-required" not found` **Workaround** Bounce the Pods that are in the ``ContainerCreating`` state. PP-30363 **Symptom** After upgrading to Robin CNP v5.4.3HF1, you might notice that one of the nodes in the ``NotReady`` state and the node description shows the following reason and message: **Reason:** KubeletNotReady **Message:** PLEG is not healthy: pleg was last seen active 1h25m32.701542224s ago; threshold is 3m0s. **Workaround** Run the following command to restart Docker service: .. code-block:: text # service docker restart PP-30364 **Symptom** After you successfully add a new node using GoRobin to an existing cluster that has nodes associated with a custom rpool, the GoRobin tool will automatically assign the new node to the default rpool as the tool does not support custom rpools. In this scenario, apply the following workaround to associate the newly added node to the custom rpool. **Workaround** 1. Run the following command to remove the storage and compute roles from the newly added node: .. code-block:: text # robin host remove-role storage,compute 2. Run the following command to assign the custom rpool to the newly added node: .. code-block:: text # robin host assign-rpool 3. Run the following command to add the storage and compute roles: .. code-block:: text # robin host add-role storage,compute PP-30386 **Symptom** When upgrading from Robin CNP v5.3.11 HF1 to Robin CNP v5.4.3 HF1, the NFS exports might be stuck in the ``ASSIGNED_ERR`` state. **Workaround** Contact the Robin support team for workaround steps. PP-30389 **Symptom** If you have added a range of blacklisted IPs in an unexpanded form, Robin CNP does not allow you to remove a range of blacklisted IPs from the IP Pool. It is recommended to use the expanded form when adding and removing a range of blacklisted IPs to an IP Pool. **Workaround** If you have added a range of blacklisted IPs in an unexpanded form, you need to remove the range from the database. Contact Robin customer support team to apply the workaround. PP-30394 **Symptom** The Robin CNP UI dashboard does not display the cluster memory usage and cluster storage available details. **Workaround** Complete the following steps to rectify this issue. 1. Run the following command to edit the Grafana ConfigMap: .. code-block:: text # kubectl edit cm robin-grafana-default-dashboards-config -n robinio 2. In the configmap, look for the following query: `"sum (container_memory_working_set_bytes{id=\"/\"}) by(kubernetes_io_hostname) / on(kubernetes_io_hostname) machine_memory_bytes * 100"` Replace the above query with the following query: `"sum (container_memory_working_set_bytes{id=\"/\"}) by(kubernetes_io_hostname) / on(kubernetes_io_hostname) machine_memory_bytes{job=\"kubernetes-cadvisor\"} * 100"` 3. Run the following command to delete the grafana-rs Pod: .. code-block:: text # kubectl delete pod robin-grafana-rs-mvjqh -n robinio PP-30398 **Symptom** After removing an offline master node from the cluster and power cycling it, the removed master node is automatically added back as a worker node. **Workaround** 1. Run the following command to remove the host: .. code-block:: text # robin host remove 2. Run the following command to remove the node .. code-block:: text # kubectl delete node 3. Run k8s-script cleanup and host-script cleanup on the to-be-removed node. ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Technical Support ================= Contact `Robin Technical support `_ for any assistance. ====================================== Robin Cloud Native Platform v5.4.3 HF2 ====================================== The Robin Cloud Native Platform (CNP) v5.4.3 HF2 release has a new feature, fixed issues, and known issues. **Release Date:** 07 July 2023 Infrastructure Versions ======================= The following software applications are included in this CNP release: ==================== ================== Software Application Version ==================== ================== Kubernetes 1.26.0 Docker 19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) Prometheus 2.39.1 Prometheus Adapter 0.10.0 Node Exporter 1.4.0 Calico 3.24.3 HAProxy 2.4.7 PostgreSQL 14.6 Grafana 9.2.3 CRI Tools 1.25.0 ==================== ================== Upgrade Path ============ The following is the supported upgrade path for Robin CNP v5.4.3 HF2: * Robin CNP v5.4.3-281 (HF1) to Robin CNP v5.4.3-302 (HF2) The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see `Upgrade Robin CNP Platform `__. .. Note:: For a successful upgrade, you must run the ``possible_job_stuck.py`` script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script. New Feature =========== Support for soft anti-affinity for Robin Bundles ------------------------------------------------ Starting with Robin CNP v5.4.3 HF2, you can enable soft anti-affinity for Robin Bundles. To enable soft anti-affinity for Robin Bundles, you must use the ``placeon_different_nodes_on_same_rack`` parameter in the Robin Bundle YAML file and set it to ``true``. **Example** .. code-block:: yaml appname: "centos-1" ippools: ["robin-default"] roles: - name: server1 placeon_different_nodes_on_same_rack: true ippools: - ippool: routes-2 - name: server2 placeon_different_nodes_on_same_rack: true ippools: - ippool: routes-2 - name: server3 placeon_different_nodes_on_same_rack: true #ippools: - ippool: routes-1 static_ips: "fd74:ca9b:3a09:86ba:a:b:c:d" Fixed Issues ============ ============= ============================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============= ============================================================================================================================================================================================================================================================================================================================================================ PP-30611 When the monitor server in Robin CNP fails to report to the robin-server, it waits for a long time before attempting to send the next report. As a result, the heartbeat misses for a long time and it results in host probe jobs. This issue is fixed. PP-30639 Source-based routing is not kicking in when the IP pool ``--ranges`` option is used. This issue is fixed. PP-30864 The issue of the K8s collect watcher taking a long time to complete on a loaded cluster is fixed. PP-30883 After rebooting the nodes, sometimes, RWX Pods might be stuck in the ``CrashLoopBackOff`` state because of the ``Permission denied`` error. This issue is fixed. PP-30895 Importing users with capabilities from an LDAP group using the ``robin ldap import-group`` command resulting in an error. This issue is fixed. PP-30896 Starting from Robin CNP v5.4.3 HF2, the ``k8s_skipplan_mutatingwebhook`` config attribute is enabled by default in the ``robin config list``. PP-30897 The SRIOV annotations in a few of the Deployments are ignored as Pods stuck in the ``validatingwebhook`` status and the robin-server is busy. As a result, the Pods are not having the required number of network interfaces. This issue is fixed. PP-30945 When you use a reused Persistent Volume for VolumeMount, the following error is displayed: 'NoneType' object is not subscriptable. This issue is fixed. PP-30951 The issue of choking the kube-controller and using more CPU because of the logging issues in Kubernetes v1.26.0 is fixed by setting the kube-controller log level from 7 to 4. It applies to both a new installation of Robin CNP v5.4.3 HF2 and an upgrade to Robin CNP v5.4.3 HF2. PP-30978 The issue of high CPU utilization by ``robin-util-cli`` is fixed. ============= ============================================================================================================================================================================================================================================================================================================================================================ Known Issues ============= ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-30493 **Symptom** When you see Pods in the ``ContainerCreating`` status and the ``kubectl describe pod`` command output displays the following event, apply the workaround. `"NoNetworkFound" : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io "robin-required" not found` **Workaround** Bounce the Pods that are in the ``ContainerCreating`` status. PP-30980 **Symptom** When you deploy multiple Pods at the same time, the Pods might come up slowly as mutation takes more time due to timeouts and multiple retries. PP-31068 **Symptom** After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, a few nodes might be in the ``Partial`` state because the drives (disks) attached to these nodes are in the ``OFFLINE`` state as per RCM but as per the stormgr, these drives are in the ``READY/ONLINE`` state. **Workaround** Contact Robin customer support team for workaround steps for this issue. PP-31070 **Symptom** After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, some Pods might be stuck in the ``ContainerCreating`` status due to the Calico CNI issue, and ``/var/log/calico/cni/cni.log`` shows the following error message: `[ERROR][3095966] plugin.go 580: Final result of CNI DEL was an error. error=error getting ClusterInformation: connection is unauthorized: Unauthorized` **Workaround** Bounce the ``calico-node-xxx`` Pod which is running on the node where those Pods are stuck. PP-31072 **Symptom** After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, the ``csi-nodeplugin`` Pod might be stuck in the ``Terminating`` state due to the IO manager issue. **Workaround** Reboot the node where the ``csi-nodeplugin`` Pod is stuck. ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Technical Support ================= Contact `Robin Technical support `_ for any assistance. ====================================== Robin Cloud Native Platform v5.4.3 HF3 ====================================== The Robin Cloud Native Platform (CNP) v5.4.3 HF3 release has a fixed issue and known issues. **Release Date:** 02 November 2023 Infrastructure Versions ======================= The following software applications are included in this CNP release: ==================== ================== Software Application Version ==================== ================== Kubernetes 1.25.7 or 1.26.0 (Default) Docker 19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) Prometheus 2.39.1 Prometheus Adapter 0.10.0 Node Exporter 1.4.0 Calico 3.24.3 HAProxy 2.4.7 PostgreSQL 14.6 Grafana 9.2.3 CRI Tools 1.25.0 ==================== ================== Upgrade Paths ============== The following are the supported upgrade paths for Robin CNP v5.4.3 HF3: * Robin CNP v5.3.13-107 (HF3) to Robin CNP v5.4.3-355 (HF3) * Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3) The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see `Upgrade Robin CNP Platform `__. .. Note:: For a successful upgrade, you must run the ``possible_job_stuck.py`` script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script. Fixed Issues ============= ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-30493 The issue of Pods being in the ``ContainerCreating`` status and the ``kubectl describe pod`` command output displays the following event is fixed. “NoNetworkFound” : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io “robin-required” not found ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Known Issues ============= ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-31522 **Symptom** After deleting a backup, unregistering a storage repo fails with the following error message: *Storage repo is associated with volume group* **Workaround** Complete the following steps: 1. Attach the repo to the App. .. code-block:: text # robin app attach-repo 2. Detach the repo from the App. .. code-block:: text # robin app detach-repo 3. Unregister the repo. .. code-block:: text # robin repo unregister PP-32259 **Symptom** When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), some of the jobs might be failed with the following error: *FATAL: remaining connection slots are reserved for non-replication superuser connections* **Workaround** Contact the Robin Customer Support team for the workaround steps. PP-32288 **Symptom** When all nodes of a cluster are rebooted after installing Robin CNP v5.4.3-355 (HF3), one of them remains in the ``NotReady`` state because it is missing from the consul members list. **Workaround** Start the consul-client: .. code-block:: text # systemctl start consul-client PP-32334 **Symptom** When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), the ``robin-file-server`` fails to come up. **Workaround** Run the following command to unmount the file collection volume: .. code-block:: text # robin volume unmount PP-32385 **Symptom** When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), sometimes, the ``etcd`` Pods get stuck in the ``CrashLoopBackOff`` status and kubectl logs show the following error because of the faulted ``etcd`` instance: *wal: max entry size limit exceeded* **Workaround** You need to remove the faulted ``etcd`` instance from the node. Complete the following steps to remove the faulted ``etcd`` instance from the node: 1. Log in to the node that has faulted ``etcd`` instance. 2. Stop ``etcd`` and ``api-server`` Pods running on this node by moving the ``etcd.yaml`` and ``api-server.yaml`` files: .. code-block:: text # mv /etc/kubernetes/manifests/kube-apiserver.yaml . # mv /etc/kubernetes/manifests/etcd.yaml . 3. Move the ``/var/lib/etcd`` directory as well. .. code-block:: text # mv /var/lib/etcd /var/lib/etcd_old 4. Check for the node where the ``etcd`` leader is running: .. code-block:: text # retcd get-leader 5. Log in to the node where the ``etcd`` Leader is running. 6. Check for the member ID of the faulted etcd instance: .. code-block:: text # retcd get-members 7. Remove the faulted ``etcd`` instance from the cluster: .. code-block:: text # retcd remove-member --member-id= 8. Add the ``etcd`` instance back to the cluster using the docker exec command: .. code-block:: text # docker exec etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints :2379 member add --peer-urls=:2380 9. Log in to the node that had the faulted ``etcd`` instance. 10. Make sure that the ``--initial-cluster`` option contains the entries for all three master nodes in the ``node_name=node_IP`` format and the ``--initial-cluster-state=existing`` option is available in the ``etcd.yaml`` file. If this option is not available, add it. 11. Move the ``etcd.yaml`` and ``kube-apiserver.yaml`` files back to the ``/etc/kubernetes/manifests`` directory: .. code-block:: text # mv etcd.yaml /etc/kubernetes/manifests # mv kube-apiserver.yaml /etc/kubernetes/manifests PP-32463 **Symptom** After installing Robin CNP v5.4.3-355 (HF3), you might face File Collection Creation failures: **Workaround** 1. Greb all Robin worker Pods. .. code-block:: text # kubectl get pods -n robinio | grep worker 2. Bounce all of the worker Pods. .. code-block:: text # kubectl delete pods 3. Try creating a File collection. .. code-block:: text # robin collection create --collection_type --replicas --size PP-32477 **Symptom** If you notice any of the Robin Patroni Pods in the ``starting`` state, follow the below workaround steps: **Workaround** 1. Exec into the unhealthy Robin Patroni instance. .. code-block:: text # kubectl exec -ti -n robinio -- bash 2. Restart the Robin Patroni instance. .. code-block:: text # patronictl restart robin-patroni 3. If the previous command does not work, run the following command to restart the Patroni agent running in the Pod. .. code-block:: text # sv restart patroni PP-32497 **Symptom** When a cluster reboots, you might observe one or more Pods might be stuck in the ``ContainerCreating`` status as the volume mount fails. Also, you can notice the following type of error message in the ``kubectl describe pod`` command output: *volume 1698488235:1 has GET error for volume attachment csi-204e799ca58418a5f0e1b0d4193fd8b0908dbe290ae52ef810a8c2964a12c202: volumeattachments.storage.k8s.io "csi-204e799ca58418a5f0e1b0d4193fd8b0908dbe290ae52ef810a8c2964a12c202" is forbidden:User "system:node:qct-09.robinsystems.com" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'qct-09.robinsystems.com'and this object* **Workaround** Cordon the node and bounce the required Pods. 1. To cordon the node where the failed Pod is scheduled, run the following command. .. code-block:: text # kubectl cordon 2. Bounce the Pod by running the following command so that the Pod moves to a different node. .. code-block:: text # kubectl delete pod -n PP-32515 **Symptom** When a cluster reboots, you might notice robin-worker node Pods might stuck in the ``CrashLoopBackOff`` status and the following error is observed in the log file of the Robin master Pod. *psycopg2.errorL.ReadOnlySqlTransaction: cannot execute CREATE EXTENSION in a read-only transaction* You can find the log file at: ``/var/log/robin/robin-master-bootstrap.log`` **Workaround** Check for the error in the log file and bounce the Robin master Pod. 1. To check for the error in the master Pod, run the following commands: .. code-block:: text # kubectl get pod -n robinio | grep robin-master 2. Exec into the Robin master Pod and check for the error. If you find the error mentioned above, bounce the Robin master Pod. 3. To bounce the master Pod, run the following command. .. code-block:: text # kubectl delete pod -n PP-32517 **Symptom** In a rare scenario, upgrade from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3) might fail with this message: ``Failed to start dockershim CRI``. **Workaround** Restart Docker and Dockershim. 1. To restart docker, run the following command. .. code-block:: text # systemctl restart docker 2. To restart Dockershim, run the following command. .. code-block:: text # systemctl restart dockershim 3. Retry the upgrade. PP-32523 **Symptom** After upgrading to Robin CNP v5.4.3-355 (HF3), you might notice some of the Pods might be in the ``ContainerCreating`` status. **Workaround** Bounce the Pods that are in the ``ContainerCreating`` status. To bounce the Pods, run the following command: .. code-block:: text # kubectl delete pod -n ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Technical Support ================= Contact `Robin Technical support `_ for any assistance. ====================================== Robin Cloud Native Platform v5.4.3 HF4 ====================================== The Robin Cloud Native Platform (CNP) v5.4.3-395 (HF4) release has a new feature, improvements, fixed issues, and known issues. **Release Date:** 13 December 2023 Infrastructure Versions ======================= The following software applications are included in this CNP release: ==================== ================== Software Application Version ==================== ================== Kubernetes 1.26.0 Docker 19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) Prometheus 2.39.1 Prometheus Adapter 0.10.0 Node Exporter 1.4.0 Calico 3.24.3 HAProxy 2.4.7 PostgreSQL 14.6 Grafana 9.2.3 CRI Tools 1.25.0 ==================== ================== Upgrade Path ============= The following is the supported upgrade path for Robin CNP v5.4.3-395 (HF4): * Robin CNP v5.4.3-302 (HF2) to Robin CNP v5.4.3-395 (HF4) The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see `Upgrade Robin CNP Platform `__. .. Note:: 1. For a successful upgrade, you must run the ``possible_job_stuck.py`` script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script. 2. You must configure the Calico Typha for the cluster with more than 50 nodes. For more information, see `Configure Calico Typha `__. 3. After upgrading to Robin CNP v5.4.3-395 (HF4), you must run the ``robin schedule update K8sResSync k8s_resource_sync 60000`` command to update the **robin schedule K8sResSync**. New Feature ============ Support of soft affinity ------------------------- Robin CNP v5.4.3-422 (HF4) supports the soft affinity feature with a few limitations. In Kubernetes, the soft affinity feature refers to a way of guiding the Kubernetes Scheduler to make a decision about where to place Pods based on preferences, rather than strict requirements. This preference helps to increase the likelihood of co-locating certain Pods on the same node, while still allowing the Kubernetes Scheduler to make adjustments based on resource availability and other constraints. For more information, see `Affinity and anti-affinity `__. **Limitations** The following are the limitations of support for soft affinity and anti-affinity support: - These operators are not supported: **DoesNotExist**, **Gt**, and **Lt**. - Multiple weight parameters for node and Pod affinity are not supported. - Soft anti-affinity doesn't check or match for the label selector coming from a different Deployment. - During a complete cluster restart, if all nodes are not up at the same time, Pods will not be spread across nodes with soft anti-affinity. - After a Pod restart, it might not come back on the same node. - Post downsizing the number of replicas in a Deployment, soft Pod anti-affinity might not delete the Pods in the same order as creation. - As the affinity information is handled in the cache, restarting the robin-server will flush the cache, resulting in scaled-up Pods not being placed as per anti-affinity. - Creating, deleting, or recreating Pods multiple times will not honour soft affinity. - Pods will be unequally distributed on nodes when all Pods in a deployment are deleted. Improvements ============= Support for Isolated shared CPU on Kubernetes CPU manager ---------------------------------------------------------- Starting from Robin CNP v5.4.3-395 (HF4) support to configure isolated shared CPUs on Kubernetes CPU Manager is available. To use the isolated shared CPU on the Kubernetes CPU manager, the CPUs must be from the isolated pool of CPUs on a host. Relaxation in NIC bonding policy --------------------------------- Starting from Robin CNP v5.4.3-395 (HF4), Robin considers the NIC bonding interface up if at least one interface from the two interfaces that are used for creating the bond interface is up. Fixed Issues ============= ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-30394 The Robin CNP UI dashboard does not display the cluster memory usage and cluster storage available details. This is fixed. PP-30673 The issue of kernel.core_pattern getting changed to **|/bin/false** from **/var/crash/core.%e.%p.%h.%t** after restarting the robin-worker Pod or robin-iomgr Pod is fixed. PP-30980 When you deploy multiple Pods at the same time, the Pods might come up slowly as mutation takes more time due to timeouts and multiple retries. This issue is fixed. PP-31294 The issue of Robin CNP considering a guaranteed number of CPUs as shared CPUs in a Pod when there are both shared and guaranteed CPUs is fixed. PP-31664 The issue of nodes stuck in reboot due to mount errors is fixed. PP-32334 The issue of the robin-file-server failing to come up when upgrading from Robin CNP v5.3.13 to Robin CNP v5.4.3 HF3 is fixed. PP-32405 When there are many Pods in the Kubernetes Scheduler and if they are taking time to come up, then the NFS Server Pod might take more than 10 minutes to come up and the job might time out. This issue is fixed. PP-32461 The issue of StatefulSet Pods using static IP ranges not resuming with the same IPs after restarting Pods is fixed. PP-32498 The issue of the snapshot-controller Pod stuck in the ``ContainerCreating`` status and ``Terminating loop`` status after node reboot is fixed. PP-32525 The issue of 5G NF Pods not being deployed using the Helm chart due to a webhook timeout is fixed. PP-32620 The issue of mutation timeout failure is fixed. ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Known Issues ============= ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-31790 **Symptom** Sometimes, discrepancy in CPU core calculation is observed for maximum period of 17 minutes when validating the tenant rpool limit. PP-32555 **Symptom** Assigning a static IP address for KVM based apps from a secondary IP-Pool is not supported through the Robin CNP UI. **Workaround** Assign the static IP address for KVM based apps from a secondary IP-Pool using the CLI. PP-32647 **Symptom** After upgrading to Robin CNP v5.3.4-395 (HF4), if the StatefulSet Pods are deleted, they may not retain the same IPs due to the implementation of the **staticip_ordered_assignment** parameter. The **staticip_ordered_assignment** parameter is a new configuration parameter added as part of the Robin CNP v5.4.3-395 (HF4). This config parameter is by default set to **True**. When it is set to **True**, the IPs provided in the network annotations will be assigned serially to the Pods. PP-32713 **Symptom** The Robin log collection operation might fail with the following error message: *Creation of storage for file collection failed, and the File server Pod might be in the Terminating status* **Workaround** Rerun the log collection again using the following command for recreating the file server Pod. .. code-block:: text # robin log collect robin-storage PP-32770 **Symptom** In a rare scenario, after rebooting the nodes, KVM based apps might be stuck in the **Error** state with the following error: *Unable to satisfy max guaranteed CPU requirements* **Workaround** Restart the respective apps manually. .. code-block:: text # robin instance start ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Technical Support ================= Contact `Robin Technical support `_ for any assistance. ======================================================= Robin Cloud Native Platform v5.4.3 HF4 Point Patch-1 ======================================================= The Robin Cloud Native Platform (CNP) v5.4.3-422 (HF4 Point Patch -1) release has new features, fixed issues, and known issues. **Release Date:** 06 March 2024 New Features ============ Robin User Ceritifcate Management ----------------------------------- Robin CNP v5.4.3 HF4 Point Patch-1 enables you to manage the Robin user security certificate. Robin CNP creates a TLS certificate when a user is created. The user certificate is by default valid for one year from the date of user addition, and it will automatically renew if the certificate is going to expire, depending on the set configuration and scheduler status. The scheduler runs as per the set configuration. For more information, see `User Certificate Management `_ Robin CNP provides a new CLI command to manage user certificates. The following are the commands, and each command has options to manage: - ``robin user-cert check`` - Enables checking the status of user certificates and user certificate configuration details. - ``robin user-cert renew`` - Enables renewing user certificates for all users, setting an offset period for checking validity, and performing a dry run to verify the renewal process. - ``robin user-cert update``- Enables configuring user certificate configuration details. Setting the life span of the certificate, whose minimum life span is one day, setting the time difference between each user’s certificate renewal checks, and setting the certificate renewal offset time. - ``robin user-cert stop`` - Enables stopping the certificate validity scheduler checks. - ``robin user-cert start`` - Enables restarting the stopped validity scheduler checks. .. Note:: In this release, support for managing the user certificates feature is not available from the Robin CNP UI. Support for Millicore CPUs for Robin Bundle Apps ------------------------------------------------- Robin CNP v5.4.3 HF4 Point Patch -1 supports the millicore CPU unit for a container. Now, you can specify a fractional value of a CPU unit when defining the CPU resource for a container in the Robin Bundle. Using the following files, you can specify the millicore CPUs for Robin Bundle apps: - **Bundle’s template file** - To specify the millicore CPUs in the **template** file of Robin Bundle app, you need to specify the millicore CPU in the ``min`` and ``max`` keys of the ``compute.cpu`` attribute: .. code-block:: yaml :emphasize-lines: 3, 4 cpu: reserve: true min: 1.03 max: 1.03 - **Bundle’s input.yaml file** - To specify the millicore CPUs in the ``input.yaml`` file of Robin Bundle app, you need to specify the millicore CPU in the ``min`` and ``max`` keys of the ``containers.cpu`` attribute: .. code-block:: yaml :emphasize-lines: 3, 4 cpu: reserve: true min: 1.03 max: 1.03 - **Bundle’s manifest.yaml file** - To specify the millicore CPUs in the ``manifest.yaml`` file of Robin Bundle app, you need to specify the millicore CPU in the ``core`` key of the ``compute.cpu`` atrribute .. code-block:: yaml :emphasize-lines: 3 cpu: reserve: true core: 1.03 .. Note:: The ``min`` and ``max`` keys are applicable only for main containers. **Limitations** - Robin CNP does not support configuring the millicore CPUs through the CNP UI. - Robin CNP does not support the Chargeback feature for millicore CPUs. - Millicore CPU unit such as 500m is not supported in the template of Robin Bundle app. - When you specify the guaranteed CPUs with millicore values, it is not true guaranteed CPUs and it is Quality of Service (QoS) guaranteed CPUs. This is Kubelet behavior. For true guaranteed CPUs, only integer values must be specified. Fixed Issues ============= ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-32758 The issue of the same event ID for two different events, which is not allowing users to add the event ID to the subscription list, is fixed. PP-33203 In prior CNP releases, the Robin user certificates with a default validity of one year would expire after the validity period, resulting in users being unable to perform app-level tasks. With this release, the user certificates will automatically renew before the expiration date. And Robin CNP allows you to manage the user certificates using the new CLI options. PP-33253 The issue of the difference in allocated memory display between Robin CNP UI and CLI is fixed. PP-33255 The issue of Bundle app creation failing when you provide values for the CPU attributes as shown below in the manifest YAML file, is fixed. .. code-block:: text cpu: cores: 8 reserve: true PP-33446 In Robin CNP v5.4.3 HF2, the ``cpuset.mems`` parameter was incorrectly set to ``0-1`` for guaranteed QoS Pods, which means the resources could be allocated across NUMA boundaries. With this release, the ``cpuset.mems`` parameter is set to the desired NUMA. PP-33516 The issue of inflight resources being held by CNP when a Helm chart or Deployment is deleted during the Pod planning phase is fixed. However, there is another known issue when there is a non-graceful termination of a Pod. For more information, see **PP-33628** under the **Known Issues** section. ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Known Issues ============= ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Reference ID Description ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ PP-33501 **Symptom** Robin CNP does not support the Chargeback feature for millicore CPUs. PP-33596 **Symptom** Robin CNP does not support the user certificates feature using the CNP UI. PP-33628 **Symptom** In some cases, after the Helm app uninstall, a non-graceful deletion of a Pod, or a StatefulSet Pod deletion, the inflight resources might be held by Robin CNP. To check this, run the following command: .. code-block:: text # robin inflight-resources info PP-33670 **Symptom** In scenarios like cluster failover or reboot, Robin CNP may fail to access the devices and mark them as FAULTED erroneously. However, the device might not have issues. **Workaround** Contact the Robin Customer Support team if you observe this issue. PP-33679 **Symptom** When a Master Pod fails over due to network partition on a node, the Master Pod might stuck in the ``Init`` state. Apply the following workaround if you see this issue: **Workaround** Restart the Calico Pod on the node where you are seeing the issue. PP-33725 **Symptom** In the following scenarios, app creation using a snapshot (robin app create from snapshot) fails if you use values other than the values of the parent application. - Fractional values are different from the parent app values. - Modified CPU value (int or float) of an app snapshot with guaranteed CPU. ============== ============================================================================================================================================================================================================================================================================================================================================================================================================ Technical Support ================= Contact `Robin Technical support `_ for any assistance.