*********************** Release Notes *********************** =================================== Robin Cloud Native Platform v5.3.11 =================================== The Robin Cloud Native Platform (CNP) v5.3.11 release has new features, an improvement, a bug fix, and known issues. **Release Date:** 10 November 2021 Infrastructure Versions ----------------------- The following software applications are included in this CNP release. ==================== ======== Software Application Version ==================== ======== Kubernetes 1.21.5 Docker 19.03.9 Prometheus 2.16.0 Node-exporter 1.1.2 Calico 3.12.3 HA-Proxy 1.5.18 PostgreSQL 9.6.11 Grafana 6.5.3 ==================== ======== Upgrade Path ------------- The following is the supported upgrade path for Robin CNP v5.3.11: * Robin v5.3.9-286 (GA) to Robin v5.3.11-69 (GA) New Features ------------ --------------------------- Support for NVIDIA A100 MIG --------------------------- Robin CNP v5.3.11 supports the Multi-Instance GPU (MIG) mode of operation for the NVIDIA A100 GPU. Robin allows you to use partitioned GPUs in Robin bundles and also supports chargeback functionality for these GPU partitions. ------------------- Rocky Linux Support ------------------- Starting from Robin CNP v5.3.11, Rocky Linux 8.4 is supported. You can install Robin CNP v5.3.11 on this version of Linux. ----------------------------------------- Support for Application Ephemeral Volumes ----------------------------------------- Robin CNP v5.3.11 supports Application Ephemeral Volumes (AEVs). An AEV is temporary storage that Robin bundle applications can use. The AEV only exists when an application is running. When you create an application, the AEV is created for usage, and its space will be reclaimed back by Robin when the application stops. Other applications can use the reclaimed storage space. Robin provisions the storage space back to the application when it starts again. You can add the AEVs only from the Robin UI when creating an application using a Robin bundle. Also, you can create templates of the application with AEV for future use. You can add a maximum of 10 AEVs per application. Improvement ----------- --------------------------------------- Support for @ Symbol in Robin Usernames --------------------------------------- Starting with Robin CNP v5.3.11, you can use the @ symbol as part of Robin usernames. This enables you to use email addresses as usernames. Fixed Issue ------------- ============= ========================================================================================================================================================================== Reference ID Description PP-24202 The security issue with SSL Medium Strength Cipher Suites is fixed by supporting the Strong Cipher Suites with more than 128bit keys are configured in Robin CNP Services. The following are the list of the supported Strong Cipher Suites: * TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 * TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 * TLS_RSA_WITH_AES_128_CBC_SHA * TLS_RSA_WITH_AES_256_CBC_SHA * TLS_RSA_WITH_AES_128_GCM_SHA256 * TLS_RSA_WITH_AES_256_GCM_SHA384 * TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA The above-mentioned list of Strong Cipher Suites is supported in the following Robin CNP Services: * K8s API server * K8s controller Manger * K8s scheduler * K8s kubelet * ROBIN UI https server * ROBIN event server ============= ========================================================================================================================================================================== Known Issues ------------- ============= ============================================================================================================================================================================================================================================ Reference ID Description PP-24270 **Symptom** In Robin CNP, Kubelet might go down due to the presence of stale ``cpu_manager_state`` file. **Workaround** Complete the following steps to fix this issue: 1. Remove the stale ``/var/lib/kubelet/cpu_manager_state`` file using the following command: .. code-block:: text # rm -rf /var/lib/kubelet/cpu_manager_state 2. Restart the Kubelet by running the following command: .. code-block:: text # systemctl restart kubelet 3. Make sure ``etcd`` and ``apiserver`` Pods on this node are up and running. PP-24248 **Symptom** When you create a new resource pool and assign it to nodes and later try to deploy a Pod with storage affinity on the node with a newly assigned resource pool, the Pod deployment fails as the node is not taking the correct resource pool. **Workaround** Complete the following steps to fix this issue: 1. Run the following to edit the node. .. code-block:: text # kubectl edit node 2. Remove the robin.io/robinrpool resource pool. 3. Add the correct resource pool name. PP-22853 **Symptom** GPUs might not be detected after CNP installation, upgrade, or addition of a new node. **Workaround** Run the following host probe rediscover command: .. code-block:: text # robin host probe --rediscover PP-22626 **Symptom** If NVIDIA GPU drivers are already installed on your setup, operator deployments might fail. **Workaround** Complete the following steps to fix this issue: 1. .. code-block:: text # yum remove nvidia-driver-latest-dkms 2. .. code-block:: text # yum remove nvidia-container-toolkit 3. Reboot the node. PP-21832 **Symptom** After you reboot a node, it might be in the ``NotReady`` state. **Workaround** Complete the following steps to fix this issue: 1. .. code-block:: text # systemctl restart kubelet 2. .. code-block:: text # systemctl restart dockershim 3. .. code-block:: text # docker restart robin-cri PP-22781 **Symptom** After removing a taint on a master node, GPUs are not detected automatically. GPUs don’t get detected automatically after removing a taint on the master node. **Workaround** You need to run the ``robin host probe --rediscover --all --wait`` command for the GPUs to be detected on the primary master node. ============= ============================================================================================================================================================================================================================================ Technical Support ----------------- Contact `Robin Technical support `_ for any assistance. ======================================= Robin Cloud Native Platform v5.3.11 HF1 ======================================= The Robin CNP v5.3.11 HF1 release has improvements, bug fixes, and known issues. **Release Date:** 08 December 2021 Infrastructure Versions ----------------------- The following software applications are included in this CNP release. ==================== ======== Software Application Version ==================== ======== Kubernetes 1.21.5 Docker 19.03.9 Prometheus 2.16.0 Node-exporter 1.1.2 Calico 3.12.3 HA-Proxy 1.5.18 PostgreSQL 9.6.11 Grafana 6.5.3 ==================== ======== Upgrade Paths ------------- The following are the supported upgrade paths for Robin CNP v5.3.11 HF1: * Robin v5.3.5 (HF3) **to** Robin v5.3.11 (HF1) * Robin v5.3.5 (HF5) **to** Robin v5.3.11 (HF1) * Robin v5.3.9 (GA) **to** Robin v5.3.11 (HF1) * Robin v5.3.11 (GA) **to** Robin v5.3.11 (HF1) Improvements ------------ --------------------------------------------------------------------- Network Planning Support for Apps with Pod Affinity and Anti-affinity --------------------------------------------------------------------- Robin CNP v5.3.11 HF1 provides the network planning support for apps with Pod affinity and anti-affinity. -------------------------------------------- Application Ephemeral Volumes UI Improvement -------------------------------------------- The APPLICATION EPHEMERAL STORAGE section of the Robin UI in Robin CNP v5.3.11 HF1 is improved to display the following drop-down options for the replication option. * Storage-Compute Affinity * Not Replicated * Replicated (2 copies) * Replicated (3 copies) .. Note:: These options appear when you create AEVs from Robin UI if the AEVs are not defined in a Robin bundle manifest file. Fixed Issues ------------ ============= ============================================================================================================================================================================================================================================ Reference ID Description PP-24650 Robin CNP v5.3.11 HF1 fixed the sweet32 vulnerability issue. PP-24528 The issue of the CNP planner assigning Pods without checking the status of network interfaces is fixed in this version. With Robin CNP 5.3.11 HF1, the planner skips the network interfaces when they are down. PP-24428 The creation of an Application Ephemeral Volume (AEV) fails in a cloud deployment if the replication factor is not specified in the AEV specifications. This issue is fixed in Robin CNP v5.3.11 HF1 by setting a default replication factor for each AEV if it is not provided. PP-22941 When you do not provide any limits and requests in the container resource section, but you provide Robin annotation for network planning, the issue of a Pod not coming up successfully is fixed. PP-21983 When an IP address is not in robin ``ip-pool info --ip-allocations`` and no other running Pods in the cluster is using the IP address, a Pod may not get created that is controlled by a Deployment, StatefulSet, or DaemonSet. This issue is fixed. PP-24589 The issue of Calico's CIDR value was not correctly set when updating a Calico IP pool in an IPv6 Robin CNP setup is fixed. PP-24313 The ``robin bundle add`` command is incorrectly storing the bundle files in the file-object directory of the log collection service instead of the file-object directory of the file collection service. This issue is fixed. ============= ============================================================================================================================================================================================================================================ Known Issues ------------- ============= ============================================================================================================================================================================================================================================================== Reference ID Description PP-24697 **Symptom** If the Network attachment definitions (net-attach-def) are not cleaned up when a Pod is bounced to create the Pod from the webhook start phase, follow this workaround. **Workaround** You must delete the net-attach-def that is not cleaned up. To delete, run the following command: .. code-block:: text # kubectl delete net-attach-def -n PP-24600 When you deploy a Pod using Kubernetes Deployment, StatefulSet, or DaemonSet, in some scenarios, the deployment process might take a longer time than usual due to the exponential back-off delay during scheduling. This is a known behavior with Kubernetes. ============= ============================================================================================================================================================================================================================================================== Technical Support ----------------- Contact `Robin Technical support `_ for any assistance. ======================================= Robin Cloud Native Platform v5.3.11 HF2 ======================================= The Robin CNP v5.3.11 HF2 release has a new feature, improvements, bug fixes, and known issues. **Release Date:** 02 February 2022 Infrastructure Versions ----------------------- The following software applications are included in this CNP release. ==================== ======== Software Application Version ==================== ======== Kubernetes 1.21.5 Docker 19.03.9 Prometheus 2.16.0 Prometheus-adapter 0.9.1 Node-exporter 1.1.2 Calico 3.12.3 HA-Proxy 1.5.18 PostgreSQL 9.6.22 Grafana 6.5.3 ==================== ======== Upgrade Paths ------------- The following are the supported upgrade paths for Robin CNP v5.3.11 HF2: * Robin v5.3.5-232 (HF5) **to** Robin v5.3.11 HF2 * Robin v5.3.7-120 (HF1) **to** Robin v5.3.11 HF2 * Robin v5.3.11-104 (HF1) **to** Robin v5.3.11 HF2 New Feature ----------- ------------------------------------------------------ Intel Cache Allocation Technology Support for vDU Pods ------------------------------------------------------ Robin CNP v5.3.11 HF2 supports the Intel Cache Allocation Technology (CAT) to deploy 4G vDU (Virtual Distributed Unit) Pods for a single non-uniform memory access (NUMA) node host. The Intel CAT enables vDUs in a 4G environment to access the CPUs on the host using dedicated cache lines/ways. When you deploy a vDU Pod using Robin CNP, the vDUs get a dedicated number of cache lines/ways configured on the host to access the CPUs. You can use the following annotation in your Pod YAML file for requesting cache lines/ways from the CPU for vDUs. **Example:** .. code-block:: yaml # "robin.runtime.num_cache_ways": "4" .. Note:: You must use the annotation at the Pod level only. Improvements ------------ ------------------------------------------------------------------- Default replication factor and fault domain parameters for RWX PVCs ------------------------------------------------------------------- Starting with Robin CNP v5.3.11 HF2, for RWX PVCs, the default ``replication`` factor is ``2`` and the default ``faultdomain`` is ``host``. If you want to change the ``replication`` factor to ``1``, you can use the following parameter under annotations in the RWX PVC: ``robin.io/rwx_force_single_replica: "1"``. .. Note:: You must not use the ``robin.io/replication`` and ``robin.io/rwx_force_single_replica`` annotations together in a PVC. The following is an example of a PVC file: .. code-block:: yaml :emphasize-lines: 7 apiVersion: v1 kind: PersistentVolumeClaim metadata: name: nfs-shared-1 annotations: robin.io/nfs-server-type: "shared" robin.io/replication: "2" robin.io/faultdomain: "host" spec: storageClassName: robin accessModes: - ReadWriteMany resources: requests: storage: 500Gi --------------------------------------------------- IP-Pool Prefix as Environment Variable inside a Pod --------------------------------------------------- Starting with Robin CNP v5.3.11 HF2, Robin CNP provides an IP-Pool prefix as an environment variable inside a Pod. When you deploy a Robin bundle or Helm app and use Calico or OVS IP-Pool, Robin CNP adds an IP-Pool prefix as an environment variable inside the Pod. Using the IP-Pool prefix environment variable, you can discover the IP prefix of an IP-Pool by using the command line. **Example:** .. code-block:: text # kubectl exec -it -n t001-u000004 c1-server-01 -- env | grep -i prefix ROBIN_SAMPLE_IPPOOL_PREFIX=16 In the above example with the command output: ``ROBIN_SAMPLE_IPPOOL_PREFIX=16`` **SAMPLE_ IPPOOL** is the IP-Pool name, and it is prefixed with **ROBIN** and suffixed with the word **PREFIX**, and **16** is the IP prefix for the IP-Pool. ------------------------------------------ Robin StorageClass with runAsAny parameter ------------------------------------------ Robin CNP v5.3.11 HF2 provides a new parameter ``runAsAny`` in the StorageClass object to enable any user other than the root user to read or write to an NFS mountpoint of an RWX volume. You can use this parameter in a scenario with multiple containers and different users, and you want to allow any user accessing the Pod (containers) to read or write to an NFS mountpoint of an RWX volume. In the StorageClass object file, set the ``runAsAny`` parameter to True. The following is an example of the StorageClass with ``runAsAny`` parameter: .. code-block:: yaml :emphasize-lines: 16 apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: run-as-any-imm labels: app.kubernetes.io/instance: robin app.kubernetes.io/managed-by: robin.io app.kubernetes.io/name: robin provisioner: robin reclaimPolicy: Delete allowVolumeExpansion: true volumeBindingMode: Immediate parameters: replication: '2' media: HDD runAsAny: "true" ---------------------------------------------------- Optimization of CPU and memory for Kubernetes Events ---------------------------------------------------- Robin CNP v5.3.11 HF2 is optimized to reduce the usage of CPU and memory when processing Kubernetes events. Fixed Issues ------------ ============= ============================================================================================================================================================================================================================================ Reference ID Description PP-25070 Vulnerability CVE-2021-41103 is related to containerd runtime. The container root directories and some plugins had insufficiently restricted permissions. It allows unprivileged Linux users to traverse directory contents and execute programs. For more information about this vulnerability, see `CVE-2021-41103 `_. In this release, Robin.io has upgraded the containerd package to containerd version 1.5.7 for handling this vulnerability. PP-24947 In the recent versions of Robin CNP, the source-based route is not configured properly for the first additional IP-Pool. This issue is fixed. PP-24938 After upgrading to Robin CNP v5.3.11 (HF1), the static IP address was not allocated to a Pod, and the Pod did not come up. This issue is fixed. PP-24796 The scheduler is unable to create a Pod within 30 seconds because Nmap showed that the requested static IP address was already in use due to an external firewall. This issue is fixed, and Nmap is disabled by default. You can enable Nmap to allow the Robin server to scan the network before IP address allocation by running the following command. .. code-block:: text # robin config update manager mutating_webhook_staticip_validation_enable true PP-24776 The ``robin ip-pool add`` command with the ``nictags pci_addr`` option is failing in Robin CNP v5.3.11 HF1. This issue is fixed. PP-24697 When a Pod with ``robin.io/networks`` annotation is deleted, the Network attachment definitions (net-attach-defs) entries are left behind. You need to manually delete these entries. This issue is fixed. PP-24789 An inaccessible device does not get elected by marking the device offline when the node goes down. This issue is fixed now. PP-25116 When you bounce a Pod or reinstall an app, the deletion event might take time to complete the process as the earlier event is stuck at registration due to a missing API in the kubectl API resources. As a result, the static IP address allocation is failing. This issue is fixed. PP-25109 In a scenario, when two MANAGER nodes are becoming SLAVE nodes and waiting for the third node to become the MASTER node, it fails to become the MASTER node due to internal issues. And, the other two nodes are waiting for the third node to become the MASTER without checking whether any node is holding the MASTER lock. As a result, the control plane is down. This issue is fixed. PP-24645 The existence of a ``recover.conf`` file in the PostgreSQL data directory was preventing a Manager node from becoming MASTER during a high availability transition. This issue is fixed. Instead of failing, Robin CNP now allows the node to continue with its transition to become MASTER. PP-25221 A Pod fails to come up to running state, and the ``kubectl describe pod -n namespace `` command shows an error that the network CNI plugin failed to set up the Pod. This issue is fixed. ============= ============================================================================================================================================================================================================================================ Known Issues ------------- ============= ============================================================================================================================================================================== Reference ID Description PP-25360 **Symptom** If containers in a Pod are using an RWX PVC and if they are stuck in the ``ContainerCreating`` state for a long time and display a timeout error, apply the following workaround. **Workaround** Delete Pods if they are part of a Deployment or StatefulSet. PP-24937 **Symptom** When upgrading to Robin CNP v5.3.11 HF2 from any supported versions using GoRobin, an error related to Paramiko might be encountered. **Workaround** Check the login credential of the Robin node or try restarting the sshd service on the Robin node. PP-25370 **Symptom** A Robin node in a cluster might go into the ``NotReady`` state when an RWX PVC's mount path is not responding. This issue could occur due to several internal Kubernetes known issues. **Workaround** For workaround steps, see `Troubleshooting a Robin Cluster Node with the NotReady State `_. PP-25430 **Symptom** After upgrading to Robin CNP v5.3.11 HF2, you might observe that the automatically restarted Pods containing ``robin.io/networks`` annotation did not have the secondary IP addresses. **Workaround** Bounce such Pods manually. PP-25422 **Symptom** Upgrade to Robin v5.3.11 HF2 might fail as one of the worker node robinds and node plugin Pods are stuck in the Terminating state. **Workaround** Perform the following steps to recover all stale NFS mount points: 1. Make sure that no application Pod uses the PVC. 2. Run the following command to scale down the replica count to 0. .. code-block:: text # kubectl scale —-replicas=0 3. Run the following command to unmount the stale mount point. .. code-block:: text # umount -f -l 4. When all stale NFS mount points on a node are recovered, run the following command to restart Kubelet. .. code-block:: text # systemctl restart kubelet 5. Scale up the replicas back to the original count. .. code-block:: text # kubectl scale —-replicas= .. Note:: You must not delete the PVC or application Pod. PP-25425 **Symptom** When you create an application using Application Ephemeral Volume (AEV) with Storage-Compute Affinity, the app creation might fail due to the missing ``robin.io/hostname`` tag on the host. **Workaround** Add the ``robin.io/hostname`` tag to the host. Run the following command to add the tag: .. code-block:: text # robin host add-tags **Example:** .. code-block:: text # robin host add-tags cscale-82-37.robinsystems.com robin.io/hostname=cscale-82-37.robinsystems.com PP-25296 **Symptom** When a cluster is recovered from a network partition, the Pods deployed on the worker nodes in the minority partition are redeployed in the majority partition. The Robin control plane is unable to access the worker nodes in the minority partition. The Pods and their volume mounts are cleaned up automatically when the network partition is resolved. In some cases, this automatic clean-up fails to remove the stale Pods on the worker nodes. **Workaround** Reboot the server. When the server restarts, the server will rejoin the cluster without the stale pods and volume mounts. PP-21832 **Symptom** After upgrading your cluster to Robin CNP v5.3.11HF2, if you notice that the cluster node is in the ``NotReady`` state when you reboot the cluster, you must apply the following workaround on all nodes of the cluster. **Workaround** To resolve this issue, complete the following steps: 1. Run the following command. .. code-block:: text \/bin/cp /root/bin/robin-reboot.service /etc/systemd/system/robin-reboot.service kernel_version=$(uname -r) if [[ $kernel_version == "3.10"* ]]; then sed -i "/EL8/d;s/#EL7//" /etc/systemd/system/robin-reboot.service else sed -i "/EL7/d;s/#EL8//" /etc/systemd/system/robin-reboot.service fi 2. Run the following command to restart Kubelet. .. code-block:: text # systemctl restart kubelet 3. Run the following command to restart Dockershim. .. code-block:: text # systemctl restart dockershim 4. Run the following command to restart robin-cri. .. code-block:: text # docker restart robin-cri PP-25286 **Symptom** When you try to resize a PDV to a large size, the resize task fails due to insufficient storage, and due to this resize task failure, the subsequent PDV resizing tasks also fail. **Workaround** For workaround steps, see `PDV Resize issue `_. PP-25441 **Symptom** In Robin CNP v5.3.11, provisioning clones with Application Ephemeral Volumes (AEVs) are not supported. PP-25412 **Symptom** Storage-Compute affinity is not enforced on Application Ephemeral Volume (AEV) when storage is available on the same host. PP-25453 **Symptom** When upgrading to Robin CNP v5.3.11 HF2 from any supported versions, CNS fails to execute post-Robin upgrade actions on one of the nodes as the IO Manager might be down and the node displays ``NotReady`` state. Apply the following workaround if you notice this issue. **Workaround** Run the following command one the node where you executed the upgrade command: .. code-block:: text ./ onprem post-upgrade-robin --hosts-json /root/hosts.json --gorobintar --robin-admin-user --robin-admin-passwd PP-25461 **Symptom** If you discover that a Pod comes up without a static IP address, you must apply the following workaround. The Pod may not get allocated with one or more static IP addresses in multiple conditions. **Workaround** Bounce the Pods that are not allocated with a required number of static IP addresses. PP-25423 **Symptom** After upgrading to Robin CNP v5.3.11HF2 from Robin CNP v5.3.5, application Pods might be in the ``ContainerCreating`` state. If you notice this issue, apply the following workaround. **Workaround** 1. Using the RWX PVC, find the Deployment or StatefulSet and the Pod details by run the following command: .. code-block:: text # kubectl describe PVC -n ns 2. Note the replica count by running the following command: .. code-block:: text # kubectl get all -n 3. Scale the replicas to 0 by running the following command. .. code-block:: text # kubectl scale --replicas=0 -n 4. Observe that the Pod is terminated and does not exist anymore. 5. Scale the replicas back to the count that you noted in step 2. Use the following command to scale up. .. code-block:: text # kubectl scale --replicas= -n PP-25381 **Symptom** Robin CNP does not support rack fault domain for Application Ephemeral Volume (AEV). However, Robin CNP UI incorrectly displays the FaultDomain (Rack) option. PP-25467 **Symptom** You might observe Kubelet is slow or does not respond and periodically goes into the error state resulting in issues with the Robin storage layer. This issue could be due to an orphan Pod, or Kubelet is trying to mount a Pod on old PVCs. **Workaround** 1. Run the following command to check the status of Kubelet. .. code-block:: text # systemctl status kubelet -l 2. In the command output, find the following message. .. code-block:: text "orphaned pod pod_id found, but error not a directory occurred when trying to remove the volumes dir" 3. Run the following command to know the PVC name from the node. .. code-block:: text # kubectl get pvc -A 4. Run the following command to check if the PVC is existing. .. code-block:: text # kubectl get pvc -A | grep 5. If the PVC does not exist, delete the directory by running the following command. .. code-block:: text # rm -rf /var/lib/kubelet/pods/ PP-25463 **Symptom** The volume mounts in a Pod fail due to duplicate FS UUIDs. **Workaround** A duplicate FS UUID is present when the device is already mounted on the same node. When a volume mount fails, the FS UUID is displayed in ``syslog/dmesg``. Perform the following steps to resolve the duplicate FS UUID. 1. Run the following command to check whether any device has the same FS UUID: .. code-block:: text # blkid | grep 2. Run the following command to check whether the device is mounted: .. code-block:: text # mount | grep 3. Run the following command to unmount the device, when the device is mounted: .. code-block:: text # unmount After unmounting the device, the duplicate FS UUID conflict will be resolved. PP-25466 **Symptom** A Pod fails to come up because the volume is not accessible, and the volume is in the faulted state. **Workaround** You need to probe the robin host by running the following command: .. code-block:: text # robin host probe --all PP-25508 **Symptom** When you try to modify an IP Pool, the modification process might fail with an error message due to missing values in the IP Pool. If you notice any error message when modifying the IP Pool, apply the following workaround. **Workaround** 1. Run the following command only once on the cluster. .. code-block:: text # robin schedule update K8sResSync k8s_resource_sync 63072000 2. Run the following command and make a note of the IP Pool values. .. code-block:: text robin ip-pool info 3. Run the following command to update missing values in the IP Pool that you noted in the previous step. .. code-block:: text kubectl edit ripp --validate=false **Example:** In the following example, you need to add the missing values in the ``spec:`` section. You do not need to update all values. For example, for network-based IP-Pool, the prefix field is not required. Similarly, you can ignore the values that are not required. .. code-block:: text [root@centos-60-205 ~]# kubectl edit ripp ovs-1 --validate=false ... spec: available: "15" dns_search: domain.com driver: ovs gateway: fd74:ca9b:3a09:868c::1 ifcount: 1 name: ovs-1 nameserver: fd74:ca9b:3a09:868c:10:9:60:62 netmask: ffff:ffff:ffff:ffff:0000:0000:0000:0000 ranges: - fd74:ca9b:3a09:868c:0010:0009:0109:0010-0020 - fd74:ca9b:3a09:868c:0010:0009:0109:0040-0050 subnet: fd74:ca9b:3a09:868c:0000:0000:0000:0000 used: "2" zone: default prefix: 64 vfdriver: xyz vlan_number: 100 4. Rerun the failed IP Pool command to verify. ============= ============================================================================================================================================================================== Appendix -------- ---------------------------------------------------------------- Troubleshooting a Robin Cluster Node with the ``NotReady`` State ---------------------------------------------------------------- **The following content is the workaround for PP-25370.** A Robin node in a cluster might go into the ``NotReady`` state when an RWX PVC's mount path is not responding. This issue could occur due to several internal Kubernetes known issues. The RWX PVC's mount path may not respond due to the following issues/symptoms on your cluster. You can troubleshoot these issues and bring back the node to the ``Ready`` state. This document section provides troubleshooting steps for the following issues: * NFS server's service IP address entry in the conntrack table might go into ``SYN_SENT`` or ``TIME_WAIT`` state * NFS Servers may not be ready * NFS Server Failover Issues * I/O hangs on the volume With Robin v5.3.11 HF2, you might notice the ``NotReady`` state issue when you are upgrading from Robin v5.3.11 HF1 to Robin v5.3.11 HF2. * **Troubleshoot NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT state** The Robin node could be in the ``NotReady`` state if the NFS Server’s service IP address entry in the conntrack table in ``SYN_SENT`` or ``TIME_WAIT``. The following steps enable you to troubleshoot this issue and bring the node to the ``Ready`` state. 1. Run the following command to know if your node is in the ``NotReady`` state when you notice any of the above-mentioned symptoms: .. code-block:: text # kubectl get node **Example:** .. code-block:: text # kubectl get node hypervvm-61-46 NAME STATUS ROLES AGE VERSION hypervvm-61-46 NotReady 25h v1.21.5 2. Run the following command and grep the NFS server mount paths: .. code-block:: text # mount|grep :/pvc 3. Copy the mount paths for verification from the command output. 4. Run the following command to check the status of the mount path: .. code-block:: text # ls **Example:** .. code-block:: text # ls /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount .. Note:: If any mount paths do not respond or hang, you must check the status of conntrack. You need the service IP of the NFS Server Pod for checking conntrack status. 5. Run the following command to get the NFS server Pod service IP address: .. code-block:: text # mount|grep **Example:** .. code-block:: text # mount|grep pvc-210829ca-96d4-4a12-aab8-5646d087054d [fd74:ca9b:3a09:868c:172:18:0:e23e]:/pvc-210829ca-96d4-4a12-aab8-5646d087054d on /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp6,timeo=600,retrans=2,sec=sys,clientaddr=fd74:ca9b:3a09:868c:10:9:82:127,local_lock=none,addr=fd74:ca9b:3a09:868c:172:18:0:e23e) 6. Verify if the conntrack entry state using NFS server Pod IP address by running the following command: .. code-block:: text # conntrack -L -d .. Note:: If you notice the conntrack entry state as ``SYN_SENT`` or ``TIME_WAIT``, you need to delete the entry from conntrack table entries to allow connections to the NFS service. **Example:** .. code-block:: text # conntrack -L -d fd74:ca9b:3a09:868c:172:18:0:e23e 13tcp 6 110 SYN_SENT src=fd74:ca9b:3a09:868c:10:9:82:127 dst=fd74:ca9b:3a09:868c:172:18:0:e23e sport=980 dport=2049 [UNREPLIED] src=fd74:ca9b:3a09:868c:172:18:0:71d4 dst=fd74:ca9b:3a09:868c:10:9:82:127 sport=2049 dport=614 mark=0 use=1 14conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown. 7. Run the following command to delete the ``SYN_SENT`` or ``TIME_WAIT``: .. code-block:: text # conntrack -D -d **Example:** .. code-block:: text # conntrack -D -d fd74:ca9b:3a09:868c:172:18:0:e23e 18tcp 6 102 SYN_SENT src=fd74:ca9b:3a09:868c:10:9:82:127 dst=fd74:ca9b:3a09:868c:172:18:0:e23e sport=980 dport=2049 [UNREPLIED] src=fd74:ca9b:3a09:868c:172:18:0:71d4 dst=fd74:ca9b:3a09:868c:10:9:82:127 sport=2049 dport=614 mark=0 use=1 19conntrack v1.4.4 (conntrack-tools): 1 flow entries have been deleted. .. Note:: After deleting the ``SYN_SENT`` or ``TIME_WAIT`` state from the conntrack, you should be able to access the NFS mount path. 8. Run the following command to verify mount path status. .. code-block:: text # ls /var/lib/kubelet/pods/25d256d5-e6cc-4865-a3ee-88640e0d1fc8/volumes/kubernetes.io~csi/pvc-210829ca-96d4-4a12-aab8-5646d087054d/mount * **Additional Troubleshooting Checks** If you have verified the NFS Server’s service IP address entry in the conntrack table in ``SYN_SENT`` or ``TIME_WAIT`` status and still your node is in the ``NotReady`` state, you need to perform additional checks to troubleshoot the issue. The following are some additional checks for troubleshooting the issue: * Check NFS Exports Status * Check NFS server failover Status * Check NFS server Pod is provisioned. * **Check NFS Exports Status** All NFS exports must be in the ``READY`` state. To check the NFS exports status, run the following command: .. code-block:: text # robin nfs export-list **Example:** .. code-block:: text # robin nfs export-list +--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+ | Export State | Export ID | Volume | NFS Server Pod | Export Clients | +--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+ | READY | 7 | pvc-9b1ef05e-5e4a-4e6a-ab3e-f7c95d1ae920 | robin-nfs-shared-9 | ["hypervvm-61-48.robinsystems.com","hypervvm-61-43.robinsystems.com"] | +--------------+-----------+------------------------------------------+---------------------+-----------------------------------------------------------------------+ .. Note:: If NFS exports are not in the ``READY`` state, make sure the NFS server failover is enabled. Generally, it is enabled by default. * **Check NFS server failover Status** The NFS Server failover status is by default enabled. However, you should check for confirmation and enable it if it is disabled. To check NFS server failover status, run the following command: .. code-block:: text # robin config list nfs|grep failover_enabled nfs | failover_enabled * **Check NFS server Pod is provisioned** To check whether NFS server Pod is provisioned or not, run the following command: .. code-block:: text # robin job list|grep -i NFSServerPodCreate|tail .. Note:: If all of these checks are fine, then it could be a bug in the NFS Server Failover. To troubleshoot the NFS Server failover issue, see **Troubleshoot NFS Server Failover Issues**. * **Troubleshoot NFS Server Failover Issues** A node could go to the ``NotReady`` state due to NFS Server failover issues as well, apart from other issues mentioned in this section. .. Note:: 1. You can use the following steps even if your NFS Server has no issues, however, the PVC mount path is hung. 2. Before you troubleshoot the NFS Server failover issues, check the **Troubleshoot NFS Server’s service IP address entry in the conntrack table in SYN_SENT or TIME_WAIT** state and **Additional Troubleshooting Checks**. To fix the NFS server failover issues, complete the following steps: 1. Run the following command to check if any NFS exports are in the ``ASSIGNED_ERR`` state and identify corresponding PVCs: .. code-block:: text # robin nfs export-list 2. Run the following command to note the replica count in the deployment or StatefulSet: .. code-block:: text # kubectl get all -n **Example:** .. code-block:: text # kubectl get all -n ... NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/app1 2/2 2 2 27h NAME DESIRED CURRENT READY AGE replicaset.apps/app1-5cbbc6d9db 2 2 2 27h 3. Run the following command to scale the application Pods that use those PVCs to 0: .. Note:: Do not delete the application. Scaling down the application Pods will make sure that new Pods do not come up and results in the NFS exports being cleared. .. code-block:: text # kubectl scale --replicas=0 -n 4. Run the following command to check all NFS exports are healthy: .. code-block:: text # robin nfs export-list 5. (Optional) Run the following command on the hung paths if you notice some mount paths are still hung: .. code-block:: text # umount -f -l 6. Run the following command to check the node status: .. code-block:: text # kubectl get node .. Note:: If you notice the node is still not showing the ``Ready`` state, wait for 2 minutes for kubelet to refresh the status. If the status is still not showing ``Ready``, stop and start kubelet by running following commands: .. code-block:: text # systemctl stop kubelet #systemctl start kubelet 7. Check the node status again. If the status is ``Ready``, then go to the last step. .. code-block:: text # kubectl get node 8. If the node is still not in the ``Ready`` state or flapping between ``Ready/NotReady`` and you do not see any Pods in k8s that are using the RWX PVC, it may be Pods are deleted by force from Kubernetes. In this case, k8s does not see Pods, but Docker is still hanging on to those Pods. THIS IS A RARE CASE and is hit only when Pods are deleted forcefully. In this case, run the following commands: .. code-block:: text # docker rm <> # systemctl restart docker 9. Run the following command to check the node status: .. code-block:: text # kubectl get node The node should be in the ``Ready`` state. 10. Run the following command to scale up the application Pods back to the original count that you noted earlier: .. code-block:: text # kubectl scale --replicas= -n PDV Resize issue ---------------- When you try to resize a PDV to a large size, the resize task fails due to insufficient storage, and due to this resize task failure, the subsequent PDV resizing tasks also fail. If you face this issue, complete the following troubleshooting steps: 1. Run the following command to verify the PersistentDataVolumeResize job status: .. code-block:: text # robin job info .. Note:: In the command output, notice the PersistentDataVolumeResize job failure, and similarly, you might notice multiple failed volume expansion jobs. This is because the Robin CNP is continuously trying to allocate storage for volume expansion. **Example:** .. code-block:: text # robin job info 935 ID |Type | Desc | State | Start |End | Duration | Dependson| Error | Message 935 | PersistentDataVolumeResize | Resizing PersistentDataVolume 'test-pdv-202201102020041' from 108447924224 to 151G | COMPLETED|FAILED | 27 Jan 12:58:53 | 12:59:06 | 0:00:13 | [] | 1 | Unable to allocate storage for volume pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 of logical size 50.0G. Needed 51.56G of type HDD in physical space but found only 42.28G available. Check available capacity, maximum volume count, physical sector size and maintenance mode for the drives. 2. Run the following command to get information about the PV and PVC for the impacted PDV volumes: .. code-block:: text # robin pdv list **Example:** .. code-block:: text # robin pdv list test-pdv-202201102020041 Name | Owner/Tenant | Access | Size | Media | PV | PVC test-pdv-202201102020041 | u1/tenant1 | Private | 100G | HDD | pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 | t038-u000040/pvc-38-40-16420352860585 3. Save the PVC configuration file. .. code-block:: text # kubectl get pvc -n -o yaml > **Example:** .. code-block:: text # kubectl get pvc -n t002-u000006 pvc-38-40-16420352860585 -o yaml > pvc.yaml **Example PVC file:** .. code-block:: yaml :emphasize-lines: 21, 22 apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" robin.io/faultdomain: host robin.io/media: HDD robin.io/replication: "2" robin.io/rpool: default volume.beta.kubernetes.io/storage-provisioner: robin volume.kubernetes.io/storage-provisioner: robin creationTimestamp: "2022-01-13T00:54:46Z" finalizers: - kubernetes.io/pvc-protection labels: robin.io/domain: ROBIN_PDV robin.io/tenant: tenant1 robin.io/tenant_id: "38" robin.io/user_id: "40" robin.io/username: u1 name: pvc-38-40-16420352860585 namespace: t038-u000040 resourceVersion: "2378648" uid: 2a9ffb4e-fc25-4536-b700-501c2a7a8d80 spec: accessModes: - ReadWriteMany resources: requests: storage: 200Gi storageClassName: robin-immediate volumeMode: Filesystem volumeName: pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 status: accessModes: - ReadWriteMany capacity: storage: 100Gi conditions: - lastProbeTime: null lastTransitionTime: "2022-01-27T17:01:41Z" status: "True" type: Resizing phase: Bound 4. Edit the PVC YAML file and remove the following attributes: .. code-block:: text vi * In the metadata annotations, remove the following attributes: .. code-block:: text pv.kubernetes.io/bind-completed pv.kubernetes.io/bound-by-controller * In the In metadata, remove the following attributes: .. code-block:: text creationTimestamp resourceVersion Uid * Remove the complete status section in the PVC YAML file. **Edited PVC YAML file example:** .. code-block:: yaml :emphasize-lines: 18, 19 apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: robin.io/faultdomain: host robin.io/media: HDD robin.io/replication: "2" robin.io/rpool: default volume.beta.kubernetes.io/storage-provisioner: robin volume.kubernetes.io/storage-provisioner: robin finalizers: - kubernetes.io/pvc-protection labels: robin.io/domain: ROBIN_PDV robin.io/tenant: tenant1 robin.io/tenant_id: "38" robin.io/user_id: "40" robin.io/username: u1 name: pvc-38-40-16420352860585 namespace: t038-u000040 spec: accessModes: - ReadWriteMany resources: requests: storage: 200Gi storageClassName: robin-immediate volumeMode: Filesystem volumeName: pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 5. In the edited YAML file, change the ``spec.resources.requests.storage`` size of the underlying Robin volume. 6. Mark the PersistentVolume(PV) that is bound to the PersistentVolumeClaim(PVC) with the ``Retain`` reclaim policy. This will prevent the underlying volume from being deleted when the PVC is deleted. **Example:** .. code-block:: text # kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched 7. Patch the PVC to disable Robin admission control from preventing the PDV’s deletion. **Example** .. code-block:: text # kubectl patch pvc -n t038-u000040 pvc-38-40-16420352860585 -p '{"metadata":{"labels": {"robin.io/override_delete_protection":"true"}}}' persistentvolumeclaim/pvc-38-40-16420352860585 patched 8. Delete the PVC. As PV now has a Retain reclaim policy, you will not lose any data when the PVC is recreated. **Example:** .. code-block:: text # kubectl delete pvc -n t038-u000040 pvc-38-40-16420352860585 persistentvolumeclaim "pvc-38-40-16420352860585" deleted 9. Delete the ``claimRef`` entry from PV specs, so the new PVC can bind to it. This should make the PV Available. **Example:** .. code-block:: text # kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]' persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched 10. Re-create the PVC with a required size without exceeding the available resources. .. Note:: In the ``pvc.yaml`` file, that is created above, has the name of the existing PV set in the ``volumeName`` attribute. This will bind the new PVC to the existing PV. **Example:** .. code-block:: text # kubectl create -f pvc.yaml persistentvolumeclaim/pvc-38-40-16420352860585 created 11. Restore the original reclaim policy of the PV. **Example:** .. code-block:: text # kubectl patch pv pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}' persistentvolume/pvc-2a9ffb4e-fc25-4536-b700-501c2a7a8d80 patched Technical Support ----------------- Contact `Robin Technical support `_ for any assistance. ======================================= Robin Cloud Native Platform v5.3.11 HF3 ======================================= The Robin CNP v5.3.11 HF3 release has a new feature and a known issue. **Release Date:** 06 May 2022 Infrastructure Versions ----------------------- The following software applications are included in this CNP release. ==================== ======== Software Application Version ==================== ======== Kubernetes 1.21.5 Docker 19.03.9 Prometheus 2.16.0 Prometheus-adapter 0.9.1 Node-exporter 1.1.2 Calico 3.12.3 HA-Proxy 1.5.18 PostgreSQL 9.6.22 Grafana 6.5.3 ==================== ======== Upgrade Path ------------- The following is the supported upgrade path for Robin CNP v5.3.11 HF3: * Robin v5.3.11 (HF2) **to** Robin v5.3.11 (HF3) New Feature ----------- ------------------------------------------------ Support for Cisco DCNM E1000 Virtual Interaface ------------------------------------------------ Robin CNP 5.3.11 HF3 supports Cisco Data Center Network Manager (DCNM) E1000 virtual network interface for KVMs. You can deploy the Cisco DCNM application on the Robin CNP cluster. .. Note:: The Cisco DCNM E1000 Virtual Interface is supported only on KVMs with OVS IP Pool. You can configure the interface only using an ``input.yaml`` file. For more information, see `here `_. Known Issue ------------- ============= ============================================================================================================================================================================== Reference ID Description PP-27192 **Symptom** In some rare scenarios, creating an app from a snapshot of a KVM app fails with this error: *Failed to ping instance.* **Workaround** Run the following command to restart the KVM instance: .. code-block:: text # robin instance restart ============= ============================================================================================================================================================================== Technical Support ----------------- Contact `Robin Technical support `_ for any assistance. ======================================= Robin Cloud Native Platform v5.3.11 HF4 ======================================= The Robin CNP v5.3.11 HF4 release has improvements, a fixed issue, and known issue. **Release Date:** 19 June 2022 Infrastructure Versions ----------------------- The following software applications are included in this CNP release. ==================== ======== Software Application Version ==================== ======== Kubernetes 1.21.5 Docker 19.03.9 Prometheus 2.16.0 Prometheus-adapter 0.9.1 Node-exporter 1.1.2 Calico 3.12.3 HA-Proxy 1.5.18 PostgreSQL 9.6.22 Grafana 6.5.3 ==================== ======== Upgrade Path ------------- The following is the supported upgrade path for Robin CNP v5.3.11 HF4: * Robin v5.3.11(HF2) **to** Robin v5.3.11(HF4) Improvements ------------ -------------------- Rocky Linux Support -------------------- Robin CNP v5.3.11HF4 supports Rocky Linux 8.6 version. You can install Robin CNP 5.3.11HF4 on Rocky Linux 8.6 servers. The following are the supported Rocky Linux 8.6 Kernal versions: - 4.18.0-372.9.1.rt7.166.el8.x86_64 (RT Kernel) - 4.18.0-372.9.1.el8.x86_64 (Non-RT Kernel) --------------------------------------------------- Disable Init Containers and Sidecars in Bundle App --------------------------------------------------- Robin CNP v5.3.11HF4 supports disabling the Init Containers and Sidecars in the Robin Bundle apps using the input.yaml file when deploying the Bundle apps. The following is the sample Robin Bundle file: .. code-block:: yaml :emphasize-lines: 34, 45, 23 name: dpdk-intel version: v1 icon: icon.png snapshot: enabled clone: enabled roles: - pktgen pktgen: name: pktgen norootfs: true image: name: robinsys/dpdk-intel version: v1 engine: docker imagePullPolicy: IfNotPresent entrypoint: entry.sh compute: memory: 1G cpu: reserve: true cores: 2 initContainers: - name: init1 image: 'robinsys/dpdk-intel:v1' imagePullPolicy: IfNotPresent resources: limits: cpu: 25m memory: 128Mi command: - sleep - '5' sidecars: - name: side1 image: 'robinsys/dpdk-intel:v1' imagePullPolicy: IfNotPresent command: - /bin/bash - '-c' - trap 'exit 0' SIGTERM; while true; do sleep 1; done resources: limits: memory: 200Mi cpu: '1' - name: side2 image: 'robinsys/dpdk-intel:v1' imagePullPolicy: IfNotPresent command: - /bin/bash - '-c' - trap 'exit 0' SIGTERM; while true; do sleep 1; done resources: limits: memory: 200Mi cpu: '1' **Input Yaml file for disabling Init Containers and Sidecars** In the earlier Robin Bundle sample file, we have ``side1`` and ``side2`` sidecars and Init container ``init1``. Using the following sample ``Input.yaml`` file you can disable the Init Containers and sidecars. From the above sample Bundle Yaml file example, we are disabling ``side1`` sidecar and Init container ``init1``. The following is the sample input.yaml file for disabling Init Containers and sidecars. .. code-block:: yaml :emphasize-lines: 7, 9 roles: - name: pktgen containers: - name: side2 disabled: false - name: side1 disabled: true - name: init1 disabled: true You can use the input.yaml file when creating an app using the Robin Bundle. **Syntax** Run the following command when creating an app using the Robin Bundle. ``# robin app create from-bundle --rpool --wait`` Fixed Issues ------------ ================== ============================================================================================================================================= Reference ID Description PP-27304 The 503 error message appears due to timeout of the HAProxy. To fix this issue, you need to increase the timeout values of the HAProxy using the ``robin config update`` command to 60 seconds for the ``connect_timeout`` attribute. ================== ============================================================================================================================================= Known Issues ------------- ============= ============================================================================================================================================================================== Reference ID Description PP-27400 **Symptom** The ``--disablerepo=*`` option does not work with CentOS 8 and Rocky Linux 8. **Workaround** You can disable all repos by creating a backup folder and manually moving all repo files to it. Run the following commands to manually move all repo files: 1. ``mkdir /etc/yum.repos.d.backup`` 2. ``mv /etc/yum.repos.d/* /etc/yum.repos.d.backup/`` PP-27613 When you create an IP pool, the IP pool creation succeeds; however, the app creation fails with the error *IP Pool does not exist*. This issue occurs as the pool creation is registered on Kubernetes and fails to register on the database. **Workaround** 1. Log in to the Robin server pod from any node using ``rbash master``. 2. Run the following command to restart Robin cluster. ``systemctl restart robin-server`` 3. Run the following command to verify the IP pool list. ``robin ip-pool list`` ============= ============================================================================================================================================================================== Technical Support ----------------- Contact `Robin Technical support `_ for any assistance.