25. Release Notes¶
25.1. Robin Cloud Native Platform v5.4.3¶
The Robin Cloud Native Platform (CNP) v5.4.3 release has new features, improvements, fixed issues, and known issues.
25.1.1. Infrastructure Versions¶
The following software applications are included in this CNP release:
Software Application |
Version |
---|---|
Kubernetes |
1.26.0 |
Docker |
19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) |
Prometheus |
2.39.1 |
Prometheus Adapter |
0.10.0 |
Node Exporter |
1.4.0 |
Calico |
3.24.3 |
HAProxy |
2.4.7 |
PostgreSQL |
14.6 |
Grafana |
9.2.3 |
CRI Tools |
1.25.0 |
25.1.2. Upgrade Paths¶
The following are the supported upgrade paths for Robin CNP v5.4.3:
Robin CNP v5.4.1 (GA) to Robin CNP v5.4.3 (GA)
Robin CNP v5.3.13 (GA) to Robin CNP v5.4.3 (GA)
The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.
Note
Before upgrading to Robin CNP v5.4.3, you must stop the Metrics feature and restart it after the upgrade.
25.1.3. New Features¶
25.1.3.1. Single Node HA-Ready¶
Starting from Robin CNP v5.4.3, you can install CNP using a single hostname or IP address in the HA mode in an on-prem environment. Later you can scale up the cluster by adding more Master and Worker nodes as per your requirements. Thus, a Single Node HA-Ready cluster is a cluster with a single host and HA enabled.
You can use the same install command to install a single node HA-Ready cluster but just provide a single hostname or IP address.
25.1.3.2. Add Master and Worker Nodes using GoRobin utility¶
Robin CNP v5.4.3 supports adding new master or worker nodes to an existing Robin CNP HA cluster to scale up your cluster using the GoRobin utility.
The option to add a master node is available only if you have initially installed your cluster as a HA cluster. However, for a cluster that you have installed as non-HA, you can add more worker nodes.
Note
It is recommended not to add additional nodes to a cluster when you installed the cluster using the --single-node-cluster
option. The behavior is not defined. If you want to add more nodes a cluster installed with this option, contact the Robin support team.
25.1.3.3. Zero Trust Feature to Block Network Traffic¶
Robin CNP v5.4.3 provides the Zero trust feature. You can enable the zero-trust
option when installing Robin CNP v5.4.3 using as part of the config.json
file during installation.
When you use this option, all network ports will be closed except Kubernetes, Robin control ports, and SSH port. You can use this option in conjunction with the single-node-cluster
option or independently.
25.1.3.4. Support to Create KVM-based VMs using Custom UUIDs¶
Robin CNP v5.4.3 provides you an option to manually provide a custom Universally Unique Identifier (UUID) for a KVM-based application. You can use this feature for VMs where the application license is linked to the UUID of the VM.
This feature enables you to provide the UUID manually using the input.yaml
file when creating VMs on Robin CNP.
25.1.4. Improvements¶
25.1.4.1. Add custome Cluster Identity Certificate for all external-facing Kubernetes and Robin CNP services¶
Starting from Robin CNP v5.4.3, Robin allows you to use the custom Cluster Identity certificate for all external-facing Kubernetes and Robin CNP services. The Cluster Identity certificate is used to validate the requests sent to the external-facing Kubernetes and Robin CNP services from external clients outside the cluster.
By default, Robin creates its own Cluster Identity certificate and uses this certificate to validate the requests. You can use your own Cluster Identity certificate and private key. An external trusted certificate authority (CA) must sign this certificate.
25.1.4.2. Support for HashiCorp Vault¶
Starting from Robin CNP v5.4.3, Robin CNP re-enabled the support for HashiCorp Vault integration. You can use the GoRobin utility for integrating HashiCorp Vault when installing Robin CNP v5.4.3.
25.1.4.3. Access Robin CNP cluster installed with Zero trust feature using whitelisted IP address¶
The zero-trust
option limits the ports that are accessible from outside the cluster. By default, these ports are accessible from all nodes. The whitelisted IP addresses option will limit access only from the nodes mentioned using this option.
25.1.4.4. ISO image for VMs¶
Robin CNP v5.4.3 supports ISO images for creating VMs on Robin CNP.
25.1.4.5. Added new events for Pods (Tech Preview)¶
The following new events for Pods are added in the robin event list
in Robin CNP v5.4.3:
EVENT_POD_STARTED
EVENT_POD_DEPLOY_FAILED
EVENT_POD_STOPPED
EVENT_POD_STOP_FAILED
EVENT_POD_RESTARTED
EVENT_POD_DELETED
EVENT_POD_FAULTED
EVENT_POD_PLAN_FAILED
EVENT_POD_RELOCATED
EVENT_POD_RELOCATE_FAILED
EVENT_POD_RESTARTING
EVENT_K8SPOD
Note
To raise the events for Kubernetes Pods, you need to enable the k8s_event_watcher
config attribute by running the robin config update cluster k8s_event_watcher True
command. By default, this event is disabled.
25.1.5. Fixed Issues¶
Reference ID |
Description |
---|---|
PP-28938 |
When deleting multiple PDVs using the Robin CNP UI, the checkbox for selecting all PDVs (next to Name field) does not work. This issue is fixed. |
PP-28966 |
If a Pod deployment fails and you notice the following error message in the Pod events: “Error: Vblock with volume_id <> not mounted”. This issue is fixed. |
PP-29360 |
When you add a secondary DPDK-based IP-Pool, routes are programmed by Robin CNP (robin-ipam) erroneously. As a result, Pods are not coming up and failing at |
PP-29398 |
The issue with the |
PP-29427 |
In a scenario where Pods are scheduled with three replicas, three static IP addresses, and an anti-affinity rule, and if the deployment fails for the first time, Robin CNP is not clearing the entries in the database. During the retry of the failed deployment, one of the Pods failed to come up as the IP address was not released by the previously failed deployment. This issue is fixed. |
PP-29430 |
The issue of not being able to use a static IP address as a string for a single replica in a Static IP annotation is fixed. |
25.1.6. Known Issues¶
Reference ID |
Description |
---|---|
PP-21916 |
Symptom A Pod IP is not pingable from any other node in the cluster, apart from the node where it is running. Workaround Bounce the Calico Pod running on the node where the issue is seen. |
PP-21935 |
Symptom Pods are stuck in the kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability Workaround Perform the following steps:
Note If the nodeplugin Pod has become unusable, future filesystem mounts will fail, this is a symptom of the many retries of NFS mount calls that hang. Bouncing the Pod will clear out the hung processes. |
PP-22781 |
Symptom After removing a taint on a master node, GPUs are not detected automatically. Workaround You need to run the |
PP-22853 |
Symptom Robin CNP may not detect GPUs in the following scenarios:
Workaround Run the |
PP-24736 |
Symptom A PVC may not come online after removing an app from the secondary Protection Group on the peer cluster. Workaround After you remove the application from the Protection Group and allow the application to start, remove the |
PP-25246 |
Symptom When you try to delete a KVM application, the deletion process might be stuck as the Virsh commands on the node may not respond. Workaround Reboot the node. |
PP-25360 |
Symptom If containers in a Pod are using an RWX PVC and if they are stuck in the Workaround Delete Pods if they are part of a Deployment or StatefulSet. |
PP-26345 |
Symptom When you deploy a Pod to use an SR-IOV VF from Ethernet Virtual Function 700 Series 154c, sometimes the Pod gets stuck in the Workaround Bounce the Pod that shows the device busy error message. |
PP-26572 |
Symptom Due to inaccuracies in tracking the Pod creation, tenants and user limits are not explicitly honored for Helm applications. |
PP-26581 |
Symptom After deleting the PCI resources, the existing Pods that are using the PCI resources are stuck in the Workaround Perform the following steps:
|
PP-26768 |
Symptom You should not use an IP-Pool associated with dpdk drivers as the default network. |
PP-26830 |
Symptom After deleting the PVCs, Robin CNP cluster is down. Workaround Bounce the Calico Pod. |
PP-27076 |
Symptom In Robin CNP, Kubelet might go down due to the stale Workaround Complete the following steps to fix this issue:
|
PP-27077 |
Symptom When deleting the RWX applications, RWX Pods are stuck in the Workaround Perform the following steps for deleting the RWX Pods:
|
PP-27193 |
Symptom When upgrading from supported Robin CNP versions to Robin CNP v5.4.3, RWX Pods may get stuck in the If you notice this issue, apply the following workaround steps: Workaround
|
PP-27276 |
Symptom After upgrading to Robin CNP v5.4.3, some Robin Bundle apps might be Workaround Manually restart the Robin Bundle apps one by one. |
PP-27283 |
Symptom In rare scenarios, when you reboot the active master node, two Patroni Pods might have the same role as Replica. Workaround Bounce the Calico Pod running on the node where the issue is seen. |
PP-27620 |
Symptom Sync with secondary peer cluster fails due to multiple snapshots restore failures. Workaround Restart the iomgr-server on the affected node.
|
PP-27678 |
Symptom When the node where the volume for file collection is mounted is turned off and you want to delete file collection with a single replica, the file collection delete job will fail putting the file server Pod in the Workaround Run the following command to delete the file server Pod forcefully stuck in the # kubectl delete <pod_name> -n <robin_ns> --force
|
PP-27775 |
Symptom When upgrading from Robin CNP supported versions to Robin CNP v5.4.3, one of the hosts is stuck in the Workaround You need to delete the worker Pod running on the node that is in the Perform the following steps to delete the worker Pod:
|
PP-27826 |
Symptom When you reboot all nodes of a cluster together, RWX Pods are stuck in the Workaround Bounce the respective Pods. |
PP-28461 |
Symptom When you increase the snapshot space limit on the Primary Protection Group, the same is not replicated to the secondary Protection Group. Workaround If you need to increase space for snapshots on the secondary protection group, apply the following workaround: Run the following command on the secondary cluster to update the snapshots space limit: # robin app snapshot-space-limit
|
PP-28494 |
Symptom During a non-HA upgrade, the File-server Pod may get stuck in the If you notice this issue, apply the following workaround steps. Workaround
|
PP-28501 |
Symptom After upgrading from the existing Robin CNP to Robin CNP v5.4.3 with RWX applications, the NFS server related jobs are stuck. Workaround Perform the following steps:
|
PP-28768 |
Symptom After upgrading to Robin CNP v5.4.3, you might notice that the cordened node is uncordened. Workaround You should put the cordened nodes in maintenance mode before upgrading. Or, you need to corden the node again after upgrading to Robin CNP v5.4.3. |
PP-28867 |
Symptom The |
PP-28922 |
Symptom When you try to restore a namespace snapshot, the job hangs as the PVCs are in the Error: Invalid annotation ‘robin.io/fstype’ provided Workaround To fix this issue, apply the following workaround:
|
PP-28972 |
Symptom When you try to deploy a KVM-based app and override the NIC tags in the IP-Pool using the Error: list index out of range You will observe this issue as the bonded interface option is not supported for KVM deployments when the Calico interface is used. |
PP-29109 |
Symptom Robin CNP v5.4.3 does not support the Application Ephemeral Volume (AEV). Due to AEV nonsupport, operations involved with AEVs will fail. |
PP-29150 |
Symptom When creating an SRIOV or OVS IP pool with VLAN, Robin CNP mistakenly allows the creation of the SRIOV or OVS IP pool if any one of them has configured VLAN for its interface at the host level. For example: in a scenario where you have created an SRIOV IP pool with VLAN and VLAN is added to the SRIOV interface at the host level. At the same time, if you create an OVS IP Pool with the same VLAN but without adding VLAN for the OVS interface at the host level, the OVS IP pool creation succeeds without any error. However, in this example, when you try to deploy the Pod using the OVS IP pool, the Pod deployment fails at the |
PP-29340 |
Symptom After upgrading from the existing Robin CNP to Robin CNP v5.4.3, RWX PVC Pods are stuck in the Workaround Perform the following steps to generate a new FS UUID:
|
PP-29441 |
Symptom After adding a Master node if a Patroni Pod is in the Workaround
|
PP-29505 |
Symptom The Dashboard in the Robin CNP UI does not display the metrics data for the CLUSTER CONTAINERS NETWORK DATA TRANSMITTED section of the UI. |
PP-29509 |
Symptom You must stop metrics before starting the upgrade and restart after the upgrade. |
PP-29512 |
Symptom After upgrading to Robin CNP v5.4.3, you might observe Robin bundle applications deployed with replica 1 are in the Workaround If you observe this issue, apply the following workaround: Run the following command to make the applications healthy: # robin host probe <hostname> --rediscover --wait
|
PP-29521 |
Symptom After upgrading to Robin CNP v5.4.3, you might observe Pods stuck in the Error: Input/Output error on device /dev/sdm Workaround If you observe this issue, apply the following workaround:
|
PP-29525 |
Symptom After upgrading to Robin CNP v5.4.3 from supported Robin CNP versions, communication with port 36443 might break because the As a result, you cannot access the Robin cluster using port 36443. |
PP-29528 |
Symptom In some scenarios, when a Pod with the Workaround Delete the # kubectl delete net-attach-def net-attach-def_name
|
25.1.7. Technical Support¶
Contact Robin Technical support for any assistance.
25.2. Robin Cloud Native Platform v5.4.3 HF1¶
The Robin Cloud Native Platform (CNP) v5.4.3 HF1 release has improvements, fixed issues, and known issues.
25.2.1. Infrastructure Versions¶
The following software applications are included in this CNP release:
Software Application |
Version |
---|---|
Kubernetes |
1.26.0 |
Docker |
19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) |
Prometheus |
2.39.1 |
Prometheus Adapter |
0.10.0 |
Node Exporter |
1.4.0 |
Calico |
3.24.3 |
HAProxy |
2.4.7 |
PostgreSQL |
14.6 |
Grafana |
9.2.3 |
CRI Tools |
1.25.0 |
25.2.2. Upgrade Paths¶
The following are the supported upgrade paths for Robin CNP v5.4.3 HF1:
Robin CNP v5.4.3-120 (GA) to Robin CNP v5.4.3 HF1
Robin CNP v5.4.3-237 (HF1-RC) to Robin CNP v5.4.3 HF1
Robin CNP v5.3.11-217 (HF2) to Robin CNP v5.4.3 HF1
The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.
Note
For a successful upgrade, you must run the possible_job_stuck.py
script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.
25.2.3. Improvements¶
25.2.3.1. Enhanced GoRobin¶
Starting from Robin CNP v5.4.3 HF1, the GoRobin utility tool runs the preinstallation checks parallelly.
25.2.3.2. Support for V2 KV engine for the HashiCorp Vault integration¶
Starting from Robin CNP v5.4.3 HF1, Robin CNP supports the V2 KV engine for the HashiCorp Vault integration.
25.2.3.3. Set faultdomain to host
or rack
for all RWX PVCs¶
Starting from Robin CNP v5.4.3 HF1, for all storageclass (custom storageclass and storageclass created by Robin) except the robin-rwx
storageclass, you can set host
or rack
as fautdomain for RWX PVCs. If you set disk
as faultdomain, RWX PVC will not be provisioned and shows the following error:
For Access-Many volumes, replication should be more than 1 and faultdomain should be ‘host’.
For the robin-rwx
storageclass, the default faultdomain is set to host
for RWX PVCs. The options of faultdomain as disk
and rack
are not supported.
25.2.3.4. Support for deploying NFS server Pods on nodes with custom taints¶
Starting from Robin CNP v5.4.3 HF1, to deploy NFS server Pods on Kubernetes nodes that have custom taints, you must update the nfs_pod_tolerations
config attribute of the nfs
section of the robin config
to add toleration for NFS server Pods.
The tolerations added through the config attribute take effect only for the newly created NFS server Pods. For the existing NFS server Pods, you must add tolerations manually.
25.2.3.5. Set default faultdomain to host
for creating a Persistent Data Volume (PDV)¶
Starting from Robin CNP v5.4.3 HF1 and onwards, for creating a Persistent Data Volume (PDV), the valid options for faultdomain are host
and rack
. The disk
option for faultdomain is not supported. The default faultdomain is set to host
.
25.2.3.6. Default MTU value from physical interface¶
Starting from Robin CNP 5.4.3 HF1, if you do not provide the MTU when creating an IP pool, Robin CNP considers the MTU value of the underling physical device interface as the default MTU.
25.2.4. Fixed Issues¶
Reference ID |
Description |
---|---|
PP-28768 |
After upgrading Robin CNP 5.4.1, you might notice that the cordened node is uncordened. This issue is fixed. |
PP-29303 |
The issue of KVMs not considering IP pool configurations (Spoofchk and Trustmode) is fixed. |
PP-29528 |
The issue of |
PP-29743 |
Token-based integration of HashiCorp Vault is no longer supported. |
PP-29090 |
When you deploy a Helm chart with static IP addresses and annotations in a StatefulSet, the Helm chart fails to assign the defined static IP addresses. However, it assigns different IP addresses from the same IP pool range. This issue is fixed. |
PP-29217 |
The issue of the sriov-device-plugin DaemonSet scaling down due to specific taints on nodes is fixed. |
PP-29441 |
The issue of Patroni Pod being in the Pending state after adding a master node is fixed. The steps to add the master node are updated. |
PP-29509 |
The issue of stopping metrics before starting the upgrade and restarting after the upgrade is fixed. |
PP-29512 |
After upgrading to Robin CNP v5.4.3, you might observe Robin bundle applications deployed with replica 1 are in the NotReady status. This issue is fixed. |
PP-29553 |
The issue of Robin cluster nodes not honoring taints applied on nodes after robin-server restart is fixed. |
PP-29577 |
When you delete an SR-IOV pod and restart the Robin server, you might observe the SR-IOV deployment pods are in the create and terminate loop states. This issue is fixed. |
PP-29582 |
When you stop and start the robin server during pod deployment, the robin network annotations might be ignored. This issue is fixed. |
PP-29595 |
The issue of Robin CNP displaying the following error message when setting up Robin Client is fixed:
|
PP-29634 |
The issue of
|
PP-29644 |
The issue of the Robin installer trying to access the Internet to download the install images and eventually the installation failing is fixed. |
PP-29648 |
The issue of GoRobin failing to install on a 45-node cluster is fixed. |
PP-29779 |
The issue of discrepancy in CPU core calculation while validating the rpool limit is fixed. |
PP-29867 |
When you use node affinity and IP Pool annotations in a Deployment, it uses the IPs from the IP Pool. However, it fails to follow the Affinity rule. This issue is fixed. |
PP-29939 |
Robin CNP scheduler is not calculating the guaranteed CPU utilization correctly and scheduling Pods on over-utilized nodes. This issue is fixed. |
PP-30033 |
The upgrade process failed when upgrading to Robin CNP v5.3.11 due to a discrepancy in the number of CSI node plugin Pods between the pre- and post-upgrade. This issue is fixed. |
PP-30050 |
The issue of Robin CLI stopped working when stormgr is down is fixed. |
PP-30160 |
The issue of node removal failing with the |
PP-30141 |
The issue of VolumeCreate job failing even though the cluster has enough resources is fixed. |
PP-30290 |
The issue of Istio mutating webhook configuration after the upgrade is fixed. |
PP-30296 |
When you try to move a File collection from a 3 Replica File collection to a single replica File collection fails. This issue is fixed. |
PP-30345 |
The issue of the upgrade process not failing even though the version and tag stamping in the config map failed is fixed. |
PP-30381 |
The issue of not being able to upload bundles to an online File collection is fixed. |
PP-30387 |
The issue of adding PDV volume failing when the Fault domain option |
25.2.5. Known Issues¶
Reference ID |
Description |
---|---|
PP-28802 |
Symptom Robin Control Plane is failing to auto-recover in the following conditions:
Workaround Apply the following workaround steps to recover from this situation:
|
PP-29533 |
Symptom After moving all apps from an existing file collection to a new file collection and then powering off one of the nodes cause application not accessible. Workaround Delete file-server Pod for PVC to get mounted again. |
PP-29650 |
Symptom After failing an IOMGR Pod, the Pod might be in the Workaround Perform the following steps:
|
PP-29850 |
Symptom After rebooting a node you might notice applications are stuck in Workaround Bounce the Pods. |
PP-29866 |
Symptom All Pod deployments at least once go through the Pending or Terminating state before the deployment is successful. The behavior is the same for StatefulSet and Deployment. |
PP-29962 |
Symptom After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, the robin nfs export list might show the wrong entry. Workaround Contact the Robin support team for workaround steps. |
PP-30112 |
Symptom When you upgrade from Robin CNP v5.3.11 HF2 to v5.4.3 HF1, applications Pods might be in the Run this command to check Pods: Workaround Delete the Pods that are in the Run the following command to delete: # kubectl delete pod -n <namespace> --force <Pod name>
|
PP-30119 |
Symptom When removing a node from Robin CNP, if the affinity is tied to local storage, instance relocate fails with the following error: Unable to reallocate instance <instance name> due to an affinity rule tying it to local storage. |
PP-30149 |
Symptom After deploying Pods, if you restart Robin Server and delete Pods, some of the Pods might not come back online. Workaround Delete the Pods that are in |
PP-30173 |
Symptom During scheduling Pods, there might be a difference of 2 CPU cores between the Robin planner and Kubernetes planner due to resource calculation. For example, if only 2.2 core is left on a node, then Pod deployed with 2.2 CPU request will not get scheduled as 2.2 core is considered as 2 cores by Robin. |
PP-30188 |
Symptom After upgrading to Robin CNP v5.4.3 HF1 from the supported version, you might notice the RWX app stuck in the Workaround
|
PP-30243 |
Symptom After upgrading from the supported Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might notice continuous vnode deploy job is going on for app. Workaround Run the following commands to rectify this issue.
|
PP-30247 |
Symptom After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, the RWX apps might report the following error event: wrong fs type, bad option, bad superblock on /dev/sdj, missing codepage or helper program, or other error Workaround Contact the Robin support team for workaround steps. |
PP-30251 |
Symptom Add master using GoRobin for the same removed node fails on IPv4 setups. The Patroni Pod will be in the Workaround
|
PP-30264 |
Symptom In a Robin CNP v5.4.3, if you have Workaround Use |
PP-30298 |
Symptom After upgrading from the supported Robin CNP v5.3.11 to Robin CNP 5.4.3 HF1, if the upgrade fails and nodes are in the Notready status, check for the following symptoms and apply the workaround:
Workaround Reboot the node where iomgr-server is in defunct state. |
PP-30319 |
Symptom When you have a StatefulSet or Deployment with robinrpool, one of the Pods may not get scheduled by Kubernetes and it remains in the Workaround Run the following command to delete the Pod that is in the # kubectl delete pod <pod name>
|
PP-30339 |
Symptom After upgrading from Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might observe Helm app Pods in the Workaround You need to stop and start the apps in the
|
PP-30357 |
Symptom After you upgrade successfully from supported Robin CNP v5.3.11 to Robin CNP v5.4.3 HF1, you might notice that a node is in the Workaround Run the following command to rectify this issue: # robin host probe <hostname> --wait
|
PP-30361 |
Symptom When you delete Pods with static IP and Affinity that are in the “NoNetworkFound” : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io “robin-required” not found Workaround Bounce the Pods that are in the |
PP-30363 |
Symptom After upgrading to Robin CNP v5.4.3HF1, you might notice that one of the nodes in the Reason: KubeletNotReady Message: PLEG is not healthy: pleg was last seen active 1h25m32.701542224s ago; threshold is 3m0s. Workaround Run the following command to restart Docker service: # service docker restart
|
PP-30364 |
Symptom After you successfully add a new node using GoRobin to an existing cluster that has nodes associated with a custom rpool, the GoRobin tool will automatically assign the new node to the default rpool as the tool does not support custom rpools. In this scenario, apply the following workaround to associate the newly added node to the custom rpool. Workaround
|
PP-30386 |
Symptom When upgrading from Robin CNP v5.3.11 HF1 to Robin CNP v5.4.3 HF1, the NFS exports might be stuck in the Workaround Contact the Robin support team for workaround steps. |
PP-30389 |
Symptom If you have added a range of blacklisted IPs in an unexpanded form, Robin CNP does not allow you to remove a range of blacklisted IPs from the IP Pool. It is recommended to use the expanded form when adding and removing a range of blacklisted IPs to an IP Pool. Workaround If you have added a range of blacklisted IPs in an unexpanded form, you need to remove the range from the database. Contact Robin customer support team to apply the workaround. |
PP-30394 |
Symptom The Robin CNP UI dashboard does not display the cluster memory usage and cluster storage available details. Workaround Complete the following steps to rectify this issue.
|
PP-30398 |
Symptom After removing an offline master node from the cluster and power cycling it, the removed master node is automatically added back as a worker node. Workaround
|
25.2.6. Technical Support¶
Contact Robin Technical support for any assistance.
25.3. Robin Cloud Native Platform v5.4.3 HF2¶
The Robin Cloud Native Platform (CNP) v5.4.3 HF2 release has a new feature, fixed issues, and known issues.
25.3.1. Infrastructure Versions¶
The following software applications are included in this CNP release:
Software Application |
Version |
---|---|
Kubernetes |
1.26.0 |
Docker |
19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) |
Prometheus |
2.39.1 |
Prometheus Adapter |
0.10.0 |
Node Exporter |
1.4.0 |
Calico |
3.24.3 |
HAProxy |
2.4.7 |
PostgreSQL |
14.6 |
Grafana |
9.2.3 |
CRI Tools |
1.25.0 |
25.3.2. Upgrade Path¶
The following is the supported upgrade path for Robin CNP v5.4.3 HF2:
Robin CNP v5.4.3-281 (HF1) to Robin CNP v5.4.3-302 (HF2)
The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.
Note
For a successful upgrade, you must run the possible_job_stuck.py
script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.
25.3.3. New Feature¶
25.3.3.1. Support for soft anti-affinity for Robin Bundles¶
Starting with Robin CNP v5.4.3 HF2, you can enable soft anti-affinity for Robin Bundles.
To enable soft anti-affinity for Robin Bundles, you must use the placeon_different_nodes_on_same_rack
parameter in the Robin Bundle YAML file and set it to true
.
Example
appname: "centos-1"
ippools: ["robin-default"]
roles:
- name: server1
placeon_different_nodes_on_same_rack: true
ippools:
- ippool: routes-2
- name: server2
placeon_different_nodes_on_same_rack: true
ippools:
- ippool: routes-2
- name: server3
placeon_different_nodes_on_same_rack: true
#ippools:
- ippool: routes-1
static_ips: "fd74:ca9b:3a09:86ba:a:b:c:d"
25.3.4. Fixed Issues¶
Reference ID |
Description |
---|---|
PP-30611 |
When the monitor server in Robin CNP fails to report to the robin-server, it waits for a long time before attempting to send the next report. As a result, the heartbeat misses for a long time and it results in host probe jobs. This issue is fixed. |
PP-30639 |
Source-based routing is not kicking in when the IP pool |
PP-30864 |
The issue of the K8s collect watcher taking a long time to complete on a loaded cluster is fixed. |
PP-30883 |
After rebooting the nodes, sometimes, RWX Pods might be stuck in the |
PP-30895 |
Importing users with capabilities from an LDAP group using the |
PP-30896 |
Starting from Robin CNP v5.4.3 HF2, the |
PP-30897 |
The SRIOV annotations in a few of the Deployments are ignored as Pods stuck in the |
PP-30945 |
When you use a reused Persistent Volume for VolumeMount, the following error is displayed: ‘NoneType’ object is not subscriptable. This issue is fixed. |
PP-30951 |
The issue of choking the kube-controller and using more CPU because of the logging issues in Kubernetes v1.26.0 is fixed by setting the kube-controller log level from 7 to 4. It applies to both a new installation of Robin CNP v5.4.3 HF2 and an upgrade to Robin CNP v5.4.3 HF2. |
PP-30978 |
The issue of high CPU utilization by |
25.3.5. Known Issues¶
Reference ID |
Description |
---|---|
PP-30493 |
Symptom When you see Pods in the “NoNetworkFound” : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io “robin-required” not found Workaround Bounce the Pods that are in the |
PP-30980 |
Symptom When you deploy multiple Pods at the same time, the Pods might come up slowly as mutation takes more time due to timeouts and multiple retries. |
PP-31068 |
Symptom After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, a few nodes might be in the Workaround Contact Robin customer support team for workaround steps for this issue. |
PP-31070 |
Symptom After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, some Pods might be stuck in the [ERROR][3095966] plugin.go 580: Final result of CNI DEL was an error. error=error getting ClusterInformation: connection is unauthorized: Unauthorized Workaround Bounce the |
PP-31072 |
Symptom After upgrading from Robin CNP v5.4.3 HF1 to Robin CNP v5.4.3 HF2, the Workaround Reboot the node where the |
25.3.6. Technical Support¶
Contact Robin Technical support for any assistance.
25.4. Robin Cloud Native Platform v5.4.3 HF3¶
The Robin Cloud Native Platform (CNP) v5.4.3 HF3 release has a fixed issue and known issues.
25.4.1. Infrastructure Versions¶
The following software applications are included in this CNP release:
Software Application |
Version |
---|---|
Kubernetes |
1.25.7 or 1.26.0 (Default) |
Docker |
19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) |
Prometheus |
2.39.1 |
Prometheus Adapter |
0.10.0 |
Node Exporter |
1.4.0 |
Calico |
3.24.3 |
HAProxy |
2.4.7 |
PostgreSQL |
14.6 |
Grafana |
9.2.3 |
CRI Tools |
1.25.0 |
25.4.2. Upgrade Paths¶
The following are the supported upgrade paths for Robin CNP v5.4.3 HF3:
Robin CNP v5.3.13-107 (HF3) to Robin CNP v5.4.3-355 (HF3)
Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3)
The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.
Note
For a successful upgrade, you must run the possible_job_stuck.py
script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.
25.4.3. Fixed Issues¶
Reference ID |
Description |
---|---|
PP-30493 |
The issue of Pods being in the “NoNetworkFound” : cannot find a network-attachment-definition (robin-required) in namespace (kube-system): network-attachment-definitions.k8s.cni.cncf.io “robin-required” not found |
25.4.4. Known Issues¶
Reference ID |
Description |
---|---|
PP-31522 |
Symptom After deleting a backup, unregistering a storage repo fails with the following error message: Storage repo is associated with volume group Workaround Complete the following steps:
|
PP-32259 |
Symptom When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), some of the jobs might be failed with the following error: FATAL: remaining connection slots are reserved for non-replication superuser connections Workaround Contact the Robin Customer Support team for the workaround steps. |
PP-32288 |
Symptom When all nodes of a cluster are rebooted after installing Robin CNP v5.4.3-355 (HF3), one of them remains in the Workaround Start the consul-client: # systemctl start consul-client
|
PP-32334 |
Symptom When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), the Workaround Run the following command to unmount the file collection volume: # robin volume unmount <file-collection-volume-name>
|
PP-32385 |
Symptom When upgrading from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3), sometimes, the wal: max entry size limit exceeded Workaround You need to remove the faulted Complete the following steps to remove the faulted
|
PP-32463 |
Symptom After installing Robin CNP v5.4.3-355 (HF3), you might face File Collection Creation failures: Workaround
|
PP-32477 |
Symptom If you notice any of the Robin Patroni Pods in the Workaround
|
PP-32497 |
Symptom When a cluster reboots, you might observe one or more Pods might be stuck in the volume 1698488235:1 has GET error for volume attachment csi-204e799ca58418a5f0e1b0d4193fd8b0908dbe290ae52ef810a8c2964a12c202: volumeattachments.storage.k8s.io “csi-204e799ca58418a5f0e1b0d4193fd8b0908dbe290ae52ef810a8c2964a12c202” is forbidden:User “system:node:qct-09.robinsystems.com” cannot get resource “volumeattachments” in API group “storage.k8s.io” at the cluster scope: no relationship found between node ‘qct-09.robinsystems.com’and this object Workaround Cordon the node and bounce the required Pods.
|
PP-32515 |
Symptom When a cluster reboots, you might notice robin-worker node Pods might stuck in the psycopg2.errorL.ReadOnlySqlTransaction: cannot execute CREATE EXTENSION in a read-only transaction You can find the log file at: Workaround Check for the error in the log file and bounce the Robin master Pod.
|
PP-32517 |
Symptom In a rare scenario, upgrade from Robin CNP v5.3.13-159 (HF3) to Robin CNP v5.4.3-355 (HF3) might fail with this message: Workaround Restart Docker and Dockershim.
|
PP-32523 |
Symptom After upgrading to Robin CNP v5.4.3-355 (HF3), you might notice some of the Pods might be in the Workaround Bounce the Pods that are in the To bounce the Pods, run the following command: # kubectl delete pod -n <namespace> <pod name>
|
25.4.5. Technical Support¶
Contact Robin Technical support for any assistance.
25.5. Robin Cloud Native Platform v5.4.3 HF4¶
The Robin Cloud Native Platform (CNP) v5.4.3-395 (HF4) release has a new feature, improvements, fixed issues, and known issues.
25.5.1. Infrastructure Versions¶
The following software applications are included in this CNP release:
Software Application |
Version |
---|---|
Kubernetes |
1.26.0 |
Docker |
19.03.9 (CentOS 7) and 20.10.8 (Rocky 8) |
Prometheus |
2.39.1 |
Prometheus Adapter |
0.10.0 |
Node Exporter |
1.4.0 |
Calico |
3.24.3 |
HAProxy |
2.4.7 |
PostgreSQL |
14.6 |
Grafana |
9.2.3 |
CRI Tools |
1.25.0 |
25.5.2. Upgrade Path¶
The following is the supported upgrade path for Robin CNP v5.4.3-395 (HF4):
Robin CNP v5.4.3-302 (HF2) to Robin CNP v5.4.3-395 (HF4)
The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.
Note
For a successful upgrade, you must run the
possible_job_stuck.py
script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.You must configure the Calico Typha for the cluster with more than 50 nodes. For more information, see Configure Calico Typha.
After upgrading to Robin CNP v5.4.3-395 (HF4), you must run the
robin schedule update K8sResSync k8s_resource_sync 60000
command to update the robin schedule K8sResSync.
25.5.3. New Feature¶
25.5.3.1. Support of soft affinity¶
Robin CNP v5.4.3-422 (HF4) supports the soft affinity feature with a few limitations.
In Kubernetes, the soft affinity feature refers to a way of guiding the Kubernetes Scheduler to make a decision about where to place Pods based on preferences, rather than strict requirements. This preference helps to increase the likelihood of co-locating certain Pods on the same node, while still allowing the Kubernetes Scheduler to make adjustments based on resource availability and other constraints. For more information, see Affinity and anti-affinity.
Limitations
The following are the limitations of support for soft affinity and anti-affinity support:
These operators are not supported: DoesNotExist, Gt, and Lt.
Multiple weight parameters for node and Pod affinity are not supported.
Soft anti-affinity doesn’t check or match for the label selector coming from a different Deployment.
During a complete cluster restart, if all nodes are not up at the same time, Pods will not be spread across nodes with soft anti-affinity.
After a Pod restart, it might not come back on the same node.
Post downsizing the number of replicas in a Deployment, soft Pod anti-affinity might not delete the Pods in the same order as creation.
As the affinity information is handled in the cache, restarting the robin-server will flush the cache, resulting in scaled-up Pods not being placed as per anti-affinity.
Creating, deleting, or recreating Pods multiple times will not honour soft affinity.
Pods will be unequally distributed on nodes when all Pods in a deployment are deleted.
25.5.4. Improvements¶
25.5.4.2. Relaxation in NIC bonding policy¶
Starting from Robin CNP v5.4.3-395 (HF4), Robin considers the NIC bonding interface up if at least one interface from the two interfaces that are used for creating the bond interface is up.
25.5.5. Fixed Issues¶
Reference ID |
Description |
---|---|
PP-30394 |
The Robin CNP UI dashboard does not display the cluster memory usage and cluster storage available details. This is fixed. |
PP-30673 |
The issue of kernel.core_pattern getting changed to |/bin/false from /var/crash/core.%e.%p.%h.%t after restarting the robin-worker Pod or robin-iomgr Pod is fixed. |
PP-30980 |
When you deploy multiple Pods at the same time, the Pods might come up slowly as mutation takes more time due to timeouts and multiple retries. This issue is fixed. |
PP-31294 |
The issue of Robin CNP considering a guaranteed number of CPUs as shared CPUs in a Pod when there are both shared and guaranteed CPUs is fixed. |
PP-31664 |
The issue of nodes stuck in reboot due to mount errors is fixed. |
PP-32334 |
The issue of the robin-file-server failing to come up when upgrading from Robin CNP v5.3.13 to Robin CNP v5.4.3 HF3 is fixed. |
PP-32405 |
When there are many Pods in the Kubernetes Scheduler and if they are taking time to come up, then the NFS Server Pod might take more than 10 minutes to come up and the job might time out. This issue is fixed. |
PP-32461 |
The issue of StatefulSet Pods using static IP ranges not resuming with the same IPs after restarting Pods is fixed. |
PP-32498 |
The issue of the snapshot-controller Pod stuck in the |
PP-32525 |
The issue of 5G NF Pods not being deployed using the Helm chart due to a webhook timeout is fixed. |
PP-32620 |
The issue of mutation timeout failure is fixed. |
25.5.6. Known Issues¶
Reference ID |
Description |
---|---|
PP-31790 |
Symptom Sometimes, discrepancy in CPU core calculation is observed for maximum period of 17 minutes when validating the tenant rpool limit. |
PP-32555 |
Symptom Assigning a static IP address for KVM based apps from a secondary IP-Pool is not supported through the Robin CNP UI. Workaround Assign the static IP address for KVM based apps from a secondary IP-Pool using the CLI. |
PP-32647 |
Symptom After upgrading to Robin CNP v5.3.4-395 (HF4), if the StatefulSet Pods are deleted, they may not retain the same IPs due to the implementation of the staticip_ordered_assignment parameter. The staticip_ordered_assignment parameter is a new configuration parameter added as part of the Robin CNP v5.4.3-395 (HF4). This config parameter is by default set to True. When it is set to True, the IPs provided in the network annotations will be assigned serially to the Pods. |
PP-32713 |
Symptom The Robin log collection operation might fail with the following error message: Creation of storage for file collection failed, and the File server Pod might be in the Terminating status Workaround Rerun the log collection again using the following command for recreating the file server Pod. # robin log collect robin-storage
|
PP-32770 |
Symptom In a rare scenario, after rebooting the nodes, KVM based apps might be stuck in the Error state with the following error: Unable to satisfy max guaranteed CPU requirements Workaround Restart the respective apps manually. # robin instance start <app_name>
|
25.5.7. Technical Support¶
Contact Robin Technical support for any assistance.
25.6. Robin Cloud Native Platform v5.4.3 HF4 Point Patch-1¶
The Robin Cloud Native Platform (CNP) v5.4.3-422 (HF4 Point Patch -1) release has new features, fixed issues, and known issues.
25.6.1. New Features¶
25.6.1.1. Robin User Ceritifcate Management¶
Robin CNP v5.4.3 HF4 Point Patch-1 enables you to manage the Robin user security certificate. Robin CNP creates a TLS certificate when a user is created. The user certificate is by default valid for one year from the date of user addition, and it will automatically renew if the certificate is going to expire, depending on the set configuration and scheduler status. The scheduler runs as per the set configuration. For more information, see User Certificate Management
Robin CNP provides a new CLI command to manage user certificates. The following are the commands, and each command has options to manage:
robin user-cert check
- Enables checking the status of user certificates and user certificate configuration details.robin user-cert renew
- Enables renewing user certificates for all users, setting an offset period for checking validity, and performing a dry run to verify the renewal process.robin user-cert update
- Enables configuring user certificate configuration details. Setting the life span of the certificate, whose minimum life span is one day, setting the time difference between each user’s certificate renewal checks, and setting the certificate renewal offset time.robin user-cert stop
- Enables stopping the certificate validity scheduler checks.robin user-cert start
- Enables restarting the stopped validity scheduler checks.
Note
In this release, support for managing the user certificates feature is not available from the Robin CNP UI.
25.6.1.2. Support for Millicore CPUs for Robin Bundle Apps¶
Robin CNP v5.4.3 HF4 Point Patch -1 supports the millicore CPU unit for a container. Now, you can specify a fractional value of a CPU unit when defining the CPU resource for a container in the Robin Bundle. Using the following files, you can specify the millicore CPUs for Robin Bundle apps:
Bundle’s template file - To specify the millicore CPUs in the template file of Robin Bundle app, you need to specify the millicore CPU in the
min
andmax
keys of thecompute.cpu
attribute:cpu: reserve: true min: 1.03 max: 1.03
Bundle’s input.yaml file - To specify the millicore CPUs in the
input.yaml
file of Robin Bundle app, you need to specify the millicore CPU in themin
andmax
keys of thecontainers.cpu
attribute:cpu: reserve: true min: 1.03 max: 1.03
Bundle’s manifest.yaml file - To specify the millicore CPUs in the
manifest.yaml
file of Robin Bundle app, you need to specify the millicore CPU in thecore
key of thecompute.cpu
atrributecpu: reserve: true core: 1.03
Note
The min
and max
keys are applicable only for main containers.
Limitations
Robin CNP does not support configuring the millicore CPUs through the CNP UI.
Robin CNP does not support the Chargeback feature for millicore CPUs.
Millicore CPU unit such as 500m is not supported in the template of Robin Bundle app.
When you specify the guaranteed CPUs with millicore values, it is not true guaranteed CPUs and it is Quality of Service (QoS) guaranteed CPUs. This is Kubelet behavior. For true guaranteed CPUs, only integer values must be specified.
25.6.2. Fixed Issues¶
Reference ID |
Description |
---|---|
PP-32758 |
The issue of the same event ID for two different events, which is not allowing users to add the event ID to the subscription list, is fixed. |
PP-33203 |
In prior CNP releases, the Robin user certificates with a default validity of one year would expire after the validity period, resulting in users being unable to perform app-level tasks. With this release, the user certificates will automatically renew before the expiration date. And Robin CNP allows you to manage the user certificates using the new CLI options. |
PP-33253 |
The issue of the difference in allocated memory display between Robin CNP UI and CLI is fixed. |
PP-33255 |
The issue of Bundle app creation failing when you provide values for the CPU attributes as shown below in the manifest YAML file, is fixed. cpu:
cores: 8
reserve: true
|
PP-33446 |
In Robin CNP v5.4.3 HF2, the |
PP-33516 |
The issue of inflight resources being held by CNP when a Helm chart or Deployment is deleted during the Pod planning phase is fixed. However, there is another known issue when there is a non-graceful termination of a Pod. For more information, see PP-33628 under the Known Issues section. |
25.6.3. Known Issues¶
Reference ID |
Description |
---|---|
PP-33501 |
Symptom Robin CNP does not support the Chargeback feature for millicore CPUs. |
PP-33596 |
Symptom Robin CNP does not support the user certificates feature using the CNP UI. |
PP-33628 |
Symptom In some cases, after the Helm app uninstall, a non-graceful deletion of a Pod, or a StatefulSet Pod deletion, the inflight resources might be held by Robin CNP. To check this, run the following command: # robin inflight-resources info
|
PP-33670 |
Symptom In scenarios like cluster failover or reboot, Robin CNP may fail to access the devices and mark them as FAULTED erroneously. However, the device might not have issues. Workaround Contact the Robin Customer Support team if you observe this issue. |
PP-33679 |
Symptom When a Master Pod fails over due to network partition on a node, the Master Pod might stuck in the Apply the following workaround if you see this issue: Workaround Restart the Calico Pod on the node where you are seeing the issue. |
PP-33725 |
Symptom In the following scenarios, app creation using a snapshot (robin app create from snapshot) fails if you use values other than the values of the parent application.
|
25.6.4. Technical Support¶
Contact Robin Technical support for any assistance.
25.7. Robin Cloud Native Platform v5.4.3 HF5¶
The Robin Cloud Native Platform (CNP) v5.4.3-564 (HF5) release has a new feature, improvements, fixed issues, and known issues.
25.7.1. Infrastructure Versions¶
The following software applications are included in this CNP release:
Software Application |
Version |
---|---|
Kubernetes |
1.25.7 or 1.26.0 (Default) |
Docker |
25.0.2 (CentOS 7),(Rocky 8) and (RHEL 8.10) |
Prometheus |
2.39.1 |
Prometheus Adapter |
0.10.0 |
Node Exporter |
1.4.0 |
Calico |
3.24.3 |
HAProxy |
2.4.7 |
PostgreSQL |
14.7 |
Grafana |
9.2.3 |
CRI Tools |
1.25.0 |
25.7.2. Upgrade Paths¶
The following are the supported upgrade paths for Robin CNP v5.4.3-564 (HF5):
Robin CNP v5.4.3-355 + Point Patch to Robin CNP v5.4.3 HF5
Robin CNP v5.4.3-395 + Point Patch-1 to Robin CNP v5.4.3 HF5
Robin CNP v5.4.3-302 + Point Patch + Security Patch to Robin CNP v5.4.3 HF5
The upgrade procedure remains the same for all the hotfix versions of Robin CNP v5.4.3. For upgrade information, see Upgrade Robin CNP Platform.
25.7.2.1. Pre-upgrade consideration¶
For a successful upgrade, you must run the possible_job_stuck.py
script before and after the upgrade. Contact the Robin Support team for the upgrade procedure using the script.
25.7.2.2. Post-upgrade considerations¶
After upgrading to Robin CNP v5.4.3 HF5, you must run the
robin schedule update K8sResSync k8s_resource_sync 60000
command to update therobin schedule K8sResSync
.After upgrading to Robin CNP v5.4.3 HF5, you must run the
robin-server validate-role-bindings
command. To run this command, you need to log in to therobin-master
Pod. This command verifies the roles assigned to each user in the cluster and corrects them if necessary.After upgrading to Robin CNP v5.4.3 HF5, the
k8s_auto_registration
config parameter is disabled by default. The config setting is deactivated to prevent all Kubernetes apps from automatically registering and consuming resources. The following are the points you must be aware with this change:You can register the Kubernetes apps using the
robin app register
command manually and use Robin CNP for snapshots, clones and backup operations of the Kubernetes app.As this config parameter is disabled, when you run the
robin app nfs-list
command, the mappings between Kubernetes apps and NFS server Pod are not listed in the command output.If you need mapping between Kubernetes app and NFS server Pod when the
k8s_auto_registration
config parameter is disabled or the k8s app is not manually registered, get the PVC name from the Pod YAML file (kubctl get pod -n <name> -o YAML
) and run therobin nfs export list | grep <pvc name>
command.The
robin nfs export list
command output displays the PVC name and namespace.
25.7.3. New Feature¶
25.7.3.1. cAdvisor as a DaemonSet¶
Robin CNP v5.4.3 HF5 supports the cAdvisor as a standalone DaemonSet. By default, cAdvisor (Container Advisor) is enabled when you enable Robin Metrics on your Robin Cluster.
However, if you do not want to use Robin Metrics but still need cAdvisor for gathering node-level metrics, you can enable it and manage cAdvisor independently.
You can start, stop, and view the status of the cAdvisor services without enabling Robin Metrics. This allows you to view metrics on any observability framework (OBF) of your choice. For more information, see cAdvisor as a DaemonSet.
The following new commands are added to support this feature:
# robin metrics start-cadvisor
# robin metrics status --cadvisor
# robin metrics stop-cadvisor
25.7.3.2. Auto Disk Rebalance¶
Robin CNP v5.4.3 HF5 supports the Auto Disk Rebalance feature. The Auto Disk Rebalance feature manages the storage space of all disks in the cluster automatically when the disk reaches a certain watermark threshold.
By default, the Auto Disk Rebalance feature is enabled.
When a disk reaches a high watermark, the disk rebalance job automatically starts moving the volume slices from one disk to another disk. For more information, see Auto Disk Rebalance.
25.7.4. Improvements¶
25.7.4.1. Robin services in Master and Worker Pods¶
The following Robin services run in the Master and Worker Pods as per 5.4.x architecture:
Robin services in Master Pod:
consul-server
robin-server
robin-auth-server
robin-event-server
sherlock-server
gui-cli
stormgr-server
httpd
robin-node-monitor
Robin services in the Worker Pod:
consul-client
robin-agent
monitor-server
25.7.4.2. Support to specify description when putting a host in maintenance mode¶
Starting from Robin CNP v5.4.3 HF5, the --desc
option is added to specify a description when putting a host in the Maintenance mode.
25.7.4.3. Support to replace the existing Cluster Identity certificate¶
Robin CNP v5.4.3 HF5 supports replacing your existing Cluster Identity Certificate created during installation. The Cluster Identity Certificate is a server certificate that is presented to all clients that send requests to all externally facing services of the CNP cluster. It provides assurance to clients that they are connecting to the right server. By default, the Cluster Identity Certificate is signed by the Cluster CA. The Cluster CA is a self-signed certificate that is generally not trusted by clients with strict security configured. You can replace the existing Cluster Identity Certificate with a certificate that is signed by an external trusted CA. For more information, see Replacing an existing Cluster Identity certificate.
You can use the following command to replace the existing Cluster Identity certificate:
# robin cert update-cluster-identity <identity_ca_path> <identity_cert_path> <identity_key_path> --force
Once the Cluster Identity Certificate is replaced, you must restart the following containers and Pods for the new certificate to take effect:
Containers
haproxy-keepalived-robink8s_monitor
container running on all master node
Pods
robin-master
robin-grafana-rs
robin-prometheus-rs
25.7.4.4. Support to update NFS server type for RWX volume¶
Robin CNP v5.4.3 HF5 allows you to update the NFS server type for an RWX volume after mounting it.
You can change the NFS server type from shared
to exclusive
and the other way around. For more information, see Update NFS server Pod type.
Note
Before updating the type of NFS server for an RWX volume, the respective volume must be unmounted.
To update the NFS server type, the following new CLI is added as part of Robin CNP v5.4.3 HF5:
# robin nfs export-update <volume> --nfs-server-type <shared|exclusive>
25.7.4.5. Support to add static MAC address for KVM-based application¶
Robin CNP v5.4.3 HF5 supports the static MAC address for KVM-based applications. To specify the static MAC address for the KVM-based application, you must add the static MAC address in the input.yaml
file at runtime when creating an application.
This feature enables you to use the VM licenses that are tied to the MAC addresses of the VMs. For more information, see Create a VM with static MAC address.
The following key is used to specify the static MAC address for the KVM-based application in the input.yaml
file:
static_macs
25.7.4.6. Config option to timeout Metrics start job¶
A new parameter metrics_timeout
is added as part of Robin CNP v5.4.3 HF5. Using this configurable parameter, you can set timeout period for metrics start
job. If any of the Metrics feature related Pods (cAdvisor, Grafana, or Prometheus) do not come up within the timeout period, the metrics start
job will fail. The default timeout is 3600 seconds
.
Note
It is recommended not to set the timeout period less than 900 seconds
.
You can configure the parameter using the following command:
# robin config update server metrics_timeout <value>
Example
# robin config update server metrics_timeout 1000 --wait
The 'server' attribute 'metrics_timeout' has been updated
25.7.4.7. New Schedule type PlanCleanup
¶
A new Schedule type, PlanCleanup
, is added as part of Robin CNP v5.4.3 HF5 to the Robin Schedules list. This schedule cleans up the stale Plan IDs.
The schedule runs every 30 minutes and it is enabled by default.
25.7.5. Fixed Issues¶
Reference ID |
Description |
---|---|
RSD-4065 |
When creating a superadmin user with AdminViewOnly capabilities or a tenantadmin user with TenantViewOnly capabilities, clusterrolebindings / rolebindings giving the user full access to K8s resources were being created. This issue has been fixed. Admin users with ViewOnly capabilities now get view clusterrolebindings / rolebindings. A utility is provided with Robin CNP v5.4.3 HF5 to fix this issue for existing users (newly created users will have view only clusterrolebindings / rolebindings). You need to run the following command in the # robin-server validate-role-bindings
|
RSD-4098 |
The |
RSD-4998 |
The issue of a KVM in Robin CNP having an additional bond0 network interface in an SR-IOV IP-pool even though the |
RSD-5513 |
The issue of Robin CNP erroneously clearing the network attach definition in the cluster while clearing the stale network attachment, causing the Pods to fail to come up, is fixed. |
RSD-5710 |
The issue of a discrepancy in CPU core calculation when validating the tenant rpool limit is fixed. |
RSD-6367 |
On the Rocky Linux OS, the KVMs running on the Robin Cluster are getting restarted due to the default LIBVIRTD_ARGS set to |
RSD-6525 |
Security vulnerabilities are fixed. To fix security vulnerabilities with the existing versions of the following applications, Robin CNP v5.4.3 HF5 has the following upgraded versions:
|
RSD-6684 |
The issue of failing to add a new worker node to the Robin Cluster running on Robin CNP v5.4.3 HF4 Point Patch-1 due to an issue with the |
RSD-6753 |
In Robin CNP 5.4.3 HF4, during app creation, randomly some of the vnode deploy jobs fail with image pull errors. The following is the sample error message: Error: Command ‘/usr/bin/crictl -i unix:///var/run/crirobin.sock -r unix:///var/run/crirobin.sock pull jfrogbng2.altiostar.com/virtual-5g-docker/centos/nrduf1cmgr:0.0.1-4524’ failed with return code 1: time=”2024-02-28T18:59:35+09:00” level=fatal msg=”unable to determine image API version: rpc error: code = Unknown desc = Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:lstat /var/lib/docker/image/robin/robingraph/distribution/diffid-by-digest/sha256/.tmp-2d473b07cdd5f0912cd6f1a703352c82b512407db6b05b43f2553732b55df3bc497070871: no such file or directory”. This issue is fixed. |
RSD-6763 |
The issue of the Helm binary version mismatch between the host and the downloaded Helm-client or the |
RSD-6781 |
The issue of Robin processes ( |
RSD-6843 |
The issue of NUMA node allocation not working for Guaranteed Pods without SRIOV interfaces is fixed. |
RSD-6897 |
When creating an RWX PVC without specifying the filesystem type, the Pod that will consume this PVC does not come up due to the error Volume already exists but requested fstype ext4 mismatches with existing volume’s fstype xfs. This issue is fixed. |
RSD-6919 |
After uninstalling a Helm app, and deleting a Pod or a StatefulSet Pod forcefully, Robin CNP might hold the inflight resources. This issue is fixed. |
RSD-6921 |
The issue of ERR_NOSPACE where the garbage collection (GC) failed to execute on the disks that have reached their full capacity, is fixed. |
RSD-7036 |
The file-collection volume is failing to mount on the host because the zone ID is missing in the volume handler present in the PV object. This issue is fixed. |
RSD-7053 |
The issue of ERR_NOSPACE where the garbage collection (GC) failed to execute on the unloaded snapshots of the volume slices, is fixed. |
RSD-7112 |
When an event is generated for Robin Bundle and Helm apps, the tags CNCF, NF, and NS UUIDs are not present in the events seen by the Kafka consumer. This issue is fixed. |
RSD-7188 |
In certain scenarions, some of the Robin processes such as |
RSD-7218 |
The issue of RES memory for the robin-server process increasing gradually over a period of time, resulting in heartbeat missed alerts on the cluster, is fixed. |
RSD-7433 |
The issue of the iomgr-server service crashing with the error conn 0x7f349c400000: recvhdr failed with ERR_BADOP is fixed. |
RSD-7469 |
The Robin predestroy Pod takes one dedicated CPU earlier, with Robin CNP v5.4.3 HF5, the Kubernetes Pods that are created from predestory hooks use the BestEffort CPU instead of taking one dedicated CPU. |
RSD-7478 |
The issue of VPP applications not being able to use 100% CPU for Guaranteed Pods is fixed. With Robin CNP v5.4.3 HF5, the parameter |
RSD-7529 |
When Pod events are generated for Kafka user, the master node’s name does not appear correctly in the |
PP-34504 |
When the IOMGR and the control plane of a replica node are down at the same time, the resync tasks are failing; this results in bringing down the complete control plane. This issue is fixed. Due to this critical issue, the build is refreshed. |
PP-28972 |
When you try to deploy a KVM-based app and override the NIC tags in the IP pool using the |
25.7.6. Known Issues¶
Reference ID |
Description |
---|---|
PP-31547 |
Symptom A device may run out of space, and you might observe disk usage alerts or out-of-space errors when an application is writing data, resulting in failed writes. You might also observe that the physical size of a volume is greater than the logical size when you run the This issue could be because the garbage collector (GC) failed to reclaim space. Workaround If you notice this issue, contact the Robin Customer Support team. |
PP-33702 |
Symptom When Pod soft anti-affinity is applied to the SRIOV Pods, the Pods might not be evenly distributed on the nodes. After applying soft anti-affinity for Pods, you can run the following command and check the distribution of Pods on nodes: # kubectl get pods -n <namespace> -o wide
The uneven distribution of Pods issue occurs as Kubernetes is recommending a node that is not part of the affinity list. Workaround To achieve an even distribution of Pods across nodes, delete Pods from the node with the highest number of pods for the same app name. For example: If there are two nodes: Node A and Node B, and if Node A has five Pods and Node B has one Pod, you should delete two Pods from Node A to balance the distribution of Pods between the two nodes. Run the following command to delete the Pods: # kubectl delete pod <pod_name>
|
PP-34088 |
Symptom After upgrading to Robin CNP v5.4.3 HF5, in rare scenarios, the MountVolume.WaitForAttach failed for volume “pvc-c3c62dd9-7b95-4254-912d-31ab5ae05150” : volume 1713906186:1 has GET error for volume attachment csi-aa7f3a85079a40fdc962a9f22ba5685f41947173a57b1f2477ae9847eff0a19b: volumeattachments.storage.k8s.io “csi-aa7f3a85079a40fdc962a9f22ba5685f41947173a57b1f2477ae9847eff0a19b” is forbidden: User “system:node:hypervvm-69-36.robinsystems.com” cannot get resource “volumeattachments” in API group “storage.k8s.io” at the cluster scope: no relationship found between node ‘hypervvm-69-36.robinsystems.com’ and this object If you notice the above issue, apply the following workaround steps: Workaround
|
PP-34111 |
Symptom Robin CNP v5.4.3 HF5 UI version does not support cAdvisor start, stop, and check status operations. Workaround You can perform these operations using the Robin CNP CLI. For more information, see cAdvisor as a DaemonSet. |
PP-34153 |
Symptom When you update affinity rules in a StatefulSet after Pods are deployed as per the affinity rules, the new affinity rules are not honored. Workaround You need to delete all replica Pods or redeploy the StatefulSet. |
PP-34157 |
Symptom After upgrading from any supported Robin CNP versions to Robin CNP v5.4.3 HF5, some of the hosts will be in the
If the status of the disks is in the Workaround Run the following command to rediscover the disks and other resources: # robin host probe --rediscover <hostname>
|
PP-34158 |
Symptom When a Robin CNP cluster has more than 50 nodes, the Robin CNP UI does not display the Metrics UI component. |
PP-34197 |
Symptom Robin CNP is reserving snapshot space, though in the bundle manifest file it is disabled (snapshot: disabled), resulting in app deployment failure due to insufficient storage. If you notice this issue, apply the following workaround: Workaround For each volume, set the |
PP-34226 |
Symptom When a PersistentVolumeClaim (PVC) is created, the CSI provisioner initiates a VolumeCreate job. If this job fails, the CSI provisioner will call a new VolumeCreate job again for the same PVC. However, if the PVC is deleted during this process, the CSI provisioner will continue to call the VolumeCreate job because it does not verify the existence of the PVC before calling the VolumeCreate job. Workaround Bounce the CSI provisioner Pod: # kubectl delete pod <csi-provisioner-robin> -n robinio
|
PP-34339 |
Symptom When multiple applications are deployed at a time, the Kubernetes scheduler (K8spodplanner) takes time to pick up the job, and when you run the Workaround If you notice this issue, deploy Pods serially or wait for Pod deployment completion. |
PP-34359 |
Symptom When a Helm app or Kubernetes application is deployed on Robin CNP, it allocates more resources to a tenant than its limit. To verify this, after deploying the app, run the following command to check if any of the Pods are overusing the resources. # robin tenant list <tenantname> --full
In the command output, if you notice any overallocation of resources, apply the following workaround. Workaround Run the following command to delete the Pods belonging to the tenant namespace: # kubectl delete pod <pod name> - n <namespace>
|
PP-34415 |
Symptom When deploying a Helm app with millicore CPUs, Robin CNP rounds up the requested millicore CPUs to the nearest integer during the planning phase because it does not support the millicore CPU for Helm apps. If the requested CPUs exceed the max_cores_per_app limit, then the Helm app deployment will fail with the following similar error: Error from server: error when creating “sts-static-6689-novol.yaml”: admission webhook “master.robin-server.service.robin” denied the request: Total cores (3) exceeds the max_cores_per_app limit (2) for tenant/rpool (master/worker) Workaround
|
PP-34434 |
Symptom When a StatefulSet or Deployment with multiple replicas is deployed with soft Pod affinity, all Pods are going on the correct node, except one in spite of sufficient resources. Workaround Delete the Pod that is placed on the wrong node by running the following command: # kubectl delete pod <pod_name> -o wide
|
PP-34439 |
Symptom In certain scenarios, an SRIOV Pod with soft anti-affinity might get stuck in the If you notice the issue, perform the following checks to confirm and apply the workaround.
If you notice these symptoms, apply the following workaround: Workaround You need to delete the
|
PP-34451 |
Symptom In rare scenarios, the RWX Pod might be stuck in the
Perform the following steps to confirm the issue:
If you notice any input and output errors in step 4, apply the following workaround: Workaround
|
PP-34457 |
Symptom If the Metrics feature is enabled on your Robin CNP cluster and you are using Grafana for monitoring, after upgrading the cluster from any supported Robin CNP versions to Robin CNP v5.4.3 HF5, the Grafana metrics will not work. Note You need to take a backup of the configmaps of the Prometheus and Grafana apps in the Workaround You need to stop and restart the Metrics feature.
|
25.7.7. Technical Support¶
Contact Robin Technical support for any assistance.