Release Notes ############# Robin Cloud Native Storage v5.4.10-59 ************************************** The Robin Cloud Native Storage (CNS) v5.4.10-59 Release Notes document provides information about upgrade paths, new features, improvements, fixed issues, and known issues. **Release Date:** September 24, 2024 Upgrade paths ============== The following are the supported upgrade paths for Robin CNS v5.4.10-59: - Robin CNS v5.4.4-182 to Robin CNS v5.4.10-59 - Robin CNS v5.4.8-280 to Robin CNS v5.4.10-59 - Robin CNS v5.4.8-313 to Robin CNS v5.4.10-59 .. Note:: - If you are upgrading from Robin CNS v5.4.4-182 to Robin CNS v5.4.10-59, you must stop any snapshot creation and deletion operations. However, in unavoidable situations, you can run the ``robin volume-snapshot upgrade --wait`` command post-upgrade. - After upgrading to Robin CNS v5.4.10-59, if you are using the Robin Client outside the ``robincli`` Pod, you must upgrade to the latest version of the Robin Client. - If you have installed Robin CNS with the ``skip_postgres_operator`` parameter to use the Zalando PostgreSQL operator, then you must first upgrade the Zalando PostgreSQL operator to v1.11.0 or later before upgrading to Robin CNS v5.4.10-59. New Features ============ Dual-Stack Network support --------------------------- Robin CNS v5.4.10-59 now can be deployed on Google Anthos clusters with IPv4/IPv6 dual stack networking. For more information, see `IPv4/IPv6 dual-stack networking `__. Auto Volume Cleanup after a node removal ----------------------------------------- When you remove all nodes from Kubernetes Cluster that serve replicas of a volume, the Auto Volume Cleanup feature automatically deletes the volumes and cleans up the data and stale devices associated with the volumes. If you want to retain your data before removing the node, you must first manually evacuate all volumes from the drives of that node and then remove the node from the cluster. For more information, see `Removing Nodes `__. Robin recommends evacuating the volumes before removing a node because removing a node entails data loss. .. Note:: By default, the Auto Volume Cleanup feature is disabled. For more information, see `Auto Volume Cleanup `__. Robin Patroni Monitor ---------------------- The Robin Patroni Monitor feature allows you to monitor the status of the Patroni instances (Pods) in a cluster. The Robin CNS architecture includes a highly available PostgreSQL cluster managed by Patroni, referred to as the Patroni Cluster. To ensure high availability (HA), Patroni maintains three copies of its database, meaning a maximum of three Patroni instances are present in a cluster at any given time. A Patroni cluster might become unavailable for a number of reasons. To monitor the status of the Patroni cluster, Robin CNS provides the Robin Patroni Monitor feature, which generates the events as required. .. Note:: After you upgrade from the previous Robin CNS versions to Robin CNS v5.4.10-59, the Robin Patroni Monitor feature is automatically enabled. Starting from Robin CNS v5.4.10, the ``robin event-type list`` command displays the Event types related to Patroni if there are any change in the status of the Patroni replicas: - EVENT_PATRONI_LEADER_CHANGE - EVENT_PATRONI_INSTANCE_NOT_READY - EVENT_PATRONI_INSTANCE_FAILED - EVENT_PATRONI_INSTANCE_READY Fore more information, see `Robin Patroni Monitor `__. Improvements ============= Support for Patroni with Permanent Replication Slots ----------------------------------------------------- Robin CNS v5.4.10 supports Patroni Version 3.2.2. In Robin CNS v5.4.10, all Patroni replicas are configured with permanent replication slots. In previous Robin CNS versions, only the Patroni Leader was maintaining the replication slots. Replication slots in Patroni track the amount of data sent to other Patroni replicas. Permanent replication slots ensure that the Leader retains the necessary WAL files until they are safely received by all standby replicas. When you install or upgrade to Robin CNS v5.4.10, Patroni Version 3.2.2 is also installed or upgraded accordingly, and all Patroni replicas start maintaining the permanent replication slots. .. Note:: If you have installed Robin CNS with the ``skip_postgres_operator`` parameter to use the Zalando PostgreSQL operator, then you must first upgrade the Zalando PostgreSQL operator to v1.11.0 or later before upgrading to Robin CNS v5.4.10-59. Fixed Issues ============= ============= ============================================================ Reference ID Description ============= ============================================================ RSD-7768 The issue of Robin CNS Cluster showing the status as **Ready** even though the nodes of the cluster are not in the **Ready** status is fixed. With this release, the cluster status displays **Ready** only if all the nodes in the cluster are **Ready** and **ONLINE**. RSD-7599 The issue of the ``robin_node_state`` metric name mismatch between the output of the metrics command and Robin documentation for node metrics is fixed. Both (output of metrics command and Robin documentation) now display the same metric (``robin_node_state``) name for node metrics. ============= ============================================================ Known Issues ============= ============= ============================================================ Reference ID Description ============= ============================================================ PP-35412 **Symptom** If a node where one of the Patroni Replicas is deployed goes down or is isolated in a network partition for an extended period of time, the Replica instance may not be able to catch up to (synchronize with) the Leader when the node comes back online. This can happen when one of the Write Ahead Log (WAL) files needed for recovery no longer exists on the Leader. In the Robin Master Pod, when you run the ``patronictl list`` command, it shows the following state for Patroni Replica instance: **Example output** .. code-block:: text # patronictl list Defaulted container "postgres" out of: postgres, dummy (init) Cluster: robin-patroni (7395181849828630602) +-----------------+-------------+--------------+-----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------------+-------------+--------------+-----------+----+-----------+ | robin-patroni-0 | 192.0.2.214 | Sync Standby | streaming | 7 | 0 | | robin-patroni-1 | 192.0.2.194 | Replica | starting | | 5986 | | robin-patroni-2 | 192.0.2.247 | Leader | running | 7 | | +-----------------+-------------+--------------+-----------+----+-----------+ You can find the following error message about the missing WAL file in the latest Posgresql CSV file(Example:``postgresql-7.csv``) at this path: ``/home/postgres/pgdata/pgroot/pg_log/`` inside the robin-patroni Pod that is stuck in the ``Starting`` state (From the above output, it is ``robin-patroni-1`` Pod). .. code-block:: text 2024-09-06 14:32:03.553 UTC,"replicator","",20442, "192.190.2.241:39040",66db1263.4fda,4, "START_REPLICATION",2024-09-06 14:32:03 UTC,10/0,0,ERROR, 58P01,"requested WAL segment 0000002B00000010000000F8 has already been removed" To overcome this, you need to reinitialize the robin-patroni replica that is stuck in the ``starting`` state. **Workaround** Complete the following steps to reinitialize the Patroni Replica instance: 1. Run the following command to get the list of Patroni replicas status: .. code-block:: text # patronictl list **Note**: Make sure the ``Leader`` and ``Sync Standby`` Patroni Replicas are running and healthy. 2. Exec into the Patroni Replica instance that is failing to start: .. code-block:: text # kubectl exec -it -n robinio -- bash From the above ``patronictl list`` command ouput example, the ``robin-patroni-1`` is lagging. **Example:** .. code-block:: text # kubectl exec -it -n robinio robin-patroni-1 -- bash 3. Reinitialize the Patroni Replica instance that is failing to start: .. code-block:: text # patronictl reinit robin-patroni --force From the above ``patronictl list`` command ouput example, the ``robin-patroni-1`` is lagging. **Example:** .. code-block:: text # patronictl reinit robin-patroni robin-patroni-1 --force 4. Verify whether the instance is started and running: .. code-block:: text # patronictl list PP-34695 **Symptom** In rare scenarios, the ``stormgr-server`` service might not recover after network restarts and stuck in the ``FATAL`` state. And you might observe the following symptoms: - Workloads might stuck in the ``Terminating`` state for a long time. If you observe this, then check if ``stormgr-server`` service is in faulted state by running following command: .. code-block:: text # kubectl exec -n robinio -- supervisorctl status stormgr-server - The ``robin volume list`` command fails with the following error: ``Cannot contact SM at endpoint /vols`` - The ``robin-service-status`` on ``robin-master`` Pod shows the following status for stormgr service: ``stormgr-server FATAL Exited too quickly (process log may have details)`` If you observe these symptoms, you need to restart the ``stormgr-server`` service in the Robin Master Pod. **Workaround** Complete the following steps to restart the ``stormgr-server`` service in the Robin Master Pod: 1. Restart ``stormgr-server`` service by running the following command: .. code-block:: text # kubectl exec -n robinio -- supervisorctl restart stormgr-server 2. Run the following command to check the status of the ``stormgr-server``: .. code-block:: text # kubectl exec -n robinio -- supervisorctl status stormgr-server ============= ============================================================ Technical Support ================= Contact `Robin Technical support `_ for any assistance.