20. Release Notes

20.1. Robin Cloud Native Storage v5.4.10-59

The Robin Cloud Native Storage (CNS) v5.4.10-59 Release Notes document provides information about upgrade paths, new features, improvements, fixed issues, and known issues.

Release Date: September 24, 2024

20.1.1. Upgrade paths

The following are the supported upgrade paths for Robin CNS v5.4.10-59:

  • Robin CNS v5.4.4-182 to Robin CNS v5.4.10-59

  • Robin CNS v5.4.8-280 to Robin CNS v5.4.10-59

  • Robin CNS v5.4.8-313 to Robin CNS v5.4.10-59

Note

  • If you are upgrading from Robin CNS v5.4.4-182 to Robin CNS v5.4.10-59, you must stop any snapshot creation and deletion operations. However, in unavoidable situations, you can run the robin volume-snapshot upgrade --wait command post-upgrade.

  • After upgrading to Robin CNS v5.4.10-59, if you are using the Robin Client outside the robincli Pod, you must upgrade to the latest version of the Robin Client.

  • If you have installed Robin CNS with the skip_postgres_operator parameter to use the Zalando PostgreSQL operator, then you must first upgrade the Zalando PostgreSQL operator to v1.11.0 or later before upgrading to Robin CNS v5.4.10-59.

20.1.2. New Features

20.1.2.1. Dual-Stack Network support

Robin CNS v5.4.10-59 now can be deployed on Google Anthos clusters with IPv4/IPv6 dual stack networking. For more information, see IPv4/IPv6 dual-stack networking.

20.1.2.2. Auto Volume Cleanup after a node removal

When you remove all nodes from Kubernetes Cluster that serve replicas of a volume, the Auto Volume Cleanup feature automatically deletes the volumes and cleans up the data and stale devices associated with the volumes.

If you want to retain your data before removing the node, you must first manually evacuate all volumes from the drives of that node and then remove the node from the cluster. For more information, see Removing Nodes. Robin recommends evacuating the volumes before removing a node because removing a node entails data loss.

Note

By default, the Auto Volume Cleanup feature is disabled.

For more information, see Auto Volume Cleanup.

20.1.2.3. Robin Patroni Monitor

The Robin Patroni Monitor feature allows you to monitor the status of the Patroni instances (Pods) in a cluster. The Robin CNS architecture includes a highly available PostgreSQL cluster managed by Patroni, referred to as the Patroni Cluster.

To ensure high availability (HA), Patroni maintains three copies of its database, meaning a maximum of three Patroni instances are present in a cluster at any given time.

A Patroni cluster might become unavailable for a number of reasons. To monitor the status of the Patroni cluster, Robin CNS provides the Robin Patroni Monitor feature, which generates the events as required.

Note

After you upgrade from the previous Robin CNS versions to Robin CNS v5.4.10-59, the Robin Patroni Monitor feature is automatically enabled.

Starting from Robin CNS v5.4.10, the robin event-type list command displays the Event types related to Patroni if there are any change in the status of the Patroni replicas:

  • EVENT_PATRONI_LEADER_CHANGE

  • EVENT_PATRONI_INSTANCE_NOT_READY

  • EVENT_PATRONI_INSTANCE_FAILED

  • EVENT_PATRONI_INSTANCE_READY

Fore more information, see Robin Patroni Monitor.

20.1.3. Improvements

20.1.3.1. Support for Patroni with Permanent Replication Slots

Robin CNS v5.4.10 supports Patroni Version 3.2.2. In Robin CNS v5.4.10, all Patroni replicas are configured with permanent replication slots. In previous Robin CNS versions, only the Patroni Leader was maintaining the replication slots.

Replication slots in Patroni track the amount of data sent to other Patroni replicas. Permanent replication slots ensure that the Leader retains the necessary WAL files until they are safely received by all standby replicas.

When you install or upgrade to Robin CNS v5.4.10, Patroni Version 3.2.2 is also installed or upgraded accordingly, and all Patroni replicas start maintaining the permanent replication slots.

Note

If you have installed Robin CNS with the skip_postgres_operator parameter to use the Zalando PostgreSQL operator, then you must first upgrade the Zalando PostgreSQL operator to v1.11.0 or later before upgrading to Robin CNS v5.4.10-59.

20.1.4. Fixed Issues

Reference ID

Description

RSD-7768

The issue of Robin CNS Cluster showing the status as Ready even though the nodes of the cluster are not in the Ready status is fixed. With this release, the cluster status displays Ready only if all the nodes in the cluster are Ready and ONLINE.

RSD-7599

The issue of the robin_node_state metric name mismatch between the output of the metrics command and Robin documentation for node metrics is fixed. Both (output of metrics command and Robin documentation) now display the same metric (robin_node_state) name for node metrics.

20.1.5. Known Issues

Reference ID

Description

PP-35412

Symptom

If a node where one of the Patroni Replicas is deployed goes down or is isolated in a network partition for an extended period of time, the Replica instance may not be able to catch up to (synchronize with) the Leader when the node comes back online. This can happen when one of the Write Ahead Log (WAL) files needed for recovery no longer exists on the Leader.

In the Robin Master Pod, when you run the patronictl list command, it shows the following state for Patroni Replica instance:

Example output

# patronictl list
Defaulted container "postgres" out of: postgres, dummy (init)
Cluster: robin-patroni (7395181849828630602)
+-----------------+-------------+--------------+-----------+----+-----------+
| Member          | Host        | Role         | State     | TL | Lag in MB |
+-----------------+-------------+--------------+-----------+----+-----------+
| robin-patroni-0 | 192.0.2.214 | Sync Standby | streaming |  7 |         0 |
| robin-patroni-1 | 192.0.2.194 | Replica      | starting  |    |      5986 |
| robin-patroni-2 | 192.0.2.247 | Leader       | running   |  7 |           |
+-----------------+-------------+--------------+-----------+----+-----------+

You can find the following error message about the missing WAL file in the latest Posgresql CSV file(Example:postgresql-7.csv) at this path:

/home/postgres/pgdata/pgroot/pg_log/ inside the robin-patroni Pod that is stuck in the Starting state (From the above output, it is robin-patroni-1 Pod).

2024-09-06 14:32:03.553 UTC,"replicator","",20442,
"192.190.2.241:39040",66db1263.4fda,4,
"START_REPLICATION",2024-09-06 14:32:03 UTC,10/0,0,ERROR,
58P01,"requested WAL segment 0000002B00000010000000F8
has already been removed"

To overcome this, you need to reinitialize the robin-patroni replica that is stuck in the starting state.

Workaround

Complete the following steps to reinitialize the Patroni Replica instance:

  1. Run the following command to get the list of Patroni replicas status:

# patronictl list

Note: Make sure the Leader and Sync Standby Patroni Replicas are running and healthy.

  1. Exec into the Patroni Replica instance that is failing to start:

    # kubectl exec -it -n robinio <robin-patroni pod> -- bash
    

    From the above patronictl list command ouput example, the robin-patroni-1 is lagging.

    Example:

    # kubectl exec -it -n robinio robin-patroni-1 -- bash
    
  2. Reinitialize the Patroni Replica instance that is failing to start:

    # patronictl reinit robin-patroni <robin-patroni pod> --force
    

    From the above patronictl list command ouput example, the robin-patroni-1 is lagging.

    Example:

    # patronictl reinit robin-patroni robin-patroni-1 --force
    
  3. Verify whether the instance is started and running:

    # patronictl list
    

PP-34695

Symptom

In rare scenarios, the stormgr-server service might not recover after network restarts and stuck in the FATAL state. And you might observe the following symptoms:

  • Workloads might stuck in the Terminating state for a long time. If you observe this, then check if stormgr-server service is in faulted state by running following command:

    # kubectl exec -n robinio <robin-master-pod>
    -- supervisorctl status stormgr-server
    
  • The robin volume list command fails with the following error:

    Cannot contact SM at endpoint /vols

  • The robin-service-status on robin-master Pod shows the following status for stormgr service:

    stormgr-server FATAL Exited too quickly (process log may have details)

If you observe these symptoms, you need to restart the stormgr-server service in the Robin Master Pod.

Workaround

Complete the following steps to restart the stormgr-server service in the Robin Master Pod:

  1. Restart stormgr-server service by running the following command:

    # kubectl exec -n robinio <robin-master-pod>
    -- supervisorctl restart stormgr-server
    
  2. Run the following command to check the status of the stormgr-server:

    # kubectl exec -n robinio <robin-master-pod>
    -- supervisorctl status stormgr-server
    

20.1.6. Technical Support

Contact Robin Technical support for any assistance.