17. Troubleshooting¶
Robin CNS provides a number of native tools and commands for an administrator to utilize in order troubleshoot their Robin cluster and/or report issues. These tools vary in their use case but provide enough information to provide insight as to why as the cluster is not functioning as intended or the reason for unexpected failures. As a result, they should be the go-to utilities when debugging potential issues and their outputs should be sent alongside any bug reports filed to Robin. Each tool has been described in their respective sections below.
Alongside the aforementioned tools for administrators, Robin CNS also provides more granular commands for individual users to track the progress of
their executed operations and determine reasons for their failure. These operations are referred to as jobs
and are identified by a unique ID.
17.1. Listing all jobs¶
Robin stores all jobs that have occurred during a cluster’s lifespan. To view these jobs alongside details such as their start time, state etc. issue the following command:
# robin job list --verbose
--ignoredeps
--noarchived
--nopurged
--states <states>
--failed
--nocolor
--page_size <size>
--page_num <num>
--total
--all
--app <app_name>
--k8sapp <k8sapp_name>
--vnode <vnode_name>
--node <node_name>
--disk <disk_wwn>
--json
|
Show complete job information instead of truncating it for display purposes. |
|
Do not show child jobs |
|
Do not show archived jobs |
|
Do not show purged jobs |
|
Filter jobs based on states. Choose one or more from: active, failed, succeeded, archived, purged |
|
Show only jobs which have failed |
|
Show uncolored output |
|
Number of jobs that should be displayed for each page |
|
Page number to start displaying jobs from (starting index 1) |
|
Return the total number of qualified root jobs |
|
Display all jobs associated with a specific application. Note this option must be used in conjunction with the |
|
Filter jobs based on specified application |
|
Filter jobs based on specified K8s/Helm registered application name |
|
Filter jobs based on specified Vnode name |
|
Filter jobs based on specified physical node name |
|
Filter jobs based on specified disk WWN |
|
Display output in JSON format |
Example:
Output
# robin job list
ID | Type | Description | State | Start | End | User | Message
--------------+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------+---------+--------+------------------------------------------
1013 | ApplicationStart | Starting application 'wp-10' | COMPLETED | 13 Aug 23:28:29 | 0:00:54 | system |
|->1015 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 13 Aug 23:28:30 | 0:00:38 | system |
| |->1017 | VnodeDeploy | Deploying vnode 'wp-10.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:28:30 | 0:00:38 | system |
| | |->1018 | VnodeStop | Stopping vnode wp-10.mysql.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:28:30 | 0:00:15 | system |
|->1016 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system |
| |->1024 | VnodeDeploy | Deploying vnode 'wp-10.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system |
| | |->1025 | VnodeStop | Stopping vnode wp-10.wordpress.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:29:08 | 0:00:07 | system |
1014 | ApplicationStart | ApplicationStart | COMPLETED|FAILED | 13 Aug 23:28:29 | 0:00:00 | system | Another job is running on application 'w
1019 | ApplicationStart | Starting application 'wp-20' | COMPLETED | 13 Aug 23:28:31 | 0:00:51 | system |
|->1020 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 13 Aug 23:28:32 | 0:00:36 | system |
| |->1022 | VnodeDeploy | Deploying vnode 'wp-20.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:28:32 | 0:00:36 | system |
| | |->1023 | VnodeStop | Stopping vnode wp-20.mysql.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:28:32 | 0:00:13 | system |
|->1021 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system |
| |->1026 | VnodeDeploy | Deploying vnode 'wp-20.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system |
| | |->1027 | VnodeStop | Stopping vnode wp-20.wordpress.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:29:08 | 0:00:05 | system |
1028 | JobArchive | Archiving job/s on all hosts | COMPLETED | 14 Aug 00:00:00 | 0:00:02 | system |
|->1029 | AgentJobArchive | Archiving job/s on host cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 00:00:01 | 0:00:00 | system |
1030 | HostProbe | Probed cscale-82-140.robinsystems.com from ONLINE/Ready ==> ONLINE/Ready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 07:54:37 | 0:00:01 | system |
1031 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> UNREACHABLE/Notready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 07:54:37 | 0:00:51 | system |
1032 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> UNREACHABLE/Notready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 08:11:11 | 0:00:50 | system |
1033 | HostProbe | Probed cscale-82-140.robinsystems.com from ONLINE/Ready ==> ONLINE/Ready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 08:11:11 | 0:00:01 | system |
1034 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeUp. | COMPLETED | 14 Aug 09:24:17 | 0:00:50 | system |
1035 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED|FAILED | 14 Aug 09:25:07 | 0:01:40 | system | Pods do not need to be failed over as Ku
1036 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> ONLINE/Ready. Origin: StateChange. | COMPLETED | 14 Aug 09:25:17 | 0:00:01 | system |
1037 | ApplicationDelete | Deleting application 'wp-10' | COMPLETED | 14 Aug 09:41:10 | 0:00:12 | robin |
|->1038 | VnodeDelete | Deleting vnode 'wp-10.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:10 | 0:00:06 | robin |
|->1039 | VnodeDelete | Deleting vnode 'wp-10.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:10 | 0:00:08 | robin |
1040 | ApplicationDelete | Deleting application 'wp-20' | COMPLETED | 14 Aug 09:41:16 | 0:00:13 | robin |
|->1041 | VnodeDelete | Deleting vnode 'wp-20.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:16 | 0:00:10 | robin |
|->1042 | VnodeDelete | Deleting vnode 'wp-20.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:16 | 0:00:09 | robin |
1043 | ApplicationDelete | Deleting application 'wp-30' | COMPLETED | 14 Aug 09:41:20 | 0:00:19 | robin |
|->1044 | VnodeDelete | Deleting vnode 'wp-30.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:20 | 0:00:06 | robin |
|->1045 | VnodeDelete | Deleting vnode 'wp-30.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:20 | 0:00:15 | robin |
1046 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 09:42:58 | 0:00:58 | robin |
|->1047 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:43:00 | 0:00:42 | robin |
| |->1049 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:00 | 0:00:42 | robin |
|->1048 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:43:42 | 0:00:14 | robin |
| |->1053 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:42 | 0:00:14 | robin |
1050 | ApplicationCreate | Adding application 'wp-2' | COMPLETED | 14 Aug 09:43:39 | 0:00:46 | robin |
|->1051 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:43:42 | 0:00:34 | robin |
| |->1054 | VnodeAdd | Adding vnode 'wp-2.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:42 | 0:00:34 | robin |
|->1052 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:44:16 | 0:00:09 | robin |
| |->1055 | VnodeAdd | Adding vnode 'wp-2.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:16 | 0:00:09 | robin |
1056 | ApplicationCreate | Adding application 'wp-3' | COMPLETED | 14 Aug 09:44:18 | 0:00:57 | robin |
|->1057 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:44:20 | 0:00:41 | robin |
| |->1059 | VnodeAdd | Adding vnode 'wp-3.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:20 | 0:00:41 | robin |
|->1058 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:45:01 | 0:00:13 | robin |
| |->1067 | VnodeAdd | Adding vnode 'wp-3.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:02 | 0:00:12 | robin |
1060 | ApplicationDelete | Deleting application 'wp-1' | COMPLETED | 14 Aug 09:44:53 | 0:00:17 | robin |
|->1061 | VnodeDelete | Deleting vnode 'wp-1.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:53 | 0:00:05 | robin |
|->1062 | VnodeDelete | Deleting vnode 'wp-1.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:53 | 0:00:13 | robin |
1063 | ApplicationDelete | Deleting application 'wp-2' | COMPLETED | 14 Aug 09:44:57 | 0:00:21 | robin |
|->1064 | VnodeDelete | Deleting vnode 'wp-2.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:57 | 0:00:09 | robin |
|->1065 | VnodeDelete | Deleting vnode 'wp-2.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:57 | 0:00:18 | robin |
1066 | ApplicationDelete | ApplicationDelete | COMPLETED|FAILED | 14 Aug 09:45:01 | 0:00:00 | robin | Another job is running on application 'w
1068 | ApplicationProbe | Probing application 'wp-3' | COMPLETED | 14 Aug 09:45:12 | 0:00:00 | robin |
1069 | ApplicationDelete | Deleting application 'wp-3' | COMPLETED | 14 Aug 09:45:16 | 0:00:12 | robin |
|->1070 | VnodeDelete | Deleting vnode 'wp-3.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:16 | 0:00:05 | robin |
|->1071 | VnodeDelete | Deleting vnode 'wp-3.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:16 | 0:00:09 | robin |
1072 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 09:47:03 | 0:00:45 | robin |
|->1074 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:47:39 | 0:00:08 | robin |
| |->1076 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:39 | 0:00:08 | robin |
|->1073 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:47:05 | 0:00:34 | robin |
| |->1075 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:05 | 0:00:34 | robin |
1077 | ApplicationCreate | Adding application 'wp-2' | COMPLETED | 14 Aug 09:47:43 | 0:00:44 | robin |
|->1079 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:48:18 | 0:00:09 | robin |
| |->1081 | VnodeAdd | Adding vnode 'wp-2.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:48:18 | 0:00:09 | robin |
|->1078 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:47:45 | 0:00:33 | robin |
| |->1080 | VnodeAdd | Adding vnode 'wp-2.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:45 | 0:00:33 | robin |
1082 | ApplicationCreate | Adding application 'wp-3' | COMPLETED | 14 Aug 09:49:14 | 0:03:12 | robin |
|->1083 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:49:16 | 0:02:49 | robin |
| |->1085 | VnodeAdd | Adding vnode 'wp-3.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:49:16 | 0:02:49 | robin |
|->1084 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:52:05 | 0:00:20 | robin |
| |->1086 | VnodeAdd | Adding vnode 'wp-3.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:52:05 | 0:00:20 | robin |
1087 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeDown. | COMPLETED | 14 Aug 09:53:43 | 0:00:52 | system |
1088 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:54:35 | 0:00:01 | system |
1089 | ApplicationStart | Starting application 'wp-3' | COMPLETED | 14 Aug 09:54:38 | 0:03:41 | system |
|->1092 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:38 | 0:01:53 | system |
| |->1094 | VnodeDeploy | Deploying vnode 'wp-3.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:38 | 0:01:53 | system |
|->1093 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:31 | 0:01:48 | system |
| |->1102 | VnodeDeploy | Deploying vnode 'wp-3.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:31 | 0:01:48 | system |
1090 | ApplicationStart | Starting application 'wp-1' | COMPLETED | 14 Aug 09:54:38 | 0:03:44 | system |
|->1098 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:39 | 0:01:51 | system |
| |->1100 | VnodeDeploy | Deploying vnode 'wp-1.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:39 | 0:01:51 | system |
|->1099 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:30 | 0:01:52 | system |
| |->1101 | VnodeDeploy | Deploying vnode 'wp-1.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:30 | 0:01:52 | system |
1091 | ApplicationStart | Starting application 'wp-2' | COMPLETED | 14 Aug 09:54:38 | 0:03:44 | system |
|->1095 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:39 | 0:01:52 | system |
| |->1097 | VnodeDeploy | Deploying vnode 'wp-2.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:39 | 0:01:52 | system |
|->1096 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:31 | 0:01:51 | system |
| |->1103 | VnodeDeploy | Deploying vnode 'wp-2.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:32 | 0:01:50 | system |
1104 | ApplicationDelete | Deleting application 'wp-1' | COMPLETED | 14 Aug 10:18:34 | 0:00:15 | robin |
|->1105 | VnodeDelete | Deleting vnode 'wp-1.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:34 | 0:00:06 | robin |
|->1106 | VnodeDelete | Deleting vnode 'wp-1.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:34 | 0:00:11 | robin |
1107 | ApplicationDelete | Deleting application 'wp-2' | COMPLETED | 14 Aug 10:18:38 | 0:00:14 | robin |
|->1108 | VnodeDelete | Deleting vnode 'wp-2.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:38 | 0:00:06 | robin |
|->1109 | VnodeDelete | Deleting vnode 'wp-2.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:38 | 0:00:08 | robin |
1110 | ApplicationDelete | Deleting application 'wp-3' | COMPLETED | 14 Aug 10:18:43 | 0:00:15 | robin |
|->1111 | VnodeDelete | Deleting vnode 'wp-3.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:43 | 0:00:12 | robin |
|->1112 | VnodeDelete | Deleting vnode 'wp-3.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:43 | 0:00:13 | robin |
1113 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeUp. | COMPLETED | 14 Aug 10:20:02 | 0:00:50 | system |
1114 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED|FAILED | 14 Aug 10:20:52 | 0:01:40 | system | Pods do not need to be failed over as Ku
1115 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> ONLINE/Notready. Origin: StateChange.. Services Down: {'iomgr-server'} | COMPLETED | 14 Aug 10:22:17 | 0:00:00 | system |
1116 | HostProbe | Probed cscale-82-139.robinsystems.com from ONLINE/Notready ==> ONLINE/Notready. Origin: StateChange.. Services Down: {'iomgr-server'} | COMPLETED | 14 Aug 10:22:47 | 0:00:00 | system |
1117 | HostProbe | Probed cscale-82-139.robinsystems.com from ONLINE/Notready ==> ONLINE/Ready. Origin: StateChange. | COMPLETED | 14 Aug 10:22:59 | 0:00:00 | system |
1118 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 10:40:21 | 0:01:05 | robin |
|->1119 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:40:24 | 0:00:41 | robin |
| |->1121 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:40:24 | 0:00:41 | robin |
|->1120 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:41:05 | 0:00:21 | robin |
| |->1122 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:41:05 | 0:00:21 | robin |
1123 | ApplicationCreate | Adding application 'wp-2-no-aff' | COMPLETED | 14 Aug 10:45:45 | 0:00:57 | robin |
|->1124 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:45:48 | 0:00:41 | robin |
| |->1126 | VnodeAdd | Adding vnode 'wp-2-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:45:48 | 0:00:41 | robin |
|->1125 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:46:29 | 0:00:13 | robin |
| |->1127 | VnodeAdd | Adding vnode 'wp-2-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:29 | 0:00:13 | robin |
1128 | ApplicationCreate | Adding application 'wp-3-no-aff' | COMPLETED | 14 Aug 10:46:33 | 0:00:39 | robin |
|->1129 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:46:35 | 0:00:28 | robin |
| |->1131 | VnodeAdd | Adding vnode 'wp-3-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:35 | 0:00:28 | robin |
|->1130 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:47:03 | 0:00:09 | robin |
| |->1132 | VnodeAdd | Adding vnode 'wp-3-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:47:03 | 0:00:09 | robin |
1133 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeDown. | COMPLETED | 14 Aug 10:49:36 | 0:00:52 | system |
1134 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:50:28 | 0:00:01 | system |
1135 | ApplicationStart | Starting application 'wp-1' | COMPLETED | 14 Aug 10:50:29 | 0:03:22 | system |
|->1141 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 10:52:16 | 0:01:35 | system |
| |->1143 | VnodeDeploy | Deploying vnode 'wp-1.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:52:16 | 0:01:35 | system |
|->1140 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 10:50:30 | 0:01:46 | system |
| |->1142 | VnodeDeploy | Deploying vnode 'wp-1.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:30 | 0:01:46 | system |
1136 | VnodeDeploy | Deploying vnode 'wp-3-no-aff.wordpress.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:01:48 | robin |
1137 | VnodeDeploy | Deploying vnode 'wp-3-no-aff.mysql.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:02:04 | robin |
1138 | VnodeDeploy | Deploying vnode 'wp-2-no-aff.mysql.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:02:07 | robin |
1139 | VnodeDeploy | Deploying vnode 'wp-2-no-aff.wordpress.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:01:44 | robin |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Returns all jobs that have occurred during a cluster’s lifespan.
End Point: /api/v5/robin_server/jobs
Method: GET
URL Parameters:
sort=[id|-id]
: Utilizing this parameter results in the list of jobs returned being sorted by their id.noarchived=true
: Utilizing this parameter results in archived jobs not being returned.nopurged=true
: Utilizing this parameter results in purged jobs not being returned.failed=true
: Utilizing this parameter results in only failed jobs being returned.parent=true
: Utilizing this parameter results in only parent jobs being returned.page_size=<size>
: Utilizing this parameter results in <size> number of jobs being returned.page_num=<index>
: Utilizing this parameter results in jobs starting from <index> being returned.objtype=[APPLICATION|K8S_APPLICATION|INSTANCE|DISK|NODE]
: Utilizing this parameter results in only jobs for the specified object type being returned.objname=<obj_name>
: Utilizing this parameter results in only jobs for objects with the specified name being returned.all=true
: Utilizing this parameter results in all jobs being returned. Note this option is only valid when an application name is specified.
Data Parameters: None
Port: RCM Port (default value is 29442)
Headers:
Authorization: <auth_token>
: Authorization token to identify which user is sending the request. The token can be acquired from the login API.
Success Response Code: 200
Error Response Code: 500 (Internal Server Error)
Example Response:
Output
{
"page_size":10,
"items":{
"users":[
{
"email":null,
"tenantid":1,
"firstname":"Robin",
"username":"robin",
"id":3,
"lastname":"Systems"
}
],
"jobs":[
{
"jobid":1888,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[1889]",
"endtime":1597456503,
"children":[
{
"jobid":1889,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456498,
"parent_jobid":1888,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456497,
"dependson_job_ids":"[]",
"level":"child",
"user_id":1,
"jtype":"CollectionOffline",
"timeout":86400,
"state":10,
"desc":"Taking collection 'file-collection-1597122699552' offline (Force False)"
}
],
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456496,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"CollectionOnline",
"timeout":86400,
"state":10,
"desc":"Bringing collection 'file-collection-1597122699552' online"
},
{
"jobid":1887,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[1890]",
"endtime":1597456504,
"children":[
{
"jobid":1890,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456499,
"parent_jobid":1887,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456497,
"dependson_job_ids":"[]",
"level":"child",
"user_id":3,
"jtype":"VnodeStop",
"timeout":86400,
"state":10,
"desc":"Stopping vnode test-ds-1.server.01 on cscale-82-140.robinsystems.com"
}
],
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456496,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":3,
"jtype":"VnodeDeploy",
"timeout":86400,
"state":10,
"desc":"Deploying vnode 'test-ds-1.server.01'. Origin: Event (cscale-82-140.robinsystems.com)"
},
{
"jobid":1886,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456488,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456487,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"HostProbe",
"timeout":86400,
"state":10,
"desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/Notready ==> ONLINE\/Ready. Origin: StateChange."
},
{
"jobid":1885,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456476,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456475,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"HostProbe",
"timeout":86400,
"state":10,
"desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/Notready ==> ONLINE\/Notready. Origin: StateChange.. Services Down: {'iomgr-server'}"
},
{
"jobid":1884,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456470,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456470,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"HostProbe",
"timeout":86400,
"state":10,
"desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/WaitingForMonitor ==> ONLINE\/Notready. Origin: StartingHostWatch.. Services Down: {'iomgr-server'}"
},
{
"jobid":1883,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456520,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456469,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"HostProbe",
"timeout":86400,
"state":10,
"desc":"Probed cscale-82-139.robinsystems.com from UNREACHABLE\/Notready ==> UNREACHABLE\/Notready. Origin: StartingHostWatch."
},
{
"jobid":1882,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456467,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456467,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"DiskNotify",
"timeout":86400,
"state":10,
"desc":"Event on disk '0x60022480940ed076551cfaf75612e24e'"
},
{
"jobid":1881,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456467,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456467,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"DiskNotify",
"timeout":86400,
"state":10,
"desc":"Event on disk '0x60022480ffcf3deb224fb37d78fe7767'"
},
{
"jobid":1880,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456467,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456467,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"DiskNotify",
"timeout":86400,
"state":10,
"desc":"Event on disk '0x600224804c48fd7e16c608dea0919064'"
},
{
"jobid":1879,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456467,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456467,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"DiskNotify",
"timeout":86400,
"state":10,
"desc":"Event on disk '0x600224803bcdafde95b1f5cd27ceb5fb'"
}
]
},
"total":1542,
"num_items":10,
"page_num":1
}
17.2. Show information about a specific job¶
In order to get more detailed information about a specific job including the state, duration and any errors related to it and any respective child jobs, issue the following command:
# robin job info <id>
--json
|
Job ID |
|
Display output in JSON format |
Example:
# robin job info 1123
ID | Type | Desc | State | Start | End | Duration | Dependson | Error | Message
-----------+-------------------+---------------------------------------------------------------------------+-----------+-----------------+----------+----------+-----------+-------+---------
1123 | ApplicationCreate | Adding application 'wp-2-no-aff' | COMPLETED | 14 Aug 10:45:45 | 10:46:42 | 0:00:57 | [] | 0 |
|->1124 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:45:48 | 10:46:29 | 0:00:41 | [] | 0 |
| |->1126 | VnodeAdd | Adding vnode 'wp-2-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:45:48 | 10:46:29 | 0:00:41 | [] | 0 |
|->1125 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:46:29 | 10:46:42 | 0:00:13 | [1124] | 0 |
| |->1127 | VnodeAdd | Adding vnode 'wp-2-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:29 | 10:46:42 | 0:00:13 | [] | 0 |
Returns details about a specific job and any of its respective child jobs.
End Point: /api/v3/robin_server/jobs/<job_id>
Method: GET
URL Parameters: None
Data Parameters: None
Port: RCM Port (default value is 29442)
Headers:
Authorization: <auth_token>
: Authorization token to identify which user is sending the request. The token can be acquired from the login API.
Success Response Code: 200
Error Response Code: 500 (Internal Server Error), 404 (Not Found Error), 401 (Authorization Error)
Example Response:
Output
{
"tenant_name":"Administrators",
"jobid":1888,
"tenant_id":1,
"enabled":true,
"json":{
"collection_id":1597122699552,
"state":"SuspectedOffline",
"set_failed":true,
"origin":2,
"hostname":"cscale-82-140.robinsystems.com"
},
"user_name":"system",
"endtime":1597456503,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456496,
"child_job_ids":"[1889]",
"cjobs":[
{
"tenant_name":"Administrators",
"jobid":1889,
"tenant_id":1,
"enabled":true,
"json":{
"collection_id":1597122699552
},
"user_name":"system",
"endtime":1597456498,
"parent_jobid":1888,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456497,
"child_job_ids":"[]",
"cjobs":[
],
"dependson_job_ids":"[]",
"user_id":1,
"jtype":"CollectionOffline",
"timeout":86400,
"state":10,
"desc":"Taking collection 'file-collection-1597122699552' offline (Force False)",
"priority":300
}
],
"dependson_job_ids":"[]",
"user_id":1,
"jtype":"CollectionOnline",
"timeout":86400,
"state":10,
"desc":"Bringing collection 'file-collection-1597122699552' online",
"priority":300
}
17.3. Log Collection¶
During any cluster wide failure or unexpected negative scenarios that affect multiple services, logs from all the system components will be needed by Robin in order to debug the issue properly. However sometimes given the scope of the issue, only a subsection of logs need to be collected. This granularity is available but it is highly recommended to always send the complete set of logs when filing a bug report with Robin. Available age-based filtering helps in reducing storage footprint. Robin supports uploading logs to the following destinations:
|
Used to store collected logs in Robin backed storage |
|
Used to store collected logs in NFS. |
|
Used to store collected logs in Amazon S3 |
|
Used to store collected logs in a given remote location |
|
Used to store collected logs to GCS bucket.(only for GCS repos) |
17.3.1. Storing logs using Robin Storage¶
Logs collected by Robin can be stored on a volume created on the local cluster, with the following command:
# robin log collect robin-storage <rpool>
--nodes <nodes>
--dest-path <dest_path>
--size <size>
--media <media>
--age <age>
|
Name of the resource pool name to use. |
|
Comma separated list of nodes from which to collect. The default is to collect all |
|
Destination path where log files will be copied |
|
Size of the storage volume for the log collect. The default is 250GB |
|
Specify which type of drives to allocate storage from. Choices include: ‘HDD’, ‘SSD’. Default media type is ‘HDD’ |
|
Collects log based on age. Valid options are s (sec), m (min), h (hrs), d (days), Mo (month) and y (years) For example 10m represents 10 minutes. |
Example:
# robin log collect robin-storage default --wait
Job: 123 Name: LogCollect State: PROCESSED Error: 0
Job: 123 Name: LogCollect State: WAITING Error: 0
Job: 123 Name: LogCollect State: COMPLETED Error: 0
17.3.2. Storing logs using NFS¶
Logs collected by Robin can be stored on a NFS share, with the following command:
# robin log collect nfs <nfs_share>
--nodes <nodes>
--age <age>
|
The ‘hostname’ or ‘IP’, ‘export_path’ and ‘dest_path’ for an NFS share in the form of <hostname|IP>:<export_path>:<dest_path>’ |
|
Comma separated list of nodes from which to collect. The default is to collect all |
|
Collects log based on age. Valid options are s (sec), m (min), h (hrs), d (days), Mo (month) and y (years) For example 10m represents 10 minutes. |
Example:
# robin log collect nfs 10.9.82.162:/tmp:/demo_log_collect
Job: 126 Name: LogCollect State: PROCESSED Error: 0
Job: 126 Name: LogCollect State: WAITING Error: 0
Job: 126 Name: LogCollect State: COMPLETED Error: 0
17.3.3. Storing logs using AWS S3¶
Logs collected by Robin can be stored on a NFS share, with the following command:
# robin log collect s3 <url> <aws_config>
--nodes <nodes>
--access_key <access_key>
--secret_key <secret_key>
--age <age>
|
S3 URL in the format https://s3-<region-name>.amazonaws.com/<bucket-name>/<directory> |
|
JSON file containing Access key, Secret Key and Region. Example format {“aws_access_key_id”: <key>, “aws_secret_access_key”: <key>, “region”: <region_name>} |
|
Comma separated list of nodes from which to collect. The default is to collect all |
|
Access Key for the respective user with access to the specified S3 bucket. |
|
Secret Key for the respective user with access to the specified S3 bucket. |
|
Collects log based on age. Valid options are s (sec), m (min), h (hrs), d (days), Mo (month) and y (years) For example 10m represents 10 minutes. |
Example:
# robin log collect s3 https://s3-us-west-2.amazonaws.com/log-collect/demo_log_collect /root/aws.json --wait
Job: 132 Name: LogCollect State: PROCESSED Error: 0
Job: 132 Name: LogCollect State: WAITING Error: 0
Job: 132 Name: LogCollect State: COMPLETED Error: 0
17.3.4. Storing logs in a remote location¶
Logs collected by Robin can be stored on a NFS share, with the following command:
# robin log collect ssh <dest>
--nodes <nodes>
--password <password>
--age <age>
|
Destination path where the log files will be copied to. The path should be in the form of ‘<user>@<hostname|IP>:<path>’ |
|
Comma separated list of nodes from which to collect. The default is to collect all |
|
Provide a password on the command line instead of via a prompt |
|
Collects log based on age. Valid options are s (sec), m (min), h (hrs), d (days), Mo (month) and y (years) For example 10m represents 10 minutes. |
Example:
# robin log collect ssh root@10.9.82.163:/demo_log_collect --password robin123
Job: 129 Name: LogCollect State: PROCESSED Error: 0
Job: 129 Name: LogCollect State: WAITING Error: 0
Job: 129 Name: LogCollect State: COMPLETED Error: 0
17.3.5. Storing logs on Google Cloud Storage¶
You can backup Robin logs to the Google Cloud Storage (GCS) repo. To back up, you must register a GCS repo on Robin CNS. For more information, see here
robin log collect repo
--nodes <NODES>
--repo_name <REPO_NAME>
--dir_path <DIR_PATH>
--wait
--template <TEMPLATE>
--age <AGE>
--collect-storage-libs
|
This is a comma separated list of nodes from which to collect, default is to collect all. |
|
Select gcs repo from ‘robin repo list’. First register using ‘robin repo register’ |
|
Directory name where log files will be copied inside bucket.(No need to create dir in bucket.) |
|
Wait for the command to finish. |
|
Provide template name or template path. |
|
Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days |
|
Option to enable collection of files under lib, lib64 and bin for storage |
Example:
[robinmaster@master ~]# robin log collect repo --repo_name robin-repo --wait Job: 235 Name: LogCollect State: PROCESSED Error: 0 Job: 235 Name: LogCollect State: WAITING Error: 0 Job: 235 Name: LogCollect State: COMPLETED Error: 0
17.4. Retrieving Job Logs¶
Robin provides a utility which collects all the appropriate logs from the necessary nodes for a particular job and its consequent hierarchy. It stores these logs within a single tarball that can be provided to Robin alongside a bug report. In addition this useful for an Administrator to debug as to why a job failed unexpectedly. This functionality is extremely convienent as it automates the process of the user logging into every affected node and collecting/inspecting the relevant log files. Issue the following command to retrieve logs for a specific job:
# robin job get <id>
|
ID of job to collect the logs for |
Example:
# robin job get 1
Retrieving log files...
Log files for Job ids: [1] are retrieved successfully at 1582189081.tar.gz
17.5. Tool for checking data integrity on PVCs¶
Robin.io provides the devck_job_runner.sh
tool for checking the data integrity of PVC on your cluster.
This tool performs a block-level checksum of data on the disks for specified volumes.
It performs checks on all volume replicas of the cluster. The tool provides the output in the form of a log file.
The log file provides details about both volumes that are healthy (checksum match) and volumes having checksum mismatch issues.
You can run the tool as per your requirements.
Note
The tool is not provided as part of Robin CNS binaries. The Robin customer support provides the tool. Contact the Robin Customer Support team when you need the tool.
The tool uses the disk bandwidth when running it and this impacts the performance of the workload. If you observe any checksum issues on your PVC, you must contact the Robin Customer Support team.
17.5.1. Running the devck_job_runner.sh tool¶
You need to log in to your Robin cluster to run this tool and run it on one PVC at a time. However, you can run multiple instances of the tool on different PVCs.
Prerequisite
You need your
image_pull_secret
and the path of the repository from where you pull the images to run the script. Run thekubctl get secret -n robinio
command to know the secret.
To run this tool, complete the following steps:
Run the following command to execute the script from your cluster.
# sh devck_job_runner.sh -n <pvc_namespace> -p <pvc_name> -s <image_pull_secret> -i <devck_image>
Note
The devck_image
command option is the path of the repository from where you pull the image. Format: registry/repository/img_name:tag
.
Example
[root@hypervvm-72-51 ~]# sh devck_job_runner.sh -n test-namespace -p r-2 -s test-cred
job.batch/job-devck-r-2-test-namespace created
Waiting for job to complete.
Check jobs status using this command 'kubectl logs job/job-devck-r-2-test-namespace -n robinio'
Wait job is not completed yet....time = 10
Wait job is not completed yet....time = 20
Job job-devck-r-2-test-namespace has completed...
Persistent Volume associated with PVC r-2: pvc-20084a84-0c61-476e-966c-925442c223d1
Volume ID for PV pvc-20084a84-0c61-476e-966c-925442c223d1: 1697050482:2
configmap/devck-pods-counter created
Volume is present on these nodes: hypervvm-72-48.robinsystems.com hypervvm-72-49.robinsystems.com hypervvm-72-50.robinsystems.com
creating deamonset
daemonset.apps/devck-pvc-20084a84-0c61-476e-966c-925442c223d1 created
DaemonSet applied successfully.
Wait... Logs are being collected. Logs collected on 0 nodes. Total ready nodes: 3
Wait... Logs are being collected. Logs collected on 0 nodes. Total ready nodes: 3
Wait... Logs are being collected. Logs collected on 0 nodes. Total ready nodes: 3
Wait... Logs are being collected. Logs collected on 0 nodes. Total ready nodes: 3
Wait... Logs are being collected. Logs collected on 1 nodes. Total ready nodes: 3
Wait... Logs are being collected. Logs collected on 1 nodes. Total ready nodes: 3
Wait... Logs are being collected. Logs collected on 1 nodes. Total ready nodes: 3
Wait... Logs are being collected. Logs collected on 1 nodes. Total ready nodes: 3
Done... Logs collected on 3 nodes. Total ready nodes: 3
Logs collected from node : hypervvm-72-48.robinsystems.com
Logs collected from node : hypervvm-72-50.robinsystems.com
Logs collected from node : hypervvm-72-49.robinsystems.com
Logs file is present at hypervvm-72-49.robinsystems.com:/var/log/robin/devck_logs-test-namespace-r-2-1697031021.log
job.batch "job-devck-r-2-test-namespace" deleted
17.5.2. Accessing the log file¶
After running the script, the command output provides the path where the log file is saved. Access the log file using the log file path from the command output.
Example
Logs file is present at hypervvm-72-49.robinsystems.com:/var/log/robin/devck_logs-test-namespace-r-2-1697031021.log
17.5.2.1. Sample Log files¶
Sample log file with no checksum mismatch
In the following sample, the log file content is displaying that data is safe on the PVC.
[root@hypervvm-72-49 ~]# cat/var/log/robin/devck_logs-test-namespace-r-2-1697031021.log
Time: 2023-10-11 13:30:21
*** Data Safe at pvc: test-namespace/r-2 volumeid: 2, dev: /dev/disk/by-id/scsi-36002248039b9acdce500204c5225cb30, host: hypervvm-72-48.robinsystems.com ***
Time: 2023-10-11 13:30:22
*** Data Safe at pvc: test-namespace/r-2 volumeid: 2, dev: /dev/disk/by-id/scsi-3600224801e3d29af5112c34369e26fcd, host: hypervvm-72-50.robinsystems.com ***
Time: 2023-10-11 13:30:22
*** Data Safe at pvc: test-namespace/r-2 volumeid: 2, dev: /dev/disk/by-id/scsi-360022480a78a4a226db75cc0f4ea4ff9, host: hypervvm-72-49.robinsystems.com ***
Sample log file displaying checksum mismatch
In the following sample, the log file content is displaying the log file with checksum mismatch.
Note
When you observe the checksum mismatch issues (data corruption) in the log file, you must contact the Robin Customer Support team.
Sample displaying checksum issue in a single block
*** Data Corruption at pvc: default/app-1 volumeid: 4, dev: /dev/disk/by-id/scsi-3600224809adf205c27fde93c46413856, host: hypervvm-72-42.robinsystems.com ***
{
"segments": [
{
"segmapidx": 2440,
"seqnum": 1049,
"state": "FINAL",
"slicemapidx": 1,
"appid": 1,
"volumeid": 4,
"snapshotid": 1,
"loffset": 1073741824,
"slice_genkey": 1697435671166909,
"seg_genkey": 1697191660280,
"pages": [
{
"pagenum": 1,
"cksum": "0x0",
"blocks": [
{
"rel_blockid": 69186,
"poff": 0,
"psize": 4008,
"padding": 0,
"valid": 1,
"ondisk_cksum": "0x933f9934",
"calc_cksum": "0xa00029b1"
},
{
"rel_blockid": 145860,
"poff": 4008,
"psize": 4008,
"padding": 4008,
"valid": 1,
"ondisk_cksum": "0x693028b2",
"calc_cksum": "0xd55f990e"
}
]
}
]
}
]
}
Sample showing checksum issues on more than one block
*** Data Corruption at pvc: default/app-1 volumeid: 4, dev: /dev/disk/by-id/scsi-3600224806f9621eff4e3a0182ec3375d, host: hypervvm-72-43.robinsystems.com ***
{
"segments": [
{
"segmapidx": 231,
"seqnum": 1049,
"state": "FINAL",
"slicemapidx": 0,
"appid": 1,
"volumeid": 4,
"snapshotid": 1,
"loffset": 0,
"slice_genkey": 1697435671166909,
"seg_genkey": 1697191657661,
"pages": [
{
"pagenum": 1,
"cksum": "0x0",
"blocks": [
{
"rel_blockid": 134538,
"poff": 0,
"psize": 3992,
"padding": 0,
"valid": 1,
"ondisk_cksum": "0x3c8a61e2",
"calc_cksum": "0x657dcec0"
},
{
"rel_blockid": 55370,
"poff": 4096,
"psize": 3984,
"padding": 4096,
"valid": 1,
"ondisk_cksum": "0x66e64a6f",
"calc_cksum": "0x87a69826"
},
{
"rel_blockid": 62216,
"poff": 8080,
"psize": 4008,
"padding": 8080,
"valid": 1,
"ondisk_cksum": "0x7466271b",
"calc_cksum": "0x9cdc9efd"
},
{
"rel_blockid": 130757,
"poff": 12088,
"psize": 4008,
"padding": 12088,
"valid": 1,
"ondisk_cksum": "0x55047ae3",
"calc_cksum": "0x5c9afc57"
},
{
"rel_blockid": 20923,
"poff": 16384,
"psize": 4000,
"padding": 16384,
"valid": 1,
"ondisk_cksum": "0x4ed6fccd",
"calc_cksum": "0x108731cc"
},
{
"rel_blockid": 251357,
"poff": 20384,
"psize": 3992,
"padding": 20384,
"valid": 1,
"ondisk_cksum": "0x3e67778f",
"calc_cksum": "0x674fdbf"
},
{
"rel_blockid": 255060,
"poff": 24376,
"psize": 3984,
"padding": 24376,
"valid": 1,
"ondisk_cksum": "0x3980c9d3",
"calc_cksum": "0x7d180819"
},
{
"rel_blockid": 100529,
"poff": 28360,
"psize": 4008,
"padding": 28360,
"valid": 1,
"ondisk_cksum": "0x6fd2942",
"calc_cksum": "0x29b1473c"
},
{
"rel_blockid": 133206,
"poff": 32368,
"psize": 4008,
"padding": 32368,
"valid": 1,
"ondisk_cksum": "0xf782af20",
"calc_cksum": "0xbcdea0de"
},
{
"rel_blockid": 202530,
"poff": 36864,
"psize": 4008,
"padding": 36864,
"valid": 1,
"ondisk_cksum": "0x4701c866",
"calc_cksum": "0xa677cfdb"
}
]
}
]
}
]
}
*** Data Safe at pvc: default/app-1 volumeid: 4, dev: /dev/disk/by-id/scsi-360022480c3025c597a2fc0d7fc407ed4, host: hypervvm-72-46.robinsystems.com ***
17.6. Correcting the read-only database connection issue¶
In some cases, following network partition recovery, the Robin CNS control plane might remain down due to a read-only database connection.
This happens when the Patroni leader switches during a network partition or recovery, leaving stale conntrack entries for the robin-patroni service pointing to the prior Patroni leader.
The stale conntrack entries persist with the robin-patroni service because, in some cases, CNI fails to flush them whenever the service endpoint changes.
To recover from this situation, Robin CNS has implemented a configurable timeout. When the configured timeout is reached and the services detect a read-only database issue, the Robin services will restart automatically after the timeout to create a new database connection.
The following are the config variables provided by Robin CNS that you need to add to address this issue while installing Robin CNS 5.4.8-313 in the options section of the robin.yaml
file or you can correct the issue post upgrade by completing the following steps provided in this section.
session_read_only_timeout: "60"
exit_on_read_only_exception: "1"
Note
The default value for
session_read_only_timeout
is0
. This indicates the services will never restart in this scenario. The recommended value for thesession_read_only_timeout
variable is “60” seconds, and do not set it to less than 30 seconds. This might result in services going down even in cases when the database goes to the read-only status due to a small network glitch.The default value for
exit_on_read_only_exception
is1
(true), which controls whether or not to exit the process when the timeout is reached.
You can correct the read-only database connection issue by adding the following config variables and updating robin configuration provided in the following steps after upgrading to Robin CNS 5.4.8.
Complete the following steps to correct the issue:
Edit the
robin-bootstrap-config
Configmap by running the following command.
# kubectl edit cm robin-bootstrap-config -n robinio
In the
bootstrap_details
section of the file, add the following tunable in addition to others that are already present.
session_read_only_timeout=60
Ensure the parameter is set as shown in the sample ouput by running the following command.
# kubectl get cm -n robinio robin-bootstrap-config -oyaml | head -n5 apiVersion: v1 data: bootstrap_details: | session_read_only_timeout=60 mode=agent
Edit the
robincluster
custom resource file by running the following command.# kubectl edit robincluster robin
In the
robincluster
custom resource, add the following two parameters inoptions
section of the file.options: session_read_only_timeout: "60" exit_on_read_only_exception: "1"
Run the following command to update robin config for the following:
session_read_only_timeout: "60"
exit_on_read_only_exception: "1"
# robin config update agent session_read_only_timeout 60 # robin config update server exit_on_read_only_exception true
Restart robin-master Pod by running the following command.
# kubectl delete pod -n robinio <robin-master-xxxxxx> -n robinio