21. Troubleshooting Tools¶
Robin Platform provides a number of native tools and commands for an administrator to utilize in order troubleshoot their Robin cluster and/or report issues. These tools vary in their use case but provide enough information to provide insight as to why as the cluster is not functioning as intended or the reason for unexpected failures. As a result, they should be the go-to utilities when debugging potential issues and their outputs should be sent alongside any bug reports filed to Robin. Each tool has been described in their respective sections below.
Alongside the aforementioned tools for administrators, Robin Platform also provides more granular commands, detailed in the sections below, for individual users to track the progress of their executed operations and determine reasons for their failure. These operations are referred to as jobs
and are identified by a unique ID. Each job
has a set of attributes such as the aforementioned job ID, job type, description, and so on. Robin stores a record of each job
, including the metadata, within the database alongside respective job logs on the relevant nodes. An adminstrator can view the job logs and troubleshoot issues within the cluster with them. It is reccomended that the complete job logs are provided when reporting issues to Robin for debugging purposes.
The Robin job logs are stored in the following directoroies within the Robin container:
Server side job logs are stored within
/var/log/robin/server
. Note this directory only present on the Robin master nodes.Worker/agent side job logs are stored within
/var/log/robin/agent
. This directory is present on all Robin nodes.
In order to access the job logs on the host instead of within the container, the /home/robinds/var/log/robin/server
and /home/robinds/var/log/robin/agent
can be used respectively.
21.1. Listing all jobs¶
Robin stores all jobs that have occurred during a cluster’s lifespan. To view these jobs alongside details such as their start time, state etc. issue the following command:
# robin job list --verbose
--ignoredeps
--noarchived
--nopurged
--states <states>
--failed
--nocolor
--page_size <size>
--page_num <num>
--total
--all
--app <app_name>
--k8sapp <k8sapp_name>
--vnode <vnode_name>
--node <node_name>
--disk <disk_wwn>
|
Show complete job information instead of truncating it for display purposes. |
|
Do not show child jobs |
|
Do not show archived jobs |
|
Do not show purged jobs |
|
Filter jobs based on states. Choose one or more from: active, failed, succeeded, archived, purged |
|
Show only jobs which have failed |
|
Show uncolored output |
|
Number of jobs that should be displayed for each page |
|
Page number to start displaying jobs from (starting index 1) |
|
Return the total number of qualified root jobs |
|
Display all jobs associated with a specific application. Note this option must be used in conjunction with the |
|
Filter jobs based on specified application |
|
Filter jobs based on specified K8s/Helm registered application name |
|
Filter jobs based on specified Vnode name |
|
Filter jobs based on specified physical node name |
|
Filter jobs based on specified disk WWN |
Example:
Output
# robin job list
ID | Type | Description | State | Start | End | User | Message
--------------+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------+---------+--------+------------------------------------------
1013 | ApplicationStart | Starting application 'wp-10' | COMPLETED | 13 Aug 23:28:29 | 0:00:54 | system |
|->1015 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 13 Aug 23:28:30 | 0:00:38 | system |
| |->1017 | VnodeDeploy | Deploying vnode 'wp-10.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:28:30 | 0:00:38 | system |
| | |->1018 | VnodeStop | Stopping vnode wp-10.mysql.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:28:30 | 0:00:15 | system |
|->1016 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system |
| |->1024 | VnodeDeploy | Deploying vnode 'wp-10.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system |
| | |->1025 | VnodeStop | Stopping vnode wp-10.wordpress.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:29:08 | 0:00:07 | system |
1014 | ApplicationStart | ApplicationStart | COMPLETED|FAILED | 13 Aug 23:28:29 | 0:00:00 | system | Another job is running on application 'w
1019 | ApplicationStart | Starting application 'wp-20' | COMPLETED | 13 Aug 23:28:31 | 0:00:51 | system |
|->1020 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 13 Aug 23:28:32 | 0:00:36 | system |
| |->1022 | VnodeDeploy | Deploying vnode 'wp-20.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:28:32 | 0:00:36 | system |
| | |->1023 | VnodeStop | Stopping vnode wp-20.mysql.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:28:32 | 0:00:13 | system |
|->1021 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system |
| |->1026 | VnodeDeploy | Deploying vnode 'wp-20.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system |
| | |->1027 | VnodeStop | Stopping vnode wp-20.wordpress.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:29:08 | 0:00:05 | system |
1028 | JobArchive | Archiving job/s on all hosts | COMPLETED | 14 Aug 00:00:00 | 0:00:02 | system |
|->1029 | AgentJobArchive | Archiving job/s on host cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 00:00:01 | 0:00:00 | system |
1030 | HostProbe | Probed cscale-82-140.robinsystems.com from ONLINE/Ready ==> ONLINE/Ready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 07:54:37 | 0:00:01 | system |
1031 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> UNREACHABLE/Notready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 07:54:37 | 0:00:51 | system |
1032 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> UNREACHABLE/Notready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 08:11:11 | 0:00:50 | system |
1033 | HostProbe | Probed cscale-82-140.robinsystems.com from ONLINE/Ready ==> ONLINE/Ready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 08:11:11 | 0:00:01 | system |
1034 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeUp. | COMPLETED | 14 Aug 09:24:17 | 0:00:50 | system |
1035 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED|FAILED | 14 Aug 09:25:07 | 0:01:40 | system | Pods do not need to be failed over as Ku
1036 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> ONLINE/Ready. Origin: StateChange. | COMPLETED | 14 Aug 09:25:17 | 0:00:01 | system |
1037 | ApplicationDelete | Deleting application 'wp-10' | COMPLETED | 14 Aug 09:41:10 | 0:00:12 | robin |
|->1038 | VnodeDelete | Deleting vnode 'wp-10.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:10 | 0:00:06 | robin |
|->1039 | VnodeDelete | Deleting vnode 'wp-10.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:10 | 0:00:08 | robin |
1040 | ApplicationDelete | Deleting application 'wp-20' | COMPLETED | 14 Aug 09:41:16 | 0:00:13 | robin |
|->1041 | VnodeDelete | Deleting vnode 'wp-20.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:16 | 0:00:10 | robin |
|->1042 | VnodeDelete | Deleting vnode 'wp-20.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:16 | 0:00:09 | robin |
1043 | ApplicationDelete | Deleting application 'wp-30' | COMPLETED | 14 Aug 09:41:20 | 0:00:19 | robin |
|->1044 | VnodeDelete | Deleting vnode 'wp-30.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:20 | 0:00:06 | robin |
|->1045 | VnodeDelete | Deleting vnode 'wp-30.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:20 | 0:00:15 | robin |
1046 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 09:42:58 | 0:00:58 | robin |
|->1047 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:43:00 | 0:00:42 | robin |
| |->1049 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:00 | 0:00:42 | robin |
|->1048 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:43:42 | 0:00:14 | robin |
| |->1053 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:42 | 0:00:14 | robin |
1050 | ApplicationCreate | Adding application 'wp-2' | COMPLETED | 14 Aug 09:43:39 | 0:00:46 | robin |
|->1051 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:43:42 | 0:00:34 | robin |
| |->1054 | VnodeAdd | Adding vnode 'wp-2.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:42 | 0:00:34 | robin |
|->1052 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:44:16 | 0:00:09 | robin |
| |->1055 | VnodeAdd | Adding vnode 'wp-2.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:16 | 0:00:09 | robin |
1056 | ApplicationCreate | Adding application 'wp-3' | COMPLETED | 14 Aug 09:44:18 | 0:00:57 | robin |
|->1057 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:44:20 | 0:00:41 | robin |
| |->1059 | VnodeAdd | Adding vnode 'wp-3.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:20 | 0:00:41 | robin |
|->1058 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:45:01 | 0:00:13 | robin |
| |->1067 | VnodeAdd | Adding vnode 'wp-3.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:02 | 0:00:12 | robin |
1060 | ApplicationDelete | Deleting application 'wp-1' | COMPLETED | 14 Aug 09:44:53 | 0:00:17 | robin |
|->1061 | VnodeDelete | Deleting vnode 'wp-1.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:53 | 0:00:05 | robin |
|->1062 | VnodeDelete | Deleting vnode 'wp-1.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:53 | 0:00:13 | robin |
1063 | ApplicationDelete | Deleting application 'wp-2' | COMPLETED | 14 Aug 09:44:57 | 0:00:21 | robin |
|->1064 | VnodeDelete | Deleting vnode 'wp-2.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:57 | 0:00:09 | robin |
|->1065 | VnodeDelete | Deleting vnode 'wp-2.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:57 | 0:00:18 | robin |
1066 | ApplicationDelete | ApplicationDelete | COMPLETED|FAILED | 14 Aug 09:45:01 | 0:00:00 | robin | Another job is running on application 'w
1068 | ApplicationProbe | Probing application 'wp-3' | COMPLETED | 14 Aug 09:45:12 | 0:00:00 | robin |
1069 | ApplicationDelete | Deleting application 'wp-3' | COMPLETED | 14 Aug 09:45:16 | 0:00:12 | robin |
|->1070 | VnodeDelete | Deleting vnode 'wp-3.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:16 | 0:00:05 | robin |
|->1071 | VnodeDelete | Deleting vnode 'wp-3.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:16 | 0:00:09 | robin |
1072 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 09:47:03 | 0:00:45 | robin |
|->1074 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:47:39 | 0:00:08 | robin |
| |->1076 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:39 | 0:00:08 | robin |
|->1073 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:47:05 | 0:00:34 | robin |
| |->1075 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:05 | 0:00:34 | robin |
1077 | ApplicationCreate | Adding application 'wp-2' | COMPLETED | 14 Aug 09:47:43 | 0:00:44 | robin |
|->1079 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:48:18 | 0:00:09 | robin |
| |->1081 | VnodeAdd | Adding vnode 'wp-2.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:48:18 | 0:00:09 | robin |
|->1078 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:47:45 | 0:00:33 | robin |
| |->1080 | VnodeAdd | Adding vnode 'wp-2.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:45 | 0:00:33 | robin |
1082 | ApplicationCreate | Adding application 'wp-3' | COMPLETED | 14 Aug 09:49:14 | 0:03:12 | robin |
|->1083 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:49:16 | 0:02:49 | robin |
| |->1085 | VnodeAdd | Adding vnode 'wp-3.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:49:16 | 0:02:49 | robin |
|->1084 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:52:05 | 0:00:20 | robin |
| |->1086 | VnodeAdd | Adding vnode 'wp-3.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:52:05 | 0:00:20 | robin |
1087 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeDown. | COMPLETED | 14 Aug 09:53:43 | 0:00:52 | system |
1088 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:54:35 | 0:00:01 | system |
1089 | ApplicationStart | Starting application 'wp-3' | COMPLETED | 14 Aug 09:54:38 | 0:03:41 | system |
|->1092 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:38 | 0:01:53 | system |
| |->1094 | VnodeDeploy | Deploying vnode 'wp-3.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:38 | 0:01:53 | system |
|->1093 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:31 | 0:01:48 | system |
| |->1102 | VnodeDeploy | Deploying vnode 'wp-3.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:31 | 0:01:48 | system |
1090 | ApplicationStart | Starting application 'wp-1' | COMPLETED | 14 Aug 09:54:38 | 0:03:44 | system |
|->1098 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:39 | 0:01:51 | system |
| |->1100 | VnodeDeploy | Deploying vnode 'wp-1.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:39 | 0:01:51 | system |
|->1099 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:30 | 0:01:52 | system |
| |->1101 | VnodeDeploy | Deploying vnode 'wp-1.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:30 | 0:01:52 | system |
1091 | ApplicationStart | Starting application 'wp-2' | COMPLETED | 14 Aug 09:54:38 | 0:03:44 | system |
|->1095 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:39 | 0:01:52 | system |
| |->1097 | VnodeDeploy | Deploying vnode 'wp-2.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:39 | 0:01:52 | system |
|->1096 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:31 | 0:01:51 | system |
| |->1103 | VnodeDeploy | Deploying vnode 'wp-2.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:32 | 0:01:50 | system |
1104 | ApplicationDelete | Deleting application 'wp-1' | COMPLETED | 14 Aug 10:18:34 | 0:00:15 | robin |
|->1105 | VnodeDelete | Deleting vnode 'wp-1.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:34 | 0:00:06 | robin |
|->1106 | VnodeDelete | Deleting vnode 'wp-1.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:34 | 0:00:11 | robin |
1107 | ApplicationDelete | Deleting application 'wp-2' | COMPLETED | 14 Aug 10:18:38 | 0:00:14 | robin |
|->1108 | VnodeDelete | Deleting vnode 'wp-2.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:38 | 0:00:06 | robin |
|->1109 | VnodeDelete | Deleting vnode 'wp-2.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:38 | 0:00:08 | robin |
1110 | ApplicationDelete | Deleting application 'wp-3' | COMPLETED | 14 Aug 10:18:43 | 0:00:15 | robin |
|->1111 | VnodeDelete | Deleting vnode 'wp-3.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:43 | 0:00:12 | robin |
|->1112 | VnodeDelete | Deleting vnode 'wp-3.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:43 | 0:00:13 | robin |
1113 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeUp. | COMPLETED | 14 Aug 10:20:02 | 0:00:50 | system |
1114 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED|FAILED | 14 Aug 10:20:52 | 0:01:40 | system | Pods do not need to be failed over as Ku
1115 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> ONLINE/Notready. Origin: StateChange.. Services Down: {'iomgr-server'} | COMPLETED | 14 Aug 10:22:17 | 0:00:00 | system |
1116 | HostProbe | Probed cscale-82-139.robinsystems.com from ONLINE/Notready ==> ONLINE/Notready. Origin: StateChange.. Services Down: {'iomgr-server'} | COMPLETED | 14 Aug 10:22:47 | 0:00:00 | system |
1117 | HostProbe | Probed cscale-82-139.robinsystems.com from ONLINE/Notready ==> ONLINE/Ready. Origin: StateChange. | COMPLETED | 14 Aug 10:22:59 | 0:00:00 | system |
1118 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 10:40:21 | 0:01:05 | robin |
|->1119 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:40:24 | 0:00:41 | robin |
| |->1121 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:40:24 | 0:00:41 | robin |
|->1120 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:41:05 | 0:00:21 | robin |
| |->1122 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:41:05 | 0:00:21 | robin |
1123 | ApplicationCreate | Adding application 'wp-2-no-aff' | COMPLETED | 14 Aug 10:45:45 | 0:00:57 | robin |
|->1124 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:45:48 | 0:00:41 | robin |
| |->1126 | VnodeAdd | Adding vnode 'wp-2-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:45:48 | 0:00:41 | robin |
|->1125 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:46:29 | 0:00:13 | robin |
| |->1127 | VnodeAdd | Adding vnode 'wp-2-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:29 | 0:00:13 | robin |
1128 | ApplicationCreate | Adding application 'wp-3-no-aff' | COMPLETED | 14 Aug 10:46:33 | 0:00:39 | robin |
|->1129 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:46:35 | 0:00:28 | robin |
| |->1131 | VnodeAdd | Adding vnode 'wp-3-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:35 | 0:00:28 | robin |
|->1130 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:47:03 | 0:00:09 | robin |
| |->1132 | VnodeAdd | Adding vnode 'wp-3-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:47:03 | 0:00:09 | robin |
1133 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeDown. | COMPLETED | 14 Aug 10:49:36 | 0:00:52 | system |
1134 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:50:28 | 0:00:01 | system |
1135 | ApplicationStart | Starting application 'wp-1' | COMPLETED | 14 Aug 10:50:29 | 0:03:22 | system |
|->1141 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 10:52:16 | 0:01:35 | system |
| |->1143 | VnodeDeploy | Deploying vnode 'wp-1.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:52:16 | 0:01:35 | system |
|->1140 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 10:50:30 | 0:01:46 | system |
| |->1142 | VnodeDeploy | Deploying vnode 'wp-1.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:30 | 0:01:46 | system |
1136 | VnodeDeploy | Deploying vnode 'wp-3-no-aff.wordpress.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:01:48 | robin |
1137 | VnodeDeploy | Deploying vnode 'wp-3-no-aff.mysql.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:02:04 | robin |
1138 | VnodeDeploy | Deploying vnode 'wp-2-no-aff.mysql.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:02:07 | robin |
1139 | VnodeDeploy | Deploying vnode 'wp-2-no-aff.wordpress.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:01:44 | robin |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Returns all jobs that have occurred during a cluster’s lifespan.
End Point: /api/v5/robin_server/jobs
Method: GET
URL Parameters:
sort=[id|-id]
: Utilizing this parameter results in the list of jobs returned being sorted by their id.noarchived=true
: Utilizing this parameter results in archived jobs not being returned.nopurged=true
: Utilizing this parameter results in purged jobs not being returned.failed=true
: Utilizing this parameter results in only failed jobs being returned.parent=true
: Utilizing this parameter results in only parent jobs being returned.page_size=<size>
: Utilizing this parameter results in <size> number of jobs being returned.page_num=<index>
: Utilizing this parameter results in jobs starting from <index> being returned.objtype=[APPLICATION|K8S_APPLICATION|INSTANCE|DISK|NODE]
: Utilizing this parameter results in only jobs for the specified object type being returned.objname=<obj_name>
: Utilizing this parameter results in only jobs for objects with the specified name being returned.all=true
: Utilizing this parameter results in all jobs being returned. Note this option is only valid when an application name is specified.
Data Parameters: None
Port: RCM Port (default value is 29442)
Headers:
Authorization: <auth_token>
: Authorization token to identify which user is sending the request. The token can be acquired from the login API.
Success Response Code: 200
Error Response Code: 500 (Internal Server Error)
Example Response:
Output
{
"page_size":10,
"items":{
"users":[
{
"email":null,
"tenantid":1,
"firstname":"Robin",
"username":"robin",
"id":3,
"lastname":"Systems"
}
],
"jobs":[
{
"jobid":1888,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[1889]",
"endtime":1597456503,
"children":[
{
"jobid":1889,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456498,
"parent_jobid":1888,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456497,
"dependson_job_ids":"[]",
"level":"child",
"user_id":1,
"jtype":"CollectionOffline",
"timeout":86400,
"state":10,
"desc":"Taking collection 'file-collection-1597122699552' offline (Force False)"
}
],
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456496,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"CollectionOnline",
"timeout":86400,
"state":10,
"desc":"Bringing collection 'file-collection-1597122699552' online"
},
{
"jobid":1887,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[1890]",
"endtime":1597456504,
"children":[
{
"jobid":1890,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456499,
"parent_jobid":1887,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456497,
"dependson_job_ids":"[]",
"level":"child",
"user_id":3,
"jtype":"VnodeStop",
"timeout":86400,
"state":10,
"desc":"Stopping vnode test-ds-1.server.01 on cscale-82-140.robinsystems.com"
}
],
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456496,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":3,
"jtype":"VnodeDeploy",
"timeout":86400,
"state":10,
"desc":"Deploying vnode 'test-ds-1.server.01'. Origin: Event (cscale-82-140.robinsystems.com)"
},
{
"jobid":1886,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456488,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456487,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"HostProbe",
"timeout":86400,
"state":10,
"desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/Notready ==> ONLINE\/Ready. Origin: StateChange."
},
{
"jobid":1885,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456476,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456475,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"HostProbe",
"timeout":86400,
"state":10,
"desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/Notready ==> ONLINE\/Notready. Origin: StateChange.. Services Down: {'iomgr-server'}"
},
{
"jobid":1884,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456470,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456470,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"HostProbe",
"timeout":86400,
"state":10,
"desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/WaitingForMonitor ==> ONLINE\/Notready. Origin: StartingHostWatch.. Services Down: {'iomgr-server'}"
},
{
"jobid":1883,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456520,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456469,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"HostProbe",
"timeout":86400,
"state":10,
"desc":"Probed cscale-82-139.robinsystems.com from UNREACHABLE\/Notready ==> UNREACHABLE\/Notready. Origin: StartingHostWatch."
},
{
"jobid":1882,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456467,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456467,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"DiskNotify",
"timeout":86400,
"state":10,
"desc":"Event on disk '0x60022480940ed076551cfaf75612e24e'"
},
{
"jobid":1881,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456467,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456467,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"DiskNotify",
"timeout":86400,
"state":10,
"desc":"Event on disk '0x60022480ffcf3deb224fb37d78fe7767'"
},
{
"jobid":1880,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456467,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456467,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"DiskNotify",
"timeout":86400,
"state":10,
"desc":"Event on disk '0x600224804c48fd7e16c608dea0919064'"
},
{
"jobid":1879,
"tenant_id":1,
"enabled":true,
"child_job_ids":"[]",
"endtime":1597456467,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":0,
"starttime":1597456467,
"dependson_job_ids":"[]",
"level":"parent",
"user_id":1,
"jtype":"DiskNotify",
"timeout":86400,
"state":10,
"desc":"Event on disk '0x600224803bcdafde95b1f5cd27ceb5fb'"
}
]
},
"total":1542,
"num_items":10,
"page_num":1
}
21.2. Show information about a specific job¶
In order to get more detailed information about a specific job including the state, duration and any errors related to it and any respective child jobs, issue the following command:
# robin job info <id>
|
Job ID |
Example:
# robin job info 1123
ID | Type | Desc | State | Start | End | Duration | Dependson | Error | Message
-----------+-------------------+---------------------------------------------------------------------------+-----------+-----------------+----------+----------+-----------+-------+---------
1123 | ApplicationCreate | Adding application 'wp-2-no-aff' | COMPLETED | 14 Aug 10:45:45 | 10:46:42 | 0:00:57 | [] | 0 |
|->1124 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:45:48 | 10:46:29 | 0:00:41 | [] | 0 |
| |->1126 | VnodeAdd | Adding vnode 'wp-2-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:45:48 | 10:46:29 | 0:00:41 | [] | 0 |
|->1125 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:46:29 | 10:46:42 | 0:00:13 | [1124] | 0 |
| |->1127 | VnodeAdd | Adding vnode 'wp-2-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:29 | 10:46:42 | 0:00:13 | [] | 0 |
Returns details about a specific job and any of its respective child jobs.
End Point: /api/v3/robin_server/jobs/<job_id>
Method: GET
URL Parameters: None
Data Parameters: None
Port: RCM Port (default value is 29442)
Headers:
Authorization: <auth_token>
: Authorization token to identify which user is sending the request. The token can be acquired from the login API.
Success Response Code: 200
Error Response Code: 500 (Internal Server Error), 404 (Not Found Error), 401 (Authorization Error)
Example Response:
Output
{
"tenant_name":"Administrators",
"jobid":1888,
"tenant_id":1,
"enabled":true,
"json":{
"collection_id":1597122699552,
"state":"SuspectedOffline",
"set_failed":true,
"origin":2,
"hostname":"cscale-82-140.robinsystems.com"
},
"user_name":"system",
"endtime":1597456503,
"parent_jobid":0,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456496,
"child_job_ids":"[1889]",
"cjobs":[
{
"tenant_name":"Administrators",
"jobid":1889,
"tenant_id":1,
"enabled":true,
"json":{
"collection_id":1597122699552
},
"user_name":"system",
"endtime":1597456498,
"parent_jobid":1888,
"error":0,
"message":"",
"taskrunner":1,
"starttime":1597456497,
"child_job_ids":"[]",
"cjobs":[
],
"dependson_job_ids":"[]",
"user_id":1,
"jtype":"CollectionOffline",
"timeout":86400,
"state":10,
"desc":"Taking collection 'file-collection-1597122699552' offline (Force False)",
"priority":300
}
],
"dependson_job_ids":"[]",
"user_id":1,
"jtype":"CollectionOnline",
"timeout":86400,
"state":10,
"desc":"Bringing collection 'file-collection-1597122699552' online",
"priority":300
}
21.3. Retrieving Job Logs¶
Robin provides a utility which collects all the appropriate logs from the necessary nodes for a particular job and its consequent hierarchy. It stores these logs within a single tarball that can be provided to Robin alongside a bug report. In addition this useful for an Administrator to debug as to why a job failed unexpectedly. This functionality is extremely convienent as it automates the process of the user logging into every affected node and collecting/inspecting the relevant log files. Issue the following command to retrieve logs for a specific job:
# robin job get <id>
|
ID of job to collect the logs for |
Example:
# robin job get 1
Retrieving log files...
Log files for Job ids: [1] are retrieved successfully at 1582189081.tar.gz
21.4. Archiving Job Logs¶
Robin Job logs can be archived in order to prevent the data loss, to improve the security, and to increase free space in the parent directory. The archival process involves moving all completed job logs to the archived sub-directory of the parent directory. The archived sub-directories are contained within the /var/log/robin/server
and var/log/robin/agent
directories. There are two methods via which this can be achieved. The first is via an automatic schedule, whose parameters can be configured, and the second is the robin job archive
command detailed below. By default, the Robin job archive schedule automatically archives the logs for jobs that were completed successfully and are older than 24 hours.
Note
The logs for failed jobs remain in the parent directories for analysis purposes.
21.4.1. Archive a Job on demand¶
In order to archive Robin jobs and their respective logs on demand, run the following command:
# robin job archive --age <age>
--include-failed
|
Minimum age (in minutes) of the job(s) whose logs should be archived |
|
Archive the logs for failed jobs as well |
Example:
# robin job archive --age 600 --wait
Job: 255170 Name: JobArchive State: PROCESSED Error: 0
Job: 255170 Name: JobArchive State: PREPARED Error: 0
Job: 255170 Name: JobArchive State: WAITING Error: 0
Job: 255170 Name: JobArchive State: COMPLETED Error: 0
21.4.2. Configure Job archive schedule attributes¶
Listed below are all the attributes a user can configure with regards to the scheduled job archival task.
Attribute |
Default value |
Description |
---|---|---|
|
86400 |
The age (in seconds) of the completed job(s) whose logs should be automatically archived. |
|
0 0 * * * |
The time at which the job archival schedule is run. The value for this attribute must be a valid CRON format, details of which can be found here. |
In order to update any of the aforementioned configurable attributes, run the following command:
# robin config update server <attribute> <value>
Example:
# robin config update server job_archive_age 81000
The 'server' attribute 'job_archive_age' has been updated
21.5. Purging Job logs¶
Robin enables users to purge logs for jobs in the case they are no longer needed and space needs to be freed up. There are two methods via which this can be achieved. The first is via an automatic schedule, whose parameters can be configured, and the second is the robin job purge
command detailed in the section below. By default, the Robin job purge schedule removes jobs (and their respective logs) that fall into any of the following categories:
Successful jobs older than two weeks.
Failed jobs older than four weeks.
Robin maintenance jobs older than one week.
Both methods for purging a job, remove the record(s) of a job and its respective child jobs from the Robin database as well delete the following associated log files if present:
The server job log directory at
/var/log/robin/server/<job-id>
on the Robin master node.The archived server job log directory at
/var/log/robin/server/archived/<job-id>.tar.gz
on the Robin master node.The agent job log directory at
/var/log/robin/agent/<job-id>
on all nodes.The archived agent job log directory at
/var/log/robin/agent/archived/<job-id>.tar.gz
on all nodes.
21.5.1. Purge a Job on demand¶
In order to purge Robin jobs and their respective logs on demand, run the following command:
# robin job purge --age <age>
--failed-job-age <failed_age>
--maintenance-job-age <maintenance_age>
--maintenance-job-types <maintenance_types>
--before-id <id>
|
Purge successful jobs that have completed before specified date and time in ‘%Y-%m-%dT%H:%M:%S’ format. The default is two weeks earlier than the current date. |
|
Purge failed jobs that have completed before specified date and time in ‘%Y-%m-%dT%H:%M:%S’ format. The default is four weeks earlier than the current date. |
|
Purge maintenance jobs that have completed before specified date and time in ‘%Y-%m-%dT%H:%M:%S’ format. The default is four weeks earlier than the current date. |
|
Comma seperated list of job types to be considered maintenance jobs. The default types include: JobArchive and JobPurge. |
|
Jobs whose IDs are lower than the specified ID will be purged. Note if |
Example:
# robin job purge --age 2021-04-06T18:14:00 --failed-job-age 2021-04-06T18:14:00 --maintenance-job-age 2021-04-06T18:14:00 --wait
Job: 309 Name: JobPurge State: VALIDATED Error: 0
Job: 309 Name: JobPurge State: COMPLETED Error: 0
21.5.2. Configure Job purge schedule attributes¶
Listed below are all the attributes a user can configure with regards to the scheduled job purge task.
Attribute |
Default value |
Description |
---|---|---|
|
1209600 |
The age (in seconds) of the completed job(s) which should be automatically purged. |
|
30 0 * * * |
The time at which the job archival schedule is run. The value for this attribute must be a valid CRON format, details of which can be found here. Robin recommends that the schedule run daily |
|
2419200 |
The age (in seconds) of the failed job(s) which should be automatically purged. |
|
604800 |
The age (in seconds) of the the maintenance job(s) which should be automatically purged. |
|
JobArchive,JobPurge |
The types of maintenance jobs to be purged. |
|
100000 |
The maximum number of jobs that can be purged at a time. |
In order to update any of the aforementioned configurable attributes, run the following command:
# robin config update server <attribute> <value>
Example:
# robin config update server job_purge_age 13396198
The 'server' attribute 'job_purge_age' has been updated
21.6. Cleaning up stale Job logs¶
In certain cases logs for jobs can remain within their respective job directories or within the archived job log directory even though the record for the job has been deleted from the database. These job logs are deemed to be stale as the Robin database is considered to be the most reliable source of the jobs run on the cluster. Robin provides two methods by which these stale job logs can be removed. The first is via an automatic schedule, whose parameters can be configured, and the second is the robin job cleanup
command detailed in the section below. By default, the Robin job cleanup schedule removes the logs for job whose records are no longer stored within the database on the first day of every month.
Note
It is recommended that the reconciliation between the job records stored and the logs present happen at least once a month to free up space and avoid retaining the logs for jobs which are no longer relevant.
21.6.1. Cleanup stale Job logs on demand¶
In order to cleanup stale job logs present on the cluster, run the following command:
# robin job cleanup
Example:
# robin job cleanup --wait
Job: 358447 Name: JobCleanupStaleLogs State: WAITING Error: 0
Job: 358447 Name: JobCleanupStaleLogs State: COMPLETED Error: 0
21.6.2. Configure Job cleanup schedule attributes¶
Listed below are all the attributes a user can configure with regards to the scheduled job cleanup task.
Attribute |
Default value |
Description |
---|---|---|
|
0 1 1 * * |
The time at which the job cleanup schedule is run. The value for this attribute must be a valid CRON format, details of which can be found here. Robin recommends that the schedule run monthly. |
In order to update any of the aforementioned configurable attributes, run the following command:
# robin config update server <attribute> <value>
Example:
# robin config update server job_cleanup_cron "0 1 2 * *"
The 'server' attribute 'job_cleanup_cron' has been updated
21.7. Log Collection¶
During any cluster wide failure or unexpected negative scenarios that affect multiple services, logs from all the system components will be needed by Robin in order to debug the issue properly. However sometimes given the scope of the issue, only a subsection of logs need to be collected. This granularity is available but it is highly recommended to always send the complete set of logs when filing a bug report with Robin. Available age-based filtering helps in reducing storage footprint. Robin supports uploading logs to the following destinations:
|
Used to store collected logs in Robin backed storage |
|
Used to store collected logs in NFS. |
|
Used to store collected logs in Amazon S3 |
|
Used to store collected logs in a given remote location |
21.7.1. Storing logs using Robin Storage¶
Logs collected by Robin can be stored on a volume created on the local cluster, with the following command:
Note
If you do not use the --age
option, by default, Robin CNP collects the logs for the last 3 days.
# robin log collect robin-storage <rpool>
--nodes <nodes>
--dest-path <dest_path>
--size <size>
--media <media>
--age <age>
|
Name of the resource pool name to use. |
|
Comma separated list of nodes from which to collect. The default is to collect all |
|
Destination path where log files will be copied |
|
Size of the storage volume for the log collect. The default is 250GB |
|
Specify which type of drives to allocate storage from. Choices include: ‘HDD’, ‘SSD’. Default media type is ‘HDD’ |
|
Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days. |
Example:
# robin log collect robin-storage default --wait
Job: 123 Name: LogCollect State: PROCESSED Error: 0
Job: 123 Name: LogCollect State: WAITING Error: 0
Job: 123 Name: LogCollect State: COMPLETED Error: 0
21.7.2. Storing logs using NFS¶
Logs collected by Robin can be stored on a NFS share, with the following command:
Note
If you do not use the --age
option, by default, Robin CNP collects the logs for the last 3 days.
# robin log collect nfs <nfs_share>
--nodes <nodes>
--age <age>
|
The ‘hostname’ or ‘IP’, ‘export_path’ and ‘dest_path’ for an NFS share in the form of <hostname|IP>:<export_path>:<dest_path>’ |
|
Comma separated list of nodes from which to collect. The default is to collect all |
|
Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days. |
Example:
# robin log collect nfs 10.9.82.162:/tmp:/demo_log_collect
Job: 126 Name: LogCollect State: PROCESSED Error: 0
Job: 126 Name: LogCollect State: WAITING Error: 0
Job: 126 Name: LogCollect State: COMPLETED Error: 0
21.7.3. Storing logs using AWS S3¶
Logs collected by Robin can be stored on a AWS S3, with the following command:
Note
If you do not use the --age
option, by default, Robin CNP collects the logs for the last 3 days.
# robin log collect s3 <url> <aws_config>
--nodes <nodes>
--access_key <access_key>
--secret_key <secret_key>
--age <age>
|
S3 URL in the format https://s3-<region-name>.amazonaws.com/<bucket-name>/<directory> |
|
JSON file containing Access key, Secret Key and Region. Example format {“aws_access_key_id”: <key>, “aws_secret_access_key”: <key>, “region”: <region_name>} |
|
Comma separated list of nodes from which to collect. The default is to collect all |
|
Access Key for the respective user with access to the specified S3 bucket. |
|
Secret Key for the respective user with access to the specified S3 bucket. |
|
Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days. |
Example:
# robin log collect s3 https://s3-us-west-2.amazonaws.com/log-collect/demo_log_collect /root/aws.json --wait
Job: 132 Name: LogCollect State: PROCESSED Error: 0
Job: 132 Name: LogCollect State: WAITING Error: 0
Job: 132 Name: LogCollect State: COMPLETED Error: 0
21.7.4. Storing logs in a remote location¶
Logs collected by Robin can be stored in a remote location, with the following command:
Note
If you do not use the --age
option, by default, Robin CNP collects the logs for the last 3 days.
# robin log collect ssh <dest>
--nodes <nodes>
--password <password>
--age <age>
|
Destination path where the log files will be copied to. The path should be in the form of ‘<user>@<hostname|IP>:<path>’ |
|
Comma separated list of nodes from which to collect. The default is to collect all |
|
Provide a password on the command line instead of via a prompt |
|
Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days. |
Example:
# robin log collect ssh root@10.9.82.163:/demo_log_collect --password robin123
Job: 129 Name: LogCollect State: PROCESSED Error: 0
Job: 129 Name: LogCollect State: WAITING Error: 0
Job: 129 Name: LogCollect State: COMPLETED Error: 0
21.8. Cluster Auditing¶
Every operation that is performed by a user on an identifiable object within a Robin cluster is logged for auditing purposes. This allows admins to track the exact series of operations performed by a user as well to monitor the general activity on the concerned cluster. This not only enables more accurate backtracking for troubleshooting purposes as well improving the thoroughness of security audits. Detailed below are the methods by which a user can retrieve the audit log.
21.8.1. Retrieving audit logs from the Robin Database¶
In order to access the audit log containing information such as which user executed an operation, the tenant and node from which they executed it from, the type of object and operation involved, and the result of the operation issue the following command:
# robin user-audit list --exec-user <exec_user>
--exec-tenant <exec_tenant>
--owner-user <owner_user>
--owner-tenant <owner_tenant>
--id <record_id>
--object-type <object_type>
--page_size <size>
--page_num <num>
--operation <operation>
--result <result>
--full
|
Filter by username for the user who initiated the operation. Note this option cannot be used in conjunction with |
|
Filter by tenant name for the user who initiated the operation. Note this option cannot be used in conjunction with |
|
Filter by username for the user who initiated the operation. Note this option cannot be used in conjunction with |
|
Filter by tenant name for the user who initiated the operation. Note this option cannot be used in conjunction with |
|
Filter for a specific record Id |
|
Filter by object type |
|
Filter by operation |
|
Number of audit records that should be displayed for each page |
|
Page number to start displaying audit records from (starting index 1) |
|
Filter by operation result |
|
Display additional information about the audit records |
Example 1 (List first page of audit records):
# robin user-audit list
Id | Timestamp | IP Addr | Exec User | Exec Tenant | Owner User | Owner Tenant | Object Type | Operation | Result
----+--------------------------+-------------+-----------+----------------+------------+--------------+-----------------+-----------+---------
643 | August 10, 2021 14:17:47 | 172.20.0.1 | robin | Administrators | | | USER | login | success
642 | July 13, 2021 11:24:13 | 10.9.121.40 | robin | Administrators | | | USER | login | success
641 | July 13, 2021 11:24:12 | 172.20.0.1 | robin | Administrators | | | METRICS | enable | success
640 | July 13, 2021 11:24:10 | 172.20.0.1 | robin | Administrators | | | CONFIG | update | success
639 | July 13, 2021 11:24:06 | 172.20.0.1 | robin | Administrators | | | NODE | probe | success
638 | July 13, 2021 11:24:04 | 172.20.0.1 | robin | Administrators | | | NODE | probe | success
637 | July 13, 2021 11:24:04 | 172.20.0.1 | robin | Administrators | | | NODE | probe | success
636 | July 13, 2021 11:23:58 | 172.20.0.1 | robin | Administrators | | | USER | login | success
635 | July 13, 2021 11:23:57 | 172.20.0.1 | robin | Administrators | | | USER | login | success
634 | July 13, 2021 11:23:49 | 172.20.0.1 | robin | Administrators | | | FILE_COLLECTION | online | success
633 | July 13, 2021 11:23:44 | 172.20.0.1 | robin | Administrators | | | USER | login | success
632 | July 13, 2021 11:20:07 | 172.20.0.1 | robin | Administrators | | | NODE | config | success
631 | July 13, 2021 11:20:07 | 172.20.0.1 | robin | Administrators | | | NODE | config | success
630 | July 13, 2021 11:20:07 | 172.20.0.1 | robin | Administrators | | | NODE | config | success
629 | July 13, 2021 11:20:01 | 172.20.0.1 | robin | Administrators | | | NODE | config | success
628 | July 13, 2021 11:20:01 | 172.20.0.1 | robin | Administrators | | | NODE | config | success
627 | July 13, 2021 11:20:01 | 172.20.0.1 | robin | Administrators | | | NODE | config | success
626 | July 13, 2021 11:19:59 | 172.20.0.1 | robin | Administrators | | | USER | login | success
625 | July 13, 2021 11:19:01 | 172.20.0.1 | robin | Administrators | | | USER | login | success
624 | July 13, 2021 11:18:57 | 172.20.0.1 | robin | Administrators | | | USER | login | success
--------------------------------------------
537 items, page 1 of 27.
--------------------------------------------
Example 2 (List audit records filtered by object type):
# robin user-audit list --object-type APPLICATION
Id | Timestamp | IP Addr | Exec User | Exec Tenant | Owner User | Owner Tenant | Object Type | Operation | Result
---+---------------------------+------------+-----------+----------------+------------+----------------+-------------+-----------+---------
46 | October 26, 2020 12:51:46 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
45 | October 26, 2020 12:51:25 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
44 | October 26, 2020 12:51:18 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
43 | October 26, 2020 12:51:06 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
42 | October 26, 2020 12:50:59 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
41 | October 26, 2020 12:49:44 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
40 | October 26, 2020 12:49:26 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
39 | October 26, 2020 12:49:17 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
38 | October 26, 2020 12:49:03 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
37 | October 26, 2020 12:46:17 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
36 | October 26, 2020 12:45:35 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
--------------------------------------------
11 items, page 1 of 1.
--------------------------------------------
Example 3 (Show details for a single audit record):
# robin user-audit list --id 46 --full
Id | Timestamp | IP Addr | Exec User | Exec Tenant | Owner User | Owner Tenant | Object Type | Operation | Result
---+---------------------------+------------+-----------+----------------+------------+----------------+-------------+-----------+---------
46 | October 26, 2020 12:51:46 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success
object_attributes: {'tenant_id': 1, 'object_id': 11, 'jobid': 74, 'object_name': 'app-11', 'user_id': 3}
details:
--------------------------------------------
1 items, page 1 of 1.
--------------------------------------------
Returns audit records containing details such as the user who performed the action, the tenant and node it was performed from, details on the operation itself and the object it was performed on.
End Point: /api/v3/robin_server/user-audit
Method: GET
URL Parameters:
exec_user=<exec_user>
: Utilizing this parameter results in only audit records detailing operations initiated by the specified user being returned. Note this option cannot be used in conjunction with theowner_user
parameter.exec_tenant=<exec_tenant>
: Utilizing this parameter results in only audit records detailing operations initiated in the specified tenant being returned. Note this option cannot be used in conjunction with theowner_tenant
parameter.owner_user=<owner_user>
: Utilizing this parameter results in only audit records detailing operations initiated by the specified user being returned. Note this option cannot be used in conjunction with theexec_user
parameter.owner_tenant=<owner_tenant>
: Utilizing this parameter results in only audit records detailing operations initiated in the specified tenant being returned. Note this option cannot be used in conjunction with theexec_tenant
parameter.id=<record_id>
: Utilizing this parameter results in only the audit record with the specified ID being returned.object_type=<object_type>
: Utilizing this parameter results in only audit records associated with the specified object type being returned.operation=<operation>
: Utilizing this parameter results in only audit records associated with the specified operation being returned.page_size=<size>
: Utilizing this parameter results in <size> number of audit records being returned.page_num=<index>
: Utilizing this parameter results in audit records starting from <index> being returned.result=<result>
: Utilizing this parameter results in only audit records matching the specified result being returned.
Data Parameters: None
Port: RCM Port (default value is 29442)
Headers:
Authorization: <auth_token>
: Authorization token to identify which user is sending the request. The token can be acquired from the login API.
Success Response Code: 200
Error Response Code: 500 (Internal Server Error)
Example Response:
Output
{
"object_type":"UserAuditRecord",
"start":1,
"count":20,
"total":538,
"page_size":20,
"page_num":1,
"items":[
{
"id":644,
"timestamp":"August 11, 2021 03:12:55",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"USER",
"operation":"login",
"result":"success",
"object_attributes":{
},
"details":{
}
},
{
"id":643,
"timestamp":"August 10, 2021 14:17:47",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"USER",
"operation":"login",
"result":"success",
"object_attributes":{
},
"details":{
}
},
{
"id":642,
"timestamp":"July 13, 2021 11:24:13",
"ip_addr":"10.9.121.40",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"USER",
"operation":"login",
"result":"success",
"object_attributes":{
},
"details":{
}
},
{
"id":641,
"timestamp":"July 13, 2021 11:24:12",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"METRICS",
"operation":"enable",
"result":"success",
"object_attributes":{
},
"details":{
}
},
{
"id":640,
"timestamp":"July 13, 2021 11:24:10",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"CONFIG",
"operation":"update",
"result":"success",
"object_attributes":{
"section":"cluster",
"attribute":"ignored_phases"
},
"details":{
"msg":"The 'cluster' attribute 'ignored_phases' has been updated"
}
},
{
"id":639,
"timestamp":"July 13, 2021 11:24:06",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"probe",
"result":"success",
"object_attributes":{
"object_name":"systestvm-40.robinsystems.com",
"object_id":1,
"jobid":1539
},
"details":{
}
},
{
"id":638,
"timestamp":"July 13, 2021 11:24:04",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"probe",
"result":"success",
"object_attributes":{
"object_name":"systestvm-39.robinsystems.com",
"object_id":3,
"jobid":1538
},
"details":{
}
},
{
"id":637,
"timestamp":"July 13, 2021 11:24:04",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"probe",
"result":"success",
"object_attributes":{
"object_name":"systestvm-41.robinsystems.com",
"object_id":2,
"jobid":1537
},
"details":{
}
},
{
"id":636,
"timestamp":"July 13, 2021 11:23:58",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"USER",
"operation":"login",
"result":"success",
"object_attributes":{
},
"details":{
}
},
{
"id":635,
"timestamp":"July 13, 2021 11:23:57",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"USER",
"operation":"login",
"result":"success",
"object_attributes":{
},
"details":{
}
},
{
"id":634,
"timestamp":"July 13, 2021 11:23:49",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"FILE_COLLECTION",
"operation":"online",
"result":"success",
"object_attributes":{
"object_id":1603741429864,
"object_name":"file-collection-1603741429864",
"collection_pathname":"\/usr\/local\/robin\/collections\/file-collection-1603741429864",
"hostname":"systestvm-40.robinsystems.com",
"jobid":1533
},
"details":{
}
},
{
"id":633,
"timestamp":"July 13, 2021 11:23:44",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"USER",
"operation":"login",
"result":"success",
"object_attributes":{
},
"details":{
}
},
{
"id":632,
"timestamp":"July 13, 2021 11:20:07",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"config",
"result":"success",
"object_attributes":{
"object_id":2,
"object_name":"systestvm-41.robinsystems.com",
"jobid":1478
},
"details":{
}
},
{
"id":631,
"timestamp":"July 13, 2021 11:20:07",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"config",
"result":"success",
"object_attributes":{
"object_id":3,
"object_name":"systestvm-39.robinsystems.com",
"jobid":1479
},
"details":{
}
},
{
"id":630,
"timestamp":"July 13, 2021 11:20:07",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"config",
"result":"success",
"object_attributes":{
"object_id":1,
"object_name":"systestvm-40.robinsystems.com",
"jobid":1480
},
"details":{
}
},
{
"id":629,
"timestamp":"July 13, 2021 11:20:01",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"config",
"result":"success",
"object_attributes":{
"object_id":1,
"object_name":"systestvm-40.robinsystems.com",
"jobid":1468
},
"details":{
}
},
{
"id":628,
"timestamp":"July 13, 2021 11:20:01",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"config",
"result":"success",
"object_attributes":{
"object_id":2,
"object_name":"systestvm-41.robinsystems.com",
"jobid":1466
},
"details":{
}
},
{
"id":627,
"timestamp":"July 13, 2021 11:20:01",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"NODE",
"operation":"config",
"result":"success",
"object_attributes":{
"object_id":3,
"object_name":"systestvm-39.robinsystems.com",
"jobid":1467
},
"details":{
}
},
{
"id":626,
"timestamp":"July 13, 2021 11:19:59",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"USER",
"operation":"login",
"result":"success",
"object_attributes":{
},
"details":{
}
},
{
"id":625,
"timestamp":"July 13, 2021 11:19:01",
"ip_addr":"172.20.0.1",
"exec_user_id":3,
"exec_username":"robin",
"exec_tenant_id":1,
"exec_tenant":"Administrators",
"owner_user_id":null,
"owner_username":null,
"owner_tenant_id":null,
"owner_tenant":null,
"object_type":"USER",
"operation":"login",
"result":"success",
"object_attributes":{
},
"details":{
}
}
],
"state":"Succeed",
"message":"NA"
}
21.8.2. Retrieving audit logs from a file¶
Robin supports recording all audit records within an audit log file. The information stored within the file is equivalent to that saved in the Robin database but provides what is essentially a hard copy of the audit trail. The audit log file is named robin-user-audit.log
and is located in the /home/robinds/var/log/robin
directory within the Robin deamonset container on the primary master node. By default, this feature is disabled.
Some points to consider with regards to the file based logging feature:
The log file will only be generated on the active master.
The log file is automatically updated by the Robin control plane processes whenever an event occurs.
The logs are automatically rotated to ensure that these logs do not consume the whole log partition.
21.8.2.1. Enable file based logging¶
By default, Robin does not log audit records to a file. In order to enable this feature, perform the following steps:
Run the following command to indicate the feature should be enabled:
# robin config update user_audit log_enable True
Run the following command to restart the
robin-server
service and thus allow the above changes to take effect:# service robin-server restart
Example
# robin config update user_audit log_enable True The 'user_audit' attribute 'log_enable' has been updated # service robin-server restart Redirecting to /bin/systemctl restart robin-server.service
After you enable the feature, all audit records are saved within the aforementioned file in real-time and in a user configurable format. As an administrator, you can view the audit logs using any text editor software application. Additionally these logs can be captured with any log forwarding tool for further processing.
21.8.2.2. Disable file based logging¶
To disable the file based logging of audit records, perform the following steps:
Run the following command to indicate the feature should be disabled:
# robin config update user_audit log_enable False
Run the following command to restart the
robin-server
service and thus allow the above changes to take effect:# service robin-server restart
Example
# robin config update user_audit log_enable False The 'user_audit' attribute 'log_enable' has been updated # service robin-server restart Redirecting to /bin/systemctl restart robin-server.service
Disabling this feature will result robin-user-audit.log
file not being updated with any new audit records.
21.8.2.3. Configure file based logging attributes¶
Listed below are all the attributes a user can configure with regards to the file based logging feature.
Attribute |
Default value |
Valid value |
---|---|---|
enabled |
True |
|
log_enable |
False |
|
log_file_size |
10 |
The maximum size in megabytes of the audit log file |
log_format |
JSON |
The output format of each audit record. The following are valid values:
|
log_level |
INFO |
Indicates the level of audit records to be captured. The following are valid values:
|
log_retention |
4 |
The maximum number of audit log files to retain. Any additional log files are rolled over. |
In order to update any of the aforementioned configurable attributes, run the following command:
# robin config update user_audit <attribute> <valid value>
Example
# robin config update user_audit log_format TEXT
The 'user_audit' attribute 'log_format' has been updated
21.8.2.4. View records captured in audit file¶
To view all audit records captured in the aforementioned file, run the following command:
# cat /var/log/robin/robin-user-audit.log
Example 1 (Viewing TEXT based audit records):
# cat /var/log/robin/robin-user-audit.log
1623 | 2021-08-12T15:26:06.581513+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1624 | 2021-08-12T15:26:12.655515+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1625 | 2021-08-12T15:26:12.783629+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1626 | 2021-08-12T15:26:13.118734+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1627 | 2021-08-12T15:26:18.584252+7:00 | 192.0.2.2 | robin | Administrators | -- | -- | USER | login | success | -- | --
1628 | 2021-08-12T15:26:21.752403+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1629 | 2021-08-12T15:26:28.934639+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1630 | 2021-08-12T15:26:36.089382+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1631 | 2021-08-12T15:26:43.233911+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1632 | 2021-08-12T15:26:50.370029+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1633 | 2021-08-12T15:26:57.528168+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1634 | 2021-08-12T15:27:04.749161+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1635 | 2021-08-12T15:27:11.934771+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1636 | 2021-08-12T15:27:19.127729+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1637 | 2021-08-12T15:27:26.291575+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1638 | 2021-08-12T15:27:33.702357+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
1639 | 2021-08-12T15:27:41.017244+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
Example 2 (Viewing JSON based audit records):
# cat /var/log/robin/robin-user-audit.log
{
"id": 197,
"timestamp": "2021-08-12T13:56:17.230515+7:00",
"ip_addr": "192.0.2.2",
"exec_user_id": 3,
"exec_username": "robin",
"exec_tenant_id": 1,
"exec_tenant": "Administrators",
"owner_user_id": null,
"owner_username": null,
"owner_tenant_id": null,
"owner_tenant": null,
"object_type": "NAMESPACE",
"operation": "create",
"result": "success",
"object_attributes": {
"object_name": "oc8687pk4i",
"username": "robin",
"tenant": "Administrators",
"import_namespace": false
},
"details": {}
}
{
"id": 198,
"timestamp": "2021-08-12T13:56:17.748933+7:00",
"ip_addr": "192.0.2.1",
"exec_user_id": 3,
"exec_username": "robin",
"exec_tenant_id": 1,
"exec_tenant": "Administrators",
"owner_user_id": null,
"owner_username": null,
"owner_tenant_id": null,
"owner_tenant": null,
"object_type": "USER",
"operation": "login",
"result": "success",
"object_attributes": {},
"details": {}
}
{
"id": 199,
"timestamp": "2021-08-12T13:56:33.766674+7:00",
"ip_addr": "192.0.2.2",
"exec_user_id": 3,
"exec_username": "robin",
"exec_tenant_id": 1,
"exec_tenant": "Administrators",
"owner_user_id": null,
"owner_username": null,
"owner_tenant_id": null,
"owner_tenant": null,
"object_type": "NAMESPACE",
"operation": "delete",
"result": "success",
"object_attributes": {},
"details": {}
}
{
"id": 200,
"timestamp": "2021-08-12T13:56:34.290960+7:00",
"ip_addr": "192.0.2.1",
"exec_user_id": 3,
"exec_username": "robin",
"exec_tenant_id": 1,
"exec_tenant": "Administrators",
"owner_user_id": null,
"owner_username": null,
"owner_tenant_id": null,
"owner_tenant": null,
"object_type": "USER",
"operation": "login",
"result": "success",
"object_attributes": {},
"details": {}
}
To view the last audit record that was captured, run the following command:
# tail -n 1 /var/log/robin/robin-user-audit.log
Example 1 (Viewing last TEXT based audit record):
# tail -n 1 /var/log/robin/robin-user-audit.log
1645 | 2021-08-12T15:28:19.298469+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | --
Example 2 (Viewing last JSON based audit record):
# tail -n 1 /var/log/robin/robin-user-audit.log
{
"id": 1646,
"timestamp": "2021-08-12T15:31:44.069446+7:00",
"ip_addr": "192.0.2.2",
"exec_user_id": 3,
"exec_username": "robin",
"exec_tenant_id": 1,
"exec_tenant": "Administrators",
"owner_user_id": null,
"owner_username": null,
"owner_tenant_id": null,
"owner_tenant": null,
"object_type": "CONFIG",
"operation": "update",
"result": "success",
"object_attributes": {
"section": "user_audit",
"attribute": "log_format"
},
"details": {
"msg": "The 'user_audit' attribute 'log_format' has been updated"
}
}
21.8.3. Kubernetes audit logs¶
Kubernetes audit logs are a set of records that contain a chronological list of all requests made to the Kubernetes API server. It records all API calls made to the API server. For more information about Kubernetes audit logs, see Kubernetes auditing.
Note
By default, the Kubernetes audit logs feature is enabled for Robin CNP clusters.
21.8.3.1. Points to consider for Kubernetes audit logs¶
The maximum size for storing Kubernetes audit logs in a cluster is 1 GB, which is non-configurable.
A log file can have a maximum size of 100 MB and a maximum of 10 log files can be stored.
21.8.3.2. View Kubernetes audit logs¶
Kubernetes audit logs help you to troubleshoot the issues in your cluster. You can find the Kubernetes audit logs at /var/log/Kubernetes/audit/audit.log
on any master node of your cluster.
Robin CNP logs the following operations at the metadata level audit policy:
Create request
Patch request
Update request
Delete request
21.9. Sherlock¶
Sherlock is a troubleshooting and self-diagnostic command-line tool (CLI) in Robin. It is designed to assist Robin administrators to identify and analyze any problems with Robin clusters. Using Sherlock, an administrator can diagnose cluster-wide problems, view a general cluster health report, or gather information regarding specific applications, nodes, containers, volumes, devices, and so on. It provides an in-depth view of these problems and the objects affected by querying a range of Robin APIs and making direct database calls. Moreover the information gathered is mapped in both a top-down and bottom-up manner through the resource hierarchy in order to showcase important information on wide range of objects in a consumible manner. Some examples of the highlighted resource connections are described below:
Application are linked to the Pods that they are compromised of. Thus details on the health of the node providing the compute resources for the Pod(s) and the status of the attached Volumes are also presented.
Volumes are implicitly linked to the Node they are created on and as a result the status of the Node, the status and source of any replicas present (including the resync progress), and number of snapshots are also displayed.
Similarly Disks are explicitly attached to hosts and so details of the Node are displayed with relevant information such as the overall disk capacity, current utilization of the disk etc.
The status of critical Robin services are displayed in addition to the impacted objects including Applications, Volumes, and Disks.
Note
Given the breadth of information displayed and gathered by Sherlock, the tool is only accessible on the active master node and should only be utilized by adminstrators. In addition it needs the RCM and Storage Manager services to be running.
21.9.1. Use Cases¶
Given the wealth of information Sherlock that displays, it can be used practically in any given scenario. Whether it is used as the primary debugging tool for cluster wide issues or simply to gain insight into the usage statistics, relevant information can always be attained with the tool. Highlighted below are two example use cases where Sherlock could be particularly useful.
Diagnosing application health issues
Given that Sherlock primarily aims to trace problems throughout the resource hierarchy, it allows for the detection of the level from which a problem stems from. For example, an application that cannot write data anymore due to a disk failure within a cluster, might report itself as unhealthy. Using Sherlock, the primary issue of the disk being in a bad state can be deduced because whilst investigating an application the volumes attached to its respective Pods are also displayed. As a result, the unhealthy volumes will be reported alongside the device from which they are allocated and so the common point of failure, the failed disk, can be indentified.
The above example highlights the usefulness of the explicit mappings showcased by the tool and how they can be used to efficiently detect objects which are malfunctioning.
Planning maintenance activities
Since Sherlock highlights the links between several abstract objects within a cluster, it can be used to determine the impact of an object being offline for a period of time. For example, this is particularly useful if a node needs to be cordoned off for maintenance, as Sherlock can show the impacted objects (Pods, volumes, applications, and users). As a result any of the parties that are affected by the maintenance activity can be informed ahead of time with little to no guesswork involved.
21.9.2. Sherlock Report¶
The report generated by Sherlock, shown in the examples below, is meant provide a quick overview of the state of the Robin cluster and by default only highlights unhealthy objects as they would be the most cause for concern. It is split into the following key sections:
Applications - This section of the report displays unhealthy applications alongside linked resources such as the affected Pods, volumes and devices on which the application data is saved.
Pods - This section of the report displays unhealthy pods alongside details of the attached volume(s) and any Kubernetes errors associated with it. This section is highlighted as it includes general Kubernetes pods and Helm based applications.
Volumes - This section of the report displays unhealthy volumes alongside details of the device it is hosted on, logical mounts with potential IO stalls, NFS Exports, NFS Server pods, the respective node from which it is allocated, and statistics about any snapshots it may have.
Nodes - This section of the report displays unhealthy nodes alongside the status of the Robin and Kubernetes services running on each node, warnings for high resource usage and indicates a lack of space availability if appropriate.
Devices - This section of thhe report displays unhealthy devices alongside details of the node each is mounted on, the utilization of each disk and the volumes affected.
File Collections - This section of thhe report displays unhealthy file collections and highlights any errors that may have caused it to be in an unhealthy state.
Bundles - This section of thhe report displays unhealthy bundles and highlights any inherited errors from other objects in the heirarchy that may cause it to be unavailable.
Example 1 (Healthy cluster Report)
# sherlock
SHOWING APPLICATIONS THAT NEED ATTENTION:
All apps are healthy
SHOWING PODS THAT NEED ATTENTION:
All pods are healthy
SHOWING UNHEALTHY VOLUMES THAT NEED ATTENTION
All volumes are healthy
SHOWING UNHEALTHY NODES THAT NEED ATTENTION:
All nodes are healthy
SHOWING UNHEALTHY DEVICES THAT NEED ATTENTION:
All devices are healthy
SHOWING UNAVAILABLE FILE COLLECTIONS THAT NEED ATTENTION:
All file collection are available
SHOWING UNAVAILABLE BUNDLES THAT NEED ATTENTION:
All bundles are available
Only unhealthy objects are shown. To see everything re-run with -H|--healthy option
To see more details rerun with -V|--verbose option
sherlock produced results in 155 milliseconds (Sat Sep 18 06:14:59 PM 2021).
|-- 3 nodes, 26 disks, 3 vols, 3 snapshots, 0 apps, 0 pods, 1 file-collections,
2 bundles, 1 users and 1 tenants were analyzed
Example 2 (Report with unhealthy applications)
# sherlock
SHOWING APPLICATIONS THAT NEED ATTENTION:
|-- robinte STATE: PLANNED Robin Systems 2/2 pods unhealthy KIND: ROBIN
SHOWING USERS WHO ARE AFFECTED:
|-- Robin Systems (Firstname: Robin LastName: Systems Email: None)
| |-- APPS 1: robinte
SHOWING PODS THAT NEED ATTENTION:
o-- POD/VNODE ID 121: robinte.R1.01 INSTALLING/ONLINE 1 CPU, 50 MB MEM NODE: UP, RIO: UP
|-- POD/VNODE ID 122: robinte.R2.01 INSTALLING/ONLINE 1 CPU, 50 MB MEM NODE: UP, RIO: UP
SHOWING UNHEALTHY VOLUMES THAT NEED ATTENTION
All volumes are healthy
SHOWING UNHEALTHY NODES THAT NEED ATTENTION:
All nodes are healthy
SHOWING UNHEALTHY DEVICES THAT NEED ATTENTION:
All devices are healthy
SHOWING UNAVAILABLE FILE COLLECTIONS THAT NEED ATTENTION:
All file collection are available
SHOWING UNAVAILABLE BUNDLES THAT NEED ATTENTION:
All bundles are available
21.9.3. Command Line Options¶
Detailed below are the different options that can be utilized whilst using the Sherlock tool to attain the desired information.
Resource Inspection Options
These options require the names of resources (objects) to be specified in order to show detailed information about the aforementioned objects. If multiple objects need to be viewed at the same time a comma seperated list of names can be specified.
|
Displays information about the given application(s) |
|
Displays information about the given node(s) |
|
Displays information about the given Pod(s) |
|
Displays information about the given volume(s) |
|
Displays information about the given device(s) |
Note
For the --dev
option, alongside specifying a list of WWNs to match several other values are supported: ‘all’ can be given to display information on all devices, ‘full’ can be given to display details on devices that are nearly full, a list of nodenames can be specified to show devices on the given nodes, and lastly to uniquely identify a device a combination of the nodename and devpath can be given in the format <nodename>:<devpath.
Advisory Rebalancing Options
The rebalancing command options can be used to discover disks which are being over or under utilized. The given advice can be used to adjust the load management for a given device or volume.
|
Provides advice on device rebalancing |
|
Provides advice on volume rebalancing |
|
Displays information about devices that need rebalancing |
|
Displays information about volumes that need rebalancing |
Behavior Controlling Options
The following options allows for the manipulation of the generated report to include details that might not be present by default.
|
Monitor the resource metrics for the given interval. Use this option alongside options such as |
|
Start scanning jobs starting at this date/time. The default time is 72 hours before the current time. This option is only valid when |
|
End scanning jobs at this date/time. The default time is the current time. This option is only valid when |
|
Run in server mode on the given port so Sherlock can be viewed from a web brower |
|
Mark resources that are not fully online as unhealthy. This option displays the resources (objects) that are partially healthy as unhelathy |
|
Build and use cache to speed up queries. It caches resources once and use the same cache for subsequent queries. Use when you run Sherlock repeatedly run with different options |
|
Also show healthy resources. Displays healthy resources (objects) along with unhealthy objects |
|
Displays detailed report |
|
Print output in HTML format. Use this along with |
|
Redirect output to a file at the specified file path |
|
Don’t skip unimportant resources to minimize output |
|
Scan job logs for errors |
|
Run in server mode to serve metrics in Prometheus format |
21.9.4. Web Server Access¶
The Sherlock tool can be accessed via a web browser in order for a more interactive viewing experience. For the server mode to be utilized an available port number between 1-65535 needs to be specified alongside the --server
option. An example is given below.
Example
# sherlock --server 45536
running the read_config now
Running in server mode. Point your web browser to the following address:
https://eqx01-flash15:45536
21.9.5. Examples¶
21.9.5.1. View health of all objects¶
In order to view a report containing the status of all healthy objects alongside the unhealthy ones use the --healthy
parameter as shown in the example(s) below.
Example 1 (Display health of all objects)
# sherlock --healthy
No matching apps found
No matching pods found
SHOWING HEALTH OF 3/3 VOLUMES IN THE CLUSTER
|-- VOLID 1: file-collection-1631971248912.0798c2d5-332f-4c6f-96e6-8283a431851b, usage: 448 MB / 20 GB, 1 snapshots, resync progress: SYNCED, using 1 devices
|-- VOLID 132: pvc-94229d46-e381-4e3c-99a1-ddfe389d7839 , usage: 352 MB / 5 GB, 1 snapshots, resync progress: SYNCED, using 1 devices
|-- VOLID 131: pvc-a140c841-0e2a-4d91-be7c-c7c75b5756b1 , usage: 576 MB / 11 GB, 1 snapshots, resync progress: SYNCED, using 1 devices
All volumes are healthy
SHOWING HEALTH OF 3/3 NODES RUNNING IN THE CLUSTER
|-- eqx01-flash16 ONLINE 0 errors, 0 warnings NODE: UP, AGENT: UP, IOMGR: UP, K8S Service(s): READY
|-- eqx04-flash05 ONLINE 0 errors, 0 warnings NODE: UP, AGENT: UP, IOMGR: UP, K8S Service(s): READY
|-- eqx01-flash15 ONLINE 0 errors, 0 warnings NODE: UP, AGENT: UP, IOMGR: UP, K8S Service(s): READY
SHOWING HEALTH OF 26/26 DEVICES IN THE CLUSTER
|-- /dev/sdi@eqx01-flash16 | 1/100 vols | 11/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0fe71f PATH: /dev/disk/by-id/ata-ST2000NX0253_S46059W1)
|
|-- /dev/sde@eqx01-flash16 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0feea2 PATH: /dev/disk/by-id/ata-ST2000NX0253_S46059R8)
|
|-- /dev/sde@eqx01-flash15 | 1/100 vols | 5/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0db2c7 PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605EN0)
|
|-- /dev/sdf@eqx01-flash15 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0db9be PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605E90)
|
|-- /dev/sdi@eqx01-flash15 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0dbae3 PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DY2)
|
|-- /dev/sdg@eqx01-flash15 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0ddd62 PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605D33)
|
|-- /dev/sdh@eqx01-flash15 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0df3ba PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605CZ7)
|
|-- /dev/sdd@eqx01-flash15 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c101de8 PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605TKT)
|
|-- /dev/sdb@eqx01-flash15 | 1/100 vols | 20/57194 slices | 893.8 GB free of 894.3 GB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x500a075109604998 PATH: /dev/disk/by-id/ata-Micron_M500_MTFDDAK960MAV_140109604998)
|
|-- /dev/sdb@eqx04-flash05 | 0/100 vols | 0/57194 slices | 894.3 GB free of 894.3 GB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x500a07510ec79d1f PATH: /dev/disk/by-id/ata-Crucial_CT960M500SSD1_14330EC79D1F)
|
|-- /dev/sdh@eqx01-flash16 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0d9e30 PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605EZW)
|
|-- /dev/sdf@eqx01-flash16 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0dc21f PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DSS)
|
|-- /dev/sdb@eqx01-flash16 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0dd039 PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DQ3)
|
|-- /dev/sdg@eqx01-flash16 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0dee42 PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DCS)
|
|-- /dev/sdd@eqx01-flash16 | 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x5000c5008c0df26c PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DB2)
|
|-- /dev/sdc@eqx04-flash05 | 0/100 vols | 0/57194 slices | 894.3 GB free of 894.3 GB, NODE: ONLINE, RDVM: UP, DEV: READY
| (WWN: 0x500a07510ee9a052 PATH: /dev/disk/by-id/ata-Crucial_CT960M500SSD1_14280EE9A052)
|
SHOWING 1 FILE COLLECTIONS IN THE CLUSTER
|-- file-collection-1631971248912 Online 0 errors 0 warnings
SHOWING 1 BUNDLES IN THE CLUSTER
|-- wordpress ONLINE 0 errors 0 warnings
To see more details rerun with -V|--verbose option
sherlock produced results in 200 milliseconds (Sat Sep 18 11:51:21 PM 2021).
|-- 3 nodes, 26 disks, 3 vols, 3 snapshots, 0 apps, 0 pods, 1 file-collections,
1 bundles, 3 users and 3 tenants were analyzed
Example 2 (Verbose report for all objects)
# sherlock --healthy --verbose
No matching apps found
No matching pods found
SHOWING HEALTH OF 3/3 VOLUMES IN THE CLUSTER
|-- VOLID 1: file-collection-1631971248912.0798c2d5-332f-4c6f-96e6-8283a431851b, usage: 448 MB / 20 GB, 1 snapshots, resync progress: SYNCED, using 1 devices
| |-- DEVID 1: /dev/sdb on eqx01-flash15 using 448 MB/894.3 GB capacity, 14/20 slices, 14 segs, segspernap=1 RDVM: UP, DEV: READY
| | (WWN: 0x500a075109604998 PATH: /dev/disk/by-id/ata-Micron_M500_MTFDDAK960MAV_140109604998)
| |
| |-- SNAPSHOTS: 1 CREATED DEV OWN CLONES STATE SIZE
| | |-- SNAPID 1: 1969/12/31 16:00:00 14 14 0 READY 448 MB
| | |
|
|-- VOLID 132: pvc-94229d46-e381-4e3c-99a1-ddfe389d7839 , usage: 352 MB / 5 GB, 1 snapshots, resync progress: SYNCED, using 1 devices
| |-- DEVID 2: /dev/sde on eqx01-flash15 using 352 MB/1.8 TB capacity, 11/5 slices, 11 segs, segspernap=3 RDVM: UP, DEV: READY
| | (WWN: 0x5000c5008c0db2c7 PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605EN0)
| |
| |-- SNAPSHOTS: 1 CREATED DEV OWN CLONES STATE SIZE
| | |-- SNAPID 1: 1969/12/31 16:00:00 11 11 0 READY 352 MB
| | |
|
|-- VOLID 131: pvc-a140c841-0e2a-4d91-be7c-c7c75b5756b1 , usage: 576 MB / 11 GB, 1 snapshots, resync progress: SYNCED, using 1 devices
| |-- DEVID 11: /dev/sdi on eqx01-flash16 using 576 MB/1.8 TB capacity, 18/11 slices, 18 segs, segspernap=2 RDVM: UP, DEV: READY
| | (WWN: 0x5000c5008c0fe71f PATH: /dev/disk/by-id/ata-ST2000NX0253_S46059W1)
| |
| |-- SNAPSHOTS: 1 CREATED DEV OWN CLONES STATE SIZE
| | |-- SNAPID 1: 1969/12/31 16:00:00 18 18 0 READY 576 MB
| | |
|
All volumes are healthy
|-- eqx01-flash16 ONLINE 0 errors, 0 warnings NODE: UP, AGENT: UP, IOMGR: UP, K8S Service(s): READY
=============================================================================================================================================================================================
0 PODS ARE RUNNING ON THIS NODE
9 DEVICES ARE ATTACHED TO THIS NODE
|-- DEVID 11: /dev/sdi READY 1.8 TB free=1.8 TB (100%) 1/100 vols, 11/119194 slices, 18 segs
| |-- VOLID 131: pvc-a140c841-0e2a-4d91-be7c-c7c75b5756b1 576 MB nslices=11 nsnaps=1 nsegs=18 nsegs_per_snap=2
|
|-- DEVID 12: /dev/sde READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 0: /dev/sdc INIT 14.9 GB free=14.9 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 9: /dev/sdh READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 8: /dev/sdf READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 10: /dev/sdb READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 14: /dev/sdg READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 13: /dev/sdd READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 0: /dev/sda INIT 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- eqx04-flash05 ONLINE 0 errors, 0 warnings NODE: UP, AGENT: UP, IOMGR: UP, K8S Service(s): READY
=============================================================================================================================================================================================
0 PODS ARE RUNNING ON THIS NODE
8 DEVICES ARE ATTACHED TO THIS NODE
|-- DEVID 16: /dev/sdb READY 894.3 GB free=894.3 GB (100%) 0/100 vols, 0/57194 slices, 0 segs
|-- DEVID 0: /dev/sda INIT 894.3 GB free=894.3 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 0: /dev/sdd INIT 894.3 GB free=894.3 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 15: /dev/sdc READY 894.3 GB free=894.3 GB (100%) 0/100 vols, 0/57194 slices, 0 segs
|-- DEVID 0: /dev/sde INIT 59.6 GB free=59.6 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 0: /dev/dm-1 INIT 17.4 GB free=17.4 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 0: /dev/dm-2 INIT 35.7 GB free=35.7 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 0: /dev/dm-0 INIT 6.0 GB free=6.0 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- eqx01-flash15 ONLINE 0 errors, 0 warnings NODE: UP, AGENT: UP, IOMGR: UP, K8S Service(s): READY
=============================================================================================================================================================================================
0 PODS ARE RUNNING ON THIS NODE
9 DEVICES ARE ATTACHED TO THIS NODE
|-- DEVID 0: /dev/sda INIT 894.3 GB free=894.3 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 0: /dev/sdc INIT 14.9 GB free=14.9 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 2: /dev/sde READY 1.8 TB free=1.8 TB (100%) 1/100 vols, 5/119194 slices, 11 segs
| |-- VOLID 132: pvc-94229d46-e381-4e3c-99a1-ddfe389d7839 352 MB nslices=5 nsnaps=1 nsegs=11 nsegs_per_snap=3
|
|-- DEVID 3: /dev/sdf READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 7: /dev/sdi READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 5: /dev/sdg READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 6: /dev/sdh READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 4: /dev/sdd READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 1: /dev/sdb READY 894.3 GB free=893.8 GB (100%) 1/100 vols, 20/57194 slices, 14 segs
| |-- VOLID 1: file-collection-1631971248912.0798c2d5-332f-4c6f-96e6-8283a431851b 448 MB nslices=20 nsnaps=1 nsegs=14 nsegs_per_snap=1
|
DEVICE /dev/sdi on eqx01-flash16 1/100 vols | 11/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0fe71f | PATH: /dev/disk/by-id/ata-ST2000NX0253_S46059W1
|-- VOL: 131 pvc-a140c841-0e2a-4d91-be7c-c7c75b5756b1 576 MB nslices=11 nsegs=18 (2 ) nsnaps=1
DEVICE /dev/sde on eqx01-flash16 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0feea2 | PATH: /dev/disk/by-id/ata-ST2000NX0253_S46059R8
DEVICE /dev/sde on eqx01-flash15 1/100 vols | 5/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0db2c7 | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605EN0
|-- VOL: 132 pvc-94229d46-e381-4e3c-99a1-ddfe389d7839 352 MB nslices=5 nsegs=11 (3 ) nsnaps=1
DEVICE /dev/sdf on eqx01-flash15 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0db9be | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605E90
DEVICE /dev/sdi on eqx01-flash15 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0dbae3 | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DY2
DEVICE /dev/sdg on eqx01-flash15 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0ddd62 | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605D33
DEVICE /dev/sdh on eqx01-flash15 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0df3ba | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605CZ7
DEVICE /dev/sdd on eqx01-flash15 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c101de8 | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605TKT
DEVICE /dev/sdb on eqx01-flash15 1/100 vols | 20/57194 slices | 893.8 GB free of 894.3 GB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x500a075109604998 | PATH: /dev/disk/by-id/ata-Micron_M500_MTFDDAK960MAV_140109604998
|-- VOL: 1 file-collection-1631971248912.0798c2d5-332f-4c6f-96e6-8283a431851b 448 MB nslices=20 nsegs=14 (1 ) nsnaps=1
DEVICE /dev/sdb on eqx04-flash05 0/100 vols | 0/57194 slices | 894.3 GB free of 894.3 GB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x500a07510ec79d1f | PATH: /dev/disk/by-id/ata-Crucial_CT960M500SSD1_14330EC79D1F
DEVICE /dev/sdh on eqx01-flash16 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0d9e30 | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605EZW
DEVICE /dev/sdf on eqx01-flash16 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0dc21f | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DSS
DEVICE /dev/sdb on eqx01-flash16 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0dd039 | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DQ3
DEVICE /dev/sdg on eqx01-flash16 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0dee42 | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DCS
DEVICE /dev/sdd on eqx01-flash16 0/100 vols | 0/119194 slices | 1.8 TB free of 1.8 TB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x5000c5008c0df26c | PATH: /dev/disk/by-id/ata-ST2000NX0253_S4605DB2
DEVICE /dev/sdc on eqx04-flash05 0/100 vols | 0/57194 slices | 894.3 GB free of 894.3 GB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x500a07510ee9a052 | PATH: /dev/disk/by-id/ata-Crucial_CT960M500SSD1_14280EE9A052
SHOWING 1 FILE COLLECTIONS IN THE CLUSTER
SHOWING 1 BUNDLES IN THE CLUSTER
sherlock produced results in 141 milliseconds (Sat Sep 18 11:45:50 PM 2021).
|-- 3 nodes, 26 disks, 3 vols, 3 snapshots, 0 apps, 0 pods, 1 file-collections,
1 bundles, 3 users and 3 tenants were analyzed
21.9.5.2. Check Application Health¶
In order to view a report containing the status of an application use the --app
parameter along with an application name as shown in the example below. The following details about an application will be displayed:
Volumes or devices on which the application data is stored
Pods associated with the application
Node(s) on which the application Pods reside
Failed jobs related to the specified application
Example
# sherlock --app mysql-test -V
APPNAME: mysql-test STATE: ONLINE Robin Systems 1/1 vnodes healthy
==============================================================================================================================================================================================================================================
APP HAS 1 VNODES:
VNODEID 2: mysql-test.mysql.01 on centos-60-181 INST: ONLINE/INST: STARTED, NODE: ONLINE, RIO: UP
|-- VOLID 4: mysql-test.mysql.01.data.1.3daab239-4327-4f00-873d-ffda3c9575f2 1 GB
| |-- DEVID : /dev/sdd segs=centos-60-181 slices=14 rawspace=1 448 MB
|-- VOLID 5: mysql-test.mysql.01.root_fs.1.58523556-a318-483c-9f8c-d2cd98ad6a32 10 GB
| |-- DEVID : /dev/sdd segs=centos-60-181 slices=10 rawspace=10 320 MB
|
APP IS RUNNING ON THE FOLLOWING 1 NODES:
|-- centos-60-181 RIO: UP
| |-- mysql-test.mysql.01 ONLINE/STARTED
|
APP IS STORING DATA ON THE FOLLOWING 1 DEVICES:
|-- DEVID 6: /dev/sdd on centos-60-181 2 vols
| |-- VOLID 4: mysql-test.mysql.01.data.1.3daab239-4327-4f00-873d-ffda3c9575f2 448 MB nslices=1 nsegs=14 nsnaps=3 segspersnap=5
| |-- VOLID 5: mysql-test.mysql.01.root_fs.1.58523556-a318-483c-9f8c-d2cd98ad6a32 320 MB nslices=10 nsegs=10 nsnaps=3 segspersnap=1
|
THERE ARE 23 FAILED JOBS TO INSPECT BETWEEN Fri May 3 01:22:23 AM 2019 - Fri May 10 01:22:23 AM 2019
|-- mysql-test.mysql.01
| |-- VnodeDelete jobid=98 state=10 error=1 start=Thu May 9 00:23:31 2019 end=Thu May 9 00:23:32 2019
| | predestroy vnodehook cmd bash /var/lib/robin/.file_object_cache/64f1ef8529796f8199a63eaf2e65365f/scripts/vnode_sample <REDACTED ARGS> failed to execute
| |-- VnodeDelete jobid=88 state=10 error=1 start=Thu May 9 00:20:37 2019 end=Thu May 9 00:20:43 2019
| | postdestroy vnodehook cmd bash /var/lib/robin/.file_object_cache/64f1ef8529796f8199a63eaf2e65365f/scripts/vnode_sample <REDACTED ARGS> failed to execute
|
|-- mysql-test
| |-- ApplicationDelete jobid=97 state=10 error=1 start=Thu May 9 00:23:31 2019 end=Thu May 9 00:23:32 2019
| | Job failed. One or more child jobs reported errors. Error: 'predestroy vnodehook cmd bash /var/lib/robin/.file_object_cache/64f1ef8529796f8199a63eaf2e65365f/scripts/vnode_sample <REDACTED ARGS> failed to
| | execute'
| |-- ApplicationDelete jobid=87 state=10 error=1 start=Thu May 9 00:20:37 2019 end=Thu May 9 00:20:43 2019
| | Job failed. One or more child jobs reported errors. Error: 'postdestroy vnodehook cmd bash /var/lib/robin/.file_object_cache/64f1ef8529796f8199a63eaf2e65365f/scripts/vnode_sample <REDACTED ARGS> failed to
| | execute'
| |-- ApplicationDelete jobid=92 state=10 error=1 start=Thu May 9 00:22:17 2019 end=Thu May 9 00:22:18 2019
| | Job failed. One or more child jobs reported errors. Error: 'predestroy vnodehook cmd bash /var/lib/robin/.file_object_cache/64f1ef8529796f8199a63eaf2e65365f/scripts/vnode_sample <REDACTED ARGS> failed to
| | execute'
| |-- ApplicationDelete jobid=95 state=10 error=1 start=Thu May 9 00:22:59 2019 end=Thu May 9 00:23:00 2019
| | Job failed. One or more child jobs reported errors. Error: 'predestroy vnodehook cmd bash /var/lib/robin/.file_object_cache/64f1ef8529796f8199a63eaf2e65365f/scripts/vnode_sample <REDACTED ARGS> failed to
| | execute'
|
|-- mysql-test1
| |-- ApplicationCreate jobid=129 state=10 error=1 start=Thu May 9 03:50:54 2019 end=Thu May 9 03:50:54 2019
| | Invalid Zone Id and/or Bundle Id: 1/2
| |-- ApplicationCreate jobid=128 state=10 error=1 start=Thu May 9 03:50:08 2019 end=Thu May 9 03:50:08 2019
| | Invalid Zone Id and/or Bundle Id: 1/2
|
sherlock produced results in 90 milliseconds (Fri May 10 01:22:23 AM 2019).
|-- 3 nodes, 12 disks, 3 vols, 7 snapshots, 1 apps, 1 vnodes, 2 users and 1 tenants were analyzed
21.9.5.3. Check Node Health¶
In order to view a report containing the status of a node use the --node
parameter along with an primary hostname of the node as shown in the example below. Information on the objects associated with the node such as applications, pods, volumes, devices, file collections and bundles will be displayed alongside details of the node’s heath with regards to the services being run on it.
Example
# sherlock --node eqx01-flash16 -V
|-- eqx01-flash16 ONLINE 0 errors, 0 warnings NODE: UP, AGENT: UP, IOMGR: UP, K8S Service(s): READY
=============================================================================================================================================================================================
0 PODS ARE RUNNING ON THIS NODE
9 DEVICES ARE ATTACHED TO THIS NODE
|-- DEVID 0: /dev/sdc INIT 14.9 GB free=14.9 GB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 9: /dev/sdh READY 1.8 TB free=1.8 TB (100%) 1/100 vols, 11/119194 slices, 18 segs
| |-- VOLID 162: pvc-eb63979d-720e-41c9-808f-145306dc1259 576 MB nslices=11 nsnaps=1 nsegs=18 nsegs_per_snap=2
|
|-- DEVID 13: /dev/sdf READY 1.8 TB free=1.8 TB (100%) 1/100 vols, 5/119194 slices, 11 segs
| |-- VOLID 163: pvc-66646581-0210-46e2-b945-9ea880be38d7 352 MB nslices=5 nsnaps=1 nsegs=11 nsegs_per_snap=3
|
|-- DEVID 12: /dev/sdb READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 14: /dev/sdg READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 8: /dev/sdd READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 0: /dev/sda INIT 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/0 slices, 0 segs
|-- DEVID 10: /dev/sdi READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
|-- DEVID 11: /dev/sde READY 1.8 TB free=1.8 TB (100%) 0/100 vols, 0/119194 slices, 0 segs
THERE ARE 1 FAILED JOBS TO INSPECT BETWEEN Sun Sep 12 05:58:15 PM 2021 - Sun Sep 19 05:58:15 PM 2021
|-- eqx01-flash16.robinsystems.com
| |-- HostAddResourcePool jobid=30 state=10 error=1 start=Sun Sep 19 03:23:54 2021 end=Wed Dec 31 16:00:00 1969
| | Host 'eqx01-flash16.robinsystems.com' already has a resource pool 'default'
|
sherlock produced results in 166 milliseconds (Sun Sep 19 05:58:15 PM 2021).
|-- 3 nodes, 26 disks, 3 vols, 3 snapshots, 0 apps, 0 pods, 1 file-collections,
0 bundles, 5 users and 5 tenants were analyzed
21.9.5.4. Check Pod Health¶
In order to view a report containing the status of a Pod use the --pod
parameter along with the Pod name as shown in the example below. Information on the objects associated with the Pod such as the application it is a part of, the volumes attached to it and the respective devices will also be displayed.
Example
# sherlock --pod centos1.server.01 -V
SHOWING HEALTH OF 1 PODS IN THE CLUSTER:
o-- POD/VNODE ID 187: centos1.server.01 STARTED/ONLINE 1 CPU, 200 MB MEM NODE: UP, RIO: UP
| |-- VOLID 238: centos1.server.01.data.1.3a588402-0288-4921-a611-8c8b27e94313 64 MB/1 GB nsnaps=1
| | |-- DEVID 15: /dev/sdb on eqx04-flash05 nsegs=2 nslices=1 64 MB
| |-- VOLID 237: centos1.server.01.block.1.0dd5e060-0e28-499c-a3f8-198e33b10851 0/1 GB nsnaps=1
| | |-- DEVID 16: /dev/sdc on eqx04-flash05 nsegs=0 nslices=1 0
|
sherlock produced results in 169 milliseconds (Sun Sep 19 10:42:37 PM 2021).
|-- 3 nodes, 26 disks, 9 vols, 9 snapshots, 3 apps, 3 pods, 1 file-collections,
1 bundles, 15 users and 16 tenants were analyzed
21.9.5.5. Check Volume Health¶
In order to view a report containing the status of a volume use the --vol
parameter along with the name of the volume as shown in the example below. Information on the objects associated with the volume such as the device on which it is mounted, potential IO stalls on mounts, NFS Exports, NFS Server pods, and the existing snapshots of it will also be displayed alongside details of the volumes usage.
Note
It is recommended that this report is validated often especially for volumes which are used frequently.
Example
# sherlock --vol pvc-c4f1bc87-c9f5-433a-bb60-f9ee46e7a9e1 -V
|-- VOLID 50: pvc-c4f1bc87-c9f5-433a-bb60-f9ee46e7a9e1, used by pvc(s): personal/nfs-exclusive-repl-2, usage: 512 MB / 4 GB, 2 snapshots, resync progress: SYNCED, using 2 devices
| |-- Potential IO Stalls:
| | |-- 37 pending IOs on hypervvm-62-45.robinsystems.com:/dev/sde
| |
| |-- NFS_EXPORTS:
| | |-- NFS_EXPORTID 7: EXPORTS: READY
| | (CLIENTS:["hypervvm-62-46.robinsystems.com","hypervvm-62-45.robinsystems.com"])
| |
| |-- NFS_SERVER_POD:
| | |-- PODID 10: robin-nfs-excl-v50-10 , HOSTNAME: hypervvm-62-47, STATUS: ONLINE
| |
| |-- DEVID 1: /dev/sdb on hypervvm-62-45 using 256 MB/100 GB capacity, 8/76 slices, 8 segs, segspernap=1, RDVM: UP, DEV: READY
| | (WWN: 0x600224801148c13acf11110ea26830ff PATH: /dev/disk/by-id/scsi-3600224801148c13acf11110ea26830ff)
| |
| |-- SNAPSHOTS: 2 CREATED DEV OWN CLONES STATE SIZE
| | |-- SNAPID 1: 2022/08/11 08:54:51 12 12 0 READY 384 MB
| | |-- SNAPID 2: 1969/12/31 16:00:00 4 4 0 READY 128 MB
| | |
|
| |-- DEVID 4: /dev/sdc on hypervvm-62-47 using 256 MB/100 GB capacity, 8/34 slices, 8 segs, segspernap=1, RDVM: UP, DEV: READY
| | (WWN: 0x60022480d00e683e8687aae16482dcd0 PATH: /dev/disk/by-id/scsi-360022480d00e683e8687aae16482dcd0)
| |
| |-- SNAPSHOTS: 2 CREATED DEV OWN CLONES STATE SIZE
| | |-- SNAPID 1: 2022/08/11 08:54:51 12 12 0 READY 384 MB
| | |-- SNAPID 2: 1969/12/31 16:00:00 4 4 0 READY 128 MB
| | |
All volumes are healthy
sherlock produced results in 243 milliseconds (Tue Sep 27 11:03:57 PM 2022).
|-- 1 nodes, 6 disks, 5 vols, 5 snapshots, 2 apps, 0 protection groups, 2 pods, 1 file-collections,
2 bundles, 3 users and 2 tenants were analyzed
Note
In certain cases a volume might be marked as needing attention when its state is DEGRADED
however this simply indicates a replica of the volume is offline. It does not indicate that it is unhealthy or faulted as the respective volume is still capable of serving I/Os.
21.9.5.6. Check Device Health¶
In order to view a report containing the status of a device use the --dev
parameter along with the WWN of the device as shown in the example below. Information on the objects associated with the device such as the volumes allocated from it will also be displayed alongside details of the devices usage.
Note
In addition to accepting the WWN, the --dev
parameter can be utilized with the following values: ‘all’ can be given to display information on all devices, ‘full’ can be given to display details on devices that are nearly full, a list of nodenames can be specified to show devices on the given nodes, and lastly to uniquely identify a device a combination of the nodename and devpath can be given in the format <nodename>:<devpath.
Example
# sherlock --dev 0x500a075109604998 -V
DEVICE /dev/sdb on eqx01-flash15 1/100 vols | 20/57194 slices | 893.8 GB free of 894.3 GB NODE: ONLINE, RDVM: UP, DEV: READY
=============================================================================================================================================================================================
|==> WWN: 0x500a075109604998 | PATH: /dev/disk/by-id/ata-Micron_M500_MTFDDAK960MAV_140109604998
|-- VOL: 1 file-collection-1632045271349.5ff1f19f-937f-4ec1-a595-9d9df9d11d44 448 MB nslices=20 nsegs=14 (1 ) nsnaps=1
sherlock produced results in 130 milliseconds (Sun Sep 19 03:34:51 AM 2021).
|-- 2 nodes, 18 disks, 1 vols, 1 snapshots, 0 apps, 0 pods, 1 file-collections,
0 bundles, 1 users and 1 tenants were analyzed
21.9.5.7. Check Devices Nearing Maximum Capacity¶
In order to view a report containing device(s) nearing their maximum capacity the --dev
parameter along with the keyword ‘full’ can be used as shown in the example below. Information on the space usage statistics for the concerned device(s) will be displayed alongside the allocations utilizing the space on the device. If any device(s) are nearing their maximum capacity, Robin recommends adding more devices to the respective nodes in order to boost the performance of the cluster as well ensure it can host more applications.
Example
# sherlock --dev full -V
DEVICE /dev/sdc on telxvm-53-159 9/10 vols | 31/6390 slices | 2 KB free of 100 GB NODE: ONLINE, RDVM: UP, DEV: READY
=======================================================================================================================================================================
|==> WWN: 0x600224802a495d29715780d6f9be9eb5 | PATH: /dev/disk/by-id/scsi-3600224802a495d29715780d6f9be9eb5
|-- VOL: 8 jm1.R1.01.data.1.12725312-2294-4eea-8eac-704938facd69 576 MB nslices=10 nsegs=18 (1 ) nsnaps=2
|-- VOL: 9 jm1.R2.01.data.1.e505bc9c-3bbc-4ec7-81c4-63e845aef949 576 MB nslices=10 nsegs=18 (1 ) nsnaps=2
|-- VOL: 1 file-collection-1637286348291.a8492d9c-591d-404f-926d-a4d647adcffc 160 MB nslices=5 nsegs=5 (1 ) nsnaps=1
|-- VOL: 4 test.server.01.data.1.11e20b4b-4d40-419c-959d-fb0583321c11 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 10 pvc-aa6a4996-3212-4b41-8c2a-44375be6834c 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 11 pvc-c2d086b0-fee8-4704-ab87-f937a09fb40e 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 24 clone1.R2.01.data.1.ab94abef-96ba-4073-816d-f2e4e0614a42 32 MB nslices=1 nsegs=1 (1 ) nsnaps=1
|-- VOL: 25 clone1.R1.01.data.1.bca7107a-3dd9-4d45-af93-69819e88504a 32 MB nslices=1 nsegs=1 (1 ) nsnaps=1
|-- VOL: 5 test.server.01.block.1.ec6ac7c1-19c1-45af-a578-0c7c4c1dca0c 0 nslices=1 nsegs=0 (0 ) nsnaps=1
DEVICE /dev/sdb on telxvm-53-159 9/10 vols | 9/6390 slices | 2 KB free of 100 GB NODE: ONLINE, RDVM: UP, DEV: READY
=======================================================================================================================================================================
|==> WWN: 0x60022480a2e824923e91646995e0da4b | PATH: /dev/disk/by-id/scsi-360022480a2e824923e91646995e0da4b
|-- VOL: 19 test2.server.01.data.1.f04ef40a-6218-477c-8f02-e5cf4b2899a6 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 17 test2.server.02.data.1.6011eb22-bf3e-4e16-ba9c-7de8d2fb97dc 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 18 test2.server.03.data.1.05139336-6657-49cb-8663-11da2a4a5d0f 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 21 pvc-cf88884a-80c2-4544-80d3-82334331e529 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 22 pvc-b8a55274-0e96-48a3-94d9-d1fe2aa846e6 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 23 pvc-ba4e627b-1795-46c2-bca9-65dfdfbbab7e 64 MB nslices=1 nsegs=2 (2 ) nsnaps=1
|-- VOL: 20 test2.server.03.block.1.4dfa240b-126b-448d-843e-9262ee8cdca9 0 nslices=1 nsegs=0 (0 ) nsnaps=1
|-- VOL: 16 test2.server.01.block.1.0e3a6e89-bb00-413b-a25d-5a416d523bae 0 nslices=1 nsegs=0 (0 ) nsnaps=1
|-- VOL: 15 test2.server.02.block.1.000c723b-11c5-4fc3-8eec-032141da223f 0 nslices=1 nsegs=0 (0 ) nsnaps=1
sherlock produced results in 129 milliseconds (Mon Nov 29 05:11:03 AM 2021).
|-- 1 nodes, 6 disks, 18 vols, 20 snapshots, 4 apps, 8 pods, 1 file-collections,
2 bundles, 2 users and 2 tenants were analyzed
21.9.5.8. Find Devices With Rebalance Need¶
In order to view a report containing device(s) that might need to be rebalanced use the --devs-needing-rebalance
parameter as shown in the example below.
Example
# sherlock --devs-needing-rebalance
SHOWING APPLICATIONS THAT NEED ATTENTION:
All apps are healthy
SHOWING PODS THAT NEED ATTENTION:
All pods are healthy
SHOWING UNHEALTHY VOLUMES THAT NEED ATTENTION
All volumes are healthy
SHOWING UNHEALTHY NODES THAT NEED ATTENTION:
All nodes are healthy
SHOWING UNHEALTHY DEVICES THAT NEED ATTENTION:
All devices are healthy
SHOWING UNAVAILABLE FILE COLLECTIONS THAT NEED ATTENTION:
All file collection are available
SHOWING UNAVAILABLE BUNDLES THAT NEED ATTENTION:
All bundles are available
Moving 4 vols, 20 slices and 256 segments:
eqx04-flash05 /dev/sdb 894.3 GB/894.3 GB (free=100.0 %) slices= 0/57194 segs= 0/57194 vols= 0/100 [ 1.26 ]
eqx04-flash05 /dev/sdc 894.3 GB/894.3 GB (free=100.0 %) slices= 0/57194 segs= 0/57194 vols= 0/100 [ 1.26 ]
eqx04-flash05 /dev/sda 894.3 GB/894.3 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
eqx04-flash05 /dev/sdd 894.3 GB/894.3 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
eqx04-flash05 /dev/sde 59.6 GB/59.6 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
eqx04-flash05 /dev/dm-1 17.4 GB/17.4 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
eqx04-flash05 /dev/dm-2 35.7 GB/35.7 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
eqx04-flash05 /dev/dm-0 6.0 GB/6.0 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
------------------------------------------------------------------------------------------------------------------------
eqx01-flash16 /dev/sdb 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash16 /dev/sdg 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash16 /dev/sdd 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash16 /dev/sdi 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash16 /dev/sde 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash16 /dev/sdf 1.8 TB/1.8 TB (free=100.0 %) slices= 5/119194 segs= 11/119194 vols= 1/100 [ 1.25 ]
eqx01-flash16 /dev/sdh 1.8 TB/1.8 TB (free=100.0 %) slices= 11/119194 segs= 18/119194 vols= 1/100 [ 1.25 ]
eqx01-flash16 /dev/sdc 14.9 GB/14.9 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
eqx01-flash16 /dev/sda 1.8 TB/1.8 TB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
------------------------------------------------------------------------------------------------------------------------
eqx01-flash15 /dev/sde 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash15 /dev/sdf 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash15 /dev/sdi 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash15 /dev/sdg 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash15 /dev/sdh 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash15 /dev/sdd 1.8 TB/1.8 TB (free=100.0 %) slices= 0/119194 segs= 0/119194 vols= 0/100 [ 1.26 ]
eqx01-flash15 /dev/sdb 893.8 GB/894.3 GB (free=100.0 %) slices= 20/57194 segs= 14/57194 vols= 1/100 [ 1.24 ]
eqx01-flash15 /dev/sda 894.3 GB/894.3 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
eqx01-flash15 /dev/sdc 14.9 GB/14.9 GB (free=100.0 %) slices= 0/0 segs= 0/0 vols= 0/100 [ -1.00 ]
Only unhealthy objects are shown. To see everything re-run with -H|--healthy option
To see more details rerun with -V|--verbose option
sherlock produced results in 131 milliseconds (Sun Sep 19 05:19:44 PM 2021).
|-- 3 nodes, 26 disks, 3 vols, 3 snapshots, 0 apps, 0 pods, 1 file-collections,
1 bundles, 2 users and 2 tenants were analyzed