****************************** Troubleshooting Tools ****************************** Robin Platform provides a number of native tools and commands for an administrator to utilize in order troubleshoot their Robin cluster and/or report issues. These tools vary in their use case but provide enough information to provide insight as to why as the cluster is not functioning as intended or the reason for unexpected failures. As a result, they should be the go-to utilities when debugging potential issues and their outputs should be sent alongside any bug reports filed to Robin. Each tool has been described in their respective sections below. Alongside the aforementioned tools for administrators, Robin Platform also provides more granular commands, detailed in the sections below, for individual users to track the progress of their executed operations and determine reasons for their failure. These operations are referred to as ``jobs`` and are identified by a unique ID. Each ``job`` has a set of attributes such as the aforementioned job ID, job type, description, and so on. Robin stores a record of each ``job``, including the metadata, within the database alongside respective job logs on the relevant nodes. An adminstrator can view the job logs and troubleshoot issues within the cluster with them. It is reccomended that the complete job logs are provided when reporting issues to Robin for debugging purposes. The Robin job logs are stored in the following directoroies within the Robin container: * Server side job logs are stored within ``/var/log/robin/server``. Note this directory only present on the Robin master nodes. * Worker/agent side job logs are stored within ``/var/log/robin/agent``. This directory is present on all Robin nodes. In order to access the job logs on the host instead of within the container, the ``/home/robinds/var/log/robin/server`` and ``/home/robinds/var/log/robin/agent`` can be used respectively. Listing all jobs ================= .. tabs:: .. tab:: CLI Robin stores all jobs that have occurred during a cluster's lifespan. To view these jobs alongside details such as their start time, state etc. issue the following command: .. code-block:: text # robin job list --verbose --ignoredeps --noarchived --nopurged --states --failed --nocolor --page_size --page_num --total --all --app --k8sapp --vnode --node --disk ========================== ======================================================================================================================================== ``--verbose`` Show complete job information instead of truncating it for display purposes. ``--ignoredeps`` Do not show child jobs ``--noarchived`` Do not show archived jobs ``--nopurged`` Do not show purged jobs ``--states `` Filter jobs based on states. Choose one or more from: active, failed, succeeded, archived, purged ``--failed`` Show only jobs which have failed ``--nocolor`` Show uncolored output ``--page_size `` Number of jobs that should be displayed for each page ``--page_num `` Page number to start displaying jobs from (starting index 1) ``--total`` Return the total number of qualified root jobs ``--all`` Display all jobs associated with a specific application. Note this option must be used in conjunction with the ``--app`` option ``--app `` Filter jobs based on specified application ``--k8sapp `` Filter jobs based on specified K8s/Helm registered application name ``--vnode `` Filter jobs based on specified Vnode name ``--node `` Filter jobs based on specified physical node name ``--disk `` Filter jobs based on specified disk WWN ========================== ======================================================================================================================================== **Example:** .. raw:: html
Output .. code-block:: text # robin job list ID | Type | Description | State | Start | End | User | Message --------------+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------+---------+--------+------------------------------------------ 1013 | ApplicationStart | Starting application 'wp-10' | COMPLETED | 13 Aug 23:28:29 | 0:00:54 | system | |->1015 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 13 Aug 23:28:30 | 0:00:38 | system | | |->1017 | VnodeDeploy | Deploying vnode 'wp-10.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:28:30 | 0:00:38 | system | | | |->1018 | VnodeStop | Stopping vnode wp-10.mysql.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:28:30 | 0:00:15 | system | |->1016 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system | | |->1024 | VnodeDeploy | Deploying vnode 'wp-10.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system | | | |->1025 | VnodeStop | Stopping vnode wp-10.wordpress.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:29:08 | 0:00:07 | system | 1014 | ApplicationStart | ApplicationStart | COMPLETED|FAILED | 13 Aug 23:28:29 | 0:00:00 | system | Another job is running on application 'w 1019 | ApplicationStart | Starting application 'wp-20' | COMPLETED | 13 Aug 23:28:31 | 0:00:51 | system | |->1020 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 13 Aug 23:28:32 | 0:00:36 | system | | |->1022 | VnodeDeploy | Deploying vnode 'wp-20.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:28:32 | 0:00:36 | system | | | |->1023 | VnodeStop | Stopping vnode wp-20.mysql.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:28:32 | 0:00:13 | system | |->1021 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system | | |->1026 | VnodeDeploy | Deploying vnode 'wp-20.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 13 Aug 23:29:08 | 0:00:14 | system | | | |->1027 | VnodeStop | Stopping vnode wp-20.wordpress.01 on cscale-82-140.robinsystems.com | COMPLETED | 13 Aug 23:29:08 | 0:00:05 | system | 1028 | JobArchive | Archiving job/s on all hosts | COMPLETED | 14 Aug 00:00:00 | 0:00:02 | system | |->1029 | AgentJobArchive | Archiving job/s on host cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 00:00:01 | 0:00:00 | system | 1030 | HostProbe | Probed cscale-82-140.robinsystems.com from ONLINE/Ready ==> ONLINE/Ready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 07:54:37 | 0:00:01 | system | 1031 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> UNREACHABLE/Notready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 07:54:37 | 0:00:51 | system | 1032 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> UNREACHABLE/Notready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 08:11:11 | 0:00:50 | system | 1033 | HostProbe | Probed cscale-82-140.robinsystems.com from ONLINE/Ready ==> ONLINE/Ready. Origin: StartingHostWatch. | COMPLETED | 14 Aug 08:11:11 | 0:00:01 | system | 1034 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeUp. | COMPLETED | 14 Aug 09:24:17 | 0:00:50 | system | 1035 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED|FAILED | 14 Aug 09:25:07 | 0:01:40 | system | Pods do not need to be failed over as Ku 1036 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> ONLINE/Ready. Origin: StateChange. | COMPLETED | 14 Aug 09:25:17 | 0:00:01 | system | 1037 | ApplicationDelete | Deleting application 'wp-10' | COMPLETED | 14 Aug 09:41:10 | 0:00:12 | robin | |->1038 | VnodeDelete | Deleting vnode 'wp-10.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:10 | 0:00:06 | robin | |->1039 | VnodeDelete | Deleting vnode 'wp-10.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:10 | 0:00:08 | robin | 1040 | ApplicationDelete | Deleting application 'wp-20' | COMPLETED | 14 Aug 09:41:16 | 0:00:13 | robin | |->1041 | VnodeDelete | Deleting vnode 'wp-20.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:16 | 0:00:10 | robin | |->1042 | VnodeDelete | Deleting vnode 'wp-20.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:16 | 0:00:09 | robin | 1043 | ApplicationDelete | Deleting application 'wp-30' | COMPLETED | 14 Aug 09:41:20 | 0:00:19 | robin | |->1044 | VnodeDelete | Deleting vnode 'wp-30.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:20 | 0:00:06 | robin | |->1045 | VnodeDelete | Deleting vnode 'wp-30.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:41:20 | 0:00:15 | robin | 1046 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 09:42:58 | 0:00:58 | robin | |->1047 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:43:00 | 0:00:42 | robin | | |->1049 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:00 | 0:00:42 | robin | |->1048 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:43:42 | 0:00:14 | robin | | |->1053 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:42 | 0:00:14 | robin | 1050 | ApplicationCreate | Adding application 'wp-2' | COMPLETED | 14 Aug 09:43:39 | 0:00:46 | robin | |->1051 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:43:42 | 0:00:34 | robin | | |->1054 | VnodeAdd | Adding vnode 'wp-2.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:43:42 | 0:00:34 | robin | |->1052 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:44:16 | 0:00:09 | robin | | |->1055 | VnodeAdd | Adding vnode 'wp-2.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:16 | 0:00:09 | robin | 1056 | ApplicationCreate | Adding application 'wp-3' | COMPLETED | 14 Aug 09:44:18 | 0:00:57 | robin | |->1057 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:44:20 | 0:00:41 | robin | | |->1059 | VnodeAdd | Adding vnode 'wp-3.mysql.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:20 | 0:00:41 | robin | |->1058 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:45:01 | 0:00:13 | robin | | |->1067 | VnodeAdd | Adding vnode 'wp-3.wordpress.01' on cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:02 | 0:00:12 | robin | 1060 | ApplicationDelete | Deleting application 'wp-1' | COMPLETED | 14 Aug 09:44:53 | 0:00:17 | robin | |->1061 | VnodeDelete | Deleting vnode 'wp-1.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:53 | 0:00:05 | robin | |->1062 | VnodeDelete | Deleting vnode 'wp-1.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:53 | 0:00:13 | robin | 1063 | ApplicationDelete | Deleting application 'wp-2' | COMPLETED | 14 Aug 09:44:57 | 0:00:21 | robin | |->1064 | VnodeDelete | Deleting vnode 'wp-2.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:57 | 0:00:09 | robin | |->1065 | VnodeDelete | Deleting vnode 'wp-2.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:44:57 | 0:00:18 | robin | 1066 | ApplicationDelete | ApplicationDelete | COMPLETED|FAILED | 14 Aug 09:45:01 | 0:00:00 | robin | Another job is running on application 'w 1068 | ApplicationProbe | Probing application 'wp-3' | COMPLETED | 14 Aug 09:45:12 | 0:00:00 | robin | 1069 | ApplicationDelete | Deleting application 'wp-3' | COMPLETED | 14 Aug 09:45:16 | 0:00:12 | robin | |->1070 | VnodeDelete | Deleting vnode 'wp-3.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:16 | 0:00:05 | robin | |->1071 | VnodeDelete | Deleting vnode 'wp-3.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 09:45:16 | 0:00:09 | robin | 1072 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 09:47:03 | 0:00:45 | robin | |->1074 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:47:39 | 0:00:08 | robin | | |->1076 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:39 | 0:00:08 | robin | |->1073 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:47:05 | 0:00:34 | robin | | |->1075 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:05 | 0:00:34 | robin | 1077 | ApplicationCreate | Adding application 'wp-2' | COMPLETED | 14 Aug 09:47:43 | 0:00:44 | robin | |->1079 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:48:18 | 0:00:09 | robin | | |->1081 | VnodeAdd | Adding vnode 'wp-2.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:48:18 | 0:00:09 | robin | |->1078 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:47:45 | 0:00:33 | robin | | |->1080 | VnodeAdd | Adding vnode 'wp-2.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:47:45 | 0:00:33 | robin | 1082 | ApplicationCreate | Adding application 'wp-3' | COMPLETED | 14 Aug 09:49:14 | 0:03:12 | robin | |->1083 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 09:49:16 | 0:02:49 | robin | | |->1085 | VnodeAdd | Adding vnode 'wp-3.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:49:16 | 0:02:49 | robin | |->1084 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 09:52:05 | 0:00:20 | robin | | |->1086 | VnodeAdd | Adding vnode 'wp-3.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:52:05 | 0:00:20 | robin | 1087 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeDown. | COMPLETED | 14 Aug 09:53:43 | 0:00:52 | system | 1088 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 09:54:35 | 0:00:01 | system | 1089 | ApplicationStart | Starting application 'wp-3' | COMPLETED | 14 Aug 09:54:38 | 0:03:41 | system | |->1092 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:38 | 0:01:53 | system | | |->1094 | VnodeDeploy | Deploying vnode 'wp-3.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:38 | 0:01:53 | system | |->1093 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:31 | 0:01:48 | system | | |->1102 | VnodeDeploy | Deploying vnode 'wp-3.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:31 | 0:01:48 | system | 1090 | ApplicationStart | Starting application 'wp-1' | COMPLETED | 14 Aug 09:54:38 | 0:03:44 | system | |->1098 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:39 | 0:01:51 | system | | |->1100 | VnodeDeploy | Deploying vnode 'wp-1.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:39 | 0:01:51 | system | |->1099 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:30 | 0:01:52 | system | | |->1101 | VnodeDeploy | Deploying vnode 'wp-1.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:30 | 0:01:52 | system | 1091 | ApplicationStart | Starting application 'wp-2' | COMPLETED | 14 Aug 09:54:38 | 0:03:44 | system | |->1095 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 09:54:39 | 0:01:52 | system | | |->1097 | VnodeDeploy | Deploying vnode 'wp-2.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:54:39 | 0:01:52 | system | |->1096 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 09:56:31 | 0:01:51 | system | | |->1103 | VnodeDeploy | Deploying vnode 'wp-2.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 09:56:32 | 0:01:50 | system | 1104 | ApplicationDelete | Deleting application 'wp-1' | COMPLETED | 14 Aug 10:18:34 | 0:00:15 | robin | |->1105 | VnodeDelete | Deleting vnode 'wp-1.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:34 | 0:00:06 | robin | |->1106 | VnodeDelete | Deleting vnode 'wp-1.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:34 | 0:00:11 | robin | 1107 | ApplicationDelete | Deleting application 'wp-2' | COMPLETED | 14 Aug 10:18:38 | 0:00:14 | robin | |->1108 | VnodeDelete | Deleting vnode 'wp-2.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:38 | 0:00:06 | robin | |->1109 | VnodeDelete | Deleting vnode 'wp-2.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:38 | 0:00:08 | robin | 1110 | ApplicationDelete | Deleting application 'wp-3' | COMPLETED | 14 Aug 10:18:43 | 0:00:15 | robin | |->1111 | VnodeDelete | Deleting vnode 'wp-3.wordpress.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:43 | 0:00:12 | robin | |->1112 | VnodeDelete | Deleting vnode 'wp-3.mysql.01' from cscale-82-140.robinsystems.com | COMPLETED | 14 Aug 10:18:43 | 0:00:13 | robin | 1113 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeUp. | COMPLETED | 14 Aug 10:20:02 | 0:00:50 | system | 1114 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED|FAILED | 14 Aug 10:20:52 | 0:01:40 | system | Pods do not need to be failed over as Ku 1115 | HostProbe | Probed cscale-82-139.robinsystems.com from UNREACHABLE/Notready ==> ONLINE/Notready. Origin: StateChange.. Services Down: {'iomgr-server'} | COMPLETED | 14 Aug 10:22:17 | 0:00:00 | system | 1116 | HostProbe | Probed cscale-82-139.robinsystems.com from ONLINE/Notready ==> ONLINE/Notready. Origin: StateChange.. Services Down: {'iomgr-server'} | COMPLETED | 14 Aug 10:22:47 | 0:00:00 | system | 1117 | HostProbe | Probed cscale-82-139.robinsystems.com from ONLINE/Notready ==> ONLINE/Ready. Origin: StateChange. | COMPLETED | 14 Aug 10:22:59 | 0:00:00 | system | 1118 | ApplicationCreate | Adding application 'wp-1' | COMPLETED | 14 Aug 10:40:21 | 0:01:05 | robin | |->1119 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:40:24 | 0:00:41 | robin | | |->1121 | VnodeAdd | Adding vnode 'wp-1.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:40:24 | 0:00:41 | robin | |->1120 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:41:05 | 0:00:21 | robin | | |->1122 | VnodeAdd | Adding vnode 'wp-1.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:41:05 | 0:00:21 | robin | 1123 | ApplicationCreate | Adding application 'wp-2-no-aff' | COMPLETED | 14 Aug 10:45:45 | 0:00:57 | robin | |->1124 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:45:48 | 0:00:41 | robin | | |->1126 | VnodeAdd | Adding vnode 'wp-2-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:45:48 | 0:00:41 | robin | |->1125 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:46:29 | 0:00:13 | robin | | |->1127 | VnodeAdd | Adding vnode 'wp-2-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:29 | 0:00:13 | robin | 1128 | ApplicationCreate | Adding application 'wp-3-no-aff' | COMPLETED | 14 Aug 10:46:33 | 0:00:39 | robin | |->1129 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:46:35 | 0:00:28 | robin | | |->1131 | VnodeAdd | Adding vnode 'wp-3-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:35 | 0:00:28 | robin | |->1130 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:47:03 | 0:00:09 | robin | | |->1132 | VnodeAdd | Adding vnode 'wp-3-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:47:03 | 0:00:09 | robin | 1133 | HostProbe | Probed cscale-82-139.robinsystems.com from PROBE_PENDING/Notready ==> UNREACHABLE/Notready. Origin: NodeDown. | COMPLETED | 14 Aug 10:49:36 | 0:00:52 | system | 1134 | HostFailoverPods | Failing over pods on host cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:50:28 | 0:00:01 | system | 1135 | ApplicationStart | Starting application 'wp-1' | COMPLETED | 14 Aug 10:50:29 | 0:03:22 | system | |->1141 | RoleStart | Starting instances for role 'wordpress' | COMPLETED | 14 Aug 10:52:16 | 0:01:35 | system | | |->1143 | VnodeDeploy | Deploying vnode 'wp-1.wordpress.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:52:16 | 0:01:35 | system | |->1140 | RoleStart | Starting instances for role 'mysql' | COMPLETED | 14 Aug 10:50:30 | 0:01:46 | system | | |->1142 | VnodeDeploy | Deploying vnode 'wp-1.mysql.01'. Origin: REST (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:30 | 0:01:46 | system | 1136 | VnodeDeploy | Deploying vnode 'wp-3-no-aff.wordpress.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:01:48 | robin | 1137 | VnodeDeploy | Deploying vnode 'wp-3-no-aff.mysql.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:02:04 | robin | 1138 | VnodeDeploy | Deploying vnode 'wp-2-no-aff.mysql.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:02:07 | robin | 1139 | VnodeDeploy | Deploying vnode 'wp-2-no-aff.wordpress.01'. Origin: Event (cscale-82-140.robinsystems.com) | COMPLETED | 14 Aug 10:50:29 | 0:01:44 | robin | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- .. raw:: html
.. tab:: API Returns all jobs that have occurred during a cluster's lifespan. **End Point:** /api/v5/robin_server/jobs **Method:** GET **URL Parameters:** - ``sort=[id|-id]`` : Utilizing this parameter results in the list of jobs returned being sorted by their id. - ``noarchived=true`` : Utilizing this parameter results in archived jobs not being returned. - ``nopurged=true`` : Utilizing this parameter results in purged jobs not being returned. - ``failed=true`` : Utilizing this parameter results in only failed jobs being returned. - ``parent=true`` : Utilizing this parameter results in only parent jobs being returned. - ``page_size=`` : Utilizing this parameter results in number of jobs being returned. - ``page_num=`` : Utilizing this parameter results in jobs starting from being returned. - ``objtype=[APPLICATION|K8S_APPLICATION|INSTANCE|DISK|NODE]`` : Utilizing this parameter results in only jobs for the specified object type being returned. - ``objname=`` : Utilizing this parameter results in only jobs for objects with the specified name being returned. - ``all=true`` : Utilizing this parameter results in all jobs being returned. Note this option is only valid when an application name is specified. **Data Parameters:** None **Port:** RCM Port (default value is 29442) **Headers:** - ``Authorization: `` : Authorization token to identify which user is sending the request. The token can be acquired from the login API. **Success Response Code:** 200 **Error Response Code:** 500 (Internal Server Error) **Example Response:** .. raw:: html
Output .. code-block:: text { "page_size":10, "items":{ "users":[ { "email":null, "tenantid":1, "firstname":"Robin", "username":"robin", "id":3, "lastname":"Systems" } ], "jobs":[ { "jobid":1888, "tenant_id":1, "enabled":true, "child_job_ids":"[1889]", "endtime":1597456503, "children":[ { "jobid":1889, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456498, "parent_jobid":1888, "error":0, "message":"", "taskrunner":1, "starttime":1597456497, "dependson_job_ids":"[]", "level":"child", "user_id":1, "jtype":"CollectionOffline", "timeout":86400, "state":10, "desc":"Taking collection 'file-collection-1597122699552' offline (Force False)" } ], "parent_jobid":0, "error":0, "message":"", "taskrunner":1, "starttime":1597456496, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"CollectionOnline", "timeout":86400, "state":10, "desc":"Bringing collection 'file-collection-1597122699552' online" }, { "jobid":1887, "tenant_id":1, "enabled":true, "child_job_ids":"[1890]", "endtime":1597456504, "children":[ { "jobid":1890, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456499, "parent_jobid":1887, "error":0, "message":"", "taskrunner":1, "starttime":1597456497, "dependson_job_ids":"[]", "level":"child", "user_id":3, "jtype":"VnodeStop", "timeout":86400, "state":10, "desc":"Stopping vnode test-ds-1.server.01 on cscale-82-140.robinsystems.com" } ], "parent_jobid":0, "error":0, "message":"", "taskrunner":1, "starttime":1597456496, "dependson_job_ids":"[]", "level":"parent", "user_id":3, "jtype":"VnodeDeploy", "timeout":86400, "state":10, "desc":"Deploying vnode 'test-ds-1.server.01'. Origin: Event (cscale-82-140.robinsystems.com)" }, { "jobid":1886, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456488, "parent_jobid":0, "error":0, "message":"", "taskrunner":0, "starttime":1597456487, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"HostProbe", "timeout":86400, "state":10, "desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/Notready ==> ONLINE\/Ready. Origin: StateChange." }, { "jobid":1885, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456476, "parent_jobid":0, "error":0, "message":"", "taskrunner":0, "starttime":1597456475, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"HostProbe", "timeout":86400, "state":10, "desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/Notready ==> ONLINE\/Notready. Origin: StateChange.. Services Down: {'iomgr-server'}" }, { "jobid":1884, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456470, "parent_jobid":0, "error":0, "message":"", "taskrunner":0, "starttime":1597456470, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"HostProbe", "timeout":86400, "state":10, "desc":"Probed cscale-82-140.robinsystems.com from ONLINE\/WaitingForMonitor ==> ONLINE\/Notready. Origin: StartingHostWatch.. Services Down: {'iomgr-server'}" }, { "jobid":1883, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456520, "parent_jobid":0, "error":0, "message":"", "taskrunner":0, "starttime":1597456469, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"HostProbe", "timeout":86400, "state":10, "desc":"Probed cscale-82-139.robinsystems.com from UNREACHABLE\/Notready ==> UNREACHABLE\/Notready. Origin: StartingHostWatch." }, { "jobid":1882, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456467, "parent_jobid":0, "error":0, "message":"", "taskrunner":0, "starttime":1597456467, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"DiskNotify", "timeout":86400, "state":10, "desc":"Event on disk '0x60022480940ed076551cfaf75612e24e'" }, { "jobid":1881, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456467, "parent_jobid":0, "error":0, "message":"", "taskrunner":0, "starttime":1597456467, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"DiskNotify", "timeout":86400, "state":10, "desc":"Event on disk '0x60022480ffcf3deb224fb37d78fe7767'" }, { "jobid":1880, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456467, "parent_jobid":0, "error":0, "message":"", "taskrunner":0, "starttime":1597456467, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"DiskNotify", "timeout":86400, "state":10, "desc":"Event on disk '0x600224804c48fd7e16c608dea0919064'" }, { "jobid":1879, "tenant_id":1, "enabled":true, "child_job_ids":"[]", "endtime":1597456467, "parent_jobid":0, "error":0, "message":"", "taskrunner":0, "starttime":1597456467, "dependson_job_ids":"[]", "level":"parent", "user_id":1, "jtype":"DiskNotify", "timeout":86400, "state":10, "desc":"Event on disk '0x600224803bcdafde95b1f5cd27ceb5fb'" } ] }, "total":1542, "num_items":10, "page_num":1 } .. raw:: html
Show information about a specific job ====================================== .. tabs:: .. tab:: CLI In order to get more detailed information about a specific job including the state, duration and any errors related to it and any respective child jobs, issue the following command: .. code-block:: text # robin job info ====================== =========================================================================================== ``id`` Job ID ====================== =========================================================================================== **Example:** .. code-block:: text # robin job info 1123 ID | Type | Desc | State | Start | End | Duration | Dependson | Error | Message -----------+-------------------+---------------------------------------------------------------------------+-----------+-----------------+----------+----------+-----------+-------+--------- 1123 | ApplicationCreate | Adding application 'wp-2-no-aff' | COMPLETED | 14 Aug 10:45:45 | 10:46:42 | 0:00:57 | [] | 0 | |->1124 | RoleCreate | Provisioning containers for role 'mysql' | COMPLETED | 14 Aug 10:45:48 | 10:46:29 | 0:00:41 | [] | 0 | | |->1126 | VnodeAdd | Adding vnode 'wp-2-no-aff.mysql.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:45:48 | 10:46:29 | 0:00:41 | [] | 0 | |->1125 | RoleCreate | Provisioning containers for role 'wordpress' | COMPLETED | 14 Aug 10:46:29 | 10:46:42 | 0:00:13 | [1124] | 0 | | |->1127 | VnodeAdd | Adding vnode 'wp-2-no-aff.wordpress.01' on cscale-82-139.robinsystems.com | COMPLETED | 14 Aug 10:46:29 | 10:46:42 | 0:00:13 | [] | 0 | .. tab:: API Returns details about a specific job and any of its respective child jobs. **End Point:** /api/v3/robin_server/jobs/ **Method:** GET **URL Parameters:** None **Data Parameters:** None **Port:** RCM Port (default value is 29442) **Headers:** - ``Authorization: `` : Authorization token to identify which user is sending the request. The token can be acquired from the login API. **Success Response Code:** 200 **Error Response Code:** 500 (Internal Server Error), 404 (Not Found Error), 401 (Authorization Error) **Example Response:** .. raw:: html
Output .. code-block:: text { "tenant_name":"Administrators", "jobid":1888, "tenant_id":1, "enabled":true, "json":{ "collection_id":1597122699552, "state":"SuspectedOffline", "set_failed":true, "origin":2, "hostname":"cscale-82-140.robinsystems.com" }, "user_name":"system", "endtime":1597456503, "parent_jobid":0, "error":0, "message":"", "taskrunner":1, "starttime":1597456496, "child_job_ids":"[1889]", "cjobs":[ { "tenant_name":"Administrators", "jobid":1889, "tenant_id":1, "enabled":true, "json":{ "collection_id":1597122699552 }, "user_name":"system", "endtime":1597456498, "parent_jobid":1888, "error":0, "message":"", "taskrunner":1, "starttime":1597456497, "child_job_ids":"[]", "cjobs":[ ], "dependson_job_ids":"[]", "user_id":1, "jtype":"CollectionOffline", "timeout":86400, "state":10, "desc":"Taking collection 'file-collection-1597122699552' offline (Force False)", "priority":300 } ], "dependson_job_ids":"[]", "user_id":1, "jtype":"CollectionOnline", "timeout":86400, "state":10, "desc":"Bringing collection 'file-collection-1597122699552' online", "priority":300 } .. raw:: html
Retrieving Job Logs ==================== .. tabs:: .. tab:: CLI Robin provides a utility which collects all the appropriate logs from the necessary nodes for a particular job and its consequent hierarchy. It stores these logs within a single tarball that can be provided to Robin alongside a bug report. In addition this useful for an Administrator to debug as to why a job failed unexpectedly. This functionality is extremely convienent as it automates the process of the user logging into every affected node and collecting/inspecting the relevant log files. Issue the following command to retrieve logs for a specific job: .. code-block:: text # robin job get ====================== =========================================================================================== ``id`` ID of job to collect the logs for ====================== =========================================================================================== **Example:** .. code-block:: text # robin job get 1 Retrieving log files... Log files for Job ids: [1] are retrieved successfully at 1582189081.tar.gz Archiving Job Logs ================== Robin Job logs can be archived in order to prevent the data loss, to improve the security, and to increase free space in the parent directory. The archival process involves moving all completed job logs to the archived sub-directory of the parent directory. The archived sub-directories are contained within the ``/var/log/robin/server`` and ``var/log/robin/agent`` directories. There are two methods via which this can be achieved. The first is via an automatic schedule, whose parameters can be configured, and the second is the ``robin job archive`` command detailed below. By default, the Robin job archive schedule automatically archives the logs for jobs that were completed successfully and are older than 24 hours. .. Note:: The logs for failed jobs remain in the parent directories for analysis purposes. Archive a Job on demand ^^^^^^^^^^^^^^^^^^^^^^^ .. tabs:: .. tab:: CLI In order to archive Robin jobs and their respective logs on demand, run the following command: .. code-block:: text # robin job archive --age --include-failed ========================== ===================================================================== ``--age `` Minimum age (in minutes) of the job(s) whose logs should be archived ``--include-failed`` Archive the logs for failed jobs as well ========================== ===================================================================== **Example:** .. code-block:: text # robin job archive --age 600 --wait Job: 255170 Name: JobArchive State: PROCESSED Error: 0 Job: 255170 Name: JobArchive State: PREPARED Error: 0 Job: 255170 Name: JobArchive State: WAITING Error: 0 Job: 255170 Name: JobArchive State: COMPLETED Error: 0 Configure Job archive schedule attributes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Listed below are all the attributes a user can configure with regards to the scheduled job archival task. ===================== ============== ============================================================================================================================================================================================================================================ Attribute Default value Description ===================== ============== ============================================================================================================================================================================================================================================ ``job_archive_age`` 86400 The age (in seconds) of the completed job(s) whose logs should be automatically archived. ``job_archive_cron`` 0 0 * * * The time at which the job archival schedule is run. The value for this attribute must be a valid CRON format, details of which can be found `here `_. ===================== ============== ============================================================================================================================================================================================================================================ In order to update any of the aforementioned configurable attributes, run the following command: .. code-block:: text # robin config update server **Example:** .. code-block:: text # robin config update server job_archive_age 81000 The 'server' attribute 'job_archive_age' has been updated Purging Job logs ==================== Robin enables users to purge logs for jobs in the case they are no longer needed and space needs to be freed up. There are two methods via which this can be achieved. The first is via an automatic schedule, whose parameters can be configured, and the second is the ``robin job purge`` command detailed in the section below. By default, the Robin job purge schedule removes jobs (and their respective logs) that fall into any of the following categories: * Successful jobs older than two weeks. * Failed jobs older than four weeks. * Robin maintenance jobs older than one week. Both methods for purging a job, remove the record(s) of a job and its respective child jobs from the Robin database as well delete the following associated log files if present: - The server job log directory at ``/var/log/robin/server/`` on the Robin master node. - The archived server job log directory at ``/var/log/robin/server/archived/.tar.gz`` on the Robin master node. - The agent job log directory at ``/var/log/robin/agent/`` on all nodes. - The archived agent job log directory at ``/var/log/robin/agent/archived/.tar.gz`` on all nodes. Purge a Job on demand ^^^^^^^^^^^^^^^^^^^^^^^ .. tabs:: .. tab:: CLI In order to purge Robin jobs and their respective logs on demand, run the following command: .. code-block:: text # robin job purge --age --failed-job-age --maintenance-job-age --maintenance-job-types --before-id ================================================ ================================================================================================================================================================= ``--age `` Purge successful jobs that have completed before specified date and time in '%Y-%m-%dT%H:%M:%S' format. The default is two weeks earlier than the current date. ``--failed-job-age `` Purge failed jobs that have completed before specified date and time in '%Y-%m-%dT%H:%M:%S' format. The default is four weeks earlier than the current date. ``--maintenance-job-age `` Purge maintenance jobs that have completed before specified date and time in '%Y-%m-%dT%H:%M:%S' format. The default is four weeks earlier than the current date. ``--maintenance-job-types `` Comma seperated list of job types to be considered maintenance jobs. The default types include: JobArchive and JobPurge. ``--before-id `` Jobs whose IDs are lower than the specified ID will be purged. Note if ``--age`` is specified, it will take precedence. ================================================ ================================================================================================================================================================= **Example:** .. code-block:: text # robin job purge --age 2021-04-06T18:14:00 --failed-job-age 2021-04-06T18:14:00 --maintenance-job-age 2021-04-06T18:14:00 --wait Job: 309 Name: JobPurge State: VALIDATED Error: 0 Job: 309 Name: JobPurge State: COMPLETED Error: 0 Configure Job purge schedule attributes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Listed below are all the attributes a user can configure with regards to the scheduled job purge task. ================================ =================== ======================================================================================================================================================================================================================================== Attribute Default value Description ================================ =================== ======================================================================================================================================================================================================================================== ``job_purge_age`` 1209600 The age (in seconds) of the completed job(s) which should be automatically purged. ``job_purge_cron`` 30 0 * * * The time at which the job archival schedule is run. The value for this attribute must be a valid CRON format, details of which can be found `here `_. Robin recommends that the schedule run daily ``job_purge_failed_age`` 2419200 The age (in seconds) of the failed job(s) which should be automatically purged. ``job_purge_maintenance_age`` 604800 The age (in seconds) of the the maintenance job(s) which should be automatically purged. ``job_purge_maintenance_jtypes`` JobArchive,JobPurge The types of maintenance jobs to be purged. ``job_purge_max_count`` 100000 The maximum number of jobs that can be purged at a time. ================================ =================== ======================================================================================================================================================================================================================================== In order to update any of the aforementioned configurable attributes, run the following command: .. code-block:: text # robin config update server **Example:** .. code-block:: text # robin config update server job_purge_age 13396198 The 'server' attribute 'job_purge_age' has been updated Cleaning up stale Job logs ========================== In certain cases logs for jobs can remain within their respective job directories or within the archived job log directory even though the record for the job has been deleted from the database. These job logs are deemed to be stale as the Robin database is considered to be the most reliable source of the jobs run on the cluster. Robin provides two methods by which these stale job logs can be removed. The first is via an automatic schedule, whose parameters can be configured, and the second is the ``robin job cleanup`` command detailed in the section below. By default, the Robin job cleanup schedule removes the logs for job whose records are no longer stored within the database on the first day of every month. .. Note:: It is recommended that the reconciliation between the job records stored and the logs present happen at least once a month to free up space and avoid retaining the logs for jobs which are no longer relevant. Cleanup stale Job logs on demand ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. tabs:: .. tab:: CLI In order to cleanup stale job logs present on the cluster, run the following command: .. code-block:: text # robin job cleanup **Example:** .. code-block:: text # robin job cleanup --wait Job: 358447 Name: JobCleanupStaleLogs State: WAITING Error: 0 Job: 358447 Name: JobCleanupStaleLogs State: COMPLETED Error: 0 Configure Job cleanup schedule attributes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Listed below are all the attributes a user can configure with regards to the scheduled job cleanup task. ===================== ============== ==================================================================================================================================================================================================================================================================================================================== Attribute Default value Description ===================== ============== ==================================================================================================================================================================================================================================================================================================================== ``job_cleanup_cron`` 0 1 1 * * The time at which the job cleanup schedule is run. The value for this attribute must be a valid CRON format, details of which can be found `here `_. Robin recommends that the schedule run monthly. ===================== ============== ==================================================================================================================================================================================================================================================================================================================== In order to update any of the aforementioned configurable attributes, run the following command: .. code-block:: text # robin config update server **Example:** .. code-block:: text # robin config update server job_cleanup_cron "0 1 2 * *" The 'server' attribute 'job_cleanup_cron' has been updated Log Collection ================ During any cluster wide failure or unexpected negative scenarios that affect multiple services, logs from all the system components will be needed by Robin in order to debug the issue properly. However sometimes given the scope of the issue, only a subsection of logs need to be collected. This granularity is available but it is highly recommended to always send the complete set of logs when filing a bug report with Robin. Available age-based filtering helps in reducing storage footprint. Robin supports uploading logs to the following destinations: ========================= ========================================================= ``robin-storage`` Used to store collected logs in Robin backed storage ``nfs`` Used to store collected logs in NFS. ``s3`` Used to store collected logs in Amazon S3 ``ssh`` Used to store collected logs in a given remote location ========================= ========================================================= Storing logs using Robin Storage ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. tabs:: .. tab:: CLI Logs collected by Robin can be stored on a volume created on the local cluster, with the following command: .. Note:: If you do not use the ``--age`` option, by default, Robin CNP collects the logs for the last 3 days.  .. code-block:: text # robin log collect robin-storage --nodes --dest-path --size --media --age ============================= ========================================================= ``rpool`` Name of the resource pool name to use. ``--nodes `` Comma separated list of nodes from which to collect. The default is to collect all ``--dest-path `` Destination path where log files will be copied ``--size `` Size of the storage volume for the log collect. The default is 250GB ``--media `` Specify which type of drives to allocate storage from. Choices include: 'HDD', 'SSD'. Default media type is 'HDD' ``--age `` Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days. ============================= ========================================================= **Example:** .. code-block:: text # robin log collect robin-storage default --wait Job: 123 Name: LogCollect State: PROCESSED Error: 0 Job: 123 Name: LogCollect State: WAITING Error: 0 Job: 123 Name: LogCollect State: COMPLETED Error: 0 Storing logs using NFS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. tabs:: .. tab:: CLI Logs collected by Robin can be stored on a NFS share, with the following command: .. Note:: If you do not use the ``--age`` option, by default, Robin CNP collects the logs for the last 3 days. .. code-block:: text # robin log collect nfs --nodes --age ============================= ========================================================= ``nfs_share`` The 'hostname' or 'IP', 'export_path' and 'dest_path' for an NFS share in the form of ::' ``--nodes `` Comma separated list of nodes from which to collect. The default is to collect all ``--age `` Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days. ============================= ========================================================= **Example:** .. code-block:: text # robin log collect nfs 10.9.82.162:/tmp:/demo_log_collect Job: 126 Name: LogCollect State: PROCESSED Error: 0 Job: 126 Name: LogCollect State: WAITING Error: 0 Job: 126 Name: LogCollect State: COMPLETED Error: 0 Storing logs using AWS S3 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. tabs:: .. tab:: CLI Logs collected by Robin can be stored on a AWS S3, with the following command: .. Note:: If you do not use the ``--age`` option, by default, Robin CNP collects the logs for the last 3 days. .. code-block:: text # robin log collect s3 --nodes --access_key --secret_key --age ================================= ========================================================= ``url`` S3 URL in the format https://s3-.amazonaws.com// ``aws_config`` JSON file containing Access key, Secret Key and Region. Example format {"aws_access_key_id": , "aws_secret_access_key": , "region": } ``--nodes `` Comma separated list of nodes from which to collect. The default is to collect all ``--access_key `` Access Key for the respective user with access to the specified S3 bucket. ``--secret_key `` Secret Key for the respective user with access to the specified S3 bucket. ``--age `` Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days. ================================= ========================================================= **Example:** .. code-block:: text # robin log collect s3 https://s3-us-west-2.amazonaws.com/log-collect/demo_log_collect /root/aws.json --wait Job: 132 Name: LogCollect State: PROCESSED Error: 0 Job: 132 Name: LogCollect State: WAITING Error: 0 Job: 132 Name: LogCollect State: COMPLETED Error: 0 Storing logs in a remote location ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. tabs:: .. tab:: CLI Logs collected by Robin can be stored in a remote location, with the following command: .. Note:: If you do not use the ``--age`` option, by default, Robin CNP collects the logs for the last 3 days. .. code-block:: text # robin log collect ssh --nodes --password --age ================================= ========================================================= ``dest`` Destination path where the log files will be copied to. The path should be in the form of '@:' ``--nodes `` Comma separated list of nodes from which to collect. The default is to collect all ``--password `` Provide a password on the command line instead of via a prompt ``--age `` Collects log based on age. Valid options are s(sec),m(min),h(hrs),d(days),Mo(month) y(years) and all. Example: Use 10m for 10 minutes. Default option is to collect logs of last 3 days. ================================= ========================================================= **Example:** .. code-block:: text # robin log collect ssh root@10.9.82.163:/demo_log_collect --password robin123 Job: 129 Name: LogCollect State: PROCESSED Error: 0 Job: 129 Name: LogCollect State: WAITING Error: 0 Job: 129 Name: LogCollect State: COMPLETED Error: 0 Cluster Auditing ================ Every operation that is performed by a user on an identifiable object within a Robin cluster is logged for auditing purposes. This allows admins to track the exact series of operations performed by a user as well to monitor the general activity on the concerned cluster. This not only enables more accurate backtracking for troubleshooting purposes as well improving the thoroughness of security audits. Detailed below are the methods by which a user can retrieve the audit log. Retrieving audit logs from the Robin Database ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. tabs:: .. tab:: CLI In order to access the audit log containing information such as which user executed an operation, the tenant and node from which they executed it from, the type of object and operation involved, and the result of the operation issue the following command: .. code-block:: text # robin user-audit list --exec-user --exec-tenant --owner-user --owner-tenant --id --object-type --page_size --page_num --operation --result --full ========================================= ===================================================================================================================================================== ``--exec-user `` Filter by username for the user who initiated the operation. Note this option cannot be used in conjunction with ``--owner-user`` parameter ``--exec-tenant `` Filter by tenant name for the user who initiated the operation. Note this option cannot be used in conjunction with ``--owner-tenant`` ``--owner-user `` Filter by username for the user who initiated the operation. Note this option cannot be used in conjunction with ``--exec-user`` ``--owner-tenant `` Filter by tenant name for the user who initiated the operation. Note this option cannot be used in conjunction with ``--exec-tenant`` ``--id `` Filter for a specific record Id ``--object-type `` Filter by object type ``--operation `` Filter by operation ``--page_size `` Number of audit records that should be displayed for each page ``--page_num `` Page number to start displaying audit records from (starting index 1) ``--result `` Filter by operation result ``--full`` Display additional information about the audit records ========================================= ===================================================================================================================================================== **Example 1 (List first page of audit records):** .. code-block:: text # robin user-audit list Id | Timestamp | IP Addr | Exec User | Exec Tenant | Owner User | Owner Tenant | Object Type | Operation | Result ----+--------------------------+-------------+-----------+----------------+------------+--------------+-----------------+-----------+--------- 643 | August 10, 2021 14:17:47 | 172.20.0.1 | robin | Administrators | | | USER | login | success 642 | July 13, 2021 11:24:13 | 10.9.121.40 | robin | Administrators | | | USER | login | success 641 | July 13, 2021 11:24:12 | 172.20.0.1 | robin | Administrators | | | METRICS | enable | success 640 | July 13, 2021 11:24:10 | 172.20.0.1 | robin | Administrators | | | CONFIG | update | success 639 | July 13, 2021 11:24:06 | 172.20.0.1 | robin | Administrators | | | NODE | probe | success 638 | July 13, 2021 11:24:04 | 172.20.0.1 | robin | Administrators | | | NODE | probe | success 637 | July 13, 2021 11:24:04 | 172.20.0.1 | robin | Administrators | | | NODE | probe | success 636 | July 13, 2021 11:23:58 | 172.20.0.1 | robin | Administrators | | | USER | login | success 635 | July 13, 2021 11:23:57 | 172.20.0.1 | robin | Administrators | | | USER | login | success 634 | July 13, 2021 11:23:49 | 172.20.0.1 | robin | Administrators | | | FILE_COLLECTION | online | success 633 | July 13, 2021 11:23:44 | 172.20.0.1 | robin | Administrators | | | USER | login | success 632 | July 13, 2021 11:20:07 | 172.20.0.1 | robin | Administrators | | | NODE | config | success 631 | July 13, 2021 11:20:07 | 172.20.0.1 | robin | Administrators | | | NODE | config | success 630 | July 13, 2021 11:20:07 | 172.20.0.1 | robin | Administrators | | | NODE | config | success 629 | July 13, 2021 11:20:01 | 172.20.0.1 | robin | Administrators | | | NODE | config | success 628 | July 13, 2021 11:20:01 | 172.20.0.1 | robin | Administrators | | | NODE | config | success 627 | July 13, 2021 11:20:01 | 172.20.0.1 | robin | Administrators | | | NODE | config | success 626 | July 13, 2021 11:19:59 | 172.20.0.1 | robin | Administrators | | | USER | login | success 625 | July 13, 2021 11:19:01 | 172.20.0.1 | robin | Administrators | | | USER | login | success 624 | July 13, 2021 11:18:57 | 172.20.0.1 | robin | Administrators | | | USER | login | success -------------------------------------------- 537 items, page 1 of 27. -------------------------------------------- **Example 2 (List audit records filtered by object type):** .. code-block:: text # robin user-audit list --object-type APPLICATION Id | Timestamp | IP Addr | Exec User | Exec Tenant | Owner User | Owner Tenant | Object Type | Operation | Result ---+---------------------------+------------+-----------+----------------+------------+----------------+-------------+-----------+--------- 46 | October 26, 2020 12:51:46 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 45 | October 26, 2020 12:51:25 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 44 | October 26, 2020 12:51:18 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 43 | October 26, 2020 12:51:06 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 42 | October 26, 2020 12:50:59 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 41 | October 26, 2020 12:49:44 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 40 | October 26, 2020 12:49:26 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 39 | October 26, 2020 12:49:17 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 38 | October 26, 2020 12:49:03 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 37 | October 26, 2020 12:46:17 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success 36 | October 26, 2020 12:45:35 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success -------------------------------------------- 11 items, page 1 of 1. -------------------------------------------- **Example 3 (Show details for a single audit record):** .. code-block:: text # robin user-audit list --id 46 --full Id | Timestamp | IP Addr | Exec User | Exec Tenant | Owner User | Owner Tenant | Object Type | Operation | Result ---+---------------------------+------------+-----------+----------------+------------+----------------+-------------+-----------+--------- 46 | October 26, 2020 12:51:46 | 172.17.0.1 | robin | Administrators | robin | Administrators | APPLICATION | create | success object_attributes: {'tenant_id': 1, 'object_id': 11, 'jobid': 74, 'object_name': 'app-11', 'user_id': 3} details: -------------------------------------------- 1 items, page 1 of 1. -------------------------------------------- .. tab:: API Returns audit records containing details such as the user who performed the action, the tenant and node it was performed from, details on the operation itself and the object it was performed on. **End Point:** /api/v3/robin_server/user-audit **Method:** GET **URL Parameters:** - ``exec_user=`` : Utilizing this parameter results in only audit records detailing operations initiated by the specified user being returned. Note this option cannot be used in conjunction with the ``owner_user`` parameter. - ``exec_tenant=`` : Utilizing this parameter results in only audit records detailing operations initiated in the specified tenant being returned. Note this option cannot be used in conjunction with the ``owner_tenant`` parameter. - ``owner_user=`` : Utilizing this parameter results in only audit records detailing operations initiated by the specified user being returned. Note this option cannot be used in conjunction with the ``exec_user`` parameter. - ``owner_tenant=`` : Utilizing this parameter results in only audit records detailing operations initiated in the specified tenant being returned. Note this option cannot be used in conjunction with the ``exec_tenant`` parameter. - ``id=`` : Utilizing this parameter results in only the audit record with the specified ID being returned. - ``object_type=`` : Utilizing this parameter results in only audit records associated with the specified object type being returned. - ``operation=`` : Utilizing this parameter results in only audit records associated with the specified operation being returned. - ``page_size=`` : Utilizing this parameter results in number of audit records being returned. - ``page_num=`` : Utilizing this parameter results in audit records starting from being returned. - ``result=`` : Utilizing this parameter results in only audit records matching the specified result being returned. **Data Parameters:** None **Port:** RCM Port (default value is 29442) **Headers:** - ``Authorization: `` : Authorization token to identify which user is sending the request. The token can be acquired from the login API. **Success Response Code:** 200 **Error Response Code:** 500 (Internal Server Error) **Example Response:** .. raw:: html
Output .. code-block:: text { "object_type":"UserAuditRecord", "start":1, "count":20, "total":538, "page_size":20, "page_num":1, "items":[ { "id":644, "timestamp":"August 11, 2021 03:12:55", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"USER", "operation":"login", "result":"success", "object_attributes":{ }, "details":{ } }, { "id":643, "timestamp":"August 10, 2021 14:17:47", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"USER", "operation":"login", "result":"success", "object_attributes":{ }, "details":{ } }, { "id":642, "timestamp":"July 13, 2021 11:24:13", "ip_addr":"10.9.121.40", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"USER", "operation":"login", "result":"success", "object_attributes":{ }, "details":{ } }, { "id":641, "timestamp":"July 13, 2021 11:24:12", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"METRICS", "operation":"enable", "result":"success", "object_attributes":{ }, "details":{ } }, { "id":640, "timestamp":"July 13, 2021 11:24:10", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"CONFIG", "operation":"update", "result":"success", "object_attributes":{ "section":"cluster", "attribute":"ignored_phases" }, "details":{ "msg":"The 'cluster' attribute 'ignored_phases' has been updated" } }, { "id":639, "timestamp":"July 13, 2021 11:24:06", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"probe", "result":"success", "object_attributes":{ "object_name":"systestvm-40.robinsystems.com", "object_id":1, "jobid":1539 }, "details":{ } }, { "id":638, "timestamp":"July 13, 2021 11:24:04", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"probe", "result":"success", "object_attributes":{ "object_name":"systestvm-39.robinsystems.com", "object_id":3, "jobid":1538 }, "details":{ } }, { "id":637, "timestamp":"July 13, 2021 11:24:04", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"probe", "result":"success", "object_attributes":{ "object_name":"systestvm-41.robinsystems.com", "object_id":2, "jobid":1537 }, "details":{ } }, { "id":636, "timestamp":"July 13, 2021 11:23:58", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"USER", "operation":"login", "result":"success", "object_attributes":{ }, "details":{ } }, { "id":635, "timestamp":"July 13, 2021 11:23:57", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"USER", "operation":"login", "result":"success", "object_attributes":{ }, "details":{ } }, { "id":634, "timestamp":"July 13, 2021 11:23:49", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"FILE_COLLECTION", "operation":"online", "result":"success", "object_attributes":{ "object_id":1603741429864, "object_name":"file-collection-1603741429864", "collection_pathname":"\/usr\/local\/robin\/collections\/file-collection-1603741429864", "hostname":"systestvm-40.robinsystems.com", "jobid":1533 }, "details":{ } }, { "id":633, "timestamp":"July 13, 2021 11:23:44", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"USER", "operation":"login", "result":"success", "object_attributes":{ }, "details":{ } }, { "id":632, "timestamp":"July 13, 2021 11:20:07", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"config", "result":"success", "object_attributes":{ "object_id":2, "object_name":"systestvm-41.robinsystems.com", "jobid":1478 }, "details":{ } }, { "id":631, "timestamp":"July 13, 2021 11:20:07", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"config", "result":"success", "object_attributes":{ "object_id":3, "object_name":"systestvm-39.robinsystems.com", "jobid":1479 }, "details":{ } }, { "id":630, "timestamp":"July 13, 2021 11:20:07", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"config", "result":"success", "object_attributes":{ "object_id":1, "object_name":"systestvm-40.robinsystems.com", "jobid":1480 }, "details":{ } }, { "id":629, "timestamp":"July 13, 2021 11:20:01", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"config", "result":"success", "object_attributes":{ "object_id":1, "object_name":"systestvm-40.robinsystems.com", "jobid":1468 }, "details":{ } }, { "id":628, "timestamp":"July 13, 2021 11:20:01", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"config", "result":"success", "object_attributes":{ "object_id":2, "object_name":"systestvm-41.robinsystems.com", "jobid":1466 }, "details":{ } }, { "id":627, "timestamp":"July 13, 2021 11:20:01", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"NODE", "operation":"config", "result":"success", "object_attributes":{ "object_id":3, "object_name":"systestvm-39.robinsystems.com", "jobid":1467 }, "details":{ } }, { "id":626, "timestamp":"July 13, 2021 11:19:59", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"USER", "operation":"login", "result":"success", "object_attributes":{ }, "details":{ } }, { "id":625, "timestamp":"July 13, 2021 11:19:01", "ip_addr":"172.20.0.1", "exec_user_id":3, "exec_username":"robin", "exec_tenant_id":1, "exec_tenant":"Administrators", "owner_user_id":null, "owner_username":null, "owner_tenant_id":null, "owner_tenant":null, "object_type":"USER", "operation":"login", "result":"success", "object_attributes":{ }, "details":{ } } ], "state":"Succeed", "message":"NA" } .. raw:: html
Retrieving audit logs from a file ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Robin supports recording all audit records within an audit log file. The information stored within the file is equivalent to that saved in the Robin database but provides what is essentially a hard copy of the audit trail. The audit log file is named ``robin-user-audit.log`` and is located in the ``/home/robinds/var/log/robin`` directory within the Robin deamonset container on the primary master node. By default, this feature is disabled. Some points to consider with regards to the file based logging feature: - The log file will only be generated on the active master. - The log file is automatically updated by the Robin control plane processes whenever an event occurs. - The logs are automatically rotated to ensure that these logs do not consume the whole log partition. Enable file based logging -------------------------- By default, Robin does not log audit records to a file. In order to enable this feature, perform the following steps: 1. Run the following command to indicate the feature should be enabled: .. code-block:: text # robin config update user_audit log_enable True 2. Run the following command to restart the ``robin-server`` service and thus allow the above changes to take effect: .. code-block:: text # service robin-server restart **Example** .. code-block:: text # robin config update user_audit log_enable True The 'user_audit' attribute 'log_enable' has been updated # service robin-server restart Redirecting to /bin/systemctl restart robin-server.service After you enable the feature, all audit records are saved within the aforementioned file in real-time and in a user configurable format. As an administrator, you can view the audit logs using any text editor software application. Additionally these logs can be captured with any log forwarding tool for further processing. Disable file based logging --------------------------- To disable the file based logging of audit records, perform the following steps: 1. Run the following command to indicate the feature should be disabled: .. code-block:: text # robin config update user_audit log_enable False 2. Run the following command to restart the ``robin-server`` service and thus allow the above changes to take effect: .. code-block:: text # service robin-server restart **Example** .. code-block:: text # robin config update user_audit log_enable False The 'user_audit' attribute 'log_enable' has been updated # service robin-server restart Redirecting to /bin/systemctl restart robin-server.service Disabling this feature will result ``robin-user-audit.log`` file not being updated with any new audit records. Configure file based logging attributes ---------------------------------------- Listed below are all the attributes a user can configure with regards to the file based logging feature. .. list-table:: :widths: 15 15 80 :header-rows: 1 * - Attribute - Default value - Valid value * - enabled - True - ``True`` - to enable the user audit feature ``False`` - to disable the user audit feature * - log_enable - False - ``True`` – to enable the audit log feature ``False`` – to disable the audit log feature * - log_file_size - 10 - The maximum size in megabytes of the audit log file * - log_format - JSON - The output format of each audit record. The following are valid values: ``JSON`` – display records in ``JSON`` format ``TEXT`` – display records in ``TEXT`` format * - log_level - INFO - Indicates the level of audit records to be captured. The following are valid values: ``INFO`` – for informational messages ``DEBUG`` – for debug-level messages that contain information for debugging a program ``WARNING`` – for warning messages ``ERROR`` – for error messages ``CRITICAL`` - for critical messages * - log_retention - 4 - The maximum number of audit log files to retain. Any additional log files are rolled over. In order to update any of the aforementioned configurable attributes, run the following command: .. code-block:: text # robin config update user_audit **Example** .. code-block:: text # robin config update user_audit log_format TEXT The 'user_audit' attribute 'log_format' has been updated View records captured in audit file ------------------------------------- To view all audit records captured in the aforementioned file, run the following command: .. code-block:: text # cat /var/log/robin/robin-user-audit.log **Example 1 (Viewing TEXT based audit records):** .. code-block:: text # cat /var/log/robin/robin-user-audit.log 1623 | 2021-08-12T15:26:06.581513+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1624 | 2021-08-12T15:26:12.655515+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1625 | 2021-08-12T15:26:12.783629+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1626 | 2021-08-12T15:26:13.118734+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1627 | 2021-08-12T15:26:18.584252+7:00 | 192.0.2.2 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1628 | 2021-08-12T15:26:21.752403+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1629 | 2021-08-12T15:26:28.934639+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1630 | 2021-08-12T15:26:36.089382+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1631 | 2021-08-12T15:26:43.233911+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1632 | 2021-08-12T15:26:50.370029+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1633 | 2021-08-12T15:26:57.528168+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1634 | 2021-08-12T15:27:04.749161+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1635 | 2021-08-12T15:27:11.934771+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1636 | 2021-08-12T15:27:19.127729+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1637 | 2021-08-12T15:27:26.291575+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1638 | 2021-08-12T15:27:33.702357+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- 1639 | 2021-08-12T15:27:41.017244+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- **Example 2 (Viewing JSON based audit records):** .. code-block:: text # cat /var/log/robin/robin-user-audit.log { "id": 197, "timestamp": "2021-08-12T13:56:17.230515+7:00", "ip_addr": "192.0.2.2", "exec_user_id": 3, "exec_username": "robin", "exec_tenant_id": 1, "exec_tenant": "Administrators", "owner_user_id": null, "owner_username": null, "owner_tenant_id": null, "owner_tenant": null, "object_type": "NAMESPACE", "operation": "create", "result": "success", "object_attributes": { "object_name": "oc8687pk4i", "username": "robin", "tenant": "Administrators", "import_namespace": false }, "details": {} } { "id": 198, "timestamp": "2021-08-12T13:56:17.748933+7:00", "ip_addr": "192.0.2.1", "exec_user_id": 3, "exec_username": "robin", "exec_tenant_id": 1, "exec_tenant": "Administrators", "owner_user_id": null, "owner_username": null, "owner_tenant_id": null, "owner_tenant": null, "object_type": "USER", "operation": "login", "result": "success", "object_attributes": {}, "details": {} } { "id": 199, "timestamp": "2021-08-12T13:56:33.766674+7:00", "ip_addr": "192.0.2.2", "exec_user_id": 3, "exec_username": "robin", "exec_tenant_id": 1, "exec_tenant": "Administrators", "owner_user_id": null, "owner_username": null, "owner_tenant_id": null, "owner_tenant": null, "object_type": "NAMESPACE", "operation": "delete", "result": "success", "object_attributes": {}, "details": {} } { "id": 200, "timestamp": "2021-08-12T13:56:34.290960+7:00", "ip_addr": "192.0.2.1", "exec_user_id": 3, "exec_username": "robin", "exec_tenant_id": 1, "exec_tenant": "Administrators", "owner_user_id": null, "owner_username": null, "owner_tenant_id": null, "owner_tenant": null, "object_type": "USER", "operation": "login", "result": "success", "object_attributes": {}, "details": {} } To view the last audit record that was captured, run the following command: .. code-block:: text # tail -n 1 /var/log/robin/robin-user-audit.log **Example 1 (Viewing last TEXT based audit record):** .. code-block:: text # tail -n 1 /var/log/robin/robin-user-audit.log 1645 | 2021-08-12T15:28:19.298469+7:00 | 192.0.2.1 | robin | Administrators | -- | -- | USER | login | success | -- | -- **Example 2 (Viewing last JSON based audit record):** .. code-block:: text # tail -n 1 /var/log/robin/robin-user-audit.log { "id": 1646, "timestamp": "2021-08-12T15:31:44.069446+7:00", "ip_addr": "192.0.2.2", "exec_user_id": 3, "exec_username": "robin", "exec_tenant_id": 1, "exec_tenant": "Administrators", "owner_user_id": null, "owner_username": null, "owner_tenant_id": null, "owner_tenant": null, "object_type": "CONFIG", "operation": "update", "result": "success", "object_attributes": { "section": "user_audit", "attribute": "log_format" }, "details": { "msg": "The 'user_audit' attribute 'log_format' has been updated" } } Kubernetes audit logs ^^^^^^^^^^^^^^^^^^^^^ Kubernetes audit logs are a set of records that contain a chronological list of all requests made to the Kubernetes API server. It records all API calls made to the API server. For more information about Kubernetes audit logs, see `Kubernetes auditing `_. .. Note:: By default, the Kubernetes audit logs feature is enabled for Robin CNP clusters. Points to consider for Kubernetes audit logs -------------------------------------------- * The maximum size for storing Kubernetes audit logs in a cluster is 1 GB, which is non-configurable. * A log file can have a maximum size of 100 MB and a maximum of 10 log files can be stored. View Kubernetes audit logs -------------------------- Kubernetes audit logs help you to troubleshoot the issues in your cluster. You can find the Kubernetes audit logs at ``/var/log/Kubernetes/audit/audit.log`` on any master node of your cluster. Robin CNP logs the following operations at the metadata level audit policy: * Create request * Patch request * Update request * Delete request Sherlock ========= Sherlock is a troubleshooting and self-diagnostic command-line tool (CLI) in Robin. It is designed to assist Robin administrators to identify and analyze any problems with Robin clusters. Using Sherlock, an administrator can diagnose cluster-wide problems, view a general cluster health report, or gather information regarding specific applications, nodes, containers, volumes, devices, and so on. It provides an in-depth view of these problems and the objects affected by querying a range of Robin APIs and making direct database calls. Moreover the information gathered is mapped in both a top-down and bottom-up manner through the resource hierarchy in order to showcase important information on wide range of objects in a consumible manner. Some examples of the highlighted resource connections are described below: - Application are linked to the Pods that they are compromised of. Thus details on the health of the node providing the compute resources for the Pod(s) and the status of the attached Volumes are also presented. - Volumes are implicitly linked to the Node they are created on and as a result the status of the Node, the status and source of any replicas present (including the resync progress), and number of snapshots are also displayed. - Similarly Disks are explicitly attached to hosts and so details of the Node are displayed with relevant information such as the overall disk capacity, current utilization of the disk etc. - The status of critical Robin services are displayed in addition to the impacted objects including Applications, Volumes, and Disks. .. Note:: Given the breadth of information displayed and gathered by Sherlock, the tool is only accessible on the active master node and should only be utilized by adminstrators. In addition it needs the RCM and Storage Manager services to be running. Use Cases ^^^^^^^^^ Given the wealth of information Sherlock that displays, it can be used practically in any given scenario. Whether it is used as the primary debugging tool for cluster wide issues or simply to gain insight into the usage statistics, relevant information can always be attained with the tool. Highlighted below are two example use cases where Sherlock could be particularly useful. **Diagnosing application health issues** Given that Sherlock primarily aims to trace problems throughout the resource hierarchy, it allows for the detection of the level from which a problem stems from. For example, an application that cannot write data anymore due to a disk failure within a cluster, might report itself as unhealthy. Using Sherlock, the primary issue of the disk being in a bad state can be deduced because whilst investigating an application the volumes attached to its respective Pods are also displayed. As a result, the unhealthy volumes will be reported alongside the device from which they are allocated and so the common point of failure, the failed disk, can be indentified. The above example highlights the usefulness of the explicit mappings showcased by the tool and how they can be used to efficiently detect objects which are malfunctioning. **Planning maintenance activities** Since Sherlock highlights the links between several abstract objects within a cluster, it can be used to determine the impact of an object being offline for a period of time. For example, this is particularly useful if a node needs to be cordoned off for maintenance, as Sherlock can show the impacted objects (Pods, volumes, applications, and users). As a result any of the parties that are affected by the maintenance activity can be informed ahead of time with little to no guesswork involved. Sherlock Report ^^^^^^^^^^^^^^^ The report generated by Sherlock, shown in the examples below, is meant provide a quick overview of the state of the Robin cluster and by default only highlights unhealthy objects as they would be the most cause for concern. It is split into the following key sections: * Applications - This section of the report displays unhealthy applications alongside linked resources such as the affected Pods, volumes and devices on which the application data is saved. * Pods - This section of the report displays unhealthy pods alongside details of the attached volume(s) and any Kubernetes errors associated with it. This section is highlighted as it includes general Kubernetes pods and Helm based applications. * Volumes - This section of the report displays unhealthy volumes alongside details of the device it is hosted on, logical mounts with potential IO stalls, NFS Exports, NFS Server pods, the respective node from which it is allocated, and statistics about any snapshots it may have. * Nodes - This section of the report displays unhealthy nodes alongside the status of the Robin and Kubernetes services running on each node, warnings for high resource usage and indicates a lack of space availability if appropriate. * Devices - This section of thhe report displays unhealthy devices alongside details of the node each is mounted on, the utilization of each disk and the volumes affected. * File Collections - This section of thhe report displays unhealthy file collections and highlights any errors that may have caused it to be in an unhealthy state. * Bundles - This section of thhe report displays unhealthy bundles and highlights any inherited errors from other objects in the heirarchy that may cause it to be unavailable. **Example 1 (Healthy cluster Report)** .. code-block:: text # sherlock SHOWING APPLICATIONS THAT NEED ATTENTION: All apps are healthy SHOWING PODS THAT NEED ATTENTION: All pods are healthy SHOWING UNHEALTHY VOLUMES THAT NEED ATTENTION All volumes are healthy SHOWING UNHEALTHY NODES THAT NEED ATTENTION: All nodes are healthy SHOWING UNHEALTHY DEVICES THAT NEED ATTENTION: All devices are healthy SHOWING UNAVAILABLE FILE COLLECTIONS THAT NEED ATTENTION: All file collection are available SHOWING UNAVAILABLE BUNDLES THAT NEED ATTENTION: All bundles are available Only unhealthy objects are shown. To see everything re-run with -H|--healthy option To see more details rerun with -V|--verbose option sherlock produced results in 155 milliseconds (Sat Sep 18 06:14:59 PM 2021). |-- 3 nodes, 26 disks, 3 vols, 3 snapshots, 0 apps, 0 pods, 1 file-collections, 2 bundles, 1 users and 1 tenants were analyzed **Example 2 (Report with unhealthy applications)** .. code-block:: text # sherlock SHOWING APPLICATIONS THAT NEED ATTENTION: |-- robinte STATE: PLANNED Robin Systems 2/2 pods unhealthy KIND: ROBIN SHOWING USERS WHO ARE AFFECTED: |-- Robin Systems (Firstname: Robin LastName: Systems Email: None) | |-- APPS 1: robinte SHOWING PODS THAT NEED ATTENTION: o-- POD/VNODE ID 121: robinte.R1.01 INSTALLING/ONLINE 1 CPU, 50 MB MEM NODE: UP, RIO: UP |-- POD/VNODE ID 122: robinte.R2.01 INSTALLING/ONLINE 1 CPU, 50 MB MEM NODE: UP, RIO: UP SHOWING UNHEALTHY VOLUMES THAT NEED ATTENTION All volumes are healthy SHOWING UNHEALTHY NODES THAT NEED ATTENTION: All nodes are healthy SHOWING UNHEALTHY DEVICES THAT NEED ATTENTION: All devices are healthy SHOWING UNAVAILABLE FILE COLLECTIONS THAT NEED ATTENTION: All file collection are available SHOWING UNAVAILABLE BUNDLES THAT NEED ATTENTION: All bundles are available Command Line Options ^^^^^^^^^^^^^^^^^^^^ Detailed below are the different options that can be utilized whilst using the Sherlock tool to attain the desired information. **Resource Inspection Options** These options require the names of resources (objects) to be specified in order to show detailed information about the aforementioned objects. If multiple objects need to be viewed at the same time a comma seperated list of names can be specified. ==================== ===================================================== ``--app `` Displays information about the given application(s) ``--node `` Displays information about the given node(s) ``--pod `` Displays information about the given Pod(s) ``--vol `` Displays information about the given volume(s) ``--dev `` Displays information about the given device(s) ==================== ===================================================== .. Note:: For the ``--dev`` option, alongside specifying a list of WWNs to match several other values are supported: 'all' can be given to display information on all devices, 'full' can be given to display details on devices that are nearly full, a list of nodenames can be specified to show devices on the given nodes, and lastly to uniquely identify a device a combination of the nodename and devpath can be given in the format :`` Provides advice on device rebalancing ``--vol-rebalance-advice `` Provides advice on volume rebalancing ``--devs-needing-rebalance`` Displays information about devices that need rebalancing ``--vols-needing-rebalance`` Displays information about volumes that need rebalancing ======================================= =============================================================== **Behavior Controlling Options** The following options allows for the manipulation of the generated report to include details that might not be present by default. ======================= ====================================================================================================================================================== ``--mon `` Monitor the resource metrics for the given interval. Use this option alongside options such as ``--app``, ``--pod`` or ``--vol`` ``--start