Monitoring system health
To monitor our system’s health, Arm uses Prometheus, an open-source toolkit for system monitoring and alerts.
Prometheus can collect metrics from any service that exposes its metrics in OpenMetrics, a text-based standard format, which Secure Factory services use.
With the Secure Factory service metrics detailed in this document, you can set up alerts and monitor system behavior using:
- Prometheus Alertmanager, a rule-based engine that can trigger notifications about system behaviors.
- Grafana to visualize the metrics.
Using Prometheus in the factory
Arm recommends monitoring the system at various levels:
-
Node monitoring using Prometheus node exporter to validate:
- Node availability, up time, number of available nodes.
- Node health, including memory and CPU consumption, disk space.
- Network traffic at node level.
-
Docker container monitoring using Google cAdvisor to validate:
- Docker container health in each of the nodes.
- Network traffic.
-
Secure Factory Service with Arm-provided Docker images.
For each of the running containers, measure specific application metrics, such as:
- Factory manufacturing information.
- Batch key retrieval events and batch key expiration information.
- Internal communication errors between servers.
-
Database health using MongoDB exporter, which Arm provides with the Secure Factory deployment.
-
HSM (Hardware Security Module) health with Arm-provided Docker images.
Using the metrics
Secure Factory Service metrics
Secure Factory Service metrics URL: https://
Secure Factory Admin service metrics URL: https://
Available metrics:
-
http_server_requests_seconds_countThe count of server HTTP input requests.
exception,method,outcome,status, anduriare useful labels for monitoring.For example, use these labels to retrieve:
GETrequests to/v1/device_responsethat return a5xxor4xxresponse to monitor the total number of failed provisioning requests.GETrequests that return a5xxor4xxresponse to monitor the total number of failed workstation registration requests.
-
http_client_requests_seconds_countThe count of client HTTP output requests.
clientName,instance,method,statusanduriare useful labels for grouping similar metrics.For example, use these queries to monitor requests to Device Management or the HSM service that result in an error:
http_client_requests_seconds_count{status="5.."}or
http_client_requests_seconds_count{status="4.."} -
batch_key_minutes_to_expirationTime, in minutes, left before the batch key expires. A negative value indicates an expired batch.
For example, a
10079value indicates the batch key expires in seven days. -
batch_key_update_count_totalThe number of batch key updates after the initial setup.
-
uploaded_reports_count_totalTotal number of manufacturing reports Secure Factory Service uploads to Device Management.
-
prepared_reports_count_totalTotal number of manufacturing reports Secure Factory Service generates. These are the reports the Manufacturing Statistics API returns.
-
process_uptime_secondsThe uptime of the Java virtual machine.
You can use this metric to monitor how long the Secure Factory Service Docker has been up.
-
system_cpu_usageThe recent CPU usage for the entire system.
-
jvm_memory_used_bytesThe amount of used memory.
Use the
arealabel to monitor heap or non-heap memory.
MongoDB metrics
MongoDB service metrics URL: https://
Available metrics:
-
mongodb_instance_uptime_secondsThe number of seconds the
mongosormongodprocess has been active.For example, the value
3.275797e+06indicates that MongoDB started 37.9 days ago. -
mongodb_mongod_replset_number_of_membersThe number of replica set members. This metric can indicate whether one or more MongoDB members disconnected from the replica set.
For example, because the number of connected members is vital, you can set an alert for when the number of available replica set members is less than two.
-
mongodb_mongod_replset_member_healthIndicates whether the member is up (1) or down (0).
You can set an alert for when a specific replica set member is no longer available.
-
mongodb_memoryThe memory data structure holds information regarding the target system architecture of
mongodand current memory usage in megabytes.Use the
typelabel to create alerts when the virtual or resident memory exceed their limits. -
mongodb_extra_info_heap_usage_bytesThe total size of heap space the database process uses.
-
mongodb_network_bytes_totalThe amount of data MongoDB’s network uses.
You can use the
statelabel to monitorin_bytesorout_bytesdata. -
mongodb_network_metrics_num_requests_totalThe total number of distinct requests that the server received.
Use this value to provide context for the
in_bytesorout_bytesvalues and ensure that MongoDB’s network utilization is consistent with expectations and application use. -
mongodb_mongod_metrics_document_totalReflects document access and modification patterns and data use.
Use the
statelabel to monitor updates, inserts, and deleted documents. -
mongodb_op_counters_totalProvides an overview of database operations by type and makes it possible to analyze the load on the database in a more granular manner.
These numbers grow over time and in response to database use. Analyze these values over time to track database usage.
Use the
typelabel to monitor delete, insert, query or update operations. -
mongodb_connections_metrics_created_totalA count of all incoming connections to the server, including closed connections.
HSM service metrics
HSM service metrics URL: https://
Available metrics:
-
hsm_machinesThe number of HSM machines connected to the HSM service.
When you work with two HSM machines, it is vital to know whether one of the HSM machines is down and therefore no longer available, in which case, the HSM service no longer replicates keys and certificates.
-
http_server_requests_seconds_countThe count of server HTTP input requests.
exception,method,handlerandstatusare useful labels for monitoring.For example, use these labels to retrieve
GETrequests that return4xxand5xxfailure responses to monitor the HSM's total number of failed Diffie Hellman key derivations. -
performance_monitor_response_time_secondsA histogram of method completed time.
Use this metric to monitor the number and latency of HSM service operations.
For example:
-
Use this query to monitor the total number of HSM service requests to the HSM to get a stored key:
performance_monitor_response_time_seconds_count{method="getKey",resource="hsm",status="success"} -
Use this query to monitor the total number such requests (HSM service requests to the HSM to get a stored key) that occur within less than one second:
performance_monitor_response_time_seconds_bucket{class="HsmServiceService",method="getKey",resource="hsm",status="success",le="1.0"}
-
-
http_requests_error_500_totalThe total number of HTTP requests with
5xxerror statuses.Use this metric to trigger an alert when an HSM service request to access the physical HSM or get input from Secure Factory Service results in an error.
-
process_start_time_secondsThe uptime of the Java virtual machine. Can indicate when the HSM service started.
-
jvm_memory_bytes_usedThe used bytes of a given JVM memory area.
Use the
arealabel to monitor heap or non-heap memory.