Solr Service Alerting - SearchStax
SearchStax® from Measured Search® provides two kinds of real-time email alerts:
- Heartbeat alerts: Notify a list of email recipients when a server starts or stops operating.
- Threshold alerts: Notify a list of email recipients when a server exceeds a performance threshold.
Both types of alerts create an "incident" report that you can inspect in the SearchStax dashboard.
Both Zookeeper and Solr send reports of system metrics to SearchStax at least once per minute. You can set up a "heartbeat" alarm to notify you if these reports should be delayed or interrupted. The system also notifies you when the updates resume.
You can configure the heatbeat alert to trigger when a configurable number of reports have been missed within a configurable amount of time. These settings let you guard against false alarms caused by transient network delays.
Set up a Heartbeat Alert
To set up a heartbeat alert, open the SearchStax dashboard and navigate to a specific deployment.
Scroll the left-side menu down until you see the Alerting node. Expand this node and select Heartbeat. Click the New Heartbeat button.
- The host control offers a list of the servers in this deployment. Select one of them to monitor.
- Give the alert a name that you will recognize when you see it in email.
- The email will be triggered when SearchStax sees X Failures within an Interval of Y minutes. For instance, two failures within two minutes.
- Set Max notifications to the maximum number of emails you wish to receive about this alert. They are typically issued every two minutes.
- Send alerts to the email recipients in this list. Use the Add link to extend the list.
Receive a Heartbeat Alert
A heartbeat email notification resembles this one:
Subject: Host 220.127.116.11 DOWN SearchStax notification From: email@example.com Date: 11/1/2016 4:46 PM To: firstname.lastname@example.org Hi there, This is a notification sent by SearchStax. Host 18.104.22.168 is DOWN Log in to your account at https://searchstax.measuredsearch.com/admin/deployment/xxx/threshold/incident/update/528 to see further details and take the necessary actions. Best regards, Measured Search Team
You will receive a similar "UP" notification when the heartbeat is again detected.
View the Heartbeat Incident Report
Click the URL in the email to view the incident report. (Or use the SearchStax dashboard menu to visit Alerting > Incidents. Choose the current incident from the list.)
You'll see a brief description of the incident followed by a timeline of events. Read the timeline from the bottom up.
You may Close or Open each incident as many times as needed.
A "threshold" alert watches a specific system metric and sends you email when the metric meets or exceeds some specific value.
SearchStax allows you to monitor the following system metrics:
- System Load Average (os.SystemLoadAverage)
- Used Physical Memory (os.UsedPhysicalMemorySize)
- Used Swap Space (os.UsedSwapSpaceSize)
- JVM Heap Memory Usage (jvm.heapMemoryUsage.used)
- JVM Heap Memory Committed (jvm.heapMemoryUsage.committed)
- JVM Non-Heap Memory Used (jvm.nonHeapMemoryUsage.used)
- JVM Non-Heap Memory Committed (jvm.nonHeapMemoryUsage.committed)
- Number of JVM threads (jvm.ThreadCount)
SearchStax can also issue alerts on the following Solr metrics:
- Solr Indexing Errors (solr.index.errors)
- Solr Indexing Timeouts (solr.index.timeouts)
- Solr Indexing 5min rate (solr.index.5minRateReqsPerSecond)
- Search Average Requests/s (solr.search.avgRequestsPerSecond)
- Solr Search Errors (solr.search.errors)
- Solr Search Timeouts (solr.search.timeouts)
- Solr Search 5min rate (solr.search.5minRateReqsPerSecond)
- documentCache Evictions (solr.documentCache.evictions)
- fieldValueCache Evictions (solr.fieldValueCache.evictions)
- queryResultCache Evictions (solr.queryResultCache.evictions)
- filtercache Evictions (solr.filtercache.evictions)
- queryResultCache warmupTime (solr.queryResultCache.warmupTime)
- queryResultCache hitratio (solr.queryResultCache.hitratio)
- filtercache warmupTime (solr.filtercache.warumpTime)
- filtercache hitratio (solr.filtercache.hitratio)
- documentCache warmupTime (solr.documentCache.warmupTime)
- documentCache hitratio (solr.documentCache.hitratio)
- fieldValueCache warmupTime (solr.fieldValueCache.warmupTime)
- fieldValueCache hitratio (solr.fieldValueCache.hitratio)
Set up a Threshold Alert
To set up a threshold alert, open the SearchStax dashboard and navigate to a specific deployment.
Scroll the left-side menu down until you see the Alerting node. Expand this node and select Threshold. Click the Create New Check button.
- The Host Machine control offers a list of the servers in this deployment. Select one of them to monitor.
- Select a metric to monitor in the Metric Name list. Select an operator (=, >, <, >=, <=). Enter a numeric value. Indicate whether the value is simply an integer (3 SolrIndexingErrors) or an amount of memory (in kilobytes, megabytes, or gigabytes). The semantics will differ depending on the selected metric.
- If appropriate, indicate the Solr Collection you want to monitor on this server.
- Provide an Alert Name that you will recognize when you see it in email.
- The email will be triggered when the threshold is exceeded for more than Delay of at-least minutes.
- Set Max Alerts to the maximum number of emails you wish to receive about this alert. They Repeat Every N minutes.
- Send Alerts To the email recipients in this list. Use the Add link to extend the list.
Receive a Threshold Alert
A threshold email notification resembles this one:
Subject: OPENED SearchStax incident #523 for System Load Average From: email@example.com Date: 10/31/2016 2:58 PM To: firstname.lastname@example.org Hi there, This is a notification sent by SearchStax. Incident 523, System Load Average, has been opened. Log in to your account at https://searchstax.measuredsearch.com/admin/deployment/xxx/threshold/incident/update/523 to see further details and take the necessary actions. Best regards, Measured Search Team
View the Threshold Incident Report
Threshold incidents appear in the same incident list as the Heartbeat incidents.
Click the URL in the email to view the incident report.
Alerting Tips and Tricks
Here are a few notes about setting up specific types of alerts.
CPU Utilization / System Load
Set up a threshold alert monitoring the System Load Average. Here's an example:
This alert will trigger when the System Load Average metric is greater than 0.5 for more than one minute. It will send five emails at two-minute intervals.
There is no direct metric of free memory, but you can monitor Used Physical Memory plus a selection of more specific usage stats (JVM, etc.)
Average Search Latency
There is no direct metric for search latency. You can monitor Solr Search 5-minute Rate, setting alerts for both high and low rates, to alert you when search behavior become atypical.
Commits per Minute
There is no metric that reports commits per time unit. The information is present in your solr.log file.
$ grep "start commit" solr.log
A glance at the time stamps will answer your question.
Cache Warm Up Time
There are four cache warmup metrics.
- queryResultCache warmupTime
- filtercache warmupTime
- documentCache warmupTime
- fieldValueCache warmupTime
Five-Minute Requests per Second
Monitor the Solr Search 5-minute Rate metric. Excessively high (or low) rates may mean that you need to add (or remove) servers.
Search Errors per Minute
Monitor Solr Search Errors and set the threshold and delay values to create an appropriate rate-per-minute.
Indexing Errors per Minute
Monitor Solr Indexing Errors and set the threshold and delay values to create an appropriate rate-per-minute.