Help Reports AI powered Anomaly Reporting

Anomaly Reporting powered by Artificial Intelligence

Site24x7's AI-powered Zia framework for anomaly uses the Robust Principal Component Analysis (RPCA) and Matrix sketching algorithms to detect any unusual spikes or aberrations in your monitor's critical performance attributes, viz., Response time, CPU used percent, Memory utilization etc.; further, notify you about such spikes in a detailed tabular or graphical dashboard inside the webclient and via alert emails. All your KPIs are compared against seasonal benchmarked values. The Anomaly Report helps you fine-tune your resource performance and safeguard your infrastructure from any unforeseen issues. You can share anomalies with your team either by generating a CSV, PDF or via email. 

Contents

Overview

The idea behind anomaly detection over a metric being monitored, is to identify any unusual spikes or aberrations in a given series. Any monitoring measure for which anomaly detection is to be enabled is treated as a time series wherein, it is polled with respect to time across uniform intervals. Depending on certain mathematical inequalities which are static, would not give contextually consistent results in the long run. Artificial Intelligence (AI) can address this with an approach that aims to detect an anomaly, upon its occurrence immediately.

An AI-based approach has distinct flavors like:

  • Smoothening of Trends: Trend handling captures the overall pattern direction (rise or fall)
  • Handling Seasonality: It's the pattern structure that keeps reccuring more or less in each time frame
  • Robustness: Makes it immune to insignificant performance spikes.

Predict Trends with Anomaly Engine

The anomaly engine's cycle consists of various stages that include processing incoming data from data collectors against the AI-training data to generating a confirmed anomaly to notifying the anomaly itself. The Anomaly Engine has a quantitative and qualitative comparison model for anomaly detection. Prediction with Anomaly Engine involves two stages:

  1. Anomaly Event Generation
  2. Domain Scoring to Determine Anomaly Severity

Anomaly Event Generation

The main purpose of this stage is to perform the heavy weight processing and generate “Events”. Anomaly detection engine collects the metrics every 15 minutes from Site24x7 data collector agents. For Univariate Anomaly Detection, this data is compared against the training data for the machine learning model, which is the last two weeks' respective day's hourly 95th percentile values. For e.g., if friday’s data is sent for anomaly detection, then the last two weeks’ friday’s values will be considered as the training data for the machine learning model. This helps achieve seasonality in the data. The 95th percentiles of the data are considered for training in order to remove extreme values present (In 95th percentile, top 5% highest values are removed, which will also remove any unusual spikes in the training data).

For Multivariate Anomaly Detection, Site24x7's data collection agents again push data to the Anomaly detection platform every 15 minutes. The last two weeks' hourly 95th Percentile values for corelated attributes are used for training the algorithm. If the combination is detected as an anomaly, attributes which contribute to the combination to be an anomaly will be determined.

Based on the comparison against the training data, the events are then generated and defined as L1, L2, and L3 values, with the L3 values having the highest chance of being an anomaly.

Domain Scoring to Determine Anomaly Severity

This stage adds a qualitative model to anomaly generation by also considering anomalies seen in dependent monitors. Events sum up and give a score based on which the “Severity of Anomaly” is decided. When an anomaly scoring task is scheduled upon occurrence of anomaly, the anomaly engine checks if there was any anomaly for any dependent monitors during the last 30 minutes. Scores are given to individual monitors based on the attributes that cause the monitor anomalous and the percentage of deviation of these attributes from the expected values.

The following methodologies (in the same order specified below) are usually considered for the final score determination:

  • Another attribute of the same monitor detected as anomalous
  • Dependent monitors detected with anomaly
  • Parent/child monitors are anomalous
  • Monitors, grouped under the same Monitor Group detected as anomalous
  • Other monitor with the same Tags (user defined tags) has anomaly
  • Monitors with the same Server name/Same Fully Qualified Domain Name (FQDN) has anomaly

You can read our Kbase article to know about the various cases used for domain scoring and severity benchmarking.  

Finally, based on the factors like domain scores, dependencies and increasing gravity of the detected anomaly, the severity of an anomaly is segmented into three:

  • Confirmed Anomaly : It highlights a negative trend which occurs on a persistent manner. When a confirmed anomaly repeats for a longer period, it clearly guides you to an immediate and inevitable outage situation. Hence, the repetitive confirmed anomaly needs your utmost attention.
  • Likely Anomaly : You must keep a close tab on such a trend, as it might lead to an outage situation in the long term.
  • Info : This is just a notification for the user and must be monitored very closely, inorder to alleviate any future issues.

AI-Based Threshold Profiles

AI-based threshold profile uses anomaly detection to determine the status of a monitor. It is a dynamic threshold approach unlike the currently used static thresholds. In the current static threshold profiles, you'll have to set hard coded thresholds to determine the status of a monitor. In case of any issue, you'll be notified only when the set thresholds are breached.

For AI-based thresholds, you will not be able to set any hard coded threshold. Instead we will have dynamic thresholds that will be updated according to the monitor's behaviour. So, in case of any issue, customer will be notified immediately once the issue starts, rather than waiting for the static threshold to be breached. Along with being dynamic, this eliminates the need of setting poll strategies. Poll strategies are essential to avoid intermittent spikes. In case of anomaly detection, spike bursting is done to avoid reporting intermittent spikes as anomalies. Hence, hard coded poll strategies can be avoided.

How it works?

You can select a Static Profile or AI-Based Profile from the existing Threshold profile form. If you choose "AI-Based profile", the severity selection option will be shown for the attributes for which anomaly has been enabled. For attributes which don't have anomaly enabled, static threshold settings will be shown irrespective of profile type selection. A combination of both static and AI-based profiles cannot be selected.You'll be getting AI-based settings only for those attributes for which anomaly option has been enabled.

In AI threshold profile :

  • Each attribute will have two severity options i.e Likely and Confirmed. They represent anomaly severities. If Likely severity is set to Trouble, it means that, "if there is a Likely anomaly in the attribute, then make the monitor status Trouble". Same is the case with Confirmed. But both severities cannot have the same status change.
  • Each attribute also has an Automation option which can be mapped to the required action if there is a likely or confirmed anomaly.

Interpret Anomaly Dashboard

The Anomaly Dashboard lets you easily decode any negative trends in your IT infrastructure beforehand. You can find and filter the Anomalies based on monitor or monitor group selection.

Follow the steps below to view and interpret Anomaly Dashboard:

  1. Log in to Site24x7 account.
  2. Navigate to Home > Anomaly Dashboard.

  3. Use the Time Period picker to select a time span ranging from "Last 1, 6, 12, 24 hours upto a year back". You can sort Anomalies by looking up the Monitor/Group Names in the Search Bar.                            
  4. Additionally, you can categorize anomalies based on the various Severity Levels like "Confirmed, Likely, and Info". 

     

    Once the dashboard is generated, you can click the Share This button shown on the top right corner to share the report via email, generate a CSV or PDF to share it with your teammates. Email can be sent to only those verified users who have agreed to receive emails from Site24x7.

The dashboard offers a Split View where all your monitors and monitor groups can be seen on the left side of the dashboard. On the right end of the dashboard screen, you can view the Anomaly Summary graph for the requested time period and the specific reason for each detected Anomaly (listed under the Anomaly History). You can sort Anomalies by looking up the Monitor/Group Name in the search field or filtering based on Severity levels. The Anomaly Summary graph displays the anomaly count of monitors/monitor groups for each day during the selected time period. The anomaly count of monitors is displayed using a stacked bar graph. Individual Anomalies will be listed under the Anomaly history section, with a detailed message regarding each listed Anomaly. All listed Anomalies will have their relevant severity flag against the Anomaly message. This Anomaly Description lets you gather indepth details about the anomaly trend. To gather further insights on the root cause of performance issues, click the hyperlink provided along with the Anomaly description.

The Anomaly Summary Graph displays the data for the selected time period. However, if the anomaly count crosses 100, then the data for only those days are shown in the graph, and the rest of the data for the remaining days is ignored.

The legends shown in the graph for each individual monitor, also acts as unique filters. You can use it to remove or re-insert specific monitors in the bar chart.
Root Cause Report

Once you click the Know the root cause link for a specific Anomaly message, you're prompted with a modal pop-up window where you can spot a line graph with the metric values. Just hover over the line graph to view the actual metric value for the specific date and time. The default metric value may vary from monitor to monitor. Every monitor will have one or more default attributes for which the anomaly detection will be enabled. However, on top of this, you can also use the drop down above the line graph, to view other performance attributes of the selected monitor, during the same time range.

Enabling/Disabling Anomaly Alerts

By default anomaly alerts willl be disabled. You can enable anomaly alerts by navigating to  Admin > User & Alert Management > User Alerts > Edit/Add User > Alert Settings > Enable email for anomalyGet to know more about users and alerts.

Anomaly Detection: List of Enabled Monitors and Corresponding Performance Attributes 

For most monitors, anomaly detection is enabled for certain metrics, by default. Here is a list of all such monitors and their respective performance attributes for which the anomaly detection is enabled.

 

Monitor Type Performance Attribute
Website  Response Time 
DNS Server  Response Time 
FTP Transfer  Response Time 
Web Page Speed (Browser) Response Time 
Ping  Response Time 
FTP Server  Response Time 
Port (Custom Protocol)  Response Time 
POP Server  Response Time 
SMTP Server  Response Time 
Web Transaction (Browser)  Response Time 
Web Transaction  Response Time
Mail Delivery Monitor Response Time
REST API Monitor Response Time
SOAP Web Service Monitor Response Time
Microsoft Hyper-V Server

Health Critical VMs,
Logical Processors,
Virtual Processors,
VM Bus Interrupts Received,
VM Bus Throttle Events,
VM Bus Interrupts Received per sec,
VM Bus Interrupts Sent per sec,
Logical Processor Guest Runtime,
Logical Processor Hypervisor Runtime,
Logical Processor Total Runtime,
Root Virtual Processor, Pagefault Intercepts,
Virtual Processor Emulated Instructions,
Virtual Processor MSR Accesses,
Virtual Switch Bytes Per Second,
Virtual Switch Packets Per Second,
Virtual Switch Bytes Sent Per Second,
Memory demand 

Microsoft Failover Cluster 

Outstanding Messages,
Resource Host Subsystem Process Restarts,
Resource Host Subsystem Process,
Used Space,
Bytes received,
Bytes Sent,
Messages received,
Messages sent,
Resources Offline,
Normal Messages Queue Length,
Urgent Message Queue Length,
Reconnect Count,
Used (MB),
Used Space,
Resource Failures,
Resource Failures - Access Violation,
Resource Failures - DeadLock

Microsoft Office 365 

Group Created,
Group Deleted,
Inactive Mailbox,
Over Warning Size,
Used Less than 25%,
Inbound,
Outbound,
Active Lync Users,
Web Conferences,
Telephony Conferences,
IM Conferences,
AV Conferences,
Application Sharing Conferences,
Audio Sessions,
File Transfer Sessions,
IM Sessions,
Video Sessions,
Application Sharing Sessions,
Unique Share Point Users,
Licenses Assigned,
Licenses Acquired,
Active Deployment,
Inactive Deployment,
Used Size 

Plugins

All Attributes

APM Insight - Application

Response Time,
Error Count,
Fatal Count,

Response time, request count and failed count for individual components

Exception count of individual exceptions

APM Insight Instance

Response Time,
Error Count,
Fatal Count,

Response time, request count and failed count for individual components

Exception count of individual exceptions

RUM Monitor

Application Throughput,
Location Throughput,
Browser Throughput,
Browser Front-end time,
Browser Error Percentage,
Location Network time,
Application Backend time

Classic Load Balancer

Latency,
Request Count

Application Load Balancer

Latency,
Request Count

Network Load Balancer

Processed Bytes,
Consumed LCU Sum

Simple Notification Service

Number of messages published,
Publish Size,
SMS Success Rate

Simple Storage Service (S3)

Bucket Size,
Number of objects,
All Requests

AWS Lambda

Invocations (Sum),
Errors (Sum),
Duration (Sum),
Throttles (Sum)

Elastic MapReduce

Jobs Failed,
Apps Failed,
S3 Bytes Read,
S3 Bytes Written,
HDFS Bytes Read,
HDFS Bytes Written,
Steps Failed

Web Application Firewall (WAF)

Allowed requests,
Blocked requests,
Counter requests,
Passed requests

Neptune Instance

CPU Utilization,
Volume Bytes Used,
Freeable Memory

Neptune Cluster

CPU Utilization,
Volume Bytes Used,
Freeable Memory

Lightsail Instance

CPU Utilization,
Network In,
Network Out

Amazon GuardDuty

Finding per day,
High

Monitor Type Performance Attribute
EC2 Server Instance Monitor

CPU Usage,
Network in (Number of Bytes received),
Network out (Number of Bytes sent)

RDS Instance Monitor

CPU Usage,
Free Storage,
Database connection(s)

Microsoft IIS Server 

Queued Request,
Application Restarts,
Bytes received,
Bytes Sent,
Network Statistics,
Cache Total Entries,
Cache API Hit Ratio,
Cache API Turnover Rate,
Cache % Machine Memory Limit Used,
Session SQL Server connections total,
Processor Time(%),
IO Data Operations/sec,
Thread count,
Private/virtual memory (MB),
No of Physical Threads 

Microsoft Exchange Server

DB Cache Size,
Page request fullfilled by Cache,
DB Cache Page Fault stall per sec,
I/O DB reads avg latency,
I/O DB writes avg latency,
IO Log Write per sec,
RPC Response time,
RPC operations,
RPC no. of slow packets,
Slow QP Threads,
Active User Count,
RPC Requests,
Active Client Count,
Hub RPC request sent per sec

Microsoft SQL Server 

Connection,
Logins,
Batch Requests,
SQL Compilations,
Replication Merge Conflicts,
Target Server Memory,
Total Server Memory,
SQL Cache Memory,
Optimizer Memory,
Granted Workspace Memory,
Memory Grants Pending,
Checkpoint Pages,
Lazy Writes,
Page Reads,
Page Writes,
Page Splits,
Full Scans,
Probe Scans,
Range Scans,
Number of Errors,
Plan Cache Hit Ratio,
Cache Pages,
Cache Objects,
Queued Jobs,
Failed Jobs,
Lock Requests,
Lock Timeouts,
Deadlocks 

Server Monitor 

CPU Usage,
Memory Usage,
Used Physical Memory,
Used Swap Memory,
Memory Pages In,
Memory Page Faults,
15 min Avg,
Disk Reads,
Disk Writes,
Context Switches,
Processor Interrupts

Microsoft Sharepoint Server

Active requests,
Active Sessions,
Average time to process request,
Get Data Requests Per Sec,
Insert Requests Per Sec,
Update Requests Per Sec,
Delete Requests Per Sec,
Failed Insert Requests Per Sec,
Average Data Fetch Duration,
Average Insert Duration,
Average Update Duration,
Average time of form session,
Transaction Completed Rate,
Request Processing Time,
VISIO Requests Per Sec,
Errors Per Sec,
Requests in Queue,
Rejected Requests,
ASP Requests Per Sec,
Current Sessions,
Transactions Started Per Sec,
Pending Transactions,
site collection warning count in contentdb,
Queries Failed,
Queries Succeeded

Was this document helpful?
Thanks for taking the time to share your feedback. We’ll use your feedback to improve our online help resources.

Help Reports AI powered Anomaly Reporting

O