class: center, middle, inverse, title-slide #
oddstream
and
stray
:
Anomaly Detection in Streaming Temporal Data with R ##
Priyanga Dilini Talagala
with
Rob J. Hyndman
Kate Smith-Miles ### Monash University, Australia
11.07.2018 --- background-image: url(https://raw.githubusercontent.com/pridiltal/pritalks/master/USER2018/USER2018%20talk/fig/2_application.png?token=ATXvCrnMUdIGRrbTsO5fn2Y6zG4-0Uvtks5bTMsLwA%3D%3D) background-position: 50% 50% background-size: 100% class: right, top ## Motivation --- class: middle, center <img src="fig/13_logo2.png" width="100%" style="display: block; margin: auto;" /> --- ### Feature Based Representation of Time series .pull-left[ - Mean - Variance - Changing variance in remainder - Level shift using rolling window - Variance change - Strength of linearity - Strength of curvature ] .pull-right[ - Strength of spikiness - Burstiness of time series (Fano Factor) - Minimum - Maximum - The ratio between 50% trimmed mean and the arithmetic mean - Moment - Ratio of means of data that is below and above the global mean ] --- class: top ### Feature Based Representation of Time series `devtools::install_github("pridiltal/oddstream")` `tsfeatures <- oddstream::extract_tsfeatures(train_data)` .pull-left[ <img src="fig/3_batch.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/oddstream1.png" width="30%" style="display: block; margin: auto;" /> <span style="color:blue">O</span>utlier <span style="color:blue">D</span>etection in <span style="color:blue">D</span>ata <span style="color:blue">STREAM</span>s ] .pull-right[ <img src="fig/tsfeatures.png" width="100%" style="display: block; margin: auto;" /> ] --- class: top ### Feature Based Representation of Time series `devtools::install_github("pridiltal/oddstream")` `tsfeatures <- oddstream::extract_tsfeatures(train_data)` .pull-left[ <img src="fig/3_batch.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/oddstream1.png" width="30%" style="display: block; margin: auto;" /> <span style="color:blue">O</span>utlier <span style="color:blue">D</span>etection in <span style="color:blue">D</span>ata <span style="color:blue">STREAM</span>s ] .pull-right[ <img src="fig/5_high_typical.gif" width="100%" style="display: block; margin: auto;" /> ] --- ### Main Contributions - Propose a framework that provides early detection of anomalies within a large collection of streaming time series data -- - Propose an algorithm that adapts to nonstationarity (concept drift) -- ### Main Assumptions - We define an anomaly as an observation that is very unlikely given the recent distribution of a given system -- - A representative data set of the system's typical behavior is available to define the model for the typical behavior of the system. -- ### Proposed Algorithm - Off-line Phase: Building a model of a system's typical behaviour; (similar to Clifton, Hugueny & Tarassenko, 2011) -- - On-line Phase: Testing newly arrived data using the boundary --- class: top ### Dimension Reduction for Time Series .pull-left[ `load(train_data)` <img src="fig/4_typical.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ `tsfeatures <- oddstream::extract_tsfeatures(train_data)` <img src="fig/5_high_typical.gif" width="60%" style="display: block; margin: auto;" /> ] -- `pc<- oddstream::get_pc_space(tsfeatures)` <br/> `oddstream::plotpc(pc$pcnorm)` <img src="fig/6_typicalfeature.png" width="25%" style="display: block; margin: auto;" /> First two PCs explain 85% of variation --- ### Anomalous threshold calculation - Estimate the probability density function of the 2D PC space `\(\longrightarrow\)` Kernel density estimation -- - Draw a large number N of extremes `\((arg min_{x\in X}[f_{2}(x)])\)` from the estimated probability density function -- - Define a `\(\Psi\)`-transform space, using the `\(\Psi\)`-transformation defined by (Clifton et al., 2011) <img src="fig/10_psitrans.png" width="50%" style="display: block; margin: auto;" /> - `\(\Psi\)`-transform maps the density values back into space into which a Gumbel distribution can be fitted. -- - Anomalous threshold calculation `\(\longrightarrow\)` extreme value theory --- class: center, top `oddstream::find_odd_streams(train_data, test_stream)` <img src="fig/18_oddstream_mvtsplot.gif" width="50%" style="display: block; margin: auto;" /> .pull-left[ <img src="fig/16_oddstream_out_loc.gif" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig/17_oddstream_pcplot.gif" width="100%" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # Anomaly Detection with <br/> <span style="color:#ff08ac"> Non-stationarity </span> --- #### Anomaly detection with non-stationarity <img src="fig/19_nonstationaritytypes.png" width="70%" style="display: block; margin: auto;" /> --- ### Anomaly detection with non-stationarity <img src="fig/20_suddenplot2.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/21_noCD1.png" width="35%" style="display: block; margin: auto;" /> --- ### Anomaly detection with non-stationarity <img src="fig/20_suddenplot3.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/21_noCD2.png" width="35%" style="display: block; margin: auto;" /> --- ### Anomaly detection with non-stationarity <img src="fig/20_suddenplot4.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/21_noCD3.png" width="35%" style="display: block; margin: auto;" /> --- ### Anomaly detection with non-stationarity <img src="fig/20_suddenplot2.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/22_conceptdrift_pval.png" width="100%" style="display: block; margin: auto;" /> - `\(H_{0} : f_{t_{0}} = f_{t_{t}}\)` - squared discrepancy measure `\(T = \int[f_{t_{0}}(x) - f_{t_{t}}(x)]^{2}dx\)` (Anderson et al., 1994) --- ### Anomaly detection with non-stationarity `oddstream::find_odd_streams(train_data, test_stream, concept_drift = TRUE)` <img src="fig/23_sudden_out.png" width="90%" style="display: block; margin: auto;" /> --- class: middle, center <img src="fig/P2_plot20.png" width="100%" style="display: block; margin: auto;" /> --- ### Main Contributions - Propose a framework to detect anomalies in high dimensional data. Our proposed algorithm addresses the limitations of HDoutliers algorithm (Wilkinson, 2018). -- - Propose an algorithm to detect anomalies in streaming temporal data -- ### Main Assumptions - We define an anomaly as an observation that deviates markedly from the majority with a large distance gap. --- <img src="fig/P2_plot5.png" width="60%" style="display: block; margin: auto;" /> - Normalize the columns of the data. (median and IQR) - This prevents variables with large variances having disproportional influence on Euclidean distances. --- <img src="fig/P2_plot6.png" width="60%" style="display: block; margin: auto;" /> - Leader Algorithm (Hartingan, 1975) - `\(r= 1/2(1/n)^{1/d}\)`: expected distance between data points in a d-dimensional space. n is the sample size (Kantardzic, 2011) --- <img src="fig/P2_plot7.png" width="60%" style="display: block; margin: auto;" /> --- <img src="fig/P2_plot9.png" width="60%" style="display: block; margin: auto;" /> - Select the k nearest neighbour distance with the maximum gap -- - Sort the resulting k nearest neighbour distances -- - Define an anomalous threshold using Extreme Value Theory (Bottom up searching algorithm proposed by Schwarz, 2008) --- `devtools::install_github("pridiltal/stray")` <br/> `outliers <- stray::find_HDoutliers(data)` <br/> `stray::display_HDoutliers(data, outliers)` <img src="fig/P2_plot10.png" width="60%" style="display: block; margin: auto;" /> --- ### Identify anomalous series within a large collection of time series - use a moving window to deal with streaming data - Extract time series features from window - Apply stray algorithm to identify anomalous series .pull-left[ <img src="fig/P2_plot22.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig/stray.gif" width="70%" style="display: block; margin: auto;" /> ] `tsfeatures <- oddstream::extract_tsfeatures(ts_data)` <br/> `outliers <- stray::find_HDoutliers(tsfeatures)` <br/> `stray::display_HDoutliers(tsfeatures, outliers)` --- class:: center .pull-left[ .Large[<span style="color:blue">`stray`</span>] <img src="fig/P2_plot21a.png" width="80%" style="display: block; margin: auto;" /> - Definition: distance - no training set ] .pull-right[ .Large[<span style="color:blue">`oddstream` </span>] <img src="fig/P2_plot21b.png" width="80%" style="display: block; margin: auto;" /> - Definition: density - need a training set ] --- class: center, middle # Thank You .pull-left[ <img src="fig/oddstream1.png" width="45%" style="display: block; margin: auto;" /> `devtools::install_github` <br/> `("pridiltal/oddstream")` Full paper available at: [https://robjhyndman.com/papers/oddstream.pdf](https://robjhyndman.com/papers/oddstream.pdf) <br/><br/> ] .pull-right[ <img src="fig/stray-logo.png" width="45%" style="display: block; margin: auto;" /> `devtools::install_github` <br/>`("pridiltal/stray")` ]
dilini.talagala@monash.edu
pridiltal
@pridiltal