class: center, middle, inverse, title-slide # A feature-based framework for detecting technical outliers in water-quality data ###
Priyanga Dilini Talagala
with
Rob J. Hyndman
Leigh Catherine
Kerrie Mengersen
Kate Smith-Miles ###
18.06.2019 --- class: clear The work is based on the collaborative research project carried out with the **Queensland University of Technology** and the **Queensland Department of Environment and Science**, Great Barrier Reef Catchment Loads Monitoring Program, Australia. <img src="fig/sensor.png" width="100%" style="display: block; margin: auto;" /> --- # Motivation - Water quality sensors are exposed to changing environments and extreme weather conditions -- - Two types of anomalies: -- 1. Water quality breaches associated with real events -- 2. Technical issues in the sensor equipment (low battery power, biofouling of the probes, errors in calibration, rust, sensor maintenance activities etc.) <img src="fig/sensor_issues.png" width="100%" style="display: block; margin: auto;" /> --- # Motivation - Water quality sensors are exposed to changing environments and extreme weather conditions - Two types of anomalies: 1. Water quality breaches associated with real events </br> 2. <span style="color:red"> Technical issues in the sensor equipment </span> (low battery power, biofouling of the probes, errors in calibration, rust, sensor maintenance activities etc.) <img src="fig/sensor_issues.png" width="100%" style="display: block; margin: auto;" /> --- # What is an anomaly - Water-quality observations that were affected by <span style="color:red">technical errors </span> in the sensor equipment <img src="fig/water_original.png" width="100%" style="display: block; margin: auto;" /> --- # What is an anomaly - Water-quality observations that were affected by <span style="color:red">technical errors </span> in the sensor equipment <img src="fig/water_out.png" width="100%" style="display: block; margin: auto;" /> --- # Materials and Methods - **Study region**: two study sites in tropical northeast Australia that flow into the Great Barrier Reef lagoon (Mackay region: Pioneer River and Sandy Creek) <img src="fig/map.png" width="45%" style="display: block; margin: auto;" /> -- - **Data**: in turbidity, conductivity and river level -- - We compare two approaches to this problem: 1. using forecasting models 2. using features with extreme value theory --- class: center, middle, inverse # Using forecasting models --- # Using forecasting models - Forecasting models are used to generate a prediction with an associated measure of uncertainty at the next time point -- - Constructed a `\(100(1-\alpha)\%\)` prediction interval for the one-step-ahead prediction -- - If the one-step-ahead observation does not fall within the prediction interval, it is classified as an anomaly. -- - For this comparison study we considered two strategies: 1. anomaly detection (AD) -- 2. anomaly detection and mitigation (ADAM): replaces anomalous measurements with forecasts for further forecasting --- class: clear, middle <img src="fig/STOTEN.png" width="20%" style="display: block; margin: auto;" /> Catherine Leigh, Omar Alsibai, Rob J Hyndman, Sevvandi Kandanaarachchi, Olivia C King, James M McGree, Catherine Neelamraju, Jennifer Strauss, Priyanga Dilini Talagala, Ryan S Turner, Kerrie Mengersen, Erin E Peterson (2019) <a href="https://www.sciencedirect.com/science/article/pii/S0048969719305662">A framework for automated anomaly detection in high frequency water-quality data from in situ sensors.</a> <span style="color:blue">Science of the Total Environment, 664, 885-898.</span> --- # Limitations - Semisupervised approach: requires a representative sample from the typical behaviour -- - Influenced strongly by the training data used to build the models (Nonstationarity, concept drift) -- - Require additional time for training for prediction and to perform optimization to estimate the model parameters -- - Complex relationship between water-quality variables -- - Irregular time series with lots of missing values (increase the frequency of measurements during high-flow events to capture greater resolution in water-quality dynamics) --- class: center, middle, inverse # Using features with extreme value theory --- # Main Contributions - Proposed an unsupervised framework that provides early detection of technical outliers in water-quality data from *in situ* sensors. <img src="fig/framework.png" width="100%" style="display: block; margin: auto;" /> -- - Provided a comparative analysis of the efficacy and reliability of both density- and nearest neighbor distance-based outlier scoring techniques. --- # oddwater R package - Introduced an R package, `oddwater` ( <span style="color:red">O</span>utlier <span style="color:red">D</span>etection in <span style="color:red">D</span>ata from <span style="color:red">WATER</span>-quality sensors) that implements the proposed framework and related functions. <img src="fig/oddwater_logo.png" width="25%" style="display: block; margin: auto;" /> `devtools::install_github("pridiltal/oddwater")` -- - `oddwater` package also provides a shiny app to explore data. `oddwater::explore_data()` --- class: clear **Step 1: Identify the data features that differentiate outlying instances from typical behaviours** <img src="fig/water_out.png" width="100%" /> --- class: clear **Step 1: Identify the data features that differentiate outlying instances from typical behaviours** <img src="fig/water_hd1.png" width="100%" /> --- class: clear **Step 2: Apply statistical transformations to make the outlying instances stand out in transformed data space** <img src="fig/trans.png" width="100%" /> --- class: clear **Step 2: Apply statistical transformations to make the outlying instances stand out in transformed data space** <img src="fig/water_hd2.png" width="100%" /> --- class: clear **step 3: Calculate unsupervised outlier scores for the observations in the transformed data space** <img src="fig/scores.png" width="80%" style="display: block; margin: auto;" /> -- - Anomaly is an observation that deviates markedly from the majority by a large distance or low density in transformed (high dimensional) data space -- - We considered eight unsupervised outlier scoring techniques for high dimensional data, involving nearest neighbor distances or densities --- class: clear **step 4: Calculate anomalous threshold** - Use extreme value theory (EVT) to calculate a separate outlier threshold for each set of outlier scores calculated using a given unsupervised outlier scoring technique. -- - Let **n** be the size of the dataset -- - Sort the resulting **n** outlier scores -- - Consider the half of the outlier scores with the smallest values as typical -- - Search for any significant large gap in the upper tail (Bottom up searching algorithm proposed by Schwarz, 2008) --- # Spacing Theorem (Weissman, 1978) Let `\(X_{1}, X_{2}, ..., X_{n}\)` be a sample from a distribution function `\(F\)` . </br> Let `\(X_{1:n} \geq X_{2:n} \geq ... \geq X_{n:n}\)` be the order statistics. </br> The available data are `\(X_{1:n}, X_{2:n}, ..., X_{k:n}\)` for some fixed `\(k\)`. </br> Let `\(D_{i,n} = X_{i:n} - X_{i+1:n},\)` `\((i = 1,2,..., k)\)` be the spacing between successive order statistics.</br> If `\(F\)` is in the maximum domain of attraction of the Gumbel distribution, then the spacings `\(D_{i,n}\)` are asymptotically independent and exponentially distributed with mean proportional to `\(i^{-1}\)`. <img src="fig/P2_plot17.png" width="55%" style="display: block; margin: auto;" /> --- class: clear <img src="fig/one_sided_derivative_TCL_sandy.png" width="70%" style="display: block; margin: auto;" /> --- # Advantages of the proposed framework - Can take the correlation structure of the water-quality variables into account when detecting outliers -- - Applicable to both univariate and multivariate problems -- - Outlier scoring techniques- unsupervised -- - Outlier thresholds have a probabilistic interpretation -- - The framework can easily be extended to streaming data such that it can provide near-real-time support -- - Proposed framework has the ability to deal with irregular (unevenly spaced) time series --- # Thank You <p><font size=5> 1. Catherine Leigh, Omar Alsibai, Rob J Hyndman, Sevvandi Kandanaarachchi, Olivia C King, James M McGree, Catherine Neelamraju, Jennifer Strauss, Priyanga Dilini Talagala, Ryan S Turner, Kerrie Mengersen, Erin E Peterson (2019) <a href="https://www.sciencedirect.com/science/article/pii/S0048969719305662">A framework for automated anomaly detection in high frequency water-quality data from in situ sensors.</a> <span style="color:blue">Science of the Total Environment, 664, 885-898.</span> </br></br> 2. Priyanga Dilini Talagala, Rob J. Hyndman, Catherine Leigh, Kerrie Mengersen, and Kate Smith-Miles. (2019) <a href="https://arxiv.org/abs/1902.06351">A feature-based framework for detecting technical outliers in water-quality data from in situ sensors</a>. arXiv preprint arXiv:1902.06351. .pull-left[
dilini.talagala@monash.edu
pridiltal
https://prital.netlify.com/ </br> (Slides available) ] .pull-right[ <img src="fig/oddwater_logo.png" width="30%" style="display: block; margin: auto;" /> <p><font size="4"> devtools::install_github("pridiltal/oddwater") </p> ] </p>