Deakin University
Browse

File(s) under permanent embargo

Metric selection and anomaly detection for cloud operations using log and metric correlation analysis

journal contribution
posted on 2018-03-01, 00:00 authored by M Farshchi, Jean-Guy SchneiderJean-Guy Schneider, I Weber, John Grundy
Cloud computing systems provide the facilities to make application services resilient against failures of individual computing resources. However, resiliency is typically limited by a cloud consumer's use and operation of cloud resources. In particular, system operations have been reported as one of the leading causes of system-wide outages. This applies specifically to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration - which are executed at much higher frequencies now than a decade ago. We address this problem by proposing a novel approach to detect errors in the execution of these kinds of operations, in particular for rolling upgrade operations. Our regression-based approach leverages the correlation between operations' activity logs and the effect of operation activities on cloud resources. First, we present a metric selection approach based on regression analysis. Second, the output of a regression model of selected metrics is used to derive assertion specifications, which can be used for runtime verification of running operations. We have conducted a set of experiments with different configurations of an upgrade operation on Amazon Web Services, with and without randomly injected faults to demonstrate the utility of our new approach.

History

Journal

Journal of Systems and Software

Pagination

531 - 549

Publisher

Elsevier

Location

New York, N.Y.

ISSN

0164-1212

Language

eng

Publication classification

C Journal article; C1 Refereed article in a scholarly journal

Copyright notice

2017, Elsevier