UCL Discovery Stage
UCL home » Library Services » Electronic resources » UCL Discovery Stage

Assess and Summarize: Improve Outage Understanding with Large Language Models

Jin, P; Zhang, S; Ma, M; Li, H; Kang, Y; Li, L; Liu, Y; ... Zhang, D; + view all (2023) Assess and Summarize: Improve Outage Understanding with Large Language Models. In: ESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering. (pp. pp. 1657-1668). ACM (In press). Green open access

[thumbnail of fse23_outagesummary.pdf]
Preview
Text
fse23_outagesummary.pdf - Accepted Version

Download (1MB) | Preview

Abstract

Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact. Outages are usually comprised of several concurring events/source causes, and therefore understanding the context of outages is a very challenging yet crucial first step toward mitigating and resolving outages. In current practice, on-call engineers with in-depth domain knowledge, have to manually assess and summarize outages when they happen, which is time-consuming and labor-intensive. In this paper, we first present a large-scale empirical study investigating the way on-call engineers currently deal with cloud outages at Microsoft, and then present and empirically validate a novel approach (dubbed Oasis) to help the engineers in this task. Oasis is able to automatically assess the impact scope of outages as well as to produce human-readable summarization. Specifically, Oasis first assesses the impact scope of an outage by aggregating relevant incidents via multiple techniques. Then, it generates a human-readable summary by leveraging fine-tuned large language models like GPT-3.x. The impact assessment component of Oasis was introduced in Microsoft over three years ago, and it is now widely adopted, while the outage summarization component has been recently introduced, and in this article we present the results of an empirical evaluation we carried out on 18 real-world cloud systems as well as a human-based evaluation with outage owners. The results obtained show that Oasis can effectively and efficiently summarize outages, and lead Microsoft to deploy its first prototype which is currently under experimental adoption by some of the incident teams.

Type: Proceedings paper
Title: Assess and Summarize: Improve Outage Understanding with Large Language Models
Event: ESEC/FSE 2023: 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
ISBN-13: 9798400703270
Open access status: An open access version is available from UCL Discovery
DOI: 10.1145/3611643.3613891
Publisher version: http://doi.org/10.1145/3611643.3613891
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: Outage Understanding, Large Language Model, Cloud Systems
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery-pp.ucl.ac.uk/id/eprint/10186655
Downloads since deposit
320Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item