February 27, 2024

The First Step: Sharpening your Focus on Triage and Prioritization

Michael St.Onge

Principal Security Architect

Share:

In order to focus a team on what is most critical and urgent, the cloud remediation process frequently begins with a triage and prioritization step. Triage and prioritization provide a systematic way to optimize how your team handles this flood of notifications.

 


 

What does triage and prioritization involve?

 

Triage and prioritization aim to ensure your team works on the most critical risks first. This prevents spending time fixing low-impact problems while major threats go unaddressed.

An effective triage and prioritization process involves:

  • Gathering context to understand the scope and impact of each alert within the business context
  • Assigning ownership of the issue to the right individual or team
  • Ranking assets based on criticality to the business in order focus on safeguarding your “crown jewels”
  • Performing robust alert validation to audit for false positives, behaviors by design, and accepted risk
  • Estimating risk levels associated with each issue
  • Streamlining and improving incident response flows
  • Identifying patterns in alerts to address systemic weaknesses
  • Leveraging automation and human expertise to enhance analysis

Establishing a rigorous triage and prioritization workflow is essential for working through your cloud environment’s security challenges efficiently and effectively. Let’s explore the key steps involved.

 


 

Key steps for effective triage and prioritization

 

1. Add context to alerts

Typically, CNAPP, CSPM, and other tools provide alerts with basic context. However, understanding the scope and impact of each notification, as well as the business relevance of the underlying asset, is crucial for assignment and prioritization. For example, an open security group is bad – but an open security group sitting in front of an EC2 instance with an application vulnerable to RCE/SSRF is catastrophic. Enriching raw alerts by gathering details like affected resources, configurations, compliance, and potential business impact will help remediation teams to either understand the broader impact of any changes they implement, or conduct further analysis that may alter their remediation plan.

For example, assume you are using a CNAPP to monitor for cloud vulnerabilities. You receive a cluster of related alerts around improper bucket permissions allowing public access. The CNAPP provides the affected resources – in this example, S3 buckets – but not the importance of those resources to the business or the data sensitivity. Your team would need to dig deeper to determine the scope of the exposure and potential business impact.

This context is difficult to automate entirely. While tools can provide some supplementary information, human oversight is invaluable for accurately assessing the potential impact of a fix. Augmenting alerts with additional context sets the foundation for effective responses.

 

2. Assign ownership of remediation

With thousands of alerts each day, determining exactly who handles each one is essential for streamlining the remediation process. Some common types of ways companies identify ownership include:

  • Logs
  • Tags
  • Continuous integration and continuous deployment (CICD), i.e. who originally deployed the infrastructure
  • Organizational charts
  • Dedicated identity providers

Broadly, companies attempt to use either a manual or automated process to assign ownership of remediation tasks. Each has pros and cons:

Approach Pros Cons
Manually assign alerts to relevant devs Ensures issues are being appropriately assigned based on domain expertise and sensitivity  Time consuming and impossible at scale in complex environments
Automate assignment of alerts based on predefined rules or heuristics Saves time and provides useful audit trails showing who has taken ownership of each incident May incorrectly assign alerts due to business logic that cannot be defined by policies

The most effective approach tends to involve a blend of these approaches. Machine learning and automation can help achieve process efficiencies at scale – while human oversight ensures appropriate routing, especially for alerts with complex business logic associated with them.

 

3. Prioritize your assets

Not all cloud resources and data are created equal. To focus triage efforts, you need to identify and prioritize your most sensitive assets – the “crown jewels” requiring heightened protection.

Examples include databases containing PII, healthcare information, financial data, and intellectual property. You can also prioritize based on resource exposure, encryption status, and other vulnerability factors.

For example, a healthcare company would prioritize systems containing patient health records, clinical trial data, and other highly sensitive information. Similarly, a financial services firm would focus on securing financial transaction data, account information, and intellectual property.

While tools can help classify assets, intricate knowledge of your business, architecture, and data flows is needed to accurately determine sensitivities and priorities. This process is difficult to automate completely and benefits greatly from human input.

 

4. Categorize your assets

In addition to ranking individual assets, higher-level categorization provides further guidance for triage. You can segment your cloud resources and data by factors like importance, sensitivity, and compliance requirements.

For instance, you may categorize assets as mission critical, business critical, and non-critical. Similarly, labeling them by data classifications (PII, financial, public) allows appropriate security handling.

Tools can suggest categories, but only organizational stakeholders truly understand complex business needs. The most effective methods combine machine learning with human judgment.

 

5. Evaluate asset risk

Too often, we conflate asset risk with alert risk. Alerts do their best to assess risk using the context available, but the asset risk may be different than the alert risk.

For example, an alert on an asset in the corporate network may be assigned a lower risk without consideration that the asset is used by a developer to commit code to the company’s flagship product. A robust remediation process must prioritize the asset risk, which may be informed by the alert risk.

The disastrous Capital One data breach from 2019 exemplifies the importance of prioritization that evaluates the risk posed to assets based on their context within the environment. In the C1 instance, S3 buckets containing sensitive data were being stored in the same account as DMZ security appliances running on vulnerable, internet-facing EC2 with instance profiles that allowed S3 access.

With your prioritized and categorized assets mapped, you can assess the risk associated with each to guide the order of response. Resources containing highly sensitive data will warrant rapid triage of related alerts.

For example, an alert related to improper access controls on databases storing customer credit card data would be assigned a high risk score demanding prompt investigation. On the other hand, an alert about a public-facing content bucket without sensitive data may carry lower risk.

Certain categories like unencrypted PII may require immediate investigation, while public-facing content with exposure risks might permit a slower approach. Repeated failures to address risks in a timely manner would escalate the triage priority.

Here again, automated risk scoring provides a useful starting point but human experts add crucial nuance based on business impact. Use the best of both worlds.

 

6. Streamline incident response

Triage also involves eliminating duplicate alerts and refining categorization to streamline incident response. Deduplicating notifications, classifying them by severity and type, and resolving false positives improves efficiency.

A CNAPP with multiple compliance frameworks configured may generate an alert for each framework on the same misconfiguration, on the same resource. For example, your CNAPP may generate multiple alerts around the same public bucket access issue. Your security team would need to analyze and deduplicate these related alerts, rather than waste time investigating each one separately.

Tools can deduplicate basic alerts, but humans identify more complex duplicate incidents spanning multiple systems. For categorization, the technical expertise of engineers is invaluable — automation alone cannot reliably classify intricate cloud security notifications.

 

7. Identify patterns

Effective triage looks beyond individual alerts to identify broader patterns and systemic weaknesses. Analyzing alerts collectively reveals issues like recurring misconfigurations requiring updated configurations, security guardrails, or user training.

For instance, frequent alerts around improperly configured bucket permissions may point to a need for better cloud security training on access controls. Improper SSH key rotation alerts may indicate a gap in administrator practices that should be addressed through better training on key lifecycle management. Without examining the bigger picture, you cannot address the root causes behind volumes of alerts.

While machine learning has become quite adept at finding patterns, human security experts interpret these signals best in terms of potential root causes and solutions. This high-level insight is very difficult for machines to match. A robust triage and prioritization process leverages both machine learning to classify and trend while a skilled human uses their knowledge and experience to operationalize the machine learning outputs.

 


 

The Key Role of Human Experts

 

As the examples above demonstrate, although automation provides a solid starting point, human expertise is indispensable for effective security triage and prioritization. The contextual understanding, architectural intuition, and risk insights of experienced cloud security professionals remain impossible for tools to replicate independently.

Approach Pros Cons
Expert
  • Allows nuanced human judgment to assess and prioritize complex issues
  • Custom prioritization based on intimate knowledge of architecture and data flows
  • Effective communication of insights across teams
  • Dependent on individual specialized expertise – human behavior generates inconsistencies, even between two great security engineers
  • Very time consuming and labor intensive, cannot scale
  • Inconsistent processes prone to human error and oversight
  • Fails to account for constant cloud evolution, missing new resources/services
Automation
  • Provides rapid, consistent prioritization at scale
  • Automates repetitive tasks to save time
  • Aggregates huge data sets to reveal insights faster
  • Automatically suggests assignments and priorities
  • Lacks nuanced oversight, potentially leading to inaccurate prioritization 
  • Standard tools fall short of tailored solutions
  • May incorrectly assign/prioritize complex issues

 


 

A Confident First Step

 

The alerts from your cloud-native application protection (CNAPP), cloud security posture management (CSPM), and other tools, report issues ranging from misconfigurations to vulnerabilities. Security tools do their best to express risk to help inform prioritization efforts, but lack the business context to be useful without human triage. It is common for security teams to be unable to immediately deal with every alert and its corresponding misconfiguration – and when you’re dealing with thousands of alerts daily, fatigue can settle in as well.

By combining smart software with human oversight, you can optimize focus on your cloud environment’s most critical risks and streamline response efforts. The future lies in this potent blend of machine and human capabilities – amplifying the strengths of each.

For Practitioners

For security analysts and engineers, optimized triage and prioritization improves workflow efficiency, team communication, and clarifies response steps. Automating repetitive enrichment and assignment tasks enables practitioners to focus expertise on complex judgment calls. Well-defined processes also create smoother hand-offs between different teams. With mature triage fundamentals in place, practitioners gain better leverage over the daily flood of security alerts.

For Management

For leadership, effective triage and prioritization deliver measurable improvements in risk reduction over time. Metrics around response times, security ticket backlogs, and overall security posture will steadily improve. Automated enrichment provides management more insight into response workflows. Leadership can scale cloud security with confidence knowing that the highest priority threats are addressed first.

Discover the Latest From Tamnoon

There’s always more to learn, see our resources center

Scroll to Top