Turn cloud findings into safe remediation Book a demo

April 30, 2026

How Agentic Cloud Remediation Actually Works in 5 Stages

Marina Segal

CEO, Tamnoon

Share:

Cloud security has a last-mile problem.

The industry has spent billions making detection faster and smarter. CNAPPs, CSPMs, and vulnerability scanners surface thousands of findings every day. But visibility isn’t the issue anymore. In cloud security remediation, what matters is what happens after issues are identified.

Tamnoon’s own research shows most security teams investigate about 4% of the alerts their tools generate, with less than 1% resulting in a confirmed fix. For the ones that do get addressed, it’s not fast. Tamnoon’s State of Cloud Remediation 2025 revealed that critical cloud misconfigurations sit unresolved for an average of 128 days before remediation. That’s over four months of exposure per finding.

Why? Because every investigation starts with simple questions like:

  • Is this resource in production? 
  • Who owns it? 
  • What breaks if we change it? 

Answering those questions takes time most teams don’t have, so the backlog grows by nearly 40% year over year.

This clearly reflects a remediation gap rooted in how issues are resolved, something automated remediation is working to solve. But successful AI-powered remediation depends on how work moves from alert to resolution. 

Explore each stage of the CNAPP remediation workflow, how findings are prioritized and investigated, where automation applies, and where human judgment is required to protect production systems.

How Manual Remediation Typically Works

Before getting into what automated cloud remediation looks like, it’s worth laying out what most teams are actually dealing with today.

  • An alert fires: Maybe it’s an S3 bucket without HTTPS enforcement, or an EC2 instance running outdated metadata service settings. An analyst picks it up, sometimes within hours, sometimes days later, depending on what else is competing for their attention.
  • Someone investigates it: They log into the cloud console. Check whether the resource is in production or a test environment. Try to figure out who owns it by digging through resource tags, Slack messages, or internal wikis. If the resource has dependencies, they need to map those too. All of this is manual, and for a single finding it can take hours.
  • A ticket is created: If they determine the issue is worth fixing, they write a ticket and hand it off to the resource owner, usually someone on the engineering or DevOps team. 
  • Someone fixes it (eventually): That person gets to it when they can, often without the full picture of why it was flagged or what’s at stake. The fix might take five minutes. The handoff and wait time might take five days or months.

Where This Breaks Down

There are a few specific friction points that make this cloud security remediation process unsustainable at scale.

Context gathering is the biggest time sink, and it’s almost entirely manual. There’s no systematic way to assess blast radius, so teams either move cautiously and leave risk open, or move fast and risk breaking production. 

Every alert gets roughly the same treatment regardless of whether it’s a production database or an idle test bucket. Cloud security alert fatigue sets in, and the handoff between security and engineering creates friction that compounds over time, with security losing credibility each time they escalate something that turns out to be low priority.

None of this is a failure of the people involved, but a broader problem with the process. The tools that find problems weren’t designed to fix them, and the space between those two steps is where risk accumulates.

Here’s how the two approaches compare at each stage of the workflow:

Workflow Stage Manual Remediation Automated Remediation
Alert intake Analyst reviews alerts one at a time. Duplicates across tools are common and often treated as separate issues. AI deduplicates across CNAPPs and groups related alerts into prioritized initiatives. ~145 alerts per initiative on average.
Context gathering Hours per finding. Analyst checks cloud consoles, resource tags, Slack threads, and internal wikis manually. Tami runs automated queries against cloud APIs in minutes. All read-only. No production impact.
Safety assessment Based on severity rating and analyst judgment. No systematic blast radius analysis. RCI score updated progressively by targeted investigative automations. Each step narrows uncertainty.
Execution Ticket handoff to resource owner, who re-researches the issue with minimal context. Days to weeks of wait time. Parameterized scripts from battle-tested playbooks. Human approval gates for RISKY findings. UNSAFE findings routed with full context.
Validation Check the CNAPP dashboard the next day to see if the alert disappeared. Post-remediation scans, configuration drift checks, and complete audit trail with full attribution.
Prevention Rarely reached. Same misconfigurations resurface. Alert fatigue cycle restarts. Guardrails implemented via SCPs, Azure Blueprints, and policy-as-code. Same issue can’t be reintroduced.

 

How AI-Powered Remediation Changes the Workflow

The workflow still follows a logical progression. You don’t skip steps, as AI and human expertise are applied at different stages to eliminate the bottlenecks described above.

In Tamnoon’s case, the AI agent driving this workflow is called Tami, trained on millions of real-world cloud security alerts and remediation outcomes. Tami handles the investigative and operational work. Human experts (called CloudPros) handle judgment calls and high-risk approvals.

One thing worth noting upfront: most of this work happens outside the production environment. Investigation, enrichment, safety analysis, and remediation planning all occur without touching live infrastructure. The only moment production is involved is the final, validated execution step. That distinction matters when the fear of breaking something is the main reason remediation stalls.

Here’s what each stage looks like when automated cloud remediation is done right:

Stage 1: Ingestion and Aggregation

Every automated cloud remediation workflow starts with the same problem: too many alerts from too many tools, with no unified view of what actually needs attention.

This stage pulls findings from whatever CNAPPs, CSPMs, or detection tools the organization is running. Findings are normalized into a common format, deduplicated across tools, and grouped into unified initiatives based on shared root cause or affected resource.

All of this is handled by AI. No human involvement at this stage. The system recognizes when two different tools flag the same S3 bucket for the same issue and groups them into a single initiative rather than creating duplicate work.

In a manual workflow, this step either doesn’t exist or happens in an analyst’s head. They might see the same finding from two solutions and treat them as separate issues, doubling their workload. Or they miss that six alerts all trace back to the same underlying cloud misconfiguration. There’s no systematic deduplication, just tribal knowledge and spreadsheets.

With AI, the analyst’s queue goes from thousands of individual alerts to a manageable set of prioritized initiatives before a human ever looks at it. That alone changes the starting point for everything that follows.

Stage 2: Contextual Enrichment

A raw alert doesn’t contain enough information to act on safely. It tells you what’s wrong, but not whether fixing it will break something. This stage builds the full picture around each initiative.

The AI runs automated queries against live cloud APIs to pull context that would take an analyst hours to gather manually. Depending on the finding type, this includes:

  • Checking resource usage patterns over the past 90 days (is this asset actually active, or has it been sitting idle?)
  • Pulling access logs from CloudTrail to understand who and what is interacting with the resource
  • Mapping workload dependencies and associations (is this container image running in production? Which EC2 instances rely on it?)
  • Identifying environment type: production, staging, development, or test
  • Checking encryption status and public exposure levels
  • Looking up ownership through resource tags, IAM mappings, and account structure

There’s no human involvement yet. And critically, all of this is read-only. The system is querying metadata and logs, not modifying anything in the environment. Nothing in production is being touched.

In a manual workflow, this is the step that kills velocity. An analyst doing this work logs into cloud consoles, runs CLI queries, cross-references internal documentation, and pieces together a picture that might take hours per finding. In many cases, they skip enrichment entirely because they don’t have time, which means they’re making remediation decisions without understanding what they’re actually touching.

The output of this stage is a fully enriched initiative with enough context to answer the real question: is this safe to fix?

Stage 3: Investigation And Safety Assessment

Enrichment builds the case file. Investigation answers the question: what happens if we fix this right now?

This is where automated cloud remediation diverges most sharply from the manual approach. Instead of relying on an analyst’s judgment call based on a severity rating, the AI runs targeted investigative automations tailored to the specific finding type. Each one answers a single safety question, and the results feed into a dynamic remediation confidence score called the Remediation Confidence Indicator (RCI).

Depending on the finding, the AI might:

  • Analyze CloudTrail logs to determine whether actual HTTP traffic exists to an S3 bucket before enforcing HTTPS (would the fix break active workloads?)
  • Check whether a flagged container image is deployed and running in production, or sitting unused in a registry
  • Query VPC Flow Logs and CloudWatch metrics to detect calls to the EC2 instance metadata service before enforcing IMDSv2
  • Pull a full software bill of materials to understand what’s inside a vulnerable image and whether a patch path exists
  • Identify whether a resource is behind CloudFront, a load balancer, or accessed directly

Each automation updates the RCI, which rates every finding as SAFE (proceed with confidence), RISKY (remediation path exists but requires human review), or UNSAFE (needs team coordination before anyone touches it). The score isn’t static, but rather evolves with each new piece of evidence, narrowing uncertainty step by step.

What This Looks Like In Practice

Say several S3 buckets get flagged for the same policy violation: no HTTPS enforcement. Same alert, same severity. After investigation, the system determines:

  • One bucket is completely empty and unused. SAFE to delete or enforce policy.
  • Two buckets show 100% HTTPS traffic with no HTTP calls detected. SAFE to enforce.
  • Another bucket has 45 active HTTP GetObject calls in the last 90 days. Enforcing HTTPS would break a production workload. UNSAFE, routed for coordination with the application team.

Same finding. Three different outcomes. That distinction doesn’t exist in a manual workflow, where the analyst checks the severity, maybe confirms whether the resource is in production, and either escalates or attempts a fix. There’s no progressive confidence scoring, no systematic blast radius assessment, and no clear framework for distinguishing “safe to proceed” from “needs more context.”

All of the investigation work described above is still read-only. No changes to production. The RCI is built entirely from observed behavior and metadata, not from test executions against live infrastructure.

See How RCI Works in a Live Environment

See how Tami investigates findings, assigns safety scores, and routes remediation based on what’s actually safe to act on.

Explore the Platform

Stage 4: Remediation Execution

This is the one stage where production is actually touched, and only for findings that have passed through investigation and safety assessment. Nothing reaches this point without an RCI attached.

The remediation path depends on that score:

  • SAFE findings: get parameterized remediation scripts generated from battle-tested playbooks. These are production-ready artifacts: IaC template updates, CLI commands, or policy-as-code changes tailored to the specific finding and environment. For organizations running in supervised autopilot mode, human experts (CloudPros) execute these fixes on the organization’s behalf. In co-pilot mode, the organization’s own team executes using the AI-generated plan.
  • RISKY findings: get full remediation plans with human approval gates. The fix exists, and the path is clear, but the action requires human review before execution. A container image rebuild based on an updated Dockerfile, for example, or an AMI upgrade for an instance with active metadata service usage. The AI has done the investigation. The human makes the final call.
  • UNSAFE findings: Not auto-remediated. They’re routed to the appropriate team with the complete investigation context: what was found, what was checked, why it’s unsafe to proceed without coordination, and who owns the resource. The developer or DevOps engineer receiving this isn’t getting a raw alert. They’re getting an investigated, enriched finding with a clear explanation of what needs to happen next.

In a manual workflow, none of this sorting exists. The developer gets a ticket that says “fix this S3 bucket policy” with minimal context. They re-research the issue, figure out the right fix, test it, and deploy it. The investigation that already happened on the security side doesn’t transfer. With agentic remediation, that work carries through to execution. Nothing gets repeated.

Stage 5: Validation And Prevention

The fix is deployed, but the workflow isn’t done.

This stage confirms that the remediation actually worked, that it resolved the root cause, and didn’t introduce new problems. The AI runs post-remediation scans to verify the finding is closed, checks for configuration drift or regressions, and documents everything in a complete audit trail.

Closing The Loop

That audit trail captures the full timeline: what was found, what context was gathered, what the RCI was at each stage, what action was taken, who approved it, and what the verification results showed. 

Every step is attributed, whether it was performed by the AI agent or a human expert. For organizations operating under SOC2, HIPAA, or similar frameworks, this is the compliance record that proves the issue was handled properly.

Preventing Recurrence

Then comes the step most manual workflows never reach: prevention. 

The system implements guardrails like Service Control Policies, Azure Blueprints, and policy-as-code rules to ensure the same misconfiguration can’t be reintroduced. This is what turns a single fix into a permanent posture improvement.

In a manual workflow, validation usually means checking the CNAPP dashboard the next day to see if the alert disappeared. Root cause confirmation is rare. Prevention is rarer. The same cloud misconfiguration resurfaces weeks later, the same alert fires again, and the cycle restarts

That recurrence is a major driver of cloud security alert fatigue, and it’s entirely avoidable when validation and prevention are built into the workflow.

What Changes When Remediation Actually Scales

The difference isn’t just speed, though that matters. Our own data shows organizations using AI-powered cloud security remediation see up to an 95% reduction in MTTR, a 25x increase in investigated alerts, and up to a 95% reduction in ticket volume.

But the bigger shift is operational:

  • Security teams stop gathering context and start making decisions. 
  • Developers stop receiving raw alerts and start receiving production safe remediation plans with clear next steps. 
  • The backlog shrinks instead of growing, and findings that get fixed stay fixed, because prevention is built into the workflow.

Automated remediation solves the last mile of cloud security across all stages. Ingestion turns thousands of alerts into manageable initiatives, enrichment builds context, investigation determines what’s safe to act on, execution follows the evidence, and validation and prevention close the loop permanently.

At every stage, AI handles the volume and investigative heavy lifting, while humans handle judgment calls and high-risk approvals. Neither replaces the other. Together, they close the last mile of cloud security.

Organizations using Tamnoon reduce open cloud exposures by up to 82% within 90 days. Book a demo to see how agent-led, expert-supervised remediation can work with your CNAPP.

Book a Demo

FAQs

Automated cloud remediation is the process of using AI and predefined playbooks to detect, prioritize, and fix cloud security issues like misconfigurations and vulnerabilities without relying on fully manual workflows. It covers the full lifecycle from alert ingestion through investigation, execution, and validation.

Manual remediation requires analysts to research each finding individually, gather context from cloud consoles, assess risk, write tickets, and hand off fixes to engineering. Automated remediation handles context gathering, deduplication, investigation, and remediation planning through AI, with humans stepping in only for high-risk approvals and edge cases.

When done correctly, yes. The key is that investigation, enrichment, and safety assessment all happen outside production. The system only touches live infrastructure during the final execution step, and only for findings that have been fully investigated and scored through the Remediation Confidence Indicator (RCI). Findings rated SAFE are executed with confidence. Findings rated RISKY require human approval before any action is taken. And findings rated UNSAFE are never auto-remediated. They’re routed to the appropriate team with full investigation context. Production is only touched when the evidence supports it.

A remediation confidence indicator is a dynamic safety rating that reflects how much context the system has gathered about a specific finding and whether it can be fixed without negative impact. In Tamnoon’s workflow, this is called the Remediation Confidence Indicator, and it rates findings as SAFE, RISKY, or UNSAFE. The score evolves as each investigation step adds new evidence.

Manual cloud security MTTR is high because most of the time is spent on investigation and handoffs, not the actual fix. Automated remediation compresses investigation from hours or days to minutes by running enrichment and safety checks through AI. This can reduce MTTR by up to 95%, because the bottleneck shifts from gathering context to making decisions.

Humans aren’t removed from the process. They’re repositioned. AI handles alert aggregation, contextual enrichment, and investigative analysis. Humans review and approve remediations for findings marked as risky, own escalations for findings marked as unsafe, and validate that execution matches the plan. The model is often described as agent-led, expert-supervised.

Common use cases include cloud misconfiguration remediation (S3 bucket policies, security group rules, encryption settings), IAM permission rightsizing, container image vulnerabilities, metadata service enforcement (IMDSv1 to IMDSv2), infrastructure drift, and orphaned or unused resources. The scope depends on the maturity of the platform and the organization’s trust level in automation.

Yes. Automated remediation platforms like Tamnoon are designed to plug into your existing detection stack, not replace it. The system ingests findings from CNAPPs like Wiz, CrowdStrike Falcon Cloud Security, Orca, Cortex Cloud, and others via API. CNAPP remediation works by letting the detection tool do what it does best (finding risk) while the remediation platform handles what comes after (fixing risk).

Automated remediation uses predefined playbooks and rules to execute fixes, typically with human approval at key steps. Autonomous remediation goes further, using AI agents that can investigate, reason about context, and take action independently based on what they find. The distinction matters because most platforms that say “automated” are running static if-then logic, while truly autonomous systems (sometimes called agentic remediation) adapt their approach based on live environment data. Tamnoon’s model is agent-led and expert-supervised, meaning the AI operates autonomously through investigation and planning, but human experts validate high-risk actions before execution.

No. It changes what they spend their time on. Without automation, most of a security team’s bandwidth goes to manual context gathering, alert triage, and ticket management. With automated remediation, the AI handles that operational workload while the team focuses on decision-making, exception handling, and strategic risk reduction. The human role shifts from doing the investigation to reviewing the investigation and approving the action. Teams don’t shrink. They get more done with the same headcount.

Start with integration depth. The platform should ingest findings from your existing CNAPPs and detection tools without requiring you to rip and replace anything. From there, look at how it handles investigation. Does it just re-prioritize alerts, or does it actually enrich findings with live cloud data and assess blast radius before recommending a fix? Check whether it has a clear safety framework (like a confidence score with defined thresholds) and whether it supports human approval workflows for high-risk actions. Finally, look at what happens after the fix. Validation, audit trails, and recurrence prevention are what separate remediation platforms from glorified ticketing systems.

ROI shows up in three places. First, MTTR compression. Reducing mean time to remediate by 95% means the window of exposure shrinks from months to weeks or days. Second, team efficiency. When AI handles investigation and triage, the same team can cover significantly more ground without adding headcount. Organizations using Tamnoon have seen up to a 95% reduction in ticket volume. Third, risk reduction. Fewer open exposures means lower breach likelihood, which translates directly to avoided incident response costs, regulatory penalties, and reputational damage.

Discover the Latest From Tamnoon

There’s always more to learn, see our resources center

Scroll to Top

JoGet Insights Delivered Weekly

Join 10,000+ Cloud Security leaders looking to master their CNAPP with expert remediation tips and best practices to test in your own CNAPP today.