Incidents Are Complex. Your Response Shouldn't Be.

Transparent AI agents that investigate, diagnose, and resolve — every step visible, editable, and reproducible.

Watch an Incident Unfold

A production pod is crash-looping. Here's what DagKnows does — step by step.

Alert

CrashLoopBackOff on production pod

PagerDuty fires — user-facing service is down

Hercules

AI builds a causal DAG — not a linear checklist

Three hypotheses tested in parallel

  • Resource Exhaustion
    • Memory exceeds limits within bounds
    • CPU throttled normal usage
  • Application Crash
    • Exit code 1 — runtime error
      • Missing environment variable
        • ConfigMap updated 23 min ago root cause
      • Upstream dependency timeout all reachable
    • Liveness probe failing probe OK
  • Infrastructure Issue
    • Node pressure or eviction node healthy
    • Image pull failure image exists
Each check auto-generates kubectl commands, log queries, and scripts — executed via proxy
RCA

Structured RCA posted to Slack

Evidence chain + rollback recommendation

Every command, every decision — fully visible and replayable.

Core Philosophy

Transparent & Reproducible, Always

See the exact code, commands, and reasoning the AI uses. Edit any step. Same inputs, same workflow, every time. No black boxes.

CrashLoopBackOff Investigation Completed
  • Resource Exhaustion
    • Check memory limits within bounds
      metrics = k8s.get_pod_metrics("payment-svc", ns)
      exceeded = metrics.memory_mb > limits.memory_mb
    • Check CPU throttling normal
      stats = k8s.top_pod("payment-svc", ns)
      throttled = stats.cpu_percent > 90
  • Application Error
    • Container exit code
      pod = k8s.read_namespaced_pod("payment-svc", ns)
      state = pod.status.container_statuses[0].last_state
      exit_code = state.terminated.exit_code  # exit_code = 1
      • Verify env variables root cause
        cm = k8s.read_config_map("payment-svc-config", ns)
        diff = compare_versions(cm, previous_version)
        # KEY_NAME removed in commit a3f8c1d (23 min ago)
    • Liveness probe config probe OK
      probe = k8s.get_liveness_probe("payment-svc", ns)
      failures = k8s.get_event_count(pod, "Unhealthy")
  • Infrastructure Issue
    • Node health healthy
      node = k8s.read_node(node_name)
      pressure = any(
          c.type == "MemoryPressure" for c in node.status.conditions
      )

And every investigation makes the next one faster. DagKnows remembers.

Knowledge Graph

Memory That Gets Smarter Over Time

Every incident builds your operational memory. Successful investigations auto-promote to reusable playbooks. Known failure patterns match to proven resolutions. When people leave, the knowledge stays.

Knowledge graph and investigation history

100+ Built-In AI Agents

Best practices for AWS, Kubernetes, Grafana, Terraform, and more — injected into every generated tool. Create your own AI agents effortlessly.

AWS
Azure
GCP
Alibaba
Slack
Teams
Discord
Grafana
Prometheus
Datadog
New Relic
Dynatrace
Elastic
Splunk
Loki
Mimir
PagerDuty
Opsgenie
VictorOps
ServiceNow
Zendesk
Kubernetes
Docker
Terraform
Ansible
Vault
Consul
Jenkins
GitLab
GitHub
ArgoCD

Not just for troubleshooting. Use DagKnows for routine tasks, cost analysis, and understanding your systems.

Beyond Troubleshooting

The same AI-driven approach works for cost optimization, compliance checks, capacity planning, and everyday operational queries.

Cloud Cost Spike (AWS)

Prompt: "Analyze my AWS costs and give me an RCA on how to save costs in various categories."

Hercules Builds investigation plan across 8 cost categories: Bedrock, EC2, ECS, VPC, S3, CloudWatch, Lambda, and data transfer
Analyze Executes each branch in the knowledge graph, querying AWS Cost Explorer and usage APIs
Found Root cause: Costs up 396% ($15.23 → $75.58). Claude Sonnet 4 Bedrock API usage in us-east-2 is the primary driver at $28.97 (38.3%)
RCA Generates full root cause analysis with executive summary, per-category breakdown, and savings recommendations
AWS cost analysis and savings RCA with root cause found

Ready for your environment? DagKnows deploys wherever your data lives.

Architecture

Deploy Your Way

SaaS or on-prem. Same full-featured platform either way.

SRE / Operator
OpenAI Amazon Bedrock Azure AI Anthropic
LLM
API
DagKnows Cloud
DagKnows UI
webapp
secure
websocket
Your Infrastructure
Code execution
CLI command
execution
HashiCorp Vault
Proxy
ssh
winrm
APIs
Customer Infra
on-prem or cloud
Various applications
Zero infrastructure to manage
Always up-to-date
Auto-scaling & HA built-in
Fastest time to value
SRE / Operator
Local LLM or Cloud LLM API
LLM
API
Your Infrastructure (Everything On-Prem)
DagKnows UI
webapp
secure
websocket
Code execution
CLI command
execution
HashiCorp Vault
Proxy
ssh
winrm
APIs
Customer Infra
on-prem or cloud
Various applications
Zero external connections
Air-gap ready
Local LLM support
Full data sovereignty

Enterprise Security & Governance

Compliance-ready from day one.

RBAC & Workspaces

Role-based access with workspace isolation and per-task permissions.

SSO Integration

Google, Okta, GitHub, LDAP/AD. Auto-create on first login.

Approval Gates

State-changing actions require explicit human sign-off.

Full Audit Trails

Every action logged. Who, what, when, why. Exportable.

API Access Tokens

Scoped JWT tokens for programmatic access and CI/CD.

Credential Vault

Secure credential storage. Never exposed in logs.

Getting Started Is Simple

1

Connect

Point alerts to DagKnows, deploy the proxy. No firewall changes.

2

Build

Import runbooks or let AI generate them from incidents.

3

Respond

Start deterministic. Graduate to AI at your own pace.

See It In Action

Get a personalized walkthrough with your team's real-world scenarios.

Request a Demo