Monitoring and Maintaining AlertD
This page describes how to monitor a running AlertD deployment, diagnose fault conditions, and perform routine and emergency maintenance: watching application logs, rotating secrets, applying software updates, managing AWS service quotas, backing up data, and recovering from failures.
AlertD is deployed via CloudFormation and runs on managed AWS services (ECS Fargate, Aurora Serverless v2, ALB, Secrets Manager). Most monitoring and maintenance is handled through standard AWS tooling — this page maps each task to the right console and command.
Monitoring Application Health
The most direct signal of AlertD’s health is the application container’s log stream in CloudWatch Logs. The container writes structured log lines on startup, on every incoming request, and whenever the request pipeline raises an exception. Watching this stream for errors is how you detect problems before users do.
Locating the Application Log Group
CloudFormation creates a CloudWatch Logs log group for the AlertD application task. Its name is derived from your stack name and follows the pattern /ecs/<stack-name>/app.
- In the AWS console, open CloudWatch → Log groups.
- Filter by your CloudFormation stack name; the application log group is the one ending in
/app. - Click the log group to see the per-task log streams. Each restart or rolling deployment creates a new stream.
Watching the Live Stream
To tail the application logs in real time from your terminal:
aws logs tail /ecs/<stack-name>/app --followAdd --since 1h to backfill the last hour, or --filter-pattern ERROR to surface only error-level lines:
aws logs tail /ecs/<stack-name>/app --follow --filter-pattern ERRORWhat to Look For
A healthy AlertD task logs steady request lines and periodic background-job lines. Treat these as fault signals:
| Pattern | What It Indicates | Where to Look Next |
|---|---|---|
ERROR, FATAL, or unhandled stack traces | Application exception (request failed or background job crashed) | The full stack trace in the same log stream |
Repeated connection refused / connection reset to the Aurora endpoint | Database is unreachable from the task | RDS console → cluster status; security group rules |
AccessDenied from STS or AssumeRole failures | Trust policy on the monitoring role is wrong, or the role was deleted | IAM → roles → trust relationship |
401/403 from api.openai.com or api.anthropic.com | LLM API key is invalid or revoked | Secrets Manager LLM key secret; rotate per Secrets Management |
| Task exits and restarts within seconds (crash loop) | Misconfiguration at startup | The very first lines of each new log stream |
ALB Target Health
The ALB performs an HTTP health check against the AlertD task. In EC2 → Load Balancers → <AlertD ALB> → Target groups, the application target group should show all targets as healthy. An unhealthy target usually points back to a problem visible in the application log stream (failed startup, dependency unreachable, etc.).
Handling Fault Conditions
This section maps user-visible symptoms to the diagnostic steps and the section that contains the fix. Start by tailing the application log stream (Monitoring Application Health) — it is the fastest way to localize most issues.
The AlertD URL Returns 502 or 503
The ALB cannot reach a healthy application target.
- In EC2 → Load Balancers → Target groups, confirm whether the application target is
unhealthy. - In ECS → Clusters → <AlertD cluster> → Services → <app service>, check whether desired/running counts match.
- Open the application log group and read the latest log stream from the top — startup failures (bad LLM key, unreachable Aurora) appear in the first few lines.
- Apply the matching fix below, then force a new deployment (see Recovering the Software → ECS Task Failure).
Repeated 401/403 from the LLM Provider
The OpenAI or Anthropic API is rejecting AlertD’s requests.
- Confirm the symptom in the application log group (
api.openai.comorapi.anthropic.com4xx lines). - Rotate the LLM API key per Rotating the LLM API Key.
AccessDenied When Querying AWS
AlertD cannot assume the customer’s monitoring role.
- In the application log group, look for STS
AssumeRoleerrors with the target role ARN. - In the target AWS account, open IAM → Roles →
AlertD-Role→ Trust relationships and confirm the Principal matches the AlertD ECS task role ARN. - Confirm the ReadOnlyAccess managed policy is still attached.
- See Advanced Setup → Troubleshooting for the full diagnostic checklist.
Database Connection Errors
The application cannot reach Aurora.
- In the application log group, look for
connection refused,connection reset, or PostgreSQL driver errors. - In RDS → Databases → <AlertD cluster>, confirm the cluster status is
Availableand the writer endpoint resolves. - Confirm the application security group has egress to the Aurora security group on port 5432.
- If the cluster is in
Backing-uporModifyingstate, wait — Aurora Serverless v2 fails over automatically; the application reconnects on its own.
Stuck or Slow Queries
A user-issued query never completes.
- The application enforces a 2-minute timeout on queries (see FAQ → What’s the query timeout?). A query that runs longer than that will be cancelled and surfaced as an error.
- If many queries time out, check Aurora capacity (RDS → <AlertD cluster> → Monitoring → ACU utilization). If ACUs are saturated, raise the maximum capacity on the cluster.
- Suggest the user narrow scope (region filter, more specific resource type) per the FAQ performance tips.
Authentication Failures at the AlertD Login Screen
Users cannot complete Google or GitHub login.
- Confirm outbound TCP 443 is open from end-user browsers to
auth.demo.alertd.ai(see Prerequisites → Network Access). - See Simple Setup → Troubleshooting → Authentication Issues for known browser/popup workarounds.
If none of the above match, capture the latest application log lines and contact support.
Secrets Management
AlertD stores two categories of secrets in AWS Secrets Manager:
- Aurora database credentials — created automatically by CloudFormation when the stack is deployed.
- LLM API key — created only if you selected
openaioranthropicas the model profile (see Deployment Step 5).
Rotating the LLM API Key
If your OpenAI or Anthropic API key is leaked, deactivated, or simply being rotated on a schedule:
- Generate a new key in the OpenAI or Anthropic console.
- In the AWS console, open Secrets Manager and locate the AlertD LLM key secret (its name is prefixed with your CloudFormation stack name).
- Choose Retrieve secret value → Edit, paste the new key, and save.
- Force a new ECS deployment so the AlertD task picks up the new value:
aws ecs update-service \ --cluster <AlertD-cluster-name> \ --service <AlertD-app-service-name> \ --force-new-deployment - The AlertD task restarts with the new key. End users experience no downtime because ECS performs a rolling deployment.
Rotating the Aurora Database Credentials
Aurora Serverless v2 credentials live in Secrets Manager and can be rotated using AWS’s built-in rotation:
- Open Secrets Manager and select the AlertD database secret.
- Choose Edit rotation and enable Automatic rotation with the AWS-managed Lambda rotation function for RDS.
- Set the rotation schedule (e.g., every 30 days).
- Trigger an immediate rotation with Rotate secret immediately to verify the workflow.
- After rotation, force a new ECS deployment (same command as above) so the application reconnects with the new credentials.
For a manual one-time rotation, follow the same procedure with Rotate secret immediately.
Cryptographic Keys
AlertD does not require customers to create or manage any KMS keys. Encryption at rest uses the AWS-managed default keys for Aurora, EBS, Secrets Manager, and CloudWatch Logs; rotation of those keys is handled automatically by AWS. See the Security Model for details.
Software Patches and Upgrades
AlertD is delivered as a CloudFormation template that references pinned container images for the application and Pulsar tasks. Aurora is patched by AWS on its standard maintenance schedule.
Upgrading AlertD
- Open the CloudFormation console and select your AlertD stack.
- Choose Update → Replace current template.
- Provide the latest AlertD template URL (the canonical URL stays the same; new versions are published in place):
https://alertd-publicassets.s3.us-west-1.amazonaws.com/cloudformation/AlertD_RDS.yaml - Step through the wizard; leave parameter values unchanged unless release notes indicate otherwise.
- Acknowledge the IAM capabilities checkbox and submit.
- CloudFormation performs a rolling update of the ECS services. The ALB drains old tasks and shifts traffic to new ones, so there is no downtime.
Subscribe to AlertD release announcements via your design-partner contact to be notified when a new template is published.
Aurora Patching
Aurora Serverless v2 receives minor-version patches automatically during the cluster’s maintenance window. Major-version upgrades are opt-in and can be performed from the RDS → Modify workflow on the AlertD database cluster. We recommend taking a manual snapshot (see Backup and Recovery below) before any major-version change.
Container Image Patching
The AlertD application and Pulsar container images are published with security patches applied. Picking up the latest images is done by running the stack update above; there is no separate image-pull step.
Managing AWS Service Limits
AlertD is constrained by the standard AWS account quotas listed below. Most accounts are well within the default limits, but consolidated or large-scale environments may need to request increases through the Service Quotas console .
| Service | Quota | Default | AlertD’s Usage |
|---|---|---|---|
| VPC | VPCs per region | 5 | 1 (only if AlertD creates a new VPC) |
| VPC | Elastic IPs per region | 5 | 1 (for the NAT Gateway) |
| VPC | NAT Gateways per AZ | 5 | 1 per AZ |
| EC2 | Fargate On-Demand vCPU | Account-level | ~2–4 vCPU continuously |
| ELB | Application Load Balancers per region | 50 | 1 |
| ECS | Services per cluster | 5,000 | 2 (app + Pulsar) |
| RDS | Aurora Serverless v2 ACUs per cluster | 256 | Scales with workload |
| Secrets Manager | Secrets per region | 500,000 | 1–2 |
To request an increase: open Service Quotas in the AWS console, find the quota, and choose Request quota increase. Increases are usually approved within a few hours for common services.
Backup and Recovery
Data Stores in Scope
AlertD stores all customer state in Aurora Serverless v2 (queries, session history, query results, execution plans, workspace metadata). The application and Pulsar tasks are stateless — their disks are ephemeral and require no backup.
Aurora Automated Backups
Aurora Serverless v2 takes continuous backups and retains them for 7 days by default (configurable up to 35 days), with point-in-time recovery (PITR) to any second within the retention window.
To verify or change backup settings:
- Open RDS → Databases → <AlertD cluster>.
- Under Maintenance & backups, confirm the backup retention period and the preferred backup window.
- To extend retention, choose Modify and set a longer retention period.
Taking a Manual Snapshot
Manual snapshots are retained until you delete them and are not subject to the retention window.
- Open RDS → Databases → <AlertD cluster>.
- Choose Actions → Take snapshot, give it a descriptive name, and submit.
We recommend taking a manual snapshot before any AlertD stack upgrade or Aurora major-version change.
Restoring from a Backup
The AlertD stack owns the Aurora cluster, so restoring from a backup requires rebuilding the stack around the restored cluster rather than swapping the endpoint inside the running stack:
- Open RDS → Databases → <AlertD cluster> → Actions → Restore to point in time (or Restore snapshot) and let RDS provision a new cluster from the backup.
- Delete the existing AlertD CloudFormation stack (keep the restored Aurora cluster — do not delete it).
- Redeploy the AlertD CloudFormation stack and, during stack creation, import the restored Aurora cluster and its Secrets Manager secret as existing resources rather than letting CloudFormation create new ones. See AWS — Bringing existing resources into CloudFormation management .
- Once the stack reaches
CREATE_COMPLETE, the AlertD application reconnects to the restored data.
Configuration
The entire deployment configuration is captured in the CloudFormation template. To restore configuration from scratch, redeploy the stack with the same parameters and import the restored Aurora cluster as described above.
Recovering the Software
ECS Task Failure
ECS automatically detects task failures and replaces them. If a task is crash-looping:
- Open ECS → Clusters → <AlertD cluster> → Tasks to see the stopped task and its exit reason.
- Open CloudWatch Logs for the corresponding log group to inspect the failure.
- Common causes: invalid LLM API key, unreachable Aurora endpoint, or Secrets Manager permission issues. Fix the underlying configuration and force a new deployment:
aws ecs update-service \ --cluster <AlertD-cluster-name> \ --service <AlertD-app-service-name> \ --force-new-deployment
Aurora Failure
For transient Aurora issues, Aurora Serverless v2 automatically fails over. For persistent corruption or accidental data loss, restore from a backup using the procedure in Backup and Recovery.
Failed Stack Update
If a CloudFormation update fails:
- CloudFormation will attempt an automatic rollback to the previous working state.
- If the rollback succeeds, investigate the cause (Events tab) before retrying.
- If the rollback also fails, choose Continue rollback and skip any non-recoverable resources, or open a support ticket.
Complete Rebuild
If the stack cannot be recovered (e.g., accidental deletion of critical resources):
- Take a manual snapshot of the Aurora cluster (if it is still present).
- Delete the AlertD CloudFormation stack.
- Redeploy the stack following the Deployment Guide and import the restored Aurora cluster during stack creation (see Restoring from a Backup).
Support
If you need help with any of the procedures above, contact AlertD support at support@alertd.ai. See the FAQ → Getting Help section for response-time expectations.