Monitoring and Maintaining AlertD

This page describes how to monitor a running AlertD deployment, diagnose fault conditions, and perform routine and emergency maintenance: watching application logs, rotating secrets, applying software updates, managing AWS service quotas, backing up data, and recovering from failures.

AlertD is deployed via CloudFormation and runs on managed AWS services (ECS Fargate, Aurora Serverless v2, ALB, Secrets Manager). Most monitoring and maintenance is handled through standard AWS tooling — this page maps each task to the right console and command.

Monitoring Application Health

The most direct signal of AlertD’s health is the application container’s log stream in CloudWatch Logs. The container writes structured log lines on startup, on every incoming request, and whenever the request pipeline raises an exception. Watching this stream for errors is how you detect problems before users do.

Locating the Application Log Group

CloudFormation creates a CloudWatch Logs log group for the AlertD application task. Its name is derived from your stack name and follows the pattern /ecs/<stack-name>/app.

In the AWS console, open CloudWatch → Log groups.
Filter by your CloudFormation stack name; the application log group is the one ending in /app.
Click the log group to see the per-task log streams. Each restart or rolling deployment creates a new stream.

Watching the Live Stream

To tail the application logs in real time from your terminal:


aws logs tail /ecs/<stack-name>/app --follow

Add --since 1h to backfill the last hour, or --filter-pattern ERROR to surface only error-level lines:


aws logs tail /ecs/<stack-name>/app --follow --filter-pattern ERROR

What to Look For

A healthy AlertD task logs steady request lines and periodic background-job lines. Treat these as fault signals:

Pattern	What It Indicates	Where to Look Next
`ERROR`, `FATAL`, or unhandled stack traces	Application exception (request failed or background job crashed)	The full stack trace in the same log stream
Repeated `connection refused` / `connection reset` to the Aurora endpoint	Database is unreachable from the task	RDS console → cluster status; security group rules
`AccessDenied` from STS or `AssumeRole` failures	Trust policy on the monitoring role is wrong, or the role was deleted	IAM → roles → trust relationship
`401`/`403` from `api.openai.com` or `api.anthropic.com`	LLM API key is invalid or revoked	Secrets Manager LLM key secret; rotate per Secrets Management
Task exits and restarts within seconds (crash loop)	Misconfiguration at startup	The very first lines of each new log stream

ALB Target Health

The ALB performs an HTTP health check against the AlertD task. In EC2 → Load Balancers → <AlertD ALB> → Target groups, the application target group should show all targets as healthy. An unhealthy target usually points back to a problem visible in the application log stream (failed startup, dependency unreachable, etc.).

Handling Fault Conditions

This section maps user-visible symptoms to the diagnostic steps and the section that contains the fix. Start by tailing the application log stream (Monitoring Application Health) — it is the fastest way to localize most issues.

The AlertD URL Returns 502 or 503

The ALB cannot reach a healthy application target.

In EC2 → Load Balancers → Target groups, confirm whether the application target is unhealthy.
In ECS → Clusters → <AlertD cluster> → Services → <app service>, check whether desired/running counts match.
Open the application log group and read the latest log stream from the top — startup failures (bad LLM key, unreachable Aurora) appear in the first few lines.
Apply the matching fix below, then force a new deployment (see Recovering the Software → ECS Task Failure).

Repeated 401/403 from the LLM Provider

The OpenAI or Anthropic API is rejecting AlertD’s requests.

Confirm the symptom in the application log group (api.openai.com or api.anthropic.com 4xx lines).
Rotate the LLM API key per Rotating the LLM API Key.

`AccessDenied` When Querying AWS

AlertD cannot assume the customer’s monitoring role.

In the application log group, look for STS AssumeRole errors with the target role ARN.
In the target AWS account, open IAM → Roles → AlertD-Role → Trust relationships and confirm the Principal matches the AlertD ECS task role ARN.
Confirm the ReadOnlyAccess managed policy is still attached.
See Advanced Setup → Troubleshooting for the full diagnostic checklist.

Database Connection Errors

The application cannot reach Aurora.

In the application log group, look for connection refused, connection reset, or PostgreSQL driver errors.
In RDS → Databases → <AlertD cluster>, confirm the cluster status is Available and the writer endpoint resolves.
Confirm the application security group has egress to the Aurora security group on port 5432.
If the cluster is in Backing-up or Modifying state, wait — Aurora Serverless v2 fails over automatically; the application reconnects on its own.

Stuck or Slow Queries

A user-issued query never completes.

The application enforces a 2-minute timeout on queries (see FAQ → What’s the query timeout?). A query that runs longer than that will be cancelled and surfaced as an error.
If many queries time out, check Aurora capacity (RDS → <AlertD cluster> → Monitoring → ACU utilization). If ACUs are saturated, raise the maximum capacity on the cluster.
Suggest the user narrow scope (region filter, more specific resource type) per the FAQ performance tips.

Users cannot complete Google or GitHub login.

Confirm outbound TCP 443 is open from end-user browsers to auth.demo.alertd.ai (see Prerequisites → Network Access).
See Simple Setup → Troubleshooting → Authentication Issues for known browser/popup workarounds.

If none of the above match, capture the latest application log lines and contact support.

Secrets Management

AlertD stores two categories of secrets in AWS Secrets Manager:

Aurora database credentials — created automatically by CloudFormation when the stack is deployed.
LLM API key — created only if you selected openai or anthropic as the model profile (see Deployment Step 5).

Rotating the LLM API Key

If your OpenAI or Anthropic API key is leaked, deactivated, or simply being rotated on a schedule:

Generate a new key in the OpenAI or Anthropic console.
In the AWS console, open Secrets Manager and locate the AlertD LLM key secret (its name is prefixed with your CloudFormation stack name).
Choose Retrieve secret value → Edit, paste the new key, and save.

Force a new ECS deployment so the AlertD task picks up the new value:


aws ecs update-service \
  --cluster <AlertD-cluster-name> \
  --service <AlertD-app-service-name> \
  --force-new-deployment

The AlertD task restarts with the new key. End users experience no downtime because ECS performs a rolling deployment.

Rotating the Aurora Database Credentials

Aurora Serverless v2 credentials live in Secrets Manager and can be rotated using AWS’s built-in rotation:

Open Secrets Manager and select the AlertD database secret.
Choose Edit rotation and enable Automatic rotation with the AWS-managed Lambda rotation function for RDS.
Set the rotation schedule (e.g., every 30 days).
Trigger an immediate rotation with Rotate secret immediately to verify the workflow.
After rotation, force a new ECS deployment (same command as above) so the application reconnects with the new credentials.

For a manual one-time rotation, follow the same procedure with Rotate secret immediately.

Cryptographic Keys

AlertD does not require customers to create or manage any KMS keys. Encryption at rest uses the AWS-managed default keys for Aurora, EBS, Secrets Manager, and CloudWatch Logs; rotation of those keys is handled automatically by AWS. See the Security Model for details.

Software Patches and Upgrades

AlertD is delivered as a CloudFormation template that references pinned container images for the application and Pulsar tasks. Aurora is patched by AWS on its standard maintenance schedule.

Upgrading AlertD

Open the CloudFormation console and select your AlertD stack.
Choose Update → Replace current template.
Provide the latest AlertD template URL (the canonical URL stays the same; new versions are published in place):
```
https://alertd-publicassets.s3.us-west-1.amazonaws.com/cloudformation/AlertD_RDS.yaml
```
Step through the wizard; leave parameter values unchanged unless release notes indicate otherwise.
Acknowledge the IAM capabilities checkbox and submit.
CloudFormation performs a rolling update of the ECS services. The ALB drains old tasks and shifts traffic to new ones, so there is no downtime.

Subscribe to AlertD release announcements via your design-partner contact to be notified when a new template is published.

Aurora Patching

Aurora Serverless v2 receives minor-version patches automatically during the cluster’s maintenance window. Major-version upgrades are opt-in and can be performed from the RDS → Modify workflow on the AlertD database cluster. We recommend taking a manual snapshot (see Backup and Recovery below) before any major-version change.

Container Image Patching

The AlertD application and Pulsar container images are published with security patches applied. Picking up the latest images is done by running the stack update above; there is no separate image-pull step.

Managing AWS Service Limits

AlertD is constrained by the standard AWS account quotas listed below. Most accounts are well within the default limits, but consolidated or large-scale environments may need to request increases through the Service Quotas console .

Service	Quota	Default	AlertD’s Usage
VPC	VPCs per region	5	1 (only if AlertD creates a new VPC)
VPC	Elastic IPs per region	5	1 (for the NAT Gateway)
VPC	NAT Gateways per AZ	5	1 per AZ
EC2	Fargate On-Demand vCPU	Account-level	~2–4 vCPU continuously
ELB	Application Load Balancers per region	50	1
ECS	Services per cluster	5,000	2 (app + Pulsar)
RDS	Aurora Serverless v2 ACUs per cluster	256	Scales with workload
Secrets Manager	Secrets per region	500,000	1–2

To request an increase: open Service Quotas in the AWS console, find the quota, and choose Request quota increase. Increases are usually approved within a few hours for common services.

Backup and Recovery

Data Stores in Scope

AlertD stores all customer state in Aurora Serverless v2 (queries, session history, query results, execution plans, workspace metadata). The application and Pulsar tasks are stateless — their disks are ephemeral and require no backup.

Aurora Automated Backups

Aurora Serverless v2 takes continuous backups and retains them for 7 days by default (configurable up to 35 days), with point-in-time recovery (PITR) to any second within the retention window.

To verify or change backup settings:

Open RDS → Databases → <AlertD cluster>.
Under Maintenance & backups, confirm the backup retention period and the preferred backup window.
To extend retention, choose Modify and set a longer retention period.

Taking a Manual Snapshot

Manual snapshots are retained until you delete them and are not subject to the retention window.

Open RDS → Databases → <AlertD cluster>.
Choose Actions → Take snapshot, give it a descriptive name, and submit.

We recommend taking a manual snapshot before any AlertD stack upgrade or Aurora major-version change.

Restoring from a Backup

The AlertD stack owns the Aurora cluster, so restoring from a backup requires rebuilding the stack around the restored cluster rather than swapping the endpoint inside the running stack:

Open RDS → Databases → <AlertD cluster> → Actions → Restore to point in time (or Restore snapshot) and let RDS provision a new cluster from the backup.
Delete the existing AlertD CloudFormation stack (keep the restored Aurora cluster — do not delete it).
Redeploy the AlertD CloudFormation stack and, during stack creation, import the restored Aurora cluster and its Secrets Manager secret as existing resources rather than letting CloudFormation create new ones. See AWS — Bringing existing resources into CloudFormation management .
Once the stack reaches CREATE_COMPLETE, the AlertD application reconnects to the restored data.

Configuration

The entire deployment configuration is captured in the CloudFormation template. To restore configuration from scratch, redeploy the stack with the same parameters and import the restored Aurora cluster as described above.

Recovering the Software

ECS Task Failure

ECS automatically detects task failures and replaces them. If a task is crash-looping:

Open ECS → Clusters → <AlertD cluster> → Tasks to see the stopped task and its exit reason.
Open CloudWatch Logs for the corresponding log group to inspect the failure.
Common causes: invalid LLM API key, unreachable Aurora endpoint, or Secrets Manager permission issues. Fix the underlying configuration and force a new deployment:
```
aws ecs update-service \
  --cluster <AlertD-cluster-name> \
  --service <AlertD-app-service-name> \
  --force-new-deployment
```

Aurora Failure

For transient Aurora issues, Aurora Serverless v2 automatically fails over. For persistent corruption or accidental data loss, restore from a backup using the procedure in Backup and Recovery.

Failed Stack Update

If a CloudFormation update fails:

CloudFormation will attempt an automatic rollback to the previous working state.
If the rollback succeeds, investigate the cause (Events tab) before retrying.
If the rollback also fails, choose Continue rollback and skip any non-recoverable resources, or open a support ticket.

Complete Rebuild

If the stack cannot be recovered (e.g., accidental deletion of critical resources):

Take a manual snapshot of the Aurora cluster (if it is still present).
Delete the AlertD CloudFormation stack.
Redeploy the stack following the Deployment Guide and import the restored Aurora cluster during stack creation (see Restoring from a Backup).

Support

If you need help with any of the procedures above, contact AlertD support at support@alertd.ai. See the FAQ → Getting Help section for response-time expectations.