Skip to main content
This guide covers the operational aspects of running the Grantex auth service in production. For initial setup and deployment options, see the Self-Hosting guide.

Health Check Endpoint

The auth service exposes a health check at GET /health that probes both PostgreSQL and Redis:
curl https://your-auth-service/health

Healthy response (200)

{
  "status": "ok"
}

Degraded response (503)

When one or more dependencies are unreachable, the endpoint returns 503 with the failing components:
{
  "status": "degraded",
  "failing": ["db", "redis"]
}
The failing array can contain "db" (PostgreSQL unreachable), "redis" (Redis unreachable), or both.
Wire this endpoint into your load balancer’s health check. Use a 5-second interval and 3-consecutive-failure threshold for removing unhealthy instances.
The health endpoint is unauthenticated (skipAuth: true) and does not count against rate limits.

Required Configuration

The auth service validates required environment variables at startup and exits immediately if any are missing. This fail-fast behavior prevents partial startup with a broken configuration.
VariableRequiredValidated atFailure mode
DATABASE_URLYesStartupProcess exits with error
REDIS_URLYesStartupProcess exits with error
RSA_PRIVATE_KEYYes*StartupProcess exits unless AUTO_GENERATE_KEYS=true
JWT_ISSUERNoStartupDefaults to https://grantex.dev
* In development, set AUTO_GENERATE_KEYS=true to auto-generate an ephemeral RSA keypair. Never use this in production.

Full environment variable reference

See the Self-Hosting guide for the complete list of environment variables including optional Stripe, FIDO, email, and policy engine settings.

Startup sequence

The auth service boots in this order:
  1. OpenTelemetry tracing — initialized first to hook module loading (only if OTEL_EXPORTER_OTLP_ENDPOINT is set)
  2. RSA key initialization — loads or generates the RSA keypair for JWT signing
  3. Ed25519 key initialization — optional, for DID/VC support
  4. Database connection — connects to PostgreSQL
  5. Migrations — runs all *.sql migration files idempotently
  6. Redis connection — connects to Redis with lazy connect
  7. Seed data — creates dev accounts if SEED_API_KEY or SEED_SANDBOX_KEY are set
  8. HTTP server — starts listening on PORT (default 3001)
  9. Background workers — webhook delivery worker and anomaly detection worker start polling
If any step fails, the process exits with code 1 and logs the error to stdout.

Graceful Shutdown

The auth service handles SIGTERM signals for graceful shutdown. When running in Kubernetes or Docker, the container runtime sends SIGTERM before force-killing the process. When OpenTelemetry tracing is enabled, the SIGTERM handler flushes pending trace spans before the process exits:
process.on('SIGTERM', () => {
  sdk.shutdown().catch(console.error);
});

Kubernetes configuration

Set a terminationGracePeriodSeconds that gives the service enough time to finish in-flight requests:
spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: auth-service
      livenessProbe:
        httpGet:
          path: /health
          port: 3001
        initialDelaySeconds: 10
        periodSeconds: 5
      readinessProbe:
        httpGet:
          path: /health
          port: 3001
        initialDelaySeconds: 5
        periodSeconds: 5

Database Connection

The auth service uses postgres.js for PostgreSQL connections. The connection is lazily initialized on first use and reused for the lifetime of the process.

Connection string format

postgresql://user:password@host:5432/grantex?sslmode=require
Always use sslmode=require (or verify-full for stricter validation) in production.

Connection pool behavior

postgres.js manages an internal connection pool. Default settings:
SettingDefaultDescription
Max connections10Maximum concurrent connections
Idle timeout0 (no timeout)Connections stay open until the process exits
Connect timeout30sTime to wait for a new connection
For high-throughput deployments, tune the pool by passing options to the postgres constructor in db/client.ts.

Monitoring connections

Check active connections via PostgreSQL:
SELECT count(*) FROM pg_stat_activity
WHERE datname = 'grantex' AND state = 'active';

Redis Connection

The auth service uses ioredis with lazy connect. Redis stores:
  • Rate limit counters — sliding-window request counts per IP
  • Ephemeral token metadata — in-flight authorization request state

Reconnection behavior

ioredis automatically reconnects when the Redis connection drops. Default behavior:
  • Retries with exponential backoff (starting at 50ms, capped at 2 seconds)
  • No maximum retry count — reconnects indefinitely
  • Queues commands during disconnection and replays them on reconnect

Data durability

Redis is not the source of truth. If Redis data is lost:
  • Rate limits reset — clients get fresh windows (no security impact beyond a brief burst)
  • In-flight auth requests fail — users must restart the consent flow
  • No permanent data is lost — PostgreSQL is the durable store
For high-availability deployments, use Redis Sentinel or Redis Cluster. ioredis supports both modes natively.

Webhook Delivery

The auth service delivers webhooks with automatic retry and exponential backoff. A background worker polls the webhook_deliveries table every 30 seconds for pending deliveries.

Retry policy

AttemptDelayCumulative wait
1st retry30 seconds30s
2nd retry60 seconds1.5 min
3rd retry120 seconds3.5 min
4th retry240 seconds7.5 min
5th retry480 seconds15.5 min
The backoff formula is 30 * 2^attempt seconds. After 5 failed attempts (configurable via max_attempts in the deliveries table), the delivery is marked as failed.

Delivery mechanics

  • Timeout: Each delivery attempt has a 10-second timeout
  • Success: Any 2xx response marks the delivery as delivered
  • Failure: Non-2xx responses or network errors trigger a retry
  • Signature: Every payload includes an X-Grantex-Signature header for HMAC verification
  • User-Agent: Requests are sent with Grantex-Webhooks/0.1

Monitoring deliveries

Query delivery status for a webhook endpoint:
curl https://api.grantex.dev/v1/webhooks/<webhook-id>/deliveries \
  -H "Authorization: Bearer <api-key>"
This returns delivery history with status (pending, delivered, failed), attempt count, and error details.

Background Workers

Two background workers run after the HTTP server starts:
WorkerIntervalPurpose
Webhook delivery30 secondsProcesses pending webhook deliveries with exponential backoff
Anomaly detectionConfigurableScans for unusual patterns (rate spikes, off-hours activity, high failure rates)
Both workers are started in the main process. They run on setInterval timers and execute one initial run immediately on startup.

Worker health

Workers log errors to stdout but do not crash the process. If a worker iteration fails, it retries on the next interval. Monitor worker health by checking for [webhook-delivery] and [anomaly-detection] prefixed log messages.

Logging

All logs are emitted as structured JSON to stdout, compatible with:
  • Datadog — auto-parsed by the Datadog agent
  • Grafana Loki — ingestible via Promtail
  • AWS CloudWatch Logs — auto-parsed in JSON format
  • Google Cloud Logging — structured log entries
Each log entry includes:
{
  "level": "info",
  "time": 1712345678901,
  "reqId": "a1b2c3d4-e5f6-...",
  "msg": "request completed",
  "responseTime": 12
}

Log levels

LevelWhen
infoNormal request lifecycle (start, complete)
warnDeprecated API usage, approaching rate limits
errorFailed requests, worker errors, database connection issues
fatalStartup failures (missing config, migration errors)

Database Migrations

Migrations run automatically on every startup. The auth service reads all *.sql files from the migrations/ directory and executes each one using idempotent DDL (CREATE TABLE IF NOT EXISTS, etc.). To upgrade, restart the service. New migration files are applied automatically. There is no separate migration command or rollback mechanism — migrations are designed to be forward-only and non-destructive.

Operational Checklist

Health check is wired to your load balancer
All required environment variables are set
Database connection uses SSL (sslmode=require)
Redis is on a private network with authentication
Log forwarding is configured (Datadog, Loki, CloudWatch, etc.)
Prometheus metrics are scraped from /metrics
CPU and memory limits are set in your container orchestrator
Automated database backups are configured
Webhook endpoints are monitored for delivery failures
Alerting rules are set for error rate spikes and health check failures