> ## Documentation Index
> Fetch the complete documentation index at: https://docs.grantex.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Metrics & Observability

> Monitor your Grantex deployment with Prometheus metrics, Grafana dashboards, alerting rules, and structured logging.

Effective monitoring ensures your Grantex deployment is healthy, performant, and secure. This guide covers the native Prometheus metrics endpoint, Grafana dashboard templates, alert thresholds, and logging best practices.

## Prometheus Metrics Endpoint

The auth service exposes a `GET /metrics` endpoint in Prometheus exposition format:

```bash theme={null}
curl https://your-auth-service/metrics
```

This endpoint is **unauthenticated** (no API key required) and rate-limited to 10 requests/minute per IP.

### Counters

| Metric                             | Labels             | Description                        |
| ---------------------------------- | ------------------ | ---------------------------------- |
| `grantex_token_exchange_total`     | `status`           | Token exchange attempts            |
| `grantex_authorize_total`          | `status`           | Authorization requests             |
| `grantex_grants_revoked_total`     | —                  | Grants revoked (including cascade) |
| `grantex_webhook_deliveries_total` | `status`           | Webhook delivery outcomes          |
| `grantex_anomalies_detected_total` | `type`, `severity` | Anomalies detected                 |

### Histograms

| Metric                                    | Labels                           | Description                        |
| ----------------------------------------- | -------------------------------- | ---------------------------------- |
| `grantex_authorize_duration_seconds`      | —                                | Authorization request duration     |
| `grantex_token_exchange_duration_seconds` | —                                | Token exchange duration            |
| `grantex_http_request_duration_seconds`   | `method`, `route`, `status_code` | HTTP request duration (all routes) |

### Gauges

| Metric                             | Description                 |
| ---------------------------------- | --------------------------- |
| `grantex_active_grants`            | Current active grants count |
| `grantex_anomalies_unacknowledged` | Unacknowledged anomalies    |

### Environment Variables

| Variable          | Default | Description                                  |
| ----------------- | ------- | -------------------------------------------- |
| `METRICS_ENABLED` | `true`  | Set to `false` to disable metrics collection |

## Grafana Dashboards

Pre-built Grafana dashboards are available at [`deploy/grafana/`](https://github.com/mishrasanjeev/grantex/tree/main/deploy/grafana):

| Dashboard                  | Description                                                                                                                             |
| -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| `overview-dashboard.json`  | Token exchange rate, success rate gauge, latency p50/p99, grants revoked, active grants, webhook deliveries, anomalies, HTTP error rate |
| `per-agent-dashboard.json` | Per-agent drill-down with a `$agent_id` template variable                                                                               |

### Import Instructions

1. In Grafana, go to **Dashboards > Import**
2. Upload the JSON file or paste its contents
3. Select your Prometheus data source when prompted (`${DS_PROMETHEUS}`)
4. Click **Import**

## Health Check Endpoint

The auth service exposes a `GET /health` endpoint that returns the service status:

```bash theme={null}
curl https://your-auth-service/health
```

```json theme={null}
{ "status": "ok" }
```

Use this endpoint for:

* **Load balancer health checks** — poll `/health` every 10–30 seconds
* **Uptime monitoring** — UptimeRobot, Pingdom, Cloud Monitoring
* **Kubernetes liveness probes** — `livenessProbe.httpGet.path: /health`

## Alerting Thresholds

Recommended thresholds for production alerting:

| Metric                      | Warning       | Critical      | Action                                      |
| --------------------------- | ------------- | ------------- | ------------------------------------------- |
| Token exchange failure rate | > 5%          | > 15%         | Check auth service logs                     |
| Token refresh failure rate  | > 5%          | > 15%         | Check for refresh token reuse or clock skew |
| Anomalies detected          | > 5/hour      | > 10/hour     | Review anomaly details                      |
| Webhook delivery success    | \< 98%        | \< 95%        | Verify endpoint availability                |
| 429 rate                    | > 50/min      | > 200/min     | Client misconfiguration or abuse            |
| Auth request latency (p99)  | > 500ms       | > 2s          | Database or Redis performance issue         |
| Health check failures       | 1 consecutive | 3 consecutive | Service restart                             |

### Alertmanager Rules

```yaml theme={null}
groups:
  - name: grantex
    rules:
      - alert: HighTokenExchangeFailureRate
        expr: |
          sum(rate(grantex_token_exchange_total{status!="success"}[5m]))
          / sum(rate(grantex_token_exchange_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Token exchange failure rate > 5%"

      - alert: HighAuthLatency
        expr: |
          histogram_quantile(0.99, rate(grantex_authorize_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Authorization p99 latency > 2s"

      - alert: WebhookDeliveryFailure
        expr: |
          sum(rate(grantex_webhook_deliveries_total{status="failed"}[5m]))
          / sum(rate(grantex_webhook_deliveries_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Webhook delivery failure rate > 5%"
```

## Logging

### Structured Logging

The auth service uses [Pino](https://getpino.io/) for JSON-structured logging:

```json theme={null}
{
  "level": "info",
  "msg": "grant.created",
  "timestamp": "2026-03-01T12:00:00.000Z",
  "grantId": "grnt_abc123",
  "agentId": "ag_def456",
  "principalId": "user_789",
  "scopes": ["calendar:read", "email:send"],
  "latencyMs": 45
}
```

### What to Log

| Event                     | Log Level | Key Fields                                    |
| ------------------------- | --------- | --------------------------------------------- |
| Grant created             | `info`    | `grantId`, `agentId`, `principalId`, `scopes` |
| Grant revoked             | `info`    | `grantId`, `revokedBy`, `cascadeCount`        |
| Token exchanged           | `info`    | `grantId`, `agentId`                          |
| Token refreshed           | `info`    | `grantId`, `agentId`                          |
| Token verification failed | `warn`    | `reason`, `tokenId`                           |
| Auth request denied       | `warn`    | `agentId`, `principalId`, `reason`            |
| Rate limit hit            | `warn`    | `ip`, `endpoint`, `retryAfter`                |
| Anomaly detected          | `warn`    | `type`, `severity`, `agentId`                 |
| Webhook delivery failed   | `error`   | `webhookId`, `url`, `statusCode`, `attempt`   |
| Database connection error | `error`   | `error`, `pool`                               |

## Webhook-Based Monitoring

Subscribe to webhook events for real-time alerting without polling:

```typescript theme={null}
import { Grantex } from '@grantex/sdk';

const grantex = new Grantex({ apiKey: process.env.GRANTEX_API_KEY! });

await grantex.webhooks.create({
  url: 'https://your-app.com/webhooks/grantex-alerts',
  events: ['grant.revoked', 'token.issued'],
  secret: process.env.WEBHOOK_SECRET!,
});
```
