Source Repository
This documentation is from amiable-dev/litellm-langfuse-railway.
Last synced: 2026-01-03 | Commit: 5a45454
LiteLLM + Langfuse Production Stack¶
The only production-ready LLM observability stack on Railway.
What Makes This Production-Ready?¶
| Feature | Starter Templates | This Template |
|---|---|---|
| Automated Backups | ❌ | ✅ PostgreSQL + ClickHouse to MinIO |
| Health Monitoring | ❌ | ✅ All services with alerting |
| Redis Persistence | ❌ Basic | ✅ AOF enabled |
| Alert Integration | ❌ | ✅ Slack, Discord, PagerDuty |
| Prometheus Metrics | ❌ | ✅ /metrics endpoint |
| Recovery Runbook | ❌ | ✅ Documented procedures |
| Restart Policies | Basic | ✅ Enhanced with retries |
Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ Your Applications │
│ (OpenAI SDK Compatible) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ LiteLLM Gateway │
│ Unified API • Rate Limiting • Cost Tracking │
│ Virtual Keys • Load Balancing │
└─────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ OpenAI │ │ Anthropic │ │ 100+ Providers │
└──────────────┘ └──────────────┘ └──────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Langfuse Platform │
│ Tracing • Prompt Management • Evaluation • Analytics │
├─────────────────────────────────────────────────────────────────────┤
│ langfuse-web (UI/API) │ langfuse-worker (async) │
└─────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ PostgreSQL │ │ ClickHouse │ │ Redis │
│ (Transactional)│ │ (Analytics) │ │ (Cache/Queue) │
│ │ │ │ │ AOF Enabled │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│ │
└───────────┬───────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Production Services │
├─────────────────────────────────────────────────────────────────────┤
│ backup-service │ health-monitor │
│ • Daily PostgreSQL dumps │ • 60s health checks │
│ • ClickHouse exports │ • Slack/Discord/PagerDuty │
│ • 7-day retention │ • Prometheus metrics │
│ • Stored in MinIO │ • Recovery alerts │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────┐
│ MinIO │
│ (S3 Storage) │
│ Backups + Logs │
└──────────────────┘
Quick Start¶
1. Deploy¶
Click the deploy button above, or manually deploy from GitHub.
2. Configure Alerts (Recommended)¶
Set up alerting by adding webhook URLs:
Slack:
1. Create a Slack App → Incoming Webhooks
2. Copy webhook URL
3. Add to ALERT_WEBHOOK_URL in both backup-service and health-monitor
Discord:
1. Server Settings → Integrations → Webhooks
2. Copy webhook URL
3. Add to ALERT_WEBHOOK_URL
PagerDuty:
1. Services → Events API v2 → Create Integration
2. Copy routing key
3. Add to PAGERDUTY_ROUTING_KEY in health-monitor
3. Connect Langfuse to LiteLLM¶
After deployment:
- Open Langfuse UI (langfuse-web public URL)
- Create account / Sign in
- Go to Settings → API Keys → Create New
- Copy
Public KeyandSecret Key - Update LiteLLM service environment variables:
LANGFUSE_PUBLIC_KEYLANGFUSE_SECRET_KEY- Redeploy LiteLLM
4. Add Your LLM Providers¶
In LiteLLM UI or via API:
import requests
# Add OpenAI
requests.post(
"https://your-litellm-url/model/new",
headers={"Authorization": "Bearer YOUR_MASTER_KEY"},
json={
"model_name": "gpt-4o",
"litellm_params": {
"model": "gpt-4o",
"api_key": "sk-..."
}
}
)
# Add Anthropic
requests.post(
"https://your-litellm-url/model/new",
headers={"Authorization": "Bearer YOUR_MASTER_KEY"},
json={
"model_name": "claude-3-5-sonnet",
"litellm_params": {
"model": "anthropic/claude-3-5-sonnet-20241022",
"api_key": "sk-ant-..."
}
}
)
Services¶
| Service | Port | Purpose | Public? |
|---|---|---|---|
| litellm | 4000 | LLM Gateway + UI | ✅ Yes |
| langfuse-web | 3000 | Observability UI | ✅ Yes |
| langfuse-worker | 3030 | Async processing | ❌ No |
| postgres | 5432 | Transactional DB | ❌ No |
| clickhouse | 8123/9000 | Analytics DB | ❌ No |
| redis | 6379 | Cache + Queues | ❌ No |
| minio | 9000/9001 | Object Storage | ❌ No |
| backup-service | 8080 | Automated Backups | ❌ No |
| health-monitor | 8080 | Health Checks | Optional |
Backup Configuration¶
| Variable | Default | Description |
|---|---|---|
BACKUP_SCHEDULE |
daily | hourly, daily, or weekly |
BACKUP_HOUR |
3 | Hour (UTC) for daily/weekly backups |
BACKUP_RETENTION_DAYS |
7 | Days to keep old backups |
BACKUP_ON_STARTUP |
true | Run backup when service starts |
ALERT_WEBHOOK_URL |
- | Webhook for backup notifications |
Manual Backup¶
Trigger an immediate backup:
Restore from Backup¶
See RUNBOOK.md for detailed restore procedures.
Health Monitoring¶
Endpoints¶
| Endpoint | Description |
|---|---|
GET /health |
JSON health status of all services |
GET /metrics |
Prometheus-compatible metrics |
GET /check |
Trigger immediate health check |
Example Health Response¶
{
"status": "healthy",
"timestamp": "2025-01-03T10:30:00Z",
"services": {
"litellm": {
"status": "healthy",
"response_time_ms": 45.2,
"consecutive_failures": 0
},
"langfuse-web": {
"status": "healthy",
"response_time_ms": 120.5,
"consecutive_failures": 0
},
"postgres": {
"status": "healthy",
"response_time_ms": 12.1,
"consecutive_failures": 0
}
}
}
Alert Configuration¶
| Variable | Default | Description |
|---|---|---|
CHECK_INTERVAL |
60 | Seconds between checks |
FAILURE_THRESHOLD |
3 | Failures before alerting |
ALERT_COOLDOWN |
15 | Minutes between repeat alerts |
ALERT_WEBHOOK_URL |
- | Slack/Discord webhook |
PAGERDUTY_ROUTING_KEY |
- | PagerDuty integration key |
Cost Estimate¶
| Service | Est. Monthly Cost |
|---|---|
| litellm | $5-15 |
| langfuse-web | $5-10 |
| langfuse-worker | $3-8 |
| postgres | $5-10 |
| clickhouse | $5-15 |
| redis | $3-5 |
| minio | $3-5 |
| backup-service | $2-5 |
| health-monitor | $2-5 |
| Total | $33-78/mo |
Costs depend on usage. Light usage (~$35/mo), heavy usage (~$75/mo).
Upgrading¶
To Railway HA PostgreSQL¶
When you need higher availability:
- Deploy Railway HA PostgreSQL cluster
- Use migration template:
railway.app/template/VgqHWg - Update
DATABASE_URLreferences - See UPGRADE.md for detailed steps
Scaling Workers¶
For high trace volume:
Troubleshooting¶
Common Issues¶
LiteLLM can't connect to Langfuse:
- Verify LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY are set
- Check Langfuse URL is correct (https, not http)
- Redeploy LiteLLM after changing keys
Backups failing:
- Check backup-service logs
- Verify MinIO is healthy
- Ensure sufficient disk space
Health monitor showing unhealthy: - Check individual service logs - Verify private network connectivity - Look for OOM or restart loops
Getting Help¶
- Check RUNBOOK.md for operational procedures
- Review service logs in Railway dashboard
- Open issue on GitHub repository
License¶
MIT License - see LICENSE
Built for production. Unlike other templates that leave resilience as an exercise for the reader, this stack includes everything you need to run LLM observability reliably.