Source Repository

This documentation is from amiable-dev/litellm-langfuse-railway. Last synced: 2026-01-03 | Commit: 5a45454

Operations Runbook¶

This runbook covers common operational procedures for the LiteLLM + Langfuse Production Stack.

Table of Contents¶

Daily Operations
Backup & Restore
Incident Response
Scaling
Maintenance

Daily Operations¶

Health Check¶

# Check all services
curl https://your-health-monitor-url/health | jq .

# Check specific service
curl https://your-litellm-url/health
curl https://your-langfuse-url/api/public/health

View Recent Backups¶

# In MinIO console or via mc client
mc ls myminio/backups/postgres/
mc ls myminio/backups/clickhouse/

Check Logs¶

In Railway Dashboard: 1. Select project 2. Click on service 3. View "Deployments" → "Logs"

Backup & Restore¶

Manual Backup Trigger¶

curl -X GET https://your-backup-service-url/backup

Restore PostgreSQL¶

1. Download backup from MinIO¶

# Using mc client
mc cp myminio/backups/postgres/postgres_backup_20250103_030000.sql.gz ./

# Or via MinIO console UI

2. Stop dependent services¶

In Railway Dashboard, pause: - litellm - langfuse-web - langfuse-worker

3. Restore to PostgreSQL¶

# Decompress
gunzip postgres_backup_20250103_030000.sql.gz

# Connect to Railway PostgreSQL and restore
# Get DATABASE_URL from Railway service variables
psql "$DATABASE_URL" < postgres_backup_20250103_030000.sql

4. Restart services¶

Unpause services in Railway Dashboard in order: 1. postgres (if restarted) 2. langfuse-worker 3. langfuse-web 4. litellm

Restore ClickHouse¶

1. Download backup¶

mc cp myminio/backups/clickhouse/clickhouse_backup_20250103_030000.tar.gz ./
tar -xzf clickhouse_backup_20250103_030000.tar.gz

2. Restore tables¶

# For each table in the backup
clickhouse-client --host $CLICKHOUSE_HOST \
  --user $CLICKHOUSE_USER \
  --password $CLICKHOUSE_PASSWORD \
  --query "INSERT INTO tablename FORMAT TabSeparatedWithNames" \
  < clickhouse_backup_20250103/tablename.tsv

Incident Response¶

Service Unhealthy¶

Symptoms¶

Health monitor alerts
503 errors from service
Slow response times

Diagnosis¶

Check health endpoint:

curl https://your-health-monitor-url/health | jq '.services'

Check service logs:
Railway Dashboard → Service → Logs
Check resource usage:
Railway Dashboard → Service → Metrics

Resolution¶

Symptom	Likely Cause	Resolution
OOM restart	Insufficient memory	Increase RAM allocation
Connection refused	Service crashed	Check logs, restart
Timeout	Downstream dependency	Check database connections
5xx errors	Application error	Check application logs

Database Connection Issues¶

PostgreSQL¶

# Test connection
psql "$DATABASE_URL" -c "SELECT 1"

# Check connection count
psql "$DATABASE_URL" -c "SELECT count(*) FROM pg_stat_activity"

# Kill idle connections if needed
psql "$DATABASE_URL" -c "
  SELECT pg_terminate_backend(pid) 
  FROM pg_stat_activity 
  WHERE state = 'idle' 
  AND query_start < now() - interval '1 hour'
"

ClickHouse¶

# Test connection
curl "http://clickhouse:8123/ping"

# Check running queries
curl "http://clickhouse:8123" --data "SELECT * FROM system.processes"

Redis¶

# Test connection
redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASSWORD ping

# Check memory usage
redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASSWORD info memory

Complete Service Outage¶

Triage:
Which services are affected?
When did it start?
Any recent deployments?
Communicate:
Update status page if applicable
Notify stakeholders

Investigate:

# Check all health
curl https://your-health-monitor-url/health | jq .

Recover:
Restart failed services
Scale horizontally if needed
Restore from backup if data corruption
Post-mortem:
Document timeline
Identify root cause
Create action items

Scaling¶

Horizontal Scaling¶

LiteLLM¶

For high request volume:

# Railway CLI
railway service litellm scale --replicas 3

Or in railway.toml:

[services.litellm.deploy]
numReplicas = 3

Langfuse Worker¶

For high trace ingestion:

railway service langfuse-worker scale --replicas 3

Vertical Scaling¶

In Railway Dashboard: 1. Service → Settings → Resources 2. Adjust memory/CPU limits

Recommended minimums for production:

Service	Memory	vCPU
litellm	512MB	0.5
langfuse-web	512MB	0.5
langfuse-worker	256MB	0.25
postgres	1GB	1.0
clickhouse	1GB	1.0
redis	256MB	0.25

Database Scaling¶

PostgreSQL → HA Cluster¶

Deploy HA PostgreSQL template: railway.app/template/high-availability-postgresql
Run migration service: railway.app/template/VgqHWg
Update all DATABASE_URL references
Verify connections
Remove old standalone instance

ClickHouse Optimization¶

For large trace volumes, consider: - Increasing memory allocation - Adding materialized views for common queries - Implementing data retention policies

Maintenance¶

Regular Tasks¶

Task	Frequency	Procedure
Review backup logs	Daily	Check backup-service logs
Check disk usage	Weekly	Review volume metrics
Update dependencies	Monthly	Deploy new image versions
Review alerts	Weekly	Tune thresholds if needed
Test restore	Monthly	Restore to test environment

Updating Services¶

LiteLLM¶

# Update image tag in railway.toml
image = "ghcr.io/berriai/litellm-database:main-stable"

# Deploy
railway up

Langfuse¶

# Update image tags
image = "langfuse/langfuse:3"
image = "langfuse/langfuse-worker:3"

# Deploy
railway up

Data Retention¶

ClickHouse (Traces)¶

Add retention policy in Langfuse settings or manually:

-- Keep last 90 days of traces
ALTER TABLE traces DELETE WHERE created_at < now() - INTERVAL 90 DAY;

PostgreSQL¶

Consider archiving old data:

-- Archive keys older than 1 year
INSERT INTO archived_keys SELECT * FROM litellm_keys 
WHERE created_at < now() - INTERVAL '1 year';

DELETE FROM litellm_keys 
WHERE created_at < now() - INTERVAL '1 year';

MinIO (Backups)¶

Backup service automatically cleans up based on BACKUP_RETENTION_DAYS.

To manually clean:

mc rm --recursive --older-than 30d myminio/backups/

Emergency Contacts¶

Role	Contact	When to Escalate
On-call Engineer	(your contact)	Any P1 incident
Database Admin	(your contact)	Data corruption, restore needed
Platform Team	Railway Support	Infrastructure issues

Appendix: Environment Variables Reference¶

backup-service¶

Variable	Required	Default	Description
DATABASE_URL	Yes	-	PostgreSQL connection
CLICKHOUSE_HOST	Yes	-	ClickHouse hostname
CLICKHOUSE_PASSWORD	Yes	-	ClickHouse password
MINIO_ENDPOINT	Yes	-	MinIO endpoint
BACKUP_SCHEDULE	No	daily	hourly/daily/weekly
BACKUP_RETENTION_DAYS	No	7	Retention period
ALERT_WEBHOOK_URL	No	-	Slack/Discord webhook

health-monitor¶

Variable	Required	Default	Description
DATABASE_URL	Yes	-	PostgreSQL connection
REDIS_HOST	Yes	-	Redis hostname
CHECK_INTERVAL	No	60	Seconds between checks
FAILURE_THRESHOLD	No	3	Failures before alert
ALERT_WEBHOOK_URL	No	-	Slack/Discord webhook
PAGERDUTY_ROUTING_KEY	No	-	PagerDuty integration