Source Repository
This documentation is from amiable-dev/litellm-langfuse-railway.
Last synced: 2026-01-03 | Commit: 5a45454
Upgrade Guide¶
This guide covers upgrading components of the LiteLLM + Langfuse Production Stack.
Table of Contents¶
Upgrading to HA PostgreSQL¶
Railway offers a High Availability PostgreSQL cluster with automatic failover. This upgrade is recommended when:
- You need 99.9%+ uptime for your database
- You're processing >10,000 LLM requests/day
- You have compliance requirements for data redundancy
Prerequisites¶
- Current stack deployed and healthy
- Recent backup (verify in MinIO)
- 30-60 minutes maintenance window
Step 1: Deploy HA PostgreSQL Cluster¶
- Open Railway Dashboard
- Click "New" → "Template"
- Search for "PostgreSQL HA" or use:
railway.app/template/high-availability-postgresql - Deploy into your existing project
The HA cluster includes: - 3 PostgreSQL nodes (pg-0, pg-1, pg-2) - PgPool for connection pooling and failover - Automatic replication
Step 2: Run Migration¶
- Deploy the migration service:
- Template:
railway.app/template/VgqHWg -
Or repository:
github.com/railwayapp-templates/pg-migrate-ha -
Configure variables:
-
Deploy and monitor logs for completion
Step 3: Verify Migration¶
# Connect to HA cluster
psql "$HA_DATABASE_URL" -c "SELECT count(*) FROM litellm_keys"
psql "$HA_DATABASE_URL" -c "SELECT count(*) FROM langfuse_projects"
Compare counts with original database.
Step 4: Update Service References¶
Update all services to point to PgPool:
- In Railway Dashboard, update each service's
DATABASE_URL: - litellm
- langfuse-web
- langfuse-worker
- backup-service
-
health-monitor
-
Use PgPool's DATABASE_URL (not individual nodes):
Step 5: Verify Connectivity¶
# Check each service can connect
curl https://your-litellm-url/health
curl https://your-langfuse-url/api/public/health
curl https://your-health-monitor-url/health
Step 6: Remove Old Database¶
Once verified (recommend waiting 24-48 hours):
- Take final backup of old database
- Delete standalone postgres service
- Update backup-service to backup HA cluster
Rollback Plan¶
If issues arise:
- Update all
DATABASE_URLreferences back to standalone - Restart all services
- Investigate HA cluster issues
Upgrading LiteLLM¶
Minor Version Updates¶
LiteLLM releases frequently. For minor updates:
- Check changelog:
-
https://github.com/BerriAI/litellm/releases
-
Update image tag in railway.toml:
-
Deploy:
-
Verify:
Major Version Updates¶
For major versions (e.g., v1 → v2):
- Review breaking changes in changelog
- Test in staging environment first
- Schedule maintenance window
- Update and deploy
- Run smoke tests
- Monitor for errors
Rollback¶
If issues occur:
- Update image tag to previous version
- Deploy immediately
- Investigate issue in logs
Upgrading Langfuse¶
Version Updates¶
Langfuse follows semantic versioning. Check releases at: - https://github.com/langfuse/langfuse/releases
Langfuse 2.x → 3.x Migration¶
Langfuse 3.x introduced ClickHouse as analytics database. This template already uses v3.
For v2 → v3: 1. Deploy ClickHouse service 2. Update Langfuse images to v3 3. Configure CLICKHOUSE_* environment variables 4. Historical data will be migrated automatically
Update Procedure¶
-
Update both images together:
-
Deploy:
-
Verify:
-
Check migrations:
- View langfuse-web logs for migration status
- Confirm UI loads correctly
Upgrading Redis¶
Minor Version Updates¶
-
Update image:
-
Deploy:
-
Redis will restart with data preserved (AOF enabled)
-
Verify:
Major Version Updates (6.x → 7.x)¶
-
Create backup:
-
Update image and deploy
-
If issues, Redis will recover from AOF on restart
Upgrading to Redis Cluster¶
For high availability Redis (rare requirement):
- Deploy Redis Sentinel or Redis Cluster template
- Update all REDIS_* environment variables
- Test connectivity from all services
- Remove standalone Redis
General Upgrade Best Practices¶
Before Any Upgrade¶
- ✅ Review changelog for breaking changes
- ✅ Verify recent backup exists
- ✅ Test in staging if possible
- ✅ Schedule maintenance window
- ✅ Notify stakeholders
During Upgrade¶
- Monitor deployment logs
- Check health endpoints immediately
- Run smoke tests
- Watch for error spikes
After Upgrade¶
- Verify all services healthy
- Check key functionality
- Monitor for 24 hours
- Document any issues
Rollback Triggers¶
Immediately rollback if: - Health checks fail after 5 minutes - Error rate >5% for 10 minutes - Core functionality broken - Data integrity issues
Version Compatibility Matrix¶
| Component | Minimum | Recommended | Maximum |
|---|---|---|---|
| LiteLLM | 1.30+ | latest stable | - |
| Langfuse | 3.0+ | latest v3 | - |
| PostgreSQL | 14 | 16 | 16 |
| ClickHouse | 23+ | 24 | - |
| Redis | 7.0 | 7.2 | 7.4 |
| MinIO | 2023+ | latest | - |
Support¶
If you encounter issues during upgrades:
- Check service logs in Railway Dashboard
- Review this guide's troubleshooting sections
- Open issue on GitHub repository
- Contact Railway support for infrastructure issues