Production Operations
Logging
PSPF supports structured JSON logging for production environments. Configure this via the LOG_FORMAT environment variable.
- Development:
LOG_FORMAT=text(default) - Human readable logs. - Production:
LOG_FORMAT=json- NDJSON format with timestamp, level, logger, message, and context.
Alerting Examples (Prometheus)
Here are recommended Prometheus alert rules for monitoring PSPF applications.
groups:
- name: pspf-alerts
rules:
# 1. High Consumer Lag
# Critical: If lag > 1000 for more than 5 minutes
- alert: HighConsumerLag
expr: stream_lag > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High lag on {{ $labels.stream }} (Group: {{ $labels.group }})"
description: "Consumer group is falling behind by {{ $value }} messages."
# 2. Worker Down
# Critical: If worker claims to be running but status is 0
- alert: WorkerStopped
expr: pspf_worker_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Worker instance is stopped"
# 3. High Error Rate
# Warning: If > 5% of processed messages range in errors
- alert: HighErrorRate
expr: rate(stream_messages_processed_total{status="error"}[5m]) / rate(stream_messages_processed_total[5m]) > 0.05
for: 2m
labels:
severity: warning
High Availability & Clustering
PSPF supports distributed coordination via a ClusterCoordinator (backed by Valkey).
Node Registration
Each worker registers itself in Valkey with a TTL. You can view the cluster status via the Admin API:
curl http://localhost:8000/cluster/status
Partition Rebalancing
PSPF implements Automatic Rebalancing for zero-downtime scaling: 1. Fair Share Calculation: Nodes periodically scan the cluster state to determine the total number of partitions and active nodes. 2. Voluntary Relinquishment: if a node holds significantly more partitions than its fair share (e.g., after many new workers join), it will voluntarily release some of its partition locks. 3. Seamless Handover: Idle workers will then claim these released partitions and start processing, ensuring an even load distribution.
Failover
If a worker crashes, its heartbeat TTL in Valkey will expire (default 10s). Other workers will then use try_acquire_leadership() to take over the orphaned partitions, ensuring continuous processing.