Version: 2.2.0

Performance Testing

Stress test

This stress test evaluates the performance and scalability of the MSR (Message State Replay) API under varying data loads. The testing focuses on the replay/state endpoint's ability to handle increasing volumes of unique entities while maintaining acceptable response times and resource utilization.

Script to bump data: bump_cdc_events.sql

Test Methodology

Data is incrementally loaded using a script that adds 11k unique entities per execution
The msr-database service is stopped between API calls to prevent caching effects
Each test scenario is executed 3 times for consistency
Performance metrics include API response time, payload size, refresh job duration, and direct database query execution time

Summary results:

SQL stmt to get replay/state is working fine (execute directly in database)
refresh_snapshot is functioning well, within acceptable limits.
API replay/state is having performance issue related to construction http response

Resource recommendations

Minimum Production Resources

resources:
  requests:
    cpu: "50m"
    memory: "128Mi"
  limits:
    cpu: "100m"
    memory: "1024Mi"

Recommended for 100k+ entities

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "2048Mi"

Test details

Local Environment (Dev container)

macOS m4 device
Podman (8 cores, 12GB RAM)
Timescaledb version: 2.22.0
Stop msr-database between each API call to avoid caching.

Entities	API Response	Payload Size	DB Query	Refresh Job
11k	< 1s	~12MB	< 500ms	< 300ms
22k	< 2s	~24MB	< 1s	< 500ms
60k	< 3.5s	~65MB	1s	~1.5s
100k	< 5s	~108MB	< 3s	~4s
200k	< 9s	~173MB	5s	~6s

On local machine: the replay/state API is working fine, the response time in acceptable
Detailed results for test cases are below
Also tested SET LOCAL timescaledb.skip_scan_run_cost_multiplier = 0 does not affect the performance in local machine

Dev Environment

Resources configuration:

resources:
  requests:
    cpu: "50m"
    memory: "128Mi"
  limits:
    cpu: "100m"
    memory: "1024Mi"

Timescaledb version: 2.22.0
Each scenario below run 3 times

Entities	API Response	Payload Size	DB Query	Refresh Job
11k	~6s	~12MB	< 500ms	~300ms
22k	~10-11s	~25MB	< 1s	~500ms
60k	22-26s	~68MB	1s	~1.5s
100k	~30s	~90MB	< 3s	~4s
200k*	~60s	~173MB	5s	~6s

*Requires 1GB memory limit

Resource metrics:
- MSR_APP:
- Timescaledb:

QA Environment

Resources configuration:

resources:
  requests:
    cpu: "50m"
    memory: "128Mi"
  limits:
    cpu: "100m"
    memory: "512Mi"

Timescaledb version: 2.22.0
Each scenario below run 3 times

Entities	API Response	Payload Size	DB Query	Refresh Job	Status
11k	~8s	~12MB	~1s	~200ms	✓
22k	~18s	~25MB	~2s	~300ms	✓
60k	-	-	-	-	✗ Crashed

Resource metrics:
MSR_APP:
Timescaledb:

Stress testing procedures

Prepare test data

# Edit bump_cdc_events.sql to generate larger batches: set p_total_days, p_runs_per_day, p_unique_entities_total (example: p_total_days := 1, p_runs_per_day := 1, p_unique_entities_total := 100000)
# Run the script
psql -h $DB_HOST -p $DB_PORT -U <username> -d <database> -f ./docs/bump_cdc_events.sql

Test Scenario 1: Baseline Performance (10k entities)

Objective: Establish baseline performance with minimal data

Steps:

Prepare test data:
Refresh snapshot (if using snapshot system):
```
SELECT msr.refresh_earliest_snapshot();
```

Stop and restart database if needed (to clear caches):

# Local environment
podman stop msr-database
podman start msr-database

# Kubernetes
kubectl delete pod -n <namespace> <msr-database-pod>

Execute API test:

# Time the API call
time curl -w "\nHTTP Status: %{http_code}\nTotal Time: %{time_total}s\nSize: %{size_download} bytes\n" \
  -H "Authorization: Bearer $TOKEN" \
  "$API_URL/replay/state?timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)"

Execute direct database query (for comparison):

\timing on
SELECT * FROM msr.entity_last_states
WHERE snapshot_time <= NOW();

Record results:
- API response time: _______
- API payload size: _______
- Database query time: _______
- Refresh job duration: _______
- Memory usage (peak): _______
- CPU usage (peak): _______
Repeat 3 times for consistency

Test Scenario 2: Medium Load (22k-60k entities)

Objective: Test performance under moderate data volumes

Steps:

Add more test data

Verify entity count:

SELECT COUNT(DISTINCT entity_id) FROM msr.cdc_event;

Repeat Test Scenario 1 steps 2-7
Continue adding data in 11k increments until reaching 60k entities
Record results at each increment (22k, 33k, 44k, 55k, 60k)

Test Scenario 3: High Load (100k entities)

Objective: Test performance at expected production scale

Steps:

Add test data to reach 100k entities

Monitor resource usage during data generation:

# In separate terminal, monitor continuously
watch -n 2 'kubectl top pods -n <namespace> | grep msr'

Repeat Test Scenario 1 steps 2-7
Check for error conditions:

Out of memory errors
Connection timeouts
Database connection pool exhaustion

Test Scenario 4: Stress Test (200k+ entities)

Objective: Identify breaking points and resource limits

Warning: This test may cause service degradation or failure. Only run in non-production environments.

Steps:

Increase resource limits (if possible):

resources:
  limits:
    memory: "2048Mi"
    cpu: "500m"

Generate large dataset:

-- Edit bump_cdc_events.sql
-- Set p_unique_entities_total := 200000
\i docs/bump_cdc_events.sql

Repeat Test Scenario 1 steps 2-7

Monitor for failures:

# Watch logs in real-time
kubectl logs -f -n <namespace> <msr-app-pod>

# Watch for pod restarts
kubectl get events -n <namespace> --watch

Document failure modes:
- At what entity count did failures occur?
- What was the failure type? (OOM, timeout, crash)
- What were resource metrics at failure?

Test Scenario 5: Concurrent User Load

Objective: Test multi-session performance

Example: Create concurrent API requests:

# Simple concurrent test with curl
for i in {1..5}; do
  curl -H "Authorization: Bearer $TOKEN" \
    "$API_URL/replay/state?timestamp=<timestamp_in_RFC3339_format>" \
    > /tmp/response_$i.json &
done
wait

Cleanup Test Data

After testing, clean up test data:

-- Delete test data (be careful!)
DELETE FROM msr.cdc_event
WHERE entity_id LIKE '%test%'
  OR event_timestamp > '<test-start-time>';

-- Vacuum to reclaim space
VACUUM FULL msr.cdc_event;

-- Reset snapshot
SELECT msr.refresh_earliest_snapshot();

Troubleshooting stress tests

Issue: API Response Much Slower Than Direct DB Query

Symptoms: Database query returns in < 10s, but API takes > 30s

Diagnosis:

-- Enable query logging
SET log_min_duration_statement = 0;

-- Check for serialization overhead
EXPLAIN ANALYZE SELECT * FROM msr.entity_last_states;

Solutions:

Investigate JSON serialization performance
Check for memory allocation issues in application code
Review HTTP response construction logic
See Performance Troubleshooting Guide for details

Issue: Out of Memory Errors

Symptoms: Application crashes or pod restarts during large requests

Diagnosis:

# Check OOM kills
kubectl describe pod <msr-app-pod> | grep -i oom

# Check memory usage trends
kubectl top pod <msr-app-pod>

Solutions:

Increase memory limits
Implement response streaming
Add pagination to API endpoints
Optimize data structures in application code

Issue: Database Connection Pool Exhaustion

Symptoms: "Too many connections" errors during concurrent tests

Diagnosis:

-- Check active connections
SELECT count(*) FROM pg_stat_activity
WHERE datname = current_database();

-- Check connection limits
SHOW max_connections;

Solutions:

Tune connection pool settings
Implement connection retry logic
Add connection monitoring/alerts
Scale database horizontally if needed

Infrastructure Metrics

1. API Response Time

Metric: msr_api_response_duration_seconds

What it measures: Time taken to process and respond to API requests

Target:

P50: < 5s (60k entities)
P95: < 10s (60k entities)
P99: < 15s (60k entities)

Alert Thresholds:

- Warning: P95 > 15s for 5 minutes
- Critical: P95 > 30s for 5 minutes
- Critical: P50 > 20s for 5 minutes

2. Pod Memory Usage

Metric: Container memory consumption

What it measures: Memory used by MSR pods

Target: < 80% of memory limit

Collection Method:

kubectl top pod <msr-app-pod>

3. Pod CPU Usage

Metric: Container CPU consumption

What it measures: CPU used by MSR pods

Target: < 80% of CPU limit

Collection Method:

kubectl top pod <msr-app-pod>

4. Pod Restart Count

Metric: Number of pod restarts

What it measures: Pod stability

Target: Zero unexpected restarts

Collection Method:

kubectl get pods -o json | jq '.items[].status.containerStatuses[].restartCount'

5. Disk I/O Wait

Metric: I/O wait percentage

What it measures: Time spent waiting for disk I/O

Target: < 20%

Performance Checklist

When investigating performance issues, work through this checklist:

Initial Assessment

Identify affected environment (local, dev, qa, prod)
Determine entity count / data volume
Measure baseline performance
Check recent changes or deployments
Review resource limits and usage

Database Layer

Check query execution time (EXPLAIN ANALYZE)
Verify indexes exist and are being used
Check table statistics are up-to-date
Monitor active queries and connections
Check for table bloat
Verify TimescaleDB policies are running

Application Layer

Compare API time vs DB query time
Check application logs for errors
Monitor memory usage trends
Check connection pool health
Profile CPU and memory (pprof)
Check for goroutine leaks

Infrastructure

Verify CPU and memory limits are adequate
Check for resource contention
Monitor disk I/O
Check for OOMKilled events
Review pod scheduling and node health

Load test

Overview

This document summarizes the load testing results for the MSR replay/state API endpoint.

Test Configurations

The following test scenarios were executed to evaluate system performance under different load conditions:

Test Scenario	Threads (Users)	Loops	Total Requests
1_thread_1_loop	1	1	1
5_thread_1_loop	5	1	5
5_thread_5_loop	5	5	25
5_thread_10_loop	5	10	50
25_thread_1_loop	25	1	25
50_thread_1_loop	50	1	50

Key Performance Metrics

Data source:
- Total records: 3.6M
- Total unique entities: 100k
- Data durations: 30 days

1. Single Thread Baseline (1 thread x 1 loop)

Summary:

Total Requests: 1
Success Rate: 100%
Response Time: 2,005 ms
Latency: 878 ms
Connection Time: 154 ms
Response Size: ~5.86 MB

Analysis: Baseline performance with a single request shows acceptable response time of approximately 2 seconds for the large response payload.

2. Low Concurrency Test (5 threads x 1 loop)

Summary:

Total Requests: 5
Success Rate: 100%
Average Response Time: ~5,790 ms
Response Time Range: 3,741 ms - 7,291 ms
Average Latency: ~2,761 ms
Connection Time Range: 152 ms - 196 ms

Analysis: With 5 concurrent users, response times increased significantly to approximately 3-7 seconds. The system handled the load without errors, but showed performance degradation compared to the baseline.

3. Medium Sustained Load (5 threads x 5 loops)

Summary:

Total Requests: 25
Success Rate: 100%
Response Time Range: 2,490 ms - 12,171 ms
Average Response Time: ~4,500 ms
Connection Time Range: 144 ms - 1,156 ms

Key Observations:

First iteration showed higher response times (3,896 ms - 7,355 ms)
Subsequent iterations improved with response times ranging from 2,238 ms - 5,861 ms
Some requests showed prolonged connection times over 1 second
System maintained stability across all iterations

Analysis: The system demonstrated good resilience under sustained load with 5 concurrent threads. Performance improved after initial requests, suggesting effective caching or connection pooling mechanisms.

4. Extended Medium Load (5 threads x 10 loops)

Summary:

Total Requests: 50
Success Rate: 100%
Response Time Range: 2,125 ms - 29,154 ms
Majority of Requests: 2,000 ms - 6,000 ms
Peak Response Time: 29,154 ms (outlier)

Key Observations:

Most requests completed within 2-6 seconds
Several outlier requests exceeded 10+ seconds
Connection establishment remained stable (138 ms - 704 ms)
No request failures despite extended duration

Analysis: Under prolonged load, the system remained stable but showed occasional performance spikes. The outliers suggest potential resource contention or garbage collection events.

5. High Concurrency Test (25 threads x 1 loop)

Summary:

Total Requests: 25
Success Rate: 100%
Response Time Range: 8,036 ms - 30,820 ms
Average Response Time: ~20,000 ms
Latency Range: 6,860 ms - 21,108 ms
Connection Time Range: 140 ms - 196 ms

Performance Distribution:

8-15 seconds: 8 requests (32%)
15-25 seconds: 12 requests (48%)
25-31 seconds: 5 requests (20%)

Key Observations:

All 25 concurrent requests completed successfully with no failures
Median response time: ~22 seconds
Fastest request: 8.0 seconds, slowest: 30.8 seconds
Connection establishment remained very stable (140-196 ms)

Analysis: At 25 concurrent users, the system maintained 100% reliability while response times increased to 8-31 seconds range. While significantly slower than lower concurrency tests, the system handled the load without any failures, demonstrating good stability under high concurrent load.

6. High Concurrency Test (50 threads x 1 loop)

Summary:

Total Requests: 50
Success Rate: 76% (38/50 successful)
Failed Requests: 12 (500 Internal Server Error)
Response Time Range (Successful): 9,006 ms - 56,562 ms
Average Response Time (Successful): ~42,000 ms
Failed Request Times: 33,533 ms - 38,203 ms
Latency Range: 7,909 ms - 44,521 ms
Connection Time Range: 139 ms - 229 ms

Performance Distribution (Successful Requests):

9-20 seconds: 6 requests (16%)
20-30 seconds: 3 requests (8%)
30-40 seconds: 9 requests (24%)
40-50 seconds: 12 requests (31%)
50-57 seconds: 8 requests (21%)

Failed Requests:

Thread 1-8: 37,043 ms - Internal Server Error
Thread 1-11: 38,203 ms - Internal Server Error
Thread 1-20: 34,091 ms - Internal Server Error
Thread 1-22: 37,543 ms - Internal Server Error
Thread 1-25: 37,454 ms - Internal Server Error
Thread 1-26: 37,512 ms - Internal Server Error
Thread 1-27: 37,873 ms - Internal Server Error
Thread 1-32: 37,717 ms - Internal Server Error
Thread 1-34: 37,287 ms - Internal Server Error
Thread 1-39: 37,188 ms - Internal Server Error
Thread 1-46: 36,267 ms - Internal Server Error
Thread 1-48: 33,533 ms - Internal Server Error

Key Observations:

24% failure rate (12 out of 50 requests failed)
Failures occurred between 33-38 seconds, suggesting a timeout threshold
Most failures happened around the 37-second mark (10 out of 12 failures)
Successful requests took significantly longer (up to 57 seconds)
Connection establishment remained stable (139-229 ms)

Analysis: At maximum tested load of 50 concurrent threads, the system reached its breaking point with a 24% failure rate. All failures were 500 Internal Server Errors occurring between 33-38 seconds, strongly suggesting a server-side timeout or resource exhaustion threshold. This represents a critical degradation compared to the 25-thread test which maintained 100% success. The 50-thread mark clearly exceeds the system's current capacity limits.

Performance Trends

Response Time vs. Concurrency

Threads	Avg Response Time	Success Rate	Performance Ratio
1	2.0s	100%	1x (baseline)
5	5.8s	100%	2.9x slower
25	20.0s	100%	10x slower
50	42.0s	76%	21x slower

Key Findings

Non-Linear Performance Degradation: Response times increased exponentially as concurrent users increased. At 5 threads, performance degraded by 3x; at 25 threads, by 10x; and at 50 threads, by 21x compared to baseline. This indicates severe bottlenecks under high concurrency.
Capacity Threshold Identified: The system maintained 100% success rate up to 25 concurrent threads but experienced a 24% failure rate at 50 threads, identifying a critical capacity threshold between 25-50 concurrent users.
Timeout Pattern at 35-38 Seconds: All 12 failures at 50 threads occurred between 33-38 seconds (most at ~37 seconds), strongly indicating a server-side timeout configuration or resource exhaustion threshold around 35-40 seconds.
Response Size Impact: The large response payload (~5.86 MB per request) significantly impacts performance under load. At 50 threads, the system attempts to process ~293 MB of response data simultaneously, contributing to bandwidth saturation, memory pressure, and extended response times.
Connection Layer Stability: Connection establishment times remained consistently stable across all test scenarios (139 ms - 229 ms), indicating the bottleneck is not at the network connection layer but rather in application processing, database queries, or bandwidth constraints.
System Breaking Point: The system's breaking point is between 25-50 concurrent threads. At 25 threads, the system is stressed but functional (100% success). At 50 threads, the system is overwhelmed (76% success), making this the maximum capacity limit without optimization.

Recommendations

Performance Improvements

Connection Pooling: Ensure proper database and HTTP connection pooling to handle concurrent requests efficiently.
Resource Scaling: Current performance suggests the system may need vertical or horizontal scaling to support more than 5-10 concurrent users effectively.
Timeout Configuration: Review and adjust timeout settings, particularly for requests approaching or exceeding 30 seconds.
Load Balancing: Consider implementing load balancing if not already in place to distribute traffic across multiple instances.

Monitoring

Set SLA Targets: Based on these results, establish clear Service Level Agreements (e.g., 95% of requests complete within 5 seconds for up to 5 concurrent users).
Alerting: Implement monitoring and alerting for:
- Response times exceeding 10 seconds
- Error rates above 5%
- Server resource utilization (CPU, memory, network)
Continuous Testing: Establish regular load testing schedule to track performance trends over time.

Resource metrics during load test

MSR_APP:
- CPU usage:
- Memory usage:
Timescaledb:
- During bump 3.6M records: MSR spike more than "1.5" CPU usage and "3GB" memory usage.
Resource required:

  - msr_app:
    - request:
      - cpu: 0.25
      - memory: 512MB
    - limit:
      - cpu: 0.5
      - memory: 1GB
  - timescaledb:
    - request:
      - cpu: 0.5
      - memory: 1GB
    - limit:
      - cpu: 1
      - memory: 2GB

Resource recommendation:

  - msr_app:
    - request:
      - cpu: 0.5
      - memory: 1GB
    - limit:
      - cpu: 1
      - memory: 2GB
  - timescaledb:
    - request:
      - cpu: 1
      - memory: 2GB
    - limit:
      - cpu: 2
      - memory: 4GB

Conclusion

The MSR application API demonstrates stable performance under low concurrency (1-5 users) with 100% success rates and response times under 6 seconds. However, performance degrades significantly at higher concurrency levels:

Optimal Range: 1-5 concurrent users
Degraded Performance: 5-25 concurrent users (performance slows but remains stable)
Critical Threshold: 50 concurrent users (unstable with 24% failure rate)

The system can handle high load but requires optimization to maintain acceptable response times. The large response payload size is a significant contributing factor to performance degradation under load.

Test Date: 2025-11-04 Application: MSR (Mission State Replay) Endpoint: https://msr.dev.agilopshub.com/replay/state Test Tool: Apache JMeter

Stress test​

Test Methodology​

Summary results:​

Resource recommendations​

Minimum Production Resources​

Recommended for 100k+ entities​

Test details​

Local Environment (Dev container)​

Dev Environment​

QA Environment​

Stress testing procedures​

Prepare test data​

Test Scenario 1: Baseline Performance (10k entities)​

Test Scenario 2: Medium Load (22k-60k entities)​

Test Scenario 3: High Load (100k entities)​

Test Scenario 4: Stress Test (200k+ entities)​

Test Scenario 5: Concurrent User Load​

Cleanup Test Data​

Troubleshooting stress tests​

Issue: API Response Much Slower Than Direct DB Query​

Issue: Out of Memory Errors​

Issue: Database Connection Pool Exhaustion​

Infrastructure Metrics​

1. API Response Time​

2. Pod Memory Usage​

3. Pod CPU Usage​

4. Pod Restart Count​

5. Disk I/O Wait​

Performance Checklist​

Initial Assessment​

Database Layer​

Application Layer​

Infrastructure​

Load test​

Overview​

Test Configurations​

Key Performance Metrics​

1. Single Thread Baseline (1 thread x 1 loop)​

2. Low Concurrency Test (5 threads x 1 loop)​

3. Medium Sustained Load (5 threads x 5 loops)​

4. Extended Medium Load (5 threads x 10 loops)​

5. High Concurrency Test (25 threads x 1 loop)​

6. High Concurrency Test (50 threads x 1 loop)​

Performance Trends​

Response Time vs. Concurrency​

Key Findings​

Recommendations​

Performance Improvements​

Monitoring​

Resource metrics during load test​

Conclusion​

Stress test

Test Methodology

Summary results:

Resource recommendations

Minimum Production Resources

Recommended for 100k+ entities

Test details

Local Environment (Dev container)

Dev Environment

QA Environment

Stress testing procedures

Prepare test data

Test Scenario 1: Baseline Performance (10k entities)

Test Scenario 2: Medium Load (22k-60k entities)

Test Scenario 3: High Load (100k entities)

Test Scenario 4: Stress Test (200k+ entities)

Test Scenario 5: Concurrent User Load

Cleanup Test Data

Troubleshooting stress tests

Issue: API Response Much Slower Than Direct DB Query

Issue: Out of Memory Errors

Issue: Database Connection Pool Exhaustion

Infrastructure Metrics

1. API Response Time

2. Pod Memory Usage

3. Pod CPU Usage

4. Pod Restart Count

5. Disk I/O Wait

Performance Checklist

Initial Assessment

Database Layer

Application Layer

Infrastructure

Load test

Overview

Test Configurations

Key Performance Metrics

1. Single Thread Baseline (1 thread x 1 loop)

2. Low Concurrency Test (5 threads x 1 loop)

3. Medium Sustained Load (5 threads x 5 loops)

4. Extended Medium Load (5 threads x 10 loops)

5. High Concurrency Test (25 threads x 1 loop)

6. High Concurrency Test (50 threads x 1 loop)

Performance Trends

Response Time vs. Concurrency

Key Findings

Recommendations

Performance Improvements

Monitoring

Resource metrics during load test

Conclusion