Heartbeat Cold Start + Persistence + Missed Beat Recovery Mechanism

Problem Description

The current implementation of the heartbeatService contains three interrelated defects, which result in the heartbeat failing to trigger as expected during normal use (especially in scenarios where Alma is frequently restarted during the development phase), and when it does trigger, it indiscriminately terminates all running agent tasks.

Root Cause Analysis

Based on API behavior and the return value of `/api/heartbeat/status `, it is inferred that `start() ` uses `setInterval ` rather than executing immediately, and `lastHeartbeatTime ` is not persisted. The following pseudocode describes the inferred current behavior:

// Inferred current behavior (pseudocode) class HeartbeatService { lastHeartbeatTime = 0; // In-memory variable, reset to zero on restart start() { this.timer = setInterval( // No initial tick () => this.tick(), 60 * this.config.intervalMinutes * 1000 ); } } 

Issue 1: No Initial Tick on Cold Start

setInterval does not execute the callback immediately upon invocation; instead, it waits for a full interval cycle to elapse before triggering the first callback. Assuming the interval is set to 60 minutes, Alma must wait a full 60 minutes after startup before the first heartbeat is executed.

If the user restarts Alma within those 60 minutes (which is common during development), the timer resets, and another 60 minutes must pass. As a result, the heartbeat may never be triggered.

Issue 2: ` lastHeartbeatTime ` is in-memory only and resets on restart

` lastHeartbeatTime ` is not persisted to SQLite or the settings JSON. Every time Alma restarts, this value resets to 0. This means:

  • It is impossible to determine "how long it has been since the last heartbeat" after a restart

  • It is impossible to decide whether an immediate catch-up heartbeat is needed

  • Heartbeat history cannot be traced in the logs

Issue 3: No catch-up logic

Even if the current implementation detects that multiple intervals have elapsed since the last heartbeat, it will not resend one. For example, if Alma is turned off for 3 hours and then turned back on, with an interval of 60 minutes, it would theoretically have missed 3 heartbeats, but after restarting, it will simply wait silently for the next full interval.

Derived Issue: Indiscriminate termination of agents when a heartbeat is triggered

This issue is particularly severe when combined with the three defects mentioned above. In testing, it was observed that when a heartbeat is triggered, it indiscriminately terminates currently executing agent tasks (multiple concurrent agents are interrupted simultaneously), resulting in incomplete task execution. If the timing of heartbeat triggers is unpredictable (due to Defects 1–3), users cannot reasonably avoid this interruption.

Due to Defect 2 (lack of persistence), after a restart, the agent cannot detect that it was previously interrupted and cannot resume from the point of interruption.

Recommended Fix

Fix 1: Execute the first tick immediately upon startup (one line of code)

start() { this.tick(); // ← New: Execute immediately once this.timer = setInterval( () => this.tick(), 60 * this.config.intervalMinutes * 1000 ); } 

Fix 2: Persist ` lastHeartbeatTime ` (approx. 5 lines)

Write ` lastHeartbeatTime ` to the ` app_settings ` table or a separate JSON file. Update the persisted value after each `tick() ` execution, and read it from persistent storage at startup.

start() { this.lastHeartbeatTime = this.loadFromStorage() || 0; // ... } tick() { // ... Execute heartbeat logic this.lastHeartbeatTime = Date.now(); this.saveToStorage(this.lastHeartbeatTime); } 

Fix 3: Check for missed heartbeats at startup (approx. 5 lines)

start() { const last = this.loadFromStorage() || 0; const elapsed = Date.now() - last; const intervalMs = 60 * this.config.intervalMinutes * 1000; if (elapsed >= intervalMs) { this.tick(); // Catch up } else { // Wait only for the remaining time, not the full interval setTimeout(() => { this.tick(); this.timer = setInterval(() => this.tick(), intervalMs); }, intervalMs - elapsed); return; } this.timer = setInterval(() => this.tick(), intervalMs); } 

Fix 4 (Recommended but not required): Graceful shutdown upon heartbeat trigger

Currently, all agents are terminated immediately when a heartbeat is triggered. It is recommended to change this to:

  1. Check if any agents are currently executing tasks

  2. If so, wait for the current task to complete (or set a timeout limit, such as 30 seconds)

  3. Execute the heartbeat logic only after the task completes or the timeout expires

  4. If waiting is not possible, at least have the agent write a checkpoint so that it can resume from the point of interruption after recovery

This change is quite extensive and can be addressed in a future optimization.

Priority Recommendations

Fix Item | Scope of Changes | Priority | Reason

Fix 1 (First Hop)

1 line

P0

If not fixed, the heartbeat will almost never trigger in scenarios with frequent restarts

Fix 2 (Persistence)

~5 lines

P1

Only meaningful when used in conjunction with Fix 3

Fix 3 (Missed Beat Recovery)

~10 lines

P1

Resolves the gap upon restart after a prolonged shutdown

Fix 4 (graceful)

~30 lines

P2

Product experience issue; the current workaround is to avoid the heartbeat window

Environment Information

  • Alma version: v0.0.738

  • Operating System: Windows 11

  • Use Case: Running long-running tasks with multiple agents via the Discord bridge

  • Reproducibility: Abnormal heartbeat behavior occurs every time Alma is restarted; 100% reproducible

Please authenticate to join the conversation.

Upvoters
Status

In Review

Board
💡

Feature Request

Date

About 2 months ago

Author

karlamo

Subscribe to post

Get notified by email when there are changes.