The current implementation of the heartbeatService contains three interrelated defects, which result in the heartbeat failing to trigger as expected during normal use (especially in scenarios where Alma is frequently restarted during the development phase), and when it does trigger, it indiscriminately terminates all running agent tasks.
Based on API behavior and the return value of `/api/heartbeat/status `, it is inferred that `start() ` uses `setInterval ` rather than executing immediately, and `lastHeartbeatTime ` is not persisted. The following pseudocode describes the inferred current behavior:
// Inferred current behavior (pseudocode) class HeartbeatService { lastHeartbeatTime = 0; // In-memory variable, reset to zero on restart start() { this.timer = setInterval( // No initial tick () => this.tick(), 60 * this.config.intervalMinutes * 1000 ); } } setInterval does not execute the callback immediately upon invocation; instead, it waits for a full interval cycle to elapse before triggering the first callback. Assuming the interval is set to 60 minutes, Alma must wait a full 60 minutes after startup before the first heartbeat is executed.
If the user restarts Alma within those 60 minutes (which is common during development), the timer resets, and another 60 minutes must pass. As a result, the heartbeat may never be triggered.
lastHeartbeatTime ` is in-memory only and resets on restart ` lastHeartbeatTime ` is not persisted to SQLite or the settings JSON. Every time Alma restarts, this value resets to 0. This means:
It is impossible to determine "how long it has been since the last heartbeat" after a restart
It is impossible to decide whether an immediate catch-up heartbeat is needed
Heartbeat history cannot be traced in the logs
Even if the current implementation detects that multiple intervals have elapsed since the last heartbeat, it will not resend one. For example, if Alma is turned off for 3 hours and then turned back on, with an interval of 60 minutes, it would theoretically have missed 3 heartbeats, but after restarting, it will simply wait silently for the next full interval.
This issue is particularly severe when combined with the three defects mentioned above. In testing, it was observed that when a heartbeat is triggered, it indiscriminately terminates currently executing agent tasks (multiple concurrent agents are interrupted simultaneously), resulting in incomplete task execution. If the timing of heartbeat triggers is unpredictable (due to Defects 1–3), users cannot reasonably avoid this interruption.
Due to Defect 2 (lack of persistence), after a restart, the agent cannot detect that it was previously interrupted and cannot resume from the point of interruption.
Recommended Fix
start() { this.tick(); // ← New: Execute immediately once this.timer = setInterval( () => this.tick(), 60 * this.config.intervalMinutes * 1000 ); } lastHeartbeatTime ` (approx. 5 lines) Write ` lastHeartbeatTime ` to the ` app_settings ` table or a separate JSON file. Update the persisted value after each `tick() ` execution, and read it from persistent storage at startup.
start() { this.lastHeartbeatTime = this.loadFromStorage() || 0; // ... } tick() { // ... Execute heartbeat logic this.lastHeartbeatTime = Date.now(); this.saveToStorage(this.lastHeartbeatTime); } start() { const last = this.loadFromStorage() || 0; const elapsed = Date.now() - last; const intervalMs = 60 * this.config.intervalMinutes * 1000; if (elapsed >= intervalMs) { this.tick(); // Catch up } else { // Wait only for the remaining time, not the full interval setTimeout(() => { this.tick(); this.timer = setInterval(() => this.tick(), intervalMs); }, intervalMs - elapsed); return; } this.timer = setInterval(() => this.tick(), intervalMs); } Currently, all agents are terminated immediately when a heartbeat is triggered. It is recommended to change this to:
Check if any agents are currently executing tasks
If so, wait for the current task to complete (or set a timeout limit, such as 30 seconds)
Execute the heartbeat logic only after the task completes or the timeout expires
If waiting is not possible, at least have the agent write a checkpoint so that it can resume from the point of interruption after recovery
This change is quite extensive and can be addressed in a future optimization.
Alma version: v0.0.738
Operating System: Windows 11
Use Case: Running long-running tasks with multiple agents via the Discord bridge
Reproducibility: Abnormal heartbeat behavior occurs every time Alma is restarted; 100% reproducible
Please authenticate to join the conversation.
In Review
Feature Request
About 2 months ago

karlamo
Get notified by email when there are changes.
In Review
Feature Request
About 2 months ago

karlamo
Get notified by email when there are changes.