Heartbeat 冷启动 + 持久化 + 补跳机制

Problem Description

The current implementation of the heartbeatService contains three interrelated defects, which result in the heartbeat failing to trigger as expected during normal use (especially in scenarios where Alma is frequently restarted during the development phase), and when it does trigger, it indiscriminately terminates all running agent tasks.

Root Cause Analysis

Based on API behavior and the return value of `/api/heartbeat/status `, it is inferred that `start() ` uses `setInterval ` rather than executing immediately, and `lastHeartbeatTime ` is not persisted. The following pseudocode describes the inferred current behavior:

// Inferred current behavior (pseudocode) class HeartbeatService { lastHeartbeatTime = 0; // In-memory variable, reset to zero on restart start() { this.timer = setInterval( // No initial tick () => this.tick(), 60 * this.config.intervalMinutes * 1000 ); } }

Issue 1: No Initial Tick on Cold Start

setInterval does not execute the callback immediately upon invocation; instead, it waits for a full interval cycle to elapse before triggering the first callback. Assuming the interval is set to 60 minutes, Alma must wait a full 60 minutes after startup before the first heartbeat is executed.

If the user restarts Alma within those 60 minutes (which is common during development), the timer resets, and another 60 minutes must pass. As a result, the heartbeat may never be triggered.

Issue 2: ` `lastHeartbeatTime` ` is in-memory only and resets on restart

` lastHeartbeatTime ` is not persisted to SQLite or the settings JSON. Every time Alma restarts, this value resets to 0. This means:

It is impossible to determine "how long it has been since the last heartbeat" after a restart
It is impossible to decide whether an immediate catch-up heartbeat is needed
Heartbeat history cannot be traced in the logs

Issue 3: No catch-up logic

Even if the current implementation detects that multiple intervals have elapsed since the last heartbeat, it will not resend one. For example, if Alma is turned off for 3 hours and then turned back on, with an interval of 60 minutes, it would theoretically have missed 3 heartbeats, but after restarting, it will simply wait silently for the next full interval.

Derived Issue: Indiscriminate termination of agents when a heartbeat is triggered

This issue is particularly severe when combined with the three defects mentioned above. In testing, it was observed that when a heartbeat is triggered, it indiscriminately terminates currently executing agent tasks (multiple concurrent agents are interrupted simultaneously), resulting in incomplete task execution. If the timing of heartbeat triggers is unpredictable (due to Defects 1–3), users cannot reasonably avoid this interruption.

Due to Defect 2 (lack of persistence), after a restart, the agent cannot detect that it was previously interrupted and cannot resume from the point of interruption.

Recommended Fix

Fix 1: Execute the first tick immediately upon startup (one line of code)

start() { this.tick(); // ← New: Execute immediately once this.timer = setInterval( () => this.tick(), 60 * this.config.intervalMinutes * 1000 ); }

Fix 2: Persist ` `lastHeartbeatTime` ` (approx. 5 lines)

Write ` lastHeartbeatTime ` to the ` app_settings ` table or a separate JSON file. Update the persisted value after each `tick() ` execution, and read it from persistent storage at startup.

start() { this.lastHeartbeatTime = this.loadFromStorage() || 0; // ... } tick() { // ... Execute heartbeat logic this.lastHeartbeatTime = Date.now(); this.saveToStorage(this.lastHeartbeatTime); }

Fix 3: Check for missed heartbeats at startup (approx. 5 lines)

start() { const last = this.loadFromStorage() || 0; const elapsed = Date.now() - last; const intervalMs = 60 * this.config.intervalMinutes * 1000; if (elapsed >= intervalMs) { this.tick(); // Catch up } else { // Wait only for the remaining time, not the full interval setTimeout(() => { this.tick(); this.timer = setInterval(() => this.tick(), intervalMs); }, intervalMs - elapsed); return; } this.timer = setInterval(() => this.tick(), intervalMs); }

Fix 4 (Recommended but not required): Graceful shutdown upon heartbeat trigger

Currently, all agents are terminated immediately when a heartbeat is triggered. It is recommended to change this to:

Check if any agents are currently executing tasks
If so, wait for the current task to complete (or set a timeout limit, such as 30 seconds)
Execute the heartbeat logic only after the task completes or the timeout expires
If waiting is not possible, at least have the agent write a checkpoint so that it can resume from the point of interruption after recovery

This change is quite extensive and can be addressed in a future optimization.

Priority Recommendations

			Fix Item \| Scope of Changes \| Priority \| Reason
Fix 1 (First Hop)	1 line	P0	If not fixed, the heartbeat will almost never trigger in scenarios with frequent restarts
Fix 2 (Persistence)	~5 lines	P1	Only meaningful when used in conjunction with Fix 3
Fix 3 (Missed Beat Recovery)	~10 lines	P1	Resolves the gap upon restart after a prolonged shutdown
Fix 4 (graceful)	~30 lines	P2	Product experience issue; the current workaround is to avoid the heartbeat window

Environment Information

Alma version: v0.0.738
Operating System: Windows 11
Use Case: Running long-running tasks with multiple agents via the Discord bridge
Reproducibility: Abnormal heartbeat behavior occurs every time Alma is restarted; 100% reproducible

Alma

Heartbeat Cold Start + Persistence + Missed Beat Recovery Mechanism

Problem Description

Root Cause Analysis

Issue 1: No Initial Tick on Cold Start

Issue 2: ` `lastHeartbeatTime` ` is in-memory only and resets on restart

Issue 3: No catch-up logic

Derived Issue: Indiscriminate termination of agents when a heartbeat is triggered

Fix 1: Execute the first tick immediately upon startup (one line of code)

Fix 2: Persist ` `lastHeartbeatTime` ` (approx. 5 lines)

Fix 3: Check for missed heartbeats at startup (approx. 5 lines)

Fix 4 (Recommended but not required): Graceful shutdown upon heartbeat trigger

Priority Recommendations

Environment Information

Subscribe to post

Subscribe to post

Alma

Heartbeat Cold Start + Persistence + Missed Beat Recovery Mechanism

Problem Description

Root Cause Analysis

Issue 1: No Initial Tick on Cold Start

Issue 2: ` lastHeartbeatTime ` is in-memory only and resets on restart

Issue 3: No catch-up logic

Derived Issue: Indiscriminate termination of agents when a heartbeat is triggered

Fix 1: Execute the first tick immediately upon startup (one line of code)

Fix 2: Persist ` lastHeartbeatTime ` (approx. 5 lines)

Fix 3: Check for missed heartbeats at startup (approx. 5 lines)

Fix 4 (Recommended but not required): Graceful shutdown upon heartbeat trigger

Priority Recommendations

Environment Information

Subscribe to post

Subscribe to post

Issue 2: ` `lastHeartbeatTime` ` is in-memory only and resets on restart

Fix 2: Persist ` `lastHeartbeatTime` ` (approx. 5 lines)