Celery Worker Observed Events

Observed on 2026-03-12 in the local plywatch lab stack.

Scenarios Forced

  • restart of the worker service

  • scale up from 1 worker to 2 workers

  • scale down back to 1 worker

  • explicit stop of the extra worker container

Confirmed Worker Event Types

  • worker-online

  • worker-heartbeat

  • worker-offline

Important Findings

  • A normal worker restart produced the expected sequence:

    • worker-heartbeat

    • worker-offline

    • worker-online

    • worker-heartbeat

  • Scaling from 1 to 2 workers produced:

    • worker-online for the new hostname

    • interleaved worker-heartbeat events for both hostnames

  • Scaling down or explicitly stopping the extra worker did not reliably leave a visible worker-offline event in the retained raw buffer.

  • Because of that, WorkerSnapshot must not rely only on worker-offline.

  • We need a derived worker state based on heartbeat timeout:

    • online

    • stale

    • offline

Model Implications

  • Use hostname as the stable worker id.

  • Persist last_seen_at from every heartbeat.

  • Persist first online_at when worker-online appears.

  • Track runtime stats from heartbeat payload:

    • pid

    • freq

    • active

    • processed

    • loadavg

    • sw_ident

    • sw_ver

    • sw_sys

  • Mark a worker stale when heartbeat timeout is exceeded.

  • Treat worker-offline as a strong signal when present, but not as a guaranteed signal.

Representative Samples

worker-offline

{
  "capturedAt": "2026-03-12T16:15:08.271263+00:00",
  "eventType": "worker-offline",
  "payload": {
    "hostname": "celery@1b584e48cbea",
    "utcoffset": 0,
    "pid": 1,
    "clock": 8008,
    "freq": 2.0,
    "active": 0,
    "processed": 2,
    "loadavg": [1.14, 0.7, 0.71],
    "sw_ident": "py-celery",
    "sw_ver": "5.6.2",
    "sw_sys": "Linux",
    "timestamp": 1773332108.2702785,
    "type": "worker-offline",
    "local_received": 1773332108.27125
  },
  "hostname": "celery@1b584e48cbea"
}

worker-online

{
  "capturedAt": "2026-03-12T16:15:10.667447+00:00",
  "eventType": "worker-online",
  "payload": {
    "hostname": "celery@1b584e48cbea",
    "utcoffset": 0,
    "pid": 1,
    "clock": 1,
    "freq": 2.0,
    "active": 0,
    "processed": 0,
    "loadavg": [1.14, 0.7, 0.71],
    "sw_ident": "py-celery",
    "sw_ver": "5.6.2",
    "sw_sys": "Linux",
    "timestamp": 1773332110.665815,
    "type": "worker-online",
    "local_received": 1773332110.6674337
  },
  "hostname": "celery@1b584e48cbea"
}

worker-online for second replica

{
  "capturedAt": "2026-03-12T16:15:31.866365+00:00",
  "eventType": "worker-online",
  "payload": {
    "hostname": "celery@dd95b6468af4",
    "utcoffset": 0,
    "pid": 1,
    "clock": 55,
    "freq": 2.0,
    "active": 0,
    "processed": 0,
    "loadavg": [0.84, 0.66, 0.69],
    "sw_ident": "py-celery",
    "sw_ver": "5.6.2",
    "sw_sys": "Linux",
    "timestamp": 1773332131.8650248,
    "type": "worker-online",
    "local_received": 1773332131.8663585
  },
  "hostname": "celery@dd95b6468af4"
}

Parallel worker-heartbeat from two workers

[
  {
    "capturedAt": "2026-03-12T16:15:35.737366+00:00",
    "eventType": "worker-heartbeat",
    "payload": {
      "hostname": "celery@1b584e48cbea",
      "pid": 1,
      "clock": 69,
      "freq": 5,
      "sw_ident": "py-celery",
      "sw_ver": "5.6.2",
      "sw_sys": "Linux",
      "type": "worker-heartbeat"
    },
    "hostname": "celery@1b584e48cbea"
  },
  {
    "capturedAt": "2026-03-12T16:15:35.737213+00:00",
    "eventType": "worker-heartbeat",
    "payload": {
      "hostname": "celery@dd95b6468af4",
      "pid": 1,
      "clock": 69,
      "freq": 5,
      "sw_ident": "py-celery",
      "sw_ver": "5.6.2",
      "sw_sys": "Linux",
      "type": "worker-heartbeat"
    },
    "hostname": "celery@dd95b6468af4"
  }
]