Published on

FAILURE MODE & RECOVERY MATRIX (Outline v0.1)

Authors

HTL-08 – FAILURE MODE & RECOVERY MATRIX (Outline v0.1)



1. Purpose


1.1 Document Objective

HTL-08 bertujuan untuk:

  • Mengidentifikasi seluruh kemungkinan failure mode
  • Menentukan metode deteksi
  • Mendefinisikan dampak sistemik
  • Mengunci containment boundary
  • Menentukan recovery otomatis & manual
  • Menetapkan escalation rule

Dokumen ini memastikan sistem tidak memiliki failure behavior yang tidak terdefinisi.


1.2 Authority

HTL-08 mengikat:

  • Firmware team
  • Backend team
  • Electrical team
  • QA team
  • Field technician

Setiap failure scenario wajib memiliki owner yang jelas.


1.3 Usage

HTL-08 digunakan untuk:

  • Design validation
  • QA stress test
  • Field troubleshooting checklist
  • Risk assessment review
  • Incident post-mortem analysis

Dokumen ini menjadi referensi saat terjadi insiden nyata.


2. Scope


2.1 In-Scope

Failure domain yang dianalisis:

✔ Layer Node

  • Firmware crash
  • Sensor anomaly
  • Actuator failure
  • Relay-aware topology issue

✔ Relay Chain

  • Parent-child disruption
  • Routing inconsistency

✔ Gateway

  • ESP reboot
  • MQTT failure
  • Buffer overflow

✔ Server (Pi)

  • Broker crash
  • DB corruption
  • Disk full

✔ Electrical

  • Brownout
  • Surge
  • Overload
  • Grounding issue

✔ Network

  • LAN failure
  • Packet loss
  • IP conflict

✔ Security

  • Unauthorized publish
  • Replay attack
  • Credential misuse

✔ OTA

  • Interrupted upgrade
  • Hash mismatch

2.2 Out-of-Scope

Tidak mencakup:

  • Natural disaster yang menghancurkan seluruh site
  • Theft seluruh panel dan infrastruktur

Namun partial theft (Node hilang) tetap in-scope.


3. Definitions


3.1 Detection

Metode atau mekanisme untuk mengidentifikasi kegagalan.


3.2 Impact

Konsekuensi terhadap sistem atau aktuator.


3.3 Recovery

Langkah pemulihan untuk mengembalikan sistem ke kondisi operasional.


3.4 Containment

Batas domain agar kegagalan tidak menyebar ke layer lain.


3.5 Escalation

Level tanggung jawab saat recovery tidak berhasil.


3.6 Safe Mode

Mode operasi dengan fungsi minimal dan aktuator dalam kondisi aman.


3.7 Partial Degradation

Sebagian sistem berfungsi, sebagian tidak.


3.8 Total Degradation

Site tetap hidup secara listrik tetapi seluruh kontrol digital tidak tersedia.


4. Assumptions


4.1 Site Semi-Autonomous

  • Node tetap dapat menjalankan kontrol lokal
  • Gateway atau Server bisa offline tanpa menghentikan kontrol dasar

4.2 Internet Tidak Kritikal

  • Tidak ada dependency cloud
  • LAN adalah satu-satunya network domain

4.3 Node Dapat Restart Sewaktu-Waktu

  • Brownout
  • Watchdog reset
  • Firmware crash

Sistem harus toleran terhadap restart.


4.4 Power Fluktuatif

  • Brownout mungkin terjadi
  • Surge mungkin terjadi
  • EMI mungkin terjadi

4.5 Relay-Aware Topology Terbatas

  • Maks 15 node
  • Hop terbatas
  • Tidak ada mesh dinamis kompleks

5. System Failure Domain Overview

Tujuan section ini adalah mendefinisikan failure containment boundary agar:

  • Kegagalan Node tidak menjatuhkan seluruh site
  • Kegagalan Gateway tidak menghentikan kontrol lokal
  • Kegagalan Server tidak mematikan aktuator
  • Kegagalan elektrikal tidak merusak perangkat lain

5.1 Failure Domain Diagram

Image


✔ Domain 1 – Electrical Layer

Komponen:

  • AC input
  • SMPS
  • Fuse
  • Relay/contact
  • Panel wiring

Failure di domain ini dapat:

  • Mematikan Node
  • Men-trip motor
  • Menyebabkan brownout

Containment:

  • Fuse segmentation
  • Interlock fisik
  • Thermal overload

✔ Domain 2 – Node Layer

Komponen:

  • ESP32 firmware
  • Sensor
  • Actuator driver
  • Local control engine

Failure Node:

  • Tidak boleh mematikan node lain
  • Tidak boleh menyebabkan actuator unsafe

Containment:

  • Safe state default OFF
  • Interlock fisik
  • Watchdog reset

✔ Domain 3 – Relay Chain Layer

Komponen:

  • Parent-child routing
  • Hop logic
  • Sequence validation

Failure:

  • Parent down
  • Routing loop

Containment:

  • Hop limit enforcement
  • Re-route logic
  • Local autonomy

✔ Domain 4 – Gateway Layer

Komponen:

  • ESP-NOW coordinator
  • MQTT bridge
  • Buffer

Failure:

  • Gateway reboot
  • MQTT disconnect

Containment:

  • Store-and-forward
  • Node autonomy
  • Automatic reconnect

✔ Domain 5 – Server Layer (Pi)

Komponen:

  • MQTT broker
  • DB
  • Command manager
  • HMI

Failure:

  • Broker crash
  • Disk full

Containment:

  • Node autonomous mode
  • Gateway buffering
  • Restart service

✔ Domain 6 – User Layer

Komponen:

  • Operator
  • Engineer
  • Admin

Failure:

  • Wrong command
  • Unauthorized access

Containment:

  • RBAC
  • TTL
  • Audit log

6. Failure Classification

Semua failure diklasifikasikan untuk memudahkan QA dan analisis risiko.


6.1 Hardware Failure

Contoh:

  • Relay weld
  • Sensor rusak
  • SMPS gagal
  • ESP flash corrupt
  • SD card rusak

Karakteristik:

  • Fisik
  • Butuh intervensi manual

6.2 Network Failure

Contoh:

  • LAN down
  • WiFi unstable
  • MQTT disconnect
  • Packet loss burst

Karakteristik:

  • Biasanya temporer
  • Harus auto-recover

6.3 Software Failure

Contoh:

  • Firmware crash
  • Deadlock service
  • Memory leak
  • Routing logic error

Karakteristik:

  • Watchdog recovery
  • Service restart

6.4 Data Failure

Contoh:

  • DB corruption
  • Flash corruption
  • Duplicate command
  • Sequence mismatch

Karakteristik:

  • Harus deteksi integritas
  • Recovery via restore atau re-sync

6.5 Security Failure

Contoh:

  • Unauthorized MQTT publish
  • Replay attack
  • Credential misuse
  • Firmware tampering

Karakteristik:

  • Harus reject & log
  • Tidak boleh memicu actuator

6.6 Environmental Failure

Contoh:

  • Overheat panel
  • High humidity
  • EMI burst
  • Voltage fluctuation

Karakteristik:

  • Harus degrade safe
  • Electrical protection dominan

7. Failure Matrix


7.1 Node-Level Failures

IDFailure ScenarioDetectionImpactContainmentRecoveryOwner
N-01Node reboot unexpectedlyRestart counter incrementTelemetry gapDefault relay OFFAuto reconnectFirmware
N-02Brownout resetBrownout log flagActuator OFFBrownout thresholdStabilize powerElectrical
N-03Sensor driftOut-of-range persistWrong control decisionPlausibility checkRecalibrate/replaceField
N-04Actuator stuck ONOutput OFF but current presentUnsafe stateInterlock hardwareReplace relayElectrical
N-05Flash corruptionCRC fail at bootNode safe modeSafe mode entryReflash firmwareEngineer
N-06OTA interruptedIncomplete image detectNode reboot failDual partitionRetry OTAEngineer
N-07Parent node downMissed heartbeatChild unreachableRe-route attemptRe-pair routingFirmware
N-08Node flooding messageMessage rate anomalyGateway overloadRate limitIsolate nodeGateway

7.2 Relay Chain Failures

IDFailure ScenarioDetectionImpactContainmentRecoveryOwner
R-01Parent offlineNo route ACKChild isolatedHop limitLocal autonomyFirmware
R-02Child unreachableMissing telemetryData gapRouting table agingRe-registerFirmware
R-03Hop loop detectedDuplicate seq spikeCongestionHop limit enforcementReset routingGateway
R-04Routing inconsistencyConflicting parentUnstable pathGateway validationRebuild topologyFirmware
R-05Duplicate stormSliding window overflowBuffer pressureDedup engineDrop excessGateway

7.3 Gateway Failures

IDFailure ScenarioDetectionImpactContainmentRecoveryOwner
G-01Gateway rebootRestart logTelemetry gapNode autonomyAuto reconnectFirmware
G-02MQTT disconnectBroker state changeData bufferingStore-forwardReconnect backoffGateway
G-03WiFi downNo LAN linkTelemetry haltBuffer localReconnectNetwork
G-04Buffer overflowQueue depth exceedData loss riskDrop oldest policyIncrease bufferArchitect
G-05ESP-NOW congestionHigh retry countPacket lossRate limitAdjust intervalFirmware
G-06Time sync failureDrift threshold exceedTimestamp skewFallback time modeResyncGateway

7.4 Server (Pi) Failures

IDFailure ScenarioDetectionImpactContainmentRecoveryOwner
S-01Broker crashService downNo commandNode autonomyRestart serviceBackend
S-02DB corruptionWrite errorData loss riskIngestion haltRestore backupBackend
S-03Disk full>90% usageWrite failAlert thresholdCleanup/archiveOperator
S-04CPU overloadLoad > thresholdSlow dashboardNo actuator impactOptimize serviceBackend
S-05Service deadlockNo responseDashboard freezeService restartDebugBackend
S-06Power lossPing failServer offlineNode autonomyBoot restartElectrical

7.5 Electrical Failures

IDFailure ScenarioDetectionImpactContainmentRecoveryOwner
E-01Short circuitFuse blowActuator offlineFuse isolationReplace fuseElectrical
E-02Pump overloadOverload tripPump OFFThermal relayReset relayField
E-03Relay weldingActuator stuckUnsafe stateInterlockReplace relayElectrical
E-04Surge eventSPD indicatorDevice damage riskSPD isolationReplace SPDElectrical
E-05Ground failureNoise/resetInstabilityGround correctionFix wiringElectrical
E-06Panel overheatTemp > thresholdHardware riskThermal shutdownImprove ventilationElectrical

7.6 Network Failures

IDFailure ScenarioDetectionImpactContainmentRecoveryOwner
NW-01LAN downNo pingDashboard offlineNode autonomyFix switch/routerNetwork
NW-02IP conflictARP anomalyDevice unstableManual isolateAssign static IPNetwork
NW-03Packet loss burstRetry spikeTelemetry gapQoS1Stabilize LANNetwork
NW-04High latencyRTT spikeSlow commandTTL controlOptimize networkNetwork

7.7 Security Failures

IDFailure ScenarioDetectionImpactContainmentRecoveryOwner
SEC-01Unauthorized MQTT publishACL violationCommand riskReject connectionRotate credentialAdmin
SEC-02Replay attackDuplicate seqDuplicate commandSliding windowLog anomalyFirmware
SEC-03Invalid firmwareHash mismatchOTA blockedSafe modeReflashEngineer
SEC-04Credential leakSuspicious loginUnauthorized accessAccount lockReset credentialAdmin
SEC-05Excess login attemptThreshold exceededBrute force riskLock accountInvestigateAdmin
SEC-06Device cloning attemptDuplicate device_idIdentity collisionReject registrationRevoke deviceAdmin

8. Recovery Strategy Model

Recovery dibagi menjadi tiga kategori utama:

  • Automatic Recovery
  • Manual Recovery
  • Escalation Rule

8.1 Automatic Recovery

Dilakukan tanpa intervensi manusia.

✔ Mekanisme Wajib

  • Watchdog reset (Node & Gateway)
  • MQTT reconnect dengan exponential backoff
  • ESP-NOW retry terbatas
  • Store-and-forward flush saat broker kembali
  • Safe mode entry saat firmware invalid
  • Service auto-restart via systemd (Pi)

Automatic recovery tidak boleh menyebabkan actuator unsafe.


8.2 Manual Recovery

Dilakukan oleh operator/engineer.

Contoh:

  • Power cycle Node
  • Replace relay
  • Replace sensor
  • Restore DB backup
  • Reflash firmware
  • Re-provision device
  • Replace SMPS

Semua manual recovery harus tercatat di log jika mempengaruhi sistem digital.


8.3 Escalation Rule

Level 1 – Operator

  • Reset device
  • Check alarm
  • Replace fuse

Level 2 – Engineer

  • Reflash firmware
  • Replace hardware
  • Restore backup
  • Investigate routing

Level 3 – Architect

  • Root cause systemic
  • Update firmware logic
  • Revise design constraint
  • Update HTL document

Jika failure berulang > threshold tertentu, escalation otomatis.


9. Degradation Modes

Sistem harus memiliki mode operasi yang jelas.


9.1 Normal Mode

Semua komponen aktif:

Node ↔ Gateway ↔ Server ↔ HMI

Full telemetry & control.


9.2 Partial Mode (Gateway Down)

Kondisi:

  • Gateway reboot
  • MQTT unreachable

Dampak:

  • Telemetry tidak sampai server
  • Command dari HMI tertunda

Kontrol lokal tetap berjalan.


9.3 Autonomous Mode (Server Down)

Kondisi:

  • Broker crash
  • Pi power loss

Dampak:

  • Tidak ada dashboard
  • Tidak ada command baru

Node tetap menjalankan:

  • Threshold control
  • Schedule fallback

9.4 Emergency Safe Mode

Trigger:

  • Firmware corruption
  • Critical sensor invalid
  • Brownout oscillation
  • Security violation

Perilaku:

  • Actuator default OFF
  • Telemetry minimal
  • No command execution

9.5 Degradation Mode Transition Diagram

Image

Image


Transisi umum:

Normal ↓ (Gateway fail) Partial ↓ (Server fail) Autonomous ↓ (Critical failure) Emergency Safe

Recovery kembali ke Normal setelah kondisi stabil.


10. Open Issues

Harus diputuskan sebelum production freeze:

  1. Maximum tolerable downtime per layer?
  2. UPS mandatory untuk Pi?
  3. Redundant gateway diperlukan?
  4. Self-healing limit (berapa retry sebelum isolate)?
  5. Field spare policy (berapa relay cadangan)?
  6. MTTR target per subsystem?
  7. SLA internal per-site?

Tanpa angka target, QA tidak bisa menentukan acceptance.


11. Revision History

VersionDateAuthorDescription
v0.12026-02-24ArchitectInitial structured draft

Catatan Penyusunan Artikel ini disusun sebagai materi edukasi dan referensi umum berdasarkan berbagai sumber pustaka, praktik lapangan, serta bantuan alat penulisan. Pembaca disarankan untuk melakukan verifikasi lanjutan dan penyesuaian sesuai dengan kondisi serta kebutuhan masing-masing sistem.