learn · South Africa
PLC troubleshooting — a practical fault-find process
A practical PLC troubleshooting tutorial. Symptom-to-root-cause flow, online diagnostics, force values, and worked fault scenarios in a free PLC simulator.
PLC troubleshooting is the work that pays the apprentice's wage and tests the senior tech's patience. The line stops, somebody radios the control room, and the next forty minutes decide whether you find the actual fault or you keep replacing parts until the fault hides somewhere you cannot find it again. This tutorial walks the practical fault-find process — symptom to root cause — that experienced panel technicians use when a PLC-controlled machine drops, alarms, or quietly produces scrap. The order matters. The order is what separates a fifteen-minute fault-find from a four-hour parts-swap exercise that ends with a confused tech and a still-broken machine.
One opinion stated outright. Most fault-find time is spent in the field, not in the program. Anyone telling you otherwise has not actually fixed enough panels. The PLC program is the diagnostic instrument; the wiring, the sensors, the auxiliary power supplies, and the mechanical alignment are where the fault almost always lives. The program tells you which channel went bad. The torque wrench, the multimeter, and the alignment gauge tell you why.
Try the simulator →The fault-find ladder — symptom to cause
Faults sort into five categories, and the order you check them in is the difference between thirty minutes and three hours.
Process fault. The recipe is wrong, the temperature setpoint was changed, the operator loaded the previous shift's batch parameters. The machine is doing exactly what the program told it to do; the program was told the wrong thing. Check this first because it costs nothing — read the active recipe, read the setpoint screen, read the alarm log for parameter-change events.
Control fault. The program is doing something the engineer did not intend. Bug in a recently-deployed change, edge case the original developer did not test, race condition that only triggers when two timers expire on the same scan. Check second; rare on mature programs, common on programs modified in the last week.
Electrical fault. A sensor wire chafed, a fuse blew, a 24V auxiliary rail dropped under load, a relay coil burnt out. The largest single category by frequency. Check third because it requires hands on the panel and a meter — slower than reading screens but faster than swapping modules.
Network fault. PROFINET timeout, EtherNet/IP connection drop, Modbus serial frame error. Check fourth; the diagnostic buffer or controller properties tab tells you immediately if a node has gone offline.
Hardware fault. A PLC module has actually failed. Last thing on the list. Modules fail less than wiring fails by an order of magnitude on industrial PLCs. The temptation to swap a module first is strong because module-swap is satisfying — the new module goes in, the lights come on, the line runs. The fault returns three days later because the underlying short-circuit on a chafed cable is still there, killing the new module the same way it killed the old one.
Don't skip steps. The reason the order is the order is that each step rules out a category cheaply, leaving a smaller suspect list for the next step. Skipping electrical and going straight to module replacement is how panels accumulate a graveyard of half-good spare modules and unsolved intermittent faults.
Step 1 — read the operator
The operator saw the fault happen. The operator knows what was running, what changed, what alarmed first. The alarm log will repeat that information back to you, but the operator's narrative includes context the log does not — the smell of hot insulation, the click of a contactor that should not have clicked, the moment a heater LED dimmed. Five minutes listening to the operator is worth thirty minutes scrolling through alarms.
Ask three questions. What was the machine doing when it stopped? What is different about today — new batch, new operator, new product, recent maintenance? What alarmed first? The third question is the most useful and the most often answered wrong. Operators remember the last alarm because the last alarm is on the screen. The first alarm is the one that triggered the cascade. The alarm log shows the chronological order; cross-check the operator's memory against the log timestamps.
Then read the alarm log itself. Both matter — operator narrative for context, log for sequence. On Siemens platforms the alarm log is in WinCC's alarm view or the diagnostic buffer; on Rockwell it is the FactoryTalk View alarm history; on smaller HMIs it is whatever screen the system integrator configured. Filter by timestamp, look at the first three alarms in chronological order, and write them down. Those three lines are your starting suspect list.
Step 2 — module status LEDs
Every chassis has them. Every module has them. Green for healthy, red for faulted, amber for warning or unconfigured, off for unpowered. A scan of the chassis takes ten seconds and rules out half the failure modes before you even open the engineering software.
What to look for. The CPU LED — solid green is healthy, blinking is not. The bus LED — solid green means the backplane communication is up. Per-module status LEDs — every digital input card and digital output card has a row of point-status LEDs that shows the current state of each channel. A digital input LED that is off when the field sensor is supposed to be on tells you instantly that the problem is upstream of the PLC — the sensor is dead, the wire is broken, or the auxiliary power supply for that input group has dropped. A blown fuse on the input common shows up here before it shows up in the program, because the program is reading the input image and the input image is reading whatever the input card sees, which is nothing once the fuse blows.
Take a phone photo of the chassis with all the LEDs visible. The photo becomes a reference point for the rest of the fault-find — and, if the fault is intermittent, the photo lets you compare healthy-state LEDs to faulted-state LEDs the next time it trips.
Step 3 — go online with the engineering software
Now the screens. TIA Portal calls it "Online Access" — connect via the PG/MPI cable or the Ethernet port, switch the project online, and the rungs and SCL blocks colour up to show the running program. Studio 5000 calls it "Go Online" — same idea, the laptop attaches to the running ControlLogix and you can monitor tag values live. Sysmac Studio's "Online Connection" does the equivalent on Omron NJ/NX. CODESYS, GX Works, Connected Components Workbench — every engineering tool has the same primitive. Connect to the running program, watch values change in real time.
Find the rung with the fault output. If the alarm log says "Conveyor 3 motor overload" and the operator confirms conveyor 3 is the affected machine, the program has a bit somewhere called Conveyor3_Overload or Cnv3_OL or M_Cnv3.Overload depending on the naming convention. Cross-reference the tag, jump to the rung that energises it, and watch which contacts in that rung are closed and which are open. The rung is the diagnostic — its inputs tell you which condition tripped the alarm.
A subtlety on the diagnostic buffer. On Siemens S7 platforms, the Diagnostic Buffer window in TIA Portal logs every controller-level event — module faults, network drops, hardware diagnostics, system warnings — with timestamps to the millisecond. It is the single most useful screen in TIA Portal during a fault-find. Open it. Filter by today's date. Read top-to-bottom. The first hardware-related entry is usually the actual root cause; everything below it is the cascade of consequential alarms.
Step 4 — trace the input chain
Once you have the rung, trace it. Which contact is wrong? The rung says the motor should run when start-permission is true and run-request is true and overload-reset has been pressed. One of those is FALSE when it should be TRUE. Find which one. The engineering software shows live values; the offending contact is the one whose colour says "open" when the operator says "this should be closed."
Then ask why the contact is open. Each contact maps to a tag; each tag has a source. The source is either a physical input (digital input card, channel address %I0.0 or Local:1:I.Data.0), an internal program bit, or a derived calculation. If it is a physical input, the next move is to check the physical input — see step 5. If it is a program bit, follow it up to wherever it gets set; another rung will be the source.
Force values — with discipline. Every modern PLC platform supports forcing — overriding an input or output value to confirm a hypothesis. Force the input ON for a moment to confirm that the rest of the rung behaves correctly when that input is true. The technique is invaluable; the discipline around it is non-negotiable.
The discipline. Every force you apply gets written down on a fault sheet — what was forced, to what value, at what time, why. Every force you apply gets removed before you leave the panel — there is no exception. Forces persist across power cycles on most platforms; a force left in place is a permanent override that the next person to work on the panel will not see and cannot understand. Audit the force list before disconnecting from the program. Read every active force aloud, justify each one, remove every one. Forcing is a diagnostic instrument, not a fix. Fixing the underlying condition is what removes the force; the force itself is just a question, not an answer.
A second piece of force discipline. Forcing an output is a higher-risk operation than forcing an input — an input force changes what the program thinks is happening, an output force changes what the field actually does. Forcing a valve open, a contactor energised, a hydraulic ram extended — those are physical movements with the laptop and the operator's hands as the only safety devices. Verify the safety state before forcing any output. Lock-out the affected zone if there is any doubt. The five seconds of LOTO discipline saves the apprentice's hand.
Step 5 — physical inspection
Most fault-find time is spent here. The field is where the actual problem usually lives. Anyone who has chased intermittent panel faults for a year confirms this; the panel-tech literature confirms this; the diagnostic-buffer cascade points here.
Wiring first. Every screwed terminal — input commons, output commons, sensor power supplies, PE bonds — gets a torque check. A loose ferrule causes intermittent dropouts that look exactly like a flaky module. Re-torque to the spec listed on the terminal block (typically 0.5 to 0.8 Nm for cage-clamp terminals, 0.6 to 1.2 Nm for screw terminals on the 6 mm² range). The 24V auxiliary supply, in particular — its terminal block carries every input card on the chassis, and a single loose connection there knocks out half the inputs intermittently.
Then sensor wiring at the field end. Cable runs through cable ladder rust, through liquid pools, through panel modifications that broke the IP rating. Coastal sites — Cape Town, Durban, the Saldanha and Richards Bay industrial zones — see corroded ferrules at the field end faster than inland sites. The tell is a sensor that works fine when the panel is dry and drops out the morning after a wet shift. Pull the field plug, inspect the pins, clean the corrosion, refit. The fault is fixed for the next eighteen months until the next salt cycle reaches the same junction.
Sensor alignment and sensor power. A photoelectric sensor that has drifted three millimetres past its target reads as intermittent — works at noon when the line is at thermal equilibrium, drops out at six in the morning when the frame is cold. An inductive sensor that is supposed to detect a moving part at 5 mm and is now sitting at 9 mm reads as no-detect. Check the alignment with a feeler gauge. Check the sensor power LED — most modern sensors have a green power LED and a yellow target LED, and both should be on when the target is present.
Common gotchas worth memorising. PNP sensor wired into NPN input — the input card sees a high impedance instead of a current sink, the channel never registers, the program thinks the sensor never sees its target. Diagnose with a meter at the input terminal; PNP sources 24V on its signal pin, NPN sinks to common, the input card expects whichever convention it was configured for. IP65 rating breached after a panel modification — somebody added a cable through the gland plate and did not seal the new entry, water gets in, the dielectric strength of the air inside the panel collapses, intermittent earth faults follow. Fix the gland; the panel goes back to working. Corroded ferrule on a 24V common — high resistance on the return path drops the apparent input voltage on every channel that shares that common, half the inputs fail intermittently. Fix one ferrule, fix the whole rail.
Step 6 — module replacement as last resort
Only after every other step. Modules fail less than wiring fails by an order of magnitude. Replacing a module before you have ruled out wiring is replacing the symptom and leaving the cause. The new module dies the same way the old one died, and now you are out two modules and still have the original fault.
The procedure when you do replace a module. Power down the chassis (not just the module — the bus voltage on a hot-swap chassis is enough to fault a CPU if the new module mis-seats). Pull the old module, label it with the date and the symptom, fit the new module, restore the configuration if the platform requires it (Rockwell ControlLogix re-configures from the controller automatically; Siemens S7-1500 re-configures on power-up if the firmware matches; older S7-300 sometimes needs a manual download of the hardware config). Power up. Test. Watch the LEDs. Run the affected machine through one or two normal cycles before signing off.
The graveyard of half-good spares. Every panel shop has one — a shelf of pulled modules, half of which are good, half of which are dead, all of which got pulled during a fault-find that did not properly identify the wiring fault first. Do not contribute to that shelf. If the module truly is dead, label it as dead and dispose of it. If it might be good, plug it into a test rig and confirm before putting it on the spares shelf. An untested spare is not a spare.
Reading vendor fault codes
Every platform has a fault-code system. Learn the one for the platform you work on; cross-reference the others when you take on a new client.
Siemens — the Diagnostic Buffer in TIA Portal is the canonical source for S7-300, S7-400, S7-1200, and S7-1500 controllers. Open it from the project tree under the CPU's "Online & diagnostics" node. Each entry has a timestamp, an event code, and a description. The Siemens support pages have searchable databases of every event code; start at support.industry.siemens.com and search by code or by symptom.
Allen-Bradley / Rockwell — the GSV (Get System Value) instruction reads controller fault status into program tags. The MAJORFAULTRECORD and MINORFAULTRECORD attributes hold the most recent major and minor faults respectively. The "Controller Properties" dialog in Studio 5000 shows the same information in a GUI; the GSV instruction is for when you want fault status visible on the HMI or logged automatically. Rockwell's fault-code database lives at rockwellautomation.com — search by fault code, controller family, and firmware revision.
Mitsubishi — error codes live in the special-data registers in the D9000 range on FX-series and in the equivalent SD registers on Q-series and iQ-R. The GX Works manual lists every code. Mitsubishi error codes are notoriously terse; the manual is the only practical reference.
Omron — Sysmac Studio's "Troubleshooting" view consolidates controller errors, fault history, and access to the manuals. Codes are documented in the NJ/NX-series troubleshooting manuals.
The general rule. Note the fault code, screenshot the diagnostic screen, search the vendor's support site. Vendors document their own codes well — there is no shortcut around reading the manual, but the search tools mean you rarely have to read more than a paragraph at a time.
A worked scenario — conveyor stops, no alarm
Consider the most awkward category of fault — the machine stops, but no alarm fires. The operator radios the control room; the control room sees no alarm; the engineer arrives, finds the conveyor still and the program apparently happy. The HMI shows green tiles where the operator says red. The diagnostic buffer is empty for the last four hours. Somewhere between the program's view of the world and the actual conveyor, reality has diverged from the program's model. This is the fault-find that tests the process.
Step 1, read the operator. "Conveyor 3 stopped about ten minutes ago. No alarm. The HMI says it's running. The motor is silent." Step 2, module status LEDs. CPU green, bus green. The output card driving conveyor 3's contactor coil shows the channel LED on — the program is commanding the contactor. The input card watching the motor's run-feedback contact shows the corresponding channel LED off — the contactor is not actually closed, despite being commanded.
So the program thinks the conveyor is running, the output card is energising the coil, but the contactor is not pulling in. Step 3, go online. The rung that drives Conveyor3_Run is solid green — start permission true, run request true, overload reset clean. The output is being commanded. The feedback rung is open at the run-feedback input — the program knows the feedback is missing, but the alarm logic was written with a 30-second delay and the operator radioed within 10 seconds.
Step 4, trace the input chain on the feedback side. The feedback input is %I2.4, on digital input card 2, channel 4. The card LED shows the channel as off. The contactor's auxiliary contact is supposed to close when the contactor pulls in. So either the contactor is not pulling in, or the auxiliary contact is dirty and not closing, or the wiring from the auxiliary contact to the input card is broken, or the input card's channel itself is dead.
Step 5, physical inspection. Multimeter to the contactor's coil terminals at the moment the program is commanding it on. The reading is 21.8V instead of the expected 24V. Check the 24V auxiliary supply feeding that distribution rail — output 21.8V, well below spec. Check the load on that rail — fourteen contactor coils, three solenoid valves, and the input cards' field side. The auxiliary supply unit is rated at 5A; the fourteen contactor coils alone draw close to 4A inrush. The supply has been quietly degrading. Today, with the rail under full load, the voltage finally dropped to where the contactor's pull-in voltage (typically 75 percent of nominal, or 18V on a 24V coil) is right on the edge — sometimes it pulls in, sometimes it does not. The contactor was on the edge for an hour and lost the battle at the moment the operator noticed.
Step 6 not needed. The fault is electrical — a degraded auxiliary power supply — invisible from the alarm log because the input card never registered an input fault (it sees no signal, but no signal is exactly what the wiring would provide if the conveyor were genuinely off; the input card cannot tell the difference between an open contact and a coil that did not close). The fix is replacing the 5A supply with a 10A unit and re-balancing the rail load. The diagnosis snapped into place at the meter reading on the contactor terminal — 21.8V, three-tenths of a volt below where pull-in becomes reliable. Hardware, not software, but invisible from the alarm log.
This pattern — silent fault, no alarm, hardware on the edge — is one of the three most common categories of awkward fault on industrial panels. The fix is always physical. The diagnosis is always at the meter, not at the keyboard.
Practice fault-finding in the simulator
The reflex layer for fault-find is built by doing it. The first ten faults teach you the process; the next fifty teach you the patterns; somewhere around fault one hundred and fifty you start diagnosing things at the operator-narrative stage and confirming with the meter. The way to get there is repetitions.
Our simulator's wiring track ships with injectable faults — open conductors, shorted commons, dropped auxiliary rails, dirty contactor auxiliaries, drifted sensors. Every scenario is a recreated panel-shop fault, instrumented so the diagnostic buffer, the module LEDs, and the live program respond exactly as they would on a real panel. You work the fault on the laptop; the simulator tells you whether you reached the right cause by the right path. The cert pack at the end of the wiring track is a sequence of timed fault-finds with progressively less obvious symptoms. Run through it twice and the reflex layer settles in.
The Siemens fault-code reference for S7 platforms is at support.industry.siemens.com; the Rockwell fault-code database for ControlLogix and CompactLogix is at rockwellautomation.com. Bookmark both. Every working panel tech has both open in a browser tab on every fault-find.
The next tutorial covers SCADA fundamentals — the supervisory layer that sits above the PLC, where the alarm log lives and the operator's view of the world is built. Most fault-finds start at the SCADA screen and end at the panel; learning the SCADA side of the chain makes the fault-find process complete.
Try the simulator →