Resources / blog

Voicebots in the workshop: when speech beats screens

8 min · 2026-05-20

There’s a reason field engineers don’t want another tablet. Hands are oily, the noise floor is high, and the schematic is on the machine, not the screen. Voice is the right interface for the work. But the gap between “voice that works in a demo” and “voice that works in a workshop in November in Esbjerg” is the gap that decides whether technicians open the agent a second time. Here is what we learned shipping voice into the work.

Why voice is not optional in this work

Field engineers don’t have hands free for a tablet. Workshop ambient noise — pneumatic tools, line motors, conversation overlap — is loud enough that consumer voice assistants fall back to silence or to guesses. The schematic is taped to the side of the machine; the manual is in a cabinet across the floor. The technician’s question lands while they are torquing a fitting or holding a probe to a sensor.

Voice is not a nice-to-have channel; it is the only channel where the engineer can keep working while the agent is answering. Voice and chat also answer from the same corpus and enforce the same ACLs — a technician on the phone gets the same answer a service manager gets in the portal. No separate knowledge silo.

Get this right and the agent earns the second use. Get it wrong — lag, mis-recognition, wrong answer delivered with confidence — and it becomes the tablet that nobody opens.

Latency is the gate

Three seconds is the threshold. A voice agent that answers in 1.5 seconds reads as responsive; one that answers in 4 seconds reads as broken, even if the answer is correct. This is not a UX preference; it is the difference between an interface the technician returns to and one they abandon after the first shift.

The latency budget: speech-to-text around 300ms, retrieval from a well-tagged corpus another 200ms, LLM inference the dominant cost at 1–2 seconds, text-to-speech a final 300ms. Get all four right and you land under two seconds. Let any one drift and you are past three.

The decisions that move the budget are corpus tagging, model selection, and grounding strategy — not microphone hardware or network bandwidth, which are table stakes. Get those right and the agent feels like a colleague. Get them wrong and it feels like a help-desk queue.

Multilingual code-switching is the real test

A Polish technician working a German press will start a sentence in Polish, drop in a German technical term mid-clause — “der Werkzeugträger ist locker” — and finish in Polish. Consumer voice assistants break here: they detect a language boundary at the sentence level, which is too late, and either mis-transcribe the term or switch the response language entirely.

The agent that handles this earns trust the first time. The mechanics are not magic: the speech-to-text layer has to detect language at the phoneme level, not the utterance level. The retrieval layer has to treat the technical term as a known noun anchored to the German parts catalogue, not as a transliteration problem. The response layer generates back in the technician’s working language — Polish, in this case — with a citation pointing at the authoritative-language source document.

The same logic works in any multilingual dealer network. It is also where you find out fast whether the corpus is tagged by language and product line, or whether it is a single undifferentiated pile of PDFs.

The failure mode that loses trust

Voice in the workshop fails most often not on recognition or latency, but on confidence. The scenario runs like this: the agent has recognised the machine model correctly, retrieved a plausible service bulletin, and generated an answer that sounds authoritative. Except the bulletin was issued for the previous model year. The technician torques to the wrong spec. The agent has lost trust — not by being wrong loudly, but by being confidently wrong on a specific.

This failure mode is upstream of voice. The corpus has to be tagged at ingest with effective date, supersession chain, applicable model range, and serial-number cutoff. A retrieval layer that cannot scope by model year will surface the wrong bulletin with full confidence, and voice makes the confidence audible in a way that text does not. Voice exposes the corpus-tagging discipline; it does not substitute for it.

If the corpus is not tagged this way before the voice rollout, the first weeks in production will teach you which documents are mis-tagged — useful feedback, but expensive feedback.

Where to start next week

Pick one workshop, one product line, one language pair. Get a microphone array that handles the noise floor — this is real hardware money, not something you solve in software. Tag the corpus for that product line at ingest: effective date, supersession chain, model and serial range. Time the round-trip from the engineer’s question to the spoken answer, and tune until it is under three seconds.

Then watch the second-use rate. The first use is curiosity; the second use is whether you have shipped a colleague or a help desk. Full product framing is at Voice / Call Agent. The field-engineer persona and the numbers that service leadership reads weekly are at Field engineers. A live deployment at this scale — 60% drop in L2 escalation, 4.1× faster onboarding — is documented at Nize Equipment.