CueSpeak Benchmark

CueSpeak is an internal benchmark evaluating whether interaction models can produce simultaneous speech responses — speaking at the same time as the user with an appropriate, semantically correct response to a specific conversational cue. The benchmark presents scenarios like: "every time I code-switch and use another language, give me the correct word in the original language." A full score requires both that the model's response has the correct semantic content (the right word in the right language) and that it is delivered at the same time as the user's code-switch, rather than after the user has finished their sentence.

The core capability being tested is cue-triggered simultaneous speech: the model must recognize that a specific event has occurred in the user's speech and produce a response at the same moment, not after. This is distinct from TimeSpeak, which tests time-triggered initiation, and from standard verbal interjections, which test context-triggered corrections. CueSpeak specifically tests the intersection of semantic recognition and temporal concurrency — can the model detect a meaningful event in the ongoing speech stream and respond concurrently?

The evaluation uses an LLM judge that assesses semantic correctness of the response and a separate timing analysis that determines whether the response overlapped with the relevant segment of user speech. Responses that are semantically correct but delivered after the user finishes speaking receive partial or no credit. This dual criterion means that models cannot compensate for slow timing with higher accuracy — both must be present simultaneously.

Current frontier models perform near-zero on CueSpeak, consistent with the finding that simultaneous speech capability is a genuine frontier. The benchmark was designed by Thinking Machines Lab to quantify capabilities that existing benchmarks do not cover — specifically, the ability to engage in the kind of real-time collaborative correction, translation, and co-reading that characterizes expert human collaboration. The practical applications include real-time language tutoring, accessibility tools for users who code-switch, and collaborative writing or coding assistance where the model provides concurrent feedback.