TimeSpeak Benchmark

TimeSpeak is an internal benchmark developed to evaluate whether interaction models can initiate speech at user-specified times while producing the correct content. The benchmark tests time-based speech generation: models are given instructions like "remind me to breathe in and out every 4 seconds until I ask you to stop" or "say 'turn left' when we reach the intersection, which is about 30 seconds from now." A correct response requires both semantic correctness (the right content was spoken) and temporal correctness (it was spoken at the right time), and the evaluation awards no credit if either criterion fails.

The benchmark is significant because time-based speech initiation is a genuinely hard problem that turn-based models cannot solve at all. A turn-based model receives a complete text or audio input, processes it, and generates a response — but it has no mechanism for initiating output at a future time unless that time has already passed by the time it responds. A model that cannot maintain a continuous representation of elapsed time cannot be instructed to act at a future moment without an explicit re-invocation. TimeSpeak therefore isolates a capability that is unique to continuous-interaction architectures.

The evaluation methodology uses an LLM judge that assesses both the semantic content of the model's spoken output and the timing. Because the ground truth timing is well-defined (a specific number of seconds from a known start point), the judge can determine with precision whether the model's output was delivered within the acceptable window. Current frontier models perform near-zero on TimeSpeak benchmarks, indicating that even models that are highly capable in other dimensions have not developed this capability through standard training.

The practical applications of time-aware speech initiation are extensive: breathing or meditation reminders, navigation instructions timed to physical location, language learning prompts timed to gaps in conversation, fitness coaching cues, and any situation where the model acts as a real-time assistant rather than a reactive respondent. TimeSpeak was designed by Thinking Machines Lab to characterize the proactive audio capabilities of their interaction model family.