
A non-profit AI research lab that maintains the LM Evaluation Harness, a standard benchmark suite for LLMs.
AI security company known for 'Gandalf', a game/tool for prompt injection testing.

United States · Startup
Provides data infrastructure for AI, including RLHF (Reinforcement Learning from Human Feedback) and comprehensive model evaluation services.
An ML observability platform that helps teams detect issues, troubleshoot, and improve model performance in production.
A model monitoring and observability platform that includes specific tools for evaluating LLM accuracy and hallucination.
Provides Model Performance Management (MPM) to monitor, explain, and analyze AI models in production.
United States · Startup
Provides a platform for LLM evaluation and observability, focusing on data quality and hallucination detection.
France · Startup
Open-source testing and evaluation platform for AI models to ensure quality, security, and compliance.
The global hub for open-source AI models and datasets. Founded by French entrepreneurs with a major office in Paris.
United States · Company
A developer-first MLOps platform used for tracking experiments, versioning models, and visualizing evaluation metrics.
Scalable oversight and evaluation systems provide comprehensive monitoring and assessment of AI systems, combining automated testing (red-teaming), behavioral benchmarks, and human review to continuously measure capabilities, identify risks, and detect regressions. These systems form the infrastructure for safety governance, enabling ongoing oversight of rapidly evolving AI systems that would be impossible to monitor manually.
This innovation addresses the challenge of ensuring AI safety as systems become more capable and complex, where manual oversight becomes impractical. By automating evaluation and providing continuous monitoring, these systems can detect problems early, track capability growth, and ensure that safety measures remain effective as systems evolve. The technology is essential for deploying AI systems safely, especially frontier models and autonomous agents that operate with significant autonomy.
The technology is becoming critical infrastructure for AI safety, as manual oversight cannot scale to monitor the vast number of AI systems being deployed. As AI capabilities advance and systems become more autonomous, having robust oversight and evaluation systems becomes essential for identifying risks, ensuring safety, and maintaining control. However, developing comprehensive evaluation systems that can detect all relevant risks and capabilities remains challenging, and the field is actively developing better methods for assessing AI systems.