Portuguese-Language Foundation Models (Sabiá)

Maritaca AI, a Campinas-based startup spun out of Unicamp's computer science department, developed the Sabiá family of large language models specifically trained on Portuguese text. Sabiá-3, released in 2024, achieves accuracy comparable to GPT-4o across 64 Brazilian exams including the OAB (bar exam), ENEM (university entrance), and ENADE (professional certifications) — while costing 3-4x less per token than frontier US models.

The strategic significance is linguistic sovereignty in the AI era. Portuguese is spoken by 260+ million people across Brazil, Portugal, Angola, Mozambique, and other Lusophone nations. Models trained primarily on English text underperform on Portuguese legal, medical, and cultural tasks — a gap that widens for domain-specific applications like legal analysis (the Juru model specializes in Brazilian law). Maritaca AI builds on high-quality Portuguese training datasets curated from Common Crawl with industrial-grade filtering.

Brazil's AI Plan 2024-2028 explicitly targets sovereign AI capability, including foundation models. Maritaca AI represents the private-sector complement: a commercially viable LLM that keeps Portuguese language processing under Brazilian control rather than depending entirely on OpenAI, Google, or Chinese alternatives. The model is being adopted by Brazilian enterprises for customer service, document analysis, and compliance workflows where Portuguese-language accuracy is mission-critical.

Book a research session

Portuguese-Language Foundation Models (Sabiá)

Book a research session