Research Project Releases Multilingual Open Source LLM
The OpenGPT-X research initiative has proudly announced the release of its “Teuken-7B” large language model, now accessible for download on Hugging Face. This LLM is a groundbreaking step in AI, having been trained from scratch in all 24 official languages of the European Union (EU) and boasting seven billion parameters.
With this release, researchers and companies have a new opportunity to leverage a commercially usable open-source model for their AI applications. Supported by funding from the German Federal Ministry of Economic Affairs and Climate Action (BMWK), the OpenGPT-X consortium has crafted this model with an intentional European perspective. Leading the project are the Fraunhofer Institutes for Intelligent Analysis and Information Systems (IAIS) and for Integrated Circuits (IIS).
Already optimized for chat through “instruction tuning,” Teuken-7B is tailored to accurately understand and respond to user instructions—an essential feature for practical applications such as chat interfaces. As Prof. Stefan Wrobel, Director of Fraunhofer IAIS, notes, “‘Teuken-7B’ is freely available, providing a public, research-based alternative for use in academia and industry.” The model is designed to be adapted and further developed, contributing significantly to the growing demand for transparent and customizable generative AI both in academia and across industries.
One of the distinct features of Teuken-7B is its multilingual foundation, with a significant 50% of its pre-training data sourced from non-English texts. Trained across all European official languages, this LLM offers stability and reliability in performance, making it invaluable for international companies and organizations that require multilingual communication capabilities.
OpenGPT-X has also tackled critical research questions around the efficient training and operation of multilingual AI models. They have developed a multilingual “tokenizer,” essential for parsing complex word structures common in European languages like German, Finnish, and Hungarian. This tokenizer not only reduces training costs but also enhances energy efficiency, compared to other options such as Llama3 or Mistral.
The project is part of the BMWK’s “Innovative and practical applications and data spaces in the Gaia-X digital ecosystem” program, with the Teuken-7B LLM becoming accessible via Gaia-X infrastructure. Gaia-X provides a federated ecosystem that ensures secure data ownership and sharing, which is crucial for sensitive corporate data management. According to Dr. Franziska Brantner, Parliamentary State Secretary at BMWK, such innovations are vital for enhancing Europe’s digital sovereignty and competitiveness.
Additionally, Professor Bernhard Grill, Director of Fraunhofer IIS, highlights the safety-critical applications enabled by the language model, emphasizing the advantage of greater control over technology without relying on opaque third-party components. Fields such as automotive, robotics, medicine, and finance could greatly benefit from using such tailored AI models.
The collaborative efforts within the OpenGPT-X consortium, including partners like TU Dresden, the German Research Center for Artificial Intelligence (DFKI), IONOS, and Aleph Alpha, have culminated in this pioneering model. The Teuken-7B was trained at the JUWELS supercomputer, harnessing Europe’s substantial HPC infrastructure.
The OpenGPT-X project is an exemplary case of how public funding and collaborative efforts can produce viable foundational technologies, from infrastructure to model development and application. As Daniel Abbou, Managing Director of the German AI Association, emphasizes, the OpenGPT-X model sets the stage for further advancements in technology and data sovereignty.
Teuken-7B is available in two versions: one designed purely for research and another under the “Apache 2.0” license, appropriate for both research and commercial usage. While both versions perform comparably, certain datasets used in the instruction tuning for the research model are not eligible for commercial use, hence not included in the Apache 2.0 version.
Launched in early 2022, this research project is approaching the completion of its current phase, with ongoing optimizations and evaluations through March 2025. Detailed background information and benchmarks can be accessed at OpenGPT-X Teuken-7B.