The OpenEuroLLM project is set to transform the landscape of artificial intelligence in Europe by developing a series of foundation models aimed at achieving transparent AI. Jan Hajič, a computational linguist from Charles University in Prague, and Peter Sarlin, CEO and co-founder of Silo AI, lead this ambitious venture. With a substantial budget of approximately €7 billion, the project unites 20 organizations, including academic institutions and corporations. Collaborating with EuroHPC supercomputer centers across Spain, Italy, Finland, and the Netherlands, OpenEuroLLM aspires to create open-source models for companies to utilize in building AI applications throughout Europe.
The project’s goal is to foster digital sovereignty within Europe by ensuring that mission-critical infrastructure and tools remain close to home. OpenEuroLLM’s scope encompasses all European Union languages, including the 24 official EU languages and those of countries negotiating entry to the EU market. Data from Common Crawl, an open repository of web-crawled data, alongside additional datasets, will support this initiative. These models will be trained on 4.5 petabytes of web crawls and over 20 billion documents, ensuring they are transparent and explainable through high-performance computing.
Collaboration for Innovation and Linguistic Diversity
OpenEuroLLM builds upon the foundation laid by the HPLT project, which has already developed datasets and models for European languages. The project’s budget primarily covers personnel costs, with compute expenses expected to be managed through partnerships with EuroHPC centers. Unlike aiming to outpace Big Tech or billion-dollar AI startups, OpenEuroLLM seeks to offer a genuinely open-source alternative for Europe. The open-source nature extends to the data used for training these models, especially that sourced from Common Crawl.
“We hope that most of the data [will be open], especially the data coming from the Common Crawl,” – Jan Hajič
Jan Hajič emphasizes the importance of transparency and compliance with AI regulations in Europe. While the project’s ultimate goal is to keep everything open, certain constraints may necessitate adjustments.
“We would like to have it all completely open, but we will see. In any case, we will have to comply with AI regulations.” – Jan Hajič
The focus remains on quality over quantity, ensuring that any models released are high-quality and well-developed. As substantial public funds from the European Commission are involved, delivering robust outcomes is of high stakes.
“We want to have it as small but as high-quality as possible. We don’t want to release something which is half-baked, because from the European point-of-view this is high-stakes, with lots of money coming from the European Commission — public money.” – Jan Hajič
OpenEuroLLM also addresses linguistic diversity challenges within Europe. The project aims to establish true benchmarks for languages with scarce digital resources, ensuring cultural representation within AI models.
“That is the goal, but how successful we can be with languages with scarce digital resources is the question,” – Jan Hajič
“But that’s also why we want to have true benchmarks for these languages, and not to be swayed toward benchmarks which are perhaps not representative of the languages and the culture behind them.” – Jan Hajič
Collaborative Approach and Long-Term Goals
The OpenEuroLLM project draws inspiration from recent successes in AI within Europe, notably by small focused teams like Mistral AI and LightOn.
“Europe’s recent successes in AI shine through small focused teams like Mistral AI and LightOn — companies that truly own what they’re building,” – Stasenko
Stasenko highlights how these companies maintain accountability in financial decisions, market positioning, and reputation.
“They carry immediate responsibility for their choices, whether in finances, market positioning, or reputation.” – Stasenko
Collaboration remains a cornerstone of the OpenEuroLLM approach. Andre Martins advocates for open collaboration among diverse communities to prevent redundancy and maximize expertise sharing.
“I hope the different communities collaborate openly, share their expertise, and don’t decide to reinvent the wheel every time a new project gets funded,” – Andre Martins
Jan Hajič echoes this sentiment, emphasizing the advantages of collaborative efforts over those driven by single entities.
“I’ve been involved in many collaborative projects, and I believe it has its advantages versus a single company,” – Jan Hajič
Combining academic expertise with corporate focus could result in innovative outcomes that rival achievements by major players such as OpenAI and Mistral.
“Of course they’ve done great things at the likes of OpenAI to Mistral, but I hope that the combination of academic expertise and the companies’ focus could bring something new.” – Jan Hajič
Ultimately, success for OpenEuroLLM does not hinge on becoming the leading model; rather, it lies in delivering a robust solution built entirely within Europe.
“I hope this won’t be the case, but if, in the end, we are not the number one model, and we have a ‘good’ model, then we will still have a model with all the components based in Europe,” – Jan Hajič
What The Author Thinks
The OpenEuroLLM initiative represents an essential step in ensuring Europe’s autonomy in the AI space. While the challenges, particularly in linguistic diversity and ensuring high-quality models, are considerable, the project’s emphasis on collaboration, transparency, and regulatory compliance shows a forward-thinking approach. Success will be measured not by surpassing tech giants but by fostering sustainable, open-source AI that serves Europe’s unique needs.
Featured image credit: Pickpic
Follow us for more breaking news on DMR