Luca Soldaini

DALL•E 2 generation (April 2022)

Hello, visitor! 👋

I am a lead research scientist at the Allen Institute for AI in the OLMo team. Prior to joining Ai2, I was a senior applied scientist at Amazon Alexa. I completed my Ph.D. in computer science at Georgetown University in 2018 in the Information Retrieval Lab working with Nazli Goharian.

When not in front of a screen, I enjoy brewing espresso, going on runs, dreaming about utopian mass transit systems, and curating my ever-growing laptop stickers collection. Raccoons are the best.

Research Interests

These days, my research focuses on maximizing transparency in all aspects of how large language models (LLMs) are created, trained, and evaluated.

🏎️ I co-lead the data team for OLMo, Ai2’s language model. OLMo is a state-of-the-art, fully-open model designed to accelerate the science of LLMs. In 2024, we released dense and mixture-of-experts variants, alongside data, code, recipes, and checkpoints we made to create them. The OLMo project has been recognized with two best paper awards at ACL 2024. We recently published OLMo 2 7B, 13B, and 32B: the best fully-open models yet.
⚙️ With my colleagues at Ai2, I develop recipes for adaptation of language models. In 2024, we launched Tülu 3, a state-of-the-art pipeline to post-train language models up to 405B parameters. We also launched Molmo, a family of open state-of-the-art multimodal AI models.
🧬 I collaborated on several projects to analyze and improve pipelines for language models. AboutMe, WIMBD, and WebOrganizer are tools to analyze large pretraining corpora. olmOCR is an high-performance toolkit for PDF text extraction. We also developed predictive techniques and benchmarks to characterize the behavior of language models during pretraining.

Beside core language modeling research, I am interested in adapting language models to information retrieval and document understanding tasks.

🔎 I have been investigating how to improve interface between language models and retrieval systems. With Orion Weller, we studied when generative models can be used to augment queries and documents in IR systems, and proposed FollowIR, a technique to adapt neural IR models to work with instructions. FollowIR was extended to multilingual systems.
📚 Adapting LLMs to literature-grounded scientific tasks remains challenging, from document parsing, to instruction following, and interface design. In late 2024, I collaborated on OpenSciLLM, an end-to-end demo showing how language models can be used for literature synthesis.

Hop over the publications page for a complete list of my work.

Contacts

I ❤ collaborating and connecting with other researchers! Do get in touch if you are working in any of the areas above, or if you have ideas that you think I might be interested in.