Hello, visitor! 👋 My name is Luca Soldaini.
I am a research scientist at the Allen Institute for AI in the OLMo team. Prior to joining Ai2, I was a senior applied scientist at Amazon Alexa. I completed my Ph.D. in computer science at Georgetown University in 2018 in the Information Retrieval Lab working with Nazli Goharian.
When not in front of a screen, I enjoy brewing espresso, going on runs, dreaming about utopian mass transit systems, and curating my ever-growing laptop stickers collection. Raccoons are the best.
Research Interests
These days, my research focuses on best practices for curation and exploration of large corpora, mostly in the context of development of language models. I also like techniques for efficient domain adaptation in Information Retrieval (IR). A sample of recent topics I have been working on:
- 🍇 I co-lead the data team for OLMo, Ai2’s language model. OLMo is a state-of-the-art, fully-open model designed to accelerate the science of LMs. In 2024, we released dense, mixture-of-experts, and multimodal models, alongside their data, code, and and checkpoints. The OLMo project has been recognized with two best paper awards at ACL 2024.
- 📊 Two data projects I have been excited about: Lucy Li’s AboutMe, where we use self-description web pages to study curation practices for LLMs exclude certain groups, and Yanai Elazar’s WIMBD, an efficient toolkit to investigate the content of large corpora.
- 🔎 On the IR side, there’s much to do on improving interface between language models and retrieval systems well! With Orion Weller, we investigated when generative models can be used to augment queries and documents in IR systems, and studied how to adapt neural IR models to work with instructions. With Sean MacAvaney, we found evidence that language models are effective labelers for IR tasks. I am an organizer for NeuCLIR, a shared task at TREC focused on cross-language information retrieval in Chinese, Farsi, and Russian.
- 📚 Processing scientific PDFs remains challenging. We created PaperMage, a toolkit to make multimodal PDF processing easier. This work won a best demo award at EMNLP 2023!
Hop over the publications page for a complete list of my work.
Contacts
I ❤ collaborating and connecting with other researchers! Do get in touch if you are interested in any of the areas above, or if you have ideas that you think I might be interested in.