Luca Soldaini
A raccoon wearing a top hat and holding a pizza slice. Luca sometimes uses this image as their online profile picture.

DALLβ€’E 2 generation (April 2022)

Hello, visitor! πŸ‘‹ My name is Luca Soldaini and I use they/them pronouns.

I am a senior research scientist at the Allen Institute for AI in the Semantic Scholar and OLMo teams, and an organizer at Queer In AI. Prior to joining AI2, I was a senior applied scientist at Amazon Alexa. I completed my Ph.D. in computer science at Georgetown University in 2018 in the Information Retrieval Lab working with Nazli Goharian.

When not in front of a screen, I enjoy brewing espresso, going on runs, dreaming about utopian mass transit systems, and curating my ever-growing laptop stickers collection. Raccoons are the best.

Research Interests

These days, my research focuses on best practices for curation and exploration of large corpora, mostly in the context of (Large) Language Model. I also like techniques for efficient domain adaptation in Information Retrieval (IR). A sample of recent topics I have been working on:

  • πŸ‡ I co-lead the data curation team for OLMo, AI2’s language model. OLMo is a state of the art model designed to accelerate the science of LMs. In 2023, we released the first version of Dolma, an open dataset of 3 trillion tokens for language model pretraining. The latest version of Dolma (1.7), improves OLMo performance significantly thanks to better sources and improved deduplication.
  • πŸ”Ž I am co-organizing NeuCLIR, a shared task at TREC focused on cross-language information retrieval in Chinese, Farsi, and Russian.
  • πŸ“Š Two data projects I have been excited about: Lucy Li’s AboutMe, where we use self-description web pages to study curation practices for LLMs exclude certain groups, and Yanai Elazar’s WIMBD, an efficient toolkit to investigate the content of large corpora.
  • πŸ”„ On the IR side, there’s much to do on making LLMs interface with retrieval systems well! With Orion Weller, we investigated when generative models can be used to augment queries and documents in IR systems, and studied how to adapt neural IR models to work with instructions. With Sean MacAvaney, we found evidence that LLMs are effective labelers for IR tasks.
  • πŸ“š Processing scientific PDFs remains challenging. We created PaperMage, a toolkit to make multimodal PDF processing easier. This work won a best demo award at EMNLP 2023!
  • πŸ³οΈβ€πŸŒˆ Participatory AI is fun! With Queer In AI, we documented what has lead to the creation of our organization, how we apply community-lead decentralized governance, and our initiatives. This work won a best paper award at FAccT 2023!

Hop over the publications page for a complete list of my work.


I ❀ collaborating and connecting with other scholars and practicioners! Do get in touch if you are interested in any of the areas above, or if you have a research ideas that you think I might be interested in.

This website is licensed under CC BY 4.0