Publications
This page contains a list of manuscripts that I have authored. For bibliographic information, please refer to my Semantic Scholar or Google Scholar profiles.
* indicates equal contribution; † indicates core contributors.
2025
- Team OLMo: Pete Walsh†, Luca Soldaini†, Dirk Groeneveld†, Kyle Lo†, Shane Arora†, Akshita Bhagia†, Yuling Gu†, Shengyi Huang†, Matt Jordan†, Nathan Lambert†, Dustin Schwenk†, Oyvind Tafjord†“2 OLMo 2 Furious”. ArXiv 2501.00656. preprint Noah A. Smith†, and Hannaneh Hajishirzi†.
- Orion Weller, Benjamin Chang, Eugene Yang, Mahsa Yarmohammadi, Sam Barham, Sean MacAvaney, Arman Cohan, Luca Soldaini, Benjamin Van Durme, and Dawn Lawrie. “mFollowIR: a Multilingual Benchmark for Instruction Following in Information Retrieval”. ECIR 2025. to appear
- Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, and Kyle Lo. RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models. AAAI 2025.
2024
- Akshita Bhagia*, Jiacheng Liu*, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, and Hannaneh Hajishirzi. “Establishing Task Scaling Laws via Compute-Efficient Model Ladders”. ArXiv 2412.04403. preprint
- Shayne Longpre*, Stella Biderman*“The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources”. TMLR 12/2024. Yacine Jernite*, and Luca Soldaini*.
- Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman,Lester James V. Miranda“Tülu 3: Pushing Frontiers in Open Language Model Post-Training”. ArXiv 2411.15124. preprint Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi.
- Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’arcy, David Wadden“OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs”. ArXiv 2411.14199. preprint Pang Wei Koh, and Hannaneh Hajishirzi.
- Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu“Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models”. ArXiv 2409.17146. preprint Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi.
- Li Lucy, Tal August, Rose E. Wang, Luca Soldaini, Courtney Allison, and Kyle Lo. “MathFish 🐟: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula”. Findings of EMNLP 2024.
- Jeffrey Li†, Alex Fang†, Georgios Smyrnis†, Maor Ivgi†“DataComp-LM: In search of the next generation of training sets for language models”. Datasets and Benchmarks track, NeurIPS 2024. Achal Dave†, Ludwig Schmidt†, and Vaishaal Shankar†.
- Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini“Paloma: A Benchmark for Evaluating Language Model Fit”. Datasets and Benchmarks track, NeurIPS 2024. and Jesse Dodge.
- Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld“OLMoE: Open Mixture-of-Experts Language Models”. ArXiv 2409.02060. preprint Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi.
- Raymond Fok, Luca Soldaini, Cassidy Trier“Accelerating Scientific Paper Skimming with Augmented Intelligence Through Customizable Faceted Highlights”. ACM Transactions on Interactive Intelligent Systems 2024. Andrew Head, and Daniel S. Weld.
- Nathan Lambert, Hailey Schoelkopf, Aaron Gokaslan, Luca Soldaini, Valentina Pyatkin, and Louis Castricato. “Self-Directed Synthetic Dialogues and Revisions Technical Report”. ArXiv 2407.18421. technical report
- David Wadden*, Kejian Shi*, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, and Arman Cohan. “SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature”. ArXiv 2406.07835.
- James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, and Noah Hibbler. “On the Evaluation of Machine-Generated Reports”. SIGIR 2024. best paper nomination
- Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, and David Wadden. “KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions”. Findings of ACL 2024.
- Luca Soldaini†, Rodney Kinney†, Akshita Bhagia†, Dustin Schwenk†“Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research”. ACL 2024. best paper award and Kyle Lo†.
- Dirk Groeneveld, Iz Beltagy“OLMo: Accelerating the Science of Language Models”. ACL 2024. best paper award Noah A. Smith, and Hannaneh Hajishirzi.
- Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, and Jesse Dodge. “AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters”. ACL 2024.
- Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar“Language models scale reliably with over-training and on downstream tasks”. ArXiv 2403.08540. Yair Carmon*, Achal Dave*, Reinhard Heckel*, Niklas Muennighoff*, and Ludwig Schmidt*.
- James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, and Noah Hibbler. “On the Evaluation of Machine-Generated Reports”. Perspective paper, SIGIR 2024.
- Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, and Luca Soldaini. “FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions”. ArXiv 2403.15246.
- Orion Weller, Kyle Lo, David Wadden, Dawn Lawrie, Benjamin Van Durme, Arman Cohan, and Luca Soldaini. “When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets”. Findings of EACL 2024.
- Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, and Jesse Dodge. “What’s In My Big Data?” ICLR 2024. spotlight
2023
- Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, and Kyle Lo. “Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders”. ArXiv 2311.09765.
- Kyle Lo†, Zejiang Shen†, Benjamin Newman†, Joseph Chee Chang†“PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents”. System Demonstration, EMNLP 2023. best paper award and Luca Soldaini†.
- Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, and Kyle Lo “A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents”. EMNLP 2023.
- John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Lu Wang, and Arman Cohan. “Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval”. Findings of EMNLP 2023.
- Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, and Eugene Yang. Overview of the TREC 2023 NeuCLIR Track. TREC 2023.
- Pratyusha Ria Kalluri*, William Agnew*, Myra Cheng*, Kentrell Owens*, Luca Soldaini*, and Abeba Birhane*. “The Surveillance AI Pipeline”. ArXiv 2309.15084.
- Raymond Fok, Hita Kambhamettu, Luca Soldaini, Jonathan Bragg, Kyle Lo, Andrew Head, Marti A. Hearst, and Daniel S. Weld. “SCIM: Intelligent Skimming Support for Scientific Papers”. IUI 2023.
- Organizers of Queer in AI, Nathan Dennler, Anaelia Ovalle, Ashwin Singh, Luca Soldaini, Arjun Subramonian, Huy Tu, William Agnew, Avijit Ghosh, Kyra Yee, Irene Font Peradejordi, Zeerak Talat, Mayra Russo, and Jess de Jesus de Pinho Pinhal. “Bound to the Bounty: Collaboratively Shaping Evaluation Processes for Queer AI Harms”. AIES 2023.
- Organizers of Queer in AI“Queer In AI: A Case Study in Community-Led Participatory AI”. FAccT 2023. best paper award
- Sean MacAvaney* and Luca Soldaini*. “One-Shot Labeling for Automatic Relevance Estimation” Short paper, SIGIR 2023.
- Kyle Lo, Joseph Chee ChangThe Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces. ArXiv 2303.14334. Marti A. Hearst, and Daniel S. Weld.
- Rodney Michael Kinney“The Semantic Scholar Open Data Platform”. ArXiv 2302.11266. and Daniel S. Weld.
- Jon Saad-Falcon, Amanpreet Singh, Luca Soldaini, Mike D’Arcy, Arman Cohan, and Doug Downey. “Embedding Recycling for Language Models”. Findings of EACL 2023.
2022
- Matteo Gabburo, Rik Koncel-Kedziorski, Siddhant Garg, Luca Soldaini, and Alessandro Moschitti. “Knowledge Transfer from Answer Ranking to Answer Generation”. EMNLP 2022.
- Luca Di Liello, Siddhant Garg, Luca Soldaini, and Alessandro Moschitti. “Pre-training Transformer Models with Sentence-Level Objectives for Answer Sentence Selection”. Short paper, EMNLP 2022.
- Yoshitomo Matsubara, Luca Soldaini, Eric Lind, and Alessandro Moschitti. “Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems”. Findings of EMNLP 2022.
- Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, and Eugene Yang. Overview of the TREC 2022 NeuCLIR Track. TREC 2022.
- Benjamin Muller, Luca Soldaini, Rik Koncel-Kedziorski, Eric Lind, and Alessandro Moschitti. “Cross-Lingual GenQA: A Language-Agnostic Generative Question Answering Approach for Open-Domain Question Answering”. AACL 2022.
- Luca Di Liello, Siddhant Garg, Luca Soldaini, and Alessandro Moschitti. “Paragraph-based Transformer Pre-training for Multi-Sentence Inference”. Short paper, NAACL-HLT 2022.
2021
- Chao-Chun Hsu, Eric Lind, Luca Soldaini, and Alessandro Moschitti. “Answer Generation for Retrieval-based Question Answering Systems”. Findings of ACL 2021.
- Rujun Han, Luca Soldaini, and Alessandro Moschitti. “Modeling Context in Answer Sentence Selection Systems on a Latency Budget” EACL 2021.
2020
- Mingda Li, Xinyue Liu, Weitong Ruan, Luca Soldaini, Wael Hamza, and Chengwei Su. “Multi-task Learning of Spoken Language Understanding by Integrating N-Best Hypotheses with Hierarchical Attention”. COLING 2020.
- Luca Soldaini and Alessandro Moschitti. “The Cascade Transformer: Efficient Answer Sentence Selection”. ACL 2020.
- Subendhu Rongali, Luca Soldaini, Emilio Monti, and Wael Hamza. “Don’t Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing”. Short paper, WWW 2020.
- Sean MacAvaney, Luca Soldaini, and Nazli Goharian. “Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning”. Short paper, ECIR 2020.
2018
- Sean MacAvaney, Andrew Yates, Arman Cohan, Luca Soldaini, Kai Hui, Nazli Goharian, and Ophir Frieder. “Overcoming Low-Utility Facets for Complex Answer Retrieval”. Information Retrieval Journal, 2018.
- Ziling Fan, Luca Soldaini, Arman Cohan, and Nazli Goharian. “Relation Extraction for Protein-Protein Interactions Affected by Mutation”. Short paper, ACM-BCB 2018.
- Arman Cohan*, Bart Desmet*, Andrew Yates*, Luca Soldaini, Sean MacAvaney, and Nazli Goharian. “SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions”. COLING 2018.
- Luca Soldaini, Timothy Walsh, Arman Cohan, Julien Han, and Nazli Goharian. “Helping or Hurting? Predicting Changes in Users’ Risk of Self-Harm Through Online Community Interactions”. CLPsych Workshop, NAACL-HLT 2018.
- Luca Soldaini. “The Knowledge and Language Gap in Medical Information Seeking”. PhD Thesis, Georgetown University. 2018.
- Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, and Nazli Goharian. “RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses”. CLPsych Workshop, NAACL-HLT 2018.
- Sean MacAvaney, Andrew Yates, Arman Cohan, Luca Soldaini, Kai Hui, Nazli Goharian, and Ophir Frieder. “Characterizing Question Facets for Complex Answer Retrieval”. SIGIR 2018.
- Sean MacAvaney, Luca Soldaini, Arman Cohan, and Nazli Goharian. “Tree-LSTMs for Scientific Relation Classification”. SemEval Workshop, NAACL-HLT 2018.
2017
- Luca Soldaini, Andrew Yates, and Nazli Goharian. “Denoising Clinical Notes for Medical Literature Retrieval with Convolutional Neural Model”. Short paper, CIKM 2017.
- Luca Soldaini, Andrew Yates, and Nazli Goharian. “Learning to Reformulate Long Queries for Clinical Decision Support”. JASIST 2017.
- Luca Soldaini and Elad Yom-Tov. “Inferring Individual Attributes from Search Engine Queries and Auxiliary Information”. WWW 2017.
- Luca Soldaini and Nazli Goharian. “Learning to Rank for Consumer Health Search: a Semantic Approach”. Short paper, ECIR 2017.
2016
- Luca Soldaini and Nazli Goharian. “QuickUMLS: a Fast, Unsupervised Approach for Medical Concept Extraction”. MedIR workshop, SIGIR 2016.
- Arman Cohan, Luca Soldaini, and Nazli Goharian. “Identifying Significance of Discrepancies in Radiology Reports”. DMMH Workshop, SDM 2016.
- Luca Soldaini, Andrew Yates, Elad Yom-Tov, Ophir Frieder, and Nazli Goharian. “Enhancing Web Search in the Medical Domain via Query Clarification”. Information Retrieval Journal, 2016.
2015
- Arman Cohan, Luca Soldaini, and Nazli Goharian. “Matching Citation Text and Cited Spans in Biomedical Literature: a Search–Oriented Approach”. Short paper, NAACL-HLT 2015.
- Luca Soldaini, Arman Cohan, Andrew Yates, Nazli Goharian, and Ophir Frieder. “Retrieving Medical Literature for Clinical Decision Support”. ECIR 2015.
2014
- Arman Cohan, Luca Soldaini, Andrew Yates, Nazli Goharian, and Ophir Frieder. “On Clinical Decision Support”. Short paper, ACM-BCB 2014.