Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension
Here is a list of papers that I have authored; they are also available on my Semantic Scholar or Google Scholar profiles. α indicates equal contribution; ω indicates core contributors.
2026
2025
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Tülu 3: Pushing Frontiers in Open Language Model Post-Training
Establishing Task Scaling Laws via Compute-Efficient Model Ladders
Teaching Models to Understand (but not Generate) High-risk Data
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
DataDecide: How to Predict Best Pretraining Data with Small Experiments
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
Language models scale reliably with over-training and on downstream tasks
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
mFollowIR: A Multilingual Benchmark for Instruction Following in Information Retrieval
RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models
2024
The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
MathFish 🐟: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula
DataComp-LM: In search of the next generation of training sets for language models
Self-Directed Synthetic Dialogues and Revisions Technical Report
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions
Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
2023
Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders
A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents
Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval
Bound to the Bounty: Collaboratively Shaping Evaluation Processes for Queer AI Harms
2022
2021
2020
2018
2017
Denoising Clinical Notes for Medical Literature Retrieval with a Convolutional Neural Model
Learning to Reformulate Long Queries for Clinical Decision Support
Inferring Individual Attributes from Search Engine Queries and Auxiliary Information
Learning to Rank for Consumer Health Search: A Semantic Approach