Here is a list of current and past research threads I have been pursuing. For a list of publications, hop over to this page.

Efficient Information Systems

Information systems are only as good as they are fast. How do we build systems that can quickly retrieve, process, and present succinct information to users?

I have looked at this problem from the perspective of question answering systems that operates at web-scale, focusing on both model (Soldaini and Moschitti 2020; Matsubara et al 2022) and data efficiency (Han et al 2021). I have also collaborated on Embedding Recycling, a promising technique to reduce model computation across many tasks (Saad-Falcon et al 2022).

Currently, I am interested in efficient information systems for scientific text:

  • How can NLP support efficient skimming of scientific literature?
  • How does search on scientific literature look like? What NLP models work best for it? How can we efficiently train them?

Cross-Language NLP and Information Processing

While there are hundreds of languages spoken in the world, most of the content on the web is concentrated in a few languages.

In the past, I have looked at how to build cross-language information retrieval systems (MacAvaney et al 2020) and explored the use of generative models to combine information in different languages (Muller et al 2021).

In 2022, I am co-organizing NeuCLIR, a shared task at TREC focused on cross-language information retrieval in Chinese, Farsi, and Russian.

Going forward, I am interested in exploring other domains and tasks can benefit from a cross-lingual approach, with a particular eye towards data-efficient approaches.

Generative Models for Better Content Presentation

Generative NLP modes can be used to process the output of any information system to better suit the needs of the user.

I have worked on projects to improve presentation in question answering in both English (Hsu et al 2021), as well as other languages (Muller et al 2021). Before that, I also looked at using generative models for structured parsing of user input (Rongali et al 2020).

I am open to collaboration on other tasks that can use generation to refine the output of information systems, particularly in scientific settings.

Open-Source for NLP

I enjoy building open source tools and data for NLP and ML practitioners, such as:

  • smashed is library of composable text and tensor processing functions. Compatible with TorchData and HuggingFace Datasets!
  • springs is a simple library to create type-safe configuration and command-line apps that use them.
  • trouting is a type-based function routing library for Python. Like typing.overload, but for runtime.
  • QuickUMLS is a tool for fast, unsupervised biomedical concept extraction from medical text.