Bowen Shi

Research Scientist
Meta MSL
bshi _at_ meta _dot_ com


I am a research scientist at Meta SuperIntelligence Labs, working on speech and audio. I am a core contributor to Meta’s foundational audio generation models, including SAM Audio, MovieGen Audio, AudioBox, VoiceBox and MMS. I obtained my Ph.D. from TTIC where I worked on automatic sign language understanding under the advisement of Prof. Karen Livescu.

Recent Highlights

December 2025 — Launched SAM Audio, a foundation model that extends Segment Anything to audio, enabling general-purpose audio separation via multimodal prompts.

March 2025 — Our team released AudioBox-aesthetics, a unified automatic quality assessment framework for any audio.

Selected Publications

(See my Google Scholar for a full list)

SAM Audio: Segment Anything in Audio
Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee
(project lead, †: core contributors)
Technical Report, 2025 [paper] [blog][demo][code]

Movie Gen: A Cast of Media Foundation Models
Core Contributor, The Movie Gen Team.
Technical Report, 2024 [paper] [blog]

Audiobox: Unified Audio Generation with Natural Language Prompts
Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu
(†: equal contribution)
Technical Report, 2023 [paper] [blog][demo]

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale,
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu
(†: equal contribution)
NeurIPS 2023 [paper] [blog]

Scaling Speech Technology to 1,000+ Languages
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
(†: core contributors)
JMLR 2023 [paper] [blog] [code]

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, Armen Aghajanyan
(†: equal contribution)
Technical Report, 2023 [paper] [blog]

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed
ICLR, 2022 [paper] [blog][code]