Sanyuan Chen (陈三元)

Ph.D. student @ Harbin Institute of Technology
Research Intern @ Microsoft Research Asia
[Resume]

GitHub
Google Scholar
linkedin
Zhihu
Email

# About Me

Hi there! I am a joint Ph.D. student of Harbin Institute of Technology and Microsoft Research Asia, supervised by Prof. Xiangzhan Yu and Dr. Ming Zhou. I obtained my B.S. degree at Harbin Institute of Technology, supervised by Prof. Wanxiang Che. Currently, I am working as a research intern at Natural Language Computing Group in MSRA, mentored by Dr. Yu Wu and Dr. Shujie Liu.

My current research focuses on language model pre-training for speech and audio processing, and I have developed several pre-trained models, including VALL-E, WavLM and BEATs, that have advanced the state of the art on various public benchmarks and contributed to the research field.

I am expected to graduate in June 2024 and seeking job opportunities worldwide. Please feel free to contact me if you are interested!

# Education & Experiences

  • Microsoft Research Asia May 2020 - Present
    Research Intern in Natural Language Computing Group

  • Harbin Institute of Technology Aug 2019 - Present
    Ph.D. Student in Research Center for Social Computing and Information Retrieval

  • Harvard University Aug 2018 - May 2019
    Research Intern in Data Systems Laboratory

  • National Chiao Tung University Aug 2017 - Jan 2018
    Exchange Student in Computer Science

  • Harbin Institute of Technology Aug 2015 - Jun 2019
    B.S. in Computer Science and Technology

# Selected Publications

→ Full list (1000+ citations) (*joint first author)

BEATs: Audio Pre-Training with Acoustic Tokenizers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Furu Wei

Ranks 1st in the AudioSet leaderboard.
Ranks 1st in the Balanced AudioSet leaderboard.
Ranks 1st in the ESC-50 leaderboard.
Powers the winner in DCASE 2023 Automated Audio Captioning Challenge.
Powers all the top 5 systems in DCASE 2023 Sound Event Detection Challenge.

[Accepted in ICML 2023 oral] [code]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang*, Sanyuan Chen*, Yu Wu*, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei

VALL-E can synthesize speech with anyone's voice from just 3 seconds of audio.
VALL-E X can help anyone speak a foreign language without an accent.
State-of-the-art zero-shot TTS system with in-context learning capabilities.
Wins the UNESCO Netexplo Innovation Award 2023 (top 10 out of over 3000).

[VALL-E paper] [VALL-E demo] [VALL-E X paper] [VALL-E X demo]
[Reddit Discussion] [Fox News] [Yahoo! News] [Ars Technica News]

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Ziqiang Zhang*, Sanyuan Chen*, Long Zhou*, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, Jinyu Li, Furu Wei

12% relative WER reduction over data2vec with only 400K unlabeled text.

[Paper] [code] [blog]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei

Ranks 1st in the SUPERB leaderboard and SLT2022 SUPERB Challenge.
Ranks 1st on VoxSRC 2021 speaker verification permanent leaderboard. Powers all the top 3 systems in VoxSRC 2022 speaker verification challenge.

[Accepted in J-STSP] [code] [demo] [blog] [HuggingFace API] [TorchAudio API]

UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training

Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu

[Accepted in ICASSP 2022] [code] [demo] [slides] [poster]

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Zhuo Chen, Peidong Wang, Gang Liu, Jinyu Li, Jian Wu, Xiangzhan Yu, Furu Wei

[Accepted in INTERSPEECH 2022] [code]

Continuous Speech Separation with Conformer

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei

Ranks 1st in the VoxCeleb Speaker Recognition Challenge 2020.
Shipped in the Microsoft Conversation Transcription Service.

[Accepted in ICASSP 2021] [code] [demo] [slides] [poster]

Ultra Fast Speech Separation Model with Teacher Student Learning

Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu

Shipped in the Microsoft Conversation Transcription Service.

[Accepted in INTERSPEECH 2021] [code]

Don’t shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu

[Accepted in ICASSP 2021] [code] [slides] [poster]

C2C-GenDA: Cluster-to-Cluster Generation for Data Augmentation of Slot Filling

Yutai Hou*, Sanyuan Chen*, Wanxiang Che, Cheng Chen, Ting Liu

[Accepted in AAAI 2021] [code] [video] [blog]

Recall and learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, Xiangzhan Yu

[Accepted in EMNLP 2020] [code] [video] [blog]