ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

Submitted to ICASSP 2026

Paper: Click here

Code: Click here


Authors

Jihwan Lee1, Sean Foley1,2, Thanathai Lertpetchpun1, Kevin Huang1, Yoonjeong Lee1, Tiantian Feng1, Louis Goldstein2, Dani Byrd2Shrikanth Narayanan1,2

1Signal Analysis and Interpretation Laboratory, University of Southern California, USA
2Department of Linguistics, University of Southern California, USA


Abstract

We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.

Overview of ARTI-6

Speech Samples

Speaker ID 237 (Female)

Intermediate Feature Type Dimension Size The shaggy coat of the prairie, which they lifted to make him a bed, has vanished forever. The wind was flapping her big hat and teasing a curl of her chestnut colored hair. His yellow canvas leggings and khaki trousers were splashed to the knees.
Ground-truth -


Mel-spectrogram [1] 80


EMA+pitch+loudness [2] 14


ARTI-6 (Ours) 6


Speaker ID 1580 (Female)

Intermediate Feature Type Dimension Size "I am afraid there are no signs here," said he. I'll take the armchair in the middle. Well, then, I must make some suggestions to you.
Ground-truth -


Mel-spectrogram [1] 80


EMA+pitch+loudness [2] 14


ARTI-6 (Ours) 6


Speaker ID 260 (Male)

Intermediate Feature Type Dimension Size The danger is approaching. During his watch I slept. Hans had spoken truly.
Ground-truth -


Mel-spectrogram [1] 80


EMA+pitch+loudness [2] 14


ARTI-6 (Ours) 6


Speaker ID 2830 (Male)

Intermediate Feature Type Dimension Size The most they could claim is that they were sent by others. It must be watched. There may be something to that.
Ground-truth -


Mel-spectrogram [1] 80


EMA+pitch+loudness [2] 14


ARTI-6 (Ours) 6











References

[1] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in neural information processing systems, 2020

[2] Cheol J Cho, Peter Wu, Tejas S Prabhune, Dhruv Agarwal, and Gopala K Anumanchipalli, “Coding speech through vocal tract kinematics,” IEEE Journal of Selected Topics in Signal Processing, 2024