How would you describe them?
An end-to-end framework for multi-speaker transcription that jointly models who spoke, when, and what.