Speechmatics | Speechmatics

Distributed Self-Distillation

14 October 2025 · 13 min read

Machine Learning Engineer

Self-distillation training involves a student model learning from a teacher model that is maintained as an exponential moving average (EMA) of the student's weights. When scaling this approach across multiple GPUs, the challenge lies in efficiently distributing both networks while respecting their different update mechanisms—the student trains via backpropagation, while the teacher updates through EMA. We examine three distributed training strategies: (1) replicating both models with DDP, which is simple but memory-intensive; (2) sharding only the student with FSDP; and (3) identically sharding both student and teacher with FSDP, making the teacher EMA update purely local with no communication overhead. The key insight is that effective distributed training must align with the algorithm's structure. In this case, identical sharding naturally respects the EMA dependency between networks.

How to build smarter turn detection for Voice AI

12 May 2025 · 13 min read

Aaron Ng

Machine Learning Engineer

When talking to a voice AI, few things are as frustrating as getting interrupted just because you paused for a moment to think. As humans, we naturally pick up on a wide range of cues to know when someone is done speaking — body language, tone, context, even the subtle shifts in breathing.

But for a machine, this is a much tougher problem. Simply waiting for a fixed 500ms of silence isn't enough, and it's a quick way to drive users away.

This post is about making turn detection smarter. I'll walk you through why Semantic Turn Detection is a big upgrade over just listening for silence, how instruction-tuned Small Language Models (SLMs) fit the job perfectly, and share a practical code example using an open-source SLM to help you get started.

How to Finetune Sesame AI's Speech Model on New Languages and Voices

25 April 2025 · 17 min read

Will Knottenbelt

Machine Learning Engineer

Cover

Sesame AI has recently stirred up a huge amount of hype with their ultra-realistic, open-source Conversational Speech Model (CSM). While the model is impressive, they didn't release the training code and there is strong demand for customization. This blog walks you through exactly how to fine-tune the CSM for any language or voice you desire!

Exploring the trade-off between speed and accuracy in real-time transcription

6 September 2024 · 10 min read

Seoirse Murray

Machine Learning Engineer

Complete this sentence: "I love lamb..." What comes next? "it pairs so well with mint", or "it's perfect in a Sunday roast?" What if the conversation has been about poetry, in which case this probably refers to Charles Lambe, and the phrase could continue "his poetry is so moving". If the preceding conversation had been about the movie Anchorman, it may have been "I love lamp".

The important difference between these scenarios is the context.

Sparse All-Reduce in PyTorch

16 March 2024 · 31 min read

David MacLeod

Machine Learning Engineer

The All-Reduce collective is ubiquitous in distributed training, but is currently not supported for sparse CUDA tensors in PyTorch. In the first part of this blog we contrast the existing alternatives available in the Gloo/NCCL backends. In the second part we implement our own efficient sparse All-Reduce collective using PyTorch and CUDA.

The.Shed: Using Speechmatics Capabilities in real-time

11 January 2024 · 7 min read

Aaron Ng

Machine Learning Engineer

Imagine being able to understand and interpret spoken language not only retrospectively, but as it happens. This isn't just a pipe dream — it's a reality we're crafting at Speechmatics.

Our mission is to deliver Speech Intelligence for the AI era, leveraging foundational speech technology and cutting-edge AI.

In 2023, we launched a series of Capabilities that look to do more with the spoken word. Moving beyond transcription, we're now offering powerful functionality that interprets, analyses and makes the spoken word more useful and valuable than ever before. So far, we've released Translation, Summaries, Sentiment, Chapters and Topics, but our journey has only just begun.

An Almost Pointless Exercise in GPU Optimization

7 November 2023 · 21 min read

Andrew Innes

Chief Architect

Not everyone is able to write funky fused operators to make ML models run faster on GPUs using clever quantisation tricks. However lots of developers work with algorithms that feel like they should be able to leverage the thousands of cores in a GPU to run faster than using the dozens of cores on a server CPU. To see what is possible and what is involved, I revisited the first problem I ever considered trying to accelerate with a GPU. What is unusual about my chosen problem is that it is officially pointless, so you ought not to be able to find any library that will accelerate this algorithm, because it isn’t worth writing one! That makes it an interesting proxy for algorithms which aren’t catered for by high-performance libraries written by experts, but can be structured to run thousands of threads in parallel.

Reduce Model Tuning Costs with MuP

1 November 2023 · 8 min read

Theo Clark

Machine Learning Engineer

Ellena Reid

Machine Learning Engineer

As machine learning engineers increasingly adopt the Bitter Lesson and models grow in size, the cost associated with training them is also on the rise. A significant portion of overall compute budget is frequently spent on hyper-parameter tuning before launching a final training run. MuP offers the capability to transfer hyperparameters from a much smaller 'toy' model, leading to a substantial reduction in overall training cost.

Building a Radio Translation Streaming Service in Python

11 October 2023 · 9 min read

Anartz Nuin

Software Engineer

At Speechmatics, we wanted to present our real-time translation product in a straightforward yet impactful manner, demonstrating its exceptional capabilities. You can experience this firsthand on our website. Beyond its capabilities in showcasing real-time transcription and translation, our live demo extends its reach to address diverse user needs. For those who may have hearing impairments or find themselves in environments where audio isn't a viable option, our streaming server provides a text-based alternative, ensuring that no one is left out. Moreover, our service bridges language barriers, making it indispensable in situations where immediate translation is crucial, breaking down communication barriers effortlessly.

Improving Speaker Diarization with Self-supervised Learning

11 September 2023 · 11 min read

Andre Mansikkaniemi

Machine Learning Engineer

Speaker diarization often complements automatic speech recognition (ASR) by determining "Who spoke when?". One intriguing advancement in the field is the adoption of Self-Supervised Learning (SSL). By harnessing vast amounts of unlabelled audio data, SSL manages to improve multiple downstream tasks, including ASR and diarization, using the same pre-trained model. As we explore in this blog, the synergy between SSL and traditional methods not only boosts ASR accuracy but also aids in improving speaker diarization results.