Recently we wanted to set up an audio transcription endpoint for our workflow with Bookscribe.ai. I’ve been using Huggingface in production for their accessible and low cost inference servers, but there isn’t comprehensive documentation on how to customize your endpoint, so I thought I’d document my procedure here.
Preliminaries
The current go-to for English language
transcription is Whisper, with Large
and Turbo
models that perform quite well overall. They can
be a bit compute-intensive though, and aren’t
particularly fast. They also generate
hallucinations if there is too much silence in
the audio. The better alternative, making use of
the quality Whisper training, is faster-whisper,
which is much faster, particularly if you use a
distilled model with batching, and incorporates
voice
activity detection (VAD), which eliminates
silence from consideration by the model.
The models, including faster-whisper,
are all available on Huggingface. Their model
library has an extensive set of models
trained for various purposes, most with
permissive licenses that support commercial
uses. The base
Whisper models are easy to get working out
of the box, but more specialized models require
some degree of customization.
Customization requirements
To get faster-whisper working,
we need two things: 1. a custom
handler.py script hosted on the
inference server 2. a specific method for
sending data to the endpoint
The custom handler is referenced here and essentially gives instructions to the hosted model about how to handle the input it receives. This means you should know how to use the model, what arguments it can take, etc., all of which can be done through testing on a local workstation (or on a Huggingface space or similar, though I don’t have particular experience with that).
The method for sending data to the endpoint involves http requests, but depending on your handler setup, it requires a certain data structure. Additionally, for our purposes we want to send audio, so we need to encode that as bytes.
How to write a handler script
The following code can be used in a
handler.py script that you would
upload to your model base directory on
Huggingface.
The first code block imports
faster-whisper,
typing, and the ffmpeg
codec to convert any input into the audio that
the transcription model will read.
# handler.py (for handling asr with faster_whisper)
from faster_whisper import WhisperModel, BatchedInferencePipeline
from typing import Any, Dict, List
from transformers.pipelines.audio_utils import ffmpeg_readThe next code block is where we create the
class for handling the input. Here we define
some variables/arguments that
faster_whisper uses under the hood:
model_size refers to the particular
model being used, and arguments include whether
we’re using a gpu, which we are
(cuda), and the level of precision
used (float16). In initialization
we actually create two different versions of the
model - a base model and a “batched” model. The
faster_whisper batched model is
much faster and uses VAD by default, with
various parameters that can impact its
performance. The base model is used as a
fallback in case the batched model isn’t working
so well.
The actual call involves reading the input being sent to the endpoint, decoding the encoded audio, then identifying any additional parameters that are being passed to the model, and returning the result as a complete string.
class EndpointHandler:
def __init__(self, path=""):
self.model_size = "distil-large-v3" # the distilled whisper v3 model
self.model = WhisperModel(self.model_size, device="cuda", compute_type="float16") # the base faster_whisper model
self.batched_model = BatchedInferencePipeline(model=self.model) # the batched faster_whisper model
def __call__(self, data: Dict[str, Any]) -> List[Dict[str, Any]]:
"""
Args:
data (:obj:):
includes the base64 encoded audio file as 'inputs'
whether to use batching as the 'batched' argument
and any additional arguments as a 'parameters' dict
Return:
segments of transcribed text, joined
"""
# process input
inputs = data.pop("inputs", data)
audio_nparray = ffmpeg_read(inputs, 16000) # read the encoded audio and convert at 16k
# Retrieve custom arguments
batched = data.pop("batched", True) # default is True if not specified
params = data.pop("parameters", {}) # all parameters for the model
if batched:
segments, info = self.batched_model.transcribe(audio_nparray, **params)
else:
segments, info = self.model.transcribe(audio_nparray, beam_size=5)
segments = [segment.text for segment in segments]
return " ".join(segments)Sending data
The faster_whisper model has the
option of implementing Voice Activity Detection
(VAD). The benefit of VAD is that only segments
of audio determined to contain vocal activity
will be considered by the Whisper model, greatly
reducing the potential for hallucinations in
transcription. This is enabled for the batched
model automatically, and has some default
parameters. In the handler we allow for
parameters to be passed to the batched model,
but not to the base model. This can easily be
modified for your particular use-case, but for
our purposes we are only interested in modifying
the VAD parameters of the batched model.
The following Python code is a possible way
to send and receive data via the endpoint
configured using our custom handler. First we
have some imports: a library to make the
requests, another to encode the data, and
another to store it in json
objects. Next we instantiate our main variables:
the endpoint url, our Huggingface token, and the
audio we want to send. Here we’re using an MP3,
but this should work with any other audio format
(since our server handler is using
ffmpeg to decode it).
import requests, base64, json
ENDPOINT_URL = "endpoints.huggingface.cloud" # 🌐 replace with your URL endpoint
HF_TOKEN = "hf_token" # 🔑 replace with your HF token
AUDIO_FILE = "audio.mp3" # 🔊 path to your local audio fileThe next section of code defines the VAD
parameters that we want to pass to the model, as
well as the headers for the inference call. For
VAD parameters we just use a default parameter
for the minimum allowed silence in an audio
segment (check out other possible parameters here).
We bundle this into another params
dict that includes additional parameters we want
to pass to the model.
vad_params = {
"min_silence_duration_ms": 500,
}
params = { # dict of parameters for faster_whisper transcription
"parameters": {"language": "en", "vad_parameters": vad_params},
# whether or not to use batched mode (defaults to True)
"batched": True,
}
headers = {"Authorization": f"Bearer {HF_TOKEN}"}Our next segment of code defines the function
that we will use to send information to the
endpoint and get the transcription back. This
takes an audio file and the header information,
as well as a dict containing additional
parameters, with the audio and parameters being
stored in json format.
def trans_fast(audiofile, headers, params):
"""
audiofile: path to audio file
params: dict containing
- 'parameters' dict to pass to transcription model
- 'batched' argument (optional), defaults to True to use faster_whisper batched inference
"""
with open(audiofile, "rb") as f:
data = f.read() # read the file
encoded_audio = base64.b64encode(data).decode('utf-8') # encode in b64 to send for transcription
# Send audio bytes and params
payload = {
"inputs": encoded_audio,
**params
}
response = requests.post(ENDPOINT_URL, headers=headers, json=payload)
return response.json()Finally, we can call the function with the audio file and the parameters dict to get a transcript back.
# Example usage
transcript = trans_fast(AUDIO_FILE, params)
print(transcript)Conclusion
There are various ways to configure an endpoint, and this is a simple handler for a particular use-case. It may be useful for you if you are doing transcription as part of your workflow and would like to use a custom Whisper model on Huggingface. Other kinds of customization would depend on the particular model you’re using and for what purpose. I’ll be revisiting this if/when our needs and configuration requirements change.