A custom handler for faster-whisper on Huggingface

2026-02-16

Recently we wanted to set up an audio transcription endpoint for our workflow with Bookscribe.ai. I’ve been using Huggingface in production for their accessible and low cost inference servers, but there isn’t comprehensive documentation on how to customize your endpoint, so I thought I’d document my procedure here.

Preliminaries

The current go-to for English language transcription is Whisper, with Large and Turbo models that perform quite well overall. They can be a bit compute-intensive though, and aren’t particularly fast. They also generate hallucinations if there is too much silence in the audio. The better alternative, making use of the quality Whisper training, is faster-whisper, which is much faster, particularly if you use a distilled model with batching, and incorporates voice activity detection (VAD), which eliminates silence from consideration by the model.

The models, including faster-whisper, are all available on Huggingface. Their model library has an extensive set of models trained for various purposes, most with permissive licenses that support commercial uses. The base Whisper models are easy to get working out of the box, but more specialized models require some degree of customization.

Customization requirements

To get faster-whisper working, we need two things: 1. a custom handler.py script hosted on the inference server 2. a specific method for sending data to the endpoint

The custom handler is referenced here and essentially gives instructions to the hosted model about how to handle the input it receives. This means you should know how to use the model, what arguments it can take, etc., all of which can be done through testing on a local workstation (or on a Huggingface space or similar, though I don’t have particular experience with that).

The method for sending data to the endpoint involves http requests, but depending on your handler setup, it requires a certain data structure. Additionally, for our purposes we want to send audio, so we need to encode that as bytes.

How to write a handler script

The following code can be used in a handler.py script that you would upload to your model base directory on Huggingface.

The first code block imports faster-whisper, typing, and the ffmpeg codec to convert any input into the audio that the transcription model will read.

# handler.py (for handling asr with faster_whisper)
from faster_whisper import WhisperModel, BatchedInferencePipeline
from typing import Any, Dict, List
from transformers.pipelines.audio_utils import ffmpeg_read

The next code block is where we create the class for handling the input. Here we define some variables/arguments that faster_whisper uses under the hood: model_size refers to the particular model being used, and arguments include whether we’re using a gpu, which we are (cuda), and the level of precision used (float16). In initialization we actually create two different versions of the model - a base model and a “batched” model. The faster_whisper batched model is much faster and uses VAD by default, with various parameters that can impact its performance. The base model is used as a fallback in case the batched model isn’t working so well.

The actual call involves reading the input being sent to the endpoint, decoding the encoded audio, then identifying any additional parameters that are being passed to the model, and returning the result as a complete string.

class EndpointHandler:

    def __init__(self, path=""):
        self.model_size = "distil-large-v3" # the distilled whisper v3 model
        self.model = WhisperModel(self.model_size, device="cuda", compute_type="float16") # the base faster_whisper model
        self.batched_model = BatchedInferencePipeline(model=self.model) # the batched faster_whisper model

    def __call__(self, data: Dict[str, Any]) -> List[Dict[str, Any]]:
        """
        Args:
            data (:obj:):
                includes the base64 encoded audio file as 'inputs'
                whether to use batching as the 'batched' argument
                and any additional arguments as a 'parameters' dict
        Return:
            segments of transcribed text, joined
        """
        # process input
        inputs = data.pop("inputs", data)
        audio_nparray = ffmpeg_read(inputs, 16000) # read the encoded audio and convert at 16k
        # Retrieve custom arguments
        batched = data.pop("batched", True) # default is True if not specified
        params = data.pop("parameters", {}) # all parameters for the model
        if batched:
            segments, info = self.batched_model.transcribe(audio_nparray, **params)
        else:
            segments, info = self.model.transcribe(audio_nparray, beam_size=5)

        segments = [segment.text for segment in segments]
        return " ".join(segments)

Sending data

The faster_whisper model has the option of implementing Voice Activity Detection (VAD). The benefit of VAD is that only segments of audio determined to contain vocal activity will be considered by the Whisper model, greatly reducing the potential for hallucinations in transcription. This is enabled for the batched model automatically, and has some default parameters. In the handler we allow for parameters to be passed to the batched model, but not to the base model. This can easily be modified for your particular use-case, but for our purposes we are only interested in modifying the VAD parameters of the batched model.

The following Python code is a possible way to send and receive data via the endpoint configured using our custom handler. First we have some imports: a library to make the requests, another to encode the data, and another to store it in json objects. Next we instantiate our main variables: the endpoint url, our Huggingface token, and the audio we want to send. Here we’re using an MP3, but this should work with any other audio format (since our server handler is using ffmpeg to decode it).

import requests, base64, json

ENDPOINT_URL = "endpoints.huggingface.cloud"  # 🌐 replace with your URL endpoint
HF_TOKEN     = "hf_token"                     # 🔑 replace with your HF token
AUDIO_FILE   = "audio.mp3"                    # 🔊 path to your local audio file

The next section of code defines the VAD parameters that we want to pass to the model, as well as the headers for the inference call. For VAD parameters we just use a default parameter for the minimum allowed silence in an audio segment (check out other possible parameters here). We bundle this into another params dict that includes additional parameters we want to pass to the model.

vad_params = {
                "min_silence_duration_ms": 500,
                }

params = { # dict of parameters for faster_whisper transcription
            "parameters": {"language": "en", "vad_parameters": vad_params},
            # whether or not to use batched mode (defaults to True)
            "batched": True,
            }

headers = {"Authorization": f"Bearer {HF_TOKEN}"}

Our next segment of code defines the function that we will use to send information to the endpoint and get the transcription back. This takes an audio file and the header information, as well as a dict containing additional parameters, with the audio and parameters being stored in json format.

def trans_fast(audiofile, headers, params):
    """
    audiofile:  path to audio file
    params:     dict containing
                    - 'parameters' dict to pass to transcription model
                    - 'batched' argument (optional), defaults to True to use faster_whisper batched inference
    """
    with open(audiofile, "rb") as f:
        data = f.read() # read the file
        encoded_audio = base64.b64encode(data).decode('utf-8') # encode in b64 to send for transcription
    # Send audio bytes and params
    payload = {
                "inputs": encoded_audio,
                **params
                }
    response = requests.post(ENDPOINT_URL, headers=headers, json=payload)

    return response.json()

Finally, we can call the function with the audio file and the parameters dict to get a transcript back.

# Example usage
transcript = trans_fast(AUDIO_FILE, params)
print(transcript)

Conclusion

There are various ways to configure an endpoint, and this is a simple handler for a particular use-case. It may be useful for you if you are doing transcription as part of your workflow and would like to use a custom Whisper model on Huggingface. Other kinds of customization would depend on the particular model you’re using and for what purpose. I’ll be revisiting this if/when our needs and configuration requirements change.