Use Whisper speech recognition library to convert mp3 audio to text

Once you generate a list of mp3 audio files, you can use the Whisper library to convert these audio files to text.

In other words, the Whisper library will be able to perform speech recognition and create an autogenerated transcript.

The generated transcript files will be saved in a directory called results, and it will produce two files – the JSON file which is the complete response from running the transcribe method, as well as a .txt file which has only the transcript.

As you can see, this text file is based on the ‘text’ field in the JSON response from the transcribe method.

import os
import whisper
import json
import time

folder_name = '<your_mp3_folder>'
audio_dir = f'audio/{folder_name}'
model = whisper.load_model("small.en")
decode_options = dict(fp16=False)
start_time = time.time()
for filename in os.listdir(audio_dir):
    if filename.endswith('.mp3'):
        f = os.path.join(audio_dir, filename)
        if os.path.isfile(f):
            result = model.transcribe(f)
            result_text = result['text']
            with open(f'results/{folder_name}/{filename}.json', 'w+') as f:
                json.dump(result, f, indent=2)
            with open(f'results/{folder_name}/{filename}.txt', 'w+') as f:
            print(f'Time elapsed = {time.time() - start_time}')

A couple of things to note here:

  • when you run this script for the first time, it will download the “small.en” model which is fairly large so it will take a few minutes
  • the actual transcription is quite slow, and you will find that it takes a fraction of the duration of the audio file. In other words, the longer your original video, the more time it will take to do the transcription.

So do not worry if it looks like the script is not doing anything.

I recommend first running it on a very small mp3 file to verify if everything works as expected, and then using this script for larger audio files.

About this website

BotFlo1 was created by Aravind Mohanoor as a website which provided training and tools for non-programmers who were2 building Dialogflow chatbots.

This website has now expanded into other topics in Natural Language Processing, including the recent Large Language Models (GPT etc.) with a special focus on helping non-programmers identify and use the right tool for their specific NLP task. 

For example, when not to use GPT

1 BotFlo was previously called MiningBusinessData. That is why you see that name in many videos

2 And still are building Dialogflow chatbots. Dialogflow ES first evolved into Dialogflow CX, and Dialogflow CX itself evolved to add Generative AI features in mid-2023