Clean up the autogenerated transcript using string find-and-replace
When you generate the transcript using the Whisper library you will notice that it does make mistakes in recognizing some words, especially technical jargon.
However, it usually makes the same mistake for that term every time (because the video creator is probably pronouncing it in the same way each time!).
In practice, this means that you can actually fix this quite easily by doing a global find-and-replace on frequently misrecognized words.
And while you are doing it, you can also update the transcript to fix capitalization for domain specific terms. Fixing capitalization alone will make your transcript a lot more readable.
On top of that, it also creates better input for the next step.
import os
gpt_replacements = [
("whisper", "Whisper"),
("Palm II", "PaLM-2"),
("Palm", "PaLM")
]
replacements = gpt_replacements
folder_name = 'gpt_for_online_courses'
original_dir = f'results/{folder_name}'
for filename in os.listdir(original_dir):
if filename.endswith('.txt'):
print(f'Started processing {filename}')
full_name = f'{original_dir}/{filename}'
with open(full_name, 'r') as f:
transcript = f.read()
full_len = len(transcript)
for replacement in replacements:
source = replacement[0]
target = replacement[1]
transcript = transcript.replace(source, target)
print(transcript)
filename_to_save = f'gpt_replaced/{folder_name}/{filename}'
with open(filename_to_save, 'w+') as f:
f.write(transcript)
You can see that the code is very simple, but make a note of the folder where the output is saved – the “gpt_replaced” folder will be used in the next step.
(In case you change it to a custom folder name, you need to remember to use the custom folder name for the next step).
Also, I use two different variables gpt_replacements and replacements so that I can use topic specific word replacements without having to rewrite any/much code.
Also, keep a backup of the folder before you run this script. Otherwise you would have to repeat the previous step, which can the most time consuming of this workflow.
About this website BotFlo1 was created by Aravind Mohanoor as a website which provided training and tools for non-programmers who were2 building Dialogflow chatbots. This website has now expanded into other topics in Natural Language Processing, including the recent Large Language Models (GPT etc.) with a special focus on helping non-programmers identify and use the right tool for their specific NLP task. For example, when not to use GPT 1 BotFlo was previously called MiningBusinessData. That is why you see that name in many videos 2 And still are building Dialogflow chatbots. Dialogflow ES first evolved into Dialogflow CX, and Dialogflow CX itself evolved to add Generative AI features in mid-2023