Automatic Transcription
It's not perfect, but if you've got a lot of material, it'll get you well on your way.
Using Mozilla DeepSpeech to Transcribe Audio
As a historian, it may happen that you have a number of audio files (such as recordings for an oral history project) that you need to transcribe so that you can better search or study the materials.
There is no substitute for listening carefully and transcribing manually (to capture inflections and silences and so on) but there may be times when you do need to give yourself a head start. There are a variety of speech-to-text technologies one might use, but in this walk through, we’re going to use a pre-trained neural network model trained on an enormous corpus of transcribed audio. This particular neural network approach handles ‘noisy’ data Hannun et al, 2014 and is the basis for Mozilla’s Deep Speech model.
To use it, we’ll need a folder of audio files. These get run through the neural network model file, and evaluated against the ‘scorer’ file, to transform audio signals into English text.
The code that Mozilla provides can be tweaked, optimized, and re-trained for a different alphabet and so on; all of that is beyond the purview of this document.
Here, I’m assuming you have English audio to be transcribed into English text.
Set up a new ‘environment’
I’m assuming you have Anaconda or Minconda on your machine. We want to make a new ‘environment’ in order to install all the bits and pieces DeepSpeech needs, without interfering with other packages you might have installed.
Open your terminal or anaconda command prompt and create a new environment:
conda create --name transcribe python=3.8
Accept all the defaults. You’re creating a folder on your machine into which everything you’ll need is installed. Then, we’ll tell your machine to use the python etc in that folder to run the commands you’ll give it. When we’re done, you’ll tell your machine to stop using that folder and to revert back to your default python. Thus:
conda activate transcribe
when we want to start using DeepSpeech, and
conda deactivate transcribe
when we’re finished. Incidentally, you can get a list of all the environments on your machine with conda env list
.
Install DeepSpeech Code
Mozilla makes a python package for interacting with its models. We can use pip3
to get what we need:
pip3 install deepspeech
Then, we can get the model file and the scorer with curl
or wget
:
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
or eg wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
These are very large files, by the way, and depending on your connection, might take a bit of time.
Prepare your files
Now, if your files are already in *.wav
format, you’re good to go. But if they’re not, ffmpeg
is your friend. Windows users, follow these steps. Mac users can use Homebrew: brew install ffmpeg
.
Let’s assume you, like me, have a folder of mp4 files. Open a command prompt/terminal prompt in that folder. The following one liner looks for every file in the folder that has the mp4 file extension. It then calls on ffmpeg to transform the mp4 into a wav file, one at a time, and preserves the file name so you know which .wav derives from which .mp4. And then, having done all that, the iteration is closed with the done
.
for f in *.mp4;do ffmpeg -i "$f" -vn "${f%mp4}.wav";done
Select all the wav files and move them into their own folder. I would suggest having things arranged like this:
-deepspeech-work
|
|-original-mp4-files
|-converted-wav-files
|-deepspeech-0.9.3-models.pbmm
|-deepspeech-0.9.3-models.scorer
nb Technically, Deepspeech is expecting a wav file, mono channel, with 16kz sampling. If when you run the commands in the ‘Transcribe’ sections below and you get some kind of error about your wav file, that might be the issue. You can convert your file appropriately with ffmpeg like this:
ffmpeg -i file-to-be-fixed.wav -acodec pcm_s16le -ac 1 -ar 16000 fixed-version.wav
That command tells ffmpeg to take file-to-be-fixed.wav
, apply some transformations to mix the audio down to a mono channel, and then sample the results at 16 kz and write it to a new file, fixed-version.wav
.
Transcribe One File!
First, I’ll show you how to use DeepSpeech on one file; you’ll do this to get a sense of how it works, and how long things will take.
In what follows, I’m assuming you have your files/folders arranged as in my example above.
Make sure, in your terminal or command prompt, that you are in your deepspeech-work
folder. At the prompt, we use the deepspeech
command (that we installed with pip3
earlier). We tell it which model to use, which scorer to use, and then which audio-file to drop through the whole process:
deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio converted-wav-files/example1.wav
After a few moments, you’ll see text being printed into your terminal/command prompt window. Hooray!
Transcribe Many Files!
Now we want to run the command and have it iterate over every file in the folder. Not only that, we want it to write the text to a file, rather than the terminal/command prompt window. AND we want it to add the result to a text file, and use the audio file’s name as a header in the resulting text file. We’re going to use a combination of that ‘for f in *.mp4’ business from before, with the deepspeech command.
Assuming you’re in the deepspeech-work
folder, we do this:
for f in converted-wav-files/*.wav;do deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio "$f" >> result.txt; echo -e "$f" >> text.txt; done
The first part tells your machine which files you want to work on and sets up a loop to go through each file in the converted-wav-files; the second part drops each wav file in turn through the model and the scorer; the » text.txt appends the output into a result.txt file, and then finally echos (copies) the original source file name into the text file, appending it to the bottom. Then the next file gets transcribed.
The resulting file result.txt
has your transcriptions for your folder!
Your Next Step
Perhaps you want to bring all of the transcriptions into R or into Excel. In a text editor, you can use a regex find-and-replace the phrase \nconverted-wav-files/
with say the | character. That is, the machine understands here that when you use \n
at the start of your search pattern that you mean for it to find converted-wav-files/
at the start of every ‘new line’ (for more about ‘regex’ - regular expressions - see the tutorial on regex). Then, you replace that pattern with the pipe character |
.
eg: you go from this:
you messeaged me right yesterday there was a
converted-wav-files/1.wav
that's all folks!
converted-wav-files/2.wav
to this:
you messaged me right yesterday there was a |1.wav
that's all folks!|2.wav
And then you’d save a copy of the file with the .csv file extension.
Then, when you import the file into a different program, you just tell the machine that fields are delimited with the | character.
For instance, if you were then loading the file into R, you might use something like this:
transcriptions <- read.csv("results.csv",sep="|" )
Good luck!