Taking Mozilla’s DeepSpeech for a spin

December 1st, 2017. Tagged: ffmpeg, misc hackery

Speech-to-text, eh? I wanted to convert episodes of my favorite podcast so their invaluable content is searchable. I'm moderately excited with the results but I'd like to document the effort nonetheless.

DeepSpeech

First thought - what open-source packages exist out there? Checking out wikipedia I see a brand-new one from Mozilla - DeepSpeech. Intriguing.

Install

Wasn't painless so let me drop this here... though it will probably be much better soon enough.

(There's an NPM package too which I missed but...) I saw there's a python installer thing called pip which I have installed on my laptop. Don't remember doing it, but it's there. So as the docs say:

$ pip install deepspeech

Didn't work. No such package. Turns out I have an old pip. So update it first:

$ sudo pip install --upgrade pip

Done!

$ pip install deepspeech still didn't work, so sudo it is:

$ sudo pip install deepspeech

Audio

I grabbed the podcast MP3 (Episode 1), but DeepSpeech requires a special WAV (16bit, mono, yadda-yadda), so ffmpeg to the rescue:

ffmpeg -i UBK_HFH_Ep_001_f.mp3 -acodec pcm_s16le -ar 16000 UBK_HFH_Ep_001_f.wav

Run

Docs say run:

$ deepspeech output_model.pb my_audio_file.wav alphabet.txt

They also say:

$ deepspeech output_model.pb my_audio_file.wav alphabet.txt lm.binary trie

Neither of those work because all these output_model.pb, alphabet.txt are nowhere to be found on my system. Thanks to this discussion, there is a solution.

  1. Download the "models" zip from github (warning: 1.3 GB)
  2. unzip anywhere
  3. navigate to the models/ folder
  4. Replace output_model.pb from the instructions with output_graph.pb found in the release package
  5. ...and succeed!
$ deepspeech output_graph.pb ../UBK_HFH_Ep_001_f.wav alphabet.txt lm.binary trie 
Loading model from file output_graph.pb
Loaded model in 1.336s.
Loading language model from files lm.binary trie
Loaded language model in 3.863s.
Running inference.

Results!

To be fair, there's music and two hosts, and they goof off and there's music production jargon... so the results arew far from perfect. But good enough for searching the content, don't you think? Here goes:

i tertoeworoocneatiiyouhadhaateponeormeoversrimoekayameeyourironhmanomyoumeyoninohaveyouevermixedstraightdupediami can easily say yes but i want to make a difficulty so what is ediimanymoreanythingwistoletrounthetinstandswawllthatrighttherewastaexabpleofwhat'sgalled ouseum guess noiid'tyuknowidon'tknowififdubstep or any of the breakbeatvaritiscountasethemigainstsoyupeopledancedtothatit'jitit'sinterestingandhaveyou'resenifvideosoputbedanceandubsstepb'ecausthesdoesbrothers lay it nouns you know its really fastenatingtispuckmahadsorhasbenunotlesteningyouumthewaenerlyavisablecourse of action ye er all those horelessmiglingrlyhothamata electronic dance music all the is it goes significantly beyond the pootstheres also batwatbowhooheboallthemassiv respect to all ourleconadansemuse brother out there loyo'detansatoenfact for many many years spunditwelve hundreds myself at a couple of crates of vinaliwasintominimalackhouseactually not the musinnaybebut what is a is challeninmustinisedyantract pay in the butmoneloy look heres some things i love the dance music but you aln'tedtelistent at im about to day if you make alectronacdancemeslicno many of you we'l have this already to some degree many of you will have a youre better than i am in every in every possible respect your better him being you look better you smell better but for those who dont smell quite a nice as in a less a music try to remember if you can what music thats plate thats played on instruments by people in a room notassarilyrecordingofertibisemainabandinarom they respond to one another and things haftentogether so what doesnt happen is the drumrdoesn'tstart playing a beat and then suddenly both thersitarand then eat bars laterbootthere'sasethat'sthat's not the way unfolds now theres theres some good reasons for doing that and i i trynobancemi'sexpecisiveyou'remakingfinaltat specifically and i mabialemdating myself now if youre making it lethegrdyomanginhistaaai'givyoutafoofrthetroloofeiaacatataaaatualyidigoresesorasnawithleteresalingaprupbucatococashadighoveresation friend about out of a friend i okay with someone i know a age thats better a thats believable about a contretmusiccontracts and how some of them still have a word phonographs really a a just thinking itll be fine i sat side for our anstrutyaaandfunnytouptekdosewith i will provide you if they phonograph m p three as you rherdioofyormakesyowazoathersreadyanackcataasyesinyoursoyeasoeyouegatathonityou'r mixingafar i understand like a sevaspinthetwelvehundredsiunderstandthatthere's a reason to have sounds come in on not metronomically like that but what im saying to you is you need in need more and i m not talking about more in a sense of eangofpaceormorestuffhappengallthtimeontalkingabouttransitionsohyeahagatit because the music is so programmed and because its so it generally is extraordinarily repetitive and then gether thats what its about as good a lot is like this bulifitis and whether it is a isintothe matter but the whinsitsitsgeneraly built a layer at a time generally by the same persons and thats as a couple of clodratorsbutgenerly onpersondoingthisat'sallcomingoutofthe same mind is coming out i in yearly this part and thisbrugsbrtimnotaraneyitflimerilylikethatgetdall your parts out but then move them around play with things first of forest and then once you got it this we here this is a i'monotdoisamixer so if you can get into this space as a composer your life will be so much easier or the delay the guy is mixing it ortorgawlnohevermaxs ing whence you got al your parts arranged and you got the general flow of the song then to pay attention to what is the focus of this a bar sex whats really carry in the groofitmihtnotbewhat you think i said just pushing pool levels around to make something really loud and see how that one than make some elsehorthetoutandturnotofthingriedowngetgetisenealli'ssectionsflow and then create transitions create things that avepeenthatsicnal that it change is coming as this is my first demondropthe's free o signal the change is coming do something to build and dianorndeverybodyin the world he is new its cheesy in a work are is doing as a tritedmore clever but if you listen to the pot and stuff on the on the radio theres always hovers and and sweepswitgooandandthethinghadstissreasontheyput that stuff in it cause its effective it lets you know somethings happening what is the really on happen boonsomehingjust happen in thentellesomesort of explosions at enerjecnesessarysiprosandoosmurassomethinghads that moment new sounds come in old sounds go away generally and something rings out so creattyure transitions like that give me as a mixer something to sink my teeth into as these parts come and go and then as your laringstuffugp when you arrange things have two things come in at once and always have time things coming and not just one to two things happen and makechomcandafdifferentandcontrasting or whatever and the instant you've got like five things hppening and that includes like drums are one thing and bases and other soon three things happen in otherwise if its tied to bring more things in other things got to go away is tis make them go away and it at'saveryselectmomentsinthessongyoucanreallykindopowl things up but othwisestuffkindagoawaywoenitsufcomesan ace keep the errandsockyotobefairetheseare things that every body as wepaintettenshuput bands peopeopl make a musicantharbedroomandbutitspindepoporwhatersiti'salaseverystlaftdlelrameaasolofthe someone on one of those online forums where people talk about things and and typing form responded actually with the very by things like well a lot of young kids these days our learning music on their lattochslariydmusicontheriyehads learning music in other ways and this simplestandfasteswaiyto get that in in some ways the cheat this way to do that as is start off to electronic based musical youve got a usalulysicanyourheadyeudon'thavetagotar round if youre a kid and you love music youre just going to find whatever tent whatever it takes to get that on recorded or down get the idea down somehow i they think edyimandhippopaviouslybackgonth a and still now those styles of music more people are getting on to that i think just because its a little bit more acessiya so you can jusasely get into it and ive noticed the same thing with hippoahalottoftimeslalgeothesehhippoptracks where ah if its a to track and vocal or if its everything'strackedoutamit'sjistthe same stuff through the verse and chorsand i had up getting hired to not only makes it but people'vefordmypreviousworkand iermebcaus they know i gonna troppiteuph to make it sound more interestingmammmdothegoodyaotoyoutlookhetogoodyotoanenlookapthemagrefac world i you at by al means support them endlessly thats the red calledbaktratfidsthebestswayfyoulisacydhaveyouherd about wickyharserebiwickyworsenobutyuvinatactypeinusanayowhithsoudoisyutwichyouorcecaza eople and then you got a referee and to any number of witnesses or whatnot but you have two pele and they start off on the same wikopediapige and wormhegindbelikeigkiecobeinnapenistayistho istory whatever the starting pages and then they say okay and howlheberygoboomand you have to work your way from starting you know with the founding fazsof the constitution to holly berry only through wickapedalinxan so you just clicknclicknclickanyoutratofagure which was on a get you closer to the mark rahdyoundtaewayandthese two people go head to head at this and who ever gets their first winds an thats what they call peckyorseat'ssveryentertaining to see that the past that people traveled to get from pointed a point by a is there a turning u with money i isthrisntthere probably will be i dont go to a who i am i was i i go i speaking of ladygarobretinspiartshe'suhadliningvegas now first shes guys saw that in the side of the buses like britningutacaseilactcoblesser i'carestosailonatlaskeserliim 't think she'llasttyercontrananoidiesii'mcurioustosyealong at last i have no thoughts on how long it will last im just curious to say and like because i can see it plan out where she just burned out like in three shows and i know a us overbuttedwithcokinerolling out of her yearballsumandticataalso see her riding up hefulcontrastandgetinworenowlandand ending up not retiring until shes like sixty seven years old i thats sad i i i see by your option b okay you disisgallnrecordherenonoh he articieho'vereotdanestateis the interwibsfok its not on a side of a bus i dolt know about it thats how my worldworkseisasaniowehhhna

Perf

And yes, the process does take a while:

Inference took 753.507s for 648.908s audio file.

Comments? Find me on BlueSky, Mastodon, LinkedIn, Threads, Twitter