Need Text to Speech and Speech Recognition Tools for Linux

Need text to speech and speech recognition tools for Linux

For speech recognition there are the various Sphinxes. The different variants have different pros and cons, there is a comparison here Comparison of Sphinx versions. Sphinx 4 is Java, but the others are C, I believe.

Text-to-speech (voice generation) and speech-to-text (voice recognition) APIs?

I'll rehash and update an answer from Speech recognition in C or Java or PHP?. This is by no means comprehensive, but it might be a start for you


From watching these questions for few months, I've seen most developer choices break down like this:

Windows folks - use the System.Speech features of .Net or Microsoft.Speech and install the free recognizers Microsoft provides. Windows 7 includes a full speech engine. Others are downloadable for free. There is a C++ API to the same engines known as SAPI. See at http://msdn.microsoft.com/en-us/magazine/cc163663.aspx. or http://msdn.microsoft.com/en-us/library/ms723627(v=vs.85).aspx. More background on Microsoft engines for Windows
What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition?

Linux folks - Sphinx seems to have a good following. See http://cmusphinx.sourceforge.net/ and http://cmusphinx.sourceforge.net/wiki/

Commercial products - Nuance, Loquendo, AT&T, IBM, others. Each provide their own SDKs and libraries for various languages.

Online service - Nuance, Yapme, ispeech.org, vlingo, others. Nuance has improved their developer program and will now give you free access to their services for development. Yap (I believe) was recently purchased by Amazon, so we may see some changes there.

Of course this may also be helpful - http://en.wikipedia.org/wiki/List_of_speech_recognition_software

There is a Java speech API. See javax.speech.recognition in the Java Speech API http://java.sun.com/products/java-media/speech/forDevelopers/jsapi-guide/Recognition.html. I believe you still have to find a speech engine that supports this API. I don't think Sphinx fully supports it - http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html#support_jsapi

There are lots of other SO quesitons:
Need text to speech and speech recognition tools for Linux
and pyspeech (python) - Transcribe mp3 files? which talks about http://code.google.com/p/pyspeech/. You may also want to look at http://code.google.com/p/dragonfly/

Text To Speech And Speech To Text Recognition -- self - Recognition is occurring

Dont initialize all the things in the ViewDidLoad. When You tap on button to convert text to speech, at that that time make the speech to text conversion object as nil also set the delegate the nil. Same things for the vice versa also.

Sound file to text file-speech recognition for ubuntu, specifically pocketsphinx usage


I want a short list of one dictionary file, language model file, acoustic file listed please, that are compatible with each other.

Install the pocketsphinx-en-us package from the universe/sound section.
(It's available in Ubuntu 18.04 Bionic Beaver and later.
Prior to that, I believe it was called pocketsphinx-hmm-en-hub4wsj.)
This will put the model in /usr/share/pocketsphinx/model/en-us/.

After that, you can run commands like this (there's no need to use sudo):

pocketsphinx_continuous -infile myfile.wav 2>&1 > myspeech.txt | tee out.log | less

Or if you want to specify the folders manually:

pocketsphinx_continuous \
-hmm /usr/share/pocketsphinx/model/en-us/en-us \
-dict /usr/share/pocketsphinx/model/en-us/cmudict-en-us.dict \
-lm /usr/share/pocketsphinx/model/en-us/en-us.lm.bin \
-infile myfile.wav > myspeech.txt

Make sure you have a 16-bit, 16 kHz mono wav file, or convert if necessary:

ffmpeg -i myfile.mp3 -ar 16000 -ac 1 -sample_fmt s16 myfile.wav

You might not have the best accuracy from the generic model.
Here's set #1 of the Harvard Sentences:

One: The birch canoe slid on the smooth planks.
Two: Glue the sheet to the dark blue background.
Three: It's easy to tell the depth of a well.
Four: These days a chicken leg is a rare dish.
Five: Rice is often served in round bowls.
Six: The juice of lemons makes fine punch.
Seven: The box was thrown beside the parked truck.
Eight: The hogs were fed chopped corn and garbage.
Nine: Four hours of steady work faced us.
Ten: Large size in stockings is hard to sell.

and here's the output I got from my recording:

if one half the brcko nude lid on this good length
to conclude ishii to the dark blue background
three it's easy to tell the devil wow
for these days eat chicken leg is a rare dish
five race is often served in round polls
six the juice of the lemons makes flying conch
seven the box was thrown beside the parked truck
eight the hogs griffin chopped coroner and garbage
not in four hours of steady work the stocks
ten large son is in stockings his heart is the good

Related:

  • How to give an input wav file to pocket sphinx
  • Error when running pocketsphinx_continuous: Acoustic model definition is not specified
  • How can we convert .wav file to text by using pocketsphinx?
  • https://askubuntu.com/questions/837408/convert-speech-mp3-audio-files-to-text
  • https://askubuntu.com/questions/161515/speech-recognition-app-to-convert-mp3-to-text
  • https://old.reddit.com/r/linuxquestions/comments/bj54y0/speech_to_text_program/em6b7j9/
  • https://unicom.crosenthal.com/blog/entry/686

Voice Recognition Software For Developers


It's out there, and it works...

There are quite a few speech recognition programs out there, of which Dragon NaturallySpeaking is, I think, one of the most widely used ones. I've used it myself, and have been impressed with its quality. That being a couple of years ago, I guess things have improved even further by now.

...but it ain't easy...

Even though it works amazingly well, I won't say it's an easy solution. It takes time to train the program, and even then, it'll make mistakes. It's painstakingly slow compared to typing, so I had to keep saying to myself "Don't grab the keyboard, don't grab the keyboard, ..." (after which I'd grab the keyboard anyway). I myself tend to mumble a bit, which didn't make things much better, either ;-). Especially the first weeks can be frustrating. You can even get voice-related problems if you strain your voice too much.

...especially for programmers!

All in all, it's certainly a workable solution for people writing normal text/prose. As a programmer, you're in a completely different realm, for which there are no real solutions. Things might have changed by now, but I'd be surprised if they have.

What's the problem? Most SR software is built to recognize normal language. Programmers write very cryptic stuff, and it's hard, if not impossible, to find software that does the conversion between normal language and code. For example, how would you dictate:

if (somevar == 'a')
{
print('You pressed a!');
}

Using the commands in your average SR program, this is a huge pain: "if space left bracket equal sign equal sign apostrophe spell a apostrophe ...". And I'm not even talking about navigating your code. Ever noticed how much you're using the keyboard while programming, and how different that usage is from how a 'normal' user uses the keyboard?

How to make the best of it

Thus far, I've only worked with Dragon NaturallySpeaking (DNS), so I can only speak for that product. There are some interesting add-ons and websites targeted for people like programmers:

  • Vocola is an unofficial plugin that allows you to easily add your own commands to DNS. I found it essential, basically. You'll also be able to find command sets written by other programmers, for e.g. navigating code. It's based on a software package written in Python, so there are also some more advanced and fancy packages around. Also check out Vocola's Resources page. (Warning: when I used it, there were some problems with installing Vocola; check out the newsgroup below for info!)
  • SpeechComputing.com is a forum/newsgroup with lots of interesting discussions. A good place to start.

Closing remarks

It seems that the best solution to this problem is, really:

  • Find ways around actual coding.
  • Try to recover. I'm somewhat reluctant to recommend this book, but it seems to work amazingly well for people with RSI/carpal tunnel and other chronic pain issues: J.E. Sarno, Mindbody prescription. I'm working with it right now, and I think it's definitely worth reading.

what text to speech and speech recognition libraries are available for Clojure?

I think this is a pretty much unexplored territory as far as existing Clojure libraries go.

Your best bet is probably to look at the many available Java speech recognition libraries and use them from Clojure - they are going to be much more mature and capable at this point.

You may want to look at:

  • http://cmusphinx.sourceforge.net/sphinx4/

Using Java libraries from Clojure is extremely easy - it is generally as simple as importing the right classes and doing (.someMethod someObject arg1 arg2)

If you do create a Clojure wrapper for a speech recogniser, please do contribute it back to the community! I know quite a few people (myself included) would be interested in doing some speech-related work in Clojure.

Speech to text conversion in Linux

Well, this is quite an undertaking and without saying what technology you want to use, here are some links:

  • Speech Recognition on Wikipedia
  • Java Speech API
  • W3C Speech Recognition Grammar Specification
  • Sphinx - An open source recognition engine written in Java

Good luck. With more detail, we may be able to provide better answers. For example, there's a big difference between "yes/no" call center-style recognition vs. even partial natural language understanding.

Java voice recognition

Mostly Java: http://cmusphinx.sourceforge.net/html/cmusphinx.php



Related Topics



Leave a reply



Submit