How to Handle Homophones in Speech Recognition

How to handle homophones in speech recognition?

This is a common NLP dilemma, and I'm not so sure what might be your desired output in this application. However, you may want to bypass this problem in your design/architecture process, if possible and if you could. Otherwise, this problem is to turn into a challenge.

Being said that, if you wish to really get into it, I like this idea of yours:

string against a function

This might be more efficient and performance friendly.

One way, I'd be liking to solve this problem would be though RegEx processing, instead of using endless loops and arrays. You could maybe prototype loops and arrays to begin with and see how it works, then you might want to use regular expression for gaining performance.

You could for instance define fixed arrays in regular expressions and quickly check against your string (word by word, maybe using back-referencing) and you can add many boundaries in your expressions for string processing, as you wish.

Sample Image

Your fixed arrays also can be designed based on probabilities of occurring certain words in certain part of a string. For instance,

^I

^eye

The probability of I being the first word is much higher than that of eye.
The probability of I in any part of a string is higher than that of eye, also.

You might want to weight words based on that.

I'd say the key would be that you'd narrow down your desired outputs as focused as possible and increase accuracy, [maybe even with 100 words if possible], if you wish to have a good/working application.

Good project though, I hope you like/enjoy the challenge.

How do speech recognition algorithms recognize homophones?

Do they use contextual clues?

Yes, ASR systems use cross-word context. For example if previous word is "going" the next word will likely to be "to" not "two". ASR systems account for probabilities and select the best probable decoding variant.

Sentence structure?

Yes, ASR systems use more advanced language models as well to predict probable words given the context.

Perhaps there are slight differences in the way each word is usually pronounced (for example, I usually hold the o sound longer in two than in to).

That too. Actually "too" and "to" are pronounced quite differently. "to" is often reduced to shwa.

If you are interested in speech recognition algorithms, it may have sense to read ASR book or check online course. See for details

https://sourceforge.net/p/cmusphinx/discussion/speech-recognition/thread/3ea89abf/

How to handle homophones from Speech Service in LUIS?

Short answer is, no, you shouldn't bin them together. Really, that is just a band-aid and more than likely will just lead to other issues down the road.

Instead, there are several steps you can take:

Remind users to speak clearly and not too fast. This is generally good practice with any speech service.
For Speech Services, consider adding Custom Speech. Essentially, you provide additional audio to further train the service on what to recognize. This can include providing a Pronunciation Model to help the service distinguish between similar sounding words.
For LUIS, look at how many example utterances you have trained. Results tend to improve with more utterances. Recommendation is a minimum of five.
For LUIS, you can consider adding a Phrase List. I think your mileage will vary more here than with the other suggestions as I think the issue is more speech driven that LUIS driven, but it could help and requires little to setup.

Hope of help!

Get alternative suggestions during speech recognition

DeepSpeech have updated releases. For better inference results, you need to follow their instructions and suggestions such as, your input audio file should be on 16000 Hz, mono channel, and 16 bit. Audio resampling may affect the quality of inference, keep this in mind. I personally use SoX for resampling but there are other options, samplerate. Also, there are many good suggestions on their forum.

There is a Python library called SpeechRecognition. They have some offline models and online API services for speech to text.

Graphene to phoneme heuristic for OOV words

Here's an attempt https://godbolt.org/z/z6G4o6c4K

It pronounces school as s-chul (in arpabet S CH UW L). But that's good enough I suppose.

It first tries to parse the first two characters, if they match common digram (i.e. ch, sh, th) then return the corresponding phoneme and pass the remainder. Otherwise convert the first letter to the closest phoneme.

How to Handle Homophones in Speech Recognition