Simplest Algorithm of Measuring How Similar of Two Short Audio

How to detect how similar a speech recording is to another speech recording?

A lot of people seem to be suggesting some sort of edit distance, which IMO is a totally wrong approach for determining the similarity of two speech patterns, especially for patterns as short as OP is implying. The specific algorithms used by speech-recognition in fact are nearly the opposite of what you would like to use here. The problem in speech recognition is resolving many similar pronunciations to the same representation. The problem here is to take a number of slightly different pronunciations and get some kind of meaningful distance between them.

I've done quite a bit of this stuff for large scale data science, and while I can't comment on exactly how proprietary programs do it, I can comment on how it's done in academia and provide a solution that is straightforward and will give you the power and flexibility that you want for this approach.

Firstly: Assuming that what you have is some chunk of audio without any filtering done on it. Just as it would be acquired from a microphone. The first step is to eliminate background noise. There are a number of different methods for this, but I'm going to assume that what you want is something that will work well without being incredibly difficult to implement.

  • Filter the audio using scipy's filtering module here. There are a lot of frequencies that microphones pick up that are simply not useful for categorizing speech. I would suggest either a Bessel or a Butterworth filter to ensure that your waveform is persevered through filtering. The fundamental frequencies for everyday speech are generally between 800 and 2000 Hz (reference) so a reasonable cutoff would be something like 300 to 4000 Hz, just to make sure you don't lose anything.
  • Look for the least active portion of speech and assume that is a reasonable representation of background noise. At this point you're going to want to run a series of fourier transforms along your data (or generate a spectrogram) and find the part of your speech recording that has the lowest average frequency response. Once you have that snapshot, you should subtract it from all other points in your audio sample.
  • At this point should should have an audio file that is mostly just your user's speech and should be ready to be compared to another file that has gone through this process. Now, we want to actually clip the sound and compare this clip to some master clip.

Secondly: You're going to want to come up with a distance metric between two speech patterns, there are a number of ways to do this, but I'm going to assume we have the output of part one and some master file that has been through similar processing.

  • Generate a spectrogram of the audio file in question (example). The output from this is ultimately going to be an image that can be represented as a 2-d array of frequency response values. A spectrogram is essentially a fourier transform over time where the colour corresponds to intensity.

  • Use OpenCV (has python bindings, example) to run blob detection on your spectrogram. Effectively this is going to look for the big colorful blob in the middle of your spectrogram, and give you some limits on this. Effectively, what this should do, is return a significantly more sparse version of the original 2d-array that solely represents the speech in question. (With the assumption that your audio file will have some trailing stuff on the front and back ends of recording)

  • Normalize the two blobs to account for differences in speech speed. Everyone talks at a different speeds, and as such your blobs will probably have different sizes along the x-axis (time). This will ultimately introduce a level of checks in your algorithm that you don't want for the speed of speech. This step isn't needed if you also want to make sure that they speak at the same speed as the master copy, but I would suggest it. Basically you want to stretch out the shorter version by multiplying it's time axis by some constant that's just the ratio of the lengths of your two blobs.

  • You should also normalize the two blobs based on maximum and minimum intensity to account for people that talk at different volumes. Again, this is up to your discretion, but to fix this you should find similar ratios for the total span of intensities that you have as well as the two recording's max intensities and make sure that these two values match up between your 2-d arrays.

Third: Now that you have 2-d arrays representing your two speech events, that should in theory contain all of their useful information it's time to directly compare them. Luckily, comparing two matrices is a well-solved problem and there are a number of ways to move forward.

  • Personally I would suggest using a metric like Cosine Similarity to determine the difference between your two blobs, but that's not the only solution and while it'll give you a quick validation, you can do better.

  • You could try subtracting one matrix from the other and get an evaluation of how much difference there is between them, which would probably be more accurate than simple cosine distance.

  • It might be overkill, but you could assume that there are certain regions of speech that matter more or less for evaluating difference between blobs (it might not matter if someone uses a long i instead of a short i, but a g instead of a k could be a different word entirely). For something like that you'd want to develop a mask for the difference array in the previous step and multiply all your values by that.

  • Whichever method you choose, you can now simply set some difference threshold and make sure that the difference between the two blobs is below your desired threshold. If it is, the captured speech is similar enough to be correct. Otherwise have them try again.

I hope that's helpful, and again, I can't assure you that this is the exact algorithm that a company uses since that information is hugely proprietary and not open for the public, but I can assure you that methods similar to these are used in the very best papers in academia and that these methods will get you a great balance of accuracy and ease of implementation. Let me know if you have any questions, and good luck with your future data science exploits!

how do we check similarity between hash values of two audio files in python?

Cryptographic hashes like SHA-256 cannot be used to compare the distance between two audio files. Cryptographic hashes are deliberately designed to be unpredictable and to ideally reveal no information about the input that was hashed.

However, there are many suitable acoustic fingerprinting algorithms that accept a segment of audio and return a fingerprint vector. Then, you can measure the similarity of two audio clips by seeing how close together their corresponding fingerprint vectors are.

Picking an acoustic fingerprinting algorithm

Chromaprint is a popular open source acoustic fingerprinting algorithm with bindings and reimplementations in many popular languages. Chromaprint is used by the AcoustID project, which is building an open source database to collect fingerprints and metadata for popular music.

The researcher Joren Six has also written and open-sourced the acoustic fingerprinting libraries Panako and Olaf. However, they are currently both licensed as AGPLv3 and might possibly infringe upon still-active US patents.

Several companies--such as Pex--sell APIs for checking if arbitrary audio files contain copyrighted material. If you sign up for Pex, they will give you their closed-source SDK for generating acoustic fingerprints as per their algorithm.

Generating and comparing fingerprints

Here, I will assume that you chose Chromaprint. You will have to install libchromaprint and an FFT library.

I will assume that you chose Chromaprint and that you want to compare fingerprints using Python, although the general principle applies to other fingerprinting libraries.

  1. Install libchromaprint or the fpcalc command line tool.
  2. Install the pyacoustid Python library from PyPI. It will look for your existing installation of libchromaprint or fpcalc.
  3. Normalize your audio files to remove differences that could confuse Chromaprint, such as silence at the beginning of an audio file. Also keep in mind that Chromaprin
  4. While I typically measure the distance between vectors using NumPy, many Chromaprint users compare two audio files by computing the xor function between the fingerprints and counting the number of 1 bits.

Here is some quick-and-dirty Python code for comparing the distance between two fingerprints. Although if I were building a production service, I'd implement the comparison in C++ or Rust.

from operator import xor
from typing import List

# These imports should be in your Python module path
# after installing the `pyacoustid` package from PyPI.
import acoustid
import chromaprint

def get_fingerprint(filename: str) -> List[int]:
"""
Reads an audio file from the filesystem and returns a
fingerprint.

Args:
filename: The filename of an audio file on the local
filesystem to read.

Returns:
Returns a list of 32-bit integers. Two fingerprints can
be roughly compared by counting the number of
corresponding bits that are different from each other.
"""
_, encoded = acoustid.fingerprint_file(filename)
fingerprint, _ = chromaprint.decode_fingerprint(
encoded
)
return fingerprint

def fingerprint_distance(
f1: List[int],
f2: List[int],
fingerprint_len: int,
) -> float:
"""
Returns a normalized distance between two fingerprints.

Args:
f1: The first fingerprint.

f2: The second fingerprint.

fingerprint_len: Only compare the first `fingerprint_len`
integers in each fingerprint. This is useful
when comparing audio samples of a different length.

Returns:
Returns a number between 0.0 and 1.0 representing
the distance between two fingerprints. This value
represents distance as like a percentage.
"""
max_hamming_weight = 32 * fingerprint_len
hamming_weight = sum(
sum(
c == "1"
for c in bin(xor(f1[i], f2[i]))
)
for i in range(fingerprint_len)
)
return hamming_weight / max_hamming_weight

The above functions would let you compare two fingerprints as follows:

>>> f1 = get_fingerprint("1.mp3")
>>> f2 = get_fingerprint("2.mp3")
>>> f_len = min(len(f1), len(f2))
>>> fingerprint_distance(f1, f2, f_len)
0.35 # for example

You can read more about how to use Chromaprint to compute the distance between different audio files. This mailing list thread describes the theory of how to compare Chromaprint fingerprints. This GitHub Gist offers another implementation.

How to calculate distance similarity measure of given 2 strings?

What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.

Here's an implementation from the first site linked below:

private static int  CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);

for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}

Similarity measure for Strings in Python

You could just use difflib. This function I got from an answer some time ago has served me well:

from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()

print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))

0.96
0.666666666667

You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:

from difflib import SequenceMatcher

def similar(a, b, c):
sim = SequenceMatcher(None, a, b).ratio()
if sim > c:
return sim

print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))

0.96
None


Related Topics



Leave a reply



Submit