development-programming-algorithm-fuzzy-search.html
* created: 2025-05-26T16:27
* modified: 2025-06-24T18:28
title
Fuzzy Search
description
You can use a fuzzy search algorithm for approximate string matching which is useful when you have to relay on inherently unreliable users.
related notes
Approximate matching on strings
These are used if dealing with imprecise user inputs. The goal is to find entries that are approximately similar but not exactly the same.
Personal use case
I want to implement a search algorithm that helps users navigate my post note application quicker. I already distinguish between some important information like title, description, and tags from each individual note and can easily put them into a map.json
which ought to be used as a mini database. As a small side note, there are a couple of reasons why I decided to use a simple json file instead of a full-fledged database in this project:
- Project scope: This is a fairly small project, which should be deployed using a simple static file server, anything but a json file would exceed the scope of this project.
- Browser cache: Even if the scope of the website grows, and by proxy, the size of the json file, user still only need to load this file ones, because browser are pretty clever about caching such static files.
- It's fun: The only reason why I do this in the first place is because I enjoy building stuff like this.
Implementing a fuzzy search algorithm
I think I've talked enough about the why, know we can take a look at how we would implement such an algorithm, but we first need to decide which one we want to implement. There are a couple of decent options, each with their respect up- and downsides, I will take a look at two of them:
-
Levenshtein Distance
- Pros:
- Simple to understand and implement
- Accurate for typo detection
- Cons:
- Slower on large data
- Doesn’t handle sound-based similarity
-
Soundex / Metaphone
- Pros:
- Very fast and efficient
- Simple phonetic matching
- Cons:
- Not typo-tolerant
- Limited to English pronunciation
For this example I will go with the Levenshtein Distance algorithm.
Levenshtein Distance
In general terms the Levenshtein distance is a number that indicates how different two strings are. The smaller the number, the closer these two strings match. If you wanted to be more precise you could say, it's the number of symbols which have to be replaced, swapped or deleted until these two strings match exactly.
The steps needed to calculate the distance can be expressed as a recursive function which looks like this:
Let a = a_1 a_2 ... a_m and b = b_1 b_2 ... b_n be two strings. Define distance matrix D of size (m+1)\times(n+1), where:
D(i, j) =
\begin{cases}
i & \text{if } j = 0 \\
j & \text{if } i = 0 \\
\min \left\{
\begin{array}{l}
D(i-1, j) + 1 \\
D(i, j-1) + 1 \\
D(i-1, j-1) + \delta(a_i, b_j)
\end{array}
\right. & \text{otherwise}
\end{cases}
where the cost function \delta(a_i, b_j) is defined as:
\delta(a_i, b_j) =
\begin{cases}
0 & \text{if } a_i = b_j \\
1 & \text{if } a_i \ne b_j
\end{cases}
Before we start calculating the difference we should check if the two strings we want to compare already match, if that's the case we can skip the calculation and assert that the distance is 1, since both Strings are already identical.
If that's not the case, we need to calculate the distance manually. One way to calculate the distance would be to transform each string into a vector, such that “Ford Prefect” would become ["f", "o", "r", "d", " ", "p", "r", "e", "f", "e", "c", "t"]
.
After that we can start calculating our matrix. To illustrate how this is done, we compare "Ford Prefect" with "Ford Perfect", since the first part already matches, we only need to compare the last part, which would look something like this:
|
"" |
p |
e |
r |
f |
e |
c |
t |
"" |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
p |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
r |
2 |
1 |
1 |
1 |
2 |
3 |
4 |
5 |
e |
3 |
2 |
1 |
2 |
2 |
2 |
3 |
4 |
f |
4 |
3 |
2 |
2 |
2 |
3 |
3 |
4 |
e |
5 |
4 |
3 |
3 |
3 |
2 |
3 |
4 |
c |
6 |
5 |
4 |
4 |
4 |
3 |
2 |
3 |
t |
7 |
6 |
5 |
5 |
5 |
4 |
3 |
2 |
The resulting distance is 2 (the number in the bottom right corner).
Particle Example
I got sidetrack earlier talking about my Post Notes Project, which is currently my main use for this algorithm.
I've implemented the search algorithm their so you can quickly search through all of my notes. The somewhat finished approach I've gone with looks like this:
class NormalizedDistance {
inner;
constructor(distance, lengthA, lengthB) {
const maxLength = Math.max(lengthA, lengthB);
this.inner = maxLength === 0 ? 1 : 1 - distance / maxLength;
}
}
class LevenshteinDistance {
lengthA;
lengthB;
inner;
constructor(a, b) {
this.lengthA = a.length;
this.lengthB = b.length;
const matrix = Array.from({ length: a.length + 1 }, () => []);
for (let i = 0; i <= a.length; i++) matrix[i][0] = i;
for (let j = 0; j <= b.length; j++) matrix[0][j] = j;
for (let i = 1; i <= a.length; i++) {
for (let j = 1; j <= b.length; j++) {
const cost = a[i - 1] === b[j - 1] ? 0 : 1;
matrix[i][j] = Math.min(
matrix[i - 1][j] + 1,
matrix[i][j - 1] + 1,
matrix[i - 1][j - 1] + cost
);
}
}
this.inner = matrix.pop().pop();
}
normalize() {
return new NormalizedDistance(this.inner, this.lengthA, this.lengthB);
}
}
You may have noticed that I normalize the result so that I get a reliable result even if the length of both strings differ widely. The current implementation is by far not perfect, but I'm currently short on time, and it's better to be done then to strive for perfection and never finish the project.
I still want to revisit this issue and refine the search a bit. Maybe try out a more functional approach to save space, or look into different concepts like tokenization.