Fallback with Maximal Common Substring if no similarity found

This commit is contained in:
2026-04-14 03:59:56 +01:00
parent b19ffff91b
commit 88774e7cc2
2 changed files with 83 additions and 29 deletions

View File

@@ -1,11 +1,12 @@
# fuzzy-match
A fast command-line tool for fuzzy string matching using the Damerau-Levenshtein distance algorithm.
A fast command-line tool for fuzzy string matching using the Damerau-Levenshtein distance algorithm, with a longest-common-substring fallback when no strong match is found.
## Features
- **Damerau-Levenshtein Distance**: Measures similarity between strings accounting for insertions, deletions, substitutions, and transpositions
- **Normalized Scoring**: Calculates similarity score as `distance / MAX(queryLength, lineLength)` for fair comparison regardless of string lengths
- **Normalized Scoring**: Calculates similarity score as `1 - distance / MAX(queryLength, lineLength)` so higher scores are better
- **Fallback Matching**: If the best Damerau-Levenshtein similarity is below `0.5`, recalculates every score using the maximal common substring length
- **Sorted Output**: Results are sorted by similarity score (best matches first)
- **Efficient Processing**: Handles large input streams with dynamic memory allocation
@@ -41,14 +42,14 @@ echo -e "apple\napple pie\norange\nbanana\nappl" | fuzzy-match "apple"
### Output Format
Each line is printed with its similarity score (lower is more similar):
Each line is printed with its similarity score (higher is more similar):
```
0.0000 apple
0.2000 appl
0.5000 apple pie
0.6667 banana
1.0000 orange
1.0000 apple
0.8000 appl
0.5556 apple pie
0.1667 banana
0.1667 orange
```
## Examples
@@ -56,28 +57,36 @@ Each line is printed with its similarity score (lower is more similar):
### Basic matching
```bash
$ echo -e "cat\ncar\ndog\nhat" | fuzzy-match "cat"
0.0000 cat
0.3333 car
1.0000 cat
0.6667 car
0.6667 hat
1.0000 dog
0.0000 dog
```
### Matching with typos
```bash
$ echo -e "programming\nprograming\nprogram\nprogamming" | fuzzy-match "programming"
0.0000 programming
0.0909 programing
0.1818 progamming
0.3333 program
1.0000 programming
0.9091 programing
0.9091 progamming
0.6364 program
```
### Fallback to maximal common substring
If no Damerau-Levenshtein similarity reaches `0.5`, every score is recalculated using the longest common substring length instead.
## Algorithm
The program implements the **Damerau-Levenshtein distance** algorithm, which measures the minimum number of single-character edits (insertions, deletions, substitutions, and transpositions) needed to transform one string into another.
The program first computes a **Damerau-Levenshtein similarity**, based on the minimum number of single-character edits (insertions, deletions, substitutions, and transpositions) needed to transform one string into another.
The similarity score is normalized to account for string length differences:
The primary similarity score is normalized to account for string length differences:
```
similarity_score = damerau_levenshtein_distance / MAX(query_length, line_length)
similarity_score = 1 - damerau_levenshtein_distance / MAX(query_length, line_length)
```
If the highest primary similarity is below `0.5`, the program recalculates every score using the maximal common substring length instead:
```
similarity_score = longest_common_substring_length / MAX(query_length, line_length)
```
## Installation