Fallback with Maximal Common Substring if no similarity found

2026-04-14 03:59:56 +01:00
parent b19ffff91b
commit 88774e7cc2
2 changed files with 83 additions and 29 deletions
--- a/README.md
+++ b/README.md
@@ -1,11 +1,12 @@
 # fuzzy-match

-A fast command-line tool for fuzzy string matching using the Damerau-Levenshtein distance algorithm.
+A fast command-line tool for fuzzy string matching using the Damerau-Levenshtein distance algorithm, with a longest-common-substring fallback when no strong match is found.

 ## Features

 - **Damerau-Levenshtein Distance**: Measures similarity between strings accounting for insertions, deletions, substitutions, and transpositions
- **Normalized Scoring**: Calculates similarity score as `distance / MAX(queryLength, lineLength)` for fair comparison regardless of string lengths
+- **Normalized Scoring**: Calculates similarity score as `1 - distance / MAX(queryLength, lineLength)` so higher scores are better
+- **Fallback Matching**: If the best Damerau-Levenshtein similarity is below `0.5`, recalculates every score using the maximal common substring length
 - **Sorted Output**: Results are sorted by similarity score (best matches first)
 - **Efficient Processing**: Handles large input streams with dynamic memory allocation

@@ -41,14 +42,14 @@ echo -e "apple\napple pie\norange\nbanana\nappl" | fuzzy-match "apple"

 ### Output Format

-Each line is printed with its similarity score (lower is more similar):
+Each line is printed with its similarity score (higher is more similar):

 ```
-0.0000	apple
-0.2000	appl
-0.5000	apple pie
-0.6667	banana
-1.0000	orange
+1.0000	apple
+0.8000	appl
+0.5556	apple pie
+0.1667	banana
+0.1667	orange
 ```

 ## Examples
@@ -56,28 +57,36 @@ Each line is printed with its similarity score (lower is more similar):
 ### Basic matching
 ```bash
 $ echo -e "cat\ncar\ndog\nhat" | fuzzy-match "cat"
-0.0000	cat
-0.3333	car
+1.0000	cat
+0.6667	car
 0.6667	hat
-1.0000	dog
+0.0000	dog
 ```

 ### Matching with typos
 ```bash
 $ echo -e "programming\nprograming\nprogram\nprogamming" | fuzzy-match "programming"
-0.0000	programming
-0.0909	programing
-0.1818	progamming
-0.3333	program
+1.0000	programming
+0.9091	programing
+0.9091	progamming
+0.6364	program
 ```

+### Fallback to maximal common substring
+If no Damerau-Levenshtein similarity reaches `0.5`, every score is recalculated using the longest common substring length instead.
+
 ## Algorithm

-The program implements the **Damerau-Levenshtein distance** algorithm, which measures the minimum number of single-character edits (insertions, deletions, substitutions, and transpositions) needed to transform one string into another.
+The program first computes a **Damerau-Levenshtein similarity**, based on the minimum number of single-character edits (insertions, deletions, substitutions, and transpositions) needed to transform one string into another.

-The similarity score is normalized to account for string length differences:
+The primary similarity score is normalized to account for string length differences:
 ```
-similarity_score = damerau_levenshtein_distance / MAX(query_length, line_length)
+similarity_score = 1 - damerau_levenshtein_distance / MAX(query_length, line_length)
+```
+
+If the highest primary similarity is below `0.5`, the program recalculates every score using the maximal common substring length instead:
+```
+similarity_score = longest_common_substring_length / MAX(query_length, line_length)
 ```

 ## Installation