Files

Seán Healy 88774e7cc2 Fallback with Maximal Common Substring if no similarity found

2026-04-14 03:59:56 +01:00

2.7 KiB

Raw Blame History

fuzzy-match

A fast command-line tool for fuzzy string matching using the Damerau-Levenshtein distance algorithm, with a longest-common-substring fallback when no strong match is found.

Features

Damerau-Levenshtein Distance: Measures similarity between strings accounting for insertions, deletions, substitutions, and transpositions
Normalized Scoring: Calculates similarity score as 1 - distance / MAX(queryLength, lineLength) so higher scores are better
Fallback Matching: If the best Damerau-Levenshtein similarity is below 0.5, recalculates every score using the maximal common substring length
Sorted Output: Results are sorted by similarity score (best matches first)
Efficient Processing: Handles large input streams with dynamic memory allocation

Building

Prerequisites

GCC compiler
Make (optional)

Compilation

Using Make:

make

Or directly with GCC:

gcc -o fuzzy-match fuzzy-match.c -lm

Usage

fuzzy-match <query> < input.txt

or pipe input from another command:

echo -e "apple\napple pie\norange\nbanana\nappl" | fuzzy-match "apple"

Output Format

Each line is printed with its similarity score (higher is more similar):

1.0000	apple
0.8000	appl
0.5556	apple pie
0.1667	banana
0.1667	orange

Examples

Basic matching

$ echo -e "cat\ncar\ndog\nhat" | fuzzy-match "cat"
1.0000	cat
0.6667	car
0.6667	hat
0.0000	dog

Matching with typos

$ echo -e "programming\nprograming\nprogram\nprogamming" | fuzzy-match "programming"
1.0000	programming
0.9091	programing
0.9091	progamming
0.6364	program

Fallback to maximal common substring

If no Damerau-Levenshtein similarity reaches 0.5, every score is recalculated using the longest common substring length instead.

Algorithm

The program first computes a Damerau-Levenshtein similarity, based on the minimum number of single-character edits (insertions, deletions, substitutions, and transpositions) needed to transform one string into another.

The primary similarity score is normalized to account for string length differences:

similarity_score = 1 - damerau_levenshtein_distance / MAX(query_length, line_length)

If the highest primary similarity is below 0.5, the program recalculates every score using the maximal common substring length instead:

similarity_score = longest_common_substring_length / MAX(query_length, line_length)

Installation

Install to system PATH:

make install

Uninstall:

make uninstall

Cleaning

Remove compiled binaries:

make clean

License

This project is licensed under the GNU General Public License, version 3. See LICENSE for the full text.

2.7 KiB Raw Blame History