118 lines
2.7 KiB
Markdown
118 lines
2.7 KiB
Markdown
# fuzzy-match
|
|
|
|

|
|
|
|
A fast command-line tool for fuzzy string matching using the Damerau-Levenshtein distance algorithm, with a longest-common-substring fallback when no strong match is found.
|
|
|
|
## Features
|
|
|
|
- **Damerau-Levenshtein Distance**: Measures similarity between strings accounting for insertions, deletions, substitutions, and transpositions
|
|
- **Normalized Scoring**: Calculates similarity score as `1 - distance / MAX(queryLength, lineLength)` so higher scores are better
|
|
- **Fallback Matching**: If the best Damerau-Levenshtein similarity is below `0.5`, recalculates every score using the maximal common substring length
|
|
- **Sorted Output**: Results are sorted by similarity score (best matches first)
|
|
- **Efficient Processing**: Handles large input streams with dynamic memory allocation
|
|
|
|
## Building
|
|
|
|
### Prerequisites
|
|
- GCC compiler
|
|
- Make (optional)
|
|
|
|
### Compilation
|
|
|
|
Using Make:
|
|
```bash
|
|
make
|
|
```
|
|
|
|
Or directly with GCC:
|
|
```bash
|
|
gcc -o fuzzy-match fuzzy-match.c -lm
|
|
```
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
fuzzy-match <query> < input.txt
|
|
```
|
|
|
|
or pipe input from another command:
|
|
|
|
```bash
|
|
echo -e "apple\napple pie\norange\nbanana\nappl" | fuzzy-match "apple"
|
|
```
|
|
|
|
### Output Format
|
|
|
|
Each line is printed with its similarity score (higher is more similar):
|
|
|
|
```
|
|
1.0000 apple
|
|
0.8000 appl
|
|
0.5556 apple pie
|
|
0.1667 banana
|
|
0.1667 orange
|
|
```
|
|
|
|
## Examples
|
|
|
|
### Basic matching
|
|
```bash
|
|
$ echo -e "cat\ncar\ndog\nhat" | fuzzy-match "cat"
|
|
1.0000 cat
|
|
0.6667 car
|
|
0.6667 hat
|
|
0.0000 dog
|
|
```
|
|
|
|
### Matching with typos
|
|
```bash
|
|
$ echo -e "programming\nprograming\nprogram\nprogamming" | fuzzy-match "programming"
|
|
1.0000 programming
|
|
0.9091 programing
|
|
0.9091 progamming
|
|
0.6364 program
|
|
```
|
|
|
|
### Fallback to maximal common substring
|
|
If no Damerau-Levenshtein similarity reaches `0.5`, every score is recalculated using the longest common substring length instead.
|
|
|
|
## Algorithm
|
|
|
|
The program first computes a **Damerau-Levenshtein similarity**, based on the minimum number of single-character edits (insertions, deletions, substitutions, and transpositions) needed to transform one string into another.
|
|
|
|
The primary similarity score is normalized to account for string length differences:
|
|
```
|
|
similarity_score = 1 - damerau_levenshtein_distance / MAX(query_length, line_length)
|
|
```
|
|
|
|
If the highest primary similarity is below `0.5`, the program recalculates every score using the maximal common substring length instead:
|
|
```
|
|
similarity_score = longest_common_substring_length / MAX(query_length, line_length)
|
|
```
|
|
|
|
## Installation
|
|
|
|
Install to system PATH:
|
|
```bash
|
|
make install
|
|
```
|
|
|
|
Uninstall:
|
|
```bash
|
|
make uninstall
|
|
```
|
|
|
|
## Cleaning
|
|
|
|
Remove compiled binaries:
|
|
```bash
|
|
make clean
|
|
```
|
|
|
|
## License
|
|
|
|
This project is licensed under the GNU General Public License, version 3.
|
|
See `LICENSE` for the full text.
|
|
|