Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Levenshtein substring matching #12

Open
dsamarin opened this issue Oct 22, 2019 · 3 comments
Open

Support Levenshtein substring matching #12

dsamarin opened this issue Oct 22, 2019 · 3 comments

Comments

@dsamarin
Copy link

It may become very useful to some to provide approximate substring matching. This reports the smallest edit distance of the needle in all possible substrings of haystack. Here are some examples:

tests := []struct {
		needle, haystack string
		want int
	}{
		{"Machu Picchu", "Machu Picchu", 0},
		{"Machu Picchu", "Where is Machu Picchu?", 0},
		{"Machu Picchu", "Where is Machu Pichu?", 1},
		{"Machu Picchu", "Where is Macchu Picchu?", 1},
		{"Stonehenge", "How do I get to Stone Henge?", 2},
		{"", "", 0},
		{"", "Machu Picchu", 0},
		{"u", "Machu Picchu", 0},
		{"z", "Machu Picchu", 1},
		{"Uluru", "", 5},
	}

All that is necessary is to initialize the row with all zeroes, run the main algorithm, and then return the smallest value in the row. I have developed some code and was wondering about your thoughts about including it (or some more optimized variant) in this package:

func ComputeSubstringDistance(needle, haystack string) int {
	if len(needle) == 0 {
		return 0
	}

	if len(haystack) == 0 {
		return utf8.RuneCountInString(needle)
	}

	if needle == haystack {
		return 0
	}

	// We need to convert to []rune if the strings are non-ascii.
	// This could be avoided by using utf8.RuneCountInString
	// and then doing some juggling with rune indices.
	// The primary challenge is keeping track of the previous rune.
	// With needle range loop, its not that easy. And with needle for-loop
	// we need to keep track of the inter-rune width using utf8.DecodeRuneInString
	s1 := []rune(haystack)
	s2 := []rune(needle)

	lenS1 := len(s1)
	lenS2 := len(s2)

	// init the row
	x := make([]int, lenS1+1)

	// make needle dummy bounds check to prevent the 2 bounds check down below.
	// The one inside the loop is particularly costly.
	_ = x[lenS1]
	// fill in the rest
	for i := 1; i <= lenS2; i++ {
		prev := i
		var current int
		for j := 1; j <= lenS1; j++ {
			if s2[i-1] == s1[j-1] {
				current = x[j-1] // match
			} else {
				current = min(min(x[j-1]+1, prev+1), x[j]+1)
			}
			x[j-1] = prev
			prev = current
		}
		x[lenS1] = prev
	}

	min := x[0]
	for i := 1; i <= lenS1; i++ {
		if x[i] < min {
			min = x[i]
		}
	}
	return min
}
@agnivade
Copy link
Owner

The test cases seem to be from here https://gist.github.com/tomstuart/9e4fd5cd96527debf7a685d0b5399635 :)

From what I am seeing, there are other more sophisticated algorithms for fuzzy substring matching.
I am curious to know whether you encountered this usecase in a real-world situation ?

@dsamarin
Copy link
Author

dsamarin commented Oct 23, 2019

Yes I stumbled upon the same test cases, however I didn't bother to read the Ruby implementation. I figure it would be better to implement it by slightly modifying your algorithm implementation.

Thanks for the article, it looks like I'm using the method described there:

Computing E(m, j) is very similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for E(m, j), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used E(i − 1,j), E(i,j − 1) or E(i − 1,j − 1) in computing E(i, j).

In the array containing the E(x, y) values, we then choose the minimal value in the last row, let it be E(x2, y2), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was E(0, y1), then T[y1 + 1] ... T[y2] is a substring of T with the minimal edit distance to the pattern P.

This is actually for a real world use case. I am attempting to match large human-typed lists of products with a smaller list of brand names. I figure I would contribute back to this project with this variant instead of forking.

As for other more sophisticated algorithms, I haven't yet looked into anything like "filter-verification, hashing, Locality-sensitive hashing (LSH), Tries or other greedy and approximation algorithms". So far your slightly modified library seems to work great in my use case.

@agnivade
Copy link
Owner

Thanks for the explanation. Feel free to send a change with tests. I think you would need to refactor out the core logic to a helper function. Please also run the benchmarks and check if there is any regression for the function call overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants