Text: add fun isSubsequenceOf #369

Anton-Latukha · 2021-09-25T08:48:53Z

Closes report: #368

Function to detect ordered subsequences in the content.

Bodigrim

@Anton-Latukha nice! But could we possibly squeeze out more performance by operating on underlying buffers? I imagine fuzzy search needs to be as fast as possible.

src/Data/Text.hs

Anton-Latukha · 2021-09-27T09:56:58Z

I currently do not have enough knowledge of low-level and high performance Haskell. But would learn and try to improve.

Currently, next I would cover the supplementary quality required: tests, benchmarks and things that good library piece requires. Having that would look into how further optimization.

Thank you for your work and reviews.

Lysxia · 2021-09-28T13:49:10Z

src/Data/Text.hs

@@ -1877,6 +1879,29 @@ isInfixOf needle haystack
    | otherwise          = not . L.null . indices needle $ haystack
 {-# INLINE [1] isInfixOf #-}

+-- | The 'isSubsequenceOf' function takes two 'Text's and returns
+-- 'True' iff the second is a subsequence of the first.


This looks backwards. first `isSubsequenceOf` second should be testing whether first is a subsequence of second.

The order of arguments does not impact performance at all here. The order of argument's I would expect is that first `isSubsequenceOf` second matches the English specification "first is a subsequence of second". Furhtermore this would be consistent with the order of arguments of isInfixOf.

Maybe you're thinking of a future implementation where you might precompute a state machine given the subsequence, but 1) this is not it, 2) then it could be called isSupersequenceOf instead.

I feel that we digress here. The original ticket has a pretty solid motivation: there is Data.List.isSubsequenceOf, it is desirable to have a counterpart working over Text. However, swapping arguments of Data.Text.isSubsequence to an opposite of Data.List.isSubsequenceOf directly contradicts this goal.

Currently all isFooBarOf functions in core packages take a needle first and a haystack second. If there is an interest to change this convention, let's discuss it separately (and probably not even in text issue tracker). FWIW I think that even from a theoretical viewpoint it is better to keep a needle first: fuzzy search compares a fixed needle against thousands of file names.

Further, if there is a desire to add isSupersequenceOf or any other function to the public API, let's discuss it in an issue first.

In the meantime I suggest we stick to the convention and make isSubsequenceOf to take a needle first.

Lysxia · 2021-09-28T13:51:20Z

src/Data/Text.hs

+-- 'True' iff the second is a subsequence of the first.
+-- (characters of the second argument appear in same sequential order in
+-- the first, to say if second argument can be derived by deleting some
+-- or no elements from the first).


It seems worth mentioning in the doc that this is useful for fuzzy search.

Boarders · 2021-09-28T21:29:17Z

I didn't test this (and so I wouldn't trust it to actually be correct!), but just to give you a rough idea of how a more efficient implementation might look, something like this might be a good starting point:

isSubsequenceOf :: Text -> Text -> Bool
isSubsequenceOf (Text needle o1 l1) (Text target o2 l2) = go o1 o2
  where
    go :: Int -> Int -> Bool
    go i _ | i >= l1 = True
    go _ j | j >= l2 = False
    go !i !j =
      let
        !(Iter c d) =  iterArray needle i
        !(Iter c' d') = iterArray target j
      in
        if c == c'
          then go (i + d) (j + d')
          else go i (j + d')

There is plenty of good material to steal from in the rest of the library too if you want to see how to work with the underlying array. Getting tests and benchmarks going before figuring this out would be great though!

Function to detect ordered subsequences in the content.

Previous argument order was alike `lookup` argument order mistake, when key cached (compiler saturates function) across DBs. Efficient use of the function is to all it once (per pattern) on the whole contents. So the contents should be cached first, and the search patterns are switchable.

As suggested, indeed `uncons` replaces 3 funciton invocations with 1.

`on` now ensures that `uncons` can be specialized/consumed once. It would be interesting to look into Core at some point of work.

This way may have more stack reuse.

ghost

Happy to see the function in the library, but there seems to be some issues with this implementation.

If we had efficient implementations of findCharIndex :: Char -> Text -> Maybe Int then we could make this function quite fast in C on the underlying utf-8 representation.

ghost · 2022-06-22T00:33:22Z

src/Data/Text.hs

+  subseqOf :: Text -> Text -> Bool
+  subseqOf t s =
+    on f uncons t s
+   where
+    f :: Maybe (Char, Text) -> Maybe (Char, Text) -> Bool
+    f _ Nothing = True
+    f Nothing _ = False
+    f (Just (tc,ts)) (Just (sc,ss)) =
+      subseqOf ts $
+        if tc == sc
+          then s
+          else ss


Is there a good reason not to just use

isSubsequenceOf needles haystack = List.isSubsequenceOf (T.unpack needles) (T.unpack haystack)

This code is doing exactly the same work, since T.unpack (as far as I understand) is just repeated calls to uncons. I can't see how there's be any performance here (though I could see an argument to using an optimised implementation in C if there's a more efficient algorithm that takes advantage of the UTF-8 encoding).

ghost · 2022-06-22T00:33:25Z

src/Data/Text.hs

+--
+-- Examples:
+--
+-- >>> isSubsequenceOf "1234567" "1356"


Aren't the arguments her backwards? Generally isInfixOf etc. are used infix:

>>> "ello" `isInfixOf` "Hello, world!" True

but this would be

"1234567" `isSubsequenceOf` "1356"

which should be False, if you read it in English. It should be isSubsequenceOf needles haystack - "1234567" is not a subsequence of "1356".

ghost · 2022-06-22T00:45:17Z

tests/Tests/Properties/Substrings.hs

+genOrdSubseq txt =
+  T.pack . transform <$> genTransformMap


I can't understand what this implementation is actually doing, but is this not all we need?

Suggested change

genOrdSubseq txt =

T.pack . transform <$> genTransformMap

genOrdSubseq txt =

T.pack . map fst . filter snd . zip (T.unpack txt) <$> vectorOf (T.length txt) arbitrary :: Gen [Bool]

We're just trying to select a random sample of elements from the given string in order right? Are there any other properties we need from this? Trying to ensure that the result is non-empty isn't too hard, you just generate two lists of Bools:

Suggested change

genOrdSubseq txt =

T.pack . transform <$> genTransformMap

genOrdSubseq txt = do

firstLen <- choose (1,T.length txt - 2)

prefix <- vectorOf firstLen arbitrary

suffix <- vectorOf (T.length txt - firstLen - 1) arbitrary

pure $ T.pack . map fst . filter snd . zip (T.unpack txt) $ prefix ++ True : suffix

(not sure if I have the maths right, but I'm trying to make a list of length txt which has a True somewhere in the middle)

ghost · 2022-06-22T01:00:43Z

uh, sorry, I didn't see others' reviews when I was writing mine, and it looks like most of my comments have been mentioned already. Sorry for the duplicate review, but I hope I have something to add nonetheless.

Lysxia · 2023-02-07T15:14:11Z

I would accept the implementation mentioned in an earlier comment:

isSubsequenceOf needles haystack = List.isSubsequenceOf (T.unpack needles) (T.unpack haystack)

If you want to stick to the current implementation,

the order of argument should be the same as Data.List.isSubsequenceOf
the auxiliary subseqOf should avoid doing redundant uncons when the needle doesn't change. This can be done by specializing it to a non-empty needle (represented as a Char and Text argument

Compared to using List.isSubsequenceOf, a well optimized implementation would have the advantage of not allocating intermediate data structures (I don't think you need any unsafe primitives, GHC's worker-wrapper + inlining should be sufficient).

Bodigrim reviewed Sep 25, 2021

View reviewed changes

src/Data/Text.hs Outdated Show resolved Hide resolved

Lysxia reviewed Sep 28, 2021

View reviewed changes

Lysxia added the API Addition (PVP minor) label Sep 28, 2021

Anton-Latukha mentioned this pull request Oct 5, 2021

Tests.QuickCheckUtils: refactors #374

Draft

Anton-Latukha force-pushed the 2021-09-25-add-isSubsequenceOf branch from 505ffd3 to 6f06b72 Compare March 2, 2022 17:34

Anton-Latukha added 7 commits March 2, 2022 19:35

Text: add fun isSubsequenceOf

813989c

Function to detect ordered subsequences in the content.

WIP: Text: isSubsequenceOf: ({null, unsafeHead, unsafeTail} -> uncons)

173c1ae

As suggested, indeed `uncons` replaces 3 funciton invocations with 1.

WIP: Text: isSubsequenceOf: lambda to imperative

f477ba6

`on` now ensures that `uncons` can be specialized/consumed once. It would be interesting to look into Core at some point of work.

WIP: Text: isSubsequenceOf: flip local fun

2a8e3d9

This way may have more stack reuse.

WIP: Text: isSubsequenceOf: upd doc (add examples of code & use)

865ada4

WIP: Tests.Properties.Substrings: add genOrdSubseq

073d676

Anton-Latukha force-pushed the 2021-09-25-add-isSubsequenceOf branch from 6f06b72 to 073d676 Compare March 2, 2022 18:15

ghost reviewed Jun 22, 2022

View reviewed changes

ghost mentioned this pull request Jun 22, 2022

Add efficient findCharIndex #445

Open

axman6 mentioned this pull request Jul 1, 2022

Fast codepointOffset #451

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text: add fun isSubsequenceOf #369

Text: add fun isSubsequenceOf #369

Anton-Latukha commented Sep 25, 2021

Bodigrim left a comment

Anton-Latukha commented Sep 27, 2021

Lysxia Sep 28, 2021

This comment was marked as outdated.

Lysxia Sep 28, 2021 •

edited

This comment was marked as outdated.

Bodigrim Sep 29, 2021 •

edited

This comment was marked as outdated.

Lysxia Sep 28, 2021

Boarders commented Sep 28, 2021 •

edited

ghost left a comment

ghost Jun 22, 2022

ghost Jun 22, 2022

ghost Jun 22, 2022 •

edited by ghost

ghost commented Jun 22, 2022

Lysxia commented Feb 7, 2023

-genOrdSubseq txt =
-  T.pack . transform <$> genTransformMap
+genOrdSubseq txt = do
+    firstLen <- choose (1,T.length txt - 2)
+    prefix <-  vectorOf firstLen arbitrary
+    suffix <- vectorOf (T.length txt - firstLen - 1) arbitrary
+    pure $ T.pack . map fst . filter snd . zip (T.unpack txt) $ prefix ++ True : suffix

Text: add fun isSubsequenceOf #369

Are you sure you want to change the base?

Text: add fun isSubsequenceOf #369

Conversation

Anton-Latukha commented Sep 25, 2021

Bodigrim left a comment

Choose a reason for hiding this comment

Anton-Latukha commented Sep 27, 2021

Lysxia Sep 28, 2021

Choose a reason for hiding this comment

This comment was marked as outdated.

Lysxia Sep 28, 2021 • edited

Choose a reason for hiding this comment

This comment was marked as outdated.

Bodigrim Sep 29, 2021 • edited

Choose a reason for hiding this comment

This comment was marked as outdated.

Lysxia Sep 28, 2021

Choose a reason for hiding this comment

Boarders commented Sep 28, 2021 • edited

ghost left a comment

Choose a reason for hiding this comment

ghost Jun 22, 2022

Choose a reason for hiding this comment

ghost Jun 22, 2022

Choose a reason for hiding this comment

ghost Jun 22, 2022 • edited by ghost

Choose a reason for hiding this comment

ghost commented Jun 22, 2022

Lysxia commented Feb 7, 2023

Lysxia Sep 28, 2021 •

edited

Bodigrim Sep 29, 2021 •

edited

Boarders commented Sep 28, 2021 •

edited

ghost Jun 22, 2022 •

edited by ghost