Motoko Regex is live on MOPS!

Hello Motoko devs!

I am happy to announce the first iteration of the motoko regex engine for my grant has been published to mops. Find it here Mops • Motoko Package Manager

Install it using

mops add regex

import it using

import Regex "mo:regex";

Use it (this is a regex pattern to detect a SSN) the flags are set to null. Flags are a set of optional boolean values. There are only 2 flag currently, caseSensitive and multiline. It is case sensitive by default. flags are immutable so once you set them in instantiation they may not be changed.

let regex = Regex.Regex("\d{3}-\d{2}-\d{4}", null ) 

Once you have instantiated your regex object you may use the following methods to pattern match. If the compilation of the regex fails you will receive a not compiled error and a Debug error will be printed to the output log. Question, would the community rather have the engine trap in this scenario with explicit errors?

match(sometext: Text) : Result<Match, RegexError>

regex.match("launch icp 629-07-2021");

This will result in #NoMatch. Why? The matcher requires the pattern to exactly match and traverse the NFA to an accept state. It stops as soon as it encounters a character that doesn’t fit the pattern. In this case, the ‘l’ in “launch” does not match the pattern, so matching terminates immediately. This goes for any state in the nfa. If you had regex.match(“629-07-202a”); it would result in a no match because the input cannot reach the final accept state which is 11. With regex.match(“629-07-2021a”); you will have a match. This is because it terminates as soon as it finds a match, the engine does not support partial matches.

There are 3 takeaways

  • For the function to succeed, the input must traverse the states of the NFA and reach a final accept state.
  • If at any point the matcher cannot identify a valid path from the current character to a next state, it immediately terminates and returns #NoMatch.
  • The matcher stops as soon as it finds a valid match that satisfies the pattern and reaches the accept state.

Match Record Structure

When a match is successful, the result is returned as a Match record.
› testMatch(“\d{3}-\d{2}-\d{4}”, “629-07-2021”)

variant { ok = record {
  status = variant { FullMatch };
  lastIndex = 11; //last index of the match (the match end)
  string = "629-07-2021"; // Original string
  value = "629-07-2021"; //value of the match
  spans = vec { record { 0; 11 } }; // This tells the length of the match could be reduced to a single number
  position = record { 0; 11 }; //Position in the given string 
  capturedGroups = opt vec {}; //capture groups if present
}}

search(sometext: Text): Result<Match, RegexError>

regex.search("launch icp 629-07-2021");

The search function behaves similarly to match, with one key difference: it continues searching even after failing to find a path to the accept state. Instead of stopping, it progressively reduces the input text from left to right, looking for a match.

For example, given the input "launch icp 629-07-2021", the function starts with the entire string. When it doesn’t find a match, it removes the first character and tries again with "aunch icp 629-07-2021", then "unch icp 629-07-2021", and so on, until it reaches "629-07-2021", where it successfully finds a match. Potential for optimization by looking ahead before making the substring.

Once a match is found, the function terminates early and ignores the remaining text.

› testSearch(“\d{3}-\d{2}-\d{4}”, “launch icp 629-07-2021”)

variant { ok = record {
  status = variant { FullMatch };
  lastIndex = 22; 
  string = "launch icp 629-07-2021";
  value = "629-07-2021";
  spans = vec { record { 0; 11 } }; 
  position = record { 11; 22 };
  capturedGroups = opt vec {}; 
}}

findall(someText: Text) Result<[Match], RegexError>

The findAll function differs from match and search by locating all occurrences of a pattern in the input text. Instead of stopping at the first match, it scans the entire text from left to right, collecting every match along the way. Returns the array of matches

regex.findall("launch icp 629-07-2021 and 123-45-6789");

› testFindAll(“\d{3}-\d{2}-\d{4}”, “launch icp 629-07-2021 and 123-45-6789”)

variant {
  ok = vec {
    record {
      status = variant { FullMatch };
      lastIndex = 22;
      string = "launch icp 629-07-2021 and 123-45-6789";
      value = "629-07-2021";
      spans = vec { record { 0; 11 } };
      position = record { 11; 22 };
      capturedGroups = opt vec {};
    };
    record {
      status = variant { FullMatch };
      lastIndex = 38;
      string = "launch icp 629-07-2021 and 123-45-6789";
      value = "123-45-6789";
      spans = vec { record { 0; 11 } };
      position = record { 27; 38 };
      capturedGroups = opt vec {};
    };
  }
}

findIter(someText: Text) : Result<Iter, Regex Error>

As the name suggest it returns an Iterator over matches found in the text.

enableDebug(mode:Bool) : ()

Enables or disables a debug feature, when debug mode is enabled you will be able to see the matching operations in the output log.
eg:

regex.enableDebug(true);
regex.match("296-28-1928");

I have removed the time from the output log and did some manual formatting for readability

[Canister] Starting match with start state: 0
[Canister] Current character: 2
[Canister] Checking range transition: (0, #Range(48, 57), 1)
[Canister] Matched range! Moving to state 1
[Canister] Current character: 9
[Canister] Checking range transition: (1, #Range(48, 57), 2)
[Canister] Matched range! Moving to state 2
[Canister] Current character: 6
[Canister] Checking range transition: (2, #Range(48, 57), 3)
[Canister] Matched range! Moving to state 3
[Canister] Current character: -
[Canister] Comparing with transition: (3, #Char('-'), 4)
[Canister] Matched! Moving to state 4
[Canister] Current character: 2
[Canister] Checking range transition: (4, #Range(48, 57), 5)
[Canister] Matched range! Moving to state 5
[Canister] Current character: 8
[Canister] Checking range transition: (5, #Range(48, 57), 6)
[Canister] Matched range! Moving to state 6
[Canister] Current character: -
[Canister] Comparing with transition: (6, #Char('-'), 7)
[Canister] Matched! Moving to state 7
[Canister] Current character: 1
[Canister] Checking range transition: (7, #Range(48, 57), 8)
[Canister] Matched range! Moving to state 8
[Canister] Current character: 9
[Canister] Checking range transition: (8, #Range(48, 57), 9)
[Canister] Matched range! Moving to state 9
[Canister] Current character: 2
[Canister] Checking range transition: (9, #Range(48, 57), 10)
[Canister] Matched range! Moving to state 10
[Canister] Current character: 8
[Canister] Checking range transition: (10, #Range(48, 57), 11)
[Canister] Matched range! Moving to state 11
[Canister] Reached accept state 11

I encourage you to try out the engine and test its capabilities with your own patterns. If you encounter unexpected behavior, please document it and report it as an issue on GitHub. I excited to receive community feedback! What methods would you like to see added?

I am working on the documentation and it will be published here Introduction

17 Likes
3 Likes

Wow, this is exactly what I’ve been looking for! I’ll give this a try to address the search problem in my dApp. So far, I’ve been attempting to solve it using stemming, but this approach has resulted in a large key-value map, which is becoming unwieldy.

1 Like

Awesome, be sure to report any issues as feedback is greatly appreciated!

PS: if you share what you are trying to extract from the text, I could take a look and try to create an optimized regex pattern to extract all the matches using find all.

1 Like

This is great! Having looked at this in the past I know what an uphill climb it must have been to build this! Solid work!

Have you done any performance metrics? I’d be interested to see how many cycles are used for various operations in different sizes of texts.

In the motoko meeting this week we started talking about optimizing wasm output from the compiler and I’d imagine that regex would be a great use case for identifying inefficient compiler output and possible low hanging fruit efficiency gains.

It was definitely a big task! I have done some performance testing, but so far, my focus has been on expanding the NFA’s number of states to see large regex patterns will hit instruction limit. I haven’t yet exceeded the limit in these tests.

I did run into an issue with the lexer where certain mapping operations were causing it to hit the instruction limit. I’ve noticed a trend where operations heavily reliant on array mapping seem to approach the instruction limit. I haven’t verified this fully yet, but it’s something I plan to investigate further.

I’ll start testing using the countInstructions function. I’ll share results here once I have them!