Hello Motoko devs!
I am happy to announce the first iteration of the motoko regex engine for my grant has been published to mops. Find it here Mops • Motoko Package Manager
Install it using
mops add regex
import it using
import Regex "mo:regex";
Use it (this is a regex pattern to detect a SSN) the flags are set to null. Flags are a set of optional boolean values. There are only 2 flag currently, caseSensitive
and multiline
. It is case sensitive by default. flags are immutable so once you set them in instantiation they may not be changed.
let regex = Regex.Regex("\d{3}-\d{2}-\d{4}", null )
Once you have instantiated your regex object you may use the following methods to pattern match. If the compilation of the regex fails you will receive a not compiled error and a Debug error will be printed to the output log. Question, would the community rather have the engine trap in this scenario with explicit errors?
match(sometext: Text) : Result<Match, RegexError>
regex.match("launch icp 629-07-2021");
This will result in #NoMatch
. Why? The matcher requires the pattern to exactly match and traverse the NFA to an accept state. It stops as soon as it encounters a character that doesn’t fit the pattern. In this case, the ‘l’ in “launch” does not match the pattern, so matching terminates immediately. This goes for any state in the nfa. If you had regex.match(“629-07-202a”); it would result in a no match because the input cannot reach the final accept state which is 11. With regex.match(“629-07-2021a”); you will have a match. This is because it terminates as soon as it finds a match, the engine does not support partial matches.
There are 3 takeaways
- For the function to succeed, the input must traverse the states of the NFA and reach a final accept state.
- If at any point the matcher cannot identify a valid path from the current character to a next state, it immediately terminates and returns
#NoMatch
. - The matcher stops as soon as it finds a valid match that satisfies the pattern and reaches the accept state.
Match Record Structure
When a match is successful, the result is returned as a Match
record.
› testMatch(“\d{3}-\d{2}-\d{4}”, “629-07-2021”)
variant { ok = record {
status = variant { FullMatch };
lastIndex = 11; //last index of the match (the match end)
string = "629-07-2021"; // Original string
value = "629-07-2021"; //value of the match
spans = vec { record { 0; 11 } }; // This tells the length of the match could be reduced to a single number
position = record { 0; 11 }; //Position in the given string
capturedGroups = opt vec {}; //capture groups if present
}}
search(sometext: Text): Result<Match, RegexError>
regex.search("launch icp 629-07-2021");
The search
function behaves similarly to match
, with one key difference: it continues searching even after failing to find a path to the accept state. Instead of stopping, it progressively reduces the input text from left to right, looking for a match.
For example, given the input "launch icp 629-07-2021"
, the function starts with the entire string. When it doesn’t find a match, it removes the first character and tries again with "aunch icp 629-07-2021"
, then "unch icp 629-07-2021"
, and so on, until it reaches "629-07-2021"
, where it successfully finds a match. Potential for optimization by looking ahead before making the substring.
Once a match is found, the function terminates early and ignores the remaining text.
› testSearch(“\d{3}-\d{2}-\d{4}”, “launch icp 629-07-2021”)
variant { ok = record {
status = variant { FullMatch };
lastIndex = 22;
string = "launch icp 629-07-2021";
value = "629-07-2021";
spans = vec { record { 0; 11 } };
position = record { 11; 22 };
capturedGroups = opt vec {};
}}
findall(someText: Text) Result<[Match], RegexError>
The findAll
function differs from match
and search
by locating all occurrences of a pattern in the input text. Instead of stopping at the first match, it scans the entire text from left to right, collecting every match along the way. Returns the array of matches
regex.findall("launch icp 629-07-2021 and 123-45-6789");
› testFindAll(“\d{3}-\d{2}-\d{4}”, “launch icp 629-07-2021 and 123-45-6789”)
variant {
ok = vec {
record {
status = variant { FullMatch };
lastIndex = 22;
string = "launch icp 629-07-2021 and 123-45-6789";
value = "629-07-2021";
spans = vec { record { 0; 11 } };
position = record { 11; 22 };
capturedGroups = opt vec {};
};
record {
status = variant { FullMatch };
lastIndex = 38;
string = "launch icp 629-07-2021 and 123-45-6789";
value = "123-45-6789";
spans = vec { record { 0; 11 } };
position = record { 27; 38 };
capturedGroups = opt vec {};
};
}
}
findIter(someText: Text) : Result<Iter, Regex Error>
As the name suggest it returns an Iterator over matches found in the text.
enableDebug(mode:Bool) : ()
Enables or disables a debug feature, when debug mode is enabled you will be able to see the matching operations in the output log.
eg:
regex.enableDebug(true);
regex.match("296-28-1928");
I have removed the time from the output log and did some manual formatting for readability
[Canister] Starting match with start state: 0
[Canister] Current character: 2
[Canister] Checking range transition: (0, #Range(48, 57), 1)
[Canister] Matched range! Moving to state 1
[Canister] Current character: 9
[Canister] Checking range transition: (1, #Range(48, 57), 2)
[Canister] Matched range! Moving to state 2
[Canister] Current character: 6
[Canister] Checking range transition: (2, #Range(48, 57), 3)
[Canister] Matched range! Moving to state 3
[Canister] Current character: -
[Canister] Comparing with transition: (3, #Char('-'), 4)
[Canister] Matched! Moving to state 4
[Canister] Current character: 2
[Canister] Checking range transition: (4, #Range(48, 57), 5)
[Canister] Matched range! Moving to state 5
[Canister] Current character: 8
[Canister] Checking range transition: (5, #Range(48, 57), 6)
[Canister] Matched range! Moving to state 6
[Canister] Current character: -
[Canister] Comparing with transition: (6, #Char('-'), 7)
[Canister] Matched! Moving to state 7
[Canister] Current character: 1
[Canister] Checking range transition: (7, #Range(48, 57), 8)
[Canister] Matched range! Moving to state 8
[Canister] Current character: 9
[Canister] Checking range transition: (8, #Range(48, 57), 9)
[Canister] Matched range! Moving to state 9
[Canister] Current character: 2
[Canister] Checking range transition: (9, #Range(48, 57), 10)
[Canister] Matched range! Moving to state 10
[Canister] Current character: 8
[Canister] Checking range transition: (10, #Range(48, 57), 11)
[Canister] Matched range! Moving to state 11
[Canister] Reached accept state 11
I encourage you to try out the engine and test its capabilities with your own patterns. If you encounter unexpected behavior, please document it and report it as an issue on GitHub. I excited to receive community feedback! What methods would you like to see added?
I am working on the documentation and it will be published here Introduction