Under Review: ICDevs.org Bounty #35 - RegEx Motoko Analysis - $500

RegEx Motoko Analysis - #35

Current Status: Discussion

  • Discussion (01/09/2023)
  • Ratification: (01/09/2023)
  • Open for application: (01/09/2023)
  • Assigned
  • In Review
  • Closed

Official Link

Bounty Details

  • Bounty Amount: $500 USD of ICP at award date - $500 USD of ICP Match Available
  • ICDevs.org Bounty Acceleration: $500 For each 1 ICP sent to 8b9cbe90c23b9d0a36006e6623205e89c72938d064c911bdbf4fde9e17280b20, ICDevs.org will add $40 USD of ICP at award date to the bounty, up to the first 12.5 ICP donated, After 12.5 ICP, donations to the above address will add .25 ICP to this issue and .75 ICP to fund other ICDevs.org initiatives.
  • Project Type: Individual
  • Opened: 01/09/2023
  • Time Commitment: Days
  • Project Type: Analysis
  • Experience Type: Intermediate - Motoko;

Description

Motoko needs a more robust text search library. There are many options to complete this task. The options are:

  1. Create a motoko library written in motoko that can be included in the base libraries for RegEx based text searches.
  2. Integrate a native regex library into the compiler as an optional compiler flag(see timer switches)
  3. Add generic rust crate integration and imports into the motoko system so that existing rust crates can be included and called from motoko programs.

This is an analysis bounty that asks the developer to review the options and to make a recommendation about the pros, cons, and estimates of doing each three mentioned strategies.

You should work with the motoko team at DFINITY to understand the complications with the different scenarios and understand their priorities as it comes to the future of motoko. We would like to develop significant motoko compiler experience outside of the Foundation, but we don’t want to duplicate work or undertake work that is currently on the road map in the immediate future.

This bounty gives the opportunity to

  • learn about the motoko compiler
  • work with the motoko team to understand their priorities
  • learn regex

To apply for this bounty you should:

  • Include links to previous work writing tutorials and any other open-source contributions(ie. your github).
  • Include a brief overview of how you will complete the task. This can include things like which dependencies you will use, how you will make it self-contained, the sacrifices you would have to make to achieve that, or how you will make it simple. Anything that can convince us you are taking a thoughtful and expert approach to this design.
  • Give an estimated timeline on completing the task.
  • Post your application text to the Bounty Thread

Selection Process

The ICDevs.org developer’s advisors will propose a vote to award the bounty and the Developer Advisors will vote.

Bounty Completion

Please keep your ongoing code in a public repository(fork or branch is ok). Please provide regular (at least weekly) updates. Code commits count as updates if you link to your branch/fork from the bounty thread. We just need to be able to see that you are making progress.

The balance of the bounty will be paid out at completion.

Once you have finished, please alert the dev forum thread that you have completed work and where we can find that work. We will review and award the bounty reward if the terms have been met. If there is any coordination work(like a pull request) or additional documentation needed we will inform you of what is needed before we can award the reward.

Bounty Abandonment and Re-awarding

If you cease work on the bounty for a prolonged(at the Developer Advisory Board’s discretion) or if the quality of work degrades to the point that we think someone else should be working on the bounty we may re-award it. We will be transparent about this and try to work with you to push through and complete the project, but sometimes, it may be necessary to move on or to augment your contribution with another resource which would result in a split bounty.

Funding

The bounty was generously funded by the DFINITY Foundation. If you would like to turbocharge this bounty you can seed additional donations of ICP to 8b9cbe90c23b9d0a36006e6623205e89c72938d064c911bdbf4fde9e17280b20. ICDevs will match the bounty $40:1 ICP for the first 12.5 ICP out of the DFINITY grant and then 0.25:1. All donations will be tax deductible for US Citizens and Corporations. If you send a donation and need a donation receipt, please email the hash of your donation transaction, physical address, and name to donations@icdevs.org. More information about how you can contribute can be found at our donations page.

FYI: General Bounty Process

Discussion

The draft bounty is posted to the DFINITY developer’s forum for discussion

Ratification: (01/09/2023)

The developer advisor’s board will propose a bounty be ratified and a vote will take place to ratify the bounty. Until a bounty is ratified by the Dev it hasn’t been officially adopted. Please take this into consideration if you are considering starting early.

Open for application

Developers can submit applications to the Dev Forum post. The council will consider these as they come in and propose a vote to award the bounty to one of the applicants. If you would like to apply anonymously you can send an email to austin at icdevs dot org or sending a PM on the dev forum.

Assigned

A developer is currently working on this bounty, you are free to contribute, but any splitting of the award will need to be discussed with the currently assigned developer.

In Review

The Dev Council is reviewing the submission

Awarded

The award has been given and the bounty is closed.

Other ICDevs.org Bounties

Regular expression engine implementation is trivial and can be achieved in as little as 2k lines of code (as for Golang’s standard “regexp” package). Thus, the obvious solution to the problem – the lack of a robust text search library – is developing one in motoko. The other solutions you propose, in particular the 3rd one, seem to stray a bit too far from the issue, being fundamental compiler design decisions rather than straightforward solutions to the text search library problem.
Regarding option number 2, we don’t see why integrating a native (Wasm, we presume) library would need a compiler flag to be set (which likely is relevant to the implementation of your language’s compiler, which we are not familiar with). We’re unsure what “timer switches” are and would appreciate a pointer to relevant documentation

We should be able to flesh out a working engine for Unix’s “extended regular expression” syntax (as used in egrep, or all Plan 9 utilities) in one day, and later add support for PCRE (Perl Compatible RE) if desired by you.

The only obstacle we see with acomplishing this task is that we were not able to aquire a complete specification of the Motoko language – we are unable to access either internetcomputer [dot] org or dfinity [dot] org, which appear to be the only repositories for Motoko documentation. We will give details of this issue in the next message.

We hope that the results of our work will be telling enough of our experience, so as to avoid the need to present our prior work here. As said above, we should be able to produce significant results in as little as one day, given proper documentation for your language and development platform (which we are not familiar with at this moment).

The sites internetcomputer [dot] org and dfinity [dot] org are inaccessible to us, despite having tried accessing them with both chromium and firefox, at each request returning a page containing only the following message:

Failed to fetch response: Error: Server returned an error:
Code: 400 ()
Body: Specified ingress_expiry not within expected range:
Minimum allowed expiry: 2023-02-22 16:30:17.689468621 UTC
Maximum allowed expiry: 2023-02-22 16:35:47.689468621 UTC
Provided expiry: 2023-02-22 16:38:15.912 UTC
Local replica time: 2023-02-22 16:30:17.689469934 UTC

We would appreciate any support on this, or a reference to complete documentation and syntax/library reference for Motoko elsewhere.

This means that your system’s time and IC time are too much out of sync.

The full Motoko docs that are hosted in internetcomputer.org are also available here in the original repo

I was away for a few days but managed to get a IC node and some Motoko examples running
(after some trouble with shared libs… as usual).

I feel that there is more to mention regarding the three options you put up
(than we did by proposing our “obvious solution”).

Firstly, before I begin, it is good to think of regular expressions as simple declarative programs
(which they are in practice).
To ‘search for’ a regular expression in text means to run the program with the text as input.
After the execution of such program, its interpreter (instance of the regexp engine) can be queried for
pieces of saved state such as whether it was successful or where in the text certain subexpressions matched.

That said, a regular expression is merely the source code of a program,
and must first be compiled, either to an AST that can be interpreted by
walking (such as in simple scripting langs, awk for example),
or converted to bytecode that can be executed by some VM (as with Python, Java).

With that in mind, you should know that it is incorrect to use a constant regexp in a source program
(assuming that an interpreting engine is used).
For example, usage such as

let text = /* read from stdin */
let matched = regexp.match("ab*c", text)

will cause, at the point of executing the assignment to matched,
a call to a parser, and later interpreter.
Since the pattern is constant (rather than read in at runtime),
clearly it must be meaningful to the core logic of the implemented program,
and is hence best implemented in the host language.
That is, it is preferable to package the simple parsing done
by the pattern into an explicit procedure, so as to communicate the intent clearly.
Additionally, such a naive implementation hides the semantics of the program
from the host language compiler – the compiler can’t reason about
the logic denoted by a string it cannot itself parse.
(Pike also notes this in slide 13 of go.dev/talks/2011/lex.slide)

Various approaches have been taken to incorporating regexps
directly into a language environment, thus solving the above issue.

  • ECMAscript defines regular expression syntax as part of its grammar,
    and thus compiles them along with all other syntax.
  • for linguistic applications, Lex “a lexical scanner generator” translates
    regular expressions written in a special file into C code
  • Extensible compilers allow the grammar of an existing language
    to be extended with arbitrary productions, and transformations
    to host language constructs to be defined for them.
    A regular expression, in let re = /ab*c/ for example, is ‘desugared’
    into the equivalent host language code implementing it, let re = func(){ match('a'); star('b'); match('c) }, say.

Clearly, extending Motoko’s grammar with such a superfluous and limited construct
(for a general purpose programming language) is unreasonable.
Lex requires the regular expressions to be specified in a seperate file,
and is hence not generally applicable. Today, it has been deprecated
in favor of hand-written implementations of patterns, or more elaborate
complete CFG parser generators.
An extensible compiler solves the general problem of extending a language
with domain-specific notation (with the side effect of losing idiomacy and regularity)
but appear to be pure research at this time.

Another important aspect to consider is the complexity of the pseudo derivatives
of regexps, such as PCRE and POSIX regexes. (See [Cox’s articles] [rsc] now.)
Regular expressions were introduced as a formal notation for expressing Kleene’s regular events.
Early software implementations of them stayed true to the theory
and implemented strictly the operators and semantics defined by Kleene.
Theoretical regular expressions have the desirable properties of being
evaluable in time linear with respect to the searched text (O(n)).
Later, due to the demand for greater expressiveness, and desire for ‘features’,
extensions were introduced into the regular expression notation,
namely in the Perl language (PCRE - Perl-compatible regexes) which
went beyond the definition of a regular expression,
and the class of regular languages which they denote
(such ‘regexes’ are not strictly “regular expressions”).
In effect, the linear time guarantees were lost and some ‘pathological’
patterns in PCRE can take years to execute, paving way for DOS attacks.
You should refer to Cox’s article for more information on that.

Thinking about the ‘cycles’ mechanism of your platform,
if we understand it correctly, we would advise that our solution
be taken, as incorporating an existing implementation might impose a significant cost
– Rust’s ‘regex’ crate strives for compatibility with PCRE’s syntax
(without the pathological features) and is hence much larger than would be needed
for traditional Unix regexps.

We argue that the additional constructs of PCRE are completely superfluous for interactive applications
(since the very reason for interactive use is conciseness and intuitiveness of the notation,
which is lost when such features as advanced character classes are introduced),
and readily replacable by host language constructs in static programs
(with immediate gain of clarity: Motoko’s Char.isUppercase() vs PCRE \p{Lu}).
Ierusalimschy, creator of the Lua programming language, also argues that PCRE-style regexps are overly
complex, and in fact proposes a solution utilizing a more sound basis for pattern recognition than regular expression,
namely that of PEGs, which was designed around backtracking and naturally handle the Context-Free language class.
(See [lpeg] at the bottom)

Furthermore, the non-linear-time behavior of PCRE-style engines would pose
a threat of DOS attacks, particularly dangerous in the context of a distributed
network such as yours.

Hence, the implementation we propose would use the classic
regular expression syntax, mirroring Plan 9’s libregexp.
Such an implementation would provide a dynamic, user-oriented
text search engine, in clear and concise Motoko code.
An obvious benefit is that the library would automatically act as a
rich example of the usage of Motoko for solving real-world, theoretically grounded issues.

The syntax would mainly consist of the following expressions

  • ‘a’, matches a literal ‘a’
  • ‘(eee)’, where each e is an expression, is treated as one expression
  • ‘*’, matches zero or more of the previous expression
  • ‘+’, matches one or more of the previous expression
  • ‘.’, matches any character except newline
  • ‘@’, matches any character (could be omitted in … of a flag for ‘.’ not matching newline)
  • ‘[abc]’, matches either a, b, or c
  • ‘[^abc]’, matches any char other than a, b, or c
  • ‘[a-z]’, matches any character lexicographically between a and z.

(plus a tiny few others)

The package should provide an interface roughly as the following:

compile(expr : Text) : regexp
match(pat : Text, t : Text) : bool
find(pat : Text, t : Text) : Text
findAll(pat : Text, t : Text) : [Text]
findIndex(pat : Text, t : Text) : (int, int)

We would like to know about the eligibility of our analysis
in this and the previous message, along with possible future
conversation with you, and the potential implementation of the library
described above, for acceptance and reward.

I eagerly await your response, and any further questions.

[lpeg]
www [dot]cs [dot] tufts [dot] edu/~nr/cs257/archive/roberto-ierusalimschy/lpeg-spe [dot] pdf

[rsc]
swtch [dot] com /~rsc/regexp

This is really great. I’d love to get @claudio take and maybe @rossberg as well.

I’d love to see it implemented in motoko as l, I think you rightly point out, would be full of great examples.

A regexp library implemented in Motoko should be a very good start. A base implementation is not that hard and can take us a long way. I also agree that popular regexp libs tend to be overloaded with many rarely needed features and performance pitfalls.

It might be hard to reach top performance for such a core mechanism with a Motoko implementation, so in the long run, I would still hope that there would be ways to import a high-performance regexp engine implemented in some low-level language. But that’s not easily possible right now, so going with Motoko sounds like a good idea.

A couple of more thoughts:

  1. A high performance regexp implementation still is a lot of work. For example, V8’s regexp implementation, which is one (or still the) fastest around, took many man years to reach the point it is at now. One notable complication for modern regexp engines also is correct handling of unicode, especially if you don’t want to waste a lot of space.

  2. Regexps are one of these features that are often and easily abused – they’re a hammer that makes every problem look like a nail. In particular, it depresses me how often they are used as hacks instead of implementing a proper parser. The results tend to be abysmal, both in terms of performance and correctness. In general, regexps are not something I would recommend using much on the IC, where cost matters.

  3. I have to disagree that lex is obsolete or replaced by parser generators. Instead, it is often used along with parser generators. It is much better-suited for lexical analysis. Almost all compilers I have ever touched used a combination of lex + yacc-style tools. That includes Motoko, btw. In all cases where a regexp is statically fixed and not entirely trivial, something like lex would be the right tool – especially on the IC, because compiling the regexp then happens offline.

I’ve written some thoughts down based on your response (in fact, I’ve done nothing but that today. If I don’t respond tomorrow, I’m. sleeping.)
but it doesn’t feel quite complete. Before I submit a complete analysis I’d like to know your thoughts on some aspects.

  1. Do you think dynamic (user defined) regexps are ever needed on your platform?

Any data already loaded by the client (browser) can have regexp search enabled by client-side scripts (javascript). Additionally, regexp search for server (canister in our case) resident data is very project specific (the only example of regexp queries to a server I could recall is Google Code Search, the name of which suggests that I’m right in stating that it is not generally applicable).
I have compiled a list of possible schemes for compile-time regexp literals in Motoko and feel like that is really what you meant by this bounty.
The problem we solve, then, is that of enabling no-cost problem-specific notation, in this case, related to predicates on text. That is, I propose a syntactic macro system (which as opposed to some other ones you might know does not rely on fixed lexical syntax, as in Scheme and Rust). I feel that it would go along with your philosophy of a small language with a supplementory libraries nicely, in this case the libraries being compile-time. (There’s a good paper about compile-time libraries that I would love to link here but urls don’t work on this forum.) An implementation of compile time regexps is then merely a plugin for the macro system. I’ve written up a good handful of interesting implications of this for Motoko down which I hope to get out to you soon. (Maybe tomorrow., if I’m not sleeping)
That’s briefly why I spoke out against lex. (Also, could you give a pointer to Motoko’s lex files?) Not flexible enough. Also, lex(1) in Plan 9 Programmer’s Manual, BUGS section:

The asteroid to kill this dinasaur is still in orbit.

And Inferno did away with it completely (asteroid hit). In yacc(1) of Inferno’s manual a lexer consisting of a single pattern matching statement is shown, which also shows how superfluous lex can be. ANTLR uses the same file and algorithms for lexing and parsing, with the benefit that productions can use regexp operators (I think).

In summary: we’re best of with the compile time approach. If we’ll do dynamic,
a Motoko based implementation is fine (more on that later), although I’m doubting its applicability (it would be good to look for any). Rust integration will be good whenever it comes.

Spent a good non-stop 8 hours on this today in hopes of getting everything out to you. Clearly failed.
Clearly a bad approach. I’ll go do some stretching now. Thanks.

Please take a look at MELT’s AbleC in the meanwhile which is pretty much what I imagine for Motoko.
(melt [dot] cs [dot] umn [dot] edu [slash] ableC)

Well, proper macro or staged programming systems aren’t simple features at all, and they have many implications and drawbacks. Supporting that machinery would be way more complex than an offline lex tool or a dynamic regexp library. So I wouldn’t hold my breath.

FWIW, I didn’t mean to imply that we shouldn’t have a regexp lib or that lex is the right tool for all cases – just saying that it has valid uses, too. For the light one-off use cases, like they occur in many applications, a regular library like you suggested is the simplest option, so I’m fully on board with that.

Also, could you give a pointer to Motoko’s lex files?

Sure, it is here, written in ocamllex.

The compile-time library scheme is feasable. What I had in mind is an independent program, equipped with a Motoko parser, that runs before the compiler proper (perhaps I didn’t make that clear earlier). That is, it is independent of the base Motoko compiler. It obviously incurs some overhead, due to the redundant parsing, although I doubt it would be enough to invalidate such an approach for this use case, and in fact it is the most common approach taken with such systems for C-like languages (Xoc, among others).
In the most general case, it should also include some way to interpret Motoko code, or whatever language the libraries are to be written in. (This makes it sound somewhat worse.) Calls to pseudo-functions are converted to calls to specialized functions which reside in a newly generated package (and hence namespace), avoiding the issue of name capture. Example:

let pat = Regexp.regexp("(foo)*bar+")

becomes

let pat = Regexp._GfooSbarP()

Regarding lex, remember that its usual interface is strictly oriented at full-blown parsers, with the primary entrypoint being yylex(), which executes all rules in parallel (perhaps ocaml’s lex is different). In one-off cases, where patterns are used sporadically and not strictly related, a more general solution may be desirable, in which patterns can be individually named and, as we now come to realize, perhaps combined.

Thinking instead about a practical lex-like system, I am reminded of Golang’s code generation conventions, where comment directives, most notably go:generate specify aspects of compilation that step beyond the semantics of the host language. The “C” pseudo-package, which extracts C declarations from a comment adjacent to its import statement, is a primary example. By that, I imagine usage of a more general lex-like utility for Motoko as the following:

/*lex: 
  h1 = "#" .* "\n"
*/
import lex

let matched = lex.h1.match("# abcdefgh\n")
D.print(matched) // => true
D.print(lex.h1.find("a#bcd\ne")) // => "#bcd\n"

(minor bug: can’t nest comments…)
or, better yet (although full-on token stream search might be needed)

/*lex: wroom = /wroo+/ */
let m1 = lex.wroom.match("wroooooom")

/*lex: boom = /boo+m/ */
let m2 = lex.boom.match("booooom")

Where lex is a package that is only generated after the special comments are evaluated.
Golang’s entire metaprogramming capabilities consist of special directive comments and //go:generate. I should familiarize myself with your build system to see how something like this could fit in. It has the obvious benefit of keeping regexps close to their use. And, either way, exposing a proper package interface.
This is exactly the ‘pseudo-package’ I had in mind with compile-time libraries.

Golang’s nex package is also a reinterpretation of lex that comes to mind, based somewhat on Pike’s ‘structural expressions’. Could be worth taking a look at.
I’ve also recently read a paper on staged parser combinators based on regular expressions extended with a recursion operator, which seem to tackle similar problems and handle composing quick and cheap one-offs well. I’ll take a look at it again.

Also, any details on the status of Rust integration would be welcome. Perhaps there is some place of active discussion. A brief overview of the issues involved could suffice. In particular, I would like to know what aspects keep such integration from being feasable at this time, as it would be most convenient to get an existing regexp package adapted as soon as possible, so as to have a place to start measurement and drive further consideration of solutions from (as you know well).

There isn’t much of a status to report re Rust integration. What would be needed for this is some kind of FFI convention or ABI that different Wasm compilers agree upon and support. But that would require an effort in the wider Wasm eco system, not something we can achieve on our own.

What comes closest is the official proposal for a language-independent component model for Wasm. That’s still work in progress, but once finished, it would bring us about 80% of the way to support at least some form of multi-language linking on the IC.

1 Like

@piens - If you can summarize all of your findings into a markdown file and submit it in a pull request to icdevs/regex.mo · GitHub I’ll get it up for review. Thanks for taking a look at this! I think you’ve highlighted the needs and that a simple RegEx motoko package is warranted. I’ll suggest it in the next round of Bounties and give you first right of refusal on the job if you are interested.

Hey, I’ve done some more research this past week (including a peek at ocamllex and the motoko compiler)
which I wanted to report back on but haven’t been able to due to some ugly hardware failure. I’m back online now and writing it down in a markdown doc sounds perfect.

It seems that I can neither clone nor PR an empty repo. You should commit a dummy README or anything to make it non-empty.

Added a readme…more body for char check

It’s up! Let me know what you think. I put it together one by one so a more general outlook and revision might be needed.

I didn’t ping you so you might not have seen this.
(Also, take a peek at the png bounty if you can!)