I am in need of a Motoko library for efficiently indexing large content with minimal index storage for fast and flexible full-text search. I haven’t found such a project and I know better than to tackle this challenge from scratch. I have neither the time nor the expertise in this specialized discipline.
If there is no reusable library or service available for the IC, I am wondering if it’s possible to compile an existing library like hOOt (with some modifications) to wasm and reference it from Motoko.
Can anyone confirm or dispel this idea?
As a last resort, I think porting the code to Motoko would be the best approach, though a much longer task.
Unfortunately you cannot reference Wasm modules from motoko yet. Even if you could, functions like indexing that run over long periods of time would be unlikely to work in the context of the Internet computer. You would need to rewrite them using chunking until time slicing is available.
Motoko needs a ton of libraries like this. I’d be really great to have regex. In fact you might have to have that before you could do full text. Some of these more significant libraries may require significant investment to get up and running.
If you know of others that need it and would be able to fund it I’d be happy to help write up a bounty for it. We have a few outstanding already and not of them are for base motoko libraries that need to be built: https://ICDevs.org/bounties.html
Good point about regex support. Regarding the long running process, I wonder if that could be handled by a separate indexing canister pulling content from the app canister periodically (via the heartbeat function) and then building the index in the background.
Another thought is to offset the indexing to the browser, when the user saves the content (and where regex is available). It wouldn’t take long to index a single document, and then save the content along with the index. The individual indexes would have to merge once in the canister. However, this comes with a huge security risk that someone could save a fake index to manipulate traffic to their content. Grasping at straws here.
I couldn’t fund the bounty for such a solution, but would be happy to donate what I can. Hopefully some early ICP investors with deep pockets will read this and see the value in funding it.
Can you elaborate on your use case a bit more instead of just the outcome that you are needing? Might help inspire a project or two for people reading this post.
Here’s my guesses on what you’re asking for when you say this. Please let me know if I’m wrong, or if your use case is in any way different.
“efficiently indexing large content” → I’m imagining you have a group of larger text documents that you’d like to index (let me know if I’m correct). How would you ideally like to use this index, and how would that index be structured for your use case?
“minimal index storage” → I’m imagining some type of word to document(s) where that word appears mapping
“full text search” → I’m imagining being able to find a document where a word or group of words appear. Nothing too fancy beyond that like regex search.
@icme Thanks for asking.
My use case is for publishing articles of varying sizes and providing flexible search of all articles within the dapp from a simple search bar in the UI. This has two parts: Index and Search.
When an article is published, it needs to be indexed and the index data needs to be stored on-chain within IC canisters. The index structure doesn’t matter, but it should be as small as possible. A single canister will not hold all indexes, so the service would need to scale linearly across an array of dynamic canisters.
There also needs to be a query service to search the index and return relevant results. Queries should be able to match single words and phrases based on root words, not exact matches (plurality, tense, etc.). The results should indicate when an exact match is found. The ability to search with wildcards (*, ?) would be icing on the cake, but far less important than searching by phrase. The search algorithm should only use the index. Scanning all the articles in real-time would not be feasible.
Support for multiple languages would be best, but difficult based on differing grammar rules.
Additional use cases
- Content management
- Web publishing
- Crowd funding
- Any dapp that needs to store and query long text content
What we really need is a Google for the IC that allows dapps to integrate by sending content for indexing and queries for search results (scoped to their dapp’s domain/canister). I hope that someone with a deep understanding of search algorithms is already working on this. We can’t have a new version of the Internet without search functionality. Whoever does this well will become a pillar of Web3.
I have a limited full-text search solution in Motoko that I built myself a month or two ago.
- indexing with one method, querying with another
- tokenizes by word (i.e. whitespace), with no support for stop words, wildcards, regex
- prefix search using a prefix tree (i.e. a real trie, not the Motoko
- ranks exact matches higher than prefix/partial matches
- users can specify attributes in order of priority, where matches on higher-priority attributes are ranked higher than matches on lower-priority attributes
- custom “popularity” function that users can specify to further rank documents
- ranking is done purely at query time using a bucket sort
- paging search results
It’s pretty specific to my use case and not really ready for open source, but I’m happy to send it to you if you want to use it. Would be nice to have multiple people looking and/or working on it.
@jzxchiang Wow! That’s amazing! Yes, I would love to get a copy of your solution, and thank you for offering. You can email me at [email protected]. Much appreciated!
I don’t know how useful this is, but I worked at a startup that made an in-memory reverse indexed search engine built in C++. It used a trie as the underlying data structure and as a result was very fast (someone said the fastest at one point but you know that’s context dependent); and as a result also made in-fix searching possible (flexible with typos). It also included capability for Mandarin (simplified iirc), and geofencing/tagging (sharding was also supposed to be implemented, but I don’t believe it was ever actually got to that point). It also had a couple of database connectors (such as SQLite), and someone even ported it over to Android (although not natively which was not recommended since an NDK version would be more robust so as a result the Android app no longer works as Android apps can no longer spawn processes with IPC).
Anyways, the startup went under (apparently selling search in ~2014 was not that easy for a variety of reasons you can imagine), but the source code was released in case it would be useful to anyone here:
Thanks @inviscidpixels. That looks like an excellent learning resource. I have tons of respect for anyone who can write code like this.