I’m attempting to perform a text search in a large dataset of a map structure.
First, I filter the map by projectId to reduce the data volume, then I apply an additional filter using the text.contains function to search for a specified string within the record property. This approach works for small datasets, but with larger ones, the canister instruction limit for a single message is exceeded.
What strategies can I use to mitigate this issue?
public query func getIssuesByIssue (issue:Text, projectId:Nat16) : async [(Nat,Record)] {
Debug.print(debug_show(Map.size(map)));
let mappedFilteredMap = Map.mapFilter<Nat, Record, Record>(map, nhash, func(key, value) = if (value.projectId == projectId) ?value else null);
let results = Map.mapFilter<Nat, Record, Record>(mappedFilteredMap, nhash, func(key, value) = if (Text.contains(value.issue, #text(issue))) ?value else null);
Map.toArray(results);
You could use Azle or Rust and put Sqlite with indexed text field.
If you want it in Motoko, then something like this:
Break the text into individual tokens (words or stems of words). For each token, add an entry in the index associating it with the document ID and offset.
Search for tokens and if their offsets are sequential then you also match a phrase.
Can you explain what the Nat is here, the key in your Map?
Also, it seems like your projectId’s are consecutive numbers. Why don’t you have one Map per project? I mean for example a vector of Maps where the vector index is the projectId. Then you can skip the first mapFilter.
mapFilter creates a new map and is therefore expensive. You don’t care about a map as a result because you want an Array in the end. Hence, I would eliminate the second mapFilter, too. You can iterate of the entries yourself with vals or entries, then filter out the ones that contain the search text, and put the results in a Buffer or Vector. Then convert from there to Array at the end.
If all that still doesn’t help then you can still build your own word index. But even then you should do the above steps regardless. So I would start with those.
The Nat as key represents the issueId in this example.
I have modified the function to use a buffer, but I’m still hitting the limit when using Text.contains for comparisons. Without the Text.contains query, the function runs very quickly. The dataset contains 22,000 records.
public query func getIssuesByIssue (issue:Text, projectId:Nat16) : async [(Nat, Record)] {
let buffer = Buffer.Buffer<(Nat, Record)>(0);
Debug.print(debug_show(Map.size(map)));
for (value in Map.vals(map)) {
switch (?value) {
case (?value) {
//if (value.projectId == projectId and Text.contains(value.issue, #text(issue)))
if (value.projectId == projectId )
{
Debug.print(debug_show(value.title));
buffer.add((value.id, value));
};
};
case null {};
};
};
Debug.print(debug_show(buffer.size()));
return Buffer.toArray(buffer);
};
There are a number of things in motoko like text operations, binary access and update by index, etc that I think could use some optimization. I know that @luc-blaeser has been hyper focused on OEP for a while and that the team has been super busy. It might be worth indexing all of this small performance issues and getting them slated on the priority list for the motoko team. I’m not holding out hope for a hyper- optimized native regex engine in motoko anytime soon, but I thing there are likely a number of places that would offer substantial gains.
Searching in the middle of a string has been a rough Argo to deal with since SQL server 1.0(and likely before!).
Your best bet is going to be some kind of preprocessing like @infu suggests.
Can you tell what is the smallest data set that exceeds 5b instructions? I mean number of records and total length of the Text of all records combined?
issue_text_length: 5_696_451 (number of characters from Text.size() over all records until it stops)
records: 14_681 (number of records over all records until it stops)
It seems that the issue isn’t the text size of any single record, but rather the combined size of all text values. This suggests that when using Text.contains for checks, memory isn’t being released after each operation. Could it be that the Text.contains function is retaining memory unnecessarily?
You can check this demo I made when trying out sql in Azle. It has full text search
[https://yv5hx-rqaaa-aaaam-abj6a-cai.icp0.io/]
The code won’t work, but with the newer Azle it should be easier.
[GitHub - infu/internetbase-sql-demo]
If you want to do more of this kind of things - text searching, ordering, intersections, queries on your db, adding queries on demand without planning for them, better try Azle + SQL. But then you need to be good with SQL to make things optimized.
Just for information:
I have imported an additional 90_000 messages which belongs to the existing issues, and everything is working as expected. I can now query a single issue with all its comments.
The local canister memory size is currently Nat(135572009), which, if my calculation is correct, would be approximately 129 MB.
I’ve developed a prototype for a German-language stemmer with Motoko. While I’m not certain that all Porter stemming rules are fully implemented, the core stemming functionality is working. The parser also excludes a predefined list of stop words and effectively removes all HTML tags.
However, the initial indexing process is currently very slow and takes a significant amount of time. Additionally, many unusable words are being indexed, and some words that appear frequently across issues cause the specific word arrays to grow excessively large.
In theory, both the indexing and search functionalities are working, but I’m unsure how to scale this approach effectively with my dataset of 22,000 issues and 90,000 messages and more.
Hey, what version of Azle did you use? I am assuming an old version. Azle no longer has those compiling issues if you’re talking about anything over a few seconds.