SubText (Substring) Function in Motoko

Hello. I’m looking for the most efficient way to get a string segment from a larger string, from starting to ending index. I don’t think Text can be treated like an array of chars, so I don’t know how to jump directly to the start index (without looping). Does anyone have a more efficient approach than this? (I will be working with large Text and this function will be called very often.) Thanks! :slight_smile:

  // mimics JavaScript substring function
  public func subText(value : Text, indexStart: Nat, indexEnd : Nat) : Text {
    if (indexStart == 0 and indexEnd >= value.size()) {
        return value;
    };
    if (indexStart >= value.size()) {
        return "";
    };

    var result : Text = "";
    var i : Nat = 0;
    label l for (c in value.chars()) {
        if (i >= indexStart and i < indexEnd) {
            result := result # Char.toText(c);
        };
        if (i == indexEnd) {
            break l;
        };
        i += 1;
    };
    
    result;
  };

JavaScript substring test:

Motoko Playground test:

1 Like

Thanks @paulk for sharing a better approach using Iter.range.

  public query func subText(value : Text, indexStart : Nat, indexEnd : Nat) : async Text {
    if (indexStart == 0 and indexEnd >= value.size()) {
        return value;
    }
    else if (indexStart >= value.size()) {
        return "";
    };
    
    var indexEndValid = indexEnd;
    if (indexEnd > value.size()) {
        indexEndValid := value.size();
    };

    var result : Text = "";
    var iter = Iter.toArray<Char>(Text.toIter(value));
    for (index in Iter.range(indexStart, indexEndValid - 1)) {
        result := result # Char.toText(iter[index]);
    };

    result;
  };
3 Likes

I’m curious to know if there are performance improvements in paulk's version over your first solution? I ask because I think Iter.toArray() also loops through the text to get all the values into an array. It is more intuitive and easier to read but does it make the code run faster or use less memory?

1 Like

You make an excellent point. It’s tough to benchmark performance of a function within a canister because timestamps (Time.now()) are actually block times. A test with very large Text and a large number of iterations could give us an idea though.

@tomijaga Wow! I did some tests and the performance difference is very surprising and not obvious from looking at the code.

Although, I could not get benchmark times, I was able to find the number of test iterations where one of the functions would exceed the query execution time limit, which I believe is 2,000 ms.

I tested a 10,000 character string and called subText or subText2 1,000 times to get the last 1,000 characters. The first function succeeded and the second timed-out.

Then I played with the test iterations to find the threshold where each function would time-out.

The first function was able to run 1,735 times, while the second could only run 320 times. So the first function runs over 5X faster!!

To be clear, the test functions call the getLongValue function to build a 10,000 character string which contributes to some of the execution time, but that is a constant for both tests.

So, your analysis was correct and it made a much bigger difference that I expected. Thank you for your valuable input! :pray:

Here is the test if you want to try it yourself:
Motoko Playgound Test

1,000 test iterations

Max test iterations per function before time-out.

1 Like

@tomijaga I may be wrong about the instruction limit referring to a time-out. Now that I think about it, it may be a call-stack error.

If that’s the meaning, then the tests conclude that first function uses a smaller call-stack and can therefore process more operations. However, it may not indicate that one function performs better than the other.

1 Like

Yet another variant of a similar function is implemented, but not exposed, in Text.mo. It may make few redundant tests.

A function like subText is probably a good candidate for direct, efficient implementation in the runtime system.

2 Likes

Thank you for sharing this @claudio.

I added the extract implementation to my tests here:

Here are the maximum invocations per function before exceeding the instruction limit:

extract: 650
subText: 1735
subText2: 320

At this point, it appears that the subText implementation (only by chance) is still the most efficient. I’m very interested in understanding why.