Content and user discovery is one of the important aspect of Creator-Consumer economy. @RuBaRu, we are trying to build on-chain full text search. we are wondering is there any existing library in IC or any sample code that we can refer for this. This library should efficiently index content with minimal index storage for fast and flexible full-text search. Looking forward to dive into a lively discussion about exploring tech ideas and uncovering the limits around this topic.
After extensive r&d about the best possible solution, we’ve crafted a custom tailored algorithm to fulfill RuBaRu’s use-case-specific full-text search. Here’s an informative article on the algorithm’s design.
An Efficient Full-Text Search Algorithm with Trie Data Structures
In the realm of information retrieval and database management, optimizing search and querying processes is a perennial challenge. This article delves into an innovative algorithm designed to address these challenges, utilizing a combination of Trie data structures and efficient tokenization techniques.
Introduction
The algorithm outlined here is devised to efficiently manage and query a database of words. Its primary goal is to streamline the full-text search process, enabling fast and accurate retrieval of relevant data.
Breaking Down Words
The foundation of this algorithm lies in its approach to tokenization. The process begins with the segmentation of words into smaller units, referred to as tokens. These tokens are essentially substrings extracted from the original words. The algorithm employs a linear time complexity (O(n)), where ‘n’ represents the length of the word, to generate these tokens.
Storing Tokens and Words
The next step involves the efficient storage of these tokens. However, not all tokens are created equal. To optimize space utilization, tokens of length 1 and 2, as well as tokens originating from middle indices of words, are excluded from storage. The chosen data structure for storing these tokens is a Trie.
In this Trie, each key represents a token, and the corresponding value is a buffer containing words associated with that particular token. This Trie acts as a compact index, allowing for efficient access to words based on their token components.
Database Updates and Insertions
Whenever a new word is introduced into the database, the algorithm dynamically generates tokens for that word and stores them in the Trie. Subsequently, the word is placed in the buffer of each token to which it corresponds. This process ensures that the database is continuously updated and indexed.
Querying: Search with and without Data
The algorithm supports two querying endpoints: search with data and search without data.
- Search without Data
In the ‘search without data’ endpoint, the algorithm employs a Trie.find operation to quickly locate tokens relevant to the query. The full-text search process is remarkably efficient, boasting a logarithmic time complexity (O(log n)), where ‘n’ represents the number of letters in the searched word.
- Search with Data
In the ‘search with data’ endpoint, the algorithm leverages the token buffer to extract words associated with the queried token. Subsequently, it retrieves additional information about these words from another data structure, which includes data such as word ranking and count. This process exhibits a complexity of [O(log(n)) + (size of the page * O(log(m)))], where ‘n’ stands for the number of letters in the searched word, ‘m’ represents the total number of words, and ‘size of page’ pertains to the pagination query.
Comparison with Existing Search Engines/Tools
When comparing this algorithm to existing full-text search tools, it stands out for its efficiency in handling tokenized data. Traditional full-text search engines, such as Elasticsearch and Apache Solr, rely on inverted indices and complex ranking algorithms. In contrast, our algorithm’s simplicity in tokenization and Trie-based data structures offers advantages in terms of speed and resource efficiency for certain use cases.
Conclusion
This algorithm, blending Trie data structures with innovative tokenization techniques, presents an efficient solution for managing and querying word databases. Its ability to rapidly retrieve data with high precision makes it a valuable tool in applications where efficient full-text search and querying are paramount. The algorithm’s reliance on optimized data structures and smart tokenization exemplifies its technical prowess in the realm of information retrieval.