KonaSearch Scale: Searching Billions of Records Using Parallelized Queries

An enterprise search query engine is generally considered “performing” when at least 98 percent of its queries are less than 200 milliseconds and no query takes more than 2 seconds. To understand how this is possible when searching billions of records one must first understand how a search engine works.

Search Engine Architecture

All search engines (e.g. Google™ search, Salesforce™ global search, KonaSearch™) at their most basic comprise two asynchronous processes:

  • Indexing data – uploading the searchable data into a search index
  • Querying data – sending search queries to the index and returning results

The two processes work independent of each other, but they share a common construct for managing scale and a single point of intersection, the search index. For those of you familiar with relational databases and similar constructs (e.g. Oracle®, Salesforce), a search index has structure and schema, and understands the notion of instantiating objects within the structure as records (“documents”, in search parlance). However, the internal structure of the index is very different: basically, a “bag of words” — or name-value pairs — with pointers to where the words are located in the various records.

This name-value pair “hyper-denormalized” schema allows the index to return query results from extremely large data sets in a very short period of time. Consider how Google is able to return results from so many websites. The English version, for example, has only about 150,000 records in its index because there are only about 150,000 words in the English language, but it has a fantastically huge array of offset pointers to all the places where each word resides in web pages. The actual models for Google are far more complex, but the basic premise still holds.

Web vs. Enterprise Search

For public websites like Google, and large ecommerce sites like Amazon.com, the main concern is maintaining performance through high query volumes. Here, query volume is measured in thousands of queries per second because, as public websites, the audience is massively large.

Enterprise search engines have different concerns. Query traffic volume is not normally a factor for query performance because the engines have long adopted the same technologies used by the public search engines. Even enterprises with hundreds of thousands of employees do not reach the same query volumes. But query performance is affected by data volume because the queries are generally more complex, the relevancy model more rigorous, and the data structures more variant.

Scaling the Architecture

Provisioning a search index on a typical multi-CPU enterprise server will support acceptable search performance up to around a million records. Searching several tens of millions of records, however, requires distributing, or “sharding”, the search index across multiple index cores (multiple servers) to provide more power and memory for parallelizing computation. But this distributed logical index, or collection, has its limits as well.

Querying indexes with billions of records requires another level of parallelization: distributing the index across multiple collections with multiple shards. Since each shard contains a subset of the records in the index, the query is sent to all the shards at the same time and the results are JOINed together (“JOIN”, in this case, refers to a function similar to the relational database JOIN function). This special multi-shard JOIN is implemented in the core of the index engine to ensure optimal performance.

Unfortunately, while the JOIN command is supported in most query languages, it is not supported across collections and not within the search engine core. The solution requires two query parsers. The first parser queries a remote collection to obtain a set of JOIN keys to use as a filter against the local collection. It retrieves the set of JOIN keys from the remote collection using a streaming expression. The second parser takes a field name and a range and matches records whose hash for that field falls within the range. If the local collection uses the JOIN key field as its routing prefix with the compositeId router, applying this filter allows the query from the first parser on each shard to request only the JOIN keys that could match against that shard’s documents.

Data Structure Optimization

Some search engines (e.g. KonaSearch) further optimize performance by separating the data by structure type, each in its own sub-index:

  • Structured data – Database records and similar constructs (e.g. Salesforce objects). These are fielded objects or tables primarily populated with small amounts of data of varying types (e.g. strings, numbers, dates).
  • Unstructured data – What we generally call files or documents (e.g. pdf, Microsoft® Office 365™, webpages, images; Salesforce Files, Attachments, Articles). Files are defined by a small number of fields and one large field containing the document’s text or binary/image data.
  • Permissioning data – Meta information that resolves what the originator of a query (e.g. a user) is permitted to see in the results (e.g. Salesforce users, groups, roles, profiles, permission sets, role hierarchies, etc.).

The search engine graph-traverses the permissioning data to resolve the ACLs and then runs a JOIN across all three indexes to produce the final set of results. Note this means that all queries resolve permissioning on each query rather than the more common but limiting approach of passing the search results to a separate process outside of the search engine.

Incremental Indexing and Caching

Often, the capacity for a data source to export data to an external system is many orders of magnitude slower than the ability of the search engine to index the data. For this and other reasons, the search engine ensures that the first time an org is indexed is the only time it has to index the entire set of data. All subsequent indexing actions are constrained to incremental changes (inserts, updates, deletes) and are run as background processes, either initiated or triggered by the changes, or run at regularly scheduled intervals.

The initial upload itself needs consideration. For example, it can take weeks to upload a large corpus of data the first time out of a cloud application. As such, for extreme sizes, the normal incremental indexing process is replaced with a faster batch export model. Further, to avoid having to repeat this process should anything happen to corrupt the index, the data can be indexed into a high-performant persistent cache (e.g. Hadoop) and streamed into the search index (e.g. Spark) in near-real time. This allows the system to recreate the index from scratch by streaming directly from the Hadoop database.

Redundancy and Failover

Search indexes are often critical services that demand high availability and uptime requirements; in effect, can never really be offline, even for maintenance. As such, indexes (shards and all) are typically replicated, so there are always multiple copies in hot standby. Both data uploads and queries are sent to all copies in parallel via a dispatcher and load balancer, allowing the engine to continue to run if a server fails, locks up, or is delayed due to a long-running process.