SOQL and Search: Can They Be Friends?

Let’s say I’m a private equity firm and a prospect asks me the following:

“I’d like to invest in companies on the New York Stock Exchange in the IT and Automotive sectors that have a market cap of at least $10 billion, a P/E ratio of at least 10, a CEO who has been with the company for at least 15 years, and… are committed to diversity in the workplace.”

Or I’m a recruiter trying to fill a job order:

“Find me a candidate that has at least 5 years experience as a head chef, with a degree from the CIA (Culinary Institute of America, not the other one), worked at a Michelin star restaurant… and is an expert in vegetarian cuisine.”

These 2 examples are interesting queries because they share the same challenge: they have both quantitative and qualitative components.

Quantitative queries deal with numbers and short strings of text like a name or title. They are filters that match exactly or not at all; a binary answer.
Qualitative queries explore text and are fuzzy, as with web search engines such as Google. Answers aren’t necessarily “correct”, but some are more relevant than others.

The Problem

Quantitative queries grew up with relational databases and their lingua franca, SQL. This is the world of structured data. Structured data refers to data that is defined by fielded values. Relational databases (Oracle, SQL Server, etc.) are typical examples. So are apps built on databases, like Salesforce and SAP.

Qualitative queries were born out of the search industry where websites and files are searched using “natural language”. This is the world of unstructured data (for those of you who are fastidious about terminology, it’s content, not data, but we’ll use data anyway). Unstructured data, or files and documents, refers to data that is predominantly “blobs of text”, as in a Word file, pdf, or media.

Structured and unstructured data technologies were developed in parallel for years without ever truly working together. Structured technologies deployed simple keyword search, technology the search industry replaced some time ago with semantic search and vector queries. Unstructured technologies searched databases by “flattening” them, that is, removing their all-important relationships.

Fortunately, this is changing. Since everyone now has access to pretty much all of the data out there, the real competitive differentiator is how accurately and quickly we find what we’re looking for, which begins with making sure we’re actually searching for everything we need.

Put simply, the best search engines today are searching the structured and unstructured data together to produce a combined relevancy score for more accurate results. Going back to our two examples at the beginning of this post: our first query would search the stock exchange database with the companies’ annual reports; our second query, candidate records with resumes and cover letters.

Let’s look at how this is done.

What is Structure and Unstructured Anyway?

The differences between structured and unstructured data are really distinctions without merit. All data has a structure in the form of metadata. Consider the “unstructured” document in a content management system like SharePoint. It has a title, created date, owner, and “body”, the value of which is all the text in the document itself. So do web pages with their HTML tags.

Conversely, apps like Salesforce, modeled on “structured” database engines, routinely have large text fields, for example, “Description”. Files are attached to object records in Salesforce all the time as attachments, notes, Chatter, and imported email records. Consider the annual report and resume in our earlier query examples. It’s fair to say that every text-based field, no matter its size, is itself a piece of unstructured data, even a contact’s name or address.

Actually, data sources reside somewhere on a sliding scale between the two:

So all data has some structure, just some more than others.

Today’s search engines capitalize on this structure by searching the name-value pairs as both fuzzy-text searchable content like Google search, and as exact-match filters like SOQL. The engine is able to improve the accuracy of its results by adding bias to these fields. For example, it may be more important that a query term is found in a “Title” field rather than a “Misc. Notes” field. In addition, the search engine adds additional structure by applying text analytics to fields that contain enough text to extract structure from them like noun entities, relationships, parts of speech, sentiment, etc.

Don’t Forget the Relationship!

OK, we have the data structure figured out, but what about the relationships between objects? The ‘R’ of RDBMS is first in the acronym for a reason. Just like the ubiquity of structure, there is also the ubiquity of relationships. An RDBMS explicitly declares its relationships but relationships exist implicitly everywhere.

Searching through a relationship between two objects is recognizing that the data from one object is relevant to the other object and should be searched as if it was part of the other object. Now I’m into the weeds but this concept is fundamental to improving search accuracy, so I’m going to spend a little time describing what this means (you can skip to the last paragraph in this section if this goes too far in the weeds for you).

Let’s dive in…

Consider our earlier examples from the private equity firm or the recruiter. Notice the clause “CEO who has been with the company for at least 15 years” in Example 1. Typically, this would not actually be part of the Company object that we’re searching for; it would be part of a related “Employee” object containing the company’s employees. The Employee object is considered a child of the Company object because there are many employees per company, but only one company per employee.

How does this resolve our CEO clause? Answer: the search engine searches through the relationship to the Employee object to find the CEO record and then the date the CEO joined the company. It sounds like a simple concept, and it is, for SOQL. Querying through relationships is routine for SOSL because it uses a feature called a JOIN.

This JOIN feature is now part of the search. And remember, search also adds natural language processing, relevancy, text analytics, etc. — all the friendliness that people love about search.

Now consider the qualitative part of the clause, “committed to diversity in the workplace”. Same approach. Here the JOIN is with an attached annual report (anything a CEO announces in an annual report is generally considered to be a commitment). The clause takes advantage of the fuzzy part of the search by finding within the report any mention of “diversity” within a few words of “employee”, synonyms applied in both cases.

This also works for our second example. Consider the clauses “worked at a Michelin-star restaurant” and “expert in vegetarian cuisine”. The restaurant’s star rating would be found in a child object listing employment history, the vegetarian expertise in a cover letter, or perhaps a comment note written by the recruiter.

By the way, there is no requirement that the annual report and cover letter in our examples has to be part of the same system as the parent object. For example, the parent object could be in Salesforce, and the child document in SharePoint. The relationship could be a field on the parent that contains a link (i.e. URL) to the document in the other system.

In fact, the relationship itself may not be declared at all but discovered through analysis. Let’s say the documents searched for the qualitative part of our example queries have no explicit relationship at all with the parent objects. How would this work? Answer: AI.

Yes, I said it: AI.

But a very specific type of AI. This AI can compare the content of the documents and the parent records and create the relationships where they align best. There is a lot of math-y detail here (vector queries, etc.) that we will leave for a later blog post where we introduce the industry’s latest shiny object, ChatGPT (hint: it’s pretty impressive!).

End of deep dive…

It really doesn’t make any difference if a related object is structured or unstructured, or if it resides in the same data source or a different one. Only the relationship matters. Without the relationship, you simply get separate sets of results. With the relationship, you can search your entire data ecosystem.

Conclusion

Some of you may remember when TVs were only for broadcast entertainment and computers only for data and documents. Even though the hardware was basically the same, they couldn’t talk to each other because they evolved on their own in parallel. We finally put them together.

We’re now doing the same for SOSL and Search. So, yes they can be friends. In fact, they should be married: one, integrated access layer providing a broader, more powerful discovery experience for everyone.

What do you think? I’m interested in your experiences. How does search fit in your environment? Have you thought about bringing together your structured and unstructured sources? We’ll be talking more about this and similar topics over the next few posts. We’ll be exploring search as the basis of a new generation of capabilities, such as matching, context curation, and automation.

Start thinking about search, not just as something you use to find information, but as a way to curate and filter information intelligently as it moves from one step to the next in your workflows.