John Berryman, Author at Engineering Blog

There is a fundamental problem with search relevance. Users are unaware of their own internal processing as they search, modern search interfaces glean only sparse information from users, and ultimately it is impossible to definitively know what a user really needs. Nevertheless, search applications are expected to act upon this sparse information and provide users with results that best match their intent.

In this post, you will learn about these challenges through several examples. By understanding the blind side of search you can accommodate these challenges and provide your users with a better search experience.

The search problem space

A couple years back I co-authored “Relevant Search,” where I described the mechanics of information retrieval and how to build search applications that match users with the information they seek. But even as I wrote the book something at the back of my mind was weighing me down. Only now has the problem taken shape so that I can describe it. I call it the fundamental problem of search. Consider the following:

Modern search interfaces are minimalistic, and users don’t have much opportunity to tell you what they want – usually just a text box.
Users have lots of criteria in mind when making a decision. If they want to find an event, then they are considering the type of event, location, time, date, overall quality, and probably many other things as well.
Different users have different criteria in mind when making decisions and different weighting of said criteria.
Users often don’t know they have all of these criteria in mind. They believe you can “simply” find a set of matching documents and return them in some “simple” specified order.
Users believe that deciding whether or not documents match their search criteria is a binary decision, and users believe that the order of the results can be exact. In truth, both the match and the ordering are naturally “fuzzy.”
Despite uncertain user intent and the fuzzy nature of matching and ordering, the relevance engineer has to make both matching and ordering concrete and absolute.

The fundamental problem of search, then, is the fact that relevance engineers are required to perform a fool’s errand. With missing information, with ambiguous information, and with high user expectations, we have to coerce a search engine to somehow return documents that match the user’s intent. Further, the documents are expected to be ordered by hopelessly ill-defined notions of quality and relevance.

In the sections below I’ll delve into several examples of the fundamental problem of search as well as some ideas for rising above the problem. Here’s a hint though: it won’t be easy.

Matching and sorting by relevance

In the simplest possible scenario, the user enters text into a search box, and we match documents and sort the results based solely on relevance. That is, we find all the documents that contain at least one of the user’s keywords, and we score the documents based upon Term Frequency Inverse Doc Frequency scoring (TF*IDF). This process seems pretty straightforward, right? But fuzziness is already starting to creep in.

First, let’s consider how to determine which set of documents match. If your user searches for “green mile” then we as humans recognize that the user is probably talking about the movie called The Green Mile. The search engine is just going to return all documents that match the term green or the term mile. However, documents that match only one of these terms probably isn’t going to be very relevant. One option is to require both terms to match. But this strategy is ill-advised because there are plenty of times where an either/or match might be relevant. If the user searches for “comedy romance” then they might prefer a comedy romance, but toward the end of the list, a comedy or romance film might be just fine.

In principle, another option would be to return every document with a score above some cutoff value X. In practice, this isn’t possible because TF*IDF scoring is not absolute; you don’t score 0 for perfectly irrelevant documents and 1 for perfectly relevant documents. Consider two queries, one for “it” (as in the movie It) and another query for “shawshank” (as in The Shawshank Redemption). The term “it” is very common and so the best matching document – the movie It – will likely get a relatively low TF*IDF score. However, in the case of “shawshank,” let’s say that we don’t actually have a document for The Shawshank Redemption, and the only document that matches is because of a footnote in the description stating “from the director of The Shawshank Redemption.” Though this is a poor match, the score will be quite high because the word “shawshank” is so rare. In this example, we have a low scoring document that is a great match and a high scoring document that is a terrible match. It is just not possible to delineate between matching and non-matching documents based upon score alone.

We see that even in the most basic text search scenario we already begin running into situations where we can’t know the user’s intent and where the hope of perfect matching and perfect sorting break down. But wait, it gets worse!

Matching and sorting by relevance and quality

“I want to find a documentary about Abraham Lincoln.” Seems simple enough. So, we retrieve all documents that match either abraham or lincoln, and we sort by the default scoring algorithm so that documents that match both terms appear at the top. However, there’s a problem: The user told you they want Abraham Lincoln documents but implicit in the request is that they really want just the high-quality results.

If you’re used to databases and if you’ve just started working with search, then the answer seems obvious – just sort by quality (or popularity or whatever field that serves as a proxy for quality). If you do this you’ll immediately find yourself with a new issue: when sorted by quality, the top results will contain documents that are high quality, but only match one of the terms and aren’t very relevant at all. If you had a documentary on the life and times of the Biblical Abraham for example and if it was a really high-quality documentary, then it would jump up above the documents that are actually about Lincoln.

So again, for someone new to search, the next “answer” is clear: just turn minimum_should_match parameter to 100% to ensure that we only return documents if they have all the terms that the user queries. But this doesn’t really fix the problem. Consider a high-quality documentary about Ulysses S. Grant which merely mentions Abraham Lincoln – a high-quality result, but nevertheless irrelevant to the user. What’s more, minimum_should_match=100% can get you in trouble when the user searches by dumping lots of words in the search box and hoping that some of them match. For example “civil war abraham lincoln” – a documentary entitled “President Lincoln and the Civil War” would be entirely relevant yet would not be a match!

The best thing to do here is to boost by quality rather than use absolute sorting. By default, the score of the document is based solely upon how well the text matches according to TF*IDF. You can incorporate quality into the score by adding in some multiplier times the quality: total_score = text_score + k * quality. With this approach, you can in principle adjust k so that the total score is the appropriate balance between text score sorting (k = 0) and absolute quality sorting (k = inf).

Though this approach of linearly adding in quality is a very commonly used approach and is often effective, it can come with some nasty problems of its own. Ideally, you would be able to find some k that works best in all cases. In practice, you can not. Refer back to the example of the “it” search and the “shawshank” search. In the “it” search, the best matching document will have a much lower text score than a typical query. And in the “shawshank” query, even average matching documents will have potentially high scores. In both of these cases if we calculate total_score as text_score + k * quality, then in the “it” search quality component have a much greater effect on sorting than it will for the “shawshank” query. It would be nice if somehow we could automatically scale k so that it was proportional to the general tests scores for a given search. More on this in a future post!

Sidebar: multiple objective optimization

A big part of the underlying theme here is that search is a multiple-objective optimization problem. That is, we are trying to optimize the scoring function so that multiple objectives are optimized simultaneously. However we do not know – and we can not know – how important the objectives are relative to one another.

The issue is perhaps most evident in applications like Yelp where the different objectives are called out in the application: You’re looking for a restaurant – how would you like to organize the results? Distance? Price? The number of stars? If you’ve selected a food category or typed in a search, then how important should that be? From Yelp’s perspective, the answer cannot entirely be known. The best we can do is to find some balance between the various dimensions that empirically tends to maximize conversion rates. In modern implementations of search, Learning-to-Rank is a machine learning approach that does precisely this.

Matching and sorting by relevance, quality, and date

Things get even worse when we involve precise quantities like date or price. Often users want to sort their results by date, and this sounds like a perfectly reasonable thing to do. However, you will encounter some terrible side effects when exact sorting by things like date. Here’s why: the Pareto principle drives the world, and your inventory is no different. If you are in e-commerce, then 20% of your inventory is where you get 80% of your sales, and 80% of your inventory is where you get 20% of your sales.

Let’s say our users are searching for “beer events,” but they want to sort the results by date. Rather than showing the most relevant events such as beer festivals, brewing classes, and beer tastings, we’re going to show them irrelevant, date-ordered events such as business dinners or speed dating events that merely mention beer in their descriptions. Effectively, we are scooping way down into the 80% of less desirable events simply because they happen sooner than the more relevant events that we should be returning.

Solving this is quite a challenge. Consider some alternatives:

Boost by date: As presented in the last section you can boost by date and make sure that the best documents are right at the top of the search results kinda sorted by date. But when users choose to sort by a precise quantity like the date, they will see any deviation from date order as evidence that search is broken and not to be trusted.
Re-sort the most relevant documents by date: You can use the Elasticsearch rescore feature to find a set of the N most relevant documents and then re-sort them by date. But how do you find a good value for N? If N is too low, then users may page past all N results and you’ll have to either tell them there are no more results OR you’ll have to “show omitted results” and start over by date. On the other hand, if N is too high, then the returned set will dip past the most relevant document and pull up some of the 80% of less desirable results. Sorting by this group means that some of these irrelevant or low-quality documents end up at the top of the search results.
Sort by date then sort by relevance: If you think this is a good idea, then you haven’t put your thinking cap on yet today. Nevertheless, I hear this tossed around as an option quite a bit. The problem is that if date includes a timestamp, then it is a continuous value. If your documents have timestamps with granularity down to the second then sorting by date followed by quality is no different than just sorting by date.
Bucket by date and sort by relevance within each bucket: As a variant on the previous idea, you do have the option of discretizing the date and chunking documents into buckets of day or week and within each bucket sort by static quality. This might be a great solution. If the user doesn’t expect exact date/time, then they will be more forgiving when the documents don’t appear in exact date order down to the second within the buckets. However, there are still problems – within each bucket, there are fewer documents to draw from. Nevertheless, the search engine will faithfully provide the best documents it has for each bucket. This means that as your bucket size gets smaller, the chances of the bucket getting filled with irrelevant documents become higher. It would be better if we don’t return the bucket at all, but per our Matching and Sorting by Relevance section, scoring is not absolute, so it might still be difficult to decide which buckets we should omit from the search results.

No hard and fast solutions

As you can see in the sections above, I’m doing an excellent job of outlining a huge problem, but I’m not providing any easy solutions. That’s because there aren’t any!

By its very nature, search and recommendation is and forever will be filled with nasty corner cases. Human language is dirty and imprecise, and your users’ information needs will be uncertain and highly varied. However, don’t lose hope. Despite the many corner cases, search technology is still an excellent tool for helping users to satisfy their information needs in the vast majority of use cases.

What’s more, search is getting better. Learning to Rank is a machine learning technique for scoring documents that can automatically find the best balance between features like text relevance, static quality, and innumerable other things. Similarly, there has been lots of conversation in the search community about embedding vectors into search so that traditional inverted-index search can be used in conjunction with recent developments with machine learning (check it out!).

Finally, I would expect the user experience to continue to develop and improve. The dominant search experience for the past 15 years has been a text box at the top of the screen. However, we see this giving way to more conversational search experiences like when you ask Siri to look up a phone number or when you ask Alexa to play a particular song. The visual experiences are changing too. Take a look at Google’s image search. There the left-nav faceted search has been replaced with a very intuitive tag-based slice-and-dice experience that allows you to quickly narrow down the small set of results that fit your information needs. I expect we will continue to get better and better experiences.

You can learn more about building a relevant search by checking out my previous post on understanding the ideas of precision and recall as they relate to search.

Have you run into the fundamental problem of search in your own search applications? What have you done to overcome it? I’d like to hear from you! Ping me on Twitter @JnBrymn or add a response at the bottom of the page. If you’d like, we can jump onto a hangout and share some war stories.

Photo by Andrew Neel on Unsplash

I work on Event Discovery and Recommendation at Eventbrite, a feature central to how our users interact with the site. A critical aspect of returning the best results is understanding the tradeoffs between precision vs. recall. If we focus on _precise_ event recommendations, the results will be highly relevant, but we may overlook some events that a user would enjoy and find relevant. Alternatively, if we focus on recall, then we will return tons events that the user is interested in, but we’ll also flood them with less relevant events. Unfortunately, it is rare for recommendations to have both high-precision and high-recall.

In the following post, I explain the notions of precision and recall by example and discuss the unavoidable tradeoffs required to build the best search and recommendation systems possible.

The Overview

Precision and recall are two fundamental measures of search relevance. Given a particular query and the set of documents returned by the search engine (the result set), these measures are defined as follows:

Precision is the percentage of documents in the result set that are relevant.
Recall is the percentage of relevant documents that are returned in the result set.

Admittedly, these definitions are a little hard to follow at first. They may even sound like the same thing. In the discussion that follows, we provide a thorough example that will help you understand these definitions and their differences. You’ll also begin to see why it’s so important to keep these concepts in mind when designing any application of search.

Additionally, we demonstrate how precision and recall are often at odds with one another. Generally, the more you improve recall, the worse your precision becomes, and the more you improve precision, the worse your recall becomes. This implies a limit on the best you can achieve in search relevance. Fortunately, you can get around this limit. We explore the details in the discussion that follows.

Continue reading “Search Precision and Recall By Example”

Engineering Blog

Author: John Berryman

The Fundamental Problem of Search

The search problem space

Matching and sorting by relevance

Matching and sorting by relevance and quality

Sidebar: multiple objective optimization

Matching and sorting by relevance, quality, and date

No hard and fast solutions

Cowboys and Consultants Don’t Need Unit Tests

Search Precision and Recall By Example

Building a Marketplace — Search and Recommendation at Eventbrite

Personalized Discovery – the Center of the Marketplace