Varnish and A-B Testing: How to Play Nice

Black and white abstract

Here at Eventbrite, we love building sites that are fast, delightful, and reliable. Caching HTML responses using edge caches, such as Varnish, ensures a lighter load on your servers and a performant experience for the end user. However, doing so can often cause A/B testing frameworks to fail in a sneaky fashion.

Read on to learn some key things to know if you find yourself running an A/B test on a page served via Varnish.

First, a Quick overview

What is A/B testing? The Wikipedia page covers the topic well, but here’s a quick TL;DR: A/B testing allows us to expose our users to two slightly different experiences: a control and a variant, where the variant only differs in a singular controlled manner. Then track each variant by pre-determined performance metrics, such as conversion rate to purchase, to decide if the variant provides a real lift over the control.

A/B testing is one of the most useful tools a developer and product managers can use to determine what engages with their audience the best. Often these tests need to live on pages that must be reliable and must be performant. That’s where Varnish comes in.

Varnish is an open sourced caching HTTP reverse proxy. Essentially, a super fast cache, which sits in front of any server that understands HTTP. It receives requests from the client and attempts to serve an HTTP response from the cache. If it cannot, it then forwards the request to the backend server, stores the server’s response and pass it along to the client.

Varnish sounds great! Why is it troublesome with A/B Testing?

Varnish caches an entire HTML response, so some requests from the client never hit any server-side application code. If the A/B testing framework assigns variants on the server or relies on any server-side logic, then a person enrolled in variant A may be served a cached response of variant B (and vice versa). This is bad. The experiment data becomes corrupt, and any potential insights are useless. If the A/B test is entirely separate from any backend logic code, there may not be any problem at all!

What is the Solution?

Utilizing Edge-Side Includes (ESI) with our Varnish layer!

ESI is a small markup language that allows for the dynamic web content assembly. It provides an edge server (like our Varnish cache) the ability to mix and match content (or fragments) from multiple cached URLs into a single response.

Let’s look at a simple example with a global header we want included via ESI on multiple pages:

//my_global_header.html
<nav>Awesome</nav>
//HTML file with ESI Include
<html>
    <body>
        <esi:include src=”/my_global_header.html” />
         <div>Lots of other content</div>
    </body>
</html>

What is happening here?

The Varnish server understands how to parse the <esi:include and will see if it has the path dictated in src value cached.

On a hit (the asked for item is in the cache): It inserts that cached fragment into the response our system returns to the client. The server did not have to do any additional work to create our global header again; rather, Varnish simply inserted the cached global header directly into the response.

On a miss (the asked for item is not in the cache): The cache checks back with the server and asks for content represented by the provided path. It then inserts that response into the cache using the src value as the key. Varnish then inserts the fragment into the response, and pass it along to the client.

Why not Varnish the whole page?

This way we can re-use the global header component on any number of templates, including those that may contain user-specific information which we should not serve via Varnish. It allows us to be surgical with what content we determine we want to cache, and that which we do not.

Applying ESI to our use case

We can utilize ESI to include an entire view, rather than just a fragment of a view, in such a way that we don’t impact performance negatively. Let’s run through an example.

Say we have a complicated homepage at www.mywebsite.com/home. Our server resolves incoming requests for /home to our view handler HomePageView which returns an HTML response. HomePageView does massive amounts of logic and heavy lifting to provide a great experience to our users. It receives heavy traffic, regularly, so we naturally serve it with Varnish to avoid such heavy lifting for every request.

However, our team has been asked to run an experiment on the homepage which would display a picture of a cool cat to users with an odd-numbered guest_id. Here guest_id is a semi-permanent identifier stored in a cookie for a logged out user.

We then can do the following:

  1. Remove any “standard” Varnish configuration that may have been implemented on the homepage to ensure that every single request hits the server. When a request comes from the client for www.mywebsite.com/home/, every single one should resolve to the HomePageView.

  2. Move all of the heavy logic that HomePageView was previously doing to a new view titled HomePageViewESI. We’ll come back to this in step 5.

  3. Now instead of the normal heavy logic our HomePageView previously did, we only parse the guest_id from the request. For purposes of the example, let’s say the guest_id is odd. The view then creates an ESI specific path that represents a homepage covered in cats:
    esi_path = my_esi/home/?my_experiment_variant=show_cats

    Aside: The esi_path here, acts as our unique cache key.

  4. Then the response which HomePageView returns from our application server is just the following:
    <esi:include src=”my_esi/home/?my_experiment_variant=show_cats” />
    

    That’s it. We don’t include anything else on the server response. Our varnish server understands how to parse the <esi:include, and if it is a hit, inserts the cached cat covered homepage specified by the provided esi_path. No application logic was necessary beyond parsing the guest_id to serve the correct content to the end user.

  5. However, what if the esi_path is a miss? Varnish will look back to our server, and request the content represented by the provided esi_path. Which looks like:
    my_esi/home/?my_experiment_variant=show_cats
    

    Meaning that the server needs to resolve incoming requests for /my_esi/home/ in addition to /home.

    This is where we use HomePageViewESI. We configure the server to resolve incoming requests for /my_esi/home/ with HomePageViewESI.

    HomePageViewESI understands how to parse experiment variants encoded into the path, does the heavy lifting, and returns a full, complex, HTML response.

    Varnish consumes this rich HTML content, insert the returned content into the <esi:include tag HomePageView returned initially as a fragment, and store it in the cache under the key:

    my_esi/home/?my_experiment_variant=show_cats
    

    This process guarantees that even cache hits serve the expected variant to a given user. The variant is encoded into the esi_path guaranteeing a unique cache key for each version of the content to be served.

Gotchas

This approach allows for the a/b testing of heavily trafficked, yet performant pages. Listed below are some “gotchas” to avoid!

Keep any logic done before returning the initial <esi:include very light.

This logic runs for every request. To hold onto the benefits that our cache provides us, be sure not to bloat this with extraneous logic.

The URL path in the browser does not match the path of the request itself.

On a cache miss, the server now receives a url prefixed with some ESI specific identifier, in our example, my_esi was used. This means it doesn’t match the URL represented by the browser.

For example, the browser’s URL may read:

<a href="http://www.myfunwebsite.com/path/to/specificpage">www.myfunwebsite.com/path/to/specificpage</a>

However, the URL path that the server is receiving is:

<a href="http://www.myfunwebsite.com/path/to/specificpage">www.myfunwebsite.com/my_esi/path/to/specificpage</a>

This can quickly cause downstream issues. Many error loggers and other forms of reporting rely on the request path server side, but that will no longer be an accurate representation of the request put forward by the user. Instead, it will be the constructed ESI URL. Additionally, if the frontend stack relies on the request path or query params, it will no longer be in sync with what is in the browser for these same reasons.

Solutions? There are many. The core of each comes down to two things:

  1. Communication
  2. Abstraction

Which seem pretty counter to each other, huh?

The communication is inward.

It is easy for issues to arise when implementing complex caching solutions, so it is necessary to utilize verbose logging on any page that has ESI implemented for the response. Doing so allows for better ability to track down bugs that could otherwise be incredibly cryptic to decipher.

Always be sure to include the full path, with query params, in the backend logs for pages served via ESI. The query params provide necessary information as to exactly what response we served to the client.

The abstraction is outward.

It should never become apparent to the user that the request path is something different than what the browser represents as that would negatively impact their trust in the application.

How do we solve for this? If possible, remove any inclusion of the request path to your client, and instead rely on window.location. However, if your application is tied tightly to the request query params and path hydration, another option is to abstract your request on the server in an ESI aware way such that the critical elements needed represent the original request and not the path.

On a cache miss: Do not enroll a user when building the full view.

Often it is necessary to enroll users based on a specific set of conditions, those conditions, however, must be met outside of the ESI layer. Attempting to enroll users from within the built view of an ESI layer causes your data to quickly become unreliable, as there is no guarantee that the server will be hit for anything encapsulated within that view.

The solution is to perform any user enrollments on the outer-most layer which we call on every request before returning the <es:include src={} /> response and encode the value into the path provided to src as that is the only way to ensure that the data is correct.

All in all, implementing an ESI layer to solve for A/B testing Varnish Cached pages can be difficult and cause confusion; however, it often is the only way to test critical flows in a given application.

Have you ever had issues A/B testing with a cache? Let us know below! You can also ping me on Twitter @VincentBudrovic.

Photo by Christopher Burns on Unsplash