A/B testing journey

When I joined Lonely Planet in March of 2014, it had been a long time since I was responsible for any public facing web properties of a company. Probably a good 15 years was spent building OS or web applications that were being used by limited audience that had no expectation of blazing fast rendering.

I'd also never done A/B testing before, so I was completely ignorant of the options available. Now that I've implemented an A/B testing framework for Lonely Planet, I've discovered that there aren't a lot of mature options, which surprised me.

Hardware driven testing

In the first two weeks I was here, it was suggested that I investigate some options for doing A/B testing on our web site that didn't require the use of expensive, time-consuming AWS instances. Up to that point, all tests required our ops team to stand up two clusters, each hosting a separate branch of the repository, and a custom load balancer was used to drive users to one cluster or the other.

I'm not sure how results were reported, but since Omniture is a tool that we've used, I'm assuming that someone pulled traffic from Omniture and graphed the results for analysis. Not important, really. What is important is to streamline this process to make is easier, and faster, to get tests running and make decisions from the metrics.

Server driven testing

The next step was to see if we could implement a solution in our Rails app that would serve up n alternatives to users, track view + conversions, and have easy access to a report on the progress of each experiment. This would eliminate the cost of added AWS instances, and the time needed by ops to launch and maintain the stacks and the routing.

My initial evaluation covered two options that were floating around Lonely Planet: Split and Vanity.

I ended up using Vanity because it offered the quickest setup and was more portable across all our different repositories. I set up the first experiment in a matter of about 20 minutes and was able to give my product owner his very first A/B experiment that he could track at will through the Vanity Dashboard.

Advantages

Cheaper because it requires no additional hardware.
Faster to put up, and tear down.
Allowed us to both element-level A/B tests (red button or blue button), and full page A/B tests (page design A or page design B).
Can integrate with our internal user-tracking mechanism.

Disadvantages

Quickly, though, we ran into some troubling issues:

Since Lonely Planet's web site is driven by several repositories instead of a monolithic one, it requires us to implement Vanity in every repository, set up environment variables in all the different environments, and have separate reporting dashboards for each application. Not optimal.
We use Fastly to cache our highest trafficked properties which has the unintended side effect of cached versions of a experiment's alternative being delivered to customers, even when new alternatives should have been generated.
Vanity does not allow us to weight alternatives without some ugly hacks.

With those issues in mind, we're currently investigating the use of a client-based split testing framework so that the use of the Fastly cache will be irrelevant, and hopefully help with the other issues.

Client driven testing

Our first test implementation is with Abba. At first blush, it fills in all the blanks that we have with Vanity. It runs as a separate process, so any of our applications can integrate with it without the need for installing gems, setting environment variables, and having multiple reporting dashboards. Also, since all alternative are defined client-side, then it doesn't matter if the page is served from the Fastly cache or not. An added benefit is that now all experiments and logic for defining and completing them are in JavaScript, which any front-end developer can write.

We just completed a test implementation - developed by Hailey Mahan - that we'll be rigorously tested before we attempt to roll it out into production, but I'm optimistic about its possibilities.

Advantages

Since Abba runs as a separate services, then all apps can use one solution which consolidates how reports are generated.
This also ensures that there are not unexpected differences in implementation across our different applications.

Disadvantages

Users who block JavaScript, but only a minor issue because, if they do that, then no Lonely Planet pages will render anyway.
For large content variations in alternatives, it might present performance issues. We'll need to test to get some real numbers. However, the majority of our tests are small changes in styles, copy or element rendering. I don't anticipate problems.
Does not integrate with our internal user-tracking system.

Results so far

With just a handful of experiments in our pocket, the majority of them have yielding little value, as expected. However, there are a couple that showed immediate positive results from customers that make their experience better, make our partners happier, and make us a tad more money.

Hardware driven testing

Server driven testing

Advantages

Disadvantages

Client driven testing

Advantages

Disadvantages

Results so far

Steve Brownlee