What Product Analysts Should Know About A/B Testing

A product analyst primer from planning to evaluation

Before becoming a product data analyst the majority of my data roles supported marketing. I had experience with marketing A/B tests but quickly realized not all A/B tests are equal, especially product ones. I found best practices that I wished I had known from the beginning through trial and error. Today I’ll discuss what I learned to give you a head start if you’re asked to work on product A/B tests.

The lifecycle of a product A/B test can be classified into 3 stages.

Planning

Experiment design and KPI(s) to measure success – In the planning stage, the product manager and I reviewed the experiment design and agreed on the KPI to measure success. It’s possible to have multiple KPIs to measure success but agree with the product manager which one should be the primary versus secondary KPI for deciding on the winning variant. I supported mobile app subscriptions that used the same KPIs for almost every test but it will vary depending on the group you support and the goal of the test.
Review when users are randomized – Users were randomized into groups based on app login but there were tests where this didn’t make sense. For example, an A/B test intended to increase the number of successful signups will need to randomize people from the point they click on the first screen of the signup flow instead of login because we’re testing if the change will facilitate users to complete the signup process. Discuss with the product manager where users will be randomized to ensure they’re divided properly into each variant.
Confirm the target population for the test – New features or changes may not be tested across all users and platforms at the same time. In my company, we tested changes in one country or platform and applied the learnings from the test before releasing them to our remaining users. Check with the product manager on the intended test population because this may affect the baseline KPI used to calculate the sample size. For example, if a new feature is only going to be tested for English users on iOS the conversion rate may be different than the rate for all users on iOS. This also affects the number of users expected to enter the test because more users logged into iOS versus just English users.
Calculate sample size and expected test duration -This step is important because you may find it’ll take months to reach the sample size needed to determine statistical significance. Often this is because the number of users expected to get to the point where they will be randomized is too low. For those cases, you’ll need to go back to the product manager to decide on another KPI to measure success or change the experiment design.

Launch

Once the test was launched we checked a couple of items to ensure nothing went wrong with the implementation. This was done by our product managers but you may have to do this as a product analyst in your company.

Confirm variant proportions match the allocated percentages – For example, if we expect 4 equal groups in the test including control then each variant should contain 25% of the total users that entered the test. This was the most common issue with tests where proportions were imbalanced. Sometimes this was caught early on and engineering fixed the issue and we restarted the test. In other cases, we caught this at the end of the test and the results were suspect.
Check KPIs are not harmed – Not all tests are successful and sometimes variants perform worse versus control. Sometimes the relative difference versus control was large enough to reach significance early and the product manager turned off the test to avoid doing more harm. Other times, we turned off the variant that was performing worse than control because other variants were doing better. Product should run tests and learn from them but we need to ensure business KPIs are not materially affected.

Evaluation

In some companies, you may not be involved in the planning or launch stages and you’re only asked to evaluate the A/B test at the end. If that’s the case, ask to be involved early because this ensures the product manager gets usable results. Otherwise, they run the risk of making the wrong decision based on bad test results or having no results because the test wasn’t set up properly.

When the test was ready to be evaluated, I ran through this checklist before I started calculating the results.

Has enough time passed to evaluate the test? – Even if a test reaches the sample size needed to determine significance, it may have to run longer depending on the KPI to measure success. For example, when we launched a new feature we wanted to test if it increased the percentage of new users that returned to the app 30 days after signup. This meant the test needed to run an additional 30 days to ensure new users in the control didn’t get exposed to the new feature within the 30-day engagement window we wanted to measure. Sometimes the test ended by the time I was asked to evaluate the A/B test. This reduced the number of users I could use to calculate results and the sample size wasn’t enough to determine significance.
Are variants proportioned correctly? – If a test has 4 variants including control and expected to be evenly proportioned, does each variant contain 25% of total users? In some tests, I didn’t see equal proportions but was told by engineering that the users were randomized within each variant. Use your best judgment given the circumstances. The reality is many things can go wrong with an A/B test and it’s the job of the analyst to evaluate if the results are still usable.
Did users get exposed to multiple variants and how many? – Each user should only be in one variant but sometimes a user can be in multiple ones meaning they’re in control and also sees the change we’re testing. Most of the time, a small percentage of users were in multiple variants and that was expected. A/B tests where a large percentage of users fall into multiple variants may call into question the test implementation and the validity of the test results. If this happens in your A/B test, discuss it with your manager or colleagues because they may have dealt with a similar situation in the past.

Even though A/B test results may not have a clear winner, there are business considerations you should take into account when reviewing results and making recommendations.

User experience – Testing is important but sometimes management may decide to roll out a new feature even before sample size is reached for significance because they believe it will improve user engagement. We used the test results as directional guidance and couldn’t calculate the true improvement versus control.
Platform parity – Sometimes the same test on iOS had positive results versus control but negative results on Android. Occasionally, we rolled out the changes anyway across both platforms to maintain parity because we felt it would benefit our users over time even if the test results didn’t show that now.
Seasonality – We couldn’t always wait for an optimal time of year to run A/B tests to avoid seasonality because then we would be limited in the number of tests we could run. Results may vary if we were to run during other months of the year but we rolled out changes where we were confident the positive direction of the results would hold.

Whether you’re new to product or thinking about becoming a product analyst, I hope this gives you a head start on your A/B testing experience. Thanks for reading!