http://stevehanov.ca/blog/index.php?id=132
Which is a very convincing article on why AB testing sucks and with a few extra lines, you can improve your algorithm to select the best test so you never go back and update your code (yeah right.)
I then thought, how many tests does this thing have to run to truly figure out which is best?
I made 4 tests with probability of success equalling 1/2, 1/4, 1/5, 1/6 and found that for this algorithm to settle on the best success rate (1/2), it took 91 hits on average with a max of 876 tests.
I ran the same test using a standard AB algorithm. Picking whichever test has been tested the least and run that test. It took on average 32 tests to figure out which performed the best with a maximum of 363. On average 3 times better than the greedy epsilon method.
I then tried tweaking my success ratios to something a little less dramatic. 1/10, 1/11, 1/12, 1/13. Which just made everything take a LOT longer.
The only problem is that in reality you don't know what the best solution is, so you can never know if you have gotten to the "actual" solution. The epsilon greedy method will eventually get there (although you will never know when). And if you are using the standard AB method you will never know if you have arrived at the best option either, especially when we are talking about the difference between 1/20 clicks versus 1/21 clicks.
Moral of the story -- AB testing is probably a waste of time.
Here is a link to all the tests I ran (python3): https://github.com/crobertsbmw/EpsilonGreedy