We’ve not been able to create a synthetic setup that triggers the bug, but managed to automate, identify, alert and log using our production environment.
If some of the images do not match, the images are marked in red and the client posts as much information as possible about the problem to a server side script that logs to splunk
Using splunk we have tried to figure out what type of clients that triggers the bug and this is what we found so far:
- It seems to be a problem on all browsers that has pipelining enabled
- Opera Mini does funky stuff on images by design so it’s a false positive
- iOS5 is overrepresented
- Opera on Android (And Symbian) has all kinds of issues.
- Native Android browser has issues, but at a much lower rate than iOS and Opera
Here is a query from splunk looking at the user-agent for all browsers that triggered the bug the last 24 hours.
sourcetype="imagebugs" NOT "Opera Mini"| rex field=useragent "(?<agent>Opera|Android|Symbian|iPhone|Windows Phone)" | top agent
At this point we put up a test environment to test all variants:
- Hardware: Macbook Air
- Operations System: OS X Lion
- Chromium latest daily daily snapshot with pipelining turned on using chrome://flags
- Firefox 9.0.1 with pipelining turned on using about::config
- Opera 10.60 (pipelining enabled by default)
- iOS Simulator 5.0 from the iPhone SDK
- Android Emulator form the Android SDK
- Network Link Conditioner (from Lion Xcode) to emulate differnt types of network
- Wireshark listening on port 80
- http://touch.vg.no/index2.php – this page uses a singel host for all images to maximise occurrence of the problem (using the parameter ?time=hammer will reload the page until it fails.)
Since the two other major newspapers in Norway have reported the same problem and they don’t use varnish we had a suspicion that the concept of loadbalancing would be the triggering factor. So to narrow down the problems we put a varnish directly on the internet with a public ip and hammered it with all the different browsers in our test environment.
The only browser we consistently managed to trigger the error on was the iOS iphone emulator running iOS 5.0.1. It took anything from 15 to 1000 reloads to trigger, averaging around 170 reloads.
For anyone interested in diving into this why this happens: Here is a bug triggered pretty early on a wired net without any traffic-shaping.
Screenshot when the bug triggers:
Screenshot of which pictures that failed.
- PCAP-file – taken client side (all 37 attempts) using wireshark on the test environment
The first is the correct picture (of the soccer guy cheering):
The next images which is supposed to be a picture of a guy that bought lots of planes for the Norwegian Airline, but is replaced with the image above:
is then replaced with the above one.
Using wireshark, look at tcp.stream eq 60 in wireshark to see where things go wrong. In this case it seems like it actually requests the image twice before the reply. But that does not seem to be the case always.