Visual Search applied – the “Fashion-Cam”

Artificial intelligence features are slowly becoming industry standards in modern products and services in every niche. Now, they are being integrated into the willhaben Fashion-Cam. The visual search project, originating from the Styria Innovation Incubator, provides a cutting edge system to add machine learning methods to existing search agent features. This empowers users to search for classified ads using photos of interesting products using either a photo taken by the user, or one from an existing ad.

Machine (Deep) Learning

The scientific techniques used in creating these features are so-called “machine learning” or, more specifically, “deep learning” methods. These are applied to software systems not by explicitly programming instructions one at a time, but by writing them in a way that gives them the capability to learn from data. With this technology, more “soft” and thorough descriptions of problems and solutions can be achieved, since no manual programming can cover all possible variants and tweaks in data.

Modern systems, with their vast amounts of computation power (GPU), have enabled “deep” learning by virtue of the many consecutive layers of simple calculation units aligned in a cascading fashion. These “neurons” and the weights among them resemble simplified structures of the biological nervous systems – this is why we call them “neural” networks. They are capable of describing very complex functions and problems. For the Computer Vision system, we use specific neuron types and an arrangement called Convolutional Neural Networks. Similar structures can be found in the visual cortexes of animals and humans.

Input parts of the network consist of neurons that specialize in detecting lines, edges, dashes, colors, or pattern transitions. Consecutive layers combine inputs from the lower layers and learn to recognize more complex features and patterns. As we go deeper into the network, more and more complex, and even abstract, features are learned.


Figure 1: Typical architecture of a Convolutional Neural Network. Source: Aphex34 (Own work) CC BY-SA 4.0, from Wikimedia Commons

Visual Searching

The goal was to enable our users to search willhaben visually; that is, to search from images as a source of input and to retrieve visually similar results. We had to find a way to represent the visual features of each image in a way that is both rich enough (to hold information about color, texture, shape, and semantics) and capable of working at a blazingly fast speed. A convolutional neural network, in its original form, is basically a classifier method, which means its goal is to recognize what is on the image. The Data Science team at Styria Digital Media Group trained such classifiers specifically for our needs, and afterwards actually dissected the network to utilize only the parts that specialized in the detection and recognition of different features and concepts. With a few mathematical tricks, it was possible to put the information into smaller vectors, which then served as image descriptors ready for comparison.

Standard approaches only give semantic matches (e.g., shirt to shirt, trousers to trousers, etc.). Our solution also allows us to tweak the visual matches (e.g., short sleeve blue shirt with stripes to short sleeve blue shirt with stripes).


Figure 2: Semantically matched results (1st line), visually matched results (2nd line)

The image preprocessing is based on the Python Imaging Library and Scikit, but the image recognition capability had to be specifically developed, because OpenCV does not provide descriptor-based content comparisons. We used TensorFlow for model training and GPU deployment, and applied NumPy for data structures. The method behind the system is patent-pending. Since we believe in open science and knowledge exchange, we plan to publish scientific papers on the topic as well.

Real-time Image Searching

Since November 14, 2016 – 1.5 years from idea to production – the willhaben Fashion-Cam has been providing an image search function to its users. Training with 5 million images took two weeks on four TITAN X GPUs, resulting in a model of 3.2 million neurons with 6.1 million weights. The models are served by a cloud infrastructure, allowing a signal propagation speed (forward pass) of 0.01 sec per image – ready for further cluster scaling. This allows image search results to be returned to the user within 200ms, which is comparable to text-based search results. New ad images are made available for searching within 1.5 minutes, so willhaben’s buyers and sellers can quickly find their matches. The Fashion-Cam is available on mobile devices for iOS and Android – give it a try!

Originally published on our willhaben Tech Blog by Marko Velic (Styria Data Science) and Reinhard Hofmann (willhaben)

More to read about Software engineering