Abstract
We have previously shown that observers can recognize high-level material categories (e.g. paper, fabric, plastic etc.) in complex, real world images even in 40 millisecond exposures (Sharan et al., VSS 2009). This rapid perception of materials is different from object or texture recognition, and is fairly robust to low-level image degradations such as blurring or contrast inversion. We now turn to computational models and ask if machines can mimic this human performance. Recent work has shown that simple image features based on luminance statistics (Sharan et al.. 2008), or based on 5x5 pixel patches (Varma and Zisserman, 2009) are sufficient for some texture and material recognition tasks. We tested state-of-art models based on these features on the stimuli that our observers viewed. The performance was poor (Categorization rate: Varma-Zisserman = 20%, observers = 90%, chance = 11%). Our stimuli, a diverse collection of photographs derived from Flickr.com, are undoubtedly more challenging than state-of-art benchmarks (Dana et al., 1999). We have developed a model that combines low and mid-level image features, based on color, texture, micro-geometry, outline shape and reflectance properties, in a Bayesian framework. This model achieves significant improvement over state-of-art on our stimuli (Categorization rate: 41%) though it lags human performance by a large margin. Individual features such as color (28%) or texture (37%) or outline shape (28%) are also useful. Interestingly, when we ask human observers to categorize materials based on these features alone (e.g. by converting our stimuli to line drawings that convey shape information, or scrambling them to emphasize textures), observer performance is similar to that of the model (20-35%). Taken together, our findings suggest that isolated cues (e.g. color or texture) or simple image features based on these cues, are not sufficient for real world material recognition.
Disney, Microsoft, NTT Japan, NSF.