Universität Augsburg - PDF

Description
Universität Augsburg Multimodal Ranking for Image Search on Community Databases Fabian Richter, Stefan Romberg Eva Hörster, Rainer Lienhart Report März 2011 Institut für Informatik D Augsburg

Please download to get full document.

View again

of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

General

Publish on:

Views: 22 | Pages: 12

Extension: PDF | Download: 0

Share
Transcript
Universität Augsburg Multimodal Ranking for Image Search on Community Databases Fabian Richter, Stefan Romberg Eva Hörster, Rainer Lienhart Report März 2011 Institut für Informatik D Augsburg Copyright c Fabian Richter, Stefan Romberg Eva Hörster, Rainer Lienhart Institut für Informatik Universität Augsburg D Augsburg, Germany all rights reserved Multimodal Ranking for Image Search on Community Databases Fabian Richter, Stefan Romberg, Eva Hörster, Rainer Lienhart Multimedia Computing Lab University of Augsburg Augsburg, Germany {richter, romberg, hoerster, ABSTRACT Searching for relevant images given a query term is an important task in nowadays large-scale community databases. The image ranking approach presented in this work represents an image collection as a graph that is built using a multimodal similarity measure based on visual features and user tags. We perform a random walk on this graph to find the most common images. Further we discuss several scalability issues of the proposed approach and show how in this framework queries can be answered fast. Experimental results validate the effectiveness of the presented algorithm. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; I.4.10 [Image Processing and Computer Vision]: Image Representation multidimensional General Terms Algorithms Keywords Image Ranking, Image Retrieval, PageRank, Graph 1. INTRODUCTION With the emergence and spread of digital cameras in everyday use the number of images in personal and online collections grows daily. For example, the Flickr T M photo repository now consists of more than four billion images. Such huge image databases require efficient techniques for navigating, labeling, and searching. In this work we focus on the goal of selecting relevant images given a query term, i.e. finding images showing content that most people associate with the query term. More specifically we aim to solve this image search problem on a large-scale community database such as Flickr where images Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIR 10, March 29 31, 2010, Philadelphia, Pennsylvania, USA. Copyright 2010 ACM /10/03...$ are often associated with different types of user generated metadata, e.g. tags, date & time, and location. Our proposed image ranking approach has been inspired by [8] where the PageRank method [18] has been adapted to the visual domain. The PageRank approach is a method to rank webpages according to their importance. It builds a graph representing the link structure of the web. The importance of a webpage is assumed to be proportional to the number of hyperlinks pointing towards this page, i.e. the number of pages linking it. Transferring this approach to our image search task we assume that the relevance or importance of an image is proportional to the number of images showing similar content. As we consider community databases, i.e databases with images from many different authors/photographers, this assumption is justified by the following: If an image has many close neighbors all showing the same content and being associated with similar metadata then the respective images authors agree that this is an important shot of the respective content. The main difficulty in such an approach is to reasonably define the similarity between two images, i.e. to determine if two images show the same content. The authors in [8] calculate the images distance based on the number of matching local features between two images. This approach works well for landmarks or product images as in such cases typically many images exist showing the exact same object. However, when searching for object categories or scenes we cannot expect to reliably match the local image descriptors. Thus we use a more sophisticated image description based on automatic content analysis. Moreover we do not rely solely on the automatically extracted visual content description for similarity definition, but we also exploit an image description based on the available metadata. More specifically we also use an representation based on the author s tags. However, establishing links between all pairs in a huge image collection does not scale well, as this results into a complete graph and computing similarities between images is costly. Thus we also consider scalability issues when designing the link structure of our graph. The original PageRank method introduces only a limited amount of links in the graph due to the hyperlink structure of the web. Opposing to the hyperlink structure in the web context, there exists no similar link structure between images which we can exploit in our image graph. Therefore we rely on a nearest neighbor approach and compute similarities only between images in small subsets. Finally, computing a separate relevance score for each image and each query term is computationally inefficient. Thus, based on our scalable nearest neighbor approach, we show how to compute the relevance of an image in a queryindependent fashion. We evaluate our proposed image retrieval method extensively on a real-world large-scale database in user studies. 1.1 Related Work There exist many works addressing the task of searching and ranking photos in (community) databases. For instance there are approaches that aim to find representative images of landmarks [25, 10, 2, 12]. Another work aims to find iconic object images [1] using cluster centroids and identifying images with clear foreground objects. There are image search approaches that rely on the user tags associated with the images. The main challenge is here posed by the ambiguity and subjectivity of user annotations in such community databases, making the direct application of text search approaches difficult. Therefore many approaches analyze and exploit the visual image content to improve the noisy labeling. Li et al. [11] and Liu et al. [15] both propose methods that use a visual content description to learn a tag s relevance to an image. In [11] the authors determine a tag s relevance by nearest neighbor voting. The latter work builds a graph using the tags associated with an image and performs a random walk to determine the tag relevance. The re-ranked tag lists are then used to retrieve images. We will use this approach as baseline for our user studies conducted in Section 4. On the other hand there are approaches that directly analyze images and rely only on a visual content description, e.g. [4, 8], where the former work uses a classifier. However, a classifier needs to be trained on (carefully) labeled data which is not available in most scenarios. Recently there have been some works exploiting multiple modalities for image search applications. Raguram and Lazebnik [21] perform joint clustering in a space built from visual and tag descriptors to find iconic summaries of abstract concepts. Wang and Forsyth [24] retrieve object images from webpages by analyzing the surrounding text and the image itself. In [23], Schroff et. al. use the surrounding text of web images for re-ranking purposes before training a SVM classifier based on visual features. Our multimodal ranking approach has been inspired by the graph-based approach presented by Jing and Baluja [8]. Here the authors construct an image graph where vertices represent images and edge weights are proportional to the visual similarity between two images. An importance score is computed for each image and query term by performing random walk on this graph. However, in contrast to their product image search scenario, our goal is to perform retrieval of objects categories and scenes from community databases with very diverse images depicting objects in their natural context. Also, a query-dependent graph is used in [8] to compute the importance score. In this work we propose to compute a global ranking independent of any predefined query. We use multiple modalities to build our image graph, more precisely visual features and user tags. We show that our multimodal ranking method improves performance over the unimodal case. The work by Winston et al. [7] is similar to our method. However they employ a context graph in a different domain as they rank videos based on multimodal story-level similarities. Text transcriptions and visual similarity based on near-duplicate detection are used to build a graph which in turn is refined by random walk. We also address scalability issues in our approach. In order to let our approach scale with the steadily growing size of image repositories, we exploit a nearest neighbor approach for graph construction. This idea has been motivated by [5], where the authors propose a framework for structural analysis of image databases using spectral clustering and spectral dimensionality reduction. The experimental results presented in [7] as well as the discussion in [19] prove the rationality of nearest neighbor approaches in a random walk context. 1.2 Contribution The main contributions of this paper are: 1. We present a query-by-term image ranking approach that relies on a graph where the images are linked by their similarities. We show that using similarities based on both user annotations and visual content improves results. 2. Based on this image ranking approach we propose a self-contained image retrieval method which is designed to scale well with the increasing size of image repositories. More specifically we show how to appropriately modify the link structure of the graph and how to efficiently compute the image similarities needed to build the graph. 3. In our experiments on a large-scale real-world database we demonstrate the effectiveness of our approach. Our system yields highly satisfactory retrieval results for different kinds of query terms such as object and scene categories. In addition, we evaluate an extension such that the ranking score is computed independently of the query term resulting in a very effective, scalable image search. 1.3 Organization This paper is organized as follows: Section 2 describes our proposed image ranking approach. In Section 3 we discuss the implementation of the presented method in more detail and address scalability issues. We evaluate the approach experimentally on a large-scale image database in Section 4. Section 5 summarizes our results. 2. APPROACH Given a large-scale collection consisting of images and their respective metadata our goal is to find images relevant to a given query term. We define a relevant image as one showing content that most people associate with the query term. For now we focus on the case where only one query term is given, however our method can be extended to multi-term queries. In order to perform image search given a query term, we start with a broad set of images that satisfy the query. In our implementation this set simply includes all images that have been tagged with the query term by their authors. Since this image set is derived automatically based on this simple constraint, it contains a significant number of noisy images not necessarily showing the desired image content due to φ visual jacksonvillezoo, zoo, jacksonville, jaguar, spots, cat, bigcat, whiskers φ tag It should be noted that our ranking system is not constrained to specific representations or modalities. However we use a low-dimensional vector description of an image in its respective modality as it allows to easily compute distances between two such vector representations. Assuming we have a vector representation of the image content for each modality ω separately, one based on text features, i.e. tags, and the other based on visual features, we define an image relevance score as trickortreat, halloween, jaguar, cat, zoo Figure 1: Images are compared by using the similarities in two different domains, i.e. by using visual and textual features (tags). the subjectivity and ambiguity of tags. Besides these images that are somehow related to the query term, we have neither additional data nor information about which images are preferred. Thus, we need to determine a score for each image indicating its relevance or importance to the current query. Assuming that the importance of an image is proportional to the number of images showing similar content, we build a graph representing the relationships of all images in the database. Its vertices represent our images and the edges their multimodal similarity. Those multimodal similarities are based on visual and textual features. We then perform a random walk on this graph to determine a score for each image indicating its importance. This importance score reflects the likelihood of arriving in a certain vertex after a random walk over the given graph. We can then automatically rank the images in the former mentioned subset according to their importance score. In the following subsections, we describe in detail how we build our similarity graph and review the random walk. 2.1 Multimodal Similarity Graph The link structure as well as the weights associated with the edges are fundamental for the performance of our graphbased approach. Hence, the first step in building a similarity graph is to define an appropriate distance measure for comparing images. Previous work [8] on image ranking uses only the visual content to compare images. On the other hand it has been shown in the context of query-by-example retrieval that image descriptions based on user annotations outperform a representation based on visual features. Moreover it has been shown in recent work [22, 13] that using multiple cues to find similar images boosts performance over using a single modality, either tags or visual features. Therefore we propose to use a distance measure for image comparison that combines the two modalities user annotations and image content (see Figure 1). We start by computing two descriptions for an image, one for each modality. In our implementation we represent our images by automatically extracted topic distributions. Details of our image representations are discussed in Section 3. ϕ ω(i, j) = exp( I ω(i) I ω (j) 1 σ ω ) (1) where I ω (i) I ω (j) 1 denotes in our case the L 1 distance of the representation I ω (i) and I ω (j) of images i and j. σ ω is a normalization constant that was set equal to the median of the pairwise L 1 distance of all images. With the equation above we obtain a relevance score between each pair of images for each modality. We then fuse both scores linearly to a combined image similarity measure s ij: s ij = β ϕ visual (i, j) + (1 β) ϕ tag (i, j) (2) where β [0, 1] determines the weight between visual- and tag-based features. In Section 4 we evaluate the optimal setting for β experimentally by user studies. Using the above defined similarity measure we are able to build our image graph. However this results in a complete graph (i.e. a graph having links between all images). Establishing, storing, and processing a fully connected graph becomes difficult when considering very large image databases, as the number of links would grow quadratically. Therefore we link each image only to its k-nearest neighbors, thus the number of edges grows only linearly in the number of images. Given an image we consider other images to be among the k-nearest neighbors if the distance between the corresponding vector representations is among the smallest k distances. This is equivalent to choosing the k images as neighbors which have the largest similarity according to Equation 2. It should be noted that due to the nearest neighbor approach, the link structure of our graph differs considerably from a complete graph. Besides its resulting sparsity, links established in the graph are not bidirectional in general. Although the used similarity defined in Equation 2 is symmetric, the neighborhood of an image is not as it is defined by its k-nearest neighbors, not by the absolute distance value itself. Figure 2 visualizes the local structure of such an image graph. As we target the search in community databases where users upload their images, it is necessary to take special care to avoid artifacts introduced by users. For instance, a single user may contribute multiple images to our image graph. Each link from one image to another represents a vote for the other image s relevance. In order to limit a user s influence, we apply the following two restrictions: A user may not vote for any of his own images. That is, no links between images of the same user are allowed as this would make the ranking vulnerable to manipulation by a single user. If an image has an incoming link from more than one image of a particular owner, the respective link weights are normalized by the number of incoming links originating from that owner s images. Alternatively we could keep only the in-link from the best matching image. Figure 2: Example for the link structure established in the image graph according to multimodal similarities. Note that there are both unidirectional (gray) and bidirectional links (orange) due to our nearest neighbor approach. The latter aspect is especially important since users tend to upload series of images which often show similar visual content and have the same or similar textual annotation. Thus, it is very likely that these near-duplicates share their nearest neighbors. Hence, an image voted by such a group or the group itself would be influenced overly strong by a single user. 2.2 Random Walk Having determined the graph representing the relationship between the images in our database we now perform a random walk on this graph. The stationary distribution of the random walk process gives us a value for each image which we use as their respective relevance scores. Let G denote the image graph. Each vertex of G corresponds to a certain image. The edges of the graph are established as discussed in the previous subsection. The random walk then traverses the graph according to its link structure, i.e. the probability of following an edge is given by its associated weight. The final computed random walk score for a vertex corresponds to the likelihood of arriving at this vertex after random walk over G. Thus, in order to apply the iterative random walk algorithm to G, we construct the corresponding transition matrix P [p ij ] n n, a row-stochastic matrix describing the transition probabilities used in the random walk process, i.e. p ij = p(j i) is the probability of arriving at vertex j in one step given the current vertex i. We first compute a multimodal similarity matrix M [m ij ] n n m ij = { s ij if j N k (i) 0 otherwise. Each row i of M contains exactly k entries that represent weighted links to the k-nearest neighbors N k (i) of image i. (3) P is then derived by normalization: p ij = mij j m ij Now let [x t (i)] n 1 denote the so called state distribution containing the all probabilities of arriving at image (or vertex) i at time instance t during the random walk. Those probabilities can be computed for all future time instances by iteratively applying the transition matrix P: x t+1 (j) = α (4) n x t (i)p ij + (1 α)v(j), (5) i=1 where v denotes a n 1 bias vector and α [0, 1] is a linear fusion variable. While the bias vector is in most cases used to ensure irreducibility 1 and therefore convergence of the random walk, it also allows to assign some nodes a higher importance prior to the actual ranking procedure. Thus we examine two different bias settings in our experiments (see Section 4.3): A uniform bias where we assign the same bias value to all nodes. A non-uniform bias where the bias v
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks