CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Summer School 2026
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Word Hashing for Efficient Search In Document Image Collections


Anand Kumar

A large numbers of document image collections are now being scanned and made available over the Internet or in digital braries. Effective access to such information sources is limited by the lack of efficient retrieval schemes. The use of text search methods requires efficient and robust optical character recognizers (OCR), which are presently unavailable for Indian languages. Word spotting - word image matching - may instead be used to retrieve word images in response to a word image query. The approaches used for word spotting so far, dynamic time warping and/or nearest neighbor search tend to be slow for large collection of books. Direct matching of images is inefficient due to the complexity of matching and thus impractical for large databases. In general, indexing and retrieval methods for document images cluster similar words and build indexes with the representatives of the clusters. The time required for building such a clustering based index is very high. Such indexing methods are time inefficient with the use of complex image matching procedures required in the clustering step. This problem is solved by directly hashing word image representations.

An efficient mechanism for indexing and retrieval in large document image collections is presented in this thesis. First, document images are segmented to get words. Then features are computed at word level and indexed. Word retrieval is done very efficiently with \emph{content-sensitive hashing} (CSH), which uses an approximate nearest neighbor search technique called locality sensitive hashing (LSH). The word images are hashed into hash tables using features computed at word level. Content-sensitive hash functions are used to hash words such that the probability of grouping similar words in the same index of the hash table is high. The sub-linear time CSH scheme makes the search very fast without degrading accuracy. Experiments on a collection of Kalidasa's - the classical Indian poet of antiquity - books in Telugu demonstrate that the word images may be searched in a few milliseconds. The approach thus makes searching document image collections practical. The search time is reduced significantly by hashing the words. (more...)

 

Year of completion:  2008
 Advisor : C. V. Jawahar & R. Manmatha

Related Publications

  • Anand Kumar, C.V. Jawahar & R. Manmatha - Efficient Search in Document Image Collections Proceedings of 8th Asian Conference on Computer Vision (ACCV'07),Part I, LNCS 4843, pp. 586.595 Tokyo Japan, 18-22 November, 2007. [PDF]

  • C.V. Jawahar and Anand Kumar - Content-level Annotation of Large Collection of Printed Document Images Proc of 9th International Conference on Document Analysis and Recognition (ICDAR), Brazil, 23-26 September, 2007. [PDF]

  • Anand Kumar, A. Balasubramanian, Anoop M. Namboodiri and C.V. Jawahar - Model-Based Annotation of Online Handwritten Datasets, International Workshop on Frontiers in Handwriting Recognition(IWFHR'06), October 23-26, 2006, La Baule, Centre de Congreee Atlantia, France. [PDF]

 


Downloads

thesis

 ppt

Proxy Based Compression of Depth Movies


Pooja Verlani (homepage)

Sensors for 3D data are common today. These include multicamera systems, laser range scan- ners, etc. Some of them are suitable for the real-time capture of the shape and appearance of dynamic events. The 2-1/2 D model of aligned depth map and image, called a Depth Image, has been 1 popular for Image Based Modeling and Rendering (IBMR). Capturing the 2-1/2D geometric structure and photometric appearance of dynamic scenes is possible today. Time varying depth and image sequences, called Depth Movies, can extend IBMR to dynamic events. The captured event con- tains aligned sequences of depth maps and textures and are often streamed to a distant location for immersive viewing. The applications of such systems include virtual-space tele-conferencing, remote 3D immersion, 3D entertainment, etc. We study a client-server model for tele-immersion where captured or stored depth movies from a server is sent to multiple, remote clients on demand. Depth movies consist of dynamic depth maps and texture maps. Multiview image compression and video compression have been studied earlier, but there has been no study about dynamic depth map compression. This thesis contributes towards dynamic depth map compression for efficient transmission in a server-client 3D teleimmersive environment. The dynamic depth maps data is heavy and need efficient compression schemes. Immersive applications requires time-varying se- quences of depth images from multiple cameras to be encoded and transmitted. At the remote site of the system, the 3D scene is generated back by rendering the whole scene. Thus, depth movies of a generic 3D scene from multiple cameras become very heavy to be sent over network considering the available bandwidth.

poojathesis

This thesis presents a scheme to compress depth movies of human actors using a parametric proxy model for the underlying action. We use a generic articulated human model as the proxy to represent the human in action and the various joint angles of the model to parametrize the proxy for each time instant. The proxy represents a common prediction of the scene structure. The difference between the captured depth and the depth of the proxy is called as the residue and is used to represent the scene exploiting the spatial coherence. A few variations of this algorithm are presented in this thesis. We experimented with bit-wise compression of the residues and analyzed the quality of the generated 3D scene. Differences in residues across time are used to exploit temporal coherence. Intra-frame coded frames and difference-coded frames provide random access and high compression. We show results on several synthetic and real actions to demonstrate the compression ratio and resulting quality using a depth-based rendering of the decoded scene. The performance achieved is quite impressive. We present the articulation fitting tool, the com- pression module with different algorithms and the server-client system with several variants for the user. The thesis first explains the concepts about 3D reconstruction by image based rendering and modeling, compressing such 3D representations, teleconferencing, later we proceed towards the concept of depth images and movies, followed by the main algorithms, examples, experiments and results.

 

Year of completion:  2008
 Advisor : P. J. Narayanan

Related Publications

  • Pooja Verlani, P. J. Narayanan - Proxy-Based Compression of 2-1/2D Structure of Dynamic Events for Tele-immersive Systems Proceedings of 3D Data Processing, Visualization and Transmission June 18-20, 2008, Georgia Institute of Technology, Atlanta, GA, USA. [PDF]

  • Pooja Verlani, P. J. Narayanan - Parametric Proxy-Based Compression of Multiple Depth Movies of Humans Proceedings of Data Compression Conference (DCC) 2008, March 25 to March 27 2008, Salt Lake City, Utah. [PDF]

  • Pooja Verlani, Aditi Goswami, P.J. Narayanan, Shekhar Dwivedi and Sashi Kumar Penta - Depth Image: Representations and Real-time Rendering, Third International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), North Carolina, Chappel Hill, June 14-16, 2006. [PDF]

 


Downloads

thesis

 ppt

 

Projected Texture for 3D Object Recognition


Avinash Sharma (homepage)

Three dimensional objects are characterized by their shape, which can be thought of as the variation in depth over the object, from a particular view point. These variations could be deterministic as in the case of rigid objects or stochastic for surfaces containing a 3D texture. These depth variations are lost during the process of imaging and what remains is the intensity variations that are induced by the shape and lighting, as well as focus variations. Algorithms that utilize 3D shape for classification tries to recover the lost 3D information from the intensity or focus variations or using additional cues from multiple images, structured lighting, etc. This process is computationally intensive and error prone. Once the depth information is estimated, one needs to characterize the object using shape descriptors for the purpose of classification. Image-based classification algorithms try to characterize the intensity variations of the image for recognition. As we noted, the intensity variations are affected by the illumination and pose of the object. The attempt of such algorithms is to derive descriptors that are invariant to the changes in lighting and pose. Although image based classification algorithms are more efficient and robust, their classification power is limited as the 3D information is lost during the imaging process. Our problem is to find an image-based recognition method, which utilize the shape of the object, without explicitly recovering the 3D shape of the object. This implicitly avoids the high computational cost of shape recovery while achieving high accuracies. The method should be robust to view variation, occlusion and also should invariant to scale and position of the object. It should also handle partially specular and a texture-less object surfaces. We propose the use of structured lighting patterns, which we refer to as {\em projected texture}, for the purpose of object recognition. The depth variations of the object induces deformations in the projected texture, and these deformations encode the shape information. The primary idea is to view the deformation pattern as a characteristic property of the object and use it directly for classification instead of trying to recover the shape explicitly. To achieve this we need to use an appropriate projection pattern and derive features that sufficiently characterize the deformations. The patterns required could be quite different depending on the nature of the object shape and its variation across the objects. Specifically, we look at three different recognition problems and propose appropriate projection patterns, deformation characterizations, and recognition algorithms for each. The first category of objects are of fixed shape and pose, where minor differences in shape are to be used for discriminating between classes. 3D hand geometry recognition is taken as the example of class of objects. The second class of recognition problem is that of category recognition of rigid objects from arbitrary view points. We propose a classification algorithm based on popular bag-of-words paradigm for object recognition. Third problem is that of 3D texture classification, where the depth variation in surface is stochastic in nature. We propose a set of simple texture features that can capture the deformations in projected lines on 3D textured surfaces. The above mentioned approaches have been implemented, verified, tested, and compared on various datasets collected as well as available on the Internet. The analysis and comparative results demonstrate significant improvement over the existing approaches, in terms of accuracy and robustness. (more...)

 

Year of completion:  2008
 Advisor : Anoop M. Namboodiri

Related Publications

  • Avinash Sharma and Anoop M. Namboodiri - Object Category Recognition with Projected Texture IEEE Sixth Indian Conference on Computer Vision, Graphics & Image Processing (ICVGIP 2008), pp. 374-381, 16-19 Dec,2008, Bhubaneswar, India. [PDF]

  • Visesh Chari, Avinash Sharma, Anoop M Namboodiri and C.V. Jawahar - Frequency Domain Visual Servoing using Planar Contours IEEE Sixth Indian Conference on Computer Vision, Graphics & Image Processing (ICVGIP 2008), pp. 87-94, 16-19 Dec,2008, Bhubaneswar, India. [PDF]

  • Avinash Sharma and Anoop M. Namboodiri - Projected Texture for Object Classification Proceedings of the 10th European Confernece on Computer Vision (ECCV 2008), 12-18 Oct, 2008, France. [PDF]

  • Avinash Sharma, Nishant Shobhit and Anoop M. Namboodiri - Projected Texture for Hand Geometry based Authentication Proceedings of CVPR Workshop on Biometrics, 28 June, Anchorage, Alaska, USA. IEEE Computer Society 2008. [PDF]

 


Downloads

thesis

 ppt

 

 

Real Time Rendering of Implicit Surfaces on the GPU


Jag Mohan Singh

Generating visually realistic looking models is one of the core problems of Computer Graphics. Rasterization or scan converting the primitives used such as triangles is one method to render them. This method suffers from problems of an inexact representation as triangles themselves are an approximation of the underlying geometry. Ray tracing primitives is another method of rendering the objects. This method delivers exact representation of the underlying geometry and looks visually realistic. We thus use ray tracing of implicit surfaces rather than polygonizing them. The programmable graphics processor units (GPUs) have high computation capabilities but relatively limited bandwidth for data access. Compact representation of geometry using a suitable procedural or mathematical model and a ray-tracing mode of rendering fit the GPUs well, consequently. An implicit surface can be represented as S(x,y,z) = 0 and the ray dependent equation is F_f(t) = 0. Ray tracing S(x,y,z) = 0 is root computation of F_f(t) = 0 for all the pixels on the screen. Analytical methods can be used in surfaces up to order 4. We compute interval extension of functions exactly by computing the function at points of maxima and minima and end points. Since, we can compute roots of functions up to order 4 we can compute points of maxima and minima of functions up to order 5. We use interval arithmetic for surfaces up to order 5 using Mitchell's algorithm. Interval methods provide a robust way for root isolation. Marching points algorithm marches in equal stepsizes until the root is found which is detected by a sign change in the function. Marching points wastes computation by computing the function values at many points. Adaptive marching points algorithm marches adaptively to find the root. Though only fourth or lower order surfaces can be rendered using analytical roots, our adaptive marching points algorithm can ray-trace arbitrary implicit surfaces exactly, by sampling the ray at selected points till a root is found. Adapting the sampling step size based on a proximity measure and a horizon measure delivers high speed. The horizon measure helps in silhouette adaptation and provides good quality silhouettes. We also provide a taylor test which has flavours of interval arithmetic and helps in robust rendering of surfaces using adaptive marching points algorithm. While computing the function S(x,y,z) = 0 we never compute the ray dependent F_f(t) = 0 by using coefficients of t. We save lot of computational overhead by computing S(x,y,z) = 0 directly instead as there are O(d^3) coefficients for t where d is the degree of the surface. In our method we don't need coefficients of t which are expensive to compute we only need the value S(x,y,z) = 0. The derivative F'_f(t) can also be calculated efficiently using the gradient of S() as grad(S(x, y, z)) dot D_f. The Barth decic can be evaluated using about 30 terms as S(x, y,z) but needs to evaluate 1373 terms to compute all 11 coefficients of the tenth order polynomial F_f(t). We render Dynamic Implicit Surfaces which vary with time. Overall, a simple algorithm that fits the SIMD architecture of the GPU results in high performance. We ray-trace algebraic surfaces up to order 18 and non-algebraic surfaces including a Blinn's blobby with 30 spheres at better than interactive frame rates. Our adaptive marching points is an ideal match for the SIMD model of GPU due to low computational cost required per operation. We use analytical methods for ray tracing surfaces up to order 4. We achieve fps of 3750 on a cubic surface and 1400 on a quartic surface. We use the robust Mitchell method on surfaces up to order 5 and achieve fps up to 400 on a torus quartic and 85 on a quintic surface. Our adaptive marching points method renders high order implicit surfaces at interactive frame rates. We render surface of order 18 at an fps of 158. These experiments used NVIDIA 8800 GTX at a resolution of 512x512. Our GPU Objects renders Bunny with 35,947 spheres at 57 fps, 99,130 spheres is rendered at 30 fps and Hyperboloid with reflection and refraction at 300 fps. NVIDIA 6600 GTX was used in experiments related to GPU Objects and the viewport was of the size 512x512.

teaser

 

Year of completion:  December 2008
 Advisor : P. J. Narayanan

Related Publications

  • Jag Mohan Singh and P. J. Narayanan - Real-Time Ray Tracing of Implicit Surfaces on the GPU IEEE Transactions Visualization and Computer Graphics, Vol. 16(2), pp. 261-272 (2010). [PDF]

  • Visesh Chari,  Jag Mohan Singh and P. J. Narayanan - Augmented Reality using Over-Segmentation Proceedings of National Conference on Computer Vision Pattern Recognition Image Processing and Graphics (NCVPRIPG'08),Jan 11-13, 2008, DA-IICT, Gandhinagar, India. [PDF]

  • Kedarnath Thangudu, Lakshmi Gade, Jag Mohan Singh and P. J. Narayanan - Point Based Representations for Hierarchical Environments In International Conference on computing: Theory and Applications(ICCTA), Kolkatta, 2007. [PDF]

  • Jag Mohan Singh and P.J. Narayanan - Progressive Decomposition of Point Clouds Without Local Planes, 5th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), Madurai, India, LNCS 4338 pp.364-375, 2006. [PDF]

  • Sunil Mohan Ranta, Jag Mohan singh and P.J. Narayanan - GPU Objects , 5th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), Madurai, India, LNCS 4338 pp.352-363, 2006. [PDF]


Downloads

thesis

 ppt

 

 

Multiple View Geometry Applications to Robotic and Visual Servoing


Visesh Chari (homepage)

Computer Vision may be described as the process of scene understanding through analysis of images captured by a camera. Understanding of a scene has several aspects associated with it, and this makes the field of computer vision a very vibrant and active field of research. For example, a section of the computer vision research concentrates on the understanding of the inherent characteristics of an object (identifying clauses like "this is a face", "this is a car" etc.). Yet another branch focuses on answering questions like "find the given face in this image" or "find where cars occur in this image". Another, more primitive of the branches, concerns itself with the estimation of the geometry of the scene. It answers questions like "what is the shape of this face", or "how would this car look from that viewpoint". This branch, and its various derivatives come under the name "Multiple View Geometry". Technically, Multiple View Geometry concerns itself with the geometric interaction of the 3D world with images captured by the camera, and the interpretation and manipulation of this information for various tasks. Multiple View Geometry (MVG) is two decades old in its research, and borrows heavily from a related field called Photogrammetry. Over the course of these years, many algorithms have been proposed for the estimation of geometric quantities like the transformation between cameras viewing a scene, or the 3D structure of a particular object being viewed by multiple cameras etc. The field has matured recently, with focus shifting towards producing globally optimal estimates of geometric quantities like transformations and structure, analysis of cases where the problem of geometric inference or manipulation is NP-Complete, etc. Even before maturity, many of the algorithms in Multiple View Geometry have found applications. The simple mosaicing solutions available in digital cameras these days, owes its origin to one such algorithm. Applications have also been spawned in areas like animation for films, robot motion in automated surgery and industrial environments, security systems that employ hundreds of cameras, etc.

collage

 

This thesis focuses on the application front of Multiple View Geometry, which has started gaining popularity. To this extent, we leverage some of the concepts of MVG, to develop new frameworks and algorithms for a variety of problems.

trackingFor this reason, we choose to explore the use of MVG in various robotics and computer vision tasks in this thesis. We first propose a tracking framework that utilizes various cues like textures and edges to perform tracking of 2D and 3D objects in various views of a scene. Tracking refers to the task of estimating the location and orientation of an object with respect to a pre-defined world coordinate system. Traditionally, filters like the Kalman Filter and its variants have been used for tracking purposes. Problems like illumination change and occlusion have affected many of these algorithms that make assumptions like uniform intensity of objects across views, etc. We show that by embedding MVG into tracking algorithms, we can achieve efficient tracking of objects, that is robust to large changes in perspective, illumination and occlusion. A by-product is the estimation of the pose of the camera, which in itself is useful for tasks like localization in a mobile environment. 

 

 

fouriervsThen we show an application of frequency domain based MVG to the task of robot positioning. Positioning (or Visual Servoing) is a task that enables a robot to assume a desired pose with respect to an object of interest, with the help of a camera. This object might be a heart, as in surgery, or an automobile part, as in industrial settings. We show that by using frequency domain techniques in MVG, we can achieve algorithms that require only rough correspondence between images, unlike earlier algorithms that needed specific point-to-point correspondences. This is further developed into a general framework for servoing that is capable of straight Cartesian paths and path following, which are recent problems in servoing.

 

 

 

inpaintingWithin computer vision, we explore the use of MVG for various image and video editing tasks. Tasks like removing a scene object from a video in a consistent manner would fall in this category (Predicting how the video would look like without the object). In this area, we propose an algorithm for video inpainting, where specific objects from a video are removed and resulting space-time holes are filled in a consistent manner. The algorithm is fully automatic unlike traditional image and video inpainting algorithms, and takes as input two functions; one function defines the object to be removed, and the other defines the background model that is used for hole-filling. 

 

 

 

 

IBR result

We then extend this algorithim to obtain Image Extrapolation, which is concerned with prediction of the future of a scene using available content about it. This is different from Inference in the sense that no data is actually available to confirm our predictions and hence several alternatives remain equally viable. In this direction, we propose an inpainting based framework for Image Based Rendering (IBR). Image Based Rendering (IBR) concerns itself with algorithms for an image based representation of the 3D information of a scene. Novel views of the scene can then be rendered with this information. We extend IBR to include cases when 3D information about a particular scene is incomplete, by incorporating information about the type of scene being viewed (for eg. the face of a person). We then devise algorithms to transfer specific semantic characteristics to the current scene from similar scenes available to us.

 

 

 

Year of completion:  December 2008
 Advisor : C. V. Jawahar & P. J. Narayanan

Related Publications

  • Visesh Chari, Avinash Sharma, Anoop M Namboodiri and C.V. Jawahar - Frequency Domain Visual Servoing using Planar Contours IEEE Sixth Indian Conference on Computer Vision, Graphics & Image Processing (ICVGIP 2008), pp. 87-94, 16-19 Dec,2008, Bhubaneswar, India. [PDF]

  • Visesh Chari, C. V. Jawahar, P. J. Narayanan - Video Completion as Noise Removal Proceedings of National Conference on Communications (NCC'08), Feb 1-3, 2008, IIT Mumbai, India. [PDF]

  • Visesh Chari,  Jag Mohan Singh and P. J. Narayanan - Augmented Reality using Over-Segmentation Proceedings of National Conference on Computer Vision Pattern Recognition Image Processing and Graphics (NCVPRIPG'08),Jan 11-13, 2008, DA-IICT, Gandhinagar, India. [PDF]

  • A.H. Abdul Hafez, Visesh Chari and C.V. Jawahar - Combining Texture and Edge Planar Tracker based on a local Quality Metric Proc. of IEEE International Conference on Robotics and Automation(ICRA'07), Roma, Italy, 2007. [ PDF ]

 


Downloads

thesis

 ppt

More Articles …

  1. Scalable Primitives for Data Mapping and Movement on the GPU
  2. Learning in Large Scale Image Retrieval Systems
  3. Document Enhancement Using Text Specific Prior
  4. Efficient Image Retrieval Methods For Large Scale Dynamic Image Databases
  • Start
  • Prev
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. MS Thesis
  5. Thesis Students
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.