CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

BoundaryNet - An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic Layout Annotation


Abstract

In this work, we propose BoundaryNet, a novel resizing-free approach for high-precision semi-automatic layout annotation. The variable-sized user selected region of interest is first processed by an attention-guided skip network. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph Convolution Network optimized using Hausdorff loss to obtain the final region boundary. Results on a challenging image manuscript dataset demonstrate that BoundaryNet outperforms strong baselines and produces high-quality semantic region boundaries. Qualitatively, our approach generalizes across multiple document image datasets containing different script systems and layouts, all without additional fine-tuning.

Architecture

ocr The architecture of BoundaryNet (top) and various sub-components (bottom). The variable-sized H×W input image is processed by Mask-CNN (MCNN) which predicts a region mask estimate and an associated region class. The mask’s boundary is determined using a contourization procedure (light brown) applied on the estimate from MCNN. M boundary points are sampled on the boundary. A graph is constructed with the points as nodes and edge connectivity defined by 6 k-hop neighborhoods of each point. The spatial coordinates of a boundary point location p = (x, y) and corresponding back- bone skip attention features from MCNN f^r are used as node features for the boundary point. The feature-augmented contour graph G = (F, A) is iteratively processed by Anchor GCN to obtain the final output contour points defining the region boundary.

Instance-level Results from Bounding Box Supervision

ocr

Page-level Results on Historical Documents

ocr

Interaction with Human Annotators

ocr A small scale experiment was conducted with real human annotators in the loop to determine BoundaryNet utility in a practical setting. The annotations for a set of images were sourced using HINDOLA document annotation system in three distinct modes: Manual Mode (hand-drawn contour generation and region labelling), Fully Automatic Mode (using an existing instance segmentation approach- Indiscapes with post-correction using the annotation system) and Semi-Automatic Mode (manual input of region bounding boxes which are subsequently sent to BoundaryNet, followed by post-correction). For each mode, we recorded the end-to-end annotation time at per-document level, including manual correction time (Violin plots shown in the figure). BoundaryNet outperforms other approaches by generating superior quality contours which minimize post-inference manual correction burden.

Citation

 
    @inProceedings{trivedi2021boundarynet,
        title   = {BoundaryNet: An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic 
                   Layout Annotation},
        author  = {Trivedi, Abhishek and Sarvadevabhatla, Ravi Kiran},
        booktitile = {International Conference on Document Analysis and Recognition},
        year    = {2021}
      }

Contact

If you have any question, please contact Dr. Ravi Kiran Sarvadevabhatla at This email address is being protected from spambots. You need JavaScript enabled to view it. .


Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting


Abstract

he idea behind our work is to tackle the high variance of error that is ignored when considering de facto statistical performance measures like (MSE,MAE) for performance evaluation in the crowd counting domain. Our recipe involves finding strata that are optimal in a Bayesian sense and later systematically modifying the standard crowd counting pipeline to incorporate decrease of variance at each step.
 
If you want our work to be listed as a network for comparison, please send a pull request to us here. The instructions for pull request is mentioned here.

 

ocr

Please cite our paper if you end up using it for your own research.

Bibtex

 
    @inproceedings{10.1145/3474085.3475522,
        author = {Sravya Vardhani Shivapuja, Mansi Pradeep Khamkar, Divij Bajaj, Ganesh Ramakrishnan, Ravi Kiran Sarvadevabhatla},
        title = {Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting},
        booktitle = {Proceedings of the 2021 ACM Conference on Multimedia},
        year = {2021},
        location = {Virtual Event, China},
        publisher = {ACM},
        address = {China},
        }

Handwritten Text Retrieval from Unseen Collections


Demo Video

The link for video: Demo Video

Efficient and Generic Interactive Segmentation Framework to Correct Mispredictions during Clinical Evaluation of Medical Images


Bhavani Sambaturu*, Ashutosh Gupta   C.V. Jawahar   Chetan Arora

IIIT Hyderabad       IIT Delhi

[Code]   [Paper]   [Supplementary]   [Demo Video]   [Test Sets]

 MICCAI, 2021

first diag

We propose a novel approach to generate annotations for medical images of several modalities in a semi-automated manner. In contrast to the existing methods, our method can be implemented using any semantic segmentation method for medical images, allows correction of multiple labels at the same time and addition of missing labels.

Abstract

Semantic segmentation of medical images is an essential first step in computer-aided diagnosis systems for many applications. However, modern deep neural networks (DNNs) have generally shown inconsistent performance for clinical use. This has led researchers to propose interactive image segmentation techniques where the output of a DNN can be interactively corrected by a medical expert to the desired accuracy. However, these techniques often need separate training data with the associated human interactions, and do not generalize to various diseases, and types of medical images. In this paper, we suggest a novel conditional inference technique for deep neural networks which takes the intervention by a medical expert as test time constraints and performs inference conditioned upon these constraints. Our technique is generic can be used for medical images from any modality. Unlike other methods, our approach can correct multiple structures at the same time and add structures missed at initial segmentation. We report an improvement of 13.3, 12.5, 17.8, 10.2, and 12.4 times in terms of user annotation time compared to full human annotation for the nucleus, multiple cell, liver and tumor, organ, and brain segmentation respectively. In comparison to other interactive segmentation techniques, we report a time saving of 2.8 , 3.0, 1.9, 4.4, and 8.6 fold. Our method can be useful to clinicians for diagnosis and, post-surgical follow-up with minimal intervention from the medical expert.


Paper

  • Paper
    Semi-Automatic Medical Image Annotation

    Bhavani Sambaturu*, Ashutosh Gupta, C.V. Jawahar* and Chetan Arora
    Semi-Automatic Medical Image Annotation, MICCAI, 2021.
    [ PDF ] | [Supplementary] | [BibTeX]

    Updated Soon

Additional Details

Some additional details have been provided which we were unable to put in the paper due to space constraints.

 

Qualitative Results

Multiple Label Segmentation

Our approach has the capability to interactively correct the segmentation of multiple labels at the same time

multiple label2

Missing Label Segmentation

Our method has the capability to add labels missed at the initial segmentation.

missing label

Unseen Organ Segmentation

We can perform interactive segmentation of organs for which the pre-trained model was not trained for.

unseen organs

Network Details

The details of the networks used in our paper has been given here.

Detection-aided liver lesion segmentation using deep learning

The network is based on the DRIU architecture. It is a cascaded architecture where the liver is segmented first followed by the lesion.

Hover-Net

A multiple branch network has been proposed which does nuclear instance segmentation and classification at the same time. The horizontal and vertical distances of nuclear pixels between their centers of masses are leveraged to separate the clustered cells.

Autofocus layer for Semantic Segmentation

An autofocus layer for semantic segmentation has been proposed here. The autofocus layer is used to change the size of the receptive fields which is used to obtain features at various scales. The convolutional layers are paralellized with an attention mechanism.


Contact

  1. Bhavani Sambaturu - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Ashutosh Gupta - This email address is being protected from spambots. You need JavaScript enabled to view it.

Towards Speech to Sign Language Generation


Parul Kapoor, Rudrabha Mukhopadhyay Sindhu B Hegde , Vinay Namboodiri and C.V. Jawahar

IIT Kanpur       IIIT Hyderabad       Univ. of Bath

[ Code ]   | [ Demo Video ]   | [ Dataset ]

banner

Previous approaches have only attempted to generate sign-language from the text level, we focus on directly converting speech segments into sign-language. Our work opens up several assistive technology applications and can help effectively communicate with people suffering from hearing loss.

Abstract

We aim to solve the highly challenging task of generating continuous sign language videos solely from speech segments for the first time. Recent efforts in this space have focused on generating such videos from human-annotated text transcripts without considering other modalities. However, replacing speech with sign language proves to be a practical solution while communicating with people suffering from hearing loss. Therefore, we eliminate the need of using text as input and design techniques that work for more natural, continuous, freely uttered speech covering an extensive vocabulary. Since the current datasets are inadequate for generating sign language directly from speech, we collect and release the first Indian sign language dataset comprising speech-level annotations, text transcripts, and the corresponding sign-language videos. Next, we propose a multi-tasking transformer network trained to generate signer's poses from speech segments. With speech-to-text as an auxiliary task and an additional cross-modal discriminator, our model learns to generate continuous sign pose sequences in an end-to-end manner. Extensive experiments and comparisons with other baselines demonstrate the effectiveness of our approach. We also conduct additional ablation studies to analyze the effect of different modules of our network. A demo video containing several results is attached to the supplementary material.


Paper

  • Paper
    Towards Speech to Sign Language Generation

    Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu Hegde Vinay Namboodiri and C.V. Jawahar
    Towards Speech to Sign Language Generation, Interspeech, 2021.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Dataset

--- COMING SOON ---


Contact

  1. Parul Kapoor - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.
  3. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.

More Articles …

  1. IIIT-INDIC-HW-WORDS: A Dataset for Indic Handwritten Text Recognition
  2. Scene Text Recognition in Indian Scripts
  3. PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction
  4. Visual Speech Enhancement Without a Real Visual Stream
  • Start
  • Prev
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.