CVIT Projects

Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition

Abstract

We introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition. It has been accepted for publication at the 2021 IEEE ICIP.

Architecture

Citation

 
    @misc{gupta2021syntactically,
        title={Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition},
        author={Pranay Gupta and Divyanshu Sharma and Ravi Kiran Sarvadevabhatla},
        year={2021},
        eprint={2101.11530},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }

PALMIRA: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts

Abstract

Handwritten documents are often characterized by dense and uneven layout. Despite advances, standard deep network based approaches for semantic layout segmentation are not robust to complex deformations seen across semantic regions. This phenomenon is especially pronounced for the low-resource Indic palm-leaf manuscript domain. To address the issue, we first introduce Indiscapes2, a new large-scale diverse dataset of Indic manuscripts with semantic layout annotations. Indiscapes2 contains documents from four different historical collections and is 150% larger than its predecessor, Indiscapes. We also propose a novel deep network Palmira for robust, deformation-aware instance segmentation of regions in handwritten manuscripts. We also report Hausdorff distance and its variants as a boundary-aware performance measure. Our experiments demonstrate that Palmira provides robust layouts, outperforms strong baseline approaches and ablative variants. We also include qualitative results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of Palmira.

Architecture

The PALMIRA architecture. The orange blocks in the backbone are deformable convolutions. A closer look at the DefGrid Mask Head and the Backbone alteration is provided below in the Network Architecture section.

Highlights

We introduce Indiscapes2, a collection of handwritten palm-leaf manuscripts which is 150% larger compared to its predecessor Indiscapes and contains two additional annotated collections which greatly increase qualitative diversity. In addition, we introduce a novel deep learning based layout parsing architecture called Palm leaf Manuscript Region Annotator or Palmira in short. Through our experiments, we show that Palmira outperforms the previous approach and strong baselines, qualitatively and quantitatively. Additionally, we demonstrate the general nature of our approach on out-of-dataset historical manuscripts. Complementing the traditional area-centric measures of Intersection-over-Union (IoU) and mean Average Precision (AP), we report the boundary-centric Hausdorff distance and its variants as part of our evaluation approach.

Network Architecture

1) Deformable Convolutions in the Backbone

Deformable Convolutions provide a way to determine suitable local 2D offsets for the default spatial sampling locations

1) Deformable Grid Mask Head

The Deformable Grid Mask Head network is optimized to predict the offsets of the grid vertices such that a subset of edges incident on the vertices form a closed contour which aligns with the region boundary.

Results

1) Indiscapesv2 Test Set - Document Level

Layout predictions by Palmira on representative test set documents from Indiscapes2 dataset. Note that the colors are used to distinguish region instances. The region category abbreviations are present at corners of the regions.

2) Indiscapesv2 Test Set - Region Level Performance

A comparative illustration of region-level performance. Palmira’s predictions are in red. Predictions from the best model among baselines (BoundaryPreserving Mask-RCNN) are in green. Ground-truth boundary is depicted in white.

3) Other Documents (Out of Dataset)

Layout predictions by Palmira on out-of-dataset handwritten Manuscripts

Citation

 
    @inproceedings{sharan2021palmira,
        title = {PALMIRA: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts},
        author = {Sharan, S P and Aitha, Sowmya and Amandeep, Kumar and Trivedi, Abhishek and Augustine, Aaron and Sarvadevabhatla, Ravi Kiran},
        booktitle = {International Conference on Document Analysis and Recognition,
                   {ICDAR} 2021},
        year = {2021},
    }

Contact

If you have any question, please contact Dr. Ravi Kiran Sarvadevabhatla at This email address is being protected from spambots. You need JavaScript enabled to view it. .

BoundaryNet - An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic Layout Annotation

Abstract

In this work, we propose BoundaryNet, a novel resizing-free approach for high-precision semi-automatic layout annotation. The variable-sized user selected region of interest is first processed by an attention-guided skip network. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph Convolution Network optimized using Hausdorff loss to obtain the final region boundary. Results on a challenging image manuscript dataset demonstrate that BoundaryNet outperforms strong baselines and produces high-quality semantic region boundaries. Qualitatively, our approach generalizes across multiple document image datasets containing different script systems and layouts, all without additional fine-tuning.

Architecture

The architecture of BoundaryNet (top) and various sub-components (bottom). The variable-sized H×W input image is processed by Mask-CNN (MCNN) which predicts a region mask estimate and an associated region class. The mask’s boundary is determined using a contourization procedure (light brown) applied on the estimate from MCNN. M boundary points are sampled on the boundary. A graph is constructed with the points as nodes and edge connectivity defined by 6 k-hop neighborhoods of each point. The spatial coordinates of a boundary point location p = (x, y) and corresponding back- bone skip attention features from MCNN f^r are used as node features for the boundary point. The feature-augmented contour graph G = (F, A) is iteratively processed by Anchor GCN to obtain the final output contour points defining the region boundary.

Instance-level Results from Bounding Box Supervision

Page-level Results on Historical Documents

Interaction with Human Annotators

A small scale experiment was conducted with real human annotators in the loop to determine BoundaryNet utility in a practical setting. The annotations for a set of images were sourced using HINDOLA document annotation system in three distinct modes: Manual Mode (hand-drawn contour generation and region labelling), Fully Automatic Mode (using an existing instance segmentation approach- Indiscapes with post-correction using the annotation system) and Semi-Automatic Mode (manual input of region bounding boxes which are subsequently sent to BoundaryNet, followed by post-correction). For each mode, we recorded the end-to-end annotation time at per-document level, including manual correction time (Violin plots shown in the figure). BoundaryNet outperforms other approaches by generating superior quality contours which minimize post-inference manual correction burden.

Citation

 
    @inProceedings{trivedi2021boundarynet,
        title   = {BoundaryNet: An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic 
                   Layout Annotation},
        author  = {Trivedi, Abhishek and Sarvadevabhatla, Ravi Kiran},
        booktitile = {International Conference on Document Analysis and Recognition},
        year    = {2021}
      }

Contact

If you have any question, please contact Dr. Ravi Kiran Sarvadevabhatla at This email address is being protected from spambots. You need JavaScript enabled to view it. .

Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting

Abstract

he idea behind our work is to tackle the high variance of error that is ignored when considering de facto statistical performance measures like (MSE,MAE) for performance evaluation in the crowd counting domain. Our recipe involves finding strata that are optimal in a Bayesian sense and later systematically modifying the standard crowd counting pipeline to incorporate decrease of variance at each step.

If you want our work to be listed as a network for comparison, please send a pull request to us here. The instructions for pull request is mentioned here.

Please cite our paper if you end up using it for your own research.

Bibtex

 
    @inproceedings{10.1145/3474085.3475522,
        author = {Sravya Vardhani Shivapuja, Mansi Pradeep Khamkar, Divij Bajaj, Ganesh Ramakrishnan, Ravi Kiran Sarvadevabhatla},
        title = {Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting},
        booktitle = {Proceedings of the 2021 ACM Conference on Multimedia},
        year = {2021},
        location = {Virtual Event, China},
        publisher = {ACM},
        address = {China},
        }

Handwritten Text Retrieval from Unseen Collections

Demo Video

The link for video: Demo Video

Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition

Abstract

Architecture

Citation

PALMIRA: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts

Abstract

Architecture

Highlights

Network Architecture

Results

Citation

Contact

BoundaryNet - An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic Layout Annotation

Abstract

Architecture

Instance-level Results from Bounding Box Supervision

Page-level Results on Historical Documents

Interaction with Human Annotators

Citation

Contact

Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting

Abstract

Bibtex

Handwritten Text Retrieval from Unseen Collections

Demo Video

More Articles …