A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!

Last update: Jan 01, 2023

Overview

CoVA: Context-aware Visual Attention for Webpage Information Extraction

Abstract

Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a new large-scale dataset of e-commerce websites for which we manually annotate every web element with four labels: product price, product title, product image and background. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.

CoVA Dataset

We labeled 7,740 webpages spanning 408 domains (Amazon, Walmart, Target, etc.). Each of these webpages contains exactly one labeled price, title, and image. All other web elements are labeled as background. On average, there are 90 web elements in a webpage.

Webpage screenshots and bounding boxes can be obtained here

Train-Val-Test split

We create a cross-domain split which ensures that each of the train, val and test sets contains webpages from different domains. Specifically, we construct a 3 : 1 : 1 split based on the number of distinct domains. We observed that the top-5 domains (based on number of samples) were Amazon, EBay, Walmart, Etsy, and Target. So, we created 5 different splits for 5-Fold Cross Validation such that each of the major domains is present in one of the 5 splits for test data. These splits can be accessed here

CoVA End-to-end Training Pipeline

Our Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (CoVA) aims to learn function f to predict labels y = [y₁, y₂, ..., y_N] for a webpage containing N elements. The input to CoVA consists of:

a screenshot of a webpage,
list of bounding boxes [x, y, w, h] of the web elements, and
neighborhood information for each element obtained from the DOM tree.

This information is processed in four stages:

the graph representation extraction for the webpage,
the Representation Network (RN),
the Graph Attention Network (GAT), and
a fully connected (FC) layer.

The graph representation extraction computes for every web element i its set of K neighboring web elements N_i. The RN consists of a Convolutional Neural Net (CNN) and a positional encoder aimed to learn a visual representation v_i for each web element i ∈ {1, ..., N}. The GAT combines the visual representation v_i of the web element i to be classified and those of its neighbors, i.e., v_k ∀k ∈ N_i to compute the contextual representation c_i for web element i. Finally, the visual and contextual representations of the web element are concatenated and passed through the FC layer to obtain the classification output.

Experimental Results

Cross Domain Accuracy (mean ± standard deviation) for 5-fold cross validation.

NOTE: Cross Domain means we train the model on some web domains and test it on completely different domains to evaluate the generalizability of the models to unseen web templates.

Attention Visualizations!

Attention Visualizations where red border denotes web element to be classified, and its contexts have green shade whose intensity denotes score. Price in (a) get much more score than other contexts. Title and image in (b) are scored higher than other contexts for price.

Cite

If you find this useful in your research, please cite our ArXiv pre-print:

Coming soon!

A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!

Related tags

Overview

CoVA: Context-aware Visual Attention for Webpage Information Extraction

Abstract

CoVA Dataset

Train-Val-Test split

CoVA End-to-end Training Pipeline

Experimental Results

Attention Visualizations!

Cite

Owner

Keval Morabia

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Unofficial Alias-Free GAN implementation. Based on rosinality's version with expanded training and inference options.

An educational AI robot based on NVIDIA Jetson Nano.

The codes and models in 'Gaze Estimation using Transformer'.

🤖 Project template for your next awesome AI project. 🦾

We utilize deep reinforcement learning to obtain favorable trajectories for visual-inertial system calibration.

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

MixRNet(Using mixup as regularization and tuning hyper-parameters for ResNets)

Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra

Deep Learning: Architectures & Methods Project: Deep Learning for Audio Super-Resolution

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166

Code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance

Implementation for "Conditional entropy minimization principle for learning domain invariant representation features"

Implementation of the "Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos" paper.

Codebase to experiment with a hybrid Transformer that combines conditional sequence generation with regression

Official PyTorch Implementation of Mask-aware IoU and maYOLACT Detector [BMVC2021]

Group Fisher Pruning for Practical Network Compression(ICML2021)

Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training