某学校选课系统GIF验证码数据集 + Baseline模型 + 上下游相关工具

Overview

elective-dataset-2021spring

某学校2021春季选课系统GIF验证码数据集(29338张) + 准确率98.4%的Baseline模型 + 上下游相关工具。

数据集采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可。

Baseline模型和上下游相关工具采用 MIT License 进行许可。

数据集

dataset/ 目录包含了收集到的所有带标签验证码数据,共29338张。

  • dataset/manual: 人工标注的带标签验证码GIF数据集,标签经过了elective验证因此都是正确的。共5471张。
  • dataset/auto-corrdataset/auto-fail-tagged: 模型自动标注的带标签验证码GIF数据集,其中 auto-corr 是识别正确(通过了elective验证)的部分,auto-fail-tagged 是识别错误然后手工重新标注的部分(此部分不保证正确性)。共22931(正确)+936(错误)张。

使用时请注意,由于 GitHub 的限制

  • auto-fail-tagged 在仓库中存储为7-zip压缩包;
  • manual 在仓库中存储为7个不超过48MB的7-zip分卷;
  • auto-corr 没有存储在仓库中,而是压缩为14个不超过95MB的7-zip分卷放在了 Release页面

Baseline 模型

baseline/ 目录包含一个简易的验证码识别模型。

此模型进行了提取关键帧、基于OpenCV的图像增强以及基于CNN的分类器等一系列工作以完成识别。

将训练集和测试集图片分别放入 set-trainset-test 后运行 train.py 进行训练,用一块TITAN RTX训练需要几分钟的时间。

用大约一万张图片训练好的 checkpoints/model_29.pth 能达到 98.4% 的整体精确度。

predict_bootstrap.py 在elective系统上测试当前模型,将检验正确的带标签图片放入 bootstrap_img_succ 目录,错误的图片放入 bootstrap_img_fail 目录。

上下游相关工具

  • crawl/: 验证码众包标注平台,可以从elective爬取验证码、辅助多名用户同时标注、检验正确性后将正确的数据放入 img_correct 目录。检验错误的验证码将被抛弃,这是初期的一个设计失误,这样将使得数据集的分布与真实分布有偏差。
  • retag/: 手工标注模型识别错误数据的工具。从 bootstrap_img_fail 读取标注错误图片,人工输入正确标注后移动到 bootstrap_img_fail_tagged
  • serve/: 提供在线验证码识别服务的 HTTP RPC 服务器。POST /fire 并传入base64编码的验证码GIF来进行识别。

数据处理过程

首先,我们设立了众包标注平台,多名志愿者累计标注了超过五千张验证码。

有了这些数据后,我们利用OpenCV进行了简单的图片增强、二值化、分字、裁切,然后随手糊了一个简单的CNN网络来识别。在随意调参之后,模型的整体(四个字)准确率接近95%。

然后,我们利用此模型来对数据集进行自举:爬取验证码后调用模型识别然后检验正确性,其中识别错误的部分手工标注。这样我们可以轻易地扩大数据集的规模,从而提升模型效果。

经过了更多的随意调参,模型的整体准确率可以达到98.4%。因为继续提升准确率意义不大,就没有继续优化。考虑到 PyTorch 安装比较麻烦,模型不易于部署到用户的设备上,我们实现了一个 HTTP API 可以用于云端识别。

相关工作

by Elector Quartet (按字典序的倒序 @xmcp, @SpiritedAwayCN, @Rabbit, @gzz)

You might also like...
Owner
xmcp
叶氏筛法第 NaN 代传人
xmcp
A public available dataset for road boundary detection in aerial images

Topo-boundary This is the official github repo of paper Topo-boundary: A Benchmark Dataset on Topological Road-boundary Detection Using Aerial Images

Zhenhua Xu 79 Jan 04, 2023
The official PyTorch code for NeurIPS 2021 ML4AD Paper, "Does Thermal data make the detection systems more reliable?"

MultiModal-Collaborative (MMC) Learning Framework for integrating RGB and Thermal spectral modalities This is the official code for NeurIPS 2021 Machi

NeurAI 12 Nov 02, 2022
Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

GATER This repository contains the code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”. Our implementation is

Jiacheng Ye 12 Nov 24, 2022
Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features | paper | Official PyTorch implementation for Mul

48 Dec 28, 2022
Code for Efficient Visual Pretraining with Contrastive Detection

Code for DetCon This repository contains code for the ICCV 2021 paper "Efficient Visual Pretraining with Contrastive Detection" by Olivier J. Hénaff,

DeepMind 56 Nov 13, 2022
PyTorch implementation of the paper: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features

Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features Estimate the noise transition matrix with f-mutual information. This co

<a href=[email protected]"> 1 Jun 05, 2022
A Bayesian cognition approach for belief updating of correlation judgement through uncertainty visualizations

Overview Code and supplemental materials for Karduni et al., 2020 IEEE Vis. "A Bayesian cognition approach for belief updating of correlation judgemen

Ryan Wesslen 1 Feb 08, 2022
Final project code: Implementing MAE with downscaled encoders and datasets, for ESE546 FA21 at University of Pennsylvania

546 Final Project: Masked Autoencoder Haoran Tang, Qirui Wu 1. Training To train the network, please run mae_pretraining.py. Please modify folder path

Haoran Tang 0 Apr 22, 2022
Study of human inductive biases in CNNs and Transformers.

Are Convolutional Neural Networks or Transformers more like human vision? This repository contains the code and fine-tuned models of popular Convoluti

Shikhar Tuli 39 Dec 08, 2022
Implementation of Multistream Transformers in Pytorch

Multistream Transformers Implementation of Multistream Transformers in Pytorch. This repository deviates slightly from the paper, where instead of usi

Phil Wang 47 Jul 26, 2022
State-to-Distribution (STD) Model

State-to-Distribution (STD) Model In this repository we provide exemplary code on how to construct and evaluate a state-to-distribution (STD) model fo

<a href=[email protected]"> 2 Apr 07, 2022
Fast and accurate optimisation for registration with little learningconvexadam

convexAdam Learn2Reg 2021 Submission Fast and accurate optimisation for registration with little learning Excellent results on Learn2Reg 2021 challeng

17 Dec 06, 2022
A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

ICT.MIRACLE lab 75 Dec 26, 2022
Differentiable Annealed Importance Sampling (DAIS)

Differentiable Annealed Importance Sampling (DAIS) This repository contains the code to reproduce the DAIS results from the paper Differentiable Annea

Guodong Zhang 6 Dec 26, 2021
Official code for the ICLR 2021 paper Neural ODE Processes

Neural ODE Processes Official code for the paper Neural ODE Processes (ICLR 2021). Abstract Neural Ordinary Differential Equations (NODEs) use a neura

Cristian Bodnar 50 Oct 28, 2022
Video Frame Interpolation without Temporal Priors (a general method for blurry video interpolation)

Video Frame Interpolation without Temporal Priors (NeurIPS2020) [Paper] [video] How to run Prerequisites NVIDIA GPU + CUDA 9.0 + CuDNN 7.6.5 Pytorch 1

YoujianZhang 31 Sep 04, 2022
Code of Puregaze: Purifying gaze feature for generalizable gaze estimation, AAAI 2022.

PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation Description Our work is accpeted by AAAI 2022. Picture: We propose a domain-general

39 Dec 05, 2022
A PyTorch implementation of QANet.

QANet-pytorch NOTICE I'm very busy these months. I'll return to this repo in about 10 days. Introduction An implementation of QANet with PyTorch. Any

H. Z. 343 Nov 03, 2022
Contains code for Deep Kernelized Dense Geometric Matching

DKM - Deep Kernelized Dense Geometric Matching Contains code for Deep Kernelized Dense Geometric Matching We provide pretrained models and code for ev

Johan Edstedt 83 Dec 23, 2022
Learning Visual Words for Weakly-Supervised Semantic Segmentation

[IJCAI 2021] Learning Visual Words for Weakly-Supervised Semantic Segmentation Implementation of IJCAI 2021 paper Learning Visual Words for Weakly-Sup

Lixiang Ru 24 Oct 05, 2022