An Unsupervised Detection Framework for Chinese Jargons in the Darknet

This repo is the Python 3 implementation of 《An Unsupervised Detection Framework for Chinese Jargons in the Darknet》 (Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM ’22).

Introduction

This project proposes Chinese jargon detection framework based on unsupervised learning.

Requirements

pip install -r requirements.txt

Data

Due to the sensitivity of the darknet information, we will not distribute the dataset directly, we show some samples of dataset in /dataset/sample.csv and we will leave the contact information for readers to request for Raw Corpus.

Please contact Liang Ke ([email protected]) for the Darknet corpus dataset.
The Modern Chinese Dictionary (the 7th edition) that we used for cross-corpus comparison is from here.

Code

Preprocess the raw corpus using preprocess.py and get the clean corpus.
Find out-of-vocabulary words using newWordsDiscovey.py, and add them to tokenizer dictionary.
Pretrain word-based DC-BERT model with clean corpus using pretrain.py.
Generate word embeddings with pretrained DC-BERT using genEmbedding.py.
Consruct seed criminal keywords with findSeedKeywords.py, we show an example of a list of seed criminal keywords for readers to reference, you can either delete or add words related to your task.
Find jargon candidates (words related to relevant cybercrimes and are very likely to be jargons) with findCandidate.py.
Finally, you can obtain real darknet Chinese jargons detected by our framework using findJargon.py.

Citation

waiting for camera-ready

An Unsupervised Detection Framework for Chinese Jargons in the Darknet

Related tags

Overview

An Unsupervised Detection Framework for Chinese Jargons in the Darknet

Introduction

Requirements

Data

Code

Citation

Owner

Implementation for "Conditional entropy minimization principle for learning domain invariant representation features"

Subdivision-based Mesh Convolutional Networks

PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

Code for "Learning to Segment Rigid Motions from Two Frames".

Deep functional residue identification

Estimating Example Difficulty using Variance of Gradients

Implementation for Shape from Polarization for Complex Scenes in the Wild

PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to handle and build

Automatic Video Captioning Evaluation Metric --- EMScore

Type4Py: Deep Similarity Learning-Based Type Inference for Python

Reinfore learning tool box, contains trpo, a3c algorithm for continous action space

This is the offical website for paper ''Category-consistent deep network learning for accurate vehicle logo recognition''

Rethinking Transformer-based Set Prediction for Object Detection

This project deploys a yolo fastest model in the form of tflite on raspberry 3b+. The model is from another repository of mine called -Trash-Classification-Car

Implementation of " SESS: Self-Ensembling Semi-Supervised 3D Object Detection" (CVPR2020 Oral)

Official Code for AdvRush: Searching for Adversarially Robust Neural Architectures (ICCV '21)

Yet Another Robotics and Reinforcement (YARR) learning framework for PyTorch.

The PyTorch implementation of DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision.

GAN-STEM-Conv2MultiSlice - Exploring Generative Adversarial Networks for Image-to-Image Translation in STEM Simulation