Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:🎉 Flow is ready to use!
	🔗 Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	🔒 Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
⠹       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
Log4jScanner is a Log4j Related CVEs Scanner, Designed to Help Penetration Testers to Perform Black Box Testing on given subdomains.

Log4jScanner Log4jScanner is a Log4j Related CVEs Scanner, Designed to Help Penetration Testers to Perform Black Box Testing on given subdomains. Disc

Pushpender Singh 35 Dec 12, 2022
NexScanner is a tool which allows you to scan a website and find the admin login panel and sub-domains

NexScanner NexScanner is a tool which helps you scan a website for sub-domains and also to find login pages in the website like the admin login panel

8 Sep 03, 2022
Simples brute forcer de diretorios para web pentest.

🦑 dirbruter Simples brute forcer de diretorios para web pentest. ❕ Atenção Não ataque sites privados. Isto é illegal. 🖥️ Pré-requisitos Ultima versã

Dio brando 6 Jan 22, 2022
A great and handy python obfuscator for protecting code.

Python Code Obfuscator A handy and necessary tool that can protect your code anytime! Mostly Command Line tool that will obfuscate your code. Features

Karim 5 Nov 18, 2022
宝塔面板Windows版提权方法

宝塔面板Windows提权方法 本项目整理一些宝塔特性,可以在无漏洞的情况下利用这些特性来增加提权的机会。

298 Dec 14, 2022
Discord-keylogger - Discord keylogger With Python

Discord-keylogger Usage python dlogger.py -t [Time interval in sec] if not speci

Satwik Sinha 1 Jan 30, 2022
Tool to decrypt iOS apps using r2frida

r2flutch Yet another tool to decrypt iOS apps using r2frida. Requirements It requires to install Frida on the Jailbroken iOS device: Jailbroken device

Murphy 146 Jan 03, 2023
自动化爆破子域名,并遍历所有端口寻找http服务,并使用crawlergo、dirsearch、xray等工具扫描并集成报告;支持动态添加扫描到的域名至任务;

AutoScanner AutoScanner是什么 AutoScanner是一款自动化扫描器,其功能主要是遍历所有子域名、及遍历主机所有端口寻找出所有http服务,并使用集成的工具进行扫描,最后集成扫描报告; 工具目前有:oneforall、masscan、nmap、crawlergo、dirse

633 Dec 30, 2022
LdapRelayScan - Check for LDAP protections regarding the relay of NTLM authentication

LDAP Relay Scan A tool to check Domain Controllers for LDAP server protections r

315 Dec 18, 2022
A Proof-Of-Concept for the recently found CVE-2021-44228 vulnerability

log4j-shell-poc A Proof-Of-Concept for the recently found CVE-2021-44228 vulnerability. Recently there was a new vulnerability in log4j, a java loggin

koz 1.5k Jan 04, 2023
Simulating Log4j Remote Code Execution (RCE) vulnerability in a flask web server using python's logging library with custom formatter that simulates lookup substitution by executing remote exploit code.

py4jshell Simulating Log4j Remote Code Execution (RCE) CVE-2021-44228 vulnerability in a flask web server using python's logging library with custom f

Narasimha Prasanna HN 86 Aug 21, 2022
Apk Framework Detector

🚀🚀🚀Program helps you to detect the major framework or technology used in writing any android app. Just provide the apk 😇😇

Daniel Agyapong 10 Dec 07, 2022
Rouge Spammers with a mission to disrupt the peace of the valley ? Fear not we will STOMP the Spammers

Rouge Spammers with a mission to disrupt the peace of the valley ? Fear not we will STOMP the Spammers New Update : adding 'on-review' tag on an issue

A N U S H 13 Sep 19, 2021
A semi-automatic osint/recon framework.

Smog Framework A semi-automatic osint/recon framework. Requirements git Python = 3.8 How to use it

toast 22 Oct 17, 2022
🍉一款基于Python-Django的多功能Web安全渗透测试工具,包含漏洞扫描,端口扫描,指纹识别,目录扫描,旁站扫描,域名扫描等功能。

Sec-Tools 项目介绍 系统简介 本项目命名为Sec-Tools,是一款基于 Python-Django 的在线多功能 Web 应用渗透测试系统,包含漏洞检测、目录识别、端口扫描、指纹识别、域名探测、旁站探测、信息泄露检测等功能。本系统通过旁站探测和域名探测功能对待检测网站进行资产收集,通过端

简简 300 Jan 07, 2023
Tool To generate Stable Undetected Payload

windowsPayload Tool To generate Stable Undetected Payload Don t Upload to Virus Total :) Follow on Social Media Platforms ScreenShots How to install +

youhacker55 117 Dec 30, 2022
Yet another web fuzzer

yafuzz Yet another web fuzzer Usage This script can run in two modes of operation. Supplying a wordlist -W argument will initiate a multithreaded fuzz

FooBallZ 5 Feb 02, 2022
Proof of concept for CVE-2021-31166, a remote HTTP.sys use-after-free triggered remotely.

CVE-2021-31166: HTTP Protocol Stack Remote Code Execution Vulnerability This is a proof of concept for CVE-2021-31166 ("HTTP Protocol Stack Remote Cod

Axel Souchet 820 Dec 18, 2022
Phishing Campaign Toolkit

King Phisher Phishing Campaign Toolkit Installation For instructions on how to install, please see the INSTALL.md file. After installing, for instruct

RSM US LLP 1.9k Jan 01, 2023
It's a simple tool for test vulnerability Apache Path Traversal

SimplesApachePathTraversal Simples Apache Path Traversal It's a simple tool for test vulnerability Apache Path Traversal https://blog.mrcl0wn.com/2021

Mr. Cl0wn - H4ck1ng C0d3r 56 Dec 27, 2022