Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:πŸŽ‰ Flow is ready to use!
	πŸ”— Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	πŸ”’ Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
β Ή       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
ο»ΏThe Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
A simple subdomain scanner in python

Subdomain-Scanner A simple subdomain scanner in python ✨ Features scans subdomains of a domain thats it! πŸ’β€β™€οΈ How to use first download the scanner.p

Portgas D Ace 2 Jan 07, 2022
Whois-Python - Get Whois Domain with Python GUI

Whois-Python-GUI Get Whois Domain with Python - GUI :) WARNING Dont Copy ! - W

MR.D3F417 3 Feb 21, 2022
Operational information regarding the vulnerability in the Log4j logging library.

Log4j Vulnerability (CVE-2021-44228) This repo contains operational information regarding the vulnerability in the Log4j logging library (CVE-2021-442

Nationaal Cyber Security Centrum (NCSC-NL) 1.9k Dec 26, 2022
Windows Stack Based Auto Buffer Overflow Exploiter

Autoflow - Windows Stack Based Auto Buffer Overflow Exploiter Autoflow is a tool that exploits windows stack based buffer overflow automatically.

Himanshu Shukla 19 Dec 22, 2022
Exploit tool for Adminer 1.0 up to 4.6.2 Arbitrary File Read vulnerability

AdminerRead Exploit tool for Adminer 1.0 up to 4.6.2 Arbitrary File Read vulnerability Installation git clone https://github.com/p0dalirius/AdminerRea

Podalirius 58 Dec 05, 2022
Huskee: Malware made in Python for Educational purposes

π‡π”π’πŠπ„π„ Caracteristicas: Discord Token Grabber Wifi Passwords Grabber Googl

chew 4 Aug 17, 2022
Log4j command generator: Generate commands for CVE-2021-44228

Log4j command generator Generate commands for CVE-2021-44228. Description The vulnerability exists due to the Log4j processor's handling of log messag

1 Jan 03, 2022
Log4j2 CVE-2021-44228 revshell

Log4j2-CVE-2021-44228-revshell Usage For reverse shell: $~ python3 Log4j2-revshell.py -M rev -u http://www.victimLog4j.xyz:8080 -l [AttackerIP] -p [At

FaisalFs 16 Mar 24, 2022
Fast Fb Cracking Tool

fb-brute Fast Fb Cracking Tool πŸ†

Aryan 8 Jun 29, 2022
PoC for CVE-2020-6207 (Missing Authentication Check in SAP Solution Manager)

PoC for CVE-2020-6207 (Missing Authentication Check in SAP Solution Manager) This script allows to check and exploit missing authentication checks in

chipik 82 Nov 09, 2022
Python lib to automate basic QFT calculations like Wick-contractions.

QFTools Python lib to automate basic QFT calculations like Wick-contractions. Features Wick contractions for real scalar fields Wick contractions for

2 Aug 21, 2022
Scan Site - Tools For Scanning Any Site and Get Site Information

Site Scanner Tools For Scanning Any Site and Get Site Information Example Require - pip install colorama - pip install requests How To Use Download Th

NumeX 5 Mar 19, 2022
Generate your own NFTs and their metadata based on your desired probabilities.

Generate your own NFTs and their metadata based on your desired probabilities. Use your own art assets too! Perfect for use with Candy Machine.

hex 7 Sep 16, 2022
FOSSLight Scanner performs open source analysis after downloading the source by passing a link that can be cloned by wget or git.

FOSSLight Scanner Analyze at once for Open Source Compliance. FOSSLight Scanner performs open source analysis after downloading the source by passing

FOSSLight 8 Nov 03, 2022
Python script to tamper with pages to test for Log4J Shell vulnerability.

log4jShell Scanner This shell script scans a vulnerable web application that is using a version of apache-log4j 2.15.0. This application is a static

GoVanguard 8 Oct 20, 2022
This repo created for bypassing Widevine L3 DRM and obtaining keys.

First run: Copy headers (with cookies) of POST license request from browser to headers.py like dictionary. pip install -r requirements.txt # if doesn'

Mikhail 263 Jan 07, 2023
ORector - A Fast Python tool designed to detect open redirects vulnerabilities on websites

ORector is a Fast Python tool designed to detect open redirects vulnerabilities

11 Apr 02, 2022
Web-eyes - OSINT tools for website research

WEB-EYES V1.0 web-eyes: OSINT tools for website research, 14 research methods ar

8 Nov 10, 2022
Static Token And Credential Scanner

Static Token And Credential Scanner What is it? STACS is a YARA powered static credential scanner which suports binary file formats, analysis of neste

STACS 81 Dec 27, 2022
Um script simples de Port Scan + DNS by Hostname

πŸ–₯ PortScan-DNS Esta Γ© uma ferramenta simples de Port Scan + DNS by Hostname... πŸ’» | DNS Resolver / by Hostname: HOST IP EXTERNO IP INTERNO πŸ’» | Port

AlbΓ’niaSecurity-RT 7 Dec 08, 2022