Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:🎉 Flow is ready to use!
	🔗 Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	🔒 Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
⠹       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
A DOM-based G-Suite password sprayer and user enumerator

A DOM-based G-Suite password sprayer and user enumerator

Mayk 1 Apr 07, 2022
The Devils Eye is an OSINT tool that searches the Darkweb for onion links and descriptions that match with the users query without requiring the use for Tor.

The Devil's Eye searches the darkweb for information relating to the user's query and returns the results including .onion links and their description

Richard Mwewa 135 Dec 31, 2022
A simple python script for hosting a Snowflake Proxy in your python program or with it's standalone cli

snowflake-cli Snowflake is a system to defeat internet censorship, made by Tor Project. The system works by volunteers who run the snowflake extension

Guilherme Paixão 6 Jul 14, 2022
A simple way to store your passwords without requiring third party applications

SimplePasswordManager A simple way to store your passwords without requiring third party applications Simple To Use. Store Your Passwords For Each Web

Leone Odinga 1 Dec 23, 2021
Midas ELF64 Injector is a tool that will help you inject a C program from source code into an ELF64 binary.

Midas ELF64 Injector Description Midas ELF64 Injector is a tool that will help you inject a C program from source code into an ELF64 binary. All you n

midas 20 Dec 24, 2022
An automated, reliable scanner for the Log4Shell (CVE-2021-44228) vulnerability.

Log4JHunt An automated, reliable scanner for the Log4Shell CVE-2021-44228 vulnerability. Video demo: Usage Here the help usage: $ python3 log4jhunt.py

RedHunt Labs 39 Nov 21, 2022
This is a Crypto asset tracker that I built to aid my personal journey in cryptocurrencies.

Wallet Tracker This is a Crypto asset tracker that I built to aid my personal journey in cryptocurrencies. build docker build -t wallet-tracker . run

2 Mar 21, 2022
This is a simple Port Flooder written in Python 3.

This is a simple Port Flooder written in Python 3. Use this tool to quickly stress test your network devices and measure your router's or server's load.

Júlio Carneiro 4 Feb 20, 2022
It's a simple tool for test vulnerability shellshock

Shellshock, also known as Bashdoor, is a family of security bugs in the Unix Bash shell, the first of which was disclosed on 24 September 2014. Shellshock could enable an attacker to cause Bash to ex

Mr. Cl0wn - H4ck1ng C0d3r 88 Dec 23, 2022
Agile Threat Modeling Toolkit

Threagile is an open-source toolkit for agile threat modeling:

Threagile 425 Jan 07, 2023
Python script to tamper with pages to test for Log4J Shell vulnerability.

log4jShell Scanner This shell script scans a vulnerable web application that is using a version of apache-log4j 2.15.0. This application is a static

GoVanguard 8 Oct 20, 2022
Orthrus is a macOS agent that uses Apple's MDM to backdoor a device using a malicious profile.

Orthrus is a macOS agent that uses Apple's MDM to backdoor a device using a malicious profile. It effectively runs its own MDM server and allows the operator to interface with it using Mythic.

Mythic Agents 37 Dec 06, 2022
Finite Volume simulation of the Raleigh-Taylor Instability

finitevolume2-python Finite Volume simulation of the Raleigh-Taylor Instability Create Your Own Finite Volume Fluid Simulation (With Python): Part 2 B

Philip Mocz 12 Sep 01, 2022
exchange-ssrf-rce

Usage python3 .\exchange-exp.py -------------------------------------------------------------------------------- |

Jen 76 Nov 09, 2022
Glass是一款针对资产列表的快速指纹识别工具,通过调用Fofa/ZoomEye/Shodan/360等api接口

Glass是一款针对资产列表的快速指纹识别工具,通过调用Fofa/ZoomEye/Shodan/360等api接口快速查询资产信息并识别重点资产的指纹,也可针对IP/IP段或资产列表进行快速的指纹识别。

s7ck Team 764 Jan 05, 2023
利用NTLM Hash读取Exchange邮件

GetMail 利用NTLM Hash读取Exchange邮件:在进行内网渗透时候,我们经常拿到的是账号的Hash凭据而不是明文口令。在这种情况下采用邮件客户端或者WEBMAIL的方式读取邮件就很麻烦,需要进行破解,NTLM的破解主要依靠字典强度,破解概率并不是很大。

<a href=[email protected]"> 388 Dec 27, 2022
spring-cloud-gateway-rce CVE-2022-22947

Spring Cloud Gateway Actuator API SpEL表达式注入命令执行(CVE-2022-22947) 1.installation pip3 install -r requirements.txt 2.Usage $ python3 spring-cloud-gateway

k3rwin 10 Sep 28, 2022
ProxyLogon Pre-Auth SSRF To Arbitrary File Write

ProxyLogon Pre-Auth SSRF To Arbitrary File Write For Education and Research Usage: C:\python proxylogon.py mail.evil.corp lulz 117 Nov 28, 2022

proxyshell payload generate

Py Permutative Encoding https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-pst/5faf4800-645d-49d1-9457-2ac40eb467bd Generate proxyshell

Evi1cg 63 Nov 15, 2022
Pgen is the best brute force password generator and it is improved from the cupp.py

pgen Pgen is the best brute force password generator and it is improved from the cupp.py The pgen tool is dedicated to Leonardo da Vinci -Time stays l

heyheykids 2 Jan 31, 2022