Kroomsa: A search engine for the curious

Overview

Kroomsa

Kroomsa

A search engine for the curious. It is a search algorithm designed to engage users by exposing them to relevant yet interesting content during their session.

Description

The search algorithm implemented in your website greatly influences visitor engagement. A decent implementation can significantly reduce dependency on standard search engines like Google for every query thus, increasing engagement. Traditional methods look at terms or phrases in your query to find relevant content based on syntactic matching. Kroomsa uses semantic matching to find content relevant to your query. There is a blog post expanding upon Kroomsa's motivation and its technical aspects.

Getting Started

Prerequisites

  • Python 3.6.5
  • Run the project directory setup: python3 ./setup.py in the root directory.
  • Tensorflow's Universal Sentence Encoder 4
    • The model is available at this link. Download the model and extract the zip file in the /vectorizer directory.
  • MongoDB is used as the database to collate Reddit's submissions. MongoDB can be installed following this link.
  • To fetch comments of the reddit submissions, PRAW is used. To scrape credentials are needed that authorize the script for the same. This is done by creating an app associated with a reddit account by following this link. For reference you can follow this tuorial written by Shantnu Tiwari.
    • Register multiple instances and retrieve their credentials, then add them to the /config under bot_codes parameter in the following format: "client_id client_secret user_agent" as list elements separated by ,.
  • Docker-compose (For dockerized deployment only): Install the latest version following this link.

Installing

  • Create a python environment and install the required packages for preprocessing using: python3 -m pip install -r ./preprocess_requirements.txt
  • Collating a dataset of Reddit submissions
    • Scraping posts
      • Pushshift's API is being used to fetch Reddit submissions. In the root directory, run the following command: python3 ./pre_processing/scraping/questions/scrape_questions.py. It launches a script that scrapes the subreddits sequentially till their inception and stores the submissions as JSON objects in /pre_processing/scraping/questions/scraped_questions. It then partitions the scraped submissions into as many equal parts as there are registered instances of bots.
    • Scraping comments
      • After populating the configuration with bot_codes, we can begin scraping the comments using the partitioned submission files created while scraping submissions. Using the following command: python3 ./pre_processing/scraping/comments/scrape_comments.py multiple processes are spawned that fetch comment streams simultaneously.
    • Insertion
      • To insert the submissions and associated comments, use the following commands: python3 ./pre_processing/db_insertion/insertion.py. It inserts the posts and associated comments in mongo.
      • To clean the comments and tag the posts that aren't public due to any reason, Run python3 ./post_processing/post_processing.py. Apart from cleaning, it also adds emojis to each submission object (This behavior is configurable).
  • Creating a FAISS Index
    • To create a FAISS index, run the following command: python3 ./index/build_index.py. By default, it creates an exhaustive IDMap, Flat index but is configurable through the /config.
  • Database dump (For dockerized deployment)
    • For dockerized deployment, a database dump is required in /mongo_dump. Use the following command at the root dir to create a database dump. mongodump --db database_name(default: red) --collection collection_name(default: questions) -o ./mongo_dump.

Execution

  • Local deployment (Using Gunicorn)
    • Create a python environment and install the required packages using the following command: python3 -m pip install -r ./inference_requirements.txt
    • A local instance of Kroomsa can be deployed using the following command: gunicorn -c ./gunicorn_config.py server:app
  • Dockerized demo
    • Set the demo_mode to True in /config.
    • Build images: docker-compose build
    • Deploy: docker-compose up

Authors

License

This project is licensed under the Apache License Version 2.0

Code repository for EMNLP 2021 paper 'Adversarial Attacks on Knowledge Graph Embeddings via Instance Attribution Methods'

Adversarial Attacks on Knowledge Graph Embeddings via Instance Attribution Methods This is the code repository to accompany the EMNLP 2021 paper on ad

Peru Bhardwaj 7 Sep 25, 2022
[UNMAINTAINED] Automated machine learning for analytics & production

auto_ml Automated machine learning for production and analytics Installation pip install auto_ml Getting started from auto_ml import Predictor from au

Preston Parry 1.6k Jan 02, 2023
Benchmarks for the Optimal Power Flow Problem

Power Grid Lib - Optimal Power Flow This benchmark library is curated and maintained by the IEEE PES Task Force on Benchmarks for Validation of Emergi

A Library of IEEE PES Power Grid Benchmarks 207 Dec 08, 2022
A foreign language learning aid using a neural network to predict probability of translating foreign words

Langy Langy is a reading-focused foreign language learning aid orientated towards young children. Reading is an activity that every child knows. It is

Shona Lowden 6 Nov 17, 2021
李云龙二次元风格化!打滚卖萌,使用了animeGANv2进行了视频的风格迁移

李云龙二次元风格化!一键star、fork,你也可以生成这样的团长! 打滚卖萌求star求fork! 0.效果展示 视频效果前往B站观看效果最佳:李云龙二次元风格化: github开源repo:李云龙二次元风格化 百度AIstudio开源地址,一键fork即可运行: 李云龙二次元风格化!一键fork

oukohou 44 Dec 04, 2022
Code for the CVPR2021 workshop paper "Noise Conditional Flow Model for Learning the Super-Resolution Space"

NCSR: Noise Conditional Flow Model for Learning the Super-Resolution Space Official NCSR training PyTorch Code for the CVPR2021 workshop paper "Noise

57 Oct 03, 2022
Power Core Simulator!

Power Core Simulator Power Core Simulator is a simulator based off the Roblox game "Pinewood Builders Computer Core". In this simulator, you can choos

BananaJeans 1 Nov 13, 2021
Stock-history-display - something like a easy yearly review for your stock performance

Stock History Display Available on Heroku: https://stock-history-display.herokua

LiaoJJ 1 Jan 07, 2022
Code for ViTAS_Vision Transformer Architecture Search

Vision Transformer Architecture Search This repository open source the code for ViTAS: Vision Transformer Architecture Search. ViTAS aims to search fo

46 Dec 17, 2022
Flower classification model that classifies flowers in 10 classes made using transfer learning (~85% accuracy).

flower-classification-inceptionV3 Flower classification model that classifies flowers in 10 classes. Training and validation are done using a pre-anot

Ivan R. Mršulja 1 Dec 12, 2021
Spectral Tensor Train Parameterization of Deep Learning Layers

Spectral Tensor Train Parameterization of Deep Learning Layers This repository is the official implementation of our AISTATS 2021 paper titled "Spectr

Anton Obukhov 12 Oct 23, 2022
Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Query Embedding on Hyper-Relational Knowledge Graphs This repository contains the code used for the experiments in the paper Query Embedding on Hyper-

DimitrisAlivas 19 Jul 26, 2022
OMNIVORE is a single vision model for many different visual modalities

Omnivore: A Single Model for Many Visual Modalities [paper][website] OMNIVORE is a single vision model for many different visual modalities. It learns

Meta Research 451 Dec 27, 2022
Adversarial Learning for Semi-supervised Semantic Segmentation, BMVC 2018

Adversarial Learning for Semi-supervised Semantic Segmentation This repo is the pytorch implementation of the following paper: Adversarial Learning fo

Wayne Hung 464 Dec 19, 2022
PyTorch - Python + Nim

Master Release Pytorch - Py + Nim A Nim frontend for pytorch, aiming to be mostly auto-generated and internally using ATen. Because Nim compiles to C+

Giovanni Petrantoni 425 Dec 22, 2022
Continual reinforcement learning baselines: experiment specifications, implementation of existing methods, and common metrics. Easily extensible to new methods.

Continual Reinforcement Learning This repository provides a simple way to run continual reinforcement learning experiments in PyTorch, including evalu

55 Dec 24, 2022
Final project for machine learning (CSC 590). Detection of hepatitis C and progression through blood samples.

Hepatitis C Blood Based Detection Final project for machine learning (CSC 590). Dataset from Kaggle. Using data from previous hepatitis C blood panels

Jennefer Maldonado 1 Dec 28, 2021
This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

JigsawClustering Jigsaw Clustering for Unsupervised Visual Representation Learning Pengguang Chen, Shu Liu, Jiaya Jia Introduction This project provid

DV Lab 73 Sep 18, 2022
Very large and sparse networks appear often in the wild and present unique algorithmic opportunities and challenges for the practitioner

Sparse network learning with snlpy Very large and sparse networks appear often in the wild and present unique algorithmic opportunities and challenges

Andrew Stolman 1 Apr 30, 2021
LoFTR:Detector-Free Local Feature Matching with Transformers CVPR 2021

LoFTR-with-train-script LoFTR:Detector-Free Local Feature Matching with Transformers CVPR 2021 (with train script --- unofficial ---). About Megadepth

Nan Xiaohu 15 Nov 04, 2022