A distributed block-based data storage and compute engine

Overview

Nebula

Extremely-fast Interactive Big Data Analytics

logo
Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

It can do much more than these:

  • Extreme Fast Data Analytics Platform.
  • Column Level Access Control Storage System.
  • Distributed Cache Tier For Tabular Data.

Documents of design, internals and stories will be shared at project site.

Introduction

With Nebula, you could easily:

  • Generate beautiful charts from TB's data in less than 1s

pretty chart 1

Generate bar from 700M rows in 600ms

  • Write an instant javascript function in real-time query.

Transform column, aggregate by it with filters

Pivot data on client for visual

To learn more about how to do it, please take a look at:

I provide free talks / presentations, and short-term onboard consultanting.

If you consider adopting Nebula, feel free to shoot me email at email

Get Started

Run It!

Prebuilt binaries

  • clone the repo: git clone https://github.com/varchar-io/nebula.git
  • run run.sh in source root: cd nebula && ./run.sh
  • explore nebula UI in browser: http://localhost:8088

Kubernetes

Deploy a single node k8s cluster on your local box. Assume your current kubectl points to the cluster, just run:

  • apply: kubectl apply -f deploy/k8s/nebula.yaml.
  • forward: kubectl port-forward nebula/server 8088:8088
  • explore: http://localhost:8088

Build Source

Please refer Developer Guide for building nebula from source code. Welcome to become a contributor.

Use Cases

Static Data Analytics

Configure your data source from a permanent storage (file system) and run analytics on it. AWS S3, Azure Blob Storage are often used storage system with support of file formats like CSV, Parquet, ORC. These file formats and storage system are frequently used in modern big data ecosystems.

For example, this simple config will let you analyze a S3 data on Nebula

" data: s3 loader: Swap source: s3://nebula/seattle_calls.10k.tsv backup: s3://nebula/n202/ format: csv csv: hasHeader: true delimiter: "," time: type: column column: queue_time pattern: "%m/%d/%Y %H:%M:%S"">
seattle.calls:
  retention:
    max-mb: 40000
    max-hr: 0
  schema: "ROW
   
    "
   
  data: s3
  loader: Swap
  source: s3://nebula/seattle_calls.10k.tsv
  backup: s3://nebula/n202/
  format: csv
  csv:
    hasHeader: true
    delimiter: ","
  time:
    type: column
    column: queue_time
    pattern: "%m/%d/%Y %H:%M:%S"

Realtime Data Analytics

Connect Nebula to real-time data source such as Kafka with data formats in thrift or JSON, and do real-time data analytics.

For example, this config section will ask Nebula to connect one Kafka topic for real time code profiling.

" data: kafka loader: Streaming source: backup: s3://nebula/n116/ format: json kafka: topic: columns: service: dict: true host: dict: true tag: dict: true lang: dict: true time: # kafka will inject a time column when specified provided type: provided settings: batch: 500">
  k.pinterest-code:
    retention:
      max-mb: 200000
      max-hr: 48
    schema: "ROW
     
      "
     
    data: kafka
    loader: Streaming
    source: 
     
    backup: s3://nebula/n116/
    format: json
    kafka:
      topic: 
     
    columns:
      service:
        dict: true
      host:
        dict: true
      tag:
        dict: true
      lang:
        dict: true
    time:
      # kafka will inject a time column when specified provided
      type: provided
    settings:
      batch: 500

Ephemeral Data Analytics

Define a template in Nebula, and load data through Nebula API to allow data live for specific period. Run analytics on Nebula to serve queries in this ephemeral data's life time.

Sparse Storage

Highly break down input data into huge small data cubes living in Nebula nodes, usually a simple predicate (filter) will massively prune dowm data to scan for super low latency in your analytics.

For exmaple, config internal partition leveraging sparse storage for super fast pruning for queries targeting specific dimension: (It also demonstrates how to set up column level access control: access group and access action for specific columns)

, flag:bool, value:tinyint>" data: custom loader: NebulaTest source: "" backup: s3://nebula/n100/ format: none # NOTE: refernece only, column properties defined here will not take effect # because they are overwritten/decided by definition of TestTable.h columns: id: bloom_filter: true event: access: read: groups: ["nebula-users"] action: mask tag: partition: values: ["a", "b", "c"] chunk: 1 time: type: static # get it from linux by "date +%s" value: 1565994194">
  nebula.test:
    retention:
      # max 10G RAM assigment
      max-mb: 10000
      # max 10 days assignment
      max-hr: 240
    schema: "ROW
   
    , flag:bool, value:tinyint>
    "
   
    data: custom
    loader: NebulaTest
    source: ""
    backup: s3://nebula/n100/
    format: none
    # NOTE: refernece only, column properties defined here will not take effect
    # because they are overwritten/decided by definition of TestTable.h
    columns:
      id:
        bloom_filter: true
      event:
        access:
          read:
            groups: ["nebula-users"]
            action: mask
      tag:
        partition:
          values: ["a", "b", "c"]
          chunk: 1
    time:
      type: static
      # get it from linux by "date +%s"
      value: 1565994194

Nebula Is Programmable

Through the great projecct QuickJS, Nebula is able to support full ES6 programing through its simple UI code editor. Below is an snippet code that generates a pie charts for your SQL-like query code in JS.

On the page top, the demo video shows how nebula client SDK is used and tables and charts are generated in milliseconds!

    // define an customized column
    const colx = () => nebula.column("value") % 20;
    nebula.apply("colx", nebula.Type.INT, colx);

    // get a data set from data set stored in HTTPS or S3
    nebula
        .source("nebula.test")
        .time("2020-08-16", "2020-08-26")
        .select("colx", count("id"))
        .where(and(gt("id", 5), eq("flag", true)))
        .sortby(nebula.Sort.DESC)
        .limit(10)
        .run();

Open source

Open source is wonderful - that is the reason we can build software and make innovations on top of others. Without these great open source projects, Nebula won't be possible:

Many others are used by Nebula:

  • common tools (glog/gflags/gtest/yaml-cpp/fmt/leveldb)
  • serde (msgpack/rapidjson/rdkafka)
  • algos(xxhash, roaring bitmap, zstd, lz4)
  • ...

Adoptions

Pinterest

Comments
  • prototyping Nebula Ingestion DDL

    prototyping Nebula Ingestion DDL

    Nebula Ingestion DDL

    YAML is powerful way to express configurations, it's easy for people to understand and change. At same time, remember all different configurations and concepts can pose high tax when we starts support functions and preprocess, indexing, or consistent hashing (possible concept to expand/shrink storage cluster) This may lead to invent new set of configuration and concepts that only expert can remember.

    Moreover, OLAP system is working as part of big data ecosystem, be able to transform and pre-process during ingestion time will provide an edge compare to other OLAP engines for user to adopt.

    Use an inspiring example that not yet supported by nebula.

    User has a hive table and a kafka stream ingest into nebula. Hive table has hourly partition keeping last 60 days of moving average of business spent per account; kafka stream contains business transactions in foreign currency of each account. User want to investigate account spending status in near realtime in home currency (e.g USD)

    The complexity of this use case comes from three folds

    • hive table may read data and eventually shard per account basis
    • kafka stream may need to rpc and convert currency into usd
    • both kafka stream may need to do stream/table join on per account basis before land result to run slice and dice.

    If user write a RDBMS query, it should look like

    OPTION 1 Materialized View with schema as part of config

    create view nebula.transaction_analytic as (select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join kafka on hive.account = kafka.acount where <all configs on hive, kafka>)

    Alternatively, we can support two statement flow like

    OPTION 2 Full Table with schema inference

    DDL ` // mapping of hive table sync to nebula table create table hive.account ( accountid bigint PRIMARY KEY, spend double, dt varchat(20) UNIQUE NOT NULL ) with ();

    create table kafka.transaction ( transactionid bigint PRIMARY KEY, accountid bigint not null, transaction_amount double, _time timestamp ) with ();

    create table transaction_analytic ( accountid bigint PRIMARY KEY, avg_transaction double, transaction_amount_in_usd double, _time timestamp ) with (); `

    DML insert into transaction_analytic select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join transaction on hive.account = transaction.acount;

    opened by chenqin 6
  • Custom time query

    Custom time query

    https://github.com/varchar-io/nebula/issues/180

    The git history and code are messy. I ran tests for creating timeline query by day and week since you can do those with constant bucket sizes. An issue that arises is that the table below the timeline will show a negative value for the initial time because the start of a day/week may be before beginTime.

    To implement month/year (which needs consideration of leap years / 29, 30, 31 days), likely need to write a UDF. I'm looking into it but still a bit confused on how to do this. Any formatting / code structure feedback is appreciated.

    opened by caoash 5
  • Support Google Cloud Storage

    Support Google Cloud Storage

    Currently Nebula defines file system interface in its storage component. Under this interface, there are a few implementations such as

    1. local file system
    2. S3 file system

    There two are mostly used so far, now we're asking a new support to allow Nebula read and write data with GCS. This will be needed for nebula deployment in Google Cloud Platform.

    enhancement p1 engine 
    opened by shawncao 4
  • [Feature] Support bounded time slice in timeline query

    [Feature] Support bounded time slice in timeline query

    (Context, this is how the current timeline works) The timeline query will have 3 basic inputs: start time, end time, and window size, simply it just slices the whole time range into pieces according to the window size.

    For general timeline, this works great as long as the window size is flexible, but there are real business scenarios that they want the timeline to fit into the boundaries of a given unit, such as Weekly, Monthly, Quarterly, or Yearly.

    In these cases, data needs to fit into the boundary of the time unit, and should not cross multiple different units.

    The change should be simple in the query building/planning phase, but need to introduce a new interface and enough tests.

    (Suggestion: A good starting task for new committer.)

    enhancement 
    opened by shawncao 3
  • We should capture all file path for in a spec's ID

    We should capture all file path for in a spec's ID

    Replace paths_ with a new object, such as FileSplit which should capture {file path, offset, length} and these should be part of the spec identifier. @ritvik-statsig

    engine 
    opened by shawncao 3
  • aggregation on INT128 typed column crash nebula server

    aggregation on INT128 typed column crash nebula server

    I0214 06:39:03.635987 18 Dsl.cpp:226] Query output schema: STRUCTflag:BOOLEAN,stamp.MAX:INT128 w/ agg columns: 1 I0214 06:39:03.636165 18 Dsl.cpp:354] Nodes to execute the query: 1 *** Aborted at 1613284743 (Unix time, try 'date -d @1613284743') *** *** Signal 11 (SIGSEGV) (0x0) received by PID 1 (pthread TID 0x7fe177d5d700) (linux TID 16) (code: 128), stack trace: *** (error retrieving stack trace)

    bug 
    opened by shawncao 3
  • Decouple time macro in source path from time spec

    Decouple time macro in source path from time spec

    Currently we support roll spec with time MACROs (date, hour, min, second) by specifying MACRO pattern in time spec. I think we should decouple this to get a better flexibility.

    For example, below spec should be a legit spec:

    test-table:
      ...
      source: s3://xxx/{date}/{hour}/
      time:
          type: column
          column: col2
          pattern: UNIXTIME_MS
    

    This spec is basically asking us to scan s3 file path with supported macros in it, but the time is actually from an existing column. My understanding is we don't support MACRO parsing if we don't specify it in time spec.

    cc @chenqin

    enhancement 
    opened by shawncao 3
  • Decouple MACRO support from Time Spec

    Decouple MACRO support from Time Spec

    While I'm updating this piece of logic - I did some refactoring to moving things around, mostly try to relocate logic into common as much as possible especially if it is common.

    The change will support macro on source only and try to materialize it with supported macro (time related). Time spec will not bind to MACRO any more but can use its watermark as its time value (file level).

    enhancement api 
    opened by shawncao 2
  • Histogram values seems incorrect

    Histogram values seems incorrect

    steps:

    • clone latest code from master and build it locally
    • run local stack by invoking ./run.sh
    • go to http://localhost:8088

    in the test set, run hist(value) only, the value distribution doesn't meet the expectation, I think value column should be even distributed as it's random generated in range [-128, 128] - https://github.com/varchar-io/nebula/blob/c8cdd6b9d6b74a76f1d6a70e7deb65a423eb4013/src/surface/MockSurface.cpp#L44

    so I believe every bucket should have similar number of values in the histogram chart.

    Also this query (value > 90) should produce non-zero buckets for value greater than 90, some bug out there! :) Screen Shot 2020-12-29 at 11 21 59 AM

    bug p1 
    opened by shawncao 2
  • refresh dev instruction to make a successful build

    refresh dev instruction to make a successful build

    I have tried this change on a fresh new ubuntu 18 twice, it works fine. Could you follow dev.md from beginning until step 'build nebula", hope this time it works alright.

    enhancement 
    opened by shawncao 2
  • Icicle/Flame view doesn't capture the right canvas height on some machine

    Icicle/Flame view doesn't capture the right canvas height on some machine

    I found the same view has incorrect height on my laptop while it is okay showing on a dell monitor.

    Also - the main color for icicle/flame is too dark (dark red), not good for this visual, should do some adjustment on it.

    bug 
    opened by shawncao 2
  • [UDTF] need to support UDTF

    [UDTF] need to support UDTF

    UDTF support is needed, the first use case good to test is the Word Cloud query.

    In the Word Cloud graph, users specify a "literal/sentence" column to rank all words in it, this query can be represented as

        select (explode(split(text)), sum(value))
    

    The initial design will be:

    1. Introduce the UDTF function and make sure the client can pass it in.
    2. Note that, explode only can be applied to a LIST-typed field. In the above case, the "text" is not, but the "split(text)" is.
    3. Our current interface that defines query keys needs to be updated to match (nebula.proto)

    Implementation: ComputedRow is the interface to bridge internal data blocks to the computing buffer, internally it will invoke UDTF to gain a list value through the ValueEval interface. We should cache it for the first time, and introduce a has more interface to let BlockExecutor to loop over all of its combined values.

    Every UDTF should maintain a cursor until the last combined value is served, hasMore shall return false.

    Bonus: During the time of developing this new feature, we will also battle testing the support of LIST type which we haven't tested thoroughly yet. And clearly LIST/ARRAY could be a top-level supported column type after to embrace a lot of use cases.

    Help Wanted This is super interesting work, but will also require some time to flesh out all the details. It would be great if any one is willing to contribute, and I'd love to work together collaborating on this, and offer as much help as I can.

    enhancement help wanted 
    opened by shawncao 0
  • [Enhancement] enrich histogram for string typed data

    [Enhancement] enrich histogram for string typed data

    currently, the histogram has little information for a client, a few enhancements we should have for more intelligent clients:

    1. Integer column should report distinct values beyond current [min, max, sum, count] which is already good.
    2. String column has a poor count of values only, for now, we definitely need distinct values for the client to decide if it should be used as a suitable dimension. If possible, getting average string length will be good info too.

    In fact, having distinct values is great for the backend as well since it can automatically switch to the dictionary encoding to save massively on expensive memory consumption. So this will be very meaningful and useful work to be added to the Nebula engine, basically, we should auto switch to dictionary encoding when sealing each data block for all INT + STRING columns.

    enhancement engine 
    opened by shawncao 1
  • Introduce a flag to fail the whole spec if any file path failed to load

    Introduce a flag to fail the whole spec if any file path failed to load

    In the multipath scenario - currently, we skip any bad file that is included in a spec, we use continue to continue ingestion. But sometimes, users may want to fail the whole spec if any file is bad, rather than continue.

    We need to introduce a flag to allow users to specify that behavior in table settings. @ritvik-statsig

    engine 
    opened by shawncao 0
  • Support DateTime column

    Support DateTime column

    Internally let's use UNIX MS (internal type size_t/int64) to store its value. But in the metadata layer, we could mark a column as Datetime so that we could apply different UDF on this type of columns, such as

    • date - date of the month
    • day - day of the week
    • month - month value of the date-time
    • quarter - quarter value of the date time

    As an alternative, we could use JS function to implement these UDF, however, quickjs isn't accurate on the test results unless we import more advanced date lib into quickjs to support Date related function. Performance won't be as good as native implementation though.

    p1 engine 
    opened by shawncao 0
  • test the file reported to cause OOM

    test the file reported to cause OOM

    Hey @ritvik-statsig,

    Just try to repro with the file you send to me, unfortunately, I can not repro it. Pls, check the config I use and compare yours and see if there is any possible guess.

    This is for investigating issue #163

    Steps:

    1. copy the file you sent to /tmp/oom.snappy.parquet
    2. add below config to use latest code to launch single-node nebula cluster
    3. watch:

    Data loaded correctly without issue as seen in Node log:

    I0302 22:41:05.525804 1830744064 TaskExecutor.cpp:128] processed task: macro.test@/tmp/[email protected]_I
    I0302 22:41:05.527215 1830744064 IngestSpec.cpp:466] Ingesting from /tmp/nebula.lNqmqH
    I0302 22:41:05.539216 1830744064 IngestSpec.cpp:547] Push a block: [raw: 817096, size: 1634020, allocation: 1634020, rows: 884, bess: 0]
    

    So the sample file only cost 800KB of memory with 884.

    Then I started web UI, the query is working alright -

    Screen Shot 2022-03-02 at 10 53 19 PM

    opened by shawncao 0
Releases(0.1.24)
  • 0.1.24(Sep 19, 2022)

    In this release, we mainly changed "Time" from size_t to int64_t, so now we should be able to support all UNIX time before the year of 1970.

    Source code(tar.gz)
    Source code(zip)
  • 0.1.22(Nov 24, 2020)

  • 0.1.6(Aug 10, 2020)

    nebula-lib is a npm package that you can use to build up your web client connecting to nebula server. (nebula servers release are managed by docker images, please checkout docker hubs for now)

    Source code(tar.gz)
    Source code(zip)
Owner
Columns AI
Columns AI makes data storytelling simple.
Columns AI
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge S

1 Jan 09, 2022
Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

ElasticBatch Elasticsearch buffer for collecting and batch inserting Python data and pandas DataFrames Overview ElasticBatch makes it easy to efficien

Dan Kaslovsky 21 Mar 16, 2022
Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Two phase pipeline + Streamlit This is an example project that demonstrates how to create a pipeline that consists of two phases of execution. In betw

Rick Lamers 1 Nov 17, 2021
Spectral Analysis in Python

SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect

Thomas Cokelaer 280 Dec 16, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
sportsdataverse python package

sportsdataverse-py See CHANGELOG.md for details. The goal of sportsdataverse-py is to provide the community with a python package for working with spo

Saiem Gilani 37 Dec 27, 2022
International Space Station data with Python research 🌎

International Space Station data with Python research 🌎 Plotting ISS trajectory, calculating the velocity over the earth and more. Plotting trajector

Facundo Pedaccio 41 Jun 16, 2022
High Dimensional Portfolio Selection with Cardinality Constraints

High-Dimensional Portfolio Selecton with Cardinality Constraints This repo contains code for perform proximal gradient descent to solve sample average

Du Jinhong 2 Mar 22, 2022
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022
PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark PySpark bindings for the H3 core library. For available functions,

Kevin Schaich 12 Dec 24, 2022
Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

Quantopian, Inc. 15.7k Jan 07, 2023
Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

Sebastian Schäfer 10 Dec 08, 2022
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 05, 2023
Extract data from a wide range of Internet sources into a pandas DataFrame.

pandas-datareader Up to date remote data access for pandas, works for multiple versions of pandas. Installation Install using pip pip install pandas-d

Python for Data 2.5k Jan 09, 2023
Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Spectacular AI SDK examples Spectacular AI SDK fuses data from cameras and IMU sensors (accelerometer and gyroscope) and outputs an accurate 6-degree-

Spectacular AI 94 Jan 04, 2023
API>local_db>AWS_RDS - Disclaimer! All data used is for educational purposes only.

APIlocal_dbAWS_RDS Disclaimer! All data used is for educational purposes only. ETL pipeline diagram. Aim of project By creating a fully working pipe

0 Apr 25, 2022
small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

Hannah Haberkern 3 Dec 14, 2022
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 23 Dec 08, 2022
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021