ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Overview

ANTLR v4

Java 7+ License

Build status

Github CI Build Status (MacOSX) AppVeyor CI Build Status (Windows) Circle CI Build Status (Linux)

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build parse trees and also generates a listener interface (or visitor) that makes it easy to respond to the recognition of phrases of interest.

Donate

Authors and major contributors

Useful information

You might also find the following pages useful, particularly if you want to mess around with the various target languages.

The Definitive ANTLR 4 Reference

Programmers run into parsing problems all the time. Whether it’s a data format like JSON, a network protocol like SMTP, a server configuration file for Apache, a PostScript/PDF file, or a simple spreadsheet macro language—ANTLR v4 and this book will demystify the process. ANTLR v4 has been rewritten from scratch to make it easier than ever to build parsers and the language applications built on top. This completely rewritten new edition of the bestselling Definitive ANTLR Reference shows you how to take advantage of these new features.

You can buy the book The Definitive ANTLR 4 Reference at amazon or an electronic version at the publisher's site.

You will find the Book source code useful.

Additional grammars

This repository is a collection of grammars without actions where the root directory name is the all-lowercase name of the language parsed by the grammar. For example, java, cpp, csharp, c, etc...

Comments
  • New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    Fixes #276 .

    This used to be a WIP PR, but it's now ready for review.

    This PR introduces a new extended Unicode escape \u{10ABCD} in ANTLR4 grammars to support Unicode literal values > U+FFFF.

    The serialized ATN represents any atom or range with a Unicode value > U+FFFF as a set. Any such set is serialized in the ATN with 32-bit arguments.

    I bumped the UUID, since this changes the serialized ATN format.

    I included lots of tests and made sure everything is passing on Linux, Mac, and Windows.

    type:feature unicode 
    opened by bhamiltoncx 115
  • Terrible Golang performance

    Terrible Golang performance

    Stackoverflow: https://stackoverflow.com/questions/72266899/golang-performance-issues

    Google group: https://groups.google.com/g/antlr-discussion/c/OdhAIsy2GfI

    Example code: https://github.com/movelazar/perf-repro

    A simple rule such as:

    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2
    

    takes exponentially longer to parse the more 1 EQ 2 OR clauses there are. This does not happen in python (by my testing) or CSharp, Dart, Java (by stackoverflow comment).

    On my machine, # of lines vs parse time:

    11: 0.5s
    12: 1.2s
    13: 3.2s
    14: 8.1s
    15: 21.9s
    16: 57.5s
    

    Given that Python doesn't face this issue I can't imagine I'm doing something terrible in my grammar.

    Issue goes away if I put parens on things but that's not a real solution.

    On 4.10.1, first noticed with 4.9.1.

    Any help is greatly appreciated. Surprised I can't find others with this issue.

    type:bug target:go comp:performance 
    opened by movelazar 101
  • splitting version numbers for targets

    splitting version numbers for targets

    Hiya: @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos

    Eric has raised the point that it would be nice to be able to make quick patches to the various runtimes; e.g., there is a stopping bug now in the JavaScript target. He proposes something along these lines:

    • any change in the tool or the runtime algorithm bumps the middle version #: 4.9 -> 4.10 -> 4.11
    • any bug fix in a runtime we bump the last digit of that runtime only: 4.9 -> 4.9.1 -> 4.9.2
    • if bumping the java runtime for bug fix we also bump the tool since it contains the runtime

    This is in optimal as people have criticized me in the past for bumping, say, 4.6 to 4.7 for some minor changes. It also has the problem that 4.9.x will not mean the same thing in two different targets possibly, as each target will now have their own version number.

    Rather than break up all of the targets into separate repositories or similar, can you guys think of a better solution? Any suggestions? The goal here is to allow more rapid target releases, and independent of me having to do a major release of the tool.

    type:question 
    opened by parrt 94
  • Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    This greatly improves the memory usage and performance of CodePointCharStream by ensuring the internal storage uses either an 8-bit buffer (for Unicode code points <= U+00FF), 16-bit buffer (for Unicode code points <= U+FFFF), or a 32-bit buffer (Unicode code points > U+FFFF).

    I split out the internal storage into a class CodePointBuffer which has a CodePointBuffer.Builder class which has the logic to upgrade from 8-bit to 16-bit to 32-bit storage.

    I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

    Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

    To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

    lexers target:java comp:performance 
    opened by bhamiltoncx 85
  • initial discussion to start integration of new targets

    initial discussion to start integration of new targets

    As promised, I am now ready to integrate the new ANTLR target languages you folks have been working on. This issue is meant to get everybody in sync, check status, and discuss the proper order of integration and resolve issues etc.

    There are two administrative details to get out of the way first:

    1. Please let me know if there is another github user that should be added to one of the categories. Or, of course, if you would like your user ID removed from this discussion.
    2. Nothing can be merged into antlr/antlr4 unless every single committer has added themselves to the contributors.txt file. It's onerous, particularly for simple commits, but it is requirement for anything merged into the master. Eclipse foundation lawyers tell me that we have one of the cleanest licenses out there and it contributes to ANTLR's widespread use because companies are not afraid to use the software. See the genesis of such heinous requirements in SCO v IBM. This means lead target authors have to go back through their committers list quickly and ask them to sign the contributors file with a new commit. Or, they can remove that commit and enter their own version of the functionality, being careful not to violate copyright on the previous.

    As we proceed, please keep in mind that I have a difficult role, balancing the needs of multiple targets and keeping discussions in the civil and practical zone. Decisions I make come from the perspective of over 25 years managing and leading this project. I look forward to incorporating your hard work into the main antlr repo.

    C++ current location

    • @mike-lischke
    • @DanMcLaughlin
    • @nburles
    • @davesisson

    Go current location, previous discussion

    • @pboyer

    Swift current location: unclear, previous discussion

    • @jeffreyguenther
    • @hanjoes
    • @janyou
    • @ewanmellor

    Likely interested/supporting humans (scraped from github issues):

    • @RYDB3RG
    • @wjkohnen
    • @willfaught
    • @parrt
    • @sharwell
    • @ericvergnaud
    type:improvement target:swift target:cpp target:go 
    opened by parrt 84
  • Add a new CharStream that converts the symbols to upper or lower case.

    Add a new CharStream that converts the symbols to upper or lower case.

    This is useful for many of the case insensitive grammars found at https://github.com/antlr/grammars-v4/ which assume the input would be all upper or lower case. Related discussion can be found at https://github.com/antlr/antlr4test-maven-plugin/issues/1

    It would be used like so:

    input, _ := antlr.NewFileStream("filename")
    
    in = antlr.NewCaseChangingStream(is, true) // true forces upper case symbols, false forces lower case.
    
    lexer := parser.NewAbnfLexer(in)
    

    While writing this, I found other people have written their own similar implementations (go, java). It makes sense to place this in the core, so everyone can use it.

    I would love for the grammar to have a option that says the lexer should upper/lower case all input, and then this code could be moved into the generated Lexer, and no user would need to explicitly use a CaseChangingStream (similar to what's discussed in #1002).

    lexers comp:runtime target:java target:javascript target:go 
    opened by bramp 69
  • Swift Target

    Swift Target

    I did a quick search and I didn't see anything written about this yet. What's the likelihood of a Swift target for ANTLR?

    There are C#, Javascript, and Python targets at the moment.

    What does it take to implement a target? Given that Swift is more Java-like, it seems like it should be possible. Maybe start with a code translator if there is one for (Java to Swift), and iterate towards a more idiomatic implementation.

    type:question 
    opened by jeffreyguenther 69
  • Clean up ATN serialization: rm UUID and shifting by value of 2

    Clean up ATN serialization: rm UUID and shifting by value of 2

    • I think we don't need the UUID in the serialization, since it has not changed in a decade. We can bump the version number and remove the UU ID
    • I did some tests and there seems to be no reason to shift the values in the serialized ATN by 2 for the purposes of improving the UTF-8 encoding for the Java target.

    If you guys agree, we can make this small change for cleanup purposes. I'm happy to do it if you guys don't want to. The second fix will require changes to each target but it's trivial to fix.

    atn-analysis type:cleanup 
    opened by parrt 68
  • Preparing for 4.9.3 release

    Preparing for 4.9.3 release

    It's that time of year again! @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos Shall we do a 4.9.3 release?

    I went through and marked all of the merged PRs and related issues with 4.9.3 and try to tag them according to their target. Would you guys like to go through the PRs to see if there's something that should be merged quickly?

    opened by parrt 68
  • [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    @ericvergnaud Hi, I modified csproj options a bit, now I can get a working nuget package locally without the issue we described in #2021. I added .net 3.5 as a target to "main" csproj along with netstandard, since it's easier to keep track of requirements for both sets of api's when editing code and, ideally, both targets can be packed into a nuget package with a single command. Right now it's possible only on Windows via msbuild /t:pack or Visual Studio; unfortunately, due to https://github.com/Microsoft/msbuild/issues/1333, right now dotnet build pack does not work for .net 3.5 target the way it should, so I adjusted the existing script to create packages from .nuspec and different solutions for different targets.

    comp:build target:csharp 
    opened by listerenko 68
  • A few updates to the Unicode documentation.

    A few updates to the Unicode documentation.

    It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.

    comp:doc 
    opened by mike-lischke 61
  • Assorted problems in calling the wrong wrapper for reachesIntoOuterContext

    Assorted problems in calling the wrong wrapper for reachesIntoOuterContext

    This is a serious bug in the CSharp runtime, ParserATNSimulator.cs.

    reachesIntoOuterContext is an integer that tracks the depth of how far we dip into the outer context. This field must be interpreted with the bit map mask SUPPRESS_PRECEDENCE_FILTER, and careful attention placed on whether to access the field raw or with the bit map mask.

    In Java, this code accesses the field raw.

    In CSharp, the field is accessed through a method that applies the bit map mask. This is a serious problem in that it causes ATN parser trace divergence.

    I did a cursory check on the other targets, and I am not confident the field access is done correctly across targets.

    The fix is the change ParserATNInterpreter.cs:

    diff --git a/runtime/CSharp/src/Atn/ParserATNSimulator.cs b/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    index 4a7a6a3d5..8463bfcc3 100644
    --- a/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    +++ b/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    @@ -1579,7 +1579,7 @@ namespace Antlr4.Runtime.Atn
     						// This assignment also propagates the
     						// isPrecedenceFilterSuppressed() value to the new
     						// configuration.
    -						c.reachesIntoOuterContext = config.OuterContextDepth;
    +                                                c.reachesIntoOuterContext = config.reachesIntoOuterContext;
     						ClosureCheckingStopState(c, configSet, closureBusy, collectPredicates,
     												 fullCtx, depth - 1, treatEofAsEpsilon);
     					}
    

    (Note, the source code should really not be using tabs. Tab expansion is editor specific.)

    atn-analysis type:bug target:csharp 
    opened by kaby76 3
  • Python runtime test failure with 4.10 and later

    Python runtime test failure with 4.10 and later

    After upgrading the antlr4 python runtime past 4.9.1 (tested 4.10.1 and 4.11.1) in nixpkgs, we've been seeing its test suite fail with

    Traceback (most recent call last):
      File "/build/source/runtime/Python3/tests/ctest.py", line 10, in <module>
        from parser.cparser import CParser
      File "/build/source/runtime/Python3/tests/parser/cparser.py", line 631, in <module>
        class CParser ( Parser ):
      File "/build/source/runtime/Python3/tests/parser/cparser.py", line 635, in CParser
        atn = ATNDeserializer().deserialize(serializedATN())
      File "/nix/store/ch8i929c63av55h9nxkinifh61mazf1h-python3.10-antlr4-python3-runtime-4.11.1/lib/python3.10/site-packages/antlr4/atn/ATNDeserializer.py", line 28, in deserialize
        self.checkVersion()
      File "/nix/store/ch8i929c63av55h9nxkinifh61mazf1h-python3.10-antlr4-python3-runtime-4.11.1/lib/python3.10/site-packages/antlr4/atn/ATNDeserializer.py", line 50, in checkVersion
        raise Exception("Could not deserialize ATN with version " + str(version) + " (expected " + str(SERIALIZED_VERSION) + ").")
    Exception: Could not deserialize ATN with version  (expected 4).
    

    The version returned seems to be \x03, while the test suite expects it to be 4.

    While similar to #3997 and #3895 we are running the test suite, not our own code.

    We're running python ctest.py from the tests directory.

    opened by mweinelt 4
  • Suggestion for the Test Rig GUI tool

    Suggestion for the Test Rig GUI tool

    This is not a defect or reporting a problem. It is a suggestion that the GUI tool have a new button added between "OK" and "Export as PNG". That can be named "Reload".

    The idea being that I can edit/save a test source file and just click "reload" rather than doing what I do which is close the tool and go back to my command window and rerun the batch file that starts the GUI.

    Thoughts?

    opened by Korporal 0
  •  invalid syntax `self.from = None` in python3 generated parsers

    invalid syntax `self.from = None` in python3 generated parsers

    Please include the following information

    python3 antlr4 version is 4.9.3

    • smallest possible grammar and code that reproduces the behavior
    import sys
    import antlr4
    from PrestoSqlLexer import PrestoSqlLexer
    from PrestoSqlParser import PrestoSqlParser
    
    
    def main():
        input_stream = antlr4.CharStream('select 1')
        lexer = PrestoSqlLexer(input_stream)
        tokens = antlr4.CommonTokenStream(lexer)
        tokens.fill()
        print([token.text for token in tokens.tokens][:-1])
    
    
    if __name__ == '__main__':
        main()
    
    • description of the expected behavior and actual behavior Pointers to suspicious code regions are also very welcome.

    error:

    Traceback (most recent call last):
      File "/Users/clients/autocomplete/parseQuery.py", line 4, in <module>
        from PrestoSqlParser import PrestoSqlParser
      File "/Users/clients/autocomplete/PrestoSqlParser.py", line 1837
        self.from = None # QualifiedNameContext
             ^^^^
    SyntaxError: invalid syntax
    
    opened by chengchengpei 2
Releases(4.11.1)
Owner
Antlr Project
The Project organization for the ANTLR parser generator.
Antlr Project
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

40 Nov 25, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

Malik Makkes 2 Mar 26, 2022
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见:https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

shuo 40 Sep 28, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
Rootski - Full codebase for rootski.io (without the data)

📣 Welcome to the Rootski codebase! This is the codebase for the application run

Eric 20 Nov 18, 2022
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
Python3 to Crystal Translation using Python AST Walker

py2cr.py A code translator using AST from Python to Crystal. This is basically a NodeVisitor with Crystal output. See AST documentation (https://docs.

66 Jul 25, 2022
A fast hierarchical dimensionality reduction algorithm.

h-NNE: Hierarchical Nearest Neighbor Embedding A fast hierarchical dimensionality reduction algorithm. h-NNE is a general purpose dimensionality reduc

Marios Koulakis 35 Dec 12, 2022
MMDA - multimodal document analysis

MMDA - multimodal document analysis

AI2 75 Jan 04, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
keras implement of transformers for humans

keras implement of transformers for humans

苏剑林(Jianlin Su) 4.8k Jan 03, 2023
NLP Overview

NLP-Overview Introduction The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human langu

PeterPham 1 Jan 13, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022