当前位置：网站首页>Go compiler source code: syntax analysis

Go compiler source code: syntax analysis

2022-04-21 08:06:00 【BioIT】

A quick reading of this article requires a little knowledge of compilation principles 、Go Using experience of
Level co., LTD. 、 Error, please contact change
The last lexical analysis source code reading article

One 、 Syntax analysis

A lot of knowledge of grammar analysis 、 It's more like making a list 、 For everyone to check the information according to the list and learn

The input of parsing is Token Sequence , namely Put a number of Token Combine to form a Node. And Node Combine in tree form , Form an abstract grammar tree （AST）. After lexical analysis and grammatical analysis , Each file will form a tree AST
Rule representation ：

Grammar ： Wikipedia , If you don't understand, you need to read it carefully . The latter defaults to the basic knowledge of grammar
We know the rules of lexical analysis （ Lexical analysis string production rules ） Regular grammar can be used to express 、 But parsing rules are obviously more complex . So we use context free grammar to express grammar rules 、 Context free grammars can contain regular grammars . That is, its ability is broader than regular grammar
Context free means if a string of Token It's the same 、 So no matter where they are , Will use the same derivation rules to analyze , If context sensitive information , We have to rely on the following semantic analysis
Writing / Represents context free grammar ： Bakos paradigm and extended Bakos paradigm , The extended Bakos paradigm can use regular . Related Wikipedia

Algorithm ： With the expression of rules , You have to write code according to the rules to analyze Token Sequence to AST, Syntax analysis algorithms are divided into 2 Categories: ： The top-down 、 Bottom up

The top-down ： The representative algorithms are LL(0),LL(k),LL English is Left-to-right Leftmost, Leftmost inference , Meaning is expanded from the leftmost non Terminator , Keep going like this .k The meaning of is how many to look forward Token Used to avoid useless branches （ A non terminator may have many derivation branches ） Traversal and backtracking , Speed up parsing
- For all branches Token First words 、 Then it's easy to handle . Watch one in advance Token By comparison, we can well judge which branches to take . But if the beginning of the branch is a non Terminator . We need to calculate these minutes First aggregate . When pre reading Token In a certain First When in a collection , To get into . Here is a temporary agreement ： Capital letters represent non terminators , Lowercase letters represent terminators
- First aggregate ：A -> Branch 1 | Branch 2 | Branch 3
  - Branch to Token start , This branch First Set for this Token
  - A branch begins with a non terminator , So this branch of First The set is the of this non terminator First aggregate
  - Yes ε To deal with ： If the first non terminator of the branch can be ε、 You have to look back and calculate one element together . If a branch, all elements can be ε、 be First Set contains ε. You need to introduce Follow aggregate
  - A Of First aggregate , Is its branches First Union of sets
- Follow aggregate ： For nonterminal symbols A,Follow(A) Express A What may happen later Token aggregate , The calculation rules are as follows ：
  - Start symbol S, FOLLOW(S) contain $
  - B -> aAC This situation ,Follow(A) Include First （ Get rid of ε）. Of course , If A Terminator followed , Directly to join Follow in
  - B -> aA perhaps B -> aAC C It can be for ε, be Follow(A) Need to include Follow(B)
- Example ：

//  For the following production , seek 2 Species set 
// A Is the beginning 
A -> BC | aB
B -> -C | ε
C -> 12 | c | ε
// First aggregate 
First(A) = First(B) Go to ε U First(C) Go to ε U ε U a = {
    -, 12, c, ε, a}
First(B) = - U ε = {
    -, ε}
First(C) = {
    12, c, ε}
// Follow aggregate 
Follow(A) = {
    $}
Follow(B) =  First(C) U Follow(A) = {
    12, c, $}
Follow(C) = Follow(A) U Follow(B) = {
    12, c, $}
//  Our example is simple , No ring , If you encounter a ring, you should use the fixed point method to traverse and update multiple times , Until the set is stable

Revelation ： If we want to use the forward-looking approach and cooperate with First、Follow The set accelerates the top-down analysis process , When designing grammar rules, we should pay attention to avoiding common prefixes if there are multiple branches . If there is a public prefix , It's best to extract it . Add a new syntax , This can be avoided k The value is very large to confirm which branch to take
Bottom up ： Representative algorithm LR(k)、LALR(k) etc. . Harder to implement than the top-down algorithm , But there are more advantages , It naturally avoids left recursion （ See below ）. because Go Using a top-down algorithm , Therefore, there is no more elaboration here , Readers can read the extended bottom-up algorithm here

Implement the general top-down steps ： Simple top-down implementation 、 This is one of its great advantages

The code looks like it's for every left node in the grammar （ Non terminating node ） Define a function . It looks like a chain of recursive function calls ,and Logic will be called successively .or Logic will have if-else perhaps switch-case Code block handling
Of course , We can use First/Follow Cooperate with looking forward to reduce the number of backtracking , Speed up parsing
For example, implement the following syntax rules ：BC and C yes or Relationship , The first branch B、C yes and Relationship

// A -> B C | C
//  Probably code logic 
func A() (aNode ANode) {
    
	aNode.B = BOrNil()
	aNode.C = C()
	return
}

The problem that the algorithm needs to solve ：

Priority questions ： Use the superior subordinate relationship , Superior grammar has lower priority than subordinate grammar . The subordinate grammar will be parsed into child nodes , Will deal with... First . Have high priority

add -> mul | add + mul
mul -> pri | mul * pri

Combination problem ： This is not a separate problem of the top-down algorithm . The solution is to write the recursive term on the left , Right union writes the recursive term on the right , As above, we put add The recursive term is written on the left . If you write add -> mul | mul + add,1 + 2 + 3 It will be interpreted as 1 + （2 + 3）
Left recursion problem ： Because most operators are left bound , Writing the recursive term on the left will lead to infinite recursion in the top-down algorithm . The solution is to rewrite the grammar 、 Then optimize the code implementation

//  If add -> mul | add + mul Without special treatment, there will be left recursion 
func add() (addNode AddNode) {
    
	addNode.X = mul()
	//  The discovery is not mul Branch . Then go add + mul Branch 
	if curToken == "+" {
    
		//  Backtrack and call add(), At the moment . We'll find that infinite calls 
		rewind()
		addNode.X = add()
		//  Remaining codes ....
	}
	return
}
//  Rewriting grammar ：add -> mul (+ mul)*
//  So we can keep trying to get + mul. At the same time, pay attention to the front addNode To be a new addNode The left node of . Otherwise, the combination 
//  There's a problem , The approximate code is as follows 
while curToken == "+" {
    
	nextToken() //  consumption +
	newAddNode = AddNode{
    }
	newAddNode.X = addNode
	newAddNode.Y = mul()
	addNode = newAddNode
}

Most of the above problems appear in the analysis of expression nodes in language 、 Our handwriting is very inconvenient 、 But tools can help us

Tool implementation ： alike , The syntax analysis algorithm is relatively fixed , So there are many tools that can help us just write grammar rules , No code implementation . As mentioned in lexical analysis antlr

antlr Written in a format file Go Parsing context free grammar
antlr In, we only need to write when dealing with priority, and put the branches with high priority in front , Instead of breaking them down into separate grammars , Such as addition and multiplication ：

expr : primaryExpr
	 | expr '*' expr
     | expr '+' expr

antlr Deal with the combination problem ： The left union operator does not need to handle , The operator of the right set is specially specified
antlr Obviously, it also helps us deal with the left recursion problem , There is no need for our special treatment

Two 、Go Syntax analysis, source code analysis ：

Go Language syntax analysis related code in src\cmd\compile\internal\syntax In bag
When reading the source code , We can ignore error handling and Pos（ Structure of location , Carry information such as ranks ） Handle 、 as well as debug and trace、 Special notes （//go: // line） Processing and other related logic . Improve your reading speed
When reading the source code of syntax analysis , Be sure to use it in combination with your own experience . Even some allow grammar that they have never written

part Go Grammar rules and AST Structure of nodes ：

Show only some grammar rules ： stay parse.go There are specific rules at the beginning of each function in the file , Readers can check all production rules . And you can refer to the above antlr Of Go Parsing rule file . Basically no big difference

// AST Root node 
SourceFile = PackageClause ";" {
     ImportDecl ";" } {
     TopLevelDecl ";" } .
//  Package definition 
PackageClause = _Package identifier;
//  Package introduction rules 
ImportSpec = [ "." | PackageName ] ImportPath .
ImportPath = string_lit .
// File Include definition rules :  Include constants 、 Variable 、 type 、 Method 、 Function definition . It can be verified in combination with the code you usually write 
TopLevelDecl  = Declaration | FunctionDecl | MethodDecl .
Declaration   = ConstDecl | TypeDecl | VarDecl .
ConstSpec = IdentifierList [ [ Type ] "=" ExpressionList ] .
FunctionDecl = "func" FunctionName [ TypeParams ] ( Function | Signature ) .
MethodDecl   = "func" Receiver MethodName ( Function | Signature ) .

When the syntax parsing is complete , Each source file will be parsed into the following structure , This is one AST, It can also be understood as the root node of the tree

type File struct {
    
	Pragma   Pragma
	PkgName  *Name //  Package name 
	DeclList []Decl //  List of child nodes . Various definitions 
	EOF      Pos
	node
}

Go Medium statement： Here directly copy others according to go Summary of grammar rules antlr Format statement The rules . function 、for In the main body of such structure , Is the stmtList Composed of

statement:
	declaration // stmt It also contains constants 、 Variables, etc 
	| labeledStmt //  label stmt.goto And other statements will jump to the corresponding tag to execute 
	| simpleStmt
	| goStmt
	| returnStmt
	| breakStmt
	| continueStmt
	| gotoStmt
	| fallthroughStmt
	| block
	| ifStmt
	| switchStmt
	| selectStmt
	| forStmt
	| deferStmt;

simpleStmt:
	sendStmt
	| incDecStmt // ++ -- sentence . In fact, it is similar to assignment、shortVarDecl Share a type of AST Node. Distinguish according to the attributes 
	| assignment
	| expressionStmt //  No subdivided expression Stmt
	| shortVarDecl;

Parser structure ： It can be seen that the inheritance lexical analyzer , Call lexical analyzer next Function to get the current token

type parser struct {
    
	file    *PosBase
	//  Error handling functions , You can see that it is frequently called during lexical analysis ,errh If nil
	//  If an error is reported  1. errh == nil . The first error is returned and the syntax tree is nil 2. errh != nill  Record the first error , Every time you encounter an attempt to continue parsing , That is, try your best 
	errh    ErrorHandler 
	mode    Mode //  Whether to turn on switch functions such as generics 
	pragh   PragmaHandler // directives  Processing function , namely //go: These processing functions 
	scanner               //  Inherit lexical analyzer 
	base   *PosBase // current position base
	first  error    // first error encountered  The first error encountered 
	//  Error count 
	errcnt int      // number of errors encountered
	pragma Pragma   // pragmas pragh The result after processing 
	fnest  int    // function nesting level (for error handling)
	//  Expression nesting level , You can see that every time you call the expression to process the related function ++.
	//  If readers read carefully, they can find a place xnest Is set to -1（for if switch The head part of the expression processing ）：eg if a > 0 {}  here a>0 Is the head expression 
	//  If it is less than 0. Partial complexity is not allowed pexpr, Such as the initialization of the structure 
	xnest  int    // expression nesting level (for complit ambiguity resolution)
	indent []byte // tracing support
}

Go Parsing important documents ：

syntax.go： Syntax analysis entry 、 Two functions are provided 、 Provide the opened file handle or file name to convert the file source code into AST, There is also initialization parser logic . From this, we can see that the starting function of syntax analysis is fileOrNil().syntax.go There is also a Mode Of type. by uint Alias , Its main function is to control generic switches
pos.go： Location structure , For error handling 、AST Node location information
node.go：AST Structured representation of various types of nodes , That is, the data structure of various types of nodes
parse.go： Syntax analysis core file , Realized Go The main logic of the parser is ,Go The syntax analysis of is a top-down algorithm , And use advance view Token Speed up analysis . The analysis start function is fileOrNil()

Several frequently used functions and concepts that need to be understood ：

got(tok token) bool： Judge the present token Whether it is the expectation token. In fact, it's what we call Look ahead

func (p *parser) got(tok token) bool {
    
	if p.tok == tok {
    
		p.next()
		return true
	}
	return false
}

want(tok token)： At present token It is not expected to report an error

func (p *parser) want(tok token) {
    
	if !p.got(tok) {
    
		p.syntaxError("expecting " + tokstring(tok))
		p.advance()
	}
}

list(sep, close token, f func() bool) Pos： Analyze the incoming sep For division , until close Similar list so far . The function that parses each element is passed in f. This method has been changed before writing the article . But the source code is updated frequently . It doesn't matter much if there are no major changes , There's no need to keep up with the latest code reading
group: The concept of group , Declare multiple elements with one keyword , These elements belong to a group ： These are import Just belong to a group
OrNil At the end of the function , Here, if the parsing fails , Will return nil

import (
	"fmt"
	"io"
	"strconv"
	"strings"
)

A brief analysis of the important functions of grammar analysis ： In fact, the functions implemented by the top-down method are very simple to read . We just need to combine grammar rules to verify each other's reading , Grammar analysis has 2000 It's impossible to list all the rows , Reading can focus on ：fileOrNil、 Type processing （typeOrNil）、 Expression processing （expr）、stmt Handle （ stmtOrNil）

fileOrNil： Parse a source file into a AST

func (p *parser) fileOrNil() *File {
    
	if trace {
    
		defer p.trace("file")()
	}

	f := new(File) //  establish AST The root 
	f.pos = p.pos()

	// 1. PackageClause
	if !p.got(_Package) {
    
		p.syntaxError("package statement must be first")
		return nil
	}
	f.Pragma = p.takePragma()
	f.PkgName = p.name() // package packageName
	p.want(_Semi)        //  It can be seen that package packageName There must be a newline or multiline comment after 
	//  Here's a summary token  hinder _Semi Under what circumstances 
	// (1)  The source code itself is written  ;
	// (2)  Previous token need  ;. Follow the previous one token After that is a multi line comment across multiple lines  /  Line break  /  Multiline comments and no content after multiline comments 
	// don't bother continuing if package clause has errors
	// package If there is a problem with part of the analysis, there is no need to continue 
	if p.first != nil {
    
		return nil
	}

	// 2. { ImportDecl ";" }. If something goes wrong ,d.Path.Bad There will be records 
	for p.got(_Import) {
    
		f.DeclList = p.appendGroup(f.DeclList, p.importDecl)
		p.want(_Semi) //  Note that the end of the sentence should have  ; ps： A single statement is the introduction path followed by , Introducing a group is ） There's going to be 
	}

	// 3. { TopLevelDecl ";" }  The remaining four definitions 
	for p.tok != _EOF {
    
		switch p.tok {
    
		case _Const: //  constant 
			p.next()
			f.DeclList = p.appendGroup(f.DeclList, p.constDecl)
		case _Type: //  type 
			p.next()
			f.DeclList = p.appendGroup(f.DeclList, p.typeDecl)
		case _Var: //  Variable 
			p.next()
			f.DeclList = p.appendGroup(f.DeclList, p.varDecl)
		case _Func: //  Functions or methods 
			p.next()
			if d := p.funcDeclOrNil(); d != nil {
    
				f.DeclList = append(f.DeclList, d)
			}
		default:
			if p.tok == _Lbrace && len(f.DeclList) > 0 && isEmptyFuncDecl(f.DeclList[len(f.DeclList)-1]) {
    
				// opening { of function declaration on next line
				p.syntaxError("unexpected semicolon or newline before {")
			} else {
    
				p.syntaxError("non-declaration statement outside function body")
			}
			p.advance(_Const, _Type, _Var, _Func)
			continue
		}
		p.clearPragma()

		if p.tok != _EOF && !p.got(_Semi) {
    
			p.syntaxError("after top level declaration")
			p.advance(_Const, _Type, _Var, _Func)
		}
	}
	// p.tok == _EOF
	p.clearPragma()
	f.EOF = p.pos()
	return f
}

Look here again expr Processing correlation function ： Be careful Go How to deal with the combination problem 、 Left recursion problem 、 Priority issues


func (p *parser) expr() Expr {
    
	if trace {
    
		defer p.trace("expr")()
	}
	//  Pay attention to this 0, It allows us to complete all reading of a complete expression 
	return p.binaryExpr(nil, 0) 
}

// Expression = UnaryExpr | Expression binary_op Expression .
//  It can be seen that the expression production rules are unary expressions or connected by binary operators 2 Expression 
func (p *parser) binaryExpr(x Expr, prec int) Expr {
    
	// x != nil When represents the left... Of the binary operator Expr
	// x by nil First take the branch of unary expression 
	if x == nil {
    
		x = p.unaryExpr()
	}
	//  If there's an operator 、 Then go Expression binary_op Expression Branch 
	//  You can see at this time 、 Previously parsed UnaryExpr This is to the left of the binary operator Expression. Instead of backtracking and rematching , The left recursion problem is avoided 
	//  When closely followed token Is an operator and the priority is greater than the priority of the current expression , Then the current expression and the following expression form a binary expression , This operation ensures priority issues 
	for (p.tok == _Operator || p.tok == _Star) && p.prec > prec {
    
		t := new(Operation)
		t.pos = p.pos()
		t.Op = p.op
		tprec := p.prec
		p.next() //  Consumption operator 
		t.X = x //  The problem of combination is solved here . If it's an operator of the same level , The previous node is a new node 
		t.Y = p.binaryExpr(nil, tprec) //  Recursively call . Pay attention to priority logic 
		x = t
	}
	return x
}

The rest of the content can be read by readers themselves , The annotated code of the source code gives a link at the end

Links ：

Syntax analysis source code annotation link
Go Programming language specification , From it, you can query grammar related knowledge
antlr Format Go The grammar rule file has been given in the text

版权声明
本文为[BioIT]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204210713024382.html

当前位置：网站首页>Go compiler source code: syntax analysis

Go compiler source code: syntax analysis

One 、 Syntax analysis

Two 、Go Syntax analysis, source code analysis ：

边栏推荐

猜你喜欢

随机推荐