In this episode, we’ll go deeper in the definition of the ECMAScript language and its syntax. If you’re not familiar withcontext-free grammars, now is a good time to check out the basics, since the spec uses context-free grammars to define the language.
ECMAScript grammars
The ECMAScript spec defines four grammars:
Thelexical grammardescribes howUnicode code pointsare translated into a sequence of input elements (tokens, line terminators, comments, white space).
Thesyntactic grammardefines how syntactically correct programs are composed of tokens.
TheRegExp grammardescribes how Unicode code points are translated into regular expressions.
Thenumeric string grammardescribes how Strings are translated into numeric values.
Each grammar is defined as a context-free grammar, consisting of a set of productions.
The grammars use slightly different notation: the syntactic grammar usesLeftHandSideSymbol :
whereas the lexical grammar and the RegExp grammar useLeftHandSideSymbol ::
and the numeric string grammar usesLeftHandSideSymbol :::
.
Next we’ll look into the lexical grammar and the syntactic grammar in more detail.
Lexical grammar
The spec defines ECMAScript source text as a sequence of Unicode code points. For example, variable names are not limited to ASCII characters but can also include other Unicode characters. The spec doesn’t talk about the actual encoding (for example, UTF-8 or UTF-16). It assumes that the source code has already been converted into a sequence of Unicode code points according to the encoding it was in.
It’s not possible to tokenize ECMAScript source code in advance, which makes defining the lexical grammar slightly more complicated.
For example, we cannot determine whether/
is the division operator or the start of a RegExp without looking at the larger context it occurs in:
const x = 10 / 5;
Here/
is aDivPunctuator
.
const r = /foo/;
Here the first/
is the start of aRegularExpressionLiteral
.
Templates introduce a similar ambiguity — the interpretation of}`
depends on the context it occurs in:
const what1 = 'temp';
const what2 = 'late';
const t = `I am a ${ what1 + what2 }`;
Here`I am a ${
isTemplateHead
and}`
is aTemplateTail
.
if (0 == 1) {
}`not very useful`;
Here}
is aRightBracePunctuator
and`
is the start of aNoSubstitutionTemplate
.
Even though the interpretation of/
and}`
depends on their “context” — their position in the syntactic structure of the code — the grammars we’ll describe next are still context-free.
The lexical grammar uses several goal symbols to distinguish between the contexts where some input elements are permitted and some are not. For example, the goal symbolInputElementDiv
is used in contexts where/
is a division and/=
is a division-assignment. TheInputElementDivproductions list the possible tokens which can be produced in this context:
InputElementDiv ::
WhiteSpace
LineTerminator
Comment
CommonToken
DivPunctuator
RightBracePunctuator
In this context, encountering/
produces theDivPunctuator
input element. Producing aRegularExpressionLiteral
is not an option here.
On the other hand,InputElementRegExpis the goal symbol for the contexts where/
is the beginning of a RegExp:
InputElementRegExp ::
WhiteSpace
LineTerminator
Comment
CommonToken
RightBracePunctuator
RegularExpressionLiteral
As we see from the productions, it’s possible that this produces theRegularExpressionLiteral
input element, but producingDivPunctuator
is not possible.
Similarly, there is another goal symbol,InputElementRegExpOrTemplateTail
, for contexts whereTemplateMiddle
andTemplateTail
are permitted, in addition toRegularExpressionLiteral
. And finally,InputElementTemplateTail
is the goal symbol for contexts where onlyTemplateMiddle
andTemplateTail
are permitted butRegularExpressionLiteral
is not permitted.
In implementations, the syntactic grammar analyzer (“parser”) may call the lexical grammar analyzer (“tokenizer” or “lexer”), passing the goal symbol as a parameter and asking for the next input element suitable for that goal symbol.
Syntactic grammar
We looked into the lexical grammar, which defines how we construct tokens from Unicode code points. The syntactic grammar builds on it: it defines how syntactically correct programs are composed of tokens.
Example: Allowing legacy identifiers
Introducing a new keyword to the grammar is a possibly breaking change — what if existing code already uses the keyword as an identifier?
For example, beforeawait
was a keyword, someone might have written the following code:
function old() {
var await;
}
The ECMAScript grammar carefully added theawait
keyword in such a way that this code continues to work. Inside async functions,await
is a keyword, so this doesn’t work:
async function modern() {
var await; // Syntax error
}
Allowingyield
as an identifier in non-generators and disallowing it in generators works similarly.
Understanding howawait
is allowed as an identifier requires understanding ECMAScript-specific syntactic grammar notation. Let’s dive right in!
Productions and shorthands
Let’s look at how the productions forVariableStatementare defined. At the first glance, the grammar can look a bit scary:
VariableStatement[Yield, Await] :
var VariableDeclarationList[+In, ?Yield, ?Await] ;
What do the subscripts ([Yield, Await]
) and prefixes (+
in+In
and?
in?Async
) mean?
The notation is explained in the section Grammar Notation
The subscripts are a shorthand for expressing a set of productions, for a set of left-hand side symbols, all at once. The left-hand side symbol has two parameters, which expands into four "real" left-hand side symbols:VariableStatement
,VariableStatement_Yield
,VariableStatement_Await
, andVariableStatement_Yield_Await
.
Note that here the plainVariableStatement
means “VariableStatement
without_Await
and_Yield
”. It should not be confused withVariableStatement[Yield, Await]
.
On the right-hand side of the production, we see the shorthand+In
, meaning "use the version with_In
", and?Await
, meaning “use the version with_Await
if and only if the left-hand side symbol has_Await
” (similarly with?Yield
).
The third shorthand,~Foo
, meaning “use the version without_Foo
”, is not used in this production.
With this information, we can expand the productions like this:
VariableStatement :
var VariableDeclarationList_In ;
VariableStatement_Yield :
var VariableDeclarationList_In_Yield ;
VariableStatement_Await :
var VariableDeclarationList_In_Await ;
VariableStatement_Yield_Await :
var VariableDeclarationList_In_Yield_Await ;
Ultimately, we need to find out two things:
- Where is it decided whether we’re in the case with
_Await
or without_Await
? - Where does it make a difference — where do the productions for
Something_Await
andSomething
(without_Await
) diverge?
_Await
or no_Await
?
Let’s tackle question 1 first. It’s somewhat easy to guess that non-async functions and async functions differ in whether we pick the parameter_Await
for the function body or not. Reading the productions for async function declarations, we findthis:
AsyncFunctionBody :
FunctionBody[~Yield, +Await]
Note thatAsyncFunctionBody
has no parameters — they get added to theFunctionBody
on the right-hand side.
If we expand this production, we get:
AsyncFunctionBody :
FunctionBody_Await
In other words, async functions haveFunctionBody_Await
, meaning a function body whereawait
is treated as a keyword.
On the other hand, if we’re inside a non-async function,the relevant productionis:
FunctionDeclaration[Yield, Await, Default] :
function BindingIdentifier[?Yield, ?Await] ( FormalParameters[~Yield, ~Await] ) { FunctionBody[~Yield, ~Await] }
(FunctionDeclaration
has another production, but it’s not relevant for our code example.)
To avoid combinatorial expansion, let’s ignore theDefault
parameter which is not used in this particular production.
The expanded form of the production is:
FunctionDeclaration :
function BindingIdentifier ( FormalParameters ) { FunctionBody }
FunctionDeclaration_Yield :
function BindingIdentifier_Yield ( FormalParameters ) { FunctionBody }
FunctionDeclaration_Await : function BindingIdentifier_Await ( FormalParameters ) { FunctionBody }
FunctionDeclaration_Yield_Await :
function BindingIdentifier_Yield_Await ( FormalParameters ) { FunctionBody }
In this production we always getFunctionBody
andFormalParameters
(without_Yield
and without_Await
), since they are parameterized with[~Yield, ~Await]
in the non-expanded production.
Function name is treated differently: it gets the parameters_Await
and_Yield
if the left-hand side symbol has them.
To summarize: Async functions have aFunctionBody_Await
and non-async functions have aFunctionBody
(without_Await
). Since we’re talking about non-generator functions, both our async example function and our non-async example function are parameterized without_Yield
.
Maybe it’s hard to remember which one isFunctionBody
and whichFunctionBody_Await
. IsFunctionBody_Await
for a function whereawait
is an identifier, or for a function whereawait
is a keyword?
You can think of the_Await
parameter meaning "await
is a keyword". This approach is also future proof. Imagine a new keyword,blob
being added, but only inside "blobby" functions. Non-blobby non-async non-generators would still haveFunctionBody
(without_Await
,_Yield
or_Blob
), exactly like they have now. Blobby functions would have aFunctionBody_Blob
, async blobby functions would haveFunctionBody_Await_Blob
and so on. We’d still need to add theBlob
subscript to the productions, but the expanded forms ofFunctionBody
for already existing functions stay the same.
Disallowing await
as an identifier
Next, we need to find out howawait
is disallowed as an identifier if we're inside aFunctionBody_Await
.
We can follow the productions further to see that the_Await
parameter gets carried unchanged fromFunctionBody
all the way to theVariableStatement
production we were previously looking at.
Thus, inside an async function, we’ll have aVariableStatement_Await
and inside a non-async function, we’ll have a VariableStatement
.
We can follow the productions further and keep track of the parameters. We already saw the productions forVariableStatement:
VariableStatement[Yield, Await] :
var VariableDeclarationList[+In, ?Yield, ?Await] ;
All productions forVariableDeclarationListjust carry the parameters on as is:
VariableDeclarationList[In, Yield, Await] :
VariableDeclaration[?In, ?Yield, ?Await]
(Here we show only theproductionrelevant to our example.)
VariableDeclaration[In, Yield, Await] :
BindingIdentifier[?Yield, ?Await] Initializer[?In, ?Yield, ?Await] opt
Theopt
shorthand means that the right-hand side symbol is optional; there are in fact two productions, one with the optional symbol, and one without.
In the simple case relevant to our example,VariableStatement
consists of the keywordvar
, followed by a singleBindingIdentifier
without an initializer, and ending with a semicolon.
To disallow or allowawait
as aBindingIdentifier
, we hope to end up with something like this:
BindingIdentifier_Await :
Identifier
yieldBindingIdentifier :
Identifier
yield
await
This would disallowawait
as an identifier inside async functions and allow it as an identifier inside non-async functions.
But the spec doesn’t define it like this, instead we find thisproduction:
BindingIdentifier[Yield, Await] :
Identifier
yield
await
Expanded, this means the following productions:
BindingIdentifier_Await :
Identifier
yield
await
BindingIdentifier :
Identifier
yield
await
(We’re omitting the productions forBindingIdentifier_Yield
andBindingIdentifier_Yield_Await
which are not needed in our example.)
This looks likeawait
andyield
would be always allowed as identifiers. What’s up with that? Is the whole blog post for nothing?
Statics semantics to the rescue
It turns out that static semantics are needed for forbiddingawait
as an identifier inside async functions.
Static semantics describe static rules — that is, rules that are checked before the program runs.
In this case, thestatic semantics for BindingIdentifierdefine the following syntax-directed rule:
BindingIdentifier[Yield, Await] : await
Effectively, this forbids theBindingIdentifier_Await : await
production.
The spec explains that the reason for having this production but defining it as a Syntax Error by the static semantics is because of interference with automatic semicolon insertion (ASI).
Remember that ASI kicks in when we’re unable to parse a line of code according to the grammar productions. ASI tries to add semicolons to satisfy the requirement that statements and declarations must end with a semicolon. (We’ll describe ASI in more detail in a later episode.)
Consider the following code (example from the spec):
async function too_few_semicolons() {
let
await 0;
}
If the grammar disallowedawait
as an identifier, ASI would kick in and transform the code into the following grammatically correct code, which also useslet
as an identifier:
async function too_few_semicolons() {
let;
await 0;
}
This kind of interference with ASI was deemed too confusing, so static semantics were used for disallowingawait
as an identifier.
Disallowed StringValues
of identifiers
There’s also another related rule:
BindingIdentifier : Identifier
This might be confusing at first.Identifieris defined like this:
Identifier :
IdentifierName but not ReservedWord
await
is aReservedWord
, so how can anIdentifier
ever beawait
?
As it turns out,Identifier
cannot beawait
, but it can be something else whoseStringValue
is"await"
— a different representation of the character sequenceawait
.
Static semantics for identifier namesdefine how theStringValue
of an identifier name is computed. For example, the Unicode escape sequence fora
is\u0061
, so\u0061wait
has theStringValue"await"
.\u0061wait
won’t be recognized as a keyword by the lexical grammar, instead it will be anIdentifier
. The static semantics for forbid using it as a variable name inside async functions.
So this works:
function old() {
var \u0061wait;
}
And this doesn’t:
async function modern() {
var \u0061wait; // Syntax error
}
Summary
In this episode, we familiarized ourselves with the lexical grammar, the syntactic grammar, and the shorthands used for defining the syntactic grammar. As an example, we looked into forbidding usingawait
as an identifier inside async functions but allowing it inside non-async functions.
Other interesting parts of the syntactic grammar, such as automatic semicolon insertion and cover grammars will be covered in a later episode. Stay tuned!
常见问题FAQ
- 免费下载或者VIP会员专享资源能否直接商用?
- 本站所有资源版权均属于原作者所有,这里所提供资源均只能用于参考学习用,请勿直接商用。若由于商用引起版权纠纷,一切责任均由使用者承担。更多说明请参考 VIP介绍。
- 提示下载完但解压或打开不了?
- 找不到素材资源介绍文章里的示例图片?
- 模板不会安装或需要功能定制以及二次开发?
发表评论
还没有评论,快来抢沙发吧!