SwiftFormat (Part 2 of 3)

In part 1 I talked about recursive descent parsers, and how they can be used to process complex structured text such as a programming language. Let’s now take a look at how SwiftFormat’s parser is implemented.

Note: This is the second part of a three-part series about SwiftFormat, a formatting tool for Swift source code. The first part can be found here.

A Token Gesture

The lexing phase of SwiftFormat is implemented in the file Tokenizer.swift. Tokens are defined as a struct, consisting of a type and a string value:

The type property is an instance of the TokenType enum, defined as follows:

The set of possible token types is deliberately short, and I’ve avoided creating new token types where the distinction can be trivially made by inspecting the string value instead. This makes it easy to handle groups of tokens collectively without needing switch statements everywhere, e.g., if I want to apply a rule to every operator, I can just say

instead of

A compiler might need more granular token types, such as a separate keyword type to verify that reserved words aren’t being used in the wrong context. But for SwiftFormat that would involve keeping track of all known Swift keywords, which is made difficult by the facts that:

  • Swift is still under active development and new keywords are being added all the time
  • Many keywords are contextual. For example, the words below are considered keywords in some contexts, but are free for use as function, type or variable names in others: associativity, convenience, dynamic, didSet, final, get, infix, indirect, lazy, left, mutating, none, nonmutating, optional, override, postfix, precedence, prefix, Protocol, required, right, set, unowned, weak, willSet

In any case, the distinction between keywords and other identifiers is not that useful for formatting purposes – they are often treated the same in terms of spacing, and when they aren’t, the treatment depends on the specific keyword, not the fact that it is a keyword.

Tokens are generated from the source code by calling the parseToken() function. This is implemented as a mutating extension method on String.CharacterView:

String.CharacterView is the type given to the characters property of String. This is essentially an array of unicode characters – a more convenient interface for the lexer to work with than the String object itself.

We need the mutating prefix to make it possible for parseToken() to modify the character array – specifically to replace it with a suffix of the original array that excludes the characters that have already been consumed. This means I can call characters.parseToken() repeatedly until it returns nil, without needing the character offset as an additional parameter and/or return value.

All the parseToken() function does internally is try calling token-specific parser functions in turn until it gets a match:

This is not the most efficient possible way to do it, because if multiple tokens begin with the same characters, a wrong guess means backtracking to try a different function.

Merging these parsing functions (so the code checks for all possible tokens concurrently) would avoid backtracking, but separating them like this is much easier to read, write and maintain, and the cost to efficiency is probably not worth worrying about in this case.

A common solution to the problem of maintainability-vs-performance with handwritten parsers is to use a parser-generator. This is a program that accepts a Context-Free Grammar (CFG), as input, and converts it into highly-efficient (but not necessarily human-readable) parsing code. One such program is Lex, which generates lexers in C, and can be used with YACC (Yet Another Compiler Compiler) to product a fully-fledged parser. There are currently no well-established parser generators for Swift, however.

Another option is to use regular expressions. We mentioned earlier that these are not well-suited to parsing programming language grammars, but they are a reasonable tool for parsing individual tokens, which are typically non-recursive.

To tokenize a string using regular expressions:

  1. Create a separate anchored expression for each token type
  2. Merge them into a single expression, using capture groups to distinguish them
  3. Test the combined expression against the start of the source code.
  4. Crop off the matched prefix of the source string
  5. Go to step 3 to capture the next token. Repeat until all the source code is consumed.

I didn’t use regular expressions for SwiftFormat, for the following reasons:

  • Swift does not have very good support for them. Apple’s NSRegularExpression class returns all its matches as NSRanges, which do not map directly to Swift’s String character indices, requiring error-prone conversion between the two schemes (NSRange assumes UTF16 codepoints, whereas Swift’s Strings use UTF32 internally).
  • Swift’s rules for which unicode character ranges are permitted in which tokens are very specific, and regular expression syntax is not optimized for matching non-ascii characters.
  • In my experience, NSRegularExpression is considerably slower than a like-for-like handwritten parser, negating any benefit from the reduction in backtracking.

Instead, I wrote some simple utility functions to match individual characters and character sequences, and then built up the token matching functions from those.

Here is a (simplified) implementation of the parseWhitespace() function:

The function consumes characters until it hits one that isn’t a space or tab, then returns the resultant string (wrapped in a Token struct). Note the self.removeFirst() which removes the first character from the character array each time it is matched as part of the whitespace sequence.

A more complex example is the parseIdentifier() function. Identifiers in most programming languages are composed of a head character followed by one or more tail characters. The head character is typically a letter or underscore, and the permitted tail characters are usually a superset of the permitted head characters, with the addition of numerals.

Swift identifiers follow this pattern, but also support non-alphanumeric unicode characters such as scientific symbols, foreign alphabets, and emoji, which we’ll ignore in this simplified implementation of parseIdentifier():

Note the isHead() and isTail() functions nested within parseIdentifier(). This nice feature of Swift lets you declare functions within functions, allowing complex code to be broken up into manageable chunks without polluting the global namespace.

Scope Creep

The tokens mentioned so far are fairly simple; they are recognized by just inspecting each character in turn and deciding if it’s part of the token or not. Things get more complicated when parsing something like a comment, or a string literal.

Strings and comments are both delimited by unique start and end characters. For a string, that’s the double-quote character , and for comments, the character pairs /* and */. But parsing a string or comment is not simply a case of scanning characters until you find the terminating character or character-pair. Like C, Swift Strings support escape sequences, where double-quotes (and other special characters) can be escaped by prefixing them with a backslash (\), and unlike C, Swift comments can be nested.

Most lexers treat a string as a single token, and encapsulate the logic for detecting the terminating characters or escape sequences inside the token-parsing function. But Swift supports a feature called string interpolation, where arbitrary Swift expressions can be embedded inside the string and escaped using \().

Parsing strings as a single token would mean treating these nested expressions as part of the string body, not a collection of individual tokens, making it impossible to apply complex formatting rules to them.

Similarly, if a nested comment was treated as a single token, it would prevent the formatter from applying indenting or whitespace rules inside a multiline comment.

A better solution is to extend the lexer to have a concept of scope.

Instead of treating comments and strings as single monolithic tokens, SwiftFormat’s parser treats each opening scope symbol ( or /*) as a token in their own right, so that each complete string or comment is made up of several separate tokens. The challenge with this approach is that the body of a comment or string does not follow the Swift language grammar, and requires its own lexer function.

To solve this, the parser adds a copy of the opening scope token to a separate array called the scope stack. The last (topmost) element in the scope stack is the current scope, and influences how subsequent source text is processed.

The complete tokenize() function looks something like this:

The scope stack allows the tokenizer to treat input differently, according to context. A sequence of digits encountered outside any scope (i.e. when the scope stack is empty) is treated as a number literal (a token in its own right), but when inside a string or comment (i.e. when the last token in the scope array is a or /*) it’s treated as part of the string or comment body.

In this way, some of the logic that would normally be considered semantic analysis is implemented directly inside the lexical analysis phase. It’s still just outputting tokens rather than creating AST nodes, but it outputs a different type of token for the same input, depending on the current scope.

This makes the tokenizer more complex, but means that later, when applying the formatting rules, the formatter already knows the correct token type without having to work out the context.

This is especially useful for resolving the less-than-sign-vs-generics-parameter-list ambiguity discussed earlier. This disambiguation is done right inside the tokenize() function, so that by the time the token list is generated, every < has already been correctly flagged as either an operator or the start of a generic parameter scope, without needing to be identified again. It also avoids backtracking, since the types of the tokens in the array can be updated retrospectively rather than throwing away a partially constructed AST subtree and re-parsing all the tokens.

Arguing Semantics

Semantic analysis is the most complex phase of parsing. After implementing a couple of rules using only the output from the tokenizer, I had the wacky idea to try to avoid it altogether.

Unlike a compiler or static analyzer, a formatter doesn’t need to ensure that programs are valid, or that keywords and other symbols are being used correctly, because the code will all be run through the Swift compiler eventually anyway.

SwiftFormat is only concerned with how the code looks – and code tends to follow fairly uniform low-level conventions, regardless of the high-level structure. Look at some common Swift structures:

These structures all have wildly different meanings in Swift, but their formatting is mostly the same. Consider the kind of rules we might define here:

  • Identifiers are always separated by a single space
  • Opening braces are separated from the previous token by a space
  • Lines inside a pair of braces should be indented one level (four spaces)

None of these rules depend on specific keywords or high-level structures, and can all be described in terms of generic tokens such as “opening brace”, “identifier”, “space”, and “line-break”.

Rules of Engagement

With a way to convert source code to an array of tokens, it’s time to think about formatting rules. Here is the first rule we suggested earlier:

  • Identifiers are always separated by a single space

The compiler should verify that code is correct, so let’s assume the code I’m formatting is already valid (or if it isn’t, it’s not my problem to fix it).

It’s impossible to write valid Swift code where identifiers are not separated by a space – two identifiers without a space would be interpreted as a single identifier by the Swift compiler, just as they are by SwiftFormat’s tokenizer. So to implement the single-space rule I only need to handle situations where they are separated by more than one space.

Furthermore, since Swift is mostly whitespace-agnostic, multiple sequential spaces never have a distinct semantic meaning. The only time it’s conventional to use multiple spaces deliberately is to indent a line. The rule can therefore be generalized as:

  • Sequences of multiple spaces should be replaced by a single space, unless they are at the beginning of a line.

Implementing this rule requires identifying when spaces fall at the beginning of a line. The code to do that looks like this:

This is a relatively simple rule, requiring no knowledge of scope, and contextual knowledge limited to just the previous token. What about the second rule?

  • Opening braces are separated from the previous token by a space.

This is more complex to implement, because new tokens are sometimes needed (adding space where there was previously none), which means modifying the token array as I go through it.

Swift’s for loops do not permit manipulation of the array or index during enumeration, but  while loops do. Here is a loop that implements the rule:

This simplified implementation doesn’t take into account what happens if the brace is preceded by a line-break, for example, but demonstrates the principle of adjusting the index when inserting or removing tokens inside a loop.

Manually updating the index after insertion/deletion is error-prone, so SwiftFormat actually performs all token array manipulations via a wrapper class that keeps track of insertions/deletions and automatically adjusts the loop index, but to keep things simple I’ve omitted that from these examples.

This rule was more complex than the last one, but still doesn’t require much contextual knowledge beyond looking at neighboring tokens. So, let’s look at the implementation for a much more complex rule, such as indenting, where it’s necessary to keep track of scope.

I’ll define the indent rule as:

  • Lines inside pairs of braces should be indented by four spaces relative to the opening brace.

How might we implement indenting using these techniques?

I’ll start by doing a pass where I insert an empty whitespace token after each line-break, since whitespace has to be inserted and removed at the start of each line. This avoids the complexity of keeping track of changing token indices later:

As before, I’ve used a while loop so I can manually adjust the index after inserting tokens. But now that’s done, we can switch to a regular for…in loop, which is faster, simpler and less error-prone. Here is the main loop:

Note that the indentStack variable; an array of strings initialized with an empty string representing the root indent level. It is used as follows:

  • When the code encounters an opening brace it adds another four spaces to the current indent, and puts it on the stack.
  • When it encounters a closing brace the current indent value is popped off the stack.
  • When it encounters a line-break it updates the whitespace token immediately after it to match the current indent level.

The tricky part is that the code can only determine the indent level of the line containing the closing brace after it has already moved past the start of the line. That’s why the index of the last encountered line-break token is stored, allowing the function to jump back to beginning of the line to adjust the indent once it finds the closing brace.

There are a lot of simplifications and assumptions here, such as assuming only one type of scope, and that braces will be balanced. The real SwiftFormat indent() function uses a separate scope stack to deal with different types of scope.

But even this simplified example is already pretty complex – how can I possibly add extra complexity and still remain confident that it works in all cases?

Well, we’ll cover that in part 3.

More to read about Software engineering