Sumarry :
-
Creating a custom syntax file
Syntax file creation is rather straightforward once you know how they are structured. The purpose of this small example is to cover that topic as extensively as possible.
Core concepts
There are two fundamental notions in QCE syntax files : contexts and "regular" matches.
The concept of context is extremely powerful while remaining extremely simple. The syntax engine enters a context when a given start token is matched and it leaves it when a stop token is matched. Within a context there can be any number of contexts and "regular" matches. contexts are typically used to match comments, strings or other special blocks of code.
"Regular" matches are what one would expect them to be : simple tokens. They can be matched from either regular expressions or plain strings. Start and stop tokens of context are "regular" matches in a way, except that they also trigger the context enter/leave event.
Syntax file structure
Now, on to the analysis of the structure of a syntax file.
The root element of the document is a <QNFA> tag. It provides various informations in its attributes.
-
language : this attribute specifies the name of the language supported by the syntax file.
-
extensions : this attribute specifies the file extensions matched by this language.
-
defaultLineMark : this (optional) attribute specifies the name of the default line mark to be set on a line when clicking on the line mark panel (if any). If none is provided, "bookmark" is used.
The QNFA tag represents the root context of the language. It can contain any number of the following tags :
-
context : defines a context. To be valid, requires children tags of type start and stop.
-
sequence : defines a "regular" match. The value of this element is always assumed to be a regexp (no internal optimizations attempt for plain strings).
-
word : defines a "regular" match. The value of this element is checked to determine whether it can be matched as a plain string (internal optimizations). Additionally, this element will ONLY be matched at word boundaries. For instance, if the value of a word element is for, it will not be matched in "foreach" while it would have been, if declared using a sequence tag.
-
list : this is a "meta-element" used to group regular matches and give them the same attributes as they are propagated from the this element to its children. Subrouping (nesting list elements) is NOT supported.
-
embed : this is a "meta-element" which allows embedding of other languages or contexts to help reducing duplication of content and make maintenance of syntax files easier. The embedding target is specified through the target attribute.
Additionally, the following tags are valid inside a context block (and, again, their number isn't limited). Also note that, while ordering of all tags above within a context DO matter, ordering of the tags below DO NOT matter.
-
start : defines a context start token as a "regular" match (remarks made about the word tag apply to this one as well).
-
stop : defines a context stop token as a "regular" match (remarks made about the word tag apply to this one as well).
-
escape : defines a special token as a "regular" match (remarks made about the sequence tag apply to this one as well). This element is used for the very common case of escape sequences which may prevent a stop token from being one. In most case it is however recommended to favor explicit escape match through a sequence, for instance in C-like strings the following construct is used :
<sequence id="escape" format="escapeseq" >\\[nrtvf\\"'\n\n]</sequence>
All these tags, except embed, support the following attributes :
-
format : specifies the format to be applied to the matches (highlighting). This property is propagated.
-
id : assign an identifer to the element. This is only used for contexts at the moment however and the only "external" use of it occurs in the embed tag.
Additionally all tags, except context and list, support the following extra attributes :
-
exclusive : Indicate that the token may be matched multiple times. For instance some contexts have the same end token (a newline in many cases) and the innermost context must not prevent its parenth from matching the newline and exiting. This attribute is reserved to start and stop tag of a context. Valid values are "true", "false", "1" or "0".
-
parenthesis : specifies that the element is a parenthesis. The concept of parenthesis actually extends way beyond simple parentheses. Parentheses are tokens that may be matched (brace matching), delimit foldable block or trigger indentation.
The value of this attribute is a string formatted as follows : "$id:$type[@nomatch]". Where $id is the identifier for the parenthesis and type is its type, which can be either "open", "close" or "boundary". Finally the "@nomatch", if present, indicate that the parenthesis should not be taken into account for brace matching. The square brackets indicate that it is optional and should not be used in a syntax file.
While the "open" and "close" type of parenthesis are quite easy to understand, the "boundary" require more details. It indicates a parenthesis that acts as both "open" and "close". Typical use of such parentheses happen in C++ for visibility keywords (public, protected, private) or in Latex for chapter tags, section tags and so on. There are of course many more cases where this type of parenthesis is the right choice but there is no point in listing them all.
-
fold : element will delimit foldable block(s). Valid values are "true", "false", "1" or "0".
-
indent : element will trigger indentation. Valid values are "true", "false", "1" or "0".
The context tag however supports the following extra attributes :
-
transparency : specifies whether the contexts and matches declared before this context should also be matched inside that context (with no need to declare them again). Valid values are "true", "false", "1" or "0".
-
stayOnLine : this attribute is provided to handle special cases of context nesting where the innermost context may span over several lines whereas the outer contexts cannot, e.g in C, comments are accepted inside preprocessor directives. This attribute indicates that no matter what happens to subcontexts, this context will not span beyond the line it started in.
Regexp format
The regexp format used by QCE is near to that used by QRegExp but with some slight variations.
First of all, a list of QRegExp features not supported in syntax files :
-
grouping and alternation (ban parentheses and | from your mind).
-
Assertions (such as word boundaries)
-
lookahead operators
-
'.' to match any character
Then, character classes (word, space, digit and their negation) use the same "specific character" (respectively w, s, d and uppercase) but a different prefix character ($ instead of \).
C-style escaping is used. Simple C escapes (for newlines and tab) are converted properly and C-style escaping is used to escape control characters.
Sets and negated sets are supported, using the same syntax as QRegExp.
Regular regexp operator '?', '*' and '+' are supported.
A revision of the syntax format may bring grouping and alternation support (and possibly other niceties) in a future version but as this would break backward compat (due to escaping issues among other things) and require a rewrite of the syntax engine a new (but very similar) syntax file format would be used.
Getting your hands dirty (coming soon)
Now that the fundamentals have been covered, let's use them to create a small syntax file for an imaginary language.
More examples availables in the qxs/ directory where all syntax files reside.