This tutorial demonstrates how to call the F# language tokenizer. Given F# source code, the tokenizer generates a list of source code lines that contain information about tokens on each line. For each token, you can get the type of the token, exact location as well as color kind of the token (keyword, identifier, number, operator, etc.).
NOTE: The FSharp.Compiler.Service API is subject to change when later versions of the nuget package are published
To use the tokenizer, reference FSharp.Compiler.Service.dll
and open the
FSharp.Compiler.Tokenization
namespace:
#r "FSharp.Compiler.Service.dll"
open FSharp.Compiler.Tokenization
Now you can create an instance of FSharpSourceTokenizer
. The class takes two
arguments - the first is the list of defined symbols and the second is the
file name of the source code. The defined symbols are required because the
tokenizer handles #if
directives. The file name is required only to specify
locations of the source code (and it does not have to exist):
let sourceTok = FSharpSourceTokenizer([], Some "C:\\test.fsx", Some "PREVIEW", None)
Using the sourceTok
object, we can now (repeatedly) tokenize lines of
F# source code.
The tokenizer operates on individual lines rather than on the entire source
file. After getting a token, the tokenizer also returns new state (as int64
value).
This can be used to tokenize F# code more efficiently. When source code changes,
you do not need to re-tokenize the entire file - only the parts that have changed.
To tokenize a single line, we create a FSharpLineTokenizer
by calling CreateLineTokenizer
on the FSharpSourceTokenizer
object that we created earlier:
let tokenizer = sourceTok.CreateLineTokenizer("let answer=42")
Now, we can write a simple recursive function that calls ScanToken
on the tokenizer
until it returns None
(indicating the end of line). When the function succeeds, it
returns an FSharpTokenInfo
object with all the interesting details:
/// Tokenize a single line of F# code
let rec tokenizeLine (tokenizer:FSharpLineTokenizer) state =
match tokenizer.ScanToken(state) with
| Some tok, state ->
// Print token name
printf "%s " tok.TokenName
// Tokenize the rest, in the new state
tokenizeLine tokenizer state
| None, state -> state
The function returns the new state, which is needed if you need to tokenize multiple lines
and an earlier line ends with a multi-line comment. As an initial state, we can use 0L
:
tokenizeLine tokenizer FSharpTokenizerLexState.Initial
The result is a sequence of tokens with names LET, WHITESPACE, IDENT, EQUALS and INT32.
There is a number of interesting properties on FSharpTokenInfo
including:
CharClass
and ColorClass
return information about the token category that
can be used for colorizing F# code.
LeftColumn
and RightColumn
return the location of the token inside the line.TokenName
is the name of the token (as defined in the F# lexer)Note that the tokenizer is stateful - if you want to tokenize single line multiple times,
you need to call CreateLineTokenizer
again.
To run the tokenizer on a longer sample code or an entire file, you need to read the
sample input as a collection of string
values:
let lines = """
// Hello world
let hello() =
printfn "Hello world!" """.Split('\r','\n')
To tokenize multi-line input, we again need a recursive function that keeps the current
state. The following function takes the lines as a list of strings (together with line number
and the current state). We create a new tokenizer for each line and call tokenizeLine
using the state from the end of the previous line:
/// Print token names for multiple lines of code
let rec tokenizeLines state count lines =
match lines with
| line::lines ->
// Create tokenizer & tokenize single line
printfn "\nLine %d" count
let tokenizer = sourceTok.CreateLineTokenizer(line)
let state = tokenizeLine tokenizer state
// Tokenize the rest using new state
tokenizeLines state (count+1) lines
| [] -> ()
The function simply calls tokenizeLine
(defined earlier) to print the names of all
the tokens on each line. We can call it on the previous input with 0L
as the initial
state and 1
as the number of the first line:
lines
|> List.ofSeq
|> tokenizeLines FSharpTokenizerLexState.Initial 1
Ignoring some unimportant details (like whitespace at the beginning of each line and the first line which is just whitespace), the code generates the following output:
|
It is worth noting that the tokenizer yields multiple LINE_COMMENT
tokens and multiple
STRING_TEXT
tokens for each single comment or string (roughly, one for each word), so
if you want to get the entire text of a comment/string, you need to concatenate the
tokens.