F# Compiler Services


Compiler Services: Using the F# tokenizer

This tutorial demonstrates how to call the F# language tokenizer. Given F# source code, the tokenizer generates a list of source code lines that contain information about tokens on each line. For each token, you can get the type of the token, exact location as well as color kind of the token (keyword, identifier, number, operator, etc.).

NOTE: The FSharp.Compiler.Service API is subject to change when later versions of the nuget package are published

Creating the tokenizer

To use the tokenizer, reference FSharp.Compiler.Service.dll and open the SourceCodeServices namespace:

1: 
2: 
#r "FSharp.Compiler.Service.dll"
open Microsoft.FSharp.Compiler.SourceCodeServices

Now you can create an instance of FSharpSourceTokenizer. The class takes two arguments - the first is the list of defined symbols and the second is the file name of the source code. The defined symbols are required because the tokenizer handles #if directives. The file name is required only to specify locations of the source code (and it does not have to exist):

1: 
let sourceTok = FSharpSourceTokenizer([], "C:\\test.fsx")

Using the sourceTok object, we can now (repeatedly) tokenize lines of F# source code.

Tokenizing F# code

The tokenizer operates on individual lines rather than on the entire source file. After getting a token, the tokenizer also returns new state (as int64 value). This can be used to tokenize F# code more efficiently. When source code changes, you do not need to re-tokenize the entire file - only the parts that have changed.

Tokenizing single line

To tokenize a single line, we create a FSharpLineTokenizer by calling CreateLineTokenizer on the FSharpSourceTokenizer object that we created earlier:

1: 
let tokenizer = sourceTok.CreateLineTokenizer("let answer=42")

Now, we can write a simple recursive function that calls ScanToken on the tokenizer until it returns None (indicating the end of line). When the function suceeds, it returns FSharpTokenInfo object with all the interesting details:

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
/// Tokenize a single line of F# code
let rec tokenizeLine (tokenizer:FSharpLineTokenizer) state =
  match tokenizer.ScanToken(state) with
  | Some tok, state ->
      // Print token name
      printf "%s " tok.TokenName
      // Tokenize the rest, in the new state
      tokenizeLine tokenizer state
  | None, state -> state

The function returns the new state, which is needed if you need to tokenize multiple lines and an earlier line ends with a multi-line comment. As an initial state, we can use 0L:

1: 
tokenizeLine tokenizer 0L

The result is a sequence of tokens with names LET, WHITESPACE, IDENT, EQUALS and INT32. There is a number of interesting properties on FSharpTokenInfo including:

  • CharClass and ColorClass return information about the token category that can be used for colorizing F# code.
  • LeftColumn and RightColumn return the location of the token inside the line.
  • TokenName is the name of the token (as defined in the F# lexer)

Note that the tokenizer is stateful - if you want to tokenize single line multiple times, you need to call CreateLineTokenizer again.

Tokenizing sample code

To run the tokenizer on a longer sample code or an entire file, you need to read the sample input as a collection of string values:

1: 
2: 
3: 
4: 
let lines = """
  // Hello world
  let hello() =
     printfn "Hello world!" """.Split('\r','\n')

To tokenize multi-line input, we again need a recursive function that keeps the current state. The following function takes the lines as a list of strings (together with line number and the current state). We create a new tokenizer for each line and call tokenizeLine using the state from the end of the previous line:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
/// Print token names for multiple lines of code
let rec tokenizeLines state count lines = 
  match lines with
  | line::lines ->
      // Create tokenizer & tokenize single line
      printfn "\nLine %d" count
      let tokenizer = sourceTok.CreateLineTokenizer(line)
      let state = tokenizeLine tokenizer state
      // Tokenize the rest using new state
      tokenizeLines state (count+1) lines
  | [] -> ()

The function simply calls tokenizeLine (defined earlier) to print the names of all the tokens on each line. We can call it on the previous input with 0L as the initial state and 1 as the number of the first line:

1: 
2: 
3: 
lines
|> List.ofSeq
|> tokenizeLines 0L 1

Ignoring some unimportant details (like whitespace at the beginning of each line and the first line which is just whitespace), the code generates the following output:

1: 
2: 
3: 
4: 
5: 
6: 
Line 1
  LINE_COMMENT LINE_COMMENT (...) LINE_COMMENT 
Line 2
  LET WHITESPACE IDENT LPAREN RPAREN WHITESPACE EQUALS 
Line 3
  IDENT WHITESPACE STRING_TEXT (...) STRING_TEXT STRING 

It is worth noting that the tokenizer yields multiple LINE_COMMENT tokens and multiple STRING_TEXT tokens for each single comment or string (roughly, one for each word), so if you want to get the entire text of a comment/string, you need to concatenate the tokens.

namespace Microsoft
namespace Microsoft.FSharp
namespace Microsoft.FSharp.Compiler
namespace Microsoft.FSharp.Compiler.SourceCodeServices
val sourceTok : FSharpSourceTokenizer

Full name: Tokenizer.sourceTok
Multiple items
type FSharpSourceTokenizer =
  new : conditionalDefines:string list * fileName:string option -> FSharpSourceTokenizer
  member CreateBufferTokenizer : bufferFiller:(char [] * int * int -> int) -> FSharpLineTokenizer
  member CreateLineTokenizer : lineText:string -> FSharpLineTokenizer

Full name: Microsoft.FSharp.Compiler.SourceCodeServices.FSharpSourceTokenizer

--------------------
new : conditionalDefines:string list * fileName:string option -> FSharpSourceTokenizer
val tokenizer : FSharpLineTokenizer

Full name: Tokenizer.tokenizer
member FSharpSourceTokenizer.CreateLineTokenizer : lineText:string -> FSharpLineTokenizer
val tokenizeLine : tokenizer:FSharpLineTokenizer -> state:FSharpTokenizerLexState -> FSharpTokenizerLexState

Full name: Tokenizer.tokenizeLine


 Tokenize a single line of F# code
val tokenizer : FSharpLineTokenizer
type FSharpLineTokenizer
member ScanToken : lexState:FSharpTokenizerLexState -> FSharpTokenInfo option * FSharpTokenizerLexState
static member ColorStateOfLexState : FSharpTokenizerLexState -> FSharpTokenizerColorState
static member LexStateOfColorState : FSharpTokenizerColorState -> FSharpTokenizerLexState

Full name: Microsoft.FSharp.Compiler.SourceCodeServices.FSharpLineTokenizer
val state : FSharpTokenizerLexState
member FSharpLineTokenizer.ScanToken : lexState:FSharpTokenizerLexState -> FSharpTokenInfo option * FSharpTokenizerLexState
union case Option.Some: Value: 'T -> Option<'T>
val tok : FSharpTokenInfo
val printf : format:Printf.TextWriterFormat<'T> -> 'T

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printf
union case Option.None: Option<'T>
val lines : string []

Full name: Tokenizer.lines
val tokenizeLines : state:FSharpTokenizerLexState -> count:int -> lines:string list -> unit

Full name: Tokenizer.tokenizeLines


 Print token names for multiple lines of code
val count : int
val lines : string list
val line : string
val printfn : format:Printf.TextWriterFormat<'T> -> 'T

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printfn
Multiple items
module List

from Microsoft.FSharp.Collections

--------------------
type List<'T> =
  | ( [] )
  | ( :: ) of Head: 'T * Tail: 'T list
  interface IEnumerable
  interface IEnumerable<'T>
  member GetSlice : startIndex:int option * endIndex:int option -> 'T list
  member Head : 'T
  member IsEmpty : bool
  member Item : index:int -> 'T with get
  member Length : int
  member Tail : 'T list
  static member Cons : head:'T * tail:'T list -> 'T list
  static member Empty : 'T list

Full name: Microsoft.FSharp.Collections.List<_>
val ofSeq : source:seq<'T> -> 'T list

Full name: Microsoft.FSharp.Collections.List.ofSeq
Fork me on GitHub