Compiler Services: Using the F# tokenizer

This tutorial demonstrates how to call the F# language tokenizer. Given F# source code, the tokenizer generates a list of source code lines that contain information about tokens on each line. For each token, you can get the type of the token, exact location as well as color kind of the token (keyword, identifier, number, operator, etc.).

NOTE: The FSharp.Compiler.Service API is subject to change when later versions of the nuget package are published

Creating the tokenizer

To use the tokenizer, reference FSharp.Compiler.Service.dll and open the FSharp.Compiler.Tokenization namespace:

#r "FSharp.Compiler.Service.dll"
open FSharp.Compiler.Tokenization

Now you can create an instance of FSharpSourceTokenizer. The class takes two arguments - the first is the list of defined symbols and the second is the file name of the source code. The defined symbols are required because the tokenizer handles #if directives. The file name is required only to specify locations of the source code (and it does not have to exist):

let sourceTok = FSharpSourceTokenizer([], Some "C:\\test.fsx", Some "PREVIEW", None)

Using the sourceTok object, we can now (repeatedly) tokenize lines of F# source code.

Tokenizing F# code

The tokenizer operates on individual lines rather than on the entire source file. After getting a token, the tokenizer also returns new state (as int64 value). This can be used to tokenize F# code more efficiently. When source code changes, you do not need to re-tokenize the entire file - only the parts that have changed.

Tokenizing single line

To tokenize a single line, we create a FSharpLineTokenizer by calling CreateLineTokenizer on the FSharpSourceTokenizer object that we created earlier:

let tokenizer = sourceTok.CreateLineTokenizer("let answer=42")

Now, we can write a simple recursive function that calls ScanToken on the tokenizer until it returns None (indicating the end of line). When the function succeeds, it returns an FSharpTokenInfo object with all the interesting details:

/// Tokenize a single line of F# code
let rec tokenizeLine (tokenizer:FSharpLineTokenizer) state =
  match tokenizer.ScanToken(state) with
  | Some tok, state ->
      // Print token name
      printf "%s " tok.TokenName
      // Tokenize the rest, in the new state
      tokenizeLine tokenizer state
  | None, state -> state

The function returns the new state, which is needed if you need to tokenize multiple lines and an earlier line ends with a multi-line comment. As an initial state, we can use 0L:

tokenizeLine tokenizer FSharpTokenizerLexState.Initial

The result is a sequence of tokens with names LET, WHITESPACE, IDENT, EQUALS and INT32. There is a number of interesting properties on FSharpTokenInfo including:

CharClass and ColorClass return information about the token category that can be used for colorizing F# code.
LeftColumn and RightColumn return the location of the token inside the line.
TokenName is the name of the token (as defined in the F# lexer)

Note that the tokenizer is stateful - if you want to tokenize single line multiple times, you need to call CreateLineTokenizer again.

Tokenizing sample code

To run the tokenizer on a longer sample code or an entire file, you need to read the sample input as a collection of string values:

let lines = """
  // Hello world
  let hello() =
     printfn "Hello world!" """.Split('\r','\n')

To tokenize multi-line input, we again need a recursive function that keeps the current state. The following function takes the lines as a list of strings (together with line number and the current state). We create a new tokenizer for each line and call tokenizeLine using the state from the end of the previous line:

/// Print token names for multiple lines of code
let rec tokenizeLines state count lines = 
  match lines with
  | line::lines ->
      // Create tokenizer & tokenize single line
      printfn "\nLine %d" count
      let tokenizer = sourceTok.CreateLineTokenizer(line)
      let state = tokenizeLine tokenizer state
      // Tokenize the rest using new state
      tokenizeLines state (count+1) lines
  | [] -> ()

The function simply calls tokenizeLine (defined earlier) to print the names of all the tokens on each line. We can call it on the previous input with 0L as the initial state and 1 as the number of the first line:

lines
|> List.ofSeq
|> tokenizeLines FSharpTokenizerLexState.Initial 1

Ignoring some unimportant details (like whitespace at the beginning of each line and the first line which is just whitespace), the code generates the following output:

Line 1
  LINE_COMMENT LINE_COMMENT (...) LINE_COMMENT 
Line 2
  LET WHITESPACE IDENT LPAREN RPAREN WHITESPACE EQUALS 
Line 3
  IDENT WHITESPACE STRING_TEXT (...) STRING_TEXT STRING

It is worth noting that the tokenizer yields multiple LINE_COMMENT tokens and multiple STRING_TEXT tokens for each single comment or string (roughly, one for each word), so if you want to get the entire text of a comment/string, you need to concatenate the tokens.

Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp

namespace FSharp.Compiler

namespace FSharp.Compiler.Tokenization

val sourceTok: FSharpSourceTokenizer

Multiple items
type FSharpSourceTokenizer = new: conditionalDefines: string list * fileName: string option * langVersion: string option * strictIndentation: bool option -> FSharpSourceTokenizer member CreateBufferTokenizer: bufferFiller: (char array * int * int -> int) -> FSharpLineTokenizer member CreateLineTokenizer: lineText: string -> FSharpLineTokenizer
<summary> Tokenizer for a source file. Holds some expensive-to-compute resources at the scope of the file. </summary>

--------------------
new: conditionalDefines: string list * fileName: string option * langVersion: string option * strictIndentation: bool option -> FSharpSourceTokenizer

union case Option.Some: Value: 'T -> Option<'T>

union case Option.None: Option<'T>

val tokenizer: FSharpLineTokenizer

member FSharpSourceTokenizer.CreateLineTokenizer: lineText: string -> FSharpLineTokenizer

val tokenizeLine: tokenizer: FSharpLineTokenizer -> state: FSharpTokenizerLexState -> FSharpTokenizerLexState
Tokenize a single line of F# code

type FSharpLineTokenizer = member ScanToken: lexState: FSharpTokenizerLexState -> FSharpTokenInfo option * FSharpTokenizerLexState static member ColorStateOfLexState: FSharpTokenizerLexState -> FSharpTokenizerColorState static member LexStateOfColorState: FSharpTokenizerColorState -> FSharpTokenizerLexState
<summary> Object to tokenize a line of F# source code, starting with the given lexState. The lexState should be FSharpTokenizerLexState.Initial for the first line of text. Returns an array of ranges of the text and two enumerations categorizing the tokens and characters covered by that range, i.e. FSharpTokenColorKind and FSharpTokenCharKind. The enumerations are somewhat adhoc but useful enough to give good colorization options to the user in an IDE. A new lexState is also returned. An IDE-plugin should in general cache the lexState values for each line of the edited code. </summary>

val state: FSharpTokenizerLexState

member FSharpLineTokenizer.ScanToken: lexState: FSharpTokenizerLexState -> FSharpTokenInfo option * FSharpTokenizerLexState

val tok: FSharpTokenInfo

val printf: format: Printf.TextWriterFormat<'T> -> 'T

FSharpTokenInfo.TokenName: string
<summary> Provides additional information about the token </summary>

[<Struct>] type FSharpTokenizerLexState = { PosBits: int64 OtherBits: int64 } member Equals: FSharpTokenizerLexState -> bool static member Initial: FSharpTokenizerLexState with get
<summary> Represents encoded information for the end-of-line continuation of lexing </summary>

property FSharpTokenizerLexState.Initial: FSharpTokenizerLexState with get

val lines: string array

val tokenizeLines: state: FSharpTokenizerLexState -> count: int -> lines: string list -> unit
Print token names for multiple lines of code

val count: int

val lines: string list

val line: string

val printfn: format: Printf.TextWriterFormat<'T> -> 'T

Multiple items
module List from Microsoft.FSharp.Collections

--------------------
type List<'T> = | op_Nil | op_ColonColon of Head: 'T * Tail: 'T list interface IReadOnlyList<'T> interface IReadOnlyCollection<'T> interface IEnumerable interface IEnumerable<'T> member GetReverseIndex: rank: int * offset: int -> int member GetSlice: startIndex: int option * endIndex: int option -> 'T list static member Cons: head: 'T * tail: 'T list -> 'T list member Head: 'T with get member IsEmpty: bool with get member Item: index: int -> 'T with get ...

val ofSeq: source: 'T seq -> 'T list