F# Data


F# Data: HTML Type Provider

This article demonstrates how to use the HTML type provider to read HTML tables files in a statically typed way.

The HTML Type Provider takes a sample HTML document as input and generates a type based on the data present in the columns of that sample. The column names are obtained from the first (header) row.

Introducing the provider

The type provider is located in the FSharp.Data.dll assembly. Assuming the assembly is located in the ../../../bin directory, we can load it in F# Interactive as follows:

1: 
2: 
#r "../../../bin/FSharp.Data.dll"
open FSharp.Data

Parsing Power Market Data

The Elexon - BM Reports website provides market data about the U.K's current power system. For simplicity, an example of this data below is shown in CSV format, (you can see an example of the raw HTML document this data was extracted from in data/MarketDepth.htm):

1: 
2: 
3: 
4: 
5: 
Settlement Day,Period,IMBALNGC,Offer Volume Bid Volume,Accepted Offer Vol,Accepted Bid Vol,UAOV,UABV,PAOV,PABV
2014-01-14,1,877.000,52378.500,-53779.500,348.200,-654.374,0.000,0.000,348.200,-654.374 
2014-01-14,2,196.000,52598.000,-53559.500,349.601,-310.862,0.000,0.000,316.701,-310.862 
2014-01-14,3,-190.000,52575.000,-53283.500,186.183,-2.426,0.000,0.000,162.767,-1.917 
2014-01-14,4,-61.000,52576.000,-53454.500,18.000,-24.158,0.000,0.000,18.000,-24.158 

Usually with HTML files headers are demarked by using the tag, however in this file this is not the case, so the provider assumes that the first row is headers. (This behaviour is likely to get smarter in later releases). But it highlights a general problem about HTML's strictness.

1: 
type MarketDepth = HtmlProvider<"../data/MarketDepth.htm">

The generated type provides a type space of tables that it has managed to parse out of the given HTML Document. Each type's name is derived from either the id, title, name, summary or caption attributes/tags provided. If none of these entities exist then the table will simply be named Tablexx where xx is the position in the HTML document if all of the tables were flatterned out into a list. The Load method allows reading the data from a file or web resource. We could also have used a web URL instead of a local file in the sample parameter of the type provider. The following sample calls the Load method with an URL that points to a live market depth servlet on the BM Reports website.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
let bmr = 
  "http://www.bmreports.com/servlet/" +
    "com.logica.neta.bwp_MarketDepthServlet"

// Download the latest market depth information
let mrktDepth = 
  MarketDepth.Load(bmr).Tables.Table1

// Look at the most recent row. Note the 'Date' property
// is of type 'DateTime' and 'Open' has a type 'decimal'
let firstRow = mrktDepth.Rows |> Seq.head
let settlementDate = firstRow.``Settlement Day``
let acceptedBid = firstRow.``Accepted Bid Vol``
let acceptedOffer = firstRow.``Accepted Offer Vol``

// Print the bid / offer volumes for each row
for row in mrktDepth.Rows do
  printfn "Bid/Offer: (%A, %A, %A)" 
    row.``Settlement Day`` row.``Bid Volume`` row.``Offer Volume``

The generated type has a property Rows that returns the data from the HTML file as a collection of rows. We iterate over the rows using a for loop. As you can see the (generated) type for rows has properties such as Settlement Day, Bid Volume and Offer Volume that correspond to the columns in the selected HTML table file.

As you can see, the type provider also infers types of individual rows. The Date property is inferred to be a DateTime (because the values in the sample file can all be parsed as dates) while other columns are inferred as decimal or float.

Parsing Nuget package stats

This small sample shows how the HTML Type Provider can be used to scrape data from a website. In this example we analyze the download counts of the FSharp.Data package on NuGet. Note that we're using the live URL as the sample, so we can just use the default constructor as the runtime data will be the same as the compile time data.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
// Configure the type provider
type NugetStats = 
  HtmlProvider<"https://www.nuget.org/packages/FSharp.Data">

// load the live package stats for FSharp.Data
let rawStats = NugetStats().Tables.``Version History``

// helper function to analyze version numbers from nuget
let getMinorVersion (v:string) =  
  System.Text.RegularExpressions.Regex(@"\d.\d").Match(v).Value

// group by minor version and calculate download count
let stats = 
  rawStats.Rows
  |> Seq.groupBy (fun r -> 
      getMinorVersion r.Version)
  |> Seq.map (fun (k, xs) -> 
      k, xs |> Seq.sumBy (fun x -> x.Downloads))

// Load the FSharp.Charting library
#load "../../../packages/FSharp.Charting/FSharp.Charting.fsx"
open FSharp.Charting

// Visualize the package stats
Chart.Bar stats

Chart

Getting statistics on Doctor Who

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
let [<Literal>] DrWho = 
  "http://en.wikipedia.org/wiki/List_of_Doctor_Who_serials"

let doctorWho = new HtmlProvider<DrWho>()

// Get the average number of viewers for each doctor finale
let viewersByDoctor = 
  doctorWho.Tables.``Series overview``.Rows 
  |> Seq.groupBy (fun season -> season.``Doctor(s)``)
  |> Seq.map (fun (doctor, seasons) -> 
      let averaged = 
        seasons 
        |> Seq.averageBy (fun season -> 
            season.``Viewers (millions) - Finale``)
      doctor, averaged)
  |> Seq.toArray

// Visualize it
Chart.Column(viewersByDoctor)
  .WithYAxis(Title = "Millions")

Chart

Related articles

Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data
type MarketDepth = HtmlProvider<...>

Full name: HtmlProvider.MarketDepth
type HtmlProvider

Full name: FSharp.Data.HtmlProvider


<summary>Typed representation of an HTML file.</summary>
           <param name='Sample'>Location of an HTML sample file or a string containing a sample HTML document.</param>
           <param name='PreferOptionals'>When set to true, inference will prefer to use the option type instead of nullable types, `double.NaN` or `""` for missing values. Defaults to false.</param>
           <param name='IncludeLayoutTables'>Includes tables that are potentially layout tables (with cellpadding=0 and cellspacing=0 attributes)</param>
           <param name='MissingValues'>The set of strings recogized as missing values. Defaults to `NaN,NA,N/A,#N/A,:,-,TBA,TBD`.</param>
           <param name='Culture'>The culture used for parsing numbers and dates. Defaults to the invariant culture.</param>
           <param name='Encoding'>The encoding used to read the sample. You can specify either the character set name or the codepage number. Defaults to UTF8 for files, and to ISO-8859-1 the for HTTP requests, unless `charset` is specified in the `Content-Type` response header.</param>
           <param name='ResolutionFolder'>A directory that is used when resolving relative file references (at design time and in hosted execution).</param>
           <param name='EmbeddedResource'>When specified, the type provider first attempts to load the sample from the specified resource
              (e.g. 'MyCompany.MyAssembly, resource_name.html'). This is useful when exposing types generated by the type provider.</param>
val bmr : string

Full name: HtmlProvider.bmr
val mrktDepth : HtmlProvider<...>.Table1

Full name: HtmlProvider.mrktDepth
HtmlProvider<...>.Load(uri: string) : HtmlProvider<...>


Loads HTML from the specified uri

HtmlProvider<...>.Load(reader: System.IO.TextReader) : HtmlProvider<...>


Loads HTML from the specified reader

HtmlProvider<...>.Load(stream: System.IO.Stream) : HtmlProvider<...>


Loads HTML from the specified stream
val firstRow : HtmlProvider<...>.Table1.Row

Full name: HtmlProvider.firstRow
property Runtime.BaseTypes.HtmlTable.Rows: HtmlProvider<...>.Table1.Row []
module Seq

from Microsoft.FSharp.Collections
val head : source:seq<'T> -> 'T

Full name: Microsoft.FSharp.Collections.Seq.head
val settlementDate : System.DateTime

Full name: HtmlProvider.settlementDate
val acceptedBid : float

Full name: HtmlProvider.acceptedBid
val acceptedOffer : float

Full name: HtmlProvider.acceptedOffer
val row : HtmlProvider<...>.Table1.Row
val printfn : format:Printf.TextWriterFormat<'T> -> 'T

Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printfn
type NugetStats = HtmlProvider<...>

Full name: HtmlProvider.NugetStats
val rawStats : HtmlProvider<...>.VersionHistory

Full name: HtmlProvider.rawStats
val getMinorVersion : v:string -> string

Full name: HtmlProvider.getMinorVersion
val v : string
Multiple items
val string : value:'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

--------------------
type string = System.String

Full name: Microsoft.FSharp.Core.string
namespace System
namespace System.Text
namespace System.Text.RegularExpressions
Multiple items
type Regex =
  new : pattern:string -> Regex + 1 overload
  member GetGroupNames : unit -> string[]
  member GetGroupNumbers : unit -> int[]
  member GroupNameFromNumber : i:int -> string
  member GroupNumberFromName : name:string -> int
  member IsMatch : input:string -> bool + 1 overload
  member Match : input:string -> Match + 2 overloads
  member Matches : input:string -> MatchCollection + 1 overload
  member Options : RegexOptions
  member Replace : input:string * replacement:string -> string + 5 overloads
  ...

Full name: System.Text.RegularExpressions.Regex

--------------------
System.Text.RegularExpressions.Regex(pattern: string) : unit
System.Text.RegularExpressions.Regex(pattern: string, options: System.Text.RegularExpressions.RegexOptions) : unit
val stats : seq<string * decimal>

Full name: HtmlProvider.stats
property Runtime.BaseTypes.HtmlTable.Rows: HtmlProvider<...>.VersionHistory.Row []
val groupBy : projection:('T -> 'Key) -> source:seq<'T> -> seq<'Key * seq<'T>> (requires equality)

Full name: Microsoft.FSharp.Collections.Seq.groupBy
val r : HtmlProvider<...>.VersionHistory.Row
property HtmlProvider<...>.VersionHistory.Row.Version: string
val map : mapping:('T -> 'U) -> source:seq<'T> -> seq<'U>

Full name: Microsoft.FSharp.Collections.Seq.map
val k : string
val xs : seq<HtmlProvider<...>.VersionHistory.Row>
val sumBy : projection:('T -> 'U) -> source:seq<'T> -> 'U (requires member ( + ) and member get_Zero)

Full name: Microsoft.FSharp.Collections.Seq.sumBy
val x : HtmlProvider<...>.VersionHistory.Row
property HtmlProvider<...>.VersionHistory.Row.Downloads: decimal
namespace FSharp.Charting
type Chart =
  static member Area : data:seq<#value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string -> GenericChart
  static member Area : data:seq<#key * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string -> GenericChart
  static member Bar : data:seq<#value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string -> GenericChart
  static member Bar : data:seq<#key * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string -> GenericChart
  static member BoxPlotFromData : data:seq<#key * #seq<'a2>> * ?Name:string * ?Title:string * ?Color:Color * ?XTitle:string * ?YTitle:string * ?Percentile:int * ?ShowAverage:bool * ?ShowMedian:bool * ?ShowUnusualValues:bool * ?WhiskerPercentile:int -> GenericChart (requires 'a2 :> value)
  static member BoxPlotFromStatistics : data:seq<#key * #value * #value * #value * #value * #value * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string * ?Percentile:int * ?ShowAverage:bool * ?ShowMedian:bool * ?ShowUnusualValues:bool * ?WhiskerPercentile:int -> GenericChart
  static member Bubble : data:seq<#value * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string * ?BubbleMaxSize:int * ?BubbleMinSize:int * ?BubbleScaleMax:float * ?BubbleScaleMin:float * ?UseSizeForLabel:bool -> GenericChart
  static member Bubble : data:seq<#key * #value * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string * ?BubbleMaxSize:int * ?BubbleMinSize:int * ?BubbleScaleMax:float * ?BubbleScaleMin:float * ?UseSizeForLabel:bool -> GenericChart
  static member Candlestick : data:seq<#value * #value * #value * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string -> CandlestickChart
  static member Candlestick : data:seq<#key * #value * #value * #value * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:Color * ?XTitle:string * ?YTitle:string -> CandlestickChart
  ...

Full name: FSharp.Charting.Chart
static member Chart.Bar : data:seq<#value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:System.Drawing.Color * ?XTitle:string * ?YTitle:string -> ChartTypes.GenericChart
static member Chart.Bar : data:seq<#key * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:System.Drawing.Color * ?XTitle:string * ?YTitle:string -> ChartTypes.GenericChart
Multiple items
type LiteralAttribute =
  inherit Attribute
  new : unit -> LiteralAttribute

Full name: Microsoft.FSharp.Core.LiteralAttribute

--------------------
new : unit -> LiteralAttribute
val DrWho : string

Full name: HtmlProvider.DrWho
val doctorWho : HtmlProvider<...>

Full name: HtmlProvider.doctorWho
val viewersByDoctor : (string * float) []

Full name: HtmlProvider.viewersByDoctor
property HtmlProvider<...>.Tables: HtmlProvider<...>.TablesContainer
val season : HtmlProvider<...>.SeriesOverview.Row
val doctor : string
val seasons : seq<HtmlProvider<...>.SeriesOverview.Row>
val averaged : float
val averageBy : projection:('T -> 'U) -> source:seq<'T> -> 'U (requires member ( + ) and member DivideByInt and member get_Zero)

Full name: Microsoft.FSharp.Collections.Seq.averageBy
val toArray : source:seq<'T> -> 'T []

Full name: Microsoft.FSharp.Collections.Seq.toArray
static member Chart.Column : data:seq<#value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:System.Drawing.Color * ?XTitle:string * ?YTitle:string * ?ColumnWidth:float -> ChartTypes.GenericChart
static member Chart.Column : data:seq<#key * #value> * ?Name:string * ?Title:string * ?Labels:#seq<string> * ?Color:System.Drawing.Color * ?XTitle:string * ?YTitle:string * ?ColumnWidth:float -> ChartTypes.GenericChart
Fork me on GitHub