F# Data: HTML Type Provider
This article demonstrates how to use the HTML type provider to read HTML tables files in a statically typed way.
The HTML Type Provider takes a sample HTML document as input and generates a type based on the data present in the columns of that sample. The column names are obtained from the first (header) row.
Introducing the provider
The type provider is located in the
FSharp.Data.dll assembly. Assuming the assembly
is located in the
../../../bin directory, we can load it in F# Interactive as follows:
Parsing Power Market Data
The Elexon - BM Reports website provides market data about the U.K's current power system. For simplicity, an example of this data below is shown in CSV format,
(you can see an example of the raw HTML document this data was extracted from in
1: 2: 3: 4: 5:
Usually with HTML files headers are demarked by using the
The generated type provides a type space of tables that it has managed to parse out of the given HTML Document.
Each type's name is derived from either the id, title, name, summary or caption attributes/tags provided. If none of these
entities exist then the table will simply be named
Tablexx where xx is the position in the HTML document if all of the tables were flatterned out into a list.
Load method allows reading the data from a file or web resource. We could also have used a web URL instead of a local file in the sample parameter of the type provider.
The following sample calls the
Load method with an URL that points to a live market depth servlet on the BM Reports website.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:
The generated type has a property
Rows that returns the data from the HTML file as a
collection of rows. We iterate over the rows using a
for loop. As you can see the
(generated) type for rows has properties such as
Bid Volume and
Offer Volume that correspond
to the columns in the selected HTML table file.
As you can see, the type provider also infers types of individual rows. The
property is inferred to be a
DateTime (because the values in the sample file can all
be parsed as dates) while other columns are inferred as
Parsing Nuget package stats
This small sample shows how the HTML Type Provider can be used to scrape data from a website. In this example we analyze the download counts of the FSharp.Data package on NuGet. Note that we're using the live URL as the sample, so we can just use the default constructor as the runtime data will be the same as the compile time data.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:
Getting statistics on Doctor Who
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
- F# Data: HTML Parser - provides more information about working with HTML documents dynamically.