F# Data: HTML Parser
This article demonstrates how to use the HTML Parser to parse HTML files.
The HTML parser takes any fragment of HTML, uri or a stream and trys to parse it into a DOM. The parser is based on the HTML Living Standard Once a document/fragment has been parsed, a set of extension methods over the HTML DOM elements allow you to extract information from a web page independently of the actual HTML Type provider.
The following example uses Google to search for
FSharp.Data then parses the first set of
search results from the page, extracting the URL and Title of the link.
To achieve this we must first parse the webpage into our DOM. We can do this using
HtmlDocument.Load method. This method will take a URL and make a synchronous web call
to extract the data from the page. Note: an asynchronous variant
HtmlDocument.AsyncLoad is also available
Now that we have a loaded HTML document we can begin to extract data from it.
Firstly we want to extract all of the anchor tags
a out of the document, then
inspect the links to see if it has a
href attribute. If it does, extract the value,
which in this case is the url that the search result is pointing to, and additionally the
InnerText of the anchor tag to provide the name of the web page for the search result
we are looking at.
1: 2: 3: 4: 5: 6:
Now that we have extracted our search results you will notice that there are lots of
other links to various Google services and cached/similar results. Ideally we would
like to filter these results as we are probably not interested in them.
At this point we simply have a sequence of Tuples, so F# makes this trivial using
1: 2: 3: 4: 5: 6:
Putting this all together yields the following:
Full name: HtmlParser.results
Full name: HtmlParser.links
Full name: Microsoft.FSharp.Collections.Seq.choose
Full name: Microsoft.FSharp.Core.Option.map
Full name: HtmlParser.searchResults
Full name: Microsoft.FSharp.Collections.Seq.filter
Full name: Microsoft.FSharp.Collections.Seq.map
Full name: Microsoft.FSharp.Collections.Seq.toArray