F# Data


F# Data: HTML Parser

This article demonstrates how to use the HTML Parser to parse HTML files.

The HTML parser takes any fragment of HTML, uri or a stream and trys to parse it into a DOM. The parser is based on the HTML Living Standard Once a document/fragment has been parsed, a set of extension methods over the HTML DOM elements allow you to extract information from a web page independently of the actual HTML Type provider.

1: 
2: 
#r "../../../bin/FSharp.Data.dll"
open FSharp.Data

The following example uses Google to search for FSharp.Data then parses the first set of search results from the page, extracting the URL and Title of the link.

To achieve this we must first parse the webpage into our DOM. We can do this using the HtmlDocument.Load method. This method will take a URL and make a synchronous web call to extract the data from the page. Note: an asynchronous variant HtmlDocument.AsyncLoad is also available

1: 
let results = HtmlDocument.Load("http://www.google.co.uk/search?q=FSharp.Data")

Now that we have a loaded HTML document we can begin to extract data from it. Firstly we want to extract all of the anchor tags a out of the document, then inspect the links to see if it has a href attribute. If it does, extract the value, which in this case is the url that the search result is pointing to, and additionally the InnerText of the anchor tag to provide the name of the web page for the search result we are looking at.

1: 
2: 
3: 
4: 
5: 
6: 
let links = 
    results.Descendants ["a"]
    |> Seq.choose (fun x -> 
           x.TryGetAttribute("href")
           |> Option.map (fun a -> x.InnerText(), a.Value())
    )

Now that we have extracted our search results you will notice that there are lots of other links to various Google services and cached/similar results. Ideally we would like to filter these results as we are probably not interested in them. At this point we simply have a sequence of Tuples, so F# makes this trivial using Seq.filter and Seq.map.

1: 
2: 
3: 
4: 
5: 
6: 
let searchResults =
    links
    |> Seq.filter (fun (name, url) -> 
                    name <> "Cached" && name <> "Similar" && url.StartsWith("/url?"))
    |> Seq.map (fun (name, url) -> name, url.Substring(0, url.IndexOf("&sa=")).Replace("/url?q=", ""))
    |> Seq.toArray

Putting this all together yields the following:

[|("F# Data: Library for Data Access", "http://fsharp.github.io/FSharp.Data/");
  ("CSV Type Provider",
   "http://fsharp.github.io/FSharp.Data/library/CsvProvider.html");
  ("JSON Type Provider",
   "http://fsharp.github.io/FSharp.Data/library/JsonProvider.html");
  ("HTML Type Provider",
   "http://fsharp.github.io/FSharp.Data/library/HtmlProvider.html");
  ("Documentation", "http://fsharp.github.io/FSharp.Data/library/Http.html");
  ("XML Type Provider",
   "http://fsharp.github.io/FSharp.Data/library/XmlProvider.html");
  ("F# Data", "http://fsharp.github.io/FSharp.Data/reference/");
  ("GitHub - fsharp/FSharp.Data: F# Data: Library for Data Access",
   "https://github.com/fsharp/FSharp.Data");
  ("NuGet Gallery | F# Data 2.3.2", "https://www.nuget.org/packages/FSharp.Data");
  ("Guide - Data Access | The F# Software Foundation - FSharp.org",
   "http://fsharp.org/guides/data-access/");
  ("F# Data: New type provider library - Tomas Petricek",
   "http://tomasp.net/blog/fsharp-data.aspx/");
  ("Microsoft.FSharp.Data.TypeProviders Namespace (F#) - MSDN",
   "https://msdn.microsoft.com/en-us/visualfsharpdocs/conceptual/microsoft.fsharp.data.typeproviders-namespace-%255Bfsharp%255D");
  ("FsLab - Data science and machine learning with F#", "https://fslab.org/")|]
Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data
val results : HtmlDocument

Full name: HtmlParser.results
Multiple items
module HtmlDocument

from FSharp.Data

--------------------
type HtmlDocument =
  private | HtmlDocument of docType: string * elements: HtmlNode list
  override ToString : unit -> string
  static member AsyncLoad : uri:string -> Async<HtmlDocument>
  static member Load : uri:string -> HtmlDocument
  static member Load : reader:TextReader -> HtmlDocument
  static member Load : stream:Stream -> HtmlDocument
  static member New : children:seq<HtmlNode> -> HtmlDocument
  static member New : docType:string * children:seq<HtmlNode> -> HtmlDocument
  static member Parse : text:string -> HtmlDocument

Full name: FSharp.Data.HtmlDocument
static member HtmlDocument.Load : uri:string -> HtmlDocument
static member HtmlDocument.Load : reader:System.IO.TextReader -> HtmlDocument
static member HtmlDocument.Load : stream:System.IO.Stream -> HtmlDocument
val links : seq<string * string>

Full name: HtmlParser.links
static member HtmlDocumentExtensions.Descendants : doc:HtmlDocument -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants : doc:HtmlDocument * predicate:(HtmlNode -> bool) -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants : doc:HtmlDocument * names:seq<string> -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants : doc:HtmlDocument * name:string -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants : doc:HtmlDocument * predicate:(HtmlNode -> bool) * recurseOnMatch:bool -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants : doc:HtmlDocument * names:seq<string> * recurseOnMatch:bool -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants : doc:HtmlDocument * name:string * recurseOnMatch:bool -> seq<HtmlNode>
module Seq

from Microsoft.FSharp.Collections
val choose : chooser:('T -> 'U option) -> source:seq<'T> -> seq<'U>

Full name: Microsoft.FSharp.Collections.Seq.choose
val x : HtmlNode
static member HtmlNodeExtensions.TryGetAttribute : n:HtmlNode * name:string -> HtmlAttribute option
module Option

from Microsoft.FSharp.Core
val map : mapping:('T -> 'U) -> option:'T option -> 'U option

Full name: Microsoft.FSharp.Core.Option.map
val a : HtmlAttribute
static member HtmlNodeExtensions.InnerText : n:HtmlNode -> string
static member HtmlAttributeExtensions.Value : attr:HtmlAttribute -> string
val searchResults : (string * string) []

Full name: HtmlParser.searchResults
val filter : predicate:('T -> bool) -> source:seq<'T> -> seq<'T>

Full name: Microsoft.FSharp.Collections.Seq.filter
val name : string
val url : string
System.String.StartsWith(value: string) : bool
System.String.StartsWith(value: string, comparisonType: System.StringComparison) : bool
System.String.StartsWith(value: string, ignoreCase: bool, culture: System.Globalization.CultureInfo) : bool
val map : mapping:('T -> 'U) -> source:seq<'T> -> seq<'U>

Full name: Microsoft.FSharp.Collections.Seq.map
System.String.Substring(startIndex: int) : string
System.String.Substring(startIndex: int, length: int) : string
System.String.IndexOf(value: string) : int
System.String.IndexOf(value: char) : int
System.String.IndexOf(value: string, comparisonType: System.StringComparison) : int
System.String.IndexOf(value: string, startIndex: int) : int
System.String.IndexOf(value: char, startIndex: int) : int
System.String.IndexOf(value: string, startIndex: int, comparisonType: System.StringComparison) : int
System.String.IndexOf(value: string, startIndex: int, count: int) : int
System.String.IndexOf(value: char, startIndex: int, count: int) : int
System.String.IndexOf(value: string, startIndex: int, count: int, comparisonType: System.StringComparison) : int
val toArray : source:seq<'T> -> 'T []

Full name: Microsoft.FSharp.Collections.Seq.toArray
Fork me on GitHub