F# Data


F# Data: HTML CSS selectors

This article demonstrates how to use HTML CSS selectors to browse the DOM of parsed HTML files.

Usage of CSS selectors is a very natural way to parse HTML when we come from Web developments. The HTML CSS selectors are based on the JQuery selectors. To use CSS selectors, reference the F# Data library. You then need to open FSharp.Data namespace, which automatically exposes extension methods that implement the CSS selectors.

1: 
2: 
#r "../../../bin/FSharp.Data.dll"
open FSharp.Data

Practice 1: Search for F# Data on Google

We will parse links of a Google to search for FSharp.Data like in the HTML Parser article.

1: 
2: 
let googleUrl = "http://www.google.co.uk/search?q=FSharp.Data"
let doc = HtmlDocument.Load(googleUrl)

To make sure we extract search results only, we will parse links in the <div> with id search. Then we can , for example, use the direct descendants selector to select another <div> with the id ires. The CSS selector to do so is div#search > div#ires:

1: 
2: 
3: 
4: 
5: 
6: 
let links = 
  doc.CssSelect("div#search > div#ires div.g > div.s div.kv cite")
  |> List.map (fun n -> 
      match n.InnerText() with
      | t when (t.StartsWith("https://") || t.StartsWith("http://"))-> t
      | t -> "http://" + t )

The rest of the selector (written as li.g > div.s) skips the first 4 sub-results targeting GitHub pages, so we only extract proper links.

Now we might want the pages titles associated with their URLs. To do this, we can use the List.zip function:

1: 
2: 
3: 
4: 
let searchResults = 
    doc.CssSelect("div#search > div#ires div.g > h3")
    |> List.map (fun n -> n.InnerText())
    |> List.zip (links)

Practice 2: Search F# books on Youscribe

We will parse links of the Youscribe web site, searching for F#. After downloading the document, we simply ensure to match good links with their CSS's styles and DOM's hierachy. In case of Youscribe, we need to look for <div> with class set to document-infos and then for all <a> elements with CSS class doc-explore-title:

1: 
2: 
3: 
4: 
5: 
6: 
7: 
let fsys = "http://en.youscribe.com/o-reilly-media/?quick_search=f%23"
let doc2 = HtmlDocument.Load(fsys)

let books = 
  doc2.CssSelect("div.document-infos a.doc-explore-title")
  |> List.map(fun a -> a.InnerText().Trim(), a.AttributeValue("href"))
  |> List.filter(fun (title, href) -> title.Contains("F#"))

JQuery selectors

This section provides a quick overview of the supported CSS selectors. If you are familiar with CSS selectors in JQuery, then you will see that most of the features are the same. You can also refer to the table below for a complete list of supported selectors.

Attribute Contains Prefix Selector

Finds all links with an english hreflang attribute.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
let englishDoc = HtmlDocument.Parse("""
  <!doctype html>
  <html lang="en">
  <body>
    <a href="example.html" hreflang="en">Some text</a>
    <a href="example.html" hreflang="en-UK">Some other text</a>
    <a href="example.html" hreflang="english">will not be outlined</a>
  </body>
  </html>""")

let englishLinks = 
  englishDoc.CssSelect("a[hreflang|=en]")

Attribute Contains Selector

Finds all inputs with a name containing "man". This includes results where "man" is a substring:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
let manDoc = HtmlDocument.Parse("""
  <!doctype html>
  <html lang="en">
  <body>
    <input name="man-news">
    <input name="milkman">
    <input name="milk man">
    <input name="letterman2">
    <input name="newmilk">
    <input name="man">
    <input name="newsletter">
  </body>
  </html>""")

let manElems = 
  manDoc.CssSelect("input[name*='man']")

Attribute Contains Word Selector

Finds all inputs with a name containing the word "man". This requires a whitespace around the word:

1: 
2: 
let manWordElems = 
  manDoc.CssSelect("input[name~='man']")

Attribute Ends With Selector

Finds all inputs with a name ending with "man".

1: 
2: 
let manEndElemes = 
  manDoc.CssSelect("input[name$='man']")

Attribute Equals Selector

Finds all inputs with a name equal to "man".

1: 
2: 
let manEqElemes = 
  manDoc.CssSelect("input[name='man']")

Attribute Not Equal Selector

Finds all inputs with a name different to "man".

1: 
2: 
let notManElems =
  manDoc.CssSelect("input[name!='man']")

Attribute Starts With Selector

Finds all inputs with a name starting with "man".

1: 
2: 
let manStartElems =
  manDoc.CssSelect("input[name^='man']")

Forms helpers

There are some syntax shortcuts to find forms controls.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
let htmlForm = HtmlDocument.Parse("""
  <!doctype html>
  <html>
  <body>
  <form>
    <fieldset>
      <input type="button" value="Input Button">
      <input type="checkbox" id="check1">
      <input type="hidden" id="hidden1">
      <input type="password" id="pass1">
      <input name="email" disabled="disabled">
      <input type="radio" id="radio1">
      <input type="checkbox" id="check2" checked="checked">
      <input type="file" id="uploader1">
      <input type="reset">
      <input type="submit">
      <input type="text">
      <select><option>Option</option></select>
      <textarea class="comment box1">Type a comment here</textarea>
      <button>Go !</button>
    </fieldset>
  </form>
  </body>
  </html>""")

You can use :prop to find CSS elements with the specified value of the type attribute or a specified form control property. This lets you easily select all buttons, checkboxes, radio buttons, but also hidden or disabled form elements:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
// Find all buttons.
let buttons = htmlForm.CssSelect(":button")

// Find all checkboxes.
let checkboxes = htmlForm.CssSelect(":checkbox")

// Find all checked checkboxs or radio.
let checkd = htmlForm.CssSelect(":checked")

// Find all disabled controls.
let disabled = htmlForm.CssSelect(":disabled")

// Find all inputs with type hidden.
let hidden = htmlForm.CssSelect(":hidden")

// Find all inputs with type radio.
let radio = htmlForm.CssSelect(":radio")

// Find all inputs with type password.
let password = htmlForm.CssSelect(":password")

// Find all files uploaders.
let file = htmlForm.CssSelect(":file")

Implemented and missing features

Basic CSS selectors are implemented, but some JQuery selectors are missing

This table lists all JQuery selectors and their status

Selector name

Status

specification

All Selector

TODO

specification

:animated Selector

not possible

specification

Attribute Contains Prefix Selector

implemented

specification

Attribute Contains Selector

implemented

specification

Attribute Contains Word Selector

implemented

specification

Attribute Ends With Selector

implemented

specification

Attribute Equals Selector

implemented

specification

Attribute Not Equal Selector

implemented

specification

Attribute Starts With Selector

implemented

specification

:button Selector

implemented

specification

:checkbox Selector

implemented

specification

:checked Selector

implemented

specification

Child Selector (“parent > child”)

implemented

specification

Class Selector (“.class”)

implemented

specification

:contains() Selector

TODO

specification

Descendant Selector (“ancestor descendant”)

implemented

specification

:disabled Selector

implemented

specification

Element Selector (“element”)

implemented

specification

:empty Selector

implemented

specification

:enabled Selector

implemented

specification

:eq() Selector

TODO

specification

:even Selector

implemented

specification

:file Selector

implemented

specification

:first-child Selector

TODO

specification

:first-of-type Selector

TODO

specification

:first Selector

TODO

specification

:focus Selector

not possible

specification

:gt() Selector

TODO

specification

Has Attribute Selector [name]

implemented

specification

:has() Selector

TODO

specification

:header Selector

TODO

specification

:hidden Selector

implemented

specification

ID Selector (“#id”)

implemented

specification

:image Selector

implemented

specification

:input Selector

implemented

specification

:lang() Selector

TODO

specification

:last-child Selector

TODO

specification

:last-of-type Selector

TODO

specification

:last Selector

TODO

specification

:lt() Selector

TODO

specification

Multiple Attribute Selector [name=”value”][name2=”value2″]

implemented

specification

Multiple Selector (“selector1, selector2, selectorN”)

TODO

specification

Next Adjacent Selector (“prev + next”)

TODO

specification

Next Siblings Selector (“prev ~ siblings”)

TODO

specification

:not() Selector

TODO

specification

:nth-child() Selector

TODO

specification

:nth-last-child() Selector

TODO

specification

:nth-last-of-type() Selector

TODO

specification

:nth-of-type() Selector

TODO

specification

:odd Selector

implemented

specification

:only-child Selector

TODO

specification

:only-of-type Selector

TODO

specification

:parent Selector

TODO

specification

:password Selector

implemented

specification

:radio Selector

implemented

specification

:reset Selector

not possible

specification

:root Selector

useless[1]

specification

:selected Selector

implemented

specification

:submit Selector

implemented

specification

:target Selector

not possible

specification

:text Selector

implemented

specification

:visible Selector

not possible

specification

[1] :root Selector seems to be useless in our case because with the HTML parser the root is always the html node.

Fork me on GitHub