This tip shows how we can parse an HTML document using HTMLDocumentClass and Regular Expression
Problem:
We need to parse a html document and get only the style element from the whole html document text.
Lets say our html document is as below:
My Page .style1{height:30px;width:100px} This is the home page
Solution:
Using HTMLDocumentClass:
HtmlDocument represents the entire document present inside the HTML tags. All tags like “body”,”form”,”style” are represented by individual HtmlElement objects. We can access each element by the property “all”, it will return the collection of all HtmlElement objects present inside the html document.
In this example, the string named “htmlToParse” contain the whole html document text to be parsed.
First we need to add a reference to Microsoft.mshtml
Required Namespce: Imports mshtml
Dim newString As String = String.Empty
‘Create a html document class Dim htmlDocument As IHTMLDocument2 = New HTMLDocumentClass()
htmlDocument.write(htmlToParse)
‘Get all elements present in the document
Dim allElements As IHTMLElementCollection = htmlDocument.all
‘Get the style element Dim styleSheetCollection As IHTMLElementCollection = allElements.tags(“style”)
Dim styleSheetElement As IHTMLElement = allElements.tags(“style”).item(0)
newString = styleSheetElement.outerHTML
Here the new string will contain the result as: .style1{height:30px;width:100px}
Using RegularExpression:
Required Namespce:Imports System.Text.RegularExpressions
Dim newString As String = String.Empty
Dim regularExp As New Regex(“]*>(.*?)”)
‘Search the result that match with the regular expression pattern specified
Dim expMatchResult As Match = regularExp.Match(htmlToParse)
‘Get the value of the matched result obtained
newString = expMatchResult.Value
The new string will contain the result as: .style1{height:30px;width:100px}
In both the cases, the new string will return only the style element.