Ways to parse an HTML document text

April 13, 2021

This tip shows how we can parse an HTML document using HTMLDocumentClass and Regular Expression

Problem:

We need to parse a html document and get only the style element from the whole html document text.
Lets say our html document is as below:


        
            My Page
                
                    .style1{height:30px;width:100px}
                
        
        This is the home page

Solution:

Using HTMLDocumentClass:

HtmlDocument represents the entire document present inside the HTML tags. All tags like “body”,”form”,”style” are represented by individual HtmlElement objects. We can access each element by the property “all”, it will return the collection of all HtmlElement objects present inside the html document.

In this example, the string named “htmlToParse” contain the whole html document text to be parsed.

First we need to add a reference to Microsoft.mshtml

Required Namespce: Imports mshtml

Dim newString As String = String.Empty

‘Create a html document class Dim htmlDocument As IHTMLDocument2 = New HTMLDocumentClass()

htmlDocument.write(htmlToParse)

‘Get all elements present in the document
Dim allElements As IHTMLElementCollection = htmlDocument.all

‘Get the style element Dim styleSheetCollection As IHTMLElementCollection = allElements.tags(“style”)

Dim styleSheetElement As IHTMLElement = allElements.tags(“style”).item(0)

newString = styleSheetElement.outerHTML

Here the new string will contain the result as: .style1{height:30px;width:100px}

Using RegularExpression:

Required Namespce:Imports System.Text.RegularExpressions

Dim newString As String = String.Empty
Dim regularExp As New Regex(“]*>(.*?)”)

‘Search the result that match with the regular expression pattern specified
Dim expMatchResult As Match = regularExp.Match(htmlToParse)

‘Get the value of the matched result obtained
newString = expMatchResult.Value

The new string will contain the result as: .style1{height:30px;width:100px}

In both the cases, the new string will return only the style element.

https://burnignorance.com/wp-content/themes/impeka/images/empty/thumbnail.jpg 150 150 Burnignorance | Where Minds Meet And Sparks Fly! Burnignorance | Where Minds Meet And Sparks Fly! https://burnignorance.com/wp-content/themes/impeka/images/empty/thumbnail.jpg April 13, 2021 March 9, 2025

Ways to parse an HTML document text

Show truncated string with dots appended with CSS3

How to avoid Memory leak issue in Java

Active Directory Authentication In Web Application