Using regex in Classic ASP to get content of specific elements

So I am loading some remote content and need to use regex to isolate the the content of some tags.

  set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP") "GET", url, false 
 xmlhttp.setRequestHeader "Content-Type", "application/x-www-form-urlencoded" 
 xmlhttp.setRequestHeader "Accept-Language", "en-us" 
 xmlhttp.send "x=hello" 
 status = xmlhttp.status 
    if err.number <> 0 or status <> 200 then 
        if status = 404 then 
            Response.Write "[EFERROR]Page does not exist (404)." 
        elseif status >= 401 and status < 402 then 
            Response.Write "[EFERROR]Access denied (401)." 
        elseif status >= 500 and status <= 600 then 
            Response.Write "[EFERROR]500 Internal Server Error on remote site." 
            Response.write "[EFERROR]Server is down or does not exist." 
        end if 
 data =  xmlhttp.responseText 

I basically need to get the content of the <title>Here is the title</title> also the meta description, keywords and some selected open graph meta data.

And finally I need to get the content of the first <h1>Heading</h1> and <p>Paragraph</p>

How can I parse the html data to get these things? Should I use regex?

  this question
Could I just specific the returned content as XML then and use node selection? could you elaborate on how that might work? thanks @DanielA.White


3 Answers

You may be able to use the .responseXML property to retrieve the content you want without using regex. Because you are looking for data inside <title>, <h1> and <p> tags, the document returned is probably HTML. If the HTML document is well-formed according to the XML specifications it could mean it is already automatically parsed and accessible after you get the response.

So you could try this:

Dim objData
Set objData = xmlhttp.responseXML.selectSingleNode("//*[local-name() = 'title']")

If objData Is Nothing Then
    Response.Write "# no result #<br />"
    Response.Write "title: " & objData.Text & "<br />"
End If

Note though, that this XPath expression may not be the most efficient way to query an XML document (in case you want to process large amounts of data).

Use the Mid function combined with the Instr function. I built a function which uses the Mid function to determine the tag wrapped text by finding the position of each tag using the Instr function:

 Function GetInnerData(Data,TagOpen,TagClose)
   OpenPos = Instr(1,data,TagOpen,1)
   ClosePos = Instr(1,data,TagClose,1)
   If OpenPos > 0 And ClosePos > 0 Then GetInnerData = Trim(Mid(data,OpenPos+Len(TagOpen),ClosePos-(OpenPos+Len(TagOpen))))
 End Function

When you run this function like this, it will return My Title

<%=GetInnerData("any text <title>My Title</title> any text","<title>","</title>")%>

And in your case, You would do it like this:

 TitleData = GetInnerData(data,"<title>","</title>")

This will get the content in your <title> tag. or

 H1Data = GetInnerData(data,"<h1>","</h1>")

This will get the content in your <h1> tag.

The Instr function returns the first string found in the data, so this function will do exactly what you need.

I actually used this solution in the end as it also solve the problem of having class names in the code.

Function GetFirstMatch(PatternToMatch, StringToSearch)
    Dim regEx, CurrentMatch, CurrentMatches

    Set regEx = New RegExp
    regEx.Pattern = PatternToMatch
    regEx.IgnoreCase = True
    regEx.Global = True
    regEx.MultiLine = True
    Set CurrentMatches = regEx.Execute(StringToSearch)

    GetFirstMatch = ""
    If CurrentMatches.Count >= 1 Then
        Set CurrentMatch = CurrentMatches(0)
        If CurrentMatch.SubMatches.Count >= 1 Then
            GetFirstMatch = CurrentMatch.SubMatches(0)
        End If
    End If
    Set regEx = Nothing
End Function

    title = clean_str(GetFirstMatch("<title[^>]*>([^<]+)</title>",data))
    firstpara = clean_str(GetFirstMatch("<p[^>]*>([^<]+)</p>",data))
    firsth1 = clean_str(GetFirstMatch("<h1[^>]*>([^<]+)</h1>",data))

