ITKeyword,专注技术干货聚合推荐

注册 | 登录

Using regex in Classic ASP to get content of specific elements

itPublisher 分享于

2021腾讯云限时秒杀,爆款1核2G云服务器298元/3年!(领取2860元代金券),
地址https://cloud.tencent.com/act/cps/redirect?redirect=1062

2021阿里云最低价产品入口+领取代金券(老用户3折起),
入口地址https://www.aliyun.com/minisite/goods

So I am loading some remote content and need to use regex to isolate the the content of some tags.

  set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP") 
 xmlhttp.open "GET", url, false 
 xmlhttp.setRequestHeader "Content-Type", "application/x-www-form-urlencoded" 
 xmlhttp.setRequestHeader "Accept-Language", "en-us" 
 xmlhttp.send "x=hello" 
 status = xmlhttp.status 
    if err.number <> 0 or status <> 200 then 
        if status = 404 then 
            Response.Write "[EFERROR]Page does not exist (404)." 
        elseif status >= 401 and status < 402 then 
            Response.Write "[EFERROR]Access denied (401)." 
        elseif status >= 500 and status <= 600 then 
            Response.Write "[EFERROR]500 Internal Server Error on remote site." 
        else 
            Response.write "[EFERROR]Server is down or does not exist." 
        end if 
    else  
 data =  xmlhttp.responseText 

I basically need to get the content of the <title>Here is the title</title> also the meta description, keywords and some selected open graph meta data.

And finally I need to get the content of the first <h1>Heading</h1> and <p>Paragraph</p>

How can I parse the html data to get these things? Should I use regex?

regex asp-classic serverxmlhttp
|
  this question
asked May 28 '12 at 13:53 Chris Dowdeswell 463 2 6 23 1   have you considered using an xml parser instead? –  Daniel A. White May 28 '12 at 13:55      Could I just specific the returned content as XML then and use node selection? could you elaborate on how that might work? thanks @DanielA.White –  Chris Dowdeswell May 28 '12 at 14:02

 | 

3 Answers
3

You may be able to use the .responseXML property to retrieve the content you want without using regex. Because you are looking for data inside <title>, <h1> and <p> tags, the document returned is probably HTML. If the HTML document is well-formed according to the XML specifications it could mean it is already automatically parsed and accessible after you get the response.

So you could try this:

Dim objData
Set objData = xmlhttp.responseXML.selectSingleNode("//*[local-name() = 'title']")

If objData Is Nothing Then
    Response.Write "# no result #<br />"
Else
    Response.Write "title: " & objData.Text & "<br />"
End If

Note though, that this XPath expression may not be the most efficient way to query an XML document (in case you want to process large amounts of data).


|
  this answer
answered Dec 12 '12 at 10:22 Sander_P 1,282 1 6 21

 | 

Use the Mid function combined with the Instr function. I built a function which uses the Mid function to determine the tag wrapped text by finding the position of each tag using the Instr function:

 Function GetInnerData(Data,TagOpen,TagClose)
   OpenPos = Instr(1,data,TagOpen,1)
   ClosePos = Instr(1,data,TagClose,1)
   If OpenPos > 0 And ClosePos > 0 Then GetInnerData = Trim(Mid(data,OpenPos+Len(TagOpen),ClosePos-(OpenPos+Len(TagOpen))))
 End Function

When you run this function like this, it will return My Title

<%=GetInnerData("any text <title>My Title</title> any text","<title>","</title>")%>

And in your case, You would do it like this:

 TitleData = GetInnerData(data,"<title>","</title>")

This will get the content in your <title> tag. or

 H1Data = GetInnerData(data,"<h1>","</h1>")

This will get the content in your <h1> tag.

The Instr function returns the first string found in the data, so this function will do exactly what you need.


|
  this answer
edited May 28 '12 at 18:36 answered May 28 '12 at 18:03 Control Freak 6,615 17 59 106

 | 

I actually used this solution in the end as it also solve the problem of having class names in the code.

Function GetFirstMatch(PatternToMatch, StringToSearch)
    Dim regEx, CurrentMatch, CurrentMatches

    Set regEx = New RegExp
    regEx.Pattern = PatternToMatch
    regEx.IgnoreCase = True
    regEx.Global = True
    regEx.MultiLine = True
    Set CurrentMatches = regEx.Execute(StringToSearch)

    GetFirstMatch = ""
    If CurrentMatches.Count >= 1 Then
        Set CurrentMatch = CurrentMatches(0)
        If CurrentMatch.SubMatches.Count >= 1 Then
            GetFirstMatch = CurrentMatch.SubMatches(0)
        End If
    End If
    Set regEx = Nothing
End Function

    title = clean_str(GetFirstMatch("<title[^>]*>([^<]+)</title>",data))
    firstpara = clean_str(GetFirstMatch("<p[^>]*>([^<]+)</p>",data))
    firsth1 = clean_str(GetFirstMatch("<h1[^>]*>([^<]+)</h1>",data))

|
  this answer
answered Jun 6 '12 at 19:32 Chris Dowdeswell 463 2 6 23

 | 


相关阅读排行


相关内容推荐

最新文章

×

×

请激活账号

为了能正常使用评论、编辑功能及以后陆续为用户提供的其他产品,请激活账号。

您的注册邮箱: 修改

重新发送激活邮件 进入我的邮箱

如果您没有收到激活邮件,请注意检查垃圾箱。