ʹÓÃPerlµÄHTML::TreeBuilder::XPathÀ´½âÎöÍøÒ³ÄÚÈÝ
ÔÎĵØÖ·£ºhttp://www.php-oa.com/2009/09/24/perl-html-tree-builder-xpath.html
ת¹ýÀ´ ÂýÂýÑо¿
Ç¿´óµÄPerlÖÐ,Óг¬¼¶¶àÇ¿´óµÄÄ£¿é,ÈÃÎÒÃDz»ÔÚÐèÒªÖØ¸´µÄ·¢Ã÷ÂÖ×Ó.ÏÂÃæÕâ¸ö¾ÍÊÇÒ»¸öÇ¿´óµÄÄ£¿é.HTML::TreeBuilder::XPath.ËüÄÜÏóxmlÒ»Ñù½âÎöÍøÕ¾.ÔõôʹÓþͲ»Ï¸½²ÁË,ÈçÏÂ,¼ûʵÀý,ÎÒÊÇ´Óalexa.comÍøÕ¾µÃµ½ÎÒµÄÍøÕ¾ÅÅÃûµÄÒ»¸öÀý×Ó.»áÏÔʾÈçϵĽá¹û
1
2
#perl test.pl
ÄãµÄÍøÕ¾ÅÅÃûΪ: 199,954
HTML::TreeBuilder::XPathµÄʵÀý
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/perl
use strict;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
my $url = "http://www.alexa.com/siteinfo/www.php-oa.com";
my $html = get( $url );
my $tree = new HTML::TreeBuilder::XPath;
$tree->parse( $html );
$tree->eof;
#$tree->dump;
my $srt;
my $items = $tree->findnodes( '/html/body/descendant::div[@class[.=~/data down/]]' );
for my $item ( $items->get_nodelist() ){
eval{
$srt = $item->content->[1];
};
print "ÄãµÄÍøÕ¾ÅÅÃûΪ:".$srt."\n";
}
ÔõôʹÓÃ×îÂé·³µÄÒ»µãÔÚÓÚÕâ¸öXPathµÄÓï·¨.ÏÂÃæÊǼòµ¥µÄÓï·¨½éÉÜ.
XPATHµÄ¼òµ¥Óï·¨½éÉÜ
XPATH»ù±¾ÉÏÊÇÓÃÒ»ÖÖÀàËÆÄ¿Â¼Ê÷µÄ·½·¨À´ÃèÊöÔÚXMLÎĵµÖеÄ·¾¶¡£±ÈÈçÓÓ/”À´×÷ΪÉÏϲ㼶¼äµÄ·Ö¸ô¡£µÚÒ»¸ö“/”±íʾÎĵµµÄ¸ù½Úµã£¨×¢Ò⣬²»ÊÇÖ¸Îĵµ×îÍâ²ãµÄtag½Úµã£¬¶øÊÇÖ¸Îĵµ±¾Éí£©¡£±ÈÈç¶ÔÓÚÒ»¸öHTMLÎļþÀ´Ëµ£¬×îÍâ²ãµÄ½ÚµãÓ¦¸ÃÊÇ"/html"¡£
ͬÑùµÄ£¬“..”ºÍ“.”·Ö±ð±»ÓÃÀ´±íʾ¸¸½ÚµãºÍ±¾½Úµã¡£
XPATH·µ»ØµÄ²»Ò»¶¨¾ÍÊÇΨһµÄ½Úµã£¬¶øÊÇ·ûºÏÌõ¼þµÄËùÓнڵ㡣±ÈÈçÔÚHTMLÎĵµÀïʹÓÓ/html/head/scrpt”¾Í»á°ÑheadÀïµÄËùÓÐscript½Úµã¶¼È¡³öÀ´¡£
ΪÁËËõС¶¨Î»·¶Î§£¬ÍùÍù»¹ÐèÒªÔö¼Ó¹ýÂËÌõ¼þ¡£¹ýÂ˵ķ½·¨¾ÍÊÇÓÓ[”“]”°Ñ¹ýÂËÌõ¼þ¼ÓÉÏ¡£±ÈÈçÔÚHTMLÎĵµÀïʹÓÓ/html/body/div[@id='main']”£¬¼´¿ÉÈ¡³öbodyÀïidΪmainµÄdiv½Úµã¡£
ÆäÖÐ@id±íʾÊôÐÔid£¬ÀàËÆµÄ»¹¿ÉÒÔʹÓÃÈç@name, @value, @href, @src, @class….
¶øº¯Êýtext()µÄÒâ˼ÔòÊÇÈ¡µÃ½Úµã°üº¬µÄÎı¾¡£±ÈÈ磺<div>hello<p>world</p>< /div>ÖУ¬ÓÃ"div[
Ïà¹ØÎĵµ£º
Ëæ×ÅWeb2.0¼¼ÊõµÄ²»¶Ï·¢Õ¹£¬Webǰ¶ËµÄÓÅ»¯Êܵ½Ô½À´Ô½¶àµÄ¹Ø×¢£¬ÌرðÊÇJavaScriptºÍCSSÓÅ»¯µÄÌÖÂÛÒ»Ö±ÊÇÈȵ㣬¹¤¾ßÒ²Ïà¶Ô·á¸»£¬¶ø¶ÔHTMLÓÅ»¯ÔòÓÐËùºöÊÓ£¬×î½ü£¬À´×Ô°Ù¶È·ºÓû§ÌåÑéÍŶӵŤ³ÌʦMiller£¨chenminliang£©×«ÎÄÇ¿µ÷ÁËHTMLÓÅ»¯µÄÖØÒªÐÔºÍÏà¹Ø¼¼ÇÉ¡£
MillerÊ×ÏȾÙÀý˵Ã÷ÁËHTMLÓÅ»¯ÉÔÏÔºöÂÔµÄÊÂʵ£º
ÔÚSteve Souder ......
using System.Text.RegularExpressions;
string ohtml = this.TextBox1.Text;
System.Text.RegularExpressions.MatchCollection m;
//ÌáÈ¡×Ö·û´®µÄͼƬ
......
Dim objReg,objMatches,objMatch
Set objReg=new RegExp
objReg.Global=True
objReg.IgnoreCase=True
objReg.Pattern="<('[^']*'|""[^""]*""|[^'"">])*?>"
Set objMatches=objReg.Execute(×Ö·û´®)
For Each objMatch In objMatches
ÕÒµ½µÄHTML £ºobjMatch.value
Next
Set objMatches=Nothing
Set objRe ......
HTML ¼òÊ·
HTML ÊÇ Web ͳһÓïÑÔ£¬ÕâЩÈÝÄÉÔÚ¼âÀ¨ºÅÀïµÄ¼òµ¥±êÇ©£¬¹¹³ÉÁËÈç½ñµÄ Web¡£1991 Ä꣬Tim Berners-Lee ±àдÁËÒ»·Ý½Ð×ö “HTML ±êÇ©”µÄÎĵµ£¬ÀïÃæ°üº¬ÁË´óÔ¼20¸öÓÃÀ´±ê¼ÇÍøÒ³µÄ HTML ±êÇ©¡£ËûÖ±½Ó½èÓà SGML µÄ±ê¼Ç¸ñʽ£¬Ò²¾ÍÊǺóÀ´ÎÒÃÇ¿´µ½µÄ HTML ±ê¼ÇµÄ¸ñʽ¡£±¾ÎĽ²Ê ......
×÷Õß
´Þ¿µ
·¢²¼ÓÚ
2010Äê5ÔÂ13ÈÕ ÏÂÎç10ʱ14·Ö
Ëæ×ÅWeb2.0¼¼ÊõµÄ²»¶Ï·¢Õ¹£¬Webǰ¶ËµÄÓÅ»¯Êܵ½Ô½À´Ô½¶àµÄ¹Ø×¢£¬ÌرðÊÇJavaScriptºÍCSSÓÅ»¯µÄÌÖÂÛÒ»Ö±ÊÇÈȵ㣬¹¤¾ßÒ²
Ïà¶Ô·á¸»£¬¶ø¶ÔHTMLÓÅ»¯ÔòÓÐËùºöÊÓ£¬×î½ü£¬À´×Ô°Ù¶È·ºÓû§Ìå ......