ʹÓÃPerlµÄHTML::TreeBuilder::XPathÀ´½âÎöÍøÒ³ÄÚÈÝ
ÔÎĵØÖ·£ºhttp://www.php-oa.com/2009/09/24/perl-html-tree-builder-xpath.html
ת¹ýÀ´ ÂýÂýÑо¿
Ç¿´óµÄPerlÖÐ,Óг¬¼¶¶àÇ¿´óµÄÄ£¿é,ÈÃÎÒÃDz»ÔÚÐèÒªÖØ¸´µÄ·¢Ã÷ÂÖ×Ó.ÏÂÃæÕâ¸ö¾ÍÊÇÒ»¸öÇ¿´óµÄÄ£¿é.HTML::TreeBuilder::XPath.ËüÄÜÏóxmlÒ»Ñù½âÎöÍøÕ¾.ÔõôʹÓþͲ»Ï¸½²ÁË,ÈçÏÂ,¼ûʵÀý,ÎÒÊÇ´Óalexa.comÍøÕ¾µÃµ½ÎÒµÄÍøÕ¾ÅÅÃûµÄÒ»¸öÀý×Ó.»áÏÔʾÈçϵĽá¹û
1
2
#perl test.pl
ÄãµÄÍøÕ¾ÅÅÃûΪ: 199,954
HTML::TreeBuilder::XPathµÄʵÀý
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/perl
use strict;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
my $url = "http://www.alexa.com/siteinfo/www.php-oa.com";
my $html = get( $url );
my $tree = new HTML::TreeBuilder::XPath;
$tree->parse( $html );
$tree->eof;
#$tree->dump;
my $srt;
my $items = $tree->findnodes( '/html/body/descendant::div[@class[.=~/data down/]]' );
for my $item ( $items->get_nodelist() ){
eval{
$srt = $item->content->[1];
};
print "ÄãµÄÍøÕ¾ÅÅÃûΪ:".$srt."\n";
}
ÔõôʹÓÃ×îÂé·³µÄÒ»µãÔÚÓÚÕâ¸öXPathµÄÓï·¨.ÏÂÃæÊǼòµ¥µÄÓï·¨½éÉÜ.
XPATHµÄ¼òµ¥Óï·¨½éÉÜ
XPATH»ù±¾ÉÏÊÇÓÃÒ»ÖÖÀàËÆÄ¿Â¼Ê÷µÄ·½·¨À´ÃèÊöÔÚXMLÎĵµÖеÄ·¾¶¡£±ÈÈçÓÓ/”À´×÷ΪÉÏϲ㼶¼äµÄ·Ö¸ô¡£µÚÒ»¸ö“/”±íʾÎĵµµÄ¸ù½Úµã£¨×¢Ò⣬²»ÊÇÖ¸Îĵµ×îÍâ²ãµÄtag½Úµã£¬¶øÊÇÖ¸Îĵµ±¾Éí£©¡£±ÈÈç¶ÔÓÚÒ»¸öHTMLÎļþÀ´Ëµ£¬×îÍâ²ãµÄ½ÚµãÓ¦¸ÃÊÇ"/html"¡£
ͬÑùµÄ£¬“..”ºÍ“.”·Ö±ð±»ÓÃÀ´±íʾ¸¸½ÚµãºÍ±¾½Úµã¡£
XPATH·µ»ØµÄ²»Ò»¶¨¾ÍÊÇΨһµÄ½Úµã£¬¶øÊÇ·ûºÏÌõ¼þµÄËùÓнڵ㡣±ÈÈçÔÚHTMLÎĵµÀïʹÓÓ/html/head/scrpt”¾Í»á°ÑheadÀïµÄËùÓÐscript½Úµã¶¼È¡³öÀ´¡£
ΪÁËËõС¶¨Î»·¶Î§£¬ÍùÍù»¹ÐèÒªÔö¼Ó¹ýÂËÌõ¼þ¡£¹ýÂ˵ķ½·¨¾ÍÊÇÓÓ[”“]”°Ñ¹ýÂËÌõ¼þ¼ÓÉÏ¡£±ÈÈçÔÚHTMLÎĵµÀïʹÓÓ/html/body/div[@id='main']”£¬¼´¿ÉÈ¡³öbodyÀïidΪmainµÄdiv½Úµã¡£
ÆäÖÐ@id±íʾÊôÐÔid£¬ÀàËÆµÄ»¹¿ÉÒÔʹÓÃÈç@name, @value, @href, @src, @class….
¶øº¯Êýtext()µÄÒâ˼ÔòÊÇÈ¡µÃ½Úµã°üº¬µÄÎı¾¡£±ÈÈ磺<div>hello<p>world</p>< /div>ÖУ¬ÓÃ"div[
Ïà¹ØÎĵµ£º
ǰ¼¸Ìì×öÏîÄ¿¡£ÐèÒªÓõ½Ò»¸öWinFormµÄHTMLµÄ±à¼ºÍÏÔʾ¿Ø¼þ¡£.NET×Ô¼º²¢Ã»ÓÐÌṩÕâ·½ÃæµÄ¿Ø¼þ¡£È¥Googel°Ù¶ÈÁËһϡ£Ã»ÓÐÕÒµ½ºÏÊʵÄ.NET¿Ø¼þ¡£ÎÞÄÎÈ¥Ó¢ÎÄGoogelÁËһϡ£¹ûÈ»·¢ÏÖÁËÒ»¿îÃûΪ£º.NET Win HTML Editor Control 3.2µÄ¿Ø¼þ¡£ÏÂÔØÅäÖû·¾³ÊÔÓ᣷¢ÏÖÃâ·Ñ°æÌṩȫ¹¦ÄÜÊÔÓá£Î¨Ò»²»ºÃµÄµØ·½¾ÍÊÇÔÚ±à¼ÇøÓÐÒ»¸ö×¢²áµÄÁ ......
Ë«ÒýºÅ£º"»ò"
µ¥ÒýºÅ£º'»ò'£¨IEʵÌåÃûÎÞЧ£©
&·ûºÅ£º&»ò&
<СÓÚ£º<»ò<
>´óÓÚ£º>»ò>
¿Õ¸ñ£º »ò 
¡êÓ¢°õ£º£»ò£
£¤Ôª£º¥»ò¥
¦·Ö¸ô·û£º¦»ò& ......
Dim objReg,objMatches,objMatch
Set objReg=new RegExp
objReg.Global=True
objReg.IgnoreCase=True
objReg.Pattern="<('[^']*'|""[^""]*""|[^'"">])*?>"
Set objMatches=objReg.Execute(×Ö·û´®)
For Each objMatch In objMatches
ÕÒµ½µÄHTML £ºobjMatch.value
Next
Set objMatches=Nothing
Set objRe ......
XMLºÍHTML³£ÓÃתÒå×Ö·û
XMLºÍHTMLÖж¼ÓÐÒ»Ð©ÌØÊâµÄ×Ö·û£¬ÕâЩ×Ö·ûÔÚXMLºÍHTMLÖÐÊDz»ÄÜÖ±½ÓʹÓõģ¬Èç¹û±ØÐëʹÓÃÕâЩ×Ö·û£¬Ó¦¸ÃʹÓÃÆä¶ÔÓ¦µÄתÒå×Ö·û¡£
Èç¹ûÔÚXMLÎĵµÖÐʹÓÃÀàËÆ"<" µÄ×Ö·û, ÄÇô½âÎöÆ÷½«»á³öÏÖ´íÎó£¬ÒòΪ½âÎöÆ÷»áÈÏΪÕâÊÇÒ»¸öÐÂÔªËØµÄ¿ªÊ¼¡£
ËùÒÔ²»Ó¦¸ÃÏñÏÂÃæÄÇÑùÊéд´úÂë:
<message&g ......
ÔÎĵØÖ·£ºhttp://bbs.chinaunix.net/viewthread.php?tid=1316204
ǰÌìÑо¿Ê¹ÓÃHTML::TreeBuilderÄ£¿é·ÖÎöÍøÒ³£¬¿´µ½ÁËһƪÎÄÕ£¬Ë³±ã¾Í·ÒëÁËһϣ¬·¢ÉÏÀ´·ÖÏí¡£±¾ÈËÎıʲ»ºÃ£¬eÎÄˮƽÓÐÏÞ£¬´ó¼Ò´éºÏ¿´°É¡£
ÔÎĵØÖ·£ºhttp://www.perl.com/pub/a/2006/01/19/analyzing_html.html?page=1
ÎÄÕµı³¾°ÊÇ£¬×÷ÕßÔÚ½ÌÊÚÍøÒ³ ......