ʹÓÃPerlµÄHTML::TreeBuilder::XPathÀ´½âÎöÍøÒ³ÄÚÈÝ
ÔÎĵØÖ·£ºhttp://www.php-oa.com/2009/09/24/perl-html-tree-builder-xpath.html
ת¹ýÀ´ ÂýÂýÑо¿
Ç¿´óµÄPerlÖÐ,Óг¬¼¶¶àÇ¿´óµÄÄ£¿é,ÈÃÎÒÃDz»ÔÚÐèÒªÖØ¸´µÄ·¢Ã÷ÂÖ×Ó.ÏÂÃæÕâ¸ö¾ÍÊÇÒ»¸öÇ¿´óµÄÄ£¿é.HTML::TreeBuilder::XPath.ËüÄÜÏóxmlÒ»Ñù½âÎöÍøÕ¾.ÔõôʹÓþͲ»Ï¸½²ÁË,ÈçÏÂ,¼ûʵÀý,ÎÒÊÇ´Óalexa.comÍøÕ¾µÃµ½ÎÒµÄÍøÕ¾ÅÅÃûµÄÒ»¸öÀý×Ó.»áÏÔʾÈçϵĽá¹û
1
2
#perl test.pl
ÄãµÄÍøÕ¾ÅÅÃûΪ: 199,954
HTML::TreeBuilder::XPathµÄʵÀý
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/perl
use strict;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
my $url = "http://www.alexa.com/siteinfo/www.php-oa.com";
my $html = get( $url );
my $tree = new HTML::TreeBuilder::XPath;
$tree->parse( $html );
$tree->eof;
#$tree->dump;
my $srt;
my $items = $tree->findnodes( '/html/body/descendant::div[@class[.=~/data down/]]' );
for my $item ( $items->get_nodelist() ){
eval{
$srt = $item->content->[1];
};
print "ÄãµÄÍøÕ¾ÅÅÃûΪ:".$srt."\n";
}
ÔõôʹÓÃ×îÂé·³µÄÒ»µãÔÚÓÚÕâ¸öXPathµÄÓï·¨.ÏÂÃæÊǼòµ¥µÄÓï·¨½éÉÜ.
XPATHµÄ¼òµ¥Óï·¨½éÉÜ
XPATH»ù±¾ÉÏÊÇÓÃÒ»ÖÖÀàËÆÄ¿Â¼Ê÷µÄ·½·¨À´ÃèÊöÔÚXMLÎĵµÖеÄ·¾¶¡£±ÈÈçÓÓ/”À´×÷ΪÉÏϲ㼶¼äµÄ·Ö¸ô¡£µÚÒ»¸ö“/”±íʾÎĵµµÄ¸ù½Úµã£¨×¢Ò⣬²»ÊÇÖ¸Îĵµ×îÍâ²ãµÄtag½Úµã£¬¶øÊÇÖ¸Îĵµ±¾Éí£©¡£±ÈÈç¶ÔÓÚÒ»¸öHTMLÎļþÀ´Ëµ£¬×îÍâ²ãµÄ½ÚµãÓ¦¸ÃÊÇ"/html"¡£
ͬÑùµÄ£¬“..”ºÍ“.”·Ö±ð±»ÓÃÀ´±íʾ¸¸½ÚµãºÍ±¾½Úµã¡£
XPATH·µ»ØµÄ²»Ò»¶¨¾ÍÊÇΨһµÄ½Úµã£¬¶øÊÇ·ûºÏÌõ¼þµÄËùÓнڵ㡣±ÈÈçÔÚHTMLÎĵµÀïʹÓÓ/html/head/scrpt”¾Í»á°ÑheadÀïµÄËùÓÐscript½Úµã¶¼È¡³öÀ´¡£
ΪÁËËõС¶¨Î»·¶Î§£¬ÍùÍù»¹ÐèÒªÔö¼Ó¹ýÂËÌõ¼þ¡£¹ýÂ˵ķ½·¨¾ÍÊÇÓÓ[”“]”°Ñ¹ýÂËÌõ¼þ¼ÓÉÏ¡£±ÈÈçÔÚHTMLÎĵµÀïʹÓÓ/html/body/div[@id='main']”£¬¼´¿ÉÈ¡³öbodyÀïidΪmainµÄdiv½Úµã¡£
ÆäÖÐ@id±íʾÊôÐÔid£¬ÀàËÆµÄ»¹¿ÉÒÔʹÓÃÈç@name, @value, @href, @src, @class….
¶øº¯Êýtext()µÄÒâ˼ÔòÊÇÈ¡µÃ½Úµã°üº¬µÄÎı¾¡£±ÈÈ磺<div>hello<p>world</p>< /div>ÖУ¬ÓÃ"div[
Ïà¹ØÎĵµ£º
HTML×Ö·ûʵÌå(Character Entities)
ÓÐЩ×Ö·ûÔÚHTMLÀïÓÐÌØ±ðµÄº¬Ò壬±ÈÈçСÓÚºÅ<¾Í±íʾHTML TagµÄ¿ªÊ¼£¬Õâ¸öСÓÚºÅÊDz»ÏÔʾÔÚÎÒÃÇ×îÖÕ¿´µ½µÄÍøÒ³ÀïµÄ¡£ÄÇÈç¹ûÎÒÃÇÏ£ÍûÔÚÍøÒ³ÖÐÏÔʾһ¸öСÓںţ¬¸ÃÔõô°ìÄØ£¿
Õâ¾ÍҪ˵µ½HTML×Ö·ûʵÌå(HTML Character Entities)ÁË¡£
Ò»¸ö×Ö·ûʵÌå(Character Entity)·Ö³ÉÈý²¿·Ö£ºµÚÒ»²¿· ......
ʹÓÃTWebBrowser×é¼þ±£´æÍøÒ³ÎªhtmlºÍmhtÎļþ ÊÕ²Ø
Ò»¡¢±£´æÎªHTMLÎļþ
uses ActiveX;
...
procedure WB_SaveAs_HTML(WB : TWebBrowser; const FileName : string) ;
var
PersistStream: IPersistStreamInit;
Stream: IStream;
FileStream: TFileStream;
begin
if not Assigned(WB. ......
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=gb2312" http-equiv="Content-Type" />
<title>¼òµ¥µÄ²âÊÔÒ³Ãæ</title> ......
͹Ï߱߿ò(¿í¶È10,ºìÉ«)
·Ö×é¿ò¡¡ ´úÂë
<fieldset style="border:10px ridge #FF0000; padding:2px; width:500">
<legend>·Ö×é¿ò</legend>
¡¡</fieldset>
°¼Ïß
·Ö×é¿ò¡¡ ´úÂë
<fieldset style="border:10px groove #FF0000; padding:2px; width:500">
<legend>·Ö×é¿ò</legen ......
ÔÎĵØÖ·£ºhttp://bbs.chinaunix.net/viewthread.php?tid=1316204
ǰÌìÑо¿Ê¹ÓÃHTML::TreeBuilderÄ£¿é·ÖÎöÍøÒ³£¬¿´µ½ÁËһƪÎÄÕ£¬Ë³±ã¾Í·ÒëÁËһϣ¬·¢ÉÏÀ´·ÖÏí¡£±¾ÈËÎıʲ»ºÃ£¬eÎÄˮƽÓÐÏÞ£¬´ó¼Ò´éºÏ¿´°É¡£
ÔÎĵØÖ·£ºhttp://www.perl.com/pub/a/2006/01/19/analyzing_html.html?page=1
ÎÄÕµı³¾°ÊÇ£¬×÷ÕßÔÚ½ÌÊÚÍøÒ³ ......