ʹÓÃPerlµÄHTML::TreeBuilder::XPathÀ´½âÎöÍøÒ³ÄÚÈÝ
ÔÎĵØÖ·£ºhttp://www.php-oa.com/2009/09/24/perl-html-tree-builder-xpath.html
ת¹ýÀ´ ÂýÂýÑо¿
Ç¿´óµÄPerlÖÐ,Óг¬¼¶¶àÇ¿´óµÄÄ£¿é,ÈÃÎÒÃDz»ÔÚÐèÒªÖØ¸´µÄ·¢Ã÷ÂÖ×Ó.ÏÂÃæÕâ¸ö¾ÍÊÇÒ»¸öÇ¿´óµÄÄ£¿é.HTML::TreeBuilder::XPath.ËüÄÜÏóxmlÒ»Ñù½âÎöÍøÕ¾.ÔõôʹÓþͲ»Ï¸½²ÁË,ÈçÏÂ,¼ûʵÀý,ÎÒÊÇ´Óalexa.comÍøÕ¾µÃµ½ÎÒµÄÍøÕ¾ÅÅÃûµÄÒ»¸öÀý×Ó.»áÏÔʾÈçϵĽá¹û
1
2
#perl test.pl
ÄãµÄÍøÕ¾ÅÅÃûΪ: 199,954
HTML::TreeBuilder::XPathµÄʵÀý
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/perl
use strict;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
my $url = "http://www.alexa.com/siteinfo/www.php-oa.com";
my $html = get( $url );
my $tree = new HTML::TreeBuilder::XPath;
$tree->parse( $html );
$tree->eof;
#$tree->dump;
my $srt;
my $items = $tree->findnodes( '/html/body/descendant::div[@class[.=~/data down/]]' );
for my $item ( $items->get_nodelist() ){
eval{
$srt = $item->content->[1];
};
print "ÄãµÄÍøÕ¾ÅÅÃûΪ:".$srt."\n";
}
ÔõôʹÓÃ×îÂé·³µÄÒ»µãÔÚÓÚÕâ¸öXPathµÄÓï·¨.ÏÂÃæÊǼòµ¥µÄÓï·¨½éÉÜ.
XPATHµÄ¼òµ¥Óï·¨½éÉÜ
XPATH»ù±¾ÉÏÊÇÓÃÒ»ÖÖÀàËÆÄ¿Â¼Ê÷µÄ·½·¨À´ÃèÊöÔÚXMLÎĵµÖеÄ·¾¶¡£±ÈÈçÓÓ/”À´×÷ΪÉÏϲ㼶¼äµÄ·Ö¸ô¡£µÚÒ»¸ö“/”±íʾÎĵµµÄ¸ù½Úµã£¨×¢Ò⣬²»ÊÇÖ¸Îĵµ×îÍâ²ãµÄtag½Úµã£¬¶øÊÇÖ¸Îĵµ±¾Éí£©¡£±ÈÈç¶ÔÓÚÒ»¸öHTMLÎļþÀ´Ëµ£¬×îÍâ²ãµÄ½ÚµãÓ¦¸ÃÊÇ"/html"¡£
ͬÑùµÄ£¬“..”ºÍ“.”·Ö±ð±»ÓÃÀ´±íʾ¸¸½ÚµãºÍ±¾½Úµã¡£
XPATH·µ»ØµÄ²»Ò»¶¨¾ÍÊÇΨһµÄ½Úµã£¬¶øÊÇ·ûºÏÌõ¼þµÄËùÓнڵ㡣±ÈÈçÔÚHTMLÎĵµÀïʹÓÓ/html/head/scrpt”¾Í»á°ÑheadÀïµÄËùÓÐscript½Úµã¶¼È¡³öÀ´¡£
ΪÁËËõС¶¨Î»·¶Î§£¬ÍùÍù»¹ÐèÒªÔö¼Ó¹ýÂËÌõ¼þ¡£¹ýÂ˵ķ½·¨¾ÍÊÇÓÓ[”“]”°Ñ¹ýÂËÌõ¼þ¼ÓÉÏ¡£±ÈÈçÔÚHTMLÎĵµÀïʹÓÓ/html/body/div[@id='main']”£¬¼´¿ÉÈ¡³öbodyÀïidΪmainµÄdiv½Úµã¡£
ÆäÖÐ@id±íʾÊôÐÔid£¬ÀàËÆµÄ»¹¿ÉÒÔʹÓÃÈç@name, @value, @href, @src, @class….
¶øº¯Êýtext()µÄÒâ˼ÔòÊÇÈ¡µÃ½Úµã°üº¬µÄÎı¾¡£±ÈÈ磺<div>hello<p>world</p>< /div>ÖУ¬ÓÃ"div[
Ïà¹ØÎĵµ£º
ʹÓÃTWebBrowser×é¼þ±£´æÍøÒ³ÎªhtmlºÍmhtÎļþ ÊÕ²Ø
Ò»¡¢±£´æÎªHTMLÎļþ
uses ActiveX;
...
procedure WB_SaveAs_HTML(WB : TWebBrowser; const FileName : string) ;
var
PersistStream: IPersistStreamInit;
Stream: IStream;
FileStream: TFileStream;
begin
if not Assigned(WB. ......
Ë«ÒýºÅ£º"»ò"
µ¥ÒýºÅ£º'»ò'£¨IEʵÌåÃûÎÞЧ£©
&·ûºÅ£º&»ò&
<СÓÚ£º<»ò<
>´óÓÚ£º>»ò>
¿Õ¸ñ£º »ò 
¡êÓ¢°õ£º£»ò£
£¤Ôª£º¥»ò¥
¦·Ö¸ô·û£º¦»ò& ......
HTML ¼òÊ·
HTML ÊÇ Web ͳһÓïÑÔ£¬ÕâЩÈÝÄÉÔÚ¼âÀ¨ºÅÀïµÄ¼òµ¥±êÇ©£¬¹¹³ÉÁËÈç½ñµÄ Web¡£1991 Ä꣬Tim Berners-Lee ±àдÁËÒ»·Ý½Ð×ö “HTML ±êÇ©”µÄÎĵµ£¬ÀïÃæ°üº¬ÁË´óÔ¼20¸öÓÃÀ´±ê¼ÇÍøÒ³µÄ HTML ±êÇ©¡£ËûÖ±½Ó½èÓà SGML µÄ±ê¼Ç¸ñʽ£¬Ò²¾ÍÊǺóÀ´ÎÒÃÇ¿´µ½µÄ HTML ±ê¼ÇµÄ¸ñʽ¡£±¾ÎĽ²Ê ......
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=gb2312" http-equiv="Content-Type" />
<title>¼òµ¥µÄ²âÊÔÒ³Ãæ</title> ......
protected override void OnPreInit(EventArgs e)
{
base.OnPreInit(e);
string ......