Python¿âÏê½âÖ®ÍøÂç(2)
×òÌìÊÔÁËÏÂÓÃHTMLParserÀàÀ´½âÎöÍøÒ³£¬¿É·¢ÏÖ½á¹û²¢²»ÀíÏë¡£²»¹ÜÔõô˵£¬ÏÈдϹý³Ì£¬Ï£ÍûºóÀ´ÈËÄÜÔÚ´Ë»ù´¡ÉϽâ¾öÎÒËùÓöµ½µÄÎÊÌâ¡£
дÁË2Ì×½â¾ö·½°¸£¬µ±È»Õâ2Ì×Ö»ÄܶÔÌØ¶¨ÍøÕ¾ÓÐЧ¡£ÎÒÕâÀïÖ÷Ҫ˵Ã÷϶ÔBBCÖ÷Ò³www.bbc.co.ukºÍ¶ÔÍøÒ×www.163.comµÄ½âÎö¡£
¶ÔÓÚBBC£º
ÕâÌ×Òª¼òµ¥µÃ¶à£¬¿ÉÄÜÊǸÃÍøÒ³µÄ±àÂë±È½Ï±ê×¼°É
import html.parser
import urllib.request
class parseHtml(html.parser.HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a {} start tag".format(tag))
def handle_endtag(self, tag):
print("Encountered a {} end tag".format(tag))
def handle_charref(self,name):
print("charref")
def handle_entityref(self,name):
print("endtiyref")
def handle_data(self,data):
print("data")
def handle_comment(self,data):
print("comment")
def handle_decl(self,decl):
print("decl")
def handle_pi(self,decl):
print("pi")
#´ÓÕâÀ↑ʼ¿´Æð£¬ÉÏÃæÄǸö¼Ì³ÐºÜ¼òµ¥£¬È«²¿ÖØÔظ¸ÀຯÊý
#ÒÔ¶þ½øÖÆÐ´µÄ·½Ê½´æ´¢BBCÍøÒ³£¬ÕâÊÇÉÏÆªÄÚÈÝ(http://blog.csdn.net/xiadasong007/archive/2009/09/03/4516683.aspx),²»×¸Êö
file=open("bbc.html",'wb') #it's 'wb',not 'w'
url=urllib.request.urlopen("http://www.bbc.co.uk/")
while(1):
line=url.readline()
if len(line)==0:
break
file.write(line)
#Éú³ÉÒ»¸ö¶ÔÏó
pht=parseHtml()
#¶ÔÓÚÕâ¸öÍøÕ¾£¬ÎÒʹÓÃ'utf-8'´ò¿ª£¬·ñÔò»á³ö´í£¬ÆäËûÍøÕ¾¿ÉÄܾͲ»ÐèÒª£¬utf-8ÊÇUNICODE±àÂë
file=open("bbc.html",encoding='utf-8',mode='r')
#´¦ÀíÍøÒ³£¬feed
while(1):
line=
Ïà¹ØÎĵµ£º
µÚ¾Å¹Ø Image
´ÓÒ³ÃæÉϵÄͼƬ¿ÉÒÔ¿´µ½ÓÐÒ»´®µã£¬ÄÇôÊDz»ÊÇ´ú±í¸Ã¹ØÓëͼÏñµãÓйأ¿ ÎÒÃÇ´ÓÒ³ÃæÔ´Âë¿ÉÒÔ¿´µ½£¬ÓÐÁ½¶ÎÊý×ÖÐòÁÐfirstºÍsecond£¬¶øÓÐÒ»¸öÌáʾfirst+second=? ʲôÒâË¼ÄØ£¿ÄѵÀÊÇ˵(first, second)´ú±íÁËͼÏñµãµÄ×ø±ê£¿²»Ïñ£¬Á½¶ÎÐòÁеij¤¶ÈÓкܴó²îÒì¡£ÄÇôËã·û+»¹ÓÐʲôº¬ÒåÄØ£¬Óп ......
ÕýÔò±í´ïʽ
¾ßÌåµÄ²Î¿¼ÊֲᣬÕâÀï¼ÇÏÂһЩСÎÊÌ⣺
1¡¢re¶ÔÏóµÄ·½·¨
match Match a regular expression pattern to the beginning of a string.
search re.search(pattern, string, flags) flags:re.I re.M re.X re.S re.L re.U
sub Substitute oc ......
´úÂëÖвÉÓÃÁËÈý²½ÊµÏÖËãÊõ±í´ïʽµÄ½âÎö:
1. ½«ËãÊõ±í´ïʽ(×Ö·û´®)ת»»³ÉÒ»¸öÁбíparseElement·½·¨
2. ½«Áбí±íʾµÄËãÊõ±í´ïʽת»»³Éºó׺±í´ïʽchangeToSuffix
3. ¼ÆËãºó׺±í´ïʽµÄ½á¹û
ÕâÀïÎÒÊÇΪÁË·½±ã, ¾ÍдÁ˸öparseElement, ²»ÏëÄÇ·½·¨Ð´µ½ºóÃæÈ´°Ñ×Ô¼ºÈÆ×¡ÁË, ¿ÉÒÔÏëÏóÒ»¸ö´ø×ÔÔö, λ, Âß¼, ËãÊõµÄ±í´ïʽµÄÊýÖµÌá ......