python ÅÀ³æ³ÌÐòÏê½â
1 #!/usr/bin/python ʹÓÃħ·¨×Ö·ûµ÷ÓÃpython
2
3 from sys import argv µ¼ÈësysÊǵ¼Èëpython½âÊÍÆ÷ºÍËû»·¾³Ïà¹ØµÄ²ÎÊý
4 from os import makedirs,unlink,sep
osÖ÷ÒªÌṩ¶Ôϵͳ·¾¶£¬ÎļþÖØÃüÃûºÍɾ³ýÎļþËùÐèµÄº¯Êý
makedirsÊÇ´´½¨µÝ¹éÎļþ¼ÐµÄº¯Êý¡£±ÈÈç˵ÎÒÃÇÒª´´½¨Ò»¸öеÄĿ¼£¬/python/HTML/crawl,µ«ÊÇĿǰÕâÈý¸öÎļþ¼Ð¶¼²»´æÔÚ£¬Èç¹ûʹÓÃmkdirÃüÁîµÄ»°ÐèҪʹÓÃÈý´Î²ÅÄÜÍê³É£¬µ«ÊÇʹÓÃos.makedirÖ»ÐèʹÓÃÒ»´Î¾Í¿ÉÒÔ´´½¨ºÃÕû¸öĿ¼¡£
os.makedirs(os.path.join(os.erviron["HOME"],"python","HTML","crawl")
os.unlink(path)ɾ³ýfile·¾¶£¬ºÍremove()Ïàͬ¡£
sep os.sepϵͳÓôËÀ´·Ö¸î·¾¶Ãû
5 from os.path import dirname,exists,isdir,splitext
ʹÓÃosÖеÄÕâЩģ¿éÀ´ÌáÈ¡dirname·¾¶Ãû£¬exists,isdirÊÇÎļþÀàÐͲâÊÔ£¬²âÊÔÊÇ·ñÊÇÒ»¸öĿ¼£¬splitextÊǽ«ÎļþÃûºÍÎļþºó׺·ÖÀë¡£·Ö³ÉĿ¼ÎļþÃûºÍºó׺Á½²¿·Ö¡£
6 from string import replace,find,lower
µ¼ÈëstringÄ£¿é£¬ÓÃÓÚ×Ö·û´®µÄÌæ»»£¬²éÕÒ£¬ºÍСд»¯¡£
7 from htmllib import HTMLParser
8 from urllib import urlretrieve
urlretrieve()º¯ÊýÓÃÓÚ½«HTMLÎļþÕû¸öÏÂÔØµ½ÄãµÄ±¾µØÓ²ÅÌÖÐÈ¥¡£
9 from urlparse import urlparse,urljoin
urlparseÓÃÓÚ½«URL·Ö½â³É6¸öÔªËØ
¶øurljoinÓÃÓÚ½«baseurlºÍnewurl×éºÏÔÚÒ»Æð
10 from formatter import DumbWriter,AbstractFormatter
formatterº¯ÊýÖ÷ÒªÓÃÓÚ¸ñʽ»¯Îı¾
11 from cStringIO import StringIO
µ÷ÓÃcStringIOº¯Êý¶ÔÄÚ´æÖеÄÎļþ½øÐд¦Àí
12
13 class Retriever:
RetrieverÀฺÔð´ÓÍøÉÏÏÂÔØÍøÒ³²¢¶Ôÿһ¸öÎĵµÀïÃæµÄÁ¬½Ó½øÐзÖÎö£¬Èç¹û·ûºÏÏÂÔØÔÔò¾ÍÌí¼Óµ½“´ý´¦Àí”¶ÓÁÐÖС£´ÓÍøÉÏÏÂÔØµ½µÄÿ¸öÖ÷Ò³¶¼ÓÐÒ»¸öÓëÖ®¶ÔÓ¦µÄRetrieverʵÀý¡£RetrieverÓм¸¸ö°ïÖúʵÏÖ¹¦Äܵķ½·¨£¬·Ö±ðÊÇ£º¹¹ÔìÆ÷(__init__()),filename(),download()ºÍparseAndGetLinks()¡£
14 def __init__(self,url): ¶¨Òå¹¹ÔìÆ÷£¬Ö¸Ïòµ±Ç°ÀàµÄµ±Ç°ÊµÀýµÄÒýÓᣠself Ö¸Ïòд´½¨µÄ
¶ÔÏó£¬ÁíÍâÒ»¸ö²ÎÊýÊÇurl.¹¹ÔìÆ÷ʵÀý»¯Ò»¸öRetriever¶ÔÏ󣬲¢ÇÒ°ÑURL×Ö·û´®ºÍ´Ófilename()·µ»ØµÄÓëÖ®¶ÔÓ¦µÄÎļþÃû±£´æÎª±¾µØÊôÐÔ¡£
15 self.url=url
½«urlµÄÖµ¸¶¸øself.url
16 self.file=self.filename(url)
???
17 def filename(self,url,deffile="index
Ïà¹ØÎĵµ£º
Python ×Öµä
×ÖµäÀàËÆÓÚÄãͨ¹ýÁªÏµÈËÃû×Ö²éÕÒµØÖ·ºÍÁªÏµÈËÏêϸÇé¿öµÄµØÖ·²¾£¬¼´£¬ÎÒÃǰѼü£¨Ãû×Ö£©ºÍÖµ£¨ÏêϸÇé¿ö£©ÁªÏµÔÚÒ»Æð¡£×¢Ò⣬¼ü±ØÐëÊÇΨһµÄ£¬¾ÍÏñÈç¹ûÓÐÁ½¸öÈËÇ¡ÇÉͬÃûµÄ»°£¬ÄãÎÞ·¨ÕÒµ½ÕýÈ·µÄÐÅÏ¢¡£
×¢Ò⣬ÄãÖ»ÄÜʹÓò»¿É±äµÄ¶ÔÏ󣨱ÈÈç×Ö·û´®£©À´×÷Ϊ×ÖµäµÄ¼ü£¬µ«ÊÇÄã¿ÉÒÔ²»¿É±ä»ò¿É±äµÄ¶ÔÏó×÷Ϊ×Öµäµ ......
Python ×Ö·û´®
×Ö·û´®ÊÇ ×Ö·ûµÄÐòÁÐ ¡£×Ö·û´®»ù±¾ÉϾÍÊÇÒ»×éµ¥´Ê¡£
ÎÒ¼¸ºõ¿ÉÒÔ±£Ö¤ÄãÔÚÿ¸öPython³ÌÐòÖж¼ÒªÓõ½×Ö·û´®£¬ËùÒÔÇëÌØ±ðÁôÐÄÏÂÃæÕⲿ·ÖµÄÄÚÈÝ¡£ÏÂÃæ¸æËßÄãÈçºÎÔÚPythonÖÐʹÓÃ×Ö·û´®¡£
ʹÓõ¥ÒýºÅ£¨'£©
Äã¿ÉÒÔÓõ¥ÒýºÅָʾ×Ö·û´®£¬¾ÍÈçͬ'Quote me on this'ÕâÑù¡£ËùÓеĿհף¬¼´¿Õ¸ñºÍÖÆ±í·û¶¼ÕÕÔÑù±£Áô¡£ ......
´ÓÈ¥Ä껹û±ÏÒµ¾Í½Ó´¥Python£¬ÉÏÖÜÓÐЩÎÞÁÄÖØÐÂÔÙ¿´Ò»±é£¬·¢ÏÖÆäȷʵ²»´í¡£Óï·¨¼òµ¥£¬Ò»¸öÏÂÎç»ù±¾Á˽⣬ʹÓÃPydev²å¼þÔÚEclipseÖнøÐпª·¢»ù±¾ÉÏûÓÐÈκÎÕϰ¡£ÖصãÊÇÆäЧÂʺܸߣ¬²»Ðè±àÒëÖ±½ÓÔËÐС£±È½ÏÊʺϽøÐÐÊý¾ÝµÄÔ¤´¦Àí¡£²»´í£¬ÒÔºóÓлú»áºÃºÃÓÃÓᣠ......
formatter Ä£¿é
formatter Ä£¿éÌṩÁËһЩ¿ÉÓÃÓÚ htmllib µÄ¸ñʽÀà( formatter classes ).
ÕâЩÀàÓÐÁ½ÖÖ, formatter ºÍ writer . formatter ½« HTML ½âÎöÆ÷µÄ±êÇ©ºÍÊý¾ÝÁ÷ת»»ÎªÊʺÏÊä³öÉ豸µÄʼþÁ÷( event stream ), ¶ø writer ½«Ê¼þÁ÷Êä³öµ½É豸ÉÏ.
´ó¶àÇé¿öÏÂ, Äã¿ÉÒÔʹÓà AbstractFormatter Àà½øÐиñʽ»¯. Ëü»á¸ù¾ ......