»°ËµPython£¨ËÄ£©»¶ÓСÂéȸ
С°×ÊǸö΢ÈíÃÔ£¬ËûµÄżÏñÊDZȶû´óÊ壬ÔÒòµ±È»ÊǵØÇòÈ˶¼ÖªµÀÀ²¡£´ó¶þµÄʱºò£¬ËûµÄ“ê¡Ñ§¼Æ»®”ÔøÒ»¶ÈµÃ³Ñ£¬ÔÒòÊÇËû¹Ò¿ÆÌ«¶à¡£µ±È»£¬´óÈýÐÂѧÆÚ¿ªÊ¼µÄʱºò£¬Ãæ¶Ô¹«ÖÚÖÊÒÉ£¬Ð¡°×Õ¾ÔÚÒÎ×ÓÉÏ£¬Ïñ¼«ÁË¡¶´óÄÚÃÜ̽ÁãÁã·¢¡·ÀïµÄÎ÷ÃÅ´µÑ©£º“ÊÀ½çÊ׸»±È²»Ò»¶¨Óжà³öÉ«£¬ÕâÖ»²»¹ýÊÇÄãÃÇÕâЩÐǶ·ÊÐÃñÒ»ÏáÇéÔ¸µÄÏë·¨°ÕÁË¡£”
ÿÌìС°×¶¼»áÔÚËÞÉáתÓÆ£¬ºÃÏñºÜ“¹Â¶À”£¬×ìÀïÄîÄîÓдʣº“Õâ¸öÊÀ½çÕýÔÚ·¢Éú×Å·Ì츲µØµÄ±ä»¯£¬¶øÎÒÃÇÈ´Ïñë¿ËƵÄÉú»î¡£”×îºó£¬Ëû×Ü»áÀ´Ò»¾ä£º“ÎÒÒª³ÉÁ¢µÚ¶þ¸ö¹È¸è£¡”
Õâ½Ú¿Î£¬ÎÒÃǾͻáÁ˽âËÑË÷ÒýÇ棬»¹»á±àдһ¸öСÐ͵ÄÍøÂçÅÀ³æ¡£
ËÑË÷ÒýÇæÓÐÄļ¸²¿·Ö¹¹³É£¿
Ê×ÏȸÐлÕâÕÅͼµÄÔ×÷Õߣ¬Ö÷Òª»¹ÊÇÒª¸ÐлCountry¡£Í¨¹ýÕâÕÅͼ£¬ÎÒÃÇ¿ÉÒÔ¿´µ½£ºÊ×ÏÈ£¬ÍøÂçÖ©ÖëץȡÍøÒ³£¬½«ÍøÒ³ÄÚÈݼ°Á´½Ó´æµ½Êý¾Ý¿âÖС£È»ºóÓÉË÷ÒýÄ£¿é½¨Á¢¹Ø¼ü´Êµ½ÍøÖ·µÄË÷Òý£¬¹©¼ìË÷Ä£¿é²éѯ¡£¼ìË÷Ä£¿éÊǸù¾ÝÄãÊäÈëµÄÄÚÈÝ´ÓË÷ÒýÊý¾Ý¿âÌáÈ¡Êý¾Ý¡£Ö÷Ҫģ¿é½éÉÜÈçÏ£º
Íøҳץȡģ¿é£º°üÀ¨CrawlerºÍCrawler control£¬ÆäÖÐCrawler¸ºÔðץȡ²¢·ÖÎöÍøÒ³Á´½Ó£¬·µ»ØpageºÍurl£»Crawler control¸ºÔð¿ØÖÆ¡¢µ÷¶ÈCrawler¡£
ÍøÒ³´æ´¢Ä£¿é:Page cache£¬ÓÃÓÚ´æ´¢Crawlerץȡµ½µÄÍøÒ³ÄÚÈÝ¡£
Ë÷ÒýÄ£¿é:½¨Á¢¹Ø¼ü´Êµ½Á´½ÓºÍÍøÒ³µÄË÷Òý¡£
¼ìË÷Ä£¿é£º½«Òª²éѯµÄÄÚÈÝ·Ö½âΪÊʺϲéѯµÄ´Ê¡£
Óû§½Ó¿Ú£º½ÓÊÜÓû§ÊäÈ룬´«µÝµ½¼ìË÷Ä£¿é¡£
½ÓÏÂÀ´µÄ¿Î³ÌÀïÎÒÃÇ»á¸ù¾ÝËùѧµÄPython֪ʶ¿ª·¢Ò»¸öСÐ͵ÄËÑË÷ÒýÇæ¡£Ãû×Ö½ÐSparrow¼´Âéȸ£¬Òâ˼ÊÇ“ÂéȸËäС£¬ÎåÔà¾ãÈ«”¡£ÎÒÃǵēÂéȸ”»áËæ×ÅÎÒÃÇ֪ʶµÄÔö¼Ó¶øÔ½·ÉÔ½¸ß£¬Ëµ²»¶¨»á±ä³É·ï»ËÄØ¡£µ±È»£¬ÏÖÔÚËü»¹Ã»ÓÐÆð·É¡£
ÈÃÎÒÃÇ¿ªÊ¼ËÑË÷ÒýÇæÖ®Âðɣ¡
Ê×ÏÈÎÒÃÇҪѧϰµÄÄ£¿éÊÇÍøҳץȡģ¿é£¨Crawler£©£¬ÓÖ½Ð×öÍøÂçÖ©Ö루Spider£©¡£
Õâ¸öÄ£¿éÓÉCrawlerÀàÀ´Íê³É£¬¸ÃÀà³õʼ»¯Ê±Ê×ÏȽÓÊÜCrawler controlÄ£¿é´«µÝµÄurl£¬Ö´ÐÐÍê±Ï×îºó·µ»ØÍøÒ³ÄÚÈÝpageºÍÍøÒ³ÄÚ³öÏÖµÄurlÁ´½Ólink¡£Ô´ÂëÈçÏ£º
import urllib.request #ÓÃÓÚ»ñÈ¡ÍøÒ³ÄÚÈÝ
import urllib.parse #½âÎöÍøÖ·µÄÄ£¿é
import re #ÕýÔò±í´ïʽ
import queue #²Ù×÷¶ÓÁеÄÄ£¿é
class Crawler(object): #ÍøÂç×
Ïà¹ØÎĵµ£º
from: http://www.cnblogs.com/jimnox/archive/2009/12/08/tips-to-python-challenge.html
Python ChallengeÊÇÒ»¸öÍøÒ³´³¹ØÓÎÏ·£¬Í¨¹ýһЩÌáʾÕÒ³öÏÂÒ»¹ØµÄÍøÒ³µØÖ·¡£ÓëÖÚ²»Í¬µÄÊÇ£¬ËüÊÇרÃÅΪ³ÌÐòÔ±Éè¼ÆµÄ£¬ÒòΪ´ó¶àÊý¹Ø¿¨¶¼Òª±à³ÌÀ´ËãŶ£¡£¡
È¥ÄêºÍͬѧһÆðÍæµÄ£¬Ëû×öÁË´ó°ë£¬ÎÒ×öÁËС°ë£¬×÷±×ÁËһЩ£¬33¹Øȫͨ£¬½ ......
1£® Ê×ÏȾÍÊÇÔÚ±àÒëÆ÷ÖаÑpython°²×°Ä¿Â¼include/Óëlibs/¼ÓÈ룬¶ÔÓÚÕâµãÎÒÔÚvc6ÖпÉÒÔ£¬µ«ÊÇÔÚdev c++Öм´Ê¹¼ÓÈëÁ˱àÒëÒ²»á³ö´í£¬ËµÕÒ²»µ½pythonÍ·Îļþ£¬Õâµã±È½ÏÓôÃÆ£¬²»¹ý¿¼Âǵ½Ò»°ãwindows±à³Ì¶¼ÓõÄÊÇvc£¬ËùÒÔ²¢Ã»ÓÐʲôӰÏì°É£¡£¡£¡
È»ºóÓÃ#include <Python.h>¾Í¿ÉÒÔ°ÑpythonµÄÖ÷Í·Îļþ°üº¬½øÀ´ÁË¡£
µ«Ê ......
import urllib
from HTMLParser import HTMLParser
class TitleParser(HTMLParser):
def __init__(self):
self.title = ''
self.divcontent = ''
self.readingtitle = 0
self.readingdiv = 0
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
......
ÒëÕßÑÔ£º
ÔçÔÚ 2008 Äê 8 Ô£¬ÎÒ¾ÍÔøÔÚ×Ô¼ºµÄ²©¿Í·¢±íÁËһƪ¡¶ÎªÊ²Ã´<Dive into Python>²»ÖµµÃÍƼö¡·£¨http://blog.csdn.net/lanphaday/archive/2008/08/28/2845258.aspx
£©£¬µ±Ê±ÒýÆðµÄÌÖÂ۾Ͳ»¶à˵ÁË£¬²»¹ýÊÂʵÉϵ½½ñÌìÈÔÈ»ÓÐÐí¶àÅóÓÑÁôÑÔÓëÎÒÌÖÂÛ£¬ÈÃÎÒ¼¸´ÎÔôÐIJ»ËÀ£¬ÏëдÔÙÉîÈëÅúÅС£ºÃÔ˵ÄʱºòÔÚÎÒÕæÕýÔÜ×㶯Á ......