html unicode±àÂëת»»·½·¨
¶ÔÓÚ"&# 24038;&# 36793;"ÕâÖÖ&#¿ªÊ¼µÄ×Ö·û£¬Ó¦¸ÃΪhtml unicode±àÂëÀàÐÍ£¬½âÂë·½·¨ÈçÏ£º
s="&# 24038;& # 36793;"
s="×ó±ß"
import re
_=re.compile('&#(x)?([0-9a-fA-F]+);')
to_str=lambda s,charset='utf-8':_.sub(lambda result:unichr(int(result.group(2),result.group(1)=='x' and 16 or 10)).encode(charset) ,s)
print to_str(s)
Ïà¹ØÎĵµ£º
HTML ÊÇ Web ͳһÓïÑÔ£¬ÕâЩÈÝÄÉÔÚ¼âÀ¨ºÅÀïµÄ¼òµ¥±êÇ©£¬¹¹³ÉÁËÈç½ñµÄ Web¡£1991 Ä꣬Tim Berners-Lee
±àдÁËÒ»·Ý½Ð×ö “HTML ±êÇ©”µÄÎĵµ£¬ÀïÃæ°üº¬ÁË´óÔ¼20¸öÓÃÀ´±ê¼ÇÍøÒ³µÄ HTML ±êÇ©¡£ËûÖ±½Ó½èÓà SGML
µÄ±ê¼Ç¸ñʽ£¬Ò²¾ÍÊǺóÀ´ÎÒÃÇ¿´µ½µÄ HTML ±ê¼ÇµÄ¸ñʽ¡£±¾ÎĽ²ÊöÁË HTML ÕâÃÅ Web ±ê¼ÇÓïÑԵķ¢Õ¹¼òÊ·¡£
......
HTML×Ö·ûʵÌå(Character Entities)
ÓÐЩ×Ö·ûÔÚHTMLÀïÓÐÌØ±ðµÄº¬Ò壬±ÈÈçСÓÚºÅ<¾Í±íʾHTML TagµÄ¿ªÊ¼£¬Õâ¸öСÓÚºÅÊDz»ÏÔʾÔÚÎÒÃÇ×îÖÕ¿´µ½µÄÍøÒ³ÀïµÄ¡£ÄÇÈç¹ûÎÒÃÇÏ£ÍûÔÚÍøÒ³ÖÐÏÔʾһ¸öСÓںţ¬¸ÃÔõô°ìÄØ£¿
Õâ¾ÍҪ˵µ½HTML×Ö·ûʵÌå(HTML Character Entities)ÁË¡£
Ò»¸ö×Ö·ûʵÌå(Character Entity)·Ö³ÉÈý²¿·Ö£ºµÚÒ»²¿· ......
System.Net.WebClient wc = new System.Net.WebClient();
Byte[] pageData = wc.DownloadData("httP://www");
string s = System.Text.Encoding.Default.GetString(pageData); ......
ÈçºÎÓÐÒ»¸ö×Ö·û´®ÊÇÕâÑùµÄÐÎʽstr = "&bbbLAA";
ÏëµÃµ½"L"µÄ»°¿ÉÒÔÕâÑùȥʵÏÖ£º
//sDataStr = "&bbbLAA";
//sLeftQuote = ""&bbb";
//sRightQuote = "&AA";
µ÷ÓÃÕâ¸ö·½·¨½«µÃµ½L×ֶΡ£
function abCutString( sDataStr, sLeftQuote, sRightQuote)
{
var sReturnVal = '';
var nStart ......
Dim objReg,objMatches,objMatch
Set objReg=new RegExp
objReg.Global=True
objReg.IgnoreCase=True
objReg.Pattern="<('[^']*'|""[^""]*""|[^'"">])*?>"
Set objMatches=objReg.Execute(×Ö·û´®)
For Each objMatch In objMatches
ÕÒµ½µÄHTML £ºobjMatch.value
Next
Set objMatches=Nothing
Set objRe ......