ÿØÿà JFIF    ÿÛ „  ( %"1!%)+...383,7(-.+  -+++--++++---+-+-----+---------------+---+-++7-----ÿÀ  ß â" ÿÄ     ÿÄ H    !1AQaq"‘¡2B±ÁÑð#R“Ò Tbr‚²á3csƒ’ÂñDS¢³$CÿÄ   ÿÄ %  !1AQa"23‘ÿÚ   ? ôÿ ¨pŸªáÿ —åYõõ\?àÒü©ŠÄï¨pŸªáÿ —åYõõ\?àÓü©ŠÄá 0Ÿªáÿ Ÿå[úƒ ú®ði~TÁbqÐ8OÕpÿ ƒOò¤Oè`–RÂáœá™êi€ßÉ< FtŸI“öÌ8úDf´°å}“¾œ6  öFá°y¥jñÇh†ˆ¢ã/ÃÐ:ªcÈ "Y¡ðÑl>ÿ ”ÏËte:qž\oäŠe÷󲍷˜HT4&ÿ ÓÐü6ö®¿øþßèô Ÿ•7Ñi’•j|“ñì>b…þS?*Óôÿ ÓÐü*h¥£ír¶ü UãS炟[AÐaè[ûª•õ&õj?†Éö+EzP—WeÒírJFt ‘BŒ†Ï‡%#tE Øz ¥OÛ«!1›üä±Í™%ºÍãö]°î(–:@<‹ŒÊö×òÆt¦ãº+‡¦%ÌÁ²h´OƒJŒtMÜ>ÀÜÊw3Y´•牋4ǍýʏTì>œú=Íwhyë,¾Ôò×õ¿ßÊa»«þˆѪQ|%6ž™A õ%:øj<>É—ÿ Å_ˆCbõ¥š±ý¯Ýƒï…¶|RëócÍf溪“t.СøTÿ *Ä¿-{†çàczůŽ_–^XþŒ±miB[X±d 1,é”zEù»& î9gœf™9Ð'.;—™i}!ôšåîqêÛ٤ёý£½ÆA–àôe"A$˝Úsäÿ ÷Û #°xŸëí(l »ý3—¥5m! rt`†0~'j2(]S¦¦kv,ÚÇ l¦øJA£Šƒ J3E8ÙiŽ:cÉžúeZ°€¯\®kÖ(79«Ž:¯X”¾³Š&¡* ….‰Ž(ÜíŸ2¥ª‡×Hi²TF¤ò[¨íÈRëÉ䢍mgÑ.Ÿ<öäS0í„ǹÁU´f#Vß;Õ–…P@3ío<ä-±»Ž.L|kªÀê›fÂ6@»eu‚|ÓaÞÆŸ…¨ááå>åŠ?cKü6ùTÍÆ”†sĤÚ;H2RÚ†õ\Ö·Ÿn'¾ ñ#ºI¤Å´%çÁ­‚â7›‹qT3Iï¨ÖÚ5I7Ë!ÅOóŸ¶øÝñØôת¦$Tcö‘[«Ö³šÒ';Aþ ¸èíg A2Z"i¸vdÄ÷.iõ®§)¿]¤À†–‡É&ä{V¶iŽ”.Ó×Õÿ û?h¬Mt–íª[ÿ Ñÿ ÌV(í}=ibÔ¡›¥¢±b Lô¥‡piη_Z<‡z§èŒ)iÖwiÇ 2hÙ3·=’d÷8éŽ1¦¸c¤µ€7›7Ø ð\á)} ¹fËí›pAÃL%âc2 í§æQz¿;T8sæ°qø)QFMð‰XŒÂ±N¢aF¨…8¯!U  Z©RÊ ÖPVÄÀÍin™Ì-GˆªÅËŠ›•zË}º±ŽÍFò¹}Uw×#ä5B¤{î}Ð<ÙD é©¤&‡ïDbàÁôMÁ." ¤‡ú*õ'VŽ|¼´Úgllº¼klz[Æüï÷Aób‡Eÿ dÑ»Xx9ÃÜ£ÁT/`¼¸vI±Ýµ·Ë‚“G³þ*Ÿû´r|*}<¨îºœ @¦mÄ’M¹”.œ«Y–|6ÏU¤jç¥ÕÞqO ˜kDÆÁ¨5ÿ š;ÐЦ¦€GÙk \ –Þ=â¼=SͧµªS°ÚÍpÜãQűÀõ¬?ÃÁ1Ñ•õZà?hóœ€ L¦l{Y*K˜Ù›zc˜–ˆâ ø+¾ ­-Ök¥%ùEÜA'}ˆ><ÊIè“bpÍ/qÞâvoX€w,\úªò6Z[XdÒæ­@Ö—€$òJí#é>'°Ú ôª˜<)4ryÙ£|óAÅn5žêŸyÒäMÝ2{"}‰–¤l÷ûWX\l¾Á¸góÉOÔ /óñB¤f¸çñ[.P˜ZsÊË*ßT܈§QN¢’¡¨§V¼(Üù*eÕ“”5T¨‹Âê¥FŒã½Dü[8'Ò¥a…Ú¶k7a *•›¼'Ò·\8¨ª\@\õ¢¦íq+DÙrmÎ…_ªæ»ŠÓœ¡¯’Ré9MÅ×D™lælffc+ŒÑ,ý™ÿ ¯þǤ=Å’Á7µ÷ÚÛ/“Ü€ñýã¼àí¾ÕÑ+ƒ,uµMâÀÄbm:ÒÎPæ{˜Gz[ƒ¯«® KHà`ߨŠéí¯P8Aq.C‰ à€kòpj´kN¶qô€…Õ,ÜNŠª-­{Zö’æû44‰sŽè‰îVíRœÕm" 6?³D9¡ÇTíÅꋇ`4«¸ÝÁô ï’ýorqКÇZ«x4Žâéþuïf¹µö[P ,Q£éaX±`PÉÍZ ¸äYúg üAx ’6Lê‚xÝÓ*äQ  Ï’¨hÍ =²,6ï#rÃ<¯–£»ƒ‹,–ê•€ aÛsñ'%Æ"®ÛüìBᝠHÚ3ß°©$“XnœÖ’î2ËTeûìxîß ¦å¿çÉ ðK§þ{‘t‚Ϋ¬jéîZ[ ”š7L¥4VÚCE×]m¤Øy”ä4-dz£œ§¸x.*ãÊÊ b÷•h:©‡¦s`BTÁRû¾g⻩‹jø sF¢àJøFl‘È•Xᓁà~*j¯ +(ÚÕ6-£¯÷GŠØy‚<Ç’.F‹Hœw(+)ÜÜâÈzÄäT§FߘãÏ;DmVœ3Àu@mÚüXÝü•3B¨òÌÁÛ<·ÃÜ z,Ì@õÅ·d2]ü8s÷IôÞ¯^Ç9¢u„~ëAŸï4«M? K]­ÅàPl@s_ p:°¬ZR”´›JC[CS.h‹ƒïËœ«Æ]–÷ó‚wR×k7X‰k›‘´ù¦=¡«‰¨¨Â')—71ó’c‡Ðúµ `é.{§p¹ój\Ž{1h{o±Ý=áUÊïGÖŒõ–-BÄm+AZX¶¡ ïHðæ¥JmÙ;…䡟ˆ¦ ° äšiÉg«$üMk5¤L“’çÊvïâï ,=f“"íἊ5ô¬x6{ɏžID0e¸vçmi'︧ºð9$ò¹÷*£’9ÿ ²TÔ…×>JV¥}Œ}$p[bÔ®*[jzS*8 ”·T›Í–ñUîƒwo$áè=LT™ç—~ô·¤ÈÚ$榍q‰„+´kFm)ž‹©i–ËqÞŠ‰à¶ü( ‚•§ •°ò·‡#5ª•µÊ﯅¡X¨šÁ*F#TXJÊ ušJVÍ&=iÄs1‚3•'fý§5Ñ<=[íÞ­ PÚ;ѱÌ_~Ä££8rÞ ²w;’hDT°>ÈG¬8Á²ÚzŽ®ò®qZcqJêäÞ-ö[ܘbň±çb“ж31²n×iƒðÕ;1¶þÉ ªX‰,ßqÏ$>•î íZ¥Z 1{ç൵+ƒÕµ¥°T$§K]á»Ûï*·¤tMI’ÂZbŽÕiÒ˜}bÓ0£ª5›¨ [5Ž^ÝœWøÂÝh° ¢OWun£¤5 a2Z.G2³YL]jåtì”ä ÁÓ‘%"©<Ôúʰsº UZvä‡ÄiÆÒM .÷V·™ø#kèýiíÌ–ª)µT[)BˆõÑ xB¾B€ÖT¨.¥~ð@VĶr#¸ü*åZNDŽH;âi ],©£öØpù(šºãö¼T.uCê•4@ÿ GÕÛ)Cx›®0ø#:ÏðFÒbR\(€€Ä®fã4Þ‰Fä¯HXƒÅ,†öEÑÔÜ]Öv²?tLÃvBY£ú6Êu5ÅAQ³1‘’¬x–HŒÐ‡ ^ ¸KwJôÖŽ5×CÚ¨vÜ«/B0$×k°=ðbÇ(Ï)w±A†Á† 11Í=èQšµ626ŒÜ/`G«µ<}—-Ö7KEHÈÉðóȤmݱû±·ø«Snmá=“䫚mݱŸ¡¶~ó·“äUóJæúòB|E LêŽy´jDÔ$G¢þÐñ7óR8ýÒ…Ç› WVe#·Ÿ p·Fx~•ݤF÷0Èÿ K¯æS<6’¡WШ; ´ÿ ¥Êø\Òuî†åÝ–VNœkÒ7oòX¨Á­Ø÷FÎÑä±g÷ÿ M~Çî=p,X´ ÝÌÚÅ‹’ÃjÖ.ØöÏñ qïQ¤ÓZE†° =6·]܈ s¸>v•Ž^Ý\wq9r‰Î\¸¡kURÒ$­*‹Nq?Þª*!sŠÆ:TU_u±T+øX¡ ®¹¡,ÄâÃBTsÜ$Ø›4m椴zÜK]’’›Pƒ @€#â˜`é¹=I‡fiV•Ôî“nRm+µFPOhÍ0B£ €+¬5c v•:P'ÒyÎ ‰V~‚Ó†ÖuókDoh$å\*ö%Ю=£«…aȼ½÷Û.-½VŒŠ¼'lyî±1¬3ó#ÞE¿ÔS¤gV£m›=§\û"—WU¤ÚǼÿ ÂnÁGŒÃ ‚õN D³õNÚíŒÕ;HôyÄÈ©P¹Ä{:?R‘Ô¨âF÷ø£bÅó® JS|‚R÷ivýáâ€Æé¡è³´IئÑT!§˜•ت‚¬â@q€wnïCWÄ@JU€ê¯m6]Ï:£âx'+ÒðXvÓ¦Úm=–´7œ $ì“B£~p%ÕŸUþ« N@¼üï~w˜ñø5®—'Ôe»¤5ã//€ž~‰Tþ›Å7•#¤× Íö pÄ$ùeåì*«ÓŠEØWEÈsßg ¦ûvžSsLpºÊW–âµEWöˬH; ™!CYõZ ÃÄf æ#1W. \uWâ\,\Çf j’<qTbên›Î[vxx£ë 'ö¨1›˜ÀM¼Pÿ H)ƒêêŒA7s,|F“ 꺸k³9Ìö*ç®;Ö!Ö$Eiž•¹ÒÚ†ýóéÝû¾ÕS®ó$’NÝäŸz¤5r¦ãÄÃD÷Üø!°ø‡Ô&@m™Ì^Ãä­d q5Lnÿ N;.6½·N|#ä"1Nƒx“ã<3('&ñßt  ~ªu”1Tb㫨9ê–›–bìd$ߣ=#ÕãÒmU¯eí$EFù5ýYô櫨æì™Ç—±ssM]·á¿0ÕåJRÓªîiƒ+O58ÖñªŠÒx" \µâá¨i’¤i —Ö ” M+M¤ë9‚‰A¦°Qõ¾ßøK~¼Ã‘g…Ö´~÷Ï[3GUœÒ½#…kàÔ®Ò”‰³·dWV‰IP‰Ú8u¹”E ÖqLj¾êÕCBš{A^Âß;–¨`¯¬ìö ˼ ×tìø.tƐm*n¨y4o&Àx¥n¦×î‡aupáÛj8¿m›è¶ã!o½;ß0y^ý×^EÑ¿ÒjzŒ­)vÚÑnÄL …^ªô× ‡—‚3k Îý­hï]içå–îÏ*÷ñþ»Ô CÒjøjÍznˆ´ ¹#b'Fô‹ ‰v¥'’à'T´ƒHýÍ%M‰ ƒ&ÆÇŒï1 ‘ –Þ ‰i¬s žR-Ÿ kЬá¬7:þ 0ŒÅÒÕ/aÙ¬ÃÝ#Úøœ ©aiVc‰. ¹¦ãµ” ›Yg¦›ÆÎýº°f³7ƒhá·¸­}&D9¡ÂsÉÙÞèŠõØàC™¨ñbFC|´Ü(ŸƒÚÒ-%»'a Ì¿)ËÇn¿úÿ ÞŽX…4ÊÅH^ôΑí@ù¹Eh¶“L8Çjù ¼ÎåVªóR©Ï5uà V4lZß®=€xÖŸ–ÑÈ ÷”¨°¾__yM1tÉ?uÆþIkÄgæ@þ[¢†°XÃJ£j·:nkÅ¢u ‘}âGzö­/IµèЬ¼48q¦F°ŽR¼=ûì{´¯RýicS ÕÛ íNtÍÙï£,w4rêì®»~x(©Uñ§#Ñ&œÕ¤>ÎåÍÓ9’Ö{9eV­[Öjâ²ãu]˜å2›qÑšÕJç0€sÄ|Êëè0튔bÁ>“{×_F`Ø©ºê:µä,v¤ðfc1±"«ÔÍän1#=· Âøv~H½ÐßA¾¿Ü€Óš]Õ; I¾÷ç‚Qi†î¹9ywÔKG˜áñ zQY—§ÃÕZ07§X‚ Áh;ÁM)iÌCH-¯T‘ë|A0{Ò½LÚ–TâÖkÜ’dÀ“rmm»”جPF³ÖcbE§T€ÒxKºû’Ó®7±²(\4ŽÃ¸Uu@j™yĵ;³µ!Á¢b.W¤=mõ´êµK k ¸K^ÜÛ#p*Ü14qkZç5ïë †°5Ï%ÍÛ<Õ¤×Ô¥ê†C Õ´¼ú$ƒÖ“”]Ù¬qÞÚ[4©ý!ûÏ—Áb쳐XµA¬â~`›Çr¸8ìùÝ䫦<>ä÷«?xs´ÇÑ /á;¹øüÊÈÙà{"@Žïzâ¬[âß‚ U_<ÇŸ½4èN˜ú61®qŠu ¦þF£»äJ_ˆÙÎ~ ÞAã–݄ϗrŠD;xTž‘ô`É«…suãO`?³à™ô Lý#Íc5öoæØ‚y´´÷«ZR§<&JÇ+éâô´€i!Àˆ0æAoàðLèÖ-2ŸõW.’t^–(KÁmHµV@xÜÇy®Ñø­â^:Ú3w· 7½¹°ñ¸â¹®:',«Mœ—n­Á+Ãbš LÈ‘ÄnRÓÅœ%¦²‰¨ùQ:¤f‚ "PÕtô¸…cæl…&˜Ú˜Ôkv‹ž+vŠ,=¢v­6—Xy*¥t£«<™:“aîϲ=¦6rO]XI¿Œ÷¤zÚ­›¶ 6÷”w\d ü~v®ˆÌk«^m<ÿ ¢‰Õ\)ùºŽ;… lîÙÅEŠ®cѾ@vnMÏ,¼“ñ•ŽBxðÃzãÇç%3ˆ"}Ù•Åî> BÉú;Ò]V+P˜F_´ßé> Øše|ï‡ÄOmFæÇ ãqÞ$/xÐx­z`ï9"œÜij‚!7.\Td…9M‡•iŽ‹¾‘50ÞŽn¥ß4ÉôO ¹*í^QêËÜÇÌ8=ާs‰'ÂëÙ«á%Pú[O †ÅP¯Vsް.‰,kc¶ ¬A9n˜XÎ-ÞšN["¹QÕ‰ƒMýÁߺXJæÍaLj¾×Ãmã¾ãÚ uñÒþåQô¦¥ /ÄUx:‚ÍÜ’ Đ©ØÝ3V¨‰ÕnÐ6ó*óúK­«…c ¯U òhsý­jóÔj#,ímŒRµ«lbïUTŒÑ8†Ä0œÏr`ð¡¬É Ї ë"À² ™ 6¥ f¶ ¢ÚoܱԷ-<Àî)†a¶ž'Ú»¨TXqØæ¶÷YÄHy˜9ÈIW­YÀuMFë ºÏ’AqÌ4·/Ú †ô'i$øä­=Ä Ý|öK×40è|È6p‘0§)o¥ctî§H+CA-“ xØ|ÐXАç l8íºð3Ø:³¤¬KX¯UÿÙ"""A parser for HTML and XHTML.""" # This file is based on sgmllib.py, but the API is slightly different. # XXX There should be a way to distinguish between PCDATA (parsed # character data -- the normal case), RCDATA (replaceable character # data -- only char and entity references and end tags are special) # and CDATA (character data -- only end tags are special). import markupbase import re # Regular expressions used for parsing interesting_normal = re.compile('[&<]') incomplete = re.compile('&[a-zA-Z#]') entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') charref = re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]') starttagopen = re.compile('<[a-zA-Z]') piclose = re.compile('>') commentclose = re.compile(r'--\s*>') # see http://www.w3.org/TR/html5/tokenization.html#tag-open-state # and http://www.w3.org/TR/html5/tokenization.html#tag-name-state # note: if you change tagfind/attrfind remember to update locatestarttagend too tagfind = re.compile('([a-zA-Z][^\t\n\r\f />\x00]*)(?:\s|/(?!>))*') # this regex is currently unused, but left for backward compatibility tagfind_tolerant = re.compile('[a-zA-Z][^\t\n\r\f />\x00]*') attrfind = re.compile( r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*' r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^>\s]*))?(?:\s|/(?!>))*') locatestarttagend = re.compile(r""" <[a-zA-Z][^\t\n\r\f />\x00]* # tag name (?:[\s/]* # optional whitespace before attribute name (?:(?<=['"\s/])[^\s/>][^\s/=>]* # attribute name (?:\s*=+\s* # value indicator (?:'[^']*' # LITA-enclosed value |"[^"]*" # LIT-enclosed value |(?!['"])[^>\s]* # bare value ) )?(?:\s|/(?!>))* )* )? \s* # trailing whitespace """, re.VERBOSE) endendtag = re.compile('>') # the HTML 5 spec, section 8.1.2.2, doesn't allow spaces between # ') class HTMLParseError(Exception): """Exception raised for all parse errors.""" def __init__(self, msg, position=(None, None)): assert msg self.msg = msg self.lineno = position[0] self.offset = position[1] def __str__(self): result = self.msg if self.lineno is not None: result = result + ", at line %d" % self.lineno if self.offset is not None: result = result + ", column %d" % (self.offset + 1) return result class HTMLParser(markupbase.ParserBase): """Find tags and other markup and call handler functions. Usage: p = HTMLParser() p.feed(data) ... p.close() Start tags are handled by calling self.handle_starttag() or self.handle_startendtag(); end tags by self.handle_endtag(). The data between tags is passed from the parser to the derived class by calling self.handle_data() with the data as argument (the data may be split up in arbitrary chunks). Entity references are passed by calling self.handle_entityref() with the entity reference as the argument. Numeric character references are passed to self.handle_charref() with the string containing the reference as the argument. """ CDATA_CONTENT_ELEMENTS = ("script", "style") def __init__(self): """Initialize and reset this instance.""" self.reset() def reset(self): """Reset this instance. Loses all unprocessed data.""" self.rawdata = '' self.lasttag = '???' self.interesting = interesting_normal self.cdata_elem = None markupbase.ParserBase.reset(self) def feed(self, data): r"""Feed data to the parser. Call this as often as you want, with as little or as much text as you want (may include '\n'). """ self.rawdata = self.rawdata + data self.goahead(0) def close(self): """Handle any buffered data.""" self.goahead(1) def error(self, message): raise HTMLParseError(message, self.getpos()) __starttag_text = None def get_starttag_text(self): """Return full source of start tag: '<...>'.""" return self.__starttag_text def set_cdata_mode(self, elem): self.cdata_elem = elem.lower() self.interesting = re.compile(r'' % self.cdata_elem, re.I) def clear_cdata_mode(self): self.interesting = interesting_normal self.cdata_elem = None # Internal -- handle data as far as reasonable. May leave state # and data to be processed by a subsequent call. If 'end' is # true, force handling all data as if followed by EOF marker. def goahead(self, end): rawdata = self.rawdata i = 0 n = len(rawdata) while i < n: match = self.interesting.search(rawdata, i) # < or & if match: j = match.start() else: if self.cdata_elem: break j = n if i < j: self.handle_data(rawdata[i:j]) i = self.updatepos(i, j) if i == n: break startswith = rawdata.startswith if startswith('<', i): if starttagopen.match(rawdata, i): # < + letter k = self.parse_starttag(i) elif startswith("', i + 1) if k < 0: k = rawdata.find('<', i + 1) if k < 0: k = i + 1 else: k += 1 self.handle_data(rawdata[i:k]) i = self.updatepos(i, k) elif startswith("&#", i): match = charref.match(rawdata, i) if match: name = match.group()[2:-1] self.handle_charref(name) k = match.end() if not startswith(';', k-1): k = k - 1 i = self.updatepos(i, k) continue else: if ";" in rawdata[i:]: # bail by consuming '&#' self.handle_data(rawdata[i:i+2]) i = self.updatepos(i, i+2) break elif startswith('&', i): match = entityref.match(rawdata, i) if match: name = match.group(1) self.handle_entityref(name) k = match.end() if not startswith(';', k-1): k = k - 1 i = self.updatepos(i, k) continue match = incomplete.match(rawdata, i) if match: # match.group() will contain at least 2 chars if end and match.group() == rawdata[i:]: self.error("EOF in middle of entity or char ref") # incomplete break elif (i + 1) < n: # not the end of the buffer, and can't be confused # with some other construct self.handle_data("&") i = self.updatepos(i, i + 1) else: break else: assert 0, "interesting.search() lied" # end while if end and i < n and not self.cdata_elem: self.handle_data(rawdata[i:n]) i = self.updatepos(i, n) self.rawdata = rawdata[i:] # Internal -- parse html declarations, return length or -1 if not terminated # See w3.org/TR/html5/tokenization.html#markup-declaration-open-state # See also parse_declaration in _markupbase def parse_html_declaration(self, i): rawdata = self.rawdata if rawdata[i:i+2] != ' gtpos = rawdata.find('>', i+9) if gtpos == -1: return -1 self.handle_decl(rawdata[i+2:gtpos]) return gtpos+1 else: return self.parse_bogus_comment(i) # Internal -- parse bogus comment, return length or -1 if not terminated # see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state def parse_bogus_comment(self, i, report=1): rawdata = self.rawdata if rawdata[i:i+2] not in ('', i+2) if pos == -1: return -1 if report: self.handle_comment(rawdata[i+2:pos]) return pos + 1 # Internal -- parse processing instr, return end or -1 if not terminated def parse_pi(self, i): rawdata = self.rawdata assert rawdata[i:i+2] == ' if not match: return -1 j = match.start() self.handle_pi(rawdata[i+2: j]) j = match.end() return j # Internal -- handle starttag, return end or -1 if not terminated def parse_starttag(self, i): self.__starttag_text = None endpos = self.check_for_whole_start_tag(i) if endpos < 0: return endpos rawdata = self.rawdata self.__starttag_text = rawdata[i:endpos] # Now parse the data between i+1 and j into a tag and attrs attrs = [] match = tagfind.match(rawdata, i+1) assert match, 'unexpected call to parse_starttag()' k = match.end() self.lasttag = tag = match.group(1).lower() while k < endpos: m = attrfind.match(rawdata, k) if not m: break attrname, rest, attrvalue = m.group(1, 2, 3) if not rest: attrvalue = None elif attrvalue[:1] == '\'' == attrvalue[-1:] or \ attrvalue[:1] == '"' == attrvalue[-1:]: attrvalue = attrvalue[1:-1] if attrvalue: attrvalue = self.unescape(attrvalue) attrs.append((attrname.lower(), attrvalue)) k = m.end() end = rawdata[k:endpos].strip() if end not in (">", "/>"): lineno, offset = self.getpos() if "\n" in self.__starttag_text: lineno = lineno + self.__starttag_text.count("\n") offset = len(self.__starttag_text) \ - self.__starttag_text.rfind("\n") else: offset = offset + len(self.__starttag_text) self.handle_data(rawdata[i:endpos]) return endpos if end.endswith('/>'): # XHTML-style empty tag: self.handle_startendtag(tag, attrs) else: self.handle_starttag(tag, attrs) if tag in self.CDATA_CONTENT_ELEMENTS: self.set_cdata_mode(tag) return endpos # Internal -- check to see if we have a complete starttag; return end # or -1 if incomplete. def check_for_whole_start_tag(self, i): rawdata = self.rawdata m = locatestarttagend.match(rawdata, i) if m: j = m.end() next = rawdata[j:j+1] if next == ">": return j + 1 if next == "/": if rawdata.startswith("/>", j): return j + 2 if rawdata.startswith("/", j): # buffer boundary return -1 # else bogus input self.updatepos(i, j + 1) self.error("malformed empty start tag") if next == "": # end of input return -1 if next in ("abcdefghijklmnopqrstuvwxyz=/" "ABCDEFGHIJKLMNOPQRSTUVWXYZ"): # end of input in or before attribute value, or we have the # '/' from a '/>' ending return -1 if j > i: return j else: return i + 1 raise AssertionError("we should not get here!") # Internal -- parse endtag, return end or -1 if incomplete def parse_endtag(self, i): rawdata = self.rawdata assert rawdata[i:i+2] == " if not match: return -1 gtpos = match.end() match = endtagfind.match(rawdata, i) # if not match: if self.cdata_elem is not None: self.handle_data(rawdata[i:gtpos]) return gtpos # find the name: w3.org/TR/html5/tokenization.html#tag-name-state namematch = tagfind.match(rawdata, i+2) if not namematch: # w3.org/TR/html5/tokenization.html#end-tag-open-state if rawdata[i:i+3] == '': return i+3 else: return self.parse_bogus_comment(i) tagname = namematch.group(1).lower() # consume and ignore other stuff between the name and the > # Note: this is not 100% correct, since we might have things like # , but looking for > after tha name should cover # most of the cases and is much simpler gtpos = rawdata.find('>', namematch.end()) self.handle_endtag(tagname) return gtpos+1 elem = match.group(1).lower() # script or style if self.cdata_elem is not None: if elem != self.cdata_elem: self.handle_data(rawdata[i:gtpos]) return gtpos self.handle_endtag(elem) self.clear_cdata_mode() return gtpos # Overridable -- finish processing of start+end tag: def handle_startendtag(self, tag, attrs): self.handle_starttag(tag, attrs) self.handle_endtag(tag) # Overridable -- handle start tag def handle_starttag(self, tag, attrs): pass # Overridable -- handle end tag def handle_endtag(self, tag): pass # Overridable -- handle character reference def handle_charref(self, name): pass # Overridable -- handle entity reference def handle_entityref(self, name): pass # Overridable -- handle data def handle_data(self, data): pass # Overridable -- handle comment def handle_comment(self, data): pass # Overridable -- handle declaration def handle_decl(self, decl): pass # Overridable -- handle processing instruction def handle_pi(self, data): pass def unknown_decl(self, data): pass # Internal -- helper to remove special character quoting entitydefs = None def unescape(self, s): if '&' not in s: return s def replaceEntities(s): s = s.groups()[0] try: if s[0] == "#": s = s[1:] if s[0] in ['x','X']: c = int(s[1:], 16) else: c = int(s) return unichr(c) except ValueError: return '&#'+s+';' else: # Cannot use name2codepoint directly, because HTMLParser supports apos, # which is not part of HTML 4 if HTMLParser.entitydefs is None: import htmlentitydefs entitydefs = {'apos':u"'"} for k, v in htmlentitydefs.name2codepoint.iteritems(): entitydefs[k] = unichr(v) HTMLParser.entitydefs = entitydefs try: return self.entitydefs[s] except KeyError: return '&'+s+';' return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)