lucene Ë÷ÒýHTMLÎĵµ
ÉîδÀ´¼¼Êõ
1¡¢´ó²¿·ÖWEBÎĵµ²ÉÓÃHTML¸ñʽ¡£
2¡¢±¾ÀýÓÃÈçÏÂHTMLÎĵµ
<html>
<head>
<title>
Laptop power supplies are avaliable in First class only
</title>
</head>
<body>
<h1>code,write,fly</h1>
</body>
</html>
3¡¢Ê¹ÓÃJTidy
JTidyÓÉAndy Quick±àдµÄTidyµÄJava°æ±¾¡£
public class JTidyHTMLHandler implements DocumentHandler{
publicorg.apache.lucene.document.Document getDocument(InputStreamis)
throwsDocumentHandlerException{ //´«ÈëÒ»¸ö´ú±íHTMLÎĵµµÄInputStream¶ÔÏó
Tidy tidy=new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
//½âÎö´ú±íHTMLÎĵµµÄInputStream¶ÔÏó
org.w3c.dom.Documentroot=tidy.parseDOM(is,null);
ElementrawDoc=root.getDocumentElement();
org.apache.lucene.document.Document doc=neworg.apache.lucene.document.Document();
Stringtitle=getTitle(rawDoc);//»ñµÃ±êÌâ
Stringbody=getBody(rawDoc);//»ñµÃ<body>ºÍ</body>Ö®¼äËùÓÐÔªËØ
if((title!=null)&&(!title.equals(""))) {
doc.add(Field.Text("title",title));
}
if((body!=null)&&(!body.equals(""))){
doc.add(Field.Text("body",body));
}
return doc;
}
protected String getTitle(Element rawDoc){
if(rawDoc==null){
returnnull;
}
Stringtitle="";
NodeListchildren=rawDoc.getElementsB
Ïà¹ØÎĵµ£º
ÈçºÎ±íʾÉϼ¶Ä¿Â¼
../±íʾԴÎļþËùÔÚĿ¼µÄÉÏÒ»¼¶Ä¿Â¼£¬../../±íʾԴÎļþËùÔÚĿ¼µÄÉÏÉϼ¶Ä¿Â¼£¬ÒÔ´ËÀàÍÆ¡£
¼ÙÉèinfo.html·¾¶ÊÇ£ºc:\Inetpub\wwwroot\sites\blabla\info.html
¼ÙÉèindex.html·¾¶ÊÇ£ºc:\Inetpub\wwwroot\sites\index.html
ÔÚinfo.html¼ÓÈëindex.html³¬Á´½ÓµÄ´úÂëÓ¦¸ÃÕâÑùд£º
<a href ......
<html>
<head>
<script>
function locking(){
document.all.ly.style.display="block";
document.all.ly.style.width=document.body.clientWidth;
document.all.ly.style.height ......
ÑÕ ÁÖ
, Èí¼þ¹¤³Ìʦ, IBM
2009 Äê 12 ÔÂ 10 ÈÕ
HTML
5 ÒýÈëÁËеĵÄÍøÒ³ÔªËØ£º<canvas>¡£Canvas ÊÇһƬ¿Õ°×µÄ»æÍ¼ÇøÓò£¬ÍøÒ³¿ª·¢Õß¿ÉÒÔÀûÓà JavaScript
ÔÚ¸ÃÇøÓòÖÐ×ÔÓɵؽøÐÐ 2D »æÍ¼¡£Canvas ¿ÉÒÔÓÃÓÚäÖȾ»ªÀöµÄÍøÒ³Éè¼Æ½çÃæ¡£±¾ÎÄͨ¹ýÒ»¸öÏêϸµÄʵÀýÀ´ËµÃ÷ÈçºÎÓà Canvas
À´ÖÆ×÷Ò»¸öͼƬµÄä¯ÀÀÆ÷¡£×îÖÕµÄÐ ......
Document ¶ÔÏó
Document ¶ÔÏó´ú±íÕû¸ö HTML Îĵµ£¬¿ÉÓÃÀ´·ÃÎÊÒ³ÃæÖеÄËùÓÐÔªËØ¡£
Document ¶ÔÏóÊÇ Window ¶ÔÏóµÄÒ»¸ö²¿·Ö£¬¿Éͨ¹ý window.document ÊôÐÔÀ´·ÃÎÊ¡£
ÓÐ¹Ø Document ¶ÔÏóµÄÏêϸÃèÊö¡£
IE: Internet Explorer, F: Firefox, O: Opera, W3C: World Wide Web Consortium (Internet ±ê×¼).
Document ¶ÔÏóµÄ¼¯º ......
public static string filterStr(string html)
{
System.Text.RegularExpressions.Regex regex1 = new System.Text.RegularExpressions.Regex(@"<s ......