<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
  <channel>
    <title>jason823</title>
    <description></description>
    <link>http://jason823.javaeye.com</link>
    <language>UTF-8</language>
    <copyright>Copyright 2003-2008, JavaEye.com</copyright>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <generator>JavaEye - 做最棒的软件开发交流社区</generator>
      <item>
        <title>Using embedded-jboss for unit testing</title>
        <author>jason823</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://jason823.javaeye.com">jason823</a>&nbsp;
          链接：<a href="http://jason823.javaeye.com/blog/186798" style="color:red;">http://jason823.javaeye.com/blog/186798</a>&nbsp;
          发表时间: 2008年04月25日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <div>You can read and download the embedded-jboss from here: <a href="http://wiki.jboss.org/wiki/EmbeddedJBoss">http://wiki.jboss.org/wiki/EmbeddedJBoss</a>.<br />Using the embedded-jboss could contribute to the benefits that we can get the core functionalities of jboss from a very short deployment time(4~5s) for our unit testing.</div>
<div>Basing on that, we can obtain the resources of application storage such as hibernate sessions as same as the way in the web application, even, we can use the very same config files(in web application) of the storage persistence without any modification on them like *.cfg.xml and *-ds.xml<br />&nbsp;<br /><strong>All we should do to make it work like this:<br />1. Add all jar files in "%Project%\embedded-jboss\lib" in your classpath.<br />2. Add all things under the directory "%Project%\embedded-jboss\bootstrap\" in your classpath too. I suggest that you can set the folder bootstrap as a source root folder in your ide.<br />3. Add the resource files in your classpath too, such as *.cfg.xml, *ds.xml, etc.<br />4. Create a file named persistence.xml under "META-INF\", also this folder should be added in your classpath.<br />5. Write a unit test class;<br />All done! You can use this sessions for you testing now.<br /></strong>&nbsp;<br />PS: <br />1. I use EntityManager from Hibernate persistence in unit tests, it is defined in META-INF\persistence.xml like this:<br />
<pre name="code" class="xml">&lt;persistence xmlns="http://java.sun.com/xml/ns/persistence"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd"
             version="1.0"&gt;

&lt;persistence-unit name="EntityManagerFactory1"&gt;
    &lt;provider&gt;org.hibernate.ejb.HibernatePersistence&lt;/provider&gt;
    &lt;jta-data-source&gt;java:/DS1&lt;/jta-data-source&gt;
    &lt;properties&gt;
        &lt;property name="jboss.entity.manager.factory.jndi.name"
                  value="java:/EntityManagerFactories/EntityManagerFactory1"/&gt;
        &lt;property name="hibernate.ejb.cfgfile" value="foo1.cfg.xml"/&gt;
    &lt;/properties&gt;
&lt;/persistence-unit&gt;

&lt;persistence-unit name="EntityManagerFactory2"&gt;
    &lt;provider&gt;org.hibernate.ejb.HibernatePersistence&lt;/provider&gt;
    &lt;jta-data-source&gt;java:/DS2&lt;/jta-data-source&gt;
    &lt;properties&gt;
        &lt;property name="jboss.entity.manager.factory.jndi.name"
                  value="java:/EntityManagerFactories/EntityManagerFactory2"/&gt;
        &lt;property name="hibernate.ejb.cfgfile" value="foo2.cfg.xml"/&gt;
    &lt;/properties&gt;
&lt;/persistence-unit&gt;

...

&lt;/persistence&gt;
</pre>
&nbsp;&nbsp;<br />&nbsp;<br />2. Start embedded-jboss is very simply like this:<br />
<pre name="code" class="java">Bootstrap.getInstance().bootstrap();
Bootstrap.getInstance().deployResourceBase("foo-ds.xml");</pre>
Your datasources(java:/DS1 and java:/DS2) should be defined in the file foo-ds.xml following the template of Jboss.</div>
<div>&nbsp;</div>
<div>Shutdown embedded-jboss&nbsp;<br />
<pre name="code" class="java">Bootstrap.getInstance().shutdown();</pre>
&nbsp;&nbsp;<br />After depolying embedded-jboss, you can get EntityFactory like this:<br />
<pre name="code" class="java">(HibernateEntityManagerFactory)Persistence.createEntityManagerFactor("EntityManagerFactory1");
(HibernateEntityManagerFactory)Persistence.createEntityManagerFactor("EntityManagerFactory2");
    ...
</pre>
&nbsp;&nbsp;<br />Then you can get hibernate session using these EntityManagerFactories.<br />&nbsp; </div>
<div>3. TestNG is a little bit different from JUnit, but it's more easier to be used especially when you are testing a group of tests(it is called TestSuit in JUnit). Here is an example:</div>
<pre name="code" class="java">import org.hibernate.ejb.HibernateEntityManagerFactory;
import org.jboss.embedded.Bootstrap;
import org.jboss.deployers.spi.DeploymentException;

import javax.persistence.Persistence;

public class TestDeployer {
    static boolean isRunning = false;

    static HibernateEntityManagerFactory entityManagerFactory1;
    static HibernateEntityManagerFactory entityManagerFactory2;

    static void deploy() throws DeploymentException {
        Bootstrap.getInstance().bootstrap();
        Bootstrap.getInstance().deployResourceBase("foo-ds.xml");
        entityManagerFactory1= (HibernateEntityManagerFactory) Persistence.createEntityManagerFactory("EntityManagerFactory1");
        entityManagerFactory2= (HibernateEntityManagerFactory) Persistence.createEntityManagerFactory("EntityManagerFactory2");
    }

    static void unDeploy() throws DeploymentException {
        try {
            entityManagerFactory1.close();
            entityManagerFactory2.close();
        } catch (Exception e) {
            System.out.println(e.getMessage());
        }
        Bootstrap.getInstance().shutdown();
    }

}
</pre>
<div>&nbsp;</div>
<div>
<pre name="code" class="java">import org.hibernate.Session;
import org.jboss.deployers.spi.DeploymentException;
import org.testng.annotations.AfterClass;
import org.testng.annotations.AfterGroups;
import org.testng.annotations.BeforeGroups;

/**
 * A simply test base class using TestNG
 * Our unit tests should be grouped by given names such as "beans", "entites", etc.
 * We can run a group of unit tests which belong the same group.
 * The running of a group can invoke a prepared method one time before the first test in it, such as the method "setUp" here.
 * The running of a group can invoke a finished method one time after all of the tests in it, such as the method "destroy" here.
 * The running of a group can invoke a method several times after each class in which group finishing all of its test methods, such as the method "closeSessions" here.
 */
public class TestBase {
    private Session session1;
    private Session session2;

    @BeforeGroups(groups = {"beans", "entities"})
    public void setUp() throws DeploymentException {
        TestDeployer.deploy();
    }

    @AfterGroups(groups = {"beans", "entities"})
    public void destroy() throws DeploymentException {
        TestDeployer.unDeploy();
    }

    @AfterClass(dependsOnGroups = {"beans", "entities"})
    public void closeSessions() {
        if (this.session1!= null)
            this.session1.close();
        if (this.session2!= null)
            this.session2.close();
    }

    protected Session getSession1() {
        if (this.session1== null || !this.session1.isOpen()) {
            this.session1= TestDeployer.entityManagerFactory1.getSessionFactory().openSession();
        }
        return this.session1;
    }

    protected Session getSession2() {
        if (this.session2== null || !this.session2.isOpen()) {
            this.session2= TestDeployer.entityManagerFactory2.getSessionFactory().openSession();
        }
        return this.session1;
    }
}
</pre>
&nbsp;<br />&nbsp;<br />&nbsp;<br />4.There are something records about the changes I made on embedded-jboss</div>
<div>&nbsp;&nbsp; 1). The embedded-jboss supports these features:<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<pre name="code" class="html">JNDI (remoteable) 
JCA 
-ds.xml files (connection pooling) 
JBoss Security 
(removed)EJB 3.0 (remoteable) 
(removed)JBoss Messaging 
(removed)JMX mbeans (-service.xml files) 
(removed)MC beans (-beans.xml files) 
(removed)JBoss TS</pre>
</div>
<div>&nbsp;<br />&nbsp;&nbsp; 2). I removed the features marked by "(removed)" above, for one thing, to speed up embedded-jboss on deploying, for another, there aresome errors after opening all features, they might be related with Seam and the Ejb feature is too complex for me to figure these errors out. But it is necessary to make clear about it because I can do unit test on features of Seam since I can't do it now!</div>
<div><br />&nbsp;&nbsp; 3). My changs on embedded-jboss(current version is beta3):&nbsp;&nbsp;&nbsp;<br />
<pre name="code" class="html">a. Removed these files under "bootstrap\deploy":  all files except "jboss-local-jdbc.rar" 
b. Removed these files under "bootstrap\deployers": all files except "jca-deployers-beans.xml" and "security-deployer-beans.xml" 
c. Added file "standardjbosscmp-jdbc.xml"(getting from jboss-4.2.0.GA) into "bootstrap" 
d. Modified file "bootstrap\conf\jboss-service.xml", added these codes into the last:
&lt;mbean code="org.jboss.ejb.plugins.cmp.jdbc.metadata.MetaDataLibrary" name="jboss.jdbc:service=metadata"/&gt;
The -ds.xml file could take effect only after doing item c and d.     </pre>
</div>
<div>&nbsp;<br />5. At present, I just have taken these features: JNDI, JCA, DS file supporting and Security from embedded jboss, after all, they are same as in web application actually.&nbsp; <br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />6. What I have recorded here are just some tips, you should get details from official documents.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; <br />7. 公司的Wiki只让发布英文的，所以就直接粘过来了，见谅。</div>
          <br/>
          <span style="color:red;">
            <a href="http://jason823.javaeye.com/blog/186798#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/97' target='_blank'><span style="color:blue;font-weight:bold;">Oracle专区上线，有Oracle最新文章，重要下载及知识库等精彩内容，欢迎访问。</span></a></li><li><a href='/adverts/92' target='_blank'><span style="color:red;font-weight:bold;">快来参加7月17日在成都举行的SOA中国技术论坛</span></a></li><li><a href='/adverts/106' target='_blank'><span style="color:blue;font-weight:bold;">JavaEye问答大赛开始了！ 从6月23日 至 7月6日，奖品丰厚 ！</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Fri, 25 Apr 2008 15:13:05 +0800</pubDate>
        <link>http://jason823.javaeye.com/blog/186798</link>
        <guid>http://jason823.javaeye.com/blog/186798</guid>
      </item>
      <item>
        <title>IDEA7.0上配置JBOSS+SEAM(DJB3|JSF)开发环境</title>
        <author>jason823</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://jason823.javaeye.com">jason823</a>&nbsp;
          链接：<a href="http://jason823.javaeye.com/blog/186789" style="color:red;">http://jason823.javaeye.com/blog/186789</a>&nbsp;
          发表时间: 2008年04月25日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <p>1.Facet-EJB<br />&nbsp; ①Deployment Descriptors中指定ejb-jar.xml(/META-INF)同时添加需要的seam.propertise(/)和persistence.xml(/META-INF)资源<br />&nbsp; ②因为应用最终要打包一个ear文件，该文件的根中需要放入一下jar文件（应用决定放入哪些Jar文件，应该在application.xml文件中有描述），这些Jar文件首先要放入这个facet所属的module的依赖库（classpath）中，然后在Modules and Libraries to Package中就可以看到这些Jar文件，每个Jar文件后边的Packaging Method选项要选择Link via manifest and copy files to这个选项。<br />&nbsp; ③Source roots for EJB classes中要勾选到该Module的Src文件夹&nbsp; <br />&nbsp; ④在Java EE Build Setting中勾选Create EJB Module Jar file(自己指定Jar文件名)，Create EJB Module exploded diectory这一项暂时不用勾选。</p>
<p>&nbsp;</p>
<p>2.Facet-WEB<br />&nbsp; ①Deployment Descriptors中指定web.xml(/WEB-INF)<br />&nbsp; ②在Modules and Libraries to Package中设置那些需要在部署时部署到/WEB-INF/lib下的Jar文件，每个Jar文件后边的Packaging Method选项要选择copy files to这个选项，然后在后边的相对地址中录入/WEB-INF/lib<br />&nbsp; ③Web resource directories中添加项目的Web资源目录<br />&nbsp; ④Source roots for EJB classes中要勾选到该Module的Src文件夹&nbsp; <br />&nbsp; ⑤在Java EE Build Setting中勾选Create web Module War file(自己指定War文件名)，Create Web Module exploded diectory这一项暂时不用勾选。<br />&nbsp; ⑥在做好以上几项之后，在该Web Facet下新增一个JSF Facet，添加时指定放置faces-config.xml文件的目录</p>
<p>&nbsp;</p>
<p>3.Facet-JavaEEApplication<br />&nbsp; ①Deployment Descriptors中指定application.xml(/META-INF)和jboss-app.xml(/META-INF)同时添加应用中需要的其它资源<br />&nbsp; ②Modules and Libraries to Package中可以看到前边的EJB Facet和Web Facet出现在列表中，它们的Packaging Method都选择Include Facet in Build这一项，同时在后边的文件名中录入对应的打包文件名并且设置Web Facet的Context Root <br />&nbsp; ③在Java EE Build Setting中勾选Create application achive(ear) file(自己指定Ear文件名)，Create application exploded diectory这一项暂时不用勾选。&nbsp;</p>
<p>&nbsp;</p>
<p>4.以上1 2 3 4中所有Facet的Java EE Build&nbsp;Settings&nbsp;中的Create&nbsp;XXXXX exploded diectory项都要勾选上，同时指定的目录名称要以相同的对应.jar|.war|.ear名称结束，否则IDEA会无法识别正确的配置，报出一个XXXXX extention错误同时程序部署时web resource&nbsp;无法热部署(packaging file)。Exclude from module content 也要勾选上。</p>
<p>&nbsp;</p>
<p>5.增加一个JBoss的Local应用，首先指定Application Server到JBoss，然后设置Server Instance，之后在Deployment中可以看到一个或多个Module(这个由自己的应用决定)的Facet都会列出来，只勾选Facet-JavaEEApplication下的module项进行deploy，同时在Deployement Source中选择之前设置好的exploded diectory即可。</p>
<p><br />上边这些就是IDEA的JBOSS开发环境的配置，稍显复杂，但是理解JBOSS和Seam的配置方式和所必须的文件之后就不再是难题了。</p>
<p>建议同时配置ant进行常规的部署和启动。而且这样也许可以用IDEA的远程调试模式，如果谁有好的经验可以分享一下。</p>
          <br/>
          <span style="color:red;">
            <a href="http://jason823.javaeye.com/blog/186789#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/97' target='_blank'><span style="color:blue;font-weight:bold;">Oracle专区上线，有Oracle最新文章，重要下载及知识库等精彩内容，欢迎访问。</span></a></li><li><a href='/adverts/92' target='_blank'><span style="color:red;font-weight:bold;">快来参加7月17日在成都举行的SOA中国技术论坛</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/106' target='_blank'><span style="color:blue;font-weight:bold;">JavaEye问答大赛开始了！ 从6月23日 至 7月6日，奖品丰厚 ！</span></a></li><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Fri, 25 Apr 2008 15:03:40 +0800</pubDate>
        <link>http://jason823.javaeye.com/blog/186789</link>
        <guid>http://jason823.javaeye.com/blog/186789</guid>
      </item>
      <item>
        <title>Heritrix使用的初步总结</title>
        <author>jason823</author>
        <description>
          <![CDATA[
          <br/>
          作者: <a href="http://jason823.javaeye.com">jason823</a>&nbsp;
          链接：<a href="http://jason823.javaeye.com/blog/84206" style="color:red;">http://jason823.javaeye.com/blog/84206</a>&nbsp;
          发表时间: 2007年05月29日
          <br/><br/>
          声明：本文系JavaEye网站发布的原创博客文章，未经作者书面许可，严禁任何网站转载本文，否则必将追究法律责任！
          <br/><br/>
          <div><strong><font size="3">一、框架介绍</font></strong></div>
<div>&nbsp;</div>
<div>公司最近项目要用到全文检索，检索对象是一些网站的网页内容，要使用到网络爬虫工具。</div>
<div>&nbsp;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 目前技术选型对象主要有两个：Heritrix 和 Nutch。二者均为Java开源框架，Heritrix 是 SourceForge上的开源产品，Nutch为Apache的一个子项目，它们都称作网络爬虫/蜘蛛（<font color="#0000ff"> Web Crawler</font>），它们实现的原理基本一致：深度遍历网站的资源，将这些资源抓取到本地，使用的方法都是分析网站每一个有效的URI，并提交Http请求，从而获得相应结果，生成本地文件及相应的日志信息等。</div>
<div>&nbsp;</div>
<div>下面是二者的介绍，摘自网络：</div>
<blockquote dir="ltr">
<div><font color="#000080"><font size="2">Heritrix 是个 &quot;archival crawler&quot; -- 用来获取完整的、精确的、站点内容的深度复制。包括获取图像以及其他非文本内容。</font><font size="2"><strong>抓取并存储</strong>相关的内容。对内容来者不拒，不对页面进行内容上的修改。重新爬行对相同的URL不针对先前的进行替换。爬虫通过</font><font size="2">Web用户界面启动、监控、调整，允许弹性的定义要获取的URL。</font></font></div>
</blockquote><blockquote dir="ltr">
<div><font size="2" color="#000080">二者的差异：</font></div>
<div>
<ul>
    <li><font size="2" color="#000080">Nutch 只获取并保存可索引的内容。Heritrix则是照单全收。力求保存页面原貌 </font></li>
    <li><font size="2" color="#000080">Nutch 可以修剪内容，或者对内容格式进行转换。 </font></li>
    <li><font size="2" color="#000080">Nutch 保存内容为数据库优化格式便于以后索引；刷新替换旧的内容。而Heritrix 是添加(追加)新的内容。 </font></li>
    <li><font size="2" color="#000080">Nutch 从命令行运行、控制。Heritrix 有 Web 控制管理界面。 </font></li>
    <li><font size="2" color="#000080">Nutch 的定制能力不够强，不过现在已经有了一定改进。Heritrix 可控制的参数更多。</font></li>
</ul>
</div>
</blockquote>
<div><font color="#99cc00">&nbsp;</font></div>
<div><font size="3" color="#000000"><strong>二、关于Heritrix使用的初步总结</strong></font></div>
<div><strong><font size="3" color="#000000"></font></strong>&nbsp;</div>
<div><font color="#000000"><font size="2">目前对<strong>Heritrix</strong>做了初步选型测试，有了一些总结：</font></font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000"><strong>1.关于安装：</strong></font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 目前的版本号为1.12.1，官网地址为&nbsp;<a href="http://g.msn.com/0SEP/zhcnrs?name="><u><font color="#810081">http://crawler.archive.org/</font></u></a>。常规安装，即解压到相关目录，之后配置系统环境变量&quot;HERITRIX_HOME&quot;到该解压目录（Java环境已经配置好）。</font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000"><strong>2.安装的后续工作：</strong></font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 将 %HERITRIX_HOME%\heritrix-1.12.1.jar 解压到临时目录，拷贝其中的profiles目录到 %HERITRIX_HOME%\conf\目录下，用来解决Heritrix 关于的Profile默认配置的一个Bug。</font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000"><strong>3.配置管理帐户：</strong></font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 拷贝 %HERITRIX_HOME%\conf\jmxremote.password.template 到 %HERITRIX_HOME%\下，并且重命名为&quot;jmxremote.password&quot;。之后编辑该文件内容关于密码的部分：</font></div>
<div><font size="2" color="#000000">monitorRole&nbsp; @PASSWORD@&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; monitorRole&nbsp; admin<br />
controlRole&nbsp; @PASSWORD@&nbsp;&nbsp;==&gt;&nbsp; controlRole&nbsp; admin</font></div>
<div><font size="2" color="#000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 修改完毕之后，保存该文件。并且需要将该文件的属性改为&ldquo;只读&rdquo;。然后有很重要的一步：在该文件jmxremote.password的属性窗口中查看&ldquo;安全&rdquo;标签，该标签下的第一项&ldquo;组或用户名称&rdquo;下要确认该文件的所有权应该只隶属于你当前的系统用户，而不应该是某一个用户组（例如Administrators），这应该是Heritrix安全机制的一个Bug。否则运行Heritrix的时候会报权限错误，需要你修改jmxremote.password文件的属性为&ldquo;只读&rdquo;，但是其实已经做了该项改动。</font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div>&nbsp;</div>
<div><font size="2" color="#000000"><strong>4.运行Heritrix：</strong></font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CMD定位到 %HERITRIX_HOME%\bin，执行 &quot;heritrix --admin=admin:admin&quot; 命令，即可启动 heritrix，有一点需要注意，heritrix默认使用8080端口，要保证系统端口没有冲突。之后便可以访问 <a href="http://127.0.0.1:8080/">http://127.0.0.1:8080</a>&nbsp;使用 heritrix 提供的WUI，即Web管理端。并且使用&quot;admin/admin&quot;登录。</font></div>
<div><font size="2" color="#000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 该管理端提供了Heritrix默认提供的所有配置功能，并且可以创建一个Job并且执行该Job抓取网站。</font></div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000"></font>&nbsp;</div>
<div><font size="2" color="#000000"><strong>5.一个简单的Job:</strong></font></div>
<div><font size="2" color="#000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Heritrix提供的配置功能非常丰富，但是也很复杂，刚开始的时候很难正确的创建并且执行一个Job去抓取网站，自己阅读了大部分的Heritrix用户文档和多次尝试之后，总结了一个简单的创建执行Job的用例，该用例为<font color="#000000"><strong>抓取</strong></font><a href="http://www.baidu.com/"><font color="#000000"><strong>www.baidu.com</strong></font></a><strong>下的网页，但子域（如 news.baidu.com）不抓取</strong>，步骤如下，可供参考：</font></div>
<blockquote dir="ltr">
<div>(1) WUI的上边的导航栏选择&quot;Jobs&quot;，呈现的第一项是&quot;Create New Job&quot;，选择第四小项&quot;With defaults&quot;。输入项的前两项 </div>
<div>&nbsp;&nbsp;&nbsp;&nbsp; Name和Description随意，Seeds非常重要：<a href="http://www.baidu.com/">http://www.baidu.com/</a>&nbsp;注意最后一个反斜杠必须。</div>
<div>&nbsp;</div>
<div>(2) 选择下边的&quot;Modules&quot;，进入Module配置页（Heritrix的扩展功能都是通过模块概念实现的，可以实现自己的模块完成自己</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp; 想要的功能）。其中第一项 &quot;<strong>Select Crawl Scope</strong>&quot; 使用默认的 &quot;org.archive.crawler.deciderules.DecidingScope&quot;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp; 。倒数第三项 &quot;<strong>Select Writers</strong> &quot; 删除默认的 &quot;org.archive.crawler.writer.ARCWriterProcessor&quot; ，后添加</div>
<div>&nbsp;&nbsp; &nbsp; &quot;org.archive.crawler.writer.MirrorWriterProcessor&quot;，这样执行任务的时候抓取到的页面会以镜像的方式放在本地的</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp; 目录结构中，而不是生成ARC存档文件。</div>
<div>&nbsp;</div>
<div>(3) 选择&quot;Modules&quot;右边的&quot;Submodules&quot;，在第一项内容中 &quot;<strong>crawl-order</strong> -&gt;<strong>scope</strong>-&gt;<strong>decide-rules</strong>-&gt;<strong>rules</strong>&quot; 删除掉其</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp; 中的 &quot;acceptIfTranscluded&quot; (<em><font size="2">org.archive.crawler.deciderules.TransclusionDecideRule</font></em>) 的这一项抓取作用域的</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp; 规则。否则当Http请求返回301或者302时Heritrix会去抓取其他域下的网页。&nbsp;</div>
<div>&nbsp;</div>
<div>(4) 在WUI的第二行导航栏中选择&quot;Settings&quot;进入Job的配置页面，其中主要修改两项：<strong>http-headers </strong>下的user-agent 和 </div>
<div>&nbsp;&nbsp;&nbsp;&nbsp; from，他们的&quot;PROJECT_URL_HERE&quot; 和 &quot;CONTACT_EMAIL_ADDRESS_HERE&quot; 替换为自己的内容</div>
<div>&nbsp;&nbsp;&nbsp; （&quot;PROJECT_URL_HERE&quot; 要以 &quot;http://&quot; 开头）</div>
<div>&nbsp;</div>
<div>(5) 在WUI的第二行导航栏中选择最右边的&quot;Submit job&quot;</div>
<div>&nbsp;</div>
<div>(6) 在WUI的第一行导航栏中选择第一项的&quot;Console&quot;，点击&quot;Start&quot;，抓取任务正式开始，时间长短有网络状况和所抓取网站的</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;深度有关。</div>
</blockquote>
<div dir="ltr"><font size="2"><font color="#000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;按照如上步骤应该可以正确的执行一次网站的抓取任务，抓取页面会存放在你的工作目录下的mirror文件夹内。关于Job的创建和执行过程中的各种设置可以在用户手册中查到，都有详细的说明。</font></font></div>
          <br/>
          <span style="color:red;">
            <a href="http://jason823.javaeye.com/blog/84206#comments" style="color:red;">本文的讨论也很精彩，浏览讨论>></a>
          </span>
          <br/><br/><br/>
          <span style="color:#E28822;">JavaEye推荐</span>
          <br/>
          <ul class='adverts'><li><a href='/adverts/42' target='_blank'><span style="color:red;font-weight:bold;">搜狐网站诚聘Java、PHP和C++工程师</span></a></li><li><a href='/adverts/106' target='_blank'><span style="color:blue;font-weight:bold;">JavaEye问答大赛开始了！ 从6月23日 至 7月6日，奖品丰厚 ！</span></a></li><li><a href='/adverts/92' target='_blank'><span style="color:red;font-weight:bold;">快来参加7月17日在成都举行的SOA中国技术论坛</span></a></li><li><a href='/adverts/41' target='_blank'><span style="color:red;font-weight:bold;">北京: 千橡集团暨校内网诚聘软件研发工程师</span></a></li><li><a href='/adverts/97' target='_blank'><span style="color:blue;font-weight:bold;">Oracle专区上线，有Oracle最新文章，重要下载及知识库等精彩内容，欢迎访问。</span></a></li></ul>
          <br/><br/><br/>
          ]]>
        </description>
        <pubDate>Tue, 29 May 2007 14:01:33 +0800</pubDate>
        <link>http://jason823.javaeye.com/blog/84206</link>
        <guid>http://jason823.javaeye.com/blog/84206</guid>
      </item>
  </channel>
</rss>