Some Hints for Nutch

好久没关注Nutch了，看邮件列表，学到了几招关于 Nutch 的小技巧．

如何索引动态 URL 站点？

调整 regex-urlfilter.txt 或是 crawl-urlfilter.txt 文件．参见行”# skip URLs containing certain characters as probable queries,后面的内容．
编译 Nutch 需要用到的 Ant 版本至少要 1.6 以上．

验证regex-urlfilter是否正常(by Michael Nebel)：

If you want to know, if your regex-urlfilter works as expectet, you can
check it with the command:
cat FILE-WITH-URLS | nutch net/nutch/net/RegexURLFilter
or by calling "nutch net/nutch/net/RegexURLFilter" and entering the URL
by hand.
Everyline line beginning with a "+" ist accepted - a line with a "-" is
accepted. For example:
$ echo "http://www.nutch.org" | nutch net/nutch/net/RegexURLFilter
run with heapsize 256
-Xmx256m
050202 173520 loadingfile:/home/nutch/nutch-0.7/conf/nutch-default.xml
050202 173520 loading file:/home/nutch/nutch-0.7/conf/nutch-site.xml
050202 173520 found resource regex-urlfilter.txt at
file:/home/nutch/nutch-0.7/conf/regex-urlfilter.txt

记录一些关于互联网的信息碎片

Some Hints for Nutch

Leave a Reply Cancel reply