项目地址
https://github.com/zhupingqi/RuiJi.Net
https://gitee.com/zhupingqi/RuiJi.Net
文档
http://www.ruijihg.com/archives/ruijinet/getting-started
RuiJi.Net 爬虫框架 讨论群 545931923
RuiJi.Net
RuiJi.Net 是一个CId开发的分布式抓取框架
RuiJi.Net 支持自托管,具有分布式抓取,提取和自管理Cookie
RuiJi.Net 支持服务器端IP轮询访问及使用代理服务器访问(未完成)
Notice
项目正在开发中
Features
抓取端
Feature | Support |
---|---|
webheader | custom |
method | get/post |
auto redirection | support |
cookie | managed/custom |
service point ip | auto/custom Bind |
encoding | auto detect/by specify |
response | raw/string |
proxy | future additions |
提取器
Feature | Support |
---|---|
selector | css/xpath/regex/json/text range/exclude text/clear |
extrac structure | block/tile/meta |
jsonconvert | extractblock |
关于提取结构
示例
直接使用RuiJi.Net.Core
var crawler = new IPCrawler();
var request = new Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");
var response = crawler.Request(request);
var content = response.Data.ToString();
var block = new ExtractBlock();
block.Selectors = new List<ISelector>
{
new CssSelector(".entry-content",CssTypeEnum.InnerHtml)
};
block.TileSelector = new ExtractTile
{
Selectors = new List<ISelector>
{
new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)
}
};
block.TileSelector.Metas.AddMeta("title",new List<ISelector> {
new CssSelector(".pt-cv-title")
});
block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
new CssSelector(".pt-cv-readmore","href")
});
var ext = new RuiJiExtracter();
var r = ext.Extract(content, block);
使用集群
- 下载 ZooKeeper
http://mirrors.hust.edu.cn/apache/zookeeper/zookeeper-3.4.12/ - 在文件夹conf中添加与zoosample.cfg相同的文件,并将其重命名为zoo.cfg。更改datadir为你的路径
- 确认Java的运行环境
- 运行 bin/zkServer.cmd
- 以管理员的身份运行 RuiJi.cmd.exe
启动完成后 将看到如下信息
Server Start At http://x.x.x.x:x
proxy x.x.x.x:x ready to startup!
try connect to zookeeper server : x.x.x.x:2181
zookeeper server connected!
运行如下代码
using RuiJi.Net.NodeVisitor;
....
var response = new Crawler().Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");
if (response.StatusCode != System.Net.HttpStatusCode.OK)
return;
var content = response.Data.ToString();
var block = new ExtractBlock();
block.Selectors = new List<ISelector>
{
new CssSelector(".entry-content",CssTypeEnum.InnerHtml)
};
block.TileSelector = new ExtractTile
{
Selectors = new List<ISelector>
{
new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)
}
};
block.TileSelector.Metas.AddMeta("title", new List<ISelector> {
new CssSelector(".pt-cv-title")
});
block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
new CssSelector(".pt-cv-readmore","href")
});
var r = Extracter.Extract(new ExtractRequest {
Blocks = new List<ExtractFeatureBlock> {
new ExtractFeatureBlock
{
Block = block
}
},
Content = content
});
RuiJi表达式
RuiJi表达式是为了快速添加页面的提取规则,实现软编码的一种方式,RuiJI表达式尽量简单、易懂。
Selectors为选择器
Tiles为需要重复提取的区域
Metas为需要提取的元数据
Blocks为Block内需要提取的子Block
如果需要对http://www.ruijihg.com/开发 进行提取的话,首先需要观察一下页面的结构
你可以使用F12来观察页面的结构
首先确保Block选择器的结果是唯一的
Block的定义可以如下
Idcontent
css .pt-cv-view:ohtml
继续添加tile
[tile]
\tIdtiles
\tcss .pt-cv-content-item:ohtml
\t[meta]
\tIdtitle
\tcss .pt-cv-title:text
\tIdcontent
\tcss .pt-cv-content:html
\tex 阅读更多... -e
你可能注意到了\t 这是因为block和tile都包含meta,所以tile的选择器部分和tile的meta以\t作为当前tile的标记
完整的Block描述结构如下
[Block]
Idblockname
selector
[blocks]
@subblockname1
@subblockname2
[tile]
\tIdtilename
\ttile selector
\t[meta]
\tIdmeta1
\tselector
\tIdmeta2
\tselector
[meta]
Idblockmeta1
selector
Idblockmeta2
selector