webmagic一:springboot整合webmagic,实现定时任务爬取多个网站

webmagic一:springboot整合webmagic,实现定时任务爬取多个网站

最近领导让我维护以前老的爬虫,我看到代码后瞬间就不想写了,只好从头开始自己做个爬虫。发现java爬虫的框架不怎么多,当然是相对于Python来说,最后选择了webmagic作为开发框架。

写这篇文章的目的是因为在实际的开发中遇到了不少头疼的问题,特此记录。

WebMagic中文网址:Introduction · WebMagic Documents

想要从头开始学习一个东西最好的办法就是从官网看文档,或者在文章里看大佬怎么看官网的。

我这里用的用的是springboot,使用Maven的方式

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.3</version>
</dependency>

从官网复制的java代码如下:

package ***.example.testspringboot.processor;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(100);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.***/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.***/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.***/code4craft").thread(5).run();
    }

}

点击main运行:go!!!

09:22:53.972 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.***/code4craft error
javax.***.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)
	at sun.security.ssl.HandshakeContext.<init>(HandshakeContext.java:171)
	at sun.security.ssl.ClientHandshakeContext.<init>(ClientHandshakeContext.java:101)
	at sun.security.ssl.TransportContext.kickstart(TransportContext.java:238)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:394)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:373)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
	at org.apache.http.impl.exe***hain.MainClientExec.establishRoute(MainClientExec.java:393)
	at org.apache.http.impl.exe***hain.MainClientExec.execute(MainClientExec.java:236)
	at org.apache.http.impl.exe***hain.ProtocolExec.execute(ProtocolExec.java:186)
	at org.apache.http.impl.exe***hain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.exe***hain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
	at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)
	at us.codecraft.webmagic.Spider.a***ess$000(Spider.java:61)
	at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
	at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
09:22:54.081 [main] INFO us.codecraft.webmagic.Spider - Spider github.*** closed! 1 pages downloaded.

第一步开始就报错,我也是很不开心,搜索了一下,发现这个问题是因为高版本的jdk安全协议导致的。

我当前的jdk版本是1.8_291

C:\Users\MI>java -version
java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)

C:\Users\MI>

找到jdk安装的目录,编辑这个java.security配置文件

在配置文件中找到如下:

jdk.tls.disabledAlgorithms=SSLv3, TLSv1, TLSv1.1, RC4, DES, MD5withRSA, \
    DH keySize < 1024, EC keySize < 224, 3DES_EDE_CBC, anon, NULL, \
    include jdk.disabled.namedCurves

将SSLv3, TLSv1, TLSv1.1都去掉;如下:

这几个是干嘛的?这偏文章说的很清楚了:一文讲清SSL协议_sslv3-CSDN博客

再次运行

先说解决办法:将webmagic版本升级到0.10.0

        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.10.0</version>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.10.0</version>
        </dependency>

升级的原因是因为:0.7.3版本默认的HttpClient只会用TLSv1去请求,对于某些只支持TLS1.2的站点,就会报错。Https下无法抓取只支持TLS1.2的站点 · Issue #701 · code4craft/webmagic · GitHub

webmagic版本地址:Releases · code4craft/webmagic · GitHub

再次运行ok了

下一篇开始正式爬虫的工作

转载请说明出处内容投诉
CSS教程_站长资源网 » webmagic一:springboot整合webmagic,实现定时任务爬取多个网站

发表评论

欢迎 访客 发表评论

一个令你着迷的主题!

查看演示 官网购买