最近领导让我维护以前老的爬虫,我看到代码后瞬间就不想写了,只好从头开始自己做个爬虫。发现java爬虫的框架不怎么多,当然是相对于Python来说,最后选择了webmagic作为开发框架。
写这篇文章的目的是因为在实际的开发中遇到了不少头疼的问题,特此记录。
WebMagic中文网址:Introduction · WebMagic Documents
想要从头开始学习一个东西最好的办法就是从官网看文档,或者在文章里看大佬怎么看官网的。
我这里用的用的是springboot,使用Maven的方式
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
</dependency>
从官网复制的java代码如下:
package ***.example.testspringboot.processor;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
public class GithubRepoPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
@Override
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.***/\\w+/\\w+)").all());
page.putField("author", page.getUrl().regex("https://github\\.***/(\\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
if (page.getResultItems().get("name")==null){
//skip this page
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.***/code4craft").thread(5).run();
}
}
点击main运行:go!!!
09:22:53.972 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.***/code4craft error
javax.***.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)
at sun.security.ssl.HandshakeContext.<init>(HandshakeContext.java:171)
at sun.security.ssl.ClientHandshakeContext.<init>(ClientHandshakeContext.java:101)
at sun.security.ssl.TransportContext.kickstart(TransportContext.java:238)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:394)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:373)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
at org.apache.http.impl.exe***hain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.exe***hain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.exe***hain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.exe***hain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.exe***hain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)
at us.codecraft.webmagic.Spider.a***ess$000(Spider.java:61)
at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
09:22:54.081 [main] INFO us.codecraft.webmagic.Spider - Spider github.*** closed! 1 pages downloaded.
第一步开始就报错,我也是很不开心,搜索了一下,发现这个问题是因为高版本的jdk安全协议导致的。
我当前的jdk版本是1.8_291
C:\Users\MI>java -version
java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)
C:\Users\MI>
找到jdk安装的目录,编辑这个java.security配置文件
在配置文件中找到如下:
jdk.tls.disabledAlgorithms=SSLv3, TLSv1, TLSv1.1, RC4, DES, MD5withRSA, \
DH keySize < 1024, EC keySize < 224, 3DES_EDE_CBC, anon, NULL, \
include jdk.disabled.namedCurves
将SSLv3, TLSv1, TLSv1.1都去掉;如下:
这几个是干嘛的?这偏文章说的很清楚了:一文讲清SSL协议_sslv3-CSDN博客
再次运行
先说解决办法:将webmagic版本升级到0.10.0
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.10.0</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.10.0</version>
</dependency>
升级的原因是因为:0.7.3版本默认的HttpClient只会用TLSv1去请求,对于某些只支持TLS1.2的站点,就会报错。Https下无法抓取只支持TLS1.2的站点 · Issue #701 · code4craft/webmagic · GitHub
webmagic版本地址:Releases · code4craft/webmagic · GitHub
再次运行ok了
下一篇开始正式爬虫的工作