SpringBoot爬虫 前言 此文章只是为了学习http请求、jsoup、SpringBoot集成等技术,不是故意爬取数据,文章仅仅记录学习过程!
什么是爬虫 爬虫简介 网络爬虫(又被称为网页蜘蛛 ,网络机器人,在FOAF 社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网 信息的程序或者脚本。
简单来说就是通过编写脚本模拟浏览器发起请求获取数据。
爬虫分类 通用网络爬虫(General Purpose Web Crawler):爬取一整张页面源码数据. 抓取系统(爬虫) 聚焦网络爬虫(Focused Web Crawler):爬取的是一张页面中局部的数据(数据解析) 增量式网络爬虫(Incremental Web Crawler):用于监测网站数据更新的情况,从而爬取网站中最新更新出来的数据 深层网络爬虫(Deep Web Crawler):Web 页面按存在方式可以分为表层网页(Surface Web)和深层网页(Deep Web,也称 Invisible Web Pages 或 Hidden Web)。 表层网页是指传统搜索引擎可以索引的页面,以超链接可以到达的静态网页为主构成的 Web 页面。Deep Web 是那些大部分内容不能通过静态链接获取的、隐藏在搜索表单后的,只有用户提交一些关键词才能获得的 Web 页面。 反爬机制与反反爬策略 爬虫:使用任何技术手段,批量获取网站信息的一种方式。
反爬虫:使用任何技术手段,阻止别人批量获取自己网站信息的一种方式。
反爬方式:
robots.txt协议
UA(User-Agent用户访问网站时候的浏览器标识)限制
UA反爬随机请求头 ip限制(限制ip访问频率和次数进行反爬)————-构造自己的 IP 代理池,然后每次访问时随机选择代理 Ajax动态加载——-使用审查元素分析”请求“对应的链接:在url请求的response中进行局部搜索当前内容,如果没有就点击左边任意请求,进行ctrl+f全局搜索,找到对应的请求(抓包工具推荐:fiddler) 验证码反爬虫或者模拟登陆 cookie限制 爬虫案例学习 案例需求 前面介绍了几种爬虫的分类,这里我们使用聚焦网络爬虫,抓取汽车之家上的汽车评测数据。https://www.autohome.com.cn/bestauto/
我们需要抓取汽车之家上面所有的汽车评测数据
在页面上我们分析,需要抓取以下部分的数据:
排名是动态生成的,我们这里不做抓取,可以后期单独处理排名
有5张图片,页面显示的是小图,我们需要打开超链接获取大图的url地址,再单独下载图片
环境准备 使用技术 JDK1.8+ SpringBoot2.X MyBatisPlus SpringMVC HttpClient Jsoup Quartz 搭建工程 设置依赖 1 2 3 4 5 <parent > <groupId > org.springframework.boot</groupId > <artifactId > spring-boot-starter-parent</artifactId > <version > 2.1.3.RELEASE</version > </parent >
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 <properties > <project.build.sourceEncoding > UTF-8</project.build.sourceEncoding > <project.reporting.outputEncoding > UTF-8</project.reporting.outputEncoding > <java.version > 1.8</java.version > <mybatisplus.version > 3.3.2</mybatisplus.version > <alibaba.boot.druid > 1.1.22</alibaba.boot.druid > </properties > <dependencies > <dependency > <groupId > org.springframework.boot</groupId > <artifactId > spring-boot-starter-web</artifactId > </dependency > <dependency > <groupId > org.springframework.boot</groupId > <artifactId > spring-boot-starter-test</artifactId > <scope > test</scope > </dependency > <dependency > <groupId > com.baomidou</groupId > <artifactId > mybatis-plus-boot-starter</artifactId > <version > ${mybatisplus.version}</version > <exclusions > <exclusion > <groupId > com.baomidou</groupId > <artifactId > mybatis-plus-generator</artifactId > </exclusion > </exclusions > </dependency > <dependency > <groupId > mysql</groupId > <artifactId > mysql-connector-java</artifactId > <scope > runtime</scope > </dependency > <dependency > <groupId > com.alibaba</groupId > <artifactId > druid-spring-boot-starter</artifactId > <version > ${alibaba.boot.druid}</version > </dependency > <dependency > <groupId > org.apache.commons</groupId > <artifactId > commons-lang3</artifactId > <version > 3.3.2</version > </dependency > <dependency > <groupId > commons-io</groupId > <artifactId > commons-io</artifactId > <version > 2.6</version > </dependency > <dependency > <groupId > org.springframework.boot</groupId > <artifactId > spring-boot-starter-quartz</artifactId > </dependency > <dependency > <groupId > org.apache.httpcomponents</groupId > <artifactId > httpclient</artifactId > </dependency > <dependency > <groupId > org.jsoup</groupId > <artifactId > jsoup</artifactId > <version > 1.10.3</version > </dependency > <dependency > <groupId > org.projectlombok</groupId > <artifactId > lombok</artifactId > <version > 1.18.6</version > <optional > true</optional > </dependency > </dependencies >
设置配置 配置路径:src/main/resources
配置总体环境,方便切换环境
1 2 3 4 5 spring: profiles: active: dev application: name: spider-autohome
创建测试环境配置application-dev.yml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 server: port: 8080 tomcat: max-swallow-size: 100MB spring: datasource: druid: type: com.alibaba.druid.pool.DruidDataSource driverClassName: com.mysql.cj.jdbc.Driver url: jdbc:mysql://192.168.56.120:3306/spider-autohome?serverTimezone=Asia/Shanghai&characterEncoding=utf8&useSSL=false username: root password: 123456 initial-size: 5 min-idle: 5 max-active: 30 max-wait: 60000 time-between-eviction-runs-millis: 60000 min-evictable-idle-time-millis: 300000 validation-query: select '1' from dual test-while-idle: true test-on-borrow: false test-on-return: false pool-prepared-statements: true max-open-prepared-statements: 50 max-pool-prepared-statement-per-connection-size: 20 filters: stat stat-view-servlet: url-pattern: /druid/* reset-enable: false login-username: admin login-password: 123456 web-stat-filter: url-pattern: /* exclusions: "*.js,*.gif,*.jpg,*.bmp,*.png,*.css,*.ico,/druid/*" servlet: multipart: max-file-size: 50MB max-request-size: 50MB mybatis-plus: configuration: log-impl: org.apache.ibatis.logging.stdout.StdOutImpl
SpringBoot入门 这里先实现入门程序,用以熟悉SpringBoot的使用。
需求:浏览器访问,获取数据库时间
创建启动引导类 1 2 3 4 5 6 @SpringBootApplication public class SpiderAutoHomeApplication { public static void main (String[] args) { SpringApplication.run(SpiderAutoHomeApplication.class,args); } }
编写测试DAO 1 2 3 4 5 6 7 8 9 @Mapper public interface TestDao { @Select("SELECT NOW()") public String queryNowDate () ; }
编写测试SERVICE 创建service文件夹,创建TestService
文件 1 2 3 4 5 6 7 8 public interface TestService { public String queryNowDate () ; }
编写测试SERVICE实现 在service文件夹下创建impl文件夹,创建TestServiceImpl
文件 1 2 3 4 5 6 7 8 9 10 11 @Service public class TestServiceImpl implements TestService { @Autowired private TestDao testDao; @Override public String queryNowDate () { return testDao.queryNowDate(); } }
编写请求CONTROLLER 创建controller文件夹,创建TestController
文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 @RestController @RequestMapping("/test") public class TestController { @Autowired private TestService testService; @GetMapping(value = "/queryNowDate") public String queryNowDate () { return testService.queryNowDate(); } }
启动测试 启动application启动类 在浏览器输入请求测试地址:http://localhost:8080/test/queryNowDate
查看返回结果:2021-05-09 09:31:42
开发分析 流程分析 分析发现,评测页的url是:
https://www.autohome.com.cn/bestauto/1
最后一个参数是页码数,我们只需要按顺序从第一页开始,把所有的页面都抓取下来就可以了
抓取页面的流程如下
抓取评测数据步骤 根据url抓取html页面 对html页面进行解析,获取该页面所有的评测数据 遍历所有的评测数据 判断遍历的评测数据是否已保存, 如果已保存再次遍历下一条评测数据 如果未保存执行下一步 保存评测数据到数据库中 数据库表设计 根据以上需求,设计数据库表。sql如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 CREATE TABLE `car_test` ( `id` bigint (10 ) NOT NULL AUTO_INCREMENT COMMENT '主键id' , `title` varchar (100 ) NOT NULL COMMENT '评测车辆的名字' , `test_speed` int (150 ) DEFAULT NULL COMMENT '评测项目-加速(0-100公里/小时),单位毫秒' , `test_brake` int (150 ) DEFAULT NULL COMMENT '评测项目-刹车(100-0公里/小时),单位毫米' , `test_oil` int (150 ) DEFAULT NULL COMMENT '评测项目-实测油耗(升/100公里),单位毫升' , `editor_name1` varchar (10 ) DEFAULT NULL COMMENT '评测编辑1' , `editor_remark1` varchar (1000 ) DEFAULT NULL COMMENT '点评内容1' , `editor_name2` varchar (10 ) DEFAULT NULL COMMENT '评测编辑2' , `editor_remark2` varchar (1000 ) DEFAULT NULL COMMENT '点评内容2' , `editor_name3` varchar (10 ) DEFAULT NULL COMMENT '评测编辑3' , `editor_remark3` varchar (1000 ) DEFAULT NULL COMMENT '点评内容3' , `image` varchar (1000 ) DEFAULT NULL COMMENT '评测图片,5张图片名,中间用,分隔' , `created` datetime DEFAULT NULL COMMENT '创建时间' , `updated` datetime DEFAULT NULL COMMENT '更新时间' , PRIMARY KEY (`id`) ) ENGINE= InnoDB AUTO_INCREMENT= 7 DEFAULT CHARSET= utf8 COMMENT= '汽车之家评测表' ;
开发准备 编写实体ENTITY 创建module【这个依据个人喜好】文件夹,创建CarTest
实体对象和数据库表进行映射 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 @Data @TableName(value = "car_test") public class CarTest { @TableId(type = IdType.AUTO) private Long id; private String title; private int testSpeed; private int testBrake; private int testOil; private String editorName1; private String editorRemark1; private String editorName2; private String editorRemark2; private String editorName3; private String editorRemark3; private String image; private Date created; private Date updated; }
编写DAO 1 2 3 @Mapper public interface CarTestDao extends BaseMapper <CarTest> { }
编写SERVICE 在service文件夹下面创建CarTestService
1 2 3 4 5 6 7 8 9 10 public interface CarTestService extends IService <CarTest> { public Page<CarTest> queryTitleByPage (long page, long pageSize) ; }
编写SERVICE实现 service文件夹下impl文件夹新建CarTestServiceImpl
1 2 3 4 5 6 7 8 9 10 @Service public class CarTestServiceImpl extends ServiceImpl <CarTestDao,CarTest> implements CarTestService { @Override public Page<CarTest> queryTitleByPage (long page, long pageSize) { Page<CarTest> queryPage = new Page <>(page, pageSize); QueryWrapper<CarTest> queryWrapper = new QueryWrapper <>(); queryWrapper.select("title" ); return baseMapper.selectPage(queryPage, queryWrapper); } }
爬取数据 HTTP连接池管理器 因为我们爬取数据是使用的HTTP请求,我们需要一个管理HTTP连接的一个工具,所以我们定义一个HTTP连接池管理工具,交给Spring进行管理。
使用以下两个注解
@Configuration注解声明配置类。
@Bean注解声明如何创建这实例
新建config文件夹,创建HttpClientManagerCfg
1 2 3 4 5 6 7 8 9 10 11 12 13 @Configuration public class HttpClientManagerCfg { @Bean public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager () { PoolingHttpClientConnectionManager httpClientConnectionManager = new PoolingHttpClientConnectionManager (); httpClientConnectionManager.setMaxTotal(50 ); httpClientConnectionManager.setDefaultMaxPerRoute(20 ); return httpClientConnectionManager; } }
定时关闭失效连接 这里使用Quartz定时任务来处理定时关闭失效连接
新建job文件夹,创建CloseHttpConnectionJob
文件,编写定时任务 1 2 3 4 5 6 7 8 9 10 11 @Slf4j @DisallowConcurrentExecution public class CloseHttpConnectionJob extends QuartzJobBean { @Override protected void executeInternal (JobExecutionContext context) throws JobExecutionException { ApplicationContext applicationContext = (ApplicationContext) context.getJobDetail().getJobDataMap().get("context" ); PoolingHttpClientConnectionManager httpClientPool = applicationContext.getBean(PoolingHttpClientConnectionManager.class); httpClientPool.closeExpiredConnections(); log.info(">>>>>>>>>>>>>>>>>>>>>>>> closeExpiredConnections" ); } }
定时任务配置 在config目录下雪创建QuartzConfig
文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 @Configuration public class QuartzConfig { @Bean("closeHttpConnectionJob") public JobDetailFactoryBean closeHttpConnectionJob () { JobDetailFactoryBean jobDetailFactoryBean = new JobDetailFactoryBean (); jobDetailFactoryBean.setApplicationContextJobDataKey("context" ); jobDetailFactoryBean.setJobClass(CloseHttpConnectionJob.class); jobDetailFactoryBean.setDurability(true ); return jobDetailFactoryBean; } @Bean("closeHttpConnectionJobTrigger") public CronTriggerFactoryBean closeHttpConnectionJobTrigger ( @Qualifier(value = "closeHttpConnectionJob") JobDetailFactoryBean itemJobBean) { CronTriggerFactoryBean tigger = new CronTriggerFactoryBean (); tigger.setJobDetail(itemJobBean.getObject()); tigger.setCronExpression("0/5 * * * * ? " ); return tigger; } @Bean public SchedulerFactoryBean schedulerFactory (CronTrigger[] cronTriggerImpl) { SchedulerFactoryBean bean = new SchedulerFactoryBean (); bean.setTriggers(cronTriggerImpl); return bean; } }
编写APISERVICE业务接口 需要实现两个功能的下载:
请求获取页面数据[GET]
请求下载图片[GET]
新建api.service目录,创建AutoHomeApiService
文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 public interface AutoHomeApiService { public String getHtml (String url) ; public String getImage (String url) ; }
编写APISERVICE实现业务接口 在api.service下面创建impl文件夹,创建AutoHomeApiServiceImpl
文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 @Service @Slf4j public class AutoHomeApiServiceImpl implements AutoHomeApiService { @Autowired private PoolingHttpClientConnectionManager connectionManager; @Override public String getHtml (String url) { CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectionManager).build(); HttpGet httpGet = new HttpGet (url); CloseableHttpResponse httpResponse = null ; String html = null ; try { httpResponse = httpClient.execute(httpGet); if (httpResponse != null && httpResponse.getStatusLine().getStatusCode() == 200 ) { if (httpResponse.getEntity() != null ) { html = EntityUtils.toString(httpResponse.getEntity(), Charsets.UTF_8); return html; } } } catch (IOException e) { e.printStackTrace(); log.error("获取汽车之家信息异常:{}" , e); }finally { if (httpResponse != null ) { try { httpResponse.close(); } catch (IOException e) { e.printStackTrace(); log.error("获取汽车之家信息响应关闭异常:{}" , e); } } } return null ; } @Override public String getImage (String url) { CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectionManager).build(); HttpGet httpGet = new HttpGet (url); CloseableHttpResponse httpResponse = null ; String fileName = null ; try { httpResponse = httpClient.execute(httpGet); if (httpResponse != null && httpResponse.getStatusLine().getStatusCode() == 200 ) { if (httpResponse.getEntity() != null ) { String contentTypeVal = httpResponse.getFirstHeader("Content-Type" ).getValue(); if (contentTypeVal.contains("image/" )){ String extName = contentTypeVal.split("/" )[1 ]; fileName = UUID.randomUUID().toString().replace("-" ,"" ) + "." + extName; OutputStream os = new FileOutputStream (new File ("D:/test/autohome-image/" + fileName)); httpResponse.getEntity().writeTo(os); return fileName; } } } } catch (IOException e) { e.printStackTrace(); log.error("获取汽车之家评测图片异常:{}" , e); }finally { if (httpResponse != null ) { try { httpResponse.close(); } catch (IOException e) { e.printStackTrace(); log.error("获取汽车之家评测图片响应关闭异常:{}" , e); } } } return null ; } }
测试APISERVICE业务实现接口 这里使用SpringBoot的测试组件,需要添加如下两个注解:
@RunWith(value = SpringJUnit4ClassRunner.class) 让测试运行在spring的环境,这样我们测试的时候就可以和开发的时候一样编写代码,例如使用@Autowired注解直接注入
@SpringBootTest(classes = Application.class) 执行当前的这个类是测试类,测试代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 @RunWith(SpringJUnit4ClassRunner.class) @SpringBootTest(classes = SpiderAutoHomeApplication.class) public class AutoHomeApiServiceTest { @Autowired private AutoHomeApiService autoHomeApiService; @Autowired private TitleFilter titleFilter; @Autowired private CarTestService carTestService; @Test public void getHtml () { String html = autoHomeApiService.getHtml("https://www.autohome.com.cn/bestauto/" ); System.out.println("html = " + html); } @Test public void getImage () { String image = autoHomeApiService.getImage("https://car2.autoimg.cn/cardfs/product/g24/M09/AE/EB/800x0_1_q87_autohomecar__wKgHIVpxGh6AFSN1AAY8kcz3Aww921.jpg" ); System.out.println("image = " + image); } }
去重过滤器 在使用网络爬虫过程中,去重是一个不可避免的问题,这里需要对抓取的数据内容进行过滤,就是对车辆幸好名称进行去重过滤,避免同样条数据反复保存到数据库中。
传统的去重,可以使用Map或者Set集合、哈希表的方式来实现去重,在数据量较小的情况下,使用这种方式没有问题。可是当我们需要大量爬去数据的时候,这种方式就存在很大问题。因为会极大的占用内存和系统资源,导致爬虫系统崩溃。这里将会使用布隆过滤器。
Bloom过滤器介绍 布隆过滤器主要用于判断一个元素是否在一个集合中,它可以使用一个位数组简洁的表示一个数组。它的空间效率和查询时间远远超过一般的算法,但是它存在一定的误判的概率,适用于容忍误判的场景。如果布隆过滤器判断元素存在于一个集合中,那么大概率是存在在集合中,如果它判断元素不存在一个集合中,那么一定不存在于集合中。常常被用于大数据去重。
算法思想
布隆过滤器算法主要思想就是利用k个哈希函数计算得到不同的哈希值,然后映射到相应的位数组的索引上,将相应的索引位上的值设置为1。判断该元素是否出现在集合中,就是利用k个不同的哈希函数计算哈希值,看哈希值对应相应索引位置上面的值是否是1,如果有1个不是1,说明该元素不存在在集合中。但是也有可能判断元素在集合中,但是元素不在,这个元素所有索引位置上面的1都是别的元素设置的,这就导致一定的误判几率。布隆过滤的思想如下图所示:
布隆过滤器实现 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 public class TitleFilter { private static final int DEFAULT_SIZE = 2 << 24 ; private static final int [] seeds = new int []{5 , 7 , 11 , 13 , 31 , 37 , 61 }; private BitSet bits = new BitSet (DEFAULT_SIZE); private SimpleHash[] func = new SimpleHash [seeds.length]; public TitleFilter () { for (int i = 0 ; i < seeds.length; i++) { func[i] = new SimpleHash (DEFAULT_SIZE, seeds[i]); } } public void add (String value) { for (SimpleHash f : func) { bits.set(f.hash(value), true ); } } public boolean contains (String value) { if (value == null ) { return false ; } boolean ret = true ; for (SimpleHash f : func) { ret = ret && bits.get(f.hash(value)); } return ret; } public static class SimpleHash { private int cap; private int seed; public SimpleHash (int cap, int seed) { this .cap = cap; this .seed = seed; } public int hash (String value) { int result = 0 ; int len = value.length(); for (int i = 0 ; i < len; i++) { result = seed * result + value.charAt(i); } return (cap - 1 ) & result; } } }
初始化去重过滤器 项目一启动,就应该创建去重过滤器。
编写以下代码实现过滤器初始化
CarTestService增加分页查询方法 1 2 3 4 5 6 7 public Page<CarTest> queryTitleByPage (long page, long pageSize) ;
CarTestService增加分页查询方法实现 1 2 3 4 5 6 7 @Override public Page<CarTest> queryTitleByPage(long page , long pageSize ) { Page<CarTest> queryPage = new Page<>(page, pageSize); QueryWrapper<CarTest> queryWrapper = new QueryWrapper<>() ; queryWrapper.select("title" ); return baseMapper.selectPage(queryPage , queryWrapper ) ; }
实现初始化去重过滤器 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 @Configuration public class TitleFilterConfig { @Autowired private CarTestService carTestService; @Bean public TitleFilter titleFilter () { TitleFilter titleFilter = new TitleFilter (); long page = 1 ; long pageSize = 5000 ; boolean repatedFlag = true ; do { Page<CarTest> carTestPage = carTestService.queryTitleByPage(page, pageSize); if (!carTestPage.hasNext()) { repatedFlag = false ; }else { page += 1 ; } for (CarTest record : carTestPage.getRecords()) { titleFilter.add(record.getTitle()); } } while (repatedFlag); return titleFilter; } }
实现爬取数据 首先实现数据爬取逻辑,先在测试方法中实现
实现爬取测试方法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 @RunWith(SpringJUnit4ClassRunner.class) @SpringBootTest(classes = SpiderAutoHomeApplication.class) public class AutoHomeApiServiceTest { @Autowired private AutoHomeApiService autoHomeApiService; @Autowired private TitleFilter titleFilter; @Autowired private CarTestService carTestService; @Test public void getHtml () { String html = autoHomeApiService.getHtml("https://www.autohome.com.cn/bestauto/" ); System.out.println("html = " + html); } @Test public void getImage () { String image = autoHomeApiService.getImage("https://car2.autoimg.cn/cardfs/product/g24/M09/AE/EB/800x0_1_q87_autohomecar__wKgHIVpxGh6AFSN1AAY8kcz3Aww921.jpg" ); System.out.println("image = " + image); } @Test public void testGetEvaluatingResult () { List<CarTest> saveList = new ArrayList <>(); for (int i = 0 ; i < 3 ; i++) { String baseUrl = "https://www.autohome.com.cn/bestauto/" ; String html = autoHomeApiService.getHtml(baseUrl + i); Document document = Jsoup.parse(html); Elements carElements = document.getElementsByClass("uibox" ); for (Element carElement : carElements) { String carTitle = carElement.getElementsByClass("uibox-title uibox-title-border" ).text(); CarTest carTest = marshalCarElement(carElement); String imageNames = marshalImageNames(carElement); carTest.setImage(imageNames); saveList.add(carTest); } if (!CollectionUtils.isEmpty(saveList)) { carTestService.saveBatch(saveList); } } } private String marshalImageNames (Element carElement) { String carImageName = null ; List<String> imageNameList = new ArrayList <>(); Elements imageElements = carElement.select(".piclist-box.fn-clear ul.piclist02 a" ); for (Element imageElement : imageElements) { String imageUrl = "https:" + imageElement.getElementsByTag("img" ).attr("src" ); String imageName = autoHomeApiService.getImage(imageUrl); imageNameList.add(imageName); } if (!CollectionUtils.isEmpty(imageNameList)) { carImageName = StringUtils.join(imageNameList, "," ); } return carImageName; } private CarTest marshalCarElement (Element carElement) { CarTest carTest = new CarTest (); String carTitle = carElement.getElementsByClass("uibox-title uibox-title-border" ).text(); carTest.setTitle(carTitle); String testSpeed = carElement.select(".tabbox1 dd:nth-child(2) > div.dd-div2" ).first().text(); carTest.setTestSpeed(covertStrToNum(testSpeed)); String testBrake = carElement.select(".tabbox1 dd:nth-child(3) > div.dd-div2" ).first().text(); carTest.setTestBrake(covertStrToNum(testBrake)); String testOil = carElement.select(".tabbox1 dd:nth-child(4) > div.dd-div2" ).first().text(); carTest.setTestOil(covertStrToNum(testOil)); String editorName1 = carElement.select(".tabbox2.tabbox-score dd:nth-child(2) > div.dd-div1" ).first().text(); carTest.setEditorName1(editorName1); String editorRemark1 = carElement.select(".tabbox2.tabbox-score dd:nth-child(2) > div.dd-div3" ).first().text(); carTest.setEditorRemark1(editorRemark1); String editorName2 = carElement.select(".tabbox2.tabbox-score dd:nth-child(3) > div.dd-div1" ).first().text(); carTest.setEditorName2(editorName2); String editorRemark2 = carElement.select(".tabbox2.tabbox-score dd:nth-child(3) > div.dd-div3" ).first().text(); carTest.setEditorRemark2(editorRemark2); String editorName3 = carElement.select(".tabbox2.tabbox-score dd:nth-child(4) > div.dd-div1" ).first().text(); carTest.setEditorName3(editorName3); String editorRemark3 = carElement.select(".tabbox2.tabbox-score dd:nth-child(4) > div.dd-div3" ).first().text(); carTest.setEditorRemark3(editorRemark3); Date currentDate = new Date (); carTest.setCreated(currentDate); carTest.setUpdated(currentDate); return carTest; } private int covertStrToNum (String str) { try { if ("--" .equals(str)) { return 0 ; } str = StringUtils.substring(str, 0 , str.length() - 1 ); Number num = Float.valueOf(str) * 1000 ; return num.intValue(); } catch (Exception e) { e.printStackTrace(); System.out.println(str); } return 0 ; } }
整合任务 把测试方法中的爬取数据代码改造为任务,再使用Quartz定时任务定时处理,就可以实现定时抓取汽车评测数据,能够获取最新的数据了
改造任务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 @Slf4j @DisallowConcurrentExecution public class CrawlerAutoHomeJob extends QuartzJobBean { private AutoHomeApiService autoHomeApiService; private TitleFilter titleFilter; private CarTestService carTestService; @Override protected void executeInternal (JobExecutionContext context) throws JobExecutionException { log.info(">>>>>>>>>>>>>>>>>>>>>>>>>>>> start crawlerAutoHomeJob" ); ApplicationContext applicationContext = (ApplicationContext) context.getJobDetail().getJobDataMap() .get("context" ); this .autoHomeApiService = applicationContext.getBean(AutoHomeApiService.class); this .carTestService = applicationContext.getBean(CarTestService.class); this .titleFilter = applicationContext.getBean(TitleFilter.class); List<CarTest> saveList = new ArrayList <>(); for (int i = 1 ; i < 188 ; i++) { String baseUrl = "https://www.autohome.com.cn/bestauto/" + i; String html = autoHomeApiService.getHtml(baseUrl); Document document = Jsoup.parse(html); Elements carElements = document.getElementsByClass("uibox" ); for (Element carElement : carElements) { String carTitle = carElement.getElementsByClass("uibox-title uibox-title-border" ).text(); CarTest carTest = marshalCarElement(carElement); String imageNames = marshalImageNames(carElement); carTest.setImage(imageNames); saveList.add(carTest); } if (!CollectionUtils.isEmpty(saveList)) { carTestService.saveBatch(saveList); } } log.info(">>>>>>>>>>>>>>>>>>>>>>>>>>>> end crawlerAutoHomeJob" ); } private String marshalImageNames (Element carElement) { String carImageName = null ; List<String> imageNameList = new ArrayList <>(); Elements imageElements = carElement.select(".piclist-box.fn-clear ul.piclist02 a" ); for (Element imageElement : imageElements) { String imageUrl = "https:" + imageElement.getElementsByTag("img" ).attr("src" ); String imageName = autoHomeApiService.getImage(imageUrl); imageNameList.add(imageName); } if (!CollectionUtils.isEmpty(imageNameList)) { carImageName = StringUtils.join(imageNameList, "," ); } return carImageName; } private CarTest marshalCarElement (Element carElement) { CarTest carTest = new CarTest (); String carTitle = carElement.getElementsByClass("uibox-title uibox-title-border" ).text(); carTest.setTitle(carTitle); String testSpeed = carElement.select(".tabbox1 dd:nth-child(2) > div.dd-div2" ).first().text(); carTest.setTestSpeed(covertStrToNum(testSpeed)); String testBrake = carElement.select(".tabbox1 dd:nth-child(3) > div.dd-div2" ).first().text(); carTest.setTestBrake(covertStrToNum(testBrake)); String testOil = carElement.select(".tabbox1 dd:nth-child(4) > div.dd-div2" ).first().text(); carTest.setTestOil(covertStrToNum(testOil)); String editorName1 = carElement.select(".tabbox2.tabbox-score dd:nth-child(2) > div.dd-div1" ).first().text(); carTest.setEditorName1(editorName1); String editorRemark1 = carElement.select(".tabbox2.tabbox-score dd:nth-child(2) > div.dd-div3" ).first().text(); carTest.setEditorRemark1(editorRemark1); String editorName2 = carElement.select(".tabbox2.tabbox-score dd:nth-child(3) > div.dd-div1" ).first().text(); carTest.setEditorName2(editorName2); String editorRemark2 = carElement.select(".tabbox2.tabbox-score dd:nth-child(3) > div.dd-div3" ).first().text(); carTest.setEditorRemark2(editorRemark2); String editorName3 = carElement.select(".tabbox2.tabbox-score dd:nth-child(4) > div.dd-div1" ).first().text(); carTest.setEditorName3(editorName3); String editorRemark3 = carElement.select(".tabbox2.tabbox-score dd:nth-child(4) > div.dd-div3" ).first().text(); carTest.setEditorRemark3(editorRemark3); Date currentDate = new Date (); carTest.setCreated(currentDate); carTest.setUpdated(currentDate); return carTest; } private int covertStrToNum (String str) { try { if ("--" .equals(str)) { return 0 ; } str = StringUtils.substring(str, 0 , str.length() - 1 ); Number num = Float.valueOf(str) * 1000 ; return num.intValue(); } catch (Exception e) { e.printStackTrace(); System.out.println(str); } return 0 ; } }
增加定时任务 在定时任务配置QuartzConfig
中添加爬取汽车之家的定时任务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 @Bean("crawlerAutoHomeJob") public JobDetailFactoryBean crawlerAutoHomeJob () { JobDetailFactoryBean jobDetailFactoryBean = new JobDetailFactoryBean (); jobDetailFactoryBean.setApplicationContextJobDataKey("context" ); jobDetailFactoryBean.setJobClass(CrawlerAutoHomeJob.class); jobDetailFactoryBean.setDurability(true ); return jobDetailFactoryBean; }@Bean("crawlerAutoHomeJobTrigger") public CronTriggerFactoryBean crawlerAutoHomeJobTrigger ( @Qualifier(value = "crawlerAutoHomeJob") JobDetailFactoryBean itemJobBean) { CronTriggerFactoryBean tigger = new CronTriggerFactoryBean (); tigger.setJobDetail(itemJobBean.getObject()); tigger.setCronExpression("0/5 * * * * ? " ); return tigger; }