需求说明
希望利用程序对网页进行截图,或者将复杂的HTML(带复杂的CSS3语法或者Javascript)转换成网页图片。
另外如果希望把网页生成PDF,可以查看这里:https://blog.terrynow.com/2022/12/13/use-node-puppeteer-docker-generate-pdf-as-service-and-support-chinese/
实现
有考虑过几种实现方式
html2image转换如果你是简单的HTML是没有问题,方式也比较简单,缺点是负责的Html无法正确渲染
PhantomJs,不过据说已经停止维护,且bug有点多,所以咱不展开。
这里要介绍的比较完美的方案是使用https://github.com/puppeteer/puppeteer来实现
Puppeteer介绍
Puppeteer 是一个node库,由Chrome官方团队进行维护。提供了一组用来操纵Chrome的API, 通俗来说就是一个 headless chrome浏览器 (当然你也可以配置成有UI的,默认是没有的)。
既然是浏览器,那么我们手工可以在浏览器上做的事情 Puppeteer 都能胜任, 另外,Puppeteer 翻译成中文是”木偶”意思,所以听名字就知道,操纵起来很方便,你可以很方便的操纵她去实现:
1) 生成网页截图或者 PDF
2) 高级爬虫,可以爬取大量异步渲染内容的网页
3) 模拟键盘输入、表单自动提交、登录网页等,实现 UI 自动化测试
4) 捕获站点的时间线,以便追踪你的网站,帮助分析网站性能问题
封装好的截图服务Docker介绍
看这个项目的说明,很容易起一个Docker服务,不过这么要说明一点,这个项目的作者是老外,并没有测试网页是中文的情况,而我已经爬过这个坑了,默认根据作者提供的说明启动的Docker服务是不支持中文的,需要额外做一些事情:
- 准备中文字体
我们可以去下载无版本问题的阿里巴巴普惠字体2.0(可放心商用):https://done.alibabadesign.com/puhuiti2.0,当然你也可以使用自己想要的字体准备好
下载好了以后,解压,放在linux上面,如下(以放在/opt/fonts下为例):
[root@localhost fonts]# pwd /opt/fonts [root@localhost fonts]# ll total 61048 -rw-r--r-- 1 root root 2035700 Apr 30 2021 AlibabaPuHuiTi-2-105-Heavy.ttf -rw-r--r-- 1 root root 2022644 Apr 30 2021 AlibabaPuHuiTi-2-115-Black.ttf -rw-r--r-- 1 root root 8465416 Apr 30 2021 AlibabaPuHuiTi-2-35-Thin.ttf -rw-r--r-- 1 root root 8476208 Apr 30 2021 AlibabaPuHuiTi-2-45-Light.ttf -rw-r--r-- 1 root root 8449680 Apr 30 2021 AlibabaPuHuiTi-2-55-Regular.ttf -rw-r--r-- 1 root root 8347080 Apr 30 2021 AlibabaPuHuiTi-2-65-Medium.ttf -rw-r--r-- 1 root root 8293500 Apr 30 2021 AlibabaPuHuiTi-2-75-SemiBold.ttf -rw-r--r-- 1 root root 8289188 Apr 30 2021 AlibabaPuHuiTi-2-85-Bold.ttf -rw-r--r-- 1 root root 8124312 Apr 30 2021 AlibabaPuHuiTi-2-95-ExtraBold.ttf
- 拉取镜像:
docker pull mingalevme/screenshoter
- 启动镜像(这一步不能使用作者写的,我加了自定义字体文件夹的参数来支持中文):
docker run -d --shm-size 1G -v /opt/fonts:/usr/share/fonts --restart always -p 9081:8080 --name screenshoter mingalevme/screenshoter
- 使用、查看效果
浏览器输入:
http://localhost:9081/take?full=1&viewport-width=1400&device-scale-factor=4&url=https://www.baidu.com
应该就能查看到网页输出的baidu首页图片
Linux上直接保存为文件的命令:
curl "http://localhost:9081/take?url=https%3A%2F%2Fwww.baidu.com" > /tmp/screenshot.png
Java程序获取图片简单示例:
httpGetImageBytes("http://localhost:9081/take?full=1&viewport-width=1400&device-scale-factor=4&url=https://www.baidu.com"); // 或者在方法里面,把InputStream保存成文件,这个很基础,就不展开了 public static byte[] httpGetImageBytes(String urlString) { try { URL url = new URL(urlString); HttpURLConnection connection = (HttpURLConnection) url.openConnection(); connection.setDoInput(true); connection.connect(); InputStream input = connection.getInputStream(); ByteArrayOutputStream buffer = new ByteArrayOutputStream(); int nRead; byte[] data = new byte[4096]; while ((nRead = input.read(data, 0, data.length)) != -1) { buffer.write(data, 0, nRead); } return buffer.toByteArray(); } catch (IOException e) { // Log exception return null; } }
一些参数的说明(翻译了几个常用的):
参数 | 类型 | 是否必填 | 说明 |
---|---|---|---|
url | string | true | 需要截图的网页,例: 'https://www.baidu.com' |
format | string | false | 图片格式,png或者jpg,默认是png |
quality | int | false | 图片质量,1-100,只适用于jpg格式的图片 |
full | int | false | 如果是true,就支持滚动截图,默认是false |
device | string | false | One of supported device, e.g. iPhone X, see https://github.com/puppeteer/puppeteer/blob/main/src/common/DeviceDescriptors.ts for a full list of devices |
viewport-width | int | false | 截图时候采用的屏幕宽度,单位像素. 使用小一点像素,例如460,就可以模拟手机屏幕截图. 默认800像素. |
viewport-height | int | false | 截图时候采用的屏幕高度,单位像素 |
is-mobile | bool (int) | false | Whether the meta viewport tag is taken into account. Defaults to false. |
has-touch | bool (int) | false | Specifies if viewport supports touch events. Defaults to false. |
is-landscape | bool (int) | false | Specifies if viewport is in landscape mode. Defaults to false. |
device-scale-factor | int | false | 设置屏幕解析度,我一般会设置成4,图片会比较清晰,Sets device scale factor (basically dpr) to emulate high-res/retina displays. 取值返回-4, 默认是 1. |
user-agent | string | false | Sets user agent |
cookies | json | false | List with cookies objects (https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagesetcookiecookies), e.g. [{"name":"foo","value":"bar","domain":".example.com"}] |
timeout | int | false | 截图等待时间,默认是30秒,传0表示禁用等待;因为有些网页需要加载JS渲染等需要时间,所以有时候,可以设置一个加载等待时间 |
fail-on-timeout | bool (int) | false | If set to false, we will take a screenshot when timeout is reached instead of failing the request. Defaults to false. |
delay | int | false | If set, we'll wait for the specified number of seconds after the page load event before taking a screenshot. |
wait-until-event | string | false | Controls when the screenshot is taken as the page loads. Supported events include: load - window load event fired (default); domcontentloaded - DOMContentLoaded event fired; networkidle0 - wait until there are zero network connections for at least 500ms; networkidle2 - wait until there are no more than 2 network connections for at least 500ms. domcontentloaded is the fastest but riskiest option–many images and other asynchronous resources may not have loaded yet. networkidle0 is the safest but slowest option. load is a nice middle ground.Defaults to load. |
element | string | false | Query selector of element to screenshot. 可以想jQuery或者CSS的selector一样,针对指定元素截图,这个也挺好用的! |
transparency | bool (int) | false | Hides default white webpage background for capturing screenshots with transparency, only works when format is png . Defaults to 0. |
scroll-page-to-bottom | bool (int) | false | (thx https://github.com/Kiuber) Scroll the page to the bottom (https://www.npmjs.com/package/puppeteer-autoscroll-down). |
scroll-page-to-bottom-size | int | false | (scroll-page-to-bottom) Number of pixels to scroll on each step (default: 250). |
scroll-page-to-bottom-delay-ms | int | false | (scroll-page-to-bottom) Delay in ms after each completed scroll step (default: 100). |
scroll-page-to-bottom-steps-limit | int | false | (scroll-page-to-bottom) Max number of steps to scroll. |
width | int | false | If resulted image's width is greater than provided value then image will be proportionally resized to provided width. This action runs before max-height checking. Defaults to 0 (do not resize). |
max-height | int | false | If resulted image's height is greater than provided value then image's height will be cropped to provided value. Defaults to 0 (do not crop). |
ttl | int | false | If last cached screenshot was made less than provided seconds then the cached image will be returned otherwise image will be cached for future use. |
文章评论