使用Puppeteer部署Docker服务给网页完美截图(实现中文不乱码、滚动全屏截图等)

2022-10-29 73点热度 0人点赞 0条评论

需求说明

希望利用程序对网页进行截图,或者将复杂的HTML(带复杂的CSS3语法或者Javascript)转换成网页图片。

实现

有考虑过几种实现方式

html2image转换如果你是简单的HTML是没有问题,方式也比较简单,缺点是负责的Html无法正确渲染

PhantomJs,不过据说已经停止维护,且bug有点多,所以咱不展开。

这里要介绍的比较完美的方案是使用https://github.com/puppeteer/puppeteer来实现

Puppeteer介绍

Puppeteer 是一个node库,由Chrome官方团队进行维护。提供了一组用来操纵Chrome的API, 通俗来说就是一个 headless chrome浏览器 (当然你也可以配置成有UI的,默认是没有的)。

既然是浏览器,那么我们手工可以在浏览器上做的事情 Puppeteer 都能胜任, 另外,Puppeteer 翻译成中文是”木偶”意思,所以听名字就知道,操纵起来很方便,你可以很方便的操纵她去实现:

1) 生成网页截图或者 PDF
2) 高级爬虫,可以爬取大量异步渲染内容的网页
3) 模拟键盘输入、表单自动提交、登录网页等,实现 UI 自动化测试
4) 捕获站点的时间线,以便追踪你的网站,帮助分析网站性能问题

封装好的截图服务Docker介绍

Puppeteer只是一个Node开发库,如果需要实现功能,还是需要编写NodeJS代码来实现的,如果你和我一样只是想拿来就作为截图服务来使用,那么可以只看这个项目(已经有github大神帮忙写好了)

看这个项目的说明,很容易起一个Docker服务,不过这么要说明一点,这个项目的作者是老外,并没有测试网页是中文的情况,而我已经爬过这个坑了,默认根据作者提供的说明启动的Docker服务是不支持中文的,需要额外做一些事情:

  • 准备中文字体

我们可以去下载无版本问题的阿里巴巴普惠字体2.0(可放心商用):https://done.alibabadesign.com/puhuiti2.0,当然你也可以使用自己想要的字体准备好

下载好了以后,解压,放在linux上面,如下(以放在/opt/fonts下为例):

[[email protected] fonts]# pwd
/opt/fonts
[[email protected] fonts]# ll
total 61048
-rw-r--r-- 1 root root 2035700 Apr 30  2021 AlibabaPuHuiTi-2-105-Heavy.ttf
-rw-r--r-- 1 root root 2022644 Apr 30  2021 AlibabaPuHuiTi-2-115-Black.ttf
-rw-r--r-- 1 root root 8465416 Apr 30  2021 AlibabaPuHuiTi-2-35-Thin.ttf
-rw-r--r-- 1 root root 8476208 Apr 30  2021 AlibabaPuHuiTi-2-45-Light.ttf
-rw-r--r-- 1 root root 8449680 Apr 30  2021 AlibabaPuHuiTi-2-55-Regular.ttf
-rw-r--r-- 1 root root 8347080 Apr 30  2021 AlibabaPuHuiTi-2-65-Medium.ttf
-rw-r--r-- 1 root root 8293500 Apr 30  2021 AlibabaPuHuiTi-2-75-SemiBold.ttf
-rw-r--r-- 1 root root 8289188 Apr 30  2021 AlibabaPuHuiTi-2-85-Bold.ttf
-rw-r--r-- 1 root root 8124312 Apr 30  2021 AlibabaPuHuiTi-2-95-ExtraBold.ttf
  • 拉取镜像:
docker pull mingalevme/screenshoter
  • 启动镜像(这一步不能使用作者写的,我加了自定义字体文件夹的参数来支持中文):
docker run -d --shm-size 1G -v /opt/fonts:/usr/share/fonts --restart always -p 9081:8080 --name screenshoter mingalevme/screenshoter
  • 使用、查看效果

浏览器输入:

http://100.103.37.3:9081/take?full=1&viewport-width=1400&device-scale-factor=4&url=https://www.baidu.com

应该就能查看到网页输出的baidu首页图片

Linux上直接保存为文件的命令:

curl "http://100.103.37.3:9081/take?url=https%3A%2F%2Fwww.baidu.com" > /tmp/screenshot.png

Java程序获取图片简单示例:

httpGetImageBytes("http://100.103.37.3:9081/take?full=1&viewport-width=1400&device-scale-factor=4&url=https://www.baidu.com");
// 或者在方法里面,把InputStream保存成文件,这个很基础,就不展开了

public static byte[] httpGetImageBytes(String urlString) {
    try {
        URL url = new URL(urlString);
        HttpURLConnection connection = (HttpURLConnection) url.openConnection();
        connection.setDoInput(true);
        connection.connect();
        InputStream input = connection.getInputStream();
        
        ByteArrayOutputStream buffer = new ByteArrayOutputStream();
        int nRead;
        byte[] data = new byte[4096];
        while ((nRead = input.read(data, 0, data.length)) != -1) {
            buffer.write(data, 0, nRead);
        }
        return buffer.toByteArray();
    } catch (IOException e) {
        // Log exception
        return null;
    }
}

一些参数的说明(翻译了几个常用的):

参数 类型 是否必填 说明
url string true 需要截图的网页,例: 'https://www.baidu.com'
format string false 图片格式,png或者jpg,默认是png
quality int false 图片质量,1-100,只适用于jpg格式的图片
full int false 如果是true,就支持滚动截图,默认是false
device string false One of supported device, e.g. iPhone X, see https://github.com/puppeteer/puppeteer/blob/main/src/common/DeviceDescriptors.ts for a full list of devices
viewport-width int false 截图时候采用的屏幕宽度,单位像素. 使用小一点像素,例如460,就可以模拟手机屏幕截图. 默认800像素.
viewport-height int false 截图时候采用的屏幕高度,单位像素
is-mobile bool (int) false Whether the meta viewport tag is taken into account. Defaults to false.
has-touch bool (int) false Specifies if viewport supports touch events. Defaults to false.
is-landscape bool (int) false Specifies if viewport is in landscape mode. Defaults to false.
device-scale-factor int false 设置屏幕解析度,我一般会设置成4,图片会比较清晰,Sets device scale factor (basically dpr) to emulate high-res/retina displays. 取值返回-4, 默认是 1.
user-agent string false Sets user agent
cookies json false List with cookies objects (https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagesetcookiecookies), e.g. [{"name":"foo","value":"bar","domain":".example.com"}]
timeout int false 截图等待时间,默认是30秒,传0表示禁用等待;因为有些网页需要加载JS渲染等需要时间,所以有时候,可以设置一个加载等待时间
fail-on-timeout bool (int) false If set to false, we will take a screenshot when timeout is reached instead of failing the request. Defaults to false.
delay int false If set, we'll wait for the specified number of seconds after the page load event before taking a screenshot.
wait-until-event string false Controls when the screenshot is taken as the page loads. Supported events include: load - window load event fired (default); domcontentloaded - DOMContentLoaded event fired; networkidle0 - wait until there are zero network connections for at least 500ms; networkidle2 - wait until there are no more than 2 network connections for at least 500ms. domcontentloaded is the fastest but riskiest option–many images and other asynchronous resources may not have loaded yet. networkidle0 is the safest but slowest option. load is a nice middle ground.Defaults to load.
element string false Query selector of element to screenshot. 可以想jQuery或者CSS的selector一样,针对指定元素截图,这个也挺好用的!
transparency bool (int) false Hides default white webpage background for capturing screenshots with transparency, only works when format is png. Defaults to 0.
scroll-page-to-bottom bool (int) false (thx https://github.com/Kiuber) Scroll the page to the bottom (https://www.npmjs.com/package/puppeteer-autoscroll-down).
scroll-page-to-bottom-size int false (scroll-page-to-bottom) Number of pixels to scroll on each step (default: 250).
scroll-page-to-bottom-delay-ms int false (scroll-page-to-bottom) Delay in ms after each completed scroll step (default: 100).
scroll-page-to-bottom-steps-limit int false (scroll-page-to-bottom) Max number of steps to scroll.
width int false If resulted image's width is greater than provided value then image will be proportionally resized to provided width. This action runs before max-height checking. Defaults to 0 (do not resize).
max-height int false If resulted image's height is greater than provided value then image's height will be cropped to provided value. Defaults to 0 (do not crop).
ttl int false If last cached screenshot was made less than provided seconds then the cached image will be returned otherwise image will be cached for future use.

 

 

admin

这个人很懒,什么都没留下

文章评论

您需要 登录 之后才可以评论