shell技巧

利用grep和prce2grep的正则方法来处理截取html源代码文件内容

通常利用shell脚本可以处理一些简单的文本操作事项,比如从html源码里取出标题或者一段html标签的内容,一些简单的文本处理是用awk,比如取标题:

awk '/<title/,/<\/title\>/' index.html

本文讨论利用rep和prce2grep的正则方法来处理截取html源代码文件内容。

为了讲解方便,我们建一个实验用的文件:regex-content-01.html,其内容如下:

<html>
<head>
<title>A list of interesting and uninteresting people
</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
      <h1>Interesting People</h1>
            <ul>
                  <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                  <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                  <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
            </ul>
      <h1>Uninteresting People</h1>
            <ul>
                  <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                  <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                  <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
            </ul>
</body>
</html>

使用grep命令的模式如下:

grep -Po <regular_expression> <path/with/filename>

参数P代表使用正则方法,参数o代表只输出匹配到的内容,而不是输出整行,下面的例子就明白了:

查找people:
$ grep -Po 'people' regex-content-01.html
结果:
people
people
people

$ grep -Po '@.*people' regex-content-01.html
结果:
@uninterestingpeople
@uninterestingpeople
@uninterestingpeople

$ grep -Po '@.*people.*\.com' regex-content-01.html
结果:
@uninterestingpeople.com
@uninterestingpeople.com
@uninterestingpeople.com

加参数i,表示不区分大小写:

$ grep -Poi '.*mick jagger.*' regex-content-01.html
结果:
<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
$ grep -Po '<li>.*</li>' regex-content-01.html
结果:
<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>
<li><div id="5">Jane Doe<br>jane@uninterestingpeople.com</div></li>
<li><div id="6">Uninteresting Person<br>up@uninterestingpeople.com</div></li>

加点数字的正则:
$ grep -Po '<li><div id="[2-4]">.*</li>' regex-content-01.html
结果:
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>

如果要处理多个html文件,可以再新来一个实验文件regex-content-02.html,其内容如下:

<html>
 <head>
 <title>A list of cool animals
 </title>
 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
 </head>
 <body bgcolor="#000000" text="#ffffff">
      <h1>Interesting Pets</h1>
            <ul>
                  <li><div id="1">Daffy Duck</div></li>
                  <li><div id="2">Porky Pig</div></li>
                  <li><div id="3">Bugs Bunny</div></li>
                  <li><div id="4">Huckleberry Hound</div></li>
                  <li><div id="5">Crusader Rabbit</div></li>
                  <li><div id="6">Top Cat</div></li>
                  <li><div id="7">Rags T. Tiger</div></li>
            </ul>
</body>
</html>
表达式就变成了:
$ grep -Po <regular_expression> <path/with/filename-01.html> <path/with/filename-02.html> ... <path/with/filename-n.html>

查找一个字符:
$ grep -Po 'Duck' *.html
结果:
regex-content-01.html:Duck
regex-content-02.html:Duck

$ grep -Po '.*Duck.*' *.html
结果:
regex-content-01.html:                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-02.html:                  <li><div id="1">Daffy Duck</div></li>

匹配一个标签

$ grep -Po '<li>.*</li>' *.html
结果:
regex-content-01.html:<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
regex-content-01.html:<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
regex-content-01.html:<li><div id="3">John Lennon<br>john@beatles.io</div></li>
regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-01.html:<li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
regex-content-01.html:<li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
regex-content-01.html:<li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
regex-content-02.html:<li><div id="1">Daffy Duck</div></li>
regex-content-02.html:<li><div id="2">Porky Pig</div></li>
regex-content-02.html:<li><div id="3">Bugs Bunny</div></li>
regex-content-02.html:<li><div id="4">Huckleberry Hound</div></li>
regex-content-02.html:<li><div id="5">Crusader Rabbit</div></li>
regex-content-02.html:<li><div id="6">Top Cat</div></li>
regex-content-02.html:<li><div id="7">Rags T. Tiger</div></li>
$ grep -Po '<li>.*Duck.*</li>' *.html
结果:
regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-02.html:<li><div id="1">Daffy Duck</div></li>

比较头疼的是,文件2中的title标签是这样的:

<title>A list of cool animals
</title>

通过grep是不能处理\n这样的换行的情况的:

$ grep -Po '<title.*\n.*</title>' regex-content-02.html
不会有任何结果

现在轮到pcre2grep出场了,如何安装呢?

$ sudo dnf update
$ sudo dnf install pcre2-tools -y
或者
$ sudo apt update
$ sudo apt-get install pcre2-utils -y
再试试:
$ pcre2grep -Mi '<title.*\n.*</title>' *.html
结果:
regex-content-01.html: <title>A list of interesting and uninteresting people
</title>
regex-content-02.html: <title>A list of cool animals
</title>

以上可以处理一行换行,如果多行呢?比如这样的情况:

<title>
A list of
cool animals
</title>

那么就需要加?s 来处理:

$ pcre2grep -Mo '(?s)<ul.+ul>' *.html
结果:
regex-content-01.html:<ul>
                  <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                  <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                  <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
            </ul>
regex-content-01.html:<ul>
                  <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                  <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                  <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
            </ul>
regex-content-02.html:<ul>
                  <li><div id="1">Daffy Duck</div></li>
                  <li><div id="2">Porky Pig</div></li>
                  <li><div id="3">Bugs Bunny</div></li>
                  <li><div id="4">Huckleberry Hound</div></li>
                  <li><div id="5">Crusader Rabbit</div></li>
                  <li><div id="6">Top Cat</div></li>
                  <li><div id="7">Rags T. Tiger</div></li>
            </ul>

参考:https://developers.redhat.com/articles/2022/10/05/filter-content-html-using-regular-expressions-grep#working_across_multiple_lines_of_html