利用grep和prce2grep的正则方法来处理截取html源代码文件内容

Mar 30, 2023

—

from

通常利用shell脚本可以处理一些简单的文本操作事项，比如从html源码里取出标题或者一段html标签的内容，一些简单的文本处理是用awk，比如取标题：

awk '/<title/,/<\/title\>/' index.html

本文讨论利用rep和prce2grep的正则方法来处理截取html源代码文件内容。

为了讲解方便，我们建一个实验用的文件：regex-content-01.html，其内容如下：

<html>
<head>
<title>A list of interesting and uninteresting people
</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
      <h1>Interesting People</h1>
            <ul>
                  <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                  <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                  <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
            </ul>
      <h1>Uninteresting People</h1>
            <ul>
                  <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                  <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                  <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
            </ul>
</body>
</html>

使用grep命令的模式如下：

grep -Po <regular_expression> <path/with/filename>

参数P代表使用正则方法，参数o代表只输出匹配到的内容,而不是输出整行，下面的例子就明白了：

查找people：
$ grep -Po 'people' regex-content-01.html
结果：
people
people
people

$ grep -Po '@.*people' regex-content-01.html
结果：
@uninterestingpeople
@uninterestingpeople
@uninterestingpeople

$ grep -Po '@.*people.*\.com' regex-content-01.html
结果：
@uninterestingpeople.com
@uninterestingpeople.com
@uninterestingpeople.com

加参数i，表示不区分大小写：

$ grep -Poi '.*mick jagger.*' regex-content-01.html
结果：
<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>

$ grep -Po '<li>.*</li>' regex-content-01.html
结果：
<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>
<li><div id="5">Jane Doe<br>jane@uninterestingpeople.com</div></li>
<li><div id="6">Uninteresting Person<br>up@uninterestingpeople.com</div></li>

加点数字的正则：
$ grep -Po '<li><div id="[2-4]">.*</li>' regex-content-01.html
结果：
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>

如果要处理多个html文件，可以再新来一个实验文件regex-content-02.html，其内容如下：

<html>
 <head>
 <title>A list of cool animals
 </title>
 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
 </head>
 <body bgcolor="#000000" text="#ffffff">
      <h1>Interesting Pets</h1>
            <ul>
                  <li><div id="1">Daffy Duck</div></li>
                  <li><div id="2">Porky Pig</div></li>
                  <li><div id="3">Bugs Bunny</div></li>
                  <li><div id="4">Huckleberry Hound</div></li>
                  <li><div id="5">Crusader Rabbit</div></li>
                  <li><div id="6">Top Cat</div></li>
                  <li><div id="7">Rags T. Tiger</div></li>
            </ul>
</body>
</html>

表达式就变成了：
$ grep -Po <regular_expression> <path/with/filename-01.html> <path/with/filename-02.html> ... <path/with/filename-n.html>

查找一个字符：
$ grep -Po 'Duck' *.html
结果：
regex-content-01.html:Duck
regex-content-02.html:Duck

$ grep -Po '.*Duck.*' *.html
结果：
regex-content-01.html:                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-02.html:                  <li><div id="1">Daffy Duck</div></li>

匹配一个标签

$ grep -Po '<li>.*</li>' *.html
结果：
regex-content-01.html:<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
regex-content-01.html:<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
regex-content-01.html:<li><div id="3">John Lennon<br>john@beatles.io</div></li>
regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-01.html:<li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
regex-content-01.html:<li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
regex-content-01.html:<li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
regex-content-02.html:<li><div id="1">Daffy Duck</div></li>
regex-content-02.html:<li><div id="2">Porky Pig</div></li>
regex-content-02.html:<li><div id="3">Bugs Bunny</div></li>
regex-content-02.html:<li><div id="4">Huckleberry Hound</div></li>
regex-content-02.html:<li><div id="5">Crusader Rabbit</div></li>
regex-content-02.html:<li><div id="6">Top Cat</div></li>
regex-content-02.html:<li><div id="7">Rags T. Tiger</div></li>

$ grep -Po '<li>.*Duck.*</li>' *.html
结果：
regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-02.html:<li><div id="1">Daffy Duck</div></li>

比较头疼的是，文件2中的title标签是这样的：

<title>A list of cool animals
</title>

通过grep是不能处理\n这样的换行的情况的：

$ grep -Po '<title.*\n.*</title>' regex-content-02.html
不会有任何结果

现在轮到pcre2grep出场了，如何安装呢？

$ sudo dnf update
$ sudo dnf install pcre2-tools -y
或者
$ sudo apt update
$ sudo apt-get install pcre2-utils -y

再试试：
$ pcre2grep -Mi '<title.*\n.*</title>' *.html
结果：
regex-content-01.html: <title>A list of interesting and uninteresting people
</title>
regex-content-02.html: <title>A list of cool animals
</title>

以上可以处理一行换行，如果多行呢？比如这样的情况：

<title>
A list of
cool animals
</title>

那么就需要加?s 来处理：

$ pcre2grep -Mo '(?s)<ul.+ul>' *.html
结果：
regex-content-01.html:<ul>
                  <li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
                  <li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
                  <li><div id="3">John Lennon<br>john@beatles.io</div></li>
                  <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
            </ul>
regex-content-01.html:<ul>
                  <li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
                  <li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
                  <li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
            </ul>
regex-content-02.html:<ul>
                  <li><div id="1">Daffy Duck</div></li>
                  <li><div id="2">Porky Pig</div></li>
                  <li><div id="3">Bugs Bunny</div></li>
                  <li><div id="4">Huckleberry Hound</div></li>
                  <li><div id="5">Crusader Rabbit</div></li>
                  <li><div id="6">Top Cat</div></li>
                  <li><div id="7">Rags T. Tiger</div></li>
            </ul>

参考：https://developers.redhat.com/articles/2022/10/05/filter-content-html-using-regular-expressions-grep#working_across_multiple_lines_of_html