通常利用shell脚本可以处理一些简单的文本操作事项,比如从html源码里取出标题或者一段html标签的内容,一些简单的文本处理是用awk,比如取标题:
awk '/<title/,/<\/title\>/' index.html
本文讨论利用rep和prce2grep的正则方法来处理截取html源代码文件内容。
为了讲解方便,我们建一个实验用的文件:regex-content-01.html,其内容如下:
<html>
<head>
<title>A list of interesting and uninteresting people
</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
<h1>Interesting People</h1>
<ul>
<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
</ul>
<h1>Uninteresting People</h1>
<ul>
<li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
<li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
<li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
</ul>
</body>
</html>
使用grep命令的模式如下:
grep -Po <regular_expression> <path/with/filename>
参数P代表使用正则方法,参数o代表只输出匹配到的内容,而不是输出整行,下面的例子就明白了:
查找people:
$ grep -Po 'people' regex-content-01.html
结果:
people
people
people
$ grep -Po '@.*people' regex-content-01.html
结果:
@uninterestingpeople
@uninterestingpeople
@uninterestingpeople
$ grep -Po '@.*people.*\.com' regex-content-01.html
结果:
@uninterestingpeople.com
@uninterestingpeople.com
@uninterestingpeople.com
加参数i,表示不区分大小写:
$ grep -Poi '.*mick jagger.*' regex-content-01.html
结果:
<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
$ grep -Po '<li>.*</li>' regex-content-01.html
结果:
<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>
<li><div id="5">Jane Doe<br>jane@uninterestingpeople.com</div></li>
<li><div id="6">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
加点数字的正则:
$ grep -Po '<li><div id="[2-4]">.*</li>' regex-content-01.html
结果:
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">John Doe<br>jd@uninterestingpeople.com</div></li>
如果要处理多个html文件,可以再新来一个实验文件regex-content-02.html,其内容如下:
<html>
<head>
<title>A list of cool animals
</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor="#000000" text="#ffffff">
<h1>Interesting Pets</h1>
<ul>
<li><div id="1">Daffy Duck</div></li>
<li><div id="2">Porky Pig</div></li>
<li><div id="3">Bugs Bunny</div></li>
<li><div id="4">Huckleberry Hound</div></li>
<li><div id="5">Crusader Rabbit</div></li>
<li><div id="6">Top Cat</div></li>
<li><div id="7">Rags T. Tiger</div></li>
</ul>
</body>
</html>
表达式就变成了:
$ grep -Po <regular_expression> <path/with/filename-01.html> <path/with/filename-02.html> ... <path/with/filename-n.html>
查找一个字符:
$ grep -Po 'Duck' *.html
结果:
regex-content-01.html:Duck
regex-content-02.html:Duck
$ grep -Po '.*Duck.*' *.html
结果:
regex-content-01.html: <li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-02.html: <li><div id="1">Daffy Duck</div></li>
匹配一个标签
$ grep -Po '<li>.*</li>' *.html
结果:
regex-content-01.html:<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
regex-content-01.html:<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
regex-content-01.html:<li><div id="3">John Lennon<br>john@beatles.io</div></li>
regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-01.html:<li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
regex-content-01.html:<li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
regex-content-01.html:<li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
regex-content-02.html:<li><div id="1">Daffy Duck</div></li>
regex-content-02.html:<li><div id="2">Porky Pig</div></li>
regex-content-02.html:<li><div id="3">Bugs Bunny</div></li>
regex-content-02.html:<li><div id="4">Huckleberry Hound</div></li>
regex-content-02.html:<li><div id="5">Crusader Rabbit</div></li>
regex-content-02.html:<li><div id="6">Top Cat</div></li>
regex-content-02.html:<li><div id="7">Rags T. Tiger</div></li>
$ grep -Po '<li>.*Duck.*</li>' *.html
结果:
regex-content-01.html:<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
regex-content-02.html:<li><div id="1">Daffy Duck</div></li>
比较头疼的是,文件2中的title标签是这样的:
<title>A list of cool animals
</title>
通过grep是不能处理\n这样的换行的情况的:
$ grep -Po '<title.*\n.*</title>' regex-content-02.html
不会有任何结果
现在轮到pcre2grep出场了,如何安装呢?
$ sudo dnf update
$ sudo dnf install pcre2-tools -y
或者
$ sudo apt update
$ sudo apt-get install pcre2-utils -y
再试试:
$ pcre2grep -Mi '<title.*\n.*</title>' *.html
结果:
regex-content-01.html: <title>A list of interesting and uninteresting people
</title>
regex-content-02.html: <title>A list of cool animals
</title>
以上可以处理一行换行,如果多行呢?比如这样的情况:
<title>
A list of
cool animals
</title>
那么就需要加?s 来处理:
$ pcre2grep -Mo '(?s)<ul.+ul>' *.html
结果:
regex-content-01.html:<ul>
<li><div id="1">Mick Jagger<br>mick@stones.com</div></li>
<li><div id="2">Joan Jett<br>joan@runaways.info</div></li>
<li><div id="3">John Lennon<br>john@beatles.io</div></li>
<li><div id="4">Duck Dunn<br>ddunn@coolmusic.io</div></li>
</ul>
regex-content-01.html:<ul>
<li><div id="5">John Doe<br>jd@uninterestingpeople.com</div></li>
<li><div id="6">Jane Doe<br>jane@uninterestingpeople.com</div></li>
<li><div id="7">Uninteresting Person<br>up@uninterestingpeople.com</div></li>
</ul>
regex-content-02.html:<ul>
<li><div id="1">Daffy Duck</div></li>
<li><div id="2">Porky Pig</div></li>
<li><div id="3">Bugs Bunny</div></li>
<li><div id="4">Huckleberry Hound</div></li>
<li><div id="5">Crusader Rabbit</div></li>
<li><div id="6">Top Cat</div></li>
<li><div id="7">Rags T. Tiger</div></li>
</ul>