{"id":1804,"date":"2022-03-02T17:06:15","date_gmt":"2022-03-02T11:36:15","guid":{"rendered":"https:\/\/smarttech101.com\/?p=1804"},"modified":"2022-03-04T06:50:20","modified_gmt":"2022-03-04T01:20:20","slug":"regular-expression-regex-and-regexp-in-linux-ft-grep","status":"publish","type":"post","link":"https:\/\/smarttech101.com\/regular-expression-regex-and-regexp-in-linux-ft-grep\/","title":{"rendered":"Regular Expression (Regex, and Regexp) in Linux Ft. Grep"},"content":{"rendered":"\n
A regular expression (also known as regex, and regexp) is a sequence of characters used by various programming languages such as python, Linux tools such as grep, awk, sed, etc. to match one or more strings. <\/p>\n\n\n\n
For instance, the regex “ Despite being super useful, there is not a single standard that is followed everywhere. For example, we have GNU Regex, POSIX Regex, Perl Regex, etc.<\/p>\n\n\n\n However, there are very few variations across these standards. For instance, I need to warn you that regexp is completely different from the zsh or bash’s glob (also called shell-pattern).<\/p>\n\n\n\n For example, Similarly, At the same time, you need to prevent the shell from interpreting your regex as a shell pattern. For this, you need to surround it with double quotes (preferably single quotes). <\/p>\n\n\n\n Any character other than the .?*+{|()[\\^$ are called ordinary characters and they are interpreted as they are.<\/p>\n\n\n\n For the example given below, the regex Note: <\/strong>By the way, If a line does not contain the given regex pattern, grep does not print that line. Therefore, in the above example, the first line is omitted. <\/p>\n\n\n\n The characters Although these characters are special characters, you can force your regex engine to treat them as ordinary characters by prepending them with a backslash. Example –<\/p>\n\n\n\n Now I will be explaining all these special characters in the upcoming headings with examples.<\/p>\n\n\n\n Dot will match any single character. <\/p>\n\n\n\n In the following example, only “fix” from the first line is matched. Here, “i” is equated as the dot. In the second line, there is no such character between f and x and hence, nothing is matched in the second line. <\/p>\n\n\n\n The caret (^) means to match an empty string at the start of a line.<\/p>\n\n\n\n For example, the following command searches for the “linux” at the beginning. Since the second line does not have “linux” at the start, hence that is not printed. <\/p>\n\n\n\n The dollar is used to match an empty string at the end of a line. For the following example, only the second line is matched because only that one has “linux” at the end.<\/p>\n\n\n\n Note<\/strong>: The charet ( Character Class (also known as Character Set<\/strong>) is a list of characters in the Note 1: <\/strong>Letters and numbers have different meanings in different countries and languages. The above table is for the traditional C locale. In simple words, if your work is based on English then it should work fine.<\/p>\n\n\n\n Explanations with examples:<\/strong><\/p>\n\n\n\n The above-mentioned Character Classes are sufficient to create any list. But before creating any convoluted list such as Named Classes<\/strong> are predefined character classes. For example, Here is a list of widely used named classes (source: Note 1: <\/strong>Special Characters lose their special meanings in the bracket. However, some of them can get it back by their special placement in the bracket. For instance, you need to place Note 2:<\/strong> The Named Classes can be included with other regular expressions in the brackets. For instance, [[:upper:][:lower:]] equals both upper and lower alphabets. <\/p>\n\n\n\n \\w is a word-constituent character (letter, digit, or underscore). Anything other than that is \\W (non-word-constituent character). Ex – <\/p>\n\n\n\nR.*x<\/code>” matches with strings “
Regex<\/code>“, and “
Regular Expression in Linux<\/code>“, etc.<\/p>\n\n\n\n
sed<\/code> in Linux and Mac OS follow standards slightly different from each other. Here, in this article, I will be focusing mainly on the GNU Regex<\/strong> using grep command in Linux<\/a>. At the same time, I will also be mentioning these variations which come to my mind.<\/p>\n\n\n\n
Table of Contents<\/h2>\n\n\n\n
Regex Is Not a Shell Pattern<\/h2>\n\n\n\n
*.mp4<\/code> in shell pattern means any filename ending with
.mp4<\/code>. On the other hand, the star (*) at the start of a regex is a null character. <\/p>\n\n\n\n
?<\/code>,
()<\/code> and
|<\/code> have different meanings.<\/p>\n\n\n\n
Special and Ordinary Characters in Regular Expression<\/h2>\n\n\n\n
Ordinary Characters<\/h3>\n\n\n\n
linux<\/code> is made of ordinary characters and hence it is interpreted as it is.<\/p>\n\n\n\n
Special Characters<\/h3>\n\n\n\n
.?*+{|()[\\^$<\/code> are called special characters (also known as metacharacters<\/strong>) since they have special meanings in the regex:<\/p>\n\n\n\n
Regular expression<\/th> Meaning<\/th><\/tr><\/thead> .<\/td> any single character<\/td><\/tr> *<\/td> the preceding item matching zero or more times<\/td><\/tr> +<\/td> the preceding item matching one or more times<\/td><\/tr> ^<\/td> beginning of the line<\/td><\/tr> $<\/td> end of the line<\/td><\/tr> ?<\/td> the preceding item is optional<\/td><\/tr> [ <\/td> list of characters, range of characters, named classes<\/td><\/tr> { <\/td> used for interval expression <\/td><\/tr> |<\/td> the infix Operator (the OR Alternate Operator)<\/td><\/tr> (<\/td> used for grouping<\/td><\/tr> \\<\/td> have meanings in combination with other characters; ex- \\b, \\<, \\w, \\, etc.<\/td><\/tr><\/tbody><\/table> Dot (.) in Regular Expression<\/h2>\n\n\n\n
Caret (^) in Regex<\/h2>\n\n\n\n
Dollar ($) in Regex<\/h2>\n\n\n\n
^<\/code>) and dollar (
$<\/code>) are also called regex anchors.<\/p>\n\n\n\n
Character Class: Bracket Expression ([ ]) in Regexp<\/h2>\n\n\n\n
[]<\/code> to match any one character from the list. There are many types:<\/p>\n\n\n\n
[az]<\/td> the character \u201ca\u201d OR \u201cz\u201d<\/td><\/tr> [a-z]<\/td> any letter from a to z (lowercase)<\/td><\/tr> [A-Z]<\/td> any letter from A to Z (uppercase)<\/td><\/tr> [A-Za-z]<\/td> any letter<\/td><\/tr> [0-9]<\/td> any number<\/td><\/tr> [-az]<\/td> any one character out of the three -, a, z<\/code><\/td><\/tr>
[^abc]<\/td> negates [abc] i.e. matches any character except a, b, c (called Negated Character Class)<\/td><\/tr><\/tbody><\/table> i[sn]<\/code> matches either
is<\/code> or
in<\/code>. <\/li><\/ul>\n\n\n\n
-<\/code>) is used for “Range Expression”. In the example given below, the
[0-9]<\/code> matches with only the digits.. <\/li><\/ul>\n\n\n\n
[0-9]<\/code> equals to any digit<\/figcaption><\/figure>\n\n\n\n
[A-Za-z]<\/code> means any alphabetic character as shown below:<\/li><\/ul>\n\n\n\n
[A-Za-z]<\/code> equals to any alphabetic character<\/figcaption><\/figure>\n\n\n\n
[^-10]<\/code> just matches any character other than the
-, 1, 0<\/code>. <\/li><\/ul>\n\n\n\n
[A-Za-z0-9]<\/code>, you can also use also corresponding Named Classes explained below.<\/p>\n\n\n\n
Named Class (aka Named Set) in Regex<\/h2>\n\n\n\n
[:alpha:]<\/code> equals to any upper or lowercase alphabet<\/p>\n\n\n\n
man grep<\/a><\/code>,
man gawk<\/code>):<\/p>\n\n\n\n
[:lower:]<\/code> – lowercase letters<\/li>
[:upper:]<\/code> – uppercase letters<\/li>
[:alpha:]<\/code> – alphabets<\/li>
[:digit:]<\/code> – digits<\/li>
[:alnum:]<\/code> – alphabets or numbers i.e.
[A-Za-z0-9]<\/code><\/li>
[:punct:]<\/code> – punctuation characters (characters that are not letter, digits, control characters, or space characters)<\/li>
[:space:]<\/code> – any space character (space, horizontal and vertical tabs, newline, carriage return, and formfeed).<\/li>
[:blank:]<\/code> – space or tab<\/li>
[:cntrl:]<\/code> – control characters<\/li>
[:print:]<\/code> – printable characters i.e.
[:punct:]<\/code>,
[:alnum:]<\/code>, space<\/li>
[:graph:]<\/code> – graphical characters i.e.
[:alnum:]<\/code> and
[:punct:]<\/code><\/li>
[:xdigit:]<\/code>– hexadecimal digits<\/li><\/ul>\n\n\n\n
^<\/code> anywhere but first, dash (-) at first, and
]<\/code> at the first position to get back their special meanings. <\/p>\n\n\n\n
Backslash Based Regex (\\b, \\B, \\w, \\W, \\s, \\S, \\>, \\<)<\/h2>\n\n\n\n
\\s<\/code> is
[[:space:]]<\/code> mentioned above.
\\S<\/code> (non-whitespace) is the exact opposite.<\/p>\n\n\n\n