Working with RegEx

Regular Expressions (RegEx) are the "search and replace" on steroids. They allow you to define complex search patterns, such as "find every line that starts with a timestamp" or "extract all email addresses from this file."

In Linux, RegEx is the backbone of powerful command-line utilities like grep, sed, and awk. This guide will take you from basic matching to advanced pattern extraction.

Basics of Regular Expressions:

A regular expression is a sequence of characters that forms a search pattern. It can include literal characters, metacharacters, and quantifiers to define a specific pattern.

Literal Characters: Matches the exact character (e.g., apple matches "apple").
Metacharacters: Special characters that define rules (e.g., ^, $, .).

2. Metacharacters:

Metacharacters have special meanings in regular expressions:

Symbol	Name	Function	Example	Matches
.	Dot	Matches any single character except newline	b.t	bat, bet, bit, b@t
^	Caret	Anchors to the start of the line.	^Error	Lines starting with "Error"
$	Dollar	Anchors to the end of the line.	done$	Lines ending with "done"
[]	Character Class	Matches one character from a specified set or range.	[Rr]ead	Read, read
[^...]	Negation	Matches one character NOT in the specified set.	[^0-9]	Any non-digit character

3. Quantifiers:

Quantifiers specify the number of occurrences of a character or group:

*: Matches 0 or more occurrences.
+: Matches 1 or more occurrences.
?: Matches 0 or 1 occurrence.
{n}: Matches exactly n occurrences.
{n,}: Matches n or more occurrences.
{n,m}: Matches between n and m occurrences.

Using RegEx with Linux Commands:

1. Using grep:

The grep command is a powerful tool for searching text using regular expressions.

Basic Usage:

grep "pattern" filename

Example:

Assume a file named regextut.txt contains the following text:

apple is made by apple is a company by apple

iphone is made by apple

faaskndjdfnksdjappleaskldjfsl

grep "apple" regextut.txt

This command searches for lines containing the word "apple" in the regextut.txt file.

Basic vs. Extended RegEx

This is where most beginners get stuck. Linux commands like grep use Basic RegEx (BRE) by default, which treats symbols like +, ?, and | as literal characters.

To use these powerful quantifiers, you must use Extended RegEx (ERE) by adding the -E flag (or using egrep).

Wrong: grep "error+" log.txt (Looks for the literal string "error+")
Right: grep -E "error+" log.txt (Looks for "error", "errorr", "errorrr")

2. Using sed:

sed is used for "Stream Editing." It can replace text on the fly using RegEx.

Substitution:

sed 's/pattern/replacement/g' filename

Example:

sed 's/apple/banana/g' regextut.txt

This command substitutes all occurrences of "apple" with "banana" in regextut.txt.

Advanced Examples:

1. Matching IP Addresses:

grep -P '(\d{1,3}\.){3}\d{1,3}' input.txt

This command uses Perl-compatible regular expressions (-P flag) to match IPv4 addresses in a file.

2. Extracting Email Addresses:

grep -oP '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b' emails.txt

This command extracts email addresses from a file using a regex pattern.

Summary Cheatsheet

Task	Pattern	Explanation
Match Email	[\w\.-]+@[\w\.-]+	Simple email match.
Match Date	\d{4}-\d{2}-\d{2}	Format YYYY-MM-DD.
Match IP	\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}	Simple IP match.
Blank Lines	^$	Start immediately followed by end.
Comments	^#	Lines starting with a hash.