Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions 04-Regex(docker compose)/RegEx/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
# What is RegEx?
Regular expressions, commonly known as RegEx or regexp, are sequences of characters that define search patterns. These patterns are used for string-searching algorithms, "find" or "find and replace" operations on strings, or for input validation³.


### Practical Applications:
- **Form Input Validation**: Ensuring email addresses, phone numbers, and other inputs are in the correct format.
- **Web Scraping**: Extracting specific information from web pages.
- **Search and Replace**: Finding and replacing text in documents or code.
- **Log Analysis**: Extracting useful information from log files.

### Example:
To match an email address, you might use a pattern like:
```regex
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
```

This pattern matches typical email addresses like `[email protected]`.

RegEx can seem complex at first, but with practice, it becomes a powerful tool for text processing and data extraction¹².


¹: [Learn Regex: A Beginner's Guide — SitePoint](https://www.sitepoint.com/learn-regex/)

²: [RegexOne - Learn Regular Expressions](https://www.regexone.com/)

³: [Regular expression - Wikipedia](https://en.wikipedia.org/wiki/Regular_expression)

(1) Regular expression - Wikipedia. https://en.wikipedia.org/wiki/Regular_expression.

(2) Learn Regex: A Beginner's Guide — SitePoint. https://www.sitepoint.com/learn-regex/.

(3) RegexOne - Learn Regular Expressions - Lesson 1: An Introduction, and .... https://www.regexone.com/.

(4) en.wikipedia.org. https://en.wikipedia.org/wiki/Regular_expression.


# RegEx Charachters

### 1. **Literal Characters**
- **Pattern**: `hello`
- **Matches**: `hello` in "hello world"
- **Explanation**: Matches the exact sequence of characters.

### 2. **Dot (.)**
- **Pattern**: `h.llo`
- **Matches**: `hello`, `hallo`, `hxllo`
- **Explanation**: Matches any single character except newline.

### 3. **Caret (^)**
- **Pattern**: `^hello`
- **Matches**: `hello` at the start of a string
- **Explanation**: Asserts position at the start of a line.

### 4. **Dollar ($)**
- **Pattern**: `world$`
- **Matches**: `world` at the end of a string
- **Explanation**: Asserts position at the end of a line.

### 5. **Asterisk (*)**
- **Pattern**: `he*llo`
- **Matches**: `hllo`, `hello`, `heello`
- **Explanation**: Matches 0 or more of the preceding element.

### 6. **Plus (+)**
- **Pattern**: `he+llo`
- **Matches**: `hello`, `heello`
- **Explanation**: Matches 1 or more of the preceding element.

### 7. **Question Mark (?)**
- **Pattern**: `he?llo`
- **Matches**: `hllo`, `hello`
- **Explanation**: Matches 0 or 1 of the preceding element.

### 8. **Braces ({})**
- **Pattern**: `he{2}llo`
- **Matches**: `heello`
- **Explanation**: Matches exactly `n` occurrences of the preceding element.

### 9. **Square Brackets ([])**
- **Pattern**: `[aeiou]`
- **Matches**: `a`, `e`, `i`, `o`, `u`
- **Explanation**: Matches any one of the enclosed characters.

### 10. **Negated Character Class ([^])**
- **Pattern**: `[^aeiou]`
- **Matches**: Any character except `a`, `e`, `i`, `o`, `u`
- **Explanation**: Matches any character not in the brackets.

### 11. **Parentheses (())**
- **Pattern**: `(hello)`
- **Matches**: `hello`
- **Explanation**: Groups multiple tokens together.

### 12. **Pipe (|)**
- **Pattern**: `hello|world`
- **Matches**: `hello`, `world`
- **Explanation**: Matches either the pattern before or after the `|`.

### 13. **Backslash (\)**
- **Pattern**: `\d`
- **Matches**: Any digit (0-9)
- **Explanation**: Escapes a special character or denotes a special sequence.

### 14. **Special Sequences**
- **\d**: Matches any digit (0-9)
- **Pattern**: `\d`
- **Matches**: `1`, `2`, `3`, etc.
- **\D**: Matches any non-digit
- **Pattern**: `\D`
- **Matches**: `a`, `b`, `!`, etc.
- **\w**: Matches any word character (alphanumeric + underscore)
- **Pattern**: `\w`
- **Matches**: `a`, `1`, `_`, etc.
- **\W**: Matches any non-word character
- **Pattern**: `\W`
- **Matches**: `!`, `@`, `#`, etc.
- **\s**: Matches any whitespace character
- **Pattern**: `\s`
- **Matches**: ` `, `\t`, `\n`, etc.
- **\S**: Matches any non-whitespace character
- **Pattern**: `\S`
- **Matches**: `a`, `1`, `!`, etc.

### Examples in Context
Let's use these characters in a more complex pattern:

**Pattern**: `^\d{3}-\d{2}-\d{4}$`
- **Matches**: `123-45-6789`
- **Explanation**: Matches a string that starts (`^`) with exactly three digits (`\d{3}`), followed by a hyphen, two digits (`\d{2}`), another hyphen, and four digits (`\d{4}`), and ends (`$`).

(1) Regular expression syntax cheat sheet - JavaScript | MDN. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Cheatsheet.
(2) A Practical Guide to Regular Expressions – Learn RegEx with Real Life .... https://www.freecodecamp.org/news/practical-regex-guide-with-real-life-examples/.
(3) Regex Tutorial - A Cheatsheet with Examples - Regextutorial.org. https://regextutorial.org/.


# Best practices for RegEx

### 1. **Keep It Simple**
- **Explanation**: Write the simplest pattern that accomplishes your goal. Complex patterns are harder to read and maintain.
- **Example**: Instead of `\b\d{1,3}(,\d{3})*(\.\d{2})?\b` to match a number with commas and decimals, use `\d+` if you only need to match digits.

### 2. **Use Comments and Verbose Mode**
- **Explanation**: Add comments and use the verbose mode to make your RegEx more readable.
- **Example**:
```python
pattern = re.compile(r"""
\b # Word boundary
\d{1,3} # 1 to 3 digits
(,\d{3})* # Optional groups of 3 digits, separated by commas
(\.\d{2})? # Optional decimal point followed by 2 digits
\b # Word boundary
""", re.VERBOSE)
```

### 3. **Test Thoroughly**
- **Explanation**: Test your RegEx with various inputs to ensure it works as expected.
- **Example**: Use tools like [regex101](https://regex101.com/) to test and debug your patterns.

### 4. **Avoid Overusing Wildcards**
- **Explanation**: The dot (`.`) matches any character except newline, which can lead to unexpected matches.
- **Example**: Instead of `.*`, use a more specific pattern like `\w+` to match word characters.

### 5. **Use Non-Capturing Groups When Possible**
- **Explanation**: Use `(?:...)` for groups you don't need to capture, which can improve performance.
- **Example**: Instead of `(abc|def)`, use `(?:abc|def)` if you don't need to capture the group.

### 6. **Anchor Your Patterns**
- **Explanation**: Use `^` and `$` to match the start and end of a string, respectively, to avoid partial matches.
- **Example**: Use `^hello$` to match the exact string "hello".

### 7. **Escape Special Characters**
- **Explanation**: Use backslashes (`\`) to escape special characters when you need to match them literally.
- **Example**: Use `\.` to match a period instead of `.`.

### 8. **Use Character Classes**
- **Explanation**: Use `[ ]` to define a set of characters to match, which can simplify your pattern.
- **Example**: Use `[aeiou]` to match any vowel.

### 9. **Optimize for Performance**
- **Explanation**: Avoid patterns that can cause excessive backtracking, which can slow down matching.
- **Example**: Use `\d{4}` instead of `\d\d\d\d` to match four digits.

### 10. **Document Your Patterns**
- **Explanation**: Add comments or documentation to explain complex patterns for future reference.
- **Example**: Include a comment like `# Matches a US phone number` above your pattern.

By following these best practices, you can write more efficient, readable, and maintainable regular expressions.

(1) Regex Learn - Step by step, from zero to advanced.. https://regexlearn.com/.
(2) A Practical Guide to Regular Expressions – Learn RegEx with Real Life .... https://www.freecodecamp.org/news/practical-regex-guide-with-real-life-examples/.
(3) Best Practices for Regular Expressions in .NET - .NET. https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices-regex.




# SOME REAL LIFE APPLICATION OF REGEX

### 1. **Extracting Email Addresses from Text**
**Problem**: You have a large text document containing various information, and you need to extract all email addresses.

**RegEx Pattern**: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`

**Example**:
```python
import re

text = "Contact us at [email protected] or [email protected] for more information."
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
print(emails)
```

**Output**:
```
['[email protected]', '[email protected]']
```

### 2. **Extracting Dates from a Document**
**Problem**: You need to extract all dates in the format `MM/DD/YYYY` from a text document.

**RegEx Pattern**: `\b\d{2}/\d{2}/\d{4}\b`

**Example**:
```python
import re

text = "The event is scheduled for 12/25/2024 and the deadline for registration is 11/30/2024."
dates = re.findall(r'\b\d{2}/\d{2}/\d{4}\b', text)
print(dates)
```

**Output**:
```
['12/25/2024', '11/30/2024']
```

### 3. **Extracting Phone Numbers from Text**
**Problem**: You have a text file with various contact information, and you need to extract all phone numbers in the format `(XXX) XXX-XXXX`.

**RegEx Pattern**: `\(\d{3}\) \d{3}-\d{4}`

**Example**:
```python
import re

text = "You can reach us at (555) 123-4567 or (555) 987-6543."
phone_numbers = re.findall(r'\(\d{3}\) \d{3}-\d{4}', text)
print(phone_numbers)
```

**Output**:
```
['(555) 123-4567', '(555) 987-6543']
```

### 4. **Extracting URLs from a Web Page**
**Problem**: You need to extract all URLs from the HTML content of a web page.

**RegEx Pattern**: `https?://[^\s]+`

**Example**:
```python
import re

html_content = """
<a href="https://example.com">Example</a>
<a href="http://example.org">Example Org</a>
"""
urls = re.findall(r'https?://[^\s]+', html_content)
print(urls)
```

**Output**:
```
['https://example.com', 'http://example.org']
```

### 5. **Extracting IP Addresses from Log Files**
**Problem**: You have server log files, and you need to extract all IP addresses.

**RegEx Pattern**: `\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b`

**Example**:
```python
import re

log_data = """
192.168.1.1 - - [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 2326
172.16.0.2 - - [10/Oct/2024:13:55:36 -0700] "POST /form HTTP/1.1" 404 721
"""
ip_addresses = re.findall(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', log_data)
print(ip_addresses)
```

**Output**:
```
['192.168.1.1', '172.16.0.2']
```

These examples demonstrate how RegEx can be used to extract specific types of data from raw text, making it a powerful tool for data processing and analysis.

(1) Five Practical Use Cases for Regular Expressions. https://blog.openreplay.com/five-practical-use-cases-for-regular-expressions/.

(2) A Practical Guide to Regular Expressions – Learn RegEx with Real Life .... https://www.freecodecamp.org/news/practical-regex-guide-with-real-life-examples/.

(3) What have you used Regular Expressions for? - Stack Overflow. https://stackoverflow.com/questions/404430/what-have-you-used-regular-expressions-for.