Regular Expressions / RegEx - Finding Patterns In Strings
Regular Expressions (RegEx) is all about finding patterns within a dataset and then performing some actions based off that. It enables us to extract meaningful information from obscured texts. Helps to perform validations, filtering and find-and-replace operations. In this post, I explore how RegEx works and how it can be used to make sense of real-world data while learning alongside the reader.
Regular Expressions (RegEx) – Finding Patterns in Strings
Upon first glance, this will seem complicated to use and get familiar with. Initially, I got lost in the topic with all the flags we have available within it. It’s a very cryptic, compact, and ambiguous way of creating an algorithm that looks for patterns within a string dataset.
In a nutshell, RegEx is a function used to find patterns within a text. I like to think of it as the search or find feature available on Microsoft Word, except we have access to more advanced features like replace, extract, split, and validate.
We can access RegEx in Python by using import re and have all the attributes/functions brought in.
Why RegEx Is Useful (Real Examples)
I’ve never really done any day trading or stocks, albeit once with £50 and lost that; however, if I were to try and give you an example of its use. Think about creating a program that looks for specific words or patterns within a news article about your favourite stock, which then tells you if you should buy or sell. This is known as text mining.
Think about also having a corrupted file where all the text inside is obfuscated with random symbols that need to be extracted to make the text more readable.
As an example:
L%e%a%r%n%i%n%g% P%y%t%h%o%n%…
Manually cleaning a file similar to this would be tedious work. We could use feature’s like find and replace on most modern applications. What if those features don’t work and you want your own program? With RegEx, we can build our own function to locate and eradicate these characters automatically, leaving behind readable information.
Functions in the re Module
To begin using RegEx in Python, we first need to import the module:
1
import re
This module will bring all the methods or attributes to our Python file for use. The four main categories we’ll look into are:
1
2
3
4
5
re.match() # Searches for a match at the beginning of a string
re.search() # Searches the entire string and returns the first match
re.findall() # Returns all matches as a list
re.split() # Splits a string at each match
re.sub() # Finds and replaces matches
All these function find a specific pattern or regular expression, which it then performs a task on, like replacing. As I begin to learn to use these functions, I’ll be using a format to first explain the syntax and then provide an example. This should help me gain a clearer idea of how I can use these functions.
Hint: Most examples here use direct strings rather than variables or files. Once file handling is revisited later in the course, these examples will make even more sense.
Using re.match()
The way I understand re.match() is that it only checks the very beginning of a string. If the pattern you specify does not start at index 0, the function simply returns None.
You can think of it as similar to string slicing, where the start position is fixed at 0.
1
2
3
4
5
6
# Syntax
re.match(pattern, dataset, re.I)
# pattern: what you are searching for
# dataset: the string or data you are searching in
# re.I: ignores case sensitivity
Example:
1
2
3
4
5
6
7
8
9
10
11
import re
txt = "Hi, I'm Sheikh and I love coding. Python is what I am learning right now, what do you like?" # data set
match = re.match("Hi,", txt, re.I) # Find the match of "Hi," from index 0 in txt and ignore case sensitivity
print(match)
# <re.Match object; span=(0, 3), match='Hi,'>
print(match.span())
# (0,3)
match2 = re.match("Sheikh", txt, re.I) # This will return None.
To be blunt, the .match() method is very limited in practice. It will only bring back a value if it finds a match from the beginning index of your sequence, index 0. To elucidate, if “Hi,” was in the starting index position 2 to 4, it would say “None”. This is because it only searches for the regular expression from index 0 of your string sequence.
Using re.search()
On the bright side, this method brings a little more functionality to its repertoire. Unlike re.match(), we can scan an entire dataset and return a list of identical matches.
1
2
# Syntax
re.search(pattern, dataset, re.I)
Example
1
2
3
4
5
6
7
8
9
import re
txt = "Hi, I'm Sheikh and I love coding. Python is what I am learning right now, what do you like?"
search = re.search("Sheikh", txt, re.I)
print(search)
# <re.Match object; span=(8, 14), match='Sheikh'>
print(search.span())
# (8, 14)
Now we can see how this function could be useful, it’ll help find keywords within a dataset and begins to feel more like a find function.
Example Using an External File:
1
2
3
4
5
6
7
8
9
import re
with open("sample.txt", "r") as txt: # This is using a file in the same directory
search = re.search("Sheikh", txt.read(), re.I)
# .read() must be used to specify that you want to read the file for this function.
print(search)
# <re.Match object; span=(8, 14), match='Sheikh'>
print(search.span())
# (8, 14)
Note: Even though the file is opened in read mode, we still need to call
.read()to pass its contents to the RegEx function.
If you’re like me and have been able to get your own practice sessions in by either using AI or the course contents. We should be feeling a little more comfortable with using these.
We’ll now start to discover rather more interesting applications of RegEx.
Using re.findall()
When we want to search through an entire dataset and find all the duplicates of our regular expression, we use re.findall(). This will scour through the dataset and produce a new list of every instance that matches the criteria.
1
2
# Syntax
re.findall(pattern, dataset, re.I)
Example
1
2
3
4
5
6
7
8
9
10
11
12
13
import re
txt = "Hey, I'm Sheikh and I love Python. Coding is something that helps me relax and express my thoughts. Recently, I've been trying new techniques in coding, like RegEx and Loops. I think coding is a great way to connect with creativity. What about you? Do you have any hobbies, maybe something like coding or website design?"
match = re.findall("Coding", txt, re.I)
print(match)
# ['Coding', 'coding', 'coding', 'coding']
match = re.findall("Coding", txt)
print(match)
# ['Coding']
match = re.findall("skhfsk", txt)
print(match)
# []
This example clearly shows the effect of
re.I(ignore case). Without it, only exact matches are returned.
Alternative Matching Styles
1
2
3
4
5
6
7
8
9
10
txt = "Hey, I'm Sheikh and I love Python. Coding is something that helps me relax and express my thoughts. Recently, I've been trying new techniques in coding, like RegEx and Loops. I think coding is a great way to connect with creativity. What about you? Do you have any hobbies, maybe something like coding or website design?"
match = re.findall('Coding|coding', txt)
print(match)
# ['Coding', 'coding', 'coding', 'coding']
match = re.findall('[Cc]oding', txt)
print(match)
# ['Coding', 'coding', 'coding', 'coding']
This begins to shed light on the different ways we can write or specify a pattern for RegEx, and I will be including a comprehensive list of all the ways we can build an expression.
Using re.finditer()
A slightly more advanced method to .findall() is .finditer() where it will enable you to gain extremely useful data like where each match occurs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import re
txt = "Hey, I'm Sheikh and I love Python. Coding is something that helps me relax and express my thoughts. Recently, I've been trying new techniques in coding, like RegEx and Loops. I think coding is a great way to connect with creativity. What about you? Do you have any hobbies, maybe something like coding or website design?"
# Use re.finditer() instead of re.findall()
matches = re.finditer("coding", txt, re.I)
# Iterate through all matches
for match in matches:
print(f"Found '{match.group()}' at index {match.start()} to {match.end()}")
# Output:
# Found 'Coding' at index 42 to 48
# Found 'coding' at index 103 to 109
# Found 'coding' at index 133 to 139
# Found 'coding' at index 211 to 217
#Built using AI
This approach gives you both the matched text and its position in the dataset, which is far more practical in real-world use cases.
Using re.sub()
This is where we can think about functions like find and replace. I know that I have continuously described RegEx as being a function identical to functions in MS Office, etc.
Please understand that instead of just looking for pieces of texts, we are looking for patterns in strings instead.
For demonstration purposes, we are using strings and characters to find patterns; however, once I started to learn about all the flags or switches available to create a pattern, I learnt how complex this feature actually is.
re.sub() allows us to find and replace patterns within a dataset.
Syntax
re.sub(pattern, replacement, dataset)
Example
1
2
3
4
5
6
7
import re
txt = “L%e%a%r%n%i%n%g% P%y%t%h%o%n%”
clean = re.sub(“%”, “”, txt)
print(clean)
# Learning Python
Earlier, I mentioned having an obfuscated dataset with special characters that need to be withdrawn. This is the tool we use when we want to achieve just that. In Cybersecurity and Data Analytics, you’ll come across similar features.
Using re.split()
This method helps to bring back a list of results from the dataset you provide by slicing it at intervals that match the pattern you specify.
It will find the first occurrence of your expression, and everything up to that gets split as an item in a list and iterates to the next occurrence.
Think about obtaining a list of sentences from a passage that are all identified and separated by “.” which is usually at the end of a phrase.
1
2
Syntax
re.split(pattern, dataset)
Example – Splitting by Words
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import re
txt = "Learning Python is a fantastic experience that opens doors to endless possibilities. As you progress in your coding journey, you'll encounter challenges that will make you a better programmer. The best part of learning Python is that you can apply it to so many fields, from web development to data science and machine learning. By consistently practising, you will become proficient and open doors to new opportunities. Keep exploring, keep coding, and most importantly, keep growing!"
result = re.split(' ', txt) # Splitting By Words
print(result)
#['Learning', 'Python', 'is', 'a', 'fantastic', 'experience', 'that', 'opens', 'doors', 'to', 'endless', 'possibilities.', 'As', 'you', 'progress', 'in', 'your', 'coding', 'journey,', "you'll", 'encounter', 'challenges', 'that', 'will', 'make', 'you', 'a', 'better', 'programmer.', 'The', 'best', 'part', 'of', 'learning', 'Python', 'is', 'that', 'you', 'can', 'apply', 'it', 'to', 'so', 'many', 'fields,', 'from', 'web', 'development', 'to', 'data', 'science', 'and', 'machine', 'learning.', 'By', 'consistently', 'practising,', 'you', 'will', 'become', 'proficient', 'and', 'open', 'doors', 'to', 'new', 'opportunities.', 'Keep', 'exploring,', 'keep', 'coding,', 'and', 'most', 'importantly,', 'keep', 'growing!']
result = re.split('\.', txt) # Splitting by sentences
print(result)
#['Learning Python is a fantastic experience that opens doors to endless possibilities', " As you progress in your coding journey, you'll encounter challenges that will make you a better programmer", ' The best part of learning Python is that you can apply it to so many fields, from web development to data science and machine learning', ' By consistently practising, you will become proficient and open doors to new opportunities', ' Keep exploring, keep coding, and most importantly, keep growing!']
result = re.split('\n', txt) # Splitting by new lines
print(result)
# ['Learning Python is a fantastic experience that opens doors to endless possibilities.',
# "As you progress in your coding journey, you'll encounter challenges that will make you a better programmer.",
# 'The best part of learning Python is that you can apply it to so many fields, from web development to data science and machine learning.',
# 'By consistently practising, you will become proficient and open doors to new opportunities.',
# 'Keep exploring, keep coding, and most importantly, keep growing!']
Initially, using “.” didn’t work as expected. That’s because ‘.’ is a special character in RegEx. Escaping it with ‘' solved the issue (
\.).
Writing Regular Expression Patterns
I have now explored the basics of using the Regular Expressions module in Python, but this is where things start to become complicated. We want to be able to create expressions that are as obscure as the information we’re looking for. Taking the analogy of cybersecurity from before, the likelihood that we’ll be looking for normal data like words or phrases might be less desired, and we might want to start looking for numbers, fractions, or multiple expressions.
These are the flags that you can use to build a custom expression that you can then use to locate that in your dataset.
RegEx Basics
| Pattern | What it Matches | Python Example | Result |
|---|---|---|---|
. | Any character except newline | re.findall(r'.', "Hi") | ['H', 'i'] |
a | Literal character a | re.findall(r'a', "banana") | ['a', 'a', 'a'] |
ab | Exact string ab | re.findall(r'ab', "abcab") | ['ab', 'ab'] |
a\|b | a or b | re.findall(r'a\|b', "abc") | ['a', 'b'] |
\. | Literal dot | re.findall(r'\.', "3.14") | ['.'] |
Character Classes
| Pattern | What it Matches | Python Example | Result |
|---|---|---|---|
[abc] | One of a, b, or c | re.findall(r'[abc]', "cat") | ['c', 'a'] |
[a-z] | Any lowercase letter | re.findall(r'[a-z]', "Hi!") | ['i'] |
[^a-z] | Not lowercase letters | re.findall(r'[^a-z]', "Hi!") | ['H', '!'] |
\d | Any digit | re.findall(r'\d', "a1b2") | ['1', '2'] |
\D | Any non-digit | re.findall(r'\D', "a1") | ['a'] |
\w | Letter, digit, _ | re.findall(r'\w', "hi_2!") | ['h', 'i', '_', '2'] |
\W | Non-word character | re.findall(r'\W', "hi!") | ['!'] |
\s | Whitespace | re.findall(r'\s', "a b") | [' '] |
\S | Non-whitespace | re.findall(r'\S', "a b") | ['a', 'b'] |
Quantifiers (How Many?)
| Pattern | Meaning | Python Example | Result |
|---|---|---|---|
a* | 0 or more a | re.findall(r'a*', "baaa") | ['', 'aaa', ''] |
a+ | 1 or more a | re.findall(r'a+', "baaa") | ['aaa'] |
a? | 0 or 1 a | re.findall(r'a?', "ba") | ['', 'a', ''] |
\d{2} | Exactly 2 digits | re.findall(r'\d{2}', "1234") | ['12', '34'] |
\d{2,3} | 2–3 digits | re.findall(r'\d{2,3}', "12345") | ['123', '45'] |
Anchors (Position Based)
| Pattern | Meaning | Python Example | Result |
|---|---|---|---|
^Hi | Starts with Hi | re.findall(r'^Hi', "Hi there") | ['Hi'] |
there$ | Ends with there | re.findall(r'there$', "Hi there") | ['there'] |
\bcat\b | Whole word cat | re.findall(r'\bcat\b', "cat scatter") | ['cat'] |
\Bcat | Inside a word | re.findall(r'\Bcat', "scatter") | ['cat'] |
Groups & Capturing
| Pattern | Meaning | Python Example | Result |
|---|---|---|---|
(ab) | Captures ab | re.findall(r'(ab)', "abab") | ['ab', 'ab'] |
(ab)+ | Repeating group | re.findall(r'(ab)+', "abab") | ['ab'] |
(?:ab) | Non-capturing group | re.findall(r'(?:ab)', "abab") | ['ab', 'ab'] |
Lookahead’s
| Pattern | Meaning | Python Example | Result |
|---|---|---|---|
\w+(?=!) | Word before ! | re.findall(r'\w+(?=!)', "Hi!") | ['Hi'] |
\w+(?!\!) | Word not before ! | re.findall(r'\w+(?!\!)', "Hi!") | ['H'] |
The mental model that worked for me was:
what to match → how many times → where it must appear
Example
1
2
import re
re.findall(r'^\d{3}$', dataset)
Explanation:
' → start of expression
^ → start of string
\d → digit
{3} → exactly three times
$ → end of string
' → end of expression
This expression is looking for 3 contiguous digits, and it doesn’t need to be ordered.
✔ Matches: “123”
✖ Does not match: “12”, “1234”
Here is an image with all the flags from before that’s included in the main course contents. On my first pass, it was what I used to gain a better understanding and practise using RegEx on different text files like all the US president speech’s.
Final Thoughts
By this point of the course I felt quite exhausted with the different ways that we can build an expression using all the information in the above tables. I would say that practising them will help with retaining the information as much as possible.
Again, with everything we learn it’s more to do with knowing you have this tool in your arsenal than it is to memorise every piece of information. A good suggestion would be to go over to the actual course and go over the tasks on this module as it will help give you a broader perspective into what we can achieve with this.
👉 Here’s a link to the course: Regular Expressions - 30 Days of Python - Asabeneh
