Python RegEx
A RegEx (Regular Expressions) is a sequence of characters which is used to form a search pattern to search for strings or a whole collection of strings.
The syntax of Regular Expressions is a little different than what you might be used to and syntax might feel a little weird to you at first but you will know how to use RegEx in Python easily after some practice.
re Module
In Python, we have the re module to deal with RegEx patterns. It is a tool to find matching patterns in text. This module is very useful in every aspect of programming including competitive programming.
The module can be imported using the import keyword like this, so let’s import our module.
Example
Code Explanation:
- We use the search function in the Python RegEx module to find the string, it takes two parameters, the pattern that we want to find, and the string where we are searching for the pattern.
- An if statement that works only if the Python RegEx test pattern is in our given string meaning that our variable “search” is equal to True.
What happens when we print out the “search” variable, let’s find out.
Example
In the output, the span=(17, 21) is the position of the object that matched our search string, it also returned the substring that matched our search string.
findall() and match() functions
findall() function
In the examples used above, we have used the search () function that returned a whole lot of information about the search results, but most of the time we do not really want that information.
Also, the search () function only finds the first match of our pattern, so, even if there are multiple patterns that satisfy our search pattern, search () will only return the position of the first pattern.
In Python’s Regex module, findall () is one of the most powerful functions. It searches the whole string and returns a list of all the matching patterns. Similar to the search () function, it also takes two parameters.
Example
match() function
match() functions is very similar to search() function we used above except one major difference, have this for a Python RegEx match() example.
Example
We decided that we wanted to find “String” and we got our results similar to the search() function. So, what is the difference?
match() vs search()
There is one major difference in the use of these functions. match() function only searches in the starting for the string, that means that it stops searching for the pattern after a space appears in the string.
On the other hand, search() searches in the whole string, consider this as an example:
Example
As you can see, that match() function could not find the entered pattern because the pattern in the string was separated by a whitespace.
MetaCharacters
MetaCharacters are special characters that we can include in our search pattern to specify the details of our pattern.
They are the most important aspect of RegEx and because of them, the re module is so powerful. Consider this as an example
Example
At 1, we use the curly brackets { } that are used to find the exact number of occurrences in a string, here we want to find whether “my” occurs in the given string 2 times or not. In this case, it was True.
The small brackets are used to group the string “my” together, if we do not do this, the metacharacter only applies to a single character then.
Example
Here, you can see that the output follows the else statement, this is because now that we did not enclose “my” in small brackets Python applies the metacharacter {2} only on the space present in “my”.
There are more equally useful metacharacters in Python re module:
MetaCharacter | Description |
{ } | Specify the number of occurrences of a substring or character.. |
+ | Check whether the string has one or more occurrences of a substring or character.. |
* | Check whether the string has zero or more occurrences of a substring or character. |
? | FInds zero or one occurrence. |
| | If the string contains either of two search patterns. |
[ ] | Checks for a specified set of characters. (The set specification will be discussed later) |
. | Checks the string for any type of sequence except in the new line. |
^ | If the string starts with the specified character or substring. |
$ | If the string ends with the specified character or substring. |
( ) | Enclose a group of characters for Python RegEx to work on together. |
\ | Specifies a special sequence (Multiple sequences are present in Python). |
Let’s see these metacharacters in examples. Each Python RegEx example might contain multiple metacharacters but do not worry, an explanation will be there below the example.
Example 1 of MetaCharacters
Let’s understand the code:
- We enclosed the character string in small brackets so that it can be used as a whole, in this case it was the string “Everything”. Now we wanted to check if the test_string starts with the string “Everything”, so we use the “^” meta character.
- Now, we want to find whether the string ends with a specific word, so we use the “$” metacharacter.
- A list is used to store the results, so when we check for multiple patterns and print the results, it becomes easier.
Example 2 of MetaCharacters
Let’s understand the code:
- We specify the starting letter and the ending letter of the pattern. The dots “..” are used to specify how many letters can be present between the starting and the ending letter.
- The metacharacter “+” is used to check multiple occurrences of the word “the”.
Example 3 of MetaCharacters
Let us say first, we want to find out if a specific piece of text contains one of the two pattern values and then we want to find the zero or more occurrences of another pattern value in the same string.
This can be done as shown below:
Let’s understand the code:
- Using the pipe (|) key, we define a pattern that we want to find out if at least one of the two values are present in the string or not. In this case, it is true.
2. Next, we define another pattern that will be used to find if the string has zero or more “Fun” followed by a “k” character. In this case, there were zero, but Python will still count it as True.
RegEx Special Sequences and Sets
Special Sequences - backslash "\"
The backslash “\” is used to access special sequences or used to signal the use of special characters in the RegEx module. The character that will follow it will have a special meaning. Have this for an example:
Example
Let’s understand the code:
- We use the backslash to use the special sequences, the “\d” defines a pattern that searches for the numbers between 0-9 in a given string.
- We simply search if the search pattern is present in the string and print the results.
A backslash is followed by a character that generates a special meaning. All of the special meanings that can be generated are given below:
Character | Description |
\d | Returns the part of the string where the string contains numbers between 0 to 9. |
\D | Returns the part of the string that does not contain numbers between 0 to 9. |
\b | A match is returned where the characters are present either at the beginning of the string or at the end of it. |
\B | A match is returned where the characters are neither present at the beginning of the string nor at the end of it. |
\s | Returns the matches where all of the whitespaces are present in the string. |
\S | Returns the matches where all of the whitespaces are NOT present in the string. |
\A | A match is returned if the specified characters are at the starting of the given string |
\w | Returns a match, if the given string has any word characters like alphabetical characters, underscores, or numbers between 0 to 9. |
\W | A match is returned where the test string does not contain any word characters, for example, whitespaces or commas. |
\Z | Returns a match if the string has the specified characters at the end of it. |
Let us have an example to understand it better.
Example
Let’s understand the code:
- We use the ‘r’ keyword to make sure that the pattern string is a raw string, if we do not use this ‘r’ for special characters, then the code will not work properly.
A pattern is created that will tell us if “nova” is present at the end of a string or not. - “\s” is used to get all of the whitespaces in the string.
- We use “w” to get all of the word characters present in the specified string.
- The string “un” is checked if it is in the beginning of the string. The “\B” makes the case True when the string is not present neither at the beginning nor at the end, but if it is present in the string somewhere.
Try other special characters as well so you will get a better idea of the special sequence with backslash.
Sets - Square Brackets "[ ]"
The large brackets are used to look for a specific set of characters in a string. When we put any metacharacter in the square brackets or (more formally) sets.
Sets can be defined in these ways:
[0-9] – A match is returned if the string contains any number between 0 to 9.
[1-4][0-2] – Returns a matched pattern wherever the string has numbers between 10 to 42. You can change the values according to you but the values in either of the brackets should not exceed 9.
[ \ ] – The backslash metacharacter loses its speciality in the square brackets, and then this will return the part of the string where the backslash “\” is present.
[cdf] – Returns the parts of the string where it contains the values ‘c’, ‘d’, or ‘f’.
[^cdf] – Returns the parts of the string where it does not contain the values ‘c’, ‘d’, or ‘f’.
[c-f] – A match is returned if the string contains any alphabetical value between ‘c’ and ‘f’.
[653] – Returns the string parts where the numbers ‘6’, ‘5’, or ‘3’ are present.
Let us have an example:
Example
As you can see the sets helped returning specific values from the string that matched the search pattern.
Functions in the re Module
There are many functions in the re module other than we used above. These functions can increase the usability of our code to a great extent. Let us have a look at these functions one by one.
compile() Function
The Python’s RegEx module has a compile() function, that can compile patterns that we want to search into objects.
Then these objects then can be used to directly search from a string, consider this for an example:
Example
Let’s understand the code:
- Using the compile() function we compile our pattern that we defined above and store it in the variable patternCompile. This variable is not instantanized.
- Next, we use the instantanized object variable to access the findall() function and pass our string in it to get the results.
Bonus Tip
- If you have multiple strings then with the help of this function you will not have to pass parameters in the findall() function too many times.
split() Function
The RegEx split() function in Python splits the string wherever a match is found and returns the result as a list. This function is very useful in programming especially in competitive programming.
Example
Let’s understand the code:
- We compile the “\s” special character that is used to find whitespaces in a string and store it in the “pattern” variable.
- Then by using the pattern object, we access the split function. This split function divides the string wherever we find a match (In this case, split the string wherever we find a whitespace).
Output explanation:
The split() function can take two more parameters, “maxsplit”, and “flags”. The maxsplit parameter depending on its value, split() will determine how many times splits should occur in the string.
The flags parameter can take the value of different use cases, by default the value of flags is equal to zero, it can be set to a value, e.g., flags = re.IGNORECASE will remove case sensitivity of the RegEx module. We will keep the value to be default for now.
Example
escape() Function
The Python RegEx module’s escape() function returns the string non-alphanumeric with backslashes.
It can be useful when your regular string contains non-alphanumeric characters as regular metacharacters. This function will escape those non-alphanumeric characters.
Example
sub() Function
The sub() function in the Python’s RegEx module replaces a string with another string, if the specified pattern is found in the string.
Example
We have a string, with an uncertain number of consecutive whitespaces, but we only want a string with no more than one whitespace at a time. There are good chances that might occur in a real life project. How did we solve it, then?
Let’s understand the code:
- The “\s” is used to get all of the whitespaces from the string, and then we follow it by “{2,}” that specifies to match whitespaces that occur consecutively twice or more than twice.
- In the sub() function, we first we pass our match pattern, then on second place, we pass the value that we want to replace, in this case it was a single whitespace (” “), and lastly the string is passed where we want to do this opera
So, with the help of Python Regex sub() we have replaced all of the consecutive whitespaces with a single whitespace.
subn() Function
The sub() and subn() functions are almost similar, the only difference is in the way they present the output.
Let’s take an example to understand the output of subn() function.
Example
Unlike sub() it returns a tuple with the replaced string and a count value of how many times the value is replaced.
To access only the replaced string, we used indexing in the Tuple.
That was it for Python RegEx and like always, keep practicing and you will become a master in Python.