Python is a vast language with hundreds of libraries and built-in functions to ease our work. With those, large codes in other programming languages can be reduced to a single line or a bunch of a few lines. In this tech blog, we will learn about the methods to find substrings in Python strings.
Why do we check if a String contains a Substring?
There are two main reasons to perform this task:
- It is very useful in text analysis and when you have to retrieve any particular string from a text file to perform a set of specific operations.
- It Can be used to find data in a pandas data frame, a very useful tool for data scientists who want to check for particular string existence in a data frame.
Working with strings is important to learn for any beginner Python programmer because it has many applications in the real world. Let's now learn about approaches to check if a string contains a substring in Python.
Approach 1) Membership Operators
While there are many comparison operators in Python, Membership operators are different. Membership operators are used to test if a sequence is presented in an object or not. It can be used for sets, tuples, and strings as well.
We have two operators under the membership operators, mainly "in" and "not in" operators. They help us identify whether a string or a list is present in another set of strings or a list. They return either True or False. The operator "not in" is defined to have the inverse truth value of "in".
Let's understand how it works with a simple example:
string="Hi I am a favtutor student" print("student" in string) print( "Hi" not in string )
Output:
True False
Let's say we have a text file named demo.txt and we have to check if it contains the 'technology' word or not. We first look at the contents of the text file:
"Though man had been playing with science for ages, the term ‘technology’ was conceived in the early 20th century. It is inspired by the German terms “Technik” and “technologie”. An American sociologist Read Bain gave the first definition of technology in 1937 in his journal titled “Technology and State Government”.
Bain wrote that all the tools, machinery, equipment, clothes, etc are technology along with the methods we use to produce them. From time to time ‘technology’ had been defined by several scientists and engineers; nevertheless, the basic concept has always remained the same.
Technology has made life easy and convenient. Before modern-day technology, carrying out everyday work was so cumbersome and difficult. It would take a whole day to just travel a couple of miles. Sending messages also wasn’t as easy as it is today.“
Code:
f=open("demo.txt","r") lines = f.read() print("Is technology present in file","technology" in lines)
Output:
Is technology present in file True
Reading text from somewhere else comes with various challenges to checking for a substring correctly, as this function is case sensitive and the “in” function checks for an exact matching string, i.e., each and every letter of the substring should be present in the string in the same format.
To achieve good and accurate results, we perform some operations on a given string, like removing case sensitivity by converting text into lowercase via lower(), removing all the spaces and new line characters using the strip() function, and splitting each word of a row to a set of lists using the split function.
Approach 2) Regex
Regex is a function often used when you have to search for a substring ending with any punctuation mark or words. It is similar to a given substring but has one or more additional letters with them making them a new word. Regex stands for Regular Expressions.
Here is a very simple example:
import re re.search(r"tech\w+",lines)
In the above code, you can see we have searched for “tech\w+” where \w is a regex character to find words starting with "tech," and (+) is a plus quantifier stating the concatenation of strings. As a result, we got the output below.
Output:
<re.Match object; span=(13, 23), match='technology'>
Here, we can see that .search() function not only returns the string match but also the starting and ending indexes of a substring. In this case, the substring is "technology” and the first occurrence of string “technology” is at index 13 ending at 23.
Let's see how to finding the string ending with punctuation:
re.search(r"technology[\.,]",lines)
Output:
<re.Match object; span=(947, 958), match='technology,'>
It returns only a single string ending with a punctuation mark. You can use the .findall() function instead of .search() function to find every matching substring that ends with punctuation.
Approach 3) __contains__()
Python has __contains__() as an instance method to check for substrings in a string. It returns a boolean. If the string contains a substring, it returns True; otherwise, it returns False. Let us dive directly into its example:
f=open("demo.txt","r") count=0 lines = f.read() #checking in whole file print(lines.__contains__('technology')) # using __contain__() function as str function print(str.__contains__("favtutor is a Edutech platform which provide us with top instructors and 24/7 live sessions.",'us'))
Output:
True True
Finding a Substring in Pandas Dataframe
If you are a data scientist or a student practicing data science on a daily basis, you have to work on CSV or Excel file data. To check for a substring in this type of data, you can perform the above-mentioned methods, but Pandas provides us with one more efficient method to find a substring in a python dataframe.
Example:
#importing library import pandas as pd # reading the dataset d=pd.read_csv('F:\projects\Favtutor\SPAM text message 20170820 - Data.csv') # using top 500 lines of dataset for finding a substring data=d.head(500) substring=data[data.Message.str.contains("membership")] print("Occurence of word membership","\n",substring.Message,"\n") substring2=data[data.Message.str.contains(r"member\w+")] print("words preceding from string member","\n",substring2.Message)
Output:
Occurence of word membership 12 URGENT! You have won a 1 week FREE membership ... Name: Message, dtype: object words preceding from string member 12 URGENT! You have won a 1 week FREE membership ... 356 Thank You for calling.I forgot to say Happy Onam... Name: Message, dtype: object
We have used a Kaggle spam message dataset above to perform the substring functions. str.contains() can be used on any n-sized dataset, and it returns the complete line containing that string. The best part is that inside this function, you can use REGEX expressions.
In short, to determine which entries in a Pandas DataFrame contain a substring, use str.contains().
Conclusion
In the article, you learned that the best way to check whether a string contains a substring in Python is to use the in-membership operator. You now know how to pick the best approach when you’re working with substrings in Python.