Since computers don't understand our language and it's likely that they pick up junk during textual analysis, computer programming is used to clean your text data. By changing all characters to lowercase, eliminating punctuation, and eliminating stop words and typos, it can be quite beneficial to get rid of unhelpful portions of the data, or noise.
Today, we'll look at how to remove punctuation from a string using python with various different methods and their code.
The results of any text-processing strategy are impacted by this crucial NLP preprocessing phase, which divides the text into sentences, paragraphs, and phrases. Including a text preparation layer in activities like sentiment analysis, document categorization, document retrieval based on user queries, and more, adding a text preprocessing layer provides more accuracy.
What are String Punctuations in Python?
Punctuation marks are unique symbols that give spoken English more grammatical structure. But when processing text, it becomes important to remove or replace them. Depending on the use case, it is crucial to determine the list of punctuation that will be discarded or removed from the data with care.
Before learning how to get rid of them, we must know how are string punctuations defined in python.
The string.punctuation is a pre-defined constant string in python that contains all punctuation characters. This string includes all these characters and more, such as quotation marks, brackets, and slashes.
It is a convenient way to reference all punctuation characters in one place, rather than having to type out each individual character. It has the following list of punctuation:
import string string.punctuation
Output :
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
Also, You can also add other special symbols to the list of punctuations that you want to discard. Example: '©', '^', '®',' ','¾', '¡', etc.
import string regular_punct = list(string.punctuation) # python punctuations special_punct=['©', '^', '®',' ','¾', '¡','!'] # user defined special characters to remove def remove_punctuation(text,punct_list): for punc in punct_list: if punc in text: text = text.replace(punc, ' ') return text.strip() a= remove_punctuation(" Hello!! welcome to Favetutor blogs©",regular_punct) b=remove_punctuation(a,special_punct) print("Sentence after removing python punctuations",a) print("Sentence after removing special punctuations",b)
Output:
Sentence after removing python punctuations Hello welcome to Favetutor blogs©
Sentence after removing special punctuations Hello welcome to Favetutor blogs
How to Remove Punctuations from Strings?
As discussed removing punctuations from strings is a common task that every programmer should know about and we will look at 4 different methods in python to do it below:
1) The Translation Function
It is one of the best ways to easily strip punctuation.
The translate() function is a string method that can be used to replace characters in a string with other characters. It is used with the maketrans() function to remove punctuation from a string. The string.translate() function's first two arguments are empty strings, and its third argument is a list of all the punctuation marks you want to eliminate from your string.
Syntax: string_name.translate(str.maketrans(‘ ’, ‘ ‘, string.punctuation))
We use the maketrans() function to create a translation table that maps the punctuation characters in the punctuation constant string to None. We apply the translation table to the sample string using the translate() method, which replaces each character in the string that matches a key in the translation table with its corresponding value, or removes it if the value is None.
Let's understand with an example:
import string text = "I'm Komal from Favtutor, hello. How may I assist you??" translating = str.maketrans('', '', string.punctuation) new_string = text.translate(translating) print(new_string)
Output:
Im Komal from Favtutor hello How may I assist you
This function can also be used if you want to replace some words or characters with specific codes or characters. Check the example below:
table = str.maketrans('aeiou', '12345') #here we are doing mappinof values ''' a==1 e==2 i==3 o==4 u==5 ''' string = 'This is a sample string' translated_string = string.translate(table) print(translated_string)
Output:
Th3s 3s 1 s1mpl2 str3ng
2) Using the Loop & Replace function
It is a standard method of doing this operation without using any in-built function of python. Here we iterate over the string to check for punctuations, and then, replace it with an empty string using the replace() function. It is a brute way to complete a task.
Here is an example:
test_str = "Hi, Welcome to the Favtutor live coding classes -24*7 Expert help availaible. Register Now!!" print("The original string --> ","\n",test_str) punc_list = '''!()-[]{};*:'"\,<>./?@_~''' for i in test_str: if i in punc_list: test_str = test_str.replace(i, "") # printing updated string print('-----------------------------------------------------------------') print("The string after removing punctuation -->",'\n',test_str)
Output:
The original string --> Hi, Welcome to the Favtutor live coding classes -24*7 Expert help availaible. Register Now!! ----------------------------------------------------------------- The string after removing punctuation --> Hi Welcome to the Favtutor live coding classes 247 Expert help availaible Register Now
3) Using Regex
Regex is a powerful tool for pattern matching and manipulation of text, including removing specific characters from a string. It has a method function named sub() which is used to search for a pattern in a string and replace all occurrences of that pattern with a specified string.
Arguments of the sub() function are pattern, replace, string, count, and flag. Here is an example:
import re string = "Hello, welcome to my blog! :)" new_string = re.sub(r'[^\w\s]', '', string) print(new_string)
Output:
Hello welcome to my blog
4) The filter() function
The filter() method filters the elements based on a specific condition. You can understand it easily with the following example:
import string str = "Hello, world!" punctuations =string.punctuation def remove_punctuation(char): return char not in punctuations clean_text = ''.join(filter(remove_punctuation, str)) print(clean_text) # output: Hello world
Output:
Hello world
What is the quickest method?
The str.translate() method in python is the fastest way to remove punctuation from a string. Speed isn't everything, of course, but discovering code that drastically slows down your code will frequently result in a worse user experience.
We can compare all the methods in terms of the time of execution of code:
import re, string, timeit str = "Hi, Welcome to the Favtutor live coding classes -24*7 Expert help availaible. Register Now!!" punctuations = set(string.punctuation) table = str.maketrans("","", string.punctuation) regex = re.compile('[%s]' % re.escape(string.punctuation)) def test_join(s): return ''.join(ch for ch in s if ch not in punctuations) def test_re(s): return regex.sub('', s) def test_trans(s): return s.translate(table) def test_repl(s): for c in string.punctuation: s=s.replace(c,"") return s print ("Join :",timeit.Timer('f(str)', 'from __main__ import str,test_join as f').timeit(1000000)) print ("regex :",timeit.Timer('f(str)', 'from __main__ import str,test_re as f').timeit(1000000)) print ("translate :",timeit.Timer('f(str)', 'from __main__ import str,test_trans as f').timeit(1000000)) print( "replace :",timeit.Timer('f(str)', 'from __main__ import str,test_repl as f').timeit(1000000))
Output:
Join : 7.931632200023159 regex : 2.2031622999347746 translate : 2.171550799976103 replace : 3.7943124000448734
Hence, we can say that translate() function is the fastest.
Also, learn how to check if the python string contains substring using 3 different approaches.
Conclusion
Since punctuation is difficult to process in natural English strings, we must first remove it before using the strings for additional processing. We learned different methods to strip punctuation from strings in python. Happy Learning :)