I need the Python equivalent to wordpress sanitize_text
for title:
'mygubbi raises $25 mn seed funding from bigbasket co founder others'
wordpress gives
"mygubbi-raises-2-5-mn-seed-funding-bigbasket-co-founder-others"
Python slugify gives
"mygubbi-raises-2-5-mn-seed-funding-from-bigbasket-co-founder-others"
I have used python-slugify Python library.
Am I supposed to just to remove words like from, in, and, to. where can I get those stop words?
The python-slugify library has a
stopwords
parameter which can be used in conjunction withnltk
as follows:This would print:
After installing
nltk
you can install additional corpora, one of which are thestopwords
. To do this run their built in download utility as follows:Select
Corpora
, scroll down tostopwords
and click theDownload
button.There is a python module called nltk. This offers you the possibility to do exactly this.
http://www.bogotobogo.com/python/NLTK/tokenization_tagging_NLTK.php
Just scroll down a little on this website to find the headline “Removing Stop Words”. There are examples of how to do this using this module.