Python equivalent to wordpress sanitize_text

I need the Python equivalent to wordpress sanitize_text

for title:

Read More
'mygubbi raises $25 mn seed funding from bigbasket co founder others'

wordpress gives

"mygubbi-raises-2-5-mn-seed-funding-bigbasket-co-founder-others"

Python slugify gives

"mygubbi-raises-2-5-mn-seed-funding-from-bigbasket-co-founder-others"

I have used python-slugify Python library.

Am I supposed to just to remove words like from, in, and, to. where can I get those stop words?

Related posts

2 comments

  1. The python-slugify library has a stopwords parameter which can be used in conjunction with nltk as follows:

    from slugify import slugify
    from nltk.corpus import stopwords
    
    text = 'mygubbi raises $25 mn seed funding from bigbasket co founder others'
    print slugify(text, stopwords=stopwords.words('english'))
    

    This would print:

    mygubbi-raises-25-mn-seed-funding-bigbasket-co-founder-others
    

    After installing nltk you can install additional corpora, one of which are the stopwords. To do this run their built in download utility as follows:

    import nltk
    
    nltk.download()
    

    NLTK download helper

    Select Corpora, scroll down to stopwords and click the Download button.

Comments are closed.