Python elasticsearch doesn't strip all HTML or stopwords in WordPress

I dumped my wordpress posts to Elasticsearch, but when search for suggestion terms I still get stopwords and html elements. For example, the, a or even p tag. I specified in the index already to use these filters.

Here’s my code.

es.indices.create(
    index='wp-posts',
    body={
        'settings': {
            # just one shard, no replicas for testing
            'number_of_shards': 1,
            'number_of_replicas': 0,

            # custom analyzer for analyzing file paths
            'analysis': {
                'analyzer': {
                    "my_analyzer": { 
                        "type": "standard", 
                        "stopwords": "_english_"
                    },
                    'wordpress_content': {
                        'type': 'custom',
                        'tokenizer': 'standard',
                        'filter': ['html_strip']
                        }
                    }
                }
            }
        },
    # Will ignore 400 errors, remove to ensure you're prompted
    ignore=400
)

And this is how I search for suggestion. Unless I do something wrong.

result = es.suggest(index="wp-posts", body={"my_suggestion": {"text": post['content'], "term": {"field":"content" }}})

Post Views: 4

Python elasticsearch doesn’t strip all HTML or stopwords

Social Network

Related posts

Social Network