BeautifulSoup: Parsing bad Wordpress HTML in WordPress

So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with.

For example

<p style='font-size: 24px;'>
    <strong>Title A</strong>
</p>
<p>
    <strong> First Subtitle of Title A </strong>
    "Text for first subtitle"
</p>

Then it will switch to

<p>
    <strong style='font-size: 24px;'> Second Subtitle for Title B </strong>
</p>

Then sometimes the new subtitles are added to the end of the previous subtitle’s text

<p>
    ...title E's content finishes 
    <strong>
        <span id="inserted31" style="font-size: 24px;"> Title F </span>
    </strong>
</p>
<p>
    <strong> First Subtitle for Title F </strong> 
</p>

Enough confusion, it’s simply poor markup. Obvious patterns such as ‘font-size:24px;’ can find the titles but there isn’t a solid, reusable method to scrape the children and associate them with the title.

Regex might work but I feel like the randomness would result in scraping patterns that are too specific and not DRY.

I could offer to rewrite the html and fix the hierarchy, however, this being a wordpress site, I fear the content might come back as incompatible to the admin in the wordpress interface.

Any suggestions for either a better scraping method or a way to go about wordpress would be greatly appreciated. I want avoid just copying/pasting as much as possible.

Post Views: 2

1 comment

At least, you can rely on the tag names and text, navigating the DOM tree horizontally – going sideways. These are all strong, p and span (with id attribute set) tags you are showing.

For example, you can get the strong text and get the following sibling:

>>> from bs4 import BeautifulSoup
>>> data = """
... <p style='font-size: 24px;'>
...     <strong>Title A</strong>
... </p>
... <p>
...     <strong> First Subtitle of Title A </strong>
...     "Text for first subtitle"
... </p>
... """
>>> soup = BeautifulSoup(data)
>>> titles = soup.find_all('strong')
>>> titles[0].text
u'Title A'
>>> titles[1].get_text(strip=True)
u'First Subtitle of Title A'
>>> titles[1].next_sibling.strip()
u'"Text for first subtitle"'

Comments are closed.

BeautifulSoup: Parsing bad WordPress HTML

1 comment

Social Network

Related posts

1 comment

Social Network