BeautifulSoup: Parsing bad WordPress HTML

So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with.

For example

Read More
<p style='font-size: 24px;'>
    <strong>Title A</strong>
</p>
<p>
    <strong> First Subtitle of Title A </strong>
    "Text for first subtitle"
</p>

Then it will switch to

<p>
    <strong style='font-size: 24px;'> Second Subtitle for Title B </strong>
</p>

Then sometimes the new subtitles are added to the end of the previous subtitle’s text

<p>
    ...title E's content finishes 
    <strong>
        <span id="inserted31" style="font-size: 24px;"> Title F </span>
    </strong>
</p>
<p>
    <strong> First Subtitle for Title F </strong> 
</p>

Enough confusion, it’s simply poor markup. Obvious patterns such as ‘font-size:24px;’ can find the titles but there isn’t a solid, reusable method to scrape the children and associate them with the title.

Regex might work but I feel like the randomness would result in scraping patterns that are too specific and not DRY.

I could offer to rewrite the html and fix the hierarchy, however, this being a wordpress site, I fear the content might come back as incompatible to the admin in the wordpress interface.

Any suggestions for either a better scraping method or a way to go about wordpress would be greatly appreciated. I want avoid just copying/pasting as much as possible.

Related posts

1 comment

  1. At least, you can rely on the tag names and text, navigating the DOM tree horizontally – going sideways. These are all strong, p and span (with id attribute set) tags you are showing.

    For example, you can get the strong text and get the following sibling:

    >>> from bs4 import BeautifulSoup
    >>> data = """
    ... <p style='font-size: 24px;'>
    ...     <strong>Title A</strong>
    ... </p>
    ... <p>
    ...     <strong> First Subtitle of Title A </strong>
    ...     "Text for first subtitle"
    ... </p>
    ... """
    >>> soup = BeautifulSoup(data)
    >>> titles = soup.find_all('strong')
    >>> titles[0].text
    u'Title A'
    >>> titles[1].get_text(strip=True)
    u'First Subtitle of Title A'
    >>> titles[1].next_sibling.strip()
    u'"Text for first subtitle"'
    

Comments are closed.