So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with.
For example
<p style='font-size: 24px;'>
<strong>Title A</strong>
</p>
<p>
<strong> First Subtitle of Title A </strong>
"Text for first subtitle"
</p>
Then it will switch to
<p>
<strong style='font-size: 24px;'> Second Subtitle for Title B </strong>
</p>
Then sometimes the new subtitles are added to the end of the previous subtitle’s text
<p>
...title E's content finishes
<strong>
<span id="inserted31" style="font-size: 24px;"> Title F </span>
</strong>
</p>
<p>
<strong> First Subtitle for Title F </strong>
</p>
Enough confusion, it’s simply poor markup. Obvious patterns such as ‘font-size:24px;’ can find the titles but there isn’t a solid, reusable method to scrape the children and associate them with the title.
Regex might work but I feel like the randomness would result in scraping patterns that are too specific and not DRY.
I could offer to rewrite the html and fix the hierarchy, however, this being a wordpress site, I fear the content might come back as incompatible to the admin in the wordpress interface.
Any suggestions for either a better scraping method or a way to go about wordpress would be greatly appreciated. I want avoid just copying/pasting as much as possible.
At least, you can rely on the tag names and text, navigating the DOM tree horizontally – going sideways. These are all
strong
,p
andspan
(withid
attribute set) tags you are showing.For example, you can get the
strong
text and get the following sibling: