Scraping product information for comic book website

I’m making a comic book website built upon a WordPress platform for an old friend for his business. I would like to be able to have a script that goes to various publisher sites and pulls in the data. I’m new to programming and I’ve read of many different alternatives and just don’t know where to begin. Firstly, would this be legal to pull this content from these websites? Secondly, here’s an example for what I would like to do.

  1. Page displays what’s coming out for the month. Copy all links from
    that page within the appropriate div that leads to the comic book
    details. Save each hyperlink as $comiclink or whatever. The script will
    execute each hyperlink at a time.

    Read More
  2. Go to the hyperlink for $comiclink and scrape content out of the page based
    upon what’s in certain DIV’s on that page. Example:

    • Copy & save comic title within a defined div into $title
    • Copy & save previous and future title hyperlinks within a defined div into $othertitles

      Note: $othertitles will loop off and start the same process itself from 1.

    • Save & download all images within a defined div to $images
    • Copy & save all content within a defined div to $content. $content is then broken down
      and pulled apart based upon the content that is within it. Example:

      • In stores: $date
      • format: $format
      • UPC: $upc
      • Price: $price
      • The Story: $story
  3. Copy & save defined div hyperlink and save into $seriesinfo

  4. Copy & save defined div $relatedinfo and then break it down.

    • images within $relatedinfo to $relatedimages
    • content within $relatedinfo to $relatedcontent
    • links within $relatedinfo to $relatedlink. $relatedlink will loop off and restart this process itself from 1.

Now that everything is broken apart and saved into it’s own little pieces. I want WordPress to automatically create a post and then start assigning all this info into the post. Working something like this.

  1. Check for existing post with same $title if does not exist place $title in title for post and page-name. If it exists abort script and move on to the next.
  2. Remove numbers and alpha characters from $title and check for existence of category if it does not exist; create it and assign to post. If it exists assigns category to the post.
  3. Check for existing category with value $format if exists assign to post, if not create & assign category to post.
  4. upload images that were downloaded from $image into this post.
  5. Check for images that contain the word “cover” and assign as featured image.

Also how this whole thing executes also. I don’t want this running 24/7 – just once a week I would like this to execute by itself and automatically go to the websites in question and scrape the content and create the pages.

I’m not asking you guys to write out the whole darn thing for me; though I definitely won’t object to it! Just help point me in the right directions to get this going. Over the past day I’ve read probably 30+ articles on pulling content and there’s so many options from what I can tell that I just don’t know where to begin or how to get the ball moving in the right direction with this.

Updated Code

Notes: So I’ve managed to successfully copy the content and paths for each block and instead of downloading the images just echoing them from their present location. Next up is actually automating the process to create a post in wordpress to dump the data into.

function scraping_comic()
{
// create HTML DOM
$html = file_get_html('http://page-on-site-to-scrape.com');

    // get block to scrape
    foreach($html->find('li.browse_result') as $article)
{
    // get title from block
    $item['title'] = trim($article->find('h4', 0)->find('span',0)->plaintext);
    // get title url from block
    $item['title_url'] = trim($article->find('h4', 0)->find('a.grid-hidden',0)->href);
    // get image from block
    $item['image_url'] = trim($article->find('img.main_thumb',0)->src);
    // get details from block
    $item['details'] = trim($article->find('p.browse_result_description_release', 0)->plaintext);
    // get sale info from block
    $item['on_sale'] = trim($article->find('.browse_comics_release_dates', 0)->plaintext);

$ret[] = $item;
}

// clean up memory
$html->clear();
unset($html);

return $ret;
}


// ===== The Code ====

$ret = scraping_comic();

if ( ! empty($ret))
{

// place main url for instance when hyperlinks and image srcs don't use the full path.
$scrape = 'http://site-to-scrape.com';

foreach($ret as $v)
{
    echo '<p><a href="'.$scrape.$v['title_url'].'">'.$v['title'].'</a></p>';
    echo '<p><img src="'.$v['image_url'].'"></p>';
    echo '<p>'.$v['details'].'</p>';
    echo '<p> '.$v['on_sale'].'</p>';
}

}
    else { echo 'Could not scrape page!'; }
?>

Related posts

Leave a Reply

1 comment

  1. Typically, no this wouldn’t be legal. Companies that share their data these days will implement an API you can call and use in your application (subject to their Terms of Use and Copyright Policy). They don’t like you making automated requests that bog down their server and kill their bandwidth.

    That being said, often times product information is available from other sources such as Amazon which does have an API.

    This project you are describing has a lot of work to be done essentially customizing the WordPress CMS and would be less than trivial for someone without any programming experience. You might want to consider hiring a freelancer at oDesk or one of the many other freelance job boards.