Creating plugin using simple_html_dom parser?

I’m running into a problem with my plugin which is essentially an admin page that includes simple_html_dom parser to edit some information scraped from some websites.

Let’s assume I want to parse this page: https://wordpress.stackexchange.com/

Read More

And store all question Titles as $item[0] and question’s URLs as $item[1].

My plugin folder is called test and has a structure as follows:

admin.php
index.php
scrape.php
simple_html_dom.php

index.php looks like this:

function test_admin() {  
    include( plugin_dir_path(__FILE__) . '/admin.php' );  
}

Which imports admin.php which looks like this:

<?php include(plugin_dir_path(__FILE__).'scrape.php'); ?>

Which imports scrape.php which looks like this:

<?php
# don't forget the library
include(plugin_dir_path(__FILE__).'simple_html_dom.php');

# this is the global array we fill with article information
$articles = array();

# passing in the first page to parse, it will crawl to the end
# on its own
getArticles('https://wordpress.stackexchange.com/');

function getArticles($page) {
global $urls, $articles, $descriptions;

$html = new simple_html_dom();
$html->load_file($page);

$items = $html->find('div[class=summary]');

foreach($items as $post) {
    # remember comments count as nodes

    $articles[] = array($post->children(0)->plaintext, // Title
                        $post->find('a',0)->href); // URL
}

// print_r($articles);   <--- WORKS, contains full contents of parsed page!

# lets see if there's a next page
if($next = $html->find('a[class=nextpostslink]', 0)) {
    $URL = $next->href;
    echo "going on to $URL <<<n";
    # memory leak clean up
    $html->clear();
    unset($html);

    getArticles($URL);
}

$html->clear();
print_r($articles) <--- WORKS, array is full of parsed info!
}
print_r($articles) <--- DOESN'T WORK, array is magically empty!
?>

<?php 
foreach($articles as $item) {
echo '<a href="'.$item[1].'">$item[0]</a>';
echo '<br>';
}
?>

My problem is that while the plugin does appear to include the parser’s php file (evidenced by print_r($articles) containing the parsed page!), it doesn’t seem to be able to loop through the array.

So, here is the output of my array:

Array ( [0] => Array ( [0] => +100 [1] => /questions/81544/using-plural-only-translation-of-register-post-status-in-plugin ) [1] => Array ( [0] => Check if page/post has any anchors that link to an image jpg/gif/png [1] => /questions/53585/check-if-page-post-has-any-anchors-that-link-to-an-image-jpg-gif-png ) [2] => Array ( [0] => Can this be done? Create 50x50 thumbnails of all existing featured images? [1] => /questions/82246/can-this-be-done-create-50x50-thumbnails-of-all-existing-featured-images ) [3] => Array ( [0] => Creating plugin using simple_html_dom parser? [1] => /questions/82096/creating-plugin-using-simple-html-dom-parser ) [4] => Array ( [0] => Warning/Error in Admin Panel while developing theme [1] => /questions/82245/warning-error-in-admin-panel-while-developing-theme ) [5] => Array ( [0] => Chose gallery display type [1] => /questions/82244/chose-gallery-display-type ) [6] => Array ( [0] => Why the JavaScript code is ignored from wp editor? [1] => /questions/33539/why-the-javascript-code-is-ignored-from-wp-editor ) [7] => Array ( [0] => How to list only child categories? [1] => /questions/82178/how-to-list-only-child-categories ) [8] => Array ( [0] => How to get the post_status in javascript on post admin page? [1] => /questions/82226/how-to-get-the-post-status-in-javascript-on-post-admin-page ) [9] => Array ( [0] => WP_Query orderby breaks when using AJAX? [1] => /questions/82092/wp-query-orderby-breaks-when-using-ajax ) [10] => Array ( [0] => WordPress on Ubuntu 12.10: permalinks problem [1] => /questions/82225/wordpress-on-ubuntu-12-10-permalinks-problem ) [11] => Array ( [0] => Filtering taxonomies to a single post [1] => /questions/82240/filtering-taxonomies-to-a-single-post ) [12] => Array ( [0] => Display Parent Category of a Post belonging only to Subcategory [1] => /questions/58496/display-parent-category-of-a-post-belonging-only-to-subcategory ) [13] => Array ( [0] => Make display name unique [1] => /questions/82239/make-display-name-unique ) [14] => Array ( [0] => Send AJAX response from a non jQuery function [1] => /questions/82238/send-ajax-response-from-a-non-jquery-function ) [15] => Array ( [0] => Sorting by 2 Custom Fields + Post Title [1] => /questions/32175/sorting-by-2-custom-fields-post-title ) [16] => Array ( [0] => How to show one page with two different templates [1] => /questions/82223/how-to-show-one-page-with-two-different-templates ) [17] => Array ( [0] => flat category urls but retain heirchy? [1] => /questions/82236/flat-category-urls-but-retain-heirchy ) [18] => Array ( [0] => How do I get rid of “category” from my URL structure? [1] => /questions/30128/how-do-i-get-rid-of-category-from-my-url-structure ) [19] => Array ( [0] => Are the wordpress Core css styles really all nessasary? [1] => /questions/82228/are-the-wordpress-core-css-styles-really-all-nessasary ) [20] => Array ( [0] => Cannot access dashboard after upgrading to 3.5 [1] => /questions/76447/cannot-access-dashboard-after-upgrading-to-3-5 ) [21] => Array ( [0] => $wpdb error (Call to a member function insert() on a non-object) [1] => /questions/82229/wpdb-error-call-to-a-member-function-insert-on-a-non-object ) [22] => Array ( [0] => Registering tags taxonomy for a custom post type [1] => /questions/82217/registering-tags-taxonomy-for-a-custom-post-type ) [23] => Array ( [0] => register_post_type name character limit [1] => /questions/82227/register-post-type-name-character-limit ) [24] => Array ( [0] => IP location based country language of wordpress site [1] => /questions/78023/ip-location-based-country-language-of-wordpress-site ) [25] => Array ( [0] => Include Post Format in permalink [1] => /questions/70627/include-post-format-in-permalink ) [26] => Array ( [0] => How to detect first visit of a user? [1] => /questions/82211/how-to-detect-first-visit-of-a-user ) [27] => Array ( [0] => Repositioning 'Reply' Link in Comments [1] => /questions/82218/repositioning-reply-link-in-comments ) [28] => Array ( [0] => How do I approach removing menu items on the fly based on settings in my plugin? [1] => /questions/82180/how-do-i-approach-removing-menu-items-on-the-fly-based-on-settings-in-my-plugin ) [29] => Array ( [0] => Custom meta boxes text field unique id [1] => /questions/82222/custom-meta-boxes-text-field-unique-id ) [30] => Array ( [0] => Using Disqus, how to stop storing comments in wp database? [1] => /questions/58417/using-disqus-how-to-stop-storing-comments-in-wp-database ) [31] => Array ( [0] => How can I Add a variable PHP in the Menu Nav [1] => /questions/82194/how-can-i-add-a-variable-php-in-the-menu-nav ) [32] => Array ( [0] => Having a lot of difficulty getting add_editor_style() to load into source code [1] => /questions/60092/having-a-lot-of-difficulty-getting-add-editor-style-to-load-into-source-code ) [33] => Array ( [0] => How to custom change author base without $this->front? [1] => /questions/82004/how-to-custom-change-author-base-without-this-front ) [34] => Array ( [0] => Rename image uploads with width in filename [1] => /questions/82193/rename-image-uploads-with-width-in-filename ) [35] => Array ( [0] => Get a post's ID [1] => /questions/82208/get-a-posts-id ) [36] => Array ( [0] => Apply custom names for generic custom taxonomy name? [1] => /questions/82184/apply-custom-names-for-generic-custom-taxonomy-name ) [37] => Array ( [0] => Get post meta in enqueued js file [1] => /questions/82209/get-post-meta-in-enqueued-js-file ) [38] => Array ( [0] => How do I only load a plugin js on it's settings pages? [1] => /questions/82032/how-do-i-only-load-a-plugin-js-on-its-settings-pages ) [39] => Array ( [0] => remove post and categories/tags count from right now dashboard widget [1] => /questions/82132/remove-post-and-categories-tags-count-from-right-now-dashboard-widget ) [40] => Array ( [0] => Styling Contact Form 7 fields [1] => /questions/82207/styling-contact-form-7-fields ) [41] => Array ( [0] => Woocommerce: Changing catalog image sizes [1] => /questions/82197/woocommerce-changing-catalog-image-sizes ) [42] => Array ( [0] => Tracing the life of a query [1] => /questions/82183/tracing-the-life-of-a-query ) [43] => Array ( [0] => Custom WP_Query with complex 'post_status' argument [1] => /questions/82200/custom-wp-query-with-complex-post-status-argument ) [44] => Array ( [0] => Get a single post by a unique meta value [1] => /questions/82203/get-a-single-post-by-a-unique-meta-value ) [45] => Array ( [0] => 2 item in a same menu pointing to 1 page [1] => /questions/46962/2-item-in-a-same-menu-pointing-to-1-page ) [46] => Array ( [0] => Modify Notification Message When Profile Updated [1] => /questions/37358/modify-notification-message-when-profile-updated ) [47] => Array ( [0] => Change wordpress meta tag description using WP functions [1] => /questions/82196/change-wordpress-meta-tag-description-using-wp-functions ) )

In short:
– simple_html_dom is in fact being included (tested with breakpoints)
– $articles array does contain the parsed page (evidence above) BUT
– $articles seems to empty itself when called from outside the getArticles() function.

Is anyone willing to help find out why I am having this problem?

Related posts

Leave a Reply

1 comment

  1. Maybe I’m missing something, but in your first print_r($articles), the global variable $articles is in the scope of the getArticles() function, so we’re all good there. But, in your second print_r($articles), which is happening after the closing brace of the getArticles() function declaration, $articles has not been globalized. So, it’s just set to the empty array.

    Doing a global $articles prior to your second print_r($articles) should give you the results.