How to scrape wordpress admin into a .csv

Recently a friend of mine conducted a survey on one of his websites. He used “WP Survey And Quiz Tool” apparently this stores everything in a crazy SQL jumble and basically gives no way to use the data. He asked me to help him out so I built a scraper for him to move all of his data (2200 surveys) into excel. This is going to outline how I did that.

The process breaks down into three main steps:
1. Opening a logged in session with wordpress
2. Scraping the data
3. Putting the data into a csv file.

Step 1. Using cURL to get into WordPress.
First we used this function to log in, the comments explain what each piece does, or you can just go ahead and copy paste it.

function curl_get_wp_login( $login_user, $login_pass, $login_url, $visit_url, $http_agent, $cookie_file ){

// Check to make sure cURL is avaiable
if( !function_exists( 'curl_init' ) || ! function_exists( 'curl_exec' )){

$m = "cUrl is not available in you PHP server.";

echo $m;

}

// Preparing postdata for wordpress login
$data = "log=". $login_user ."&pwd=" . $login_pass . "&wp-submit=Log%20In&redirect_to=" . $visit_url;

// Intialize cURL
$ch = curl_init();

// Url to use
curl_setopt( $ch, CURLOPT_URL, $login_url );

// Set the cookies for the login in a cookie file.
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie_file );

// Set SSL to false
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );

// User agent
curl_setopt( $ch, CURLOPT_USERAGENT, $http_agent );

// Maximum time cURL will wait for get response. in seconds
curl_setopt( $ch, CURLOPT_TIMEOUT, 60 );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, 1 );

// Return or echo the execution
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );

// Set Http referer.
curl_setopt( $ch, CURLOPT_REFERER, $login_url );

// Post fields to the login url
curl_setopt( $ch, CURLOPT_POSTFIELDS, $data );
curl_setopt( $ch, CURLOPT_POST, 1);

// Save the return in a variable
$content = curl_exec ($ch);

/*
** if you need to visit another url ,you can do it here.
** curl_setopt( $ch, CURLOPT_URL, 'a new url address or a file download url'
);
** $content = curl_exec ($ch);
*/

// Close the cURL.
curl_close( $ch );

// You can echo or return the page data here. I just echo it to make sure its working, then run the scrape_the_data. Once you are confident it is working, remove this before running a full scrape to save processing power.
echo $content;

//Run the scraper function
scrape_the_data($content);

}

Then you have to set the variables for the cURL to work:
// Enter the username for login, most commonly Admin
$login_user = "login";

// Enter the password for this username
$login_pass = "password";

// Login url address. This is where it usually is, but this can be unique for your site
$login_url = "http://your-site.com/wp-login.php";

/*
This is where you store your cookies, and is used to maintain the cURL session, this can vary from theme to theme.
*/

$cookie_file = "http://your-site.com/wp-content/themes/your-theme/cookie/cookie.txt";

// Set HTTP user agent. Or what browser you want to crawl as.

$http_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6";

/*
Now enter the url that you want to grab for scraping. This is set if you are using the "WP Survey and Quiz Tool"
// This is the URL
$visit_url = urlencode( 'http://your-site.com//wp-admin/admin.php?page=wpsqt-menu§ion=results&subsection=mark&id=5&resultid='.$pagenumber );

// This is to choose the page number of the survey you want to scrape.
$pagenumber = 1831;

Now because a lot of these have massive volume, you will want to scrape multiple pages. This is how you can grab the content for all of the pages you want, replace the pagenumber and $visit_url above with this code:


//Starting page number
$pagenumber = 1831;

//Final page number
while($pagenumber < = 2300) { //Static URL $visit_url = urlencode( 'http://your-site.com/wp-admin/admin.php?page=wpsqt-menu&section=results&subsection=mark&id=5&resultid='.$pagenumber ); //Run cURL function, which runs the scraper function curl_get_wp_login( $login_user, $login_pass, $login_url, $visit_url, $http_agent, $cookie_file ); //Next page $pagenumber++; }

Step 2. Scraping the Data.
There are many different ways to scrape data, for this one I decided to use regex and preg_match. Regex and preg match is a way to put in html code and show it what you want to pull with parenthesis. \ before a symbol nulls it out. If you want to learn more about regex or test regex, I recommend: For learning http://regexlib.com/
For Testing: http://regexlib.com/RETester.aspx
Note you can just copy and paste the html code into it for testing.

The fields we decided to scrape were the first name, last name, email, and 4 of the survey responses. We identified a response by the red color the plugin gives it. Then we check to make sure it had a response. To ensure regex pulled the right field, we match it to something before and after what we want. We pull what is in the (.*) area.


// $dataurl is the $content variable above, and gets called in the preg_match
function scrape_the_data($data) {

//Matching to the html before and after first name
$first_name = '/

First Name< \/th>(.*) Last Name< \/th>/s';

//Matching to the html before and after last name
$last_name = '/ Last Name< \/th>(.*) Email< \/th>/s';

//Matching to the html before and after email
$email = '/ Email< \/th>(.*) IP Address< \/th>/s';

/* Matching to the html before and after the best artist question. We used the header of the question and the one after it. .*? means anything until it matches the next snippet */
$best_artist = '/

Best Artist< \/u>< \/h4>.*?(.*?)< \/font>.*

Best Female Bartender< \/u>< \/h4>/s';

// same
$best_humanitarian = '/

Top Humanitarian or Social Worker< \/u>< \/h4>.*?(.*?)< \/font>.*

Best Salon< \/u>< \/h4>/s';

// Grab first, last, email
preg_match($first_name,$dataurl,$firstname);
preg_match($last_name,$dataurl,$lastname);
preg_match($email,$dataurl,$useremail);

/* This takes the regex for best artist, the content of the page and matches it, outputting it into a $bestartist array, with the data we want in $bestartist[1] */
preg_match($best_artist,$dataurl,$bestartist);
// Check to see if user responded, if not fill it with 'NA'
if ($bestartist[1] == '') { $bestartist[1]= 'NA';}

//repeat for different fields
preg_match($best_humanitarian,$dataurl,$besthumanitarian);
if ($besthumanitarian[1] == '') { $besthumanitarian[1]= 'NA';}

//put the results in an array to then put into the .csv
$list = array($firstname[1],$lastname[1],$useremail[1],$bestartist[1],$besthumanitarian[1]);
WriteToFile($list);

Step 3. Putting the results into a CSV

Don’t worry, it’s almost done!
The final step is writing all of this to the csv.
function WriteToFile($list) {

//open the csv file you want to write to and append any writings to the end
$fp = fopen('file11.csv', 'a');

//Output the array until the end is reached
foreach ($list as $fields) {
fputcsv($fp, $fields);
}

//last line of the document, close the csv file
fclose($fp);
?>

I know that this is a very long and difficult tutorial if you are not familiar with the above methods. If you have any tips on how to make the code or tutorial itself cleaner or any questions about how to make this work, please comment below or tweet at me @SpencerPadway

Thanks!

Written by

Spencer is the founder of Brandfish. He is an SEO who enjoys the outdoors and rugby.