2010
10.28

I just posted this on Snipplr, but I thought I’d share it here too.

With this script, you can extract bits of data from any site based on MediaWiki (including Wikipedia). I’ve also made a more complex version of the code that allows you to grab certain categories and images using the same function with different parameters (but I need to clean it up before I post it).

It should be pretty much self explanatory, let me know if you need help figuring it out.

function curlStart($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, FALSE);
curl_setopt($ch, CURLOPT_VERBOSE, FALSE);
curl_setopt($ch, CURLOPT_REFERER, "");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; he; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8");
$page = curl_exec($ch);
return $page;
}
function findWikiInfo($s) {
$page = curlStart("http://en.wikipedia.org/w/api.php?action=opensearch&search=".urlencode($s)."&format=xml&limit=10");
$xmlpage = simplexml_load_string($page);
$wikiarray = array();
if(count($xmlpage->Section->Item) == 1){
array_push($wikiarray, array((string)$xmlpage->Section->Item->Text, (string)$xmlpage->Section->Item->Description, (string)$xmlpage->Section->Item->Url));
return $wikiarray;
} else {
for($i = 0;$i != count($xmlpage->Section->Item); $i++){
array_push($wikiarray, array((string)$xmlpage->Section->Item[$i]->Text, (string)$xmlpage->Section->Item[$i]->Description, (string)$xmlpage->Section->Item[$i]->Url));
}
return $wikiarray;
}
}
print_r(findWikiInfo($_GET['search']));

3 comments so far

Add Your Comment
  1. Where can you ater the code to display perhaps the first paragraph or 2 of text from wikipedia?

    • Hmm that could be tricky because it would depend on the HTML structure of the article.

      If there were just p tags, you could do something like this

  2. I have been using your technique for sometime, but lately it seems to have failed. The failure however may be do to the fact I am behind the great Firewall of China. Or it may be that like Google Wikipedia has changed an API, so my question is does this technique still work in the free world?