Php/Goutte

Goutteとは

スクレイピングライブラリです。PHP5.3以降サポート

ベンチマーク

Simple HTML DOM Parserより良いらしい。

参照：http://qiita.com/soarcreator/items/56a971d42b8b640b76a6

使い方

DL

wget https://raw.github.com/fabpot/Goutte/master/goutte.phar

dom操作

filter('h1')   CSSセレクタにマッチするノード
filterXpath('h1')   XPath式にマッチするノード
eq(1)   指定したインデックスのノード
first()   最初のノード
last()   最後のノード
siblings()   兄弟のノード
nextAll()   後の兄弟ノード
previousAll()   前の兄弟ノード
parents()   親ノード
children()   子ノード
reduce($lambda)   callableがfalseを返さないノード
selectLink($value)   指定されたテキストを含むリンクすべてを選択
selectButton($value)   指定されたテキストを含むボタンすべてを選択

filter指定

h1
h1.class1 class指定
h1#id1 id指定
body > p

値取得

attr($attribute)   最初のノードの、指定した属性の値を返す
text()   最初のテキストノードの値を返す
html() 最初のHTMLノードを取得する

サンプル

require_once __DIR__ . '/goutte.phar';
use Goutte\Client;
$client = new Client();
$url = "http://example.com";
$crawler = $client->request('GET', $url);
$dom = $crawler->filter('link');
$dom->each(function($node) {
 // rss取得
 if ($node->attr('type') == "application/rss+xml") {
   echo $node->attr('href')."\n";
 }
});
$dom = $crawler->filter('h1');
$dom->each(function($node) {
   echo $node->text()."\n";
});
$dom = $crawler->filter('a');
$dom->each(function($node) {
 if (preg_match("!http://!", $node->attr('href'))) {
     echo $node->attr('href')."\n";
     echo $node->text()."\n";
 }
});
$dom = $crawler->filter('h2.post > a')
$dom->each(function ($node) {
   print $node->text()."\n";
});
//$dom = $crawler->filter('table.content td');

htmlをhttpでとってこずにsetする場合

use Symfony\Component\DomCrawler\Crawler;
$html = <<<'HTML'
<!DOCTYPE html>
<html>
   <body>
       <p class="message">Hello World!</p>
       <p>Hello Crawler!</p>
   </body>
</html>
HTML;
$crawler = new Crawler($html);

Aタグのリンククリック

$crawler = $client->request('GET', $url);
$targetLinkText = 'もっとみる';
$link = $crawler->selectLink($targetLinkText)->link();
$crawler = $client->click($link);

user_agent偽装

$client->setHeader('User-Agent', 'Googlebot-Video/1.0');
$client->setHeader('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36'); // chrome28

httpリクエストをしないでhtmlだけを解析する方法

use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler();
$crawler->addHtmlContent($html, 'utf-8');
$dom = $crawler->filter('img');
if ($dom) {
    $dom->each(function($node) use (&$item) {
      echo $node->attr('src');
    });
}
$dom = $crawler->filter('a');
if ($dom) {
    $dom->each(function($node) use (&$item) {
      echo $node->attr('href');
      echo $node->text();
    });
}

ISO-8859-1以外をhtmlに含めると文字コードをしていないと文字化けするのでaddHtmlContent()の引数に文字コードを指定する

参考：http://docs.symfony.gr.jp/symfony2/components/dom_crawler.html

Php/Goutte

目次

Goutteとは

ベンチマーク

使い方

DL

dom操作

filter指定

値取得

サンプル

htmlをhttpでとってこずにsetする場合

Aタグのリンククリック

user_agent偽装

httpリクエストをしないでhtmlだけを解析する方法

案内メニュー

個人用ツール

名前空間

変種

表示

その他

検索

案内

プログラムメモ

サーバメモ

デザインメモ

サービスメモ

便利系メモ

クライアント

cmsメモ

その他

ページ内

ツール