HTML解析 (パーサ)

ここではPythonで行うHTML解析 (パース)を解説します。取得したHTMLから様々な処理を行うことができます。

1 HTML文字列の取得
2 ヘッダ情報の設定
3 タグ情報の取得

HTML文字列の取得

5行目で指定URLをオープンし、6行目で取得したHTML文を表示しています。

Python 3系

import urllib.request

url = 'http://www.python-izm.com/'

htmldata = urllib.request.urlopen(url)
print(htmldata.read().decode('UTF-8'))

htmldata.close()

Python 2系ではurllib2モジュールなので注意してください。

Python 2系

# -*- coding: utf-8 -*- 

import urllib2

url = 'http://www.python-izm.com/'

htmldata = urllib2.urlopen(url)
print unicode(htmldata.read(), 'utf-8')

htmldata.close()

ヘッダ情報の設定

build_openerを使用します。この例ではユーザーエージェント情報を設定してからオープンしています。

Python 3系

import urllib.request

url = 'http://www.python-izm.com/'

opener = urllib.request.build_opener()
opener.addheaders = [
    (
        'User-agent', 
        'Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)'
    )
]

htmldata = opener.open(url)
print(htmldata.read().decode('UTF-8'))

htmldata.close()
opener.close()

Python 2系ではurllib2モジュールなので注意してください。

Python 2系

# -*- coding: utf-8 -*- 

import urllib2

url = 'http://www.python-izm.com/'

opener = urllib2.build_opener()
opener.addheaders = [
    (
        'User-agent', 
        'Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)'
    )
]

htmldata = opener.open(url)
print unicode(htmldata.read(), 'utf-8')

htmldata.close()
opener.close()

タグ情報の取得

HTMLParser継承して処理を追加します。下記例では、本サイトのトップページからリンクされているURLの取得を行っています。

Python 3系

import urllib.request
from html.parser import HTMLParser

class TestParser(HTMLParser):

    def handle_starttag(self, tagname, attribute):
        if tagname.lower() == 'a':
            for i in attribute:
                if i[0].lower() == 'href':
                    print(i[1])


url = 'http://www.python-izm.com/'

htmldata = urllib.request.urlopen(url)

parser = TestParser()
parser.feed(htmldata.read().decode('UTF-8'))

parser.close()
htmldata.close()

Python 2系ではurllib2、HTMLParserモジュールなので注意してください。

Python 2系

# -*- coding: utf-8 -*- 

import urllib2
from HTMLParser import HTMLParser

class TestParser(HTMLParser):

    def handle_starttag(self, tagname, attribute):
        if tagname.lower() == 'a':
            for i in attribute:
                if i[0].lower() == 'href':
                    print i[1]


url = 'http://www.python-izm.com/'

htmldata = urllib2.urlopen(url)

parser = TestParser()
parser.feed(htmldata.read())

parser.close()
htmldata.close()