권영기/web crawler

From ZeroWiki

Revision as of 01:22, 15 July 2012 by imported>trailblaze

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

목적

Python을 이용해서 Web Crawler를 제작하면서 Python의 사용법을 익히고, 원하는 웹 페이지를 긁기 위한 Web Crawler를 제작한다. (네이버웹툰(돌아온 럭키짱, 신의 탑...), 네이버 캐스트, 그 외의 각종 웹페이지..)

필요기술

HTML
CSS
JavaScript
Python

  HTML, CSS, JavaScript - 웹 페이지 분석
  Python

진행 과정

필요한 문서

http://docs.python.org/
http://hyogeun.tistory.com/107 - try, except.

시작

웹 페이지 소스 긁어오기

import urllib
import urllib2

req = urllib2.Request('http://9632024.tistory.com/974')
try: urllib2.urlopen(req)
except URLError, e:
	print e.reason

fo = open("test1.html","w")
for line in urllib2.urlopen(req).readlines():
	fo.write(line)

fo.close()

http://coreapython.hosting.paran.com/howto/HOWTO%20Fetch%20Internet%20Resources%20Using%20urllib2.htm

소스에서 URL만 추출하기

import urllib
import urllib2
import string

fo1 = open("test1.html", "r")
fo2 = open("test2.html", "w")

for line in fo1.readlines() :
	pos = string.find(line, '"http')
	if pos is not -1 :
		for c in range(pos+1, len(line)) :
			if line[c] is '"' :
				fo2.write("\n")
				break
			fo2.write(line[c])

fo1.close()
fo2.close()

파일 다운로드하기

http://www.wellho.net/resources/ex.php4?item=y108/bejo.py

split

 line = 'http://cfile23.uf.tistory.com/original/2001D2044C945F80495C6F'
 line.split('/')-1 == '2001D2044C945F80495C6F'
 line.split('/')-2 == 'original'

 say = "This is a line of text"
 part = line.split(' ')
 part == 'This', 'is', 'a', 'line', 'of', 'text'

swap

Python 2.7.2+ (default, Oct  4 2011, 20:03:08) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> first = 1
>>> second = 2
>>> first, second = second, first
>>> print first
2
>>> print second
1
>>> first, second = second, first
>>> third = 3
>>> first, second, third = third, first, second
>>> print first, second, third
3 1 2

Retrieved from "https://mediawiki.zeropage.org/index.php?title=권영기/web_crawler&oldid=42223"