Normally, to check if a remote web file exists I would use urllib's getcode() but that is a 2.6 and newer feature. In Python 2.5 its a little more interesting. Thankfully, wget's spider command can help us out.
from subprocess import Popen, PIPE
def url_exists(url):
command = ["wget", "-S", "--spider", url]
p = Popen(command, stdout=PIPE, stderr=PIPE)
stdout, stderr = p.communicate()
exists = stderr.find('ERROR 404')
if int(exists) > -1:
return False
else:
return True

















1 comments:
I noticed that this got the wrong answer for some okcupid URLs that I threw at it. I don't think OKWS prints out "ERROR 404" for its 404 pages, but it does put "HTTP/1.0 404" in its response headesr. I think it might work better if you disregard whatever's coming out of stdout/stderr and just go by wget's return code. I'm pretty sure wget will return with 0 for all pages it fetches successfully and 1 otherwise.
While you're disregarding the output, you don't need Popen or PIPE or communicate or stdin/stderr or whatever. We can just use the os.system call for this (though I admit it's kind of nice having the command components in a python list) and save a bunch of trouble. One key is using wget's '-q" option which makes it stop outputting anything (even on error). Here's a simper version I put together:
from os import system
def url_exists(url):
command = "wget -qS --spider %s" % url
res = system(command)
return not res # "Success" is backwards for processes
Cheers!
Eli
PS, does blogger support code/pre blocks in comments?
Post a Comment