XKCD Batch Downloading

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Prelates, Moderators General

XKCD Batch Downloading

Postby Onion_Knight » Tue Jul 29, 2008 4:39 am UTC

Hey Everyone,

I'm a Computer Science graduate who is trying to pick up Python for fun. I heard it was a neat language that had some powerful script utilities, so I decided to give it a shot.

While looking for a little project to test myself, I came across the first Batch Downloading thread. http://forums.xkcd.com/viewtopic.php?f=11&t=25155&p=764716&hilit=Batch+download#p764716.

Taking Berengal's code as a starting point, I tried to make the script fully functional. Here is the script I just finished.

Code: Select all
#Downloads all XKCD comics

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')

#Check the main XKCD page
site = urllib.urlopen("http://www.xkcd.com/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if CurrentRegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    EndNum = int(contentLine[3])
else:
    EndNum = 455

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 445 (This is here incase someone wants you make this script a function w/ 2 params)

#Full path of destination folder needs to pre-exist
destinationFolder = "C:\XKCD"

#Full path of "Alt Text" file doesn't need to pre-exist.
#Will be overwritten if it does
textFile = open(destinationFolder+"\AltText.txt",'w')

#Regular Exp. used to find the comics in the webpage source
RegExp = re.compile('.*<img src=".*" title=".*" alt=".*" />.*')

#Reg. Exp used to fix images surrounded by Anchor Tags
LinkRegExp = re.compile(".*<img src=.*")

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #Gets the Site of the i-th comic
    site = urllib.urlopen("http://www.xkcd.com/"+str(i)+"/")
    contentLine = None

    #For each line in the webpage's source...
    for line in site.readlines():
        #Break when you find the image information
        if RegExp.search(line):
            contentLine = line
            break

    #Skips the non-existant comic #404
    if not contentLine:
        continue

    #Create the Array with the information in it
    info = line.split('"')

    #IF there is a match, image is not embedded in Anchors
    if LinkRegExp.search(info[0]):
        #Gets the url for the image
        source = info[1]
        #The title-text (commonly known on these fora as the alt-text)
        title = info[3]
        #The alt-text
        alt = info[5]
    #ELSE Adjust the indexes to compensate for the Anchor Tags
    else:
        #Gets the url for the image
        source = info[3]
        #The title-text (commonly known on these fora as the alt-text)
        title = info[5]
        #The alt-text
        alt = info[7]
        #Manually downloads the Lojban Translation comic
        urllib.urlretrieve("http://imgs.xkcd.com/comics/lojban_translated.png", os.path.join(destinationFolder, str(i)+" Translated.png"))

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
    # Writes the title and alt to a text file
    textFile.write("Comic "+str(i) + ': Alt: "' + alt + '" Title: "' + title + '"\n')

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()


This script will detect the newest comic, and download all the comics until then.
It has a StartNum and EndNum variables if someone wishes to modify this code and turn it into a function instead.
I haven't figured out how to determine if one has already downloaded a comic, thus avoiding having to re-download ALL the comics every time you want to download a new one. I have a few ideas that I wanna fool around with, and hopefully one of them works.

I wish to thank Berengal for the basis of the program, and want to remind all users of XKCD's Attribution-NonCommercial Licence.

I attempted to make this as Bullet Proof as possible, but feel free to let me know of any bugs.

Also, if anyone else has some ideas for Python "projects" that would help me expand my Python knowledge, I would appreciate it greatly.

Thanks a lot everyone,
-Onion Knight
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby Berengal » Tue Jul 29, 2008 3:23 pm UTC

Awesome, I made something useful.

Skipping already downloaded comics: Check if the file exists already.
Code: Select all
fileExists = True
try:
  f = open(os.path.join(destinationFolder, str(i)+'.png'), 'r')
  f.close()
except IOError, e:
  if e.errno == 2:
    fileExists = False
  else:
    raise

if not fileExists:
  continue

Put it at the top of the outer for loop for best effect.
Enabling proper resuming or "repair" downloading requires some modification in regards to the title-text file. This is left as an exercise to the reader.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.
User avatar
Berengal
Superabacus Mystic of the First Rank
 
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway

Re: XKCD Batch Downloading

Postby Onion_Knight » Tue Jul 29, 2008 6:52 pm UTC

Here is an update. I enabled skipping already downloaded comics (Not quite the way Berengal did, but the basic idea was the same 8) ), and fixed the bug that occurred because Comic 256 was also a linked comic.

I assumed 191 was the only linked one, so in the special case, I downloaded the translation (The other link) manually. So the program re-downloaded the translation when it hit comic 256 again.

I also changed the textFile = open(...) from a "w" parameter to a "a" parameter, so the Alt-Text file wouldn't be overwritten when you wanted to download the newest comic.

Here's the "final" version. I'm pretty sure it's bug-less, but as always, please let me know if any are found.

Code: Select all
#Downloads all XKCD comics

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')

#Check the main XKCD page
site = urllib.urlopen("http://www.xkcd.com/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if CurrentRegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    EndNum = int(contentLine[3])
else:
    EndNum = 455

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 192

#Full path of destination folder needs to pre-exist
destinationFolder = "C:\XKCD"

#Full path of "Alt Text" file doesn't need to pre-exist.
#Info will be appended to the end of the file
textFile = open(destinationFolder+"\AltText.txt",'a')

#Regular Exp. used to find the comics in the webpage source
RegExp = re.compile('.*<img src=".*" title=".*" alt=".*" />.*')

#Reg. Exp used to fix images surrounded by Anchor Tags
LinkRegExp = re.compile(".*<img src=.*")

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue

    #Gets the Site of the i-th comic
    site = urllib.urlopen("http://www.xkcd.com/"+str(i)+"/")
    contentLine = None

    #For each line in the webpage's source...
    for line in site.readlines():
        #Break when you find the image information
        if RegExp.search(line):
            contentLine = line
            break

    #Skips the non-existant comic #404
    if not contentLine:
        continue

    #Create the Array with the information in it
    info = line.split('"')

    #IF there is a match, image is not embedded in Anchors
    if LinkRegExp.search(info[0]):
        #Gets the url for the image
        source = info[1]
        #The title-text (commonly known on these fora as the alt-text)
        title = info[3]
        #The alt-text
        alt = info[5]
    #ELSE Adjust the indexes to compensate for the Anchor Tags
    else:
        #Gets the url for the image
        source = info[3]
        #The title-text (commonly known on these fora as the alt-text)
        title = info[5]
        #The alt-text
        alt = info[7]
        if i == 191:
            urllib.urlretrieve("http://imgs.xkcd.com/comics/lojban_translated.png", os.path.join(destinationFolder, str(i)+" Translated.png"))

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
    # Writes the title and alt to a text file
    textFile.write("Comic "+str(i) + ': Alt: "' + alt + '" Title: "' + title + '"\n')

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()


Hope everyone enjoys this,
-Onion Knight
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby ediblespread » Tue Jul 29, 2008 10:48 pm UTC

EDIT: My bad! Silly me didn't create the C:\XKCD directory! XD

As a note actually, why don't you create this directory in the code; can Python not do this, or just personal choice for X reason?


EDIT2: Well, I successfully downloaded all the xkcd comics (thanks!), and am now attempting, with basically no knowledge of python, to try and make the code do the same thing for questionable content. Sorry for using your code, but as I said, it works well and I don't know enough to do this from scratch; all credit to you. Basically, I think I can get the basis stuff edited alright, but the more complex stuff just fails for me. I'd muchly appreciate it if you could help me butch- adapt your own code to do qc? ;)

Spoiler:
Code: Select all
#Downloads all QC comics

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
#QC does not refer to the current episode in text on the homepage, only in the comic itself, so number must come from the <img>
#Can I do that?
CurrentRegExp = re.compile('<img src="http://www.questionablecontent.net/comics/*.png">')

#Check the main QC page
site = urllib.urlopen("http://www.questionablecontent.net/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if CurrentRegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
#What exactly is the contentLine.split stuff?
if contentLine:
    contentLine = contentLine.split('/')
    EndNum = int(contentLine[3])
else:
    EndNum = 1196

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 192

#Full path of destination folder needs to pre-exist
destinationFolder = "C:\QC"

#Regular Exp. used to find the comics in the webpage source
#QC doesnt have alt text or title, so I removed "title=".*" alt=".*"" from the code
#Is that right?
RegExp = re.compile('.*<img src=".*"/>.*')

#Reg. Exp used to fix images surrounded by Anchor Tags
LinkRegExp = re.compile(".*<img src=.*")

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue

    #Gets the Site of the i-th comic
    #Again, QC has a slightly different format of "http://www.questionablecontent.net/view.php?comic=X"
    site = urllib.urlopen("http://www.questionablecontent.net/view.php?comic="+str(i)+"/")
    contentLine = None

    #For each line in the webpage's source...
    for line in site.readlines():
        #Break when you find the image information
        if RegExp.search(line):
            contentLine = line
            break

    #Create the Array with the information in it
    #Again the line.split stuff?
    info = line.split('"')


# From here on in, I am lost. Completley.

    #IF there is a match, image is not embedded in Anchors
    if LinkRegExp.search(info[0]):
        #Gets the url for the image
        source = info[1]
    #ELSE Adjust the indexes to compensate for the Anchor Tags
    else:
        #Gets the url for the image
        source = info[3]

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    #Save image from QC to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()



Help would be appreciated, though ignore if you wish.

-Edibles
ediblespread
 
Posts: 33
Joined: Fri Jul 25, 2008 6:29 pm UTC

Re: XKCD Batch Downloading

Postby Onion_Knight » Wed Jul 30, 2008 2:30 am UTC

ediblespread wrote:EDIT: My bad! Silly me didn't create the C:\XKCD directory! XD

As a note actually, why don't you create this directory in the code; can Python not do this, or just personal choice for X reason?


EDIT2: Well, I successfully downloaded all the xkcd comics (thanks!), and am now attempting, with basically no knowledge of python, to try and make the code do the same thing for questionable content. Sorry for using your code, but as I said, it works well and I don't know enough to do this from scratch; all credit to you. Basically, I think I can get the basis stuff edited alright, but the more complex stuff just fails for me. I'd muchly appreciate it if you could help me butch- adapt your own code to do qc? ;)

Spoiler:
Code: Select all
#Downloads all QC comics

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
#QC does not refer to the current episode in text on the homepage, only in the comic itself, so number must come from the <img>
#Can I do that?
CurrentRegExp = re.compile('<img src="http://www.questionablecontent.net/comics/*.png">')

#Check the main QC page
site = urllib.urlopen("http://www.questionablecontent.net/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if CurrentRegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
#What exactly is the contentLine.split stuff?
if contentLine:
    contentLine = contentLine.split('/')
    EndNum = int(contentLine[3])
else:
    EndNum = 1196

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 192

#Full path of destination folder needs to pre-exist
destinationFolder = "C:\QC"

#Regular Exp. used to find the comics in the webpage source
#QC doesnt have alt text or title, so I removed "title=".*" alt=".*"" from the code
#Is that right?
RegExp = re.compile('.*<img src=".*"/>.*')

#Reg. Exp used to fix images surrounded by Anchor Tags
LinkRegExp = re.compile(".*<img src=.*")

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue

    #Gets the Site of the i-th comic
    #Again, QC has a slightly different format of "http://www.questionablecontent.net/view.php?comic=X"
    site = urllib.urlopen("http://www.questionablecontent.net/view.php?comic="+str(i)+"/")
    contentLine = None

    #For each line in the webpage's source...
    for line in site.readlines():
        #Break when you find the image information
        if RegExp.search(line):
            contentLine = line
            break

    #Create the Array with the information in it
    #Again the line.split stuff?
    info = line.split('"')


# From here on in, I am lost. Completley.

    #IF there is a match, image is not embedded in Anchors
    if LinkRegExp.search(info[0]):
        #Gets the url for the image
        source = info[1]
    #ELSE Adjust the indexes to compensate for the Anchor Tags
    else:
        #Gets the url for the image
        source = info[3]

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    #Save image from QC to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()



Help would be appreciated, though ignore if you wish.

-Edibles


You're right. I'll try to automatically create a directory, without deleting the directory if it pre-exists. Thanks. :)

The part where you got lost was done to stop the script from crashing if the XKCD comic was imbedded an an Anchor tag. That would add additional "s in the line of code taken, screwing up the indexes of the "info" array. I did what I did to ensure the indexes were properly offset.

As for QC, all you'll need is the number of the newest comic, which is available on the main page. It's on the line:
<img src="http://www.questionablecontent.net/comics/XXXX.png">.
Once you get that number, loop and download 1..XXXX times, and you're done! Since there is no Title or Alt to worry about, and since QC doesn't link their images, it'll be a much simpler task. :wink:

Hopefully that is enough to get you going. Python is not a very complex language, so I'm sure you'll be fine. If not, feel free to post back here!
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby brennydoogles » Wed Jul 30, 2008 4:52 am UTC

Being completely unfamiliar with Python, how would I adapt the script to work on Linux? Other than creating a folder for the comics and editing the script to reflect that location (I know enough to do that), I'm not sure what all I would need to change to allow for /'s instead of \'s in the filesystem. Any help would be appreciated!
Blessed are the Cheese Makers?
User avatar
brennydoogles
 
Posts: 16
Joined: Tue Jul 29, 2008 9:40 pm UTC

Re: XKCD Batch Downloading

Postby ediblespread » Wed Jul 30, 2008 11:18 am UTC

Right, being doing some work on the code, mostly trying to identify exactly what everything does. Just to double check, these are your variables, aye? (This is what I hate about python; variables are defined anywhere, you just make them up as you go along it seems >_<.)

Spoiler:
currentregexp <-- ?? the text surrounding the last episode of the comic?
contentline <-- the line in the source code that holds the episode of webcomic
endnum <-- the last episode of the comic
startnum <-- the starting episode
destinationfolder <-- C:\QC
regexp <-- ???
info <-- ???
source <-- ???


The following lines are the ones I get stuck on:

Code: Select all
contentLine = contentLine.split('/')
EndNum = int(contentLine[3]) 


I assume this takes contentline, cuts off the non-comic-number part, then turns the comic number into an integer? If so, which part is cut? Defined by where a "/" is found? And if that is right too, then how is it distinguishing between all of the different "/"s?

Code: Select all
RegExp = re.compile('.*<img src=".*"/>.*')


I understand we're trying to make up the line which has the image source in it, but is that line right for QC? And what exactly is it doing?

Code: Select all
info = line.split('"')
source = info[1]


Again, not quite sure how that line split is working, plus have no idea what happens when you add [1] after the info variable...


You can tell I have no clue when it comes to python, cant you? :P I need to pick up a book for it really; I could possibly find this stuff online, but I prefer having something solid to leaf through.

-Edibles
ediblespread
 
Posts: 33
Joined: Fri Jul 25, 2008 6:29 pm UTC

Re: XKCD Batch Downloading

Postby Mat » Wed Jul 30, 2008 3:09 pm UTC

I don't know what I'm doing either but maybe I can help...
ediblespread wrote:
Code: Select all
contentLine = contentLine.split('/')
EndNum = int(contentLine[3]) 

I assume this takes contentline, cuts off the non-comic-number part, then turns the comic number into an integer? If so, which part is cut? Defined by where a "/" is found? And if that is right too, then how is it distinguishing between all of the different "/"s?

Not quite, none of it is cut, it turns the contentLine string into a list of strings by splitting it wherever theres a '/' - so the length of the list is one more than the number of '/'s. It's the [3] part which says which bit you want. (The 4th one = the bit after the 3rd '/') Then it converts that to an integer.

ediblespread wrote:
Code: Select all
RegExp = re.compile('.*<img src=".*"/>.*')


I understand we're trying to make up the line which has the image source in it, but is that line right for QC? And what exactly is it doing?

I'm not too sure about this, I would've thought the .* at the beginning would match everything? But yeah its looking for a line which has an img tag in it somewhere. You could also have a more complicated regular expression with parentheses around the url part so you don't have to split the strings all the time and check for <a> tags. Check out http://www.regular-expressions.info if you haven't already.

Also I had a quick look at your code.
CurrentRegExp = re.compile('<img src="http://www.questionablecontent.net/comics/*.png">')

This won't work. The * means nothing by itself, it is used to repeat something any amount of times (including zero). You probably mean something like ".+", which will match one or more characters. But again, I'm not sure if that would match too much, so I would use something like "<img src="blablabla/([^.]+.png)">" which would match anything thats not a dot. (Note: you will need to escape everything with slashes several hundred times before this will actually work)

ediblespread wrote:
Code: Select all
info = line.split('"')
source = info[1]


Again, not quite sure how that line split is working, plus have no idea what happens when you add [1] after the info variable...

line.split('"') returns a list. info[1] gets the 2nd element in the list.
User avatar
Mat
 
Posts: 406
Joined: Fri Apr 21, 2006 8:19 pm UTC
Location: London

Re: XKCD Batch Downloading

Postby ediblespread » Wed Jul 30, 2008 6:50 pm UTC

Right, that makes alot more sense now...

But hold on. Knowing that all qc comic pngs are saved like the following: http://www.questionablecontent.net/comics/X.png, where X is the number of the comic, couldn't I just read off the latest comic (or even easier, get it inputted, but reading off is probably more professional ;)), and then use that as the source?

Something like...

Code: Select all
#Downloads all QC comics

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
CurrentRegExp = re.compile('<img src="http://www.questionablecontent.net/comics/(.+).png">')

#Check the main QC page
site = urllib.urlopen("http://www.questionablecontent.net/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if CurrentRegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    EndNum = int(contentLine[4])
else:
    EndNum = 1197

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 192

#Full path of destination folder needs to pre-exist
destinationFolder = "C:\QC"

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue

   #Create the source url for the current comic number   
    source = "http://www.questionablecontent.net/comics/"+str(i)+".png"

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    #Save image from QC to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"


??

Or have I seriously misread the coding?

-Edibles


PS: Still unsure about the re.compile stuff. For a start, the original coder used the line "CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')" to search for the information after "permanent link" (I think?). So, why wouldn't this work for me too? Sorry, still trying to get my head around this re. stuff; escaping is even worse too.
Last edited by ediblespread on Wed Jul 30, 2008 7:07 pm UTC, edited 1 time in total.
ediblespread
 
Posts: 33
Joined: Fri Jul 25, 2008 6:29 pm UTC

Re: XKCD Batch Downloading

Postby Berengal » Wed Jul 30, 2008 6:57 pm UTC

I'd have done it somewhat like this:
Code: Select all
import urllib
destinationFolder = #somewhere
startComic = #some number
endComic = #some number
for i in range(startComic, endComic+1):
  urllib.urlretrieve('http://www.questionablecontent.net/comics/%d.png" % i, destinationFolder)

It's just a single command in bash, so I'd probably do that though.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.
User avatar
Berengal
Superabacus Mystic of the First Rank
 
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway

Re: XKCD Batch Downloading

Postby ediblespread » Wed Jul 30, 2008 7:16 pm UTC

Interesting.

Runs okay but then complains of a lack of permission concerning C:\QC. Which is weird, since your version seems to just be a cut down version of the other, and it runs fine...

-Edibles

EDIT: Edited your code slightly and got it working. Used this:

source = "http://www.questionablecontent.net/comics/"+str(i)+".png"
urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
ediblespread
 
Posts: 33
Joined: Fri Jul 25, 2008 6:29 pm UTC

Re: XKCD Batch Downloading

Postby ediblespread » Wed Jul 30, 2008 9:39 pm UTC

I now have working scripts for 3 webcomics; qc, darth and droids and lucid-tv, plus of course the original xkcd. The scripts for qc, d&d and lucid-tv were put together using Berengal's example, so all credit to him. The script for the xkcd is obviously Onion_Knight's. My scripts are extemely basic, especially compared to his (no automatic lastnum or skipping), and also have a horrible structure, since I had to add lines of code to deal with problematic episodes but really was too lazy to do it properly. When I can be bothered, I'm going to go back and structure them properly, and then when I've learned a bit more python I hope to try and understand and add Onion_Knight's functionability to them ^^.

Thanks guys; does anyone want to see the scripts? They're extremely basic, but if anyone was looking to get all the comics from any of the above webcomics, then it'd save them looking out urls and problematic episodes etc etc.

-Edibles
ediblespread
 
Posts: 33
Joined: Fri Jul 25, 2008 6:29 pm UTC

Re: XKCD Batch Downloading

Postby Onion_Knight » Sat Aug 02, 2008 7:37 pm UTC

For those who are still a little fuzzy on the matter, split works like so:
Code: Select all
var2 = "http://xkcd.com/456/"
var1 = var2.split('/')

This will make var1 an array of strings with the following information:
var1[0] => "http:"
var1[1] => ""
var1[2] => "xkcd.com"
var1[3] => "456"
That is why I used index 3 when I wanted to find the comic number. ;) Keep in mind that if you don't give split any parameters, it will be the same as splitting on a space.

re.compile works like so:
Code: Select all
RegExp = re.compile('.*<img src=".*"/>.*')

This will make RegExp a regular expression object that matches any string that has the following format:
Code: Select all
{any number of characters}<img src="{any number of characters}"/>{any number of characters}

Keep in mind that you only need to escape certain characters, such as: . \ [ ] -
This is because the are also used to create a regular expression, thus if you wish to search for an actual period, you'd do \.
All re.compile does is return an object that will match strings. Using that object to do the matching/searching is what you REALLY want it to do. :) Make sure your Reg. Expression is specific enough to only pick up the image you need, and not any other images that happen to be on the page.

Also, I was wondering if the mods, or anyone else, knows if posting scripts for non-XKCD comics are allowed here? Or is that a violation of the forum rules?

Thanks everyone,
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby ediblespread » Sat Aug 02, 2008 7:49 pm UTC

Thanks onion. I've actually learnt most of that stuff over playing with Python over the last few days, but reiteration is good! ^^

I'm currently working on a variation of this script, to find and download a bunch of links from a number of webpages, and am making good use of reg.exprs... just a pity that the source code for the website is written in HUGE lines XD. But I got it with a reg.expr and a split.

One weird thing is, I used the exact same notation to create a text file and write to it as you did here... and yet I get the error message: "C:\\FOLDER" doesnt exist... when the folder is typed in as C:\FOLDER. But I asked in my thread for help with that, so dont feel at all obliged to help here :p

Like I said, I've got cut down versions of scripts for lucid-tv and darth and droids, based off of Berengals. Might get round to expanding them now actually, since I'm waiting on an answer to a problem.

I wonder how you'd do dated webcomics... especially if they didnt always follow a set "Mon/tues/wed" date but instead updated sporadically... load up each and every webpage that might contain a comic, check for a line (any line really, since the webpage won't exist otherwise), and then only save it if it exists?

-Edibles
ediblespread
 
Posts: 33
Joined: Fri Jul 25, 2008 6:29 pm UTC

Re: XKCD Batch Downloading

Postby Onion_Knight » Sat Aug 02, 2008 8:10 pm UTC

ediblespread wrote:Thanks onion. I've actually learnt most of that stuff over playing with Python over the last few days, but reiteration is good! ^^

...

I wonder how you'd do dated webcomics... especially if they didnt always follow a set "Mon/tues/wed" date but instead updated sporadically... load up each and every webpage that might contain a comic, check for a line (any line really, since the webpage won't exist otherwise), and then only save it if it exists?

-Edibles


For dated comics, I`ll probably just have an if statement check if there is a 404. Because if you try to access a comic that doesn`t exist, then you know to skip it. If not, download it. ;)
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby EvanED » Sun Aug 03, 2008 2:14 am UTC

Onion_Knight wrote:Also, I was wondering if the mods, or anyone else, knows if posting scripts for non-XKCD comics are allowed here? Or is that a violation of the forum rules?
I would say go ahead. If another mod disagrees we'll delete it or something like that, but you won't get in trouble.
EvanED
 
Posts: 4150
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI

Re: XKCD Batch Downloading

Postby brennydoogles » Mon Aug 04, 2008 5:46 pm UTC

I'll admit immediately that I know nothing about Python (I'm a Java guy), but I have been trying to make this script useable for Linux. Here's the code so far:
Code: Select all
    #Downloads all XKCD comics

    #import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

    #RegExp for the EndNum variable
CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')

    #Check the main XKCD page
site = urllib.urlopen("http://www.xkcd.com/")
contentLine = None

    #For each line in the homepage's source...
for line in site.readlines():
        #Break when you find the variable information
   if CurrentRegExp.search(line):
      contentLine = line
   break

    #IF the information was found successfuly automatically change EndNum
    #ELSE set it to the latest comic as of this writing
if contentLine:
   contentLine = contentLine.split('/')
   EndNum = int(contentLine[3])
else:
   EndNum = 455

    #First and Last comics user wishes to download
StartNum = 1
    #EndNum = 192

    #Full path of destination folder needs to pre-exist
destinationFolder = "/home/brendon/xkcd"

    #Full path of "Alt Text" file doesn't need to pre-exist.
    #Info will be appended to the end of the file
textFile = open(destinationFolder+"\AltText.txt",'a')

    #Regular Exp. used to find the comics in the webpage source
RegExp = re.compile('.*<img src=".*" title=".*" alt=".*" />.*')

    #Reg. Exp used to fix images surrounded by Anchor Tags
LinkRegExp = re.compile(".*<img src=.*")

    #XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

        #IF you already have the comic, skip downloading it
   if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
      print "Skipping Comic "+str(i)+"..."
   continue

        #Gets the Site of the i-th comic
site = urllib.urlopen("http://www.xkcd.com/"+str(i)+"/")
contentLine = None

        #For each line in the webpage's source...
for line in site.readlines():
            #Break when you find the image information
   if RegExp.search(line):
      contentLine = line
   break

        #Skips the non-existant comic #404
if not contentLine:
   continue

        #Create the Array with the information in it
   info = line.split('"')

        #IF there is a match, image is not embedded in Anchors
if LinkRegExp.search(info[0]):
            #Gets the url for the image
   source = info[1]
            #The title-text (commonly known on these fora as the alt-text)
   title = info[3]
            #The alt-text
   alt = info[5]
        #ELSE Adjust the indexes to compensate for the Anchor Tags
else:
            #Gets the url for the image
   source = info[3]
            #The title-text (commonly known on these fora as the alt-text)
   title = info[5]
            #The alt-text
   alt = info[7]
if i == 191:
   urllib.urlretrieve("http://imgs.xkcd.com/comics/lojban_translated.png", os.path.join(destinationFolder, str(i)+" Translated.png"))

        #Printing User-Friendly Messages
   print "Comic %d Found. Downloading..." % i

        #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
        # Writes the title and alt to a text file
textFile.write("Comic "+str(i) + ': Alt: "' + alt + '" Title: "' + title + '"\n')

    #Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()

more or less untouched from the code I copied from the forum, with the exception of changing the comic directory and some tabbing as per error messages from terminal. I am currently getting this error
Code: Select all
brendon@brendon-linux:~/bin$ python xkcd.py
  File "xkcd.py", line 66
    continue
SyntaxError: 'continue' not properly in loop

but since I am unfamiliar with Python Syntax I'm not sure what the problem is. Any guesses?
Blessed are the Cheese Makers?
User avatar
brennydoogles
 
Posts: 16
Joined: Tue Jul 29, 2008 9:40 pm UTC

Re: XKCD Batch Downloading

Postby Onion_Knight » Mon Aug 04, 2008 6:20 pm UTC

brennydoogles wrote:I'll admit immediately that I know nothing about Python (I'm a Java guy), but I have been trying to make this script useable for Linux. Here's the code so far:
Code: Select all
    #Downloads all XKCD comics

    #import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

    #RegExp for the EndNum variable
CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')

    #Check the main XKCD page
site = urllib.urlopen("http://www.xkcd.com/")
contentLine = None

    #For each line in the homepage's source...
for line in site.readlines():
        #Break when you find the variable information
   if CurrentRegExp.search(line):
      contentLine = line
   break

    #IF the information was found successfuly automatically change EndNum
    #ELSE set it to the latest comic as of this writing
if contentLine:
   contentLine = contentLine.split('/')
   EndNum = int(contentLine[3])
else:
   EndNum = 455

    #First and Last comics user wishes to download
StartNum = 1
    #EndNum = 192

    #Full path of destination folder needs to pre-exist
destinationFolder = "/home/brendon/xkcd"

    #Full path of "Alt Text" file doesn't need to pre-exist.
    #Info will be appended to the end of the file
textFile = open(destinationFolder+"\AltText.txt",'a')

    #Regular Exp. used to find the comics in the webpage source
RegExp = re.compile('.*<img src=".*" title=".*" alt=".*" />.*')

    #Reg. Exp used to fix images surrounded by Anchor Tags
LinkRegExp = re.compile(".*<img src=.*")

    #XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

        #IF you already have the comic, skip downloading it
   if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
      print "Skipping Comic "+str(i)+"..."
   continue

        #Gets the Site of the i-th comic
site = urllib.urlopen("http://www.xkcd.com/"+str(i)+"/")
contentLine = None

        #For each line in the webpage's source...
for line in site.readlines():
            #Break when you find the image information
   if RegExp.search(line):
      contentLine = line
   break

        #Skips the non-existant comic #404
if not contentLine:
   continue

        #Create the Array with the information in it
   info = line.split('"')

        #IF there is a match, image is not embedded in Anchors
if LinkRegExp.search(info[0]):
            #Gets the url for the image
   source = info[1]
            #The title-text (commonly known on these fora as the alt-text)
   title = info[3]
            #The alt-text
   alt = info[5]
        #ELSE Adjust the indexes to compensate for the Anchor Tags
else:
            #Gets the url for the image
   source = info[3]
            #The title-text (commonly known on these fora as the alt-text)
   title = info[5]
            #The alt-text
   alt = info[7]
if i == 191:
   urllib.urlretrieve("http://imgs.xkcd.com/comics/lojban_translated.png", os.path.join(destinationFolder, str(i)+" Translated.png"))

        #Printing User-Friendly Messages
   print "Comic %d Found. Downloading..." % i

        #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
        # Writes the title and alt to a text file
textFile.write("Comic "+str(i) + ': Alt: "' + alt + '" Title: "' + title + '"\n')

    #Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()

more or less untouched from the code I copied from the forum, with the exception of changing the comic directory and some tabbing as per error messages from terminal. I am currently getting this error
Code: Select all
brendon@brendon-linux:~/bin$ python xkcd.py
  File "xkcd.py", line 66
    continue
SyntaxError: 'continue' not properly in loop

but since I am unfamiliar with Python Syntax I'm not sure what the problem is. Any guesses?


Python is a language that uses structure as a basis for grouping chunks of code. This means that there is no need to brace-bracket parts of code together. An example would be this Python code...
Spoiler:
Code: Select all
i=0
while true:
    i = i + 1
    if i>10:
        print "i > 10\n"
        break
    else:
        print "i < 10\n"

...is the same as this C++ code...
Code: Select all
i=0;
while (1)
{
i++;
if(i>10)
{
printf("i > 10\n");
break;
}
else printf("I < 10\n");
}

If you do not have the indentations done properly, it will be the same as doing this:
Spoiler:
Code: Select all
i=0
while true:
    i = i + 1
    if i>10:
        print "i > 10\n"
    break
    else:
        print "i < 10\n"

...is the same as this C++ code...
Code: Select all
i=0;
while (1)
{
i++;
if(i>10)
{
printf("i > 10\n");
}
break;    //This break will ALWAYS execute
else printf("I < 10\n");
}

Not only will the break always execute, but this means that the loop will ever only run once, and the else is unreachable.

This may be the problem you're having, if it is complaining that you don't the the continue in a loop. Make sure the spacing is correct, and you should be good to go.

Hope that helps,
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby ediblespread » Mon Aug 04, 2008 6:21 pm UTC

Hrm. I don't know python very well meself as you can see, but you should double check that the tabbing is done right; copying from place to place has caused nightmares for me before in Python with tabbing. Seriously, I really dont like using tabbing as a way to tell when something ends... usually. It is neat for big nested things I suppose.

-Edibles


EDIT: Damn. Ninja'ed! :P
ediblespread
 
Posts: 33
Joined: Fri Jul 25, 2008 6:29 pm UTC

Re: XKCD Batch Downloading

Postby J Spade » Mon Aug 04, 2008 7:41 pm UTC

Did you try "import xkcdComics"?
User avatar
J Spade
Luppoewagan
 
Posts: 523
Joined: Wed Apr 18, 2007 7:56 pm UTC
Location: Up a creek without a paddle

Re: XKCD Batch Downloading

Postby brennydoogles » Tue Aug 05, 2008 2:47 am UTC

ediblespread wrote:you should double check that the tabbing is done right


Unfortunately I am completely unfamiliar with Python Tabbing rules, and thus have NO IDEA if it is tabbed correctly. Does anyone have the Python file (tabbed correctly and currently working) hosted anywhere?? If not, I would be more than happy to host the file on my site if someone would be willing to either email me a working copy of the file, or post the source on my private pastebin http://brennydoogles.pastebin.com/. Thanks!!
Blessed are the Cheese Makers?
User avatar
brennydoogles
 
Posts: 16
Joined: Tue Jul 29, 2008 9:40 pm UTC

Re: XKCD Batch Downloading

Postby gorcee » Tue Aug 05, 2008 3:25 am UTC

brennydoogles wrote:
ediblespread wrote:you should double check that the tabbing is done right


Unfortunately I am completely unfamiliar with Python Tabbing rules, and thus have NO IDEA if it is tabbed correctly. Does anyone have the Python file (tabbed correctly and currently working) hosted anywhere?? If not, I would be more than happy to host the file on my site if someone would be willing to either email me a working copy of the file, or post the source on my private pastebin http://brennydoogles.pastebin.com/. Thanks!!


Just to summarize... there are no tabbing rules other than all tabbing must be consistent.

So if you pick a tab as your nest delimiter, it should always be a tab. If you pick 2 spaces, it should always be 2 spaces.

Good editors should pick up the intentions of copy-pasted code and convert it as suited for the current project.
gorcee
 
Posts: 1501
Joined: Sun Jul 13, 2008 3:14 am UTC
Location: Charlottesville, VA

Re: XKCD Batch Downloading

Postby ediblespread » Tue Aug 05, 2008 10:34 am UTC

Yeah, it's basically just a general rule... I don't like it but *shrug* (I found some weird errors that happen when my code in notepad++ is opened in the python interpretor and it doesnt tab quite right). Generally, for ease of use (although as said you just have to keep it consistent), move one tab in for each nested statement:

Code: Select all
for i in range(a,b):
     #We move in a tab to write the code that is to be looped.
#And when finished we merely move out a tab to signal that the to-bo-looped code is finished.



-Edibles


PS: That there code hasnt actually got a tab, as I dunno how to do them in a webbrowser; I just used 5 spaces instead. But use tabs. XD
ediblespread
 
Posts: 33
Joined: Fri Jul 25, 2008 6:29 pm UTC

Re: XKCD Batch Downloading

Postby Berengal » Tue Aug 05, 2008 1:35 pm UTC

Actually, don't use tabs at all, because they can and will be displayed differently for everyone. This can lead to much anger:
Code: Select all
# This is valid (and displays properly in the text editing box)
if something():
   doStuff()
        andMoreStuff()

# This is invalid (but displays correctly in the code box):
if something():
   doStuff()
   andMoreStuff()

The solution is of course to always use just tabs or always use just spaces, but spaces are more consistent anyway, and several editors will insert spaces instead of tabs when you press tab (most pyhon editors, or editors in python mode default to this) so it's easier to keep the code clean anyway.

This almost belongs in Religious wars. But only almost.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.
User avatar
Berengal
Superabacus Mystic of the First Rank
 
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway

Re: XKCD Batch Downloading

Postby brennydoogles » Tue Aug 05, 2008 2:19 pm UTC

In order to not have to deal with the indentation issue, I re-copied and pasted the code into a Python IDE with auto-indentation. I was thrilled when my terminal window began listing the comics as they were downloaded. There was however one problem: Instead of downloading the comics it downloaded the banner for the store 126 times before I killed the script. Here is the source with every edit I made:
Code: Select all
#Downloads all XKCD comics

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')

#Check the main XKCD page
site = urllib.urlopen("http://www.xkcd.com/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if CurrentRegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    EndNum = int(contentLine[3])
else:
    EndNum = 455

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 192

#Full path of destination folder needs to pre-exist
destinationFolder = "/home/brendon/xkcd"

#Full path of "Alt Text" file doesn't need to pre-exist.
#Info will be appended to the end of the file
textFile = open(destinationFolder+"/AltText.txt",'a')

#Regular Exp. used to find the comics in the webpage source
RegExp = re.compile('.*<img src=".*" title=".*" alt=".*" />.*')

#Reg. Exp used to fix images surrounded by Anchor Tags
LinkRegExp = re.compile(".*<img src=.*")

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"/"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue

    #Gets the Site of the i-th comic
    site = urllib.urlopen("http://www.xkcd.com/"+str(i)+"/")
    contentLine = None

    #For each line in the webpage's source...
    for line in site.readlines():
        #Break when you find the image information
        if RegExp.search(line):
            contentLine = line
            break

    #Skips the non-existant comic #404
    if not contentLine:
        continue

    #Create the Array with the information in it
    info = line.split('"')

    #IF there is a match, image is not embedded in Anchors
    if LinkRegExp.search(info[0]):
        #Gets the url for the image
        source = info[1]
        #The title-text (commonly known on these fora as the alt-text)
        title = info[3]
        #The alt-text
        alt = info[5]
    #ELSE Adjust the indexes to compensate for the Anchor Tags
    else:
        #Gets the url for the image
        source = info[3]
        #The title-text (commonly known on these fora as the alt-text)
        title = info[5]
        #The alt-text
        alt = info[7]
        if i == 191:
            urllib.urlretrieve("http://imgs.xkcd.com/comics/lojban_translated.png", os.path.join(destinationFolder, str(i)+" Translated.png"))

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
    # Writes the title and alt to a text file
    textFile.write("Comic "+str(i) + ': Alt: "' + alt + '" Title: "' + title + '"\n')

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()

Spoiler:
xkcd.png
Screenshot of problem with downloading script.
xkcd.png (221.52 KiB) Viewed 7508 times
Blessed are the Cheese Makers?
User avatar
brennydoogles
 
Posts: 16
Joined: Tue Jul 29, 2008 9:40 pm UTC

Re: XKCD Batch Downloading

Postby Berengal » Tue Aug 05, 2008 2:46 pm UTC

Try:
Code: Select all
RegExp = re.compile('^\s*<img src=".*" title=".*" alt=".*" />.*')

Although that will choke on the link-pictures.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.
User avatar
Berengal
Superabacus Mystic of the First Rank
 
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway

Re: XKCD Batch Downloading

Postby brennydoogles » Wed Aug 06, 2008 2:01 pm UTC

Berengal wrote:Try:
Code: Select all
RegExp = re.compile('^\s*<img src=".*" title=".*" alt=".*" />.*')

Although that will choke on the link-pictures.


That seems to have worked wonders, but now I have another problem. I get an error when trying to open the first 90 comics saying that the image file is not a png. If you change the extension to .jpg it displays correctly. How would I edit the script to have the first 90 images have a .jpg extension while the rest have a .png extension?? Here is a screenshot.
Spoiler:
xkcd2.png
xkcd2.png (589.04 KiB) Viewed 7443 times



:::EDIT:::

There are actually mis-extensioned comics scattered randomly about throughout all of them, but the first 90 are all jpgs. Very strange.
Blessed are the Cheese Makers?
User avatar
brennydoogles
 
Posts: 16
Joined: Tue Jul 29, 2008 9:40 pm UTC

Re: XKCD Batch Downloading

Postby Mat » Wed Aug 06, 2008 4:04 pm UTC

Ok, I'm going to jump on the bandwagon and post my own script :)

It's a modification of the one onion_knight posted. It should work fine on linux now and name jpegs correctly. Unfortunately I don't know how to make it check for an existing filename with an arbitrary extension other than to repeatedly check for each possibility so I left that bit out. Instead, you can make it start from any comic by giving the start number as command line argument, e.g python xkcd.py 42

Code: Select all
#Downloads all XKCD comics

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os
from sys import argv

#RegExp for the EndNum variable
CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')

#Check the main XKCD page
site = urllib.urlopen("http://www.xkcd.com/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if CurrentRegExp.search(line):
       contentLine = line
        break

#First and Last comics user wishes to download
StartNum = 1
EndNum = 455

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    EndNum = int(contentLine[3])

#Get StartNum from the command line argument
if len(argv) > 1:
    try:
        if int(argv[1])>EndNum:
            raise Exception
        StartNum = int(argv[1])
    except:
        print("The start number you entered is not a valid comic number.")

#Make sure destination folder exists
destinationFolder = "xkcd"
if not os.path.exists(destinationFolder):
    os.mkdir(destinationFolder)

#Full path of "Alt Text" file doesn't need to pre-exist.
#Info will be appended to the end of the file
textFile = open(os.path.join(destinationFolder,"AltText.txt"),'a')

#Regular Exp. used to find the comics in the webpage source
RegExp = re.compile('<img src="(http://imgs.xkcd.com/comics/[^"]+)" title="([^"]*)" alt="([^"]*)" />')

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #Gets the Site of the i-th comic
    site = urllib.urlopen("http://www.xkcd.com/"+str(i)+"/")

    #For each line in the webpage's source...
    for line in site.readlines():
        #Break when you find the image information
        match = RegExp.search(line)
        if match:
            source = match.group(1)
            title = match.group(2)
            alt = match.group(3)
            srcType = source.split(".")[-1] #everything after the last dot

            #Printing User-Friendly Messages
            print "Comic %d Found. Downloading..." % i

            #Save image from XKCD to Destination Folder
            comicFilename = os.path.join(destinationFolder,str(i)+"."+srcType)
            urllib.urlretrieve(source, comicFilename)
           
            # Writes the title and alt to a text file
            textFile.write("Comic "+str(i) + ': Alt: "' + alt + '" Title: "' + title + '"\n')
            break

#Download the translated version of comic 191
comicFilename = os.path.join(destinationFolder,"191 Translated.png")
if not os.path.exists(comicFilename):      
    urllib.urlretrieve("http://imgs.xkcd.com/comics/lojban_translated.png", comicFilename)

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()
User avatar
Mat
 
Posts: 406
Joined: Fri Apr 21, 2006 8:19 pm UTC
Location: London

Re: XKCD Batch Downloading

Postby colouragga » Fri Aug 08, 2008 6:58 am UTC

Hey all, kinda new here but I registered just to improve something on this script. Let's use os.path.expanduser("~") to find the home directory of the current user and create the XKCD folder there? Code would be like this, just a minor might-be improvement. This way it could be adapted to run as cronjob :):

Code: Select all
    #Downloads all XKCD comics

    #import Web, Reg. Exp, and Operating System libraries
    import urllib, re, os
    from sys import argv

    #RegExp for the EndNum variable
    CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')

    #Check the main XKCD page
    site = urllib.urlopen("http://www.xkcd.com/")
    contentLine = None

    #For each line in the homepage's source...
    for line in site.readlines():
        #Break when you find the variable information
        if CurrentRegExp.search(line):
           contentLine = line
            break

    #First and Last comics user wishes to download
    StartNum = 1
    EndNum = 455

    #IF the information was found successfuly automatically change EndNum
    #ELSE set it to the latest comic as of this writing
    if contentLine:
        contentLine = contentLine.split('/')
        EndNum = int(contentLine[3])

    #Get StartNum from the command line argument
    if len(argv) > 1:
        try:
            if int(argv[1])>EndNum:
                raise Exception
            StartNum = int(argv[1])
        except:
            print("The start number you entered is not a valid comic number.")

    #Make sure destination folder exists
    destinationFolder = "xkcd"
    if not os.path.exists(os.path.expanduser("~/") + destinationFolder):
        os.mkdir(os.path.expanduser("~/") + destinationFolder)

    #Full path of "Alt Text" file doesn't need to pre-exist.
    #Info will be appended to the end of the file
    textFile = open(os.path.join(destinationFolder,"AltText.txt"),'a')

    #Regular Exp. used to find the comics in the webpage source
    RegExp = re.compile('<img src="(http://imgs.xkcd.com/comics/[^"]+)" title="([^"]*)" alt="([^"]*)" />')

    #XRange creates an iterator to go over the comics
    for i in xrange(StartNum, EndNum+1):

        #Gets the Site of the i-th comic
        site = urllib.urlopen("http://www.xkcd.com/"+str(i)+"/")

        #For each line in the webpage's source...
        for line in site.readlines():
            #Break when you find the image information
            match = RegExp.search(line)
            if match:
                source = match.group(1)
                title = match.group(2)
                alt = match.group(3)
                srcType = source.split(".")[-1] #everything after the last dot

                #Printing User-Friendly Messages
                print "Comic %d Found. Downloading..." % i

                #Save image from XKCD to Destination Folder
                comicFilename = os.path.join(destinationFolder,str(i)+"."+srcType)
                urllib.urlretrieve(source, comicFilename)
               
                # Writes the title and alt to a text file
                textFile.write("Comic "+str(i) + ': Alt: "' + alt + '" Title: "' + title + '"\n')
                break

    #Download the translated version of comic 191
    comicFilename = os.path.join(destinationFolder,"191 Translated.png")
    if not os.path.exists(comicFilename):     
        urllib.urlretrieve("http://imgs.xkcd.com/comics/lojban_translated.png", comicFilename)

    #Graceful program termination
    print str(EndNum-StartNum) + " Comics Downloaded"
    textFile.close()


Now who is going to start a Google Code project and SVN for this :p
colouragga
 
Posts: 1
Joined: Fri Aug 08, 2008 6:53 am UTC

Re: XKCD Batch Downloading

Postby brennydoogles » Fri Aug 08, 2008 9:57 pm UTC

Ok, so I tried Mat's version of the script, and it seemed to work pretty well. There was one issue that I noticed though. Looking at the code (as a n00b Java programmer who knows nothing about Python) I see this structure:
Code: Select all
#Check the main XKCD page
    site = urllib.urlopen("http://www.xkcd.com/")
    contentLine = None

    #For each line in the homepage's source...
    for line in site.readlines():
        #Break when you find the variable information
        if CurrentRegExp.search(line):
           contentLine = line
            break

    #First and Last comics user wishes to download
    StartNum = 1
    EndNum = 455

    #IF the information was found successfuly automatically change EndNum
    #ELSE set it to the latest comic as of this writing
    if contentLine:
        contentLine = contentLine.split('/')
        EndNum = int(contentLine[3])


Which as far as I can tell is supposed to scan through the url of the most recent comic, and assign the value of the comic number to the Endnum Variable. That does not seem to be working correctly. Whenever you run the script it automatically stops downloading at the comic with the specified index. I had noticed on a previous version of the script that the Endnum variable declaration was commented out, but when I tried that I got errors about an undefined variable. How can I fix this?

The other issue that I noticed is that the script does not automatically detect which comics I already have downloaded, and every time re-downloads each comic (overwriting the copy I have on my computer). While that is not a terribly bad problem it does seem to be a bit inefficient. What about a control structure to scan the destination folder for .jpg and .png images, and take the highest number before the extension+1 and set it as the Startnum variable? If there are no comics downloaded then you could default Startnum to 1, and otherwise it would be set to the correct value.
Blessed are the Cheese Makers?
User avatar
brennydoogles
 
Posts: 16
Joined: Tue Jul 29, 2008 9:40 pm UTC

Re: XKCD Batch Downloading

Postby Mat » Sat Aug 09, 2008 1:48 am UTC

brennydoogles wrote:Which as far as I can tell is supposed to scan through the url of the most recent comic, and assign the value of the comic number to the Endnum Variable. That does not seem to be working correctly. Whenever you run the script it automatically stops downloading at the comic with the specified index. I had noticed on a previous version of the script that the Endnum variable declaration was commented out, but when I tried that I got errors about an undefined variable. How can I fix this?

You mean it stops at 455? That's strange, it worked fine when I tried it. I'll have a look at it again tommorow.
*Edit* I ran the script again and all the comics get downloaded fine for me. Maybe the indenting got screwed up somewhere? The important part is this bit:
Code: Select all
#For each line in the homepage's source...
    for line in site.readlines():
        #Break when you find the variable information
        if CurrentRegExp.search(line):
           contentLine = line
           break

if the break statement is in the wrong place it won't finish searching and will use the old value of EndNum. Other than that I dunno what to suggest :?

The other issue that I noticed is that the script does not automatically detect which comics I already have downloaded, and every time re-downloads each comic (overwriting the copy I have on my computer). While that is not a terribly bad problem it does seem to be a bit inefficient. What about a control structure to scan the destination folder for .jpg and .png images, and take the highest number before the extension+1 and set it as the Startnum variable? If there are no comics downloaded then you could default Startnum to 1, and otherwise it would be set to the correct value.

Yeah ok, that makes a lot more sense.
Just add this after the "for i in xrange(StartNum, EndNum+1):" bit:
Code: Select all
#Skip comics which have already been downloaded
    jpgPath = os.path.join(destinationFolder,str(i)+".jpg")
    pngPath = os.path.join(destinationFolder,str(i)+".png")
    if os.path.exists(jpgPath) or os.path.exists(pngPath):
        print("Comic %d already exists, skipping download..." % i)
        continue


colouragga wrote:Hey all, kinda new here but I registered just to improve something on this script. Let's use os.path.expanduser("~") to find the home directory of the current user and create the XKCD folder

Ok, but I think your code is missing the ~ part in the path when it actually downloads the comics? I've just included it in the destinationFolder variable in mine:
Code: Select all
destinationFolder = os.path.join("~","xkcd")
destinationFolder = os.path.expanduser(destinationFolder)
Last edited by Mat on Sat Aug 09, 2008 2:47 pm UTC, edited 2 times in total.
User avatar
Mat
 
Posts: 406
Joined: Fri Apr 21, 2006 8:19 pm UTC
Location: London

Re: XKCD Batch Downloading

Postby zahlman » Sat Aug 09, 2008 3:46 am UTC

Berengal wrote:Actually, don't use tabs at all, because they can and will be displayed differently for everyone.


In contexts that don't involve forum posting, that's exactly why you should use tabs. I.e. so that people can set the tab stop as they like, and have everything magically line up the way they like indented things to line up.

# This is valid (and displays properly in the text editing box)


In your font, sure. And coincidentally in mine, too.

# This is invalid (but displays correctly in the code box):


The problem is that the forum actually converts tabs to spaces for posting. It doesn't help that web browsers don't let you type tabs into text fields (instead expecting that you mean to move to the next UI component).

but spaces are more consistent anyway,


I don't know what you think "consistent" means here.

and several editors will insert spaces instead of tabs when you press tab (most pyhon editors, or editors in python mode default to this)


Yes; and it causes me no end of trouble. Especially because IDLE's settings for the "tab distance" seem to be completely ignored, but also because I like to edit my source with Vim, and have it insert tabs, also.

Now if only I could figure out how to make Vim not use tabs for spacing *beyond* the level of indentation...

This almost belongs in Religious wars. But only almost.


Sorry for dragging it further in that direction :(
Belial wrote:I once had a series of undocumented and nonstandardized subjective experiences that indicated that anecdotal data is biased and unreliable.
zahlman
 
Posts: 638
Joined: Wed Jan 30, 2008 5:15 pm UTC

Re: XKCD Batch Downloading

Postby Berengal » Sat Aug 09, 2008 9:09 am UTC

Google for "you should use tabs in python": 249 000
Google for "you should use spaces in python": 739 000

One of the top results: http://www.python.org/dev/peps/pep-0008/
Excerpt:
For new projects, spaces-only are strongly recommended over tabs. Most editors have features that make this easy to do.

This style guide mostly applies to the standard library, but since when hasn't that been the ultimate authority on language style?
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.
User avatar
Berengal
Superabacus Mystic of the First Rank
 
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway

Re: XKCD Batch Downloading

Postby EvanED » Sat Aug 09, 2008 9:24 am UTC

Not the thread for tabs vs spaces
EvanED
 
Posts: 4150
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI

Re: XKCD Batch Downloading

Postby Onion_Knight » Sat Aug 09, 2008 11:08 pm UTC

As for skipping already downloaded comics, I use this right after the "main" xrange loop:
Code: Select all
#Note, this is using Windows file system format, modify for Lunix
if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue


Some have also realized that I commented out the line after StartNum, where I assign a value to EndNum. I left the comment there incase anyone wished to make this script a function. (Eg. calling DownloadXKCD(3,100). Where StartNum and EndNum are passed in). It appears that the commented out EndNum value has confused some people, and I'm sorry about that.

Anyone who is still downloading the Store Banner my accident, it's because the script checks for any image that has a src, title, AND an alt. The store didn't use to have all 3, but now it does. It's not a big problem, as fancier Reg Exps will take care of it. Just goes to show how fragile source-dependent software can be. ;)

I'm glad so many people are enjoying the little chunk of code I threw together. As usual, feel free to modify and share as much as you wish! :D
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby Mat » Sun Aug 10, 2008 1:04 pm UTC

Onion_Knight wrote:As for skipping already downloaded comics, I use this right after the "main" xrange loop:
Code: Select all
#Note, this is using Windows file system format, modify for Lunix
if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue


That just checks for png files though. You need to look for jpegs as well. Also I added a lot of "os.path.join()" to the script instead of forming paths with the "+". This takes care of whether there needs to be a "/" or a "\" for you, so the script should be cross platform.

Some have also realized that I commented out the line after StartNum, where I assign a value to EndNum. I left the comment there incase anyone wished to make this script a function. (Eg. calling DownloadXKCD(3,100). Where StartNum and EndNum are passed in). It appears that the commented out EndNum value has confused some people, and I'm sorry about that.

Nah, I think it was just my meddling that confused people. I uncommented that line and removed the "else" for no real reason other than I thought it would look nicer like that. :oops:

Thanks for sharing by the way :) It's nice to know that if my internet connection dies I can still amuse myself by reading xkcd all day.
User avatar
Mat
 
Posts: 406
Joined: Fri Apr 21, 2006 8:19 pm UTC
Location: London

Re: XKCD Batch Downloading

Postby Onion_Knight » Fri Aug 15, 2008 2:28 am UTC

Just for anyone who is interested, I managed to write scripts, for the following comics.

Gone With The Blastwave (http://www.blastwavecomic.com/):
Spoiler:
Code: Select all
#Downloads all Blastwave comics

#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
RegExp = re.compile('.*<img src="\..*"  alt=".*" >.*')


site = urllib.urlopen("http://www.blastwavecomic.com/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if RegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    contentLine = contentLine[2].split('.')
    EndNum = int(contentLine[0])
else:
    EndNum = 39

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 39

#Full path of destination folder needs to pre-exist
destinationFolder = "C:\Thanos Files\Comics\Blastwave"

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"\\"+str(i)+".jpg"):
        print "Skipping Comic "+str(i)+"..."
        continue
   
    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    #Save image from XKCD to Destination Folder as a JPG
    if i <= 9:
        urllib.urlretrieve("http://www.blastwavecomic.com/comics/""2006050"+str(i)+".jpg", os.path.join(destinationFolder, str(i)+".jpg"))
    elif i == 10:
        urllib.urlretrieve("http://www.blastwavecomic.com/comics/""200605"+str(i)+".jpg", os.path.join(destinationFolder, str(i)+".jpg"))
    else:
        urllib.urlretrieve("http://www.blastwavecomic.com/comics/"+str(i)+".jpg", os.path.join(destinationFolder, str(i)+".jpg"))

#Manually download the special cases
if not os.path.exists(destinationFolder+"\\Fake.jpg"):
    print "Comic Fake Found. Downloading..."
    urllib.urlretrieve("http://www.blastwavecomic.com/comics/fake.jpg", os.path.join(destinationFolder, "Fake.jpg"))
if not os.path.exists(destinationFolder+"\\Taking A Break.jpg"):
    print "Comic Break Found. Downloading..."
    urllib.urlretrieve("http://www.blastwavecomic.com/comics/ABREak.jpg", os.path.join(destinationFolder, "Taking A Break.jpg"))


#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"


Questionable Content (http://questionablecontent.net/):
Spoiler:
Code: Select all
#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
RegExp = re.compile('.*<img src="http://www.questionablecontent.net/comics.*')

#Check the main QC page
site = urllib.urlopen("http://questionablecontent.net/")
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if RegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    contentLine = contentLine[4].split('.')
    EndNum = int(contentLine[0])
else:
    EndNum = 1199

#First and Last comics user wishes to download
StartNum = 1
#EndNum = 1199

#Full path of destination folder needs to pre-exist
destinationFolder = "C:\Thanos Files\Comics\QC"

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    source = "http://www.questionablecontent.net/comics/"+str(i)+".png"

    #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"


And the XKCD script, that was the basis of this thread to begin with.
Spoiler:
Code: Select all
#import Web, Reg. Exp, and Operating System libraries
import urllib, re, os

#RegExp for the EndNum variable
CurrentRegExp = re.compile('<h3>Permanent link.*</h3>')

#Check the main XKCD page
site = urllib.urlopen("http://www.xkcd.com/")
#Clear content line, just incase
contentLine = None

#For each line in the homepage's source...
for line in site.readlines():
    #Break when you find the variable information
    if CurrentRegExp.search(line):
        contentLine = line
        break

#IF the information was found successfuly automatically change EndNum
#ELSE set it to the latest comic as of this writing
if contentLine:
    contentLine = contentLine.split('/')
    #Index 3 has the comic number
    EndNum = int(contentLine[3])
else:
    EndNum = 455

#First and Last comics user wishes to download.
#Uncomment EndNum if you want to use this as a function instead of a script
StartNum = 1
#EndNum = 192

#Full path of destination folder needs to pre-exist
destinationFolder = "C:\Thanos Files\Comics\XKCD"

#Full path of "Alt Text" file doesn't need to pre-exist.
#Info will be appended to the end of the file
textFile = open(destinationFolder+"\AltText.txt",'a')

#Regular Exp. used to find the comics in the webpage source
RegExp = re.compile('.*<img src=".*" title=".*" alt=".*" />.*')

#Reg. Exp used to fix images surrounded by Anchor Tags
LinkRegExp = re.compile(".*<img src=.*")

#Reg. Exp used to bypass download the store's image
StoreRegExp = re.compile(".*http://store.xkcd.com/.*")

#XRange creates an iterator to go over the comics
for i in xrange(StartNum, EndNum+1):

    #IF you already have the comic, skip downloading it
    if os.path.exists(destinationFolder+"\\"+str(i)+".png"):
        print "Skipping Comic "+str(i)+"..."
        continue

    #Gets the Site of the i-th comic
    site = urllib.urlopen("http://www.xkcd.com/"+str(i)+"/")
    #Clear content line, just incase
    contentLine = None

    #For each line in the webpage's source...
    for line in site.readlines():
        #Break when you find the image information AND it's not the store's
        if RegExp.search(line) and not StoreRegExp.search(line):
            contentLine = line
            break

    #Skips the non-existant comic #404
    if not contentLine:
        continue

    #Create the Array with the information in it
    info = line.split('"')

    #IF there is a match, image is not embedded in Anchors
    if LinkRegExp.search(info[0]):
        #Gets the url for the image
        source = info[1]
        #The title-text (commonly known on these fora as the alt-text)
        title = info[3]
        #The alt-text
        alt = info[5]
    #ELSE Adjust the indexes to compensate for the Anchor Tags
    else:
        #Gets the url for the image
        source = info[3]
        #The title-text (commonly known on these fora as the alt-text)
        title = info[5]
        #The alt-text
        alt = info[7]
        if i == 191:
            #Manually download the image 191 links to
            urllib.urlretrieve("http://imgs.xkcd.com/comics/lojban_translated.png", os.path.join(destinationFolder, str(i)+" Translated.png"))

    #Printing User-Friendly Messages
    print "Comic %d Found. Downloading..." % i

    #Save image from XKCD to Destination Folder as a PNG (As most comics are PNGs)
    urllib.urlretrieve(source, os.path.join(destinationFolder, str(i)+".png"))
    # Writes the title and alt to a text file
    textFile.write("Comic "+str(i) + ': Alt: "' + alt + '" Title: "' + title + '"\n')

#Graceful program termination
print str(EndNum-StartNum) + " Comics Downloaded"
textFile.close()


All these are made for Windows, so feel free to Unixify/Macify them. Others are welcome to share their own creations of they like.

Hope you all enjoy,
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby abrenecki » Sat Aug 23, 2008 11:53 am UTC

Here's my version (spoiler'd for readability):
Spoiler:
Code: Select all
#!/usr/bin/python
#(c) Adam brenecki 2008
#Licensed under the CC BY-NC license
#http://creativecommons.org/licenses/by-nc/3.0/
import urllib, os, re

homepage = urllib.urlopen("http://www.xkcd.com/").read()
latest = int(re.compile(r'http://xkcd.com/([0-9]*)/').search(homepage).group(1))
f = open("titles.txt",'a')

dirlist = os.listdir("./") #Get the directory listing
dirre = re.compile(r'^([0-9]+).*') #Because file names aren't just the number here
first = 0
for item in dirlist: #Determine what's been downloaded
   match = dirre.search(item)
   if match:
      num = int(match.group(1))
      if num > first:
         first = num
first += 1 #Get the number of the first one that HASN'T
if first > latest:
   print "All comics already here."
   exit()

print "Downloading comics",first,"to",latest
comicre = re.compile(r'<img src="(http://imgs.xkcd.com/comics/[^"]+)" title="([^"]*)" alt="([^"]*)" />')
extre = re.compile(r'([^\/]+)\.(jpg|png)$')

for number in range(first,latest+1):
   print "Downloading",number
   match = comicre.search(urllib.urlopen("http://www.xkcd.com/"+str(number)+"/").read()) #download the html
   if match:
      url = match.group(1)
      title = match.group(2)
      alt = match.group(3)
      extm = extre.search(url)
      webfilename = extm.group(1) #Filename
      ext = extm.group(2) #Extension
      filename = "%(#)03d"%{"#":number}+"_"+webfilename+"."+ext
      #DIRTY HACK AHEAD
      os.system("curl -s -o \""+filename+"\" \""+url+"\"") #urlretrieve doesn't work?
      f.write("%(#)03d"%{"#":number}+": "+alt+"\n    "+title+"\n")
      print "Successful."
   else:
      print "Failed."
      f.write("FAILED ON COMIC %(#)03d \n"%{"#":number})
f.close()

Advantages:
  • Puts filename of original file in download, prepending with number so they stay in order
  • Keeps a titles.txt
  • Checks for latest comic and checks to see what's been downloaded before starting
Disadvantages:
  • Doesn't work on Windows - Made on a Mac, should work on Linux boxes with curl installed (lines 8 and 38 are the platform specific ones afaik)
Notes:
  • Why curl? urlllib.urlretrieve didn't work for me. (Edit: Does it need an absolute path? I'll play with it later.)
  • I'm not fond of looping over lines of a small file (although xkcdmailer (see my sig) does this). Regex engines can handle an entire html file at once, whether or not it's good practise.
User avatar
abrenecki
 
Posts: 62
Joined: Mon Sep 10, 2007 8:39 am UTC
Location: Mid North SA, Australia

Re: XKCD Batch Downloading

Postby Onion_Knight » Sun Aug 24, 2008 10:25 pm UTC

abrenecki wrote:Here's my version (spoiler'd for readability):
Spoiler:
Code: Select all
#!/usr/bin/python
#(c) Adam brenecki 2008
#Licensed under the CC BY-NC license
#http://creativecommons.org/licenses/by-nc/3.0/
import urllib, os, re

homepage = urllib.urlopen("http://www.xkcd.com/").read()
latest = int(re.compile(r'http://xkcd.com/([0-9]*)/').search(homepage).group(1))
f = open("titles.txt",'a')

dirlist = os.listdir("./") #Get the directory listing
dirre = re.compile(r'^([0-9]+).*') #Because file names aren't just the number here
first = 0
for item in dirlist: #Determine what's been downloaded
   match = dirre.search(item)
   if match:
      num = int(match.group(1))
      if num > first:
         first = num
first += 1 #Get the number of the first one that HASN'T
if first > latest:
   print "All comics already here."
   exit()

print "Downloading comics",first,"to",latest
comicre = re.compile(r'<img src="(http://imgs.xkcd.com/comics/[^"]+)" title="([^"]*)" alt="([^"]*)" />')
extre = re.compile(r'([^\/]+)\.(jpg|png)$')

for number in range(first,latest+1):
   print "Downloading",number
   match = comicre.search(urllib.urlopen("http://www.xkcd.com/"+str(number)+"/").read()) #download the html
   if match:
      url = match.group(1)
      title = match.group(2)
      alt = match.group(3)
      extm = extre.search(url)
      webfilename = extm.group(1) #Filename
      ext = extm.group(2) #Extension
      filename = "%(#)03d"%{"#":number}+"_"+webfilename+"."+ext
      #DIRTY HACK AHEAD
      os.system("curl -s -o \""+filename+"\" \""+url+"\"") #urlretrieve doesn't work?
      f.write("%(#)03d"%{"#":number}+": "+alt+"\n    "+title+"\n")
      print "Successful."
   else:
      print "Failed."
      f.write("FAILED ON COMIC %(#)03d \n"%{"#":number})
f.close()

Advantages:
  • Puts filename of original file in download, prepending with number so they stay in order
  • Keeps a titles.txt
  • Checks for latest comic and checks to see what's been downloaded before starting
Disadvantages:
  • Doesn't work on Windows - Made on a Mac, should work on Linux boxes with curl installed (lines 8 and 38 are the platform specific ones afaik)
Notes:
  • Why curl? urlllib.urlretrieve didn't work for me. (Edit: Does it need an absolute path? I'll play with it later.)
  • I'm not fond of looping over lines of a small file (although xkcdmailer (see my sig) does this). Regex engines can handle an entire html file at once, whether or not it's good practise.


I like it. Changing the filenames from just numbers to numbers and titles was rather classy.
-Onion Knight
Onion_Knight
 
Posts: 22
Joined: Tue Jul 22, 2008 9:13 pm UTC

Re: XKCD Batch Downloading

Postby RoadieRich » Thu Aug 28, 2008 12:24 pm UTC

Code: Select all

#assuming page has been downloaded as a string into pageContent

def convertToDict(listOfTuples):
    """convert a list of (key,value) tuples into a dict, assinging the value None when none exists""""
    d = {}
    for t in listOfTuples:
        try:
            d[t[0]] = t[1]
        except ValueError:
            d[t[0]] = None
    return d

import HTMLParser
class xkcdParser(HTMLParser.HTMLParser):
    def handle_starttag(tag, attrs):
        if tag != img:
            return
        else:
            attrs = convertToDict(attrs)
            if '/comics/' in attr['src']:
                listOfComics.append(attr['src'])
                listOfAltTexts.append(attr['title']
            else:
                return

parser = HTMLParser()
parser.feed(pageContent)


It may not be as compact as the re method, but when you've got something ready made, why not use it?
roband wrote:Mav is a cow.

UniJam 2012: Inter-university Games Jam hosted by Nottingham Trent University DevSoc.
nlug: Nottingham Linux User Group
DevSoc: The Nottingham Trent University Software Development Society
User avatar
RoadieRich
The Black Hand
 
Posts: 1030
Joined: Tue Feb 12, 2008 11:40 am UTC
Location: Somewhere only we know

Next

Return to Coding

Who is online

Users browsing this forum: No registered users and 7 guests