Visit the application – http://comics.randallcrock.net
Download the code
Download script written in Perl using the LWP::Simple module, MySQL database for storage, PHP front-end for display
My Comics Archive tool grew out of my love for online comics and my desire to have them for offline viewing. The system is set up as a general way for downloading the entire archives of a comic, then viewing them of off a local web server. I began the first iteration of this program in C# .NET, but it moved into Perl as I was teaching myself Perl. It is now available to anyone, and I am expanding the number of comics I am mirroring. If you have any comments, recommendations, or requests, just email them to me. Currently, public access is not available to the archive due to some concerns I have over infringing on the copyrights of the authors. The code is available at the link above with instructions on how to set it up and run your own version for your own personal use.
The Webcomics Archive began life as a triplet of Perl scripts for downloading a comic archive and has evolved into something a little more complex based on a MySQL database and using an enhanced PHP frontend. The original scripts were for download, displaying a single comic’s images, and one for displaying all the comics simultaneously. The original design simply downloaded the images into a directory named after the comic, then did a file lookup to see which files to grab. This is an expensive operation from a server standpoint in that directory listings on thousands of files are very disk intensive and unnecessary for something like this. This is one reason why the backend was switched to a database driven search along with the troubles I kept having with duplicate file names and numbering comics which did not fall into a pre-numbered schema. By switching to a database to store the data, queries are much faster and easier to process, and it completely eliminates the numbering problem since the files can be stored with their original file names and be looked up by name or by number in the database. This has also cleaned up my file structure since each comic used to have its own log file and configuration file which are now both built into the database framework. The actual image files are still stored in named directories, but all that information gets tracked in the database so when a request is made, the script simply connects the dots, so to speak, to get each image’s location.
The basis of the downloading system is to start from a given point, then search for the link to the next page, saving the comics from each page as it goes. The links and comic images are found using regular expressions, with the pattern predefined in the configuration file. Currently it only downloads the HTML and comic images, nothing else, but I plan to extend this to include saving the title text (what you get when you mouse over an image) and any annotations associated with the page. The title text is relatively simple since, at least in theory, I capture it anyway when grabbing the image’s source location. The annotation saving, on the other hand, will be much more difficult to create a generalized system for since each comic may do its annotations in a huge variety of ways.
A quick description of how the configuration file is formed.
The comics view is fairly straightforward, and is based on the names of the files. There is also a feature where users can save their location in the archives and restore the archives to that point, but cookies are required.
A screenshot of the “Latest” page of comics
There is also a view to see the latest comics from all of the comics available, which simply draws the highest-index comic from each folder, and assembles them into one webpage.