Archiving URLs

From the command line

Once installed, you can start using storytracker’s command-line tools immediately, like storytracker.archive().

$ storytracker-archive http://www.latimes.com

That should pour out a scary looking stream of data to your console. That is the content of the page you requested compressed using gzip. If you’d prefer to see the raw HTML, add the --do-not-compress option.

$ storytracker-archive http://www.latimes.com --do-not-compress

You could save that yourself using a standard UNIX pipeline.

$ storytracker-archive http://www.latimes.com --do-not-compress > archive.html

But why do that when storytracker.create_archive_filename() will work behind the scenes to automatically come up with a tidy name that includes both the URL and a timestamp?

$ storytracker-archive http://www.latimes.com --do-not-compress --output-dir="./"

Run that and you’ll see the file right away in your current directory.

# Try opening the file you spot here with your browser
$ ls | grep .html

Using Python

UNIX-like systems typically come equipped with a built in method for scheduling tasks known as cron. To utilize it with storytracker, one approach is to write a Python script that retrieves a series of sites each time it is run.

import storytracker

SITE_LIST = [
    # A list of the sites to archive
    'http://www.latimes.com',
    'http://www.nytimes.com',
    'http://www.kansascity.com',
    'http://www.knoxnews.com',
    'http://www.indiatimes.com',
]
# The place on the filesystem where you want to save the files
OUTPUT_DIR = "/path/to/my/directory/"

# Runs when the script is called with the python interpreter
# ala "$ python cron.py"
if __name__ == "__main__":
    # Loop through the site list
    for s in SITE_LIST:
        # Spit out what you're doing
        print "Archiving %s" % s
        try:
            # Attempt to archive each site at the output directory
            # defined above
            storytracker.archive(s, output_dir=OUTPUT_DIR)
        except Exception as e:
            # And just move along and keep rolling if it fails.
            print e

Scheduling with cron

Then edit the cron file from the command line.

$ crontab -e

And use cron’s custom expressions to schedule the job however you’d like. This example would schedule the script to run a file like the one above at the top of every hour. Though it assumes that storytracker is available to your global Python installation at /usr/bin/python. If you are using a virtualenv or different Python configuration, you should begin the line with a path leading to that particular python executable.

0 * * * *  /usr/bin/python /path/to/my/script/cron.py