
transmogrify.webcrawler
=======================

A transmogrifier blueprint source which will crawl a url reading in all pages
until all have been crawled.

Options
-------

site_url
  URL to start crawling. The URL will be treated as the base and any links outside
  this base will be ignored

ignore
   Regular expressions for urls not to follow
  

patterns
   Regular expressions to substitute before html is parsed. New line seperated 
   
subs
   Text to replace 

checkext
   checkext
   
verbose
   verbose
   
maxsize
   don't crawl anything larger than this
      
nonames
   nonames
   
cache
   cache

Keys inserted
-------------

The following set the keys items added to the pipeline

pathkey
 default: _path. The path of the url not including the base

siteurlkey
 default: _site_url. The base of the url
 
originkey
  default: _origin. The original path in case retriving the url caused a redirection

contentkey
 default: _content. The main content of the url
 
contentinfokey
 default: _content_info. Headers returned by urlopen

sortorderkey
  default: _sortoder. A count on when a link to this item was first encounted while crawling

backlinkskey
  default: _backlinks. A list of tuples of which pages linked to this item. (url, path)


Tests
-----

>>> testtransmogrifier("""
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url  = file://%s/test_staticsite
... alias_bases = http://somerandomsite file:///
... """)
{'_backlinks': [],
 '_content_info': {'content-type': 'text/html'},
 '_mimetype': 'text/html',
 '_origin': 'file://.../test_staticsite',
 '_path': '',
 '_site_url': 'file://.../test_staticsite/',
 '_sortorder': 0,
 '_type': 'Document'}
...


>>> testtransmogrifier("""
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url  = file://%s/test_staticsite
... alias_bases = http://somerandomsite file:///
... """)
{...
 '_path': '',
 ...}
{...
 '_path': 'cia-plone-view-source.jpg',
 ...}
{...
 '_path': 'subfolder',
 ...}
{...
 '_path': 'subfolder2',
 ...}
{...
 '_path': 'file3.html',
 ...}
{...
 '_path': 'subfolder/subfile1.htm',
 ...}
{...
 '_path': 'file.doc',
 ...}
{...
 '_path': 'file2.htm',
 ...}
{...
 '_path': 'file4.HTML',
 ...}
{...
 '_path': 'egenius-plone.gif',
 ...}
{...
 '_path': 'plone_schema.png',
 ...}
{...
 '_path': 'file1.htm',
 ...}
{...
'_path': 'subfolder2/subfile1.htm', 
 ...}
...

>>> testtransmogrifier("""
... [webcrawler]
... blueprint = transmogrify.webcrawler
... site_url  = file://%s/test_staticsite
... alias_bases = http://somerandomsite file:///
... patterns =
...		(?s)<SCRIPT.*Abbreviation"\) 
...		(?s)MakeLink\('(?P<u>[^']*)','(?P<a>[^']*)'\)
...     (?s)State=.*<body[^>]*>
... subs = 
...     </head><body>
...		<a href="\g<u>">\g<a></a>
...     <br>
... """)



External scripts used
---------------------

http://svn.python.org/projects/python/trunk/Tools/webchecker/webchecker.py
http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py

thon.org/projects/python/trunk/Tools/webchecker/webchecker.py
http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py

