Metadata-Version: 1.0
Name: transmogrify.webcrawler
Version: 1.0b1
Summary: Crawling and feeding html content into a transmogrifier pipeline
Home-page: http://github.com/djay/transmogrify.webcrawler
Author: Dylan Jay
Author-email: software@pretaweb.com
License: GPL
Description: Introduction
        ============
        
        transmogrify.webcrawler
        A source blueprint for crawling content from a site or local html files.
        
        # WebCrawler will emit items like
        # item = dict(_site_url = "Original site_url used",
        #            _path = "The url crawled without _site_url,
        #            _content = "The raw content returned by the url",
        #            _content_info = "Headers returned with content"
        #            _backlinks    = names,
        #            _sortorder    = "An integer representing the order the url was found within the page/site
        #	     )
        
        
        transmogrify.webcrawler.typerecognitor
        A blueprint for assinging content type based on the mime-type as given by the
        webcrawler
        
        transmogrify.webcrawler.cache
        A blueprint that saves crawled content into a directory structure
        
        
        
        transmogrify.webcrawler
        =======================
        
        A transmogrifier blueprint source which will crawl a url reading in all pages
        until all have been crawled.
        
        Options
        -------
        
        site_url
        URL to start crawling. The URL will be treated as the base and any links outside
        this base will be ignored
        
        ignore
        Regular expressions for urls not to follow
        
        
        patterns
        Regular expressions to substitute before html is parsed. New line seperated
        
        subs
        Text to replace
        
        checkext
        checkext
        
        verbose
        verbose
        
        maxsize
        don't crawl anything larger than this
        
        nonames
        nonames
        
        cache
        cache
        
        Keys inserted
        -------------
        
        The following set the keys items added to the pipeline
        
        pathkey
        default: _path. The path of the url not including the base
        
        siteurlkey
        default: _site_url. The base of the url
        
        originkey
        default: _origin. The original path in case retriving the url caused a redirection
        
        contentkey
        default: _content. The main content of the url
        
        contentinfokey
        default: _content_info. Headers returned by urlopen
        
        sortorderkey
        default: _sortoder. A count on when a link to this item was first encounted while crawling
        
        backlinkskey
        default: _backlinks. A list of tuples of which pages linked to this item. (url, path)
        
        
        Tests
        -----
        
        >>> testtransmogrifier("""
        ... [webcrawler]
        ... blueprint = transmogrify.webcrawler
        ... site_url  = file://%s/test_staticsite
        ... alias_bases = http://somerandomsite file:///
        ... """)
        {'_backlinks': [],
        '_content_info': {'content-type': 'text/html'},
        '_mimetype': 'text/html',
        '_origin': 'file://.../test_staticsite',
        '_path': '',
        '_site_url': 'file://.../test_staticsite/',
        '_sortorder': 0,
        '_type': 'Document'}
        ...
        
        
        >>> testtransmogrifier("""
        ... [webcrawler]
        ... blueprint = transmogrify.webcrawler
        ... site_url  = file://%s/test_staticsite
        ... alias_bases = http://somerandomsite file:///
        ... """)
        {...
        '_path': '',
        ...}
        {...
        '_path': 'cia-plone-view-source.jpg',
        ...}
        {...
        '_path': 'subfolder',
        ...}
        {...
        '_path': 'subfolder2',
        ...}
        {...
        '_path': 'file3.html',
        ...}
        {...
        '_path': 'subfolder/subfile1.htm',
        ...}
        {...
        '_path': 'file.doc',
        ...}
        {...
        '_path': 'file2.htm',
        ...}
        {...
        '_path': 'file4.HTML',
        ...}
        {...
        '_path': 'egenius-plone.gif',
        ...}
        {...
        '_path': 'plone_schema.png',
        ...}
        {...
        '_path': 'file1.htm',
        ...}
        {...
        '_path': 'subfolder2/subfile1.htm',
        ...}
        ...
        
        >>> testtransmogrifier("""
        ... [webcrawler]
        ... blueprint = transmogrify.webcrawler
        ... site_url  = file://%s/test_staticsite
        ... alias_bases = http://somerandomsite file:///
        ... patterns =
        ...		(?s)<SCRIPT.*Abbreviation"\)
        ...		(?s)MakeLink\('(?P<u>[^']*)','(?P<a>[^']*)'\)
        ...     (?s)State=.*<body[^>]*>
        ... subs =
        ...     </head><body>
        ...		<a href="\g<u>">\g<a></a>
        ...     <br>
        ... """)
        
        
        
        External scripts used
        ---------------------
        
        http://svn.python.org/projects/python/trunk/Tools/webchecker/webchecker.py
        http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
        
        thon.org/projects/python/trunk/Tools/webchecker/webchecker.py
        http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
        
        
        TypeRecognitor
        ==============
        
        TypeRecognitor is a transmogrifier blue print which determines the plone type of the
        item from the mime_type in the header. It reads the mimetype from the headers in
        _content_info set by transmogrify.webrawler
        
        >>> from os.path import dirname
        >>> from os.path import abspath
        >>> config = """
        ...
        ... [transmogrifier]
        ... pipeline =
        ...     webcrawler
        ...     typerecognitor
        ...     clean
        ...     printer
        ...
        ... [webcrawler]
        ... blueprint = transmogrify.webcrawler
        ... site_url  = file://%s/test_staticsite
        ...
        ... [typerecognitor]
        ... blueprint = transmogrify.webcrawler.typerecognitor
        ...
        ... [clean]
        ... blueprint = collective.transmogrifier.sections.manipulator
        ... delete =
        ...   file
        ...   text
        ...   image
        ...
        ... [printer]
        ... blueprint = collective.transmogrifier.sections.tests.pprinter
        ...
        ... """ % abspath(dirname(__file__)).replace('\\','/')
        
        >>> from collective.transmogrifier.tests import registerConfig
        >>> registerConfig(u'transmogrify.webcrawler.typerecognitor.test', config)
        
        >>> from collective.transmogrifier.transmogrifier import Transmogrifier
        >>> transmogrifier = Transmogrifier(plone)
        >>> transmogrifier(u'transmogrify.webcrawler.typerecognitor.test')
        {...
        '_mimetype': 'image/jpeg',
        ...
        '_path': 'cia-plone-view-source.jpg',
        ...
        '_type': 'Image',
        ...}
        ...
        
        {'_mimetype': 'image/gif',
        '_path': '/egenius-plone.gif',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': None,
        '_type': 'Image'}
        {'_mimetype': 'application/msword',
        '_path': '/file.doc',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': 'doc_to_html',
        '_type': 'Document'}
        {'_mimetype': 'text/html',
        '_path': '/file1.htm',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': None,
        '_type': 'Document'}
        {'_mimetype': 'text/html',
        '_path': '/file2.htm',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': None,
        '_type': 'Document'}
        {'_mimetype': 'text/html',
        '_path': '/file3.html',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': None,
        '_type': 'Document'}
        {'_mimetype': 'text/html',
        '_path': '/file4.HTML',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': None,
        '_type': 'Document'}
        {'_mimetype': 'image/png',
        '_path': '/plone_schema.png',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': None,
        '_type': 'Image'}
        {'_mimetype': 'text/html',
        '_path': '/subfolder',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': None,
        '_type': 'Document'}
        {'_mimetype': 'text/html',
        '_path': '/subfolder/subfile1.htm',
        '_site_url': 'file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite',
        '_transform': None,
        '_type': 'Document'}
        
        
        Changelog
        =========
        
        1.0 - Unreleased
        ----------------
        
        * Initial release
        
        transmogrify.webcrawler 0.1 - October 25, 2008
        
        - renamed package from pretaweb.blueprints to transmogrify.webcrawler.
        [djay]
        
        - enhanced import view (djay)
        
        
        0.2
        
        16-7-09 djay Added caching of crawled sites
        
        10-7-09 djay Added UI using z3cform
        
Keywords: transmogrifier blueprint funnelweb source plone import conversion microsoft office
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development :: Libraries :: Python Modules
