Changes

Jump to navigation Jump to search
more
Line 14: Line 14:  
It sounds like the best place to get the content is probably Datapages.com. There is no web API, but there are two sets of semantic tags in Datapages: '''<nowiki><meta></nowiki>''' tags and '''<nowiki><cjsXXXXX></nowiki>''' where XXXXX is some string making, for example, '''<nowiki><cjsauthor></nowiki>''',  '''<nowiki><cjsyear></nowiki>''', etc. The body of the article, with simple HTML markup, is in  '''<nowiki><cjstext></nowiki>''' tags.  
 
It sounds like the best place to get the content is probably Datapages.com. There is no web API, but there are two sets of semantic tags in Datapages: '''<nowiki><meta></nowiki>''' tags and '''<nowiki><cjsXXXXX></nowiki>''' where XXXXX is some string making, for example, '''<nowiki><cjsauthor></nowiki>''',  '''<nowiki><cjsyear></nowiki>''', etc. The body of the article, with simple HTML markup, is in  '''<nowiki><cjstext></nowiki>''' tags.  
   −
'''scrapy''' is a web scraping library for Python with nice XPath addressing of tags and easy ways to push collected items through a processing flow. It also has nice features like handling server errors, honest user agents, polite wait times, etc., so we won't take Datapages down while it crawls (something it does very quickly — scraping an entire book takes about 1 minute).
+
'''scrapy''' is a web scraping library for Python with nice XPath addressing of tags and easy ways to push collected items through a processing flow. It starts from a book's base URL, follows chapter links, and collects the page content. It also has nice features like handling server errors, honest user agents, polite wait times, etc., so we won't take Datapages down while it crawls (something it does very quickly — scraping an entire book takes about 1 minute).
 +
 
 +
===Example spider===
 +
<pre>
 +
from scrapy.contrib.spiders import CrawlSpider, Rule
 +
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
 +
from scrapy.selector import HtmlXPathSelector
 +
 
 +
# We define the Article object in items.py
 +
from aapg.items import Article
 +
 
 +
class AapgSpider(CrawlSpider):
 +
    name = 'aapg'
 +
   
 +
    # Keep the spider on Datapages
 +
    allowed_domains = ['datapages.com']
 +
   
 +
    start_urls = [
 +
        'http://archives.datapages.com/data/alt-browse/aapg-special-volumes/me10.htm'
 +
    ]
 +
   
 +
    # All the chapter pages contain the publication number, a095 in this case
 +
    # Links are followed and the contents sent to the parser
 +
    rules = (
 +
        Rule(SgmlLinkExtractor(allow=('specpubs/method', )),
 +
        callback='parse_item'),
 +
    )
 +
 
 +
    # The parser — gathers the XPath components into Scrapy Item
 +
    def parse_item(self,response):
 +
       
 +
        hxs = HtmlXPathSelector(response)
 +
       
 +
        article = Article()
 +
        article['pdf']    = hxs.select('//meta[contains(@name,"pdf_url")]/@content').extract()
 +
        article['link']    = hxs.select('//meta[contains(@name,"html_url")]/@content').extract()
 +
        article['publ']    = hxs.select('//cjspubtitle/text()').extract()
 +
        article['kind']    = hxs.select('//cjstype/text()').extract()
 +
        article['volume']  = hxs.select('//cjsvolume/text()').extract()
 +
        article['year']    = hxs.select('//cjsyear/text()').extract()
 +
        article['editor']  = hxs.select('//cjseditor/text()').extract()
 +
        article['part']    = hxs.select('//cjstitle/text()').extract()
 +
        article['chapter'] = hxs.select('//cjstitle/text()').extract()
 +
        article['author']  = hxs.select('//cjsauthor/text()').extract()
 +
       
 +
        # This is the text of the article, only works with access to Datapages
 +
        # article['text'] = hxs.select('//cjstext/text()').extract()
 +
 
 +
        return article
 +
</pre>
 +
 
 +
==Converting basic HTML to wikitext==
 +
Some patterns to convert:
 +
* <code><nowiki><p><strong>HEADING</strong></p></nowiki></code> &rarr; Article called <code><nowiki>Heading</nowiki></code> (use ''string.isupper()'' to decide)
 +
* <code><nowiki><p><strong>Heading</strong></p></nowiki></code> &rarr; <code><nowiki>==Heading==</nowiki></code>
 +
* <code><nowiki><p>Whole bunch of text.</p></nowiki></code> &rarr; ordinary wikitext
 +
* <code><nowiki><p><strong>Table [0-9]. Caption.</strong></p></nowiki></code> & rarr; Table with caption <code><nowiki>Caption.</nowiki></code>
 +
* <code><nowiki><p><strong>Fig. [0-9]. Caption.</strong></p></nowiki></code> & rarr; Image with caption <code><nowiki>Caption.</nowiki></code>
 +
* <code><nowiki><p><strong><blockquote>End_Page 276------------------------</blockquote></strong></p></nowiki></code> & rarr; delete
 +
* <code><nowiki><p>... \([a-z]+, (in press)*[0-9]{0,2,4}\) ...</p></nowiki></code> & rarr; a reference
 +
* <code><nowiki><p>[EQUATION]</p></nowiki></code> & rarr; an equation, probably needs hand-coding in LaTeX
 +
* <code><nowiki><p>... <em>some text</em> ...</p></nowiki></code> & rarr; <code><nowiki>''some text''</nowiki></code>
 +
 
 +
For bottom captions in tables, use something like:
 +
<pre>
 +
|+ align="bottom" | Table caption.
 +
</pre>
 +
 
 +
I haven't seen an equation yet.
 +
 
 +
==Handling categories, references, figures, etc.==
 +
 
 +
==Handling tables, equations, etc.==
 +
 
 +
==Uploading images==
 +
 
 +
==Adding links to articles==
 +
 
 +
==Human QC and finessing of conversion==

Navigation menu