Difference between revisions of "AAPG Wiki:Converting content"

From AAPG Wiki
Jump to navigation Jump to search
Line 82: Line 82:
 
</pre>
 
</pre>
  
I haven't seen an equation yet.
+
===Methods for converting===
 +
* Useful Python libraries:
 +
** It should be possible to just adapt a scrapy pipeline for processing the text
 +
** HTMLParse is probably not powerful enough
 +
** BeautifulSoup or lxml
  
 
==Handling categories, references, figures, etc.==
 
==Handling categories, references, figures, etc.==

Revision as of 18:07, 28 June 2013

There are some large tasks in the conversion process:

  • Collecting the content in a useful format
  • Converting basic HTML to wikitext
  • Handling categories, references, figures, etc., probably automatically
  • Handling tables, equations, etc., probably manually
  • Uploading images
  • Adding links to articles
  • Human QC and finessing of conversion

Collecting content[edit]

The seed content does not exist as semantically marked-up text.

It sounds like the best place to get the content is probably Datapages.com. There is no web API, but there are two sets of semantic tags in Datapages: <meta> tags and <cjsXXXXX> where XXXXX is some string making, for example, <cjsauthor>, <cjsyear>, etc. The body of the article, with simple HTML markup, is in <cjstext> tags.

scrapy is a web scraping library for Python with nice XPath addressing of tags and easy ways to push collected items through a processing flow. It starts from a book's base URL, follows chapter links, and collects the page content. It also has nice features like handling server errors, honest user agents, polite wait times, etc., so we won't take Datapages down while it crawls (something it does very quickly — scraping an entire book takes about 1 minute).

Example spider[edit]

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

# We define the Article object in items.py
from aapg.items import Article

class AapgSpider(CrawlSpider):
    name = 'aapg'
    
    # Keep the spider on Datapages
    allowed_domains = ['datapages.com']
    
    start_urls = [
        'http://archives.datapages.com/data/alt-browse/aapg-special-volumes/me10.htm'
    ]
    
    # All the chapter pages contain the publication number, a095 in this case
    # Links are followed and the contents sent to the parser
    rules = (
        Rule(SgmlLinkExtractor(allow=('specpubs/method', )),
        callback='parse_item'),
    )

    # The parser — gathers the XPath components into Scrapy Item
    def parse_item(self,response):
        
        hxs = HtmlXPathSelector(response)
        
        article = Article()
        article['pdf']     = hxs.select('//meta[contains(@name,"pdf_url")]/@content').extract()
        article['link']    = hxs.select('//meta[contains(@name,"html_url")]/@content').extract()
        article['publ']    = hxs.select('//cjspubtitle/text()').extract()
        article['kind']    = hxs.select('//cjstype/text()').extract()
        article['volume']  = hxs.select('//cjsvolume/text()').extract()
        article['year']    = hxs.select('//cjsyear/text()').extract()
        article['editor']  = hxs.select('//cjseditor/text()').extract()
        article['part']    = hxs.select('//cjstitle/text()').extract()
        article['chapter'] = hxs.select('//cjstitle/text()').extract()
        article['author']  = hxs.select('//cjsauthor/text()').extract()
        
        # This is the text of the article, only works with access to Datapages
        # article['text'] = hxs.select('//cjstext/text()').extract()

        return article

Converting basic HTML to wikitext[edit]

Some patterns to convert:

  • <p><strong>HEADING</strong></p> → Article called Heading (use string.isupper() to decide)
  • <p><strong>Heading</strong></p>==Heading==
  • <p>Whole bunch of text.</p> → ordinary wikitext
  • <p><strong>Table [0-9]. Caption.</strong></p> → Table with caption Caption.
  • <p><strong>Fig. [0-9]. Caption.</strong></p> → Image with caption Caption.
  • <p><strong><blockquote>End_Page 276------------------------</blockquote></strong></p> → delete
  • <p>... \([a-z]+, (in press)*[0-9]{0,2,4}\) ...</p> → a reference
  • <p>[EQUATION]</p> → an equation, probably needs hand-coding in LaTeX
  • <p>... <em>some text</em> ...</p>''some text''

For bottom captions in tables, use something like:

|+ align="bottom" | Table caption.

Methods for converting[edit]

  • Useful Python libraries:
    • It should be possible to just adapt a scrapy pipeline for processing the text
    • HTMLParse is probably not powerful enough
    • BeautifulSoup or lxml

Handling categories, references, figures, etc.[edit]

Handling tables, equations, etc.[edit]

Uploading images[edit]

Adding links to articles[edit]

Human QC and finessing of conversion[edit]