AAPG Wiki:Converting content

From AAPG Wiki
Jump to navigation Jump to search

There are some large tasks in the conversion process:

  • Collecting the content in a useful format
  • Converting basic HTML to wikitext
  • Handling categories, references, figures, etc., probably automatically
  • Handling tables, equations, etc., probably manually
  • Uploading images
  • Adding links to articles
  • Human QC and finessing of conversion

Collecting content[edit]

The seed content does not exist as semantically marked-up text.

It sounds like the best place to get the content is probably Datapages.com. There is no web API, but there are two sets of semantic tags in Datapages: <meta> tags and <cjsXXXXX> where XXXXX is some string making, for example, <cjsauthor>, <cjsyear>, etc. The body of the article, with simple HTML markup, is in <cjstext> tags.

scrapy is a web scraping library for Python with nice XPath addressing of tags and easy ways to push collected items through a processing flow. It starts from a book's base URL, follows chapter links, and collects the page content. It also has nice features like handling server errors, honest user agents, polite wait times, etc., so we won't take Datapages down while it crawls (something it does very quickly — scraping an entire book takes about 1 minute).

Example spider[edit]

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

# We define the Article object in items.py
from aapg.items import Article

class AapgSpider(CrawlSpider):
    name = 'aapg'
    
    # Keep the spider on Datapages
    allowed_domains = ['datapages.com']
    
    start_urls = [
        'http://archives.datapages.com/data/alt-browse/aapg-special-volumes/me10.htm'
    ]
    
    # All the chapter pages contain the publication number, a095 in this case
    # Links are followed and the contents sent to the parser
    rules = (
        Rule(SgmlLinkExtractor(allow=('specpubs/method', )),
        callback='parse_item'),
    )

    # The parser — gathers the XPath components into Scrapy Item
    def parse_item(self,response):
        
        hxs = HtmlXPathSelector(response)
        
        article = Article()
        article['pdf']     = hxs.select('//meta[contains(@name,"pdf_url")]/@content').extract()
        article['link']    = hxs.select('//meta[contains(@name,"html_url")]/@content').extract()
        article['publ']    = hxs.select('//cjspubtitle/text()').extract()
        article['kind']    = hxs.select('//cjstype/text()').extract()
        article['volume']  = hxs.select('//cjsvolume/text()').extract()
        article['year']    = hxs.select('//cjsyear/text()').extract()
        article['editor']  = hxs.select('//cjseditor/text()').extract()
        article['part']    = hxs.select('//cjstitle/text()').extract()
        article['chapter'] = hxs.select('//cjstitle/text()').extract()
        article['author']  = hxs.select('//cjsauthor/text()').extract()
        
        # This is the text of the article, only works with access to Datapages
        # article['text'] = hxs.select('//cjstext/text()').extract()

        return article

Converting basic HTML to wikitext[edit]

Some patterns to convert:

  • <p><strong>HEADING</strong></p> → Article called Heading (use string.isupper() to decide)
  • <p><strong>Heading</strong></p>==Heading==
  • <p>Whole bunch of text.</p> → ordinary wikitext
  • <p><strong>Table [0-9]. Caption.</strong></p> → Table with caption Caption.
  • <p><strong>Fig. [0-9]. Caption.</strong></p> → Image with caption Caption.
  • <p><strong><blockquote>End_Page 276------------------------</blockquote></strong></p> → delete
  • <p>... \([a-z]+, (in press)*[0-9]{0,2,4}\) ...</p> → a reference
  • <p>[EQUATION]</p> → an equation, probably needs hand-coding in LaTeX
  • <p>... <em>some text</em> ...</p>''some text''

For bottom captions in tables, use something like:

|+ align="bottom" | Table caption.

Regex patterns[edit]

Page number blocks

re.sub("<blockquote>.*Page [0-9]*-*</blockquote>",r"",str)

H2 headings, indiscriminate

re.sub("<p><strong>(.*)</strong></p>",r"==\1==",str)

Possible new articles, H1 headings

re.sub("<p><strong>([A-Z ]*)</strong></p>",r"==\1==",str)

Finds reference citations

re.sub("\(([a-zA-Z ]*[,] (in press)*[0-9]{4})\)",r"<ref>\1</ref>",s)

Almost makes references key in list of refs (doesn't handle accented characters)

re.sub(r"<p>([-a-zA-Z]*).*([0-9]{4}).*</p>",r"\1\2",r)

Finds figures

re.sub(r"<p><strong>Fig\. [0-9]*\. (.*)</strong></p>",r"[[File:AUTHOR_SOMETHING.png|thumb|\1]]",s)

Finds tables; may want to put image of table here too, and build empty table

re.sub(r"<p><strong>(Table [0-9]*)\. (.*)</strong></p>",r"<!-- \1 -- \2 --> ",s)

Methods for converting[edit]

  • Useful Python libraries:
    • It should be possible to just adapt a scrapy pipeline for processing the text
    • HTMLParse is probably not powerful enough
    • BeautifulSoup or lxml

Handling categories, references, figures, etc.[edit]

Handling tables, equations, etc.[edit]

Uploading images[edit]

Mark is looking into where we will get images from. Loading method depends a bit on where they are and what the files are called.

Adding links to articles[edit]

Human QC and finessing of conversion[edit]