AAPG Wiki:Converting content

From AAPG Wiki
Revision as of 15:32, 28 June 2013 by Matt (talk | contribs) (started)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

There are some large tasks in the conversion process:

  • Collecting the content in a useful format
  • Converting basic HTML to wikitext
  • Handling categories, references, figures, etc., probably automatically
  • Handling tables, equations, etc., probably manually
  • Uploading images
  • Adding links to articles
  • Human QC and finessing of conversion

Collecting content[edit]

The seed content does not exist as semantically marked-up text.

It sounds like the best place to get the content is probably Datapages.com. There is no web API, but there are two sets of semantic tags in Datapages: <meta> tags and <cjsXXXXX> where XXXXX is some string making, for example, <cjsauthor>, <cjsyear>, etc. The body of the article, with simple HTML markup, is in <cjstext> tags.

scrapy is a web scraping library for Python with nice XPath addressing of tags and easy ways to push collected items through a processing flow. It also has nice features like handling server errors, honest user agents, polite wait times, etc., so we won't take Datapages down while it crawls (something it does very quickly — scraping an entire book takes about 1 minute).