AAPG Wiki:Converting content

There are some large tasks in the conversion process:

Collecting the content in a useful format
Converting basic HTML to wikitext
Handling categories, references, figures, etc., probably automatically
Handling tables, equations, etc., probably manually
Uploading images
Adding links to articles
Human QC and finessing of conversion

Collecting content[edit]

The seed content does not exist as semantically marked-up text.

It sounds like the best place to get the content is probably Datapages.com. There is no web API, but there are two sets of semantic tags in Datapages: <meta> tags and <cjsXXXXX> where XXXXX is some string making, for example, <cjsauthor>, <cjsyear>, etc. The body of the article, with simple HTML markup, is in <cjstext> tags.

scrapy is a web scraping library for Python with nice XPath addressing of tags and easy ways to push collected items through a processing flow. It also has nice features like handling server errors, honest user agents, polite wait times, etc., so we won't take Datapages down while it crawls (something it does very quickly — scraping an entire book takes about 1 minute).

AAPG Wiki:Converting content

Collecting content[edit]

Navigation menu

Search