Changes

Jump to navigation Jump to search
started
There are some large tasks in the conversion process:

* Collecting the content in a useful format
* Converting basic HTML to wikitext
* Handling categories, references, figures, etc., probably automatically
* Handling tables, equations, etc., probably manually
* Uploading images
* Adding links to articles
* Human QC and finessing of conversion

==Collecting content==
The seed content does not exist as semantically marked-up text.

It sounds like the best place to get the content is probably Datapages.com. There is no web API, but there are two sets of semantic tags in Datapages: '''<nowiki><meta></nowiki>''' tags and '''<nowiki><cjsXXXXX></nowiki>''' where XXXXX is some string making, for example, '''<nowiki><cjsauthor></nowiki>''', '''<nowiki><cjsyear></nowiki>''', etc. The body of the article, with simple HTML markup, is in '''<nowiki><cjstext></nowiki>''' tags.

'''scrapy''' is a web scraping library for Python with nice XPath addressing of tags and easy ways to push collected items through a processing flow. It also has nice features like handling server errors, honest user agents, polite wait times, etc., so we won't take Datapages down while it crawls (something it does very quickly — scraping an entire book takes about 1 minute).

Navigation menu