Difference between revisions of "User:Matt/Content conversion"

Latest revision as of 16:09, 1 October 2013

Notes about what will be required to convert from Datapages format to MediaWiki wikitext. Based on:

Well completion — Chapter from Part 9 of Methods 10.

Article information[edit]

We will preserve book information. We will not maintain 'canonical' book content in a separate namespace (SPE did this; SEG sort of did this too).

The article info in Datapages is semantically tagged (yay!). This is very useful indeed. Everything after <HR> and before <CJSTEXT> is header information about the article — book name, volume number, page range, etc:

<P><STRONG>Pub. Id:</STRONG> <CJSVOLUME>A095</CJSVOLUME> (<CJSYEAR>1992</CJSYEAR>)</P><P><STRONG>First Page:</STRONG> <CJSFIRSTPAGE>463</CJSFIRSTPAGE></P>
<P><STRONG>Last Page:</STRONG> <CJSLASTPAGE>468</CJSLASTPAGE></P>
<P><STRONG>Book Title:</STRONG> <CJSPUBTITLE>ME 10: Development Geology Reference Manual</CJSPUBTITLE></P>
<P><STRONG>Article/Chapter:</STRONG> <CJSTITLE>Well Completions: Part 9. Production Engineering Methods</CJSTITLE></P>
<P><STRONG>Subject Group:</STRONG> <CJSTOPIC>Oil--Methodology and Concepts</CJSTOPIC></P>
<P><STRONG>Spec. Pub. Type:</STRONG> <CJSTYPE>Methods in Exploration</CJSTYPE></P>
<P><STRONG>Pub. Year:</STRONG> <CJSVOLUMEYEAR>1992</CJSVOLUMEYEAR></P>
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P>
<P><STRONG>Text:</STRONG></P>

Most of this is repeated in the page's <head>, and this is the easiest place to get the URLs for the HTML and PDF resources:

<!-- Juicy Metadata. Google Scholar eats this up. -->
 <!-- journals -->
  <meta name="citation_publisher" content="AAPG Special Volumes">
  <meta name="citation_title" content="Well Completions: Part 9. Production Engineering Methods">
  <meta name="DC.Title" content="Well Completions: Part 9. Production Engineering Methods">
  <meta name="citation_author" content="Stephen A. Holditch">
  <meta name="DC.Contributor" content="Stephen A. Holditch">
  <meta name="citation_publication_date" content="1992">
  <meta name="DC.Date" content="1992">
  <meta name="citation_volume" content="95">
  <meta name="citation_firstpage" content="463">
  <meta name="citation_lastpage" content="468">
  <meta name="citation_fulltext_html_url" content="http://archives.datapages.com/data/specpubs/methodo1/data/a095/a095/0001/0450/0463.htm">
  <meta name="citation_pdf_url" content="http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0450/04630.pdf">

Technical content[edit]

Everything after <CJSTEXT> and before </CJSTEXT> is the text of the article.

Many things need to be converted:

Remove all the   and   tags
Remove the page breaks, which are contained in <blockquote> tags
Delete the  and   tags

Headings[edit]

Headings can be interpreted, per these examples:

H1, article name — PERFORATING
H2, section name — Types of Guns
H3, subsection name — Expendable Gun

They all need to be converted to sentence case, but leaving proper names, all-uppercase words, and parentheses alone.

Figures[edit]

The actual figure references — which might come before or after the mention in the text — look like this:

<P><STRONG>Fig. 1. Wellbore diagram of (a) an open hole completion and (b) a slotted liner completion.</STRONG></P>

We make a file name out of the figure caption and the author's name, so every file is unique:

Ideally, the first mention of a figure, e.g. Figure 1, should trigger the file call:

[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]

Lists[edit]

Horrible. Here's an example:

<P>-- Slotted liner<BR>
-- Screen and liner<BR>
-- Cemented liner</P>

This becomes:

* Slotted liner
* Screen and liner
* Cemented liner

Similar thing for ordered lists:

<P>1. Expendable gun<BR>
2. Semi-expendable gun<BR>
3. Retrievable, hollow carrier gun</P>

which becomes...

# Expendable gun
# Semi-expendable gun
# Retrievable, hollow carrier gun

Here's how we deal with all of this:

# Convert unordered lists
text = re.sub(r"(<br>)*(\n)*\n-- ",r"\n* ",text)
text = re.sub(r"\n\n\*",r"\n*",text) # to handle double newlines

# Convert ordered lists
text = re.sub(r"(<br>)*(\n)*\n[0-9][0-9]?\. ",r"\n# ",text)
text = re.sub(r"\n\n#",r"\n#",text) # to handle double newlines

# Convert description lists
text = re.sub(r"<dd>",r"",text)
text = re.sub(r"</dd>",r"",text)
text = re.sub(r"<dl>",r"",text)
text = re.sub(r"</dl>",r"",text)
text = re.sub(r"<dt><strong>(?:[0-9]\. )?(.+?)</strong></dt>(\n)?",r"\n\n====\1====\n",text)
text = re.sub(r"<dt>-- (.+?)</dt>(\n)?",r"* \1\n",text)
text = re.sub(r"<dt>(.+?)</dt>(\n)?",r"* \1\n",text)

# Any more <br>s are probably unecessary linebreaks and can be ordinary text lists.
text = re.sub(r"\n(.+?)<br>\n",r"\n* \1<br>\n",text)
text = re.sub(r"<br>\n?",r"\n* ",text) # to handle double newlines