User:Matt/Content conversion

From AAPG Wiki
< User:Matt
Revision as of 16:09, 1 October 2013 by Matt (talk | contribs) (updating notes)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Notes about what will be required to convert from Datapages format to MediaWiki wikitext. Based on:

Article information[edit]

We will preserve book information. We will not maintain 'canonical' book content in a separate namespace (SPE did this; SEG sort of did this too).

The article info in Datapages is semantically tagged (yay!). This is very useful indeed. Everything after <HR> and before <CJSTEXT> is header information about the article — book name, volume number, page range, etc:

<P><STRONG>Book Title:</STRONG> <CJSPUBTITLE>ME 10: Development Geology Reference Manual</CJSPUBTITLE></P>
<P><STRONG>Article/Chapter:</STRONG> <CJSTITLE>Well Completions: Part 9. Production Engineering Methods</CJSTITLE></P>
<P><STRONG>Subject Group:</STRONG> <CJSTOPIC>Oil--Methodology and Concepts</CJSTOPIC></P>
<P><STRONG>Spec. Pub. Type:</STRONG> <CJSTYPE>Methods in Exploration</CJSTYPE></P>
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P>

Most of this is repeated in the page's <head>, and this is the easiest place to get the URLs for the HTML and PDF resources:

<!-- Juicy Metadata. Google Scholar eats this up. -->
 <!-- journals -->
  <meta name="citation_publisher" content="AAPG Special Volumes">
  <meta name="citation_title" content="Well Completions: Part 9. Production Engineering Methods">
  <meta name="DC.Title" content="Well Completions: Part 9. Production Engineering Methods">
  <meta name="citation_author" content="Stephen A. Holditch">
  <meta name="DC.Contributor" content="Stephen A. Holditch">
  <meta name="citation_publication_date" content="1992">
  <meta name="DC.Date" content="1992">
  <meta name="citation_volume" content="95">
  <meta name="citation_firstpage" content="463">
  <meta name="citation_lastpage" content="468">
  <meta name="citation_fulltext_html_url" content="">
  <meta name="citation_pdf_url" content="">

Technical content[edit]

Everything after <CJSTEXT> and before </CJSTEXT> is the text of the article.

Many things need to be converted:

  • Remove all the <br> and <br /> tags
  • Remove the page breaks, which are contained in <blockquote> tags
  • Delete the <P> and <BR> tags


Headings can be interpreted, per these examples:

  • H1, article name — <P><STRONG>PERFORATING</STRONG></P>
  • H2, section name — <P><STRONG>Types of Guns</STRONG></P>
  • H3, subsection name — <P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P>

They all need to be converted to sentence case, but leaving proper names, all-uppercase words, and parentheses alone.


The actual figure references — which might come before or after the mention in the text — look like this:

<P><STRONG>Fig. 1. Wellbore diagram of (a) an open hole completion and (b) a slotted liner completion.</STRONG></P>

We make a file name out of the figure caption and the author's name, so every file is unique:

Ideally, the first mention of a figure, e.g. Figure 1, should trigger the file call:

[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]


Horrible. Here's an example:

<P>-- Slotted liner<BR>
-- Screen and liner<BR>
-- Cemented liner</P>

This becomes:

* Slotted liner
* Screen and liner
* Cemented liner

Similar thing for ordered lists:

<P>1. Expendable gun<BR>
2. Semi-expendable gun<BR>
3. Retrievable, hollow carrier gun</P>

which becomes...

# Expendable gun
# Semi-expendable gun
# Retrievable, hollow carrier gun

Here's how we deal with all of this:

# Convert unordered lists
text = re.sub(r"(<br>)*(\n)*\n-- ",r"\n* ",text)
text = re.sub(r"\n\n\*",r"\n*",text) # to handle double newlines

# Convert ordered lists
text = re.sub(r"(<br>)*(\n)*\n[0-9][0-9]?\. ",r"\n# ",text)
text = re.sub(r"\n\n#",r"\n#",text) # to handle double newlines

# Convert description lists
text = re.sub(r"<dd>",r"",text)
text = re.sub(r"</dd>",r"",text)
text = re.sub(r"<dl>",r"",text)
text = re.sub(r"</dl>",r"",text)
text = re.sub(r"<dt><strong>(?:[0-9]\. )?(.+?)</strong></dt>(\n)?",r"\n\n====\1====\n",text)
text = re.sub(r"<dt>-- (.+?)</dt>(\n)?",r"* \1\n",text)
text = re.sub(r"<dt>(.+?)</dt>(\n)?",r"* \1\n",text)

# Any more <br>s are probably unecessary linebreaks and can be ordinary text lists.
text = re.sub(r"\n(.+?)<br>\n",r"\n* \1<br>\n",text)
text = re.sub(r"<br>\n?",r"\n* ",text) # to handle double newlines