User:Matt/Content conversion
Notes about what will be required to convert from Datapages format to MediaWiki wikitext. Based on:
- Well completion — Chapter from Part 9 of Methods 10.
Article information[edit]
We will preserve book information. We will not maintain 'canonical' book content in a separate namespace (SPE did this; SEG sort of did this too).
The article info in Datapages is semantically tagged (yay!). This is very useful indeed. Everything after <HR>
and before <CJSTEXT>
is header information about the article — book name, volume number, page range, etc:
<P><STRONG>Pub. Id:</STRONG> <CJSVOLUME>A095</CJSVOLUME> (<CJSYEAR>1992</CJSYEAR>)</P><P><STRONG>First Page:</STRONG> <CJSFIRSTPAGE>463</CJSFIRSTPAGE></P> <P><STRONG>Last Page:</STRONG> <CJSLASTPAGE>468</CJSLASTPAGE></P> <P><STRONG>Book Title:</STRONG> <CJSPUBTITLE>ME 10: Development Geology Reference Manual</CJSPUBTITLE></P> <P><STRONG>Article/Chapter:</STRONG> <CJSTITLE>Well Completions: Part 9. Production Engineering Methods</CJSTITLE></P> <P><STRONG>Subject Group:</STRONG> <CJSTOPIC>Oil--Methodology and Concepts</CJSTOPIC></P> <P><STRONG>Spec. Pub. Type:</STRONG> <CJSTYPE>Methods in Exploration</CJSTYPE></P> <P><STRONG>Pub. Year:</STRONG> <CJSVOLUMEYEAR>1992</CJSVOLUMEYEAR></P> <P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P> <P><STRONG>Text:</STRONG></P>
Most of this is repeated in the page's <head>
, and this is the easiest place to get the URLs for the HTML and PDF resources:
<!-- Juicy Metadata. Google Scholar eats this up. --> <!-- journals --> <meta name="citation_publisher" content="AAPG Special Volumes"> <meta name="citation_title" content="Well Completions: Part 9. Production Engineering Methods"> <meta name="DC.Title" content="Well Completions: Part 9. Production Engineering Methods"> <meta name="citation_author" content="Stephen A. Holditch"> <meta name="DC.Contributor" content="Stephen A. Holditch"> <meta name="citation_publication_date" content="1992"> <meta name="DC.Date" content="1992"> <meta name="citation_volume" content="95"> <meta name="citation_firstpage" content="463"> <meta name="citation_lastpage" content="468"> <meta name="citation_fulltext_html_url" content="http://archives.datapages.com/data/specpubs/methodo1/data/a095/a095/0001/0450/0463.htm"> <meta name="citation_pdf_url" content="http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0450/04630.pdf">
Technical content[edit]
Everything after <CJSTEXT>
and before </CJSTEXT>
is the text of the article.
Many things need to be converted:
- Remove all the
<br>
and<br />
tags - Remove the page breaks, which are contained in
<blockquote>
tags - Delete the
<P>
and<BR>
tags
Headings[edit]
Headings can be interpreted, per these examples:
- H1, article name —
<P><STRONG>PERFORATING</STRONG></P>
- H2, section name —
<P><STRONG>Types of Guns</STRONG></P>
- H3, subsection name —
<P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P>
They all need to be converted to sentence case, but leaving proper names, all-uppercase words, and parentheses alone.
Figures[edit]
The actual figure references — which might come before or after the mention in the text — look like this:
<P><STRONG>Fig. 1. Wellbore diagram of (a) an open hole completion and (b) a slotted liner completion.</STRONG></P>
We make a file name out of the figure caption and the author's name, so every file is unique:
Ideally, the first mention of a figure, e.g. Figure 1, should trigger the file call:
[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]
Lists[edit]
Horrible. Here's an example:
<P>-- Slotted liner<BR> -- Screen and liner<BR> -- Cemented liner</P>
This becomes:
* Slotted liner * Screen and liner * Cemented liner
Similar thing for ordered lists:
<P>1. Expendable gun<BR> 2. Semi-expendable gun<BR> 3. Retrievable, hollow carrier gun</P>
which becomes...
# Expendable gun # Semi-expendable gun # Retrievable, hollow carrier gun
Here's how we deal with all of this:
# Convert unordered lists text = re.sub(r"(<br>)*(\n)*\n-- ",r"\n* ",text) text = re.sub(r"\n\n\*",r"\n*",text) # to handle double newlines # Convert ordered lists text = re.sub(r"(<br>)*(\n)*\n[0-9][0-9]?\. ",r"\n# ",text) text = re.sub(r"\n\n#",r"\n#",text) # to handle double newlines # Convert description lists text = re.sub(r"<dd>",r"",text) text = re.sub(r"</dd>",r"",text) text = re.sub(r"<dl>",r"",text) text = re.sub(r"</dl>",r"",text) text = re.sub(r"<dt><strong>(?:[0-9]\. )?(.+?)</strong></dt>(\n)?",r"\n\n====\1====\n",text) text = re.sub(r"<dt>-- (.+?)</dt>(\n)?",r"* \1\n",text) text = re.sub(r"<dt>(.+?)</dt>(\n)?",r"* \1\n",text) # Any more <br>s are probably unecessary linebreaks and can be ordinary text lists. text = re.sub(r"\n(.+?)<br>\n",r"\n* \1<br>\n",text) text = re.sub(r"<br>\n?",r"\n* ",text) # to handle double newlines