Difference between revisions of "User:Matt/Content conversion"

From AAPG Wiki
Jump to navigation Jump to search
(initial observations)
 
(updating notes)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
Notes about what will be required to convert from Datapages format to MediaWiki wikitext.  
+
Notes about what will be required to convert from Datapages format to MediaWiki wikitext. Based on:
 +
* [[Well completion]] — Chapter from Part 9 of Methods 10.  
  
 
==Article information==
 
==Article information==
We will need to decide if we want to preserve book information. One approach is to build 'canonical' book content in a separate namespace. But this will effectively be obsolete as soon as the wiki version is edited (Nupedia/Wikipedia story). SPE did this. SEG have sort of done this too. It could be confusing for readers (two versions of everything).  
+
We will preserve book information. We will not maintain 'canonical' book content in a separate namespace (SPE did this; SEG sort of did this too).
  
The article info is semantically tagged (yay!). Everything after <code><nowiki><HR></nowiki></code> and before <code><CJSTEXT></code> is header information about the article — book name, volume number, page range, etc:
+
The article info in Datapages is semantically tagged (yay!). This is very useful indeed. Everything after <code><nowiki><HR></nowiki></code> and before <code><CJSTEXT></code> is header information about the article — book name, volume number, page range, etc:
  
 
<pre>
 
<pre>
Line 16: Line 17:
 
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P>
 
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P>
 
<P><STRONG>Text:</STRONG></P>
 
<P><STRONG>Text:</STRONG></P>
 +
</pre>
 +
 +
Most of this is repeated in the page's <code><nowiki><head></nowiki></code>, and this is the easiest place to get the URLs for the HTML and PDF resources:
 +
 +
<pre>
 +
<!-- Juicy Metadata. Google Scholar eats this up. -->
 +
<!-- journals -->
 +
  <meta name="citation_publisher" content="AAPG Special Volumes">
 +
  <meta name="citation_title" content="Well Completions: Part 9. Production Engineering Methods">
 +
  <meta name="DC.Title" content="Well Completions: Part 9. Production Engineering Methods">
 +
  <meta name="citation_author" content="Stephen A. Holditch">
 +
  <meta name="DC.Contributor" content="Stephen A. Holditch">
 +
  <meta name="citation_publication_date" content="1992">
 +
  <meta name="DC.Date" content="1992">
 +
  <meta name="citation_volume" content="95">
 +
  <meta name="citation_firstpage" content="463">
 +
  <meta name="citation_lastpage" content="468">
 +
  <meta name="citation_fulltext_html_url" content="http://archives.datapages.com/data/specpubs/methodo1/data/a095/a095/0001/0450/0463.htm">
 +
  <meta name="citation_pdf_url" content="http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0450/04630.pdf">
 
</pre>
 
</pre>
  
Line 21: Line 41:
 
Everything after <code><CJSTEXT></code> and before <code></CJSTEXT></code> is the text of the article.  
 
Everything after <code><CJSTEXT></code> and before <code></CJSTEXT></code> is the text of the article.  
  
Some things need to be converted:
+
Many things need to be converted:
 +
* Remove all the <code><nowiki><br></nowiki></code> and <code><nowiki><br /></nowiki></code> tags
 +
* Remove the page breaks, which are contained in <code><nowiki><blockquote></nowiki></code> tags
 
* Delete the <code><nowiki><P></nowiki></code> and <code><nowiki><BR></nowiki></code> tags
 
* Delete the <code><nowiki><P></nowiki></code> and <code><nowiki><BR></nowiki></code> tags
  
Line 30: Line 52:
 
* H3, subsection name — <code><nowiki><P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P></nowiki></code>
 
* H3, subsection name — <code><nowiki><P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P></nowiki></code>
  
They all need to be converted to sentence case.
+
They all need to be converted to sentence case, but leaving proper names, all-uppercase words, and parentheses alone.
  
 
===Figures===
 
===Figures===
First mention of a figure, e.g. '''Figure 1''', should trigger a file call:
+
The actual figure references — which might come before or after the mention in the text — look like this:
: <code>[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]</code>
 
 
 
The actual figure references — which might come before or after the mention in the text, looks like this:
 
  
 
<pre>
 
<pre>
Line 42: Line 61:
 
</pre>
 
</pre>
  
If we're very cunning, we can gather the file calls, gather the actual figure references, and match them up, so that the figure caption is inserted into the file call.
+
We make a file name out of the figure caption and the author's name, so every file is unique:
  
An alternative approach, which would require us to write an extension I think (I can't find one), would be to upload the images using their captions as the file description (if available). Then we could ask for the description when we call the file, either with a magic word or via a template (less good, because it breaks the way to make an image call):
+
Ideally, the first mention of a figure, e.g. '''Figure 1''', should trigger the file call:
 
+
: <code>[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]</code>
<pre>
 
[[File:Myfile.jpg|thumb|{{DESCRIPTION}}]]
 
</pre>
 
 
 
or
 
 
 
<pre>
 
{{fig | 3.2
 
| myfile.jpg
 
| Caption text.
 
| Smith et al. 2006
 
}}
 
</pre>
 
  
 
===Lists===
 
===Lists===
Line 69: Line 75:
 
</pre>
 
</pre>
  
This will become:
+
This becomes:
 
 
 
<pre>
 
<pre>
 
* Slotted liner
 
* Slotted liner
Line 76: Line 81:
 
* Cemented liner
 
* Cemented liner
 
</pre>
 
</pre>
 
To do this:
 
* Interpret such a block as a list: perhaps lines that start with <code>--[SPACE]</code>
 
* Delete the <code><nowiki><P></nowiki></code> and <code><nowiki><BR></nowiki></code> tags.
 
* Replace <code>--</code> with <code>*</code>
 
  
 
Similar thing for ordered lists:  
 
Similar thing for ordered lists:  
 
 
<pre>
 
<pre>
 
<P>1. Expendable gun<BR>
 
<P>1. Expendable gun<BR>
Line 90: Line 89:
 
</pre>
 
</pre>
  
Some things can be removed:
+
which becomes...
* Everything in <code><nowiki><BLOCKQUOTE></nowiki></code> tags is page information we won't want in the wiki
+
<pre>
 +
# Expendable gun
 +
# Semi-expendable gun
 +
# Retrievable, hollow carrier gun
 +
</pre>
 +
 
 +
Here's how we deal with all of this:
 +
<pre>
 +
# Convert unordered lists
 +
text = re.sub(r"(<br>)*(\n)*\n-- ",r"\n* ",text)
 +
text = re.sub(r"\n\n\*",r"\n*",text) # to handle double newlines
 +
 
 +
# Convert ordered lists
 +
text = re.sub(r"(<br>)*(\n)*\n[0-9][0-9]?\. ",r"\n# ",text)
 +
text = re.sub(r"\n\n#",r"\n#",text) # to handle double newlines
 +
 
 +
# Convert description lists
 +
text = re.sub(r"<dd>",r"",text)
 +
text = re.sub(r"</dd>",r"",text)
 +
text = re.sub(r"<dl>",r"",text)
 +
text = re.sub(r"</dl>",r"",text)
 +
text = re.sub(r"<dt><strong>(?:[0-9]\. )?(.+?)</strong></dt>(\n)?",r"\n\n====\1====\n",text)
 +
text = re.sub(r"<dt>-- (.+?)</dt>(\n)?",r"* \1\n",text)
 +
text = re.sub(r"<dt>(.+?)</dt>(\n)?",r"* \1\n",text)
 +
 
 +
# Any more <br>s are probably unecessary linebreaks and can be ordinary text lists.
 +
text = re.sub(r"\n(.+?)<br>\n",r"\n* \1<br>\n",text)
 +
text = re.sub(r"<br>\n?",r"\n* ",text) # to handle double newlines
 +
</pre>

Latest revision as of 16:09, 1 October 2013

Notes about what will be required to convert from Datapages format to MediaWiki wikitext. Based on:

Article information[edit]

We will preserve book information. We will not maintain 'canonical' book content in a separate namespace (SPE did this; SEG sort of did this too).

The article info in Datapages is semantically tagged (yay!). This is very useful indeed. Everything after <HR> and before <CJSTEXT> is header information about the article — book name, volume number, page range, etc:

<P><STRONG>Pub. Id:</STRONG> <CJSVOLUME>A095</CJSVOLUME> (<CJSYEAR>1992</CJSYEAR>)</P><P><STRONG>First Page:</STRONG> <CJSFIRSTPAGE>463</CJSFIRSTPAGE></P>
<P><STRONG>Last Page:</STRONG> <CJSLASTPAGE>468</CJSLASTPAGE></P>
<P><STRONG>Book Title:</STRONG> <CJSPUBTITLE>ME 10: Development Geology Reference Manual</CJSPUBTITLE></P>
<P><STRONG>Article/Chapter:</STRONG> <CJSTITLE>Well Completions: Part 9. Production Engineering Methods</CJSTITLE></P>
<P><STRONG>Subject Group:</STRONG> <CJSTOPIC>Oil--Methodology and Concepts</CJSTOPIC></P>
<P><STRONG>Spec. Pub. Type:</STRONG> <CJSTYPE>Methods in Exploration</CJSTYPE></P>
<P><STRONG>Pub. Year:</STRONG> <CJSVOLUMEYEAR>1992</CJSVOLUMEYEAR></P>
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P>
<P><STRONG>Text:</STRONG></P>

Most of this is repeated in the page's <head>, and this is the easiest place to get the URLs for the HTML and PDF resources:

<!-- Juicy Metadata. Google Scholar eats this up. -->
 <!-- journals -->
  <meta name="citation_publisher" content="AAPG Special Volumes">
  <meta name="citation_title" content="Well Completions: Part 9. Production Engineering Methods">
  <meta name="DC.Title" content="Well Completions: Part 9. Production Engineering Methods">
  <meta name="citation_author" content="Stephen A. Holditch">
  <meta name="DC.Contributor" content="Stephen A. Holditch">
  <meta name="citation_publication_date" content="1992">
  <meta name="DC.Date" content="1992">
  <meta name="citation_volume" content="95">
  <meta name="citation_firstpage" content="463">
  <meta name="citation_lastpage" content="468">
  <meta name="citation_fulltext_html_url" content="http://archives.datapages.com/data/specpubs/methodo1/data/a095/a095/0001/0450/0463.htm">
  <meta name="citation_pdf_url" content="http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0450/04630.pdf">

Technical content[edit]

Everything after <CJSTEXT> and before </CJSTEXT> is the text of the article.

Many things need to be converted:

  • Remove all the <br> and <br /> tags
  • Remove the page breaks, which are contained in <blockquote> tags
  • Delete the <P> and <BR> tags

Headings[edit]

Headings can be interpreted, per these examples:

  • H1, article name — <P><STRONG>PERFORATING</STRONG></P>
  • H2, section name — <P><STRONG>Types of Guns</STRONG></P>
  • H3, subsection name — <P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P>

They all need to be converted to sentence case, but leaving proper names, all-uppercase words, and parentheses alone.

Figures[edit]

The actual figure references — which might come before or after the mention in the text — look like this:

<P><STRONG>Fig. 1. Wellbore diagram of (a) an open hole completion and (b) a slotted liner completion.</STRONG></P>

We make a file name out of the figure caption and the author's name, so every file is unique:

Ideally, the first mention of a figure, e.g. Figure 1, should trigger the file call:

[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]

Lists[edit]

Horrible. Here's an example:

<P>-- Slotted liner<BR>
-- Screen and liner<BR>
-- Cemented liner</P>

This becomes:

* Slotted liner
* Screen and liner
* Cemented liner

Similar thing for ordered lists:

<P>1. Expendable gun<BR>
2. Semi-expendable gun<BR>
3. Retrievable, hollow carrier gun</P>

which becomes...

# Expendable gun
# Semi-expendable gun
# Retrievable, hollow carrier gun

Here's how we deal with all of this:

# Convert unordered lists
text = re.sub(r"(<br>)*(\n)*\n-- ",r"\n* ",text)
text = re.sub(r"\n\n\*",r"\n*",text) # to handle double newlines

# Convert ordered lists
text = re.sub(r"(<br>)*(\n)*\n[0-9][0-9]?\. ",r"\n# ",text)
text = re.sub(r"\n\n#",r"\n#",text) # to handle double newlines

# Convert description lists
text = re.sub(r"<dd>",r"",text)
text = re.sub(r"</dd>",r"",text)
text = re.sub(r"<dl>",r"",text)
text = re.sub(r"</dl>",r"",text)
text = re.sub(r"<dt><strong>(?:[0-9]\. )?(.+?)</strong></dt>(\n)?",r"\n\n====\1====\n",text)
text = re.sub(r"<dt>-- (.+?)</dt>(\n)?",r"* \1\n",text)
text = re.sub(r"<dt>(.+?)</dt>(\n)?",r"* \1\n",text)

# Any more <br>s are probably unecessary linebreaks and can be ordinary text lists.
text = re.sub(r"\n(.+?)<br>\n",r"\n* \1<br>\n",text)
text = re.sub(r"<br>\n?",r"\n* ",text) # to handle double newlines