User:Matt/Content conversion

From AAPG Wiki
< User:Matt
Revision as of 18:56, 18 June 2013 by Matt (talk | contribs) (initial observations)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Notes about what will be required to convert from Datapages format to MediaWiki wikitext.

Article information

We will need to decide if we want to preserve book information. One approach is to build 'canonical' book content in a separate namespace. But this will effectively be obsolete as soon as the wiki version is edited (Nupedia/Wikipedia story). SPE did this. SEG have sort of done this too. It could be confusing for readers (two versions of everything).

The article info is semantically tagged (yay!). Everything after <HR> and before <CJSTEXT> is header information about the article — book name, volume number, page range, etc:

<P><STRONG>Pub. Id:</STRONG> <CJSVOLUME>A095</CJSVOLUME> (<CJSYEAR>1992</CJSYEAR>)</P><P><STRONG>First Page:</STRONG> <CJSFIRSTPAGE>463</CJSFIRSTPAGE></P>
<P><STRONG>Last Page:</STRONG> <CJSLASTPAGE>468</CJSLASTPAGE></P>
<P><STRONG>Book Title:</STRONG> <CJSPUBTITLE>ME 10: Development Geology Reference Manual</CJSPUBTITLE></P>
<P><STRONG>Article/Chapter:</STRONG> <CJSTITLE>Well Completions: Part 9. Production Engineering Methods</CJSTITLE></P>
<P><STRONG>Subject Group:</STRONG> <CJSTOPIC>Oil--Methodology and Concepts</CJSTOPIC></P>
<P><STRONG>Spec. Pub. Type:</STRONG> <CJSTYPE>Methods in Exploration</CJSTYPE></P>
<P><STRONG>Pub. Year:</STRONG> <CJSVOLUMEYEAR>1992</CJSVOLUMEYEAR></P>
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P>
<P><STRONG>Text:</STRONG></P>

Technical content

Everything after <CJSTEXT> and before </CJSTEXT> is the text of the article.

Some things need to be converted:

  • Delete the <P> and <BR> tags

Headings

Headings can be interpreted, per these examples:

  • H1, article name — <P><STRONG>PERFORATING</STRONG></P>
  • H2, section name — <P><STRONG>Types of Guns</STRONG></P>
  • H3, subsection name — <P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P>

They all need to be converted to sentence case.

Figures

First mention of a figure, e.g. Figure 1, should trigger a file call:

[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]

The actual figure references — which might come before or after the mention in the text, looks like this:

<P><STRONG>Fig. 1. Wellbore diagram of (a) an open hole completion and (b) a slotted liner completion.</STRONG></P>

If we're very cunning, we can gather the file calls, gather the actual figure references, and match them up, so that the figure caption is inserted into the file call.

An alternative approach, which would require us to write an extension I think (I can't find one), would be to upload the images using their captions as the file description (if available). Then we could ask for the description when we call the file, either with a magic word or via a template (less good, because it breaks the way to make an image call):

[[File:Myfile.jpg|thumb|{{DESCRIPTION}}]]

or

{{fig | 3.2
| myfile.jpg
| Caption text.
| Smith et al. 2006
}}

Lists

Horrible. Here's an example:

<P>-- Slotted liner<BR>
-- Screen and liner<BR>
-- Cemented liner</P>

This will become:

* Slotted liner
* Screen and liner
* Cemented liner

To do this:

  • Interpret such a block as a list: perhaps lines that start with --[SPACE]
  • Delete the <P> and <BR> tags.
  • Replace -- with *

Similar thing for ordered lists:

<P>1. Expendable gun<BR>
2. Semi-expendable gun<BR>
3. Retrievable, hollow carrier gun</P>

Some things can be removed:

  • Everything in <BLOCKQUOTE> tags is page information we won't want in the wiki