Difference between revisions of "User:Matt/Content conversion"
(initial observations) |
(updating notes) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | Notes about what will be required to convert from Datapages format to MediaWiki wikitext. | + | Notes about what will be required to convert from Datapages format to MediaWiki wikitext. Based on: |
+ | * [[Well completion]] — Chapter from Part 9 of Methods 10. | ||
==Article information== | ==Article information== | ||
− | We will | + | We will preserve book information. We will not maintain 'canonical' book content in a separate namespace (SPE did this; SEG sort of did this too). |
− | The article info is semantically tagged (yay!). Everything after <code><nowiki><HR></nowiki></code> and before <code><CJSTEXT></code> is header information about the article — book name, volume number, page range, etc: | + | The article info in Datapages is semantically tagged (yay!). This is very useful indeed. Everything after <code><nowiki><HR></nowiki></code> and before <code><CJSTEXT></code> is header information about the article — book name, volume number, page range, etc: |
<pre> | <pre> | ||
Line 16: | Line 17: | ||
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P> | <P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P> | ||
<P><STRONG>Text:</STRONG></P> | <P><STRONG>Text:</STRONG></P> | ||
+ | </pre> | ||
+ | |||
+ | Most of this is repeated in the page's <code><nowiki><head></nowiki></code>, and this is the easiest place to get the URLs for the HTML and PDF resources: | ||
+ | |||
+ | <pre> | ||
+ | <!-- Juicy Metadata. Google Scholar eats this up. --> | ||
+ | <!-- journals --> | ||
+ | <meta name="citation_publisher" content="AAPG Special Volumes"> | ||
+ | <meta name="citation_title" content="Well Completions: Part 9. Production Engineering Methods"> | ||
+ | <meta name="DC.Title" content="Well Completions: Part 9. Production Engineering Methods"> | ||
+ | <meta name="citation_author" content="Stephen A. Holditch"> | ||
+ | <meta name="DC.Contributor" content="Stephen A. Holditch"> | ||
+ | <meta name="citation_publication_date" content="1992"> | ||
+ | <meta name="DC.Date" content="1992"> | ||
+ | <meta name="citation_volume" content="95"> | ||
+ | <meta name="citation_firstpage" content="463"> | ||
+ | <meta name="citation_lastpage" content="468"> | ||
+ | <meta name="citation_fulltext_html_url" content="http://archives.datapages.com/data/specpubs/methodo1/data/a095/a095/0001/0450/0463.htm"> | ||
+ | <meta name="citation_pdf_url" content="http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0450/04630.pdf"> | ||
</pre> | </pre> | ||
Line 21: | Line 41: | ||
Everything after <code><CJSTEXT></code> and before <code></CJSTEXT></code> is the text of the article. | Everything after <code><CJSTEXT></code> and before <code></CJSTEXT></code> is the text of the article. | ||
− | + | Many things need to be converted: | |
+ | * Remove all the <code><nowiki><br></nowiki></code> and <code><nowiki><br /></nowiki></code> tags | ||
+ | * Remove the page breaks, which are contained in <code><nowiki><blockquote></nowiki></code> tags | ||
* Delete the <code><nowiki><P></nowiki></code> and <code><nowiki><BR></nowiki></code> tags | * Delete the <code><nowiki><P></nowiki></code> and <code><nowiki><BR></nowiki></code> tags | ||
Line 30: | Line 52: | ||
* H3, subsection name — <code><nowiki><P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P></nowiki></code> | * H3, subsection name — <code><nowiki><P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P></nowiki></code> | ||
− | They all need to be converted to sentence case. | + | They all need to be converted to sentence case, but leaving proper names, all-uppercase words, and parentheses alone. |
===Figures=== | ===Figures=== | ||
− | + | The actual figure references — which might come before or after the mention in the text — look like this: | |
− | |||
− | |||
− | The actual figure references — which might come before or after the mention in the text | ||
<pre> | <pre> | ||
Line 42: | Line 61: | ||
</pre> | </pre> | ||
− | + | We make a file name out of the figure caption and the author's name, so every file is unique: | |
− | + | Ideally, the first mention of a figure, e.g. '''Figure 1''', should trigger the file call: | |
− | + | : <code>[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]</code> | |
− | < | ||
− | [[File: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | </ | ||
===Lists=== | ===Lists=== | ||
Line 69: | Line 75: | ||
</pre> | </pre> | ||
− | This | + | This becomes: |
− | |||
<pre> | <pre> | ||
* Slotted liner | * Slotted liner | ||
Line 76: | Line 81: | ||
* Cemented liner | * Cemented liner | ||
</pre> | </pre> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
Similar thing for ordered lists: | Similar thing for ordered lists: | ||
− | |||
<pre> | <pre> | ||
<P>1. Expendable gun<BR> | <P>1. Expendable gun<BR> | ||
Line 90: | Line 89: | ||
</pre> | </pre> | ||
− | + | which becomes... | |
− | * | + | <pre> |
+ | # Expendable gun | ||
+ | # Semi-expendable gun | ||
+ | # Retrievable, hollow carrier gun | ||
+ | </pre> | ||
+ | |||
+ | Here's how we deal with all of this: | ||
+ | <pre> | ||
+ | # Convert unordered lists | ||
+ | text = re.sub(r"(<br>)*(\n)*\n-- ",r"\n* ",text) | ||
+ | text = re.sub(r"\n\n\*",r"\n*",text) # to handle double newlines | ||
+ | |||
+ | # Convert ordered lists | ||
+ | text = re.sub(r"(<br>)*(\n)*\n[0-9][0-9]?\. ",r"\n# ",text) | ||
+ | text = re.sub(r"\n\n#",r"\n#",text) # to handle double newlines | ||
+ | |||
+ | # Convert description lists | ||
+ | text = re.sub(r"<dd>",r"",text) | ||
+ | text = re.sub(r"</dd>",r"",text) | ||
+ | text = re.sub(r"<dl>",r"",text) | ||
+ | text = re.sub(r"</dl>",r"",text) | ||
+ | text = re.sub(r"<dt><strong>(?:[0-9]\. )?(.+?)</strong></dt>(\n)?",r"\n\n====\1====\n",text) | ||
+ | text = re.sub(r"<dt>-- (.+?)</dt>(\n)?",r"* \1\n",text) | ||
+ | text = re.sub(r"<dt>(.+?)</dt>(\n)?",r"* \1\n",text) | ||
+ | |||
+ | # Any more <br>s are probably unecessary linebreaks and can be ordinary text lists. | ||
+ | text = re.sub(r"\n(.+?)<br>\n",r"\n* \1<br>\n",text) | ||
+ | text = re.sub(r"<br>\n?",r"\n* ",text) # to handle double newlines | ||
+ | </pre> |
Latest revision as of 16:09, 1 October 2013
Notes about what will be required to convert from Datapages format to MediaWiki wikitext. Based on:
- Well completion — Chapter from Part 9 of Methods 10.
Article information[edit]
We will preserve book information. We will not maintain 'canonical' book content in a separate namespace (SPE did this; SEG sort of did this too).
The article info in Datapages is semantically tagged (yay!). This is very useful indeed. Everything after <HR>
and before <CJSTEXT>
is header information about the article — book name, volume number, page range, etc:
<P><STRONG>Pub. Id:</STRONG> <CJSVOLUME>A095</CJSVOLUME> (<CJSYEAR>1992</CJSYEAR>)</P><P><STRONG>First Page:</STRONG> <CJSFIRSTPAGE>463</CJSFIRSTPAGE></P> <P><STRONG>Last Page:</STRONG> <CJSLASTPAGE>468</CJSLASTPAGE></P> <P><STRONG>Book Title:</STRONG> <CJSPUBTITLE>ME 10: Development Geology Reference Manual</CJSPUBTITLE></P> <P><STRONG>Article/Chapter:</STRONG> <CJSTITLE>Well Completions: Part 9. Production Engineering Methods</CJSTITLE></P> <P><STRONG>Subject Group:</STRONG> <CJSTOPIC>Oil--Methodology and Concepts</CJSTOPIC></P> <P><STRONG>Spec. Pub. Type:</STRONG> <CJSTYPE>Methods in Exploration</CJSTYPE></P> <P><STRONG>Pub. Year:</STRONG> <CJSVOLUMEYEAR>1992</CJSVOLUMEYEAR></P> <P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P> <P><STRONG>Text:</STRONG></P>
Most of this is repeated in the page's <head>
, and this is the easiest place to get the URLs for the HTML and PDF resources:
<!-- Juicy Metadata. Google Scholar eats this up. --> <!-- journals --> <meta name="citation_publisher" content="AAPG Special Volumes"> <meta name="citation_title" content="Well Completions: Part 9. Production Engineering Methods"> <meta name="DC.Title" content="Well Completions: Part 9. Production Engineering Methods"> <meta name="citation_author" content="Stephen A. Holditch"> <meta name="DC.Contributor" content="Stephen A. Holditch"> <meta name="citation_publication_date" content="1992"> <meta name="DC.Date" content="1992"> <meta name="citation_volume" content="95"> <meta name="citation_firstpage" content="463"> <meta name="citation_lastpage" content="468"> <meta name="citation_fulltext_html_url" content="http://archives.datapages.com/data/specpubs/methodo1/data/a095/a095/0001/0450/0463.htm"> <meta name="citation_pdf_url" content="http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0450/04630.pdf">
Technical content[edit]
Everything after <CJSTEXT>
and before </CJSTEXT>
is the text of the article.
Many things need to be converted:
- Remove all the
<br>
and<br />
tags - Remove the page breaks, which are contained in
<blockquote>
tags - Delete the
<P>
and<BR>
tags
Headings[edit]
Headings can be interpreted, per these examples:
- H1, article name —
<P><STRONG>PERFORATING</STRONG></P>
- H2, section name —
<P><STRONG>Types of Guns</STRONG></P>
- H3, subsection name —
<P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P>
They all need to be converted to sentence case, but leaving proper names, all-uppercase words, and parentheses alone.
Figures[edit]
The actual figure references — which might come before or after the mention in the text — look like this:
<P><STRONG>Fig. 1. Wellbore diagram of (a) an open hole completion and (b) a slotted liner completion.</STRONG></P>
We make a file name out of the figure caption and the author's name, so every file is unique:
Ideally, the first mention of a figure, e.g. Figure 1, should trigger the file call:
[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]
Lists[edit]
Horrible. Here's an example:
<P>-- Slotted liner<BR> -- Screen and liner<BR> -- Cemented liner</P>
This becomes:
* Slotted liner * Screen and liner * Cemented liner
Similar thing for ordered lists:
<P>1. Expendable gun<BR> 2. Semi-expendable gun<BR> 3. Retrievable, hollow carrier gun</P>
which becomes...
# Expendable gun # Semi-expendable gun # Retrievable, hollow carrier gun
Here's how we deal with all of this:
# Convert unordered lists text = re.sub(r"(<br>)*(\n)*\n-- ",r"\n* ",text) text = re.sub(r"\n\n\*",r"\n*",text) # to handle double newlines # Convert ordered lists text = re.sub(r"(<br>)*(\n)*\n[0-9][0-9]?\. ",r"\n# ",text) text = re.sub(r"\n\n#",r"\n#",text) # to handle double newlines # Convert description lists text = re.sub(r"<dd>",r"",text) text = re.sub(r"</dd>",r"",text) text = re.sub(r"<dl>",r"",text) text = re.sub(r"</dl>",r"",text) text = re.sub(r"<dt><strong>(?:[0-9]\. )?(.+?)</strong></dt>(\n)?",r"\n\n====\1====\n",text) text = re.sub(r"<dt>-- (.+?)</dt>(\n)?",r"* \1\n",text) text = re.sub(r"<dt>(.+?)</dt>(\n)?",r"* \1\n",text) # Any more <br>s are probably unecessary linebreaks and can be ordinary text lists. text = re.sub(r"\n(.+?)<br>\n",r"\n* \1<br>\n",text) text = re.sub(r"<br>\n?",r"\n* ",text) # to handle double newlines