Changes

Jump to navigation Jump to search
updating notes
Line 3: Line 3:     
==Article information==
 
==Article information==
We will need to decide if we want to preserve book information. One approach is to build 'canonical' book content in a separate namespace. But this will effectively be obsolete as soon as the wiki version is edited (Nupedia/Wikipedia story). SPE did this. SEG have sort of done this too. It could be confusing for readers (two versions of everything).  
+
We will preserve book information. We will not maintain 'canonical' book content in a separate namespace (SPE did this; SEG sort of did this too).
   −
The article info is semantically tagged (yay!). Everything after <code><nowiki><HR></nowiki></code> and before <code><CJSTEXT></code> is header information about the article — book name, volume number, page range, etc:
+
The article info in Datapages is semantically tagged (yay!). This is very useful indeed. Everything after <code><nowiki><HR></nowiki></code> and before <code><CJSTEXT></code> is header information about the article — book name, volume number, page range, etc:
    
<pre>
 
<pre>
Line 17: Line 17:  
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P>
 
<P><STRONG>Author(s):</STRONG> <CJSAUTHOR>Stephen A. Holditch</CJSAUTHOR></P>
 
<P><STRONG>Text:</STRONG></P>
 
<P><STRONG>Text:</STRONG></P>
 +
</pre>
 +
 +
Most of this is repeated in the page's <code><nowiki><head></nowiki></code>, and this is the easiest place to get the URLs for the HTML and PDF resources:
 +
 +
<pre>
 +
<!-- Juicy Metadata. Google Scholar eats this up. -->
 +
<!-- journals -->
 +
  <meta name="citation_publisher" content="AAPG Special Volumes">
 +
  <meta name="citation_title" content="Well Completions: Part 9. Production Engineering Methods">
 +
  <meta name="DC.Title" content="Well Completions: Part 9. Production Engineering Methods">
 +
  <meta name="citation_author" content="Stephen A. Holditch">
 +
  <meta name="DC.Contributor" content="Stephen A. Holditch">
 +
  <meta name="citation_publication_date" content="1992">
 +
  <meta name="DC.Date" content="1992">
 +
  <meta name="citation_volume" content="95">
 +
  <meta name="citation_firstpage" content="463">
 +
  <meta name="citation_lastpage" content="468">
 +
  <meta name="citation_fulltext_html_url" content="http://archives.datapages.com/data/specpubs/methodo1/data/a095/a095/0001/0450/0463.htm">
 +
  <meta name="citation_pdf_url" content="http://archives.datapages.com/data/specpubs/methodo1/images/a095/a0950001/0450/04630.pdf">
 
</pre>
 
</pre>
   Line 22: Line 41:  
Everything after <code><CJSTEXT></code> and before <code></CJSTEXT></code> is the text of the article.  
 
Everything after <code><CJSTEXT></code> and before <code></CJSTEXT></code> is the text of the article.  
   −
Some things need to be converted:
+
Many things need to be converted:
 +
* Remove all the <code><nowiki><br></nowiki></code> and <code><nowiki><br /></nowiki></code> tags
 +
* Remove the page breaks, which are contained in <code><nowiki><blockquote></nowiki></code> tags
 
* Delete the <code><nowiki><P></nowiki></code> and <code><nowiki><BR></nowiki></code> tags
 
* Delete the <code><nowiki><P></nowiki></code> and <code><nowiki><BR></nowiki></code> tags
   Line 31: Line 52:  
* H3, subsection name — <code><nowiki><P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P></nowiki></code>
 
* H3, subsection name — <code><nowiki><P ALIGN=CENTER><STRONG>Expendable Gun</STRONG></P></nowiki></code>
   −
They all need to be converted to sentence case.
+
They all need to be converted to sentence case, but leaving proper names, all-uppercase words, and parentheses alone.
    
===Figures===
 
===Figures===
First mention of a figure, e.g. '''Figure 1''', should trigger a file call:
+
The actual figure references — which might come before or after the mention in the text — look like this:
: <code>[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]</code>
  −
 
  −
The actual figure references — which might come before or after the mention in the text, looks like this:
      
<pre>
 
<pre>
Line 43: Line 61:  
</pre>
 
</pre>
   −
If we're very cunning, we can gather the file calls, gather the actual figure references, and match them up, so that the figure caption is inserted into the file call.
+
We make a file name out of the figure caption and the author's name, so every file is unique:
   −
An alternative approach, which would require us to write an extension I think (I can't find one), would be to upload the images using their captions as the file description (if available). Then we could ask for the description when we call the file, either with a magic word or via a template (less good, because it breaks the way to make an image call):
+
Ideally, the first mention of a figure, e.g. '''Figure 1''', should trigger the file call:
 
+
: <code>[[File:<filename.jpg>|thumb|Fig. 1 — Caption.]]</code>
<pre>
  −
[[File:Myfile.jpg|thumb|{{DESCRIPTION}}]]
  −
</pre>
  −
 
  −
or
  −
 
  −
<pre>
  −
{{fig | 3.2
  −
| myfile.jpg
  −
| Caption text.
  −
| Smith et al. 2006
  −
}}
  −
</pre>
      
===Lists===
 
===Lists===
Line 70: Line 75:  
</pre>
 
</pre>
   −
This will become:
+
This becomes:
 
   
<pre>
 
<pre>
 
* Slotted liner
 
* Slotted liner
Line 77: Line 81:  
* Cemented liner
 
* Cemented liner
 
</pre>
 
</pre>
  −
To do this:
  −
* Interpret such a block as a list: perhaps lines that start with <code>--[SPACE]</code>
  −
* Delete the <code><nowiki><P></nowiki></code> and <code><nowiki><BR></nowiki></code> tags.
  −
* Replace <code>--</code> with <code>*</code>
      
Similar thing for ordered lists:  
 
Similar thing for ordered lists:  
   
<pre>
 
<pre>
 
<P>1. Expendable gun<BR>
 
<P>1. Expendable gun<BR>
Line 91: Line 89:  
</pre>
 
</pre>
   −
Some things can be removed:
+
which becomes...
* Everything in <code><nowiki><BLOCKQUOTE></nowiki></code> tags is page information we won't want in the wiki
+
<pre>
 +
# Expendable gun
 +
# Semi-expendable gun
 +
# Retrievable, hollow carrier gun
 +
</pre>
 +
 
 +
Here's how we deal with all of this:
 +
<pre>
 +
# Convert unordered lists
 +
text = re.sub(r"(<br>)*(\n)*\n-- ",r"\n* ",text)
 +
text = re.sub(r"\n\n\*",r"\n*",text) # to handle double newlines
 +
 
 +
# Convert ordered lists
 +
text = re.sub(r"(<br>)*(\n)*\n[0-9][0-9]?\. ",r"\n# ",text)
 +
text = re.sub(r"\n\n#",r"\n#",text) # to handle double newlines
 +
 
 +
# Convert description lists
 +
text = re.sub(r"<dd>",r"",text)
 +
text = re.sub(r"</dd>",r"",text)
 +
text = re.sub(r"<dl>",r"",text)
 +
text = re.sub(r"</dl>",r"",text)
 +
text = re.sub(r"<dt><strong>(?:[0-9]\. )?(.+?)</strong></dt>(\n)?",r"\n\n====\1====\n",text)
 +
text = re.sub(r"<dt>-- (.+?)</dt>(\n)?",r"* \1\n",text)
 +
text = re.sub(r"<dt>(.+?)</dt>(\n)?",r"* \1\n",text)
 +
 
 +
# Any more <br>s are probably unecessary linebreaks and can be ordinary text lists.
 +
text = re.sub(r"\n(.+?)<br>\n",r"\n* \1<br>\n",text)
 +
text = re.sub(r"<br>\n?",r"\n* ",text) # to handle double newlines
 +
</pre>

Navigation menu