Difference between revisions of "Help:Report generation"

From DFM Wiki
(Created page with "== Manual preparation of source document == Before converting a Word document to other formats, we need to ensure that the layout is clean and semantically valid. If authors h...")
 
(Added details)
Line 1: Line 1:
== Manual preparation of source document ==
+
This document describes a workflow for converting manuscripts received in Microsoft Word format to digital publications that are disseminated through DFM's various online platforms.
Before converting a Word document to other formats, we need to ensure that the layout is clean and semantically valid. If authors have used direct formatting instead of semantically defined styles (e.g., boldface type for headings), the document will not convert well into wikitext, HTML, epub, or PDF.  
 
  
# Go to File > Check for issues > Inspect document. Select everything then press the "Inspect" button. If there are any comments, tracked changes, document properties, etc., click "Remove All".
+
===Target outputs ===
# Ensure that headings follow a nested outline: Heading 2 (chapter), Heading 3 (subsection), Heading 4 (sub-subsection).
+
Given a source manuscript, we will end up with the following outputs:
# Remove extra line breaks (between paragraphs, before and after headings or tables, line breaks used instead of page breaks, etc.)
 
# Move caption text for images into the alt text field for each image (Right click and "Edit alt text"). Ensure that each caption fully describes the image and includes attribution.
 
# If pictures are inside of frames, cut and paste back into the document so they are no longer contained within a frame. A group of images with an individual caption should be split up, so that each image has an individual caption. (Complex layouts in Word will break when converted to other formats.)
 
# Graphs and Smart Art need to be embedded as images, instead of as editable objects. Cut and paste back into the document, selecting "paste as image". Alternatively, if the option is available in your version of Word, graphs can be downloaded as SVG (scalable vector) images and pasted back into the document.
 
# If the document contains Zotero citations: Navigate to Zotero > Document Preferences > Switch to a different word processor. This will convert all the citations into hyperlinks.
 
  
== Upload to the wiki ==
+
*Editable, version-controlled '''source text''' of the working paper (report) on the DFM Wiki
 +
* Copies of all '''images''' used in the report, along with description and licensing metadata, on the DFM Wiki
 +
*Metadata for all '''bibliographic references''' in the report within the DFM Zotero library
 +
*A '''web-browsable version''' of the report on the DFM public-facing website ([https://driedfishmatters.org/pub driedfishmatters.org])
 +
*A downloadable '''PDF version''' of the report in the DFM Zotero library
 +
*A brief description of the report, with cover thumbnail/preview and links to web and PDF versions, in the '''Working Papers listing''' on the [[DFM Working papers|DFM Wiki]] and [https://driedfishmatters.org/pub/dfm-working-papers.html public website].
  
# Run the docx2wiki command. Indicate the target pagename on the wiki (i.e., the title of the document) and the path to the docx document. This will update the image database with hash values for any new (unrecognized) images, upload new images from the current document to the wiki, then upload the document text.
+
===Rationale ===
# If successful, you will see the message "Page [[<pagename>]] saved".
+
If our only goal is to distribute a manuscript shared by a research team, the simplest approach is to use the "Export as PDF" function within Microsoft Word to generate a shareable document, which can then be disseminated through the DFM mailing list and website.
  
== Review and clean up the wiki page ==
+
The various outputs listed above are intended to add value to the reports prepared by project co-investigators and collaborators, by maximizing options for the dissemination and re-use of project data across multiple platforms. For example, this allows us to locate and re-use images embedded in various reports through the [[Help:Images|image catalogue]] contained within the DFM Wiki. (If the same image is used within a subsequent report, our image fingerprinting tools should be able to locate the existing version in our catalogue.)
  
# Manually check the tables. Sometimes Word can store incorrect table layout data if columns, rows, or cells have been merged and re-split, or otherwise edited in non-standard ways. If this is an issue, copy and paste from the source document into the wiki visual editor.
+
At the same time, this workflow implements mechanisms to streamline the design and copy-editing of reports. For example, we end up with a series of reports with the same branding; consistent citation, table, and figure formatting; and valid reference data for in-text citations and image captions.
# Move table caption text into the table caption
+
 
# Convert table headers to "Header Cell"
+
==Step 1. Copy-edit and format the source document==
# Insert the metadata template.
+
Before converting a Word document to other formats, we need to ensure that the layout is clean and semantically valid. If authors have used direct formatting instead of semantically defined styles (e.g., boldface type for headings), the document will not convert well into wikitext, HTML, epub, or PDF. In this step we will conduct some basic copy-editing of the document. 
# Check citations. If the citation data could not be retrieved from Zotero, edit the template.
+
 
# Add a "Notes" header.
+
#Correct the formatting
# Convert footnotes using Cite > Basic.
+
##Go to File > Check for issues > Inspect document. Select everything then press the "Inspect" button. If there are any comments, tracked changes, document properties, etc., click "Remove All".
 +
##Ensure that headings follow a nested outline: Heading 2 (chapter), Heading 3 (subsection), Heading 4 (sub-subsection).
 +
##Remove extra line breaks (between paragraphs, before and after headings or tables, line breaks used instead of page breaks, etc.)
 +
#Review figures
 +
##If pictures are inside of frames, cut and paste back into the document so they are no longer contained within a frame.  (Complex layouts in Word will break when converted to other formats.)
 +
##Ensure that each image in the document has a caption that fully describes the image and includes attribution (source and license data). Place the caption text for images in the '''alt text''' field for each image (Right click and "Edit alt text"). A group of images with an individual caption should be split up, so that each image has an individual caption.
 +
##Graphs and Smart Art need to be embedded as images, instead of as editable objects. Cut and paste back into the document, selecting "paste as image".
 +
#Review citations
 +
##Check all in-text citations to ensure they are linked to the DFM Zotero library; create or edit citations as needed. NOTE: It is possible that contributors will have used Zotero to include citations, but will have linked the citations to a private library rather than the DFM group library. This will NOT work as we have no way of retrieving the reference data from a private library. Citations will need to be transferred to the DFM group library and updated manually in the manuscript. Citations can also be updated in the DFM Wiki; see [[Help:Adding Zotero citations]] and [[Help:Importing text with Zotero citations from a word processor]].
 +
##Navigate to Zotero > Document Preferences > Switch to a different word processor. This will convert all the citations into hyperlinks.
 +
 
 +
==Step 2. Upload to the wiki==
 +
The <code>docx2wiki</code> bot script will update the image database with hash values for any new (unrecognized) images, upload new images from the current document to the wiki, then upload the document text.
 +
#Run the docx2wiki command.
 +
##Use the <code>-pagename</code> option to provide the target pagename on the wiki (i.e., the title of the document)
 +
##Use the <code>-input</code> option to provide the path to the docx manuscript.
 +
##Optionally, use the <code>-db</code> option to provide the path to an image fingerprint database.
 +
#If successful, you will see the message "<code>Page [[<pagename>]] saved</code>".
 +
Here is an example of the command:
 +
python pwb.py docx2wiki -pagename:"Dried Fish in West Bengal, India: Scoping report" -input:"/mnt/c/users/Eric/Downloads/WBG/DFM_RPT_IITK_Revised-Scoping-Report_2022-02-09_clean.docx"
 +
The command should work fairly reliably, however it will give errors if there are any images that do not contain a caption set in the alt text field. Note also that images in unknown formats will be ignored; currently the only recognized formats are JPEG and PNG. 
 +
 
 +
==Step 3. Review and clean up the wiki page==
 +
 
 +
#Insert a metadata template
 +
##At the top of the document, insert [[Template:Report metadata]]. Fill in the required fields: authors (separated with ampersands), abstract, series, number in the series, and institution.
 +
#Check table formatting
 +
##Manually check the tables for formatting errors. Sometimes Word can store incorrect table layout data if columns, rows, or cells have been merged and re-split, or otherwise edited in non-standard ways. If this is an issue, delete the table then copy and paste from the source document into the wiki visual editor.
 +
##If the table has a caption, move the table caption text into the actual table caption. (Word formats table captions as paragraphs.)
 +
##Convert table headers to "Header Cell". (Word formats table headers as regular cells.)
 +
#Check citation formatting
 +
##Check citations. If the citation data could not be retrieved from the DFM Zotero library (there will be an error printed in red), locate the broken citation and edit the [[Template:Zotero|Zotero citation template]] within the citation field. Occasionally it is just a communication error with the Zotero server that causes the problem; [[mediawikiwiki:Manual:Purge|purging the page]] (or saving again after making further edits) may fix the issue.  
 +
##Add a "Notes" header at the bottom of the document.
 +
##If there were any footnotes or endnotes in the Word manuscript, convert those to footnotes in the wiki using the tool Cite > Basic and copying/pasting the footnote text into a note at the correct location in the document.
 +
 
 +
== Step 4. Create a cover ==
 +
 
 +
== Step 5. Generate PDF ==
 +
 
 +
== Step 6. Distribute ==
 +
 
 +
# Add to Zotero
 +
# Add to the DFM Working Papers listing
 +
# Publish to the DFM website
 +
 
 +
== Step 7. Publicize ==
 +
 
 +
# Post to the DFM blog
 +
# Post to Twitter
 +
# Post to the DFM mailing list
 +
# Send message to authors

Revision as of 11:01, 16 May 2022

This document describes a workflow for converting manuscripts received in Microsoft Word format to digital publications that are disseminated through DFM's various online platforms.

Target outputs

Given a source manuscript, we will end up with the following outputs:

  • Editable, version-controlled source text of the working paper (report) on the DFM Wiki
  • Copies of all images used in the report, along with description and licensing metadata, on the DFM Wiki
  • Metadata for all bibliographic references in the report within the DFM Zotero library
  • A web-browsable version of the report on the DFM public-facing website (driedfishmatters.org)
  • A downloadable PDF version of the report in the DFM Zotero library
  • A brief description of the report, with cover thumbnail/preview and links to web and PDF versions, in the Working Papers listing on the DFM Wiki and public website.

Rationale

If our only goal is to distribute a manuscript shared by a research team, the simplest approach is to use the "Export as PDF" function within Microsoft Word to generate a shareable document, which can then be disseminated through the DFM mailing list and website.

The various outputs listed above are intended to add value to the reports prepared by project co-investigators and collaborators, by maximizing options for the dissemination and re-use of project data across multiple platforms. For example, this allows us to locate and re-use images embedded in various reports through the image catalogue contained within the DFM Wiki. (If the same image is used within a subsequent report, our image fingerprinting tools should be able to locate the existing version in our catalogue.)

At the same time, this workflow implements mechanisms to streamline the design and copy-editing of reports. For example, we end up with a series of reports with the same branding; consistent citation, table, and figure formatting; and valid reference data for in-text citations and image captions.

Step 1. Copy-edit and format the source document

Before converting a Word document to other formats, we need to ensure that the layout is clean and semantically valid. If authors have used direct formatting instead of semantically defined styles (e.g., boldface type for headings), the document will not convert well into wikitext, HTML, epub, or PDF. In this step we will conduct some basic copy-editing of the document.

  1. Correct the formatting
    1. Go to File > Check for issues > Inspect document. Select everything then press the "Inspect" button. If there are any comments, tracked changes, document properties, etc., click "Remove All".
    2. Ensure that headings follow a nested outline: Heading 2 (chapter), Heading 3 (subsection), Heading 4 (sub-subsection).
    3. Remove extra line breaks (between paragraphs, before and after headings or tables, line breaks used instead of page breaks, etc.)
  2. Review figures
    1. If pictures are inside of frames, cut and paste back into the document so they are no longer contained within a frame. (Complex layouts in Word will break when converted to other formats.)
    2. Ensure that each image in the document has a caption that fully describes the image and includes attribution (source and license data). Place the caption text for images in the alt text field for each image (Right click and "Edit alt text"). A group of images with an individual caption should be split up, so that each image has an individual caption.
    3. Graphs and Smart Art need to be embedded as images, instead of as editable objects. Cut and paste back into the document, selecting "paste as image".
  3. Review citations
    1. Check all in-text citations to ensure they are linked to the DFM Zotero library; create or edit citations as needed. NOTE: It is possible that contributors will have used Zotero to include citations, but will have linked the citations to a private library rather than the DFM group library. This will NOT work as we have no way of retrieving the reference data from a private library. Citations will need to be transferred to the DFM group library and updated manually in the manuscript. Citations can also be updated in the DFM Wiki; see Help:Adding Zotero citations and Help:Importing text with Zotero citations from a word processor.
    2. Navigate to Zotero > Document Preferences > Switch to a different word processor. This will convert all the citations into hyperlinks.

Step 2. Upload to the wiki

The docx2wiki bot script will update the image database with hash values for any new (unrecognized) images, upload new images from the current document to the wiki, then upload the document text.

  1. Run the docx2wiki command.
    1. Use the -pagename option to provide the target pagename on the wiki (i.e., the title of the document)
    2. Use the -input option to provide the path to the docx manuscript.
    3. Optionally, use the -db option to provide the path to an image fingerprint database.
  2. If successful, you will see the message "Page [[<pagename>]] saved".

Here is an example of the command:

python pwb.py docx2wiki -pagename:"Dried Fish in West Bengal, India: Scoping report" -input:"/mnt/c/users/Eric/Downloads/WBG/DFM_RPT_IITK_Revised-Scoping-Report_2022-02-09_clean.docx" 

The command should work fairly reliably, however it will give errors if there are any images that do not contain a caption set in the alt text field. Note also that images in unknown formats will be ignored; currently the only recognized formats are JPEG and PNG.

Step 3. Review and clean up the wiki page

  1. Insert a metadata template
    1. At the top of the document, insert Template:Report metadata. Fill in the required fields: authors (separated with ampersands), abstract, series, number in the series, and institution.
  2. Check table formatting
    1. Manually check the tables for formatting errors. Sometimes Word can store incorrect table layout data if columns, rows, or cells have been merged and re-split, or otherwise edited in non-standard ways. If this is an issue, delete the table then copy and paste from the source document into the wiki visual editor.
    2. If the table has a caption, move the table caption text into the actual table caption. (Word formats table captions as paragraphs.)
    3. Convert table headers to "Header Cell". (Word formats table headers as regular cells.)
  3. Check citation formatting
    1. Check citations. If the citation data could not be retrieved from the DFM Zotero library (there will be an error printed in red), locate the broken citation and edit the Zotero citation template within the citation field. Occasionally it is just a communication error with the Zotero server that causes the problem; purging the page (or saving again after making further edits) may fix the issue.
    2. Add a "Notes" header at the bottom of the document.
    3. If there were any footnotes or endnotes in the Word manuscript, convert those to footnotes in the wiki using the tool Cite > Basic and copying/pasting the footnote text into a note at the correct location in the document.

Step 4. Create a cover

Step 5. Generate PDF

Step 6. Distribute

  1. Add to Zotero
  2. Add to the DFM Working Papers listing
  3. Publish to the DFM website

Step 7. Publicize

  1. Post to the DFM blog
  2. Post to Twitter
  3. Post to the DFM mailing list
  4. Send message to authors