‘Making PDF Great Again’

The Goal

I am looking at producing a document format which will be used for ‘publishing’ academic documents which provides ‘a rich amount of data’ for a reader to interact with, not thin documents like current PDFs.

  • By ‘publishing’ I mean making public, in a way which freezes the document, much like making a printed document public.
  • By ‘a rich amount of data’ I mean keeping any meta-data the author would want to include, as well as any advanced views of the document, such as something in the style of Liquid Views: http://www.liquid.info/view.html

The Challenge

The ubiquity of PDFs will make it hard to challenge the current workflow.

The Workflow to Augment

The purpose of this project is to augment the full academic and scientific workflow – the interoperability feature is required to fully support the full lifecycle:

•  The Literature Review which currently happens through ‘thin’ PDF documents.
•  Performing Experiments
•  Developing The Thesis
•  Authoring
•  Collaboration & Review
•  Publishing. This is where the current PDF publishing method strips out a large amount of contextual data generated in the process, providing the person who is doing a literature review based on this document only a thin sliver of what was generated.

The Proposal : Rich PDF

Many in the  academic world are generally well versed in writing to export in LaTex to make sure their documents appear in the appropriate academic layout style.

The proposal here is to help the author tag their document as though exporting using LaTex style tagging/formatting, and using this tag data to export as a properly formatted PDF and adding a full XML sheet of all the attributes the author would like to have persist with the document (removing any data such as earlier drafts which the author may not want added).

For those who are not used to LaText all they will notice is that they are highlighting and tagging the document.

 

The user simply marks up the important elements of the document and these elements are preserved even though the published form is PDF

 

Open Document in Standard PDF Reader

Anyone can then open the resulting .pdf document in any PDF reader to read the document as they expect.

Open Document in Rich PDF Reader

Any user with a Rich PDF aware application can open the document and the PDF rendered layout can be thought of almost as a label which is not used; the rich data from the document is still contained within and can be used for a rich reading experience.

Specific Benefits

All the advanced features of the authoring environment will be retained. For example:

•  Free-form layouts for mind maps/concept maps will have all the 2D/3D layout data preserved for the reader to view how the author saw the connections in the document.
•  Live formulas will be preserved for the reader to interact with.
•  Rich data can be included, in compressed form.

Adoption & Power

This way we have an entirely new document format which will enable the user to richly and powerfully interact with the data in a published, ‘archived’ document yet everyone can open the powerfully features of the basic document.

Producers of authoring software can then further innovate with advanced reading functionality, knowing that documents can support the interactions.

Development Opportunities

If my word processor features an advanced layout then another developer’s software can choose to be able to use the data from my word processor or offer alternatives views, but not destroy the data which was generated by my word processor, so the user can always go back and view it in my application.

Implementation

In order to create such a system we will need to evaluate current rich document interchange standards and write up a spec based on this for the XML in PDF data, which we will then share with our friends in the word processor development community, where we will then settle on a standard which will be useable across the board.

I will be in touch with developer friends once a basic in-house spec has been generally agreed upon.

PostScript : Why Now?

“…the way that scientists share their results once they have them  … it’s a PDF in a journal that you can download and you can look at charts and graphs and read a static document, that’s what it is basically, there might be the odd chance to download a bit of data or watch a video or look at some pictures but for the most part it’s the same technology that’s been around for hundreds of years albeit you can now download it in a PDF and to Freeman’s mind, and it’s a fair point, that’s ridiculous…’

 Jeremy Freeman of Chan Zuckerberg Initiative, referred to in 14 minutes in on the Wired podcast https://overcast.fm/+OFMaEXgU

(Since Stuart Arnott brought the Freeman podcast to my attention I have been thinking (again) about about to re-invent academic documents. I sent a page of suggestions to a few people, including my advisor Les Carr, Joe Corneli (who has the very interesting notion of Scholia to put in the mix), Christopher Gutteridge who shredded my proposal (thanks, it was useful) and Mark Anderson with whom I had a long chat, which should have been recorded but these chats are often in loud student areas so likely would have been horrible to listen to anyway)

This lead to the concept presented here, for which I am grateful to Jeremy Freeman for providing the impetus.