Service/App to Assign Correct Document Name & Authors etc to PDF

  1. The user would drag & drop PDFs onto this app (or use a web service) and the system would extract the name of the document (which often the name of the document is not) and search Mendeley (as example below) to extract citation information (author, publisher, year etc).
  2. Search results are shown in a dialogue. If none are found or if they are wrong, the user can manually type in the name of the document or paste it.
  3. It will then assign the found Name as the document name and assign other citation information inside the PDFs meta information, for use by other applications, as shown below.
  4. Example Mendeley search:
  5. This will allow applications like Liquid | Author and others to easily allow the reader to copy text As Citation, saving potentially massive amounts of time.
  6. Here is a screenshot of cmd-i in Apple Preview, illustrating some of the embeddable meta-data:

Rich PDF. Open Access. …

Wednesday meeting with Christopher, Mark and now, Adam Procter. Notions discussed include Open Publishing an a new name for a ‘rich’ Open Access PDF with EXIF was suggested by Chris. Use .tif self describing. Very much in line with my rich PDF notion.

Annotate eprints. No

Paragraph level addressability. Hmmm…

Eprints Labs. THIS is what we will try to make happen. Author can be on both sides of the workflow of course. Could be brilliant to show at The Future of Text.

‘Making PDF Great Again’

The Goal

I am looking at producing a document format which will be used for ‘publishing’ academic documents which provides ‘a rich amount of data’ for a reader to interact with, not thin documents like current PDFs.

  • By ‘publishing’ I mean making public, in a way which freezes the document, much like making a printed document public.
  • By ‘a rich amount of data’ I mean keeping any meta-data the author would want to include, as well as any advanced views of the document, such as something in the style of Liquid Views:

The Challenge

The ubiquity of PDFs will make it hard to challenge the current workflow.

The Workflow to Augment

The purpose of this project is to augment the full academic and scientific workflow – the interoperability feature is required to fully support the full lifecycle:

•  The Literature Review which currently happens through ‘thin’ PDF documents.
•  Performing Experiments
•  Developing The Thesis
•  Authoring
•  Collaboration & Review
•  Publishing. This is where the current PDF publishing method strips out a large amount of contextual data generated in the process, providing the person who is doing a literature review based on this document only a thin sliver of what was generated.

The Proposal : Rich PDF

Many in the  academic world are generally well versed in writing to export in LaTex to make sure their documents appear in the appropriate academic layout style.

The proposal here is to help the author tag their document as though exporting using LaTex style tagging/formatting, and using this tag data to export as a properly formatted PDF and adding a full XML sheet of all the attributes the author would like to have persist with the document (removing any data such as earlier drafts which the author may not want added).

For those who are not used to LaText all they will notice is that they are highlighting and tagging the document.


The user simply marks up the important elements of the document and these elements are preserved even though the published form is PDF


Open Document in Standard PDF Reader

Anyone can then open the resulting .pdf document in any PDF reader to read the document as they expect.

Open Document in Rich PDF Reader

Any user with a Rich PDF aware application can open the document and the PDF rendered layout can be thought of almost as a label which is not used; the rich data from the document is still contained within and can be used for a rich reading experience.

Specific Benefits

All the advanced features of the authoring environment will be retained. For example:

•  Free-form layouts for mind maps/concept maps will have all the 2D/3D layout data preserved for the reader to view how the author saw the connections in the document.
•  Live formulas will be preserved for the reader to interact with.
•  Rich data can be included, in compressed form.

Adoption & Power

This way we have an entirely new document format which will enable the user to richly and powerfully interact with the data in a published, ‘archived’ document yet everyone can open the powerfully features of the basic document.

Producers of authoring software can then further innovate with advanced reading functionality, knowing that documents can support the interactions.

Development Opportunities

If my word processor features an advanced layout then another developer’s software can choose to be able to use the data from my word processor or offer alternatives views, but not destroy the data which was generated by my word processor, so the user can always go back and view it in my application.


In order to create such a system we will need to evaluate current rich document interchange standards and write up a spec based on this for the XML in PDF data, which we will then share with our friends in the word processor development community, where we will then settle on a standard which will be useable across the board.

I will be in touch with developer friends once a basic in-house spec has been generally agreed upon.

PostScript : Why Now?

“…the way that scientists share their results once they have them  … it’s a PDF in a journal that you can download and you can look at charts and graphs and read a static document, that’s what it is basically, there might be the odd chance to download a bit of data or watch a video or look at some pictures but for the most part it’s the same technology that’s been around for hundreds of years albeit you can now download it in a PDF and to Freeman’s mind, and it’s a fair point, that’s ridiculous…’

 Jeremy Freeman of Chan Zuckerberg Initiative, referred to in 14 minutes in on the Wired podcast

(Since Stuart Arnott brought the Freeman podcast to my attention I have been thinking (again) about about to re-invent academic documents. I sent a page of suggestions to a few people, including my advisor Les Carr, Joe Corneli (who has the very interesting notion of Scholia to put in the mix), Christopher Gutteridge who shredded my proposal (thanks, it was useful) and Mark Anderson with whom I had a long chat, which should have been recorded but these chats are often in loud student areas so likely would have been horrible to listen to anyway)

This lead to the concept presented here, for which I am grateful to Jeremy Freeman for providing the impetus.