Skip to content

Make the Document Readable

 

Now that we have both Liquid | Author and Liquid | Reader I think it’s time to clarify the differences between an editable manuscript (in Author) and a published (made public/defined as done, at least to a specific version) ‘frozen’, document (PDF opened in Reader). In analog times this was a clear distinction where the typewritten document and a typeset document: one was produced in very limited amounts and the other reproducible in large volumes. With digital documents this distinction has disintegrated.

The nearest we have are probably Word documents for manuscripts and PDF for published documents where the prime characteristic of the Word document is editablilipy and the PDF that it is frozen. However, software allows for different kinds of manipulations so this is only a loose rule. The model described here does use PDF as the base published document but this is subject to change as the word moves on and another document format, such as JATS may step up. The notion of a private manuscript and published document remains however.

TL:DR / Summary / Abstract

This post makes the point that adding appendices to a document can usefully describe the semantics of the document for the reader software to present rich options to the user, rather then fixing information in hard to parse ways or embedding them in fragile meta-boxes.

First : Outside the Document

Universal Text Interactions

Powerful interactions should be possible for both categories of documents and in my world this means Liquid | Flow interactions where the user can select any text and instantly get a myriad of search results and transformations done.

Document Connections

Citation Analysis / Concept Mapping

Citation analysis can be a very visual process based on system-extracted data about documents and how documents connect through citations. I put concept mapping in the same section here since both are based on how concepts or documents connect and are therefore both outside and in-between the documents.

Glossary

By glossary I mean definitions which are specific to a document, author, publisher or a field. The glossary systems I am concerned with have explicit connections to other glossary terms and documents and therefore can merge with concept mapping. I have blogged on various stages of this: http://wordpress.liquid.info/?s=glossary

Author’s Manuscript

Moving forward it will be important to define the interactions–and possible interactions–for each document type. This will really mean defining the Reader document (.pdf) since the Author document (.liquid) should have as rich interactions as possible. There is not much to say on this here since this is covered under all of the work for Author and general interactions. The purpose of the thoughts in this post is to clarify the role of the published document how to expand and limit it’s potential interactions:

The Published Document

The defining characteristic of the published document is that it is a frozen substrate where the author’s work is not editable but it is annotatable and citable:

Annotations

Annotations are notes of varying sorts added by the reader ‘on top of’ the author’s work. The reasons for this include:

  • Augment comprehension of the document
  • Augment comprehension of the content of the document in a multi-document context
  • To share with other readers for discussion
  • To share with the author for comment
  • To find passages of text in the future for citing
  • & more

The reader user should be able to highlight passages of text and to make any ‘mark’ they feel they want to. The system should store these highlights and marks and make them as useful as possible for the/a reader in the future. This includes the ability to search an individual document or a set of documents for only text which ash been highlighted, either in the Reader application or as part of a citation or concept analysis.

The way Reader should handle annotating is simply to let the user highlight any text with a colour highlight (default yellow) and that’s it for the initial highlighting. In the future it should be possible to choose colours based on some meaning and to draw and doodle.

The annotations should be stored in such a way as to be accessible to the Reader application, and any other PDF reader for searching the document based on only annotated/highlighted text and to an importing application, such as Author for the citation view to make connections and do other visualisations and interactions based on any keywords in the document and/or only highlighted text.

Citations

Citations are the means through which the reader can connect what they are themselves authoring to the source material in the published work.

This comes straight onto the issue of addressing, which I think is a prime issue to be dealt with and which I have blogged about quite a bit http://wordpress.liquid.info/?s=addressing

The act of citing is the act of showing the source in relation to the author’s work and the act of reading a citation is the act of recognising the source and seeing if it adds credibility to the author’s work or seeing a new source which can then be investigated to check for relevance and veracity.

The act of adding a citation is currently generally absurd, with the source documents in PDF not carrying any useful meta-information other than what might be written in plain text in the document as a title and names of the authors and only sometimes the publication date. Companies provide commercial services to search databases to add full citation information to the user (but crucially, not the document itself) to help the user cite them. This is a key issue the Reader-Author interaction solves, with the Author Created PDF carrying the meta for Reader to allow the user to simply copy text and then paste it as a full citation: https://www.youtube.com/watch?v=Q-LnkuI2Qx8

(The important aspect of high-resolution addressing can come under this system, but that is not addressed here in detail)

Meta -> Visible ‘About this Document’

The information about a document would have to be on the same substrate level as the content in the analog world, there was no place to hide it. In digital documents however there can be a payload of information not visible to the user, in fact it is a requirement of digital documents since they need a way to convey to the operating system and reader/editor software what the document is and how it should be displayed and how it can be interacted with. This can clearly be useful, such as with the EXIF data of a photograph containing a lot of information about the technical status of the taking of the picture and has potential for adding all the citation information–and more–to a document but there are two issues: Publishers (software and companies) usually do not include this meta information and it gets stripped out on changing formats or printing.

I learnt that when Jacob implemented the ability to copy the document’s BibTeX textual citation information however, that this is findable information for a system since it starts with a unique and identifiable string, and as such, when a user copies a BibTeX from a download site to use in Author, the user does not need to copy only the the BibTeX text since if the whole web page including the BibTeX is copied, Author will easily parse the text and find the BibTeX and use it.

This gave me the most obvious revelation: Humans can read the visible text in documents and so can computer systems so why not not worry about embedding meta and instead leave it visible? This is why Author now has the option to export the BibTeX for the document at the end of the document as plain text, under the heading ‘BibTeX’. It means that Reader opens the document and ‘reads’ it and finds the BibTeX, it then uses this when the user performs a basic copy by appending it to the clipboard. When the user then pasted back into Author this is made available and on paste a dialog asks the user: Paste as plain text or use the embedded BibTeX to paste as a citation? The result is that a simple copy and paste becomes a fully formatted citation where the application accepting the paste (in this case Author) ‘knows’ that this is a citation.

The next step from this perspective is to encourage software vendors to produce PDF documents where the visual information contains semantic values, not expecting hidden information to do the job. In terms of archiving and data transfer this is useful but it’s also useful now, to make the systems more rich and robust.

Have a section at the end of the document with the BiBTeX as citation information and don’t call it meta, simply call it information but since it’s clearly marked any reader can use it in the same way as Reader / Author does.

And let’s go further. Let’s use such an appendix to describe the formatting of the document, including how headings are formatted and so on. This should allow for complete compatibility with basic PDF readers but also allow new readers to extract semantic values to allow for richer interactions, such as automatic headings interactions, citation display and interactions and so on.

This could put an end to the absurd academic time-waste of nit-picking how citations should be displayed: Let the teacher/examiner/reader specify how the citations should be displayed, based on the document having described in the appendix how they are used and therefore the reader can re-format the the readers tastes.

This can further be used to work with glossaries and much more and will be robust enough to even be printed out and scanned and all will be retained.

I am putting my money where my mouth is by demonstrating this interaction via Author and Reader but this is as open as possibly can be and the end user can seriously benefit from such a very open rich-information interchange.

 

Note: This became the Visible-Meta approach.

 

Published inLiquid | AuthorLiquid | ProjectsLiquid | ViewLiquid | View PitchNotes On...PhDThoughts

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.