Skip to content

Category: Notes On…

This category is for writings I consider fuller articles than the very brief glossary terms or other posts.

Making Information Self Aware

We can fight fake news and find more useful information in the academic and scientific publishing tsunami if we make the information self aware–if the information knows what it is. This is not a suggestion of Harry Potter level magical fantasy but a concrete act we can start with today and lay for foundation for future massive improvement.

the intelligent environment

Many years ago I read an interview with one of the developers of the computer game Crysis where he was lauded with the quality of the AI of the opponents in the game. He said that making the AI was not really the hard part, making the different parts of the environment aware of their attributes was key. If a tree trunk is thick, then the enemy can hide behind it. If it is dense then it will also serve as a shield, up to a point.

the self aware document

This is what we can and must do to documents. We must encode the meaning in documents as clearly as possible so that the document may be read by software and human. The document must be aware of who authored it, when, what its title is and so on, to at least provide the minimal context for useful citations.

It should also know what citations it contains and what any charts and graphs means what glossary terms are used and how they connect. Of course, we call this ‘metadata’ – information about information and the term has been used in many ways for many years now, but the metadata has so far been hidden inside the document, away from direct human and system interaction. We should maybe instead call it ‘hiddendata’. For some media this is actively used, such as the EXIF data in photographs, but it is lost when the photograph changes format, is inserted into other media or is printed. For text-based documents this is certainly currently possible but seldom actually used and not usefully read by the reader software and lost on printing.

bibtex foundation

You may well feel that this is simply a call for yet another document format but it is not. This is simply a call for a new way to add academic ‘industry-standard’ BibTeX style formatting of metadata to any document, starting with PDFs, in a robust, useful and legacy friendly way, by simply adding a final appendix to the document which follows a visually human-readable (hence BibTeX) and therefore also machine parseable format.

As this will include who authored the information, which the reading software can ‘understand’ and make it possible for the user to simply copy text from the document and paste it as a full citation into a new document in one operation, making citations easier, quicker and more robust. Further information can be explained for reader-software parsing, such as how the headings are formatted (so that the reader software can re-format the document if required, to show academic citation styles in the preference of the reader if they are different from the presence of the author), what citations are used, what glossary terms are used and what the data in tables etc. contains and more.

more connected texts

This is making the document say what it is, where it comes from, how it’s connected, what it means, and what data it contains. This is, in effect, making the document self aware and able to communicate with the world. These are truly augmented documents.

This will power simple parsing today and enable more powerful AI in the future in order to much better ‘understand’ the ‘intention’ of the author producing the document, by making documents readable.

This explicitly applies to documents and has the added benefit that even if they are turned into different formats and even if they are printed and scanned they will still retain the metadata. The concept is extensible to other textual media, but that is beyond this proposal.

visual-meta

I call this approach Visual-Meta and it’s presented in more detail here liquid.info/visual-meta.html. I believe this is important and I have therefore started the process of hosting a dialog with industry and I have produced two proof-of-concept applications, one for authoring Visual-Meta documents and one for reading and parsing them: Liquid | Author and Liquid | Reader: www.liquid.info

paper

Digital capabilities run deeper than what previous substrates could, but even in the pursuit of more liquid information environments we should not ignore the power of the visual symbolic layer. We hide the meta at our peril – we reveal it and include it in the visual document and gain robustness through document format changes and even writing and scanning, gaining archival strength without any loss of deep digital interactivity, something which matters more and more as we live and discover how brittle our digital data is and how important rich interactivity is to enable the deeper literacy required to fight propaganda and to propagate academic discoveries often lost in the sheer volume of documents.

Furthermore, with the goal of more robust formats and supporting reading of printed books and documents, addressing information (as discussed in the Visual-Meta addressability post) can be printed on each page in the footer to allow for easy scanning of hand-annotated texts to be OCR’d and entered into the user’s digital workflow automatically. Digital is magic. Paper is also magic. One day they will merge, but until then there is value to be had to use both to their strengths.

 

As we make our information aware,
we increase the potential of our own awareness

 

 

Leave a Comment

Visual-Meta Introduction

 

Visual-Meta is an approach to make document’s meta machine and human readable by adding an appendix to the end of the document, based on BibTeX, with all the information needed to cite the document (author, title, date etc.) as well as clearly stating the values of any data (such as tables, lists advanced layouts etc.) and glossary terms.

This visually (as plain text in the document) metadata can then be parsed by a Visual-Meta aware PDF reader to enable functionality such as copying text and pasting it as citation in one step.

Putting the metadata visually into the document means that even if the document format is changed or the document is printed and scanned, the data will still be a part of the document and compatibility with legacy readers is maintained since they will only see the metadata as plain text.

Adding human readable appendices to a PDF document which usefully describe the semantics of the document and also making it machine readable offers many benefits and workflow improvements in the academic document space, while adding no document overhead beyond a few plain text pages at the end of the document. This approach keeps compatibility with legacy PDF software Readers while opening up rich opportunities for augmented Readers; Legacy Readers will simply show a normal PDF with an appendix with BibTeX style information.

 

Augmentations

Visible-Meta Augmented Readers can provide the user with as rich interactions as can be provided in a custom authoring environment–the publishing and freezing onto PDF is no longer a limitation. Advanced interactions can include:

  • Copy As Citation using a simple copy command, with all citation information added to the clipboard payload for use by Visible-Meta aware applications on Paste.
  • Instant Outline based on the document specifying heading formatting.
  • Dynamic Views, such as the one implemented in Liquid | Author could be stored as data not only images.
  • Server Access. Repositories can extract information for large scale analysis.
  • Glossary Support. Glossary terms could be added to the appendix.
  • High Resolution, Document Based Addressing. The Name of the document is not the same as the Title and this can be be used to address by document and not location and support High-Resolution Addressing.
  • & more, to be discovered.

 

Benefits

For an author this approach means that they can embed more rich information in their document with a minimum of effort and be sure of the robustness of the information.

It allows the reader a much faster way to cite with a higher degree of accuracy and more access to the original data and interactions.

Augmented textual communication. Using the appendices to describe the document content, such as the formatting of headings and citations as well as the use of glossaries, can allow the reading software to present the document to the readers preference without loosing the creator’s semantics.

Server Friendly which allows for large scale citation and other document element analysis. University of Southampton’s Christopher Gutteridge, one the of the people behind the university repository, elaborates on this.

Institutions can worry less about the cosmetics of citations and benefit from more documents cited being checked and read.

This could put an end to the absurd academic time-waste of nit-picking how citations should be displayed: Let the teacher/examiner/reader specify how the citations should be displayed, based on the document having described in the appendix how they are used and therefore the reader can re-format the the readers tastes.

Universities still get to dictate the default handing-in formatting but the same document could be displayed in any format the reader chooses.

 

Demonstration

Visual-Meta export is built in to the Liquid | Author word processor and parsing it can be done by the Liquid | Reader PDF reader application, both produced by the author of this article, Frode Hegland: www.liquid.info

Video demonstration of the concept (less than two minutes long): youtube.com/watch?v=Q-LnkuI2Qx8&feature=youtu.be

 

Example

Examples and description of the format is posted: Visible-Meta Example & Structure.

 

Document Name

Note that the ‘document_name’ is distinct from the title and can be set automatically by the authoring software to help identify the document through search later: http://wordpress.liquid.info/addressability-supplemental-augmentation-for-visual-meta

 

 

Adoption Support

The first implementations will include links to actual code for how to add this into other developer’s projects, dramatically reducing the implementation overhead.

 

Legacy Support

When using a supported Reader, the user can download a PDF and copy the BibTeX export format on the download page, then open the PDF in Reader and click to ‘Assign BibTeX’ and it will be applied as an appendix and saved (along with a tag stating which source was used and when), same as if it was natively exported with Visual-Meta. Only the citation information will be provided in this way–formatting etc. will not be available

 

 

Legacy Augmentation

 ­

Manual

When using a supported Reader, the user can download a PDF and copy the BibTeX export format on the download page, then open the PDF in Reader and click to ‘Assign BibTeX’ and it will be applied as an appendix and saved, same as if it was natively exported with Visual-Meta. Only the citation information will be provided in this way–formatting etc. will not be available.

Server

Reader applications can also send non-visible-meta PDFs to a server, such as Scholarcy to have the Visible-Meta extracted and appended.

 

 

Background

This work grew out of work on Liquid | Author: Visible-Meta Origins.

 

How This Relates To My PhD

This work has grown out of my PhD work at the University of Southampton under Dame Wendy Hall and Les Carr. It aims to solve infrastructure issues which hamper citation interaction and visualisations: Visual-Meta & my PhD.

 

Known Issues

There are many issues to be worked out, including how to refer to different authors of different chapters and what exactly to encode.

15 Comments

Make the Document Readable

 

Now that we have both Liquid | Author and Liquid | Reader I think it’s time to clarify the differences between an editable manuscript (in Author) and a published (made public/defined as done, at least to a specific version) ‘frozen’, document (PDF opened in Reader). In analog times this was a clear distinction where the typewritten document and a typeset document: one was produced in very limited amounts and the other reproducible in large volumes. With digital documents this distinction has disintegrated.

The nearest we have are probably Word documents for manuscripts and PDF for published documents where the prime characteristic of the Word document is editablilipy and the PDF that it is frozen. However, software allows for different kinds of manipulations so this is only a loose rule. The model described here does use PDF as the base published document but this is subject to change as the word moves on and another document format, such as JATS may step up. The notion of a private manuscript and published document remains however.

TL:DR / Summary / Abstract

This post makes the point that adding appendices to a document can usefully describe the semantics of the document for the reader software to present rich options to the user, rather then fixing information in hard to parse ways or embedding them in fragile meta-boxes.

First : Outside the Document

Universal Text Interactions

Powerful interactions should be possible for both categories of documents and in my world this means Liquid | Flow interactions where the user can select any text and instantly get a myriad of search results and transformations done.

Document Connections

Citation Analysis / Concept Mapping

Citation analysis can be a very visual process based on system-extracted data about documents and how documents connect through citations. I put concept mapping in the same section here since both are based on how concepts or documents connect and are therefore both outside and in-between the documents.

Glossary

By glossary I mean definitions which are specific to a document, author, publisher or a field. The glossary systems I am concerned with have explicit connections to other glossary terms and documents and therefore can merge with concept mapping. I have blogged on various stages of this: http://wordpress.liquid.info/?s=glossary

Author’s Manuscript

Moving forward it will be important to define the interactions–and possible interactions–for each document type. This will really mean defining the Reader document (.pdf) since the Author document (.liquid) should have as rich interactions as possible. There is not much to say on this here since this is covered under all of the work for Author and general interactions. The purpose of the thoughts in this post is to clarify the role of the published document how to expand and limit it’s potential interactions:

The Published Document

The defining characteristic of the published document is that it is a frozen substrate where the author’s work is not editable but it is annotatable and citable:

Annotations

Annotations are notes of varying sorts added by the reader ‘on top of’ the author’s work. The reasons for this include:

  • Augment comprehension of the document
  • Augment comprehension of the content of the document in a multi-document context
  • To share with other readers for discussion
  • To share with the author for comment
  • To find passages of text in the future for citing
  • & more

The reader user should be able to highlight passages of text and to make any ‘mark’ they feel they want to. The system should store these highlights and marks and make them as useful as possible for the/a reader in the future. This includes the ability to search an individual document or a set of documents for only text which ash been highlighted, either in the Reader application or as part of a citation or concept analysis.

The way Reader should handle annotating is simply to let the user highlight any text with a colour highlight (default yellow) and that’s it for the initial highlighting. In the future it should be possible to choose colours based on some meaning and to draw and doodle.

The annotations should be stored in such a way as to be accessible to the Reader application, and any other PDF reader for searching the document based on only annotated/highlighted text and to an importing application, such as Author for the citation view to make connections and do other visualisations and interactions based on any keywords in the document and/or only highlighted text.

Citations

Citations are the means through which the reader can connect what they are themselves authoring to the source material in the published work.

This comes straight onto the issue of addressing, which I think is a prime issue to be dealt with and which I have blogged about quite a bit http://wordpress.liquid.info/?s=addressing

The act of citing is the act of showing the source in relation to the author’s work and the act of reading a citation is the act of recognising the source and seeing if it adds credibility to the author’s work or seeing a new source which can then be investigated to check for relevance and veracity.

The act of adding a citation is currently generally absurd, with the source documents in PDF not carrying any useful meta-information other than what might be written in plain text in the document as a title and names of the authors and only sometimes the publication date. Companies provide commercial services to search databases to add full citation information to the user (but crucially, not the document itself) to help the user cite them. This is a key issue the Reader-Author interaction solves, with the Author Created PDF carrying the meta for Reader to allow the user to simply copy text and then paste it as a full citation: https://www.youtube.com/watch?v=Q-LnkuI2Qx8

(The important aspect of high-resolution addressing can come under this system, but that is not addressed here in detail)

Meta -> Visible ‘About this Document’

The information about a document would have to be on the same substrate level as the content in the analog world, there was no place to hide it. In digital documents however there can be a payload of information not visible to the user, in fact it is a requirement of digital documents since they need a way to convey to the operating system and reader/editor software what the document is and how it should be displayed and how it can be interacted with. This can clearly be useful, such as with the EXIF data of a photograph containing a lot of information about the technical status of the taking of the picture and has potential for adding all the citation information–and more–to a document but there are two issues: Publishers (software and companies) usually do not include this meta information and it gets stripped out on changing formats or printing.

I learnt that when Jacob implemented the ability to copy the document’s BibTeX textual citation information however, that this is findable information for a system since it starts with a unique and identifiable string, and as such, when a user copies a BibTeX from a download site to use in Author, the user does not need to copy only the the BibTeX text since if the whole web page including the BibTeX is copied, Author will easily parse the text and find the BibTeX and use it.

This gave me the most obvious revelation: Humans can read the visible text in documents and so can computer systems so why not not worry about embedding meta and instead leave it visible? This is why Author now has the option to export the BibTeX for the document at the end of the document as plain text, under the heading ‘BibTeX’. It means that Reader opens the document and ‘reads’ it and finds the BibTeX, it then uses this when the user performs a basic copy by appending it to the clipboard. When the user then pasted back into Author this is made available and on paste a dialog asks the user: Paste as plain text or use the embedded BibTeX to paste as a citation? The result is that a simple copy and paste becomes a fully formatted citation where the application accepting the paste (in this case Author) ‘knows’ that this is a citation.

The next step from this perspective is to encourage software vendors to produce PDF documents where the visual information contains semantic values, not expecting hidden information to do the job. In terms of archiving and data transfer this is useful but it’s also useful now, to make the systems more rich and robust.

Have a section at the end of the document with the BiBTeX as citation information and don’t call it meta, simply call it information but since it’s clearly marked any reader can use it in the same way as Reader / Author does.

And let’s go further. Let’s use such an appendix to describe the formatting of the document, including how headings are formatted and so on. This should allow for complete compatibility with basic PDF readers but also allow new readers to extract semantic values to allow for richer interactions, such as automatic headings interactions, citation display and interactions and so on.

This could put an end to the absurd academic time-waste of nit-picking how citations should be displayed: Let the teacher/examiner/reader specify how the citations should be displayed, based on the document having described in the appendix how they are used and therefore the reader can re-format the the readers tastes.

This can further be used to work with glossaries and much more and will be robust enough to even be printed out and scanned and all will be retained.

I am putting my money where my mouth is by demonstrating this interaction via Author and Reader but this is as open as possibly can be and the end user can seriously benefit from such a very open rich-information interchange.

 

Note: This became the Visible-Meta approach.

 

Leave a Comment