In short: when its in a zoo… Bare with me. Common Eland in Zoo
A document is a record of something that has been observed. Such a record can be anything – a utility bill, an employment contract, a sculpture in a museum, or a painting on a wall. All are examples of documents describing something else. The utility bill records and describes your usage of gas and electricity. The employment contract records and describes the details of the handshake you gave at that final job interview. The sculpture or painting documents – there’s that word – a historical event. All these examples of documents are records of something observed by something or someone else.
Paul Otlet (1868-1944), the father of information science, is known for his observation that documents could be three dimensional. As examples of such “documents” Otlet cites natural objects, artifacts, objects bearing traces of human activity (such as archaeological finds), explanatory models, educational games, and works of art.
Suzanne Briet (1894-1989), also known as “Madame Documentation”, states her case through the enumeration of six objects:
- Is the star in the sky a document? No.
- Is the photo of the star in the sky a document? Yes.
- Is the stone in the river a document? No.
- Is the stone in the museum a document? Yes.
- Is the antelope in the wild a document? No.
- Is the antelope in the zoo a document? Yes.
Suzanne Briet rules: an antelope running wild on the plains of Africa should not be considered a document. But if it were to be captured, taken to a zoo and made an object of study, it has been made into a document, it has been made evidence. So there is a process involved in making “the something” into a document – we call it documentation.
As humans, we’ve invented all kinds of devices to aid in the process of documentation: library cards, folders, URLs, bibliographies, tags, taxonomies, reference documents. They form part of the discipline that is documentation and the basis for content management.
With the advent of content management systems we seem to have lost some of the high-level abstract concepts that were clearly laid out in the early parts of the 20th century. As an industry sector, involved in content management, we’ve become too focussed on the implementation details of content management systems and the limitations that these systems face.
Context“What is metadata? What is a document?” These questions typically go hand-in-hand and are often naively answered by: “the document is a file or a
blob that is stored in database but is difficult too manipulate, so the metadata, table rows and columns, are used to facilitate manipulation and describe the document”.
Metadata provides context with which to consume the document. You’ll have seen this in a zoo. You walk up to the antelope enclosure and there’s plaque containing the name, Latin name and a map of the world with a particular part of Africa highlighted describing the antelope and its origin – metadata. The zoo is giving you context with which to understand the antelope document.
The same holds true for documents in a content management system. Documents are stored in a particular context described by their metadata. The folder, the author, draft/publish status, tags, taxonomy are all pieces of metadata to aid the consumer in consuming the document.
That consumer may be the content management system itself as it responds to the query “give me all documents in the /marketing folder” on behalf of a web visitor. The consumer can also be a records management system archiving documents “in a published state and that are older than 24 months”.
Documents never exist without metadata, without context. For example, the print-out of sales figures that I’ve thrown in the wastebasket is a fully-fledged document of our company’s sales figures telling the person that picks it out the wastebasket to treat (read “consume”) the document as a discarded document.
I’ve seen this catch people out on a few content migration projects when they try and de-duplicate content repositories. They classify documents as duplicates based on their contents alone, without ever taking context into account. De-duplication is tricky business because in doing so you are destroying metadata that is right-or-wrongfully been created to help consume documents.
The accurate consumption or manipulation of documents is intrinsically tied to the accuracy and completeness of their metadata. Is the print-out of sales data in the wastebasket to be trusted? Is the sales data accurate? How should the reader consume the document? Look at the metadata! Its in the wastebasket. This opens up the possibility: did I mean to throw the print-out in the wastebasket? Is the metadata accurate? The reader can only make that decision with more metadata. The reader could phone or email me and ask: did you intend to discard that print-out? Thereby creating more metadata and a better context with which to consume the document.
Content management systems merely store metadata, human beings create metadata – often by hand, sometimes using automated tools. The process of generating metadata or maintaining its accuracy is a human process. Computers don’t care about accuracy or completeness.
Adriaan Bloem, analyst at CMSWatch, touches on this by labeling enterprise search as a “brute force” approach. Adriaan also points out that metadata or context is neccessary to communicate. He’s right – otherwise how do we make sense of a document ?
What if metadata contains a document, i.e. when one document describes another? Doesn’t this form of reasoning collapse in on itself?
What if you took a photograph of the antelope and attached it to the information plaque outside the enclosure? So when the antelope is having an off-day and its hiding in the undergrowth, passers-by can still learn about it by reading the plaque. Now you’ve got one document (the photograph) describing another (the antelope), haven’t you? Aren’t both documents? Wrong.
We can describe documents with other documents. Suzanne Briet would argue that the antelope in the zoo is the primary document and any scholarly articles written about it are secondary documents. They provide context around the primary document. There’s is a document and there is context with which to interpret that document – metadata. Nothing else. Document… Metadata… Document… Metadata.
In an English language sentence “things” can be both subjects as well objects, yet can’t be both at the same time. In one situation the photograph is a document, described by metadata from a digital camera (exposure, shutterspeed), in the other situation it is metadata describing the antelope.
Confused? What is metadata ? In any given situation, ask yourself what the document is and by exclusion all that isn’t is metadata.
So what does this means for content management systems ? Are they all broken? Do we need metadata management processes as well as content management processes? Do we need a separate metadata lifecycle to run alongside a
content lifecycle ?
The answer to those questions is unfortunately – yes. Yes, we do need separate metadata management processes. Yes, we do need a separate metadata lifecycle. Unless… we stop building content management systems in the naive fashion of blobs for documents and table rows and columns for metadata. We need to start building these systems so that there is
no technical distinction between the content store and the metadata store. Having separate stores for content and metadata causes us to duplicate our efforts, causing us to define duplicate processes to support the lifecycle of both document and metadata.
Ironic, since a promise of content management is the removal of duplication.