Archive Record Management

Targeted for 7.0 Shiny release

Overview

Build a comprehensive archive and records management sub-system for the underlying publishing framework.

The existing FarCry archival system provides a record archive for key content types. However, recovery of archival content is not ideal, especially when objects have been flagged for deletion. In many scenarios a forensic exercise needs to be performed to extract the relevant record information. While this may be acceptable for legal purposes, its not ideal and makes it difficult to test actual archival coverage. Furthermore, special attention needs to be given to custom content types to ensure that they participate in any sort of archival process at all.

The proposed archive and records management sub-system should provide broad coverage across content types (ie. all content except where explicitly excluded), and provide a framework for simple restore and rollback features for any content type.

Archive Areas to Cover

  • Outline current risk in terms of ability to recover exactly what content was on any of our websites at a given time for legal purposes
  • Identify what objects/content we can and can't recover
  • Identify what opportunities there are to fill an gaps that exist
  • Identify requirements to fill those gaps including description of features, cost estimate and timing
  • Can be split into range of options from minimum to meet our legal needs to fully functioned archive recovery tool.
  • I like the option to have screenshots linked to archive instance.

Snapshot of what gets archived by default FarCry 6.x

Cheat sheet of what gets throw into the archive table by default:

  • any content type that extends versions that is approved; for example, HTML, News, etc
  • associated or related content items attached to your versioned content (note only dmFile and dmImage and not their attachments)
  • any content that is deleted (provided bDoArchive is active for the site)

Check Admin > Config > General > Do Archives is ON

Cheat sheet for content changes that are missed by default:

  • any content that is not extending versions; for example, Navigation (dmNavigation)
  • any historical change where the content item that is first unpublished (ie sent to draft), modified, and then re-published (including versioned content)
  • any historical change to a content item that is approved and allows editing on the approved item; for example, Navigation (dmNavigation)
  • any file deleted from the platform has its media archive erroneously removed (assuming it had one ie before FarCry 6)
  • any associated media file (including those attached to dmFile or dmImage)

Short story. Backups give you a complete restore, assuming you have them. If you need to do a restore from a period between backups, its way harder than it should be. Without backups, it's hit and miss on whether you can retrieve data from all content types, especially custom built content types by default. FarCry should automatically archive all approvals and deletions. FarCry should auto-detect media assets based on the formtool metadata and archive. There are bugs that need to be fixed.

Archival Use Cases

General record management requirements for web based content management platforms, in order of frequency of use. We want to target these use cases and see how much we can achieve for the 7.0 release.

Forensic Restoration for Legal Department

Some organisations have a legal requirement to be able to determine the content on the site at a given moment in time. Generally this requirement is rarely invoked, and could be met with a forensic reconstruction of the site. By forensic, we mean there is no automated or simple restoration process, but the data has been stored and a developer could take reasonable steps to piece together what the site looked like at the time in question.

Forensic restoration is typically needed in response to some legal challenge or in the assessment of some sort of customer recompense. Unfortunately, this requirement may need to be invoked on content ranging over an archival period determined by the companies legal archival obligations. For example, something like seven (7) years in Australia.

It is not realistic to be able to recreate an exact rendition of a page, especially when implementing publishing rules, and personalisation. However, the publishing platform needs to record all publicly visible changes to principal content records such that we can reconstruct "what was there when" with a degree of certainty.

An optional "screen shotting" service, that snags the full length version of a page at 1024 pixels might go a long way to answering this use case.

TODO List

  • write & publish test plans for various user scenarios
  • extend dmArchive to include additional auxiliary data, and audit record
    • typename
    • bDeleted (last archive item)
    • method that generated archive; approve, delete, rollback, undelete
    • username, IP address, roles list
    • friendly URL archive (query as WDDX)
    • category archive (struct of UUID/label)
    • peripheral tree data (parentid, array of children)
  • auto archive all content items
    • on Delete
    • on Approval (including on change of approved content)
  • add bArchive component metadata to allow the exclusion of specific content types from archive
  • re-implement media archive process and automate for all properties by default
    • all file and image fields should be auto-archived based on property field type (ie. image or file)
    • add ftArchive flag to allow developers to exclude specific properties as needed
  • Investigate screen shot services that can be implemented via plugin: http://cutycapt.sourceforge.net/

Content "Diffing"

A review of any archive is extremely tedious if its not possible to see the change in the content record from one version to the next of the archive. A visual content "diff" is an important part of any UI interacting with the archive system.

When dealing with DRAFT content, it is extremely useful to be able to see quickly what the difference between various versions of a content item. Especially the difference between the current APPROVED content and the underlying DRAFT content; for example, when content editing and approving content changes.

Many users leave content in DRAFT for fear of losing potential changes when no changes exist at all, because it is difficult to ascertain what changes (if any) have been made.

FarCry Capabilities as at v6.1

FarCry has no native "diffing" functions. However, a commercial plugin is available from Daemon that provides the ability to "diff" archive versions within the webtop overview. Release of the plugin to open source is being considered for the 6.2 release. (released in 6.2)

TODO List

  • Create a diffing tool for comparing text properties between content records (could leverage Daemon farcry diff plugin) (released in 6.2)
  • implement webtop overview tab for types extending versions (released in 6.2)
  • implement webtop overview tab for dmArchive

Undelete

Probably the most common scenario for restoring content is accidental deletion. The system should provide a simple mechanism for restoring deleted content.

Undeletes normally need to be performed immediately, ie within minutes of the user realising they have accidentally deleted content.

FarCry Capabilities as at v6.1

FarCry has no native "undelete" methods.

TODO List

  • Create content admin view for dmArchive content type
  • Custom action on dmArchive grid to undelete
  • Restore function for basic content items (no tree relationships)
  • Restore function for tree content items (places content in tree)
  • Restore option for related content that may also have been deleted

Content Roll Back

Some users need the ability to roll back content to an earlier state. Typically this is because content has been inadvertently published or changed and is deemed to be inaccurate and needs to be restored to a previous version of the content immediately.

A content roll back is likely to occur within a short time frame from when the erroneous content was published.

Rollbacks are actually relatively rare scenarios, and many companies will typically "roll forward" ie. correct or remove the inaccurate content as needed.

FarCry Capabilities as at v6.1

FarCry has roll back options for content types that extend versions abstract class; for example, HTML, News, etc.

TODO List

  • Integrate existing roll back features into Content Diff tab (released in 6.2)

Screengrab of Content Pages

Screengrabbing a page and attaching the screen grab to the archive would be ideal. While a screengrab is not going to necessarily show all relevant content on the page (for example, tabbed or concealed content), it does offer some important context; what did the page look like, what other compound or dynamic content might have existed juxtaposed to the content record for the page.

Webshot looks like a promising, server side, screen shotting tool: http://www.websitescreenshots.com/

Screengrabbing could be hooked into the types.onStatusChange or types.onStatusApproved event cascade for the full page view of content items.

TODO List

  • Create a hook in FarCry Core to bubble event for screen capture (lib library with blank function call; override in screen grab plugin)
  • Set up sub-system for asynchronous screen grab calls
  • Update dmArchive UI to integrate screen grabs next to relevant archive
  • Update Content Diff tab to show screen grabs in context

Current Record Management & Recovery Issues

Assuming FarCry 6.0.x deployment.

Complete Site Restoration

Currently the only way to ensure a complete site restoration is to have the code base, media assets and database backups from the specific recovery point. Given all of these assets, it is relatively straightforward to restore an application to its historic state.

Advantages:

  • recovery is complete

Disadvantages:

  • time consuming
  • its unlikely any client will be able to maintain daily backups going back years; for example, monthly archival backups or fewer is likely the norm beyond 7 days

Possible in all versions of FarCry

Determining What Was on a Specific Page at a Specific Time

Assuming Complete Site Restoration is not an option, what can be recovered from a current FarCry 6.0.x platform? ie. what is the publishing platform storing by default.

Archived Content

The FarCry archive is a database table that holds a reference to the archived content item (archiveid) and a WDDX (XML format) packet of the entire content record, including relationships. There are various triggers for the system creating an archive (detailed below).

Content Types that are "versioned" (ie. extends abstract class *versions) allow contributors to review a list of all previous versions of the specific content record. Users also have the option of "rolling back" to any version previously listed.

Currently there is no way to access this table directly within a standard installation. And even with access to the database table, you would need to know the UUID of the content item that you were searching for. Trawling the Archive table is essentially a forensic task at this time, and not something a standard user would be able to do in isolation.

Archived Media Assets

FarCry 6.0; no files are archived

Since the implementation of FarCry 6.x it appears that files are no longer being archived in media archive at all. Deleting a file removes the file permanently from the system.

Images and files present a different archival problem. While the content record contains meta-data such as title and description, the underlying media asset is the critical item for archiving. When an associated document is updated or the file content item is deleted, the physical file that is no longer required is moved to an archive section of the project.

Default project media archive file format
./projectname/mediaArchive/typename/propertyname/objectid-epochdatetime-filename.ext

(see below for detail)

Archived files are not readily retrievable or restorable by contributors in the system. The archive is generated primarily for forensic retrieval.

Archived files should not be removed on delete!

Essentially all changes to attached media files are recorded during the lifetime of the record. Unfortunately, deleting a file archives the content record but deletes all attached media. This is not so good.

./packages/formtools/file.cfc; delete()
<!--- Delete file --->
<cfif structKeyExists(stLocation,"fullpath") AND len(stLocation.fullpath)>
	<cftry>
		<cffile action="delete" file="#stLocation.fullpath#" />

		<!--- Delete archived files --->
		<cfdirectory action="list" directory="#application.path.mediaArchive##arguments.stMetadata.ftDestination#/" filter="#arguments.stObject.objectid#*" name="qArchive" />
		<cfloop query="qArchive">
			<cffile action="delete" file="#application.path.mediaArchive##arguments.stMetadata.ftDestination#/#qArchive.name#" />
		</cfloop>
		
		<cfcatch><cfdump var="#cfcatch.message#"><cfdump var="#arguments#"><cfdump var="#stLocation#"><cfabort></cfcatch>
	</cftry>
</cfif>

Programatic Triggers for Archive

Versioned content on Approve

Any content that extends the versions abstract class – essentially anything that generates an underlying DRAFT content item – automatically creates an Archive record for the approved content when the status changes to "approved". Note this archiving process is bound to the specific draft/pending/approved workflow process and not to the types.onStatusChange cascade.

General Config bDoArchive

types.delete()
<!--- check if need to archive object --->
<cfif application.config.general.bDoArchive EQ "true">
	<cfset stLocal.archiveObject = createobject("component",application.types.dmArchive.typepath)>
	<cfset stLocal.returnVar = stLocal.archiveObject.fArchiveObject(stObj)>
</cfif>

<cfset onDelete(typename=stObj.typename,stObject=stObj) />

fArchiveObject(stObj) archives the entire content item record as a WDDX packet in the dmArchive table.

onDelete Cascade

types.delete()
<cfset onDelete(typename=stObj.typename,stObject=stObj) />

onDelete(typename, stObject) fires a cascade of onDelete() functions across all the formtool properties of the content type. This allows the processing of attached media like images and files.

Media Archive Specification for FarCry 6.x Audit

Archive & Versioning Overview

Archive functionality is a configurable option based on metadata attached to the content type.

There are effectively two levels of archiving. Deleted content can be archived, such that any content that is deleted is archived as a permanent record. Versioned content is archived allowing users to roll back to previous versions of the content.

Deleted Archive

Whenever a content item is deleted, the content item database record is wrapped up and stored within a dmArchive content item. Deleted content is really a safety net that is useful for legal records and emergency data recovery. There is no user interface for easily restoring previously deleted content.

Associated Media

FarCry 6.0; no files are archived

Since the implementation of FarCry 6.x it appears that files are no longer being archived in media archive at all. Deleting a file removes the file permanently from the system.

When an item is deleted that has associated media assets, such as files or images, the attached media content needs to be moved outside of the webroot and stored.

Archived media is moved to ./projectname/archive/typename/property/UUID_YYYYMMDD.ext

Where;

  • UUID is the content items objectid
  • YYYYMMDD is the current date time stamp
  • ext is the specific extension for the media in question

In addition, if the content item is being updated with a new version of the attached media asset, then the previous version needs to be moved to the archive storage area.

False; Archived Media is Removed on Record Deletion

As noted previously, whilst file changes to an existing record are recorded, deleting a file will result in the removal of its underlying file archive.

Versioning content

Dependencies; content types need to extend the versions.cfc abstract class or equivalent.

The FarCry versioning service effectively kicks in whenever a content item is approved. The approval process takes a copy of the live content item and stores it inside a dmArchive record.

This archiving process should include the movement of associated media to archive, where the associated media differs from the media attached to the draft content item being approved.

False; Only works for dmFile and dmImage system content types

The platform should archive related content and associated media assets dynamically based on the field properties. ie. any field with an attachment. Currently the system only looks for specifically dmFile and dmImage content types. While many systems will use these default libraries, al lot of bespoke applications will not.

Roll Back

Editors can see an archive view for any particular content item, showing a link to the archive records of previously versioned content. Choosing to roll back to a previous version effectively, archives the current live object and copies the archived version to a new live version. That is, the versioning history is retained at all times.

Automating the roll back of associated media items is a little more complex and must be performed manually at this time.

(Future enhancement: it would be nice to automate the roll back of media items as well as the primary database record)

Ongoing Maintenance

Administrators need to be aware of the fact that busy sites may build up a large archive storage. Provisions need to be put in place to manage this area in an operational sense that is in line with the organisations record keeping policies.

(Future enhancements: would be nice to have a variety of utilities in the web top to assist in reporting on and managing the archive area)

File Management

FarCry follows specific rules with respect to how physical files are created, updated, deleted and archived.

Content Migration Issues

FarCry 4.0 does not store media content in a tree based hierarchy (unlike FarCry 2.x). Media content is classified using categories or by direct association with a specific content item.

Document Handling

FarCry creates a database record or file content item that holds all the extraneous metadata associated with a file, such as categorisation, document date, publish date and so on. In addition, the content item is attached to a physical file that is stored on the application file system.

When a file content item is created, the system ensures that any attached document has a unique name on the server. Attempting to upload a document of the same name as something already on the server will result in the uploaded file being renamed to ensure that it is a unique file reference.

When a file content item is updated, the system allows you to upload a new file to update the current record. Regardless of the uploaded documents file name, the system will rename the associated document to have exactly the same name as the previously uploaded file. In effect, it replaces the old file on the system.

If the file being used to update an existing file has a different extension, it is assumed to have a different MIME type, and the system will reject the newly uploaded file reporting a conflict. For example, attempting to upload a PDF over the top of a Word document and retaining the exact filename is going to be fraught with danger.

Archiving File Content

When an associated document is updated or the file content item is deleted, the physical file that is no longer required is moved to an archive section of the project.

The archive location is configurable but by default is:

./projectname/mediaArchive/typename/propertyname/objectid-epochdatetime-filename.ext
  • projectname: the name of the current application
  • typename: the specific content type in question, eg. dmFile
  • propertyname: the specific property name holding the file path information
  • objectid: the UUID of the file content item
  • epochtime: the number of seconds elapsed since midnight UTC of January 1, 1970
  • filename: the filename of the document

Typically typename/propertyname would be defined as part of the formtool metadata for file; ftDestination. A defacto standard for content types with a single file property is to simply use the typename value. For example, core content type dmfile uses ./mediaArchive/typename/*

Archived files are not readily retrievable or restorable by contributors in the system. The archive is generated primarily for archival and audit purposes.

True. The publishing platform appears to work as advertised.

Data Loss. When a file is deleted its underlying media archive is erroneously deleted.

Versioning & Approval

Files have status only, there is no inherent versioning of files. Setting a file to a status of draft will prevent others from using it in the content management system and prevent users from downloading it (assuming secure files have been configured).

File content items are not versioned as the attached file is considered to be complete. That is, it was already worked on elsewhere and when its uploaded the document has been pre-approved.

Versioning files, whilst maintaining the same filename and in-line linkages would be a very different approach to the current methodology and would essentially require a completely custom and separate file content type.

True. The publishing platform appears to work as advertised.