|
With the explosion of regulations and laws mandating policy-based data retention and deletion, archiving is suddenly very popular within the technology industry. There is no shortage of vendors who are recontextualizing their products as archive solutions. What exactly is an archive solution? Let's find out. First things first. Archive is a process that typically extracts data from an expensive "capture storage" platform and writes it to a less expensive "retention storage" platform. The movement of the data to the archive platform is controlled by a policy. The management of data once written to the archive platform is also controlled by policies. Policies automate certain aspects of archiving so that it is not a subjective or manually intensive process. Archive is NOT the same as backup or snapshotting, nor is it hierarchical storage management (HSM). Archive is different and these differences are very important. We'll get to them later. A basic archive platform has three functional components: Policy Tools, an Ingestion Engine, and Management Tools. Policy tools An archiving platform is driven by policies: rules and some mechanism to apply the rules in order to automate the operation of the archive platform itself. Specifically, there are - Rules for selecting and migrating data from production to retention storage
- Rules for applying data management services (security, non-repudiation, WORM, replication, backup, etc.)
- Rules for deleting data (calendars, shredding functions, etc.)
The basic archiving platform will provide a mechanism for defining policies and will work in conjunction with an ingestion engine and management tools to apply policies throughout the useful live of the data in the archive. Ingestion engine Ingestion sounds like a complex word for a fairly straightforward idea. To put data into the archive ("to ingest it"), the correct data needs to be identified and the target location for the data movement needs to be identified. So, an ingestion engine is a piece of software that, at a minimum, provides: - A mechanism for selecting and moving data
- A means for validating data contents
- A method for confirming that writes have occurred correctly and completely
- A means for un-ingesting data when required
There are many other things that could be done with data prior to writing it to the archive, but these are not core functions of an archive. We will discuss them later. Management tools Management tools are used to manage the archive infrastructure and to control accesses that are made to the infrastructure and the data stored there. Management tools available in basic archiving platforms provide the means for: - Configuring archive volumes
- Configuring ingestion controls for data selection and migration
- Archive media management
- Audit and logging
- Access control
Management tools must be policy aware so they can support the smooth operation of the archive platform. What About All of The Other Stuff?There is no shortage of archive add-ons, and it can get confusing trying to figure out what some of these products do and whether they are, as their vendors may suggest, part of the core componentry of an archive or providers of value-add features and functions. Unfortunately, many vendors provide self-serving definitions in their marketing literature that can be hard to weed through. Here's our take: - Content Addressing: Think of content addressing as the addition of a new metadata hook on data that is being ingested into an archive that will help you locate the data object (or range of objects) more readily. This kind of functionality, while not a core part of many archive platforms today, shows real promise as a means to help migrate data from one set of archive media to another set over time, making sure that you don't overlook or forget any objects. Metadata may also be modified by the Content Addressing engine to provide checksums or other "hashes" produced by exotic algorithms that will ensure that data does not change when you move it around -- a good thing. Content addressing schemes vary from one provider to another, and those that are bound to a specific vendor's proprietary array controller are to be avoided as a rule. Otherwise, you lock yourself into one kind of hardware for your archive, which might not be a good thing in the long run.
- Content Indexing: Content indexing is the review of the contents of a data object prior to ingestion for the purpose of building a index, table or database to facilitate searches at a later time. This is another good idea, but not necessarily part of core archive functionality. Indexing approaches vary widely and if you desire this feature, you probably want to strive for an approach that is hardware agnostic to avoid long term lock-ins.
- Data Extractors: Extractors are programs that select data from email systems, databases, or other specific data object repositories for inclusion in an archive. We think of them as pre-archive tools, though many vendors want to sell them as part of the core archive stack to support their own complete archive solution for a specific data type. They see customers looking for one-stop-shop email archiving solutions or all-in-one-box database archive solutions, so their extractors are bundled with the policy tools or ingestion engine functionality of their products. There is nothing wrong with this approach, except architecturally, it creates archive "stovepipes" -- individual repositories by data type that require their own policies and their own cadre of operators and managers. In some cases, management of these data type-focused approaches can become daunting, especially as the volume of data being archived scales. Ultimately, a business may be better served by a generic (or data type agnostic) archive platform that uses a number of pre-ingestion tools to select data objects by type for inclusion in the archive management scheme. Correlating the activities of multiple pre-ingestion extractors may require a Manager of Managers or MOM.
- Data De-Duplication: This is another one of those terms that means different things to different vendors. At the file level, de-duplication means what it says -- it is technology that finds copies of files and compares them, then removes all but one (or two) of the duplicates, saving space on the archive platform. At the block level, de-duplication usually refers to technology for identifying identical patterns of bits (for example, 00110011) in data, which are then replaced by a "stub." The promised outcome is a much smaller quantity of compressed data, again providing for better capacity allocation efficiency. There is nothing wrong with this technology, provided that data can be restored to a usable form and without a lot of proprietary gear or software. However, de-duplication is not a core archiving function.
- Special Archive File Systems: File systems are logical schemes for organizing data objects. Some prominent file systems have come under attack for various inadequacies that trace back to the engineering constraints that existed when they were designed. For example, saving a file in most file systems overwrites the last valid copy of the file having the same name -- a feature intended to preserve hard disk space, which was prohibitively expensive at the time that the file system was created. This "self destructive" nature of a file system is now considered a bug, rather than a feature, and engineers in many companies are re-inventing file systems for use with modern systems (including archive) that consider modern platform capabilities and costs. Getting to a file system (more likely, an object oriented database) could improve our ability to wrangle user files into some semblance of manageable order -- a good thing. We are watching these developments, but have absolutely no certainty whether any of the current approaches will become mainstream. In any case, a special file system is not a prerequisite for archiving.
- Clustering Approaches: Clustering refers to a set of technologies intended to improve the scalability and in some cases the performance and responsiveness of any collection of systems. The idea is that several nodes represent themselves as one entity and share resources such as bandwidth, capacity and processing speed. This is closely related to GRID technology. There is no reason why an archive cannot be built on a clustered platform, though this might add an additional layer of hardware complexity. In any case, clustering is not a core function of an archive.
- Auto-Classification Engines: Mainly used with files and other rich metadata objects, auto-classification engines seek to streamline the data selection process as part of archive ingestion. Auto-classification, despite vendor hype in many cases, remains a work in progress. Tuning engines to cherry-pick files for inclusion in an archive is often a difficult and tedious process, despite the availability of "templates" supplied by vendors of these products. We are monitoring this space closely and encourage you to visit AMO often for updates on the latest classification tools. Auto classification is not a core component of an archive platform.
- Manager of Managers (MOM): A MOM is an integration tool that may be used to coordinate the activities of multiple data extraction engines by translating policies into the native syntax of each engine. MOM technology is in its infancy, but the hope is that the operation and output of multiple best-of-breed extractors will be able to be managed by a single cadre of archive operators or managers. MOMs may also play a role in coordinating access controls to individual extractors and their output. MOMs are not currently core components of an archive platform.
 Integrating the above functionality where it makes sense for your organization into a unified archive platform can be a challenge. In the absence of real standards, you may want to leverage an experienced integrator or archiving consultant to help you to find and integrate the best solution for you. Most archive vendors have their own theories about how these functions should be integrated and deliver them on a single solution-oriented platform. Others provide a range of products including appliances designed to support smaller firms with smaller budgets that can be migrated over time into larger and more spacious platforms. To make a smart buy, you need to understand the technology that is (and is not) being provided and select the one that best fits your requirements now and in the foreseeable future. You want to avoid "forklift upgrades" going forward and to prefer archiving software that can scale independently of any particular vendor's hardware. Best practices for archive design will be covered extensively at AMO. And user cases will be provided to validate vendor claims about their products and services. What Archive is NotArchive is NOT the Same as BackupThere is some confusion in the marketing literature of vendors that suggests that backups are all the archive anyone needs. While there may seem to be similarities between the two processes, they serve two very different purposes.  | Backup is a disaster recovery-oriented data protection scheme. Ideally, mission-critical data (data that is known to support critical applications directly or indirectly) is identified and copied to removable media for placement in secure, off-site storage. The backup process typically entails the use of specialized software that creates “super streams” of target data, writing them to magnetic tape (“rust-coated” Mylar plastic media contained inside a housing and operated inside of a special drive). The tapes are verified to ensure that they contain accurate copies of the data and are removed to off site storage for safe keeping. In the event of an unplanned outage, cartridges are recovered from the off-site store and data is restored (written back from tape to disk) to place it in a usable form for busienss operations. |
Backups are copies of data that are retained for discrete intervals and are intended for use if primary data are lost or damaged. As a rule, the data in an archive is not as important to operational recovery following a disaster event as the data stored on backups (or mirrored disk, another data protection approach). In fact, one advantage of archiving is that it selectively removes a quantity of data from the operational environment so that it is no longer part of the backup workload. That way, backups or mirrors might operate more efficiently. The goals of backup and archive are different. The mechanisms for performing backups are different than those for performing archiving. The management of archives is significantly different from the management of backups or disk mirrors. Typically, the granularity (level of detail) of information stored in a backup is significantly lower than the granularity of archived data. Often backups consist of "bare metal" images of disk drives or otherwise "anonymous" data sets. Tape media, it should be noted, is just as valid as a medium for archive as it is for backup. Also, it is a good idea to backup your archive platform periodically as an additional safeguard against corruption or other loss events. Archive is NOT the Same as HSMHierarchical storage management (HSM) was pioneered by IBM in the mainframe world as a capacity optimization method. The goal was to improve the operational efficiency of storage by migrating data between tiers of storage (each with significantly different cost and performance dynamics) with a minimum of manual intervention.  | With most HSM schemes, data movement is triggered by used capacity measurements or dataset access frequency counts. In the former, when a storage device reaches a specified capacity, older data is moved from one storage device to another in order to economize on space. In the case of access frequency count-oriented systems, data sets that have not been accessed over a specified period of time are moved to media that makes more economic sense |
HSM is essentially a capacity management tool. While mainframe tools like DFHSM in the IBM world provided tools to improve the granularity of data selection, moving much closer to true archiving, most shops did not configure their policies to realize this degree of granular data selection and movement. For the most part, HSM treats data objects as anonymous entities, and moves data between tiers of storage (system memory, direct access storage devices or disk arrays, then to tape and optical) based on media costs and access requirements. Archives typically use much more sophisticated data movement triggers and consider the data to be archived from a much more specific, and business focused, perspective. The key difference between archive and HSM is that the latter focuses narrowly on cost-savings, while archive focuses on a broader business-case that includes cost savings, risk reduction and process improvement. Here is a comparison between backup, HSM and archive from a business value perspective. Only archive makes a full business value case. Copyright (C) 2007 by Jon William Toigo for the AMO Journal. All Rights Reserved.
|