Data archiving isn't rocket science. Doing it right comes down to applying common sense to decisions about what should be archived and for how long, where and how archive data should be hosted so that it fulfills accessibility and re-reference requirements (as well as any special requirements specified in regulatory and legal mandates), how data will be moved from production IT platforms into the archival platform and how migration and management will be accomplished over time, and how end users will interact with the archive system (a factor that most directly determines the outcome of the overall archive strategy). Following is a brief discussion of each of the major phases of the archive project.
The Definition PhaseDeveloping an archiving strategy begins with a definition phase. You need to gather information about your data assets, their associations with business processes on one hand and with applications and technology infrastructure on the other, their utilization characteristics, and their value to the organization, to begin developing a business case for undertaking an archive initiative at all. Specific definitions that you will need to develop include the following: - Policies need to be developed first to classify data according to some intelligent scheme of organization, and second to relate these data classes to specific resources and services that will be allocated to data while archived. Class-based policies may include retention and deletion schedules, special protection for sensitive or critical data, transformation and migration strategies for data that must be retained for a protracted period of time, and hosting preferences.
- The environment in which archive will happen must also be defined. Determine what type(s) of hosting platforms and media will be used to store archival data based on access requirements, media durability, ingestion rates, and other criteria of relevance.
- Standards and best practices, where they are available from qualified standards bodies, can be useful in policy development and environment definition. These should be identified and documented in a requirements specification or strategy overview document.
- User experience, how users will interact with the archiving system, must be considered carefully.
The Selection PhaseThe selection phase involves the application of policies on data selection to the production storage environment. Testing will be required to ensure that classification schema are sound and that the right data is being selected for movement to archive platform targets. Additional testing may be required to ensure that target media are appropriate for archival data and to validate the environment definition performed in the previous phase. The objective is to simulate the archive process envisioned in the definition phase to ensure that errors have not crept into the process and that important requirements were not overlooked. This phase also provides an opportunity to evaluate and test tools for "pre-archive" or "pre-ingestion" preparation of archival data. Tools exist for extracting archival data from databases, email systems, electronic content management systems, and user file systems (as well as industry-specific data repositories such as those used in healthcare or video surveillance) that are sometimes confused with archiving itself. While tools for extracting data for inclusion into an archival respository may have enormous value, they should not be mistaken for full archive solutions. An all-too-common consequence of building separate archives around tools for extracting data of different data types is "stovepiping" -- a cadre of personnel and special hardware and software infrastructure need to be developed around each tool. In the long run, this strategy is neither efficient nor cost-effective.
The Ingestion PhaseIngestion refers to the movement of data from production storage onto the archive platform per policy. In the ingestion phase, the initial loading of data into the archive must be carefully monitored. Data hygiene is of key importance in the process of data movement. Obviously, archives have less efficacy if the data they store is corrupted, improperly classified, contain viruses or malware, or are duplicates of data already ingested. Data hygiene products will aid in separating the "grain" from the "chaffe". Ingestion is also the phase in which services such as content indexing and content addressing may be applied. We define content indexing as the creation of additional metadata context describing the contents of a file or other data object. We define content addressing as the application of additional metadata hooks to the data object in order to support later migration of the data between media in the archive, to ensure its nonrepudiability and/or to validate its completeness via checksum or some hashing scheme, and to facilitate faster data retrieval from the archive. Data migration describes the actual movement of data from production storage to the archival store and subsequent data movement within the archive repository from older media to newer media. Most archive products include a data migration engine that is actuated by a policy engine. All data movement should be closely controlled by policy and should avail itself of auditing by a management tool.
The Management PhaseManagement is less a phase than a set of activities that parallels all of the preceding phases. A management strategy is required to ensure that data is archived properly, that appropriate protective services are applied at all times, and that retention and deletion schedules and other matters of policy are executed in an orderly and predictable fashion. Special management requirements may include: - Management of the transformation of data so that it remains machine readable even as hardware and software technologies, media formats, and data "container" formats change over time. Both proprietary and open standards have been defined for data "containers." When significant changes occur to these format definitions, a management process must be defined to un-ingest data formatted with that container definition and to re-ingest it using the latest version.
- Retention must be managed to ensure that data required for retention by law or regulation is not accidentally deleted. Conversely, deletion must be carefully monitored and managed to ensure that deletions occur per policy and always with the approval of a qualified authority. Shredding technology may be required to ensure that data is deleted completely and irretrievably.
- Migrations should always be managed to ensure that data movements occur completely and without error. Re-ingestion is important to monitor to ensure that data classifications and policies governing ingestion make sense from the standpoint of production data use.
- In generally, the adherence of archive operations to all policies is essential. Access controls need to be enforced and all archive accesses, changes and modifications, and data movements need to be logged in an auditable fashion.
Ultimately, archive management provides a feedback loop to the definition, selection and ingestion phases of the the archive practice, converting archive from a project to a process with continuous improvement built in. Archiving therefore requires a permanent staff and an on-going review process to ensure that the archive remains synchronized with the changing requirements of the business and the regulatory milieu. Copyright (C) 2007 by Jon William Toigo for the AMO Journal. All Rights Reserved.
|