ENTERPRISE DATA FORUM
Pittsburgh Hilton, Pittsburgh, Pennsylvania - November 4-7, 2002
CONFERENCE TRIP REPORT
Produced by Wilshire Conferences
This report compiled and edited by Tony Shaw, Program Chair, Wilshire Conferences
The inaugural Wilshire Enterprise Data Forum was held before an audience of
over 350 attendees and speakers. It covered subject areas related to enterprise-wide
data management, including data integration, XML, analytical technologies, modeling,
data administration and database technologies. To receive more information about
this conference, and related future events, go to http://www.wilshireconferences.com
This report contains summaries of the key discussions and conclusions from most (not all) of the conference sessions, special interest groups, and tutorials.
This report is also available in PDF format.
Reproduction Policy
This document is © Copyright 2002 Wilshire Conferences. It may be copied,
linked to, quoted and/or redistributed without fee or royalty provided that
all copies and excerpts provide attribution to Wilshire Conferences and the
appropriate speaker(s). Any questions regarding reproduction or distribution
may be addressed to info@wilshireconferences.com.
TUTORIALS
Introduction to XML for Enterprise Data Management
and Application Integration
Peter Aiken
Founding Director
Institute for Data Research
XML represents a critical future direction for the management of data, metadata, business rules and will play an increasingly important role is business and systems engineering. Peter Aiken explained how XML works, and showed how to start incorporating XML capabilities into data management programs.
XML allows companies to develop scalable and “programmatic” solutions to long term data integration needs, instead of customized interfaces. The top 5 reasons for using XML-enabled data (according to CIO Magazine survey) are:
1. Broader search capabilities
2. Perform new operations on data in XML format
3. Reduced development and maintenance time
4. Enables reduced processes and shorter business cycles
5. Allows conversion of EDI data to better formats
Among the business benefits that XML hopes to provide:
- Internet information delivery
- provide common data access
- enable enterprise application integration
- enable web services
- enable business partner data exchange
- reduce application development time
- reduce costs of data conversion
- provide a bridge between .Net and Java
- support new mobile devices
- provide faster, better search capabilities
- provide management and analytical capabilities for unstructured data and content
XML-based EAI technologies permits implementation with minimal or no change to the existing applications or data - a non intrusive approach.
Avoiding Catastrophe in Data Integration and ETL
Michael Scofield
Consultant and Author
Data integration requirements continue to increase, but getting the data onto the same platform is the easy part. Integrating it so it makes sense is much more difficult. Most technically-oriented programmers aren't good at this task -- it falls to the data analyst to figure out the the business context of the data so that integration will be truly successful.
Successful mapping of source data to target field depends upon a thorough understanding of the business meaning and data architectures of each source, and designing target database appropriately. "Semantics" means ensuring that each source data field has the comparable meaning, scope, and normal behavior (not merely field-name and format) corresponding with its peer source field(s).
The workshop discussed a range of techniques available. The speaker emphasized that you have to lookbeyond the documentation (file descriptions, etc.) of sources. They may be obsolete. Instead, you must look at ALL of the actual data.
Mike discussed techniques for uncovering data anomalies, data quality problems, and semantical discontinuities in how a field is used in the context of a data source. He starts by creating an inventory of the data (particularly the sources), and the logical architecture of each data source, and the behavior of the data, from the high-level view down to the specific, detailed behavior of each field and column, and inter-dependencies. Understanding data behavior includes understanding significant subtypes of subject entities, and their life cycle. Data behavior also includes the quality and consistency of the data.
Techniques in data profiling and domain studies were shown. Some surprises were found -- for example a field may be used in one way for one entity subtype, and in a different way for another subtype (application owners will often use a field for a purpose different than its original intent).
Mike included a survey of techniques for establishing an on-going data surveillance program to ensure that later production-ized loads of data will not be caught by surprise when a source changes definitions or scope of the data it supplies. To recognize the importance of this, we must review all the external factors which force enterprises to “morph” their logical business data architectures. Any external source which is subject to such morphing pressure may change some aspect of its logical architecture (and hence the precise meaning of the data they send you), and a receiving organization needs to be alert for such changes, or significant trauma to the target database (or data warehouse) may result.
Transforming An Operational Model Into A Physical Warehouse Database
Tom Haughey
Chief Technology Officer
Pepsi Bottling Group
This presentation showed the progression of a data model from an operational model to an analytical database, including the central data warehouse model and different data mart structures. The basic premise of the speaker is that this transformation must be based on principles, not just patterns.
The first major structure is the Central Data Warehouse (CDW). The CDW is the focal and most granular repository within the DW, and supports long-term strategic analysis and reporting. It has three main purposes: to provide data to any requiring application; to support some direct querying; and to support all ad hoc querying. It houses integrated data at both atomic and summarized levels. It main characteristics are that it is atomic, integrated, historical, read only, general purpose and application independent.
Data marts are a major step in the progression. A data mart is an environment
containing a specialized set of related data, customized for a specific community
of knowledge workers, analysts or planners, to support their reporting and analysis
needs. There are three types of data marts: embedded, dependent and independent.
Data Modeling - The Big Issues
Graeme Simsion
Senior Fellow
University of Melbourne
Graeme Simsion looked at some of the most important issues facing today’s data modelers, and offer practical approaches to addressing them. The tutorial addressed the role of data modeling and data modelers, as well as the modeling process itself. As Graeme said at the begiining of the session, his purpose was to facilitate a discussion rather than lecture the audience, because there often is no single “correct” answer in data modeling. Nonetheless, some of his advice includes:
- Is modelling a design or analysis task? Graeme leans towards modelling as
a design task. To quote from his book (written with Graham Witt) “data modeling
is, overall, a design activity, but it includes the task of understanding requirements.
There’s a time to ask … and a time to propose - and even persuade.”
Considerable time was spent on the oft-used analogy between data modeling and
architecture. This was a thread used throughout the tutorial to examine the
role of the modeler with respect to various issues of discussion. For example,
consider the dialog between an architect and a client. The architect brings
the following to the conversation:
- Tools for explaining
- Domain knowledge - lots!
- Access to specialists - a one-stop shop
- Knowledge of materials and costs
- His own specialization - and idiosyncrasies
- Patterns - lots of them
Additional specific discussions included:
What makes a good entity definition? It will clearly answer two questions:
- What distinguishes instances of this entity from instances of other entities?
- What distinguishes one instance from another? (cf: “identity”)
It will include:
- examples
- counter-examples
- any relevant extreme cases
- behavior?
And what about a good attribute definition? It should answer the questions:
- “What does it mean to assign a value to this attribute?”
- “What does each value that can be assigned to this attribute mean?”
Overall, the tutorial included a great deal of rich discussion and debate that is difficult to capture in this abbreviated summary. Interested readers are highly recommended to attend Graeme’s future tutorials and/or workshops, or obtain a copy of his text entitled Data Modeling Essentials.
Implementing a Message-Based Data Integration Strategy
David McComb, Simon Robe & Simon Hoare
Semantic Arts
Three consultants from Semantic Arts presented an all day tutorial on a new approach to integration using XML and message oriented middleware. One of the key themes was that in the absence of design and coordination, the current trends in XML and Web Services are going to lead to chaos in Enterprise Application Integration. To counter this they presented a detailed methodology for “Enterprise Message Modeling” and made a case that this was going to be a core discipline for organizations that were migrating from legacy systems to web based technologies. Their tutorial also included:
- The economics of systems integration. They come to the non-obvious conclusion that the ideal application architecture will consist neither of a few very large applications (the ERP approach) nor the pure component/ web services model of high fragmented systems. They propose a moderate number of applications tied through a message model.
- The central role of partitioning and decoupling. Messaging and integration are the obverse sides of the partitioning and decoupling coin, and the choices made in partitioning and decoupling have a great deal of influence on how well the messages will integrate the enterprise. They provided a model that included six axis of decoupling and the intermediate forms needed to make the decoupling really work.
- Practical tips for getting started – The presentation was liberally sprinkled with examples from their client work, and concluded with a step by step approach to getting started with these new methods. They contend that their method allows for incremental implementation over time, which has considerable advantages over other integration approaches which require widespread enterprise deployment from day one, in order to succeed.
Enterprise Metadata Implementation: Getting to Success
R. Todd Stephens
Director of the Metadata Services Group
BellSouth Corporation
This tutorial focused on the formulation and implementation of an enterprise metadata strategy. The Metadata Services Group within BellSouth has spent the last 3 years developing an enterprise metadata solution based on a solid product line and a customer service focus. The speaker acknowledged that his success was enabled by high level executive support and sufficient budget to get the job done right, however he also suggests that individual compoents of his overall strategy can be used to achieve incremental progress.
Enterprise metadata and EAI may seem like a strange relationship but the reality is that metadata is one of the most critical elements of a solid application integration effort. Todd discussed and provided a three year learning curve for the attendies of this workshop. The session reviewed seven perspectives of an enterprise metadata effort.
- Enterprise Metadata Environment
- The Architecture of an Enterprise Metadata effort
- The Project and Implementation side of an enterprise effort
- The importance of Usability in Metadata
- Technical Architecture of the Repository Collection
- The Principle of Success around the service side of delivery
- Key Leasons Learned
Metadata and the principles that define this technology must be expanded into the other areas of the enterprise environment. Interfaces, components, schemas, DTD, web services, systems, documents, web pages, metrics as well as the components of the traditional database metadata effort needs to be looked at from a different view. The organization that plans on implementing an enterprise architectures needs to take a long look at the data architecture and ensure that it includes a heavy dose of metadata
Developing a Master Plan for Managing Your Ever-Growing Data
Daniel Linstedt
Chief Technology Officer
Core Integration Partners
Dealing with ever increasing data volumes is an issue facing most large organizations. The speaker made the point that managing terbytes of data today is difficult, but that it is a walk in the park compared to managing the petabytes of tomorrow, which he said was the equivalent to climbing Mount Everest. In the limited time available, the speaker talked mostly from the perspective of the business, rather than trying to cover the deeply technical issues. This included: capturing information, integration and management, application of analytics, filtering, reporting, and future trends. He focused on large data sets, quality, and synthesized data as a theme. As well as covering the planning, thought processes, up-front work, tricks and traps of these extremely large systems.
OMG Model Driven Architectures
Jon Siegel
Vice President of Technology Transfer
Object Management Group
Cory B. Casanave
President
Data Access Technologies
Because each middleware platform works best in a particular network niche (such as behind the firewall, or over the Internet), today's enterprise must deal with a multitude of platforms and connectivity paradigms. OMG's Model Driven Architecture (MDA) unifies and simplifies this environment by defining software fundamentally at the model level, expressed in the standard Unified Modeling Language (UML).
An application's base model specifies every detail of its business functionality and behavior in a technology-neutral way. Working from the base model, MDA tools use OMG-standard mappings to generate interfaces and most or all of the implementation code for one or more target middleware platforms. Tools also generate cross-platform invocations, allowing easy interworking with other applications wherever they reside. MDA supports applications over their full lifecycle starting with design and moving on to coding, testing, and deployment, through maintenance, and eventually to evolution to a new platform when an application's existing platform becomes obsolete.
Another benefit: because industry standards defined in the MDA are platform-independent, they can be used by every enterprise even in industries that haven't converged on a single middleware platform. The MDA is the base architecture for OMG standards (as of September 2001).
From Agile Modeling to Agile Data
Scott W. Ambler
President and Senior Consultant
Ronin International, Inc.
The Agile Modeling (AM) methodology attempts to answer many questions, including: How do you successfully model the complexities of modern-day software without getting bogged-down in mountains of paper work? How do you effectively engineer the requirements for your system? What techniques can you apply to analyze those requirements? To architect and design your software? How can you document your system in an effective manner? AM is a chaordic, practices based methodology for effective modeling and documentation (www.agilemodeling.com).
The Agile Data (AD) method describes techniques for application developers, DBAs, enterprise/data administrators, and enterprise architects to work together effectively (www.agiledata.org). AD also attempts to answer many questions, including: How can developers and data professionals work together effectively? How can enterprise-level people work together effectively with project-level people? What techniques and technologies are available so that data professionals can work in an iterative and incremental manner, just as the majority of developers prefer to work? How can data-related concerns be reflected on projects taking an agile approach to development?
Scott makes the point that the vast majority of software processes work in an iterative and incremental manner, and that data professionals need to start working this way too. We need to work together as a team, the finger pointing between the object and data communities has to stop.
CONFERENCE SESSIONS & SPECIAL INTEREST GROUPS
(Summaries are in the sequence they were presented at the conference)
Concordance: Managing Mismatched Data from Multiple Sources
Denise Draper
Chief Software Architect
Nimble Technology
When integrating data from multiple sources, one of the primary hurdles to overcome is how to match the data that refer to the same entity across different sources, when there is no 'natural key.' A classic example is two systems that house customer data that have to be matched on customer name or address, but names and addresses are subtly different.
This problem is solved with various kinds of 'merge/purge' techniques for creating warehouses or cleaning source data, but the problem is different when doing virtual data integration, when the underlying source data remains 'dirty' but cannot be changed. We call this the 'concordance' problem.
The speaker described the issues involved in solving the concordance problem, and described a solution based on creating an independent concordance database, which tracks the relationships between records in multiple sources.
Stewardship - The Road Taken
Bob Seiner
KIK Consulting
Companies need to get over the semantics hurdle ... if the appropriate term is "data owner" for your organization, so be it ... although it promotes "my data" syndrome.
Stewardship does not come without a cost. However the cost is not a technology cost. The primary cost is in the time and availability of resources to formally accept and adhere to their role and responsibility. Another major cost is the cost of resolving business data integration issues (what's the alternative?).
Recent events on Wall Street will likely drive top level management to look for ways to become more confident in the data and numbers that they are reporting. Companies should look to that level for sponsorship of stewardship efforts -- but it will take some convincing and a well thought out plan for how you are going to make stewardship happen.
Introduction to RosettaNet
Robert Oberwetter
Applications Development Manager
Tokyo Electron America
Robert Oberwetter gave a presentation introducing RosettaNet, the organization and the standards it promotes. RosettaNet.org is a subsidiary of the Uniform Code Council (UCC) and is a consortium of companies working together to implement e-business process standards. These are standards for a common e-business language which aligns processes between supply chain partners. The standards include xml documents for the data to be exchanged, definitions for the data, which data is required and optional, and the business processes and time frames that surround the business process interactions. By implementing the RosettaNet standards, companies can engage in dynamic, flexible trading-partner relationships, gain operational efficiencies, reduce costs and raise productivity. In adopting the RosettaNet standards, end users enjoy speed and uniformity in purchasing practices since everyone is using the same business process.
Using XSLT for Cheap Data Transformation
Hal Davis
Project Manager
Mellon Financial Services
XSLT (eXtensible Stylesheet Language Transformations) has become a universal format for describing and implementing data transformations. It is relatively simple and cheap to implement and can be used to automatically generate metadata for XML documents.
The speaker showed examples of how XSLT can be used for data transformations within and between enterprises, and differences/similarities between XSL Transformations and more conventional data transformation approaches.
Keynote Speaker
The Information Management And Analytics Topography: 5-Year Forecast
John Ladley
Knowledge Interspace
John Ladley replaced the originally scheduled speaker Doug Laney at the last minute, due to Mr. Laney's illness, but essentially spoke to the original presentation. Key recommendations from the talk were:
1. Minimize overlapping analytic toolsets, but select analytic tools for discrete purposes and user communities. Build analytics into operational processes to continuously optimize them and deploy parallel-capable data integration solutions.
2. Consider viability of outsourcing burdensome information management or analytics. Wrap partners, employees, customers and suppliers into common relationship management framework. Apply "real-time" judiciously.
3. Begin regularly auditing information assets, mature (or outsource) information management, and expand the data administrator's role and authority.
Using Web Services for Integration Within and Outside the Enterprise
Leo Kraunelis
Director
OASIS/XML.org
A Simple Definition of web services is that they are a collection of standards and protocols (XML, SOAP, WSDL, UDDI) that allow us to make processing requests to remote systems by speaking a common language and using common transport protocols (HTTP, TC/PIP, SMTP, …). The speaker admits that “web services” is probably a bad name for a very good idea.
Web services begin with XML as the basis for systems to talk to each other. XML is the power behind the data integration aspect of web services. The transportation protocol is SOAP, which wraps the XML data and commoditizes the API’s. SOAP does for apps what HTML did for content. WSDL (web services description language) describes and defines the XML info in the SOAP envelope. The ultimate promise is a new level of compatibility across multiple platforms.
The Many Become One ... Integrating Disparate Data into an Enterprise Data Warehouse
Alan Chow
SVP, R&D
Teradata, a division of NCR
The key point from Alan’s presentation was that data consolidation is the key to obtaining a single view of the business. He contends that a set of multiple data marts cannot deliver the same benefits as a data warehouse, and will be more costly, more difficult to manage and less flexible. Typically, the problem with data marts are that it is difficult to form linkages between subject areas, it Takes too long to generate reports from multiple systems, and users are less able to achieve corporate “single view” goals with a data mart strategy. In addition, the “wasted” cost of redundancy in multiple data marts is between 35% and 70% of the total cost.
The Business Justification for Data Mart Consolidation is clear. In one actual case example, the elimination of 6 Data Marts in revenue reporting area saved $11.5 MM over 3 Years. Estimated migration project costs were $1.9 MM, resulting in a payback period of less than 3 months. Longer term, their Business Intelligence Capabilities have been improved through elimination of inconsistencies due to data redundancy, latency, and multiple sets of business rules. The new system supports ad hoc querying across multiple subject areas, and they get one centralized version of the “truth.”
Be The Master Of Your Domain
Doug Stacey
Team Leader, Metadata Infrastructure Support
Allstate Insurance Company
Renee Zea
Data Analyst
Allstate Insurance Company
Domain Management is central to Allstate Insurance's data integration strategy. Through the management of Business Domains, the company's Enterprise Data Management team has achieved consistency in business definitions, documented and integrated the multiple sets of values and codes used throughout the enterprise, and provided the links between physical schemas and logical data models. By building a Domain Management set of tools, Allstate has created a solution for researching, managing, and standardizing both encoded and non-encoded data.
Interestingly, many speakers and attendees saw the Allstate system as a potential model for other applications outside the insurance industry, including securities trading (getting to T+1).
Data Model Patterns, Generalizations and XML
Roland Berg
Principal Consultant
ThinkSpark
Patterns and generalizations have long been touted as the solution to managing unstable data environments. The use of metadata-centered database designs creates a great deal of flexibility in the structures and allows the database to become evolutionary in content while maintaining structural consistency. XML and related technology is considered to be the solution to the problem of exchanging dynamic data across diverse systems.
It is only natural that we should explore the implications of applying XML-based data interchange to patterned/generalized databases and vice-versa. This presentation discussed the impact that each technological approach has on the other and explored some alternatives available for representing a patterned/generalized database as an XML structure.
Specific issues addressed included the interaction of the object-like nature of the highly generalized database with the hierarchical structure of XML and options for transporting the highly metadata-driven information via XML.
Unified Modeling Language (UML)
Robert Maksimchuk
Data Modeling Evangelist
Rational Software Corporation
UML is intended to provide a common language for all the participants in the development process (business analyst, software engineer, database designer, web developer). Typically, projects fail because of lack of user input, unclear objectives, incomplete and/or changing requirements and specifications and lack of planning. The use of the UML to model business processes, systems requirements, software applications and database design allows improved communication within software projects, and therefore helps prevent the aforementioned causes of failure.
XML as Meta Data
Matthew Williams
Senior Data Analyst
Worldspan
Matt Williams shared his personal experience as a data analyst in embracing and standardizing on XML practices within his organization. From a DA or DBA perspective XML is metadata, and as such needs to be incorporated into the database design process. This necessitates that programmers/application developers, DAs and DBAs work closely with each other to harness the power of XML without jeopardizing data integrity within the organization.
Database Futures
Alan Chow
SVP, R&D
Teradata, a division of NCR
William Ruh
Senior Vice President of Professional Service
Software AG, Inc
Sam Batterman
Business Intelligence Evangelist
Microsoft Corporation
This panel session offered views of the future of database technology from
the standpoint of three vendors. Software AG is focussed on XML-based solutions,
for all aspects of data management (integration, storage, transport, etc.).
Teradata sees data consolidation as the key to better integration and management.
Microsoft is, to a large extent, also providing XML-based solutions, but not
exclusively. Key takeaways from the session were:
- traditional relational databases are not going away any time soon. New XML
databases will likely co-exist with RDBMS.
- XML-based storage will become a key feature of new databases, allowing all
sorts of unstructured objects (incl. Word documents, Powerpoint files, pictures)
to be stored in more highly structured formats.
- analytics and business intelligence are the key drivers behind new database
technology. Increasingly, analytical tools will be built into the native database.
- demand is also increasing for real-time business monitoring capabilities
Are we Headed Towards Massively Distributed Integration?
Michael Hoskins
President
Data Junction Corporation
This presentation aimed to get attendees thinking about their approach to integrating their information and technologies and what resources they should be querying in order to accomplish it. A dynamic, universal approach to integration continues to elude today’s businesses, regardless of their size, scope or budget. The evolution of integration has shifted from the problem of Enterprise Application Integration (EAI) -- the tying together of all applications inside the enterprise -- to the much more global issue of Business to Business Intelligence (B2Bi). As integration is now requisite both inside and outside the enterprise, new kinds of problems are demanding Distributed Application Integration (DAI).
To solve today’s massively distributed application integration projects, solutions must be massively distributed as well. Basic patterns in biology teach us what type of architecture effectively solves massively distributed problems -- not only must the solution itself be massively distributed, it must also be highly intelligent and dynamic, changing and developing as the challenges themselves evolve. Consequently, it is through emergent integration systems, working at the firewall of each business in an integration chain, that disparate data can be mediated (semantically and syntactically) into the enterprise’s own unique systems.
Analytical Modeling Manifesto
Tom Haughey
Chief Technology Officer
Pepsi Bottling Group
Dimensional modeling is usually presented as the end-all and be-all of data warehousing. Yet, dimensional modeling has strengths and weaknesses. In some ways it has become outmoded. In other ways, it has been around for decades (and will continue to be). There are three ways to improve performance: use better hardware, use better software and optimize the data. The primary justification for dimensional modeling is to improve performance by compromising the data to compensate for the inefficiency of technology. It uses the third method above. A secondary purpose is to provide a consistent base for analysis. Dimensional modeling comes with a price and with restrictions. There are times and places where dimensional modeling is appropriate and will work, and other times and places where it is inappropriate and will actually interfere with the goals of a warehouse.
Experiences of a Data Modeler on a RUP Project
Christine Mandracchia
Manager - Data Administration
American Re-Insurance
The speaker is a logical data modeler who was actively involved in a 2 year project undertaken using the RUP methodology, which does not have an LDM deliverable nor the role of a logical data modeler. She presented both her pre-conceived notions about using the RUP methodology, and her actual experiences on the project, in the areas of the designated roles, the object class model, other related artifacts, and the iterative process.
Her pre-conceived notions centered around her perception that the object class model, and related deliverables, would be created with a design or development focus, and with a mix of data and process, rather than from a business requirements focus, and that the data would then be less sharable. She experienced the same requirements facilitation and clarification process as for traditional logical data modeling, due to the expertise of the person staffed in the "Architect" role on the project. She also experienced more challenging requirements scope management for this object oriented analyst during each iteration, and additional work for the DBA's due to multiple iterations.
It is her current perspective that if the object class model is developed by a RUP "Architect" who has a business requirements focus, then an entity-relationship LDM does not also need to be developed. She became convinced during this project of the benefits of the iterative process for system development.
Analytical API Update: XML for Analysis & JOLAP
Seth Grimes
Principal Consultant
Alta Plana Corporation
BI vendors led by Microsoft, Hyperion, and SAS Institute last year released version 1.0 of the XML for Analysis (XML/A) specification, "an open-standards-based messaging interface" designed to "promote the standardization of the data access interaction between a client application and business intelligence systems and other applications over the Web and in distributing environments."
Meanwhile, the nascent JOLAP specification provides a similar API for the J2EE [Java] Web services environment, one that "supports the creation and maintenance of OLAP data and metadata, in a vendor-independent manner."
The Semantic Web
Brett Champlin
Technical Shared Services
Allstate Insurance Company
William Ruh
Senior Vice President of Professional Service
Software AG, Inc
Dave McComb
President
Semantic Arts
The Semantic Web is a much anticipated (and yet often misunderstood) concept. Fundamentally, the Semantic Web is a vision of the future in which documents and data contain descriptive metadata which allows them to be easily understood by computers. The potential payoffs are huge, particularly in terms of making search technologies more powerful, but the implementation questions remain largely unanswered. This panel agreed thyat the vision of the Semantic Web is compelling, but that practical issues such as how to imbue the vast stores of existing documents with meaningful meta data, and how to apply standards consistently across the web, will remain enormous. Progress towards the vision will likely be incremental.
ETL vs. EAI: Comparing Data Integration Approaches
Faisal Shah
Chief Technology Officer
Knightsbridge Solutions
EAI follows ETL as the latest category of data integration tools. Many organizations are tempted to address all of their integration needs through just one category of tool. But the long-term costs of trying to solve ETL issues with EAI tools (and vice versa) can far outweigh the upfront costs. The two categories treat latency, unit of work granularity, meta data integration, third-party product integration, and other product dimensions differently.
Organizations need to address ETL and EAI holistically and at the same time understand that there are still significant differences between the tools and ways to approach integration projects. EAI and ETL tools continue to grow closer together, but there are still significant advantages to using each for its original purpose, and knowing how to leverage these will allow an integration project to deliver the right information at the right time and at the right cost.
Data Refactoring: Enabling Iterative and Incremental Database Development
Scott W. Ambler
President and Senior Consultant
Ronin International
A database refactoring is a small change to the design of a database schema that improves its quality without changing it's behavioral or data semantics. Database refactoring is enabled by full regression test suites and effective db management scripts. Database refactoring works in practice, and enables data professionals to work in an iterative and incremental manner. The speaker believes database refactoring may be the best hope for organizations to fix their legacy database schemas. www.agiledata.org/essays/databaseRefactoring.html
Enterprise Data Integration: Development of an Enterprise Data Model
Noreen Kendle
Enterprise Architect
Delta Technology - Delta Air Lines
This presentation focused on the approach developed and used at Delta Air Lines for the creation of an Enterprise Data Model. The Delta Air Lines Enterprise Data Model is now being used to create the Operational or Enterprise Data Stores, integrating operational data across the airline business. The speaker described her methodology for developing an EDM that incorporates the enterprise view needed for integration to support an ODS and DW, as well as the current state (work already accomplished – existing models) for practicality and quicker development.
The Delta model is now over 700 integrated entities and encompasses 5 of the major business subject areas. The model is being used to build the Operational Enterprise Data Stores and will eventually be used for the Enterprise Data Warehouse.
XML Tools
XML for Data Integration
Mark Milodragovich
Senior Information Engineer
Nimble Technology, Inc.
Bradley Wright
Vice President, Product Development
MetaMatrix, Inc
This session compared two approaches to data integration using XML. The first (from Nimble Technology) uses virtual XML documents, sometimes called XML Views, to dynamically mediate and integrate data from heterogenous data sources.
The second, from Metamatrix, discussed the integration of legacy systems into XML standard schemas. Using the Market Data Definition Language (MDDL) as an example, the speaker showed how the schema can be represented as a virtual model and then mapped to Non-XML physical sources, as well as XML sources.
Roadmap to Federated Data Architecture
Ho-Chun Ho
President
HoTech Corp
The goal of architectural planning is to enable organizations to optimize revenue and increase shareholder value by establishing the supporting strategy, standard process, culture, technology and best practices. Over the years organizations have been building silo systems and isolated data islands, oftentimes forced by realistic reasons. It is largely overlooked that inadequate design of the organization of data architecture contributes to this disparity. This presentation discussed models of data architecture organizations, the pros and cons of each type of organization, the concept of federation governance and local autonomy, and the roadmap to establish data architecture in a federated manner based on real-life experience.
Information Quality through Semantic Models
Joshua Fox
Software Architect
Unicorn Solutions Ltd.
Understanding data source semantics and their reference to a unified business model is central to ensuring total information quality. This presentation discussed how to apply a central conceptual model to provide semantics to data schemas, including the two critical questions - where is the data? and what does it mean? Combining such a model with a formal development process ensures information quality that transcends the limits of a single system, transformation, or data warehouse.
Data integrators today analyze the business concepts behind their data, and design transformation logic to unify metadata. These procedures must be repeated individually for each data source and transformation, with the resulting integrations providing low quality output that is often impossible to maintain. The presentation demonstrated how data analysts can understand their numerous data sources without re-analyzing each schema’s semantics and structure. When the rich semantic model helps implement business information quality coherently across the enterprise, disjointed data is transformed into meaningful information.
Drill-Thru and the Corporate Information Factory
Nicholas Galemmo
Information Architect
Nestle
This presentation examined the issues involved in providing drill-through capability from summarized dimensional data marts into a detailed 3NF data warehouse as prescribed in the Corporate Information Factory (CIF) architecture. It presented the Kimball Comforming Dimensions concept and applied it to the CIF and looked at issues involved in generating and preserving key values and dealing with structural differences between the 3NF and Dimensional models. Problem areas and possible solutions were identified, as well as the level of functionality a query tool should provide to support cross-model drill through capabilities.
Current Controversies in Data Modeling
Graeme Simsion (moderator)
Scott Ambler
David Hay
Brett Champlin
Robert Maksimchuk
The panel attempted to discuss some key issues that data modelers are grappling
with today. For the most part, the discussion came down to three main areas:
1. The debate about agile modeling/agile data methods, as proposed primarily
by Scott Ambler (a panelist),
2. A similar debate about UML and the OMG's model driven architecture, and
3. A discussion of data modeling’s place in the world, and whether the role
of the modeler is still relevant today?
Key take-aways from the discussion included:
- proponents of agile, MDA and "traditional" modeling approaches actually
agree about a great deal, particularly in terms of "people" and "communication"
issues. They depart primarily on the relative priority of data vs programming
roles (perhaps a "chicken or egg" debate)
- There is most definitely a role for data modeling in contemporary development
environments, though modelers need to adjust their thinking, and their skills,
to deal appropriately with the changes that continue to occur. In particular,
they need to be more flexible in their approach - moving away from the "all
or nothing" approach and being willing to work within the constraints of
typical environments. This is not to say that the grand vision of enterprise
data management is not valuable, but instead realizing that it is very difficult
to achieve and that smaller victories are still possible and valuable.
XML Tools: Native XML Databases
Alex Cheng
Director of Engineering
Ipedo
This session explained the similarities and differences between XML databases and Relational DBMS, in terms of how they store data, the query language and application interfaces used, and the methods for managing schema.
Managing XML Assets
Kathryn Breininger
CENTRAL Project Mgr., Emerging Technologies
The Boeing Company
As XML becomes more widely used, the need for efficient management of XML-related assets becomes critical. This presentation described how repository and registry technology are being used to manage XML assets – such as DTDs and schemas --- providing discovery of, access to, and sharing of these assets.
The Central Registration Authority and Locator (CENTRAL) is a Boeing enterprise-wide Registry and Repository designed to store and retrieve reusable eXtensible Markup Language (XML) assets such as Document Type Definitions (DTDs) and XML schemas. The CENTRAL Registry contains metadata and locations for XML assets and makes these assets available to the entire Boeing enterprise as reusable objects.
This presentation provided an overview of the CENTRAL project, including the scope of the project, events that led up to the development of the system, the design of the system, and the functions and roles of the users. The first production release of CENTRAL was presented, including concepts for the architecture, services, and functions to be provided in future phases.
Engaging Data Administration in the Enterprise
Tom Bilcze
Senior Group Coordinator
Roadway Express
Data Architects offer the promise of building a sound data infrastructure yet often end up littering the road to systems development with walls and obstructions. This session showed how to build a collaborative environment by partnering with applications developers and end-user business staffs, in order to make the DA function into a key contributor to business projects.
Essentially, the speaker demonstrated the necessity to break out of the traffic cop mentality to data administration, take stock of your strengths and assets, and get to know your customer better. He discussed how to update your methods and procedures for today's rapid development cycles, ow to market and sell your new "value of information" philosophy, and how to use your data modeling tools to effectively communicate with both technical and non-technical users.
New Approaches to Customer Data Integration
Chandos Quill
Vice President, Strategic Marketing
Experian
Jeff Canter
Vice President of Operations
Innovative Systems, Inc.
This session offered two alternative approaches to customer data integration.
Experian’s approach of Referential Linking provides additional data that enhances
the identification process. This is the next evolution of Customer Data Integration
(CDI) where a company compares their data to a reference database. The reference
database contains a vast compilation of variations of names and address history
from thousands of reliable data sources. In essence, referential linking provides
access to all current and former name and address variations to accurately identify
and link desperate information. The key benefits of referential linking are:
* Intelligent decisions are made because additional data is being considered
within the decision making process
* Historic information enables accurate identification of consumers regardless
of the variations that naturally occur
* Increased accuracy of data integration process results in more appropriate
customer interactions
* The many sources of information become an extension of your environment, providing
data that is not available within your organization
* As new information is received it is matched against the reference database
that enables persistent accuracy
* Cost savings due to elimination of full file refresh – only incremental changes
Data Synchronization is a new approach from Innivatice Systems Inc. for effective, accurate, enterprise-wide customer data integration. It reconciles the different customer views that exist across the enterprise’s various application systems and business units. Data Synchronization is the combination of technology, software, process and services required to achieve a synchronized view of the different, purpose-driven customer profiles across the enterprise. It allows customer data to be accurately managed both within, and across, business groups. This session provided an overview of Data Synchronization, including critical success factors and a customer case study.
Surviving and Thriving using Data Modeling Standards & Procedures
Marcie Barkin Goodwin
President/CEO
Axis Software Designs
Standards and procedures are those seemingly nasty things that everyone knows they should have, but don’t want to admit to. Or they do have but don’t use. They cause universal grimaces and moans when someone is faced with the writing, implementing and enforcing of these vitally important (though unpopular) bastions of development.
There is, however, such an enormous advantage to using standards & procedures that the issue is not whether they add value, but how an organization can most efficiently and effectively realize their return on investment.
XML Tools: XQUERY
Denise Draper
Chief Software Architect
Nimble Technology
Alex Cheng
Director of Engineering
Ipedo
XQuery is the new query language being designed by the W3C to query XML data. This talk introduced the main XQuery language features, in particular comparing them to SQL and existing XML access methods such as XPath.
XML has received a lot of momentum as a language for data interchange and integration, and support for XQuery has already been announced by several large vendors. XQuery will soon be one of the standard tools available for application development and integration.
EAI Aftermath - What Next?
Sheila Jeffrey
Vice President
Wachovia
This presentation provided a framework for identifying key themes for assessing data integration options and tools:
- Understand the problem in context - what type of data/information is targeted,
what are the environmental characteristics that constrain the solution, and
what type of business function will benefit.
- The application of common reference data/values (encapsulating standard business
rules) for transformation, analysis (data mart domains, report categories),
and dynamic integration (EAI - Enterprise Application Integration) can accomplish
virtual data integration.
- Data/information problems can be understood more effectively in the process
context, so it is beneficial to incorporate a high level understanding of the
business activities. This approach will reveal that EAI and ETL are complementary,
not competitive.
- Models - especially an Enterprise Data Model, but also a high level conceptual
functional placement model - are powerful analytic tools that can jump-start
solution identification and design, even if they are not completely populated
or deployed.
- There are no silver bullets - some challenge points for XML approaches and
Web services were discussed.
Achieving Semantic Interoperability in Near-Time Transactional Environments
Chito Jovellanos
President & CEO
forward look, inc.
This presentation compared three techniques for mitigating the effort and costs
of reconciling transactions between senders and receivers:
- standards-based data mapping
- schema mediation, and
- “semantic signaling”.
These techniques support the goal of “straight-through-processing” by enabling the data in the sender’s transaction to be unambigously understood by the receiver’s system(s). Semantic signaling is a new technique based on information theory, and uses multi-variate statistics for quantifying relative variances in content and meaning in transactions exchanged between business counter-parties. Case studies were presented from the securities industry based on the processing of corporate actions and trades.
The Process Potential of Temporal Data Structures
Henry Feinman
Principal
HJF Information Solutions
Difficulties encountered in developing structures that incorporate time has stunted the growth of data centric approach to modelling business process. However, temporal enablement has a foothold in data warehousing. Methods and techniques used here can be extended to operational data models, freeing these structures to define process and state transition.
Business process is the method by which the organization attempts to manage state transitions for the benefit of itself and its customers. A definition of required and desired state transitions, or business rules, can be created in procedural code, or within the data structure, though it is usually done in procedural code. Defining business process in procedural code locks process change to IT systems change.
There are many advantages to moving state transition definition to data - ease of modification, flexibility, agility, but the difficulties encountered in developing structures that incorporate time have prevented widespread exploitation of the data centric approach.
Managing Schema Chaos for XML
Dan Chang
Research Staff
IBM
Lucian Popa
Research Staff
IBM
Information integration and application integration are among the most critical challenges facing corporate IT staffs. XML is a promising technology for delivering the needed solution. However, without proper schema management, XML will only create a different information chaos: the schema chaos. This presentation discusses a novel solution framework for managing schema chaos for XML. The framework currently consists of an XML Schema catalog and an XML Schema mapper. It is being extended to include the following additional components: XML Schema analyzer, XML Schema integrator and XML Schema evolution manager.
XML in Unstructured Data Environments
Robert Ainsbury
General Manager US Operations
Xyleme
The features of XML that allow it to make “unstructured” data environments more manageable are one of its greatest assets. Exciting new applications in the areas of content management, document publishing, multimedia, search and intellectual asset management will all be facilitated by the power of embedded, XML-enabled, computer-understandable “meaning.” Essentially, it will turn unstructured data into near-structured data. Order out of chaos.
The News and Press industries have been one of the fastest to adopt XML for this purpose; with widely accepted standards (such as NewsML) and utilization by virtually all the leading organizations worldwide. Many of the challenges and opportunities pioneered in news and publishing are now surfacing in other industries. Using this industry as an example, this lively session offered no-nonsense insights about what to do, and what not to do, when wrestling with large-scale XML adoption.
Real-Time Integration and Analytics
Seth Grimes
Principal Consultant
Alta Plana Corporation
John Ko
Product Marketing Manager
DataMirror
Ron Agresta
Product Manager
DataFlux
This session examined the increasing demand for real-time businessmonitoring and instantaneous information. A variety of terminology exists -- “Real-time”, “Zero latency”, “Information on Demand”, “Active Warehousing” -- but essentially they all require more immediate timeframes for getting data into and out of analytical systems.
The speakers discussed both the business drivers for this demand, and the technical requirements to make it happen. Real-time database integration of XML data is critical. Companies must be able to capture selected events such as purchase orders or invoicing from any application database and send them in industry standard XML formats across the enterprise and beyond. One speaker referred to this as "XML streaming."
A Common Model for Classification Hierarchies
William Lewis
Senior Technology Specialist
Cambridge Technology Partners
Among the most frequently occuring challenges faced by data modelers are classification hierarchies, or taxonomies, used for grouping and analyzing common entities business entities such as Customers, Products, Accounts or Organization Units. Widespread technologies implementing such classification hierarchies include OLAP dimensions, data mining clusters, knowledge taxonomies and LDAP directories. With the growing emphasis on both corporate portals and information security, these requirements have taken on increasing urgency.
This presentation offered examples conventional patterns for modeling these requirements, along with examples of techniques for "flattening" hierarchical structures. A new, highly abstract, yet widely applicable and implementable model for addressing classifications and hierarchies across multiple application domains was then presented. Capabilities of this model to support flexible and dynamic hierarchical classifications of detailed, summary and historical time-series data, by incorporating features of object, multi-dimensional and entity-relationship models, were explained. With the growing emphasis on both corporate portals and information security, requirements for flexible modeling of classification hierarchies have taken on increasing urgency.
Bringing Data Modeling/Data Administration into an Organization: What's Worked...and Not
Abbie Allen
Systems Technical Consultant
Farmland Insurance
Introducing Data Modeling/Data Administration into an organization can be very challenging and frustrating, yet very rewarding and exciting at the same time…when it works! Yet what works for one company may not work as well for another. This presentation examined some of the trials and tribulations experienced when introducing Data Administration into various companies in order to provide some examples of what worked well, what didn’t work so well and some thoughts on why.
In the final analysis, what generally works is:
– Education for Everyone
– Establishing Standards
– Include the Process
– Make it fun
- Pin the Attribute on the Model
- Sticky Note stuff
And what doesn’t generally work...
– Selling the IRM concept just to upper management
– A volatile first project
– No accountability for IRM tasks outside of IRM
– Allowing IT to drive technology without input from Business Community
and remember to…
– learn the rules so you know how to break them properly
– design for the future but accommodate the past
– respect another’s creativity & knowledge
– keep it fun
– pick the appropriate pitch