An architecture essay
In this article, I will describe the typical data repositories in the IT landscape of an enterprise and how they come into being. It’s an elaboration of a part of the earlier essay linked below. You need not have read it to get some value from this essay, but if you do, more of the context and picture will become clear, which is good for an architect.
Let’s first look at the primary (functional) data repositories, a litmus test for data duplication, and secondary data stores.
This architectural thinking applies whether the systems are in a private data centre, hybrid private-public, or public cloud. Some architect from the enterprise or IT partner with holistic and long-term ownership has to understand and guide it.
A. Primary or Functional Data Stores
In decades of experience as an application, integration and enterprise architect, I’ve seen enterprises typically going through the evolution laid out below. Of course, some companies may have birthed their IT landscape at Stages 2, 3 or even 4. But it helps our architectural thinking to consider this progression anyway, as it elegantly explains the need for creating copies, combinations and transformations of the core data to meet business needs.
Stage 1 — OOTB Data Stores
Implement or subscribe to any COTS (Commercial Off The Shelf) Sales, CRM, Self-service, or ERP system, and it comes with its native data repository into which it writes new accounts, orders, payments, invoices, etc. The user interface also supplies information which sometimes leads to further writes (e.g., after a search for a service, product, or offers) or not. Behind the scenes, the database works differently to write new data and find and supply existing data.
The product or service is usually equipped with a reporting capability to help the business managers know what is happening and make all sorts of decisions.
While the enterprise is small, this all-in-one system works just fine, especially modern ones that are well architected, with the requisite cohesive and loosely-coupled system components.
These native repositories of systems of engagement have the following characteristics:
●High OLTP load
●Highly variable loads over the day, week, month and year
● Medium-term online storage ranging from weeks to months
●Tuned for writing fast
●Sub-second response back to UI
●Original, unaltered and non-derived data
Here is a picture of this simple first-stage situation.
Stage 2 — Adding Operational Data Stores
As the organisation grows (e.g., 10s of thousands of users, >10M USD revenue), the OLTP operations increase. The number of read transactions goes up exponentially and can be hundreds to thousands of times more than write transactions. The primary database starts struggling to respond to the users with previously persisted data while simultaneously writing new records. This problem is exacerbated for end-of-month, quarterly and annual reporting loads. Everything starts slowing down — storing, reading, reporting.
The architecture pattern to relieve this situation is an Operational Data Source. We identify the use cases that read data, especially those not closely followed by write operations. The data required for them is replicated in near real-time to a separate data store. Only the needed part of the data is taken, and it is stored online only for the duration required by, say, the 90th percentile of use cases. This could be 10–50% of the original data fields in a typical scenario, and the online retention could be for three months.
The clients, application UI, and business logic layers must be pointed to the ODS. This investment is necessary, besides the investment in the ODS storage, network, software, operations, etc. Once this pattern is implemented, the original SoE and the refactored read use cases perform fast. In many cases, the reporting can continue on the original SoE native data repository as it can do most of its work in the nighttime when the OLTP loads are almost zero.
ODSs should start with the most problematic use cases and just the data needed for them and extend to additional use cases and data as required. Having modular application architectures for the systems of Engagement and clients of the ODSs greatly facilitates this optimised progression.
ODSs have the following characteristics:
●High OLTP Load
●Highly variable loads over the day, week, month and year
● Short-term online storage ranging from days to weeks
●Tuned for reading fast
●Sub-second response back to UI
●Part of the original, unaltered and non-derived data of the SoE
Here is a picture of this second-stage landscape.
Stage 3 — Adding Cross-Domain MIS Data Stores
As the enterprise grows further (e.g., millions of customers, > 100M USD revenue), two problems emerge. Reports running against the Systems of Engagement struggle to complete at night or the end of the month, quarter or year. At the same time, the need emerges to correlate and join data across business domains for their coordination and competitive advantage.
The architectural pattern that provides the solution is a separate cross-domain Management Information Systems Database (MIS DB). If it is used mainly for canned reports, it is referred to as a Reporting Data Mart (RDM), although it usually soon becomes a proper MIS DB.
MIS DBs should start, extend and grow strictly with the needs of the business.
MIS DBs have the following characteristics:
●Medium OLAP load
●Not very variable loads
● Mid-term storage ranging from weeks to a year
●Tuned for batch operations
●Seconds to display canned reports, seconds to minutes for ad hoc queries
●Data is stored as measures such as min, max, count, sum, average, std deviation, variance, median, etc.
Here is a picture of this third-stage landscape.
Stage 4 — Adding BI Data Lakes
Maturity comes with cost and complexity, which need to be justified by the benefits. When an enterprise grows large (e.g., 10–100s of million customers or > 1 B USD revenue) enough or works in a highly competitive environment, it needs to analyse its current performance and future possibilities minutely. This is the aim of business intelligence or advanced analytics.
Data is collected into a ‘lake’ by extracting, loading and transforming from as many sources as possible and retained for as long as possible. The variety and quantity are expected to yield insights that can help business tacticians and strategists make sound decisions. It is a combination of fact-based and probabilistic decision-making, and the outcomes depend on the skills of the data engineering and data science teams.
BI should start, extend and grow strictly with the needs of the business.
BI data lakes have the following characteristics:
●High OLAP load
●Medium Availability (regular down days for data reorganisation acceptable)
●Not very variable loads
● Long-term storage ranging from months to years
●Tuned for OLAP operations
●Seconds to display canned analysis, seconds to hours for ad hoc queries
●Complex dimensions and linkages within and between domains’ data
The picture below depicts the end state of a large and mature enterprise in this context.
B. Litmus Test for Data Copying
Every copy of data adds to enterprise cost and complexity. Use the objective method below consistently; you’ll know you have minimised data duplication and its effects.
The picture below shows a Litmus Test I developed for this Architecture Decision. Please contact me if you want to know more and use it.
C. Secondary Data Stores
These data stores are used to operate the hardware and software. We can consider these as overhead and should try hard to eliminate the need for them.
Secondary data storage reasons
There are three main reasons to store secondary information.
- Problem management — Information that helps to uncover the root cause of issues. This should be temporary so remember to turn it off after the problem is solved. (If data is being stored for reconciliation between systems, find the root cause of the mistrust and eliminate it.)
- Security — Information that is required for non-repudiation, forensics and threat management.
- Capacity and performance management — Information which helps to optimise infrastructure and scale up or down systems with business loads.
Secondary data types
These are the main types of secondary data.
- Event data — Detailed information about events in the ecosystem is stored with time stamps. It can be fine-grained, e.g., code execution steps, to coarse, e.g., business process execution.
- Metadata for management — Information about various types of people, systems, processes, and documentation.
Secondary data stores
These are the main types of secondary data stores for event data or metadata; problem management, security, or capacity and performance management.
- Service Log Stores — Business and technical services logs.
- Application Log Stores — Business systems, applications, components and sub-component logs.
- Middleware Log Stores — Logs from BPM, Workflow, ESB, API GW, Database, HTTP Server, App Server, Backup & Restore, DR Tools, etc.
- Hardware Log Stores — The logs from the firmware and operating systems of servers, racks, storage, network equipment and their parts.
- Security Log Stores — Logs from various security equipment and software such as firewalls, proxies, reverse proxies, security incident and event management systems, VPN gateways, etc.
Data and the Cloud
Let’s consider some patterns of data repositories on different types of clouds.
Private Data Centre Clouds
We have seen several typical data stores above that come about naturally in private IT infrastructure. The characteristics of the data that are most fitting to be kept in a private data centre are:
- Large-size stores above Petabyte capacity (as it is likely to be less expensive for infrastructure, software and network costs). E.g. MIS, BI, Core Banking, etc.
- Sensitive and regulatory data. E.g., financial information, defence information, private customer information, individuals’ message and interaction records, security keys, government reporting data, etc.
- Data that is processed heavily or accessed a lot by applications which are in the private data centre (this will reduce network and compute costs). E.g., Core function data of air traffic control, banking, manufacturing, ERP systems, BI systems, Systems of Engagement, etc.
It is still rare that all the IT systems of a large enterprise are born-on-public-cloud or migrated there. The usual question is which systems and data stores should be migrated to the public cloud. Here are some typical ways public clouds are best used for data stores.
- Data slices for analytics received from private data centres
- Burst data associated with burst application instances that transiently extend private data centre systems
- Backup or replication data for business continuity and disaster recovery
- Native data of cloud-native Systems and Services of Engagement, Record or Insights
By nature, edge clouds are distributed and application specific and, therefore, the smallest cloud type in compute power and storage. Most data streams through the edge cloud to private or public clouds, but there are three typical reasons to hold data on the edge.
- Edge caches — this helps to quickly deliver often-used information or content to end users without having to get it every time from a backend system. It results in a better user experience and reduced network bandwidth costs. The caches can be in memory or disk storage and are refreshed using rules based on various parameters. Examples are streaming video, virtual reality videos, maps, application-specific data and rulesets for health, vehicle management, etc.
- Edge compute data — autonomous and automated actions that are time sensitive and simple can be carried out at the edge using just enough time series field data stored for just enough time. The actions can be configuration and control changes to field devices (e.g., wearables, smart manufacturing equipment, drones, etc.) or human notifications.
- Edge analytics data — these are cases where analytics, artificial intelligence or machine learning needs to be done at the edge for immediacy or
Combinations of private, edge and public clouds that follow the guidelines above for data location have now become commonplace. They make the best use of each cloud type in architectural combinations that deliver functionality quickly, with quality and cost efficiency.
How to manage it all
Ultimately and inevitably, a large enterprise has many data repositories spread over many systems, locations and technologies. It naturally leads to concerns among technology and business leaders regarding the data’s cost, quality and security. A gnawing suspicion soon follows that much of the data is not becoming information that helps the business, customers and partners.
Architecturally we can optimise the variety of repositories through the thinking in the preceding sections of this article, but there will be a variety. An architectural construct called the Enterprise Data Fabric (EDF) is an elegant solution to managing and governing them.
A data fabric is an architectural system that covers all the enterprise’s data sources, flows and repositories across all its environments to get visibility, intelligence and control over all its data and make the best use of it for business tactics, strategies and operations.
The benefits of implementing an EDF are:
- Common knowledge base and semantics across business and technology that boosts cooperation and insights
- Visibility across data silos to manage the data lifecycles better and reduce costs
- Self-service data access and querying for faster insights and decisions
- Coverage of popular data storage technologies and hybrid environments of data centres and clouds
- Data quality governance for better customer and business experiences
- Integration insights for faster development and lower time to market
An EDF needs to be designed carefully, and it will take time to roll out across all the data repositories and business and technology domains. Prioritise the most critical and create a multi-year plan to maximise the coverage.
The illustration below shows the typical data repository landscape for a large modern digital or Web 3.0-type enterprise with multiple business partners and customer types.
Information Technology is essentially ‘data in, data out’, plus automation and analytics. The highest IT cost for an enterprise is data storage and transport. Getting data in and out quickly and reliably is the source of most IT issues. Getting your information architecture right saves tons of money and grief. It economises and, hopefully, boosts the entire business.
So don’t skimp on thinking about where what type of data should be located, architect. And make it happen.