Capstone Project: Carnegie Mellon Univeristy, Masters in Information Systems Management
(concentration in Business Intelligence and Data Analytics)
Data architecture design for flexible and scalable solution that provides real-time insights into
critical system's performance at major US pharmaceutical chain.
Project Goals: Our primary goal was to design an IT architecture that
will allow executives in the office of the CIO of a $136 billion pharmaceutical chain
to monitor and report on the health and performance indicators of critical retail systems (e.g. POS, Supply Chain systems)
across the United States in real-time.
A sample of indicators to be monitored include:
(i) Receipt Print Time
(ii) Payment Authorization Time
(iii) Item Scan Time
Design Approach: The architecture design was broken up into 4 logical phases from the source data to final output.
For this project, the visualization phase was out of scope. We researched and identified potential solutions that met client requirements for all in-scope phases (ingestion, storage and transformation). The architectural design options were evaluated using a scorecard for a final recommendation before tools and vendors were selected.
After research, the team proposed 3 architectural options optimized for different strengths.
Peridot: This architectural design was proposed as a light-weight technology stack.
Data is extracted from source systems using traditional batch ETL solutions. The data is loaded into a cold storage data lake for adhoc queries and basic data analysis. From the data lake, the data is stored in a downstream datawarehouse for a 30-day period. Subsets of the transactional data are then stored in data marts for easier and faster access on the dashboard. Pre-defined rules are applied before analytics data is displayed on the dashboard.
The downside of this design is degraded performance in providing "near real-time" analytics because of the number of components. Also, there is an impact on solution reliability and fault-tolerance using traditional batch ETL process.
Amethyst: Our second recommendation uses a messaging queue to ingest data from the source systems. Messaging systems allow for asynchronous communication, decoupling systems and improving fault-tolerance. The data is transformed using a stream processing application that can process continuous data input. All data is stored in a data lake for adhoc queries, advanced analytics and predictive modeling. Some stream processing applications provide advanced modeling. Data is stored in datawarehouses for a shorter period for historical analysis. Subsets of the transactional data are then stored in data marts for easier and faster access on the dashboard. Pre-defined rules are applied before analytics data is displayed on the dashboard.
Andensine: Our final recommendation also uses a messaging queue to ingest data from the source systems. The data is also transformed using a stream processing application that can process continuous data input. The stream application is integrated with the analytics dashboard as a downstream system. All data is stored in a cold storage data lake for adhoc queries, advanced analytics and predictive modeling, and more recent historical data is available in variety of databases (NoSQL, SQL, etc.) for quicker analysis.
Following these proposals, we evaluated each architectural option against our client's key requirements using a weighted scorecard.
We scored each architectural option based on our knowledge of information systems, our research, and counsel from esteemed client and advisors.
Andensine and Amethyst options with messaging queues and streaming applications scored hire on scalability and flexibility, which was a key requirements. Our client wanted to keep the solution flexible for changing business needs. Business changes may come with a growth in data volume and processing requirements. Also, messaging systems provide transparency (our source system will not need to be concerned about the location/address of the source system) and can ingest variety data formats (structured and unstructured).
These options both scored low on total cost of ownership. We conducted a cost-analysis based on our best estimates of products available on Microsoft Azure's platform.
Adensine architecture was our final recommendation to enable real-time insights on the health and performance of critical systems in a retail and e-commerce organization.
Andensine's architecture and components allow our client to grow and take advantage advanced technology strategies.
Please see capstone poster for more details on this project.