The requirement was to develop a data platform to collect, organize, process and provide insights into business, operational aspects and enable development of customer value added, data driven product features and dashboards.
​Raw data was stored with no oversight of the contents​
The platform needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp“.​
Solution
We have developed a data lake solution which has the following the features​
Real time streaming data from source systems.​
Connectors developed for Oracle DB (PeopleSoft, Banner) and Workday.​
Data access is controlled with views set up in Apache Hive which is connected to data lake.​
ETLS run in loop and identify changed files (via Hive) and update Report Mart. Sample ETL scripts and reports developed for HR Diversity data. PostgreSQL acts as Report Mart.​
Data changes form source system are reflected in the reports within two minutes.​
Outcome
Easier and quicker to populate as no transformation is involved ​
Allows to import any amount of data that can come in real-time​
Allows organizations to generate different types of insights including reporting on historical data​
Ability to store all types of structured and unstructured data​
Elimination of data silos​
Democratized access to data via a single, unified view of data​