Big Data Project Architecture and Evolution
In the realm of big data, crafting efficient and scalable data solutions requires a deep understanding of various technologies and their evolution. This blog delves into a specific data migration and transformation project, providing insights into its architecture, the shift from Hive to Spark, and the critical aspects of Spark job execution and monitoring.
Data Migration and Transformation Project
Project Overview
In a recent project, I worked on migrating and transforming data from a DB2 database to Hadoop/Hive. The primary goal was to extract data, transform it, and store it in Hive for reporting purposes. This project catered to a diverse audience, including freshers, recent graduates, industry professionals with 3–4 years of experience, and seasoned professionals struggling to articulate their big data projects.
Data Source and Reporting Requirements
The source of our data was a DB2 database, an RDBMS known for handling structured data. The client required a monthly report that answered critical business questions:
- How many credit cards were offered in the last month?
- How many offers were announced?
- Of those announced, how many were used for purchases?
The report needed to be generated by the 5th of every month, necessitating a well-planned ETL (Extract, Transform, Load) process.
Identifying and Querying Source Tables
The data required for the report was spread across multiple tables:
- Customer information
- Credit card details
- Offer details
- Product data
The task involved querying these disparate tables to aggregate the necessary information.
Data Migration and Transformation
Rather than performing transformations within the DB2 environment, we utilized Sqoop to migrate data to Hadoop/Hive. The data was then stored in Hive tables, where it was queried to produce the required report. This process was automated and scheduled to ensure timely report generation.
Evolution from Hive to Spark
The Transition
Around 2015–2016, Hive was the cornerstone of big data processing. It allowed SQL developers to interact with Hadoop ecosystems without extensive Hadoop knowledge. However, as data processing needs evolved, Spark emerged as a superior tool for transformation tasks.
By mid-2016, Spark began to be used alongside Hive, improving performance and handling complex transformations more efficiently. The transition was marked by the introduction of Spark’s RDDs, DataFrames, and Datasets, which offered enhanced performance compared to Hive’s traditional approach.
Hive vs. Spark
While Hive remained essential for storing and querying data, Spark’s ability to perform transformations more efficiently led to its adoption for ETL processes. Spark’s evolution, from RDDs to DataFrames and Datasets, showcased significant performance improvements, solidifying its role in modern big data projects.
Spark’s Integration with Hive
Despite Spark’s advantages, it often reads data from Hive tables rather than directly querying databases like DB2. This approach leverages the big data ecosystem’s benefits and reduces the cost associated with direct database connections.
Spark Job Execution and Monitoring
Deployment and Scheduling
Deploying a Spark job involves several steps:
- Job Execution: A Spark job (usually packaged as a JAR file) is executed via a shell script invoked by a scheduler (e.g., Autosys, Airflow). The output is stored in a Hive table.
- Scheduling: Jobs are scheduled based on customer requirements, which may be weekly or monthly. Popular schedulers include Autosys, ActiveBatch, Control-M, and Airflow.
Monitoring and Verification
After the job executes, the status can be monitored through the scheduler’s UI or logs. It’s crucial to verify that the data in the Hive table matches expectations. Any discrepancies, such as null values, should be investigated to determine whether they originate from the source data or the migration process.
Dealing with Null Values
Checking for null values is an essential part of data validation. If null values are present in the source (DB2), it needs to be communicated to the downstream team. Conversely, if nulls appear in the Hive data but not in the source, the issue likely lies in the migration process.
Final Thoughts
Understanding the evolution of big data technologies and the intricacies of job execution and monitoring is crucial for effective data engineering. This blog highlights the importance of adapting to technological advancements, ensuring data integrity, and effectively communicating project details.