Solution-oriented software engineer having quick-learning capabilities. Experienced in big data engineering using cutting-edge technologies under Hadoop ecosystem.
Reduced execution time of existing Spark batch job to 20%.
Solely automated pipeline using Apache Airflow + Bash scripts.
Implemented complex batch processing logic using Apache Spark SQL including performance optimization and solving memory issues.
Knowledge of HDFS optimized file formats like ORC and AVRO.
Knowledge of AWS EMR, S3 with experience in using AWS console for monitoring EMR and EC2 instances and understanding of their configuration files.
Configured Hadoop cluster of 3 nodes along with Hive(MySQL as metastore) and Spark.
Configured HA for ResourceManager using ZooKeeper.
Played major role in building codebase/transformations/quality checks for maintaining data lake on S3.
Built a transformer job in PySpark which performs aggregations and stores them as tables on RDS.
Developed a PySpark job to extract and transform features required for the data science team.
Actively participated in implementing backend server in Python Flask which interacts with Dashboard using REST APIs.
Explored on different services provided by Cloud Providers like AWS and GCP to notify update events of resources in real-time.
Implemented multiple POCs for AWS resources which involved CloudTrail, SNS, SQS, Lambda services to make the solution cost efficient, reliable and real-time.
Formulated Spark batch jobs for processing stock exchange data at EOD to calculate incentives per market participant. The generated output further used in report generation.
Optimized the spark jobs to make efficient use of memory and minimize processing time which involved crucial partitioning, complex joins and skewed data.
Utilized S3 for storage of data in ORC format and EMR for executing the spark jobs.
Actively participated in all stages of the project from the very start of requirement gathering to the prod-deployment phase.