Skillset required for this project: Python, AWS, Starbust, Neo4j
Roles and Responsibilities
- Data Pipeline Development
- Design, build, and maintain scalable data pipelines in AWS Glue using Python code language to ingest, transform, and load data from sources (AWS S3 bucket) into data lakes (AWS S3 bucket).
- Data Integration
- Integrate data from multiple sources and systems (AWS S3 bucket) to enable unified and comprehensive views of clinical data.
- Implement ETL (Extract, Transform, Load) processes to prepare raw data for analysis.
- Database Management
- Manage and optimize databases (e.g., Starbust) for performance, scalability, and reliability.
Implement database schemas and indexes to support efficient data querying and retrieval.
- Data Governance and Security
- Implement data governance policies and procedures to ensure data security, compliance, and privacy using AZ GxP process.
- Establish access controls, encryption, and auditing mechanisms to protect sensitive data.
- Monitoring and Optimization
- Monitor data pipelines and systems for performance issues, bottlenecks, and anomalies.
Implement optimizations to improve data processing efficiency and reduce latency.
- Collaboration and Documentation
- Collaborate with data scientists, analysts, and other stakeholders to understand data requirements and support analytics initiatives.
- Document data engineering processes, workflows, and architectures for knowledge sharing and future reference.
- Collaborate with testers while testing the developed pipeline and after which promoting to higher environment.
- Deployment and Jira Test Execution
- Deploying the code from Github repository from the master branch to higher environments like SIT, PPT and PROD.
- While deploying we capture evidence and execute the deployment scripts with different steps and corresponding evidence in Jira.
- Automation
- Automation to trigger the code automatically using AWS resources without manual intervention.