As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, availability, and performance of our infrastructure, applications and hardware & software systems. You will collaborate with cross-functional teams to design, build, maintain and monitor robust systems that can withstand the challenges of high-traffic and mission-critical environments. You will also have the opportunity to travel around the country to do onsite system setup and maintenance. As a critical thinker, your situational awareness and flexibility in dealing with system issues will be key to our success.
Responsibilities
- Implement and maintain best practices for system reliability, availability, and scalability while minimizing downtime and disruptions for both software and hardware systems.
- Develop and enhance automation tools and scripts for system monitoring, deployment, and recovery to streamline operational processes.
- Identify and resolve performance bottlenecks, proactively optimizing system components to ensure optimal response times and resource utilization.
- Participate in on-call rotations to respond to and resolve system incidents promptly and efficiently, ensuring minimal impact on end-users. Site visits and on-site debugging will be needed.
- Use Infrastructure as Code tools to manage and version infrastructure, making it more predictable and reproducible.
- Set up and maintain robust monitoring, alerting, and logging systems to detect and mitigate issues before they impact the user experience.
- Analyze system performance trends and collaborate with other teams to plan for capacity requirements and scaling as needed.
- Implement security best practices and participate in vulnerability assessments to protect systems and data from threats.
- Maintain comprehensive documentation for systems, configurations, and procedures to ensure knowledge sharing and smooth knowledge transfer within the team.
- Bachelor's degree in a technical or scientific field such as Software Engineering, Computer Science, Electrical Engineering or IT preferred.
- Minimum 4 years proven experience as a Site Reliability Engineer or in a similar role.
- Proficiency in scripting and automation with languages such as Python and Bash.
- Familiarity with cloud platforms (e.g., AWS, Azure, GCP).
- Strong knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Experience with Infrastructure as Code tools (e.g., Terraform, Ansible).
- Solid understanding of monitoring tools and practices.
- Knowledge of security best practices and incident response.
- Experience and knowledge of IoT (eg. sensors, Raspberry Pi, device management)
- Experience or interest in electrical circuit design, PCB layout, soldering preferred.
- Excellent problem-solving and communication skills.
- Ability to work effectively in a collaborative team environment.
- Having experience and knowledge in back-end microservices is a plus.
- You are a problem solver with good analytical skills.
- Comfortable in conversational English.
Silakan referensi bahwa Anda menemukan lowongan kerja ini
di Fungsi.id, ini membantu kami mendapatkan lebih banyak
lowongan kerja berkualitas di sini, terima kasih!