Key Responsibilities:
•            Design and implement resilient system architectures that support high availability and scalability.
•            Develop automation tools and scripts to enhance operational efficiency and reduce manual effort.
•            Define, track, and analyze SLOs and SLIs to ensure reliability and performance meet business needs.
•            Conduct thorough post-mortem analyses following incidents, driving continuous improvement through root cause identification and solution implementation.
•            Collaborate with development and operations teams to establish best practices in system reliability and incident management.
•            Troubleshoot and resolve issues related to database performance, network connectivity, and deployment failures, including diagnosing problems at the underlying platform level (e.g., Kubernetes, virtual machines).
•            Ensure that issues are resolved within the stipulated Service Level Agreements (SLAs), maintaining high standards of service delivery.
•            Identify and troubleshoot performance bottlenecks across systems, providing actionable recommendations for enhancements.
•            Maintain detailed documentation of processes and incident responses to support knowledge sharing and compliance.
Qualifications:
•            Proficiency in programming languages such as Python, Golang, Java, or similar, focusing on operational efficiency.
•            Demonstrated experience in system architecture and design, prioritizing reliability, and scalability.
•            Strong understanding of SRE principles, including SLOs, SLIs, toil reduction, and incident post-mortems.
•            Experience with cloud environments (e.g., AWS, Azure, Google Cloud) and their operational management.
•            Strong expertise in Linux system administration.
•            Proven experience in troubleshooting application support issues with a focus on performance and connectivity.
•            Familiarity with networking concepts and effective troubleshooting techniques.
•            Excellent problem-solving abilities and a proactive approach to operational challenges.
•            Bilingual with proficiency in mandarin preferred to liaise with mandarin Speaking clients.
Preferred Skills:
•            Familiarity with monitoring tools and performance optimization techniques.
•            Experience in scripting or automation for system administration tasks.
•            Knowledge of networking concepts and troubleshooting methodologies.
•            Hands-on knowledge of cloud platforms (e.g., AWS, Azure, Google Cloud) and their services.
•            Familiarity with DevOps practices and frameworks, including CI/CD, infrastructure as code, and containerization.