𝗕𝘂𝗶𝗹𝗱 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝘁 𝗔𝗜 𝗪𝗼𝗿𝗸𝗹𝗼𝗮𝗱𝘀 𝗼𝗻 𝗚𝗞𝗘

📅2 weeks ago⏱1 min read

You use Spot VMs to save 90% on costs. Then Google takes your VM back. Your work vanishes.

Ephemeral compute fails. Your architecture must assume failure.

Listen for the signal. Google sends a SIGTERM signal before a node dies. You have 15 seconds. Your code must catch this signal. Stop new batches. Flush data to disk. Exit with success.

Save progress. Local files disappear. Save model weights to Cloud Storage.

Save often.
Keep buckets in the same region.
Load checkpoints on startup.

Stop duplicate data. Processing the same image twice creates errors. Use UPSERT in your database. Check if a record exists before using GPU cycles.

Use message queues. Do not use static lists for big jobs.

Push work to Pub/Sub.
Pull tasks in small chunks.
Send an ACK only after the work is done.
If a node dies, Pub/Sub gives the task to another worker.

Design for failure. Save money without losing reliability.

Source: https://dev.to/googlecloud/surviving-the-eviction-how-to-build-interrupt-resilient-ai-workloads-on-gke-5581 Optional learning community: https://t.me/GyaanSetuAi

𝗕𝘂𝗶𝗹𝗱 𝗥𝗲𝘀𝗶𝗹𝗶𝗲𝗻𝘁 𝗔𝗜 𝗪𝗼𝗿𝗸𝗹𝗼𝗮𝗱𝘀 𝗼𝗻 𝗚𝗞𝗘

Continue reading

𝗗𝗼 𝗡𝗼𝘁 𝗟𝗲𝘁 𝗔𝗜 𝗘𝗿𝗮𝘀𝗲 𝗬𝗼𝘂𝗿 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲

𝗛𝗶𝗴𝗵 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗔𝗿𝗲 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗦𝘆𝘀𝘁𝗲𝗺𝘀

𝗛𝘆𝗯𝗿𝗶𝗱 𝗥𝗔𝗚, 𝗔𝗜 𝗠𝗲𝗺𝗼𝗿𝘆, 𝗮𝗻𝗱 𝗚𝗼𝗼𝗴𝗹𝗲 𝗖𝗟𝗜

𝗦𝘁𝗼𝗽 𝗪𝗮𝘀𝘁𝗶𝗻𝗴 𝗠𝗼𝗻𝗲𝘆 𝗼𝗻 𝗔𝗜 𝗔𝗣𝗜𝘀

𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝗶𝗻𝗴 𝗔𝗻 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗹𝗲 𝗘𝗱𝗴𝗲 𝗣𝗼𝗱