{"id":11497,"date":"2024-07-25T10:26:14","date_gmt":"2024-07-25T09:26:14","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=11497"},"modified":"2024-07-26T00:32:33","modified_gmt":"2024-07-25T23:32:33","slug":"the-tale-of-the-undead-logger","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2024\/07\/the-tale-of-the-undead-logger\/","title":{"rendered":"The Tale of the Undead Logger"},"content":{"rendered":"<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger.webp?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"625\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger.webp?resize=625%2C625&#038;ssl=1\" alt=\"A picture of a scary-looking zombie in a lumberjack outfit holding an axe, in the middle of a forest at night, staring menacingly at the viewer.\" class=\"wp-image-11498\" style=\"width:385px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger.webp?w=1024&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger.webp?resize=300%2C300&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger.webp?resize=150%2C150&amp;ssl=1 150w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger.webp?resize=768%2C768&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger.webp?resize=624%2C624&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><figcaption class=\"wp-element-caption\"><em>Fear the Undead Logger all ye who enter here.<br>For he may strike, and drain the life out nodes that you hold dear.<br>Among the smouldering embers of jobs you thought long dead,<br>he lingers on, to terrorise, and cause you frightful dread.<br>But hark ye all my tale to save you from much pain,<br>and fight ye not anew the battles I have fought in vain.<\/em><\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"has-text-align-left\">Or simply&#8230;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">&#8230; Tips and Tricks to Use When <code>wandb<\/code> Logger<strong> Just. <em>Won&#8217;t.<\/em><\/strong><em> <strong>DIE.<\/strong><\/em><\/h2>\n\n\n\n<p>The <a href=\"https:\/\/docs.wandb.ai\/guides\/track\">Weights and Biases Logger<\/a> (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It&#8217;s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like checkpointing model weights in the cloud and automating hyperparameter sweeps; and for integrating painlessly with frameworks like PyTorch and PyTorch Lightning. It simplifies your life as an ML researcher enormously by making it easy to track and compare experiments, monitor system resource usage, all while giving you very fun interactive graphs to play with.<br>Plot arbitrary quantities you may be logging against each other, interactively, on the fly, however you like. In Dark Mode, of course (you&#8217;re a professional, after all). Here&#8217;s a less artistic impression to give you an idea, should you have been living under a rock:<\/p>\n\n\n\n<!--more-->\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"375\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?resize=625%2C375&#038;ssl=1\" alt=\"A screenshot of the wandb logger dashboard homepage in dark mode, showing several colourful line graphs tracking training metrics across different model runs.\" class=\"wp-image-11500\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?resize=1024%2C614&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?resize=300%2C180&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?resize=768%2C460&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?resize=1536%2C921&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?resize=2048%2C1228&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?resize=624%2C374&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/07\/logger_screenshot.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p>The logger has been variously introduced on <a href=\"https:\/\/www.blopig.com\/blog\/2022\/06\/visualise-with-weight-and-biases\/\">this<\/a> very <a href=\"https:\/\/www.blopig.com\/blog\/2021\/01\/tracking-machine-learning-projects-with-weights-biases\/\">blog<\/a>, so I won&#8217;t belabour the point. Its <a href=\"https:\/\/docs.wandb.ai\/guides\/track\">documentation<\/a> is excellent, read it; then use it.<\/p>\n\n\n\n<p>Excellent, that is, until you encounter the following (purely hypothetical!) scenario: You <code>scancel<\/code> a job, or it runs <code>OOM<\/code> or <code>OOT<\/code>, and SLURM promptly informs you in no uncertain terms that your favourite node has just gone into drain (for the third time this week), swiftly followed by IT who provide damming evidence that it is <em>your<\/em> jobs leaving undead, un-killable <code>wandb<\/code> logger processes behind\u2013meaning it is, in fact, your fault.<br>Scarcely has this news had time to earn you the undying adoration of your colleagues (\ud83d\ude05\ud83d\ude2c), when IT follow it up with statements like &#8220;In normal operation <code>nvidia-smi<\/code> simply should not hang&#8221; (read: how the !?#!! did you !?#!! up low-level processes this badly?!), and &#8220;FYI this morning the server was so hung <code>ssh<\/code> no longer responded and I had to force a reboot&#8221;. In this <em>entirely hypothetical<\/em> (!!) scenario, you may find that the documentation offers you precious little information on what specific crime you ought to charge yourself with.<\/p>\n\n\n\n<p>If you do find yourself facing something akin to this <em>ENTIRELY HYPOTHETICAL (!!!)<\/em> scenario, consider the following steps: Disable multi-process logging (<code>sync_dist=False, rank_zero_only=True<\/code>) to simplify debugging, then package up your programme logic in a <code>my_main()<\/code> function, and use the following idiom in your script:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">if __name__ == \"__main__\":\n    try:\n        my_main()\n    finally:\n        wandb.finish()\n        # should be called even upon SIGTERM, and ensure correct shutdown<\/pre>\n\n\n\n<p>The thing to know here is that when SLURM sends your job a <code>SIGTERM<\/code> for whatever reason (rather than your programme logic terminating the job from within <code>Python<\/code>) any running <code>wandb<\/code> processes sometimes refuse to die.<br>Either they are desperately trying to sync any last logs to the server, eventually exhausting SLURM&#8217;s patience, or they may actually be caught in a deadlock (if you&#8217;re running on multiple GPUs and have distributed logging calls).<br>If you&#8217;re lucky, you can even manage to make it impossible to release any GPU resources you hold back to the system, meaning not even <code>nvidia-smi<\/code> can see the GPUs any more.<br>In any case, <code>Python<\/code> does not exit, SLURM is faced with an un-killable job, panics, and sends the node into drain.<\/p>\n\n\n\n<p>If this is you, see if the above helps; make the changes above, <code>scancel<\/code> mid-job, and see if you still crash the node. If you don&#8217;t, try to re-enable multi-process logging and see if the fix still holds (I&#8217;m not entirely sure I trust that <code>wandb<\/code> is thread-safe, so proceed with caution). If you <em>do<\/em> still crash the node \u2026 happy debugging! May the odds be ever in your favour!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Or simply&#8230; &#8230; Tips and Tricks to Use When wandb Logger Just. Won&#8217;t. DIE. The Weights and Biases Logger (illustrated above by DALL-E; admittedly with some artistic license) hardly requires introduction. It&#8217;s something of an industry standard at this point, well-regarded for the extensive (and extensible) functionality of its interactive dashboard; for advanced features like [&hellip;]<\/p>\n","protected":false},"author":125,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[29,361,621,632,189,227],"tags":[784],"ppma_author":[783],"class_list":["post-11497","post","type-post","status-publish","format-standard","hentry","category-code","category-data-science","category-data-visualization","category-deep-learning","category-machine-learning","category-python-code","tag-debugging"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":783,"user_id":125,"is_guest":0,"slug":"ody","display_name":"Odysseas Vavourakis","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/b74030bdaef5f39ec32be3ae7bb5af054cbcb0b431b1cc51ba1b41d723ecee48?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11497","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/125"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=11497"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11497\/revisions"}],"predecessor-version":[{"id":11504,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11497\/revisions\/11504"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=11497"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=11497"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=11497"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=11497"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}