{"id":4201,"date":"2018-07-02T20:40:41","date_gmt":"2018-07-02T19:40:41","guid":{"rendered":"http:\/\/www.blopig.com\/blog\/?p=4201"},"modified":"2018-07-02T20:40:41","modified_gmt":"2018-07-02T19:40:41","slug":"maps-are-useful-but-first-you-need-to-build-store-and-read-a-map","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2018\/07\/maps-are-useful-but-first-you-need-to-build-store-and-read-a-map\/","title":{"rendered":"Maps are useful. But first, you need to build, store and read a map."},"content":{"rendered":"<div>\n<p>Recently we embarked on a project that required the storage of a relatively big dictionary with 10M+ key-value pairs. Unsurprisingly, Python took over two hours to build such dictionary, taking into accounts all the time for extending, accessing and writing to the dictionary, AND it eventually crashed. So I turned to C++ for help.<\/p>\n<p>In C++, map is one of the ways you can store a string-key and an integer value. Since we are concerned about the data storage and access, I compared <code>map<\/code> and <code>unordered_map<\/code>.<\/p>\n<\/div>\n<div><\/div>\n<div>(Read more:\u00a0<a href=\"https:\/\/www.geeksforgeeks.org\/map-vs-unordered_map-c\/\">https:\/\/www.geeksforgeeks.org\/map-vs-unordered_map-c\/<\/a>\u00a0.)<\/div>\n<div><\/div>\n<div>An <code>unordered_map<\/code> stores a hash table of the keys and the mapped value; while a <code>map<\/code> is ordered. The important consideration here includes:<\/div>\n<div><\/div>\n<div><\/div>\n<ul>\n<li>Memory: <code>map<\/code> does not have the hash table and is therefore smaller than an <code>unordered_map<\/code>.<\/li>\n<li><span style=\"font-size: 1rem\">Access: accessing an <code>unordered_map<\/code> takes O(1) while accessing a <code>map<\/code> takes log(n).<\/span><\/li>\n<\/ul>\n<div><\/div>\n<div><\/div>\n<div>I have eventually chosen to go with <code>map<\/code>, because it is more memory efficient considering the small RAM size that I have access to. However, it still takes up about 8GB of RAM per object during its runtime (and I have 1800 objects to run through, each building a different dictionary). Saving these seems to open another can of worm.<\/div>\n<div><\/div>\n<div><\/div>\n<div>In Python, we could easily use Pickle or JSON to serialise the dictionary. In C++, it\u2019s common to use the BOOST library. There are two archival functions in BOOST: text or binary archives. Text archives are human-readable but I don\u2019t think I am really going to open and read 10M+ lines of key-value pairs, I opted for binary archives that are machine readable and smaller. (Read more:\u00a0<a href=\"https:\/\/stackoverflow.com\/questions\/1058051\/boost-serialization-performance-text-vs-binary-format\">https:\/\/stackoverflow.com\/questions\/1058051\/boost-serialization-performance-text-vs-binary-format<\/a>\u00a0.)<\/div>\n<div><\/div>\n<div><\/div>\n<div>To further compress the memory size when I save the maps, I used zlib compression. Obviously there are ready-to-use codes from these people half a year ago, which saved me debugging:<\/div>\n<div><a href=\"https:\/\/stackoverflow.com\/questions\/48034374\/compressing-boost-binary-archive-with-zlib\"><span style=\"color: #000000\">https:\/\/stackoverflow.com\/questions\/48034374\/compressing-boost-binary-archive-with-zlib<\/span><\/a><\/div>\n<div><\/div>\n<div><\/div>\n<div>\n<div>Ultimately this gets down to 96GB summing 1800 files, all done within 6 hours.<\/div>\n<div><\/div>\n<div><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Recently we embarked on a project that required the storage of a relatively big dictionary with 10M+ key-value pairs. Unsurprisingly, Python took over two hours to build such dictionary, taking into accounts all the time for extending, accessing and writing to the dictionary, AND it eventually crashed. So I turned to C++ for help. In [&hellip;]<\/p>\n","protected":false},"author":49,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[14,15],"tags":[],"ppma_author":[533],"class_list":["post-4201","post","type-post","status-publish","format-standard","hentry","category-howto","category-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":533,"user_id":49,"is_guest":0,"slug":"catherine","display_name":"Wing Ki (Catherine) Wong","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/c620f4eb25c8d5c8efdae64e95c4ec436da2bc2d6654ab7085ad1a822e3b4341?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4201","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/49"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=4201"}],"version-history":[{"count":3,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4201\/revisions"}],"predecessor-version":[{"id":4204,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4201\/revisions\/4204"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=4201"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=4201"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=4201"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=4201"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}