{"id":11633,"date":"2024-08-28T23:15:06","date_gmt":"2024-08-28T22:15:06","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=11633"},"modified":"2024-08-28T23:15:25","modified_gmt":"2024-08-28T22:15:25","slug":"memory-mapped-files-for-efficient-data-processing","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2024\/08\/memory-mapped-files-for-efficient-data-processing\/","title":{"rendered":"Memory-mapped files for efficient data processing"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades rapidly due to excessive swapping, leading to increased latency and reduced throughput. Memory-mapped files are an alternative strategy to access and manipulate large datasets without the need to load them fully into memory. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br><strong>A background on memory-mapped Files<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Memory mapping is the process of mapping a file or a portion of a file directly into virtual memory. This mapping establishes a one-to-one correspondence between the file&#8217;s contents on disk and specific addresses in the process&#8217;s memory space. Instead of relying on traditional I\/O operations, such as <code>read()<\/code> an <code>write()<\/code>, which involve copying data between kernel space and user space, the process can access the file\u2019s contents directly through memory addresses. Then, page faults are used to determine which chunks to load into physical memory. However, this chunks are significantly smaller than the whole file contents. This direct access reduces overhead and can significantly speed up data processing, especially for large files or applications that require high-throughput I\/O operations.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\">In the context of Python, a common approach to memory-mapped files is the <code>mmap()<\/code> system call. It is often used in place of <code>malloc()<\/code>.  <code>mmap()<\/code> requests the operating system to map a file or device into the process\u2019s memory. It supports lazy loading, meaning that the file&#8217;s pages are only loaded into physical memory when accessed by the process.<br><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How memory-mapped files work in Python<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>mmap()<\/code> system call requires several parameters to establish a memory-mapped file:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>addr<\/code> &#8211; An address for the operating system indicating where to start the virtual mapping. If set to NULL, the kernel will choose an appropriate address.<\/li>\n\n\n\n<li><code>length<\/code> &#8211; Specifies the length of the mapping in bytes.<\/li>\n\n\n\n<li><code>prot<\/code> &#8211; Defines the protection level of the mapped memory, such as read, write, or execute permissions.<\/li>\n\n\n\n<li><code>flags<\/code> &#8211; Determines various options for the mapping, such as whether it is backed by a file or anonymous (not backed by a file).<br>fd: The file descriptor of the file to be mapped. For anonymous mappings, this is set to -1.<\/li>\n\n\n\n<li><code>offset<\/code> &#8211; Indicates the starting point within the file for the mapping, which must be aligned with the system\u2019s page size.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.clear.rice.edu\/comp321\/html\/laboratories\/lab10\/mmap.png?w=625&#038;ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><code>mmap()<\/code> returns a pointer to the mapped area, allowing the process to access the file as if it were part of its own memory space. The figure above, from materials provided in the <a href=\"https:\/\/www.clear.rice.edu\/comp321\/html\/laboratories\/lab10\/\" data-type=\"link\" data-id=\"https:\/\/www.clear.rice.edu\/comp321\/html\/laboratories\/lab10\/\">COMP 321 course on Operating Systems at Rice University<\/a>, demonstrates the mmap() system call in action. It shows how a file on disk is mapped into the process&#8217;s virtual memory space, establishing a direct correspondence between the file&#8217;s bytes and specific memory addresses.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Pros and cons of memory mapping<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Memory-mapped files offer several advantages, the first being lazy loading. Pages of these file are only loaded into memory when accessed, which conserves both memory and processing time, especially in applications where only a portion of the file is needed at any given time. By avoiding the overhead associated with multiple system calls and the need to copy data between kernel space and user space, memory mapping can lead to substantial performance improvements. This is particularly beneficial for large files or files that are frequently accessed, as the process interacts with the file\u2019s content as though it were in memory.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When working with multiple processes that need to access the same data, memory-mapped files allow them to do so from the same location. This is particularly advantageous in server environments.Finally, memory-mapping provides more versatile memory allocation than alternatives like malloc(). It can be used in signal handlers and allows for dynamic memory management, such as allocating memory in the middle of the heap or freeing memory at any point.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br>However, memory-mapping also has some disadvantages. Memory mappings must align with the system\u2019s page boundaries, which can lead to wasted space if the file size is not a multiple of the page size. Additionally, extensive use of memory-mapped files in systems with limited address space can lead to fragmentation, complicating memory management. There is also some overhead associated with maintaining these mappings within the kernel, though this is often outweighed by the performance benefits.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Implementing memory-mapped files in Python<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Python\u2019s <code>mmap<\/code> module provides a straightforward interface for working with memory-mapped files. This can be particularly useful when handling large datasets, such as the QM9 dataset from PyTorch Geometric. This example demonstrates how to map a large file into memory, read SMILES strings and their corresponding property values, modify them, and then update the file\u2014all without loading the entire dataset into RAM.<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">import mmap\nimport os\nimport struct\nfrom torch_geometric.datasets import QM9\n\n# Load the QM9 dataset\ndataset = QM9(root=os.path.join(os.path.dirname(os.path.realpath(__file__)), '..', 'data', 'QM9'))\n\n# Define the file name and size\nfilename = 'qm9_smiles_properties.dat'\nfilesize = 1024 * 1024 * 100  # 100 MB for example\n\n# Create a file of the given size\nwith open(filename, 'wb') as f:\n    f.seek(filesize - 1)\n    f.write(b'\\x00')\n\n# Open and memory-map the file\nwith open(filename, 'r+b') as f:\n    mmapped_file = mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_WRITE)\n\n    # Write SMILES string and property value to the memory-mapped file\n    for index, data in enumerate(dataset):\n        smiles_string = data.smiles  # SMILES string from QM9 dataset\n        property_value = data.y.item()  # Property value (e.g., U0 or any target)\n\n        smiles_binary = smiles_string.encode('utf-8')\n        property_value_binary = struct.pack('d', property_value)\n\n        entry = (\n            struct.pack('I', len(smiles_binary)) + smiles_binary +\n            property_value_binary\n        )\n\n        # Write the entry to the memory-mapped file\n        mmapped_file.write(entry)\n        mmapped_file.flush()\n\n    # Retrieve and update an entry\n    mmapped_file.seek(0)  # Reset pointer to the beginning\n\n    while mmapped_file.tell() &lt; filesize:\n        # Read the length of the SMILES string\n        smiles_len = struct.unpack('I', mmapped_file.read(4))&#91;0]\n        smiles_binary = mmapped_file.read(smiles_len)\n        property_value_binary = mmapped_file.read(8)\n\n        # Decode SMILES string and unpack property value\n        smiles_string = smiles_binary.decode('utf-8')\n        property_value = struct.unpack('d', property_value_binary)&#91;0]\n\n        # Adjust property value (multiply by 2)\n        modified_property_value = property_value * 2\n        modified_property_value_binary = struct.pack('d', modified_property_value)\n\n        # Seek back to where the property value was stored\n        mmapped_file.seek(-8, os.SEEK_CUR)\n        mmapped_file.write(modified_property_value_binary)\n        mmapped_file.flush()\n\n    # Close the memory-mapped file\n    mmapped_file.close()<\/code><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In this example, the PyTorch Geometric QM9 dataset is loaded. A 100 MB file is created as a placeholder for storing the data. This file is then memory-mapped, allowing it to be accessed as if it were a part of the process&#8217;s memory. The SMILES string and corresponding property value (e.g., the U0 energy from the QM9 dataset) are encoded and written to the memory-mapped file. Each entry consists of the length of the SMILES string, the SMILES string itself, and the property value. After writing, the code reads back the SMILES string and property value, modifies the property value (doubling it), and writes the modified value back to the file. Changes are flushed to disk after each write operation and finally the memory-mapped file is closed.\u2028<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Memory management is a key concern when working with large datasets. Many researchers and developers will load entire datasets into memory for processing. Although this is a straightforward approach that allows for quick access and manipulation of data, it has its drawbacks. When the dataset size approaches or exceeds the available physical memory, performance degrades [&hellip;]<\/p>\n","protected":false},"author":111,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[29,227],"tags":[87,412,152,648],"ppma_author":[766],"class_list":["post-11633","post","type-post","status-publish","format-standard","hentry","category-code","category-python-code","tag-code-2","tag-how-to","tag-python","tag-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":766,"user_id":111,"is_guest":0,"slug":"adelaide","display_name":"Adelaide Punt","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/5eedee079149cbf18cbc455e2a2c69c197eeaae869e51d4a9cb0ac91ba9138df?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Punt","first_name":"Adelaide","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11633","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/111"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=11633"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11633\/revisions"}],"predecessor-version":[{"id":11647,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11633\/revisions\/11647"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=11633"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=11633"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=11633"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=11633"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}