{"id":12603,"date":"2025-05-22T16:35:08","date_gmt":"2025-05-22T15:35:08","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=12603"},"modified":"2025-05-22T16:36:56","modified_gmt":"2025-05-22T15:36:56","slug":"memory-efficient-clustering-of-large-protein-trajectory-ensembles","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/05\/memory-efficient-clustering-of-large-protein-trajectory-ensembles\/","title":{"rendered":"Memory Efficient Clustering of Large Protein Trajectory Ensembles"},"content":{"rendered":"\n<p>Molecular dynamics simulations have grown increasingly ambitious, with researchers routinely generating trajectories containing hundreds of thousands or even millions of frames. While this wealth of data offers unprecedented insights into protein dynamics, it also presents a formidable computational challenge: how do you extract meaningful conformational clusters from datasets that can easily exceed available system memory?<\/p>\n\n\n\n<p>Traditional approaches to trajectory clustering often stumble when faced with large ensembles. Loading all pairwise distances into memory simultaneously can quickly consume tens or hundreds of gigabytes of RAM, while conventional PCA implementations require the entire dataset to fit in memory before decomposition can begin. For many researchers, this means either downsampling their precious simulation data or investing in expensive high-memory computing resources.<\/p>\n\n\n\n<p>The solution lies in recognizing that we don&#8217;t actually need to hold all our data in memory simultaneously. By leveraging incremental algorithms and smart memory management, we can perform sophisticated dimensionality reduction and clustering on arbitrarily large trajectory datasets using modest computational resources. Let&#8217;s explore how three key strategies\u2014incremental PCA, mini-batch clustering, and intelligent memory management\u2014can transform your approach to analyzing large protein ensembles.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Incremental PCA: Building Understanding One Chunk at a Time<\/h2>\n\n\n\n<p>The heart of trajectory clustering often involves dimensionality reduction of high-dimensional coordinate or distance data. For protein trajectories, pairwise distances between alpha carbons provide an alignment-free representation that captures the essential conformational relationships between frames. However, for a protein with 200 residues, each frame generates nearly 20,000 pairwise distances\u2014multiply this by 100,000 frames, and you&#8217;re looking at 2 billion distance values that need to fit in memory simultaneously.<\/p>\n\n\n\n<p>Incremental PCA offers an elegant solution by building the principal component decomposition iteratively, processing data in manageable chunks while maintaining the mathematical rigor of the full decomposition. The key insight is that PCA can be formulated as an online learning problem, where we update our understanding of the data structure with each new batch of observations.<\/p>\n\n\n\n<p>Here&#8217;s how the trajectory clustering script implements this approach:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def calculate_distances_and_perform_pca(universe, selection, num_components, chunk_size):\n    \"\"\"Calculate pairwise distances and perform incremental PCA in chunks\"\"\"\n    \n    # Initialize IncrementalPCA with appropriate batch size\n    ipca_batch_size = max(num_components * 10, 100)\n    pca = IncrementalPCA(n_components=num_components, batch_size=ipca_batch_size)\n    \n    # Process frames in chunks to manage memory\n    for chunk_start in tqdm(range(0, n_frames, chunk_size), desc=\"Processing frame chunks\"):\n        chunk_end = min(chunk_start + chunk_size, n_frames)\n        chunk_distances = np.zeros((chunk_size_actual, n_distances))\n        \n        # Calculate pairwise distances for frames in this chunk\n        for i, frame_idx in enumerate(chunk_indices):\n            universe.trajectory[frame_idx]\n            positions = atoms.positions\n            frame_distances = pdist(positions, metric=\"euclidean\")\n            chunk_distances[i] = frame_distances\n        \n        # Update PCA model with this chunk\n        if chunk_start == 0:\n            chunk_pca_coords = pca.fit_transform(chunk_distances)\n        else:\n            pca.partial_fit(chunk_distances)\n            chunk_pca_coords = pca.transform(chunk_distances)<\/pre>\n\n\n\n<p>The beauty of this approach lies in its mathematical equivalence to batch PCA while using only a fraction of the memory. The&nbsp;<code>partial_fit<\/code>&nbsp;method allows the PCA model to incrementally update its understanding of the data covariance structure, ensuring that the final principal components capture the same variance relationships as if we had processed the entire dataset at once.<\/p>\n\n\n\n<p>Think of incremental PCA like learning to recognize patterns in a large collection of photographs. Rather than spreading all photos across a warehouse floor to study them simultaneously, you examine them in small, manageable stacks. Each stack teaches you something about the overall patterns, and you update your understanding as you progress. By the end, your knowledge of the complete collection is just as thorough as if you had somehow managed to view everything at once.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Mini-Batch K-Means: Clustering Without Memory Constraints<\/h2>\n\n\n\n<p>Once we&#8217;ve reduced our trajectory data to a manageable number of principal components, we face the clustering challenge. Traditional k-means algorithms require loading all data points into memory simultaneously and computing distances to all cluster centers at each iteration. For large trajectory datasets, this can again become prohibitively memory-intensive.<\/p>\n\n\n\n<p>Mini-batch k-means addresses this limitation by updating cluster centers using small, randomly sampled subsets of the data at each iteration. This approach maintains the convergence properties of standard k-means while dramatically reducing memory requirements and often improving computational speed.<\/p>\n\n\n\n<p>The trajectory clustering script intelligently selects the appropriate clustering algorithm based on dataset size:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def perform_kmeans_clustering(pca_coords, n_clusters):\n    \"\"\"Perform k-means clustering on PCA coordinates\"\"\"\n    \n    # Use MiniBatchKMeans for large datasets to manage memory\n    if len(pca_coords) > 10000:\n        kmeans = MiniBatchKMeans(n_clusters=n_clusters, random_state=42, batch_size=1000)\n    else:\n        kmeans = KMeans(n_clusters=n_clusters, random_state=42)\n    \n    cluster_labels = kmeans.fit_predict(pca_coords)\n    cluster_centers = kmeans.cluster_centers_\n    \n    return cluster_labels, cluster_centers, kmeans<\/pre>\n\n\n\n<p>The mini-batch approach works by maintaining running estimates of cluster centers and updating them with each batch of data points. Instead of recalculating centers using all data points at each iteration, the algorithm updates centers based on the current batch, weighted by the number of samples previously assigned to each cluster. This creates a natural learning rate that allows the algorithm to adapt quickly to new information while maintaining stability as more data is processed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Intelligent Memory Management: Processing What You Need, When You Need It<\/h2>\n\n\n\n<p>Beyond algorithmic improvements, careful memory management can significantly impact the feasibility of large-scale trajectory analysis. The key principle is to avoid loading more data into memory than absolutely necessary at any given moment, while ensuring that data access patterns remain efficient.<\/p>\n\n\n\n<p>The trajectory clustering script demonstrates this principle through its chunked processing approach:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">    for chunk_start in tqdm(range(0, n_frames, chunk_size), desc=\"Processing frame chunks\"):\n        chunk_end = min(chunk_start + chunk_size, n_frames)\n        chunk_indices = frame_indices[chunk_start:chunk_end]\n        chunk_size_actual = len(chunk_indices)\n\n        # Pre-allocate array just for this chunk of frames\n        chunk_distances = np.zeros((chunk_size_actual, n_distances))\n\n        # Process each frame in the chunk\n        for i, frame_idx in enumerate(chunk_indices):\n            # Go to the specific frame\n            universe.trajectory[frame_idx]\n\n            # Get atom positions for this frame\n            positions = atoms.positions\n\n            # Calculate pairwise distances for this frame\n            frame_distances = pdist(positions, metric=\"euclidean\")\n\n            # Store the distances for this frame in the chunk array\n            chunk_distances[i] = frame_distances\n\n        # Partial fit PCA with this chunk\n        if chunk_start == 0:\n            # For the first chunk, we need to fit and transform\n            chunk_pca_coords = pca.fit_transform(chunk_distances)\n        else:\n            # For subsequent chunks, we partial_fit and transform\n            pca.partial_fit(chunk_distances)\n            chunk_pca_coords = pca.transform(chunk_distances)\n\n        # Store the PCA coordinates for this chunk\n        pca_coords[chunk_start:chunk_end] = chunk_pca_coords\n\n        # Free memory by deleting the chunk distances\n        del chunk_distances<\/pre>\n\n\n\n<p>This approach processes trajectory frames in small groups, calculates the required distances, feeds them to the incremental PCA algorithm, and then explicitly frees the memory before moving to the next chunk. The&nbsp;<code>del<\/code>&nbsp;statement ensures that Python&#8217;s garbage collector can immediately reclaim the memory, preventing accumulation of unused arrays.<\/p>\n\n\n\n<p>For even larger datasets, memory-mapped files offer another powerful strategy. NumPy&#8217;s&nbsp;<code>memmap<\/code>&nbsp;functionality allows you to work with arrays that appear to be in memory but are actually stored on disk, with the operating system handling data transfer as needed. While not implemented in this particular script, memory mapping can be invaluable for trajectories that exceed available RAM:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"true\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Example of memory-mapped array usage (not from the script)\nlarge_array = np.memmap('trajectory_data.dat', dtype='float32', \n                       mode='r+', shape=(n_frames, n_features))<\/pre>\n\n\n\n<p>Memory mapping works particularly well for trajectory analysis because access patterns are often sequential or localized, allowing the operating system to efficiently cache relevant portions of the data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">An Illustrative Example<\/h2>\n\n\n\n<p>To illustrate the power of these combined approaches, consider a typical use case: clustering a 500,000-frame trajectory of a 150-residue protein. Using traditional methods, storing pairwise CA distances would require approximately 80 GB of memory (500,000 frames \u00d7 11,175 distances per frame \u00d7 8 bytes per double). The trajectory clustering script reduces this to manageable chunks of perhaps 1-2 GB at most, processing the data incrementally while achieving mathematically equivalent results.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusions and Future Directions<\/h2>\n\n\n\n<p>Memory-efficient trajectory clustering represents more than just a technical optimization\u2014it democratizes access to sophisticated conformational analysis for researchers working with large simulation datasets. By combining incremental PCA, mini-batch clustering, and intelligent memory management, we can perform rigorous analysis on trajectories that would otherwise require prohibitively expensive computational resources.<\/p>\n\n\n\n<p>This workflow can be modified to other dimensionality reduction techniques like t-SNE and PHATE clustering. The difference for these approaches is that you would first compute, linear, PCA and then run non-linear methods on these coordiantes. As incremental PCA is linear it is possible to transform during fitting for most protein ensembles. For proteins with variable descriptions (i.e. different structural coordinate variance and covariance properties) one should fit across the whole dataset first and then transform. Memory mapped files shown here would be ideal for this task.<\/p>\n\n\n\n<p>This can be avoided by increasing batch sizes or by ensuring adequate mixing of the training data. Here, we show mini batch K-means but HDBscan is an alternative that requires fewer iterations for robust parameters. This may not reduce memory further but may reduce the total computational load.<\/p>\n\n\n\n<p>As molecular simulations continue to grow in scale and ambition, these memory-efficient approaches will become increasingly essential. The principles demonstrated in this trajectory clustering script\u2014processing data incrementally, using online algorithms, and managing memory explicitly\u2014provide a foundation for tackling even larger challenges in computational structural biology.<\/p>\n\n\n\n<p>The next time you face a trajectory that seems too large to analyze, remember that size doesn&#8217;t have to be a barrier. With the right algorithmic tools and memory management strategies, even the most ambitious datasets can yield their secrets, one manageable chunk at a time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Complete script:<\/h2>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\"\"\"\nThis script takes in a topology and a list of trajectories and k-means clusters the ensemble down to the specified number of clusters.\n\nThe script takes in the following args:\n\n- topology_path (str)\n- trajectory_paths (list[str])\n- atom_selection (str, default: \"name CA\")\n- chunk_size (int, default: 100) # chunk size for memory efficiency when calculating pairwise coords and running PCA\n- number_of_clusters (int, default: 500)\n- num_components (int, default: 10)\n- output_dir (str, default: next to script inside a directory labelled with the topology name, number of clusters and the time)\n- save_pdbs (bool, default: False) # Whether to save individual PDB files for each cluster\n- log (bool, default: True)\n\nThe script performs a memory-efficienct PCA on the pariwise coordinates using pdist. Using these reduced coordinates the clusters are picked with k-means.\n\nThe PCA is then comprehensively and professionally plotted using a contour map alongside a scatter plot of the clusters. The clusters are then saved to a file in the output directory as xtc.\n\nThe script also saves the arguments and logging information using a logger.\n\n\"\"\"\n\nimport argparse\nimport datetime\nimport logging\nimport os\n\nimport matplotlib.pyplot as plt\nimport MDAnalysis as mda\nimport numpy as np\nimport seaborn as sns\nfrom matplotlib.gridspec import GridSpec\nfrom scipy.spatial.distance import pdist\nfrom sklearn.cluster import KMeans, MiniBatchKMeans\nfrom sklearn.decomposition import IncrementalPCA\nfrom tqdm import tqdm\n\n\ndef setup_logger(log_file):\n    \"\"\"Set up the logger to output to both file and console\"\"\"\n    logger = logging.getLogger()\n    logger.setLevel(logging.INFO)\n\n    # Clear existing handlers if any\n    for handler in logger.handlers[:]:\n        logger.removeHandler(handler)\n\n    # File handler\n    file_handler = logging.FileHandler(log_file)\n    file_formatter = logging.Formatter(\"%(asctime)s - %(levelname)s - %(message)s\")\n    file_handler.setFormatter(file_formatter)\n    logger.addHandler(file_handler)\n\n    # Console handler\n    console_handler = logging.StreamHandler()\n    console_handler.setFormatter(file_formatter)\n    logger.addHandler(console_handler)\n\n    return logger\n\n\ndef parse_arguments():\n    \"\"\"Parse command line arguments\"\"\"\n    parser = argparse.ArgumentParser(description=\"Cluster molecular dynamics trajectories\")\n    parser.add_argument(\"--topology_path\", type=str, required=True, help=\"Path to topology file\")\n    parser.add_argument(\n        \"--trajectory_paths\", nargs=\"+\", type=str, required=True, help=\"Paths to trajectory files\"\n    )\n    parser.add_argument(\n        \"--atom_selection\",\n        type=str,\n        default=\"name CA\",\n        help='Atom selection string (default: \"name CA\")',\n    )\n    parser.add_argument(\n        \"--chunk_size\",\n        type=int,\n        default=100,\n        help=\"Chunk size for memory efficiency (default: 100)\",\n    )\n    parser.add_argument(\n        \"--number_of_clusters\",\n        type=int,\n        default=500,\n        help=\"Number of clusters for k-means (default: 500)\",\n    )\n    parser.add_argument(\n        \"--num_components\", type=int, default=10, help=\"Number of PCA components (default: 10)\"\n    )\n    parser.add_argument(\n        \"--output_dir\",\n        type=str,\n        default=None,\n        help=\"Output directory (default: auto-generated based on topology name and time)\",\n    )\n    parser.add_argument(\n        \"--save_pdbs\",\n        action=\"store_true\",\n        help=\"Save individual PDB files for each cluster (default: False)\",\n    )\n    parser.add_argument(\n        \"--log\", action=\"store_true\", default=True, help=\"Enable logging (default: True)\"\n    )\n\n    return parser.parse_args()\n\n\ndef calculate_pairwise_rmsd(universe, selection, chunk_size):\n    \"\"\"Calculate pairwise distances between atoms within each frame\"\"\"\n    logger.info(\"Calculating pairwise atomic coordinate distances within frames...\")\n\n    # Select atoms for distance calculation\n    atoms = universe.select_atoms(selection)\n    n_frames = len(universe.trajectory)\n    n_atoms = atoms.n_atoms\n    n_distances = n_atoms * (n_atoms - 1) \/\/ 2\n    logger.info(f\"Selected {n_atoms} atoms, processing {n_frames} frames\")\n\n    # Pre-allocate array for all pairwise distances across all frames\n    all_distances = np.zeros((n_frames, n_distances))\n\n    # Process each frame\n    for i, ts in enumerate(tqdm(universe.trajectory, desc=\"Processing frames\")):\n        # Get atom positions for this frame\n        positions = atoms.positions\n\n        # Calculate pairwise distances for this frame only\n        frame_distances = pdist(positions, metric=\"euclidean\")\n\n        # Store the distances for this frame\n        all_distances[i] = frame_distances\n\n    logger.info(f\"Generated distances for {all_distances.shape[0]} frames\")\n\n    return all_distances\n\n\ndef perform_pca_on_distances(distances, num_components):\n    \"\"\"Perform PCA on the distance matrix using IncrementalPCA with appropriate batch size\"\"\"\n    logger.info(f\"Performing PCA with {num_components} components...\")\n\n    # Set batch size to be at least 10 times the number of components as suggested\n    ipca_batch_size = max(num_components * 10, 100)\n\n    # Initialize IncrementalPCA with proper batch size\n    pca = IncrementalPCA(n_components=num_components, batch_size=ipca_batch_size)\n\n    # Fit PCA model and transform the data\n    pca_coords = pca.fit_transform(distances)\n\n    logger.info(f\"Explained variance ratio: {pca.explained_variance_ratio_}\")\n    logger.info(f\"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}\")\n\n    return pca_coords, pca\n\n\ndef perform_kmeans_clustering(pca_coords, n_clusters):\n    \"\"\"Perform k-means clustering on PCA coordinates\"\"\"\n    logger.info(f\"Performing k-means clustering with {n_clusters} clusters...\")\n\n    # Use MiniBatchKMeans for large datasets\n    if len(pca_coords) > 10000:\n        kmeans = MiniBatchKMeans(n_clusters=n_clusters, random_state=42, batch_size=1000)\n    else:\n        kmeans = KMeans(n_clusters=n_clusters, random_state=42)\n\n    cluster_labels = kmeans.fit_predict(pca_coords)\n    cluster_centers = kmeans.cluster_centers_\n\n    # Count frames per cluster\n    unique_labels, counts = np.unique(cluster_labels, return_counts=True)\n\n    logger.info(f\"Clustering complete: {len(unique_labels)} clusters\")\n    logger.info(f\"Average frames per cluster: {np.mean(counts):.1f}\")\n    logger.info(f\"Min frames per cluster: {np.min(counts)}\")\n    logger.info(f\"Max frames per cluster: {np.max(counts)}\")\n\n    return cluster_labels, cluster_centers, kmeans\n\n\ndef create_publication_plots(pca_coords, cluster_labels, cluster_centers, pca, output_dir):\n    \"\"\"Create publication-quality plots of the PCA results and clustering\"\"\"\n    plots_dir = os.path.join(output_dir, \"plots\")\n    os.makedirs(plots_dir, exist_ok=True)\n\n    # Set style for publication-quality plots\n    plt.style.use(\"default\")  # Use the default style or choose another valid style\n    plt.rcParams[\"font.family\"] = \"sans-serif\"\n    plt.rcParams[\"font.sans-serif\"] = [\"Arial\", \"Helvetica\", \"DejaVu Sans\"]\n    plt.rcParams[\"font.size\"] = 12\n    plt.rcParams[\"axes.linewidth\"] = 1.5\n    plt.rcParams[\"axes.edgecolor\"] = \"black\"\n\n    # 1. Create main PCA plot with clusters\n    logger.info(\"Creating PCA projection plot with clusters...\")\n\n    # Figure setup with gridspec for complex layout\n    fig = plt.figure(figsize=(18, 15))\n    gs = GridSpec(3, 3, figure=fig, height_ratios=[1, 3, 1])\n\n    # Main PCA scatter plot\n    ax_main = fig.add_subplot(gs[1, :2])\n\n    # Calculate point density for contour plot\n    x, y = pca_coords[:, 0], pca_coords[:, 1]\n    sns.kdeplot(x=x, y=y, ax=ax_main, levels=20, cmap=\"Blues\", fill=True, alpha=0.5, zorder=0)\n\n    # Scatter plot of frames colored by cluster\n    scatter = ax_main.scatter(\n        x, y, c=cluster_labels, cmap=\"viridis\", s=20, alpha=0.7, zorder=10, edgecolor=\"none\"\n    )\n\n    # Plot cluster centers\n    ax_main.scatter(\n        cluster_centers[:, 0],\n        cluster_centers[:, 1],\n        c=\"red\",\n        s=80,\n        marker=\"X\",\n        edgecolors=\"black\",\n        linewidths=1.5,\n        zorder=20,\n        label=\"Cluster Centers\",\n    )\n\n    # Labels and title\n    variance_pc1 = pca.explained_variance_ratio_[0] * 100\n    variance_pc2 = pca.explained_variance_ratio_[1] * 100\n    ax_main.set_xlabel(f\"PC1 ({variance_pc1:.1f}% variance)\", fontsize=14)\n    ax_main.set_ylabel(f\"PC2 ({variance_pc2:.1f}% variance)\", fontsize=14)\n    ax_main.set_title(\"PCA Projection with K-means Clustering\", fontsize=16, pad=20)\n\n    # Add histograms for PC1 distribution\n    ax_top = fig.add_subplot(gs[0, :2], sharex=ax_main)\n    sns.histplot(x, kde=True, ax=ax_top, color=\"darkblue\", alpha=0.6)\n    ax_top.set_ylabel(\"Density\", fontsize=12)\n    ax_top.set_title(\"PC1 Distribution\", fontsize=14)\n    ax_top.tick_params(labelbottom=False)\n\n    # Add histograms for PC2 distribution\n    ax_right = fig.add_subplot(gs[1, 2], sharey=ax_main)\n    sns.histplot(y=y, kde=True, ax=ax_right, color=\"darkblue\", alpha=0.6, orientation=\"horizontal\")\n    ax_right.set_xlabel(\"Density\", fontsize=12)\n    ax_right.set_title(\"PC2 Distribution\", fontsize=14)\n    ax_right.tick_params(labelleft=False)\n\n    # Create explained variance ratio plot\n    ax_var = fig.add_subplot(gs[2, :])\n    components = range(1, len(pca.explained_variance_ratio_) + 1)\n    cumulative = np.cumsum(pca.explained_variance_ratio_)\n\n    # Plot individual and cumulative explained variance\n    bars = ax_var.bar(\n        components, pca.explained_variance_ratio_, color=\"steelblue\", alpha=0.7, label=\"Individual\"\n    )\n\n    ax_var2 = ax_var.twinx()\n    line = ax_var2.plot(\n        components,\n        cumulative,\n        \"o-\",\n        color=\"firebrick\",\n        linewidth=2.5,\n        markersize=8,\n        label=\"Cumulative\",\n    )\n\n    # Add explained variance labels\n    ax_var.set_xlabel(\"Principal Component\", fontsize=14)\n    ax_var.set_ylabel(\"Explained Variance Ratio\", fontsize=14)\n    ax_var2.set_ylabel(\"Cumulative Explained Variance\", fontsize=14)\n    ax_var.set_title(\"Explained Variance by Principal Components\", fontsize=16, pad=20)\n\n    # Set x-axis to integers\n    ax_var.set_xticks(components)\n    ax_var2.set_ylim([0, 1.05])\n\n    # Combine legends\n    lines, labels = ax_var.get_legend_handles_labels()\n    lines2, labels2 = ax_var2.get_legend_handles_labels()\n    ax_var.legend(lines + lines2, labels + labels2, loc=\"upper left\", fontsize=12)\n\n    # Add colorbar for cluster labels\n    cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])\n    cbar = plt.colorbar(scatter, cax=cbar_ax)\n    cbar.set_label(\"Cluster Label\", fontsize=14, labelpad=15)\n\n    # Save the figure\n    plt.tight_layout()\n    pca_plot_path = os.path.join(plots_dir, \"pca_clusters.png\")\n    plt.savefig(pca_plot_path, dpi=300, bbox_inches=\"tight\")\n    plt.close()\n\n    # 2. Create 3D PCA plot if we have at least 3 components\n    if pca_coords.shape[1] >= 3:\n        logger.info(\"Creating 3D PCA plot...\")\n        fig = plt.figure(figsize=(12, 10))\n        ax = fig.add_subplot(111, projection=\"3d\")\n\n        scatter = ax.scatter(\n            pca_coords[:, 0],\n            pca_coords[:, 1],\n            pca_coords[:, 2],\n            c=cluster_labels,\n            cmap=\"viridis\",\n            s=30,\n            alpha=0.7,\n        )\n\n        ax.scatter(\n            cluster_centers[:, 0],\n            cluster_centers[:, 1],\n            cluster_centers[:, 2],\n            c=\"red\",\n            s=100,\n            marker=\"X\",\n            edgecolors=\"black\",\n            linewidths=1.5,\n        )\n\n        variance_pc3 = pca.explained_variance_ratio_[2] * 100\n        ax.set_xlabel(f\"PC1 ({variance_pc1:.1f}% variance)\", fontsize=12)\n        ax.set_ylabel(f\"PC2 ({variance_pc2:.1f}% variance)\", fontsize=12)\n        ax.set_zlabel(f\"PC3 ({variance_pc3:.1f}% variance)\", fontsize=12)\n        ax.set_title(\"3D PCA Projection with Clusters\", fontsize=16)\n\n        plt.colorbar(scatter, ax=ax, label=\"Cluster Label\")\n        plt.tight_layout()\n\n        pca_3d_path = os.path.join(plots_dir, \"pca_3d.png\")\n        plt.savefig(pca_3d_path, dpi=300, bbox_inches=\"tight\")\n        plt.close()\n\n    # 3. Create cluster size distribution plot\n    logger.info(\"Creating cluster size distribution plot...\")\n    unique_labels, counts = np.unique(cluster_labels, return_counts=True)\n\n    plt.figure(figsize=(14, 8))\n    sns.histplot(counts, kde=True, color=\"steelblue\")\n    plt.axvline(np.mean(counts), color=\"red\", linestyle=\"--\", label=f\"Mean: {np.mean(counts):.1f}\")\n    plt.axvline(\n        np.median(counts), color=\"green\", linestyle=\"--\", label=f\"Median: {np.median(counts):.1f}\"\n    )\n\n    plt.xlabel(\"Frames per Cluster\", fontsize=14)\n    plt.ylabel(\"Frequency\", fontsize=14)\n    plt.title(\"Distribution of Cluster Sizes\", fontsize=16)\n    plt.legend(fontsize=12)\n    plt.tight_layout()\n\n    cluster_dist_path = os.path.join(plots_dir, \"cluster_distribution.png\")\n    plt.savefig(cluster_dist_path, dpi=300)\n    plt.close()\n\n    return {\n        \"pca_plot\": pca_plot_path,\n        \"pca_3d_plot\": pca_3d_path if pca_coords.shape[1] >= 3 else None,\n        \"cluster_dist\": cluster_dist_path,\n    }\n\n\ndef save_cluster_trajectories(\n    universe, cluster_labels, pca_coords, cluster_centers, output_dir, save_pdbs=False\n):\n    \"\"\"Save cluster trajectories to a single XTC file and optionally save representative PDB files\n\n    Parameters:\n    -----------\n    universe : MDAnalysis.Universe\n        The universe containing all frames\n    cluster_labels : numpy.ndarray\n        Array of cluster labels for each frame\n    pca_coords : numpy.ndarray\n        PCA coordinates for each frame\n    cluster_centers : numpy.ndarray\n        K-means computed cluster centers in PCA space\n    output_dir : str\n        Output directory path\n    save_pdbs : bool, optional\n        Whether to save individual PDB files for each cluster representative\n    \"\"\"\n    logger.info(\"Saving cluster trajectories...\")\n\n    clusters_dir = os.path.join(output_dir, \"clusters\")\n    os.makedirs(clusters_dir, exist_ok=True)\n\n    # Get unique clusters\n    unique_clusters = np.unique(cluster_labels)\n    n_clusters = len(unique_clusters)\n\n    # Find the frame in each cluster that is closest to the true cluster center\n    representative_frames = {}\n\n    for cluster_idx in unique_clusters:\n        # Get indices of frames in this cluster\n        cluster_mask = cluster_labels == cluster_idx\n        if not np.any(cluster_mask):\n            continue\n\n        cluster_frame_indices = np.where(cluster_mask)[0]\n\n        # Get PCA coordinates for frames in this cluster\n        cluster_pca_coords = pca_coords[cluster_mask]\n\n        # Get the center for this cluster\n        center = cluster_centers[cluster_idx]\n\n        # Calculate distances from each frame to the cluster center\n        distances = np.sqrt(np.sum((cluster_pca_coords - center) ** 2, axis=1))\n\n        # Find the frame with minimum distance to center\n        min_dist_idx = np.argmin(distances)\n\n        # Get the original frame index\n        representative_frame_idx = cluster_frame_indices[min_dist_idx]\n\n        # Store the representative frame\n        representative_frames[cluster_idx] = representative_frame_idx\n\n    # Create a single trajectory file with cluster centers only\n    all_clusters_file = os.path.join(clusters_dir, \"all_clusters.xtc\")\n    with mda.Writer(all_clusters_file, universe.atoms.n_atoms) as writer:\n        # Go through each cluster and save only the representative frame\n        for cluster_idx in tqdm(unique_clusters, desc=\"Saving cluster centers\"):\n            if cluster_idx in representative_frames:\n                frame_idx = representative_frames[cluster_idx]\n                universe.trajectory[frame_idx]\n                writer.write(universe.atoms)\n\n    # Optionally save representative frames as PDB files\n    if save_pdbs:\n        for cluster_idx, frame_idx in tqdm(representative_frames.items(), desc=\"Saving PDB files\"):\n            universe.trajectory[frame_idx]\n            with mda.Writer(\n                os.path.join(clusters_dir, f\"cluster_{cluster_idx}_rep.pdb\")\n            ) as pdb_writer:\n                pdb_writer.write(universe.atoms)\n\n    # Also save a CSV file mapping frame to cluster\n    frame_to_cluster = np.column_stack((np.arange(len(cluster_labels)), cluster_labels))\n    np.savetxt(\n        os.path.join(clusters_dir, \"frame_to_cluster.csv\"),\n        frame_to_cluster,\n        delimiter=\",\",\n        header=\"frame_index,cluster_label\",\n        fmt=\"%d\",\n        comments=\"\",\n    )\n\n    logger.info(\n        f\"Saved {n_clusters} true cluster centers to a single trajectory file: {all_clusters_file}\"\n    )\n    if save_pdbs:\n        logger.info(f\"Saved {len(representative_frames)} representative PDB files\")\n    logger.info(\"Saved frame-to-cluster mapping to frame_to_cluster.csv\")\n\n\ndef calculate_distances_and_perform_pca(universe, selection, num_components, chunk_size):\n    \"\"\"Calculate pairwise distances and perform incremental PCA in chunks\"\"\"\n    logger.info(\"Calculating pairwise distances and performing incremental PCA...\")\n\n    # Select atoms for distance calculation\n    atoms = universe.select_atoms(selection)\n    n_frames = len(universe.trajectory)\n    n_atoms = atoms.n_atoms\n    n_distances = n_atoms * (n_atoms - 1) \/\/ 2\n\n    logger.info(f\"Selected {n_atoms} atoms, processing {n_frames} frames\")\n    logger.info(f\"Each frame will generate {n_distances} pairwise distances\")\n\n    # Initialize IncrementalPCA\n    ipca_batch_size = max(num_components * 10, 100)\n    pca = IncrementalPCA(n_components=num_components, batch_size=ipca_batch_size)\n\n    # Process frames in chunks\n    frame_indices = np.arange(n_frames)\n\n    # Store PCA coordinates for all frames\n    pca_coords = np.zeros((n_frames, num_components))\n\n    for chunk_start in tqdm(range(0, n_frames, chunk_size), desc=\"Processing frame chunks\"):\n        chunk_end = min(chunk_start + chunk_size, n_frames)\n        chunk_indices = frame_indices[chunk_start:chunk_end]\n        chunk_size_actual = len(chunk_indices)\n\n        # Pre-allocate array just for this chunk of frames\n        chunk_distances = np.zeros((chunk_size_actual, n_distances))\n\n        # Process each frame in the chunk\n        for i, frame_idx in enumerate(chunk_indices):\n            # Go to the specific frame\n            universe.trajectory[frame_idx]\n\n            # Get atom positions for this frame\n            positions = atoms.positions\n\n            # Calculate pairwise distances for this frame\n            frame_distances = pdist(positions, metric=\"euclidean\")\n\n            # Store the distances for this frame in the chunk array\n            chunk_distances[i] = frame_distances\n\n        # Partial fit PCA with this chunk\n        if chunk_start == 0:\n            # For the first chunk, we need to fit and transform\n            chunk_pca_coords = pca.fit_transform(chunk_distances)\n        else:\n            # For subsequent chunks, we partial_fit and transform\n            pca.partial_fit(chunk_distances)\n            chunk_pca_coords = pca.transform(chunk_distances)\n\n        # Store the PCA coordinates for this chunk\n        pca_coords[chunk_start:chunk_end] = chunk_pca_coords\n\n        # Free memory by deleting the chunk distances\n        del chunk_distances\n\n    logger.info(f\"Completed incremental PCA with {num_components} components\")\n    logger.info(f\"Explained variance ratio: {pca.explained_variance_ratio_}\")\n    logger.info(f\"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}\")\n\n    return pca_coords, pca\n\n\ndef main():\n    # Parse command line arguments\n    args = parse_arguments()\n\n    # Start timer\n    start_time = datetime.datetime.now()\n\n    # Set up output directory\n    if args.output_dir is None:\n        topology_name = os.path.splitext(os.path.basename(args.topology_path))[0]\n        timestamp = datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n        args.output_dir = os.path.join(\n            os.path.dirname(os.path.abspath(__file__)),\n            f\"{topology_name}_clusters{args.number_of_clusters}_{timestamp}\",\n        )\n\n    # Create the output directory\n    os.makedirs(args.output_dir, exist_ok=True)\n\n    # Set up logging\n    global logger\n    log_file = os.path.join(args.output_dir, \"cluster_trajectory.log\")\n    logger = setup_logger(log_file)\n\n    logger.info(\"=\" * 80)\n    logger.info(\"Starting trajectory clustering\")\n    logger.info(\"=\" * 80)\n    logger.info(f\"Arguments: {vars(args)}\")\n\n    # Load universe\n    logger.info(f\"Loading topology from {args.topology_path}\")\n    logger.info(f\"Loading trajectories: {args.trajectory_paths}\")\n    universe = mda.Universe(args.topology_path, *args.trajectory_paths)\n    logger.info(\n        f\"Loaded universe with {len(universe.atoms)} atoms and {len(universe.trajectory)} frames\"\n    )\n\n    # Calculate distances and perform PCA\n    pca_coords, pca = calculate_distances_and_perform_pca(\n        universe, args.atom_selection, args.num_components, args.chunk_size\n    )\n\n    # Perform k-means clustering\n    cluster_labels, cluster_centers, kmeans = perform_kmeans_clustering(\n        pca_coords, args.number_of_clusters\n    )\n\n    # Create publication-quality plots\n    plot_paths = create_publication_plots(\n        pca_coords, cluster_labels, cluster_centers, pca, args.output_dir\n    )\n\n    # Save cluster trajectories\n    save_cluster_trajectories(\n        universe,\n        cluster_labels,\n        pca_coords,\n        cluster_centers,\n        args.output_dir,\n        args.save_pdbs,\n    )\n\n    # Save analysis data\n    data_dir = os.path.join(args.output_dir, \"data\")\n    os.makedirs(data_dir, exist_ok=True)\n\n    np.save(os.path.join(data_dir, \"pca_coordinates.npy\"), pca_coords)\n    np.save(os.path.join(data_dir, \"cluster_labels.npy\"), cluster_labels)\n    np.save(os.path.join(data_dir, \"cluster_centers.npy\"), cluster_centers)\n\n    # End timer and report\n    end_time = datetime.datetime.now()\n    elapsed = end_time - start_time\n    logger.info(\"=\" * 80)\n    logger.info(f\"Clustering complete. Results saved to {args.output_dir}\")\n    logger.info(f\"Total execution time: {elapsed}\")\n    logger.info(\"=\" * 80)\n\n\nif __name__ == \"__main__\":\n    main()\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Molecular dynamics simulations have grown increasingly ambitious, with researchers routinely generating trajectories containing hundreds of thousands or even millions of frames. While this wealth of data offers unprecedented insights into protein dynamics, it also presents a formidable computational challenge: how do you extract meaningful conformational clusters from datasets that can easily exceed available system memory? [&hellip;]<\/p>\n","protected":false},"author":115,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[361,621,296,647,228,227],"tags":[430],"ppma_author":[725],"class_list":["post-12603","post","type-post","status-publish","format-standard","hentry","category-data-science","category-data-visualization","category-hints-and-tips","category-molecular-dynamics","category-protein-structure","category-python-code","tag-clustering"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":725,"user_id":115,"is_guest":0,"slug":"alexi","display_name":"Alexi Hussain","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/aa505acb767aa3aa80da3b68d4684bf5d5064ac6aed2a01206a61383a967535a?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12603","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/115"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=12603"}],"version-history":[{"count":2,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12603\/revisions"}],"predecessor-version":[{"id":12606,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12603\/revisions\/12606"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=12603"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=12603"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=12603"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=12603"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}