TY - JOUR
T1 - TeraHeap
T2 - Exploiting Flash Storage for Mitigating DRAM Pressure in Managed Big Data Frameworks
AU - Kolokasis, Iacovos G.
AU - Evdorou, Giannos
AU - Akram, Shoaib
AU - Kozanitis, Christos
AU - Papagiannis, Anastasios
AU - Zakkak, Foivos S.
AU - Pratikakis, Polyvios
AU - Bilas, Angelos
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/12/13
Y1 - 2024/12/13
N2 - Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive datasets that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back for processing. In this article, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of objects in analytics frameworks. TeraHeap relies on three concepts: (1) It eliminates S/D by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing analytics frameworks to leverage object knowledge to populate H2. (3) It reduces GC cost by fencing the collector from scanning H2 objects while maintaining the illusion of a single managed heap, ensuring memory safety. We implement TeraHeap in OpenJDK8 and OpenJDK17 and evaluate it with fifteen widely used applications in two real-world big data frameworks, Spark and Giraph. We find that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph. Also, it can still provide better performance by consuming up to and less DRAM than native Spark and Giraph, respectively. TeraHeap can also be used for in-memory frameworks and applying it to the Neo4j Graph Data Science library improves its performance by up to 26%. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid DRAM-NVM memories, by up to 69%.
AB - Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive datasets that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back for processing. In this article, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of objects in analytics frameworks. TeraHeap relies on three concepts: (1) It eliminates S/D by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing analytics frameworks to leverage object knowledge to populate H2. (3) It reduces GC cost by fencing the collector from scanning H2 objects while maintaining the illusion of a single managed heap, ensuring memory safety. We implement TeraHeap in OpenJDK8 and OpenJDK17 and evaluate it with fifteen widely used applications in two real-world big data frameworks, Spark and Giraph. We find that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph. Also, it can still provide better performance by consuming up to and less DRAM than native Spark and Giraph, respectively. TeraHeap can also be used for in-memory frameworks and applying it to the Neo4j Graph Data Science library improves its performance by up to 26%. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid DRAM-NVM memories, by up to 69%.
KW - fast storage devices
KW - garbage collection
KW - Java Virtual Machine (JVM)
KW - large analytics
KW - large managed heaps
KW - memory hierarchy
KW - memory management
KW - serialization
UR - http://www.scopus.com/inward/record.url?scp=86000552868&partnerID=8YFLogxK
U2 - 10.1145/3700593
DO - 10.1145/3700593
M3 - Article
AN - SCOPUS:86000552868
SN - 0164-0925
VL - 46
JO - ACM Transactions on Programming Languages and Systems
JF - ACM Transactions on Programming Languages and Systems
IS - 4
M1 - 12
ER -