You are currently on IBM Systems Media’s archival website. Click here to view our new website.

POWER > Systems Management > Data Management

IBM Researchers Maximize Apache Spark’s Capabilities

IBM Austin Research Laboratory
Illustration by Lonnie Busch

Problem sizes are growing exponentially for developers, data scientists and end users. For many apps, it’s not feasible for several reasons to have the whole working set of data in main memory. The density of memory isn’t increasing at the same rate as other technologies are expanding. DRAM memory is expensive. Even in the long term, the problem sizes businesses face will outgrow the rate at which main memory becomes available. Plus, people should be cautious about which data to keep in main memory or on a secondary storage system.

Testing, Testing

IBM is also conducting trials on the use of CAPI Flash with industries, such as transportation and advertising, that have large data problems whose performance is suffering when their main memory is no longer sufficient to process the data.

“CAPI Flash has been developed as a more efficient secondary storage solution,” Rellermeyer says. “We’re working with companies that already have lots of data and are collecting raw data every day. We can now offer them a Spark-customized solution that allows them to do clever things, such as move data from memory into an extension, which enables them to correlate certain behaviors and gain insights from their data in ways they’ve not done before.

“In the future, CAPI Flash will be applied to more generic systems up to the point that maybe we’ll ultimately support it as part of the memory hierarchy through the OS; then every application should benefit from it.”

Accelerating Workloads

Meanwhile, researchers at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York, have investigated accelerating Spark’s workloads using GPUs. The new GPU-enabled Spark ecosystem enables businesses to exploit the high performance of GPUs without changing their Spark programs. GPUs can be viewed as massively parallel supercomputers on a chip and are increasingly used to accelerate workloads with numerically intensive computations (e.g., matrix computations) using their hardware resources. A GPU’s high memory bandwidth also makes it ideal for accelerating memory-intensive applications, such as graph analytics.

GPUs help companies utilize big data. Businesses can provide services to customers in real time without requiring multiple servers. Rajesh Bordawekar, a member of IBM’s research team that specializes in GPU and multicore acceleration for analytics and data management, explains how to get the best of both worlds with GPU-enabled Spark:

“Spark provides an easy programming model for distributing data and for in-memory computations. GPUs enable parallel execution of numerically and memory-intensive workloads. As there’s no change of code, developers can write code and accelerate it transparently using the GPUs for the workloads suited for GPU acceleration (e.g., machine learning, deep learning, graph analytics and data management).

Juliet Stott is a freelance journalist based in York, England.



2019 Solutions Edition

A Comprehensive Online Buyer's Guide to Solutions, Services and Education.


Are You Ready for GDPR?


IBM Researchers Maximize Apache Spark’s Capabilities

IBM Systems Magazine Subscribe Box Read Now Link Subscribe Now Link iPad App Google Play Store