Uniform Hashing of Arbitrary Input Into Key-Exclusive Segments by Paul Dorfman and Don Henderson
Wed, Jun 12
|Webinar
Don Henderson presents a method for using hash functions to split an arbitrarily large dataset into manageable chunks for processing.
Time & Location
Jun 12, 2024, 12:00 PM – 1:00 PM EDT
Webinar
Aggregating or combining large data volumes can challenge computing resources. For example, the process may be hindered by the system limits on utility space or memory and, as a result, either fail or run too long to be useful. It is a natural inclination to try solving the problem by segregating the input records into a number of smaller segments, processing them independently and combining the results. However, in order for such a divide-and-conquer tactic to work, two seemingly contradictory criteria must be met: First, to aggregate or combine the data correctly, no segment can share its key values with the rest; and second, the segments must be more or less equal in size. In this presentation, we show how a hash function can be used to achieve it for arbitrary input with no prior knowledge of the distribution of the key values among its records. Effectively, the method renders any task of aggregating or combining data of any size doable by splitting its input into a large enough number of segments. Such an approach can be used to process the segments sequentially or in-parallel. The trade-off is the need to partially re-read the data. However, it is a rather small price to pay for making a failing or endlessly running task finish on time.
Bio:
Paul Dorfman is an Independent Consultant. He specializes in developing SAS software solutions from ad hoc programming to building complete data management systems in a range of industries, such as telecom, banking, pharmaceutical, and retail. A native of Ukraine, Paul started using SAS while pursuing his degree in physics in the late 1980's. In 1998, he pioneered using hash algorithms in SAS programming by designing a set of hash routines based on SAS arrays. With the advent of the SAS hash object, Paul was first to use it practically and to author a SUGI white paper on the subject. In the process, he introduced hash object techniques for metadata-based parameter type matching, sorting, unduplication, filtering, data aggregation, dynamic file splitting, and memory usage optimization. Paul has presented papers at global, regional, and local SAS conferences and meetings annually since 1998.
Don Henderson is now enjoying retirement. Don has used SAS software since 1975, designing and developing business applications with a focus on data warehouse, business intelligence, and analytic applications. Don was one of the primary architects in the initial development and release of SAS/IntrNet software in 1996, and he was one of the original developers for the SAS/IntrNet Application Dispatcher. Don has presented numerous papers at SUGI and regional SAS user group meetings, and continues to be a great supporter of SAS and its products.