Tuesday, January 1, 2013

Java : find union and intersection of two sorted arrays


Java : find union and intersection of two sorted arrays


This problem essentially reduces to a join operation and then a filter operation (to remove duplicates and only keep inner matches).
As the inputs are both already sorted, the join can be efficiently achieved through a merge join, with O(size(a) + size(b)).
The filter operation will be O(n) because the output of the join is sorted and to remove duplicates all you have to do is check if the each element is the same as the one before it. Filtering only the inner matches is trivial, you just discard any elements that were not matched (the outer joins).
There are opportunities for parallelism (both in the join and filter) to achieve better performance. For example the Apache Pig framework on Hadoop offers a parallel implementation of a merge join.
There are obvious trade-offs between performance and complexity (and thus maintainability). So I would say a good answer to a interview question really needs to take account of the performance demands.
  • Set based comparison - O(nlogn) - Relatively slow, very simple, use if there are no performance concerns. Simplicity wins.
  • Merge join + Filter - O(n) - Fast, prone to coding error, use if performance is an issue. Ideally try to leverage an existing library to do this, or perhaps even use a database if appropriate.
  • Parallel Implementation - O(n/p) - Very fast, requires other infrastructure in place, use if the volume is very large and anticipated to grow and this is a major performance bottleneck.
(Also note that the function in the question intersectSortedArrays is essentially a modified merge join, where the filter is done during the join. You can filter afterwards at no performance loss, although a slightly increased memory footprint).
Final thought.
In fact, I suspect most modern commercial RDBMSs offer thread parallelism in their implementation of joins, so what the Hadoop version offers is machine-level parallelism (distribution). From a design point of view, perhaps a good, simple solution to the question is to put the data on a database, index on A and B (effectively sorting the data) and use an SQL inner join.


No comments:

Post a Comment