Monday, June 5, 2017

Finding intersections of an array in matrix rows more efficiently

Let's imagine a case where we want to find the number of matching elements of an array in each row of a large matrix.

In this example, we have such as matrix containing one million rows, each having six columns. Each row in the matrix contains unique random integer values between 1 and 10.

We'll collect the number of the matching needle elements in each row of the haystack into matchcount. The number of the individual elements matching to the matrix rows is also collected into numcount.

haystack=zeros(1000000,6);        
for i=1:1000000                   
    haystack(i,:)=randperm(10,6); 
end                               
needle     = [1 2 3 4 5];         
matchcount = zeros(5, 1);         
numcount   = zeros(5, 1);         

There doesn't seem to be good vectorized method to find the intersections, so we'll use for loops in all the three examples.

In the first code, we'll use intersect() to get the matching indices for each row. Intersect() does not require the inputs to be sorted, so it works out-of-the-box in our case. Intersect returns the common elements to the needle and the rows of the haystack. We'll keep count of these matches.

for i = 1:length(haystack)                            
    match_ids = intersect(needle, haystack(i, :));    
    matches = length(match_ids);                      
    if matches > 0                                    
        matchcount(matches) = matchcount(matches) + 1;
        numcount(match_ids) = numcount(match_ids) + 1;
    end                                               
end                                                   


Execution of this code takes almost one minute; the average for three runs was 58.08 seconds.

The results of a single run look like this:

>> disp([1:5;numcount';matchcount'])
           1           2           3           4           5
      599684      600619      599554      599802      601223
       23761      237749      476269      238289       23932


To get the same results, we can also use ismember() which returns the logical array of the locations of the matching elements instead of the actual elements. This means we have to change the counting of matches from length() to sum():

for i = 1:length(haystack)                            
    match_ids = ismember(needle, haystack(i, :));     
    matches = sum(match_ids);                         
    if matches > 0                                    
        matchcount(matches) = matchcount(matches) + 1;
        numcount(match_ids) = numcount(match_ids) + 1;
    end                                               
end                                                   

This method is already much faster; average execution time for three runs was now only 15.13 seconds. The results are still looking correct:

>> disp([1:5;numcount';matchcount'])
           1           2           3           4           5
      600486      600681      599275      599443      599843
       23841      238526      475544      238242       23847

To make the computation even faster, we have to identify the bottlenecks and remove as many of them as possible. To find out where ismember() uses the most of the time, we can use profile(). Using MATLAB Profiler will be discussed more in some later post; now we will get straight to the findings.

It can be seen in the Profile report, that most of the time is used to verify the input arguments and branch to the correct parts of the code based on the types of the arguments. As we know the types of the arguments and we can be sure they do not contain NaNs or anything else requiring additional operations, we can skip most of the code.

Actually we can skip all the MATLAB code and go straight to an undocumented builtin function which is called deep in the ismember(). This undocumented function is _ismemberhelper() and it needs to be called with a special function builtin(). This function requires the arguments to be sorted, so we need additional sort() for the haystack:

for i = 1:length(haystack)                                               
    match_ids = builtin('_ismemberhelper', needle, sort(haystack(i, :)));
    matches = sum(match_ids);                                            
    if matches > 0                                                       
        matchcount(matches) = matchcount(matches) + 1;                   
        numcount(match_ids) = numcount(match_ids) + 1;                   
    end                                                                  
end                                                                      

For this code, the execution time reduced to 7.69 seconds.

And the results still look correct:

>> disp([1:5;numcount';matchcount'])
           1           2           3           4           5
      599938      599039      599923      600895      599592
       23879      237804      476875      237935       23507

To summarize the results, here are the execution times for all the three methods:

  • intersect()               - 58.08 seconds
  • ismember()             - 15.13 seconds
  • _ismemberhelper() -  7.69 seconds

No comments:

Post a Comment