Saturday, June 10, 2017

Using MATLAB Profiler to remove function overheads

As we could see in the previous post, there can be a huge amount of overhead in some MATLAB functions.

The functions are usually very generic and can handle lots of different argument types. This adds a lot of branching and validation which causes also a lot of overhead.

In this post we'll go through again the same example as in the previous post, but now we'll see how it's done by using MATLAB Profiler. As you may remember, we were looking for the number of matching elements between the needle (an array of five figures) and each row of the haystack (matrix containing one million rows, each having six figures between one and ten.)

Using intersect() the operation took 58 seconds. We can run the code in the MATLAB Profiler to find out which operations are included in the consumed time. This can be done by clicking "Run and Time" button in the MATLAB Code Editor.




After the code has been executed, we'll get the report from the Profiler. In the report, we'll see which lines have used the most of the time and which child functions are the most demanding.


It's quite clear that the most of the time was used in the intersect() function:

This can be seen also in another view having all the code lines, the number of calls and the amount of consumed time shown:


As we can see in the column which contains the consumed time, running the code via the MATLAB Profiler makes the execution a little bit slower. Instead of 58 seconds the execution took now 117 seconds, but we can still see the parts that have consumed the biggest amount of the execution time.

By clicking the intersect() in the Profiler view, we'll get the similar result of the consumed time inside the intersect().

Intersect() itself is a simple wrapper function. It checks the input arguments and calls a subfunction intersectR2012a() or intersectlegacy() with some option flags depending on the arguments. In this case the called function was intersectR2012a() as can be seen in the Profiler report:


Some part of the time was used in the validation of the arguments and parameters, but the most of the time was used inside the intersectR2012a(). As we click that function in the Profiler view, we'll get again a new report page:


This far, all the extra time was consumed by overheads caused by argument checking and the branching. These lines can be ignored if we know which branch is always used for our arguments.

After ignoring all the extra lines, we can see that the time consuming part of the remaining lines is mainly split between two different functions; ismember() and unique(). As we know from our data, calling the unique() is not required, so we can save a lot of time by calling ismember() only.


Also, we don't need to index our needle with the output of ismember() as the intersectR2012a() is doing. It's enough to call ismember(needle, haystack(row, :)). To verify this, we'll run the commands for a single row of the haystack:



The call with intersect() returned 1 and 3 as the figures of needle found in the haystack. Calling ismember() returned indices 1 and 3 which point to the same figures in the needle.

The modified code looks like this:


As it was mentioned in the previous post, the execution time with ismember() was reduced to 15 seconds. By continuing with Profiler, we can try to find even more overheads to remove:



Again, as we have a simple combination of arguments, the branching in the function is simple and finally all the magic happens in a subfunction called ismemberR2012a(). In that function we'll find more unnecessary overhead. Here we can see one example of such lines which are not required for our known arguments but which seems to consume quite a big part of the execution time:


The most parts of the code in ismemberR2012a() are argument based branching again, so finally we find only one mandatory line:


Another subfunction, so no wonder there's quite a lot of overhead in the function calls. In the ismemberBuiltInTypes() we see that some branching depends on the size of the first argument. In our case, the argument is needle. If the argument has more than six elements, the comparing of the arrays is done with a different method.


Because our needle contains only 5 figures, we would use this branch by default:


This is actually a method that was not used in our previous post. So we'll measure the execution time now.

Here's the code we use to test the performance with any()

for i = 1:length(haystack)
    match_ids = false(length(needle));
    for j = 1:length(needle)
        match_ids(j) = any(needle(j)==haystack(i, :));
    end
    matches = sum(match_ids);    
    if matches > 0
        matchcount(matches) = matchcount(matches) + 1;
        numcount(match_ids) = numcount(match_ids) + 1;
    end
end

As you may remember, the execution time with ismember() was about 15 seconds and with _ismemberhelper() it was just below 8 seconds.

For any(), the execution time was between them; the average time for three runs was 12.88 seconds. So _ismemberhelper() seems to be clearly faster even though our array in the comparation was below the hardcoded branching limit of five elements. The comment of _ismemberhelper() section in the code is the following:


This is why we had the sorting included in the code using _ismemberhelper(). If the haystack would be sorted already before the execution, we could probably save few seconds more as can be seen in the Profiler report of the code having sort() and _ismemberhelper() in separate lines:


As we saw in this optimization experiment, there can be a lot of overhead in the MATLAB functions. Using the correct functions for the operations is already helping a lot (intersect() vs ismember() vs any()), but finding the best method requires effort.

NOTE: If the best method is an undocumented function, it may get deprecated in the future MATLAB versions or the behavior of the function may get changed. Therefore it's not recommended to use the undocumented functions in any production code.

Monday, June 5, 2017

Finding intersections of an array in matrix rows more efficiently

Let's imagine a case where we want to find the number of matching elements of an array in each row of a large matrix.

In this example, we have such as matrix containing one million rows, each having six columns. Each row in the matrix contains unique random integer values between 1 and 10.

We'll collect the number of the matching needle elements in each row of the haystack into matchcount. The number of the individual elements matching to the matrix rows is also collected into numcount.

haystack=zeros(1000000,6);        
for i=1:1000000                   
    haystack(i,:)=randperm(10,6); 
end                               
needle     = [1 2 3 4 5];         
matchcount = zeros(5, 1);         
numcount   = zeros(5, 1);         

There doesn't seem to be good vectorized method to find the intersections, so we'll use for loops in all the three examples.

In the first code, we'll use intersect() to get the matching indices for each row. Intersect() does not require the inputs to be sorted, so it works out-of-the-box in our case. Intersect returns the common elements to the needle and the rows of the haystack. We'll keep count of these matches.

for i = 1:length(haystack)                            
    match_ids = intersect(needle, haystack(i, :));    
    matches = length(match_ids);                      
    if matches > 0                                    
        matchcount(matches) = matchcount(matches) + 1;
        numcount(match_ids) = numcount(match_ids) + 1;
    end                                               
end                                                   


Execution of this code takes almost one minute; the average for three runs was 58.08 seconds.

The results of a single run look like this:

>> disp([1:5;numcount';matchcount'])
           1           2           3           4           5
      599684      600619      599554      599802      601223
       23761      237749      476269      238289       23932


To get the same results, we can also use ismember() which returns the logical array of the locations of the matching elements instead of the actual elements. This means we have to change the counting of matches from length() to sum():

for i = 1:length(haystack)                            
    match_ids = ismember(needle, haystack(i, :));     
    matches = sum(match_ids);                         
    if matches > 0                                    
        matchcount(matches) = matchcount(matches) + 1;
        numcount(match_ids) = numcount(match_ids) + 1;
    end                                               
end                                                   

This method is already much faster; average execution time for three runs was now only 15.13 seconds. The results are still looking correct:

>> disp([1:5;numcount';matchcount'])
           1           2           3           4           5
      600486      600681      599275      599443      599843
       23841      238526      475544      238242       23847

To make the computation even faster, we have to identify the bottlenecks and remove as many of them as possible. To find out where ismember() uses the most of the time, we can use profile(). Using MATLAB Profiler will be discussed more in some later post; now we will get straight to the findings.

It can be seen in the Profile report, that most of the time is used to verify the input arguments and branch to the correct parts of the code based on the types of the arguments. As we know the types of the arguments and we can be sure they do not contain NaNs or anything else requiring additional operations, we can skip most of the code.

Actually we can skip all the MATLAB code and go straight to an undocumented builtin function which is called deep in the ismember(). This undocumented function is _ismemberhelper() and it needs to be called with a special function builtin(). This function requires the arguments to be sorted, so we need additional sort() for the haystack:

for i = 1:length(haystack)                                               
    match_ids = builtin('_ismemberhelper', needle, sort(haystack(i, :)));
    matches = sum(match_ids);                                            
    if matches > 0                                                       
        matchcount(matches) = matchcount(matches) + 1;                   
        numcount(match_ids) = numcount(match_ids) + 1;                   
    end                                                                  
end                                                                      

For this code, the execution time reduced to 7.69 seconds.

And the results still look correct:

>> disp([1:5;numcount';matchcount'])
           1           2           3           4           5
      599938      599039      599923      600895      599592
       23879      237804      476875      237935       23507

To summarize the results, here are the execution times for all the three methods:

  • intersect()               - 58.08 seconds
  • ismember()             - 15.13 seconds
  • _ismemberhelper() -  7.69 seconds

Make MATLAB Great Again!

This new blog is about optimizing MATLAB codes, finding the ways to remove the overheads and getting the most out of the MATLAB.

The authors of the blog are highly experienced MATLAB users. They have participated many demanding projects requiring MATLAB optimization skills in the single computer solutions as well as in the large scale distributed computing.

In the posts, we will discuss about tools and tips to find the bottlenecks of the MATLAB codes, and try to find the alternative methods to solve some common problems in a faster way.

We hope that you'll find the posts useful, and that there will be good discussions about the topics. We are also very pleased to receive any new discussion items from the readers of the blog.

Yours,
Matlab User