David Santos on 12 Aug 2019

Edited: David Santos on 14 Aug 2019

I'm trying to calculate a percentile of a lot of files (25000 or even more) containing 4x1 cell, representing 4 maps or 1483x2824 matrixes.

I'm using tall arrays following indications of Percentiles of Tall Matrix Along Different Dimensions:

tic

%start local pool for mutithreading

c=parcluster('local');

c.NumWorkers=20;

parpool(c, c.NumWorkers);

folder='/home/temporal2/dsantos/mat/*.mat'; %more than 25000 files

A=ones(1483,2824,2);%aux matrix for stablish prdtile data type

y=tall(A);

%database of files cointaining 4x1cell of 1483*2824 maps

ds=fileDatastore(folder,'ReadFcn',@loadPrc,'FileExtensions','.mat','UniformRead', true)

t=tall(ds);

%fill the aux tall array with each map in the correct format

for i=1:25000

y(:,:,i)=t(1+(i-1)*1483:1483*i,:);

end

%calculate the percentile

p90_1=prctile(y,90,3)

P90_1=gather(p90_1);

save('/home/temporal2/dsantos/p90_1.mat','P90_1','-v7.3');

toc

But it seems that tall arrays won't work for this because I get the error:

Warning: Error encountered during preview of tall array 'p90_1'. At

tempting to

gather 'p90_1' will probably result in an error. The error encountered was:

Requested 500025x500025 (1862.8GB) array exceeds maximum array size preference.

Creation of arrays greater than this limit may take a long time and cause

MATLAB to become unresponsive. See <a href="matlab: helpview([docroot

'/matlab/helptargets.map'], 'matlab_env_workspace_prefs')">array size limit</a>

or preference panel for more information.

> In tall/display (line 21)

p90_1 =

MxNx... tall array

? ? ? ...

? ? ? ...

? ? ? ...

: : :

: : :

>> Error using digraph/distances (line 72)

Internal problem while evaluating tall expression. The problem was:

Requested 500028x500028 (1862.9GB) array exceeds maximum array size preference.

Creation of arrays greater than this limit may take a long time and cause

MATLAB to become unresponsive. See <a href="matlab: helpview([docroot

'/matlab/helptargets.map'], 'matlab_env_workspace_prefs')">array size limit</a>

or preference panel for more information.

Error in

matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadata (line

756)

allDistances = distances(cg.Graph);

Error in

matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadataFillingPart

itionedArrays

(line 739)

[metadatas, partitionedArrays] = iGenerateMetadata(inputArrays,

executorToConsider);

Error in ...

Error in tall/gather (line 50)

[varargout{:}] = iGather(varargin{:});

Caused by:

Error using matlab.internal.graph.MLDigraph/bfsAllShortestPaths

Requested 500028x500028 (1862.9GB) array exceeds maximum array size

preference. Creation of arrays greater than this limit may take a long time

and cause MATLAB to become unresponsive. See <a href="matlab:

helpview([docroot '/matlab/helptargets.map'],

'matlab_env_workspace_prefs')">array size limit</a> or preference panel for

more information.

Any clue on how to solve this problem?

All the best

### Answers (2)

Edric Ellis on 13 Aug 2019

That particular error is an internal error basically because your tall array expression is simply too large - contains too many expressions. tall arrays operate by building up a symbolic representation of all the expressions you've evaluated, and then running them all together when you call gather. Because you've got a for loop over 25000 elements, this symbolic representation is large - too large to be evaluated. tall arrays are basically not designed to be looped over in this way. Instead, you need to express your program in terms of a smaller number of vectorised operations.

I would proceed in the following manner (I can't be more specific since your problem statement isn't executable - see this page on tips regarding making a minimal reproduction):

- Have your loadPrc return a 4 × 1483 × 2824 numeric matrix (rather than a cell array)
- Your corresponding tall array t will then be 25000 × 1483 × 2824
- Instead of the for loop, simply call prctile in dimension 1

ds = fileDatastore();

t = tall(ds);

p90_1=prctile(t,90,1);

P90_1=gather(p90_1);

% and then perhaps

P90_1 = shiftdim(P90_1, 1)

David Santos on 13 Aug 2019

Thanks a lot for your answer Edric!

I'm not sure how to solve point 1. Here's my simplified loadPrc:

function dataOut=loadPrc(filename)

data=load(filename);%data is a 4x1 cell, 4 frequency maps of 1483x2824points

dataOut=data{1};%let's solve just the first frequency map for the moment.. 1483x2824 matrix

end

how can I modify this to reach to your proposal?

I've tried this now I'm my server and because it has the 2017a version "'UniformRead', true" is not working so dataOut is always a cell. can I have a numeric matrix somehow¿?

In the other hand if I just calculate the percentile of one frequency map (as stated in loadPrc), dataOut is going to be 2d not 3d matrix. I'm doing this because if join the 4 frequencies=>dataOut=4x1483x2824 so, how can I calculate each frequency percentile? maybe I can do :

p90_1=prctile(t(1:4:end,:,:,90,1);

P90_1=gather(p90_1);

p90_2=prctile(t(2:4:end,:,:,90,1);

P90_2=gather(p90_2);

?

All the best

David Santos on 13 Aug 2019

Edited: David Santos on 13 Aug 2019

·About point 3: trying in my local computer i get

p90_1=prctile(t,90,1);

P90_1=gather(p90_1);

Error using tall/prctile (line 56)

Percentile or quantile of a tall array in the first dimension is only supported for tall column

vectors.

·After more testing (you can test this with whatever set of big matrixes) I think that when you make:

P90_1=gather(p90_1)

gather needs to load the entire t matrix in memory

Edric Ellis on 14 Aug 2019

Ah, sorry, I hadn't realised that prctile in the tall dimension supports only vectors. Hm, this might turn out to be trickier than I thought. In fact, I'm not sure I know how to do this using tall arrays.

Let me just confirm that I got the basics of your problem correct - you do want to compute percentiles individually for each 1483x2824 element - so 4187992 percentiles down vectors of length 25000.

It may be that tall arrays aren't the right tool in this case - at the very least, I think it will be necessary to "transpose" the data so that you can load a handful of 25000-element vectors in memory at a time and call prctile on those in sequence (perhaps even in parallel if you have Parallel Computing Toolbox).

David Santos on 14 Aug 2019

Thanks for your answer!

-In the way I was working at the begining of my question (with that tricky aux tall array to format in the way percentile likes 3d tall array) I was able to calculate prctile over the 3rd dim of the tall array of size (1483x2824x25000) in the way:

p90_1=prctile(t,90,3);

P90_1=gather(p90_1);

the problem was that at the end, when I used gather, matlab needed to load the entire vector in memory and its always too big. I think tall arrays won't work because of this. It would be great to be able to load in memmory only the p90_1 variable instead of the entire (400 GB) t matrix.

-Yes, you got rigth, I want to compute percentiles individually for each 1483x2824 matrix/map. What you propose could be a solution but even with the parallel process (40 cores) it would imply a lot loading files isn't? I will try to do a minitest and see what happens

-What about other ways? Mapreduce? Using big matfiles on disk? Approximations to percentile as P2 algorithm?

David Santos on 14 Aug 2019

Edited: David Santos on 14 Aug 2019

SOME TESTING

I did some testing using just 4 maps/files/(1483x2824) matrixes with your "slicing" percentile calculation proposal. The first 2 option s(using matfile and tall arrays) only calculate the 10 first rows

%%Using matfile option

tic

folder=dir('matBorrame/*.mat'); %4 files folder

P=zeros(1483,2824,2);

save('P.mat','P','-v7.3');

m=matfile('P.mat','Writable',true);

for i=1:4

fprintf('%d\n',i);

v=load(strcat('matBorrame/',folder(i).name));

id=strcat('l',folder(i).name(1:end-4-7));

m.P(:,:,i)=v.(id){1};

end

p90_1=ones(1483,2824);

for r=1:10 %%

fprintf('ROW:%d\n',r);

for c=1:2824

p90_1(r,c)=prctile(m.P(r,c,:),90);

end

end

save('p5_90_4.mat','p90_1','-v7.3');

toc

%Elapsed time is 190.559574 seconds.%

%%Tall arrays option

tic

ds=fileDatastore('matBorrame','ReadFcn',@loadPrc,'FileExtensions','.mat','UniformRead', true)

t=tall(ds);

A=ones(1483,2824,2);%aux matrix for stablish prctile data format

y=tall(A);

for i=1:4

y(:,:,i)=t(1+(i-1)*1483:1483*i,:);

end

p90_1=ones(1483,2824);

for r=1:10

fprintf('ROW:%d\n',r);

for c=1:2824

aux=squeeze(y(r,c,:));

p90_1(r,c)=gather(prctile(aux,90));

end

end

save('p1_90_4.mat','p90_1','-v7.3');

toc

%Elapsed time is >>5000s.% I stopped before finish...

%%In memory option; Processing all rows!!

tic

p90_1=prctile(m.P,90,3);

toc

%Elapsed time is 1.335489 seconds.%

My conclusions:

- tall Arrays are a bad solution for your proposal, it will last forever...

- Using matfile could work but is around 4000 times slower than the standard solution

- What about other ways? Mapreduce? Approximations to percentile as P2 algorithm? black magic?

All the best

