Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make unused replications warning more informative #2128

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

martsc1
Copy link

@martsc1 martsc1 commented Nov 7, 2022

Enhancement.
Tested once without creating errors. Imo this adds useful additional information to the warning.

@robertoostenveld
Copy link
Contributor

thanks for the suggestion.

Please use ft_warning with sprintf-like argument, e.g., like this

  ft_warning('Using only %d out of %d replications for the computation of the statistic', repsused, repstotal);

When I run the following

>> cfg = [];
>> cfg.ivar = 1
>> ft_statfun_indepsamplesT(cfg, randn(1,10), [1 1 1 2 2 2 2 3 3 3])

I get Warning: Using only 0 out of 1 replications for the computation of the statistic! whereas I had expected that it would tell me that I used 7 out of 10. So the counts don't appear to be correct.

@robertoostenveld
Copy link
Contributor

I believe it should be

 ft_warning('Using only %d out of %d replications for the computation of the statistic', nrepl, size(design,2));

@schoffelen
Copy link
Contributor

schoffelen commented Dec 12, 2022

Hi Martin @martsc1 , could you make the change suggested by Robert, and test it? If it works as intended we can merge this PR!

@martsc1
Copy link
Author

martsc1 commented Dec 12, 2022

I get the following error when I try to use the suggested changes with a data matrix instead of a vector.
I'm not sure however, if that is a bug or I misunderstood how the function is supposed to work.

cfg = [];
cfg.ivar = 1;
ft_statfun_indepsamplesT(cfg, rand(10,10), [1 1 1 2 2 2 2 3 3 3 ])

Error using warning
Formatted arguments must be scalar.

Error in ft_notification (line 354)
warning(msgId, varargin{:});

Error in ft_warning (line 63)
ft_notification(varargin{:});

Perhaps this could be a fix:
ft_warning('Using only %d out of %d replications for the computation of the statistic', nrepl(1), size(design,2));

This doesn't give me an error for either a data vector or matrix, and gives the expected result.

@schoffelen
Copy link
Contributor

I see, good point. The statement is only strictly valid once all(nrepl==nrepl(1)). For now it is safe to assume this I think

@martsc1
Copy link
Author

martsc1 commented Dec 12, 2022

I think it is actually not safe to assume that. If the data contains NaN values for instance.

I'm a bit confused now about what is being checked at this point in the function.
Is it the number of used units of observations (participants for example), or the number of valid data points?
For example, is it checking whether for each point in the design matrix there is some valid data, or out of all the data points, how many are valid.

So in the above example with the random 10x10 matrix, 7 out 10 units of observations are used, but 70 out of 70 data points are used. Which of the two should be checked here?

@schoffelen
Copy link
Contributor

the data matrix in the ft_statfun_ is always datapoint vs. observation. The datapoint dimension is a combination of spatial, spectral (frequencies), and time points. In some situations, e.g. a TFR with a frequency-varying time-window, it could be that there are some observations that contain a nan (i.e. corresponding to a time point at which no estimate could be made due to not enough data in the time window). For other frequency bins this data could be available. This would only be an issue I think, once the statistics are performed within a single subjects, across trials. If it's a 'second level' test, when the observation dimension reflects subjects, then I think it is (at this moment) safe to assume: once-a-nan, always-a-nan.

we could consider to replace the nrepl(1) in the warning to mean(nrepl) and then replacing 'Using only', with Using on average'

@martsc1
Copy link
Author

martsc1 commented Dec 12, 2022

I still don't understand whether a replication refers to a datapoint or an observatoin?

@schoffelen
Copy link
Contributor

replication = observation, can be trial or subject, and is defined in the columns

@martsc1
Copy link
Author

martsc1 commented Dec 12, 2022

Thanks for the clarification!
I have a dataset with powerspectrum data and a lot of NaN's due to the situation you described above. For each replication I have either 0 or the full number of observations, so in that case both nrepl(1) or mean(nrepl) would give misleading information. I don't know how common my situation is though, and whether that is a problem with my data rather than with the function.

@martsc1
Copy link
Author

martsc1 commented Dec 14, 2022

How about this solution:

if any(nrepl<size(design,2))
    if ismember(0,nrepl)
        datapused=sum(nrepl==size(design,2));
        ft_warning('Using only %d out of %d datapoints for the computation of the statistic!', datapused, numel(nrepl));
        if any(nrepl(nrepl<size(design,2))>0)
            ft_warning('Using between %d and %d out of %d replications for the computation of the statistic',min(nrepl(nrepl>0)), max(nrepl), size(design,2));
        end
    else
        ft_warning('Using between %d and %d out of %d replications for the computation of the statistic', min(nrepl), max(nrepl), size(design,2));
    end
end

Now the function would warn about both unused replications and unused datapoints.
Alternatively, if this feels to clunky, a more minimalistic solution would be to not make any changes to the code but only change the warning text to 'Not all replications or datapoints are used for the computation of the statistic.'

@schoffelen
Copy link
Contributor

Hi @martsc1 thanks for your perseverance, I don't mind being informative, but the risk exists that people get annoyed by the amount of warnings on their screen -> ft_statfun_ will be called many times when running permutation statistics, and that will lead to many warnings, unless the user's settings are tweaked a bit to suppress repeating messages. I don't know from the top of my head what the default is. But let's not worry about that here.

May I suggest a slight change of logic? You propose a nested series of if statements, which I think can be simplified to 2 non-nested if statements:

if any(nrepl==0)
% here I would change the message a bit: "computing the test statistic for %d out of %d datapoints"
end

and

if any(nrepl<size(design,2)
end

@martsc1
Copy link
Author

martsc1 commented Dec 15, 2022

I'm happy with any suggestions. I think we need to modify it slightly to prevent getting replication warnings when not all datapoints are used, but all replications are. Maybe this could work:

if any(nrepl==0)
        datapused=sum(nrepl==size(design,2));
        ft_warning('Computing the test statistic for %d out of %d datapoints', datapused, numel(nrepl));
end

if any(nrepl(nrepl>0)<size(design,2))
       ft_warning('Using between %d and %d out of %d replications for the computation of the statistic',min(nrepl(nrepl>0)), max(nrepl), size(design,2));
end

martsc1 and others added 2 commits January 24, 2023 09:59
Updated changes based on commit discussion.
@schoffelen
Copy link
Contributor

I think this is almost ready to be merged, were it not for the fact that for overall code consistency we might want to consider to also add similar heuristics to the other (in)depsamples statfun (F and T)

@schoffelen
Copy link
Contributor

Hi Martin, @martsc1 would you be up for making similar suggestions for the other 'statfuns', so that we can push this forward, and don't lose the work that you have spent on this up until now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants