Add simple C14 evaluation cumulative timer#40
Conversation
Use existing timer type but fix the uninitialised components. The timer lifetime and output is tied to the model. When the model is loaded, the timer resets. When model is deconstructed, the measurement is printed to STDOUT. Note that the timer may have somewhat high overhead (it is processor dependent but on Linux and with gfortran involves a syscall to getrusage). Hence it is unsuitable to be used inside the 'hot loop'. It should be good enough though to get a rough estimate of the time taken by the whole inference loop.
| call torch_delete( C14_neural_net ) | ||
|
|
||
| ! Print the time | ||
| write(unit=fstdout, fmt='(a,g,a)') "C14 total evaluation time: ", C14_timer_total % time_elapsed, " [s]" |
There was a problem hiding this comment.
To discuss: I couldn’t really decide on the format for measurement. If I were to follow the CLUBB-TIMER convention if should be f10.4. I went with general since I though we don't care really and the fortran library can decide for what is the most appropriate format for the given value.
There was a problem hiding this comment.
I don't have a strong opinion on this. The only issue I could see with f10.4 is if we test models that take only milliseconds or hours to evaluate, both of which I doubt will happen (?). Anyway I think going with general won't do harm!
There was a problem hiding this comment.
To be fair given the timer overheads I don't think our measurement is precise enough to tell anything significant for <0.1ms, so accuracy wise 4 decimal places should be OK. But if we are happy not to follow the CLUBB previous output format I am happy :-)
There was a problem hiding this comment.
Quick update regarding this @Mikolaj-A-Kowalski. When I compile this with gfortran it complained about Fortran 2018 standards:
Error: Fortran 2018: positive width required at (1) with fmt='(a,g,a)'.
Apparently a more general fmt='(a,g0,a)' is needed? At least it compiled without problem.
There was a problem hiding this comment.
My bad! Thank you for catching that! Do you wish to submit a quick PR with a fix or you prefer I do it?
|
Thanks @Mikolaj-A-Kowalski, a couple of thoughts. Is it possible to get a comparison to non-ML C14? I'd presume as a standard arithmetic operation it's negligible. It would also be interesting to know what the overheads associated with loading the net are - this is usually the expensive part, though the net is relatively small in this case. |
If we want I can add a timer for that. I didn't since I was thinking the loading happens once at initialisation so it is of less interest as it will become less significant the longer or the larger the calculation.
Given that, if I am reading things right, we always evaluate the 'classical' C14 in here: clubb_ML/src/CLUBB_core/advance_xp2_xpyp_module.F90 Lines 596 to 602 in ec06009 I can add a timer around that as well. Will put both as separate commits so we can drop them easily if we decide them uninteresting. |
|
|
||
| call timer_start(C14_timer_total) | ||
| ! Interpolate Lscales from thermal to momentum grid | ||
| Lscale_up_zm(:,:) = zt2zm_api( nzm, nzt, ngrdcol, gr, Lscale_up(:,:), zero_threshold ) |
There was a problem hiding this comment.
Here we pretend the interpolation from tracer to momentum grid is part of the net. This probably makes sense since this step is not needed in noML-mode. But it made me wonder why the net can't do that itself? Meaning Lscales on zt as input?
There was a problem hiding this comment.
@adconnolly I believe you input on that may be necessary ;-) [Myself I lack the knowledge and context]
There was a problem hiding this comment.
We could add 2 Lscales, one above and one below, for all but zm(1) point where the surface is. Because c14 doesn't matter at the surface, this could be interesting to try, at least offline to see if it provides any additional skill. I'm skeptical it will because the gradients dL/dz never appear in the existing physical closures model, and the drawback would be having a bigger net. If it really is just the average that matters, I think it would be a waste of computational expense to have a larger network.
Another point, the interpolation of the Lscale does happen in the non-ML model but it is just the "master" Lscale that gets interpolated to combine with sqrt(TKE) to get a single time scale at zm points. I've check this before and I'm pretty sure it is the master length scale that gets interpolated, but it could just as well be the underlying _up and _down Lscales or the resultant inverse time scale that gets interpolated. If it is the _up and _down Lscales that get interpolated I suppose we could pass those through and save some re-interpolation, but we'd need to look at where time scale, tau, or its inverse gets calculated
|
I have added the extra timers. For the standalone BOMEX case the load time is a bout 0.5s. |
vopikamm
left a comment
There was a problem hiding this comment.
LGTM and will for sure be useful down the line!
I + your comment that it could be useful to write these measurements to some output file.
Although the point I was trying to make is that perhaps it is not worth to bother about it (at least at this point) since we can just easily grep the STDOUT 😅 |
|
I agree on sticking with STDOUT for now. |
|
I'm relieved that batching improves the speed so much! |
Closes #36
Use existing timer type but fix the uninitialised components.
The timer lifetime and output is tied to the model. When the model is loaded, the timer resets. When model is deconstructed, the measurement is printed to STDOUT.
Note that the timer may have somewhat high overhead (it is processor dependent but on Linux and with gfortran involves a syscall to getrusage). Hence it is unsuitable to be used inside the 'hot loop'. It should be good enough though to get a rough estimate of the time taken by the whole inference loop.
The 'null' mesurment using the timer (i.e.
timer_endimmediately aftertimer_start) gives a total of ~1ms when the measurement of the total inference time is around 0.5s.I was hesitating a bit if the output to STDOUT is sufficient (or if we should print to a dedicated output file) but in the end you get something like:
which is nice.
What is worrying is that we are dealing with non-trivial overhead! 1/3 of the runtime is C14 evaluation! At least for the BOMEX. 🤞 it will get better with batching or larger problems.