Debugging the MOZART chemical solver

(Martin Schultz, 12/03/2014)

As we have just gone through this extensively, it may be a good time to document a bit how the MOZART solver works and what one can do to debug it (or rather: debug the chemical mechanism and/or initial concentrations).

Calling sequence

moz_chemistry (in mo_moz_chemistry.f90) is the wrapper routine for the MOZART chemistry driver chemdr (in mo_moz_chemdr.f90). The first operates in the "ECHAM world" (kproma, krow, etc.), the latter translates things into the "MOZART world" (plonl, lat, etc.). MOZART is not "nproma-aware", which is why all arrays that are passed to chemdr must be filled from kproma+1:kbdim.

After calculation of photolysis and reaction rates, chemdr calls the actual solver for the chemical equation systems. The MOZART preprocessor (mozpp) can generate code for three different solver types: explicit, implicit, and Rhodas. All three of them can be executed in parallel since every compound must be associated with one (and only one) solver type ("solution classes" in MOZART language). In ECHAM-HAMMOZ, we assume that all species are solved with the implicit solver imp_sol (in mo_moz_imp_sol.f90).

Description of imp_sol

The implicit solver loops over all grid boxes of the (plev, plonl) vector of volume mixing ratios(\!) it is given, so in effect it does a box model calculation for each grid box. The solver time step is initially set to the model time step length (dt = delt). If the implicit solution does not converge, it will revert to the starting conditions and try again with a time step that is cut in half. This happens until convergence is reached or the cut_limit (default value is 5) is reached. If a solution is reached with using a shorter timestep, the internal time (interval_done) will be increased, and the next chunk of the model time step is treated the same way. If, after two steps, the solver still converges, it will try to increase the time step length again by a factor of two. This procedure is repeated until the interval_done reaches or exceeds the model time step length.

In order to expediate the convergence, the species in the base_solution vector are reordered (more or less by lifetime). lsol is the "local" solution vector containing the mixing ratios of all species at the current iteration, solution is the reordered vector which is actually modified. The reordering is defined in the implicit%permute index vector. The implicit%clsmap index vector points to those species (i.e. ECHAM tracers) which shall be subjected to the implicit solver. If you only run MOZ, do jt = 1, implicit%clsmap is the same as do jt = 1, ntrac. Would you have for example the explicit solver running in parallel (e.g. to solve for CH4, CO, etc.), the corresponding species indices would be missing in the implicit%clsmap vector.

For each time step first the invariant parts of the Jacobian are computed (this is basically y(t)/dt), then the linear equation terms (photo- or decomposition reactions) are stored in the Jacobian matrix (subroutine imp_linmat is a preprocessor generated routine that can be found in mo_moz_mat.f90). Then, the solver enters the iteration loop for the Newton Raphson iteration to solve for f(y) = 0. This loop is executed up to implicit%iter_max times (default value is 11). The actual Jacobian matrix is stored in sys_jac. sys_jac is filled with the non-linear equation terms in imp_nlnmat (also in mo_moz_mat.f90). At the end of this routine, the previously calculated linear terms are added to sys_jac (this also happens within mo_moz_mat.90, subroutine imp_nlnmat_finit). Note that the Jacobian is saved as a sparse diagonal matrix, i.e. only terms != 0 are actually included.

Once the Jacobian is formed, a LU factorization takes place (imp_lu_fac), the production and loss rates are computed, and the tendency ("forcing") is evaluated as forcing(m) = solution(m)*dti - (iter_invariant(m) + prod(m) - loss(m)). Then, the equations are solved in imp_lu_slv (imp_lu_fac, imp_lu_slv, and imp_prod_loss are all preprocessor generated and can be found in mo_moz_mat.f90). Finally, the solution vector is updated and the convergence is tested by a relative criterion. Note, that MOZART applies stricter convergence criteria for some key compounds (presently these are O3, NO, NO2, NO3, HNO3, HO2NO2, N2O5, OH, and HO2) than for others. Th criteria applied are 1.e-3 for "normal" compounds, and 1.e-4 for the key species (rel_err and high_rel_err).

If all species converged, the iteration loop is exited, and the current solution is written back to the base_sol vector.

Debugging techniques

Unless you change your chemical mechanism (and forget to update some reaction rate constants), or you mess up your initial tracer concentrations file, you should rarely encounter a situation where you see a lot of imp_sol: failed to converge (lon,lat,lev,indx,dt)@ messages. If there are a few of those when you do an initial run, don't bother. However, if they persist over many time steps, you will have to act.

The imp_sol code will write some additional information about the reason for non-convergence. In particular, it will list the processor, lon, lev, lat and "indx" value where it happened and show you the species for which convergence was not reached. Again, under normal circumstances you will see one or two species here, and usually the same species in all locations. Thus you can first double-check your initial concentrations for this species (or these species). If this doesn't help, make use of the extended debugging code in mo_moz_imp_sol.f90, which is normally commented out. For this, you should pick one specific grid box (look at the "index" value from the output) and set index_debug to that value. If you then run the code with all the extra debugging information, you should get a lot of detail about what happens at each Newton Raphson time step. Look at the base_sol output and the production and loss rates to see if they make sense. Note that species are given in mixing ratios. This means that prod and loss values are not given in the canonical units of "cm^3 molecules^-1 s^-1", but rather in "vmr^-1 s^-1", which means they are multiplied by the Loschmid number (2.7e19 under normal surface conditions).

If you cannot figure out what is going on, try to build a reduced mechanism with the preprocessor so that you have a chance to carefully check each individual reaction equation. It may also help to print out all rates in imp_sol and make sure you find what you expect. To analyze this, it is very helpful to look at the "doc" file that is produced by the MOZART preprocessor. A sample "doc" file for the current HAMMOZ mechanism (219 species) is attached. Note that reactions are numbered in two different ways: in the equations in mo_moz_mat.f90, you will find "rNNN" and "jNNN" for gas-phase and photolysis reactions, respectively. The index numbers for these are listed on the left of the "doc" file's equations. In contrast, the rates array uses indices from 1 to nphoto and adds the gas-phase reactions on top. Use the right-hand side indices in the "doc" file for debugging the rates array.

Good luck!

kproma testing of the MOZ code

Due to the fact that imp_sol will always recompute an entire block if it fails to converge somewhere, you cannot guarantee bit-identical results for different kproma values unless you set cut_limit and iter_max to zero!