Many state-of-the-art applications for massively parallel computer architectures have the drawback that their communication costs grow disproportionately with respect to the number of processing elements (PEs). This problem will exacerbate in the future due to steadily increasing numbers of PEs. In particular, investments in emerging exascale architectures may not pay off to their full potential. Consequently, the proposed research project aims at developing new methods for significantly reducing the communication costs of important application classes for massively parallel NUMA architectures. Typically, one models communication and/or data dependencies by a so-called application graph. In order to assign processes and/or data to PEs in a load-balanced and communication-optimized manner, the application graph is suitably partitioned into subgraphs, and the latter are mapped onto the PEs. Current parallel tools for this form of partitioning and mapping do not scale sufficiently or compromise on quality. The proposed project aims at significantly improving the trade-off between scalability and quality, and targets an acceleration of typical communication-bound applications by several factors. The proposed research unifies the partitioning and mapping of a potentially dynamic application graph. To this end, we model communication costs by exploiting graph-theoretical properties of typical non-uniform architectures. For the optimization of communication costs we employ the multilevel framework, which has proven extremely effective in related contexts. In contrast to common practice, the algorithms to be developed within our unified approach will optimize an application's communication costs in all phases of of the multilevel framework. Among the many algorithms that are used in the context of multilevel graph partitioning and process mapping, we have selected two main classes as a starting point: (i) strictly local combinatorial optimization methods, and (ii) more global diffusion-based methods. Due to our previous work, we have expertise in both classes. Strictly local optimization methods are myopic and most often do not parallelize well. Global optimization methods, in spite of better scalability, are usually prohibitively expensive. Consequently, we want to combine and extend the most desirable features of both classes into "semi-local" optimization methods. We expect this hybridization to overcome the problems depicted above, that is, to provide high-quality mappings for, and on, very large-scale parallel machines. We will integrate our new methods into the established software libraries of our external partner. The libraries are free and permanently available to the community, hence fostering immediate application of our contributions to real-world, frontier simulation codes.