Graph partitioning requires the division of a graph's vertex set into k equally sized subsets s. t. some objective function is optimized. High-quality partitions are important for many applications, whose objective functions are often NP-hard to optimize. Most state-of-the-art graph partitioning libraries use a variant of the KernighanLin (KL) heuristic within a multilevel framework. While these libraries are very fast, their solutions do not always meet all user requirements. Moreover, due to its sequential nature, KL is not easy to parallelize. Its use as a load balancer in parallel numerical applications therefore requires complicated adaptations. That is why we developed previously an inherently parallel algorithm, called Bubble-FOS/C [H. Meyerhenke, B. Monien, S. Schamberger, Accelerating shape optimizing load balancing for parallel FEM simulations by algebraic multigrid, in: Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, IPDPS'06, IEEE Computer Society, 2006, p. 57 (CD)], which optimizes partition shapes by a diffusive mechanism. However, it is too slow for practical use, despite its high solution quality. In this paper, besides proving that Bubble-FOS/C converges towards a local optimum of a potential function, we develop a much faster method for the improvement of partitionings. This faster method called TruncCons is based on a different diffusive process, which is restricted to local areas of the graph and also contains a high degree of parallelism. By coupling TruncCons with Bubble-FOS/C in a multilevel framework based on two different hierarchy construction methods, we obtain our new graph partitioning heuristic DibaP. Compared to Bubble-FOS/C, DibaP shows a considerable acceleration, while retaining the positive properties of the slower algorithm. Experiments with popular benchmark graphs show that DibaP computes consistently better results than the state-of-the-art libraries METIS and JOSTLE. Moreover, with our new algorithm, we have improved the best known edge-cut values for a significant number of partitionings of six widely used benchmark graphs.