Graph partitioning requires the division of the vertex set of a graph into k equally sized subsets such that some objective function is optimized. For many important objective functions, e.g., the number of edges incident to different partitions, the problem is NP-hard. Graph partitioning is an important task in many applications, so that a variety of algorithms and tools for its solution have been developed. Most state-of-the-art graph partitioning libraries use a variant of the Kernighan-Lin (KL) heuristic within a multilevel framework. While these libraries are very fast, their solutions do not always meet all requirements of the users. This includes the choice of the appropriate objective function and the shape of the computed partitions. Moreover, due to its sequential nature, the KL heuristic is not easy to parallelize. Thus, its use as a load balancer in parallel numerical applications requires complicated adaptations. That is why we have developed previously an inherently parallel algorithm, called Bubble-FOS/C (Meyerhenke et al., IPDPS'06), which optimizes the partition shapes, using a diffusive mechanism. Yet, it is too slow to be of real practical use, despite its high solution quality. In this paper, besides proving that Bubble-FOS/C converges towards a local optimum, we develop a much faster method for the improvement of partitionings. It is based on a different diffusive process, which is restricted to local areas of the graph and also contains a high degree of parallelism. By coupling this new technique with Bubble-FOS/C in a multilevel framework based on two different hierarchy construction methods, we obtain our new graph partitioning heuristic DibaP. Compared to Bubble-FOS/C, it shows a considerable acceleration, while retaining the positive properties of the slower algorithm. Experiments with popular benchmark graphs show an extremely good behavior. First, DibaP computes consistently better results -- measured by the edge-cut and the number of boundary vertices in the summation and the maximum norm -- than the state-of-the-art libraries Metis and Jostle. Second, with our new algorithm, we have improved the best known edge-cut values for a significant number of partitionings of six widely used benchmark graphs.