hyteg issueshttps://i10git.cs.fau.de/hyteg/hyteg/-/issues2024-03-08T08:26:10+01:00https://i10git.cs.fau.de/hyteg/hyteg/-/issues/253Scaled Apply + GEMV2024-03-08T08:26:10+01:00Nils KohlScaled Apply + GEMVThe proposal of scaled operators in !703 initiated this issue.
Having the standard BLAS level 2 routine `gemv` (see e.g. [here](https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/blas/gemv.html))
```math
y \gets \alph...The proposal of scaled operators in !703 initiated this issue.
Having the standard BLAS level 2 routine `gemv` (see e.g. [here](https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/blas/gemv.html))
```math
y \gets \alpha A x + \beta y
```
as part of the `Operator` interface would be great for obvious reasons.
What we currently can do is
```math
y \gets A x + \gamma y
```
where $\gamma \in \{0, 1\}$ (a.k.a., `Replace` and `Add`).
Implementing the full `gemv` would be the optimal fix, however, since many of the matrix-free operator implementations work in an "additive" fashion (particularly all elementwise operators including the new generated versions) I do not see an obvious way around having to touch $y$ (i.e., the "dst" vector) *twice* to apply scaling with $\beta$. If there is no obvious solution, we are forced to first scale $y$ (if $\beta \neq 0$) and then perform a scaled matrix-vector multiplication after that. We can do this scaling of $y$ inside the operator, but I do not see a real reason to do that inside the operator and not just call `assign` beforehand.
MR !703 proposes an implementation that enables scaled operators
```math
y \gets \alpha A x + \gamma y
```
that is in my opinion not optimal since $\alpha$ becomes part of the operator's state.
To resolve the "state"-issue mentioned above, I propose to actually extend the `apply` routine by a parameter $\alpha$ instead of changing the operator's state:
```
virtual void apply( const ValueType& alpha // new parameter
const SourceFunction& src,
const DestinationFunction& dst,
size_t level,
DoFType flag,
UpdateType updateType = Replace ) const
```
Refactoring this requires some work, but should hopefully be doable. I'd keep the current apply, and just make it call the new one with $\alpha = 1$.
-----
EDIT: We basically already do this:
> We can do this scaling of $y$ inside the operator, but I do not see a real reason to do that inside the operator and not just call `assign` beforehand.
if `UpdateType == Replace` since we [interpolate `0`](https://i10git.cs.fau.de/hyteg/hyteg/-/blob/0a8460a3dd538d9d91a194b5d5588c1a4ef2b1ac/src/hyteg/elementwiseoperators/P1ElementwiseOperator.cpp#L96).
Therefore, for completeness I'll also propose an interface for the `gemv`:
```
virtual void gemv( const ValueType& alpha // new parameter
const SourceFunction& src,
const ValueType& beta // new parameter
const DestinationFunction& dst,
size_t level,
DoFType flag ) const
```
with the perk that it gets rid of the `UpdateType` parameter.Nils KohlNils Kohlhttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/223SetupPrimitiveStorage is fully resident in all processes' memories2023-10-31T06:48:46+01:00Daniel BauerSetupPrimitiveStorage is fully resident in all processes' memoriesThe general process of doing virtually anything in HyTeG starts like this.
```cpp
const MeshInfo meshInfo = /*...*/;
const SetupPrimitiveStorage setupStorage(meshInfo, /*...*/);
std::shared_ptr< PrimitiveStorage > storage = std::make_sha...The general process of doing virtually anything in HyTeG starts like this.
```cpp
const MeshInfo meshInfo = /*...*/;
const SetupPrimitiveStorage setupStorage(meshInfo, /*...*/);
std::shared_ptr< PrimitiveStorage > storage = std::make_shared< PrimitiveStorage >( setupStorage );
```
While the `PrimitiveStorage` is fully distributed, the same is not the case for the `MeshInfo` and the `SetupPrimitiveStorage`.
To my limited experience the `MeshInfo` is less of a problem but the `SetupPrimitiveStorage` can get quite large (#177).
This limits the scalability of our code.
Let us consider an example.
Assume there is a system called "Hawk", which has 128 cores and 256GB per node, i.e. 2GB per process.
Let us say we know from smaller runs that this is enough to fit 2 level 7 cells per process with our chosen discretization.
We now try to scale our code to 2048 nodes and therefore construct a coarse mesh with `2048*128*2 = 524 288` macro cells.
The `SetupPrimitiveStorage` now consumes more than 8GB which is far beyond the available 2GB per process.
It is therefore the limiting factor to scalability.
In principle there is no need for the `SetupPrimitiveStorage` to be resident in each process.
Even if we do not distribute the `SetupPrimitiveStorage`, allocating it only once would allow sizes close to 256GB.
Any thoughts on how to solve this issue are highly welcome.https://i10git.cs.fau.de/hyteg/hyteg/-/issues/208GeometryMaps are completely opaque2023-04-20T17:38:30+02:00Daniel BauerGeometryMaps are completely opaqueTheir is currently no way to get information about the `GeometryMap` which is currently in use.
It is not even possible to detect whether the identity is used or not.
Instead, it must always be assumed that an arbitrary transformation is...Their is currently no way to get information about the `GeometryMap` which is currently in use.
It is not even possible to detect whether the identity is used or not.
Instead, it must always be assumed that an arbitrary transformation is in effect which prohibits certain optimizations.
For instance, the interpolation of constant vector fields requires that the vector field is mapped to computational space (btw, this is currently not implemented in `CSFVectorFunction` and `N1E1VectorFunction`).
In general, this destroys the constant nature of the vector field.
However, in case the mapping is affine (which of course includes the special case of the identity) the field remains constant.
I suggest implementing a means to query information about the nature of the geometry map:
- is identity
- is linear
- is affine
- something else?
This could be done by
1. adding pure virtual functions to the base class `GeometryMap`, or
2. adding trait classes similar to `FunctionTrait`.
I think 1. is better fit for purpose.Daniel BauerDaniel Bauerhttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/114Extend HyTeG to work on level 1 and 02020-02-24T16:04:02+01:00Nils KohlExtend HyTeG to work on level 1 and 0Apparently for some P1 GMG tests (even P1-P1 Stokes) we can plug in level 1 as minimum level and it works.
Level 0 could really be helpful to accelerate our coarse grid solvers.
* [x] indexing
* [x] functions
* [x] operator assembly
* [...Apparently for some P1 GMG tests (even P1-P1 Stokes) we can plug in level 1 as minimum level and it works.
Level 0 could really be helpful to accelerate our coarse grid solvers.
* [x] indexing
* [x] functions
* [x] operator assembly
* [x] single level solvers (CG, PETSc suite, ...)
* [x] grid transfer, multigrid solvers
* [x] fix 2D
* [x] function iterator, PETSc field split indexing
* [x] agglomeration
* [ ] ~~VTK~~ -> #116 Dominik Thoennesdominik.thoennes@fau.deDominik Thoennesdominik.thoennes@fau.dehttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/72Number of global MPI reduce ops in P2Function::dot()?2018-07-26T07:29:54+02:00Marcus MohrNumber of global MPI reduce ops in P2Function::dot()?Hi,
taking a look at the `P2Function::dot()` I see that there are two invocations of `walberla::mpi::allReduceInplace()`, one each by
- VertexDoFFunction< ValueType >::dot()
- EdgeDoFFunction< ValueType >::dot()
Having no insight into ...Hi,
taking a look at the `P2Function::dot()` I see that there are two invocations of `walberla::mpi::allReduceInplace()`, one each by
- VertexDoFFunction< ValueType >::dot()
- EdgeDoFFunction< ValueType >::dot()
Having no insight into walberla I can of course not be sure, but if this leads to two MPI reduce operations being performed, we should probably, w.r.t. 3D and HPC, avoid this.
My suggestions would be to add another optional parameter to ***DoFFunction::dot() that turns the call to walberla::mpi::allReduceInplace() on or off. Then we could perform the reduce inside P2Function::dot(), same as with
P2Function::getMaxMagnitude(), see [d7a2cc47].
Opinion?
Cheers
MarcusDominik Thoennesdominik.thoennes@fau.deDominik Thoennesdominik.thoennes@fau.dehttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/61Implement vector valued unknowns by extending the indexing functions2019-11-20T14:17:10+01:00Nils KohlImplement vector valued unknowns by extending the indexing functionsFor higher order finite elements or e.g. DG1 discretizations, **we need vector values unknowns** on the discretized domain.
Currently this is implemented (but not tested and most likely not functional!) by template parameters (`ValueTyp...For higher order finite elements or e.g. DG1 discretizations, **we need vector values unknowns** on the discretized domain.
Currently this is implemented (but not tested and most likely not functional!) by template parameters (`ValueType`) that could theoretically be set to `std::array` or similar.
As an alternative, **we could extend the indexing functions by another dimension** that refers to the nth element of such a vector valued unknown.
Example:
To get the three DG1 unknowns at position x, y on a macro-face an indexing function call could look like this:
```
real_t dg1_a = data[ index< Level >( x, y, DG_CENTER, 0 ) ];
real_t dg1_b = data[ index< Level >( x, y, DG_CENTER, 1 ) ];
real_t dg1_c = data[ index< Level >( x, y, DG_CENTER, 2 ) ];
```
(Such mappings of course must be documented (which would be true for the templated approach as well).)
The function memory is simply extended to the required size.
Further advantages:
* the `ValueType` template parameter ~~can be removed~~ would be restricted to either `float` or `double`
* the memory layout (AoS vs SoA) can be easily switched by exchanging the indexing function(!) - this would not be possible with the `ValueType`-template approach
Disadvantages:
* we would restrict the function classes to unknowns of type ~~`real_t`^N~~ `float`^N or `double`^N (for `uint_t` typed functions we then need extra classes (but maybe that's not that bad anyway?))
* ~~https://i10git.cs.fau.de/terraneo/tinyhhg/issues/34 would be really hard to solve (but uncertain if we ever need that)~~Nils KohlNils Kohlhttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/50Benchmark (level-)templated vs non-templated index functions2018-10-01T10:56:53+02:00Nils KohlBenchmark (level-)templated vs non-templated index functionsCurrently, the indexing functions are templated with the corresponding refinement level.
But it is not clear, if that optimization is necessary (it is not even clear if the templates are better than non-templated indexing functions...)....Currently, the indexing functions are templated with the corresponding refinement level.
But it is not clear, if that optimization is necessary (it is not even clear if the templates are better than non-templated indexing functions...). Therefore we should compare the performance impact of a non-templated indexing function.
If there is none, this would mean we can get rid of the templated indexing functions which would
1. presumably severely increase compilation speed
2. simplify modularization of the library
3. simplify the code structure (we could get rid of the SPECIALIZE macro(s) and lots of template code)Dominik Thoennesdominik.thoennes@fau.deDominik Thoennesdominik.thoennes@fau.dehttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/48ParMetis graph contains edges between primitives that are more than one dimen...2019-04-18T17:36:51+02:00Nils KohlParMetis graph contains edges between primitives that are more than one dimension apartInter-primitive communication is currently only allowed almong primitives that differ in only one dimension.
The corresponding ParMetis graph should reflect that relation by not creating edges between primitives that differ by two dimens...Inter-primitive communication is currently only allowed almong primitives that differ in only one dimension.
The corresponding ParMetis graph should reflect that relation by not creating edges between primitives that differ by two dimensions.Nils KohlNils Kohlhttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/45Unnecessary communication during interpolate, add, assign2019-02-18T16:44:08+01:00Nils KohlUnnecessary communication during interpolate, add, assignCurrently (in various DoF spaces) we communicate the halos during the LA routines interpolate, add and assign although it is not strictly necessary since all of them only update the local DoFs and only depend on local DoFs.
It would be ...Currently (in various DoF spaces) we communicate the halos during the LA routines interpolate, add and assign although it is not strictly necessary since all of them only update the local DoFs and only depend on local DoFs.
It would be more consequent to let routines that need updated halos pull them when necessary (e.g. operators - before stencils are applied).
Resolving this issue
1. would clear up which routines really depend on updated halos
2. could increase performance as it might safe time that is currently spent on unnecessary communication (however, shifting the communication to other routines could also decrease performance since we might not be able to overlap it with computation in some cases)https://i10git.cs.fau.de/hyteg/hyteg/-/issues/32Remove switch statement from index functions2017-09-18T10:00:55+02:00Nils KohlRemove switch statement from index functionsCould be replaced by generated or templated functions for each stencil direction.
However unsure if the switch is optimized away by the compiler anyway.Could be replaced by generated or templated functions for each stencil direction.
However unsure if the switch is optimized away by the compiler anyway.https://i10git.cs.fau.de/hyteg/hyteg/-/issues/30Graph weights for load balancing2019-11-20T14:19:22+01:00Nils KohlGraph weights for load balancingCurrently all primitives are equally weighted, but they should be weighted by number of DoFs (vertex < edge < face < cell).Currently all primitives are equally weighted, but they should be weighted by number of DoFs (vertex < edge < face < cell).Nils KohlNils Kohlhttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/29BufferedCommunicator fails when using parallel sends / recvs via OpenMPBuffer...2019-11-07T14:17:37+01:00Nils KohlBufferedCommunicator fails when using parallel sends / recvs via OpenMPBufferSystemLikely there are race conditions when invoking the pack / unpack (send / recv) callbacks. Maybe there is more to it..
Switched to serial sends / recvs now, seems to work fine.
Unsure whether this causes a huge performance drop -> profi...Likely there are race conditions when invoking the pack / unpack (send / recv) callbacks. Maybe there is more to it..
Switched to serial sends / recvs now, seems to work fine.
Unsure whether this causes a huge performance drop -> profile first before fix!
Also: only an issue when building with OpenMP.Nils KohlNils Kohlhttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/25Potential Optimization: replace std::pow(2, ...) with bit shifts or lookup ta...2017-09-15T16:20:31+02:00Nils KohlPotential Optimization: replace std::pow(2, ...) with bit shifts or lookup tablesDominik Thoennesdominik.thoennes@fau.deDominik Thoennesdominik.thoennes@fau.dehttps://i10git.cs.fau.de/hyteg/hyteg/-/issues/19Potential Optimization: cache communication dependencies2022-11-30T10:43:42+01:00Nils KohlPotential Optimization: cache communication dependenciesThe BufferedCommunicator could cache neighbor dependencies instead of arranging them before every communicationThe BufferedCommunicator could cache neighbor dependencies instead of arranging them before every communicationNils KohlNils Kohl