Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • pystencils pystencils
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 18
    • Issues 18
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 6
    • Merge requests 6
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • pycodegen
  • pystencilspystencils
  • Merge requests
  • !230

Improve non-temporal stores

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Michael Kuron requested to merge nontemporal into master Apr 01, 2021
  • Overview 2
  • Commits 9
  • Pipelines 4
  • Changes 13

ARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (dc zva) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of it. The other important part of non-temporal stores is that they don't pollute the cache. Some architectures like PowerPC can emulate that with a special store instruction that flushes a cache line from cache. It's not available on ARM, but this pull requests adds support anyway.

Furthermore, this pull request adds fence instructions for non-temporal stores on x86. This ensures that everything has actually arrived in memory before the kernel returns, thus ensuring that subsequent memory accesses will not get stale data. This was not an issue in practice as the overhead of exiting and calling a kernel would have usually taken enough time for the data to arrive in memory, but it wasn't guaranteed.

Note that non-temporal stores are not guaranteed to be faster than regular stores on all processors. For example, on my Apple M1, the non-temporal stores are actually slightly slower (so I did not gain anything by implementing this). This is actually not due to the overhead of checking whether a vector is the first one in a cache line (which can easily be confirmed by replacing dc zva with nop), but likely an artifact of Apple implementing dc zva differently than the ARM documentation suggests. So on ARM, one should always compare regular and non-temporal stores and use whichever is faster.

Fixes #25 (closed). Supersedes !225 (closed).

Edited Apr 12, 2021 by Michael Kuron
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: nontemporal