Computing
ATLAS-Grid-Computing
The huge amount of data taken at the ATLAS Experiment at CERN in 
Geneva requires an enormous data store. This is only possible by sharing
 the work over several compute centres. The LHC Experiments are combine 
their forces in the so-called Worldwide LHC Computing Grid (WLCG). This 
is a distributed computing infrastructure organized in 3 tiers.
 
Fig. 1: WLCG is made up of four layers, or "tiers"; 0, 1, 2 and 3. Each tier provides a specific set of services. (graphic: WLCG)
The innermost layer – the tier 0 is based in the CERN data centre and
 provides around 20% of the total compute capacity. The next layer is 
provided by thirteen large computer centres with sufficient storage 
capacity and round-the-clock support for the Grid. They fulfil the task 
of safe-keeping of a proportional share of raw and reconstructed data, 
as well as simulated data produced at these Tier 2s, by hosting both 
huge tape  and disk storage. The outermost layer – the Tier 2 centres - 
are usually based at universities and research laboratories. They store 
data and provide adequate computing power for analysis tasks. They 
handle analysis requirements and proportional share of simulated event 
production and reconstruction. Currently there are around 160 Tier 2 
sites all over the world.
Since the beginning of data taking at the 
LHC our group successfully operates an ATLAS-Tier-2-Centre. In order to 
cope with the continuously growing data volume during data taking, we 
are required to provide gradually more and more compute power as well as
 disk storage over the years.
Our group is also responsible for 
several ATLAS-specific operation tasks in the cloud around the Tier-1 
GridKa at KIT. On a continual basis the performance of all Tier 2 
centres in the cloud (data transfer, job submission, job success rate 
etc.) is tested and monitored. We are responsible for summarizing, and 
analysing the current status.
ATLAS HammerCloud Project
Since 2014 we are active in the ATLAS HammerCloud project. 
HammerCloud (HC) is a Distributed Analysis testing system. It can test 
site(s) and report the results obtained on that test. It provides a web 
interface for scheduling on demand tests and reporting the results of 
automated tests. A subset of the automated functional tests are used for
 automated site exclusion and recovery from the central ATLAS 
distributed computing workflow management system. If the dedicated tests
 fail, a complete computing site is discarded from the ATLAS computing 
Grid, such that no further production and/ or user analysis jobs are 
brokered to an unhealthy computing site. Once the HC tests succeed 
again, the compute resources of the successful site are re-included into
 the computing grid.
 
Fig. 2: Working mode of HC (graphic: HammerCloud Twiki)
Since 2019 one of the two experiment specific coordinators is from 
our group. We are involved in the support of sites, maintenance tasks as
 well as new developments, such as the Implementation of several 
monitoring web-pages within the HammerCloud web platform, e.g. site and 
cloud overviews, auto-exclusion summaries, and nightly test 
visualizations. Additional features have been developed, such as 
centralized benchmarking and JobShaping. The latter is a newly developed
 feature for speeding up the auto-exclusion and recovery decisions by 
dynamically adjust the number of parallel running jobs and providing 
additional info about root causes of failing sites by submitting 
automatically dedicated debug jobs to sites with failing test jobs.
R&D of innovative digital technologies
Another field of research is the development of optimizing 
the usage of heterogeneous compute resources. As shown in fig.3 we are 
operating a computing setup, where we integrate transparently the 
resources of the HPC cluster NEMO from the Rechenzentrum 
opportunistically into the local ATLAS-Tier 3 resources by connecting 
the two clusters with COBalD/TARDIS.
 
Fig.3: Transparent integration of NEMO resources into the local ATLAS Tier 2/3 center.
This allows us, to enlarge the compute resources for local ATLAS 
users by a given share of the NEMO cluster, without the burden for the 
user to switch between the two compute clusters and consider any 
differences in terms of software or operating systems. The user submits 
his/her jobs to SLURM, the batch system of the ATLAS Tier2/3 center and 
uses either resources from the Tier2/3 center or virtualized nodes on 
the HPC cluster.
In this context we are developing monitoring and
 accounting solutions for heterogeneous site infrastructures and 
validate and benchmark a possible caching infrastructure for the 
LHC-Run4 storage landscape.

