Computing

ATLAS-Grid-Computing

The huge amount of data taken at the ATLAS Experiment at CERN in Geneva requires an enormous data store. This is only possible by sharing the work over several compute centres. The LHC Experiments are combine their forces in the so-called Worldwide LHC Computing Grid (WLCG). This is a distributed computing infrastructure organized in 3 tiers.

WLCG-Tiers

Fig. 1: WLCG is made up of four layers, or "tiers"; 0, 1, 2 and 3. Each tier provides a specific set of services. (graphic: WLCG)

The innermost layer – the tier 0 is based in the CERN data centre and provides around 20% of the total compute capacity. The next layer is provided by thirteen large computer centres with sufficient storage capacity and round-the-clock support for the Grid. They fulfil the task of safe-keeping of a proportional share of raw and reconstructed data, as well as simulated data produced at these Tier 2s, by hosting both huge tape and disk storage. The outermost layer – the Tier 2 centres - are usually based at universities and research laboratories. They store data and provide adequate computing power for analysis tasks. They handle analysis requirements and proportional share of simulated event production and reconstruction. Currently there are around 160 Tier 2 sites all over the world.
Since the beginning of data taking at the LHC our group successfully operates an ATLAS-Tier-2-Centre. In order to cope with the continuously growing data volume during data taking, we are required to provide gradually more and more compute power as well as disk storage over the years.

Our group is also responsible for several ATLAS-specific operation tasks in the cloud around the Tier-1 GridKa at KIT. On a continual basis the performance of all Tier 2 centres in the cloud (data transfer, job submission, job success rate etc.) is tested and monitored. We are responsible for summarizing, and analysing the current status.

ATLAS HammerCloud Project

Since 2014 we are active in the ATLAS HammerCloud project. HammerCloud (HC) is a Distributed Analysis testing system. It can test site(s) and report the results obtained on that test. It provides a web interface for scheduling on demand tests and reporting the results of automated tests. A subset of the automated functional tests are used for automated site exclusion and recovery from the central ATLAS distributed computing workflow management system. If the dedicated tests fail, a complete computing site is discarded from the ATLAS computing Grid, such that no further production and/ or user analysis jobs are brokered to an unhealthy computing site. Once the HC tests succeed again, the compute resources of the successful site are re-included into the computing grid.

HammerCloud_sketch

Fig. 2: Working mode of HC (graphic: HammerCloud Twiki)

Since 2019 one of the two experiment specific coordinators is from our group. We are involved in the support of sites, maintenance tasks as well as new developments, such as the Implementation of several monitoring web-pages within the HammerCloud web platform, e.g. site and cloud overviews, auto-exclusion summaries, and nightly test visualizations. Additional features have been developed, such as centralized benchmarking and JobShaping. The latter is a newly developed feature for speeding up the auto-exclusion and recovery decisions by dynamically adjust the number of parallel running jobs and providing additional info about root causes of failing sites by submitting automatically dedicated debug jobs to sites with failing test jobs.

R&D of innovative digital technologies

Another field of research is the development of optimizing the usage of heterogeneous compute resources. As shown in fig.3 we are operating a computing setup, where we integrate transparently the resources of the HPC cluster NEMO from the Rechenzentrum opportunistically into the local ATLAS-Tier 3 resources by connecting the two clusters with COBalD/TARDIS.

ATLAS_on_NEMO

Fig.3: Transparent integration of NEMO resources into the local ATLAS Tier 2/3 center.

This allows us, to enlarge the compute resources for local ATLAS users by a given share of the NEMO cluster, without the burden for the user to switch between the two compute clusters and consider any differences in terms of software or operating systems. The user submits his/her jobs to SLURM, the batch system of the ATLAS Tier2/3 center and uses either resources from the Tier2/3 center or virtualized nodes on the HPC cluster.

In this context we are developing monitoring and accounting solutions for heterogeneous site infrastructures and validate and benchmark a possible caching infrastructure for the LHC-Run4 storage landscape.

Benutzerspezifische Werkzeuge

Anmelden

Artikelaktionen

Computing

ATLAS-Grid-Computing

ATLAS HammerCloud Project

R&D of innovative digital technologies

Start Footer

Benutzerspezifische Werkzeuge