# Le défit de 40 Tbit/s

### Analyse de données structurées en temps réel pour

### l'expérience LHCb au CERN

### **Dorothea vom Bruch**

**CPPM** Marseille

Journée thématique CEDRE Décembre 1<sup>er</sup> 2022 Campus Saint-Jérôme









erc



- Real-time data analysis in particle physics
- Trend towards heterogeneous computing systems
- Computing challenge: Analyze 40 Tbit/s of data in real-time at the LHCb experiment @ CERN
  - Analyze data on Graphics Processing Units (GPUs)
  - Data structure, algorithms processed in real-time
  - Parallelization strategy
  - Commissioning of system in 2022

### LHC @ CERN





### Particle collisions

- Two beams of proton bunches in opposite directions
- One bunch crossing of the two beams every 25 ns at the four large LHC experiments

→ "Event"

- The proton-proton collisions occur in a region spread along the beamline
- The position of one proton-proton collision is called primary vertex (PV)





### Typical particle detector

x

7



### Data challenges in particle physics



Storage



Simulation



Data analysis



### "Trigger": Real-time data analysis and reduction



### "Trigger": Real-time data analysis and reduction



### "Trigger": Real-time data analysis and reduction



### When to use hardware versus software trigger?

### Hardware trigger

Local characteristic signature, For example high energy / pt particle

#### Software trigger

Analysis of whole event required → reconstruct all trajectories



### Change in trigger paradigm



#### Access as much information about the collision as early as possible

### Real-time software challenges



#### Largest single internet exchange point: 14 Tbit/s



#### LHCb experiment @ CERN 40 Tbit/s



## Computing performance challenge @ CERN



- In high energy physics, usually assume flat budget for computing cost estimation
- Estimated improvement increase: 10-15% per year for the same budget
- Can no longer count on a stable increase for CPU servers

### Trend towards heterogeneous solutions: TOP500



### Graphics Processing Unit (GPU)

Developed for graphics-oriented workloads









#### Consider how much of the problem can actually be parallelized!

### The LHCb experiment at CERN

#### LHC @ CERN



Example: Vertex locator detector



Data produced by every sensor of the detector looks like this:



### Recurrent tasks in real-time data analysis

#### Raw data decoding

- Transform binary payload from subdetector raw banks into collections of hits (x,y,z) in LHCb coordinate system
  Track reconstruction
- Consists of two steps:
  - Pattern recognition: Which hits were produced by the same particle? → "Track"
    - $\rightarrow$  Huge combinatorics when testing different combinations of hits
  - Track fitting: Describe track with mathematical model

#### Vertex finding

- Where did proton-proton collisions take place?
- Where did particles decay within the detector volume?
  Calorimeter / muon detector reconstruction
- Reconstruct clusters in the calorimeter / muon detectors
- Match tracks to clusters



## LHCb's first level real-time analysis on GPUs

#### High Level Trigger 1 (HLT1) tasks

- Decode binary payload of five sub-detectors
- Reconstruct charged particle trajectories
- Identify particle types
- Reconstruct particle decay vertices
- Select pp-bunch collisions to store



- Manageable amount of algorithms with highly parallelizable tasks
- Ideally suited for parallel architecture of GPUs

### Main task: particle trajectory reconstruction



Huge computing challenge for 10<sup>9</sup> – 10<sup>10</sup> tracks / second



### Minimize copies to / from GPU



### How does HLT1 map to GPUs?

| Characteristics of LHCb HLT1                                                                    | Characteristics of GPUs                                                                    |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| Intrinsically parallel problem:<br>- Run events in parallel<br>- Reconstruct tracks in parallel | Good for<br>- Data-intensive parallelizable applications<br>- High throughput applications |
| Huge compute load                                                                               | Many TFLOPS                                                                                |
| Full data stream from all detectors is read out<br>→ no stringent latency requirements          | Higher latency than CPUs, not as predictable as FPGAs                                      |
| Small raw event data (~100 kB)                                                                  | Connection via PCIe → limited I/O bandwidth                                                |
| Small event raw data (~100 kB)                                                                  | Thousands of events fit into O(10) GB of memory                                            |

### Perfect fit!

- Named after Frances E. Allen
- Fully standalone software project: https://gitlab.cern.ch/lhcb/Allen, Sphinx documentation
- Framework developed for processing LHCb's first real-time selection stage (HLT1) on GPUs
- Cross-architecture compatibility via macros & few coding guide lines
  - GPU code written in CUDA, runs on CPUs, Nvidia GPUs (CUDA), AMD GPUs (HIP)
- Algorithm sequences defined in python and generated at run-time
- Multi-event processing with dedicated scheduler
- Memory manager allocates large chunk of GPU memory at start-up
- Reconstruction algorithms re-designed for parallelism and low memory usage: O(MB) per core



### LHCb: Software-only real-time analysis since 2022

- Two challenges:
  - 1) Connect sub-detectors to server-farm → FPGA card
  - 2) Use best suited computing architecture for reconstruction of particle collisions at 30 MHz
    - $\rightarrow$  Partial reconstruction fully implemented on GPUs



# History: HLT1 architecture choice





- Developed two solutions simultaneously
- Both the multi-threaded CPU & the GPU HLT1 fulfilled the requirements from the 2014 TDR
- Detailed cost benefit analysis

#### (arXiv:2105.04031)

- GPU solution leads to cost savings on processors and the network
- Throughput headroom for additional features
- Decision: A GPU-based software trigger will allow LHCb to expand its physics reach in Run 3 and beyond.



See also arXiv:2106.07701 on LHCb's energy efficiency with a CPU and GPU HLT1

## GPU HLT1 within data acquisition system



![](_page_28_Picture_2.jpeg)

### GPU HLT1 within data acquisition system

![](_page_29_Figure_1.jpeg)

![](_page_29_Picture_2.jpeg)

### HLT1 commissioning: Allen within the DAQ system

![](_page_30_Figure_1.jpeg)

### HLT1 commissioning: Towards first collisions

![](_page_31_Picture_1.jpeg)

### HLT1 commissioning: Towards first collisions

July 2022: First collisions @ 13.6 TeV at the LHC Happy trigger commissioning team

![](_page_32_Picture_2.jpeg)

- Particle physics experiments real time analysis systems are entering the exascale computing era
- Need to exploit modern computing techonolgies to face this challenge
- LHCb experiment is commissioning a real-time analysis system full implemented in software in 2022
- First time in particle phyiscs to process 30 million proton-proton collisions per second on GPUs
- Developed Allen: a heterogeneous software framework for multi-event processing
- Gain expertise in heterogeneous DAQ systems
  - → Preparing to exploit emerging new architectures entering the market

![](_page_33_Picture_8.jpeg)

# Backup

### What do we reconstruct at LHCb?

![](_page_35_Figure_1.jpeg)

#### Run 3: 40 Tbit/s → PCIe40 card developed

- Receives data from sub-detectors and transfers it to the server memory for event building via PCIe connection
- Local data processing occurs on the card using only the information from the links connected to it
- Card is generic enough to be re-used by other experiments: ALICE, Belle-II, Mu3e

PCIe40 card

- Towards Run 5: increase bandwidth and processing power by factor 10
- Run 4: PCIe400 card to transfer 400 Gbit/s via PCIe connection
- Run 5: Transfer 800 Gbit/s via ethernet connection using more powerful FPGA
- Add more local processing to the board in the future to reduce processing load of HLT

![](_page_36_Picture_10.jpeg)

### Overview of GPU usage in various HEP experiments

| Experiment | Main tasks<br>processed on GPU                                                                           | Event / data rate                          | Number of GPUs  | Deployment date                         |  |
|------------|----------------------------------------------------------------------------------------------------------|--------------------------------------------|-----------------|-----------------------------------------|--|
| Mu3e       | Track- & vertex reconstruction                                                                           | 20 MHz /<br>32 Gbit/s                      | O(10)           | 2023                                    |  |
| CMS        | Decoding,<br>clustering, pattern<br>recognition in pixel<br>detector                                     | 100 kHz                                    |                 | 2022 (tbc)                              |  |
| ALICE      | Track reconstruction<br>in three sub-<br>detectors                                                       | 50 kHz Pb-Pb or < 5<br>MHz p-p / 30 Tbit/s | O(2000)         | 2022                                    |  |
| LHCb       | Decoding,<br>clustering, track<br>reconstruction in<br>three sub-detectors,<br>vertex<br>reconstruction, | 30 MHz/ 40 Tbit/s<br>D. vom Bruch          | O(250)<br>https | 2022<br>://arxiv.org/pdf/2003.11491.pdf |  |

38

### CPU – GPU – FPGA

|      | Latency            | Connection                   | Engineering cost                                                                                             | FP performance                              | Serial /<br>parallel                                             | Memory                               | Backward<br>compatibility                                  |
|------|--------------------|------------------------------|--------------------------------------------------------------------------------------------------------------|---------------------------------------------|------------------------------------------------------------------|--------------------------------------|------------------------------------------------------------|
| CPU  | O(10) μs           | Ethernet,<br>USB, PCIe       | Low entry level:<br>Programmable with C++,<br>pthon, etc.                                                    | O(1-10) TFLOPs                              | Optimized for<br>serial,<br>increasingly<br>vector<br>processing | O(100) GB<br>RAM                     | Compatible,<br>except for<br>vector<br>instruction<br>sets |
| GPU  | O(100) µs          | PCIe, Nvlink                 | Low to medium entry level:<br>Programmable with CUDA,<br>OpenCL, etc.                                        | O(10) TFLOPs                                | Optimized for<br>parallel<br>performance                         | O(10) GB                             | Compatible,<br>exept for<br>specific<br>features           |
| FPGA | Fixed<br>O(100) ns | Any<br>connection<br>via PCB | High entry level:<br>traditionally hardware<br>description languages,<br>Some high-level syntax<br>available | Optimized for<br>fixed point<br>performance | Optimized for<br>parallel<br>performance                         | O(10) MB<br>on the<br>FPGA<br>itself | Not easily<br>backward<br>compatible                       |

- Developed for graphics pipeline
- General purpose computations possible
- Increasingly used for AI applications
- Hardware specialized in this direction since few years
- Programmed with high-level language

![](_page_39_Figure_6.jpeg)

Low core count / powerful ALU Complex control unit Large chaches

 $\rightarrow$  Latency optimized

![](_page_39_Figure_9.jpeg)

High core count No complex control unit Small chaches → **Throughput optimized** 

### FPGAs – High Level Synthesis for Neural Networks

- Traditionally, programmed with hardware description languages (Verilog, VHDL) → long development time
- Increasingly more high-level languages (HLS) developed
- Challenges:
  - Fit into resource constraints of FPGA
  - Preserve model performance
- Specialized hardware blocks emerging implementing functions for Neural networks such as tensor blocks

FPGA: thousands of logic blocks, I/O blocks, connected via programmable interconnect

![](_page_40_Figure_8.jpeg)

Source: National Instruments