Rockport Networks launched a 300 Gbps switchless structure and demonstrated 396 node deployment at TACC

2021-11-24 04:06:43 By : Mr. yuiyin zhang

Since 1987-covering the world's fastest computers and their operators

Since 1987-covering the world's fastest computers and their operators

This week, Rockport Networks launched a 300 Gbps switchless network architecture, focusing on the needs of the high-performance computing and advanced artificial intelligence markets. Early customers include the Texas Advanced Computing Center (TACC), which has installed network technology as part of its Frontera system, and DiRAC/Durham University, which also uses network equipment. The Ohio State University high-performance networking team also worked with Rockport to provide it with expertise in standards support.

Rockport's distributed switching function is realized through its patented rNOS software, which is a network operating system that runs across network cards. The software does not occupy any server resources, and the server is not visible except for the high-performance Ethernet NIC. The network function is distributed down to each node, and these nodes are directly connected to each other through passive wiring. Rockport said that there is a distributed control plane and a distributed routing plane, but nodes are self-discovering, self-configuring, and self-repairing. The software determines the best path through the network to minimize congestion and delay, while breaking the packet into smaller parts (Rockport refers to these as FLIT) to ensure that high-priority messages are not blocked by large amounts of data.

In addition to rNOS, Rockport Networks' solution also includes three parts:

Rockport Chief Technology Officer Matt Williams said that the products currently being shipped are based on the advanced version of the 6D torus, with a high degree of path diversity. The CTO stated that it currently supports up to 1,500 nodes, but the architecture aims to expand to more than 100,000 nodes using topologies such as Dragonfly. 

In order to test and validate its solution, Rockport Networks worked with the Texas Advanced Computing Center (TACC) in Austin for about a year. With the support of its new Rockport Center of Excellence, TACC recently installed the Rockport network on the 396 nodes of its Frontera supercomputer. (The Dell system with approximately 8,000 nodes ranks tenth on the Top500 list and uses Nvidia-Mellanox HDR InfiniBand as its main interconnect.) Rockport-connected nodes are being used in production science to support quantum computing research related to the pandemic Research and emergency response calculations to solve destructive weather events and other large-scale disasters.

"TACC is delighted to be the Rockport Center of Excellence. We run a variety of advanced computing workloads that rely on high-bandwidth, low-latency communications to maintain large-scale performance," said Dan Stanzione, Director of TACC and Vice President of Research at UT-Austin . "We are very happy to collaborate with Rockport's switchless network design and other innovative technologies.

"Our team has seen promising initial results in terms of congestion and latency control. We were impressed by the simplicity of installation and management. We look forward to continuing to test new and larger workloads and integrate Rockport Switchless The network is further extended to our data center," he added.

Williams reported that the Rockport installation of TACC was completed in only a week and a half. "This is actually a two-step process," he said. "Plug in the card, plug in the internet cable."

Williams told HPCwire that compared to InfiniBand, customers saw an average improvement of 28% and a 3x reduction in end-to-end latency, running their applications under load. "Under load, we have better overall performance and always provide better workload completion time. Every workload is different and you don't always see 28%. Sometimes we will be higher or lower , It depends on how sensitive the workload is to network conditions. But on average, we see about 28%."

He clarified that these four tests (above) compared the Rockport solution with a 100 Gbps InfiniBand network, but stated that they saw "very similar results" in internal tests for 200 Gbps InfiniBand. The top HPC workloads use moving grid fluid dynamics codes. 

Speaking of methodology and comparison, Williams said, "The important thing about how we define performance is that it is in production and it is under load. Many traditional network vendors like to focus on offloading raw baseline performance or infrastructure. However, When you deploy them in production, and you have multiple workloads running in this mix of bandwidth and latency-sensitive workloads, you start to see a significant drop in performance compared to what you saw in the baseline test. So we always It’s talking about how we operate, how we perform in a load environment, just like you see in a multi-workload production environment."

According to Williams, Rockport network technology has been tested with customers and is now available for mass production. HPC, AI, and machine learning are a beachhead market, and the company is targeting high-performance applications that are very sensitive to network performance (mainly latency) but also require consistent bandwidth performance.

"This is a lossless solution, but we still use the standard host interface, so in order to test or deploy our solution, our customers only need to remove the existing IB card, or in some cases remove the Ethernet NIC , And replace it with our card," Williams said. "No software changes; not a single driver even changed. We seem to be a standard Ethernet NIC interface with all the advanced offloading features provided."

The solution delivered to the customer is the same as the solution installed in TACC. Unlike the traditional HPC network infrastructure that prioritizes the connection of nodes in the racks, through Rockport settings, nodes in different racks are directly connected together. The takeaway is that it is not very sensitive to physical location. Williams pointed out that TACC deployed 11 equipment racks across the data center, providing direct connections over that distance.

The announcement was supported by HPC analysis company Hyperion Research.

"There is a lot of evidence that switchless architecture can significantly improve application performance, which has traditionally been achieved at a high cost," Hyperion Research CEO Earl C. Joseph said at a press conference. "Making these advances economically more accessible should greatly benefit the global research community and is expected to raise our expectations of the network in terms of research returns and results time."

Durham University’s DiRAC and Ohio State University’s Network Computing Lab also issued support statements.

"In discovering next-generation HPC network technologies, Durham's team continues to push boundaries," said Alastair Basden, COSMA HPC Cluster Technical Manager, DiRAC/Durham University. "Based on the 6D torus, we found that Rockport Switchless Network is very easy to set up and install. We studied code that relies on peer-to-peer communication between all nodes with different packet sizes. Among them-usually-congestion will reduce the traditional network Performance. We can achieve consistent low latency under load and look forward to seeing the impact this will have on larger-scale cosmological simulations."

"Our mission is to provide standard libraries for the advanced computing community, such as MVAPICH2, to support the best performance available on the market. We will keep the library fresh through innovative methods, such as Rockport Networks' new switchless architecture. Heavy," said DK Panda, a computer science professor and distinguished scholar at Ohio State University, and head of network-based computing research. "We look forward to continuing cooperation with Rockport to define new standards for our upcoming version."

Be the most informed person in the room! Stay ahead of technology trends with industry updates provided to you every week!

SC21 may be the first large-scale supercomputing conference to resume face-to-face events, but not everything is back to the on-site menu: the student cluster competition-held virtually at ISC 2020, SC20 and ISC 2021-again at SC21. Nonetheless, on Thursday, Student@SC Chairman Jay Lofstead announced the winners of the remaining SC21 awards on the physical stage of SC21, including the two winning teams of the conference student cluster competition. The student cluster competition was launched in 2007 and usually involves high school... Read more...

Earlier this week, MLCommons released its latest MLPerf HPC training benchmark results. Unlike other MLPerf benchmarks, other MLPerf benchmarks mainly measure the training and inference performance of available systems. Read more...

For the second year in a row (hopefully the last year), SC21 won the second major research award together with the ACM 2021 Gordon Bell Prize: the Gordon Bell Special Prize for COVID-19 Research Based on High Performance Computing. Last year, the first iteration of the award was used to simulate the SARS-CoV-2 spike protein; this year, the award was awarded to researchers from RIKEN in Japan, who simulated the dynamics of aerosolized COVID droplets at the peak of the pandemic Period shapes behavior all over the world. RIKEN's Fugaku has been ranked Top500 for four consecutive times... Read more...

Today at the hybrid virtual/face-to-face SC21 conference, the organizers announced the winners of the 2021 ACM Gordon Bell Prize: A team of Chinese researchers used the new exascale Sunway system to simulate quantum circuits. The Gordon Bell Award is provided by HPC pioneer Gordon Bell with a prize of US$10,000 and is awarded once a year... Read more...

Unlike the in-depth technical dives of many SC keynote speeches, Internet pioneer Vint Cerf avoided the trenches and strolled leisurely through a series of human-computer interactions, talking about the growing capabilities of ML, while pointing out potholes that should be avoided as much as possible. Of course, Cerf and Bob Kahn jointly designed the TCP/IP protocol and Internet architecture. He foreshadows...read more...

Last year, we launched Amazon EC2 UltraClusters for P4d instances, which puts more than 4,000 NVIDIA A100 GPUs on a PB-level non-blocking network, and we provide them to anyone who has a model to train and a problem to solve. . read more…

Since its launch in Paris in the fall of 2019, the Jean Zay supercomputer has been one of the most powerful supercomputers available to European high-performance computing and artificial intelligence researchers. Now, according to GENCI, by adding the new Nvidia A100 80GB GPU and other hardware, Jean Zay will soon provide twice the computing power for AI and HPC research... Read more...

SC21 may be the first large-scale supercomputing conference to resume face-to-face events, but not everything is back to the on-site menu: the student cluster competition-held virtually at ISC 2020, SC20 and ISC 2021-again at SC21. Nonetheless, on Thursday, Student@SC Chairman Jay Lofstead announced the winners of the remaining SC21 awards on the physical stage of SC21, including the two winning teams of the conference student cluster competition. The student cluster competition was launched in 2007 and usually involves high school... Read more...

Earlier this week, MLCommons released its latest MLPerf HPC training benchmark results. Unlike other MLPerf benchmarks, other MLPerf benchmarks mainly measure t Read more...

For the second year in a row (hopefully the last year), SC21 won the second major research award together with the ACM 2021 Gordon Bell Prize: the Gordon Bell Special Prize for COVID-19 Research Based on High Performance Computing. Last year, the first iteration of the award was used to simulate the SARS-CoV-2 spike protein; this year, the award was awarded to researchers from RIKEN in Japan, who simulated the dynamics of aerosolized COVID droplets at the peak of the pandemic Period shapes behavior all over the world. RIKEN's Fugaku has been ranked Top500 for four consecutive times... Read more...

Today at the hybrid virtual/face-to-face SC21 conference, the organizers announced the winners of the 2021 ACM Gordon Bell Prize: A team of Chinese researchers used the new exascale Sunway system to simulate quantum circuits. The Gordon Bell Award is provided by HPC pioneer Gordon Bell with a prize of US$10,000 and is awarded once a year... Read more...

Unlike the in-depth technical dives of many SC keynote speeches, Internet pioneer Vint Cerf avoided the trenches and strolled leisurely through a series of human-computer interactions, talking about the growing capabilities of ML, while pointing out potholes that should be avoided as much as possible. Of course, Cerf and Bob Kahn jointly designed the TCP/IP protocol and Internet architecture. He foreshadows...read more...

Since its launch in Paris in the fall of 2019, the Jean Zay supercomputer has been one of the most powerful supercomputers available to European high-performance computing and artificial intelligence researchers. Now, according to GENCI, by adding the new Nvidia A100 80GB GPU and other hardware, Jean Zay will soon provide twice the computing power for AI and HPC research... Read more...

When the group members gathered on stage for the first full speech of SC21, the so-called Peter Parker principle-"the greater the ability, the greater the responsibility"-looped in the background slides. In the next hour, the five team members faced this dilemma: As the transformative power of HPC (especially the artificial intelligence that supports HPC) is increasingly mainstreamed and by all major... read more. ..

Hyperion Research released its annual HPC market update on SC21 today. Most of these correspond to Hyperion's earlier mid-year report: the HPC market (in-house deployment) completed about 28B (about 1.1%) in 2020, which is roughly the same as the June forecast. The gains are mainly due to Fuyue's early standing. read more…

On October 1 this year, IonQ became the first pure quantum computing startup to go public. At the time of writing, the stock (NYSE: IONQ) is approximately $15 and has a market value of approximately $2.89 billion. Co-founder and chief scientist Chris Monroe said that it is interesting to have some of the company's approximately 100 employees travel to New York to ring the opening bell of the New York stock market...Read more...

Two months ago, Tesla revealed a huge GPU cluster, which is said to be "roughly the fifth-ranked supercomputer in the world." This is only the pioneer of Tesla's real supercomputing moon landing: long-rumored, Dojo system with few details. read more…

Esperanto Technologies announced the launch of ET-SoC-1 in December last year, which is a new RISC-V-based chip designed for machine learning. It packs nearly 1,100 cores into one small enough to fit six times on a single PCIe card Package. Now, Esperanto is back, with silicon in hand, aiming...Read more...

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting held by Zoom this week (September 29-30), it was revealed that the Frontier supercomputer is currently being installed at the Oak Ridge National Laboratory in Oak Ridge, Tennessee. Oak Ridge leadership...read more...

In a virtual event this morning, AMD CEO Lisa Su announced the company’s highly anticipated latest server products: the new Milan-X CPU with AMD’s new 3D V-Cache technology; and its new Instinct MI200 GPU , It provides up to 220 computing units on two Infinity Fabric connected chips, providing an astonishing 47.9 peak double precision teraflops. "Driven by the growing demand for additional computing performance, we are in the midst of a high-performance computing cycle...Read more...

Following the removal of Intel's HPC division from the data platform division and into the newly created Accelerated Computing Systems and Graphics (AXG) business division led by Raja Koduri in June, Intel is making further updates and announcements to the HPC division. .. read more…

Intel reported in a blog this week that it has completed the adoption of the open source LLVM architecture for Intel's C/C compiler. The transition is In Read more...

AMD's next-generation supercomputer GPU is under development-and by all accounts, it is about to become famous. AMD Radeon Instinct MI200 GPU (successor to MI100) will start to power three large-scale systems on three continents next year: the US's exascale Frontier system; the EU's pre-exascale LUMI system; and Australia's petascale Setonix system. read more…

The emergence of data processing unit (DPU) and infrastructure processing unit (IPU) as potentially important parts of cloud and data center architecture is to read more...

In the spring of 2019, Tesla implicitly mentioned a project called Dojo, which is a "super-powerful training computer" for video data processing. Then, in the summer of 2020, Tesla CEO Elon Musk wrote on Twitter: "Tesla is developing [neural network] to train computers...Read more...

Earlier this month, D-Wave Systems was a pioneer in quantum computing and has long supported quantum computing based on quantum annealing (sometimes due to read more...

When its process node technology lags behind its competitors (about) a generation, but the outdated naming convention makes me read more...

In the fierce and often controversial field of government IT, HPE won a huge contract worth US$2 billion to provide HPC and AI services to the National Security Agency (NSA). Following the now cancelled $10 billion JEDI contract (reissued as JWCC) and $10 billion... read more...

The results of the latest round of MLPerf inference benchmarks (v 1.1) were released today, and Nvidia once again dominates, closing (apples-to-ap Read more...

The coolest factor in server chips is nano. AMD beats Intel to build a CPU based on the 7nm process node*-5nm and 3nm are coming soon-I have been reading more...

What is the quantum computing market? Energetic (a lot of money) but still chaotic and advancing in an unpredictable way (e.g. competing qubit technology Read more...

© 2021 HPCwire. all rights reserved. Tabor Newsletter

HPCwire is a registered trademark of Tabor Communications, Inc. The use of this website is governed by our terms of use and privacy policy.

Reproduction in whole or in part in any form or media is prohibited without the express written permission of Tabor Communications, Inc..