Exploring the black hole of computing

Our computer chips are not infallible. They make “silent mistakes,” that is, miscalculations that go unnoticed. EKPA Computer Architecture Professor Dimitris Gizopoulos and his research team won a collaboration with Meta to measure and limit this major problem.

When we search for information on an Internet search engine, we trust that the result we get is reliable. When we log into our e-banking, we believe the amount displayed on our screen is correct and we sleep soundly. How many of us constantly check the results that computers give us? Until recently, this confidence was well-founded, as researchers studying the reliability of computer systems believed that the probability of errors in the operation of microprocessors ranged from 1 in a million microprocessor chips to 1 in a billion chips, which is very rare. . However, in February 2021, Meta (then Facebook) came to stir things up by publishing a study stating that microchip errors are much more common: reaching 1 in 1,000 processor chips. Simply put, one chip in a thousand gives the wrong results. There was a small earthquake in the tech community. The New York Times characteristically wrote, “Little chips, huge headaches.”

Four months after Meta was exposed, Google came in to confirm the issue and its frequency. The tech giant Meta immediately announced a competition to the international academic community to find solutions to the so-called “silent bugs” of computers. Faculty of Informatics and Telecommunications, National University of Athens. Kapodistrias is one of the five winning universities. Dimitris Gizopoulos, professor of computer architecture, and his team were able to convince Meth that they could find answers to this complex and potentially troubling problem. “This is a significant success, testifying to the high level of our university institutions and researchers, even in the field of advanced technology research,” the author of the proposal notes. “62 proposals were submitted from 54 universities around the world, and together with us, the proposals of the leading institutions in North America alone were approved: Stanford, Carnegie Mellon, Northeastern in the USA and British Columbia in Canada,” adds the professor.

“A computer can malfunction for many reasons,” explains Mr. Gizopoulos. “The microprocessor may be improperly designed or manufactured, or its operation may be affected by environmental factors (radiation, temperature) or, finally, it may begin to wear out from intensive and prolonged use. All four of these reasons can cause our programs to malfunction. This is nothing new, we have known this for decades. Basically, we knew that the main memory and magnetic or other external storage devices, the so-called disks, were the problem. We did not know that the problem is so extensive in the central processing unit (CPU), that is, in the processor.

Gizopoulos emphasizes that in storage units we have easier ways to understand that there is an error or fix it without the user noticing it. However, even if the error cannot be corrected, there is usually a malfunction that is easy to detect. This is what happens when computers freeze and Windows blue screens. Often there are debugging or error detection codes in both hardware and software that indicate to the user that something is not working properly and therefore should not consider the result. However, in the central processing unit, where the microprocessor is located, everything is different. There may be a problem and the result of a critical arithmetic calculation may be incorrect, but no one will ever know about this in time, therefore errors are called “silent” there.

“If I add 5 plus 7 to Excel and it doesn’t give me 12, I get it right away,” Mr. Gizopoulos explains. “But I don’t use Excel for such simple calculations, I say 11.356 times 145.8 and it gives me the result and I go and buy a car or say how much money I have in the bank. Do you ever check what Excel gives you? The error described by Meta in the first post is this: a certain processor does not perform power up: 1.153 correctly, while everyone else does. Instead of giving the correct result, it gives zero. This zero corresponded to the size of the file on disk. The client saw the file on disk, but its counter was showing zero, so it was saying “Since the file exists, why is it telling me zero?”. Another CPU did the above calculation correctly, making a mistake when calculating 1.1 – 3, and the third one did it when calculating 1.1107. Dozens of engineers spent months looking for what was going on until they found faulty processors and the wrong instruction in each of them. Panic reigned in the company.

Meta and then Google identified the problem because they don’t have one laptop like the rest of us running a few hours a day, but tens of millions of machines with four, eight cores each running 24 hours a day. all kinds of programs. “The conclusion that 1 in 1,000 processor chips fail is just crazy. Right now, if you walk around our department, we probably have many thousands of processor chips. So many of them are broken and no one knows which ones.” The unsolvability of the problem is associated with its “silence”, because “if the program that the computer executes does not use the wrong arithmetic unit, then all its operations are correct. But if the program running on the computer continues to do this, it will continue to give incorrect results, but which will not be noticed.

“It’s a nightmare. It’s exciting. Our main goal is to measure the size of the problem and build tests that will identify broken chips.”

Modeling

“The frequency of the error depends on the hardware, software and conditions,” Mr. Gizopoulos explains. “It depends on the room temperature, the age of the car, the altitude and other factors. We are talking about arithmetic operations that simply give an incorrect result, they do not “coffin” the computer, and no error detection or correction codes are detected. A nightmare. It’s exciting. It could also be a movie script. Our main goal is to measure the size of the problem and build tests that will identify broken chips. We’re trying to model the problem by working with chip makers, Intel and AMD, and come up with smart tests so that when you use the chip on multiple machines, you can detect bugs and not reuse bad results. give. The research collaboration has been going on for several months now and is just one piece of the big puzzle.”

Mr Gizopoulos cannot be sure of the extent of the problem. “We may only be seeing the tip of the iceberg,” he warns. “The problem is definitely much bigger. Meta and Google had a client come back and tell them that the calculation you gave me was wrong because I checked. The rest who never complained, how many?

And what do I care, the layman asks. Why should Mr. Gizopoulos and his team play detective to solve a problem found in the Meta and Google data centers? The obvious answer is that we all use the applications of these two companies. Moreover, these same microprocessor chips are used every day by all of us on a massive scale in all our devices: mobile phones, tablets, laptops, desktop computers. The chips in these devices “age”, “wear out”, are affected by environmental conditions: therefore, they can produce the same silent errors as the Meta and Google chips.

complex equation

Knowing that 1 out of 1000 chips can miscalculate inevitably leads to overestimation of many things as our lives become more and more dependent on digital technologies. The fact that it’s hard to know which chips are problematic can only add to the concern. “An easy way to verify this is to run calculations on two different processors and compare the result,” says the professor, but such methods load the performance and power consumption of devices. “Therefore, in the processors that we use every day, it is not at all easy and expensive to implement this problem. Applications with a high need for reliability come at a high price. Airplanes, for example, use three processors and they work in parallel to do the same job. Banks also necessarily check the results of calculations. They have excess processing power. However, in some applications there may be a serious risk and not many opportunities to overcome it. The problem starts with scale. If the scale and load conditions of the processor are such that you have massiveness, pressure, complex and time-consuming calculations, then there may be a problem. In supercomputers, its footprint can be significant. For example, in vaccine research. If there is an error in the calculations, they will lead you to serious errors, which in emergency situations can be catastrophic.”

According to Mr. Gizopoulos, “silent mistakes” should not cause us to panic. “The problem will never be permanently solved as processor design and manufacturing methods become more and more complex, but solutions will be found by those who are willing to pay. Right now, the goal is to accurately measure the extent of the problem and contain it.” The rest of us hope that, like at the end of many crime films, the detective will find the culprits, and we, users and viewers, will continue to sleep peacefully.

Modeling

complex equation

LEAVE A REPLY Cancel reply