
Systems Artificial intelligence they don’t think like humans. In fact, they don’t even understand what they are saying. They can mimic human speech because the artificial intelligence they are based on “reads” a huge amount of text, mostly posted on the Internet.
These texts are the main source of information about the world for AI and influence how they react to users. For example, the fact that they excel in law exams is because thousands of pages of exam preparation are included in AI training data.
Tech companies don’t disclose what hardware they power their AI systems, but the Washington Post is now disclosing one of those datasets, citing websites that chatbots are “monitoring”.
WP analyzed the Google C4 dataset, a massive library of snapshot content from 15 million websites that was used to train some of the most important English-language artificial intelligence systems such as Google’s T5 and Facebook’s LLaMA.
OpenAI does not disclose which datasets it uses to train the AI models that its popular ChatGPT chatbot is based on.
About a third of sites cannot be classified because they no longer appear on the web. As the American newspaper points out, personal and often offensive information is injected into the training data of artificial intelligence systems.
From Wikipedia to Patents
Among the countless venues where AI systems are taught, dominated by journalism, entertainment and content creation platforms. This partly explains why these particular industries may be threatened by the rise of artificial intelligence.
The top three sites for AI machine learning are patchs.google.com (#1) with texts from patents from around the world, Wikipedia.org online encyclopedia (#2) and Scribd.com (#3). digital subscription library.
Also high on the list is b-ok.org (No. 190), a “pirated” e-book site that the US Department of Justice was trying to shut down.