Transcript of Episode #1029

The Illusion of Thinking

Description: In memoriam: Bill Atkinson. Meta native apps and JavaScript collude for a localhost local mess. The EU rolls out its own DNS4EU filtered DNS service. Ukraine DDoSes Russia's Railway DNS - and so what? The Linux Foundation creates an alternative WordPress package manager. Court tells OpenAI it must NOT delete ANYONE's chats. Period! A CVSS 10.0 in Erlang/OTP's SSH library. Can Russia intercept Telegram? Perhaps. Spain's ISPs mistakenly block Google sites. Reddit sues Anthropic. Twitter's new encrypted DMs are as lame as the old ones. The Login.gov site may not have any backups. And finally, Apple explores the question of recent Large Reasoning Models "thinking."

High quality  (64 kbps) mp3 audio file URL: http://media.GRC.com/sn/SN-1029.mp3

Quarter size (16 kbps) mp3 audio file URL: http://media.GRC.com/sn/sn-1029-lq.mp3

SHOW TEASE: It's time for Security Now!. Steve Gibson is here. We're going to talk about an Apple research paper that explores whether these new large reasoning models are really doing any thinking. I'll give you a hint. They're not. Goodbye to Bill Atkinson. And then what is the Linux Foundation's response to the WordPress kerfuffle? All that and more coming up next on Security Now!.

Leo Laporte: This is Security Now! with Steve Gibson, Episode 1029, recorded Tuesday, June 10th, 2025: The Illusion of Thinking.

It's time for Security Now!, the show where we cover your security, privacy, and everything else, frankly. We're going to do AI today with this guy right here, he's the king of the hill as far as security comes, frankly as far as geekiness goes, Mr. Steve Gibson. Hello, Steve.

Steve Gibson: Hello, Leo. Great to be with you again.

Leo: Although nothing is more geeky than the clock with the milliseconds behind me now.

Steve: Oh, show everybody. Let's just take 15 minutes to discuss this. Open it up, show us...

Leo: No, we already did it on MacBreak Weekly.

Steve: Show us how it's a clapper. We need to see the clapper.

Leo: It's a clapper. It's always - it's really, I think it's very cool. This is the new clock so that people can - because people want a clock behind me for some reason.

Steve: Leo's latest toy.

Leo: Yeah, it's my new toy.

Steve: It came yesterday; right?

Leo: Yeah.

Steve: What's coming tomorrow?

Leo: We'll see.

Steve: Okay. We've got a great episode. Something happened with Meta, and also Yandex, which needs to get some attention, mostly because it's really interesting. We get to do a deep dive. It bears on a lot, as we'll see. And that's how we're going to largely start today's podcast. We're going to finish it by looking at some research that Apple's guys did. And I don't think this is sour grapes because, you know, they kind of missed the AI train. They did some tests using something other than math, which they argue is not a really good way to measure a reasoning model's ability to reason because, if it's just really good at matching patterns, it can score better than people can because we're not that good. Anyway, so lots of fun stuff to talk about. We are going to start by, as your programs have, Leo, remembering somebody. I've got that for the first page of the show notes.

Leo: Oh, good. Oh, good.

Steve: Because he was amazing.

Leo: Yeah.

Steve: We're going to talk about this Meta native apps and JavaScript colluding behind their users' backs. The EU has, believe it or not, rolled out their own DNS service, which the good news is it works much better in the EU than it does here in the states. They didn't, you know, they didn't create it for some guy in Southern California with a benchmark, which is a good thing because, whoa, not good over here. Also, Ukraine DDoSed Russia's Railway DNS, and we're going to pause on that briefly to say, uh, so? The Linux Foundation has created an alternative WordPress package manager because apparently there's some politics over at WordPress land that created some schisms. Oh, and a court has told OpenAI that they must not delete anyone's chats. Anyone's, not just selective people. We're going to dig into that.

Also there is a very bad, well, depending upon who uses Erlang/OTP's SSH library. If you do, hopefully you already know about this 10.0 CVSS. There's been some questions raised about whether Russia is able to intercept Telegram messages. Seems like maybe, which would be a surprise. Spain's ISPs blocked Google sites. Whoops. Reddit is suing Anthropic. Twitter's new encrypted DMs are apparently as lame as the old ones were. Also seems that the Login.gov site doesn't have backups.

Leo: Oh, no. Oh.

Steve: What could possible go wrong? Wow. And then we're going to look at an interesting way that Apple came up with to generate some really good metrics about to what degree this next generation, like the o3 and Claude 3.7, so-called large, not large language models, but large reasoning models, are actually reasoning, and whether they're just better at language than reasoning. So I think maybe this time, Leo, for Podcast #1029, we've actually got some interesting stuff to talk about.

Leo: Finally, after a thousand podcasts.

Steve: I know. We're getting the hang of it, I think.

Leo: Actually I'm really - this Apple paper was a research paper, is very, you know, scientific and deep.

Steve: Yup.

Leo: And I'm really thrilled that you want to dissect it because I need some help.

Steve: We've got charts. We have charts.

Leo: We've got charts. Charts coming up.

Steve: Charts coming up.

Leo: Of course. We count on Steve every week. Aren't you glad you're here? This is why we're here. Thank you, Steve. We'll get right to the meat of it. Steve, let's talk. We've got a Picture of the Week.

Steve: We do. And I gave this one the caption, "If your kitchen oven challenges you to prove you're human, something has gone very wrong somewhere."

Leo: No. Not a CAPTCHA - wait a minute. We've got to look at this one. A CAPTCHA on an oven?

Steve: What the what? as they would say.

Leo: Oh, no.

Steve: You can see this lady, you can see the reflection of her face with her wearing glasses in the screen of the oven. Somehow she is being asked - the caption that is there on her oven's screen is saying, "Click all the buttons that contain traffic lights." You can see the word "traffic lights."

Leo: Yeah. This is a smart thing. This must be a Samsung oven. That's crazy.

Steve: It is not so smart.

Leo: But they built - what's not smart is they built browsers into their appliances. It's just dumb. And by the way, I have a close friend who has a Samsung refrigerator. She can't use the browser because it's out of date. And that happens so quickly. But this is even worse. Oy, oy, oy.

Steve: So, gee, all I want to do is warm up the pie. And let's see. Where are the traffic lights?

Leo: Insane.

Steve: If your kitchen oven challenges you to prove you're human, well...

Leo: You've got the wrong oven.

Steve: ...it's got a little more technology than it should have.

Leo: Yeah.

Steve: So I wanted to take a moment to note with sadness, as a huge, I mean, the Internet has really responded to the passing of Bill Atkinson, who died last Thursday, June 5th, after losing his battle with pancreatic cancer. And as you noted, that's also of course we know what took Steve Jobs 14 years ago back in 2011. Steve was only 56 at the time. Still, Bill went too soon. He was 74, born in 1951.

I got a kick out of what he wrote in the third person of himself for the "About" page of his website. He modestly said: "Aside from being a nature photographer, he [meaning himself] is also well known in the world of software design." Uh-huh. "Years ago, as a member of the original Macintosh team at Apple, he helped design much of the initial Macintosh user interface and wrote the original QuickDraw, MacPaint, and HyperCard software." And as I said, talk about modest. So Bill received his undergrad degree from UC San Diego, which is where he also met the now also famous Apple alumnus Jef Raskin.

Leo: Who also died of pancreatic cancer.

Steve: And isn't that kind of bizarre, Leo?

Leo: Yeah.

Steve: Like were they all drinking the same strange potion that Jobs came up with or what? How could all three of these guys - I don't know. Just seems bizarre. What are the chances?

Leo: It does.

Steve: But on the other hand, you know, we tend to see patterns, even when they don't exist. Anyway, Jef Raskin was Apple employee #31.

Leo: Right. Raskin was 51.

Steve: Yes. And he met Jef at UC San Diego, where he was one of his professors. Then later Bill Atkinson studied neuroscience, neurochemistry at the University of Washington. Then, while almost closing in on his PhD, Raskin invited Atkinson to visit Apple, where of course Steve Jobs got his hooks into him and persuaded him to, oh, forget about school, you don't need one of those degrees. You know, who needs that? Join the company and change the world. And of course Jobs can be very persuasive when he wants to be, so Atkinson became employee #51.

And of course at Apple Bill became the principal designer and developer of the GUI for Apple's Lisa, and later became one of the first 30 members of the original Apple Mac dev team, where he also principally designed the Mac's UI. He is the author of MacPaint which, at the time, I'm sure you remember this, Leo, our jaws dropped. I mean, MacPaint was an astonishing piece of work. No one could believe it. And it was built upon the foundation of the QuickDraw toolbox, which Bill had first written for the Lisa and then ported...

Leo: 1978, just to put it in perspective. This was a long time ago.

Steve: Yup, and then ported that to the Mac. And need I note that QuickDraw was 100% pure Motorola 68000 assembly language?

Leo: Yeah. Yeah.

Steve: Because that's the only way you could get these machines, I mean, in order to create a reasonably priced consumer PC then, you basically had a processor and some fancy Woz hardware in order to map some memory onto the screen. But there was no GPU. You know, it was all "bit banging," as we called it, in order to draw all this.

Leo: In order to, you know, Bill studied with Raskin at the UC San Diego.

Steve: Right.

Leo: Where of course UCSD Pascal came from. And Bill really wanted Pascal on the Macintosh. And everybody said no, you can't put UCSD Pascal on the Macintosh. Bill went home.

Steve: Six days. I'm trying to get six fingers on the screen.

Leo: And wrote it and made it work on the Macintosh. And so I wrote 68000 code on my original Mac, but I also remember very well Macintosh Programmers Workshop and being able to write in Pascal on the Macintosh. That was a small language.

Steve: Well, and that's what impressed Steve and forever changed Jobs' opinion of Atkinson.

Leo: Right, yeah.

Steve: And the key to this was that Pascal was based on a pseudo-machine, a p-machine. And so this was the brilliant thing that Bill Atkinson realized. All he had to do was to implement the UCSD Pascal's p-code, the pseudo-machine, in Motorola 68000 code, and then all of the rest - the compiler, the editor that was part of UCSD Pascal - all of that and all the apps and everything would start running. So it was like the perfect thing to do in under a week in order to say, okay, we got UCSD Pascal now.

Leo: So brilliant.

Steve: And actually it was a very nice Pascal. I didn't mess with it on the Mac, but I did on the Apple II because Apple II also had UCSD Pascal. Maybe it was using a soft card. I don't quite remember now. But I remember that I wrote something that solved some sort of puzzles. I think it was just one of the pig jumping puzzles at the time, and I did it recursively.

Leo: I think now that you say that, that the port was for the Apple II, not the Mac; right? I don't know.

Steve: Oh, maybe it...

Leo: I think it was. I have to go back and look.

Steve: May have been.

Leo: Yeah.

Steve: That would certainly make sense.

Leo: Yeah, kind of makes more sense.

Steve: Yeah.

Leo: What was interesting is the interface for the Macintosh, and this is Inside Mac, the Volume, is all in Pascal. So if you wanted to write to the Apple ROM, you could do it...

Steve: Oh, that's right, so Pascal would have existed, it would have been well in place by then.

Leo: That's right. So I think he did it for the Apple II.

Steve: And that makes sense, too, because MacPaint was a hybrid of Pascal and assembly language.

Leo: Right, some of those low-level, well, QuickDraw has obviously got to be in assembly; right?

Steve: Yeah.

Leo: Pretty impressive.

Steve: Anyway, and of course Bill also then famously designed and implemented HyperCard, which gave non-programmers access to programming and database design. And in fact, years later, Bill Atkinson received, it was in 1994, the EFF Pioneer Award for his contributions in the field of personal computing. You interviewed him on a long Triangulation.

Leo: We did four with him. One of them was a five-hour interview, yeah.

Steve: Yup. And you chopped it up into two pieces. So I wanted to let our listeners know that you guys did a great job with Bill Atkinson. Anybody who wants to listen to him and look at him, you know, being interviewed by you, it was great.

Leo: I felt very fortunate to be able to spend so much time with somebody that I admired so much.

Steve: Well, and all of his photo card stuff that you talked about for years.

Leo: Yeah, was incredible.

Steve: Yeah.

Leo: Here's, if you go to my blog, Leo.fm, John "JammerB" Slanina took a bunch of pictures of Bill, and I took a few of him on our set. And you notice, by the way, he brought the Sidekick, he brought the Macintosh, he brought a lot of stuff. And I think it was Alex Gumpel who had an unopened copy of HyperCard that Bill signed for us. Just incredible. I have a link there to all of the interviews that we did with Bill over a period of time. The first one was in 2016 at the Brick House, and the last one was in 2018 in the Eastside studio. And yeah, that was the one we spent five hours together. I just, I'm really - for some reason this really - this one really hit me.

Steve: Well, he was a good guy.

Leo: He really was. He was a genuine...

Steve: And also, you know, 74. Let's not keep leaving at age 74. I'm certainly not planning to.

Leo: We're both getting close, and that's maybe another reason. But also it hit me I think because this is a generation.

Steve: Yes.

Leo: Steve Jobs, Jef Raskin, Bill Atkinson were a generation of people who changed computing forever, and we owe them so much; you know?

Steve: Yeah. There was someone we were - Lorrie and I were looking at or talking about the other day who, I don't know, they're 10. And, like, they will never know a world that doesn't have the Internet, that probably won't be aware of it, really, that didn't have AI Assistant stuff. I mean, they're growing up in an entirely different environment than we did. I mean, it's just - there's no comparison.

Leo: I guess that's why I feel it's incumbent on us to remind them of their elders, the people who made it all possible.

Steve: And then we just seem old. Old.

Leo: Yeah. Back in the day. Oh, well. Oh, well.

Steve: Okay. So if anyone might be at all unsure about just how badly the likes of Meta are determined to surreptitiously track their users' movements around the Internet for the purpose of secretly profiling them, the news I have to share about a recent super-sneaky tracking discovery, something we've never talked about before, will disabuse anyone of any doubts along those lines.

To quickly lay out what it does and how it works, the write-up of this begins with a quick overview. The guys who found this wrote: "We disclose a novel tracking method by Meta and Yandex, potentially affecting billions of Android users." And I'll just say for the record, not only Android. This is cross-platform. But it's being done on Android. "We found that native Android apps, including Facebook and Instagram and several Yandex apps including Maps and Browser" - get this - "silently listen on fixed local ports for tracking purposes."

Okay, now, I'll just interrupt to note that that's actually kind of diabolically brilliant, although I'm not endorsing it. It's not completely new. For example, my own native Windows SQRL client, and the other SQRL clients that people created, running in the user's machine, opens and listens on port 25519 - of course I chose that port because that's the crypto that I used - for connections from a SQRL script running on login pages. The SQRL login JavaScript on a website's login page would send the SQRL client app, which is running on the user's machine, a unique token by opening a TCP connection to the localhost IP where the resident SQRL client app was listening.

The SQRL client app would then connect to the remote site at the URL provided by the website which contained a unique token. It would identify its user, that is, the SQRL client app would identify its user and use the unique token to perform a secure public key authentication. Upon authentication success, the remote site would return a URL which the SQRL client would then forward to the waiting web browser, which would then jump the user to the logged-on page at the site, thus essentially "Presto." Without doing anything, the user would be logged in with complete security that could not be hacked, spoofed, or intercepted. So that's how I've used this feature, which is controversial at best, to allow script running in the browser to connect to something listening on the localhost IP. You know, 127.0.0.1.

So the idea of allowing a website's JavaScript to talk to a local native app is not entirely new. But of course what SQRL was doing was aboveboard and fully documented as part of the protocol. That is decidedly not the case with Meta and Yandex, who were doing this purely for tracking. And, oh, is this powerful for tracking because it bypasses everything. During the development of SQRL there was some worry about this handy facility disappearing, since Microsoft was aware of the potential for the abuse of this, and for a while they tried to shut down browser access to the localhost IP from within the web browser. But turns out there are many other legitimate use cases for this, too; so much so that too many things broke when Microsoft tried to do this, and they were forced to backpedal and leave the facility in place on Windows. And it's obviously there on Android.

So the guys who discovered Meta and Yandex's abuse explained: "These native Android apps receive browsers' metadata, cookies, and commands from the Meta Pixel" - which is what they call it, it's actually a JavaScript - "Meta Pixel and Yandex Metrica scripts embedded on 5.8 million websites. These JavaScripts load on users' mobile browsers and silently connect with native apps running on the same device through localhost sockets. Since native apps have access to device identifiers like the Android Advertising ID or directly handle actual user identities as in the case of Meta apps, this method effectively allows these organizations to link mobile browsing sessions and web cookies to real-world user identities, de-anonymizing users visiting sites embedding their scripts.

"This web-to-app ID sharing method bypasses all" - this is them writing this - "bypasses all typical privacy protections such as clearing cookies, Incognito Mode, and all of Android's permission controls."

Leo: Oy.

Steve: Yes. "It also opens the door for potentially malicious apps eavesdropping on users' web activity." Because nothing prevents other apps from also saying, oh, let's monitor all of these Meta Pixel JavaScripts, which are going to be trying to connect to localhost.

So what we have here is an interesting and extremely privacy-invasive hack. The concept is that this is not leveraging some bug that can be found, fixed, and eliminated. As I noted, Microsoft previously tried and failed to eliminate this capability. I think it was when they were heading toward IE11, as I recall. I think that was the IE that was going to be saying, eh, no more of this localhost business. They had to back away. Maybe it was 10. I don't know. Anyway.

So that everyone's clear about this, the problem Microsoft had with cutting off their browser from all access to the local machine is that it has always been possible to do this. And, as we've often seen, anytime something is possible, it will eventually be done. And once applications have become dependent upon some available mechanism, it's extremely difficult to take it back. For example, many web developers run local web servers on their machines, and they test their web code locally on web browsers running on the same machines. It's entirely practical and easier than needing to set up some second external web server somewhere and talk to it.

Another example is that web browsers have become so powerful that a local application might be written to be "headless," without its own desktop UI and presence on its own. Instead, it will just launch the system's web browser to perform all communication with the user. The user experiences it as a website, but they're actually communicating with an application running on their own local machine. This is done by running a web server on the local machine which the web browser communicates with.

So Meta and Yandex are both abusing this deliberate and formally supported ability of web browsers, not only to connect to faraway remote servers out on the Internet, but also to little local servers set up and running inside any application on the same machine. And there's no obvious way any user can know this is going on, let alone prevent it from happening. Since this problem is not going away, let's take a closer look at what these researchers found.

They wrote: "While there are subtle differences in the way Meta and Yandex bridge web and mobile contexts and identifiers, both of them essentially misuse" - again, this is them writing this - "essentially misuse the unvetted access to localhost sockets. The Android OS allows any installed app with the Internet permission" - which will be all Android apps except maybe Calculator - "to open a listening socket on the loopback interface (127.0.0.1). Browsers running on the same device also access this interface without user consent or platform mediation. This allows JavaScript embedded on web pages to communicate with native Android apps and share identifiers and browsing habits, bridging ephemeral web identifiers to long-lived mobile app IDs using standard Web APIs.

"The Meta (Facebook) Pixel JavaScript, when loaded in an Android mobile browser, transmits the first-party _fbp cookie using WebRTC to UDP ports 12580-12585 to any app on the device that's listening on those ports." They said: "We found Meta-owned Android apps Facebook and Instagram, available on the Google Play Store, listening on this port range."

So here's the step-by-step of this in detail. First, in their normal course of use, the user opens their native Facebook or Instagram app on their device. You know, on any Android device, Android smartphone. The app is eventually switched away from, is sent to the background, and creates a background service to listen for incoming traffic on a TCP port (12387 or 12388) and a UDP port, the first unoccupied port in the range from 12580-12585. Users must be logged-in with their credentials on the apps. So the user is identified to the app, Facebook or Instagram. The user then opens their web browser and visits any one of 5.8 million websites which integrate the Meta Pixel JavaScript. Websites may ask for consent depending upon the website and the visitor's location and, you know, local requirements for them to do so.

The Meta Pixel script sends the _fbp cookie to the native Instagram or Facebook app using the WebRTC protocol. The Meta Pixel script simultaneously sends the _fbp value, so the same cookie it's sending to the local app, it sends it to www.facebook.com/tr. And gee, do you think that maybe "tr" might be short for track? The URL's query tail contains other parameters such as the page's URL, website and browser metadata, and even the event type, like PageView, AddToCart, Donate, Purchase, whatever. The Facebook or Instagram app which has received that _fbp cookie from the Meta Pixel JavaScript running on the browser then transmits that to graph.facebook.com/graphql, along with other persistent user identifiers, which links the user's _fbp cookie ID with their Facebook or Instagram account, thus bypassing all other privacy controls which the industry has created through the past, you know, most recent 10 years or so.

The researchers explain: "According to Meta's Cookies Policy, the _fbp cookie 'identifies browsers for the purpose of providing advertising and site analytics services and has a lifespan of 90 days.' The cookie is present on approximately 25% of the top million websites" - as we saw, 5.8 million overall - "making it the third most common first-party cookie of the web, according to Web Almanac 2024." They said: "A first-party cookie implies that it cannot be used to track users across websites, as it is set under the website's domain. That means the same user has different _fbp cookies on different websites." Right? It's the way it's supposed to be now. "However, the method we disclose," they write, "allows the linking of the different _fbp cookies to the same user, which bypasses existing protections and runs counter to user expectations."

Okay? So just to be clear, this entire surreptitious surveillance system was specifically designed to explicitly and deliberately bypass, not only all user-expressible anti-tracking wishes, but also to circumvent all of the work the browser vendors have invested in to limit and control cross-site tracking. This neatly circumvents all of the explicit first-party domain-tied cookie isolation and stovepiping that our web browsers have recently added specifically to prevent the abuse...

Leo: So evil.

Steve: It is really evil, Leo.

Leo: My god.

Steve: And there is no other purpose. It's doing nothing other than this. There is no other reason.

Leo: And the only way to really remove it is to remove Facebook and Yandex apps from your phone.

Steve: Yeah. This behavior is entirely indefensible.

Leo: I just deleted Facebook from everything. Everything. Unbelievable.

Steve: So that's what Meta has been up to. How does the Russian service Yandex compare? The researchers write: "Since 2017, the Yandex Metrica script initiates HTTP requests with long and opaque parameters to localhost through specific TCP ports: 29009, 29010, 30102, and 30103. Our investigation revealed that Yandex-owned applications - such as Yandex Maps, Navigator, Search and Browser - actively listen on these ports. Furthermore, our analysis indicates that" - get this one, Leo, oh, boy - "the domain yandexmetrica.com is resolving to the loopback address" - I put it into NS Lookup because I couldn't believe it yesterday, and sure enough it came up - "127.0.0.1."

Leo: What, it resolves to localhost?

Steve: Yes, in order to be extra sneaky. And I'll explain that in a second. "And that the Yandex Metrica script transmits data via HTTPS to local ports 29010 and 30103. This design choice," they wrote, "obfuscates the data exfiltration process, thereby complicating conventional detection mechanisms." Okay. In other words, it's quite sneaky to have a public domain like yandexmetrica.com resolving to the localhost IP 127.0.0.1 since script code analyzers would likely look for the string "localhost" or the IP "127.0.0.1," but Yandex embeds a public-appearing domain name to further obscure what's actually going on. And their use of HTTPS means that any communications is also obscured and is less easy to intercept, monitor, and analyze. And then Yandex gets even trickier.

The researchers explain: "Yandex apps contact a Yandex domain, startup.mobile.yandex.net or similar, to retrieve the list of ports to listen to. The endpoint returns a JSON object containing the local port number (30102, 29009) and" - get this - "and a 'first_delay_seconds' parameter which," they wrote, "we believe is used to delay the initiation of the service. On one of our test devices, first_delay_seconds roughly corresponded to the number of seconds it took for the Yandex app to begin listening on local ports, which was around three days." The only possible reason for this is to avoid detection and to prevent any researchers from easily discovering this deliberately concealed behavior. It's really despicable.

Leo: At least Facebook wouldn't do anything the Russians would do.

Steve: Right. They said: "After receiving the localhost HTTP requests from the Yandex Metrica script, the mobile app responds with a Base64-encoded binary payload embedding and bridging the Android Advertising ID among other identifiers accessible from Java APIs like Google's advertising ID and UUIDs, potentially Yandex-specific. As opposed to Meta's Pixel case, all of this information is aggregated and uploaded together to the Yandex Metrica server, mc.yango.com, by the JavaScript code running in the web browser, rather than by the native app. In the case of Yandex, the native app acts as a proxy to collect native Android-specific identifiers, then transferring them back to the browser context through localhost sockets."

Okay. In other words, Meta has their native Facebook or Instagram app doing the communicating with the Meta mothership; whereas the various Yandex apps run native servers that the Yandex JavaScripts communicate with in order to, specifically, to obtain whatever device-specific information Yandex may wish. That information is then returned to the browser from the little local Yandex servers, which the Yandex JavaScript then forwards to Yandex.

The researchers point out an additional problem under their heading "Additional risk: Browsing history leak." And Leo, I note that we're at 40 minutes in, so let's take a pause, and then we're going to look at the additional problems that doing this creates. And there are several.

Leo: Now, we should mention that they've stopped doing this; right? This is...

Steve: The day this report was published they went, "Oopsie."

Leo: That's admitting it. That's saying...

Steve: The day it came out it suddenly stopped.

Leo: Oh, we don't do that.

Steve: What are you talking about?

Leo: What are you talking about? OMG.

Steve: I know.

Leo: Is it, now, if I use the Facebook app on a computer, is it doing the same thing? Like the website?

Steve: Well, it would be interesting to see, if you ran the Facebook app on Windows...

Leo: Oh, it would do the same thing.

Steve: ...you could do a netstat and get the application names that are opening and listening on the localhost and see whether Facebook and Instagram apps are listening on localhost. I don't know if that is. I'm not running any of that.

Leo: Right. Yeah. For good reason. Holy cow.

Steve: It is a spy on anyone's machine.

Leo: Willfully bypassing every indication that you as a user have made that you want privacy.

Steve: Yes. And willfully bypassing all of the browser's well-meaning attempts to allow this to happen, but we're going to keep you from tracking with it.

Leo: Now, can I block these ports?

Steve: We're going to be talking about that.

Leo: Okay. That's coming up after our break.

Steve: After our break.

Leo: Obviously I have many questions. All of which will be answered soon. Oh, my goodness.

Steve: Yeah. It's just evil.

Leo: This is why we listen. What a great show. Boy, I want to hear more. Let's go.

Steve: Yeah. So they said under their heading "Additional Risk: Browsing History Leak," they wrote: "Using HTTP requests for web-to-native ID sharing," which is what these guys are doing, "may expose users' browsing history to third parties. A malicious third-party Android application that also listens on the aforementioned ports can intercept HTTP requests sent by the Yandex Metrica script and Meta's communication channel by monitoring the Origin HTTP header." Which is the website domain. Thus any app on the platform is able to use this to basically - the user's web browser has now been turned into a leaking sieve which is broadcasting everywhere the user goes that has either a Yandex or a Meta JavaScript cookie, and anybody is able to listen for it.

They said: "We developed a proof-of-concept app to demonstrate the feasibility of this browsing history harvesting by any malicious third-party app. We found that browsers such as Chrome, Firefox, and Edge are susceptible to this form of browsing history leakage in both default and private browsing modes." You can't hide from this. "The Brave browser was unaffected by this issue due to their blocklist and the blocking of requests to the localhost; and DuckDuckGo was only minimally affected due to missing domains in their blocklist." I didn't understand what they meant by that. But it's interesting that Brave does have localhost blocked.

"While the possibility for other apps to listen to these ports exist, we have not observed any other app, not owned by Meta or Yandex, listening to these ports. Due to Yandex using HTTP requests for its localhost communications, any app listening on the required ports can monitor the website a user visited with these tracking capabilities as demonstrated by the video above." And they had a video on their site showing it. They said: "We first open our proof of concept app, which listens to the ports used by Yandex, and send it to the background. Next, we visit five websites across different browsers. Afterwards, we can see the URLs of these five sites listed in the app."

In other words, once this local system abuse is present, there's nothing to prevent other apps from establishing their own competing services, little servers, and hooking into this illicit extra-browser communications to obtain for their own purposes the same Internet-wide tracking and monitoring that the Meta and Yandex apps are deliberately employing.

Finally, summarizing things, they wrote: "This novel tracking method exploits unrestricted access to localhost sockets on the Android platforms, including most Android browsers. As we show, these trackers perform this practice without user awareness, as current privacy controls sandboxing approaches, mobile platform and browser permissions, web consent models, incognito modes, resetting mobile advertising IDs, or clearing cookies are all insufficient to control and mitigate it. We note that localhost communications may be used for legitimate purposes such as web development. However, the research community has raised concerns about localhost sockets becoming a potential vector for data leakage and persistent tracking. To the best of our knowledge, however, no evidence of real-world abuse for persistent user tracking across platforms has been reported until our disclosure.

"Our responsible disclosure to major Android browser vendors led to several patches attempting to mitigate this issue, some already deployed, others currently in development. We thank all participating vendors (Chrome, Mozilla, DuckDuckGo, and Brave) for their active collaboration and constructive engagement throughout the process. Other Chromium-based browsers should follow upstream code changes to patch their own products.

"However, beyond these short-term fixes, fully addressing the issue will require a broader set of measures as they are not covering the fundamental limitations of platforms' sandboxing methods and policies. These include user-facing controls to alert users about localhost access, stronger platform policies accompanied by consent and strict enforcement actions to proactively prevent misuse, and enhanced security around Android's interprocess communication mechanisms, particularly those relying on localhost connections."

So I'll add that, while these guys are only focusing upon, as I said earlier, mobile platforms, this is not a mobile-only problem. As I said, my implementation, and others, of this legitimate intra-platform communication for SQRL's use works cross-platform everywhere, on both mobile and desktop. So we know that there are currently no controls for this.

My own feeling is that no browsers should allow this by default. It's just too dangerous to permit out of the box. So the default should be for browsers to block and notify their user when any website they visit attempts to open a backdoor channel to something running, perhaps surreptitiously, on their own local machine. Any legitimate use of this, such as for web development, would then expect and permit this. And a browser might offer some configuration. There might be like, for example, three settings: block and don't notify, or request permission, or always allow.

And as another option - since, for example, Firefox certainly appears to have no upper limit on the number of fine-grained configuration settings that it's able to manage - a user might permit this localhost network communication only over certain ports, such as the standard web ports 80 and 443, to permit local web server access while blocking all other high ports that apps might use.

Technology aside, this makes one sort of shake one's head, Leo, and I know your head's been shaking for the last half hour.

Leo: Yes, no kidding.

Steve: You know, Yandex is Russian. So they're not friends of the West, and they're certainly not...

Leo: Or privacy.

Steve: Right. And they're certainly not on any friendship trajectory toward the West.

Leo: Right.

Steve: But Meta is a huge and, we would wish, responsible U.S. corporation that would like to have and deserve the trust of its users. But the design and installation of these covert backdoors in their apps, which can only have the purpose of communicating with matching user-tracking web scripts spread across 5.8 million Internet sites, really deserves the attention, I think, of U.S. authorities.

And, as you noted, Meta knows this was wrong because this horrifying behavior was immediately shut down, the same day, after the publication of this research. They got caught bypassing all user choice and anti-tracking browser enforcement and immediately turned it off. They're able to do this since those JavaScripts are all being sourced by their own content delivery network. So it was only a matter of changing the code being sent out from the mothership. But their apps will still be opening and listening for any local web browser connections. Who's to say where, when, and how they might attempt to resume this behavior in the future? Who would know?

Leo: Yeah, I'm sure they'll try something else. These guys are smart.

Steve: Boy. And this just demonstrates how determined they are. They insist on profiling their own users.

Leo: Well, if there's any question in anybody's mind about whether Facebook was evil, there should not be any question. Evil's maybe a strong term. Not your friend?

Steve: Amoral.

Leo: Yeah. I mean, I'm sure in their minds it's justified because they need that tracking to sell ads, and that's their revenue model. I think it's really good that you've exposed them, and these guys have exposed them, and everybody should know this. Should somebody - so a couple of things. One point somebody made is - Paul did, Paul Holder, your friend and ours - because as soon as everybody knew this we could have reversed abused them and flooded them with fake sites and IDs, which is true. As soon as it becomes public, it's easy to fake.

Another person pointed out, OutofSync, also very smart, that it would be nice if you'd get a popup when the browser is accessing a localhost because...

Steve: Yup.

Leo: ...that's definitely...

Steve: That's questionable behavior.

Leo: There's times when you do that. I do it. But you know you're doing it. If it's happening, and you haven't done it on purpose...

Steve: Exactly.

Leo: ...that's not good.

Steve: Or I would say if the user puts localhost into the URL address, then they're deliberately going to a localhost server.

Leo: Right, right.

Steve: If scripts try to access localhost - oh, and boy, isn't that tricky, setting up YandexMetrica.com to resolve to 127.0.0.1, ohhh.

Leo: See, what domain registrar would allow that? I guess you just change the DNS to point...

Steve: Exactly. It's just the DNS pointing there.

Leo: Yeah. Wow. Unbelievable.

Steve: Yeah, I mean, there is no excuse for this. They got caught.

Leo: They got caught.

Steve: And, I mean, their own guilt is demonstrated by the fact that they immediately turned it off. It's like, oops. Bad idea, guys.

Leo: Wow, what a story. Thank you for that.

Steve: Yeah. Let's take another break since we're now at an hour, and then we're going to look at the DNS servers, the new service that's been set up in the EU by the EU.

Leo: I think that's fascinating.

Steve: Yeah.

Leo: Boy, that's really interesting.

Steve: Just don't use them from the U.S.

Leo: Just as long as you trust them it's okay.

Steve: Actually there's been some question.

Leo: Yeah. Why would they - oh, it's a service.

Steve: Yes. And no one makes anyone, you know, use their DNS. So I think it's aboveboard. Anyway...

Leo: It's interesting.

Steve: ...we'll get to that in a second.

Leo: Yeah. Ah, okay, Steve.

Steve: Okay. So last week you could go, Leo, to joindns4.eu.

Leo: Funny that it's in English.

Steve: Last week the European Union launched its own multi-flavor DNS service.

Leo: They call it a safe space.

Steve: Uh-huh. Join the European safe digital space. So there are flavors for government, for telcos, and for home users. The service is designed to provide secure and privacy-focused DNS resolvers for the EU bloc as an alternative to U.S. and other foreign services.

Leo: Ah.

Steve: So they want their own.

Leo: That makes sense, yeah.

Steve: They want their own.

Leo: Okay. Okay.

Steve: The project was first announced back in October 2022, the year 2022, and was built under the supervision of the EU cybersecurity agency ENISA. It's currently managed by a consortium led by the Czech Republic security firm Whalebone, and members include cybersecurity companies, CERTs, academic institutions from 10 EU countries.

Leo: Sounds good.

Steve: I confirmed the "Whalebone" ownership since I immediately dropped the various DNS resolver IPs into GRC's DNS Benchmark, and the Benchmark's ownership tab showed they were all within a network owned by "WHALEBONE S.R.O." Now, naturally, these EU resolvers include built-in DNS filters for malicious and malware-linked domains - that is, filtering them out - that prevent users from connecting to known bad sites. The lists are managed from a central location by EU threat intel analysts, and none of this costs anything for EU users, or anybody, for that matter, nor companies, or any governments that might decide to adopt the service.

The pitch to governments and telcos is that having the EU offer a trusted DNS service can eliminate the costs and overhead associated with running their own DNS infrastructure. And to the degree that independent DNS services required security personnel to manage and filter the directory, you know, like upkeep and all that, that can now be offloaded to the dedicated DNS4EU team. The variations that are offered for DNS which are targeted to home users give people a choice of different profiles. You know, malicious domains can be removed; adult content; ad filtering, interestingly.

Leo: Oh. So this is like NextDNS or OpenDNS or 4.4.4.4, like Cloudflare.

Steve: Exactly. Exactly.

Leo: Ah, interesting.

Steve: And so on their page for home users they offered, they say, "Choose the Resolver That Fits Your Needs." So at 86.54.11.1, that's the Protective Resolution that removes, you know, questionable and malware domains. If you use .11.12, you get Protective + Child Protection. So it removes adult content. Or if you use 11.13, you get Protective + Ad blocking. 11.11 gives you all of that - Protective, Child Protection, and Ad blocking. Or if you go to 11.100, that is to say, 86.54.11.100, you get Unfiltered DNS, all of the domains that are available on the 'Net.

Now, while it would be nice to have government-backed free DNS web content filtering, I have a DNS Benchmark. And so I immediately dropped those IPs in, wondering how those five resolver IPs list on the Benchmark. And I was not impressed. I included a clip from the Benchmark showing the performance where the word "atrocious" comes to mind. But stop, because people in the EU have since confirmed they work great over there. And of course that's what you'd expect; right?

For me in Southern California, their average response time ranged from 163 to 173 milliseconds, which is very slow. For example, compare that to Cloudflare's DNS that the same benchmark had come in at 20 milliseconds. But again, I want to make sure everybody understands, they didn't do this for me in Southern California. The European Union is not suggesting that someone located in Southern California should be using their DNS at all, let alone...

Leo: Is it typically the case that if it's geographically closer, it's faster?

Steve: Yes, because the packets have to travel all that distance.

Leo: But it's at the speed of light, Steve. I mean...

Steve: Yeah, but it turns out it's got to go across the ocean, undersea cable, you know...

Leo: Does it go through other servers, too, on the way?

Steve: Yeah. It is bouncing, well, I am connecting directly to that server. The reason that Cloudflare is so fast anywhere is that they're a CDN.

Leo: Right.

Steve: You know, you use a Cloudflare IP, what is it, 1.1.1.1, well, you're not actually, you know, that's a pseudo IP.

Leo: Right.

Steve: You're actually being routed to some very local Cloudflare DNS server that is physically close to you, even though I use that and people in the EU use that IP. They're getting a Cloudflare server near them.

Leo: So it makes sense that Whalebone would be slow from Southern California.

Steve: Yes. And again, I want to make sure everybody understands. I posted to grc.dns.dev newsgroup where we've all been testing this evolving next-generation DNS Benchmark code. And I asked anybody who's located in the EU to give the same set of DNS IPs a run. Because of the time zone difference, I didn't hear back by the time I posted today's show notes. Since then, I have, and for anybody in the EU, they're getting great performance.

Leo: Oh, good.

Steve: They're getting the same 20 millisecond-ish performance from those.

Leo: You'd expect that, yeah.

Steve: Yes. And that's why, frankly, GRC's Benchmark is so valuable is, you know...

Leo: Right. It's not the same for everybody.

Steve: I don't get the same thing as when somebody else runs it.

Leo: Yeah, right, right, right.

Steve: It matters where you're running it from. Which is to say, you know, and that's the DNS server you want to choose for that location. So the DNS services are available under all protocols, IPv4, IPv6, DNS over UDP, and - so those, you know, IPv4 and v6 over UDP, but also DoH and DoT where you get privacy-enforcing secure DNS over TCP with TLS. So the Benchmark showed them in green, which also indicates that they support DNSSEC security so that the records that are available, they will support signed, cryptographically signed DNS records to prevent anyone from spoofing or altering those records. So anyone in the EU wishing to explore this further should jump their browser over to joindns4.eu, where you'll find all the information.

Leo: It's free; right?

Steve: Yeah. It's free.

Leo: See, I understand if you're in the EU you might want to use this. If our government decided to make a DNS server, I don't think I'd use it.

Steve: No.

Leo: I just don't think so. Don't want to use the DOGE DNS server.

Steve: While we're on the topic of DNS, I noted that Ukraine's military intelligence agency claims that it took down the DNS service of the Russian Railways using a 6-gigabit, 2.5 million packet per second DDoS attack. The reporting was in Ukrainian news, and it was in Ukrainian, and I didn't bother to dig any further. It's unclear to me what that accomplished. You know?

Leo: It was fun. We could do it.

Steve: Yeah. As we know...

Leo: Trains do not run on time now.

Steve: Yeah. Any attack on DNS would need to be sustained until the local DNS caches expired. At that point, things would begin to collapse. But it wasn't clear what would collapse. Would the trains no longer run at all? Would the scheduling and the ticket sales fail? I don't know. Now, that said, using a large number of inexpensive, stealthfully-inserted autonomous drones to remotely take out many extremely expensive Russian cruise missile-launching warplanes, now, that's something to write home about.

Leo: And a 6Gb attack is not that - that's not that big.

Steve: No. Those are like, okay, well, okay. I guess it wasn't a very, well, and it's probably some server in a closet somewhere that started to smoke. But okay, who cares? It's the Russian Railway.

Leo: It smokes a lot anyway. It's the Russian Railway.

Steve: Right. The Linux Foundation has launched what they call the FAIR WordPress package manager. Given the astonishing number of websites that use the WordPress core as their content management system, their CMS, I always want to keep our listeners abreast of any important WordPress-related news. So when the Linux Foundation announces the launch of their replacement for WordPress.org's own package manager, that makes the news cut.

I haven't kept up to date on the politics surrounding WordPress and Automattic. But the reporting that I saw said: "The new system is a decentralized alternative to the WordPress.org plugin and theme ecosystem developed with help from veteran WordPress developers who were pushed out from the main WordPress project last year during a power grab by Automattic and Matt Mullenweg."

Leo: Oh. Ow.

Steve: So there.

Leo: Yeah.

Steve: So what I do know is that this replacement looks pretty sweet. It's called the "fairpm" page, so it's github.com/fairpm. They explain: "The FAIR Package Manager is an open-source initiative backed by the Linux Foundation. Our goal is to rethink how software is distributed and managed in the world of open web publishing. We focus on decentralization, transparency, and giving users more control. Our community brings together developers, infrastructure providers, and open web contributors and advocates who all share the same mission: to move away from centralized systems and empower site owners and hosting providers with greater independence.

"FAIR is governed through open working groups and consensus-driven processes, ensuring that its development reflects the needs of the broader community. Whether you're a contributor, a host, or an end user, FAIR invites participation at every level, from writing code and documentation to community organization and governance. As a community-led project, we aim to build public digital infrastructure that's both resilient and fair. The FAIR Package Manager is a decentralized alternative to the central WordPress.org plugin and theme ecosystem, designed to return control to WordPress hosts and developers. It operates as a drop-in WordPress plugin, and seamlessly replaces existing centralized services with a federated, open-source infrastructure."

And then they finished with: "There are two core pillars to the FAIR system. First, API Replacement: It replaces communication with WordPress.org APIs, such as update checks and event feeds, using local or FAIR-governed alternatives. Some features, like browser version checks, are handled entirely within the plugin using embedded logic," and they said, "for example, browserslist. And then, second, Decentralized Package Management: FAIR introduces a new package distribution model for themes and plugins. It supports opt-in packages that use the FAIR protocol and enables hosts to configure their own mirrors for plugin/theme data using AspirePress or their own domains. While stable plugins currently use mirrors of WordPress.org, future versions will fully support FAIR-native packages." So anyway, this seems like a useful addition to the Internet's number one web authoring and delivery system.

Leo: Yeah. Kind of a rebuke to Matt Mullenweg.

Steve: Yeah. Yeah, especially when it was created by people who were pushed out, who were old WordPress hands.

Leo: Right.

Steve: So they said, okay, fine, we'll do our own.

Leo: However you feel about Matt, it does seem appropriate that WordPress should not be dependent entirely on WordPress.org for its libraries, I think.

Steve: Yeah. It's just too important.

Leo: Yeah.

Steve: It's gotten, I mean, it's too big a success, essentially.

Leo: Right.

Steve: Okay. I was reminded of my recent discovery and reporting of the privacy-preserving, I mean, explicitly and deliberately privacy preserving and unfiltered conversational AI, which we talked about a couple weeks ago, "Venice.ai," when I saw Ars Technica's headline "OpenAI slams court order to save all ChatGPT logs, including deleted chats," with the subhead "OpenAI defends privacy of hundreds of millions of ChatGPT users." Yikes. And when Ars says "all ChatGPT logs," they mean all of every user's ChatGPT logs, not just those of selected users. Not just users that some court order might say, you know, like under subpoena you must save. So this is everyone's ChatGPT interactions, period.

Leo: Even if you explicitly say "Delete this interaction."

Steve: Yes. Yes.

Leo: Which is the big problem here.

Steve: They are not legally able, they are not currently legally allowed to actually delete people's chats. So it seems clearly better for ChatGPT to never have any logs to save in the first place, which is one of the features of that "Venice.ai" service. To understand what's going on here, I think the details are worth sharing. So here's what Ars reported.

They said: "OpenAI is now fighting a court order to preserve all ChatGPT user logs, including deleted chats and sensitive chats logged through its AI business offering, after news organizations suing over copyright claims accused the AI company of destroying evidence. OpenAI explained in a court filing demanding oral arguments in a bid to block the controversial order: 'Before OpenAI had an opportunity to respond to those unfounded accusations, the court ordered OpenAI to preserve and segregate all output log data that would otherwise be deleted on a going forward basis until further order of the Court, in essence the output log data that OpenAI has been destroying.

"In the filing, OpenAI alleged that the court rushed the order based only on a hunch raised by The New York Times and other news plaintiffs. And now, without 'any just cause,' OpenAI argued, the order 'continues to prevent OpenAI from respecting its users' privacy decisions.' That risk extended to users of ChatGPT Free, Plus, and Pro, as well as users of OpenAI's application programming interface, OpenAI said. The court order came after news organizations expressed concern that people using ChatGPT to skirt paywalls might be more likely to delete all [their] searches to cover their tracks. What? Okay. Even that seems kind of farfetched to me. Do people even know that what they're getting from ChatGPT skirted a paywall? OpenAI said that.

"Evidence to support that claim, news plaintiffs argued, was missing from the record because, so far, OpenAI had only shared samples of chat logs that users had agreed that the company could retain." Okay. They're being responsible; right? Respecting their users' privacy concerns. "Sharing the news plaintiffs' concerns, the judge, Ona Wang, ultimately agreed that OpenAI likely would never stop deleting that alleged evidence absent a court order, granting news plaintiffs' request to force the preservation of all chats.

"OpenAI argued that the May 13 order was premature and should be vacated until, at a minimum, news organizations can establish a substantial need for OpenAI to preserve all chat logs. They warned that the privacy of hundreds of millions of ChatGPT users globally is at risk every day that the 'sweeping, unprecedented' order continues to be enforced.

"OpenAI argued: 'As a result, OpenAI is forced to jettison its commitment to allow users to control when and how their ChatGPT conversation data is used, and whether it is retained.' Meanwhile, there's no evidence beyond speculation yet supporting claims that OpenAI had intentionally deleted data, OpenAI alleged. And supposedly there is not a single piece of evidence supporting claims that copyright-infringing ChatGPT users are more likely to delete their chats." And to me that seems reasonable. "OpenAI argued: 'OpenAI did not "destroy" any data, and certainly did not delete any data in response to litigation events. The Order appears to have incorrectly assumed the contrary.'

"At a conference in January, Wang [the judge] raised a hypothetical in line with her thinking on the subsequent order. She asked OpenAI's legal team to consider a ChatGPT user who found some way to get around the pay wall and was getting The New York Times content somehow as the output. If the user then hears about this case and says, oh, whoa, you know, I'm going to ask them to delete all of my searches and not retain any of my searches going forward, the judge asked, wouldn't that be 'directly the problem' that the order would address?

"OpenAI does not plan to give up this fight, alleging that news plaintiffs have fallen silent on claims of intentional evidence destruction, and the order should be deemed unlawful. For OpenAI, risks of breaching its own privacy agreements could not only damage relationships with users but could also risk putting the company in breach of contracts and global privacy regulations. Further, the order imposes significant burdens on OpenAI, supposedly forcing the ChatGPT maker to dedicate months of engineering hours at substantial costs to comply, OpenAI claimed.

"It follows then that OpenAI's potential for harm 'far outweighs News Plaintiffs' speculative need for such data,' OpenAI argued. 'While OpenAI appreciates the court's efforts to manage discovery in this complex set of cases, it has no choice but to protect the interests of its users by objecting to the Preservation Order and requesting its immediate vacatur,' OpenAI said.

"Millions of people use ChatGPT daily for a range of purposes, OpenAI noted, 'ranging from the mundane to profoundly personal.' People may choose to delete chat logs that contain their private thoughts, OpenAI said, as well as sensitive information, like financial data from balancing the house budget or intimate details from workshopping wedding vows. And for business users connecting to OpenAI's API, the stakes may be even higher, as their logs may contain their companies' most confidential data, including trade secrets and privileged business information. Given that array of highly confidential and personal use cases, OpenAI goes to great lengths to protect its users' data and privacy,' OpenAI argued. 'It does this partly by honoring its privacy policies and contractual commitments to users." And the article goes on, but everyone has the idea.

So anyway, it's a mess. The bottom line is that, for the time being, and since this began, no one's ChatGPT logs have actually been deleted. Since May 13th. They've been forced by court order to retain everyone's everything. And I don't mean to make more of this than it is. I'm not suggesting that we should be terrified. I have no doubt that ChatGPT will treat them, these logs, with as much respect as possible. But "deleted," you know, needs to be put in air quotes. It doesn't actually mean now that it's truly gone.

So for what it's worth, if you are someone who cares about maintaining as much absolute privacy as possible, you'll want to look at something such as Venice.ai, whose entire architecture is designed in TNO-mode so that they never have any logs to either keep or delete. I should mention, though, that I have compared, after I talked about Venice.ai, I did some side-by-side comparison against OpenAI's o3 model, which blows Venice AI away.

Leo: And o3 blows pretty much everyone away. It's pretty amazing.

Steve: It's astonishing, yeah. So it's not like they are at parity. But unfortunately, ChatGPT being the big guy in town has become a target of the advertisers, I mean of the content producers, and they're saying, hey, you know, our content's being slurped up, and users are getting it for free by asking ChatGPT what happened today.

Leo: What model - oh, it's all using open source models, Venice is, like Llama and...

Steve: Yeah. And actually it's distributed open source, and they're not using the ChatGPT API. They're using...

Leo: Right, right, they can't, obviously.

Steve: Right, exactly.

Leo: Yeah.

Steve: Yeah, because they are completely uncensored.

Leo: Actually, somebody can, which is Apple. Apple claims that they don't send any information to ChatGPT when you use it on an iPhone. So presumably you could use ChatGPT, maybe not its strongest models, but you could use it on an Apple device.

Steve: So, what? Because, I mean, especially what we heard at WWDC yesterday, they're all, like, they're engaging ChatGPT all over the place.

Leo: Yeah. But it doesn't - it'll send the prompts, it has to. But it won't send any personal information. So they've made a deal, obviously, of some sort with ChatGPT, with OpenAI, to do that, yeah.

Steve: You mean it won't identify who you are.

Leo: Right.

Steve: So it's anonymizing your...

Leo: Exactly. [Crosstalk] it can't.

Steve: Right.

Leo: But so if you send it your tax returns, you're out of luck. But if you send it just a simple prompt, it doesn't know who it is.

Steve: Got it. Okay. Erlang. I don't know anybody who uses Erlang. But when you get a CVSS of 10.0...

Leo: Ooh, that's not good.

Steve: You know, the four people who do use it really need to pay attention.

Leo: It's actually widely used in - because it was written by Ericsson for mobile phones. So there are a lot of embedded and interesting uses of Erlang. CVSS 10 is a big deal.

Steve: Yeah, and it's on an SSH server. So it's an authentication bypass. It got a 10.0. That's the official CVSS. The description says: "Erlang/OTP is a set of libraries for the Erlang programming language. Prior to versions" - now, there are three version threads, 27.3.3, 26.2.5.11, and 25.3.2.20. Those versions are safe. Prior to those: "...an SSH server may allow an attacker" - and we know that when they say "may," that means we gave it a 10.0. Read between the lines. Not subject "may." Probably actually should say an SSH server already did allow an attacker. The attacker already has what they want - "to perform unauthenticated remote code execution (RCE).

"By exploiting a flaw in SSH protocol message handling, a malicious actor could [and we know they mean 'did'] gain unauthorized access to affected systems and execute arbitrary commands without valid credentials. A temporary workaround involves" pulling the plug. No. Involves "disabling the SSH server or to prevent access via firewall rules." Meaning don't let anybody use your SSH server.

Anyway, you know, even though no one talks about using Erlang, as I wrote in the show notes, apparently it's out there. And Leo, you've confirmed that.

Leo: Yeah.

Steve: Ericsson mobile phones? Does Ericsson still make mobile phones?

Leo: No, but they made Erlang, so there you go.

Steve: Oh, they made Erlang. Okay.

Leo: You know, OTP implies it's the one-time password. Oh, no, that's actually the name of Erlang is Erlang OTP. Okay.

Steve: Ah.

Leo: So it's not a library, it's Erlang. Okay. Wow.

Steve: Anyway, 10.0, kiddies. So unplug it if you've got it.

Leo: Holy cow.

Steve: Yikes, yes. Can Russia "intercept" Telegram messages? There's a report that appears to allege that Russia now has some means for intercepting Telegram messages. My most pressing question is whether this applies to two-party one-to-one messages. Here's what the reporting says: "Russian human rights NGO, known as First Department, warned on Friday" - just this past Friday - "that Russia's Federal Security Service" - the infamous FSB - "has learned to intercept messages sent by Russians to bots or feedback accounts associated with certain Ukrainian Telegram channels, potentially exposing anyone communicating with such outlets to treason charges.

"Russia's principal domestic intelligence agency (FSB) has gained access to correspondence made with Ukrainian Telegram channels including Crimean Wind and Vision Vishnun, according to First Department, which said that the FSB's hacking of Ukrainian Telegram channels had come about during a 2022 investigation into the Ukrainian intelligence agencies 'gathering information that threatens the security of the Russian Federation' via messengers and social networks including Telegram.

"The case is being handled by the FSB's investigative department, though no suspects or defendants have been named in the case, according to First Department. When the FSB identifies individual Russian citizens who have communicated with or transmitted funds to certain Ukrainian Telegram channels, it contacts the FSB office in their region, which then typically opens a criminal case for treason against the implicated person.

"First Department said: 'We know that by the time the defendants in cases of "state treason" are detained, the FSB is already in possession of their correspondence. And the fact that neither defendants nor a lawyer are named in the main case allows the FSB to hide how exactly it goes about gaining access to that correspondence.' First Department stressed that their findings highlighted the various security risks inherent in using Telegram for confidential communication, especially in cases where the contents of such private messages could result in criminal charges.

"Dmitry Zair-Bek, the head of First Division, said that materials from Telegram have already been used as evidence in 'a significant number of cases,' adding that 'in most cases, they have been accessed due to compromised devices. However, there are also cases in which no credible technical explanations consistent with known access methods can be identified.'" So this guy does sound like he knows what he's talking about. He said: "This could indicate either the use of undisclosed cyberespionage tools or Telegram's cooperation with the Russian authorities, obvious signs of which we see in a number of other areas."

So, you know, we've been watching Pavel Durov's previously adamant stance soften somewhat over time, particularly after he was arrested and convicted in France last summer. Has he allowed Telegram to be compromised? You know, it's certainly not a messaging system that can be trusted. And remember that an audit of its homegrown crypto technology did raise additional concerns several months ago. So it's not what I would recommend anybody use.

Leo: On we go with the show.

Steve: Okay. So I had to double-check the date on this news when I read that Spanish ISPs had accidentally blocked Google domains while attempting to crack down on illegal soccer live streams. The double-check was required, of course, because this is not the first time this has happened, nor the first time we've noted what a lame and harebrained approach it is to force specific ISPs to locally filter large chunks of the Internet for only their own subscribers. Right? I mean, everybody else can see what they want. Maybe someday we'll learn, but I'm not holding my breath.

I did note that Reddit has sued Anthropic for scraping and using Reddit comments to train its Claude AI chatbot. And I guess this is just going to be a thing, Leo, for a while. You know, we just talked about OpenAI in trouble with The New York Times and other plaintiffs. And now Anthropic, you know, Reddit's upset. And we know there are sites that specifically say, oh, no, don't worry, AI is not allowed in. So I would just say obey those robots.txt files, folks, and, you know, behave yourselves.

A recent analysis of Twitter's new encrypted XChat messaging appears to leave as much to be desired as you might imagine. The researcher who looked into it wrote: "When Twitter launched encrypted DMs a couple of years ago," he wrote, "it was the worst kind of end-to-end encrypted - technically end-to-end encrypted, but in a way that made it relatively easy for Twitter to inject new encryption keys and get everyone's messages anyway. It was also lacking a whole bunch of features such as sending pictures, so the entire thing was largely a waste of time," he wrote. "But a couple of days ago, Elon announced the arrival of XChat, a new encrypted message platform built on Rust" - it actually isn't, it's written in C - "with Bitcoin-style encryption, whole new architecture." And then the guy says: "Maybe they got it right this time."

And then a little bit later he says: "The TL;DR is: No. Use Signal."

Leo: Yeah.

Steve: He said: "Twitter can probably obtain your private keys and admit that they can man-in-the-middle you and have full access to your metadata."

So anyway, the analysis goes deeper. And to me it looked kind of interesting. It might make for some additional attention and a deeper dive for the podcast, so I may return to that next week. We'll see. In the meantime, I would follow this investigator's recommendation and not assume that what Elon has brought us in this new XChat is actually secure because they apparently were in a hurry, didn't actually write it in Rust, and, you know.

Leo: That's hysterical that he would even claim that.

Steve: I know. Because I guess, whoo, Rust.

Leo: Makes it better.

Steve: Must be good.

Leo: Rust makes it better. And what does it even mean to say "Bitcoin-style encrypted"?

Steve: I don't know.

Leo: Bitcoin's not encrypted, by the way.

Steve: Exactly. It's a public ledger that everyone can look at.

Leo: So I guess what they're admitting is, oh, yeah, there's no encryption.

Steve: But I think you just, like, throw in some more buzzwords.

Leo: Yeah. Maybe the message is all the DMs are put on the blockchain for everyone.

Steve: You would think he would have been in Dogecoin, but I guess not.

Leo: Oh, geez Louise.

Steve: Yeah. Meanwhile, Thundermail, the worst named service ever - please - will have email servers located in the European Union for increased privacy. Yeah. Okay. Fine. Whatever. But could you please change the damn name?

Leo: How about Lightning Mail? Do you like that better?

Steve: That's better than Thundermail.

Leo: It is.

Steve: I mean, Thundermail just sounds so bad. I don't know what it is. In other happy news...

Leo: It's from Thunderbird. That's why; right?

Steve: I mean, I get it. Yes. And on Thunderbird that seems fine. I don't know why. You can't change the "bird" to "mail" and have it still be good.

Leo: There's something about a message and thunder that just don't go together somehow, yeah.

Steve: I don't know. In other happy news, the GAO, the U.S. Government Accountability Office, has a report out which incidentally noted in passing that the Login.gov site service has no policy to verify that its backups are working. So a cyberattack, a mistake, or any other IT issue could completely crash the U.S. government's entire login and identity system for, I don't know, days, weeks, or even months until it's restored.

Leo: This is how I get into my Social Security account.

Steve: Yeah, well, you'd better login and hope you stay logged in because apparently it could go away.

Leo: Unbelievable. Yeah.

Steve: And lord knows, I mean...

Leo: Oh, also Global Entry. My Global Entry account's there. My IRS account. Actually, they use ID.me. That really makes me nervous. They use a third-party system.

Steve: Yeah. Maybe it's better to send it somewhere else. I would imagine ID.me probably actually has backups. Okay. So let's talk about The Illusion of Thinking and Apple's work on this. We have one more break, but we'll get to that halfway through this.

Leo: Yeah. It's a quick break, so...

Steve: Yeah, okay. A couple of days ago I added an "AI" group to GRC's long-running text-only NNTP newsgroups. In my inaugural post to that group, I wrote: "Everyone, I've learned not to haphazardly create groups that do not have enduring value, since it's more difficult to remove groups than to create them, and endless group proliferation is not ideal. But I think it's WAY beyond clear that Artificial Intelligence is in the process of rapidly changing the world, and I cannot imagine any more important and worthwhile new group to create."

Then, just this past Sunday, upon discovering this just-released research from Apple, thanks to feedback from one of our listeners, Urs Rau, I posted the following into this new, our brand new AI newsgroup there. I said: "'The Illusion of Thinking' is how the title of their well-assembled paper begins. The entire title is 'The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.'"

Leo: Woohoo.

Steve: And so I wrote in this posting to GRC's newsgroup: "Is this just sour grapes, engendered by Apple finding themselves behind the rest of the industry in AI deployment? I don't think so. This looks like an exploration that adds to our understanding of what we have today. And it's not suggesting that what we have today is not useful, nor that Apple might not wish they had some of their own. What it's doing is exploring the LIMITS of what we are now calling 'Artificial Intelligence' and suggesting what many of us have intuited, which is that, while a massive problem space can be solved with powerful pattern matching, when there are not patterns to be matched, today's systems are revealed to not be exhibiting anything like true problem understanding."

In other words, Leo, your earliest take on this, which was that AI was little more than fancy spell correction, carried an essential kernel of truth onto which Apple has just placed a very fine point. I think everyone should listen carefully to what Apple's research paper abstract explains. They wrote: "Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers." And Leo, you and I were just talking about o3. And, yes, it is astonishing.

They said: "While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood." And as I said a week or two ago, researchers are going to be studying what we have. And it's not something that happens overnight, but we're going to begin to get answers that tell us more about what it is we have. This is one such set of answers.

They wrote: "Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces' structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers, but also the internal reasoning traces, offering insights into how LRMs 'think.'" And they have that in air quotes.

"Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities." There's a cliff. "Moreover, they exhibit a counterintuitive scaling limit. Their reasoning effort increases with problem complexity up to a point, then declines, despite having an adequate token budget." Meaning we're letting you have, we're letting you think about this as much as you want. Keep going. But they don't.

They wrote: "By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: First, low-complexity tasks where standard models surprisingly outperform LRMs; second, medium-complexity tasks where additional thinking in LRMs demonstrates advantage; and then, three, high-complexity tasks where both models experience complete collapse.

"We found that LRMs have limitations in exact computation. They fail to use explicit algorithms, and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities."

Okay, now, as I've cautioned before, anything and everything that's believed to be known about AI definitely needs to carry a date stamp and also probably a "best used by" expiration date. What this means for us here is that Apple is showing us some interesting and probably previously under-appreciated features of today's LRMs (Large Reasoning Models). It's worth reminding ourselves that, if Apple had written this same paper a year ago, before the appearance of LRMs, and only challenging LLMs, the results would have been similar, though significantly less impressive for the AI side.

The question, then, is whether, and if so to what degree, even LARGER Reasoning Models in the future will be able to eclipse the performance of today's Large Reasoning Models? In other words, since what we all want to know today is what's going to happen with AI in the future, to what degree is Apple's research able to speak to any fundamental underlying limitations that might limit any future AI? That is, will this current language linguistic neural network-based approach hit a wall?

To answer that question, we need to see what Apple's research discovered. Here's how Apple's researchers set up the question. They wrote: "Large Language Models (LLMs) have recently evolved to include specialized variants explicitly designed for reasoning tasks: Large Reasoning Models such as OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking. These models are new artifacts, characterized by their 'thinking' mechanisms such as long Chain-of-Thought (CoT) with self-reflection, and have demonstrated promising results across various reasoning benchmarks. Their emergence suggests a potential paradigm shift in how LLM systems approach complex reasoning and problem-solving tasks, with some researchers proposing them as significant steps toward more general artificial intelligence capabilities.

"Despite these claims and performance advancements, the fundamental benefits and limitations of LRMs remain insufficiently understood." And, you know, also they're very new; right? So, okay. "Critical questions still persist: Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching? How does their performance scale with increasing problem complexity? How do they compare to their non-thinking standard LLM counterparts when provided with the same inference token compute? Most importantly, what are the inherent limitations of current reasoning approaches, and what improvements might be necessary to advance toward more robust reasoning capabilities?

"We believe," they wrote, "the lack of systematic analysis investigating these questions is due to limitations in current evaluation paradigms. Existing evaluations predominantly focus on established mathematical and coding benchmarks which, while valuable, often suffer from data contamination issues and do not allow for controlled experimental conditions across different settings and complexities. Moreover, these evaluations do not provide insights into the structure and quality of reasoning traces. To understand the reasoning behavior of these models more rigorously, we need environments that enable controlled experimentation.

"In this study, we probe the reasoning mechanisms of frontier LRMs through the lens of problem complexity. Rather than standard benchmarks, meaning math problems, we adopt controllable puzzle environments that let us vary complexity systematically by adjusting puzzle elements while preserving the core logic and inspect both solutions and internal reasoning."

Then we see, to my delight, the paper's diagram of one of the puzzle tests Apple's researchers chose, which is the famous Towers of Hanoi. This is a classic puzzle with very simple rules, which is what makes it such a great puzzle. I received a beautiful wooden version one Christmas when, as a child, my annoying aunt, who was always trying to stump me, thought, okay. Now, for those who are not familiar...

Leo: I love it. I had that one when I was a kid, too, and that's how I learned recursion. I think it's why I was able to grok recursion right away.

Steve: Yup.

Leo: Isn't that fascinating, yeah.

Steve: So for those who are not familiar, the puzzle consists of three pegs in a line, with one of the pegs having a stack of discs of decreasing diameter, with the largest disc on the bottom, and going to the smallest disc on top. The challenge is to move all of the discs from the starting peg to the peg at the other end of the three by moving only one disc at a time from any peg to any other peg, while never placing a larger disc over a smaller disc. It's a truly lovely puzzle because that's the rules. The rules are simple. But the solution requires patience, repetition, and grasping a deeper solution concept. That's what makes this such a perfect puzzle to test reasoning.

Okay, now, I should note that the puzzle is also a joy to solve by computer using traditional coding methods, and that the most elegant coding solution employs recursion, since this puzzle itself is deeply recursive. For anyone who has an age-appropriate child, or a nephew, Amazon has a large selection, like pages, of beautifully rendered...

Leo: I bet they do.

Steve: ...wooden and colorful versions of this famous puzzle. Now, what's so clever about Apple's choice of this puzzle is that its complexity can be uniformly scaled simply by changing the number of discs. So, first, imagine that we just have one disc. We can simply move it to its destination peg. If we have two discs, the smaller disc must first be placed on the middle peg, so that the bottom larger disc can be placed on its destination peg at the other end of the puzzle. Then the smaller disc can join the larger disc on the end peg, and the two-disc puzzle is solved.

Switching to three discs requires a bit more work. So visualize three pegs and three discs. The smallest disc temporarily goes onto the third destination peg. The middle disc goes to the middle peg. Now the smallest disc can go on top of the middle disc on the middle peg. This frees up the third peg to receive the largest bottom disc, which is now all alone on the original peg. So you move that over to the third peg. The smallest size disc is then moved to the first peg, which uncovers the middle size disc, which is on the middle peg, which can now be placed onto the third destination peg, and the smallest disc can then join the others to complete the stack and solve the three-disc puzzle.

It is quite satisfying to do this. And note that the two versus three disc puzzle may hopefully teach the astute puzzler which peg should first receive the smallest disc, based upon whether the disc count is even or odd. And that would be confirmed by solving the four-disc puzzle.

Leo: Aha.

Steve: Now, I should mention that if anyone who's listening is planning to make a gift of one of these, please encourage its recipient to start out this way, rather than just jumping into a very frustrating deep end using all of the eight or 10 discs that these puzzles provide. Solving the puzzle with very few discs will provide the encouragement and stamina that will eventually be needed...

Leo: Stamina.

Steve: ...to tackle and solve this very gratifying full puzzle.

Leo: Then make them write it in Python, and now you've got something.

Steve: And again, that little trick about noticing which peg to start out with will definitely save the day.

Leo: And you need it. It's recursive, so you need it each time you start, yeah, the next thing, yeah.

Steve: Yup. So I think that Apple's choice of the Towers of Hanoi is brilliant by reason of the puzzle's lovely scalability of difficulty. In all, they used four different, somewhat similar, sequential combinatorial puzzles: Towers of Hanoi, Checker Jumping on a linear strip of squares, something that they call Block World, and also River Crossing.

So here's what Apple explained. They said: "These puzzles, first, offer fine-grained control over complexity; second, avoid contamination common in established benchmarks; third, require only the explicitly provided rules, emphasizing algorithmic reasoning; and, fourth, support rigorous, simulator-based evaluation, enabling precise solution checks and detailed failure analysis." Just very clever that they did this.

They said: "Our empirical investigation reveals several key findings about current Large Reasoning Models (LRMs): First, despite their sophisticated self-reflection mechanisms learned through reinforcement learning, these models fail to develop generalizable problem-solving capabilities for planning tasks..."

Leo: Huh.

Steve: Yeah, and look at these charts here in the middle of page 19, Leo - "...with performance collapsing to zero beyond a certain complexity threshold. Second, our comparison between LRMs and standard LLMs under equivalent inference compute reveals three distinct reasoning regimes." And that's what I talked about before.

They said: "For simpler, low-compositional problems, standard LLMs demonstrate greater efficiency and accuracy." Like there's this overthink problem. "As problem complexity moderately increases, thinking models gain an advantage." So that's what we're now seeing, right, in what o3 is doing. We're seeing this improved advantage. "However, when problems reach high complexity with longer compositional depth, both model types experience complete performance collapse." And we see that in the chart that I've got on page 19 on the left.

They said: "Notably, near this collapse point, LRMs begin reducing their reasoning effort (measured by inference-time tokens) as problem complexity increases, despite operating well below generation limits." That's shown in the middle diagram. They said: "This suggests a fundamental inference time scaling limitation in LRMs' reasoning capabilities relative to problem complexity."

And they said: "Finally, our analysis of intermediate reasoning traces or thoughts reveals complexity-dependent patterns. In simpler problems, reasoning models often identify correct solutions early, but inefficiently continue exploring incorrect alternatives, an 'overthinking' phenomenon. At moderate complexity, correct solutions emerge only after extensive exploration of incorrect paths." And that's fair. "And beyond a certain complexity threshold, models completely fail to find correct solutions." In other words, they're not really reasoning.

"This indicates LRMs possess limited self-correction capabilities that, while valuable, reveal fundamental inefficiencies and clear scaling limitations. These findings highlight both the strengths and limitations of existing LRMs, raising questions about the nature of reasoning in these systems with important implications for their design and deployment." They then list their key contributions from this research, which we're going to go into after our final break.

Leo: All right. You got me thinking. And I just ordered a Towers of Hanoi because I remember this with such fondness from my childhood.

Steve: It's just pleasant and gratifying, yeah.

Leo: And once you understand it, it's pretty straightforward. But it's fun, yeah.

Steve: But for a five year old or an 8 year old?

Leo: You know, I hadn't really thought about this, but I think the fact that that was on our coffee table when I was a little kid, and I did figure out how to solve it, probably prepared me well for understanding recursion because you repeat the same algorithm over and over.

Steve: Yup. Exactly.

Leo: And planning because you have to start on the right peg to make it most efficient. There's a few things that...

Steve: And you're able to give yourself simpler versions of it in order to kind of get the hang of it.

Leo: Right, right, because you're just repeating it, yeah.

Steve: Okay. So they say their key contributions are, from the research...

Leo: Yes.

Steve: "We question the current evaluation paradigm of LRMs on established math benchmarks and design a controlled experimental testbed by leveraging algorithmic puzzle environments that enable controllable experimentation with respect to problem complexity. We show that state-of-the-art LRMs (o3-mini, DeepSeek-R1, Claude 3.7 Sonnet Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments. We find that there exists a scaling limit in the LRMs' reasoning effort with respect to problem complexity, evidenced by the counterintuitive decreasing trend in the thinking tokens after a complexity point.

"We question the current evaluation paradigm based on final accuracy and extend our evaluation to intermediate solutions of thinking traces with the help of deterministic puzzle simulators. Our analysis reveals that, as problem complexity increases, correct solutions systematically emerge at later positions in thinking, compared to incorrect ones, providing quantitative insights into the self-correction mechanisms within LRMs. And finally, we uncover surprising limitations in LRMs' ability to perform exact computation, including their failure to benefit from explicit algorithms" - we'll get to this. But at one point they told it how to do the Towers, and it still couldn't. Like they gave it instruction, here's how you solve this. Anyway, "and their inconsistent reasoning across puzzle types."

Okay. So for those listening to this without the advantage of the performance charts in the show notes, the Claude 3.7 thinking versus non-thinking model performance on the Towers of Hanoi puzzle was interesting. We talked about - everyone understands the Tower of Hanoi now. Both the earlier Large Language Model and the later Large Reasoning Models performed perfectly, returning success 100% of the time when only one or two discs were used. And we saw how simple those were. Both models still did very well after a third disc was added; but interestingly, the fancier thinking model underperformed the simpler LLM by about 4%.

Leo: That's wild.

Steve: Yeah. But when that first peg was stacked with four discs, the deeper thinking model's performance was restored, whereas the simpler Claude 3.7 LLM collapsed to only finding the solution 35% of the time, whereas the thinking model held at 100. As the disc count then increases above four, both models' performance continues to drop, but the LRM holds a huge lead over the LLM until they get to eight discs. The LLM is never able to solve that one, whereas the thinking model finds the eight-disc solution about one out of every 10 tries, about 10%, but 10 discs is beyond the reach of either.

The full research paper has lots of interesting detail about the various models' performance on the four puzzle types. I noted, however, that the nature of the other three puzzles seemed to be pretty much beyond the grasp of any of this so-called "AI." One of their more interesting findings was the appearance of what they term the three "complexity regimes."

Paraphrasing from the paper, they wrote, under "How Does Complexity Affect Reasoning?" they said: "Motivated by the observations to systematically investigate the impact of problem complexity on reasoning behavior, we conducted experiments comparing thinking and non-thinking model pairs across our controlled puzzle environments. Our analysis focused on matching pairs of LLMs with identical model backbones, specifically Claude 3.7 Sonnet, with and without thinking, and DeepSeek R1 versus V3. For each puzzle, we vary the complexity by manipulating problem size N, where N represents the disc count, the checker count, the block count, or the crossing elements.

"Results from these experiments demonstrate that, unlike observations from math" - and that's probably one of the most significant things here is that we keep seeing, oh, this thing, these do better than a math PhD. It's like, okay. How about frogs jumping over each other? Oh, well, no, it can't do frogs, no. So they said: "There exist three regimes in the behavior of these models with respect to complexity. In the first regime, where problem complexity is low, we observe that non-thinking models are capable of obtaining performance comparable to, or even better than, thinking models with more token-efficient inference." Meaning it's cheaper to do them. "In the second regime, with medium complexity, the advantage of reasoning models capable of generating long chain of thought begin to manifest, and the performance gap between the model pairs increases.

"The most interesting regime is the third regime, where problem complexity is higher, and the performance of both models have collapsed to zero. Results show that, while thinking models delay this collapse, they ultimately encounter the same fundamental limitations as their non-thinking counterparts."

I think it's important to address their decision to use puzzles as an evaluation mechanism versus math problems. They gave this a lot of thought, and they wrote on the "Math and Puzzle Environments" question, they wrote the following. They said: "Currently, it is not clear whether the performance enhancements observed in recent reinforcement learning (RL)-based thinking models" - all of the LRMs we've been talking about - "are attributable to increased exposure to established mathematical benchmark data, to the significantly greater inference compute allocated to thinking tokens, or to reasoning capabilities developed by RL-based training." That is, the reinforcement learning training.

"Recent studies have explored this question with established math benchmarks by comparing the upper-bound capabilities of reinforcement learning-based thinking models with their non-thinking standard LLM counterparts. They've shown that under equivalent inference token budgets, non-thinking LLMs can eventually reach performance comparable to thinking models on benchmarks like MATH500 and AIME24. We also conducted our comparative analysis of frontier LRMs like Claude 3.7 Sonnet, with and without thinking, and DeepSeek R1 versus V3. Our results confirm that, on the MATH500 dataset, the performance of thinking models is comparable to their non-thinking counterparts when provided with the same inference token budget. However, we observed that this performance gap widens on the AIME24 benchmark and widens further on AIME25.

"This widening gap presents an interpretive challenge. It could be attributed to either increasing complexity requiring more sophisticated reasoning processes, thus revealing genuine advantages of the thinking models for more complex problems; or reduced data contamination in newer benchmarks, particularly AIME25. Interestingly, human performance on AIME25 was actually higher than on AIME24, suggesting that AIME25 might be less complex. Yet models perform worse on AIME25 than AIME24, potentially suggesting that data contamination during the training of frontier LRMs is occurring." That is, there's more contamination in the older models because there's been more time for the contamination to happen, as compared to the newer training benchmarks, or testing benchmarks.

"Given these non-justified observations, and the fact that mathematical benchmarks do not allow for controlled manipulation of problem complexity, we turned to puzzle environments that enable more precise and systematic experimentation."

Okay. So we have the very real problem of data contamination that makes judging what these AI models are actually doing, meaning that the models, you know, may have previously encountered the problems during their training and simply memorized the answer. So they're not actually reasoning. They're not thinking or solving new problems, they're pattern-matching at a very high level and just regurgitating. But even puzzles like the Towers of Hanoi and River Crossing exist on the Internet and are also presumably in the training data. The researchers talk about this under the heading "Open Questions: Puzzling Behavior of Reasoning Models." They write: "We present surprising results concerning the limitations of reasoning models in executing exact problem-solving steps, as well as demonstrating different behaviors of the models based on the number of moves.

"In the Tower of Hanoi environment, even when we provide the algorithm in the prompt" - here again, this is what I was talking about. "In the Tower of Hanoi environment, even when we provide the algorithm to be used in the prompt, so that the model only needs to execute the prescribed steps, performance does not improve, and the observed collapse still occurs at roughly the same point. This is noteworthy because finding and devising a solution should require substantially more computation for search and verification than merely executing a given algorithm. This further highlights the limitations of reasoning models in verification and in following logical steps to solve a problem, suggesting that further research is needed to understand the symbolic manipulation capabilities of such models.

"Moreover, we observe very different behavior from the Claude 3.7 Sonnet thinking model. In the Tower of Hanoi environment, the model's first error in the proposed solution often occurs much later, around move 100 for when you have 10 discs, compared to the River Crossing environment, where the model can only produce a valid solution until move four. Note that this model also achieves near-perfect accuracy when solving the Tower of Hanoi with five discs, which requires 31 moves, while it fails to solve the River Crossing puzzle with just N=3, which has a solution in only 11 moves. This likely suggests that examples of River Crossing with N>2 are scarce on the web, meaning LRMs may not have frequently encountered or memorized such instances during training."

In other words, it is very, very difficult to test these models where you need clean models that have not absorbed contaminating information that allows them to appear to be creating new thought, as opposed to just finding something from the past.

So this work by Apple's researchers is full of terrific insights that I want to commend to anyone who's interested in obtaining a more thorough understanding of where things probably stand at this point in time. I've got a link right under the title at the beginning of this in the show notes. So here's what the researchers conclude.

They said: "In this paper, we systematically examine frontier Large Reasoning Models through the lens of problem complexity using controllable puzzle environments. Our findings reveal fundamental limitations in current models. Despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds." So I'm going to repeat that, since I think that's the essence of this entire paper: "Our findings reveal that, despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds." So the models are doing much better at doing what their simpler LLM brethren have been doing, but the difference is fundamentally quantitative, not qualitative.

Apple continues: "We identified three distinct reasoning regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at moderate complexity, and both collapse at higher complexity. Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent compute scaling limit in LRMs. Our detailed analysis of reasoning traces further expose complexity-dependent reasoning patterns, from inefficient 'overthinking' on simpler problems to complete failure on complex ones. These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.

"Finally, we presented some surprising results on LRMs that lead to several open questions for future work. Most notably, we observed their limitations in performing exact computation. For example, when we provided the solution algorithm for the Tower of Hanoi to the models, their performance on this puzzle did not improve." They gave them the answer, and it didn't help. "Moreover, investigating the first failure move of the models revealed surprising behaviors. For instance, they could perform up to 100 correct moves in the Tower of Hanoi, but fail to provide more than five correct moves in the River Crossing puzzle. We believe our results can pave the way for further future investigations into the reasoning capabilities of these systems."

And then, finally, under "Limitations," they just said: "We acknowledge that our work has limitations. While our puzzle environments enable controlled experimentation with fine-grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems." You know, they're algorithmic, not knowledge based. "It is notable that most of our experiments rely on black-box API access to the closed frontier LRMs, limiting our ability to analyze internal states or architectural components. Furthermore, the use of deterministic puzzle simulators assumes that reasoning can be perfectly validated step by step. However, in less structured domains, such precise validation may not be feasible, limiting the transferability of this analysis to other more generalizable reasoning."

So in other words, the only thing this is, is what it is. It may or may not be more widely applicable, and it may not even have any meaning or utility beyond the scope of these problems. There's not a great deal of real-world need, you know, for stacking discs on poles, after all. But for what it's worth, it does track with the intuition many of us have about where the true capabilities of today's AI falls. You know, using terms like "comprehend" or "understand" or even "reason" really don't seem to apply. They're used by AI fanboys. Maybe they're just a lazy shorthand, but I don't feel that they're helpful. In fact, I think they're anti-helpful. So what I think we need is some new anti-anthropomorphic terminology to accompany this new technology.

There's zero question that scale-driven computation has changed the world forever. Everyone is asking ChatGPT and other consumer AI more and more questions every day, and that's only going to accelerate as the benefits of this become more widely known. AI does not need to become AGI or self-aware to be useful; and, frankly, I would strongly prefer that it did not. To that end, I doubt that we have anything to worry about anytime soon, and perhaps not even for the foreseeable future. Thus the title of today's podcast, "The Illusion of Thinking," because I believe that the fairest conclusion is that's all we have today. It's useful, but it's not thought.

Leo: Yeah. And I don't think it, you know, Anthony Nielsen's asking a legit question is if they - how much they coach the LRM. You know, you can say to it, for instance, "use code," and it might well have been able to do better had they said "use code." There's things you can say like "think harder" that actually make a difference. But it doesn't change your main point, which is, no, they're not thinking. Maybe they can do better. But even if they did better, it wouldn't necessarily mean they're thinking, by any means.

Steve: And I think in the same way that we under - we were initially astonished when these things started to, like, talk, and appeared to understand us.

Leo: Yeah.

Steve: It's like, holy tamoli.

Leo: It is astonishing, yeah.

Steve: And so I think now what we're underappreciating is the amount of knowledge that is captured by these, and that when we ask them to think more, think longer, think harder, more of that captured what appears to be understanding, but isn't actually, we get that out. You know, we squeeze the sponge harder, and we get more out of it.

Leo: Yeah. And that's of course that these companies are doing as fast as they can because everybody's competing to come out with the smartest solution. We should also note that this paper was written with the older models from Claude. They have 4.0 out now.

Steve: Right. And as I said also, this is all a moving target.

Leo: Yeah.

Steve: I mean, it's...

Leo: Absolutely.

Steve: And that's really the point, though, Leo. Does it matter which model, how far into the future this goes?

Leo: Probably not. Fundamentally they're not thinking.

Steve: Exactly. And I don't think they're going to. I think they're just going to be able to squeeze the sponge harder and get more of the juice out. But at some point, you know...

Leo: There's a limit.

Steve: They're not creating new juice.

Leo: Right. It's exciting times. We'll see. I don't know myself. It was a great paper. I'm glad you explained it. I appreciate it. As always, I look to you. Every week I say, oh, I can't wait till Tuesday, I wonder what Steve's going to say about this.

Steve: And again, if anyone has a youngster around, look at how gorgeous those puzzles are. Aren't they beautiful?

Leo: Oh, they're fantastic, yeah.

Steve: Yeah.


Copyright (c) 2014 by Steve Gibson and Leo Laporte. SOME RIGHTS RESERVED

This work is licensed for the good of the Internet Community under the
Creative Commons License v2.5. See the following Web page for details:
http://creativecommons.org/licenses/by-nc-sa/2.5/



Jump to top of page
Gibson Research Corporation is owned and operated by Steve Gibson.  The contents
of this page are Copyright (c) 2024 Gibson Research Corporation. SpinRite, ShieldsUP,
NanoProbe, and any other indicated trademarks are registered trademarks of Gibson
Research Corporation, Laguna Hills, CA, USA. GRC's web and customer privacy policy.
Jump to top of page

Last Edit: Jun 16, 2025 at 08:14 (25.90 days ago)Viewed 7 times per day