Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference

(gizmoweek.com)

90 points | by takumi123 7 hours ago

17 comments

temp7000 38 minutes ago
Is it me, or does the article sound like LLM output?
The pattern "It's not mere X — it's Y", occurs like 4 times in the text :v
[-]
- figmert 36 minutes ago
  > :v
  I guess I found the millennial. I haven't seen that in so long!
  [-]
  - Den_VR 19 minutes ago
    :<
- mtremsal 30 minutes ago
  An AI slop pattern so widespread it’s now referred to as “it’s not pee pee it’s poo poo”.
- kbouw 36 minutes ago
  You would be correct. Ran the article through GPTZero, 100% AI.
  [-]
  - subscribed 4 minutes ago
    These detectors are a scam falsely flagging non-native English speakers: https://plagiarismcheckerai.app/ai-detector-false-positives-...
    At this point relying on their judgement is beyond folly.
  - xd1936 26 minutes ago
    https://redd.it/13mft8s
  - 71bw 27 minutes ago
    Would not trust any of these tools in the slightest.
  - devmor 15 minutes ago
    AI detectors that use text as a basis are not real. It is fundamentally impossible for them to exist.
conception 21 minutes ago
I’m pretty excited about the edge gallery ios app with gemma 4 on it but it seems like they hobbled it, not giving access to intents and you have to write custom plugins for web search, etc. Does anyone have a favorite way to run these usefully? ChatMCP works pretty well but only supports models via api.
codybontecou 2 hours ago
Unfortunately Apple appears to be blocking the use of these llms within apps on their app store. I've been trying to ship an app that contains local llms and have hit a brick wall with issue 2.5.2
[-]
- CubsFan1060 1 hour ago
  Though of course Apple's rules aren't always consistent, I have 2 separate apps currently on my phone that can/are running this (Google's Edge Gallery and Locally AI)
  [-]
  - cyanydeez 1 hour ago
    Can't be just a SaaSpocolypse. LLMs with the right harness could obliterate much of the TODO+ apps with a general assistant.
    But it's more likely it's just walled garden + security theatre that'll keep them from allowing outside apps.
    [-]
    - varispeed 42 minutes ago
      Wouldn't trust AI to run TODO, especially weak models. They can hallucinate tasks, forget to remind etc.
      [-]
      - tapvt 14 minutes ago
        LLMs are stateless. But given an actual database of task-shaped items and some work, I could see the potential.
        With a canonical source of truth, and set input/output expectations, the potential blast radius is quite small.
- MillionOClock 30 minutes ago
  What is your app doing? Just LLM inference?
- Gareth321 1 hour ago
  I think Apple will become increasingly draconian about LLMs. Very soon people won't need to buy many of their apps. They can just make them. This threatens Apple's entire business model.
  [-]
  - mrkpdl 40 minutes ago
    But… why would I put the effort into getting an llm to make me an app when a there’s an existing app that I don’t have to maintain? I don’t want to have to make every app I use?
    [-]
    - orrito 28 minutes ago
      There's a huge difference between local apps that cost one time 3-10$ and apps that ask for a subscription between 5 to 20$ per month. the first category will remain and might become more popular as quality increases, the second category will be oblitereated as the value isn't there, even if all the buyers are rich. The second group takes up a much larger part of the pie than the first though, so apple's revenue will decrease.
  - StilesCrisis 55 minutes ago
    Apple's business model isn't really affected by 2% of its users choosing not to spend $100/yr on the App Store. That isn't even a blip on the radar.
    A kid playing Roblox can spend more than that in a good weekend.
  - borborigmus 27 minutes ago
    VibeOS. It’s just an LLM from which all other userspace is vibed.
  - Forgeties79 47 minutes ago
    I guess I am not seeing why would I want to abandon most (if any) of my simple, small, purpose-built apps that always do the exact thing I want for a private company’s ever-changing LLM that will approximate what I’m asking and approximate its response utilizing far more resources.
    I’m sure there are things on my phone it could replace (though I struggle to think of them) but there are plenty it can’t. My black magic camera app, web browsers, local send, libby/hoopla…
    I can’t really think of any apps I use every day - or every week - that an LLM would replace. I’m not coding on my smartphone and aside from that an LLM is basically a more complex, somewhat inconsistent search engine experience right now for most people. Siri didn’t replace any of my apps, for instance. Why would chatGPT?
    TL;DR: what apps would an LLM replace on my iPhone?
- saagarjha 1 hour ago
  Use of the LLMs to do what?
  [-]
  - throwaway613746 1 hour ago
    [dead]
karimf 2 hours ago
Related: Gemma 4 on iPhone (254 comments) - https://news.ycombinator.com/item?id=47652561
[-]
- redbell 1 hour ago
  Another related submission from 22 days ago : iPhone 17 Pro Demonstrated Running a 400B LLM (+700pts, +300cmts): https://news.ycombinator.com/item?id=47490070
  [-]
  - zozbot234 46 minutes ago
    That's very impressive but it's streaming in weights from flash storage. That's not really viable in a mobile context, it will use way too much power. Smaller models are way more applicable to typical use, perhaps with mid-sized models (like the Gemma4 26A4B model) using weights offload from SSD for rare uses involving slower "pro" inference.
Chrisszz 25 minutes ago
I just installed Google Ai Edge Gallery on my iPhone 16 pro, here are the results of the first benchmark with GPU, Prefill Tokens=256, Decode Tokens=256, Number of runs: 3. Prefill Speed=231t/s, Decode Speed=16t/s, Time to First Token=1.16s, First init time=20s
grimmai143 8 minutes ago
Do you know of a way of running these models on Android? Also, what does the thermal throttling look like?
the_inspector 18 minutes ago
You are referring to the edge models, right? E2B and E4B, not the bigger ones (26B, 31B)...
bearjaws 37 minutes ago
Would love to see a show down of performance on iPhone vs Googles Tensor G5, which in my experience the G5 is 2 full generations behind performance wise.
usmanshaikh06 1 hour ago
ESET is blocking this site saying:
Threat found This web page may contain dangerous content that can provide remote access to an infected device, leak sensitive data from the device or harm the targeted device. Threat: JS/Agent.RDW trojan
mistic92 2 hours ago
It runs on Android too, with AI Core or even with llama.cpp
pabs3 2 hours ago
> edge AI deployment
Isn't the "edge" meant to be computing near the user, but not on their devices?
[-]
- stingraycharles 1 hour ago
  No it does not. This is about as “edge” as AI gets.
  In a general sense, edge just means moving the computation to the user, rather than in a central cloud (although the two aren’t mutually exclusive, eg Cloudflare Workers)
- hhh 1 hour ago
  It depends, because edge is a meaningless term and people choose what they want for it. In 2022, we set up a call with a vendor for ‘edge’ AI. Their edge meant something like 5kW, and our edge was a single raspberry pi in the best case.
- pgt 1 hour ago
  Your device is the ultimate edge. The next frontier would be running models on your wetware.
  [-]
  - acters 1 hour ago
    Man can't wait for AI in my brain. And then intelligence will be pay to win.
  - elcritch 1 hour ago
    Not just running it on your wetware, but charging you for it.
    Can't wait until AI companies go from mimicking human thought to figuring how to licensing those thoughts. ;)
logicallee 1 hour ago
For those who would like an example of its output, I'm currently working through creating a small, free (cc0, public domain) encyclopedia (just a couple of thousand entries) of core concepts in Biology and Health Sciences, Physical Sciences, and Technology. Each entry is being entirely written by Gemma 4:e4b (the 10 GB model.) I believe that this may be slightly larger than the size of the model that runs locally on phones, so perhaps this model is slightly better, but the output is similar. Here is an example entry:
https://pastebin.com/ZfSKmfWp
Seems pretty good to me!
[-]
- everyday7732 1 hour ago
  What's your goal? Do you have a project you want the encyclopedia for?
ValleZ 1 hour ago
There are many apps to run local LLMs on both iOS & Android
bossyTeacher 2 hours ago
Is the output coherent though? I am yet to see a local model working on consumer grade hardware being actually useful.
[-]
- the_pwner224 59 minutes ago
  I have a 128 GB Strix Halo tablet (same as the other commenter here with the Framework Desktop). I'm using the larger Gemma 4 26B-A4B model (only 28 GB @ Q8) and it's been working great and runs very fast.
  It's a 100% replacement for free ChatGPT/Gemini.
  Compared to the paid pro/thinking models... Gemma does have reasoning, and I have used the reasoning mode for some tax & legal/accounting advice recently as well as other misc problems. It's worked well for that, but I haven't tried any real difficult tasks. From what I've heard re. agentic coding, the open weight models are ~18-24 months behind Anthropic & Google's SOTA.
  Qwen 3.5 122B-A10B should just fit into 128 GB with a Q4/5 and may be a bit smarter. There's apparently also a similar sized Gemma 4 model but they haven't released it yet, the 26B was the largest released.
  [-]
  - zozbot234 34 minutes ago
    There's a 31B dense model in the Gemma 4 series that's obviously going to be smarter (though a whole lot slower) than the MoE 26A4B.
    [-]
    - the_pwner224 31 minutes ago
      I tried it and it was unusably slow at ~5-6 TPS. 26A4B gets close to 40 TPS which is faster than you can read, and still pretty quick with reasoning enabled.
- jeroenhd 1 hour ago
  Google's models work quite well on my Android phone. I haven't found a use case beyond generating shitposts, but the model does its job pretty well. It's not exactly ChatGPT, but minor things like "alter the tone of this email to make it more professional" work like a charm.
  You need a relatively beefy phone to run this stuff on large amounts of text, though, and you can't have every app run it because your battery wouldn't last more than an hour.
  I think the real use case for apps is more like going to be something like tiny, purpose-trained models, like the 270M models Google wants people to train and use: https://developers.googleblog.com/on-device-function-calling... With these things, you can set up somewhat intelligent situational automation without having to work out logic trees and edge cases beforehand.
- lrvick 2 hours ago
  I run qwen3.5 122b on a Framework Desktop at 35/ts as a daily driver doing security and OS systems and software engineering.
  Never paid an LLM provider and I have no reason to ever start.
  [-]
  - mixermachine 2 hours ago
    What spec of Framework Desktop do you run this on?
    [-]
    - the_pwner224 1 hour ago
      If you're looking to buy new hardware, also consider the Asus Rog Flow Z13. It has the same chip as the Framework desktop and is ~20% cheaper ($2,700) for the 128 GB spec while coming in a tablet/laptop form factor. It's capped at a slightly lower power but Strix Halo scales down very well in TDP - I never even use the max power mode on my Z13 because you don't really get any extra perf.
      The only downside is that I suspect the Framework would be a decent bit quieter under load (not that this thing is abnormally loud). As well as you're limited to a single M.2 2230 internal SSD slot in this (I believe Micron recently launched a 4 TB model, but generally you'll max out at 2 TB without using an external enclosure).
      I don't have anything against the Framework, I'm sure it's a great machine, but the Z13 is an incredible portable all-in-one device that can handle everything from general PC use to gaming to tablet/entertainment to LLMs & high perf.
    - breisa 1 hour ago
      There is only one and for this model you need the one with 128GiB RAM.
- fsiefken 2 hours ago
  Qwen3.5-9b and Qwen3.5-27b are pretty coherent on my 24G android phone
  [-]
  - dpacmittal 1 hour ago
    Which android phone has 24G?
- jfoster 2 hours ago
  It can write (some) code that works. Just roughly guessing from my use, but I think of it as being a bit like ChatGPT circa-2024 in terms of capability & speed.
  Disappointing if you compare it to anything else from 2026, but fairly impressive for something that can run locally at an OK speed.
- a_paddy 2 hours ago
  I can try it for you
- logicallee 1 hour ago
  It's highly coherent (see my other comment for an example of its text output) and yes, it's useful. I am starting to use Gemma 4:e4b as my daily driver for simple commands it definitely knows, things that are too simple to use ChatGPT for. It is also able to code through moderately difficult coding tasks. If you want to see it in action, I posted a video about it here[1] (the 10 GB one is at the 2 minute mark and the 20 GB one says hello at 5 minutes 45 seconds into the video.) You can see its speed and output on simple consumer grade hardware, in this case a Mac Mini M4 with 24 GB of RAM.
  [1] https://youtube.com/live/G5OVcKO70ns
camillomiller 2 hours ago
Can we please ban content that is CLEARLY written by AI?
[-]
- stingraycharles 1 hour ago
  I find it fascinating that after all this time reporters still don’t even bother to proofread content for obvious AI tells. I guess nobody really cares anymore?
- dax_ 1 hour ago
  That bugged me too, so I started looking at other articles - they all look AI generated to me. Whole website should be banned.
andsoitis 6 hours ago
is there a comparison of it running on iPhone vs. Android phones?
[-]
- jeroenhd 55 minutes ago
  Running Gemma-4-E2B-it on an iPhone 15 (can't go higher than that due to RAM limitations) versus a Pixel 9 Pro, I don't really notice much of a difference between the two. The Pixel is a bit faster, but also a year more recent.
  The model itself works absolutely fine, though the iPhone thermal throttles at some point which really reduces the token generation speed. When I asked it to write me a business plan for a fish farm in the Nevada desert, it slowed down after a couple thousand tokens, whereas the Pixel seems to just keep going.
- lrvick 2 hours ago
  You can run Android on just about anything so it boils down to Linux GPU benchmarks.
  [-]
  - fsiefken 1 hour ago
    That doesn't answer the question, I'm curious too. I think there's a speed and battery advantage on the A19 Pro chip compared to the Snapdragon 8 Elite Gen 5 chip, but to know for sure one has to run the same model used in the most efficient way on both machines (flagships ios and android).
grimm7000 16 minutes ago
[dead]