Skip to main content

The Surprising Truth About What AI Crawlers Extract From Your Website

There is a lot of noise right now about AI optimization, GEO (generative engine optimization), and how to get your business cited by large language models like ChatGPT, Gemini, Perplexity, and Google AI Overviews. Most of what is being sold under the banner of “AI optimization” is either unproven, misleading, or outright wrong. So instead of speculating, Marketing 1on1 ran a controlled real-world test to find out exactly what AI crawlers actually read, extract, and cite from a live website. The findings cut through the marketing fluff in a way that most agency blogs will never tell you.

As someone who has spent years analyzing how search algorithms and now AI retrieval systems interact with web content, these results confirmed several things I had already suspected, while also producing a few findings that deserve serious attention from anyone managing a business website in the current landscape.

How the Test Was Structured

Marketing 1on1 set up a deliberate experiment across live websites, making specific, measurable changes to how information was presented. They then repeatedly queried multiple major LLM platforms over more than two months to see whether those models could correctly surface the information. The test focused on three specific scenarios, each designed to isolate a different variable in how AI crawlers interpret and index website content.

This was not a theoretical exercise. Real queries were submitted to real AI systems. The results were documented. And what came back was both clarifying and, for the AI optimization industry, damaging.

Schema Markup Alone Is Not Enough, AI Models Ignored It Completely

In Marketing 1on1’s test, business operating hours were removed from the visible website content and placed exclusively inside LocalBusiness schema markup. After more than two months, every LLM queried, including ChatGPT, Gemini, and Perplexity, claimed the website did not publish operating hours. Not one model successfully extracted the schema-only data. This strongly suggests AI crawlers do not meaningfully process or rely on structured data markup when generating .

This is the part that should give every business owner a pause. Schema markup has long been positioned as a best practice for both traditional SEO and, more recently, AI optimization. Vendors are actively selling “AI schema optimization” services with the promise that adding the right structured data will make your business more visible inside AI-generated answers. The Marketing 1on1 test directly challenges that claim.

When the operating hours were visible only in the schema and not displayed anywhere on the actual webpage, the AI models consistently reported that the website did not include operating hours at all. They did not pull from the LocalBusiness markup. They did not reference the structured data. They simply told users the information was not there because, from their perspective, it was not.

This aligns with my analysis of the Ahrefs schema markup study, which reached a similar conclusion: schema markup showed no meaningful improvement in AI citation rates. Two independent studies, same result. At some point, that stops being a coincidence and starts being a pattern worth taking seriously.

“If AI models are ignoring your schema markup when generating answers, then every dollar spent on schema-focused AI optimization is buying you nothing. The data does not support the sales pitch.”

It is also worth noting that Google’s own AI optimization guidance does not instruct webmasters to focus heavily on schema as a mechanism for improving AI visibility. When even the source recommends against over-relying on structured data for this purpose, that should be a signal worth heeding.

Phone Number Anchor Text: When AI Models Lose the Information Entirely

Marketing 1on1 changed the clickable phone number on their website so that the anchor text displayed “Call Us” rather than the digits of the phone number. The number remained functional as a link, but the visible text no longer showed the number itself. Within weeks, LLM models began reporting that no phone number was published on the website. The number was there in the link, but because the visible text did not display it, AI crawlers could not extract or cite it.

This result is particularly revealing because it shows how literally AI systems read visible content. The phone number existed in the href attribute of the anchor tag. A human could click “Call Us” and be connected immediately. But because the numeral string itself was not rendered as readable text on the page, the LLM models concluded it simply did not exist.

When asked directly whether a phone number was listed on the company website, the AI models responded clearly: no phone number is published anywhere on the website. That response, while technically inaccurate from a developer’s perspective, is entirely accurate from the perspective of what a model sees when it processes a page’s text content.

This tells you something important about how AI retrieval systems actually work. They are not parsing HTML attributes with the intent of inference. They are reading what is displayed. If a piece of critical business information is buried in a tag attribute, an alt text field that is not contextually connected, or any other non-rendered location, there is a strong chance AI systems will not surface it.

The practical implication is straightforward: if you want AI crawlers to cite your phone number, your hours, your address, your pricing, or any other business-critical data, that information needs to appear as actual visible text on the page.

The One Surprising Finding: Hidden Content with Display None Was Readable

Despite the failures with schema markup and anchor text, Marketing 1on1 found that content hidden using CSS display: none was actually crawled and cited by LLM models. This is counterintuitive and raises meaningful questions about exactly how AI crawlers access and process web content compared with traditional search bots.

Google has historically been cautious about hidden content, sometimes treating it as lower quality or potentially manipulative depending on context. AI crawlers, at least based on this test, appear to read the raw HTML content regardless of whether the CSS instructs a browser to display it. The text was there in the source, and the models could cite it.

I would caution against reading this as permission to stuff hidden content with information you want AI to surface. That would be a manipulative approach that is unlikely to survive as AI retrieval systems mature. But it does tell us something interesting about the technical architecture of how these crawlers are ingesting content. They appear to be working at the HTML text extraction layer rather than simulating a fully rendered browser environment in all cases.

That said, the broader lesson from the complete test is not “use hidden content.” The broader lesson is that clearly visible, properly rendered, human-readable text content is the most reliable way to ensure AI models can extract and cite your information.

What This Means for How AI Crawlers Actually Work

Taken together, these three findings paint a fairly clear picture of AI crawler behavior that contradicts much of what is currently being sold in the market.

  • AI crawlers appear to prioritize rendered, visible text content above structured data
  • Schema markup in its current form does not appear to influence what LLMs cite or how they describe your business
  • Anchor text matters more than the underlying href value for phone numbers and similar data points
  • CSS-hidden content may be accessible to AI crawlers at the source level, though this should not be relied upon as a strategy
  • LLM models can confidently state that information does not exist on a website, even when it technically does, if it is not visible as text

The implication for website owners is not complicated: write clearly, present your information in plain, visible text, and do not assume that backend technical configurations such as schema, JSON-LD, or custom meta fields are doing the heavy lifting for AI visibility.

The “Fake AI Optimization” Problem Is Real and Growing

One of the reasons this study matters beyond academic curiosity is the volume of services currently being sold under the label of AI optimization or GEO. Many of these services are built on the assumption, usually stated confidently in sales materials, that schema markup, custom coding, and structured data layers are the keys to appearing inside AI-generated answers. The Marketing 1on1 test directly refutes that.

If a vendor is selling you “AI citation schema packages” or promising that their proprietary code will make you more visible inside ChatGPT or Google AI Overviews, the evidence does not support that claim. Two months of testing on a live website showed that schema markup was completely invisible to LLM models when surfacing business-specific information. That is not a minor caveat, it is a fundamental flaw in the entire premise of schema-centric AI optimization services.

“The SEO industry has a long history of selling technical solutions to problems that actually require content solutions. AI optimization appears to be repeating that pattern almost exactly.”

What actually moves the needle, based on the evidence available, is the quality, clarity, and visibility of your written content. Information that is clearly written, properly structured in readable headings and paragraphs, and factually accurate is what AI models extract and cite. That has always been a good content strategy. It remains a good content strategy now.

Practical Takeaways for Business Owners and SEO Professionals

Based on the Marketing 1on1 findings and the broader body of evidence accumulating around AI retrieval behavior, here is what I would recommend to any business concerned about their visibility inside AI-generated answers:

  1. Display all critical business information as visible text. Hours, phone numbers, service areas, pricing ranges, and key differentiators should all appear as readable content on the page, not just inside schema or hidden in attributes.
  2. Do not rely on schema markup as your primary AI optimization strategy. Use it for traditional SEO purposes where it still provides value, but do not expect it to improve your AI citation rates based on current evidence.
  3. Write with specificity. AI models favor content that directly answers specific questions. Vague marketing language is unlikely to be extracted or cited. Clear, factual, specific content is.
  4. Structure content for extraction. Use clear headings, short informative paragraphs, and direct answers near the top of sections. This matches how AI retrieval systems scan and pull content.
  5. Audit your anchor text. If your phone number, email, or other contact information is hidden behind generic anchor text like “contact us” or “call now” without the actual data visible, AI models may not be able to surface it.
  6. Be skeptical of technical AI optimization services. If a vendor cannot explain in plain terms how their technical intervention changes what a language model reads from your page, push back. The evidence suggests the answer may be that it does not.

Myths vs. Facts: AI Crawlers and Schema Markup

Common Claim What the Evidence Actually Shows
Schema markup helps AI models surface your business information Marketing 1on1’s test showed AI models did not extract schema-only information after two months of testing
JSON-LD structured data improves AI citation rates The Ahrefs study and Marketing 1on1’s test both found no meaningful improvement from schema in AI citation behavior
AI crawlers work similarly to Googlebot AI crawlers appear to prioritize visible rendered text differently, and may read CSS-hidden content that traditional SEO advice discourages
Technical AI optimization services can get you cited by LLMs No controlled evidence supports technical backend changes improving AI citation rates over clearly written visible content
A clickable phone number link is enough for AI models to cite your number Marketing 1on1 showed that if the anchor text does not display the number, AI models report no phone number exists on the site

Why Visible Content Remains the Most Powerful AI Optimization Signal

Everything in this study points back to a principle that good content strategists have understood for years: write for humans first, and make sure the information is actually on the page. AI systems are, at their core, trained on human-readable text. They are designed to understand and reproduce the kind of information that a person would read and find useful. When that information is clear, accurate, and present, AI models can work with it. When it is buried in backend code, hidden in markup that browsers do not render to users, or obscured by generic anchor text, AI models behave as though it simply does not exist.

This is both a challenge and an opportunity. The challenge is that many businesses have historically treated certain types of information as “backend” data managed through CMS fields, schema plugins, or database entries that never get written into page content. Those businesses may be invisible to AI retrieval systems for important queries. The opportunity is that fixing this requires content work, not expensive technical packages. Write your hours on the page. Display your phone number as text. Describe your services clearly. Answer the questions your customers actually ask.

That is, based on the best available evidence, including the Marketing 1on1 study, what AI crawlers are actually reading.

Frequently Asked Questions

Do AI crawlers read schema markup when generating business citations?

Based on Marketing 1on1’s controlled test, AI crawlers did not extract or cite information that existed only in schema markup, including LocalBusiness structured data containing operating hours. After more than two months of testing, LLM models consistently reported that the information was not present on the website, despite it being correctly formatted in the schema. This strongly suggests that schema markup is not a reliable mechanism for influencing what AI models cite about your business.

What happens if my phone number is a link, but the anchor text says “Call Us” instead of showing the number?

According to Marketing 1on1’s test, LLM models were unable to identify or cite the phone number when the anchor text was changed to “Call Us” rather than displaying the actual number. The models reported that no phone number was published on the website. This confirms that AI crawlers rely on visible rendered text, not on underlying link attributes or href values, when extracting contact information.

Can AI models read CSS hidden content using display: none?

Marketing 1on1’s test found that content hidden with CSS display: none was accessible to AI crawlers and could be cited in responses. This suggests AI systems may be processing raw HTML text content rather than fully simulated browser-rendered pages in all instances. However, using hidden content as an AI optimization strategy is not recommended, as it mirrors techniques historically associated with manipulative practices and may be treated differently as AI retrieval systems evolve.

Is investing in schema markup still worthwhile for SEO in the context of AI?

Schema markup still holds value for traditional search engine optimization purposes, including rich results and Knowledge Graph contributions. However, the evidence from both Marketing 1on1’s study and the Ahrefs schema research suggests it does not reliably improve AI citation rates or influence what LLMs report about your business. For AI visibility specifically, clearly written, visible text content appears to be a substantially more effective investment than schema-focused optimization.

What type of website content is most likely to be cited by AI systems like ChatGPT, Gemini, and Perplexity?

Based on current evidence, AI systems favor clearly written, factually specific, human-readable text content that directly answers questions. Information displayed as visible on-page text, organized under descriptive headings, and written with precision, is most likely to be extracted and cited. Business-critical details like hours, contact information, service descriptions, and pricing should be written as plain text, not stored exclusively in schema markup, CMS backend fields, or link attributes.

Final Thoughts

The Marketing 1on1 study is one of the more practically useful pieces of real-world AI crawler research I have seen. It does not rely on theory or extrapolation. It ran actual queries on actual AI systems over an extended period and documented what happened. The results are consistent with Ahrefs’ research, Google’s guidance, and the general principle that AI models are fundamentally text-reading systems that prioritize what is visible to users.

If you are evaluating your current website for AI visibility, the question to ask is simple: if a person loaded your page and read only the visible text, would they find every piece of information you want AI models to surface about your business? If the answer is no, that is where to start. Not with schema packages. Not with proprietary coding overlays. With the content itself.

That is what the evidence actually supports, and in an industry full of speculation and salesmanship, evidence matters.

If you want an honest assessment of how your website is currently positioned for AI retrieval and what practical steps would actually improve your visibility, reach out to Affordable SEO Expert for a straightforward conversation grounded in what the data actually shows, not what vendors want to sell you.

Call Me
Get a Quote