The question comes up in nearly every recent ministry conversation we have, in some form: “Can we just use ChatGPT Deep Research for this?”
It is a fair question, and one that our clients understand once we explain.
Deep Research, and similar offerings from various AI providers, is genuinely capable. It will autonomously browse the open web for several minutes per query, read across text, images, and PDFs, and return a multi-thousand-word cited synthesis. It lets you nominate or prioritise specific indexed sites, and increasingly connects to external data sources through MCP. For a one-shot literature review on an indexed topic, such as a sector primer, an industry scan, or a precedent search, it is often the right tool. We use it ourselves for that class of work, and it provides a quick, useful first pass.
The gap appears when the requirement is not literature-review-shaped. Sometimes the brief is find and synthesise what's been discussed about X. Sometimes the brief is tell me how citizens received [policy announcement] in the 24 hours after, on the platforms where they actually argue, in the languages they actually speak, separated from coordinated noise, with confidence levels we can stand behind.
The first is research. The second is intelligence.
- CorpusOpen-web index, indexed only
- DepthA citation sample
- LanguageEnglish-first
- ClaimsConfident prose
- CorpusSource-prioritised across mainstream · social · forums
- DepthThe underlying conversation
- LanguageNative EN · ZH · MS · TA + Singlish
- ClaimsSource · method · confidence
Both are legitimate jobs; they call for different tools. The aim of this piece is not to argue that Deep Research is bad (it isn't) and not to run a head-to-head feature table. It is to give you a way to recognise the right tool for the task in front of you.
A worked example: the post-event reception read
Imagine a high-stakes policy announcement made six hours ago. The requirement from senior management is an hourly read for the first 24 hours until the situation stabilizes, followed by a full report: how did it land among the grassroots, which stakeholders expressed opinions and on which platforms, are early narratives forming that will need a response, and is the public reception growing and being amplified.
What Deep Research returns. A polished memo. Mainstream coverage well-summarised. A handful of high-engagement X posts pulled into the synthesis. Surface themes identified, with citations. The prose reads confident and well-formed. For a quick read it is genuinely useful.
What it misses on this particular brief. Six things, each previewing one of the gaps.
- The HardwareZone EDMW thread where the policy critique actually seeded, and where the most-shared screenshots originated, never appears. It isn't in the corpus the tool can reach. Or, even when it pulls in a representative Reddit thread, it doesn't cover the rest of social media (Gap 1: source control.)
- The first hours' reactions don't appear because web indexing hasn't caught up with them yet; meanwhile, material from outside the 24-hour window gets pulled in because it ranks high in web search, diluting the post-event signal (Gap 2: bounded time.)
- Mainstream articles read at the article level, but the social media posts and subsequent comment threads (where the public reception is actually being argued out) never ingest. The corpus is a citation sample, not the underlying conversation. (Gap 3: depth of corpus.)
- Only English results are covered. Other local languages (Mandarin, Malay, and Tamil) and code-switched Singlish aren't, missing out on language and cultural nuances. (Gap 4: native multilingual.)
- Comments on recent-creation accounts (which may be bots or spam accounts) are treated as authentic public reaction, counted alongside organic voices in the “what people are saying” tally. (Gap 5: authentic vs amplified.)
- A confident claim of a sentiment shift, which on closer examination the underlying data does not actually support. The LLM produces fluent prose where the evidence is thin as it tries to be helpful and ends up hallucinating in the process. (Gap 6: calibrated claims.)
What a Wisma post-event report does instead on the same brief. A source-prioritised corpus across mainstream, social, and forums, including closed-space coverage where lawful and appropriate. Scoping bounded to the actual 24-hour window, not “top results.” The conversation read in native English, Mandarin, Malay, and Tamil (with English translations for analyst checks) with cultural and specialist context preserved. Bots and spam filtered out of the source dataset before the analysis runs, so the analysis can separate organic public reception from inauthentic amplification. Source, method, and confidence documented on every classification, with a human reviewer checking the output.
Six gaps to close
We work through each gap in turn.
1. Source control, not just search. Deep Research's corpus is whatever its tool surfaces from the open web. Its source-prioritisation surface lets you nominate or weight indexed sites, which is genuinely useful when your sources happen to be web-indexed. What it cannot reach are the platforms that fall outside the open web, such as local forums where policy critique seeds, semi-private channels, and niche industry boards. Nor can it weight sources by trust and method beyond on/off and prioritise toggles. A media intelligence partner specifies the corpus on the question's terms: which forums, which platforms, which closed spaces (where lawful and appropriate), each weighted by trust and relevance to the brief in front of you.
2. Time, bounded to the question, including very recent and very specific. Two failure modes here, both common. Sub-day windows. When the question covers the first twelve hours after an announcement, Deep Research depends on web search indexing, whose latency for many sources runs into the next day or beyond. A partner with continuous ingestion across news, social, and forums sees the same material within minutes of publication. Historically bounded windows. When the question is “13–20 January 2024,” Deep Research returns top relevance matches for the search terms, which routinely include material from outside the date range. A partner takes the start, end, and granularity as part of the question, and analyses only what falls inside it.
3. Depth of corpus, not a sample of it. Deep Research operates at the level of page-level retrieval. It surfaces and synthesises from a finite citation set, and it is genuinely good at picking the high-relevance sources and pulling from them. Where the answer requires reading what most people are saying (every comment thread under the article, every forum reply, every related social-media chain), the unit shifts from cited sources to data points, and the volume shifts by orders of magnitude. A partner ingests at the depth of the underlying conversation and filters down to what matters; the read is grounded in the conversation itself, not a citation sample of it.
4. Native multilingual, with cultural and specialist context. Singlish, code-switched threads, religious register, trade-specific or community-specific language, and in-group sarcasm carry meaning that machine translation flattens. Models that natively understand multilingual and multicultural contexts read them in the register they were written in. Specialist domains compound the same problem: legal language, healthcare terminology, technical industry idiom each have their own in-group conventions a translation pass cannot reconstruct. A partner builds for that linguistic and cultural surface deliberately, rather than treating English-translation as the default unit of analysis.
5. Authentic signal, not amplified noise. Bot rings, coordinated reply networks, persona farms, and brigading rings inflate engagement counts and skew “what people are saying” tallies. A partner filters these out of the source dataset before the analysis runs, so the read describes organic public reception rather than amplification machinery. The honest tractable form of this work is at the network layer, not the content layer: what amplified this, who seeded it, which network carried it. Reliable detection of whether any single post was written by a human is not something we (or anyone) can claim today, and a partner who does claim it should not be trusted on the rest of the stack either.
6. Calibrated claims, not confident prose. OpenAI's own documentation for ChatGPT Deep Research concedes that the tool “occasionally makes factual hallucinations or incorrect inferences” and “may not accurately convey uncertainty.” The same caveat, in different wording, applies to its peers. That disclosure is honest, and worth holding onto. The question that matters for the buyer is what additional methodology sits on top of the LLM. A partner pairs LLM analysis with proprietary hallucination-reduction techniques and human review on classifications that carry political weight, and documents source, method, and confidence on every published claim. The output is something you can still cite in a week, in a different register, with the method footnotes intact.
Research, intelligence, and what to do with the distinction
What the six gaps together diagnose is not which tool is “better.” It is the line between research and intelligence: between find and synthesise what's been written about X and tell me what is happening in this corpus, on this timeline, in these languages, separated from this noise, with confidence levels you can stand behind. Most ministry briefs sit cleanly on one side of that line or the other. Some genuinely sit on the research side, and Deep Research is the right instrument for those. Some sit on the intelligence side, and a different shape of partner is needed.
You will notice what this piece does not include. There is no head-to-head comparison table. No claim that Deep Research is bad. No ranked vendor list. The category distinction does the work; the rest would be theatre.
We should say openly: we are one of the companies that builds to close those gaps. We have been building NarrativeIQ for the intelligence requirements this piece covers:
- A source-prioritised corpus across mainstream, social, and forums, including closed-space coverage where lawful and appropriate.
- Continuous ingestion at minute-level latency, with analysis bounded to the question's chosen window and depth at the level of the underlying conversation rather than a citation sample of it.
- Models that natively understand multilingual and multicultural contexts, including specialist and in-group terminology.
- Spam and coordinated amplification filtered out of the source dataset before analysis runs.
- Source, method, and confidence documented on our reports, with domain-reviewer validation on high-stakes claims.
If you are thinking about how research becomes intelligence in your function, and how to tell brief by brief when each is the right tool, we would be glad to walk it through. Sometimes the honest answer is use a tool like Deep Research. Other times it is not. Either way, we are happy to think it through with you and share what we have seen across similar briefs. Write to us at contact@wisma.ai.