The problem - a low Signal to Noise Ratio
One of the tenets of recording activity from neurons (electrophysiology or more generally, signal processing) is to maximize signal to noise. Having a strong signal is not really important - having a clear signal with respect to the general background noise is much more important (i.e. one can improve a recording by increasing the signal strength from a neuron and/or reducing background noise).
I think the same applies to the internet. The internet democratized information by making a good fraction of collective human knowledge accessible at our fingertips. It did this by reducing friction in getting your voice and knowledge heard by anyone around the world.
However, as a consequence, it also increased the noise, making it too hard to pull out the reliable sources of information from the unreliable.
Â
Example plot that shows that the noise level matters a lot when distinguishing the signal. In this case, 10x noise causes the signal to no longer be visible. (Generated using Chatgpt)
How did we get here?
Our current ability to filter information from noise relies on search engines. These search engines parse the query and use it to find webpages that might be what we are looking for. Importantly, they rely on previous clicks and visits to order the search results, allowing the most clicked/visited website to surface on top.
This idea, which allowed Google to become synonymous with web search, worked very well till a few years back. However, it looks like they no longer are as effective as before because of several sources of noise that are polluting the search results:
- Search engine optimization. This seems to be a big source of noise, affecting almost all search engines.1
- Result personalization based on search history. This too adds noise in my perspective as it weeds out the results based on what the user might be searching for. Although the intention is to remove noise, I think it often creates a search bubble, which becomes a source of noise when one is trying to break out of it.
- Hard-to-discern marketing links.. Increasing number of marketing links and ads that are harder to detect in search engines. These might not be relevant to the search term but pop up high on the list, becoming a source of noise.
- Redundant webpages to increase ad-based revenue. Lots of websites post very similar content in the hopes of attracting visitors and increasing ad revenue. High-quality content is slowly replaced with low-quality, high-quantity content, which implies search results are populated with the same types of content and not diverse results linked to the search query.
All this combined makes it incredibly hard to actually find relevant, broad, and varied results for a particular search term. The noise consumes the underlying signal.
And this will only get worse! Our internet history has only begun.2
The future
Given the high noise in search results, we need a way to whittle down the noise and pull out the underlying signal.
LLMs for search
I feel that LLMs naturally solve a few problems associated with increased noise in searches.
- The redundant information scattered across webpages is actually useful for training here. This reduces noise by generalizing across the information and providing cohesive answers.
- SEO works by manipulating specific methods in which webpages are indexed. However, given a lack of transparency in how LLMs arrive at the output, it might be hard to hack the data and optimize it for a particular purpose.
- Currently, it looks like marketing and personalization of answers based on interests is hard to do (but work in progress). This is just based on my experience. The responses to keywords seem to be similar.
- Generating a wide variety of LLMs by fine-tuning them for specific kinds of search (general, research, travel, etc.) and running them privately on computers allows the possibility of custom search engines that can be downloaded based on need.3
Caveats
That said, there are a lot of caveats that need to be figured out: Â
- LLMs make generation of seemingly well-written information easy. Availability of open models that can be fine-tuned to make specific types of articles could allow new (even more disastrous) ways of adding noise to the system. Â
- There are a class of adversarial attacks that have been developed to fool image-based classifiers. Maybe such methods will be developed in the future to generate inputs that will bias LLMs to generate ads or other specific types of inputs.4 Â
- Providing reliable links and citations to the generated answers.5 What is harder is to infer the credibility of the links and citations, and make the LLM show the reliable (generalizable) results with the right citations.
Footnotes
-
“Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines” - Bevendorff et al., 2024 ↩
-
Only 50% of the world were consistently on the internet in 2016, which increased to 63% in 2023 (source: our world in data). ↩
-
This is already happening, and open source software like ollama, lmstudio, already allows open LLM models to be fine-tuned and locally run on computers. Notes on my explorations are in Leveraging local LLMs. ↩
-
Search engines like Perplexity or my new favorite Correkt (made by UCSB undergrads) already do a decent job at this. OpenAI, Google (Gemini), and DuckDuckGo all give AI-assisted answers as a part of the search keywords as well, though they are not as good. ↩