Tuesday, October 15, 2019

The sad state of NLP tech in the marketplace today

Update: Sinequa has caught up to us in going straight to the answer, EXCEPT we are FASTER and MORE ACCURATE. Don't believe it? Try the first 3 covid queries on our covid virus search at noonean.com and then try them against sinequa's covid search. We blow them away. One good thing about so many companies doing covid search, we finally get the put up or shut up comparisons.
-------------------------------------

Very few products exist which use advanced NLP techniques. We've scoured the marketplace and find product after product lacking.

Here is a brief list of things they do:
 
     Phrase recognition
     Simple named entity recognition
     Part of Speech
     Lemmatization
     User Intent Analysis
     Disambiguation

While it's true that all of these are parts of NLP none of them are ADVANCED NLP techniques. Do they help improve search, yes. But not as much as using advanced techniques.

Here is how one competitor describes their tech:

  • Semantic search: uses a variety of signals to understand the user’s intent and handle ambiguity. For example, semantic search understands that when a user searches for “profit,” she would also want to find data sets that reference “net income.” Similarly, if a user searches for “NJ,” they would also return entries for “New Jersey.”
  • Lemmatization uses variations on words such as plurals, tenses, genders, hyphenated forms, and more. For example, a search for “running” would return matches for “runs” or “ran.”
  • Advanced syntax increases precision through techniques such as phrase search, fielded search, Boolean matching, and proximity search.
  • Fuzzy matching increases recall and allows for looser matching. Substring and approximate matches allow users to find data sources when they only have partial information or even incorrect information.
None of this is Advanced NLP. And sadly it's commonplace that companies are pitching that they have NLP tech in their search but it's all very rudimentary. While you can't argue that lemmatization and simple named entity extraction is NOT NLP, because it is, it's really just the first few steps of a processing chain with the higher order functions lopped off and ignored. And why is that? Because they don't really have backgrounds in doing NLP research. 

But it's a successful strategy that gets investment and sales, but really it's a shoddy product. So just like the AI buzz that gets everyone funded, it's quite troubling that such minimal low level tech is representative of what's state of the art. It isn't. 

Shockingly, things are no better at Google, Microsoft, or Amazon regarding their NLP search offerings.  Focused more on User Intent recognition / Sentiment analysis and disambiguation, these products are focused on Chat Bots not Enterprise Search. 



Sinequa just recently demo-ed their tech at KMWorld. First you get a list of documents and then you have to click on the documents to see the analysis of each document.  Noonean is more advanced, we show you the answers DIRECTLY not documents. 


Noonean is designed from the beginning for enterprise scale and works with a parallel index so you can trial the technology without any risk of damaging your current enterprise. The index can reside as a SOLR core on the same machine or a completely different machine. 


Noonean.com brings more advanced NLP techniques to the market.

Companies see the value of automating customer service and think their traditional search technologies are sufficient for the enterprise. But moving to Cognitive Search and Insight engines with Advanced NLP and AI learning will take things to another level and have a profound impact on their bottom lines.  In reality, enterprise search is a first line crude form of knowledge management, and fact bases based on NLP Ontologies are the next level after that. 




Tuesday, October 1, 2019

User Intent Analysis and NLP - Potato Chips or Computer Chips

One area that is not the sine qua non of NLP but nonetheless a core technique is user intent analysis. This is more critical for areas like building a chat bot or an Alexa, and has some but lesser utility in enterprise search.

So consider the user entered query:
    what is the most expensive chip

Do they mean computer chip or potato chip? How can the machine know? Technically this is a query disambiguation class problem.

There are two types of user intent. One is categorical and the other is free intent.

For categorical intent, a NLP corpus is split into different areas. This is quite common. One person may work in the food side of the org and another in the silicon chip side. Typically there might be 10-30 categorical intents.

To solve for categorical intent, each corpus would get processed in the different areas and statistical maps or 3D spatial similarity maps would be generated. Typically these are generated on NLP generated tokens not the words themselves.  In our case, expensive would have a stronger correlation score to computer chips than potato chips. So that would trigger the user intent of "computers" and query transforms to boost to that would be applied.

The other user intent is more subtle. Which is to boost a query based on historical analysis of the user. So if the user's other query is "what is the fastest ram" then we might deduce that he is seeking computer chips. If the user's other query from history is "what is the saltiest snack" then we might deduce potato chips.

Ah but do we really know the user's intent. What if they've just started a keto diet and all potato chips are verbotten.  So if we start re-writing queries based on analyzed intent that's going to piss a lot of users off!  So for historical intent it's more an art than a science, and the goal is often to extend and nudge the query rather than obliterate it. How much is enough and how much is too much? Generally the intent provided term should be enough to show up in result sets but not dominate. Think of it as a leather clad dominatrix who uses a pillow rather than a whip. err. ok don't think of that. The point is, it's subtle and not overbearing, which is the mistake that's made most often.  Same goes for a chat bot but much worse - Alexa, turn off the lights becomes burn down the house. "sure can do!" replies alexa and turns on the oven.  Subtle. These are guesses after all!   They take a lot of tuning and regression testing to get the right sensibility.

Products which deliver intent analysis without enterprise search integration are clearly targeting chat bots. And while chat bots have their niche use, providing User Intent is also a powerful tool for enterprise NLP search.