The Gianna Giavelli Blog: 2019

Tuesday, October 15, 2019

The sad state of NLP tech in the marketplace today

Update: Sinequa has caught up to us in going straight to the answer, EXCEPT we are FASTER and MORE ACCURATE. Don't believe it? Try the first 3 covid queries on our covid virus search at noonean.com and then try them against sinequa's covid search. We blow them away. One good thing about so many companies doing covid search, we finally get the put up or shut up comparisons.

-------------------------------------

Very few products exist which use advanced NLP techniques. We've scoured the marketplace and find product after product lacking.

Here is a brief list of things they do:

Phrase recognition
Simple named entity recognition
Part of Speech
Lemmatization
User Intent Analysis
Disambiguation

While it's true that all of these are parts of NLP none of them are ADVANCED NLP techniques. Do they help improve search, yes. But not as much as using advanced techniques.

Here is how one competitor describes their tech:

Semantic search: uses a variety of signals to understand the user’s intent and handle ambiguity. For example, semantic search understands that when a user searches for “profit,” she would also want to find data sets that reference “net income.” Similarly, if a user searches for “NJ,” they would also return entries for “New Jersey.”
Lemmatization uses variations on words such as plurals, tenses, genders, hyphenated forms, and more. For example, a search for “running” would return matches for “runs” or “ran.”
Advanced syntax increases precision through techniques such as phrase search, fielded search, Boolean matching, and proximity search.
Fuzzy matching increases recall and allows for looser matching. Substring and approximate matches allow users to find data sources when they only have partial information or even incorrect information.

None of this is Advanced NLP. And sadly it's commonplace that companies are pitching that they have NLP tech in their search but it's all very rudimentary. While you can't argue that lemmatization and simple named entity extraction is NOT NLP, because it is, it's really just the first few steps of a processing chain with the higher order functions lopped off and ignored. And why is that? Because they don't really have backgrounds in doing NLP research.

But it's a successful strategy that gets investment and sales, but really it's a shoddy product. So just like the AI buzz that gets everyone funded, it's quite troubling that such minimal low level tech is representative of what's state of the art. It isn't.

Shockingly, things are no better at Google, Microsoft, or Amazon regarding their NLP search offerings. Focused more on User Intent recognition / Sentiment analysis and disambiguation, these products are focused on Chat Bots not Enterprise Search.

Sinequa just recently demo-ed their tech at KMWorld. First you get a list of documents and then you have to click on the documents to see the analysis of each document. Noonean is more advanced, we show you the answers DIRECTLY not documents.

Noonean is designed from the beginning for enterprise scale and works with a parallel index so you can trial the technology without any risk of damaging your current enterprise. The index can reside as a SOLR core on the same machine or a completely different machine.

Noonean.com brings more advanced NLP techniques to the market.

Companies see the value of automating customer service and think their traditional search technologies are sufficient for the enterprise. But moving to Cognitive Search and Insight engines with Advanced NLP and AI learning will take things to another level and have a profound impact on their bottom lines. In reality, enterprise search is a first line crude form of knowledge management, and fact bases based on NLP Ontologies are the next level after that.

Tuesday, October 1, 2019

User Intent Analysis and NLP - Potato Chips or Computer Chips

One area that is not the sine qua non of NLP but nonetheless a core technique is user intent analysis. This is more critical for areas like building a chat bot or an Alexa, and has some but lesser utility in enterprise search.

So consider the user entered query:
what is the most expensive chip

Do they mean computer chip or potato chip? How can the machine know? Technically this is a query disambiguation class problem.

There are two types of user intent. One is categorical and the other is free intent.

For categorical intent, a NLP corpus is split into different areas. This is quite common. One person may work in the food side of the org and another in the silicon chip side. Typically there might be 10-30 categorical intents.

To solve for categorical intent, each corpus would get processed in the different areas and statistical maps or 3D spatial similarity maps would be generated. Typically these are generated on NLP generated tokens not the words themselves. In our case, expensive would have a stronger correlation score to computer chips than potato chips. So that would trigger the user intent of "computers" and query transforms to boost to that would be applied.

The other user intent is more subtle. Which is to boost a query based on historical analysis of the user. So if the user's other query is "what is the fastest ram" then we might deduce that he is seeking computer chips. If the user's other query from history is "what is the saltiest snack" then we might deduce potato chips.

Ah but do we really know the user's intent. What if they've just started a keto diet and all potato chips are verbotten. So if we start re-writing queries based on analyzed intent that's going to piss a lot of users off! So for historical intent it's more an art than a science, and the goal is often to extend and nudge the query rather than obliterate it. How much is enough and how much is too much? Generally the intent provided term should be enough to show up in result sets but not dominate. Think of it as a leather clad dominatrix who uses a pillow rather than a whip. err. ok don't think of that. The point is, it's subtle and not overbearing, which is the mistake that's made most often. Same goes for a chat bot but much worse - Alexa, turn off the lights becomes burn down the house. "sure can do!" replies alexa and turns on the oven. Subtle. These are guesses after all! They take a lot of tuning and regression testing to get the right sensibility.

Products which deliver intent analysis without enterprise search integration are clearly targeting chat bots. And while chat bots have their niche use, providing User Intent is also a powerful tool for enterprise NLP search.

Tuesday, September 24, 2019

bFloat16 for Neural Networks? You win some and you lose some

So you're probably wondering what's the hype with Nervana's processor and the push to adopt bFloat16 format for AI.

One of the main issues is that the extreme performance levels needed for large scale neural networks requires performance in the 50-200 TFLOP range. Large scale being something like 8-20 billion neuron units. So there are a few cards that can do such things like Nvidia's new Titan Rcx(?) something or other. But there's a catch. It's only at 16bit. go to full 32 bit precision and performance drops off.

Neural networks rarely need such large numbers. What they need more is precision. Yet bFloat16 only has 7 bits for the fraction, rather than standard 16 bit float which has 10. since each is a DOUBLING of range, those 2 bits mean a big difference in precision. What gives?

Remember that most neural networks store values between 0 and 1. The need for large numbers doesn't exist.

It has to do with the speed of moving huge pipelines of 32 bit precision data in traditional memory into and out of these systems. By removing 2 precision bits it allows conversion from 32 bit float to be much simpler as the exponent is identical. You basically just whack 16 bits off the end.

32 bit floating point

bFloat16

What is the trade off? Well, its the loss of precision in the small numbers. So if you are designing a re-entrant highly convolutional dimensional neural network precision might be much more important to you vs. a vision system which needs huge data pipes into the system. Since neural network systems use very specialized memory (gddr6 or HMB) which is more expensive, its tough to get 16 or even 24 gigabytes of ram to support the processing. That limits the number of nodes you can express.

bFloat16 is the google pushed format, but is it right for everyone? No it's not. The loss of precision will probably hinder the kinds of problems you can solve with neural networks. Personally, I'd rather go with precision.

Friday, September 13, 2019

The Importance of Broad Querying and Narrow Indexing

One of the issues we came across with developing our enterprise NLP search engine was the issue of precision. Rather than rank, if the precision level is not met documents simply fall out of query match.

Let's take the case of an object being a predicate object in the query vs being a noun subject in the sentence being indexed. There would be no match. So how do we handle this? Well, we can create a broad query - containing the object as both noun subject and predicate object. Now it will match sentences that have either case.

Wait didn't we lose precision? Yep, we sure did. So how does that pass the muster? Well it has to do with two things. The first is many other things are contributing to the match. And all those elements will result in partial scores that add up to a total. The end user experience of the results found will match their mental expectation - they think that queries will resolve symmetry of grammar structure, or as we call them inversions. There can be models where prepositional clauses dangle from different parts of the sentence. Should that be a match? For enterprise search a large part of this process is tuning and using a "gist" a "gestalt" that things feel ok to the brain. There is no hard and set rule.

Another way to approach broadening is to handle cases where you didn't find the expected document and you realize it was due to a narrow definition. While you can't simply broaden the case, you can run a test broadening it and see how it effects your regression test set of queries. And then do a bit of spot checking.

Finally the last issue is that of the near miss. This often happens with the verb. Microsoft accquires Documentum. Microsoft purchased Documentum. This might result in a total miss. So how to handle this? The best technique is to run your docset through a processor which determines similarity and clustering. Then you can extend your query with additional terms if there are any within a specified distance. Again, its a technique that can assist or blow up your query. It takes tuning and time to review many queries.

So if your latest search technology isn't producing the results you expected, remember that broadening the search vs the index is one technique to bring more potential matches into your query rankings.