Default language change in Japanese¶
Why did the default language change when modeling Japanese text features?
Hi team, this is a question from a customer:
When modeling with Japanese text features, the "language" used to be set to "english" by default. However, when I recently performed modeling using the same data, the setting was changed to "language=japanese". It has been basically set to "language=english" by default until now, but from now on, if I input Japanese, will it automatically be set to "language=japanese"?
I was able to reproduce this event with my data. The model created on July 19, 2022 had
language=english, but when I created a model today with the same settings, it had
language=japanese. Is this a setting that was updated when the default was changed from "Word N-Gram" to "Char N-Gram"?
Before, for every dataset we showed "english", which is incorrect. Now after NLP Heuristics Improvements, we dynamically detect and set the dataset's language.
Additionally, we found that char-grams for Japanese datasets perform better than word-grams, thus we switched to char-grams for better speed & accuracy. But to keep Text AI Word Cloud Insights in a good shape, we also train 1 word-gram based blueprint so you can inspect both char & word-gram WCs.
Let me know if you have more questions, happy to help!
Robot 2, thank you for the comment. I will tell the customer that NLP has improved and language is now properly set. I was also able to confirm that the word-gram based BP model was created as you mentioned. Thanks!