Learn about our innovative approach to Text Analytics that combines machine learning with a rules-based approach to yield more accurate results at scale without sacrificing coverage and recall.
Much as oil powered the industrial economy, data is the fuel powering the information economy. By effectively collecting and analyzing data, companies can operate more efficiently and claim significant competitive advantages.
Yet the challenges that accompany the transition to Big Data are substantial. Data creation is growing with incredible velocity. By 2025, worldwide data creation will reach 163 zettabytes (or about ten times the amount created in 2017). This speed and the diverse nature of today's data can quickly overwhelm even sophisticated enterprises.
Compounding this difficulty is the fact that much of this data is unstructured (not easily classifiable or searchable), and therefore more difficult to parse and analyze. Structured data found within IT systems represents only a portion of the information decision-makers need to consider.
Despite the Big Data hype cycle, we are still at early stages for treating data as an asset. This means that today's organizations can carve out a distinct competitive advantage by adopting innovative new data technologies.
One such technology is text analytics, an essential tool for capturing and unlocking critical intelligence within large data sets.
Unstructured data is simply defined as information that is not grouped, organized or pre-defined in some fashion. Most unstructured data is text, but it may also include images, video, audio, numbers, etc. Because of this lack of structure, high-quality, actionable information within this category is often lost or remains obscure to organizations.
Problems with unstructured data generally fit within the "Four V" framework:
While the problems presented by the Four Vs are formidable, they can be solved through the use of text mining. Ronen Feldman, author of "The Text Mining Handbook," first coined the term “text mining” in 2004.
Text mining allows for the processing of diverse data sets at scale through automated interpretation and contextualization. It can identify patterns in documents much in the same way a human domain authority would, except it happens far faster and more efficiently.
This extraction of value and meaning from unstructured text is accomplished through machine learning and a technology called natural language processing (NLP). A branch of computer science, NLP allows complex and unstructured texts to be rendered into structured data sets that can be easily analyzed, processed, and visualized.
Text Mining vs. Text Analytics
In a business context, text mining and text analytics share the same definitions and are used interchangeably.
Organizations derive a number of powerful benefits from text analytics, including cost control, faster processing, smoother integration of data analysis capabilities, and more consistent outcomes.
Text analytics is especially useful for organizations that analyze or consume complex documents and texts, or rely on this process in some form. Common examples of this scenario include the following:
As anyone who has combed through an annual report can attest, SEC filings may contain enormous amounts of boilerplate language, legal jargon, hypothetical scenarios, and so forth. Analysts can use text analytics to identify potential red flags for investors, such as incomplete documents, earnings release anomalies, economic or consumer spending issues, cyber security vulnerabilities.
All of this can be accomplished without having to wade through hundreds or thousands of pages.
Text analytics enables the creation of a medical knowledge base, a vast trove of information culled from sources such as patents, clinical trial results, forum comments, and scientific news. This base of information can provide the foundation for advances in the treatment of diseases or the tracking of epidemics.
In order to guide decisions about coverage and case reserves, claims adjusters must carefully review all documents attached to a case. These documents can quickly become large and complex, potentially resulting in missed information and a sub-optimal decision.
Predictive models, which incorporate input from the best underwriters, allow insurers to process claims not only more efficiently, but also more accurately. A claims adjuster can use this technology to view claims in order of priority, or to surface key information that helps them determine whether to settle a case or pursue a legal resolution.
Earnings calls offer investors a critically important window into the performance of a company. A text analytics model, engineered with sophisticated NLP capabilities, can analyze transcripts of these calls and generate valuable insight into the future market performance of portfolio or target firms.
The most advanced text analytics can detect uncertainty, evasion, dishonesty or doubt in a CEO's language, allowing investors and analysts to identify potential problems before they materialize publicly.
When performing text mining there are a variety of information extraction and analysis techniques that can be employed. Each model has specific advantages and disadvantages that should be considered.
Though this method is popular (given its ease and minimal cost), it is not a true form of text mining. Search-based research is information retrieval, rather than information extraction. In other words, searching surfaces documents that match a query but cannot extract the data within those documents.
This means that anyone using the search method must contend with long lists of documents rather than an aggregation of all results. As a purely manual process, it offers none of the efficiency, accuracy or efficacy of true text mining.
These systems classify unstructured text according to custom linguistic rules, enabling a high degree of precision.
However, traditional rule-based systems feature shallow, word-based pattern matching. Because these rules are based on a mere portion of each sentence, they struggle to provide a high level of accuracy and effectiveness.
Additionally, because these systems necessitate extensive rule writing, recall (or coverage) is another persistent problem.
The approach used by most commercial platforms, Bag of Words is also the simplest of all NLP language models. Machine learning is deployed to count words (which are pre-designated as negative or positive), and a score based on that count is given.
Like other word-based models, this approach is limited by its shallowness. Additionally, designating words as positive or negative introduces a measure of subjectivity into the process. Rich, semantic contextual details are lost; word order, grammar and other elements are not taken into consideration.
If words are used sarcastically or hypothetically—or if a preceding word changes the definition of an analyzed word—the model will not accurately reflect these circumstances. All of this context is unavailable within this model; the nuanced relationship between words, sentences and paragraphs is absent.
Featuring a higher degree of accuracy than found with the Bag-of-Words approach,statistical modeling takes a predictive, algorithmic approach to in-text patterns. Such algorithmic models can process and "understand" language without explicit programming, and no manual rule coding is necessary.
In order to be functionally accurate, however, such models require vast amounts of training data. Statistical models also struggle to understand context and interpret texts in a human-like fashion.
Additionally, platforms incorporating this model can generally not be customized. Products are typically one size fits all,regardless of industry or the specific needs of an organization.
Also known as text classification, the deep learning NLP model analyzes sample texts that have been assigned specific categories. As such, in order to build an accurate and comprehensive system, massive data sets are needed. Additionally, each new task may require distinct (and equally vast) training data.
Overall, the amount of training data required may be much higher than what is traditionally seen with machine-learning algorithms, as it is often difficult to create a sufficient diversity of training examples within a short time frame.
The complexity of text mining makes it difficult to do well with true quickness and accuracy. Traditional approaches, whether custom (rules-based) or automated (bag of words, deep learning, statistical) are insufficient for the task.
To solve this problem, Amenity Analytics offers an innovative new model, combining machine learning with a rules-based approach.
While rule composition serves as the foundation for Amenity's NLP technology, our solution profoundly improves upon traditional rules-based models by employing full parsing. This reduces the number of rules needed.
Amenity also solves the coverage/recall problem by using full sentences rather than word-based patterns, thus providing full context. The process of defining rules is also made much easier. Amenity writes the rules for users, who simply guide the system into relevant sections of a document before identifying anything that needs to be extracted—a perfect collaboration between human knowledge and algorithmic power.
As artificial intelligence grows more sophisticated and powerful, it also becomes more difficult to explain in human terms. This development has led to a new trend called "explainable AI".
Amenity fits squarely within this movement. The system is designed to offer a clear and transparent explanation for every extraction. This is possible because every extraction is traceable to the precise set of rules responsible for the extraction.
The result is simple yet effective. Amenity parses data with high accuracy, while adding context and metadata to each extraction, increasing precision and recall in the process. All of this occurs at exceptionally fast speeds (up to 150 MBs per minute), making Amenity fast,deeply accurate and affordable, from a computational perspective.
This communication does not represent investment advice. Transcript text provided by S&P Global Market Intelligence.
Copyright ©2019 Amenity Analytics.