Imagine that you were tasked with indexing (categorizing and summarizing) the contents of a large collection of randomly formatted text documents to make them easily accessible throughout your company. Perhaps this use case contains legal documents as well as medical records (with various levels of sensitivity). Suppose that the size of this collection contained over 100 million documents that you had to analyze. Besides categorizing each document as medical or legal, you must also assign them to subcategories like cardiology or probate. Finally, you must capture the meaning of each document for reporting and analytics.
Besides setting aside the rest of your life to do this job manually, does this problem have a solution?
Natural Language Processing [NLP]
Natural Language Processing (NLP) tools and techniques from could potentially provide a workable solution to the problem at hand. NLP is an area of Computer Science whose aim is to programmatically understand and produce human language in both written and spoken form.
Someday NLP will consider the number of languages used on the face of the planet and the dialects and colloquialisms that go along with them. For the purpose of this blog, let’s stick to the language in which you are currently reading, US English.
For that pile of unstructured documents, in its basic form, NLP has found success in being able to recognize human language using knowledge of sentence structure and dictionaries to identify nouns, verbs, adverbs, and clauses. But this is not enough to extract context and meaning to enable you to summarize and categorize the documents.
If you had dictionaries for medical and legal terms, there still may be overlapping content between the two. For example, the word “Contract” is likely to show up in a medical document of a patient as well as a legal document describing a business arrangement. So what then? Would you know how to classify the documents without additional information?
A natural progression within the area of NLP is to discern the actual meaning of a word or group of words based on its context without having a human intervene. To remove the ambiguity associated with the word ‘contract’ in two documents, or even its use within a single document, we need to know the context of where it appears. This would include things like how close are certain other words to the word ‘contract’. How many times does it appear in a document? The distance between words. These and other properties of a document build up the context of a document allowing for its proper disambiguation.
Taxonomies and Ontologies
To sort out the context of words, phrases, and sentences in a document, bodies of knowledge called taxonomies and ontologies are employed. Taxonomy is the science that concerns itself with classification. A particular taxonomy is a set of rules for classification, whether it is viruses, animals, or words in a document. Ontology is a branch of philosophy that studies the nature of being, or what do things mean. Ontology can be simplified by breaking it down to three components: A taxonomy, a set of rules, and use cases.
See this entry in the Encyclopedia of Databases for a more complete discussion of ontologies as used in Information Technology.
By applying a taxonomy and ontology constructed for legal documents we should be able to classify the documents according to the specific area of law practice in which they belong and then to extract the meaning of the document into a summarized form suitable for inserting into a database.
According to IBM, we create 2.5 quintillion bytes of new data every day, which is slightly more than 2 billion gigabytes of data. This blog is approximately 30,000 bytes. Therefore, we create the equivalent of 83 trillion unique blogs daily across all languages and dialects. Given the ubiquity of text-based data and the shear amount that we generate, you can appreciate the challenge of this climb.
Having 100 million text documents to index can certainly be called ‘big data’. In reality, this may be considered the lower end of ‘big’ (see “Is the mountain …” side bar). Relatively speaking, the problem does not look as insurmountable however a Big Data solution will be needed.
We know we have an incredible number of unstructured text files to process and that they are stored in various places throughout the corporate enterprise. The first step will be to collect them all in a commonly accessible location, in this case, we advise using a Hadoop Distributed File System (HDFS). HDFS distributes the files across many processing and storage nodes providing a processing capability for the second step.
The second step applies the logic dictated by the taxonomy and ontology to each individual file to build the index. Fortunately, with the HDFS storage solution, we can leverage the many processing nodes and the proximity of the data to a generate this index in a reasonable period of time with the ability to reduce processing time by adding nodes.
Alternatives to the above solution framework could be built using Spark, Storm, or any other distributed computation framework depending on the nature of the actual problem, with HDFS ultimately providing the storage of the files.
The Next Frontier
I am certain that companies like Google and Forest Rim will continue to pursue this problem on a global scale and perhaps they will find a solution for climbing the proverbial mountain faster than the mountain is growing. The good news for corporations is that for practical use cases (i.e., indexing volumes of corporate data), Big Data solutions are available today and companies like Bridgera are here to help.
About the Author: William Tribbey (Data Scientist) is a Mathematician with a Ph.D. in Computer Science. He has over 20 years’ experience in the areas of software development, business intelligence, and analytics and has developed online courses in statistics, machine learning, and Big Data architectures for data warehousing. Will provides technology leadership for Bridgera within our Big Data and IoT innovation labs and leads predictive analytics projects for our clients.