This website uses cookies to improve your navigation and technical cookies info Browsing this site you accept the cookies usage.
Request a Demo
  • 1-919-230-9951
  • info@bridgera.com
  • Login |
Bridgera Logo
  • Solutions
    • Healthcare
    • Industrial
    • Logistics
    • OEM
  • Services
    • IoT Service Enablement
    • Custom Software
    • AI Transformation Services
    • Professional Services
  • Resources
    • Blogs
    • Videos
    • News & Events
    • Case Studies
    • White Papers
    • Glossary
  • Try for Free

WHAT DOES THIS TEXT REALLY MEAN?

WHAT DOES THIS TEXT REALLY MEAN?
By Joydeep Misra June 21, 2016

Imagine that you were tasked with indexing (categorizing and summarizing) the contents of a large collection of randomly formatted text documents to make them easily accessible throughout your company. Perhaps this use case contains legal documents as well as medical records (with various levels of sensitivity). Suppose that the size of this collection contained over 100 million documents that you had to analyze. Besides categorizing each document as medical or legal, you must also assign them to subcategories like cardiology or probate. Finally, you must capture the meaning of each document for reporting and analytics.

Besides setting aside the rest of your life to do this job manually, does this problem have a solution?

Natural Language Processing [NLP]

Natural Language Processing (NLP) tools and techniques from could potentially provide a workable solution to the problem at hand. NLP is an area of Computer Science whose aim is to programmatically understand and produce human language in both written and spoken form.

Someday NLP will consider the number of languages used on the face of the planet and the dialects and colloquialisms that go along with them. For the purpose of this blog, let’s stick to the language in which you are currently reading, US English.

For that pile of unstructured documents, in its basic form, NLP has found success in being able to recognize human language using knowledge of sentence structure and dictionaries to identify nouns, verbs, adverbs, and clauses. But this is not enough to extract context and meaning to enable you to summarize and categorize the documents.

If you had dictionaries for medical and legal terms, there still may be overlapping content between the two. For example, the word “Contract” is likely to show up in a medical document of a patient as well as a legal document describing a business arrangement. So what then? Would you know how to classify the documents without additional information?

A natural progression within the area of NLP is to discern the actual meaning of a word or group of words based on its context without having a human intervene. To remove the ambiguity associated with the word ‘contract’ in two documents, or even its use within a single document, we need to know the context of where it appears. This would include things like how close are certain other words to the word ‘contract’. How many times does it appear in a document? The distance between words. These and other properties of a document build up the context of a document allowing for its proper disambiguation.

Taxonomies and Ontologies

To sort out the context of words, phrases, and sentences in a document, bodies of knowledge called taxonomies and ontologies are employed. Taxonomy is the science that concerns itself with classification. A particular taxonomy is a set of rules for classification, whether it is viruses, animals, or words in a document. Ontology is a branch of philosophy that studies the nature of being, or what do things mean. Ontology can be simplified by breaking it down to three components: A taxonomy, a set of rules, and use cases.

See this entry in the Encyclopedia of Databases for a more complete discussion of ontologies as used in Information Technology.

By applying a taxonomy and ontology constructed for legal documents we should be able to classify the documents according to the specific area of law practice in which they belong and then to extract the meaning of the document into a summarized form suitable for inserting into a database.

“Is the mountain growing faster than we can climb?”
Bill Inmon, who is known as the “father of data warehousing”, and his company, Forest Rim Technology, are a leader in the area of expanding NLP through ontologies. They have trademarked the term “Textual Disambiguation” as a way of describing the problem and their concepts.The building of taxonomies and ontologies for general textual disambiguation is a difficult problem (i.e., discerning actual meaning of all text in all documents or in a completely random set of documents). After 12+ years of work invested by Inmon and his team, the problem of ingesting astronomically large volumes of unstructured text-based data and transforming it into usable and actionable information does not have a generally applicable solution. The mountain has only been partially climbed and the summit may be rising further out of reach as the volume of data grows.

According to IBM, we create 2.5 quintillion bytes of new data every day, which is slightly more than 2 billion gigabytes of data. This blog is approximately 30,000 bytes. Therefore, we create the equivalent of 83 trillion unique blogs daily across all languages and dialects. Given the ubiquity of text-based data and the shear amount that we generate, you can appreciate the challenge of this climb.

Big Data

Having 100 million text documents to index can certainly be called ‘big data’. In reality, this may be considered the lower end of ‘big’ (see “Is the mountain …” side bar). Relatively speaking, the problem does not look as insurmountable however a Big Data solution will be needed.

We know we have an incredible number of unstructured text files to process and that they are stored in various places throughout the corporate enterprise. The first step will be to collect them all in a commonly accessible location, in this case, we advise using a Hadoop Distributed File System (HDFS). HDFS distributes the files across many processing and storage nodes providing a processing capability for the second step.

The second step applies the logic dictated by the taxonomy and ontology to each individual file to build the index. Fortunately, with the HDFS storage solution, we can leverage the many processing nodes and the proximity of the data to a generate this index in a reasonable period of time with the ability to reduce processing time by adding nodes.

Alternatives to the above solution framework could be built using Spark, Storm, or any other distributed computation framework depending on the nature of the actual problem, with HDFS ultimately providing the storage of the files.

The Next Frontier

I am certain that companies like Google and Forest Rim will continue to pursue this problem on a global scale and perhaps they will find a solution for climbing the proverbial mountain faster than the mountain is growing. The good news for corporations is that for practical use cases (i.e., indexing volumes of corporate data), Big Data solutions are available today and companies like Bridgera are here to help.

About the Author: William Tribbey (Data Scientist) is a Mathematician with a Ph.D. in Computer Science. He has over 20 years’ experience in the areas of software development, business intelligence, and analytics and has developed online courses in statistics, machine learning, and Big Data architectures for data warehousing. Will provides technology leadership for Bridgera within our Big Data and IoT innovation labs and leads predictive analytics projects for our clients.

Guide to IoT eBook

  • Previous Post
  • Next Post

Subscribe to Bridgera's Newsletter

Stay at the forefront of IoT insights – subscribe to our exclusive articles and newsletters delivered directly to you.

Search Our Blog

Most Recent

  • Discover the AI Possibilities of Data Readiness
  • What We Learned at Automate 2025
  • IoT in Fleet Management: A Comprehensive Guide
  • Turning IoT Dark Data into Intelligence: Leveraging Unstructured Data for AI and Advanced Analytics 
  • The Role of Digital Twins and Device Simulators in IoT Use Cases

Subscribe to Bridgera's Newsletter

Stay at the forefront of IoT insights – subscribe to our exclusive articles and newsletters delivered directly to you.

Bridgera is a custom software development and service company specialized in building Internet of Things solutions. Our mission is to help our customers to reach their IoT goals faster by utilizing our proprietary tool set and deep experience in the IoT field.

Quick Links
  • Contact
  • Careers
  • White Papers
  • Our History & Mission
  • Get eBook Now
  • Cookie Policy
  • Privacy Policy
  • Employee Privacy Notice
  • Applicants Privacy Notice
  • Privacy Notice for California
  • Transparency in Coverage
  • Website terms of use
Get In Touch

500 W. Peace St. Raleigh, NC 27603.
United States.

info@bridgera.com

+1 919-230-9951 [Raleigh] +1 984-305-2794 [Raleigh] +91-20-30220200 [Pune] +91-40-30126000 [Hyderabad] +91-33-40725631 [Kolkata]

Copyright © 2025 Bridgera, All Rights Reserved.

Go to mobile version