Data Investigation and Collection

EDIS will organize interviews with key stakeholders in the operator’s organization to fully understand record descriptions, search criteria, physical and virtual locations for data and records of interest. The interview yields information held by long time employees and client subject matter experts, providing a history of the types, locations, and nomenclature associated with legacy records. This input feeds the collection process, where collection teams are fielded to defensibly query and gather both physical and electronic records. Collected records and data are processed in our facilities to populate the project database.

Metadata Analytics and Gap Analysis

EDIS performs “Records Viability Analytics” of current metadata to assess the availability of critical records types. Although generally sparse and incomplete, the metadata hits will point to the “low hanging fruit” records which appear to be readily available. This metadata typically includes digital and physical file inventory descriptions and searchable index fields in content management systems. Performing analytics on the metadata creates a profile of the available records, facilitating decisions by the project team to intelligent processing.

We use results gathered from analytics and customer interviews to determine a detailed breakdown of available data. These attributes are broken down per valve section, component and/or vessel to define present and missing records and data discrepancies needed to fulfill baseline requirements for engineering analysis. Reports are provided to our clients who summarize the findings, so that missing data and related records can be investigated and back-filled.

Data Migration/Database Expertise

Data migration projects always entail the porting over of data from legacy applications, and supporting connections to other systems which have a shared reliance on data. EDIS has the data expertise to assist in these projects with seasoned staff with multiple complex integrity management projects under their belt.  Some of the common tasks we support related to data migration include:
  • Creating data maps for collecting the target data needed to feed new systems and applications
  • Assessing "as-built" legacy systems to determine data structures
  • Performing analytics to determine the utility of legacy data
  • Performing data cleansing and normalization
  • Extracting data and creating load files for new systems
In addition to the people skills needed to support data migration, EDIS has licensed technology from a global technology company which uses software robots, designed on the desktop, to automate the data look-up, normalization and reconciliation tasks. The EDIS team has been fully trained on robot design and deployment.  Here are some additional facts regarding the use of software robots:
  • Is particularly good at scraping data from websites and auto-populating other sites or databases
  • Eliminates the need to write code for custom API’s
  • Can be used to maintain synchronous data between applications
Using several tools and systems, EDIS is able to collaborate with your team to spec out a framework for a custom database and presentation layer which meets the client’s needs and which can include a secure web UI for client access to data during the project. Our team is versed in SQL, C++, Java and other programming languages and is able to rapidly create data mashups. We prepare custom load files to auto-populate distinct fields per client specifications into RBI and PMS applications, yielding significant savings in time and labor over manual, engineering-intensive methods.

Technology Assisted Classification of Records

Records come to us in various forms:
  • Physical records in boxes containing many different document types (usually unknown at the start) and of various vintage
  • Electronic records of a single record type (ex: alignment sheets, U1’s)
  • Electronic records from file servers, with many file and document types (again, usually unknown at the start)
  • Project files which inherently contain multiple document types but which are found in a single large file such as a PDF
Many of the most critical documents are previously imaged (scanned) documents which are records of tests, inspections, materials design, project files and as-built and completion reports. These PDF or TIFF files may have poor text quality and be difficult to process using conventional OCR tools. Our workflow is tailored to each situation and includes tasks such as:
  • Inventorying content and generating hash tags for chain of custody
  • Removing black listed file types
  • De-duplication of the files
  • Where there are multiple, unknown document types, using Technology Assisted Classification (TAC) to classify document types
  • Attributing (deep indexing) of specific document types (such as U1’s) using Technology Assisted Attribution (TAA)
  • Performing quality control to a 99% accuracy level
  • Exporting results to a load file for ingestion into the target RBI or PMS software application

Technology Assisted Attribution of Records

Certain document types have a wealth of data that must be deeply indexed at a very high level of accuracy; an example of this document type is a “U1, Manufacturer’s Data Report” which is an extremely common data rich document. Our tools include software which can be configured to auto-extract the necessary data (atttributes), then subjected to a three-level QC process to insure data integrity. Based on customer requirements we will apply business and low-level engineering rules after QC to transform and normalize the data into the desired state for ingestion into the target RBI or PMS applications. EDIS has developed and licensed multiple attribution tools from leading software technology companies after performing extensive due diligence and testing to confirm their performance and ability to support a robust quality assurance process.  Our team has has an extensive background in this area and has vetted a number of tools as candidates for our process, and after thorough evaluation and price negotiation selected those which exhibited the best combination of cost and performance. EDIS also uses open source software from the Apache Foundation, including their tools such as regular expression, fuzzy logic and pattern matching.  One  method for attribute extraction which can be very cost efficient is the combined use of regular expressions and validation tables (standard lists of values). Our philosophy is to use the most cost effective tools to attribute as much data as possible, which means we use the more costly tools on the most difficult data.

Advanced Search Technology

In many cases records have not been adequately indexed but it is desirable to cross-correlate attributes and metadata to determine the “family” of documents associated with a certain term for a certain valve segment or vessel, where the term of interest is a manufacturer, for example. Because the source documents are scanned images the extracted text quality is always less than 100% accurate, meaning that a literal search for a company name tied to a missing record will not return all of the actual results. To improve this condition, we have developed our Advanced Search Technology (“AST”) search engine which allows for some error conditions in the text and return those results as additional hits. These additional hits can be very meaningful in locating relevant records for engineers tasked with design pressure confirmation, for example, especially where such records can be used as the basis for multiple components or sections. AST takes advantage of all available metadata as well, to better narrow the universe of relevant records. If the image clearly has the name of the vendor “LABARGE PIPE AND STEEL COMPANY”   evident in the visual examination of the PDF but is potentially not returned as a search result by a text-based search engine. Character corruption is a result of processing the PDF with optical character recognition (“OCR”) software and can limit high percentages of the search hits. Using EDIS AST, the corrupted text "LA$BARG IPE ST*EL" would be found as a true positive hit despite the corrupted text.

Data Quality Improvement

Legacy records and data suffers from a number of factors which hamper their use in downstream applications such as MAOP design pressure confirmation, fitness for service and risk based inspection.  Numerous versions of individual record types (such as purchase orders and mill test reports from different vendors over time) add to the complexity of organizing the data into logical buckets.  Legacy databases often have different data schema's and layouts, presenting challenges in linking same-attributes across tables from different systems. Data quality is critical as it relates to integrity management because missing data in the risk and engineering calculations can lead to incorrect or incomplete data sets upon which the calculations are based.  The software performing these calculations will either: A. Rely on default values to perform the calculations, which results in false positive results, or B. Cause the analysis software to stop short of it's advertised features because it lacks the legacy data to perform the analysis. EDIS helps improve data quality by performing the following functions.
  • Properly identifying the versions of records which are the same record but have different content and layouts
  • Creating a high quality "hard index" of the content attributes (PO Numbers, dates, locations, work order numbers, material specifications)
  • Normalizing the metadata so that specific data expressions are appropriately consistent across the database
  • Standardizing the metadata, such as converting  fractions to decimals to facilitate later calculations
  • Performing simple arithmetic calculations, such as converting ID to OD where needed

On-Site/Off-Site Physical Records Conversion

EDIS operates a 60,000 square foot facility in the Houston area which is equipped with high speed scanners, workstations for up to 175 people, and a records storage area capable of holding up to 250,000 cubic feet of physical records. We perform all steps of the imaging workflow and have a history of supporting complex projects in the energy sector, with a focus on high quality work and data integrity. Our staff has been trained on several AIM projects and is led by our project managers and AIM SME. Our experience includes deploying onsite and mobile scanning solutions, for short and medium term assignments where, for reasons of security, compliance and immediacy of access, our clients need us to be nearby. These assignments include deploying experienced teams with hardware and software for a project and coordinating delivery of images to our data center for processing. We can also perform manual triage to glean importance at a box, file or folder level to ascertain the business benefit from further processing.