Luddite's Guide to Analytics Tools

Dealing with data

For legal professionals, the ever-expanding landscape of data is a constant challenge. As new technologies, devices and communications channels emerge, data takes on more different forms, making processes like eDiscovery more complex.

But, as data volumes have grown and developed, so too have the tools and technologies to manage them. The problem? Most legal professionals simply don’t know enough about these tools to unlock their potential.

With that in mind, we thought it was time to create another handy Luddite’s guide for legal professionals looking to better understand legal tech and eDiscovery solutions.

In this latest addition to the series, we’ll unpack everything there is to know about analytics tools.

Lawyers have been slow to adopt analytics. As is true of most technologies in the legal space, those who would benefit most from analytics tools are often too busy to get to grips with them.

It’s understandable. Not only is law one of the most time-consuming professions, but the world of legal tech moves and changes at an incredible pace. New extensions, add-ons and best practices are constantly emerging, with old versions being rendered obsolete as quickly as new updates are made.

But the truth is that every legal professional — from independent firms to in-house counsel — can benefit from what analytics tools offer. The time you spend understanding analytics tools will save you time on your next eDiscovery project.

The benefits of analytics tools — in 30 seconds or less

In the right hands, analytics tools can offer benefits that are simply unmatched by any legal tech innovation. When used during eDiscovery, analytics can enable legal teams to:

Make faster, more informed decisions
Automate the labour-intensive aspects of reviews
Recognise relationships between different sources of information
Analyse data across all digital formats
Sift through huge document sets and organise them based on relevance
Deliver a higher standard of client service

In short, analytics tools can empower you to be more efficient, accurate and make significant cost savings. With analytics, you can get the most relevant information in your caseload as quickly as possible and find those critical case documents faster than ever before.

Why is it worth getting to grips with analytics tools?

Each tool has benefits and limitations you should be aware of. The key to getting the best results from eDiscovery is having a 360-degree understanding of the systems and solutions you’re putting to use.

Armed with that knowledge, you can build a custom stack of these technologies tailored to fit your team's needs and objectives or a particular case. To that end, in this Luddite’s guide, we look at some of the best analytics tools out there, assessing their strengths, weaknesses and the purposes they’re best suited to.

It’s worth noting that while we look at multiple different tools, these aren’t necessarily standalone technologies. The capabilities discussed in this guide should be available as standard features for any worthy cloud-based eDiscovery software.

Email threading

Email threading works by gathering all replies (including reply-all messages), forwards and attachments from an email chain, grouping them for ease of review.

What are the benefits of email threading?

Email threading can help you significantly reduce the number of documents in your review queue and save time by weeding out duplicate content. The tool groups together email chains from end to end, meaning they can be reviewed all at once in sequential order rather than mixed in with the rest of the data set.

This allows you to interpret the information in its intended context, which can significantly improve the quality of your review. Email threading tools can also highlight missing emails, inserting placeholders to show where replies or forwards were once present in an email chain.

Email threading can also be used for email text data and scanned hard copies of emails, making for a more robust deduplication process than tools that use hash values.

Are there any limitations to email threading?

While it brings the advantage of grouping conversations together and enabling you to see individual messages in their appropriate context, email threading can make the review process longer than it needs to be.

For example, it may be that only one email in a particular chain is relevant to your case, but grouping it with the rest of the chain rather than searching based on relevance can lead to reviewing more documents than necessary.

And, as it can lead to a more labour-intensive review, it can result in more labour-intensive redaction exercises than you need.

When to use email threading

Email threading is best when reviewing substantial amounts of email data and for data sets containing email data from various custodian mailboxes.

When not to use email threading

You shouldn’t use email threading for standalone electronic documents or for cases where you anticipate a lot of redactions. Keep in mind that when coding for privilege, this may lead to an extensive redaction exercise.

Clustering

Clustering identifies conceptually similar documents, pulls them together and places them into logical groups — or ‘clusters’.

What are the benefits of clustering?

If you’re working with an unfamiliar data set, clustering is a quick way to gain an overview of the topics covered in your collection.

Each cluster is named according to the documents' content, making a large workload much easier to navigate. It provides you with a high-level overview of the themes present within your document collection before review has even begun.

By allowing you to focus on the most relevant topic clusters and disregard any irrelevant ones, clustering helps prioritise your review process and complete tasks more quickly.

Clustering doesn’t require user input or example documents to be applied. Plus, it can help find relevant information that can slip through the net when using keyword filtering, either by revealing synonyms or identifying non-responsive documents that are conceptually similar to your selected keywords.

Assigning individual reviewers to individual clusters is a great way to reduce errors and improve coding consistency.

Are there any limitations to clustering?

Reviewing by cluster will break the chronology of your review queue as it prioritises content themes over dates and metadata. Foreign languages can make things difficult too, so if you have a high population of foreign language documents in your collection, they may need to be clustered separately.

In general, ensuring your clusters are as relevant and accurate as possible can often require you to clean up your documents manually.

When to use clustering

Clustering is perfect when presented with an unfamiliar data set and before disclosure to check for coding consistency and ensure documents aren’t missed.

Clustering is also a useful quality control tool to assess your review team’s work during a managed review.

When not to use clustering

Don’t use clustering if reviewing your document queue in chronological order is crucial to your project or if your data set consists primarily of images or media files. Clustering is also best avoided if your data set contains very long documents that discuss several different topics, as this can compromise accuracy.

Categorisation

Categorisation allows you to create a set of example documents that form the basis for identifying and grouping other conceptually similar documents across your data set.

What are the benefits of categorisation?

For documents that touch on more than one concept or subject, categorisation can ensure that they are classified accordingly.

Unlike clustering, categorisation can place the same documents into multiple categories, should they be a conceptual match for several different themes. A rank is then assigned to each document, indicating how conceptually similar it is to the overarching themes of each category.

This is ideal for projects for long and detailed documents with more nuanced subject matters that cover several topics.

Once you’ve submitted your set of example documents, categorisation works to sort them into critical issues quickly. This can significantly help prioritise your review according to relevance and identify necessary documents early in the process.

Are there any limitations to categorisation?

This tool can require a fair bit of user input. As well as requiring you to submit example data, it works best when the categories of interest are identified manually beforehand.

Example data must be focused entirely on a single concept, with at least two detailed paragraphs of meaningful text and free from any other text features which can confuse the system (such as repeated text or headers and footers).

Finally, any single document can only be assigned to a maximum of five categories.

When to use categorisation

Categorisation is useful when you’ve identified particular issues or categories of interest within your data set.

To use it effectively, you need at least one example document to match the different subject themes of each category. When receiving a substantial data set (such as a received disclosure), this needs to be coded for issue after having already coded for issue on your own dataset.

When not to use categorisation

As documents can’t be assigned to more than five different categories, we wouldn’t recommend categorising any project where you’re coding for a long list of issues or points of interest.

Active Learning — prioritised review

Active Learning is a technology-assisted review tool that predicts which documents are most likely relevant to your case, allowing your data to be organised rapidly.

Within Active Learning, there are two main methods of review: prioritised review and coverage review. Prioritised review queues documents in the order the system deems most relevant, serving the highest ranked document first.

Its data visualisation gives you a crystal clear picture of how the review is progressing in real-time and its elusion testing features allow you to make an informed judgment on when to stop your review. This can save time and money.

Are there any limitations to prioritised review?

The system will serve random documents until its initial training quota is met. If your data set has low richness, targeted searches may be required to find relevant documents that can effectively train the model.

User judgement is also required (albeit with support from the system’s statistical analysis) to determine when a review should be stopped. Finally, once the review process has started, you can’t change the system settings until the project is complete.

When to use prioritised review

Prioritised review is most suitable for dealing with data sets as small as 1,000 files or when reviewing complete or filtered data sets of no more than 9 million documents.

It’s also ideal when you need to quickly review the most relevant documents, when you wish to review complete ‘document families’ together or when working with data sets with an expected lower richness level.

It can also be used to quality control previously coded documents, helping identify outliers and coding inconsistencies.

When not to use prioritised review

Prioritised review is unsuitable for small volumes of data or data sets made up of images or media files — including scanned documents of text. Prioritised review will also be ineffective for data sets with poor quality OCR (Optical Character Recognition). OCR is a process that converts scanned documents or other images with text into text data.

Suppose OCR is used on low-quality documents, something handwritten or from an old printer or fax machine, for example. In that case, it can lead to the text being misinterpreted and converted with errors in the data.

What are the benefits of prioritised review?

Prioritised review allows you to quickly locate and review the most relevant documents in your caseload, dramatically speeding up review and keeping costs down.

It also continuously learns from your coding decisions, meaning the more documents you code, the better the system understands which documents are relevant. Using this understanding, the system then updates its document ranks every 20 minutes to ensure the most relevant documents continue to be pushed to the front of the review queue.

Additional documents can be added after the review has begun before being ranked with the next update. Prioritised review can be run in combination with other analytics tools, is language agnostic and requires a low level of manual input. It requires five documents coded with your positive choice and five with a negative choice to rank the rest of your data set.

Active Learning - coverage review

Coverage review prioritises and serves up documents which will best train the Active Learning model. Unlike prioritised review, which is intended to locate the most relevant documents, coverage review aims to train the system as quickly as possible.

What are the benefits of coverage review?

Coverage review serves up the documents the model is most unsure about, as these are most beneficial to the Active Learning model. These documents are chosen solely to train the model and are categorised as either relevant or not relevant.

Like prioritised review, coverage review is language agnostic and additional documents can be added after the review has begun. Coverage review will continue to serve up documents until there are no longer any documents in the queue.

However, you should stop reviewing documents once the model stabilises to save time and resources.

Are there any limitations to coverage review?

Coverage review doesn’t serve family documents together and as with prioritised review, system settings can’t be changed once the review has started.

When to use coverage review

Coverage review is suitable when quick production is necessary and you need to classify documents into relevant or not relevant data sets in a short period. It’s also beneficial for large projects where you don’t need to review and code all relevant documents.

Coverage review can be helpful in the general investigation and information mining when you’re looking to obtain information for your benefit.

When not to use coverage review

We’d advise against using coverage review for projects where all relevant documents must be reviewed, for data sets with poor quality OCR and for documents that must be reviewed alongside the rest of their document family.

Like prioritised review, coverage review is similarly unsuitable for data sets made up of images or media files, including scanned copies of written documents.

Repeated content filtering

Repeated content filtering identifies commonly occurring text within your data set and then suppresses this content from your analytics.

What are the benefits of repeated content filtering?

Using repeated content filters will improve the quality of your analytics index. This prevents headers, footers and boilerplate text from overshadowing the subject matter of a batch of documents while improving the accuracy and efficiency of keyword searches.

These filters also suppress duplicate text from the desired analytics index without altering document contents.

Repeated content filters can be used alongside regular expression filters to eliminate text that follows a recurring pattern or repeatedly appears with slight variation, such as a URL or a Bates stamp.

Are there any limitations to repeated content filtering?

Repeated content filtering requires user configuration, such as setting the number of occurrences and word count. Repeated content filters can’t be directly applied for search term reports or other index types, such as dtSearch.

When to use repeated content filtering

Repeated content filters are best used for conceptual analytics across your data set. You can also use them when running keyword searches that yield high false positive results due to matching terms found in email footers or boilerplate text.

When not to use repeated content filtering

Repeated content filtering is unsuitable when you aren’t using conceptual analytics tools. It’s also unsuitable for data sets made primarily of images or media files (including scanned copies of written documents) or for data sets with poor-quality OCR.

Textual near duplicate identification

Textual near duplicate identification analyses the extracted text of all documents and determines percentage similarity for each document compared to the others within its data set. Following this, it groups them based on this percentage similarity.

What are the benefits of textual near duplicate identification?

Textual near duplicate identification works to accelerate your review by quickly identifying textually similar documents within your data set. This makes it a handy quality control tool as it can locate near duplicate documents, compare them for relevance or privilege and issue coding decisions.

The process is controlled by a minimum similarity percentage parameter set automatically by the system. You can change this if you feel the level of accuracy isn’t right.

This parameter determines how similar another document in a batch must be to be placed in the same group as the principal document in question. A percentage of 100% would indicate an exact textual duplicate.

The higher you set the minimum similarity percentage, the more similarity is required, which will result in smaller groups of documents. A higher setting will also mean a faster review process, as fewer files have to be compared. It’s worth noting that textual near duplicate identification doesn’t rely on hash values.

Are there any limitations to textual near duplicate identification?

If the percentage similarity of documents is set too low, the grouping function can be overly inclusive, which defeats the purpose of this tool.

When to use textual near duplicate identification

Textual near duplicate identification is useful when hash values aren’t available or when your data set may contain multiple versions of the same text in different formats (e.g. it is in both a Word document and a PDF file).

It’s also a convenient tool for when metadata spoliation has occurred and therefore hash values may not match.

When not to use textual near duplicate identification

Textual near duplicate identification isn’t suitable for documents with a low word count, data sets made of images, media files, scanned hard copies of written documents or data sets with poor quality OCR.

Language identification

Language identification examines extracted text from each document in your review project to determine its primary language and any secondary languages present. You’ll see how many languages exist in your collection and the percentages of each language per document.

What are the benefits of language identification?

Language identification tools allow foreign language documents in your workload to be isolated. This can help produce better quality analytics indexes and, depending on the volume of foreign language documents in your data set, can help you create a separate review queue for a particular language. You can then forward this to foreign language reviewers.

The language identification tool supports 173 languages (see the full list here). It can consider all Unicode characters and recognise the different characters associated with each language.

This tool is handy as you can run it without impacting your overall review time and identify languages in your data set you may otherwise have been unaware of.

Are there any limitations to language identification?

The tool can only detect three different languages present in one given text – one primary language and up to two secondary languages. Language identification may also flag false positives for documents containing foreign languages outside the main body text, such as in an email footer.

When to use language identification

Language identification is best used for data sets composed entirely of electronic data sets with good quality OCR. This tool can also add value when reviewing unfamiliar data sets and running other analytics tools that benefit from split language indexes.

When not to use language identification

Language identification is unsuitable for data sets made of images, media files or scanned hard copies of written documents or data sets with poor quality OCR.

Keyword expansion

Keyword expansion allows you to submit single keywords or blocks of text before returning a list of conceptually related terms that appear within your data set.

What are the benefits of keyword expansion?

As the name implies, keyword expansion allows you to expand on a starting list of keywords and identify more relevant terms relating to them, leading to more successful and accurate searches.

This includes synonyms or terms strongly related to your predefined keywords, which you may not have considered. Users can either submit single words or blocks of text for keyword expansion.

After compiling a list of related words, terms and phrases, the tool will provide a score for how closely the returned keywords relate to the principal term (or terms) submitted.

Are there any limitations to keyword expansion?

Keyword expansion is not purpose-built for large lists of keywords or phrases. As such, this tool can be time-consuming if not used in a targeted manner.

When to use keyword expansion

Keyword expansion is a powerful tool to use when running keyword searches. It should be a go-to method when looking to improve searching criteria or identify the different ways a particular concept has been discussed or described within your document set.

Keyword expansion can also significantly help with data sets containing multiple foreign language documents.

When not to use keyword expansion

Keyword expansion is unsuitable for data sets made of images, media files or scanned hard copies of written documents or data sets with poor quality OCR.

Find Similar Documents

As the name implies, ‘Find Similar Documents’ allows you to identify conceptually similar documents to the one you’re viewing.

What are the benefits of ‘Find Similar Documents’?

Using the ‘Find Similar Documents’ feature, you can quickly find additional relevant documents in your data set that may have initially been missed.

Unlike textual near-duplicate identification, ‘Find Similar Documents’ ranks files based on their content's similarity rather than a text-focused comparison. In short, it identifies documents based on their context rather than just the words and terminology used.

What it has in common with textual near duplicate identification is it makes a robust quality control tool as it can be used to quality check coding choices and ensure review consistency before production.

Are there any limitations to ‘Find Similar Documents’?

‘Find Similar Documents’ can flag false positives if used incorrectly and require manual quality control to get the best results.

When to use ‘Find Similar Documents’

‘Find Similar Documents’ is useful when hash values aren’t available or when metadata spoliation has occurred; therefore, hash values may not match. This tool is also helpful when trying to identify multiple versions of the same document or when your data set contains multiple versions of the same text in different formats.

‘Find Similar Documents’ ensures documents are identified and coded correctly and redacted as necessary to prevent privileged information from accidentally being disclosed.

When not to use ‘Find Similar Documents’

‘Find Similar Documents’ is unsuitable for data sets made of images, media files or scanned hard copies of written documents or data sets with poor quality OCR. We’d also advise not using ‘Find Similar Documents’ when dealing with large documents that discuss several different topics or documents primarily made up of numbers.

Still have questions about analytics tools? Altlaw can help. Contact one of our team today for an informal chat.

eDiscovery Services: 020 7566 7566 Print/Hard Copy Services: 020 7490 1646 Email us: enquiries@altlaw.co.uk

Analytics tools for Luddites

Contents