Keyword searching has long been the go-to method for culling data in the early stages of eDiscovery projects but, as analytics tools and AI capabilities continue to improve, many of us are asking the question - do we need to use keyword searching any more?
In this blog post, we will explore the pros and cons of keyword searching and what the alternatives might look like.
The pros of keyword searching:
Keyword searching is a quick way of culling your dataset to handle only the documents you think are most likely relevant. This is achieved by applying a set of terms or phrases (ideally 10-15 good ones) across the dataset. The keyword-responsive documents would then make up your review population. Slight spelling variations of your search terms can be identified by applying fuzzy searching. This can help ensure that key documents are not missed due to human error, or American/English (or other) spelling differences.
Keyword searching is a trusted and well-understood process in eDiscovery which makes it an ideal tool for those more wary of Artificial Intelligence and its use within law. The use of keywords is also especially helpful later on in the case when you have had the chance to become more familiar with your case matter. At this point, you can suggest very specific keywords with a high chance of producing relevant material.
On top of saving you time by reducing the number of documents you have to review, keyword searching also saves you costs by culling your data in an ECA workspace which is much cheaper than in a review workspace.
The cons of keyword searching:
While keyword searching is an established method of culling data, it is often cumbersome and required too early in the eDiscovery process. A common theme of communication between our project managers and clients is the difficulty that occurs when trying to produce case-specific keywords without actually having had the time to build information on the case yet.
Keyword searching is also rather nuanced and can be quite confusing for those who aren't well acquainted with the process. Additionally at this point in time - though there are developments heading our way - keywords are taken at face value, meaning that if you are looking for communications about a cat, and you use 'cat' as a keyword, any communications that include the word 'cat' will be produced. This, however, misses out on all communications that use the word 'feline' or 'kitten' instead. This means you have to be very sure of your keyword selection to avoid missing any key documents.
As well as needing to be sure of your keyword selection to avoid missing key documents, you must also be prepared for keyword searching to still collect a proportion of irrelevant documents as keyword searching is not context specific. Anywhere a keyword is mentioned - a keyword is collected. You will need to take this into account when searching for more commonly occurring words.
It seems the main issue with keyword searching revolves around your familiarity with your case and the potential to miss key documents due to a lack thereof. In this instance, not only is there a risk of missing your smoking gun document, but then there are additional time and cost pressures that come from having to re-search and re-review documents in order to try and find these important documents.
So what can we do to try and eliminate this problem?
Your initial thoughts might be to simply delay the use of keyword searching until you are more sure of your case details and what data could be relevant. However, under current industry processes, this would potentially delay your eDiscovery quite significantly thus leading to time pressures further down the line which is not ideal. Instead, let's have a look at an alternative solution.
Alternatives to keyword searching:
In order to remove the dependence on keywords we would suggest an approach that utilises the capabilities of your eDiscovery platform without relying on keyword data. This approach would therefore have to be completely analytical and rely only a little on preliminary case knowledge. It would go a little something like this...
1) Apply Concept Clustering: The use of concept clustering is an excellent way to get a high-level overview of the themes within your data. Not only is this a completely analytic tool that requires no suggestion from the project manager or client, but it is also a very effective way of discovering what keywords might be useful later on in the process.
2) Apply Sentiment Analysis/Other Tools: By applying sentiment analysis and other analytics tools you can double down on the information you can learn from your data. For example, you know your case is between two brothers about a holiday they went on in a specified time period, that ended badly. With this very basic knowledge, you can use concept clustering to dive into the holiday segment (if there is one, it may be a sub-segment depending on the amount of communication) and then run sentiment analysis. Now you can focus only on documents that register for negativity or anger sentiments, thus allowing you to be very targeted in your approach to your data.
When Relativity's new communications analysis tool is released you will also be able to go through this same process with specific communicators in mind!
3) Train an Active Learning Algorithm: Now that you have some very targeted datasets that are likely to contain relevant material, you can promote all of your data to a review workspace and use these datasets to train an Active Learning algorithm. When it comes to training an AL algorithm, Relativity recommends at least 5 high-quality relevant documents and 5 high-quality irrelevant documents, though we would recommend a number more in the 100-200 region. The more the better really. Regardless of the number you choose to train your algorithm on, finding these documents should be fairly easy with some simple manipulation of the two previous steps.
4) Prioritised Review of the Whole Dataset: Here when I say 'whole dataset', I simply mean that we do not filter out any documents as irrelevant before beginning the review process (other than those removed by de-duplicating measures etc.) this allows for the algorithm to learn as you learn and helps negate the risk of losing important documents. As it is an Active Learning review we would still expect there to be plenty of documents we do not have to review, especially if we have taken the time to train the model well in the beginning. The further benefit of this is that we can run an Elusion test to check the likelihood of missing a document and make decisions based on reliable statistics.
There are, of course, pros and cons to this method, and it will likely change many times as new tools are released and processes are perfected. However, it does present as a viable alternative to keyword searching and I think, given the right circumstances, this could be a very effective way of navigating the potential information loss problem.
The fact that a process can be designed that entirely removes the need for keyword searching is an interesting argument towards keyword searching becoming obsolete, but I don't think we are the whole way there yet... do you?
Want to learn more about Sentiment Analysis?
Sentiment Analysis is a hot topic among legal professionals at the moment and no wonder! This new tool is disrupting the industry, building momentum and doesn't look like it will be stopping any time soon. Read our latest blog to find out all there is to know about Sentiment Analysis!