Video of Brian Carver’s talk at Stanford Law School on January 16, 2014, entitled “It is Long Past Time for Free Online Access to the Law: Free Law Project.”
Yesterday Mark Boyd published a great story about the CourtListener API on Programmable Web. Mark talked to several of the API’s early adopters and really learned what the issues are and how people are addressing them. Thanks to all those quoted in the story for taking the time to talk with Mark about the CourtListener REST API. We’re excited about how you all are already using the API and hope to continue improving it. (There’s nothing like people hitting your website thousands of times a day to shake loose hard-to-find bugs…and we’ve had some of that too and hope to get any and all bugs resolved ASAP!)
I particularly like Waldo Jaquith‘s sentiment quoted in the article that 24 months from now we will find it quaint that anyone found this interesting. I sure hope so! That will mean we’ve made many advances and the thought of not having an API for United States case law will seem unimaginable. Unfortunately, free programmatic access–even digital access–to U.S. case law has been not much more than a fanciful dream for a long time in the legal technology community. For years we’ve been like a textile industry that knows the cotton gin exists and even knows how to build one, and yet no one ever has built one. We now have in place a key tool necessary to enable a revolution. We also need the raw materials: the court documents themselves, and we’ll continue to work on that until we can confidently say we’ve got that fully covered, but in 24 months I think we’ll look back and not only find it quaint that this was interesting in 2013, but also we’ll mark this time as an important turning point, where the seed for that legal technology revolution was planted.
Today a group of non-proft public interest organizations have released updated Best-Practices Language for Making Government Data “License-Free.” Free Law Project is glad to sign on to their statement and to support the effort to assist government agencies in making clear that their data is free of copyright or contractual restrictions and can be re-used freely.
Free Law Project is proud to announce that it has been officially accepted as a member of the Free Access to Law Movement. FALM is a consortium of non-profit institutions dedicated to providing free and open access to the world’s law. Its members subscribe to the Declaration on Free Access to Law.
The Declaration explains in part that,
- Public legal information from all countries and international institutions is part of the common heritage of humanity. Maximising access to this information promotes justice and the rule of law;
- Public legal information is digital common property and should be
accessible to all on a non-profit basis and free of charge;
- Organisations such as legal information institutes have the right to publish public legal information and the government bodies that create or control that information should provide access to it so that it can be published by other parties.
We have been operating consistently with the principles laid out in the Declaration for some time. Finding ourselves in complete agreement with the Declaration on Free Access to Law, we are excited now to make it official and to formally join with our colleagues around the globe engaged in these endeavors.
FALM members come from Africa, the Americas, Asia, Australia, and Europe. Free Law Project looks forward to sharing ideas and knowledge with this diverse group so that we can all better meet the challenges facing free access to public legal information throughout the world.
For a long time we’ve had a feature that allowed you to look at the items that cite an opinion, letting you to look into the future and see what cases found it important down the line. As of today, we’re announcing the complimentary feature that allows easy travel into the past. Starting immediately, when you look at almost any case in our collection you’ll see an Authorities section in its sidebar.
For example, Roe v. Wade looks like this:
This section shows the top five opinions that were cited by the one you are looking at. If you wish to see all of the opinions it cited there is a link at the bottom that takes you to the new Table of Authorities page, which shows everything:
Now, when you’re looking at an opinion, you can easily travel through time to either opinions that came later or ones that came before. Doc Brown would be proud.
Posted by: Michael Lissner
Today marks another big day for the Free Law Project. We’re happy to share that we’ve created the first ever API for U.S. Legal Opinions. An API — or Application Programming Interface — is a way for computers to talk to each other and consume each others’ data in an automated fashion. From this day forth, developers, researchers and legal startups can begin consuming the data that we have at CourtListener in a granular and very specific manner.
For example, here are some very basic things that can be done with our API (these links will only work if you are signed in to your CourtListener account):
- Include a list of relevant opinions on your blog or website.
- Get a list of the new opinions of the day (here’s today’s, for example) and make a Twitter or Facebook page from it.
- Keep track of opinions that we’ve blocked from search engines at the request of an involved party. This might allow you to block such cases in your project or otherwise analyze privacy concerns in legal opinions.
- See modifications that we’ve recently made to our collection (this can sometimes be a very large number of items!).
- Interrogate or track the citations within an opinion or the citations to an opinion you’re interested in.
- Keep track of changes to our database of American jurisdictions or simply get a list of them.
- Show the most relevant opinions for a controversial topic like abortion (Roe v. Wade is the second hit thanks to CiteGeist).
- Build a citation cross-walk that allows you to find parallel citations.
And this is just the beginning. Legal data has been hard to query and analyze for a very long time, and with this initiative we hope to begin breaking down this barrier. If you’re interested in using our API, it’s free, though we appreciate a linkback or a powered-by logo. Just make sure you’re logged in, and you’ll be good to go.
Posted By: Michael Lissner
We’re excited to announce that beginning today our relevancy engine will provide significantly better results than it has in the past. Starting today, whenever you place a query we will analyze which opinions are the most cited, and we will use that to provide the best results possible. We’re calling this the CiteGeist score because it finds the spirit of your query (“Geist”) and gives you the best possible results. This is currently enabled for our corpus starting in the 1750′s up through about 1985, and the remaining years will get the CiteGeist treatment as well over the next few days.
The details of how CiteGeist works are in our code, but the basic idea is to give a high CiteGeist score to opinions that are cited many times by other important opinions, and to give a lower CiteGeist to opinions that have not been cited or that have only been cited by unimportant opinions. Once we’ve established the CiteGeist score, we combine it with a query’s keyword-based (TF/IDF) relevancy. Together, we get a combined score which is a measure of how intrinsically important a case is (its CiteGeist) as well as how closely it matches your specific query.
We are proud to offer this service, and as always we give away our data in our bulk files and soon via our API. We hope that this new feature will make legal research faster, easier and more accurate and we couldn’t be prouder to offer this service.
This feature was developed by a volunteer contributor, Bo Jin (Krist). He is majoring in Software Engineering at Tianjin University and spent the summer of 2013 taking classes at UC Berkeley. Krist worked closely with us while in Berkeley learning about our code base and has continued to contribute now that he’s returned to China to finish his degree. He hopes to return to the U.S. next fall to pursue a Masters degree in Computer Science.
Posted by: Michael Lissner
Note: This is the third in the series of posts explaining the work that we did to release the data donation from Lawbox LLC. This is a very technical post exploring and documenting the process we use for extracting meta data and merging it with our current collection. If you’re not technically-inclined (or at least curious), you may want to scoot along.
Working with legal data is hard. We all know that, but this post serves to document the many reasons why that’s the case and then delves deeply into the ways we dealt with the problems we encountered while importing the Lawbox donation. The data we received from Lawbox contains about 1.6M HTML files and we’ve spent the past several months working with them to extract good meta data and then merge it with our current corpus. This post is a long and technical one and below I’ve broken it into two sections explaining this process: Extraction and Merging.
Extraction is a difficult process when working with legal data because it’s inevitably quite dirty: Terms aren’t used consistently, there are no reliable identifiers, formats vary across jurisdictions, and the data was made by humans, with typos galore. To overcome these issues we use a number of approaches ranging from hundreds of regular expressions to clever heuristics.
The first step we take is to convert the HTML files into an in-memory tree that we can traverse and that we can query using XPath, a variable that contains only the text of the opinion (for later analysis), and a variable that contains simplified versions of the HTML with any headers or other junk stripped out.
From there, the tree, text and simplified tree get sent into various functions that extract the following pieces of meta data:
- Case name
- Case date
- Docket number
Of these, jurisdiction and citations are by far the hardest. The others are fairly straightforward, though dates are often missing and must be laboriously looked up.
Citations are extracted using our standard citation finder. We’ve described how it works in the past (pdf), but the basic idea is to tokenize text into valid words and then find valid reporters within the tokens. Whenever a valid reporter is found, you then inch backwards and forwards from it, identifying the volume, page number, year, and any other related information.
Finding the jurisdiction relies on a collection of about 500 regular expressions, each designed to find a specific court. Since the data provided by Lawbox is rather dirty, you can see that these regular expressions do a lot work. Unfortunately this approach isn’t enough for many jurisdictions, and for the hard ones we go a step further.
If the regular expressions fail, the next step we take is to use the citation information as a clue towards the jurisdiction. In many cases it works! It’s often enough to know that a case is in the California Appellate Reporter or the U.S. reporter. Using that information alone, we can often figure out the hard cases.
But sometimes they’re really hard to figure out.
The really hard cases in the Lawbox collection describe their jurisdiction like so: “United States District Court, D. Alabama”. Doesn’t look hard, but, well, Alabama currently has three district courts, the Middle, Northern and Southern, but it doesn’t have a generic “D. Alabama” (at least not since 1824). For the rare case like this, we developed a clever solution: We use the judge information in the case to determine the jurisdiction. Since most judges don’t move too much between courts, before we began importing anything we extracted all the judges and made tallies of where they worked. Then, when we encountered a case like the above, we said, “OK, who’s the judge in this case, and where does he work?” In almost every case this worked very well, but in some cases it didn’t and for those, we simply put the information in manually.
For the remainder of the meta data fields listed, we employed similar tricks, but these were the hardest examples. For the remainder of our approach, you can inspect the code itself. Just be careful of hairballs.
Once all of the meta data is properly extracted, the next step is to merge it with our existing corpus, identifying duplicates and merging them, or simply adding new cases if no duplicate was found.
The merging process takes one of three main avenues:
- Cases for which there cannot be a duplicate.
- Cases for which there is exactly one duplicate.
- Cases for which there are multiple duplicates.
For the vast majority of the Lawbox donation, we were able to simply add the case to our collection without further ado. We determined this by comparing the date and jurisdiction of the new opinion to our collection and seeing if we had any cases from that jurisdiction during that time. If there weren’t any cases from that place and time, bingo, the new case couldn’t be a duplicate and we could add it straightaway.
For the opinions that might have duplicates, we developed a duplicate-detecting algorithm. The process for this algorithm is as follows.
1: Create a set of candidate documents that might be duplicates by searching our existing corpus. First search it for cases in the same jurisdiction within 15 days of when it was issued and which have the same words in their name. Since names can vary greatly, the last word of the plaintiff and the first word in the defendant are used as queries, but only if those words:
- aren’t uppercase and less than three letters long (indicating an abbreviation);
- aren’t words that occurs very frequently (indicating a stop word);
- don’t contain punctuation or numbers (indicating something out of the norm); and
- aren’t less than two letters long (indicating they’re an abbreviation).
Once this query returns, if it has results, we continue to step 2, but if not, we try a new query using the docket number instead of the case name. This often works, but if it fails we try one final time using the citations. Unfortunately we can’t use the citations for all queries because prior to this donation we did not have a good collection of citation information.
2: Once we have some cases that our new one might be a duplicate of, we attempt to match up the duplicates by docket number. This often works, but if it doesn’t, we gather statistics about the items our new document might be a duplicate of. Specifically, we gather:
- The edit distance between our new case and each of the candidates;
- The edit distance between the text of our case and each of the candidates;
- The difference in length between our case and each of the candidates; and
- The cosine similarity between our case and each of the candidates.
Once that’s gathered, we set it aside and move on to step three.
3: At this point, we compare the case names to see if any of them are good matches. We assume that if we have one candidate, if it has all the same words in its case name as does our new document, and if they’re in the right order, it must be a duplicate. So for example, Lissner v. Carver is a duplicate of Michael Lissner v. Brian Carver, but not of Carver v. Lissner (right words but wrong order).
4: If this approach fails, our next step is to attempt a similarity test based on the docket number instead of the case name. This often works, but when it doesn’t, we have another approach, using the statistics generated above.
5: Our last approach is the statistical approach. Of the statistics generated above, the cosine similarity is very accurate and the others seem flawed in various ways. Cosine similarity takes all of the words in each case, counts up how many times each one occurs, then plots all of the words in a multi-dimensional vector space. Once we have a vector space for each case, we determine the difference between the new case and each candidate. If the two cases are very similar, they get a high similarity rating. If not, they get a low one. In our experience a good duplicate has a similarity of about 98%, and a dissent to the same case has usually has a slightly lower similarity, generally around 97%. Anything below 90% is unrelated. The extent to which this approach works is remarkable, but it is slow and can lose accuracy if cases have additional words, say, as part of the header information.
6: After all this is done, if we still haven’t determined if we have any duplicates, our final approach is to send the new case and all its candidates to a human reviewer, who looks at their contents and makes a determination. Fortunately, this only happens rarely.
The Final Step
Once the meta data is extracted and any duplicates have been found, we take the best parts of each document, merge them together and save it to the index. Once this is complete the document shows up in search, in our bulk files and everywhere else.
This approach took many month’s development, and it will receive another round of polish the next time we add a batch of cases. Until then, we hope this post has been educational, and that it can serve as a reference for any data merging projects.
Note: This is a technical post exploring and documenting the work that was done in order to build our new jurisdiction picker. If you’re not technically-inclined (or at least curious), you may want to move along before getting sucked in.
While prepping to import the Lawbox corpus, one of the many things we did was redesign our jurisdiction picker so it would support more than 350 jurisdictions. Completing this efort was a collaboration between me and a volunteer contributor, Peter Nguyen. Peter and I worked together iteratively, first building a wireframe of the jurisdiction picker, then a prototype, then the final version that you see today.
Before beginning, we outlined the use cases that the new picker should support. It should:
- Allow a user to select a single jurisdiction;
- Allow a user to select all jurisdictions from state, federal, district, bankruptcy or all of the above;
- Allow a user to select in hybrid mode – expanding a selection of a state courts to the related federal courts or vice versa;
- Allow users to easily select the courts they desire by filtering to the ones they’re interested in;
- Support more than 300 jurisdictions without taking up too much space; and
- Be responsive and fast.
In our version we released yesterday, we accomplished most of these goals. Using the links at the top of the picker, it’s easy to select or clear an entire tab, a collection of tabs, or all tabs. Using the filter, it’s easy to select by typing, allowing our users to select the courts they want without skimming long lists. By typing just a few letters and pressing enter, you can make sophisiticated jurisdiction selections.
There are still a few features we’d like to add to the picker, and you can watch for them soon. One is the hybrid selection mode, and the other is synonym support, so that typing words like “Eastern” will return courts in the “E.” district.
On the backend, we ran into a couple challenges while building the new picker. First, Internet Explorer 8 and below only support about 2000 characters in a URL. This causes lots of problems across the Web, but for us it meant that users selecting lots of jurisdictions would run into problems. The id codes we use for the jurisdictions are about four letters long, each, and our URLs used to look like
For every jurisdiction selected, it would send &court_id= along for the ride, making the URLs very long. Our new version tweaks this so that all the court identifiers are simply separated by a comma and sent in a single block. Much better, and of course the old URLs still work so long as they’re short enough.
As we built the new court picker, we had several iterations. The first was a basic wireframe like so:
This didn’t work at all, but it provided a pretty good place to start. Around the same time, Peter was experimenting with the Chosen jQuery plugin, and built a prototype that greatly enhanced our sidebar without forcing people to use modal dialogs:
Not bad, but it made it very difficult to select lots of jurisdictions, which was a problem. We did another iteration of the modal dialog, and ended up with a working demo that looks like this:
Our final version is quite similar, but is changed in a couple significant ways. It has “Clear” and “Select All” links at the top, and it changes the filter so that it checks boxes as you type rather than hiding them out of view. The final version is now live on the site, and should provide a great foundation as we move forward. We’re already investigating more jurisdictions, and we expect it won’t be too hard next time.
For Immediate Release — Berkeley, CA
After many years of collecting and curating data, today CourtListener crossed some incredible boundaries. Thanks to a generous data donation from Lawbox LLC, our computers are currently adding more than 1.5M new opinions to CourtListener, expanding our coverage to a total of more than 350 jurisdictions. This new data enables legal professionals and researchers insight into data that has never before been available in bulk and greatly enhances the data we previously had. This data will be slowly rolling out in our front end, and will soon be available in bulk from our bulk downloads page. A new version of our coverage page was developed, and, as always, you can see our current coverage for any jurisdiction we support.
It’s difficult to overstate the importance of this new data. In addition to being a massive expansion of our coverage, it also brings some notable improvements to the project:
- For all of the new data and much of our old data, we have added star pagination throughout. For the first time, this will make pinpoint citations possible using the CourtListener platform.
- We’ve re-organized our database for more accurate citations enabling for the first time the creation of a citation cross walk. We will soon be releasing an API for our data and when we do, a simple query for a citation could tell you equivalent citations for that opinion. For example, a query for a Supreme Court opinion could tell you its citation in West’s Federal Reporter, Lawyers’ Edition, and a historical citation, like one to Howard’s Supreme Court Reports. Similarly, for courts with neutral citations, one could query the neutral citation and get back the citation in the regional reporter and state reporter or vice versa. This has long been a pipe dream for numerous legal professionals and will soon be a reality.
- This fills in previously unknown gaps in the data available from Resource.org. Although it is often considered complete, we have identified a few small gaps, which this donation has corrected.
- We’ve completed a first pass at extracting judge information from all of the new opinions. This feature is still in beta since our extraction is not comprehensive, but this feature can be used for rough queries starting immediately.
- We’ve created a massive database of all known reporters and released it for free to the public. In addition to containing all of the reporters we found when working with this donation, it contains variations for their names as found in our corpus and in the Cardiff Index to Legal Abbreviations. This database can be used in citation finders or other tools, like the Free Law Ferret.
- We’ve created a new database of American jurisdictions. It currently contains 351 jurisdictions and can be used to create systems such as CourtListener. The data is not yet complete and we welcome your contributions.
As you can tell, this is a very big day for the Free Law Project and the legal world — one that we’ve quietly been working towards for months. Over the remainder of the week we will be writing two additional posts about this topic, explaining the design work behind our new jurisdiction picker, and the process we use to merge new corpuses in with our existing data.
We hope that these new opinions and features will unleash a new surge in legal research and technology, and that you’ll help support our project so that we can continue bringing these technologies and information to the fore.