How to scrape webpages in 2022 and best ways to remove boilerplate
November 02, 2022
Scraping webpages is an important part of many workflows involving NLP. It's challenging though, because webpages are complex, and there's a lot of excess text on webpages. If you're working on a data science project, what can you do about it?
"Boilerplate" refers to text that appears on many web pages, and doesn't add much to the meaning of a page. Typically it includes things such as navigation elements, legal disclosures, consent spam Ironically when looking for a "consent spam" article, I got a non-dismissible, not compliant cookie spam banner, headers and footers, sponsored content widgets, ads, etc. For example, here's two screenshots of webpages showing the main content highlighted in blue, with everything else being boilerplate.
Ok, so how are we going to solve this problem?
12 years ago, there was a good paper published about a decision-tree based algorithm called boilerpipe that used a lot of HTML features to determine whether certain text on a webpage should be classified as boilerplate or not. This library has been a standard bearer for a long time, but the web has changed a lot in 12 years! Thankfully, there is a new kid on the block called Trafilatura. It's a pretty easy to use library! Let's give it a try on that CNN article.
>>> from trafilatura import fetch_url, extract >>> from rich import print # make it pretty! >>> download = fetch_url("https://www.cnn.com/2022/08/04/health/updated-boosters-fall/index.html") >>> print(content.split("\n")[0:10]) # print the first 10 lines [ "(CNN)This fall, Americans could get boosted with a mRNA Covid-19 vaccine unlike any that's come before.", 'Both Pfizer and Moderna are working on bivalent boosters: vaccines made up of both the old formula and a new one that targets the Omicron BA.4 and BA.5 subvariants of the coronavirus.', 'If the shots meet US Food and Drug Administration standards, they will probably be available as early as September, the FDA says.', 'But cases are high now. There are about 124,000 new cases reported each day -- far from the levels reported during the Omicron surge, but nearing peak case rates from the Delta wave -- and cases are more undercounted than ever.', 'Some experts wonder whether the Omicron-specific boosters will come in time to make a difference and if they will actually offer more protection than the current shots.', 'A prediction game', 'The current shots are based on the original strain of the virus and offered nearly full protection, even from infection, early on. With new variants in circulation, the vaccines still are good at keeping people out of the hospital, but most scientists think people need a vaccine that offers more protection.', 'Dr. Michael Chang, a pediatric infectious disease specialist at Memorial Hermann Health System in Houston, thinks vaccines with an Omicron component will be helpful -- within limits.', '"I just wish that the timing had been sooner so that we could actually be dealing with the kind of BA.5 surge that we have right now," he said.', 'With the highly contagious BA.5 subvariant now dominant, the goal of minimizing the number of infections is "kind of lost," but the new vaccines should help keep hospitalizations and deaths down, Chang said.' ]
Pretty cool! Let's see how it does on this Wired article with an annoying GDPR popup.
>>> download = fetch_url("https://www.wired.co.uk/article/gdpr-cookie-consent-eprivacy") >>> content = extract(download) >>> print(content.split("\n")[0:5]) [ 'The user experience of browsing the web is worse than ever. Even if you only spend a tiny amount of time online, it’s impossible to escape cookie consent notices. They’re the intrusive banners and blocks that appear each time you visit a new website that collects data about you through cookies. Each is asking the same question: will you allow this website to collect your information?', 'The spread of these cookie notices is down to European legislation. A combination of GDPR and how it altered the ePrivacy Directive forced pretty much every site on the web to ensure people in Europe clicked ‘allow’.', 'The legal changes were meant to make understanding web tracking easier for everyone. But two years after the arrival of GDPR, cookie consent notices are a blight on the web. Researchers have found that they use dark patterns to trick people into clicking ‘yes’, with a lack of enforcement against websites that don’t comply with the rules – and a general lack of awareness of what the cookie notices are meant to achieve – creating a real mess.', '“Usually people click to get it away because it’s really big on the screen,” says Estelle Massé, a senior policy analyst and global data protection lead at non-profit internet advocacy group Access Now. “You want to move on. You don’t actually read what is happening, you don’t actually know what you’re consenting to. It’s not really helpful as a tool.”' ]
Nice! I'm a happy user. In a later blog post, I'll go about describing how we can embed these for machine learning.