Extracting information from websites is one of the most important skills in modern data science. Because it is the Internet today that is the key source of information for various studies.
Parsing websites can be a tricky business, especially for those with limited technical knowledge. Trying to extract data from HTML and other web formats is no easy feat – you’ve got to figure out the structure of the site first and understand what bits need to be scooped up before you can actually do any parsing. And then there are things like javascript which can add additional layers of complexity. If you’re not careful, it’s easy to miss out on important information or accidentally parse in duplicate records.
An additional difficulty in extracting information from websites is that all sites differ in structure, as well as in code and markup. And the bigger and older the website, the more nuances you need to consider.
In this post I want to share a technique of parsing one of the largest IT-themed websites in Russian – habr.com.
Habr is a great community-driven website that provides a wealth of knowledge and insights on the latest topics within technology, science, business and more. With regularly updated content sourced from professionals in their respective industries, habr.com provides educational and interesting insights into areas like programming, dev ops, data analysis and overall software engineering.
Not only focusing on professional blogs but also including tutorials and live discussions as well, Habr makes sure all its users stay up to date with the ever-evolving tech-oriented world. In addition to this regular content, habr.com also puts out special publications dedicated to focus on specific technical subjects or projects. Furthermore, the user experience is incredible; Habr allows for quick navigation between various topics through its convenient organization system – you can easily find exactly what you’re looking for!
Because of my specialty, I often come to this website looking for news from the IT industry or new articles on programming. So below I will show you how to quickly retrieve all the articles from this website for the query you want (using “Python” as an example).