Scrapera: A universal tool of scraper scripts for humans
🎉 Introduction 🎉
Today’s technological advancements are heavily dependent on the availability of structured data. Though data is available in huge quantities all over the internet, data collection and cleaning are major tasks for any kind of analysis or model development. This is where Scrapera comes in to save the day!
🙋♂️ About Scrapera 🙋♂️
Scrapera is an all-in-one tool for common scraping domains like image, text, audio, video, etc. Scrapera aims at clustering common scraping tasks under a single library which makes it convenient for users. With Scrapera, Data scientists and ML researchers can focus their time and energy towards creating better preprocessing pipelines and training rather than collection of data. The package ensures that the data obtained is clean and reliable.
A list of all scrapers supported by Scrapera can be found here
🤔 Why would you use Scrapera? 🤔
Okay but why use Scrapera when you can write your own custom scripts? Here are some unique features that will help you decide
- Scrapera has weekly, sometimes biweekly updates to ensure up to date changes
- Scrapera is completely Chromedriver free. This means that all data is extracted directly from public API endpoints instead of running the heavy overhead of a browser. No extensions or external files. Simple Python3 and its modules.
- Not involving the browser and minimizing CSS dependent code makes Scrapera extremely fast and robust to unforseen DOM changes.
- The library is well structured and modules are separated according to their domains which makes them easy to use
- All scrapers have their individual examples and documentation which makes it easier to refer and stay updated. Just import, instantiate and execute!
- Scrapera has full proxy support and ensures dedicated, fully tweakable sleep cycles to prevent detection and server overloads or crashes
- No CAPTCHA problems! Yes, you heard that right. Scrapera uses unique and specific headers to simulate browser configurations and avoid CAPTCHAs.
If all of these points aren’t enough to convince you, Scrapera is completely open sourced and is heavily dependent on active contributors who make Scrapera what it is today.
🚀 How to get started? 🚀
Scrapera is a powerful library but it’s true power lies in simplicity. A simple pip install scrapera will get you the latest version
All modules are separately maintained by contributors and issues are fixed as soon as possible to ensure version control. The structure of these modules is easy to grasp. The usual import pattern for a desired scraper is
scrapera/{domain}/{website}
This level of abstraction is very convenient for collection of data for quick experiments and analysis
🤝 Contributing to Scrapera 🤝
It takes both sides to build a bridge. — Frederik Nael
From its birth, the Scrapera initiative strongly relies on the Open Source community and experts for constant updation.
If you are passionate about data and feel that this library was helpful to you then I would like to invite you to contribute to this project and be a part of its success.
✏️ Conclusion ✏️
Scrapera is a key to getting data quickly in your hands and such automation is essential for today’s growing data requirements.
Some new features that would be added in the coming days are
- Support for more websites
- Automatic rotation of proxies
- Automatic header switching
- Dynamic sleep time allocations
If you don’t want to miss out on these updates then star the GitHub repository and turn on watching so that you catch em all!