Writing for RefinePro

Copywriting of a serie of articles on ETL and Web Scraping. Published for RefinePro under Martin Magdinier’s name.

(The client could have modified the articles since, without my knowing.)

(Photo par Artem Sapegin.)


Article 1, published March 1, 2020

YOU’RE LOOKING FOR A WEB SCRAPING TOOL? LOOK FOR A SERVICE INSTEAD

Blog_RefinePro_Article1

Is web scraping easy? No. Is it Profitable? It can with a great scheduling, monitoring, and maintenance plan. Learn how!

Your Python skills are not too bad, and it’s not your first time dealing with tech. In fact, technology is part of your daily life. You’ve heard how web scraping could help automate your business. So, you start looking for the best scrapers and web scraping tools, hoping maybe to boost your e-commerce sales. Not a bad plan in the short term. But what about maintaining your web scraping system? And what will you do when your target websites start updating? And what’s your solution if you get blocked and can’t get the data you need? Web scraping has opened the door to big data in a world where every market research and business strategy relies on it. But even if the web offers us a plethora of information about “how to,” few take the time to explain the long-term implications of maintaining a web scraping infrastructure. And that’s because web scraping is not a product. It’s a service. Here’s why.

Reed the full article here, or on the client’s website.


Article 2, published March 12, 2020

THE WHO, THE WHAT, AND THE “WITH WHAT” OF WEB SCRAPING

Blog_RefinePro_Article2

Data is the new differentiator. It’s what you, a product owner, a marketing strategist, your local journalist, and a multimillionaire who already owns twelve successful companies all need. And web scraping is one way to get that data.

But where to start? Sure, the Internet will give you everything you need to know. Soon, you’ll come across lists of the best tools available, each with a name that will never make the list for best marketing decision of the year: Octoparse, Scrapy, BeautifulSoup, ParseHub, Mozenda … But how to choose? Looking at the description and the ratings is a good system when you’re buying shoes, but not when you’re trying to find the best scraper for your project.

Think about it: if you’re about to send something out there on the web to gather the reliable data you need, you have to make sure that the tool you’re using is the best for your project and your specific goals. A pair of flip flops, no matter how good the ratings, won’t get you far if you’re visiting Norway in December. And that’s why we put so much energy in helping our clients find the right tools for their projects. Over the years, we’ve tested many of them and you can now benefit from our experience. Here’s what we know.

Read the full article here, or on the client’s website.


Article 3, published April 5, 2020

HOW TO MAINTAIN DATA QUALITY AT EVERY STEP OF YOUR PIPELINE

Blog_RefinePro_Article3

Maintaining the quality of your data is paramount to any web scraping or data integration project. Think about it: there’s absolutely no point in collecting a massive amount of data if you can’t rely on it to make sound decisions! And the only way to maintain high quality is by implementing quality checks and validation at every step of your data pipeline. As the saying goes: garbage in, garbage out!

We’ve discussed already what a good ETL (Extract Transform Load) tool is and what it should do, and we’ll now learn how data quality insurance fits into that process. ETL is the process of extracting, transforming and loading the data. It defines what, when, and how data gets from your source sites to your readable database. Data quality, on the other hand, relies on the implementation of a system from the early stage of extraction all the way to the final loading of your data into readable databases.

Choosing the right scraper and the right ETL tools will help you streamline this process, but these tools don’t automatically guarantee the quality of your end results. That’s why you need to work with a partner, like us, who will put in place all the proper checkpoints.

Read the full article here, or on the client’s website.


Article 4, published May 15, 2020

14 RULES TO SUCCEED WITH YOUR ETL PROJECT

Blog_RefinePro_Article4

Extracting, transforming, and loading (ETL) data is a complex process at the center of most organizations’ data extraction projects. As we saw in our article on web scraping and ETL, the implementation of an ETL workflow is a process that requires a lot of in-depth knowledge in several subfields of statistics and programming.

ETL developers thus work at the crossroad of different fields. They must ensure that every aspect of the data life cycle has been addressed to ensure its operability and maintainability. To help your developer navigate the deep and dark waters of ETL, we’ve drawn on our years of experience to create a list of ETL principles and best practices.

What you see here is not meant to be a grocery list; these guidelines need to be considered, and then implemented or rejected. Your developer will draw on their understanding of the project and their experience to decide which principles are needed, when, and at what range.

Read the full article here, or on the client’s website.


Article 5, published May 17, 2020

HOW TO DIVIDE AND CONQUER YOUR DATA PROJECT FOR SUCCESS

Blog_RefinePro_Article5

Data extraction is now one of the most efficient ways for companies to stay up to date with current events and trends, but also to position themselves in their field. But for a lot of small entrepreneurs and even larger companies, the implementation of data extraction projects presents new challenges: How should these processes be implemented, and by whom?

Web Scraping is known as the process by which data is extracted from different sources and then transformed into usable information. As such, a huge part of any web scraping project relies on a strong Extraction, Transformation, and Loading process, known as ETL. But building a solid ETL architecture for your web scraping requires a lot of technical know-how, combined with the knowledge necessary to adapt these “easy-to-use” tools to your specific needs. Most importantly, your project will also rely on many other crucial processes, including data quality management and administrative procedures.

Read the full article here, or on the client’s website.


Article 6, published May 18, 2020

PDF EXTRACTION – EVERYTHING YOU NEED TO KNOW

Blog_RefinePro_Article6

Data extraction is now one of the most efficient ways for companies to stay up to date with current events and trends, but also to position themselves in their field. But for a lot of small entrepreneurs and even larger companies, the implementation of data extraction projects presents new challenges: How should these processes be implemented, and by whom?

Web Scraping is known as the process by which data is extracted from different sources and then transformed into usable information. As such, a huge part of any web scraping project relies on a strong Extraction, Transformation, and Loading process, known as ETL. But building a solid ETL architecture for your web scraping requires a lot of technical know-how, combined with the knowledge necessary to adapt these “easy-to-use” tools to your specific needs. Most importantly, your project will also rely on many other crucial processes, including data quality management and administrative procedures.

Read the full article here, or on the client’s website.


Article 7, published May 25, 2020

10 QUESTIONS TO ASK BEFORE USING NEW DATA

Capture

Data extraction projects are complex and often require quite a lot of time and effort. To make sure your organization is creating value and that your money and your time are well spent, the first logical step is to choose your sources carefully. To help you achieve just that, we create a list of 10 questions you need to ask before you set your sights on a dataset. The goal here is to collect and analyze all the data existing information in order to clarify its ownership, publication, structure, content, quality, relationship, etc. Only by going through this process can you guarantee the suitability of your sources and identify potential problems and particularities.

This checklist will help you assess all the elements you need to know in order to proceed with your data project. Most of all, once you have all the answers, you will have everything you need to define what will be your game plan to transform and manipulate the datasets you chose.

Read the full article here, or on the client’s website.


Article 8, published July 9, 2020

THE SECRET FOR LONG TERM-GROWTH – OR WHY DATA IS THE NEW OIL

blog_refinepro_article8

Data is the new oil. It stands at the center of an organization’s value proposition, at the core of their product or service creation process. Understanding and managing data is a core competency and not a by-product.

COMPANIES THAT ARE LEADERS IN THE USE OF DATA ARE THREE TIMES MORE LIKELY TO BE FINANCIALLY SUCCESSFUL. source: Economic Intelligence Unit

It’s estimated that 40,000 more exabytes of data is either created, replicated, or consumed annually in 2020 compared to 1,200 in 2010. And this increase is happening in almost every industry. Organizations must then learn to exploit and refine data if they want to grow. And to do so, they need to understand how their data strategy evolves at every step of their customer journey and product life cycle. They need their employees to understand how to use, read, understand, and interpret data. And they need to know how to build products and services that are driven by data.

Read the full article here, or on the client’s website.


You need someone to write blogposts for your website? Contact me!