When I saw this title I assumed it was about using ML to build dynamic scrapers that self-configure where the changing data elements. Does that exist anywhere?
At a very high level, it's similar. We use computer vision and ML to extract structured data from any web page, even ones we haven't seen before. https://www.diffbot.com/
If anyone has any questions or wants to try it out, feel free to email me directly at dru@diffbot.com
interesting read. I liked the part about identifying rare templates. plus, if I ever plan to do ML on html the article gave me a good place to start on feature extraction.
might have some uses in identifying phishing urls, etc
Not sure why you're getting down-voted. You're right, the author didn't close the loop. Typically we would expect to see an insight after you apply the ML technique.
So they extracted features, clustered the pages and found... what?
I am sure they learned something but it might be proprietary.
What I'd really like to see is a presentation of an algorithm that automatically recognizes and hides the first dismissive HN comment that inevitably appears. Any takers?
On the more serious note I would not mind a but more critical and less clickbaity attitude in the tech.
Not every if statement or regexp is ML and AI, not eveything requires blockchain.