Picking 25 samples of every domain - Mailing list pgsql-novice
From | Gary Warner |
---|---|
Subject | Picking 25 samples of every domain |
Date | |
Msg-id | 5436FD55.2010806@askgar.com Whole thread Raw |
Responses |
Re: Picking 25 samples of every domain
|
List | pgsql-novice |
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I have a set of Postgres tables that are related to URLs found in email. The tables are BIG. 40-50 million records per day. We are using them for some research into tricks spammers use to confound blacklists. When we parse the URLs, we pull out the "domain" portion of each URL and store it in a field called "top_domain". The full URL is available as "link". Through various forms of randomization, customization, and wild-carding, a domain may have as many as 1 million URLs per day. I am needing a query that would grab a sample number of URLs per domain (let's say 25 for conversation) . . . something that in pseudo-code might look like for each top_domain in urltable do select top_domain, link limit 25; Thoughts on the fastest way to do a query like that? Real examples (to sort of show the point . . .) spam_urls=# select top_domain, link from spam_info_2014_09_20 where top_domain = 'pacedoctor[.]ru' limit 10; (To prevent risk, I'm changing "." to "[.]" top_domain | link - ---------------+----------------------------------- pacedoctor[.]ru | http://kftdouhixn[.]pacedoctor[.]ru pacedoctor[.]ru | http://ozukalaj[.]pacedoctor[.]ru pacedoctor[.]ru | http://knlssrmp[.]pacedoctor[.]ru pacedoctor[.]ru | http://aaztcobtwztx[.]pacedoctor[.]ru pacedoctor[.]ru | http://kqyarrarn[.]pacedoctor[.]ru pacedoctor[.]ru | http://zzj[.]pacedoctor[.]ru pacedoctor[.]ru | http://piqtj[.]pacedoctor[.]ru pacedoctor[.]ru | http://wahednc[.]pacedoctor[.]ru pacedoctor[.]ru | http://fjhmin[.]pacedoctor[.]ru pacedoctor[.]ru | http://wevtwo[.]pacedoctor[.]ru (10 rows) top_domain | link - ------------+----------------------------------------------------------------------- i-cer[.]com | http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex8[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex7[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex6[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex5[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex4[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex3[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex2[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex1[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-838-106207395/vonmill/u[.]html i-cer[.]com | http://www[.]i-cer[.]com/6907-372-838-106207395/vonmill/rindex14[.]html (10 rows) top_domain | link - --------------+--------------------------------------------------------------------- ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/u[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/rindex5[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex4[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex3[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex2[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex1[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/u[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/rindex5[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/tindex4[.]html ezzesoft[.]com | http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/tindex3[.]html (10 rows) - -- Thanks for any suggestions! _-_ gar -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlQ2/VUACgkQg79eYCOO6PtTiwCbBomLVmWST81FEI3eQX0g5HrI C8EAnjoXM6gb4+sn4DROXd/3IvFRoBoV =50PP -----END PGP SIGNATURE-----
pgsql-novice by date: