MSc Methods&Statistics
Working at Jibes Data Analytics
Open source projects:
yagmail | send emails in 2 lines (html/attach) | 246 |
sky | next-gen intelligent web scraping | 57 |
gittyleaks | find users/keys/pass in git repos | 18 |
pytrending | discover trending python | 10 |
xtoy | automatic prep/model/predict | 2 |
Interesting for python because:
{"domain": "http://www.gtbit.org",
"url": "http://www.gtbit.org/news/viewitem.php?id=40",
"injectable": true,
"on line": true,
"error": false,
"at line": false,
"time": "Wed Oct 28 00:59:39 2015",
"warning": true,
"failed_request": false,
"emails": ["gtbit@rediffmail.com", "inderjeet@gmail.com"],
"sql": true}
sqlmap
(written in Python) to figure out which tables are in the database. Action | Amount |
---|---|
Web data of 145TB | 1.81 billion |
URL contains "php?" | 109.715 |
Keep only unique domains | 27.046 |
Append single quote | |
Test for SQL errors on page | 1.742 |
if error: scan homepage + contact for email | 692 |
part = r'[^?@ ><\'":\\\/]+'
email_re = re.compile(part + '@' + part + r'\.' + part)
for wet_path in wetpaths:
swp = slugger(wet_path)
if swp in dones:
continue
t1 = time.time()
results = []
# Start a connection to one of the WARC files
k = Key(pds, wet_path)
f = warc.WARCFile(fileobj=GzipStreamFile(k))
for i, record in enumerate(f):
if record.url is not None and 'php?id=' in record.url:
results.append(record.url)
print(time.time() - t1)
save_file_s3('\n'.join(results), swp)
kootenpv.github.io | |
PascalvKooten | |
kootenpv | |
kootenpv@gmail.com | |
pascalvkooten |