Tri is a PhD candidate in the school of electrical engineering at Princeton University. He is being advised by professor David Wentzlaff in the field of Computer Architecture at the Princeton Parallel Computing research group. Tri is currently in his 5th year, and hoping to graduate sometime next year...

Besides studying architecture, Tri dedicates whatever left of his time to guitar, video/board gaming, running, and last but not least, learning cool stuff that he wished he had studied instead on... YouTube.

Curriculum Vitae (07/2018)


July 2018: Two papers--CABLE and PiCL--got accepted to MICRO'18! Japan, I'm coming!!!


April 2018: I have accepted a postdoc appointment at the Harvard Medical School to work in the Wei lab. I'm excited to learn how neurons work and create a revolution for neural network machine learning!

Research at a Glance

For his PhD thesis, Tri's primary interest is in understanding the bandwidth wall, its cause, manifestation, and solution. To risk on oversimplification, the bandwidth wall refers to the widening gap between computation performance (core count or FLOP/s) and memory performance (# of memory channels, DRAM latency). Limited bandwidth is already a problem for today's throughput computing systems, and for future computing systems such as data centers and super-computers, memory will truly become a first-class optimization problem. In his research, Tri takes the viewpoint of systems 10+ years in the future where throughput of commercial manycore servers is more valuable than single-threaded performance of today's consumer desktops.

Selected Publications

MORC: Manycore Cache Compression (MICRO'15) pdf,slides

An approach to overcome limited off-chip bandwidth is through localizing data movement on-chip as much as possible. Cache compression is a promising technique to increase effective cache capacity, improve cache hit rate, and decrease off-chip accesses. Much like file compression for email, cache compression compacts the data residing in caches in order to store more cache lines. Unfortunately, cache compression is notoriously hard to implement, and is plagued with internal fragmentation, external fragmentation, data store expansion, and last but not least low performance compression algorithm.

To solve these challenges all in one fell swoop, MORC utilizes a novel log-based cache organization to compress a log composed of multiple cache lines together, gzip-style. This approach trades off a slight increase in access latency for vastly improved compression ratios, higher throughput, and lower energy consumption for future manycore architectures. MORC was published in MICRO'15 in Waikiki.

MORC fig1

Piton (website)

Piton (pronounced pee-t-on) is a manycore prototype designed in-housed at Princeton and tape-out at IBM fab (now GlobalFoundries) at 32nm. The computational core is based on the OpenSPARC T1, and it has all the traditional features you have ever wanted in a manycore prototype: tile-based, distributed shared caches, directory-based shared mem, 3 NoCs, seamless multi-chip...

Piton chip


Email: firstname + lastname[0] at princeton dot edu